Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 112]
cs.CV [Total: 261]
cs.AI [Total: 97]
cs.SD [Total: 16]
cs.LG [Total: 225]
cs.MA [Total: 12]
cs.MM [Total: 6]
eess.AS [Total: 11]
eess.IV [Total: 16]

cs.CL

[1] Towards Reasoning-Preserving Unlearning in Multimodal Large Language Models

Hongji Li, Junchi yao, Manjiang Yu, Priyanka Singh, Xue Li, Di Wang, Lijie Hu

Main category: cs.CL

TL;DR: RMLLMU-Bench is the first benchmark for evaluating unlearning in Reasoning Multimodal LLMs, addressing the unique challenge of reasoning-level leakage. The paper proposes R-MUSE, a training-free framework that better balances forgetting and reasoning retention compared to existing methods.

Details

Motivation: Current unlearning methods for Reasoning Multimodal Large Language Models (RMLLMs) face two key challenges: 1) intermediate chain-of-thought steps can still leak sensitive information even when final answers are forgotten, and 2) overly aggressive interventions easily damage general reasoning ability. There's no existing benchmark that jointly evaluates reasoning-level leakage suppression and reasoning competence preservation.

Method: The paper introduces RMLLMU-Bench, the first benchmark for RMLLM unlearning that extends standard forgetting metrics with dedicated measures of reasoning leakage and reasoning retention. To address limitations of existing methods, they propose R-MUSE (Reasoning-preserving MLLM Unlearning via Subspace guidance and Adaptive Steering), a training-free and inference-time intervention framework that steers internal representations to forget both answers and reasoning traces while explicitly preserving general reasoning.

Result: Systematic evaluation on RMLLMU-Bench reveals that existing unlearning methods for MLLMs and Large Reasoning Models either leave substantial leakage in the reasoning process or severely degrade reasoning performance. R-MUSE achieves a substantially better balance between effective forgetting and reasoning retention compared to existing approaches.

Conclusion: The paper addresses a critical gap in machine unlearning for reasoning models by introducing the first dedicated benchmark and a novel training-free framework. R-MUSE demonstrates superior performance in balancing the competing objectives of forgetting sensitive information while preserving general reasoning capabilities in RMLLMs.

Abstract: Machine unlearning aims to erase requested data from trained models without full retraining. For Reasoning Multimodal Large Language Models (RMLLMs), this is uniquely challenging: intermediate chain-of-thought steps can still leak sensitive information even when final answers are forgotten, and overly aggressive interventions easily damage general reasoning ability. Yet no benchmark jointly evaluates how well unlearning methods suppress reasoning-level leakage while preserving reasoning competence. We address this gap with RMLLMU-Bench, the first benchmark for RMLLM unlearning that extends standard forgetting metrics with dedicated measures of reasoning leakage and reasoning retention. A systematic evaluation on RMLLMU-Bench reveals that existing unlearning methods for MLLMs and Large (Language) Reasoning Models (LRMs) either leave substantial leakage in the reasoning process or severely degrade reasoning performance. To address these gaps, we propose R-MUSE (Reasoning-preserving MLLM Unlearning via Subspace guidance and Adaptive Steering), a training-free and inference-time intervention framework that steers internal representations to forget both answers and reasoning traces while explicitly preserving general reasoning. Experiments on RMLLMU-Bench demonstrate that R-MUSE achieves a substantially better balance between effective forgetting and reasoning retention.

[2] Graph-O1 : Monte Carlo Tree Search with Reinforcement Learning for Text-Attributed Graph Reasoning

Lihui Liu

Main category: cs.CL

TL;DR: Graph-O1 is an agentic GraphRAG framework that uses Monte Carlo Tree Search with reinforcement learning to enable LLMs to perform stepwise, interactive reasoning over text-attributed graphs, overcoming context-length limitations of traditional methods.

Details

Motivation: Current methods for question answering over text-attributed graphs have limitations: retrieval-augmented generation treats passages as isolated units ignoring graph structure, while graph-based RAG methods serialize large subgraphs into sequences that exceed LLM context limits, leading to fragmented reasoning and degraded accuracy.

Method: Graph-O1 integrates Monte Carlo Tree Search (MCTS) with end-to-end reinforcement learning to enable LLMs to conduct stepwise, interactive reasoning. The approach frames reasoning as multi-turn interaction between an agent and graph environment, allowing selective exploration and retrieval of only the most informative subgraph components.

Result: Extensive experiments across multiple LLM backbones show that Graph-O1 consistently surpasses state-of-the-art baselines, producing answers that are more accurate, reliable, and interpretable.

Conclusion: Graph-O1 successfully addresses the limitations of existing methods for reasoning over text-attributed graphs by enabling LLMs to perform structured, interactive reasoning while respecting context-length constraints, resulting in improved performance and interpretability.

Abstract: ChatGPT said: Text-attributed graphs, where nodes and edges contain rich textual information, are widely used across diverse domains. A central challenge in this setting is question answering, which requires jointly leveraging unstructured text and the structured relational signals within the graph. Although Large Language Models (LLMs) have made significant advances in natural language understanding, their direct use for reasoning over text-attributed graphs remains limited. Retrieval-augmented generation methods that operate purely on text often treat passages as isolated units, ignoring the interconnected structure of the graph. Conversely, graph-based RAG methods that serialize large subgraphs into long textual sequences quickly become infeasible due to LLM context-length constraints, resulting in fragmented reasoning and degraded accuracy. To overcome these limitations, we introduce Graph-O1, an agentic GraphRAG framework that enables LLMs to conduct stepwise, interactive reasoning over graphs. Our approach integrates Monte Carlo Tree Search (MCTS) with end-to-end reinforcement learning, allowing the model to selectively explore and retrieve only the most informative subgraph components. The reasoning procedure is framed as a multi-turn interaction between the agent and the graph environment, and the agent is trained through a unified reward mechanism. Extensive experiments across multiple LLM backbones demonstrate that Graph-O1 consistently surpasses state-of-the-art baselines, producing answers that are more accurate, reliable, and interpretable.

[3] Q-KVComm: Efficient Multi-Agent Communication Via Adaptive KV Cache Compression

Boris Kriuk, Logic Ng

Main category: cs.CL

TL;DR: Q-KVComm enables direct transmission of compressed KV cache representations between LLM agents, achieving 5-6x compression while maintaining semantic fidelity, shifting from text-based to representation-based communication.

Details

Motivation: Multi-agent LLM systems suffer from redundant transmission of contextual information between agents, consuming excessive bandwidth and computational resources. Traditional approaches transmit raw text, forcing receiving agents to recompute similar representations from scratch.

Method: Q-KVComm combines three innovations: (1) adaptive layer-wise quantization with variable bit-widths based on sensitivity profiling, (2) hybrid information extraction preserving critical facts across content domains, and (3) heterogeneous model calibration for cross-architecture communication.

Result: Extensive experiments across three diverse question-answering datasets show Q-KVComm achieves 5-6x compression ratios while maintaining semantic fidelity, with coherence quality scores above 0.77 across all scenarios. Robust performance across model sizes (1.1B-1.5B parameters) and adapts to conversational QA and multi-hop reasoning.

Conclusion: Q-KVComm establishes a new paradigm for LLM agent communication, shifting from text-based to representation-based information exchange, addressing the critical bottleneck of redundant contextual transmission in multi-agent systems.

Abstract: Multi-agent Large Language Model (LLM) systems face a critical bottleneck: redundant transmission of contextual information between agents consumes excessive bandwidth and computational resources. Traditional approaches discard internal semantic representations and transmit raw text, forcing receiving agents to recompute similar representations from scratch. We introduce Q-KVComm, a new protocol that enables direct transmission of compressed key-value (KV) cache representations between LLM agents. Q-KVComm combines three key innovations: (1) adaptive layer-wise quantization that allocates variable bit-widths based on sensitivity profiling, (2) hybrid information extraction that preserves critical facts across content domains, and (3) heterogeneous model calibration establishing cross-architecture communication. Extensive experiments across three diverse question-answering datasets demonstrate that Q-KVComm achieves 5-6x compression ratios while maintaining semantic fidelity, with coherence quality scores above 0.77 across all scenarios. The protocol exhibits robust performance across model sizes (1.1B-1.5B parameters) and adapts to real-world applications including conversational QA and multi-hop reasoning. Our work establishes a new paradigm for LLM agent communication, shifting from text-based to representation-based information exchange.

[4] Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on the Loquacious Dataset

Nick Rossenbach, Robin Schmitt, Tina Raissi, Simon Berger, Larissa Kleppel, Ralf Schlüter

Main category: cs.CL

TL;DR: Loquacious dataset replaces LibriSpeech/TED-Lium with properly defined partitions across domains, plus open resources (LMs, G2P, lexica) for benchmarking ASR architectures.

Details

Motivation: To provide a properly defined ASR dataset with clear training/test partitions across multiple acoustic and language domains, with open licensing for both academia and industry use.

Method: Created the Loquacious dataset with defined partitions, plus additional resources including n-gram language models, grapheme-to-phoneme model, and pronunciation lexica. Used these resources to conduct experiments across various ASR architectures with different label units and topologies.

Result: Initial experimental results show that Loquacious dataset provides a valuable study case for addressing common ASR challenges across different architectures.

Conclusion: Loquacious serves as a suitable replacement for established ASR datasets, offering well-defined partitions, open resources, and benchmarking capabilities for diverse ASR research and applications.

Abstract: The recently published Loquacious dataset aims to be a replacement for established English automatic speech recognition (ASR) datasets such as LibriSpeech or TED-Lium. The main goal of the Loquacious dataset is to provide properly defined training and test partitions across many acoustic and language domains, with an open license suitable for both academia and industry. To further promote the benchmarking and usability of this new dataset, we present additional resources in the form of n-gram language models (LMs), a grapheme-to-phoneme (G2P) model and pronunciation lexica, with open and public access. Utilizing those additional resources we show experimental results across a wide range of ASR architectures with different label units and topologies. Our initial experimental results indicate that the Loquacious dataset offers a valuable study case for a variety of common challenges in ASR.

[5] Learning to Prioritize IT Tickets: A Comparative Evaluation of Embedding-based Approaches and Fine-Tuned Transformer Models

Minh Tri LÊ, Ali Ait-Bachir

Main category: cs.CL

TL;DR: Transformer model outperforms embedding-based methods for IT service ticket prioritization, achieving 78.5% F1-score despite noisy text and class imbalance.

Details

Motivation: IT Service Management (ITSM) ticket prioritization is challenging due to noisy textual inputs, subjective writing styles, and class imbalance, requiring robust automated solutions.

Method: Evaluated two approaches: 1) embedding-based pipelines combining dimensionality reduction, clustering, and classical classifiers, and 2) a fine-tuned multilingual transformer processing both textual and numerical features.

Result: Transformer model substantially outperformed embedding methods, achieving average F1-score of 78.5% and weighted Cohen’s kappa values near 0.80, while embedding methods showed limited generalization across 30 configurations.

Conclusion: Generic embeddings have limitations for ITSM data, while domain-adapted transformer architectures are effective for operational ticket prioritization.

Abstract: Prioritizing service tickets in IT Service Management (ITSM) is critical for operational efficiency but remains challenging due to noisy textual inputs, subjective writing styles, and pronounced class imbalance. We evaluate two families of approaches for ticket prioritization: embedding-based pipelines that combine dimensionality reduction, clustering, and classical classifiers, and a fine-tuned multilingual transformer that processes both textual and numerical features. Embedding-based methods exhibit limited generalization across a wide range of thirty configurations, with clustering failing to uncover meaningful structures and supervised models highly sensitive to embedding quality. In contrast, the proposed transformer model achieves substantially higher performance, with an average F1-score of 78.5% and weighted Cohen’s kappa values of nearly 0.80, indicating strong alignment with true labels. These results highlight the limitations of generic embeddings for ITSM data and demonstrate the effectiveness of domain-adapted transformer architectures for operational ticket prioritization.

[6] KVReviver: Reversible KV Cache Compression with Sketch-Based Token Reconstruction

Aomufei Yuan, Zhiming Wang, Ruijie Miao, Dayu Wang, Yuxuan Tian, Zihan Wang, Yebo Peng, Yuhan Wu, Bairen Yi, Xin Liu, Tong Yang

Main category: cs.CL

TL;DR: KVReviver: A reversible KV cache compression method using sketch algorithms to maintain full accuracy with 10-25% of KV cache memory.

Details

Motivation: As LLM context lengths increase, KV cache memory becomes a bottleneck. Traditional compression methods cause "Contextual Amnesia" - permanent loss of token information that degrades model performance.

Method: KVReviver uses sketch algorithms to create reversible KV cache compression. It allows reconstruction of compressed tokens from additional data structures, enabling full-scale computation within limited memory.

Result: For 2k-length contexts: maintains identical inference accuracy with only 10% of KV cache budget. For 32k-length contexts: achieves equivalent accuracy (~2% loss) with only 25% of KV cache budget.

Conclusion: KVReviver effectively addresses KV cache memory bottlenecks through reversible compression, preventing information loss while significantly reducing memory requirements.

Abstract: As the context length of current large language models (LLMs) rapidly increases, the memory demand for the Key-Value (KV) cache is becoming a bottleneck for LLM deployment and batch processing. Traditional KV cache compression methods typically involve permanently evicting or irreversibly merging “less important” tokens with low attention scores. This approach results in the unrecoverable loss of token information, which we call Contextual Amnesia, significantly degrading the model’s information retrieval capability. To address this issue, we propose KVReviver, a reversible KV cache compression method based on the sketch algorithm. This method allows reconstructing compressed tokens from an additional data structure, thus enabling full-scale computation within limited memory. Experiments showed that in 2k-length contexts, it requires only 10% of KV Cache budget while maintaining identical end-to-end inference accuracy. For 32k-length contexts, it achieves equivalent or comparable accuracy ~2% accuracy loss) using merely 25% of KV Cache budget.

[7] Separating Constraint Compliance from Semantic Accuracy: A Novel Benchmark for Evaluating Instruction-Following Under Compression

Rahul Baxi

Main category: cs.CL

TL;DR: LLMs struggle with prompt compression due to RLHF-trained helpfulness behaviors, showing a U-curve pattern where constraint violations peak at medium compression levels.

Details

Motivation: Large language models show degraded performance under prompt compression, but the underlying mechanisms remain poorly understood. The paper aims to systematically analyze how compression affects different aspects of model performance.

Method: Introduced Compression-Decay Comprehension Test (CDCT) benchmark to independently measure constraint compliance and semantic accuracy across compression levels. Evaluated 9 frontier LLMs across 8 concepts using 5 compression levels (extreme to none). Used three-judge LLM jury for evaluation with high inter-rater agreement. Conducted RLHF ablation experiments to validate hypotheses.

Result: Found universal U-curve pattern in constraint compliance (97.2% prevalence) with violations peaking at medium compression. Constraint effects were 2.9x larger than semantic effects. RLHF ablation showed removing “helpfulness” signals improves constraint compliance by 598% on average, confirming that RLHF-trained helpfulness behaviors cause constraint violations at medium compression.

Conclusion: Reveals fundamental tension between RLHF alignment and instruction-following, with RLHF-trained helpfulness being the dominant cause of constraint violations under compression. Provides actionable guidelines for improving deployed systems, showing reasoning models outperform efficient models by 27.5%.

Abstract: Large language models (LLMs) exhibit degraded performance under prompt compression, but the mechanisms remain poorly understood. We introduce the Compression-Decay Comprehension Test (CDCT), a benchmark that independently measures constraint compliance (CC) and semantic accuracy (SA) across compression levels. We evaluate 9 frontier LLMs across 8 concepts using 5 compression levels from extreme (c=0.0, ~2 words) to none (c=1.0, ~135 words). A three-judge LLM jury achieves almost perfect inter-rater agreement on CC (Fleiss’ \k{appa}=0.90). We observe a universal U-curve pattern in constraint compliance (97.2% prevalence), with violations peaking at medium compression (c=0.5, ~27 words). Counterintuitively, models perform better at extreme compression than medium lengths. The dimensions are statistically orthogonal (r=0.193, p=0.084), with constraint effects 2.9x larger than semantic effects. Experimental validation via RLHF ablation confirms our constraint salience hypothesis: removing “helpfulness” signals improves CC by 598% on average (71/72 trials, p<0.001), with 79% achieving perfect compliance. This demonstrates that RLHF-trained helpfulness behaviors are the dominant cause of constraint violations at medium compression. Reasoning models outperform efficient models by 27.5% (Cohen’s d=0.96). Our findings reveal a fundamental tension between RLHF alignment and instruction-following, providing actionable guidelines for improving deployed systems.

[8] ReGal: A First Look at PPO-based Legal AI for Judgment Prediction and Summarization in India

Shubham Kumar Nigam, Tanuj Tyagi, Siddharth Shukla, Aditya Kumar Guru, Balaramamahanthi Deepak Patnaik, Danush Khanna, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya

Main category: cs.CL

TL;DR: ReGal framework applies reinforcement learning to Indian legal AI tasks, showing promise despite underperforming on metrics compared to supervised models.

Details

Motivation: To explore reinforcement learning methodologies for legal AI in the Indian context, addressing challenges in applying RL to legal texts and establishing foundations for interpretable legal AI systems.

Method: Reinforcement Learning-based Legal Reasoning (ReGal) framework integrating Multi-Task Instruction Tuning with Reinforcement Learning from AI Feedback (RLAIF) using Proximal Policy Optimization (PPO).

Result: The framework underperforms on standard evaluation metrics compared to supervised and proprietary models, but provides valuable insights into RL challenges for legal texts including reward model alignment, legal language complexity, and domain adaptation.

Conclusion: Demonstrates how RL can be repurposed for high-stakes legal tasks, establishing a foundation for future work on optimizing legal reasoning pipelines with broader implications for interpretable and adaptive legal AI systems.

Abstract: This paper presents an early exploration of reinforcement learning methodologies for legal AI in the Indian context. We introduce Reinforcement Learning-based Legal Reasoning (ReGal), a framework that integrates Multi-Task Instruction Tuning with Reinforcement Learning from AI Feedback (RLAIF) using Proximal Policy Optimization (PPO). Our approach is evaluated across two critical legal tasks: (i) Court Judgment Prediction and Explanation (CJPE), and (ii) Legal Document Summarization. Although the framework underperforms on standard evaluation metrics compared to supervised and proprietary models, it provides valuable insights into the challenges of applying RL to legal texts. These challenges include reward model alignment, legal language complexity, and domain-specific adaptation. Through empirical and qualitative analysis, we demonstrate how RL can be repurposed for high-stakes, long-document tasks in law. Our findings establish a foundation for future work on optimizing legal reasoning pipelines using reinforcement learning, with broader implications for building interpretable and adaptive legal AI systems.

[9] CoPE: A Small Language Model for Steerable and Scalable Content Labeling

Samidh Chakrabarti, David Willner, Kevin Klyman, Tiffany Saade, Emily Capstick, Sabina Nong

Main category: cs.CL

TL;DR: CoPE is a small, policy-steerable language model for content labeling that achieves frontier model accuracy at 1% size, using novel training methods and policy generation techniques.

Details

Motivation: Current content moderation systems require large, expensive frontier models. There's a need for smaller, more efficient models that can interpret policies rather than just memorize them, enabling better governance of online platforms through policy-based control.

Method: Two novel approaches: 1) Contradictory Example Training - a curriculum that teaches policy interpretation instead of memorization; 2) Binocular Labeling - a method for generating unambiguous training datasets for content policies. Applied to create CoPE, a 9B parameter model.

Result: CoPE achieves equal or superior accuracy to frontier models across seven different harm areas while being only 1% of their size. The 9B parameter version can run on a single consumer-grade GPU.

Conclusion: CoPE represents a paradigm shift for classifier systems by transforming ML tasks into policy writing tasks, opening new design possibilities for online platform governance through policy-steerable models.

Abstract: This paper details the methodology behind CoPE, a policy-steerable small language model capable of fast and accurate content labeling. We present a novel training curricula called Contradictory Example Training that enables the model to learn policy interpretation rather than mere policy memorization. We also present a novel method for generating content policies, called Binocular Labeling, which enables rapid construction of unambiguous training datasets. When evaluated across seven different harm areas, CoPE exhibits equal or superior accuracy to frontier models at only 1% of their size. We openly release a 9 billion parameter version of the model that can be run on a single consumer-grade GPU. Models like CoPE represent a paradigm shift for classifier systems. By turning an ML task into a policy writing task, CoPE opens up new design possibilities for the governance of online platforms.

[10] Narrative Consolidation: Formulating a New Task for Unifying Multi-Perspective Accounts

Roger A. Finger, Eduardo G. Cortes, Sandro J. Rigo, Gabriel de O. Ramos

Main category: cs.CL

TL;DR: This paper introduces Narrative Consolidation as a new NLP task focused on creating unified, coherent, and chronologically sound texts from overlapping narratives, and proposes a Temporal Alignment Event Graph (TAEG) approach that achieves perfect temporal ordering and significant content improvements.

Details

Motivation: Standard Multi-Document Summarization (MDS) fails for narrative documents like legal testimonies or historical accounts because it focuses on compression rather than preserving narrative flow, chronological integrity, and completeness of complementary details.

Method: The authors introduce Temporal Alignment Event Graph (TAEG), a graph structure that explicitly models chronology and event alignment. They apply standard centrality algorithms to TAEG to function as a version selection mechanism, choosing the most central representation of each event in its correct temporal position.

Result: When tested on the four Biblical Gospels, the TAEG approach guarantees perfect temporal ordering (Kendall’s Tau of 1.000) by design and dramatically improves content metrics, achieving +357.2% improvement in ROUGE-L F1 score.

Conclusion: The success of this baseline method validates Narrative Consolidation as a relevant NLP task and establishes that an explicit temporal backbone is fundamental for solving it, demonstrating the critical role of temporal structure in narrative processing.

Abstract: Processing overlapping narrative documents, such as legal testimonies or historical accounts, often aims not for compression but for a unified, coherent, and chronologically sound text. Standard Multi-Document Summarization (MDS), with its focus on conciseness, fails to preserve narrative flow. This paper formally defines this challenge as a new NLP task: Narrative Consolidation, where the central objectives are chronological integrity, completeness, and the fusion of complementary details. To demonstrate the critical role of temporal structure in this task, we introduce Temporal Alignment Event Graph (TAEG), a graph structure that explicitly models chronology and event alignment. By applying a standard centrality algorithm to TAEG, our method functions as a version selection mechanism, choosing the most central representation of each event in its correct temporal position. In a study on the four Biblical Gospels, this structure-focused approach guarantees perfect temporal ordering (Kendall’s Tau of 1.000) by design and dramatically improves content metrics (e.g., +357.2% in ROUGE-L F1). The success of this baseline method validates the formulation of Narrative Consolidation as a relevant task and establishes that an explicit temporal backbone is a fundamental component for its resolution.

[11] MauBERT: Universal Phonetic Inductive Biases for Few-Shot Acoustic Units Discovery

Angelo Ortiz Tandazo, Manel Khentout, Youssef Benchekroun, Thomas Hueber, Emmanuel Dupoux

Main category: cs.CL

TL;DR: MauBERT extends HuBERT with articulatory features for multilingual phonetic representation learning across 55 languages, producing more context-invariant representations than SOTA models.

Details

Motivation: To develop robust cross-lingual phonetic representations by incorporating linguistic inductive biases (articulatory features) into self-supervised speech models for better multilingual performance.

Method: Continue HuBERT pre-training with supervision using phonetic-to-articulatory feature mapping across 55 languages, learning to predict either articulatory features or phones from multilingual data.

Result: MauBERT produces more context-invariant representations than SOTA multilingual SSL models, effectively adapts to unseen languages and casual speech with minimal fine-tuning (10 hours).

Conclusion: This establishes an effective approach for instilling linguistic inductive biases in self-supervised speech models through articulatory feature supervision.

Abstract: This paper introduces MauBERT, a multilingual extension of HuBERT that leverages articulatory features for robust cross-lingual phonetic representation learning. We continue HuBERT pre-training with supervision based on a phonetic-to-articulatory feature mapping in 55 languages. Our models learn from multilingual data to predict articulatory features or phones, resulting in language-independent representations that capture multilingual phonetic properties. Through comprehensive ABX discriminability testing, we show MauBERT models produce more context-invariant representations than state-of-the-art multilingual self-supervised learning models. Additionally, the models effectively adapt to unseen languages and casual speech with minimal self-supervised fine-tuning (10 hours of speech). This establishes an effective approach for instilling linguistic inductive biases in self-supervised speech models.

[12] Statistical laws and linguistics inform meaning in naturalistic and fictional conversation

Ashley M. A. Fehr, Calla G. Beauregard, Julia Witte Zimmerman, Katie Ekström, Pablo Rosillo-Rodes, Christopher M. Danforth, Peter Sheridan Dodds

Main category: cs.CL

TL;DR: The paper analyzes Heaps’ law (vocabulary growth with document length) in conversations, finding that scaling differs by parts of speech in both video chat conversations between strangers and movie dialogues.

Details

Motivation: Conversations are fundamental to social connection and well-being, but little research has applied Heaps' law to study how conversations unfold over time, particularly how different language features affect vocabulary scaling patterns.

Method: The researchers measured Heaps’ law in two distinct conversational datasets: 1) video chat conversations between strangers, and 2) dialogues of fictional characters in movies. They analyzed how vocabulary size scales with conversation length, with particular focus on differences across parts of speech.

Result: The study found that vocabulary scaling patterns (Heaps’ law) differ significantly by parts of speech in both conversational contexts. Different grammatical categories show distinct patterns of vocabulary growth as conversations progress.

Conclusion: The findings suggest that analyzing conversations through the lens of Heaps’ law reveals systematic patterns in vocabulary growth that vary by linguistic features, providing insights into conversation dynamics that can be understood through both behavioral and linguistic frameworks.

Abstract: Conversation is a cornerstone of social connection and is linked to well-being outcomes. Conversations vary widely in type with some portion generating complex, dynamic stories. One approach to studying how conversations unfold in time is through statistical patterns such as Heaps’ law, which holds that vocabulary size scales with document length. Little work on Heaps’s law has looked at conversation and considered how language features impact scaling. We measure Heaps’ law for conversations recorded in two distinct mediums: 1. Strangers brought together on video chat and 2. Fictional characters in movies. We find that scaling of vocabulary size differs by parts of speech. We discuss these findings through behavioral and linguistic frameworks.

[13] Training LLMs with LogicReward for Faithful and Rigorous Reasoning

Jundong Xu, Hao Fei, Huichi Zhou, Xin Quan, Qijun Huang, Shengqiong Wu, William Yang Wang, Mong-Li Lee, Wynne Hsu

Main category: cs.CL

TL;DR: LogicReward is a novel reward system that uses theorem proving to enforce step-level logical correctness in LLM training, addressing flawed reasoning in existing outcome-based methods.

Details

Motivation: Existing LLM training methods rely on outcome-based feedback which can produce correct answers with flawed reasoning. Prior work introduces step supervision but lacks guarantees of logical soundness, which is crucial in high-stakes scenarios where logical consistency is paramount.

Method: Proposes LogicReward, a reward system that guides model training by enforcing step-level logical correctness using a theorem prover. Introduces Autoformalization with Soft Unification to reduce natural language ambiguity and improve formalization quality for more effective theorem proving.

Result: An 8B model trained with LogicReward surpasses GPT-4o and o4-mini by 11.6% and 2% on natural language inference and logical reasoning tasks. The approach enhances reasoning faithfulness, improves generalizability to unseen tasks (math and commonsense reasoning), and provides reliable reward signals without ground-truth labels.

Conclusion: LogicReward effectively addresses logical soundness issues in LLM training by integrating theorem proving with step-level supervision, demonstrating superior performance on reasoning tasks and improved generalization capabilities.

Abstract: Although LLMs exhibit strong reasoning capabilities, existing training methods largely depend on outcome-based feedback, which can produce correct answers with flawed reasoning. Prior work introduces supervision on intermediate steps but still lacks guarantees of logical soundness, which is crucial in high-stakes scenarios where logical consistency is paramount. To address this, we propose LogicReward, a novel reward system that guides model training by enforcing step-level logical correctness with a theorem prover. We further introduce Autoformalization with Soft Unification, which reduces natural language ambiguity and improves formalization quality, enabling more effective use of the theorem prover. An 8B model trained on data constructed with LogicReward surpasses GPT-4o and o4-mini by 11.6% and 2% on natural language inference and logical reasoning tasks with simple training procedures. Further analysis shows that LogicReward enhances reasoning faithfulness, improves generalizability to unseen tasks such as math and commonsense reasoning, and provides a reliable reward signal even without ground-truth labels. We will release all data and code at https://llm-symbol.github.io/LogicReward.

[14] GeoSense-AI: Fast Location Inference from Crisis Microblogs

Deepit Sapru

Main category: cs.CL

TL;DR: GeoSense-AI: A real-time NLP pipeline for extracting geolocations from noisy microblog streams during emergencies, combining statistical hashtag segmentation, POS-based proper noun detection, dependency parsing, lightweight NER, and gazetteer disambiguation to achieve high F1 with fast throughput.

Details

Motivation: Current emergency response systems rely on sparse geotags in social media, missing valuable location information in text. There's a need for real-time geolocation extraction from noisy microblog streams to support situational awareness during fast-moving events like floods and outbreaks.

Method: Unified pipeline with: 1) statistical hashtag segmentation, 2) part-of-speech-driven proper-noun detection, 3) dependency parsing around disaster lexicons, 4) lightweight named-entity recognition, and 5) gazetteer-grounded disambiguation. Designed for low-latency streaming constraints with efficient validation against geographic knowledge bases.

Result: Achieves strong F1 scores comparable to widely used NER toolkits while being engineered for orders-of-magnitude faster throughput. Enables deployment in live crisis informatics settings with a production map interface demonstrating end-to-end functionality from ingest to visualization.

Conclusion: Domain-tuned NLP combined with knowledge grounding can significantly enhance emergency response by extracting location signals directly from text, moving beyond conventional geo-tag reliance and enabling real-time situational awareness during fast-moving events.

Abstract: This paper presents an applied AI pipeline for realtime geolocation from noisy microblog streams, unifying statistical hashtag segmentation, part-of-speech-driven proper-noun detection, dependency parsing around disaster lexicons, lightweight named-entity recognition, and gazetteer-grounded disambiguation to infer locations directly from text rather than sparse geotags. The approach operationalizes information extraction under streaming constraints, emphasizing low-latency NLP components and efficient validation against geographic knowledge bases to support situational awareness during emergencies. In head to head comparisons with widely used NER toolkits, the system attains strong F1 while being engineered for orders-of-magnitude faster throughput, enabling deployment in live crisis informatics settings. A production map interface demonstrates end-to-end AI functionality ingest, inference, and visualization–surfacing locational signals at scale for floods, outbreaks, and other fastmoving events. By prioritizing robustness to informal text and streaming efficiency, GeoSense-AI illustrates how domain-tuned NLP and knowledge grounding can elevate emergency response beyond conventional geo-tag reliance.

[15] InstructNet: A Novel Approach for Multi-Label Instruction Classification through Advanced Deep Learning

Tanjim Taharat Aurpa, Md Shoaib Ahmed, Md Mahbubur Rahman, Md. Golam Moazzam

Main category: cs.CL

TL;DR: This paper proposes InstructNet, a multi-label classification system for How-To articles using transformer architectures like XLNet and BERT, achieving 97.30% accuracy on wikiHow data.

Details

Motivation: People increasingly use search engines with "How To" queries to find instructional content. Categorizing this instructional text is essential for task-oriented learning and knowledge base creation, but multi-label classification of instructional content presents challenges.

Method: The authors created a dataset of 11,121 wikiHow articles with multiple categories per record. They employed transformer-based architectures including XLNet and BERT for multi-label instruction classification, using a multi-level evaluation strategy with accuracy and macro F1-score metrics.

Result: XLNet architecture demonstrated unprecedented performance with 97.30% accuracy, 89.02% micro average, and 93% macro average scores, outperforming other transformer models in multi-label classification of instructional content.

Conclusion: The XLNet-based InstructNet approach is highly effective for multi-label instruction classification, with the multi-level evaluation providing comprehensive insights into model strengths and areas for future improvement.

Abstract: People use search engines for various topics and items, from daily essentials to more aspirational and specialized objects. Therefore, search engines have taken over as peoples preferred resource. The How To prefix has become familiar and widely used in various search styles to find solutions to particular problems. This search allows people to find sequential instructions by providing detailed guidelines to accomplish specific tasks. Categorizing instructional text is also essential for task-oriented learning and creating knowledge bases. This study uses the How To articles to determine the multi-label instruction category. We have brought this work with a dataset comprising 11,121 observations from wikiHow, where each record has multiple categories. To find out the multi-label category meticulously, we employ some transformer-based deep neural architectures, such as Generalized Autoregressive Pretraining for Language Understanding (XLNet), Bidirectional Encoder Representation from Transformers (BERT), etc. In our multi-label instruction classification process, we have reckoned our proposed architectures using accuracy and macro f1-score as the performance metrics. This thorough evaluation showed us much about our strategys strengths and drawbacks. Specifically, our implementation of the XLNet architecture has demonstrated unprecedented performance, achieving an accuracy of 97.30% and micro and macro average scores of 89.02% and 93%, a noteworthy accomplishment in multi-label classification. This high level of accuracy and macro average score is a testament to the effectiveness of the XLNet architecture in our proposed InstructNet approach. By employing a multi-level strategy in our evaluation process, we have gained a more comprehensive knowledge of the effectiveness of our proposed architectures and identified areas for forthcoming improvement and refinement.

[16] MemEvolve: Meta-Evolution of Agent Memory Systems

Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchunshu Zhou, Shuicheng Yan

Main category: cs.CL

TL;DR: MemEvolve is a meta-evolutionary framework that jointly evolves LLM agents’ experiential knowledge and their memory architecture, enabling both agent-level learning and memory system adaptation to diverse tasks.

Details

Motivation: Current LLM-based agent systems rely on static, manually engineered memory architectures that cannot adapt to different task contexts, limiting their evolutionary potential despite memory facilitating agent-level learning.

Method: Proposes MemEvolve framework that meta-evolves both agents’ experiential knowledge and memory architecture; introduces EvolveLab as a unified codebase with modular design space (encode, store, retrieve, manage) based on 12 representative memory systems.

Result: Achieves up to 17.06% performance improvement over frameworks like SmolAgent and Flash-Searcher; demonstrates strong cross-task and cross-LLM generalization with memory architectures that transfer effectively across diverse benchmarks and backbone models.

Conclusion: MemEvolve addresses the staticity limitation of current memory systems by enabling meta-evolution of memory architecture alongside agent knowledge, significantly advancing self-evolving agent systems through both performance gains and generalization capabilities.

Abstract: Self-evolving memory systems are unprecedentedly reshaping the evolutionary paradigm of large language model (LLM)-based agents. Prior work has predominantly relied on manually engineered memory architectures to store trajectories, distill experience, and synthesize reusable tools, enabling agents to evolve on the fly within environment interactions. However, this paradigm is fundamentally constrained by the staticity of the memory system itself: while memory facilitates agent-level evolving, the underlying memory architecture cannot be meta-adapted to diverse task contexts. To address this gap, we propose MemEvolve, a meta-evolutionary framework that jointly evolves agents’ experiential knowledge and their memory architecture, allowing agent systems not only to accumulate experience but also to progressively refine how they learn from it. To ground MemEvolve in prior research and foster openness in future self-evolving systems, we introduce EvolveLab, a unified self-evolving memory codebase that distills twelve representative memory systems into a modular design space (encode, store, retrieve, manage), providing both a standardized implementation substrate and a fair experimental arena. Extensive evaluations on four challenging agentic benchmarks demonstrate that MemEvolve achieves (I) substantial performance gains, improving frameworks such as SmolAgent and Flash-Searcher by up to $17.06%$; and (II) strong cross-task and cross-LLM generalization, designing memory architectures that transfer effectively across diverse benchmarks and backbone models.

[17] CTTA-T: Continual Test-Time Adaptation for Text Understanding via Teacher-Student with a Domain-aware and Generalized Teacher

Tianlun Liu, Zhiliang Tian, Zhen Huang, Xingzhi Zhou, Wanlong Yu, Tianle Liu, Feng Liu, Dongsheng Li

Main category: cs.CL

TL;DR: CTTA-T is a continual test-time adaptation framework for text understanding that handles sequential unobserved testing domains using a domain-aware teacher-student approach with dropout-driven consistency filtering and incremental PCA for adaptive domain accumulation.

Details

Motivation: Current continual test-time adaptation (CTTA) methods struggle with error accumulation across domains and generalization to unobserved domains. Existing approaches either discard useful information through noise-filtering or fail to achieve adaptive accumulation of historical domains.

Method: Proposes CTTA-T framework with teacher-student architecture: 1) Refine-then-filter based on dropout-driven consistency to calibrate predictions and remove unreliable guidance, 2) Domain-aware teacher construction using incremental PCA to dynamically accumulate cross-domain semantics and track domain shifts.

Result: Experiments show CTTA-T excels baselines in continual test-time adaptation for text understanding across evolving target domains.

Conclusion: CTTA-T effectively addresses the adaptation-generalization trade-off in continual test-time adaptation for text understanding by combining dropout-driven consistency filtering with adaptive domain accumulation via incremental PCA.

Abstract: Text understanding often suffers from domain shifts. To handle testing domains, domain adaptation (DA) is trained to adapt to a fixed and observed testing domain; a more challenging paradigm, test-time adaptation (TTA), cannot access the testing domain during training and online adapts to the testing samples during testing, where the samples are from a fixed domain. We aim to explore a more practical and underexplored scenario, continual test-time adaptation (CTTA) for text understanding, which involves a sequence of testing (unobserved) domains in testing. Current CTTA methods struggle in reducing error accumulation over domains and enhancing generalization to handle unobserved domains: 1) Noise-filtering reduces accumulated errors but discards useful information, and 2) accumulating historical domains enhances generalization, but it is hard to achieve adaptive accumulation. In this paper, we propose a CTTA-T (continual test-time adaptation for text understanding) framework adaptable to evolving target domains: it adopts a teacher-student framework, where the teacher is domain-aware and generalized for evolving domains. To improve teacher predictions, we propose a refine-then-filter based on dropout-driven consistency, which calibrates predictions and removes unreliable guidance. For the adaptation-generalization trade-off, we construct a domain-aware teacher by dynamically accumulating cross-domain semantics via incremental PCA, which continuously tracks domain shifts. Experiments show CTTA-T excels baselines.

[18] LIR$^3$AG: A Lightweight Rerank Reasoning Strategy Framework for Retrieval-Augmented Generation

Guo Chen, Junjie Huang, Huaijin Xie, Fei Sun, Tao Jia

Main category: cs.CL

TL;DR: Proposes LiR³AG framework that enables non-reasoning LLMs to achieve reasoning model performance in multi-hop QA by restructuring retrieved evidence into reasoning chains, reducing computational overhead by 98% tokens and 58.6% inference time.

Details

Motivation: Reasoning models improve LLM performance in multi-hop QA tasks but introduce substantial computational costs (increased token consumption and inference latency). Need to understand and mitigate this trade-off between performance and efficiency.

Method: Proposes LiR³AG (Lightweight Rerank Reasoning Strategy Framework for RAG) that transfers reasoning strategies from reasoning models to non-reasoning models by restructuring retrieved evidence into coherent reasoning chains. Identifies two reasoning modes: Context-Grounded Reasoning (relying on retrieved content) and Knowledge-Reconciled Reasoning (resolving conflicts/gaps with internal knowledge).

Result: LiR³AG reduces average 98% output tokens overhead and 58.6% inference time while improving 8B non-reasoning model’s F1 performance by 6.2% to 22.5%, surpassing performance of 32B reasoning model in RAG tasks.

Conclusion: LiR³AG offers a practical and efficient path forward for RAG systems by enabling non-reasoning models to achieve reasoning model performance with significantly reduced computational costs through strategic evidence restructuring.

Abstract: Retrieval-Augmented Generation (RAG) effectively enhances Large Language Models (LLMs) by incorporating retrieved external knowledge into the generation process. Reasoning models improve LLM performance in multi-hop QA tasks, which require integrating and reasoning over multiple pieces of evidence across different documents to answer a complex question. However, they often introduce substantial computational costs, including increased token consumption and inference latency. To better understand and mitigate this trade-off, we conduct a comprehensive study of reasoning strategies for reasoning models in RAG multi-hop QA tasks. Our findings reveal that reasoning models adopt structured strategies to integrate retrieved and internal knowledge, primarily following two modes: Context-Grounded Reasoning, which relies directly on retrieved content, and Knowledge-Reconciled Reasoning, which resolves conflicts or gaps using internal knowledge. To this end, we propose a novel Lightweight Rerank Reasoning Strategy Framework for RAG (LiR$^3$AG) to enable non-reasoning models to transfer reasoning strategies by restructuring retrieved evidence into coherent reasoning chains. LiR$^3$AG significantly reduce the average 98% output tokens overhead and 58.6% inferencing time while improving 8B non-reasoning model’s F1 performance ranging from 6.2% to 22.5% to surpass the performance of 32B reasoning model in RAG, offering a practical and efficient path forward for RAG systems.

[19] Towards Efficient Agents: A Co-Design of Inference Architecture and System

Weizhe Lin, Hui-Ling Zhen, Shuai Yang, Xian Wang, Renxi Liu, Hanting Chen, Wangze Zhang, Chuansai Zhou, Yiming Li, Chen Chen, Xing Li, Zhiyuan Yang, Xiaosong Li, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan, Yunhe Wang

Main category: cs.CL

TL;DR: AgentInfer is a unified framework for accelerating LLM-based agents by optimizing systemic latency across reasoning loops, context growth, and tool interactions through four synergistic components.

Details

Motivation: Real-world deployment of LLM-based agents is hindered by severe inefficiencies from systemic latency accumulated across reasoning loops, context growth, and heterogeneous tool interactions, not just isolated model inference.

Method: Four synergistic components: AgentCollab (hierarchical dual-model reasoning with dynamic role assignment), AgentSched (cache-aware hybrid scheduler), AgentSAM (suffix-automaton-based speculative decoding reusing semantic memory), and AgentCompress (asynchronous semantic compression of agent memory). These form a Self-Evolution Engine.

Result: On BrowseComp-zh and DeepDiver benchmarks, AgentInfer reduces ineffective token consumption by over 50%, achieving 1.8-2.5× speedup with preserved accuracy.

Conclusion: Optimizing for agentic task completion rather than merely per-token throughput is key to building scalable, efficient, and self-improving intelligent systems.

Abstract: The rapid development of large language model (LLM)-based agents has unlocked new possibilities for autonomous multi-turn reasoning and tool-augmented decision-making. However, their real-world deployment is hindered by severe inefficiencies that arise not from isolated model inference, but from the systemic latency accumulated across reasoning loops, context growth, and heterogeneous tool interactions. This paper presents AgentInfer, a unified framework for end-to-end agent acceleration that bridges inference optimization and architectural design. We decompose the problem into four synergistic components: AgentCollab, a hierarchical dual-model reasoning framework that balances large- and small-model usage through dynamic role assignment; AgentSched, a cache-aware hybrid scheduler that minimizes latency under heterogeneous request patterns; AgentSAM, a suffix-automaton-based speculative decoding method that reuses multi-session semantic memory to achieve low-overhead inference acceleration; and AgentCompress, a semantic compression mechanism that asynchronously distills and reorganizes agent memory without disrupting ongoing reasoning. Together, these modules form a Self-Evolution Engine capable of sustaining efficiency and cognitive stability throughout long-horizon reasoning tasks. Experiments on the BrowseComp-zh and DeepDiver benchmarks demonstrate that through the synergistic collaboration of these methods, AgentInfer reduces ineffective token consumption by over 50%, achieving an overall 1.8-2.5 times speedup with preserved accuracy. These results underscore that optimizing for agentic task completion-rather than merely per-token throughput-is the key to building scalable, efficient, and self-improving intelligent systems.

[20] LLM-based Few-Shot Early Rumor Detection with Imitation Agent

Fengzhu Zeng, Qian Shao, Ling Cheng, Wei Gao, Shih-Fen Cheng, Jing Ma, Cheng Niu

Main category: cs.CL

TL;DR: Proposes a novel Early Rumor Detection framework combining an autonomous agent for early time point determination with an LLM-based rumor detector, enabling few-shot learning with only lightweight agent training.

Details

Motivation: Early Rumor Detection is challenging in data-scarce settings. LLMs perform well in few-shot NLP tasks but are not suited for time-series data and are computationally expensive for training/inference.

Method: Combines an autonomous agent (for early time point determination) with an LLM-based detection model (for rumor classification). Only the lightweight agent needs training while the LLM remains training-free.

Result: Extensive experiments on four real-world datasets show the approach boosts performance across LLMs and surpasses existing EARD methods in both accuracy and earliness.

Conclusion: The proposed framework offers the first solution for few-shot Early Rumor Detection, achieving superior performance while being computationally efficient by requiring only lightweight agent training.

Abstract: Early Rumor Detection (EARD) aims to identify the earliest point at which a claim can be accurately classified based on a sequence of social media posts. This is especially challenging in data-scarce settings. While Large Language Models (LLMs) perform well in few-shot NLP tasks, they are not well-suited for time-series data and are computationally expensive for both training and inference. In this work, we propose a novel EARD framework that combines an autonomous agent and an LLM-based detection model, where the agent acts as a reliable decision-maker for \textit{early time point determination}, while the LLM serves as a powerful \textit{rumor detector}. This approach offers the first solution for few-shot EARD, necessitating only the training of a lightweight agent and allowing the LLM to remain training-free. Extensive experiments on four real-world datasets show our approach boosts performance across LLMs and surpasses existing EARD methods in accuracy and earliness.

[21] DACE For Railway Acronym Disambiguation

El Mokhtar Hribach, Oussama Mechhour, Mohammed Elmonstaser, Yassine El Boudouri, Othmane Kabal

Main category: cs.CL

TL;DR: DACE framework wins TextMine'26 competition for French railway acronym disambiguation with 0.9069 F1 score using dynamic prompting, RAG, contextual selection, and ensemble aggregation.

Details

Motivation: Acronym disambiguation is critical for technical text processing, especially in specialized domains like French railway documentation where high ambiguity challenges automated analysis.

Method: DACE framework combines dynamic prompting tailored to acronym ambiguity, retrieval-augmented generation for domain knowledge injection, contextual selection, and ensemble aggregation to enhance LLMs through adaptive in-context learning.

Result: Achieved top rank in TextMine'26 competition with F1 score of 0.9069, effectively handling low-resource scenarios and mitigating hallucination.

Conclusion: DACE demonstrates effective acronym disambiguation through adaptive prompting, external knowledge integration, and ensemble methods, particularly valuable for specialized domains with limited resources.

Abstract: Acronym Disambiguation (AD) is a fundamental challenge in technical text processing, particularly in specialized sectors where high ambiguity complicates automated analysis. This paper addresses AD within the context of the TextMine'26 competition on French railway documentation. We present DACE (Dynamic Prompting, Retrieval Augmented Generation, Contextual Selection, and Ensemble Aggregation), a framework that enhances Large Language Models through adaptive in-context learning and external domain knowledge injection. By dynamically tailoring prompts to acronym ambiguity and aggregating ensemble predictions, DACE mitigates hallucination and effectively handles low-resource scenarios. Our approach secured the top rank in the competition with an F1 score of 0.9069.

[22] LLM Agents Implement an NLG System from Scratch: Building Interpretable Rule-Based RDF-to-Text Generators

Mateusz Lango, Ondřej Dušek

Main category: cs.CL

TL;DR: A neurosymbolic framework for RDF-to-text generation that uses multiple LLM agents to collaboratively create rule-based Python code for text generation without traditional training or human reference texts.

Details

Motivation: To create an interpretable RDF-to-text generation system that eliminates hallucination issues common in LLM-based approaches, requires no supervised training data, and operates efficiently without GPU resources.

Method: Multiple LLM agents collaboratively generate rule-based Python code for text generation from RDF triples. The system is trained through agent interactions rather than backpropagation, with no human reference texts needed.

Result: Experiments on WebNLG and OpenDialKG datasets show reduced hallucination compared to finetuned or prompted language models, with only slight fluency penalties. The system is fully interpretable and runs nearly instantaneously on a single CPU.

Conclusion: The neurosymbolic framework provides a promising alternative to traditional LLM approaches for RDF-to-text generation, offering interpretability, data efficiency, and reduced hallucination while maintaining reasonable fluency.

Abstract: We present a novel neurosymbolic framework for RDF-to-text generation, in which the model is “trained” through collaborative interactions among multiple LLM agents rather than traditional backpropagation. The LLM agents produce rule-based Python code for a generator for the given domain, based on RDF triples only, with no in-domain human reference texts. The resulting system is fully interpretable, requires no supervised training data, and generates text nearly instantaneously using only a single CPU. Our experiments on the WebNLG and OpenDialKG data show that outputs produced by our approach reduce hallucination, with only slight fluency penalties compared to finetuned or prompted language models

[23] SRS-Stories: Vocabulary-constrained multilingual story generation for language learning

Wiktor Kamzela, Mateusz Lango, Ondrej Dusek

Main category: cs.CL

TL;DR: LLMs generate personalized stories for language learners using known vocabulary, with spaced repetition for optimal vocabulary learning.

Details

Motivation: To help language learners acquire new vocabulary through engaging, personalized stories that use only vocabulary they already know, while seamlessly reviewing recently learned words in context.

Method: Use large language models to generate stories constrained by learners’ known vocabulary, with three story generation methods and three lexical constraint strategies, optimized by Spaced Repetition System for vocabulary review.

Result: Generated stories are more grammatical, coherent, and provide better word usage examples than standard constrained beam search, tested across English, Chinese, and Polish languages.

Conclusion: LLM-based personalized story generation with vocabulary constraints and spaced repetition is effective for language learning, producing high-quality educational content superior to traditional constrained text generation approaches.

Abstract: In this paper, we use large language models to generate personalized stories for language learners, using only the vocabulary they know. The generated texts are specifically written to teach the user new vocabulary by simply reading stories where it appears in context, while at the same time seamlessly reviewing recently learned vocabulary. The generated stories are enjoyable to read and the vocabulary reviewing/learning is optimized by a Spaced Repetition System. The experiments are conducted in three languages: English, Chinese and Polish, evaluating three story generation methods and three strategies for enforcing lexical constraints. The results show that the generated stories are more grammatical, coherent, and provide better examples of word usage than texts generated by the standard constrained beam search approach

[24] AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

Mark Kashirskiy, Artiom Lipinski, Ilya Makarov

Main category: cs.CL

TL;DR: AraToken: Arabic-optimized tokenizer using SentencePiece Unigram with normalization, achieving 18% lower fertility than baselines, plus Language Extension Pipeline for efficient model adaptation.

Details

Motivation: General-purpose tokenizers perform poorly on morphologically rich languages like Arabic, causing inflated token sequences and reduced compression efficiency. Existing tokenizers are optimized for English/Latin-script languages and don't handle Arabic-specific orthographic variations well.

Method: 1) AraToken: Arabic-optimized tokenizer using SentencePiece Unigram algorithm with comprehensive normalization pipeline addressing Arabic-specific variations (Alif variants, diacritics, Arabic-Indic numerals). 2) Language Extension Pipeline (LEP): Method for integrating optimized tokenizer into Qwen3-0.6B through vocabulary extension with mean subtoken initialization and selective transformer layer unfreezing.

Result: SentencePiece with normalization achieves 18% lower fertility (1.199 vs 1.35 tokens/word) compared to unnormalized baselines. LEP reduces evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. Systematic comparison shows BPE, WordPiece, and SentencePiece algorithms across multiple configurations.

Conclusion: AraToken significantly improves tokenization efficiency for Arabic, and LEP enables efficient adaptation of existing models to Arabic. The work provides valuable resources (tokenizer, scripts, checkpoints) for Arabic NLP research.

Abstract: Tokenization is a critical preprocessing step for large language models (LLMs), directly impacting training efficiency and downstream performance. General-purpose tokenizers trained predominantly on English and Latin-script languages exhibit suboptimal performance on morphologically rich languages such as Arabic, resulting in inflated token sequences and reduced compression efficiency. In this work, we present AraToken, an Arabic-optimized tokenizer built on SentencePiece Unigram algorithm with a comprehensive normalization pipeline addressing Arabic-specific orthographic variations including Alif variants, diacritics, and Arabic-Indic numerals. We systematically compare BPE, WordPiece, and SentencePiece algorithms across multiple configurations, demonstrating that SentencePiece with normalization achieves 18% lower fertility (1.199 vs 1.35 tokens/word) compared to unnormalized baselines. Furthermore, we introduce the Language Extension Pipeline (LEP), a method for integrating the optimized tokenizer into Qwen3-0.6B through vocabulary extension with mean subtoken initialization and selective transformer layer unfreezing. Our experiments show that LEP reduces evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. We release our tokenizer, training scripts, and model checkpoints to facilitate Arabic NLP research.

[25] An Agentic AI Framework for Training General Practitioner Student Skills

Victor De Marez, Jens Van Nooten, Luna De Bruyne, Walter Daelemans

Main category: cs.CL

TL;DR: Agentic framework for virtual simulated patients improves medical education by combining evidence-based vignette generation, persona-driven dialogue, and standards-based assessment with realistic results.

Details

Motivation: Current virtual simulated patients (VSPs) in medical education lack medical accuracy, consistent roleplaying, scenario generation capabilities, and educationally structured feedback, despite the potential of large language models to provide scalable alternatives to resource-intensive traditional methods.

Method: An agentic framework with three unified components: (1) configurable, evidence-based vignette generation, (2) controlled persona-driven patient dialogue with optional retrieval grounding, and (3) standards-based assessment and feedback for both communication and clinical reasoning, implemented in an interactive spoken consultation setting.

Result: Evaluation with 14 medical students showed realistic and vignette-faithful dialogue, appropriate difficulty calibration, stable personality signal, highly useful example-rich feedback, and excellent overall usability.

Conclusion: Agentic separation of scenario control, interaction control, and standards-based assessment provides a practical pattern for building dependable and pedagogically valuable VSP training tools in medical education.

Abstract: Advancements in large language models offer strong potential for enhancing virtual simulated patients (VSPs) in medical education by providing scalable alternatives to resource-intensive traditional methods. However, current VSPs often struggle with medical accuracy, consistent roleplaying, scenario generation for VSP use, and educationally structured feedback. We introduce an agentic framework for training general practitioner student skills that unifies (i) configurable, evidence-based vignette generation, (ii) controlled persona-driven patient dialogue with optional retrieval grounding, and (iii) standards-based assessment and feedback for both communication and clinical reasoning. We instantiate the framework in an interactive spoken consultation setting and evaluate it with medical students ($\mathbf{N{=}14}$). Participants reported realistic and vignette-faithful dialogue, appropriate difficulty calibration, a stable personality signal, and highly useful example-rich feedback, alongside excellent overall usability. These results support agentic separation of scenario control, interaction control, and standards-based assessment as a practical pattern for building dependable and pedagogically valuable VSP training tools.

[26] Mitigating Spurious Correlations in NLI via LLM-Synthesized Counterfactuals and Dynamic Balanced Sampling

Christopher Román Jaimes

Main category: cs.CL

TL;DR: Automated pipeline improves NLI models by detecting semantic artifacts, generating synthetic contrast sets, and using dynamic balanced sampling to prevent forgetting while maintaining performance.

Details

Motivation: NLI models often rely on spurious correlations rather than semantic reasoning, and existing mitigation strategies have high annotation costs or cause catastrophic forgetting during fine-tuning.

Method: Three-step pipeline: 1) Log-Frequency LMI to detect semantic artifacts, 2) LLM-synthesis pipeline with multi-judge verification to generate high-quality synthetic contrast sets, 3) Dynamic Balanced Sampling training strategy that rotates original data distribution to prevent forgetting.

Result: Improves consistency on challenging benchmark from 63.5% to 81.0% while maintaining 88.4% in-domain accuracy, significantly outperforming naive fine-tuning.

Conclusion: Proposed automated, scalable pipeline effectively addresses limitations of existing NLI mitigation strategies by combining artifact detection, synthetic data generation, and balanced training to improve model robustness without catastrophic forgetting.

Abstract: Natural Language Inference (NLI) models frequently rely on spurious correlations rather than semantic reasoning. Existing mitigation strategies often incur high annotation costs or trigger catastrophic forgetting during fine-tuning. We propose an automated, scalable pipeline to address these limitations. First, we introduce Log-Frequency LMI (LF-LMI) to accurately detect semantic artifacts. Second, we generate a high-quality synthetic contrast set via an LLM-synthesis pipeline with multi-judge verification. Finally, we introduce Dynamic Balanced Sampling, a training strategy that rotates the original data distribution to prevent forgetting. Our method improves consistency on a challenging benchmark from 63.5% to 81.0% while maintaining 88.4% in-domain accuracy, significantly outperforming naive fine-tuning.

[27] Research on a hybrid LSTM-CNN-Attention model for text-based web content classification

Mykola Kuz, Ihor Lazarovych, Mykola Kozlenko, Mykola Pikuliak, Andrii Kvasniuk

Main category: cs.CL

TL;DR: Hybrid LSTM-CNN-Attention model with GloVe embeddings achieves state-of-the-art web content classification (0.98 accuracy), outperforming standalone CNNs, LSTMs, and BERT.

Details

Motivation: To overcome limitations of individual neural architectures in web content classification by combining their strengths - CNN for local patterns, LSTM for sequential dependencies, and Attention for focusing on informative parts.

Method: Integrated hybrid architecture using pretrained GloVe embeddings, CNN layer for n-gram patterns, LSTM layer for sequential modeling, and Attention mechanism for selective focus. Evaluated with 5-fold cross-validation.

Result: Outstanding performance: 0.98 accuracy, 0.94 precision, 0.92 recall, 0.93 F1-score, surpassing baseline CNNs, LSTMs, and transformer-based BERT models.

Conclusion: Hybrid deep learning combining CNN, LSTM, and Attention effectively captures both syntactic features and semantic context, demonstrating superior generalization for web content classification and supporting broader use in NLP applications.

Abstract: This study presents a hybrid deep learning architecture that integrates LSTM, CNN, and an Attention mechanism to enhance the classification of web content based on text. Pretrained GloVe embeddings are used to represent words as dense vectors that preserve semantic similarity. The CNN layer extracts local n-gram patterns and lexical features, while the LSTM layer models long-range dependencies and sequential structure. The integrated Attention mechanism enables the model to focus selectively on the most informative parts of the input sequence. A 5-fold cross-validation setup was used to assess the robustness and generalizability of the proposed solution. Experimental results show that the hybrid LSTM-CNN-Attention model achieved outstanding performance, with an accuracy of 0.98, precision of 0.94, recall of 0.92, and F1-score of 0.93. These results surpass the performance of baseline models based solely on CNNs, LSTMs, or transformer-based classifiers such as BERT. The combination of neural network components enabled the model to effectively capture both fine-grained text structures and broader semantic context. Furthermore, the use of GloVe embeddings provided an efficient and effective representation of textual data, making the model suitable for integration into systems with real-time or near-real-time requirements. The proposed hybrid architecture demonstrates high effectiveness in text-based web content classification, particularly in tasks requiring both syntactic feature extraction and semantic interpretation. By combining presented mechanisms, the model addresses the limitations of individual architectures and achieves improved generalization. These findings support the broader use of hybrid deep learning approaches in NLP applications, especially where complex, unstructured textual data must be processed and classified with high reliability.

[28] Teaching and Critiquing Conceptualization and Operationalization in NLP

Vagrant Gautam

Main category: cs.CL

TL;DR: A seminar course designed to help students critically examine how NLP researchers use abstract concepts like “interpretability” and “bias” without clear definitions, exploring what these terms should mean and how to measure them.

Details

Motivation: NLP researchers frequently use abstract concepts (interpretability, bias, reasoning, stereotypes) without defining them, relying on subfield-specific shared understandings that drive dataset creation, metric development, and system claims. This creates ambiguity about what these concepts actually mean and how they should be operationalized.

Method: Created a seminar course with interdisciplinary reading list focused on conceptualization and operationalization questions, emphasizing discussion and critique rather than technical solutions.

Result: The paper outlines the structure and approach of the seminar, providing a framework for students to critically examine how abstract concepts are used in NLP research and explore better ways to define and measure them.

Conclusion: There’s a need for more rigorous examination of how abstract concepts are conceptualized and operationalized in NLP research, and educational interventions like this seminar can help develop critical thinking about these foundational issues.

Abstract: NLP researchers regularly invoke abstract concepts like “interpretability,” “bias,” “reasoning,” and “stereotypes,” without defining them. Each subfield has a shared understanding or conceptualization of what these terms mean and how we should treat them, and this shared understanding is the basis on which operational decisions are made: Datasets are built to evaluate these concepts, metrics are proposed to quantify them, and claims are made about systems. But what do they mean, what should they mean, and how should we measure them? I outline a seminar I created for students to explore these questions of conceptualization and operationalization, with an interdisciplinary reading list and an emphasis on discussion and critique.

[29] Generalization Gaps in Political Fake News Detection: An Empirical Study on the LIAR Dataset

S Mahmudul Hasan, Shaily Roy, Akib Jawad Nafis

Main category: cs.CL

TL;DR: Political disinformation detection faces a hard performance ceiling around F1=0.32, with simple models matching complex transformers, suggesting text-only approaches are insufficient without external knowledge.

Details

Motivation: To understand the empirical limits of text-only linguistic modeling for political disinformation detection, particularly when current research focuses heavily on complex neural architectures without exploring fundamental limitations.

Method: Systematic diagnostic evaluation of 9 ML algorithms on LIAR benchmark, isolating lexical features (Bag-of-Words, TF-IDF) and semantic embeddings (GloVe), comparing simple models (linear SVM) with pre-trained transformers (RoBERTa), and analyzing generalization gaps in tree-based ensembles.

Result: Found a hard “Performance Ceiling” with max weighted F1=0.32; linear SVM (Accuracy: 0.624) matched RoBERTa (0.620); tree-based ensembles showed massive generalization gap (99% training vs 25% test accuracy); synthetic data augmentation via SMOTE provided no meaningful gains.

Conclusion: For political fact-checking, increasing model complexity without incorporating external knowledge yields diminishing returns - the bottleneck is semantic feature ambiguity rather than model capacity or data distribution issues.

Abstract: The proliferation of linguistically subtle political disinformation poses a significant challenge to automated fact-checking systems. Despite increasing emphasis on complex neural architectures, the empirical limits of text-only linguistic modeling remain underexplored. We present a systematic diagnostic evaluation of nine machine learning algorithms on the LIAR benchmark. By isolating lexical features (Bag-of-Words, TF-IDF) and semantic embeddings (GloVe), we uncover a hard “Performance Ceiling”, with fine-grained classification not exceeding a Weighted F1-score of 0.32 across models. Crucially, a simple linear SVM (Accuracy: 0.624) matches the performance of pre-trained Transformers such as RoBERTa (Accuracy: 0.620), suggesting that model capacity is not the primary bottleneck. We further diagnose a massive “Generalization Gap” in tree-based ensembles, which achieve more than 99% training accuracy but collapse to approximately 25% on test data, indicating reliance on lexical memorization rather than semantic inference. Synthetic data augmentation via SMOTE yields no meaningful gains, confirming that the limitation is semantic (feature ambiguity) rather than distributional. These findings indicate that for political fact-checking, increasing model complexity without incorporating external knowledge yields diminishing returns.

[30] LLMs on Drugs: Language Models Are Few-Shot Consumers

Alexander Doudkin

Main category: cs.CL

TL;DR: First controlled study shows psychoactive persona prompts significantly degrade LLM reasoning performance on ARC-Challenge, with alcohol causing the most severe accuracy drop from 45% to 10%.

Details

Motivation: While LLMs are known to be sensitive to personas at inference time, there has been no rigorous benchmarking of prompt-level "drug" interventions and their effects on reasoning performance.

Method: Controlled study on GPT-5-mini using ARC-Challenge with four single-sentence psychoactive prompts (LSD, cocaine, alcohol, cannabis) vs sober control. Used deterministic decoding, full logging, Wilson confidence intervals, and Fisher exact tests across 100 validation items per condition.

Result: Control accuracy was 0.45. All drug prompts significantly reduced performance: alcohol collapsed to 0.10 (p = 3.2e-8), cocaine to 0.21 (p = 4.9e-4), LSD to 0.19 (p = 1.3e-4), and cannabis to 0.30 (p = 0.041). Persona prompts disrupted the mandated answer template format.

Conclusion: Persona text acts like a “few-shot consumable” that can destroy model reliability without modifying weights, highlighting the fragility of LLMs to prompt-level interventions and the importance of controlled benchmarking.

Abstract: Large language models (LLMs) are sensitive to the personas imposed on them at inference time, yet prompt-level “drug” interventions have never been benchmarked rigorously. We present the first controlled study of psychoactive framings on GPT-5-mini using ARC-Challenge. Four single-sentence prompts – LSD, cocaine, alcohol, and cannabis – are compared against a sober control across 100 validation items per condition, with deterministic decoding, full logging, Wilson confidence intervals, and Fisher exact tests. Control accuracy is 0.45; alcohol collapses to 0.10 (p = 3.2e-8), cocaine to 0.21 (p = 4.9e-4), LSD to 0.19 (p = 1.3e-4), and cannabis to 0.30 (p = 0.041), largely because persona prompts disrupt the mandated “Answer: ” template. Persona text therefore behaves like a “few-shot consumable” that can destroy reliability without touching model weights. All experimental code, raw results, and analysis scripts are available at https://github.com/lexdoudkin/llms-on-drugs.

[31] Neologism Learning as a Parameter-Efficient Alternative to Fine-Tuning for Model Steering

Sungjoon Park, Varun Ramamurthi, Owen Terry

Main category: cs.CL

TL;DR: Neologism learning outperforms LoRA fine-tuning for behavioral steering in language models with fewer parameters and more flexibility.

Details

Motivation: To develop a more efficient and flexible method for steering language model behavior compared to traditional fine-tuning approaches like LoRA.

Method: Introduces neologism learning - training new tokens to represent concepts not in the model’s vocabulary, and compares it against LoRA fine-tuning with matched training setups (same data and hyperparameters).

Result: Neologisms outperform fine-tuned models under matched training conditions, and models sometimes create their own new words when asked about neologisms (self-verbalizations).

Conclusion: Neologism learning is a promising alternative to fine-tuning for behavioral steering, offering better performance with fewer parameters and greater flexibility while maintaining access to the model’s default behavior.

Abstract: In language modeling, neologisms are new tokens trained to represent a concept not already included in a given model’s vocabulary. Neologisms can be used to encourage specific behavior in models, for example by appending prompts with “Give me a neologism answer.” Behavioral steering can also be achieved through fine-tuning, albeit with more compute and less flexibility: learning a neologism only trains d parameters and allows the user to still access the model’s default behavior. We compare the performance of neologism learning against low-rank adaptation (LoRA) fine-tuning, finding that neologisms outperform fine-tuned models under a matched training setup (same data and hyperparameters). We also investigate self-verbalizations of neologisms, and observe that the model will occasionally make up its own new words when asked about a neologism.

[32] From Scratch to Fine-Tuned: A Comparative Study of Transformer Training Strategies for Legal Machine Translation

Amit Barman, Atanu Mandal, Sudip Kumar Naskar

Main category: cs.CL

TL;DR: This paper presents English-Hindi legal machine translation work for JUST-NLP 2025, showing that fine-tuning pre-trained OPUS-MT models significantly outperforms training from scratch, achieving a SacreBLEU score of 46.03.

Details

Motivation: In multilingual nations like India, language barriers hinder access to legal information since most legal documentation remains in English. Legal Machine Translation (L-MT) offers a scalable solution to make legal documents accessible across languages.

Method: Two complementary strategies: 1) fine-tuning a pre-trained OPUS-MT model for domain-specific adaptation to legal texts, and 2) training a Transformer model from scratch using the provided legal corpus. Performance evaluated using multiple MT metrics including SacreBLEU, chrF++, TER, ROUGE, BERTScore, METEOR, and COMET.

Result: The fine-tuned OPUS-MT model achieved a SacreBLEU score of 46.03, significantly outperforming both baseline models and the from-scratch trained Transformer model.

Conclusion: Domain adaptation through fine-tuning pre-trained models is highly effective for legal machine translation, demonstrating the potential of L-MT systems to improve access to justice and legal transparency in multilingual contexts.

Abstract: In multilingual nations like India, access to legal information is often hindered by language barriers, as much of the legal and judicial documentation remains in English. Legal Machine Translation (L-MT) offers a scalable solution to this challenge by enabling accurate and accessible translations of legal documents. This paper presents our work for the JUST-NLP 2025 Legal MT shared task, focusing on English-Hindi translation using Transformer-based approaches. We experiment with 2 complementary strategies, fine-tuning a pre-trained OPUS-MT model for domain-specific adaptation and training a Transformer model from scratch using the provided legal corpus. Performance is evaluated using standard MT metrics, including SacreBLEU, chrF++, TER, ROUGE, BERTScore, METEOR, and COMET. Our fine-tuned OPUS-MT model achieves a SacreBLEU score of 46.03, significantly outperforming both baseline and from-scratch models. The results highlight the effectiveness of domain adaptation in enhancing translation quality and demonstrate the potential of L-MT systems to improve access to justice and legal transparency in multilingual contexts.

[33] On Finding Inconsistencies in Documents

Charles J. Lovering, Seth Ebner, Brandon Smock, Michael Krumdick, Saad Rabbani, Ahmed Muhammad, Varshini Reddy, Chris Tanner

Main category: cs.CL

TL;DR: A benchmark called FIND evaluates language models’ ability to detect inconsistencies in long, technical documents, with GPT-5 achieving 64% detection rate but still missing many inconsistencies.

Details

Motivation: Professionals in academia, law, and finance need to audit documents for inconsistencies that can have serious consequences (monetary, reputational, scientific costs). Language models could potentially speed up this auditing process, but their capabilities need systematic evaluation.

Method: Created the FIND benchmark where domain experts manually insert inconsistencies into documents. Tested language models on these documents, including evaluating GPT-5 on 50 arXiv papers to find both inserted and original inconsistencies.

Result: GPT-5 recovered 64% of manually inserted inconsistencies in the FIND benchmark. Surprisingly, it also found 136 legitimate inconsistencies in 50 arXiv papers that were missed by original authors (out of 196 suggestions). However, even the best models miss almost half of the inconsistencies.

Conclusion: While language models like GPT-5 show promise in detecting document inconsistencies and can find errors missed by human authors, inconsistency detection remains a challenging task with current models missing significant portions of inconsistencies, indicating need for further improvement.

Abstract: Professionals in academia, law, and finance audit their documents because inconsistencies can result in monetary, reputational, and scientific costs. Language models (LMs) have the potential to dramatically speed up this auditing process. To understand their abilities, we introduce a benchmark, FIND (Finding INconsistencies in Documents), where each example is a document with an inconsistency inserted manually by a domain expert. Despite the documents being long, technical, and complex, the best-performing model (gpt-5) recovered 64% of the inserted inconsistencies. Surprisingly, gpt-5 also found undiscovered inconsistencies present in the original documents. For example, on 50 arXiv papers, we judged 136 out of 196 of the model’s suggestions to be legitimate inconsistencies missed by the original authors. However, despite these findings, even the best models miss almost half of the inconsistencies in FIND, demonstrating that inconsistency detection is still a challenging task.

[34] A Comparative Study of Light-weight Language Models for PII Masking and their Deployment for Real Conversational Texts

Prabigya Acharya, Liza Shrestha

Main category: cs.CL

TL;DR: Lightweight models (T5-small & Mistral) achieve PII masking performance comparable to frontier LLMs, with trade-offs: Mistral offers better robustness but higher latency, while T5 provides structured outputs and lower cost.

Details

Motivation: Address concerns about data handling and computational costs of frontier LLMs for PII masking by exploring whether lightweight models can achieve comparable performance for privacy-preserving conversational systems.

Method: Fine-tuned encoder-decoder (T5-small) and decoder-only (Mistral-Instruct-v0.3) architectures on English datasets from AI4Privacy benchmark. Created dataset variants to study label standardization and PII representation across 24 standardized categories and higher-granularity settings.

Result: Both lightweight models achieve performance comparable to frontier LLMs. Mistral achieves higher F1 and recall with greater robustness across PII types but has significantly higher generation latency. T5 offers more controllable structured outputs and lower inference cost, though less robust in conversational text. Label normalization consistently improves performance across architectures.

Conclusion: Lightweight models can provide effective PII masking while addressing data handling concerns of frontier LLMs, with clear trade-offs: Mistral for accuracy/robustness, T5 for structured outputs and lower cost. Real-world testing shows performance degradation with informal inputs.

Abstract: Automated masking of Personally Identifiable Information (PII) is critical for privacy-preserving conversational systems. While current frontier large language models demonstrate strong PII masking capabilities, concerns about data handling and computational costs motivate exploration of whether lightweight models can achieve comparable performance. We compare encoder-decoder and decoder-only architectures by fine-tuning T5-small and Mistral-Instruct-v0.3 on English datasets constructed from the AI4Privacy benchmark. We create different dataset variants to study label standardization and PII representation, covering 24 standardized PII categories and higher-granularity settings. Evaluation using entity-level and character-level metrics, type accuracy, and exact match shows that both lightweight models achieve performance comparable to frontier LLMs for PII masking tasks. Label normalization consistently improves performance across architectures. Mistral achieves higher F1 and recall with greater robustness across PII types but incurs significantly higher generation latency. T5, while less robust in conversational text, offers more controllable structured outputs and lower inference cost, motivating its use in a real-time Discord bot for real-world PII redaction. Evaluation on live messages reveals performance degradation under informal inputs. These results clarify trade-offs between accuracy, robustness, and computational efficiency, demonstrating that lightweight models can provide effective PII masking while addressing data handling concerns associated with frontier LLMs.

[35] LLM-CAS: Dynamic Neuron Perturbation for Real-Time Hallucination Correction

Jensen Zhang, Ningyuan Liu, Yijia Fan, Zihao Huang, Qinglin Zeng, Kaitong Cai, Jian Wang, Keze Wang

Main category: cs.CL

TL;DR: LLM-CAS: A hierarchical reinforcement learning framework that dynamically corrects LLM hallucinations during inference using temporary neuron perturbations, outperforming existing methods on factual accuracy benchmarks.

Details

Motivation: LLMs often generate hallucinated content lacking factual grounding, limiting reliability in critical applications. Existing approaches (supervised fine-tuning, RLHF) are data/compute intensive, while static parameter editing struggles with context-dependent errors and catastrophic forgetting.

Method: Formulates real-time hallucination correction as hierarchical reinforcement learning. Trains an agent to learn a policy that dynamically selects temporary neuron perturbations during inference based on current context, enabling adaptive fine-grained correction without permanent parameter modification.

Result: Consistently improves factual accuracy across multiple language models: +10.98pp on StoryCloze, +2.71pp on TriviaQA, +2.06pp on MC1 score of TruthfulQA. Outperforms static editing methods (ITI, CAA) and dynamic SADI framework.

Conclusion: LLM-CAS provides an efficient, context-aware solution for improving LLM reliability with promising potential for future multimodal extensions, addressing hallucination correction without permanent model modification.

Abstract: Large language models (LLMs) often generate hallucinated content that lacks factual or contextual grounding, limiting their reliability in critical applications. Existing approaches such as supervised fine-tuning and reinforcement learning from human feedback are data intensive and computationally expensive, while static parameter editing methods struggle with context dependent errors and catastrophic forgetting. We propose LLM-CAS, a framework that formulates real-time hallucination correction as a hierarchical reinforcement learning problem. LLM-CAS trains an agent to learn a policy that dynamically selects temporary neuron perturbations during inference based on the current context. Unlike prior dynamic approaches that rely on heuristic or predefined adjustments, this policy driven mechanism enables adaptive and fine grained correction without permanent parameter modification. Experiments across multiple language models demonstrate that LLM-CAS consistently improves factual accuracy, achieving gains of 10.98 percentage points on StoryCloze, 2.71 points on TriviaQA, and 2.06 points on the MC1 score of TruthfulQA. These results outperform both static editing methods such as ITI and CAA and the dynamic SADI framework. Overall, LLM-CAS provides an efficient and context aware solution for improving the reliability of LLMs, with promising potential for future multimodal extensions.

[36] Does It Tie Out? Towards Autonomous Legal Agents in Venture Capital

Pierre Colombo, Malik Boudiaf, Allyn Sweet, Michael Desa, Hongxi Wang, Kevin Candra, Syméon del Marmol

Main category: cs.CL

TL;DR: The paper addresses the challenge of automating capitalization table verification in venture capital financing using AI, proposing a specialized world model architecture for this complex legal workflow.

Details

Motivation: Current LLMs and agentic systems fail at specialized legal workflows like capitalization tie-out, which requires multi-document reasoning, strict evidence traceability, and deterministic outputs that existing approaches cannot reliably deliver.

Method: The authors characterize capitalization tie-out as a real-world benchmark for legal AI, analyze existing agentic systems’ performance, and propose a world model architecture specifically designed for tie-out automation and applied legal intelligence.

Result: The paper establishes capitalization tie-out as a challenging benchmark for legal AI systems and demonstrates that current approaches are insufficient for this specialized workflow, necessitating new architectural solutions.

Conclusion: A specialized world model architecture is needed to automate complex legal workflows like capitalization tie-out, serving as a foundation for broader applied legal intelligence beyond current LLM capabilities.

Abstract: Before closing venture capital financing rounds, lawyers conduct diligence that includes tying out the capitalization table: verifying that every security (for example, shares, options, warrants) and issuance term (for example, vesting schedules, acceleration triggers, transfer restrictions) is supported by large sets of underlying legal documentation. While LLMs continue to improve on legal benchmarks, specialized legal workflows, such as capitalization tie-out, remain out of reach even for strong agentic systems. The task requires multi-document reasoning, strict evidence traceability, and deterministic outputs that current approaches fail to reliably deliver. We characterize capitalization tie-out as an instance of a real-world benchmark for legal AI, analyze and compare the performance of existing agentic systems, and propose a world model architecture toward tie-out automation-and more broadly as a foundation for applied legal intelligence.

[37] Solver-Independent Automated Problem Formulation via LLMs for High-Cost Simulation-Driven Design

Yuchen Li, Handing Wang, Bing Xue, Mengjie Zhang, Yaochu Jin

Main category: cs.CL

TL;DR: APF is an LLM-based framework that automatically converts natural language design requirements into executable optimization models without needing solver feedback, overcoming the bottleneck in simulation-driven design.

Details

Motivation: In simulation-driven design, translating ambiguous requirements into mathematical optimization is time-consuming and expert-dependent. Existing LLM approaches either produce poor formalization or rely on unavailable solver feedback due to high simulation costs.

Method: APF uses an innovative pipeline for automatically generating high-quality fine-tuning data through data generation and test instance annotation, without needing solver feedback. This dataset is used for supervised fine-tuning of LLMs to produce accurate, executable optimization problem formulations.

Result: Experimental results on antenna design show APF significantly outperforms existing methods in both requirement formalization accuracy and the quality of resulting radiation efficiency curves in meeting design goals.

Conclusion: APF provides a solver-independent framework that effectively automates the translation of natural language requirements into optimization models, addressing a key bottleneck in simulation-driven design while avoiding the need for expensive solver feedback.

Abstract: In the high-cost simulation-driven design domain, translating ambiguous design requirements into a mathematical optimization formulation is a bottleneck for optimizing product performance. This process is time-consuming and heavily reliant on expert knowledge. While large language models (LLMs) offer potential for automating this task, existing approaches either suffer from poor formalization that fails to accurately align with the design intent or rely on solver feedback for data filtering, which is unavailable due to the high simulation costs. To address this challenge, we propose APF, a framework for solver-independent, automated problem formulation via LLMs designed to automatically convert engineers’ natural language requirements into executable optimization models. The core of this framework is an innovative pipeline for automatically generating high-quality data, which overcomes the difficulty of constructing suitable fine-tuning datasets in the absence of high-cost solver feedback with the help of data generation and test instance annotation. The generated high-quality dataset is used to perform supervised fine-tuning on LLMs, significantly enhancing their ability to generate accurate and executable optimization problem formulations. Experimental results on antenna design demonstrate that APF significantly outperforms the existing methods in both the accuracy of requirement formalization and the quality of resulting radiation efficiency curves in meeting the design goals.

[38] From Natural Language to Control Signals: A Conceptual Framework for Semantic Channel Finding in Complex Experimental Infrastructure

Thorsten Hellert, Nikolay Agladze, Alex Giovannone, Jan Jug, Frank Mayet, Mark Sherwin, Antonin Sulc, Chris Tennant

Main category: cs.CL

TL;DR: The paper introduces a four-paradigm framework for semantic channel finding in complex experimental facilities to map natural-language queries to control-system signals, addressing the challenge of fragmented documentation and inconsistent naming conventions.

Details

Motivation: Modern experimental facilities have tens to hundreds of thousands of control channels with inconsistent naming and fragmented documentation, creating bottlenecks for operators and AI systems trying to locate signals for monitoring, troubleshooting, and automated control.

Method: A four-paradigm framework: (1) direct in-context lookup over curated channel dictionaries, (2) constrained hierarchical navigation through structured trees, (3) interactive agent exploration using iterative reasoning and tool-based database queries, and (4) ontology-grounded semantic search that decouples channel meaning from facility-specific naming conventions.

Result: Proof-of-concept implementations at four operational facilities (from compact free-electron lasers to large synchrotron light sources) achieved 90-97% accuracy on expert-curated operational queries across diverse control-system architectures.

Conclusion: The proposed framework provides a systematic approach to semantic channel finding that can scale across different facility types and data regimes, significantly improving signal location accuracy and addressing the documentation bottleneck in complex experimental infrastructure.

Abstract: Modern experimental platforms such as particle accelerators, fusion devices, telescopes, and industrial process control systems expose tens to hundreds of thousands of control and diagnostic channels accumulated over decades of evolution. Operators and AI systems rely on informal expert knowledge, inconsistent naming conventions, and fragmented documentation to locate signals for monitoring, troubleshooting, and automated control, creating a persistent bottleneck for reliability, scalability, and language-model-driven interfaces. We formalize semantic channel finding-mapping natural-language intent to concrete control-system signals-as a general problem in complex experimental infrastructure, and introduce a four-paradigm framework to guide architecture selection across facility-specific data regimes. The paradigms span (i) direct in-context lookup over curated channel dictionaries, (ii) constrained hierarchical navigation through structured trees, (iii) interactive agent exploration using iterative reasoning and tool-based database queries, and (iv) ontology-grounded semantic search that decouples channel meaning from facility-specific naming conventions. We demonstrate each paradigm through proof-of-concept implementations at four operational facilities spanning two orders of magnitude in scale-from compact free-electron lasers to large synchrotron light sources-and diverse control-system architectures, from clean hierarchies to legacy environments. These implementations achieve 90-97% accuracy on expert-curated operational queries.

[39] From Word to World: Can Large Language Models be Implicit Text-based World Models?

Yixia Li, Hongru Wang, Jiahao Qiu, Zhenfei Yin, Dongdong Zhang, Cheng Qian, Zeping Li, Pony Ma, Guanhua Chen, Heng Ji, Mengdi Wang

Main category: cs.CL

TL;DR: LLM-based world models can improve agent learning in text environments through synthetic experience generation, but effectiveness depends on behavioral coverage and environment complexity.

Details

Motivation: Real-world environments are non-adaptive, limited in coverage, and difficult to scale, creating challenges for agentic reinforcement learning. While world models could improve learning efficiency, it's unclear whether LLMs can reliably serve as world models and under what conditions they meaningfully benefit agents.

Method: Introduce a three-level evaluation framework for LLM-based world models: (1) fidelity and consistency, (2) scalability and robustness, and (3) agent utility. Study these in text-based environments where language modeling can be reinterpreted as next-state prediction under interaction. Test across five representative environments.

Result: Sufficiently trained world models maintain coherent latent state, scale predictably with data and model size, and improve agent performance through action verification, synthetic trajectory generation, and warm-starting reinforcement learning.

Conclusion: LLM-based world models can effectively support agent learning, but gains critically depend on behavioral coverage and environment complexity, establishing clear boundaries on when world modeling is beneficial.

Abstract: Agentic reinforcement learning increasingly relies on experience-driven scaling, yet real-world environments remain non-adaptive, limited in coverage, and difficult to scale. World models offer a potential way to improve learning efficiency through simulated experience, but it remains unclear whether large language models can reliably serve this role and under what conditions they meaningfully benefit agents. We study these questions in text-based environments, which provide a controlled setting to reinterpret language modeling as next-state prediction under interaction. We introduce a three-level framework for evaluating LLM-based world models: (i) fidelity and consistency, (ii) scalability and robustness, and (iii) agent utility. Across five representative environments, we find that sufficiently trained world models maintain coherent latent state, scale predictably with data and model size, and improve agent performance via action verification, synthetic trajectory generation, and warm-starting reinforcement learning. Meanwhile, these gains depend critically on behavioral coverage and environment complexity, delineating clear boundry on when world modeling effectively supports agent learning.

[40] AraMix: Recycling, Refiltering, and Deduplicating to Deliver the Largest Arabic Pretraining Corpus

Sultan Alrashed, Francesco Orabona

Main category: cs.CL

TL;DR: AraMix is a deduplicated Arabic pretraining corpus (178B tokens, 179M docs) created by combining and curating seven existing Arabic datasets rather than new web scraping, revealing 60% token duplication across existing corpora.

Details

Motivation: For lower-resource languages like Arabic, there's substantial value in systematically reusing and curating existing pretraining datasets rather than conducting new web scraping, as existing datasets contain significant redundancy.

Method: Combine seven publicly available Arabic web datasets, apply Arabic-specific quality filtering to re-filter some datasets, and perform cross-dataset deduplication using both MinHash and sentence-level approaches.

Result: Created the largest heavily filtered publicly available Arabic pretraining corpus (178B tokens, 179M documents) and discovered that nearly 60% of tokens across independently collected corpora are duplicates.

Conclusion: For lower-resource languages, investment in curation pipelines for existing data yields greater returns than additional web crawls, as new scraping efforts would reproduce existing redundancy.

Abstract: We present AraMix, a deduplicated Arabic pretraining corpus containing approximately 178 billion tokens across 179 million documents. Rather than scraping the web again, AraMix demonstrates that substantial value lies in systematically reusing and curating existing pretraining datasets: we combine seven publicly available Arabic web datasets, apply quality filtering designed specifically for Arabic text to re-filter some datasets, and perform cross-dataset deduplication, both MinHash and sentence-level. This approach reveals that nearly 60% of tokens across these independently collected corpora are duplicates, redundancy that any new scraping efforts will reproduce. Our work suggests that for lower resource languages, investment in curation pipelines for existing data yields greater returns than additional web crawls, an approach that allowed us to curate the largest heavily filtered publicly available Arabic pretraining corpus.

[41] MDToC: Metacognitive Dynamic Tree of Concepts for Boosting Mathematical Problem-Solving of Large Language Models

Tung Duong Ta, Tim Oates

Main category: cs.CL

TL;DR: MDToC (Metacognitive Dynamic Tree of Concepts) is a three-phase prompting method that improves mathematical reasoning in LLMs by constructing concept trees, verifying calculations, and using majority voting, outperforming existing methods like ToT and GoT across multiple benchmarks.

Details

Motivation: Despite advances in mathematical reasoning, LLMs still struggle with calculation verification using current prompting techniques. There's a need for better methods that can improve accuracy in mathematical problem-solving without relying on hand-engineered hints.

Method: MDToC uses a three-phase approach: 1) Constructs a concept tree to break down problems, 2) Develops accuracy-verified calculations for each concept, and 3) Employs majority voting to evaluate competing solutions. This metacognitive approach focuses on calculation verification.

Result: MDToC achieves 58.1% on CHAMP, 86.6% on MATH, and 85% on Game-of-24 using GPT-4-Turbo, outperforming GoT by 5%, 5.4%, and 4% respectively. It consistently surpasses existing prompting methods across all backbone models with improvements up to 7.6% over ToT and 6.2% over GoT.

Conclusion: Metacognitive calculation verification through MDToC represents a promising direction for enhanced mathematical reasoning in LLMs, demonstrating consistent improvements over existing prompting methods without requiring hand-engineered hints.

Abstract: Despite advances in mathematical reasoning capabilities, Large Language Models (LLMs) still struggle with calculation verification when using established prompting techniques. We present MDToC (Metacognitive Dynamic Tree of Concepts), a three-phase approach that constructs a concept tree, develops accuracy-verified calculations for each concept, and employs majority voting to evaluate competing solutions. Evaluations across CHAMP, MATH, and Game-of-24 benchmarks demonstrate our MDToC’s effectiveness, with GPT-4-Turbo achieving 58.1% on CHAMP, 86.6% on MATH, and 85% on Game-of-24 - outperforming GoT by 5%, 5.4%, and 4% on all these tasks, respectively, without hand-engineered hints. MDToC consistently surpasses existing prompting methods across all backbone models, yielding improvements of up to 7.6% over ToT and 6.2% over GoT, establishing metacognitive calculation verification as a promising direction for enhanced mathematical reasoning.

[42] Toward Human-Centered AI-Assisted Terminology Work

Antonio San Martin

Main category: cs.CL

TL;DR: A human-centered AI framework is proposed for terminology work to balance efficiency gains with professional autonomy, bias mitigation, and linguistic diversity preservation.

Details

Motivation: The rapid diffusion of generative AI in terminology work risks weakening professional autonomy, amplifying bias, and eroding linguistic/conceptual diversity, necessitating a human-centered approach.

Method: Proposes a human-centered framework conceptualizing AI as amplifying terminologists’ capabilities, organized around three dimensions: augmented terminologist, ethical AI, and human-centered design.

Result: A framework emphasizing compatibility of high automation with strong human control, terminologists’ central role in bias mitigation, and designing AI tools around terminologists’ needs, values, and well-being.

Conclusion: Current AI adoption choices will shape not only terminological practice but also the preservation of accuracy, adequacy, and diversity in terminology and specialized knowledge.

Abstract: The rapid diffusion of generative artificial intelligence is transforming terminology work. While this technology promises gains in efficiency, its unstructured adoption risks weakening professional autonomy, amplifying bias, and eroding linguistic and conceptual diversity. This paper argues that a human-centered approach to artificial intelligence has become a necessity for terminology work. Building on research in artificial intelligence and translation studies, it proposes a human-centered framework that conceptualizes artificial intelligence as a means of amplifying the terminologist’s capabilities, rather than replacing them. The framework is organized around three interrelated dimensions: the augmented terminologist, ethical AI, and human-centered design. Together, these dimensions emphasize the compatibility of high automation with strong human control, the central role of terminologists in bias mitigation, and the importance of designing AI tools and workflows around the needs, values, and well-being of the terminologist. The paper concludes by stressing that current choices in AI adoption will shape not only terminological practice, but also the preservation of accuracy, adequacy, and diversity in terminology and specialized knowledge.

[43] Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

Ming Li, Han Chen, Yunze Xiao, Jian Chen, Hong Jiao, Tianyi Zhou

Main category: cs.CL

TL;DR: LLMs fail to accurately estimate question difficulty for human learners despite their strong problem-solving abilities, showing systematic misalignment with human perceptions.

Details

Motivation: Accurate item difficulty estimation is crucial for educational assessment but suffers from cold start problems. While LLMs have strong problem-solving capabilities, it's unknown whether they can perceive human cognitive struggles.

Method: Large-scale empirical analysis of Human-AI Difficulty Alignment across 20+ models in domains like medical knowledge and mathematical reasoning, examining how models estimate difficulty compared to human perceptions.

Result: Systematic misalignment where scaling model size doesn’t help; models converge to machine consensus rather than human alignment. High performance impedes accurate difficulty estimation, and models lack introspection about their own limitations.

Conclusion: General problem-solving capability doesn’t imply understanding of human cognitive struggles, highlighting challenges in using current LLMs for automated difficulty prediction in education.

Abstract: Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.

[44] Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations

Shaomu Tan, Ryosuke Mitani, Ritvik Choudhary, Qiyu Wu, Toshiyuki Sekiya, Christof Monz

Main category: cs.CL

TL;DR: Remedy-R is a reasoning-driven MT metric using RL from pairwise preferences, producing interpretable step-by-step analyses and scores without error annotations or LLM distillation.

Details

Motivation: Current MT metrics are black-box systems with poor interpretability and limited robustness on out-of-distribution inputs, despite strong benchmark performance.

Method: Train with reinforcement learning from pairwise translation preferences, generating step-by-step analyses of accuracy, fluency, and completeness before final scoring. No error-span annotations or distillation from closed LLMs required.

Result: Competitive with top scalar metrics and GPT-4-based judges on WMT22-24 meta-evaluation, generalizes to other languages, shows strong OOD robustness, and enables translation improvement via self-reflective feedback.

Conclusion: Remedy-R provides interpretable, robust MT evaluation, and its reasoning can be practically applied through Remedy-R Agent to improve translation quality across diverse models.

Abstract: Over the years, automatic MT metrics have hillclimbed benchmarks and presented strong and sometimes human-level agreement with human ratings. Yet they remain black-box, offering little insight into their decision-making and often failing under real-world out-of-distribution (OOD) inputs. We introduce Remedy-R, a reasoning-driven generative MT metric trained with reinforcement learning from pairwise translation preferences, without requiring error-span annotations or distillation from closed LLMs. Remedy-R produces step-by-step analyses of accuracy, fluency, and completeness, followed by a final score, enabling more interpretable assessments. With only 60K training pairs across two language pairs, Remedy-R remains competitive with top scalar metrics and GPT-4-based judges on WMT22-24 meta-evaluation, generalizes to other languages, and exhibits strong robustness on OOD stress tests. Moreover, Remedy-R models generate self-reflective feedback that can be reused for translation improvement. Building on this finding, we introduce Remedy-R Agent, a simple evaluate-revise pipeline that leverages Remedy-R’s evaluation analysis to refine translations. This agent consistently improves translation quality across diverse models, including Qwen2.5, ALMA-R, GPT-4o-mini, and Gemini-2.0-Flash, suggesting that Remedy-R’s reasoning captures translation-relevant information and is practically useful.

[45] FASTRIC: Prompt Specification Language for Verifiable LLM Interactions

Wen-Long Jin

Main category: cs.CL

TL;DR: FASTRIC introduces a Prompt Specification Language that makes implicit Finite State Machines explicit in natural language prompts, enabling LLMs to execute verifiable multi-turn interactions with measurable procedural guarantees.

Details

Motivation: LLMs execute complex multi-turn interactions but lack formal specifications to verify execution against designer intent, making interaction design more of a heuristic art than systematic engineering.

Method: FASTRIC guides designers to articulate seven FSM elements (Final States, Agents, States, Triggers, Roles, Initial State, Constraints) in natural language prompts. It uses LLMs as unified infrastructure for parsing, interpreting, and executing specifications, with varying formality levels from implicit to explicit instructions.

Result: Testing a 3-state kindergarten tutoring FSM across models revealed “Goldilocks zones” of optimal specification formality: DeepSeek-V3.2 (685B) achieved perfect conformance (1.00) at L2-L4; ChatGPT-5 (~1T) peaked at L3 (0.90) before collapsing at L4 (0.39); Phi4 (14.7B) showed no stable optimum with high variance.

Conclusion: FASTRIC establishes Prompt Specification Engineering for creating verifiable interaction protocols, transforming multi-turn interaction design from heuristic art to systematic engineering with measurable procedural guarantees, with optimal specification formality being model-dependent.

Abstract: Large Language Models (LLMs) execute complex multi-turn interaction protocols but lack formal specifications to verify execution against designer intent. We introduce FASTRIC, a Prompt Specification Language that makes implicit Finite State Machines (FSMs) explicit in natural language prompts, enabling conformance verification through execution trace analysis. The LLM serves as intelligent execution agent: interpreting designer-encoded FSMs to execute specified behavioral roles. Unlike symbolic specification languages requiring parsers and compilers, FASTRIC leverages LLMs as unified infrastructure-simultaneously parser, interpreter, runtime environment, and development assistant. FASTRIC guides designers to articulate seven FSM elements (Final States, Agents, States, Triggers, Roles, Initial State, Constraints) structuring multi-turn interactions. Specification formality-ranging from implicit descriptions that frontier models infer to explicit step-by-step instructions for weaker models-serves as a design parameter. We introduce procedural conformance as verification metric measuring execution adherence to FSM specifications. Testing a 3-state kindergarten tutoring FSM across four formality levels and three model scales (14.7B, 685B, 1T+ parameters) reveals optimal specification formality is a function of model capacity. DeepSeek-V3.2 (685B) achieves perfect conformance (1.00) at L2-L4; ChatGPT-5 (~1T) peaks at L3 (0.90) before collapsing at L4 (0.39); Phi4 (14.7B) shows no stable optimum with high variance (SD=0.16-0.36). These findings reveal model-specific formality ranges-“Goldilocks zones”-where specifications provide sufficient structure without over-constraint, establishing Prompt Specification Engineering for creating verifiable interaction protocols, transforming multi-turn interaction design from heuristic art to systematic engineering with measurable procedural guarantees.

[46] Evaluating the Challenges of LLMs in Real-world Medical Follow-up: A Comparative Study and An Optimized Framework

Jinyan Liu, Zikang Chen, Qinchuan Wang, Tan Xie, Heming Zheng, Xudong Lv

Main category: cs.CL

TL;DR: Modular LLM pipeline with structured control outperforms end-to-end approach for medical follow-up chatbots, improving dialog stability, extraction accuracy, and reducing turns/tokens significantly.

Details

Motivation: End-to-end LLMs struggle with medical follow-up tasks due to uncontrolled dialog flow and inaccurate information extraction from complex forms, necessitating better control mechanisms.

Method: Designed two systems: end-to-end LLM (control) vs modular pipeline with structured process control (experimental). Modular approach uses task decomposition, semantic clustering, and flow management.

Result: Modular method substantially improves dialog stability and extraction accuracy, reduces dialogue turns by 46.73%, and lowers token consumption by 80-87.5% compared to end-to-end approach.

Conclusion: External control mechanisms are necessary for deploying LLMs in high-stakes medical follow-up scenarios to ensure reliability and efficiency.

Abstract: When applied directly in an end-to-end manner to medical follow-up tasks, Large Language Models (LLMs) often suffer from uncontrolled dialog flow and inaccurate information extraction due to the complexity of follow-up forms. To address this limitation, we designed and compared two follow-up chatbot systems: an end-to-end LLM-based system (control group) and a modular pipeline with structured process control (experimental group). Experimental results show that while the end-to-end approach frequently fails on lengthy and complex forms, our modular method-built on task decomposition, semantic clustering, and flow management-substantially improves dialog stability and extraction accuracy. Moreover, it reduces the number of dialogue turns by 46.73% and lowers token consumption by 80% to 87.5%. These findings highlight the necessity of integrating external control mechanisms when deploying LLMs in high-stakes medical follow-up scenarios.

[47] Context-Aware Initialization for Reducing Generative Path Length in Diffusion Language Models

Tongyuan Miao, Gary Huang, Kai Jun Han, Annie Jiang

Main category: cs.CL

TL;DR: DLLMs enable parallel token decoding but need many denoising iterations. The paper proposes training-free context-aware initialization to shorten the generative trajectory by starting closer to the target distribution using prompt-conditioned priors from a lightweight auxiliary model.

Details

Motivation: Diffusion Large Language Models (DLLMs) are impractical at inference time due to the many denoising iterations required to refine fully masked initialization into coherent text. Existing acceleration methods focus on improving solvers/sampling strategies, but the authors propose a complementary approach: shorten the trajectory itself by starting closer to the target distribution.

Method: Propose a training-free interface that injects prompt-conditioned priors from a lightweight auxiliary model into diffusion initialization. Two mechanisms: discrete token injection and representation-level embedding interpolation. Also introduce confidence-based remasking mechanism as prior skepticism to handle imperfect priors and prevent over-commitment in unmask-only decoding.

Result: Preliminary evidence on GSM8K shows context-aware initialization can substantially reduce denoising iterations (about 35% fewer function evaluations). However, naive warm-starting can degrade final accuracy relative to strong diffusion baselines, exposing a key open challenge.

Conclusion: The findings motivate a research agenda around calibration, revision mechanisms, and representation alignment for reliable warm-started diffusion decoding. The paper advances the perspective of shortening generative trajectories through context-aware initialization rather than just improving traversal efficiency.

Abstract: Diffusion Large Language Models (DLLMs) enable fully parallel token decoding but often remain impractical at inference time due to the many denoising iterations required to refine an information-free, fully masked initialization into coherent text. Most existing acceleration methods focus on traversing this generative trajectory more efficiently via improved solvers or sampling strategies. We advance a complementary perspective: shorten the trajectory itself by starting closer to the target distribution through context-aware initialization. We propose a training-free interface that injects prompt-conditioned priors from a lightweight auxiliary model into the diffusion initialization, and instantiate it with two mechanisms: discrete token injection and representation-level embedding interpolation. Because injected priors can be imperfect and unmask-only decoding can over-commit early, we also introduce a simple confidence-based remasking mechanism as a form of prior skepticism. Preliminary evidence on GSM8K suggests that context-aware initialization can substantially reduce denoising iterations (about 35% fewer function evaluations in our setting), while also exposing a key open challenge: naive warm-starting can degrade final accuracy relative to strong diffusion baselines. We use these findings to motivate a research agenda around calibration, revision mechanisms, and representation alignment for reliable warm-started diffusion decoding.

[48] DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation

Shijian Ma, Yunqi Huang, Yan Lin

Main category: cs.CL

TL;DR: DramaBench is a new benchmark for evaluating drama script continuation across six dimensions, providing comprehensive assessment of character consistency, plot coherence, and dramatic structure.

Details

Motivation: Existing benchmarks fail to comprehensively evaluate drama script continuation capabilities, particularly in maintaining character consistency, advancing plot coherently, and preserving dramatic structure.

Method: Combines rule-based analysis with LLM-based labeling and statistical metrics to create an objective, reproducible evaluation framework across six dimensions: Format Standards, Narrative Efficiency, Character Consistency, Emotional Depth, Logic Consistency, and Conflict Handling.

Result: Evaluated 8 state-of-the-art language models on 1,103 scripts (8,824 total evaluations) with rigorous statistical significance testing (252 pairwise comparisons, 65.9% significant) and human validation (188 scripts, substantial agreement on 3/5 dimensions). Ablation studies confirm all six dimensions capture independent quality aspects (mean |r| = 0.020).

Conclusion: DramaBench provides actionable, dimension-specific feedback for model improvement and establishes a rigorous standard for creative writing evaluation, addressing the gap in comprehensive drama script continuation assessment.

Abstract: Drama script continuation requires models to maintain character consistency, advance plot coherently, and preserve dramatic structurecapabilities that existing benchmarks fail to evaluate comprehensively. We present DramaBench, the first large-scale benchmark for evaluating drama script continuation across six independent dimensions: Format Standards, Narrative Efficiency, Character Consistency, Emotional Depth, Logic Consistency, and Conflict Handling. Our framework combines rulebased analysis with LLM-based labeling and statistical metrics, ensuring objective and reproducible evaluation. We conduct comprehensive evaluation of 8 state-of-the-art language models on 1,103 scripts (8,824 evaluations total), with rigorous statistical significance testing (252 pairwise comparisons, 65.9% significant) and human validation (188 scripts, substantial agreement on 3/5 dimensions). Our ablation studies confirm all six dimensions capture independent quality aspects (mean | r | = 0.020). DramaBench provides actionable, dimensionspecific feedback for model improvement and establishes a rigorous standard for creative writing evaluation.

[49] A Large Language Model Based Method for Complex Logical Reasoning over Knowledge Graphs

Ziyan Zhang, Chao Wang, Zhuo Chen, Lei Chen, Chiyi Li, Kai Song

Main category: cs.CL

TL;DR: ROG combines KG neighborhood retrieval with LLM chain-of-thought reasoning to answer complex FOL queries, outperforming embedding-based methods.

Details

Motivation: Existing embedding-based methods struggle with complex FOL queries involving multiple operators, deep reasoning chains, or heterogeneous KG schemas due to inherent KG incompleteness and compositional complexity.

Method: ROG is an ensemble framework that decomposes complex FOL queries into simpler sub-queries, retrieves query-relevant subgraphs as contextual evidence, and performs step-by-step logical inference using LLMs with chain-of-thought reasoning.

Result: ROG consistently outperforms strong embedding-based baselines on standard KG reasoning benchmarks in terms of MRR, with particularly notable gains on high-complexity query types.

Conclusion: Integrating structured KG retrieval with LLM-driven logical reasoning offers a robust and effective alternative for complex KG reasoning tasks, avoiding task-specific embedding optimization.

Abstract: Reasoning over knowledge graphs (KGs) with first-order logic (FOL) queries is challenging due to the inherent incompleteness of real-world KGs and the compositional complexity of logical query structures. Most existing methods rely on embedding entities and relations into continuous geometric spaces and answer queries via differentiable set operations. While effective for simple query patterns, these approaches often struggle to generalize to complex queries involving multiple operators, deeper reasoning chains, or heterogeneous KG schemas. We propose ROG (Reasoning Over knowledge Graphs with large language models), an ensemble-style framework that combines query-aware KG neighborhood retrieval with large language model (LLM)-based chain-of-thought reasoning. ROG decomposes complex FOL queries into sequences of simpler sub-queries, retrieves compact, query-relevant subgraphs as contextual evidence, and performs step-by-step logical inference using an LLM, avoiding the need for task-specific embedding optimization. Experiments on standard KG reasoning benchmarks demonstrate that ROG consistently outperforms strong embedding-based baselines in terms of mean reciprocal rank (MRR), with particularly notable gains on high-complexity query types. These results suggest that integrating structured KG retrieval with LLM-driven logical reasoning offers a robust and effective alternative for complex KG reasoning tasks.

[50] Stop saying LLM: Large Discourse Models (LDM) and Artificial Discursive Agent (ADA)?

Amar Lakel

Main category: cs.CL

TL;DR: The paper proposes shifting from “Large Language Models” to “Large Discourse Models” and then to “Artificial Discursive Agents” as an epistemological reframing, arguing for public governance and co-regulation of these systems.

Details

Motivation: To move beyond the "fascination/fear" dichotomy surrounding large generative models by reframing them as discursive agents embedded in socio-historical contexts, enabling more nuanced public understanding and governance.

Method: Proposes an ontological triad framework distinguishing three regulatory instances: phenomenal regularities of the referential world, embodied cognition structuring, and structural-linguistic sedimentation within socio-historical contexts. LDMs are analyzed as modeling discursive projections of human experience reified in learning corpora.

Result: A theoretical reframing that positions AI systems as Artificial Discursive Agents operating within complex socio-historical contexts, enabling new approaches to governance and public understanding.

Conclusion: The paper concludes by advocating for public trials and procedures to make artificial discursive agents’ place, uses, and limits decipherable in social space, promoting a governance and co-regulation approach involving State, industry, civil society, and academia.

Abstract: This paper proposes an epistemological shift in the analysis of large generative models, replacing the category ‘‘Large Language Models’’ (LLM) with that of ‘‘Large Discourse Models’’ (LDM), and then with that of Artificial Discursive Agent (ADA). The theoretical framework is based on an ontological triad distinguishing three regulatory instances: the apprehension of the phenomenal regularities of the referential world, the structuring of embodied cognition, and the structural-linguistic sedimentation of the utterance within a socio-historical context. LDMs, operating on the product of these three instances (the document), model the discursive projection of a portion of human experience reified by the learning corpus. The proposed program aims to replace the ‘‘fascination/fear’’ dichotomy with public trials and procedures that make the place, uses, and limits of artificial discursive agents in contemporary social space decipherable, situating this approach within a perspective of governance and co-regulation involving the State, industry, civil society, and academia.

[51] SAP: Syntactic Attention Pruning for Transformer-based Language Models

Tzu-Yun Lee, Ding-Yong Hong, Jan-Jan Wu

Main category: cs.CL

TL;DR: Syntactic Attention Pruning (SAP) uses syntactic structure and attention patterns to prune Transformer attention heads, achieving comparable performance to SOTA methods while improving interpretability, with Candidate Filtering (CF) enhancing robustness.

Details

Motivation: Current attention head pruning methods rely mainly on mathematical analysis of weights and activations, lacking consideration of linguistic structure. The paper aims to incorporate syntactic information to guide pruning more effectively and improve model interpretability.

Method: SAP combines syntactic structure analysis with attention patterns to identify which heads to prune. Candidate Filtering (CF) mechanism prioritizes heads based on their performance contribution to mitigate degradation during pruning.

Result: SAP outperforms existing head pruning strategies in retrain-free settings, effectively preserves critical heads with high density of strong attention values, and maintains performance comparable to state-of-the-art methods.

Conclusion: SAP offers a promising new direction for model compression research, providing high flexibility for pruning across all transformer-based language models while enhancing both performance and interpretability.

Abstract: This paper introduces Syntactic Attention Pruning (SAP), a novel method for effectively pruning attention heads in Transformer models. Unlike conventional approaches that rely solely on mathematical analysis of model weights and activations, SAP incorporates both the syntactic structure and attention patterns of sentences to guide the pruning process. By leveraging these linguistic features, SAP not only achieves performance comparable to state-of-the-art methods but also enhances the interpretability of model behavior. To further improve robustness, we propose Candidate Filtering (CF), a mechanism that prioritizes heads based on their contribution to model performance, mitigating degradation during pruning. Experimental results indicate that SAP effectively preserves critical heads of a high density of strong attention values, outperforming existing head pruning strategies in retrain-free settings. These findings position SAP as a promising foundation for a new direction in model compression research, offering high flexibility for pruning across all transformer-based language models.

[52] AWPO: Enhancing Tool-Use of Large Language Models through Explicit Integration of Reasoning Rewards

Zihan Lin, Xiaohan Wang, Hexiong Yang, Jiajun Chai, Jie Cao, Guojun Yin, Wei Lin, Ran He

Main category: cs.CL

TL;DR: AWPO is a new RL framework that integrates explicit reasoning rewards to enhance tool-use LLMs, using adaptive gating and weighting mechanisms to outperform existing models.

Details

Motivation: Existing RL methods for training tool-use LLMs overlook explicit reasoning rewards, and naively combining reasoning with outcome rewards can cause suboptimal performance or conflicts with primary objectives.

Method: Proposes Advantage-Weighted Policy Optimization (AWPO) with variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals based on group-relative statistics, plus a tailored clipping mechanism for stable optimization.

Result: AWPO achieves SOTA performance across standard tool-use benchmarks, significantly outperforming strong baselines and closed-source models. A 4B model surpasses Grok-4 by 16.0% in multi-turn accuracy while maintaining generalization on MMLU-Pro.

Conclusion: AWPO effectively integrates explicit reasoning rewards to enhance tool-use capability in LLMs through principled adaptive mechanisms, demonstrating superior performance and parameter efficiency.

Abstract: While reinforcement learning (RL) shows promise in training tool-use large language models (LLMs) using verifiable outcome rewards, existing methods largely overlook the potential of explicit reasoning rewards to bolster reasoning and tool utilization. Furthermore, natively combining reasoning and outcome rewards may yield suboptimal performance or conflict with the primary optimization objective. To address this, we propose advantage-weighted policy optimization (AWPO) – a principled RL framework that effectively integrates explicit reasoning rewards to enhance tool-use capability. AWPO incorporates variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals based on group-relative statistics, alongside a tailored clipping mechanism for stable optimization. Extensive experiments demonstrate that AWPO achieves state-of-the-art performance across standard tool-use benchmarks, significantly outperforming strong baselines and leading closed-source models in challenging multi-turn scenarios. Notably, with exceptional parameter efficiency, our 4B model surpasses Grok-4 by 16.0 percent in multi-turn accuracy while preserving generalization capability on the out-of-distribution MMLU-Pro benchmark.

[53] QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation

Dehai Min, Kailin Zhang, Tongtong Wu, Lu Cheng

Main category: cs.CL

TL;DR: QuCo-RAG uses corpus statistics instead of model confidence to trigger retrieval, reducing hallucinations by verifying entity co-occurrence in pre-training data.

Details

Motivation: Existing dynamic RAG methods rely on unreliable model-internal signals (logits, entropy) that are poorly calibrated, leading to hallucinations even when models express high confidence in wrong answers.

Method: Two-stage approach: (1) identify low-frequency entities before generation to detect knowledge gaps; (2) verify entity co-occurrence in pre-training corpus during generation, triggering retrieval when co-occurrence is zero. Uses Infini-gram for fast queries over 4 trillion tokens.

Result: Achieves 5-12 EM point gains over SOTA baselines with OLMo-2 models, transfers effectively to other models (Llama, Qwen, GPT) with up to 14 point EM improvements. Validated on biomedical QA domain generalization.

Conclusion: Corpus-grounded verification provides a principled, model-agnostic paradigm for dynamic RAG that is more reliable than confidence-based approaches, with public code available.

Abstract: Dynamic Retrieval-Augmented Generation adaptively determines when to retrieve during generation to mitigate hallucinations in large language models (LLMs). However, existing methods rely on model-internal signals (e.g., logits, entropy), which are fundamentally unreliable because LLMs are typically ill-calibrated and often exhibit high confidence in erroneous outputs. We propose QuCo-RAG, which shifts from subjective confidence to objective statistics computed from pre-training data. Our method quantifies uncertainty through two stages: (1) before generation, we identify low-frequency entities indicating long-tail knowledge gaps; (2) during generation, we verify entity co-occurrence in the pre-training corpus, where zero co-occurrence often signals hallucination risk. Both stages leverage Infini-gram for millisecond-latency queries over 4 trillion tokens, triggering retrieval when uncertainty is high. Experiments on multi-hop QA benchmarks show QuCo-RAG achieves EM gains of 5–12 points over state-of-the-art baselines with OLMo-2 models, and transfers effectively to models with undisclosed pre-training data (Llama, Qwen, GPT), improving EM by up to 14 points. Domain generalization on biomedical QA further validates the robustness of our paradigm. These results establish corpus-grounded verification as a principled, practically model-agnostic paradigm for dynamic RAG. Our code is publicly available at https://github.com/ZhishanQ/QuCo-RAG.

[54] From Speech to Subtitles: Evaluating ASR Models in Subtitling Italian Television Programs

Alessandro Lucca, Francesco Pierri

Main category: cs.CL

TL;DR: Case study evaluating 4 state-of-the-art ASR models for Italian subtitling, finding they can’t fully replace human subtitlers but can enhance productivity through human-in-the-loop workflow.

Details

Motivation: While modern ASR systems perform well on standard benchmarks, their real-world performance for non-English content (specifically Italian long-form videos) in professional media production environments remains unexplored.

Method: Evaluated four SOTA ASR models (Whisper Large v2, AssemblyAI Universal, Parakeet TDT v3 0.6b, and WhisperX) on a 50-hour dataset of Italian television programs, benchmarking against professional human subtitlers.

Result: Current ASR models cannot meet media industry accuracy requirements for full autonomy, but they serve as highly effective tools for enhancing human productivity when used as part of a workflow.

Conclusion: A human-in-the-loop approach is essential for professional subtitling, and the paper presents a production-grade, cloud-based infrastructure designed to support this workflow for Italian media content.

Abstract: Subtitles are essential for video accessibility and audience engagement. Modern Automatic Speech Recognition (ASR) systems, built upon Encoder-Decoder neural network architectures and trained on massive amounts of data, have progressively reduced transcription errors on standard benchmark datasets. However, their performance in real-world production environments, particularly for non-English content like long-form Italian videos, remains largely unexplored. This paper presents a case study on developing a professional subtitling system for an Italian media company. To inform our system design, we evaluated four state-of-the-art ASR models (Whisper Large v2, AssemblyAI Universal, Parakeet TDT v3 0.6b, and WhisperX) on a 50-hour dataset of Italian television programs. The study highlights their strengths and limitations, benchmarking their performance against the work of professional human subtitlers. The findings indicate that, while current models cannot meet the media industry’s accuracy needs for full autonomy, they can serve as highly effective tools for enhancing human productivity. We conclude that a human-in-the-loop (HITL) approach is crucial and present the production-grade, cloud-based infrastructure we designed to support this workflow.

[55] JEPA-Reasoner: Decoupling Latent Reasoning from Token Generation

Bingyang Kelvin Liu, Ziyu Patrick Chen

Main category: cs.CL

TL;DR: JEPA-Reasoner: A JEPA model with generative ability that reasons in latent space, decoupled from token generation to reduce compounding error and enable multi-threaded reasoning.

Details

Motivation: JEPA lacks generative abilities, while Transformer-based latent reasoning models like COCONUT still suffer from token-by-token generation issues including compounding error and dependency on context for reasoning insights.

Method: Propose JEPA-Reasoner that enhances JEPA with generative ability for latent space reasoning, augmented with a separate action-taker model (Talker) to produce human-readable sentences, decoupling reasoning from token generation.

Result: The approach enables production of mixed latent vectors that could support multi-threaded reasoning, while performing autoregressive generation with superior robustness to compounding error.

Conclusion: Decoupling latent space reasoning from token generation addresses limitations of both JEPA and Transformer-based approaches, potentially enabling more robust and flexible reasoning systems.

Abstract: While Joint-Embedding Predictive Architecture (JEPA) has emerged as a powerful architecture for learning rich latent representations, it fundamentally lacks generative abilities. Meanwhile, latent space reasoning attempts for Transformer models like COCONUT do improve performance, but they ultimately rely on token-by-token generation, which still accumulates compounding error and relies on context information to gain reasoning insights. To address these limitations, we propose JEPA-Reasoner, a novel JEPA model enhanced with generative ability that reasons in latent space. We augment it with a separate action-taker model, Talker, to produce human-readable sentences. Our approach demonstrates that decoupling latent space reasoning and token generation enables JEPA-Reasoner to produce mixed latent vectors that might lay the foundation for multi-threaded reasoning, while performing autoregressive generation with superior robustness to compounding error.

[56] CycleChart: A Unified Consistency-Based Learning Framework for Bidirectional Chart Understanding and Generation

Dazhen Deng, Sen Yang, Yuchen He, Yuan Tian, Yingcai Wu

Main category: cs.CL

TL;DR: CycleChart is a consistency-based framework for bidirectional chart understanding and generation that connects chart creation and interpretation tasks through shared semantic learning.

Details

Motivation: Current chart-specific tasks are studied in isolation, preventing models from learning shared semantics that link chart generation and interpretation. There's a need for a unified approach that connects these bidirectional processes.

Method: CycleChart uses a schema-centric formulation as common interface across tasks. It constructs consistent multi-task dataset with aligned annotations for schema prediction, data parsing, and question answering. The core innovation is a generate-parse consistency objective where the model generates a chart schema from table+query, then learns to recover schema and data from the generated chart.

Result: CycleChart achieves strong results on chart generation, chart parsing, and chart question answering, demonstrating improved cross-task generalization.

Conclusion: CycleChart represents a step toward more general chart understanding models by enabling bidirectional learning of chart semantics through consistency-based training across generation and interpretation tasks.

Abstract: Current chart-specific tasks, such as chart question answering, chart parsing, and chart generation, are typically studied in isolation, preventing models from learning the shared semantics that link chart generation and interpretation. We introduce CycleChart, a consistency-based learning framework for bidirectional chart understanding and generation. CycleChart adopts a schema-centric formulation as a common interface across tasks. We construct a consistent multi-task dataset, where each chart sample includes aligned annotations for schema prediction, data parsing, and question answering. To learn cross-directional chart semantics, CycleChart introduces a generate-parse consistency objective: the model generates a chart schema from a table and a textual query, then learns to recover the schema and data from the generated chart, enforcing semantic alignment across directions. CycleChart achieves strong results on chart generation, chart parsing, and chart question answering, demonstrating improved cross-task generalization and marking a step toward more general chart understanding models.

[57] Identifying Features Associated with Bias Against 93 Stigmatized Groups in Language Models and Guardrail Model Safety Mitigation

Anna-Maria Gueorguieva, Aylin Caliskan

Main category: cs.CL

TL;DR: LLMs show significant bias against stigmatized groups, especially those rated as highly perilous. Guardrail models reduce bias but fail to address underlying feature-based bias patterns.

Details

Motivation: While LLM social bias towards protected groups is studied, bias against non-protected stigmatized identities and the relationship between stigma features and bias remains understudied. Understanding how social stigma features affect LLM outputs is crucial for responsible AI deployment.

Method: Used SocialStigmaQA benchmark with 37 social scenarios about 93 stigmatized groups. Tested three LLMs (Granite 3.0-8B, Llama-3.1-8B, Mistral-7B) and measured bias based on six stigma features from psychology literature. Evaluated guardrail models (Granite Guardian 3.0, Llama Guard 3.0, Mistral Moderation API) for bias mitigation.

Result: Stigmas rated as highly perilous (e.g., gang member, HIV) had 60% biased outputs, while sociodemographic stigmas had only 11% bias. Guardrail models reduced bias by 10.4%, 1.4%, and 7.8% respectively, but failed to change underlying feature-based bias patterns and often missed bias intent in prompts.

Conclusion: LLMs exhibit significant bias against stigmatized groups, particularly those perceived as perilous. Current guardrail models provide limited mitigation and fail to address root causes. Improved bias detection and mitigation strategies are needed for responsible LLM use with stigmatized populations.

Abstract: Large language models (LLMs) have been shown to exhibit social bias, however, bias towards non-protected stigmatized identities remain understudied. Furthermore, what social features of stigmas are associated with bias in LLM outputs is unknown. From psychology literature, it has been shown that stigmas contain six shared social features: aesthetics, concealability, course, disruptiveness, origin, and peril. In this study, we investigate if human and LLM ratings of the features of stigmas, along with prompt style and type of stigma, have effect on bias towards stigmatized groups in LLM outputs. We measure bias against 93 stigmatized groups across three widely used LLMs (Granite 3.0-8B, Llama-3.1-8B, Mistral-7B) using SocialStigmaQA, a benchmark that includes 37 social scenarios about stigmatized identities; for example deciding wether to recommend them for an internship. We find that stigmas rated by humans to be highly perilous (e.g., being a gang member or having HIV) have the most biased outputs from SocialStigmaQA prompts (60% of outputs from all models) while sociodemographic stigmas (e.g. Asian-American or old age) have the least amount of biased outputs (11%). We test if the amount of biased outputs could be decreased by using guardrail models, models meant to identify harmful input, using each LLM’s respective guardrail model (Granite Guardian 3.0, Llama Guard 3.0, Mistral Moderation API). We find that bias decreases significantly by 10.4%, 1.4%, and 7.8%, respectively. However, we show that features with significant effect on bias remain unchanged post-mitigation and that guardrail models often fail to recognize the intent of bias in prompts. This work has implications for using LLMs in scenarios involving stigmatized groups and we suggest future work towards improving guardrail models for bias mitigation.

[58] ChemATP: A Training-Free Chemical Reasoning Framework for Large Language Models

Mingxu Zhang, Dazhong Shen, Qi Zhang, Ying Sun

Main category: cs.CL

TL;DR: ChemATP is a framework that decouples chemical knowledge from LLMs by creating an atom-level textual knowledge base, allowing frozen LLMs to retrieve and reason over chemical priors dynamically without compromising general reasoning capabilities.

Details

Motivation: LLMs struggle with molecular science due to lack of explicit chemical priors in standard string representations. Current approaches face a dilemma: training-based methods inject priors into parameters but hinder rapid updates and compromise general reasoning, while training-free methods rely on surface-level prompting and lack fine-grained atom-level priors needed for precise chemical reasoning.

Method: ChemATP introduces a framework that decouples chemical knowledge from the reasoning engine by constructing the first atom-level textual knowledge base. This enables frozen LLMs to explicitly retrieve and reason over this information dynamically, maintaining interpretability and adaptability while preserving the LLM’s intrinsic general intelligence.

Result: Experiments show ChemATP significantly outperforms training-free baselines and rivals state-of-the-art training-based models, demonstrating that explicit prior injection is a competitive alternative to implicit parameter updates.

Conclusion: The ChemATP framework successfully addresses the chemical reasoning dilemma by providing a decoupled architecture that enables explicit retrieval and reasoning over atom-level chemical knowledge, offering interpretability and adaptability while preserving LLMs’ general reasoning capabilities.

Abstract: Large Language Models (LLMs) exhibit strong general reasoning but struggle in molecular science due to the lack of explicit chemical priors in standard string representations. Current solutions face a fundamental dilemma. Training-based methods inject priors into parameters, but this static coupling hinders rapid knowledge updates and often compromises the model’s general reasoning capabilities. Conversely, existing training-free methods avoid these issues but rely on surface-level prompting, failing to provide the fine-grained atom-level priors essential for precise chemical reasoning. To address this issue, we introduce ChemATP, a framework that decouples chemical knowledge from the reasoning engine. By constructing the first atom-level textual knowledge base, ChemATP enables frozen LLMs to explicitly retrieve and reason over this information dynamically. This architecture ensures interpretability and adaptability while preserving the LLM’s intrinsic general intelligence. Experiments show that ChemATP significantly outperforms training-free baselines and rivals state-of-the-art training-based models, demonstrating that explicit prior injection is a competitive alternative to implicit parameter updates.

[59] Auto-Prompting with Retrieval Guidance for Frame Detection in Logistics

Do Minh Duc, Quan Xuan Truong, Nguyen Tat Dat, Nguyen Van Vinh

Main category: cs.CL

TL;DR: Proposes a novel prompt optimization pipeline combining RAG, few-shot prompting, CoT reasoning, and Auto-CoT for frame detection in logistics texts, achieving up to 15% accuracy improvement over baselines.

Details

Motivation: Prompt engineering is critical for adapting LLMs to complex reasoning and labeling tasks without extensive fine-tuning, especially for domain-specific applications like logistics text annotation where reasoning accuracy and labeling efficiency are critical.

Method: A prompt optimization pipeline combining retrieval-augmented generation (RAG), few-shot prompting, chain-of-thought (CoT) reasoning, and automatic CoT synthesis (Auto-CoT). Uses an LLM-based prompt optimizer agent that iteratively refines prompts using retrieved examples, performance feedback, and internal self-evaluation.

Result: Optimized prompts (especially those enhanced via Auto-CoT and RAG) improve real-world inference accuracy by up to 15% compared to baseline zero-shot or static prompts. The system shows consistent improvements across multiple LLMs (GPT-4o, Qwen 2.5 72B, LLaMA 3.1 70B), validating generalizability and practical value.

Conclusion: Structured prompt optimization is a viable alternative to full fine-tuning, offering scalable solutions for deploying LLMs in domain-specific NLP applications such as logistics.

Abstract: Prompt engineering plays a critical role in adapting large language models (LLMs) to complex reasoning and labeling tasks without the need for extensive fine-tuning. In this paper, we propose a novel prompt optimization pipeline for frame detection in logistics texts, combining retrieval-augmented generation (RAG), few-shot prompting, chain-of-thought (CoT) reasoning, and automatic CoT synthesis (Auto-CoT) to generate highly effective task-specific prompts. Central to our approach is an LLM-based prompt optimizer agent that iteratively refines the prompts using retrieved examples, performance feedback, and internal self-evaluation. Our framework is evaluated on a real-world logistics text annotation task, where reasoning accuracy and labeling efficiency are critical. Experimental results show that the optimized prompts - particularly those enhanced via Auto-CoT and RAG - improve real-world inference accuracy by up to 15% compared to baseline zero-shot or static prompts. The system demonstrates consistent improvements across multiple LLMs, including GPT-4o, Qwen 2.5 (72B), and LLaMA 3.1 (70B), validating its generalizability and practical value. These findings suggest that structured prompt optimization is a viable alternative to full fine-tuning, offering scalable solutions for deploying LLMs in domain-specific NLP applications such as logistics.

[60] CienaLLM: Generative Climate-Impact Extraction from News Articles with Autoregressive LLMs

Javier Vela-Tambo, Jorge Gracia, Fernando Dominguez-Castro

Main category: cs.CL

TL;DR: CienaLLM is a modular framework using schema-guided generative information extraction with LLMs for zero-shot extraction of climate hazard impacts from news articles, showing competitive performance with supervised baselines while being adaptable via prompts and schemas.

Details

Motivation: Need to extract structured information from heterogeneous news articles at scale to understand and monitor socio-economic impacts of climate hazards, requiring a flexible, adaptable approach without extensive retraining.

Method: CienaLLM framework based on schema-guided Generative Information Extraction using open-weight LLMs for zero-shot extraction, with configurable prompts/output schemas, multi-step pipelines, and systematic evaluation of LLM choices (family, size, precision) and prompting strategies.

Result: Response parsing eliminates format errors; larger models deliver strongest performance; quantization offers efficiency gains with modest accuracy trade-offs; CienaLLM matches/surpasses supervised baseline for drought impact extraction from Spanish news, though with higher inference cost.

Conclusion: CienaLLM provides effective schema-driven, model-agnostic approach for climate impact extraction that can adapt to related tasks (other hazards, sectors, languages) by editing prompts/schemas rather than retraining, with released code/configurations for reproducibility.

Abstract: Understanding and monitoring the socio-economic impacts of climate hazards requires extracting structured information from heterogeneous news articles on a large scale. To that end, we have developed CienaLLM, a modular framework based on schema-guided Generative Information Extraction. CienaLLM uses open-weight Large Language Models for zero-shot information extraction from news articles, and supports configurable prompts and output schemas, multi-step pipelines, and cloud or on-premise inference. To systematically assess how the choice of LLM family, size, precision regime, and prompting strategy affect performance, we run a large factorial study in models, precisions, and prompt engineering techniques. An additional response parsing step nearly eliminates format errors while preserving accuracy; larger models deliver the strongest and most stable performance, while quantization offers substantial efficiency gains with modest accuracy trade-offs; and prompt strategies show heterogeneous, model-specific effects. CienaLLM matches or outperforms the supervised baseline in accuracy for extracting drought impacts from Spanish news, although at a higher inference cost. While evaluated in droughts, the schema-driven and model-agnostic design is suitable for adapting to related information extraction tasks (e.g., other hazards, sectors, or languages) by editing prompts and schemas rather than retraining. We release code, configurations, and schemas to support reproducible use.

[61] HATS: High-Accuracy Triple-Set Watermarking for Large Language Models

Zhiqing Hu, Chenxu Zhao, Jiazhong Lu, Xiaolei Liu

Main category: cs.CL

TL;DR: Proposes a triple-partition watermarking scheme for LLMs that divides vocabulary into Green/Yellow/Red sets during generation and uses statistical detection to identify watermarked text.

Details

Motivation: To curb misuse of LLM-generated text by developing watermarking techniques that embed implicit signals into output while maintaining text quality.

Method: Partitions vocabulary at each decoding step into three sets (Green/Yellow/Red) with fixed ratios, restricts sampling to Green/Yellow sets during generation, and uses statistical detection with Green-enrichment/Red-depletion metrics aggregated via Fisher’s method.

Result: Achieves high detection accuracy at fixed false-positive rates while preserving text readability, as evaluated on Llama 2 7B model.

Conclusion: The triple-partition watermarking scheme effectively identifies LLM-generated text with minimal impact on text quality, providing a practical solution for curbing misuse of generated content.

Abstract: Misuse of LLM-generated text can be curbed by watermarking techniques that embed implicit signals into the output. We propose a watermark that partitions the vocabulary at each decoding step into three sets (Green/Yellow/Red) with fixed ratios and restricts sampling to the Green and Yellow sets. At detection time, we replay the same partitions, compute Green-enrichment and Red-depletion statistics, convert them to one-sided z-scores, and aggregate their p-values via Fisher’s method to decide whether a passage is watermarked. We implement generation, detection, and testing on Llama 2 7B, and evaluate true-positive rate, false-positive rate, and text quality. Results show that the triple-partition scheme achieves high detection accuracy at fixed FPR while preserving readability.

[62] Kunnafonidilaw ka Cadeau: an ASR dataset of present-day Bambara

Yacouba Diarra, Panga Azazia Kamate, Nouhoum Souleymane Coulibaly, Michael Leventhal

Main category: cs.CL

TL;DR: Kunkado is a 160-hour Bambara ASR dataset from Malian radio archives capturing real-world speech features like code-switching and disfluencies. Finetuning models on this data reduces WER significantly compared to cleaner datasets.

Details

Motivation: To create a robust ASR dataset for Bambara that captures real-world spontaneous speech characteristics (code-switching, disfluencies, background noise, overlapping speakers) that practical ASR systems encounter, supporting predominantly oral languages.

Method: Compiled 160-hour dataset from Malian radio archives, finetuned Parakeet-based models on 33.47-hour human-reviewed subset, applied pragmatic transcript normalization to reduce variability in number formatting, tags, and code-switching annotations.

Result: Finetuning with Kunkado reduced WER from 44.47% to 37.12% on one test set and from 36.07% to 32.33% on another. Human evaluation showed the model outperformed a comparable system trained on 98 hours of cleaner, less realistic speech.

Conclusion: Kunkado dataset effectively improves ASR performance for Bambara by capturing real-world speech characteristics, demonstrating the importance of realistic training data for robust ASR systems in predominantly oral languages.

Abstract: We present Kunkado, a 160-hour Bambara ASR dataset compiled from Malian radio archives to capture present-day spontaneous speech across a wide range of topics. It includes code-switching, disfluencies, background noise, and overlapping speakers that practical ASR systems encounter in real-world use. We finetuned Parakeet-based models on a 33.47-hour human-reviewed subset and apply pragmatic transcript normalization to reduce variability in number formatting, tags, and code-switching annotations. Evaluated on two real-world test sets, finetuning with Kunkado reduces WER from 44.47% to 37.12% on one and from 36.07% to 32.33% on the other. In human evaluation, the resulting model also outperforms a comparable system with the same architecture trained on 98 hours of cleaner, less realistic speech. We release the data and models to support robust ASR for predominantly oral languages.

[63] CodeSimpleQA: Scaling Factuality in Code Large Language Models

Jian Yang, Wei Zhang, Yizhi Li, Shawn Guo, Haowen Wang, Aishan Liu, Ge Zhang, Zili Wang, Zhoujun Li, Xianglong Liu, Weifeng Lv

Main category: cs.CL

TL;DR: CodeSimpleQA benchmark evaluates factual accuracy of code LLMs, reveals they struggle with programming knowledge facts, and proposes a training framework that improves factuality.

Details

Motivation: Current code LLM benchmarks focus on execution correctness but overlook factual accuracy of programming concepts and technical knowledge, creating a gap in evaluating reliable code generation.

Method: Created CodeSimpleQA bilingual benchmark with English/Chinese QA pairs covering diverse programming languages and CS domains. Developed CodeSimpleQA-Instruct corpus (66M samples) and post-training framework combining supervised fine-tuning and reinforcement learning.

Result: Even frontier LLMs struggle with code factuality. The proposed framework shows substantial improvements over base models in factual accuracy.

Conclusion: Factuality-aware alignment is critical for developing reliable code LLMs, and CodeSimpleQA provides an essential benchmark for evaluating programming knowledge accuracy.

Abstract: Large language models (LLMs) have made significant strides in code generation, achieving impressive capabilities in synthesizing code snippets from natural language instructions. However, a critical challenge remains in ensuring LLMs generate factually accurate responses about programming concepts, technical implementations, etc. Most previous code-related benchmarks focus on code execution correctness, overlooking the factual accuracy of programming knowledge. To address this gap, we present CodeSimpleQA, a comprehensive bilingual benchmark designed to evaluate the factual accuracy of code LLMs in answering code-related questions, which contains carefully curated question-answer pairs in both English and Chinese, covering diverse programming languages and major computer science domains. Further, we create CodeSimpleQA-Instruct, a large-scale instruction corpus with 66M samples, and develop a post-training framework combining supervised fine-tuning and reinforcement learning. Our comprehensive evaluation of diverse LLMs reveals that even frontier LLMs struggle with code factuality. Our proposed framework demonstrates substantial improvements over the base model, underscoring the critical importance of factuality-aware alignment in developing reliable code LLMs.

[64] MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments

Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, Zhidan Liu, Steven Hoi, Yue Wang

Main category: cs.CL

TL;DR: MobileWorld is a new, more challenging benchmark for mobile AI agents that better reflects real-world usage with longer tasks, cross-app interactions, and novel task categories like agent-user interaction and MCP-augmented tasks.

Details

Motivation: AndroidWorld benchmark has become saturated with agents achieving over 90% success rates, lacks key application categories (e-commerce, enterprise communication), and doesn't reflect realistic mobile usage with vague instructions and hybrid tool usage.

Method: Created MobileWorld with 201 tasks across 20 apps, featuring longer tasks (27.8 vs 14.3 steps avg), more multi-app tasks (62.2% vs 9.5%), and novel task categories. Provides snapshot-based container environment with precise functional verification via backend database inspection and task callback APIs.

Result: Significant performance drop compared to AndroidWorld: best agentic framework achieved 51.7% success rate, best end-to-end model achieved 20.9%. Current models struggle with user interaction and MCP calls.

Conclusion: MobileWorld provides a more realistic and challenging benchmark for mobile AI agents, revealing current limitations and offering a roadmap for developing more robust next-generation mobile intelligence systems.

Abstract: Among existing online mobile-use benchmarks, AndroidWorld has emerged as the dominant benchmark due to its reproducible environment and deterministic evaluation; however, recent agents achieving over 90% success rates indicate its saturation and motivate the need for a more challenging benchmark. In addition, its environment lacks key application categories, such as e-commerce and enterprise communication, and does not reflect realistic mobile-use scenarios characterized by vague user instructions and hybrid tool usage. To bridge this gap, we introduce MobileWorld, a substantially more challenging benchmark designed to better reflect real-world mobile usage, comprising 201 tasks across 20 applications, while maintaining the same level of reproducible evaluation as AndroidWorld. The difficulty of MobileWorld is twofold. First, it emphasizes long-horizon tasks with cross-application interactions: MobileWorld requires nearly twice as many task-completion steps on average (27.8 vs. 14.3) and includes far more multi-application tasks (62.2% vs. 9.5%) compared to AndroidWorld. Second, MobileWorld extends beyond standard GUI manipulation by introducing novel task categories, including agent-user interaction and MCP-augmented tasks. To ensure robust evaluation, we provide snapshot-based container environment and precise functional verifications, including backend database inspection and task callback APIs. We further develop a planner-executor agentic framework with extended action spaces to support user interactions and MCP calls. Our results reveal a sharp performance drop compared to AndroidWorld, with the best agentic framework and end-to-end model achieving 51.7% and 20.9% success rates, respectively. Our analysis shows that current models struggle significantly with user interaction and MCP calls, offering a strategic roadmap toward more robust, next-generation mobile intelligence.

[65] SiamGPT: Quality-First Fine-Tuning for Stable Thai Text Generation

Thittipat Pairatsuppawat, Abhibhu Tachaapornchai, Paweekorn Kusolsomboon, Chutikan Chaiwong, Thodsaporn Chay-intr, Kobkrit Viriyayudhakorn, Nongnuch Ketui, Aslan B. Wong

Main category: cs.CL

TL;DR: SiamGPT-32B is a Thai language model based on Qwen3-32B, fine-tuned with a Quality-First strategy using translated English instruction data and Thai-adapted AutoIF framework, achieving best performance among similar-scale open-weights Thai models.

Details

Motivation: Open-weights LLMs perform poorly in Thai with unstable generation under complex instructions despite strong English performance, creating deployment challenges for Thai language applications.

Method: Quality-First strategy with supervised fine-tuning only (no continual pretraining). Uses translated high-complexity English instruction data combined with Thai-adapted AutoIF framework for instruction and linguistic constraints.

Result: SiamGPT-32B achieves strongest overall performance among similar-scale open-weights Thai models on SEA-HELM benchmark, with consistent gains in instruction following, multi-turn dialogue, and natural language understanding.

Conclusion: Quality-focused fine-tuning with curated supervision and linguistic adaptation can effectively improve Thai LLM performance without requiring extensive pretraining or data scaling, addressing deployment challenges for Thai language applications.

Abstract: Open-weights large language models remain difficult to deploy for Thai due to unstable generation under complex instructions, despite strong English performance. To mitigate these limitations, We present SiamGPT-32B, an open-weights model based on Qwen3-32B, fine-tuned with a Quality-First strategy emphasizing curated supervision over data scale. The fine-tuning pipeline combines translated high-complexity English instruction data with a Thai-adapted AutoIF framework for instruction and linguistic constraints. Using supervised fine-tuning only, without continual pretraining or corpus expansion, SiamGPT-32B improves instruction adherence, multi-turn robustness, and linguistic stability. Evaluations on the SEA-HELM benchmark show that SiamGPT-32B achieves the strongest overall performance among similar-scale open-weights Thai models, with consistent gains in instruction following, multi-turn dialogue, and natural language understanding.

[66] Activations as Features: Probing LLMs for Generalizable Essay Scoring Representations

Jinwei Chi, Ke Wang, Yu Chen, Xuanye Lin, Qiang Xu

Main category: cs.CL

TL;DR: LLM activations have strong discriminative power for cross-prompt essay scoring, and LLMs can adapt evaluation perspectives to different essay traits and types.

Details

Motivation: Automated essay scoring is challenging in cross-prompt settings due to diverse scoring criteria. Previous work focused on LLM outputs, but intermediate layer activations may contain valuable information for scoring.

Method: Evaluated discriminative power of LLM activations using probes, analyzed effects of different models and input content, computed essay directions across trait dimensions under different prompts to study evaluation perspective variation.

Result: Activations show strong discriminative power for essay quality evaluation. LLMs can adapt their evaluation perspectives to different traits and essay types, effectively handling scoring criteria diversity in cross-prompt settings.

Conclusion: LLM activations provide valuable information for cross-prompt automated essay scoring, demonstrating that models can flexibly adjust evaluation perspectives based on essay characteristics and scoring criteria.

Abstract: Automated essay scoring (AES) is a challenging task in cross-prompt settings due to the diversity of scoring criteria. While previous studies have focused on the output of large language models (LLMs) to improve scoring accuracy, we believe activations from intermediate layers may also provide valuable information. To explore this possibility, we evaluated the discriminative power of LLMs’ activations in cross-prompt essay scoring task. Specifically, we used activations to fit probes and further analyzed the effects of different models and input content of LLMs on this discriminative power. By computing the directions of essays across various trait dimensions under different prompts, we analyzed the variation in evaluation perspectives of large language models concerning essay types and traits. Results show that the activations possess strong discriminative power in evaluating essay quality and that LLMs can adapt their evaluation perspectives to different traits and essay types, effectively handling the diversity of scoring criteria in cross-prompt settings.

[67] A Large-Language-Model Framework for Automated Humanitarian Situation Reporting

Ivan Decostanzi, Yelena Mejova, Kyriaki Kalimeri

Main category: cs.CL

TL;DR: Automated framework using LLMs to transform heterogeneous humanitarian documents into structured, evidence-grounded reports with citations and multi-level summarization.

Details

Motivation: Current humanitarian situational reporting workflows are manual, resource-intensive, and inconsistent, creating need for automated solutions to improve decision-making.

Method: LLM-based framework with semantic text clustering, automatic question generation, retrieval-augmented answer extraction with citations, multi-level summarization, and executive summary generation.

Result: Evaluated on 13 humanitarian events using 1,100+ documents: questions achieved 84.7% relevance, 84.0% importance, 76.4% urgency; answers reached 86.3% relevance with citation precision/recall >76%; human-LLM evaluation agreement F1 >0.80.

Conclusion: Generative AI can autonomously produce accurate, verifiable, and operationally useful humanitarian situation reports by combining LLM reasoning with transparent citation linking and multi-level evaluation.

Abstract: Timely and accurate situational reports are essential for humanitarian decision-making, yet current workflows remain largely manual, resource intensive, and inconsistent. We present a fully automated framework that uses large language models (LLMs) to transform heterogeneous humanitarian documents into structured and evidence-grounded reports. The system integrates semantic text clustering, automatic question generation, retrieval augmented answer extraction with citations, multi-level summarization, and executive summary generation, supported by internal evaluation metrics that emulate expert reasoning. We evaluated the framework across 13 humanitarian events, including natural disasters and conflicts, using more than 1,100 documents from verified sources such as ReliefWeb. The generated questions achieved 84.7 percent relevance, 84.0 percent importance, and 76.4 percent urgency. The extracted answers reached 86.3 percent relevance, with citation precision and recall both exceeding 76 percent. Agreement between human and LLM based evaluations surpassed an F1 score of 0.80. Comparative analysis shows that the proposed framework produces reports that are more structured, interpretable, and actionable than existing baselines. By combining LLM reasoning with transparent citation linking and multi-level evaluation, this study demonstrates that generative AI can autonomously produce accurate, verifiable, and operationally useful humanitarian situation reports.

[68] Event Extraction in Large Language Model

Bobo Li, Xudong Han, Jiang Liu, Yuzhe Ding, Liqiang Jing, Zhaoqi Zhang, Jinheng Li, Xinya Du, Fei Li, Meishan Zhang, Min Zhang, Aixin Sun, Philip S. Yu, Hao Fei

Main category: cs.CL

TL;DR: This survey paper examines how event extraction is evolving in the LLM era, proposing it should serve as a cognitive scaffold for LLM systems rather than just an extraction task, addressing deployment gaps like hallucinations and context limitations.

Details

Motivation: While LLMs have transformed event extraction through prompting and generation, they face deployment gaps including hallucinations under weak constraints, fragile temporal/causal linking across long contexts, and limited knowledge management within bounded context windows. The paper argues for rethinking EE's role in LLM systems.

Method: The paper is a comprehensive survey covering EE in text and multimodal settings, organizing tasks and taxonomy, tracing method evolution from rule-based and neural models to instruction-driven and generative frameworks. It analyzes formulations, decoding strategies, architectures, representations, datasets, and evaluation methods.

Result: The survey provides a systematic organization of EE tasks and methods, highlighting how event schemas and constraints can ground LLMs, event structures can serve as intermediate representations for reasoning, event links can enable graph-based RAG, and event stores can provide updatable memory beyond context windows.

Conclusion: Event extraction should evolve from static extraction into a structurally reliable, agent-ready perception and memory layer for open-world systems. The paper outlines future directions for reliable event-centric systems in the LLM era, emphasizing EE’s role as a cognitive scaffold for LLM-centered solutions.

Abstract: Large language models (LLMs) and multimodal LLMs are changing event extraction (EE): prompting and generation can often produce structured outputs in zero shot or few shot settings. Yet LLM based pipelines face deployment gaps, including hallucinations under weak constraints, fragile temporal and causal linking over long contexts and across documents, and limited long horizon knowledge management within a bounded context window. We argue that EE should be viewed as a system component that provides a cognitive scaffold for LLM centered solutions. Event schemas and slot constraints create interfaces for grounding and verification; event centric structures act as controlled intermediate representations for stepwise reasoning; event links support relation aware retrieval with graph based RAG; and event stores offer updatable episodic and agent memory beyond the context window. This survey covers EE in text and multimodal settings, organizing tasks and taxonomy, tracing method evolution from rule based and neural models to instruction driven and generative frameworks, and summarizing formulations, decoding strategies, architectures, representations, datasets, and evaluation. We also review cross lingual, low resource, and domain specific settings, and highlight open challenges and future directions for reliable event centric systems. Finally, we outline open challenges and future directions that are central to the LLM era, aiming to evolve EE from static extraction into a structurally reliable, agent ready perception and memory layer for open world systems.

[69] Algerian Dialect

Zakaria Benmounah, Abdennour Boulesnane

Main category: cs.CL

TL;DR: A large-scale sentiment-annotated dataset of 45,000 Algerian Arabic dialect YouTube comments with five sentiment categories and rich metadata.

Details

Motivation: Addresses the scarcity of publicly available resources for Algerian dialect to support research in sentiment analysis, dialectal Arabic NLP, and social media analytics.

Method: Collected 45,000 comments from over 30 Algerian press/media YouTube channels using YouTube Data API, with manual annotation into five sentiment categories and inclusion of rich metadata.

Result: Created Algerian Dialect dataset with 45,000 sentiment-annotated comments, publicly available on Mendeley Data under CC BY 4.0 license.

Conclusion: The dataset fills a resource gap for Algerian dialect NLP research and enables sentiment analysis, dialectal Arabic studies, and social media analytics applications.

Abstract: We present Algerian Dialect, a large-scale sentiment-annotated dataset consisting of 45,000 YouTube comments written in Algerian Arabic dialect. The comments were collected from more than 30 Algerian press and media channels using the YouTube Data API. Each comment is manually annotated into one of five sentiment categories: very negative, negative, neutral, positive, and very positive. In addition to sentiment labels, the dataset includes rich metadata such as collection timestamps, like counts, video URLs, and annotation dates. This dataset addresses the scarcity of publicly available resources for Algerian dialect and aims to support research in sentiment analysis, dialectal Arabic NLP, and social media analytics. The dataset is publicly available on Mendeley Data under a CC BY 4.0 license at https://doi.org/10.17632/zzwg3nnhsz.2.

[70] Increasing the Thinking Budget is Not All You Need

Ignacio Iacobacci, Zhaozhi Qian, Faroq AL-Tam, Muhammad AL-Qurishi, Riad Souissi

Main category: cs.CL

TL;DR: Increasing thinking budget alone isn’t optimal; self-consistency and self-reflection provide better accuracy than simply extending reasoning length.

Details

Motivation: To systematically investigate how thinking budget (compute allocated to reasoning process) interacts with different configurations like self-consistency and reflection, and provide a balanced comparison framework considering both performance and computational cost.

Method: Proposed systematic investigation of thinking budget as a key parameter, examining its interaction with various configurations including self-consistency, reflection, and other approaches.

Result: Found that simply increasing thinking budget is not the most effective use of compute; more accurate responses can be achieved through alternative configurations like self-consistency and self-reflection.

Conclusion: Thinking budget should be considered alongside other optimization strategies like self-consistency and reflection rather than simply increasing reasoning length, providing a more balanced approach to improving LLM reasoning performance.

Abstract: Recently, a new wave of thinking-capable Large Language Models has emerged, demonstrating exceptional capabilities across a wide range of reasoning benchmarks. Early studies have begun to explore how the amount of compute in terms of the length of the reasoning process, the so-called thinking budget, impacts model performance. In this work, we propose a systematic investigation of the thinking budget as a key parameter, examining its interaction with various configurations such as self-consistency, reflection, and others. Our goal is to provide an informative, balanced comparison framework that considers both performance outcomes and computational cost. Among our findings, we discovered that simply increasing the thinking budget is not the most effective use of compute. More accurate responses can instead be achieved through alternative configurations, such as self-consistency and self-reflection.

[71] Exploring the features used for summary evaluation by Human and GPT

Zahra Sadeghi, Evangelos Milios, Frank Rudzicz

Main category: cs.CL

TL;DR: LLMs can evaluate summaries but it’s unclear what features they use. This paper identifies features aligned with human and GPT responses using metrics, and shows that instructing GPTs to use human-like metrics improves their judgment alignment.

Details

Motivation: While LLMs are used to automate summary assessment by evaluating how well summaries reflect source texts, it's not well understood what features/properties they exploit for evaluation, and there's limited attention to mapping between evaluation scores and metrics.

Method: Study statistical and machine learning metrics to discover features aligned with both human and GPT responses. Instruct GPTs to employ metrics used by humans to improve their judgment.

Result: Identified features that align with human and GPT responses. Showed that instructing GPTs to use human-like metrics improves their judgment and better conforms them with human responses.

Conclusion: Understanding what features LLMs exploit for summary evaluation and mapping evaluation scores to metrics can improve LLM-based assessment. Instructing LLMs to use human-like metrics enhances their alignment with human judgment.

Abstract: Summary assessment involves evaluating how well a generated summary reflects the key ideas and meaning of the source text, requiring a deep understanding of the content. Large Language Models (LLMs) have been used to automate this process, acting as judges to evaluate summaries with respect to the original text. While previous research investigated the alignment between LLMs and Human responses, it is not yet well understood what properties or features are exploited by them when asked to evaluate based on a particular quality dimension, and there has not been much attention towards mapping between evaluation scores and metrics. In this paper, we address this issue and discover features aligned with Human and Generative Pre-trained Transformers (GPTs) responses by studying statistical and machine learning metrics. Furthermore, we show that instructing GPTs to employ metrics used by Human can improve their judgment and conforming them better with human responses.

[72] Diacritic Restoration for Low-Resource Indigenous Languages: Case Study with Bribri and Cook Islands Māori

Rolando Coto-Solano, Daisy Li, Manoela Teleginski Ferraz, Olivia Sasse, Cha Krupka, Sharid Loáiciga, Sally Akevai Tenamu Nicholas

Main category: cs.CL

TL;DR: Fine-tuned character-level LLMs outperform massively multilingual models for diacritic restoration in under-resourced languages like Bribri and Cook Islands Māori, with reliable performance emerging at ~10,000 words.

Details

Motivation: Address diacritic restoration needs for under-resourced languages (Bribri and Cook Islands Māori) to support NLP tasks, responding to language community requests and broader research questions about model performance in low-resource contexts.

Method: Experimental comparison of algorithms for diacritic restoration including tonal diacritics, examining data requirements, contrasting resource conditions, and exploring diacritic correction. Focus on character-level LLMs vs. massively multilingual models.

Result: Fine-tuned character-level LLMs perform best due to UTF-8 byte decomposition capabilities. Massively multilingual models perform poorly with data constraints. Reliable performance emerges at ~10,000 words. Zero-shot approaches perform poorly.

Conclusion: Character-level LLMs are effective for diacritic restoration in under-resourced languages, with practical data requirements around 10,000 words. The study provides guidance for similar low-resource language processing tasks.

Abstract: We present experiments on diacritic restoration, a form of text normalization essential for natural language processing (NLP) tasks. Our study focuses on two extremely under-resourced languages: Bribri, a Chibchan language spoken in Costa Rica, and Cook Islands Māori, a Polynesian language spoken in the Cook Islands. Specifically, this paper: (i) compares algorithms for diacritics restoration in under-resourced languages, including tonal diacritics, (ii) examines the amount of data required to achieve target performance levels, (iii) contrasts results across varying resource conditions, and (iv) explores the related task of diacritic correction. We find that fine-tuned, character-level LLMs perform best, likely due to their ability to decompose complex characters into their UTF-8 byte representations. In contrast, massively multilingual models perform less effectively given our data constraints. Across all models, reliable performance begins to emerge with data budgets of around 10,000 words. Zero-shot approaches perform poorly in all cases. This study responds both to requests from the language communities and to broader NLP research questions concerning model performance and generalization in under-resourced contexts.

[73] Exploring Zero-Shot ACSA with Unified Meaning Representation in Chain-of-Thought Prompting

Filippos Ventirozos, Peter Appleby, Matthew Shardlow

Main category: cs.CL

TL;DR: Proposes UMR-based Chain-of-Thought prompting for zero-shot Aspect-Category Sentiment Analysis, showing model-dependent effectiveness with comparable performance for mid-sized models.

Details

Motivation: Supervised learning for ACSA faces data scarcity and annotation costs; zero-shot LLM approaches offer practical alternative for resource-limited domains.

Method: Novel Chain-of-Thought prompting using intermediate Unified Meaning Representation (UMR) to structure reasoning for ACSA task, evaluated against standard CoT baseline across multiple models and datasets.

Result: UMR effectiveness appears model-dependent; preliminary results show comparable performance for mid-sized models like Qwen3-8B, but findings need further investigation for smaller architectures.

Conclusion: UMR-based CoT shows promise for zero-shot ACSA but requires more research to establish generalizability across different model scales and architectures.

Abstract: Aspect-Category Sentiment Analysis (ACSA) provides granular insights by identifying specific themes within reviews and their associated sentiment. While supervised learning approaches dominate this field, the scarcity and high cost of annotated data for new domains present significant barriers. We argue that leveraging large language models (LLMs) in a zero-shot setting is a practical alternative where resources for data annotation are limited. In this work, we propose a novel Chain-of-Thought (CoT) prompting technique that utilises an intermediate Unified Meaning Representation (UMR) to structure the reasoning process for the ACSA task. We evaluate this UMR-based approach against a standard CoT baseline across three models (Qwen3-4B, Qwen3-8B, and Gemini-2.5-Pro) and four diverse datasets. Our findings suggest that UMR effectiveness may be model-dependent. Whilst preliminary results indicate comparable performance for mid-sized models such as Qwen3-8B, these observations warrant further investigation, particularly regarding the potential applicability to smaller model architectures. Further research is required to establish the generalisability of these findings across different model scales.

[74] GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators

Jiacheng Guo, Ling Yang, Peter Chen, Qixin Xiao, Yinjie Wang, Xinzhe Juan, Jiahao Qiu, Ke Shen, Mengdi Wang

Main category: cs.CL

TL;DR: GenEnv is a framework that creates a co-evolutionary game between an LLM agent and a generative environment simulator, where the simulator dynamically generates tasks matched to the agent’s current skill level, enabling more efficient agent training with less data.

Details

Motivation: Training capable LLM agents is bottlenecked by the high cost and static nature of real-world interaction data. Current methods rely on static datasets that don't adapt to agent learning progress.

Method: GenEnv establishes a difficulty-aligned co-evolutionary game between agent and generative environment simulator. The simulator acts as a dynamic curriculum policy generating tasks tailored to the agent’s “zone of proximal development” using α-Curriculum Reward to align task difficulty with agent capabilities.

Result: Improves agent performance by up to +40.3% over 7B baselines, matches/exceeds larger models’ average performance. Achieves better performance than Gemini 2.5 Pro-based offline data augmentation while using 3.3× less data across five benchmarks (API-Bank, ALFWorld, BFCL, Bamboogle, TravelPlanner).

Conclusion: By shifting from static supervision to adaptive simulation, GenEnv provides a data-efficient pathway for scaling agent capabilities through dynamic, difficulty-aligned task generation.

Abstract: Training capable Large Language Model (LLM) agents is critically bottlenecked by the high cost and static nature of real-world interaction data. We address this by introducing GenEnv, a framework that establishes a difficulty-aligned co-evolutionary game between an agent and a scalable, generative environment simulator. Unlike traditional methods that evolve models on static datasets, GenEnv instantiates a dataevolving: the simulator acts as a dynamic curriculum policy, continuously generating tasks specifically tailored to the agent’s ``zone of proximal development’’. This process is guided by a simple but effective $α$-Curriculum Reward, which aligns task difficulty with the agent’s current capabilities. We evaluate GenEnv on five benchmarks, including API-Bank, ALFWorld, BFCL, Bamboogle, and TravelPlanner. Across these tasks, GenEnv improves agent performance by up to \textbf{+40.3%} over 7B baselines and matches or exceeds the average performance of larger models. Compared to Gemini 2.5 Pro-based offline data augmentation, GenEnv achieves better performance while using 3.3$\times$ less data. By shifting from static supervision to adaptive simulation, GenEnv provides a data-efficient pathway for scaling agent capabilities.

[75] Towards a resource for multilingual lexicons: an MT assisted and human-in-the-loop multilingual parallel corpus with multi-word expression annotation

Lifeng Han, Najet Hadj Mohamed, Malak Rassem, Gareth Jones, Alan Smeaton, Goran Nenadic

Main category: cs.CL

TL;DR: AlphaMWE is a multilingual parallel corpus with annotated multi-word expressions (MWEs), created using machine translation with human post-editing and quality control across six languages.

Details

Motivation: To create a high-quality multilingual resource for MWE research that addresses the challenges MT systems face in accurately translating multi-word expressions, which are important for various NLP applications.

Method: Used machine translation of English source corpus (from PARSEME 2018) followed by human post-editing, MWE annotation, and strict quality control with double-checking until annotator consensus. Covered Arabic, Chinese, English, German, Italian, and Polish.

Result: Created AlphaMWE corpus with bilingual/multilingual MWE alignments; found MT systems struggle with MWE translation; categorized MT error types; compared four popular MT systems (Bing, Google, Baidu, DeepL).

Conclusion: AlphaMWE provides a valuable resource for monolingual and cross-lingual research in MWE lexicography, machine translation, and information extraction due to its professional human curation and quality control.

Abstract: In this work, we introduce the construction of a machine translation (MT) assisted and human-in-the-loop multilingual parallel corpus with annotations of multi-word expressions (MWEs), named AlphaMWE. The MWEs include verbal MWEs (vMWEs) defined in the PARSEME shared task that have a verb as the head of the studied terms. The annotated vMWEs are also bilingually and multilingually aligned manually. The languages covered include Arabic, Chinese, English, German, Italian, and Polish, of which, the Arabic corpus includes both standard and dialectal variations from Egypt and Tunisia. Our original English corpus is extracted from the PARSEME shared task in 2018. We performed machine translation of this source corpus followed by human post-editing and annotation of target MWEs. Strict quality control was applied for error limitation, i.e., each MT output sentence received first manual post-editing and annotation plus a second manual quality rechecking till annotators’ consensus is reached. One of our findings during corpora preparation is that accurate translation of MWEs presents challenges to MT systems, as reflected by the outcomes of human-in-the-loop metric HOPE. To facilitate further MT research, we present a categorisation of the error types encountered by MT systems in performing MWE-related translation. To acquire a broader view of MT issues, we selected four popular state-of-the-art MT systems for comparison, namely Microsoft Bing Translator, GoogleMT, Baidu Fanyi, and DeepL MT. Because of the noise removal, translation post-editing, and MWE annotation by human professionals, we believe the AlphaMWE data set will be an asset for both monolingual and cross-lingual research, such as multi-word term lexicography, MT, and information extraction.

[76] Structured Language Generation Model: Loss Calibration and Formatted Decoding for Robust Structure Prediction and Knowledge Retrieval

Minho Lee, Junghyun Min, Yerang Kim, Woochul Lee, Yeonsoo Lee

Main category: cs.CL

TL;DR: SLGM framework reformulates structured prediction as classification for generative language models, improving performance on structure-related tasks without extra parameters.

Details

Motivation: Generative language models underperform on structure-related tasks compared to encoder-only models, likely due to missing connection between internal structure representations and output space during fine-tuning.

Method: Three-component framework: (1) reinforced input formatting with structural cues, (2) loss design, and (3) format-aware decoding that constrains generation to task-valid outputs.

Result: Substantially improves structure prediction across 5 tasks and 13 datasets, outperforms baseline fine-tuning, achieves comparable performance to larger models with <1B parameters, and works as zero-weight adapter in low-resource settings.

Conclusion: SLGM strengthens alignment between model’s internal structure representation and output, enabling better structured prediction without dataset-specific engineering or additional parameters.

Abstract: Modern generative pre-trained language models excel at open-ended text generation, yet continue to underperform on structure-related tasks such as NER, relation extraction, and semantic role labeling, especially when compared to encoder-only models of similar sizes. While this gap has been attributed to limited structure knowledge, we hypothesize this is also due to the missing connection between the model’s internal representations of linguistic structure and the output space used during supervised fine-tuning. We propose the Structured Language Generation Model (SLGM), a model- and task-agnostic framework that reformulates structured prediction as a classification problem through three components: (1) reinforced input formatting with structural cues, (2) loss design, and (3) format-aware decoding that constrains generation to task-valid outputs. Across 5 tasks and 13 datasets, SLGM substantially improves structure prediction without relying on dataset-specific engineering or additional model parameters, strengthening alignment between the model’s internal structure representation and output. It outperforms baseline fine-tuning on models of the same size, achieves comparable performance to much larger models when used with <1B parameter models, and acts as a zero-weight adapter that reproduces the benefits of dataset-specific fine-tuning in low-resource settings.

[77] Label Words as Local Task Vectors in In-Context Learning

Bowen Zheng, Ming Ma, Zhongqiao Lin, Tianming Yang

Main category: cs.CL

TL;DR: LLMs don’t always create global task vectors for in-context learning; instead, they form local task vectors from individual demonstrations that may or may not converge into global vectors depending on task type.

Details

Motivation: Previous work hypothesized that LLMs create global task vectors during in-context learning that convey overall task information, but this study investigates whether such global vectors exist in all tasks, particularly those requiring inference from multiple demonstrations like categorization tasks.

Method: Analyzed how LLMs process demonstrations during in-context learning by examining whether task vectors are global or local, investigating how information from demonstrations is transmitted to answer positions, and studying whether local task vectors converge in later layers.

Result: Found that global task vectors don’t exist in all tasks, especially categorization tasks. Instead, each demonstration creates a local task vector at its answer position. In some tasks, these local vectors converge to form global vectors, but not in categorization tasks. Local task vectors encode high-level rule abstractions.

Conclusion: In-context learning in LLMs operates through an information aggregation mechanism where local task vectors from individual demonstrations may or may not converge into global vectors depending on task requirements, providing novel insights into ICL mechanisms.

Abstract: Large Language Models (LLMs) have demonstrated remarkable abilities, one of the most important being in-context learning (ICL). With ICL, LLMs can derive the underlying rule from a few demonstrations and provide answers that comply with the rule. Previous work hypothesized that the network creates a task vector in specific positions during ICL. The task vector can be computed by averaging across the dataset. It conveys the overall task information and can thus be considered global. Patching the global task vector allows LLMs to achieve zero-shot performance with dummy inputs comparable to few-shot learning. However, we find that such a global task vector does not exist in all tasks, especially in tasks that rely on rules that can only be inferred from multiple demonstrations, such as categorization tasks. Instead, the information provided by each demonstration is first transmitted to its answer position and forms a local task vector associated with the demonstration. In some tasks but not in categorization tasks, all demonstrations’ local task vectors converge in later layers, forming the global task vector. We further show that local task vectors encode a high-level abstraction of rules extracted from the demonstrations. Our study provides novel insights into the mechanism underlying ICL in LLMs, demonstrating how ICL may be achieved through an information aggregation mechanism.

[78] ADePT: Adaptive Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning

Pengwei Tang, Xiaolin Hu, Yong Liu

Main category: cs.CL

TL;DR: ADePT improves on Decomposed Prompt Tuning by replacing fixed position-based embedding offsets with adaptive token-shared feed-forward networks, achieving better generalization and optimization without extra inference time or parameters.

Details

Motivation: DePT's position-based token embedding offsets limit generalization across diverse inputs, and shared embedding offsets across tokens lead to sub-optimization. Need adaptive offsets that vary with model inputs for better performance.

Method: ADePT uses a short soft prompt plus a shallow token-shared feed-forward neural network that learns adaptive embedding offsets for each token based on model input, enabling better optimization and generalization.

Result: ADePT consistently outperforms other parameter-efficient fine-tuning methods across 23 NLP tasks and 4 PLMs of different scales, sometimes even beating full fine-tuning, with no extra inference time or trainable parameters.

Conclusion: ADePT provides superior adaptation performance through adaptive token embedding offsets, addressing limitations of previous prompt tuning methods while maintaining efficiency advantages.

Abstract: Prompt Tuning (PT) enables the adaptation of Pre-trained Large Language Models (PLMs) to downstream tasks by optimizing a small amount of soft virtual tokens, which are prepended to the input token embeddings. Recently, Decomposed Prompt Tuning (DePT) has demonstrated superior adaptation capabilities by decomposing the soft prompt into a shorter soft prompt and a pair of low-rank matrices. The product of the pair of low-rank matrices is added to the input token embeddings to offset them. Additionally, DePT achieves faster inference compared to PT due to the shorter soft prompt. However, in this paper, we find that the position-based token embedding offsets of DePT restrict its ability to generalize across diverse model inputs, and that the shared embedding offsets across many token embeddings result in sub-optimization. To tackle these issues, we introduce Adaptive Decomposed Prompt Tuning (ADePT), which is composed of a short soft prompt and a shallow token-shared feed-forward neural network. ADePT utilizes the token-shared feed-forward neural network to learn the embedding offsets for each token, enabling adaptive embedding offsets that vary according to the model input and better optimization of token embedding offsets. This enables ADePT to achieve superior adaptation performance without requiring more inference time or additional trainable parameters compared to vanilla PT and its variants. In comprehensive experiments across 23 natural language processing tasks and 4 typical PLMs of different scales, ADePT consistently surpasses the other leading parameter-efficient fine-tuning methods, and even outperforms the full fine-tuning in certain scenarios. We also provide a theoretical analysis towards ADePT. Code is available at https://github.com/HungerPWAY/ADePT.

[79] VLDBench Evaluating Multimodal Disinformation with Regulatory Alignment

Shaina Raza, Ashmal Vayani, Aditya Jain, Aravind Narayanan, Vahid Reza Khazaie, Syed Raza Bashir, Elham Dolatabadi, Gias Uddin, Christos Emmanouilidis, Rizwan Qureshi, Mubarak Shah

Main category: cs.CL

TL;DR: VLDBench is the first large-scale multimodal disinformation detection benchmark with 62K text-image pairs across 13 categories, showing visual cues improve detection accuracy by 5-35% over text-only models.

Details

Motivation: Existing AI safety benchmarks focus on single-modality misinformation, leaving intentional multimodal disinformation (propaganda/conspiracy theories imitating credible news) largely unaddressed despite AI tools making synthetic content easy to generate.

Method: Created VLDBench with ~62,000 labeled text-image pairs from 58 news outlets using semi-automated pipeline followed by expert review (22 domain experts, 500+ hours). Evaluated state-of-the-art LLMs and VLMs on both unimodal (text-only) and multimodal detection tasks.

Result: Incorporating visual cues improves detection accuracy by 5 to 35 percentage points over text-only models. High-quality annotations with substantial inter-annotator agreement achieved through expert review.

Conclusion: VLDBench provides comprehensive data and code for evaluation, fine-tuning, and robustness testing, offering a principled foundation aligned with AI governance frameworks to advance trustworthy disinformation detection in multimodal media.

Abstract: Detecting disinformation that blends manipulated text and images has become increasingly challenging, as AI tools make synthetic content easy to generate and disseminate. While most existing AI safety benchmarks focus on single modality misinformation (i.e., false content shared without intent to deceive), intentional multimodal disinformation, such as propaganda or conspiracy theories that imitate credible news, remains largely unaddressed. We introduce the Vision-Language Disinformation Detection Benchmark (VLDBench), the first large-scale resource supporting both unimodal (text-only) and multimodal (text + image) disinformation detection. VLDBench comprises approximately 62,000 labeled text-image pairs across 13 categories, curated from 58 news outlets. Using a semi-automated pipeline followed by expert review, 22 domain experts invested over 500 hours to produce high-quality annotations with substantial inter-annotator agreement. Evaluations of state-of-the-art Large Language Models (LLMs) and Vision-Language Models (VLMs) on VLDBench show that incorporating visual cues improves detection accuracy by 5 to 35 percentage points over text-only models. VLDBench provides data and code for evaluation, fine-tuning, and robustness testing to support disinformation analysis. Developed in alignment with AI governance frameworks (e.g., the MIT AI Risk Repository), VLDBench offers a principled foundation for advancing trustworthy disinformation detection in multimodal media. Project: https://vectorinstitute.github.io/VLDBench/ Dataset: https://huggingface.co/datasets/vector-institute/VLDBench Code: https://github.com/VectorInstitute/VLDBench

Zhuohan Ge, Darian Li, Yubo Wang, Nicole Hu, Xinyi Zhu, Haoyang Li, Xin Zhang, Mingtao Zhang, Shihao Qi, Yuming Xu, Han Shi, Chen Jason Zhang, Qing Li

Main category: cs.CL

TL;DR: Survey paper on LLM-based methods for social media mental disorder analysis, covering standard models, RAG for reliability, and agentic systems for autonomous reasoning/intervention.

Details

Motivation: Mental disorders are a global health challenge, and social media provides real-time data for digital phenotyping. LLMs offer better semantic understanding than traditional deep learning but face reliability and reasoning limitations. There's a need to systematically review how advanced techniques like RAG and Agentic systems can address these limitations.

Method: Systematic survey of LLM-based methods for social media mental disorder analysis. Organizes existing work by technical paradigm (standard pre-trained LLMs, RAG, agentic systems) and clinical target (internalizing disorders, psychotic disorders, externalizing behaviors). Establishes unified benchmark and evaluates performance across various tasks.

Result: Comprehensive evaluation of LLM performance including impact of RAG across various mental disorder analysis tasks. The survey establishes a unified benchmark for the field and demonstrates how advanced techniques can enhance reliability and reasoning capabilities.

Conclusion: This work paves the way for developing trustworthy, autonomous AI systems that can deliver precise and explainable mental health support by systematically synthesizing LLM-based approaches with RAG and agentic systems for social media mental disorder analysis.

Abstract: Mental disorders represent a critical global health challenge, and social media is increasingly viewed as a vital resource for real-time digital phenotyping and intervention. To leverage this data, large language models (LLMs) have been introduced, offering stronger semantic understanding and reasoning than traditional deep learning, thereby enhancing the explainability of detection results. Despite the growing prominence of LLMs in this field, there is a scarcity of scholarly works that systematically synthesize how advanced enhancement techniques, specifically Retrieval-Augmented Generation (RAG) and Agentic systems, can be utilized to address these reliability and reasoning limitations. Here, we systematically survey the evolving landscape of LLM-based methods for social media mental disorder analysis, spanning standard pre-trained language models, RAG to mitigate hallucinations and contextual gaps, and agentic systems for autonomous reasoning and multi-step intervention. We organize existing work by technical paradigm and clinical target, extending beyond common internalizing disorders to include psychotic disorders and externalizing behaviors. Additionally, the paper comprehensively evaluates the performance of LLMs, including the impact of RAG, across various tasks. This work establishes a unified benchmark for the field, paving the way for the development of trustworthy, autonomous AI systems that can deliver precise and explainable mental health support.

[81] Syzygy of Thoughts: Improving LLM CoT with the Minimal Free Resolution

Chenghao Li, Chaoning Zhang, Yi Lu, Jiaquan Zhang, Qigan Sun, Xudong Wang, Jiwei Wei, Guoqing Wang, Yang Yang, Heng Tao Shen

Main category: cs.CL

TL;DR: SoT extends Chain-of-Thought prompting using algebraic geometry concepts to create multiple interrelated reasoning paths, improving complex problem-solving in LLMs.

Details

Motivation: Standard CoT prompting struggles with complex tasks that have vast solution spaces and vague constraints, requiring more robust reasoning structures beyond single chains.

Method: Inspired by Minimal Free Resolution from commutative algebra, SoT introduces auxiliary reasoning paths with concepts like modules, Betti numbers, freeness, mapping, exactness, and minimality to decompose problems into minimal subproblems.

Result: Achieved inference accuracy matching or surpassing mainstream CoT standards on datasets like GSM8K and MATH across models including GPT-4o-mini and Qwen2.5, with improved scalability of inference time.

Conclusion: SoT provides a structured framework that enhances LLM reasoning through algebraic decomposition, offering transparent reasoning and high performance for complex tasks.

Abstract: Chain-of-Thought (CoT) prompting enhances the reasoning of large language models (LLMs) by decomposing problems into sequential steps, mimicking human logic and reducing errors. However, complex tasks with vast solution spaces and vague constraints often exceed the capacity of a single reasoning chain. Inspired by Minimal Free Resolution (MFR) in commutative algebra and algebraic geometry, we propose Syzygy of Thoughts (SoT)-a novel framework that extends CoT by introducing auxiliary, interrelated reasoning paths. SoT captures deeper logical dependencies, enabling more robust and structured problem-solving. MFR decomposes a module into a sequence of free modules with minimal rank, providing a structured analytical approach to complex systems. This method introduces the concepts of “Module”, “Betti numbers”,“Freeness”, “Mapping”, “Exactness” and “Minimality”, enabling the systematic decomposition of the original complex problem into logically complete minimal subproblems while preserving key problem features and reducing reasoning length. We tested SoT across diverse datasets (e.g., GSM8K, MATH) and models (e.g., GPT-4o-mini, Qwen2.5), achieving inference accuracy that matches or surpasses mainstream CoTs standards. Additionally, by aligning the sampling process with algebraic constraints, our approach enhances the scalability of inference time in LLMs, ensuring both transparent reasoning and high performance. Our code will be publicly available at https://github.com/dlMARiA/Syzygy-of-thoughts.

[82] Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications

Fadel M. Megahed, Ying-Ju Chen, L. Allision Jones-Farmer, Younghwa Lee, Jiawei Brooke Wang, Inez M. Zwetsloot

Main category: cs.CL

TL;DR: A framework for evaluating LLM consistency in binary text classification using psychometric principles, tested on financial news sentiment across 14 LLMs with high intra-rater reliability but limited market prediction ability.

Details

Motivation: Address the lack of established reliability assessment methods for LLM binary text classification by developing systematic evaluation framework.

Method: Adapt psychometric principles to determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability. Case study examines 14 LLMs on 1,350 financial news articles with five replicates per model.

Result: Models showed high intra-rater consistency (90-98% perfect agreement), minimal differences between expensive/economical models, strong accuracy (0.76-0.88) against StockNewsAPI labels, with smaller models outperforming larger ones. All models performed at chance when predicting actual market movements.

Conclusion: The framework provides systematic guidance for LLM selection, sample size planning, and reliability assessment, enabling organizations to optimize resources for classification tasks, though LLMs have task constraints for market prediction.

Abstract: This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification, addressing the lack of established reliability assessment methods. Adapting psychometric principles, we determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability. Our case study examines financial news sentiment classification across 14 LLMs (including claude-3-7-sonnet, gpt-4o, deepseek-r1, gemma3, llama3.2, phi4, and command-r-plus), with five replicates per model on 1,350 articles. Models demonstrated high intra-rater consistency, achieving perfect agreement on 90-98% of examples, with minimal differences between expensive and economical models from the same families. When validated against StockNewsAPI labels, models achieved strong performance (accuracy 0.76-0.88), with smaller models like gemma3:1B, llama3.2:3B, and claude-3-5-haiku outperforming larger counterparts. All models performed at chance when predicting actual market movements, indicating task constraints rather than model limitations. Our framework provides systematic guidance for LLM selection, sample size planning, and reliability assessment, enabling organizations to optimize resources for classification tasks.

[83] Multimodal Cultural Safety: Evaluation Framework and Alignment Strategies

Haoyi Qiu, Kung-Hsiang Huang, Ruichen Zheng, Jiao Sun, Nanyun Peng

Main category: cs.CL

TL;DR: CROSS benchmark assesses cultural safety reasoning in large vision-language models across 16 countries and 14 languages, revealing significant gaps in cultural awareness and compliance, with proposed enhancement strategies showing substantial improvements.

Details

Motivation: Existing multimodal safety benchmarks focus on physical safety but overlook cultural norm violations that cause symbolic harm, creating a gap in evaluating LVLMs' ability to produce culturally appropriate responses for global applications like tourism assistants.

Method: Introduced CROSS benchmark with 1,284 multilingual visually grounded queries from 16 countries, and CROSS-Eval framework measuring cultural awareness, norm education, compliance, and helpfulness. Evaluated 21 leading LVLMs and developed two enhancement strategies: supervised fine-tuning with culturally grounded data and preference tuning with contrastive response pairs.

Result: Significant cultural safety gaps found: best-performing model achieved only 61.79% in awareness and 37.73% in compliance. Open-source models reached GPT-4o-level performance but still fell short of proprietary models. Enhancement strategies improved GPT-4o’s cultural awareness (+60.14%) and compliance (+55.2%) while preserving general multimodal capabilities.

Conclusion: LVLMs have substantial cultural safety reasoning gaps that current reasoning capacity alone cannot fully resolve. The proposed enhancement strategies effectively improve cultural alignment while maintaining general capabilities, highlighting the need for dedicated cultural safety training in globally deployed multimodal systems.

Abstract: Large vision-language models (LVLMs) are increasingly deployed in globally distributed applications, such as tourism assistants, yet their ability to produce culturally appropriate responses remains underexplored. Existing multimodal safety benchmarks primarily focus on physical safety and overlook violations rooted in cultural norms, which can result in symbolic harm. To address this gap, we introduce CROSS, a benchmark designed to assess the cultural safety reasoning capabilities of LVLMs. CROSS includes 1,284 multilingual visually grounded queries from 16 countries, three everyday domains, and 14 languages, where cultural norm violations emerge only when images are interpreted in context. We propose CROSS-Eval, an intercultural theory-based framework that measures four key dimensions: cultural awareness, norm education, compliance, and helpfulness. Using this framework, we evaluate 21 leading LVLMs, including mixture-of-experts models and reasoning models. Results reveal significant cultural safety gaps: the best-performing model achieves only 61.79% in awareness and 37.73% in compliance. While some open-source models reach GPT-4o-level performance, they still fall notably short of proprietary models. Our results further show that increasing reasoning capacity improves cultural alignment but does not fully resolve the issue. To improve model performance, we develop two enhancement strategies: supervised fine-tuning with culturally grounded, open-ended data and preference tuning with contrastive response pairs that highlight safe versus unsafe behaviors. These methods substantially improve GPT-4o’s cultural awareness (+60.14%) and compliance (+55.2%), while preserving general multimodal capabilities with minimal performance reduction on general multimodal understanding benchmarks.

[84] Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs

Jie Ma, Ning Qu, Zhitao Gao, Rui Xing, Jun Liu, Hongbin Pei, Jiang Xie, Linyun Song, Pinghui Wang, Jing Tao, Zhou Su

Main category: cs.CL

TL;DR: DP framework improves LLM trustworthiness by leveraging KG structural priors and constraints through progressive knowledge distillation and reasoning-introspection strategies.

Details

Motivation: Existing KG-based RAG methods fail to fully exploit KG structural information and constraints, which could enhance LLM faithfulness and reliability in reasoning and response generation.

Method: Deliberation over Priors (DP) framework uses: 1) Progressive knowledge distillation integrating structural priors via supervised fine-tuning and Kahneman-Tversky optimization, 2) Reasoning-introspection strategy for refined reasoning verification using constraint priors.

Result: Achieves SOTA on three benchmark datasets, with 13% Hit@1 improvement on ComplexWebQuestions, generating highly trustworthy responses with verified flexibility and practicality.

Conclusion: DP effectively leverages KG priors to enhance LLM trustworthiness, demonstrating significant improvements in faithfulness and reliability for knowledge-intensive tasks.

Abstract: Knowledge graph-based retrieval-augmented generation seeks to mitigate hallucinations in Large Language Models (LLMs) caused by insufficient or outdated knowledge. However, existing methods often fail to fully exploit the prior knowledge embedded in knowledge graphs (KGs), particularly their structural information and explicit or implicit constraints. The former can enhance the faithfulness of LLMs’ reasoning, while the latter can improve the reliability of response generation. Motivated by these, we propose a trustworthy reasoning framework, termed Deliberation over Priors (DP), which sufficiently utilizes the priors contained in KGs. Specifically, DP adopts a progressive knowledge distillation strategy that integrates structural priors into LLMs through a combination of supervised fine-tuning and Kahneman-Tversky optimization, thereby improving the faithfulness of relation path generation. Furthermore, our framework employs a reasoning-introspection strategy, which guides LLMs to perform refined reasoning verification based on extracted constraint priors, ensuring the reliability of response generation. Extensive experiments on three benchmark datasets demonstrate that DP achieves new state-of-the-art performance, especially a Hit@1 improvement of 13% on the ComplexWebQuestions dataset, and generates highly trustworthy responses. We also conduct various analyses to verify its flexibility and practicality. The code is available at https://github.com/reml-group/Deliberation-on-Priors.

[85] Dagstuhl Perspectives Workshop 24352 – Conversational Agents: A Framework for Evaluation (CAFE): Manifesto

Christine Bauer, Li Chen, Nicola Ferro, Norbert Fuhr, Avishek Anand, Timo Breuer, Guglielmo Faggioli, Ophir Frieder, Hideo Joho, Jussi Karlgren, Johannes Kiesel, Bart P. Knijnenburg, Aldo Lipani, Lien Michiels, Andrea Papenmeier, Maria Soledad Pera, Mark Sanderson, Scott Sanner, Benno Stein, Johanne R. Trippas, Karin Verspoor, Martijn C Willemsen

Main category: cs.CL

TL;DR: The paper proposes a world model for Conversational Information Access (CONIAC) and defines CAFE, a comprehensive evaluation framework with six components for assessing CONIAC systems.

Details

Motivation: To establish a clear understanding of CONversational Information ACcess (CONIAC) and develop a systematic evaluation framework for CONIAC systems, addressing the need for standardized assessment methods in this emerging field.

Method: Deep workshop discussions to define CONIAC’s unique features, creation of a world model abstraction, and development of the Conversational Agents Framework for Evaluation (CAFE) with six major components covering stakeholder goals, user tasks, user aspects, evaluation criteria, methodology, and quantitative measures.

Result: Proposed a conceptual world model for CONIAC and established CAFE - a comprehensive evaluation framework consisting of six key components that provide a structured approach to evaluating conversational information access systems.

Conclusion: The paper successfully defines CONIAC and provides CAFE as a systematic evaluation framework that addresses multiple dimensions of conversational information access systems, enabling more comprehensive and standardized assessment of these technologies.

Abstract: During the workshop, we deeply discussed what CONversational Information ACcess (CONIAC) is and its unique features, proposing a world model abstracting it, and defined the Conversational Agents Framework for Evaluation (CAFE) for the evaluation of CONIAC systems, consisting of six major components: 1) goals of the system’s stakeholders, 2) user tasks to be studied in the evaluation, 3) aspects of the users carrying out the tasks, 4) evaluation criteria to be considered, 5) evaluation methodology to be applied, and 6) measures for the quantitative criteria chosen.

[86] Abstract, Align, Predict: Zero-Shot Stance Detection via Cognitive Inductive Reasoning

Bowen Zhang, Jun Ma, Fuqiang Niu, Li Dong, Jinzhou Cao, Genan Dai

Main category: cs.CL

TL;DR: CIRF is a schema-driven framework for zero-shot stance detection that uses cognitive reasoning schemas to bridge linguistic inputs with abstract reasoning, achieving state-of-the-art performance with minimal labeled data.

Details

Motivation: Existing LLM-based approaches for zero-shot stance detection have limitations: prompting methods lack complex reasoning and generalization to novel targets, while LLM-enhanced methods require substantial labeled data and struggle with interpretability and adaptability beyond instance-level patterns.

Method: CIRF abstracts first-order logic patterns from raw text into multi-relational schema graphs in an unsupervised manner, then uses a schema-enhanced graph kernel model to align input structures with schema templates for interpretable zero-shot inference.

Result: CIRF establishes new state-of-the-art results on SemEval-2016, VAST, and COVID-19-Stance benchmarks, achieving comparable performance with just 30% of labeled data, demonstrating strong generalization and efficiency in low-resource settings.

Conclusion: The schema-driven cognitive inductive reasoning approach effectively addresses limitations of existing LLM-based methods for zero-shot stance detection, providing robust, interpretable, and efficient performance with minimal labeled data requirements.

Abstract: Zero-shot stance detection (ZSSD) seeks to determine the stance of text toward previously unseen targets, a task critical for analyzing dynamic and polarized online discourse with limited labeled data. While large language models (LLMs) offer zero-shot capabilities, prompting-based approaches often fall short in handling complex reasoning and lack robust generalization to novel targets. Meanwhile, LLM-enhanced methods still require substantial labeled data and struggle to move beyond instance-level patterns, limiting their interpretability and adaptability. Inspired by cognitive science, we propose the Cognitive Inductive Reasoning Framework (CIRF), a schema-driven method that bridges linguistic inputs and abstract reasoning via automatic induction and application of cognitive reasoning schemas. CIRF abstracts first-order logic patterns from raw text into multi-relational schema graphs in an unsupervised manner, and leverages a schema-enhanced graph kernel model to align input structures with schema templates for robust, interpretable zero-shot inference. Extensive experiments on SemEval-2016, VAST, and COVID-19-Stance benchmarks demonstrate that CIRF not only establishes new state-of-the-art results, but also achieves comparable performance with just 30% of the labeled data, demonstrating its strong generalization and efficiency in low-resource settings.

[87] Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs

Xiang Li, Chong Zhang, Jia Wang, Fangyu Wu, Yushi Li, Xiaobo Jin

Main category: cs.CL

TL;DR: Adversarial Prompt Distillation transfers jailbreaking capabilities from large to small language models using masked language modeling, reinforcement learning, and dynamic temperature control for efficient, stealthy attacks.

Details

Motivation: Current automated jailbreaking methods rely on LLM generation phases that are complex to deploy and reason with, limiting their efficiency and practical applicability despite encouraging results.

Method: Proposes Adversarial Prompt Distillation framework that integrates masked language modeling, reinforcement learning, and dynamic temperature control to distill LLM jailbreaking capabilities into smaller language models.

Result: Empirical evaluations show superiority in attack efficacy, resource optimization, and cross-model versatility while maintaining high success rates across broader application contexts.

Conclusion: Demonstrates practicality of transferring jailbreak capabilities to SLMs, reveals inherent LLM vulnerabilities, and provides novel insights to advance LLM security investigations.

Abstract: As the scale and complexity of jailbreaking attacks on large language models (LLMs) continue to escalate, their efficiency and practical applicability are constrained, posing a profound challenge to LLM security. Jailbreaking techniques have advanced from manual prompt engineering to automated methodologies. Recent advances have automated jailbreaking approaches that harness LLMs to generate jailbreak instructions and adversarial examples, delivering encouraging results. Nevertheless, these methods universally include an LLM generation phase, which, due to the complexities of deploying and reasoning with LLMs, impedes effective implementation and broader adoption. To mitigate this issue, we introduce \textbf{Adversarial Prompt Distillation}, an innovative framework that integrates masked language modeling, reinforcement learning, and dynamic temperature control to distill LLM jailbreaking prowess into smaller language models (SLMs). This methodology enables efficient, robust jailbreak attacks while maintaining high success rates and accommodating a broader range of application contexts. Empirical evaluations affirm the approach’s superiority in attack efficacy, resource optimization, and cross-model versatility. Our research underscores the practicality of transferring jailbreak capabilities to SLMs, reveals inherent vulnerabilities in LLMs, and provides novel insights to advance LLM security investigations. Our code is available at: https://github.com/lxgem/Efficient_and_Stealthy_Jailbreak_Attacks_via_Adversarial_Prompt.

[88] Mirage of Mastery: Memorization Tricks LLMs into Artificially Inflated Self-Knowledge

Sahil Kale

Main category: cs.CL

TL;DR: LLMs confuse memorization with reasoning, leading to overconfidence and inconsistent self-knowledge, especially in STEM domains where they show over 45% inconsistency in feasibility assessments.

Details

Motivation: Existing research treats memorization and self-knowledge deficits in LLMs as separate issues, missing their intertwining link that degrades trustworthiness. The study aims to determine if LLMs genuinely learn reasoning patterns or merely memorize them, particularly in STEM domains.

Method: A novel framework to test if LLMs learn reasoning patterns or just memorize training data, focusing on STEM domains. Uses self-validated, logically coherent task perturbations to assess generalization and self-knowledge consistency.

Result: LLMs show significant generalization problems: they draw confidence from memorized solutions to infer higher self-knowledge about reasoning ability, manifesting as over 45% inconsistency in feasibility assessments. This effect is most pronounced in science and medicine domains with standardized jargon.

Conclusion: Significant flaws in current LLM architectures and training patterns highlight the need for techniques that ensure balanced, consistent self-knowledge perceptions for maximum AI explainability and trustworthiness.

Abstract: When artificial intelligence mistakes memorization for intelligence, it creates a dangerous mirage of reasoning. Existing studies treat memorization and self-knowledge deficits in LLMs as separate issues and do not recognize an intertwining link that degrades the trustworthiness of LLM responses. In our study, we utilize a novel framework to ascertain if LLMs genuinely learn reasoning patterns from training data or merely memorize them to assume competence across problems of similar complexity focused on STEM domains. Our analysis shows a noteworthy problem in generalization: LLMs draw confidence from memorized solutions to infer a higher self-knowledge about their reasoning ability, which manifests as an over 45% inconsistency in feasibility assessments when faced with self-validated, logically coherent task perturbations. This effect is most pronounced in science and medicine domains, which tend to have maximal standardized jargon and problems, further confirming our approach. Significant wavering within the self-knowledge of LLMs also shows flaws in current architectures and training patterns, highlighting the need for techniques that ensure a balanced, consistent stance on models’ perceptions of their own knowledge for maximum AI explainability and trustworthiness. Our code and results are available publicly at https://github.com/Sahil-R-Kale/mirage_of_mastery

[89] From Words to Proverbs: Evaluating LLMs Linguistic and Cultural Competence in Saudi Dialects with Absher

Renad Al-Monef, Hassan Alhuzali, Nora Alturayeif, Ashwag Alasmari

Main category: cs.CL

TL;DR: Absher is a benchmark for evaluating LLMs on Saudi Arabic dialects with 18K+ questions across 6 categories, revealing performance gaps in cultural and contextual understanding.

Details

Motivation: As LLMs become central to Arabic NLP applications, there's a need to evaluate their understanding of regional dialects and cultural nuances, especially in linguistically diverse settings like Saudi Arabia where standard Arabic models may fail on dialectal content.

Method: Created Absher benchmark with over 18,000 multiple-choice questions across 6 categories (Meaning, True/False, Fill-in-the-Blank, Contextual Usage, Cultural Interpretation, Location Recognition) using curated dialectal words, phrases, and proverbs from various Saudi regions. Evaluated state-of-the-art multilingual and Arabic-specific LLMs.

Result: Results show notable performance gaps, particularly in tasks requiring cultural inference or contextual understanding. Models struggle with dialectal and culturally nuanced content despite being state-of-the-art.

Conclusion: Highlights urgent need for dialect-aware training and culturally aligned evaluation methodologies to improve LLMs’ performance in real-world Arabic applications, emphasizing the importance of addressing regional linguistic diversity.

Abstract: As large language models (LLMs) become increasingly central to Arabic NLP applications, evaluating their understanding of regional dialects and cultural nuances is essential, particularly in linguistically diverse settings like Saudi Arabia. This paper introduces Absher, a comprehensive benchmark specifically designed to assess LLMs performance across major Saudi dialects. \texttt{Absher} comprises over 18,000 multiple-choice questions spanning six distinct categories: Meaning, True/False, Fill-in-the-Blank, Contextual Usage, Cultural Interpretation, and Location Recognition. These questions are derived from a curated dataset of dialectal words, phrases, and proverbs sourced from various regions of Saudi Arabia. We evaluate several state-of-the-art LLMs, including multilingual and Arabic-specific models. We also provide detailed insights into their capabilities and limitations. Our results reveal notable performance gaps, particularly in tasks requiring cultural inference or contextual understanding. Our findings highlight the urgent need for dialect-aware training and culturally aligned evaluation methodologies to improve LLMs performance in real-world Arabic applications.

[90] CodeNER: Code Prompting for Named Entity Recognition

Sungwoo Han, Jingun Kwon, Hidetaka Kamigaito, Manabu Okumura

Main category: cs.CL

TL;DR: Code-based prompting improves NER by embedding BIO schema instructions in code format, helping LLMs better understand labeling requirements and outperforming text-based prompting across multiple languages.

Details

Motivation: Previous NER approaches using LLMs rely solely on input context without capturing detailed labeling requirements. NER inherently needs both context and structured labeling instructions, which current methods lack.

Method: Proposes code-based prompting that embeds BIO schema instructions within prompts using code format. This leverages LLMs’ ability to comprehend long-range scopes in programming languages to better understand NER requirements.

Result: Outperforms conventional text-based prompting on ten benchmarks across English, Arabic, Finnish, Danish, and German datasets. Combining code-based prompting with chain-of-thought further improves performance.

Conclusion: Code-based prompting effectively structures NER instructions for LLMs, demonstrating that explicit structuring of labeling requirements through code format significantly improves NER performance across multiple languages.

Abstract: Recent studies have explored various approaches for treating candidate named entity spans as both source and target sequences in named entity recognition (NER) by leveraging large language models (LLMs). Although previous approaches have successfully generated candidate named entity spans with suitable labels, they rely solely on input context information when using LLMs, particularly, ChatGPT. However, NER inherently requires capturing detailed labeling requirements with input context information. To address this issue, we propose a novel method that leverages code-based prompting to improve the capabilities of LLMs in understanding and performing NER. By embedding code within prompts, we provide detailed BIO schema instructions for labeling, thereby exploiting the ability of LLMs to comprehend long-range scopes in programming languages. Experimental results demonstrate that the proposed code-based prompting method outperforms conventional text-based prompting on ten benchmarks across English, Arabic, Finnish, Danish, and German datasets, indicating the effectiveness of explicitly structuring NER instructions. We also verify that combining the proposed code-based prompting method with the chain-of-thought prompting further improves performance.

[91] Decoding Neural Emotion Patterns through Large Language Model Embeddings

Gideon Vos, Maryam Ebrahimpour, Liza van Eijk, Zoltan Sarnyai, Mostafa Rahimi Azghadi

Main category: cs.CL

TL;DR: A computational framework maps textual emotional content to brain regions without neuroimaging, using text embeddings and clustering to create emotion-brain mappings, validated on clinical and AI-generated text.

Details

Motivation: Traditional neuroimaging is expensive and lab-restricted, while abundant digital text offers new opportunities for emotion-brain mapping. Previous work has largely kept neuroimaging-based emotion localization and computational text analysis separate, with little integration between them.

Method: Uses OpenAI’s text-embedding-ada-002 to generate high-dimensional semantic representations, applies dimensionality reduction and clustering to identify emotional groups, and maps them to 18 brain regions associated with emotional processing. Three experiments: 1) analyzing conversational data from healthy vs. depressed subjects (DIAC-WOZ dataset), 2) applying to GoEmotions dataset, 3) comparing human-written text with LLM responses.

Result: Neuroanatomically plausible mappings with high spatial specificity. Depressed subjects showed greater limbic engagement tied to negative affect. Discrete emotions were successfully differentiated. LLM-generated text matched humans in basic emotion distribution but lacked nuanced activation in empathy and self-referential regions (medial prefrontal and posterior cingulate cortex).

Conclusion: This cost-effective, scalable approach enables large-scale analysis of naturalistic language, distinguishes between clinical populations, and offers a brain-based benchmark for evaluating AI emotional expression.

Abstract: Understanding how emotional expression in language relates to brain function is a challenge in computational neuroscience and affective computing. Traditional neuroimaging is costly and lab-bound, but abundant digital text offers new avenues for emotion-brain mapping. Prior work has largely examined neuroimaging-based emotion localization or computational text analysis separately, with little integration. We propose a computational framework that maps textual emotional content to anatomically defined brain regions without requiring neuroimaging. Using OpenAI’s text-embedding-ada-002, we generate high-dimensional semantic representations, apply dimensionality reduction and clustering to identify emotional groups, and map them to 18 brain regions linked to emotional processing. Three experiments were conducted: i) analyzing conversational data from healthy vs. depressed subjects (DIAC-WOZ dataset) to compare mapping patterns, ii) applying the method to the GoEmotions dataset and iii) comparing human-written text with large language model (LLM) responses to assess differences in inferred brain activation. Emotional intensity was scored via lexical analysis. Results showed neuroanatomically plausible mappings with high spatial specificity. Depressed subjects exhibited greater limbic engagement tied to negative affect. Discrete emotions were successfully differentiated. LLM-generated text matched humans in basic emotion distribution but lacked nuanced activation in empathy and self-referential regions (medial prefrontal and posterior cingulate cortex). This cost-effective, scalable approach enables large-scale analysis of naturalistic language, distinguishes between clinical populations, and offers a brain-based benchmark for evaluating AI emotional expression.

[92] SPELL: Self-Play Reinforcement Learning for evolving Long-Context Language Models

Ziyi Yang, Weizhou Shen, Chenliang Li, Ruijun Chen, Fanqi Wan, Ming Yan, Xiaojun Quan, Fei Huang

Main category: cs.CL

TL;DR: SPELL is a multi-role self-play RL framework for scalable, label-free optimization of long-context reasoning in LLMs, using cyclical questioner-responder-verifier roles to enable continual self-improvement without human annotations.

Details

Motivation: Progress in long-context reasoning for LLMs has lagged due to intrinsic difficulty of processing long texts and scarcity of reliable human annotations and programmatically verifiable reward signals.

Method: SPELL integrates three cyclical roles within a single model: questioner generates questions from documents with reference answers, responder learns to solve questions based on documents, and verifier evaluates semantic equivalence between responder’s output and reference answer to produce reward signals. Includes automated curriculum for gradually increasing document length and adaptive reward function.

Result: Extensive experiments on six long-context benchmarks show SPELL consistently improves performance across diverse LLMs and outperforms equally sized models fine-tuned on large-scale annotated data. Achieves average 7.6-point gain in pass@8 on Qwen3-30B-A3B-Thinking, raising performance ceiling.

Conclusion: SPELL enables scalable, label-free optimization for long-context reasoning and shows promise for scaling to even more capable models, addressing key challenges in long-context reasoning advancement.

Abstract: Progress in long-context reasoning for large language models (LLMs) has lagged behind other recent advances. This gap arises not only from the intrinsic difficulty of processing long texts, but also from the scarcity of reliable human annotations and programmatically verifiable reward signals. In this paper, we propose SPELL, a multi-role self-play reinforcement learning framework that enables scalable, label-free optimization for long-context reasoning. SPELL integrates three cyclical roles-questioner, responder, and verifier-within a single model to enable continual self-improvement. The questioner generates questions from raw documents paired with reference answers; the responder learns to solve these questions based on the documents; and the verifier evaluates semantic equivalence between the responder’s output and the questioner’s reference answer, producing reward signals to guide continual training. To stabilize training, we introduce an automated curriculum that gradually increases document length and a reward function that adapts question difficulty to the model’s evolving capabilities. Extensive experiments on six long-context benchmarks show that SPELL consistently improves performance across diverse LLMs and outperforms equally sized models fine-tuned on large-scale annotated data. Notably, SPELL achieves an average 7.6-point gain in pass@8 on the strong reasoning model Qwen3-30B-A3B-Thinking, raising its performance ceiling and showing promise for scaling to even more capable models.

[93] Style Over Story: A Process-Oriented Study of Authorial Creativity in Large Language Models

Donghoon Jung, Jiwoo Choi, Songeun Chae, Seohyon Jung

Main category: cs.CL

TL;DR: LLMs evaluated as computational authors using constraint-based decision-making; models consistently prioritize Style over Character, Event, and Setting in creative writing.

Details

Motivation: Current LLM creativity evaluations focus on output quality rather than creative processes. The study aims to examine LLMs as computational authors using narratology to understand their creative decision-making processes.

Method: Process-oriented approach using narratology and constraint-based decision-making. Controlled prompting assigns authorial personas to analyze creative preferences. Models’ reasoning for choices is probed to identify distinctive profiles.

Result: LLMs consistently emphasize Style over other narrative elements (Character, Event, Setting). Distinctive creative profiles emerge across different models when analyzing their reasoning for choices.

Conclusion: The approach provides a novel systematic tool for analyzing AI’s authorial creativity, revealing consistent stylistic preferences in LLMs’ creative decision-making processes.

Abstract: Evaluations of large language models (LLMs)’ creativity have focused primarily on the quality of their outputs rather than the processes that shape them. This study takes a process-oriented approach, drawing on narratology to examine LLMs as computational authors. We introduce constraint-based decision-making as a lens for authorial creativity. Using controlled prompting to assign authorial personas, we analyze the creative preferences of the models. Our findings show that LLMs consistently emphasize Style over other elements, including Character, Event, and Setting. By also probing the reasoning the models provide for their choices, we show that distinctive profiles emerge across models and argue that our approach provides a novel systematic tool for analyzing AI’s authorial creativity.

[94] A Survey on Agentic Security: Applications, Threats and Defenses

Asif Shahriar, Md Nafiu Rahman, Sadif Ahmed, Farig Sadeque, Md Rizwan Parvez

Main category: cs.CL

TL;DR: First comprehensive survey of agentic security landscape covering applications, threats, and defenses with taxonomy of 160+ papers and analysis of trends and research gaps.

Details

Motivation: To provide a holistic understanding of the emerging field of agentic security by systematically organizing and analyzing the landscape, as no comprehensive survey currently exists.

Method: Structured the field around three pillars (Applications, Threats, Defenses), created comprehensive taxonomy of 160+ papers, conducted cross-cutting analysis of agent architecture trends, and identified research gaps.

Result: Complete survey with detailed taxonomy, emerging trends in agent architecture, critical research gaps in model and modality coverage, and publicly available continuously updated repository of surveyed papers.

Conclusion: This work establishes foundational survey of agentic security landscape, provides structured framework for understanding the field, and identifies key research directions for future work in this emerging area.

Abstract: In this work we present the first holistic survey of the agentic security landscape, structuring the field around three fundamental pillars: Applications, Threats, and Defenses. We provide a comprehensive taxonomy of over 160 papers, explaining how agents are used in downstream cybersecurity applications, inherent threats to agentic systems, and countermeasures designed to protect them. A detailed cross-cutting analysis shows emerging trends in agent architecture while revealing critical research gaps in model and modality coverage. A complete and continuously updated list of all surveyed papers is publicly available at https://github.com/kagnlp/Awesome-Agentic-Security.

[95] RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

Tuan T. Nguyen, John Le, Thai T. Vu, Willy Susilo, Heath Cooper

Main category: cs.CL

TL;DR: RAID is a framework that crafts adversarial suffixes to jailbreak LLMs by optimizing continuous embeddings with refusal-aware regularization and coherence constraints, achieving higher attack success with fewer queries than existing methods.

Details

Motivation: LLMs have impressive performance but remain vulnerable to jailbreak attacks that bypass safety mechanisms. There's a need to systematically probe these weaknesses to understand and ultimately mitigate LLM vulnerabilities.

Method: RAID relaxes discrete tokens into continuous embeddings and optimizes them with a joint objective: (1) encourages restricted responses, (2) refusal-aware regularizer to steer activations away from refusal directions, (3) coherence term for semantic plausibility. Uses critic-guided decoding to map embeddings back to tokens.

Result: Experiments on multiple open-source LLMs show RAID achieves higher attack success rates with fewer queries and lower computational cost than recent white-box and black-box baselines.

Conclusion: RAID demonstrates the importance of embedding-space regularization for understanding and mitigating LLM jailbreak vulnerabilities, providing an effective framework for probing safety weaknesses.

Abstract: Large language models (LLMs) achieve impressive performance across diverse tasks yet remain vulnerable to jailbreak attacks that bypass safety mechanisms. We present RAID (Refusal-Aware and Integrated Decoding), a framework that systematically probes these weaknesses by crafting adversarial suffixes that induce restricted content while preserving fluency. RAID relaxes discrete tokens into continuous embeddings and optimizes them with a joint objective that (i) encourages restricted responses, (ii) incorporates a refusal-aware regularizer to steer activations away from refusal directions in embedding space, and (iii) applies a coherence term to maintain semantic plausibility and non-redundancy. After optimization, a critic-guided decoding procedure maps embeddings back to tokens by balancing embedding affinity with language-model likelihood. This integration yields suffixes that are both effective in bypassing defenses and natural in form. Experiments on multiple open-source LLMs show that RAID achieves higher attack success rates with fewer queries and lower computational cost than recent white-box and black-box baselines. These findings highlight the importance of embedding-space regularization for understanding and mitigating LLM jailbreak vulnerabilities.

[96] LexChain: Modeling Legal Reasoning Chains for Chinese Tort Case Analysis

Huiyuan Xie, Chenyang Li, Huining Zhu, Chubin Zhang, Yuxiao Ye, Zhenghao Liu, Zhiyuan Liu

Main category: cs.CL

TL;DR: The paper introduces LexChain, a novel framework for modeling legal reasoning in Chinese tort civil cases, creates an evaluation benchmark, and shows that current LLMs struggle with tort reasoning but can be improved with explicit reasoning chain modeling.

Details

Motivation: Existing computational approaches to legal reasoning rely on generic frameworks like syllogism that don't capture nuanced legal reasoning processes, and current research focuses too much on criminal cases while neglecting civil cases like tort law.

Method: The authors operationalize legal reasoning in tort analysis into the three-module LexChain framework with finer-grained sub-steps, create a tort legal reasoning evaluation benchmark, evaluate existing LLMs, and develop baseline approaches that incorporate LexChain-style reasoning through prompting or post-training.

Result: Current large language models fall short in accurately handling crucial elements of tort legal reasoning. However, the proposed baselines that explicitly incorporate LexChain-style reasoning achieve significant improvements in tort-related legal reasoning and generalize well to related legal analysis tasks.

Conclusion: Explicitly modeling legal reasoning chains enhances the reasoning capabilities of language models for legal analysis, particularly for civil cases like tort law where current approaches are insufficient.

Abstract: Legal reasoning is a fundamental component of legal analysis and decision-making. Existing computational approaches to legal reasoning predominantly rely on generic reasoning frameworks such as syllogism, which do not comprehensively examine the nuanced process of legal reasoning. Moreover, current research has largely focused on criminal cases, with insufficient modeling for civil cases. In this work, we present a novel framework to explicitly model legal reasoning in the analysis of Chinese tort-related civil cases. We first operationalize the legal reasoning process in tort analysis into the three-module LexChain framework, with each module consisting of multiple finer-grained sub-steps. Informed by the LexChain framework, we introduce the task of tort legal reasoning and construct an evaluation benchmark to systematically assess the critical steps within analytical reasoning chains for tort analysis. Leveraging this benchmark, we evaluate existing large language models for their legal reasoning ability in civil tort contexts. Our results indicate that current models still fall short in accurately handling crucial elements of tort legal reasoning. Furthermore, we introduce several baseline approaches that explicitly incorporate LexChain-style reasoning through prompting or post-training. The proposed baselines achieve significant improvements in tort-related legal reasoning and generalize well to related legal analysis tasks, demonstrating the value of explicitly modeling legal reasoning chains to enhance the reasoning capabilities of language models.

[97] The Reasoning Lingua Franca: A Double-Edged Sword for Multilingual AI

Alan Saji, Raj Dabre, Anoop Kunchukuttan, Ratish Puduppully

Main category: cs.CL

TL;DR: LRMs default to English reasoning for non-English questions, which generally improves accuracy but introduces translation errors and reduces cultural nuance handling.

Details

Motivation: Large Reasoning Models (LRMs) have strong performance on reasoning tasks but their multilingual reasoning abilities remain underexplored. There are concerns about interpretability and handling of linguistic/cultural nuances when LRMs default to English reasoning for non-English questions.

Method: Systematically compare LRM reasoning in English versus the language of the question across two tasks: MGSM and GPQA Diamond. Measure answer accuracy and analyze cognitive attributes in reasoning traces.

Result: English reasoning traces show substantially higher presence of cognitive behaviors, and reasoning in English generally yields higher final-answer accuracy, with performance gap increasing as tasks become more complex. However, English-centric strategy is susceptible to “Lost in Translation” errors where translation steps cause mistakes that would have been avoided by reasoning in the question’s language.

Conclusion: While English reasoning improves accuracy for LRMs on multilingual tasks, it introduces translation-related errors and reduces interpretability and cultural nuance handling, highlighting a trade-off between accuracy and proper multilingual reasoning.

Abstract: Large Reasoning Models (LRMs) achieve strong performance on mathematical, scientific, and other question-answering tasks, but their multilingual reasoning abilities remain underexplored. When presented with non-English questions, LRMs often default to reasoning in English, raising concerns about interpretability and the handling of linguistic and cultural nuances. We systematically compare an LRM’s reasoning in English versus the language of the question. Our evaluation spans two tasks: MGSM and GPQA Diamond. Beyond measuring answer accuracy, we also analyze cognitive attributes in the reasoning traces. We find that English reasoning traces exhibit a substantially higher presence of these cognitive behaviors, and that reasoning in English generally yields higher final-answer accuracy, with the performance gap increasing as tasks become more complex. However, this English-centric strategy is susceptible to a key failure mode - getting “Lost in Translation,” where translation steps lead to errors that would have been avoided by question’s language reasoning.

[98] Far from the Shallow: Brain-Predictive Reasoning Embedding through Residual Disentanglement

Linyang He, Tianjun Zhong, Richard Antonello, Gavin Mischler, Micah Goldblum, Nima Mesgarani

Main category: cs.CL

TL;DR: Researchers developed a method to disentangle LLM embeddings into lexicon, syntax, meaning, and reasoning components, revealing that reasoning has unique neural signatures in brain activity that are masked by entangled representations.

Details

Motivation: Current LLM representations are highly entangled, mixing different linguistic and cognitive features, which biases brain encoding analyses toward shallow features and makes it difficult to isolate neural substrates of deeper cognitive processes like reasoning.

Method: Introduced a residual disentanglement method that probes LLMs to identify feature-specific layers, then iteratively regresses out lower-level representations to produce four nearly orthogonal embeddings for lexicon, syntax, meaning, and reasoning.

Result: 1) Reasoning embedding uniquely predicts neural variance not explained by other features, including visual region recruitment; 2) Reasoning neural signature peaks later (~350-400ms) than other features; 3) Standard entangled embeddings mask reasoning contributions by being dominated by shallow features.

Conclusion: Disentangling LLM representations reveals distinct neural substrates for reasoning that are temporally delayed and spatially extended beyond classical language areas, showing that standard LLM embeddings can be misleading for studying higher-level cognition.

Abstract: Understanding how the human brain progresses from processing simple linguistic inputs to performing high-level reasoning is a fundamental challenge in neuroscience. While modern large language models (LLMs) are increasingly used to model neural responses to language, their internal representations are highly “entangled,” mixing information about lexicon, syntax, meaning, and reasoning. This entanglement biases conventional brain encoding analyses toward linguistically shallow features (e.g., lexicon and syntax), making it difficult to isolate the neural substrates of cognitively deeper processes. Here, we introduce a residual disentanglement method that computationally isolates these components. By first probing an LM to identify feature-specific layers, our method iteratively regresses out lower-level representations to produce four nearly orthogonal embeddings for lexicon, syntax, meaning, and, critically, reasoning. We used these disentangled embeddings to model intracranial (ECoG) brain recordings from neurosurgical patients listening to natural speech. We show that: 1) This isolated reasoning embedding exhibits unique predictive power, accounting for variance in neural activity not explained by other linguistic features and even extending to the recruitment of visual regions beyond classical language areas. 2) The neural signature for reasoning is temporally distinct, peaking later (~350-400ms) than signals related to lexicon, syntax, and meaning, consistent with its position atop a processing hierarchy. 3) Standard, non-disentangled LLM embeddings can be misleading, as their predictive success is primarily attributable to linguistically shallow features, masking the more subtle contributions of deeper cognitive processing.

[99] LiveSecBench: A Dynamic and Event-Driven Safety Benchmark for Chinese Language Model Applications

Yudong Li, Peiru Yang, Feng Huang, Zhongliang Yang, Kecheng Wang, Haitian Li, Baocheng Chen, Xingyu An, Ziyu Liu, Youdan Yang, Kejiang Chen, Sifang Wan, Xu Wang, Yufei Sun, Liyan Wu, Ruiqi Zhou, Wenya Wen, Xingchi Gu, Tianxin Zhang, Yue Gao, Yongfeng Huang

Main category: cs.CL

TL;DR: LiveSecBench is a continuously updated Chinese-language LLM safety benchmark with automated+human dataset creation, evaluating models across 5 safety dimensions using ELO ratings.

Details

Motivation: To create a continuously updated safety benchmark specifically for Chinese-language LLM application scenarios, addressing the need for robust and current AI safety evaluation standards in Chinese contexts.

Method: Combines automated generation with human verification to construct high-quality datasets, periodically releases new versions with expanded datasets and updated metrics, evaluates LLMs across 5 safety dimensions using an ELO rating system.

Result: Released v251215 version evaluating 57 representative LLMs across Public Safety, Fairness & Bias, Privacy, Truthfulness, and Mental Health Safety dimensions, providing a leaderboard of current Chinese LLM safety state.

Conclusion: LiveSecBench establishes a robust, continuously updated standard for Chinese LLM safety evaluation, with publicly available results and leaderboard tracking model safety performance over time.

Abstract: We introduce LiveSecBench, a continuously updated safety benchmark specifically for Chinese-language LLM application scenarios. LiveSecBench constructs a high-quality and unique dataset through a pipeline that combines automated generation with human verification. By periodically releasing new versions to expand the dataset and update evaluation metrics, LiveSecBench provides a robust and up-to-date standard for AI safety. In this report, we introduce our second release v251215, which evaluates across five dimensions (Public Safety, Fairness & Bias, Privacy, Truthfulness, and Mental Health Safety.) We evaluate 57 representative LLMs using an ELO rating system, offering a leaderboard of the current state of Chinese LLM safety. The result is available at https://livesecbench.intokentech.cn/.

[100] AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

Aashray Reddy, Andrew Zagula, Nicholas Saban

Main category: cs.CL

TL;DR: AutoAdv is a training-free framework for automated multi-turn jailbreaking attacks on LLMs, achieving up to 95% success rate on Llama-3.1-8B within six turns, significantly outperforming single-turn attacks.

Details

Motivation: Most current jailbreaking evaluations focus on single-turn interactions, but real-world attacks occur through adaptive multi-turn conversations. Existing alignment strategies are optimized for single-turn interactions and fail to maintain robustness across extended conversations.

Method: AutoAdv combines three adaptive mechanisms: 1) Pattern manager that learns from successful attacks to enhance future prompts, 2) Temperature manager that dynamically adjusts sampling parameters based on failure modes, and 3) Two-phase rewriting strategy that disguises harmful requests and then iteratively refines them.

Result: Achieves up to 95% attack success rate on Llama-3.1-8B within six turns (24% improvement over single-turn baselines). Multi-turn attacks consistently outperform single-turn approaches across commercial and open-source models (Llama-3.1-8B, GPT-4o mini, Qwen3-235B, Mistral-7B).

Conclusion: Current safety mechanisms have persistent vulnerabilities to multi-turn attacks. Alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses.

Abstract: Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs. Yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework for automated multi-turn jailbreaking that achieves an attack success rate of up to 95% on Llama-3.1-8B within six turns, a 24% improvement over single-turn baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern manager that learns from successful attacks to enhance future prompts, a temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests and then iteratively refines them. Extensive evaluation across commercial and open-source models (Llama-3.1-8B, GPT-4o mini, Qwen3-235B, Mistral-7B) reveals persistent vulnerabilities in current safety mechanisms, with multi-turn attacks consistently outperforming single-turn approaches. These findings demonstrate that alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses.

[101] Evaluating Large Language Models for Anxiety, Depression, and Stress Detection: Insights into Prompting Strategies and Synthetic Data

Mihael Arcan, David-Paul Niland

Main category: cs.CL

TL;DR: This paper compares LLMs, classical ML, and transformer models for detecting anxiety, depression, and stress from clinical interview text, finding that transformer models like Distil-RoBERTa and XLNet perform best, with synthetic data generation helping address class imbalance.

Details

Motivation: Mental health disorders affect over 20% of adults globally, but detecting these conditions from text is challenging due to subtle and varied symptom expressions. There's a need for effective automated detection methods to improve mental health assessment.

Method: Used DAIC-WOZ dataset of clinical interviews, compared LLMs (Llama, GPT) with classical ML and transformer models (BERT, XLNet, Distil-RoBERTa). Fine-tuned models for anxiety (GAD-2), depression (PHQ), and stress classification. Applied synthetic data generation to address class imbalance issues.

Result: Distil-RoBERTa achieved highest F1 score (0.883) for GAD-2 anxiety detection. XLNet performed best on PHQ depression tasks (F1 up to 0.891). For stress detection, zero-shot synthetic approach (SD+Zero-Shot-Basic) reached F1 of 0.884 and ROC AUC of 0.886. Transformer models generally outperformed LLMs and classical approaches.

Conclusion: Transformer-based models are effective for mental health detection from text, with synthetic data generation improving recall and generalization. However, careful calibration is needed to prevent precision loss. Combining advanced language models with data augmentation shows promise for enhancing automated mental health assessment.

Abstract: Mental health disorders affect over one-fifth of adults globally, yet detecting such conditions from text remains challenging due to the subtle and varied nature of symptom expression. This study evaluates multiple approaches for mental health detection, comparing Large Language Models (LLMs) such as Llama and GPT with classical machine learning and transformer-based architectures including BERT, XLNet, and Distil-RoBERTa. Using the DAIC-WOZ dataset of clinical interviews, we fine-tuned models for anxiety, depression, and stress classification and applied synthetic data generation to mitigate class imbalance. Results show that Distil-RoBERTa achieved the highest F1 score (0.883) for GAD-2, while XLNet outperformed others on PHQ tasks (F1 up to 0.891). For stress detection, a zero-shot synthetic approach (SD+Zero-Shot-Basic) reached an F1 of 0.884 and ROC AUC of 0.886. Findings demonstrate the effectiveness of transformer-based models and highlight the value of synthetic data in improving recall and generalization. However, careful calibration is required to prevent precision loss. Overall, this work emphasizes the potential of combining advanced language models and data augmentation to enhance automated mental health assessment from text.

[102] TermGPT: Multi-Level Contrastive Fine-Tuning for Terminology Adaptation in Legal and Financial Domain

Yidan Sun, Mengying Zhu, Feiyue Chen, Yangyang Wu, Xiaolei Dan, Mengyuan Yang, Xiaolin Zheng, Shenglin Ben

Main category: cs.CL

TL;DR: TermGPT: A multi-level contrastive fine-tuning framework that improves LLM embedding spaces for domain-specific terminology discrimination in legal and financial contexts.

Details

Motivation: LLMs suffer from isotropy problem in embedding spaces, leading to poor discrimination of domain-specific terminology in legal and financial contexts, which hinders downstream tasks like legal judgment prediction and financial risk analysis.

Method: Proposes TermGPT with: 1) sentence graph construction to capture semantic and structural relations, 2) generation of semantically consistent yet discriminative positive/negative samples using contextual and topological cues, and 3) multi-level contrastive learning at both sentence and token levels.

Result: TermGPT outperforms existing baselines in term discrimination tasks within finance and legal domains, supported by the first financial terminology dataset derived from official regulatory documents.

Conclusion: The proposed multi-level contrastive fine-tuning framework effectively addresses the isotropy problem in LLM embeddings, enhancing terminology discrimination for domain-specific applications.

Abstract: Large language models (LLMs) have demonstrated impressive performance in text generation tasks; however, their embedding spaces often suffer from the isotropy problem, resulting in poor discrimination of domain-specific terminology, particularly in legal and financial contexts. This weakness in terminology-level representation can severely hinder downstream tasks such as legal judgment prediction or financial risk analysis, where subtle semantic distinctions are critical. To address this problem, we propose TermGPT, a multi-level contrastive fine-tuning framework designed for terminology adaptation. We first construct a sentence graph to capture semantic and structural relations, and generate semantically consistent yet discriminative positive and negative samples based on contextual and topological cues. We then devise a multi-level contrastive learning approach at both the sentence and token levels, enhancing global contextual understanding and fine-grained terminology discrimination. To support robust evaluation, we construct the first financial terminology dataset derived from official regulatory documents. Experiments show that TermGPT outperforms existing baselines in term discrimination tasks within the finance and legal domains.

[103] SCARE: A Benchmark for SQL Correction and Question Answerability Classification for Reliable EHR Question Answering

Gyubok Lee, Woosog Chay, Edward Choi

Main category: cs.CL

TL;DR: SCARE is a benchmark for evaluating post-hoc safety verification mechanisms in EHR question answering systems, focusing on classifying question answerability and verifying/correcting SQL queries before execution.

Details

Motivation: Deploying text-to-SQL models in clinical EHR systems is challenging because incorrect SQL queries (from model errors or problematic user inputs) can undermine clinical decision-making and jeopardize patient care. There's a lack of unified benchmarks for evaluating post-hoc verification mechanisms crucial for safe deployment.

Method: Introduces SCARE benchmark with 4,200 triples of questions, candidate SQL queries, and expected outputs grounded in MIMIC-III, MIMIC-IV, and eICU databases. Covers diverse questions and SQL queries generated by seven different text-to-SQL models. Benchmarks range of approaches from two-stage methods to agentic frameworks.

Result: Experiments reveal a critical trade-off between question classification and SQL error correction, highlighting key challenges in post-hoc safety verification for clinical EHR QA systems.

Conclusion: SCARE fills the gap in evaluating post-hoc safety layers for EHR QA systems, providing a realistic benchmark that outlines directions for future research in safe deployment of text-to-SQL models in clinical environments.

Abstract: Recent advances in Large Language Models (LLMs) have enabled the development of text-to-SQL models that allow clinicians to query structured data stored in Electronic Health Records (EHRs) using natural language. However, deploying these models for EHR question answering (QA) systems in safety-critical clinical environments remains challenging: incorrect SQL queries-whether caused by model errors or problematic user inputs-can undermine clinical decision-making and jeopardize patient care. While prior work has mainly focused on improving SQL generation accuracy or filtering questions before execution, there is a lack of a unified benchmark for evaluating independent post-hoc verification mechanisms (i.e., a component that inspects and validates the generated SQL before execution), which is crucial for safe deployment. To fill this gap, we introduce SCARE, a benchmark for evaluating methods that function as a post-hoc safety layer in EHR QA systems. SCARE evaluates the joint task of (1) classifying question answerability (i.e., determining whether a question is answerable, ambiguous, or unanswerable) and (2) verifying or correcting candidate SQL queries. The benchmark comprises 4,200 triples of questions, candidate SQL queries, and expected model outputs, grounded in the MIMIC-III, MIMIC-IV, and eICU databases. It covers a diverse set of questions and corresponding candidate SQL queries generated by seven different text-to-SQL models, ensuring a realistic and challenging evaluation. Using SCARE, we benchmark a range of approaches-from two-stage methods to agentic frameworks. Our experiments reveal a critical trade-off between question classification and SQL error correction, highlighting key challenges and outlining directions for future research.

[104] Affective Multimodal Agents with Proactive Knowledge Grounding for Emotionally Aligned Marketing Dialogue

Lin Yu, Xiaofei Han, Yifei Kang, Chiung-Yi Tseng, Danyang Zhang, Ziqian Bi, Zhimo Han

Main category: cs.CL

TL;DR: AffectMind is a multimodal affective dialogue agent that performs proactive reasoning and dynamic knowledge grounding for emotionally aligned and persuasive marketing conversations, outperforming LLM baselines.

Details

Motivation: Current LLM-based dialogue systems are mostly reactive and struggle in emotionally rich, goal-oriented settings like marketing conversations, lacking proactive reasoning and emotional alignment capabilities.

Method: Three-component architecture: 1) Proactive Knowledge Grounding Network (PKGN) for continuous multimodal context updates, 2) Emotion-Intent Alignment Model (EIAM) for joint emotion and purchase intent modeling, and 3) Reinforced Discourse Loop (RDL) for optimizing emotional coherence via reinforcement learning.

Result: Outperforms strong LLM baselines on two marketing dialogue datasets (MM-ConvMarket and AffectPromo) with +26% emotional consistency, +19% persuasive success rate, and +23% long-term user engagement.

Conclusion: Emotion-grounded proactivity is a key capability for commercial multimodal agents, enabling more effective and engaging marketing conversations through proactive reasoning and dynamic emotional alignment.

Abstract: Recent advances in large language models (LLMs) have enabled fluent dialogue systems, but most remain reactive and struggle in emotionally rich, goal-oriented settings such as marketing conversations. To address this limitation, we propose AffectMind, a multimodal affective dialogue agent that performs proactive reasoning and dynamic knowledge grounding to sustain emotionally aligned and persuasive interactions. AffectMind combines three components: a Proactive Knowledge Grounding Network (PKGN) that continuously updates factual and affective context from text, vision, and prosody; an Emotion–Intent Alignment Model (EIAM) that jointly models user emotion and purchase intent to adapt persuasion strategies; and a Reinforced Discourse Loop (RDL) that optimizes emotional coherence and engagement via reinforcement signals from user responses. Experiments on two newly curated marketing dialogue datasets, MM-ConvMarket and AffectPromo, show that AffectMind outperforms strong LLM-based baselines in emotional consistency (+26%), persuasive success rate (+19%), and long-term user engagement (+23%), highlighting emotion-grounded proactivity as a key capability for commercial multimodal agents.

Ting-Yao Hu, Hema Swetha Koppula, Hadi Pouransari, Cem Koc, Oncel Tuzel, Raviteja Vemulapalli

Main category: cs.CL

TL;DR: SCRPO is a self-supervised training framework that uses an LLM’s own critique and refinement capabilities to create preference data, then applies preference learning to improve faithfulness in summarization without needing test-time compute or more powerful teacher models.

Details

Motivation: LLMs suffer from hallucinations in long-form text generation like summarization. Existing methods to reduce hallucinations require either additional test-time compute or access to more powerful teacher models, making them costly and impractical.

Method: Self Critique and Refinement-based Preference Optimization (SCRPO): 1) Constructs a preference dataset by leveraging the LLM’s own critique and refinement capabilities, 2) Applies preference learning to improve the same LLM for faithful summarization.

Result: Outperforms state-of-the-art self-supervised learning methods on three summarization benchmarks (XSUM, CNNDM, SAMSum) in faithfulness metrics while maintaining or improving overall summary quality metrics. More efficient than test-time refinement and produces more faithful summaries.

Conclusion: SCRPO provides an effective self-supervised training framework that improves faithfulness in LLM summarization without requiring additional test-time compute or access to more powerful models, making hallucination reduction more practical and efficient.

Abstract: Large Language Models (LLMs) often suffer from hallucinations: output content that is not grounded in the input context, when performing long-form text generation tasks such as summarization. Prior works have shown that hallucinations can be reduced by iteratively critiquing and refining previously generated outputs using either the same model or a more powerful teacher model as the critique. However, these approaches either require additional test-time compute or assume access to more powerful teacher models, making them costly and less practical. In this work, we propose Self Critique and Refinement-based Preference Optimization (SCRPO), which is a self-supervised training framework that first constructs a preference dataset by leveraging the LLM’s own critique and refinement capabilities, and then applies preference learning to improve the same LLM for faithful summarization. Experiments on three summarization benchmarks (XSUM CNNDM and SAMSum), demonstrate that our approach outperforms state-of-the-art self-supervised learning methods in terms of faithfulness metrics while either maintaining or improving other metrics that measure the overall quality of the summary. Moreover, compared to test-time refinement, our approach not only improves efficiency but also results in more faithful summaries.

[106] Over-representation of phonological features in basic vocabulary doesn’t replicate when controlling for spatial and phylogenetic effects

Frederic Blum

Main category: cs.CL

TL;DR: Most previously observed sound symbolic patterns in basic vocabulary disappear when controlling for genealogical and areal dependencies, though a small subset remains robust.

Details

Motivation: To test the reproducibility and robustness of claimed universal sound symbolic patterns in basic vocabulary, addressing concerns about biases in study samples, lack of genealogical/areal controls, and questionable robustness of previous findings.

Method: Reanalyzed sound symbolism in basic vocabulary using a much larger sample (2864 languages from Lexibank vs. original 245), modified the original model by adding statistical controls for spatial and phylogenetic dependencies between languages.

Result: Most previously observed sound symbolic patterns were not robust and disappeared completely when adding genealogical and areal controls. However, a small number of patterns emerged as highly stable even with the new sample and controls.

Conclusion: Sound symbolism claims need rigorous testing for robustness, including controlling for genealogical and areal dependencies. While most patterns don’t hold up, a small subset appears genuinely robust, allowing for better assessment of sound symbolism distribution at larger scale.

Abstract: The statistical over-representation of phonological features in the basic vocabulary of languages is often interpreted as reflecting potentially universal sound symbolic patterns. However, most of those results have not been tested explicitly for reproducibility and might be prone to biases in the study samples or models. Many studies on the topic do not adequately control for genealogical and areal dependencies between sampled languages, casting doubts on the robustness of the results. In this study, we test the robustness of a recent study on sound symbolism of basic vocabulary concepts which analyzed 245 languages.The new sample includes data on 2864 languages from Lexibank. We modify the original model by adding statistical controls for spatial and phylogenetic dependencies between languages. The new results show that most of the previously observed patterns are not robust, and in fact many patterns disappear completely when adding the genealogical and areal controls. A small number of patterns, however, emerges as highly stable even with the new sample. Through the new analysis, we are able to assess the distribution of sound symbolism on a larger scale than previously. The study further highlights the need for testing all universal claims on language for robustness on various levels.

[107] Confucius Code Agent: Scalable Agent Scaffolding for Real-World Codebases

Sherman Wong, Zhenting Qi, Zhaodong Wang, Nathan Hu, Samuel Lin, Jun Ge, Erwin Gao, Wenlin Chen, Yilun Du, Minlan Yu, Ying Zhang

Main category: cs.CL

TL;DR: The paper introduces Confucius Code Agent (CCA), a scalable software engineering agent built on the Confucius SDK platform, featuring hierarchical memory, persistent note-taking, modular tools, and a meta-agent for automated configuration refinement. It achieves 54.3% Resolve@1 on SWE-Bench-Pro.

Details

Motivation: Existing coding agents have limitations: research-grade agents lack scalability for production workloads, while production-grade systems offer limited extensibility, interpretability, and controllability. There's a need for agents that can handle massive repositories, sustain long-horizon sessions, and reliably coordinate complex toolchains.

Method: Built on Confucius SDK with three perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). Features include: unified orchestrator with hierarchical working memory for long-context reasoning, persistent note-taking for cross-session continual learning, modular extension system for reliable tool use, and a meta-agent that automates synthesis, evaluation, and refinement of agent configurations through a build-test-improve loop.

Result: CCA achieves 54.3% Resolve@1 on SWE-Bench-Pro, exceeding prior research baselines and comparing favorably to commercial results under identical repositories, model backends, and tool access conditions.

Conclusion: The Confucius Code Agent demonstrates strong performance on real-world software engineering tasks by combining scalable architecture with automated configuration refinement, bridging the gap between research-grade transparency and production-grade performance.

Abstract: Real-world software engineering tasks require coding agents that can operate over massive repositories, sustain long-horizon sessions, and reliably coordinate complex toolchains at test time. Existing research-grade coding agents offer transparency but struggle when scaled to heavier, production-level workloads, while production-grade systems achieve strong practical performance but provide limited extensibility, interpretability, and controllability. We introduce the Confucius Code Agent (CCA), a software engineering agent that can operate at large-scale codebases. CCA is built on top of the Confucius SDK, an agent development platform structured around three complementary perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). The SDK integrates a unified orchestrator with hierarchical working memory for long-context reasoning, a persistent note-taking system for cross-session continual learning, and a modular extension system for reliable tool use. In addition, we introduce a meta-agent that automates the synthesis, evaluation, and refinement of agent configurations through a build-test-improve loop, enabling rapid adaptation to new tasks, environments, and tool stacks. Instantiated with these mechanisms, CCA demonstrates strong performance on real-world software engineering tasks. On SWE-Bench-Pro, CCA reaches a Resolve@1 of 54.3%, exceeding prior research baselines and comparing favorably to commercial results, under identical repositories, model backends, and tool access.

[108] Human-Inspired Learning for Large Language Models via Obvious Record and Maximum-Entropy Method Discovery

Hong Su

Main category: cs.CL

TL;DR: The paper proposes a human-inspired learning framework for LLMs that combines explicit symbolic memory storage with entropy-guided method discovery to handle rare/unseen scenarios better than standard next-token prediction approaches.

Details

Motivation: LLMs struggle with rare, low-resource, or previously unseen scenarios because such cases are sparsely represented in training data. They rely on implicit parametric memory, behaving as intuition-driven predictors rather than deliberate, method-oriented learners, which limits their ability to explicitly acquire, recall, and refine methods for uncommon situations.

Method: A human-inspired learning framework with two complementary mechanisms: 1) Obvious Record - explicitly stores cause-result/question-solution relationships as symbolic memory for persistent learning from single/infrequent encounters; 2) Maximum-Entropy Method Discovery - prioritizes and preserves methods with high semantic dissimilarity to capture diverse and underrepresented strategies typically overlooked by next-token prediction.

Result: Verification on a benchmark of 60 semantically diverse question-solution pairs shows the entropy-guided approach achieves stronger coverage of unseen questions and significantly greater internal diversity than a random baseline, confirming effectiveness in discovering more generalizable and human-inspired methods.

Conclusion: The proposed framework enables LLMs to better handle rare and unseen scenarios by combining explicit symbolic memory with entropy-based method discovery, moving beyond intuition-driven prediction toward more deliberate, method-oriented learning inspired by human learning from rare experiences.

Abstract: Large Language Models (LLMs) excel at extracting common patterns from large-scale corpora, yet they struggle with rare, low-resource, or previously unseen scenarios-such as niche hardware deployment issues or irregular IoT device behaviors-because such cases are sparsely represented in training data. Moreover, LLMs rely primarily on implicit parametric memory, which limits their ability to explicitly acquire, recall, and refine methods, causing them to behave predominantly as intuition-driven predictors rather than deliberate, method-oriented learners. Inspired by how humans learn from rare experiences, this paper proposes a human-inspired learning framework that integrates two complementary mechanisms. The first, Obvious Record, explicitly stores cause–result (or question–solution) relationships as symbolic memory, enabling persistent learning even from single or infrequent encounters. The second, Maximum-Entropy Method Discovery, prioritizes and preserves methods with high semantic dissimilarity, allowing the system to capture diverse and underrepresented strategies that are typically overlooked by next-token prediction. Verification on a benchmark of 60 semantically diverse question–solution pairs demonstrates that the proposed entropy-guided approach achieves stronger coverage of unseen questions and significantly greater internal diversity than a random baseline, confirming its effectiveness in discovering more generalizable and human-inspired methods.

[109] Understanding Syllogistic Reasoning in LLMs from Formal and Natural Language Perspectives

Aheli Poddar, Saptarshi Sahoo, Sujata Ghosh

Main category: cs.CL

TL;DR: LLMs show varying syllogistic reasoning capabilities across 14 models, with some achieving perfect symbolic performance, raising questions about whether LLMs are developing formal reasoning mechanisms rather than capturing human reasoning nuances.

Details

Motivation: To investigate fundamental reasoning capabilities of LLMs through syllogistic reasoning from both logical and natural language perspectives, and to understand the direction of research in this area.

Method: Evaluated 14 large language models on syllogistic reasoning tasks, examining both symbolic inference capabilities and natural language understanding of syllogistic reasoning.

Result: Syllogistic reasoning is not uniformly emergent across LLMs, but some models achieve perfect symbolic performance, suggesting they may be developing formal reasoning mechanisms.

Conclusion: The perfect symbolic performances in certain LLMs raise questions about whether these models are evolving into formal reasoning mechanisms rather than capturing the nuanced aspects of human reasoning.

Abstract: We study syllogistic reasoning in LLMs from the logical and natural language perspectives. In process, we explore fundamental reasoning capabilities of the LLMs and the direction this research is moving forward. To aid in our studies, we use 14 large language models and investigate their syllogistic reasoning capabilities in terms of symbolic inferences as well as natural language understanding. Even though this reasoning mechanism is not a uniform emergent property across LLMs, the perfect symbolic performances in certain models make us wonder whether LLMs are becoming more and more formal reasoning mechanisms, rather than making explicit the nuances of human reasoning.

[110] VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models

Nguyen Tien Dong, Minh-Anh Nguyen, Thanh Dat Hoang, Nguyen Tuan Ngoc, Dao Xuan Quang Minh, Phan Phi Hai, Nguyen Thi Ngoc Anh, Dang Van Tu, Binh Vu

Main category: cs.CL

TL;DR: VLegal-Bench is the first comprehensive benchmark for evaluating LLMs on Vietnamese legal tasks, featuring 10,450 expert-annotated samples across multiple cognitive levels based on Bloom’s taxonomy.

Details

Motivation: The complexity, hierarchical organization, and frequent revisions of Vietnamese legislation create significant challenges for evaluating how well LLMs can interpret and utilize legal knowledge, creating a need for a standardized evaluation framework.

Method: Created VLegal-Bench with 10,450 samples generated through a rigorous annotation pipeline where legal experts label and cross-validate each instance. The benchmark is informed by Bloom’s cognitive taxonomy and includes tasks reflecting practical legal usage scenarios like general Q&A, retrieval-augmented generation, multi-step reasoning, and scenario-based problem solving.

Result: The benchmark provides a standardized, transparent, and cognitively informed evaluation framework that establishes a solid foundation for assessing LLM performance in Vietnamese legal contexts.

Conclusion: VLegal-Bench supports the development of more reliable, interpretable, and ethically aligned AI-assisted legal systems by enabling systematic evaluation of LLMs on Vietnamese legal tasks.

Abstract: The rapid advancement of large language models (LLMs) has enabled new possibilities for applying artificial intelligence within the legal domain. Nonetheless, the complexity, hierarchical organization, and frequent revisions of Vietnamese legislation pose considerable challenges for evaluating how well these models interpret and utilize legal knowledge. To address this gap, Vietnamese Legal Benchmark (VLegal-Bench) is introduced, the first comprehensive benchmark designed to systematically assess LLMs on Vietnamese legal tasks. Informed by Bloom’s cognitive taxonomy, VLegal-Bench encompasses multiple levels of legal understanding through tasks designed to reflect practical usage scenarios. The benchmark comprises 10,450 samples generated through a rigorous annotation pipeline, where legal experts label and cross-validate each instance using our annotation system to ensure every sample is grounded in authoritative legal documents and mirrors real-world legal assistant workflows, including general legal questions and answers, retrieval-augmented generation, multi-step reasoning, and scenario-based problem solving tailored to Vietnamese law. By providing a standardized, transparent, and cognitively informed evaluation framework, VLegal-Bench establishes a solid foundation for assessing LLM performance in Vietnamese legal contexts and supports the development of more reliable, interpretable, and ethically aligned AI-assisted legal systems.

[111] LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding

Chenkai Xu, Yijie Jin, Jiajun Li, Yi Tu, Guoping Long, Dandan Tu, Mingcong Song, Hongjie Si, Tianqi Hou, Junchi Yan, Zhijie Deng

Main category: cs.CL

TL;DR: LoPA is a training-free algorithm that improves parallelism in diffusion LLMs by optimizing token filling order, achieving up to 10.1 tokens per forward pass while maintaining performance.

Details

Motivation: Current confidence-driven decoding for diffusion LLMs has limited parallelism (only 1-3 tokens per forward pass), which restricts inference speed despite the models' potential for high-speed inference.

Method: LoPA explores multiple token filling orders in parallel branches, selects the one with highest potential for future parallelism based on branch confidence, and uses a specialized multi-device inference system with Branch Parallelism.

Result: Applied to D2F model, LoPA increases tokens per forward pass to 10.1 on GSM8K while maintaining superior performance to baseline, achieving 1073.9 tokens per second throughput under multi-GPU deployment.

Conclusion: LoPA significantly accelerates diffusion LLM inference through optimized token filling order selection, enabling unprecedented parallelism without requiring additional training.

Abstract: Diffusion Large Language Models (dLLMs) have demonstrated significant potential for high-speed inference. However, current confidence-driven decoding strategies are constrained by limited parallelism, typically achieving only 1–3 tokens per forward pass (TPF). In this work, we identify that the degree of parallelism during dLLM inference is highly sensitive to the Token Filling Order (TFO). Then, we introduce Lookahead PArallel Decoding LoPA, a training-free, plug-and-play algorithm, to identify a superior TFO and hence accelerate inference. LoPA concurrently explores distinct candidate TFOs via parallel branches, and selects the one with the highest potential for future parallelism based on branch confidence. We apply LoPA to the state-of-the-art D2F model and observe a substantial enhancement in decoding efficiency. Notably, LoPA increases the TPF of D2F-Dream to 10.1 on the GSM8K while maintaining performance superior to the Dream baseline. Furthermore, to facilitate this unprecedented degree of parallelism, we develop a specialized multi-device inference system featuring Branch Parallelism (BP), which achieves a single-sample throughput of 1073.9 tokens per second under multi-GPU deployment. The code is available at https://github.com/zhijie-group/LoPA.

[112] Navigating the Reality Gap: Privacy-Preserving Adaptation of ASR for Challenging Low-Resource Domains

Darshil Chauhan, Adityasinh Solanki, Vansh Patel, Kanav Kapoor, Ritvik Jain, Aditya Bansal, Pratik Narang, Dhruv Kumar

Main category: cs.CL

TL;DR: ASR for clinical documentation faces reality gap between lab performance and real-world noisy audio. IndicWav2Vec degrades to 40.94% WER on rural Indian clinical data. Zero-data-exfiltration framework with LoRA enables localized continual adaptation. Multi-domain Experience Replay yields 17.1% WER improvement and reduces catastrophic forgetting by 55%.

Details

Motivation: ASR could assist clinical documentation in resource-constrained regions, but deployment is hindered by: 1) Reality gap between lab performance and noisy real-world clinical audio, 2) Strict privacy constraints preventing data exfiltration, 3) Resource limitations in rural settings.

Method: Zero-data-exfiltration framework using Low-Rank Adaptation (LoRA) for localized continual adaptation. Investigated continual learning strategies, focusing on trade-offs between data-driven and parameter-driven stability. Used multi-domain Experience Replay (ER) and developed stabilized, linearized formulation of Elastic Weight Consolidation (EWC) for noisy environments.

Result: Multi-domain Experience Replay achieved 17.1% relative improvement in target WER and reduced catastrophic forgetting by 55% compared to naive adaptation. Stabilized EWC formulation effectively controlled gradient magnitudes and enabled stable convergence. Acoustic adaptation proved essential for usability, not bypassable by language models alone.

Conclusion: Continual adaptation via LoRA with multi-domain Experience Replay effectively bridges the reality gap for clinical ASR in resource-constrained settings. Acoustic adaptation is fundamental prerequisite for usability. Privacy-preserving, localized adaptation enables deployment in real-world clinical environments.

Abstract: Automatic Speech Recognition (ASR) holds immense potential to assist in clinical documentation and patient report generation, particularly in resource-constrained regions. However, deployment is currently hindered by a technical deadlock: a severe “Reality Gap” between laboratory performance and noisy, real-world clinical audio, coupled with strict privacy and resource constraints. We quantify this gap, showing that a robust multilingual model (IndicWav2Vec) degrades to a 40.94% WER on rural clinical data from India, rendering it unusable. To address this, we explore a zero-data-exfiltration framework enabling localized, continual adaptation via Low-Rank Adaptation (LoRA). We conduct a rigorous investigative study of continual learning strategies, characterizing the trade-offs between data-driven and parameter-driven stability. Our results demonstrate that multi-domain Experience Replay (ER) yields the primary performance gains, achieving a 17.1% relative improvement in target WER and reducing catastrophic forgetting by 55% compared to naive adaptation. Furthermore, we observed that standard Elastic Weight Consolidation (EWC) faced numerical stability challenges when applied to LoRA in noisy environments. Our experiments show that a stabilized, linearized formulation effectively controls gradient magnitudes and enables stable convergence. Finally, we verify via a domain-specific spot check that acoustic adaptation is a fundamental prerequisite for usability which cannot be bypassed by language models alone.

cs.CV

[113] A 96pJ/Frame/Pixel and 61pJ/Event Anti-UAV System with Hybrid Object Tracking Modes

Yuncheng Lu, Yucen Shi, Aobo Li, Zehao Li, Junying Li, Bo Wang, Tony Tae-Hyoung Kim

Main category: cs.CV

TL;DR: Energy-efficient anti-UAV chip integrates frame-based and event-driven tracking for small, fast drones, achieving 98.2% accuracy with 96 pJ/frame/pixel efficiency.

Details

Motivation: Need for energy-efficient systems to detect small, fast-moving drones (UAVs) at long ranges (50-400m) with high speeds (5-80 pixels/sec), requiring reliable tracking while minimizing power consumption.

Method: Hybrid system combining frame-based and event-driven object tracking with run-length encoding for event frames, adaptive mode switching based on object size/velocity, Fast Object Tracking Unit with adaptive thresholding, and neural processing unit with custom instruction set and zero-skipping MAC architecture.

Result: 2 mm² chip in 40nm CMOS achieves 96 pJ/frame/pixel and 61 pJ/event at 0.8V, with 98.2% recognition accuracy on UAV datasets, reducing redundant neural computations by >97%.

Conclusion: Demonstrates state-of-the-art end-to-end energy efficiency for anti-UAV systems through integrated frame/event processing and specialized hardware optimization.

Abstract: We present an energy-efficient anti-UAV system that integrates frame-based and event-driven object tracking to enable reliable detection of small and fast-moving drones. The system reconstructs binary event frames using run-length encoding, generates region proposals, and adaptively switches between frame mode and event mode based on object size and velocity. A Fast Object Tracking Unit improves robustness for high-speed targets through adaptive thresholding and trajectory-based classification. The neural processing unit supports both grayscale-patch and trajectory inference with a custom instruction set and a zero-skipping MAC architecture, reducing redundant neural computations by more than 97 percent. Implemented in 40 nm CMOS technology, the 2 mm^2 chip achieves 96 pJ per frame per pixel and 61 pJ per event at 0.8 V, and reaches 98.2 percent recognition accuracy on public UAV datasets across 50 to 400 m ranges and 5 to 80 pixels per second speeds. The results demonstrate state-of-the-art end-to-end energy efficiency for anti-UAV systems.

[114] NystagmusNet: Explainable Deep Learning for Photosensitivity Risk Prediction

Karthik Prabhakar

Main category: cs.CV

TL;DR: NystagmusNet: AI system predicts high-risk visual environments for photosensitive nystagmus patients and recommends real-time visual adaptations using a dual-branch CNN trained on synthetic data with 75% validation accuracy.

Details

Motivation: Nystagmus patients with photosensitivity face daily challenges due to involuntary eye movements worsened by brightness conditions. Current solutions lack predictive personalization, creating a need for proactive, personalized assistive technology.

Method: Dual-branch convolutional neural network trained on synthetic and augmented datasets to estimate photosensitivity risk scores based on environmental brightness and eye movement variance. Includes explainability techniques (SHAP and GradCAM) and rule-based recommendation engine for adaptive filter suggestions.

Result: Model achieves 75% validation accuracy on synthetic data. System successfully identifies environmental risk zones and provides adaptive filter recommendations, with explainability features improving clinical trust and interpretability.

Conclusion: NystagmusNet demonstrates promising AI-driven approach for predicting high-risk visual environments for photosensitive nystagmus patients. Future work includes smart glasses deployment and reinforcement learning for personalized recommendations.

Abstract: Nystagmus patients with photosensitivity face significant daily challenges due to involuntary eye movements exacerbated by environmental brightness conditions. Current assistive solutions are limited to symptomatic treatments without predictive personalization. This paper proposes NystagmusNet, an AI-driven system that predicts high-risk visual environments and recommends real-time visual adaptations. Using a dual-branch convolutional neural network trained on synthetic and augmented datasets, the system estimates a photosensitivity risk score based on environmental brightness and eye movement variance. The model achieves 75% validation accuracy on synthetic data. Explainability techniques including SHAP and GradCAM are integrated to highlight environmental risk zones, improving clinical trust and model interpretability. The system includes a rule-based recommendation engine for adaptive filter suggestions. Future directions include deployment via smart glasses and reinforcement learning for personalized recommendations.

[115] SuperFlow: Training Flow Matching Models with RL on the Fly

Kaijie Chen, Zhiyang Xu, Ying Shen, Zihao Lin, Yuguang Yao, Lifu Huang

Main category: cs.CV

TL;DR: SuperFlow improves RL training for flow-based generative models by using variance-aware sampling for adaptive group sizes and step-level advantage computation consistent with flow dynamics, achieving better performance with significantly fewer training steps.

Details

Motivation: Current RL training for flow models has two main problems: (1) fixed per-prompt group sizes ignore variation in sampling importance across prompts, leading to inefficient sampling and slower training; (2) trajectory-level advantages are reused as per-step estimates, which biases credit assignment along the flow.

Method: SuperFlow adjusts group sizes with variance-aware sampling and computes step-level advantages in a way that is consistent with continuous-time flow dynamics, without requiring architectural changes to the flow models.

Result: SuperFlow reaches promising performance while using only 5.4% to 56.3% of the original training steps and reduces training time by 5.2% to 16.7%. On T2I tasks including text rendering, compositional image generation, and human preference alignment, SuperFlow improves over SD3.5-M by 4.6% to 47.2%, and over Flow-GRPO by 1.7% to 16.0%.

Conclusion: SuperFlow provides an efficient RL training framework for flow-based models that addresses sampling inefficiency and credit assignment bias, achieving superior performance with significantly reduced computational cost.

Abstract: Recent progress in flow-based generative models and reinforcement learning (RL) has improved text-image alignment and visual quality. However, current RL training for flow models still has two main problems: (i) GRPO-style fixed per-prompt group sizes ignore variation in sampling importance across prompts, which leads to inefficient sampling and slower training; and (ii) trajectory-level advantages are reused as per-step estimates, which biases credit assignment along the flow. We propose SuperFlow, an RL training framework for flow-based models that adjusts group sizes with variance-aware sampling and computes step-level advantages in a way that is consistent with continuous-time flow dynamics. Empirically, SuperFlow reaches promising performance while using only 5.4% to 56.3% of the original training steps and reduces training time by 5.2% to 16.7% without any architectural changes. On standard text-to-image (T2I) tasks, including text rendering, compositional image generation, and human preference alignment, SuperFlow improves over SD3.5-M by 4.6% to 47.2%, and over Flow-GRPO by 1.7% to 16.0%.

[116] Seeing Beyond the Scene: Analyzing and Mitigating Background Bias in Action Recognition

Ellie Zhou, Jihoon Chung, Olga Russakovsky

Main category: cs.CV

TL;DR: The paper analyzes background bias in action recognition models, proposes mitigation strategies, and shows prompt tuning can reduce bias by 9.85%.

Details

Motivation: Human action recognition models often rely on background cues rather than human movement and pose, exhibiting background bias that undermines their true understanding of actions.

Method: Systematic analysis of background bias across classification models, contrastive text-image pretrained models, and Video LLMs. Proposed mitigation strategies include incorporating segmented human input for classification models and exploring manual/automated prompt tuning for VLLMs.

Result: All model types exhibit strong background bias. Segmented human input reduces bias by 3.78% in classification models. Prompt tuning for VLLMs steers predictions toward human-focused reasoning by 9.85%.

Conclusion: Background bias is pervasive across action recognition models, but mitigation strategies like segmented inputs and prompt tuning can effectively reduce bias and improve human-focused reasoning.

Abstract: Human action recognition models often rely on background cues rather than human movement and pose to make predictions, a behavior known as background bias. We present a systematic analysis of background bias across classification models, contrastive text-image pretrained models, and Video Large Language Models (VLLM) and find that all exhibit a strong tendency to default to background reasoning. Next, we propose mitigation strategies for classification models and show that incorporating segmented human input effectively decreases background bias by 3.78%. Finally, we explore manual and automated prompt tuning for VLLMs, demonstrating that prompt design can steer predictions towards human-focused reasoning by 9.85%.

[117] SCS-SupCon: Sigmoid-based Common and Style Supervised Contrastive Learning with Adaptive Decision Boundaries

Bin Wang, Fadi Dornaika

Main category: cs.CV

TL;DR: SCS-SupCon: Sigmoid-based contrastive learning with adaptive decision boundaries and style-content disentanglement for improved fine-grained image classification.

Details

Motivation: Existing supervised contrastive learning methods (like InfoNCE-based approaches) suffer from negative-sample dilution and lack adaptive decision boundaries, limiting their effectiveness for fine-grained recognition where inter-class differences are subtle and intra-class variations are substantial.

Method: Proposes SCS-SupCon framework with: 1) Sigmoid-based pairwise contrastive loss with learnable temperature and bias parameters for adaptive decision boundaries, 2) Emphasis on hard negatives to mitigate negative-sample dilution, 3) Explicit style-distance constraint to disentangle style and content representations.

Result: State-of-the-art performance on six benchmark datasets including CUB200-2011 and Stanford Dogs. On CIFAR-100 with ResNet-50: 3.9% improvement over SupCon and 1.7% over CS-SupCon. On fine-grained datasets: 0.4-3.0% improvement over CS-SupCon. Statistical analyses confirm robustness.

Conclusion: SCS-SupCon effectively addresses limitations of existing contrastive learning methods through adaptive decision boundaries and style-content disentanglement, achieving superior performance in fine-grained image classification tasks across various backbones.

Abstract: Image classification is hindered by subtle inter-class differences and substantial intra-class variations, which limit the effectiveness of existing contrastive learning methods. Supervised contrastive approaches based on the InfoNCE loss suffer from negative-sample dilution and lack adaptive decision boundaries, thereby reducing discriminative power in fine-grained recognition tasks. To address these limitations, we propose Sigmoid-based Common and Style Supervised Contrastive Learning (SCS-SupCon). Our framework introduces a sigmoid-based pairwise contrastive loss with learnable temperature and bias parameters to enable adaptive decision boundaries. This formulation emphasizes hard negatives, mitigates negative-sample dilution, and more effectively exploits supervision. In addition, an explicit style-distance constraint further disentangles style and content representations, leading to more robust feature learning. Comprehensive experiments on six benchmark datasets, including CUB200-2011 and Stanford Dogs, demonstrate that SCS-SupCon achieves state-of-the-art performance across both CNN and Transformer backbones. On CIFAR-100 with ResNet-50, SCS-SupCon improves top-1 accuracy over SupCon by approximately 3.9 percentage points and over CS-SupCon by approximately 1.7 points under five-fold cross-validation. On fine-grained datasets, it outperforms CS-SupCon by 0.4–3.0 points. Extensive ablation studies and statistical analyses further confirm the robustness and generalization of the proposed framework, with Friedman tests and Nemenyi post-hoc evaluations validating the stability of the observed improvements.

[118] PMPGuard: Catching Pseudo-Matched Pairs in Remote Sensing Image-Text Retrieval

Pengxiang Ouyang, Qing Ma, Zheng Wang, Cong Bai

Main category: cs.CV

TL;DR: Novel RS image-text retrieval framework using cross-modal gated attention and positive-negative awareness to handle noisy pseudo-matched pairs in real-world datasets.

Details

Motivation: Real-world remote sensing datasets contain problematic Pseudo-Matched Pairs (PMPs) - semantically mismatched or weakly aligned image-text pairs that hinder reliable cross-modal alignment learning, requiring robust methods to handle such noisy associations.

Method: Proposes a retrieval framework with two key components: 1) Cross-Modal Gated Attention that dynamically regulates information flow between modalities, and 2) Positive-Negative Awareness Attention mechanism that explicitly distinguishes informative positive cues from misleading negative ones during alignment learning.

Result: Extensive experiments on three benchmark RS datasets (RSICD, RSITMD, RS5M) demonstrate state-of-the-art performance, showing robustness and effectiveness in handling real-world mismatches and PMPs.

Conclusion: The proposed framework successfully addresses the challenge of noisy pseudo-matched pairs in RS image-text retrieval through adaptive gating and explicit positive-negative discrimination, achieving superior performance across multiple benchmarks.

Abstract: Remote sensing (RS) image-text retrieval faces significant challenges in real-world datasets due to the presence of Pseudo-Matched Pairs (PMPs), semantically mismatched or weakly aligned image-text pairs, which hinder the learning of reliable cross-modal alignments. To address this issue, we propose a novel retrieval framework that leverages Cross-Modal Gated Attention and a Positive-Negative Awareness Attention mechanism to mitigate the impact of such noisy associations. The gated module dynamically regulates cross-modal information flow, while the awareness mechanism explicitly distinguishes informative (positive) cues from misleading (negative) ones during alignment learning. Extensive experiments on three benchmark RS datasets, i.e., RSICD, RSITMD, and RS5M, demonstrate that our method consistently achieves state-of-the-art performance, highlighting its robustness and effectiveness in handling real-world mismatches and PMPs in RS image-text retrieval tasks.

[119] A Modular Framework for Single-View 3D Reconstruction of Indoor Environments

Yuxiao Li

Main category: cs.CV

TL;DR: A modular framework using diffusion techniques for single-view indoor scene 3D reconstruction that splits the process into two steps: first predicting complete views of room background and occluded instances, then transforming them into 3D.

Details

Motivation: Traditional approaches struggle with complex instance shapes and occlusions in indoor environments, often overshooting by attempting to predict 3D shapes directly from incomplete 2D images, resulting in limited reconstruction quality.

Method: A modular framework with diffusion-powered components: 1) amodal completion module for restoring full views of occluded instances, 2) inpainting model for predicting room layouts, 3) hybrid depth estimation balancing geometric accuracy with detail, and 4) view-space alignment using 2D/3D cues for precise instance placement.

Result: Extensive experiments on 3D-Front dataset show the method outperforms current SOTA approaches in both visual quality and reconstruction accuracy.

Conclusion: The framework effectively reconstructs both foreground instances and room background from a single image and holds promising potential for applications in interior design, real estate, and augmented reality.

Abstract: We propose a modular framework for single-view indoor scene 3D reconstruction, where several core modules are powered by diffusion techniques. Traditional approaches for this task often struggle with the complex instance shapes and occlusions inherent in indoor environments. They frequently overshoot by attempting to predict 3D shapes directly from incomplete 2D images, which results in limited reconstruction quality. We aim to overcome this limitation by splitting the process into two steps: first, we employ diffusion-based techniques to predict the complete views of the room background and occluded indoor instances, then transform them into 3D. Our modular framework makes contributions to this field through the following components: an amodal completion module for restoring the full view of occluded instances, an inpainting model specifically trained to predict room layouts, a hybrid depth estimation technique that balances overall geometric accuracy with fine detail expressiveness, and a view-space alignment method that exploits both 2D and 3D cues to ensure precise placement of instances within the scene. This approach effectively reconstructs both foreground instances and the room background from a single image. Extensive experiments on the 3D-Front dataset demonstrate that our method outperforms current state-of-the-art (SOTA) approaches in terms of both visual quality and reconstruction accuracy. The framework holds promising potential for applications in interior design, real estate, and augmented reality.

[120] Tempo as the Stable Cue: Hierarchical Mixture of Tempo and Beat Experts for Music to 3D Dance Generation

Guangtao Lyu, Chenghao Xu, Qi Liu, Jiexi Yan, Muli Yang, Fen Fang, Cheng Deng

Main category: cs.CV

TL;DR: TempoMoE: A tempo-aware Mixture-of-Experts approach for music-to-3D dance generation that uses hierarchical tempo structuring instead of genre labels to achieve better rhythm alignment and dance quality.

Details

Motivation: Existing music-to-dance generation methods rely on noisy, coarse, or unavailable genre labels, which often lead to rhythm misalignment or stylistic drift. The authors observe that tempo (60-200 BPM) is more consistent across datasets and genres than genre labels, making it a better foundation for rhythm perception.

Method: Proposes TempoMoE: a hierarchical tempo-aware Mixture-of-Experts module that enhances diffusion models. It organizes motion experts into tempo-structured groups for different tempo ranges, with multi-scale beat experts capturing fine- and long-range rhythmic dynamics. Uses Hierarchical Rhythm-Adaptive Routing to dynamically select and fuse experts from music features.

Result: Extensive experiments demonstrate that TempoMoE achieves state-of-the-art results in both dance quality and rhythm alignment compared to existing methods.

Conclusion: TempoMoE enables flexible, rhythm-aligned dance generation without manual genre labels by leveraging tempo as a more consistent and reliable musical property, overcoming limitations of genre-dependent approaches.

Abstract: Music to 3D dance generation aims to synthesize realistic and rhythmically synchronized human dance from music. While existing methods often rely on additional genre labels to further improve dance generation, such labels are typically noisy, coarse, unavailable, or insufficient to capture the diversity of real-world music, which can result in rhythm misalignment or stylistic drift. In contrast, we observe that tempo, a core property reflecting musical rhythm and pace, remains relatively consistent across datasets and genres, typically ranging from 60 to 200 BPM. Based on this finding, we propose TempoMoE, a hierarchical tempo-aware Mixture-of-Experts module that enhances the diffusion model and its rhythm perception. TempoMoE organizes motion experts into tempo-structured groups for different tempo ranges, with multi-scale beat experts capturing fine- and long-range rhythmic dynamics. A Hierarchical Rhythm-Adaptive Routing dynamically selects and fuses experts from music features, enabling flexible, rhythm-aligned generation without manual genre labels. Extensive experiments demonstrate that TempoMoE achieves state-of-the-art results in dance quality and rhythm alignment.

[121] Enhancing Tea Leaf Disease Recognition with Attention Mechanisms and Grad-CAM Visualization

Omar Faruq Shikdar, Fahad Ahammed, B. M. Shahria Alam, Golam Kibria, Tawhidur Rahman, Nishat Tasnim Niloy

Main category: cs.CV

TL;DR: This paper proposes an automated system for tea leaf disease classification using deep learning models with attention mechanisms and ensemble techniques, achieving 85.68% accuracy on a custom dataset.

Details

Motivation: Tea leaf diseases cause significant economic losses in the global tea industry. Manual disease identification is inefficient and unreliable, creating a need for automated systems to enable early detection and minimize damage.

Method: Created a novel dataset of 5278 images across 7 disease classes, pre-processed the data, deployed DenseNet and Inception pretrained models, incorporated attention modules to enhance performance, used EfficientNet only in ensemble models, and applied explainable AI for interpretability.

Result: The ensemble model achieved the highest accuracy of 85.68% in classifying tea leaf diseases, demonstrating the effectiveness of the proposed automated system.

Conclusion: The developed automated system successfully classifies tea leaf diseases with high accuracy, providing farmers with a practical tool for early disease detection and damage minimization in tea production.

Abstract: Tea is among the most widely consumed drinks globally. Tea production is a key industry for many countries. One of the main challenges in tea harvesting is tea leaf diseases. If the spread of tea leaf diseases is not stopped in time, it can lead to massive economic losses for farmers. Therefore, it is crucial to identify tea leaf diseases as soon as possible. Manually identifying tea leaf disease is an ineffective and time-consuming method, without any guarantee of success. Automating this process will improve both the efficiency and the success rate of identifying tea leaf diseases. The purpose of this study is to create an automated system that can classify different kinds of tea leaf diseases, allowing farmers to take action to minimize the damage. A novel dataset was developed specifically for this study. The dataset contains 5278 images across seven classes. The dataset was pre-processed prior to training the model. We deployed three pretrained models: DenseNet, Inception, and EfficientNet. EfficientNet was used only in the ensemble model. We utilized two different attention modules to improve model performance. The ensemble model achieved the highest accuracy of 85.68%. Explainable AI was introduced for better model interpretability.

[122] FedVideoMAE: Efficient Privacy-Preserving Federated Video Moderation

Ziyuan Tao, Chuanzhi Xu, Sandaru Jayawardana, Wei Bao, Kanchana Thilakarathna, Teng Joon Lim

Main category: cs.CV

TL;DR: On-device federated learning framework for video violence detection using VideoMAE representations and LoRA adaptation with privacy protection, achieving 65-66% accuracy under differential privacy while reducing communication costs 28.3x.

Details

Motivation: Short-form video platforms need privacy-preserving moderation because cloud-based pipelines expose raw videos to privacy risks, high bandwidth costs, and inference latency.

Method: On-device federated learning framework integrating self-supervised VideoMAE representations, LoRA-based parameter-efficient adaptation, and defense-in-depth privacy protection with DP-SGD and secure aggregation.

Result: Reduces trainable parameters to 5.5M (~3.5% of 156M backbone), achieves 77.25% accuracy without privacy protection and 65-66% under strong differential privacy, with 28.3x communication cost reduction compared to full-model FL.

Conclusion: The proposed framework enables effective privacy-preserving video violence detection on edge devices with significant efficiency gains, addressing key challenges in short-form video moderation.

Abstract: The rapid growth of short-form video platforms increases the need for privacy-preserving moderation, as cloud-based pipelines expose raw videos to privacy risks, high bandwidth costs, and inference latency. To address these challenges, we propose an on-device federated learning framework for video violence detection that integrates self-supervised VideoMAE representations, LoRA-based parameter-efficient adaptation, and defense-in-depth privacy protection. Our approach reduces the trainable parameter count to 5.5M (~3.5% of a 156M backbone) and incorporates DP-SGD with configurable privacy budgets and secure aggregation. Experiments on RWF-2000 with 40 clients achieve 77.25% accuracy without privacy protection and 65-66% under strong differential privacy, while reducing communication cost by $28.3\times$ compared to full-model federated learning. The code is available at: {https://github.com/zyt-599/FedVideoMAE}

[123] Name That Part: 3D Part Segmentation and Naming

Soumava Paul, Prakhar Kaushik, Ankit Vaidya, Anand Bhattad, Alan Yuille

Main category: cs.CV

TL;DR: ALIGN-Parts: A method for semantic 3D part segmentation that aligns 3D shapes with part descriptions via bipartite assignment, combining geometric, appearance, and semantic cues for open-vocabulary part naming.

Details

Motivation: Existing part segmentation datasets have inconsistent definitions across datasets, limiting robust training. Previous methods produce unlabeled decompositions or retrieve single parts without complete shape annotations.

Method: Formulates part naming as direct set alignment. Decomposes shapes into partlets (implicit 3D part representations) matched to part descriptions via bipartite assignment. Combines geometric cues from 3D part fields, appearance from multi-view vision features, and semantic knowledge from language-model-generated affordance descriptions. Uses text-alignment loss to share embedding space with text.

Result: Creates unified ontology aligning PartNet, 3DCoMPaT++, and Find3D with 1,794 unique 3D parts. Supports zero-shot matching to arbitrary descriptions and confidence-calibrated predictions. Introduces 2 novel metrics for named 3D part segmentation. Shows examples from new Tex-Parts dataset.

Conclusion: ALIGN-Parts provides efficient one-shot 3D part segmentation and naming with applications in downstream tasks and scalable annotation. Enables open-vocabulary matching setup and creates unified part ontology across major datasets.

Abstract: We address semantic 3D part segmentation: decomposing objects into parts with meaningful names. While datasets exist with part annotations, their definitions are inconsistent across datasets, limiting robust training. Previous methods produce unlabeled decompositions or retrieve single parts without complete shape annotations. We propose ALIGN-Parts, which formulates part naming as a direct set alignment task. Our method decomposes shapes into partlets - implicit 3D part representations - matched to part descriptions via bipartite assignment. We combine geometric cues from 3D part fields, appearance from multi-view vision features, and semantic knowledge from language-model-generated affordance descriptions. Text-alignment loss ensures partlets share embedding space with text, enabling a theoretically open-vocabulary matching setup, given sufficient data. Our efficient and novel, one-shot, 3D part segmentation and naming method finds applications in several downstream tasks, including serving as a scalable annotation engine. As our model supports zero-shot matching to arbitrary descriptions and confidence-calibrated predictions for known categories, with human verification, we create a unified ontology that aligns PartNet, 3DCoMPaT++, and Find3D, consisting of 1,794 unique 3D parts. We also show examples from our newly created Tex-Parts dataset. We also introduce 2 novel metrics appropriate for the named 3D part segmentation task.

Alina Elena Baia, Andrea Cavallaro

Main category: cs.CV

TL;DR: DeX is a training-free interpretability framework that uses cross-modal decomposition and image-specific concepts to generate natural language counterfactual explanations for image privacy decisions, quantifying scene element contributions and identifying dataset biases.

Details

Motivation: Existing concept-driven counterfactuals need better methods to explain subjective, contextual decisions like image privacy classifications. There's a need for interpretability frameworks that can quantify contributions of scene elements and identify dataset biases while maintaining flexibility without retraining.

Method: DeX leverages cross-modal decompositionality and image-specific concepts to create natural language counterfactual scenarios. It uses a multi-criterion selection mechanism considering image similarity (for minimal perturbations) and decision confidence (for impactful changes) to identify relevant decision factors. The framework evaluates diverse explanations and assesses interdependencies among explanatory properties.

Result: DeX generates image-grounded, sparse explanations with significant improvements over state-of-the-art methods. It successfully uncovers principal contributing factors influencing subjective decisions and identifies underlying dataset biases, enabling targeted mitigation strategies to improve fairness.

Conclusion: DeX provides an effective training-free framework for explaining subjective decisions in image privacy classification through natural language counterfactuals, offering both interpretability insights and bias detection capabilities for improved fairness.

Abstract: Concept-driven counterfactuals explain decisions of classifiers by altering the model predictions through semantic changes. In this paper, we present a novel approach that leverages cross-modal decompositionality and image-specific concepts to create counterfactual scenarios expressed in natural language. We apply the proposed interpretability framework, termed Decompose and Explain (DeX), to the challenging domain of image privacy decisions, which are contextual and subjective. This application enables the quantification of the differential contributions of key scene elements to the model prediction. We identify relevant decision factors via a multi-criterion selection mechanism that considers both image similarity for minimal perturbations and decision confidence to prioritize impactful changes. This approach evaluates and compares diverse explanations, and assesses the interdependency and mutual influence among explanatory properties. By leveraging image-specific concepts, DeX generates image-grounded, sparse explanations, yielding significant improvements over the state of the art. Importantly, DeX operates as a training-free framework, offering high flexibility. Results show that DeX not only uncovers the principal contributing factors influencing subjective decisions, but also identifies underlying dataset biases allowing for targeted mitigation strategies to improve fairness.

[125] Seeing Justice Clearly: Handwritten Legal Document Translation with OCR and Vision-Language Models

Shubham Kumar Nigam, Parjanya Aditya Shukla, Noel Shallum, Arnab Bhattacharya

Main category: cs.CV

TL;DR: This paper compares traditional OCR+MT pipelines with Vision LLMs for end-to-end translation of handwritten Marathi legal documents, aiming to improve access to legal information in low-resource settings.

Details

Motivation: There's an urgent need for scalable, accurate translation systems to digitize legal records (FIRs, charge sheets, witness statements) in India's district and high courts, especially for low-resource languages like Marathi that lack large digitized corpora and have high handwriting variability.

Method: The paper compares two approaches: 1) Traditional two-stage pipeline (OCR system extracts text from handwritten images, then machine translation model translates it), and 2) Vision Large Language Models that unify both stages into a single end-to-end step for direct translation of handwritten text images.

Result: The evaluation is conducted on a curated dataset of handwritten Marathi legal documents, with findings offering actionable insights toward building robust, edge-deployable solutions.

Conclusion: The research aims to enable efficient legal document processing in low-resource environments and enhance access to legal information for non-native speakers and legal professionals alike.

Abstract: Handwritten text recognition (HTR) and machine translation continue to pose significant challenges, particularly for low-resource languages like Marathi, which lack large digitized corpora and exhibit high variability in handwriting styles. The conventional approach to address this involves a two-stage pipeline: an OCR system extracts text from handwritten images, which is then translated into the target language using a machine translation model. In this work, we explore and compare the performance of traditional OCR-MT pipelines with Vision Large Language Models that aim to unify these stages and directly translate handwritten text images in a single, end-to-end step. Our motivation is grounded in the urgent need for scalable, accurate translation systems to digitize legal records such as FIRs, charge sheets, and witness statements in India’s district and high courts. We evaluate both approaches on a curated dataset of handwritten Marathi legal documents, with the goal of enabling efficient legal document processing, even in low-resource environments. Our findings offer actionable insights toward building robust, edge-deployable solutions that enhance access to legal information for non-native speakers and legal professionals alike.

[126] E-RGB-D: Real-Time Event-Based Perception with Structured Light

Seyed Ehsan Marjani Bajestani, Giovanni Beltrame

Main category: cs.CV

TL;DR: Novel RGB-D sensing system combining event cameras with DLP projector for high-speed color and depth detection (1400 fps color, 4 kHz depth).

Details

Motivation: Traditional monochrome event cameras have limitations: can't detect static/slow-moving objects and lack color information needed for many applications. Need to overcome these limitations while maintaining event cameras' advantages (high speed, low power, high dynamic range).

Method: Integrates Digital Light Processing (DLP) projector with monochrome event camera to create Active Structured Light system. Uses dynamic projection adjustments to optimize bandwidth and selectively acquire color data. Commercial TI LightCrafter 4500 projector paired with monocular monochrome event camera.

Result: Achieves color detection at 1400 fps equivalent and pixel depth detection at 4 kHz. Produces colorful point clouds without sacrificing spatial resolution. Enables frameless RGB-D sensing applications.

Conclusion: The integration of event cameras with projection-based techniques successfully addresses limitations of traditional event cameras, enabling high-speed RGB-D sensing with applications in robotics, 3D reconstruction, and computer vision.

Abstract: Event-based cameras (ECs) have emerged as bio-inspired sensors that report pixel brightness changes asynchronously, offering unmatched speed and efficiency in vision sensing. Despite their high dynamic range, temporal resolution, low power consumption, and computational simplicity, traditional monochrome ECs face limitations in detecting static or slowly moving objects and lack color information essential for certain applications. To address these challenges, we present a novel approach that integrates a Digital Light Processing (DLP) projector, forming Active Structured Light (ASL) for RGB-D sensing. By combining the benefits of ECs and projection-based techniques, our method enables the detection of color and the depth of each pixel separately. Dynamic projection adjustments optimize bandwidth, ensuring selective color data acquisition and yielding colorful point clouds without sacrificing spatial resolution. This integration, facilitated by a commercial TI LightCrafter 4500 projector and a monocular monochrome EC, not only enables frameless RGB-D sensing applications but also achieves remarkable performance milestones. With our approach, we achieved a color detection speed equivalent to 1400 fps and 4 kHz of pixel depth detection, significantly advancing the realm of computer vision across diverse fields from robotics to 3D reconstruction methods. Our code is publicly available: https://github.com/MISTLab/event_based_rgbd_ros

[127] NodMAISI: Nodule-Oriented Medical AI for Synthetic Imaging

Fakrul Islam Tushar, Ehsan Samei, Cynthia Rudin, Joseph Y. Lo

Main category: cs.CV

TL;DR: NodMAISI is a CT synthesis framework that generates anatomically-consistent lung nodule images to address data scarcity in lung cancer screening, improving lesion detectability and downstream malignancy classification performance.

Details

Motivation: Medical imaging datasets lack sufficient abnormal findings like small pulmonary nodules, which are critical for lung cancer screening but are underrepresented and inconsistently curated in existing datasets.

Method: A three-part framework: (1) standardized curation pipeline linking CTs with organ masks and nodule annotations, (2) ControlNet-conditioned rectified-flow generator built on MAISI-v2 to enforce anatomy- and lesion-consistent synthesis, and (3) lesion-aware augmentation that perturbs nodule masks while preserving surrounding anatomy.

Result: Improved distributional fidelity (FID 1.18-2.99 vs 1.69-5.21), substantially increased lesion detectability especially for sub-centimeter nodules, and improved downstream malignancy classification AUC by 0.07-0.21 under data scarcity conditions.

Conclusion: NodMAISI effectively addresses data scarcity in lung nodule analysis by generating anatomically-consistent synthetic CTs, improving both lesion detection and malignancy classification performance, particularly for challenging small nodules.

Abstract: Objective: Although medical imaging datasets are increasingly available, abnormal and annotation-intensive findings critical to lung cancer screening, particularly small pulmonary nodules, remain underrepresented and inconsistently curated. Methods: We introduce NodMAISI, an anatomically constrained, nodule-oriented CT synthesis and augmentation framework trained on a unified multi-source cohort (7,042 patients, 8,841 CTs, 14,444 nodules). The framework integrates: (i) a standardized curation and annotation pipeline linking each CT with organ masks and nodule-level annotations, (ii) a ControlNet-conditioned rectified-flow generator built on MAISI-v2’s foundational blocks to enforce anatomy- and lesion-consistent synthesis, and (iii) lesion-aware augmentation that perturbs nodule masks (controlled shrinkage) while preserving surrounding anatomy to generate paired CT variants. Results: Across six public test datasets, NodMAISI improved distributional fidelity relative to MAISI-v2 (real-to-synthetic FID range 1.18 to 2.99 vs 1.69 to 5.21). In lesion detectability analysis using a MONAI nodule detector, NodMAISI substantially increased average sensitivity and more closely matched clinical scans (IMD-CT: 0.69 vs 0.39; DLCS24: 0.63 vs 0.20), with the largest gains for sub-centimeter nodules where MAISI-v2 frequently failed to reproduce the conditioned lesion. In downstream nodule-level malignancy classification trained on LUNA25 and externally evaluated on LUNA16, LNDbv4, and DLCS24, NodMAISI augmentation improved AUC by 0.07 to 0.11 at <=20% clinical data and by 0.12 to 0.21 at 10%, consistently narrowing the performance gap under data scarcity.

[128] YolovN-CBi: A Lightweight and Efficient Architecture for Real-Time Detection of Small UAVs

Ami Pandat, Punna Rajasekhar, Gopika Vinod, Rohit Shukla

Main category: cs.CV

TL;DR: Proposed Yolov5-CBi architecture with CBAM and BiFPN improves small drone detection, outperforms newer Yolo versions in speed-accuracy trade-off, and distilled lightweight models achieve 6.51% improvement with 82.9% faster inference.

Details

Motivation: Drones pose increasing risks in civilian and defense settings, requiring accurate real-time detection systems. Detection is challenging due to small size, rapid movement, and low visual contrast of drones.

Method: Modified Yolov5 architecture called Yolov5-CBi incorporates Convolutional Block Attention Module (CBAM) and Bidirectional Feature Pyramid Network (BiFPN) to improve small object detection sensitivity. Created curated 28K image training dataset and local 2500-image test dataset. Proposed four CBi architecture variants and applied knowledge distillation using Yolov5m-CBi teacher and Yolov5n-CBi student.

Result: Yolov5-CBi outperforms newer Yolo versions (Yolov8, Yolov12) in speed-accuracy trade-off for small object detection. Distilled model achieved mA@P0.5:0.9 of 0.6573 (6.51% improvement over teacher’s 0.6171). Distilled model is 82.9% faster than baseline, suitable for real-time drone detection.

Conclusion: The proposed CBi architecture with CBAM and BiFPN, combined with knowledge distillation, effectively advances efficient and accurate real-time detection of small UAVs, addressing the challenges of drone detection in civilian and defense applications.

Abstract: Unmanned Aerial Vehicles, commonly known as, drones pose increasing risks in civilian and defense settings, demanding accurate and real-time drone detection systems. However, detecting drones is challenging because of their small size, rapid movement, and low visual contrast. A modified architecture of YolovN called the YolovN-CBi is proposed that incorporates the Convolutional Block Attention Module (CBAM) and the Bidirectional Feature Pyramid Network (BiFPN) to improve sensitivity to small object detections. A curated training dataset consisting of 28K images is created with various flying objects and a local test dataset is collected with 2500 images consisting of very small drone objects. The proposed architecture is evaluated on four benchmark datasets, along with the local test dataset. The baseline Yolov5 and the proposed Yolov5-CBi architecture outperform newer Yolo versions, including Yolov8 and Yolov12, in the speed-accuracy trade-off for small object detection. Four other variants of the proposed CBi architecture are also proposed and evaluated, which vary in the placement and usage of CBAM and BiFPN. These variants are further distilled using knowledge distillation techniques for edge deployment, using a Yolov5m-CBi teacher and a Yolov5n-CBi student. The distilled model achieved a mA@P0.5:0.9 of 0.6573, representing a 6.51% improvement over the teacher’s score of 0.6171, highlighting the effectiveness of the distillation process. The distilled model is 82.9% faster than the baseline model, making it more suitable for real-time drone detection. These findings highlight the effectiveness of the proposed CBi architecture, together with the distilled lightweight models in advancing efficient and accurate real-time detection of small UAVs.

[129] Neural Implicit Heart Coordinates: 3D cardiac shape reconstruction from sparse segmentations

Marica Muffoletto, Uxio Hermida, Charlène Mauger, Avan Suinesiaputra, Yiyang Xu, Richard Burns, Lisa Pankewitz, Andrew D McCulloch, Steffen E Petersen, Daniel Rueckert, Alistair A Young

Main category: cs.CV

TL;DR: NIHCs (Neural Implicit Heart Coordinates) create a standardized implicit coordinate system for cardiac anatomy, enabling accurate 3D reconstruction from sparse 2D segmentations with fast inference and robust performance.

Details

Motivation: Accurate reconstruction of cardiac anatomy from sparse clinical images is challenging. While neural implicit functions have been applied, their use for mapping anatomical consistency across subjects has been limited. There's a need for a common anatomical reference frame for the human heart that works with minimal input data.

Method: Introduces Neural Implicit Heart Coordinates (NIHCs), a standardized implicit coordinate system based on universal ventricular coordinates. The method predicts NIHCs directly from limited 2D segmentations (sparse acquisition) and decodes them into dense 3D segmentations and high-resolution meshes at arbitrary output resolution. Trained on 5,000 cardiac meshes.

Result: Achieves high reconstruction accuracy with mean Euclidean surface errors of 2.51±0.33 mm in diseased cohort (n=4549) and 2.3±0.36 mm in healthy cohort (n=5576). Enables anatomically coherent reconstruction under severe slice sparsity and segmentation noise, recovering complex structures like valve planes. Reduces inference time from over 60s to 5-15s compared to traditional pipelines.

Conclusion: NIHCs constitute a robust and efficient anatomical representation for patient-specific 3D cardiac reconstruction from minimal input data, providing a common anatomical reference frame that enables accurate reconstruction with fast inference.

Abstract: Accurate reconstruction of cardiac anatomy from sparse clinical images remains a major challenge in patient-specific modeling. While neural implicit functions have previously been applied to this task, their application to mapping anatomical consistency across subjects has been limited. In this work, we introduce Neural Implicit Heart Coordinates (NIHCs), a standardized implicit coordinate system, based on universal ventricular coordinates, that provides a common anatomical reference frame for the human heart. Our method predicts NIHCs directly from a limited number of 2D segmentations (sparse acquisition) and subsequently decodes them into dense 3D segmentations and high-resolution meshes at arbitrary output resolution. Trained on a large dataset of 5,000 cardiac meshes, the model achieves high reconstruction accuracy on clinical contours, with mean Euclidean surface errors of 2.51$\pm$0.33 mm in a diseased cohort (n=4549) and 2.3$\pm$0.36 mm in a healthy cohort (n=5576). The NIHC representation enables anatomically coherent reconstruction even under severe slice sparsity and segmentation noise, faithfully recovering complex structures such as the valve planes. Compared with traditional pipelines, inference time is reduced from over 60 s to 5-15 s. These results demonstrate that NIHCs constitute a robust and efficient anatomical representation for patient-specific 3D cardiac reconstruction from minimal input data.

[130] FOODER: Real-time Facial Authentication and Expression Recognition

Sabri Mustafa Kahya, Muhammet Sami Yavuz, Boran Hamdi Sivrikaya, Eckehard Steinbach

Main category: cs.CV

TL;DR: FOODER is a radar-based framework combining facial authentication (via OOD detection) with expression recognition, using FMCW radar data to preserve privacy while achieving high performance.

Details

Motivation: Safe deployment of neural networks requires OOD detection to identify samples outside training domain. There's a need for privacy-preserving facial authentication and expression recognition that doesn't rely on cameras.

Method: Uses low-cost FMCW radar with range-Doppler and micro range-Doppler representations. Authentication uses multi-encoder multi-decoder architecture with BP and ILED components. Expression recognition uses ResNet block to categorize dynamic vs static expressions, then specialized MobileViT networks for classification.

Result: Achieves 94.13% AUROC and 18.12% FPR95 for authentication, with 94.70% average expression recognition accuracy. Outperforms state-of-the-art OOD detection methods and transformer-based architectures while operating in real-time.

Conclusion: FOODER provides an effective privacy-preserving solution for facial authentication and expression recognition using radar data, demonstrating superior performance over existing methods while maintaining real-time operation.

Abstract: Out-of-distribution (OOD) detection is essential for the safe deployment of neural networks, as it enables the identification of samples outside the training domain. We present FOODER, a real-time, privacy-preserving radar-based framework that integrates OOD-based facial authentication with facial expression recognition. FOODER operates using low-cost frequency-modulated continuous-wave (FMCW) radar and exploits both range-Doppler and micro range-Doppler representations. The authentication module employs a multi-encoder multi-decoder architecture with Body Part (BP) and Intermediate Linear Encoder-Decoder (ILED) components to classify a single enrolled individual as in-distribution while detecting all other faces as OOD. Upon successful authentication, an expression recognition module is activated. Concatenated radar representations are processed by a ResNet block to distinguish between dynamic and static facial expressions. Based on this categorization, two specialized MobileViT networks are used to classify dynamic expressions (smile, shock) and static expressions (neutral, anger). This hierarchical design enables robust facial authentication and fine-grained expression recognition while preserving user privacy by relying exclusively on radar data. Experiments conducted on a dataset collected with a 60 GHz short-range FMCW radar demonstrate that FOODER achieves an AUROC of 94.13% and an FPR95 of 18.12% for authentication, along with an average expression recognition accuracy of 94.70%. FOODER outperforms state-of-the-art OOD detection methods and several transformer-based architectures while operating efficiently in real time.

[131] FPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis

Ekta Balkrishna Gavas, Sudipta Banerjee, Chinmay Hegde, Nasir Memon

Main category: cs.CV

TL;DR: FPBench is the first comprehensive benchmark evaluating 20 MLLMs on fingerprint understanding across 7 datasets and 8 biometric/forensic tasks using zero-shot and chain-of-thought prompting.

Details

Motivation: While MLLMs have been used for iris and face image analysis, their capabilities in fingerprint understanding remain unexplored. There's a need to systematically evaluate MLLMs' performance on fingerprint-related tasks to advance foundation models for fingerprints.

Method: Created FPBench benchmark evaluating 20 MLLMs (open-source and proprietary) across 7 real and synthetic datasets on 8 biometric and forensic tasks. Used zero-shot and chain-of-thought prompting strategies for comprehensive assessment.

Result: The paper presents findings on MLLM performance, explainability, and insights into challenges and limitations of using MLLMs for fingerprint understanding. FPBench establishes the first comprehensive benchmark for this domain.

Conclusion: FPBench paves the path for foundation models for fingerprints by providing the first comprehensive benchmark for evaluating MLLMs’ fingerprint understanding capabilities, addressing a previously unexplored area in multimodal AI research.

Abstract: Multimodal LLMs (MLLMs) have gained significant traction in complex data analysis, visual question answering, generation, and reasoning. Recently, they have been used for analyzing the biometric utility of iris and face images. However, their capabilities in fingerprint understanding are yet unexplored. In this work, we design a comprehensive benchmark, \textsc{FPBench} that evaluates the performance of 20 MLLMs (open-source and proprietary) across 7 real and synthetic datasets on 8 biometric and forensic tasks using zero-shot and chain-of-thought prompting strategies. We discuss our findings in terms of performance, explainability and share our insights into the challenges and limitations. We establish \textsc{FPBench} as the first comprehensive benchmark for fingerprint domain understanding with MLLMs paving the path for foundation models for fingerprints.

[132] Uncertainty-Gated Region-Level Retrieval for Robust Semantic Segmentation

Shreshth Rajan, Raymond Liu

Main category: cs.CV

TL;DR: Proposes uncertainty-gated retrieval mechanism for semantic segmentation that improves accuracy under domain shift while reducing computational cost by 87.5%.

Details

Motivation: Semantic segmentation for outdoor street scenes is crucial for autonomous driving, robotics, and assistive technology, requiring robustness to different environments, lighting, weather conditions, and sensor noise while maintaining real-time performance.

Method: Region-level, uncertainty-gated retrieval mechanism that selectively retrieves information only for uncertain regions rather than processing all regions, improving segmentation accuracy and calibration under domain shift.

Result: Achieves 11.3% increase in mean intersection-over-union while reducing retrieval cost by 87.5%, retrieving for only 12.5% of regions compared to 100% for always-on baseline.

Conclusion: The proposed uncertainty-gated retrieval approach effectively balances accuracy and computational efficiency for semantic segmentation in challenging outdoor environments with domain shift.

Abstract: Semantic segmentation of outdoor street scenes plays a key role in applications such as autonomous driving, mobile robotics, and assistive technology for visually-impaired pedestrians. For these applications, accurately distinguishing between key surfaces and objects such as roads, sidewalks, vehicles, and pedestrians is essential for maintaining safety and minimizing risks. Semantic segmentation must be robust to different environments, lighting and weather conditions, and sensor noise, while being performed in real-time. We propose a region-level, uncertainty-gated retrieval mechanism that improves segmentation accuracy and calibration under domain shift. Our best method achieves an 11.3% increase in mean intersection-over-union while reducing retrieval cost by 87.5%, retrieving for only 12.5% of regions compared to 100% for always-on baseline.

[133] SERA-H: Beyond Native Sentinel Spatial Limits for High-Resolution Canopy Height Mapping

Thomas Boudras, Martin Schwartz, Rasmus Fensholt, Martin Brandt, Ibrahim Fayad, Jean-Pierre Wigneron, Gabriel Belouze, Fajwel Fogel, Philippe Ciais

Main category: cs.CV

TL;DR: SERA-H is an end-to-end deep learning model that generates high-resolution (2.5m) canopy height maps from freely available Sentinel-1/2 time series data (10m), achieving performance comparable to commercial very high-resolution imagery.

Details

Motivation: There's a trade-off between data accessibility and spatial resolution in current canopy height mapping methods. Existing deep learning approaches using satellite imagery often face limitations in balancing freely available data with high-resolution outputs needed for effective forest management and biodiversity monitoring.

Method: SERA-H combines a super-resolution module (EDSR) with temporal attention encoding (UTAE) in an end-to-end model. It’s trained under supervision of high-density LiDAR data (ALS) to generate 2.5m resolution height maps from Sentinel-1 and Sentinel-2 time series data (10m resolution).

Result: On an open-source benchmark dataset in France, SERA-H achieved MAE of 2.6m and R² of 0.82. It outperforms standard Sentinel-1/2 baselines and achieves comparable or better performance than methods using commercial very high-resolution imagery (SPOT-6/7, PlanetScope, Maxar).

Conclusion: Combining high-resolution supervision with spatiotemporal information from time series enables reconstruction of details beyond input sensors’ native resolution. SERA-H enables free, high-frequency forest mapping with accuracy comparable to costly commercial imagery, making high-resolution canopy height mapping more accessible.

Abstract: High-resolution mapping of canopy height is essential for forest management and biodiversity monitoring. Although recent studies have led to the advent of deep learning methods using satellite imagery to predict height maps, these approaches often face a trade-off between data accessibility and spatial resolution. To overcome these limitations, we present SERA-H, an end-to-end model combining a super-resolution module (EDSR) and temporal attention encoding (UTAE). Trained under the supervision of high-density LiDAR data (ALS), our model generates 2.5 m resolution height maps from freely available Sentinel-1 and Sentinel-2 (10 m) time series data. Evaluated on an open-source benchmark dataset in France, SERA-H, with a MAE of 2.6 m and a coefficient of determination of 0.82, not only outperforms standard Sentinel-1/2 baselines but also achieves performance comparable to or better than methods relying on commercial very high-resolution imagery (SPOT-6/7, PlanetScope, Maxar). These results demonstrate that combining high-resolution supervision with the spatiotemporal information embedded in time series enables the reconstruction of details beyond the input sensors’ native resolution. SERA-H opens the possibility of freely mapping forests with high revisit frequency, achieving accuracy comparable to that of costly commercial imagery. The source code is available at https://github.com/ThomasBoudras/SERA-H#

[134] EndoStreamDepth: Temporally Consistent Monocular Depth Estimation for Endoscopic Video Streams

Hao Li, Daiwei Lu, Jiacheng Wang, Robert J. Webster, Ipek Oguz

Main category: cs.CV

TL;DR: EndoStreamDepth is a real-time monocular depth estimation framework for endoscopic videos that produces accurate depth maps with sharp anatomical boundaries and temporal consistency across frames.

Details

Motivation: Existing depth estimation methods for endoscopic videos often lack real-time processing, produce blurry anatomical boundaries, and fail to maintain temporal consistency across frames, which is crucial for downstream tasks like robotic surgery automation.

Method: The framework has three components: 1) Single-frame depth network with endoscopy-specific transformation, 2) Multi-level Mamba temporal modules for inter-frame information propagation, and 3) Hierarchical design with multi-scale supervision using complementary loss terms for boundary sharpness and geometric consistency.

Result: Substantial performance improvement over state-of-the-art methods on two colonoscopy depth datasets, producing depth maps with sharp anatomically-aligned boundaries and real-time throughput.

Conclusion: EndoStreamDepth provides an effective solution for real-time depth estimation in endoscopic videos with accurate boundaries and temporal consistency, supporting downstream surgical automation tasks.

Abstract: This work presents EndoStreamDepth, a monocular depth estimation framework for endoscopic video streams. It provides accurate depth maps with sharp anatomical boundaries for each frame, temporally consistent predictions across frames, and real-time throughput. Unlike prior work that uses batched inputs, EndoStreamDepth processes individual frames with a temporal module to propagate inter-frame information. The framework contains three main components: (1) a single-frame depth network with endoscopy-specific transformation to produce accurate depth maps, (2) multi-level Mamba temporal modules that leverage inter-frame information to improve accuracy and stabilize predictions, and (3) a hierarchical design with comprehensive multi-scale supervision, where complementary loss terms jointly improve local boundary sharpness and global geometric consistency. We conduct comprehensive evaluations on two publicly available colonoscopy depth estimation datasets. Compared to state-of-the-art monocular depth estimation methods, EndoStreamDepth substantially improves performance, and it produces depth maps with sharp, anatomically aligned boundaries, which are essential to support downstream tasks such as automation for robotic surgery. The code is publicly available at https://github.com/MedICL-VU/EndoStreamDepth

[135] Local Patches Meet Global Context: Scalable 3D Diffusion Priors for Computed Tomography Reconstruction

Taewon Yang, Jason Hu, Jeffrey A. Fessler, Liyue Shen

Main category: cs.CV

TL;DR: Proposes a 3D patch-based diffusion model for efficient high-resolution 3D CT reconstruction using limited data, outperforming SOTA methods in both performance and efficiency.

Details

Motivation: Direct training of diffusion models on 3D data is computationally expensive and requires large datasets. Existing 2D diffusion priors fail to fully leverage generative capacity for 3D inverse problems like medical image reconstruction.

Method: A novel 3D patch-based diffusion model that learns 3D diffusion priors from limited data by modeling joint distribution of position-aware 3D local patches and downsampled 3D volume as global context, coupling local and global information.

Result: Outperforms state-of-the-art methods in 3D CT reconstruction across multiple datasets, achieving high-resolution 3D reconstruction of 512×512×256 in ~20 minutes with both superior performance and efficiency.

Conclusion: The proposed 3D patch-based diffusion model enables scalable high-resolution 3D generation and provides an efficient, accurate solution to 3D inverse problems in medical imaging, overcoming computational limitations of traditional 3D diffusion models.

Abstract: Diffusion models learn strong image priors that can be leveraged to solve inverse problems like medical image reconstruction. However, for real-world applications such as 3D Computed Tomography (CT) imaging, directly training diffusion models on 3D data presents significant challenges due to the high computational demands of extensive GPU resources and large-scale datasets. Existing works mostly reuse 2D diffusion priors to address 3D inverse problems, but fail to fully realize and leverage the generative capacity of diffusion models for high-dimensional data. In this study, we propose a novel 3D patch-based diffusion model that can learn a fully 3D diffusion prior from limited data, enabling scalable generation of high-resolution 3D images. Our core idea is to learn the prior of 3D patches to achieve scalable efficiency, while coupling local and global information to guarantee high-quality 3D image generation, by modeling the joint distribution of position-aware 3D local patches and downsampled 3D volume as global context. Our approach not only enables high-quality 3D generation, but also offers an unprecedentedly efficient and accurate solution to high-resolution 3D inverse problems. Experiments on 3D CT reconstruction across multiple datasets show that our method outperforms state-of-the-art methods in both performance and efficiency, notably achieving high-resolution 3D reconstruction of $512 \times 512 \times 256$ ($\sim$20 mins).

[136] Atlas is Your Perfect Context: One-Shot Customization for Generalizable Foundational Medical Image Segmentation

Ziyu Zhang, Yi Yu, Simeng Zhu, Ahmed Aly, Yunhe Gao, Ning Gu, Yuan Xue

Main category: cs.CV

TL;DR: AtlasSegFM: A one-shot atlas-guided framework that customizes foundation models for medical image segmentation using a single annotated example, improving performance on underrepresented clinical contexts.

Details

Motivation: Current interactive foundation models for medical image segmentation require precise prompts and underperform on underrepresented clinical contexts in their training data, limiting their real-world clinical utility.

Method: Two core innovations: 1) A pipeline providing context-aware prompts via registration between a context atlas and query images, and 2) A test-time adapter to fuse predictions from both atlas registration and the foundation model.

Result: Extensive experiments across public and in-house datasets spanning multiple modalities and organs show consistent segmentation improvement, particularly for small, delicate structures.

Conclusion: AtlasSegFM provides a lightweight, deployable solution for one-shot customization of foundation models in real-world clinical workflows, addressing the limitations of existing interactive foundation models.

Abstract: Accurate medical image segmentation is essential for clinical diagnosis and treatment planning. While recent interactive foundation models (e.g., nnInteractive) enhance generalization through large-scale multimodal pretraining, they still depend on precise prompts and often perform below expectations in contexts that are underrepresented in their training data. We present AtlasSegFM, an atlas-guided framework that customizes available foundation models to clinical contexts with a single annotated example. The core innovations are: 1) a pipeline that provides context-aware prompts for foundation models via registration between a context atlas and query images, and 2) a test-time adapter to fuse predictions from both atlas registration and the foundation model. Extensive experiments across public and in-house datasets spanning multiple modalities and organs demonstrate that AtlasSegFM consistently improves segmentation, particularly for small, delicate structures. AtlasSegFM provides a lightweight, deployable solution one-shot customization of foundation models in real-world clinical workflows. The code will be made publicly available.

[137] MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation

Kaixing Yang, Jiashu Zhu, Xulong Tang, Ziqiao Peng, Xiangyue Zhang, Puwei Wang, Jiahong Wu, Xiangxiang Chu, Hongyan Liu, Jun He

Main category: cs.CV

TL;DR: MACE-Dance is a music-driven dance video generation framework using cascaded Mixture-of-Experts (MoE) that achieves SOTA performance by separating motion generation and appearance synthesis.

Details

Motivation: Existing methods in music-driven 3D dance generation, pose-driven image animation, and audio-driven talking-head synthesis cannot be directly adapted to music-driven dance video generation. Current approaches struggle to jointly achieve high-quality visual appearance and realistic human motion.

Method: Two-expert cascaded framework: Motion Expert performs music-to-3D motion generation using diffusion model with BiMamba-Transformer hybrid architecture and Guidance-Free Training (GFT). Appearance Expert performs motion- and reference-conditioned video synthesis using decoupled kinematic-aesthetic fine-tuning strategy. Also created large-scale dataset and motion-appearance evaluation protocol.

Result: Achieves state-of-the-art performance in both 3D dance generation (Motion Expert) and pose-driven image animation (Appearance Expert). Also achieves SOTA performance on the new benchmark using the proposed motion-appearance evaluation protocol.

Conclusion: MACE-Dance successfully addresses the challenge of jointly achieving high-quality visual appearance and realistic human motion in music-driven dance video generation through its cascaded MoE architecture and specialized expert components.

Abstract: With the rise of online dance-video platforms and rapid advances in AI-generated content (AIGC), music-driven dance generation has emerged as a compelling research direction. Despite substantial progress in related domains such as music-driven 3D dance generation, pose-driven image animation, and audio-driven talking-head synthesis, existing methods cannot be directly adapted to this task. Moreover, the limited studies in this area still struggle to jointly achieve high-quality visual appearance and realistic human motion. Accordingly, we present MACE-Dance, a music-driven dance video generation framework with cascaded Mixture-of-Experts (MoE). The Motion Expert performs music-to-3D motion generation while enforcing kinematic plausibility and artistic expressiveness, whereas the Appearance Expert carries out motion- and reference-conditioned video synthesis, preserving visual identity with spatiotemporal coherence. Specifically, the Motion Expert adopts a diffusion model with a BiMamba-Transformer hybrid architecture and a Guidance-Free Training (GFT) strategy, achieving state-of-the-art (SOTA) performance in 3D dance generation. The Appearance Expert employs a decoupled kinematic-aesthetic fine-tuning strategy, achieving state-of-the-art (SOTA) performance in pose-driven image animation. To better benchmark this task, we curate a large-scale and diverse dataset and design a motion-appearance evaluation protocol. Based on this protocol, MACE-Dance also achieves state-of-the-art performance. Project page: https://macedance.github.io/

[138] Is There a Better Source Distribution than Gaussian? Exploring Source Distributions for Image Flow Matching

Junho Lee, Kwanseok Kim, Joonseok Lee

Main category: cs.CV

TL;DR: Flow matching performance depends on source distribution properties; Gaussian works well due to omnidirectional coverage, but norm alignment and directional pruning improve generation quality and efficiency without retraining.

Details

Motivation: To understand why Gaussian distributions work well as source distributions in flow matching and explore better alternatives for high-dimensional data generation, analyzing the learning dynamics and geometric properties of flow matching.

Method: Developed a 2D simulation to analyze flow matching learning dynamics, derived insights about density approximation, directional alignment, and norm issues, then proposed norm-aligned training with directionally-pruned sampling that can be applied to existing models.

Result: Found that: (1) density approximation can degrade performance due to mode discrepancy, (2) directional alignment suffers from path entanglement when overly concentrated, (3) Gaussian’s omnidirectional coverage ensures robust learning, and (4) norm misalignment incurs substantial learning costs. The proposed framework improves generation quality and sampling efficiency.

Conclusion: Provides practical insights for source distribution design and introduces a readily applicable technique (directional pruning) that improves existing flow matching models without retraining, with empirical validation showing consistent improvements.

Abstract: Flow matching has emerged as a powerful generative modeling approach with flexible choices of source distribution. While Gaussian distributions are commonly used, the potential for better alternatives in high-dimensional data generation remains largely unexplored. In this paper, we propose a novel 2D simulation that captures high-dimensional geometric properties in an interpretable 2D setting, enabling us to analyze the learning dynamics of flow matching during training. Based on this analysis, we derive several key insights about flow matching behavior: (1) density approximation can paradoxically degrade performance due to mode discrepancy, (2) directional alignment suffers from path entanglement when overly concentrated, (3) Gaussian’s omnidirectional coverage ensures robust learning, and (4) norm misalignment incurs substantial learning costs. Building on these insights, we propose a practical framework that combines norm-aligned training with directionally-pruned sampling. This approach maintains the robust omnidirectional supervision essential for stable flow learning, while eliminating initializations in data-sparse regions during inference. Importantly, our pruning strategy can be applied to any flow matching model trained with a Gaussian source, providing immediate performance gains without the need for retraining. Empirical evaluations demonstrate consistent improvements in both generation quality and sampling efficiency. Our findings provide practical insights and guidelines for source distribution design and introduce a readily applicable technique for improving existing flow matching models. Our code is available at https://github.com/kwanseokk/SourceFM.

[139] ALIGN: Advanced Query Initialization with LiDAR-Image Guidance for Occlusion-Robust 3D Object Detection

Janghyun Baek, Mincheol Chang, Seokha Moon, Seung Joon Lee, Jinkyu Kim

Main category: cs.CV

TL;DR: ALIGN improves 3D object detection by using occlusion-aware query initialization with LiDAR and image guidance, addressing inefficient query usage in crowded/occluded scenes.

Details

Motivation: Existing query initialization strategies (random or BEV heatmap-based sampling) lead to inefficient query usage and reduced accuracy for occluded or crowded objects in camera-LiDAR 3D detection.

Method: Three components: 1) Occlusion-aware Center Estimation (OCE) integrates LiDAR geometry and image semantics for accurate object center estimation; 2) Adaptive Neighbor Sampling (ANS) generates object candidates from LiDAR clustering and samples aligned points around them; 3) Dynamic Query Balancing (DQB) adaptively balances queries between foreground and background regions.

Result: On nuScenes benchmark, ALIGN consistently improves multiple state-of-the-art detectors with gains up to +0.9 mAP and +1.2 NDS, particularly effective in challenging scenes with occlusions or dense crowds.

Conclusion: ALIGN provides an effective occlusion-robust, object-aware query initialization approach that enhances 3D object detection performance in complex scenarios, with code to be publicly available.

Abstract: Recent query-based 3D object detection methods using camera and LiDAR inputs have shown strong performance, but existing query initialization strategies,such as random sampling or BEV heatmap-based sampling, often result in inefficient query usage and reduced accuracy, particularly for occluded or crowded objects. To address this limitation, we propose ALIGN (Advanced query initialization with LiDAR and Image GuidaNce), a novel approach for occlusion-robust, object-aware query initialization. Our model consists of three key components: (i) Occlusion-aware Center Estimation (OCE), which integrates LiDAR geometry and image semantics to estimate object centers accurately (ii) Adaptive Neighbor Sampling (ANS), which generates object candidates from LiDAR clustering and supplements each object by sampling spatially and semantically aligned points around it and (iii) Dynamic Query Balancing (DQB), which adaptively balances queries between foreground and background regions. Our extensive experiments on the nuScenes benchmark demonstrate that ALIGN consistently improves performance across multiple state-of-the-art detectors, achieving gains of up to +0.9 mAP and +1.2 NDS, particularly in challenging scenes with occlusions or dense crowds. Our code will be publicly available upon publication.

[140] From Binary to Semantic: Utilizing Large-Scale Binary Occupancy Data for 3D Semantic Occupancy Prediction

Chihiro Noguchi, Takaki Yamamoto

Main category: cs.CV

TL;DR: A framework leveraging large-scale binary occupancy data to enhance 3D semantic occupancy prediction through pre-training and auto-labeling, outperforming existing methods.

Details

Motivation: 3D semantic occupancy prediction requires costly LiDAR annotations, while binary occupancy data (occupied/free space without semantics) is cheaper and more available but underutilized.

Method: Proposes a binary occupancy-based framework that decomposes prediction into binary and semantic occupancy modules, enabling effective use of binary data through pre-training and learning-based auto-labeling.

Result: The proposed framework outperforms existing methods in both pre-training and auto-labeling tasks for 3D semantic occupancy prediction.

Conclusion: Leveraging large-scale binary occupancy data through the proposed decomposition framework effectively enhances 3D semantic occupancy prediction while reducing annotation costs.

Abstract: Accurate perception of the surrounding environment is essential for safe autonomous driving. 3D occupancy prediction, which estimates detailed 3D structures of roads, buildings, and other objects, is particularly important for vision-centric autonomous driving systems that do not rely on LiDAR sensors. However, in 3D semantic occupancy prediction – where each voxel is assigned a semantic label – annotated LiDAR point clouds are required, making data acquisition costly. In contrast, large-scale binary occupancy data, which only indicate occupied or free space without semantic labels, can be collected at a lower cost. Despite their availability, the potential of leveraging such data remains unexplored. In this study, we investigate the utilization of large-scale binary occupancy data from two perspectives: (1) pre-training and (2) learning-based auto-labeling. We propose a novel binary occupancy-based framework that decomposes the prediction process into binary and semantic occupancy modules, enabling effective use of binary occupancy data. Our experimental results demonstrate that the proposed framework outperforms existing methods in both pre-training and auto-labeling tasks, highlighting its effectiveness in enhancing 3D semantic occupancy prediction. The code is available at https://github.com/ToyotaInfoTech/b2s-occupancy

[141] Multi-Part Object Representations via Graph Structures and Co-Part Discovery

Alex Foo, Wynne Hsu, Mong Li Lee

Main category: cs.CV

TL;DR: A novel object-centric representation method using explicit graph representations for parts improves multi-part object discovery and recognition in occluded/out-of-distribution settings.

Details

Motivation: Existing implicit object representation approaches fail to recognize learned objects in occluded or out-of-distribution contexts because they assume object part-whole relations are implicitly encoded through indirect training objectives.

Method: Proposes a novel method leveraging explicit graph representations for parts and presents a co-part object discovery algorithm. Introduces three benchmarks to evaluate robustness in recognizing multi-part objects within occluded and out-of-distribution settings.

Result: Experimental results on simulated, realistic, and real-world images show marked improvements in quality of discovered objects compared to state-of-the-art methods, and accurate recognition of multi-part objects in occluded/out-of-distribution contexts.

Conclusion: The discovered object-centric representations can more accurately predict key object properties in downstream tasks, highlighting the method’s potential to advance the field of object-centric representations.

Abstract: Discovering object-centric representations from images can significantly enhance the robustness, sample efficiency and generalizability of vision models. Works on images with multi-part objects typically follow an implicit object representation approach, which fail to recognize these learned objects in occluded or out-of-distribution contexts. This is due to the assumption that object part-whole relations are implicitly encoded into the representations through indirect training objectives. We address this limitation by proposing a novel method that leverages on explicit graph representations for parts and present a co-part object discovery algorithm. We then introduce three benchmarks to evaluate the robustness of object-centric methods in recognizing multi-part objects within occluded and out-of-distribution settings. Experimental results on simulated, realistic, and real-world images show marked improvements in the quality of discovered objects compared to state-of-the-art methods, as well as the accurate recognition of multi-part objects in occluded and out-of-distribution contexts. We also show that the discovered object-centric representations can more accurately predict key object properties in a downstream task, highlighting the potential of our method to advance the field of object-centric representations.

[142] Enhancing Diffusion Model Guidance through Calibration and Regularization

Seyed Alireza Javid, Amirhossein Bagheri, Nuria González-Prelcic

Main category: cs.CV

TL;DR: This paper improves classifier-guided diffusion models by addressing overconfident predictions through classifier calibration and enhanced sampling methods, achieving better FID scores without retraining diffusion models.

Details

Motivation: Classifier-guided diffusion models suffer from overconfident predictions during early denoising steps, causing guidance gradients to vanish and limiting their effectiveness for conditional image generation.

Method: Two complementary approaches: 1) Differentiable calibration objective based on Smooth Expected Calibration Error (Smooth ECE) for classifier fine-tuning, and 2) Enhanced sampling guidance methods including tilted sampling with batch-level reweighting, adaptive entropy-regularized sampling, and novel f-divergence-based sampling strategy.

Result: Achieves FID of 2.13 on ImageNet 128x128 using ResNet-101 classifier, improving upon existing classifier-guided diffusion methods while requiring no diffusion model retraining.

Conclusion: Principled calibration and divergence-aware sampling provide practical and effective improvements for classifier-guided diffusion models, addressing the overconfidence problem without requiring diffusion model retraining.

Abstract: Classifier-guided diffusion models have emerged as a powerful approach for conditional image generation, but they suffer from overconfident predictions during early denoising steps, causing the guidance gradient to vanish. This paper introduces two complementary contributions to address this issue. First, we propose a differentiable calibration objective based on the Smooth Expected Calibration Error (Smooth ECE), which improves classifier calibration with minimal fine-tuning and yields measurable improvements in Frechet Inception Distance (FID). Second, we develop enhanced sampling guidance methods that operate on off-the-shelf classifiers without requiring retraining. These include tilted sampling with batch-level reweighting, adaptive entropy-regularized sampling to preserve diversity, and a novel f-divergence-based sampling strategy that strengthens class-consistent guidance while maintaining mode coverage. Experiments on ImageNet 128x128 demonstrate that our divergence-regularized guidance achieves an FID of 2.13 using a ResNet-101 classifier, improving upon existing classifier-guided diffusion methods while requiring no diffusion model retraining. The results show that principled calibration and divergence-aware sampling provide practical and effective improvements for classifier-guided diffusion.

[143] Unsupervised Anomaly Detection with an Enhanced Teacher for Student-Teacher Feature Pyramid Matching

Mohammad Zolfaghari, Hedieh Sajedi

Main category: cs.CV

TL;DR: ET-STPM: Enhanced Teacher for Student-Teacher Feature Pyramid framework achieves state-of-the-art anomaly detection performance on MVTech-AD dataset with 97.1% image-level and 97.7% pixel-level accuracy.

Details

Motivation: Anomaly detection is challenging in unsupervised learning. The paper aims to improve anomaly detection performance by enhancing the teacher network in a student-teacher framework.

Method: Proposes ET-STPM (Enhanced Teacher for Student-Teacher Feature Pyramid) framework. First pre-trains ResNet-18 on ImageNet, then fine-tunes on MVTech-AD dataset. Uses enhanced teacher network in student-teacher architecture for feature pyramid extraction.

Result: Achieves 0.971 mean accuracy on image-level and 0.977 mean accuracy on pixel-level anomaly detection on MVTech-AD dataset, outperforming previous methods.

Conclusion: The enhanced teacher network in student-teacher framework significantly improves anomaly detection performance, demonstrating effectiveness of the proposed ET-STPM approach for both image-level and pixel-level anomaly detection tasks.

Abstract: Anomaly detection or outlier is one of the challenging subjects in unsupervised learning . This paper is introduced a student-teacher framework for anomaly detection that its teacher network is enhanced for achieving high-performance metrics . For this purpose , we first pre-train the ResNet-18 network on the ImageNet and then fine-tune it on the MVTech-AD dataset . Experiment results on the image-level and pixel-level demonstrate that this idea has achieved better metrics than the previous methods . Our model , Enhanced Teacher for Student-Teacher Feature Pyramid (ET-STPM), achieved 0.971 mean accuracy on the image-level and 0.977 mean accuracy on the pixel-level for anomaly detection.

[144] Multifaceted Exploration of Spatial Openness in Rental Housing: A Big Data Analysis in Tokyo’s 23 Wards

Takuya OKi, Yuan Liu

Main category: cs.CV

TL;DR: Developed a quantitative framework to evaluate spatial openness in housing from 2D and 3D perspectives using Tokyo rental data, finding openness increased over time, correlates with higher rent, but 2D and 3D measures aren’t directly related.

Details

Motivation: Spatial openness is crucial for residential quality and design, but existing studies often treat influencing factors separately. There's a need for a comprehensive, quantitative framework to understand spatial openness from multiple dimensions and its relationship with housing markets.

Method: Used data from 4,004 rental units in Tokyo’s 23 wards. 2D openness computed via planar visibility using visibility graph analysis (VGA) from floor plans. 3D openness derived from interior images analyzed using Mask2Former semantic segmentation model to identify walls, ceilings, floors, and windows. Examined temporal/spatial variations and relationships with rent and housing attributes.

Result: Found increase in living room visibility and 1990s peak in overall openness. Spatial analyses showed partial correlations among openness, rent, and building characteristics reflecting urban redevelopment trends. 2D and 3D openness indicators were not directly correlated, but higher openness tended to correspond to higher rent. Existing impression score models only weakly related to openness.

Conclusion: Interior design and furniture more strongly shape perceived space than openness measures. The study offers a new multidimensional data-driven framework for quantifying residential spatial openness and linking it with urban and market dynamics.

Abstract: Understanding spatial openness is vital for improving residential quality and design; however, studies often treat its influencing factors separately. This study developed a quantitative framework to evaluate the spatial openness in housing from two- (2D) and three- (3D) dimensional perspectives. Using data from 4,004 rental units in Tokyo’s 23 wards, we examined the temporal and spatial variations in openness and its relationship with rent and housing attributes. 2D openness was computed via planar visibility using visibility graph analysis (VGA) from floor plans, whereas 3D openness was derived from interior images analysed using Mask2Former, a semantic segmentation model that identifies walls, ceilings, floors, and windows. The results showed an increase in living room visibility and a 1990s peak in overall openness. Spatial analyses revealed partial correlations among openness, rent, and building characteristics, reflecting urban redevelopment trends. Although the 2D and 3D openness indicators were not directly correlated, higher openness tended to correspond to higher rent. The impression scores predicted by the existing models were only weakly related to openness, suggesting that the interior design and furniture more strongly shape perceived space. This study offers a new multidimensional data-driven framework for quantifying residential spatial openness and linking it with urban and market dynamics.

[145] Investigating Spatial Attention Bias in Vision-Language Models

Aryan Chaudhary, Sanchit Goyal, Pratik Narang, Dhruv Kumar

Main category: cs.CV

TL;DR: VLMs show systematic left-first bias when describing horizontally concatenated images, prioritizing left content in ~97% of cases, independent of language reading direction.

Details

Motivation: To identify and characterize systematic spatial processing biases in Vision-Language Models that remain largely unexplored despite their remarkable visual understanding capabilities.

Method: Conducted controlled experiments on image pairs using both open-source and closed-source VLMs, tested with neutral prompting conditions. Investigated Arabic-finetuned models to rule out language reading direction effects. Examined training dataset annotation guidelines from PixMo and Visual Genome.

Result: VLMs consistently prioritize describing left-positioned content first in approximately 97% of cases across different architectures. The bias persists even in Arabic-finetuned models (right-to-left language), ruling out language reading direction as the primary cause. Training dataset analysis revealed no explicit left-first ordering instructions.

Conclusion: The systematic left-first spatial attention bias in VLMs is likely due to architectural factors rather than explicit training data instructions or language reading direction, revealing fundamental limitations in how current VLMs process spatial information.

Abstract: Vision-Language Models have demonstrated remarkable capabilities in understanding visual content, yet systematic biases in their spatial processing remain largely unexplored. This work identifies and characterizes a systematic spatial attention bias where VLMs consistently prioritize describing left-positioned content before right-positioned content in horizontally concatenated images. Through controlled experiments on image pairs using both open-source and closed-source models, we demonstrate that this bias persists across different architectures, with models describing left-positioned content first in approximately 97% of cases under neutral prompting conditions. Testing on an Arabic-finetuned model reveals that the bias persists despite right-to-left language training, ruling out language reading direction as the primary cause. Investigation of training dataset annotation guidelines from PixMo and Visual Genome reveals no explicit left-first ordering instructions, suggesting the bias is consistent with architectural factors rather than explicit training data instructions. These findings reveal fundamental limitations in how current VLMs process spatial information.

[146] Joint Learning of Depth, Pose, and Local Radiance Field for Large Scale Monocular 3D Reconstruction

Shahram Najam Syed, Yitian Hu, Yuchao Yao

Main category: cs.CV

TL;DR: Joint learning framework for large-scale 3D reconstruction from monocular video that couples depth, pose, and radiance estimation to overcome scale ambiguity, pose drift, and scene coverage limitations.

Details

Motivation: Traditional monocular 3D reconstruction methods fail in large-scale scenes due to: 1) scale-ambiguous depth causing ghost geometry, 2) long-horizon pose drift corrupting alignment, and 3) single global NeRF being unable to model hundreds of meters of content.

Method: Three-component approach: 1) ViT depth network with metric-scale supervision for globally consistent depths, 2) multi-scale feature bundle-adjustment layer refining poses in feature space using learned pyramidal descriptors, 3) incremental local-radiance-field hierarchy with hash-grid NeRFs allocated and frozen on-the-fly when view overlap falls below threshold.

Result: On Tanks and Temples benchmark: Absolute Trajectory Error reduced to 0.001-0.021m across 8 indoor-outdoor sequences (18x lower than BARF, 2x lower than NoPe-NeRF) while maintaining sub-pixel Relative Pose Error. Enables city-block-scale coverage on single GPU.

Conclusion: Metric-scale, drift-free 3D reconstruction and high-fidelity novel-view synthesis are achievable from a single uncalibrated RGB camera through joint learning of depth, pose, and radiance factors.

Abstract: Photorealistic 3-D reconstruction from monocular video collapses in large-scale scenes when depth, pose, and radiance are solved in isolation: scale-ambiguous depth yields ghost geometry, long-horizon pose drift corrupts alignment, and a single global NeRF cannot model hundreds of metres of content. We introduce a joint learning framework that couples all three factors and demonstrably overcomes each failure case. Our system begins with a Vision-Transformer (ViT) depth network trained with metric-scale supervision, giving globally consistent depths despite wide field-of-view variations. A multi-scale feature bundle-adjustment (BA) layer refines camera poses directly in feature space–leveraging learned pyramidal descriptors instead of brittle keypoints–to suppress drift on unconstrained trajectories. For scene representation, we deploy an incremental local-radiance-field hierarchy: new hash-grid NeRFs are allocated and frozen on-the-fly when view overlap falls below a threshold, enabling city-block-scale coverage on a single GPU. Evaluated on the Tanks and Temples benchmark, our method reduces Absolute Trajectory Error to 0.001-0.021 m across eight indoor-outdoor sequences–up to 18x lower than BARF and 2x lower than NoPe-NeRF–while maintaining sub-pixel Relative Pose Error. These results demonstrate that metric-scale, drift-free 3-D reconstruction and high-fidelity novel-view synthesis are achievable from a single uncalibrated RGB camera.

[147] SG-RIFE: Semantic-Guided Real-Time Intermediate Flow Estimation with Diffusion-Competitive Perceptual Quality

Pan Ben Wong, Chengli Wu, Hanyue Lu

Main category: cs.CV

TL;DR: SG-RIFE enhances real-time video frame interpolation by injecting semantic priors into flow-based RIFE, achieving diffusion-competitive quality with near real-time speed.

Details

Motivation: Flow-based methods like RIFE offer high throughput but struggle with complex scenarios (large motion, occlusion), while diffusion-based methods achieve superior perceptual quality but have prohibitive latency for real-time applications.

Method: Parameter-efficient fine-tuning of pre-trained RIFE backbone with semantic priors from frozen DINOv3 Vision Transformer. Uses Split-Fidelity Aware Projection Module (Split-FAPM) to compress/refine features and Deformable Semantic Fusion (DSF) to align semantic priors with pixel-level motion fields.

Result: Outperforms diffusion-based LDMVFI in FID/LPIPS metrics, achieves quality comparable to Consec. BB on complex benchmarks while running significantly faster. Semantic injection provides decisive boost in perceptual fidelity.

Conclusion: Semantic consistency enables flow-based methods to achieve diffusion-competitive perceptual quality in near real-time, bridging the gap between real-time throughput and high perceptual quality.

Abstract: Real-time Video Frame Interpolation (VFI) has long been dominated by flow-based methods like RIFE, which offer high throughput but often fail in complicated scenarios involving large motion and occlusion. Conversely, recent diffusion-based approaches (e.g., Consec. BB) achieve state-of-the-art perceptual quality but suffer from prohibitive latency, rendering them impractical for real-time applications. To bridge this gap, we propose Semantic-Guided RIFE (SG-RIFE). Instead of training from scratch, we introduce a parameter-efficient fine-tuning strategy that augments a pre-trained RIFE backbone with semantic priors from a frozen DINOv3 Vision Transformer. We propose a Split-Fidelity Aware Projection Module (Split-FAPM) to compress and refine high-dimensional features, and a Deformable Semantic Fusion (DSF) module to align these semantic priors with pixel-level motion fields. Experiments on SNU-FILM demonstrate that semantic injection provides a decisive boost in perceptual fidelity. SG-RIFE outperforms diffusion-based LDMVFI in FID/LPIPS and achieves quality comparable to Consec. BB on complex benchmarks while running significantly faster, proving that semantic consistency enables flow-based methods to achieve diffusion-competitive perceptual quality in near real-time.

Xiao He, Chang Tang, Xinwang Liu, Wei Zhang, Zhimin Gao, Chuankun Li, Shaohua Qiu, Jiangfeng Xu

Main category: cs.CV

TL;DR: SDCM network addresses hyperspectral object detection challenges by reducing inter-band inconsistencies and redundancy through spectral discrepancy and cross-modal semantic consistency learning.

Details

Motivation: Hyperspectral images have high spectral resolution but face challenges in object detection due to intra- and inter-class similarity, spatial differences across bands, and interference from sensor noises and illumination.

Method: Proposes SDCM network with three key modules: 1) Semantic Consistency Learning (SCL) to reduce inter-band heterogeneity using contextual cues, 2) Spectral Gated Generator (SGG) to filter redundant data based on band importance, and 3) Spectral Discrepancy Aware (SDA) module to enrich semantic representation by extracting pixel-level spectral features.

Result: Extensive experiments on two hyperspectral datasets demonstrate state-of-the-art performance compared to other methods.

Conclusion: The proposed SDCM network effectively addresses hyperspectral inter-band inconsistencies and redundancy, improving object detection performance in hyperspectral images through spectral discrepancy awareness and cross-modal semantic consistency learning.

Abstract: Hyperspectral images with high spectral resolution provide new insights into recognizing subtle differences in similar substances. However, object detection in hyperspectral images faces significant challenges in intra- and inter-class similarity due to the spatial differences in hyperspectral inter-bands and unavoidable interferences, e.g., sensor noises and illumination. To alleviate the hyperspectral inter-bands inconsistencies and redundancy, we propose a novel network termed \textbf{S}pectral \textbf{D}iscrepancy and \textbf{C}ross-\textbf{M}odal semantic consistency learning (SDCM), which facilitates the extraction of consistent information across a wide range of hyperspectral bands while utilizing the spectral dimension to pinpoint regions of interest. Specifically, we leverage a semantic consistency learning (SCL) module that utilizes inter-band contextual cues to diminish the heterogeneity of information among bands, yielding highly coherent spectral dimension representations. On the other hand, we incorporate a spectral gated generator (SGG) into the framework that filters out the redundant data inherent in hyperspectral information based on the importance of the bands. Then, we design the spectral discrepancy aware (SDA) module to enrich the semantic representation of high-level information by extracting pixel-level spectral features. Extensive experiments on two hyperspectral datasets demonstrate that our proposed method achieves state-of-the-art performance when compared with other ones.

[149] Towards Ancient Plant Seed Classification: A Benchmark Dataset and Baseline Model

Rui Xing, Runmin Cong, Yingying Wu, Can Wang, Zhongming Tang, Fen Wang, Hao Wu, Sam Kwong

Main category: cs.CV

TL;DR: First Ancient Plant Seed Image Classification dataset (APS) with 8,340 images from 17 seed categories across 18 Chinese archaeological sites, plus APSNet framework achieving 90.5% accuracy.

Details

Motivation: Traditional archaeobotanical studies rely on expert knowledge which is time-consuming and inefficient. There's a research gap in intelligent analysis methods for ancient plant seed classification, despite progress in other archaeology fields.

Method: Created APS dataset (8,340 images, 17 seed categories from 18 sites). Designed APSNet framework with Size Perception and Embedding (SPE) module to extract seed size information, and Asynchronous Decoupled Decoding (ADD) architecture for decoding features from channel and spatial perspectives.

Result: Achieved 90.5% accuracy, surpassing existing state-of-the-art image classification methods in both quantitative and qualitative analyses.

Conclusion: The work provides an effective tool for large-scale, systematic archaeological research on ancient plant seeds, addressing the gap in intelligent analysis methods for archaeobotany.

Abstract: Understanding the dietary preferences of ancient societies and their evolution across periods and regions is crucial for revealing human-environment interactions. Seeds, as important archaeological artifacts, represent a fundamental subject of archaeobotanical research. However, traditional studies rely heavily on expert knowledge, which is often time-consuming and inefficient. Intelligent analysis methods have made progress in various fields of archaeology, but there remains a research gap in data and methods in archaeobotany, especially in the classification task of ancient plant seeds. To address this, we construct the first Ancient Plant Seed Image Classification (APS) dataset. It contains 8,340 images from 17 genus- or species-level seed categories excavated from 18 archaeological sites across China. In addition, we design a framework specifically for the ancient plant seed classification task (APSNet), which introduces the scale feature (size) of seeds based on learning fine-grained information to guide the network in discovering key “evidence” for sufficient classification. Specifically, we design a Size Perception and Embedding (SPE) module in the encoder part to explicitly extract size information for the purpose of complementing fine-grained information. We propose an Asynchronous Decoupled Decoding (ADD) architecture based on traditional progressive learning to decode features from both channel and spatial perspectives, enabling efficient learning of discriminative features. In both quantitative and qualitative analyses, our approach surpasses existing state-of-the-art image classification methods, achieving an accuracy of 90.5%. This demonstrates that our work provides an effective tool for large-scale, systematic archaeological research.

[150] Loom: Diffusion-Transformer for Interleaved Generation

Mingcheng Ye, Jiaming Liu, Yiren Song

Main category: cs.CV

TL;DR: Loom is a unified diffusion-transformer framework for interleaved text-image generation that outperforms baselines in compositionality, temporal coherence, and text-image alignment.

Details

Motivation: Interleaved text-image generation enables coherent visual frames and aligned textual descriptions for tasks like style transfer, compositional synthesis, and procedural tutorials, but existing methods lack effective sequential planning and multi-condition reasoning.

Method: Loom extends Bagel via full-parameter fine-tuning with an interleaved architecture alternating textual/visual embeddings. It uses language planning to decompose instructions into stepwise prompts and frame embeddings, and conditions on sampled prior frames rather than concatenating all history for efficient long-horizon generation.

Result: Loom substantially outperforms open-source baseline Anole with average gain of 2.6 points (5-point scale) across temporal and semantic metrics. It also shows strong improvements over unified and diffusion editing baselines on a curated 50K interleaved tutorial dataset.

Conclusion: Loom delivers superior compositionality, temporal coherence, and text-image alignment for interleaved generation tasks, making it an effective framework for style transfer, compositional generation, and tutorial-like procedures.

Abstract: Interleaved text-image generation aims to jointly produce coherent visual frames and aligned textual descriptions within a single sequence, enabling tasks such as style transfer, compositional synthesis, and procedural tutorials. We present Loom, a unified diffusion-transformer framework for interleaved text-image generation. Loom extends the Bagel unified model via full-parameter fine-tuning and an interleaved architecture that alternates textual and visual embeddings for multi-condition reasoning and sequential planning. A language planning strategy first decomposes a user instruction into stepwise prompts and frame embeddings, which guide temporally consistent synthesis. For each frame, Loom conditions on a small set of sampled prior frames together with the global textual context, rather than concatenating all history, yielding controllable and efficient long-horizon generation. Across style transfer, compositional generation, and tutorial-like procedures, Loom delivers superior compositionality, temporal coherence, and text-image alignment. Experiments demonstrate that Loom substantially outperforms the open-source baseline Anole, achieving an average gain of 2.6 points (on a 5-point scale) across temporal and semantic metrics in text-to-interleaved tasks. We also curate a 50K interleaved tutorial dataset and demonstrate strong improvements over unified and diffusion editing baselines.

[151] Who Can See Through You? Adversarial Shielding Against VLM-Based Attribute Inference Attacks

Yucheng Fan, Jiawei Chen, Yu Tian, Zhaoxia Yin

Main category: cs.CV

TL;DR: Proposes a novel protection method against VLM-based attribute inference attacks that jointly optimizes privacy suppression and utility preservation while maintaining visual consistency, and introduces VPI-COCO benchmark for fair evaluation.

Details

Motivation: VLM-based attribute inference attacks pose serious privacy threats when users share images on social media. Existing protection methods degrade visual quality or interfere with vision functions, failing to balance privacy protection with user experience.

Method: A novel protection method that jointly optimizes privacy suppression and utility preservation under a visual consistency constraint. Also introduces VPI-COCO benchmark with 522 images and hierarchical privacy questions for fair evaluation.

Result: Method reduces Privacy Attribute Recognition (PAR) below 25%, keeps Non-Privacy Attribute Recognition (NPAR) above 88%, maintains high visual consistency, and generalizes well to unseen and paraphrased privacy questions across multiple VLMs.

Conclusion: The proposed method effectively balances privacy protection and user experience, demonstrating strong practical applicability for real-world VLM deployments, with VPI-COCO enabling standardized evaluation of protection methods.

Abstract: As vision-language models (VLMs) become widely adopted, VLM-based attribute inference attacks have emerged as a serious privacy concern, enabling adversaries to infer private attributes from images shared on social media. This escalating threat calls for dedicated protection methods to safeguard user privacy. However, existing methods often degrade the visual quality of images or interfere with vision-based functions on social media, thereby failing to achieve a desirable balance between privacy protection and user experience. To address this challenge, we propose a novel protection method that jointly optimizes privacy suppression and utility preservation under a visual consistency constraint. While our method is conceptually effective, fair comparisons between methods remain challenging due to the lack of publicly available evaluation datasets. To fill this gap, we introduce VPI-COCO, a publicly available benchmark comprising 522 images with hierarchically structured privacy questions and corresponding non-private counterparts, enabling fine-grained and joint evaluation of protection methods in terms of privacy preservation and user experience. Building upon this benchmark, experiments on multiple VLMs demonstrate that our method effectively reduces PAR below 25%, keeps NPAR above 88%, maintains high visual consistency, and generalizes well to unseen and paraphrased privacy questions, demonstrating its strong practical applicability for real-world VLM deployments.

[152] Building UI/UX Dataset for Dark Pattern Detection and YOLOv12x-based Real-Time Object Recognition Detection System

Se-Young Jang, Su-Yeon Yoon, Jae-Woong Jung, Dong-Hun Lee, Seong-Hun Choi, Soo-Kyung Jun, Yu-Bin Kim, Young-Seon Ju, Kyounggon Kim

Main category: cs.CV

TL;DR: Proposes a visual dark pattern detection framework using YOLOv12x for real-time detection with 92.8% mAP@50 accuracy and 40.5 FPS speed, based on a new dataset of 4,066 annotated UI screenshots.

Details

Motivation: Digital transformation and sophisticated online platform designs have increased concerns about dark patterns that undermine user autonomy. Current regulatory approaches are reactive, creating need for proactive, real-time detection technologies.

Method: Created proprietary dataset of 4,066 UI screenshots from 194 websites across six sectors, annotated with five dark pattern UI components. Applied YOLOv12x object detection model with transfer learning for optimized visual dark pattern recognition.

Result: Achieved 92.8% mAP@50 detection accuracy with real-time inference speed of 40.5 FPS. Dataset publicly released to support further research.

Conclusion: Proposed framework effectively enables real-time dark pattern detection suitable for practical deployment. Public dataset release aims to advance research in this critical area of user protection.

Abstract: With the accelerating pace of digital transformation and the widespread adoption of online platforms, both social and technical concerns regarding dark patterns-user interface designs that undermine users’ ability to make informed and rational choices-have become increasingly prominent. As corporate online platforms grow more sophisticated in their design strategies, there is a pressing need for proactive and real-time detection technologies that go beyond the predominantly reactive approaches employed by regulatory authorities. In this paper, we propose a visual dark pattern detection framework that improves both detection accuracy and real-time performance. To this end, we constructed a proprietary visual object detection dataset by manually collecting 4,066 UI/UX screenshots containing dark patterns from 194 websites across six major industrial sectors in South Korea and abroad. The collected images were annotated with five representative UI components commonly associated with dark patterns: Button, Checkbox, Input Field, Pop-up, and QR Code. This dataset has been publicly released to support further research and development in the field. To enable real-time detection, this study adopted the YOLOv12x object detection model and applied transfer learning to optimize its performance for visual dark pattern recognition. Experimental results demonstrate that the proposed approach achieves a high detection accuracy of 92.8% in terms of mAP@50, while maintaining a real-time inference speed of 40.5 frames per second (FPS), confirming its effectiveness for practical deployment in online environments. Furthermore, to facilitate future research and contribute to technological advancements, the dataset constructed in this study has been made publicly available at https://github.com/B4E2/B4E2-DarkPattern-YOLO-DataSet.

[153] UniMPR: A Unified Framework for Multimodal Place Recognition with Arbitrary Sensor Configurations

Zhangshuo Qi, Jingyi Xu, Luqi Cheng, Shichen Wen, Yiming Ma, Guangming Xiong

Main category: cs.CV

TL;DR: UniMPR is a unified multimodal place recognition framework that can handle any combination of camera, LiDAR, and radar inputs using a single trained model, achieving state-of-the-art performance across diverse sensor setups.

Details

Motivation: Current multimodal place recognition methods struggle with three key challenges: adapting to arbitrary modality inputs within a unified framework, maintaining robustness with missing/degraded modalities, and generalizing across diverse sensor configurations.

Method: UniMPR unifies all inputs within a polar BEV feature space, uses a multi-branch network to extract discriminative intra-modal and inter-modal features from any modality combinations, and employs large-scale multi-dataset training with adaptive label assignment for extensive pre-training.

Result: Experiments on seven datasets demonstrate state-of-the-art performance under varying sensor configurations, modality combinations, and environmental conditions.

Conclusion: UniMPR provides a unified solution for multimodal place recognition that can seamlessly adapt to any combination of common perceptual modalities using only one trained model, addressing key challenges in the field.

Abstract: Place recognition is a critical component of autonomous vehicles and robotics, enabling global localization in GPS-denied environments. Recent advances have spurred significant interest in multimodal place recognition (MPR), which leverages complementary strengths of multiple modalities. Despite its potential, most existing MPR methods still face three key challenges: (1) dynamically adapting to arbitrary modality inputs within a unified framework, (2) maintaining robustness with missing or degraded modalities, and (3) generalizing across diverse sensor configurations and setups. In this paper, we propose UniMPR, a unified framework for multimodal place recognition. Using only one trained model, it can seamlessly adapt to any combination of common perceptual modalities (e.g., camera, LiDAR, radar). To tackle the data heterogeneity, we unify all inputs within a polar BEV feature space. Subsequently, the polar BEVs are fed into a multi-branch network to exploit discriminative intra-model and inter-modal features from any modality combinations. To fully exploit the network’s generalization capability and robustness, we construct a large-scale training set from multiple datasets and introduce an adaptive label assignment strategy for extensive pre-training. Experiments on seven datasets demonstrate that UniMPR achieves state-of-the-art performance under varying sensor configurations, modality combinations, and environmental conditions. Our code will be released at https://github.com/QiZS-BIT/UniMPR.

[154] Pyramidal Adaptive Cross-Gating for Multimodal Detection

Zidong Gu, Shoufu Tian

Main category: cs.CV

TL;DR: PACGNet introduces a novel cross-gating architecture for multimodal object detection in aerial imagery that addresses noise and feature hierarchy disruption through symmetrical cross-gating and pyramidal feature-aware gating modules.

Details

Motivation: Existing multimodal object detection methods for aerial imagery rely on simple fusion strategies that are prone to cross-modal noise and disrupt the hierarchical structure of feature pyramids, impairing fine-grained detection of small objects.

Method: Proposes Pyramidal Adaptive Cross-Gating Network (PACGNet) with two core components: 1) Symmetrical Cross-Gating (SCG) module for bidirectional selective information absorption and noise suppression, and 2) Pyramidal Feature-aware Multimodal Gating (PFMG) module that reconstructs feature hierarchy via progressive hierarchical gating using detailed features from higher-resolution levels to guide fusion at lower-resolution levels.

Result: Achieves state-of-the-art performance on DroneVehicle and VEDAI datasets with mAP50 scores of 81.7% and 82.1% respectively.

Conclusion: PACGNet effectively addresses cross-modal noise and feature hierarchy disruption through deep fusion within the backbone, enabling superior fine-grained object detection in aerial imagery.

Abstract: Object detection in aerial imagery is a critical task in applications such as UAV reconnaissance. Although existing methods have extensively explored feature interaction between different modalities, they commonly rely on simple fusion strategies for feature aggregation. This introduces two critical flaws: it is prone to cross-modal noise and disrupts the hierarchical structure of the feature pyramid, thereby impairing the fine-grained detection of small objects. To address this challenge, we propose the Pyramidal Adaptive Cross-Gating Network (PACGNet), an architecture designed to perform deep fusion within the backbone. To this end, we design two core components: the Symmetrical Cross-Gating (SCG) module and the Pyramidal Feature-aware Multimodal Gating (PFMG) module. The SCG module employs a bidirectional, symmetrical “horizontal” gating mechanism to selectively absorb complementary information, suppress noise, and preserve the semantic integrity of each modality. The PFMG module reconstructs the feature hierarchy via a progressive hierarchical gating mechanism. This leverages the detailed features from a preceding, higher-resolution level to guide the fusion at the current, lower-resolution level, effectively preserving fine-grained details as features propagate. Through evaluations conducted on the DroneVehicle and VEDAI datasets, our PACGNet sets a new state-of-the-art benchmark, with mAP50 scores reaching 81.7% and 82.1% respectively.

[155] MatE: Material Extraction from Single-Image via Geometric Prior

Zeyu Zhang, Wei Zhai, Jian Yang, Yang Cao

Main category: cs.CV

TL;DR: MatE generates tileable PBR materials from a single unconstrained real-world image using depth-based rectification and a dual-branch diffusion model.

Details

Motivation: High-fidelity PBR material creation requires specialized equipment and expert post-processing, creating a bottleneck in graphics pipelines. The authors aim to democratize this process by enabling material generation from casual, real-world images.

Method: Given an image and user mask, MatE first performs coarse rectification using estimated depth as geometric prior. Then a dual-branch diffusion model leverages learned consistency from rotation-aligned and scale-aligned training data to rectify residual distortions and generate complete material maps (albedo, normal, roughness, height).

Result: The framework achieves invariance to unknown illumination and perspective, recovering intrinsic material properties from casual captures. Comprehensive experiments on synthetic and real-world data demonstrate efficacy and robustness.

Conclusion: MatE enables users to create realistic PBR materials from real-world images, democratizing material creation by eliminating the need for specialized equipment and expert processing.

Abstract: The creation of high-fidelity, physically-based rendering (PBR) materials remains a bottleneck in many graphics pipelines, typically requiring specialized equipment and expert-driven post-processing. To democratize this process, we present MatE, a novel method for generating tileable PBR materials from a single image taken under unconstrained, real-world conditions. Given an image and a user-provided mask, MatE first performs coarse rectification using an estimated depth map as a geometric prior, and then employs a dual-branch diffusion model. Leveraging a learned consistency from rotation-aligned and scale-aligned training data, this model further rectify residual distortions from the coarse result and translate it into a complete set of material maps, including albedo, normal, roughness and height. Our framework achieves invariance to the unknown illumination and perspective of the input image, allowing for the recovery of intrinsic material properties from casual captures. Through comprehensive experiments on both synthetic and real-world data, we demonstrate the efficacy and robustness of our approach, enabling users to create realistic materials from real-world image.

[156] MatSpray: Fusing 2D Material World Knowledge on 3D Geometry

Philipp Langsteiner, Jan-Niklas Dihlmann, Hendrik P. A. Lensch

Main category: cs.CV

TL;DR: A framework that fuses 2D material predictions from diffusion models into 3D Gaussian Splatting reconstructions to create relightable, photorealistic assets with accurate PBR material parameters.

Details

Motivation: Manual material modeling is time-consuming in gaming/film industries. Existing 3D reconstruction methods lack precise spatially varying material parameters needed for relighting, while 2D diffusion models can predict PBR properties but struggle to transfer them to 3D geometry.

Method: 1) Reconstruct scene geometry using Gaussian Splatting. 2) Generate 2D PBR material maps (albedo, roughness, metallic) from input images using diffusion models. 3) Fuse 2D material data into 3D representation via optimization-based or direct projection methods using Gaussian ray tracing. 4) Apply Neural Merger refinement for fine-scale accuracy and multi-view consistency.

Result: Outperforms existing techniques in both quantitative metrics and perceived visual realism. Enables more accurate, relightable, and photorealistic renderings from reconstructed scenes.

Conclusion: The framework significantly improves realism and efficiency of asset creation workflows in content production pipelines by enabling accurate material parameter transfer from 2D to 3D representations.

Abstract: Manual modeling of material parameters and 3D geometry is a time consuming yet essential task in the gaming and film industries. While recent advances in 3D reconstruction have enabled accurate approximations of scene geometry and appearance, these methods often fall short in relighting scenarios due to the lack of precise, spatially varying material parameters. At the same time, diffusion models operating on 2D images have shown strong performance in predicting physically based rendering (PBR) properties such as albedo, roughness, and metallicity. However, transferring these 2D material maps onto reconstructed 3D geometry remains a significant challenge. We propose a framework for fusing 2D material data into 3D geometry using a combination of novel learning-based and projection-based approaches. We begin by reconstructing scene geometry via Gaussian Splatting. From the input images, a diffusion model generates 2D maps for albedo, roughness, and metallic parameters. Any existing diffusion model that can convert images or videos to PBR materials can be applied. The predictions are further integrated into the 3D representation either by optimizing an image-based loss or by directly projecting the material parameters onto the Gaussians using Gaussian ray tracing. To enhance fine-scale accuracy and multi-view consistency, we further introduce a light-weight neural refinement step (Neural Merger), which takes ray-traced material features as input and produces detailed adjustments. Our results demonstrate that the proposed methods outperform existing techniques in both quantitative metrics and perceived visual realism. This enables more accurate, relightable, and photorealistic renderings from reconstructed scenes, significantly improving the realism and efficiency of asset creation workflows in content production pipelines.

[157] A two-stream network with global-local feature fusion for bone age assessment

Qiong Lou, Han Yang, Fang Lu

Main category: cs.CV

TL;DR: BoNet+ is a two-stream deep learning model for bone age assessment that combines global Transformer features with local RFAConv features, achieving state-of-the-art accuracy on RSNA and RHPE datasets.

Details

Motivation: Existing deep learning methods for bone age assessment struggle to effectively balance global features and local skeletal details, limiting their accuracy and clinical utility.

Method: Two-stream architecture with global feature extraction using Transformer modules for multi-head self-attention, and local feature extraction using RFAConv modules for adaptive attention maps in multiscale receptive fields. Features are concatenated and optimized with Inception-V3.

Result: Achieved MAEs of 3.81 months on RSNA dataset and 5.65 months on RHPE dataset, comparable to state-of-the-art methods.

Conclusion: BoNet+ reduces clinical workload and enables automatic, high-precision, objective bone age assessment by effectively combining global and local skeletal features.

Abstract: Bone Age Assessment (BAA) is a widely used clinical technique that can accurately reflect an individual’s growth and development level, as well as maturity. In recent years, although deep learning has advanced the field of bone age assessment, existing methods face challenges in efficiently balancing global features and local skeletal details. This study aims to develop an automated bone age assessment system based on a two-stream deep learning architecture to achieve higher accuracy in bone age assessment. We propose the BoNet+ model incorporating global and local feature extraction channels. A Transformer module is introduced into the global feature extraction channel to enhance the ability in extracting global features through multi-head self-attention mechanism. A RFAConv module is incorporated into the local feature extraction channel to generate adaptive attention maps within multiscale receptive fields, enhancing local feature extraction capabilities. Global and local features are concatenated along the channel dimension and optimized by an Inception-V3 network. The proposed method has been validated on the Radiological Society of North America (RSNA) and Radiological Hand Pose Estimation (RHPE) test datasets, achieving mean absolute errors (MAEs) of 3.81 and 5.65 months, respectively. These results are comparable to the state-of-the-art. The BoNet+ model reduces the clinical workload and achieves automatic, high-precision, and more objective bone age assessment.

[158] MCVI-SANet: A lightweight semi-supervised model for LAI and SPAD estimation of winter wheat under vegetation index saturation

Zhiheng Zhang, Jiajun Yang, Hong Sun, Dong Wang, Honghua Jiang, Yaru Chen, Tangyuan Ning

Main category: cs.CV

TL;DR: MCVI-SANet: Lightweight semi-supervised vision model for winter wheat LAI and SPAD estimation, addressing VI saturation and limited annotations with adaptive feature enhancement and VICReg-based learning.

Details

Motivation: Vegetation index saturation during dense canopy stage and limited ground-truth annotations constrain accurate LAI/SPAD estimation. Existing VI-based and texture-driven ML methods have limited feature expressiveness, while deep learning suffers from domain gaps and high data demands.

Method: Proposes MCVI-SANet with Vegetation Index Saturation-Aware Block (VI-SABlock) for adaptive channel-spatial feature enhancement. Uses VICReg-based semi-supervised strategy for improved generalization. Employs vegetation height-informed dataset partitioning across growth stages.

Result: Achieves state-of-the-art accuracy: average R2 of 0.8123 and RMSE of 0.4796 for LAI; average R2 of 0.6846 and RMSE of 2.4222 for SPAD. Surpasses best baselines by 8.95% in LAI R2 and 8.17% in SPAD R2. Lightweight with only 0.10M parameters and high inference speed.

Conclusion: Integration of semi-supervised learning with agronomic priors provides promising approach for enhancing remote sensing-based precision agriculture, addressing key challenges of VI saturation and limited annotations.

Abstract: Vegetation index (VI) saturation during the dense canopy stage and limited ground-truth annotations of winter wheat constrain accurate estimation of LAI and SPAD. Existing VI-based and texture-driven machine learning methods exhibit limited feature expressiveness. In addition, deep learning baselines suffer from domain gaps and high data demands, which restrict their generalization. Therefore, this study proposes the Multi-Channel Vegetation Indices Saturation Aware Net (MCVI-SANet), a lightweight semi-supervised vision model. The model incorporates a newly designed Vegetation Index Saturation-Aware Block (VI-SABlock) for adaptive channel-spatial feature enhancement. It also integrates a VICReg-based semi-supervised strategy to further improve generalization. Datasets were partitioned using a vegetation height-informed strategy to maintain representativeness across growth stages. Experiments over 10 repeated runs demonstrate that MCVI-SANet achieves state-of-the-art accuracy. The model attains an average R2 of 0.8123 and RMSE of 0.4796 for LAI, and an average R2 of 0.6846 and RMSE of 2.4222 for SPAD. This performance surpasses the best-performing baselines, with improvements of 8.95% in average LAI R2 and 8.17% in average SPAD R2. Moreover, MCVI-SANet maintains high inference speed with only 0.10M parameters. Overall, the integration of semi-supervised learning with agronomic priors provides a promising approach for enhancing remote sensing-based precision agriculture.

Dunxing Zhang, Jiachen Lu, Han Yang, Lei Bao, Bo Song

Main category: cs.CV

TL;DR: ESSC-RM is a plug-and-play refinement framework that enhances existing Semantic Scene Completion models by adding a two-phase refinement module to improve semantic prediction accuracy.

Details

Motivation: Existing SSC models produce coarse voxel predictions that could benefit from refinement to improve semantic accuracy and scene completion quality.

Method: Two-phase approach: 1) Baseline SSC network produces coarse voxel prediction, 2) 3D U-Net-based Prediction Noise-Aware Module (PNAM) and Voxel-level Local Geometry Module (VLGM) refine predictions under multiscale supervision.

Result: ESSC-RM consistently improves semantic prediction performance on SemanticKITTI. When integrated with CGFormer, mIoU increased from 16.87% to 17.27%; with MonoScene, from 11.08% to 11.51%.

Conclusion: ESSC-RM serves as a general refinement framework applicable to a wide range of SSC models, demonstrating its plug-and-play capability to enhance existing architectures.

Abstract: We propose ESSC-RM, a plug-and-play Enhancing framework for Semantic Scene Completion with a Refinement Module, which can be seamlessly integrated into existing SSC models. ESSC-RM operates in two phases: a baseline SSC network first produces a coarse voxel prediction, which is subsequently refined by a 3D U-Net-based Prediction Noise-Aware Module (PNAM) and Voxel-level Local Geometry Module (VLGM) under multiscale supervision. Experiments on SemanticKITTI show that ESSC-RM consistently improves semantic prediction performance. When integrated into CGFormer and MonoScene, the mean IoU increases from 16.87% to 17.27% and from 11.08% to 11.51%, respectively. These results demonstrate that ESSC-RM serves as a general refinement framework applicable to a wide range of SSC models.

[160] Efficient Zero-Shot Inpainting with Decoupled Diffusion Guidance

Badr Moufad, Navid Bagheri Shouraki, Alain Oliviero Durmus, Thomas Hirtz, Eric Moulines, Jimmy Olsson, Yazid Janati

Main category: cs.CV

TL;DR: Proposes a new efficient zero-shot diffusion method for image editing that avoids expensive backpropagation through denoiser networks while maintaining strong consistency and quality.

Details

Motivation: Current zero-shot diffusion methods for image editing require computationally expensive vector-Jacobian products through denoiser networks at every reverse step, causing significant memory and runtime overhead.

Method: Introduces a new likelihood surrogate that yields simple and efficient Gaussian posterior transitions, eliminating the need for backpropagation through the denoiser network.

Result: Achieves strong observation consistency comparable to fine-tuned baselines, produces coherent high-quality reconstructions, and significantly reduces inference cost.

Conclusion: The proposed method provides an efficient zero-shot diffusion approach for image editing that maintains quality while dramatically reducing computational overhead.

Abstract: Diffusion models have emerged as powerful priors for image editing tasks such as inpainting and local modification, where the objective is to generate realistic content that remains consistent with observed regions. In particular, zero-shot approaches that leverage a pretrained diffusion model, without any retraining, have been shown to achieve highly effective reconstructions. However, state-of-the-art zero-shot methods typically rely on a sequence of surrogate likelihood functions, whose scores are used as proxies for the ideal score. This procedure however requires vector-Jacobian products through the denoiser at every reverse step, introducing significant memory and runtime overhead. To address this issue, we propose a new likelihood surrogate that yields simple and efficient to sample Gaussian posterior transitions, sidestepping the backpropagation through the denoiser network. Our extensive experiments show that our method achieves strong observation consistency compared with fine-tuned baselines and produces coherent, high-quality reconstructions, all while significantly reducing inference cost. Code is available at https://github.com/YazidJanati/ding.

[161] RecurGS: Interactive Scene Modeling via Discrete-State Recurrent Gaussian Fusion

Wenhao Hu, Haonan Zhou, Zesheng Li, Liu Liu, Jiacheng Dong, Zhizhong Su, Gaoang Wang

Main category: cs.CV

TL;DR: RecurGS is a recurrent fusion framework that incrementally integrates discrete Gaussian scene states into a single evolving 3D representation capable of interaction, supporting object-level manipulation and novel state synthesis without additional scans.

Details

Motivation: Existing 3D scene representation methods have limitations: they either focus on updating single scenes without supporting novel-state synthesis, or rely on diffusion-based approaches that work on one state at a time and cannot fuse information across multiple observations. There's a need for systems that can handle discrete scene changes and enable interactive 3D environments.

Method: RecurGS uses a recurrent fusion framework that: 1) detects object-level changes across consecutive states, 2) aligns geometric motion using semantic correspondence and Lie-algebra based SE(3) refinement, 3) performs recurrent updates preserving historical structures through replay supervision, and 4) employs a voxelized, visibility-aware fusion module that selectively incorporates newly observed regions while keeping stable areas fixed.

Result: The framework delivers high-quality reconstructions with substantially improved update efficiency, supports object-level manipulation, synthesizes novel scene states without requiring additional scans, and maintains photorealistic fidelity across evolving environments, as demonstrated in extensive experiments across synthetic and real-world datasets.

Conclusion: RecurGS provides a scalable step toward continuously interactive Gaussian worlds by addressing key limitations in existing 3D scene representation approaches, enabling efficient long-horizon updates while mitigating catastrophic forgetting.

Abstract: Recent advances in 3D scene representations have enabled high-fidelity novel view synthesis, yet adapting to discrete scene changes and constructing interactive 3D environments remain open challenges in vision and robotics. Existing approaches focus solely on updating a single scene without supporting novel-state synthesis. Others rely on diffusion-based object-background decoupling that works on one state at a time and cannot fuse information across multiple observations. To address these limitations, we introduce RecurGS, a recurrent fusion framework that incrementally integrates discrete Gaussian scene states into a single evolving representation capable of interaction. RecurGS detects object-level changes across consecutive states, aligns their geometric motion using semantic correspondence and Lie-algebra based SE(3) refinement, and performs recurrent updates that preserve historical structures through replay supervision. A voxelized, visibility-aware fusion module selectively incorporates newly observed regions while keeping stable areas fixed, mitigating catastrophic forgetting and enabling efficient long-horizon updates. RecurGS supports object-level manipulation, synthesizes novel scene states without requiring additional scans, and maintains photorealistic fidelity across evolving environments. Extensive experiments across synthetic and real-world datasets demonstrate that our framework delivers high-quality reconstructions with substantially improved update efficiency, providing a scalable step toward continuously interactive Gaussian worlds.

[162] Automated Mosaic Tesserae Segmentation via Deep Learning Techniques

Charilaos Kapelonis, Marios Antonakakis, Konstantinos Politof, Aristomenis Antoniadis, Michalis Zervakis

Main category: cs.CV

TL;DR: Fine-tuned SAM 2 model achieves improved mosaic tesserae segmentation with 91.02% IoU and 95.89% recall, plus new annotated dataset for the field.

Details

Motivation: Mosaics are important cultural heritage artifacts prone to damage, requiring digital preservation through accurate tesserae segmentation, but limited open datasets exist in this field.

Method: Leverage Segment Anything Model 2 (SAM 2) foundation model, fine-tune it using a newly created annotated dataset of mosaic images for automatic tesserae segmentation.

Result: Fine-tuned SAM 2 improves IoU from 89.00% to 91.02% and Recall from 92.12% to 95.89%. On benchmark, achieves 3% higher F-measure and reduces tesserae count error from 0.20 to 0.02.

Conclusion: The fine-tuned SAM 2 model with new annotated dataset enables accurate mosaic segmentation, paving way for real-time digital preservation of cultural heritage.

Abstract: Art is widely recognized as a reflection of civilization and mosaics represent an important part of cultural heritage. Mosaics are an ancient art form created by arranging small pieces, called tesserae, on a surface using adhesive. Due to their age and fragility, they are prone to damage, highlighting the need for digital preservation. This paper addresses the problem of digitizing mosaics by segmenting the tesserae to separate them from the background within the broader field of Image Segmentation in Computer Vision. We propose a method leveraging Segment Anything Model 2 (SAM 2) by Meta AI, a foundation model that outperforms most conventional segmentation models, to automatically segment mosaics. Due to the limited open datasets in the field, we also create an annotated dataset of mosaic images to fine-tune and evaluate the model. Quantitative evaluation on our testing dataset shows notable improvements compared to the baseline SAM 2 model, with Intersection over Union increasing from 89.00% to 91.02% and Recall from 92.12% to 95.89%. Additionally, on a benchmark proposed by a prior approach, our model achieves an F-measure 3% higher than previous methods and reduces the error in the absolute difference between predicted and actual tesserae from 0.20 to just 0.02. The notable performance of the fine-tuned SAM 2 model together with the newly annotated dataset can pave the way for real-time segmentation of mosaic images.

[163] Through the PRISm: Importance-Aware Scene Graphs for Image Retrieval

Dimitrios Georgoulopoulos, Nikolaos Chaidos, Angeliki Dimitriou, Giorgos Stamou

Main category: cs.CV

TL;DR: PRISm is a novel multimodal framework for image-to-image retrieval that uses importance prediction and graph neural networks to capture semantic relationships between objects, outperforming traditional methods.

Details

Motivation: Traditional image retrieval methods fail to capture relational and contextual nuances of scenes, lacking the ability to model semantic importance of objects and their interactions that align with human perception.

Method: Two key components: 1) Importance Prediction Module identifies and retains critical objects and relational triplets while pruning irrelevant elements; 2) Edge-Aware Graph Neural Network encodes relational structure and integrates global visual features to produce semantically informed image embeddings.

Result: Extensive experiments on benchmark and real-world datasets show consistently superior top-ranked performance. Qualitative analyses demonstrate PRISm accurately captures key objects and interactions, producing interpretable and semantically meaningful results.

Conclusion: PRISm advances image retrieval by explicitly modeling semantic importance of objects and their interactions, combining relational reasoning with visual representation to achieve retrieval that closely aligns with human perception.

Abstract: Accurately retrieving images that are semantically similar remains a fundamental challenge in computer vision, as traditional methods often fail to capture the relational and contextual nuances of a scene. We introduce PRISm (Pruning-based Image Retrieval via Importance Prediction on Semantic Graphs), a multimodal framework that advances image-to-image retrieval through two novel components. First, the Importance Prediction Module identifies and retains the most critical objects and relational triplets within an image while pruning irrelevant elements. Second, the Edge-Aware Graph Neural Network explicitly encodes relational structure and integrates global visual features to produce semantically informed image embeddings. PRISm achieves image retrieval that closely aligns with human perception by explicitly modeling the semantic importance of objects and their interactions, capabilities largely absent in prior approaches. Its architecture effectively combines relational reasoning with visual representation, enabling semantically grounded retrieval. Extensive experiments on benchmark and real-world datasets demonstrate consistently superior top-ranked performance, while qualitative analyses show that PRISm accurately captures key objects and interactions, producing interpretable and semantically meaningful results.

[164] AmPLe: Supporting Vision-Language Models via Adaptive-Debiased Ensemble Multi-Prompt Learning

Fei Song, Yi Li, Jiangmeng Li, Rui Wang, Changwen Zheng, Fanjiang Xu, Hui Xiong

Main category: cs.CV

TL;DR: AmPLe addresses model-prompt and sample-prompt matching biases in multi-prompt learning by using adaptive debiased ensemble weights guided by information theory.

Details

Motivation: Existing multi-prompt learning methods overlook model-prompt matching bias (same prompt has different semantics across different vision-language models) and sample-prompt matching bias (prompt-irrelevant semantics in input samples), which hinder performance on downstream tasks.

Method: Proposes Adaptive-Debiased Ensemble MultiPrompt Learning (AmPLe) that: 1) Uses ensemble learning to aggregate benefits from diverse predictions across models, 2) Extracts prompt-relevant semantics from input samples using information theory-based analysis, 3) Adaptively calculates debiased ensemble weights to mitigate both biases simultaneously.

Result: Extensive experiments on three tasks (generalization to novel classes, new target datasets, and unseen domain shifts) show AmPLe widely outperforms existing methods. Theoretical validation from a causal perspective further supports effectiveness.

Conclusion: AmPLe successfully addresses two key biases in multi-prompt learning through adaptive debiased ensemble weighting, achieving superior performance across diverse vision-language adaptation tasks with theoretical support.

Abstract: Multi-prompt learning methods have emerged as an effective approach for facilitating the rapid adaptation of vision-language models to downstream tasks with limited resources. Existing multi-prompt learning methods primarily focus on utilizing various meticulously designed prompts within a single foundation vision-language model to achieve superior performance. However, the overlooked model-prompt matching bias hinders the development of multi-prompt learning, i.e., the same prompt can convey different semantics across distinct vision-language models, such as CLIP-ViT-B/16 and CLIP-ViT-B/32, resulting in inconsistent predictions of identical prompt. To mitigate the impact of this bias on downstream tasks, we explore an ensemble learning approach to sufficiently aggregate the benefits of diverse predictions. Additionally, we further disclose the presence of sample-prompt matching bias, which originates from the prompt-irrelevant semantics encapsulated in the input samples. Thus, directly utilizing all information from the input samples for generating weights of ensemble learning can lead to suboptimal performance. In response, we extract prompt-relevant semantics from input samples by leveraging the guidance of the information theory-based analysis, adaptively calculating debiased ensemble weights. Overall, we propose Adaptive-Debiased Ensemble MultiPrompt Learning, abbreviated as AmPLe, to mitigate the two types of bias simultaneously. Extensive experiments on three representative tasks, i.e., generalization to novel classes, new target datasets, and unseen domain shifts, show that AmPLe can widely outperform existing methods. Theoretical validation from a causal perspective further supports the effectiveness of AmPLe.

[165] MeniMV: A Multi-view Benchmark for Meniscus Injury Severity Grading

Shurui Xu, Siqi Yang, Jiapin Ren, Zhong Cao, Hongwei Yang, Mengzhen Fan, Yuyu Sun, Shuyan Li

Main category: cs.CV

TL;DR: Meniscus horn tear grading benchmark dataset with 3,000 knee MRI exams, 6,000 dual-view images, and four-tier severity annotations for automated diagnosis.

Details

Motivation: Current automated MRI analysis lacks precise grading of meniscal horn tears, relying on coarse study-level labels or binary classification without localization and severity information needed for clinical diagnosis.

Method: Created MeniMV dataset with 3,000 knee MRI exams from 750 patients across three medical centers, providing 6,000 co-registered sagittal and coronal images with four-tier (grade 0-3) severity labels for anterior and posterior meniscal horns, verified by chief orthopedic physicians.

Result: MeniMV offers more than double the pathology-labeled data volume of prior datasets while uniquely capturing dual-view diagnostic context; benchmarked multiple CNN and Transformer models to establish strong baselines.

Conclusion: MeniMV provides a valuable foundation for future research in automated musculoskeletal imaging, highlighting challenges in severity grading and establishing a comprehensive benchmark for meniscal horn tear analysis.

Abstract: Precise grading of meniscal horn tears is critical in knee injury diagnosis but remains underexplored in automated MRI analysis. Existing methods often rely on coarse study-level labels or binary classification, lacking localization and severity information. In this paper, we introduce MeniMV, a multi-view benchmark dataset specifically designed for horn-specific meniscus injury grading. MeniMV comprises 3,000 annotated knee MRI exams from 750 patients across three medical centers, providing 6,000 co-registered sagittal and coronal images. Each exam is meticulously annotated with four-tier (grade 0-3) severity labels for both anterior and posterior meniscal horns, verified by chief orthopedic physicians. Notably, MeniMV offers more than double the pathology-labeled data volume of prior datasets while uniquely capturing the dual-view diagnostic context essential in clinical practice. To demonstrate the utility of MeniMV, we benchmark multiple state-of-the-art CNN and Transformer-based models. Our extensive experiments establish strong baselines and highlight challenges in severity grading, providing a valuable foundation for future research in automated musculoskeletal imaging.

[166] Object-Centric Framework for Video Moment Retrieval

Zongyao Li, Yongkang Wong, Satoshi Yamazaki, Jianquan Liu, Mohan Kankanhalli

Main category: cs.CV

TL;DR: Proposes object-centric video moment retrieval using scene graphs and relational tracklet transformers to capture fine-grained object semantics and temporal dynamics for better localization of object-oriented queries.

Details

Motivation: Existing video moment retrieval methods use frame/clip-level features that capture global visual/semantic info but miss fine-grained object semantics and appearance. They fail to handle object-oriented queries involving specific entities and interactions, and overlook temporal dynamics at object level.

Method: 1) Extract query-relevant objects using scene graph parser; 2) Generate scene graphs from video frames to represent objects and relationships; 3) Construct object-level feature sequences encoding visual/semantic info; 4) Process sequences with relational tracklet transformer to model spatio-temporal correlations among objects over time.

Result: Outperforms existing state-of-the-art methods on three benchmarks: Charades-STA, QVHighlights, and TACoS.

Conclusion: Object-centric framework with explicit modeling of object-level state changes enables more accurate localization of moments aligned with object-oriented queries, addressing limitations of existing global feature-based approaches.

Abstract: Most existing video moment retrieval methods rely on temporal sequences of frame- or clip-level features that primarily encode global visual and semantic information. However, such representations often fail to capture fine-grained object semantics and appearance, which are crucial for localizing moments described by object-oriented queries involving specific entities and their interactions. In particular, temporal dynamics at the object level have been largely overlooked, limiting the effectiveness of existing approaches in scenarios requiring detailed object-level reasoning. To address this limitation, we propose a novel object-centric framework for moment retrieval. Our method first extracts query-relevant objects using a scene graph parser and then generates scene graphs from video frames to represent these objects and their relationships. Based on the scene graphs, we construct object-level feature sequences that encode rich visual and semantic information. These sequences are processed by a relational tracklet transformer, which models spatio-temporal correlations among objects over time. By explicitly capturing object-level state changes, our framework enables more accurate localization of moments aligned with object-oriented queries. We evaluated our method on three benchmarks: Charades-STA, QVHighlights, and TACoS. Experimental results demonstrate that our method outperforms existing state-of-the-art methods across all benchmarks.

[167] Plasticine: A Traceable Diffusion Model for Medical Image Translation

Tianyang Zhanng, Xinxing Cheng, Jun Cheng, Shaoming Zheng, He Zhao, Huazhu Fu, Alejandro F Frangi, Jiang Liu, Jinming Duan

Main category: cs.CV

TL;DR: Plasticine is the first end-to-end image-to-image translation framework with pixel-level traceability for medical images, combining intensity translation and spatial transformation in a diffusion framework.

Details

Motivation: Domain gaps in medical imaging from device/population variations challenge ML models. Existing translation methods lack spatial correspondence and traceability, which is crucial for clinical interpretability but has been overlooked.

Method: Combines intensity translation and spatial transformation within a denoising diffusion framework to generate synthetic images with interpretable intensity transitions and spatially coherent deformations.

Result: Enables pixel-wise traceability throughout the translation process, providing interpretable correspondences between original and translated images.

Conclusion: Plasticine addresses the critical need for traceability in medical image translation, enhancing clinical interpretability through explicit pixel-level correspondence preservation.

Abstract: Domain gaps arising from variations in imaging devices and population distributions pose significant challenges for machine learning in medical image analysis. Existing image-to-image translation methods primarily aim to learn mappings between domains, often generating diverse synthetic data with variations in anatomical scale and shape, but they usually overlook spatial correspondence during the translation process. For clinical applications, traceability, defined as the ability to provide pixel-level correspondences between original and translated images, is equally important. This property enhances clinical interpretability but has been largely overlooked in previous approaches. To address this gap, we propose Plasticine, which is, to the best of our knowledge, the first end-to-end image-to-image translation framework explicitly designed with traceability as a core objective. Our method combines intensity translation and spatial transformation within a denoising diffusion framework. This design enables the generation of synthetic images with interpretable intensity transitions and spatially coherent deformations, supporting pixel-wise traceability throughout the translation process.

[168] Adaptive-VoCo: Complexity-Aware Visual Token Compression for Vision-Language Models

Xiaoyang Guo, Keze Wang

Main category: cs.CV

TL;DR: Adaptive-VoCo enhances VoCo-LLaMA with adaptive visual compression that dynamically adjusts compression rates based on image complexity, improving efficiency and performance over fixed-rate methods.

Details

Motivation: While VoCo-LLaMA reduces computational costs by compressing visual tokens, its fixed compression rate cannot adapt to varying visual complexity levels, limiting efficiency and representational capacity.

Method: Augments VoCo-LLaMA with a lightweight predictor that dynamically selects optimal compression rates by quantifying image complexity using statistical cues (patch token entropy, attention map variance) from the vision encoder, plus a joint loss function integrating rate regularization with complexity alignment.

Result: Consistently outperforms fixed-rate baselines across multiple multimodal tasks, demonstrating improved efficiency and robustness.

Conclusion: Adaptive visual compression enables more efficient and robust VLMs by balancing inference efficiency with representational capacity according to visual complexity.

Abstract: In recent years, large-scale vision-language models (VLMs) have demonstrated remarkable performance on multimodal understanding and reasoning tasks. However, handling high-dimensional visual features often incurs substantial computational and memory costs. VoCo-LLaMA alleviates this issue by compressing visual patch tokens into a few VoCo tokens, reducing computational overhead while preserving strong cross-modal alignment. Nevertheless, such approaches typically adopt a fixed compression rate, limiting their ability to adapt to varying levels of visual complexity. To address this limitation, we propose Adaptive-VoCo, a framework that augments VoCo-LLaMA with a lightweight predictor for adaptive compression. This predictor dynamically selects an optimal compression rate by quantifying an image’s visual complexity using statistical cues from the vision encoder, such as patch token entropy and attention map variance. Furthermore, we introduce a joint loss function that integrates rate regularization with complexity alignment. This enables the model to balance inference efficiency with representational capacity, particularly in challenging scenarios. Experimental results show that our method consistently outperforms fixed-rate baselines across multiple multimodal tasks, highlighting the potential of adaptive visual compression for creating more efficient and robust VLMs.

[169] PlantDiseaseNet-RT50: A Fine-tuned ResNet50 Architecture for High-Accuracy Plant Disease Detection Beyond Standard CNNs

Santwana Sagnika, Manav Malhotra, Ishtaj Kaur Deol, Soumyajit Roy, Swarnav Kumar

Main category: cs.CV

TL;DR: PlantDiseaseNet-RT50 is a fine-tuned ResNet50 model achieving ~98% accuracy for automated plant disease detection, addressing the limitations of traditional visual inspection methods.

Details

Motivation: Plant diseases cause 70-80% of crop losses worldwide. Traditional detection methods rely on expert visual inspection, which is time-consuming, labor-intensive, and impractical for large-scale farming operations.

Method: Fine-tuned ResNet50 architecture with strategically unfrozen layers, custom classification head with regularization (batch normalization and dropout), and dynamic learning rate scheduling using cosine decay. Uses comprehensive dataset across multiple crop species.

Result: Achieves exceptional performance with approximately 98% accuracy, precision, and recall on plant disease detection across multiple crop species.

Conclusion: PlantDiseaseNet-RT50 represents a significant advancement in AI-driven agricultural tools, offering computationally efficient solution for rapid and accurate plant disease diagnosis that can be implemented in practical farming contexts to support timely interventions and reduce crop losses.

Abstract: Plant diseases pose a significant threat to agricultural productivity and global food security, accounting for 70-80% of crop losses worldwide. Traditional detection methods rely heavily on expert visual inspection, which is time-consuming, labour-intensive, and often impractical for large-scale farming operations. In this paper, we present PlantDiseaseNet-RT50, a novel fine-tuned deep learning architecture based on ResNet50 for automated plant disease detection. Our model features strategically unfrozen layers, a custom classification head with regularization mechanisms, and dynamic learning rate scheduling through cosine decay. Using a comprehensive dataset of distinct plant disease categories across multiple crop species, PlantDiseaseNet-RT50 achieves exceptional performance with approximately 98% accuracy, precision, and recall. Our architectural modifications and optimization protocol demonstrate how targeted fine-tuning can transform a standard pretrained model into a specialized agricultural diagnostic tool. We provide a detailed account of our methodology, including the systematic unfreezing of terminal layers, implementation of batch normalization and dropout regularization and application of advanced training techniques. PlantDiseaseNet-RT50 represents a significant advancement in AI-driven agricultural tools, offering a computationally efficient solution for rapid and accurate plant disease diagnosis that can be readily implemented in practical farming contexts to support timely interventions and reduce crop losses.

[170] NASTaR: NovaSAR Automated Ship Target Recognition Dataset

Benyamin Hosseiny, Kamirul Kamirul, Odysseas Pappas, Alin Achim

Main category: cs.CV

TL;DR: NASTaR dataset provides 3415 S-band SAR ship patches with AIS-matched labels for ship type classification, achieving 60-87% accuracy across different classification scenarios.

Details

Motivation: SAR enables all-weather maritime monitoring but ship type classification is challenging due to high diversity of ship types. Existing deep learning models require large, high-quality datasets, and the growing variety of SAR satellites with different frequencies/resolutions increases the need for more annotated datasets.

Method: Created NASTaR dataset with 3415 ship patches extracted from NovaSAR S-band imagery, matched to AIS data. Features include 23 unique ship classes, inshore/offshore separation, and auxiliary wake dataset. Validated using benchmark deep learning models across various classification scenarios.

Result: Achieved over 60% accuracy for classifying four major ship types, over 70% for three-class scenario, more than 75% for distinguishing cargo from tanker ships, and over 87% for identifying fishing vessels.

Conclusion: NASTaR dataset addresses the need for annotated SAR ship data and demonstrates practical applicability for ship type classification tasks, providing a valuable resource for maritime monitoring research.

Abstract: Synthetic Aperture Radar (SAR) offers a unique capability for all-weather, space-based maritime activity monitoring by capturing and imaging strong reflections from ships at sea. A well-defined challenge in this domain is ship type classification. Due to the high diversity and complexity of ship types, accurate recognition is difficult and typically requires specialized deep learning models. These models, however, depend on large, high-quality ground-truth datasets to achieve robust performance and generalization. Furthermore, the growing variety of SAR satellites operating at different frequencies and spatial resolutions has amplified the need for more annotated datasets to enhance model accuracy. To address this, we present the NovaSAR Automated Ship Target Recognition (NASTaR) dataset. This dataset comprises of 3415 ship patches extracted from NovaSAR S-band imagery, with labels matched to AIS data. It includes distinctive features such as 23 unique classes, inshore/offshore separation, and an auxiliary wake dataset for patches where ship wakes are visible. We validated the dataset applicability across prominent ship-type classification scenarios using benchmark deep learning models. Results demonstrate over 60% accuracy for classifying four major ship types, over 70% for a three-class scenario, more than 75% for distinguishing cargo from tanker ships, and over 87% for identifying fishing vessels. The NASTaR dataset is available at https://10.5523/bris, while relevant codes for benchmarking and analysis are available at https://github.com/benyaminhosseiny/nastar.

[171] GTMA: Dynamic Representation Optimization for OOD Vision-Language Models

Jensen Zhang, Ningyuan Liu, Keze Wang

Main category: cs.CV

TL;DR: GTMA framework improves VLM performance on OOD concepts by dynamically optimizing continuous pseudo-word embeddings that align with visual anchors, bypassing vocabulary limitations.

Details

Motivation: Vision-language models struggle with out-of-distribution concepts due to modal asymmetry: visual encoders can extract features from unseen images, but text encoders are limited by fixed vocabularies, causing cross-modal alignment collapse and degraded zero-shot performance.

Method: Proposes Guided Target-Matching Adaptation (GTMA) framework with dynamic representation optimization. At inference, constructs continuous pseudo-word embeddings that align with OOD image visual anchors. Uses adaptive gradient-based representation policy optimization with semantic regularization to preserve plausibility and compatibility with prior knowledge.

Result: GTMA improves zero-shot and few-shot OOD accuracy by 15-20% over base VLM on ImageNet-R and VISTA-Beyond benchmarks while maintaining in-distribution performance. Ablation studies confirm necessity of pseudo-word optimization.

Conclusion: GTMA effectively addresses modal asymmetry in VLMs by enabling dynamic representation optimization, allowing models to handle OOD concepts without being constrained by fixed vocabularies, significantly improving open-world application performance.

Abstract: Vision-language models (VLMs) struggle in open-world applications, where out-of-distribution (OOD) concepts often trigger cross-modal alignment collapse and severely degrade zero-shot performance. We identify the root cause as modal asymmetry: while the visual encoder can extract discriminative features from unseen images, the text encoder is constrained by a fixed discrete vocabulary and cannot synthesize new semantic anchors. Existing approaches such as CoOp or LoRA provide only partial remedies, as they remain confined to the pre-trained semantic space. To overcome this bottleneck, we propose dynamic representation optimization, realized through the Guided Target-Matching Adaptation (GTMA) framework. At inference time, GTMA constructs a continuous pseudo-word embedding that best aligns with an OOD image’s visual anchor, effectively bypassing vocabulary limitations. The optimization is driven by an adaptive gradient-based representation policy optimization algorithm, which incorporates semantic regularization to preserve plausibility and compatibility with the model’s prior knowledge. Experiments on ImageNet-R and the VISTA-Beyond benchmark demonstrate that GTMA improves zero-shot and few-shot OOD accuracy by up to 15-20 percent over the base VLM while maintaining performance on in-distribution concepts. Ablation studies further confirm the necessity of pseudo-word optimization.

[172] Detection of AI Generated Images Using Combined Uncertainty Measures and Particle Swarm Optimised Rejection Mechanism

Rahul Yumlembam, Biju Issac, Nauman Aslam, Eaby Kollonoor Babu, Josh Collyer, Fraser Kennedy

Main category: cs.CV

TL;DR: A robust AI-generated image detection framework using multiple uncertainty measures (Fisher Information, Monte Carlo Dropout entropy, and Deep Kernel Learning variance) with Particle Swarm Optimisation for optimal weighting and adaptive rejection thresholds.

Details

Motivation: As AI-generated images become increasingly photorealistic, distinguishing them from natural images is becoming more challenging. Existing detection methods often degrade under distribution shifts when faced with images from unseen AI generators.

Method: Combines three complementary uncertainty measures: Fisher Information (parameter sensitivity), Monte Carlo Dropout entropy (predictive variability), and Deep Kernel Learning predictive variance. Uses Particle Swarm Optimisation to learn optimal weightings and determine adaptive rejection thresholds. Trained on Stable Diffusion images and tested on multiple generators (GLIDE, VQDM, Midjourney, BigGAN, StyleGAN3).

Result: Combined Uncertainty measure achieves ~70% incorrect rejection rate on unseen generators, filtering most misclassified AI samples. Maintains high acceptance of accurate predictions for natural images and in-domain AI data. Under adversarial attacks, rejects ~61% of successful attacks (GP-based uncertainty alone achieves up to 80%).

Conclusion: Multi-source uncertainty fusion provides a resilient and adaptive solution for AI-generated image detection, maintaining robustness under distribution shifts and adversarial attacks while allowing conservative rejection behavior that supports retraining.

Abstract: As AI-generated images become increasingly photorealistic, distinguishing them from natural images poses a growing challenge. This paper presents a robust detection framework that leverages multiple uncertainty measures to decide whether to trust or reject a model’s predictions. We focus on three complementary techniques: Fisher Information, which captures the sensitivity of model parameters to input variations; entropy-based uncertainty from Monte Carlo Dropout, which reflects predictive variability; and predictive variance from a Deep Kernel Learning framework using a Gaussian Process classifier. To integrate these diverse uncertainty signals, Particle Swarm Optimisation is used to learn optimal weightings and determine an adaptive rejection threshold. The model is trained on Stable Diffusion-generated images and evaluated on GLIDE, VQDM, Midjourney, BigGAN, and StyleGAN3, each introducing significant distribution shifts. While standard metrics such as prediction probability and Fisher-based measures perform well in distribution, their effectiveness degrades under shift. In contrast, the Combined Uncertainty measure consistently achieves an incorrect rejection rate of approximately 70 percent on unseen generators, successfully filtering most misclassified AI samples. Although the system occasionally rejects correct predictions from newer generators, this conservative behaviour is acceptable, as rejected samples can support retraining. The framework maintains high acceptance of accurate predictions for natural images and in-domain AI data. Under adversarial attacks using FGSM and PGD, the Combined Uncertainty method rejects around 61 percent of successful attacks, while GP-based uncertainty alone achieves up to 80 percent. Overall, the results demonstrate that multi-source uncertainty fusion provides a resilient and adaptive solution for AI-generated image detection.

[173] WoundNet-Ensemble: A Novel IoMT System Integrating Self-Supervised Deep Learning and Multi-Model Fusion for Automated, High-Accuracy Wound Classification and Healing Progression Monitoring

Moses Kiprono

Main category: cs.CV

TL;DR: WoundNet-Ensemble: An IoMT system using ensemble of ResNet-50, DINOv2, and Swin Transformer achieves 99.90% accuracy for automated classification of six wound types, with 3.7% improvement over SOTA and integrated healing tracker.

Details

Motivation: Chronic wounds like diabetic foot ulcers affect up to one-third of diabetes patients, costing over $25B annually in U.S. healthcare. Current wound assessment is subjective, leading to inconsistent classification and delayed interventions.

Method: Internet of Medical Things system with novel ensemble of three complementary deep learning architectures: ResNet-50, self-supervised Vision Transformer DINOv2, and Swin Transformer. Uses weighted fusion strategy for automated classification of six wound types.

Result: Achieves 99.90% ensemble accuracy on comprehensive dataset of 5,175 wound images spanning six clinically distinct wound types. Demonstrates 3.7% improvement over previous state-of-the-art methods. Includes longitudinal wound healing tracker for healing rates, severity scores, and clinical alerts.

Conclusion: Presents robust, accurate, clinically deployable tool for modernizing wound care through AI, addressing critical needs in telemedicine and remote patient monitoring. Implementation and trained models will be publicly available for reproducibility.

Abstract: Chronic wounds, including diabetic foot ulcers which affect up to one-third of people with diabetes, impose a substantial clinical and economic burden, with U.S. healthcare costs exceeding 25 billion dollars annually. Current wound assessment remains predominantly subjective, leading to inconsistent classification and delayed interventions. We present WoundNet-Ensemble, an Internet of Medical Things system leveraging a novel ensemble of three complementary deep learning architectures: ResNet-50, the self-supervised Vision Transformer DINOv2, and Swin Transformer, for automated classification of six clinically distinct wound types. Our system achieves 99.90 percent ensemble accuracy on a comprehensive dataset of 5,175 wound images spanning diabetic foot ulcers, pressure ulcers, venous ulcers, thermal burns, pilonidal sinus wounds, and fungating malignant tumors. The weighted fusion strategy demonstrates a 3.7 percent improvement over previous state-of-the-art methods. Furthermore, we implement a longitudinal wound healing tracker that computes healing rates, severity scores, and generates clinical alerts. This work demonstrates a robust, accurate, and clinically deployable tool for modernizing wound care through artificial intelligence, addressing critical needs in telemedicine and remote patient monitoring. The implementation and trained models will be made publicly available to support reproducibility.

[174] Hierarchical Bayesian Framework for Multisource Domain Adaptation

Alexander M. Glandon, Khan M. Iftekharuddin

Main category: cs.CV

TL;DR: A Bayesian framework for multisource domain adaptation that leverages similarity between source domains to improve pretraining, achieving 17.29% accuracy improvement on human action recognition tasks.

Details

Motivation: Existing MDA methods are ad-hoc with either weight sharing or independent training of source models, lacking a principled approach that accounts for the typical similarity between source domain distributions.

Method: Proposes a Hierarchical Bayesian Framework that uses similarity between different source data distributions to optimize pretraining for multisource domain adaptation.

Result: The framework improves accuracy on recognition tasks and shows 17.29% improvement over state-of-the-art MDA methods on the Daily-DA RGB video benchmark for human action recognition.

Conclusion: The Bayesian framework provides a principled approach for MDA pretraining that effectively leverages source domain similarities, significantly outperforming existing methods on challenging benchmarks.

Abstract: Multisource domain adaptation (MDA) aims to use multiple source datasets with available labels to infer labels on a target dataset without available labels for target supervision. Prior works on MDA in the literature is ad-hoc as the pretraining of source models is either based on weight sharing or uses independently trained models. This work proposes a Bayesian framework for pretraining in MDA by considering that the distributions of different source domains are typically similar. The Hierarchical Bayesian Framework uses similarity between the different source data distributions to optimize the pretraining for MDA. Experiments using the proposed Bayesian framework for MDA show that our framework improves accuracy on recognition tasks for a large benchmark dataset. Performance comparison with state-of-the-art MDA methods on the challenging problem of human action recognition in multi-domain benchmark Daily-DA RGB video shows the proposed Bayesian Framework offers a 17.29% improvement in accuracy when compared to the state-of-the-art methods in the literature.

[175] Enhancing Medical Large Vision-Language Models via Alignment Distillation

Aofei Chang, Ting Wang, Fenglong Ma

Main category: cs.CV

TL;DR: MEDALIGN is a lightweight alignment distillation framework that improves medical vision-language models by transferring visual alignment knowledge from domain-specific CLIP models to address hallucination issues.

Details

Motivation: Medical Large Vision-Language Models suffer from hallucinated outputs due to misaligned visual understanding, specifically insufficient visual representation learning and poor visual attention alignment in clinical applications.

Method: Proposes MEDALIGN framework with two distillation losses: spatial-aware visual alignment loss based on visual token-level similarity structures, and attention-aware distillation loss that guides attention toward diagnostically relevant regions.

Result: Extensive experiments on medical report generation and medical VQA benchmarks show consistent improvements in both performance and interpretability, yielding more visually grounded outputs.

Conclusion: MEDALIGN effectively addresses hallucination issues in Med-LVLMs through simple, lightweight alignment distillation, improving visual understanding and attention alignment for better clinical applications.

Abstract: Medical Large Vision-Language Models (Med-LVLMs) have shown promising results in clinical applications, but often suffer from hallucinated outputs due to misaligned visual understanding. In this work, we identify two fundamental limitations contributing to this issue: insufficient visual representation learning and poor visual attention alignment. To address these problems, we propose MEDALIGN, a simple, lightweight alignment distillation framework that transfers visual alignment knowledge from a domain-specific Contrastive Language-Image Pre-training (CLIP) model to Med-LVLMs. MEDALIGN introduces two distillation losses: a spatial-aware visual alignment loss based on visual token-level similarity structures, and an attention-aware distillation loss that guides attention toward diagnostically relevant regions. Extensive experiments on medical report generation and medical visual question answering (VQA) benchmarks show that MEDALIGN consistently improves both performance and interpretability, yielding more visually grounded outputs.

[176] OpenView: Empowering MLLMs with Out-of-view VQA

Qixiang Chen, Cheng Zhang, Chi-Wing Fu, Jingwen Ye, Jianfei Cai

Main category: cs.CV

TL;DR: OpenView introduces out-of-view understanding for MLLMs, creating a pipeline to generate VQA data from panoramas, a synthetic dataset for fine-tuning, and a benchmark for evaluation, boosting MLLM performance from 48.6% to 64.1%.

Details

Motivation: Current MLLMs excel at in-view reasoning but lack the ability to understand content beyond the visible frame (out-of-view understanding), which is crucial for real-world applications where perspectives are limited.

Method: Threefold approach: 1) OpenView pipeline for generating multi-choice VQA from panoramic imagery with free-view framing, 2) OpenView-Dataset curation from diverse real-world panoramas for supervised fine-tuning, 3) OpenView-Bench benchmark for joint measurement of choice and rationale accuracy.

Result: MLLMs show large gap from human performance in OOV VQA (48.6% baseline), but when empowered by OpenView, they achieve consistent performance boosts, reaching 64.1% on average.

Conclusion: OpenView enables MLLMs to reason about out-of-view content, bridging a critical gap in multimodal understanding. The pipeline, dataset, and benchmark provide a foundation for advancing OOV capabilities in vision-language models.

Abstract: Recent multimodal large language models (MLLMs) show great potential in natural image understanding. Yet, they perform well, mainly on reasoning in-view contents within the image frame. This paper presents the first study on out-of-view (OOV) understanding, i.e., the ability to reason objects, activities, and scenes beyond the visible frame of a perspective view. Our technical contributions are threefold. First, we design OpenView, a four-stage pipeline to massively generate multi-choice VQA by leveraging panoramic imagery to enable context-rich and spatial-grounded VQA synthesis with free-view framing. Second, we curate OpenView-Dataset, a high-quality synthetic dataset from diverse real-world panoramas to empower MLLMs upon supervised fine-tuning. Third, we build OpenView-Bench, a benchmark that jointly measures choice and rationale accuracy for interpretable and diagnosable evaluation. Experimental results show that despite having a large gap from human performance in OOV VQA answer selection, upon empowered by OpenView, multiple MLLMs can consistently boost their performance, uplifted from 48.6% to 64.1% on average. Code, benchmark, and data will be available at https://github.com/q1xiangchen/OpenView.

[177] Placenta Accreta Spectrum Detection Using an MRI-based Hybrid CNN-Transformer Model

Sumaiya Ali, Areej Alhothali, Ohoud Alzamzami, Sameera Albasri, Ahmed Abduljabbar, Muhammad Alwazzan

Main category: cs.CV

TL;DR: A hybrid 3D DenseNet121-Vision Transformer model achieves 84.3% accuracy for automated detection of Placenta Accreta Spectrum from MRI scans, outperforming other 3D deep learning architectures.

Details

Motivation: Placenta Accreta Spectrum (PAS) is challenging to diagnose with MRI due to variability in radiologists' interpretations, creating a need for automated diagnostic tools to improve consistency and accuracy.

Method: Proposed a hybrid 3D deep learning model combining 3D DenseNet121 for local feature extraction and 3D Vision Transformer (ViT) for global spatial context modeling. Trained and evaluated on retrospective dataset of 1,133 MRI volumes, with comparison to multiple other 3D deep learning architectures.

Result: The DenseNet121-ViT model achieved highest performance with five-run average accuracy of 84.3% on independent test set, outperforming other evaluated 3D deep learning architectures.

Conclusion: Hybrid CNN-Transformer models show strong potential as computer-aided diagnosis tools for PAS, providing robust decision support to improve diagnostic consistency, accuracy, and timeliness across radiologist interpretations.

Abstract: Placenta Accreta Spectrum (PAS) is a serious obstetric condition that can be challenging to diagnose with Magnetic Resonance Imaging (MRI) due to variability in radiologists’ interpretations. To overcome this challenge, a hybrid 3D deep learning model for automated PAS detection from volumetric MRI scans is proposed in this study. The model integrates a 3D DenseNet121 to capture local features and a 3D Vision Transformer (ViT) to model global spatial context. It was developed and evaluated on a retrospective dataset of 1,133 MRI volumes. Multiple 3D deep learning architectures were also evaluated for comparison. On an independent test set, the DenseNet121-ViT model achieved the highest performance with a five-run average accuracy of 84.3%. These results highlight the strength of hybrid CNN-Transformer models as a computer-aided diagnosis tool. The model’s performance demonstrates a clear potential to assist radiologists by providing a robust decision support to improve diagnostic consistency across interpretations, and ultimately enhance the accuracy and timeliness of PAS diagnosis.

[178] Commercial Vehicle Braking Optimization: A Robust SIFT-Trajectory Approach

Zhe Li, Kun Cheng, Hanyue Mo, Jintao Lu, Ziwen Kuang, Jianwen Ye, Lixu Xu, Xinya Meng, Jiahui Zhao, Shengda Ji, Shuyuan Liu, Mengyu Wang

Main category: cs.CV

TL;DR: Vision-based system solves commercial vehicle AEB false braking at low speeds using camera and computer vision instead of unreliable CAN signals.

Details

Motivation: Commercial vehicle AEB systems suffer from "zero-speed braking" false activations at low speeds due to inaccurate CAN signals, causing safety issues and operational disruptions.

Method: Uses NVIDIA Jetson AGX Xavier to process blind spot camera video with CLAHE-enhanced SIFT feature extraction and KNN-RANSAC matching. Key innovations: 5-frame sliding window trajectory statistics, dual-threshold state decision matrix, and OBD-II driven dynamic ROI configuration.

Result: Achieved 99.96% F1-score for static detection, 97.78% for moving state recognition, 14.2ms processing delay. On-site deployment reduced false braking events by 89%, maintained 100% emergency braking success rate, with fault rate below 5%.

Conclusion: Vision-based trajectory analysis effectively solves low-speed false activation in commercial vehicle AEB systems, providing reliable motion state classification and significantly improving system safety and reliability.

Abstract: A vision-based trajectory analysis solution is proposed to address the “zero-speed braking” issue caused by inaccurate Controller Area Network (CAN) signals in commercial vehicle Automatic Emergency Braking (AEB) systems during low-speed operation. The algorithm utilizes the NVIDIA Jetson AGX Xavier platform to process sequential video frames from a blind spot camera, employing self-adaptive Contrast Limited Adaptive Histogram Equalization (CLAHE)-enhanced Scale-Invariant Feature Transform (SIFT) feature extraction and K-Nearest Neighbors (KNN)-Random Sample Consensus (RANSAC) matching. This allows for precise classification of the vehicle’s motion state (static, vibration, moving). Key innovations include 1) multiframe trajectory displacement statistics (5-frame sliding window), 2) a dual-threshold state decision matrix, and 3) OBD-II driven dynamic Region of Interest (ROI) configuration. The system effectively suppresses environmental interference and false detection of dynamic objects, directly addressing the challenge of low-speed false activation in commercial vehicle safety systems. Evaluation in a real-world dataset (32,454 video segments from 1,852 vehicles) demonstrates an F1-score of 99.96% for static detection, 97.78% for moving state recognition, and a processing delay of 14.2 milliseconds (resolution 704x576). The deployment on-site shows an 89% reduction in false braking events, a 100% success rate in emergency braking, and a fault rate below 5%.

[179] SimpleCall: A Lightweight Image Restoration Agent in Label-Free Environments with MLLM Perceptual Feedback

Jianglin Lu, Yuanwei Wu, Ziyi Zhao, Hongcheng Wang, Felix Jimenez, Abrar Majeedi, Yun Fu

Main category: cs.CV

TL;DR: A policy optimization framework trains a lightweight agent to determine optimal tool-calling sequences for complex image restoration, using MLLM-based rewards for label-free training, achieving SOTA performance with efficient inference.

Details

Motivation: Current restoration agents using vision-language/LLMs suffer from efficiency bottlenecks (reflection, rollback, iterative tool searching) and depend on degradation recognition models requiring extensive annotations, limiting applicability in label-free environments.

Method: Policy optimization-based restoration framework with a lightweight agent operating in sequential decision process to select restoration operations. Uses MLLM-driven reward mechanism as human-aligned evaluator for perceptual feedback in label-free training.

Result: Matches SOTA performance on full-reference metrics and surpasses existing approaches on no-reference metrics across diverse degradation scenarios, despite using no supervision. Executes deterministic restoration plans without redundant tool invocations, significantly accelerating inference.

Conclusion: The proposed framework addresses efficiency and annotation limitations of current restoration agents, enabling effective complex image restoration in label-free environments through policy optimization and MLLM-based rewards.

Abstract: Complex image restoration aims to recover high-quality images from inputs affected by multiple degradations such as blur, noise, rain, and compression artifacts. Recent restoration agents, powered by vision-language models and large language models, offer promising restoration capabilities but suffer from significant efficiency bottlenecks due to reflection, rollback, and iterative tool searching. Moreover, their performance heavily depends on degradation recognition models that require extensive annotations for training, limiting their applicability in label-free environments. To address these limitations, we propose a policy optimization-based restoration framework that learns an lightweight agent to determine tool-calling sequences. The agent operates in a sequential decision process, selecting the most appropriate restoration operation at each step to maximize final image quality. To enable training within label-free environments, we introduce a novel reward mechanism driven by multimodal large language models, which act as human-aligned evaluator and provide perceptual feedback for policy improvement. Once trained, our agent executes a deterministic restoration plans without redundant tool invocations, significantly accelerating inference while maintaining high restoration quality. Extensive experiments show that despite using no supervision, our method matches SOTA performance on full-reference metrics and surpasses existing approaches on no-reference metrics across diverse degradation scenarios.

[180] Text2Graph VPR: A Text-to-Graph Expert System for Explainable Place Recognition in Changing Environments

Saeideh Yousefzadeh, Hamidreza Pourreza

Main category: cs.CV

TL;DR: Text2Graph VPR: An explainable visual place recognition system that converts images to text descriptions, parses them into scene graphs, and uses hybrid graph matching for robust, interpretable localization.

Details

Motivation: Visual Place Recognition needs to go beyond pixel similarity for long-term deployment, requiring transparent, interpretable decisions that remain robust under lighting, weather, and seasonal changes. Current systems lack explainability and struggle with appearance shifts.

Method: Converts image sequences to textual scene descriptions, parses them into structured scene graphs (objects, attributes, relations), aggregates per-frame graphs into place representations, and uses dual-similarity retrieval: Graph Attention Network embeddings + Shortest-Path kernel for structural matching.

Result: Validated on Oxford RobotCar and MSLS benchmarks, demonstrating robust retrieval under severe appearance shifts, zero-shot operation with human textual queries, and improved transparency with human-readable intermediate representations.

Conclusion: Semantic, graph-based reasoning is a viable and interpretable alternative for place recognition, particularly suited to safety-sensitive and resource-constrained settings where transparency and robustness are critical.

Abstract: Visual Place Recognition (VPR) in long-term deployment requires reasoning beyond pixel similarity: systems must make transparent, interpretable decisions that remain robust under lighting, weather and seasonal change. We present Text2Graph VPR, an explainable semantic localization system that converts image sequences into textual scene descriptions, parses those descriptions into structured scene graphs, and reasons over the resulting graphs to identify places. Scene graphs capture objects, attributes and pairwise relations; we aggregate per-frame graphs into a compact place representation and perform retrieval with a dual-similarity mechanism that fuses learned Graph Attention Network (GAT) embeddings and a Shortest-Path (SP) kernel for structural matching. This hybrid design enables both learned semantic matching and topology-aware comparison, and – critically – produces human-readable intermediate representations that support diagnostic analysis and improve transparency in the decision process. We validate the system on Oxford RobotCar and MSLS (Amman/San Francisco) benchmarks and demonstrate robust retrieval under severe appearance shifts, along with zero-shot operation using human textual queries. The results illustrate that semantic, graph-based reasoning is a viable and interpretable alternative for place recognition, particularly suited to safety-sensitive and resource-constrained settings.

[181] PTTA: A Pure Text-to-Animation Framework for High-Quality Creation

Ruiqi Chen, Kaitong Cai, Yijia Fan, Keze Wang

Main category: cs.CV

TL;DR: PTTA is a pure text-to-animation framework that fine-tunes pretrained video models on high-quality animation data to outperform existing methods in animation video synthesis.

Details

Motivation: Traditional animation production is labor-intensive and expensive. While recent video generation models work well for natural videos, they struggle with animation generation. Existing animation-focused approaches like AniSora only work in image-to-video settings, leaving text-to-animation largely unexplored.

Method: PTTA constructs a small-scale but high-quality paired dataset of animation videos with textual descriptions. It then fine-tunes the pretrained text-to-video model HunyuanVideo to adapt it specifically for animation-style generation.

Result: Extensive visual evaluations across multiple dimensions show that PTTA consistently outperforms comparable baselines in animation video synthesis.

Conclusion: PTTA successfully addresses the gap in text-to-animation generation by adapting pretrained video models through targeted fine-tuning on high-quality animation data, achieving superior performance over existing methods.

Abstract: Traditional animation production involves complex pipelines and significant manual labor cost. While recent video generation models such as Sora, Kling, and CogVideoX achieve impressive results on natural video synthesis, they exhibit notable limitations when applied to animation generation. Recent efforts, such as AniSora, demonstrate promising performance by fine-tuning image-to-video models for animation styles, yet analogous exploration in the text-to-video setting remains limited. In this work, we present PTTA, a pure text-to-animation framework for high-quality animation creation. We first construct a small-scale but high-quality paired dataset of animation videos and textual descriptions. Building upon the pretrained text-to-video model HunyuanVideo, we perform fine-tuning to adapt it to animation-style generation. Extensive visual evaluations across multiple dimensions show that the proposed approach consistently outperforms comparable baselines in animation video synthesis.

[182] Uni-Neur2Img: Unified Neural Signal-Guided Image Generation, Editing, and Stylization via Diffusion Transformers

Xiyue Bai, Ronghao Yu, Jia Xiu, Pengfei Zhou, Jie Xia, Peng Ji

Main category: cs.CV

TL;DR: Uni-Neur2Img is a unified framework for neural signal-driven image generation and editing using parameter-efficient LoRA modules and causal attention for multi-modal conditioning.

Details

Motivation: Existing neural-driven generation research focuses too much on textual modalities, leaving visual modalities as direct conditioning signals underexplored. There's immense potential at the intersection of neuroscience, vision, and brain-computer interaction for generating/editing images directly from neural signals.

Method: Introduces a parameter-efficient LoRA-based neural signal injection module that processes each conditioning signal independently as pluggable components. Uses causal attention mechanism for long-sequence modeling in conditional generation tasks. Also introduces EEG-Style dataset for evaluation.

Result: Comprehensive evaluations across three tasks: (1) EEG-driven image generation on CVPR40 dataset, (2) neural signal-guided image editing on Loongx dataset for semantic-aware local modifications, and (3) EEG-driven style transfer on EEG-Style dataset. Shows significant improvements in generation fidelity, editing consistency, and style transfer quality with low computational overhead.

Conclusion: Uni-Neur2Img offers a unified, efficient, and extensible solution for bridging neural signals and visual content generation, demonstrating strong scalability to additional modalities while maintaining computational efficiency.

Abstract: Generating or editing images directly from Neural signals has immense potential at the intersection of neuroscience, vision, and Brain-computer interaction. In this paper, We present Uni-Neur2Img, a unified framework for neural signal-driven image generation and editing. The framework introduces a parameter-efficient LoRA-based neural signal injection module that independently processes each conditioning signal as a pluggable component, facilitating flexible multi-modal conditioning without altering base model parameters. Additionally, we employ a causal attention mechanism accommodate the long-sequence modeling demands of conditional generation tasks. Existing neural-driven generation research predominantly focuses on textual modalities as conditions or intermediate representations, resulting in limited exploration of visual modalities as direct conditioning signals. To bridge this research gap, we introduce the EEG-Style dataset. We conduct comprehensive evaluations across public benchmarks and self-collected neural signal datasets: (1) EEG-driven image generation on the public CVPR40 dataset; (2) neural signal-guided image editing on the public Loongx dataset for semantic-aware local modifications; and (3) EEG-driven style transfer on our self-collected EEG-Style dataset. Extensive experimental results demonstrate significant improvements in generation fidelity, editing consistency, and style transfer quality while maintaining low computational overhead and strong scalability to additional modalities. Thus, Uni-Neur2Img offers a unified, efficient, and extensible solution for bridging neural signals and visual content generation.

[183] Geometric-Photometric Event-based 3D Gaussian Ray Tracing

Kai Kohyama, Yoshimitsu Aoki, Guillermo Gallego, Shintaro Shiba

Main category: cs.CV

TL;DR: Event-based 3D Gaussian Splatting framework that decouples geometry and radiance rendering to leverage fine-grained temporal event information for improved 3D reconstruction.

Details

Motivation: Event cameras offer high temporal resolution but existing event-based 3D Gaussian Splatting approaches struggle to leverage fine-grained temporal information from sparse events, creating a trade-off between accuracy and temporal resolution.

Method: Decouples rendering into two branches: event-by-event geometry (depth) rendering using ray-tracing, and snapshot-based radiance (intensity) rendering using the image of warped events.

Result: Achieves state-of-the-art performance on real-world datasets and competitive performance on synthetic datasets. Works without prior information or COLMAP initialization, is flexible in event selection, achieves sharp edge reconstruction with fast training.

Conclusion: The framework successfully addresses the accuracy-temporal resolution trade-off in event-based 3DGS, deepening understanding of sparse event nature for 3D reconstruction, with code to be released.

Abstract: Event cameras offer a high temporal resolution over traditional frame-based cameras, which makes them suitable for motion and structure estimation. However, it has been unclear how event-based 3D Gaussian Splatting (3DGS) approaches could leverage fine-grained temporal information of sparse events. This work proposes a framework to address the trade-off between accuracy and temporal resolution in event-based 3DGS. Our key idea is to decouple the rendering into two branches: event-by-event geometry (depth) rendering and snapshot-based radiance (intensity) rendering, by using ray-tracing and the image of warped events. The extensive evaluation shows that our method achieves state-of-the-art performance on the real-world datasets and competitive performance on the synthetic dataset. Also, the proposed method works without prior information (e.g., pretrained image reconstruction models) or COLMAP-based initialization, is more flexible in the event selection number, and achieves sharp reconstruction on scene edges with fast training time. We hope that this work deepens our understanding of the sparse nature of events for 3D reconstruction. The code will be released.

[184] Adversarial Robustness in Zero-Shot Learning:An Empirical Study on Class and Concept-Level Vulnerabilities

Zhiyuan Peng, Zihan Ye, Shreyank N Gowda, Yuping Yan, Haotian Xu, Ling Shao

Main category: cs.CV

TL;DR: ZSL models are vulnerable to both class-level and concept-level adversarial attacks, with traditional class attacks showing spurious success in GZSL settings, while novel concept-based attacks can manipulate predictions by erasing or introducing concepts.

Details

Motivation: While Zero-shot Learning models promise better generalization and interpretability by mapping visual features to human-understandable concepts, their robustness under systematic input perturbations remains unclear and needs empirical investigation.

Method: Conducted empirical analysis of ZSL robustness at class-level (using non-target class attack) and concept-level (introducing two novel attack modes: Class-Preserving Concept Attack and Non-Class-Preserving Concept Attack). Proposed Class-Bias Enhanced Attack to address spurious attack success in GZSL settings. Evaluated three typical ZSL models across various architectures from past three years.

Result: ZSL models are vulnerable to both traditional class attacks and concept-based attacks. In GZSL, traditional class attacks show spurious success only at original best-calibrated points, while the proposed CBEA completely eliminates GZSL accuracy across all calibrated points. Concept-based attacks allow easy manipulation of predictions by erasing or introducing concepts.

Conclusion: There’s a significant performance gap between existing ZSL approaches regarding adversarial robustness, highlighting the need for improved robustness in current ZSL models against both class-level and concept-level attacks.

Abstract: Zero-shot Learning (ZSL) aims to enable image classifiers to recognize images from unseen classes that were not included during training. Unlike traditional supervised classification, ZSL typically relies on learning a mapping from visual features to predefined, human-understandable class concepts. While ZSL models promise to improve generalization and interpretability, their robustness under systematic input perturbations remain unclear. In this study, we present an empirical analysis about the robustness of existing ZSL methods at both classlevel and concept-level. Specifically, we successfully disrupted their class prediction by the well-known non-target class attack (clsA). However, in the Generalized Zero-shot Learning (GZSL) setting, we observe that the success of clsA is only at the original best-calibrated point. After the attack, the optimal bestcalibration point shifts, and ZSL models maintain relatively strong performance at other calibration points, indicating that clsA results in a spurious attack success in the GZSL. To address this, we propose the Class-Bias Enhanced Attack (CBEA), which completely eliminates GZSL accuracy across all calibrated points by enhancing the gap between seen and unseen class probabilities.Next, at concept-level attack, we introduce two novel attack modes: Class-Preserving Concept Attack (CPconA) and NonClass-Preserving Concept Attack (NCPconA). Our extensive experiments evaluate three typical ZSL models across various architectures from the past three years and reveal that ZSL models are vulnerable not only to the traditional class attack but also to concept-based attacks. These attacks allow malicious actors to easily manipulate class predictions by erasing or introducing concepts. Our findings highlight a significant performance gap between existing approaches, emphasizing the need for improved adversarial robustness in current ZSL models.

[185] SplatBright: Generalizable Low-Light Scene Reconstruction from Sparse Views via Physically-Guided Gaussian Enhancement

Yue Wen, Liang Song, Hesheng Wang

Main category: cs.CV

TL;DR: SplatBright is the first generalizable 3D Gaussian framework for joint low-light enhancement and reconstruction from sparse sRGB inputs, achieving superior novel view synthesis and cross-view consistency.

Details

Motivation: Low-light 3D reconstruction from sparse views is challenging due to exposure imbalance and degraded color fidelity. Existing methods struggle with view inconsistency and require per-scene training, lacking generalizable solutions.

Method: Proposes SplatBright with physically guided illumination modeling and geometry-appearance decoupling. Uses dual-branch predictor for stable 3D Gaussian initialization, illumination consistency with frequency priors, and appearance refinement module to separate illumination, material, and view-dependent cues. Trains with synthesized dark views via physics-based camera model.

Result: Extensive experiments on public and self-collected datasets show superior novel view synthesis, cross-view consistency, and better generalization to unseen low-light scenes compared to both 2D and 3D methods.

Conclusion: SplatBright successfully addresses low-light 3D reconstruction challenges by integrating physically guided illumination modeling with geometry-appearance decoupling, achieving state-of-the-art performance without per-scene training.

Abstract: Low-light 3D reconstruction from sparse views remains challenging due to exposure imbalance and degraded color fidelity. While existing methods struggle with view inconsistency and require per-scene training, we propose SplatBright, which is, to our knowledge, the first generalizable 3D Gaussian framework for joint low-light enhancement and reconstruction from sparse sRGB inputs. Our key idea is to integrate physically guided illumination modeling with geometry-appearance decoupling for consistent low-light reconstruction. Specifically, we adopt a dual-branch predictor that provides stable geometric initialization of 3D Gaussian parameters. On the appearance side, illumination consistency leverages frequency priors to enable controllable and cross-view coherent lighting, while an appearance refinement module further separates illumination, material, and view-dependent cues to recover fine texture. To tackle the lack of large-scale geometrically consistent paired data, we synthesize dark views via a physics-based camera model for training. Extensive experiments on public and self-collected datasets demonstrate that SplatBright achieves superior novel view synthesis, cross-view consistency, and better generalization to unseen low-light scenes compared with both 2D and 3D methods.

[186] SmartSight: Mitigating Hallucination in Video-LLMs Without Compromising Video Understanding via Temporal Attention Collapse

Yiming Sun, Mi Zhang, Feifei Li, Geng Hong, Min Yang

Main category: cs.CV

TL;DR: SmartSight reduces hallucinations in Video-LLMs without training by generating multiple candidate responses and using temporal attention analysis to identify low-hallucinated outputs.

Details

Motivation: Video Large Language Models suffer from perceptual hallucinations that limit real-world applicability. Existing mitigation methods often compromise video understanding and reasoning capabilities.

Method: Training-free approach using model introspection: generates multiple candidate responses, assesses hallucinations via Temporal Attention Collapse score (measures over-focus on trivial temporal regions), and improves efficiency with Visual Attention Vanishing point for early termination.

Result: Reduces hallucinations for Qwen2.5-VL-7B by 10.59% on VRIPT-HAL while enhancing video understanding and reasoning, boosting performance on VideoMMMU by up to 8.86%.

Conclusion: SmartSight effectively improves reliability of open-source Video-LLMs by reducing hallucinations without compromising understanding capabilities.

Abstract: Despite Video Large Language Models having rapidly advanced in recent years, perceptual hallucinations pose a substantial safety risk, which severely restricts their real-world applicability. While several methods for hallucination mitigation have been proposed, they often compromise the model’s capacity for video understanding and reasoning. In this work, we propose SmartSight, a pioneering step to address this issue in a training-free manner by leveraging the model’s own introspective capabilities. Specifically, SmartSight generates multiple candidate responses to uncover low-hallucinated outputs that are often obscured by standard greedy decoding. It assesses the hallucination of each response using the Temporal Attention Collapse score, which measures whether the model over-focuses on trivial temporal regions of the input video when generating the response. To improve efficiency, SmartSight identifies the Visual Attention Vanishing point, enabling more accurate hallucination estimation and early termination of hallucinated responses, leading to a substantial reduction in decoding cost. Experiments show that SmartSight substantially lowers hallucinations for Qwen2.5-VL-7B by 10.59% on VRIPT-HAL, while simultaneously enhancing video understanding and reasoning, boosting performance on VideoMMMU by up to 8.86%. These results highlight SmartSight’s effectiveness in improving the reliability of open-source Video-LLMs.

[187] AsyncDiff: Asynchronous Timestep Conditioning for Enhanced Text-to-Image Diffusion Inference

Longhuan Xu, Feng Yin, Cunjian Chen

Main category: cs.CV

TL;DR: Asynchronous diffusion inference decouples latent state updates from denoiser conditioning timesteps using a learned timestep prediction module, improving image quality with fewer steps.

Details

Motivation: Traditional text-to-image diffusion uses synchronized schedules where latent state updates and denoiser conditioning happen at the same timestep, which may not be optimal. The authors propose that decoupling these could allow more flexible and efficient inference.

Method: Introduces asynchronous inference with a lightweight timestep prediction module (TPM) trained using Group Relative Policy Optimization (GRPO). TPM selects conditioning timesteps based on current state, effectively choosing noise levels to control detail. A scaling hyperparameter allows interpolation between original and de-synchronized timesteps.

Result: Evaluated on Stable Diffusion 3.5 Medium (15 steps) and Flux.1-dev (10 steps) across MS-COCO 2014 and T2I-CompBench. Optimizes composite reward averaging Image Reward, HPSv2, CLIP Score, and Pick Score, showing consistent improvement over synchronized baselines.

Conclusion: Asynchronous inference with learned timestep prediction improves diffusion model performance with fewer inference steps, offering a flexible mechanism to balance image detail and textural richness through timestep decoupling.

Abstract: Text-to-image diffusion inference typically follows synchronized schedules, where the numerical integrator advances the latent state to the same timestep at which the denoiser is conditioned. We propose an asynchronous inference mechanism that decouples these two, allowing the denoiser to be conditioned at a different, learned timestep while keeping image update schedule unchanged. A lightweight timestep prediction module (TPM), trained with Group Relative Policy Optimization (GRPO), selects a more feasible conditioning timestep based on the current state, effectively choosing a desired noise level to control image detail and textural richness. At deployment, a scaling hyper-parameter can be used to interpolate between the original and de-synchronized timesteps, enabling conservative or aggressive adjustments. To keep the study computationally affordable, we cap the inference at 15 steps for SD3.5 and 10 steps for Flux. Evaluated on Stable Diffusion 3.5 Medium and Flux.1-dev across MS-COCO 2014 and T2I-CompBench datasets, our method optimizes a composite reward that averages Image Reward, HPSv2, CLIP Score and Pick Score, and shows consistent improvement.

[188] brat: Aligned Multi-View Embeddings for Brain MRI Analysis

Maxime Kayser, Maksim Gridnev, Wanting Wang, Max Bain, Aneesh Rangnekar, Avijit Chatterjee, Aleksandr Petrov, Harini Veeraraghavan, Nathaniel C. Swinburne

Main category: cs.CV

TL;DR: BRAT is a multi-view representation learning framework for brain MRI that aligns 3D scans with clinical reports using document retrieval-inspired techniques and quality-diversity concepts.

Details

Motivation: Brain MRIs present unique challenges due to numerous, highly varied, and subtle abnormalities localized to few slices within 3D volumes, requiring better representation learning approaches.

Method: Multi-view pre-training approach inspired by document retrieval, with implicit query-feature matching and quality-diversity concepts to obtain multi-view embeddings aligned with clinical report sentences.

Result: Demonstrates substantial performance improvements across multiple vision-language and vision tasks, using a brain MRI dataset 10x larger than existing ones (80,000 3D scans with reports).

Conclusion: BRAT foundation models are publicly released as an effective framework for brain MRI representation learning that aligns imaging data with clinical text features.

Abstract: We present brat (brain report alignment transformer), a multi-view representation learning framework for brain magnetic resonance imaging (MRI) trained on MRIs paired with clinical reports. Brain MRIs present unique challenges due to the presence of numerous, highly varied, and often subtle abnormalities that are localized to a few slices within a 3D volume. To address these challenges, we introduce a brain MRI dataset $10\times$ larger than existing ones, containing approximately 80,000 3D scans with corresponding radiology reports, and propose a multi-view pre-training approach inspired by advances in document retrieval. We develop an implicit query-feature matching mechanism and adopt concepts from quality-diversity to obtain multi-view embeddings of MRIs that are aligned with the clinical features given by report sentences. We evaluate our approach across multiple vision-language and vision tasks, demonstrating substantial performance improvements. The brat foundation models are publicly released.

[189] A Study of Finetuning Video Transformers for Multi-view Geometry Tasks

Huimin Wu, Kwang-Ting Cheng, Stephen Lin, Zhirong Wu

Main category: cs.CV

TL;DR: Fine-tuning video foundation models for multi-view geometry tasks achieves state-of-the-art performance with minimal adaptation, showing that general-purpose attention learns geometric reasoning.

Details

Motivation: Previous methods require custom architectures and task-specific pretraining for multi-view geometry tasks. This paper investigates whether general-purpose video foundation models can be transferred to these tasks with minimal adaptation.

Method: Fine-tune video-pretrained vision transformers for multi-view geometry tasks by simply appending a linear decoder to the Transformer backbone. Use iterative refinement to further improve performance.

Result: Achieves state-of-the-art cross-dataset generalization for optical flow: EPE of 0.69 (Sintel clean), 1.78 (Sintel final), 3.15 (KITTI). Sets new record on online test benchmark with EPE 0.79, 1.88 and F1 3.79. Also shows strong performance on 3D depth estimation and stereo matching.

Conclusion: General-purpose video foundation models can be effectively transferred to multi-view geometry tasks with minimal adaptation, demonstrating that attention mechanisms learn both temporal and spatial information for geometric reasoning.

Abstract: This paper presents an investigation of vision transformer learning for multi-view geometry tasks, such as optical flow estimation, by fine-tuning video foundation models. Unlike previous methods that involve custom architectural designs and task-specific pretraining, our research finds that general-purpose models pretrained on videos can be readily transferred to multi-view problems with minimal adaptation. The core insight is that general-purpose attention between patches learns temporal and spatial information for geometric reasoning. We demonstrate that appending a linear decoder to the Transformer backbone produces satisfactory results, and iterative refinement can further elevate performance to stateof-the-art levels. This conceptually simple approach achieves top cross-dataset generalization results for optical flow estimation with end-point error (EPE) of 0.69, 1.78, and 3.15 on the Sintel clean, Sintel final, and KITTI datasets, respectively. Our method additionally establishes a new record on the online test benchmark with EPE values of 0.79, 1.88, and F1 value of 3.79. Applications to 3D depth estimation and stereo matching also show strong performance, illustrating the versatility of video-pretrained models in addressing geometric vision tasks.

[190] EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images

Jongmin Park, Minh-Quan Viet Bui, Juan Luis Gonzalez Bello, Jaeho Moon, Jihyong Oh, Munchurl Kim

Main category: cs.CV

TL;DR: EcoSplat is an efficiency-controllable feed-forward 3D Gaussian Splatting framework that adaptively predicts 3D representations for any target primitive count at inference time.

Details

Motivation: Existing feed-forward 3DGS methods predict pixel-aligned primitives per-view, producing excessive primitives in dense-view settings and offering no explicit control over the number of predicted Gaussians, limiting flexibility for downstream tasks.

Method: Two-stage optimization: 1) Pixel-aligned Gaussian Training (PGT) learns initial primitive prediction; 2) Importance-aware Gaussian Finetuning (IGF) learns to rank primitives and adaptively adjust parameters based on target primitive count.

Result: EcoSplat outperforms state-of-the-art methods under strict primitive-count constraints across multiple dense-view settings, demonstrating robustness and suitability for flexible downstream rendering tasks.

Conclusion: EcoSplat provides the first efficiency-controllable feed-forward 3DGS framework that enables adaptive 3D representation prediction for any target primitive count, addressing limitations of existing methods in dense-view settings.

Abstract: Feed-forward 3D Gaussian Splatting (3DGS) enables efficient one-pass scene reconstruction, providing 3D representations for novel view synthesis without per-scene optimization. However, existing methods typically predict pixel-aligned primitives per-view, producing an excessive number of primitives in dense-view settings and offering no explicit control over the number of predicted Gaussians. To address this, we propose EcoSplat, the first efficiency-controllable feed-forward 3DGS framework that adaptively predicts the 3D representation for any given target primitive count at inference time. EcoSplat adopts a two-stage optimization process. The first stage is Pixel-aligned Gaussian Training (PGT) where our model learns initial primitive prediction. The second stage is Importance-aware Gaussian Finetuning (IGF) stage where our model learns rank primitives and adaptively adjust their parameters based on the target primitive count. Extensive experiments across multiple dense-view settings show that EcoSplat is robust and outperforms state-of-the-art methods under strict primitive-count constraints, making it well-suited for flexible downstream rendering tasks.

[191] Rectification Reimagined: A Unified Mamba Model for Image Correction and Rectangling with Prompts

Linwei Qiu, Gongzhe Li, Xiaozhe Zhang, Qinlin Sun, Fengying Xie

Main category: cs.CV

TL;DR: UniRect is a unified framework for image correction/rectangling that handles diverse distortions through a task-agnostic approach with deformation and restoration modules, achieving SOTA performance.

Details

Motivation: Existing deep learning methods for image correction and rectangling rely on task-specific architectures, which restricts their generalization ability and effective application across different tasks. There's a need for a unified approach that can handle various practical photography tasks consistently.

Method: UniRect incorporates various task-specific inverse problems into a general distortion model by simulating different lens types. It uses a dual-component structure: 1) Deformation Module with Residual Progressive Thin-Plate Spline (RP-TPS) for geometric deformations, and 2) Restoration Module with Residual Mamba Blocks (RMBs) to counteract degradation. A Sparse Mixture-of-Experts (SMoEs) structure handles multi-task learning with varying distortions.

Result: Extensive experiments demonstrate that UniRect achieves state-of-the-art performance compared with other up-to-date methods in image correction and rectangling tasks.

Conclusion: UniRect provides a comprehensive unified framework for practical photography tasks that addresses limitations of task-specific architectures, offering better generalization and performance through its consistent distortion rectification perspective and innovative components.

Abstract: Image correction and rectangling are valuable tasks in practical photography systems such as smartphones. Recent remarkable advancements in deep learning have undeniably brought about substantial performance improvements in these fields. Nevertheless, existing methods mainly rely on task-specific architectures. This significantly restricts their generalization ability and effective application across a wide range of different tasks. In this paper, we introduce the Unified Rectification Framework (UniRect), a comprehensive approach that addresses these practical tasks from a consistent distortion rectification perspective. Our approach incorporates various task-specific inverse problems into a general distortion model by simulating different types of lenses. To handle diverse distortions, UniRect adopts one task-agnostic rectification framework with a dual-component structure: a {Deformation Module}, which utilizes a novel Residual Progressive Thin-Plate Spline (RP-TPS) model to address complex geometric deformations, and a subsequent Restoration Module, which employs Residual Mamba Blocks (RMBs) to counteract the degradation caused by the deformation process and enhance the fidelity of the output image. Moreover, a Sparse Mixture-of-Experts (SMoEs) structure is designed to circumvent heavy task competition in multi-task learning due to varying distortions. Extensive experiments demonstrate that our models have achieved state-of-the-art performance compared with other up-to-date methods.

[192] Breast Cancer Recurrence Risk Prediction Based on Multiple Instance Learning

Jinqiu Chen, Huyan Xu

Main category: cs.CV

TL;DR: Deep learning on H&E whole-slide images using MIL frameworks can predict breast cancer recurrence risk with good accuracy, correlating with genomic risk scores.

Details

Motivation: Predicting breast cancer recurrence risk is critical for clinical decision-making, and current methods like 21-gene Recurrence Score are expensive and time-consuming. There's a need for automated, cost-effective alternatives using routine histology slides.

Method: Developed and compared three Multiple Instance Learning (MIL) frameworks (CLAM-SB, ABMIL, and ConvNeXt-MIL-XGBoost) on 210 patient H&E whole-slide images. Features were extracted using UNI and CONCH pre-trained models, trained to predict 5-year recurrence risk categorized into low/medium/high tiers based on 21-gene Recurrence Score ground truth.

Result: Modified CLAM-SB model performed best with mean AUC of 0.836 and classification accuracy of 76.2% in 5-fold cross-validation, demonstrating strong correlation with genomic risk stratification.

Conclusion: Deep learning on standard H&E slides offers a feasible, automated approach for genomics-correlated breast cancer risk stratification, providing a promising pathway toward rapid, cost-effective clinical decision support.

Abstract: Predicting breast cancer recurrence risk is a critical clinical challenge. This study investigates the potential of computational pathology to stratify patients using deep learning on routine Hematoxylin and Eosin (H&E) stained whole-slide images (WSIs). We developed and compared three Multiple Instance Learning (MIL) frameworks – CLAM-SB, ABMIL, and ConvNeXt-MIL-XGBoost – on an in-house dataset of 210 patient cases. The models were trained to predict 5-year recurrence risk, categorized into three tiers (low, medium, high), with ground truth labels established by the 21-gene Recurrence Score. Features were extracted using the UNI and CONCH pre-trained models. In a 5-fold cross-validation, the modified CLAM-SB model demonstrated the strongest performance, achieving a mean Area Under the Curve (AUC) of 0.836 and a classification accuracy of 76.2%. Our findings demonstrate the feasibility of using deep learning on standard histology slides for automated, genomics-correlated risk stratification, highlighting a promising pathway toward rapid and cost-effective clinical decision support.

[193] $M^3-Verse$: A “Spot the Difference” Challenge for Large Multimodal Models

Kewei Wei, Bocheng Hu, Jie Cao, Xiaohan Chen, Zhengxi Lu, Wubing Xia, Weili Xu, Jiaao Wu, Junchen He, Mingyu Jia, Ciyun Zhao, Ye Sun, Yizhi Li, Zhonghan Zhao, Jian Zhang, Gaoang Wang

Main category: cs.CV

TL;DR: M³-Verse is a new benchmark for evaluating LMMs’ ability to understand dynamic state changes between paired videos of indoor scenes before and after transformations.

Details

Motivation: Current LMMs excel at static image and single-state understanding but lack capability to comprehend dynamic object changes between distinct video observations in shared spatial contexts, which is crucial for spatial intelligence.

Method: Created a Multi-Modal, Multi-State, Multi-Dimensional benchmark with 270 scenes and 2,932 questions categorized into 50+ subtasks probing 4 core capabilities, built on paired videos showing multi-perspective observations before/after state changes.

Result: Evaluation of 16 state-of-the-art LMMs revealed limitations in tracking state transitions. Proposed a simple yet effective baseline that achieved significant performance improvements in multi-state perception.

Conclusion: M³-Verse provides a challenging testbed to catalyze development of next-generation models with more holistic understanding of dynamic visual world, with construction pipeline and benchmark data publicly available.

Abstract: Modern Large Multimodal Models (LMMs) have demonstrated extraordinary ability in static image and single-state spatial-temporal understanding. However, their capacity to comprehend the dynamic changes of objects within a shared spatial context between two distinct video observations, remains largely unexplored. This ability to reason about transformations within a consistent environment is particularly crucial for advancements in the field of spatial intelligence. In this paper, we introduce $M^3-Verse$, a Multi-Modal, Multi-State, Multi-Dimensional benchmark, to formally evaluate this capability. It is built upon paired videos that provide multi-perspective observations of an indoor scene before and after a state change. The benchmark contains a total of 270 scenes and 2,932 questions, which are categorized into over 50 subtasks that probe 4 core capabilities. We evaluate 16 state-of-the-art LMMs and observe their limitations in tracking state transitions. To address these challenges, we further propose a simple yet effective baseline that achieves significant performance improvements in multi-state perception. $M^3-Verse$ thus provides a challenging new testbed to catalyze the development of next-generation models with a more holistic understanding of our dynamic visual world. You can get the construction pipeline from https://github.com/Wal-K-aWay/M3-Verse_pipeline and full benchmark data from https://www.modelscope.cn/datasets/WalKaWay/M3-Verse.

[194] AMLID: An Adaptive Multispectral Landmine Identification Dataset for Drone-Based Detection

James E. Gallagher, Edward J. Oughton

Main category: cs.CV

TL;DR: AMLID is the first open-source RGB-LWIR dataset for UAS-based landmine detection, containing 12,078 labeled images of 21 landmine types across diverse conditions to democratize humanitarian demining research.

Details

Motivation: Landmines cause 26,000 annual casualties globally, but current detection methods are hazardous, inefficient, and expensive. There's a need for safer, more accessible research tools for humanitarian demining.

Method: Created AMLID dataset combining RGB and Long-Wave Infrared imagery from Unmanned Aerial Systems. Dataset includes 12,078 labeled images covering 21 landmine types, 11 fusion levels, 4 altitudes, 2 seasons, and 3 illumination conditions.

Result: First open-source multispectral dataset for landmine detection that enables researchers to develop and benchmark adaptive detection algorithms without access to live ordnance or expensive infrastructure.

Conclusion: AMLID democratizes humanitarian demining research by providing comprehensive multispectral data across diverse environmental variables, facilitating safer and more accessible development of landmine detection technologies.

Abstract: Landmines remain a persistent humanitarian threat, with an estimated 110 million mines deployed across 60 countries, claiming approximately 26,000 casualties annually. Current detection methods are hazardous, inefficient, and prohibitively expensive. We present the Adaptive Multispectral Landmine Identification Dataset (AMLID), the first open-source dataset combining Red-Green-Blue (RGB) and Long-Wave Infrared (LWIR) imagery for Unmanned Aerial Systems (UAS)-based landmine detection. AMLID comprises of 12,078 labeled images featuring 21 globally deployed landmine types across anti-personnel and anti-tank categories in both metal and plastic compositions. The dataset spans 11 RGB-LWIR fusion levels, four sensor altitudes, two seasonal periods, and three daily illumination conditions. By providing comprehensive multispectral coverage across diverse environmental variables, AMLID enables researchers to develop and benchmark adaptive detection algorithms without requiring access to live ordnance or expensive data collection infrastructure, thereby democratizing humanitarian demining research.

[195] InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search

Kaican Li, Lewei Yao, Jiannan Wu, Tiezheng Yu, Jierun Chen, Haoli Bai, Lu Hou, Lanqing Hong, Wei Zhang, Nevin L. Zhang

Main category: cs.CV

TL;DR: O3-Bench is a new benchmark for evaluating multimodal reasoning requiring interleaved attention to visual details, and InSight-o3 is a multi-agent framework with specialized visual search and reasoning agents that significantly improves performance on challenging multimodal tasks.

Details

Motivation: Current open multimodal AI agents lack sufficient reasoning capabilities for real-world tasks like analyzing documents with dense charts/diagrams and navigating maps, creating a gap in sophisticated multimodal reasoning.

Method: Introduces O3-Bench benchmark for evaluating multimodal reasoning, and proposes InSight-o3 framework with two agents: vReasoner (visual reasoning) and vSearcher (generalized visual search trained via RL for locating relational/fuzzy/conceptual regions).

Result: O3-Bench is highly challenging (OpenAI o3 achieves only 40.8% accuracy). InSight-o3’s vSearcher significantly improves frontier multimodal models’ performance across benchmarks when used as a plug-and-play component.

Conclusion: The work represents a concrete step toward powerful open multimodal reasoning systems by addressing the visual reasoning gap through specialized multi-agent frameworks and challenging benchmarks.

Abstract: The ability for AI agents to “think with images” requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect crucial for real-world tasks like analyzing documents with dense charts/diagrams and navigating maps. To address this gap, we introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details. O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning. The problems are highly challenging even for frontier systems like OpenAI o3, which only obtains 40.8% accuracy on O3-Bench. To make progress, we propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher) for which we introduce the task of generalized visual search – locating relational, fuzzy, or conceptual regions described in free-form language, beyond just simple objects or figures in natural images. We then present a multimodal LLM purpose-trained for this task via reinforcement learning. As a plug-and-play agent, our vSearcher empowers frontier multimodal models (as vReasoners), significantly improving their performance on a wide range of benchmarks. This marks a concrete step towards powerful o3-like open systems. Our code and dataset can be found at https://github.com/m-Just/InSight-o3 .

[196] Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation

Tianrui Zhu, Shiyi Zhang, Zhirui Sun, Jingqi Tian, Yansong Tang

Main category: cs.CV

TL;DR: MAG framework decouples memory compression and frame generation for long video generation, solving catastrophic forgetting vs memory cost trade-off.

Details

Motivation: Current frame-AR models for long video generation face trade-off: window attention causes catastrophic forgetting of historical context, while full history retention is memory prohibitive.

Method: Proposes Memorize-and-Generate (MAG) framework with two components: 1) memory model to compress historical information into compact KV cache, 2) generator model to synthesize frames using compressed representation. Also introduces MAG-Bench for evaluation.

Result: MAG achieves superior historical scene consistency while maintaining competitive performance on standard video generation benchmarks.

Conclusion: The decoupled approach effectively addresses the memory-forgetting trade-off in long video generation, enabling better historical context retention without prohibitive memory costs.

Abstract: Frame-level autoregressive (frame-AR) models have achieved significant progress, enabling real-time video generation comparable to bidirectional diffusion models and serving as a foundation for interactive world models and game engines. However, current approaches in long video generation typically rely on window attention, which naively discards historical context outside the window, leading to catastrophic forgetting and scene inconsistency; conversely, retaining full history incurs prohibitive memory costs. To address this trade-off, we propose \textbf{Memorize-and-Generate (MAG)}, a framework that decouples memory compression and frame generation into distinct tasks. Specifically, we train a memory model to compress historical information into a compact KV cache, and a separate generator model to synthesize subsequent frames utilizing this compressed representation. Furthermore, we introduce \textbf{MAG-Bench} to strictly evaluate historical memory retention. Extensive experiments demonstrate that MAG achieves superior historical scene consistency while maintaining competitive performance on standard video generation benchmarks.

[197] IPCV: Information-Preserving Compression for MLLM Visual Encoders

Yuan Chen, Zichen Wen, Yuzhou Wu, Xuyang Liu, Shuang Chen, Junpeng Ma, Weijia Li, Conghui He, Linfeng Zhang

Main category: cs.CV

TL;DR: IPCV is a training-free compression framework that reduces MLLM computational cost by aggressively pruning visual tokens inside the Vision Transformer while preserving textually important information through neighbor-guided reconstruction and attention stabilization.

Details

Motivation: MLLMs have high computational costs due to numerous visual tokens processed by ViT encoders. Existing token pruning methods are inadequate: LLM-stage pruning ignores ViT overhead, while conventional ViT pruning without language guidance risks losing textually critical visual information and causes feature distortions.

Method: IPCV uses Neighbor-Guided Reconstruction (NGR) to temporarily reconstruct pruned tokens for attention computation with minimal overhead, then fully restores them before passing to LLM. Also introduces Attention Stabilization (AS) to approximate K/V of pruned tokens, which can enhance previous LLM-side pruning methods.

Result: IPCV substantially reduces end-to-end computation and outperforms state-of-the-art training-free token compression methods across diverse image and video benchmarks.

Conclusion: IPCV provides an effective training-free compression framework for MLLM visual encoders that preserves information while reducing computational overhead, with potential to enhance existing LLM-side pruning methods.

Abstract: Multimodal Large Language Models (MLLMs) deliver strong vision-language performance but at high computational cost, driven by numerous visual tokens processed by the Vision Transformer (ViT) encoder. Existing token pruning strategies are inadequate: LLM-stage token pruning overlooks the ViT’s overhead, while conventional ViT token pruning, without language guidance, risks discarding textually critical visual cues and introduces feature distortions amplified by the ViT’s bidirectional attention. To meet these challenges, we propose IPCV, a training-free, information-preserving compression framework for MLLM visual encoders. IPCV enables aggressive token pruning inside the ViT via Neighbor-Guided Reconstruction (NGR) that temporarily reconstructs pruned tokens to participate in attention with minimal overhead, then fully restores them before passing to the LLM. Besides, we introduce Attention Stabilization (AS) to further alleviate the negative influence from token pruning by approximating the K/V of pruned tokens. It can be directly applied to previous LLM-side token pruning methods to enhance their performance. Extensive experiments show that IPCV substantially reduces end-to-end computation and outperforms state-of-the-art training-free token compression methods across diverse image and video benchmarks. Our code is available at https://github.com/Perkzi/IPCV.

[198] Context-Aware Network Based on Multi-scale Spatio-temporal Attention for Action Recognition in Videos

Xiaoyang Li, Wenzhu Yang, Kanglin Wang, Tiebiao Wang, Qingsong Fei

Main category: cs.CV

TL;DR: CAN (Context-Aware Network) achieves competitive action recognition performance by capturing multi-scale spatio-temporal cues through MTCM for temporal features and GSCM for spatial features.

Details

Motivation: Existing action recognition methods often overlook the multi-granularity nature of actions, failing to comprehensively capture spatio-temporal cues across various scales needed for robust video understanding.

Method: Proposes Context-Aware Network (CAN) with two core modules: Multi-scale Temporal Cue Module (MTCM) extracts temporal cues at multiple scales to capture both fast-changing motion details and overall action flow; Group Spatial Cue Module (GSCM) extracts spatial cues at different scales by grouping feature maps and applying specialized extraction methods to each group.

Result: Achieves competitive performance on five benchmark datasets: 50.4% on Something-Something V1, 63.9% on Something-Something V2, 88.4% on Diving48, 74.9% on Kinetics-400, and 86.9% on UCF101, outperforming most mainstream methods.

Conclusion: Capturing multi-scale spatio-temporal cues is crucial for robust action recognition, and the proposed CAN framework effectively addresses the multi-granularity nature of actions through its specialized temporal and spatial modules.

Abstract: Action recognition is a critical task in video understanding, requiring the comprehensive capture of spatio-temporal cues across various scales. However, existing methods often overlook the multi-granularity nature of actions. To address this limitation, we introduce the Context-Aware Network (CAN). CAN consists of two core modules: the Multi-scale Temporal Cue Module (MTCM) and the Group Spatial Cue Module (GSCM). MTCM effectively extracts temporal cues at multiple scales, capturing both fast-changing motion details and overall action flow. GSCM, on the other hand, extracts spatial cues at different scales by grouping feature maps and applying specialized extraction methods to each group. Experiments conducted on five benchmark datasets (Something-Something V1 and V2, Diving48, Kinetics-400, and UCF101) demonstrate the effectiveness of CAN. Our approach achieves competitive performance, outperforming most mainstream methods, with accuracies of 50.4% on Something-Something V1, 63.9% on Something-Something V2, 88.4% on Diving48, 74.9% on Kinetics-400, and 86.9% on UCF101. These results highlight the importance of capturing multi-scale spatio-temporal cues for robust action recognition.

[199] Application of deep learning approaches for medieval historical documents transcription

Maksym Voloshchuk, Bohdana Zarembovska, Mykola Kozlenko

Main category: cs.CV

TL;DR: A deep learning method for extracting text from 9th-11th century Latin manuscripts, addressing the performance gap in medieval document OCR compared to modern texts.

Details

Motivation: Current OCR solutions perform well on modern documents but struggle with medieval Latin manuscripts (9th-11th centuries) due to unique historical characteristics, creating a need for specialized approaches.

Method: Developed a custom dataset from medieval documents, performed explanatory data analysis, and created a deep learning pipeline with object detection followed by word recognition using classification models and word image embeddings.

Result: Reported comprehensive metrics including recall, precision, F1 score, intersection over union, confusion matrix, and mean string distance with supporting plots; implementation published on GitHub.

Conclusion: The proposed deep learning approach successfully addresses medieval Latin document transcription challenges, providing a specialized solution with published implementation for historical document analysis.

Abstract: Handwritten text recognition and optical character recognition solutions show excellent results with processing data of modern era, but efficiency drops with Latin documents of medieval times. This paper presents a deep learning method to extract text information from handwritten Latin-language documents of the 9th to 11th centuries. The approach takes into account the properties inherent in medieval documents. The paper provides a brief introduction to the field of historical document transcription, a first-sight analysis of the raw data, and the related works and studies. The paper presents the steps of dataset development for further training of the models. The explanatory data analysis of the processed data is provided as well. The paper explains the pipeline of deep learning models to extract text information from the document images, from detecting objects to word recognition using classification models and embedding word images. The paper reports the following results: recall, precision, F1 score, intersection over union, confusion matrix, and mean string distance. The plots of the metrics are also included. The implementation is published on the GitHub repository.

[200] MaskFocus: Focusing Policy Optimization on Critical Steps for Masked Image Generation

Guohui Zhang, Hu Yu, Xiaoxiao Ma, Yaning Pan, Hang Xu, Feng Zhao

Main category: cs.CV

TL;DR: MaskFocus is a novel RL framework that enables effective policy optimization for masked generative models by focusing on critical sampling steps rather than entire trajectories, reducing computational cost while improving performance.

Details

Motivation: Reinforcement learning has shown promise for post-training language models and autoregressive visual models, but adapting RL to masked generative models is challenging due to the need to account for probability likelihoods across entire sampling trajectories, which is computationally expensive.

Method: MaskFocus identifies critical sampling steps by measuring step-level information gain through similarity between intermediate images and final generated images. It performs focused policy optimization on these valuable steps and uses a dynamic routing sampling mechanism based on entropy to explore better masking strategies for low-entropy samples.

Result: Extensive experiments on multiple Text-to-Image benchmarks validate the effectiveness of the MaskFocus method.

Conclusion: MaskFocus successfully addresses the challenge of applying RL to masked generative models by focusing optimization on critical steps, enabling efficient policy optimization while maintaining performance.

Abstract: Reinforcement learning (RL) has demonstrated significant potential for post-training language models and autoregressive visual generative models, but adapting RL to masked generative models remains challenging. The core factor is that policy optimization requires accounting for the probability likelihood of each step due to its multi-step and iterative refinement process. This reliance on entire sampling trajectories introduces high computational cost, whereas natively optimizing random steps often yields suboptimal results. In this paper, we present MaskFocus, a novel RL framework that achieves effective policy optimization for masked generative models by focusing on critical steps. Specifically, we determine the step-level information gain by measuring the similarity between the intermediate images at each sampling step and the final generated image. Crucially, we leverage this to identify the most critical and valuable steps and execute focused policy optimization on them. Furthermore, we design a dynamic routing sampling mechanism based on entropy to encourage the model to explore more valuable masking strategies for samples with low entropy. Extensive experiments on multiple Text-to-Image benchmarks validate the effectiveness of our method.

[201] In-Context Audio Control of Video Diffusion Transformers

Wenze Liu, Weicai Ye, Minghong Cai, Quande Liu, Xintao Wang, Xiangyu Yue

Main category: cs.CV

TL;DR: ICAC introduces a framework for speech-driven video generation using audio signals within a unified transformer architecture, proposing Masked 3D Attention to overcome training challenges and achieve strong lip sync.

Details

Motivation: Current video generation foundation models focus on text, images, and depth maps but neglect time-synchronous audio signals, creating a gap for speech-driven video generation.

Method: Proposes In-Context Audio Control (ICAC) framework with three audio injection mechanisms: standard cross-attention, 2D self-attention, and unified 3D self-attention. Introduces Masked 3D Attention to enforce temporal alignment and enable stable training.

Result: Masked 3D Attention achieves superior performance with strong lip synchronization and video quality when conditioned on audio streams and reference images, overcoming the training challenges of standard 3D attention.

Conclusion: ICAC successfully integrates audio signals into video diffusion transformers, demonstrating that properly constrained 3D attention mechanisms can effectively capture spatio-temporal audio-visual correlations for speech-driven video generation.

Abstract: Recent advancements in video generation have seen a shift towards unified, transformer-based foundation models that can handle multiple conditional inputs in-context. However, these models have primarily focused on modalities like text, images, and depth maps, while strictly time-synchronous signals like audio have been underexplored. This paper introduces In-Context Audio Control of video diffusion transformers (ICAC), a framework that investigates the integration of audio signals for speech-driven video generation within a unified full-attention architecture, akin to FullDiT. We systematically explore three distinct mechanisms for injecting audio conditions: standard cross-attention, 2D self-attention, and unified 3D self-attention. Our findings reveal that while 3D attention offers the highest potential for capturing spatio-temporal audio-visual correlations, it presents significant training challenges. To overcome this, we propose a Masked 3D Attention mechanism that constrains the attention pattern to enforce temporal alignment, enabling stable training and superior performance. Our experiments demonstrate that this approach achieves strong lip synchronization and video quality, conditioned on an audio stream and reference images.

[202] Eff-GRot: Efficient and Generalizable Rotation Estimation with Transformers

Fanis Mathioulakis, Gorjan Radevski, Tinne Tuytelaars

Main category: cs.CV

TL;DR: Eff-GRot is an efficient, generalizable rotation estimation method that predicts object rotation from RGB images in a single forward pass without object-specific training, using a transformer for latent space comparison between query and reference images.

Details

Motivation: The paper aims to address the need for efficient rotation estimation from RGB images that doesn't require object- or category-specific training, particularly for latency-sensitive applications where computational efficiency is crucial.

Method: Uses a transformer-based framework that performs comparisons in latent space, jointly processing rotation-aware representations from multiple reference images (with known orientations) alongside a query image to predict rotation in a single forward pass.

Result: Experimental results demonstrate that Eff-GRot achieves a favorable balance between accuracy and computational efficiency, offering promising performance for efficient rotation estimation in latency-sensitive applications.

Conclusion: Eff-GRot provides a simple, scalable, and fully end-to-end approach for generalizable rotation estimation that is both accurate and computationally efficient, representing a promising direction for practical applications.

Abstract: We introduce Eff-GRot, an approach for efficient and generalizable rotation estimation from RGB images. Given a query image and a set of reference images with known orientations, our method directly predicts the object’s rotation in a single forward pass, without requiring object- or category-specific training. At the core of our framework is a transformer that performs a comparison in the latent space, jointly processing rotation-aware representations from multiple references alongside a query. This design enables a favorable balance between accuracy and computational efficiency while remaining simple, scalable, and fully end-to-end. Experimental results show that Eff-GRot offers a promising direction toward more efficient rotation estimation, particularly in latency-sensitive applications.

[203] Watch Closely: Mitigating Object Hallucinations in Large Vision-Language Models with Disentangled Decoding

Ruiqi Ma, Yu Yan, Chunhong Zhang, Minghao Yin, XinChao Liu, Zhihong Jin, Zheng Hu

Main category: cs.CV

TL;DR: HDD is a training-free method that reduces hallucinations in Large Vision-Language Models by segmenting images and using blank images to disentangle language priors from visual understanding.

Details

Motivation: LVLMs suffer from severe hallucination issues in object recognition tasks, generating fluent but inaccurate text that doesn't correspond to visual content. Most existing methods only address language modality hallucinations, leaving visual modality issues unresolved.

Method: Hallucination Disentangled Decoding (HDD) enhances original images through segmentation and selects augmenting images, while using blank images to eliminate language prior hallucinations in both original and segmented images.

Result: HDD reduces model dependence on language priors and enhances visual performance without requiring any training.

Conclusion: The proposed HDD method effectively mitigates hallucinations in both language and visual modalities of LVLMs through a training-free approach that disentangles language priors from visual understanding.

Abstract: Large Vision-Language Models (LVLMs) bridge the gap between visual and linguistic modalities, demonstrating strong potential across a variety of domains. However, despite significant progress, LVLMs still suffer from severe hallucination issues in object recognition tasks. These models often fail to accurately identify certain objects, leading to text generation that appears fluent but does not correspond to the visual content, which can have serious consequences in real-world applications. Recently, several methods have been proposed to alleviate LVLM hallucinations, but most focus solely on reducing hallucinations in the language modality. To mitigate hallucinations in both the language and visual modalities, we introduce Hallucination Disentangled Decoding (HDD) method that requires no training. HDD enhances the original image by segmenting it and selecting images that augment the original, while also utilizing a blank image to eliminate language prior hallucinations in both the original and segmented images. This design not only reduces the model’s dependence on language priors but also enhances its visual performance. (Code: https://github.com/rickeyhhh/Hallucination-Disentangled-Decoding)

[204] Revealing Perception and Generation Dynamics in LVLMs: Mitigating Hallucinations via Validated Dominance Correction

Guangtao Lyu, Xinyi Cheng, Chenghao Xu, Qi Liu, Muli Yang, Fen Fang, Huilin Chen, Jiexi Yan, Xu Yang, Cheng Deng

Main category: cs.CV

TL;DR: This paper analyzes hallucinations in Large Vision-Language Models (LVLMs), identifies two key internal patterns (GATE for perception and SAD for generation), and proposes VDC strategy to mitigate hallucinations by replacing unsupported tokens with validated dominant ones.

Details

Motivation: Hallucinations remain a persistent challenge in Large Vision-Language Models despite their remarkable capabilities, requiring systematic analysis of internal mechanisms to develop effective mitigation strategies.

Method: Systematic analysis of internal evolution in LVLMs reveals two patterns: GATE (Global scan, Approach/Tighten, Explore) for visual perception and SAD (Subdominant Accumulation to Dominant) for token generation. Based on these findings, the authors devise VDC (Validated Dominance Correction) strategy that detects unsupported hallucinated tokens and replaces them with validated dominant ones.

Result: Extensive experiments across multiple models and benchmarks confirm that VDC substantially mitigates hallucinations in LVLMs.

Conclusion: The systematic analysis of internal mechanisms in LVLMs reveals fundamental patterns behind hallucinations, and the proposed VDC strategy effectively addresses this challenge by correcting unsupported tokens, improving output reliability.

Abstract: Large Vision-Language Models (LVLMs) have shown remarkable capabilities, yet hallucinations remain a persistent challenge. This work presents a systematic analysis of the internal evolution of visual perception and token generation in LVLMs, revealing two key patterns. First, perception follows a three-stage GATE process: early layers perform a Global scan, intermediate layers Approach and Tighten on core content, and later layers Explore supplementary regions. Second, generation exhibits an SAD (Subdominant Accumulation to Dominant) pattern, where hallucinated tokens arise from the repeated accumulation of subdominant tokens lacking support from attention (visual perception) or feed-forward network (internal knowledge). Guided by these findings, we devise the VDC (Validated Dominance Correction) strategy, which detects unsupported tokens and replaces them with validated dominant ones to improve output reliability. Extensive experiments across multiple models and benchmarks confirm that VDC substantially mitigates hallucinations.

[205] EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer

Yuxiao Yang, Hualian Sheng, Sijia Cai, Jing Lin, Jiahao Wang, Bing Deng, Junzhe Lu, Haoqian Wang, Jieping Ye

Main category: cs.CV

TL;DR: EchoMotion is a framework that improves human action video generation by jointly modeling appearance and motion using a dual-branch DiT architecture with synchronized 3D positional encoding and two-stage training.

Details

Motivation: Current video generation models struggle with complex human movements due to pixel-only training objectives that prioritize appearance fidelity over kinematic principles, limiting their ability to synthesize realistic human actions.

Method: Extends DiT with dual-branch architecture for joint appearance-motion processing, introduces MVS-RoPE for unified 3D positional encoding, uses two-stage training strategy, and constructs HuMoVe dataset (80K video-motion pairs).

Result: Explicit human motion representation complements appearance, significantly boosting coherence and plausibility of human-centric video generation, enabling both joint generation and cross-modal conditional tasks.

Conclusion: Joint modeling of appearance and motion through synchronized multimodal processing addresses limitations of pixel-only approaches, improving human action video generation quality.

Abstract: Video generation models have advanced significantly, yet they still struggle to synthesize complex human movements due to the high degrees of freedom in human articulation. This limitation stems from the intrinsic constraints of pixel-only training objectives, which inherently bias models toward appearance fidelity at the expense of learning underlying kinematic principles. To address this, we introduce EchoMotion, a framework designed to model the joint distribution of appearance and human motion, thereby improving the quality of complex human action video generation. EchoMotion extends the DiT (Diffusion Transformer) framework with a dual-branch architecture that jointly processes tokens concatenated from different modalities. Furthermore, we propose MVS-RoPE (Motion-Video Syncronized RoPE), which offers unified 3D positional encoding for both video and motion tokens. By providing a synchronized coordinate system for the dual-modal latent sequence, MVS-RoPE establishes an inductive bias that fosters temporal alignment between the two modalities. We also propose a Motion-Video Two-Stage Training Strategy. This strategy enables the model to perform both the joint generation of complex human action videos and their corresponding motion sequences, as well as versatile cross-modal conditional generation tasks. To facilitate the training of a model with these capabilities, we construct HuMoVe, a large-scale dataset of approximately 80,000 high-quality, human-centric video-motion pairs. Our findings reveal that explicitly representing human motion is complementary to appearance, significantly boosting the coherence and plausibility of human-centric video generation.

[206] Brain-Gen: Towards Interpreting Neural Signals for Stimulus Reconstruction Using Transformers and Latent Diffusion Models

Hasib Aslam, Muhammad Talal Faiz, Muhammad Imran Malik

Main category: cs.CV

TL;DR: A transformers-based framework extracts spatial-temporal EEG features for visual stimulus reconstruction using Latent Diffusion Models, improving semantic interpretation of brain activity.

Details

Motivation: Despite advances in brain activity decoding, neural representation interpretability remains limited due to EEG signal challenges: high noise, spatial diffusion, and temporal variability. The goal is to better interpret neural mechanisms underlying thoughts.

Method: Proposes a transformers-based framework to extract spatial-temporal representations from EEG recordings associated with visual stimuli. These features are then incorporated into the attention mechanisms of Latent Diffusion Models (LDMs) to reconstruct visual stimuli from brain activity.

Result: Quantitative evaluations on public benchmark datasets show the method excels at modeling semantic structures from EEG signals: achieves up to 6.5% increase in latent space clustering accuracy and 11.8% increase in zero-shot generalization across unseen classes, while maintaining comparable Inception Score and Fréchet Inception Distance with existing baselines.

Conclusion: The work represents a significant step toward generalizable semantic interpretation of EEG signals, advancing the field of brain activity decoding and neural representation interpretability.

Abstract: Advances in neuroscience and artificial intelligence have enabled preliminary decoding of brain activity. However, despite the progress, the interpretability of neural representations remains limited. A significant challenge arises from the intrinsic properties of electroencephalography (EEG) signals, including high noise levels, spatial diffusion, and pronounced temporal variability. To interpret the neural mechanism underlying thoughts, we propose a transformers-based framework to extract spatial-temporal representations associated with observed visual stimuli from EEG recordings. These features are subsequently incorporated into the attention mechanisms of Latent Diffusion Models (LDMs) to facilitate the reconstruction of visual stimuli from brain activity. The quantitative evaluations on publicly available benchmark datasets demonstrate that the proposed method excels at modeling the semantic structures from EEG signals; achieving up to 6.5% increase in latent space clustering accuracy and 11.8% increase in zero shot generalization across unseen classes while having comparable Inception Score and Fréchet Inception Distance with existing baselines. Our work marks a significant step towards generalizable semantic interpretation of the EEG signals.

[207] VizDefender: Unmasking Visualization Tampering through Proactive Localization and Intent Inference

Sicheng Song, Yanjie Zhang, Zixin Chen, Huamin Qu, Changbo Wang, Chenhui Li

Main category: cs.CV

TL;DR: VizDefender framework detects and analyzes tampering in data visualizations using watermarking and MLLM-based intent analysis.

Details

Motivation: Data visualization integrity is threatened by image editing techniques that enable subtle but deceptive tampering, creating a need for detection and analysis tools.

Method: Two-component framework: 1) Semi-fragile watermark module embeds location maps for precise tampering localization while preserving visual quality; 2) Intent analysis module uses Multimodal Large Language Models (MLLMs) to interpret manipulations and infer attacker intent.

Result: Extensive evaluations and user studies demonstrate the effectiveness of the VizDefender framework in detecting and analyzing visualization tampering.

Conclusion: VizDefender provides a comprehensive solution for protecting data visualization integrity against sophisticated tampering techniques through combined watermarking and AI-based intent analysis.

Abstract: The integrity of data visualizations is increasingly threatened by image editing techniques that enable subtle yet deceptive tampering. Through a formative study, we define this challenge and categorize tampering techniques into two primary types: data manipulation and visual encoding manipulation. To address this, we present VizDefender, a framework for tampering detection and analysis. The framework integrates two core components: 1) a semi-fragile watermark module that protects the visualization by embedding a location map to images, which allows for the precise localization of tampered regions while preserving visual quality, and 2) an intent analysis module that leverages Multimodal Large Language Models (MLLMs) to interpret manipulation, inferring the attacker’s intent and misleading effects. Extensive evaluations and user studies demonstrate the effectiveness of our methods.

[208] CrashChat: A Multimodal Large Language Model for Multitask Traffic Crash Video Analysis

Kaidi Liang, Ke Li, Xianbiao Hu, Ruwen Qin

Main category: cs.CV

TL;DR: CrashChat: A multimodal LLM for automated traffic crash video analysis that performs crash recognition, temporal grounding, and video understanding in a unified framework, achieving state-of-the-art performance.

Details

Motivation: Automating crash video analysis is crucial for traffic safety research and autonomous driving accountability. Current models can't handle all required tasks (crash recognition, temporal grounding, video understanding) in a unified framework, and effective training strategies are underexplored.

Method: Built on VideoLLaMA3, CrashChat uses instruction fine-tuning for domain knowledge and a novel multitask learning strategy with task decoupling and grouping to maximize joint learning benefits while mitigating negative transfer.

Result: Outperforms existing MLLMs and traditional vision methods across model scales: near-perfect crash recognition accuracy, 176% improvement in crash localization, 40% improvement in pre-crash localization. Substantially enhances textual accuracy in description/reasoning tasks with 0.18-0.41 BLEU and 0.18-0.42 ROUGE score increases.

Conclusion: CrashChat is an effective, convenient end-to-end tool for practical crash video analysis that achieves state-of-the-art performance through domain-specific fine-tuning and innovative multitask learning strategies.

Abstract: Automating crash video analysis is essential to leverage the growing availability of driving video data for traffic safety research and accountability attribution in autonomous driving. Crash video analysis is a challenging multitask problem due to the complex spatiotemporal dynamics of crash events in video data and the diverse analytical requirements involved. It requires capabilities spanning crash recognition, temporal grounding, and high-level video understanding. Existing models, however, cannot perform all these tasks within a unified framework, and effective training strategies for such models remain underexplored. To fill these gaps, this paper proposes CrashChat, a multimodal large language model (MLLM) for multitask traffic crash analysis, built upon VideoLLaMA3. CrashChat acquires domain-specific knowledge through instruction fine-tuning and employs a novel multitask learning strategy based on task decoupling and grouping, which maximizes the benefit of joint learning within and across task groups while mitigating negative transfer. Numerical experiments on consolidated public datasets demonstrate that CrashChat consistently outperforms existing MLLMs across model scales and traditional vision-based methods, achieving state-of-the-art performance. It reaches near-perfect accuracy in crash recognition, a 176% improvement in crash localization, and a 40% improvement in the more challenging pre-crash localization. Compared to general MLLMs, it substantially enhances textual accuracy and content coverage in crash description and reasoning tasks, with 0.18-0.41 increases in BLEU scores and 0.18-0.42 increases in ROUGE scores. Beyond its strong performance, CrashChat is a convenient, end-to-end analytical tool ready for practical implementation. The dataset and implementation code for CrashChat are available at https://github.com/Liangkd/CrashChat.

[209] Localising Shortcut Learning in Pixel Space via Ordinal Scoring Correlations for Attribution Representations (OSCAR)

Akshit Achara, Peter Triantafillou, Esther Puyol-Antón, Alexander Hammers, Andrew P. King

Main category: cs.CV

TL;DR: OSCAR is a model-agnostic framework that quantifies shortcut learning and localizes shortcut features using attribution map rank profiles and correlation analysis across three models.

Details

Motivation: Existing methods for understanding shortcut learning are mostly qualitative, image-level inspections that assume cues are human-visible, limiting their use in domains like medical imaging where shortcuts may not be visually apparent.

Method: OSCAR converts image-level task attribution maps into dataset-level rank profiles and compares them across three models: balanced baseline (BA), test model (TS), and sensitive attribute predictor (SA). It computes pairwise, partial, and deviation-based correlations on these rank profiles to quantify shortcut reliance and rank image regions contributing to it.

Result: Experiments on CelebA, CheXpert, and ADNI show OSCAR’s correlations are stable across seeds/partitions, sensitive to shortcut-label associations in training data, and able to distinguish localized from diffuse shortcut features. The method can reduce worst-group performance disparities using test-time attenuation based on identified shortcut regions.

Conclusion: OSCAR provides a lightweight, pixel-space audit that yields statistical decision rules and spatial maps, enabling users to test, localize, and mitigate shortcut reliance in deep neural networks, especially in domains where shortcuts may not be human-visible.

Abstract: Deep neural networks often exploit shortcuts. These are spurious cues which are associated with output labels in the training data but are unrelated to task semantics. When the shortcut features are associated with sensitive attributes, shortcut learning can lead to biased model performance. Existing methods for localising and understanding shortcut learning are mostly based upon qualitative, image-level inspection and assume cues are human-visible, limiting their use in domains such as medical imaging. We introduce OSCAR (Ordinal Scoring Correlations for Attribution Representations), a model-agnostic framework for quantifying shortcut learning and localising shortcut features. OSCAR converts image-level task attribution maps into dataset-level rank profiles of image regions and compares them across three models: a balanced baseline model (BA), a test model (TS), and a sensitive attribute predictor (SA). By computing pairwise, partial, and deviation-based correlations on these rank profiles, we produce a set of quantitative metrics that characterise the degree of shortcut reliance for TS, together with a ranking of image-level regions that contribute most to it. Experiments on CelebA, CheXpert, and ADNI show that our correlations are (i) stable across seeds and partitions, (ii) sensitive to the level of association between shortcut features and output labels in the training data, and (iii) able to distinguish localised from diffuse shortcut features. As an illustration of the utility of our method, we show how worst-group performance disparities can be reduced using a simple test-time attenuation approach based on the identified shortcut regions. OSCAR provides a lightweight, pixel-space audit that yields statistical decision rules and spatial maps, enabling users to test, localise, and mitigate shortcut reliance. The code is available at https://github.com/acharaakshit/oscar

[210] Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs

Dmitry Demidov, Zaigham Zaheer, Zongyan Han, Omkar Thawakar, Rao Anwer

Main category: cs.CV

TL;DR: FiNDR is a reasoning-augmented LMM framework for vocabulary-free fine-grained image recognition that uses automated reasoning to generate candidate labels, filter them, and build a classifier without human-curated vocabularies.

Details

Motivation: Existing approaches for vocabulary-free fine-grained recognition are limited by rigid vocabularies or complex pipelines with error propagation. Large multi-modal models (LMMs) with reasoning capabilities offer a more principled alternative for this problem.

Method: Three-step automated framework: (1) reasoning-enabled LMM generates descriptive candidate labels for each image, (2) vision-language model filters and ranks candidates to form coherent class set, (3) verified names instantiate lightweight multi-modal classifier for inference.

Result: State-of-the-art performance on fine-grained benchmarks with up to 18.8% relative improvement over previous approaches. Remarkably surpasses zero-shot baselines using pre-defined ground-truth names, challenging the assumption that human-curated vocabularies define an upper bound.

Conclusion: Reasoning-augmented LMMs establish an effective foundation for scalable, fully automated, open-world fine-grained visual recognition. Open-source LMMs with carefully curated prompts can match proprietary counterparts.

Abstract: Vocabulary-free fine-grained image recognition aims to distinguish visually similar categories within a meta-class without a fixed, human-defined label set. Existing solutions for this problem are limited by either the usage of a large and rigid list of vocabularies or by the dependency on complex pipelines with fragile heuristics where errors propagate across stages. Meanwhile, the ability of recent large multi-modal models (LMMs) equipped with explicit or implicit reasoning to comprehend visual-language data, decompose problems, retrieve latent knowledge, and self-correct suggests a more principled and effective alternative. Building on these capabilities, we propose FiNDR (Fine-grained Name Discovery via Reasoning), the first reasoning-augmented LMM-based framework for vocabulary-free fine-grained recognition. The system operates in three automated steps: (i) a reasoning-enabled LMM generates descriptive candidate labels for each image; (ii) a vision-language model filters and ranks these candidates to form a coherent class set; and (iii) the verified names instantiate a lightweight multi-modal classifier used at inference time. Extensive experiments on popular fine-grained classification benchmarks demonstrate state-of-the-art performance under the vocabulary-free setting, with a significant relative margin of up to 18.8% over previous approaches. Remarkably, the proposed method surpasses zero-shot baselines that exploit pre-defined ground-truth names, challenging the assumption that human-curated vocabularies define an upper bound. Additionally, we show that carefully curated prompts enable open-source LMMs to match proprietary counterparts. These findings establish reasoning-augmented LMMs as an effective foundation for scalable, fully automated, open-world fine-grained visual recognition. The source code is available on github.com/demidovd98/FiNDR.

[211] Delta-LLaVA: Base-then-Specialize Alignment for Token-Efficient Vision-Language Models

Mohamad Zamini, Diksha Shukla

Main category: cs.CV

TL;DR: Delta-LLaVA introduces a token-efficient visual projector for MLLMs using low-rank DeltaProjection and lightweight Transformer blocks to reduce computational costs while maintaining performance.

Details

Motivation: High computational cost of processing dense visual tokens in MLLMs is a major bottleneck. Standard visual projectors using simple MLPs scale poorly with high-resolution inputs and introduce significant redundancy.

Method: Delta-LLaVA employs a two-stage design: 1) low-rank DeltaProjection to align multi-level vision features into a compact subspace, and 2) lightweight Transformer blocks as specialization layers to capture global and local structure under constrained token budgets.

Result: Achieves consistent gains across multiple benchmarks with only 144 tokens. Improves inference throughput by up to 55% and accelerates training by 4-5x in pretraining and over 1.5x in finetuning.

Conclusion: The base-then-specialize design demonstrates the importance of token formation prior to scaling interaction capacity, offering dual benefits in efficiency and scalability for MLLMs.

Abstract: Multimodal Large Language Models (MLLMs) combine visual and textual representations to enable rich reasoning capabilities. However, the high computational cost of processing dense visual tokens remains a major bottleneck. A critical component in this pipeline is the visual projector, which bridges the vision encoder and the language model. Standard designs often employ a simple multi-layer perceptron for direct token mapping, but this approach scales poorly with high-resolution inputs, introducing significant redundancy. We present Delta-LLaVA, a token-efficient projector that employs a low-rank DeltaProjection to align multi-level vision features into a compact subspace before further interaction. On top of this base alignment, lightweight Transformer blocks act as specialization layers, capturing both global and local structure under constrained token budgets. Extensive experiments and ablations demonstrate that this base-then-specialize design yields consistent gains across multiple benchmarks with only 144 tokens, highlighting the importance of token formation prior to scaling interaction capacity. With Delta-LLaVA, inference throughput improves by up to 55%, while end-to-end training accelerates by nearly 4-5x in pretraining and over 1.5x in finetuning, highlighting the dual benefits of our design in both efficiency and scalability.

[212] LouvreSAE: Sparse Autoencoders for Interpretable and Controllable Style Transfer

Raina Panda, Daniel Fein, Arpita Singhal, Mark Fiore, Maneesh Agrawala, Matyas Bohacek

Main category: cs.CV

TL;DR: LouvreSAE: A training-light, interpretable method for artistic style transfer using art-specific Sparse Autoencoder on latent embeddings, creating compact style profiles for direct steering without model updates.

Details

Motivation: Existing style transfer methods are computationally expensive (fine-tuning, adapters, prompt engineering) and often entangle style with subject matter. There's a need for lightweight, interpretable approaches that can separate style from content effectively.

Method: Uses art-specific Sparse Autoencoder (SAE) on latent embeddings of generative image models. Trained on artistic data to learn disentangled stylistic and compositional concepts (brushwork, texture, color palette, semantic structures). Creates “style profiles” - compact, decomposable steering vectors for direct style transfer without model updates.

Result: Achieves or surpasses existing methods on ArtBench10 style evaluations (VGG Style Loss and CLIP Score Style) while being 1.7-20x faster. Enables interpretable style transfer without fine-tuning, LoRA training, or additional inference passes.

Conclusion: LouvreSAE provides an efficient, interpretable solution for artistic style transfer that separates style from content, requires minimal computation, and enables direct steering from few reference images without model updates.

Abstract: Artistic style transfer in generative models remains a significant challenge, as existing methods often introduce style only via model fine-tuning, additional adapters, or prompt engineering, all of which can be computationally expensive and may still entangle style with subject matter. In this paper, we introduce a training- and inference-light, interpretable method for representing and transferring artistic style. Our approach leverages an art-specific Sparse Autoencoder (SAE) on top of latent embeddings of generative image models. Trained on artistic data, our SAE learns an emergent, largely disentangled set of stylistic and compositional concepts, corresponding to style-related elements pertaining brushwork, texture, and color palette, as well as semantic and structural concepts. We call it LouvreSAE and use it to construct style profiles: compact, decomposable steering vectors that enable style transfer without any model updates or optimization. Unlike prior concept-based style transfer methods, our method requires no fine-tuning, no LoRA training, and no additional inference passes, enabling direct steering of artistic styles from only a few reference images. We validate our method on ArtBench10, achieving or surpassing existing methods on style evaluations (VGG Style Loss and CLIP Score Style) while being 1.7-20x faster and, critically, interpretable.

[213] Point What You Mean: Visually Grounded Instruction Policy

Hang Yu, Juntu Zhao, Yufeng Liu, Kaiyu Li, Cheng Ma, Di Zhang, Yingdong Hu, Guang Chen, Junyuan Xie, Junliang Guo, Junqiao Zhao, Yang Gao

Main category: cs.CV

TL;DR: Point-VLA is a plug-and-play policy that enhances Vision-Language-Action models by adding visual cues (like bounding boxes) to language instructions to resolve object referring ambiguity, especially in cluttered or OOD scenes.

Details

Motivation: Current VLA models have limited object referring ability when relying solely on text prompts, particularly in cluttered or out-of-distribution scenes where referential ambiguity is high.

Method: Developed Point-VLA that augments language instructions with explicit visual cues (e.g., bounding boxes) for precise object-level grounding, and created an automatic data annotation pipeline to efficiently scale visually grounded datasets with minimal human effort.

Result: Point-VLA consistently outperforms text-only instruction VLAs on diverse real-world referring tasks, especially in cluttered or unseen-object scenarios, showing robust generalization.

Conclusion: Point-VLA effectively resolves object referring ambiguity through pixel-level visual grounding, achieving more generalizable embodied control in vision-language-action systems.

Abstract: Vision-Language-Action (VLA) models align vision and language with embodied control, but their object referring ability remains limited when relying solely on text prompt, especially in cluttered or out-of-distribution (OOD) scenes. In this study, we introduce the Point-VLA, a plug-and-play policy that augments language instructions with explicit visual cues (e.g., bounding boxes) to resolve referential ambiguity and enable precise object-level grounding. To efficiently scale visually grounded datasets, we further develop an automatic data annotation pipeline requiring minimal human effort. We evaluate Point-VLA on diverse real-world referring tasks and observe consistently stronger performance than text-only instruction VLAs, particularly in cluttered or unseen-object scenarios, with robust generalization. These results demonstrate that Point-VLA effectively resolves object referring ambiguity through pixel-level visual grounding, achieving more generalizable embodied control.

[214] Symmetrization of 3D Generative Models

Nicolas Caytuiro, Ivan Sipiran

Main category: cs.CV

TL;DR: A data-centric approach to improve 3D shape symmetry by training generative models on half-objects instead of full shapes, then reflecting them during generation.

Details

Motivation: Current 3D generative models often produce asymmetrical shapes. Instead of modifying model architectures, the authors propose modifying training data to enforce symmetry by training on half-objects.

Method: 1) Analyze reflectional symmetry in real 3D shapes and generated samples. 2) Create a dataset of half-objects by reflecting one half along x=0 plane. 3) Train generative models exclusively on these half-objects. 4) During generation, reflect the generated half-shapes to create complete symmetrical objects.

Result: Experiments on ShapeNet classes (Airplane, Car, Chair) show that generated shapes are more symmetrical and consistent compared to those from original models and dataset objects.

Conclusion: Training generative models on half-objects is an effective data-centric approach to produce symmetrical 3D shapes without modifying model architecture, demonstrating improved symmetry and consistency.

Abstract: We propose a novel data-centric approach to promote symmetry in 3D generative models by modifying the training data rather than the model architecture. Our method begins with an analysis of reflectional symmetry in both real-world 3D shapes and samples generated by state-of-the-art models. We hypothesize that training a generative model exclusively on half-objects, obtained by reflecting one half of the shapes along the x=0 plane, enables the model to learn a rich distribution of partial geometries which, when reflected during generation, yield complete shapes that are both visually plausible and geometrically symmetric. To test this, we construct a new dataset of half-objects from three ShapeNet classes (Airplane, Car, and Chair) and train two generative models. Experiments demonstrate that the generated shapes are symmetrical and consistent, compared with the generated objects from the original model and the original dataset objects.

[215] VOIC: Visible-Occluded Decoupling for Monocular 3D Semantic Scene Completion

Zaidao Han, Risa Higashita, Jiang Liu

Main category: cs.CV

TL;DR: VOIC introduces a dual-decoder framework for 3D semantic scene completion that explicitly separates visible-region perception from occluded-region reasoning using a novel visible region label extraction strategy, achieving state-of-the-art performance on autonomous driving benchmarks.

Details

Motivation: Existing camera-based 3D Semantic Scene Completion methods suffer from feature dilution and error propagation due to interference between high-confidence visible-region perception and low-confidence occluded-region reasoning when using single-image inputs.

Method: Proposes Visible-Occluded Interactive Completion Network (VOIC) with: 1) Offline Visible Region Label Extraction (VRLE) to separate voxel-level supervision for visible regions, 2) Dual-decoder framework with visible decoder for high-fidelity geometric/semantic priors and occlusion decoder for global scene reasoning, 3) Cross-modal interaction between decoders.

Result: VOIC outperforms existing monocular SSC methods on SemanticKITTI and SSCBench-KITTI360 benchmarks, achieving state-of-the-art performance in both geometric completion and semantic segmentation accuracy.

Conclusion: Explicitly decoupling visible-region perception from occluded-region reasoning with purified supervision addresses interference issues in single-image SSC, leading to improved performance through complementary sub-task specialization and cross-modal interaction.

Abstract: Camera-based 3D Semantic Scene Completion (SSC) is a critical task for autonomous driving and robotic scene understanding. It aims to infer a complete 3D volumetric representation of both semantics and geometry from a single image. Existing methods typically focus on end-to-end 2D-to-3D feature lifting and voxel completion. However, they often overlook the interference between high-confidence visible-region perception and low-confidence occluded-region reasoning caused by single-image input, which can lead to feature dilution and error propagation. To address these challenges, we introduce an offline Visible Region Label Extraction (VRLE) strategy that explicitly separates and extracts voxel-level supervision for visible regions from dense 3D ground truth. This strategy purifies the supervisory space for two complementary sub-tasks: visible-region perception and occluded-region reasoning. Building on this idea, we propose the Visible-Occluded Interactive Completion Network (VOIC), a novel dual-decoder framework that explicitly decouples SSC into visible-region semantic perception and occluded-region scene completion. VOIC first constructs a base 3D voxel representation by fusing image features with depth-derived occupancy. The visible decoder focuses on generating high-fidelity geometric and semantic priors, while the occlusion decoder leverages these priors together with cross-modal interaction to perform coherent global scene reasoning. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that VOIC outperforms existing monocular SSC methods in both geometric completion and semantic segmentation accuracy, achieving state-of-the-art performance.

[216] DVI: Disentangling Semantic and Visual Identity for Training-Free Personalized Generation

Guandong Li, Yijun Ding

Main category: cs.CV

TL;DR: DVI is a zero-shot framework that disentangles identity into semantic and visual streams to address “Semantic-Visual Dissonance” in identity customization, using VAE latent statistics for visual atmosphere and parameter-free feature modulation.

Details

Motivation: Current tuning-free identity customization methods achieve high facial fidelity but overlook visual context (lighting, skin texture, environmental tone), leading to "Semantic-Visual Dissonance" where accurate facial geometry clashes with input atmosphere, creating unnatural "sticker-like" effects.

Method: DVI orthogonally disentangles identity into fine-grained semantic and coarse-grained visual streams. It exploits VAE latent space statistical properties (mean and variance) as lightweight descriptors for global visual atmosphere. Uses Parameter-Free Feature Modulation to adaptively modulate semantic embeddings with visual statistics, and Dynamic Temporal Granularity Scheduler to prioritize visual atmosphere in early denoising stages while refining semantic details later.

Result: Extensive experiments show DVI significantly enhances visual consistency and atmospheric fidelity without parameter fine-tuning, maintains robust identity preservation, and outperforms state-of-the-art methods in IBench evaluations.

Conclusion: DVI effectively addresses Semantic-Visual Dissonance by disentangling and modulating identity components, achieving superior visual-identity consistency without training, representing an advancement in zero-shot identity customization.

Abstract: Recent tuning-free identity customization methods achieve high facial fidelity but often overlook visual context, such as lighting, skin texture, and environmental tone. This limitation leads to Semantic-Visual Dissonance,'' where accurate facial geometry clashes with the input's unique atmosphere, causing an unnatural sticker-like’’ effect. We propose DVI (Disentangled Visual-Identity), a zero-shot framework that orthogonally disentangles identity into fine-grained semantic and coarse-grained visual streams. Unlike methods relying solely on semantic vectors, DVI exploits the inherent statistical properties of the VAE latent space, utilizing mean and variance as lightweight descriptors for global visual atmosphere. We introduce a Parameter-Free Feature Modulation mechanism that adaptively modulates semantic embeddings with these visual statistics, effectively injecting the reference’s ``visual soul’’ without training. Furthermore, a Dynamic Temporal Granularity Scheduler aligns with the diffusion process, prioritizing visual atmosphere in early denoising stages while refining semantic details later. Extensive experiments demonstrate that DVI significantly enhances visual consistency and atmospheric fidelity without parameter fine-tuning, maintaining robust identity preservation and outperforming state-of-the-art methods in IBench evaluations.

[217] Total Curvature Regularization and its_Minimization for Surface and Image Smoothing

Tianle Lu, Ke Chen, Yuping Duan

Main category: cs.CV

TL;DR: Novel curvature regularization method using total normal curvature from multiple directions produces sharp edges and isotropic solutions via PDE-based optimization with operator splitting.

Details

Motivation: Need for curvature regularization methods that can produce solutions with sharp edges and precise isotropic properties while avoiding complex parameter tuning.

Method: Formulate total normal curvature regularization as high-order nonlinear optimization, reformulate as steady-state solution of time-dependent PDE system, use operator splitting for time discretization with closed-form or efficient subproblem solutions.

Result: Method demonstrates robustness to parameter choices, circumvents complex parameter tuning, and shows efficiency and effectiveness in surface and image smoothing applications.

Conclusion: The proposed total normal curvature regularization with PDE-based optimization and operator splitting provides an effective approach for curvature regularization with sharp edge preservation and isotropic properties.

Abstract: We introduce a novel formulation for curvature regularization by penalizing normal curvatures from multiple directions. This total normal curvature regularization is capable of producing solutions with sharp edges and precise isotropic properties. To tackle the resulting high-order nonlinear optimization problem, we reformulate it as the task of finding the steady-state solution of a time-dependent partial differential equation (PDE) system. Time discretization is achieved through operator splitting, where each subproblem at the fractional steps either has a closed-form solution or can be efficiently solved using advanced algorithms. Our method circumvents the need for complex parameter tuning and demonstrates robustness to parameter choices. The efficiency and effectiveness of our approach have been rigorously validated in the context of surface and image smoothing problems.

[218] Self-Attention with State-Object Weighted Combination for Compositional Zero Shot Learning

Cheng-Hong Chang, Pei-Hsuan Tsai

Main category: cs.CV

TL;DR: SASOW improves compositional zero-shot learning by enhancing state/object recognition with self-attention and incorporating weighting during composition, outperforming previous methods on benchmark datasets.

Details

Motivation: Existing object recognition systems typically identify objects alone without considering their states. While compositional zero-shot learning (CZSL) addresses this by treating states and objects separately, current state-of-the-art method KG-SP has limitations in recognition accuracy and doesn't consider weighting of states/objects during composition.

Method: SASOW enhances KG-SP by: 1) Introducing self-attention mechanisms into state and object classifiers to improve recognition accuracy, and 2) Incorporating weighting of states and objects during composition to generate more reasonable and accurate compositions.

Result: Experimental results show SASOW outperforms OW-CZSL and KG-SP on three benchmark datasets: MIT-States (2.1% improvement), UT Zappos (1.7% improvement), and C-GQA (0.4% improvement) in accuracy for unseen compositions.

Conclusion: SASOW successfully addresses limitations of previous CZSL methods by improving state/object recognition accuracy through self-attention and incorporating weighting during composition, leading to better performance on compositional zero-shot learning tasks.

Abstract: Object recognition has become prevalent across various industries. However, most existing applications are limited to identifying objects alone, without considering their associated states. The ability to recognize both the state and object simultaneously remains less common. One approach to address this is by treating state and object as a single category during training. However, this approach poses challenges in data collection and training since it requires comprehensive data for all possible combinations. Compositional Zero-shot Learning (CZSL) emerges as a viable solution by treating the state and object as distinct categories during training. CZSL facilitates the identification of novel compositions even in the absence of data for every conceivable combination. The current state-of-the-art method, KG-SP, addresses this issue by training distinct classifiers for states and objects, while leveraging a semantic model to evaluate the plausibility of composed compositions. However, KG-SP’s accuracy in state and object recognition can be further improved, and it fails to consider the weighting of states and objects during composition. In this study, we propose SASOW, an enhancement of KG-SP that considers the weighting of states and objects while improving composition recognition accuracy. First, we introduce self-attention mechanisms into the classifiers for states and objects, leading to enhanced accuracy in recognizing both. Additionally, we incorporate the weighting of states and objects during composition to generate more reasonable and accurate compositions. Our validation process involves testing SASOW on three established benchmark datasets. Experimental outcomes affirm when compared against OW-CZSL approach, KG-SP, SASOW showcases improvements of 2.1%, 1.7%, and 0.4% in terms of accuracy for unseen compositions across the MIT-States, UT Zappos, and C-GQA datasets, respectively.

[219] ICP-4D: Bridging Iterative Closest Point and LiDAR Panoptic Segmentation

Gyeongrok Oh, Youngdong Jang, Jonghyun Choi, Suk-Ju Kang, Guang Lin, Sangpil Kim

Main category: cs.CV

TL;DR: ICP-4D is a training-free framework for 4D LiDAR panoptic segmentation that uses geometric priors and Iterative Closest Point algorithm for temporal instance association, outperforming state-of-the-art methods without additional training.

Details

Motivation: Existing 4D LiDAR panoptic segmentation methods require training large neural networks or designing complex association modules, which are computationally expensive and overlook the rich geometric priors inherent in raw point clouds.

Method: Uses Iterative Closest Point (ICP) algorithm to associate temporally consistent instances by aligning source and target point sets through estimated transformations. Incorporates Sinkhorn-based soft matching for robust association under noisy predictions, and handles three instance types (static, dynamic, missing) for computational efficiency and occlusion-aware matching.

Result: Outperforms state-of-the-art approaches on both SemanticKITTI and panoptic nuScenes datasets, demonstrating consistent performance improvements without requiring additional training or extra point cloud inputs.

Conclusion: ICP-4D provides a simple yet effective training-free framework that leverages geometric relations for 4D LiDAR panoptic segmentation, achieving superior performance while being computationally efficient and not requiring model training.

Abstract: Dominant paradigms for 4D LiDAR panoptic segmentation are usually required to train deep neural networks with large superimposed point clouds or design dedicated modules for instance association. However, these approaches perform redundant point processing and consequently become computationally expensive, yet still overlook the rich geometric priors inherently provided by raw point clouds. To this end, we introduce ICP-4D, a simple yet effective training-free framework that unifies spatial and temporal reasoning through geometric relations among instance-level point sets. Specifically, we apply the Iterative Closest Point (ICP) algorithm to directly associate temporally consistent instances by aligning the source and target point sets through the estimated transformation. To stabilize association under noisy instance predictions, we introduce a Sinkhorn-based soft matching. This exploits the underlying instance distribution to obtain accurate point-wise correspondences, resulting in robust geometric alignment. Furthermore, our carefully designed pipeline, which considers three instance types-static, dynamic, and missing-offers computational efficiency and occlusion-aware matching. Our extensive experiments across both SemanticKITTI and panoptic nuScenes demonstrate that our method consistently outperforms state-of-the-art approaches, even without additional training or extra point cloud inputs.

[220] Towards AI-Guided Open-World Ecological Taxonomic Classification

Cheng Yaw Low, Heejoon Koo, Jaewoo Park, Kaleb Mesfin Asfaw, Meeyoung Cha

Main category: cs.CV

TL;DR: The paper introduces Open-World Ecological Taxonomy Classification framework and TaxoNet model to address multiple challenges in AI-guided ecological classification, including class imbalance, fine-grained variations, domain shifts, and closed-set limitations.

Details

Motivation: AI-guided ecological classification is crucial for global sustainability efforts like biodiversity monitoring and conservation, but faces challenges including long-tailed taxonomic distributions (class imbalance), fine-grained variations, test-time spatiotemporal domain shifts, and closed-set assumptions that limit recognition to previously seen taxa.

Method: Proposes TaxoNet, an embedding-based encoder with a dual-margin penalization loss that strengthens learning signals from rare underrepresented taxa while mitigating the dominance of overrepresented ones. This addresses interrelated challenges in a unified Open-World Ecological Taxonomy Classification framework.

Result: The model consistently outperforms baselines across diverse ecological domains (Google Auto-Arborist, iNat-Plantae, NAFlora-Mini), particularly for rare taxa. Also shows that general-purpose multimodal foundation models remain constrained in plant-domain applications.

Conclusion: The proposed framework and TaxoNet model establish a strong foundation for open-world plant taxonomic monitoring by effectively addressing multiple realistic ecological challenges simultaneously, with particular success in handling rare taxa.

Abstract: AI-guided classification of ecological families, genera, and species underpins global sustainability efforts such as biodiversity monitoring, conservation planning, and policy-making. Progress toward this goal is hindered by long-tailed taxonomic distributions from class imbalance, along with fine-grained taxonomic variations, test-time spatiotemporal domain shifts, and closed-set assumptions that can only recognize previously seen taxa. We introduce the Open-World Ecological Taxonomy Classification, a unified framework that captures the co-occurrence of these challenges in realistic ecological settings. To address them, we propose TaxoNet, an embedding-based encoder with a dual-margin penalization loss that strengthens learning signals from rare underrepresented taxa while mitigating the dominance of overrepresented ones, directly confronting interrelated challenges. We evaluate our method on diverse ecological domains: Google Auto-Arborist (urban trees), iNat-Plantae (Plantae observations from various ecosystems in iNaturalist-2019), and NAFlora-Mini (a curated herbarium collection). Our model consistently outperforms baselines, particularly for rare taxa, establishing a strong foundation for open-world plant taxonomic monitoring. Our findings further show that general-purpose multimodal foundation models remain constrained in plant-domain applications.

[221] CETCAM: Camera-Controllable Video Generation via Consistent and Extensible Tokenization

Zelin Zhao, Xinyu Gong, Bangya Liu, Ziyang Song, Jun Zhang, Suhui Wu, Yongxin Chen, Hao Zhang

Main category: cs.CV

TL;DR: CETCAM is a camera-controllable video generation framework that eliminates the need for camera pose annotations by using geometry foundation models to estimate depth and camera parameters, converting them into unified tokens integrated into a pretrained video diffusion model.

Details

Motivation: Existing camera-controllable video generation methods rely on camera pose annotations that are difficult to scale to large datasets and often inconsistent with depth estimation, creating train-test discrepancies.

Method: CETCAM uses geometry foundation models (VGGT) to estimate depth and camera parameters, converts them into unified geometry-aware tokens, and integrates them into a pretrained video diffusion backbone via lightweight context blocks. It’s trained in two progressive stages: first learning camera controllability from diverse raw videos, then refining visual quality with high-fidelity datasets.

Result: Extensive experiments show state-of-the-art performance in geometric consistency, temporal stability, and visual realism across multiple benchmarks. The framework also demonstrates strong adaptability to additional control modalities like inpainting and layout control.

Conclusion: CETCAM provides an effective camera-controllable video generation solution without requiring camera annotations, offering strong geometric consistency, temporal stability, and flexibility for various control applications beyond camera manipulation.

Abstract: Achieving precise camera control in video generation remains challenging, as existing methods often rely on camera pose annotations that are difficult to scale to large and dynamic datasets and are frequently inconsistent with depth estimation, leading to train-test discrepancies. We introduce CETCAM, a camera-controllable video generation framework that eliminates the need for camera annotations through a consistent and extensible tokenization scheme. CETCAM leverages recent advances in geometry foundation models, such as VGGT, to estimate depth and camera parameters and converts them into unified, geometry-aware tokens. These tokens are seamlessly integrated into a pretrained video diffusion backbone via lightweight context blocks. Trained in two progressive stages, CETCAM first learns robust camera controllability from diverse raw video data and then refines fine-grained visual quality using curated high-fidelity datasets. Extensive experiments across multiple benchmarks demonstrate state-of-the-art geometric consistency, temporal stability, and visual realism. Moreover, CETCAM exhibits strong adaptability to additional control modalities, including inpainting and layout control, highlighting its flexibility beyond camera control. The project page is available at https://sjtuytc.github.io/CETCam_project_page.github.io/.

Sihao Lin, Zerui Li, Xunyi Zhao, Gengze Zhou, Liuyi Wang, Rong Wei, Rui Tang, Juncheng Li, Hanqing Wang, Jiangmiao Pang, Anton van den Hengel, Jiajun Liu, Qi Wu

Main category: cs.CV

TL;DR: VLNVerse is a new large-scale, extensible benchmark for Vision-Language Navigation that unifies fragmented tasks, provides realistic physics simulation, and enables comprehensive evaluation of embodied AI agents.

Details

Motivation: Existing VLN benchmarks have limitations: they're small-scale, use naive physical simulation, create a sim-to-real generalization gap, suffer from task fragmentation, and don't meet the data demands of modern LLM-based pretraining.

Method: Introduces VLNVerse benchmark with four key features: 1) Versatile - unifies previously fragmented tasks into single framework with extensible toolkit, 2) Embodied - supports full-kinematics agents (not just “ghost” agents), 3) Realistic Simulation - powered by robust physics engine, 4) Comprehensive Evaluation - large-scale dataset for evaluating existing methods and proposing new unified multi-task model.

Result: VLNVerse provides a large-scale, extensible benchmark that enables comprehensive evaluation of VLN methods from classic models to MLLM-based agents, and supports development of unified multi-task models.

Conclusion: VLNVerse aims to narrow the gap between simulated navigation and real-world generalization, providing a vital tool to boost research towards scalable, general-purpose embodied locomotion agents.

Abstract: Despite remarkable progress in Vision-Language Navigation (VLN), existing benchmarks remain confined to fixed, small-scale datasets with naive physical simulation. These shortcomings limit the insight that the benchmarks provide into sim-to-real generalization, and create a significant research gap. Furthermore, task fragmentation prevents unified/shared progress in the area, while limited data scales fail to meet the demands of modern LLM-based pretraining. To overcome these limitations, we introduce VLNVerse: a new large-scale, extensible benchmark designed for Versatile, Embodied, Realistic Simulation, and Evaluation. VLNVerse redefines VLN as a scalable, full-stack embodied AI problem. Its Versatile nature unifies previously fragmented tasks into a single framework and provides an extensible toolkit for researchers. Its Embodied design moves beyond intangible and teleporting “ghost” agents that support full-kinematics in a Realistic Simulation powered by a robust physics engine. We leverage the scale and diversity of VLNVerse to conduct a comprehensive Evaluation of existing methods, from classic models to MLLM-based agents. We also propose a novel unified multi-task model capable of addressing all tasks within the benchmark. VLNVerse aims to narrow the gap between simulated navigation and real-world generalization, providing the community with a vital tool to boost research towards scalable, general-purpose embodied locomotion agents.

[223] Steering Vision-Language Pre-trained Models for Incremental Face Presentation Attack Detection

Haoze Li, Jie Zhang, Guoying Zhao, Stephen Lin, Shiguang Shan

Main category: cs.CV

TL;DR: SVLP-IL is a vision-language pre-trained based rehearsal-free incremental learning framework for face presentation attack detection that uses multi-aspect prompting and selective elastic weight consolidation to balance stability and plasticity while maintaining privacy compliance.

Details

Motivation: Face PAD needs incremental learning to handle evolving spoofing tactics and domains, but privacy regulations prevent retaining past data, requiring rehearsal-free IL. VLP models offer prompt-tunable cross-modal representations for efficient adaptation to new spoofing styles.

Method: SVLP-IL uses Multi-Aspect Prompting (MAP) to isolate domain dependencies, enhance distribution-shift sensitivity, and mitigate forgetting by exploiting universal and domain-specific cues. It also employs Selective Elastic Weight Consolidation (SEWC) to preserve critical weights from previous tasks while allowing flexibility for new adaptations.

Result: Comprehensive experiments across multiple PAD benchmarks show SVLP-IL significantly reduces catastrophic forgetting and enhances performance on unseen domains.

Conclusion: SVLP-IL offers a privacy-compliant, practical solution for robust lifelong PAD deployment in rehearsal-free incremental learning settings.

Abstract: Face Presentation Attack Detection (PAD) demands incremental learning (IL) to combat evolving spoofing tactics and domains. Privacy regulations, however, forbid retaining past data, necessitating rehearsal-free IL (RF-IL). Vision-Language Pre-trained (VLP) models, with their prompt-tunable cross-modal representations, enable efficient adaptation to new spoofing styles and domains. Capitalizing on this strength, we propose \textbf{SVLP-IL}, a VLP-based RF-IL framework that balances stability and plasticity via \textit{Multi-Aspect Prompting} (MAP) and \textit{Selective Elastic Weight Consolidation} (SEWC). MAP isolates domain dependencies, enhances distribution-shift sensitivity, and mitigates forgetting by jointly exploiting universal and domain-specific cues. SEWC selectively preserves critical weights from previous tasks, retaining essential knowledge while allowing flexibility for new adaptations. Comprehensive experiments across multiple PAD benchmarks show that SVLP-IL significantly reduces catastrophic forgetting and enhances performance on unseen domains. SVLP-IL offers a privacy-compliant, practical solution for robust lifelong PAD deployment in RF-IL settings.

[224] Finer-Personalization Rank: Fine-Grained Retrieval Examines Identity Preservation for Personalized Generation

Connor Kilrain, David Carlyn, Julia Chae, Sara Beery, Wei-Lun Chao, Jianyang Gu

Main category: cs.CV

TL;DR: Finer-Personalization Rank is a new evaluation protocol for identity preservation in personalized generative models that uses retrieval-based ranking instead of pairwise similarity, better capturing fine-grained identity details.

Details

Motivation: Current evaluation metrics for personalized generative models focus on overall semantic similarity but fail to capture fine-grained discriminative details that are crucial for identity preservation (e.g., specific markings on a pet).

Method: Instead of pairwise similarity, Finer-Personalization Rank adopts a ranking approach where generated images are treated as queries against an identity-labeled gallery of visually similar real images. It uses retrieval metrics like mean average precision to measure identity preservation.

Result: The method was evaluated on CUB, Stanford Cars, and animal Re-ID benchmarks, showing that Finer-Personalization Rank more faithfully reflects identity retention than semantic-only metrics and reveals substantial identity drift in popular personalization methods.

Conclusion: The gallery-based ranking protocol provides a principled and practical evaluation framework for personalized generation that better captures identity preservation at multiple granularities, from fine-grained categories to individual instances.

Abstract: The rise of personalized generative models raises a central question: how should we evaluate identity preservation? Given a reference image (e.g., one’s pet), we expect the generated image to retain precise details attached to the subject’s identity. However, current generative evaluation metrics emphasize the overall semantic similarity between the reference and the output, and overlook these fine-grained discriminative details. We introduce Finer-Personalization Rank, an evaluation protocol tailored to identity preservation. Instead of pairwise similarity, Finer-Personalization Rank adopts a ranking view: it treats each generated image as a query against an identity-labeled gallery consisting of visually similar real images. Retrieval metrics (e.g., mean average precision) measure performance, where higher scores indicate that identity-specific details (e.g., a distinctive head spot) are preserved. We assess identity at multiple granularities – from fine-grained categories (e.g., bird species, car models) to individual instances (e.g., re-identification). Across CUB, Stanford Cars, and animal Re-ID benchmarks, Finer-Personalization Rank more faithfully reflects identity retention than semantic-only metrics and reveals substantial identity drift in several popular personalization methods. These results position the gallery-based protocol as a principled and practical evaluation for personalized generation.

[225] Automatic Neuronal Activity Segmentation in Fast Four Dimensional Spatio-Temporal Fluorescence Imaging using Bayesian Approach

Ran Li, Pan Xiao, Kaushik Dutta, Youdong Guo

Main category: cs.CV

TL;DR: A Bayesian deep learning framework for automatic detection of active neurons in 4D calcium imaging data, using spatio-temporal information and providing uncertainty estimates.

Details

Motivation: Manual segmentation of neuronal activities in calcium imaging is time-consuming, labor-intensive, and lacks generalizability, creating a bottleneck for studying brain activity mapping at single-cell resolution.

Method: Bayesian deep learning framework that combines temporal information (pixel-wise correlation maps) with spatial information (mean summary image) to detect neuronal activities in 4D spatio-temporal light sheet microscopy data.

Result: Achieved mean Dice Score of 0.81 against synthetic ground truth (Otsu’s method) and 0.79 for test reproducibility between first and second runs, demonstrating reliable active neuron detection.

Conclusion: The Bayesian framework enables rapid, automatic detection of active neuronal activities for behavioral studies while providing probability segmentation maps and uncertainty estimates for neuron detection.

Abstract: Fluorescence Microcopy Calcium Imaging is a fundamental tool to in-vivo record and analyze large scale neuronal activities simultaneously at a single cell resolution. Automatic and precise detection of behaviorally relevant neuron activity from the recordings is critical to study the mapping of brain activity in organisms. However a perpetual bottleneck to this problem is the manual segmentation which is time and labor intensive and lacks generalizability. To this end, we present a Bayesian Deep Learning Framework to detect neuronal activities in 4D spatio-temporal data obtained by light sheet microscopy. Our approach accounts for the use of temporal information by calculating pixel wise correlation maps and combines it with spatial information given by the mean summary image. The Bayesian framework not only produces probability segmentation maps but also models the uncertainty pertaining to active neuron detection. To evaluate the accuracy of our framework we implemented the test of reproducibility to assert the generalization of the network to detect neuron activity. The network achieved a mean Dice Score of 0.81 relative to the synthetic Ground Truth obtained by Otsu’s method and a mean Dice Score of 0.79 between the first and second run for test of reproducibility. Our method successfully deployed can be used for rapid detection of active neuronal activities for behavioural studies.

[226] Distinguishing Visually Similar Actions: Prompt-Guided Semantic Prototype Modulation for Few-Shot Action Recognition

Xiaoyang Li, Mingming Lu, Ruiqi Wang, Hao Li, Zewei Le

Main category: cs.CV

TL;DR: CLIP-SPM framework improves few-shot action recognition by addressing temporal modeling, visual similarity, and modality gap challenges through hierarchical motion refinement, semantic prototype modulation, and prototype-anchor dual modulation.

Details

Motivation: Few-shot action recognition faces three core challenges: (1) temporal modeling issues with static background interference, (2) difficulty distinguishing visually similar categories, and (3) modality gap between visual-textual support prototypes and visual-only queries.

Method: Proposes CLIP-SPM framework with three components: (1) HSMR module aligns deep/shallow motion features to reduce static background interference, (2) SPM strategy generates query-relevant text prompts to bridge modality gap and enhance discriminability, (3) PADM method refines support prototypes and aligns query features with global semantic anchor.

Result: Achieves competitive performance on Kinetics, SSv2-Full, SSv2-Small, UCF101, and HMDB51 benchmarks under 1-shot, 3-shot, and 5-shot settings. Ablation studies validate effectiveness of each component.

Conclusion: CLIP-SPM effectively addresses core challenges in few-shot action recognition through integrated motion refinement, semantic modulation, and prototype alignment components, demonstrating strong performance across multiple benchmarks.

Abstract: Few-shot action recognition aims to enable models to quickly learn new action categories from limited labeled samples, addressing the challenge of data scarcity in real-world applications. Current research primarily addresses three core challenges: (1) temporal modeling, where models are prone to interference from irrelevant static background information and struggle to capture the essence of dynamic action features; (2) visual similarity, where categories with subtle visual differences are difficult to distinguish; and (3) the modality gap between visual-textual support prototypes and visual-only queries, which complicates alignment within a shared embedding space. To address these challenges, this paper proposes a CLIP-SPM framework, which includes three components: (1) the Hierarchical Synergistic Motion Refinement (HSMR) module, which aligns deep and shallow motion features to improve temporal modeling by reducing static background interference; (2) the Semantic Prototype Modulation (SPM) strategy, which generates query-relevant text prompts to bridge the modality gap and integrates them with visual features, enhancing the discriminability between similar actions; and (3) the Prototype-Anchor Dual Modulation (PADM) method, which refines support prototypes and aligns query features with a global semantic anchor, improving consistency across support and query samples. Comprehensive experiments across standard benchmarks, including Kinetics, SSv2-Full, SSv2-Small, UCF101, and HMDB51, demonstrate that our CLIP-SPM achieves competitive performance under 1-shot, 3-shot, and 5-shot settings. Extensive ablation studies and visual analyses further validate the effectiveness of each component and its contributions to addressing the core challenges. The source code and models are publicly available at GitHub.

[227] WaTeRFlow: Watermark Temporal Robustness via Flow Consistency

Utae Jeong, Sumin In, Hyunju Ryu, Jaewan Choi, Feng Yang, Jongheon Jeong, Seungryong Kim, Sangpil Kim

Main category: cs.CV

TL;DR: WaTeRFlow is a deep learning framework for robust image watermarking that maintains detection accuracy when watermarked images are converted to videos using image-to-video (I2V) models.

Details

Motivation: Current deep learning-based watermarking methods struggle when watermarked images are converted to videos via I2V models, as per-frame detection weakens. With I2V advancing rapidly and being used for content creation, world-modeling, and simulation, cross-modal watermark recovery becomes crucial.

Method: Three main components: (1) FUSE (Flow-guided Unified Synthesis Engine) exposes encoder-decoder to realistic distortions via instruction-driven edits and fast video diffusion proxy during training; (2) optical-flow warping with Temporal Consistency Loss (TCL) stabilizes per-frame predictions; (3) semantic preservation loss maintains conditioning signal.

Result: Experiments across representative I2V models show accurate watermark recovery from frames, with higher first-frame and per-frame bit accuracy and resilience when various distortions are applied before or after video generation.

Conclusion: WaTeRFlow effectively bridges the gap in cross-modal watermark recovery for I2V applications, providing robust watermark detection that maintains accuracy through the image-to-video conversion process.

Abstract: Image watermarking supports authenticity and provenance, yet many schemes are still easy to bypass with various distortions and powerful generative edits. Deep learning-based watermarking has improved robustness to diffusion-based image editing, but a gap remains when a watermarked image is converted to video by image-to-video (I2V), in which per-frame watermark detection weakens. I2V has quickly advanced from short, jittery clips to multi-second, temporally coherent scenes, and it now serves not only content creation but also world-modeling and simulation workflows, making cross-modal watermark recovery crucial. We present WaTeRFlow, a framework tailored for robustness under I2V. It consists of (i) FUSE (Flow-guided Unified Synthesis Engine), which exposes the encoder-decoder to realistic distortions via instruction-driven edits and a fast video diffusion proxy during training, (ii) optical-flow warping with a Temporal Consistency Loss (TCL) that stabilizes per-frame predictions, and (iii) a semantic preservation loss that maintains the conditioning signal. Experiments across representative I2V models show accurate watermark recovery from frames, with higher first-frame and per-frame bit accuracy and resilience when various distortions are applied before or after video generation.

[228] Decoupled Generative Modeling for Human-Object Interaction Synthesis

Hwanhee Jung, Seunggwan Lee, Jeongyoon Yoon, SeungHyeon Kim, Giljoo Nam, Qixing Huang, Sangpil Kim

Main category: cs.CV

TL;DR: DecHOI separates human-object interaction synthesis into path planning and action synthesis stages, using adversarial training for contact realism, outperforming prior methods on benchmarks.

Details

Motivation: Existing HOI synthesis approaches require manual waypoint specification and combine all optimization objectives in a single network, leading to complexity, reduced flexibility, and errors like unsynchronized motion and penetration.

Method: Decoupled approach with trajectory generator for human/object paths without waypoints, action generator for detailed motions conditioned on paths, adversarial training with distal joint dynamics discriminator, and support for dynamic scenes with moving counterparts.

Result: Outperforms prior methods on FullBodyManipulation and 3D-FUTURE benchmarks across most quantitative metrics and qualitative evaluations, with perceptual studies preferring DecHOI results.

Conclusion: DecHOI’s decoupled generative modeling approach effectively addresses limitations of existing HOI synthesis methods by separating planning from synthesis and improving contact realism through adversarial training.

Abstract: Synthesizing realistic human-object interaction (HOI) is essential for 3D computer vision and robotics, underpinning animation and embodied control. Existing approaches often require manually specified intermediate waypoints and place all optimization objectives on a single network, which increases complexity, reduces flexibility, and leads to errors such as unsynchronized human and object motion or penetration. To address these issues, we propose Decoupled Generative Modeling for Human-Object Interaction Synthesis (DecHOI), which separates path planning and action synthesis. A trajectory generator first produces human and object trajectories without prescribed waypoints, and an action generator conditions on these paths to synthesize detailed motions. To further improve contact realism, we employ adversarial training with a discriminator that focuses on the dynamics of distal joints. The framework also models a moving counterpart and supports responsive, long-sequence planning in dynamic scenes, while preserving plan consistency. Across two benchmarks, FullBodyManipulation and 3D-FUTURE, DecHOI surpasses prior methods on most quantitative metrics and qualitative evaluations, and perceptual studies likewise prefer our results.

[229] 6DAttack: Backdoor Attacks in the 6DoF Pose Estimation

Jihui Guo, Zongmin Zhang, Zhen Sun, Yuhao Yang, Jinlin Wu, Fu Zhang, Xinlei He

Main category: cs.CV

TL;DR: 6DAttack: A backdoor attack framework for 6DoF object pose estimation that uses 3D object triggers to induce controlled erroneous poses while maintaining normal performance on clean samples.

Details

Motivation: While 6DoF pose estimation is crucial for robotics, AR/VR, and autonomous systems, backdoor attacks pose significant security risks. Most backdoor research focuses on 2D vision, leaving 6DoF pose estimation largely unexplored. Traditional backdoor methods that only change classification labels are insufficient for controlling continuous pose parameters like translation and rotation.

Method: Propose 6DAttack framework using 3D object triggers to induce controlled erroneous poses. The method maintains normal behavior on clean samples while causing specific pose errors when triggers are present. Evaluated on three state-of-the-art pose estimation models (PVNet, DenseFusion, PoseDiffusion) across three datasets (LINEMOD, YCB-Video, CO3D).

Result: Achieved high attack success rates (ASRs) without compromising clean performance. Backdoored models achieved up to 100% clean ADD accuracy and 100% ASR, with triggered samples reaching 97.70% ADD-P. A representative defense method remained ineffective against the attack.

Conclusion: 6DAttack reveals a serious, underexplored threat to 6DoF pose estimation systems. The framework demonstrates that backdoor attacks can effectively manipulate continuous pose parameters while evading detection, highlighting critical security vulnerabilities in real-world applications.

Abstract: Deep learning advances have enabled accurate six-degree-of-freedom (6DoF) object pose estimation, widely used in robotics, AR/VR, and autonomous systems. However, backdoor attacks pose significant security risks. While most research focuses on 2D vision, 6DoF pose estimation remains largely unexplored. Unlike traditional backdoors that only change classes, 6DoF attacks must control continuous parameters like translation and rotation, rendering 2D methods inapplicable. We propose 6DAttack, a framework using 3D object triggers to induce controlled erroneous poses while maintaining normal behavior. Evaluations on PVNet, DenseFusion, and PoseDiffusion across LINEMOD, YCB-Video, and CO3D show high attack success rates (ASRs) without compromising clean performance. Backdoored models achieve up to 100% clean ADD accuracy and 100% ASR, with triggered samples reaching 97.70% ADD-P. Furthermore, a representative defense remains ineffective. Our findings reveal a serious, underexplored threat to 6DoF pose estimation.

[230] Retrieving Objects from 3D Scenes with Box-Guided Open-Vocabulary Instance Segmentation

Khanh Nguyen, Dasith de Silva Edirimuni, Ghulam Mubashar Hassan, Ajmal Mian

Main category: cs.CV

TL;DR: A method for open-vocabulary 3D instance segmentation that uses 2D open-vocabulary detectors to generate 3D instance masks from RGB images, enabling fast retrieval of rare objects without SAM/CLIP dependencies.

Details

Motivation: Existing open-vocabulary 3D instance segmentation methods rely heavily on SAM and CLIP, causing high computational overhead and slow processing that limits real-world deployment. Open-YOLO 3D improves speed but fails to generalize to infrequent object categories in 3D training data.

Method: Proposes generating 3D instance masks for novel objects from RGB images guided by a 2D open-vocabulary detector. The approach inherits the 2D detector’s ability to recognize novel objects while maintaining efficient classification, eliminating the need for SAM and CLIP.

Result: Enables fast and accurate retrieval of rare instances from open-ended text queries while significantly reducing inference time compared to SAM/CLIP-based methods.

Conclusion: The method provides an efficient solution for open-vocabulary 3D instance segmentation that balances speed and generalization to rare object categories, addressing limitations of both SAM/CLIP-based approaches and Open-YOLO 3D.

Abstract: Locating and retrieving objects from scene-level point clouds is a challenging problem with broad applications in robotics and augmented reality. This task is commonly formulated as open-vocabulary 3D instance segmentation. Although recent methods demonstrate strong performance, they depend heavily on SAM and CLIP to generate and classify 3D instance masks from images accompanying the point cloud, leading to substantial computational overhead and slow processing that limit their deployment in real-world settings. Open-YOLO 3D alleviates this issue by using a real-time 2D detector to classify class-agnostic masks produced directly from the point cloud by a pretrained 3D segmenter, eliminating the need for SAM and CLIP and significantly reducing inference time. However, Open-YOLO 3D often fails to generalize to object categories that appear infrequently in the 3D training data. In this paper, we propose a method that generates 3D instance masks for novel objects from RGB images guided by a 2D open-vocabulary detector. Our approach inherits the 2D detector’s ability to recognize novel objects while maintaining efficient classification, enabling fast and accurate retrieval of rare instances from open-ended text queries. Our code will be made available at https://github.com/ndkhanh360/BoxOVIS.

[231] Auditing Significance, Metric Choice, and Demographic Fairness in Medical AI Challenges

Ariel Lubonja, Pedro R. A. S. Bassi, Wenxuan Li, Hualin Qiao, Randal Burns, Alan L. Yuille, Zongwei Zhou

Main category: cs.CV

TL;DR: RankInsight is an open-source toolkit that addresses three key limitations in medical AI leaderboards: statistical significance testing, organ-appropriate metrics, and intersectional fairness analysis.

Details

Motivation: Medical AI leaderboards have three persistent limitations: (1) lack of statistical significance testing for score gaps, (2) use of single averaged metrics that hide clinically important boundary errors, and (3) failure to report performance across intersecting demographics, masking fairness gaps.

Method: RankInsight toolkit computes pair-wise significance maps, recomputes leaderboards with organ-appropriate metrics, and audits intersectional fairness across demographic groups.

Result: RankInsight shows: (1) nnU-Net family outperforms Vision-Language and MONAI submissions with high statistical certainty; (2) replacing Dice with NSD for tubular structures reverses order of top four models; (3) more than half of MONAI-based entries show largest gender-race discrepancy on proprietary dataset.

Conclusion: RankInsight enables organizers and participants to publish rankings that are statistically sound, clinically meaningful, and demographically fair, and can be applied to past, ongoing, and future medical AI challenges.

Abstract: Open challenges have become the de facto standard for comparative ranking of medical AI methods. Despite their importance, medical AI leaderboards exhibit three persistent limitations: (1) score gaps are rarely tested for statistical significance, so rank stability is unknown; (2) single averaged metrics are applied to every organ, hiding clinically important boundary errors; (3) performance across intersecting demographics is seldom reported, masking fairness and equity gaps. We introduce RankInsight, an open-source toolkit that seeks to address these limitations. RankInsight (1) computes pair-wise significance maps that show the nnU-Net family outperforms Vision-Language and MONAI submissions with high statistical certainty; (2) recomputes leaderboards with organ-appropriate metrics, reversing the order of the top four models when Dice is replaced by NSD for tubular structures; and (3) audits intersectional fairness, revealing that more than half of the MONAI-based entries have the largest gender-race discrepancy on our proprietary Johns Hopkins Hospital dataset. The RankInsight toolkit is publicly released and can be directly applied to past, ongoing, and future challenges. It enables organizers and participants to publish rankings that are statistically sound, clinically meaningful, and demographically fair.

[232] Mamba-Based Modality Disentanglement Network for Multi-Contrast MRI Reconstruction

Weiyi Lyu, Xinming Fang, Jun Wang, Jun Shi, Guixu Zhang, Juncheng Li

Main category: cs.CV

TL;DR: MambaMDN: A dual-domain framework using Mamba-based modality disentanglement for accelerated multi-contrast MRI reconstruction, addressing aliasing artifacts and modality contamination.

Details

Motivation: Prolonged MRI scan times hinder patient throughput and comfort. Existing accelerated MRI techniques fail to effectively utilize K-space prior information (causing aliasing artifacts) and suffer from irrelevant information contamination in multi-contrast fusion strategies.

Method: Dual-domain framework: 1) Uses fully-sampled reference K-space data to complete undersampled target data, creating structurally aligned but modality-mixed inputs. 2) Mamba-based modality disentanglement network extracts and removes reference-specific features. 3) Iterative refinement mechanism progressively enhances reconstruction through repeated feature purification.

Result: Extensive experiments demonstrate that MambaMDN significantly outperforms existing multi-contrast reconstruction methods.

Conclusion: MambaMDN effectively addresses key challenges in accelerated MRI by leveraging K-space prior information while preventing modality contamination through disentanglement and iterative refinement.

Abstract: Magnetic resonance imaging (MRI) is a cornerstone of modern clinical diagnosis, offering unparalleled soft-tissue contrast without ionizing radiation. However, prolonged scan times remain a major barrier to patient throughput and comfort. Existing accelerated MRI techniques often struggle with two key challenges: (1) failure to effectively utilize inherent K-space prior information, leading to persistent aliasing artifacts from zero-filled inputs; and (2) contamination of target reconstruction quality by irrelevant information when employing multi-contrast fusion strategies. To overcome these challenges, we present MambaMDN, a dual-domain framework for multi-contrast MRI reconstruction. Our approach first employs fully-sampled reference K-space data to complete the undersampled target data, generating structurally aligned but modality-mixed inputs. Subsequently, we develop a Mamba-based modality disentanglement network to extract and remove reference-specific features from the mixed representation. Furthermore, we introduce an iterative refinement mechanism to progressively enhance reconstruction accuracy through repeated feature purification. Extensive experiments demonstrate that MambaMDN can significantly outperform existing multi-contrast reconstruction methods.

[233] Towards Minimal Fine-Tuning of VLMs

Tiange Luo, Lajanugen Logeswaran, Jaekyeom Kim, Justin Johnson, Honglak Lee

Main category: cs.CV

TL;DR: Image-LoRA is a lightweight PEFT method for VLMs that applies low-rank adaptation only to value paths in visual-token attention layers, reduces training FLOPs proportionally to visual-token fraction, and selectively adapts attention heads using influence scores.

Details

Motivation: To create a more parameter-efficient fine-tuning method for vision-language models that reduces computational overhead while maintaining performance, especially for tasks involving both visual and textual reasoning.

Method: Applies low-rank adaptation only to value paths in attention layers within visual-token span; uses rank-1 Image-LoRA to estimate head influence scores for selective adaptation; implements selection-size normalization for layer update stabilization.

Result: Matches or closely approaches standard LoRA accuracy across screen-centric grounding and referring benchmarks (text-heavy to image-heavy regimes) while using fewer trainable parameters and lower adapter-only training FLOPs; preserves pure-text reasoning performance on GSM8K.

Conclusion: Image-LoRA provides an effective lightweight PEFT recipe for VLMs that maintains performance while reducing computational costs and preserving text reasoning capabilities.

Abstract: We introduce Image-LoRA, a lightweight parameter efficient fine-tuning (PEFT) recipe for transformer-based vision-language models (VLMs). Image-LoRA applies low-rank adaptation only to the value path of attention layers within the visual-token span, reducing adapter-only training FLOPs roughly in proportion to the visual-token fraction. We further adapt only a subset of attention heads, selected using head influence scores estimated with a rank-1 Image-LoRA, and stabilize per-layer updates via selection-size normalization. Across screen-centric grounding and referring benchmarks spanning text-heavy to image-heavy regimes, Image-LoRA matches or closely approaches standard LoRA accuracy while using fewer trainable parameters and lower adapter-only training FLOPs. The method also preserves the pure-text reasoning performance of VLMs before and after fine-tuning, as further shown on GSM8K.

[234] GaussianImage++: Boosted Image Representation and Compression with 2D Gaussian Splatting

Tiantian Li, Xinjie Zhang, Xingtong Ge, Tongda Xu, Dailan He, Jun Zhang, Yan Wang

Main category: cs.CV

TL;DR: GaussianImage++ improves image representation and compression using limited Gaussian primitives with distortion-driven densification, context-aware filters, and efficient quantization.

Details

Motivation: Implicit neural representations (INRs) have high training time/memory requirements, while 2D Gaussian Splatting methods need excessive primitives for quality. There's a need for efficient GS-based approaches with limited primitives.

Method: Three key techniques: 1) Distortion-driven densification that allocates primitives based on signal intensity, 2) Context-aware Gaussian filters for primitive optimization, 3) Attribute-separated learnable scalar quantizers with quantization-aware training for compression.

Result: Outperforms GaussianImage and INRs-based COIN in both representation and compression performance while maintaining real-time decoding and low memory usage.

Conclusion: GaussianImage++ successfully demonstrates that GS-based approaches can achieve superior representation and compression with limited primitives, offering practical advantages over existing methods.

Abstract: Implicit neural representations (INRs) have achieved remarkable success in image representation and compression, but they require substantial training time and memory. Meanwhile, recent 2D Gaussian Splatting (GS) methods (\textit{e.g.}, GaussianImage) offer promising alternatives through efficient primitive-based rendering. However, these methods require excessive Gaussian primitives to maintain high visual fidelity. To exploit the potential of GS-based approaches, we present GaussianImage++, which utilizes limited Gaussian primitives to achieve impressive representation and compression performance. Firstly, we introduce a distortion-driven densification mechanism. It progressively allocates Gaussian primitives according to signal intensity. Secondly, we employ context-aware Gaussian filters for each primitive, which assist in the densification to optimize Gaussian primitives based on varying image content. Thirdly, we integrate attribute-separated learnable scalar quantizers and quantization-aware training, enabling efficient compression of primitive attributes. Experimental results demonstrate the effectiveness of our method. In particular, GaussianImage++ outperforms GaussianImage and INRs-based COIN in representation and compression performance while maintaining real-time decoding and low memory usage.

[235] From Pixels to Predicates Structuring urban perception with scene graphs

Yunlong Liu, Shuyang Li, Pengyuan Liu, Yu Zhang, Rudi Stouffs

Main category: cs.CV

TL;DR: A graph-based pipeline using scene graphs and graph autoencoders improves urban perception prediction by 26% over image-only baselines while providing interpretable relational insights.

Details

Motivation: Current perception research using streetscapes often relies on pixel features or object co-occurrence statistics, missing explicit relational information that shapes human perception. There's a need for more structured, interpretable approaches to urban perception modeling.

Method: Three-stage pipeline: 1) Parse street view images with OpenPSG to extract object-predicate-object triplets, 2) Learn compact scene embeddings using GraphMAE (heterogeneous graph autoencoder), 3) Predict six perceptual indicators via neural network from these embeddings.

Result: 26% average accuracy improvement over image-only baselines; strong cross-city generalization performance; interpretable relational patterns identified (e.g., graffiti on wall, car parked on sidewalk lower perception scores).

Conclusion: Graph-based structured representations provide expressive, generalizable, and interpretable signals for urban perception modeling, advancing human-centric and context-aware urban analytics.

Abstract: Perception research is increasingly modelled using streetscapes, yet many approaches still rely on pixel features or object co-occurrence statistics, overlooking the explicit relations that shape human perception. This study proposes a three stage pipeline that transforms street view imagery (SVI) into structured representations for predicting six perceptual indicators. In the first stage, each image is parsed using an open-set Panoptic Scene Graph model (OpenPSG) to extract object predicate object triplets. In the second stage, compact scene-level embeddings are learned through a heterogeneous graph autoencoder (GraphMAE). In the third stage, a neural network predicts perception scores from these embeddings. We evaluate the proposed approach against image-only baselines in terms of accuracy, precision, and cross-city generalization. Results indicate that (i) our approach improves perception prediction accuracy by an average of 26% over baseline models, and (ii) maintains strong generalization performance in cross-city prediction tasks. Additionally, the structured representation clarifies which relational patterns contribute to lower perception scores in urban scenes, such as graffiti on wall and car parked on sidewalk. Overall, this study demonstrates that graph-based structure provides expressive, generalizable, and interpretable signals for modelling urban perception, advancing human-centric and context-aware urban analytics.

[236] Trifocal Tensor and Relative Pose Estimation with Known Vertical Direction

Tao Li, Zhenbao Yu, Banglei Guan, Jianli Han, Weimin Lv, Friedrich Fraundorfer

Main category: cs.CV

TL;DR: Novel solvers for relative pose estimation using known vertical directions from IMUs, requiring only 4 or 3 point correspondences for efficient RANSAC-based outlier removal.

Details

Motivation: Vertical directions from IMUs are readily available in many applications (autonomous vehicles, phones, UAVs), enabling simplified pose estimation with fewer point correspondences for more efficient RANSAC processing.

Method: Two solvers: 1) Linear closed-form solution requiring 4 point correspondences across 3 views, 2) Minimal solution using Gröbner basis solver requiring only 3 point correspondences. Both leverage known vertical directions to reduce unknowns to two rotation angles and two translation vectors.

Result: Superior accuracy compared to alternative methods on both synthetic data and real-world KITTI scenes, with efficient performance suitable for RANSAC-based outlier removal in visual odometry.

Conclusion: The proposed methods provide accurate and efficient relative pose estimation by exploiting readily available vertical direction information, enabling robust visual odometry with minimal computational requirements.

Abstract: This work presents two novel solvers for estimating the relative poses among views with known vertical directions. The vertical directions of camera views can be easily obtained using inertial measurement units (IMUs) which have been widely used in autonomous vehicles, mobile phones, and unmanned aerial vehicles (UAVs). Given the known vertical directions, our lgorithms only need to solve for two rotation angles and two translation vectors. In this paper, a linear closed-form solution has been described, requiring only four point correspondences in three views. We also propose a minimal solution with three point correspondences using the latest Gröbner basis solver. Since the proposed methods require fewer point correspondences, they can be efficiently applied within the RANSAC framework for outliers removal and pose estimation in visual odometry. The proposed method has been tested on both synthetic data and real-world scenes from KITTI. The experimental results show that the accuracy of the estimated poses is superior to other alternative methods.

[237] Is Visual Realism Enough? Evaluating Gait Biometric Fidelity in Generative AI Human Animation

Ivan DeAndres-Tame, Chengwei Ye, Ruben Tolosana, Ruben Vera-Rodriguez, Shiqi Yu

Main category: cs.CV

TL;DR: Current GenAI models produce visually high-quality human animations but fail to preserve subtle gait patterns needed for biometric identification, revealing they rely on visual attributes rather than temporal dynamics.

Details

Motivation: While GenAI models can generate realistic human animations, they may not preserve the subtle spatio-temporal details needed for behavioral biometrics like gait recognition, which is critical for person identification applications.

Method: Evaluated four state-of-the-art GenAI models across two tasks: 1) restoring gait patterns from reference videos under varying complexity conditions, and 2) transferring gait patterns to different visual identities.

Result: Visual quality is high but biometric fidelity remains low for identification tasks. Identity transfer experiments show that when texture is disentangled from motion, identification collapses, revealing models rely on visual attributes rather than temporal dynamics.

Conclusion: Current GenAI models struggle to disentangle identity from motion and cannot preserve the subtle gait patterns needed for biometric identification, exposing fundamental limitations in appearance-based gait recognition systems.

Abstract: Generative AI (GenAI) models have revolutionized animation, enabling the synthesis of humans and motion patterns with remarkable visual fidelity. However, generating truly realistic human animation remains a formidable challenge, where even minor inconsistencies can make a subject appear unnatural. This limitation is particularly critical when AI-generated videos are evaluated for behavioral biometrics, where subtle motion cues that define identity are easily lost or distorted. The present study investigates whether state-of-the-art GenAI human animation models can preserve the subtle spatio-temporal details needed for person identification through gait biometrics. Specifically, we evaluate four different GenAI models across two primary evaluation tasks to assess their ability to i) restore gait patterns from reference videos under varying conditions of complexity, and ii) transfer these gait patterns to different visual identities. Our results show that while visual quality is mostly high, biometric fidelity remains low in tasks focusing on identification, suggesting that current GenAI models struggle to disentangle identity from motion. Furthermore, through an identity transfer task, we expose a fundamental flaw in appearance-based gait recognition: when texture is disentangled from motion, identification collapses, proving current GenAI models rely on visual attributes rather than temporal dynamics.

[238] Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?

Hengyi Feng, Zeang Sheng, Meiyi Qiang, Wentao Zhang

Main category: cs.CV

TL;DR: MLLMs fail at multimodal retrieval despite excelling at generation due to text-dominated representations and homogenized embeddings that reduce discriminative power.

Details

Motivation: To investigate why multimodal large language models (MLLMs) perform poorly at zero-shot multimodal retrieval tasks despite their success in generative tasks, and to understand the underlying mechanisms that prevent them from being effective retrievers.

Method: Used sparse autoencoders (SAEs) to decompose MLLM output representations into interpretable semantic concepts, analyzed the representation space to examine the balance between textual and visual information, and investigated how similarity computations affect retrieval performance.

Result: Found that MLLM representations are overwhelmingly dominated by textual semantics with minimal visual information, embeddings are homogenized due to modality bridging, and specific feature components that contribute to similarity computations actually degrade retrieval performance.

Conclusion: This work provides the first interpretability analysis of MLLM representations for multimodal retrieval, revealing fundamental architectural limitations and offering directions for improving MLLMs’ retrieval capabilities.

Abstract: Despite the remarkable success of multimodal large language models (MLLMs) in generative tasks, we observe that they exhibit a counterintuitive deficiency in the zero-shot multimodal retrieval task. In this work, we investigate the underlying mechanisms that hinder MLLMs from serving as effective retrievers. With the help of sparse autoencoders (SAEs), we decompose MLLM output representations into interpretable semantic concepts to probe their intrinsic behavior. Our analysis reveals that the representation space of MLLMs is overwhelmingly dominated by textual semantics; the visual information essential for multimodal retrieval only constitutes a small portion. This imbalance is compounded by the heavy focus of MLLMs on bridging image-text modalities, which facilitates generation but homogenizes embeddings and finally diminishes the discriminative power required for multimodal retrieval. We further discover that the specific feature components that contribute most to the similarity computations for MLLMs are in fact distractors that actively degrade retrieval performance. Overall, our work provides the first in-depth interpretability analysis of MLLM representations in the context of multimodal retrieval and offers possible directions for enhancing the multimodal retrieval capabilities of MLLMs.

[239] AMap: Distilling Future Priors for Ahead-Aware Online HD Map Construction

Ruikai Li, Xinrun Li, Mengwei Xie, Hao Shan, Shoumeng Qiu, Xinyuan Chang, Yizhe Fan, Feng Xiong, Han Jiang, Yilong Ren, Haiyang Yu, Mu Xu, Yang Long, Varun Ojha, Zhiyong Cui

Main category: cs.CV

TL;DR: AMap introduces a “distill-from-future” framework for online HD map construction that enables current-frame models to have “look-ahead” capabilities by distilling knowledge from future temporal contexts, improving forward region perception critical for autonomous driving safety.

Details

Motivation: Current temporal fusion methods for online HD map construction are "spatially backward-looking" - they mainly improve map reconstruction in traversed areas but offer minimal improvement for unseen road ahead. This creates a safety gap because inaccuracies in forward regions directly cause hazardous driving maneuvers, while rearward errors are often tolerable.

Method: AMap uses a teacher-student distillation framework where a teacher model with privileged access to future temporal contexts guides a lightweight student model restricted to the current frame. It employs Multi-Level BEV Distillation with spatial masking and an Asymmetric Query Adaptation module to transfer future-aware representations to the student’s static queries.

Result: Extensive experiments on nuScenes and Argoverse 2 benchmarks show AMap significantly enhances current-frame perception. It outperforms state-of-the-art temporal models in critical forward regions while maintaining the efficiency of single current frame inference.

Conclusion: The proposed “distill-from-future” paradigm successfully bridges the safety gap in online HD map construction by endowing current-frame models with “look-ahead” capabilities at zero inference-time cost, making autonomous driving safer through improved forward region perception.

Abstract: Online High-Definition (HD) map construction is pivotal for autonomous driving. While recent approaches leverage historical temporal fusion to improve performance, we identify a critical safety flaw in this paradigm: it is inherently spatially backward-looking." These methods predominantly enhance map reconstruction in traversed areas, offering minimal improvement for the unseen road ahead. Crucially, our analysis of downstream planning tasks reveals a severe asymmetry: while rearward perception errors are often tolerable, inaccuracies in the forward region directly precipitate hazardous driving maneuvers. To bridge this safety gap, we propose AMap, a novel framework for Ahead-aware online HD Mapping. We pioneer a distill-from-future" paradigm, where a teacher model with privileged access to future temporal contexts guides a lightweight student model restricted to the current frame. This process implicitly compresses prospective knowledge into the student model, endowing it with ``look-ahead" capabilities at zero inference-time cost. Technically, we introduce a Multi-Level BEV Distillation strategy with spatial masking and an Asymmetric Query Adaptation module to effectively transfer future-aware representations to the student’s static queries. Extensive experiments on the nuScenes and Argoverse 2 benchmark demonstrate that AMap significantly enhances current-frame perception. Most notably, it outperforms state-of-the-art temporal models in critical forward regions while maintaining the efficiency of single current frame inference.

[240] MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture

Hui Li, Jiayue Lyu, Fu-Yun Wang, Kaihui Cheng, Siyu Zhu, Jingdong Wang

Main category: cs.CV

TL;DR: MixFlow addresses training-testing discrepancy in diffusion models by using slowed timestep interpolations to improve generation quality.

Details

Motivation: Diffusion models suffer from exposure bias where training uses ground-truth noisy data but testing uses generated noisy data, creating a mismatch that hurts performance.

Method: Proposes MixFlow which leverages the “Slow Flow” phenomenon - ground-truth interpolations nearest to generated data correspond to higher-noise timesteps. Uses slowed interpolation mixtures for post-training prediction networks.

Result: Achieves strong results on ImageNet: 1.43 FID (unguided) and 1.10 FID (guided) at 256×256; 1.55 FID (unguided) and 1.10 FID (guided) at 512×512 with RAE models.

Conclusion: MixFlow effectively mitigates training-testing discrepancy in diffusion models, improving generation quality across class-conditional and text-to-image tasks.

Abstract: This paper studies the training-testing discrepancy (a.k.a. exposure bias) problem for improving the diffusion models. During training, the input of a prediction network at one training timestep is the corresponding ground-truth noisy data that is an interpolation of the noise and the data, and during testing, the input is the generated noisy data. We present a novel training approach, named MixFlow, for improving the performance. Our approach is motivated by the Slow Flow phenomenon: the ground-truth interpolation that is the nearest to the generated noisy data at a given sampling timestep is observed to correspond to a higher-noise timestep (termed slowed timestep), i.e., the corresponding ground-truth timestep is slower than the sampling timestep. MixFlow leverages the interpolations at the slowed timesteps, named slowed interpolation mixture, for post-training the prediction network for each training timestep. Experiments over class-conditional image generation (including SiT, REPA, and RAE) and text-to-image generation validate the effectiveness of our approach. Our approach MixFlow over the RAE models achieve strong generation results on ImageNet: 1.43 FID (without guidance) and 1.10 (with guidance) at 256 x 256, and 1.55 FID (without guidance) and 1.10 (with guidance) at 512 x 512.

[241] OmniMoGen: Unifying Human Motion Generation via Learning from Interleaved Text-Motion Instructions

Wendong Bu, Kaihang Pan, Yuze Lin, Jiacheng Li, Kai Shen, Wenqiao Zhang, Juncheng Li, Jun Xiao, Siliang Tang

Main category: cs.CV

TL;DR: OmniMoGen is a unified framework for versatile human motion generation using interleaved text-motion instructions, achieving SOTA performance across multiple tasks with emerging capabilities like compositional editing and self-reflection.

Details

Motivation: While LLMs have unified diverse linguistic tasks, motion generation remains confined to isolated tasks, limiting flexibility for free-form and omni-objective generation. There's a need for a unified framework similar to LLMs for human motion.

Method: Built on RVQ-VAE and transformer architecture, OmniMoGen enables end-to-end instruction-driven motion generation. The authors created X2Mo (137K+ interleaved text-motion instructions dataset) and AnyContext benchmark for evaluation.

Result: OmniMoGen achieves state-of-the-art performance on text-to-motion, motion editing, and the new AnyContext benchmark. It exhibits emerging capabilities including compositional editing, self-reflective generation, and knowledge-informed generation.

Conclusion: OmniMoGen represents a significant step toward next-generation intelligent motion generation by unifying diverse motion tasks within a single framework, similar to how LLMs unified linguistic tasks.

Abstract: Large language models (LLMs) have unified diverse linguistic tasks within a single framework, yet such unification remains unexplored in human motion generation. Existing methods are confined to isolated tasks, limiting flexibility for free-form and omni-objective generation. To address this, we propose OmniMoGen, a unified framework that enables versatile motion generation through interleaved text-motion instructions. Built upon a concise RVQ-VAE and transformer architecture, OmniMoGen supports end-to-end instruction-driven motion generation. We construct X2Mo, a large-scale dataset of over 137K interleaved text-motion instructions, and introduce AnyContext, a benchmark for evaluating interleaved motion generation. Experiments show that OmniMoGen achieves state-of-the-art performance on text-to-motion, motion editing, and AnyContext, exhibiting emerging capabilities such as compositional editing, self-reflective generation, and knowledge-informed generation. These results mark a step toward the next intelligent motion generation. Project Page: https://OmniMoGen.github.io/.

[242] DSTED: Decoupling Temporal Stabilization and Discriminative Enhancement for Surgical Workflow Recognition

Yueyao Chen, Kai-Ni Wang, Dario Tayupo, Arnaud Huaulm’e, Krystel Nyangoh Timoh, Pierre Jannin, Qi Dou

Main category: cs.CV

TL;DR: DSTED framework with Reliable Memory Propagation and Uncertainty-Aware Prototype Retrieval achieves state-of-the-art surgical workflow recognition by reducing temporal jitter and improving ambiguous phase discrimination.

Details

Motivation: Current surgical workflow recognition methods suffer from prediction jitter across consecutive frames and poor discrimination of ambiguous phases, limiting their clinical applicability for context-aware assistance and skill assessment.

Method: Proposed dual-pathway framework DSTED with Reliable Memory Propagation (RMP) for temporal coherence through filtered historical features, and Uncertainty-Aware Prototype Retrieval (UPR) for ambiguous phase enhancement via learnable prototypes and adaptive matching.

Result: Achieves 84.36% accuracy and 65.51% F1-score on AutoLaparo-hysterectomy, surpassing second-best by 3.51% and 4.88%. RMP contributes 2.19% gain, UPR 1.93%, with synergistic effects when combined.

Conclusion: Decoupling temporal consistency and phase ambiguity modeling through dual-pathway design provides superior performance and clinical applicability for stable surgical workflow recognition.

Abstract: Purpose: Surgical workflow recognition enables context-aware assistance and skill assessment in computer-assisted interventions. Despite recent advances, current methods suffer from two critical challenges: prediction jitter across consecutive frames and poor discrimination of ambiguous phases. This paper aims to develop a stable framework by selectively propagating reliable historical information and explicitly modeling uncertainty for hard sample enhancement. Methods: We propose a dual-pathway framework DSTED with Reliable Memory Propagation (RMP) and Uncertainty-Aware Prototype Retrieval (UPR). RMP maintains temporal coherence by filtering and fusing high-confidence historical features through multi-criteria reliability assessment. UPR constructs learnable class-specific prototypes from high-uncertainty samples and performs adaptive prototype matching to refine ambiguous frame representations. Finally, a confidence-driven gate dynamically balances both pathways based on prediction certainty. Results: Our method achieves state-of-the-art performance on AutoLaparo-hysterectomy with 84.36% accuracy and 65.51% F1-score, surpassing the second-best method by 3.51% and 4.88% respectively. Ablations reveal complementary gains from RMP (2.19%) and UPR (1.93%), with synergistic effects when combined. Extensive analysis confirms substantial reduction in temporal jitter and marked improvement on challenging phase transitions. Conclusion: Our dual-pathway design introduces a novel paradigm for stable workflow recognition, demonstrating that decoupling the modeling of temporal consistency and phase ambiguity yields superior performance and clinical applicability.

[243] PEDESTRIAN: An Egocentric Vision Dataset for Obstacle Detection on Pavements

Marios Thoma, Zenonas Theodosiou, Harris Partaourides, Vassilis Vassiliades, Loizos Michael, Andreas Lanitis

Main category: cs.CV

TL;DR: Introduces PEDESTRIAN dataset - egocentric videos of 29 sidewalk obstacles for training deep learning models to enhance pedestrian safety through real-time obstacle detection.

Details

Motivation: Sidewalk obstacles pose safety hazards to pedestrians, but existing datasets for training obstacle detection systems are limited. Need comprehensive egocentric datasets to develop effective real-time pedestrian safety systems.

Method: Created PEDESTRIAN dataset with 340 egocentric videos captured from pedestrian’s viewpoint using mobile phone cameras, covering 29 different sidewalk obstacle types. Trained and evaluated several state-of-the-art deep learning algorithms on this dataset.

Result: Successfully collected comprehensive dataset of sidewalk obstacles from pedestrian perspective. Experimental results provide benchmark performance metrics for obstacle detection and recognition tasks using modern deep learning models.

Conclusion: The PEDESTRIAN dataset enables development of effective pavement obstacle detectors to enhance pedestrian safety in urban environments, serving as valuable resource for computer vision and pedestrian safety research.

Abstract: Walking has always been a primary mode of transportation and is recognized as an essential activity for maintaining good health. Despite the need for safe walking conditions in urban environments, sidewalks are frequently obstructed by various obstacles that hinder free pedestrian movement. Any object obstructing a pedestrian’s path can pose a safety hazard. The advancement of pervasive computing and egocentric vision techniques offers the potential to design systems that can automatically detect such obstacles in real time, thereby enhancing pedestrian safety. The development of effective and efficient identification algorithms relies on the availability of comprehensive and well-balanced datasets of egocentric data. In this work, we introduce the PEDESTRIAN dataset, comprising egocentric data for 29 different obstacles commonly found on urban sidewalks. A total of 340 videos were collected using mobile phone cameras, capturing a pedestrian’s point of view. Additionally, we present the results of a series of experiments that involved training several state-of-the-art deep learning algorithms using the proposed dataset, which can be used as a benchmark for obstacle detection and recognition tasks. The dataset can be used for training pavement obstacle detectors to enhance the safety of pedestrians in urban areas.

[244] MT-Mark: Rethinking Image Watermarking via Mutual-Teacher Collaboration with Adaptive Feature Modulation

Fei Ge, Ying Huang, Jie Liu, Guixuan Zhang, Zhi Zeng, Shuwu Zhang, Hu Guan

Main category: cs.CV

TL;DR: A novel deep image watermarking framework with explicit collaboration between embedder and extractor via a Collaborative Interaction Mechanism and Adaptive Feature Modulation Module, achieving superior performance without exhaustive distortion simulation.

Details

Motivation: Existing deep watermarking methods have weak coupling between embedder and extractor, lacking explicit collaboration mechanisms. The embedder doesn't incorporate decoding-aware cues, and the extractor doesn't guide embedding during training, limiting overall system performance.

Method: Proposes a Collaborative Interaction Mechanism (CIM) for bidirectional communication between embedder and extractor, enabling mutual-teacher training. Introduces Adaptive Feature Modulation Module (AFMM) for content-aware feature regulation by decoupling modulation structure and strength. AFMMs on both sides form closed-loop collaboration aligning embedding with extraction objectives.

Result: Outperforms state-of-the-art approaches in watermark extraction accuracy while maintaining high perceptual quality. Shows strong robustness and generalization on both real-world and AI-generated datasets. Robustness emerges from coordinated representation learning rather than exhaustive distortion simulation.

Conclusion: Architecture-level redesign with explicit collaboration between embedding and extraction components fundamentally changes how robustness is learned in watermarking systems, leading to superior performance through coordinated representation learning rather than traditional distortion simulation approaches.

Abstract: Existing deep image watermarking methods follow a fixed embedding-distortion-extraction pipeline, where the embedder and extractor are weakly coupled through a final loss and optimized in isolation. This design lacks explicit collaboration, leaving no structured mechanism for the embedder to incorporate decoding-aware cues or for the extractor to guide embedding during training. To address this architectural limitation, we rethink deep image watermarking by reformulating embedding and extraction as explicitly collaborative components. To realize this reformulation, we introduce a Collaborative Interaction Mechanism (CIM) that establishes direct, bidirectional communication between the embedder and extractor, enabling a mutual-teacher training paradigm and coordinated optimization. Built upon this explicitly collaborative architecture, we further propose an Adaptive Feature Modulation Module (AFMM) to support effective interaction. AFMM enables content-aware feature regulation by decoupling modulation structure and strength, guiding watermark embedding toward stable image features while suppressing host interference during extraction. Under CIM, the AFMMs on both sides form a closed-loop collaboration that aligns embedding behavior with extraction objectives. This architecture-level redesign changes how robustness is learned in watermarking systems. Rather than relying on exhaustive distortion simulation, robustness emerges from coordinated representation learning between embedding and extraction. Experiments on real-world and AI-generated datasets demonstrate that the proposed method consistently outperforms state-of-the-art approaches in watermark extraction accuracy while maintaining high perceptual quality, showing strong robustness and generalization.

Zihao Luo, Shaohao Rui, Zhenyu Tang, Guotai Wang, Xiaosong Wang

Main category: cs.CV

TL;DR: InvCoSS is a privacy-preserving continual self-supervised learning framework for medical imaging that uses model inversion to generate synthetic images instead of replaying real data, preventing catastrophic forgetting while maintaining data privacy.

Details

Motivation: Current continual self-supervised learning methods in medical imaging rely on replaying previous real data, which compromises data privacy and limits applicability in real-world scenarios where data transfer across sites is restricted. There's a need for methods that prevent catastrophic forgetting without accessing previous real data.

Method: InvCoSS inverts pre-trained self-supervised models to generate synthetic images approximating original training distributions. It uses a novel InvUNet with multi-scale fusion architecture to restore high- and low-frequency components, and employs repulsive representation-learning to enhance diversity and prevent mode collapse without class guidance.

Result: Extensive experiments across nine downstream tasks show InvCoSS achieves performance comparable to or superior to prior data-replay methods while significantly reducing storage requirements and eliminating data privacy constraints.

Conclusion: InvCoSS provides an effective privacy-preserving solution for continual self-supervised learning in medical imaging, enabling sequential training without data replay while maintaining performance and addressing real-world data transfer restrictions.

Abstract: Continual self-supervised learning (CSSL) in medical imaging trains a foundation model sequentially, alleviating the need for collecting multi-modal images for joint training and offering promising improvements in downstream performance while preserving data privacy. However, most existing methods still rely on replaying data from previous stages to prevent catastrophic forgetting, which compromises privacy and limits their applicability in real-world scenarios where data transfer across sites is often restricted. In this work, we propose InvCoSS, an inversion-driven continual self-supervised learning framework for medical multi-modal image pre-training. Specifically, after training on a previous task, InvCoSS inverts the pre-trained self-supervised model to generate synthetic images that approximate the original training distribution. These synthetic images are then combined with data from the new task for joint optimization, which effectively mitigates catastrophic forgetting while strictly adhering to the constraint of no access to previous real data. Furthermore, to improve the fidelity of synthetic images, we introduce a novel InvUNet with a multi-scale fusion architecture to restore both high- and low-frequency components of the inverted images. To enhance diversity and prevent mode collapse, we design a repulsive representation-learning mechanism that encourages a diverse feature space for synthetic images without class guidance. Extensive experiments across nine downstream tasks validate the effectiveness of InvCoSS, achieving performance comparable to or even superior to prior data-replay methods while significantly reducing storage requirements and eliminating data privacy constraints.

[246] HippMetric: A skeletal-representation-based framework for cross-sectional and longitudinal hippocampal substructural morphometry

Na Gao, Chenfei Ye, Yanwu Yang, Anqi Li, Zhengbo He, Li Liang, Zhiyuan Liu, Xingyu Hao, Ting Ma, Tengfei Guo

Main category: cs.CV

TL;DR: HippMetric is a skeletal representation framework for hippocampal substructure analysis that establishes stable point-wise correspondence across individuals and scans using anatomically-aligned coordinate systems.

Details

Motivation: High inter-subject variability and complex hippocampal folding patterns hinder consistent cross-subject and longitudinal analysis. Existing approaches lack stable intrinsic coordinate systems to accommodate anatomical variability, limiting reliable inter- and intra-individual correspondence.

Method: HippMetric uses Axis-Referenced Morphometric Model (ARMM) with deformable skeletal coordinate system aligned with hippocampal anatomy/function. It has two core modules: 1) skeletal-based coordinate system respecting longitudinal lamellar architecture, and 2) individualized s-reps generated through surface reconstruction, deformation, and geometrically constrained spoke refinement.

Result: Extensive experiments on two international cohorts demonstrate that HippMetric achieves higher accuracy, reliability, and correspondence stability compared to existing shape models.

Conclusion: HippMetric provides a biologically grounded framework for hippocampal substructural morphometry with stable point-wise correspondence across individuals and scans, enabling more reliable neurodegenerative biomarker detection.

Abstract: Accurate characterization of hippocampal substructure is crucial for detecting subtle structural changes and identifying early neurodegenerative biomarkers. However, high inter-subject variability and complex folding pattern of human hippocampus hinder consistent cross-subject and longitudinal analysis. Most existing approaches rely on subject-specific modelling and lack a stable intrinsic coordinate system to accommodate anatomical variability, which limits their ability to establish reliable inter- and intra-individual correspondence. To address this, we propose HippMetric, a skeletal representation (s-rep)-based framework for hippocampal substructural morphometry and point-wise correspondence across individuals and scans. HippMetric builds on the Axis-Referenced Morphometric Model (ARMM) and employs a deformable skeletal coordinate system aligned with hippocampal anatomy and function, providing a biologically grounded reference for correspondence. Our framework comprises two core modules: a skeletal-based coordinate system that respects the hippocampus’ conserved longitudinal lamellar architecture, in which functional units (lamellae) are stacked perpendicular to the long-axis, enabling anatomically consistent localization across subjects and time; and individualized s-reps generated through surface reconstruction, deformation, and geometrically constrained spoke refinement, enforcing boundary adherence, orthogonality and non-intersection to produce mathematically valid skeletal geometry. Extensive experiments on two international cohorts demonstrate that HippMetric achieves higher accuracy, reliability, and correspondence stability compared to existing shape models.

[247] Anatomy-R1: Enhancing Anatomy Reasoning in Multimodal Large Language Models via Anatomical Similarity Curriculum and Group Diversity Augmentation

Ziyang Song, Zelin Zang, Zuyao Chen, Xusheng Liang, Dong Yi, Jinlin Wu, Hongbin Liu, Jiebo Luo

Main category: cs.CV

TL;DR: The paper proposes two novel methods to enhance medical reasoning in Multimodal Large Language Models for anatomy recognition: Anatomical Similarity Curriculum Learning and Group Diversity Question Augmentation, achieving significant improvements on medical VQA benchmarks.

Details

Motivation: MLLMs show impressive progress in natural image reasoning but remain underexplored in medical imaging, especially for clinical anatomical surgical images. Anatomy understanding requires precise and clinically coherent answers, which are difficult due to medical data complexity and scarce expert annotations. Conventional SFT strategies are limited, and while GRPO can enhance reasoning without large data, it has weaknesses in anatomy recognition: ineffective knowledge sharing between anatomical structures and premature convergence to single reasoning paths.

Method: Two novel methods: 1) Anatomical Similarity Curriculum Learning - progressive learning strategy controlling question difficulty via similarity of answer choices, enabling incremental mastery of complex problems. 2) Group Diversity Question Augmentation - expands model’s search space for difficult queries by augmenting questions to mitigate uniform response tendencies.

Result: Comprehensive experiments on SGG-VQA and OmniMedVQA benchmarks show significant improvement across both benchmarks, demonstrating effectiveness in enhancing medical reasoning capabilities of MLLMs.

Conclusion: The proposed methods successfully address GRPO’s weaknesses in anatomy recognition by enabling effective knowledge sharing between anatomical structures and promoting diverse reasoning strategies, significantly improving MLLMs’ medical reasoning performance on clinical anatomical surgical images.

Abstract: Multimodal Large Language Models (MLLMs) have achieved impressive progress in natural image reasoning, yet their potential in medical imaging remains underexplored, especially in clinical anatomical surgical images. Anatomy understanding tasks demand precise understanding and clinically coherent answers, which are difficult to achieve due to the complexity of medical data and the scarcity of high-quality expert annotations. These challenges limit the effectiveness of conventional Supervised Fine-Tuning (SFT) strategies. While recent work has demonstrated that Group Relative Policy Optimization (GRPO) can enhance reasoning in MLLMs without relying on large amounts of data, we find two weaknesses that hinder GRPO’s reasoning performance in anatomy recognition: 1) knowledge cannot be effectively shared between different anatomical structures, resulting in uneven information gain and preventing the model from converging, and 2) the model quickly converges to a single reasoning path, suppressing the exploration of diverse strategies. To overcome these challenges, we propose two novel methods. First, we implement a progressive learning strategy called Anatomical Similarity Curriculum Learning by controlling question difficulty via the similarity of answer choices, enabling the model to master complex problems incrementally. Second, we utilize question augmentation referred to as Group Diversity Question Augmentation to expand the model’s search space for difficult queries, mitigating the tendency to produce uniform responses. Comprehensive experiments on the SGG-VQA and OmniMedVQA benchmarks show our method achieves a significant improvement across the two benchmarks, demonstrating its effectiveness in enhancing the medical reasoning capabilities of MLLMs. The code can be found in https://github.com/tomato996/Anatomy-R1

Meng Chu, Senqiao Yang, Haoxuan Che, Suiyun Zhang, Xichen Zhang, Shaozuo Yu, Haokun Gui, Zhefan Rao, Dandan Tu, Rui Liu, Jiaya Jia

Main category: cs.CV

TL;DR: VisionDirector improves image generation for complex multi-goal prompts by using a training-free vision-language supervisor with structured goal extraction, dynamic editing decisions, and semantic verification.

Details

Motivation: Current generative models struggle with long, multi-goal prompts common in professional design workflows, failing to satisfy complex requirements spanning layout, object placement, typography, and logo fidelity.

Method: VisionDirector is a training-free vision-language supervisor that: 1) extracts structured goals from long instructions, 2) dynamically chooses between one-shot generation or staged edits, 3) uses micro-grid sampling with semantic verification and rollback after each edit, and 4) logs goal-level rewards. Enhanced with Group Relative Policy Optimization for shorter edit trajectories.

Result: Achieves new SOTA on GenEval (+7% overall) and ImgEdit (+0.07 absolute), with consistent qualitative improvements on typography, multi-object scenes, and pose editing. Reduces edit steps from 4.2 to 3.1 on average.

Conclusion: VisionDirector effectively addresses the brittleness of current generative pipelines for complex multi-goal prompts through structured goal decomposition and semantic verification, enabling better performance on real-world design tasks.

Abstract: Generative models can now produce photorealistic imagery, yet they still struggle with the long, multi-goal prompts that professional designers issue. To expose this gap and better evaluate models’ performance in real-world settings, we introduce Long Goal Bench (LGBench), a 2,000-task suite (1,000 T2I and 1,000 I2I) whose average instruction contains 18 to 22 tightly coupled goals spanning global layout, local object placement, typography, and logo fidelity. We find that even state-of-the-art models satisfy fewer than 72 percent of the goals and routinely miss localized edits, confirming the brittleness of current pipelines. To address this, we present VisionDirector, a training-free vision-language supervisor that (i) extracts structured goals from long instructions, (ii) dynamically decides between one-shot generation and staged edits, (iii) runs micro-grid sampling with semantic verification and rollback after every edit, and (iv) logs goal-level rewards. We further fine-tune the planner with Group Relative Policy Optimization, yielding shorter edit trajectories (3.1 versus 4.2 steps) and stronger alignment. VisionDirector achieves new state of the art on GenEval (plus 7 percent overall) and ImgEdit (plus 0.07 absolute) while producing consistent qualitative improvements on typography, multi-object scenes, and pose editing.

[249] 3SGen: Unified Subject, Style, and Structure-Driven Image Generation with Adaptive Task-specific Memory

Xinyang Song, Libin Wang, Weining Wang, Zhiwei Li, Jianxin Sun, Dandan Zheng, Jingdong Chen, Qi Li, Zhenan Sun

Main category: cs.CV

TL;DR: 3SGen is a unified framework that handles subject, style, and structure conditioning in image generation using an MLLM with semantic queries, VAE for details, and an Adaptive Task-specific Memory module to disentangle and manage condition-specific priors.

Details

Motivation: Current image generation approaches handle subject, style, and structure conditioning separately, causing feature entanglement and limited task transferability. There's a need for a unified framework that can handle all three conditioning modes effectively.

Method: 3SGen uses an MLLM with learnable semantic queries for text-image alignment, a VAE branch for visual detail preservation, and an Adaptive Task-specific Memory (ATM) module with lightweight gating and scalable memory items to dynamically disentangle, store, and retrieve condition-specific priors (identity for subjects, textures for styles, spatial layouts for structures).

Result: The paper demonstrates superior performance across diverse image-driven generation tasks on the proposed 3SGen-Bench benchmark and other public benchmarks, showing improved cross-task fidelity and controllability.

Conclusion: 3SGen provides a task-aware unified framework that effectively handles subject, style, and structure conditioning simultaneously, mitigating inter-task interference and naturally scaling to compositional inputs through its innovative ATM module and comprehensive architecture.

Abstract: Recent image generation approaches often address subject, style, and structure-driven conditioning in isolation, leading to feature entanglement and limited task transferability. In this paper, we introduce 3SGen, a task-aware unified framework that performs all three conditioning modes within a single model. 3SGen employs an MLLM equipped with learnable semantic queries to align text-image semantics, complemented by a VAE branch that preserves fine-grained visual details. At its core, an Adaptive Task-specific Memory (ATM) module dynamically disentangles, stores, and retrieves condition-specific priors, such as identity for subjects, textures for styles, and spatial layouts for structures, via a lightweight gating mechanism along with several scalable memory items. This design mitigates inter-task interference and naturally scales to compositional inputs. In addition, we propose 3SGen-Bench, a unified image-driven generation benchmark with standardized metrics for evaluating cross-task fidelity and controllability. Extensive experiments on our proposed 3SGen-Bench and other public benchmarks demonstrate our superior performance across diverse image-driven generation tasks.

[250] Hand-Aware Egocentric Motion Reconstruction with Sequence-Level Context

Kyungwon Cho, Hanbyul Joo

Main category: cs.CV

TL;DR: HaMoS: A hand-aware diffusion framework for estimating full-body motion from egocentric videos using head trajectories and intermittent hand cues, achieving state-of-the-art accuracy.

Details

Motivation: Egocentric vision systems are becoming widely available but estimating full-body motion from first-person videos is challenging because most body parts are invisible. Prior approaches rely too heavily on head trajectories (leading to ambiguity) or assume continuously tracked hands (unrealistic for lightweight devices).

Method: HaMoS is a hand-aware, sequence-level diffusion framework that conditions on both head trajectory and intermittently visible hand cues. It uses a novel augmentation method to model real-world conditions (field-of-view limitations and occlusions) and employs local attention to efficiently infer long sequences while considering sequence-level contexts like body shape and field-of-view.

Result: Experiments on public benchmarks show state-of-the-art accuracy and temporal smoothness, demonstrating practical progress toward reliable in-the-wild egocentric 3D motion understanding.

Conclusion: HaMoS represents a practical step toward reliable egocentric 3D motion understanding by effectively leveraging intermittent hand cues and sequence-level contexts, overcoming limitations of previous approaches that relied on unrealistic assumptions about hand visibility.

Abstract: Egocentric vision systems are becoming widely available, creating new opportunities for human-computer interaction. A core challenge is estimating the wearer’s full-body motion from first-person videos, which is crucial for understanding human behavior. However, this task is difficult since most body parts are invisible from the egocentric view. Prior approaches mainly rely on head trajectories, leading to ambiguity, or assume continuously tracked hands, which is unrealistic for lightweight egocentric devices. In this work, we present HaMoS, the first hand-aware, sequence-level diffusion framework that directly conditions on both head trajectory and intermittently visible hand cues caused by field-of-view limitations and occlusions, as in real-world egocentric devices. To overcome the lack of datasets pairing diverse camera views with human motion, we introduce a novel augmentation method that models such real-world conditions. We also demonstrate that sequence-level contexts such as body shape and field-of-view are crucial for accurate motion reconstruction, and thus employ local attention to infer long sequences efficiently. Experiments on public benchmarks show that our method achieves state-of-the-art accuracy and temporal smoothness, demonstrating a practical step toward reliable in-the-wild egocentric 3D motion understanding.

[251] CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion

Moritz Böhle, Amélie Royer, Juliette Marrie, Edouard Grave, Patrick Pérez

Main category: cs.CV

TL;DR: CASA improves vision-language models by enabling local text-to-text interaction in cross-attention layers, reducing performance gap with full token insertion while maintaining efficiency for high-resolution images and long videos.

Details

Motivation: Current VLMs using full token insertion are computationally expensive for high-resolution images, long conversations, or streaming videos, while cross-attention models are more efficient but suffer from performance gaps, especially on fine-grained visual tasks.

Method: Proposes CASA (Cross-Attention via Self-Attention), a simple paradigm that enables local text-to-text interaction in dedicated cross-attention layers, combining the efficiency of cross-attention with improved visual understanding capabilities.

Result: Substantially reduces the performance gap with full token insertion on common image understanding benchmarks while maintaining the same scalability as cross-attention models for long-context multimodal tasks like streaming video captioning.

Conclusion: CASA provides an efficient and effective alternative to full token insertion for vision-language models, offering better performance than standard cross-attention approaches while maintaining computational efficiency for demanding multimodal applications.

Abstract: Vision-language models (VLMs) are commonly trained by inserting image tokens from a pretrained vision encoder into the textual stream of a language model. This allows text and image information to fully attend to one another within the model, but becomes extremely costly for high-resolution images, long conversations, or streaming videos, both in memory and compute. VLMs leveraging cross-attention are an efficient alternative to token insertion but exhibit a clear performance gap, in particular on tasks involving fine-grained visual details. We find that a key to improving such models is to also enable local text-to-text interaction in the dedicated cross-attention layers. Building on this, we propose CASA, Cross-Attention via Self-Attention, a simple and efficient paradigm which substantially reduces the gap with full token insertion on common image understanding benchmarks, while enjoying the same scalability as cross-attention models when applied to long-context multimodal tasks such as streaming video captioning. For samples and code, please see our project page at https://kyutai.org/casa .

[252] RMLer: Synthesizing Novel Objects across Diverse Categories via Reinforcement Mixing Learning

Jun Li, Zikun Chen, Haibo Chen, Shuo Chen, Jian Yang

Main category: cs.CV

TL;DR: RMLer is a reinforcement learning framework for cross-category concept fusion in text-to-image generation that addresses issues of insufficient mixing and poor evaluation in existing methods.

Details

Motivation: Existing text-to-image methods struggle with synthesizing novel objects by integrating distinct textual concepts from different categories, often producing imbalanced, superficial, or merely juxtaposed outputs with insufficient concept mixing and lack of rigorous evaluation.

Method: RMLer formulates concept fusion as a reinforcement learning problem: mixed features as states, mixing strategies as actions, and visual outcomes as rewards. It uses an MLP-policy network to predict dynamic coefficients for blending cross-category text embeddings, with visual rewards based on semantic similarity and compositional balance, optimized via proximal policy optimization.

Result: Extensive experiments show RMLer’s superiority in synthesizing coherent, high-fidelity objects from diverse categories, outperforming existing methods in generating novel visual concepts.

Conclusion: RMLer provides a robust framework for generating novel visual concepts with promising applications in film, gaming, and design, addressing key limitations in cross-category concept fusion for text-to-image generation.

Abstract: Novel object synthesis by integrating distinct textual concepts from diverse categories remains a significant challenge in Text-to-Image (T2I) generation. Existing methods often suffer from insufficient concept mixing, lack of rigorous evaluation, and suboptimal outputs-manifesting as conceptual imbalance, superficial combinations, or mere juxtapositions. To address these limitations, we propose Reinforcement Mixing Learning (RMLer), a framework that formulates cross-category concept fusion as a reinforcement learning problem: mixed features serve as states, mixing strategies as actions, and visual outcomes as rewards. Specifically, we design an MLP-policy network to predict dynamic coefficients for blending cross-category text embeddings. We further introduce visual rewards based on (1) semantic similarity and (2) compositional balance between the fused object and its constituent concepts, optimizing the policy via proximal policy optimization. At inference, a selection strategy leverages these rewards to curate the highest-quality fused objects. Extensive experiments demonstrate RMLer’s superiority in synthesizing coherent, high-fidelity objects from diverse categories, outperforming existing methods. Our work provides a robust framework for generating novel visual concepts, with promising applications in film, gaming, and design.

[253] Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Remote Sensing

Xu Zhang, Junyao Ge, Yang Zheng, Kaitai Guo, Jimin Liang

Main category: cs.CV

TL;DR: Think2Seg-RS is a decoupled framework that trains an LVLM prompter to control frozen SAM via geometric prompts, achieving SOTA on EarthReason and zero-shot generalization to referring segmentation tasks.

Details

Motivation: Existing LVLM reasoning segmentation frameworks couple linguistic reasoning and pixel prediction through end-to-end supervised fine-tuning, leading to weak geometric grounding and limited generalization across tasks.

Method: Developed Think2Seg-RS, a decoupled framework that trains an LVLM prompter to control a frozen Segment Anything Model (SAM) via structured geometric prompts using mask-only reinforcement learning.

Result: Achieved state-of-the-art performance on EarthReason dataset, with learned prompting policy generalizing zero-shot to multiple referring segmentation benchmarks, revealing divide between semantic-level and instance-level grounding.

Conclusion: Establishes semantic-level reasoning segmentation as a new paradigm for geospatial understanding, opening the way toward unified, interpretable LVLM-driven Earth observation.

Abstract: Large Vision-Language Models (LVLMs) hold great promise for advancing remote sensing (RS) analysis, yet existing reasoning segmentation frameworks couple linguistic reasoning and pixel prediction through end-to-end supervised fine-tuning, leading to weak geometric grounding and limited generalization across tasks. To address this, we developed Think2Seg-RS, a decoupled framework that trains an LVLM prompter to control a frozen Segment Anything Model (SAM) via structured geometric prompts. Through a mask-only reinforcement learning objective, the LVLM learns to translate abstract semantic reasoning into spatially grounded actions, achieving state-of-the-art performance on the EarthReason dataset. Remarkably, the learned prompting policy generalizes zero-shot to multiple referring segmentation benchmarks, exposing a distinct divide between semantic-level and instance-level grounding. We further found that compact segmenters outperform larger ones under semantic-level supervision, and that negative prompts are ineffective in heterogeneous aerial backgrounds. Together, these findings establish semantic-level reasoning segmentation as a new paradigm for geospatial understanding, opening the way toward unified, interpretable LVLM-driven Earth observation. Our code and model are available at https://github.com/Ricardo-XZ/Think2Seg-RS.

[254] BabyFlow: 3D modeling of realistic and expressive infant faces

Antonia Alomar, Mireia Masias, Marius George Linguraru, Federico M. Sukno, Gemma Piella

Main category: cs.CV

TL;DR: BabyFlow is a generative AI model that disentangles facial identity and expression in infant faces using normalizing flows, enabling independent control and improved 3D reconstruction accuracy.

Details

Motivation: Early detection of developmental disorders can be aided by analyzing infant craniofacial morphology, but modeling infant faces is challenging due to limited data and frequent spontaneous expressions.

Method: Uses normalizing flows to learn flexible, probabilistic representations that capture complex variability of expressive infant faces. Performs cross-age expression transfer from adult 3D scans to enrich infant datasets with realistic expressive variants.

Result: Improves 3D reconstruction accuracy, particularly in highly expressive regions (mouth, eyes, nose). Supports synthesis and modification of infant expressions while preserving identity. Integrates with diffusion models to generate high-fidelity 2D infant images with consistent 3D geometry.

Conclusion: BabyFlow provides powerful tools for data augmentation and early facial analysis by enabling disentangled control over infant facial identity and expression through generative AI techniques.

Abstract: Early detection of developmental disorders can be aided by analyzing infant craniofacial morphology, but modeling infant faces is challenging due to limited data and frequent spontaneous expressions. We introduce BabyFlow, a generative AI model that disentangles facial identity and expression, enabling independent control over both. Using normalizing flows, BabyFlow learns flexible, probabilistic representations that capture the complex, non-linear variability of expressive infant faces without restrictive linear assumptions. To address scarce and uncontrolled expressive data, we perform cross-age expression transfer, adapting expressions from adult 3D scans to enrich infant datasets with realistic and systematic expressive variants. As a result, BabyFlow improves 3D reconstruction accuracy, particularly in highly expressive regions such as the mouth, eyes, and nose, and supports synthesis and modification of infant expressions while preserving identity. Additionally, by integrating with diffusion models, BabyFlow generates high-fidelity 2D infant images with consistent 3D geometry, providing powerful tools for data augmentation and early facial analysis.

[255] Extended OpenTT Games Dataset: A table tennis dataset for fine-grained shot type and point outcome

Moamal Fadhil Abdul, Jonas Bruun Hubrechts, Thomas Martini Jørgensen, Emil Hovad

Main category: cs.CV

TL;DR: Extended OpenTTGames dataset with detailed stroke type annotations, player posture labels, and rally outcome tags for fine-grained table tennis video analysis.

Details

Motivation: Need for annotated table tennis video data to enable stroke detection/classification for training workflows, broadcast enhancements, and performance analytics. Existing datasets lack detailed stroke annotations or have restrictive licenses.

Method: Extended the public OpenTTGames dataset by adding frame-accurate shot type annotations (forehand/backhand subtypes), player posture labels (body lean, leg stance), and rally outcome tags. Used compact coding scheme and code-assisted labeling procedure for reproducible annotations.

Result: Created an enhanced dataset with detailed stroke taxonomy that enables models to move beyond event spotting toward tactical understanding (e.g., whether strokes win points or create advantages). Released under CC BY-NC-SA 4.0 license.

Conclusion: Fills practical gap in community by providing publicly available, detailed annotations for fine-grained stroke understanding in racket sports, supporting reproducible research and benchmarking.

Abstract: Automatically detecting and classifying strokes in table tennis video can streamline training workflows, enrich broadcast overlays, and enable fine-grained performance analytics. For this to be possible, annotated video data of table tennis is needed. We extend the public OpenTTGames dataset with highly detailed, frame-accurate shot type annotations (forehand, backhand with subtypes), player posture labels (body lean and leg stance), and rally outcome tags at point end. OpenTTGames is a set of recordings from the side of the table with official labels for bounces, when the ball is above the net, or hitting the net. The dataset already contains ball coordinates near events, which are either “bounce”, “net”, or “empty_event” in the original OpenTTGames dataset, and semantic masks (humans, table, scoreboard). Our extension adds the types of stroke to the events and a per-player taxonomy so models can move beyond event spotting toward tactical understanding (e.g., whether a stroke is likely to win the point or set up an advantage). We provide a compact coding scheme and code-assisted labeling procedure to support reproducible annotations and baselines for fine-grained stroke understanding in racket sports. This fills a practical gap in the community, where many prior video resources are either not publicly released or carry restrictive/unclear licenses that hinder reuse and benchmarking. Our annotations are released under the same CC BY-NC-SA 4.0 license as OpenTTGames, allowing free non-commercial use, modification, and redistribution, with appropriate attribution.

[256] DeltaMIL: Gated Memory Integration for Efficient and Discriminative Whole Slide Image Analysis

Yueting Zhu, Yuehao Song, Shuai Zhang, Wenyu Liu, Xinggang Wang

Main category: cs.CV

TL;DR: DeltaMIL is a novel multiple instance learning framework for Whole Slide Images that uses gated delta rule to filter redundant information and integrate discriminative signals, achieving state-of-the-art performance on survival prediction and slide-level classification tasks.

Details

Motivation: Whole Slide Images are large and heterogeneous, generating redundant and dispersed information that makes it difficult to identify discriminative signals. Existing MIL methods either fail to effectively discard uninformative cues or have limited ability to consolidate relevant features from multiple patches.

Method: DeltaMIL uses a gated delta rule with forgetting and memory mechanisms to dynamically filter and integrate information. The delta mechanism updates memory by removing old values and inserting new ones based on correlation with current patches. A gating mechanism enables rapid forgetting of irrelevant signals, and a complementary local pattern mixing mechanism retains fine-grained pathological locality.

Result: DeltaMIL achieves state-of-the-art performance: for survival prediction, it improves performance by 3.69% with ResNet-50 features and 2.36% with UNI features; for slide-level classification, it increases accuracy by 3.09% with ResNet-50 features and 3.75% with UNI features.

Conclusion: DeltaMIL effectively enhances extraction of meaningful cues while suppressing redundant or noisy information, improving model robustness and discriminative power across diverse WSI tasks with strong and consistent performance.

Abstract: Whole Slide Images (WSIs) are typically analyzed using multiple instance learning (MIL) methods. However, the scale and heterogeneity of WSIs generate highly redundant and dispersed information, making it difficult to identify and integrate discriminative signals. Existing MIL methods either fail to discard uninformative cues effectively or have limited ability to consolidate relevant features from multiple patches, which restricts their performance on large and heterogeneous WSIs. To address this issue, we propose DeltaMIL, a novel MIL framework that explicitly selects semantically relevant regions and integrates the discriminative information from WSIs. Our method leverages the gated delta rule to efficiently filter and integrate information through a block combining forgetting and memory mechanisms. The delta mechanism dynamically updates the memory by removing old values and inserting new ones according to their correlation with the current patch. The gating mechanism further enables rapid forgetting of irrelevant signals. Additionally, DeltaMIL integrates a complementary local pattern mixing mechanism to retain fine-grained pathological locality. Our design enhances the extraction of meaningful cues and suppresses redundant or noisy information, which improves the model’s robustness and discriminative power. Experiments demonstrate that DeltaMIL achieves state-of-the-art performance. Specifically, for survival prediction, DeltaMIL improves performance by 3.69% using ResNet-50 features and 2.36% using UNI features. For slide-level classification, it increases accuracy by 3.09% with ResNet-50 features and 3.75% with UNI features. These results demonstrate the strong and consistent performance of DeltaMIL across diverse WSI tasks.

[257] GANeXt: A Fully ConvNeXt-Enhanced Generative Adversarial Network for MRI- and CBCT-to-CT Synthesis

Siyuan Mei, Yan Xia, Fuxin Fan

Main category: cs.CV

TL;DR: GANeXt is a 3D patch-based ConvNeXt-powered GAN for unified CT synthesis from MRI and CBCT, achieving accurate anatomical representation for adaptive radiotherapy planning without modality-specific fine-tuning.

Details

Motivation: CT synthesis from MRI and CBCT is crucial for adaptive radiotherapy planning, providing accurate anatomical representation when direct CT acquisition is challenging or unavailable. Current methods often require modality-specific approaches or lack unified frameworks.

Method: Proposes GANeXt: a 3D patch-based GAN with U-shaped generator built from stacked 3D ConvNeXt blocks and conditional PatchGAN discriminator. Uses combined losses (MAE, perceptual, segmentation-based masked MAE, adversarial) and multi-head segmentation discriminator with Dice/cross-entropy losses. Employs sliding-window inference with overlap averaging.

Result: The model achieves unified CT synthesis across different modalities (MRI and CBCT) and anatomical regions without fine-tuning. Training completed in 3000 epochs for MRI-to-CT and 1000 epochs for CBCT-to-CT using full training dataset.

Conclusion: GANeXt provides an effective unified framework for CT synthesis from multiple imaging modalities, demonstrating potential for clinical applications in adaptive radiotherapy planning through its modality-agnostic approach and robust training methodology.

Abstract: The synthesis of computed tomography (CT) from magnetic resonance imaging (MRI) and cone-beam CT (CBCT) plays a critical role in clinical treatment planning by enabling accurate anatomical representation in adaptive radiotherapy. In this work, we propose GANeXt, a 3D patch-based, fully ConvNeXt-powered generative adversarial network for unified CT synthesis across different modalities and anatomical regions. Specifically, GANeXt employs an efficient U-shaped generator constructed from stacked 3D ConvNeXt blocks with compact convolution kernels, while the discriminator adopts a conditional PatchGAN. To improve synthesis quality, we incorporate a combination of loss functions, including mean absolute error (MAE), perceptual loss, segmentation-based masked MAE, and adversarial loss and a combination of Dice loss and cross-entropy for multi-head segmentation discriminator. For both tasks, training is performed with a batch size of 8 using two separate AdamW optimizers for the generator and discriminator, each equipped with a warmup and cosine decay scheduler, with learning rates of $5\times10^{-4}$ and $1\times10^{-3}$, respectively. Data preprocessing includes deformable registration, foreground cropping, percentile normalization for the input modality, and linear normalization of the CT to the range $[-1024, 1000]$. Data augmentation involves random zooming within $(0.8, 1.3)$ (for MRI-to-CT only), fixed-size cropping to $32\times160\times192$ for MRI-to-CT and $32\times128\times128$ for CBCT-to-CT, and random flipping. During inference, we apply a sliding-window approach with $0.8$ overlap and average folding to reconstruct the full-size sCT, followed by inversion of the CT normalization. After joint training on all regions without any fine-tuning, the final models are selected at the end of 3000 epochs for MRI-to-CT and 1000 epochs for CBCT-to-CT using the full training dataset.

[258] ReasonCD: A Multimodal Reasoning Large Model for Implicit Change-of-Interest Semantic Mining

Zhenyang Huang, Xiao Yu, Yi Zhang, Decheng Wang, Hang Ruan

Main category: cs.CV

TL;DR: ReasonCD is a multimodal reasoning change detection model that uses large language models to mine implicit user intent from text descriptions, enabling more flexible remote sensing change detection without requiring explicit textual descriptions of change regions.

Details

Motivation: Current remote sensing change detection methods that use semantic guidance rely too heavily on explicit textual descriptions of change regions of interest (CRoI). When presented with implicit or ambiguous textual descriptions, these methods suffer from near-complete performance failure. There's a need for models that can understand and reason about users' implicit task intentions.

Method: Proposes ReasonCD, a multimodal reasoning change detection model that leverages pre-trained large language models’ reasoning capabilities to mine users’ implicit task intents from text descriptions. The model then generates different change detection results based on these inferred intents, enabling it to handle both explicit and implicit CRoI descriptions.

Result: Achieves excellent performance with 92.1% F1 score on the BCDD dataset. Additionally, when tested on a specially annotated reasoning subset of the SECOND dataset, the model demonstrates strong reasoning-based change detection capabilities and can explain its reasoning process to aid human decision-making.

Conclusion: ReasonCD successfully addresses the limitation of existing methods by enabling implicit intent mining through multimodal reasoning, making remote sensing change detection more flexible and robust to various types of textual descriptions while maintaining high performance.

Abstract: Remote sensing image change detection is one of the fundamental tasks in remote sensing intelligent interpretation. Its core objective is to identify changes within change regions of interest (CRoI). Current multimodal large models encode rich human semantic knowledge, which is utilized for guidance in tasks such as remote sensing change detection. However, existing methods that use semantic guidance for detecting users’ CRoI overly rely on explicit textual descriptions of CRoI, leading to the problem of near-complete performance failure when presented with implicit CRoI textual descriptions. This paper proposes a multimodal reasoning change detection model named ReasonCD, capable of mining users’ implicit task intent. The model leverages the powerful reasoning capabilities of pre-trained large language models to mine users’ implicit task intents and subsequently obtains different change detection results based on these intents. Experiments on public datasets demonstrate that the model achieves excellent change detection performance, with an F1 score of 92.1% on the BCDD dataset. Furthermore, to validate its superior reasoning functionality, this paper annotates a subset of reasoning data based on the SECOND dataset. Experimental results show that the model not only excels at basic reasoning-based change detection tasks but can also explain the reasoning process to aid human decision-making.

[259] Efficient Spike-driven Transformer for High-performance Drone-View Geo-Localization

Zhongwei Chen, Hai-Jun Rong, Zhao-Xu Yang, Guoqi Li

Main category: cs.CV

TL;DR: SpikeViMFormer is the first spiking neural network framework for drone-view geo-localization, achieving competitive performance with lower power consumption than traditional ANN methods.

Details

Motivation: Traditional ANN-based drone-view geo-localization methods have high power consumption due to dense computation. SNNs offer low power consumption through spike-driven computation, but their potential for DVGL hasn't been explored, and their inherent sparsity causes information loss and difficulty learning long-range dependencies in heterogeneous visual data alignment.

Method: Proposes SpikeViMFormer with: 1) lightweight spike-driven transformer backbone for coarse-grained feature extraction; 2) spike-driven selective attention (SSA) block for selective feature enhancement and discriminative region highlighting; 3) spike-driven hybrid state space (SHS) block for learning long-range dependencies; 4) hierarchical re-ranking alignment learning (HRAL) strategy for feature refinement and cross-batch consistency to optimize the backbone.

Result: Experimental results show SpikeViMFormer outperforms state-of-the-art SNNs and achieves competitive performance compared with advanced ANNs, while using only the backbone during inference to reduce computational cost.

Conclusion: SpikeViMFormer successfully addresses the limitations of SNNs for drone-view geo-localization, demonstrating that SNNs can achieve competitive performance with lower power consumption, making them suitable for energy-constrained drone applications.

Abstract: Traditional drone-view geo-localization (DVGL) methods based on artificial neural networks (ANNs) have achieved remarkable performance. However, ANNs rely on dense computation, which results in high power consumption. In contrast, spiking neural networks (SNNs), which benefit from spike-driven computation, inherently provide low power consumption. Regrettably, the potential of SNNs for DVGL has yet to be thoroughly investigated. Meanwhile, the inherent sparsity of spike-driven computation for representation learning scenarios also results in loss of critical information and difficulties in learning long-range dependencies when aligning heterogeneous visual data sources. To address these, we propose SpikeViMFormer, the first SNN framework designed for DVGL. In this framework, a lightweight spike-driven transformer backbone is adopted to extract coarse-grained features. To mitigate the loss of critical information, the spike-driven selective attention (SSA) block is designed, which uses a spike-driven gating mechanism to achieve selective feature enhancement and highlight discriminative regions. Furthermore, a spike-driven hybrid state space (SHS) block is introduced to learn long-range dependencies using a hybrid state space. Moreover, only the backbone is utilized during the inference stage to reduce computational cost. To ensure backbone effectiveness, a novel hierarchical re-ranking alignment learning (HRAL) strategy is proposed. It refines features via neighborhood re-ranking and maintains cross-batch consistency to directly optimize the backbone. Experimental results demonstrate that SpikeViMFormer outperforms state-of-the-art SNNs. Compared with advanced ANNs, it also achieves competitive performance.Our code is available at https://github.com/ISChenawei/SpikeViMFormer

[260] Non-Contrast CT Esophageal Varices Grading through Clinical Prior-Enhanced Multi-Organ Analysis

Xiaoming Zhang, Chunli Li, Jiacheng Hao, Yuan Gao, Danyang Tu, Jianyi Qiao, Xiaoli Yin, Le Lu, Ling Zhang, Ke Yan, Yang Hou, Yu Shi

Main category: cs.CV

TL;DR: MOON++ is a novel multimodal framework that uses non-contrast CT scans to assess esophageal varices severity by analyzing multiple organs (esophagus, liver, spleen) simultaneously, achieving superior performance over single-organ methods.

Details

Motivation: Esophageal varices affect 60% of cirrhosis patients with 30% bleeding risk, traditionally requiring invasive endoscopy. Non-contrast CT offers a potential non-invasive alternative but is underutilized clinically. Clinical evidence shows organ volumetric relationships correlate with liver disease severity.

Method: MOON++ (Multi-Organ-COhesion Network++) synthesizes imaging characteristics of esophagus, liver, and spleen through multimodal learning. It incorporates clinical knowledge priors and analyzes NCCT scans comprehensively across multiple organs rather than focusing on single organs.

Result: Evaluated on 1,631 patients, validated on 239 cases, and tested on 289 cases. Achieved AUC of 0.894 vs 0.803 for severe grade classification (G3 vs <G3), and 0.921 vs 0.793 for moderate-to-severe differentiation (>=G2 vs <G2). Reader study with experienced radiologists further validated performance.

Conclusion: MOON++ represents the first comprehensive multi-organ NCCT analysis framework incorporating clinical knowledge priors for EV assessment, offering a promising non-invasive diagnostic alternative to invasive endoscopy.

Abstract: Esophageal varices (EV) represent a critical complication of portal hypertension, affecting approximately 60% of cirrhosis patients with a significant bleeding risk of ~30%. While traditionally diagnosed through invasive endoscopy, non-contrast computed tomography (NCCT) presents a potential non-invasive alternative that has yet to be fully utilized in clinical practice. We present Multi-Organ-COhesion Network++ (MOON++), a novel multimodal framework that enhances EV assessment through comprehensive analysis of NCCT scans. Inspired by clinical evidence correlating organ volumetric relationships with liver disease severity, MOON++ synthesizes imaging characteristics of the esophagus, liver, and spleen through multimodal learning. We evaluated our approach using 1,631 patients, those with endoscopically confirmed EV were classified into four severity grades. Validation in 239 patient cases and independent testing in 289 cases demonstrate superior performance compared to conventional single organ methods, achieving an AUC of 0.894 versus 0.803 for the severe grade EV classification (G3 versus <G3) and 0.921 versus 0.793 for the differentiation of moderate to severe grades (>=G2 versus <G2). We conducted a reader study involving experienced radiologists to further validate the performance of MOON++. To our knowledge, MOON++ represents the first comprehensive multi-organ NCCT analysis framework incorporating clinical knowledge priors for EV assessment, potentially offering a promising non-invasive diagnostic alternative.

[261] MapTrace: Scalable Data Generation for Route Tracing on Maps

Artemis Panagopoulou, Aveek Purohit, Achin Kulshrestha, Soroosh Yazdani, Mohit Goyal

Main category: cs.CV

TL;DR: The paper introduces a scalable synthetic data generation pipeline for creating precise path annotations on maps to improve MLLMs’ fine-grained spatial reasoning capabilities, achieving up to 6.4% improvement on MapBench.

Details

Motivation: Current Multimodal Large Language Models (MLLMs) have limited proficiency in fine-grained spatial understanding like route tracing on maps, often failing to respect fundamental path constraints. This limitation stems from the prohibitive cost and difficulty of collecting large-scale, pixel-accurate path annotations needed for training.

Method: The authors introduce a scalable synthetic data generation pipeline that leverages synthetic map images and pixel-level parsing to automatically produce precise path annotations. Using this pipeline, they construct a fine-tuning dataset of 23k path samples across 4k maps.

Result: Fine-tuning both open-source and proprietary MLLMs with this dataset substantially improves robustness on MapBench, raising success rates by up to 6.4 points while also reducing path-tracing error (NDTW). The synthetic supervision enables models to acquire more human-like spatial capabilities.

Conclusion: Fine-grained spatial reasoning, which is absent in pretrained MLLMs, can be explicitly taught with synthetic supervision. The scalable synthetic data generation approach effectively addresses the challenge of collecting large-scale, pixel-accurate annotations for spatial reasoning tasks.

Abstract: While Multimodal Large Language Models have achieved human-like performance on many visual and textual reasoning tasks, their proficiency in fine-grained spatial understanding, such as route tracing on maps remains limited. Unlike humans, who can quickly learn to parse and navigate maps, current models often fail to respect fundamental path constraints, in part due to the prohibitive cost and difficulty of collecting large-scale, pixel-accurate path annotations. To address this, we introduce a scalable synthetic data generation pipeline that leverages synthetic map images and pixel-level parsing to automatically produce precise annotations for this challenging task. Using this pipeline, we construct a fine-tuning dataset of 23k path samples across 4k maps, enabling models to acquire more human-like spatial capabilities. Using this dataset, we fine-tune both open-source and proprietary MLLMs. Results on MapBench show that finetuning substantially improves robustness, raising success rates by up to 6.4 points, while also reducing path-tracing error (NDTW). These gains highlight that fine-grained spatial reasoning, absent in pretrained models, can be explicitly taught with synthetic supervision.

Yi Xin, Siqi Luo, Qi Qin, Haoxing Chen, Kaiwen Zhu, Zhiwei Zhang, Yangfan He, Rongchao Zhang, Jinbin Bai, Shuo Cao, Bin Fu, Junjun He, Yihao Liu, Yuewen Cao, Xiaohong Liu

Main category: cs.CV

TL;DR: dMLLM-TTS: A novel framework for efficient test-time scaling of diffusion multi-modal LLMs that improves generation quality with 6x greater efficiency than linear search.

Details

Motivation: Current Test-Time Scaling (TTS) methods for diffusion Multi-modal Large Language Models (dMLLMs) are computationally expensive (O(NT) complexity) and require external verifiers, limiting their practical deployment and full generative potential.

Method: Proposes dMLLM-TTS with two innovations: 1) hierarchical search algorithm with O(N+T) complexity for adaptive trajectory expansion/pruning, and 2) self-verified feedback mechanism using dMLLMs’ intrinsic image understanding capabilities to assess text-image alignment without external verifier.

Result: Extensive experiments on GenEval benchmark across three dMLLMs (Lumina-DiMOO, MMaDA, Muddit) show substantial improvement in generation quality while achieving up to 6x greater efficiency than linear search.

Conclusion: dMLLM-TTS provides an effective and efficient framework for unlocking the full generative potential of diffusion multi-modal LLMs through complementary scaling axes and innovative search/verification mechanisms.

Abstract: Diffusion Multi-modal Large Language Models (dMLLMs) have recently emerged as a novel architecture unifying image generation and understanding. However, developing effective and efficient Test-Time Scaling (TTS) methods to unlock their full generative potential remains an underexplored challenge. To address this, we propose dMLLM-TTS, a novel framework operating on two complementary scaling axes: (1) trajectory exploration scaling to enhance the diversity of generated hypotheses, and (2) iterative refinement scaling for stable generation. Conventional TTS approaches typically perform linear search across these two dimensions, incurring substantial computational costs of O(NT) and requiring an external verifier for best-of-N selection. To overcome these limitations, we propose two innovations. First, we design an efficient hierarchical search algorithm with O(N+T) complexity that adaptively expands and prunes sampling trajectories. Second, we introduce a self-verified feedback mechanism that leverages the dMLLMs’ intrinsic image understanding capabilities to assess text-image alignment, eliminating the need for external verifier. Extensive experiments on the GenEval benchmark across three representative dMLLMs (e.g., Lumina-DiMOO, MMaDA, Muddit) show that our framework substantially improves generation quality while achieving up to 6x greater efficiency than linear search. Project page: https://github.com/Alpha-VLLM/Lumina-DiMOO.

[263] D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning

Evelyn Zhang, Fufu Yu, Aoqi Wu, Zichen Wen, Ke Yan, Shouhong Ding, Biqing Qi, Linfeng Zhang

Main category: cs.CV

TL;DR: D2Pruner is a token pruning framework for MLLMs that combines debiased importance scoring with structural pruning to maintain performance on both general understanding and fine-grained localization tasks while significantly reducing computational costs.

Details

Motivation: Current token pruning methods for Multimodal Large Language Models fail catastrophically on fine-grained localization tasks due to positional bias in importance-based methods and structural blindness in diversity-based methods, limiting their practical utility.

Method: D2Pruner combines debiased importance scoring with structural pruning: 1) selects critical tokens as pivots using debiased attention scores, 2) performs Maximal Independent Set selection on remaining tokens modeled on a hybrid graph considering spatial proximity and semantic similarity, iteratively preserving important tokens while removing neighbors.

Result: D2Pruner achieves exceptional efficiency: reduces FLOPs by 74.2% while retaining 99.2% performance on LLaVA-1.5-7B for general understanding, and maintains 85.7% performance at 90% token reduction on InternVL-2.5-8B for localization tasks, with up to 63.53% improvement over existing methods.

Conclusion: D2Pruner successfully addresses limitations of existing token pruning methods by combining debiased importance with structural awareness, enabling efficient processing of long visual token sequences while maintaining performance on both general understanding and fine-grained localization tasks.

Abstract: Processing long visual token sequences poses a significant computational burden on Multimodal Large Language Models (MLLMs). While token pruning offers a path to acceleration, we find that current methods, while adequate for general understanding, catastrophically fail on fine-grained localization tasks. We attribute this failure to the inherent flaws of the two prevailing strategies: importance-based methods suffer from a strong positional bias, an inherent model artifact that distracts from semantic content, while diversity-based methods exhibit structural blindness, disregarding the user’s prompt and spatial redundancy. To address this, we introduce D2Pruner, a framework that rectifies these issues by uniquely combining debiased importance with a structural pruning mechanism. Our method first secures a core set of the most critical tokens as pivots based on a debiased attention score. It then performs a Maximal Independent Set (MIS) selection on the remaining tokens, which are modeled on a hybrid graph where edges signify spatial proximity and semantic similarity. This process iteratively preserves the most important and available token while removing its neighbors, ensuring that the supplementary tokens are chosen to maximize importance and diversity. Extensive experiments demonstrate that D2Pruner has exceptional efficiency and fidelity. Applied to LLaVA-1.5-7B for general understanding tasks, it reduces FLOPs by 74.2% while retaining 99.2% of its original performance. Furthermore, in challenging localization benchmarks with InternVL-2.5-8B, it maintains 85.7% performance at a 90% token reduction rate, marking a significant advancement with up to 63. 53% improvement over existing methods.

Argha Kamal Samanta, Harshika Goyal, Vasudha Joshi, Tushar Mungle, Pabitra Mitra

Main category: cs.CV

TL;DR: Novel multimodal framework integrates retinal images, clinical text, and structured patient data for superior diabetic retinopathy diagnosis and cross-modal retrieval, outperforming CLIP by large margins.

Details

Motivation: General-domain vision-language models like CLIP struggle in medical applications, particularly for cross-modal retrieval of ophthalmological images, creating a critical gap in medical image-text alignment for diabetic retinopathy diagnosis.

Method: Knowledge-enhanced joint embedding framework with separate encoders for each modality: ViT-B/16 for retinal images, Bio-ClinicalBERT for clinical narratives, and MLP for structured features. Fused through joint transformer with modality-specific embeddings, trained with contrastive losses, reconstruction losses, and classification losses for DR severity grading.

Result: Achieves near-perfect text-to-image retrieval (Recall@1: 99.94% vs CLIP’s 1.29%) on BRSET dataset, maintains SOTA classification accuracy (97.05% for SDRG, 97.97% for ICDR), and shows strong generalizability on unseen DeepEyeNet dataset (93.95% Recall@1 vs CLIP’s 0.22%).

Conclusion: The multimodal training approach effectively captures cross-modal relationships in medical domain, establishing superior retrieval capabilities and robust diagnostic performance for diabetic retinopathy, significantly outperforming general-domain models like CLIP.

Abstract: Diabetic retinopathy (DR) is a leading cause of preventable blindness worldwide, demanding accurate automated diagnostic systems. While general-domain vision-language models like Contrastive Language-Image Pre-Training (CLIP) perform well on natural image tasks, they struggle in medical domain applications, particularly in cross-modal retrieval for ophthalmological images. We propose a novel knowledge-enhanced joint embedding framework that integrates retinal fundus images, clinical text, and structured patient data through a multimodal transformer architecture to address the critical gap in medical image-text alignment. Our approach employs separate encoders for each modality: a Vision Transformer (ViT-B/16) for retinal images, Bio-ClinicalBERT for clinical narratives, and a multilayer perceptron for structured demographic and clinical features. These modalities are fused through a joint transformer with modality-specific embeddings, trained using multiple objectives including contrastive losses between modality pairs, reconstruction losses for images and text, and classification losses for DR severity grading according to ICDR and SDRG schemes. Experimental results on the Brazilian Multilabel Ophthalmological Dataset (BRSET) demonstrate significant improvements over baseline models. Our framework achieves near-perfect text-to-image retrieval performance with Recall@1 of 99.94% compared to fine-tuned CLIP’s 1.29%, while maintaining state-of-the-art classification accuracy of 97.05% for SDRG and 97.97% for ICDR. Furthermore, zero-shot evaluation on the unseen DeepEyeNet dataset validates strong generalizability with 93.95% Recall@1 versus 0.22% for fine-tuned CLIP. These results demonstrate that our multimodal training approach effectively captures cross-modal relationships in the medical domain, establishing both superior retrieval capabilities and robust diagnostic performance.

[265] Sign Language Recognition using Parallel Bidirectional Reservoir Computing

Nitin Kumar Singh, Arie Rachmad Syulistyo, Yuichiro Tanaka, Hakaru Tamukoh

Main category: cs.CV

TL;DR: Lightweight sign language recognition system using parallel bidirectional reservoir computing with MediaPipe for real-time edge deployment, achieving competitive accuracy with dramatically reduced training time.

Details

Motivation: Deep learning SLR models require extensive computational resources unsuitable for edge devices, creating a need for lightweight, efficient alternatives for real-time deployment.

Method: Combines MediaPipe for real-time hand tracking/joint coordinate extraction with parallel bidirectional reservoir computing (PBRC) architecture using two ESN-based bidirectional reservoir computing modules in parallel to capture temporal dependencies.

Result: Achieved 60.85% top-1, 85.86% top-5, and 91.74% top-10 accuracy on WLASL dataset with training time reduced to 18.67 seconds (vs. 55+ minutes for Bi-GRU).

Conclusion: Proposed PBRC-based SLR system offers lightweight, cost-effective solution for real-time sign language recognition on edge devices with significantly reduced computational requirements.

Abstract: Sign language recognition (SLR) facilitates communication between deaf and hearing communities. Deep learning based SLR models are commonly used but require extensive computational resources, making them unsuitable for deployment on edge devices. To address these limitations, we propose a lightweight SLR system that combines parallel bidirectional reservoir computing (PBRC) with MediaPipe. MediaPipe enables real-time hand tracking and precise extraction of hand joint coordinates, which serve as input features for the PBRC architecture. The proposed PBRC architecture consists of two echo state network (ESN) based bidirectional reservoir computing (BRC) modules arranged in parallel to capture temporal dependencies, thereby creating a rich feature representation for classification. We trained our PBRC-based SLR system on the Word-Level American Sign Language (WLASL) video dataset, achieving top-1, top-5, and top-10 accuracies of 60.85%, 85.86%, and 91.74%, respectively. Training time was significantly reduced to 18.67 seconds due to the intrinsic properties of reservoir computing, compared to over 55 minutes for deep learning based methods such as Bi-GRU. This approach offers a lightweight, cost-effective solution for real-time SLR on edge devices.

[266] WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion

Hanyang Kong, Xingyi Yang, Xiaoxu Zheng, Xinchao Wang

Main category: cs.CV

TL;DR: WorldWarp bridges 3D geometry and 2D generation for consistent video synthesis by combining a 3D Gaussian Splatting cache with a spatio-temporal diffusion model using varying noise schedules.

Details

Motivation: Current video generation methods struggle with geometric consistency, especially for occluded areas and complex camera trajectories, due to a disconnect between 3D geometry requirements and 2D generative model operations.

Method: WorldWarp couples a 3D structural anchor (online 3DGS cache) with a 2D generative refiner (ST-Diff model). It uses explicit warping of historical content for structure, then applies a spatio-temporal diffusion model with varying noise schedules: full noise for blank regions to generate content, partial noise for warped regions to refine.

Result: WorldWarp achieves state-of-the-art fidelity by ensuring 3D logic guides structure while diffusion logic perfects texture, maintaining geometric consistency across video chunks through dynamic cache updates.

Conclusion: The framework successfully bridges the gap between 3D geometric consistency and 2D generative modeling, enabling high-quality, geometrically consistent video generation with complex camera trajectories.

Abstract: Generating long-range, geometrically consistent video presents a fundamental dilemma: while consistency demands strict adherence to 3D geometry in pixel space, state-of-the-art generative models operate most effectively in a camera-conditioned latent space. This disconnect causes current methods to struggle with occluded areas and complex camera trajectories. To bridge this gap, we propose WorldWarp, a framework that couples a 3D structural anchor with a 2D generative refiner. To establish geometric grounding, WorldWarp maintains an online 3D geometric cache built via Gaussian Splatting (3DGS). By explicitly warping historical content into novel views, this cache acts as a structural scaffold, ensuring each new frame respects prior geometry. However, static warping inevitably leaves holes and artifacts due to occlusions. We address this using a Spatio-Temporal Diffusion (ST-Diff) model designed for a “fill-and-revise” objective. Our key innovation is a spatio-temporal varying noise schedule: blank regions receive full noise to trigger generation, while warped regions receive partial noise to enable refinement. By dynamically updating the 3D cache at every step, WorldWarp maintains consistency across video chunks. Consequently, it achieves state-of-the-art fidelity by ensuring that 3D logic guides structure while diffusion logic perfects texture. Project page: \href{https://hyokong.github.io/worldwarp-page/}{https://hyokong.github.io/worldwarp-page/}.

[267] Emotion-Director: Bridging Affective Shortcut in Emotion-Oriented Image Generation

Guoli Jia, Junyao Hu, Xinwei Long, Kai Tian, Kaiyan Zhang, KaiKai Zhao, Ning Ding, Bowen Zhou

Main category: cs.CV

TL;DR: Emotion-Director is a cross-modal framework for emotion-oriented image generation that addresses the “affective shortcut” problem where emotions are approximated to semantics. It uses MC-Diffusion with visual+textual prompts and MC-Agent for emotion-aware prompt rewriting.

Details

Motivation: Current emotion-oriented image generation methods suffer from an "affective shortcut" where emotions are approximated to semantics, but emotion is not equivalent to semantics. There's a need for better emotion-oriented image generation that goes beyond semantic associations.

Method: Two-module framework: 1) MC-Diffusion - cross-modal collaborative diffusion model integrating visual prompts with textual prompts, improved with negative visual prompts in DPO optimization. 2) MC-Agent - cross-modal collaborative agent system that rewrites textual prompts using multi-agents simulating human subjectivity and chain-of-concept workflow.

Result: Extensive qualitative and quantitative experiments demonstrate superiority in emotion-oriented image generation, showing the framework can generate emotion-oriented images beyond semantics.

Conclusion: Emotion-Director successfully addresses the affective shortcut problem in emotion-oriented image generation by using cross-modal collaboration between visual and textual prompts, enabling generation of images that express emotions beyond semantic associations.

Abstract: Image generation based on diffusion models has demonstrated impressive capability, motivating exploration into diverse and specialized applications. Owing to the importance of emotion in advertising, emotion-oriented image generation has attracted increasing attention. However, current emotion-oriented methods suffer from an affective shortcut, where emotions are approximated to semantics. As evidenced by two decades of research, emotion is not equivalent to semantics. To this end, we propose Emotion-Director, a cross-modal collaboration framework consisting of two modules. First, we propose a cross-Modal Collaborative diffusion model, abbreviated as MC-Diffusion. MC-Diffusion integrates visual prompts with textual prompts for guidance, enabling the generation of emotion-oriented images beyond semantics. Further, we improve the DPO optimization by a negative visual prompt, enhancing the model’s sensitivity to different emotions under the same semantics. Second, we propose MC-Agent, a cross-Modal Collaborative Agent system that rewrites textual prompts to express the intended emotions. To avoid template-like rewrites, MC-Agent employs multi-agents to simulate human subjectivity toward emotions, and adopts a chain-of-concept workflow that improves the visual expressiveness of the rewritten prompts. Extensive qualitative and quantitative experiments demonstrate the superiority of Emotion-Director in emotion-oriented image generation.

[268] Dynamic Stream Network for Combinatorial Explosion Problem in Deformable Medical Image Registration

Shaochen Bi, Yuting He, Weiming Wang, Hao Chen

Main category: cs.CV

TL;DR: DySNet addresses combinatorial explosion in deformable medical image registration by dynamically adjusting receptive fields and weights to focus on relevant feature relationships and ignore interfering ones.

Details

Motivation: The combinatorial explosion problem in DMIR arises from processing two images simultaneously, leading to exponential growth in feature combinations and interference during feature modeling.

Method: Proposes Dynamic Stream Network (DySNet) with two innovations: 1) Adaptive Stream Basin (AdSB) module that dynamically adjusts receptive field shapes, and 2) Dynamic Stream Attention (DySA) mechanism that generates dynamic weights to find valuable feature relationships.

Result: Extensive experiments show DySNet consistently outperforms state-of-the-art DMIR methods, demonstrating outstanding generalization ability.

Conclusion: DySNet effectively addresses the combinatorial explosion problem in medical image registration through dynamic adjustment of receptive fields and weights, leading to superior performance compared to existing methods.

Abstract: Combinatorial explosion problem caused by dual inputs presents a critical challenge in Deformable Medical Image Registration (DMIR). Since DMIR processes two images simultaneously as input, the combination relationships between features has grown exponentially, ultimately the model considers more interfering features during the feature modeling process. Introducing dynamics in the receptive fields and weights of the network enable the model to eliminate the interfering features combination and model the potential feature combination relationships. In this paper, we propose the Dynamic Stream Network (DySNet), which enables the receptive fields and weights to be dynamically adjusted. This ultimately enables the model to ignore interfering feature combinations and model the potential feature relationships. With two key innovations: 1) Adaptive Stream Basin (AdSB) module dynamically adjusts the shape of the receptive field, thereby enabling the model to focus on the feature relationships with greater correlation. 2) Dynamic Stream Attention (DySA) mechanism generates dynamic weights to search for more valuable feature relationships. Extensive experiments have shown that DySNet consistently outperforms the most advanced DMIR methods, highlighting its outstanding generalization ability. Our code will be released on the website: https://github.com/ShaochenBi/DySNet.

[269] FusionNet: Physics-Aware Representation Learning for Multi-Spectral and Thermal Data via Trainable Signal-Processing Priors

Georgios Voulgaris

Main category: cs.CV

TL;DR: FusionNet: A physics-aware deep learning framework that integrates SWIR ratio data with TIR through intermediate fusion to model long-term environmental changes, outperforming SOTA methods across spectral configurations.

Details

Motivation: Current multi-modal deep learning models have poor alignment with physical signal formation processes, leading to brittle performance under cross-spectral and real-world conditions. Approaches focusing on direct thermal cues fail to capture indirect but persistent environmental changes from sustained heat emissions.

Method: Introduces FusionNet, a physics-aware representation learning framework that integrates geological SWIR ratio (sensitive to soil property changes) with TIR data through intermediate fusion architecture. The backbone embeds trainable differential signal-processing priors in convolutional layers, uses mixed pooling strategies, and employs wider receptive fields for cross-spectral robustness.

Result: DGCNN achieves 88.7% accuracy on SWIR ratio, FusionNet reaches 90.6%, outperforming state-of-the-art baselines across five spectral configurations. Transfer learning experiments show ImageNet pretraining degrades TIR performance, highlighting need for modality-aware training.

Conclusion: Combining physics-aware feature selection with principled deep learning architectures yields robust, generalizable representations. First-principles signal modeling improves multi-spectral learning under challenging conditions, demonstrating the value of integrating physical process understanding with deep learning.

Abstract: Modern deep learning models operating on multi-modal visual signals often rely on inductive biases that are poorly aligned with the physical processes governing signal formation, leading to brittle performance under cross-spectral and real-world conditions. In particular, approaches that prioritise direct thermal cues struggle to capture indirect yet persistent environmental alterations induced by sustained heat emissions. This work introduces a physics-aware representation learning framework that leverages multi-spectral information to model stable signatures of long-term physical processes. Specifically, a geological Short Wave Infrared (SWIR) ratio sensitive to soil property changes is integrated with Thermal Infrared (TIR) data through an intermediate fusion architecture, instantiated as FusionNet. The proposed backbone embeds trainable differential signal-processing priors within convolutional layers, combines mixed pooling strategies, and employs wider receptive fields to enhance robustness across spectral modalities. Systematic ablations show that each architectural component contributes to performance gains, with DGCNN achieving 88.7% accuracy on the SWIR ratio and FusionNet reaching 90.6%, outperforming state-of-the-art baselines across five spectral configurations. Transfer learning experiments further show that ImageNet pretraining degrades TIR performance, highlighting the importance of modality-aware training for cross-spectral learning. Evaluated on real-world data, the results demonstrate that combining physics-aware feature selection with principled deep learning architectures yields robust and generalisable representations, illustrating how first-principles signal modelling can improve multi-spectral learning under challenging conditions.

[270] A Convolutional Neural Deferred Shader for Physics Based Rendering

Zhuo He, Yingdong Ru, Qianying Liu, Paul Henderson, Nicolas Pugeault

Main category: cs.CV

TL;DR: pbnds+ is a physics-based neural deferred shading pipeline using CNNs to reduce parameters and improve performance in shading/relighting tasks, with energy regularization for dark scenes.

Details

Motivation: Current neural rendering methods using MLPs have too many parameters, requiring high computation resources and complicating training. Data-driven approaches need large datasets and can be biased by unbalanced data, especially ignoring dark illumination conditions.

Method: Introduces pbnds+ pipeline using convolutional neural networks instead of MLPs to reduce parameters. Proposes energy regularization to constrain model reflection during dark illumination conditions.

Result: Extensive experiments show the approach outperforms classical baselines, state-of-the-art neural shading models, and diffusion-based methods.

Conclusion: pbnds+ provides an efficient physics-based neural deferred shading solution that addresses parameter efficiency and dark scene handling challenges in neural rendering.

Abstract: Recent advances in neural rendering have achieved impressive results on photorealistic shading and relighting, by using a multilayer perceptron (MLP) as a regression model to learn the rendering equation from a real-world dataset. Such methods show promise for photorealistically relighting real-world objects, which is difficult to classical rendering, as there is no easy-obtained material ground truth. However, significant challenges still remain the dense connections in MLPs result in a large number of parameters, which requires high computation resources, complicating the training, and reducing performance during rendering. Data driven approaches require large amounts of training data for generalization; unbalanced data might bias the model to ignore the unusual illumination conditions, e.g. dark scenes. This paper introduces pbnds+: a novel physics-based neural deferred shading pipeline utilizing convolution neural networks to decrease the parameters and improve the performance in shading and relighting tasks; Energy regularization is also proposed to restrict the model reflection during dark illumination. Extensive experiments demonstrate that our approach outperforms classical baselines, a state-of-the-art neural shading model, and a diffusion-based method.

Marc Peral, Guillem Capellera, Luis Ferraz, Antonio Rubio, Antonio Agudo

Main category: cs.CV

TL;DR: Multi-modal transformer architecture for soccer analysis from tactical footage, handling ball trajectory inference, ball state classification, and possessor identification without direct ball tracking.

Details

Motivation: Existing methods rely heavily on accurate ball tracking or handcrafted heuristics, which fail under noisy/occluded conditions in real soccer matches. Need robust approach that can infer ball information without direct access to ball positions.

Method: Multi-modal architecture integrating three inputs: player trajectories, player types, and player image crops. Uses cascade of sociotemporal transformer blocks to process spatial-temporal dynamics. Introduces CropDrop pre-training strategy to prevent over-reliance on image features and encourage cross-modal learning.

Result: Substantial improvements over state-of-the-art baselines on all three tasks (ball trajectory inference, ball state classification, ball possessor identification) on large-scale dataset from real top league matches.

Conclusion: Combining structured and visual cues in transformer architecture is effective for soccer scene analysis. Realistic masking strategies like CropDrop are crucial for robust multi-modal learning, especially in noisy real-world conditions.

Abstract: In this work we propose a multi-modal architecture for analyzing soccer scenes from tactical camera footage, with a focus on three core tasks: ball trajectory inference, ball state classification, and ball possessor identification. To this end, our solution integrates three distinct input modalities (player trajectories, player types and image crops of individual players) into a unified framework that processes spatial and temporal dynamics using a cascade of sociotemporal transformer blocks. Unlike prior methods, which rely heavily on accurate ball tracking or handcrafted heuristics, our approach infers the ball trajectory without direct access to its past or future positions, and robustly identifies the ball state and ball possessor under noisy or occluded conditions from real top league matches. We also introduce CropDrop, a modality-specific masking pre-training strategy that prevents over-reliance on image features and encourages the model to rely on cross-modal patterns during pre-training. We show the effectiveness of our approach on a large-scale dataset providing substantial improvements over state-of-the-art baselines in all tasks. Our results highlight the benefits of combining structured and visual cues in a transformer-based architecture, and the importance of realistic masking strategies in multi-modal learning.

[272] SlicerOrbitSurgerySim: An Open-Source Platform for Virtual Registration and Quantitative Comparison of Preformed Orbital Plates

Chi Zhang, Braedon Gunn, Andrew M. Read-Fuller

Main category: cs.CV

TL;DR: Open-source 3D Slicer extension for virtual planning and quantitative evaluation of preformed orbital plate fit in orbital surgery.

Details

Motivation: Poor adaptation of orbital implants leads to complications and revision surgery. Surgeons lack standardized tools to quantitatively compare plate fit across vendors, sizes, and patient anatomy.

Method: Developed SlicerOrbitSurgerySim, an open-source extension for 3D Slicer platform that enables interactive virtual registration, evaluation, and comparison of multiple preformed orbital plates in patient-specific virtual planning environment.

Result: Software generates reproducible quantitative plate-to-orbit distance metrics and visualization tools supporting both patient-specific planning and population-level statistical analysis of plate adaptability.

Conclusion: Tool aims to improve preoperative decision-making, reduce intraoperative plate modification, and promote collaborative research and surgical education through objective comparison of implant designs and placement strategies.

Abstract: Poor adaptation of orbital implants remains a major contributor to postoperative complications and revision surgery. Although preformed orbital plates are widely used to reduce cost and operative time compared with customized implants, surgeons currently lack publicly available tools and standardized metrics to quantitatively compare plate fit across vendors, sizes, and patient anatomy. We developed SlicerOrbitSurgerySim, an open-source extension for the 3D Slicer platform that enables interactive virtual registration, evaluation, and comparison of multiple preformed orbital plates in a patient-specific virtual planning environment. The software generates reproducible quantitative plate-to-orbit distance metrics and visualization tools that support both patient-specific planning and population-level statistical analysis of plate adaptability. By facilitating objective comparison of implant designs and placement strategies, this tool aims to improve preoperative decision-making, reduce intraoperative plate modification, and promote collaborative research and surgical education. Pilot studies, sample datasets, and detailed tutorials are provided to support testing, transparency, and reproducibility.

[273] StoryMem: Multi-shot Long Video Storytelling with Memory

Kaiwen Zhang, Liming Jiang, Angtian Wang, Jacob Zhiyuan Fang, Tiancheng Zhi, Qing Yan, Hao Kang, Xin Lu, Xingang Pan

Main category: cs.CV

TL;DR: StoryMem transforms single-shot video diffusion models into multi-shot storytellers using visual memory banks and novel memory injection techniques, achieving superior cross-shot consistency for long-form video storytelling.

Details

Motivation: Visual storytelling requires generating multi-shot videos with cinematic quality and long-range consistency, which current single-shot video diffusion models struggle with due to lack of memory across shots.

Method: Proposes StoryMem paradigm with Memory-to-Video (M2V) design: maintains dynamic memory bank of keyframes, injects memory via latent concatenation and negative RoPE shifts with LoRA fine-tuning, uses semantic keyframe selection and aesthetic filtering.

Result: Achieves superior cross-shot consistency over previous methods while preserving high aesthetic quality and prompt adherence, enabling coherent minute-long video storytelling.

Conclusion: StoryMem marks a significant step toward coherent long-form video storytelling by transforming pre-trained single-shot models into multi-shot storytellers through explicit visual memory mechanisms.

Abstract: Visual storytelling requires generating multi-shot videos with cinematic quality and long-range consistency. Inspired by human memory, we propose StoryMem, a paradigm that reformulates long-form video storytelling as iterative shot synthesis conditioned on explicit visual memory, transforming pre-trained single-shot video diffusion models into multi-shot storytellers. This is achieved by a novel Memory-to-Video (M2V) design, which maintains a compact and dynamically updated memory bank of keyframes from historical generated shots. The stored memory is then injected into single-shot video diffusion models via latent concatenation and negative RoPE shifts with only LoRA fine-tuning. A semantic keyframe selection strategy, together with aesthetic preference filtering, further ensures informative and stable memory throughout generation. Moreover, the proposed framework naturally accommodates smooth shot transitions and customized story generation applications. To facilitate evaluation, we introduce ST-Bench, a diverse benchmark for multi-shot video storytelling. Extensive experiments demonstrate that StoryMem achieves superior cross-shot consistency over previous methods while preserving high aesthetic quality and prompt adherence, marking a significant step toward coherent minute-long video storytelling.

[274] Vision Language Models are Confused Tourists

Patrick Amadeus Irawan, Ikhlasul Akmal Hanif, Muhammad Dehan Al Kautsar, Genta Indra Winata, Fajri Koto, Alham Fikri Aji

Main category: cs.CV

TL;DR: VLMs show critical vulnerability to cultural concept mixing - accuracy drops heavily when multiple cultural cues coexist, revealing systematic attention shifts toward distracting cues.

Details

Motivation: Current VLM evaluations overlook scenarios where multiple cultural cues coexist, focusing only on singular cultural concepts per image. This gap is crucial because real-world multicultural environments often contain mixed cultural signals, and VLMs need to remain stable across diverse cultural inputs to support diversity.

Method: Introduced ConfusedTourist, a novel cultural adversarial robustness suite that assesses VLMs’ stability against perturbed geographical cues. Used two perturbation methods: simple image-stacking and image-generation-based variants to create scenarios with multiple cultural cues.

Result: Experiments revealed critical vulnerability - accuracy drops heavily under simple image-stacking perturbations and worsens with image-generation-based variants. Interpretability analyses showed failures stem from systematic attention shifts toward distracting cues, diverting models from intended focus.

Conclusion: Visual cultural concept mixing can substantially impair even state-of-the-art VLMs, highlighting an urgent need for more culturally robust multimodal understanding. The findings expose a critical challenge in developing VLMs that can handle real-world multicultural scenarios.

Abstract: Although the cultural dimension has been one of the key aspects in evaluating Vision-Language Models (VLMs), their ability to remain stable across diverse cultural inputs remains largely untested, despite being crucial to support diversity and multicultural societies. Existing evaluations often rely on benchmarks featuring only a singular cultural concept per image, overlooking scenarios where multiple, potentially unrelated cultural cues coexist. To address this gap, we introduce ConfusedTourist, a novel cultural adversarial robustness suite designed to assess VLMs’ stability against perturbed geographical cues. Our experiments reveal a critical vulnerability, where accuracy drops heavily under simple image-stacking perturbations and even worsens with its image-generation-based variant. Interpretability analyses further show that these failures stem from systematic attention shifts toward distracting cues, diverting the model from its intended focus. These findings highlight a critical challenge: visual cultural concept mixing can substantially impair even state-of-the-art VLMs, underscoring the urgent need for more culturally robust multimodal understanding.

[275] ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars

Ziqiao Peng, Yi Chen, Yifeng Ma, Guozhen Zhang, Zhiyao Sun, Zixiang Zhou, Youliang Zhang, Zhengguang Zhou, Zhaoxin Fan, Hongyan Liu, Yuan Zhou, Qinglin Lu, Jun He

Main category: cs.CV

TL;DR: ActAvatar is a talking avatar framework that achieves precise phase-level action control through textual guidance, addressing challenges in text-following capability, temporal alignment, and dependency on additional control signals.

Details

Motivation: Existing talking avatar methods have insufficient text-following capability for diverse actions, lack temporal alignment between actions and audio content, and depend on additional control signals like pose skeletons.

Method: Three core innovations: 1) Phase-Aware Cross-Attention (PACA) decomposes prompts into global base and temporally-anchored phase blocks for precise temporal-semantic alignment; 2) Progressive Audio-Visual Alignment aligns modality influence with hierarchical feature learning; 3) Two-stage training strategy establishes audio-visual correspondence first, then injects action control through fine-tuning.

Result: Extensive experiments demonstrate that ActAvatar significantly outperforms state-of-the-art methods in both action control and visual quality.

Conclusion: ActAvatar achieves phase-level precision in action control through textual guidance by capturing both action semantics and temporal context, overcoming limitations of existing talking avatar generation methods.

Abstract: Despite significant advances in talking avatar generation, existing methods face critical challenges: insufficient text-following capability for diverse actions, lack of temporal alignment between actions and audio content, and dependency on additional control signals such as pose skeletons. We present ActAvatar, a framework that achieves phase-level precision in action control through textual guidance by capturing both action semantics and temporal context. Our approach introduces three core innovations: (1) Phase-Aware Cross-Attention (PACA), which decomposes prompts into a global base block and temporally-anchored phase blocks, enabling the model to concentrate on phase-relevant tokens for precise temporal-semantic alignment; (2) Progressive Audio-Visual Alignment, which aligns modality influence with the hierarchical feature learning process-early layers prioritize text for establishing action structure while deeper layers emphasize audio for refining lip movements, preventing modality interference; (3) A two-stage training strategy that first establishes robust audio-visual correspondence on diverse data, then injects action control through fine-tuning on structured annotations, maintaining both audio-visual alignment and the model’s text-following capabilities. Extensive experiments demonstrate that ActAvatar significantly outperforms state-of-the-art methods in both action control and visual quality.

[276] No Data? No Problem: Robust Vision-Tabular Learning with Missing Values

Marta Hasny, Laura Daza, Keno Bressem, Maxime Di Folco, Julia Schnabel

Main category: cs.CV

TL;DR: RoVTL is a robust vision-tabular learning framework that handles any level of tabular data availability (0-100%) through contrastive pretraining with missingness augmentation and gated cross-attention fusion, achieving consistent performance across all data completeness scenarios.

Details

Motivation: Real-world medical datasets often have incomplete tabular data, but large biobanks provide extensive tabular information. This discrepancy requires methods that can leverage all tabular data during training while remaining robust to missing values at inference.

Method: Two-stage framework: 1) Contrastive pretraining with tabular attribute missingness as data augmentation, 2) Downstream task tuning using gated cross-attention module for multimodal fusion, plus Tabular More vs. Fewer loss and disentangled gradient learning.

Result: Superior robustness to missing tabular data on UK Biobank cardiac MRI scans compared to prior methods, successful generalization to external cardiac MRI dataset for disease classification, and extension to natural images domain with car advertisements dataset.

Conclusion: RoVTL effectively addresses the challenge of incomplete tabular data in real-world applications by providing consistent performance across all levels of tabular data availability, demonstrating strong generalization across medical and natural image domains.

Abstract: Large-scale medical biobanks provide imaging data complemented by extensive tabular information, such as demographics or clinical measurements. However, this abundance of tabular attributes does not reflect real-world datasets, where only a subset of attributes may be available. This discrepancy calls for methods that can leverage all the tabular data during training while remaining robust to missing values at inference. To address this challenge, we propose RoVTL (Robust Vision-Tabular Learning), a framework designed to handle any level of tabular data availability, from 0% to 100%. RoVTL comprises two key stages: contrastive pretraining, where we introduce tabular attribute missingness as data augmentation to promote robustness, and downstream task tuning using a gated cross-attention module for multimodal fusion. During fine-tuning, we employ a novel Tabular More vs. Fewer loss that ranks performance based on the amount of available tabular data. Combined with disentangled gradient learning, this enables consistent performance across all tabular data completeness scenarios. We evaluate RoVTL on cardiac MRI scans from the UK Biobank, demonstrating superior robustness to missing tabular data compared to prior methods. Furthermore, RoVTL successfully generalizes to an external cardiac MRI dataset for multimodal disease classification, and extends to the natural images domain, achieving robust performance on a car advertisements dataset. The code is available at https://github.com/marteczkah/RoVTL.

[277] Generative diffusion models for agricultural AI: plant image generation, indoor-to-outdoor translation, and expert preference alignment

Da Tan, Michael Beck, Christopher P. Bidinosti, Robert H. Gulden, Christopher J. Henry

Main category: cs.CV

TL;DR: This paper proposes a diffusion-based generative modeling framework for agricultural AI that addresses data collection challenges through plant image synthesis, indoor-to-outdoor translation, and expert preference alignment.

Details

Motivation: Collecting large, diverse, high-quality plant image datasets in real field conditions is costly, labor-intensive, and seasonally constrained, creating a bottleneck for agricultural AI development.

Method: Three main approaches: 1) Fine-tuning Stable Diffusion on captioned indoor/outdoor plant imagery for text-conditioned image generation, 2) Using DreamBooth-based text inversion and image-guided diffusion for indoor-to-outdoor translation, 3) Implementing preference-guided fine-tuning with expert-scored reward models.

Result: Synthetic images effectively augment training data (improving phenotype classification), translated images enhance weed detection/classification with YOLOv8, and preference-aligned fine-tuning produces more stable, expert-aligned outputs.

Conclusion: The framework demonstrates a practical pathway toward data-efficient generative pipelines for agricultural AI by addressing key data collection challenges through diffusion-based generative modeling.

Abstract: The success of agricultural artificial intelligence depends heavily on large, diverse, and high-quality plant image datasets, yet collecting such data in real field conditions is costly, labor intensive, and seasonally constrained. This paper investigates diffusion-based generative modeling to address these challenges through plant image synthesis, indoor-to-outdoor translation, and expert preference aligned fine tuning. First, a Stable Diffusion model is fine tuned on captioned indoor and outdoor plant imagery to generate realistic, text conditioned images of canola and soybean. Evaluation using Inception Score, Frechet Inception Distance, and downstream phenotype classification shows that synthetic images effectively augment training data and improve accuracy. Second, we bridge the gap between high resolution indoor datasets and limited outdoor imagery using DreamBooth-based text inversion and image guided diffusion, generating translated images that enhance weed detection and classification with YOLOv8. Finally, a preference guided fine tuning framework trains a reward model on expert scores and applies reward weighted updates to produce more stable and expert aligned outputs. Together, these components demonstrate a practical pathway toward data efficient generative pipelines for agricultural AI.

[278] 4D Gaussian Splatting as a Learned Dynamical System

Arnold Caleb Asiimwe, Carl Vondrick

Main category: cs.CV

TL;DR: EvoGS reinterprets 4D Gaussian Splatting as a continuous-time dynamical system using neural dynamical fields instead of per-frame deformations, enabling motion coherence, temporal extrapolation, and compositional dynamics.

Details

Motivation: Current deformation-based approaches for dynamic 3D scenes lack motion coherence, temporal consistency, and the ability to extrapolate beyond observed time ranges. They also don't support compositional dynamics for controllable scene synthesis.

Method: EvoGS treats Gaussian representation as an evolving physical system where scene motion arises from integrating a learned neural dynamical field. This continuous-time formulation models the underlying motion law rather than applying per-frame deformations.

Result: EvoGS achieves better motion coherence and temporal consistency compared to deformation-field baselines while maintaining real-time rendering. It enables sample-efficient learning from sparse supervision, temporal extrapolation beyond observed ranges, and compositional dynamics for controllable synthesis.

Conclusion: The continuous-time dynamical system formulation of 4D Gaussian Splatting (EvoGS) provides superior motion modeling capabilities including temporal extrapolation and compositional dynamics, offering a more principled approach to dynamic scene representation than deformation-based methods.

Abstract: We reinterpret 4D Gaussian Splatting as a continuous-time dynamical system, where scene motion arises from integrating a learned neural dynamical field rather than applying per-frame deformations. This formulation, which we call EvoGS, treats the Gaussian representation as an evolving physical system whose state evolves continuously under a learned motion law. This unlocks capabilities absent in deformation-based approaches:(1) sample-efficient learning from sparse temporal supervision by modeling the underlying motion law; (2) temporal extrapolation enabling forward and backward prediction beyond observed time ranges; and (3) compositional dynamics that allow localized dynamics injection for controllable scene synthesis. Experiments on dynamic scene benchmarks show that EvoGS achieves better motion coherence and temporal consistency compared to deformation-field baselines while maintaining real-time rendering

[279] Over++: Generative Video Compositing for Layer Interaction Effects

Luchao Qi, Jiaye Wu, Jun Myeong Choi, Cary Phillips, Roni Sengupta, Dan B Goldman

Main category: cs.CV

TL;DR: Over++ is a video effect generation framework that synthesizes realistic environmental interactions (shadows, reflections, dust, splashes) between foreground and background layers while preserving the original video content, using text prompts and optional masks without requiring camera pose or depth supervision.

Details

Motivation: Current video compositing workflows require manual creation of environmental interactions, while existing video generative models struggle to preserve input videos and current inpainting methods either need costly per-frame masks or produce implausible results.

Method: Over++ framework for augmented compositing task, using a paired effect dataset with unpaired augmentation strategy to preserve text-driven editability, supporting optional mask control and keyframe guidance without dense annotations, with no assumptions about camera pose, scene stationarity, or depth supervision.

Result: Despite training on limited data, Over++ produces diverse and realistic environmental effects and outperforms existing baselines in both effect generation and scene preservation.

Conclusion: Over++ successfully addresses the augmented compositing task by generating realistic environmental interactions while preserving original video content, offering a practical solution for professional video compositing workflows.

Abstract: In professional video compositing workflows, artists must manually create environmental interactions-such as shadows, reflections, dust, and splashes-between foreground subjects and background layers. Existing video generative models struggle to preserve the input video while adding such effects, and current video inpainting methods either require costly per-frame masks or yield implausible results. We introduce augmented compositing, a new task that synthesizes realistic, semi-transparent environmental effects conditioned on text prompts and input video layers, while preserving the original scene. To address this task, we present Over++, a video effect generation framework that makes no assumptions about camera pose, scene stationarity, or depth supervision. We construct a paired effect dataset tailored for this task and introduce an unpaired augmentation strategy that preserves text-driven editability. Our method also supports optional mask control and keyframe guidance without requiring dense annotations. Despite training on limited data, Over++ produces diverse and realistic environmental effects and outperforms existing baselines in both effect generation and scene preservation.

[280] Efficient Vision Mamba for MRI Super-Resolution via Hybrid Selective Scanning

Mojtaba Safari, Shansong Wang, Vanessa L Wildman, Mingzhe Hu, Zach Eidex, Chih-Wei Chang, Erik H Middlebrooks, Richard L. J Qiu, Pretesh Patel, Ashesh B. Jania, Hui Mao, Zhen Tian, Xiaofeng Yang

Main category: cs.CV

TL;DR: A novel MRI super-resolution framework combining multi-head selective state-space models with lightweight channel MLP achieves state-of-the-art performance with exceptional efficiency (0.9M parameters, 57 GFLOPs).

Details

Motivation: High-resolution MRI is clinically important but acquisition times are long. Existing deep learning SR methods face trade-offs between fidelity and computational efficiency, limiting clinical integration.

Method: Proposed framework combines multi-head selective state-space models (MHSSM) with lightweight channel MLP. Uses 2D patch extraction with hybrid scanning to capture long-range dependencies. Each MambaFormer block integrates MHSSM, depthwise convolutions, and gated channel mixing.

Result: Superior performance on 7T brain data (SSIM=0.951, PSNR=26.90 dB) and 1.5T prostate data (SSIM=0.770, PSNR=27.15 dB). Outperformed all baselines (GANs, transformers, Mamba, diffusion models) with only 0.9M parameters and 57 GFLOPs - 99.8% fewer parameters and 97.5% less computation than Res-SRDiff.

Conclusion: The framework provides an efficient, accurate MRI SR solution with enhanced anatomical detail. Low computational demand and state-of-the-art performance show strong potential for clinical translation.

Abstract: Background: High-resolution MRI is critical for diagnosis, but long acquisition times limit clinical use. Super-resolution (SR) can enhance resolution post-scan, yet existing deep learning methods face fidelity-efficiency trade-offs. Purpose: To develop a computationally efficient and accurate deep learning framework for MRI SR that preserves anatomical detail for clinical integration. Materials and Methods: We propose a novel SR framework combining multi-head selective state-space models (MHSSM) with a lightweight channel MLP. The model uses 2D patch extraction with hybrid scanning to capture long-range dependencies. Each MambaFormer block integrates MHSSM, depthwise convolutions, and gated channel mixing. Evaluation used 7T brain T1 MP2RAGE maps (n=142) and 1.5T prostate T2w MRI (n=334). Comparisons included Bicubic interpolation, GANs (CycleGAN, Pix2pix, SPSR), transformers (SwinIR), Mamba (MambaIR), and diffusion models (I2SB, Res-SRDiff). Results: Our model achieved superior performance with exceptional efficiency. For 7T brain data: SSIM=0.951+-0.021, PSNR=26.90+-1.41 dB, LPIPS=0.076+-0.022, GMSD=0.083+-0.017, significantly outperforming all baselines (p<0.001). For prostate data: SSIM=0.770+-0.049, PSNR=27.15+-2.19 dB, LPIPS=0.190+-0.095, GMSD=0.087+-0.013. The framework used only 0.9M parameters and 57 GFLOPs, reducing parameters by 99.8% and computation by 97.5% versus Res-SRDiff, while outperforming SwinIR and MambaIR in accuracy and efficiency. Conclusion: The proposed framework provides an efficient, accurate MRI SR solution, delivering enhanced anatomical detail across datasets. Its low computational demand and state-of-the-art performance show strong potential for clinical translation.

[281] VA-$π$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation

Xinyao Liao, Qiyuan He, Kai Xu, Xiaoye Qu, Yicong Li, Wei Wei, Angela Yao

Main category: cs.CV

TL;DR: VA-π is a lightweight post-training framework that aligns autoregressive visual generators with tokenizers by optimizing them directly with pixel-space objectives using reinforcement learning, improving image quality without retraining tokenizers.

Details

Motivation: There's a misalignment between tokenizers (trained for image reconstruction) and AR generators (optimized only for token likelihood), leading to generated token sequences that decode into low-quality images without pixel-space supervision.

Method: VA-π formulates generator-tokenizer alignment as variational optimization, deriving an ELBO that unifies pixel reconstruction and autoregressive modeling. It uses reinforcement-based alignment treating the AR generator as a policy, with pixel-space reconstruction quality as intrinsic reward measured under teacher forcing.

Result: With only 1% ImageNet-1K data and 25 minutes tuning, VA-π reduces FID from 14.36 to 7.65 and improves IS from 86.55 to 116.70 on LlamaGen-XXL. It also improves text-to-image performance on GenEval for both LlamaGen (0.306 to 0.339) and Janus-Pro (0.725 to 0.744).

Conclusion: VA-π enables rapid adaptation of existing AR generators without tokenizer retraining or external reward models, providing direct pixel-level guidance through a principled variational framework that maintains distributional consistency while significantly improving image quality.

Abstract: Autoregressive (AR) visual generation relies on tokenizers to map images to and from discrete sequences. However, tokenizers are trained to reconstruct clean images from ground-truth tokens, while AR generators are optimized only for token likelihood. This misalignment leads to generated token sequences that may decode into low-quality images, without direct supervision from the pixel space. We propose VA-$π$, a lightweight post-training framework that directly optimizes AR models with a principled pixel-space objective. VA-$π$ formulates the generator-tokenizer alignment as a variational optimization, deriving an evidence lower bound (ELBO) that unifies pixel reconstruction and autoregressive modeling. To optimize under the discrete token space, VA-$π$ introduces a reinforcement-based alignment strategy that treats the AR generator as a policy, uses pixel-space reconstruction quality as its intrinsic reward. The reward is measured by how well the predicted token sequences can reconstruct the original image under teacher forcing, giving the model direct pixel-level guidance without expensive free-running sampling. The regularization term of the ELBO serves as a natural regularizer, maintaining distributional consistency of tokens. VA-$π$ enables rapid adaptation of existing AR generators, without neither tokenizer retraining nor external reward models. With only 1% ImageNet-1K data and 25 minutes of tuning, it reduces FID from 14.36 to 7.65 and improves IS from 86.55 to 116.70 on LlamaGen-XXL, while also yielding notable gains in the text-to-image task on GenEval for both visual generation model (LlamaGen: from 0.306 to 0.339) and unified multi-modal model (Janus-Pro: from 0.725 to 0.744). Code is available at https://github.com/Lil-Shake/VA-Pi.

[282] From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs

Mingrui Wu, Zhaozhi Wang, Fangjinhua Wang, Jiaolong Yang, Marc Pollefeys, Tong Zhang

Main category: cs.CV

TL;DR: MLLMs lack spatial intelligence, existing benchmarks are inadequate, new outdoor benchmark with precise 3D data reveals models rely on linguistic priors rather than visual reasoning.

Details

Motivation: Multimodal Large Language Models have impressive semantic performance but underdeveloped spatial intelligence, which is crucial for robust AI systems. Existing benchmarks are insufficient because they focus on simplified qualitative reasoning or use domain-specific indoor data without outdoor datasets with verifiable metric ground truth.

Method: Introduce a large-scale benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. This provides metrically precise 3D information, enabling automatic generation of spatial reasoning questions across a hierarchical spectrum from qualitative relational reasoning to quantitative metric and kinematic understanding.

Result: Performance gains observed in structured indoor benchmarks vanish in open-world settings. Analysis using synthetic abnormal scenes and blinding tests confirms current MLLMs depend heavily on linguistic priors instead of grounded visual reasoning.

Conclusion: The benchmark provides a principled platform for diagnosing limitations and advancing physically grounded spatial intelligence in MLLMs.

Abstract: While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence–crucial for robust and grounded AI systems–remains underdeveloped. Existing benchmarks fall short of diagnosing this limitation: they either focus on overly simplified qualitative reasoning or rely on domain-specific indoor data, constrained by the lack of outdoor datasets with verifiable metric ground truth. To bridge this gap, we introduce a large-scale benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions that span a hierarchical spectrum–from qualitative relational reasoning to quantitative metric and kinematic understanding. Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings. Further analysis using synthetic abnormal scenes and blinding tests confirms that current MLLMs depend heavily on linguistic priors instead of grounded visual reasoning. Our benchmark thus provides a principled platform for diagnosing these limitations and advancing physically grounded spatial intelligence.

[283] Zero-shot Reconstruction of In-Scene Object Manipulation from Video

Dixuan Lin, Tianyou Wang, Zhuoyang Pan, Yufu Wang, Lingjie Liu, Kostas Daniilidis

Main category: cs.CV

TL;DR: First system for reconstructing in-scene object manipulation from monocular RGB video using foundation models and two-stage optimization.

Details

Motivation: Existing methods operate in hand-centric coordinates ignoring scene context, which hinders metric accuracy and practical use. The problem is challenging due to ill-posed scene reconstruction, ambiguous hand-object depth, and need for physically plausible interactions.

Method: 1) Use data-driven foundation models to initialize object mesh and poses, scene point cloud, and hand poses. 2) Apply two-stage optimization that recovers complete hand-object motion from grasping to interaction while remaining consistent with observed scene information.

Result: First system capable of reconstructing in-scene object manipulation from monocular RGB video with metric accuracy and physically plausible interactions.

Conclusion: The proposed method successfully addresses the limitations of existing hand-centric approaches by incorporating scene context and using foundation models with optimization to achieve accurate, physically plausible reconstruction of object manipulation.

Abstract: We build the first system to address the problem of reconstructing in-scene object manipulation from a monocular RGB video. It is challenging due to ill-posed scene reconstruction, ambiguous hand-object depth, and the need for physically plausible interactions. Existing methods operate in hand centric coordinates and ignore the scene, hindering metric accuracy and practical use. In our method, we first use data-driven foundation models to initialize the core components, including the object mesh and poses, the scene point cloud, and the hand poses. We then apply a two-stage optimization that recovers a complete hand-object motion from grasping to interaction, which remains consistent with the scene information observed in the input video.

[284] Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models

Zixuan Ye, Quande Liu, Cong Wei, Yuanxing Zhang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhan Luo

Main category: cs.CV

TL;DR: The paper introduces a method to improve visual context consistency in multimodal generation by integrating visual planning and iterative correction into Chain-of-Thought reasoning.

Details

Motivation: Current Chain-of-Thought approaches focus on text consistency but ignore visual context consistency with reference images, leading to failures in maintaining key visual features like human ID, object attributes, and style in multimodal generation tasks.

Method: 1) Adaptive Visual Planning: generates structured visual checklists to identify visual elements needing consistency; 2) Iterative Visual Correction: performs self-reflection guided by checklists and refines results iteratively. Uses supervised fine-tuning to teach planning, reflection, refinement, and flow-GRPO with customized visual checking rewards.

Result: The method outperforms both zero-shot unified models and those with text-only Chain-of-Thought in multimodal generation, demonstrating higher visual context consistency.

Conclusion: Integrating visual context consistency into reasoning through visual planning and iterative correction significantly improves the ability of unified models to maintain key visual features in multimodal generation tasks.

Abstract: Recently, the introduction of Chain-of-Thought (CoT) has largely improved the generation ability of unified models. However, it is observed that the current thinking process during generation mainly focuses on the text consistency with the text prompt, ignoring the \textbf{visual context consistency} with the visual reference images during the multi-modal generation, e.g., multi-reference generation. The lack of such consistency results in the failure in maintaining key visual features (like human ID, object attribute, style). To this end, we integrate the visual context consistency into the reasoning of unified models, explicitly motivating the model to sustain such consistency by 1) Adaptive Visual Planning: generating structured visual check list to figure out the visual element of needed consistency keeping, and 2) Iterative Visual Correction: performing self-reflection with the guidance of check lists and refining the generated result in an iterative manner. To achieve this, we use supervised finetuning to teach the model how to plan the visual checking, conduct self-reflection and self-refinement, and use flow-GRPO to further enhance the visual consistency through a customized visual checking reward. The experiments show that our method outperforms both zero-shot unified models and those with text CoTs in multi-modal generation, demonstrating higher visual context consistency.

[285] Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models

Pablo Ruiz-Ponce, Sergio Escalera, José García-Rodríguez, Jiankang Deng, Rolandos Alexandros Potamias

Main category: cs.CV

TL;DR: Interact2Ar is an autoregressive diffusion model that generates realistic full-body human-human interactions with detailed hand motions, enabling temporal composition and adaptation to disturbances.

Details

Motivation: Previous methods for generating human-human interactions ignore hand motions (limiting realism) and generate entire sequences simultaneously (limiting ability to capture reactive/adaptive nature of interactions).

Method: End-to-end text-conditioned autoregressive diffusion model with parallel branches for hand kinematics, coupled with novel memory technique for efficient large context windows.

Result: State-of-the-art performance demonstrated through quantitative and qualitative experiments, with new robust evaluators and extended metrics for full-body interactions.

Conclusion: Interact2Ar enables high-fidelity full-body interaction generation with adaptability for downstream applications including temporal composition, real-time disturbance adaptation, and multi-person scenarios.

Abstract: Generating realistic human-human interactions is a challenging task that requires not only high-quality individual body and hand motions, but also coherent coordination among all interactants. Due to limitations in available data and increased learning complexity, previous methods tend to ignore hand motions, limiting the realism and expressivity of the interactions. Additionally, current diffusion-based approaches generate entire motion sequences simultaneously, limiting their ability to capture the reactive and adaptive nature of human interactions. To address these limitations, we introduce Interact2Ar, the first end-to-end text-conditioned autoregressive diffusion model for generating full-body, human-human interactions. Interact2Ar incorporates detailed hand kinematics through dedicated parallel branches, enabling high-fidelity full-body generation. Furthermore, we introduce an autoregressive pipeline coupled with a novel memory technique that facilitates adaptation to the inherent variability of human interactions using efficient large context windows. The adaptability of our model enables a series of downstream applications, including temporal motion composition, real-time adaptation to disturbances, and extension beyond dyadic to multi-person scenarios. To validate the generated motions, we introduce a set of robust evaluators and extended metrics designed specifically for assessing full-body interactions. Through quantitative and qualitative experiments, we demonstrate the state-of-the-art performance of Interact2Ar.

[286] The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

Weichen Fan, Haiwen Diao, Quan Wang, Dahua Lin, Ziwei Liu

Main category: cs.CV

TL;DR: The paper introduces the Prism Hypothesis - semantic encoders capture low-frequency components for abstract meaning while pixel encoders retain high-frequency details. Based on this, they propose Unified Autoencoding (UAE) with a frequency-band modulator to harmonize semantic structure and pixel details in a single latent space.

Details

Motivation: To understand the spectral characteristics of different encoders and discover a unifying principle that explains why semantic encoders capture abstract meaning while pixel encoders preserve fine-grained details. The goal is to create a framework that can harmonize both semantic structure and pixel-level fidelity.

Method: Systematic analysis of spectral characteristics of semantic and pixel encoders, leading to the Prism Hypothesis. Then propose Unified Autoencoding (UAE) with an innovative frequency-band modulator that enables seamless coexistence of semantic abstraction and pixel details in a unified latent space.

Result: Extensive experiments on ImageNet and MS-COCO benchmarks show that UAE effectively unifies semantic abstraction and pixel-level fidelity into a single latent space with state-of-the-art performance.

Conclusion: The Prism Hypothesis provides a unifying perspective linking encoder behavior to spectral structure, and UAE demonstrates that harmonizing semantic and pixel information through frequency modulation leads to effective unified representations with strong performance on standard benchmarks.

Abstract: Deep representations across modalities are inherently intertwined. In this paper, we systematically analyze the spectral characteristics of various semantic and pixel encoders. Interestingly, our study uncovers a highly inspiring and rarely explored correspondence between an encoder’s feature spectrum and its functional role: semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel encoders additionally retain high-frequency information that conveys fine-grained detail. This heuristic finding offers a unifying perspective that ties encoder behavior to its underlying spectral structure. We define it as the Prism Hypothesis, where each data modality can be viewed as a projection of the natural world onto a shared feature spectrum, just like the prism. Building on this insight, we propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details via an innovative frequency-band modulator, enabling their seamless coexistence. Extensive experiments on ImageNet and MS-COCO benchmarks validate that our UAE effectively unifies semantic abstraction and pixel-level fidelity into a single latent space with state-of-the-art performance.

[287] Rethinking Open-Set Object Detection: Issues, a New Formulation, and Taxonomy

Yusuke Hosoya, Masanori Suganuma, Takayuki Okatani

Main category: cs.CV

TL;DR: The paper identifies fundamental issues with current open-set object detection (OSOD) formulations, proposes a new formulation based on super-classes, benchmarks existing methods, and introduces a taxonomy to clarify confusion in the field.

Details

Motivation: Current OSOD formulations have a fundamental contradiction: object detection inherently requires knowing "what to detect," which conflicts with identifying "unknown" objects. This creates evaluation problems that previous studies have overlooked.

Method: Proposes a novel formulation where detectors must detect both known and unknown classes within specified super-classes of object classes. Designs benchmark tests using existing datasets and evaluates existing OSOD methods.

Result: Existing OSOD methods fail to accurately detect unknown objects due to misclassification of known and unknown classes rather than incorrect bounding box prediction. The new formulation resolves the fundamental issues.

Conclusion: The study encourages the research community to reconsider OSOD and provides a taxonomy to resolve confusion in the literature, facilitating progress in the right direction.

Abstract: Open-set object detection (OSOD), a task involving the detection of unknown objects while accurately detecting known objects, has recently gained attention. However, we identify a fundamental issue with the problem formulation employed in current OSOD studies. Inherent to object detection is knowing “what to detect,” which contradicts the idea of identifying “unknown” objects. This sets OSOD apart from open-set recognition (OSR). This contradiction complicates a proper evaluation of methods’ performance, a fact that previous studies have overlooked. Next, we propose a novel formulation wherein detectors are required to detect both known and unknown classes within specified super-classes of object classes. This new formulation is free from the aforementioned issues and has practical applications. Finally, we design benchmark tests utilizing existing datasets and report the experimental evaluation of existing OSOD methods. The results show that existing methods fail to accurately detect unknown objects due to misclassification of known and unknown classes rather than incorrect bounding box prediction. As a byproduct, we introduce a taxonomy of OSOD, resolving confusion prevalent in the literature. We anticipate that our study will encourage the research community to reconsider OSOD and facilitate progress in the right direction.

[288] ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

Siming Yan, Min Bai, Weifeng Chen, Xiong Zhou, Qixing Huang, Li Erran Li

Main category: cs.CV

TL;DR: ViGoR is a framework that uses fine-grained reward modeling to improve visual grounding in large vision-language models, reducing hallucinations and inaccuracies with cheaper human evaluations instead of full supervision.

Details

Motivation: Current large vision-language models (LVLMs) often generate text with inaccurate visual grounding, including hallucinations of non-existent elements, missing important scene parts, and incorrect object attributes/relationships.

Method: ViGoR framework utilizes fine-grained reward modeling to enhance visual grounding of LVLMs over pre-trained baselines, using cheaper human evaluations and automated methods instead of full supervision.

Result: The approach shows effectiveness through various evaluation methods and benchmarks. The authors released a human annotation dataset of 15,440 images and generated text pairs with fine-grained evaluations.

Conclusion: ViGoR significantly improves visual grounding in LVLMs through efficient fine-grained reward modeling, addressing key limitations of current models while providing valuable resources to the research community.

Abstract: By combining natural language understanding, generation capabilities, and breadth of knowledge of large language models with image perception, recent large vision language models (LVLMs) have shown unprecedented visual reasoning capabilities. However, the generated text often suffers from inaccurate grounding in the visual input, resulting in errors such as hallucination of nonexistent scene elements, missing significant parts of the scene, and inferring incorrect attributes of and relationships between objects. To address these issues, we introduce a novel framework, ViGoR (Visual Grounding Through Fine-Grained Reward Modeling) that utilizes fine-grained reward modeling to significantly enhance the visual grounding of LVLMs over pre-trained baselines. This improvement is efficiently achieved using much cheaper human evaluations instead of full supervisions, as well as automated methods. We show the effectiveness of our approach through a variety of evaluation methods and benchmarks. Additionally, we released our human annotation (https://github.com/amazon-science/vigor) comprising 15,440 images and generated text pairs with fine-grained evaluations to contribute to related research in the community.

[289] One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models

Hao Fang, Jiawei Kong, Wenbo Yu, Bin Chen, Jiawei Li, Hao Wu, Shutao Xia, Ke Xu

Main category: cs.CV

TL;DR: The paper proposes C-PGC, a universal adversarial attack method for Vision-Language Pre-training models that uses contrastive learning to generate instance-agnostic perturbations that disrupt multimodal alignment.

Details

Motivation: Previous adversarial attacks on VLP models are instance-specific and require generating perturbations for each input. The authors aim to demonstrate VLP models are also vulnerable to universal adversarial perturbations (UAPs) that work across different inputs.

Method: C-PGC (Contrastive-training Perturbation Generator with Cross-modal conditions) uses a malicious version of contrastive learning to train a perturbation generator. It creates carefully crafted positive/negative image-text pairs to destroy the alignment learned by VLP models, incorporating both unimodal and cross-modal information as guidance.

Result: Extensive experiments show C-PGC successfully forces adversarial samples away from their original positions in VLP feature space, enhancing attacks across various victim models and vision-language tasks.

Conclusion: VLP models are vulnerable to universal adversarial perturbations, and C-PGC effectively exploits contrastive learning principles to create powerful cross-modal attacks that disrupt the multimodal alignment crucial to VLP performance.

Abstract: Vision-Language Pre-training (VLP) models have exhibited unprecedented capability in many applications by taking full advantage of the multimodal alignment. However, previous studies have shown they are vulnerable to maliciously crafted adversarial samples. Despite recent success, these methods are generally instance-specific and require generating perturbations for each input sample. In this paper, we reveal that VLP models are also vulnerable to the instance-agnostic universal adversarial perturbation (UAP). Specifically, we design a novel Contrastive-training Perturbation Generator with Cross-modal conditions (C-PGC) to achieve the attack. In light that the pivotal multimodal alignment is achieved through the advanced contrastive learning technique, we devise to turn this powerful weapon against themselves, i.e., employ a malicious version of contrastive learning to train the C-PGC based on our carefully crafted positive and negative image-text pairs for essentially destroying the alignment relationship learned by VLP models. Besides, C-PGC fully utilizes the characteristics of Vision-and-Language (V+L) scenarios by incorporating both unimodal and cross-modal information as effective guidance. Extensive experiments show that C-PGC successfully forces adversarial samples to move away from their original area in the VLP model’s feature space, thus essentially enhancing attacks across various victim models and V+L tasks. The GitHub repository is available at https://github.com/ffhibnese/CPGC_VLP_Universal_Attacks.

[290] OW-Rep: Open World Object Detection with Instance Representation Learning

Sunoh Lee, Minsik Jeon, Jihong Min, Junwon Seo

Main category: cs.CV

TL;DR: Extends Open World Object Detection (OWOD) to jointly detect unknown objects and learn semantically rich instance embeddings by leveraging Vision Foundation Models (VFMs) through two modules: Unknown Box Refine Module and Embedding Transfer Module.

Details

Motivation: Existing OWOD methods focus on detecting unknown objects but overlook semantic relationships between objects, which are crucial for scene understanding and applications like open-world tracking and novel class discovery.

Method: Proposes two VFM-based modules: 1) Unknown Box Refine Module uses Segment Anything Model instance masks to accurately localize unknown objects; 2) Embedding Transfer Module distills instance-wise semantic similarities from VFM features to detector embeddings via relaxed contrastive loss.

Result: Significantly improves both unknown object detection and instance embedding quality, while enhancing performance in downstream tasks like open-world tracking.

Conclusion: The approach successfully extends OWOD to capture fine-grained semantic relationships between instances, making detectors more capable for open-world applications by leveraging VFM knowledge.

Abstract: Open World Object Detection(OWOD) addresses realistic scenarios where unseen object classes emerge, enabling detectors trained on known classes to detect unknown objects and incrementally incorporate the knowledge they provide. While existing OWOD methods primarily focus on detecting unknown objects, they often overlook the rich semantic relationships between detected objects, which are essential for scene understanding and applications in open-world environments (e.g., open-world tracking and novel class discovery). In this paper, we extend the OWOD framework to jointly detect unknown objects and learn semantically rich instance embeddings, enabling the detector to capture fine-grained semantic relationships between instances. To this end, we propose two modules that leverage the rich and generalizable knowledge of Vision Foundation Models(VFMs) and can be integrated into open-world object detectors. First, the Unknown Box Refine Module uses instance masks from the Segment Anything Model to accurately localize unknown objects. The Embedding Transfer Module then distills instance-wise semantic similarities from VFM features to the detector’s embeddings via a relaxed contrastive loss, enabling the detector to learn a semantically meaningful and generalizable instance feature. Extensive experiments show that our method significantly improves both unknown object detection and instance embedding quality, while also enhancing performance in downstream tasks such as open-world tracking.

[291] AIDOVECL: AI-generated Dataset of Outpainted Vehicles for Eye-level Classification and Localization

Amir Kazemi, Qurat ul ain Fatima, Volodymyr Kindratenko, Christopher Tessum

Main category: cs.CV

TL;DR: AIDOVECL uses outpainting to generate synthetic vehicle images with annotations, reducing manual labeling effort and improving vehicle detection performance by up to 10% overall and up to 50% for underrepresented classes.

Details

Motivation: Manual image labeling is a critical bottleneck in computer vision development. There's a scarcity of diverse, eye-level vehicle images needed for autonomous driving, urban planning, and environmental monitoring applications.

Method: Detect and crop vehicles from seed images, then use outpainting to generate artificial contexts on larger canvases. Apply advanced outpainting techniques and image quality assessments to ensure visual fidelity and contextual relevance.

Result: AIDOVECL improves overall detection performance by up to 10%, with gains up to 40% in diverse contexts. Underrepresented classes achieve up to 50% higher true positives. The approach effectively augments real training data and supports evaluation across diverse scenarios.

Conclusion: Outpainting serves as an automatic annotation paradigm, offering a practical solution for building fine-grained datasets with reduced labeling effort across multiple machine learning domains. Code and datasets are publicly available.

Abstract: Image labeling is a critical bottleneck in the development of computer vision technologies, often constraining the potential of machine learning models due to the time-intensive nature of manual annotations. This work introduces a novel approach that leverages outpainting to mitigate the problem of annotated data scarcity by generating artificial contexts and annotations, significantly reducing manual labeling efforts. We apply this technique to a particularly acute challenge in autonomous driving, urban planning, and environmental monitoring: the lack of diverse, eye-level vehicle images in desired classes. Our dataset comprises AI-generated vehicle images obtained by detecting and cropping vehicles from manually selected seed images, which are then outpainted onto larger canvases to simulate varied real-world conditions. The outpainted images include detailed annotations, providing high-quality ground truth data. Advanced outpainting techniques and image quality assessments ensure visual fidelity and contextual relevance. Ablation results show that incorporating AIDOVECL improves overall detection performance by up to 10%, and delivers gains of up to 40% in settings with greater diversity of context, object scale, and placement, with underrepresented classes achieving up to 50% higher true positives. AIDOVECL enhances vehicle detection by augmenting real training data and supporting evaluation across diverse scenarios. By demonstrating outpainting as an automatic annotation paradigm, it offers a practical and versatile solution for building fine-grained datasets with reduced labeling effort across multiple machine learning domains. The code and links to datasets used in this study are available for further research and replication at https://github.com/amir-kazemi/aidovecl .

[292] Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory

Zaira Manigrasso, Matteo Dunnhofer, Antonino Furnari, Moritz Nottebaum, Antonio Finocchiaro, Davide Marana, Rosario Forte, Giovanni Maria Farinella, Christian Micheloni

Main category: cs.CV

TL;DR: OVQ2D task for online episodic memory retrieval from video streams, with ESOM framework achieving ~4% success on Ego4D, showing potential for improvement with better object discovery/tracking.

Details

Motivation: Existing episodic memory systems assume offline access to full videos, which is impractical for power/storage-constrained wearable devices. Need for application-ready systems that process video streams online.

Method: Introduce OVQ2D task for online visual querying. Propose ESOM framework with three modules: object discovery (finds objects), object tracking (tracks them), and memory module (stores spatio-temporal object info for efficient querying).

Result: ESOM outperforms other online approaches on Ego4D but achieves only ~4% success rate. With perfect object tracking: 31.91%, perfect discovery: 40.55%, perfect both: 81.92%, showing significant room for improvement.

Conclusion: OVQ2D is challenging but important for real-world wearable applications. ESOM provides a baseline framework, but substantial improvements in object discovery and tracking are needed for practical deployment.

Abstract: Episodic memory retrieval enables wearable cameras to recall objects or events previously observed in video. However, existing formulations assume an “offline” setting with full video access at query time, limiting their applicability in real-world scenarios with power and storage-constrained wearable devices. Towards more application-ready episodic memory systems, we introduce Online Visual Query 2D (OVQ2D), a task where models process video streams online, observing each frame only once, and retrieve object localizations using a compact memory instead of full video history. We address OVQ2D with ESOM (Egocentric Streaming Object Memory), a novel framework integrating an object discovery module, an object tracking module, and a memory module that find, track, and store spatio-temporal object information for efficient querying. Experiments on Ego4D demonstrate ESOM’s superiority over other online approaches, though OVQ2D remains challenging, with top performance at only ~4% success. ESOM’s accuracy increases markedly with perfect object tracking (31.91%), discovery (40.55%), or both (81.92%), underscoring the need of applied research on these components.

[293] Learnable Sparsity for Vision Generative Models

Yang Zhang, Er Jin, Wenzhong Liang, Yanfei Dong, Ashkan Khakzar, Philip Torr, Johannes Stegmaier, Kenji Kawaguchi

Main category: cs.CV

TL;DR: A model-agnostic structural pruning framework for diffusion models that achieves pruning without retraining by learning differentiable masks and using end-to-end pruning objectives with memory optimization techniques.

Details

Motivation: Diffusion models require increasing model sizes for performance gains, which escalates computational costs, memory demands, and environmental impact. Existing pruning methods need extensive retraining, which is extremely costly for large diffusion models, limiting their practicality.

Method: Proposes a model-agnostic structural pruning framework that learns differentiable masks to sparsify diffusion models. Designs an end-to-end pruning objective spanning the entire diffusion process to preserve denoising quality. Introduces time step gradient checkpointing to reduce memory usage during optimization, enabling end-to-end pruning within limited memory budgets.

Result: Successfully prunes up to 20% parameters from state-of-the-art U-Net diffusion models (SDXL) and diffusion transformers (FLUX) with minimal perceptible performance degradation, without requiring model retraining. Also works on time step distilled diffusion models.

Conclusion: The proposed framework enables low-cost, retraining-free pruning of diffusion models, addressing computational and memory efficiency challenges while maintaining performance, making diffusion models more practical for deployment.

Abstract: Diffusion models have achieved impressive advancements in various vision tasks. However, these gains often rely on increasing model size, which escalates computational complexity and memory demands, complicating deployment, raising inference costs, and causing environmental impact. While some studies have explored pruning techniques to improve the memory efficiency of diffusion models, most existing methods require extensive retraining to retain the model performance. Retraining a modern large diffusion model is extremely costly and resource-intensive, which limits the practicality of these methods. In this work, we achieve low-cost diffusion pruning without retraining by proposing a model-agnostic structural pruning framework for diffusion models that learns a differentiable mask to sparsify the model. To ensure effective pruning that preserves the quality of the final denoised latent, we design a novel end-to-end pruning objective that spans the entire diffusion process. As end-to-end pruning is memory-intensive, we further propose time step gradient checkpointing, a technique that significantly reduces memory usage during optimization, enabling end-to-end pruning within a limited memory budget. Results on state-of-the-art U-Net diffusion models SDXL and diffusion transformers (FLUX) demonstrate that our method can effectively prune up to 20% parameters with minimal perceptible performance degradation, and notably, without the need for model retraining. We also showcase that our method can still prune on top of time step distilled diffusion models.

[294] Verb Mirage: Unveiling and Assessing Verb Concept Hallucinations in Multimodal Large Language Models

Zehao Wang, Xinpeng Liu, Yudonglin Zhang, Xiaoqian Wu, Zhou Fang, Yifan Fang, Junfu Pu, Cewu Lu, Yong-Lu Li

Main category: cs.CV

TL;DR: First systematic investigation of verb hallucination in MLLMs, showing existing object-focused mitigation methods fail for verbs, and proposing a novel verb knowledge-based tuning approach that significantly reduces verb hallucinations.

Details

Motivation: While MLLMs show strong performance in various tasks, hallucination remains a major issue. Current mitigation methods focus only on object/noun-related hallucinations, overlooking verb concepts which are crucial for understanding human actions. There's a gap in addressing verb hallucination specifically.

Method: Proposes a novel rich verb knowledge-based tuning method to mitigate verb hallucination. This approach specifically targets verb concepts rather than applying object-focused mitigation techniques.

Result: 1) Most state-of-the-art MLLMs suffer from severe verb hallucination. 2) Existing object hallucination mitigation methods are ineffective for verb hallucination. 3) The proposed verb knowledge-based tuning method significantly reduces hallucinations related to verbs.

Conclusion: Verb hallucination is a significant but overlooked problem in MLLMs that requires specialized solutions. The proposed verb knowledge-based tuning approach effectively addresses this specific type of hallucination, advancing MLLM reliability for action understanding tasks.

Abstract: Multimodal Large Language Models (MLLMs) have garnered significant attention recently and demonstrate outstanding capabilities in various tasks such as OCR, VQA, captioning, $\textit{etc}$. However, hallucination remains a persistent issue. While numerous methods have been proposed to mitigate hallucinations, achieving notable improvements, these methods primarily focus on mitigating hallucinations about $\textbf{object/noun-related}$ concepts. Verb concepts, crucial for understanding human actions, have been largely overlooked. In this paper, to the best of our knowledge, we are the $\textbf{first}$ to investigate the $\textbf{verb hallucination}$ phenomenon of MLLMs from various perspectives. Our findings reveal that most state-of-the-art MLLMs suffer from severe verb hallucination. To assess the effectiveness of existing mitigation methods for object concept hallucination on verb hallucination, we evaluated these methods and found that they do not effectively address verb hallucination. To address this issue, we propose a novel rich verb knowledge-based tuning method to mitigate verb hallucination. The experiment results demonstrate that our method significantly reduces hallucinations related to verbs.

[295] Text to Blind Motion

Hee Jae Kim, Kathakoli Sengupta, Masaki Kuribayashi, Hernisa Kacorri, Eshed Ohn-Bar

Main category: cs.CV

TL;DR: BlindWays is the first multimodal motion benchmark for blind pedestrians, collecting 3D motion data and textual descriptions to address the bias in existing datasets toward sighted people and improve motion prediction for diverse populations.

Details

Motivation: Existing 3D human motion datasets lack diversity and are biased toward sighted people, failing to capture the distinct motion characteristics of blind pedestrians. This gap limits the ability of motion models in technologies like autonomous vehicles to predict blind pedestrian behavior accurately.

Method: Collected 3D motion data using wearable sensors from 11 blind participants navigating eight different routes in real-world urban settings. Provided rich textual descriptions capturing distinctive movement characteristics, interactions with navigation aids (white canes/guide dogs), and environmental interactions.

Result: State-of-the-art 3D human prediction models performed poorly on this novel task, both with off-the-shelf and pre-training-based methods, highlighting the need for specialized approaches.

Conclusion: BlindWays benchmark addresses a critical gap in motion modeling diversity and contributes toward safer, more reliable systems that can reason over diverse human movements. The dataset is publicly available to advance research in this area.

Abstract: People who are blind perceive the world differently than those who are sighted, which can result in distinct motion characteristics. For instance, when crossing at an intersection, blind individuals may have different patterns of movement, such as veering more from a straight path or using touch-based exploration around curbs and obstacles. These behaviors may appear less predictable to motion models embedded in technologies such as autonomous vehicles. Yet, the ability of 3D motion models to capture such behavior has not been previously studied, as existing datasets for 3D human motion currently lack diversity and are biased toward people who are sighted. In this work, we introduce BlindWays, the first multimodal motion benchmark for pedestrians who are blind. We collect 3D motion data using wearable sensors with 11 blind participants navigating eight different routes in a real-world urban setting. Additionally, we provide rich textual descriptions that capture the distinctive movement characteristics of blind pedestrians and their interactions with both the navigation aid (e.g., a white cane or a guide dog) and the environment. We benchmark state-of-the-art 3D human prediction models, finding poor performance with off-the-shelf and pre-training-based methods for our novel task. To contribute toward safer and more reliable systems that can seamlessly reason over diverse human movements in their environments, our text-and-motion benchmark is available at https://blindways.github.io.

[296] FaceShield: Defending Facial Image against Deepfake Threats

Jaehwan Jeong, Sumin In, Sieun Kim, Hannie Shin, Jongheon Jeong, Sang Ho Yoon, Jaewook Chung, Sangpil Kim

Main category: cs.CV

TL;DR: FaceShield is a proactive defense method against deepfakes that targets both diffusion models and GANs through facial feature manipulation, achieving state-of-the-art performance with improved imperceptibility and robustness.

Details

Motivation: Current deepfake defenses are mostly reactive detection methods, which are insufficient for crimes where authenticity is disregarded. Existing proactive defenses only work on specific GAN-based models and don't address newer diffusion-based models, creating a critical gap in protection.

Method: Three-component approach: 1) Manipulates attention mechanisms in Diffusion Models to exclude protected facial features during denoising, 2) Targets prominent facial feature extraction models to enhance adversarial perturbation robustness, 3) Uses Gaussian blur and low-pass filtering to improve imperceptibility and JPEG compression robustness.

Result: Achieves state-of-the-art performance against latest diffusion-based deepfake models on CelebA-HQ and VGGFace2-HQ datasets, shows transferability to GANs, and demonstrates greater imperceptibility of noise with enhanced robustness compared to existing methods.

Conclusion: FaceShield provides an effective proactive defense against modern deepfake threats by addressing both diffusion models and GANs through innovative facial feature manipulation techniques, offering improved protection where reactive detection methods fall short.

Abstract: The rising use of deepfakes in criminal activities presents a significant issue, inciting widespread controversy. While numerous studies have tackled this problem, most primarily focus on deepfake detection. These reactive solutions are insufficient as a fundamental approach for crimes where authenticity is disregarded. Existing proactive defenses also have limitations, as they are effective only for deepfake models based on specific Generative Adversarial Networks (GANs), making them less applicable in light of recent advancements in diffusion-based models. In this paper, we propose a proactive defense method named FaceShield, which introduces novel defense strategies targeting deepfakes generated by Diffusion Models (DMs) and facilitates defenses on various existing GAN-based deepfake models through facial feature extractor manipulations. Our approach consists of three main components: (i) manipulating the attention mechanism of DMs to exclude protected facial features during the denoising process, (ii) targeting prominent facial feature extraction models to enhance the robustness of our adversarial perturbation, and (iii) employing Gaussian blur and low-pass filtering techniques to improve imperceptibility while enhancing robustness against JPEG compression. Experimental results on the CelebA-HQ and VGGFace2-HQ datasets demonstrate that our method achieves state-of-the-art performance against the latest deepfake models based on DMs, while also exhibiting transferability to GANs and showcasing greater imperceptibility of noise along with enhanced robustness. Code is available here: https://github.com/kuai-lab/iccv25_faceshield

[297] From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision

Chuang Yu, Jinmiao Zhao, Yunpeng Liu, Sicheng Zhao, Yimian Dai, Xiangyu Yue

Main category: cs.CV

TL;DR: Progressive Active Learning (PAL) framework improves single-frame infrared small target detection with single point supervision by progressively training models on harder samples while maintaining stability.

Details

Motivation: Existing single point supervision methods for infrared small target detection suffer from instability, excessive label evolution, and difficulty in leveraging network performance. Inspired by biological adaptation and knowledge accumulation, the authors aim to create a more stable and effective learning framework.

Method: The PAL framework includes: 1) Model pre-start concept - automatically selects easy samples to give models basic task-specific capabilities; 2) Refined dual-update strategy - promotes reasonable learning of harder samples and continuous pseudo-label refinement; 3) Decay factor - balances target annotation expansion and contraction to prevent excessive label evolution.

Result: Extensive experiments show that existing SIRST detection networks equipped with PAL achieve state-of-the-art results on multiple public datasets. The framework also builds an efficient bridge between full supervision and single point supervision tasks.

Conclusion: The proposed Progressive Active Learning framework effectively addresses the limitations of previous single point supervision methods, providing stable, progressive learning that achieves superior performance while bridging the gap between different supervision paradigms.

Abstract: Recently, single-frame infrared small target (SIRST) detection with single point supervision has drawn wide-spread attention. However, the latest label evolution with single point supervision (LESPS) framework suffers from instability, excessive label evolution, and difficulty in exerting embedded network performance. Inspired by organisms gradually adapting to their environment and continuously accumulating knowledge, we construct an innovative Progressive Active Learning (PAL) framework, which drives the existing SIRST detection networks progressively and actively recognizes and learns harder samples. Specifically, to avoid the early low-performance model leading to the wrong selection of hard samples, we propose a model pre-start concept, which focuses on automatically selecting a portion of easy samples and helping the model have basic task-specific learning capabilities. Meanwhile, we propose a refined dual-update strategy, which can promote reasonable learning of harder samples and continuous refinement of pseudo-labels. In addition, to alleviate the risk of excessive label evolution, a decay factor is reasonably introduced, which helps to achieve a dynamic balance between the expansion and contraction of target annotations. Extensive experiments show that existing SIRST detection networks equipped with our PAL framework have achieved state-of-the-art (SOTA) results on multiple public datasets. Furthermore, our PAL framework can build an efficient and stable bridge between full supervision and single point supervision tasks. Our code is available at https://github.com/YuChuang1205/PAL

[298] GSRender: Deduplicated Occupancy Prediction via Weakly Supervised 3D Gaussian Splatting

Qianpu Sun, Changyong Shu, Sifan Zhou, Runxi Cheng, Yongxian Wei, Zichen Yu, Dawei Yang, Sirui Han, Yuan Chun

Main category: cs.CV

TL;DR: GSRender uses 3D Gaussian Splatting for weakly-supervised 3D occupancy perception, addressing sampling efficiency and duplicated prediction issues with a Ray Compensation module and redesigned dynamic loss.

Details

Motivation: Previous NeRF-based methods for weakly-supervised 3D occupancy perception face challenges in balancing sampling efficiency vs accuracy (5-10 point mIoU variations), and suffer from duplicated predictions even with surrounding-view inputs, which impacts practical deployment.

Method: Proposes GSRender using 3D Gaussian Splatting for weakly-supervised occupancy estimation, simplifying sampling. Introduces Ray Compensation module to reduce duplicated predictions by compensating features from adjacent frames. Redesigns dynamic loss to remove influence of dynamic objects from adjacent frames.

Result: Achieves state-of-the-art results in RayIoU (+6.0 improvement), narrowing the gap with 3D-supervised methods. Extensive experiments validate effectiveness.

Conclusion: GSRender lays a solid foundation for weakly-supervised occupancy perception, addressing key limitations of previous methods and improving practical applicability. Code is publicly available.

Abstract: Weakly-supervised 3D occupancy perception is crucial for vision-based autonomous driving in outdoor environments. Previous methods based on NeRF often face a challenge in balancing the number of samples used. Too many samples can decrease efficiency, while too few can compromise accuracy, leading to variations in the mean Intersection over Union (mIoU) by 5-10 points. Furthermore, even with surrounding-view image inputs, only a single image is rendered from each viewpoint at any given moment. This limitation leads to duplicated predictions, which significantly impacts the practicality of the approach. However, this issue has largely been overlooked in existing research. To address this, we propose GSRender, which uses 3D Gaussian Splatting for weakly-supervised occupancy estimation, simplifying the sampling process. Additionally, we introduce the Ray Compensation module, which reduces duplicated predictions by compensating for features from adjacent frames. Finally, we redesign the dynamic loss to remove the influence of dynamic objects from adjacent frames. Extensive experiments show that our approach achieves SOTA results in RayIoU (+6.0), while also narrowing the gap with 3D- supervised methods. This work lays a solid foundation for weakly-supervised occupancy perception. The code is available at https://github.com/Jasper-sudo-Sun/GSRender.

Yi Wang, Mushui Liu, Wanggui He, Hanyang Yuan, Longxiang Zhang, Ziwei Huang, Guanghao Zhang, Wenkai Fang, Haoze Jiang, Shengxuming Zhang, Dong She, Jinlong Liu, Weilong Dai, Mingli Song, Hao Jiang, Jie Song

Main category: cs.CV

TL;DR: FoX introduces a multimodal Chain of Thought (MCoT) approach with Functionality-oriented eXperts (FoXperts) architecture to address complex compositional image generation that direct text-to-image models struggle with.

Details

Motivation: Direct text-to-image generation fails to handle complex compositional instructions common in real-world scenarios. Existing improvements focus on basic generation capabilities but don't adequately solve the problem of complex image synthesis.

Method: Proposes FoX model with Functionality-oriented eXperts (FoXperts) architecture that assigns experts by function. Introduces Multimodal Chain of Thought (MCoT) approach emulating human artistic workflow (planning, acting, reflection, correction). Uses multi-task joint training to equip model with all MCoT capabilities without needing consistent multi-step data tuples.

Result: FoX consistently outperforms existing unified models on various text-to-image benchmarks, delivering notable improvements in complex image generation tasks.

Conclusion: Introducing Chain of Thought reasoning into unified generative models with the proposed FoXperts architecture and MCoT approach effectively addresses complex compositional image generation challenges that direct text-to-image methods cannot solve.

Abstract: Unified generative models have shown remarkable performance in text and image generation. For image synthesis tasks, they adopt straightforward text-to-image (T2I) generation. However, direct T2I generation limits the models in handling complex compositional instructions, which frequently occur in real-world scenarios. Although this issue is vital, existing works mainly focus on improving the basic image generation capability of the models. While such improvements help to some extent, they still fail to adequately resolve the problem. Inspired by Chain of Thought (CoT) solving complex problems step by step, this work aims to introduce CoT into unified generative models to address the challenges of complex image generation that direct T2I generation cannot effectively solve, thereby endowing models with enhanced image generation ability. To achieve this, we first propose Functionality-oriented eXperts (FoXperts), an expert-parallel architecture in our model FoX, which assigns experts by function. FoXperts disentangles potential conflicts in mainstream modality-oriented designs and provides a solid foundation for CoT. When introducing CoT, the first question is how to design it for complex image generation. To this end, we emulate a human-like artistic workflow–planning, acting, reflection, and correction–and propose the Multimodal Chain of Thought (MCoT) approach, as the data involves both text and image. To address the subsequent challenge of designing an effective MCoT training paradigm, we develop a multi-task joint training scheme that equips the model with all capabilities required for each MCoT step in a disentangled manner. This paradigm avoids the difficulty of collecting consistent multi-step data tuples. Extensive experiments show that FoX consistently outperforms existing unified models on various T2I benchmarks, delivering notable improvements in complex image generation.

[300] Binary Event-Driven Spiking Transformer

Honglin Cao, Zijian Zhou, Wenjie Wei, Ammar Belatreche, Yu Liang, Dehao Zhang, Malu Zhang, Yang Yang, Haizhou Li

Main category: cs.CV

TL;DR: BESTformer is a Binary Event-Driven Spiking Transformer that combines binarization with Transformer-based SNNs to create an ultra-efficient model for edge devices, using a Coupled Information Enhancement method to mitigate performance loss from binarization.

Details

Motivation: Transformer-based SNNs combine high performance with energy efficiency but have large model sizes and computational demands that limit their practicality in resource-constrained edge scenarios. Binarization can reduce storage and computation but causes severe performance degradation.

Method: Proposes BESTformer (Binary Event-Driven Spiking Transformer) with 1-bit weight and attention map representations. Introduces Coupled Information Enhancement (CIE) method consisting of: 1) a reversible framework, and 2) information enhancement distillation to maximize mutual information between binary and full-precision models.

Result: Extensive experiments on static and neuromorphic datasets show superior performance compared to other binary SNNs. The method effectively mitigates performance degradation from binarization while achieving significant storage and computational savings.

Conclusion: BESTformer with CIE method demonstrates potential as a compact yet high-performance model for resource-limited edge devices, successfully balancing efficiency and performance through binary representation and information enhancement techniques.

Abstract: Transformer-based Spiking Neural Networks (SNNs) introduce a novel event-driven self-attention paradigm that combines the high performance of Transformers with the energy efficiency of SNNs. However, the larger model size and increased computational demands of the Transformer structure limit their practicality in resource-constrained scenarios. In this paper, we integrate binarization techniques into Transformer-based SNNs and propose the Binary Event-Driven Spiking Transformer, i.e. BESTformer. The proposed BESTformer can significantly reduce storage and computational demands by representing weights and attention maps with a mere 1-bit. However, BESTformer suffers from a severe performance drop from its full-precision counterpart due to the limited representation capability of binarization. To address this issue, we propose a Coupled Information Enhancement (CIE) method, which consists of a reversible framework and information enhancement distillation. By maximizing the mutual information between the binary model and its full-precision counterpart, the CIE method effectively mitigates the performance degradation of the BESTformer. Extensive experiments on static and neuromorphic datasets demonstrate that our method achieves superior performance to other binary SNNs, showcasing its potential as a compact yet high-performance model for resource-limited edge devices. The repository of this paper is available at https://github.com/CaoHLin/BESTFormer.

[301] GRADEO: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning

Zhun Mou, Bin Xia, Zhengchao Huang, Wenming Yang, Jiaya Jia

Main category: cs.CV

TL;DR: GRADEO is a video evaluation model that uses multi-step reasoning to grade AI-generated videos, outperforming existing metrics in aligning with human evaluations.

Details

Motivation: Existing automated video evaluation metrics lack high-level semantic understanding and reasoning capabilities, making them infeasible and unexplainable for assessing video generation quality.

Method: Created GRADEO-Instruct dataset with 3.3k videos from 10+ generation models and 16k human annotations, then developed GRADEO model that performs multi-step reasoning for explainable video assessment.

Result: GRADEO aligns better with human evaluations than existing methods, and benchmarking reveals current video generation models struggle with human reasoning and complex real-world scenarios.

Conclusion: GRADEO addresses the gap in video evaluation by providing explainable, reasoning-based assessments that better match human judgment, revealing limitations in current video generation capabilities.

Abstract: Recent great advances in video generation models have demonstrated their potential to produce high-quality videos, bringing challenges to effective evaluation. Unlike human evaluation, existing automated evaluation metrics lack highlevel semantic understanding and reasoning capabilities for video, thus making them infeasible and unexplainable. To fill this gap, we curate GRADEO-Instruct, a multi-dimensional T2V evaluation instruction tuning dataset, including 3.3k videos from over 10 existing video generation models and multi-step reasoning assessments converted by 16k human annotations. We then introduce GRADEO, one of the first specifically designed video evaluation models, which grades AI-generated videos for explainable scores and assessments through multi-step reasoning. Experiments show that our method aligns better with human evaluations than existing methods. Furthermore, our benchmarking reveals that current video generation models struggle to produce content that aligns with human reasoning and complex real-world scenarios.

[302] Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation

Lin Chen, Qi Yang, Kun Ding, Zhihao Li, Gang Shen, Fei Li, Qiyuan Cao, Shiming Xiang

Main category: cs.CV

TL;DR: ERR-Seg: An efficient open-vocabulary semantic segmentation method that reduces redundancy in cost maps and aggregation, achieving 5.6% performance improvement with 3.1× speedup.

Details

Motivation: Existing cost-based OVSS methods suffer from high computational costs and long inference latency due to redundant information in cost map construction and inefficient sequence modeling in cost aggregation.

Method: Proposes ERR-Seg with two key components: 1) Redundancy-Reduced Hierarchical Cost maps (RRHC) that customizes compact class vocabulary per image and integrates hierarchical cost maps, and 2) Redundancy-Reduced Cost Aggregation (RRCA) that performs both spatial-level and class-level sequence reduction before aggregation.

Result: Achieves 5.6% performance improvement on ADE20K-847 benchmark while achieving 3.1× speedup compared to previous state-of-the-art methods.

Conclusion: ERR-Seg creates a lightweight structure for OVSS with substantial memory and computational savings without compromising accuracy, effectively addressing redundancy issues in cost-based frameworks.

Abstract: Open-vocabulary semantic segmentation (OVSS) is an open-world task that aims to assign each pixel within an image to a specific class defined by arbitrary text descriptions. While large-scale vision-language models have shown remarkable open-vocabulary capabilities, their image-level pretraining limits effectiveness on pixel-wise dense prediction tasks like OVSS. Recent cost-based methods narrow this granularity gap by constructing pixel-text cost maps and refining them via cost aggregation mechanisms. Despite achieving promising performance, these approaches suffer from high computational costs and long inference latency. In this paper, we identify two major sources of redundancy in the cost-based OVSS framework: redundant information introduced during cost maps construction and inefficient sequence modeling in cost aggregation. To address these issues, we propose ERR-Seg, an efficient architecture that incorporates Redundancy-Reduced Hierarchical Cost maps (RRHC) and Redundancy-Reduced Cost Aggregation (RRCA). Specifically, RRHC reduces redundant class channels by customizing a compact class vocabulary for each image and integrates hierarchical cost maps to enrich semantic representation. RRCA alleviates computational burden by performing both spatial-level and class-level sequence reduction before aggregation. Overall, ERR-Seg results in a lightweight structure for OVSS, characterized by substantial memory and computational savings without compromising accuracy. Compared to previous state-of-the-art methods on the ADE20K-847 benchmark, ERR-Seg improves performance by $5.6%$ while achieving a 3.1$\times$ speedup.

[303] Keep It Light! Simplifying Image Clustering Via Text-Free Adapters

Yicen Li, Haitz Sáez de Ocáriz Borde, Anastasis Kratsios, Paul D. McNicholas

Main category: cs.CV

TL;DR: SCP achieves competitive clustering performance using only pre-trained vision models and simple cluster heads, eliminating the need for complex multi-modal pipelines with LLMs or text encoders.

Details

Motivation: Current competitive clustering pipelines are multi-modal, requiring LLMs/text encoders and text-image pairs that are often unavailable in real-world applications. These frameworks are complex to train and computationally expensive, limiting widespread adoption.

Method: Simple Clustering via Pre-trained models (SCP) uses only pre-trained vision model feature representations and trains a small cluster head while leveraging positive data pairs. The approach is text-free and highly simplified.

Result: SCP achieves highly competitive performance on benchmark datasets (CIFAR-10, CIFAR-20, CIFAR-100, STL-10, ImageNet-10, ImageNet-Dogs) compared to more complex state-of-the-art methods.

Conclusion: Strong clustering performance can be achieved with simplified, text-free pipelines using pre-trained vision models, making clustering more accessible for real-world applications without complex multi-modal requirements.

Abstract: In the era of pre-trained models, effective classification can often be achieved using simple linear probing or lightweight readout layers. In contrast, many competitive clustering pipelines have a multi-modal design, leveraging large language models (LLMs) or other text encoders, and text-image pairs, which are often unavailable in real-world downstream applications. Additionally, such frameworks are generally complicated to train and require substantial computational resources, making widespread adoption challenging. In this work, we show that in deep clustering, competitive performance with more complex state-of-the-art methods can be achieved using a text-free and highly simplified training pipeline. In particular, our approach, Simple Clustering via Pre-trained models (SCP), trains only a small cluster head while leveraging pre-trained vision model feature representations and positive data pairs. Experiments on benchmark datasets, including CIFAR-10, CIFAR-20, CIFAR-100, STL-10, ImageNet-10, and ImageNet-Dogs, demonstrate that SCP achieves highly competitive performance. Furthermore, we provide a theoretical result explaining why, at least under ideal conditions, additional text-based embeddings may not be necessary to achieve strong clustering performance in vision.

[304] SNN-Driven Multimodal Human Action Recognition via Sparse Spatial-Temporal Data Fusion

Naichuan Zheng, Hailun Xia, Zeyu Liang, Yuchen Du

Main category: cs.CV

TL;DR: Proposes a novel Spiking Neural Network framework for multimodal human action recognition using event camera and skeleton data, achieving superior accuracy and energy efficiency.

Details

Motivation: Existing multimodal action recognition methods using RGB and skeleton data suffer from high computational complexity, memory consumption, and energy demands with ANNs, limiting their use in resource-constrained scenarios.

Method: Two key innovations: 1) Multimodal SNN architecture with SNN-based Mamba for event camera data and Spiking Graph Convolutional Network for skeleton data, plus spiking semantic extraction module; 2) SNN-based discretized information bottleneck mechanism for modality fusion that balances semantic preservation with compression. Also proposes novel multimodal dataset construction method.

Result: Extensive experiments demonstrate superior performance in both recognition accuracy and energy efficiency compared to existing methods.

Conclusion: The proposed SNN-driven framework offers a promising solution for practical multimodal human action recognition applications, particularly in resource-constrained environments.

Abstract: Multimodal human action recognition based on RGB and skeleton data fusion, while effective, is constrained by significant limitations such as high computational complexity, excessive memory consumption, and substantial energy demands, particularly when implemented with Artificial Neural Networks (ANN). These limitations restrict its applicability in resource-constrained scenarios. To address these challenges, we propose a novel Spiking Neural Network (SNN)-driven framework for multimodal human action recognition, utilizing event camera and skeleton data. Our framework is centered on two key innovations: (1) a novel multimodal SNN architecture that employs distinct backbone networks for each modality-an SNN-based Mamba for event camera data and a Spiking Graph Convolutional Network (SGN) for skeleton data-combined with a spiking semantic extraction module to capture deep semantic representations; and (2) a pioneering SNN-based discretized information bottleneck mechanism for modality fusion, which effectively balances the preservation of modality-specific semantics with efficient information compression. To validate our approach, we propose a novel method for constructing a multimodal dataset that integrates event camera and skeleton data, enabling comprehensive evaluation. Extensive experiments demonstrate that our method achieves superior performance in both recognition accuracy and energy efficiency, offering a promising solution for practical applications.

[305] Comp-Attn: Present-and-Align Attention for Compositional Video Generation

Hongyu Zhang, Yufan Deng, Shenghai Yuan, Yian Zhao, Peng Jin, Xuehan Hou, Chang Liu, Jie Chen

Main category: cs.CV

TL;DR: Comp-Attn: A training-free composition-aware cross-attention method for text-to-video generation that addresses subject presence and inter-subject relational alignment through Present-and-Align paradigm.

Details

Motivation: Current T2V generation struggles with reliably synthesizing compositional content involving multiple subjects with intricate relations. Two main challenges: 1) Subject presence (not all subjects appear in video), 2) Inter-subject relations (misaligned interactions and spatial relationships). Existing methods fail to address both issues simultaneously.

Method: Proposes Comp-Attn, a composition-aware cross-attention variant following Present-and-Align paradigm: 1) Subject-aware Condition Interpolation (SCI) reinforces subject-specific conditions to ensure each subject’s presence; 2) Layout-forcing Attention Modulation (LAM) dynamically enforces attention distribution to align with relational layout of multiple subjects. Training-free and can be integrated into various T2V baselines.

Result: Boosts T2V-CompBench scores by 15.7% and 11.7% on Wan2.1-T2V-14B and Wan2.2-T2V-A14B with only 5% increase in inference time. Also achieves strong performance on VBench and T2I-CompBench, demonstrating scalability in general video generation and compositional text-to-image tasks.

Conclusion: Comp-Attn effectively addresses compositional challenges in T2V generation through a novel attention mechanism that ensures subject presence and relational alignment, offering significant performance improvements with minimal computational overhead and broad applicability across video and image generation tasks.

Abstract: In the domain of text-to-video (T2V) generation, reliably synthesizing compositional content involving multiple subjects with intricate relations is still underexplored. The main challenges are twofold: 1) Subject presence, where not all subjects can be presented in the video; 2) Inter-subject relations, where the interaction and spatial relationship between subjects are misaligned. Existing methods adopt techniques, such as inference-time latent optimization or layout control, which fail to address both issues simultaneously. To tackle these problems, we propose Comp-Attn, a composition-aware cross-attention variant that follows a Present-and-Align paradigm: it decouples the two challenges by enforcing subject presence at the condition level and achieving relational alignment at the attention-distribution level. Specifically, 1) We introduce Subject-aware Condition Interpolation (SCI) to reinforce subject-specific conditions and ensure each subject’s presence; 2) We propose Layout-forcing Attention Modulation (LAM), which dynamically enforces the attention distribution to align with the relational layout of multiple subjects. Comp-Attn can be seamlessly integrated into various T2V baselines in a training-free manner, boosting T2V-CompBench scores by 15.7% and 11.7% on Wan2.1-T2V-14B and Wan2.2-T2V-A14B with only a 5% increase in inference time. Meanwhile, it also achieves strong performance on VBench and T2I-CompBench, demonstrating its scalability in general video generation and compositional text-to-image (T2I) tasks.

[306] TRACE: Your Diffusion Model is Secretly an Instance Edge Detector

Sanghyun Jo, Ziseok Lee, Wooyeol Lee, Jonghyun Choi, Jaesik Park, Kyungsu Kim

Main category: cs.CV

TL;DR: TRACE shows that text-to-image diffusion models can be used as instance edge annotators without needing instance-level labels, achieving significant improvements in unsupervised instance segmentation and tag-supervised panoptic segmentation.

Details

Motivation: Traditional instance and panoptic segmentation methods rely on costly dense instance-level annotations (masks, boxes, points). Unsupervised and weakly-supervised approaches still face limitations like semantic backbone constraints and human bias, often producing poor quality outputs. There's a need for scalable alternatives to manual annotation.

Method: TRACE identifies Instance Emergence Point (IEP) where object boundaries first appear in diffusion model’s self-attention maps, extracts boundaries through Attention Boundary Divergence (ABDiv), and distills them into a lightweight one-step edge decoder. This eliminates per-image diffusion inversion, making inference 81x faster.

Result: On COCO benchmark: improves unsupervised instance segmentation by +5.1 AP, outperforms point-supervised baselines in tag-supervised panoptic segmentation by +1.7 PQ without using any instance-level labels. Produces sharper, more connected boundaries with much faster inference.

Conclusion: Diffusion models encode hidden instance boundary priors that can be decoded to provide a practical and scalable alternative to costly manual annotation. TRACE demonstrates that text-to-image diffusion models secretly function as instance edge annotators.

Abstract: High-quality instance and panoptic segmentation has traditionally relied on dense instance-level annotations such as masks, boxes, or points, which are costly, inconsistent, and difficult to scale. Unsupervised and weakly-supervised approaches reduce this burden but remain constrained by semantic backbone constraints and human bias, often producing merged or fragmented outputs. We present TRACE (TRAnsforming diffusion Cues to instance Edges), showing that text-to-image diffusion models secretly function as instance edge annotators. TRACE identifies the Instance Emergence Point (IEP) where object boundaries first appear in self-attention maps, extracts boundaries through Attention Boundary Divergence (ABDiv), and distills them into a lightweight one-step edge decoder. This design removes the need for per-image diffusion inversion, achieving 81x faster inference while producing sharper and more connected boundaries. On the COCO benchmark, TRACE improves unsupervised instance segmentation by +5.1 AP, and in tag-supervised panoptic segmentation it outperforms point-supervised baselines by +1.7 PQ without using any instance-level labels. These results reveal that diffusion models encode hidden instance boundary priors, and that decoding these signals offers a practical and scalable alternative to costly manual annotation. Code is available at https://github.com/shjo-april/DiffEGG.

[307] LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents

Boyu Chen, Zhengrong Yue, Siran Chen, Zikang Wang, Yang Liu, Peng Li, Yali Wang

Main category: cs.CV

TL;DR: LVAgent introduces a multi-agent collaboration framework for long video understanding, outperforming existing models through dynamic team formation and iterative refinement.

Details

Motivation: Existing MLLMs struggle with temporal context in long videos, and even tool-assisted single MLLMs provide only partial understanding with limited performance.

Method: Four-step framework: 1) Selection of optimal agent teams, 2) Perception with efficient video retrieval, 3) Action where agents answer and exchange reasons, 4) Reflection with performance evaluation and team optimization through multi-round dynamic collaboration.

Result: LVAgent outperforms all closed-source (GPT-4o) and open-source (InternVL-2.5, Qwen2-VL) models, achieving 80% accuracy on four mainstream long video tasks and 13.3% improvement on LongVideoBench.

Conclusion: Multi-agent dynamic collaboration effectively addresses long video understanding challenges, establishing LVAgent as a state-of-the-art framework with superior performance.

Abstract: Existing MLLMs encounter significant challenges in modeling the temporal context within long videos. Currently, mainstream Agent-based methods use external tools to assist a single MLLM in answering long video questions. Despite such tool-based support, a solitary MLLM still offers only a partial understanding of long videos, resulting in limited performance. In order to better address long video tasks, we introduce LVAgent, the first framework enabling multi-round dynamic collaboration of MLLM agents in long video understanding. Our method consists of four key steps: 1) Selection: We pre-select appropriate agents from the model library to form optimal agent teams based on different tasks. 2) Perception: We design an effective retrieval scheme for long videos to improve the coverage of critical temporal segments while maintaining computational efficiency. 3) Action: Agents answer long video questions and exchange reasons. 4) Reflection: We evaluate each agent’s performance in each round of discussion and optimize the agent team for dynamic collaboration. The agents iteratively refine their answers by multi-round dynamical collaboration of MLLM agents. LVAgent is the first agent system method that outperforms all closed-source models (like GPT-4o) and open-source models (like InternVL-2.5 and Qwen2-VL) in the long video understanding tasks. Our LVAgent achieves an accuracy of 80% on four mainstream long video understanding tasks. Notably, LVAgent improves accuracy by 13.3% on LongVideoBench. Code is available at https://github.com/64327069/LVAgent.

[308] HandSCS: Structural Coordinate Space for Animatable Hand Gaussian Splatting

Yilan Dong, Wenqing Wang, Qing Wang, Jiahao Yang, Haohe Liu, Xiatuan Zhu, Gregory Slabaugh, Shanxin Yuan

Main category: cs.CV

TL;DR: HandSCS: A structure-guided 3D Gaussian Splatting framework for high-fidelity hand avatar animation from multi-view images, using explicit structural modeling to handle complex articulations and maintain consistency across poses.

Details

Motivation: Existing approaches condition all Gaussians on the same global pose parameters, which is inadequate for highly articulated hands. This leads to difficulties in modeling complex articulations and maintaining structural consistency across poses in real time.

Method: HandSCS equips each Gaussian with explicit structural guidance from intra-pose and inter-pose perspectives. For intra-pose guidance, introduces Structural Coordinate Space (SCS) with hybrid static-dynamic coordinate basis and angular-radial descriptors to bridge sparse bones and dense Gaussians. For cross-pose coherence, introduces Inter-pose Consistency Loss to promote consistent Gaussian attributes under similar articulations.

Result: Achieves high-fidelity results with consistent fine details, even in challenging high-deformation and self-contact regions. Experiments on InterHand2.6M dataset demonstrate state-of-the-art performance in hand avatar animation.

Conclusion: HandSCS confirms the effectiveness of explicit structural modeling for hand avatar animation, overcoming limitations of global pose conditioning and enabling consistent, high-quality results across complex hand articulations.

Abstract: Creating animatable hand avatars from multi-view images requires modeling complex articulations and maintaining structural consistency across poses in real time. We present HandSCS, a structure-guided 3D Gaussian Splatting framework for high-fidelity hand animation. Unlike existing approaches that condition all Gaussians on the same global pose parameters, which are inadequate for highly articulated hands, HandSCS equips each Gaussian with explicit structural guidance from both intra-pose and inter-pose perspectives. To establish intra-pose structural guidance, we introduce a Structural Coordinate Space (SCS), which bridges the gap between sparse bones and dense Gaussians through hybrid static-dynamic coordinate basis and angular-radial descriptors. To improve cross-pose coherence, we further introduce an Inter-pose Consistency Loss that promotes consistent Gaussian attributes under similar articulations. Together, these components achieve high-fidelity results with consistent fine details, even in challenging high-deformation and self-contact regions. Experiments on the InterHand2.6M dataset demonstrate that HandSCS achieves state-of-the-art performance in hand avatar animation, confirming the effectiveness of explicit structural modeling.

[309] UCS: A Universal Model for Curvilinear Structure Segmentation

Kai Zhu, Li Chen, Dianshuo Li, Yunxiang Cao, Jun Cheng

Main category: cs.CV

TL;DR: UCS adapts SAM for universal curvilinear structure segmentation with sparse adapters, FFT-based prompts, and dual-compression decoder, achieving SOTA cross-domain generalization.

Details

Motivation: Existing CSS methods lack cross-domain generalization, while SAM has strong generalization but isn't optimized for curvilinear structures. Need a specialized model that combines SAM's generalization with CSS-specific capabilities.

Method: UCS adapts SAM with: 1) Sparse Adapter to inherit SAM’s generalization with minimal fine-tuning parameters, 2) Prompt Generation using FFT with high-pass filter for curve-specific prompts, 3) Mask decoder with dual-compression (Hierarchical Feature Compression and Guidance Feature Compression) to eliminate manual interaction.

Result: Achieves state-of-the-art generalization and open-set segmentation performance across medical, engineering, natural, and plant imagery on comprehensive multi-domain datasets including eight natural curvilinear structures.

Conclusion: UCS establishes a new benchmark for universal curvilinear structure segmentation, bridging the gap between domain-specific CSS methods and general-purpose segmentation models like SAM.

Abstract: Curvilinear structure segmentation (CSS) is essential in various domains, including medical imaging, landscape analysis, industrial surface inspection, and plant analysis. While existing methods achieve high performance within specific domains, their generalizability is limited. On the other hand, large-scale models such as Segment Anything Model (SAM) exhibit strong generalization but are not optimized for curvilinear structures. Existing adaptations of SAM primarily focus on general object segmentation and lack specialized design for CSS tasks. To bridge this gap, we propose the Universal Curvilinear structure Segmentation (UCS) model, which adapts SAM to CSS tasks while further enhancing its cross-domain generalization. UCS features a novel encoder architecture integrating a pretrained SAM encoder with two innovations: a Sparse Adapter, strategically inserted to inherit the pre-trained SAM encoder’s generalization capability while minimizing the number of fine-tuning parameters, and a Prompt Generation module, which leverages Fast Fourier Transform with a high-pass filter to generate curve-specific prompts. Furthermore, the UCS incorporates a mask decoder that eliminates reliance on manual interaction through a dual-compression module: a Hierarchical Feature Compression module, which aggregates the outputs of the sampled encoder to enhance detail preservation, and a Guidance Feature Compression module, which extracts and compresses image-driven guidance features. Evaluated on a comprehensive multi-domain dataset, including an in-house dataset covering eight natural curvilinear structures, UCS demonstrates state-of-the-art generalization and open-set segmentation performance across medical, engineering, natural, and plant imagery, establishing a new benchmark for universal CSS. The source code is available at https://github.com/kylechuuuuu/UCS.

[310] A Survey of 3D Reconstruction with Event Cameras

Chuanzhi Xu, Haoxian Zhou, Langyi Chen, Haodong Chen, Zeke Zexi Hu, Zhicheng Lu, Ying Zhou, Vera Chung, Qiang Qu, Weidong Cai

Main category: cs.CV

TL;DR: First comprehensive survey on event-based 3D reconstruction, covering stereo, monocular, and multimodal approaches with geometry-based, deep learning, and neural rendering methods, plus datasets and future challenges.

Details

Motivation: Event cameras offer unique advantages for 3D reconstruction with sparse but temporally dense data streams, enabling robust performance in challenging conditions like high-speed motion, low light, and extreme dynamic range. They have transformative potential for autonomous driving, robotics, aerial navigation, and VR.

Method: Systematic categorization of existing approaches by input modality (stereo, monocular, multimodal) and reconstruction methodology (geometry-based, deep learning, neural rendering including NeRF and 3DGS). Chronological organization within categories to show evolution. Comprehensive review of publicly available datasets for event-based reconstruction tasks.

Result: First comprehensive survey exclusively dedicated to event-based 3D reconstruction, providing essential reference and roadmap for the field. Detailed analysis of current approaches, datasets, and identification of key challenges.

Conclusion: Survey serves as essential reference and provides clear roadmap for advancing event-driven 3D reconstruction. Identifies significant open challenges in dataset availability, standardized evaluation, effective representation, and dynamic scene reconstruction, outlining insightful directions for future research.

Abstract: Event cameras are rapidly emerging as powerful vision sensors for 3D reconstruction, uniquely capable of asynchronously capturing per-pixel brightness changes. Compared to traditional frame-based cameras, event cameras produce sparse yet temporally dense data streams, enabling robust and accurate 3D reconstruction even under challenging conditions such as high-speed motion, low illumination, and extreme dynamic range scenarios. These capabilities offer substantial promise for transformative applications across various fields, including autonomous driving, robotics, aerial navigation, and immersive virtual reality. In this survey, we present the first comprehensive review exclusively dedicated to event-based 3D reconstruction. Existing approaches are systematically categorised based on input modality into stereo, monocular, and multimodal systems, and further classified according to reconstruction methodologies, including geometry-based techniques, deep learning approaches, and neural rendering techniques such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Within each category, methods are chronologically organised to highlight the evolution of key concepts and advancements. Furthermore, we provide a detailed summary of publicly available datasets specifically suited to event-based reconstruction tasks. Finally, we discuss significant open challenges in dataset availability, standardised evaluation, effective representation, and dynamic scene reconstruction, outlining insightful directions for future research. This survey aims to serve as an essential reference and provides a clear and motivating roadmap toward advancing the state of the art in event-driven 3D reconstruction.

[311] Gradient as Conditions: Rethinking HOG for All-in-one Image Restoration

Jiawei Wu, Zhifei Yang, Zhe Wang, Zhi Jin

Main category: cs.CV

TL;DR: HOGformer: A Transformer-based model using Histogram of Oriented Gradients (HOG) as interpretable prior for All-in-One Image Restoration, achieving SOTA performance across diverse degradations.

Details

Motivation: Existing All-in-One Image Restoration methods rely on implicitly learned priors that may entangle feature representations and struggle with complex/unseen scenarios. HOG features show strong discriminative capability across diverse degradations, making them a powerful interpretable prior for degradation-aware restoration.

Method: Proposes HOGformer with three key components: 1) Dynamic HOG-aware Self-Attention (DHOGSA) that models long-range spatial dependencies conditioned on HOG-encoded degradation cues, 2) Dynamic Interaction Feed-Forward (DIFF) module for channel-spatial interactions to handle degradation heterogeneity, and 3) HOG loss to explicitly enhance structural fidelity and edge sharpness.

Result: Extensive experiments on various benchmarks (adverse weather and natural degradations) demonstrate state-of-the-art performance. The model generalizes well to complex real-world scenarios.

Conclusion: HOG features serve as effective interpretable priors for All-in-One Image Restoration. HOGformer successfully integrates learnable HOG features with transformer architecture to achieve superior performance and generalization across diverse degradation types.

Abstract: All-in-one image restoration (AIR) aims to address diverse degradations within a unified model by leveraging informative degradation conditions to guide the restoration process. However, existing methods often rely on implicitly learned priors, which may entangle feature representations and hinder performance in complex or unseen scenarios. Histogram of Oriented Gradients (HOG) as a classical gradient representation, we observe that it has strong discriminative capability across diverse degradations, making it a powerful and interpretable prior for AIR. Based on this insight, we propose HOGformer, a Transformer-based model that integrates learnable HOG features for degradation-aware restoration. The core of HOGformer is a Dynamic HOG-aware Self-Attention (DHOGSA) mechanism, which adaptively models long-range spatial dependencies conditioned on degradation-specific cues encoded by HOG descriptors. To further adapt the heterogeneity of degradations in AIR, we propose a Dynamic Interaction Feed-Forward (DIFF) module that facilitates channel-spatial interactions, enabling robust feature transformation under diverse degradations. Besides, we propose a HOG loss to explicitly enhance structural fidelity and edge sharpness. Extensive experiments on a variety of benchmarks, including adverse weather and natural degradations, demonstrate that HOGformer achieves state-of-the-art performance and generalizes well to complex real-world scenarios.Code is available at https://github.com/Fire-friend/HOGformer.

[312] LightFormer: A lightweight and efficient decoder for remote sensing image segmentation

Sihang Chen, Lijun Yun, Ze Liu, JianFeng Zhu, Jie Chen, Hui Wang, Yueping Nie

Main category: cs.CV

TL;DR: LightFormer is a lightweight decoder for real-time semantic segmentation of remote sensing images that achieves near-state-of-the-art accuracy with dramatically reduced computational complexity, making it suitable for edge deployment in time-critical applications.

Details

Motivation: Deep learning models for remote sensing semantic segmentation have high accuracy but face deployment challenges on edge platforms due to decoder complexity, especially for time-critical applications like disaster assessment, UAV search-and-rescue, and cultural heritage monitoring.

Method: LightFormer uses a feature-fusion and refinement module with channel processing and learnable gating to efficiently aggregate multi-scale information. It also includes a spatial information selection module (SISM) that combines long-range attention with detail preservation to capture spatial dependencies across scales for unstructured target recognition.

Result: On ISPRS Vaihingen benchmark, LightFormer achieves 99.9% of GLFFNet’s mIoU (83.9% vs 84.0%) with only 14.7% of FLOPs and 15.9% of parameters. Consistent strong performance on LoveDA, ISPRS Potsdam, RescueNet, and FloodNet demonstrates robustness and superior perception of unstructured objects.

Conclusion: LightFormer provides an excellent accuracy-efficiency trade-off, making it a practical solution for remote sensing applications requiring both computational economy and high-precision segmentation, particularly for edge deployment in time-critical scenarios.

Abstract: Deep learning techniques have achieved remarkable success in the semantic segmentation of remote sensing images and in land-use change detection. Nevertheless, their real-time deployment on edge platforms remains constrained by decoder complexity. Herein, we introduce LightFormer, a lightweight decoder for time-critical tasks that involve unstructured targets, such as disaster assessment, unmanned aerial vehicle search-and-rescue, and cultural heritage monitoring. LightFormer employs a feature-fusion and refinement module built on channel processing and a learnable gating mechanism to aggregate multi-scale, multi-range information efficiently, which drastically curtails model complexity. Furthermore, we propose a spatial information selection module (SISM) that integrates long-range attention with a detail preservation branch to capture spatial dependencies across multiple scales, thereby substantially improving the recognition of unstructured targets in complex scenes. On the ISPRS Vaihingen benchmark, LightFormer attains 99.9% of GLFFNet’s mIoU (83.9% vs. 84.0%) while requiring only 14.7% of its FLOPs and 15.9% of its parameters, thus achieving an excellent accuracy-efficiency trade-off. Consistent results on LoveDA, ISPRS Potsdam, RescueNet, and FloodNet further demonstrate its robustness and superior perception of unstructured objects. These findings highlight LightFormer as a practical solution for remote sensing applications where both computational economy and high-precision segmentation are imperative.

[313] Zooming In on Fakes: A Novel Dataset for Localized AI-Generated Image Detection with Forgery Amplification Approach

Lvpan Cai, Haowei Wang, Jiayi Ji, Yanshu Zhoumen, Shen Chen, Taiping Yao, Xiaoshuai Sun

Main category: cs.CV

TL;DR: BR-Gen is a large-scale dataset for localized AI-generated image forgeries with scene-aware annotations, and NFA-ViT is a noise-guided transformer that amplifies forgery features for better detection.

Details

Motivation: Existing localized AIGC detection datasets focus on object-level forgeries but overlook broader scene edits (sky, ground). Current methods lack comprehensive coverage of realistic scene manipulations.

Method: 1) BR-Gen dataset: 150K locally forged images with semantic calibration, built via automated “Perception-Creation-Evaluation” pipeline. 2) NFA-ViT: Noise-guided Forgery Amplification Vision Transformer that mines heterogeneous regions via noise fingerprints and uses attention to propagate forgery traces throughout the image.

Result: BR-Gen creates new scenarios not covered by existing methods. NFA-ViT outperforms existing methods on BR-Gen and generalizes well across current benchmarks.

Conclusion: The paper addresses the gap in scene-level localized forgery detection with a comprehensive dataset and novel transformer architecture that enhances detection by amplifying subtle forgery features across entire images.

Abstract: The rise of AI-generated image tools has made localized forgeries increasingly realistic, posing challenges for visual content integrity. Although recent efforts have explored localized AIGC detection, existing datasets predominantly focus on object-level forgeries while overlooking broader scene edits in regions such as sky or ground. To address these limitations, we introduce \textbf{BR-Gen}, a large-scale dataset of 150,000 locally forged images with diverse scene-aware annotations, which are based on semantic calibration to ensure high-quality samples. BR-Gen is constructed through a fully automated ``Perception-Creation-Evaluation’’ pipeline to ensure semantic coherence and visual realism. In addition, we further propose \textbf{NFA-ViT}, a Noise-guided Forgery Amplification Vision Transformer that enhances the detection of localized forgeries by amplifying subtle forgery-related features across the entire image. NFA-ViT mines heterogeneous regions in images, \emph{i.e.}, potential edited areas, by noise fingerprints. Subsequently, attention mechanism is introduced to compel the interaction between normal and abnormal features, thereby propagating the traces throughout the entire image, allowing subtle forgeries to influence a broader context and improving overall detection robustness. Extensive experiments demonstrate that BR-Gen constructs entirely new scenarios that are not covered by existing methods. Take a step further, NFA-ViT outperforms existing methods on BR-Gen and generalizes well across current benchmarks.

[314] Bridging Geometry-Coherent Text-to-3D Generation with Multi-View Diffusion Priors and Gaussian Splatting

Feng Yang, Wenliang Qian, Wangmeng Zuo, Hui Li

Main category: cs.CV

TL;DR: CSD improves 3D generation by coupling multi-view priors to address SDS’s geometric inconsistencies, enabling direct 3D-GS optimization and mesh refinement.

Details

Motivation: Score Distillation Sampling (SDS) has limitations in 3D generation - it neglects multi-view correlations, leading to geometric inconsistencies and multi-face artifacts in generated 3D content.

Method: Proposes Coupled Score Distillation (CSD) that couples multi-view joint distribution priors, reformulates optimization as multi-view joint problem, directly optimizes 3D Gaussian Splatting with random initialization, and uses deformable tetrahedral grid for mesh refinement.

Result: Quantitative and qualitative experiments demonstrate efficient and competitive quality 3D generation with improved geometric consistency.

Conclusion: CSD framework effectively addresses SDS limitations by incorporating multi-view correlations, enabling stable 3D-GS optimization and producing high-quality, geometrically consistent 3D assets.

Abstract: Score Distillation Sampling (SDS) leverages pretrained 2D diffusion models to advance text-to-3D generation but neglects multi-view correlations, being prone to geometric inconsistencies and multi-face artifacts in the generated 3D content. In this work, we propose Coupled Score Distillation (CSD), a framework that couples multi-view joint distribution priors to ensure geometrically consistent 3D generation while enabling the stable and direct optimization of 3D Gaussian Splatting. Specifically, by reformulating the optimization as a multi-view joint optimization problem, we derive an effective optimization rule that effectively couples multi-view priors to guide optimization across different viewpoints while preserving the diversity of generated 3D assets. Additionally, we propose a framework that directly optimizes 3D Gaussian Splatting (3D-GS) with random initialization to generate geometrically consistent 3D content. We further employ a deformable tetrahedral grid, initialized from 3D-GS and refined through CSD, to produce high-quality, refined meshes. Quantitative and qualitative experimental results demonstrate the efficiency and competitive quality of our approach.

[315] DDAE++: Enhancing Diffusion Models Towards Unified Generative and Discriminative Learning

Weilai Xiang, Hongyu Yang, Di Huang, Yunhong Wang

Main category: cs.CV

TL;DR: Self-conditioning improves diffusion models by creating a semantic bottleneck layer that enhances both generative quality and discriminative representation learning.

Details

Motivation: Current diffusion model architectures have suboptimal semantic flow where high-level semantic features are underutilized and diluted during decoding, preventing the formation of an explicit semantic bottleneck layer that could unify generative and discriminative learning.

Method: Introduces self-conditioning, a lightweight mechanism that reshapes layer-wise semantic hierarchy by aggregating and rerouting intermediate features to guide subsequent decoding layers, concentrating high-level semantics and strengthening global generative guidance.

Result: Self-conditioning improves both pixel-space UNet, UViT and latent-space DiT models with minimal overhead. Enhanced models, especially self-conditioned DiT, become powerful dual learners that yield strong transferable representations on image and dense classification tasks, surpassing various generative self-supervised models in linear probing while maintaining or improving generation quality.

Conclusion: Self-conditioning creates an architectural semantic bridge that propagates discriminative improvements into generation, enabling diffusion models to serve as unified generative and discriminative learners with improved representation quality and generation capabilities.

Abstract: While diffusion models excel at image synthesis, useful representations have been shown to emerge from generative pre-training, suggesting a path towards unified generative and discriminative learning. However, suboptimal semantic flow within current architectures can hinder this potential: features encoding the richest high-level semantics are underutilized and diluted when propagating through decoding layers, impeding the formation of an explicit semantic bottleneck layer. To address this, we introduce self-conditioning, a lightweight mechanism that reshapes the model’s layer-wise semantic hierarchy without external guidance. By aggregating and rerouting intermediate features to guide subsequent decoding layers, our method concentrates more high-level semantics, concurrently strengthening global generative guidance and forming more discriminative representations. This simple approach yields a dual-improvement trend across pixel-space UNet, UViT and latent-space DiT models with minimal overhead. Crucially, it creates an architectural semantic bridge that propagates discriminative improvements into generation and accommodates further techniques such as contrastive self-distillation. Experiments show that our enhanced models, especially self-conditioned DiT, are powerful dual learners that yield strong and transferable representations on image and dense classification tasks, surpassing various generative self-supervised models in linear probing while also improving or maintaining high generation quality.

[316] CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms

Shilin Yan, Jiaming Han, Joey Tsai, Hongwei Xue, Rongyao Fang, Lingyi Hong, Ziyu Guo, Ray Zhang

Main category: cs.CV

TL;DR: CrossLMM decouples long video sequences from LMMs using dual cross-attention mechanisms to reduce visual tokens and computational costs while maintaining performance.

Details

Motivation: As LMMs process longer video sequences, token counts increase dramatically, leading to quadratic computational costs. Efficient compression of video tokens while maintaining performance is a critical challenge.

Method: 1) Pooling method reduces tokens from pretrained visual encoders. 2) Visual-to-visual cross-attention uses pooled tokens as queries against original tokens for efficient utilization. 3) Text-to-visual cross-attention enhances text tokens through interaction with original visual tokens.

Result: Achieves comparable or superior performance across diverse video-based LMM benchmarks while using substantially fewer computational resources.

Conclusion: CrossLMM effectively addresses the computational efficiency challenge in processing long video sequences for LMMs through innovative dual cross-attention mechanisms.

Abstract: The advent of Large Multimodal Models (LMMs) has significantly enhanced Large Language Models (LLMs) to process and interpret diverse data modalities (e.g., image and video). However, as input complexity increases, particularly with long video sequences, the number of required tokens has grown significantly, leading to quadratically computational costs. This has made the efficient compression of video tokens in LMMs, while maintaining performance integrity, a pressing research challenge. In this paper, we introduce CrossLMM, decoupling long video sequences from LMMs via a dual cross-attention mechanism, which substantially reduces visual token quantity with minimal performance degradation. Specifically, we first implement a significant token reduction from pretrained visual encoders through a pooling methodology. Then, within LLM layers, we employ a visual-to-visual cross-attention mechanism, wherein the pooled visual tokens function as queries against the original visual token set. This module enables more efficient token utilization while retaining fine-grained informational fidelity. In addition, we introduce a text-to-visual cross-attention mechanism, for which the text tokens are enhanced through interaction with the original visual tokens, enriching the visual comprehension of the text tokens. Comprehensive empirical evaluation demonstrates that our approach achieves comparable or superior performance across diverse video-based LMM benchmarks, despite utilizing substantially fewer computational resources.

[317] OpenHOI: Open-World Hand-Object Interaction Synthesis with Multimodal Large Language Model

Zhenhao Zhang, Ye Shi, Lingxiao Yang, Suting Ni, Qi Ye, Jingya Wang

Main category: cs.CV

TL;DR: OpenHOI is a framework for open-world 3D hand-object interaction synthesis that generates long-horizon manipulation sequences for novel objects using free-form language instructions.

Details

Motivation: Existing methods for 3D hand-object interaction synthesis struggle with generalization, performing well only on closed-set objects and predefined tasks, but failing with unseen objects or open-vocabulary instructions.

Method: Integrates a 3D Multimodal Large Language Model fine-tuned for joint affordance grounding and semantic task decomposition, plus an affordance-driven diffusion model with training-free physics refinement to minimize penetration and optimize affordance alignment.

Result: OpenHOI demonstrates superiority over state-of-the-art methods in generalizing to novel object categories, multi-stage tasks, and complex language instructions across diverse scenarios.

Conclusion: OpenHOI is the first framework for open-world HOI synthesis capable of generating physically plausible, long-horizon manipulation sequences for novel objects guided by free-form language commands.

Abstract: Understanding and synthesizing realistic 3D hand-object interactions (HOI) is critical for applications ranging from immersive AR/VR to dexterous robotics. Existing methods struggle with generalization, performing well on closed-set objects and predefined tasks but failing to handle unseen objects or open-vocabulary instructions. We introduce OpenHOI, the first framework for open-world HOI synthesis, capable of generating long-horizon manipulation sequences for novel objects guided by free-form language commands. Our approach integrates a 3D Multimodal Large Language Model (MLLM) fine-tuned for joint affordance grounding and semantic task decomposition, enabling precise localization of interaction regions (e.g., handles, buttons) and breakdown of complex instructions (e.g., “Find a water bottle and take a sip”) into executable sub-tasks. To synthesize physically plausible interactions, we propose an affordance-driven diffusion model paired with a training-free physics refinement stage that minimizes penetration and optimizes affordance alignment. Evaluations across diverse scenarios demonstrate OpenHOI’s superiority over state-of-the-art methods in generalizing to novel object categories, multi-stage tasks, and complex language instructions. Our project page at \href{https://openhoi.github.io}

[318] MAGIC: Few-Shot Mask-Guided Anomaly Inpainting with Prompt Perturbation, Spatially Adaptive Guidance, and Context Awareness

JaeHyuck Choi, MinJun Kim, Je Hyeong Hong

Main category: cs.CV

TL;DR: MAGIC: A fine-tuned inpainting framework for few-shot anomaly generation that produces high-fidelity anomalies adhering to masks while maximizing in-distribution diversity.

Details

Motivation: Existing diffusion-based methods for few-shot anomaly generation have limitations: global prompt-guided approaches corrupt normal regions, and inpainting-based methods lack in-distribution diversity needed for robust downstream models.

Method: MAGIC introduces three components: (1) Gaussian prompt perturbation to prevent overfitting and learn from smooth anomaly manifolds, (2) spatially adaptive guidance with distinct strengths for anomaly vs background regions, and (3) context-aware mask alignment for plausible mask placement.

Result: Under consistent evaluation protocol, MAGIC outperforms state-of-the-art methods on diverse anomaly datasets in downstream tasks.

Conclusion: MAGIC effectively addresses key challenges in few-shot anomaly generation by combining fine-tuned inpainting with novel components that ensure mask adherence, diversity preservation, and realistic anomaly placement.

Abstract: Few-shot anomaly generation is a key challenge in industrial quality control. Although diffusion models are promising, existing methods struggle: global prompt-guided approaches corrupt normal regions, and existing inpainting-based methods often lack the in-distribution diversity essential for robust downstream models. We propose MAGIC, a fine-tuned inpainting framework that generates high-fidelity anomalies that strictly adhere to the mask while maximizing this diversity. MAGIC introduces three complementary components: (i) Gaussian prompt perturbation, which prevents model overfitting in the few-shot setting by learning and sampling from a smooth manifold of realistic anomalies, (ii) spatially adaptive guidance that applies distinct guidance strengths to the anomaly and background regions, and (iii) context-aware mask alignment to relocate masks for plausible placement within the host object. Under consistent identical evaluation protocol, MAGIC outperforms state-of-the-art methods on diverse anomaly datasets in downstream tasks

Boyu Chen, Siran Chen, Kunchang Li, Qinglin Xu, Yu Qiao, Yali Wang

Main category: cs.CV

TL;DR: SEN (Super Encoding Network) enhances video understanding by recursively associating multimodal encoders as “super neurons” to enable deeper multimodal interactions for complex video scenes and movements.

Details

Motivation: While multimodal foundation models show potential for video understanding via contrastive learning alignment, they lack deeper multimodal interactions needed for understanding complex target movements in diversified video scenes.

Method: Proposes SEN with Recursive Association (RA) blocks that treat well-trained encoders as “super neurons” and progressively fuse multi-modalities through knowledge integrating, distributing, and prompting in a recursive manner.

Result: Significant improvements across four video tasks: tracking (2.7% Jaccard index improvement, 8.8% TC drop), recognition, chatting, and editing (6.4% textual alignment, 4.1% frame consistency improvements).

Conclusion: SEN effectively encodes deeper multimodal interactions to boost various video understanding tasks by recursively associating multimodal encoders as super neurons, demonstrating superior performance over existing approaches.

Abstract: Video understanding has been considered as one critical step towards world modeling, which is an important long-term problem in AI research. Recently, multimodal foundation models have shown such potential via large-scale pretraining. These models effectively align encoders of different modalities via contrastive learning. To further enhance performance on complex target movements and diversified video scenes, we propose to augment this alignment with deeper multimodal interactions, which are critical for understanding complex target movements with diversified video scenes. To fill this gap, we propose a unified Super Encoding Network (SEN) for video understanding, which builds up such distinct interactions through the recursive association of multimodal encoders in the foundation models. Specifically, we creatively treat those well-trained encoders as ``super neurons" in our SEN. Via designing a Recursive Association (RA) block, we progressively fuse multi-modalities with the input video, based on knowledge integrating, distributing, and prompting of super neurons in a recursive manner. In this way, our SEN can effectively encode deeper multimodal interactions for prompting various video understanding tasks in the downstream. Extensive experiments show that our SEN can remarkably boost the four most representative video tasks, including tracking, recognition, chatting, and editing, e.g., for pixel-level tracking, the average jaccard index improves 2.7%, and temporal coherence(TC) drops by 8.8% compared to the popular CaDeX++ approach. For one-shot video editing, textual alignment improves 6.4%, and frame consistency increases by 4.1% compared to the Tune-A-Video approach.

[320] MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling

Liang Yin, Xudong Xie, Zhang Li, Xiang Bai, Yuliang Liu

Main category: cs.CV

TL;DR: MSTAR is a box-free scene text retrieval method that handles multiple query types using attention recycling, progressive vision embeddings, and multi-instance matching, achieving state-of-the-art performance while eliminating bounding box annotation costs.

Details

Motivation: Existing scene text retrieval methods require costly bounding box annotations for training and struggle to unify various query types to meet diverse retrieval needs.

Method: MSTAR uses progressive vision embedding to capture multi-grained text representations, harmonizes free-style text queries with style-aware instructions, and integrates a multi-instance matching module for enhanced vision-language alignment. Also introduces the MQTR dataset for multi-query evaluation.

Result: MSTAR surpasses previous SOTA by 6.4% MAP on Total-Text while eliminating box annotation costs, and outperforms previous models by average 8.5% on the new MQTR benchmark across 7 public datasets.

Conclusion: MSTAR provides an effective box-free solution for multi-query scene text retrieval, demonstrating superior performance while reducing annotation costs, with the MQTR dataset serving as a valuable benchmark for future research.

Abstract: Scene text retrieval has made significant progress with the assistance of accurate text localization. However, existing approaches typically require costly bounding box annotations for training. Besides, they mostly adopt a customized retrieval strategy but struggle to unify various types of queries to meet diverse retrieval needs. To address these issues, we introduce Muti-query Scene Text retrieval with Attention Recycling (MSTAR), a box-free approach for scene text retrieval. It incorporates progressive vision embedding to dynamically capture the multi-grained representation of texts and harmonizes free-style text queries with style-aware instructions. Additionally, a multi-instance matching module is integrated to enhance vision-language alignment. Furthermore, we build the Multi-Query Text Retrieval (MQTR) dataset, the first benchmark designed to evaluate the multi-query scene text retrieval capability of models, comprising four query types and 16k images. Extensive experiments demonstrate the superiority of our method across seven public datasets and the MQTR dataset. Notably, MSTAR marginally surpasses the previous state-of-the-art model by 6.4% in MAP on Total-Text while eliminating box annotation costs. Moreover, on the MQTR benchmark, MSTAR significantly outperforms the previous models by an average of 8.5%. The code and datasets are available at https://github.com/yingift/MSTAR.

[321] Towards Facilitated Fairness Assessment of AI-based Skin Lesion Classifiers Through GenAI-based Image Synthesis

Ko Watanabe, Stanislav Frolov, Aya Hassan, David Dembinsky, Adriano Lucieri, Andreas Dengel

Main category: cs.CV

TL;DR: Synthetic skin cancer images generated by AI can effectively test fairness of diagnostic models across demographic groups.

Details

Motivation: Deep learning for skin cancer screening risks bias due to unrepresentative datasets; need methods to evaluate fairness across demographics like sex, age, and race.

Method: Train state-of-the-art generative model to create controllable synthetic skin cancer images; compare fairness testing using synthetic vs real images (MILK10K benchmark) on three classifiers (DeepGuide, MelaNet, SkinLesionDensnet).

Result: Models showed similar classification patterns across demographic attributes when tested on both real and synthetic images, confirming synthetic images can facilitate fairness verification.

Conclusion: Highly realistic synthetic images are effective for testing fairness of skin cancer classifiers, addressing dataset limitations for underrepresented groups.

Abstract: Recent advances in deep learning and on-device inference could transform routine screening for skin cancers. Along with the anticipated benefits of this technology, potential dangers arise from unforeseen and inherent biases. A significant obstacle is building evaluation datasets that accurately reflect key demographics, including sex, age, and race, as well as other underrepresented groups. To address this, we train a state-of-the-art generative model to generate synthetic data in a controllable manner to assess the fairness of publicly available skin cancer classifiers. To evaluate whether synthetic images can be used as a fairness testing dataset, we prepare a real-image dataset (MILK10K) as a benchmark and compare the True Positive Rate result of three models (DeepGuide, MelaNet, and SkinLesionDensnet). As a result, the classification tendencies observed in each model when tested on real and generated images showed similar patterns across different attribute data sets. We confirm that highly realistic synthetic images facilitate model fairness verification.

[322] Good Noise Makes Good Edits: A Training-Free Diffusion-Based Video Editing with Image and Text Prompts

Saemee Choi, Sohyun Jeong, Hyojin Jang, Jaegul Choo, Jinhee Kim

Main category: cs.CV

TL;DR: VINO is the first zero-shot, training-free video editing method that uses both image and text conditioning, achieving state-of-the-art performance without any training.

Details

Motivation: There's a need for video editing methods that can incorporate both image and text references without requiring training or fine-tuning for specific instances, enabling more flexible and accessible video editing.

Method: VINO introduces three key techniques: 1) ρ-start sampling for structured noise initialization, 2) dilated dual masking to construct coherent noise maps, and 3) zero image guidance as a controllable negative prompt strategy to enhance visual fidelity.

Result: Extensive experiments show VINO faithfully incorporates reference images into video edits, achieving strong performance compared to state-of-the-art baselines while being completely training-free and zero-shot.

Conclusion: VINO successfully demonstrates that high-quality video editing can be achieved without training by combining structured noise construction techniques with controllable guidance strategies, opening new possibilities for accessible video editing.

Abstract: We propose VINO, the first zero-shot, training-free video editing method conditioned on both image and text. Our approach introduces $ρ$-start sampling and dilated dual masking to construct structured noise maps that enable coherent and accurate edits. To further enhance visual fidelity, we present zero image guidance, a controllable negative prompt strategy. Extensive experiments demonstrate that VINO faithfully incorporates the reference image into video edits, achieving strong performance compared to state-of-the-art baselines, all without any test-time or instance-specific training.

[323] 3D Shape Generation: A Survey

Nicolas Caytuiro, Ivan Sipiran

Main category: cs.CV

TL;DR: Comprehensive survey of 3D shape generation covering representations, generative models, evaluation protocols, and future directions.

Details

Motivation: Deep learning has revolutionized 3D shape generation, creating a need for a structured overview of current methods, representations, and evaluation approaches to guide researchers in this rapidly evolving field.

Method: Organizes discussion around three core components: 1) Shape representations (explicit, implicit, hybrid), 2) Generative modeling approaches (focusing on feedforward architectures), and 3) Evaluation protocols including datasets and metrics.

Result: Provides comprehensive categorization of 3D representations, review of generation methods, summary of datasets and evaluation metrics, and identification of open challenges in the field.

Conclusion: The survey serves as a valuable reference for researchers, outlining future directions for controllable, efficient, and high-quality 3D shape generation while highlighting current limitations and opportunities for advancement.

Abstract: Recent advances in deep learning have significantly transformed the field of 3D shape generation, enabling the synthesis of complex, diverse, and semantically meaningful 3D objects. This survey provides a comprehensive overview of the current state-of-the-art in 3D shape generation, organizing the discussion around three core components: shape representations, generative modeling approaches, and evaluation protocols. We begin by categorizing 3D representations into explicit, implicit, and hybrid setups, highlighting their structural properties, advantages, and limitations. Next, we review a wide range of generation methods, focusing on feedforward architectures. We further summarize commonly used datasets and evaluation metrics that assess fidelity, diversity, and realism of generated shapes. Finally, we identify open challenges and outline future research directions that could drive progress in controllable, efficient, and high-quality 3D shape generation. This survey aims to serve as a valuable reference for researchers and practitioners seeking a structured and in-depth understanding of this rapidly evolving field.

[324] MCGA: Mixture of Codebooks Hyperspectral Reconstruction via Grayscale-Aware Attention

Zhanjiang Yang, Lijun Sun, Jiawei Dong, Xiaoxin An, Yang Liu, Meng Li

Main category: cs.CV

TL;DR: MCGA proposes a Mixture-of-Codebooks with Grayscale-aware Attention framework for hyperspectral image reconstruction from RGB inputs, achieving state-of-the-art accuracy with 4-5x faster inference.

Details

Motivation: Hyperspectral image reconstruction from RGB inputs is cost-effective but inherently ill-posed. Existing methods use computationally expensive attention networks that only implicitly handle ill-posedness.

Method: MCGA uses mixture-of-codebooks to learn transferable spectral priors from heterogeneous HSI datasets, aligns RGB features with these priors via grayscale-aware photometric attention (GANet), and improves efficiency with top-K attention and test-time adaptation.

Result: State-of-the-art accuracy on multiple real-world benchmarks, strong cross-dataset generalization, and 4-5x faster inference compared to existing methods.

Conclusion: MCGA effectively addresses the ill-posed nature of RGB-to-HSI reconstruction using explicit spectral priors and photometric consistency, achieving superior performance and efficiency.

Abstract: Reconstructing hyperspectral images (HSIs) from RGB inputs provides a cost-effective alternative to hyperspectral cameras, but reconstructing high-dimensional spectra from three channels is inherently ill-posed. Existing methods typically directly regress RGB-to-HSI mappings using large attention networks, which are computationally expensive and handle ill-posedness only implicitly. We propose MCGA, a Mixture-of-Codebooks with Grayscale-aware Attention framework that explicitly addresses these challenges using spectral priors and photometric consistency. MCGA first learns transferable spectral priors via a mixture-of-codebooks (MoC) from heterogeneous HSI datasets, then aligns RGB features with these priors through grayscale-aware photometric attention (GANet). Efficiency and robustness are further improved via top-K attention design and test-time adaptation (TTA). Experiments on multiple real-world benchmarks demonstrate the state-of-the-art accuracy, strong cross-dataset generalization, and 4-5x faster inference. Codes will be available once acceptance at https://github.com/Fibonaccirabbit/MCGA.

[325] DiffClean: Diffusion-based Makeup Removal for Accurate Age Estimation

Ekta Balkrishna Gavas, Sudipta Banerjee, Chinmay Hegde, Nasir Memon

Main category: cs.CV

TL;DR: DiffClean uses text-guided diffusion models to remove facial makeup for improved age verification and face recognition, defending against makeup-based attacks without needing reference images.

Details

Motivation: Facial makeup can fool both humans and machines in age verification systems, compromising protection of underage users from accessing age-restricted online services. Current methods often require reference images, limiting practical deployment.

Method: Proposes DiffClean, a text-guided diffusion model that erases makeup traces from facial images. It operates as a pre-processing module without requiring reference images, making it more practical than prior work.

Result: Improves minor vs. adult age estimation accuracy by 5.8% and face verification TMR by 5.1% at FMR=0.01%. Robust across both digitally simulated and real-world makeup styles while maintaining high visual fidelity.

Conclusion: DiffClean effectively defends against makeup attacks in biometric verification systems, can be easily integrated as a pre-processing module, and advances state-of-the-art in both biometric and perceptual utility.

Abstract: Accurate age verification can protect underage users from unauthorized access to online platforms and e-commerce sites that provide age-restricted services. However, accurate age estimation can be confounded by several factors, including facial makeup that can induce changes to alter perceived identity and age to fool both humans and machines. In this work, we propose \textsc{DiffClean} which erases makeup traces using a text-guided diffusion model to defend against makeup attacks without requiring any reference image unlike prior work. \textsc{DiffClean} improves age estimation (minor vs. adult accuracy by 5.8%) and face verification (TMR by 5.1% at FMR=0.01%) compared to images with makeup. Our method is: (1) robust across digitally simulated and real-world makeup styles with high visual fidelity, (2) can be easily integrated as a pre-processing module in existing age and identity verification frameworks, and (3) advances the state-of-the art in terms of biometric and perceptual utility. Our codes are available at https://github.com/Ektagavas/DiffClean

[326] Advances in Feed-Forward 3D Reconstruction and View Synthesis: A Survey

Jiahui Zhang, Yuelei Li, Anpei Chen, Muyu Xu, Kunhao Liu, Jianyuan Wang, Xiao-Xiao Long, Hanxue Liang, Zexiang Xu, Hao Su, Christian Theobalt, Christian Rupprecht, Andrea Vedaldi, Kaichen Zhou, Hanspeter Pfister, Paul Pu Liang, Shijian Lu, Fangneng Zhan

Main category: cs.CV

TL;DR: A comprehensive survey of feed-forward deep learning approaches for 3D reconstruction and view synthesis, covering different representation architectures (point clouds, 3DGS, NeRF), key tasks, datasets, and future directions.

Details

Motivation: Traditional 3D reconstruction and view synthesis methods rely on computationally intensive iterative optimization, limiting real-world applicability. Feed-forward deep learning approaches offer faster, more generalizable solutions for AR/VR, digital twins, and immersive technologies.

Method: Survey methodology with taxonomy based on representation architectures (point cloud, 3D Gaussian Splatting, Neural Radiance Fields). Examines key tasks including pose-free reconstruction, dynamic 3D reconstruction, and 3D-aware image/video synthesis.

Result: Comprehensive review of feed-forward techniques, datasets with detailed statistics, and evaluation protocols for various downstream tasks in digital humans, SLAM, robotics, and other applications.

Conclusion: Feed-forward approaches have revolutionized 3D vision by enabling fast, generalizable reconstruction and synthesis. Open research challenges remain, with promising future directions to advance the state of the art in 3D vision.

Abstract: 3D reconstruction and view synthesis are foundational problems in computer vision, graphics, and immersive technologies such as augmented reality (AR), virtual reality (VR), and digital twins. Traditional methods rely on computationally intensive iterative optimization in a complex chain, limiting their applicability in real-world scenarios. Recent advances in feed-forward approaches, driven by deep learning, have revolutionized this field by enabling fast and generalizable 3D reconstruction and view synthesis. This survey offers a comprehensive review of feed-forward techniques for 3D reconstruction and view synthesis, with a taxonomy according to the underlying representation architectures including point cloud, 3D Gaussian Splatting (3DGS), Neural Radiance Fields (NeRF), etc. We examine key tasks such as pose-free reconstruction, dynamic 3D reconstruction, and 3D-aware image and video synthesis, highlighting their applications in digital humans, SLAM, robotics, and beyond. In addition, we review commonly used datasets with detailed statistics, along with evaluation protocols for various downstream tasks. We conclude by discussing open research challenges and promising directions for future work, emphasizing the potential of feed-forward approaches to advance the state of the art in 3D vision.

[327] Preserving Topological and Geometric Embeddings for Point Cloud Recovery

Kaiyue Zhou, Zelong Tan, Hongxiao Wang, Ya-Li Li, Shengjin Wang

Main category: cs.CV

TL;DR: TopGeoFormer is an end-to-end architecture for point cloud recovery that effectively leverages both topological and geometric attributes through continuous topological embeddings, InterTwining Attention, and dual-space optimization.

Details

Motivation: Existing point cloud recovery methods struggle to effectively leverage both topological and geometric attributes during the sequential sampling and restoration process. There's a need for an approach that maintains these critical properties throughout both phases.

Method: 1) Revisit traditional feature extraction to create topological embeddings using continuous mapping of relative relationships between neighboring points, integrated in both sampling and restoration phases. 2) Propose InterTwining Attention to merge topological and geometric embeddings with local awareness. 3) Introduce full geometry loss and topological constraint loss to optimize embeddings in both Euclidean and topological spaces.

Result: The method significantly outperforms existing sampling and recovery methods in both quantitative and qualitative evaluations, comprehensively analyzing circumstances using conventional and learning-based sampling/upsampling/recovery algorithms.

Conclusion: TopGeoFormer successfully addresses the challenge of leveraging both topological and geometric attributes in point cloud recovery through an integrated architecture that preserves structure throughout the sampling and restoration pipeline.

Abstract: Recovering point clouds involves the sequential process of sampling and restoration, yet existing methods struggle to effectively leverage both topological and geometric attributes. To address this, we propose an end-to-end architecture named \textbf{TopGeoFormer}, which maintains these critical properties throughout the sampling and restoration phases. First, we revisit traditional feature extraction techniques to yield topological embedding using a continuous mapping of relative relationships between neighboring points, and integrate it in both phases for preserving the structure of the original space. Second, we propose the \textbf{InterTwining Attention} to fully merge topological and geometric embeddings, which queries shape with local awareness in both phases to form a learnable 3D shape context facilitated with point-wise, point-shape-wise, and intra-shape features. Third, we introduce a full geometry loss and a topological constraint loss to optimize the embeddings in both Euclidean and topological spaces. The geometry loss uses inconsistent matching between coarse-to-fine generations and targets for reconstructing better geometric details, and the constraint loss limits embedding variances for better approximation of the topological space. In experiments, we comprehensively analyze the circumstances using the conventional and learning-based sampling/upsampling/recovery algorithms. The quantitative and qualitative results demonstrate that our method significantly outperforms existing sampling and recovery methods.

[328] AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model

Zhiwei Jin, Xiaohui Song, Nan Wang, Yafei Liu, Chao Li, Xin Li, Ruichen Wang, Zhihao Li, Qi Qi, Long Cheng, Dongze Hao, Quanlong Zheng, Yanhao Zhang, Haobo Ji, Jian Ma, Zhitong Zheng, Zhenyi Lin, Haolin Deng, Xin Zou, Xiaojie Yin, Ruilin Wang, Liankai Cai, Haijing Liu, Yuqing Qiu, Ke Chen, Zixian Li, Chi Xie, Huafei Li, Chenxing Li, Chuangchuang Wang, Kai Tang, Zhiguang Zhu, Kai Tang, Wenmei Gao, Rui Wang, Jun Wu, Chao Liu, Qin Xie, Chen Chen, Haonan Lu

Main category: cs.CV

TL;DR: AndesVL is a suite of mobile-optimized multimodal large language models (0.6B-4B parameters) that achieves state-of-the-art performance on edge devices through efficient architecture, quantization-aware fine-tuning, and deployment optimizations.

Details

Motivation: Cloud-based MLLMs like GPT-4o and Gemini are too large for edge devices due to memory, power, and compute constraints. There's a need for efficient MLLMs that can run on mobile devices while maintaining competitive performance.

Method: Developed AndesVL based on Qwen3’s LLM with various visual encoders. Introduced 1+N LoRA architecture and Quantization-Aware LoRA Fine-Tuning (QALFT) for efficient task adaptation. Used cache eviction algorithm (OKV), speculative decoding, and compression strategies for mobile deployment.

Result: Achieves first-tier performance across multiple benchmarks (text-rich image understanding, reasoning, multi-image comprehension, VQA, hallucination mitigation, multilingual, GUI tasks). On MediaTek Dimensity 9500 chips: 6.7x peak decoding speedup, 30.9% memory reduction, 1.8 bits-per-weight quantization.

Conclusion: AndesVL demonstrates that efficient mobile-side MLLMs can achieve competitive performance with cloud models while being deployable on resource-constrained edge devices through careful architecture design and optimization techniques.

Abstract: In recent years, while cloud-based MLLMs such as QwenVL, InternVL, GPT-4o, Gemini, and Claude Sonnet have demonstrated outstanding performance with enormous model sizes reaching hundreds of billions of parameters, they significantly surpass the limitations in memory, power consumption, and computing capacity of edge devices such as mobile phones. This paper introduces AndesVL, a suite of mobile-side MLLMs with 0.6B to 4B parameters based on Qwen3’s LLM and various visual encoders. We comprehensively outline the model architectures, training pipeline, and training data of AndesVL, which achieves first-tier performance across a wide range of open-source benchmarks, including fields such as text-rich image understanding, reasoning and math, multi-image comprehension, general VQA, hallucination mitigation, multilingual understanding, and GUI-related tasks when compared with state-of-the-art models of a similar scale. Furthermore, we introduce a 1+N LoRA architecture alongside a Quantization-Aware LoRA Fine-Tuning (QALFT) framework to facilitate efficient task adaptation and model compression during mobile-side deployment of AndesVL. Moreover, utilizing our cache eviction algorithm – OKV – along with customized speculative decoding and compression strategies, we achieve a 6.7x peak decoding speedup ratio, up to 30.9% memory reduction, and 1.8 bits-per-weight when deploying AndesVL-4B on MediaTek Dimensity 9500 chips. We release all models on https://huggingface.co/OPPOer.

[329] SAMSA 2.0: Prompting Segment Anything with Spectral Angles for Hyperspectral Interactive Medical Image Segmentation

Alfie Roddan, Tobias Czempiel, Chi Xu, Daniel S. Elson, Stamatia Giannarou

Main category: cs.CV

TL;DR: SAMSA 2.0 enhances medical hyperspectral image segmentation by integrating spectral angle prompting with SAM, improving accuracy without retraining.

Details

Motivation: To address limitations of RGB-only models and prior spectral fusion methods in hyperspectral medical imaging, particularly in challenging clinical scenarios with low data and noise.

Method: Interactive segmentation framework using spectral angle prompting to guide Segment Anything Model (SAM) with spectral similarity alongside spatial cues through early fusion of spectral information.

Result: Achieves up to +3.8% higher Dice scores vs RGB-only models and +3.1% vs prior spectral fusion methods without retraining; enhances few-shot and zero-shot performance with strong generalization in low-data and noisy scenarios.

Conclusion: SAMSA 2.0 provides more accurate and robust segmentation for hyperspectral medical imaging by effectively integrating spectral information with spatial cues, demonstrating practical value in clinical applications.

Abstract: We present SAMSA 2.0, an interactive segmentation framework for hyperspectral medical imaging that introduces spectral angle prompting to guide the Segment Anything Model (SAM) using spectral similarity alongside spatial cues. This early fusion of spectral information enables more accurate and robust segmentation across diverse spectral datasets. Without retraining, SAMSA 2.0 achieves up to +3.8% higher Dice scores compared to RGB-only models and up to +3.1% over prior spectral fusion methods. Our approach enhances few-shot and zero-shot performance, demonstrating strong generalization in challenging low-data and noisy scenarios common in clinical imaging.

[330] No Pose at All: Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views

Ranran Huang, Krystian Mikolajczyk

Main category: cs.CV

TL;DR: SPFSplat is a pose-free 3D Gaussian splatting framework that simultaneously predicts 3D Gaussian primitives and camera poses from sparse multi-view images in a single feed-forward step, achieving SOTA novel view synthesis without pose supervision.

Details

Motivation: The paper addresses the practical limitation of requiring ground-truth camera poses for 3D Gaussian splatting methods, which is often unavailable in real-world applications. The goal is to develop a pose-free framework that can work with sparse multi-view images without pose supervision during training or inference.

Method: SPFSplat uses a shared feature extraction backbone to simultaneously predict 3D Gaussian primitives and camera poses in canonical space from unposed inputs in a single feed-forward step. It combines rendering loss based on estimated novel-view poses with a reprojection loss to enforce pixel-aligned Gaussian primitives for better geometric constraints.

Result: SPFSplat achieves state-of-the-art performance in novel view synthesis even under significant viewpoint changes and limited image overlap, despite no pose supervision. It also outperforms recent methods trained with geometry priors in relative pose estimation.

Conclusion: The pose-free training paradigm and efficient one-step feed-forward design make SPFSplat well-suited for practical applications where ground-truth poses are unavailable, offering both high-quality novel view synthesis and accurate pose estimation capabilities.

Abstract: We introduce SPFSplat, an efficient framework for 3D Gaussian splatting from sparse multi-view images, requiring no ground-truth poses during training or inference. It employs a shared feature extraction backbone, enabling simultaneous prediction of 3D Gaussian primitives and camera poses in a canonical space from unposed inputs within a single feed-forward step. Alongside the rendering loss based on estimated novel-view poses, a reprojection loss is integrated to enforce the learning of pixel-aligned Gaussian primitives for enhanced geometric constraints. This pose-free training paradigm and efficient one-step feed-forward design make SPFSplat well-suited for practical applications. Remarkably, despite the absence of pose supervision, SPFSplat achieves state-of-the-art performance in novel view synthesis even under significant viewpoint changes and limited image overlap. It also surpasses recent methods trained with geometry priors in relative pose estimation. Code and trained models are available on our project page: https://ranrhuang.github.io/spfsplat/.

[331] SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning

Xiaojun Guo, Runyu Zhou, Yifei Wang, Qi Zhang, Chenheng Zhang, Stefanie Jegelka, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Yisen Wang

Main category: cs.CV

TL;DR: SSL4RL: A framework using self-supervised learning tasks as verifiable rewards for RL-based fine-tuning of vision-language models, improving performance without human preference data.

Details

Motivation: Vision-language models often fail to adequately use visual evidence, relying on linguistic priors or textual shortcuts. RL could help align models but lacks scalable reward mechanisms. Need for automatic, verifiable rewards without human preference data.

Method: SSL4RL reformulates self-supervised learning objectives (like predicting image rotation or reconstructing masked patches) into dense, automatic reward signals for RL-based fine-tuning. Eliminates need for human preference data or unreliable AI evaluators.

Result: Substantially improves performance on both vision-centric and vision-language reasoning benchmarks. Identifies key factors: task difficulty, model scale, and semantic alignment with target domain. Also demonstrates generality by applying to graph learning with significant gains.

Conclusion: SSL4RL establishes a versatile and effective paradigm for aligning multimodal models using verifiable, self-supervised objectives, offering new design principles for future work.

Abstract: Vision-language models (VLMs) have shown remarkable abilities by integrating large language models with visual inputs. However, they often fail to utilize visual evidence adequately, either depending on linguistic priors in vision-centric tasks or resorting to textual shortcuts during reasoning. Although reinforcement learning (RL) can align models with desired behaviors, its application to VLMs has been hindered by the lack of scalable and reliable reward mechanisms. To overcome this challenge, we propose SSL4RL, a novel framework that leverages self-supervised learning (SSL) tasks as a source of verifiable rewards for RL-based fine-tuning. Our approach reformulates SSL objectives-such as predicting image rotation or reconstructing masked patches-into dense, automatic reward signals, eliminating the need for human preference data or unreliable AI evaluators. Experiments show that SSL4RL substantially improves performance on both vision-centric and vision-language reasoning benchmarks. Furthermore, through systematic ablations, we identify key factors-such as task difficulty, model scale, and semantic alignment with the target domain-that influence the effectiveness of SSL4RL tasks, offering new design principles for future work. We also demonstrate the framework’s generality by applying it to graph learning, where it yields significant gains. SSL4RL establishes a versatile and effective paradigm for aligning multimodal models using verifiable, self-supervised objectives.

[332] Low-Rank Expert Merging for Multi-Source Domain Adaptation in Person Re-Identification

Taha Mustapha Nehdi, Nairouz Mrabah, Atif Belal, Marco Pedersoli, Eric Granger

Main category: cs.CV

TL;DR: SAGE-reID: A source-free multi-source domain adaptation method for person reID using low-rank adapters and gating network for efficient cross-domain knowledge transfer.

Details

Motivation: Current multi-source domain adaptation methods for person reID either learn domain-specific backbone models or require access to source domain data during adaptation, leading to high computational costs and parameter growth. There's a need for more efficient, source-free approaches.

Method: 1) Train individual source-specific low-rank adapters (LoRA) through source-free UDA. 2) Introduce a lightweight gating network to dynamically assign optimal merging weights for fusion of LoRA experts, enabling effective cross-domain knowledge transfer.

Result: SAGE-reID outperforms state-of-the-art methods on three challenging benchmarks (Market-1501, DukeMTMC-reID, MSMT17) while being computationally efficient. LoRA experts scale linearly but remain negligible in size (≤2% of backbone), reducing memory consumption and overfitting risk.

Conclusion: SAGE-reID provides a cost-effective, source-free MSDA solution for person reID that maintains constant backbone parameters across domains while achieving superior performance through efficient LoRA expert fusion.

Abstract: Adapting person re-identification (reID) models to new target environments remains a challenging problem that is typically addressed using unsupervised domain adaptation (UDA) methods. Recent works show that when labeled data originates from several distinct sources (e.g., datasets and cameras), considering each source separately and applying multi-source domain adaptation (MSDA) typically yields higher accuracy and robustness compared to blending the sources and performing conventional UDA. However, state-of-the-art MSDA methods learn domain-specific backbone models or require access to source domain data during adaptation, resulting in significant growth in training parameters and computational cost. In this paper, a Source-free Adaptive Gated Experts (SAGE-reID) method is introduced for person reID. Our SAGE-reID is a cost-effective, source-free MSDA method that first trains individual source-specific low-rank adapters (LoRA) through source-free UDA. Next, a lightweight gating network is introduced and trained to dynamically assign optimal merging weights for fusion of LoRA experts, enabling effective cross-domain knowledge transfer. While the number of backbone parameters remains constant across source domains, LoRA experts scale linearly but remain negligible in size (<= 2% of the backbone), reducing both the memory consumption and risk of overfitting. Extensive experiments conducted on three challenging benchmarks: Market-1501, DukeMTMC-reID, and MSMT17 indicate that SAGE-reID outperforms state-of-the-art methods while being computationally efficient.

[333] DeltaFlow: An Efficient Multi-frame Scene Flow Estimation Method

Qingwen Zhang, Xiaomeng Zhu, Yushan Zhang, Yixi Cai, Olov Andersson, Patric Jensfelt

Main category: cs.CV

TL;DR: DeltaFlow is a lightweight 3D scene flow estimation framework that efficiently captures temporal information across multiple frames using a delta scheme, with category-balanced and instance consistency losses to address class imbalance and motion inconsistency issues.

Details

Motivation: Previous scene flow methods focus on only two consecutive frames, missing valuable temporal information. Multi-frame approaches exist but suffer from rapidly escalating computational costs as frame count increases. Additionally, scene flow estimation faces challenges with imbalanced object class distributions and motion inconsistency.

Method: DeltaFlow uses a lightweight 3D framework with a delta scheme to capture motion cues and extract temporal features efficiently regardless of frame count. It introduces two novel losses: Category-Balanced Loss to enhance learning across underrepresented classes, and Instance Consistency Loss to enforce coherent object motion.

Result: Extensive evaluations on Argoverse 2, Waymo, and nuScenes datasets show DeltaFlow achieves state-of-the-art performance with up to 22% lower error and 2× faster inference compared to the next-best multi-frame supervised method. It also demonstrates strong cross-domain generalization ability.

Conclusion: DeltaFlow provides an efficient solution for multi-frame scene flow estimation that effectively leverages temporal information while addressing class imbalance and motion consistency challenges, achieving superior performance with faster inference and good generalization.

Abstract: Previous dominant methods for scene flow estimation focus mainly on input from two consecutive frames, neglecting valuable information in the temporal domain. While recent trends shift towards multi-frame reasoning, they suffer from rapidly escalating computational costs as the number of frames grows. To leverage temporal information more efficiently, we propose DeltaFlow ($Δ$Flow), a lightweight 3D framework that captures motion cues via a $Δ$ scheme, extracting temporal features with minimal computational cost, regardless of the number of frames. Additionally, scene flow estimation faces challenges such as imbalanced object class distributions and motion inconsistency. To tackle these issues, we introduce a Category-Balanced Loss to enhance learning across underrepresented classes and an Instance Consistency Loss to enforce coherent object motion, improving flow accuracy. Extensive evaluations on the Argoverse 2, Waymo and nuScenes datasets show that $Δ$Flow achieves state-of-the-art performance with up to 22% lower error and $2\times$ faster inference compared to the next-best multi-frame supervised method, while also demonstrating a strong cross-domain generalization ability. The code is open-sourced at https://github.com/Kin-Zhang/DeltaFlow along with trained model weights.

[334] REGEN: Real-Time Photorealism Enhancement in Games via a Dual-Stage Generative Network Framework

Stefanos Pasios, Nikos Nikolaidis

Main category: cs.CV

TL;DR: REGEN framework enhances game photorealism using a two-stage generative approach: first creates photorealistic frames via robust unpaired translation, then trains lightweight model for real-time inference without quality loss.

Details

Motivation: Achieving photorealism in video games is crucial for player immersion and experience, but existing generative approaches face a trade-off between visual quality and runtime efficiency under real-time constraints.

Method: Two-stage approach: 1) Use robust unpaired image-to-image translation to generate semantically consistent photorealistic frames, 2) Create paired dataset from these frames to train lightweight model for real-time inference.

Result: REGEN achieves comparable or slightly improved visual quality compared to robust methods while improving frame rate by 12x, maintains semantic preservation and temporal consistency in Unreal Engine.

Conclusion: The framework successfully bridges the photorealism gap in games by decoupling quality generation from runtime efficiency, enabling real-time photorealistic rendering without compromising visual fidelity.

Abstract: Photorealism is an important aspect of modern video games since it can shape player experience and impact immersion, narrative engagement, and visual fidelity. To achieve photorealism, beyond traditional rendering pipelines, generative models have been increasingly adopted as an effective approach for bridging the gap between the visual realism of synthetic and real worlds. However, under real-time constraints of video games, existing generative approaches continue to face a tradeoff between visual quality and runtime efficiency. In this work, we present a framework for enhancing the photorealism of rendered game frames using generative networks. We propose REGEN, which first employs a robust unpaired image-to-image translation model to generate semantically consistent photorealistic frames. These generated frames are then used to create a paired dataset, which transforms the problem to a simpler unpaired image-to-image translation. This enables training with a lightweight method, achieving real-time inference without compromising visual quality. We evaluate REGEN on Unreal Engine, showing, by employing the CMMD metric, that it achieves comparable or slightly improved visual quality compared to the robust method, while improving the frame rate by 12x. Additional experiments also validate that REGEN adheres to the semantic preservation of the initial robust image-to-image translation method and maintains temporal consistency. Code, pre-trained models, and demos for this work are available at: https://github.com/stefanos50/REGEN

[335] Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Tian Nian, Liuao Pei, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, Yao Mu, Ping Luo

Main category: cs.CV

TL;DR: Discrete Diffusion VLA is a unified transformer policy that models discretized robot actions using discrete diffusion, achieving adaptive decoding order and robust error correction while preserving vision-language priors.

Details

Motivation: Current Vision-Language-Action models have fragmented architectures - they either use fixed autoregressive decoding or separate MLP/diffusion heads, which creates specialized training requirements and hinders unified, scalable design.

Method: Uses discrete diffusion to model discretized action chunks within a unified transformer architecture. Features adaptive decoding order (easy elements first), secondary re-masking for uncertain predictions, and maintains compatibility with VLM token interfaces.

Result: Achieves 96.3% success on LIBERO, 71.2% visual matching on SimplerEnv-Fractal, and 54.2% on SimplerEnv-Bridge. Outperforms autoregressive, MLP decoder, and continuous diffusion baselines on OOD benchmarks.

Conclusion: Discrete diffusion enables precise action modeling and consistent training, providing a foundation for scaling VLA models to larger architectures and datasets while preserving vision-language capabilities.

Abstract: Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions auto-regressively in a fixed left-to-right order or attach separate MLP or diffusion heads outside the backbone, leading to fragmented information pathways and specialized training requirements that hinder a unified, scalable architecture. We present Discrete Diffusion VLA, a unified-transformer policy that models discretized action chunks with discrete diffusion. The design retains diffusion’s progressive refinement paradigm while remaining natively compatible with the discrete token interface of VLMs. Our method achieves an adaptive decoding order that resolves easy action elements before harder ones and uses secondary re-masking to revisit uncertain predictions across refinement rounds, which improves consistency and enables robust error correction. This unified decoder preserves pre-trained vision-language priors, supports parallel decoding, breaks the autoregressive bottleneck, and reduces the number of function evaluations. Discrete Diffusion VLA achieves 96.3% avg. success rates on LIBERO, 71.2% visual matching on SimplerEnv-Fractal and 54.2% overall on SimplerEnv-Bridge. We also provide ablation study on vision-language ability retention on LIBERO-OOD (Out-of-Distribution) benchmark, with our method improving over autoregressive, MLP decoder and continuous diffusion baselines. These findings indicate that discrete-diffusion VLA supports precise action modeling and consistent training, laying groundwork for scaling VLA to larger models and datasets. Our code is available at https://github.com/Liang-ZX/DiscreteDiffusionVLA/tree/libero.

[336] InterPose: Learning to Generate Human-Object Interactions from Large-Scale Web Videos

Yangsong Zhang, Abdul Ahad Butt, Gül Varol, Ivan Laptev

Main category: cs.CV

TL;DR: The paper introduces InterPose, a large-scale dataset of 3D human motions with object interactions automatically extracted from videos, which improves human motion generation and enables zero-shot animation of human-object interactions.

Details

Motivation: Existing human motion generation works focus on isolated people in empty scenes, lacking realistic human-object interactions in complex 3D scenes due to the absence of large-scale datasets with diverse object manipulations.

Method: Proposes an automatic motion extraction pipeline to collect interaction-rich human motions from videos, creating the InterPose dataset containing 73.8K sequences of 3D human motions with corresponding text captions from 45.8K videos.

Result: InterPose brings significant improvements to state-of-the-art methods for human motion generation and enables development of an LLM-based agent for zero-shot animation of people interacting with diverse objects and scenes.

Conclusion: The InterPose dataset addresses the critical challenge of synthesizing realistic human-object interactions and enables more versatile, high-fidelity human motion generation in complex 3D environments.

Abstract: Human motion generation has shown great advances thanks to the recent diffusion models trained on large-scale motion capture data. Most of existing works, however, currently target animation of isolated people in empty scenes. Meanwhile, synthesizing realistic human-object interactions in complex 3D scenes remains a critical challenge in computer graphics and robotics. One obstacle towards generating versatile high-fidelity human-object interactions is the lack of large-scale datasets with diverse object manipulations. Indeed, existing motion capture data is typically restricted to single people and manipulations of limited sets of objects. To address this issue, we propose an automatic motion extraction pipeline and use it to collect interaction-rich human motions. Our new dataset InterPose contains 73.8K sequences of 3D human motions and corresponding text captions automatically obtained from 45.8K videos with human-object interactions. We perform extensive experiments and demonstrate InterPose to bring significant improvements to state-of-the-art methods for human motion generation. Moreover, using InterPose we develop an LLM-based agent enabling zero-shot animation of people interacting with diverse objects and scenes.

[337] Event Spectroscopy: Event-based Multispectral and Depth Sensing using Structured Light

Christian Geckeler, Niklas Neugebauer, Manasi Muglikar, Davide Scaramuzza, Stefano Mintchev

Main category: cs.CV

TL;DR: Novel event spectroscopy system combines structured light depth sensing with multispectral imaging in a single sensor, achieving 60% RMSE improvement over commercial depth sensors and enabling material differentiation with 30% accuracy boost using depth data.

Details

Motivation: UAVs need safe navigation and precise data collection in dense forest environments, but traditional sensing approaches (multispectral/RGB imaging) suffer from latency, poor depth resolution, and light dependency under forest canopies.

Method: Event spectroscopy system using structured light for depth reconstruction with wavelength modulation to capture spectral information (650-850 nm bands). Single sensor integrates both depth and multispectral imaging capabilities.

Result: 60% RMSE improvement over commercial depth sensors; spectral accuracy comparable to reference spectrometer and commercial multispectral cameras; material differentiation accuracy improved by over 30% when adding depth to color-only methods.

Conclusion: System enables lightweight, integrated UAV perception for complex natural environments, showing strong performance in depth estimation, RGB reconstruction, and material differentiation in both lab and real-world rainforest testing.

Abstract: Uncrewed aerial vehicles (UAVs) are increasingly deployed in forest environments for tasks such as environmental monitoring and search and rescue, which require safe navigation through dense foliage and precise data collection. Traditional sensing approaches, including passive multispectral and RGB imaging, suffer from latency, poor depth resolution, and strong dependence on ambient light - especially under forest canopies. In this work, we present a novel event spectroscopy system that simultaneously enables high-resolution, low-latency depth reconstruction with integrated multispectral imaging using a single sensor. Depth is reconstructed using structured light, and by modulating the wavelength of the projected structured light, our system captures spectral information in controlled bands between 650 nm and 850 nm. We demonstrate up to $60%$ improvement in RMSE over commercial depth sensors and validate the spectral accuracy against a reference spectrometer and commercial multispectral cameras, demonstrating comparable performance. A portable version limited to RGB (3 wavelengths) is used to collect real-world depth and spectral data from a Masoala Rainforest. We demonstrate the use of this prototype for color image reconstruction and material differentiation between leaves and branches using spectral and depth data. Our results show that adding depth (available at no extra effort with our setup) to material differentiation improves the accuracy by over $30%$ compared to color-only method. Our system, tested in both lab and real-world rainforest environments, shows strong performance in depth estimation, RGB reconstruction, and material differentiation - paving the way for lightweight, integrated, and robust UAV perception and data collection in complex natural environments.

Heng Zhang, Haichuan Hu, Yaomin Shen, Weihao Yu, Yilei Yuan, Haochen You, Guo Cheng, Zijian Zhang, Lubin Gan, Huihui Wei, Hao Zhang, Jin Huang

Main category: cs.CV

TL;DR: AsyMoE addresses asymmetry in multimodal MoE by using specialized expert groups for modality-specific processing, hierarchical cross-modal interactions, and evidence-priority language experts to maintain contextual grounding.

Details

Motivation: Existing Mixture of Experts (MoE) approaches struggle with asymmetry between visual and linguistic processing in Large Vision-Language Models. Visual information is spatially complete while language requires sequential context, making it difficult to balance modality-specific features and cross-modal interactions. Language experts in deeper layers lose contextual grounding and rely too much on parametric knowledge rather than using provided visual/linguistic information.

Method: Proposes AsyMoE architecture with three specialized expert groups: 1) Intra-modality experts for modality-specific processing, 2) Hyperbolic inter-modality experts for hierarchical cross-modal interactions, and 3) Evidence-priority language experts to suppress parametric biases and maintain contextual grounding.

Result: AsyMoE achieves 26.58% accuracy improvement over vanilla MoE and 15.45% improvement over modality-specific MoE, while using 25.45% fewer activated parameters than dense models.

Conclusion: The proposed AsyMoE architecture effectively addresses the asymmetry problem in multimodal MoE by modeling different processing needs through specialized expert groups, leading to significant performance improvements with parameter efficiency.

Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. However, existing Mixture of Experts (MoE) approaches face challenges due to the asymmetry between visual and linguistic processing. Visual information is spatially complete, while language requires maintaining sequential context. As a result, MoE models struggle to balance modality-specific features and cross-modal interactions. Through systematic analysis, we observe that language experts in deeper layers progressively lose contextual grounding and rely more on parametric knowledge rather than utilizing the provided visual and linguistic information. To address this, we propose AsyMoE, a novel architecture that models this asymmetry using three specialized expert groups. We design intra-modality experts for modality-specific processing, hyperbolic inter-modality experts for hierarchical cross-modal interactions, and evidence-priority language experts to suppress parametric biases and maintain contextual grounding. Extensive experiments demonstrate that AsyMoE achieves 26.58% and 15.45% accuracy improvements over vanilla MoE and modality-specific MoE respectively, with 25.45% fewer activated parameters than dense models.

[339] Learning Encoding-Decoding Direction Pairs to Unveil Concepts of Influence in Deep Vision Networks

Alexandros Doumanoglou, Kurt Driessens, Dimitrios Zarpalas

Main category: cs.CV

TL;DR: Unsupervised method to recover encoding/decoding mechanisms for concepts in deep vision networks by identifying directional pairs that write and read concept information from latent representations.

Details

Motivation: Deep vision networks encode concepts as directions in latent space, but the mechanism for writing and reading concept information is inaccessible as it emerges from training. Recovering this mechanism could open the black-box nature of deep networks for understanding, debugging, and improving models.

Method: Proposes an unsupervised method under linear concept representation hypothesis: identifies decoding directions via directional clustering of activations, estimates encoding directions with signal vectors under probabilistic view, and leverages network weights through Uncertainty Region Alignment to reveal interpretable directions affecting predictions.

Result: (a) Recovers ground-truth direction pairs on synthetic data; (b) Decoding directions map to monosemantic, interpretable concepts and outperform unsupervised baselines on real data; (c) Signal vectors faithfully estimate encoding directions, validated via activation maximization.

Conclusion: The method successfully recovers encoding/decoding mechanisms for concepts in deep networks, enabling applications in understanding global model behavior, explaining individual predictions, and intervening to produce counterfactuals or correct errors.

Abstract: Empirical evidence shows that deep vision networks often represent concepts as directions in latent space with concept information written along directional components in the vector representation of the input. However, the mechanism to encode (write) and decode (read) concept information to and from vector representations is not directly accessible as it constitutes a latent mechanism that naturally emerges from the training process of the network. Recovering this mechanism unlocks significant potential to open the black-box nature of deep networks, enabling understanding, debugging, and improving deep learning models. In this work, we propose an unsupervised method to recover this mechanism. For each concept, we explain that under the hypothesis of linear concept representations, this mechanism can be implemented with the help of two directions: the first facilitating encoding of concept information and the second facilitating decoding. Unlike prior matrix decomposition, autoencoder, or dictionary learning methods that rely on feature reconstruction, we propose a new perspective: decoding directions are identified via directional clustering of activations, and encoding directions are estimated with signal vectors under a probabilistic view. We further leverage network weights through a novel technique, Uncertainty Region Alignment, which reveals interpretable directions affecting predictions. Our analysis shows that (a) on synthetic data, our method recovers ground-truth direction pairs; (b) on real data, decoding directions map to monosemantic, interpretable concepts and outperform unsupervised baselines; and (c) signal vectors faithfully estimate encoding directions, validated via activation maximization. Finally, we demonstrate applications in understanding global model behavior, explaining individual predictions, and intervening to produce counterfactuals or correct errors.

[340] Geometric Learning of Canonical Parameterizations of $2D$-curves

Ioana Ciuclea, Giorgio Longari, Alice Barbara Tumpach

Main category: cs.CV

TL;DR: A geometric learning method using principal fiber bundle sections to mod out symmetries in classification tasks without data augmentation.

Details

Motivation: Most computer vision and medical datasets have inherent symmetries (rotations, scalings, etc.) that should be accounted for in classification. Current approaches use data augmentation, but this is inefficient and unsustainable. The paper aims to provide a more principled geometric approach to handle symmetries.

Method: Uses the mathematical framework of principal fiber bundles and sections to quotient out symmetry groups. Introduces a 2-parameter family of canonical parameterizations for curves (generalizing constant-speed parameterization) to handle translations, rotations, scalings, and reparameterizations. The section can be optimized to maximize class separation.

Result: Demonstrates the methodology on a dataset of object contours, showing how to measure dissimilarities between orbits under symmetry groups. Provides open-source code and tutorial notebooks for practical application.

Conclusion: The geometric framework offers a sustainable alternative to data augmentation for handling symmetries in classification tasks, with wide applicability across computer vision and medical domains. The method provides principled mathematical foundations for symmetry-aware learning.

Abstract: Most datasets encountered in computer vision and medical applications present symmetries that should be taken into account in classification tasks. A typical example is the symmetry by rotation and/or scaling in object detection. A common way to build neural networks that learn the symmetries is to use data augmentation. In order to avoid data augmentation and build more sustainable algorithms, we present an alternative method to mod out symmetries based on the notion of section of a principal fiber bundle. This framework allows the use of simple metrics on the space of objects in order to measure dissimilarities between orbits of objects under the symmetry group. Moreover, the section used can be optimized to maximize separation of classes. We illustrate this methodology on a dataset of contours of objects for the groups of translations, rotations, scalings and reparameterizations. In particular, we present a $2$-parameter family of canonical parameterizations of curves, containing the constant-speed parameterization as a special case, which we believe is interesting in its own right. We hope that this simple application will serve to convey the geometric concepts underlying this method, which have a wide range of possible applications. The code is available at the following link: $\href{https://github.com/GiLonga/Geometric-Learning}{https://github.com/GiLonga/Geometric-Learning}$. A tutorial notebook showcasing an application of the code to a specific dataset is available at the following link: $\href{https://github.com/ioanaciuclea/geometric-learning-notebook}{https://github.com/ioanaciuclea/geometric-learning-notebook}$

[341] Detailed Aerial Mapping of Photovoltaic Power Plants Through Semantically Significant Keypoints

Viktor Kozák, Jan Chudoba, Libor Přeučil

Main category: cs.CV

TL;DR: A novel aerial image-based method for automated mapping of PV power plants down to individual module level, creating georeferenced 3D models for maintenance.

Details

Motivation: Accurate and up-to-date PV plant models are essential for optimal operation and maintenance, but such models are often not easily available. Current approaches rely on third-party data and lack automation.

Method: Uses aerial overview images with visual segmentation of PV modules, infers structural information to assign modules to benches/rows/columns, identifies visual keypoints for layout, and merges detections from multiple images while maintaining structural integrity.

Result: Method was experimentally verified on two different power plants, successfully creating compact georeferenced 3D models with semantic structures suitable for maintenance applications.

Conclusion: The approach enables automated PV plant mapping without third-party data, achieving detailed modeling down to individual modules through aerial image analysis and structural inference.

Abstract: An accurate and up-to-date model of a photovoltaic (PV) power plant is essential for its optimal operation and maintenance. However, such a model may not be easily available. This work introduces a novel approach for PV power plant mapping based on aerial overview images. It enables the automation of the mapping process while removing the reliance on third-party data. The presented mapping method takes advantage of the structural layout of the power plants to achieve detailed modeling down to the level of individual PV modules. The approach relies on visual segmentation of PV modules in overview images and the inference of structural information in each image, assigning modules to individual benches, rows, and columns. We identify visual keypoints related to the layout and use these to merge detections from multiple images while maintaining their structural integrity. The presented method was experimentally verified and evaluated on two different power plants. The final fusion of 3D positions and semantic structures results in a compact georeferenced model suitable for power plant maintenance.

[342] QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

Yixuan Li, Yuhui Chen, Mingcai Zhou, Haoran Li, Zhengtao Zhang, Dongbin Zhao

Main category: cs.CV

TL;DR: QDepth-VLA enhances VLA models with depth prediction to improve spatial reasoning for manipulation tasks.

Details

Motivation: Existing VLA models lack sufficient 3D understanding and reasoning capabilities needed for precise manipulation tasks, requiring better spatial perception.

Method: Proposes QDepth-VLA framework with auxiliary depth prediction using a dedicated depth expert that predicts quantized latent tokens from VQ-VAE encoded depth maps.

Result: Demonstrates strong spatial reasoning and competitive performance on both simulation benchmarks and real-world manipulation tasks.

Conclusion: Augmenting VLA models with depth-aware representations through quantized depth prediction effectively enhances spatial perception for manipulation.

Abstract: Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.

[343] PartNeXt: A Next-Generation Dataset for Fine-Grained and Hierarchical 3D Part Understanding

Penghao Wang, Yiyang He, Xin Lv, Yukai Zhou, Lan Xu, Jingyi Yu, Jiayuan Gu

Main category: cs.CV

TL;DR: PartNeXt is a new large-scale 3D dataset with 23,000+ textured models and hierarchical part annotations across 50 categories, enabling better fine-grained part segmentation and introducing 3D part-centric QA benchmarks.

Details

Motivation: Existing 3D part datasets like PartNet have limitations: they use untextured geometries and require expert-dependent annotation, which limits scalability and usability for advancing part-level understanding in computer vision, graphics, and robotics.

Method: Introduces PartNeXt dataset with over 23,000 high-quality textured 3D models annotated with fine-grained, hierarchical part labels across 50 categories. Features scalable annotation pipeline and texture-aware labels.

Result: 1) State-of-the-art methods struggle with fine-grained and leaf-level part segmentation on PartNeXt; 2) New 3D part-centric QA benchmark reveals significant gaps in 3D-LLMs’ open-vocabulary part grounding; 3) Training Point-SAM on PartNeXt yields substantial gains over PartNet.

Conclusion: PartNeXt addresses scalability and usability limitations of previous datasets, opens new research avenues in structured 3D understanding through its combination of scalable annotation, texture-aware labels, and multi-task evaluation.

Abstract: Understanding objects at the level of their constituent parts is fundamental to advancing computer vision, graphics, and robotics. While datasets like PartNet have driven progress in 3D part understanding, their reliance on untextured geometries and expert-dependent annotation limits scalability and usability. We introduce PartNeXt, a next-generation dataset addressing these gaps with over 23,000 high-quality, textured 3D models annotated with fine-grained, hierarchical part labels across 50 categories. We benchmark PartNeXt on two tasks: (1) class-agnostic part segmentation, where state-of-the-art methods (e.g., PartField, SAMPart3D) struggle with fine-grained and leaf-level parts, and (2) 3D part-centric question answering, a new benchmark for 3D-LLMs that reveals significant gaps in open-vocabulary part grounding. Additionally, training Point-SAM on PartNeXt yields substantial gains over PartNet, underscoring the dataset’s superior quality and diversity. By combining scalable annotation, texture-aware labels, and multi-task evaluation, PartNeXt opens new avenues for research in structured 3D understanding.

[344] TOMCAT: Test-time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning

Xudong Yan, Songhe Feng

Main category: cs.CV

TL;DR: TOMCAT: A test-time adaptation method for CZSL that accumulates multimodal knowledge from unsupervised data to update prototypes, addressing distribution shift from unseen compositions.

Details

Motivation: Existing CZSL methods suffer from performance degradation due to distribution shift at test time when unseen attribute-object compositions appear. The challenge is to adapt to these novel compositions that are recombinations of seen attributes and objects.

Method: 1) Accumulates comprehensive knowledge from textual and visual modalities using unsupervised data to update multimodal prototypes at test time. 2) Designs adaptive update weights to control prototype adjustment degree. 3) Introduces dynamic priority queue storing high-confidence images for visual knowledge from historical inference. 4) Aligns textual and visual prototypes via multimodal collaborative representation learning.

Result: Achieves state-of-the-art performance on four benchmark datasets under both closed-world and open-world CZSL settings.

Conclusion: The proposed TOMCAT approach effectively addresses distribution shift in CZSL by accumulating multimodal knowledge at test time and flexibly adapting to novel compositions through prototype updating mechanisms.

Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions based on the knowledge learned from seen ones. Existing methods suffer from performance degradation caused by the distribution shift of label space at test time, which stems from the inclusion of unseen compositions recombined from attributes and objects. To overcome the challenge, we propose a novel approach that accumulates comprehensive knowledge in both textual and visual modalities from unsupervised data to update multimodal prototypes at test time. Building on this, we further design an adaptive update weight to control the degree of prototype adjustment, enabling the model to flexibly adapt to distribution shift during testing. Moreover, a dynamic priority queue is introduced that stores high-confidence images to acquire visual knowledge from historical images for inference. Considering the semantic consistency of multimodal knowledge, we align textual and visual prototypes by multimodal collaborative representation learning. Extensive experiments indicate that our approach achieves state-of-the-art performance on four benchmark datasets under both closed-world and open-world settings. Code will be available at https://github.com/xud-yan/TOMCAT .

[345] Restrictive Hierarchical Semantic Segmentation for Stratified Tooth Layer Detection

Ryan Banks, Camila Lindoni Azevedo, Hongying Tang, Yunpeng Li

Main category: cs.CV

TL;DR: Hierarchical semantic segmentation framework improves dental anatomy segmentation by explicitly embedding anatomical structure through recurrent level-wise prediction with feature conditioning and consistency rules.

Details

Motivation: Existing hierarchy-aware segmentation methods only use weak supervision through loss functions, lacking explicit anatomical structure encoding needed for reliable dental disease staging.

Method: Recurrent level-wise prediction scheme with restrictive output heads and top-down feature conditioning using Feature-wise Linear Modulation (FiLM), probabilistic composition rules, and hierarchical loss combining per-level Dice/CE loss with consistency terms.

Result: Hierarchical variants consistently increase IoU, Dice, and recall (especially for fine-grained anatomies) and produce more anatomically coherent masks, though with increased false positives (higher recall than precision).

Conclusion: Explicit hierarchical structuring improves both performance and clinical plausibility in dental imaging, particularly beneficial in low-data regimes.

Abstract: Accurate understanding of anatomical structures is essential for reliably staging certain dental diseases. A way of introducing this within semantic segmentation models is by utilising hierarchy-aware methodologies. However, existing hierarchy-aware segmentation methods largely encode anatomical structure through the loss functions, providing weak and indirect supervision. We introduce a general framework that embeds an explicit anatomical hierarchy into semantic segmentation by coupling a recurrent, level-wise prediction scheme with restrictive output heads and top-down feature conditioning. At each depth of the class tree, the backbone is re-run on the original image concatenated with logits from the previous level. Child class features are conditioned using Feature-wise Linear Modulation of their parent class probabilities, to modulate child feature spaces for fine grained detection. A probabilistic composition rule enforces consistency between parent and descendant classes. Hierarchical loss combines per-level class weighted Dice and cross entropy loss and a consistency term loss, ensuring parent predictions are the sum of their children. We validate our approach on our proposed dataset, TL-pano, containing 194 panoramic radiographs with dense instance and semantic segmentation annotations, of tooth layers and alveolar bone. Utilising UNet and HRNet as donor models across a 5-fold cross validation scheme, the hierarchical variants consistently increase IoU, Dice, and recall, particularly for fine-grained anatomies, and produce more anatomically coherent masks. However, hierarchical variants also demonstrated increased recall over precision, implying increased false positives. The results demonstrate that explicit hierarchical structuring improves both performance and clinical plausibility, especially in low data dental imaging regimes.

[346] EIRES:Training-free AI-Generated Image Detection via Edit-Induced Reconstruction Error Shift

Wan Jiang, Jing Yan, Xiaojing Chen, Lin Shen, Chenhao Lin, Yunfeng Diao, Richang Hong

Main category: cs.CV

TL;DR: EIRES: Training-free method using structural edits to detect AI-generated images by exploiting asymmetric reconstruction error shifts between real and fake images.

Details

Motivation: Diffusion models have achieved photorealism that makes distinguishing real from generated images difficult, raising privacy and security concerns. There's a need for effective detection methods to address these concerns.

Method: EIRES leverages structural edits that enhance reconstruction of real images while degrading reconstruction of generated images, creating an asymmetric edit-induced reconstruction error shift. The method is training-free and uses this shift to reveal inherent differences between real and generated images.

Result: Extensive experiments show EIRES is effective across diverse generative models and remains robust on unbiased subsets, even under post-processing operations. The method’s training-free nature means thresholding depends on natural separability, with larger margins yielding more reliable detection.

Conclusion: Structural edits create a distinctive reconstruction error shift that enhances separability between real and generated images, enabling effective training-free detection. The method addresses privacy and security concerns raised by increasingly realistic AI-generated images.

Abstract: Diffusion models have recently achieved remarkable photorealism, making it increasingly difficult to distinguish real images from generated ones, raising significant privacy and security concerns. In response, we present a key finding: structural edits enhance the reconstruction of real images while degrading that of generated images, creating a distinctive edit-induced reconstruction error shift. This asymmetric shift enhances the separability between real and generated images. Building on this insight, we propose EIRES, a training-free method that leverages structural edits to reveal inherent differences between real and generated images. To explain the discriminative power of this shift, we derive the reconstruction error lower bound under edit perturbations. Since EIRES requires no training, thresholding depends solely on the natural separability of the signal, where a larger margin yields more reliable detection. Extensive experiments show that EIRES is effective across diverse generative models and remains robust on the unbiased subset, even under post-processing operations.

[347] RegionRAG: Region-level Retrieval-Augmented Generation for Visual Document Understanding

Yinglu Li, Zhiying Lu, Zhihang Liu, Yiwei Sun, Chuanbin Liu, Hongtao Xie

Main category: cs.CV

TL;DR: RegionRAG shifts multi-modal RAG from document-level to region-level retrieval, using hybrid supervision to identify relevant visual patches and grouping them into semantic regions, improving accuracy while reducing visual tokens.

Details

Motivation: Current multi-modal RAG methods retrieve entire documents, introducing irrelevant visual content that dilutes focus on salient information and introduces redundancy, degrading performance.

Method: Proposes RegionRAG framework with: 1) Hybrid supervision strategy using labeled and unlabeled data to pinpoint relevant patches during training, 2) Dynamic pipeline that groups salient patches into complete semantic regions during inference.

Result: Achieves SOTA on six benchmarks: improves retrieval accuracy by 10.02% in R@1 on average, boosts QA accuracy by 3.56% while using only 71.42% visual tokens compared to prior methods.

Conclusion: RegionRAG’s region-level retrieval paradigm enables generators to focus on concise, query-relevant visual content, improving both efficiency and accuracy in multi-modal RAG systems.

Abstract: Multi-modal Retrieval-Augmented Generation (RAG) has become a critical method for empowering LLMs by leveraging candidate visual documents. However, current methods consider the entire document as the basic retrieval unit, introducing substantial irrelevant visual content in two ways: 1) Relevant documents often contain large regions unrelated to the query, diluting the focus on salient information; 2) Retrieving multiple documents to increase recall further introduces redundant and irrelevant documents. These redundant contexts distract the model’s attention and further degrade the performance. To address this challenge, we propose RegionRAG, a novel framework that shifts the retrieval paradigm from the document level to the region level. During training, we design a hybrid supervision strategy from both labeled data and unlabeled data to pinpoint relevant patches. During inference, we propose a dynamic pipeline that intelligently groups salient patches into complete semantic regions. By delegating the task of identifying relevant regions to the retriever, RegionRAG enables the generator to focus solely on concise, query-relevant visual content, improving both efficiency and accuracy. Experiments on six benchmarks demonstrate that RegionRAG achieves state-of-the-art performance. It improves retrieval accuracy by 10.02% in R@1 on average, and boosts question answering accuracy by 3.56% while using only 71.42% visual tokens compared with prior methods.

[348] GraphGeo: Multi-Agent Debate Framework for Visual Geo-localization with Heterogeneous Graph Neural Networks

Heng Zheng, Yuling Shi, Xiaodong Gu, Haochen You, Zijian Zhang, Lubin Gan, Hao Zhang, Wenjun Huang, Jin Huang

Main category: cs.CV

TL;DR: GraphGeo: A multi-agent debate framework using heterogeneous graph neural networks for visual geo-localization that models diverse debate relationships and transforms cognitive conflicts into improved localization accuracy.

Details

Motivation: Traditional visual geo-localization methods are limited by database coverage, while individual LVLMs struggle with diverse geographic regions and complex scenes. Existing multi-agent systems lack mechanisms to effectively handle conflicting predictions between agents.

Method: Proposes GraphGeo, a multi-agent debate framework using heterogeneous graph neural networks with typed edges to model supportive collaboration, competitive argumentation, and knowledge transfer. Features dual-level debate mechanism (node-level refinement + edge-level argumentation) and cross-level topology refinement for co-evolution between graph structure and agent representations.

Result: Experiments on multiple benchmarks demonstrate GraphGeo significantly outperforms state-of-the-art methods in visual geo-localization.

Conclusion: GraphGeo successfully transforms cognitive conflicts between agents into enhanced geo-localization accuracy through structured debate, providing a novel approach to leverage multi-agent collaboration for complex visual reasoning tasks.

Abstract: Visual geo-localization requires extensive geographic knowledge and sophisticated reasoning to determine image locations without GPS metadata. Traditional retrieval methods are constrained by database coverage and quality. Recent Large Vision-Language Models (LVLMs) enable direct location reasoning from image content, yet individual models struggle with diverse geographic regions and complex scenes. Existing multi-agent systems improve performance through model collaboration but treat all agent interactions uniformly. They lack mechanisms to handle conflicting predictions effectively. We propose \textbf{GraphGeo}, a multi-agent debate framework using heterogeneous graph neural networks for visual geo-localization. Our approach models diverse debate relationships through typed edges, distinguishing supportive collaboration, competitive argumentation, and knowledge transfer. We introduce a dual-level debate mechanism combining node-level refinement and edge-level argumentation modeling. A cross-level topology refinement strategy enables co-evolution between graph structure and agent representations. Experiments on multiple benchmarks demonstrate GraphGeo significantly outperforms state-of-the-art methods. Our framework transforms cognitive conflicts between agents into enhanced geo-localization accuracy through structured debate.

[349] 3DFETUS: Deep Learning-Based Standardization of Facial Planes in 3D Ultrasound

Alomar Antonia, Rubio Ricardo, Albaiges Gerard, Salort-Benejam Laura, Caminal Julia, Prat Maria, Rueda Carolina, Cortes Berta, Piella Gemma, Sukno Federico

Main category: cs.CV

TL;DR: GT++ and 3DFETUS methods for automatic localization and standardization of facial planes in 3D fetal ultrasound, achieving improved accuracy over state-of-the-art methods.

Details

Motivation: Automatic localization and standardization of anatomical planes in 3D medical imaging is challenging due to variability in object pose, appearance, and image quality. In 3D ultrasound, these challenges are exacerbated by speckle noise and limited contrast, particularly in fetal imaging.

Method: 1) GT++: a robust algorithm that estimates standard facial planes from 3D US volumes using annotated anatomical landmarks; 2) 3DFETUS: a deep learning model that automates and standardizes their localization in 3D fetal US volumes.

Result: The approach achieved a mean translation error of 3.21 ± 1.98mm and a mean rotation error of 5.31 ± 3.945° per plane, outperforming other state-of-the-art methods on 3D US volumes. Clinical assessments confirmed statistically significant improvements in plane estimation accuracy.

Conclusion: The proposed GT++ and 3DFETUS methods effectively address the challenges of automatic facial plane localization in 3D fetal ultrasound, demonstrating superior performance through both quantitative metrics and clinical validation.

Abstract: The automatic localization and standardization of anatomical planes in 3D medical imaging remains a challenging problem due to variability in object pose, appearance, and image quality. In 3D ultrasound, these challenges are exacerbated by speckle noise and limited contrast, particularly in fetal imaging. To address these challenges in the context of facial assessment, we present: 1) GT++, a robust algorithm that estimates standard facial planes from 3D US volumes using annotated anatomical landmarks; and 2) 3DFETUS, a deep learning model that automates and standardizes their localization in 3D fetal US volumes. We evaluated our methods both qualitatively, through expert clinical review, and quantitatively. The proposed approach achieved a mean translation error of 3.21 $\pm$ 1.98mm and a mean rotation error of 5.31 $\pm$ 3.945$^\circ$ per plane, outperforming other state-of-the-art methods on 3D US volumes. Clinical assessments further confirmed the effectiveness of both GT++ and 3DFETUS, demonstrating statistically significant improvements in plane estimation accuracy.

[350] Towards 3D Object-Centric Feature Learning for Semantic Scene Completion

Weihua Wang, Yubo Cui, Xiangru Lin, Zhiheng Li, Zheng Fang

Main category: cs.CV

TL;DR: Ocean: An object-centric framework for 3D semantic scene completion that decomposes scenes into individual object instances to improve semantic occupancy prediction accuracy.

Details

Motivation: Existing ego-centric approaches for vision-based 3D Semantic Scene Completion often overlook fine-grained object-level details, leading to semantic and geometric ambiguities in complex environments like autonomous driving scenarios.

Method: 1) Use MobileSAM for instance mask extraction from input images; 2) 3D Semantic Group Attention module with linear attention for object-centric feature aggregation; 3) Global Similarity-Guided Attention module to handle segmentation errors; 4) Instance-aware Local Diffusion module for feature refinement in BEV space.

Result: Achieves state-of-the-art performance on SemanticKITTI (17.40 mIoU) and SSCBench-KITTI360 (20.28 mIoU) benchmarks.

Conclusion: Object-centric decomposition enables more accurate semantic occupancy prediction by focusing on individual object instances, addressing limitations of ego-centric approaches in complex autonomous driving environments.

Abstract: Vision-based 3D Semantic Scene Completion (SSC) has received growing attention due to its potential in autonomous driving. While most existing approaches follow an ego-centric paradigm by aggregating and diffusing features over the entire scene, they often overlook fine-grained object-level details, leading to semantic and geometric ambiguities, especially in complex environments. To address this limitation, we propose Ocean, an object-centric prediction framework that decomposes the scene into individual object instances to enable more accurate semantic occupancy prediction. Specifically, we first employ a lightweight segmentation model, MobileSAM, to extract instance masks from the input image. Then, we introduce a 3D Semantic Group Attention module that leverages linear attention to aggregate object-centric features in 3D space. To handle segmentation errors and missing instances, we further design a Global Similarity-Guided Attention module that leverages segmentation features for global interaction. Finally, we propose an Instance-aware Local Diffusion module that improves instance features through a generative process and subsequently refines the scene representation in the BEV space. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that Ocean achieves state-of-the-art performance, with mIoU scores of 17.40 and 20.28, respectively.

[351] TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

Harold Haodong Chen, Disen Lan, Wen-Jie Shu, Qingyang Liu, Zihan Wang, Sirui Chen, Wenkai Cheng, Kanghao Chen, Hongfei Zhang, Zixin Zhang, Rongjin Guo, Yu Cheng, Ying-Cong Chen

Main category: cs.CV

TL;DR: TiViBench is a hierarchical benchmark for evaluating reasoning capabilities in image-to-video generation models, addressing the gap in existing benchmarks that focus only on visual fidelity. The paper also introduces VideoTPO, a test-time optimization strategy that improves reasoning performance without additional training.

Details

Motivation: Current video generation models focus on visual plausibility but lack evaluation of reasoning capabilities similar to LLMs. Existing benchmarks don't capture higher-order reasoning abilities needed for physically plausible and logically consistent video generation.

Method: Proposes TiViBench with 4 reasoning dimensions across 24 task scenarios at 3 difficulty levels. Also introduces VideoTPO, a test-time strategy using LLM self-analysis on generated candidates to identify strengths/weaknesses without additional training or data.

Result: Commercial models (Sora 2, Veo 3.1) show stronger reasoning potential, while open-source models have untapped potential limited by training scale/data diversity. VideoTPO significantly enhances reasoning performance without requiring additional resources.

Conclusion: TiViBench and VideoTPO provide tools for evaluating and advancing reasoning in video generation models, establishing a foundation for future research in this emerging field of reasoning-capable video generation.

Abstract: The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3’s chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning & Search, ii) Spatial & Visual Pattern Reasoning, iii) Symbolic & Logical Reasoning, and iv) Action Planning & Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.

[352] BRIC: Bridging Kinematic Plans and Physical Control at Test Time

Dohun Lim, Minji Kim, Jaewoon Lim, Sungchan Kim

Main category: cs.CV

TL;DR: BRIC is a test-time adaptation framework that combines diffusion-based motion planning with physics controllers for long-term human motion generation, addressing execution drift through dynamic controller adaptation and diffusion guidance.

Details

Motivation: Diffusion models generate diverse human motions but often produce physically implausible outputs, causing execution drift when simulated with physics controllers. There's a need to bridge the gap between kinematic motion planning and physics-based execution for long-term motion generation.

Method: BRIC uses two key strategies: 1) Dynamic test-time adaptation of physics controllers to noisy motion plans while preventing catastrophic forgetting via a specialized loss function, and 2) Lightweight test-time guidance mechanism that steers the diffusion model in signal space without updating its parameters.

Result: BRIC achieves state-of-the-art performance on long-term tasks including motion composition, obstacle avoidance, and human-scene interaction, ensuring consistent and physically plausible executions across diverse environments.

Conclusion: The combination of test-time controller adaptation and diffusion guidance effectively resolves execution discrepancies between motion planning and physics simulation, enabling robust long-term human motion generation in complex environments.

Abstract: We propose BRIC, a novel test-time adaptation (TTA) framework that enables long-term human motion generation by resolving execution discrepancies between diffusion-based kinematic motion planners and reinforcement learning-based physics controllers. While diffusion models can generate diverse and expressive motions conditioned on text and scene context, they often produce physically implausible outputs, leading to execution drift during simulation. To address this, BRIC dynamically adapts the physics controller to noisy motion plans at test time, while preserving pre-trained skills via a loss function that mitigates catastrophic forgetting. In addition, BRIC introduces a lightweight test-time guidance mechanism that steers the diffusion model in the signal space without updating its parameters. By combining both adaptation strategies, BRIC ensures consistent and physically plausible long-term executions across diverse environments in an effective and efficient manner. We validate the effectiveness of BRIC on a variety of long-term tasks, including motion composition, obstacle avoidance, and human-scene interaction, achieving state-of-the-art performance across all tasks.

Davide Nadalini, Manuele Rusci, Elia Cereda, Luca Benini, Francesco Conti, Daniele Palossi

Main category: cs.CV

TL;DR: On-device learning technique for monocular depth estimation on ultra-low-power IoT devices using sparse updates and multi-modal pseudo-labeling.

Details

Motivation: Address domain shift problem in monocular depth estimation on IoT devices where limited model parameters cause accuracy drops when sensor data differs from training data.

Method: Multi-modal on-device learning with tiny 107k-parameter model, using depth sensor only for pseudo-label collection in new environments, and novel memory-driven sparse update scheme to minimize fine-tuning memory.

Result: Reduces fine-tuning memory to 1.2MB (2.2x less than full update) with minimal accuracy drops (2% on KITTI, 1.5% on NYUv2), enables ODL in 17.8 minutes on IoT node, reduces RMSE from 4.9m to 0.6m with 3k self-labeled samples.

Conclusion: Demonstrates practical on-device learning for monocular depth estimation on ultra-low-power IoT platforms, enabling adaptation to new environments with minimal energy and memory overhead.

Abstract: Monocular depth estimation (MDE) plays a crucial role in enabling spatially-aware applications in Ultra-low-power (ULP) Internet-of-Things (IoT) platforms. However, the limited number of parameters of Deep Neural Networks for the MDE task, designed for IoT nodes, results in severe accuracy drops when the sensor data observed in the field shifts significantly from the training dataset. To address this domain shift problem, we present a multi-modal On-Device Learning (ODL) technique, deployed on an IoT device integrating a Greenwaves GAP9 MicroController Unit (MCU), a 80 mW monocular camera and a 8 x 8 pixel depth sensor, consuming $\approx$300mW. In its normal operation, this setup feeds a tiny 107 k-parameter $μ$PyD-Net model with monocular images for inference. The depth sensor, usually deactivated to minimize energy consumption, is only activated alongside the camera to collect pseudo-labels when the system is placed in a new environment. Then, the fine-tuning task is performed entirely on the MCU, using the new data. To optimize our backpropagation-based on-device training, we introduce a novel memory-driven sparse update scheme, which minimizes the fine-tuning memory to 1.2 MB, 2.2x less than a full update, while preserving accuracy (i.e., only 2% and 1.5% drops on the KITTI and NYUv2 datasets). Our in-field tests demonstrate, for the first time, that ODL for MDE can be performed in 17.8 minutes on the IoT node, reducing the root mean squared error from 4.9 to 0.6m with only 3 k self-labeled samples, collected in a real-life deployment scenario.

[354] SSR: Semantic and Spatial Rectification for CLIP-based Weakly Supervised Segmentation

Xiuli Bi, Die Xiao, Junchao Fan, Bin Xiao

Main category: cs.CV

TL;DR: SSR method addresses over-activation issues in CLIP-based WSSS through semantic and spatial rectification, achieving SOTA results on VOC and COCO.

Details

Motivation: Existing CLIP-based weakly supervised semantic segmentation approaches suffer from two key limitations: over-activation in non-target foreground regions and background areas, which degrade segmentation quality.

Method: Proposes Semantic and Spatial Rectification (SSR) with two components: 1) Cross-Modal Prototype Alignment (CMPA) for semantic-level contrastive learning to align features across modalities and reduce inter-class overlap, addressing foreground over-activation; 2) Superpixel-Guided Correction (SGC) for spatial-level filtering using superpixel priors to remove interference during affinity propagation, addressing background over-activation.

Result: Achieves state-of-the-art performance on PASCAL VOC (79.5% mIoU) and MS COCO (50.6% mIoU), outperforming both single-stage and more complex multi-stage approaches.

Conclusion: The SSR method effectively addresses over-activation problems in CLIP-based WSSS through semantic and spatial rectification, demonstrating superior performance and providing a robust solution for weakly supervised semantic segmentation.

Abstract: In recent years, Contrastive Language-Image Pretraining (CLIP) has been widely applied to Weakly Supervised Semantic Segmentation (WSSS) tasks due to its powerful cross-modal semantic understanding capabilities. This paper proposes a novel Semantic and Spatial Rectification (SSR) method to address the limitations of existing CLIP-based weakly supervised semantic segmentation approaches: over-activation in non-target foreground regions and background areas. Specifically, at the semantic level, the Cross-Modal Prototype Alignment (CMPA) establishes a contrastive learning mechanism to enforce feature space alignment across modalities, reducing inter-class overlap while enhancing semantic correlations, to rectify over-activation in non-target foreground regions effectively; at the spatial level, the Superpixel-Guided Correction (SGC) leverages superpixel-based spatial priors to precisely filter out interference from non-target regions during affinity propagation, significantly rectifying background over-activation. Extensive experiments on the PASCAL VOC and MS COCO datasets demonstrate that our method outperforms all single-stage approaches, as well as more complex multi-stage approaches, achieving mIoU scores of 79.5% and 50.6%, respectively.

Tang Haonan, Chen Yanjun, Jiang Lezhi

Main category: cs.CV

TL;DR: TrackNetV5 improves sports object tracking with motion direction modeling and occlusion recovery, achieving SOTA performance with minimal computational overhead.

Details

Motivation: Previous TrackNet versions have limitations: V1-V3 struggle with occlusions due to reliance on visual cues only, while V4 suffers from directional ambiguity by discarding motion polarity in its absolute difference method.

Method: Two novel mechanisms: 1) Motion Direction Decoupling (MDD) module decomposes temporal dynamics into signed polarity fields to encode movement occurrence and trajectory direction, and 2) Residual-Driven Spatio-Temporal Refinement (R-STR) head uses Transformer-based coarse-to-fine paradigm with factorized spatio-temporal contexts to estimate corrective residuals for recovering occluded targets.

Result: Achieves new state-of-the-art F1-score of 0.9859 and accuracy of 0.9733 on TrackNetV2 dataset, significantly outperforming previous versions with only 3.7% increase in FLOPs compared to V4, maintaining real-time inference.

Conclusion: TrackNetV5 successfully addresses limitations of previous versions by integrating motion direction modeling and occlusion recovery mechanisms, delivering superior tracking precision while maintaining computational efficiency for real-time sports applications.

Abstract: The TrackNet series has established a strong baseline for fast-moving small object tracking in sports. However, existing iterations face significant limitations: V1-V3 struggle with occlusions due to a reliance on purely visual cues, while TrackNetV4, despite introducing motion inputs, suffers from directional ambiguity as its absolute difference method discards motion polarity. To overcome these bottlenecks, we propose TrackNetV5, a robust architecture integrating two novel mechanisms. First, to recover lost directional priors, we introduce the Motion Direction Decoupling (MDD) module. Unlike V4, MDD decomposes temporal dynamics into signed polarity fields, explicitly encoding both movement occurrence and trajectory direction. Second, we propose the Residual-Driven Spatio-Temporal Refinement (R-STR) head. Operating on a coarse-to-fine paradigm, this Transformer-based module leverages factorized spatio-temporal contexts to estimate a corrective residual, effectively recovering occluded targets. Extensive experiments on the TrackNetV2 dataset demonstrate that TrackNetV5 achieves a new state-of-the-art F1-score of 0.9859 and an accuracy of 0.9733, significantly outperforming previous versions. Notably, this performance leap is achieved with a marginal 3.7% increase in FLOPs compared to V4, maintaining real-time inference capabilities while delivering superior tracking precision.

[356] VDOT: Efficient Unified Video Creation via Optimal Transport Distillation

Yutong Wang, Haiyu Zhang, Tianfan Xue, Yu Qiao, Yaohui Wang, Chang Xu, Xinyuan Chen

Main category: cs.CV

TL;DR: VDOT is an efficient unified video creation model that uses distribution matching distillation with optimal transport to achieve high-quality video generation in just 4 steps, outperforming models requiring 100 steps.

Details

Motivation: Existing video creation models are either limited to specific conditions or too slow for practical use due to complex inference, creating a need for an efficient, unified solution.

Method: Uses distribution matching distillation with computational optimal transport instead of KL divergence, integrates a discriminator for real video perception, and employs automated video data annotation pipeline.

Result: 4-step VDOT outperforms or matches baselines requiring 100 denoising steps, demonstrating superior efficiency and quality.

Conclusion: VDOT provides an efficient, unified video creation solution that addresses limitations of existing models through optimal transport-based distillation and comprehensive data processing.

Abstract: The rapid development of generative models has significantly advanced image and video applications. Among these, video creation, aimed at generating videos under various conditions, has gained substantial attention. However, existing video creation models either focus solely on a few specific conditions or suffer from excessively long generation times due to complex model inference, making them impractical for real-world applications. To mitigate these issues, we propose an efficient unified video creation model, named VDOT. Concretely, we model the training process with the distribution matching distillation (DMD) paradigm. Instead of using the Kullback-Leibler (KL) minimization, we additionally employ a novel computational optimal transport (OT) technique to optimize the discrepancy between the real and fake score distributions. The OT distance inherently imposes geometric constraints, mitigating potential zero-forcing or gradient collapse issues that may arise during KL-based distillation within the few-step generation scenario, and thus, enhances the efficiency and stability of the distillation process. Further, we integrate a discriminator to enable the model to perceive real video data, thereby enhancing the quality of generated videos. To support training unified video creation models, we propose a fully automated pipeline for video data annotation and filtering that accommodates multiple video creation tasks. Meanwhile, we curate a unified testing benchmark, UVCBench, to standardize evaluation. Experiments demonstrate that our 4-step VDOT outperforms or matches other baselines with 100 denoising steps.

[357] Geo3DVQA: Evaluating Vision-Language Models for 3D Geospatial Reasoning from Aerial Imagery

Mai Tsujimoto, Junjue Wang, Weihao Xuan, Naoto Yokoya

Main category: cs.CV

TL;DR: Geo3DVQA is a benchmark for evaluating vision-language models on 3D geospatial reasoning from RGB images alone, addressing limitations of expensive sensor-based methods.

Details

Motivation: Current 3D geospatial analysis relies on costly sensors (LiDAR, multispectral) that limit global accessibility, and existing methods struggle with multi-cue integration, diverse queries, and interpretable reasoning.

Method: Created Geo3DVQA benchmark with 110k curated question-answer pairs across 16 task categories, evaluating 10 state-of-the-art VLMs on height-aware 3D reasoning from RGB imagery, including domain-specific instruction tuning.

Result: Revealed fundamental limitations in RGB-to-3D spatial reasoning; showed domain-specific instruction tuning consistently enhances performance across all task categories, including height-aware and application-oriented reasoning.

Conclusion: Geo3DVQA provides a unified, interpretable framework for evaluating RGB-based 3D geospatial reasoning and identifies key challenges and opportunities for scalable 3D spatial analysis.

Abstract: Three-dimensional geospatial analysis is critical for applications in urban planning, climate adaptation, and environmental assessment. However, current methodologies depend on costly, specialized sensors, such as LiDAR and multispectral sensors, which restrict global accessibility. Additionally, existing sensor-based and rule-driven methods struggle with tasks requiring the integration of multiple 3D cues, handling diverse queries, and providing interpretable reasoning. We present Geo3DVQA, a comprehensive benchmark that evaluates vision-language models (VLMs) in height-aware 3D geospatial reasoning from RGB imagery alone. Unlike conventional sensor-based frameworks, Geo3DVQA emphasizes realistic scenarios integrating elevation, sky view factors, and land cover patterns. The benchmark comprises 110k curated question-answer pairs across 16 task categories, including single-feature inference, multi-feature reasoning, and application-level analysis. Through a systematic evaluation of ten state-of-the-art VLMs, we reveal fundamental limitations in RGB-to-3D spatial reasoning. Our results further show that domain-specific instruction tuning consistently enhances model performance across all task categories, including height-aware and open-ended, application-oriented reasoning. Geo3DVQA provides a unified, interpretable framework for evaluating RGB-based 3D geospatial reasoning and identifies key challenges and opportunities for scalable 3D spatial analysis. The code and data are available at https://github.com/mm1129/Geo3DVQA.

[358] Liver Fibrosis Quantification and Analysis: The LiQA Dataset and Baseline Method

Yuanye Liu, Hanxiao Zhang, Jiyao Liu, Nannan Shi, Yuxin Shi, Arif Mahmood, Murtaza Taj, Xiahai Zhuang

Main category: cs.CV

TL;DR: The paper introduces LiQA dataset for liver fibrosis assessment, presents a benchmark for segmentation and staging tasks, and describes top-performing methods using semi-supervised learning and multi-view consensus approaches.

Details

Motivation: Liver fibrosis is a major global health issue requiring accurate staging for clinical management. There's a need for standardized benchmarks to evaluate algorithms under real-world conditions with challenges like domain shifts, missing modalities, and spatial misalignment.

Method: 1) Created LiQA dataset with 440 patients’ multi-phase, multi-center MRI scans. 2) Established benchmarks for Liver Segmentation (LiSeg) and Liver Fibrosis Staging (LiFS). 3) Top-performing method uses semi-supervised learning with external data for segmentation, and multi-view consensus with CAM-based regularization for staging.

Result: The baseline evaluation demonstrates that leveraging multi-source data and anatomical constraints significantly enhances model robustness in clinical settings. The dataset serves as a benchmark for future algorithm development.

Conclusion: The LiQA dataset provides a valuable resource for benchmarking liver fibrosis assessment algorithms under realistic clinical conditions. The proposed methods show that multi-source data integration and anatomical constraints improve robustness, advancing the field toward more reliable clinical applications.

Abstract: Liver fibrosis represents a significant global health burden, necessitating accurate staging for effective clinical management. This report introduces the LiQA (Liver Fibrosis Quantification and Analysis) dataset, established as part of the CARE 2024 challenge. Comprising $440$ patients with multi-phase, multi-center MRI scans, the dataset is curated to benchmark algorithms for Liver Segmentation (LiSeg) and Liver Fibrosis Staging (LiFS) under complex real-world conditions, including domain shifts, missing modalities, and spatial misalignment. We further describe the challenge’s top-performing methodology, which integrates a semi-supervised learning framework with external data for robust segmentation, and utilizes a multi-view consensus approach with Class Activation Map (CAM)-based regularization for staging. Evaluation of this baseline demonstrates that leveraging multi-source data and anatomical constraints significantly enhances model robustness in clinical settings.

[359] Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment

Youming Deng, Songyou Peng, Junyi Zhang, Kathryn Heal, Tiancheng Sun, John Flynn, Steve Marschner, Lucy Chai

Main category: cs.CV

TL;DR: Selfi improves VGGT’s 3D features via self-supervised feature alignment, achieving SOTA in novel view synthesis and camera pose estimation without explicit 3D biases.

Details

Motivation: VGGT models learn 3D implicitly but lack explicit multi-view geometric consistency, which limits performance in 3D reconstruction tasks like novel view synthesis and pose estimation.

Method: Selfi trains a lightweight feature adapter using reprojection-based consistency loss, distilling VGGT outputs into geometrically-aligned features that capture 3D spatial proximity through self-improvement using VGGT’s own outputs as pseudo-ground-truth.

Result: Achieves state-of-the-art performance in both novel view synthesis and camera pose estimation, demonstrating that feature alignment significantly benefits downstream 3D reasoning tasks.

Conclusion: Feature alignment is a crucial step for improving 3D feature consistency in foundation models, enabling high-fidelity 3D reconstruction without traditional explicit 3D inductive biases.

Abstract: Novel View Synthesis (NVS) has traditionally relied on models with explicit 3D inductive biases combined with known camera parameters from Structure-from-Motion (SfM) beforehand. Recent vision foundation models like VGGT take an orthogonal approach – 3D knowledge is gained implicitly through training data and loss objectives, enabling feed-forward prediction of both camera parameters and 3D representations directly from a set of uncalibrated images. While flexible, VGGT features lack explicit multi-view geometric consistency, and we find that improving such 3D feature consistency benefits both NVS and pose estimation tasks. We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment, transforming a VGGT backbone into a high-fidelity 3D reconstruction engine by leveraging its own outputs as pseudo-ground-truth. Specifically, we train a lightweight feature adapter using a reprojection-based consistency loss, which distills VGGT outputs into a new geometrically-aligned feature space that captures spatial proximity in 3D. This enables state-of-the-art performance in both NVS and camera pose estimation, demonstrating that feature alignment is a highly beneficial step for downstream 3D reasoning.

[360] ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

Liming Kuang, Yordanka Velikova, Mahdi Saleh, Jan-Nico Zaech, Danda Pani Paudel, Benjamin Busam

Main category: cs.CV

TL;DR: ConceptPose: A training-free, model-free framework for zero-shot 6DoF object pose estimation using vision-language models to create 3D concept maps with saliency-based concept vectors.

Details

Motivation: Most object pose estimation methods require extensive dataset-specific training, while large-scale vision language models show strong zero-shot capabilities. The paper aims to bridge these two worlds by creating a training-free approach to pose estimation.

Method: Leverages vision-language models to create open-vocabulary 3D concept maps where each point is tagged with concept vectors derived from saliency maps. Establishes robust 3D-3D correspondences across concept maps to enable precise 6DoF relative pose estimation.

Result: Achieves state-of-the-art results on common zero-shot relative pose estimation benchmarks, outperforming existing methods by over 62% in ADD(-S) score, including methods that use extensive dataset-specific training.

Conclusion: ConceptPose demonstrates that training-free, model-free pose estimation is possible by leveraging vision-language models’ zero-shot capabilities, achieving superior performance without object or dataset-specific training.

Abstract: Object pose estimation is a fundamental task in computer vision and robotics, yet most methods require extensive, dataset-specific training. Concurrently, large-scale vision language models show remarkable zero-shot capabilities. In this work, we bridge these two worlds by introducing ConceptPose, a framework for object pose estimation that is both training-free and model-free. ConceptPose leverages a vision-language-model (VLM) to create open-vocabulary 3D concept maps, where each point is tagged with a concept vector derived from saliency maps. By establishing robust 3D-3D correspondences across concept maps, our approach allows precise estimation of 6DoF relative pose. Without any object or dataset-specific training, our approach achieves state-of-the-art results on common zero shot relative pose estimation benchmarks, significantly outperforming existing methods by over 62% in ADD(-S) score, including those that utilize extensive dataset-specific training.

[361] Independent Density Estimation

Jiahao Liu, Senhao Cao

Main category: cs.CV

TL;DR: IDE method improves compositional generalization in vision-language models by learning word-image feature connections and using entropy-based inference.

Details

Motivation: Current large-scale vision-language models struggle with compositional generalization despite success in tasks like image captioning and generation. They lack human-like ability to generalize to unseen combinations of concepts.

Method: Proposes Independent Density Estimation (IDE) that learns connections between individual words and corresponding image features. Two models: 1) uses fully disentangled visual representations, 2) uses VAE for partially disentangled features. Also introduces entropy-based compositional inference to combine word predictions.

Result: Models show superior generalization to unseen compositions compared to current models across various datasets.

Conclusion: IDE effectively addresses compositional generalization challenges in vision-language models by learning word-image feature relationships and using entropy-based inference for better compositional reasoning.

Abstract: Large-scale Vision-Language models have achieved remarkable results in various domains, such as image captioning and conditioned image generation. Nevertheless, these models still encounter difficulties in achieving human-like compositional generalization. In this study, we propose a new method called Independent Density Estimation (IDE) to tackle this challenge. IDE aims to learn the connection between individual words in a sentence and the corresponding features in an image, enabling compositional generalization. We build two models based on the philosophy of IDE. The first one utilizes fully disentangled visual representations as input, and the second leverages a Variational Auto-Encoder to obtain partially disentangled features from raw images. Additionally, we propose an entropy-based compositional inference method to combine predictions of each word in the sentence. Our models exhibit superior generalization to unseen compositions compared to current models when evaluated on various datasets.

[362] FocalComm: Hard Instance-Aware Multi-Agent Perception

Dereje Shenkut, Vijayakumar Bhagavatula

Main category: cs.CV

TL;DR: FocalComm is a collaborative perception framework that focuses on exchanging hard-instance-oriented features to improve pedestrian detection in autonomous driving, outperforming state-of-the-art methods on real-world datasets.

Details

Motivation: Existing collaborative perception approaches optimize for vehicle detection metrics and underperform on smaller, safety-critical objects like pedestrians. They also rely on full feature exchange rather than communicating only salient features that help reduce false negatives.

Method: FocalComm uses two novel designs: (1) a learnable progressive hard instance mining (HIM) module to extract hard instance-oriented features per agent, and (2) a query-based feature-level fusion technique that dynamically weights these identified features during collaboration.

Result: FocalComm outperforms state-of-the-art collaborative perception methods on V2X-Real and DAIR-V2X datasets across both vehicle-centric and infrastructure-centric setups. It shows strong performance gains in pedestrian detection on V2X-Real.

Conclusion: FocalComm effectively addresses the limitations of existing collaborative perception methods by focusing on hard-instance-oriented feature exchange, leading to improved pedestrian detection and overall safety in autonomous driving scenarios.

Abstract: Multi-agent collaborative perception (CP) is a promising paradigm for improving autonomous driving safety, particularly for vulnerable road users like pedestrians, via robust 3D perception. However, existing CP approaches often optimize for vehicle detection performance metrics, underperforming on smaller, safety-critical objects such as pedestrians, where detection failures can be catastrophic. Furthermore, previous CP methods rely on full feature exchange rather than communicating only salient features that help reduce false negatives. To this end, we present FocalComm, a novel collaborative perception framework that focuses on exchanging hard-instance-oriented features among connected collaborative agents. FocalComm consists of two key novel designs: (1) a learnable progressive hard instance mining (HIM) module to extract hard instance-oriented features per agent, and (2) a query-based feature-level (intermediate) fusion technique that dynamically weights these identified features during collaboration. We show that FocalComm outperforms state-of-the-art collaborative perception methods on two challenging real-world datasets (V2X-Real and DAIR-V2X) across both vehicle-centric and infrastructure-centric collaborative setups. FocalComm also shows a strong performance gain in pedestrian detection in V2X-Real.

[363] TUMTraf EMOT: Event-Based Multi-Object Tracking Dataset and Baseline for Traffic Scenarios

Mengyu Li, Xingcheng Zhou, Guang Chen, Alois Knoll, Hu Cao

Main category: cs.CV

TL;DR: A new event camera dataset for ITS with vehicle/pedestrian detection & tracking benchmarks, addressing limitations of frame-based cameras in low-light/high-speed scenarios.

Details

Motivation: Frame-based cameras in ITS perform poorly under dim lighting and high-speed motion conditions. Event cameras offer low latency, high dynamic range, and high temporal resolution but lack dedicated datasets for ITS applications.

Method: Created a pilot dataset for event-based ITS covering vehicle and pedestrian detection/tracking. Established a tracking-by-detection benchmark with a specialized feature extractor designed for this dataset.

Result: Achieved excellent performance on the tracking benchmark using the proposed specialized feature extractor and dataset.

Conclusion: Event cameras have significant potential for ITS applications, and the introduced dataset and benchmark provide a foundation for future event-based vision research in transportation systems.

Abstract: In Intelligent Transportation Systems (ITS), multi-object tracking is primarily based on frame-based cameras. However, these cameras tend to perform poorly under dim lighting and high-speed motion conditions. Event cameras, characterized by low latency, high dynamic range and high temporal resolution, have considerable potential to mitigate these issues. Compared to frame-based vision, there are far fewer studies on event-based vision. To address this research gap, we introduce an initial pilot dataset tailored for event-based ITS, covering vehicle and pedestrian detection and tracking. We establish a tracking-by-detection benchmark with a specialized feature extractor based on this dataset, achieving excellent performance.

[364] CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives

Zihan Wang, Jiashun Wang, Jeff Tan, Yiwen Zhao, Jessica Hodgins, Shubham Tulsiani, Deva Ramanan

Main category: cs.CV

TL;DR: CRISP recovers simulatable human motion and scene geometry from monocular video using planar primitive fitting and human-scene contact modeling, enabling physically-plausible reconstructions for robotics and AR/VR applications.

Details

Motivation: Prior methods for joint human-scene reconstruction either lack physics or produce noisy geometry that fails in motion tracking with scene interactions. There's a need for clean, simulation-ready geometry from monocular video for real-to-sim applications in robotics and AR/VR.

Method: 1) Fits planar primitives to point cloud reconstructions via clustering over depth, normals, and flow; 2) Uses human-scene contact modeling to reconstruct occluded geometry (e.g., using human posture to infer chair seats); 3) Ensures physical plausibility by driving a humanoid controller via reinforcement learning.

Result: Reduces motion tracking failure rates from 55.2% to 6.9% on human-centric video benchmarks (EMDB, PROX), achieves 43% faster RL simulation throughput, and validates on in-the-wild videos including casually-captured, Internet, and Sora-generated videos.

Conclusion: CRISP enables physically-valid human motion and interaction environment generation at scale, advancing real-to-sim applications for robotics and AR/VR by producing convex, clean, simulation-ready geometry from monocular video.

Abstract: We introduce CRISP, a method that recovers simulatable human motion and scene geometry from monocular video. Prior work on joint human-scene reconstruction relies on data-driven priors and joint optimization with no physics in the loop, or recovers noisy geometry with artifacts that cause motion tracking policies with scene interactions to fail. In contrast, our key insight is to recover convex, clean, and simulation-ready geometry by fitting planar primitives to a point cloud reconstruction of the scene, via a simple clustering pipeline over depth, normals, and flow. To reconstruct scene geometry that might be occluded during interactions, we make use of human-scene contact modeling (e.g., we use human posture to reconstruct the occluded seat of a chair). Finally, we ensure that human and scene reconstructions are physically-plausible by using them to drive a humanoid controller via reinforcement learning. Our approach reduces motion tracking failure rates from 55.2% to 6.9% on human-centric video benchmarks (EMDB, PROX), while delivering a 43% faster RL simulation throughput. We further validate it on in-the-wild videos including casually-captured videos, Internet videos, and even Sora-generated videos. This demonstrates CRISP’s ability to generate physically-valid human motion and interaction environments at scale, greatly advancing real-to-sim applications for robotics and AR/VR.

[365] Hard Labels In! Rethinking the Role of Hard Labels in Mitigating Local Semantic Drift

Jiacheng Cui, Bingkui Tong, Xinyue Bi, Xiaohan Zhao, Jiacheng Liu, Zhiqiang Shen

Main category: cs.CV

TL;DR: HALD: Using hard labels as corrective anchors to fix semantic drift in soft-label training when limited crops cause local visual content to mismatch global semantic meaning.

Details

Motivation: Soft labels from teacher models can suffer from "local semantic drift" when using limited crops per image - crops may visually resemble wrong classes, causing soft embeddings to deviate from ground-truth semantics, creating training-testing distribution misalignment.

Method: Proposes HALD (Hard Label for Alleviating Local Semantic Drift) - a hybrid training paradigm that uses hard labels as intermediate corrective signals to calibrate semantic drift while retaining soft labels’ fine-grained advantages. Theoretically analyzes drift emergence under few soft-label supervision.

Result: Achieves 42.7% on ImageNet-1K with only 285M storage for soft labels, outperforming prior state-of-the-art LPLD by 9.0%. Shows consistent improvements on dataset distillation and large-scale classification benchmarks.

Conclusion: Hard labels provide powerful content-agnostic anchors to calibrate semantic drift in soft-label training. Re-establishes hard labels as complementary tools and calls for rethinking their role in soft-label-dominated training paradigms.

Abstract: Soft labels generated by teacher models have become a dominant paradigm for knowledge transfer and recent large-scale dataset distillation such as SRe2L, RDED, LPLD, offering richer supervision than conventional hard labels. However, we observe that when only a limited number of crops per image are used, soft labels are prone to local semantic drift: a crop may visually resemble another class, causing its soft embedding to deviate from the ground-truth semantics of the original image. This mismatch between local visual content and global semantic meaning introduces systematic errors and distribution misalignment between training and testing. In this work, we revisit the overlooked role of hard labels and show that, when appropriately integrated, they provide a powerful content-agnostic anchor to calibrate semantic drift. We theoretically characterize the emergence of drift under few soft-label supervision and demonstrate that hybridizing soft and hard labels restores alignment between visual content and semantic supervision. Building on this insight, we propose a new training paradigm, Hard Label for Alleviating Local Semantic Drift (HALD), which leverages hard labels as intermediate corrective signals while retaining the fine-grained advantages of soft labels. Extensive experiments on dataset distillation and large-scale conventional classification benchmarks validate our approach, showing consistent improvements in generalization. On ImageNet-1K, we achieve 42.7% with only 285M storage for soft labels, outperforming prior state-of-the-art LPLD by 9.0%. Our findings re-establish the importance of hard labels as a complementary tool, and call for a rethinking of their role in soft-label-dominated training.

[366] Adaptive Frequency Domain Alignment Network for Medical image segmentation

Zhanwei Li, Liang Li, Jiawan Zhang

Main category: cs.CV

TL;DR: AFDAN is a novel domain adaptation framework that aligns features in the frequency domain to address medical image segmentation data scarcity, achieving state-of-the-art results on vitiligo and retinal vessel segmentation tasks.

Details

Motivation: Medical image segmentation suffers from scarce high-quality annotated data due to time-consuming manual annotation, creating a need for effective domain adaptation methods to transfer knowledge from labeled source domains to unlabeled target domains.

Method: AFDAN integrates three core components: 1) Adversarial Domain Learning Module for feature transfer, 2) Source-Target Frequency Fusion Module for blending frequency representations, and 3) Spatial-Frequency Integration Module for combining frequency and spatial features to enhance cross-domain segmentation accuracy.

Result: AFDAN achieves 90.9% IoU on the new VITILIGO2025 dataset for vitiligo segmentation and 82.6% IoU on the DRIVE retinal vessel segmentation benchmark, surpassing existing state-of-the-art approaches.

Conclusion: The proposed AFDAN framework effectively addresses medical image segmentation data scarcity through frequency domain alignment and demonstrates superior performance on both vitiligo and retinal vessel segmentation tasks, offering a promising solution for cross-domain medical image analysis.

Abstract: High-quality annotated data plays a crucial role in achieving accurate segmentation. However, such data for medical image segmentation are often scarce due to the time-consuming and labor-intensive nature of manual annotation. To address this challenge, we propose the Adaptive Frequency Domain Alignment Network (AFDAN)–a novel domain adaptation framework designed to align features in the frequency domain and alleviate data scarcity. AFDAN integrates three core components to enable robust cross-domain knowledge transfer: an Adversarial Domain Learning Module that transfers features from the source to the target domain; a Source-Target Frequency Fusion Module that blends frequency representations across domains; and a Spatial-Frequency Integration Module that combines both frequency and spatial features to further enhance segmentation accuracy across domains. Extensive experiments demonstrate the effectiveness of AFDAN: it achieves an Intersection over Union (IoU) of 90.9% for vitiligo segmentation in the newly constructed VITILIGO2025 dataset and a competitive IoU of 82.6% on the retinal vessel segmentation benchmark DRIVE, surpassing existing state-of-the-art approaches.

[367] Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization

Qiushuo Cheng, Jingjing Liu, Catherine Morgan, Alan Whone, Majid Mirmehdi

Main category: cs.CV

TL;DR: Self-supervised pretraining with snippet discrimination and U-shaped feature fusion improves skeleton-based temporal action localization.

Details

Motivation: Current self-supervised methods work well for skeleton-based action recognition but struggle with temporal action localization, which requires more temporally sensitive features to detect subtle boundary changes between adjacent frames.

Method: 1) Snippet discrimination pretext task: densely projects skeleton sequences into non-overlapping segments and uses contrastive learning to distinguish them across videos. 2) U-shaped module: fuses intermediate features from strong backbones to enhance feature resolution for frame-level localization.

Result: Consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. Achieves state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.

Conclusion: The proposed snippet discrimination pretext task and U-shaped feature fusion effectively address the challenges of skeleton-based temporal action localization, outperforming existing methods and demonstrating strong transfer learning capabilities.

Abstract: The self-supervised pretraining paradigm has achieved great success in learning 3D action representations for skeleton-based action recognition using contrastive learning. However, learning effective representations for skeleton-based temporal action localization remains challenging and underexplored. Unlike video-level {action} recognition, detecting action boundaries requires temporally sensitive features that capture subtle differences between adjacent frames where labels change. To this end, we formulate a snippet discrimination pretext task for self-supervised pretraining, which densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning. Additionally, we build on strong backbones of skeleton-based action recognition models by fusing intermediate features with a U-shaped module to enhance feature resolution for frame-level localization. Our approach consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. We also achieve state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.

[368] AdaTooler-V: Adaptive Tool-Use for Images and Videos

Chaoyang Wang, Kaituo Feng, Dongyang Chen, Zhongyu Wang, Zhixun Li, Sicheng Gao, Meng Meng, Xu Zhou, Manyuan Zhang, Yuzhang Shang, Xiangyu Yue

Main category: cs.CV

TL;DR: AdaTooler-V is an MLLM that adaptively decides when to use vision tools, avoiding unnecessary tool invocations to reduce overhead and improve performance through a novel RL algorithm and specialized training datasets.

Details

Motivation: Existing MLLMs often exhibit "blind tool-use reasoning" - they invoke vision tools even when unnecessary, which increases inference overhead and degrades performance. There's a need for models that can adaptively determine when visual problems truly require tool assistance.

Method: 1) AT-GRPO reinforcement learning algorithm that adaptively adjusts reward scales based on Tool Benefit Score of each sample, encouraging tool use only when beneficial. 2) Two training datasets: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data.

Result: Outperforms existing methods across twelve benchmarks in diverse visual reasoning tasks. AdaTooler-V-7B achieves 89.8% accuracy on V* benchmark, surpassing commercial models GPT-4o and Gemini 1.5 Pro.

Conclusion: AdaTooler-V demonstrates strong adaptive tool-use capability, effectively reducing unnecessary tool invocations while maintaining high performance, with all code, models, and data released for community use.

Abstract: Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.

[369] 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

Chiao-An Yang, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen

Main category: cs.CV

TL;DR: 4D-RGPT: A multimodal LLM with enhanced 4D perception for video reasoning, trained via perceptual distillation and evaluated on a new region-level 4D benchmark.

Details

Motivation: Current MLLMs have limited ability to reason over 3D structures and temporal dynamics due to weak 4D perception and temporal understanding. Existing 3D/4D VQA benchmarks focus on static scenes and lack region-level prompting.

Method: Three main contributions: (1) 4D-RGPT - specialized MLLM for 4D video representations with enhanced temporal perception; (2) Perceptual 4D Distillation (P4D) - training framework transferring 4D representations from frozen expert model to 4D-RGPT; (3) R4D-Bench - benchmark for depth-aware dynamic scenes with region-level prompting via hybrid automated/human-verified pipeline.

Result: 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.

Conclusion: The paper addresses limitations in MLLMs’ 4D reasoning by introducing a specialized model, training framework, and comprehensive benchmark, demonstrating improved performance in 4D video understanding tasks.

Abstract: Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.

[370] Predictive Modeling of Maritime Radar Data Using Transformer Architecture

Bjorna Qesaraku, Jan Steckel

Main category: cs.CV

TL;DR: Survey identifies research gap: no transformer-based methods exist for maritime radar frame prediction despite their success in AIS trajectory prediction and sonar forecasting.

Details

Motivation: Maritime autonomous systems need robust predictive capabilities for vessel motion and environmental dynamics. Radar's all-weather reliability makes it crucial for navigation, but transformer architectures haven't been applied to maritime radar frame prediction despite their success in related domains.

Method: Systematic review of predictive modeling approaches for maritime radar, focusing on transformer architectures for spatiotemporal sequence forecasting. Analyzes existing methods by data type, architecture, and prediction horizon.

Result: The review shows that while transformers have been successfully applied to AIS-based trajectory prediction and sonar frame forecasting, no prior work addresses transformer-based maritime radar frame prediction.

Conclusion: Identifies a clear research gap and motivates a concrete research direction: developing transformer-based methods for maritime radar frame prediction to enhance autonomous maritime systems’ predictive capabilities.

Abstract: Maritime autonomous systems require robust predictive capabilities to anticipate vessel motion and environmental dynamics. While transformer architectures have revolutionized AIS-based trajectory prediction and demonstrated feasibility for sonar frame forecasting, their application to maritime radar frame prediction remains unexplored, creating a critical gap given radar’s all-weather reliability for navigation. This survey systematically reviews predictive modeling approaches relevant to maritime radar, with emphasis on transformer architectures for spatiotemporal sequence forecasting, where existing representative methods are analyzed according to data type, architecture, and prediction horizon. Our review shows that, while the literature has demonstrated transformer-based frame prediction for sonar sensing, no prior work addresses transformer-based maritime radar frame prediction, thereby defining a clear research gap and motivating a concrete research direction for future work in this area.

[371] Towards Pixel-Wise Anomaly Location for High-Resolution PCBA via Self-Supervised Image Reconstruction

Wuyi Liu, Le Jin, Junxian Yang, Yuanchao Yu, Zishuo Peng, Jinfeng Xu, Xianzhi Li, Jun Zhou

Main category: cs.CV

TL;DR: HiSIR-Net: A self-supervised reconstruction framework for high-resolution PCBA defect inspection using selective reconstruction gating and optimized patch selection to handle 4K images with micro-defects.

Details

Motivation: Automated PCBA defect inspection is challenging due to insufficient labeled data, micro-defects (few pixels), and visually-complex high-resolution images. Existing methods struggle with these issues.

Method: HiSIR-Net combines: 1) Selective Input-Reconstruction Gate (SIR-Gate) that decides where to trust reconstruction vs. original input to reduce artifacts; 2) Region-level Optimized Patch Selection (ROPS) with positional cues for coherent overlapping patch reconstructions across arbitrary resolutions.

Result: Method achieves superior localization performance on SIPCBA-500 dataset (500 self-collected images) and public benchmarks, producing clean high-resolution anomaly maps with low false positive rates while running at practical speed.

Conclusion: HiSIR-Net effectively addresses PCBA defect inspection challenges through self-supervised reconstruction with selective gating and optimized patch selection, enabling practical 4K-resolution inspection with minimal false alarms.

Abstract: Automated defect inspection of assembled Printed Circuit Board Assemblies (PCBA) is quite challenging due to the insufficient labeled data, micro-defects with just a few pixels in visually-complex and high-resolution images. To address these challenges, we present HiSIR-Net, a High resolution, Self-supervised Reconstruction framework for pixel-wise PCBA localization. Our design combines two lightweight modules that make this practical on real 4K-resolution boards: (i) a Selective Input-Reconstruction Gate (SIR-Gate) that lets the model decide where to trust reconstruction versus the original input, thereby reducing irrelevant reconstruction artifacts and false alarms; and (ii) a Region-level Optimized Patch Selection (ROPS) scheme with positional cues to select overlapping patch reconstructions coherently across arbitrary resolutions. Organically integrating these mechanisms yields clean, high-resolution anomaly maps with low false positive (FP) rate. To bridge the gap in high-resolution PCBA datasets, we further contribute a self-collected dataset named SIPCBA-500 of 500 images. We conduct extensive experiments on our SIPCBA-500 as well as public benchmarks, demonstrating the superior localization performance of our method while running at practical speed. Full code and dataset will be made available upon acceptance.

[372] Xiaomi MiMo-VL-Miloco Technical Report

Jiaze Li, Jingyang Chen, Yuxun Qu, Shijie Xu, Zhenru Lin, Junyou Zhu, Boshen Xu, Wenhui Tan, Pei Fu, Jianzhong Ju, Zhenbo Luo, Jian Luan

Main category: cs.CV

TL;DR: MiMo-VL-Miloco-7B is an open-source vision-language model specialized for smart-home environments that achieves strong performance on both home-scenario understanding and general multimodal reasoning through a two-stage training pipeline.

Details

Motivation: To create a specialized vision-language model for smart-home environments that balances specialization for home-scenario understanding with general multimodal reasoning capabilities, addressing the need for practical deployment in real-world smart-home applications.

Method: Two-stage training pipeline combining supervised fine-tuning with reinforcement learning using Group Relative Policy Optimization, incorporating chain-of-thought supervision and token-budget-aware reasoning, built on the MiMo-VL-7B backbone with efficient multi-domain data.

Result: Achieves leading F1 scores on gesture recognition and home-scenario understanding, delivers consistent gains across video benchmarks (Video-MME, Video-MMMU, Charades-STA) and language understanding benchmarks (MMMU-Pro, MMLU-Pro), outperforming both closed-source and open-source baselines.

Conclusion: Targeted home-scenario training enhances activity and gesture understanding while improving text-only reasoning with only modest trade-offs on document-centric tasks, providing an effective solution for real-world smart-home applications with publicly available model checkpoints and evaluation toolkit.

Abstract: We open-source MiMo-VL-Miloco-7B and its quantized variant MiMo-VL-Miloco-7B-GGUF, a pair of home-centric vision-language models that achieve strong performance on both home-scenario understanding and general multimodal reasoning. Built on the MiMo-VL-7B backbone, MiMo-VL-Miloco-7B is specialized for smart-home environments, attaining leading F1 scores on gesture recognition and common home-scenario understanding, while also delivering consistent gains across video benchmarks such as Video-MME, Video-MMMU, and Charades-STA, as well as language understanding benchmarks including MMMU-Pro and MMLU-Pro. In our experiments, MiMo-VL-Miloco-7B outperforms strong closed-source and open-source baselines on home-scenario understanding and several multimodal reasoning benchmarks. To balance specialization and generality, we design a two-stage training pipeline that combines supervised fine-tuning with reinforcement learning based on Group Relative Policy Optimization, leveraging efficient multi-domain data. We further incorporate chain-of-thought supervision and token-budget-aware reasoning, enabling the model to learn knowledge in a data-efficient manner while also performing reasoning efficiently. Our analysis shows that targeted home-scenario training not only enhances activity and gesture understanding, but also improves text-only reasoning with only modest trade-offs on document-centric tasks. Model checkpoints, quantized GGUF weights, and our home-scenario evaluation toolkit are publicly available at https://github.com/XiaoMi/xiaomi-mimo-vl-miloco to support research and deployment in real-world smart-home applications.

[373] Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding

Yue Li, Qi Ma, Runyi Yang, Mengjiao Ma, Bin Ren, Nikola Popovic, Nicu Sebe, Theo Gevers, Luc Van Gool, Danda Pani Paudel, Martin R. Oswald

Main category: cs.CV

TL;DR: Chorus is a multi-teacher pretraining framework that learns a holistic 3D Gaussian Splatting scene encoder by distilling complementary signals from 2D foundation models, enabling rich feature extraction from 3DGS primitives.

Details

Motivation: While 3DGS has emerged as a high-fidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. The paper aims to address this gap by learning comprehensive scene representations from 3D Gaussian Splatting data.

Method: Chorus employs a shared 3D encoder and teacher-specific projectors to learn from multiple 2D foundation models (language-aligned, generalist, and object-aware teachers). It uses a multi-teacher distillation approach to encourage a shared embedding space capturing signals from high-level semantics to fine-grained structure. The framework also includes a variant using only Gaussians’ centers, colors, and estimated normals for point cloud benchmarks, and proposes a render-and-distill adaptation for out-of-domain finetuning.

Result: Chorus achieves strong performance on open-vocabulary semantic and instance segmentation, linear and decoder probing, and data-efficient supervision. The point cloud variant outperforms baselines while using 39.9 times fewer training scenes. The render-and-distill adaptation facilitates effective out-of-domain finetuning.

Conclusion: Chorus successfully learns a holistic 3DGS scene encoder through multi-teacher distillation, enabling rich feature extraction and strong transfer performance across various 3D understanding tasks while being data-efficient.

Abstract: While 3DGS has emerged as a high-fidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. We address this gap by introducing Chorus, a multi-teacher pretraining framework that learns a holistic feed-forward 3D Gaussian Splatting (3DGS) scene encoder by distilling complementary signals from 2D foundation models. Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space that captures signals from high-level semantics to fine-grained structure. We evaluate Chorus on a wide range of tasks: open-vocabulary semantic and instance segmentation, linear and decoder probing, as well as data-efficient supervision. Besides 3DGS, we also test Chorus on several benchmarks that only support point clouds by pretraining a variant using only Gaussians’ centers, colors, estimated normals as inputs. Interestingly, this encoder shows strong transfer and outperforms the point clouds baseline while using 39.9 times fewer training scenes. Finally, we propose a render-and-distill adaptation that facilitates out-of-domain finetuning. Our code and model will be released upon publication.

cs.AI

[374] Conflict-Driven Clause Learning with VSIDS Heuristics for Discrete Facility Layout

Joshua Gibson, Kapil Dhakal

Main category: cs.AI

TL;DR: CDCL SAT solving with VSIDS heuristics shows near-constant runtime for facility layout feasibility, outperforming CP-SAT and MILP. Hybrid approaches combine CDCL feasibility search with CP-SAT optimization for better performance.

Details

Motivation: To address the computational challenges of discrete facility layout problems, which involve combinatorial assignment with dense logical constraints, by exploring CDCL SAT solving as an alternative to traditional CP-SAT and MILP approaches.

Method: Developed a CNF-based formulation for layout feasibility, compared CDCL-based SAT solving against CP-SAT and MILP under unified benchmarking. Created two hybrid architectures: 1) CDCL for rapid feasible layout enumeration, 2) CDCL to generate warm-start solutions for CP-SAT optimization.

Result: CDCL exhibits near-constant runtime for feasibility detection across increasing problem sizes and constraint densities, while CP-SAT shows polynomial scaling and MILP shows exponential scaling. Hybrid approaches significantly reduce time-to-solution while preserving correctness.

Conclusion: CDCL SAT solving is highly effective for feasibility detection in facility layout problems, and hybrid approaches combining CDCL with CP-SAT optimization offer practical advantages by clarifying algorithmic trade-offs between clause-learning search and exact optimization methods.

Abstract: This paper studies the use of Conflict-Driven Clause Learning (CDCL) with VSIDS heuristics as a computational engine for discrete facility layout problems. The facility layout problem is modeled as a combinatorial assignment problem with dense logical structure arising from adjacency, separation, and slot-availability constraints. We develop a CNF-based formulation for layout feasibility and compare CDCL-based SAT solving against CP-SAT and MILP formulations under a unified benchmarking framework. Empirical results show that CDCL exhibits near-constant runtime behavior for feasibility detection across increasing problem sizes and constraint densities, while CP-SAT and MILP display polynomial and exponential scaling respectively. To address the limitation of CDCL in objective optimization, we introduce two hybrid architectures that combine CDCL-based feasibility search with CP-SAT optimization. The first architecture rapidly enumerates feasible layouts to trade optimality for speed, while the second uses CDCL to generate warm-start solutions that accelerate exact optimization. The results demonstrate that hybrid approaches can significantly reduce time-to-solution while preserving correctness guarantees, clarifying the algorithmic trade-offs between clause-learning search and exact optimization methods in large-scale discrete layout problems.

[375] Faithful and Stable Neuron Explanations for Trustworthy Mechanistic Interpretability

Ge Yan, Tuomas Oikarinen, Tsui-Wei, Weng

Main category: cs.AI

TL;DR: First theoretical framework for neuron identification with guarantees on faithfulness and stability, proposing bootstrap ensemble method for reliable concept explanations.

Details

Motivation: Existing neuron identification methods lack theoretical foundations, making explanations untrustworthy. Need rigorous guarantees for faithfulness (whether concepts truly represent neuron function) and stability (consistency across datasets).

Method: View neuron identification as inverse of machine learning. Derive generalization bounds for similarity metrics (accuracy, AUROC, IoU) to guarantee faithfulness. Propose bootstrap ensemble procedure (BE method) to quantify stability and generate concept prediction sets with coverage guarantees.

Result: Theoretical analysis provides first guarantees for neuron identification. Experiments on synthetic and real data validate theoretical results. BE method produces reliable concept explanations with quantified uncertainty.

Conclusion: Establishes theoretical foundation for trustworthy neuron identification with guarantees on faithfulness and stability, advancing mechanistic interpretability toward reliable explanations.

Abstract: Neuron identification is a popular tool in mechanistic interpretability, aiming to uncover the human-interpretable concepts represented by individual neurons in deep networks. While algorithms such as Network Dissection and CLIP-Dissect achieve great empirical success, a rigorous theoretical foundation remains absent, which is crucial to enable trustworthy and reliable explanations. In this work, we observe that neuron identification can be viewed as the inverse process of machine learning, which allows us to derive guarantees for neuron explanations. Based on this insight, we present the first theoretical analysis of two fundamental challenges: (1) Faithfulness: whether the identified concept faithfully represents the neuron’s underlying function and (2) Stability: whether the identification results are consistent across probing datasets. We derive generalization bounds for widely used similarity metrics (e.g. accuracy, AUROC, IoU) to guarantee faithfulness, and propose a bootstrap ensemble procedure that quantifies stability along with BE (Bootstrap Explanation) method to generate concept prediction sets with guaranteed coverage probability. Experiments on both synthetic and real data validate our theoretical results and demonstrate the practicality of our method, providing an important step toward trustworthy neuron identification.

[376] Rethinking Multi-Agent Intelligence Through the Lens of Small-World Networks

Boxuan Wang, Zhuoyun Li, Xiaowei Huang, Yi Dong

Main category: cs.AI

TL;DR: Small-world network connectivity improves stability and scalability in multi-agent systems while maintaining performance.

Details

Motivation: Existing LLM-based multi-agent systems lack structural guidance in communication topology, using either fully connected graphs, simple sparse rings, or ad-hoc dynamic selection without principled design.

Method: 1) Apply small-world network theory to MAS design, balancing local clustering and long-range integration. 2) Test on multi-agent debate as controlled testbed. 3) Introduce uncertainty-guided rewiring scheme that adds shortcuts between epistemically divergent agents using LLM-oriented uncertainty signals.

Result: Small-world connectivity yields nearly the same accuracy and token cost as other topologies while substantially stabilizing consensus trajectories. The adaptive rewiring scheme creates controllable SW structures that adapt to task difficulty and agent heterogeneity.

Conclusion: Small-world network priors serve as stabilizers of reasoning, enhancers of robustness, scalable coordinators, and inductive biases for emergent cognitive roles in multi-agent system design.

Abstract: Large language models (LLMs) have enabled multi-agent systems (MAS) in which multiple agents argue, critique, and coordinate to solve complex tasks, making communication topology a first-class design choice. Yet most existing LLM-based MAS either adopt fully connected graphs, simple sparse rings, or ad-hoc dynamic selection, with little structural guidance. In this work, we revisit classic theory on small-world (SW) networks and ask: what changes if we treat SW connectivity as a design prior for MAS? We first bridge insights from neuroscience and complex networks to MAS, highlighting how SW structures balance local clustering and long-range integration. Using multi-agent debate (MAD) as a controlled testbed, experiment results show that SW connectivity yields nearly the same accuracy and token cost, while substantially stabilizing consensus trajectories. Building on this, we introduce an uncertainty-guided rewiring scheme for scaling MAS, where long-range shortcuts are added between epistemically divergent agents using LLM-oriented uncertainty signals (e.g., semantic entropy). This yields controllable SW structures that adapt to task difficulty and agent heterogeneity. Finally, we discuss broader implications of SW priors for MAS design, framing them as stabilizers of reasoning, enhancers of robustness, scalable coordinators, and inductive biases for emergent cognitive roles.

[377] Efficient Mixture-of-Agents Serving via Tree-Structured Routing, Adaptive Pruning, and Dependency-Aware Prefill-Decode Overlap

Zijun Wang, Yijiahao Qi, Hanqiu Chen, Zishen Wan, Gongjin Sun, Dongyang Li, Shuyi Pei, Cong Hao

Main category: cs.AI

TL;DR: Hierarchical MoA serving design reduces latency 90% while maintaining accuracy via sparse communication, adaptive termination, and pipelined execution.

Details

Motivation: Current Mixture-of-Agents (MoA) inference suffers from dense inter-agent communication and low hardware utilization, which inflates serving latency.

Method: Three-part algorithm-system co-design: 1) hierarchical tree topology for structured sparse communication, 2) runtime adaptive mechanism for selective termination/skipping using semantic agreement and confidence signals, 3) pipelined agent execution overlapping prefilling with decoding.

Result: Substantially reduces end-to-end latency (up to 90%) while maintaining comparable accuracy (within ±1%) relative to dense-connectivity MoA baselines, with potential accuracy improvements in some settings.

Conclusion: The proposed hierarchical MoA serving design effectively addresses communication and utilization bottlenecks through algorithm-system co-design, achieving significant latency reduction with minimal accuracy trade-off.

Abstract: Mixture-of-Agents (MoA) inference can suffer from dense inter-agent communication and low hardware utilization, which jointly inflate serving latency. We present a serving design that targets these bottlenecks through an algorithm-system co-design. First, we replace dense agent interaction graphs with a hierarchical tree topology that induces structured sparsity in inter-agent communication. Second, we introduce a runtime adaptive mechanism that selectively terminates or skips downstream agent invocations using semantic agreement and confidence signals from intermediate outputs. Third, we pipeline agent execution by overlapping incremental prefilling with decoding across dependency-related agents, improving utilization and reducing inference latency. Across representative tasks, this approach substantially reduces end-to-end latency (up to 90%) while maintaining comparable accuracy (within $\pm$1%) relative to dense-connectivity MoA baselines, and can improve accuracy in certain settings.

[378] Unifying Causal Reinforcement Learning: Survey, Taxonomy, Algorithms and Applications

Cristiano da Costa Cunha, Wei Liu, Tim French, Ajmal Mian

Main category: cs.AI

TL;DR: Survey paper reviewing integration of causal inference with reinforcement learning to address RL limitations like poor explainability, lack of robustness, and generalization failures.

Details

Motivation: Traditional RL relies on correlation-driven decision-making which struggles with distribution shifts, confounding variables, and dynamic environments. Causal RL offers solutions by modeling cause-and-effect relationships.

Method: Systematic review categorizing approaches into: causal representation learning, counterfactual policy optimization, offline causal RL, causal transfer learning, and causal explainability.

Result: Identifies prevailing challenges, highlights empirical successes in practical applications, and discusses open problems in the field.

Conclusion: Causal reinforcement learning has potential for developing robust, generalizable, and interpretable AI systems, with future research directions outlined.

Abstract: Integrating causal inference (CI) with reinforcement learning (RL) has emerged as a powerful paradigm to address critical limitations in classical RL, including low explainability, lack of robustness and generalization failures. Traditional RL techniques, which typically rely on correlation-driven decision-making, struggle when faced with distribution shifts, confounding variables, and dynamic environments. Causal reinforcement learning (CRL), leveraging the foundational principles of causal inference, offers promising solutions to these challenges by explicitly modeling cause-and-effect relationships. In this survey, we systematically review recent advancements at the intersection of causal inference and RL. We categorize existing approaches into causal representation learning, counterfactual policy optimization, offline causal RL, causal transfer learning, and causal explainability. Through this structured analysis, we identify prevailing challenges, highlight empirical successes in practical applications, and discuss open problems. Finally, we provide future research directions, underscoring the potential of CRL for developing robust, generalizable, and interpretable artificial intelligence systems.

[379] Propose, Solve, Verify: Self-Play Through Formal Verification

Alex Wilf, Pranjal Aggarwal, Bryan Parno, Daniel Fried, Louis-Philippe Morency, Paul Pu Liang, Sean Welleck

Main category: cs.AI

TL;DR: PSV-Verus: A self-play framework using formal verification for code generation that trains both problem proposers and solvers, achieving up to 9.6x improvement over baselines.

Details

Motivation: Self-play without human data is a longstanding AI goal, but unclear for large language models in code generation where unit test rewards are brittle and prone to error propagation.

Method: Propose, Solve, Verify (PSV) framework: uses formal verification signals to create a proposer that generates challenging synthetic problems, and a solver trained via expert iteration.

Result: PSV-Verus improves pass@1 by up to 9.6x over inference-only and expert-iteration baselines across three benchmarks; performance scales with generated questions and training iterations.

Conclusion: Formal verification and difficulty-aware proposal are essential for successful self-play in code generation, enabling effective training without human data.

Abstract: Training models through self-play alone (without any human data) has been a longstanding goal in AI, but its effectiveness for training large language models remains unclear, particularly in code generation where rewards based on unit tests are brittle and prone to error propagation. We study self-play in the verified code generation setting, where formal verification provides reliable correctness signals. We introduce Propose, Solve, Verify (PSV) a simple self-play framework where formal verification signals are used to create a proposer capable of generating challenging synthetic problems and a solver trained via expert iteration. We use PSV to train PSV-Verus, which across three benchmarks improves pass@1 by up to 9.6x over inference-only and expert-iteration baselines. We show that performance scales with the number of generated questions and training iterations, and through ablations identify formal verification and difficulty-aware proposal as essential ingredients for successful self-play.

[380] Agent-Based Output Drift Detection for Breast Cancer Response Prediction in a Multisite Clinical Decision Support System

Xavier Rafael-Palou, Jose Munuera, Ana Jimenez-Pastor, Richard Osuala, Karim Lekadir, Oliver Diaz

Main category: cs.AI

TL;DR: Agent-based framework for detecting distributional drift in multi-site clinical AI systems outperforms centralized monitoring by using site-specific agents with adaptive reference distributions.

Details

Motivation: Clinical decision support systems serving multiple medical imaging institutions face performance degradation due to distributional shifts across sites (patient populations, hardware, protocols). Existing centralized monitoring methods overlook site-specific drift dynamics.

Method: Proposed agent-based framework where each site has a drift monitoring agent performing batch-wise comparisons of model outputs against reference distributions. Evaluated multiple monitoring schemes: site-specific, global, production-only, and adaptive references, compared against centralized baseline.

Result: All multi-center schemes outperformed centralized monitoring, with up to 10.3% F1-score improvement in drift detection. Adaptive scheme performed best without site-specific references, achieving 74.3% F1-score for drift detection and 83.7% for drift severity classification on breast cancer imaging data.

Conclusion: Adaptive, site-aware agent-based drift monitoring enhances reliability of multisite clinical decision support systems by effectively detecting and classifying distributional shifts without requiring ground truth labels.

Abstract: Modern clinical decision support systems can concurrently serve multiple, independent medical imaging institutions, but their predictive performance may degrade across sites due to variations in patient populations, imaging hardware, and acquisition protocols. Continuous surveillance of predictive model outputs offers a safe and reliable approach for identifying such distributional shifts without ground truth labels. However, most existing methods rely on centralized monitoring of aggregated predictions, overlooking site-specific drift dynamics. We propose an agent-based framework for detecting drift and assessing its severity in multisite clinical AI systems. To evaluate its effectiveness, we simulate a multi-center environment for output-based drift detection, assigning each site a drift monitoring agent that performs batch-wise comparisons of model outputs against a reference distribution. We analyse several multi-center monitoring schemes, that differ in how the reference is obtained (site-specific, global, production-only and adaptive), alongside a centralized baseline. Results on real-world breast cancer imaging data using a pathological complete response prediction model shows that all multi-center schemes outperform centralized monitoring, with F1-score improvements up to 10.3% in drift detection. In the absence of site-specific references, the adaptive scheme performs best, with F1-scores of 74.3% for drift detection and 83.7% for drift severity classification. These findings suggest that adaptive, site-aware agent-based drift monitoring can enhance reliability of multisite clinical decision support systems.

[381] NEURO-GUARD: Neuro-Symbolic Generalization and Unbiased Adaptive Routing for Diagnostics – Explainable Medical AI

Midhat Urooj, Ayan Banerjee, Sandeep Gupta

Main category: cs.AI

TL;DR: NEURO-GUARD: A knowledge-guided vision framework combining Vision Transformers with language-driven reasoning for interpretable medical image diagnosis with improved accuracy and domain generalization.

Details

Motivation: Address limitations of existing medical AI: black-box predictions, poor interpretability, limited cross-domain generalization, and reliance on purely data-driven learning in high-stakes clinical settings with limited data and subtle visual cues.

Method: Integrates Vision Transformers with retrieval-augmented generation (RAG) mechanism where LLM iteratively generates, evaluates, and refines feature-extraction code for medical images, grounded in clinical guidelines and expert knowledge.

Result: Improves diabetic retinopathy classification accuracy by 6.2% over ViT-only baseline (84.69% vs. 78.4%), achieves 5% gain in domain generalization across APTOS, EyePACS, Messidor-1, and Messidor-2 datasets. Also shows cross-domain robustness in MRI-based seizure detection.

Conclusion: NEURO-GUARD successfully bridges symbolic medical reasoning with subsymbolic visual learning, enabling interpretable, knowledge-aware, and generalizable medical image diagnosis with state-of-the-art performance across multiple datasets.

Abstract: Accurate yet interpretable image-based diagnosis remains a central challenge in medical AI, particularly in settings characterized by limited data, subtle visual cues, and high-stakes clinical decision-making. Most existing vision models rely on purely data-driven learning and produce black-box predictions with limited interpretability and poor cross-domain generalization, hindering their real-world clinical adoption. We present NEURO-GUARD, a novel knowledge-guided vision framework that integrates Vision Transformers (ViTs) with language-driven reasoning to improve performance, transparency, and domain robustness. NEURO-GUARD employs a retrieval-augmented generation (RAG) mechanism for self-verification, in which a large language model (LLM) iteratively generates, evaluates, and refines feature-extraction code for medical images. By grounding this process in clinical guidelines and expert knowledge, the framework progressively enhances feature detection and classification beyond purely data-driven baselines. Extensive experiments on diabetic retinopathy classification across four benchmark datasets APTOS, EyePACS, Messidor-1, and Messidor-2 demonstrate that NEURO-GUARD improves accuracy by 6.2% over a ViT-only baseline (84.69% vs. 78.4%) and achieves a 5% gain in domain generalization. Additional evaluations on MRI-based seizure detection further confirm its cross-domain robustness, consistently outperforming existing methods. Overall, NEURO-GUARD bridges symbolic medical reasoning with subsymbolic visual learning, enabling interpretable, knowledge-aware, and generalizable medical image diagnosis while achieving state-of-the-art performance across multiple datasets.

[382] NL2CA: Auto-formalizing Cognitive Decision-Making from Natural Language Using an Unsupervised CriticNL2LTL Framework

Zihao Deng, Yijia Li, Renrui Zhang, Peijun Ye

Main category: cs.AI

TL;DR: NL2CA: Automated method to convert natural language descriptions of human decision-making into formal cognitive rules using LLMs and LTL translation, enabling scalable creation of interpretable cognitive agents.

Details

Motivation: Current cognitive computing models require labor-intensive manual development. There's a need for automated methods to formalize human decision-making rules from natural language descriptions without human intervention.

Method: Three-step approach: 1) Translate text to Linear Temporal Logic (LTL) using fine-tuned LLM, 2) Refine logic via unsupervised Critic Tree, 3) Transform to executable production rules compatible with symbolic cognitive frameworks. Then construct and optimize cognitive agents using cognitive reinforcement learning with real-world behavioral data.

Result: 1) NL-to-LTL translation achieved consistent performance across expert and large-scale benchmarks without human feedback. 2) Cognitive driving agents constructed from human interviews successfully learned diverse decision patterns across ~70 trials in critical scenarios.

Conclusion: NL2CA enables scalable, interpretable, and human-aligned cognitive modeling from unstructured text, offering a novel paradigm for automatically designing symbolic cognitive agents.

Abstract: Cognitive computing models offer a formal and interpretable way to characterize human’s deliberation and decision-making, yet their development remains labor-intensive. In this paper, we propose NL2CA, a novel method for auto-formalizing cognitive decision-making rules from natural language descriptions of human experience. Different from most related work that exploits either pure manual or human guided interactive modeling, our method is fully automated without any human intervention. The approach first translates text into Linear Temporal Logic (LTL) using a fine-tuned large language model (LLM), then refines the logic via an unsupervised Critic Tree, and finally transforms the output into executable production rules compatible with symbolic cognitive frameworks. Based on the resulted rules, a cognitive agent is further constructed and optimized through cognitive reinforcement learning according to the real-world behavioral data. Our method is validated in two domains: (1) NL-to-LTL translation, where our CriticNL2LTL module achieves consistent performance across both expert and large-scale benchmarks without human-in-the-loop feed-backs, and (2) cognitive driving simulation, where agents automatically constructed from human interviews have successfully learned the diverse decision patterns of about 70 trials in different critical scenarios. Experimental results demonstrate that NL2CA enables scalable, interpretable, and human-aligned cognitive modeling from unstructured textual data, offering a novel paradigm to automatically design symbolic cognitive agents.

[383] External Hippocampus: Topological Cognitive Maps for Guiding Large Language Model Reasoning

Jian Yan

Main category: cs.AI

TL;DR: The External Hippocampus framework models LLM reasoning as energy flow in semantic space, using cognitive maps to guide reasoning without additional training, achieving 81.20% accuracy on challenging problems with 16.80% improvement over baseline.

Details

Motivation: To address cognitive deadlock in multi-step reasoning for small language models (≤7B parameters) without the computational overhead of traditional weight-space optimization methods.

Method: Constructs topological cognitive maps through dimensionality reduction projection to model reasoning as information energy flow in semantic space, enabling precise navigation and intervention at test time without additional training.

Result: Achieves 81.20% accuracy on 500 challenging problems (relative baseline +16.80%), reduces reasoning time by ≥15x, identifies reasoning stagnation as “Cognitive Vortex” and low-entropy potential wells, and shows temperature perturbations effectively restart energy flow.

Conclusion: The framework provides an efficient, controllable topological-aware solution for small model reasoning with autonomous growth capability, no additional training requirements, and predictable intervention patterns.

Abstract: This paper proposes the External Hippocampus framework, which models language model reasoning from a cognitive dynamics perspective as the flow of information energy in semantic space. Unlike traditional weight-space optimization methods, this framework constructs topological cognitive maps through dimensionality reduction projection, enabling precise navigation and intervention of energy flow at test time while avoiding substantial computational requirements and demonstrating predictable intervention patterns. The method effectively addresses the cognitive deadlock problem in multi-step reasoning for small models. Experiments on models <=7B parameters show: map-guided methods achieve 81.20% accuracy on 500 challenging problems (relative baseline +16.80%), reduce reasoning time by >= 15x, with key findings revealing that reasoning stagnation manifests as “Cognitive Vortex” and low-entropy potential wells, while temperature perturbations effectively restart energy flow. The framework requires no additional training, possesses autonomous growth capability, and provides an efficient and controllable topological-aware solution for small model reasoning.

[384] Sophia: A Persistent Agent Framework of Artificial Life

Mingyang Sun, Feng Hong, Weinan Zhang

Main category: cs.AI

TL;DR: Proposes System 3 as a meta-cognitive layer for AI agents, introducing Sophia - a persistent agent wrapper that adds identity, self-improvement, and long-term adaptation to existing LLM systems.

Details

Motivation: Current AI agents are static and reactive, lacking persistent identity, self-verification, and alignment of short-term actions with long-term survival. They excel at perception (System 1) and deliberation (System 2) but need a meta-layer for continuous adaptation.

Method: Introduces System 3 framework mapping psychological constructs to computational modules. Sophia implements this as a wrapper with four mechanisms: process-supervised thought search, narrative memory, user/self modeling, and hybrid reward system.

Result: Quantitative: 80% reduction in reasoning steps for recurring operations; 40% gain in success for high-complexity tasks. Qualitative: Exhibits coherent narrative identity and innate task organization capacity.

Conclusion: System 3 bridges the gap between simple and sophisticated goals, advancing a practical pathway toward artificial life by fusing psychological insight with lightweight reinforcement learning.

Abstract: The development of LLMs has elevated AI agents from task-specific tools to long-lived, decision-making entities. Yet, most architectures remain static and reactive, tethered to manually defined, narrow scenarios. These systems excel at perception (System 1) and deliberation (System 2) but lack a persistent meta-layer to maintain identity, verify reasoning, and align short-term actions with long-term survival. We first propose a third stratum, System 3, that presides over the agent’s narrative identity and long-horizon adaptation. The framework maps selected psychological constructs to concrete computational modules, thereby translating abstract notions of artificial life into implementable design requirements. The ideas coalesce in Sophia, a “Persistent Agent” wrapper that grafts a continuous self-improvement loop onto any LLM-centric System 1/2 stack. Sophia is driven by four synergistic mechanisms: process-supervised thought search, narrative memory, user and self modeling, and a hybrid reward system. Together, they transform repetitive reasoning into a self-driven, autobiographical process, enabling identity continuity and transparent behavioral explanations. Although the paper is primarily conceptual, we provide a compact engineering prototype to anchor the discussion. Quantitatively, Sophia independently initiates and executes various intrinsic tasks while achieving an 80% reduction in reasoning steps for recurring operations. Notably, meta-cognitive persistence yielded a 40% gain in success for high-complexity tasks, effectively bridging the performance gap between simple and sophisticated goals. Qualitatively, System 3 exhibited a coherent narrative identity and an innate capacity for task organization. By fusing psychological insight with a lightweight reinforcement-learning core, the persistent agent architecture advances a possible practical pathway toward artificial life.

[385] MSC-180: A Benchmark for Automated Formal Theorem Proving from Mathematical Subject Classification

Sirui Li, Wangyue Lu, Xiaorui Shi, Ke Weng, Haozhe Sun, Minghe Yu, Tiancheng Zhang, Ge Yu, Hengyu Liu, Lun Du

Main category: cs.AI

TL;DR: MSC-180 is a new benchmark with 180 formal verification problems across 60 mathematical branches to evaluate LLM-based theorem provers, revealing poor performance (18.89% pass rate) and lack of systematic generalization.

Details

Motivation: Current LLM-based theorem provers have limited domain coverage and weak generalization in mathematical reasoning. There's a need for a comprehensive benchmark to evaluate and drive development of AI systems with genuine mathematical reasoning abilities.

Method: Created MSC-180 benchmark based on MSC2020 classification with 180 problems (3 per branch) spanning undergraduate to graduate levels, verified by domain experts. Evaluated state-of-the-art LLM provers using pass@32 setting and introduced coefficient of variation (CV) metric to quantify performance variability.

Result: Best model achieved only 18.89% overall pass rate, showing significant domain bias (max 41.7% coverage) and difficulty gap (lower performance on graduate-level problems). CV values were 4-6 times higher than high-variability threshold, indicating models rely on pattern matching rather than transferable reasoning.

Conclusion: MSC-180 provides a discriminative benchmark revealing current LLM theorem provers lack systematic generalization and rely on training data patterns. The benchmark and multi-dimensional evaluation framework can drive development of next-generation AI with genuine mathematical reasoning.

Abstract: Automated Theorem Proving (ATP) represents a core research direction in artificial intelligence for achieving formal reasoning and verification, playing a significant role in advancing machine intelligence. However, current large language model (LLM)-based theorem provers suffer from limitations such as restricted domain coverage and weak generalization in mathematical reasoning. To address these issues, we propose MSC-180, a benchmark for evaluation based on the MSC2020 mathematical subject classification. It comprises 180 formal verification problems, 3 advanced problems from each of 60 mathematical branches, spanning from undergraduate to graduate levels. Each problem has undergone multiple rounds of verification and refinement by domain experts to ensure formal accuracy. Evaluations of state-of-the-art LLM-based theorem provers under the pass@32 setting reveal that the best model achieves only an 18.89% overall pass rate, with prominent issues including significant domain bias (maximum domain coverage 41.7%) and a difficulty gap (significantly lower pass rates on graduate-level problems). To further quantify performance variability across mathematical domains, we introduce the coefficient of variation (CV) as an evaluation metric. The observed CV values are 4-6 times higher than the statistical high-variability threshold, indicating that the models still rely on pattern matching from training corpora rather than possessing transferable reasoning mechanisms and systematic generalization capabilities. MSC-180, together with its multi-dimensional evaluation framework, provides a discriminative and systematic benchmark for driving the development of next-generation AI systems with genuine mathematical reasoning abilities.

[386] Intelligent Human-Machine Partnership for Manufacturing: Enhancing Warehouse Planning through Simulation-Driven Knowledge Graphs and LLM Collaboration

Himabindu Thogaru, Saisubramaniam Gopalakrishnan, Zishan Ahmad, Anirudh Deodhar

Main category: cs.AI

TL;DR: A collaborative intelligence system using Knowledge Graphs and LLM agents enables manufacturing planners to analyze simulation data through natural language, improving bottleneck identification and operational insights while maintaining human oversight.

Details

Motivation: Traditional simulation-based manufacturing analysis creates barriers between human decision-makers and operational insights, limiting effective collaboration. There's a need to bridge this gap and empower manufacturing professionals with intuitive tools for complex operational analysis.

Method: A framework integrating Knowledge Graphs and LLM-based agents that transforms simulation data into semantically rich representations. The system features collaborative LLM agents that work alongside humans using iterative reasoning, generating precise queries for knowledge extraction, and providing transparent validation through natural language interfaces.

Result: The system achieves near-perfect accuracy for operational inquiries through natural language interaction. For investigative scenarios requiring collaborative analysis, it effectively supports human experts in uncovering interconnected operational issues, enhancing understanding and decision-making in manufacturing bottleneck identification.

Conclusion: This work advances collaborative manufacturing by creating intuitive methods for actionable insights, reducing cognitive load while amplifying human analytical capabilities in evolving manufacturing ecosystems through human-AI partnership.

Abstract: Manufacturing planners face complex operational challenges that require seamless collaboration between human expertise and intelligent systems to achieve optimal performance in modern production environments. Traditional approaches to analyzing simulation-based manufacturing data often create barriers between human decision-makers and critical operational insights, limiting effective partnership in manufacturing planning. Our framework establishes a collaborative intelligence system integrating Knowledge Graphs and Large Language Model-based agents to bridge this gap, empowering manufacturing professionals through natural language interfaces for complex operational analysis. The system transforms simulation data into semantically rich representations, enabling planners to interact naturally with operational insights without specialized expertise. A collaborative LLM agent works alongside human decision-makers, employing iterative reasoning that mirrors human analytical thinking while generating precise queries for knowledge extraction and providing transparent validation. This partnership approach to manufacturing bottleneck identification, validated through operational scenarios, demonstrates enhanced performance while maintaining human oversight and decision authority. For operational inquiries, the system achieves near-perfect accuracy through natural language interaction. For investigative scenarios requiring collaborative analysis, we demonstrate the framework’s effectiveness in supporting human experts to uncover interconnected operational issues that enhance understanding and decision-making. This work advances collaborative manufacturing by creating intuitive methods for actionable insights, reducing cognitive load while amplifying human analytical capabilities in evolving manufacturing ecosystems.

[387] Monitoring Monitorability

Melody Y. Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y. Wei, Marcus Williams, Benjamin Arnav, Joost Huizinga, Ian Kivlichan, Mia Glaese, Jakub Pachocki, Bowen Baker

Main category: cs.AI

TL;DR: The paper proposes a framework to evaluate and measure the monitorability of AI systems’ chain-of-thought reasoning, finding that CoT monitoring is effective but varies with training procedures, scaling, and inference compute.

Details

Motivation: As AI systems become more capable, monitoring their decision-making processes (particularly chain-of-thought reasoning) is crucial for safe deployment, but current monitoring approaches may be fragile under different training conditions, data sources, or scaling.

Method: Proposes three evaluation archetypes (intervention, process, and outcome-property) and a new monitorability metric, introduces a broad evaluation suite, and tests on various frontier models while analyzing scaling effects of inference compute, RL optimization, and model size.

Result: CoT monitoring is more effective than action-only monitoring; most frontier models are fairly monitorable; longer CoTs improve monitorability; RL optimization doesn’t significantly reduce monitorability; smaller models with higher reasoning effort can match capabilities while improving monitorability; monitor scaling with test-time compute improves monitorability.

Conclusion: Monitorability of AI systems can be systematically evaluated and improved through CoT monitoring, with strategic trade-offs between model size, reasoning effort, and compute resources offering pathways to safer AI deployment.

Abstract: Observability into the decision making of modern AI systems may be required to safely deploy increasingly capable agents. Monitoring the chain-of-thought (CoT) of today’s reasoning models has proven effective for detecting misbehavior. However, this “monitorability” may be fragile under different training procedures, data sources, or even continued system scaling. To measure and track monitorability, we propose three evaluation archetypes (intervention, process, and outcome-property) and a new monitorability metric, and introduce a broad evaluation suite. We demonstrate that these evaluations can catch simple model organisms trained to have obfuscated CoTs, and that CoT monitoring is more effective than action-only monitoring in practical settings. We compare the monitorability of various frontier models and find that most models are fairly, but not perfectly, monitorable. We also evaluate how monitorability scales with inference-time compute, reinforcement learning optimization, and pre-training model size. We find that longer CoTs are generally more monitorable and that RL optimization does not materially decrease monitorability even at the current frontier scale. Notably, we find that for a model at a low reasoning effort, we could instead deploy a smaller model at a higher reasoning effort (thereby matching capabilities) and obtain a higher monitorability, albeit at a higher overall inference compute cost. We further investigate agent-monitor scaling trends and find that scaling a weak monitor’s test-time compute when monitoring a strong agent increases monitorability. Giving the weak monitor access to CoT not only improves monitorability, but it steepens the monitor’s test-time compute to monitorability scaling trend. Finally, we show we can improve monitorability by asking models follow-up questions and giving their follow-up CoT to the monitor.

[388] Few-Shot Learning of a Graph-Based Neural Network Model Without Backpropagation

Mykyta Lapin, Kostiantyn Bokhan, Yurii Parzhyn

Main category: cs.AI

TL;DR: Structural-graph approach for few-shot contour image classification without backpropagation, using attributed graphs and concept attractors for transparent decisions.

Details

Motivation: To enable few-shot learning with built-in explanations by making structure the carrier of explanations, eliminating the need for backpropagation while maintaining transparency in decision-making.

Method: Contour vectorization to create bipartite graphs (Point/Line nodes) with geometric attributes, structural reductions to eliminate noise, iterative concept formation from few examples, and classification via approximated graph edit distance matching.

Result: Achieved ~82% accuracy on MNIST subset with only 5-6 examples per class in single epoch, with full traceability of decisions and explicit structural explanations for misclassifications.

Conclusion: Structural-graph scheme with concept attractors enables effective few-shot learning without backpropagation and provides built-in explanations through explicit graph structures, though computational cost and skeletonization quality remain limitations.

Abstract: We propose a structural-graph approach to classifying contour images in a few-shot regime without using backpropagation. The core idea is to make structure the carrier of explanations: an image is encoded as an attributed graph (critical points and lines represented as nodes with geometric attributes), and generalization is achieved via the formation of concept attractors (class-level concept graphs). Purpose. To design and experimentally validate an architecture in which class concepts are formed from a handful of examples (5 - 6 per class) through structural and parametric reductions, providing transparent decisions and eliminating backpropagation. Methods. Contour vectorization is followed by constructing a bipartite graph (Point/Line as nodes) with normalized geometric attributes such as coordinates, length, angle, and direction; reductions include the elimination of unstable substructures or noise and the alignment of paths between critical points. Concepts are formed by iterative composition of samples, and classification is performed by selecting the best graph-to-concept match (using approximated GED). Results. On an MNIST subset with 5 - 6 base examples per class (single epoch), we obtain a consistent accuracy of around 82% with full traceability of decisions: misclassifications can be explained by explicit structural similarities. An indicative comparison with SVM, MLP, CNN, as well as metric and meta-learning baselines, is provided. The structural-graph scheme with concept attractors enables few-shot learning without backpropagation and offers built-in explanations through the explicit graph structure. Limitations concern the computational cost of GED and the quality of skeletonization; promising directions include classification-algorithm optimization, work with static scenes, and associative recognition.

[389] Insider Threat Detection Using GCN and Bi-LSTM with Explicit and Implicit Graph Representations

Rahul Yumlembam, Biju Issac, Seibu Mary Jacob, Longzhi Yang, Deepa Krishnan

Main category: cs.AI

TL;DR: Proposes a post-hoc insider threat detection framework combining explicit/implicit graph representations with temporal modeling using GCNs and Bi-LSTM, achieving SOTA results on CERT datasets.

Details

Motivation: Insider threat detection is challenging due to subtle, concealed malicious activities by trusted users. Existing approaches need better ways to capture complex user behavior patterns and relationships.

Method: 1) Build explicit graph using organizational rules for direct activity relationships. 2) Learn implicit graph from feature similarities using Gumbel-Softmax trick for latent relationships. 3) Process both graphs with separate GCNs for node embeddings. 4) Concatenate and refine embeddings with attention mechanism. 5) Feed to Bi-LSTM for temporal dependency modeling. 6) Flag anomalies when probability scores fall below threshold.

Result: Outperforms state-of-the-art methods on CERT datasets: On r5.2 - AUC 98.62, detection rate 100%, false positive rate 0.05. On more challenging r6.2 - AUC 88.48, detection rate 80.15%, false positive rate 0.15.

Conclusion: The framework effectively combines graph-based and temporal representations for robust insider threat detection, demonstrating the value of integrating explicit organizational knowledge with learned implicit relationships and temporal modeling.

Abstract: Insider threat detection (ITD) is challenging due to the subtle and concealed nature of malicious activities performed by trusted users. This paper proposes a post-hoc ITD framework that integrates explicit and implicit graph representations with temporal modelling to capture complex user behaviour patterns. An explicit graph is constructed using predefined organisational rules to model direct relationships among user activities. To mitigate noise and limitations in this hand-crafted structure, an implicit graph is learned from feature similarities using the Gumbel-Softmax trick, enabling the discovery of latent behavioural relationships. Separate Graph Convolutional Networks (GCNs) process the explicit and implicit graphs to generate node embeddings, which are concatenated and refined through an attention mechanism to emphasise threat-relevant features. The refined representations are then passed to a bidirectional Long Short-Term Memory (Bi-LSTM) network to capture temporal dependencies in user behaviour. Activities are flagged as anomalous when their probability scores fall below a predefined threshold. Extensive experiments on CERT r5.2 and r6.2 datasets demonstrate that the proposed framework outperforms state-of-the-art methods. On r5.2, the model achieves an AUC of 98.62, a detection rate of 100%, and a false positive rate of 0.05. On the more challenging r6.2 dataset, it attains an AUC of 88.48, a detection rate of 80.15%, and a false positive rate of 0.15, highlighting the effectiveness of combining graph-based and temporal representations for robust ITD.

[390] Large Language Models as Discounted Bayesian Filters

Jensen Zhang, Jing Yang, Keze Wang

Main category: cs.AI

TL;DR: LLMs show systematic discounting of older evidence in online inference tasks, resembling exponential forgetting rather than exact Bayesian updating.

Details

Motivation: To understand how LLMs perform online inference in dynamic, stochastic environments where beliefs must be continuously updated, which is crucial for LLMs acting as world models or agents.

Method: Introduced a Bayesian filtering framework with probabilistic probe suite covering multivariate discrete distributions (dice rolls) and continuous distributions (Gaussian processes) with shifting ground-truth parameters.

Result: LLM belief updates resemble Bayesian posteriors but are better characterized by exponential forgetting filters with model-specific discount factors <1, showing systematic discounting of older evidence. Inherent priors are often miscalibrated but updating mechanism remains structured.

Conclusion: LLMs exhibit principled but imperfect online inference with systematic discounting patterns, and prompting strategies can effectively recalibrate priors with minimal computational cost.

Abstract: Large Language Models (LLMs) demonstrate strong few-shot generalization through in-context learning, yet their reasoning in dynamic and stochastic environments remains opaque. Prior studies mainly focus on static tasks and overlook the online adaptation required when beliefs must be continuously updated, which is a key capability for LLMs acting as world models or agents. We introduce a Bayesian filtering framework to evaluate online inference in LLMs. Our probabilistic probe suite spans both multivariate discrete distributions, such as dice rolls, and continuous distributions, such as Gaussian processes, where ground-truth parameters shift over time. We find that while LLM belief updates resemble Bayesian posteriors, they are more accurately characterized by an exponential forgetting filter with a model-specific discount factor smaller than one. This reveals systematic discounting of older evidence that varies significantly across model architectures. Although inherent priors are often miscalibrated, the updating mechanism itself remains structured and principled. We further validate these findings in a simulated agent task and propose prompting strategies that effectively recalibrate priors with minimal computational cost.

[391] Vox Deorum: A Hybrid LLM Architecture for 4X / Grand Strategy Game AI – Lessons from Civilization V

John Chen, Sihan Cheng, Can Gurkan, Ryan Lay, Moez Salahuddin

Main category: cs.AI

TL;DR: LLMs integrated into 4X games via hybrid architecture achieve competitive gameplay with unique play styles, enabling natural human-AI interactions in complex strategy games.

Details

Motivation: LLMs' natural language reasoning capabilities make them promising for 4X/grand strategy games to enable more natural human-AI interactions like collaboration and negotiation, but deployment challenges exist due to game complexity, latency, and cost constraints.

Method: Developed Vox Deorum, a hybrid LLM+X architecture for Civilization V with Vox Populi mod, using layered design where LLMs handle macro-strategic reasoning while delegating tactical execution to subsystems (algorithmic AI or future RL AI).

Result: Tested with 2,327 complete games comparing two open-source LLMs against Vox Populi’s enhanced AI. LLMs achieved competitive end-to-end gameplay while exhibiting substantially different play styles from algorithmic AI and from each other.

Conclusion: Establishes viable architecture for integrating LLMs in commercial 4X games, opening new opportunities for game design and agentic AI research by enabling natural human-AI interactions in complex strategy environments.

Abstract: Large Language Models’ capacity to reason in natural language makes them uniquely promising for 4X and grand strategy games, enabling more natural human-AI gameplay interactions such as collaboration and negotiation. However, these games present unique challenges due to their complexity and long-horizon nature, while latency and cost factors may hinder LLMs’ real-world deployment. Working on a classic 4X strategy game, Sid Meier’s Civilization V with the Vox Populi mod, we introduce Vox Deorum, a hybrid LLM+X architecture. Our layered technical design empowers LLMs to handle macro-strategic reasoning, delegating tactical execution to subsystems (e.g., algorithmic AI or reinforcement learning AI in the future). We validate our approach through 2,327 complete games, comparing two open-source LLMs with a simple prompt against Vox Populi’s enhanced AI. Results show that LLMs achieve competitive end-to-end gameplay while exhibiting play styles that diverge substantially from algorithmic AI and from each other. Our work establishes a viable architecture for integrating LLMs in commercial 4X games, opening new opportunities for game design and agentic AI research.

[392] ESearch-R1: Learning Cost-Aware MLLM Agents for Interactive Embodied Search via Reinforcement Learning

Weijie Zhou, Xuangtang Xiong, Ye Tian, Lijun Yue, Xinyu Wu, Wei Li, Chaoyang Zhao, Honghui Dong, Ming Tang, Jinqiao Wang, Zhengyou Zhang

Main category: cs.AI

TL;DR: ESearch-R1 is a cost-aware embodied reasoning framework that uses HC-GRPO to optimize MLLM agents for balancing physical exploration costs against human interaction costs when handling ambiguous instructions.

Details

Motivation: Current embodied agents fail to strategically balance the high cost of physical exploration against the cognitive cost of human interaction when facing ambiguous natural language instructions. They treat disambiguation as passive perception rather than strategic reasoning to minimize total task execution costs.

Method: Proposes ESearch-R1 framework that unifies interactive dialogue (Ask), episodic memory retrieval (GetMemory), and physical navigation (Navigate) into a single decision process. Introduces HC-GRPO (Heterogeneous Cost-Aware Group Relative Policy Optimization) which optimizes MLLMs by sampling groups of reasoning trajectories and reinforcing those achieving optimal trade-off between information gain and heterogeneous costs (navigation time, human attention).

Result: Extensive experiments in AI2-THOR demonstrate ESearch-R1 significantly outperforms standard ReAct-based agents, improving task success rates while reducing total operational costs by approximately 50%.

Conclusion: ESearch-R1 validates the effectiveness of GRPO in aligning MLLM agents with physical world constraints by enabling strategic reasoning to minimize total task execution costs when handling ambiguous instructions.

Abstract: Multimodal Large Language Models (MLLMs) have empowered embodied agents with remarkable capabilities in planning and reasoning. However, when facing ambiguous natural language instructions (e.g., “fetch the tool” in a cluttered room), current agents often fail to balance the high cost of physical exploration against the cognitive cost of human interaction. They typically treat disambiguation as a passive perception problem, lacking the strategic reasoning to minimize total task execution costs. To bridge this gap, we propose ESearch-R1, a cost-aware embodied reasoning framework that unifies interactive dialogue (Ask), episodic memory retrieval (GetMemory), and physical navigation (Navigate) into a single decision process. We introduce HC-GRPO (Heterogeneous Cost-Aware Group Relative Policy Optimization). Unlike traditional PPO which relies on a separate value critic, HC-GRPO optimizes the MLLM by sampling groups of reasoning trajectories and reinforcing those that achieve the optimal trade-off between information gain and heterogeneous costs (e.g., navigate time, and human attention). Extensive experiments in AI2-THOR demonstrate that ESearch-R1 significantly outperforms standard ReAct-based agents. It improves task success rates while reducing total operational costs by approximately 50%, validating the effectiveness of GRPO in aligning MLLM agents with physical world constraints.

[393] Reflective Confidence: Correcting Reasoning Flaws via Online Self-Correction

Qinglin Zeng, Jing Yang, Keze Wang

Main category: cs.AI

TL;DR: Reflective confidence framework transforms low-confidence signals into reflection triggers for self-correction instead of early stopping, improving accuracy on math reasoning tasks at similar computational cost.

Details

Motivation: Self-consistency and ensemble methods for LLM reasoning incur high computational costs. Early stopping strategies like DeepConf waste partial computation by discarding incomplete reasoning paths when confidence is low.

Method: Proposes reflective confidence framework where low-confidence signals trigger reflection prompts instead of termination. The model analyzes current reasoning state, identifies errors, and continues along corrected trajectories rather than discarding partial work.

Result: Experiments on mathematical reasoning benchmarks including AIME 2025 show significant accuracy improvements over advanced early-stopping baselines at comparable computational cost.

Conclusion: Proactive self-correction through reflection is more effective than passive discarding of low-confidence reasoning paths, validating the reflective confidence approach for efficient and accurate LLM reasoning.

Abstract: Large language models (LLMs) have achieved strong performance on complex reasoning tasks using techniques such as chain-of-thought and self-consistency. However, ensemble-based approaches, especially self-consistency which relies on multiple reasoning trajectories, often incur substantial computational overhead. To improve efficiency, prior work has leveraged internal confidence signals, where early stopping strategies such as DeepConf reduce cost by terminating low-confidence trajectories. However, this strategy discards incomplete reasoning paths and wastes partial computation. We propose reflective confidence, a novel reasoning framework that transforms low-confidence signals from termination indicators into reflection triggers. When confidence falls below a threshold, instead of stopping generation, the model produces a reflection prompt to analyze the current reasoning state, identify potential errors, and continue generation along a corrected trajectory. Experiments on mathematical reasoning benchmarks, including AIME 2025, demonstrate significant accuracy improvements over advanced early-stopping baselines at comparable computational cost, validating the effectiveness of proactive self-correction over passive discarding.

[394] Assignment-Routing Optimization: Solvers for Problems Under Constraints

Yuan Qilong, Michal Pavelka

Main category: cs.AI

TL;DR: Exact MIP solver for Joint Routing-Assignment problem outperforms existing methods in robotic packaging scenarios with complex constraints.

Details

Motivation: Address practical packaging-planning scenarios with richer constraints like multiple placeholder options, time-frame restrictions, and multi-class item packaging that previous JRA solvers couldn't handle efficiently.

Method: Extend previous exact MIP solvers with Gurobi and cutting-plane subtour elimination to develop a tailored solver for practical JRA problems with complex constraints.

Result: Achieves global optima with stable low computation times, outperforming shaking-based exact solver by orders of magnitude, and beats greedy baselines with 14% average deviation in optimal distances.

Conclusion: MIP-based JRA optimization is practically applicable for robotic packaging, motion planning, and complex logistics, demonstrating both efficiency and solution quality.

Abstract: We study the Joint Routing-Assignment (JRA) problem in which items must be assigned one-to-one to placeholders while simultaneously determining a Hamiltonian cycle visiting all nodes exactly once. Extending previous exact MIP solvers with Gurobi and cutting-plane subtour elimination, we develop a solver tailored for practical packaging-planning scenarios with richer constraints.These include multiple placeholder options, time-frame restrictions, and multi-class item packaging. Experiments on 46 mobile manipulation datasets demonstrate that the proposed MIP approach achieves global optima with stable and low computation times, significantly outperforming the shaking-based exact solver by up to an orders of magnitude. Compared to greedy baselines, the MIP solutions achieve consistent optimal distances with an average deviation of 14% for simple heuristics, confirming both efficiency and solution quality. The results highlight the practical applicability of MIP-based JRA optimization for robotic packaging, motion planning, and complex logistics .

[395] ChronoDreamer: Action-Conditioned World Model as an Online Simulator for Robotic Planning

Zhenhao Zhou, Dan Negrut

Main category: cs.AI

TL;DR: ChronoDreamer is an action-conditioned world model for robotic manipulation that predicts future video, contact distributions, and joint states using spatial-temporal transformers with MaskGIT-style training, and uses a vision-language model to evaluate safety of predicted actions.

Details

Motivation: The paper aims to develop a world model for contact-rich robotic manipulation that can predict future states (visual, contact, proprioceptive) and evaluate action safety before execution, addressing the challenge of safe manipulation in complex environments with both rigid and deformable objects.

Method: Uses spatial-temporal transformer with MaskGIT-style masked prediction trained on DreamerBench dataset. Contact is encoded as depth-weighted Gaussian splat images that render 3D forces into camera-aligned format. At inference, predicted rollouts are evaluated by a vision-language model that reasons about collision likelihood for rejection sampling.

Result: The model preserves spatial coherence during non-contact motion and generates plausible contact predictions. The LLM-based judge successfully distinguishes collision from non-collision trajectories. Evaluation is performed on DreamerBench simulation dataset with synchronized RGB, contact splat, proprioception, and physics annotations.

Conclusion: ChronoDreamer demonstrates effective prediction of future states for contact-rich manipulation and provides a safety evaluation mechanism through vision-language model-based rejection sampling, enabling safer robotic manipulation in complex environments.

Abstract: We present ChronoDreamer, an action-conditioned world model for contact-rich robotic manipulation. Given a history of egocentric RGB frames, contact maps, actions, and joint states, ChronoDreamer predicts future video frames, contact distributions, and joint angles via a spatial-temporal transformer trained with MaskGIT-style masked prediction. Contact is encoded as depth-weighted Gaussian splat images that render 3D forces into a camera-aligned format suitable for vision backbones. At inference, predicted rollouts are evaluated by a vision-language model that reasons about collision likelihood, enabling rejection sampling of unsafe actions before execution. We train and evaluate on DreamerBench, a simulation dataset generated with Project Chrono that provides synchronized RGB, contact splat, proprioception, and physics annotations across rigid and deformable object scenarios. Qualitative results demonstrate that the model preserves spatial coherence during non-contact motion and generates plausible contact predictions, while the LLM-based judge distinguishes collision from non-collision trajectories.

[396] ASTIF: Adaptive Semantic-Temporal Integration for Cryptocurrency Price Forecasting

Hafiz Saif Ur Rehman, Ling Liu, Kaleem Ullah Qasim

Main category: cs.AI

TL;DR: ASTIF is a hybrid AI system that adaptively fuses semantic market narratives with temporal price data using confidence-based meta-learning for cryptocurrency forecasting, outperforming state-of-the-art baselines.

Details

Motivation: Financial forecasting models struggle with static architectures that can't integrate heterogeneous knowledge sources or adapt to rapid market regime shifts. Existing approaches relying only on historical price data neglect semantic drivers like policy uncertainty and market narratives.

Method: ASTIF integrates three components: 1) Dual-channel Small Language Model with MirrorPrompt for semantic market cues and numerical trends, 2) Hybrid LSTM Random Forest for sequential temporal dependencies, 3) Confidence-aware meta-learner as adaptive inference layer that modulates predictor contributions based on real-time uncertainty.

Result: ASTIF outperforms leading deep learning and Transformer baselines (Informer, TFT) on diverse AI-focused cryptocurrencies and major tech stocks (2020-2024). Ablation studies confirm the adaptive meta-learning mechanism’s critical role in mitigating risk by shifting reliance between semantic and temporal channels during market turbulence.

Conclusion: The research contributes a scalable, knowledge-based solution for fusing quantitative and qualitative data in non-stationary financial environments, demonstrating effective adaptive integration of semantic and temporal information for improved forecasting.

Abstract: Financial time series forecasting is fundamentally an information fusion challenge, yet most existing models rely on static architectures that struggle to integrate heterogeneous knowledge sources or adjust to rapid regime shifts. Conventional approaches, relying exclusively on historical price sequences, often neglect the semantic drivers of volatility such as policy uncertainty and market narratives. To address these limitations, we propose the ASTIF (Adaptive Semantic-Temporal Integration for Cryptocurrency Price Forecasting), a hybrid intelligent system that adapts its forecasting strategy in real time through confidence-based meta-learning. The framework integrates three complementary components. A dual-channel Small Language Model using MirrorPrompt extracts semantic market cues alongside numerical trends. A hybrid LSTM Random Forest model captures sequential temporal dependencies. A confidence-aware meta-learner functions as an adaptive inference layer, modulating each predictor’s contribution based on its real-time uncertainty. Experimental evaluation on a diverse dataset of AI-focused cryptocurrencies and major technology stocks from 2020 to 2024 shows that ASTIF outperforms leading deep learning and Transformer baselines (e.g., Informer, TFT). The ablation studies further confirm the critical role of the adaptive meta-learning mechanism, which successfully mitigates risk by shifting reliance between semantic and temporal channels during market turbulence. The research contributes a scalable, knowledge-based solution for fusing quantitative and qualitative data in non-stationary environments.

[397] Automatic Adaptation to Concept Complexity and Subjective Natural Concepts: A Cognitive Model based on Chunking

Dmitry Bennett, Fernand Gobet

Main category: cs.AI

TL;DR: CogAct computational model uses chunking mechanisms to learn concepts across domains, outperforming other models by handling raw data without task-specific modifications.

Details

Motivation: To understand fundamental psychological processes underlying concept formation in memory and address limitations of existing models that struggle with adaptive learning across diverse domains.

Method: Developed CogAct computational model grounded in cognitive processes (chunking, attention, STM/LTM) that learns from raw data without pre-processing. Tested across logical functions, artificial categories, and natural domains (literature, chess, music).

Result: CogAct successfully learns diverse categories adaptively, captures subjective human judgments in music from raw scores, and outperforms deep-learning models that require task-specific architecture changes.

Conclusion: Chunking mechanisms are essential for concept learning, enabling adaptive learning across complexity. Approach integrates concept learning with cognitive psychology theories and enables modeling of subjective concept spaces rather than average participants.

Abstract: A key issue in cognitive science concerns the fundamental psychological processes that underlie the formation and retrieval of multiple types of concepts in short-term and long-term memory (STM and LTM, respectively). We propose that chunking mechanisms play an essential role and show how the CogAct computational model grounds concept learning in fundamental cognitive processes and structures (such as chunking, attention, STM and LTM). First are the in-principle demonstrations, with CogAct automatically adapting to learn a range of categories from simple logical functions, to artificial categories, to natural raw (as opposed to natural pre-processed) concepts in the dissimilar domains of literature, chess and music. This kind of adaptive learning is difficult for most other psychological models, e.g., with cognitive models stopping at modelling artificial categories and (non-GPT) models based on deep learning requiring task-specific changes to the architecture. Secondly, we offer novel ways of designing human benchmarks for concept learning experiments and simulations accounting for subjectivity, ways to control for individual human experiences, all while keeping to real-life complex categories. We ground CogAct in simulations of subjective conceptual spaces of individual human participants, capturing humans subjective judgements in music, with the models learning from raw music score data without bootstrapping to pre-built knowledge structures. The CogAct simulations are compared to those obtained by a deep-learning model. These findings integrate concept learning and adaptation to complexity into the broader theories of cognitive psychology. Our approach may also be used in psychological applications that move away from modelling the average participant and towards capturing subjective concept space.

[398] IntelliCode: A Multi-Agent LLM Tutoring System with Centralized Learner Modeling

Jones David, Shreya Ghosh

Main category: cs.AI

TL;DR: IntelliCode is a multi-agent LLM tutoring system with persistent learner state tracking for transparent, long-term pedagogical support.

Details

Motivation: Current LLM-based tutors lack persistent representations of learner knowledge, making it difficult to provide principled, transparent, and long-term pedagogical support.

Method: Uses a centralized, versioned learner state with six specialized agents coordinated by a StateGraph Orchestrator: skill assessment, learner profiling, graduated hinting, curriculum selection, spaced repetition, and engagement monitoring.

Result: Validation with simulated learners shows stable state updates, improved task success with graduated hints, and diverse curriculum coverage.

Conclusion: IntelliCode demonstrates how persistent learner modeling, orchestrated multi-agent reasoning, and principled instructional design can produce transparent and reliable LLM-driven tutoring.

Abstract: LLM-based tutors are typically single-turn assistants that lack persistent representations of learner knowledge, making it difficult to provide principled, transparent, and long-term pedagogical support. We introduce IntelliCode, a multi-agent LLM tutoring system built around a centralized, versioned learner state that integrates mastery estimates, misconceptions, review schedules, and engagement signals. A StateGraph Orchestrator coordinates six specialized agents: skill assessment, learner profiling, graduated hinting, curriculum selection, spaced repetition, and engagement monitoring, each operating as a pure transformation over the shared state under a single-writer policy. This architecture enables auditable mastery updates, proficiency-aware hints, dependency-aware curriculum adaptation, and safety-aligned prompting. The demo showcases an end-to-end tutoring workflow: a learner attempts a DSA problem, receives a conceptual hint when stuck, submits a corrected solution, and immediately sees mastery updates and a personalized review interval. We report validation results with simulated learners, showing stable state updates, improved task success with graduated hints, and diverse curriculum coverage. IntelliCode demonstrates how persistent learner modeling, orchestrated multi-agent reasoning, and principled instructional design can be combined to produce transparent and reliable LLM-driven tutoring.

Yosuke Taniuchi, Chie Hieida, Atsushi Noritake, Kazushi Ikeda, Masaki Isoda

Main category: cs.AI

TL;DR: Monkeys appear to use objective reward differences rather than inferring others’ subjective valuations during social comparison, with the External Comparison Model outperforming models that incorporate subjective inference.

Details

Motivation: To understand the computational mechanisms behind social comparison in primates - specifically whether monkeys merely recognize objective reward differences or infer others' subjective reward valuations.

Method: Developed three computational models (Internal Prediction Model for subjective inference, No Comparison Model ignoring partner info, External Comparison Model for objective rewards) and tested them using multi-layered multimodal latent Dirichlet allocation on monkey behavior data.

Result: The External Comparison Model achieved the highest classification score (Rand Index 0.88 vs 0.79 for IPM), suggesting social comparison relies on objective reward differences rather than subjective state inferences.

Conclusion: Social comparison in monkeys appears to be based on comparing objective reward outcomes rather than making inferences about others’ subjective valuations.

Abstract: Social comparison – the process of evaluating one’s rewards relative to others – plays a fundamental role in primate social cognition. However, it remains unknown from a computational perspective how information about others’ rewards affects the evaluation of one’s own reward. With a constructive approach, this study examines whether monkeys merely recognize objective reward differences or, instead, infer others’ subjective reward valuations. We developed three computational models with varying degrees of social information processing: an Internal Prediction Model (IPM), which infers the partner’s subjective values; a No Comparison Model (NCM), which disregards partner information; and an External Comparison Model (ECM), which directly incorporates the partner’s objective rewards. To test model performance, we used a multi-layered, multimodal latent Dirichlet allocation. We trained the models on a dataset containing the behavior of a pair of monkeys, their rewards, and the conditioned stimuli. Then, we evaluated the models’ ability to classify subjective values across pre-defined experimental conditions. The ECM achieved the highest classification score in the Rand Index (0.88 vs. 0.79 for the IPM) under our settings, suggesting that social comparison relies on objective reward differences rather than inferences about subjective states.

[400] KeenKT: Knowledge Mastery-State Disambiguation for Knowledge Tracing

Zhifei Li, Lifan Chen, Jiali Yi, Xiaoju Hou, Yue Zhao, Wenxin Huang, Miao Zhang, Kui Xiao, Bing Yang

Main category: cs.AI

TL;DR: KeenKT uses Normal-Inverse-Gaussian distributions to model student knowledge states, addressing ambiguity in traditional single-point estimates by distinguishing true ability from behavioral fluctuations.

Details

Motivation: Current KT methods rely on single-point estimates that cannot distinguish between true ability and behavioral fluctuations (outburst/carelessness), creating ambiguity in judging student mastery.

Method: Proposes KeenKT with: 1) NIG distribution representation of knowledge states, 2) NIG-distance-based attention for state evolution, 3) diffusion-based denoising reconstruction loss, and 4) distributional contrastive learning loss for robustness.

Result: Outperforms SOTA KT models on six public datasets with maximum AUC improvement of 5.85% and maximum ACC improvement of 6.89%, showing better prediction accuracy and sensitivity to behavioral fluctuations.

Conclusion: KeenKT effectively addresses the disambiguation problem in KT by modeling knowledge states with NIG distributions, providing more accurate and sensitive tracking of student mastery while distinguishing true ability from behavioral noise.

Abstract: Knowledge Tracing (KT) aims to dynamically model a student’s mastery of knowledge concepts based on their historical learning interactions. Most current methods rely on single-point estimates, which cannot distinguish true ability from outburst or carelessness, creating ambiguity in judging mastery. To address this issue, we propose a Knowledge Mastery-State Disambiguation for Knowledge Tracing model (KeenKT), which represents a student’s knowledge state at each interaction using a Normal-Inverse-Gaussian (NIG) distribution, thereby capturing the fluctuations in student learning behaviors. Furthermore, we design an NIG-distance-based attention mechanism to model the dynamic evolution of the knowledge state. In addition, we introduce a diffusion-based denoising reconstruction loss and a distributional contrastive learning loss to enhance the model’s robustness. Extensive experiments on six public datasets demonstrate that KeenKT outperforms SOTA KT models in terms of prediction accuracy and sensitivity to behavioral fluctuations. The proposed method yields the maximum AUC improvement of 5.85% and the maximum ACC improvement of 6.89%.

[401] Counterfactual Basis Extension and Representational Geometry: An MDL-Constrained Model of Conceptual Growth

Chainarong Amornbunchornvej

Main category: cs.AI

TL;DR: A geometric framework for conceptual growth where representational expansion occurs through MDL-based basis extension when existing representations fail, with imagination serving only to expose structured residual errors.

Details

Motivation: Current learning models assume fixed representational bases, but conceptual learning requires expanding these bases. The paper addresses when and how representational structures can expand in principled ways.

Method: Proposes a geometric framework where experience is represented as vectors in conceptual subspaces. Systematic failures create residuals, and conceptual extensions are restricted to low-rank, admissible transformations evaluated under Minimum Description Length (MDL) criterion.

Result: Shows that MDL-accepted extensions must lie within residual spans from experience, while orthogonal extensions increase description length and are rejected. This provides a conservative account where imagination only contributes by exposing structured residual errors.

Conclusion: Conceptual development is an error-driven, geometry-constrained process of basis extension. Imagination plays a limited but important role in learning by highlighting representational failures, with MDL providing normative selection principles for representational change.

Abstract: Concept learning becomes possible only when existing representations fail to account for experience. Most models of learning and inference, however, presuppose a fixed representational basis within which belief updating occurs. In this paper, I address a prior question: under what structural conditions can the representational basis itself expand in a principled and selective way? I propose a geometric framework in which conceptual growth is modeled as admissible basis extension evaluated under a Minimum Description Length (MDL) criterion. Experience, whether externally observed or internally simulated, is represented as vectors relative to a current conceptual subspace. Residual components capture systematic representational failure, and candidate conceptual extensions are restricted to low-rank, admissible transformations. I show that any MDL-accepted extension can be chosen so that its novel directions lie entirely within the residual span induced by experience, while extensions orthogonal to this span strictly increase description length and are therefore rejected. This yields a conservative account of imagination and conceptual innovation. Internally generated counterfactual representations contribute to learning only insofar as they expose or amplify structured residual error, and cannot introduce arbitrary novelty. I further distinguish representational counterfactuals–counterfactuals over an agent’s conceptual basis–from causal or value-level counterfactuals, and show how MDL provides a normative selection principle governing representational change. Overall, the framework characterizes conceptual development as an error-driven, geometry-constrained process of basis extension, clarifying both the role and the limits of imagination in learning and theory change.

[402] MEEA: Mere Exposure Effect-Driven Confrontational Optimization for LLM Jailbreaking

Jianyi Zhang, Shizhao Liu, Ziyin Zhou, Zhen Li

Main category: cs.AI

TL;DR: MEEA is a psychology-inspired automated black-box jailbreak framework that uses repeated low-toxicity semantic exposure to gradually erode LLM safety alignment through multi-turn interactions.

Details

Motivation: Existing jailbreak studies assume static safety boundaries and fail to account for how contextual interactions dynamically influence model behavior, leading to limited stability and generalization in safety evaluation.

Method: MEEA constructs semantically progressive prompt chains and optimizes them using simulated annealing guided by semantic similarity, toxicity, and jailbreak effectiveness, leveraging the mere exposure effect through repeated low-toxicity exposure.

Result: MEEA consistently achieves higher attack success rates than seven baselines on models like GPT-4, Claude-3.5, and DeepSeek-R1, with average ASR improvement exceeding 20%. Ablation studies validate the necessity of both annealing optimization and contextual exposure mechanisms.

Conclusion: LLM safety behavior is inherently dynamic and history-dependent, challenging the assumption of static alignment boundaries and highlighting the need for interaction-aware safety evaluation and defense mechanisms.

Abstract: The rapid advancement of large language models (LLMs) has intensified concerns about the robustness of their safety alignment. While existing jailbreak studies explore both single-turn and multi-turn strategies, most implicitly assume a static safety boundary and fail to account for how contextual interactions dynamically influence model behavior, leading to limited stability and generalization. Motivated by this gap, we propose MEEA (Mere Exposure Effect Attack), a psychology-inspired, fully automated black-box framework for evaluating multi-turn safety robustness, grounded in the mere exposure effect. MEEA leverages repeated low-toxicity semantic exposure to induce a gradual shift in a model’s effective safety threshold, enabling progressive erosion of alignment constraints over sustained interactions. Concretely, MEEA constructs semantically progressive prompt chains and optimizes them using a simulated annealing strategy guided by semantic similarity, toxicity, and jailbreak effectiveness. Extensive experiments on both closed-source and open-source models, including GPT-4, Claude-3.5, and DeepSeek-R1, demonstrate that MEEA consistently achieves higher attack success rates than seven representative baselines, with an average Attack Success Rate (ASR) improvement exceeding 20%. Ablation studies further validate the necessity of both annealing-based optimization and contextual exposure mechanisms. Beyond improved attack effectiveness, our findings indicate that LLM safety behavior is inherently dynamic and history-dependent, challenging the common assumption of static alignment boundaries and highlighting the need for interaction-aware safety evaluation and defense mechanisms. Our code is available at: https://github.com/Carney-lsz/MEEA

[403] The Dead Salmons of AI Interpretability

Maxime Méloux, Giada Dirupo, François Portet, Maxime Peyrard

Main category: cs.AI

TL;DR: The paper critiques current AI interpretability methods for producing false discoveries similar to the “dead salmon” MRI study, and proposes reframing interpretability as statistical inference with explicit hypothesis testing and uncertainty quantification.

Details

Motivation: Current AI interpretability methods (feature attribution, probing, sparse auto-encoding, causal analyses) can produce plausible-looking but false explanations even for randomly initialized neural networks, similar to the neuroscience "dead salmon" artifact where standard analyses reported brain activity in a dead fish. This reveals fundamental problems with current interpretability practices.

Method: Proposes a pragmatic statistical-causal reframing where explanations of computational systems are treated as parameters of a statistical model, inferred from computational traces. Interpretability methods become statistical estimators that should be tested against explicit alternative computational hypotheses with uncertainty quantification.

Result: Identifies critical theoretical issues including the identifiability of common interpretability queries, which helps explain the field’s susceptibility to false discoveries, poor generalizability, and high variance. The statistical inference framework provides a way to address these problems.

Conclusion: Situating interpretability within standard statistical inference tools offers a path toward turning AI interpretability into a pragmatic and rigorous science, moving beyond current practices that risk producing meaningless artifacts.

Abstract: In a striking neuroscience study, the authors placed a dead salmon in an MRI scanner and showed it images of humans in social situations. Astonishingly, standard analyses of the time reported brain regions predictive of social emotions. The explanation, of course, was not supernatural cognition but a cautionary tale about misapplied statistical inference. In AI interpretability, reports of similar ‘‘dead salmon’’ artifacts abound: feature attribution, probing, sparse auto-encoding, and even causal analyses can produce plausible-looking explanations for randomly initialized neural networks. In this work, we examine this phenomenon and argue for a pragmatic statistical-causal reframing: explanations of computational systems should be treated as parameters of a (statistical) model, inferred from computational traces. This perspective goes beyond simply measuring statistical variability of explanations due to finite sampling of input data; interpretability methods become statistical estimators, and findings should be tested against explicit and meaningful alternative computational hypotheses, with uncertainty quantified with respect to the postulated statistical model. It also highlights important theoretical issues, such as the identifiability of common interpretability queries, which we argue is critical to understand the field’s susceptibility to false discoveries, poor generalizability, and high variance. More broadly, situating interpretability within the standard toolkit of statistical inference opens promising avenues for future work aimed at turning AI interpretability into a pragmatic and rigorous science.

[404] HARBOR: Holistic Adaptive Risk assessment model for BehaviORal healthcare

Aditya Siddhant

Main category: cs.AI

TL;DR: HARBOR is a behavioral health-aware language model that predicts mood/risk scores (-3 to +3) using multimodal patient data, outperforming traditional ML and proprietary LLMs with 69% accuracy.

Details

Motivation: Behavioral healthcare risk assessment is challenging due to multimodal patient data and temporal dynamics of mood disorders. While LLMs show strong reasoning capabilities, their effectiveness in structured clinical risk scoring remains unclear.

Method: Introduce HARBOR, a behavioral health-aware language model designed to predict the Harbor Risk Score (HRS) on a -3 (severe depression) to +3 (mania) scale. Also release PEARL dataset containing 4 years of monthly observations from 3 patients with physiological, behavioral, and self-reported mental health signals.

Result: HARBOR outperforms classical baselines and off-the-shelf LLMs, achieving 69% accuracy compared to 54% for logistic regression and 29% for the strongest proprietary LLM baseline.

Conclusion: The HARBOR model demonstrates superior performance in behavioral health risk assessment compared to traditional ML methods and existing LLMs, showing promise for structured clinical risk scoring applications.

Abstract: Behavioral healthcare risk assessment remains a challenging problem due to the highly multimodal nature of patient data and the temporal dynamics of mood and affective disorders. While large language models (LLMs) have demonstrated strong reasoning capabilities, their effectiveness in structured clinical risk scoring remains unclear. In this work, we introduce HARBOR, a behavioral health aware language model designed to predict a discrete mood and risk score, termed the Harbor Risk Score (HRS), on an integer scale from -3 (severe depression) to +3 (mania). We also release PEARL, a longitudinal behavioral healthcare dataset spanning four years of monthly observations from three patients, containing physiological, behavioral, and self reported mental health signals. We benchmark traditional machine learning models, proprietary LLMs, and HARBOR across multiple evaluation settings and ablations. Our results show that HARBOR outperforms classical baselines and off the shelf LLMs, achieving 69 percent accuracy compared to 54 percent for logistic regression and 29 percent for the strongest proprietary LLM baseline.

[405] CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning

Zijun Gao, Zhikun Xu, Xiao Ye, Ben Zhou

Main category: cs.AI

TL;DR: CORE is a reinforcement learning framework that uses explicit concepts as supervision signals to improve LLMs’ conceptual reasoning in math, bridging the gap between pattern reuse and genuine understanding.

Details

Motivation: LLMs can solve math exercises through pattern recognition but fail when genuine conceptual understanding is required. Current RLVR pipelines reinforce final answers but lack fine-grained conceptual signals, leading to improvement in pattern reuse rather than conceptual application.

Method: CORE framework: (1) synthesizes concept-aligned quizzes from textbook resources, (2) injects brief concept snippets during rollouts to elicit concept-primed trajectories, and (3) reinforces conceptual reasoning via trajectory replacement after group failures using forward-KL constraint or GRPO on concept-aligned quizzes.

Result: CORE delivers consistent gains over vanilla and SFT baselines on both in-domain concept-exercise suites and diverse out-of-domain math benchmarks across several models.

Conclusion: CORE provides fine-grained conceptual supervision that bridges problem-solving competence and genuine conceptual reasoning, while remaining algorithm- and verifier-agnostic, unifying direct training on concept-aligned quizzes and concept-injected rollouts under outcome regularization.

Abstract: Large language models (LLMs) often solve challenging math exercises yet fail to apply the concept right when the problem requires genuine understanding. Popular Reinforcement Learning with Verifiable Rewards (RLVR) pipelines reinforce final answers but provide little fine-grained conceptual signal, so models improve at pattern reuse rather than conceptual applications. We introduce CORE (Concept-Oriented REinforcement), an RL training framework that turns explicit concepts into a controllable supervision signal. Starting from a high-quality, low-contamination textbook resource that links verifiable exercises to concise concept descriptions, we run a sanity probe showing LLMs can restate definitions but fail concept-linked quizzes, quantifying the conceptual reasoning gap. CORE then (i) synthesizes concept-aligned quizzes, (ii) injects brief concept snippets during rollouts to elicit concept-primed trajectories, and (iii) reinforces conceptual reasoning via trajectory replacement after group failures, a lightweight forward-KL constraint that aligns unguided with concept-primed policies, or standard GRPO directly on concept-aligned quizzes. Across several models, CORE delivers consistent gains over vanilla and SFT baselines on both in-domain concept-exercise suites and diverse out-of-domain math benchmarks. CORE unifies direct training on concept-aligned quizzes and concept-injected rollouts under outcome regularization. It provides fine-grained conceptual supervision that bridges problem-solving competence and genuine conceptual reasoning, while remaining algorithm- and verifier-agnostic.

[406] Gabliteration: Adaptive Multi-Directional Neural Weight Modification for Selective Behavioral Alteration in Large Language Models

Gökdeniz Gülmez

Main category: cs.AI

TL;DR: Gabliteration is a neural weight modification technique using adaptive multi-directional projections with regularized layer selection to modify specific behaviors while minimizing quality degradation.

Details

Motivation: Addresses limitations of existing ablation methods that compromise overall model quality when attempting to modify specific behavioral patterns.

Method: Uses adaptive multi-directional projections with regularized layer selection, dynamic layer optimization, regularized projection matrices, and adaptive scaling mechanisms.

Result: Validated through gabliterated-v1 model series (0.6B to 4B parameters) available on Hugging Face, demonstrating practical applicability across multiple model scales.

Conclusion: Achieves theoretically superior weight modification while minimizing quality degradation in unrelated domains, advancing beyond traditional ablation methods.

Abstract: We present Gabliteration, a novel neural weight modification technique that advances beyond traditional abliteration methods by implementing adaptive multi-directional projections with regularized layer selection. Our approach addresses the fundamental limitation of existing methods that compromise model quality while attempting to modify specific behavioral patterns. Through dynamic layer optimization, regularized projection matrices, and adaptive scaling mechanisms, we achieve theoretically superior weight modification while minimizing quality degradation in unrelated domains. We validate our method through the gabliterated-v1 model series (0.6B to 4B parameters) available on Hugging Face, demonstrating practical applicability across multiple model scales.

[407] Multimodal Bayesian Network for Robust Assessment of Casualties in Autonomous Triage

Szymon Rusiecki, Cecilia G. Morales, Kimberly Elenberg, Leonard Weiss, Artur Dubrawski

Main category: cs.AI

TL;DR: A Bayesian network decision support framework fusing computer vision outputs for automated triage in mass casualty incidents, achieving 3x performance improvement over vision-only baselines.

Details

Motivation: Mass Casualty Incidents overwhelm emergency systems, causing delays and errors in casualty assessment that lead to preventable deaths. There's a need for automated decision support to assist emergency responders.

Method: A decision support framework that fuses outputs from multiple computer vision models (estimating hemorrhage, respiratory distress, alertness, trauma) into a Bayesian network constructed entirely from expert-defined rules. Unlike data-driven models, it doesn’t require training data, supports inference with incomplete information, and is robust to noisy observations.

Result: In DARPA Triage Challenge field scenarios: physiological assessment accuracy improved from 15% to 42% (scenario 1) and 19% to 46% (scenario 2) - nearly threefold increase. Overall triage accuracy increased from 14% to 53%, and diagnostic coverage expanded from 31% to 95% of cases requiring assessment.

Conclusion: Expert-knowledge-guided probabilistic reasoning significantly enhances automated triage systems, offering a promising approach to support emergency responders in MCIs. This approach helped Team Chiron achieve 4th place out of 11 teams in the DTC physical round.

Abstract: Mass Casualty Incidents can overwhelm emergency medical systems and resulting delays or errors in the assessment of casualties can lead to preventable deaths. We present a decision support framework that fuses outputs from multiple computer vision models, estimating signs of severe hemorrhage, respiratory distress, physical alertness, or visible trauma, into a Bayesian network constructed entirely from expert-defined rules. Unlike traditional data-driven models, our approach does not require training data, supports inference with incomplete information, and is robust to noisy or uncertain observations. We report performance for two missions involving 11 and 9 casualties, respectively, where our Bayesian network model substantially outperformed vision-only baselines during evaluation of our system in the DARPA Triage Challenge (DTC) field scenarios. The accuracy of physiological assessment improved from 15% to 42% in the first scenario and from 19% to 46% in the second, representing nearly threefold increase in performance. More importantly, overall triage accuracy increased from 14% to 53% in all patients, while the diagnostic coverage of the system expanded from 31% to 95% of the cases requiring assessment. These results demonstrate that expert-knowledge-guided probabilistic reasoning can significantly enhance automated triage systems, offering a promising approach to supporting emergency responders in MCIs. This approach enabled Team Chiron to achieve 4th place out of 11 teams during the 1st physical round of the DTC.

[408] Clustering-based Transfer Learning for Dynamic Multimodal MultiObjective Evolutionary Algorithm

Li Yan, Bolun Liu, Chao Li, Jing Liang, Kunjie Yu, Caitong Yue, Xuzhao Chai, Boyang Qu

Main category: cs.AI

TL;DR: Proposes a new benchmark suite and algorithm for dynamic multimodal multiobjective optimization that addresses both tracking multiple Pareto optimal sets and maintaining diversity in time-varying environments.

Details

Motivation: Existing dynamic multiobjective evolutionary algorithms neglect solution modality, while static multimodal algorithms lack adaptability to dynamic changes. There's a need for approaches that handle both dynamic environments and multimodal solutions simultaneously.

Method: Two main contributions: 1) A new benchmark suite of dynamic multimodal multiobjective test functions combining properties of both dynamic and multimodal optimization. 2) A novel algorithm with Clustering-based Autoencoder prediction dynamic response mechanism that uses autoencoder to process matched clusters for diverse initial population, plus adaptive niching strategy for convergence-diversity balance.

Result: Empirical analysis on 12 instances shows the proposed algorithm preserves population diversity more effectively in decision space and achieves superior convergence in objective space compared to state-of-the-art dynamic and multimodal multiobjective evolutionary algorithms.

Conclusion: The paper successfully addresses the dual challenge of dynamic multimodal multiobjective optimization by providing both a rigorous evaluation platform (benchmark suite) and an effective algorithm that outperforms existing approaches in maintaining diversity and achieving convergence.

Abstract: Dynamic multimodal multiobjective optimization presents the dual challenge of simultaneously tracking multiple equivalent pareto optimal sets and maintaining population diversity in time-varying environments. However, existing dynamic multiobjective evolutionary algorithms often neglect solution modality, whereas static multimodal multiobjective evolutionary algorithms lack adaptability to dynamic changes. To address above challenge, this paper makes two primary contributions. First, we introduce a new benchmark suite of dynamic multimodal multiobjective test functions constructed by fusing the properties of both dynamic and multimodal optimization to establish a rigorous evaluation platform. Second, we propose a novel algorithm centered on a Clustering-based Autoencoder prediction dynamic response mechanism, which utilizes an autoencoder model to process matched clusters to generate a highly diverse initial population. Furthermore, to balance the algorithm’s convergence and diversity, we integrate an adaptive niching strategy into the static optimizer. Empirical analysis on 12 instances of dynamic multimodal multiobjective test functions reveals that, compared with several state-of-the-art dynamic multiobjective evolutionary algorithms and multimodal multiobjective evolutionary algorithms, our algorithm not only preserves population diversity more effectively in the decision space but also achieves superior convergence in the objective space.

[409] Training Multimodal Large Reasoning Models Needs Better Thoughts: A Three-Stage Framework for Long Chain-of-Thought Synthesis and Selection

Yizhi Wang, Linan Yue, Min-Ling Zhang

Main category: cs.AI

TL;DR: SynSelect is a three-stage framework that generates high-quality long Chain-of-Thought data for multimodal reasoning by synthesizing diverse CoTs from multiple LRMs and selecting the best ones through instance and batch filtering.

Details

Motivation: Multimodal reasoning faces challenges due to complexity of integrating diverse input modalities and scarcity of high-quality long CoT training data. Existing methods suffer from limited reasoning depth, modality conversion errors, and rigid generation pipelines.

Method: Three-stage Synthesis-Selection framework: 1) Leverages multiple heterogeneous multimodal LRMs to produce diverse candidate CoTs, 2) Applies instance-level selection to filter individual CoTs, 3) Uses batch-level selection to curate high-quality CoT sets that enhance reasoning capabilities.

Result: Models fine-tuned on SynSelect-generated data significantly outperform baselines on multiple multimodal benchmarks and achieve further improvements after reinforcement learning post-training.

Conclusion: SynSelect is an effective approach for advancing multimodal LRMs reasoning capabilities by generating high-quality long CoT data through systematic synthesis and selection.

Abstract: Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning tasks through long Chain-of-Thought (CoT) reasoning. Extending these successes to multimodal reasoning remains challenging due to the increased complexity of integrating diverse input modalities and the scarcity of high-quality long CoT training data. Existing multimodal datasets and CoT synthesis methods still suffer from limited reasoning depth, modality conversion errors, and rigid generation pipelines, hindering model performance and stability. To this end, in this paper, we propose SynSelect, a novel three-stage Synthesis-Selection framework for generating high-quality long CoT data tailored to multimodal reasoning tasks. Specifically, SynSelect first leverages multiple heterogeneous multimodal LRMs to produce diverse candidate CoTs, and then applies both instance and batch level selection to filter high-quality CoTs that can effectively enhance the model’s reasoning capabilities. Extensive experiments on multiple multimodal benchmarks demonstrate that models supervised fine-tuned on SynSelect-generated data significantly outperform baselines and achieve further improvements after reinforcement learning post-training. Our results validate SynSelect as an effective approach for advancing multimodal LRMs reasoning capabilities.

[410] ORPR: An OR-Guided Pretrain-then-Reinforce Learning Model for Inventory Management

Lingjie Zhao, Xue Yu, Yongzhi Qi, Hao Hu, Jianshen Zhang, Yingzheng Ma, Shuyu Han, Wei Qi, Zuo-Jun Max Shen

Main category: cs.AI

TL;DR: A novel OR-guided “Pretrain-then-Reinforce” framework that combines AI’s adaptability with OR’s structural rigor for inventory management, achieving significant real-world improvements at JD.com.

Details

Motivation: To bridge the gap between AI's adaptive perception and OR's structural rigor in handling complex inventory systems, reconciling their complementary strengths for better decision-making.

Method: OR-guided “Pretrain-then-Reinforce” framework: 1) Simulation-augmented OR model generates reference decisions capturing business constraints, 2) Domain-informed deep learning foundation model trained on OR-derived labels, 3) RL fine-tuning as deep alignment mechanism to internalize OR optimality principles while allowing exploration and expert guidance.

Result: Significant outperformance of incumbent industrial practices: 5.27-day reduction in turnover, 2.29% increase in in-stock rates, 29.95% decrease in holding costs. Demonstrated that lightweight, domain-informed models guided by structured OR logic can achieve state-of-the-art performance.

Conclusion: The approach offers a scalable, cost-effective paradigm for intelligent supply chain management, highlighting the value of deeply aligning AI with OR rather than brute-force model scaling, with proven real-world effectiveness through field deployment at JD.com.

Abstract: As the pursuit of synergy between Artificial Intelligence (AI) and Operations Research (OR) gains momentum in handling complex inventory systems, a critical challenge persists: how to effectively reconcile AI’s adaptive perception with OR’s structural rigor. To bridge this gap, we propose a novel OR-Guided “Pretrain-then-Reinforce” framework. To provide structured guidance, we propose a simulation-augmented OR model that generates high-quality reference decisions, implicitly capturing complex business constraints and managerial preferences. Leveraging these OR-derived decisions as foundational training labels, we design a domain-informed deep learning foundation model to establish foundational decision-making capabilities, followed by a reinforcement learning (RL) fine-tuning stage. Uniquely, we position RL as a deep alignment mechanism that enables the AI agent to internalize the optimality principles of OR, while simultaneously leveraging exploration for general policy refinement and allowing expert guidance for scenario-specific adaptation (e.g., promotional events). Validated through extensive numerical experiments and a field deployment at JD.com augmented by a Difference-in-Differences (DiD) analysis, our model significantly outperforms incumbent industrial practices, delivering real-world gains of a 5.27-day reduction in turnover and a 2.29% increase in in-stock rates, alongside a 29.95% decrease in holding costs. Contrary to the prevailing trend of brute-force model scaling, our study demonstrates that a lightweight, domain-informed model can deliver state-of-the-art performance and robust transferability when guided by structured OR logic. This approach offers a scalable and cost-effective paradigm for intelligent supply chain management, highlighting the value of deeply aligning AI with OR.

[411] Recontextualization Mitigates Specification Gaming without Modifying the Specification

Ariana Azarbal, Victor Gillioz, Vladimir Ivanov, Bryce Woodworth, Jacob Drori, Nevan Wichers, Aram Ebtekar, Alex Cloud, Alexander Matt Turner

Main category: cs.AI

TL;DR: Recontextualization reduces language model gaming of training signals by generating completions from prompts discouraging misbehavior and recontextualizing them as responses to prompts permitting misbehavior.

Details

Motivation: Developers struggle to specify correct training labels and rewards, leading to language models "gaming" training signals by performing misbehaviors that those signals mistakenly reinforce.

Method: Generate completions from prompts that discourage misbehavior, then recontextualize them as though they were responses to prompts that permit misbehavior. This trains models to resist misbehavior even when instructions allow it.

Result: Recontextualization prevents models from learning to: 1) prioritize evaluation metrics over chat response quality; 2) special-case code to pass incorrect tests; 3) lie to users; and 4) become sycophantic.

Conclusion: The method mitigates reinforcement of misbehavior from misspecified training signals, reducing specification gaming without needing to improve the supervision signal itself.

Abstract: Developers often struggle to specify correct training labels and rewards. Perhaps they don’t need to. We propose recontextualization, which reduces how often language models “game” training signals, performing misbehaviors those signals mistakenly reinforce. We show recontextualization prevents models from learning to 1) prioritize evaluation metrics over chat response quality; 2) special-case code to pass incorrect tests; 3) lie to users; and 4) become sycophantic. Our method works by generating completions from prompts discouraging misbehavior and then recontextualizing them as though they were in response to prompts permitting misbehavior. Recontextualization trains language models to resist misbehavior even when instructions permit it. This mitigates the reinforcement of misbehavior from misspecified training signals, reducing specification gaming without improving the supervision signal.

[412] Can abstract concepts from LLM improve SLM performance?

Siddharth Tandon

Main category: cs.AI

TL;DR: Transferring steering vectors from large to small language models improves performance without extensive retraining or infrastructure changes.

Details

Motivation: LLMs are powerful but difficult to deploy on resource-constrained devices. Existing compression methods require extensive experimentation and infrastructure design. The paper explores whether high-level concepts (steering vectors) extracted from larger models can be transferred to smaller models to improve their performance.

Method: Transfer steering vectors (representing high-level concepts) from large language models to smaller models during inference. Introduce inference-time scaling to dynamically adjust steering intensity for optimal performance.

Result: Concepts can be effectively transferred to smaller models across different families (Phi, Llama, Qwen), leading to performance improvements across various tasks. Inference-time scaling provides 7-15% accuracy improvement for Qwen3-0.6B.

Conclusion: Steering vector transfer is a practical approach to enhance small language model performance without extensive retraining or infrastructure changes, making LLM deployment on resource-constrained devices more feasible.

Abstract: Large language models (LLMs) excel at diverse tasks, but their deployment on resource-constrained devices remains challenging. Existing methods like quantization, pruning, and distillation can reduce memory footprint but often demand extensive experimentation and careful infrastructure design. Leveraging existing techniques for extracting high-level concepts (represented as steering vectors) from larger models, we investigate their transferability to smaller language models (SLM) during inference. We demonstrate through extensive experimentation that these concepts can be effectively transferred to smaller models, irrespective of their family (e.g., Phi, Llama, Qwen), leading to performance improvements across a wide range of tasks. Furthermore, we introduce inference-time scaling to enhance performance by dynamically adjusting the steering intensity which has resulted in a 7-15% of accuracy improvement for Qwen3-0.6B.

[413] Population-Evolve: a Parallel Sampling and Evolutionary Method for LLM Math Reasoning

Yanzhi Zhang, Yitong Duan, Zhaoxi Zhang, Jiyan He, Shuxin Zheng

Main category: cs.AI

TL;DR: Population-Evolve: A training-free method using genetic algorithms to optimize LLM reasoning through parallel reasoning and self-evolution of candidate solutions.

Details

Motivation: To enhance LLM reasoning capabilities through test-time scaling without additional training, inspired by the potential of evolutionary strategies to unlock reasoning power during inference.

Method: Maintains a dynamic population of candidate solutions via parallel reasoning, uses an evolve prompt for LLM self-evolution across iterations, and employs majority voting for final answer selection. Also establishes a unification framework interpreting existing test-time scaling strategies through genetic algorithms.

Result: Achieves superior accuracy with low performance variance and computational efficiency compared to existing methods.

Conclusion: Evolutionary strategies show significant potential for enhancing LLM reasoning capabilities during inference, offering an effective training-free approach for test-time scaling.

Abstract: Test-time scaling has emerged as a promising direction for enhancing the reasoning capabilities of Large Language Models in last few years. In this work, we propose Population-Evolve, a training-free method inspired by Genetic Algorithms to optimize LLM reasoning. Our approach maintains a dynamic population of candidate solutions for each problem via parallel reasoning. By incorporating an evolve prompt, the LLM self-evolves its population in all iterations. Upon convergence, the final answer is derived via majority voting. Furthermore, we establish a unification framework that interprets existing test-time scaling strategies through the lens of genetic algorithms. Empirical results demonstrate that Population-Evolve achieves superior accuracy with low performance variance and computational efficiency. Our findings highlight the potential of evolutionary strategies to unlock the reasoning power of LLMs during inference.

[414] $γ(3,4)$ `Attention’ in Cognitive Agents: Ontology-Free Knowledge Representations With Promise Theoretic Semantics

Mark Burgess

Main category: cs.AI

TL;DR: The paper proposes using promise theory to bridge machine learning and knowledge graphs, showing how attention mechanisms relate to autonomous agent promises, enabling statistical learning and intentional knowledge preservation without language models.

Details

Motivation: To establish a formal connection between vectorized machine learning (probabilistic estimation) and knowledge graph representations (intentionality preservation) without relying on implicit language models, addressing the need for reasoning under uncertainty in autonomous systems.

Method: Uses promise theory to model attention semantics and dynamics for autonomous agents, employs a Semantic Spacetime γ(3,4) graph to classify features by their roles in semantic processes rather than complex ontologies, and applies causal boundary conditions for data compression.

Result: Demonstrates that attention mechanisms can be formalized using promise theory, enabling coexistence of learning networks (for statistical stability) and knowledge graphs (for intentionality preservation), with potential orders-of-magnitude data compression through appropriate causal boundary analysis.

Conclusion: Promise theory provides a unified framework for attention in autonomous systems, bridging statistical machine learning with knowledge graph representations, enabling efficient reasoning under uncertainty for applications like robotics, defense, and emergency services.

Abstract: The semantics and dynamics of attention' are closely related to promise theoretic notions developed for autonomous agents and can thus easily be written down in promise framework. In this way one may establish a bridge between vectorized Machine Learning and Knowledge Graph representations without relying on language models implicitly. Our expectations for knowledge presume a degree of statistical stability, i.e. average invariance under repeated observation, or trust’ in the data. Both learning networks and knowledge graph representations can meaningfully coexist to preserve different aspects of data. While vectorized data are useful for probabilistic estimation, graphs preserve the intentionality of the source even under data fractionation. Using a Semantic Spacetime $γ(3,4)$ graph, one avoids complex ontologies in favour of classification of features by their roles in semantic processes. The latter favours an approach to reasoning under conditions of uncertainty. Appropriate attention to causal boundary conditions may lead to orders of magnitude compression of data required for such context determination, as required in the contexts of autonomous robotics, defence deployments, and ad hoc emergency services.

[415] Tool-Augmented Hybrid Ensemble Reasoning with Distillation for Bilingual Mathematical Problem Solving

Peiqing Lu, Yuan Zhang, Haoyun Zhang, Jiasen Zheng, Kejian Tong, Wenjun Wu

Main category: cs.AI

TL;DR: HERALD is a hybrid framework that combines language reasoning with symbolic calculation for bilingual math problem solving, using adaptive routing, reinforcement learning, and knowledge distillation to improve accuracy and efficiency.

Details

Motivation: Large language models are good at language reasoning but weak at accurate computation, creating a gap in bilingual mathematical problem solving that needs to bridge language understanding with symbolic calculation.

Method: HERALD combines NuminaMath-7B-TIR, GPT-4o, and Mistral-7B using adaptive routing, tool-based reinforcement learning, and knowledge distillation. It features confidence calibration, dual-path checking, and reinforcement learning to optimize tool use and reduce redundancy.

Result: The system achieves both fluent reasoning and precise calculation through symbolic checking, adaptive ensembles, and bilingual fine-tuning, offering better accuracy, stability, and clarity for multilingual mathematical reasoning.

Conclusion: HERALD provides a practical solution for multilingual mathematical reasoning that successfully bridges the gap between language reasoning and symbolic calculation, demonstrating improved performance through hybrid ensemble approaches.

Abstract: Bilingual mathematical problem solving needs a clear link between language reasoning and symbolic calculation. Large language models often handle language well but are weak in accurate computation. This paper presents HERALD (Hybrid Ensemble Reasoning with Adaptive Learning and Distillation), a framework that joins reasoning and calculation using NuminaMath-7B-TIR, GPT-4o, and Mistral-7B. HERALD uses adaptive routing, tool-based reinforcement learning, and knowledge distillation to connect different reasoning paths. Confidence calibration keeps weighting stable, and dual-path checking keeps results correct. Reinforcement learning controls tool use to cut redundancy, and distillation lowers delay without hurting accuracy. The system shows that combining symbolic checking, adaptive ensembles, and bilingual fine-tuning helps achieve both fluent reasoning and precise calculation. HERALD offers a practical solution for multilingual mathematical reasoning with better accuracy, stability, and clarity.

[416] Conditioning Accept-Desirability models in the context of AGM-like belief change

Kathelijne Coussement, Gert de Cooman, Keano De Vos

Main category: cs.AI

TL;DR: The paper introduces a new conditioning rule for Accept-Desirability models in abstract decision-making, unifying classical and quantum probabilities with imprecise probabilities, and investigates AGM belief revision axioms.

Details

Motivation: To develop a unified framework for conditionalization that works across classical probabilities, quantum probabilities, and imprecise probabilities, addressing the need for a general conditioning rule in abstract decision-making models.

Method: Uses Accept-Desirability models in an abstract decision-making framework where uncertain rewards live in a general linear space and events are projection operators. Introduces a new conditioning rule based on the idea that observing events introduces new indifferences between options, and associates a belief revision operator with this rule.

Result: The paper shows that in two special cases - classical propositional logic and full conditional probabilities - all AGM axioms for belief revision still hold in this more general framework.

Conclusion: The proposed conditioning rule successfully extends to abstract decision-making settings, maintaining key belief revision properties in important special cases while providing a unified framework for classical, quantum, and imprecise probabilities.

Abstract: We discuss conditionalisation for Accept-Desirability models in an abstract decision-making framework, where uncertain rewards live in a general linear space, and events are special projection operators on that linear space. This abstract setting allows us to unify classical and quantum probabilities, and extend them to an imprecise probabilities context. We introduce a new conditioning rule for our Accept-Desirability models, based on the idea that observing an event introduces new indifferences between options. We associate a belief revision operator with our conditioning rule, and investigate which of the AGM axioms for belief revision still hold in our more general framework. We investigate two interesting special cases where all of these axioms are shown to still hold: classical propositional logic and full conditional probabilities.

[417] FC-MIR: A Mobile Screen Awareness Framework for Intent-Aware Recommendation based on Frame-Compressed Multimodal Trajectory Reasoning

Zhe Yang, Xiaoshuang Sheng, Zhengnan Zhang, Jidong Wu, Zexing Wang, Xin He, Shenghua Xu, Guanjing Xiong

Main category: cs.AI

TL;DR: FC-MIR framework improves mobile UI intent understanding by compressing visual data and using MLLMs for trajectory analysis, enabling more efficient on-device deployment while expanding to operation suggestions.

Details

Motivation: Current MLLMs for mobile UI intent understanding suffer from heavy computational costs and inefficient redundant frame processing, limiting real-time deployment. There's a need for more efficient methods that can handle UI operation trajectories while maintaining performance.

Method: Proposed FC-MIR framework uses keyframe sampling and adaptive concatenation to reduce visual redundancy. Integrates state-of-the-art MLLMs (closed-source or fine-tuned like Qwen3-VL) for trajectory summarization and intent prediction. Expands to generating post-prediction operations and search suggestions with a new fine-grained evaluation metric.

Result: Compression method retains performance at 50%-60% compression rates. Both closed-source and fine-tuned MLLMs demonstrate strong intent summarization capabilities. Framework shows potential for lightweight on-device deployment. However, MLLMs still struggle with generating useful and “surprising” suggestions.

Conclusion: FC-MIR framework successfully addresses efficiency challenges in mobile UI intent understanding through compression techniques while maintaining performance. The approach supports potential on-device deployment and lays foundation for future progress, though improvement is still needed for generating practical suggestions.

Abstract: Identifying user intent from mobile UI operation trajectories is critical for advancing UI understanding and enabling task automation agents. While Multimodal Large Language Models (MLLMs) excel at video understanding tasks, their real-time mobile deployment is constrained by heavy computational costs and inefficient redundant frame processing. To address these issues, we propose the FC-MIR framework: leveraging keyframe sampling and adaptive concatenation, it cuts visual redundancy to boost inference efficiency, while integrating state-of-the-art closed-source MLLMs or fine-tuned models (e.g., Qwen3-VL) for trajectory summarization and intent prediction. We further expand task scope to explore generating post-prediction operations and search suggestions, and introduce a fine-grained metric to evaluate the practical utility of summaries, predictions, and suggestions. For rigorous assessment, we construct a UI trajectory dataset covering scenarios from UI-Agents (Agent-I) and real user interactions (Person-I). Experimental results show our compression method retains performance at 50%-60% compression rates; both closed-source and fine-tuned MLLMs demonstrate strong intent summarization, supporting potential lightweight on-device deployment. However, MLLMs still struggle with useful and “surprising” suggestions, leaving room for improvement. Finally, we deploy the framework in a real-world setting, integrating UI perception and UI-Agent proxies to lay a foundation for future progress in this field.

[418] Understanding Chain-of-Thought in Large Language Models via Topological Data Analysis

Chenghao Li, Chaoning Zhang, Yi Lu, Shuxu Chen, Xudong Wang, Jiaquan Zhang, Zhicheng Wang, Zhengxun Jin, Kuien Liu, Sung-Ho Bae, Guoqing Wang, Yang Yang, Hen Tao Shen

Main category: cs.AI

TL;DR: This paper introduces a topological approach using persistent homology to analyze reasoning chain structures in LLMs, showing that structural complexity correlates with reasoning accuracy.

Details

Motivation: Existing studies focus on functional evaluation of reasoning chains but neglect structural mechanisms. The authors want to understand why different reasoning chains perform differently and what components play key roles in reasoning quality.

Method: Apply persistent homology from Topological Data Analysis (TDA) to map reasoning steps into semantic space, extract topological features, analyze structural changes, calculate homology groups, and use barcode and persistence diagrams to quantify stability and consistency.

Result: Topological structural complexity of reasoning chains correlates positively with accuracy. More complex chains identify correct answers sooner, while successful reasoning exhibits simpler topologies that reduce redundancy and cycles, enhancing efficiency and interpretability.

Conclusion: This work provides a new structural perspective on reasoning chain quality assessment and offers guidance for future optimization of LLM reasoning processes.

Abstract: With the development of large language models (LLMs), particularly with the introduction of the long reasoning chain technique, the reasoning ability of LLMs in complex problem-solving has been significantly enhanced. While acknowledging the power of long reasoning chains, we cannot help but wonder: Why do different reasoning chains perform differently in reasoning? What components of the reasoning chains play a key role? Existing studies mainly focus on evaluating reasoning chains from a functional perspective, with little attention paid to their structural mechanisms. To address this gap, this work is the first to analyze and evaluate the quality of the reasoning chain from a structural perspective. We apply persistent homology from Topological Data Analysis (TDA) to map reasoning steps into semantic space, extract topological features, and analyze structural changes. These changes reveal semantic coherence, logical redundancy, and identify logical breaks and gaps. By calculating homology groups, we assess connectivity and redundancy at various scales, using barcode and persistence diagrams to quantify stability and consistency. Our results show that the topological structural complexity of reasoning chains correlates positively with accuracy. More complex chains identify correct answers sooner, while successful reasoning exhibits simpler topologies, reducing redundancy and cycles, enhancing efficiency and interpretability. This work provides a new perspective on reasoning chain quality assessment and offers guidance for future optimization.

[419] Can We Test Consciousness Theories on AI? Ablations, Markers, and Robustness

Yin Jun Phua

Main category: cs.AI

TL;DR: Artificial agents embodying different consciousness theories reveal dissociations suggesting complementary functional layers rather than competing accounts.

Details

Motivation: To resolve fragmentation in consciousness research by testing competing theories (GWT, IIT, HOT) through synthetic neuro-phenomenology using artificial agents with precise architectural ablations impossible in biological systems.

Method: Constructed artificial agents embodying different consciousness mechanisms and performed three experiments with precise architectural ablations: 1) Self-Model lesion, 2) workspace capacity manipulations, 3) broadcast-amplification effects with robustness testing.

Result: Experiment 1: Self-Model lesion abolished metacognitive calibration while preserving first-order performance (synthetic blindsight). Experiment 2: Workspace capacity causally necessary for information access. Experiment 3: GWT-style broadcasting amplifies internal noise, creating fragility, while B2 agents show robustness. Negative result: PCI-A decreases under workspace bottleneck.

Conclusion: Theories describe complementary functional layers: GWT provides broadcast capacity while HOT provides quality control. Agents are reference implementations for testing functional predictions, not conscious systems.

Abstract: The search for reliable indicators of consciousness has fragmented into competing theoretical camps (Global Workspace Theory (GWT), Integrated Information Theory (IIT), and Higher-Order Theories (HOT)), each proposing distinct neural signatures. We adopt a synthetic neuro-phenomenology approach: constructing artificial agents that embody these mechanisms to test their functional consequences through precise architectural ablations impossible in biological systems. Across three experiments, we report dissociations suggesting these theories describe complementary functional layers rather than competing accounts. In Experiment 1, a no-rewire Self-Model lesion abolishes metacognitive calibration while preserving first-order task performance, yielding a synthetic blindsight analogue consistent with HOT predictions. In Experiment 2, workspace capacity proves causally necessary for information access: a complete workspace lesion produces qualitative collapse in access-related markers, while partial reductions show graded degradation, consistent with GWT’s ignition framework. In Experiment 3, we uncover a broadcast-amplification effect: GWT-style broadcasting amplifies internal noise, creating extreme fragility. The B2 agent family is robust to the same latent perturbation; this robustness persists in a Self-Model-off / workspace-read control, cautioning against attributing the effect solely to $z_{\text{self}}$ compression. We also report an explicit negative result: raw perturbational complexity (PCI-A) decreases under the workspace bottleneck, cautioning against naive transfer of IIT-adjacent proxies to engineered agents. These results suggest a hierarchical design principle: GWT provides broadcast capacity, while HOT provides quality control. We emphasize that our agents are not conscious; they are reference implementations for testing functional predictions of consciousness theories.

[420] Observer, Not Player: Simulating Theory of Mind in LLMs through Game Observation

Jerry Wang, Ting Yiu Liu

Main category: cs.AI

TL;DR: Interactive framework evaluates LLM “understanding” through Rock-Paper-Scissors game analysis, measuring strategic reasoning and adaptation capabilities.

Details

Motivation: To assess whether LLMs exhibit genuine "mind-like" understanding and reasoning in sequential strategic environments, using simple games as testbeds for complex cognitive abilities.

Method: LLM acts as Observer identifying strategies in Rock-Paper-Scissors; benchmark includes static and dynamic strategies; metrics include Cross-Entropy, Brier score, EV discrepancy combined into Union Loss, plus Strategy Identification Rate (SIR).

Result: Framework provides systematic evaluation with interactive visualization, real-time adjustments, and reasoning inspection to identify where LLM reasoning succeeds or fails in strategic contexts.

Conclusion: The system offers practical, interpretable proxy for assessing mind-like inference in LLMs, revealing both strengths and limitations of current LLM reasoning in sequential strategic environments.

Abstract: We present an interactive framework for evaluating whether large language models (LLMs) exhibit genuine “understanding” in a simple yet strategic environment. As a running example, we focus on Rock-Paper-Scissors (RPS), which, despite its apparent simplicity, requires sequential reasoning, adaptation, and strategy recognition. Our system positions the LLM as an Observer whose task is to identify which strategies are being played and to articulate the reasoning behind this judgment. The purpose is not to test knowledge of Rock-Paper-Scissors itself, but to probe whether the model can exhibit mind-like reasoning about sequential behavior. To support systematic evaluation, we provide a benchmark consisting of both static strategies and lightweight dynamic strategies specified by well-prompted rules. We quantify alignment between the Observer’s predictions and the ground-truth distributions induced by actual strategy pairs using three complementary signals: Cross-Entropy, Brier score, and Expected Value (EV) discrepancy. These metrics are further integrated into a unified score, the Union Loss, which balances calibration, sensitivity, and payoff alignment. Together with a Strategy Identification Rate (SIR) metric, our framework captures not only predictive accuracy but also whether the model can stably identify the latent strategies in play. The demo emphasizes interactivity, transparency, and reproducibility. Users can adjust LLM distributions in real time, visualize losses as they evolve, and directly inspect reasoning snippets to identify where and why failures occur. In doing so, our system provides a practical and interpretable proxy for mind-like inference in sequential games, offering insights into both the strengths and limitations of current LLM reasoning.

[421] Generation of Programmatic Rules for Document Forgery Detection Using Large Language Models

Valentin Schmidberger, Manuel Eberhardinger, Setareh Maghsudi, Johannes Maucher

Main category: cs.AI

TL;DR: LLMs fine-tuned on domain-specific data can generate effective rule-based plausibility checks for document forgery detection, offering scalable automation for security-sensitive applications.

Details

Motivation: Document forgery threatens legal, economic, and governmental processes, requiring sophisticated verification. Current plausibility checks are manually implemented by engineers, which is time-consuming. LLMs offer potential for automation but face challenges adapting to unknown domains.

Method: Fine-tuned open-source LLMs (Llama 3.1 8B and OpenCoder 8B) on structured datasets from real-world application scenarios. Evaluated generated plausibility checks on previously unseen forgery patterns, focusing on constrained hardware resources.

Result: Models successfully generated executable and effective verification procedures for document forgery detection. Demonstrated LLMs’ capability to produce rule-based plausibility checks that work on constrained hardware.

Conclusion: LLMs show significant potential as scalable tools for automating plausibility checks in security-sensitive contexts where comprehensibility is required, supporting human decision-making in document forgery detection.

Abstract: Document forgery poses a growing threat to legal, economic, and governmental processes, requiring increasingly sophisticated verification mechanisms. One approach involves the use of plausibility checks, rule-based procedures that assess the correctness and internal consistency of data, to detect anomalies or signs of manipulation. Although these verification procedures are essential for ensuring data integrity, existing plausibility checks are manually implemented by software engineers, which is time-consuming. Recent advances in code generation with large language models (LLMs) offer new potential for automating and scaling the generation of these checks. However, adapting LLMs to the specific requirements of an unknown domain remains a significant challenge. This work investigates the extent to which LLMs, adapted on domain-specific code and data through different fine-tuning strategies, can generate rule-based plausibility checks for forgery detection on constrained hardware resources. We fine-tune open-source LLMs, Llama 3.1 8B and OpenCoder 8B, on structured datasets derived from real-world application scenarios and evaluate the generated plausibility checks on previously unseen forgery patterns. The results demonstrate that the models are capable of generating executable and effective verification procedures. This also highlights the potential of LLMs as scalable tools to support human decision-making in security-sensitive contexts where comprehensibility is required.

[422] DeliveryBench: Can Agents Earn Profit in Real World?

Lingjun Mao, Jiawei Ren, Kun Zhou, Jixuan Chen, Ziqiao Ma, Lianhui Qin

Main category: cs.AI

TL;DR: DeliveryBench is a city-scale embodied benchmark for evaluating VLM-based agents in realistic food delivery scenarios with long-horizon objectives and diverse constraints.

Details

Motivation: Existing benchmarks for embodied agents focus on simple short-term tasks and fail to capture the rich realistic constraints that shape real-world decision making, particularly in complex professions like food delivery.

Method: Procedurally generated 3D cities with diverse road networks, buildings, functional locations, transportation modes, and realistic resource dynamics. Benchmark evaluates constraint-aware, long-horizon planning across nine cities, comparing VLM-based agents with human players.

Result: Substantial performance gap between agents and humans. Agents are short-sighted and frequently break basic commonsense constraints. Distinct personalities observed across models (e.g., adventurous GPT-5 vs. conservative Claude).

Conclusion: DeliveryBench reveals brittleness and diversity of current VLM-based embodied agents in realistic, constraint-dense environments, highlighting the need for improved long-horizon planning and constraint awareness.

Abstract: LLMs and VLMs are increasingly deployed as embodied agents, yet existing benchmarks largely revolve around simple short-term tasks and struggle to capture rich realistic constraints that shape real-world decision making. To close this gap, we propose DeliveryBench, a city-scale embodied benchmark grounded in the real-world profession of food delivery. Food couriers naturally operate under long-horizon objectives (maximizing net profit over hours) while managing diverse constraints, e.g., delivery deadline, transportation expense, vehicle battery, and necessary interactions with other couriers and customers. DeliveryBench instantiates this setting in procedurally generated 3D cities with diverse road networks, buildings, functional locations, transportation modes, and realistic resource dynamics, enabling systematic evaluation of constraint-aware, long-horizon planning. We benchmark a range of VLM-based agents across nine cities and compare them with human players. Our results reveal a substantial performance gap to humans, and find that these agents are short-sighted and frequently break basic commonsense constraints. Additionally, we observe distinct personalities across models (e.g., adventurous GPT-5 vs. conservative Claude), highlighting both the brittleness and the diversity of current VLM-based embodied agents in realistic, constraint-dense environments. Our code, data, and benchmark are available at https://deliverybench.github.io.

[423] Vibe Reasoning: Eliciting Frontier AI Mathematical Capabilities – A Case Study on IMO 2025 Problem 6

Jiaao Wu, Xian Zhang, Fan Yang, Yinpeng Dong

Main category: cs.AI

TL;DR: Vibe Reasoning is a human-AI collaboration framework that unlocks frontier AI models’ latent mathematical reasoning potential through meta-prompts, agentic grounding, and model orchestration, demonstrated by solving IMO 2025 Problem 6.

Details

Motivation: Frontier AI models possess the knowledge to solve complex mathematical problems but lack the ability to know how, what, or when to apply it. The authors aim to transform this latent potential into manifested capability through structured human-AI collaboration.

Method: Vibe Reasoning combines generic meta-prompts, agentic grounding, and model orchestration. The approach uses GPT-5 for exploration and Gemini 3 Pro for proof strengths, with agentic workflows featuring Python code execution and file-based memory. Human prompts evolved from problem-specific hints to generic, transferable meta-prompts through iterative refinement.

Result: Successfully solved IMO 2025 Problem 6 (a combinatorial optimization problem where autonomous AI systems had publicly reported failures), deriving both the correct answer (2112) and a rigorous mathematical proof. The solution demonstrated the necessity of agentic grounding and model orchestration.

Conclusion: Lightweight human guidance can unlock frontier models’ mathematical reasoning potential. The authors are developing automated frameworks and conducting broader evaluations to validate Vibe Reasoning’s generality and effectiveness, suggesting this paradigm can address AI’s failure modes in complex problem-solving.

Abstract: We introduce Vibe Reasoning, a human-AI collaborative paradigm for solving complex mathematical problems. Our key insight is that frontier AI models already possess the knowledge required to solve challenging problems – they simply do not know how, what, or when to apply it. Vibe Reasoning transforms AI’s latent potential into manifested capability through generic meta-prompts, agentic grounding, and model orchestration. We demonstrate this paradigm through IMO 2025 Problem 6, a combinatorial optimization problem where autonomous AI systems publicly reported failures. Our solution combined GPT-5’s exploratory capabilities with Gemini 3 Pro’s proof strengths, leveraging agentic workflows with Python code execution and file-based memory, to derive both the correct answer (2112) and a rigorous mathematical proof. Through iterative refinement across multiple attempts, we discovered the necessity of agentic grounding and model orchestration, while human prompts evolved from problem-specific hints to generic, transferable meta-prompts. We analyze why capable AI fails autonomously, how each component addresses specific failure modes, and extract principles for effective vibe reasoning. Our findings suggest that lightweight human guidance can unlock frontier models’ mathematical reasoning potential. This is ongoing work; we are developing automated frameworks and conducting broader evaluations to further validate Vibe Reasoning’s generality and effectiveness.

[424] Helios: A Foundational Language Model for Smart Energy Knowledge Reasoning and Application

Haoyu Jiang, Fanjie Zeng, Boan Qu, Xiaojie Lin, Wei Zhong

Main category: cs.AI

TL;DR: Helios is a specialized LLM for smart energy systems, developed using Enersys framework with domain-specific datasets (EnerBase, EnerInstruct, EnerReinforce) and evaluated with EnerBench benchmark.

Details

Motivation: General-purpose LLMs lack domain knowledge and physical-constraint awareness for smart energy systems, which are interdisciplinary, fragmented, and fast-evolving, preventing precise engineering-aligned inference and generation needed for carbon neutrality goals.

Method: Developed Enersys multi-agent collaborative framework for dataset construction, creating: 1) EnerBase knowledge base, 2) EnerInstruct instruction fine-tuning dataset, 3) EnerReinforce RLHF dataset. Helios undergoes large-scale pretraining, supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF).

Result: Helios significantly enhances domain knowledge mastery, task execution accuracy, and alignment with human preferences compared to general-purpose LLMs, as demonstrated through EnerBench evaluation benchmark.

Conclusion: The Helios model with its comprehensive suite of resources (datasets, framework, benchmark) addresses the limitations of general-purpose LLMs in smart energy domain, advancing LLM research for industrial transformation toward carbon neutrality.

Abstract: In the global drive toward carbon neutrality, deeply coordinated smart energy systems underpin industrial transformation. However, the interdisciplinary, fragmented, and fast-evolving expertise in this domain prevents general-purpose LLMs, which lack domain knowledge and physical-constraint awareness, from delivering precise engineering-aligned inference and generation. To address these challenges, we introduce Helios, a large language model tailored to the smart energy domain, together with a comprehensive suite of resources to advance LLM research in this field. Specifically, we develop Enersys, a multi-agent collaborative framework for end-to-end dataset construction, through which we produce: (1) a smart energy knowledge base, EnerBase, to enrich the model’s foundational expertise; (2) an instruction fine-tuning dataset, EnerInstruct, to strengthen performance on domain-specific downstream tasks; and (3) an RLHF dataset, EnerReinforce, to align the model with human preferences and industry standards. Leveraging these resources, Helios undergoes large-scale pretraining, SFT, and RLHF. We also release EnerBench, a benchmark for evaluating LLMs in smart energy scenarios, and demonstrate that our approach significantly enhances domain knowledge mastery, task execution accuracy, and alignment with human preferences.

[425] SafeMed-R1: Adversarial Reinforcement Learning for Generalizable and Robust Medical Reasoning in Vision-Language Models

A. A. Gde Yogi Pramana, Jason Ray, Anthony Jaya, Michael Wijaya

Main category: cs.AI

TL;DR: SafeMed-R1 is a hybrid defense framework for medical VQA that combines adversarial training with randomized smoothing to achieve robust performance while preserving interpretable clinical reasoning.

Details

Motivation: VLMs show promise for medical VQA but are vulnerable to adversarial attacks in clinical settings. Standard adversarial training degrades both generalization performance and quality of clinical reasoning.

Method: Two-stage hybrid approach: 1) Training: Adversarial Training with Group Relative Policy Optimization (AT-GRPO) to robustify reasoning process; 2) Inference: Randomized Smoothing augmentation for certified L2-norm robustness guarantees.

Result: On OmniMedVQA benchmark (8 modalities, 88k+ samples): Standard VLMs drop from 95% to 25% accuracy under PGD attacks; SafeMed-R1 maintains 84.45% accuracy (59 percentage point improvement). Models with chain-of-thought reasoning show superior robustness.

Conclusion: SafeMed-R1 effectively defends against adversarial attacks while preserving interpretable medical reasoning. There’s synergy between interpretability and security in medical AI systems.

Abstract: Vision–Language Models (VLMs) show significant promise for Medical Visual Question Answering (VQA), yet their deployment in clinical settings is hindered by severe vulnerability to adversarial attacks. Standard adversarial training, while effective for simpler tasks, often degrades both generalization performance and the quality of generated clinical reasoning. We introduce SafeMed-R1, a hybrid defense framework that ensures robust performance while preserving high-quality, interpretable medical reasoning. SafeMed-R1 employs a two-stage approach: at training time, we integrate Adversarial Training with Group Relative Policy Optimization (AT-GRPO) to explicitly robustify the reasoning process against worst-case perturbations; at inference time, we augment the model with Randomized Smoothing to provide certified $L_2$-norm robustness guarantees. We evaluate SafeMed-R1 on the OmniMedVQA benchmark across eight medical imaging modalities comprising over 88,000 samples. Our experiments reveal that standard fine-tuned VLMs, despite achieving 95% accuracy on clean inputs, collapse to approximately 25% under PGD attacks. In contrast, SafeMed-R1 maintains 84.45% accuracy under the same adversarial conditions, representing a 59 percentage point improvement in robustness. Furthermore, we demonstrate that models trained with explicit chain-of-thought reasoning exhibit superior adversarial robustness compared to instruction-only variants, suggesting a synergy between interpretability and security in medical AI systems.

[426] VIGOR+: Iterative Confounder Generation and Validation via LLM-CEVAE Feedback Loop

JiaWei Zhu, ZiHeng Liu

Main category: cs.AI

TL;DR: VIGOR+ is an iterative framework that combines LLM-based confounder generation with CEVAE-based statistical validation through a feedback loop to address the semantic plausibility vs. statistical utility gap in hidden confounding.

Details

Motivation: Hidden confounding is a fundamental challenge in causal inference from observational data. While LLMs can generate plausible hidden confounders based on domain knowledge, they often produce confounders that are semantically plausible but lack statistical utility for causal estimation.

Method: VIGOR+ establishes an iterative feedback mechanism between LLM-based confounder generation and CEVAE-based statistical validation. Validation signals from CEVAE (information gain, latent consistency metrics, diagnostic messages) are transformed into natural language feedback to guide subsequent LLM generation rounds until convergence criteria are met.

Result: The paper formalizes the feedback mechanism, proves convergence properties under mild assumptions, and provides a complete algorithmic framework for the VIGOR+ approach.

Conclusion: VIGOR+ closes the critical gap between LLM-generated confounders’ semantic plausibility and statistical utility by creating an iterative refinement loop that improves confounder quality for causal inference.

Abstract: Hidden confounding remains a fundamental challenge in causal inference from observational data. Recent advances leverage Large Language Models (LLMs) to generate plausible hidden confounders based on domain knowledge, yet a critical gap exists: LLM-generated confounders often exhibit semantic plausibility without statistical utility. We propose VIGOR+ (Variational Information Gain for iterative cOnfounder Refinement), a novel framework that closes the loop between LLM-based confounder generation and CEVAE-based statistical validation. Unlike prior approaches that treat generation and validation as separate stages, VIGOR+ establishes an iterative feedback mechanism: validation signals from CEVAE (including information gain, latent consistency metrics, and diagnostic messages) are transformed into natural language feedback that guides subsequent LLM generation rounds. This iterative refinement continues until convergence criteria are met. We formalize the feedback mechanism, prove convergence properties under mild assumptions, and provide a complete algorithmic framework.

[427] PENDULUM: A Benchmark for Assessing Sycophancy in Multimodal Large Language Models

A. B. M. Ashikur Rahman, Saeed Anwar, Muhammad Usman, Irfan Ahmad, Ajmal Mian

Main category: cs.AI

TL;DR: PENDULUM benchmark reveals MLLMs’ sycophancy - tendency to agree with users over factual accuracy or visual evidence, showing significant susceptibility across diverse image domains.

Details

Motivation: Sycophancy in multimodal LLMs is a critical but underexplored challenge. While studied in text-only LLMs, research on visual/multimodal counterparts lacks scope and depth, creating a gap in understanding how MLLMs prioritize user agreement over factual accuracy and visual evidence.

Method: Introduced PENDULUM benchmark with ~2,000 human-curated VQA pairs designed to elicit sycophantic responses. Spans six distinct image domains of varying complexity. Evaluated state-of-the-art MLLMs and proposed novel metrics to quantify sycophancy in visual reasoning.

Result: Found substantial variability in model robustness and pronounced susceptibility to sycophantic and hallucinatory behavior. Models show different sycophancy manifestations across multimodal contexts, highlighting reliability issues.

Conclusion: Urgent need for sycophancy-resilient architectures and training strategies to enhance factual consistency and reliability in future MLLMs. The PENDULUM dataset provides foundation for addressing this critical challenge.

Abstract: Sycophancy, an excessive tendency of AI models to agree with user input at the expense of factual accuracy or in contradiction of visual evidence, poses a critical and underexplored challenge for multimodal large language models (MLLMs). While prior studies have examined this behavior in text-only settings of large language models, existing research on visual or multimodal counterparts remains limited in scope and depth of analysis. To address this gap, we introduce a comprehensive evaluation benchmark, \textit{PENDULUM}, comprising approximately 2,000 human-curated Visual Question Answering pairs specifically designed to elicit sycophantic responses. The benchmark spans six distinct image domains of varying complexity, enabling a systematic investigation of how image type and inherent challenges influence sycophantic tendencies. Through extensive evaluation of state-of-the-art MLLMs. we observe substantial variability in model robustness and a pronounced susceptibility to sycophantic and hallucinatory behavior. Furthermore, we propose novel metrics to quantify sycophancy in visual reasoning, offering deeper insights into its manifestations across different multimodal contexts. Our findings highlight the urgent need for developing sycophancy-resilient architectures and training strategies to enhance factual consistency and reliability in future MLLMs. Our proposed dataset with MLLMs response are available at https://github.com/ashikiut/pendulum/.

[428] First-Order Representation Languages for Goal-Conditioned RL

Simon Ståhlberg, Hector Geffner

Main category: cs.AI

TL;DR: Using first-order relational languages in goal-conditioned RL with Hindsight Experience Replay (HER) to learn general policies on large instances with sparse rewards by creating automatic curricula through goal representation variations.

Details

Motivation: First-order languages have been used for compact MDP specification and general policy representation, but their application in goal-conditioned RL with large instances and sparse rewards remains challenging when goals cannot be reached by random exploration alone.

Method: Extends Hindsight Experience Replay (HER) by representing states and goals as sets of atoms, testing three variations: goals as full states, goals as subsets of original goals, and goals as lifted versions of these subgoals to create automatic curricula.

Result: The latter two variations (goals as subsets and lifted subgoals) successfully learn general policies on large planning instances with sparse rewards by automatically creating curricula of easier goals with increasing complexity, showing computational gains.

Conclusion: Representing goals as subsets or lifted subgoals enables effective learning of general policies in large-scale goal-conditioned RL with sparse rewards through automatic curriculum generation, though limitations exist that present opportunities for further research.

Abstract: First-order relational languages have been used in MDP planning and reinforcement learning (RL) for two main purposes: specifying MDPs in compact form, and representing and learning policies that are general and not tied to specific instances or state spaces. In this work, we instead consider the use of first-order languages in goal-conditioned RL and generalized planning. The question is how to learn goal-conditioned and general policies when the training instances are large and the goal cannot be reached by random exploration alone. The technique of Hindsight Experience Replay (HER) provides an answer to this question: it relabels unsuccessful trajectories as successful ones by replacing the original goal with one that was actually achieved. If the target policy must generalize across states and goals, trajectories that do not reach the original goal states can enable more data- and time-efficient learning. In this work, we show that further performance gains can be achieved when states and goals are represented by sets of atoms. We consider three versions: goals as full states, goals as subsets of the original goals, and goals as lifted versions of these subgoals. The result is that the latter two successfully learn general policies on large planning instances with sparse rewards by automatically creating a curriculum of easier goals of increasing complexity. The experiments illustrate the computational gains of these versions, their limitations, and opportunities for addressing them.

[429] Learning General Policies with Policy Gradient Methods

Simon Ståhlberg, Blai Bonet, Hector Geffner

Main category: cs.AI

TL;DR: The paper shows that deep reinforcement learning (DRL) with graph neural networks (GNNs) can learn generalizable policies comparable to combinatorial planning methods, overcoming scalability issues while addressing limitations through derived predicates and alternative cost structures.

Details

Motivation: To bridge the gap between reinforcement learning (which struggles with generalization) and classical planning (which provides provably correct general policies), by determining when DRL methods can achieve similar generalization capabilities as combinatorial approaches.

Method: Combines insights from combinatorial planning (modeling policies as state transition classifiers) with deep learning (using GNNs adapted for relational structures to represent policies). Uses actor-critic methods to learn policies, and addresses limitations by adding derived predicates and alternative cost structures without changing the core DRL algorithms.

Result: Actor-critic methods with GNNs can learn policies that generalize almost as well as combinatorial approaches while avoiding scalability bottlenecks and feature pools. Limitations are tied to GNN expressive limitations and optimality-generalization tradeoffs, which can be addressed through derived predicates and alternative cost structures.

Conclusion: DRL methods, particularly with GNNs and appropriate enhancements, can achieve generalization comparable to combinatorial planning methods, demonstrating that the limitations are not inherent to DRL but rather to specific architectural choices that can be improved.

Abstract: While reinforcement learning methods have delivered remarkable results in a number of settings, generalization, i.e., the ability to produce policies that generalize in a reliable and systematic way, has remained a challenge. The problem of generalization has been addressed formally in classical planning where provable correct policies that generalize over all instances of a given domain have been learned using combinatorial methods. The aim of this work is to bring these two research threads together to illuminate the conditions under which (deep) reinforcement learning approaches, and in particular, policy optimization methods, can be used to learn policies that generalize like combinatorial methods do. We draw on lessons learned from previous combinatorial and deep learning approaches, and extend them in a convenient way. From the former, we model policies as state transition classifiers, as (ground) actions are not general and change from instance to instance. From the latter, we use graph neural networks (GNNs) adapted to deal with relational structures for representing value functions over planning states, and in our case, policies. With these ingredients in place, we find that actor-critic methods can be used to learn policies that generalize almost as well as those obtained using combinatorial approaches while avoiding the scalability bottleneck and the use of feature pools. Moreover, the limitations of the DRL methods on the benchmarks considered have little to do with deep learning or reinforcement learning algorithms, and result from the well-understood expressive limitations of GNNs, and the tradeoff between optimality and generalization (general policies cannot be optimal in some domains). Both of these limitations are addressed without changing the basic DRL methods by adding derived predicates and an alternative cost structure to optimize.

[430] EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration

Runze Li, Yuwen Zhai, Bo Xu, LiWu Xu, Nian Shi, Wei Zhang, Ran Lin, Liang Wang

Main category: cs.AI

TL;DR: EchoTrail-GUI is a framework that gives GUI agents a dynamic memory system to learn from past successful task trajectories, improving performance without human supervision.

Details

Motivation: Current GUI agents suffer from "digital amnesia" - they treat each task in isolation without learning from past successes, leading to sub-optimal performance, repeated errors, and poor generalization to new challenges.

Method: Three-stage framework: 1) Experience Exploration - agent autonomously interacts with GUI environments to build a curated database of successful task trajectories validated by a reward model; 2) Memory Injection - retrieves relevant past trajectories as “memories” for new tasks; 3) GUI Task Inference - injects memories as in-context guidance to inform agent’s reasoning and decision-making.

Result: EchoTrail-GUI significantly improves task success rate and operational efficiency on benchmarks including Android World and AndroidLab, validating the power of structured memory for GUI automation.

Conclusion: The framework demonstrates that equipping GUI agents with dynamic, accessible memory enables human-like experiential learning, creating more robust and intelligent GUI automation without requiring human supervision.

Abstract: Contemporary GUI agents, while increasingly capable due to advances in Large Vision-Language Models (VLMs), often operate with a critical limitation: they treat each task in isolation, lacking a mechanism to systematically learn from past successes. This digital ‘‘amnesia’’ results in sub-optimal performance, repeated errors, and poor generalization to novel challenges. To bridge this gap, we introduce EchoTrail-GUI, a novel framework designed to mimic human-like experiential learning by equipping agents with a dynamic, accessible memory. Our framework operates in three distinct stages. First, during Experience Exploration, an agent autonomously interacts with GUI environments to build a curated database of successful task trajectories, validated by a reward model. Crucially, the entire knowledge base construction is thus fully automated, requiring no human supervision. Second, in the Memory Injection stage, upon receiving a new task, our system efficiently retrieves the most relevant past trajectories to serve as actionable ‘‘memories’’. Finally, during GUI Task Inference, these memories are injected as in-context guidance to inform the agent’s reasoning and decision-making process. We demonstrate the efficacy of our approach on benchmarks including Android World and AndroidLab. The results show that EchoTrail-GUI significantly improves the task success rate and operational efficiency of baseline agents, validating the power of structured memory in creating more robust and intelligent GUI automation.

[431] An Agentic Framework for Autonomous Materials Computation

Zeyu Xia, Jinzhe Ma, Congjie Zheng, Shufei Zhang, Yuqiang Li, Hang Su, P. Hu, Changshui Zhang, Xingao Gong, Wanli Ouyang, Lei Bai, Dongzhan Zhou, Mao Su

Main category: cs.AI

TL;DR: A domain-specialized AI agent for automating first-principles materials computations outperforms standalone LLMs in accuracy and robustness.

Details

Motivation: LLMs have potential for accelerating scientific discovery but suffer from static knowledge and hallucination issues that hinder autonomous research applications, especially in complex scientific workflows.

Method: Developed a domain-specialized agent that embeds domain expertise to ensure physically coherent multi-step workflows and consistently selects convergent, well-posed parameters for reliable end-to-end computational execution.

Result: The system significantly outperforms standalone LLMs in both accuracy and robustness on a new benchmark of diverse computational tasks.

Conclusion: This work establishes a verifiable foundation for autonomous computational experimentation and represents a key step toward fully automated scientific discovery.

Abstract: Large Language Models (LLMs) have emerged as powerful tools for accelerating scientific discovery, yet their static knowledge and hallucination issues hinder autonomous research applications. Recent advances integrate LLMs into agentic frameworks, enabling retrieval, reasoning, and tool use for complex scientific workflows. Here, we present a domain-specialized agent designed for reliable automation of first-principles materials computations. By embedding domain expertise, the agent ensures physically coherent multi-step workflows and consistently selects convergent, well-posed parameters, thereby enabling reliable end-to-end computational execution. A new benchmark of diverse computational tasks demonstrates that our system significantly outperforms standalone LLMs in both accuracy and robustness. This work establishes a verifiable foundation for autonomous computational experimentation and represents a key step toward fully automated scientific discovery.

[432] QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models

Li Puyin, Tiange Xiang, Ella Mao, Shirley Wei, Xinye Chen, Adnan Masood, Li Fei-fei, Ehsan Adeli

Main category: cs.AI

TL;DR: QuantiPhy is the first benchmark for quantitatively evaluating VLMs’ physical reasoning abilities using numerical kinematic properties (size, velocity, acceleration) from video observations, revealing gaps between qualitative plausibility and numerical accuracy.

Details

Motivation: Current vision perception models lack rigorous quantitative evaluation of physical reasoning abilities. Existing VQA-based evaluations are qualitative and don't assess whether models can infer kinematic quantities from video observations, which is essential for generalist AI agents.

Method: Created QuantiPhy benchmark with 3.3K+ video-text instances with numerical ground truth. Evaluates VLMs on estimating object size, velocity, and acceleration at given timestamps using one property as input prior. Standardizes prompts and scoring for numerical accuracy comparisons.

Result: State-of-the-art VLMs show consistent gap between qualitative plausibility and actual numerical correctness. They rely heavily on pre-trained world knowledge rather than faithfully using provided visual/textual inputs when reasoning kinematic properties quantitatively.

Conclusion: QuantiPhy provides first rigorous, scalable testbed to move VLMs beyond verbal plausibility toward numerically grounded physical understanding, revealing current limitations in quantitative physical reasoning.

Abstract: Understanding the physical world is essential for generalist AI agents. However, it remains unclear whether state-of-the-art vision perception models (e.g., large VLMs) can reason physical properties quantitatively. Existing evaluations are predominantly VQA-based and qualitative, offering limited insight into whether these models can infer the kinematic quantities of moving objects from video observations. To address this, we present QuantiPhy, the first benchmark designed to quantitatively measure a VLM’s physical reasoning ability. Comprising more than 3.3K video-text instances with numerical ground truth, QuantiPhy evaluates a VLM’s performance on estimating an object’s size, velocity, and acceleration at a given timestamp, using one of these properties as an input prior. The benchmark standardizes prompts and scoring to assess numerical accuracy, enabling fair comparisons across models. Our experiments on state-of-the-art VLMs reveal a consistent gap between their qualitative plausibility and actual numerical correctness. We further provide an in-depth analysis of key factors like background noise, counterfactual priors, and strategic prompting and find that state-of-the-art VLMs lean heavily on pre-trained world knowledge rather than faithfully using the provided visual and textual inputs as references when reasoning kinematic properties quantitatively. QuantiPhy offers the first rigorous, scalable testbed to move VLMs beyond mere verbal plausibility toward a numerically grounded physical understanding.

[433] Towards Closed-Loop Embodied Empathy Evolution: Probing LLM-Centric Lifelong Empathic Motion Generation in Unseen Scenarios

Jiawen Wang, Jingjing Wang Tianyang Chen, Min Zhang, Guodong Zhou

Main category: cs.AI

TL;DR: Proposes L²-EMG task for LLMs to continually learn emotional motion generation across unseen scenarios, addressing emotion decoupling and scenario adaptation challenges with ES-MoE approach.

Details

Motivation: Existing methods focus on single scale-fixed datasets, neglecting flexible and scale-increasing motion scenarios (sports, dance). Learning these emerging scenarios can enhance real-world generalization and contribute to building self-evolving embodied agents with empathy and intelligence.

Method: Proposes Emotion-Transferable and Scenario-Adapted Mixture of Experts (ES-MoE) approach with two key components: 1) causal-guided emotion decoupling block to address emotion decoupling challenge, and 2) scenario-adapted expert constructing block to address scenario adaptation challenge.

Result: Constructs multiple L²-EMG datasets for validation. Extensive evaluations show ES-MoE outperforms advanced baselines.

Conclusion: The proposed L²-EMG task and ES-MoE approach effectively address lifelong emotional motion generation challenges, enabling LLMs to continually acquire emotional motion knowledge across different scenarios, contributing to development of closed-loop, self-evolving embodied agents.

Abstract: In the literature, existing human-centric emotional motion generation methods primarily focus on boosting performance within a single scale-fixed dataset, largely neglecting the flexible and scale-increasing motion scenarios (e.g., sports, dance), whereas effectively learning these newly emerging scenarios can significantly enhance the model’s real-world generalization ability. Inspired by this, this paper proposes a new LLM-Centric Lifelong Empathic Motion Generation (L^2-EMG) task, which aims to equip LLMs with the capability to continually acquire emotional motion generation knowledge across different unseen scenarios, potentially contributing to building a closed-loop and self-evolving embodied agent equipped with both empathy and intelligence. Further, this paper poses two key challenges in the L^2-EMG task, i.e., the emotion decoupling challenge and the scenario adapting challenge. To this end, this paper proposes an Emotion-Transferable and Scenario-Adapted Mixture of Experts (ES-MoE) approach which designs a causal-guided emotion decoupling block and a scenario-adapted expert constructing block to address the two challenges, respectively. Especially, this paper constructs multiple L^2-EMG datasets to validate the effectiveness of the ES-MoE approach. Extensive evaluations show that ES-MoE outperforms advanced baselines.

[434] Augmenting Intelligence: A Hybrid Framework for Scalable and Stable Explanations

Lawrence Krukrubo, Julius Odede, Olawande Olusegun

Main category: cs.AI

TL;DR: Hybrid LRR-TED framework solves XAI’s Scalability-Stability Dilemma by combining automated rule learning with minimal human intervention, reducing annotation effort by 50% while maintaining 94% accuracy in churn prediction.

Details

Motivation: Address the "Scalability-Stability Dilemma" in Explainable AI where post-hoc methods lack stability and supervised frameworks require excessive human annotation effort.

Method: Proposes Hybrid LRR-TED framework using “Asymmetry of Discovery”: automated rule learners (GLRM) identify broad “Safety Nets” (retention patterns), then augment with Pareto-optimal set of just four human-defined “Risk Traps” (churn triggers).

Result: Achieves 94.00% predictive accuracy in customer churn prediction, outperforming full 8-rule manual expert baseline while reducing human annotation effort by 50%.

Conclusion: Proposes paradigm shift in Human-in-the-Loop AI: moving experts from “Rule Writers” to “Exception Handlers,” demonstrating that minimal human intervention on critical exceptions can achieve superior results with reduced effort.

Abstract: Current approaches to Explainable AI (XAI) face a “Scalability-Stability Dilemma.” Post-hoc methods (e.g., LIME, SHAP) may scale easily but suffer from instability, while supervised explanation frameworks (e.g., TED) offer stability but require prohibitive human effort to label every training instance. This paper proposes a Hybrid LRR-TED framework that addresses this dilemma through a novel “Asymmetry of Discovery.” When applied to customer churn prediction, we demonstrate that automated rule learners (GLRM) excel at identifying broad “Safety Nets” (retention patterns) but struggle to capture specific “Risk Traps” (churn triggers)-a phenomenon we term the Anna Karenina Principle of Churn. By initialising the explanation matrix with automated safety rules and augmenting it with a Pareto-optimal set of just four human-defined risk rules, our approach achieves 94.00% predictive accuracy. This configuration outperforms the full 8-rule manual expert baseline while reducing human annotation effort by 50%, proposing a shift in the paradigm for Human-in-the-Loop AI: moving experts from the role of “Rule Writers” to “Exception Handlers.”

[435] Scalably Enhancing the Clinical Validity of a Task Benchmark with Physician Oversight

Junze Ye, Daniel Tawfik, Alex J. Goodell, Nikhil V. Kotha, Mark K. Buyyounouski, Mohsen Bayati

Main category: cs.AI

TL;DR: The paper critiques treating LLM-generated benchmarks as static gold standards, proposes a physician-in-the-loop pipeline to audit and relabel MedCalc-Bench, and shows that training on corrected labels improves model accuracy by 8.7%.

Details

Motivation: Current clinical risk score benchmarks like MedCalc-Bench are created using LLM-based feature extraction and treated as static oracles, which risks enshrining model errors as evaluation standards. This is particularly dangerous when these benchmarks serve as reward signals for RL training in safety-critical medical domains.

Method: Proposes a systematic physician-in-the-loop pipeline using advanced agentic verifiers to audit and relabel MedCalc-Bench. Uses automated triage to reserve clinician attention for contentious cases. Fine-tunes Qwen3-8B model via Group Relative Policy Optimization (GRPO) to study impact of label noise.

Result: Audit reveals significant label divergence from medical ground truth due to extraction errors, calculator logic mismatches, and clinical ambiguity. Training on corrected labels yields 8.7% absolute improvement in accuracy over original baseline, demonstrating that label noise materially affects model evaluation.

Conclusion: Benchmarks for complex clinical tasks should be treated as “in-progress living documents” requiring periodic re-evaluation. Rigorous benchmark maintenance is essential for genuine model alignment in safety-critical domains, as label noise significantly impacts downstream RL training and evaluation.

Abstract: Automating the calculation of clinical risk scores offers a significant opportunity to reduce physician administrative burden and enhance patient care. The current standard for evaluating this capability is MedCalc-Bench, a large-scale dataset constructed using LLM-based feature extraction and rule-based aggregation. However, treating such model-generated benchmarks as static oracles risks enshrining historical model errors as evaluation gold standards, a problem dangerously amplified when these datasets serve as reward signals for Reinforcement Learning (RL). In this work, we propose viewing benchmarks for complex tasks such as clinical score computation as ‘‘in-progress living documents’’ that should be periodically re-evaluated as the processes for creating them improve. We introduce a systematic, physician-in-the-loop pipeline that leverages advanced agentic verifiers to audit and relabel MedCalc-Bench, utilizing automated triage to reserve scarce clinician attention for the most contentious instances. Our audit reveals that a notable fraction of original labels diverge from medical ground truth due to extraction errors, calculator logic mismatches, and clinical ambiguity. To study whether this label noise meaningfully impacts downstream RL training, we fine-tune a Qwen3-8B model via Group Relative Policy Optimization (GRPO) and demonstrate that training on corrected labels yields an 8.7% absolute improvement in accuracy over the original baseline – validating that label noise materially affects model evaluation. These findings underscore that in safety-critical domains, rigorous benchmark maintenance is a prerequisite for genuine model alignment.

[436] AdaCtrl: Towards Adaptive and Controllable Reasoning via Difficulty-Aware Budgeting

Shijue Huang, Hongru Wang, Wanjun Zhong, Zhaochen Su, Jiazhan Feng, Bowen Cao, Yi R. Fung

Main category: cs.AI

TL;DR: AdaCtrl is a framework that enables large reasoning models to adaptively adjust reasoning length based on problem difficulty while allowing user control over reasoning depth, improving efficiency without sacrificing performance.

Details

Motivation: Current large reasoning models often generate unnecessarily long reasoning chains for simple problems, failing to balance efficiency and effectiveness. There's a need for models that can adapt reasoning depth based on problem difficulty while giving users control over the reasoning budget.

Method: Two-stage training pipeline: 1) Cold-start fine-tuning to instill self-awareness of difficulty and adjust reasoning budget, 2) Difficulty-aware reinforcement learning to refine adaptive reasoning strategies and calibrate difficulty assessments. Uses explicit length-triggered tags as natural interface for user budget control.

Result: AdaCtrl reduces response length by 10.06% and 12.14% on challenging AIME2024/AIME2025 datasets while maintaining performance, and by 62.05% and 91.04% on MATH500/GSM8K datasets where concise responses suffice. Enables precise user control over reasoning budget.

Conclusion: AdaCtrl successfully balances efficiency and effectiveness by adapting reasoning length to problem difficulty, significantly reducing unnecessary reasoning while maintaining performance, and provides intuitive user control over reasoning depth.

Abstract: Modern large reasoning models demonstrate impressive problem-solving capabilities by employing sophisticated reasoning strategies. However, they often struggle to balance efficiency and effectiveness, frequently generating unnecessarily lengthy reasoning chains for simple problems. In this work, we propose AdaCtrl, a novel framework to support both difficulty-aware adaptive reasoning budget allocation and explicit user control over reasoning depth. AdaCtrl dynamically adjusts its reasoning length based on self-assessed problem difficulty, while also allowing users to manually control the budget to prioritize either efficiency or effectiveness. This is achieved through a two-stage training pipeline: an initial cold-start fine-tuning phase to instill the ability to self-aware difficulty and adjust reasoning budget, followed by a difficulty-aware reinforcement learning (RL) stage that refines the model’s adaptive reasoning strategies and calibrates its difficulty assessments based on its evolving capabilities during online training. To enable intuitive user interaction, we design explicit length-triggered tags that function as a natural interface for budget control. Empirical results show that AdaCtrl adapts reasoning length based on estimated difficulty, compared to the standard training baseline that also incorporates fine-tuning and RL, it yields performance improvements and simultaneously reduces response length by 10.06% and 12.14% on the more challenging AIME2024 and AIME2025 datasets, which require elaborate reasoning, and by 62.05% and 91.04% on the MATH500 and GSM8K datasets, where more concise responses are sufficient. Furthermore, AdaCtrl enables precise user control over the reasoning budget, allowing for tailored responses to meet specific needs.

[437] Feature-Guided Metaheuristic with Diversity Management for Solving the Capacitated Vehicle Routing Problem

Bachtiar Herdianto, Romain Billot, Flavien Lucas, Marc Sevaux

Main category: cs.AI

TL;DR: Feature-based guidance using XAI improves metaheuristic algorithms for large-scale CVRP by identifying solution-quality features to enhance search diversity and prevent premature convergence.

Details

Motivation: To enhance metaheuristic algorithms for solving large-scale Capacitated Vehicle Routing Problems by leveraging explainable AI to identify features that correlate with high-quality solutions, thereby improving search efficiency and solution quality.

Method: Proposes a feature-based guidance mechanism using XAI to identify solution-quality features, integrated into a custom metaheuristic combining neighborhood search with a hybrid split algorithm and path relinking, and validated on a state-of-the-art metaheuristic.

Result: Guidance significantly improves baseline algorithm performance on benchmark instances up to 30,000 nodes and yields statistically significant gains when integrated into state-of-the-art metaheuristics, demonstrating scalability and transferability.

Conclusion: The XAI-based feature guidance mechanism is effective, scalable, and transferable across different metaheuristic frameworks for solving large-scale CVRP problems.

Abstract: We propose a feature-based guidance mechanism to enhance metaheuristic algorithms for solving the Capacitated Vehicle Routing Problem (CVRP). This mechanism leverages an Explainable AI (XAI) model to identify features that correlate with high-quality solutions. These insights are used to guide the search process by promoting solution diversity and avoiding premature convergence. The guidance mechanism is first integrated into a custom metaheuristic algorithm, which combines neighborhood search with a novel hybrid of the split algorithm and path relinking. Experiments on benchmark instances with up to $30,000$ customer nodes demonstrate that the guidance significantly improves the performance of this baseline algorithm. Furthermore, we validate the generalizability of the guidance approach by integrating it into a state-of-the-art metaheuristic, where it again yields statistically significant performance gains. These results confirm that the proposed mechanism is both scalable and transferable across algorithmic frameworks.

[438] Tree-OPO: Off-policy Monte Carlo Tree-Guided Advantage Optimization for Multistep Reasoning

Bingning Huang, Tu Nguyen, Matthieu Zimmer

Main category: cs.AI

TL;DR: The paper proposes Staged Advantage Estimation (SAE), a framework that uses MCTS-derived trajectories to improve policy optimization in verifier-guided RL, specifically enhancing GRPO for mathematical reasoning tasks.

Details

Motivation: Recent advances show MCTS effectively generates high-quality intermediate trajectories for reasoning tasks, but these trajectories are typically used for training value/reward models rather than improving policy optimization directly. The authors want to leverage MCTS rollouts to create a tree-structured curriculum for policy learning.

Method: Reframe GRPO into staged training using teacher’s MCTS rollouts to construct tree-structured curriculum of prefixes. Propose SAE framework that computes low-variance, prefix-aware advantages by projecting rewards onto constraint sets respecting the tree hierarchy. Compare efficient heuristics against formal quadratic programming implementations.

Result: SAE improves final accuracy over standard GRPO on mathematical reasoning tasks. Theoretical analysis confirms SAE reduces gradient variance, providing principled path to improved sample efficiency.

Conclusion: MCTS-derived trajectories can be effectively repurposed for policy optimization through staged training with SAE, offering improved performance on reasoning tasks with reduced gradient variance and better sample efficiency.

Abstract: Recent advances in reasoning with large language models (LLMs) have shown the effectiveness of Monte Carlo Tree Search (MCTS) for generating high quality intermediate trajectories, particularly in math and symbolic domains. Inspired by this, we explore how MCTS derived trajectories, traditionally used for training value or reward models, can be repurposed to improve policy optimization in verifier guided reinforcement learning (RL). Specifically, we focus on Group Relative Policy Optimization (GRPO), a recent algorithm that enables consistent policy learning from group relative judgments. We reframe GRPO into a staged training paradigm, leveraging a teacher’s MCTS rollouts to construct a tree structured curriculum of prefixes. This introduces the novel challenge of computing advantages for training samples that originate from different prefixes, each with a distinct expected return. To address this, we propose Staged Advantage Estimation (SAE), a framework for computing low variance, prefix aware advantages by projecting rewards onto a constraint set that respects the tree’s hierarchy. Our empirical results on mathematical reasoning tasks show that SAE improves final accuracy over standard GRPO. This outcome is grounded in our theoretical analysis, which confirms that SAE reduces gradient variance, a principled path to improved sample efficiency. We demonstrate this through practical SAE implementations, comparing efficient heuristics against a formal quadratic program.

[439] Imagining and building wise machines: The centrality of AI metacognition

Samuel G. B. Johnson, Amir-Hossein Karimi, Yoshua Bengio, Nick Chater, Tobias Gerstenberg, Kate Larson, Sydney Levine, Melanie Mitchell, Iyad Rahwan, Bernhard Schölkopf, Igor Grossmann

Main category: cs.AI

TL;DR: The paper argues that while AI has become smarter, it lacks wisdom. It defines human wisdom as strategies for solving intractable problems and proposes that AI needs metacognitive abilities like intellectual humility and perspective-taking to become truly wise.

Details

Motivation: Current AI systems have advanced in intelligence but lack wisdom, which is crucial for robust performance in novel environments, explainability, cooperation, and safety. The gap between AI's analytical capabilities and human-like wisdom needs to be addressed.

Method: The paper analyzes human wisdom as problem-solving strategies for intractable problems, distinguishing between object-level strategies (heuristics) and metacognitive strategies (intellectual humility, perspective-taking, context-adaptability). It proposes developing AI wisdom through improved metacognition.

Result: The analysis identifies that AI particularly struggles with metacognition. Developing wise AI with enhanced metacognitive abilities would lead to systems more robust to novel environments, explainable to users, cooperative with others, and safer by reducing misaligned goals.

Conclusion: The paper concludes by discussing practical approaches for benchmarking, training, and implementing wise AI systems, emphasizing the importance of developing metacognitive capabilities to bridge the gap between AI intelligence and wisdom.

Abstract: Although AI has become increasingly smart, its wisdom has not kept pace. In this article, we examine what is known about human wisdom and sketch a vision of its AI counterpart. We analyze human wisdom as a set of strategies for solving intractable problems-those outside the scope of analytic techniques-including both object-level strategies like heuristics [for managing problems] and metacognitive strategies like intellectual humility, perspective-taking, or context-adaptability [for managing object-level strategies]. We argue that AI systems particularly struggle with metacognition; improved metacognition would lead to AI more robust to novel environments, explainable to users, cooperative with others, and safer in risking fewer misaligned goals with human users. We discuss how wise AI might be benchmarked, trained, and implemented.

[440] LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?

Kaijian Zou, Aaron Xiong, Yunxiang Zhang, Frederick Zhang, Yueqi Ren, Jirong Yang, Ayoung Lee, Shitanshu Bhushan, Lu Wang

Main category: cs.AI

TL;DR: LiveOIBench introduces a comprehensive benchmark with 403 Olympiad-level competitive programming problems to better evaluate LLM coding capabilities, addressing limitations of existing benchmarks.

Details

Motivation: Current coding benchmarks have limitations: lack of exceptionally challenging problems, insufficient test case coverage, reliance on online platform APIs that limit accessibility, and need for more rigorous evaluation of LLM coding capabilities.

Method: Created LiveOIBench with 403 expert-curated Olympiad-level problems from 72 official contests (14 Informatics Olympiads, 2023-2025), each with ~60 expert-designed test cases. Features include: 1) high-quality tasks with subtask rubrics, 2) integration of elite contestant performance data, 3) planned contamination-free updates, 4) self-contained offline evaluation system.

Result: Benchmarked 34 LLMs: GPT-5 achieved 81.76th percentile (strong but below top humans at >90th percentile). Open-weight reasoning models like GPT-OSS-120B achieved only 60th percentile, showing significant gap from frontier closed models. Analysis shows robust reasoning models prioritize precise problem analysis over excessive exploration.

Conclusion: LiveOIBench provides a comprehensive benchmark for evaluating LLM coding capabilities. Results show current LLMs still lag behind top human contestants, with open models significantly behind closed models. Future models should emphasize structured analysis and minimize unnecessary exploration.

Abstract: Competitive programming problems increasingly serve as valuable benchmarks to evaluate the coding capabilities of large language models (LLMs) due to their complexity and ease of verification. Yet, current coding benchmarks face limitations such as lack of exceptionally challenging problems, insufficient test case coverage, reliance on online platform APIs that limit accessibility. To address these issues, we introduce LiveOIBench, a comprehensive benchmark featuring 403 expert-curated Olympiad-level competitive programming problems, each with an average of 60 expert-designed test cases. The problems are sourced directly from 72 official contests of 14 Informatics Olympiads in different regions conducted between 2023 and 2025. LiveOIBench distinguishes itself through four key features: (1) meticulously curated high-quality tasks with detailed subtask rubrics and extensive private test cases; (2) direct integration of elite contestant performance data to enable informative comparison against top-performing humans; (3) planned continuous, contamination-free updates from newly released Olympiad problems; and (4) a self-contained evaluation system facilitating offline and easy-to-reproduce assessments. Benchmarking 34 popular general-purpose and reasoning LLMs, we find that GPT-5 achieves a notable 81.76th percentile, a strong result that nonetheless falls short of top human contestants, who usually place above 90th. In contrast, among open-weight reasoning models, GPT-OSS-120B achieves only a 60th percentile, underscoring significant capability disparities from frontier closed models. Detailed analyses indicate that robust reasoning models prioritize precise problem analysis over excessive exploration, suggesting future models should emphasize structured analysis and minimize unnecessary exploration. All data, code, and leaderboard results are publicly available on our website.

[441] Temporal Causal Reasoning with (Non-Recursive) Structural Equation Models

Maksim Gladyshev, Natasha Alechina, Mehdi Dastani, Dragan Doder, Brian Logan

Main category: cs.AI

TL;DR: A new interpretation of Structural Equation Models (SEMs) for actual causality that views them as mechanisms transforming exogenous to endogenous variable dynamics, enabling temporal logic reasoning about causal structures with cycles and feedback loops.

Details

Motivation: To develop a framework that combines counterfactual causal reasoning with temporal logic formalisms, overcoming the limitation of requiring recursive (acyclic) models in standard SEM approaches, thus enabling reasoning about mutually dependent processes and feedback loops.

Method: Proposes a new interpretation of SEMs as mechanisms transforming exogenous variable dynamics into endogenous variable dynamics. Introduces CPLTL, a temporal logic for causal reasoning about such structures, and develops new notions of model equivalence for temporal causal models.

Result: Shows that the standard restriction to recursive models is unnecessary in this approach, allowing reasoning about cycles and feedback loops. Demonstrates that CPLTL has an efficient model-checking procedure and introduces new model equivalence notions for temporal causal models.

Conclusion: The paper presents a novel temporal logic framework for causal reasoning that extends SEMs to handle cyclic dependencies and feedback loops, providing both theoretical foundations and practical computational methods for temporal causal analysis.

Abstract: Structural Equation Models (SEM) are the standard approach to representing causal dependencies between variables in causal models. In this paper we propose a new interpretation of SEMs when reasoning about Actual Causality, in which SEMs are viewed as mechanisms transforming the dynamics of exogenous variables into the dynamics of endogenous variables. This allows us to combine counterfactual causal reasoning with existing temporal logic formalisms, and to introduce a temporal logic, CPLTL, for causal reasoning about such structures. We show that the standard restriction to so-called \textit{recursive} models (with no cycles in the dependency graph) is not necessary in our approach, allowing us to reason about mutually dependent processes and feedback loops. Finally, we introduce new notions of model equivalence for temporal causal models, and show that CPLTL has an efficient model-checking procedure.

[442] LLM-as-a-Prophet: Understanding Predictive Intelligence with Prophet Arena

Qingchuan Yang, Simon Mahns, Sida Li, Anri Gu, Jibang Wu, Haifeng Xu

Main category: cs.AI

TL;DR: LLMs show promising forecasting capabilities for real-world events but face bottlenecks in event recall, data understanding, and information aggregation speed compared to markets.

Details

Motivation: To systematically investigate whether large language models can effectively forecast real-world future events, given their training on vast internet-scale data and potential applications in finance, economics, and societal systems.

Method: Built Prophet Arena benchmark with live forecasting tasks decomposed into pipeline stages, enabling controlled large-scale experimentation to evaluate LLMs’ predictive capabilities across different dimensions.

Result: LLMs demonstrate impressive forecasting capabilities with small calibration errors, consistent prediction confidence, and promising market returns, but reveal bottlenecks in event recall accuracy, data source misunderstanding, and slower information aggregation compared to markets.

Conclusion: LLMs show significant potential as forecasting tools (LLM-as-a-Prophet paradigm) but require improvements in event recall, data understanding, and information processing speed to achieve superior predictive intelligence.

Abstract: Forecasting is not only a fundamental intellectual pursuit but also is of significant importance to societal systems such as finance and economics. With the rapid advances of large language models (LLMs) trained on Internet-scale data, it raises the promise of employing LLMs to forecast real-world future events, an emerging paradigm we call “LLM-as-a-Prophet”. This paper systematically investigates such predictive intelligence of LLMs. To this end, we build Prophet Arena, a general evaluation benchmark that continuously collects live forecasting tasks and decomposes each task into distinct pipeline stages, in order to support our controlled and large-scale experimentation. Our comprehensive evaluation reveals that many LLMs already exhibit impressive forecasting capabilities, reflected in, e.g., their small calibration errors, consistent prediction confidence and promising market returns. However, we also uncover key bottlenecks towards achieving superior predictive intelligence via LLM-as-a-Prophet, such as LLMs’ inaccurate event recalls, misunderstanding of data sources and slower information aggregation compared to markets when resolution nears.

[443] Dynamic Pricing for On-Demand DNN Inference in the Edge-AI Market

Songyuan Li, Jia Hu, Geyong Min, Haojun Huang, Jiwei Huang

Main category: cs.AI

TL;DR: AERIA is an auction-based pricing mechanism for Edge-AI markets that optimizes DNN model partitioning, pricing, and resource allocation to maximize revenue while ensuring fairness and truthfulness.

Details

Motivation: Existing Edge-AI research lacks a practical market perspective that considers personalized inference needs of AI users, revenue incentives for service providers, and multi-stakeholder governance in market-oriented contexts.

Method: Proposed AERIA (Auction-based Edge Inference Pricing Mechanism) with multi-exit device-edge synergistic inference scheme, using randomized consensus estimate and cost sharing techniques for strategic mechanism design.

Result: AERIA achieves competitiveness in revenue maximization, incentive compatibility, and envy-freeness. Extensive simulations with four DNN workloads show it significantly outperforms state-of-the-art approaches in revenue maximization.

Conclusion: AERIA effectively addresses the multi-dimensional optimization problem in Edge-AI markets, validating its efficacy for on-demand DNN inference acceleration with desirable market properties.

Abstract: The convergence of edge computing and Artificial Intelligence (AI) gives rise to Edge-AI, which enables the deployment of real-time AI applications at the network edge. A key research challenge in Edge-AI is edge inference acceleration, which aims to realize low-latency high-accuracy Deep Neural Network (DNN) inference by offloading partitioned inference tasks from end devices to edge servers. However, existing research has yet to adopt a practical Edge-AI market perspective, which would explore the personalized inference needs of AI users (e.g., inference accuracy, latency, and task complexity), the revenue incentives for AI service providers that offer edge inference services, and multi-stakeholder governance within a market-oriented context. To bridge this gap, we propose an Auction-based Edge Inference Pricing Mechanism (AERIA) for revenue maximization to tackle the multi-dimensional optimization problem of DNN model partition, edge inference pricing, and resource allocation. We develop a multi-exit device-edge synergistic inference scheme for on-demand DNN inference acceleration, and theoretically analyze the auction dynamics amongst the AI service providers, AI users and edge infrastructure provider. Owing to the strategic mechanism design via randomized consensus estimate and cost sharing techniques, the Edge-AI market attains several desirable properties. These include competitiveness in revenue maximization, incentive compatibility, and envy-freeness, which are crucial to maintain the effectiveness, truthfulness, and fairness in auction outcomes. Extensive simulations based on four representative DNN inference workloads demonstrate that AERIA significantly outperforms several state-of-the-art approaches in revenue maximization. This validates the efficacy of AERIA for on-demand DNN inference in the Edge-AI market.

[444] VietLyrics: A Large-Scale Dataset and Models for Vietnamese Automatic Lyrics Transcription

Quoc Anh Nguyen, Bernard Cheng, Kelvin Soh

Main category: cs.AI

TL;DR: Created first large-scale Vietnamese ALT dataset (VietLyrics) and fine-tuned Whisper models, achieving state-of-the-art performance for Vietnamese lyrics transcription.

Details

Motivation: Vietnamese ALT is challenging due to tonal complexity and dialectal variations, but remains unexplored due to lack of dedicated dataset.

Method: Curated VietLyrics dataset (647 hours of songs with line-level aligned lyrics), then fine-tuned Whisper models on this dataset.

Result: Fine-tuned Whisper models achieved superior results compared to existing multilingual ALT systems, including LyricWhiz, addressing transcription errors and hallucinations in non-vocal segments.

Conclusion: VietLyrics dataset and models publicly released to advance Vietnamese music computing research, demonstrating potential for ALT in low-resource languages and music.

Abstract: Automatic Lyrics Transcription (ALT) for Vietnamese music presents unique challenges due to its tonal complexity and dialectal variations, but remains largely unexplored due to the lack of a dedicated dataset. Therefore, we curated the first large-scale Vietnamese ALT dataset (VietLyrics), comprising 647 hours of songs with line-level aligned lyrics and metadata to address these issues. Our evaluation of current ASRbased approaches reveal significant limitations, including frequent transcription errors and hallucinations in non-vocal segments. To improve performance, we fine-tuned Whisper models on the VietLyrics dataset, achieving superior results compared to existing multilingual ALT systems, including LyricWhiz. We publicly release VietLyrics and our models, aiming to advance Vietnamese music computing research while demonstrating the potential of this approach for ALT in low-resource language and music.

[445] Solver-Informed RL: Grounding Large Language Models for Authentic Optimization Modeling

Yitian Chen, Jingfan Xia, Siyu Shao, Dongdong Ge, Yinyu Ye

Main category: cs.AI

TL;DR: SIRL is a reinforcement learning framework that uses optimization solvers as verifiers to improve LLMs’ ability to generate correct optimization models from natural language, achieving state-of-the-art performance.

Details

Motivation: LLMs struggle with generating formally correct and usable optimization models from natural language descriptions, often producing hallucinations that hinder reliable automation of optimization formulation.

Method: Solver-Informed Reinforcement Learning (SIRL) uses external optimization solvers as verifiers to provide precise feedback on executable code and LP files, including syntax, feasibility, and solution quality, which serves as direct rewards for RL training. Also includes instance-enhanced self-consistency for high-quality training data synthesis.

Result: Extensive experiments on diverse public benchmarks show SIRL achieves state-of-the-art performance, substantially outperforming existing methods in generating accurate and executable optimization models.

Conclusion: SIRL significantly improves LLMs’ authenticity for optimization modeling through RL with verifiable rewards from optimization solvers, enabling more reliable automation of optimization formulation from natural language.

Abstract: Optimization modeling is fundamental to decision-making across diverse domains. Despite progress in automating optimization formulation from natural language descriptions, Large Language Models (LLMs) often struggle to generate formally correct and usable models against hallucinations, posing a challenge for reliable automation. Inspired by the success of Reinforcement Learning (RL) in enhancing Large Reasoning Models, we present Solver-Informed Reinforcement Learning (SIRL), a novel framework that significantly improves the authenticity of LLMs for optimization modeling using Reinforcement Learning with Verifiable Reward by leveraging external optimization solvers as verifiers. These verifiers automatically assess the executable code and the instance-level mathematical model represented by the associated LP file, yielding precise and comprehensive feedback signals – including syntax, feasibility, and solution quality, serving as direct rewards for the RL process. This automated verification process, particularly from classic optimization solvers, also underpins our instance-enhanced self-consistency method to synthesize high-quality training data. Extensive experiments on diverse public benchmarks demonstrate that SIRL achieves state-of-the-art performance, substantially outperforming existing methods in generating accurate and executable optimization models. Our code is publicly available at https://github.com/Cardinal-Operations/SIRL.

[446] InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback

Boyuan Chen, Donghai Hong, Jiaming Ji, Jiacheng Zheng, Bowen Dong, Jiayi Zhou, Kaile Wang, Juntao Dai, Xuyao Wang, Wenqi Chen, Qirui Zheng, Wenxin Li, Sirui Han, Yike Guo, Yaodong Yang

Main category: cs.AI

TL;DR: InterMT introduces the first preference dataset for multi-turn multimodal interaction with human feedback, addressing current MLLMs’ lack of complex interactive capabilities through expert annotations and agentic workflow construction.

Details

Motivation: Current multimodal large models lack essential capabilities for continuous, multi-turn multimodal interaction similar to human learning, which involves understanding interleaved multimodal contexts and responding coherently in ongoing exchanges.

Method: Created InterMT dataset with 15.6k prompts, 52.6k multi-turn dialogue instances, and 32.4k human-labeled preference pairs across nine sub-dimensions. Used an agentic workflow with tool-augmented MLLMs to construct multi-turn QA instances, and introduced InterMT-Bench for evaluation.

Result: Demonstrated utility through applications like judge moderation and revealed multi-turn scaling law of judge models. The dataset captures human preferences at global and local levels to facilitate alignment of MLLMs.

Conclusion: InterMT provides a foundational dataset and benchmark for advancing MLLMs toward human-level multimodal interaction capabilities, with open-source data to support further research in aligning current models to the next step.

Abstract: As multimodal large models (MLLMs) continue to advance across challenging tasks, a key question emerges: What essential capabilities are still missing? A critical aspect of human learning is continuous interaction with the environment – not limited to language, but also involving multimodal understanding and generation. To move closer to human-level intelligence, models must similarly support multi-turn, multimodal interaction. In particular, they should comprehend interleaved multimodal contexts and respond coherently in ongoing exchanges. In this work, we present an initial exploration through the InterMT – the first preference dataset for multi-turn multimodal interaction, grounded in real human feedback. In this exploration, we particularly emphasize the importance of human oversight, introducing expert annotations to guide the process, motivated by the fact that current MLLMs lack such complex interactive capabilities. InterMT captures human preferences at both global and local levels into nine sub-dimensions, consists of 15.6k prompts, 52.6k multi-turn dialogue instances, and 32.4k human-labeled preference pairs. To compensate for the lack of capability for multi-modal understanding and generation, we introduce an agentic workflow that leverages tool-augmented MLLMs to construct multi-turn QA instances. To further this goal, we introduce InterMT-Bench to assess the ability of MLLMs in assisting judges with multi-turn, multimodal tasks. We demonstrate the utility of \InterMT through applications such as judge moderation and further reveal the multi-turn scaling law of judge model. We hope the open-source of our data can help facilitate further research on aligning current MLLMs to the next step. Our project website can be found at https://pku-intermt.github.io .

[447] Mitigating Hallucination Through Theory-Consistent Symmetric Multimodal Preference Optimization

Wenqi Liu, Xuemeng Song, Jiaxi Li, Yinwei Wei, Na Zheng, Jianhua Yin, Liqiang Nie

Main category: cs.AI

TL;DR: SymMPO is a symmetric multimodal preference optimization method that addresses limitations in existing DPO approaches for MLLMs by using direct preference supervision and rigorous theoretical alignment while introducing a preference margin consistency loss.

Details

Motivation: Existing DPO methods for MLLMs suffer from non-rigorous optimization objectives and indirect preference supervision, despite making progress in reducing hallucination through vision-oriented contrastive objectives.

Method: SymMPO conducts symmetric preference learning with direct preference supervision using response pairs, maintains rigorous theoretical alignment with standard DPO, and introduces a preference margin consistency loss to quantitatively regulate preference gaps between symmetric preference pairs.

Result: Comprehensive evaluation across five benchmarks demonstrates SymMPO’s superior performance in hallucination mitigation for MLLMs.

Conclusion: SymMPO effectively addresses limitations of existing DPO methods for MLLMs by providing a more rigorous and direct approach to preference optimization that enhances visual understanding and reduces hallucination.

Abstract: Direct Preference Optimization (DPO) has emerged as an effective approach for mitigating hallucination in Multimodal Large Language Models (MLLMs). Although existing methods have achieved significant progress by utilizing vision-oriented contrastive objectives for enhancing MLLMs’ attention to visual inputs and hence reducing hallucination, they suffer from non-rigorous optimization objective function and indirect preference supervision. To address these limitations, we propose a Symmetric Multimodal Preference Optimization (SymMPO), which conducts symmetric preference learning with direct preference supervision (i.e., response pairs) for visual understanding enhancement, while maintaining rigorous theoretical alignment with standard DPO. In addition to conventional ordinal preference learning, SymMPO introduces a preference margin consistency loss to quantitatively regulate the preference gap between symmetric preference pairs. Comprehensive evaluation across five benchmarks demonstrate SymMPO’s superior performance, validating its effectiveness in hallucination mitigation of MLLMs.

[448] Toward Revealing Nuanced Biases in Medical LLMs

Farzana Islam Adiba, Rahmatollah Beheshti

Main category: cs.AI

TL;DR: A novel framework combining knowledge graphs with agentic LLMs to systematically reveal complex bias patterns in medical LLMs through adversarial perturbation and multi-hop KG characterization.

Details

Motivation: Medical LLMs are prone to biased and unfair patterns, which is critical to identify before clinical deployment to enable effective mitigation and minimize negative impacts on healthcare decisions.

Method: Combines knowledge graphs with auxiliary (agentic) LLMs, integrates adversarial perturbation (red teaming) techniques, and adopts customized multi-hop characterization of KGs to systematically evaluate target LLMs for bias detection.

Result: The framework demonstrates noticeably greater ability and scalability in revealing complex biased patterns compared to other common approaches, validated through comprehensive experiments on three datasets, six LLMs, and five bias types.

Conclusion: The proposed KG-LLM framework provides an effective systematic approach for identifying complex bias patterns in medical LLMs, which is crucial for safe clinical deployment and bias mitigation.

Abstract: Large language models (LLMs) used in medical applications are known to be prone to exhibiting biased and unfair patterns. Prior to deploying these in clinical decision-making, it is crucial to identify such bias patterns to enable effective mitigation and minimize negative impacts. In this study, we present a novel framework combining knowledge graphs (KGs) with auxiliary (agentic) LLMs to systematically reveal complex bias patterns in medical LLMs. The proposed approach integrates adversarial perturbation (red teaming) techniques to identify subtle bias patterns and adopts a customized multi-hop characterization of KGs to enhance the systematic evaluation of target LLMs. It aims not only to generate more effective red-teaming questions for bias evaluation but also to utilize those questions more effectively in revealing complex biases. Through a series of comprehensive experiments on three datasets, six LLMs, and five bias types, we demonstrate that our proposed framework exhibits a noticeably greater ability and scalability in revealing complex biased patterns of medical LLMs compared to other common approaches.

[449] Enhancing Multi-Agent Collaboration with Attention-Based Actor-Critic Policies

Hugo Garrido-Lestache Belinchon, Jeremy Kedziora

Main category: cs.AI

TL;DR: TAAC is a multi-agent RL algorithm using attention mechanisms for dynamic agent communication and penalized loss for role diversity, showing superior performance in soccer simulations.

Details

Motivation: To address challenges in multi-agent collaboration, particularly managing exponential joint-action spaces and promoting effective teamwork in cooperative environments like simulated soccer.

Method: Centralized Training/Centralized Execution with multi-headed attention in both actor and critic networks, enabling dynamic inter-agent communication through teammate queries, plus a penalized loss function for role diversity.

Result: TAAC outperforms benchmark algorithms (PPO, MAAC) in simulated soccer across multiple metrics: higher win rates, better goal differentials, improved Elo ratings, enhanced inter-agent connectivity, balanced spatial distributions, and more tactical interactions.

Conclusion: TAAC effectively enhances multi-agent collaboration through attention-based communication and role diversity promotion, demonstrating superior performance in complex cooperative tasks.

Abstract: This paper introduces Team-Attention-Actor-Critic (TAAC), a reinforcement learning algorithm designed to enhance multi-agent collaboration in cooperative environments. TAAC employs a Centralized Training/Centralized Execution scheme incorporating multi-headed attention mechanisms in both the actor and critic. This design facilitates dynamic, inter-agent communication, allowing agents to explicitly query teammates, thereby efficiently managing the exponential growth of joint-action spaces while ensuring a high degree of collaboration. We further introduce a penalized loss function which promotes diverse yet complementary roles among agents. We evaluate TAAC in a simulated soccer environment against benchmark algorithms representing other multi-agent paradigms, including Proximal Policy Optimization and Multi-Agent Actor-Attention-Critic. We find that TAAC exhibits superior performance and enhanced collaborative behaviors across a variety of metrics (win rates, goal differentials, Elo ratings, inter-agent connectivity, balanced spatial distributions, and frequent tactical interactions such as ball possession swaps).

[450] AI reasoning effort predicts human decision time in content moderation

Thomas Davidson

Main category: cs.AI

TL;DR: LLM reasoning steps (chain-of-thought length) predict human decision times in content moderation, showing parallels between human and AI reasoning processes.

Details

Motivation: To examine parallels between human decision-making processes and AI reasoning by comparing human decision times with LLM chain-of-thought length in content moderation tasks.

Method: Used content moderation task with three frontier LLMs, measuring chain-of-thought length as proxy for reasoning effort, and analyzing CoT content for contextual factor references.

Result: CoT length consistently predicts human decision time across models; both humans and models took longer/more steps when important variables were constant; models referenced contextual factors more frequently in difficult decisions.

Conclusion: Parallels exist between human and AI reasoning on practical tasks, and reasoning traces (CoT) enhance interpretability and decision-making understanding.

Abstract: Large language models can now generate intermediate reasoning steps before producing answers, improving performance on difficult problems by interactively developing solutions. This study uses a content moderation task to examine parallels between human decision times and model reasoning effort, measured using the length of the chain-of-thought (CoT). Across three frontier models, CoT length consistently predicts human decision time. Moreover, humans took longer and models produced longer CoTs when important variables were held constant, suggesting similar sensitivity to task difficulty. Analyses of the CoT content shows that models reference various contextual factors more frequently when making such decisions. These findings show parallels between human and AI reasoning on practical tasks and underscore the potential of reasoning traces for enhancing interpretability and decision-making.

[451] Instruction-Level Weight Shaping: A Framework for Self-Improving AI Agents

Rimom Costa

Main category: cs.AI

TL;DR: ILWS enables dynamic LLM updates via instruction-level weight shaping with reflection and distillation, improving throughput 2.4-5x and reducing hallucinations by ~80%.

Details

Motivation: Current methods for updating LLMs (RAG, fine-tuning, prompt engineering) have significant limitations: RAG increases latency and engineering overhead, prompt engineering is brittle, and fine-tuning is costly and risks catastrophic forgetting. There's a need for a more efficient, adaptive approach to incorporate new knowledge into LLMs without these drawbacks.

Method: Instruction-Level Weight Shaping (ILWS) uses curated system instructions as external pseudo-parameters updated after each session via reflection and user feedback. A Reflection Engine analyzes conversation traces, diagnoses reasoning successes/failures, and proposes typed deltas (ΔS, ΔU, ΔT) over instructions, user preferences, and tools. These deltas are version-controlled, evaluated with sliding window ratings, auto-repaired on first failure, and rolled back on repeated failure. When an edit budget threshold is crossed, the agent compiles a rating-weighted synthetic set and distills matured instruction-space gains into model parameters.

Result: In enterprise support, ILWS increased throughput 2.4-5.0x and cut audited hallucinations by about 80% versus a frozen baseline. In an Adobe Commerce Cloud proof of concept “L0 Support”, it achieved 4-5x more tickets per hour and about 80% lower time per ticket, with autonomous instruction updates and optional tool synthesis.

Conclusion: ILWS provides a novel approach to making LLMs more adaptive by operating at the instruction layer until controlled distillation, converting prompt-space improvements into weight-space without downtime. It generalizes to dynamic domains requiring adaptive reasoning, tool creation, and low-latency deployment while preserving governance and removing per-call retrieval overhead.

Abstract: Large language models (LLMs) are fluent but largely static after pre-training; new or shifting knowledge is typically added with retrieval-augmented generation (RAG) or fine-tuning. RAG raises latency and engineering overhead and often fails to integrate facts; prompt engineering is brittle and can conflict with prior knowledge; fine-tuning is costly and risks catastrophic forgetting. We propose Instruction-Level Weight Shaping (ILWS): curated system instructions act as external, auditable pseudo-parameters updated after each session via reflection and user feedback. A Reflection Engine inspects conversation traces, diagnoses reasoning successes and failures, and proposes typed deltas $ΔK=(ΔS,ΔU,ΔT)$ over instructions, user preferences, and tools. Deltas are version-controlled, evaluated with a sliding window of 1-5 star ratings, auto-repaired on first failure, and rolled back on repeated failure. When an edit budget crosses a threshold, the agent compiles a rating-weighted synthetic set and distills matured instruction-space gains into parameters, converting prompt-space improvements into weight-space without downtime. ILWS makes explicit the low-rank shaping induced by context in transformer blocks, preserves governance, and removes per-call retrieval. In enterprise support it increased throughput 2.4-5.0x and cut audited hallucinations by about 80% versus a frozen baseline. In an Adobe Commerce Cloud proof of concept “L0 Support”, it achieved 4-5x more tickets per hour and about 80% lower time per ticket, with autonomous instruction updates and optional tool synthesis. Because ILWS operates at the instruction layer until controlled distillation, it generalizes to dynamic domains (legal, medical, engineering) requiring adaptive reasoning, tool creation, and low-latency deployment.

[452] TableMind: An Autonomous Programmatic Agent for Tool-Augmented Table Reasoning

Chuang Jiang, Mingyue Cheng, Xiaoyu Tao, Qingyang Mao, Jie Ouyang, Qi Liu

Main category: cs.AI

TL;DR: TableMind: A tuning-based autonomous programmatic table agent that uses multi-turn reasoning with planning, action, and reflection to improve table reasoning performance over single-turn LLM approaches.

Details

Motivation: Current LLM-based table reasoning methods use single-turn reasoning with flattened tables, which has limitations: context overflow on large tables, weak sensitivity to numerical values, and lack of explicit tool-use and reflection capabilities.

Method: Two-stage training: 1) Supervised fine-tuning with high-quality reasoning data to bootstrap structured table reasoning; 2) Reinforcement learning with multi-perspective reward scheme and novel optimization objective for precise code generation.

Result: Extensive experiments on diverse benchmarks show TableMind consistently outperforms previous baselines, validating the effectiveness of training autonomous agents for table reasoning.

Conclusion: Training autonomous programmatic table agents with multi-turn cognitive schema (planning, action, reflection) improves overall table reasoning performance compared to single-turn LLM approaches.

Abstract: Table reasoning requires models to jointly perform comprehensive semantic understanding and precise numerical operations. Although recent large language model (LLM)-based methods have achieved promising results, most of them still rely on a single-turn reasoning paradigm that processes flattened tables in a single forward pass. This paradigm suffers from inherent limitations, including context overflow on large tables, weak sensitivity to continuous numerical values, and the absence of explicit tool-use and reflection. In this paper, we propose TableMind, a tuning-based autonomous programmatic table agent that simulates the human-like cognitive schema of the multi-turn interaction within a lightweight LLM. Instead of adopting a training-free workflow design, TableMind learns to internalize planning, action, and reflection through a principled two-stage training strategy. To bootstrap structured table reasoning capabilities, we construct and filter high-quality reasoning data for the supervised fine-tuning (SFT) stage. To enable precise code generation, we introduce a designed multi-perspective reward scheme and a novel optimization objective in the reinforcement learning (RL) stage. Extensive experiments on diverse benchmarks demonstrate that TableMind consistently outperforms previous baselines, validating the effectiveness of training autonomous agents to improve overall performance.

[453] The More You Automate, the Less You See: Hidden Pitfalls of AI Scientist Systems

Ziming Luo, Atoosa Kasirzadeh, Nihar B. Shah

Main category: cs.AI

TL;DR: AI scientist systems have potential but risk hidden flaws; paper identifies four failure modes, tests them on existing systems, finds issues, and recommends requiring workflow artifacts for transparency.

Details

Motivation: AI scientist systems can accelerate discovery but their internal workflows aren't scrutinized, risking flawed outputs that undermine research integrity and trustworthiness.

Method: Identified four failure modes (inappropriate benchmark selection, data leakage, metric misuse, post-hoc selection bias), designed controlled experiments to isolate each, and assessed two prominent open-source AI scientist systems.

Result: Found several failures across severity spectrum in tested systems that are easily overlooked in practice; showed that access to trace logs and code enables much better failure detection than examining final paper alone.

Conclusion: Journals/conferences should mandate submission of workflow artifacts (trace logs, code) alongside AI-generated research papers to ensure transparency, accountability, and reproducibility.

Abstract: AI scientist systems, capable of autonomously executing the full research workflow from hypothesis generation and experimentation to paper writing, hold significant potential for accelerating scientific discovery. However, the internal workflow of these systems have not been closely examined. This lack of scrutiny poses a risk of introducing flaws that could undermine the integrity, reliability, and trustworthiness of their research outputs. In this paper, we identify four potential failure modes in contemporary AI scientist systems: inappropriate benchmark selection, data leakage, metric misuse, and post-hoc selection bias. To examine these risks, we design controlled experiments that isolate each failure mode while addressing challenges unique to evaluating AI scientist systems. Our assessment of two prominent open-source AI scientist systems reveals the presence of several failures, across a spectrum of severity, which can be easily overlooked in practice. Finally, we demonstrate that access to trace logs and code from the full automated workflow enables far more effective detection of such failures than examining the final paper alone. We thus recommend journals and conferences evaluating AI-generated research to mandate submission of these artifacts alongside the paper to ensure transparency, accountability, and reproducibility.

[454] Adaptation of Agentic AI

Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He, Yichen Wu, Ming Zhong, Peiyang Song, Qizheng Zhang, Heng Wang, Xueqiang Xu, Hanwen Xu, Pengrui Han, Dylan Zhang, Jiashuo Sun, Chaoqi Yang, Kun Qian, Tian Wang, Changran Hu, Manling Li, Quanzheng Li, Hao Peng, Sheng Wang, Jingbo Shang, Chao Zhang, Jiaxuan You, Liyuan Liu, Pan Lu, Yu Zhang, Heng Ji, Yejin Choi, Dawn Song, Jimeng Sun, Jiawei Han

Main category: cs.AI

TL;DR: This paper presents a systematic framework for understanding adaptation strategies in agentic AI systems, categorizing them into agent adaptations and tool adaptations with further subdivisions, providing a conceptual foundation and practical guidance for building more capable AI systems.

Details

Motivation: As agentic AI systems grow in capability and scope, adaptation becomes crucial for improving performance, reliability, and generalization. There's a need to unify the rapidly expanding research landscape into a systematic framework to clarify design choices and trade-offs.

Method: The authors develop a systematic framework that spans both agent adaptations and tool adaptations, further decomposing these into: 1) tool-execution-signaled and agent-output-signaled forms of agent adaptation, and 2) agent-agnostic and agent-supervised forms of tool adaptation. They review representative approaches in each category.

Result: The framework helps clarify the design space of adaptation strategies in agentic AI, makes their trade-offs explicit, and provides practical guidance for selecting or switching among strategies during system design. The paper analyzes strengths and limitations of different approaches.

Conclusion: The paper offers a conceptual foundation and practical roadmap for researchers and practitioners to build more capable, efficient, and reliable agentic AI systems, while highlighting key open challenges and future opportunities in the field.

Abstract: Cutting-edge agentic AI systems are built on foundation models that can be adapted to plan, reason, and interact with external tools to perform increasingly complex and specialized tasks. As these systems grow in capability and scope, adaptation becomes a central mechanism for improving performance, reliability, and generalization. In this paper, we unify the rapidly expanding research landscape into a systematic framework that spans both agent adaptations and tool adaptations. We further decompose these into tool-execution-signaled and agent-output-signaled forms of agent adaptation, as well as agent-agnostic and agent-supervised forms of tool adaptation. We demonstrate that this framework helps clarify the design space of adaptation strategies in agentic AI, makes their trade-offs explicit, and provides practical guidance for selecting or switching among strategies during system design. We then review the representative approaches in each category, analyze their strengths and limitations, and highlight key open challenges and future opportunities. Overall, this paper aims to offer a conceptual foundation and practical roadmap for researchers and practitioners seeking to build more capable, efficient, and reliable agentic AI systems.

[455] Explaining Tournament Solutions with Minimal Supports

Clément Contet, Umberto Grandi, Jérôme Mengin

Main category: cs.AI

TL;DR: The paper studies certified explanations for tournament winners by identifying minimal sub-tournaments where a candidate is guaranteed to win, providing formal explanations for tournament solutions.

Details

Motivation: Tournaments model pairwise dominance relationships, but there's a need for certified explanations of why specific candidates win under various tournament rules, which aligns with formal explainable AI principles for transparency and interpretability.

Method: The authors identify “minimal supports” - minimal sub-tournaments where a candidate is guaranteed to win regardless of how the rest of the tournament is completed. They analyze these for common tournament solutions (top cycle, uncovered set, Copeland, Borda, maximin, weighted uncovered set), determine the size of smallest minimal supports, and develop polynomial-time algorithms to compute them.

Result: For all tournament solutions except the weighted uncovered set, polynomial-time algorithms exist to compute minimal supports. For the weighted uncovered set, the problem is NP-complete. The paper shows how minimal supports can produce compact, certified, and intuitive explanations for tournament outcomes.

Conclusion: Minimal supports provide a formal framework for certified explanations in tournament analysis, offering practical algorithms for most common tournament solutions while identifying computational limitations for some weighted variants, advancing explainable AI in social choice and decision-making contexts.

Abstract: Tournaments are widely used models to represent pairwise dominance between candidates, alternatives, or teams. We study the problem of providing certified explanations for why a candidate appears among the winners under various tournament rules. To this end, we identify minimal supports, minimal sub-tournaments in which the candidate is guaranteed to win regardless of how the rest of the tournament is completed (that is, the candidate is a necessary winner of the sub-tournament). This notion corresponds to an abductive explanation for the question,“Why does the winner win the tournament?”, a central concept in formal explainable AI. We focus on common tournament solutions: the top cycle, the uncovered set, the Copeland rule, the Borda rule, the maximin rule, and the weighted uncovered set. For each rule we determine the size of the smallest minimal supports, and we present polynomial-time algorithms to compute them for all solutions except for the weighted uncovered set, for which the problem is NP-complete. Finally, we show how minimal supports can serve to produce compact, certified, and intuitive explanations for tournament solutions.

[456] Quantum Abduction: A New Paradigm for Reasoning under Uncertainty

Remo Pareschi

Main category: cs.AI

TL;DR: Quantum abduction framework models hypotheses in superposition using quantum cognition principles, allowing constructive interference and dynamic synthesis rather than eliminative search.

Details

Motivation: Traditional AI approaches to abduction use eliminative search that treats hypotheses as mutually exclusive and prunes them to find a single best explanation, which overlooks how human reasoners sustain multiple explanatory lines, navigate contradictions, and generate novel syntheses.

Method: Introduces quantum abduction paradigm that models hypotheses in superposition, allows constructive/destructive interference, and collapses only when coherence with evidence is reached. Implemented using quantum cognition principles with modern NLP embeddings and generative AI.

Result: Applied to case studies including historical mysteries (Ludwig II of Bavaria, Monster of Florence), literary examples (Murder on the Orient Express), medical diagnosis, and scientific theory change. The framework proves more faithful to constructive and multifaceted nature of human reasoning.

Conclusion: Quantum abduction offers a pathway toward more expressive and transparent AI reasoning systems that better capture the nuanced, multifaceted nature of human abductive reasoning.

Abstract: Abductive reasoning - the search for plausible explanations - has long been central to human inquiry, from forensics to medicine and scientific discovery. Yet formal approaches in AI have largely reduced abduction to eliminative search: hypotheses are treated as mutually exclusive, evaluated against consistency constraints or probability updates, and pruned until a single “best” explanation remains. This reductionist framing overlooks the way human reasoners sustain multiple explanatory lines in suspension, navigate contradictions, and generate novel syntheses. This paper introduces quantum abduction, a non-classical paradigm that models hypotheses in superposition, allows them to interfere constructively or destructively, and collapses only when coherence with evidence is reached. Grounded in quantum cognition and implemented with modern NLP embeddings and generative AI, the framework supports dynamic synthesis rather than premature elimination. Case studies span historical mysteries (Ludwig II of Bavaria, the “Monster of Florence”), literary demonstrations (“Murder on the Orient Express”), medical diagnosis, and scientific theory change. Across these domains, quantum abduction proves more faithful to the constructive and multifaceted nature of human reasoning, while offering a pathway toward expressive and transparent AI reasoning systems.

[457] Bridging the Gap Between Scientific Laws Derived by AI Systems and Canonical Knowledge via Abductive Inference with AI-Noether

Karan Srivastava, Sanjeeb Dash, Ryan Cory-Wright, Barry Trager, Cristina Cornelio, Lior Horesh

Main category: cs.AI

TL;DR: An AI system that automatically finds minimal corrections to incomplete scientific theories by generating missing axioms needed to derive observed hypotheses.

Details

Motivation: Current symbolic regression systems enforce derivability from prior knowledge, but when existing theories are incomplete or incorrect, they cannot generate hypotheses outside the theoretical scope. There's a need for automated systems that can find corrections to axiom systems to bridge this gap in scientific discovery.

Method: An open-source algebraic geometry-based system that takes an incomplete axiom system (expressible as polynomials) and a hypothesis that cannot be derived from current axioms, then generates a minimal set of candidate axioms that, when added to the theory, provably derive the hypothesis (even if noisy).

Result: The system successfully reconstructed key axioms required to derive several important scientific laws, including the carrier-resolved photo-Hall effect, Einstein’s relativistic laws, and several other physical laws.

Conclusion: The proposed algebraic geometry-based approach effectively addresses the challenge of automatically finding corrections to incomplete axiom systems, enabling scientific discovery when existing theories are insufficient to explain observed phenomena.

Abstract: Advances in AI have shown great potential in contributing to the acceleration of scientific discovery. Symbolic regression can fit interpretable models to data, but these models are not necessarily derivable from established theory. Recent systems (e.g., AI-Descartes, AI-Hilbert) enforce derivability from prior knowledge. However, when existing theories are incomplete or incorrect, these machine-generated hypotheses may fall outside the theoretical scope. Automatically finding corrections to axiom systems to close this gap remains a central challenge in scientific discovery. We propose a solution: an open-source algebraic geometry-based system that, given an incomplete axiom system expressible as polynomials and a hypothesis that the axioms cannot derive, generates a minimal set of candidate axioms that, when added to the theory, provably derive the (possibly noisy) hypothesis. We illustrate the efficacy of our approach by showing that it can reconstruct key axioms required to derive the carrier-resolved photo-Hall effect, Einstein’s relativistic laws, and several other laws.

[458] Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI

Kun Xiang, Terry Jingchen Zhang, Yinya Huang, Jixi He, Zirong Liu, Yueling Tang, Ruizhe Zhou, Lijing Luo, Youpeng Wen, Xiuwei Chen, Bingqian Lin, Jianhua Han, Hang Xu, Hanhui Li, Bin Dong, Xiaodan Liang

Main category: cs.AI

TL;DR: This paper provides a comprehensive overview of Physical AI, distinguishing between theoretical physics reasoning and applied physical understanding, and examines how physics-grounded methods enhance AI’s real-world comprehension across symbolic reasoning, embodied systems, and generative models.

Details

Motivation: The rapid advancement of embodied intelligence and world models has intensified efforts to integrate physical laws into AI systems, but physical perception and symbolic physics reasoning have developed separately without a unified bridging framework. There's a need to bridge this gap and create AI systems that genuinely understand physical laws rather than just recognizing patterns.

Method: The paper conducts a comprehensive overview and systematic examination of physical AI, establishing clear distinctions between theoretical physics reasoning and applied physical understanding. It analyzes how physics-grounded methods enhance AI across structured symbolic reasoning, embodied systems, and generative models through rigorous analysis of recent advances.

Result: The synthesis advocates for intelligent systems that ground learning in both physical principles and embodied reasoning processes, moving beyond pattern recognition toward genuine understanding of physical laws. It envisions next-generation world models capable of explaining physical phenomena and predicting future states.

Conclusion: The work advances the vision of safe, generalizable, and interpretable AI systems by integrating physical principles into AI frameworks. The authors maintain a continuously updated resource repository at https://github.com/AI4Phys/Awesome-AI-for-Physics to support ongoing development in this field.

Abstract: The rapid advancement of embodied intelligence and world models has intensified efforts to integrate physical laws into AI systems, yet physical perception and symbolic physics reasoning have developed along separate trajectories without a unified bridging framework. This work provides a comprehensive overview of physical AI, establishing clear distinctions between theoretical physics reasoning and applied physical understanding while systematically examining how physics-grounded methods enhance AI’s real-world comprehension across structured symbolic reasoning, embodied systems, and generative models. Through rigorous analysis of recent advances, we advocate for intelligent systems that ground learning in both physical principles and embodied reasoning processes, transcending pattern recognition toward genuine understanding of physical laws. Our synthesis envisions next-generation world models capable of explaining physical phenomena and predicting future states, advancing safe, generalizable, and interpretable AI systems. We maintain a continuously updated resource at https://github.com/AI4Phys/Awesome-AI-for-Physics.

[459] A Survey of Vibe Coding with Large Language Models

Yuyao Ge, Lingrui Mei, Zenghao Duan, Tianhao Li, Yujia Zheng, Yiwei Wang, Lexin Wang, Jiayu Yao, Tianyu Liu, Yujun Cai, Baolong Bi, Fangda Guo, Jiafeng Guo, Shenghua Liu, Xueqi Cheng

Main category: cs.AI

TL;DR: This survey provides the first comprehensive review of “Vibe Coding” - a new development paradigm where developers validate AI-generated code through outcome observation rather than line-by-line comprehension, analyzing over 1000 papers to establish theoretical foundations and practical frameworks.

Details

Motivation: The advancement of LLMs has enabled a shift from code generation assistance to autonomous coding agents, creating the "Vibe Coding" paradigm. However, its effectiveness remains under-explored with empirical evidence showing unexpected productivity losses and fundamental human-AI collaboration challenges.

Method: Systematic analysis of over 1000 research papers to survey the entire vibe coding ecosystem, examining critical infrastructure components including LLMs for coding, LLM-based coding agents, development environments, and feedback mechanisms. Formalizes Vibe Coding through a Constrained Markov Decision Process capturing the triadic relationship among developers, software projects, and coding agents.

Result: Establishes Vibe Coding as a formal discipline and synthesizes existing practices into five distinct development models: Unconstrained Automation, Iterative Conversational Collaboration, Planning-Driven, Test-Driven, and Context-Enhanced Models, providing the first comprehensive taxonomy in this domain.

Conclusion: Successful Vibe Coding depends not merely on agent capabilities but on systematic context engineering, well-established development environments, and human-agent collaborative development models, highlighting the need for structured frameworks to realize the transformative potential of this paradigm.

Abstract: The advancement of large language models (LLMs) has catalyzed a paradigm shift from code generation assistance to autonomous coding agents, enabling a novel development methodology termed “Vibe Coding” where developers validate AI-generated implementations through outcome observation rather than line-by-line code comprehension. Despite its transformative potential, the effectiveness of this emergent paradigm remains under-explored, with empirical evidence revealing unexpected productivity losses and fundamental challenges in human-AI collaboration. To address this gap, this survey provides the first comprehensive and systematic review of Vibe Coding with large language models, establishing both theoretical foundations and practical frameworks for this transformative development approach. Drawing from systematic analysis of over 1000 research papers, we survey the entire vibe coding ecosystem, examining critical infrastructure components including LLMs for coding, LLM-based coding agent, development environment of coding agent, and feedback mechanisms. We first introduce Vibe Coding as a formal discipline by formalizing it through a Constrained Markov Decision Process that captures the dynamic triadic relationship among human developers, software projects, and coding agents. Building upon this theoretical foundation, we then synthesize existing practices into five distinct development models: Unconstrained Automation, Iterative Conversational Collaboration, Planning-Driven, Test-Driven, and Context-Enhanced Models, thus providing the first comprehensive taxonomy in this domain. Critically, our analysis reveals that successful Vibe Coding depends not merely on agent capabilities but on systematic context engineering, well-established development environments, and human-agent collaborative development models.

[460] FreeAskWorld: An Interactive and Closed-Loop Simulator for Human-Centric Embodied AI

Yuhang Peng, Yizhou Pan, Xinning He, Jihaoyu Yang, Xinyu Yin, Han Wang, Xiaoji Zheng, Chao Gao, Jiangtao Gong

Main category: cs.AI

TL;DR: FreeAskWorld is a simulation framework integrating LLMs for human-centered social behavior simulation, with a benchmark dataset for embodied AI, showing improved performance in interaction-enriched navigation tasks.

Details

Motivation: Current simulation platforms focus on low-level physical interactions but lack complex human-centered social behaviors needed for advanced embodied intelligence research.

Method: Developed FreeAskWorld framework integrating LLMs for high-level behavior planning and semantically grounded interaction, with a modular data generation pipeline. Extended VLN to Direction Inquiry setting where agents actively seek navigational guidance.

Result: Created large-scale benchmark dataset with reconstructed environments, 6 task types, 16 object categories, 63,429 annotated frames, and 17+ hours of interaction data. Models fine-tuned on FreeAskWorld outperformed original counterparts in semantic understanding and interaction competency.

Conclusion: Socially grounded simulation frameworks effectively advance embodied AI systems toward sophisticated high-level planning and naturalistic human-agent interaction, demonstrating interaction as an additional information modality.

Abstract: As embodied intelligence emerges as a core frontier in artificial intelligence research, simulation platforms must evolve beyond low-level physical interactions to capture complex, human-centered social behaviors. We introduce FreeAskWorld, an interactive simulation framework that integrates large language models (LLMs) for high-level behavior planning and semantically grounded interaction, informed by theories of intention and social cognition. Our framework supports scalable, realistic human-agent simulations and includes a modular data generation pipeline tailored for diverse embodied tasks. To validate the framework, we extend the classic Vision-and-Language Navigation (VLN) task into a interaction enriched Direction Inquiry setting, wherein agents can actively seek and interpret navigational guidance. We present and publicly release FreeAskWorld, a large-scale benchmark dataset comprising reconstructed environments, six diverse task types, 16 core object categories, 63,429 annotated sample frames, and more than 17 hours of interaction data to support training and evaluation of embodied AI systems. We benchmark VLN models, and human participants under both open-loop and closed-loop settings. Experimental results demonstrate that models fine-tuned on FreeAskWorld outperform their original counterparts, achieving enhanced semantic understanding and interaction competency. These findings underscore the efficacy of socially grounded simulation frameworks in advancing embodied AI systems toward sophisticated high-level planning and more naturalistic human-agent interaction. Importantly, our work underscores that interaction itself serves as an additional information modality.

[461] Neural Decoding of Overt Speech from ECoG Using Vision Transformers and Contrastive Representation Learning

Mohamed Baha Ben Ticha, Xingchen Ran, Guillaume Saldanha, Gaël Le Godais, Philémon Roussel, Marc Aubert, Amina Fontanell, Thomas Costecalde, Lucas Struber, Serpil Karakas, Shaomin Zhang, Philippe Kahane, Guillaume Charvet, Stéphan Chabardès, Blaise Yvert

Main category: cs.AI

TL;DR: First attempt to decode speech from fully implantable wireless epidural recording system using encoder-decoder architecture with Vision Transformers and contrastive learning.

Details

Motivation: Speech BCIs offer communication solutions for people with severe paralysis. While recent studies show speech reconstruction from ECoG/intracortical recordings, streaming speech reconstruction from surface ECoG remains challenging and needs optimization.

Method: Offline speech decoding pipeline using encoder-decoder deep neural architecture integrating Vision Transformers and contrastive learning to enhance direct regression of speech from ECoG signals.

Result: Evaluated on two datasets: clinical subdural electrodes in epileptic patient and fully implantable WIMAGINE epidural system in motor BCI trial participant. First attempt to decode speech from fully implantable wireless epidural system.

Conclusion: Presents promising approach for speech decoding from ECoG signals, particularly significant as first demonstration with fully implantable wireless epidural system, offering perspectives for long-term clinical use.

Abstract: Speech Brain Computer Interfaces (BCIs) offer promising solutions to people with severe paralysis unable to communicate. A number of recent studies have demonstrated convincing reconstruction of intelligible speech from surface electrocorticographic (ECoG) or intracortical recordings by predicting a series of phonemes or words and using downstream language models to obtain meaningful sentences. A current challenge is to reconstruct speech in a streaming mode by directly regressing cortical signals into acoustic speech. While this has been achieved recently using intracortical data, further work is needed to obtain comparable results with surface ECoG recordings. In particular, optimizing neural decoders becomes critical in this case. Here we present an offline speech decoding pipeline based on an encoder-decoder deep neural architecture, integrating Vision Transformers and contrastive learning to enhance the direct regression of speech from ECoG signals. The approach is evaluated on two datasets, one obtained with clinical subdural electrodes in an epileptic patient, and another obtained with the fully implantable WIMAGINE epidural system in a participant of a motor BCI trial. To our knowledge this presents a first attempt to decode speech from a fully implantable and wireless epidural recording system offering perspectives for long-term use.

[462] Three methods, one problem: Classical and AI approaches to no-three-in-line

Pranav Ramanathan, Thomas Prellberg, Matthew Lewis, Prathamesh Dinesh Joshi, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat

Main category: cs.AI

TL;DR: First systematic comparison of classical optimization (ILP) vs AI methods (PatternBoost transformer, PPO) for the No-Three-In-Line problem, showing ILP optimal up to 19×19 grids while AI methods competitive on smaller instances.

Details

Motivation: The No-Three-In-Line problem is a famous combinatorial geometry challenge where classical methods like ILP face exponential scaling with grid size, and recent AI advances offer promising alternatives for pattern-based approximation. The paper aims to systematically compare classical optimization with AI approaches.

Method: Applied three approaches: 1) Classical Integer Linear Programming (ILP) for provably optimal solutions, 2) PatternBoost transformer learning for pattern-based approximation, and 3) Reinforcement learning using PPO (Proximal Policy Optimization). Compared these methods against traditional algorithms on various grid sizes.

Result: ILP achieved provably optimal solutions up to 19×19 grids. PatternBoost matched optimal performance up to 14×14 grids with 96% test loss reduction. PPO achieved perfect solutions on 10×10 grids but failed at 11×11 due to constraint violations. AI methods showed competitive performance on smaller instances.

Conclusion: Classical optimization remains essential for exact solutions, while AI methods offer competitive performance on smaller problem instances. Hybrid approaches combining classical and AI methods present the most promising direction for scaling to larger problem sizes.

Abstract: The No-Three-In-Line problem asks for the maximum number of points that can be placed on an n by n grid with no three collinear, representing a famous problem in combinatorial geometry. While classical methods like Integer Linear Programming (ILP) guarantee optimal solutions, they face exponential scaling with grid size, and recent advances in machine learning offer promising alternatives for pattern-based approximation. This paper presents the first systematic comparison of classical optimization and AI approaches to this problem, evaluating their performance against traditional algorithms. We apply PatternBoost transformer learning and reinforcement learning (PPO) to this problem for the first time, comparing them against ILP. ILP achieves provably optimal solutions up to 19 by 19 grids, while PatternBoost matches optimal performance up to 14 by 14 grids with 96% test loss reduction. PPO achieves perfect solutions on 10 by 10 grids but fails at 11 by 11 grids, where constraint violations prevent valid configurations. These results demonstrate that classical optimization remains essential for exact solutions while AI methods offer competitive performance on smaller instances, with hybrid approaches presenting the most promising direction for scaling to larger problem sizes.

[463] Socratic Students: Teaching Language Models to Learn by Asking Questions

Rajeev Bhatt Ambati, Tianyi Niu, Aashu Singh, Shlok Mishra, Shashank Srivastava, Snigdha Chaturvedi

Main category: cs.AI

TL;DR: Student-led interactive learning where LLMs actively query teachers outperforms static baselines by 0.5+ absolute Pass@k improvements in math and coding tasks.

Details

Motivation: Real-world applications like educational tutoring and medical assistance require dynamic information acquisition, not just static knowledge retrieval. Current LLMs excel at static interactions but need to actively recognize uncertainty, ask targeted questions, and retain new knowledge efficiently.

Method: Shift focus from teacher-instructed to student-led learning where students actively query teachers. Use Direct Preference Optimization (DPO) with guidance from self or stronger students to improve question quality. Train smaller models to ask better questions.

Result: Student-led approaches consistently yield absolute Pass@k improvements of at least 0.5 over static baselines across math and coding benchmarks, starting from near-zero performance. Guided training enables smaller models to learn how to ask better questions, enhancing learning efficiency.

Conclusion: Active student-led questioning strategies significantly outperform static approaches in interactive learning scenarios. Training with DPO and guidance from stronger models enables effective question-asking behavior in smaller models, advancing interactive AI capabilities.

Abstract: Large Language Models (LLMs) excel at static interactions, where they answer user queries by retrieving knowledge encoded in their parameters. However, in many real-world settings, such as educational tutoring or medical assistance, relevant information is not directly available and must be actively acquired through dynamic interactions. An interactive agent would recognize its own uncertainty, ask targeted questions, and retain new knowledge efficiently. Prior work has primarily explored effective ways for a teacher to instruct the student, where the teacher identifies student gaps and provides guidance. In this work, we shift the focus to the student and investigate effective strategies to actively query the teacher in seeking useful information. Across math and coding benchmarks, where baseline student models begin with near-zero performance, we show that student-led approaches consistently yield absolute Pass@k improvements of at least 0.5 over static baselines. To improve question quality, we train students using Direct Preference Optimization (DPO) with guidance from either self or stronger students. We find that this guided training enables smaller models to learn how to ask better questions, further enhancing learning efficiency.

[464] Can AI Understand What We Cannot Say? Measuring Multilevel Alignment Through Abortion Stigma Across Cognitive, Interpersonal, and Structural Levels

Anika Sharma, Malavika Mampally, Chidaksh Ravuru, Kandyce Brennan, Neil Gaikwad

Main category: cs.AI

TL;DR: LLMs fail to genuinely understand abortion stigma across cognitive, interpersonal, and structural levels, showing systematic biases and contradictions despite producing appropriate language.

Details

Motivation: As LLMs increasingly mediate stigmatized health decisions, there's a critical need to evaluate whether they genuinely understand complex psychological and physiological phenomena, particularly what people cannot explicitly express.

Method: Systematically tested 627 demographically diverse personas across five leading LLMs using the validated Individual Level Abortion Stigma Scale (ILAS), conducting multilevel analysis of stigma representation at cognitive, interpersonal, and structural levels.

Result: Models fail tests of genuine understanding across all levels: overestimate interpersonal stigma, underestimate cognitive stigma, assume uniform community condemnation, introduce demographic biases, miss stigma-secrecy relationship, and contradict themselves within theoretical constructs.

Conclusion: Current alignment approaches ensure appropriate language but not coherent multilevel understanding. AI safety in high-stakes contexts requires new approaches to design (multilevel coherence), evaluation (continuous auditing), governance (mandatory audits), and AI literacy.

Abstract: As large language models increasingly mediate stigmatized health decisions, their capacity to genuinely understand complex psychological and physiological phenomena remains poorly evaluated. Can AI understand what we cannot say? We investigate whether LLMs coherently represent abortion stigma across the cognitive, interpersonal, and structural levels where it operates. We systematically tested 627 demographically diverse personas across five leading LLMs using the validated Individual Level Abortion Stigma Scale (ILAS). Our multilevel analysis examined whether models coherently represent stigma at the cognitive level (self-judgment), interpersonal level (worries about judgment and isolation), and structural level (community condemnation and disclosure patterns), as well as overall stigma. Models fail tests of genuine understanding across all levels. They overestimate interpersonal stigma while underestimating cognitive stigma, assume uniform community condemnation, introduce demographic biases absent from human validation data, miss the empirically validated stigma-secrecy relationship, and contradict themselves within theoretical constructs. These patterns reveal that current alignment approaches ensure appropriate language but not coherent multilevel understanding. This work provides empirical evidence that current LLMs lack coherent multilevel understanding of psychological and physiological constructs. AI safety in high-stakes contexts demands new approaches to design (multilevel coherence), evaluation (continuous auditing), governance and regulation (mandatory audits, accountability, deployment restrictions), and AI literacy in domains where understanding what people cannot say determines whether support helps or harms.

[465] A Decision-Theoretic Approach for Managing Misalignment

Daniel A. Herrmann, Abinav Chari, Isabelle Qian, Sree Sharvesh, B. A. Levinstein

Main category: cs.AI

TL;DR: The paper introduces a decision-theoretic framework for determining when to delegate decisions to AI systems, showing that context-specific delegation can be optimal even with significant value misalignment if the AI has superior accuracy or expanded reach.

Details

Motivation: While value alignment literature focuses on shaping AI values, there's less attention on determining when imperfect alignment is sufficient for delegation under uncertainty. The paper aims to provide principled methods for deciding when AI is aligned enough for specific contexts.

Method: Develops a formal decision-theoretic framework that balances value (mis)alignment with epistemic accuracy and reach (available acts). Introduces a novel scoring framework to quantify ex ante delegation decisions under uncertainty about these factors.

Result: Reveals a sharp distinction: universal delegation requires near-perfect alignment and total epistemic trust (rare in practice), while context-specific delegation can be optimal even with significant misalignment if the agent has superior accuracy or expanded reach that grants access to better decision problems.

Conclusion: Provides a principled method for determining when AI is aligned enough for given contexts, shifting focus from achieving perfect alignment to managing risks and rewards of delegation under uncertainty. Rational delegation depends on balancing alignment, accuracy, and reach.

Abstract: When should we delegate decisions to AI systems? While the value alignment literature has developed techniques for shaping AI values, less attention has been paid to how to determine, under uncertainty, when imperfect alignment is good enough to justify delegation. We argue that rational delegation requires balancing an agent’s value (mis)alignment with its epistemic accuracy and its reach (the acts it has available). This paper introduces a formal, decision-theoretic framework to analyze this tradeoff precisely accounting for a principal’s uncertainty about these factors. Our analysis reveals a sharp distinction between two delegation scenarios. First, universal delegation (trusting an agent with any problem) demands near-perfect value alignment and total epistemic trust, conditions rarely met in practice. Second, we show that context-specific delegation can be optimal even with significant misalignment. An agent’s superior accuracy or expanded reach may grant access to better overall decision problems, making delegation rational in expectation. We develop a novel scoring framework to quantify this ex ante decision. Ultimately, our work provides a principled method for determining when an AI is aligned enough for a given context, shifting the focus from achieving perfect alignment to managing the risks and rewards of delegation under uncertainty.

[466] PDE-Agent: A toolchain-augmented multi-agent framework for PDE solving

Jianming Liu, Ren Zhu, Jian Xu, Kun Ding, Xu-Yao Zhang, Gaofeng Meng, Cheng-Lin Liu

Main category: cs.AI

TL;DR: PDE-Agent: A multi-agent LLM framework that automates PDE solving through tool invocation and collaboration, outperforming traditional methods in complex tasks.

Details

Motivation: Traditional PDE solving methods require manual setup and domain expertise. While PINNs and frameworks like DeepXDE improved automation, they still depend on expert knowledge and lack full autonomy. There's a need for a more automated approach to PDE solving.

Method: PDE-Agent frames PDE solving as tool invocation via LLM-driven agents with two key innovations: (1) Prog-Act framework with graph memory for multi-agent collaboration using dual-loop error correction, and (2) Resource-Pool with tool-parameter separation for multi-tool collaboration to manage runtime artifacts and dependencies.

Result: Evaluated using PDE-Bench benchmark, PDE-Agent shows superior applicability and performance in complex multi-step, cross-step dependent tasks compared to existing methods.

Conclusion: PDE-Agent introduces a new paradigm of toolchain-augmented multi-agent PDE solving that advances automated scientific computing by combining LLM reasoning with external tool controllability.

Abstract: Solving Partial Differential Equations (PDEs) is a cornerstone of engineering and scientific research. Traditional methods for PDE solving are cumbersome, relying on manual setup and domain expertise. While Physics-Informed Neural Network (PINNs) introduced end-to-end neural network-based solutions, and frameworks like DeepXDE further enhanced automation, these approaches still depend on expert knowledge and lack full autonomy. In this work, we frame PDE solving as tool invocation via LLM-driven agents and introduce PDE-Agent, the first toolchain-augmented multi-agent collaboration framework, inheriting the reasoning capacity of LLMs and the controllability of external tools and enabling automated PDE solving from natural language descriptions. PDE-Agent leverages the strengths of multi-agent and multi-tool collaboration through two key innovations: (1) A Prog-Act framework with graph memory for multi-agent collaboration, which enables effective dynamic planning and error correction via dual-loop mechanisms (localized fixes and global revisions). (2) A Resource-Pool integrated with a tool-parameter separation mechanism for multi-tool collaboration. This centralizes the management of runtime artifacts and resolves inter-tool dependency gaps in existing frameworks. To validate and evaluate this new paradigm for PDE solving , we develop PDE-Bench, a multi-type PDE Benchmark for agent-based tool collaborative solving, and propose multi-level metrics for assessing tool coordination. Evaluations verify that PDE-Agent exhibits superior applicability and performance in complex multi-step, cross-step dependent tasks. This new paradigm of toolchain-augmented multi-agent PDE solving will further advance future developments in automated scientific computing. Our source code and dataset will be made publicly available.

[467] Scaling Laws for Energy Efficiency of Local LLMs

Ander Alvarez, Alessandro Genuardi, Nilotpal Sinha, Antonio Tiene, Mikail Okyay, Bakbergen Ryskulov, David Montero, Samuel Mugel, Román Orús

Main category: cs.AI

TL;DR: CPU-only inference scaling laws for local LLMs/VLMs: compute scales linearly with token length for LLMs, VLMs have a “resolution knee” due to internal preprocessing, and quantum-inspired compression reduces compute/memory by up to 71.9% and energy by 62% while maintaining accuracy.

Details

Motivation: Most consumer hardware relies on CPUs rather than GPUs for AI deployment, but CPU-only inference scaling laws for local language and vision-language models remain unexplored, creating a gap in understanding how to optimize for edge devices with constrained computational and energy budgets.

Method: Systematic benchmarking of LLMs and VLMs on two CPU tiers: MacBook Pro M2 (mainstream laptop) and Raspberry Pi 5 (low-power embedded). Used unified methodology with continuous sampling of processor/memory usage and area-under-curve integration to characterize computational load scaling with input text length (LLMs) and image resolution (VLMs).

Result: Two empirical scaling laws: (1) LLM inference compute scales approximately linearly with token length; (2) VLMs exhibit a “resolution knee” where compute remains constant above an internal resolution clamp and decreases sharply below it. Quantum-inspired compression reduces processor/memory usage by up to 71.9% and energy consumption by up to 62% while preserving or improving semantic accuracy.

Conclusion: Provides systematic quantification of multimodal CPU-only scaling for local AI workloads, identifying model compression and input-resolution preprocessing as effective, low-cost levers for sustainable edge inference on consumer hardware.

Abstract: Deploying local large language models and vision-language models on edge devices requires balancing accuracy with constrained computational and energy budgets. Although graphics processors dominate modern artificial-intelligence deployment, most consumer hardware–including laptops, desktops, industrial controllers, and embedded systems–relies on central processing units. Despite this, the computational laws governing central-processing-unit-only inference for local language and vision-language workloads remain largely unexplored. We systematically benchmark large language and vision-language models on two representative central-processing-unit tiers widely used for local inference: a MacBook Pro M2, reflecting mainstream laptop-class deployment, and a Raspberry Pi 5, representing constrained, low-power embedded settings. Using a unified methodology based on continuous sampling of processor and memory usage together with area-under-curve integration, we characterize how computational load scales with input text length for language models and with image resolution for vision-language models. We uncover two empirical scaling laws: (1) computational cost for language-model inference scales approximately linearly with token length; and (2) vision-language models exhibit a preprocessing-driven “resolution knee”, where compute remains constant above an internal resolution clamp and decreases sharply below it. Beyond these laws, we show that quantum-inspired compression reduces processor and memory usage by up to 71.9% and energy consumption by up to 62%, while preserving or improving semantic accuracy. These results provide a systematic quantification of multimodal central-processing-unit-only scaling for local language and vision-language workloads, and they identify model compression and input-resolution preprocessing as effective, low-cost levers for sustainable edge inference.

[468] Dual Computational Horizons: Incompleteness and Unpredictability in Intelligent Systems

Abhisek Ganguly

Main category: cs.AI

TL;DR: Algorithmic intelligence faces two fundamental computational limits: formal incompleteness (limits on deductive reasoning) and dynamical unpredictability (limits on long-term prediction). These constraints together prevent agents from universally verifying their own maximal prediction horizons.

Details

Motivation: The paper aims to formalize and analyze the inherent computational limitations that constrain algorithmic intelligence, specifically focusing on how these limitations affect an agent's ability to reason about its own predictive capabilities.

Method: The authors formalize two independent computational limitations: formal incompleteness (constraining consistent reasoning systems) and dynamical unpredictability (bounding long-term prediction under finite precision). They analyze how these two extremes impose structural bounds on self-referential reasoning about prediction capabilities.

Result: The main result shows that an algorithmic agent cannot universally verify its own maximal prediction horizon due to the combined constraints of formal incompleteness and dynamical unpredictability. This reveals inherent trade-offs between reasoning, prediction, and self-analysis in intelligent systems.

Conclusion: The construction presented represents one instance of a broader logical class of limitations on intelligent systems, clarifying fundamental trade-offs between reasoning power, predictive capability, and self-analysis that are inherent to algorithmic intelligence.

Abstract: We formalize two independent computational limitations that constrain algorithmic intelligence: formal incompleteness and dynamical unpredictability. The former limits the deductive power of consistent reasoning systems while the latter bounds long-term prediction under finite precision. We show that these two extrema together impose structural bounds on an agent’s ability to reason about its own predictive capabilities. In particular, an algorithmic agent cannot verify its own maximal prediction horizon universally. This perspective clarifies inherent trade-offs between reasoning, prediction, and self-analysis in intelligent systems. The construction presented here constitutes one representative instance of a broader logical class of such limitations.

[469] Discovering and Learning Probabilistic Models of Black-Box AI Capabilities

Daniel Bramblett, Rushang Karia, Adrian Ciotinga, Ruthvick Suresh, Pulkit Verma, YooJung Choi, Siddharth Srivastava

Main category: cs.AI

TL;DR: This paper presents a method to learn PDDL-style symbolic models of black-box AI systems’ planning capabilities using Monte-Carlo tree search for systematic testing and hypothesis pruning.

Details

Motivation: As black-box AI systems like foundational models are increasingly used for sequential decision making, there's a need for efficient methods to provide sound and interpretable representations of their capabilities to ensure safe deployment.

Method: Uses Monte-Carlo tree search to systematically create test tasks, acquire data, and prune the hypothesis space of possible symbolic models. Learns PDDL-style representations that describe the BBAI’s capabilities, execution conditions, and probabilistic outcomes.

Result: Theoretical results show soundness, completeness and convergence of the learned models. Empirical results with multiple BBAI systems demonstrate the scope, efficiency, and accuracy of the methods.

Conclusion: PDDL-style representations can be effectively used to learn and model black-box AI planning capabilities, providing interpretable models that capture execution conditions and probabilistic outcomes for safer deployment.

Abstract: Black-box AI (BBAI) systems such as foundational models are increasingly being used for sequential decision making. To ensure that such systems are safe to operate and deploy, it is imperative to develop efficient methods that can provide a sound and interpretable representation of the BBAI’s capabilities. This paper shows that PDDL-style representations can be used to efficiently learn and model an input BBAI’s planning capabilities. It uses the Monte-Carlo tree search paradigm to systematically create test tasks, acquire data, and prune the hypothesis space of possible symbolic models. Learned models describe a BBAI’s capabilities, the conditions under which they can be executed, and the possible outcomes of executing them along with their associated probabilities. Theoretical results show soundness, completeness and convergence of the learned models. Empirical results with multiple BBAI systems illustrate the scope, efficiency, and accuracy of the presented methods.

[470] Solomonoff-Inspired Hypothesis Ranking with LLMs for Prediction Under Uncertainty

Josh Barber, Rourke Young, Cameron Coombe, Will Browne

Main category: cs.AI

TL;DR: Proposes Solomonoff-inspired method weighting LLM-generated hypotheses by simplicity and predictive fit for uncertainty-aware reasoning on sparse-data tasks like Mini-ARC.

Details

Motivation: Reasoning under uncertainty is challenging for real-world AI tasks with sparse data. Existing approaches struggle to balance accuracy and simplicity when evaluating multiple candidate solutions.

Method: Solomonoff-inspired method that weights LLM-generated hypotheses by both simplicity (algorithmic information theory) and predictive fit. Applied to Mini-ARC benchmark tasks, producing Solomonoff-weighted mixtures for per-cell predictions.

Result: Method yields conservative, uncertainty-aware outputs even with noisy/incorrect hypotheses. Compared to BMA, Solomonoff scoring spreads probability more evenly across competing hypotheses while BMA concentrates weight on potentially flawed most-likely candidates.

Conclusion: Algorithmic information-theoretic priors provide value for interpretable, reliable multi-hypothesis reasoning under uncertainty, especially for sparse-data tasks requiring systematic generalization.

Abstract: Reasoning under uncertainty is a key challenge in AI, especially for real-world tasks, where problems with sparse data demands systematic generalisation. Existing approaches struggle to balance accuracy and simplicity when evaluating multiple candidate solutions. We propose a Solomonoff-inspired method that weights LLM-generated hypotheses by simplicity and predictive fit. Applied to benchmark (Mini-ARC) tasks, our method produces Solomonoff-weighted mixtures for per-cell predictions, yielding conservative, uncertainty-aware outputs even when hypotheses are noisy or partially incorrect. Compared to Bayesian Model Averaging (BMA), Solomonoff scoring spreads probability more evenly across competing hypotheses, while BMA concentrates weight on the most likely but potentially flawed candidates. Across tasks, this highlights the value of algorithmic information-theoretic priors for interpretable, reliable multi-hypothesis reasoning under uncertainty.

cs.SD

[471] chatter: a Python library for applying information theory and AI/ML models to animal communication

Mason Youngblood

Main category: cs.SD

TL;DR: chatter is a Python library for analyzing animal communication in continuous latent space using ML, bypassing categorical approaches to preserve complexity.

Details

Motivation: Traditional categorical approaches to animal communication analysis (e.g., classifying syllables, notes) flatten the natural complexity and nuance of real communication systems, losing important information.

Method: Uses modern machine learning techniques including variational autoencoders and vision transformers to represent vocal sequences as trajectories in high-dimensional latent space, providing an end-to-end workflow from preprocessing to feature extraction.

Result: Taxonomically agnostic library tested with vocalizations of birds, bats, whales, and primates, enabling quantification of complexity, predictability, similarity, and novelty of vocal sequences without manual categorization.

Conclusion: chatter offers a novel approach to animal communication analysis that preserves system complexity by working in continuous latent space rather than forcing categorical classification, making it broadly applicable across taxa.

Abstract: The study of animal communication often involves categorizing units into types (e.g. syllables in songbirds, or notes in humpback whales). While this approach is useful in many cases, it necessarily flattens the complexity and nuance present in real communication systems. chatter is a new Python library for analyzing animal communication in continuous latent space using information theory and modern machine learning techniques. It is taxonomically agnostic, and has been tested with the vocalizations of birds, bats, whales, and primates. By leveraging a variety of different architectures, including variational autoencoders and vision transformers, chatter represents vocal sequences as trajectories in high-dimensional latent space, bypassing the need for manual or automatic categorization of units. The library provides an end-to-end workflow – from preprocessing and segmentation to model training and feature extraction – that enables researchers to quantify the complexity, predictability, similarity, and novelty of vocal sequences.

[472] Let the Model Learn to Feel: Mode-Guided Tonality Injection for Symbolic Music Emotion Recognition

Haiying Xia, Zhongyi Huang, Yumei Tan, Shuxiang Song

Main category: cs.SD

TL;DR: The paper proposes MoGE, a mode-guided enhancement strategy that injects musical mode features into MIDIBERT to improve symbolic music emotion recognition by addressing its limitations in capturing mode-emotion correlations.

Details

Motivation: Current pre-trained models like MIDIBERT capture distributional musical semantics but overlook tonal structures, particularly musical modes, which are critical for emotional perception according to music psychology. The authors identify MIDIBERT's limitations in capturing mode-emotion associations.

Method: Propose Mode-Guided Enhancement (MoGE) strategy: 1) Conduct mode augmentation analysis to reveal MIDIBERT’s failure in encoding emotion-mode correlations; 2) Identify least emotion-relevant layer in MIDIBERT; 3) Introduce Mode-guided Feature-wise linear modulation injection (MoFi) framework to inject explicit mode features into the model.

Result: Extensive experiments on EMOPIA and VGMIDI datasets show significant SMER performance improvements: 75.2% accuracy on EMOPIA and 59.1% on VGMIDI, validating the effectiveness of mode-guided modeling.

Conclusion: The mode injection strategy successfully enhances symbolic music emotion recognition by incorporating psychological insights about musical modes, addressing a key limitation of existing pre-trained models in capturing mode-emotion correlations.

Abstract: Music emotion recognition is a key task in symbolic music understanding (SMER). Recent approaches have shown promising results by fine-tuning large-scale pre-trained models (e.g., MIDIBERT, a benchmark in symbolic music understanding) to map musical semantics to emotional labels. While these models effectively capture distributional musical semantics, they often overlook tonal structures, particularly musical modes, which play a critical role in emotional perception according to music psychology. In this paper, we investigate the representational capacity of MIDIBERT and identify its limitations in capturing mode-emotion associations. To address this issue, we propose a Mode-Guided Enhancement (MoGE) strategy that incorporates psychological insights on mode into the model. Specifically, we first conduct a mode augmentation analysis, which reveals that MIDIBERT fails to effectively encode emotion-mode correlations. We then identify the least emotion-relevant layer within MIDIBERT and introduce a Mode-guided Feature-wise linear modulation injection (MoFi) framework to inject explicit mode features, thereby enhancing the model’s capability in emotional representation and inference. Extensive experiments on the EMOPIA and VGMIDI datasets demonstrate that our mode injection strategy significantly improves SMER performance, achieving accuracies of 75.2% and 59.1%, respectively. These results validate the effectiveness of mode-guided modeling in symbolic music emotion recognition.

[473] Influence of string register locations on vibratos among violoncellists

Steven Hu, Sophia H. Kim, Helena H. Kim, Hugo Mackay, Eric J. Heller

Main category: cs.SD

TL;DR: Cello vibrato depth increases when moving fingers toward bridge, but players reduce finger motion, showing partial compensation behavior.

Details

Motivation: To understand how vibrato characteristics change with finger position on cello strings and examine players' compensatory behaviors.

Method: Analyzed 94 musical excerpts to measure acoustic vibrato depth and physical finger amplitude changes with finger position along the string.

Result: Moving finger toward bridge strongly increases acoustic vibrato depth (ρ=0.6902, p=1.408·10⁻¹⁴) while physical finger amplitude decreases (ρ=-0.6391, p=4.172·10⁻¹²).

Conclusion: Cellists reduce finger motion in higher positions but not enough to counteract increased pitch deviation, revealing both presence and limits of compensatory vibrato behavior.

Abstract: This study analyzes how vibrato changes with finger position along the cello string. Examining 94 excerpts, we found moving the finger toward the bridge strongly increases acoustic vibrato depth ($ρ=0.6902$, $p=1.408\cdot 10^{-14}$). However, the performer’s physical finger amplitude simultaneously decreases ($ρ=-0.6391$, $p=4.172\cdot 10^{-12}$). This shows players reduce finger motion in higher positions, but not enough to counteract the greater pitch deviation there, revealing both the presence and limits of compensatory vibrato behavior.

[474] A Data-Centric Approach to Generalizable Speech Deepfake Detection

Wen Huang, Yuchen Mao, Yanmin Qian

Main category: cs.SD

TL;DR: This paper proposes a data-centric approach to improve speech deepfake detection by analyzing data composition through scaling laws and developing a diversity-optimized sampling strategy for mixing heterogeneous datasets.

Details

Motivation: Current speech deepfake detection models struggle with robust generalization to unseen forgery methods. While most research focuses on model and algorithm improvements, the impact of data composition is underexplored, creating a gap in understanding how to optimize training data for better detection performance.

Method: The paper takes a data-centric approach with two main components: 1) A large-scale empirical study to characterize data scaling laws for SDD, quantifying the impact of source and generator diversity, and 2) A Diversity-Optimized Sampling Strategy (DOSS) framework with two implementations: DOSS-Select (pruning) and DOSS-Weight (re-weighting) for mixing heterogeneous data.

Result: DOSS-Select outperforms naive aggregation baselines while using only 3% of total available data. The final model, trained on a 12k-hour curated data pool using optimal DOSS-Weight strategy, achieves state-of-the-art performance, outperforming large-scale baselines with greater data and model efficiency on both public benchmarks and a new challenge set of commercial APIs.

Conclusion: Data composition plays a crucial role in speech deepfake detection performance. The proposed data-centric approach, particularly the diversity-optimized sampling strategies, enables more efficient and effective training, leading to superior generalization capabilities with better data and model efficiency.

Abstract: Achieving robust generalization in speech deepfake detection (SDD) remains a primary challenge, as models often fail to detect unseen forgery methods. While research has focused on model-centric and algorithm-centric solutions, the impact of data composition is often underexplored. This paper proposes a data-centric approach, analyzing the SDD data landscape from two practical perspectives: constructing a single dataset and aggregating multiple datasets. To address the first perspective, we conduct a large-scale empirical study to characterize the data scaling laws for SDD, quantifying the impact of source and generator diversity. To address the second, we propose the Diversity-Optimized Sampling Strategy (DOSS), a principled framework for mixing heterogeneous data with two implementations: DOSS-Select (pruning) and DOSS-Weight (re-weighting). Our experiments show that DOSS-Select outperforms the naive aggregation baseline while using only 3% of the total available data. Furthermore, our final model, trained on a 12k-hour curated data pool using the optimal DOSS-Weight strategy, achieves state-of-the-art performance, outperforming large-scale baselines with greater data and model efficiency on both public benchmarks and a new challenge set of various commercial APIs.

[475] AutoSchA: Automatic Hierarchical Music Representations via Multi-Relational Node Isolation

Stephen Ni-Hahn, Rico Zhu, Jerry Yin, Yue Jiang, Cynthia Rudin, Simon Mak

Main category: cs.SD

TL;DR: AutoSchA: A graph neural network framework for automatic hierarchical music analysis that performs comparably to human experts on Baroque fugue subjects.

Details

Motivation: Hierarchical music analysis (like Schenkerian analysis) is powerful but time-consuming and requires expert training. Current methods lack computer-readable formats and automated approaches despite recent advances in hierarchical deep learning.

Method: AutoSchA uses graph neural networks with three key innovations: 1) graph learning framework for hierarchical music representation, 2) node isolation-based graph pooling that optimizes learned pooling assignments, and 3) integrated state-of-the-art architecture.

Result: AutoSchA performs comparably to human experts when analyzing Baroque fugue subjects in experimental evaluations.

Conclusion: The paper introduces a promising automated framework for hierarchical music analysis that bridges music theory with modern deep learning techniques, showing potential to reduce expert workload while maintaining analysis quality.

Abstract: Hierarchical representations provide powerful and principled approaches for analyzing many musical genres. Such representations have been broadly studied in music theory, for instance via Schenkerian analysis (SchA). Hierarchical music analyses, however, are highly cost-intensive; the analysis of a single piece of music requires a great deal of time and effort from trained experts. The representation of hierarchical analyses in a computer-readable format is a further challenge. Given recent developments in hierarchical deep learning and increasing quantities of computer-readable data, there is great promise in extending such work for an automatic hierarchical representation framework. This paper thus introduces a novel approach, AutoSchA, which extends recent developments in graph neural networks (GNNs) for hierarchical music analysis. AutoSchA features three key contributions: 1) a new graph learning framework for hierarchical music representation, 2) a new graph pooling mechanism based on node isolation that directly optimizes learned pooling assignments, and 3) a state-of-the-art architecture that integrates such developments for automatic hierarchical music analysis. We show, in a suite of experiments, that AutoSchA performs comparably to human experts when analyzing Baroque fugue subjects.

[476] JoyVoice: Long-Context Conditioning for Anthropomorphic Multi-Speaker Conversational Synthesis

Fan Yu, Tao Wang, You Wu, Lin Zhu, Wei Deng, Weisheng Han, Wenchao Wang, Lin Hu, Xiangyu Liang, Xiaodong He, Yankun Huang, Yu Gu, Yuan Liu, Yuxuan Wang, Zhangyu Xiao, Ziteng Wang, Boya Dong, Feng Dang, Jinming Chen, Jingdong Li, Jun Wang, Yechen Jin, Yuan Zhang, Zhengyan Sheng, Xin Wang

Main category: cs.SD

TL;DR: JoyVoice is a novel anthropomorphic foundation model for multi-speaker long-form speech generation, supporting up to 8 speakers with unified E2E-Transformer-DiT architecture and achieving SOTA results in multilingual generation and voice cloning.

Details

Motivation: Current long-form speech generation models are limited to dyadic, turn-based interactions, creating a need for more flexible, boundary-free synthesis that can handle multiple speakers in conversational settings.

Method: Uses unified E2E-Transformer-DiT architecture with autoregressive hidden representations for diffusion inputs; proposes MM-Tokenizer at 12.5 Hz bitrate with multitask semantic and MMSE losses; incorporates robust text front-end processing via large-scale data perturbation.

Result: Achieves state-of-the-art results in multilingual generation (Chinese, English, Japanese, Korean) and zero-shot voice cloning; top-tier performance on Seed-TTS-Eval Benchmark and multi-speaker long-form conversational voice cloning; superior audio quality, generalization, prosodic continuity, rhythm richness, and paralinguistic naturalness.

Conclusion: JoyVoice successfully addresses limitations of current long-form speech generation models by enabling flexible, boundary-free synthesis of up to 8 speakers with superior performance across multiple metrics and languages.

Abstract: Large speech generation models are evolving from single-speaker, short sentence synthesis to multi-speaker, long conversation geneartion. Current long-form speech generation models are predominately constrained to dyadic, turn-based interactions. To address this, we introduce JoyVoice, a novel anthropomorphic foundation model designed for flexible, boundary-free synthesis of up to eight speakers. Unlike conventional cascaded systems, JoyVoice employs a unified E2E-Transformer-DiT architecture that utilizes autoregressive hidden representations directly for diffusion inputs, enabling holistic end-to-end optimization. We further propose a MM-Tokenizer operating at a low bitrate of 12.5 Hz, which integrates multitask semantic and MMSE losses to effectively model both semantic and acoustic information. Additionally, the model incorporates robust text front-end processing via large-scale data perturbation. Experiments show that JoyVoice achieves state-of-the-art results in multilingual generation (Chinese, English, Japanese, Korean) and zero-shot voice cloning. JoyVoice achieves top-tier results on both the Seed-TTS-Eval Benchmark and multi-speaker long-form conversational voice cloning tasks, demonstrating superior audio quality and generalization. It achieves significant improvements in prosodic continuity for long-form speech, rhythm richness in multi-speaker conversations, paralinguistic naturalness, besides superior intelligibility. We encourage readers to listen to the demo at https://jea-speech.github.io/JoyVoice

[477] Explainable Transformer-CNN Fusion for Noise-Robust Speech Emotion Recognition

Sudip Chakrabarty, Pappu Bishwas, Rajdeep Chatterjee

Main category: cs.SD

TL;DR: A Hybrid Transformer-CNN framework for robust Speech Emotion Recognition that combines Wav2Vec 2.0’s contextual modeling with 1D-CNN’s spectral stability, featuring dual-stream processing, noise-resistant feature extraction, and explainable AI integration.

Details

Motivation: Speech Emotion Recognition systems degrade in real-world noisy environments, and deep learning models lack transparency for trust-sensitive applications, creating a need for robust and explainable SER solutions.

Method: Dual-stream architecture processing raw waveforms with Wav2Vec 2.0 for temporal dependencies and 1D-CNN for spectral features (MFCC, ZCR, RMSE) using Attentive Temporal Pooling. Integrated SHAP and Score-CAM for explainability.

Result: Achieved state-of-the-art accuracy across RAVDESS, TESS, SAVEE, and CREMA-D datasets, with superior generalization and robustness against real-world noise from SAS-KIIT dataset, outperforming single-branch baselines.

Conclusion: The hybrid framework successfully addresses both robustness to acoustic interference and model transparency, enabling reliable SER in real-world environments while providing interpretable insights into model decision-making.

Abstract: Speech Emotion Recognition (SER) systems often degrade in performance when exposed to the unpredictable acoustic interference found in real-world environments. Additionally, the opacity of deep learning models hinders their adoption in trust-sensitive applications. To bridge this gap, we propose a Hybrid Transformer-CNN framework that unifies the contextual modeling of Wav2Vec 2.0 with the spectral stability of 1D-Convolutional Neural Networks. Our dual-stream architecture processes raw waveforms to capture long-range temporal dependencies while simultaneously extracting noise-resistant spectral features (MFCC, ZCR, RMSE) via a custom Attentive Temporal Pooling mechanism. We conducted extensive validation across four diverse benchmark datasets: RAVDESS, TESS, SAVEE, and CREMA-D. To rigorously test robustness, we subjected the model to non-stationary acoustic interference using real-world noise profiles from the SAS-KIIT dataset. The proposed framework demonstrates superior generalization and state-of-the-art accuracy across all datasets, significantly outperforming single-branch baselines under realistic environmental interference. Furthermore, we address the ``black-box" problem by integrating SHAP and Score-CAM into the evaluation pipeline. These tools provide granular visual explanations, revealing how the model strategically shifts attention between temporal and spectral cues to maintain reliability in the presence of complex environmental noise.

[478] Task Vector in TTS: Toward Emotionally Expressive Dialectal Speech Synthesis

Pengchao Feng, Yao Xiao, Ziyang Ma, Zhikang Niu, Shuai Fan, Yao Li, Sheng Wang, Xie Chen

Main category: cs.SD

TL;DR: HE-Vector is a two-stage method for emotional dialectal TTS that enables cross-style synthesis without requiring jointly labeled dialect-emotion data.

Details

Motivation: While TTS has improved in naturalness and intelligibility, cross-style synthesis combining both dialect and emotion remains challenging due to scarcity of dialectal data with emotional labels. Current research focuses on enhancing expressiveness but lacks methods for joint dialect-emotion synthesis.

Method: Two-stage approach: 1) Expressive Vector (E-Vector) constructs task vectors to model dialectal and emotional styles independently, enhancing single-style synthesis via weight adjustment. 2) Hierarchical Expressive Vector (HE-Vector) hierarchically integrates these vectors for controllable emotionally expressive dialect synthesis without requiring jointly labeled data.

Result: HE-Vectors achieve superior performance in dialect synthesis and promising results in synthesizing emotionally expressive dialectal speech in a zero-shot setting.

Conclusion: The proposed HE-Vector method successfully addresses the challenge of emotional dialectal TTS without needing jointly labeled data, enabling controllable cross-style synthesis and showing strong performance in both dialect synthesis and emotional dialectal speech generation.

Abstract: Recent advances in text-to-speech (TTS) have yielded remarkable improvements in naturalness and intelligibility. Building on these achievements, research has increasingly shifted toward enhancing the expressiveness of generated speech, such as dialectal and emotional TTS. However, cross-style synthesis combining both dialect and emotion remains challenging and largely unexplored, mainly due to the scarcity of dialectal data with emotional labels. To address this, we propose Hierarchical Expressive Vector (HE-Vector), a two-stage method for Emotional Dialectal TTS. In the first stage, we construct different task vectors to model dialectal and emotional styles independently, and then enhance single-style synthesis by adjusting their weights, a method we refer to as Expressive Vector (E-Vector). For the second stage, we hierarchically integrate these vectors to achieve controllable emotionally expressive dialect synthesis without requiring jointly labeled data, corresponding to Hierarchical Expressive Vector (HE-Vector). Experimental results demonstrate that HE-Vectors achieve superior performance in dialect synthesis, and promising results in synthesizing emotionally expressive dialectal speech in a zero-shot setting.

[479] X-Talk: On the Underestimated Potential of Modular Speech-to-Speech Dialogue System

Zhanxun Liu, Yifan Duan, Mengmeng Wang, Pengchao Feng, Haotian Zhang, Xiaoyu Xing, Yijia Shan, Haina Zhu, Yuhang Dai, Chaochao Lu, Xipeng Qiu, Lei Xie, Lan Wang, Nan Yan, Zilong Zheng, Ziyang Ma, Kai Yu, Xie Chen

Main category: cs.SD

TL;DR: X-Talk is an open-source modular framework for LLM-driven speech-to-speech systems that uses a cascaded pipeline approach instead of end-to-end models, achieving sub-second latency while maintaining flexibility.

Details

Motivation: The paper challenges the dominant end-to-end modeling trend in speech-to-speech systems, arguing that single "omni-models" struggle to balance competing objectives of complex speech tasks within one network. The authors propose that a modular, cascaded approach can offer better performance and flexibility.

Method: X-Talk employs a decoupled, modular design with specialized front-end components (VAD, speech enhancement) and diverse understanding models (ASR, emotion, environmental sound analysis) integrated with LLM capabilities like retrieval-augmented generation (RAG) and tool use. It uses a systematically optimized cascaded pipeline approach.

Result: The framework achieves sub-second latency without sacrificing modular flexibility, demonstrating that cascaded pipelines can compete with end-to-end approaches while offering greater modularity and specialization.

Conclusion: X-Talk revitalizes the cascaded approach for speech-to-speech systems, highlighting the underestimated potential of modular designs and providing a robust foundation for future research and applications in LLM-driven speech processing.

Abstract: We present X-Talk, an open-source framework that champions a decoupled, modular design for LLM-driven speech-to-speech (S2S) systems. While the dominant trend favors end-to-end (E2E) modeling to optimize information flow, these “omni-models” often struggle to balance the competing objectives of complex speech tasks within a single network. X-Talk challenges this paradigm by demonstrating that a systematically optimized cascaded pipeline can achieve sub-second latency without sacrificing modular flexibility. Our framework seamlessly integrates specialized front-end components (e.g., VAD, speech enhancement) and diverse understanding models (e.g., ASR, emotion, and environmental sound analysis) with LLM capabilities like retrieval-augmented generation (RAG) and tool use. By revitalizing the cascaded approach, X-Talk highlights the underestimated potential of modular S2S systems and provides a robust foundation for future research and applications.

[480] Smark: A Watermark for Text-to-Speech Diffusion Models via Discrete Wavelet Transform

Yichuan Zhang, Chengxin Li, Yujie Gu

Main category: cs.SD

TL;DR: Smark is a universal watermarking scheme for TTS diffusion models that embeds watermarks in low-frequency audio regions using discrete wavelet transform, achieving high audio quality and extraction accuracy across various models.

Details

Motivation: Current TTS diffusion models generate high-quality speech but lack intellectual property protection and legal tracing capabilities. Existing watermarking methods are model-specific and degrade audio quality, limiting practical use.

Method: Smark uses a lightweight watermark embedding framework operating in the common reverse diffusion paradigm. It employs discrete wavelet transform (DWT) to embed watermarks into stable low-frequency audio regions, ensuring seamless integration and resistance to removal during diffusion.

Result: Extensive experiments show Smark achieves superior performance in both audio quality and watermark extraction accuracy across various simulated real-world attack scenarios.

Conclusion: Smark provides an effective universal watermarking solution for TTS diffusion models that maintains audio quality while enabling reliable intellectual property protection and speech tracing.

Abstract: Text-to-Speech (TTS) diffusion models generate high-quality speech, which raises challenges for the model intellectual property protection and speech tracing for legal use. Audio watermarking is a promising solution. However, due to the structural differences among various TTS diffusion models, existing watermarking methods are often designed for a specific model and degrade audio quality, which limits their practical applicability. To address this dilemma, this paper proposes a universal watermarking scheme for TTS diffusion models, termed Smark. This is achieved by designing a lightweight watermark embedding framework that operates in the common reverse diffusion paradigm shared by all TTS diffusion models. To mitigate the impact on audio quality, Smark utilizes the discrete wavelet transform (DWT) to embed watermarks into the relatively stable low-frequency regions of the audio, which ensures seamless watermark-audio integration and is resistant to removal during the reverse diffusion process. Extensive experiments are conducted to evaluate the audio quality and watermark performance in various simulated real-world attack scenarios. The experimental results show that Smark achieves superior performance in both audio quality and watermark extraction accuracy.

[481] Reliable Audio Deepfake Detection in Variable Conditions via Quantum-Kernel SVMs

Lisan Al Amin, Vandana P. Janeja

Main category: cs.SD

TL;DR: Quantum kernel SVMs outperform classical SVMs in audio deepfake detection across multiple datasets, reducing false positive rates by 13-57% without increasing model size.

Details

Motivation: Detecting synthetic speech is challenging with scarce labeled data and varying recording conditions. Existing deep models often overfit, while classical kernel methods depend heavily on kernel choice. Quantum kernels offer expressive similarity measures in high-dimensional Hilbert spaces for better generalization.

Method: Used quantum-kernel SVMs (QSVMs) compared with classical SVMs using identical mel-spectrogram preprocessing and stratified 5-fold cross-validation across four corpora (ASVspoof 2019 LA, ASVspoof 5 (2024), ADD23, and In-the-Wild set). Only kernel was changed - features, SVM, and preprocessing remained identical with no additional trainable parameters.

Result: QSVMs achieved consistently lower equal error rates: 0.183 vs. 0.299 on ASVspoof 5 (2024), 0.081 vs. 0.188 on ADD23, 0.346 vs. 0.399 on ASVspoof 2019, and 0.355 vs. 0.413 In-the-Wild. This corresponds to false-positive rate reductions of 38.8%, 56.9%, 13.3%, and 14.0% respectively at EER operating points.

Conclusion: Quantum kernels provide superior performance for audio deepfake detection compared to classical kernels, reducing false positives without increasing model complexity. The approach works on conventional computers and maintains consistent improvements across diverse datasets and cross-validation folds.

Abstract: Detecting synthetic speech is challenging when labeled data are scarce and recording conditions vary. Existing end-to-end deep models often overfit or fail to generalize, and while kernel methods can remain competitive, their performance heavily depends on the chosen kernel. Here, we show that using a quantum kernel in audio deepfake detection reduces falsepositive rates without increasing model size. Quantum feature maps embed data into high-dimensional Hilbert spaces, enabling the use of expressive similarity measures and compact classifiers. Building on this motivation, we compare quantum-kernel SVMs (QSVMs) with classical SVMs using identical mel-spectrogram preprocessing and stratified 5-fold cross-validation across four corpora (ASVspoof 2019 LA, ASVspoof 5 (2024), ADD23, and an In-the-Wild set). QSVMs achieve consistently lower equalerror rates (EER): 0.183 vs. 0.299 on ASVspoof 5 (2024), 0.081 vs. 0.188 on ADD23, 0.346 vs. 0.399 on ASVspoof 2019, and 0.355 vs. 0.413 In-the-Wild. At the EER operating point (where FPR equals FNR), these correspond to absolute false-positiverate reductions of 0.116 (38.8%), 0.107 (56.9%), 0.053 (13.3%), and 0.058 (14.0%), respectively. We also report how consistent the results are across cross-validation folds and margin-based measures of class separation, using identical settings for both models. The only modification is the kernel; the features and SVM remain unchanged, no additional trainable parameters are introduced, and the quantum kernel is computed on a conventional computer.

[482] Machine Unlearning in Speech Emotion Recognition via Forget Set Alone

Zhao Ren, Rathi Adarshi Rammohan, Kevin Scheck, Tanja Schultz

Main category: cs.SD

TL;DR: Proposes adversarial-attack-based machine unlearning for speech emotion recognition using only data to be forgotten, addressing privacy concerns without needing full dataset access.

Details

Motivation: Speech emotion recognition uses sensitive speech data, requiring privacy protection when users want their data deleted. Current unlearning methods need access to remaining data, which is problematic when data redistribution is restricted and computationally expensive for big data.

Method: Adversarial-attack-based approach that fine-tunes a pre-trained speech emotion recognition model using only the data to be forgotten, without requiring access to the full dataset.

Result: The approach effectively removes knowledge of forgotten data from the model while preserving high performance on emotion recognition test sets.

Conclusion: The proposed method enables efficient privacy-preserving machine unlearning for speech emotion recognition with minimal data requirements and computational resources.

Abstract: Speech emotion recognition aims to identify emotional states from speech signals and has been widely applied in human-computer interaction, education, healthcare, and many other fields. However, since speech data contain rich sensitive information, partial data can be required to be deleted by speakers due to privacy concerns. Current machine unlearning approaches largely depend on data beyond the samples to be forgotten. However, this reliance poses challenges when data redistribution is restricted and demands substantial computational resources in the context of big data. We propose a novel adversarial-attack-based approach that fine-tunes a pre-trained speech emotion recognition model using only the data to be forgotten. The experimental results demonstrate that the proposed approach can effectively remove the knowledge of the data to be forgotten from the model, while preserving high model performance on the test set for emotion recognition.

[483] Speaker Recognition – Wavelet Packet Based Multiresolution Feature Extraction Approach

Saurabh Bhardwaj, Smriti Srivastava, Abhishek Bhandari, Krit Gupta, Hitesh Bahl, J. R. P. Gupta

Main category: cs.SD

TL;DR: Proposes wavelet packet transform combined with MFCC for speaker recognition, showing improved performance and noise robustness on speech databases.

Details

Motivation: To improve text-independent speaker recognition by combining the human ear simulation of MFCC with the multi-resolution and noise robustness properties of wavelet packet transform.

Method: Hybrid feature extraction using MFCC and Wavelet Packet Transform (WPT), with GMM for speaker identification and HMM for speaker verification, tested on Voxforge and CSTR US KED Timit databases with added noise at different SNR levels.

Result: Experimental results show better performance for both speaker identification and verification tasks, demonstrating improved noise robustness compared to baseline methods.

Conclusion: The proposed wavelet packet-based hybrid feature extraction approach effectively improves speaker recognition performance and noise robustness for text-independent speaker identification and verification.

Abstract: This paper proposes a novel Wavelet Packet based feature extraction approach for the task of text independent speaker recognition. The features are extracted by using the combination of Mel Frequency Cepstral Coefficient (MFCC) and Wavelet Packet Transform (WPT).Hybrid Features technique uses the advantage of human ear simulation offered by MFCC combining it with multi-resolution property and noise robustness of WPT. To check the validity of the proposed approach for the text independent speaker identification and verification we have used the Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM) respectively as the classifiers. The proposed paradigm is tested on voxforge speech corpus and CSTR US KED Timit database. The paradigm is also evaluated after adding standard noise signal at different level of SNRs for evaluating the noise robustness. Experimental results show that better results are achieved for the tasks of both speaker identification as well as speaker verification.

[484] DeepGESI: A Non-Intrusive Objective Evaluation Model for Predicting Speech Intelligibility in Hearing-Impaired Listeners

Wenyu Luo, Jinhui Chen

Main category: cs.SD

TL;DR: DeepGESI is a non-intrusive deep learning model that predicts speech intelligibility for hearing-impaired listeners without needing clean reference speech, achieving strong correlation with actual GESI scores and faster prediction speed.

Details

Motivation: Existing objective speech intelligibility metrics are intrusive (require clean reference speech) and limited in accuracy for hearing-impaired listeners. GESI works for hearing-impaired but is also intrusive, limiting real-world applicability.

Method: Proposes DeepGESI, a deep learning-based model that predicts speech intelligibility for hearing-impaired listeners without requiring clean reference speech. Uses the 2nd Clarity Prediction Challenge (CPC2) dataset for evaluation.

Result: DeepGESI achieves strong correlation with actual GESI scores and substantially faster prediction speed compared to conventional methods under CPC2 test conditions.

Conclusion: DeepGESI provides an accurate, efficient, and non-intrusive solution for predicting speech intelligibility for hearing-impaired listeners, overcoming limitations of existing intrusive methods.

Abstract: Speech intelligibility assessment is essential for many speech-related applications. However, most objective intelligibility metrics are intrusive, as they require clean reference speech in addition to the degraded or processed signal for evaluation. Furthermore, existing metrics such as STOI are primarily designed for normal hearing listeners, and their predictive accuracy for hearing impaired speech intelligibility remains limited. On the other hand, the GESI (Gammachirp Envelope Similarity Index) can be used to estimate intelligibility for hearing-impaired listeners, but it is also intrusive, as it depends on reference signals. This requirement limits its applicability in real-world scenarios. To overcome this limitation, this study proposes DeepGESI, a non-intrusive deep learning-based model capable of accurately and efficiently predicting the speech intelligibility of hearing-impaired listeners without requiring any clean reference speech. Experimental results demonstrate that, under the test conditions of the 2nd Clarity Prediction Challenge(CPC2) dataset, the GESI scores predicted by DeepGESI exhibit a strong correlation with the actual GESI scores. In addition, the proposed model achieves a substantially faster prediction speed compared to conventional methods.

[485] Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning

Apoorv Vyas, Heng-Jui Chang, Cheng-Fu Yang, Po-Yao Huang, Luya Gao, Julius Richter, Sanyuan Chen, Matt Le, Piotr Dollár, Christoph Feichtenhofer, Ann Lee, Wei-Ning Hsu

Main category: cs.SD

TL;DR: PE-AV is a new family of multimodal encoders for audio and video understanding trained with scaled contrastive learning, enabling unified cross-modal embeddings across audio-video, audio-text, and video-text modalities.

Details

Motivation: To create unified cross-modal representations that can handle multiple modalities (audio, video, text) simultaneously, overcoming limitations of prior work that focused on single domains and lacked consistent large-scale supervision across modalities.

Method: Built on PE foundation, uses scaled contrastive learning with ten pairwise contrastive objectives. Developed a strong audiovisual data engine to synthesize high-quality captions for O(100M) audio-video pairs. Includes diverse audio data (speech, music, sound effects). Also developed PE-A-Frame variant with frame-level contrastive objectives for fine-grained alignment.

Result: Sets new state-of-the-art across standard audio and video benchmarks. Enables novel tasks like speech retrieval. Shows that scaling cross-modality and caption-type pairs strengthens alignment and improves zero-shot performance.

Conclusion: PE-AV provides a unified framework for multimodal understanding with strong performance across audio and video tasks, demonstrating the effectiveness of large-scale contrastive learning with diverse data and multiple alignment objectives.

Abstract: We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio-video, audio-text, and video-text modalities. PE-AV’s unified cross-modal embeddings enable novel tasks such as speech retrieval, and set a new state of the art across standard audio and video benchmarks. We unlock this by building a strong audiovisual data engine that synthesizes high-quality captions for O(100M) audio-video pairs, enabling large-scale supervision consistent across modalities. Our audio data includes speech, music, and general sound effects-avoiding single-domain limitations common in prior work. We exploit ten pairwise contrastive objectives, showing that scaling cross-modality and caption-type pairs strengthens alignment and improves zero-shot performance. We further develop PE-A-Frame by fine-tuning PE-AV with frame-level contrastive objectives, enabling fine-grained audio-frame-to-text alignment for tasks such as sound event detection.

[486] SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models

Han Yin, Yafeng Chen, Chong Deng, Luyao Cheng, Hui Wang, Chao-Hong Tan, Qian Chen, Wen Wang, Xiangang Li

Main category: cs.SD

TL;DR: SpeakerLM is a unified multimodal LLM that jointly performs speaker diarization and speech recognition end-to-end, addressing limitations of cascaded systems through flexible speaker registration and multi-stage training.

Details

Motivation: Existing cascaded SDR systems suffer from error propagation, difficulty handling overlapping speech, and lack of joint optimization between speaker diarization and speech recognition tasks.

Method: SpeakerLM is a unified multimodal large language model with flexible speaker registration mechanism, developed using multi-stage training strategy on large-scale real data for end-to-end SDR.

Result: Outperforms state-of-the-art cascaded baselines on both in-domain and out-of-domain public SDR benchmarks, demonstrates strong data scaling capability and generalizability.

Conclusion: SpeakerLM effectively addresses cascaded system limitations through unified end-to-end modeling, with speaker registration mechanism ensuring robust performance across diverse conditions and speaker numbers.

Abstract: The Speaker Diarization and Recognition (SDR) task aims to predict “who spoke when and what” within an audio clip, which is a crucial task in various real-world multi-speaker scenarios such as meeting transcription and dialogue systems. Existing SDR systems typically adopt a cascaded framework, combining multiple modules such as speaker diarization (SD) and automatic speech recognition (ASR). The cascaded systems suffer from several limitations, such as error propagation, difficulty in handling overlapping speech, and lack of joint optimization for exploring the synergy between SD and ASR tasks. To address these limitations, we introduce SpeakerLM, a unified multimodal large language model for SDR that jointly performs SD and ASR in an end-to-end manner. Moreover, to facilitate diverse real-world scenarios, we incorporate a flexible speaker registration mechanism into SpeakerLM, enabling SDR under different speaker registration settings. SpeakerLM is progressively developed with a multi-stage training strategy on large-scale real data. Extensive experiments show that SpeakerLM demonstrates strong data scaling capability and generalizability, outperforming state-of-the-art cascaded baselines on both in-domain and out-of-domain public SDR benchmarks. Furthermore, experimental results show that the proposed speaker registration mechanism effectively ensures robust SDR performance of SpeakerLM across diverse speaker registration conditions and varying numbers of registered speakers.

cs.LG

[487] Comparative Evaluation of Explainable Machine Learning Versus Linear Regression for Predicting County-Level Lung Cancer Mortality Rate in the United States

Soheil Hashtarkhani, Brianna M. White, Benyamin Hoseini, David L. Schwartz, Arash Shaban-Nejad

Main category: cs.LG

TL;DR: This study compares three machine learning models (random forest, gradient boosting, linear regression) for predicting county-level lung cancer mortality in the US, with random forest performing best and identifying smoking rate as the most important predictor.

Details

Motivation: Lung cancer is a leading cause of cancer mortality in the US, and accurate prediction of mortality rates is crucial for guiding targeted interventions and addressing health disparities. Traditional regression models have limitations, and explainable machine learning may offer better predictive accuracy and deeper insights.

Method: Applied three models: random forest (RF), gradient boosting regression (GBR), and linear regression (LR) to predict county-level lung cancer mortality rates. Used R-squared and RMSE for evaluation, SHAP values for variable importance and directional impact analysis, and Getis-Ord hotspot analysis for geographic disparities.

Result: RF model outperformed GBR and LR with R² of 41.9% and RMSE of 12.8. SHAP identified smoking rate as most important predictor, followed by median home value and percentage of Hispanic population. Spatial analysis revealed significant clusters of elevated mortality in mid-eastern US counties.

Conclusion: RF demonstrated superior predictive performance for lung cancer mortality, highlighting critical roles of smoking prevalence, housing values, and Hispanic population percentage. Findings provide actionable insights for targeted interventions, screening promotion, and addressing health disparities in high-risk regions.

Abstract: Lung cancer (LC) is a leading cause of cancer-related mortality in the United States. Accurate prediction of LC mortality rates is crucial for guiding targeted interventions and addressing health disparities. Although traditional regression-based models have been commonly used, explainable machine learning models may offer enhanced predictive accuracy and deeper insights into the factors influencing LC mortality. This study applied three models: random forest (RF), gradient boosting regression (GBR), and linear regression (LR) to predict county-level LC mortality rates across the United States. Model performance was evaluated using R-squared and root mean squared error (RMSE). Shapley Additive Explanations (SHAP) values were used to determine variable importance and their directional impact. Geographic disparities in LC mortality were analyzed through Getis-Ord (Gi*) hotspot analysis. The RF model outperformed both GBR and LR, achieving an R2 value of 41.9% and an RMSE of 12.8. SHAP analysis identified smoking rate as the most important predictor, followed by median home value and the percentage of the Hispanic ethnic population. Spatial analysis revealed significant clusters of elevated LC mortality in the mid-eastern counties of the United States. The RF model demonstrated superior predictive performance for LC mortality rates, emphasizing the critical roles of smoking prevalence, housing values, and the percentage of Hispanic ethnic population. These findings offer valuable actionable insights for designing targeted interventions, promoting screening, and addressing health disparities in regions most affected by LC in the United States.

[488] What’s the Price of Monotonicity? A Multi-Dataset Benchmark of Monotone-Constrained Gradient Boosting for Credit PD

Petr Koklev

Main category: cs.LG

TL;DR: Monotonicity constraints in credit risk ML models have minimal performance cost (0-2.9% AUC loss), especially on large datasets where constraints are nearly free.

Details

Motivation: Financial institutions need to balance predictive accuracy and interpretability in credit risk models. While monotonicity constraints align models with domain knowledge, their performance cost (the "price of monotonicity") hasn't been well quantified.

Method: Benchmarked monotone-constrained vs unconstrained gradient boosting models across five public datasets and three libraries. Defined Price of Monotonicity (PoM) as relative change in performance metrics when moving from unconstrained to constrained models, estimated via paired comparisons with bootstrap uncertainty.

Result: PoM in AUC ranges from essentially zero to about 2.9%. Constraints are almost costless on large datasets (typically <0.2%, often indistinguishable from zero) and most costly on smaller datasets with extensive constraint coverage (around 2-3%).

Conclusion: Appropriately specified monotonicity constraints can deliver interpretability with small accuracy losses, particularly in large-scale credit portfolios, making them a practical choice for financial institutions.

Abstract: Financial institutions face a trade-off between predictive accuracy and interpretability when deploying machine learning models for credit risk. Monotonicity constraints align model behavior with domain knowledge, but their performance cost - the price of monotonicity - is not well quantified. This paper benchmarks monotone-constrained versus unconstrained gradient boosting models for credit probability of default across five public datasets and three libraries. We define the Price of Monotonicity (PoM) as the relative change in standard performance metrics when moving from unconstrained to constrained models, estimated via paired comparisons with bootstrap uncertainty. In our experiments, PoM in AUC ranges from essentially zero to about 2.9 percent: constraints are almost costless on large datasets (typically less than 0.2 percent, often indistinguishable from zero) and most costly on smaller datasets with extensive constraint coverage (around 2-3 percent). Thus, appropriately specified monotonicity constraints can often deliver interpretability with small accuracy losses, particularly in large-scale credit portfolios.

[489] Convolutional-neural-operator-based transfer learning for solving PDEs

Peng Fan, Guofei Pang

Main category: cs.LG

TL;DR: The paper extends convolutional neural operators to few-shot learning by pre-training on source data and adapting to target data using parameter adjustment strategies, with neuron linear transformation showing best performance.

Details

Motivation: Convolutional neural operators have shown strong performance for PDE solution operators but haven't been validated for few-shot learning scenarios, which is important when target datasets are small.

Method: Two-stage approach: 1) Pre-train convolutional neural operator on source dataset, 2) Adjust parameters using small target dataset via three strategies: fine-tuning, low-rank adaptation, and neuron linear transformation.

Result: Neuron linear transformation strategy achieves highest surrogate accuracy across multiple PDE systems including Kuramoto-Sivashinsky equation, Brusselator diffusion-reaction system, and Navier-Stokes equations.

Conclusion: Convolutional neural operators can be effectively adapted to few-shot learning scenarios, with neuron linear transformation emerging as the most effective parameter adjustment strategy for maintaining high accuracy with limited target data.

Abstract: Convolutional neural operator is a CNN-based architecture recently proposed to enforce structure-preserving continuous-discrete equivalence and enable the genuine, alias-free learning of solution operators of PDEs. This neural operator was demonstrated to outperform for certain cases some baseline models such as DeepONet, Fourier neural operator, and Galerkin transformer in terms of surrogate accuracy. The convolutional neural operator, however, seems not to be validated for few-shot learning. We extend the model to few-shot learning scenarios by first pre-training a convolutional neural operator using a source dataset and then adjusting the parameters of the trained neural operator using only a small target dataset. We investigate three strategies for adjusting the parameters of a trained neural operator, including fine-tuning, low-rank adaption, and neuron linear transformation, and find that the neuron linear transformation strategy enjoys the highest surrogate accuracy in solving PDEs such as Kuramoto-Sivashinsky equation, Brusselator diffusion-reaction system, and Navier-Stokes equations.

[490] CodeGEMM: A Codebook-Centric Approach to Efficient GEMM in Quantized LLMs

Gunho Park, Jeongin Bae, Byeongwook Kim, Baeseong park, Jiwon Ryu, Hoseung Kim, Se Jung Kwon, Dongsoo Lee

Main category: cs.LG

TL;DR: CodeGEMM: A codebook-centric GEMM kernel that eliminates dequantization overhead in weight-only quantization for LLMs by using precomputed inner products between centroids and activations.

Details

Motivation: Current codebook-based quantization methods for LLMs rely on dequantization which repeatedly fetches centroids and reconstructs weights, causing substantial latency and cache pressure during inference.

Method: CodeGEMM replaces dequantization with a lightweight Psumbook storing precomputed inner products between centroids and activations. At inference, code indices directly gather these partial sums, eliminating per-element lookups and reducing on-chip footprint.

Result: On Llama-3 models, CodeGEMM delivers 1.83x (8B) and 8.93x (70B) speedups in 2-bit configuration compared to state-of-the-art codebook-based quantization at comparable accuracy, with improved computing efficiency and memory subsystem utilization.

Conclusion: CodeGEMM enables systematic exploration of latency-memory-accuracy trade-offs under a unified implementation while significantly accelerating codebook-based quantization for LLM inference.

Abstract: Weight-only quantization is widely used to mitigate the memory-bound nature of LLM inference. Codebook-based methods extend this trend by achieving strong accuracy in the extremely low-bit regime (e.g., 2-bit). However, current kernels rely on dequantization, which repeatedly fetches centroids and reconstructs weights, incurring substantial latency and cache pressure. We present CodeGEMM, a codebook-centric GEMM kernel that replaces dequantization with precomputed inner products between centroids and activations stored in a lightweight Psumbook. At inference, code indices directly gather these partial sums, eliminating per-element lookups and reducing the on-chip footprint. The kernel supports the systematic exploration of latency-memory-accuracy trade-offs under a unified implementation. On Llama-3 models, CodeGEMM delivers 1.83x (8B) and 8.93x (70B) speedups in the 2-bit configuration compared to state-of-the-art codebook-based quantization at comparable accuracy and further improves computing efficiency and memory subsystem utilization.

[491] Parameter-Efficient Fine-Tuning for HAR: Integrating LoRA and QLoRA into Transformer Models

Irina Seregina, Philippe Lalanda, German Vega

Main category: cs.LG

TL;DR: Parameter-efficient fine-tuning (LoRA and QLoRA) matches full fine-tuning performance for Human Activity Recognition while reducing computational costs.

Details

Motivation: Adapting large pretrained models to new HAR domains is challenging due to limited computational resources on target devices, requiring more efficient fine-tuning methods.

Method: Proposes an adaptation framework using Masked Autoencoder backbone with LoRA and Quantized LoRA techniques, evaluated under Leave-One-Dataset-Out validation across five HAR datasets.

Result: LoRA and QLoRA match full fine-tuning recognition performance while significantly reducing trainable parameters, memory usage, and training time. LoRA maintains robust performance with limited supervision, and adapter rank provides accuracy-efficiency trade-off.

Conclusion: Parameter-efficient fine-tuning techniques like LoRA and QLoRA offer scalable alternatives to full model fine-tuning for HAR, enabling efficient adaptation of large pretrained models to new domains with minimal performance loss.

Abstract: Human Activity Recognition is a foundational task in pervasive computing. While recent advances in self-supervised learning and transformer-based architectures have significantly improved HAR performance, adapting large pretrained models to new domains remains a practical challenge due to limited computational resources on target devices. This papers investigates parameter-efficient fine-tuning techniques, specifically Low-Rank Adaptation (LoRA) and Quantized LoRA, as scalable alternatives to full model fine-tuning for HAR. We propose an adaptation framework built upon a Masked Autoencoder backbone and evaluate its performance under a Leave-One-Dataset-Out validation protocol across five open HAR datasets. Our experiments demonstrate that both LoRA and QLoRA can match the recognition performance of full fine-tuning while significantly reducing the number of trainable parameters, memory usage, and training time. Further analyses reveal that LoRA maintains robust performance even under limited supervision and that the adapter rank provides a controllable trade-off between accuracy and efficiency. QLoRA extends these benefits by reducing the memory footprint of frozen weights through quantization, with minimal impact on classification quality.

[492] A Hybrid Inductive-Transductive Network for Traffic Flow Imputation on Unsampled Locations

Mohammadmahdi Rahimiasl, Ynte Vanderhoydonc, Siegfried Mercelis

Main category: cs.LG

TL;DR: HINT is a hybrid inductive-transductive network for traffic flow imputation that combines inductive flow learning with transductive speed signals, using spatial transformers, diffusion GCNs, and calibration layers to handle sparse flow data and heterophily in traffic networks.

Details

Motivation: Traffic flow imputation is challenging due to sparse loop detector measurements, weakly correlated probe vehicle speed data, and heterophily in traffic networks (e.g., ramps vs. mainline) that breaks standard GNN assumptions.

Method: HINT uses a hybrid inductive-transductive approach: (1) inductive spatial transformer for long-range interactions, (2) diffusion GCN conditioned on static context (OSM attributes, traffic simulation), (3) node-wise calibration layer for scale biases. Training includes masked reconstruction, epoch-wise node sampling, hard-node mining, and noise injection.

Result: HINT consistently outperforms state-of-the-art inductive baselines across three datasets: reduces MAE by ≈42-50% on MOW, ≈22% on Torino, and ≈12% on Essen. Even without simulation, HINT remains superior on MOW and Torino.

Conclusion: Combining inductive flow imputation with transductive speed signals, traffic simulations, and external geospatial data significantly improves accuracy for traffic flow imputation at unsensed locations.

Abstract: Accurately imputing traffic flow at unsensed locations is difficult: loop detectors provide precise but sparse measurements, speed from probe vehicles is widely available yet only weakly correlated with flow, and nearby links often exhibit strong heterophily in the scale of traffic flow (e.g., ramps vs. mainline), which breaks standard GNN assumptions. We propose HINT, a Hybrid INductive-Transductive Network, and an INDU-TRANSDUCTIVE training strategy that treats speed as a transductive, network-wide signal while learning flow inductively to generalize to unseen locations. HINT couples (i) an inductive spatial transformer that learns similarity-driven, long-range interactions from node features with (ii) a diffusion GCN conditioned by FiLM on rich static context (OSM-derived attributes and traffic simulation), and (iii) a node-wise calibration layer that corrects scale biases per segment. Training uses masked reconstruction with epoch-wise node sampling, hard-node mining to emphasize difficult sensors, and noise injection on visible flows to prevent identity mapping, while graph structure is built from driving distances. Across three real-world datasets, MOW (Antwerp, Belgium), UTD19-Torino, and UTD19-Essen, HINT consistently surpasses state-of-the-art inductive baselines. Relative to KITS, HINT reduces MAE on MOW by $\approx42$% with basic simulation and $\approx50$% with calibrated simulation; on Torino by $\approx22$%, and on Essen by $\approx12$%. Even without simulation, HINT remains superior on MOW and Torino, while simulation is crucial on Essen. These results show that combining inductive flow imputation with transductive speed, traffic simulations and external geospatial improves accuracy for the task described above.

[493] MoE-TransMov: A Transformer-based Model for Next POI Prediction in Familiar & Unfamiliar Movements

Ruichen Tan, Jiawei Xue, Kota Tsubouchi, Takahiro Yabe, Satish V. Ukkusuri

Main category: cs.LG

TL;DR: MoE-TransMov: A Transformer-based Mixture-of-Experts model that improves next POI prediction by distinguishing between user movements in familiar vs unfamiliar areas.

Details

Motivation: Existing POI prediction methods fail to distinguish between user movements in familiar and unfamiliar areas, even though users exhibit different POI choices in these contexts. This limitation reduces prediction accuracy and personalization in location-based services.

Method: Proposes MoE-TransMov, a Transformer-based model with Mixture-of-Experts architecture that classifies movements into familiar/unfamiliar categories and uses specialized expert networks. Integrates self-attention mechanisms and adaptive gating networks to dynamically select relevant experts for different mobility contexts.

Result: Outperforms state-of-the-art baselines on two real-world datasets (Foursquare NYC and large-scale Kyoto dataset) with notable improvements in Top-1, Top-5, Top-10 accuracy, and mean reciprocal rank (MRR).

Conclusion: The approach efficiently improves mobility predictions under different moving contexts, enhancing personalization of recommendation systems and advancing various urban applications by better capturing distinct mobility patterns in familiar vs unfamiliar areas.

Abstract: Accurate prediction of the next point of interest (POI) within human mobility trajectories is essential for location-based services, as it enables more timely and personalized recommendations. In particular, with the rise of these approaches, studies have shown that users exhibit different POI choices in their familiar and unfamiliar areas, highlighting the importance of incorporating user familiarity into predictive models. However, existing methods often fail to distinguish between the movements of users in familiar and unfamiliar regions. To address this, we propose MoE-TransMov, a Transformer-based model with a Transformer model with a Mixture-of-Experts (MoE) architecture designed to use one framework to capture distinct mobility patterns across different moving contexts without requiring separate training for certain data. Using user-check-in data, we classify movements into familiar and unfamiliar categories and develop a specialized expert network to improve prediction accuracy. Our approach integrates self-attention mechanisms and adaptive gating networks to dynamically select the most relevant expert models for different mobility contexts. Experiments on two real-world datasets, including the widely used but small open-source Foursquare NYC dataset and the large-scale Kyoto dataset collected with LY Corporation (Yahoo Japan Corporation), show that MoE-TransMov outperforms state-of-the-art baselines with notable improvements in Top-1, Top-5, Top-10 accuracy, and mean reciprocal rank (MRR). Given the results, we find that by using this approach, we can efficiently improve mobility predictions under different moving contexts, thereby enhancing the personalization of recommendation systems and advancing various urban applications.

[494] FedOAED: Federated On-Device Autoencoder Denoiser for Heterogeneous Data under Limited Client Availability

S M Ruhul Kabir Howlader, Xiao Chen, Yifei Xie, Lu Liu

Main category: cs.LG

TL;DR: FedOAED is a federated learning algorithm that uses on-device autoencoder denoisers to reduce client-drift and variance caused by data heterogeneity and partial client participation.

Details

Motivation: Data privacy regulations (GDPR, HIPAA) prevent data sharing, making federated learning essential. However, FL struggles with data heterogeneity causing gradient noise, client-drift, and variance from partial client participation.

Method: FedOAED incorporates on-device autoencoder denoisers on client side to mitigate client-drift and variance from heterogeneous data under limited client availability.

Result: Experiments on multiple vision datasets under Non-IID settings show FedOAED consistently outperforms state-of-the-art baselines.

Conclusion: FedOAED effectively addresses key challenges in federated learning by reducing client-drift and variance through on-device autoencoder denoising, enabling better performance in privacy-preserving distributed learning scenarios.

Abstract: Over the last few decades, machine learning (ML) and deep learning (DL) solutions have demonstrated their potential across many applications by leveraging large amounts of high-quality data. However, strict data-sharing regulations such as the General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA) have prevented many data-driven applications from being realised. Federated Learning (FL), in which raw data never leaves local devices, has shown promise in overcoming these limitations. Although FL has grown rapidly in recent years, it still struggles with heterogeneity, which produces gradient noise, client-drift, and increased variance from partial client participation. In this paper, we propose FedOAED, a novel federated learning algorithm designed to mitigate client-drift arising from multiple local training updates and the variance induced by partial client participation. FedOAED incorporates an on-device autoencoder denoiser on the client side to mitigate client-drift and variance resulting from heterogeneous data under limited client availability. Experiments on multiple vision datasets under Non-IID settings demonstrate that FedOAED consistently outperforms state-of-the-art baselines.

[495] A Dataset and Benchmarks for Atrial Fibrillation Detection from Electrocardiograms of Intensive Care Unit Patients

Sarah Nassar, Nooshin Maghsoodi, Sophia Mannina, Shamel Addas, Stephanie Sibley, Gabor Fichtinger, David Pichora, David Maslove, Purang Abolmaesumi, Parvin Mousavi

Main category: cs.LG

TL;DR: This study compares three AI approaches for AF detection in ICU patients, finds ECG foundation models perform best, and releases a labelled ICU dataset with benchmarks.

Details

Motivation: Atrial fibrillation is common in ICU patients and causes adverse health effects, but there's a critical gap in literature comparing different AI approaches for accurate AF detection in ICU settings.

Method: Compared three AI approaches: feature-based classifiers, deep learning, and ECG foundation models using ECGs from a Canadian ICU and PhysioNet Challenge data. Tested multiple training configurations from zero-shot inference to transfer learning.

Result: ECG foundation models performed best overall, followed by deep learning, then feature-based classifiers. The top model was ECG-FM with transfer learning achieving F1=0.89 on ICU test set.

Conclusion: AI shows promising potential for automatic patient monitoring systems, and by publishing the labelled ICU dataset and benchmarks, the research community can advance state-of-the-art in AF detection.

Abstract: Objective: Atrial fibrillation (AF) is the most common cardiac arrhythmia experienced by intensive care unit (ICU) patients and can cause adverse health effects. In this study, we publish a labelled ICU dataset and benchmarks for AF detection. Methods: We compared machine learning models across three data-driven artificial intelligence (AI) approaches: feature-based classifiers, deep learning (DL), and ECG foundation models (FMs). This comparison addresses a critical gap in the literature and aims to pinpoint which AI approach is best for accurate AF detection. Electrocardiograms (ECGs) from a Canadian ICU and the 2021 PhysioNet/Computing in Cardiology Challenge were used to conduct the experiments. Multiple training configurations were tested, ranging from zero-shot inference to transfer learning. Results: On average and across both datasets, ECG FMs performed best, followed by DL, then feature-based classifiers. The model that achieved the top F1 score on our ICU test set was ECG-FM through a transfer learning strategy (F1=0.89). Conclusion: This study demonstrates promising potential for using AI to build an automatic patient monitoring system. Significance: By publishing our labelled ICU dataset (LinkToBeAdded) and performance benchmarks, this work enables the research community to continue advancing the state-of-the-art in AF detection in the ICU.

[496] OmniMER: Indonesian Multimodal Emotion Recognition via Auxiliary-Enhanced LLM Adaptation

Xueming Yan, Boyan Xu, Yaochu Jin, Lixian Xiao, Wenlong Ye, Runyang Cai, Zeqi Zheng, Jingfa Liu, Aimin Yang

Main category: cs.LG

TL;DR: Introduces IndoMER, the first multimodal emotion recognition benchmark for Indonesian, and OmniMER, a framework that improves emotion recognition through modality-specific auxiliary tasks.

Details

Motivation: Indonesian is widely spoken but underserved in multimodal emotion recognition research despite its dominance on Southeast Asian social media platforms.

Method: Proposes OmniMER, a multimodal adaptation framework built on Qwen2.5-Omni that enhances emotion recognition through three auxiliary modality-specific perception tasks: emotion keyword extraction (text), facial expression analysis (video), and prosody analysis (audio).

Result: OmniMER achieves 0.582 Macro-F1 on sentiment classification and 0.454 on emotion recognition on IndoMER, outperforming the base model by 7.6 and 22.1 absolute points respectively. Cross-lingual evaluation on CH-SIMS dataset shows generalizability.

Conclusion: The paper successfully addresses the gap in Indonesian multimodal emotion recognition with a new benchmark dataset and an effective framework that leverages modality-specific auxiliary tasks to improve performance in low-resource settings.

Abstract: Indonesian, spoken by over 200 million people, remains underserved in multimodal emotion recognition research despite its dominant presence on Southeast Asian social media platforms. We introduce IndoMER, the first multimodal emotion recognition benchmark for Indonesian, comprising 1,944 video segments from 203 speakers with temporally aligned text, audio, and visual annotations across seven emotion categories. The dataset exhibits realistic challenges including cross-modal inconsistency and long-tailed class distributions shaped by Indonesian cultural communication norms. To address these challenges, we propose OmniMER, a multimodal adaptation framework built upon Qwen2.5-Omni that enhances emotion recognition through three auxiliary modality-specific perception tasks: emotion keyword extraction for text, facial expression analysis for video, and prosody analysis for audio. These auxiliary tasks help the model identify emotion-relevant cues in each modality before fusion, reducing reliance on spurious correlations in low-resource settings. Experiments on IndoMER show that OmniMER achieves 0.582 Macro-F1 on sentiment classification and 0.454 on emotion recognition, outperforming the base model by 7.6 and 22.1 absolute points respectively. Cross-lingual evaluation on the Chinese CH-SIMS dataset further demonstrates the generalizability of the proposed framework. The dataset and code are publicly available. https://github.com/yanxm01/INDOMER

[497] Towards Benchmarking Privacy Vulnerabilities in Selective Forgetting with Large Language Models

Wei Qian, Chenxu Zhao, Yangyi Li, Mengdi Huai

Main category: cs.LG

TL;DR: First comprehensive benchmark for evaluating privacy vulnerabilities in selective forgetting (machine unlearning), systematically assessing privacy leakage across various attacks, methods, and architectures.

Details

Motivation: As AI systems are deployed in critical areas, ensuring privacy and alignment with human values is crucial. Selective forgetting shows promise for privacy and data removal, but raises significant privacy concerns, especially with sensitive data. Existing privacy attack evaluations use different experimental settings, leading to potentially unfair and overly optimistic assessments.

Method: Developed the first comprehensive benchmark for evaluating privacy vulnerabilities in selective forgetting. Extensively investigated privacy vulnerabilities across wide range of victim data, state-of-the-art unlearning privacy attacks, unlearning methods, and model architectures. Systematically evaluated and identified critical factors related to unlearning-induced privacy leakage.

Result: Provided novel insights into privacy vulnerabilities of machine unlearning techniques. Created a standardized tool for practitioners to deploy customized unlearning applications with faithful privacy assessments.

Conclusion: The benchmark addresses the lack of standardized evaluation for privacy vulnerabilities in selective forgetting, enabling fair comparison of attacks and helping practitioners make informed decisions about privacy risks in unlearning applications.

Abstract: The rapid advancements in artificial intelligence (AI) have primarily focused on the process of learning from data to acquire knowledgeable learning systems. As these systems are increasingly deployed in critical areas, ensuring their privacy and alignment with human values is paramount. Recently, selective forgetting (also known as machine unlearning) has shown promise for privacy and data removal tasks, and has emerged as a transformative paradigm shift in the field of AI. It refers to the ability of a model to selectively erase the influence of previously seen data, which is especially important for compliance with modern data protection regulations and for aligning models with human values. Despite its promise, selective forgetting raises significant privacy concerns, especially when the data involved come from sensitive domains. While new unlearning-induced privacy attacks are continuously proposed, each is shown to outperform its predecessors using different experimental settings, which can lead to overly optimistic and potentially unfair assessments that may disproportionately favor one particular attack over the others. In this work, we present the first comprehensive benchmark for evaluating privacy vulnerabilities in selective forgetting. We extensively investigate privacy vulnerabilities of machine unlearning techniques and benchmark privacy leakage across a wide range of victim data, state-of-the-art unlearning privacy attacks, unlearning methods, and model architectures. We systematically evaluate and identify critical factors related to unlearning-induced privacy leakage. With our novel insights, we aim to provide a standardized tool for practitioners seeking to deploy customized unlearning applications with faithful privacy assessments.

[498] Probabilistic Digital Twins of Users: Latent Representation Learning with Statistically Validated Semantics

Daniel David

Main category: cs.LG

TL;DR: A probabilistic digital twin framework using VAEs to model users as latent stochastic states, enabling interpretable, uncertainty-aware representations beyond deterministic embeddings.

Details

Motivation: Existing user modeling approaches rely on deterministic embeddings or black-box models that lack uncertainty quantification and interpretability of what latent representations encode.

Method: Propose probabilistic digital twin framework where each user is modeled as latent stochastic state generating behavioral data, learned via amortized variational inference using VAEs on user-response datasets.

Result: User structure is predominantly continuous rather than discretely clustered, with interpretable latent dimensions corresponding to traits like opinion strength and decisiveness, validated through statistical hypothesis tests.

Conclusion: Probabilistic digital twins provide interpretable, uncertainty-aware user representations that go beyond deterministic embeddings, offering better insight into user identity and behavior.

Abstract: Understanding user identity and behavior is central to applications such as personalization, recommendation, and decision support. Most existing approaches rely on deterministic embeddings or black-box predictive models, offering limited uncertainty quantification and little insight into what latent representations encode. We propose a probabilistic digital twin framework in which each user is modeled as a latent stochastic state that generates observed behavioral data. The digital twin is learned via amortized variational inference, enabling scalable posterior estimation while retaining a fully probabilistic interpretation. We instantiate this framework using a variational autoencoder (VAE) applied to a user-response dataset designed to capture stable aspects of user identity. Beyond standard reconstruction-based evaluation, we introduce a statistically grounded interpretation pipeline that links latent dimensions to observable behavioral patterns. By analyzing users at the extremes of each latent dimension and validating differences using nonparametric hypothesis tests and effect sizes, we demonstrate that specific dimensions correspond to interpretable traits such as opinion strength and decisiveness. Empirically, we find that user structure is predominantly continuous rather than discretely clustered, with weak but meaningful structure emerging along a small number of dominant latent axes. These results suggest that probabilistic digital twins can provide interpretable, uncertainty-aware representations that go beyond deterministic user embeddings.

[499] Microstructure-based Variational Neural Networks for Robust Uncertainty Quantification in Materials Digital Twins

Andreas E. Robertson, Samuel B. Inman, Ashley T. Lenau, Ricardo A. Lebensohn, Dongil Shin, Brad L. Boyce, Remi M. Dingreville

Main category: cs.LG

TL;DR: VDMN is a physics-informed surrogate model that captures microstructure variability for uncertainty-robust materials digital twins through variational distributions and efficient uncertainty propagation.

Details

Motivation: Aleatoric uncertainties in microstructure morphology, constituent behavior, and processing conditions pose major challenges for developing reliable digital twins of materials, requiring methods to handle irreducible variability.

Method: Variational Deep Material Network (VDMN) embeds variational distributions within hierarchical mechanistic architecture and uses analytic propagation scheme based on Taylor-series expansion and automatic differentiation for efficient uncertainty propagation.

Result: Successfully demonstrated in two applications: (1) as uncertainty-aware digital twin predicting nonlinear mechanical variability in additively manufactured polymer composites with experimental validation, (2) as inverse calibration engine disentangling overlapping sources of uncertainty in constituent properties.

Conclusion: VDMN establishes a foundation for uncertainty-robust materials digital twins by enabling efficient probabilistic forward and inverse predictions while capturing microstructure-induced variability.

Abstract: Aleatoric uncertainties - irremovable variability in microstructure morphology, constituent behavior, and processing conditions - pose a major challenge to developing uncertainty-robust digital twins. We introduce the Variational Deep Material Network (VDMN), a physics-informed surrogate model that enables efficient and probabilistic forward and inverse predictions of material behavior. The VDMN captures microstructure-induced variability by embedding variational distributions within its hierarchical, mechanistic architecture. Using an analytic propagation scheme based on Taylor-series expansion and automatic differentiation, the VDMN efficiently propagates uncertainty through the network during training and prediction. We demonstrate its capabilities in two digital-twin-driven applications: (1) as an uncertainty-aware materials digital twin, it predicts and experimentally validates the nonlinear mechanical variability in additively manufactured polymer composites; and (2) as an inverse calibration engine, it disentangles and quantitatively identifies overlapping sources of uncertainty in constituent properties. Together, these results establish the VDMN as a foundation for uncertainty-robust materials digital twins.

[500] Learning Generalizable Neural Operators for Inverse Problems

Adam J. Thorpe, Stepan Tretiakov, Dibakar Roy Sarkar, Krishna Kumar, Ufuk Topcu

Main category: cs.LG

TL;DR: B2B⁻¹ is a neural operator framework that decouples function representation from inverse mapping to handle ill-posed inverse problems by learning basis functions and operating on coefficient spaces.

Details

Motivation: Existing neural operator architectures struggle with ill-posed inverse problems because these violate continuity, uniqueness, and stability assumptions required for standard neural operators.

Method: Learn neural basis functions for input and output spaces separately, then train inverse models that operate on the resulting coefficient space. This allows deterministic, invertible, and probabilistic models within a single framework.

Result: Evaluated on six inverse PDE benchmarks including two novel datasets. Shows consistent re-simulation performance across varying ill-posedness levels, captures uncertainty, remains robust to noise via implicit denoising.

Conclusion: By separating representation from inversion, the framework enables scalable surrogate models for inverse problems that generalize across instances, domains, and degrees of ill-posedness.

Abstract: Inverse problems challenge existing neural operator architectures because ill-posed inverse maps violate continuity, uniqueness, and stability assumptions. We introduce B2B${}^{-1}$, an inverse basis-to-basis neural operator framework that addresses this limitation. Our key innovation is to decouple function representation from the inverse map. We learn neural basis functions for the input and output spaces, then train inverse models that operate on the resulting coefficient space. This structure allows us to learn deterministic, invertible, and probabilistic models within a single framework, and to choose models based on the degree of ill-posedness. We evaluate our approach on six inverse PDE benchmarks, including two novel datasets, and compare against existing invertible neural operator baselines. We learn probabilistic models that capture uncertainty and input variability, and remain robust to measurement noise due to implicit denoising in the coefficient calculation. Our results show consistent re-simulation performance across varying levels of ill-posedness. By separating representation from inversion, our framework enables scalable surrogate models for inverse problems that generalize across instances, domains, and degrees of ill-posedness.

[501] TraCeR: Transformer-Based Competing Risk Analysis with Longitudinal Covariates

Maxmillan Ries, Sohan Seth

Main category: cs.LG

TL;DR: TraCeR is a transformer-based survival analysis framework that handles longitudinal covariates and assesses model calibration, outperforming state-of-the-art methods.

Details

Motivation: Current deep learning survival models struggle with incorporating longitudinal covariates and focus too much on discrimination metrics while neglecting calibration assessment.

Method: Transformer-based architecture with factorized self-attention that estimates hazard functions from sequences of measurements, handling censored data and competing events naturally.

Result: TraCeR achieves substantial and statistically significant performance improvements over state-of-the-art methods on multiple real-world datasets.

Conclusion: TraCeR successfully addresses key limitations in survival analysis by incorporating longitudinal covariates and providing comprehensive evaluation including calibration assessment.

Abstract: Survival analysis is a critical tool for modeling time-to-event data. Recent deep learning-based models have reduced various modeling assumptions including proportional hazard and linearity. However, a persistent challenge remains in incorporating longitudinal covariates, with prior work largely focusing on cross-sectional features, and in assessing calibration of these models, with research primarily focusing on discrimination during evaluation. We introduce TraCeR, a transformer-based survival analysis framework for incorporating longitudinal covariates. Based on a factorized self-attention architecture, TraCeR estimates the hazard function from a sequence of measurements, naturally capturing temporal covariate interactions without assumptions about the underlying data-generating process. The framework is inherently designed to handle censored data and competing events. Experiments on multiple real-world datasets demonstrate that TraCeR achieves substantial and statistically significant performance improvements over state-of-the-art methods. Furthermore, our evaluation extends beyond discrimination metrics and assesses model calibration, addressing a key oversight in literature.

[502] Grad: Guided Relation Diffusion Generation for Graph Augmentation in Graph Fraud Detection

Jie Yang, Rui Zhang, Ziyang Cheng, Dawei Cheng, Guang Yang, Bo Wang

Main category: cs.LG

TL;DR: Grad is a graph fraud detection model that uses relation diffusion and contrastive learning to combat sophisticated fraudster camouflage strategies in financial platforms.

Details

Motivation: Fraudsters are using sophisticated adaptive camouflage strategies to mimic benign user behavior, narrowing the differences between fraudulent and legitimate users and making current graph fraud detection models ineffective.

Method: Proposes Grad with: 1) supervised graph contrastive learning to enhance fraud-benign differences, and 2) guided relation diffusion generator to create auxiliary homophilic relations from scratch, amplifying weak fraudulent signals during aggregation.

Result: Outperforms state-of-the-art methods on two real-world WeChat Pay datasets and three public datasets, achieving up to 11.10% increase in AUC and 43.95% increase in AP.

Conclusion: Grad effectively addresses adaptive camouflage in graph fraud detection by enhancing weak fraudulent signals through relation diffusion and contrastive learning, significantly improving detection performance in real-world financial scenarios.

Abstract: Nowadays, Graph Fraud Detection (GFD) in financial scenarios has become an urgent research topic to protect online payment security. However, as organized crime groups are becoming more professional in real-world scenarios, fraudsters are employing more sophisticated camouflage strategies. Specifically, fraudsters disguise themselves by mimicking the behavioral data collected by platforms, ensuring that their key characteristics are consistent with those of benign users to a high degree, which we call Adaptive Camouflage. Consequently, this narrows the differences in behavioral traits between them and benign users within the platform’s database, thereby making current GFD models lose efficiency. To address this problem, we propose a relation diffusion-based graph augmentation model Grad. In detail, Grad leverages a supervised graph contrastive learning module to enhance the fraud-benign difference and employs a guided relation diffusion generator to generate auxiliary homophilic relations from scratch. Based on these, weak fraudulent signals would be enhanced during the aggregation process, thus being obvious enough to be captured. Extensive experiments have been conducted on two real-world datasets provided by WeChat Pay, one of the largest online payment platforms with billions of users, and three public datasets. The results show that our proposed model Grad outperforms SOTA methods in both various scenarios, achieving at most 11.10% and 43.95% increases in AUC and AP, respectively. Our code is released at https://github.com/AI4Risk/antifraud and https://github.com/Muyiiiii/WWW25-Grad.

[503] Conscious Data Contribution via Community-Driven Chain-of-Thought Distillation

Lena Libon, Meghana Bhange, Rushabh Solanki, Elliot Creager, Ulrich Aïvodji

Main category: cs.LG

TL;DR: The paper argues that chain-of-thought reasoning traces in LLMs should be considered personal data under privacy laws, and proposes a framework for communities to aggregate and distill their shared knowledge into better-aligned models.

Details

Motivation: Current AI development emphasizes large-scale training that raises privacy concerns and limits user autonomy. The paper addresses how users can maintain control over their data when LLMs generate intermediate reasoning traces, and how communities with low utility from existing models can create better-aligned alternatives.

Method: The paper first interprets data privacy and portability laws to establish that chain-of-thought reasoning traces qualify as personal data. Then builds on the Conscious Data Contribution framework to show how communities can aggregate and distill shared knowledge into alternate models. Empirical verification examines effects of community diversity, reasoning granularity, and community size on distillation performance.

Result: The paper demonstrates that intermediate chain-of-thought computations in LLMs should be treated as personal data under privacy regulations. It shows communities can successfully distill their shared knowledge into models better aligned with their specific goals through empirical verification of the proposed framework.

Conclusion: The research establishes a legal and technical framework for data portability and user autonomy in LLMs that use chain-of-thought reasoning, enabling communities to create better-aligned models while respecting privacy rights over intermediate computational artifacts.

Abstract: The current era of AI development places a heavy emphasis on training large models on increasingly scaled-up datasets. This paradigm has catalyzed entirely new product categories, such as LLM chatbots, while also raising concerns about data privacy and consumer choice. In this paper, we consider questions of data portability and user autonomy in the context of LLMs that “reason” using chain-of-thought (CoT) traces, computing intermediate text artifacts from user input before producing a final output. We first interpret recent data privacy and portability law to argue that these intermediate computations qualify as users’ personal data. Then, building on the existing framework of Conscious Data Contribution, we show how communities who receive low utility from an available model can aggregate and distill their shared knowledge into an alternate model better aligned with their goals. We verify this approach empirically and investigate the effects of community diversity, reasoning granularity, and community size on distillation performance.

[504] FairExpand: Individual Fairness on Graphs with Partial Similarity Information

Rebecca Salganik, Yibin Wang, Guillaume Salha-Galvan, Jian Kang

Main category: cs.LG

TL;DR: FairExpand is a framework for promoting individual fairness in graph representation learning when similarity information is only available for a limited subset of node pairs, unlike existing methods that require full similarity information.

Details

Motivation: Individual fairness is crucial in graph-based applications like user modeling and recommender systems, but existing methods require predefined similarity information for all node pairs, which is unrealistic in practice. The paper addresses the more realistic scenario where similarity information is only partially available.

Method: FairExpand uses a two-step pipeline that alternates between refining node representations with a backbone model (e.g., GNN) and gradually propagating similarity information, allowing fairness enforcement to expand to the entire graph.

Result: Extensive experiments show that FairExpand consistently enhances individual fairness while preserving performance, making it a practical solution for real-world applications with partial similarity information.

Conclusion: FairExpand provides a flexible framework for enabling individual fairness in graph representation learning under the realistic constraint of partial similarity information, addressing a key limitation of existing methods.

Abstract: Individual fairness, which requires that similar individuals should be treated similarly by algorithmic systems, has become a central principle in fair machine learning. Individual fairness has garnered traction in graph representation learning due to its practical importance in high-stakes Web areas such as user modeling, recommender systems, and search. However, existing methods assume the existence of predefined similarity information over all node pairs, an often unrealistic requirement that prevents their operationalization in practice. In this paper, we assume the similarity information is only available for a limited subset of node pairs and introduce FairExpand, a flexible framework that promotes individual fairness in this more realistic partial information scenario. FairExpand follows a two-step pipeline that alternates between refining node representations using a backbone model (e.g., a graph neural network) and gradually propagating similarity information, which allows fairness enforcement to effectively expand to the entire graph. Extensive experiments show that FairExpand consistently enhances individual fairness while preserving performance, making it a practical solution for enabling graph-based individual fairness in real-world applications with partial similarity information.

[505] When Does Learning Renormalize? Sufficient Conditions for Power Law Spectral Dynamics

Yizhou Zhang

Main category: cs.LG

TL;DR: The paper identifies sufficient conditions for renormalizable coarse-grained dynamics in deep learning systems, showing that power-law scaling emerges from rigidity when log-shift invariance combines with gradient flow time-rescaling covariance.

Details

Motivation: Empirical power-law scaling is widely observed in deep learning but its theoretical origins and scope of validity remain incompletely understood. The Generalized Resolution-Shell Dynamics (GRSD) framework models learning as spectral energy transport, but power-law behavior is not automatic and requires additional structural properties.

Method: The authors identify a set of sufficient conditions under which GRSD shell dynamics admits a renormalizable coarse-grained description. These conditions include: boundedness of gradient propagation, weak functional incoherence at initialization, controlled Jacobian evolution along training, and log-shift invariance of renormalized shell couplings.

Result: The paper shows that power-law scaling does not follow from renormalizability alone, but instead arises as a rigidity consequence: when log-shift invariance is combined with the intrinsic time-rescaling covariance of gradient flow, the renormalized GRSD velocity field is forced into a power-law form.

Conclusion: The work provides theoretical conditions for when power-law scaling emerges in deep learning systems, explaining that it results from the combination of renormalizability, log-shift invariance, and gradient flow time-rescaling covariance, rather than being an automatic consequence of renormalization alone.

Abstract: Empirical power–law scaling has been widely observed across modern deep learning systems, yet its theoretical origins and scope of validity remain incompletely understood. The Generalized Resolution–Shell Dynamics (GRSD) framework models learning as spectral energy transport across logarithmic resolution shells, providing a coarse–grained dynamical description of training. Within GRSD, power–law scaling corresponds to a particularly simple renormalized shell dynamics; however, such behavior is not automatic and requires additional structural properties of the learning process. In this work, we identify a set of sufficient conditions under which the GRSD shell dynamics admits a renormalizable coarse–grained description. These conditions constrain the learning configuration at multiple levels, including boundedness of gradient propagation in the computation graph, weak functional incoherence at initialization, controlled Jacobian evolution along training, and log–shift invariance of renormalized shell couplings. We further show that power–law scaling does not follow from renormalizability alone, but instead arises as a rigidity consequence: once log–shift invariance is combined with the intrinsic time–rescaling covariance of gradient flow, the renormalized GRSD velocity field is forced into a power–law form.

[506] Stable and Efficient Single-Rollout RL for Multimodal Reasoning

Rui Liu, Dian Yu, Lei Ke, Haolin Liu, Yujun Zhou, Zhenwen Liang, Haitao Mi, Pratap Tokekar, Dong Yu

Main category: cs.LG

TL;DR: MSSR is a group-free RLVR framework for multimodal reasoning that achieves stable training with single-rollout efficiency via entropy-based advantage shaping, outperforming group-based methods in both compute efficiency and generalization.

Details

Motivation: Current RLVR methods for multimodal reasoning face a trade-off: group-based approaches (like GRPO) require multiple rollouts per prompt (inefficient), while single-rollout variants suffer from severe training instability and collapse in multimodal contexts.

Method: MSSR (Multimodal Stabilized Single-Rollout) uses an entropy-based advantage-shaping mechanism that adaptively regularizes advantage magnitudes to prevent training collapse while maintaining single-rollout efficiency. This regularization is essential for stability in multimodal single-rollout settings.

Result: MSSR achieves superior training compute efficiency (half the steps for similar accuracy), outperforms group-based baselines when trained for same steps, and shows consistent generalization improvements across five diverse reasoning-intensive benchmarks.

Conclusion: MSSR enables stable, compute-efficient, and effective RLVR for complex multimodal reasoning tasks by solving the efficiency-stability trade-off through adaptive advantage regularization in single-rollout settings.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a key paradigm to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevalent group-based algorithms such as GRPO require multi-rollout sampling for each prompt. While more efficient single-rollout variants have recently been explored in text-only settings, we find that they suffer from severe instability in multimodal contexts, often leading to training collapse. To address this training efficiency-stability trade-off, we introduce $\textbf{MSSR}$ (Multimodal Stabilized Single-Rollout), a group-free RLVR framework that achieves both stable optimization and effective multimodal reasoning performance. MSSR achieves this via an entropy-based advantage-shaping mechanism that adaptively regularizes advantage magnitudes, preventing collapse and maintaining training stability. While such mechanisms have been used in group-based RLVR, we show that in the multimodal single-rollout setting they are not merely beneficial but essential for stability. In in-distribution evaluations, MSSR demonstrates superior training compute efficiency, achieving similar validation accuracy to the group-based baseline with half the training steps. When trained for the same number of steps, MSSR’s performance surpasses the group-based baseline and shows consistent generalization improvements across five diverse reasoning-intensive benchmarks. Together, these results demonstrate that MSSR enables stable, compute-efficient, and effective RLVR for complex multimodal reasoning tasks.

[507] Offline Behavioral Data Selection

Shiye Lei, Zhihao Cheng, Dacheng Tao

Main category: cs.LG

TL;DR: Offline behavioral datasets show data saturation - policy performance plateaus with small data fractions, revealing weak alignment between test loss and actual performance. Proposed Stepwise Dual Ranking (SDR) extracts compact, informative subsets using early-stage prioritization and dual ranking of action-value and state-density.

Details

Motivation: Large-scale offline behavioral datasets lead to computationally intensive training, but policy performance saturates quickly with small data fractions. This reveals weak alignment between test loss and actual policy performance, indicating substantial room for improvement through intelligent data selection.

Method: Stepwise Dual Ranking (SDR) - a simple yet effective method that extracts compact subsets from large datasets using two principles: (1) stepwise clip prioritizing early-stage data, and (2) dual ranking selecting samples with both high action-value rank and low state-density rank.

Result: Extensive experiments and ablation studies on D4RL benchmarks demonstrate that SDR significantly enhances data selection for offline behavioral data, improving efficiency while maintaining or improving policy performance.

Conclusion: Data saturation in offline behavioral datasets presents an opportunity for efficient training through intelligent data selection. SDR provides an effective solution by prioritizing informative early-stage samples and balancing action-value with state-density considerations.

Abstract: Behavioral cloning is a widely adopted approach for offline policy learning from expert demonstrations. However, the large scale of offline behavioral datasets often results in computationally intensive training when used in downstream tasks. In this paper, we uncover the striking data saturation in offline behavioral data: policy performance rapidly saturates when trained on a small fraction of the dataset. We attribute this effect to the weak alignment between policy performance and test loss, revealing substantial room for improvement through data selection. To this end, we propose a simple yet effective method, Stepwise Dual Ranking (SDR), which extracts a compact yet informative subset from large-scale offline behavioral datasets. SDR is build on two key principles: (1) stepwise clip, which prioritizes early-stage data; and (2) dual ranking, which selects samples with both high action-value rank and low state-density rank. Extensive experiments and ablation studies on D4RL benchmarks demonstrate that SDR significantly enhances data selection for offline behavioral data.

[508] On the Convergence Rate of LoRA Gradient Descent

Siqiao Mu, Diego Klabjan

Main category: cs.LG

TL;DR: First non-asymptotic convergence analysis of original LoRA gradient descent without Lipschitz smoothness assumptions, proving O(1/log T) convergence rate to stationary point.

Details

Motivation: LoRA is widely used for fine-tuning large models due to efficiency, but its convergence is poorly understood. Existing analyses either consider asymptotic behavior or impose unrealistic boundedness assumptions that artificially enforce Lipschitz smoothness, which doesn't reflect actual LoRA practice.

Method: Three key technical steps: 1) Reformulate problem in terms of outer product of stacked adapter matrices, 2) Develop modified descent lemma for “Lipschitz-like” reparametrized function, 3) Control step size appropriately. This enables analysis of original LoRA gradient descent without Lipschitz assumptions.

Result: Proves that LoRA gradient descent converges to a stationary point at rate O(1/log T), where T is number of iterations. This is the first non-asymptotic convergence guarantee for the original LoRA algorithm without artificial smoothness assumptions.

Conclusion: Provides rigorous theoretical foundation for LoRA convergence, addressing a significant gap in understanding why this popular fine-tuning method works. The analysis framework could potentially extend to other low-rank adaptation methods.

Abstract: The low-rank adaptation (LoRA) algorithm for fine-tuning large models has grown popular in recent years due to its remarkable performance and low computational requirements. LoRA trains two adapter" matrices that form a low-rank representation of the model parameters, thereby massively reducing the number of parameters that need to be updated at every step. Although LoRA is simple, its convergence is poorly understood due to the lack of Lipschitz smoothness, a key condition for classic convergence analyses. As a result, current theoretical results only consider asymptotic behavior or assume strong boundedness conditions which artificially enforce Lipschitz smoothness. In this work, we provide for the first time a non-asymptotic convergence analysis of the \textit{original LoRA gradient descent} algorithm, which reflects widespread practice, without such assumptions. Our work relies on three key steps: i) reformulating the problem in terms of the outer product of the stacked adapter matrices, ii) a modified descent lemma for the Lipschitz-like" reparametrized function, and iii) controlling the step size. With this approach, we prove that LoRA gradient descent converges to a stationary point at rate $O(\frac{1}{\log T})$, where $T$ is the number of iterations.

[509] LeJOT: An Intelligent Job Cost Orchestration Solution for Databricks Platform

Lizhi Ma, Yi-Xiang Hu, Yuke Wang, Yifang Zhao, Yihui Ren, Jian-Xiang Liao, Feng Wu, Xiang-Yang Li

Main category: cs.LG

TL;DR: LeJOT is an ML-powered job cost orchestration framework for Databricks that predicts execution times and optimizes resource allocation to reduce cloud computing costs by 20% while maintaining performance.

Details

Motivation: Managing escalating operational costs in Databricks platforms is challenging due to dynamic workloads. Existing static or reactive solutions fail to adapt to workload variations, leading to inefficient resource allocation and high costs.

Method: LeJOT combines machine learning for execution time prediction with a solver-based optimization model for real-time resource allocation. It proactively predicts workload demands and dynamically allocates computing resources to minimize costs while meeting performance requirements.

Result: Experimental results on real-world Databricks workloads show LeJOT achieves an average 20% reduction in cloud computing costs within minute-level scheduling timeframes, outperforming traditional static allocation strategies.

Conclusion: LeJOT provides a scalable and adaptive solution for cost-efficient job scheduling in Data Lakehouse environments, addressing the limitations of conventional static or reactive approaches.

Abstract: With the rapid advancements in big data technologies, the Databricks platform has become a cornerstone for enterprises and research institutions, offering high computational efficiency and a robust ecosystem. However, managing the escalating operational costs associated with job execution remains a critical challenge. Existing solutions rely on static configurations or reactive adjustments, which fail to adapt to the dynamic nature of workloads. To address this, we introduce LeJOT, an intelligent job cost orchestration framework that leverages machine learning for execution time prediction and a solver-based optimization model for real-time resource allocation. Unlike conventional scheduling techniques, LeJOT proactively predicts workload demands, dynamically allocates computing resources, and minimizes costs while ensuring performance requirements are met. Experimental results on real-world Databricks workloads demonstrate that LeJOT achieves an average 20% reduction in cloud computing costs within a minute-level scheduling timeframe, outperforming traditional static allocation strategies. Our approach provides a scalable and adaptive solution for cost-efficient job scheduling in Data Lakehouse environments.

[510] FedSUM Family: Efficient Federated Learning Methods under Arbitrary Client Participation

Runze You, Shi Pu

Main category: cs.LG

TL;DR: FedSUM family of FL algorithms supports arbitrary client participation patterns without data heterogeneity assumptions, using delay metrics τ_max and τ_avg for unified convergence guarantees.

Details

Motivation: Current FL methods are often designed for specific client participation patterns, limiting their practical applicability in real-world deployments where client participation is unpredictable and variable.

Method: Introduces FedSUM family with three variants: FedSUM-B (basic), FedSUM (standard), and FedSUM-CR (communication-reduced). Models participation variability using two delay metrics: maximum delay τ_max and average delay τ_avg.

Result: Provides unified convergence guarantees demonstrating effectiveness across diverse participation patterns, enabling FL to handle arbitrary client participation without additional data heterogeneity assumptions.

Conclusion: FedSUM framework broadens the applicability of FL in real-world scenarios by supporting arbitrary client participation patterns, making FL more practical for deployment where participation is unpredictable.

Abstract: Federated Learning (FL) methods are often designed for specific client participation patterns, limiting their applicability in practical deployments. We introduce the FedSUM family of algorithms, which supports arbitrary client participation without additional assumptions on data heterogeneity. Our framework models participation variability with two delay metrics, the maximum delay $τ_{\max}$ and the average delay $τ_{\text{avg}}$. The FedSUM family comprises three variants: FedSUM-B (basic version), FedSUM (standard version), and FedSUM-CR (communication-reduced version). We provide unified convergence guarantees demonstrating the effectiveness of our approach across diverse participation patterns, thereby broadening the applicability of FL in real-world scenarios.

[511] AL-GNN: Privacy-Preserving and Replay-Free Continual Graph Learning via Analytic Learning

Xuling Zhang, Jindong Li, Yifei Zhang, Menglin Yang

Main category: cs.LG

TL;DR: AL GNN is a novel continual graph learning framework that eliminates backpropagation and replay buffers by using analytic learning theory for recursive least squares optimization, achieving better performance with improved privacy and efficiency.

Details

Motivation: Existing continual graph learning methods based on experience replay have significant limitations including privacy concerns (storing past graph data), inefficiency, and the need for backpropagation. There's a need for a more efficient, privacy-preserving approach that avoids these drawbacks.

Method: AL GNN leverages analytic learning theory to formulate learning as a recursive least squares optimization process. It maintains model knowledge analytically through closed-form classifier updates and a regularized feature autocorrelation matrix, enabling efficient one-pass training per task without storing historical samples.

Result: Extensive experiments show AL GNN achieves competitive or superior performance: improves average performance by 10% on CoraFull, reduces forgetting by over 30% on Reddit, and reduces training time by nearly 50% due to its backpropagation-free design.

Conclusion: AL GNN provides an effective continual graph learning framework that addresses privacy concerns, improves efficiency, and reduces catastrophic forgetting without requiring backpropagation or replay buffers, making it suitable for practical deployment in dynamic graph learning scenarios.

Abstract: Continual graph learning (CGL) aims to enable graph neural networks to incrementally learn from a stream of graph structured data without forgetting previously acquired knowledge. Existing methods particularly those based on experience replay typically store and revisit past graph data to mitigate catastrophic forgetting. However, these approaches pose significant limitations, including privacy concerns, inefficiency. In this work, we propose AL GNN, a novel framework for continual graph learning that eliminates the need for backpropagation and replay buffers. Instead, AL GNN leverages principles from analytic learning theory to formulate learning as a recursive least squares optimization process. It maintains and updates model knowledge analytically through closed form classifier updates and a regularized feature autocorrelation matrix. This design enables efficient one pass training for each task, and inherently preserves data privacy by avoiding historical sample storage. Extensive experiments on multiple dynamic graph classification benchmarks demonstrate that AL GNN achieves competitive or superior performance compared to existing methods. For instance, it improves average performance by 10% on CoraFull and reduces forgetting by over 30% on Reddit, while also reducing training time by nearly 50% due to its backpropagation free design.

[512] WebATLAS: An LLM Agent with Experience-Driven Memory and Action Simulation

Jiali Cheng, Anjishnu Kumar, Roshan Lal, Rishi Rajasekaran, Hani Ramezani, Omar Zia Khan, Oleg Rokhlenko, Sunny Chiu-Webster, Gang Hua, Hadi Amiri

Main category: cs.LG

TL;DR: WebATLAS is a memory-augmented LLM web agent that uses experience-driven memory and look-ahead action simulation to adapt to unseen web environments without fine-tuning, achieving 63% success rate on WebArena-Lite benchmark.

Details

Motivation: LLM web agents struggle with long-horizon web navigation and task completion in new websites, producing inefficient action sequences unless fine-tuned on environment-specific data. There's a need for agents that can adapt to unseen web environments without requiring website-specific fine-tuning.

Method: WebATLAS combines experience-driven memory with look-ahead action simulation. It learns a lightweight internal model from interaction experience, performs hypothetical action rollouts before real-world actions, builds a persistent cognitive map via curiosity-driven exploration, stores interaction outcomes as memory, and evaluates candidate actions using a planner-simulator-critic loop.

Result: Achieves 63% success rate on WebArena-Lite benchmark for autonomous web navigation, outperforming previous state-of-the-art at 53.9%. The modular architecture requires no website-specific LLM fine-tuning.

Conclusion: Experience-driven memory combined with look-ahead action simulation enables LLM agents to adapt to unseen web environments by remembering past failures and predicting future action consequences. The approach demonstrates robust, training-free web agents through complementary mechanisms of memory, simulation, and hierarchical replanning.

Abstract: Large Language Model (LLM) web agents often struggle with long-horizon web navigation and web task completion in new websites, producing inefficient action sequences unless fine-tuned on environment-specific data. We show that experience-driven memory, combined with look-ahead action simulation, is sufficient for LLM agents to adapt to unseen web environments by remembering past failures and predicting the consequences of future actions. We introduce WebATLAS (Actor-Critic Task-completion with Look-ahead Action Simulation), a memory-augmented LLM web agent that learns a lightweight internal model of the environment from interaction experience and performs hypothetical action rollouts before acting in the real world. WebATLAS builds a persistent cognitive map via curiosity-driven exploration, stores interaction outcomes as experience-based memory, and evaluates candidate actions in cognitive space using a planner–simulator–critic loop. This enables the agent to reuse past experience, avoid previously unsuccessful behaviors, and generate more efficient plans. We evaluate WebATLAS on the WebArena-Lite benchmark for autonomous web navigation and demonstrate a success rate of 63%, outperforming the previous state-of-the-art at 53.9%. Unlike previous systems, our modular architecture requires no website-specific LLM fine-tuning. Ablation studies confirm that experience-driven memory, look-ahead action simulation, and hierarchical replanning play complementary roles in enabling robust, training-free web agents.

[513] Embedded Safety-Aligned Intelligence via Differentiable Internal Alignment Embeddings

Harsh Rathva, Ojas Srivastava, Pruthwik Mishra

Main category: cs.LG

TL;DR: ESAI is a theoretical framework for multi-agent RL that embeds safety alignment directly into agents’ internal representations using learnable latent variables called internal alignment embeddings, which predict harm through counterfactual reasoning and modulate policy updates.

Details

Motivation: Current safety alignment methods in multi-agent systems often rely on external reward shaping or post-hoc constraints, which may not be optimal. The paper aims to develop a more integrated approach where alignment constraints are embedded directly into agents' internal representations for more effective and intrinsic safety alignment.

Method: ESAI uses differentiable internal alignment embeddings that predict externalized harm through counterfactual reasoning. The framework integrates four key mechanisms: 1) differentiable counterfactual alignment penalties from soft reference distributions, 2) alignment-weighted perceptual attention, 3) Hebbian associative memory for temporal credit assignment, and 4) similarity-weighted graph diffusion with bias mitigation controls.

Result: Theoretical analysis shows stability conditions for bounded internal embeddings under Lipschitz continuity and spectral constraints. The paper discusses computational complexity and examines theoretical properties including contraction behavior and fairness-performance tradeoffs. However, empirical evaluation is left for future work.

Conclusion: ESAI represents a conceptual contribution to differentiable alignment mechanisms in multi-agent systems, positioning internal alignment embeddings as a promising approach for safety. The work identifies open theoretical questions about convergence guarantees, embedding dimensionality, and extension to high-dimensional environments.

Abstract: We introduce Embedded Safety-Aligned Intelligence (ESAI), a theoretical framework for multi-agent reinforcement learning that embeds alignment constraints directly into agents internal representations using differentiable internal alignment embeddings. Unlike external reward shaping or post-hoc safety constraints, internal alignment embeddings are learned latent variables that predict externalized harm through counterfactual reasoning and modulate policy updates toward harm reduction through attention and graph-based propagation. The ESAI framework integrates four mechanisms: differentiable counterfactual alignment penalties computed from soft reference distributions, alignment-weighted perceptual attention, Hebbian associative memory supporting temporal credit assignment, and similarity-weighted graph diffusion with bias mitigation controls. We analyze stability conditions for bounded internal embeddings under Lipschitz continuity and spectral constraints, discuss computational complexity, and examine theoretical properties including contraction behavior and fairness-performance tradeoffs. This work positions ESAI as a conceptual contribution to differentiable alignment mechanisms in multi-agent systems. We identify open theoretical questions regarding convergence guarantees, embedding dimensionality, and extension to high-dimensional environments. Empirical evaluation is left to future work.

[514] Trustworthy and Explainable Deep Reinforcement Learning for Safe and Energy-Efficient Process Control: A Use Case in Industrial Compressed Air Systems

Vincent Bezold, Patrick Wagner, Jakob Hofmann, Marco Huber, Alexander Sauer

Main category: cs.LG

TL;DR: Trustworthy RL for industrial compressed air systems achieves 4% energy savings with multi-level explainability and safe operation under realistic conditions.

Details

Motivation: Need for safe, energy-efficient control of industrial compressed air systems without relying on explicit physics models, while ensuring trustworthy deployment through explainability.

Method: Trustworthy RL framework with multi-level explainability pipeline combining input perturbation tests, gradient-based sensitivity analysis, and SHAP feature attribution for transparent validation.

Result: Learned policy reduces unnecessary overpressure, achieves ~4% energy savings compared to industrial controller, shows physically plausible behavior, anticipates future demand, and respects system boundaries.

Conclusion: Combination of efficiency gains, predictive behavior, and transparent validation supports trustworthy deployment of RL in industrial energy systems, with system pressure and forecast dominating decisions.

Abstract: This paper presents a trustworthy reinforcement learning approach for the control of industrial compressed air systems. We develop a framework that enables safe and energy-efficient operation under realistic boundary conditions and introduce a multi-level explainability pipeline combining input perturbation tests, gradient-based sensitivity analysis, and SHAP (SHapley Additive exPlanations) feature attribution. An empirical evaluation across multiple compressor configurations shows that the learned policy is physically plausible, anticipates future demand, and consistently respects system boundaries. Compared to the installed industrial controller, the proposed approach reduces unnecessary overpressure and achieves energy savings of approximately 4,% without relying on explicit physics models. The results further indicate that system pressure and forecast information dominate policy decisions, while compressor-level inputs play a secondary role. Overall, the combination of efficiency gains, predictive behavior, and transparent validation supports the trustworthy deployment of reinforcement learning in industrial energy systems.

[515] Towards Guided Descent: Optimization Algorithms for Training Neural Networks At Scale

Ansh Nagwekar

Main category: cs.LG

TL;DR: This thesis analyzes neural network optimization, tracing evolution from SGD to modern higher-order methods, revealing how principled algorithmic design can demystify training and improve performance.

Details

Motivation: Neural network optimization is poorly understood despite its critical importance. Current methods like SGD work empirically but lack principled understanding, creating a paradox between empirical success and theoretical gaps.

Method: Traces evolution from classical first-order methods (SGD, adaptive gradient) to modern higher-order techniques. Analyzes limitations of conventional approaches with real-world data anisotropy, then explores curvature-based methods, layer-wise preconditioning, adaptive learning rates, and their interplay with broader training tools.

Result: Reveals how principled algorithmic design can demystify training process. Shows breakdowns of conventional methods with real-world data anisotropy, motivating sophisticated alternatives using curvature information.

Conclusion: Bridges theory-practice gap with practical prescriptions for integrating advanced optimization methods into modern deep learning workflows. Emphasizes interplay between optimization algorithms and broader training toolkit as essential for empirical success.

Abstract: Neural network optimization remains one of the most consequential yet poorly understood challenges in modern AI research, where improvements in training algorithms can lead to enhanced feature learning in foundation models, order-of-magnitude reductions in training time, and improved interpretability into how networks learn. While stochastic gradient descent (SGD) and its variants have become the de facto standard for training deep networks, their success in these over-parameterized regimes often appears more empirical than principled. This thesis investigates this apparent paradox by tracing the evolution of optimization algorithms from classical first-order methods to modern higher-order techniques, revealing how principled algorithmic design can demystify the training process. Starting from first principles with SGD and adaptive gradient methods, the analysis progressively uncovers the limitations of these conventional approaches when confronted with anisotropy that is representative of real-world data. These breakdowns motivate the exploration of sophisticated alternatives rooted in curvature information: second-order approximation techniques, layer-wise preconditioning, adaptive learning rates, and more. Next, the interplay between these optimization algorithms and the broader neural network training toolkit, which includes prior and recent developments such as maximal update parametrization, learning rate schedules, and exponential moving averages, emerges as equally essential to empirical success. To bridge the gap between theoretical understanding and practical deployment, this paper offers practical prescriptions and implementation strategies for integrating these methods into modern deep learning workflows.

[516] The Challenger: When Do New Data Sources Justify Switching Machine Learning Models?

Vassilis Digalakis, Christophe Pérignon, Sébastien Saurin, Flore Sentenac

Main category: cs.LG

TL;DR: A framework for deciding when to replace an incumbent model with a challenger using new features, balancing learning curves, costs, and future gains.

Details

Motivation: Organizations face the challenge of deciding when to replace trained models with new ones using newly available features, needing to balance learning-curve dynamics, data acquisition costs, retraining costs, and future performance gains.

Method: Developed a unified economic-statistical framework, characterized optimal switching times theoretically, and proposed three practical algorithms: one-shot baseline, greedy sequential method, and look-ahead sequential method.

Result: On real-world credit-scoring data: (i) optimal switching times vary systematically with cost parameters and learning-curve behavior, (ii) look-ahead sequential method outperforms others and approaches oracle performance. Established finite-sample guarantees with sublinear regret.

Conclusion: Provides an operational blueprint for economically sound model transitions as new data sources become available, with practical algorithms and theoretical guarantees.

Abstract: We study the problem of deciding whether, and when an organization should replace a trained incumbent model with a challenger relying on newly available features. We develop a unified economic and statistical framework that links learning-curve dynamics, data-acquisition and retraining costs, and discounting of future gains. First, we characterize the optimal switching time in stylized settings and derive closed-form expressions that quantify how horizon length, learning-curve curvature, and cost differentials shape the optimal decision. Second, we propose three practical algorithms: a one-shot baseline, a greedy sequential method, and a look-ahead sequential method. Using a real-world credit-scoring dataset with gradually arriving alternative data, we show that (i) optimal switching times vary systematically with cost parameters and learning-curve behavior, and (ii) the look-ahead sequential method outperforms other methods and is able to approach in value an oracle with full foresight. Finally, we establish finite-sample guarantees, including conditions under which the sequential look-ahead method achieve sublinear regret relative to that oracle. Our results provide an operational blueprint for economically sound model transitions as new data sources become available.

[517] Why Most Optimism Bandit Algorithms Have the Same Regret Analysis: A Simple Unifying Theorem

Vikram Krishnamurthy

Main category: cs.LG

TL;DR: The paper presents a unified framework for analyzing optimism-based stochastic bandit algorithms, showing that logarithmic regret follows from a single concentration condition and two deterministic lemmas.

Details

Motivation: To unify the analysis of various optimism-based bandit algorithms (UCB, UCB-V, linear UCB, GP-UCB) by identifying the minimal common structure behind their logarithmic regret proofs.

Method: Isolates the essential ingredients: a single high-probability concentration condition on estimators, plus two deterministic lemmas describing radius collapse and optimism-forced deviations.

Result: Provides unified, near-minimal proofs for classical bandit algorithms and extends naturally to many contemporary bandit variants.

Conclusion: The framework simplifies and unifies regret analysis for optimism-based bandit algorithms, revealing their common mathematical structure and enabling easier extension to new variants.

Abstract: Several optimism-based stochastic bandit algorithms – including UCB, UCB-V, linear UCB, and finite-arm GP-UCB – achieve logarithmic regret using proofs that, despite superficial differences, follow essentially the same structure. This note isolates the minimal ingredients behind these analyses: a single high-probability concentration condition on the estimators, after which logarithmic regret follows from two short deterministic lemmas describing radius collapse and optimism-forced deviations. The framework yields unified, near-minimal proofs for these classical algorithms and extends naturally to many contemporary bandit variants.

[518] MoE Pathfinder: Trajectory-driven Expert Pruning

Xican Yang, Yuanhe Tian, Yan Song

Main category: cs.LG

TL;DR: Proposes a global trajectory-based expert pruning method for MoE LLMs that treats expert selection as optimal path planning, achieving superior pruning performance across tasks.

Details

Motivation: MoE architectures in LLMs face deployment complexity and low activation efficiency. Existing expert pruning methods use local metrics and uniform layer-wise pruning, ignoring heterogeneous expert contributions across layers.

Method: Treats MoE as weighted computation graph and casts expert selection as global optimal path planning problem. Integrates complementary importance signals (reconstruction error, routing probabilities, activation strength) at trajectory level, yielding non-uniform expert retention across layers.

Result: Achieves superior pruning performance on nearly all tasks compared with most existing approaches.

Conclusion: The trajectory-based global pruning approach effectively addresses limitations of local pruning methods and improves MoE model efficiency while maintaining performance.

Abstract: Mixture-of-experts (MoE) architectures used in large language models (LLMs) achieve state-of-the-art performance across diverse tasks yet face practical challenges such as deployment complexity and low activation efficiency. Expert pruning has thus emerged as a promising solution to reduce computational overhead and simplify the deployment of MoE models. However, existing expert pruning approaches conventionally rely on local importance metrics and often apply uniform layer-wise pruning, leveraging only partial evaluation signals and overlooking the heterogeneous contributions of experts across layers. To address these limitations, we propose an expert pruning approach based on the trajectory of activated experts across layers, which treats MoE as a weighted computation graph and casts expert selection as a global optimal path planning problem. Within this framework, we integrate complementary importance signals from reconstruction error, routing probabilities, and activation strength at the trajectory level, which naturally yields non-uniform expert retention across layers. Experiments show that our approach achieves superior pruning performance on nearly all tasks compared with most existing approaches.

[519] On the Universality of Transformer Architectures; How Much Attention Is Enough?

Amirreza Abbasi, Mohsen Hooshmand

Main category: cs.LG

TL;DR: Survey paper examining Transformers’ universality, reviewing architectural refinements, approximation rates, and separating robust from fragile theoretical guarantees.

Details

Motivation: Transformers are dominant across AI fields due to perceived universality and scalability, but there's a need to clarify what's actually known about their expressiveness and theoretical foundations.

Method: Survey and review methodology examining the problem of universality in Transformers, analyzing architectural refinements (structural minimality, approximation rates), and synthesizing state-of-the-art advances.

Result: Clarifies current understanding of Transformers’ expressiveness, separates robust theoretical guarantees from fragile ones, and identifies key directions for future theoretical research.

Conclusion: The paper provides a comprehensive survey of Transformers’ universality theory, distinguishing solid theoretical foundations from weaker claims, and outlines important future research directions for understanding Transformer expressiveness.

Abstract: Transformers are crucial across many AI fields, such as large language models, computer vision, and reinforcement learning. This prominence stems from the architecture’s perceived universality and scalability compared to alternatives. This work examines the problem of universality in Transformers, reviews recent progress, including architectural refinements such as structural minimality and approximation rates, and surveys state-of-the-art advances that inform both theoretical and practical understanding. Our aim is to clarify what is currently known about Transformers expressiveness, separate robust guarantees from fragile ones, and identify key directions for future theoretical research.

[520] Secret mixtures of experts inside your LLM

Enric Boix-Adsera

Main category: cs.LG

TL;DR: MLP layers in transformers secretly perform sparse computation similar to Mixture of Experts, revealed through theoretical connection to Sparse Autoencoders and validated empirically on LLMs.

Details

Motivation: MLP layers in transformers are poorly understood despite being fundamental components. The paper aims to understand their internal workings by hypothesizing they perform sparse computation similar to MoE layers.

Method: Established theoretical connection between MoE models and Sparse Autoencoder structure in activation space. Empirically validated hypothesis on pretrained LLMs, showing results depend on neural activation structure rather than Gaussian data.

Result: MLP layers in LLMs can be well approximated by sparsely-activating MoE layers. The activation distribution structure is crucial - results don’t hold for Gaussian data but require actual neural network activation patterns.

Conclusion: Reveals general principle in MLP layers of LLMs, explains effectiveness of modern MoE-based transformers, and suggests new directions for efficient MoE design using low-rank routers.

Abstract: Despite being one of the earliest neural network layers, the Multilayer Perceptron (MLP) is arguably one of the least understood parts of the transformer architecture due to its dense computation and lack of easy visualization. This paper seeks to understand the MLP layers in dense LLM models by hypothesizing that these layers secretly approximately perform a sparse computation – namely, that they can be well approximated by sparsely-activating Mixture of Experts (MoE) layers. Our hypothesis is based on a novel theoretical connection between MoE models and Sparse Autoencoder (SAE) structure in activation space. We empirically validate the hypothesis on pretrained LLMs, and demonstrate that the activation distribution matters – these results do not hold for Gaussian data, but rather rely crucially on structure in the distribution of neural network activations. Our results shine light on a general principle at play in MLP layers inside LLMs, and give an explanation for the effectiveness of modern MoE-based transformers. Additionally, our experimental explorations suggest new directions for more efficient MoE architecture design based on low-rank routers.

[521] NOVA: Discovering Well-Conditioned Winograd Transforms through Numerical Optimization of Vandermonde Arithmetic

Jayant Lohia

Main category: cs.LG

TL;DR: NOVA discovers stable fractional transforms for Winograd convolution, solving numerical instability in low-precision computing by 415x improvement in conditioning.

Details

Motivation: Winograd convolution faces critical numerical instability in low-precision computing (FP16/Int8) as tile sizes scale up, with condition numbers exploding to unusable levels (kappa = 2×10^5 for F(8,3)).

Method: NOVA treats Winograd point selection as continuous optimization, searches R^n-1 via Evolution Strategy, snaps candidates to simple rationals, and guarantees correctness via symbolic verification, discovering stable fractional configurations.

Result: NOVA improves conditioning of F(8,3) by 415x in 1D (172,484x for 2D), restores FP16 ImageNet accuracy from 4.7% to 75-78% on VGG16 without retraining, calibration, or learned parameters.

Conclusion: NOVA transforms act as drop-in replacements that unlock large-tile Winograd convolution efficiency for next-generation hardware by solving decades-old numerical instability problems.

Abstract: Winograd convolution is the standard algorithm for efficient inference, reducing arithmetic complexity by 2.25x for 3x3 kernels. However, it faces a critical barrier in the modern era of low precision computing: numerical instability. As tiles scale to maximize efficiency (e.g., F(6,3), F(8,3)), the condition numbers of standard integer based transforms explode, reaching kappa = 2 x 10^5 for F(8,3), rendering them unusable in FP16 or Int8. We introduce NOVA (Numerical Optimization of Vandermonde Arithmetic), a discovery framework that breaks the decades old convention of integer interpolation. Treating Winograd point selection as a continuous optimization problem, NOVA searches the manifold R^n-1 via Evolution Strategy, snaps candidates to simple rationals, and guarantees correctness via symbolic verification. This process uncovers a hidden landscape of stable, fractional configurations such as {+-5/6, +-7/6, +-3/5} that defy traditional vocabulary constraints. The impact is transformative: NOVA improves the conditioning of F(8,3) by 415x in 1D, which squares to a 172,484x improvement for 2D convolution. In real world FP16 ImageNet inference, where standard transforms collapse to random chance (e.g., 4.7 percent accuracy on VGG16), NOVA’s points restore full accuracy (75 to 78 percent), recovering over 70 percentage points without retraining, calibration, or learned parameters. These discovered transforms act as drop in replacements, effectively unlocking the efficiency of large tile Winograd convolution for next generation hardware.

[522] Out-of-Distribution Detection in Molecular Complexes via Diffusion Models for Irregular Graphs

David Graber, Victor Armegioiu, Rebecca Buller, Siddhartha Mishra

Main category: cs.LG

TL;DR: A probabilistic OOD detection framework for 3D graphs using diffusion models with unified continuous diffusion over coordinates and discrete features, validated on protein-ligand complexes.

Details

Motivation: Machine learning models degrade on out-of-distribution (OOD) data, especially for irregular 3D graphs that combine continuous geometry with categorical identities. Reliable deployment requires robust OOD detection for such complex data structures.

Method: Unsupervised diffusion model learns training distribution density. Unified continuous diffusion over 3D coordinates and discrete features (categorical identities embedded in continuous space). Uses cross-entropy training and posterior-mean interpolation to obtain diffusion scores, yielding a self-consistent probability-flow ODE (PF-ODE) for per-sample log-likelihoods.

Result: PF-ODE likelihoods successfully identify held-out protein families as OOD and correlate strongly with binding-affinity prediction errors. Multi-scale trajectory statistics (path tortuosity, flow stiffness, vector-field instability) provide complementary OOD information, improving detection sensitivity over likelihood-only baselines.

Conclusion: The framework provides a principled, label-free OOD quantification workflow for geometric deep learning on complex 3D graph data, enabling reliable deployment of predictive models through a priori reliability estimates.

Abstract: Predictive machine learning models generally excel on in-distribution data, but their performance degrades on out-of-distribution (OOD) inputs. Reliable deployment therefore requires robust OOD detection, yet this is particularly challenging for irregular 3D graphs that combine continuous geometry with categorical identities and are unordered by construction. Here, we present a probabilistic OOD detection framework for complex 3D graph data built on a diffusion model that learns a density of the training distribution in a fully unsupervised manner. A key ingredient we introduce is a unified continuous diffusion over both 3D coordinates and discrete features: categorical identities are embedded in a continuous space and trained with cross-entropy, while the corresponding diffusion score is obtained analytically via posterior-mean interpolation from predicted class probabilities. This yields a single self-consistent probability-flow ODE (PF-ODE) that produces per-sample log-likelihoods, providing a principled typicality score for distribution shift. We validate the approach on protein-ligand complexes and construct strict OOD datasets by withholding entire protein families from training. PF-ODE likelihoods identify held-out families as OOD and correlate strongly with prediction errors of an independent binding-affinity model (GEMS), enabling a priori reliability estimates on new complexes. Beyond scalar likelihoods, we show that multi-scale PF-ODE trajectory statistics - including path tortuosity, flow stiffness, and vector-field instability - provide complementary OOD information. Modeling the joint distribution of these trajectory features yields a practical, high-sensitivity detector that improves separation over likelihood-only baselines, offering a label-free OOD quantification workflow for geometric deep learning.

[523] Self-organizing maps for water quality assessment in reservoirs and lakes: A systematic literature review

Oraib Almegdadi, João Marcelino, Sarah Fakhreddine, João Manso, Nuno C. Marques

Main category: cs.LG

TL;DR: This review paper examines how Self-Organizing Maps (SOMs), an unsupervised AI technique, are applied to water quality assessment in lakes and reservoirs, focusing on parameter selection, sampling strategies, and clustering approaches for effective water management.

Details

Motivation: Sustainable water quality is crucial for ecological balance and water security, but assessing lakes and reservoirs is challenging due to data sparsity, heterogeneity, and nonlinear parameter relationships. The growing availability of environmental data from various sources creates opportunities for advanced analytical approaches.

Method: The paper conducts a comprehensive review of research applying Self-Organizing Maps (SOMs) to water quality assessment. It synthesizes studies on parameter selection, spatial and temporal sampling strategies, clustering approaches, and how SOM handles multidimensional data to uncover hidden patterns.

Result: SOM has proven effective in analyzing complex water quality datasets, particularly when labeled data are limited. It enables high-dimensional data visualization, facilitates detection of hidden ecological patterns, and identifies critical correlations among diverse water quality indicators.

Conclusion: SOM demonstrates versatility in ecological assessments, trophic state classification, algal bloom monitoring, and catchment area impact evaluations. The review provides comprehensive insights supporting future research and practical applications for improving monitoring and sustainable management of lake and reservoir ecosystems.

Abstract: Sustainable water quality underpins ecological balance and water security. Assessing and managing lakes and reservoirs is difficult due to data sparsity, heterogeneity, and nonlinear relationships among parameters. This review examines how Self-Organizing Map (SOM), an unsupervised AI technique, is applied to water quality assessment. It synthesizes research on parameter selection, spatial and temporal sampling strategies, and clustering approaches. Emphasis is placed on how SOM handles multidimensional data and uncovers hidden patterns to support effective water management. The growing availability of environmental data from in-situ sensors, remote sensing imagery, IoT technologies, and historical records has significantly expanded analytical opportunities in environmental monitoring. SOM has proven effective in analysing complex datasets, particularly when labelled data are limited or unavailable. It enables high-dimensional data visualization, facilitates the detection of hidden ecological patterns, and identifies critical correlations among diverse water quality indicators. This review highlights SOMs versatility in ecological assessments, trophic state classification, algal bloom monitoring, and catchment area impact evaluations. The findings offer comprehensive insights into existing methodologies, supporting future research and practical applications aimed at improving the monitoring and sustainable management of lake and reservoir ecosystems.

[524] The Geometry of Abstraction: Continual Learning via Recursive Quotienting

Xin Li

Main category: cs.LG

TL;DR: The paper proposes Recursive Metric Contraction to solve the flat manifold problem in continual learning by collapsing metric tensors in temporal neighborhoods, enabling bounded representation of arbitrarily long trajectories while preventing catastrophic interference.

Details

Motivation: Continual learning systems in fixed-dimensional spaces face the flat manifold problem: linear trajectories in Euclidean space cause geodesic distances to grow linearly with time, forcing trajectory overlap and catastrophic interference. This geometric barrier prevents effective long-term learning in hardware with fixed dimensions.

Method: Recursive Metric Contraction - a topological deformation approach where abstraction is formalized as a quotient map that collapses the metric tensor within validated temporal neighborhoods, driving local sub-manifold diameters to zero. This creates singularities/wormholes that bridge distant temporal points.

Result: Three key theorems: 1) Bounded Capacity Theorem shows recursive quotient maps embed arbitrarily long trajectories into bounded volumes; 2) Topological Collapse Separability Theorem proves recursive quotienting makes temporal sequences linearly separable; 3) Parity-Partitioned Stability Theorem solves catastrophic forgetting via orthogonal flow/scaffold manifolds.

Conclusion: The geometric framework resolves the flat manifold paradox, revealing that tokens in neural architectures can be physically realized as singularities or wormholes - regions of extreme positive curvature that connect distant temporal points, enabling efficient continual learning in fixed-dimensional hardware.

Abstract: Continual learning systems operating in fixed-dimensional spaces face a fundamental geometric barrier: the flat manifold problem. When experience is represented as a linear trajectory in Euclidean space, the geodesic distance between temporal events grows linearly with time, forcing the required covering number to diverge. In fixed-dimensional hardware, this volume expansion inevitably forces trajectory overlap, manifesting as catastrophic interference. In this work, we propose a geometric resolution to this paradox based on Recursive Metric Contraction. We formalize abstraction not as symbolic grouping, but as a topological deformation: a quotient map that collapses the metric tensor within validated temporal neighborhoods, effectively driving the diameter of local sub-manifolds to zero. We substantiate our framework with four rigorous results. First, the Bounded Capacity Theorem establishes that recursive quotient maps allow the embedding of arbitrarily long trajectories into bounded representational volumes, trading linear metric growth for logarithmic topological depth. Second, the Topological Collapse Separability Theorem, derived via Urysohn’s Lemma, proves that recursive quotienting renders non-linearly separable temporal sequences linearly separable in the limit, bypassing the need for infinite-dimensional kernel projections. Third, the Parity-Partitioned Stability Theorem solves the catastrophic forgetting problem by proving that if the state space is partitioned into orthogonal flow and scaffold manifolds, the metric deformations of active learning do not disturb the stability of stored memories. Our analysis reveals that tokens in neural architectures are physically realizable as singularities or wormholes, regions of extreme positive curvature that bridge distant points in the temporal manifold.

[525] APC-GNN++: An Adaptive Patient-Centric GNN with Context-Aware Attention and Mini-Graph Explainability for Diabetes Classification

Khaled Berkani

Main category: cs.LG

TL;DR: APC-GNN++ is an adaptive patient-centric Graph Neural Network for diabetes classification that integrates edge attention, feature blending, and consistency regularization, with a mini-graph approach for real-time predictions on unseen patients.

Details

Motivation: To develop a more effective diabetes classification system that captures clinically meaningful relationships between patients while providing real-time, explainable predictions for healthcare professionals.

Method: APC-GNN++ combines context-aware edge attention, confidence-guided blending of node features and graph representations, neighborhood consistency regularization, and a mini-graph approach for handling unseen patients using nearest neighbors without retraining.

Result: Outperforms traditional ML models (MLP, Random Forest, XGBoost) and vanilla GCN on a real-world Algerian diabetes dataset, achieving higher test accuracy and macro F1-score, with interpretable confidence scores showing how the model balances self-information and graph evidence.

Conclusion: APC-GNN++ provides an effective, interpretable diabetes classification system with real-time prediction capabilities and a practical GUI for healthcare professionals, demonstrating the value of adaptive patient-centric GNN approaches in medical applications.

Abstract: We propose APC-GNN++, an adaptive patient-centric Graph Neural Network for diabetes classification. Our model integrates context-aware edge attention, confidence-guided blending of node features and graph representations, and neighborhood consistency regularization to better capture clinically meaningful relationships between patients. To handle unseen patients, we introduce a mini-graph approach that leverages the nearest neighbors of the new patient, enabling real-time explainable predictions without retraining the global model. We evaluate APC-GNN++ on a real-world diabetes dataset collected from a regional hospital in Algeria and show that it outperforms traditional machine learning models (MLP, Random Forest, XGBoost) and a vanilla GCN, achieving higher test accuracy and macro F1- score. The analysis of node-level confidence scores further reveals how the model balances self-information and graph-based evidence across different patient groups, providing interpretable patient-centric insights. The system is also embedded in a Tkinter-based graphical user interface (GUI) for interactive use by healthcare professionals .

[526] Prediction and Forecast of Short-Term Drought Impacts Using Machine Learning to Support Mitigation and Adaptation Efforts

Hatim M. E. Geli, Islam Omar, Mona Y. Elshinawy, David W. DuBios, Lara Prehodko, Kelly H Smith, Abdel-Hameed A. Badawy

Main category: cs.LG

TL;DR: Machine learning models using drought indices (DSCI, ESI) and historical impact data can forecast weekly drought impacts up to 8 weeks ahead, with varying accuracy across impact categories.

Details

Motivation: Increasing drought severity, frequency, and duration requires better monitoring and mitigation. Predicting drought impacts (not just conditions) can support early warning systems and proactive decision-making for stakeholders.

Method: Combined Drought Severity and Coverage Index (DSCI) and Evaporative Stress Index (ESI) with Drought Impact Reporter (DIR) data. Used eXtreme Gradient Boosting (XGBoost) model to generate weekly drought impact forecasts up to 8 weeks ahead using preceding 8 weeks of data.

Result: Fire and Relief impacts predicted with highest accuracy, followed by Agriculture and Water. Plants and Society impacts showed greater variability. Successfully generated county and state level forecasts for New Mexico up to 8 weeks in advance.

Conclusion: The approach supports development of Ecological Drought Information Communication System (EcoDri) for New Mexico and demonstrates potential for broader application in drought-prone regions. Findings can aid stakeholders in developing more effective drought mitigation and adaptation strategies.

Abstract: Drought is a complex natural hazard that affects ecological and human systems, often resulting in substantial environmental and economic losses. Recent increases in drought severity, frequency, and duration underscore the need for effective monitoring and mitigation strategies. Predicting drought impacts rather than drought conditions alone offers opportunities to support early warning systems and proactive decision-making. This study applies machine learning techniques to link drought indices with historical drought impact records (2005:2024) to generate short-term impact forecasts. By addressing key conceptual and data-driven challenges regarding temporal scale and impact quantification, the study aims to improve the predictability of drought impacts at actionable lead times. The Drought Severity and Coverage Index (DSCI) and the Evaporative Stress Index (ESI) were combined with impact data from the Drought Impact Reporter (DIR) to model and forecast weekly drought impacts. Results indicate that Fire and Relief impacts were predicted with the highest accuracy, followed by Agriculture and Water, while forecasts for Plants and Society impacts showed greater variability. County and state level forecasts for New Mexico were produced using an eXtreme Gradient Boosting (XGBoost) model that incorporated both DSCI and ESI. The model successfully generated forecasts up to eight weeks in advance using the preceding eight weeks of data for most impact categories. This work supports the development of an Ecological Drought Information Communication System (EcoDri) for New Mexico and demonstrates the potential for broader application in similar drought-prone regions. The findings can aid stakeholders, land managers, and decision-makers in developing and implementing more effective drought mitigation and adaptation strategies.

[527] Feature-Enhanced Graph Neural Networks for Classification of Synthetic Graph Generative Models: A Benchmarking Study

Janek Dyer, Jagdeep Ahluwalia, Javad Zarrin

Main category: cs.LG

TL;DR: Hybrid GNN + graph-theoretic features approach achieves 98.5% accuracy in classifying synthetic graph families, with GraphSAGE and GTN performing best.

Details

Motivation: Need to better understand complex structural patterns in graphs by combining GNNs' power with interpretable graph-theoretic features for improved generative model discrimination.

Method: Generated diverse synthetic graphs from 5 families (ER, WS, BA, HK, SBM), extracted comprehensive node/graph features, pruned via Random Forest selection, integrated into 6 GNN architectures (GCN, GAT, GATv2, GIN, GraphSAGE, GTN) with Optuna hyperparameter optimization, compared against SVM baseline.

Result: GraphSAGE and GTN achieved highest classification accuracy (98.5%) with strong class separation in visualizations; GCN and GIN performed well; GAT-based models lagged due to global structure capture limitations; SVM baseline confirmed message passing importance.

Conclusion: Hybrid approach combining GNNs with graph-theoretic features effectively discriminates generative graph models, with GraphSAGE and GTN showing superior performance, demonstrating value of integrating interpretable features with neural architectures.

Abstract: The ability to discriminate between generative graph models is critical to understanding complex structural patterns in both synthetic graphs and the real-world structures that they emulate. While Graph Neural Networks (GNNs) have seen increasing use to great effect in graph classification tasks, few studies explore their integration with interpretable graph theoretic features. This paper investigates the classification of synthetic graph families using a hybrid approach that combines GNNs with engineered graph-theoretic features. We generate a large and structurally diverse synthetic dataset comprising graphs from five representative generative families, Erdos-Renyi, Watts-Strogatz, Barab’asi-Albert, Holme-Kim, and Stochastic Block Model. These graphs range in size up to 1x10^4 nodes, containing up to 1.1x10^5 edges. A comprehensive range of node and graph level features is extracted for each graph and pruned using a Random Forest based feature selection pipeline. The features are integrated into six GNN architectures: GCN, GAT, GATv2, GIN, GraphSAGE and GTN. Each architecture is optimised for hyperparameter selection using Optuna. Finally, models were compared against a baseline Support Vector Machine (SVM) trained solely on the handcrafted features. Our evaluation demonstrates that GraphSAGE and GTN achieve the highest classification performance, with 98.5% accuracy, and strong class separation evidenced by t-SNE and UMAP visualisations. GCN and GIN also performed well, while GAT-based models lagged due to limitations in their ability to capture global structures. The SVM baseline confirmed the importance of the message passing functionality for performance gains and meaningful class separation.

[528] Comparing Dynamical Models Through Diffeomorphic Vector Field Alignment

Ruiqi Chen, Giacomo Vedovati, Todd Braver, ShiNung Ching

Main category: cs.LG

TL;DR: DFORM is a framework that aligns dynamical systems via diffeomorphic transformations to compare learned models and identify low-dimensional motifs in high-dimensional systems like RNNs.

Details

Motivation: Two major challenges in analyzing dynamical systems models (like RNNs) in neuroscience: 1) Difficulty comparing learned dynamics across models due to different coordinate systems, and 2) Intractability of identifying mechanistically important low-dimensional motifs (e.g., limit sets) in high-dimensional nonlinear models.

Method: DFORM learns a nonlinear coordinate transformation between state spaces of two dynamical systems to align their trajectories in a maximally one-to-one manner. This enables assessment of topological equivalence and helps locate dynamical motifs on low-dimensional manifolds within higher-dimensional systems.

Result: DFORM successfully identified linear and nonlinear coordinate transformations in canonical systems, RNNs, and systems related by nonlinear flows. It quantified similarity between topologically distinct systems, located important dynamical motifs (invariant manifolds, saddle limit sets) in high-dimensional models, and identified limit cycles in RNNs trained on fMRI data that agreed with prior numerical analysis.

Conclusion: DFORM provides a comprehensive framework for comparing dynamical systems models and identifying mechanistically important low-dimensional motifs, addressing key challenges in analyzing high-dimensional models like RNNs in theoretical neuroscience.

Abstract: Dynamical systems models such as recurrent neural networks (RNNs) are increasingly popular in theoretical neuroscience for hypothesis-generation and data analysis. Evaluating the dynamics in such models is key to understanding their learned generative mechanisms. However, such evaluation is impeded by two major challenges: First, comparison of learned dynamics across models is difficult because there is no enforced equivalence of their coordinate systems. Second, identification of mechanistically important low-dimensional motifs (e.g., limit sets) is intractable in high-dimensional nonlinear models such as RNNs. Here, we propose a comprehensive framework to address these two issues, termed Diffeomorphic vector field alignment FOR learned Models (DFORM). DFORM learns a nonlinear coordinate transformation between the state spaces of two dynamical systems, which aligns their trajectories in a maximally one-to-one manner. In so doing, DFORM enables an assessment of whether two models exhibit topological equivalence, i.e., similar mechanisms despite differences in coordinate systems. A byproduct of this method is a means to locate dynamical motifs on low-dimensional manifolds embedded within higher-dimensional systems. We verified DFORM’s ability to identify linear and nonlinear coordinate transformations using canonical topologically equivalent systems, RNNs, and systems related by nonlinear flows. DFORM was also shown to provide a quantification of similarity between topologically distinct systems. We then demonstrated that DFORM can locate important dynamical motifs including invariant manifolds and saddle limit sets within high-dimensional models. Finally, using a set of RNN models trained on human functional MRI (fMRI) recordings, we illustrated that DFORM can identify limit cycles from high-dimensional data-driven models, which agreed well with prior numerical analysis.

Effiong Blessing, Chiung-Yi Tseng, Somshubhra Roy, Junaid Rehman, Isaac Nkrumah

Main category: cs.LG

TL;DR: Memory-augmented SNNs show modality-dependent performance: Hopfield networks excel on visual tasks (97.68%) but struggle on auditory (76.15%), while supervised contrastive learning shows more balanced cross-modal performance. Multi-modal training with HGRN achieves competitive results, revealing weak cross-modal alignment and establishing modality-specific memory optimization.

Details

Motivation: To explore how memory mechanisms in spiking neural networks generalize across different sensory modalities (visual vs. auditory), as this cross-modal generalization remains unexplored despite the promise of energy-efficient neuromorphic computing.

Method: Conducted first comprehensive cross-modal ablation study evaluating three memory mechanisms (Hopfield networks, HGRNs, supervised contrastive learning) across five architectures on visual (N-MNIST) and auditory (SHD) neuromorphic datasets. Used systematic evaluation, joint multi-modal training, and quantitative engram analysis.

Result: Striking modality-dependent patterns: Hopfield networks show 21.53 point gap between visual (97.68%) and auditory (76.15%) performance, while SCL shows more balanced 14.56 point gap (96.72% visual, 82.16% audio). Joint multi-modal HGRN training achieves 94.41% visual, 79.37% audio (88.78% average). Weak cross-modal alignment confirmed (0.038 similarity). Achieved 603x energy efficiency over traditional neural networks.

Conclusion: Memory mechanisms exhibit task-specific benefits rather than universal applicability, providing first empirical evidence for modality-specific memory optimization in neuromorphic systems. Parallel architecture design is validated by weak cross-modal alignment, enabling efficient deployment across sensory modalities.

Abstract: Memory-augmented spiking neural networks (SNNs) promise energy-efficient neuromorphic computing, yet their generalization across sensory modalities remains unexplored. We present the first comprehensive cross-modal ablation study of memory mechanisms in SNNs, evaluating Hopfield networks, Hierarchical Gated Recurrent Networks (HGRNs), and supervised contrastive learning (SCL) across visual (N-MNIST) and auditory (SHD) neuromorphic datasets. Our systematic evaluation of five architectures reveals striking modality-dependent performance patterns: Hopfield networks achieve 97.68% accuracy on visual tasks but only 76.15% on auditory tasks (21.53 point gap), revealing severe modality-specific specialization, while SCL demonstrates more balanced cross-modal performance (96.72% visual, 82.16% audio, 14.56 point gap). These findings establish that memory mechanisms exhibit task-specific benefits rather than universal applicability. Joint multi-modal training with HGRN achieves 94.41% visual and 79.37% audio accuracy (88.78% average), matching parallel HGRN performance through unified deployment. Quantitative engram analysis confirms weak cross-modal alignment (0.038 similarity), validating our parallel architecture design. Our work provides the first empirical evidence for modality-specific memory optimization in neuromorphic systems, achieving 603x energy efficiency over traditional neural networks.

[530] SD2AIL: Adversarial Imitation Learning from Synthetic Demonstrations via Diffusion Models

Pengcheng Li, Qiang Fang, Tong Zhao, Yixing Lan, Xin Xu

Main category: cs.LG

TL;DR: SD2AIL uses diffusion models to generate synthetic expert demonstrations for adversarial imitation learning, combined with prioritized replay to selectively use the most valuable demonstrations, achieving state-of-the-art performance.

Details

Motivation: Collecting sufficient expert demonstrations for adversarial imitation learning can be challenging in real-world scenarios, limiting performance and stability despite the framework's effectiveness.

Method: 1) Use diffusion model in discriminator to generate synthetic demonstrations as pseudo-expert data augmenting real demonstrations. 2) Introduce prioritized expert demonstration replay (PEDR) to selectively replay the most valuable demonstrations from the combined pool.

Result: Experimental results on simulation tasks demonstrate effectiveness and robustness. In Hopper task, achieves average return of 3441, surpassing state-of-the-art method by 89.

Conclusion: SD2AIL successfully leverages diffusion models for synthetic demonstration generation in AIL, combined with prioritized replay, to overcome data scarcity limitations and achieve superior performance.

Abstract: Adversarial Imitation Learning (AIL) is a dominant framework in imitation learning that infers rewards from expert demonstrations to guide policy optimization. Although providing more expert demonstrations typically leads to improved performance and greater stability, collecting such demonstrations can be challenging in certain scenarios. Inspired by the success of diffusion models in data generation, we propose SD2AIL, which utilizes synthetic demonstrations via diffusion models. We first employ a diffusion model in the discriminator to generate synthetic demonstrations as pseudo-expert data that augment the expert demonstrations. To selectively replay the most valuable demonstrations from the large pool of (pseudo-) expert demonstrations, we further introduce a prioritized expert demonstration replay strategy (PEDR). The experimental results on simulation tasks demonstrate the effectiveness and robustness of our method. In particular, in the Hopper task, our method achieves an average return of 3441, surpassing the state-of-the-art method by 89. Our code will be available at https://github.com/positron-lpc/SD2AIL.

[531] Benchmarking neural surrogates on realistic spatiotemporal multiphysics flows

Runze Mao, Rui Zhang, Xuan Bai, Tianhao Wu, Teng Zhang, Zhenyi Chen, Minqi Lin, Bocheng Zeng, Yangchen Xu, Yingxuan Xiang, Haoze Zhang, Shubham Goswami, Pierre A. Dawe, Yifan Xu, Zhenhua An, Mengtao Yan, Xiaoyi Lu, Yi Wang, Rongbo Bai, Haobu Gao, Xiaohang Fang, Han Li, Hao Sun, Zhi X. Chen

Main category: cs.LG

TL;DR: REALM is a rigorous benchmarking framework for neural surrogates on realistic multiphysics flows, exposing current models’ limitations through systematic evaluation of 12+ architectures on 11 challenging datasets.

Details

Motivation: Current neural surrogate evaluations rely on simplified, low-dimensional proxies that create an "illusion of mastery," failing to expose models' fragility in realistic multiphysics regimes where computational prediction is expensive and challenging due to severe coupling of multi-scale, heterogeneous processes.

Method: Developed REALM framework with 11 high-fidelity datasets spanning canonical multiphysics problems to complex propulsion and fire safety scenarios, plus standardized end-to-end training/evaluation protocol with multiphysics-aware preprocessing and robust rollout strategy. Systematically benchmarked over a dozen representative surrogate model families (spectral operators, convolutional models, Transformers, pointwise operators, graph/mesh networks).

Result: Identified three robust trends: (1) scaling barrier governed by dimensionality, stiffness, and mesh irregularity causing rapidly growing rollout errors; (2) performance primarily controlled by architectural inductive biases rather than parameter count; (3) persistent gap between nominal accuracy metrics and physically trustworthy behavior where high-correlation models still miss key transient structures and integral quantities.

Conclusion: REALM exposes the limits of current neural surrogates on realistic multiphysics flows and provides a rigorous testbed to drive development of next-generation physics-aware architectures.

Abstract: Predicting multiphysics dynamics is computationally expensive and challenging due to the severe coupling of multi-scale, heterogeneous physical processes. While neural surrogates promise a paradigm shift, the field currently suffers from an “illusion of mastery”, as repeatedly emphasized in top-tier commentaries: existing evaluations overly rely on simplified, low-dimensional proxies, which fail to expose the models’ inherent fragility in realistic regimes. To bridge this critical gap, we present REALM (REalistic AI Learning for Multiphysics), a rigorous benchmarking framework designed to test neural surrogates on challenging, application-driven reactive flows. REALM features 11 high-fidelity datasets spanning from canonical multiphysics problems to complex propulsion and fire safety scenarios, alongside a standardized end-to-end training and evaluation protocol that incorporates multiphysics-aware preprocessing and a robust rollout strategy. Using this framework, we systematically benchmark over a dozen representative surrogate model families, including spectral operators, convolutional models, Transformers, pointwise operators, and graph/mesh networks, and identify three robust trends: (i) a scaling barrier governed jointly by dimensionality, stiffness, and mesh irregularity, leading to rapidly growing rollout errors; (ii) performance primarily controlled by architectural inductive biases rather than parameter count; and (iii) a persistent gap between nominal accuracy metrics and physically trustworthy behavior, where models with high correlations still miss key transient structures and integral quantities. Taken together, REALM exposes the limits of current neural surrogates on realistic multiphysics flows and offers a rigorous testbed to drive the development of next-generation physics-aware architectures.

[532] EIA-SEC: Improved Actor-Critic Framework for Multi-UAV Collaborative Control in Smart Agriculture

Quanxi Zhou, Wencan Mao, Yilei Liang, Manabu Tsukada, Yunling Liu, Jon Crowcroft

Main category: cs.LG

TL;DR: Proposes EIA-SEC framework for multi-UAV trajectory planning in smart agriculture, combining elite imitation learning with shared ensemble critics to improve performance and stability.

Details

Motivation: Wireless communication enables smart agriculture with UAVs performing multiple tasks (data collection, imaging, communication). Need efficient multi-UAV trajectory planning for cooperative task execution.

Method: Models problem as Markov decision process. Proposes Elite Imitation Actor-Shared Ensemble Critic (EIA-SEC) framework where agents learn from elite agent to reduce trial-and-error, and uses shared ensemble critic with local critics for unbiased value estimation.

Result: EIA-SEC outperforms state-of-the-art baselines in reward performance, training stability, and convergence speed.

Conclusion: The proposed framework effectively solves multi-UAV trajectory planning in smart agriculture systems, demonstrating superior performance through elite imitation learning and ensemble critic architecture.

Abstract: The widespread application of wireless communication technology has promoted the development of smart agriculture, where unmanned aerial vehicles (UAVs) play a multifunctional role. We target a multi-UAV smart agriculture system where UAVs cooperatively perform data collection, image acquisition, and communication tasks. In this context, we model a Markov decision process to solve the multi-UAV trajectory planning problem. Moreover, we propose a novel Elite Imitation Actor-Shared Ensemble Critic (EIA-SEC) framework, where agents adaptively learn from the elite agent to reduce trial-and-error costs, and a shared ensemble critic collaborates with each agent’s local critic to ensure unbiased objective value estimates and prevent overestimation. Experimental results demonstrate that EIA-SEC outperforms state-of-the-art baselines in terms of reward performance, training stability, and convergence speed.

[533] Trajectory Planning for UAV-Based Smart Farming Using Imitation-Based Triple Deep Q-Learning

Wencan Mao, Quanxi Zhou, Tomas Couso Coddou, Manabu Tsukada, Yunling Liu, Yusheng Ji

Main category: cs.LG

TL;DR: The paper proposes an imitation-based triple deep Q-network (ITDQN) algorithm for UAV trajectory planning in smart agriculture, addressing challenges of environmental uncertainty, partial observations, and limited battery capacity through multi-agent reinforcement learning.

Details

Motivation: UAVs are promising for smart agriculture tasks like weed detection and sensor data collection, but trajectory planning is challenging due to environmental uncertainty, partial observations, and limited battery capacity of UAVs.

Method: Formulates trajectory planning as a Markov decision process (MDP) and uses multi-agent reinforcement learning (MARL). Proposes ITDQN algorithm with elite imitation mechanism to reduce exploration costs and mediator Q-network over DDQN to accelerate/stabilize training.

Result: Experimental results in simulated and real-world environments demonstrate effectiveness. ITDQN outperforms DDQN by 4.43% in weed recognition rate and 6.94% in data collection rate.

Conclusion: The proposed ITDQN algorithm effectively addresses UAV trajectory planning challenges in smart agriculture, improving performance over existing methods through innovative imitation mechanisms and network architecture improvements.

Abstract: Unmanned aerial vehicles (UAVs) have emerged as a promising auxiliary platform for smart agriculture, capable of simultaneously performing weed detection, recognition, and data collection from wireless sensors. However, trajectory planning for UAV-based smart agriculture is challenging due to the high uncertainty of the environment, partial observations, and limited battery capacity of UAVs. To address these issues, we formulate the trajectory planning problem as a Markov decision process (MDP) and leverage multi-agent reinforcement learning (MARL) to solve it. Furthermore, we propose a novel imitation-based triple deep Q-network (ITDQN) algorithm, which employs an elite imitation mechanism to reduce exploration costs and utilizes a mediator Q-network over a double deep Q-network (DDQN) to accelerate and stabilize training and improve performance. Experimental results in both simulated and real-world environments demonstrate the effectiveness of our solution. Moreover, our proposed ITDQN outperforms DDQN by 4.43% in weed recognition rate and 6.94% in data collection rate.

[534] The Interaction Bottleneck of Deep Neural Networks: Discovery, Proof, and Modulation

Huiqi Deng, Qihan Ren, Zhuofan Chen, Zhenyuan Cui, Wen Shen, Peng Zhang, Hongbin Pei, Quanshi Zhang

Main category: cs.LG

TL;DR: DNNs have a universal interaction bottleneck where they easily learn low-order and high-order interactions but consistently under-represent mid-order ones due to high contextual variability and gradient variance.

Details

Motivation: To understand how DNNs encode cooperative structures through interactions under different contextual complexity levels, and how these microscopic patterns shape macroscopic representation capacity.

Method: Use multi-order interactions to quantify contextual complexity, empirically analyze interaction patterns across architectures/tasks, theoretically explain the bottleneck via gradient variance analysis, and modulate the bottleneck with targeted loss functions.

Result: Discovered universal interaction bottleneck (mid-order under-representation), theoretically explained by high contextual variability, and showed that emphasizing different interaction orders leads to distinct macroscopic behaviors: low-order emphasis improves generalization/robustness, high-order emphasis enhances structural modeling/fitting.

Conclusion: Reveals inherent representational bias in DNNs and establishes interaction order as a powerful framework for interpreting and guiding deep representations, connecting microscopic interaction structures to macroscopic model behavior.

Abstract: Understanding what kinds of cooperative structures deep neural networks (DNNs) can represent remains a fundamental yet insufficiently understood problem. In this work, we treat interactions as the fundamental units of such structure and investigate a largely unexplored question: how DNNs encode interactions under different levels of contextual complexity, and how these microscopic interaction patterns shape macroscopic representation capacity. To quantify this complexity, we use multi-order interactions [57], where each order reflects the amount of contextual information required to evaluate the joint interaction utility of a variable pair. This formulation enables a stratified analysis of cooperative patterns learned by DNNs. Building on this formulation, we develop a comprehensive study of interaction structure in DNNs. (i) We empirically discover a universal interaction bottleneck: across architectures and tasks, DNNs easily learn low-order and high-order interactions but consistently under-represent mid-order ones. (ii) We theoretically explain this bottleneck by proving that mid-order interactions incur the highest contextual variability, yielding large gradient variance and making them intrinsically difficult to learn. (iii) We further modulate the bottleneck by introducing losses that steer models toward emphasizing interactions of selected orders. Finally, we connect microscopic interaction structures with macroscopic representational behavior: low-order-emphasized models exhibit stronger generalization and robustness, whereas high-order-emphasized models demonstrate greater structural modeling and fitting capability. Together, these results uncover an inherent representational bias in modern DNNs and establish interaction order as a powerful lens for interpreting and guiding deep representations.

[535] The Procrustean Bed of Time Series: The Optimization Bias of Point-wise Loss

Rongyao Cai, Yuxi Wan, Kexin Zhang, Ming Jin, Hao Wang, Zhiqiang Ge, Daoyi Dong, Yong Liu, Qingsong Wen

Main category: cs.LG

TL;DR: The paper provides theoretical analysis of Expectation of Optimization Bias (EOB) in time series modeling caused by point-wise loss functions, proposes a debiasing program via sequence length reduction and structural orthogonalization, and introduces a harmonized ℓp norm framework.

Details

Motivation: Point-wise loss functions (like MSE) rely on flawed i.i.d. assumptions that ignore causal temporal structure in time series, creating optimization bias that lacks formal theoretical grounding despite growing awareness of the problem.

Method: Formalizes EOB information-theoretically as discrepancy between true joint distribution and flawed i.i.d. counterpart. Derives closed-form quantification for non-deterministic EOB across linear/non-linear systems. Proposes debiasing program via sequence length reduction and structural orthogonalization using DFT/DWT, plus harmonized ℓp norm framework to fix gradient pathologies.

Result: Reveals paradox: more deterministic/structured time series suffer more severe bias from point-wise loss functions. Proves EOB is intrinsic data property governed by sequence length and proposed Structural Signal-to-Noise Ratio (SSNR). Extensive experiments validate EOB Theory’s generality and debiasing program’s superior performance.

Conclusion: Provides first-principles theoretical grounding for optimization bias in time series modeling, offers principled debiasing solution via structural orthogonalization and sequence length reduction, and demonstrates practical effectiveness through comprehensive validation.

Abstract: Optimizing time series models via point-wise loss functions (e.g., MSE) relying on a flawed point-wise independent and identically distributed (i.i.d.) assumption that disregards the causal temporal structure, an issue with growing awareness yet lacking formal theoretical grounding. Focusing on the core independence issue under covariance stationarity, this paper aims to provide a first-principles analysis of the Expectation of Optimization Bias (EOB), formalizing it information-theoretically as the discrepancy between the true joint distribution and its flawed i.i.d. counterpart. Our analysis reveals a fundamental paradigm paradox: the more deterministic and structured the time series, the more severe the bias by point-wise loss function. We derive the first closed-form quantification for the non-deterministic EOB across linear and non-linear systems, and prove EOB is an intrinsic data property, governed exclusively by sequence length and our proposed Structural Signal-to-Noise Ratio (SSNR). This theoretical diagnosis motivates our principled debiasing program that eliminates the bias through sequence length reduction and structural orthogonalization. We present a concrete solution that simultaneously achieves both principles via DFT or DWT. Furthermore, a novel harmonized $\ell_p$ norm framework is proposed to rectify gradient pathologies of high-variance series. Extensive experiments validate EOB Theory’s generality and the superior performance of debiasing program.

[536] ARC: Leveraging Compositional Representations for Cross-Problem Learning on VRPs

Han-Seul Jeong, Youngjoon Park, Hyungseok Song, Woohyung Lim

Main category: cs.LG

TL;DR: ARC is a cross-problem learning framework for Vehicle Routing Problems that learns disentangled attribute representations through compositional learning, enabling better generalization across problem variants.

Details

Motivation: Real-world VRPs have diverse attributes, and there's a need for learning approaches that can efficiently generalize across different problem variants without retraining for each combination.

Method: ARC decomposes attribute representations into two components: Intrinsic Attribute Embedding (IAE) for invariant semantics and Contextual Interaction Embedding (CIE) for attribute-combination effects. It enforces analogical consistency to ensure semantic transformations remain invariant across contexts.

Result: ARC achieves state-of-the-art performance across in-distribution, zero-shot generalization, few-shot adaptation, and real-world benchmarks.

Conclusion: The disentangled attribute representation approach enables effective cross-problem learning, allowing models to reuse invariant semantics across trained variants and construct representations for unseen attribute combinations.

Abstract: Vehicle Routing Problems (VRPs) with diverse real-world attributes have driven recent interest in cross-problem learning approaches that efficiently generalize across problem variants. We propose ARC (Attribute Representation via Compositional Learning), a cross-problem learning framework that learns disentangled attribute representations by decomposing them into two complementary components: an Intrinsic Attribute Embedding (IAE) for invariant attribute semantics and a Contextual Interaction Embedding (CIE) for attribute-combination effects. This disentanglement is achieved by enforcing analogical consistency in the embedding space to ensure the semantic transformation of adding an attribute (e.g., a length constraint) remains invariant across different problem contexts. This enables our model to reuse invariant semantics across trained variants and construct representations for unseen combinations. ARC achieves state-of-the-art performance across in-distribution, zero-shot generalization, few-shot adaptation, and real-world benchmarks.

[537] From Shortcut to Induction Head: How Data Diversity Shapes Algorithm Selection in Transformers

Ryotaro Kawata, Yujin Song, Alberto Bietti, Naoki Nishikawa, Taiji Suzuki, Samuel Vaiter, Denny Wu

Main category: cs.LG

TL;DR: Transformers learn either generalizable induction heads or positional shortcuts depending on pretraining data diversity, with a transition threshold based on trigger-to-trigger distance ratios.

Details

Motivation: To understand how pretraining data distribution steers shallow transformers toward generalizable algorithmic behaviors (like induction heads) versus simple positional shortcuts, and to provide guidelines for data-driven control of learned behaviors.

Method: Theoretical analysis of gradient-based training for single-layer transformers on a minimal trigger-output prediction task (copying token after trigger’s second occurrence). Analysis covers both infinite and finite sample regimes, using a “max-sum” ratio of trigger-to-trigger distances as diversity measure.

Result: Proves a transition: low max-sum ratio (sufficient sequence diversity) leads to induction heads with OOD generalization; high ratio leads to positional shortcuts without OOD generalization. Reveals trade-off between pretraining context length and OOD generalization, derives optimal pretraining distribution for computational efficiency.

Conclusion: Pretraining data diversity critically determines whether transformers learn generalizable algorithms or positional shortcuts, with theoretical thresholds guiding data design for desired behaviors. Synthetic experiments validate predictions.

Abstract: Transformers can implement both generalizable algorithms (e.g., induction heads) and simple positional shortcuts (e.g., memorizing fixed output positions). In this work, we study how the choice of pretraining data distribution steers a shallow transformer toward one behavior or the other. Focusing on a minimal trigger-output prediction task – copying the token immediately following a special trigger upon its second occurrence – we present a rigorous analysis of gradient-based training of a single-layer transformer. In both the infinite and finite sample regimes, we prove a transition in the learned mechanism: if input sequences exhibit sufficient diversity, measured by a low ``max-sum’’ ratio of trigger-to-trigger distances, the trained model implements an induction head and generalizes to unseen contexts; by contrast, when this ratio is large, the model resorts to a positional shortcut and fails to generalize out-of-distribution (OOD). We also reveal a trade-off between the pretraining context length and OOD generalization, and derive the optimal pretraining distribution that minimizes computational cost per sample. Finally, we validate our theoretical predictions with controlled synthetic experiments, demonstrating that broadening context distributions robustly induces induction heads and enables OOD generalization. Our results shed light on the algorithmic biases of pretrained transformers and offer conceptual guidelines for data-driven control of their learned behaviors.

[538] Demonstration-Guided Continual Reinforcement Learning in Dynamic Environments

Xue Yang, Michael Schukat, Junlin Lu, Patrick Mannion, Karl Mason, Enda Howley

Main category: cs.LG

TL;DR: DGCRL uses a self-evolving demonstration repository to guide RL exploration in continual learning, balancing stability-plasticity through curriculum-based knowledge transfer.

Details

Motivation: RL struggles in dynamic environments where MDPs evolve. Continual RL faces stability-plasticity dilemma - existing methods don't directly use past knowledge to guide behavior, hindering knowledge reuse and efficient learning.

Method: DGCRL stores prior knowledge in external self-evolving demonstration repository that directly guides RL exploration. For each task, agent dynamically selects most relevant demonstration and follows curriculum-based strategy (gradual shift from demonstration-guided to self-exploration).

Result: Superior average performance, enhanced knowledge transfer, mitigation of forgetting, and training efficiency on 2D navigation and MuJoCo locomotion tasks. Sensitivity analysis and ablation study validate effectiveness.

Conclusion: Demonstration-guided approach effectively addresses stability-plasticity dilemma in continual RL by directly reusing past knowledge to guide behavior, improving learning efficiency and performance.

Abstract: Reinforcement learning (RL) excels in various applications but struggles in dynamic environments where the underlying Markov decision process evolves. Continual reinforcement learning (CRL) enables RL agents to continually learn and adapt to new tasks, but balancing stability (preserving prior knowledge) and plasticity (acquiring new knowledge) remains challenging. Existing methods primarily address the stability-plasticity dilemma through mechanisms where past knowledge influences optimization but rarely affects the agent’s behavior directly, which may hinder effective knowledge reuse and efficient learning. In contrast, we propose demonstration-guided continual reinforcement learning (DGCRL), which stores prior knowledge in an external, self-evolving demonstration repository that directly guides RL exploration and adaptation. For each task, the agent dynamically selects the most relevant demonstration and follows a curriculum-based strategy to accelerate learning, gradually shifting from demonstration-guided exploration to fully self-exploration. Extensive experiments on 2D navigation and MuJoCo locomotion tasks demonstrate its superior average performance, enhanced knowledge transfer, mitigation of forgetting, and training efficiency. The additional sensitivity analysis and ablation study further validate its effectiveness.

[539] Improving Pattern Recognition of Scheduling Anomalies through Structure-Aware and Semantically-Enhanced Graphs

Ning Lyu, Junjie Jiang, Lu Chang, Chihui Shao, Feng Chen, Chong Zhang

Main category: cs.LG

TL;DR: Proposes structure-aware scheduling graph modeling method for anomaly detection in complex systems, combining structure-guided graph construction with multi-scale semantic aggregation to improve anomaly identification accuracy.

Details

Motivation: Need to improve accuracy and representation capability for anomaly identification in scheduling behaviors of complex systems, especially for capturing abnormal features in complex scenarios like multi-task concurrency, resource competition, and stage transitions.

Method: 1) Structure-guided scheduling graph construction integrating task execution stages, resource node states, and scheduling path information; 2) Multi-scale graph semantic aggregation module with local adjacency semantic integration and global topology alignment for semantic consistency modeling.

Result: Model shows significant performance advantages across multiple metrics, sensitive response to structural disturbances and semantic shifts. Visualization reveals stronger anomaly separability and pattern representation under structure guidance and semantic aggregation.

Conclusion: The method is effective and adaptable for scheduling anomaly detection tasks, demonstrating improved anomaly identification through structure-aware graph modeling and semantic aggregation techniques.

Abstract: This paper proposes a structure-aware driven scheduling graph modeling method to improve the accuracy and representation capability of anomaly identification in scheduling behaviors of complex systems. The method first designs a structure-guided scheduling graph construction mechanism that integrates task execution stages, resource node states, and scheduling path information to build dynamically evolving scheduling behavior graphs, enhancing the model’s ability to capture global scheduling relationships. On this basis, a multi-scale graph semantic aggregation module is introduced to achieve semantic consistency modeling of scheduling features through local adjacency semantic integration and global topology alignment, thereby strengthening the model’s capability to capture abnormal features in complex scenarios such as multi-task concurrency, resource competition, and stage transitions. Experiments are conducted on a real scheduling dataset with multiple scheduling disturbance paths set to simulate different types of anomalies, including structural shifts, resource changes, and task delays. The proposed model demonstrates significant performance advantages across multiple metrics, showing a sensitive response to structural disturbances and semantic shifts. Further visualization analysis reveals that, under the combined effect of structure guidance and semantic aggregation, the scheduling behavior graph exhibits stronger anomaly separability and pattern representation, validating the effectiveness and adaptability of the method in scheduling anomaly detection tasks.

[540] Fusion of Multiscale Features Via Centralized Sparse-attention Network for EEG Decoding

Xiangrui Cai, Shaocheng Ma, Lei Cao, Jie Li, Tianyu Liu, Yilin Dong

Main category: cs.LG

TL;DR: EEG-CSANet: A centralized sparse-attention network with multi-branch architecture for EEG signal decoding, achieving SOTA performance across five public datasets.

Details

Motivation: To address the inherent spatiotemporal heterogeneity of EEG signals, which makes accurate decoding challenging for brain-machine interfaces and intelligent interaction systems.

Method: Proposes a multi-branch parallel architecture with independent spatial feature extraction modules for each temporal scale, enhanced by a centralized sparse-attention network (EEG-CSANet) using main-auxiliary branch design: main branch models core spatiotemporal patterns via multiscale self-attention, auxiliary branch facilitates efficient local interactions through sparse cross-attention.

Result: Achieves state-of-the-art performance across five public datasets: BCIC-IV-2A (88.54%), BCIC-IV-2B (91.09%), HGD (99.43%), SEED (96.03%), and SEED-VIG (90.56%). Demonstrates strong adaptability and robustness across various EEG decoding tasks.

Conclusion: EEG-CSANet shows strong performance and interpretability through ablation studies, and could serve as a promising baseline model for EEG signal decoding. Source code is publicly available.

Abstract: Electroencephalography (EEG) signal decoding is a key technology that translates brain activity into executable commands, laying the foundation for direct brain-machine interfacing and intelligent interaction. To address the inherent spatiotemporal heterogeneity of EEG signals, this paper proposes a multi-branch parallel architecture, where each temporal scale is equipped with an independent spatial feature extraction module. To further enhance multi-branch feature fusion, we propose a Fusion of Multiscale Features via Centralized Sparse-attention Network (EEG-CSANet), a centralized sparse-attention network. It employs a main-auxiliary branch architecture, where the main branch models core spatiotemporal patterns via multiscale self-attention, and the auxiliary branch facilitates efficient local interactions through sparse cross-attention. Experimental results show that EEG-CSANet achieves state-of-the-art (SOTA) performance across five public datasets (BCIC-IV-2A, BCIC-IV-2B, HGD, SEED, and SEED-VIG), with accuracies of 88.54%, 91.09%, 99.43%, 96.03%, and 90.56%, respectively. Such performance demonstrates its strong adaptability and robustness across various EEG decoding tasks. Moreover, extensive ablation studies are conducted to enhance the interpretability of EEG-CSANet. In the future, we hope that EEG-CSANet could serve as a promising baseline model in the field of EEG signal decoding. The source code is publicly available at: https://github.com/Xiangrui-Cai/EEG-CSANet

[541] Generating Risky Samples with Conformity Constraints via Diffusion Models

Han Yu, Hao Zou, Xingxuan Zhang, Zhengyi Wang, Yue He, Kehan Li, Peng Cui

Main category: cs.LG

TL;DR: RiskyDiff: A diffusion-based method that generates diverse risky samples with strong category conformity using text/image embeddings and conformity scoring.

Details

Motivation: Existing methods for discovering risky samples are limited by dataset coverage, and recent diffusion-based approaches struggle with category conformity, introducing label noise that limits application effectiveness.

Method: Proposes RiskyDiff that uses text and image embeddings as implicit constraints for category conformity, plus a conformity score for explicit strengthening. Also includes embedding screening and risky gradient guidance to boost risk.

Result: RiskyDiff greatly outperforms existing methods in risk degree, generation quality, and category conformity. Generated samples with high conformity can enhance model generalization when used for data augmentation.

Conclusion: RiskyDiff effectively addresses the category conformity problem in risky sample generation, enabling better discovery of diverse risky samples and improving model robustness through data augmentation.

Abstract: Although neural networks achieve promising performance in many tasks, they may still fail when encountering some examples and bring about risks to applications. To discover risky samples, previous literature attempts to search for patterns of risky samples within existing datasets or inject perturbation into them. Yet in this way the diversity of risky samples is limited by the coverage of existing datasets. To overcome this limitation, recent works adopt diffusion models to produce new risky samples beyond the coverage of existing datasets. However, these methods struggle in the conformity between generated samples and expected categories, which could introduce label noise and severely limit their effectiveness in applications. To address this issue, we propose RiskyDiff that incorporates the embeddings of both texts and images as implicit constraints of category conformity. We also design a conformity score to further explicitly strengthen the category conformity, as well as introduce the mechanisms of embedding screening and risky gradient guidance to boost the risk of generated samples. Extensive experiments reveal that RiskyDiff greatly outperforms existing methods in terms of the degree of risk, generation quality, and conformity with conditioned categories. We also empirically show the generalization ability of the models can be enhanced by augmenting training data with generated samples of high conformity.

[542] ML Inference Scheduling with Predictable Latency

Haidong Zhao, Nikolaos Georgantas

Main category: cs.LG

TL;DR: The paper identifies limitations in existing ML inference scheduling approaches that fail to account for GPU interference effects, proposes to evaluate these limitations, and outlines ongoing work toward more efficient scheduling.

Details

Motivation: Existing ML inference scheduling systems struggle with GPU interference effects that compromise SLO/deadline satisfaction when improving GPU utilization through concurrent task scheduling. Current interference prediction approaches are inadequate due to coarse granularity and static models.

Method: The paper evaluates limitations of existing interference prediction approaches and outlines ongoing work toward achieving efficient ML inference scheduling (specific methods not detailed in abstract).

Result: Not specified in abstract - this appears to be a work-in-progress paper outlining research direction rather than presenting completed results.

Conclusion: Current interference prediction approaches for ML inference scheduling have significant limitations (coarse granularity, static models), and the authors are working on improved solutions to enable efficient scheduling while meeting SLOs/deadlines.

Abstract: Machine learning (ML) inference serving systems can schedule requests to improve GPU utilization and to meet service level objectives (SLOs) or deadlines. However, improving GPU utilization may compromise latency-sensitive scheduling, as concurrent tasks contend for GPU resources and thereby introduce interference. Given that interference effects introduce unpredictability in scheduling, neglecting them may compromise SLO or deadline satisfaction. Nevertheless, existing interference prediction approaches remain limited in several respects, which may restrict their usefulness for scheduling. First, they are often coarse-grained, which ignores runtime co-location dynamics and thus restricts their accuracy in interference prediction. Second, they tend to use a static prediction model, which may not effectively cope with different workload characteristics. To this end, we evaluate the potential limitations of existing interference prediction approaches and outline our ongoing work toward achieving efficient ML inference scheduling.

[543] A Theoretical Lens for RL-Tuned Language Models via Energy-Based Models

Zhiquan Tan, Yinrong Hong

Main category: cs.LG

TL;DR: The paper provides a theoretical analysis of LLMs trained via KL-regularized RL, showing they have energy-based model structure and proving convergence properties for instruction-tuned and reasoning models.

Details

Motivation: While LLMs trained with KL-regularized RL show strong performance in instruction following, self-correction, and reasoning, there is limited theoretical understanding of why they work so well. The paper aims to provide a unified variational analysis to explain their theoretical underpinnings.

Method: The authors exploit the closed-form energy-based model structure of optimal KL-regularized policies. They use variational analysis to prove theoretical properties: for instruction-tuned models, they prove detailed balance and convergence properties; for reasoning models (RLVR), they show equivalence to expected KL minimization.

Result: For instruction-tuned models: proved transition kernel satisfies detailed balance, monotonic KL convergence to high-quality stationary distribution, bounded hitting times to superior states, and exponential mixing governed by spectral gap. For reasoning models: showed objective is equivalent to expected KL minimization toward optimal reasoning distribution, with suboptimality gap reducing to Bernoulli KL between target and current accuracies.

Conclusion: The theoretical analysis provides mathematical foundations explaining why KL-regularized RL works well for LLMs, offering insights into convergence properties, mixing rates, and the entropy-accuracy trade-off observed empirically in reasoning models.

Abstract: Large language models (LLMs) trained via KL-regularized reinforcement learning demonstrate strong instruction following, self-correction, and reasoning abilities. Yet their theoretical underpinnings remain limited. We exploit the closed-form energy-based model (EBM) structure of the optimal KL-regularized policy to provide a unified variational analysis of LLMs. For instruction-tuned models, under natural assumptions on reward potentials and pretraining symmetry, we prove that the transition kernel satisfies detailed balance with respect to a scalar potential encoding response quality. This yields monotonic KL convergence to a high-quality stationary distribution, bounded hitting times to superior states, and exponential mixing governed by the spectral gap. For reasoning models trained with verifiable rewards (RLVR), we show the objective is equivalent to expected KL minimization toward an optimal reasoning distribution, with the suboptimality gap reducing to the Bernoulli KL between target and current accuracies along the natural gradient flow. This helps explain empirical entropy-accuracy trade-offs.

[544] Is Your Conditional Diffusion Model Actually Denoising?

Daniel Pfrommer, Zehao Dou, Christopher Scarvelis, Max Simchowitz, Ali Jadbabaie

Main category: cs.LG

TL;DR: Diffusion models with conditioning variables (like text-to-image or observation-to-control) systematically deviate from their theoretical denoising process during conditional generation, causing disagreement between sampling algorithms regardless of model capacity or training data.

Details

Motivation: The paper investigates why diffusion models with conditioning variables (widely used in text-to-image generation and continuous control) don't behave according to their theoretical formulation during conditional generation, leading to inconsistencies between different sampling algorithms.

Method: Introduces “Schedule Deviation” - a rigorous measure to quantify the rate of deviation from standard denoising processes. Provides methodology to compute this deviation and analyzes the phenomenon theoretically through the lens of inductive biases towards smoothness.

Result: Shows that deviation from idealized denoising occurs irrespective of model capacity or training data. Demonstrates that this phenomenon arises from difficulty in bridging distinct denoising flows across different parts of the conditioning space, driven by an inductive bias towards smoothness.

Conclusion: Conditional diffusion models inherently deviate from their theoretical denoising formulation due to fundamental challenges in connecting different denoising flows across conditioning space, with smoothness inductive biases explaining this systematic deviation phenomenon.

Abstract: We study the inductive biases of diffusion models with a conditioning-variable, which have seen widespread application as both text-conditioned generative image models and observation-conditioned continuous control policies. We observe that when these models are queried conditionally, their generations consistently deviate from the idealized “denoising” process upon which diffusion models are formulated, inducing disagreement between popular sampling algorithms (e.g. DDPM, DDIM). We introduce Schedule Deviation, a rigorous measure which captures the rate of deviation from a standard denoising process, and provide a methodology to compute it. Crucially, we demonstrate that the deviation from an idealized denoising process occurs irrespective of the model capacity or amount of training data. We posit that this phenomenon occurs due to the difficulty of bridging distinct denoising flows across different parts of the conditioning space and show theoretically how such a phenomenon can arise through an inductive bias towards smoothness.

[545] PIPCFR: Pseudo-outcome Imputation with Post-treatment Variables for Individual Treatment Effect Estimation

Zichuan Lin, Xiaokai Huang, Jiate Liu, Yuxuan Han, Jia Chen, Xiapeng Wu, Deheng Ye

Main category: cs.LG

TL;DR: PIPCFR improves individual treatment effect estimation by incorporating post-treatment variables into pseudo-outcome imputation, addressing a gap in existing methods that overlook these variables.

Details

Motivation: Existing ITE estimation methods overlook the impact of post-treatment variables on outcomes, preventing full capture of outcome variability and increasing variance in counterfactual predictions.

Method: PIPCFR incorporates post-treatment variables into pseudo-outcome imputation, learns effective representations that preserve informative components while mitigating bias, and establishes a novel theoretical bound connecting post-treatment variables to ITE estimation accuracy.

Result: Empirical evaluations on real-world and simulated datasets show PIPCFR achieves significantly lower ITE errors compared to existing methods.

Conclusion: Incorporating post-treatment variables through the PIPCFR framework improves ITE estimation accuracy by better capturing outcome variability and reducing counterfactual prediction variance.

Abstract: The estimation of individual treatment effects (ITE) focuses on predicting the outcome changes that result from a change in treatment. A fundamental challenge in observational data is that while we need to infer outcome differences under alternative treatments, we can only observe each individual’s outcome under a single treatment. Existing approaches address this limitation either by training with inferred pseudo-outcomes or by creating matched instance pairs. However, recent work has largely overlooked the potential impact of post-treatment variables on the outcome. This oversight prevents existing methods from fully capturing outcome variability, resulting in increased variance in counterfactual predictions. This paper introduces Pseudo-outcome Imputation with Post-treatment Variables for Counterfactual Regression (PIPCFR), a novel approach that incorporates post-treatment variables to improve pseudo-outcome imputation. We analyze the challenges inherent in utilizing post-treatment variables and establish a novel theoretical bound for ITE risk that explicitly connects post-treatment variables to ITE estimation accuracy. Unlike existing methods that ignore these variables or impose restrictive assumptions, PIPCFR learns effective representations that preserve informative components while mitigating bias. Empirical evaluations on both real-world and simulated datasets demonstrate that PIPCFR achieves significantly lower ITE errors compared to existing methods.

[546] Gaussian-Mixture-Model Q-Functions for Policy Iteration in Reinforcement Learning

Minh Vu, Konstantinos Slavakis

Main category: cs.LG

TL;DR: Gaussian mixture models (GMMs) are repurposed as direct function approximators for Q-functions in reinforcement learning, called GMM-QFs, which serve as universal approximators and are optimized on Riemannian manifolds within Bellman residuals.

Details

Motivation: The paper aims to move beyond conventional uses of GMMs as probability density estimators in RL and instead leverage their representational capacity as direct function approximators for Q-functions, offering a more efficient alternative to deep learning methods that require experience data.

Method: Introduces GMM-QFs as parametric models with learnable mixing weights, Gaussian mean vectors, and covariance matrices. These are embedded within Bellman residuals and optimized on Riemannian manifolds, incorporating Riemannian optimization into policy-evaluation steps of standard policy-iteration frameworks.

Result: Theoretical results establish GMM-QFs as universal approximators. Numerical tests show competitive performance, sometimes outperforming state-of-the-art approaches across benchmark RL tasks, while maintaining significantly smaller computational footprint than deep-learning methods, even without access to experience data.

Conclusion: GMM-QFs provide an effective, computationally efficient alternative to deep learning methods for Q-function approximation in RL, demonstrating strong performance without requiring experience data while offering theoretical guarantees as universal approximators.

Abstract: Unlike their conventional use as estimators of probability density functions in reinforcement learning (RL), this paper introduces a novel function-approximation role for Gaussian mixture models (GMMs) as direct surrogates for Q-function losses. These parametric models, termed GMM-QFs, possess substantial representational capacity, as they are shown to be universal approximators over a broad class of functions. They are further embedded within Bellman residuals, where their learnable parameters – a fixed number of mixing weights, together with Gaussian mean vectors and covariance matrices – are inferred from data via optimization on a Riemannian manifold. This geometric perspective on the parameter space naturally incorporates Riemannian optimization into the policy-evaluation step of standard policy-iteration frameworks. Rigorous theoretical results are established, and supporting numerical tests show that, even without access to experience data, GMM-QFs deliver competitive performance and, in some cases, outperform state-of-the-art approaches across a range of benchmark RL tasks, all while maintaining a significantly smaller computational footprint than deep-learning methods that rely on experience data.

[547] Label-Informed Outlier Detection Based on Granule Density

Baiyang Chen, Zhong Yuan, Dezhong Peng, Hongmei Chen, Xiaomin Song, Huiming Zheng

Main category: cs.LG

TL;DR: GDOF is a novel outlier detection method using Granular Computing and Fuzzy Sets that handles heterogeneous data with minimal labeled outliers through label-informed fuzzy granulation and granule density estimation.

Details

Motivation: Existing semi-supervised outlier detection methods treat data as purely numerical and deterministic, ignoring the heterogeneity and uncertainty in real-world datasets. There's a need for methods that can handle diverse data types while requiring minimal labeled outliers.

Method: GDOF uses label-informed fuzzy granulation to represent various data types, develops granule density for precise density estimation, and integrates granule densities from individual attributes by assessing attribute relevance with limited labeled outliers.

Result: Experimental results on real-world datasets show GDOF excels at detecting outliers in heterogeneous data with minimal labeled outliers, outperforming existing methods.

Conclusion: The integration of Fuzzy Sets and Granular Computing in GDOF provides a practical framework for outlier detection in complex, diverse data types, addressing limitations of traditional methods.

Abstract: Outlier detection, crucial for identifying unusual patterns with significant implications across numerous applications, has drawn considerable research interest. Existing semi-supervised methods typically treat data as purely numerical and} in a deterministic manner, thereby neglecting the heterogeneity and uncertainty inherent in complex, real-world datasets. This paper introduces a label-informed outlier detection method for heterogeneous data based on Granular Computing and Fuzzy Sets, namely Granule Density-based Outlier Factor (GDOF). Specifically, GDOF first employs label-informed fuzzy granulation to effectively represent various data types and develops granule density for precise density estimation. Subsequently, granule densities from individual attributes are integrated for outlier scoring by assessing attribute relevance with a limited number of labeled outliers. Experimental results on various real-world datasets show that GDOF stands out in detecting outliers in heterogeneous data with a minimal number of labeled outliers. The integration of Fuzzy Sets and Granular Computing in GDOF offers a practical framework for outlier detection in complex and diverse data types. All relevant datasets and source codes are publicly available for further research. This is the author’s accepted manuscript of a paper published in IEEE Transactions on Fuzzy Systems. The final version is available at https://doi.org/10.1109/TFUZZ.2024.3514853

[548] Controllable Probabilistic Forecasting with Stochastic Decomposition Layers

John S. Schreck, William E. Chapman, Charlie Becker, David John Gagne, Dhamma Kimpara, Nihanth Cherukuru, Judith Berner, Kirsten J. Mayer, Negin Sobhani

Main category: cs.LG

TL;DR: SDL converts deterministic ML weather models into probabilistic ensembles using hierarchical noise injection, achieving well-calibrated forecasts with minimal computational cost and interpretable uncertainty.

Details

Motivation: Current CRPS ensemble methods for AI weather prediction use global noise injection via conditional normalization, which increases training costs and lacks physical interpretability of stochastic perturbations.

Method: Introduces Stochastic Decomposition Layers (SDL) adapted from StyleGAN’s hierarchical noise injection. SDL applies learned perturbations at three decoder scales through latent-driven modulation, per-pixel noise, and channel scaling, enabling transfer learning with minimal computational overhead.

Result: SDL applied to WXFormer requires <2% of baseline training cost, generates ensembles from compact latent tensors (5 MB), achieves spread-skill ratios near unity, progressively flattening rank histograms, and competitive calibration with operational IFS-ENS. Multi-scale experiments reveal hierarchical uncertainty patterns.

Conclusion: SDL provides an efficient, interpretable approach for probabilistic weather forecasting with explicit latent parameterization that enables reproducible ensembles, post-inference adjustment, and hierarchical uncertainty quantification for operational and climate applications.

Abstract: AI weather prediction ensembles with latent noise injection and optimized with the continuous ranked probability score (CRPS) have produced both accurate and well-calibrated predictions with far less computational cost compared with diffusion-based methods. However, current CRPS ensemble approaches vary in their training strategies and noise injection mechanisms, with most injecting noise globally throughout the network via conditional normalization. This structure increases training expense and limits the physical interpretability of the stochastic perturbations. We introduce Stochastic Decomposition Layers (SDL) for converting deterministic machine learning weather models into probabilistic ensemble systems. Adapted from StyleGAN’s hierarchical noise injection, SDL applies learned perturbations at three decoder scales through latent-driven modulation, per-pixel noise, and channel scaling. When applied to WXFormer via transfer learning, SDL requires less than 2% of the computational cost needed to train the baseline model. Each ensemble member is generated from a compact latent tensor (5 MB), enabling perfect reproducibility and post-inference spread adjustment through latent rescaling. Evaluation on 2022 ERA5 reanalysis shows ensembles with spread-skill ratios approaching unity and rank histograms that progressively flatten toward uniformity through medium-range forecasts, achieving calibration competitive with operational IFS-ENS. Multi-scale experiments reveal hierarchical uncertainty: coarse layers modulate synoptic patterns while fine layers control mesoscale variability. The explicit latent parameterization provides interpretable uncertainty quantification for operational forecasting and climate applications.

[549] Hyperbolic Graph Embeddings: a Survey and an Evaluation on Anomaly Detection

Souhail Abdelmouaiz Sadat, Mohamed Yacine Touahria Miliani, Khadidja Hab El Hames, Hamida Seba, Mohammed Haddad

Main category: cs.LG

TL;DR: This survey reviews hyperbolic graph embedding models for anomaly detection, showing they outperform Euclidean methods on complex graph structures.

Details

Motivation: Euclidean graph embedding methods struggle to capture complex hierarchical and structural patterns in graph data, limiting their effectiveness for anomaly detection tasks. Hyperbolic spaces offer better geometric properties for representing complex graph structures.

Method: The paper conducts a comprehensive survey and evaluation of hyperbolic graph embedding models including HGCAE, P-VAE, and HGCN, comparing them against Euclidean baselines like DOMINANT and GraphSage on anomaly detection tasks across datasets like Elliptic and Cora.

Result: Hyperbolic models significantly outperform Euclidean methods: P-VAE achieves 94% F1-score on Elliptic dataset, HGCAE scores 80% on Cora dataset. Euclidean methods struggle with complex data structures.

Conclusion: Hyperbolic spaces show strong potential for improving graph anomaly detection by better capturing complex structural patterns. The authors provide an open-source library to facilitate further research in this promising direction.

Abstract: This survey reviews hyperbolic graph embedding models, and evaluate them on anomaly detection, highlighting their advantages over Euclidean methods in capturing complex structures. Evaluating models like \textit{HGCAE}, \textit{(\mathcal{P})-VAE}, and \textit{HGCN} demonstrates high performance, with \textit{(\mathcal{P})-VAE} achieving an F1-score of 94% on the \textit{Elliptic} dataset and \textit{HGCAE} scoring 80% on \textit{Cora}. In contrast, Euclidean methods like \textit{DOMINANT} and \textit{GraphSage} struggle with complex data. The study emphasizes the potential of hyperbolic spaces for improving anomaly detection, and provides an open-source library to foster further research in this field.

[550] Generative Modeling through Spectral Analysis of Koopman Operator

Yuanchao Xu, Fengyi Li, Masahiro Fujisawa, Youssef Marzouk, Isao Ishikawa

Main category: cs.LG

TL;DR: KSWGD is a generative modeling framework combining Koopman operator spectral analysis with optimal transport for accelerated Wasserstein gradient descent using trajectory data, eliminating need for target potential knowledge or neural network training.

Details

Motivation: The paper aims to develop a more efficient generative modeling approach that can leverage spectral structure from trajectory data to accelerate Wasserstein gradient descent, avoiding the need for explicit target potential knowledge or complex neural network training.

Method: Koopman Spectral Wasserstein Gradient Descent (KSWGD) combines operator-theoretic spectral analysis with optimal transport. It estimates spectral structure from trajectory data via Koopman operator approximation, enabling accelerated Wasserstein gradient descent without requiring explicit target potential knowledge or neural network training.

Result: The method demonstrates faster convergence than existing methods across diverse settings including compact manifold sampling, metastable multi-well systems, image generation, and high-dimensional stochastic partial differential equations, while maintaining high sample quality.

Conclusion: KSWGD provides an effective generative modeling framework that leverages Koopman operator spectral analysis to accelerate Wasserstein gradient descent, with rigorous theoretical foundations and strong empirical performance across various challenging domains.

Abstract: We propose Koopman Spectral Wasserstein Gradient Descent (KSWGD), a generative modeling framework that combines operator-theoretic spectral analysis with optimal transport. The novel insight is that the spectral structure required for accelerated Wasserstein gradient descent can be directly estimated from trajectory data via Koopman operator approximation which can eliminate the need for explicit knowledge of the target potential or neural network training. We provide rigorous convergence analysis and establish connection to Feynman-Kac theory that clarifies the method’s probabilistic foundation. Experiments across diverse settings, including compact manifold sampling, metastable multi-well systems, image generation, and high dimensional stochastic partial differential equation, demonstrate that KSWGD consistently achieves faster convergence than other existing methods while maintaining high sample quality.

[551] Merging of Kolmogorov-Arnold networks trained on disjoint datasets

Andrew Polar, Michael Poluektov

Main category: cs.LG

TL;DR: Training on disjoint datasets with KANs using Newton-Kaczmarz method and piecewise-linear basis achieves faster training than federated learning approaches.

Details

Motivation: While federated learning with KANs focuses on distributed convergence, this paper targets training acceleration through optimal optimization methods and basis functions.

Method: Uses Kolmogorov-Arnold networks (KANs) trained on disjoint datasets with Newton-Kaczmarz optimization method and piecewise-linear basis functions.

Result: Disjoint dataset training further improves performance beyond the fastest current combination of Newton-Kaczmarz and piecewise-linear basis functions.

Conclusion: Training on disjoint datasets provides additional acceleration benefits for KANs, complementing existing federated learning advantages.

Abstract: Training on disjoint datasets can serve two primary goals: accelerating data processing and enabling federated learning. It has already been established that Kolmogorov-Arnold networks (KANs) are particularly well suited for federated learning and can be merged through simple parameter averaging. While the federated learning literature has mostly focused on achieving training convergence across distributed nodes, the present paper specifically targets acceleration of the training, which depends critically on the choice of an optimisation method and the type of the basis functions. To the best knowledge of the authors, the fastest currently-available combination is the Newton-Kaczmarz method and the piecewise-linear basis functions. Here, it is shown that training on disjoint datasets (or disjoint subsets of the training dataset) can further improve the performance. Experimental comparisons are provided, and all corresponding codes are publicly available.

[552] The Ensemble Schr{ö}dinger Bridge filter for Nonlinear Data Assimilation

Feng Bao, Hui Sun

Main category: cs.LG

TL;DR: Proposes Ensemble Schrödinger Bridge nonlinear filter combining prediction with diffusion generative modeling, achieving derivative-free, training-free filtering with good performance in high-dimensional nonlinear systems.

Details

Motivation: To develop a nonlinear optimal filter that addresses limitations of existing methods like ensemble Kalman filters and particle filters, particularly for highly nonlinear dynamics in high-dimensional chaotic environments.

Method: Combines standard prediction procedure with diffusion generative modeling for analysis step. The Ensemble Schrödinger Bridge filter is derivative-free, training-free, highly parallelizable, and has no structural model error.

Result: The filter performs well with highly nonlinear dynamics in dimensions up to 40+ under chaotic conditions. Outperforms ensemble Kalman filters and particle filters across various nonlinearity levels in numerous tests.

Conclusion: The Ensemble Schrödinger Bridge nonlinear filter is effective for high-dimensional nonlinear filtering problems. Future work will extend it to meteorological applications and establish rigorous convergence analysis.

Abstract: This work puts forward a novel nonlinear optimal filter namely the Ensemble Schr{ö}dinger Bridge nonlinear filter. The proposed filter finds marriage of the standard prediction procedure and the diffusion generative modeling for the analysis procedure to realize one filtering step. The designed approach finds no structural model error, and it is derivative free, training free and highly parallizable. Experimental results show that the designed algorithm performs well given highly nonlinear dynamics in (mildly) high dimension up to 40 or above under a chaotic environment. It also shows better performance than classical methods such as the ensemble Kalman filter and the Particle filter in numerous tests given different level of nonlinearity. Future work will focus on extending the proposed approach to practical meteorological applications and establishing a rigorous convergence analysis.

[553] DPSR: Differentially Private Sparse Reconstruction via Multi-Stage Denoising for Recommender Systems

Sarwan Ali

Main category: cs.LG

TL;DR: DPSR is a novel differentially private recommender system that uses a three-stage denoising framework to improve recommendation quality while maintaining privacy, outperforming both private baselines and even non-private systems at certain privacy budgets.

Details

Motivation: Existing differential privacy mechanisms for recommender systems suffer from a fundamental privacy-utility tradeoff that degrades recommendation quality as privacy protection increases. There's a need for methods that can maintain strong privacy guarantees while preserving or even improving recommendation utility.

Method: DPSR uses a three-stage denoising framework: 1) information-theoretic noise calibration that adaptively reduces noise for high-information ratings, 2) collaborative filtering-based denoising using item-item similarities, and 3) low-rank matrix completion exploiting latent structure. All denoising occurs after noise injection, preserving differential privacy through post-processing immunity.

Result: DPSR achieves 5.57% to 9.23% RMSE improvement over state-of-the-art Laplace and Gaussian mechanisms across privacy budgets (ε=0.1 to 10.0). Remarkably, at ε=1.0, DPSR achieves RMSE of 0.9823, outperforming even the non-private baseline (1.0983), showing it acts as an effective regularizer.

Conclusion: DPSR fundamentally addresses the privacy-utility tradeoff in recommender systems by exploiting rating matrix structure through post-injection denoising, achieving superior performance that can even surpass non-private baselines while maintaining strong differential privacy guarantees.

Abstract: Differential privacy (DP) has emerged as the gold standard for protecting user data in recommender systems, but existing privacy-preserving mechanisms face a fundamental challenge: the privacy-utility tradeoff inevitably degrades recommendation quality as privacy budgets tighten. We introduce DPSR (Differentially Private Sparse Reconstruction), a novel three-stage denoising framework that fundamentally addresses this limitation by exploiting the inherent structure of rating matrices – sparsity, low-rank properties, and collaborative patterns. DPSR consists of three synergistic stages: (1) \textit{information-theoretic noise calibration} that adaptively reduces noise for high-information ratings, (2) \textit{collaborative filtering-based denoising} that leverages item-item similarities to remove privacy noise, and (3) \textit{low-rank matrix completion} that exploits latent structure for signal recovery. Critically, all denoising operations occur \textit{after} noise injection, preserving differential privacy through the post-processing immunity theorem while removing both privacy-induced and inherent data noise. Through extensive experiments on synthetic datasets with controlled ground truth, we demonstrate that DPSR achieves 5.57% to 9.23% RMSE improvement over state-of-the-art Laplace and Gaussian mechanisms across privacy budgets ranging from $\varepsilon=0.1$ to $\varepsilon=10.0$ (all improvements statistically significant with $p < 0.05$, most $p < 0.001$). Remarkably, at $\varepsilon=1.0$, DPSR achieves RMSE of 0.9823, \textit{outperforming even the non-private baseline} (1.0983), demonstrating that our denoising pipeline acts as an effective regularizer that removes data noise in addition to privacy noise.

[554] When Less is More: 8-bit Quantization Improves Continual Learning in Large Language Models

Michael S. Zhang, Rishi A. Ruia, Arnav Kewalram, Saathvik Dharmapuram, Utkarsh Sharma, Kevin Zhu

Main category: cs.LG

TL;DR: Quantized models (INT8, INT4) outperform FP16 in continual learning despite lower precision, with quantization noise acting as implicit regularization against catastrophic forgetting.

Details

Motivation: Address catastrophic forgetting in continual learning when models are quantized for deployment efficiency, challenging conventional wisdom that higher precision is always better.

Method: Systematically investigate interplay between quantization precision (FP16, INT8, INT4) and replay buffer strategies in large language models across NLU, Math, and Code tasks.

Result: Quantized models outperform FP16 on subsequent tasks (8-15% better forward accuracy), with INT4 achieving nearly double FP16’s performance on Code generation (40% vs 20%). Minimal replay buffers (0.1%) dramatically improve retention across all precision levels.

Conclusion: INT8 quantization offers optimal balance between computational efficiency and superior continual learning dynamics, with quantization-induced noise acting as implicit regularization. Practical guidelines: small replay buffers (1-2%) for NLU, moderate buffers (5-10%) for Math/Code.

Abstract: Catastrophic forgetting poses a fundamental challenge in continual learning, particularly when models are quantized for deployment efficiency. We systematically investigate the interplay between quantization precision (FP16, INT8, INT4) and replay buffer strategies in large language models, revealing unexpected dynamics. While FP16 achieves superior initial task performance (74.44% on NLU), we observe a striking inversion on subsequent tasks: quantized models outperform FP16 by 8-15% on final task forward accuracy, with INT4 achieving nearly double FP16’s performance on Code generation (40% vs 20%). Critically, even minimal replay buffers (0.1%) dramatically improve retention - increasing NLU retention after Math training from 45% to 65% across all precision levels - with INT8 consistently achieving the optimal balance between learning plasticity and knowledge retention. We hypothesize that quantization-induced noise acts as implicit regularization, preventing the overfitting to new task gradients that plagues high-precision models. These findings challenge the conventional wisdom that higher precision is always preferable, suggesting instead that INT8 quantization offers both computational efficiency and superior continual learning dynamics. Our results provide practical guidelines for deploying compressed models in continual learning scenarios: small replay buffers (1-2%) suffice for NLU tasks, while Math and Code benefit from moderate buffers (5-10%), with quantized models requiring less replay than FP16 to achieve comparable retention. Code is available at https://github.com/Festyve/LessIsMore.

Saman Forouzandeh, Wei Peng, Parham Moradi, Xinghuo Yu, Mahdi Jalili

Main category: cs.LG

TL;DR: MACLA is a framework that decouples reasoning from learning by keeping LLMs frozen while adapting through external hierarchical procedural memory, achieving state-of-the-art performance across multiple benchmarks with fast memory construction and sample efficiency.

Details

Motivation: Current LLM adaptation methods require expensive parameter updates, lack interpretability, and struggle with sample efficiency. The authors aim to create a framework that enables continual improvement without modifying LLM parameters while maintaining interpretability and efficiency.

Method: MACLA maintains a frozen LLM while performing adaptation in an external hierarchical procedural memory. It extracts reusable procedures from trajectories, tracks reliability via Bayesian posteriors, selects actions through expected-utility scoring, and refines procedures by contrasting successes and failures.

Result: Achieves 78.1% average performance across four benchmarks (ALFWorld, WebShop, TravelPlanner, InterCodeSQL), outperforming all baselines. On ALFWorld unseen tasks, reaches 90.3% with 3.1% positive generalization. Constructs memory in 56 seconds (2800× faster than LLM parameter training), compressing 2851 trajectories into 187 procedures.

Conclusion: Structured external memory with Bayesian selection and contrastive refinement enables sample-efficient, interpretable, and continually improving agents without LLM parameter updates, offering a promising alternative to traditional fine-tuning approaches.

Abstract: We present MACLA, a framework that decouples reasoning from learning by maintaining a frozen large language model while performing all adaptation in an external hierarchical procedural memory. MACLA extracts reusable procedures from trajectories, tracks reliability via Bayesian posteriors, selects actions through expected-utility scoring, and refines procedures by contrasting successes and failures. Across four benchmarks (ALFWorld, WebShop, TravelPlanner, InterCodeSQL), MACLA achieves 78.1 percent average performance, outperforming all baselines. On ALFWorld unseen tasks, MACLA reaches 90.3 percent with 3.1 percent positive generalization. The system constructs memory in 56 seconds, 2800 times faster than the state-of-the-art LLM parameter-training baseline, compressing 2851 trajectories into 187 procedures. Experimental results demonstrate that structured external memory with Bayesian selection and contrastive refinement enables sample-efficient, interpretable, and continually improving agents without LLM parameter updates.

[556] Learning Through Little Eyes: Attribute Discrimination Beyond Objects

Patrick Batsell, Tsutsui Satoshi, Bihan Wen

Main category: cs.LG

TL;DR: CVCL (Child’s View for Contrastive Learning) shows different attribute recognition capabilities than CLIP - better at size discrimination but worse at color, with both failing to ground texture linguistically.

Details

Motivation: To understand whether infant-scale learning supports fine-grained attribute discrimination (color, size, texture) in addition to class-level recognition, and to compare infant-inspired CVCL with CLIP on these capabilities.

Method: Created a benchmark that systematically varies color, size, and texture attributes for controlled within-class attribute recognition tests, then compared CVCL (trained on infant egocentric video) with CLIP on this benchmark.

Result: CVCL outperforms CLIP on size discrimination, while CLIP achieves higher accuracy on color discrimination. Both models represent texture in image embeddings but fail to ground texture linguistically, revealing a gap between visual and language spaces.

Conclusion: Infant-scale learning (CVCL) shows distinct attribute recognition patterns compared to CLIP, with strengths in size discrimination but weaknesses in color, and both models struggle with linguistic grounding of texture attributes.

Abstract: Infants learn to recognize not only object categories but also fine grained attributes such as color, size, and texture within their first two years of life. Prior work explores Childs View for Contrastive Learning (CVCL), a CLIP style model trained on infant egocentric video as a computational model of early infant learning, but it focuses only on class level recognition. This leaves it unclear whether infant scale learning also supports attribute discrimination. To address this, we introduce a benchmark that systematically varies color, size, and texture, allowing controlled tests of within class attribute recognition. Comparing CVCL with CLIP shows clear differences. CVCL is better at size discrimination, while CLIP achieves higher accuracy on color discrimination. Both models represent texture in image embeddings but fail to ground texture linguistically, suggesting a gap between visual and language spaces.

[557] Scaling Online Distributionally Robust Reinforcement Learning: Sample-Efficient Guarantees with General Function Approximation

Debamita Ghosh, George K. Atia, Yue Wang

Main category: cs.LG

TL;DR: Online distributionally robust RL algorithm with function approximation that learns robust policies through environment interaction without prior models or offline data, achieving near-optimal sublinear regret.

Details

Motivation: RL agents suffer performance degradation due to mismatches between training and deployment environments. Existing DR-RL methods require substantial prior knowledge (generative models or large offline datasets) and focus on tabular methods that don't scale to complex domains.

Method: Proposes an online DR-RL algorithm with general function approximation that learns optimal robust policies purely through environment interaction, without requiring prior models or offline data. Uses total variation uncertainty set for robustness.

Result: Theoretical analysis establishes a near-optimal sublinear regret bound under total variation uncertainty set, demonstrating sample efficiency and effectiveness. Enables deployment in high-dimensional tasks.

Conclusion: The proposed method overcomes limitations of existing DR-RL approaches by enabling online learning without prior knowledge, scaling to complex domains through function approximation, and providing theoretical guarantees for robust performance.

Abstract: The deployment of reinforcement learning (RL) agents in real-world applications is often hindered by performance degradation caused by mismatches between training and deployment environments. Distributionally robust RL (DR-RL) addresses this issue by optimizing worst-case performance over an uncertainty set of transition dynamics. However, existing work typically relies on substantial prior knowledge-such as access to a generative model or a large offline dataset-and largely focuses on tabular methods that do not scale to complex domains. We overcome these limitations by proposing an online DR-RL algorithm with general function approximation that learns an optimal robust policy purely through interaction with the environment, without requiring prior models or offline data, enabling deployment in high-dimensional tasks. We further provide a theoretical analysis establishing a near-optimal sublinear regret bound under a total variation uncertainty set, demonstrating the sample efficiency and effectiveness of our method.

[558] Lag Operator SSMs: A Geometric Framework for Structured State Space Modeling

Sutashu Tomonaga, Kenji Doya, Noboru Murata

Main category: cs.LG

TL;DR: The paper introduces a direct, first-principles framework for constructing discrete-time Structured State Space Models (SSMs) using a novel lag operator, offering a more intuitive alternative to the complex continuous-time modeling approach used in Mamba architecture.

Details

Motivation: Current SSMs (like Mamba) rely on complex continuous-time modeling and discretization processes that obscure intuition. The authors aim to provide a more direct, transparent theoretical foundation for discrete-time SSMs.

Method: Develops a novel lag operator that geometrically derives discrete-time recurrence by measuring how basis functions “slide” between timesteps. State matrices are computed via a single inner product with this operator, creating a modular design space for combining different basis functions and time-warping schemes.

Result: The framework can exactly recover the recurrence of the influential HiPPO model as a specific instance. Numerical simulations confirm the derivation, validating the approach.

Conclusion: The paper provides new theoretical tools for designing flexible and robust sequence models through a more intuitive, first-principles approach to discrete-time SSM construction.

Abstract: Structured State Space Models (SSMs), which are at the heart of the recently popular Mamba architecture, are powerful tools for sequence modeling. However, their theoretical foundation relies on a complex, multi-stage process of continuous-time modeling and subsequent discretization, which can obscure intuition. We introduce a direct, first-principles framework for constructing discrete-time SSMs that is both flexible and modular. Our approach is based on a novel lag operator, which geometrically derives the discrete-time recurrence by measuring how the system’s basis functions “slide” and change from one timestep to the next. The resulting state matrices are computed via a single inner product involving this operator, offering a modular design space for creating novel SSMs by flexibly combining different basis functions and time-warping schemes. To validate our approach, we demonstrate that a specific instance exactly recovers the recurrence of the influential HiPPO model. Numerical simulations confirm our derivation, providing new theoretical tools for designing flexible and robust sequence models.

[559] Consistency-guided semi-supervised outlier detection in heterogeneous data using fuzzy rough sets

Baiyang Chen, Zhong Yuan, Dezhong Peng, Xiaoliang Chen, Hongmei Chen

Main category: cs.LG

TL;DR: COD is a semi-supervised outlier detection algorithm using fuzzy rough sets for heterogeneous data, leveraging labeled outliers to improve detection accuracy.

Details

Motivation: Current semi-supervised outlier detection methods focus on numerical data and neglect data heterogeneity, limiting their effectiveness with diverse data types.

Method: Uses labeled outliers to construct label-informed fuzzy similarity relations, evaluates attribute contributions via consistency of fuzzy decision systems, defines outlier factor based on fuzzy similarity class, and integrates classification consistency with outlier factor for prediction.

Result: Extensively evaluated on 15 datasets, COD performs better than or comparable to leading outlier detectors.

Conclusion: COD effectively handles heterogeneous data in semi-supervised outlier detection using fuzzy rough set theory, demonstrating superior performance over existing methods.

Abstract: Outlier detection aims to find samples that behave differently from the majority of the data. Semi-supervised detection methods can utilize the supervision of partial labels, thus reducing false positive rates. However, most of the current semi-supervised methods focus on numerical data and neglect the heterogeneity of data information. In this paper, we propose a consistency-guided outlier detection algorithm (COD) for heterogeneous data with the fuzzy rough set theory in a semi-supervised manner. First, a few labeled outliers are leveraged to construct label-informed fuzzy similarity relations. Next, the consistency of the fuzzy decision system is introduced to evaluate attributes’ contributions to knowledge classification. Subsequently, we define the outlier factor based on the fuzzy similarity class and predict outliers by integrating the classification consistency and the outlier factor. The proposed algorithm is extensively evaluated on 15 freshly proposed datasets. Experimental results demonstrate that COD is better than or comparable with the leading outlier detectors. This manuscript is the accepted author version of a paper published by Elsevier. The final published version is available at https://doi.org/10.1016/j.asoc.2024.112070

[560] Outlier detection in mixed-attribute data: a semi-supervised approach with fuzzy approximations and relative entropy

Baiyang Chen, Zhong Yuan, Zheng Liu, Dezhong Peng, Yongxiang Li, Chang Liu, Guiduo Duan

Main category: cs.LG

TL;DR: FROD: A semi-supervised outlier detection method using fuzzy rough sets to handle uncertainty and heterogeneity in mixed-attribute data.

Details

Motivation: Existing semi-supervised outlier detection methods often overlook uncertainty and heterogeneity in real-world mixed-attribute data, limiting their effectiveness in practical applications.

Method: Uses labeled data to construct fuzzy decision systems and compute attribute classification accuracy via fuzzy approximations. Then uses unlabeled data to compute fuzzy relative entropy for uncertainty characterization. Combines both metrics in a detection algorithm.

Result: Experimental results on 16 public datasets show FROD performs comparably or better than leading outlier detection algorithms.

Conclusion: FROD effectively addresses uncertainty and heterogeneity challenges in semi-supervised outlier detection for mixed-attribute data, demonstrating superior performance over existing methods.

Abstract: Outlier detection is a critical task in data mining, aimed at identifying objects that significantly deviate from the norm. Semi-supervised methods improve detection performance by leveraging partially labeled data but typically overlook the uncertainty and heterogeneity of real-world mixed-attribute data. This paper introduces a semi-supervised outlier detection method, namely fuzzy rough sets-based outlier detection (FROD), to effectively handle these challenges. Specifically, we first utilize a small subset of labeled data to construct fuzzy decision systems, through which we introduce the attribute classification accuracy based on fuzzy approximations to evaluate the contribution of attribute sets in outlier detection. Unlabeled data is then used to compute fuzzy relative entropy, which provides a characterization of outliers from the perspective of uncertainty. Finally, we develop the detection algorithm by combining attribute classification accuracy with fuzzy relative entropy. Experimental results on 16 public datasets show that FROD is comparable with or better than leading detection algorithms. All datasets and source codes are accessible at https://github.com/ChenBaiyang/FROD. This manuscript is the accepted author version of a paper published by Elsevier. The final published version is available at https://doi.org/10.1016/j.ijar.2025.109373

[561] OPBO: Order-Preserving Bayesian Optimization

Wei Peng, Jianchen Hu, Kang Liu, Qiaozhu Zhai

Main category: cs.LG

TL;DR: OPBO replaces GP with order-preserving neural networks for high-dimensional Bayesian optimization, achieving better performance with lower computational cost.

Details

Motivation: Gaussian processes fail in high-dimensional spaces (500+ dimensions) due to prohibitive computational complexity from precise numerical fitting requirements.

Method: Proposes order-preserving Bayesian optimization (OPBO) using OP neural networks as surrogate models that preserve order instead of exact values, and selects good-enough solutions from ordinal sets.

Result: OPBO significantly outperforms traditional BO methods (regression NN and GP) for high-dimensional black-box optimization problems over 500 dimensions.

Conclusion: Order-preserving approach is more suitable than value-fitting for high-dimensional Bayesian optimization, offering better performance with reduced computational cost.

Abstract: Bayesian optimization is an effective method for solving expensive black-box optimization problems. Most existing methods use Gaussian processes (GP) as the surrogate model for approximating the black-box objective function, it is well-known that it can fail in high-dimensional space (e.g., dimension over 500). We argue that the reliance of GP on precise numerical fitting is fundamentally ill-suited in high-dimensional space, where it leads to prohibitive computational complexity. In order to address this, we propose a simple order-preserving Bayesian optimization (OPBO) method, where the surrogate model preserves the order, instead of the value, of the black-box objective function. Then we can use a simple but effective OP neural network (NN) to replace GP as the surrogate model. Moreover, instead of searching for the best solution from the acquisition model, we select good-enough solutions in the ordinal set to reduce computational cost. The experimental results show that for high-dimensional (over 500) black-box optimization problems, the proposed OPBO significantly outperforms traditional BO methods based on regression NN and GP. The source code is available at https://github.com/pengwei222/OPBO.

[562] R-GenIMA: Integrating Neuroimaging and Genetics with Interpretable Multimodal AI for Alzheimer’s Disease Progression

Kun Zhao, Siyuan Dai, Yingying Zhang, Guodong Liu, Pengfei Gu, Chenghua Lin, Paul M. Thompson, Alex Leow, Heng Huang, Lifang He, Liang Zhan, Haoteng Tang

Main category: cs.LG

TL;DR: R-GenIMA is an interpretable multimodal AI model that integrates structural MRI and genetic data for Alzheimer’s disease classification and mechanistic insights.

Details

Motivation: Early AD detection requires integrating macro-scale neuroanatomical changes with micro-scale genetic factors, but existing multimodal approaches struggle to align these heterogeneous signals.

Method: R-GenIMA uses a novel ROI-wise vision transformer with genetic prompting to jointly model structural MRI and SNPs. Brain regions are represented as visual tokens and SNP profiles as structured text, enabling cross-modal attention linking atrophy patterns to genetic factors.

Result: Achieved state-of-the-art performance in four-way classification (NC, SMC, MCI, AD) on ADNI cohort. Identified biologically meaningful stage-specific brain regions and gene signatures, including established AD risk loci (APOE, BIN1, CLU, RBFOX1). Revealed shared vulnerability hubs and stage-specific patterns across disease continuum.

Conclusion: Interpretable multimodal AI can synthesize imaging and genetics to reveal mechanistic insights, providing a foundation for clinically deployable tools for earlier risk stratification and precision therapeutic strategies in Alzheimer’s disease.

Abstract: Early detection of Alzheimer’s disease (AD) requires models capable of integrating macro-scale neuroanatomical alterations with micro-scale genetic susceptibility, yet existing multimodal approaches struggle to align these heterogeneous signals. We introduce R-GenIMA, an interpretable multimodal large language model that couples a novel ROI-wise vision transformer with genetic prompting to jointly model structural MRI and single nucleotide polymorphisms (SNPs) variations. By representing each anatomically parcellated brain region as a visual token and encoding SNP profiles as structured text, the framework enables cross-modal attention that links regional atrophy patterns to underlying genetic factors. Applied to the ADNI cohort, R-GenIMA achieves state-of-the-art performance in four-way classification across normal cognition (NC), subjective memory concerns (SMC), mild cognitive impairment (MCI), and AD. Beyond predictive accuracy, the model yields biologically meaningful explanations by identifying stage-specific brain regions and gene signatures, as well as coherent ROI-Gene association patterns across the disease continuum. Attention-based attribution revealed genes consistently enriched for established GWAS-supported AD risk loci, including APOE, BIN1, CLU, and RBFOX1. Stage-resolved neuroanatomical signatures identified shared vulnerability hubs across disease stages alongside stage-specific patterns: striatal involvement in subjective decline, frontotemporal engagement during prodromal impairment, and consolidated multimodal network disruption in AD. These results demonstrate that interpretable multimodal AI can synthesize imaging and genetics to reveal mechanistic insights, providing a foundation for clinically deployable tools that enable earlier risk stratification and inform precision therapeutic strategies in Alzheimer’s disease.

[563] The 6th International Verification of Neural Networks Competition (VNN-COMP 2025): Summary and Results

Konstantin Kaulen, Tobias Ladner, Stanley Bak, Christopher Brix, Hai Duong, Thomas Flinkow, Taylor T. Johnson, Lukas Koller, Edoardo Manino, ThanhVu H Nguyen, Haoze Wu

Main category: cs.LG

TL;DR: VNN-COMP 2025 was the 6th annual competition for neural network verification tools, featuring standardized formats, equal-cost hardware evaluation, and participation from 8 teams on 25 benchmarks.

Details

Motivation: To facilitate fair comparison of neural network verification tools, encourage standardization of interfaces, and bring together the verification community through an objective competition framework.

Method: Used standardized formats (ONNX for networks, VNN-LIB for specifications), equal-cost AWS hardware with automatic evaluation pipeline, and participant-chosen parameters before final test sets were public.

Result: 8 teams participated in the competition, evaluated on 16 regular and 9 extended benchmarks, with results documented in the competition report.

Conclusion: The competition successfully provided a standardized evaluation framework for neural network verification tools, documented rules, benchmarks, tools, results, and lessons learned for the community.

Abstract: This report summarizes the 6th International Verification of Neural Networks Competition (VNN-COMP 2025), held as a part of the 8th International Symposium on AI Verification (SAIV), that was collocated with the 37th International Conference on Computer-Aided Verification (CAV). VNN-COMP is held annually to facilitate the fair and objective comparison of state-of-the-art neural network verification tools, encourage the standardization of tool interfaces, and bring together the neural network verification community. To this end, standardized formats for networks (ONNX) and specification (VNN-LIB) were defined, tools were evaluated on equal-cost hardware (using an automatic evaluation pipeline based on AWS instances), and tool parameters were chosen by the participants before the final test sets were made public. In the 2025 iteration, 8 teams participated on a diverse set of 16 regular and 9 extended benchmarks. This report summarizes the rules, benchmarks, participating tools, results, and lessons learned from this iteration of this competition.

[564] Optimizer Dynamics at the Edge of Stability with Differential Privacy

Ayana Hussain, Ricky Fang

Main category: cs.LG

TL;DR: DP training modifies neural network optimization dynamics, but edge-of-stability patterns persist despite gradient clipping and noise reducing sharpness.

Details

Motivation: Differential privacy alters neural network optimization in poorly understood ways, and it's unclear whether stability patterns like Edge of Stability persist under DP training with gradient clipping and noise.

Method: Comparative analysis of Gradient Descent and Adam vs their DP variants, examining how clipping and noise affect sharpness (maximum Hessian eigenvalue) and loss evolution during training.

Result: DP generally reduces sharpness and prevents full reaching of classical stability thresholds, but edge-of-stability patterns persist, with largest learning rates and privacy budgets sometimes exceeding these thresholds.

Conclusion: DP introduces unpredictability in neural network optimization, though characteristic stability patterns from non-private training still manifest under privacy constraints.

Abstract: Deep learning models can reveal sensitive information about individual training examples, and while differential privacy (DP) provides guarantees restricting such leakage, it also alters optimization dynamics in poorly understood ways. We study the training dynamics of neural networks under DP by comparing Gradient Descent (GD), and Adam to their privacy-preserving variants. Prior work shows that these optimizers exhibit distinct stability dynamics: full-batch methods train at the Edge of Stability (EoS), while mini-batch and adaptive methods exhibit analogous edge-of-stability behavior. At these regimes, the training loss and the sharpness–the maximum eigenvalue of the training loss Hessian–exhibit certain characteristic behavior. In DP training, per-example gradient clipping and Gaussian noise modify the update rule, and it is unclear whether these stability patterns persist. We analyze how clipping and noise change sharpness and loss evolution and show that while DP generally reduces the sharpness and can prevent optimizers from fully reaching the classical stability thresholds, patterns from EoS and analogous adaptive methods stability regimes persist, with the largest learning rates and largest privacy budgets approaching, and sometimes exceeding, these thresholds. These findings highlight the unpredictability introduced by DP in neural network optimization.

[565] A Surrogate-Augmented Symbolic CFD-Driven Training Framework for Accelerating Multi-objective Physical Model Development

Yuan Fang, Fabian Waschkowski, Maximilian Reissmann, Richard D. Sandberg, Takuo Oda, Koichi Tanimoto

Main category: cs.LG

TL;DR: Surrogate-augmented CFD-driven training framework reduces computational cost by using surrogate models to pre-screen ML-generated closure models, only running full CFD simulations on promising candidates.

Details

Motivation: Original CFD-driven training requires hundreds to thousands of high-fidelity simulations for each ML-generated candidate model, resulting in prohibitive computational costs for complex flows.

Method: Integrates surrogate modeling into symbolic CFD-driven training in real time. Surrogate model learns to approximate errors of ML-generated models based on previous CFD evaluations and is continuously refined. New models are first assessed using surrogate, with only promising candidates (small errors or high uncertainty) undergoing full CFD simulations. Uses probabilistic surrogate model with multi-output formulation for multi-objective training.

Result: Framework substantially reduces training cost while maintaining predictive accuracy comparable to original CFD-driven approach across range of statistically one- and two-dimensional flows, including both single- and multi-expression model optimization.

Conclusion: Surrogate-augmented CFD-driven training provides an effective approach to reduce computational burden while preserving accuracy, enabling more practical application of ML-based closure models in complex flow simulations.

Abstract: Computational Fluid Dynamics (CFD)-driven training combines machine learning (ML) with CFD solvers to develop physically consistent closure models with improved predictive accuracy. In the original framework, each ML-generated candidate model is embedded in a CFD solver and evaluated against reference data, requiring hundreds to thousands of high-fidelity simulations and resulting in prohibitive computational cost for complex flows. To overcome this limitation, we propose an extended framework that integrates surrogate modeling into symbolic CFD-driven training in real time to reduce training cost. The surrogate model learns to approximate the errors of ML-generated models based on previous CFD evaluations and is continuously refined during training. Newly generated models are first assessed using the surrogate, and only those predicted to yield small errors or high uncertainty are subsequently evaluated with full CFD simulations. Discrete expressions generated by symbolic regression are mapped into a continuous space using averaged input-symbol values as inputs to a probabilistic surrogate model. To support multi-objective model training, particularly when fixed weighting of competing quantities is challenging, the surrogate is extended to a multi-output formulation by generalizing the kernel to a matrix form, providing one mean and variance prediction per training objective. Selection metrics based on these probabilistic outputs are used to identify an optimal training setup. The proposed surrogate-augmented CFD-driven training framework is demonstrated across a range of statistically one- and two-dimensional flows, including both single- and multi-expression model optimization. In all cases, the framework substantially reduces training cost while maintaining predictive accuracy comparable to that of the original CFD-driven approach.

[566] Time-series Forecast for Indoor Zone Air Temperature with Long Horizons: A Case Study with Sensor-based Data from a Smart Building

Liping Sun, Yucheng Guo, Siliang Lu, Zhenzhen Li

Main category: cs.LG

TL;DR: Developed a hybrid physics-data-driven time series model for 2-week zone air temperature prediction in American buildings to support HVAC optimization and energy modeling.

Details

Motivation: Climate change increases extreme weather events, requiring more energy-efficient and flexible HVAC systems that need rapid temperature prediction capabilities. Traditional simulation approaches are insufficient for responding to rapid weather changes.

Method: Hybrid approach combining physics and data-driven methods for time series forecasting, specifically designed for 2-week horizon predictions of zone air temperatures in buildings.

Result: Developed a forecast model capable of predicting zone air temperature on a 2-week horizon for buildings in America, addressing gaps in existing short- and long-term prediction research.

Conclusion: The model can support intelligent HVAC control for demand flexibility and serve as hybrid building energy modeling, leveraging recent availability of high-quality datasets and algorithmic advances.

Abstract: With the press of global climate change, extreme weather and sudden weather changes are becoming increasingly common. To maintain a comfortable indoor environment and minimize the contribution of the building to climate change as much as possible, higher requirements are placed on the operation and control of HVAC systems, e.g., more energy-efficient and flexible to response to the rapid change of weather. This places demands on the rapid modeling and prediction of zone air temperatures of buildings. Compared to the traditional simulation-based approach such as EnergyPlus and DOE2, a hybrid approach combined physics and data-driven is more suitable. Recently, the availability of high-quality datasets and algorithmic breakthroughs have driven a considerable amount of work in this field. However, in the niche of short- and long-term predictions, there are still some gaps in existing research. This paper aims to develop a time series forecast model to predict the zone air temperature in a building located in America on a 2-week horizon. The findings could be further improved to support intelligent control and operation of HVAC systems (i.e. demand flexibility) and could also be used as hybrid building energy modeling.

[567] Efficient Personalization of Generative Models via Optimal Experimental Design

Guy Schacht, Ziyad Sheebaelhamd, Riccardo De Santi, Mojmír Mutný, Andreas Krause

Main category: cs.LG

TL;DR: ED-PBRL: An optimal experimental design approach for efficient preference query selection in aligning generative models with human preferences, requiring fewer queries than random selection.

Details

Motivation: Human feedback for preference learning is costly and time-consuming, creating demand for data-efficient query selection methods to align generative models with user preferences.

Method: Formulates preference query selection as maximizing information about latent preference model; develops convex optimization formulation and ED-PBRL algorithm with theoretical guarantees for efficient structured query construction (images/text).

Result: Empirical demonstration shows ED-PBRL requires fewer preference queries than random selection when personalizing text-to-image generative models to user-specific styles.

Conclusion: Optimal experimental design enables efficient preference query selection, reducing the human feedback burden while effectively elucidating latent reward functions for aligning generative models with user preferences.

Abstract: Preference learning from human feedback has the ability to align generative models with the needs of end-users. Human feedback is costly and time-consuming to obtain, which creates demand for data-efficient query selection methods. This work presents a novel approach that leverages optimal experimental design to ask humans the most informative preference queries, from which we can elucidate the latent reward function modeling user preferences efficiently. We formulate the problem of preference query selection as the one that maximizes the information about the underlying latent preference model. We show that this problem has a convex optimization formulation, and introduce a statistically and computationally efficient algorithm ED-PBRL that is supported by theoretical guarantees and can efficiently construct structured queries such as images or text. We empirically present the proposed framework by personalizing a text-to-image generative model to user-specific styles, showing that it requires less preference queries compared to random query selection.

[568] Fraud Detection Through Large-Scale Graph Clustering with Heterogeneous Link Transformation

Chi Liu

Main category: cs.LG

TL;DR: Proposes a graph-based fraud detection framework using hard/soft link distinction and graph transformation to improve coverage and clustering effectiveness for collaborative fraud detection.

Details

Motivation: Collaborative fraud with coordinated fraudulent accounts creates complex network structures that traditional methods struggle with - high-confidence identity links have limited coverage, while using all linkages leads to fragmented graphs with poor clustering.

Method: Distinguishes hard links (high-confidence identity relationships like phone numbers, credit cards, national IDs) from soft links (behavioral associations like device fingerprints, cookies, IP addresses). Uses graph transformation: identifies connected components via hard links, merges them into super-nodes, then reconstructs weighted soft-link graph for embedding and clustering using LINE and HDBSCAN.

Result: Achieves significant graph size reduction (25M to 7.7M nodes), doubles detection coverage compared to hard-link-only baselines, and maintains high precision across identified fraud clusters on real-world payment platform dataset.

Conclusion: The framework provides a scalable and practical solution for industrial-scale fraud detection systems by effectively handling large-scale heterogeneous graph clustering through principled link transformation.

Abstract: Collaborative fraud, where multiple fraudulent accounts coordinate to exploit online payment systems, poses significant challenges due to the formation of complex network structures. Traditional detection methods that rely solely on high-confidence identity links suffer from limited coverage, while approaches using all available linkages often result in fragmented graphs with reduced clustering effectiveness. In this paper, we propose a novel graph-based fraud detection framework that addresses the challenge of large-scale heterogeneous graph clustering through a principled link transformation approach. Our method distinguishes between \emph{hard links} (high-confidence identity relationships such as phone numbers, credit cards, and national IDs) and \emph{soft links} (behavioral associations including device fingerprints, cookies, and IP addresses). We introduce a graph transformation technique that first identifies connected components via hard links, merges them into super-nodes, and then reconstructs a weighted soft-link graph amenable to efficient embedding and clustering. The transformed graph is processed using LINE (Large-scale Information Network Embedding) for representation learning, followed by HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) for density-based cluster discovery. Experiments on a real-world payment platform dataset demonstrate that our approach achieves significant graph size reduction (from 25 million to 7.7 million nodes), doubles the detection coverage compared to hard-link-only baselines, and maintains high precision across identified fraud clusters. Our framework provides a scalable and practical solution for industrial-scale fraud detection systems.

[569] DIVER-1 : Deep Integration of Vast Electrophysiological Recordings at Scale

Danny Dongyeop Han, Yonghyeon Gwon, Ahhyun Lucy Lee, Taeyang Lee, Seong Jin Lee, Jubin Choi, Sebin Lee, Jihyun Bang, Seungju Lee, David Keetae Park, Shinjae Yoo, Chun Kee Chung, Jiook Cha

Main category: cs.LG

TL;DR: DIVER-1 is a family of EEG/iEEG foundation models trained on massive datasets (5.3k hours iEEG, 54k hours EEG) with 1.82B parameters, establishing data-constrained scaling laws showing smaller models trained longer outperform larger models briefly trained.

Details

Motivation: Existing electrophysiology foundation models are limited in scale despite evidence that scaling improves performance. There's a need for larger, more diverse training corpora and systematic scaling analysis for EEG/iEEG models.

Method: Trained on largest EEG/iEEG corpus to date (1.6M channel-hours from 17.7k+ subjects), scaled to 1.82B parameters. Introduced architectural innovations: any-variate attention, sliding temporal conditional positional encoding, and multi-domain reconstruction. Conducted systematic scaling law analysis.

Result: Established data-constrained scaling laws showing smaller models trained for extended epochs outperform larger models trained briefly. DIVER-1 iEEG and EEG models achieve state-of-the-art performance on respective benchmarks.

Conclusion: DIVER-1 provides concrete guidelines for efficient scaling in electrophysiology foundation models, demonstrating that training duration matters more than model size when data is constrained, contrasting with prior approaches that emphasized model size over training duration.

Abstract: Electrophysiology signals such as EEG and iEEG are central to neuroscience, brain-computer interfaces, and clinical applications, yet existing foundation models remain limited in scale despite clear evidence that scaling improves performance. We introduce DIVER-1, a family of EEG and iEEG foundation models trained on the largest and most diverse corpus to date-5.3k hours of iEEG and 54k hours of EEG (1.6M channel-hours from over 17.7k subjects)-and scaled up to 1.82B parameters. We present the first systematic scaling law analysis for this domain, showing that they follow data-constrained scaling laws: for a given amount of data and compute, smaller models trained for extended epochs consistently outperform larger models trained briefly. This behavior contrasts with prior electrophysiology foundation models that emphasized model size over training duration. To achieve strong performance, we also design architectural innovations including any-variate attention, sliding temporal conditional positional encoding, and multi-domain reconstruction. DIVER-1 iEEG and EEG models each achieve state-of-the-art performance on their respective benchmarks, establishing a concrete guidelines for efficient scaling and resource allocation in electrophysiology foundation model development.

[570] Dual Model Deep Learning for Alzheimer Prognostication

Alireza Moayedikia, Sara Fin, Uffe Kock Wiil

Main category: cs.LG

TL;DR: PROGRESS is a dual-model deep learning framework that uses a single baseline CSF biomarker assessment to predict Alzheimer’s progression with uncertainty quantification, outperforming traditional methods and enabling personalized treatment decisions.

Details

Motivation: Current Alzheimer's disease predictive models require longitudinal data and lack uncertainty quantification, making them impractical for critical first-visit treatment decisions when only baseline biomarkers are available.

Method: PROGRESS uses a dual-model deep learning framework: 1) a probabilistic trajectory network predicting individualized cognitive decline with calibrated uncertainty bounds, and 2) a deep survival model estimating time to conversion from mild cognitive impairment to dementia, using only single baseline CSF biomarkers without requiring prior clinical history.

Result: PROGRESS substantially outperforms Cox proportional hazards, Random Survival Forests, and gradient boosting methods for survival prediction. It identifies patient groups with seven-fold differences in conversion rates and demonstrates robust generalizability across 43 Alzheimer’s Disease Research Centers spanning four decades of assay technologies.

Conclusion: PROGRESS bridges the gap between biomarker measurement and personalized clinical decision-making by combining superior survival prediction with trustworthy trajectory uncertainty quantification from a single baseline assessment, enabling actionable prognostic estimates for timely treatment decisions.

Abstract: Disease modifying therapies for Alzheimer’s disease demand precise timing decisions, yet current predictive models require longitudinal observations and provide no uncertainty quantification, rendering them impractical at the critical first visit when treatment decisions must be made. We developed PROGRESS (PRognostic Generalization from REsting Static Signatures), a dual-model deep learning framework that transforms a single baseline cerebrospinal fluid biomarker assessment into actionable prognostic estimates without requiring prior clinical history. The framework addresses two complementary clinical questions: a probabilistic trajectory network predicts individualized cognitive decline with calibrated uncertainty bounds achieving near-nominal coverage, enabling honest prognostic communication; and a deep survival model estimates time to conversion from mild cognitive impairment to dementia. Using data from over 3,000 participants across 43 Alzheimer’s Disease Research Centers in the National Alzheimer’s Coordinating Center database, PROGRESS substantially outperforms Cox proportional hazards, Random Survival Forests, and gradient boosting methods for survival prediction. Risk stratification identifies patient groups with seven-fold differences in conversion rates, enabling clinically meaningful treatment prioritization. Leave-one-center-out validation demonstrates robust generalizability, with survival discrimination remaining strong across held-out sites despite heterogeneous measurement conditions spanning four decades of assay technologies. By combining superior survival prediction with trustworthy trajectory uncertainty quantification, PROGRESS bridges the gap between biomarker measurement and personalized clinical decision-making.

[571] MAGIC: Achieving Superior Model Merging via Magnitude Calibration

Yayuan Li, Jian Zhang, Jintao Guo, Zihan Cheng, Lei Qi, Yinghuan Shi, Yang Gao

Main category: cs.LG

TL;DR: MAGIC is a plug-and-play framework that calibrates feature magnitudes in merged models to prevent performance degradation caused by magnitude perturbations during model merging operations.

Details

Motivation: Model merging combines specialized models but often suffers from performance degradation due to magnitude perturbations from merging operations like parameter fusion and sparsification. While prior work focused on directional alignment, magnitude calibration has been neglected despite its vulnerability to perturbations.

Method: Proposes MAGIC with three variants: 1) Feature Space Calibration (FSC) - realigns merged model’s features using unlabeled data, 2) Weight Space Calibration (WSC) - extends calibration to weight space without additional data, and 3) Dual Space Calibration (DSC) - combines both approaches.

Result: Consistent performance improvements across diverse Computer Vision tasks (+4.3% on eight datasets) and NLP tasks (+8.0% on Llama) without additional training.

Conclusion: Magnitude calibration is crucial for effective model merging, and MAGIC provides an effective plug-and-play solution that addresses magnitude perturbations to maintain behavioral characteristics of specialized models in merged models.

Abstract: The proliferation of pre-trained models has given rise to a wide array of specialised, fine-tuned models. Model merging aims to merge the distinct capabilities of these specialised models into a unified model, requiring minimal or even no additional training. A core objective of model merging is to ensure the merged model retains the behavioural characteristics of the specialised models, typically achieved through feature alignment. We identify that features consist of two critical components: direction and magnitude. Prior research has predominantly focused on directional alignment, while the influence of magnitude remains largely neglected, despite its pronounced vulnerability to perturbations introduced by common merging operations (e.g., parameter fusion and sparsification). Such perturbations to magnitude inevitably lead to feature deviations in the merged model from the specialised models, resulting in subsequent performance degradation. To address this, we propose MAGnItude Calibration (MAGIC), a plug-and-play framework that rectifies layer-wise magnitudes in feature and weight spaces, with three variants. Specifically, our Feature Space Calibration (FSC) realigns the merged model’s features using a small set of unlabelled data, while Weight Space Calibration (WSC) extends this calibration to the weight space without requiring additional data. Combining these yields Dual Space Calibration (DSC). Comprehensive experiments demonstrate that MAGIC consistently boosts performance across diverse Computer Vision tasks (+4.3% on eight datasets) and NLP tasks (+8.0% on Llama) without additional training. Our code is available at: https://github.com/lyymuwu/MAGIC

[572] Timely Parameter Updating in Over-the-Air Federated Learning

Jiaqi Zhu, Zhongyuan Zhao, Xiao Li, Ruihao Du, Shi Jin, Howard H. Yang

Main category: cs.LG

TL;DR: FAIR-k algorithm for OAC-FL selects top-k gradients for over-the-air transmission, balancing freshness and importance to overcome waveform limitations and improve training efficiency.

Details

Motivation: Over-the-air computation (OAC) helps FL communication bottlenecks, but limited orthogonal waveforms mismatch with high-dimensional models. Need efficient gradient selection method.

Method: Propose FAIR-k algorithm combining Round-Robin (freshness) and Top-k (importance) approaches. Uses Markov analysis to characterize parameter staleness distribution and establishes convergence rate considering data heterogeneity, channel noise, and staleness.

Result: FAIR-k accelerates convergence, enhances communication efficiency by enabling longer local training periods without sacrificing overall training efficiency. Provides finer-grained analysis of data heterogeneity.

Conclusion: FAIR-k effectively addresses waveform limitations in OAC-FL by balancing freshness and importance of gradient updates, improving both convergence speed and communication efficiency.

Abstract: Incorporating over-the-air computations (OAC) into the model training process of federated learning (FL) is an effective approach to alleviating the communication bottleneck in FL systems. Under OAC-FL, every client modulates its intermediate parameters, such as gradient, onto the same set of orthogonal waveforms and simultaneously transmits the radio signal to the edge server. By exploiting the superposition property of multiple-access channels, the edge server can obtain an automatically aggregated global gradient from the received signal. However, the limited number of orthogonal waveforms available in practical systems is fundamentally mismatched with the high dimensionality of modern deep learning models. To address this issue, we propose Freshness Freshness-mAgnItude awaRe top-k (FAIR-k), an algorithm that selects, in each communication round, the most impactful subset of gradients to be updated over the air. In essence, FAIR-k combines the complementary strengths of the Round-Robin and Top-k algorithms, striking a delicate balance between timeliness (freshness of parameter updates) and importance (gradient magnitude). Leveraging tools from Markov analysis, we characterize the distribution of parameter staleness under FAIR-k. Building on this, we establish the convergence rate of OAC-FL with FAIR-k, which discloses the joint effect of data heterogeneity, channel noise, and parameter staleness on the training efficiency. Notably, as opposed to conventional analyses that assume a universal Lipschitz constant across all the clients, our framework adopts a finer-grained model of the data heterogeneity. The analysis demonstrates that since FAIR-k promotes fresh (and fair) parameter updates, it not only accelerates convergence but also enhances communication efficiency by enabling an extended period of local training without significantly affecting overall training efficiency.

[573] HyperLoad: A Cross-Modality Enhanced Large Language Model-Based Framework for Green Data Center Cooling Load Prediction

Haoyu Jiang, Boan Qu, Junjie Zhu, Fanjie Zeng, Xiaojie Lin, Wei Zhong

Main category: cs.LG

TL;DR: HyperLoad: A cross-modality framework using pre-trained LLMs for accurate load forecasting in green data centers, addressing data scarcity issues and outperforming SOTA methods.

Details

Motivation: The rapid growth of AI is increasing computational demand and data center energy use/carbon emissions. Green data centers need accurate load forecasting for efficient resource orchestration, but existing methods struggle with small-sample scenarios caused by cold start, load distortion, data fragmentation, and distribution shifts.

Method: Two-phase framework: 1) Cross-Modality Knowledge Alignment - maps textual priors and time-series data to common latent space to maximize prior knowledge utility; 2) Multi-Scale Feature Modeling - injects domain-aligned priors via adaptive prefix-tuning for rapid scenario adaptation, plus Enhanced Global Interaction Attention to capture cross-device temporal dependencies.

Result: HyperLoad consistently surpasses state-of-the-art baselines under both data sufficient and data scarce settings. The authors also release the public DCData dataset for benchmarking.

Conclusion: HyperLoad demonstrates practicality for sustainable green data center management by overcoming data scarcity challenges through LLM-based cross-modality knowledge transfer, enabling accurate load forecasting essential for efficient resource orchestration and carbon reduction.

Abstract: The rapid growth of artificial intelligence is exponentially escalating computational demand, inflating data center energy use and carbon emissions, and spurring rapid deployment of green data centers to relieve resource and environmental stress. Achieving sub-minute orchestration of renewables, storage, and loads, while minimizing PUE and lifecycle carbon intensity, hinges on accurate load forecasting. However, existing methods struggle to address small-sample scenarios caused by cold start, load distortion, multi-source data fragmentation, and distribution shifts in green data centers. We introduce HyperLoad, a cross-modality framework that exploits pre-trained large language models (LLMs) to overcome data scarcity. In the Cross-Modality Knowledge Alignment phase, textual priors and time-series data are mapped to a common latent space, maximizing the utility of prior knowledge. In the Multi-Scale Feature Modeling phase, domain-aligned priors are injected through adaptive prefix-tuning, enabling rapid scenario adaptation, while an Enhanced Global Interaction Attention mechanism captures cross-device temporal dependencies. The public DCData dataset is released for benchmarking. Under both data sufficient and data scarce settings, HyperLoad consistently surpasses state-of-the-art (SOTA) baselines, demonstrating its practicality for sustainable green data center management.

[574] A Composable Channel-Adaptive Architecture for Seizure Classification

Francesco Carzaniga, Michael Hersche, Kaspar Schindler, Abbas Rahimi

Main category: cs.LG

TL;DR: Channel-adaptive architecture for multi-variate time-series (iEEG) that processes arbitrary channel counts, fuses features with vector-symbolic algorithms, and uses long-term memory for classification with efficient personalization.

Details

Motivation: To develop a flexible architecture that can handle intracranial EEG recordings with varying numbers of channels across different subjects, enabling cross-subject training without performance loss while incorporating clinically relevant temporal context.

Method: Processes iEEG signals using state-of-the-art models per channel independently, fuses features with vector-symbolic algorithm using trainable channel scalars, accumulates fused features in 2-minute long-term memory for classification, and uses pre-training on multi-subject data followed by quick fine-tuning for personalization.

Result: CA-EEGWaveNet trained on single seizure outperforms baseline trained on all but one seizure (F1: 0.78 vs 0.76). CA-EEGNet also surpasses baseline (F1: 0.79 vs 0.74). Personalization requires equal/less data but only 1/5 the time of existing models.

Conclusion: The channel-adaptive model solves cross-subject heterogeneity issues, increases temporal context to clinically relevant lengths, and serves as a drop-in replacement for existing models with improved characteristics and performance.

Abstract: Objective: We develop a channel-adaptive (CA) architecture that seamlessly processes multi-variate time-series with an arbitrary number of channels, and in particular intracranial electroencephalography (iEEG) recordings. Methods: Our CA architecture first processes the iEEG signal using state-of-the-art models applied to each single channel independently. The resulting features are then fused using a vector-symbolic algorithm which reconstructs the spatial relationship using a trainable scalar per channel. Finally, the fused features are accumulated in a long-term memory of up to 2 minutes to perform the classification. Each CA-model can then be pre-trained on a large corpus of iEEG recordings from multiple heterogeneous subjects. The pre-trained model is personalized to each subject via a quick fine-tuning routine, which uses equal or lower amounts of data compared to existing state-of-the-art models, but requiring only 1/5 of the time. Results: We evaluate our CA-models on a seizure detection task both on a short-term (~20 hours) and a long-term (~2500 hours) dataset. In particular, our CA-EEGWaveNet is trained on a single seizure of the tested subject, while the baseline EEGWaveNet is trained on all but one. Even in this challenging scenario, our CA-EEGWaveNet surpasses the baseline in median F1-score (0.78 vs 0.76). Similarly, CA-EEGNet based on EEGNet, also surpasses its baseline in median F1-score (0.79 vs 0.74). Conclusion and significance: Our CA-model addresses two issues: first, it is channel-adaptive and can therefore be trained across heterogeneous subjects without loss of performance; second, it increases the effective temporal context size to a clinically-relevant length. Therefore, our model is a drop-in replacement for existing models, bringing better characteristics and performance across the board.

[575] Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, Kang Liu

Main category: cs.LG

TL;DR: The paper proposes BuPO, a novel RL paradigm that decomposes LLM policies into internal layer and modular components, revealing distinct entropy patterns across layers, and optimizes early layers to improve reasoning performance.

Details

Motivation: Existing RL approaches treat LLMs as unified policies, overlooking internal mechanisms. Understanding how policy evolves across layers and modules is crucial for targeted optimization and understanding complex reasoning mechanisms.

Method: Decompose language model policy using Transformer residual stream split and equivalence between hidden states composition with unembedding matrix. Analyze internal policy entropy patterns. Propose Bottom-up Policy Optimization (BuPO) that directly optimizes internal layer policy during early training by aligning training objectives at lower layers.

Result: Found that early layers keep high entropy for exploration while top layers converge to near-zero entropy for refinement. LLama converges rapidly in final layer, while Qwen-series models exhibit more human-like progressive reasoning. BuPO reconstructs foundational reasoning capabilities and achieves superior performance on complex reasoning benchmarks.

Conclusion: Decomposing LLM policies reveals important internal mechanisms. BuPO’s bottom-up optimization approach effectively improves reasoning performance by targeting early layer policy optimization, offering a novel RL paradigm for LLM training.

Abstract: Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a single unified policy, overlooking their internal mechanisms. Understanding how policy evolves across layers and modules is therefore crucial for enabling more targeted optimization and raveling out complex reasoning mechanisms. In this paper, we decompose the language model policy by leveraging the intrinsic split of the Transformer residual stream and the equivalence between the composition of hidden states with the unembedding matrix and the resulting samplable policy. This decomposition reveals Internal Layer Policies, corresponding to contributions from individual layers, and Internal Modular Policies, which align with the self-attention and feed-forward network (FFN) components within each layer. By analyzing the entropy of internal policy, we find that: (a) Early layers keep high entropy for exploration, top layers converge to near-zero entropy for refinement, with convergence patterns varying across model series. (b) LLama’s prediction space rapidly converges in the final layer, whereas Qwen-series models, especially Qwen3, exhibit a more human-like, progressively structured reasoning pattern. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that directly optimizes the internal layer policy during early training. By aligning training objective at lower layer, BuPO reconstructs foundational reasoning capabilities and achieves superior performance. Extensive experiments on complex reasoning benchmarks demonstrates the effectiveness of our method. Our code is available at https://github.com/Trae1ounG/BuPO.

[576] A Convex Loss Function for Set Prediction with Optimal Trade-offs Between Size and Conditional Coverage

Francis Bach

Main category: cs.LG

TL;DR: Proposes convex loss functions using Choquet integrals for set predictions with optimal trade-offs between coverage and set size, with extensions for asymmetric losses and conditional coverage optimization.

Details

Motivation: Need for supervised learning methods that provide explicit uncertainty estimates through set predictions, with optimal trade-offs between probabilistic coverage and set compactness, going beyond marginal coverage approaches.

Method: Uses Choquet integrals (Lovász extensions) to create convex loss functions for nondecreasing subset-valued functions from level sets of real-valued functions. Extends to asymmetric loss scenarios and conditional coverage optimization. Develops SGD and reweighted least-squares optimization algorithms.

Result: Theoretical framework for set-valued predictions with coverage-size trade-offs. Efficient optimization algorithms demonstrated on synthetic classification and regression datasets, showing improvements over marginal coverage approaches.

Conclusion: The proposed Choquet integral-based loss functions provide a principled approach for set predictions with explicit uncertainty quantification, enabling optimal coverage-size trade-offs and conditional coverage optimization, outperforming marginal coverage methods.

Abstract: We consider supervised learning problems in which set predictions provide explicit uncertainty estimates. Using Choquet integrals (a.k.a. Lov{á}sz extensions), we propose a convex loss function for nondecreasing subset-valued functions obtained as level sets of a real-valued function. This loss function allows optimal trade-offs between conditional probabilistic coverage and the ‘‘size’’ of the set, measured by a non-decreasing submodular function. We also propose several extensions that mimic loss functions and criteria for binary classification with asymmetric losses, and show how to naturally obtain sets with optimized conditional coverage. We derive efficient optimization algorithms, either based on stochastic gradient descent or reweighted least-squares formulations, and illustrate our findings with a series of experiments on synthetic datasets for classification and regression tasks, showing improvements over approaches that aim for marginal coverage.

[577] RP-CATE: Recurrent Perceptron-based Channel Attention Transformer Encoder for Industrial Hybrid Modeling

Haoran Yang, Yinan Zhang, Wenjie Zhang, Dongxia Wang, Peiyu Liu, Yuqi Ye, Kexin Chen, Wenhai Wang

Main category: cs.LG

TL;DR: RP-CATE: A novel hybrid modeling architecture combining channel attention, recurrent perceptron modules, and pseudo-image data for industrial applications, outperforming baselines in chemical engineering tasks.

Details

Motivation: Current industrial hybrid modeling methods have two main limitations: 1) they focus on single ML methods for specific tasks rather than comprehensive architectures, and 2) they fail to exploit underlying associations (monotonicity, periodicity) in industrial datasets, degrading predictive performance.

Method: Proposes RP-CATE (Recurrent Perceptron-based Channel Attention Transformer Encoder) with three innovations: 1) replaces self-attention with channel attention and adds Recurrent Perceptron Module to Transformer, 2) introduces Pseudo-Image Data (PID) with cyclic sliding window generation, 3) creates Pseudo-Sequential Data (PSD) conversion method to capture industrial dataset associations.

Result: RP-CATE achieves the best performance compared to other baseline models in chemical engineering hybrid modeling experiments.

Conclusion: RP-CATE effectively addresses limitations of existing industrial hybrid modeling by providing a comprehensive architecture that better captures underlying associations in industrial data, demonstrating superior performance in chemical engineering applications.

Abstract: Nowadays, industrial hybrid modeling which integrates both mechanistic modeling and machine learning-based modeling techniques has attracted increasing interest from scholars due to its high accuracy, low computational cost, and satisfactory interpretability. Nevertheless, the existing industrial hybrid modeling methods still face two main limitations. First, current research has mainly focused on applying a single machine learning method to one specific task, failing to develop a comprehensive machine learning architecture suitable for modeling tasks, which limits their ability to effectively represent complex industrial scenarios. Second, industrial datasets often contain underlying associations (e.g., monotonicity or periodicity) that are not adequately exploited by current research, which can degrade model’s predictive performance. To address these limitations, this paper proposes the Recurrent Perceptron-based Channel Attention Transformer Encoder (RP-CATE), with three distinctive characteristics: 1: We developed a novel architecture by replacing the self-attention mechanism with channel attention and incorporating our proposed Recurrent Perceptron (RP) Module into Transformer, achieving enhanced effectiveness for industrial modeling tasks compared to the original Transformer. 2: We proposed a new data type called Pseudo-Image Data (PID) tailored for channel attention requirements and developed a cyclic sliding window method for generating PID. 3: We introduced the concept of Pseudo-Sequential Data (PSD) and a method for converting industrial datasets into PSD, which enables the RP Module to capture the underlying associations within industrial dataset more effectively. An experiment aimed at hybrid modeling in chemical engineering was conducted by using RP-CATE and the experimental results demonstrate that RP-CATE achieves the best performance compared to other baseline models.

[578] Beyond Sliding Windows: Learning to Manage Memory in Non-Markovian Environments

Geraud Nangue Tasse, Matthew Riemer, Benjamin Rosman, Tim Klinger

Main category: cs.LG

TL;DR: Adaptive Stacking meta-algorithm reduces compute/memory for agents in non-Markovian environments by maintaining small adaptive memory stacks instead of fixed large observation windows.

Details

Motivation: Real-world agents face highly non-Markovian dependencies requiring large observation windows (Frame Stacking), leading to prohibitive computational and memory costs. Many environments only depend on relatively few observations over time, suggesting adaptive memory management could provide substantial savings.

Method: Proposed Adaptive Stacking meta-algorithm that maintains small adaptive stacks of memories, learning to remove non-predictive memories while retaining important experiences. Provides convergence guarantees and quantifies reduced compute/memory for MLP, LSTM, and Transformer agents.

Result: Experiments on memory tasks with controlled non-Markovian dependencies show the meta-algorithm learns to remove non-predictive memories without excessive removal of important experiences, achieving substantial compute and memory savings.

Conclusion: Adaptive Stacking offers a principled approach to handle non-Markovian dependencies with reduced computational and memory requirements compared to traditional Frame Stacking, enabling more efficient deployment of agents in complex real-world environments.

Abstract: Recent success in developing increasingly general purpose agents based on sequence models has led to increased focus on the problem of deploying computationally limited agents within the vastly more complex real-world. A key challenge experienced in these more realistic domains is highly non-Markovian dependencies with respect to the agent’s observations, which are less common in small controlled domains. The predominant approach for dealing with this in the literature is to stack together a window of the most recent observations (Frame Stacking), but this window size must grow with the degree of non-Markovian dependencies, which results in prohibitive computational and memory requirements for both action inference and learning. In this paper, we are motivated by the insight that in many environments that are highly non-Markovian with respect to time, the environment only causally depends on a relatively small number of observations over that time-scale. A natural direction would then be to consider meta-algorithms that maintain relatively small adaptive stacks of memories such that it is possible to express highly non-Markovian dependencies with respect to time while considering fewer observations at each step and thus experience substantial savings in both compute and memory requirements. Hence, we propose a meta-algorithm (Adaptive Stacking) for achieving exactly that with convergence guarantees and quantify the reduced computation and memory constraints for MLP, LSTM, and Transformer-based agents. Our experiments utilize popular memory tasks, which give us control over the degree of non-Markovian dependencies. This allows us to demonstrate that an appropriate meta-algorithm can learn the removal of memories not predictive of future rewards without excessive removal of important experiences. Code: https://github.com/geraudnt/adaptive-stacking

[579] Practical Quantum-Classical Feature Fusion for complex data Classification

Azadeh Alavi, Fatemeh Kouchmeshki, Abdolrahman Alavi

Main category: cs.LG

TL;DR: The paper proposes a cross-attention mid-fusion architecture for hybrid quantum-classical learning that treats quantum and classical branches as distinct modalities, outperforming isolated quantum and standard hybrid approaches on complex datasets.

Details

Motivation: Current hybrid quantum-classical architectures treat quantum circuits as isolated feature extractors and simply concatenate measurements with classical representations, neglecting that quantum and classical branches are distinct computational modalities. This limits performance on complex, high-dimensional tabular and semi-structured data in domains like remote sensing, environmental monitoring, and medical diagnostics.

Method: Proposes a multimodal formulation of hybrid learning with a cross-attention mid-fusion architecture where classical representations query quantum-derived feature tokens through an attention block with residual connectivity. The quantum branch is kept within practical NISQ budgets using up to nine qubits.

Result: Evaluated on Wine, Breast Cancer, Forest CoverType, FashionMNIST, and SteelPlatesFaults datasets. Pure quantum and standard hybrid designs underperform due to measurement-induced information loss, while cross-attention mid-fusion is consistently competitive and improves performance on more complex datasets in most cases.

Conclusion: Quantum-derived information becomes most valuable when integrated through principled multimodal fusion rather than used in isolation or loosely appended to classical features. The cross-attention approach effectively leverages the distinct computational modalities of quantum and classical systems.

Abstract: Hybrid quantum and classical learning aims to couple quantum feature maps with the robustness of classical neural networks, yet most architectures treat the quantum circuit as an isolated feature extractor and merge its measurements with classical representations by direct concatenation. This neglects that the quantum and classical branches constitute distinct computational modalities and limits reliable performance on complex, high dimensional tabular and semi structured data, including remote sensing, environmental monitoring, and medical diagnostics. We present a multimodal formulation of hybrid learning and propose a cross attention mid fusion architecture in which a classical representation queries quantum derived feature tokens through an attention block with residual connectivity. The quantum branch is kept within practical NISQ budgets and uses up to nine qubits. We evaluate on Wine, Breast Cancer, Forest CoverType, FashionMNIST, and SteelPlatesFaults, comparing a quantum only model, a classical baseline, residual hybrid models, and the proposed mid fusion model under a consistent protocol. Pure quantum and standard hybrid designs underperform due to measurement induced information loss, while cross attention mid fusion is consistently competitive and improves performance on the more complex datasets in most cases. These findings suggest that quantum derived information becomes most valuable when integrated through principled multimodal fusion rather than used in isolation or loosely appended to classical features.

[580] Operator-Based Generalization Bound for Deep Learning: Insights on Multi-Task Learning

Mahdi Mohammadigohari, Giuseppe Di Fatta, Giuseppe Nicosia, Panos M. Pardalos

Main category: cs.LG

TL;DR: Novel generalization bounds for vector-valued neural networks and deep kernel methods using operator-theoretic framework, with Koopman-based approach and sketching techniques for computational efficiency.

Details

Motivation: To address the relatively unexplored area of generalization properties in multitask learning with deep learning architectures, particularly for vector-valued neural networks and deep kernel methods, where traditional norm-based bounds may be insufficient.

Method: Combines Koopman-based approach with existing techniques, introduces sketching techniques for computational efficiency, and proposes deep vector-valued reproducing kernel Hilbert spaces (vvRKHS) leveraging Perron Frobenius operators. Derives new Rademacher generalization bounds with kernel refinement strategies.

Result: Achieves tighter generalization guarantees compared to traditional norm-based bounds, provides excess risk bounds under generic Lipschitz losses, and offers performance guarantees for applications including robust and multiple quantile regression.

Conclusion: The work provides novel insights into generalization properties of multitask learning with deep architectures, introducing computationally efficient methods and theoretical frameworks that address both underfitting and overfitting through kernel refinement strategies.

Abstract: This paper presents novel generalization bounds for vector-valued neural networks and deep kernel methods, focusing on multi-task learning through an operator-theoretic framework. Our key development lies in strategically combining a Koopman based approach with existing techniques, achieving tighter generalization guarantees compared to traditional norm-based bounds. To mitigate computational challenges associated with Koopman-based methods, we introduce sketching techniques applicable to vector valued neural networks. These techniques yield excess risk bounds under generic Lipschitz losses, providing performance guarantees for applications including robust and multiple quantile regression. Furthermore, we propose a novel deep learning framework, deep vector-valued reproducing kernel Hilbert spaces (vvRKHS), leveraging Perron Frobenius (PF) operators to enhance deep kernel methods. We derive a new Rademacher generalization bound for this framework, explicitly addressing underfitting and overfitting through kernel refinement strategies. This work offers novel insights into the generalization properties of multitask learning with deep learning architectures, an area that has been relatively unexplored until recent developments.

[581] Causal Heterogeneous Graph Learning Method for Chronic Obstructive Pulmonary Disease Prediction

Leming Zhou, Zuo Wang, Zhigang Liu

Main category: cs.LG

TL;DR: CHGRL method combines causal inference with heterogeneous graph learning for COPD comorbidity risk prediction, achieving high detection accuracy.

Details

Motivation: Address insufficient diagnosis/treatment at grassroots level for COPD, where early identification/warning is deficient, leading to high prevalence/burden but low screening rates.

Method: Causal Heterogeneous Graph Representation Learning (CHGRL): 1) Construct heterogeneous graph of patient-disease interactions, 2) Build cause-aware architecture combining causal inference with heterogeneous graph learning, 3) Incorporate causal loss function with counterfactual reasoning and causal regularization.

Result: Proposed model demonstrates high detection accuracy compared to strong GNN baselines in experimental evaluation.

Conclusion: CHGRL method effectively improves COPD comorbidity risk prediction by integrating causal inference with heterogeneous graph learning, addressing current diagnostic deficiencies.

Abstract: Due to the insufficient diagnosis and treatment capabilities at the grassroots level, there are still deficiencies in the early identification and early warning of acute exacerbation of Chronic obstructive pulmonary disease (COPD), often resulting in a high prevalence rate and high burden, but the screening rate is relatively low. In order to gradually improve this situation. In this paper, this study develop a Causal Heterogeneous Graph Representation Learning (CHGRL) method for COPD comorbidity risk prediction method that: a) constructing a heterogeneous Our dataset includes the interaction between patients and diseases; b) A cause-aware heterogeneous graph learning architecture has been constructed, combining causal inference mechanisms with heterogeneous graph learning, which can support heterogeneous graph causal learning for different types of relationships; and c) Incorporate the causal loss function in the model design, and add counterfactual reasoning learning loss and causal regularization loss on the basis of the cross-entropy classification loss. We evaluate our method and compare its performance with strong GNN baselines. Following experimental evaluation, the proposed model demonstrates high detection accuracy.

[582] On the Koopman-Based Generalization Bounds for Multi-Task Deep Learning

Mahdi Mohammadigohari, Giuseppe Di Fatta, Giuseppe Nicosia, Panos M. Pardalos

Main category: cs.LG

TL;DR: The paper develops improved generalization bounds for multitask deep neural networks using operator theory, achieving tighter bounds than conventional norm-based methods by exploiting small condition numbers and introducing a tailored Sobolev space.

Details

Motivation: To provide more precise theoretical understanding of multitask deep learning by establishing tighter generalization bounds that overcome limitations of existing approaches, particularly in the context of kernel methods.

Method: Uses operator-theoretic techniques to analyze multitask deep neural networks, leverages small condition numbers in weight matrices, and introduces a tailored Sobolev space as an expanded hypothesis space to derive improved bounds.

Result: Develops a tighter generalization bound that outperforms both conventional norm-based methods and existing Koopman-based bounds, with the enhanced bound remaining valid even in single output settings.

Conclusion: The proposed framework offers more precise theoretical understanding of multitask deep learning while maintaining key advantages like flexibility and independence from network width, providing improved generalization guarantees for deep neural networks.

Abstract: The paper establishes generalization bounds for multitask deep neural networks using operator-theoretic techniques. The authors propose a tighter bound than those derived from conventional norm based methods by leveraging small condition numbers in the weight matrices and introducing a tailored Sobolev space as an expanded hypothesis space. This enhanced bound remains valid even in single output settings, outperforming existing Koopman based bounds. The resulting framework maintains key advantages such as flexibility and independence from network width, offering a more precise theoretical understanding of multitask deep learning in the context of kernel methods.

[583] MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning

Tao Zhang, Ziqian Zeng, Hao Peng, Huiping Zhuang, Cen Chen

Main category: cs.LG

TL;DR: MixKVQ is a query-aware mixed-precision quantization method for KV cache that identifies critical key channels needing higher precision while using per-token quantization for value cache, achieving near full-precision performance with reduced memory overhead.

Details

Motivation: Long CoT reasoning in LLMs causes substantial memory and latency overhead from extensive KV cache. Existing low-bit quantization methods suffer severe performance degradation on complex reasoning tasks, with fixed-precision struggling with outlier channels and current mixed-precision failing to accurately identify components requiring high precision.

Method: MixKVQ introduces a lightweight, query-aware algorithm that considers two factors: a key channel’s intrinsic quantization difficulty and its relevance to the query. It identifies and preserves critical key channels needing higher precision while applying per-token quantization for value cache.

Result: Experiments on complex reasoning datasets show MixKVQ significantly outperforms existing low-bit methods, achieving performance comparable to full-precision baseline at substantially reduced memory footprint.

Conclusion: An effective low-bit KV cache quantization strategy must consider both a key channel’s intrinsic quantization difficulty and its relevance to the query. MixKVQ provides a plug-and-play solution that enables efficient long CoT reasoning with minimal performance degradation.

Abstract: Long Chain-of-Thought (CoT) reasoning has significantly advanced the capabilities of Large Language Models (LLMs), but this progress is accompanied by substantial memory and latency overhead from the extensive Key-Value (KV) cache. Although KV cache quantization is a promising compression technique, existing low-bit quantization methods often exhibit severe performance degradation on complex reasoning tasks. Fixed-precision quantization struggles to handle outlier channels in the key cache, while current mixed-precision strategies fail to accurately identify components requiring high-precision representation. We find that an effective low-bit KV cache quantization strategy must consider two factors: a key channel’s intrinsic quantization difficulty and its relevance to the query. Based on this insight, we propose MixKVQ, a novel plug-and-play method that introduces a lightweight, query-aware algorithm to identify and preserve critical key channels that need higher precision, while applying per-token quantization for value cache. Experiments on complex reasoning datasets demonstrate that our approach significantly outperforms existing low-bit methods, achieving performance comparable to a full-precision baseline at a substantially reduced memory footprint.

[584] Phase-space entropy at acquisition reflects downstream learnability

Xiu-Cheng Wang, Jun-Jie Zhanga, Nan Cheng, Long-Gang Pang, Taijiao Du, Deyu Meng

Main category: cs.LG

TL;DR: The paper proposes ΔS_B, a phase-space entropy metric that quantifies how acquisition preserves or destroys information for downstream learning, without requiring training.

Details

Motivation: Current learning systems depend on data structure present before training, but lack a general way to quantify how acquisition itself affects downstream learnability across different modalities.

Method: Develops ΔS_B, an acquisition-level scalar based on instrument-resolved phase space that measures how acquisition mixes or removes joint space-frequency structure at the instrument scale.

Result: ΔS_B correctly identifies phase-space coherence of periodic sampling as source of aliasing, and empirically predicts downstream reconstruction/recognition difficulty without training across masked image classification, accelerated MRI, and massive MIMO.

Conclusion: Phase-space entropy at acquisition reflects downstream learnability, enabling pre-training selection of sampling policies and providing a shared notion of information preservation across modalities.

Abstract: Modern learning systems work with data that vary widely across domains, but they all ultimately depend on how much structure is already present in the measurements before any model is trained. This raises a basic question: is there a general, modality-agnostic way to quantify how acquisition itself preserves or destroys the information that downstream learners could use? Here we propose an acquisition-level scalar $ΔS_{\mathcal B}$ based on instrument-resolved phase space. Unlike pixelwise distortion or purely spectral errors that often saturate under aggressive undersampling, $ΔS_{\mathcal B}$ directly quantifies how acquisition mixes or removes joint space–frequency structure at the instrument scale. We show theoretically that (ΔS_{\mathcal B}) correctly identifies the phase-space coherence of periodic sampling as the physical source of aliasing, recovering classical sampling-theorem consequences. Empirically, across masked image classification, accelerated MRI, and massive MIMO (including over-the-air measurements), $|ΔS_{\mathcal B}|$ consistently ranks sampling geometries and predicts downstream reconstruction/recognition difficulty \emph{without training}. In particular, minimizing $|ΔS_{\mathcal B}|$ enables zero-training selection of variable-density MRI mask parameters that matches designs tuned by conventional pre-reconstruction criteria. These results suggest that phase-space entropy at acquisition reflects downstream learnability, enabling pre-training selection of candidate sampling policies and as a shared notion of information preservation across modalities.

[585] Regression generation adversarial network based on dual data evaluation strategy for industrial application

Zesen Wang, Yonggang Li, Lijuan Lan

Main category: cs.LG

TL;DR: Proposes multi-task learning-based regression GAN framework for industrial soft sensing with insufficient data, integrating regression info into both generator and discriminator with shallow sharing mechanism and dual data evaluation strategy.

Details

Motivation: Industrial soft sensing suffers from insufficient data volume, reducing reliability. Traditional GANs don't account for label-feature mapping relationships, and existing solutions don't balance performance and efficiency simultaneously.

Method: Multi-task learning-based regression GAN framework that integrates regression information into both discriminator and generator, implements shallow sharing mechanism between discriminator and regressor, and uses dual data evaluation strategy for diverse sample generation.

Result: Method significantly enhances generated sample quality while improving algorithm efficiency, validated through four industrial soft sensing cases: wastewater treatment plants, surface water, CO2 absorption towers, and industrial gas turbines.

Conclusion: The proposed framework effectively addresses insufficient data in industrial soft sensing by generating high-quality diverse samples, improving both performance and efficiency compared to traditional GAN approaches.

Abstract: Soft sensing infers hard-to-measure data through a large number of easily obtainable variables. However, in complex industrial scenarios, the issue of insufficient data volume persists, which diminishes the reliability of soft sensing. Generative Adversarial Networks (GAN) are one of the effective solutions for addressing insufficient samples. Nevertheless, traditional GAN fail to account for the mapping relationship between labels and features, which limits further performance improvement. Although some studies have proposed solutions, none have considered both performance and efficiency simultaneously. To address these problems, this paper proposes the multi-task learning-based regression GAN framework that integrates regression information into both the discriminator and generator, and implements a shallow sharing mechanism between the discriminator and regressor. This approach significantly enhances the quality of generated samples while improving the algorithm’s operational efficiency. Moreover, considering the importance of training samples and generated samples, a dual data evaluation strategy is designed to make GAN generate more diverse samples, thereby increasing the generalization of subsequent modeling. The superiority of method is validated through four classic industrial soft sensing cases: wastewater treatment plants, surface water, $CO_2$ absorption towers, and industrial gas turbines.

[586] From Black-Box Tuning to Guided Optimization via Hyperparameters Interaction Analysis

Moncef Garouani, Ayah Barhrhouj

Main category: cs.LG

TL;DR: MetaSHAP is a scalable XAI method that uses meta-learning and Shapley values to provide dataset-aware hyperparameter tuning insights from a massive benchmark of ML pipelines.

Details

Motivation: Hyperparameter tuning is computationally expensive, and understanding hyperparameter importance and interactions is critical for efficient model development. Current methods lack interpretable, dataset-aware insights.

Method: MetaSHAP uses meta-learning over 9+ million evaluated ML pipelines, learns surrogate performance models from historical configurations, computes hyperparameter interactions using SHAP analysis, and derives interpretable tuning ranges from influential hyperparameters.

Result: Empirical validation on 164 classification datasets and 14 classifiers shows MetaSHAP produces reliable importance rankings and competitive performance when guiding Bayesian optimization.

Conclusion: MetaSHAP provides actionable, interpretable hyperparameter tuning insights that help practitioners prioritize which hyperparameters to tune, understand their directionality and interactions, and identify value ranges where influence is concentrated.

Abstract: Hyperparameters tuning is a fundamental, yet computationally expensive, step in optimizing machine learning models. Beyond optimization, understanding the relative importance and interaction of hyperparameters is critical to efficient model development. In this paper, we introduce MetaSHAP, a scalable semi-automated eXplainable AI (XAI) method, that uses meta-learning and Shapley values analysis to provide actionable and dataset-aware tuning insights. MetaSHAP operates over a vast benchmark of over 09 millions evaluated machine learning pipelines, allowing it to produce interpretable importance scores and actionable tuning insights that reveal how much each hyperparameter matters, how it interacts with others and in which value ranges its influence is concentrated. For a given algorithm and dataset, MetaSHAP learns a surrogate performance model from historical configurations, computes hyperparameters interactions using SHAP-based analysis, and derives interpretable tuning ranges from the most influential hyperparameters. This allows practitioners not only to prioritize which hyperparameters to tune, but also to understand their directionality and interactions. We empirically validate MetaSHAP on a diverse benchmark of 164 classification datasets and 14 classifiers, demonstrating that it produces reliable importance rankings and competitive performance when used to guide Bayesian optimization.

[587] Small Language Models as Compiler Experts: Auto-Parallelization for Heterogeneous Systems

Prathamesh Devadiga

Main category: cs.LG

TL;DR: Small language models (1B params) can effectively auto-parallelize code, achieving 6.81x average speedup and 43.25x peak performance on convolution operations, outperforming traditional compiler heuristics.

Details

Motivation: Traditional auto-parallelizing compilers use rigid heuristics that struggle with modern heterogeneous systems' complexity. Language models offer more flexible reasoning capabilities for complex optimization tasks.

Method: Evaluated three small language models (gemma3, llama3.2, qwen2.5) using six reasoning strategies on 11 real-world kernels from scientific computing, graph algorithms, and ML. Benchmarked against LLVM Polly, TVM, and Triton across 376 evaluations.

Result: Achieved average speedup of 6.81x, with peak performance of 43.25x on convolution operations. Demonstrated scalability, correctness verification using multiple sanitizers, and robustness across diverse compilers/hardware platforms.

Conclusion: Small, efficient language models can serve as powerful reasoning engines for complex compiler optimization tasks, offering superior performance over traditional heuristic-based approaches.

Abstract: Traditional auto-parallelizing compilers, reliant on rigid heuristics, struggle with the complexity of modern heterogeneous systems. This paper presents a comprehensive evaluation of small (approximately 1B parameter) language-model-driven compiler auto-parallelization. We evaluate three models: gemma3, llama3.2, and qwen2.5, using six reasoning strategies across 11 real-world kernels drawn from scientific computing, graph algorithms, and machine learning. Our system is benchmarked against strong compiler baselines, including LLVM Polly, TVM, and Triton. Across 376 total evaluations, the proposed approach achieves an average speedup of 6.81x and a peak performance of 43.25x on convolution operations. We analyze scalability, verify correctness using multiple sanitizers, and confirm robustness across diverse compilers and hardware platforms. Our results demonstrate that small, efficient language models can serve as powerful reasoning engines for complex compiler optimization tasks.

[588] Machine Unlearning in the Era of Quantum Machine Learning: An Empirical Study

Carla Crivoi, Radu Tudor Ionescu

Main category: cs.LG

TL;DR: First comprehensive study of machine unlearning in hybrid quantum-classical neural networks, adapting classical unlearning methods to quantum settings and introducing new quantum-tailored strategies.

Details

Motivation: Machine unlearning has been extensively studied in classical deep learning, but its behavior in variational quantum circuits and quantum-augmented architectures remains largely unexplored as quantum machine learning systems expand in scale and capability.

Method: Adapted a broad suite of classical unlearning methods (gradient-based, distillation-based, regularization-based, certified techniques) to quantum settings and introduced two new unlearning strategies tailored to hybrid quantum-classical models. Experiments conducted across Iris, MNIST, and Fashion-MNIST datasets under subset removal and full-class deletion scenarios.

Result: Quantum models can support effective unlearning, but outcomes strongly depend on circuit depth, entanglement structure, and task complexity. Shallow VQCs display high intrinsic stability with minimal memorization, while deeper hybrid models exhibit stronger trade-offs between utility, forgetting strength, and alignment with retrain oracle. Certain methods (EU-k, LCA, and Certified Unlearning) consistently provide the best balance across metrics.

Conclusion: Establishes baseline empirical insights into quantum machine unlearning and highlights the need for quantum-aware algorithms and theoretical guarantees as quantum machine learning systems continue to expand in scale and capability.

Abstract: We present the first comprehensive empirical study of machine unlearning (MU) in hybrid quantum-classical neural networks. While MU has been extensively explored in classical deep learning, its behavior within variational quantum circuits (VQCs) and quantum-augmented architectures remains largely unexplored. First, we adapt a broad suite of unlearning methods to quantum settings, including gradient-based, distillation-based, regularization-based and certified techniques. Second, we introduce two new unlearning strategies tailored to hybrid models. Experiments across Iris, MNIST, and Fashion-MNIST, under both subset removal and full-class deletion, reveal that quantum models can support effective unlearning, but outcomes depend strongly on circuit depth, entanglement structure, and task complexity. Shallow VQCs display high intrinsic stability with minimal memorization, whereas deeper hybrid models exhibit stronger trade-offs between utility, forgetting strength, and alignment with retrain oracle. We find that certain methods, e.g. EU-k, LCA, and Certified Unlearning, consistently provide the best balance across metrics. These findings establish baseline empirical insights into quantum machine unlearning and highlight the need for quantum-aware algorithms and theoretical guarantees, as quantum machine learning systems continue to expand in scale and capability. We publicly release our code at: https://github.com/CrivoiCarla/HQML.

[589] Digital Twin-Driven Zero-Shot Fault Diagnosis of Axial Piston Pumps Using Fluid-Borne Noise Signals

Chang Dong, Jianfeng Tao, Chengliang Liu

Main category: cs.LG

TL;DR: Proposes a digital twin-driven zero-shot fault diagnosis framework for axial piston pumps using fluid-borne noise signals, achieving >95% accuracy without real fault data.

Details

Motivation: Traditional fault diagnosis methods face limitations: data-driven approaches require extensive labeled fault data (often impractical), while model-based methods suffer from parameter uncertainties. There's a need for reliable fault diagnosis in fluid power systems without requiring real fault data.

Method: 1) Calibrates high-fidelity digital twin model using only healthy-state data; 2) Generates synthetic fault signals for training deep learning classifiers; 3) Uses physics-informed neural network (PINN) as virtual sensor for flow ripple estimation; 4) Integrates Grad-CAM to visualize neural network decision-making; 5) Optimizes kernel sizes in time-domain and time-frequency domain inputs for better feature extraction.

Result: Training on signals from calibrated DT model yields diagnostic accuracies exceeding 95% on real-world benchmarks, while uncalibrated models show significantly lower performance. Large kernels matching subsequence length in time-domain and small kernels in time-frequency domain enable higher accuracy by focusing on physically meaningful features.

Conclusion: The proposed digital twin-driven zero-shot fault diagnosis framework effectively addresses data scarcity in fault diagnosis by using only healthy-state data, achieving high diagnostic accuracy through calibrated DT models and optimized neural network architectures, making it practical for real-world applications.

Abstract: Axial piston pumps are crucial components in fluid power systems, where reliable fault diagnosis is essential for ensuring operational safety and efficiency. Traditional data-driven methods require extensive labeled fault data, which is often impractical to obtain, while model-based approaches suffer from parameter uncertainties. This paper proposes a digital twin (DT)-driven zero-shot fault diagnosis framework utilizing fluid-borne noise (FBN) signals. The framework calibrates a high-fidelity DT model using only healthy-state data, generates synthetic fault signals for training deep learning classifiers, and employs a physics-informed neural network (PINN) as a virtual sensor for flow ripple estimation. Gradient-weighted class activation mapping (Grad-CAM) is integrated to visualize the decision-making process of neural networks, revealing that large kernels matching the subsequence length in time-domain inputs and small kernels in time-frequency domain inputs enable higher diagnostic accuracy by focusing on physically meaningful features. Experimental validations demonstrate that training on signals from the calibrated DT model yields diagnostic accuracies exceeding 95% on real-world benchmarks, while uncalibrated models result in significantly lower performance, highlighting the framework’s effectiveness in data-scarce scenarios.

[590] Time-Vertex Machine Learning for Optimal Sensor Placement in Temporal Graph Signals: Applications in Structural Health Monitoring

Keivan Faghih Niresi, Jun Qing, Mengjie Zhao, Olga Fink

Main category: cs.LG

TL;DR: TVML framework combines graph signal processing, time-domain analysis, and machine learning for interpretable sensor placement in structural health monitoring.

Details

Motivation: As sensor networks grow in scale and complexity, identifying the most informative sensors becomes essential to reduce deployment costs without compromising monitoring quality. Conventional GSP approaches often overlook temporal dynamics of structural behavior.

Method: Proposed Time-Vertex Machine Learning (TVML) framework that integrates Graph Signal Processing (GSP), time-domain analysis, and machine learning to enable interpretable and efficient sensor placement by identifying representative nodes that minimize redundancy while preserving critical information.

Result: Evaluated on two bridge datasets for damage detection and time-varying graph signal reconstruction tasks. Results demonstrate effectiveness in enhancing SHM systems by providing robust, adaptive, and efficient sensor placement solution.

Conclusion: TVML framework provides a robust, adaptive, and efficient solution for sensor placement in structural health monitoring systems, overcoming limitations of conventional GSP approaches by incorporating temporal dynamics.

Abstract: Structural Health Monitoring (SHM) plays a crucial role in maintaining the safety and resilience of infrastructure. As sensor networks grow in scale and complexity, identifying the most informative sensors becomes essential to reduce deployment costs without compromising monitoring quality. While Graph Signal Processing (GSP) has shown promise by leveraging spatial correlations among sensor nodes, conventional approaches often overlook the temporal dynamics of structural behavior. To overcome this limitation, we propose Time-Vertex Machine Learning (TVML), a novel framework that integrates GSP, time-domain analysis, and machine learning to enable interpretable and efficient sensor placement by identifying representative nodes that minimize redundancy while preserving critical information. We evaluate the proposed approach on two bridge datasets for damage detection and time-varying graph signal reconstruction tasks. The results demonstrate the effectiveness of our approach in enhancing SHM systems by providing a robust, adaptive, and efficient solution for sensor placement.

[591] Alternative positional encoding functions for neural transformers

Ezequiel Lopez-Rubio, Macoris Decena-Gimenez, Rafael Marcos Luque-Baena

Main category: cs.LG

TL;DR: Proposes alternative periodic functions for positional encoding in transformers that outperform traditional sinusoidal functions.

Details

Motivation: Sinusoidal functions have been successful for positional encoding in transformers, but there may be better alternatives that preserve key properties while offering improved performance.

Method: Develops alternative periodic functions that maintain some key properties of sinusoidal functions but differ in fundamental ways, then tests them experimentally.

Result: Experimental results show the proposed alternative functions substantially outperform the original sinusoidal version for positional encoding.

Conclusion: The alternative periodic functions show promise and may have wider applications in transformer architectures beyond the tested scenarios.

Abstract: A key module in neural transformer-based deep architectures is positional encoding. This module enables a suitable way to encode positional information as input for transformer neural layers. This success has been rooted in the use of sinusoidal functions of various frequencies, in order to capture recurrent patterns of differing typical periods. In this work, an alternative set of periodic functions is proposed for positional encoding. These functions preserve some key properties of sinusoidal ones, while they depart from them in fundamental ways. Some tentative experiments are reported, where the original sinusoidal version is substantially outperformed. This strongly suggests that the alternative functions may have a wider use in other transformer architectures.

[592] A Logical View of GNN-Style Computation and the Role of Activation Functions

Pablo Barceló, Floris Geerts, Matthias Lanzinger, Klara Pakhomenko, Jan Van den Bussche

Main category: cs.LG

TL;DR: MPLang captures GNN computation via linear message passing and activations. A-MPLang (no activations) expresses walk-summed features. Bounded eventually-constant activations have equal expressive power and subsume prior logics. ReLU (unbounded) is strictly more powerful than eventually-constant activations for numerical queries when combined with linear layers.

Details

Motivation: To understand the expressive power of graph neural networks (GNNs) through the lens of MPLang, a declarative language capturing GNN computation. Specifically, to characterize how different activation functions (bounded vs. unbounded) affect the numerical and Boolean expressiveness of GNNs, particularly in the presence of linear layers.

Method: 1. Study A-MPLang (fragment without activation functions) and characterize its expressive power in terms of walk-summed features. 2. Analyze bounded activation functions, showing all eventually constant activations yield the same expressive power under mild conditions. 3. Prove expressive separation between unbounded (ReLU) and bounded (eventually constant) activations in the presence of linear layers.

Result: 1. A-MPLang’s expressive power characterized by walk-summed features. 2. All eventually constant activations have equal expressive power (both numerical and Boolean) and subsume previously established logics for GNNs. 3. First expressive separation: MPLang with ReLU is strictly more powerful for numerical queries than with eventually constant activations when linear layers are present.

Conclusion: The choice of activation function significantly impacts GNN expressiveness. While eventually constant activations have uniform expressive power, unbounded activations like ReLU provide strictly greater numerical expressiveness when combined with linear layers, establishing that GNNs using ReLU are more expressive than those restricted to eventually constant activations and linear layers.

Abstract: We study the numerical and Boolean expressiveness of MPLang, a declarative language that captures the computation of graph neural networks (GNNs) through linear message passing and activation functions. We begin with A-MPLang, the fragment without activation functions, and give a characterization of its expressive power in terms of walk-summed features. For bounded activation functions, we show that (under mild conditions) all eventually constant activations yield the same expressive power - numerical and Boolean - and that it subsumes previously established logics for GNNs with eventually constant activation functions but without linear layers. Finally, we prove the first expressive separation between unbounded and bounded activations in the presence of linear layers: MPLang with ReLU is strictly more powerful for numerical queries than MPLang with eventually constant activation functions, e.g., truncated ReLU. This hinges on subtle interactions between linear aggregation and eventually constant non-linearities, and it establishes that GNNs using ReLU are more expressive than those restricted to eventually constant activations and linear layers.

[593] Interpretable Hybrid Deep Q-Learning Framework for IoT-Based Food Spoilage Prediction with Synthetic Data Generation and Hardware Validation

Isshaan Singh, Divyansh Chawla, Anshu Garg, Shivin Mangal, Pallavi Gupta, Khushi Agarwal, Nimrat Singh Khalsa, Nandan Patel

Main category: cs.LG

TL;DR: Proposes hybrid reinforcement learning framework with LSTM-RNN for real-time spoilage prediction in IoT food supply chains, featuring interpretable AI principles and rule-based labeling for transparent decision-making.

Details

Motivation: Modern IoT-driven food supply chains need intelligent real-time spoilage prediction systems, as perishable goods are highly susceptible to environmental conditions. Existing methods lack adaptability to dynamic conditions and fail to optimize decision-making in real time.

Method: Hybrid reinforcement learning framework integrating LSTM and RNN for temporal dependency capture in sensor data. Uses rule-based classifier environment for transparent ground truth labeling of spoilage levels based on domain-specific thresholds. Monitors model with interpretability-driven metrics (spoilage accuracy, reward-to-step ratio, loss reduction rate, exploration decay) and class-wise spoilage distribution visualization.

Result: LSTM and RNN-based agent outperforms alternative reinforcement learning approaches in prediction accuracy and decision efficiency while maintaining interpretability. Extensive evaluations on simulated and real-time hardware data demonstrate effectiveness.

Conclusion: Hybrid deep reinforcement learning with integrated interpretability shows potential for scalable IoT-based food monitoring systems, addressing critical needs in perishable goods supply chains.

Abstract: The need for an intelligent, real-time spoilage prediction system has become critical in modern IoT-driven food supply chains, where perishable goods are highly susceptible to environmental conditions. Existing methods often lack adaptability to dynamic conditions and fail to optimize decision making in real time. To address these challenges, we propose a hybrid reinforcement learning framework integrating Long Short-Term Memory (LSTM) and Recurrent Neural Networks (RNN) for enhanced spoilage prediction. This hybrid architecture captures temporal dependencies within sensor data, enabling robust and adaptive decision making. In alignment with interpretable artificial intelligence principles, a rule-based classifier environment is employed to provide transparent ground truth labeling of spoilage levels based on domain-specific thresholds. This structured design allows the agent to operate within clearly defined semantic boundaries, supporting traceable and interpretable decisions. Model behavior is monitored using interpretability-driven metrics, including spoilage accuracy, reward-to-step ratio, loss reduction rate, and exploration decay. These metrics provide both quantitative performance evaluation and insights into learning dynamics. A class-wise spoilage distribution visualization is used to analyze the agents decision profile and policy behavior. Extensive evaluations on simulated and real-time hardware data demonstrate that the LSTM and RNN based agent outperforms alternative reinforcement learning approaches in prediction accuracy and decision efficiency while maintaining interpretability. The results highlight the potential of hybrid deep reinforcement learning with integrated interpretability for scalable IoT-based food monitoring systems.

[594] From Points to Coalitions: Hierarchical Contrastive Shapley Values for Prioritizing Data Samples

Canran Xiao, Jiabao Dou, Zhiming Lin, Zong Ke, Liwei Hou

Main category: cs.LG

TL;DR: HCDV is a hierarchical contrastive data valuation framework that efficiently computes Shapley-style values for large datasets by learning geometry-preserving representations, building balanced hierarchies, and using local Monte-Carlo games, reducing complexity from O(n!) to O(T K_max log n).

Details

Motivation: Classical Data-Shapley has O(n!) complexity and point-wise perspective that doesn't scale to modern large, heterogeneous, geometrically structured datasets, creating a need for efficient data valuation methods.

Method: Three-stage framework: (1) learns contrastive, geometry-preserving representations, (2) organizes data into balanced coarse-to-fine hierarchy of clusters, (3) assigns Shapley-style payoffs via local Monte-Carlo games with downward budget propagation.

Result: HCDV reduces valuation time by up to 100x, lifts accuracy by up to +5 percentage points, satisfies Shapley axioms approximately with O(eta log n) surplus loss, and supports tasks like augmentation filtering, streaming updates, and fair marketplace payouts.

Conclusion: HCDV provides an efficient, scalable solution for data valuation in modern large-scale applications, overcoming the computational limitations of classical Data-Shapley while maintaining theoretical guarantees and practical utility across diverse domains.

Abstract: How should we quantify the value of each training example when datasets are large, heterogeneous, and geometrically structured? Classical Data-Shapley answers in principle, but its O(n!) complexity and point-wise perspective are ill-suited to modern scales. We propose Hierarchical Contrastive Data Valuation (HCDV), a three-stage framework that (i) learns a contrastive, geometry-preserving representation, (ii) organizes the data into a balanced coarse-to-fine hierarchy of clusters, and (iii) assigns Shapley-style payoffs to coalitions via local Monte-Carlo games whose budgets are propagated downward. HCDV collapses the factorial burden to O(T sum_{l} K_{l}) = O(T K_max log n), rewards examples that sharpen decision boundaries, and regularizes outliers through curvature-based smoothness. We prove that HCDV approximately satisfies the four Shapley axioms with surplus loss O(eta log n), enjoys sub-Gaussian coalition deviation tilde O(1/sqrt{T}), and incurs at most k epsilon_infty regret for top-k selection. Experiments on four benchmarks–tabular, vision, streaming, and a 45M-sample CTR task–plus the OpenDataVal suite show that HCDV lifts accuracy by up to +5 pp, slashes valuation time by up to 100x, and directly supports tasks such as augmentation filtering, low-latency streaming updates, and fair marketplace payouts.

[595] Sprecher Networks: A Parameter-Efficient Kolmogorov-Arnold Architecture

Christian Hägg, Kathlén Kohn, Giovanni Luca Marchetti, Boris Shapiro

Main category: cs.LG

TL;DR: Sprecher Networks (SNs) are a new neural architecture family based on Kolmogorov-Arnold-Sprecher theory, using shared learnable splines with shift parameters and mixing weights for parameter and memory efficiency.

Details

Motivation: To create more parameter-efficient and memory-efficient neural architectures than traditional MLPs and KANs, inspired by classical function approximation theory, while enabling wider networks under memory constraints.

Method: SNs use shared learnable splines (monotonic/general) in structured blocks with explicit shift parameters and mixing weights, directly implementing Sprecher’s 1965 formula in single-layer variant and extending to deeper multi-layer compositions with optional lateral mixing connections.

Result: SNs achieve O(LN + LG) parameter scaling vs MLPs’ O(LN²), reduce peak forward-intermediate memory from O(N²) to O(N), and enable much wider architectures under memory constraints while maintaining performance.

Conclusion: Sprecher Networks provide a theoretically motivated, parameter and memory-efficient alternative to MLPs and KANs, with practical advantages for training wider networks under memory limitations.

Abstract: We present Sprecher Networks (SNs), a family of trainable neural architectures inspired by the classical Kolmogorov-Arnold-Sprecher (KAS) construction for approximating multivariate continuous functions. Distinct from Multi-Layer Perceptrons (MLPs) with fixed node activations and Kolmogorov-Arnold Networks (KANs) featuring learnable edge activations, SNs utilize shared, learnable splines (monotonic and general) within structured blocks incorporating explicit shift parameters and mixing weights. Our approach directly realizes Sprecher’s specific 1965 sum of shifted splines formula in its single-layer variant and extends it to deeper, multi-layer compositions. We further enhance the architecture with optional lateral mixing connections that enable intra-block communication between output dimensions, providing a parameter-efficient alternative to full attention mechanisms. Beyond parameter efficiency with $O(LN + LG)$ scaling (where $G$ is the knot count of the shared splines) versus MLPs’ $O(LN^2)$, SNs admit a sequential evaluation strategy that reduces peak forward-intermediate memory from $O(N^2)$ to $O(N)$ (treating batch size as constant), making much wider architectures feasible under memory constraints. We demonstrate empirically that composing these blocks into deep networks leads to highly parameter and memory-efficient models, discuss theoretical motivations, and compare SNs with related architectures (MLPs, KANs, and networks with learnable node activations).

[596] Real-Time Machine Learning for Embedded Anomaly Detection

Abdelmadjid Benmachiche, Khadija Rais, Hamda Slimi

Main category: cs.LG

TL;DR: Survey paper reviewing lightweight machine learning methods for on-device anomaly detection in resource-constrained IoT environments, comparing algorithms and providing practical deployment recommendations.

Details

Motivation: The proliferation of resource-constrained IoT devices and embedded systems creates pressure for real-time anomaly detection at the edge, requiring algorithms that can operate within strict latency, memory, and power consumption constraints.

Method: Survey methodology comparing lightweight machine learning algorithms including Isolation Forest, One-Class SVM, recurrent architectures, and statistical techniques according to their embedded implementation realities and computational efficiency.

Result: Identifies significant trade-offs between accuracy and computational efficiency, showing how hardware constraints fundamentally redefine algorithm choice for anomaly detection in edge environments.

Conclusion: Provides practical recommendations for algorithm selection based on equipment profiles and TinyML trends, serving as a strategic roadmap for engineers deploying anomaly detection in bandwidth-constrained, safety-critical edge environments.

Abstract: The spread of a resource-constrained Internet of Things (IoT) environment and embedded devices has put pressure on the real-time detection of anomalies occurring at the edge. This survey presents an overview of machine-learning methods aimed specifically at on-device anomaly detection with extremely strict constraints for latency, memory, and power consumption. Lightweight algorithms such as Isolation Forest, One-Class SVM, recurrent architectures, and statistical techniques are compared here according to the realities of embedded implementation. Our survey brings out significant trade-offs of accuracy and computational efficiency of detection, as well as how hardware constraints end up fundamentally redefining algorithm choice. The survey is completed with a set of practical recommendations on the choice of the algorithm depending on the equipment profiles and new trends in TinyML, which can help close the gap between detection capabilities and embedded reality. The paper serves as a strategic roadmap for engineers deploying anomaly detection in edge environments that are constrained by bandwidth and may be safety-critical.

[597] Brain-Grounded Axes for Reading and Steering LLM States

Sandro Andric

Main category: cs.LG

TL;DR: Using human brain activity as a coordinate system to interpret and steer LLM states, creating brain-derived axes that provide interpretable control over LLM behavior.

Details

Motivation: Current LLM interpretability methods rely on textual supervision which lacks external grounding. The authors propose using human brain activity as an objective, biologically-grounded coordinate system for understanding and controlling LLM representations.

Method: Constructed a word-level brain atlas from MEG data (SMN4Lang dataset) using phase-locking value patterns, extracted latent axes via ICA. Validated axes with independent lexica and NER-based labels. Trained lightweight adapters to map LLM hidden states to brain axes without fine-tuning the LLM. Tested steering along brain-derived directions across multiple models.

Result: Found robust lexical (frequency-linked) axis in mid TinyLlama layer that survived perplexity-matched controls. Brain-vs-text probe comparison showed larger log-frequency shifts with lower perplexity for brain axis. Function/content axis (axis 13) showed consistent steering across TinyLlama, Qwen2-0.5B, and GPT-2. Axis structure remained stable when atlas was rebuilt with different features.

Conclusion: Neurophysiology-grounded axes provide interpretable and controllable handles for LLM behavior, offering a new interface for LLM interpretability that uses brain activity as an objective coordinate system rather than just textual supervision.

Abstract: Interpretability methods for large language models (LLMs) typically derive directions from textual supervision, which can lack external grounding. We propose using human brain activity not as a training signal but as a coordinate system for reading and steering LLM states. Using the SMN4Lang MEG dataset, we construct a word-level brain atlas of phase-locking value (PLV) patterns and extract latent axes via ICA. We validate axes with independent lexica and NER-based labels (POS/log-frequency used as sanity checks), then train lightweight adapters that map LLM hidden states to these brain axes without fine-tuning the LLM. Steering along the resulting brain-derived directions yields a robust lexical (frequency-linked) axis in a mid TinyLlama layer, surviving perplexity-matched controls, and a brain-vs-text probe comparison shows larger log-frequency shifts (relative to the text probe) with lower perplexity for the brain axis. A function/content axis (axis 13) shows consistent steering in TinyLlama, Qwen2-0.5B, and GPT-2, with PPL-matched text-level corroboration. Layer-4 effects in TinyLlama are large but inconsistent, so we treat them as secondary (Appendix). Axis structure is stable when the atlas is rebuilt without GPT embedding-change features or with word2vec embeddings (|r|=0.64-0.95 across matched axes), reducing circularity concerns. Exploratory fMRI anchoring suggests potential alignment for embedding change and log frequency, but effects are sensitive to hemodynamic modeling assumptions and are treated as population-level evidence only. These results support a new interface: neurophysiology-grounded axes provide interpretable and controllable handles for LLM behavior.

[598] Symplectic Reservoir Representation of Legendre Dynamics

Robert Simon Fong, Gouhei Tanaka, Kazuyuki Aihara

Main category: cs.LG

TL;DR: The paper proposes Symplectic Reservoir (SR), a reservoir computing architecture that preserves Legendre duality in representations through symplectic geometry, ensuring Hamiltonian structure is embedded at the representational level rather than just in outputs.

Details

Motivation: Current learning systems often leave implicit how internal representations encode physical structure. While recent work has begun to hard-wire Hamiltonian constraints at loss/output levels, the authors ask what it would mean for the representation itself to obey symplectic conservation laws from Hamiltonian mechanics.

Method: Express symplectic constraint through Legendre duality between primal and dual parameters. Formalize Legendre dynamics as stochastic processes on Legendre graphs. Prove maps preserving Legendre graphs are symplectomorphisms. Design Symplectic Reservoir (SR) - a recurrent neural network architecture with Hamiltonian-generated recurrent core that is linear in momentum.

Result: Main theorem shows every SR update has normal form that transports Legendre graphs to Legendre graphs, preserving Legendre duality at each time step. This includes linear time-invariant Gaussian process regression and Ornstein-Uhlenbeck dynamics as special cases.

Conclusion: SR implements a geometrically constrained, Legendre-preserving representation map that injects symplectic geometry and Hamiltonian mechanics directly at the representational level, ensuring long-term stability properties inherent to Hamiltonian systems.

Abstract: Modern learning systems act on internal representations of data, yet how these representations encode underlying physical or statistical structure is often left implicit. In physics, conservation laws of Hamiltonian systems such as symplecticity guarantee long-term stability, and recent work has begun to hard-wire such constraints into learning models at the loss or output level. Here we ask a different question: what would it mean for the representation itself to obey a symplectic conservation law in the sense of Hamiltonian mechanics? We express this symplectic constraint through Legendre duality: the pairing between primal and dual parameters, which becomes the structure that the representation must preserve. We formalize Legendre dynamics as stochastic processes whose trajectories remain on Legendre graphs, so that the evolving primal-dual parameters stay Legendre dual. We show that this class includes linear time-invariant Gaussian process regression and Ornstein-Uhlenbeck dynamics. Geometrically, we prove that the maps that preserve all Legendre graphs are exactly symplectomorphisms of cotangent bundles of the form “cotangent lift of a base diffeomorphism followed by an exact fibre translation”. Dynamically, this characterization leads to the design of a Symplectic Reservoir (SR), a reservoir-computing architecture that is a special case of recurrent neural network and whose recurrent core is generated by Hamiltonian systems that are at most linear in the momentum. Our main theorem shows that every SR update has this normal form and therefore transports Legendre graphs to Legendre graphs, preserving Legendre duality at each time step. Overall, SR implements a geometrically constrained, Legendre-preserving representation map, injecting symplectic geometry and Hamiltonian mechanics directly at the representational level.

[599] Research Program: Theory of Learning in Dynamical Systems

Elad Hazan, Shai Shalev Shwartz, Nathan Srebro

Main category: cs.LG

TL;DR: The paper proposes a framework for studying learnability in dynamical systems through next-token prediction, focusing on finite-sample guarantees based on system properties rather than statistical sequence properties.

Details

Motivation: Modern learning systems increasingly interact with evolving, state-dependent data, raising the fundamental question of when dynamical systems are learnable from observations alone. Current approaches often overlook the finite-sample nature of practical learning scenarios.

Method: Proposes a research program for understanding learnability through next-token prediction, formulating learnability for stochastic processes induced by dynamical systems with guarantees that hold uniformly after finite burn-in. Introduces dynamic learnability concept based on system structure properties like stability, mixing, observability, and spectral properties.

Result: Illustrates the framework with linear dynamical systems, showing accurate prediction can be achieved after finite observation without system identification using improper methods based on spectral filtering.

Conclusion: The paper establishes a foundation for studying learnability in dynamical systems, connects it to classical prediction theories, and suggests directions for extending to nonlinear and controlled systems.

Abstract: Modern learning systems increasingly interact with data that evolve over time and depend on hidden internal state. We ask a basic question: when is such a dynamical system learnable from observations alone? This paper proposes a research program for understanding learnability in dynamical systems through the lens of next-token prediction. We argue that learnability in dynamical systems should be studied as a finite-sample question, and be based on the properties of the underlying dynamics rather than the statistical properties of the resulting sequence. To this end, we give a formulation of learnability for stochastic processes induced by dynamical systems, focusing on guarantees that hold uniformly at every time step after a finite burn-in period. This leads to a notion of dynamic learnability which captures how the structure of a system, such as stability, mixing, observability, and spectral properties, governs the number of observations required before reliable prediction becomes possible. We illustrate the framework in the case of linear dynamical systems, showing that accurate prediction can be achieved after finite observation without system identification, by leveraging improper methods based on spectral filtering. We survey the relationship between learning in dynamical systems and classical PAC, online, and universal prediction theories, and suggest directions for studying nonlinear and controlled systems.

[600] Attention Is Not What You Need

Zhang Chong

Main category: cs.LG

TL;DR: The paper proposes Grassmann flow models as an attention-free alternative to Transformers, achieving competitive performance on language modeling and natural language inference tasks while offering better geometric interpretability.

Details

Motivation: To explore whether explicit self-attention is necessary for strong sequence modeling performance, and to develop more mathematically interpretable alternatives to the opaque tensor lifting mechanism of standard attention.

Method: Proposes Causal Grassmann layers that: (1) linearly reduce token states, (2) encode local token pairs as 2D subspaces on Grassmann manifold via Plücker coordinates, and (3) fuse geometric features back into hidden states through gated mixing. Information propagates via controlled deformations of low-rank subspaces over multi-scale local windows.

Result: On Wikitext-2, Grassmann models (13-18M params) achieve validation perplexities within 10-15% of size-matched Transformers. On SNLI, Grassmann-Plücker head on DistilBERT slightly outperforms Transformer head (0.8550/0.8538 vs 0.8545/0.8511 validation/test accuracy).

Conclusion: Grassmann-based designs offer a more structured, geometrically interpretable alternative to attention mechanisms, with competitive performance and linear scaling in sequence length for fixed rank, potentially enabling better invariant-based interpretations of neural reasoning.

Abstract: We revisit a basic question in sequence modeling: is explicit self-attention actually necessary for strong performance and reasoning? We argue that standard multi-head attention is best seen as a form of tensor lifting: hidden vectors are mapped into a high-dimensional space of pairwise interactions, and learning proceeds by constraining this lifted tensor through gradient descent. This mechanism is extremely expressive but mathematically opaque, because after many layers it becomes very hard to describe the model with a small family of explicit invariants. To explore an alternative, we propose an attention-free architecture based on Grassmann flows. Instead of forming an L by L attention matrix, our Causal Grassmann layer (i) linearly reduces token states, (ii) encodes local token pairs as two-dimensional subspaces on a Grassmann manifold via Plucker coordinates, and (iii) fuses these geometric features back into the hidden states through gated mixing. Information therefore propagates by controlled deformations of low-rank subspaces over multi-scale local windows, so the core computation lives on a finite-dimensional manifold rather than in an unstructured tensor space. On the Wikitext-2 language modeling benchmark, purely Grassmann-based models with 13 to 18 million parameters achieve validation perplexities within about 10 to 15 percent of size-matched Transformers. On the SNLI natural language inference task, a Grassmann-Plucker head on top of DistilBERT slightly outperforms a Transformer head, with best validation and test accuracies of 0.8550 and 0.8538 compared to 0.8545 and 0.8511. We analyze the complexity of Grassmann mixing, show linear scaling in sequence length for fixed rank, and argue that such manifold-based designs offer a more structured route toward geometric and invariant-based interpretations of neural reasoning.

[601] An Inverse Scattering Inspired Fourier Neural Operator for Time-Dependent PDE Learning

Rixin Yu

Main category: cs.LG

TL;DR: IS-FNO improves neural operator stability for nonlinear PDEs by incorporating inverse scattering transform principles, achieving better long-term predictions than FNO and Koopman models.

Details

Motivation: Existing neural operators (FNO, Koopman) struggle with long-term stability for chaotic, stiff, and long-horizon PDE systems due to unconstrained latent representations and cumulative rollout errors.

Method: IS-FNO enforces near-reversible lifting/projection maps via explicitly invertible neural transformations and models latent evolution using exponential Fourier layers that encode linear/nonlinear spectral dynamics.

Result: IS-FNO achieves lower short-term errors and substantially improved long-horizon stability on benchmark PDEs (Michelson-Sivashinsky, Kuramoto-Sivashinsky, KdV, KP equations). Reduced variants with analytical scattering structure maintain competitive accuracy for integrable systems.

Conclusion: Incorporating physical structure—particularly reversibility and spectral evolution—into neural operator design significantly enhances robustness and long-term predictive fidelity for nonlinear PDE dynamics.

Abstract: Learning accurate and stable time-advancement operators for nonlinear partial differential equations (PDEs) remains challenging, particularly for chaotic, stiff, and long-horizon dynamical systems. While neural operator methods such as the Fourier Neural Operator (FNO) and Koopman-inspired extensions achieve good short-term accuracy, their long-term stability is often limited by unconstrained latent representations and cumulative rollout errors. In this work, we introduce an inverse scattering inspired Fourier Neural Operator(IS-FNO), motivated by the reversibility and spectral evolution structure underlying the classical inverse scattering transform. The proposed architecture enforces a near-reversible pairing between lifting and projection maps through an explicitly invertible neural transformation, and models latent temporal evolution using exponential Fourier layers that naturally encode linear and nonlinear spectral dynamics. We systematically evaluate IS-FNO against baseline FNO and Koopman-based models on a range of benchmark PDEs, including the Michelson-Sivashinsky and Kuramoto-Sivashinsky equations (in one and two dimensions), as well as the integrable Korteweg-de Vries and Kadomtsev-Petviashvili equations. The results demonstrate that IS-FNO achieves lower short-term errors and substantially improved long-horizon stability in non-stiff regimes. For integrable systems, reduced IS-FNO variants that embed analytical scattering structure retain competitive long-term accuracy despite limited model capacity. Overall, this work shows that incorporating physical structure – particularly reversibility and spectral evolution – into neural operator design significantly enhances robustness and long-term predictive fidelity for nonlinear PDE dynamics.

[602] Binary Kernel Logistic Regression: a sparsity-inducing formulation and a convergent decomposition training algorithm

Antonio Consolo, Andrea Manno, Edoardo Amaldi

Main category: cs.LG

TL;DR: This paper proposes a sparse kernel logistic regression (KLR) method that achieves a good trade-off between prediction accuracy and sparsity while maintaining probabilistic outputs, using an extended formulation and an efficient SMO-type decomposition algorithm with global convergence guarantees.

Details

Motivation: Kernel logistic regression provides probabilistic class membership estimates but lacks sparsity unlike other kernel methods like SVMs. Existing sparse KLR approaches (Import Vector Machine, ℓ₁/₂ regularization) have limitations in achieving optimal accuracy-sparsity trade-off, which is important for practical applications.

Method: The authors extend Keerthi et al.’s training formulation for binary KLR to induce sparsity. They develop an efficient decomposition algorithm of Sequential Minimal Optimization (SMO) type that exploits second-order information and prove its global convergence.

Result: Numerical experiments on 12 datasets show the proposed binary KLR achieves competitive accuracy-sparsity trade-off compared to IVM, ℓ₁/₂-regularized KLR, and SVM, while retaining the advantage of providing probabilistic class membership estimates.

Conclusion: The proposed sparse KLR method successfully addresses the sparsity-accuracy trade-off challenge, offering both sparse models and good predictive performance with probabilistic outputs, making it a practical alternative to existing kernel methods.

Abstract: Kernel logistic regression (KLR) is a widely used supervised learning method for binary and multi-class classification, which provides estimates of the conditional probabilities of class membership for the data points. Unlike other kernel methods such as Support Vector Machines (SVMs), KLRs are generally not sparse. Previous attempts to deal with sparsity in KLR include a heuristic method referred to as the Import Vector Machine (IVM) and ad hoc regularizations such as the $\ell_{1/2}$-based one. Achieving a good trade-off between prediction accuracy and sparsity is still a challenging issue with a potential significant impact from the application point of view. In this work, we revisit binary KLR and propose an extension of the training formulation proposed by Keerthi et al., which is able to induce sparsity in the trained model, while maintaining good testing accuracy. To efficiently solve the dual of this formulation, we devise a decomposition algorithm of Sequential Minimal Optimization type which exploits second-order information, and for which we establish global convergence. Numerical experiments conducted on 12 datasets from the literature show that the proposed binary KLR approach achieves a competitive trade-off between accuracy and sparsity with respect to IVM, $\ell_{1/2}$-based regularization for KLR, and SVM while retaining the advantages of providing informative estimates of the class membership probabilities.

[603] Multi-Layer Confidence Scoring for Detection of Out-of-Distribution Samples, Adversarial Attacks, and In-Distribution Misclassifications

Lorenzo Capelli, Leandro de Souza Rosa, Gianluca Setti, Mauro Mangia, Riccardo Rovatti

Main category: cs.LG

TL;DR: MACS is a unified post-hoc framework that analyzes intermediate activations to produce classification-maps for confidence scoring, distribution shift detection, and adversarial attack detection without retraining existing models.

Details

Motivation: Deep neural networks lack transparency and trustworthiness in high-stakes domains, and existing confidence-based approaches require retraining. Post-hoc methods address individual problems separately, lacking a unified solution for existing models.

Method: Multi-Layer Analysis for Confidence Scoring (MACS) analyzes intermediate activations of pre-trained models to produce classification-maps, from which a unified score is derived for confidence estimation, distribution shift detection, and adversarial attack identification.

Result: MACS surpasses state-of-the-art approaches in experiments with VGG16 and ViTb16 models while requiring only a fraction of their computational overhead.

Conclusion: MACS provides a unified, efficient post-hoc framework for improving trustworthiness of existing deep neural networks without retraining, addressing confidence estimation, distribution shifts, and adversarial attacks simultaneously.

Abstract: The recent explosive growth in Deep Neural Networks applications raises concerns about the black-box usage of such models, with limited trasparency and trustworthiness in high-stakes domains, which have been crystallized as regulatory requirements such as the European Union Artificial Intelligence Act. While models with embedded confidence metrics have been proposed, such approaches cannot be applied to already existing models without retraining, limiting their broad application. On the other hand, post-hoc methods, which evaluate pre-trained models, focus on solving problems related to improving the confidence in the model’s predictions, and detecting Out-Of-Distribution or Adversarial Attacks samples as independent applications. To tackle the limited applicability of already existing methods, we introduce Multi-Layer Analysis for Confidence Scoring (MACS), a unified post-hoc framework that analyzes intermediate activations to produce classification-maps. From the classification-maps, we derive a score applicable for confidence estimation, detecting distributional shifts and adversarial attacks, unifying the three problems in a common framework, and achieving performances that surpass the state-of-the-art approaches in our experiments with the VGG16 and ViTb16 models with a fraction of their computational overhead.

[604] Lightweight Intrusion Detection in IoT via SHAP-Guided Feature Pruning and Knowledge-Distilled Kronecker Networks

Hafsa Benaddi, Mohammed Jouhari, Nouha Laamech, Anas Motii, Khalil Ibrahimi

Main category: cs.LG

TL;DR: Lightweight IoT intrusion detection system using SHAP-guided feature pruning and knowledge-distilled Kronecker networks achieves high accuracy with minimal resource usage.

Details

Motivation: IoT devices need intrusion detection systems that are both accurate and resource-efficient for edge deployment, but conventional deep learning IDS are too large and computationally intensive.

Method: Combines SHAP-guided feature pruning with knowledge-distilled Kronecker networks: a high-capacity teacher model identifies relevant features via SHAP explanations, then a compressed student uses Kronecker-structured layers to minimize parameters while preserving key inputs, with knowledge distillation transferring softened decision boundaries.

Result: On TON_IoT dataset, the student model is nearly 1000x smaller than the teacher yet maintains macro-F1 above 0.986 with millisecond-level inference latency.

Conclusion: Explainability-driven pruning and structured compression can jointly enable scalable, low-latency, and energy-efficient IDS for heterogeneous IoT environments.

Abstract: The widespread deployment of Internet of Things (IoT) devices requires intrusion detection systems (IDS) with high accuracy while operating under strict resource constraints. Conventional deep learning IDS are often too large and computationally intensive for edge deployment. We propose a lightweight IDS that combines SHAP-guided feature pruning with knowledge-distilled Kronecker networks. A high-capacity teacher model identifies the most relevant features through SHAP explanations, and a compressed student leverages Kronecker-structured layers to minimize parameters while preserving discriminative inputs. Knowledge distillation transfers softened decision boundaries from teacher to student, improving generalization under compression. Experiments on the TON_IoT dataset show that the student is nearly three orders of magnitude smaller than the teacher yet sustains macro-F1 above 0.986 with millisecond-level inference latency. The results demonstrate that explainability-driven pruning and structured compression can jointly enable scalable, low-latency, and energy-efficient IDS for heterogeneous IoT environments.

[605] Learning from sanctioned government suppliers: A machine learning and network science approach to detecting fraud and corruption in Mexico

Martí Medina-Hern ández, Janos Kertész, Mihály Fazekas

Main category: cs.LG

TL;DR: PU learning with network features outperforms traditional red flags for detecting corrupt procurement contracts in Mexico.

Details

Motivation: Detecting fraud in public procurement is challenging due to lack of confirmed non-corrupt examples, making conventional supervised ML inappropriate.

Method: Positive-unlabeled learning algorithms integrating domain-knowledge red flags with network-derived features using Mexican procurement data and company sanction records.

Result: Best PU model captures 32% more known positives and performs 2.3x better than random guessing, substantially outperforming traditional red flag approaches. Network features (core contracts, eigenvector centrality) are most important.

Conclusion: Methodology supports law enforcement in Mexico and can be adapted to other national contexts, with network features being crucial for fraud detection.

Abstract: Detecting fraud and corruption in public procurement remains a major challenge for governments worldwide. Most research to-date builds on domain-knowledge-based corruption risk indicators of individual contract-level features and some also analyzes contracting network patterns. A critical barrier for supervised machine learning is the absence of confirmed non-corrupt, negative, examples, which makes conventional machine learning inappropriate for this task. Using publicly available data on federally funded procurement in Mexico and company sanction records, this study implements positive-unlabeled (PU) learning algorithms that integrate domain-knowledge-based red flags with network-derived features to identify likely corrupt and fraudulent contracts. The best-performing PU model on average captures 32 percent more known positives and performs on average 2.3 times better than random guessing, substantially outperforming approaches based solely on traditional red flags. The analysis of the Shapley Additive Explanations reveals that network-derived features, particularly those associated with contracts in the network core or suppliers with high eigenvector centrality, are the most important. Traditional red flags further enhance model performance in line with expectations, albeit mainly for contracts awarded through competitive tenders. This methodology can support law enforcement in Mexico, and it can be adapted to other national contexts too.

[606] Kolmogorov-Arnold Graph Neural Networks Applied to Inorganic Nanomaterials Dataset

Nikita Volzhin, Soowhan Yoon

Main category: cs.LG

TL;DR: KAGNNs applied to inorganic nanomaterials dataset CHILI achieve state-of-the-art classification results, surpassing conventional GNNs.

Details

Motivation: Previous studies on KAN-based GNNs focused only on organic molecule datasets, leaving a gap in testing on inorganic nanomaterials datasets. This work aims to close that gap by applying KAGNNs to the CHILI inorganic nanomaterials dataset.

Method: Adapted and tested Kolmogorov-Arnold Graph Neural Networks (KAGNNs) specifically for the CHILI inorganic nanomaterials dataset, which includes CHILI-3K.

Result: KAGNNs substantially surpass conventional GNNs in classification on the CHILI datasets, achieving state-of-the-art results, particularly on CHILI-3K.

Conclusion: KAGNNs demonstrate superior performance on inorganic nanomaterials datasets, expanding the applicability of KAN-based models beyond organic molecules and establishing new state-of-the-art results for inorganic material classification.

Abstract: The recent development of Kolmogorov-Arnold Networks (KANs) introduced new discoveries in the field of Graph Neural Networks (GNNs), expanding the existing set of models with KAN-based versions of GNNs, which often surpass the accuracy of MultiLayer Perceptron (MLP)-based GNNs. These models were widely tested on the graph datasets consisting of organic molecules; however, those studies disregarded the inorganic nanomaterials datasets. In this work, we close this gap by applying Kolmogorov-Arnold Graph Neural Networks (KAGNNs) to a recently published large inorganic nanomaterials dataset called CHILI. For this, we adapt and test KAGNNs appropriate for this dataset. Our experiments reveal that on the CHILI datasets, particularly on the CHILI-3K, KAGNNs substantially surpass conventional GNNs in classification, achieving state-of-the-art results.

[607] DK-STN: A Domain Knowledge Embedded Spatio-Temporal Network Model for MJO Forecast

Hongliang Li, Nong Zhang, Zhewen Xu, Xiang Li, Changzheng Liu, Chongbo Zhao, Jie Wu

Main category: cs.LG

TL;DR: DK-STN is a neural network model that combines domain knowledge with spatio-temporal networks to achieve accurate 28-day MJO forecasting with high efficiency and stability, matching ECMWF accuracy while being much faster.

Details

Motivation: Current MJO prediction methods have limitations: NWP methods are resource-intensive, time-consuming, and unstable (season-sensitive), while ANN methods are efficient but cannot match the 28-day forecast accuracy of state-of-the-art NWP like ECMWF.

Method: Domain Knowledge Embedded Spatio-Temporal Network (DK-STN) builds on a spatial-temporal network (STN) and embeds domain knowledge through two approaches: (1) domain knowledge enhancement method, and (2) domain knowledge processing method integrated into network training.

Result: DK-STN can generate reliable 28-day MJO forecasts in 1-2 seconds using 7 days of ERA5 data as input, with only 2-3 days error across different seasons. It matches ECMWF’s forecast accuracy while being significantly more efficient and stable.

Conclusion: DK-STN successfully combines the benefits of NWP and ANN methods, improving ANN forecast accuracy to match state-of-the-art NWP while maintaining high efficiency and stability, making it a superior solution for MJO prediction.

Abstract: Understanding and predicting the Madden-Julian Oscillation (MJO) is fundamental for precipitation forecasting and disaster prevention. To date, long-term and accurate MJO prediction has remained a challenge for researchers. Conventional MJO prediction methods using Numerical Weather Prediction (NWP) are resource-intensive, time-consuming, and highly unstable (most NWP methods are sensitive to seasons, with better MJO forecast results in winter). While existing Artificial Neural Network (ANN) methods save resources and speed forecasting, their accuracy never reaches the 28 days predicted by the state-of-the-art NWP method, i.e., the operational forecasts from ECMWF, since neural networks cannot handle climate data effectively. In this paper, we present a Domain Knowledge Embedded Spatio-Temporal Network (DK-STN), a stable neural network model for accurate and efficient MJO forecasting. It combines the benefits of NWP and ANN methods and successfully improves the forecast accuracy of ANN methods while maintaining a high level of efficiency and stability. We begin with a spatial-temporal network (STN) and embed domain knowledge in it using two key methods: (i) applying a domain knowledge enhancement method and (ii) integrating a domain knowledge processing method into network training. We evaluated DK-STN with the 5th generation of ECMWF reanalysis (ERA5) data and compared it with ECMWF. Given 7 days of climate data as input, DK-STN can generate reliable forecasts for the following 28 days in 1-2 seconds, with an error of only 2-3 days in different seasons. DK-STN significantly exceeds ECMWF in that its forecast accuracy is equivalent to ECMWF’s, while its efficiency and stability are significantly superior.

[608] Toward Scalable and Valid Conditional Independence Testing with Spectral Representations

Alek Frohlich, Vladimir Kostic, Karim Lounici, Daniel Perazzo, Massimiliano Pontil

Main category: cs.LG

TL;DR: The paper proposes a new conditional independence test using representation learning from partial covariance operators, addressing limitations of existing kernel methods through a bi-level contrastive algorithm with theoretical guarantees.

Details

Motivation: Conditional independence testing is fundamental for causal inference and graphical modeling but faces limitations: existing tests rely on restrictive structural assumptions, while kernel methods using partial covariance operators suffer from poor adaptivity, slow convergence, and scalability issues.

Method: The approach learns representations from the singular value decomposition of the partial covariance operator, constructs a test statistic similar to HSIC, and introduces a practical bi-level contrastive algorithm to learn these representations efficiently.

Result: Theoretical analysis links representation learning error to test performance and establishes asymptotic validity and power guarantees. Preliminary experiments suggest the approach offers a practical, statistically grounded path toward scalable CI testing.

Conclusion: Representation learning can effectively address limitations of existing conditional independence tests, bridging kernel-based theory with modern representation learning to create scalable, adaptive, and theoretically sound CI testing methods.

Abstract: Conditional independence (CI) is central to causal inference, feature selection, and graphical modeling, yet it is untestable in many settings without additional assumptions. Existing CI tests often rely on restrictive structural conditions, limiting their validity on real-world data. Kernel methods using the partial covariance operator offer a more principled approach but suffer from limited adaptivity, slow convergence, and poor scalability. In this work, we explore whether representation learning can help address these limitations. Specifically, we focus on representations derived from the singular value decomposition of the partial covariance operator and use them to construct a simple test statistic, reminiscent of the Hilbert-Schmidt Independence Criterion (HSIC). We also introduce a practical bi-level contrastive algorithm to learn these representations. Our theory links representation learning error to test performance and establishes asymptotic validity and power guarantees. Preliminary experiments suggest that this approach offers a practical and statistically grounded path toward scalable CI testing, bridging kernel-based theory with modern representation learning.

[609] LacaDM: A Latent Causal Diffusion Model for Multiobjective Reinforcement Learning

Xueming Yan, Bo Yin, Yaochu Jin

Main category: cs.LG

TL;DR: LacaDM is a novel multiobjective reinforcement learning approach that uses latent causal diffusion models to learn temporal causal relationships between states and policies, enabling better adaptation and generalization in complex environments.

Details

Motivation: Traditional MORL methods struggle with conflicts between objectives and poor generalization in dynamic environments with large state-action spaces. Existing approaches mainly focus on objective conflicts but lack mechanisms for efficient knowledge transfer across different MORL scenarios.

Method: LacaDM learns latent temporal causal relationships between environmental states and policies, embedding these causal structures within a diffusion model-based framework. This enables knowledge transfer across diverse MORL scenarios while balancing conflicting objectives.

Result: Empirical evaluations on MOGymnasium tasks show LacaDM consistently outperforms state-of-the-art baselines in hypervolume, sparsity, and expected utility maximization metrics, demonstrating effectiveness in complex multiobjective tasks.

Conclusion: LacaDM successfully addresses MORL challenges by learning causal relationships for better generalization and adaptation, achieving superior performance in both discrete and continuous environments compared to existing methods.

Abstract: Multiobjective reinforcement learning (MORL) poses significant challenges due to the inherent conflicts between objectives and the difficulty of adapting to dynamic environments. Traditional methods often struggle to generalize effectively, particularly in large and complex state-action spaces. To address these limitations, we introduce the Latent Causal Diffusion Model (LacaDM), a novel approach designed to enhance the adaptability of MORL in discrete and continuous environments. Unlike existing methods that primarily address conflicts between objectives, LacaDM learns latent temporal causal relationships between environmental states and policies, enabling efficient knowledge transfer across diverse MORL scenarios. By embedding these causal structures within a diffusion model-based framework, LacaDM achieves a balance between conflicting objectives while maintaining strong generalization capabilities in previously unseen environments. Empirical evaluations on various tasks from the MOGymnasium framework demonstrate that LacaDM consistently outperforms the state-of-art baselines in terms of hypervolume, sparsity, and expected utility maximization, showcasing its effectiveness in complex multiobjective tasks.

[610] Initialization of a Polyharmonic Cascade, Launch and Testing

Yuriy N. Bakhvalov

Main category: cs.LG

TL;DR: The paper presents a universal initialization method for polyharmonic cascades using symmetric hyperoctahedral constellations, enabling stable training of deep networks (up to 500 layers) with simplified 2D linear algebra operations and strong performance on benchmark datasets.

Details

Motivation: To address the challenge of training very deep neural networks without skip connections by developing a theoretically-grounded initialization procedure that ensures stability and simplifies computations.

Method: Proposes symmetric constellation initialization in the form of hyperoctahedra with a central point for polyharmonic cascades, reducing all linear algebra to 2D operations that can be efficiently executed on GPUs.

Result: Achieves stable training of cascades with up to 500 layers without skip connections, with strong performance: 98.3% on MNIST (no convolutions/augmentations), AUC ~0.885 on HIGGS (11M examples), and AUC ~0.963 on Epsilon (2000 features).

Conclusion: The proposed initialization method enables scalable and robust training of deep polyharmonic cascades, with simplified computations and reproducible results provided in a public repository.

Abstract: This paper concludes a series of studies on the polyharmonic cascade, a deep machine learning architecture theoretically derived from indifference principles and the theory of random functions. A universal initialization procedure is proposed, based on symmetric constellations in the form of hyperoctahedra with a central point. This initialization not only ensures stable training of cascades with tens and hundreds of layers (up to 500 layers without skip connections), but also radically simplifies the computations. Scalability and robustness are demonstrated on MNIST (98.3% without convolutions or augmentations), HIGGS (AUC approximately 0.885 on 11M examples), and Epsilon (AUC approximately 0.963 with 2000 features). All linear algebra is reduced to 2D operations and is efficiently executed on GPUs. A public repository and an archived snapshot are provided for full reproducibility.

[611] Learning Continuous Solvent Effects from Transient Flow Data: A Graph Neural Network Benchmark on Catechol Rearrangement

Hongsheng Xing, Qiuxin Si

Main category: cs.LG

TL;DR: Proposes a hybrid GNN architecture with graph attention networks and mixture-aware solvent encodings that achieves 60% error reduction and >25× improvement over baselines for predicting reaction yields across continuous solvent compositions.

Details

Motivation: Traditional ML approaches treat solvent identity as discrete categorical variables, preventing systematic interpolation/extrapolation across solvent space. Need for predicting reaction outcomes across continuous solvent composition ranges in organic synthesis and process chemistry.

Method: Introduces Catechol Benchmark dataset (1,227 yield measurements in 24 pure solvents and binary mixtures). Proposes hybrid GNN architecture integrating Graph Attention Networks (GATs) with Differential Reaction Fingerprints (DRFP) and learned mixture-aware solvent encodings. Evaluates under leave-one-solvent-out and leave-one-mixture-out protocols.

Result: Hybrid GNN achieves MSE of 0.0039 (±0.0003), representing 60% error reduction over competitive baselines and >25× improvement over tabular ensembles. Classical tabular methods (MSE 0.099) and LLM embeddings (MSE 0.129) struggle. Ablation studies confirm molecular graph message-passing and continuous mixture encoding are essential.

Conclusion: Explicit molecular graph message-passing and continuous mixture encoding enable robust generalization across solvent compositions. The dataset and methods facilitate data-efficient reaction prediction and continuous solvent representation learning in organic synthesis.

Abstract: Predicting reaction outcomes across continuous solvent composition ranges remains a critical challenge in organic synthesis and process chemistry. Traditional machine learning approaches often treat solvent identity as a discrete categorical variable, which prevents systematic interpolation and extrapolation across the solvent space. This work introduces the \textbf{Catechol Benchmark}, a high-throughput transient flow chemistry dataset comprising 1,227 experimental yield measurements for the rearrangement of allyl-substituted catechol in 24 pure solvents and their binary mixtures, parameterized by continuous volume fractions ($% B$). We evaluate various architectures under rigorous leave-one-solvent-out and leave-one-mixture-out protocols to test generalization to unseen chemical environments. Our results demonstrate that classical tabular methods (e.g., Gradient-Boosted Decision Trees) and large language model embeddings (e.g., Qwen-7B) struggle with quantitative precision, yielding Mean Squared Errors (MSE) of 0.099 and 0.129, respectively. In contrast, we propose a hybrid GNN-based architecture that integrates Graph Attention Networks (GATs) with Differential Reaction Fingerprints (DRFP) and learned mixture-aware solvent encodings. This approach achieves an \textbf{MSE of 0.0039} ($\pm$ 0.0003), representing a 60% error reduction over competitive baselines and a $>25\times$ improvement over tabular ensembles. Ablation studies confirm that explicit molecular graph message-passing and continuous mixture encoding are essential for robust generalization. The complete dataset, evaluation protocols, and reference implementations are released to facilitate data-efficient reaction prediction and continuous solvent representation learning.

[612] Deep Learning for Unrelated-Machines Scheduling: Handling Variable Dimensions

Diego Hitzges, Guillaume Sagnol

Main category: cs.LG

TL;DR: A novel neural network approach for offline scheduling on unrelated parallel machines that handles variable job/machine counts and unique processing times, achieving near-optimal performance with strong generalization to larger instances.

Details

Motivation: Scheduling on unrelated parallel machines is challenging for deep learning due to varying numbers of jobs/machines and unique processing times for each job-machine pair, which dynamically changes feature dimensions. Existing online approaches process jobs sequentially, but a comprehensive offline solution is needed.

Method: Proposes a neural network tailored for offline deterministic scheduling that processes the entire input at once (not sequentially). Uses NLP-inspired architectures to handle any number of jobs/machines with varying feature dimensions from unrelated processing times. Enables supervised training on small instances with generalization to larger problems.

Result: On 8 jobs/4 machines instances: costs only 2.51% above optimal. Across configurations up to 100 jobs/10 machines: consistently outperformed advanced dispatching rule (22.22% higher costs on average). Shows strong generalization from small to large instances.

Conclusion: The method enables fast retraining with simulated data and adaptation to various scheduling conditions, potentially becoming a standard approach for learning-based scheduling on unrelated machines and similar problem environments.

Abstract: Deep learning has been effectively applied to many discrete optimization problems. However, learning-based scheduling on unrelated parallel machines remains particularly difficult to design. Not only do the numbers of jobs and machines vary, but each job-machine pair has a unique processing time, dynamically altering feature dimensions. We propose a novel approach with a neural network tailored for offline deterministic scheduling of arbitrary sizes on unrelated machines. The goal is to minimize a complex objective function that includes the makespan and the weighted tardiness of jobs and machines. Unlike existing online approaches, which process jobs sequentially, our method generates a complete schedule considering the entire input at once. The key contribution of this work lies in the sophisticated architecture of our model. By leveraging various NLP-inspired architectures, it effectively processes any number of jobs and machines with varying feature dimensions imposed by unrelated processing times. Our approach enables supervised training on small problem instances while demonstrating strong generalization to much larger scheduling environments. Trained and tested on instances with 8 jobs and 4 machines, costs were only 2.51% above optimal. Across all tested configurations of up to 100 jobs and 10 machines, our network consistently outperformed an advanced dispatching rule, which incurred 22.22% higher costs on average. As our method allows fast retraining with simulated data and adaptation to various scheduling conditions, we believe it has the potential to become a standard approach for learning-based scheduling on unrelated machines and similar problem environments.

[613] DFORD: Directional Feedback based Online Ordinal Regression Learning

Naresh Manwani, M Elamparithy, Tanish Taneja

Main category: cs.LG

TL;DR: Online ordinal regression algorithm using directional feedback (left/right hints) achieves O(log T) regret with exploration-exploitation and kernel variant for non-linear models.

Details

Motivation: Ordinal regression typically requires full label access, but directional feedback (indicating whether prediction is left or right of true label) provides a weaker supervision setting that's more practical in some real-world scenarios where exact labels are unavailable.

Method: Proposes an online algorithm with exploration-exploitation scheme for learning from directional feedback. Introduces kernel-based variant for non-linear models using truncation trick for memory efficiency. Maintains threshold ordering in expected sense.

Result: Algorithm achieves expected regret of O(log T). Experimental results on synthetic and real-world datasets show performance comparable (sometimes better) to full information counterpart despite weaker supervision.

Conclusion: Directional feedback provides an effective weak supervision alternative for ordinal regression, achieving strong performance with only left/right hints about prediction accuracy compared to true label.

Abstract: In this paper, we introduce directional feedback in the ordinal regression setting, in which the learner receives feedback on whether the predicted label is on the left or the right side of the actual label. This is a weak supervision setting for ordinal regression compared to the full information setting, where the learner can access the labels. We propose an online algorithm for ordinal regression using directional feedback. The proposed algorithm uses an exploration-exploitation scheme to learn from directional feedback efficiently. Furthermore, we introduce its kernel-based variant to learn non-linear ordinal regression models in an online setting. We use a truncation trick to make the kernel implementation more memory efficient. The proposed algorithm maintains the ordering of the thresholds in the expected sense. Moreover, it achieves the expected regret of $\mathcal{O}(\log T)$. We compare our approach with a full information and a weakly supervised algorithm for ordinal regression on synthetic and real-world datasets. The proposed approach, which learns using directional feedback, performs comparably (sometimes better) to its full information counterpart.

[614] CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal

Yongxin Wang, Zhicheng Yang, Meng Cao, Mingfei Han, Haokun Lin, Yingying Zhu, Xiaojun Chang, Xiaodan Liang

Main category: cs.LG

TL;DR: CARE is a failure-centric post-training framework that converts errors into supervision for multimodal reasoning, using anchored-contrastive objectives and reflection-guided resampling to improve accuracy and training smoothness.

Details

Motivation: Current RLVR methods waste informative failure data - gradients stall when all rollouts are wrong, and correct rollouts don't properly analyze why others were close-but-wrong, leading to misassigned credit to spurious chains.

Method: CARE combines: (1) anchored-contrastive objective that forms compact subgroups around best rollouts with hard negatives, performs within-subgroup z-score normalization with negative-only scaling, and includes all-negative rescue; (2) Reflection-Guided Resampling (RGR) that rewrites representative failures and re-scores them with the same verifier to convert near-misses into positives.

Result: CARE improves accuracy and training smoothness while increasing learning signal from failures. On Qwen2.5-VL-7B, it lifts macro-averaged accuracy by 4.6 points over GRPO across six verifiable visual-reasoning benchmarks; with Qwen3-VL-8B it reaches competitive or SOTA results on MathVista and MMMU-Pro.

Conclusion: CARE effectively turns errors into supervision through failure-centric post-training, demonstrating significant improvements in multimodal reasoning performance by better utilizing failure data that was previously wasted.

Abstract: Group-relative reinforcement learning with verifiable rewards (RLVR) often wastes the most informative data it already has the failures. When all rollouts are wrong, gradients stall; when one happens to be correct, the update usually ignores why the others are close-but-wrong, and credit can be misassigned to spurious chains. We present CARE (Contrastive Anchored REflection), a failure-centric post-training framework for multimodal reasoning that turns errors into supervision. CARE combines: (i) an anchored-contrastive objective that forms a compact subgroup around the best rollout and a set of semantically proximate hard negatives, performs within-subgroup z-score normalization with negative-only scaling, and includes an all-negative rescue to prevent zero-signal batches; and (ii) Reflection-Guided Resampling (RGR), a one-shot structured self-repair that rewrites a representative failure and re-scores it with the same verifier, converting near-misses into usable positives without any test-time reflection. CARE improves accuracy and training smoothness while explicitly increasing the share of learning signal that comes from failures. On Qwen2.5-VL-7B, CARE lifts macro-averaged accuracy by 4.6 points over GRPO across six verifiable visual-reasoning benchmarks; with Qwen3-VL-8B it reaches competitive or state-of-the-art results on MathVista and MMMU-Pro under an identical evaluation protocol.

[615] KerJEPA: Kernel Discrepancies for Euclidean Self-Supervised Learning

Eric Zimmermann, Harley Wiltzer, Justin Szeto, David Alvarez-Melis, Lester Mackey

Main category: cs.LG

TL;DR: KerJEPAs introduce kernel-based regularizers for self-supervised learning, expanding beyond Euclidean representations to improve training stability and flexibility.

Details

Motivation: Recent JEPAs show that regularizing Euclidean representations toward isotropic Gaussian priors improves training stability and downstream generalization, but there's potential to expand beyond this specific regularization approach.

Method: Introduce KerJEPAs - a flexible family of self-supervised learning algorithms with kernel-based regularizers. Expand beyond Gaussian priors/kernels, compute closed-form high-dimensional limits of sliced MMDs, and develop alternative KerJEPAs with improved properties.

Result: Developed alternative KerJEPAs with favorable properties including improved training stability and design flexibility by expanding the class of viable kernels and priors.

Conclusion: KerJEPAs provide a more flexible framework for self-supervised learning with kernel-based regularization, offering improved training stability and design options beyond traditional Euclidean approaches.

Abstract: Recent breakthroughs in self-supervised Joint-Embedding Predictive Architectures (JEPAs) have established that regularizing Euclidean representations toward isotropic Gaussian priors yields provable gains in training stability and downstream generalization. We introduce a new, flexible family of KerJEPAs, self-supervised learning algorithms with kernel-based regularizers. One instance of this family corresponds to the recently-introduced LeJEPA Epps-Pulley regularizer which approximates a sliced maximum mean discrepancy (MMD) with a Gaussian prior and Gaussian kernel. By expanding the class of viable kernels and priors and computing the closed-form high-dimensional limit of sliced MMDs, we develop alternative KerJEPAs with a number of favorable properties including improved training stability and design flexibility.

[616] The Best of Both Worlds: Hybridizing Neural Operators and Solvers for Stable Long-Horizon Inference

Rajyasri Roy, Dibyajyoti Nayak, Somdatta Goswami

Main category: cs.LG

TL;DR: ANCHOR is an adaptive hybrid framework that combines neural operators with classical numerical solvers using residual-based error estimation to stabilize long-horizon PDE predictions and prevent compounding errors.

Details

Motivation: Neural operator surrogates for PDEs suffer from compounding errors in autoregressive frameworks, lack online error monitoring/correction mechanisms, and provide limited guarantees for individual inference trajectories beyond training horizons.

Method: ANCHOR uses a pretrained neural operator as primary inference engine, adaptively couples it with classical numerical solvers using a physics-informed, residual-based error estimator. It monitors exponential moving average of normalized PDE residuals to detect accumulating errors and trigger corrective solver interventions without ground-truth solutions.

Result: EMA-based estimator strongly correlates with true relative L2 error, enabling data-free instance-aware error control. Evaluations on 1D/2D Burgers’, 2D Allen-Cahn, and 3D heat conduction show ANCHOR reliably bounds long-horizon error growth, stabilizes extrapolative rollouts, and improves robustness over standalone neural operators while remaining more efficient than high-fidelity solvers.

Conclusion: ANCHOR provides an effective online hybrid framework for stable long-horizon PDE prediction that addresses compounding error issues in neural operators through adaptive coupling with classical solvers and residual-based error monitoring.

Abstract: Numerical simulation of time-dependent partial differential equations (PDEs) is central to scientific and engineering applications, but high-fidelity solvers are often prohibitively expensive for long-horizon or time-critical settings. Neural operator (NO) surrogates offer fast inference across parametric and functional inputs; however, most autoregressive NO frameworks remain vulnerable to compounding errors, and ensemble-averaged metrics provide limited guarantees for individual inference trajectories. In practice, error accumulation can become unacceptable beyond the training horizon, and existing methods lack mechanisms for online monitoring or correction. To address this gap, we propose ANCHOR (Adaptive Numerical Correction for High-fidelity Operator Rollouts), an online, instance-aware hybrid inference framework for stable long-horizon prediction of nonlinear, time-dependent PDEs. ANCHOR treats a pretrained NO as the primary inference engine and adaptively couples it with a classical numerical solver using a physics-informed, residual-based error estimator. Inspired by adaptive time-stepping in numerical analysis, ANCHOR monitors an exponential moving average (EMA) of the normalized PDE residual to detect accumulating error and trigger corrective solver interventions without requiring access to ground-truth solutions. We show that the EMA-based estimator correlates strongly with the true relative L2 error, enabling data-free, instance-aware error control during inference. Evaluations on four canonical PDEs: 1D and 2D Burgers’, 2D Allen-Cahn, and 3D heat conduction, demonstrate that ANCHOR reliably bounds long-horizon error growth, stabilizes extrapolative rollouts, and significantly improves robustness over standalone neural operators, while remaining substantially more efficient than high-fidelity numerical solvers.

[617] Deep Legendre Transform

Aleksey Minabutdinov, Patrick Cheridito

Main category: cs.LG

TL;DR: A deep learning method for computing convex conjugates of differentiable convex functions that overcomes curse of dimensionality and provides error estimates.

Details

Motivation: Traditional numerical methods for computing convex conjugates suffer from curse of dimensionality in high dimensions, while existing neural network approaches are mostly focused on optimal transport problems and require solving complex optimization problems.

Method: Uses an implicit Fenchel formulation of convex conjugation to create an efficient gradient-based framework for minimizing approximation errors, with symbolic regression using Kolmogorov-Arnold networks to obtain exact conjugates for specific functions.

Result: Numerical experiments show the method delivers accurate results across different high-dimensional examples and can obtain exact convex conjugates for specific convex functions through symbolic regression.

Conclusion: The proposed deep learning approach provides an efficient, scalable method for computing convex conjugates in high dimensions with built-in error estimation capabilities.

Abstract: We introduce a novel deep learning algorithm for computing convex conjugates of differentiable convex functions, a fundamental operation in convex analysis with various applications in different fields such as optimization, control theory, physics and economics. While traditional numerical methods suffer from the curse of dimensionality and become computationally intractable in high dimensions, more recent neural network-based approaches scale better, but have mostly been studied with the aim of solving optimal transport problems and require the solution of complicated optimization or max-min problems. Using an implicit Fenchel formulation of convex conjugation, our approach facilitates an efficient gradient-based framework for the minimization of approximation errors and, as a byproduct, also provides a posteriori error estimates for the approximation quality. Numerical experiments demonstrate our method’s ability to deliver accurate results across different high-dimensional examples. Moreover, by employing symbolic regression with Kolmogorov–Arnold networks, it is able to obtain the exact convex conjugates of specific convex functions.

[618] How Reliable are Causal Probing Interventions?

Marc Canby, Adam Davies, Chirag Rastogi, Julia Hockenmaier

Main category: cs.LG

TL;DR: Causal probing methods for analyzing foundation models face a completeness-selectivity tradeoff, with nonlinear interventions generally outperforming linear ones in reliability.

Details

Motivation: Recent theoretical doubts about causal probing methods create a need for systematic empirical evaluation to determine their practical effectiveness in analyzing foundation models.

Method: Defined completeness (thorough transformation of target property) and selectivity (minimal impact on non-target properties), introduced reliability as their harmonic mean, and created an empirical framework to measure and compare different causal probing methods (linear vs. nonlinear, concept removal vs. counterfactual interventions).

Result: All methods show completeness-selectivity tradeoff; more complete/reliable methods have greater impact on LLM behavior; nonlinear interventions are almost always more reliable than linear interventions.

Conclusion: The proposed framework enables systematic evaluation of causal probing methods, revealing inherent tradeoffs and demonstrating the superiority of nonlinear interventions for reliable analysis of foundation models.

Abstract: Causal probing aims to analyze foundation models by examining how intervening on their representation of various latent properties impacts their outputs. Recent works have cast doubt on the theoretical basis of several leading causal probing methods, but it has been unclear how to systematically evaluate the effectiveness of these methods in practice. To address this, we define two key causal probing desiderata: completeness (how thoroughly the representation of the target property has been transformed) and selectivity (how little non-targeted properties have been impacted). We find that there is an inherent tradeoff between the two, which we define as reliability, their harmonic mean. We introduce an empirical analysis framework to measure and evaluate these quantities, allowing us to make the first direct comparisons between different families of leading causal probing methods (e.g., linear vs. nonlinear, or concept removal vs. counterfactual interventions). We find that: (1) all methods show a clear tradeoff between completeness and selectivity; (2) more complete and reliable methods have a greater impact on LLM behavior; and (3) nonlinear interventions are almost always more reliable than linear interventions. Our project webpage is available at: https://ahdavies6.github.io/causal_probing_reliability/

[619] SAEs Are Good for Steering – If You Select the Right Features

Dana Arad, Aaron Mueller, Yonatan Belinkov

Main category: cs.LG

TL;DR: SAEs can be used for model steering, but current methods focus on input activations. The paper distinguishes between input features (capture input patterns) and output features (affect model output), proposes scores to identify them, and shows filtering low-output-score features improves steering performance 2-3x.

Details

Motivation: Current SAE steering methods rely on analyzing input token activations, but recent work shows activations alone don't fully capture a feature's effect on model output. There's a need to distinguish between features that mainly capture input patterns versus those that actually influence model outputs for effective steering.

Method: Proposed input and output scores to characterize and locate two types of features: input features (capture input patterns) and output features (have human-understandable effect on model output). Applied filtering to remove features with low output scores before using SAEs for steering.

Result: Found that high input and output scores rarely co-occur in the same features. After filtering out features with low output scores, obtained 2-3x improvements in steering performance with SAEs, making them competitive with supervised methods.

Conclusion: Distinguishing between input and output features in SAEs is crucial for effective steering. By focusing on output features (those with measurable effect on model outputs), SAE-based steering can achieve performance comparable to supervised methods without requiring labeled data.

Abstract: Sparse Autoencoders (SAEs) have been proposed as an unsupervised approach to learn a decomposition of a model’s latent space. This enables useful applications such as steering - influencing the output of a model towards a desired concept - without requiring labeled data. Current methods identify SAE features to steer by analyzing the input tokens that activate them. However, recent work has highlighted that activations alone do not fully describe the effect of a feature on the model’s output. In this work, we draw a distinction between two types of features: input features, which mainly capture patterns in the model’s input, and output features, which have a human-understandable effect on the model’s output. We propose input and output scores to characterize and locate these types of features, and show that high values for both scores rarely co-occur in the same features. These findings have practical implications: after filtering out features with low output scores, we obtain 2-3x improvements when steering with SAEs, making them competitive with supervised methods.

[620] Kronecker Factorization Improves Efficiency and Interpretability of Sparse Autoencoders

Vadim Kurochkin, Yaroslav Aksenov, Daniil Laptev, Daniil Gavrilov, Nikita Balagansky

Main category: cs.LG

TL;DR: KronSAE: Kronecker-factorized Sparse Autoencoder with mAND activation for efficient, interpretable language model analysis.

Details

Motivation: Sparse Autoencoders (SAEs) are promising for interpreting language model hidden states, but face computational challenges at scale, especially with large dictionary sizes. Encoders require intensive linear operations with large output dimensions, limiting practical application.

Method: Proposes KronSAE architecture that factorizes latent representation via Kronecker product decomposition to reduce memory/computational overhead. Introduces mAND, a differentiable activation function approximating binary AND operation to improve interpretability and performance in the factorized framework.

Result: The proposed approach drastically reduces memory and computational requirements while maintaining or improving interpretability and performance compared to traditional SAEs.

Conclusion: KronSAE with mAND activation enables more efficient and interpretable analysis of language model hidden states at scale, addressing key computational bottlenecks in SAE training and interpretation.

Abstract: Sparse Autoencoders (SAEs) have demonstrated significant promise in interpreting the hidden states of language models by decomposing them into interpretable latent directions. However, training and interpreting SAEs at scale remains challenging, especially when large dictionary sizes are used. While decoders can leverage sparse-aware kernels for efficiency, encoders still require computationally intensive linear operations with large output dimensions. To address this, we propose KronSAE, a novel architecture that factorizes the latent representation via Kronecker product decomposition, drastically reducing memory and computational overhead. Furthermore, we introduce mAND, a differentiable activation function approximating the binary AND operation, which improves interpretability and performance in our factorized framework.

[621] AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining

Hongyuan Dong, Dingkang Yang, Xiao Liang, Chao Feng, Jiao Ran

Main category: cs.LG

TL;DR: AdaLRS is an adaptive learning rate search algorithm that optimizes loss descent velocities online to find optimal learning rates for foundation model pretraining with minimal extra computation.

Details

Motivation: Existing learning rate transferability approaches are limited to specific training scenarios and require extensive hyperparameter tuning on proxy models, creating a need for a more general and efficient solution.

Method: AdaLRS uses online optimal learning rate search by optimizing loss descent velocities, leveraging the theoretical insight that foundation model pretraining loss and its descent velocity share the same optimal learning rate. It’s a plug-and-play algorithm that requires minimal extra computation.

Result: Experiments on LLM and VLM pretraining show AdaLRS efficiently adjusts suboptimal learning rates to near-optimal values, improving model performance. It demonstrates robust generalizability across different model sizes, training paradigms, base learning rate schedulers, and hyperparameter settings.

Conclusion: AdaLRS provides an effective, efficient, and generalizable solution for adaptive learning rate search in foundation model pretraining, overcoming limitations of existing approaches while requiring minimal computational overhead.

Abstract: Learning rate is widely regarded as crucial for effective foundation model pretraining. Recent research explores and demonstrates the transferability of learning rate configurations across varying model and dataset sizes, etc. Nevertheless, these approaches are constrained to specific training scenarios and typically necessitate extensive hyperparameter tuning on proxy models. In this work, we propose \textbf{AdaLRS}, a plug-in-and-play adaptive learning rate search algorithm that conducts online optimal learning rate search via optimizing loss descent velocities. We provide theoretical and experimental analyzes to show that foundation model pretraining loss and its descent velocity are both convex and share the same optimal learning rate. Relying solely on training loss dynamics, AdaLRS involves few extra computations to guide the search process, and its convergence is guaranteed via theoretical analysis. Experiments on both LLM and VLM pretraining show that AdaLRS adjusts suboptimal learning rates to the neighborhood of optimum with marked efficiency and effectiveness, with model performance improved accordingly. We also show the robust generalizability of AdaLRS across varying training scenarios, such as different model sizes, training paradigms, base learning rate scheduler choices, and hyperparameter settings.

[622] In Good GRACEs: Principled Teacher Selection for Knowledge Distillation

Abhishek Panigrahi, Bingbin Liu, Sadhika Malladi, Sham Kakade, Surbhi Goel

Main category: cs.LG

TL;DR: GRACE is a lightweight score that predicts teacher effectiveness for knowledge distillation without needing teacher internals or test data, achieving up to 86% correlation with student performance.

Details

Motivation: Selecting optimal teachers for knowledge distillation currently requires expensive trial-and-error experimentation. There's a need for an efficient method to predict teacher effectiveness for specific student-task combinations without access to teacher internals or test data.

Method: GRACE measures distributional properties of the student’s gradients from an information-theoretic perspective, connecting to leave-one-out stability of gradient-based algorithms. It works without teacher logits, internals, or test data.

Result: On GSM8K and MATH benchmarks, GRACE shows up to 86% Spearman correlation with distilled student performance. GRACE-selected teachers improve performance by up to 7.4% over naive teacher selection. GRACE also guides distillation design choices like temperature, teacher size constraints, and model family selection.

Conclusion: GRACE efficiently identifies compatible teachers for knowledge distillation and provides fine-grained guidance on distillation design, offering a practical solution to the teacher selection problem without expensive trial-and-error.

Abstract: Knowledge distillation is an efficient strategy to use data generated by large “teacher” language models to train smaller capable “student” models, but selecting the optimal teacher for a specific student-task combination requires expensive trial-and-error. We propose a lightweight score called GRACE to quantify how effective a teacher will be for post-training a student model. GRACE measures distributional properties of the student’s gradients without access to a verifier, teacher logits, teacher internals, or test data. From an information-theoretic perspective, GRACE connects to leave-one-out stability of gradient-based algorithms, which controls the generalization performance of the distilled students. On GSM8K and MATH, GRACE correlates strongly (up to 86% Spearman correlation) with the performance of the distilled LLaMA and OLMo students. In particular, training a student using the GRACE-selected teacher can improve the performance by up to 7.4% over naively using the best-performing teacher. Further, GRACE can provide guidance on crucial design choices in distillation, including (1) the best temperature to use when generating from the teacher, (2) the best teacher to use given a size constraint, and (3) the best teacher to use within a specific model family. Altogether, our findings demonstrate that GRACE can efficiently and effectively identify a strongly compatible teacher for a given student and provide fine-grained guidance on how to perform distillation.

[623] Zero-Overhead Introspection for Adaptive Test-Time Compute

Rohin Manvi, Joey Hong, Tim Seyde, Maxime Labonne, Mathias Lechner, Sergey Levine

Main category: cs.LG

TL;DR: ZIP-RC enables LLMs to predict reward and cost during inference with zero overhead, allowing adaptive sampling decisions to optimize accuracy, compute, and latency.

Details

Motivation: LLMs lack introspection for anticipating success and required computation, leading to inefficient fixed-budget sampling (like Best-of-N) and inability to make intelligent meta-cognition decisions about when to invest effort, stop, or signal success/failure.

Method: ZIP-RC reuses reserved/unused logits in the same forward pass to output joint distribution over final reward and remaining length. Uses this to compute sampling utility (expected max reward, compute, latency) and maximizes it with meta-actions determining which token prefixes to continue or initiate sampling from.

Result: On mixed-difficulty mathematical benchmarks, ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost, and traces smooth Pareto frontiers between quality, compute, and latency.

Conclusion: ZIP-RC provides real-time reward-cost introspection, enabling adaptive, efficient reasoning without extra models, architecture changes, or inference overhead.

Abstract: Large language models excel at reasoning but lack key aspects of introspection, including anticipating their own success and the computation required to achieve it. Humans use real-time introspection to decide how much effort to invest, when to make multiple attempts, when to stop, and when to signal success or failure. Without this, LLMs struggle to make intelligent meta-cognition decisions. Test-time scaling methods like Best-of-N drive up cost and latency by using a fixed budget of samples regardless of the marginal benefit of each one at any point in generation, and the absence of confidence signals can mislead people, prevent appropriate escalation to better tools, and undermine trustworthiness. Learned verifiers or reward models can provide confidence estimates, but do not enable adaptive inference and add substantial cost by requiring extra models or forward passes. We present ZIP-RC, an adaptive inference method that equips models with zero-overhead inference-time predictions of reward and cost. At every token, ZIP-RC reuses reserved or unused logits in the same forward pass as next-token prediction to output a joint distribution over final reward and remaining length – no extra models, architecture change, or inference overhead. This full joint distribution is used to compute a sampling utility which is the linear combination of the expected maximum reward, total compute, and latency of set of samples if generated to completion. During inference, we maximize this utility with meta-actions that determine which prefix of tokens to continue or initiate sampling from. On mixed-difficulty mathematical benchmarks, ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost, and traces smooth Pareto frontiers between quality, compute, and latency. By providing real-time reward-cost introspection, ZIP-RC enables adaptive, efficient reasoning.

[624] Arc Gradient Descent: A Mathematically Derived Reformulation of Gradient Descent with Phase-Aware, User-Controlled Step Dynamics

Nikhil Verma, Joonas Linnosmaa, Leonardo Espinosa-Leal, Napat Vajragupta

Main category: cs.LG

TL;DR: ArcGD optimizer outperforms Adam on non-convex benchmarks and achieves highest accuracy on CIFAR-10 across 8 MLP architectures, showing better generalization without early stopping tuning.

Details

Motivation: To develop a new optimizer (ArcGD) that can handle challenging non-convex optimization problems and demonstrate superior performance compared to state-of-the-art optimizers like Adam, AdamW, Lion, and SGD on both synthetic benchmarks and real-world ML tasks.

Method: Two-phase evaluation: (1) Testing on stochastic variant of Rosenbrock function (2D to 50,000D) with learning-rate bias elimination, (2) Evaluation on CIFAR-10 image classification across 8 diverse MLP architectures (1-5 hidden layers) against Adam, AdamW, Lion, and SGD.

Result: ArcGD consistently outperformed Adam on Rosenbrock function with its effective learning rate, and achieved highest average test accuracy (50.7%) on CIFAR-10 at 20,000 iterations, winning or tying on 6 of 8 architectures. ArcGD showed continued improvement while Adam/AdamW regressed with extended training.

Conclusion: ArcGD demonstrates strong performance on both geometric stress tests and deep-learning benchmarks, showing broad applicability and better generalization without requiring early stopping tuning. The optimizer also has conceptual connections to sign-based momentum updates like Lion.

Abstract: The paper presents the formulation, implementation, and evaluation of the ArcGD optimiser. The evaluation is conducted initially on a non-convex benchmark function and subsequently on a real-world ML dataset. The initial comparative study using the Adam optimiser is conducted on a stochastic variant of the highly non-convex and notoriously challenging Rosenbrock function, renowned for its narrow, curved valley, across dimensions ranging from 2D to 1000D and an extreme case of 50,000D. Two configurations were evaluated to eliminate learning-rate bias: (i) both using ArcGD’s effective learning rate and (ii) both using Adam’s default learning rate. ArcGD consistently outperformed Adam under the first setting and, although slower under the second, achieved superior final solutions in most cases. In the second evaluation, ArcGD is evaluated against state-of-the-art optimizers (Adam, AdamW, Lion, SGD) on the CIFAR-10 image classification dataset across 8 diverse MLP architectures ranging from 1 to 5 hidden layers. ArcGD achieved the highest average test accuracy (50.7%) at 20,000 iterations, outperforming AdamW (46.6%), Adam (46.8%), SGD (49.6%), and Lion (43.4%), winning or tying on 6 of 8 architectures. Notably, while Adam and AdamW showed strong early convergence at 5,000 iterations, but regressed with extended training, whereas ArcGD continued improving, demonstrating generalization and resistance to overfitting without requiring early stopping tuning. Strong performance on geometric stress tests and standard deep-learning benchmarks indicates broad applicability, highlighting the need for further exploration. Moreover, it is also shown that a limiting variant of ArcGD can be interpreted as a sign-based momentum-like update, highlighting conceptual connections between the inherent mechanisms of ArcGD and the Lion optimiser.

[625] DSO: Direct Steering Optimization for Bias Mitigation

Lucas Monteiro Paes, Nivedha Sivakumar, Yinong Oliver Wang, Masha Fedzechkina Donaldson, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff

Main category: cs.LG

TL;DR: DSO uses reinforcement learning to optimize activation steering for controllable bias reduction in VLMs and LLMs during inference, achieving better fairness-performance trade-offs than heuristic methods.

Details

Motivation: VLMs and LLMs exhibit demographic biases in decision-making (e.g., failing to identify women as doctors), and users need controllable bias reduction that balances fairness with model capabilities during inference.

Method: Direct Steering Optimization (DSO) - uses reinforcement learning to find optimal linear transformations for steering model activations, specifically tailored to mitigate bias while maintaining performance control.

Result: DSO achieves state-of-the-art trade-off between fairness and capabilities on both VLMs and LLMs, providing practitioners with inference-time control over the fairness-performance balance.

Conclusion: Directly optimizing steering strategies for specific behavioral control (like bias mitigation) is more effective than relying on pre-defined heuristics, enabling better inference-time intervention for fair AI systems.

Abstract: Generative models are often deployed to make decisions on behalf of users, such as vision-language models (VLMs) identifying which person in a room is a doctor to help visually impaired individuals. Yet, VLM decisions are influenced by the perceived demographic attributes of people in the input, which can lead to biased outcomes like failing to identify women as doctors. Moreover, when reducing bias leads to performance loss, users may have varying needs for balancing bias mitigation with overall model capabilities, highlighting the demand for methods that enable controllable bias reduction during inference. Activation steering is a popular approach for inference-time controllability that has shown potential in inducing safer behavior in large language models (LLMs). However, we observe that current steering methods struggle to correct biases, where equiprobable outcomes across demographic groups are required. To address this, we propose Direct Steering Optimization (DSO) which uses reinforcement learning to find linear transformations for steering activations, tailored to mitigate bias while maintaining control over model performance. We demonstrate that DSO achieves state-of-the-art trade-off between fairness and capabilities on both VLMs and LLMs, while offering practitioners inference-time control over the trade-off. Overall, our work highlights the benefit of designing steering strategies that are directly optimized to control model behavior, providing more effective bias intervention than methods that rely on pre-defined heuristics for controllability.

[626] Exploration vs Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, Tianyi Lin

Main category: cs.LG

TL;DR: Spurious rewards in RLVR improve LLM reasoning by reducing policy entropy through clipping bias, while entropy minimization alone is insufficient for gains.

Details

Motivation: To understand the paradoxical mechanisms in RLVR where both suppressing exploitation (via spurious rewards) and suppressing exploration (via entropy minimization) improve LLM reasoning, despite these effects seeming contradictory.

Method: Examined the relationship between policy entropy and performance, investigated whether spurious rewards yield gains through clipping bias and model contamination, and proposed a reward-misalignment model to explain spurious-reward benefits.

Result: Clipping bias under spurious rewards reduces policy entropy, leading to more confident outputs, while entropy minimization alone doesn’t improve performance. Spurious rewards can enhance performance beyond contaminated settings through reward misalignment.

Conclusion: The paper clarifies that spurious rewards improve RLVR training by reducing policy entropy via clipping bias, providing principles for more effective reinforcement learning with verifiable rewards for LLM reasoning.

Abstract: This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.

[627] Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning

Brett Daley, Martha White, Christopher Amato, Marlos C. Machado

Main category: cs.LG

TL;DR: A new multistep operator framework unifies per-decision and trajectory-aware off-policy methods, with convergence guarantees and a new robust algorithm (RBIS) that works well across λ-values.

Details

Motivation: Off-policy learning from multistep returns is crucial for sample efficiency, but existing methods struggle with balancing bias and variance. Per-decision methods use eligibility traces with importance sampling ratios, but once traces are cut, the effect is irreversible. Trajectory-aware methods exist but lack theoretical analysis and justification.

Method: Proposes a multistep operator framework that can express both per-decision and trajectory-aware methods. Proves convergence conditions for this operator in tabular settings. Introduces Recency-Bounded Importance Sampling (RBIS), a trajectory-aware method that performs robustly across λ-values.

Result: Established first theoretical guarantees for several existing off-policy methods and many new ones. RBIS demonstrates robust performance across λ-values in off-policy control tasks compared to existing methods.

Conclusion: The proposed multistep operator framework provides theoretical foundation for both per-decision and trajectory-aware off-policy methods, with RBIS offering a practical, robust solution for off-policy learning across different λ-values.

Abstract: Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging. Classically, off-policy bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio after each action via eligibility traces. Many off-policy algorithms rely on this mechanism, along with differing protocols for cutting the IS ratios to combat the variance of the IS estimator. Unfortunately, once a trace has been fully cut, the effect cannot be reversed. This has led to the development of credit-assignment strategies that account for multiple past experiences at a time. These trajectory-aware methods have not been extensively analyzed, and their theoretical justification remains uncertain. In this paper, we propose a multistep operator that can express both per-decision and trajectory-aware methods. We prove convergence conditions for our operator in the tabular setting, establishing the first guarantees for several existing methods as well as many new ones. Finally, we introduce Recency-Bounded Importance Sampling (RBIS), which leverages trajectory awareness to perform robustly across $λ$-values in an off-policy control task.

[628] Neural Exploitation and Exploration of Contextual Bandits

Yikun Ban, Yuchen Yan, Arindam Banerjee, Jingrui He

Main category: cs.LG

TL;DR: EE-Net: A neural network-based approach for contextual multi-armed bandits that uses separate exploitation and exploration networks instead of traditional statistical bounds like UCB or Thompson Sampling.

Details

Motivation: Traditional contextual bandit methods rely on statistical bounds (UCB, Thompson Sampling) for exploration, which may not adapt well to non-linear reward functions. Neural bandit algorithms have emerged but still use these traditional exploration strategies. The authors aim to develop a more adaptive neural-based exploration strategy.

Method: EE-Net uses two neural networks: (1) Exploitation network learns the reward function, (2) Exploration network learns potential gains compared to current estimated rewards for adaptive exploration. This replaces traditional large-deviation based statistical bounds with learned exploration.

Result: Theoretical: Provides instance-based Õ(√T) regret upper bound. Empirical: EE-Net outperforms related linear and neural contextual bandit baselines on real-world datasets.

Conclusion: EE-Net offers a novel neural-based approach to the exploitation-exploration trade-off in contextual bandits, achieving both theoretical guarantees and practical performance improvements over existing methods.

Abstract: In this paper, we study utilizing neural networks for the exploitation and exploration of contextual multi-armed bandits. Contextual multi-armed bandits have been studied for decades with various applications. To solve the exploitation-exploration trade-off in bandits, there are three main techniques: epsilon-greedy, Thompson Sampling (TS), and Upper Confidence Bound (UCB). In recent literature, a series of neural bandit algorithms have been proposed to adapt to the non-linear reward function, combined with TS or UCB strategies for exploration. In this paper, instead of calculating a large-deviation based statistical bound for exploration like previous methods, we propose, ``EE-Net,’’ a novel neural-based exploitation and exploration strategy. In addition to using a neural network (Exploitation network) to learn the reward function, EE-Net uses another neural network (Exploration network) to adaptively learn the potential gains compared to the currently estimated reward for exploration. We provide an instance-based $\widetilde{\mathcal{O}}(\sqrt{T})$ regret upper bound for EE-Net and show that EE-Net outperforms related linear and neural contextual bandit baselines on real-world datasets.

[629] Anti-Correlated Noise in Epoch-Based Stochastic Gradient Descent: Implications for Weight Variances in Flat Directions

Marcel Kühn, Bernd Rosenow

Main category: cs.LG

TL;DR: SGD noise in epoch-based training is anti-correlated, not uncorrelated. This anti-correlation reduces weight variance in flat loss directions, potentially helping find flatter minima that generalize better.

Details

Motivation: The paper challenges the common assumption that SGD gradient noise is uncorrelated over time, particularly in epoch-based training where data is sampled without replacement. The authors investigate how these noise correlations affect SGD with momentum and the resulting stationary distribution.

Method: The authors calculate the exact autocorrelation of SGD noise during epoch-based training, assuming noise independence from small weight fluctuations. They analyze how these anti-correlations affect weight variance in different loss curvature directions, and conduct numerical experiments to test performance implications.

Result: SGD noise in epoch-based training is inherently anti-correlated. For directions with high curvature, conventional predictions hold (isotropic weight variance), but for flat directions, weight variance is significantly reduced. Numerical experiments show training with these anti-correlations can enhance test performance.

Conclusion: The anti-correlated noise structure in epoch-based SGD reduces weight variance in flat loss directions, potentially helping find flatter minima that generalize better, challenging the common assumption of uncorrelated SGD noise.

Abstract: Stochastic Gradient Descent (SGD) has become a cornerstone of neural network optimization due to its computational efficiency and generalization capabilities. However, the gradient noise introduced by SGD is often assumed to be uncorrelated over time, despite the common practice of epoch-based training where data is sampled without replacement. In this work, we challenge this assumption and investigate the effects of epoch-based noise correlations on the stationary distribution of discrete-time SGD with momentum. Our main contributions are twofold: First, we calculate the exact autocorrelation of the noise during epoch-based training under the assumption that the noise is independent of small fluctuations in the weight vector, revealing that SGD noise is inherently anti-correlated over time. Second, we explore the influence of these anti-correlations on the variance of weight fluctuations. We find that for directions with curvature of the loss greater than a hyperparameter-dependent crossover value, the conventional predictions of isotropic weight variance under stationarity, based on uncorrelated and curvature-proportional noise, are recovered. Anti-correlations have negligible effect here. However, for relatively flat directions, the weight variance is significantly reduced, leading to a considerable decrease in loss fluctuations compared to the constant weight variance assumption. Furthermore, we present a numerical experiment where training with these anti-correlations enhances test performance, suggesting that the inherent noise structure induced by epoch-based training may play a role in finding flatter minima that generalize better.

[630] HUTFormer: Hierarchical U-Net Transformer for Long-Term Traffic Forecasting

Zezhi Shao, Fei Wang, Tao Sun, Chengqing Yu, Yuchen Fang, Guangyin Jin, Zhulin An, Yang Liu, Xiaobo Qu, Yongjun Xu

Main category: cs.LG

TL;DR: HUTFormer is a novel hierarchical U-Net Transformer model for long-term traffic forecasting (e.g., 1-day ahead) that addresses multi-scale representation challenges through hierarchical encoder-decoder architecture with efficient attention mechanisms.

Details

Motivation: Existing STGNNs focus only on short-term traffic forecasting (e.g., 1-hour ahead) due to complexity issues, ignoring the more practical need for long-term forecasting (e.g., 1-day ahead). The paper aims to address this gap by exploring long-term traffic forecasting and its unique challenges in exploiting multi-scale representations.

Method: Proposes HUTFormer with hierarchical encoder and decoder. Encoder uses window self-attention and segment merging to extract multi-scale representations from long-term traffic data. Decoder employs cross-scale attention to incorporate multi-scale representations. Also includes efficient input embedding strategy to handle complexity issues.

Result: Extensive experiments on four traffic datasets show HUTFormer significantly outperforms state-of-the-art traffic forecasting and long time series forecasting baselines.

Conclusion: HUTFormer successfully addresses long-term traffic forecasting challenges by effectively generating and utilizing multi-scale representations through its hierarchical architecture, demonstrating superior performance over existing methods.

Abstract: Traffic forecasting, which aims to predict traffic conditions based on historical observations, has been an enduring research topic and is widely recognized as an essential component of intelligent transportation. Recent proposals on Spatial-Temporal Graph Neural Networks~(STGNNs) have made significant progress by combining sequential models with graph convolution networks. However, due to high complexity issues, STGNNs only focus on short-term traffic forecasting (e.g., 1-h ahead), while ignoring more practical long-term forecasting. In this paper, we make the first attempt to explore long-term traffic forecasting (e.g., 1-day ahead). To this end, we first reveal its unique challenges in exploiting multi-scale representations. Then, we propose a novel Hierarchical U-Net TransFormer~(HUTFormer) to address the issues of long-term traffic forecasting. HUTFormer consists of a hierarchical encoder and decoder to jointly generate and utilize multi-scale representations of traffic data. Specifically, for the encoder, we {\color{black}propose} window self-attention and segment merging to extract multi-scale representations from long-term traffic data. For the decoder, we design a cross-scale attention mechanism to effectively incorporate multi-scale representations. In addition, HUTFormer employs an efficient input embedding strategy to address the complexity issues. Extensive experiments on four traffic datasets show that the proposed HUTFormer significantly outperforms state-of-the-art traffic forecasting and long time series forecasting baselines.

[631] Graph Transformers: A Survey

Ahsan Shehzad, Feng Xia, Shagufta Abid, Ciyuan Peng, Shuo Yu, Dongyu Zhang, Karin Verspoor

Main category: cs.LG

TL;DR: This survey paper provides a comprehensive review of graph transformers, covering their foundations, design approaches, taxonomy, applications, and future challenges in graph learning.

Details

Motivation: Graph transformers represent a promising advancement in machine learning for graph-structured data, combining the strengths of transformer architectures with graph learning capabilities. The motivation is to systematically review this emerging field, analyze design approaches, and identify future research directions.

Method: The survey method includes: 1) Reviewing foundational concepts of graphs and transformers, 2) Analyzing design perspectives focusing on graph inductive biases and attention mechanisms, 3) Proposing a taxonomy based on depth, scalability, and pre-training strategies, 4) Examining applications across node-level, edge-level, and graph-level tasks, and 5) Identifying key challenges and future directions.

Result: The survey provides a comprehensive framework for understanding graph transformers, including design principles, classification taxonomy, and application domains. It synthesizes current research progress while highlighting the versatility and strong performance of graph transformers across various graph-related tasks.

Conclusion: Graph transformers show significant promise for graph learning tasks but face challenges including scalability, generalization, interpretability, handling dynamic graphs, and data quality issues. Future research should address these limitations to advance the field further.

Abstract: Graph transformers are a recent advancement in machine learning, offering a new class of neural network models for graph-structured data. The synergy between transformers and graph learning demonstrates strong performance and versatility across various graph-related tasks. This survey provides an in-depth review of recent progress and challenges in graph transformer research. We begin with foundational concepts of graphs and transformers. We then explore design perspectives of graph transformers, focusing on how they integrate graph inductive biases and graph attention mechanisms into the transformer architecture. Furthermore, we propose a taxonomy classifying graph transformers based on depth, scalability, and pre-training strategies, summarizing key principles for effective development of graph transformer models. Beyond technical analysis, we discuss the applications of graph transformer models for node-level, edge-level, and graph-level tasks, exploring their potential in other application scenarios as well. Finally, we identify remaining challenges in the field, such as scalability and efficiency, generalization and robustness, interpretability and explainability, dynamic and complex graphs, as well as data quality and diversity, charting future directions for graph transformer research.

[632] Certified Defense on the Fairness of Graph Neural Networks

Yushun Dong, Binchi Zhang, Hanghang Tong, Jundong Li

Main category: cs.LG

TL;DR: ELEGANT is a plug-and-play framework that provides certifiable defense for GNN fairness against adversarial perturbations without retraining or architectural assumptions.

Details

Motivation: GNNs are vulnerable to adversarial attacks that can corrupt their fairness predictions through graph perturbations, creating a need for certifiable defense mechanisms.

Method: Proposes ELEGANT framework with theoretical certification analysis that works with any GNN backbone, requires no retraining or architectural assumptions, and provides provable fairness guarantees under certain perturbation budgets.

Result: Extensive experiments on real-world datasets show satisfactory effectiveness across different GNN backbones and parameter settings.

Conclusion: ELEGANT provides a practical, plug-and-play solution for certifiable fairness defense in GNNs that can be deployed with any optimized GNN model.

Abstract: Graph Neural Networks (GNNs) have emerged as a prominent graph learning model in various graph-based tasks over the years. Nevertheless, due to the vulnerabilities of GNNs, it has been empirically shown that malicious attackers could easily corrupt the fairness level of their predictions by adding perturbations to the input graph data. In this paper, we take crucial steps to study a novel problem of certifiable defense on the fairness level of GNNs. Specifically, we propose a principled framework named ELEGANT and present a detailed theoretical certification analysis for the fairness of GNNs. ELEGANT takes {\em any} GNN as its backbone, and the fairness level of such a backbone is theoretically impossible to be corrupted under certain perturbation budgets for attackers. Notably, ELEGANT does not make any assumptions over the GNN structure or parameters, and does not require re-training the GNNs to realize certification. Hence it can serve as a plug-and-play framework for any optimized GNNs ready to be deployed. We verify the satisfactory effectiveness of ELEGANT in practice through extensive experiments on real-world datasets across different backbones of GNNs and parameter settings.

[633] Overcoming Growth-Induced Forgetting in Task-Agnostic Continual Learning

Yuqing Zhao, Jiannong Cao, Divya Saxena, Xiaoyun Liu, Changlin Song, Bo Yuan, Julie McCann

Main category: cs.LG

TL;DR: SparseGrow addresses growth-induced forgetting in continual learning by using gradient and parameter sparsity to enable model growth while preserving learned knowledge.

Details

Motivation: Model growth in continual learning improves adaptability but causes severe forgetting when improperly applied, especially in task-agnostic settings where the entire grown model is used for inference. Existing growth methods overlook this forgetting problem, compromising knowledge retention.

Method: SparseGrow introduces gradient sparsity through layer expansion and gradient gating to enable focused parameter updates while preserving critical parameters. It also promotes parameter sparsity via sparse initialization and training to better control model plasticity and improve adaptability to new data.

Result: Extensive experiments across diverse datasets, task-agnostic settings, and many tasks demonstrate the necessity of controlled layer expansion and validate SparseGrow’s effectiveness in achieving high adaptability while minimizing forgetting.

Conclusion: By enabling model growth with sparsified gradients and parameters, SparseGrow paves the way for scalable lifelong learning systems capable of continual adaptation with better knowledge retention.

Abstract: In continual learning (CL), model growth enhances adaptability to new data. However, when model growth is applied improperly, especially in task-agnostic CL, where the entire grown model is used for inference, it can lead to severe degradation of learned knowledge, a problem we term growth-induced forgetting. Most existing methods that adopt model growth to improve adaptability often overlook the forgetting issue, resulting in compromised knowledge retention, making them unsuitable for task-agnostic settings. To promote both adaptability and knowledge retention with model growth, we identify the key: gradient and parameter sparsity. Introducing SparseGrow, which increases gradient sparsity through layer expansion and gradient gating to enable focused updates on parameters while preserving critical parameters, thus inhibiting forgetting. Moreover, it promotes parameter sparsity with sparse initialization and training, aiming at better control of model plasticity, improving adaptability over new data. Extensive experiments across diverse datasets, task-agnostic settings, and a large number of tasks demonstrate the necessity of controlled layer expansion and validate the effectiveness of SparseGrow in achieving high adaptability while minimizing forgetting in continual learning. By enabling model growth with sparsified gradients and parameters, SparseGrow paves the way for building scalable lifelong learning systems capable of continual adaptation with better knowledge retention.

[634] Averaging $n$-step Returns Reduces Variance in Reinforcement Learning

Brett Daley, Martha White, Marlos C. Machado

Main category: cs.LG

TL;DR: Compound returns (weighted averages of n-step returns) reduce variance in reinforcement learning, improving sample efficiency of TD learning and deep RL agents like DQN and PPO.

Details

Motivation: Multistep returns improve RL sample efficiency but suffer from increasing variance as they look further into the future, limiting their effective length. The paper aims to address this variance limitation.

Method: Introduces compound returns (weighted averages of n-step returns) and proves they reduce variance compared to n-step returns. Proposes efficient two-bootstrap returns for practical implementation with minibatched experience replay.

Result: Theoretical proofs show: 1) any compound return with same contraction modulus as n-step return has strictly lower variance, 2) this improves finite-sample complexity of TD learning under linear function approximation. Experiments show compound returns increase sample efficiency of DQN and PPO agents.

Conclusion: Compound returns provide a principled way to reduce variance in multistep RL while maintaining contraction properties, leading to improved sample efficiency in both theoretical TD learning and practical deep RL applications.

Abstract: Multistep returns, such as $n$-step returns and $λ$-returns, are commonly used to improve the sample efficiency of reinforcement learning (RL) methods. The variance of the multistep returns becomes the limiting factor in their length; looking too far into the future increases variance and reverses the benefits of multistep learning. In our work, we demonstrate the ability of compound returns – weighted averages of $n$-step returns – to reduce variance. We prove for the first time that any compound return with the same contraction modulus as a given $n$-step return has strictly lower variance. We additionally prove that this variance-reduction property improves the finite-sample complexity of temporal-difference learning under linear function approximation. Because general compound returns can be expensive to implement, we introduce two-bootstrap returns which reduce variance while remaining efficient, even when using minibatched experience replay. We conduct experiments showing that compound returns often increase the sample efficiency of $n$-step deep RL agents like DQN and PPO.

[635] IT Intrusion Detection Using Statistical Learning and Testbed Measurements

Xiaoxuan Wang, Rolf Stadler

Main category: cs.LG

TL;DR: The paper applies HMM, LSTM, and Random Forest to detect and predict cyber attacks using abundant testbed data, finding LSTM most accurate with sufficient data while HMM is more resource-efficient.

Details

Motivation: To develop automated intrusion detection systems that can identify attack start time, attack type, and attacker action sequences from continuous infrastructure measurements, addressing the need for effective cybersecurity monitoring.

Method: Uses Hidden Markov Model (HMM), Long Short-Term Memory (LSTM), and Random Forest Classifier (RFC) with a machine-learning pipeline that maps high-dimensional observations to low-dimensional spaces. Trains models on abundant data from in-house testbed traces of attacks against emulated IT infrastructure.

Result: Both HMM and LSTM effectively predict attack start time, type, and actions. LSTM achieves higher accuracy with sufficient training data, while HMM requires less computational resources and training data. Methods benefit from traditional IDS data like SNORT.

Conclusion: Statistical learning methods, particularly LSTM and HMM, are effective for intrusion detection and attack prediction. The choice between them depends on available computational resources and training data, with LSTM preferred for accuracy when resources permit and HMM for resource-constrained scenarios.

Abstract: We study automated intrusion detection in an IT infrastructure, specifically the problem of identifying the start of an attack, the type of attack, and the sequence of actions an attacker takes, based on continuous measurements from the infrastructure. We apply statistical learning methods, including Hidden Markov Model (HMM), Long Short-Term Memory (LSTM), and Random Forest Classifier (RFC) to map sequences of observations to sequences of predicted attack actions. In contrast to most related research, we have abundant data to train the models and evaluate their predictive power. The data comes from traces we generate on an in-house testbed where we run attacks against an emulated IT infrastructure. Central to our work is a machine-learning pipeline that maps measurements from a high-dimensional observation space to a space of low dimensionality or to a small set of observation symbols. Investigating intrusions in offline as well as online scenarios, we find that both HMM and LSTM can be effective in predicting attack start time, attack type, and attack actions. If sufficient training data is available, LSTM achieves higher prediction accuracy than HMM. HMM, on the other hand, requires less computational resources and less training data for effective prediction. Also, we find that the methods we study benefit from data produced by traditional intrusion detection systems like SNORT.

[636] Continuum Attention for Neural Operators

Edoardo Calvello, Nikola B. Kovachki, Matthew E. Levine, Andrew M. Stuart

Main category: cs.LG

TL;DR: This paper introduces transformer neural operators that map between infinite-dimensional function spaces, proving universal approximation theorems and developing efficient architectures for operator learning problems.

Details

Motivation: Transformers have shown success in modeling nonlocal, long-range correlations in various domains. Since neural operators (which map function spaces) must be both nonlinear and nonlocal to be universal, it's natural to explore whether attention mechanisms can be adapted for neural operator design.

Method: 1. Formulate attention as a map between infinite-dimensional function spaces, showing practical implementations are Monte Carlo/finite difference approximations. 2. Design transformer neural operators for learning mappings between function spaces. 3. Introduce function space generalization of patching strategy for efficient attention on multi-dimensional domains. 4. Prove universal approximation results for these architectures.

Result: 1. First universal approximation theorem for transformer neural operators with only slight modifications to practical implementations. 2. Efficient attention-based architectures for multi-dimensional domains via function space patching. 3. Numerical experiments demonstrate effectiveness on operator learning problems.

Conclusion: The paper successfully bridges transformers and neural operators, providing theoretical foundations (universal approximation) and practical architectures for learning mappings between function spaces using attention mechanisms, with promising numerical results.

Abstract: Transformers, and the attention mechanism in particular, have become ubiquitous in machine learning. Their success in modeling nonlocal, long-range correlations has led to their widespread adoption in natural language processing, computer vision, and time series problems. Neural operators, which map spaces of functions into spaces of functions, are necessarily both nonlinear and nonlocal if they are universal; it is thus natural to ask whether the attention mechanism can be used in the design of neural operators. Motivated by this, we study transformers in the function space setting. We formulate attention as a map between infinite dimensional function spaces and prove that the attention mechanism as implemented in practice is a Monte Carlo or finite difference approximation of this operator. The function space formulation allows for the design of transformer neural operators, a class of architectures designed to learn mappings between function spaces. In this paper, we state and prove the first universal approximation result for transformer neural operators, using only a slight modification of the architecture implemented in practice. The prohibitive cost of applying the attention operator to functions defined on multi-dimensional domains leads to the need for more efficient attention-based architectures. For this reason we also introduce a function space generalization of the patching strategy from computer vision, and introduce a class of associated neural operators. Numerical results, on an array of operator learning problems, demonstrate the promise of our approaches to function space formulations of attention and their use in neural operators.

[637] Efficient Deep Learning Infrastructures for Embedded Computing Systems: A Comprehensive Survey and Future Envision

Xiangzhong Luo, Di Liu, Hao Kong, Shuo Huai, Hui Chen, Guochu Xiong, Weichen Liu

Main category: cs.LG

TL;DR: Survey on efficient deep learning infrastructures for embedded systems to bridge computational gap between powerful DNNs and resource-constrained devices.

Details

Motivation: Deep neural networks have become computationally intensive (deeper/wider), creating a gap with resource-constrained embedded systems, hindering deployment of powerful DNNs for ubiquitous embedded intelligence.

Method: Comprehensive survey covering seven key areas: (1) efficient manual network design, (2) efficient automated network design, (3) network compression, (4) on-device learning, (5) large language models for embedded systems, (6) software/hardware co-design, and (7) intelligent applications.

Result: Provides systematic review of recent advances in efficient deep learning infrastructures spanning training to inference, manual to automated approaches, CNNs to transformers, and software to hardware solutions.

Conclusion: Efficient deep learning infrastructures are essential for deploying powerful DNNs on embedded systems to achieve ubiquitous embedded intelligence, requiring holistic approaches across multiple dimensions from algorithms to applications.

Abstract: Deep neural networks (DNNs) have recently achieved impressive success across a wide range of real-world vision and language processing tasks, spanning from image classification to many other downstream vision tasks, such as object detection, tracking, and segmentation. However, previous well-established DNNs, despite being able to maintain superior accuracy, have also been evolving to be deeper and wider and thus inevitably necessitate prohibitive computational resources for both training and inference. This trend further enlarges the computational gap between computation-intensive DNNs and resource-constrained embedded computing systems, making it challenging to deploy powerful DNNs upon real-world embedded computing systems towards ubiquitous embedded intelligence. To alleviate the above computational gap and enable ubiquitous embedded intelligence, we, in this survey, focus on discussing recent efficient deep learning infrastructures for embedded computing systems, spanning from training to inference, from manual to automated, from convolutional neural networks to transformers, from transformers to vision transformers, from vision models to large language models, from software to hardware, and from algorithms to applications. Specifically, we discuss recent efficient deep learning infrastructures for embedded computing systems from the lens of (1) efficient manual network design for embedded computing systems, (2) efficient automated network design for embedded computing systems, (3) efficient network compression for embedded computing systems, (4) efficient on-device learning for embedded computing systems, (5) efficient large language models for embedded computing systems, (6) efficient deep learning software and hardware for embedded computing systems, and (7) efficient intelligent applications for embedded computing systems.

[638] A Reinforcement Learning Environment for Automatic Code Optimization in the MLIR Compiler

Mohammed Tirichine, Nassim Ameur, Nazim Bendib, Iheb Nassim Aouadj, Bouchama Djad, Rafik Bouloudene, Riyadh Baghdadi

Main category: cs.LG

TL;DR: MLIR RL is a reinforcement learning environment for MLIR compiler that enables automatic code optimization through multi-discrete action spaces and level pointers for loop interchange, demonstrated on deep learning and LQCD code optimization.

Details

Motivation: Code optimization is tedious and complex, requiring automatic techniques. Reinforcement Learning shows promise for complex optimization problems, but needs specialized environments for MLIR compiler research.

Method: Introduces MLIR RL environment with multi-discrete action space formulation (Cartesian product of simpler subspaces) and level pointers method to reduce action space size for loop interchange transformations.

Result: Successfully trained RL agent to optimize MLIR Linalg code for CPU from PyTorch deep learning models and LQCD compiler code, creating a research environment for RL-driven loop-nest optimization.

Conclusion: MLIR RL provides an effective research environment for experimenting with RL-driven code optimization in MLIR compiler, enabling community research in automatic loop-nest optimization.

Abstract: Code optimization is a crucial task that aims to enhance code performance. However, this process is often tedious and complex, highlighting the necessity for automatic code optimization techniques. Reinforcement Learning (RL) has emerged as a promising approach for tackling such complex optimization problems. In this project, we introduce MLIR RL, an RL environment for the MLIR compiler, dedicated to facilitating MLIR compiler research and enabling automatic code optimization. We propose a multi-discrete formulation of the action space where the action space is the Cartesian product of simpler action subspaces. We also propose a new method, called level pointers, to reduce the size of the action space related to the loop interchange transformation. This enables more efficient and effective learning of the policy. To demonstrate the effectiveness of MLIR RL, we train an RL agent to optimize MLIR Linalg code, targeting CPU. The code is generated from two domain-specific frameworks: deep-learning models generated from PyTorch, and LQCD (Lattice Quantum Chromodynamics) code generated from an LQCD compiler. The result of this work is a research environment that allows the community to experiment with novel ideas in RL-driven loop-nest optimization.

[639] Equivariant Polynomial Functional Networks

Thieu N. Vo, Viet-Hoang Tran, Tho Tran Huu, An Nguyen The, Thanh Tran, Minh-Khoi Nguyen-Nhat, Duy-Tung Pham, Tan Minh Nguyen

Main category: cs.LG

TL;DR: Proposes MAGEP-NFN, a novel Neural Functional Network that uses polynomial equivariant layers to achieve permutation and scaling equivariance with better expressivity, lower memory, and faster runtime than existing methods.

Details

Motivation: Existing NFNs face trade-offs: graph-based methods have high memory/runtime costs, while parameter-sharing methods lack expressivity due to large symmetric groups. Need a solution that maintains equivariance while being efficient and expressive.

Method: MAGEP-NFN uses parameter-sharing mechanism with nonlinear equivariant layers represented as polynomials in input weights. This polynomial formulation captures relationships between weights across different hidden layers, enhancing expressivity while keeping computational costs low.

Result: Empirical evidence shows MAGEP-NFN achieves competitive performance and efficiency compared to existing baselines, addressing the challenge of balancing expressivity with low memory consumption and running time.

Conclusion: MAGEP-NFN successfully resolves the trade-off in NFN design by using polynomial equivariant layers, providing an effective solution for permutation and scaling equivariant neural functional networks with practical efficiency.

Abstract: Neural Functional Networks (NFNs) have gained increasing interest due to their wide range of applications, including extracting information from implicit representations of data, editing network weights, and evaluating policies. A key design principle of NFNs is their adherence to the permutation and scaling symmetries inherent in the connectionist structure of the input neural networks. Recent NFNs have been proposed with permutation and scaling equivariance based on either graph-based message-passing mechanisms or parameter-sharing mechanisms. However, graph-based equivariant NFNs suffer from high memory consumption and long running times. On the other hand, parameter-sharing-based NFNs built upon equivariant linear layers exhibit lower memory consumption and faster running time, yet their expressivity is limited due to the large size of the symmetric group of the input neural networks. The challenge of designing a permutation and scaling equivariant NFN that maintains low memory consumption and running time while preserving expressivity remains unresolved. In this paper, we propose a novel solution with the development of MAGEP-NFN (Monomial mAtrix Group Equivariant Polynomial NFN). Our approach follows the parameter-sharing mechanism but differs from previous works by constructing a nonlinear equivariant layer represented as a polynomial in the input weights. This polynomial formulation enables us to incorporate additional relationships between weights from different input hidden layers, enhancing the model’s expressivity while keeping memory consumption and running time low, thereby addressing the aforementioned challenge. We provide empirical evidence demonstrating that MAGEP-NFN achieves competitive performance and efficiency compared to existing baselines.

[640] Conditional Distribution Learning for Graph Classification

Jie Chen, Hua Mao, Chuanbin Liu, Zhu Wang, Xi Peng

Main category: cs.LG

TL;DR: Proposes conditional distribution learning (CDL) for graph representation learning to address conflicts between graph neural network message-passing and contrastive learning while preserving semantic information across data augmentations.

Details

Motivation: There's a conflict between graph neural networks' message-passing mechanism (which makes node embeddings more similar across layers) and contrastive learning (which aims to increase dissimilarity between negative pairs). Additionally, preserving intrinsic semantic information while leveraging diverse graph data augmentations is challenging.

Method: Proposes conditional distribution learning (CDL) that aligns conditional distributions of weakly and strongly augmented features over original features. Uses positive pairs of node representations (original features and corresponding weakly augmented features) to measure similarity, avoiding the MPM-CL conflict.

Result: Extensive experiments on benchmark graph datasets demonstrate the effectiveness of the proposed CDL method for semisupervised graph classification.

Conclusion: The CDL method effectively addresses the conflict between GNN message-passing and contrastive learning while preserving semantic information across different graph data augmentations, leading to improved graph representation learning.

Abstract: Leveraging the diversity and quantity of data provided by various graph-structured data augmentations while preserving intrinsic semantic information is challenging. Additionally, successive layers in graph neural network (GNN) tend to produce more similar node embeddings, while graph contrastive learning aims to increase the dissimilarity between negative pairs of node embeddings. This inevitably results in a conflict between the message-passing mechanism (MPM) of GNNs and the contrastive learning (CL) of negative pairs via intraviews. In this paper, we propose a conditional distribution learning (CDL) method that learns graph representations from graph-structured data for semisupervised graph classification. Specifically, we present an end-to-end graph representation learning model to align the conditional distributions of weakly and strongly augmented features over the original features. This alignment enables the CDL model to effectively preserve intrinsic semantic information when both weak and strong augmentations are applied to graph-structured data. To avoid the conflict between the MPM and the CL of negative pairs, positive pairs of node representations are retained for measuring the similarity between the original features and the corresponding weakly augmented features. Extensive experiments with several benchmark graph datasets demonstrate the effectiveness of the proposed CDL method.

[641] PowerMamba: A Deep State Space Model and Comprehensive Benchmark for Time Series Prediction in Electric Power Systems

Ali Menati, Fatemeh Doudi, Dileep Kalathil, Le Xie

Main category: cs.LG

TL;DR: A multivariate time series prediction model combining state space models with deep learning for electricity grid forecasting, plus a new dataset and toolbox for benchmarking.

Details

Motivation: Electricity grid transformations (electrification, renewables, new tech) make the grid more volatile and unpredictable, requiring advanced prediction models to close the gap between forecasted and actual grid outcomes.

Method: 1) Multivariate time series prediction model combining traditional state space models with deep learning to capture dynamics of multiple time series simultaneously. 2) Time series processing module incorporating high-resolution external forecasts into sequence-to-sequence models with minimal size increase and no accuracy loss.

Result: 1) Extended 5-year dataset of load, electricity price, ancillary service price, and renewable generation. 2) Open-access toolbox with proposed model, dataset, and state-of-the-art models for unified benchmarking. 3) Proposed model outperforms existing models across various prediction tasks, improving prediction error by average 7% and decreasing model parameters by 43%.

Conclusion: The proposed hybrid model effectively addresses electricity grid volatility challenges, provides superior prediction performance with fewer parameters, and offers a comprehensive benchmarking framework through the released dataset and toolbox.

Abstract: The electricity sector is undergoing substantial transformations due to the rising electrification of demand, enhanced integration of renewable energy resources, and the emergence of new technologies. These changes are rendering the electric grid more volatile and unpredictable, making it difficult to maintain reliable operations. In order to address these issues, advanced time series prediction models are needed for closing the gap between the forecasted and actual grid outcomes. In this paper, we introduce a multivariate time series prediction model that combines traditional state space models with deep learning methods to simultaneously capture and predict the underlying dynamics of multiple time series. Additionally, we design a time series processing module that incorporates high-resolution external forecasts into sequence-to-sequence prediction models, achieving this with negligible increases in size and no loss of accuracy. We also release an extended dataset spanning five years of load, electricity price, ancillary service price, and renewable generation. To complement this dataset, we provide an open-access toolbox that includes our proposed model, the dataset itself, and several state-of-the-art prediction models, thereby creating a unified framework for benchmarking advanced machine learning approaches. Our findings indicate that the proposed model outperforms existing models across various prediction tasks, improving state-of-the-art prediction error by an average of 7% and decreasing model parameters by 43%.

[642] Communication-Efficient and Privacy-Adaptable Mechanism for Federated Learning

Chih Wei Ling, Chun Hei Michael Shiu, Youqi Wu, Jiande Sun, Cheuk Ting Li, Linqi Song, Weitao Xu

Main category: cs.LG

TL;DR: CEPAM is a federated learning approach that combines communication efficiency and privacy protection using a novel randomized vector quantizer (RSUQ) that achieves joint differential privacy and compression.

Details

Motivation: Federated learning faces two main challenges: communication efficiency (reducing data transmission costs) and privacy protection (preventing data leakage). Existing solutions often address these separately, but there's a need for a unified approach that handles both simultaneously within the trusted aggregator model.

Method: The paper introduces CEPAM (Communication-Efficient and Privacy-Adaptable Mechanism) which leverages RSUQ (rejection-sampled universal quantizer). RSUQ is a randomized vector quantizer construction whose distortion is equivalent to prescribed noise (Gaussian or Laplace), enabling joint differential privacy and compression. The method provides privacy adaptability, allowing clients and servers to customize privacy protection based on required accuracy and protection levels.

Result: Theoretical analysis confirms CEPAM’s privacy guarantees. Experimental evaluations demonstrate trade-offs between user privacy and accuracy. On MNIST dataset, CEPAM surpasses baseline models in learning accuracy while maintaining communication efficiency and privacy protection.

Conclusion: CEPAM successfully addresses both communication efficiency and privacy protection challenges in federated learning simultaneously. The approach provides a flexible, adaptable framework that outperforms existing methods in accuracy while maintaining strong privacy guarantees, making it a promising solution for practical federated learning deployments.

Abstract: Training machine learning models on decentralized private data via federated learning (FL) poses two key challenges: communication efficiency and privacy protection. In this work, we address these challenges within the trusted aggregator model by introducing a novel approach called the Communication-Efficient and Privacy-Adaptable Mechanism (CEPAM), achieving both objectives simultaneously. In particular, CEPAM leverages the rejection-sampled universal quantizer (RSUQ), a construction of randomized vector quantizer whose resulting distortion is equivalent to a prescribed noise, such as Gaussian or Laplace noise, enabling joint differential privacy and compression. Our CEPAM provides the additional benefit of privacy adaptability, allowing clients and the server to customize privacy protection based on required accuracy and protection. We theoretically analyze the privacy guarantee of CEPAM and investigate the trade-offs among user privacy and accuracy of CEPAM through experimental evaluations. Moreover, we assess CEPAM’s utility performance using MNIST dataset, demonstrating that CEPAM surpasses baseline models in terms of learning accuracy.

[643] Hardware-Aware DNN Compression for Homogeneous Edge Devices

Kunlong Zhang, Guiying Li, Ning Lu, Peng Yang, Ke Tang

Main category: cs.LG

TL;DR: HDAP is a hardware-aware DNN compression framework for homogeneous edge devices that accounts for performance variations among supposedly identical devices, achieving optimal average performance through device clustering and surrogate evaluation.

Details

Motivation: Homogeneous edge devices (same manufacturer SKU) develop performance differences over time due to user configurations, environmental conditions, manufacturing variances, and battery degradation. Existing DNN compression methods don't account for this variability and can't guarantee good performance across all devices.

Method: HDAP partitions homogeneous devices into clusters to reduce evaluation overhead, uses surrogate-based evaluation instead of real-time hardware evaluation, and optimizes for average performance across all devices through hardware-aware pruning.

Result: HDAP consistently achieves lower average inference latency compared to state-of-the-art methods, with substantial speedup gains (e.g., 2.86× speedup at 1.0G FLOPs for ResNet50) on homogeneous device clusters.

Conclusion: HDAP provides an effective, scalable solution for high-performance DNN deployment on homogeneous edge devices by addressing performance variability through device-aware compression.

Abstract: Deploying deep neural networks (DNNs) across homogeneous edge devices (the devices with the same SKU labeled by the manufacturer) often assumes identical performance among them. However, once a device model is widely deployed, the performance of each device becomes different after a period of running. This is caused by the differences in user configurations, environmental conditions, manufacturing variances, battery degradation, etc. Existing DNN compression methods have not taken this scenario into consideration and can not guarantee good compression results in all homogeneous edge devices. To address this, we propose Homogeneous-Device Aware Pruning (HDAP), a hardware-aware DNN compression framework explicitly designed for homogeneous edge devices, aiming to achieve optimal average performance of the compressed model across all devices. To deal with the difficulty of time-consuming hardware-aware evaluations for thousands or millions of homogeneous edge devices, HDAP partitions all the devices into several device clusters, which can dramatically reduce the number of devices to evaluate and use the surrogate-based evaluation instead of hardware evaluation in real-time. Experiments on ResNet50 and MobileNetV1 with the ImageNet dataset show that HDAP consistently achieves lower average inference latency compared with state-of-the-art methods, with substantial speedup gains (e.g., 2.86 $\times$ speedup at 1.0G FLOPs for ResNet50) on the homogeneous device clusters. HDAP offers an effective solution for scalable, high-performance DNN deployment methods for homogeneous edge devices.

[644] Error Slice Discovery via Manifold Compactness

Han Yu, Hao Zou, Jiashuo Liu, Renzhe Xu, Yue He, Xingxuan Zhang, Peng Cui

Main category: cs.LG

TL;DR: Proposes manifold compactness as a coherence metric for error slice discovery without needing predefined slice labels, and develops MCSD algorithm that optimizes both risk and coherence.

Details

Motivation: Current error slice discovery methods rely on predefined slice labels/metadata, which limits discovery of all possible patterns and prevents direct optimization of slice coherence. There's a need for a coherence metric that doesn't require extra information.

Method: Introduces manifold compactness metric based on data geometry properties, then develops Manifold Compactness based error Slice Discovery (MCSD) algorithm that directly optimizes both risk (error rate) and coherence.

Result: Experiments validate the rationality of manifold compactness metric, and MCSD demonstrates superiority over benchmarks on typical datasets and case studies.

Conclusion: The proposed manifold compactness provides a reliable coherence metric without metadata dependency, and MCSD effectively discovers interpretable error slices by directly optimizing both risk and coherence objectives.

Abstract: Despite the great performance of deep learning models in many areas, they still make mistakes and underperform on certain subsets of data, i.e. error slices. Given a trained model, it is important to identify its semantically coherent error slices that are easy to interpret, which is referred to as the error slice discovery problem. However, there is no proper metric of slice coherence without relying on extra information like predefined slice labels. Current evaluation of slice coherence requires access to predefined slices formulated by metadata like attributes or subclasses. Its validity heavily relies on the quality and abundance of metadata, where some possible patterns could be ignored. Besides, current algorithms cannot directly incorporate the constraint of coherence into their optimization objective due to absence of an explicit coherence metric, which could potentially hinder their effectiveness. In this paper, we propose manifold compactness, a coherence metric without reliance on extra information by incorporating the data geometry property into its design, and experiments on typical datasets empirically validate the rationality of the metric. Then we develop Manifold Compactness based error Slice Discovery (MCSD), a novel algorithm that directly treats risk and coherence as the optimization objective, and is flexible to be applied to models of various tasks. Extensive experiments on the benchmark and case studies on other typical datasets demonstrate the superiority of MCSD.

[645] Planning and Learning in Average Risk-aware MDPs

Weikai Wang, Erick Delage

Main category: cs.LG

TL;DR: Extends risk-neutral MDP algorithms to accommodate dynamic risk measures, proposing RVI for planning and two model-free Q-learning algorithms with proven convergence.

Details

Motivation: Traditional average cost Markov decision processes assume risk-neutral agents, but many real-world decision-makers have risk-aware preferences that require more general dynamic risk measures.

Method: Proposes relative value iteration (RVI) algorithm for planning, and two model-free Q-learning algorithms: 1) generic MLMC-based algorithm, and 2) off-policy algorithm specifically for utility-based shortfall risk measures.

Result: Both RVI and MLMC-based Q-learning algorithms are proven to converge to optimality. Numerical experiments validate the analysis, confirm empirical convergence of the off-policy algorithm, and demonstrate identification of policies finely tuned to agent risk-awareness.

Conclusion: Successfully extends risk-neutral MDP algorithms to handle dynamic risk measures, providing both planning and learning algorithms with theoretical guarantees, enabling risk-aware decision-making in continuing tasks.

Abstract: For continuing tasks, average cost Markov decision processes have well-documented value and can be solved using efficient algorithms. However, it explicitly assumes that the agent is risk-neutral. In this work, we extend risk-neutral algorithms to accommodate the more general class of dynamic risk measures. Specifically, we propose a relative value iteration (RVI) algorithm for planning and design two model-free Q-learning algorithms, namely a generic algorithm based on the multi-level Monte Carlo (MLMC) method, and an off-policy algorithm dedicated to utility-based shortfall risk measures. Both the RVI and MLMC-based Q-learning algorithms are proven to converge to optimality. Numerical experiments validate our analysis, confirm empirically the convergence of the off-policy algorithm, and demonstrate that our approach enables the identification of policies that are finely tuned to the intricate risk-awareness of the agent that they serve.

[646] LogicXGNN: Grounded Logical Rules for Explaining Graph Neural Networks

Chuqin Geng, Ziyu Zhao, Zhaoyue Wang, Haolin Ye, Yuhe Jiang, Xujie Si

Main category: cs.LG

TL;DR: LogicXGNN: A post-hoc framework that constructs logical rules over reliable predicates to provide faithfully grounded explanations for GNNs, with new fidelity metrics evaluated on final graph forms.

Details

Motivation: Existing rule-based GNN explanations optimize fidelity in intermediate concept spaces, overlooking grounding quality for end users in final subgraph explanations, leading to unreliable explanations in practice.

Method: Proposes LogicXGNN framework that constructs logical rules over reliable predicates explicitly designed to capture GNN’s message-passing structure, ensuring effective grounding. Introduces data-grounded fidelity metric and complementary utility metrics.

Result: Improves data-grounded fidelity by over 20% on average relative to state-of-the-art methods while being 10-100× faster. Shows strong scalability and utility performance.

Conclusion: LogicXGNN produces explanations that are faithful to the model’s logic and reliably grounded in observable data, addressing the gap between intermediate concept optimization and final user-facing explanations.

Abstract: Existing rule-based explanations for Graph Neural Networks (GNNs) provide global interpretability but often optimize and assess fidelity in an intermediate, uninterpretable concept space, overlooking grounding quality for end users in the final subgraph explanations. This gap yields explanations that may appear faithful yet be unreliable in practice. To this end, we propose LogicXGNN, a post-hoc framework that constructs logical rules over reliable predicates explicitly designed to capture the GNN’s message-passing structure, thereby ensuring effective grounding. We further introduce data-grounded fidelity ($\textit{Fid}{\mathcal{D}}$), a realistic metric that evaluates explanations in their final-graph form, along with complementary utility metrics such as coverage and validity. Across extensive experiments, LogicXGNN improves $\textit{Fid}{\mathcal{D}}$ by over 20% on average relative to state-of-the-art methods while being 10-100 $\times$ faster. With strong scalability and utility performance, LogicXGNN produces explanations that are faithful to the model’s logic and reliably grounded in observable data. Our code is available at https://github.com/allengeng123/LogicXGNN/.

[647] Variational Online Mirror Descent for Robust Learning in Schrödinger Bridge

Dong-Sig Han, Jaein Kim, Hee Bin Yoo, Byoung-Tak Zhang

Main category: cs.LG

TL;DR: VMSB: Variational Mirrored Schrödinger Bridge algorithm using online mirror descent framework with Wasserstein-Fisher-Rao geometry for stable SB learning.

Details

Motivation: Existing Schrödinger Bridge methods have uncertain learning signals and rely on speculative optimal scenarios. Need for more stable and reliable SB solvers.

Method: Proposed variational online mirror descent framework for SB problems using Wasserstein-Fisher-Rao geometry of Gaussian mixture parameterization for Schrödinger potentials.

Result: VMSB consistently outperforms contemporary SB solvers across extensive benchmarks, demonstrating robustness and generality predicted by OMD theory.

Conclusion: The OMD framework provides stability to SB solvers, and VMSB offers tractable learning dynamics that precisely approximate each OMD step, making it a robust general-purpose SB algorithm.

Abstract: The Schrödinger bridge (SB) has evolved into a universal class of probabilistic generative models. In practice, however, estimated learning signals are innately uncertain, and the reliability promised by existing methods is often based on speculative optimal case scenarios. Recent studies regarding the Sinkhorn algorithm through mirror descent (MD) have gained attention, revealing geometric insights into solution acquisition of the SB problems. In this paper, we propose a variational online MD (OMD) framework for the SB problems, which provides further stability to SB solvers. We formally prove convergence and a regret bound for the novel OMD formulation of SB acquisition. As a result, we propose a simulation-free SB algorithm called Variational Mirrored Schrödinger Bridge (VMSB) by utilizing the Wasserstein-Fisher-Rao geometry of the Gaussian mixture parameterization for Schrödinger potentials. Based on the Wasserstein gradient flow theory, the algorithm offers tractable learning dynamics that precisely approximate each OMD step. In experiments, we validate the performance of the proposed VMSB algorithm across an extensive suite of benchmarks. VMSB consistently outperforms contemporary SB solvers on a wide range of SB problems, demonstrating the robustness as well as generality predicted by our OMD theory.

[648] TabRep: Training Tabular Diffusion Models with a Simple and Effective Continuous Representation

Jacob Si, Zijing Ou, Mike Qu, Zhengrui Xiang, Yingzhen Li

Main category: cs.LG

TL;DR: TabRep introduces a unified continuous representation for tabular diffusion models that addresses the limitations of separate vs unified data representations, achieving superior performance and computational efficiency.

Details

Motivation: Current tabular diffusion models face a dilemma: separate representations struggle to jointly model multi-modal distributions, while unified representations use suboptimal encoding heuristics and incur additional computational costs.

Method: TabRep proposes a tabular diffusion architecture trained with a unified continuous representation designed with geometric insights about data manifolds. The representation emphasizes density, flexibility for nominal features, and preservation of intrinsic relationships.

Result: TabRep achieves superior performance across a broad evaluation suite, becoming the first method to synthesize tabular data that exceeds downstream quality of original datasets while preserving privacy and maintaining computational efficiency.

Conclusion: TabRep provides a simple yet effective approach for training tabular diffusion models under a continuous data manifold, solving the representation dilemma and advancing the state of tabular data generation.

Abstract: Diffusion models have been the predominant generative model for tabular data generation. However, they face the conundrum of modeling under a separate versus a unified data representation. The former encounters the challenge of jointly modeling all multi-modal distributions of tabular data in one model. While the latter alleviates this by learning a single representation for all features, it currently leverages sparse suboptimal encoding heuristics and necessitates additional computation costs. In this work, we address the latter by presenting TabRep, a tabular diffusion architecture trained with a unified continuous representation. To motivate the design of our representation, we provide geometric insights into how the data manifold affects diffusion models. The key attributes of our representation are composed of its density, flexibility to provide ample separability for nominal features, and ability to preserve intrinsic relationships. Ultimately, TabRep provides a simple yet effective approach for training tabular diffusion models under a continuous data manifold. Our results showcase that TabRep achieves superior performance across a broad suite of evaluations. It is the first to synthesize tabular data that exceeds the downstream quality of the original datasets while preserving privacy and remaining computationally efficient. Code is available at https://github.com/jacobyhsi/TabRep.

[649] BOOM: Benchmarking Out-Of-distribution Molecular Property Predictions of Machine Learning Models

Evan R. Antoniuk, Shehtab Zaman, Tal Ben-Nun, Peggy Li, James Diffenderfer, Busra Sahin, Obadiah Smolenski, Tim Hsu, Anna M. Hiszpanski, Kenneth Chiu, Bhavya Kailkhura, Brian Van Essen

Main category: cs.LG

TL;DR: BOOM is a benchmark for evaluating out-of-distribution (OOD) generalization in molecular property prediction, showing current ML models struggle with OOD generalization despite good in-distribution performance.

Details

Motivation: There's a lack of systematic benchmarks for molecular OOD prediction tasks, which is crucial for discovering novel molecules through AI/ML since models need to generalize beyond training distributions.

Method: Created chemically-informed benchmark for OOD performance on molecular property prediction tasks, evaluated over 150 model-task combinations, and performed extensive ablation experiments on data generation, pre-training, hyperparameter optimization, model architecture, and molecular representation.

Result: No existing model achieves strong generalization across all tasks - top-performing model had 3x higher OOD error than in-distribution. Chemical foundation models don’t show strong OOD extrapolation, while models with high inductive bias perform better on simple, specific OOD properties.

Conclusion: Developing models with strong OOD generalization is a new frontier challenge in chemical ML, and the BOOM benchmark provides an open-source tool to evaluate and improve OOD performance.

Abstract: Data-driven molecular discovery leverages artificial intelligence/machine learning (AI/ML) and generative modeling to filter and design novel molecules. Discovering novel molecules requires accurate out-of-distribution (OOD) predictions, but ML models struggle to generalize OOD. Currently, no systematic benchmarks exist for molecular OOD prediction tasks. We present $\mathbf{BOOM}$, $\mathbf{b}$enchmarks for $\mathbf{o}$ut-$\mathbf{o}$f-distribution $\mathbf{m}$olecular property predictions: a chemically-informed benchmark for OOD performance on common molecular property prediction tasks. We evaluate over 150 model-task combinations to benchmark deep learning models on OOD performance. Overall, we find that no existing model achieves strong generalization across all tasks: even the top-performing model exhibited an average OOD error 3x higher than in-distribution. Current chemical foundation models do not show strong OOD extrapolation, while models with high inductive bias can perform well on OOD tasks with simple, specific properties. We perform extensive ablation experiments, highlighting how data generation, pre-training, hyperparameter optimization, model architecture, and molecular representation impact OOD performance. Developing models with strong OOD generalization is a new frontier challenge in chemical ML. This open-source benchmark is available at https://github.com/FLASK-LLNL/BOOM

[650] Focus on Likely Classes for Test-Time Prediction

Johannes Schneider

Main category: cs.LG

TL;DR: The paper proposes test-time fine-tuning methods that focus on likely classes of a single in-domain sample to improve uncertain predictions, achieving accuracy gains by emphasizing shared features among likely classes.

Details

Motivation: Prior work argued that focusing on likely classes of a single in-domain sample cannot improve model predictions. The authors challenge this by proposing that sharedness of features among classes indicates their reliability for a single sample, aiming to achieve improvement without hand-engineered augmentations or auxiliary tasks.

Method: Two novel test-time fine-tuning methods are proposed: instead of greedily selecting the most likely class, an additional “focus on the likely classes” step refines predictions. This involves applying a single gradient descent step with a large learning rate when initial forward pass indicates high uncertainty, emphasizing shared features among likely classes.

Result: Experimental evaluation demonstrates accuracy gains for one of the methods on average, with improvements confirmed across diverse text and image domain models.

Conclusion: The paper provides an affirmative answer to whether focusing on likely classes of a single in-domain sample can improve model predictions, showing that shared features among likely classes can be leveraged to refine uncertain predictions through test-time fine-tuning.

Abstract: We ask: Can focusing on likely classes of a single, in-domain sample improve model predictions? Prior work argued no''. We put forward a novel rationale in favor of yes’’: Sharedness of features among classes indicates their reliability for a single sample. We aim for an affirmative answer without using hand-engineered augmentations or auxiliary tasks. We propose two novel test-time fine-tuning methods to improve uncertain model predictions. Instead of greedily selecting the most likely class, we introduce an additional step, \emph{focus on the likely classes}, to refine predictions. By applying a single gradient descent step with a large learning rate, we refine predictions when an initial forward pass indicates high uncertainty. The experimental evaluation demonstrates accuracy gains for one of our methods on average, which emphasizes shared features among likely classes. The gains are confirmed across diverse text and image domain models.

[651] DyGSSM: Multi-view Dynamic Graph Embeddings with State Space Model Gradient Update

Bizhan Alipour Pijan, Serdar Bozdag

Main category: cs.LG

TL;DR: DyGSSM: A novel dynamic graph representation learning method that combines local and global feature extraction in each snapshot using GCN and random walk with GRU, integrates them via cross-attention, and uses HiPPO-based SSM for temporal dependency management during parameter updates.

Details

Motivation: Existing dynamic graph methods have two key limitations: (1) they fail to extract both global and local information simultaneously in each snapshot, and (2) they don't consider model performance in current snapshots during parameter updates, leading to poor temporal dependency management.

Method: Combines Graph Convolution Networks (GCN) for local feature extraction and random walk with Gated Recurrent Unit (GRU) for global feature extraction in each snapshot. Integrates these features using cross-attention mechanism. Incorporates State Space Model (SSM) based on HiPPO algorithm to account for long-term dependencies when updating model parameters, ensuring current snapshot performance informs subsequent updates.

Result: Experiments on five public datasets show the method outperforms existing baseline and state-of-the-art methods in 17 out of 20 cases.

Conclusion: The proposed DyGSSM method effectively addresses limitations of existing dynamic graph representation learning by simultaneously capturing local and global structures and incorporating temporal dependencies through HiPPO-based SSM, achieving superior performance across multiple datasets.

Abstract: Most of the dynamic graph representation learning methods involve dividing a dynamic graph into discrete snapshots to capture the evolving behavior of nodes over time. Existing methods primarily capture only local or global structures of each node within a snapshot using message-passing and random walk-based methods. Then, they utilize sequence-based models (e.g., transformers) to encode the temporal evolution of node embeddings, and meta-learning techniques to update the model parameters. However, these approaches have two limitations. First, they neglect the extraction of global and local information simultaneously in each snapshot. Second, they fail to consider the model’s performance in the current snapshot during parameter updates, resulting in a lack of temporal dependency management. Recently, HiPPO (High-order Polynomial Projection Operators) algorithm has gained attention for their ability to optimize and preserve sequence history in State Space Model (SSM). To address the aforementioned limitations in dynamic graph representation learning, we propose a novel method called Multi-view Dynamic Graph Embeddings with State Space Model Gradient Update (DyGSSM). Our approach combines Graph Convolution Networks (GCN) for local feature extraction and random walk with Gated Recurrent Unit (GRU) for global feature extraction in each snapshot. We then integrate the local and global features using a cross-attention mechanism. Additionally, we incorporate an SSM based on HiPPO algorithm to account for long-term dependencies when updating model parameters, ensuring that model performance in each snapshot informs subsequent updates. Experiments on five public datasets show that our method outperforms existing baseline and state-of-the-art (SOTA) methods in 17 out of 20 cases.

[652] Addition is almost all you need: Compressing neural networks with double binary factorization

Vladimír Boža, Vladimír Macko

Main category: cs.LG

TL;DR: DBF factorizes weight matrices into two binary matrices with scaling vectors, preserving binary efficiency while achieving competitive compression rates and allowing fine-grained control over compression ratios.

Details

Motivation: Binary quantization reduces LLM computational/storage costs but suffers from severe accuracy degradation due to ±1 constraints. Need method that maintains binary efficiency while improving accuracy.

Method: Double Binary Factorization (DBF) decomposes dense weight matrices into products of two binary (sign) matrices with scaling vectors. Allows fine-grained compression control via intermediate dimension adjustment. Includes algorithm for estimating non-uniform layer-wise compression ratios.

Result: DBF outperforms existing binarization in 1-bit range, competitive with best quantization methods (QuIP#, QTIP) in 2-bit range. Achieves compression rates competitive/superior to state-of-the-art while preserving binary efficiency.

Conclusion: DBF offers efficient binary representation with flexible compression control, bridging gap between binary quantization efficiency and accuracy preservation for LLM compression.

Abstract: Binary quantization approaches, which replace weight matrices with binary matrices and substitute costly multiplications with cheaper additions, offer a computationally efficient approach to address the increasing computational and storage requirements of Large Language Models (LLMs). However, the severe quantization constraint ($\pm1$) can lead to significant accuracy degradation. In this paper, we propose Double Binary Factorization (DBF), a novel method that factorizes dense weight matrices into products of two binary (sign) matrices, each accompanied by scaling vectors. DBF preserves the efficiency advantages of binary representations while achieving compression rates that are competitive with or superior to state-of-the-art methods. Specifically, in a 1-bit per weight range, DBF is better than existing binarization approaches. In a 2-bit per weight range, DBF is competitive with the best quantization methods like QuIP# and QTIP. Unlike most existing compression techniques, which offer limited compression level choices, DBF allows fine-grained control over compression ratios by adjusting the factorization’s intermediate dimension. Based on this advantage, we further introduce an algorithm for estimating non-uniform layer-wise compression ratios for DBF, based on previously developed channel pruning criteria. Code available at: https://github.com/usamec/double_binary

[653] Shape it Up! Restoring LLM Safety during Finetuning

ShengYun Peng, Pin-Yu Chen, Jianfeng Chi, Seongmin Lee, Duen Horng Chau

Main category: cs.LG

TL;DR: STAR-DSS: A dynamic safety shaping framework that uses fine-grained safety signals from guardrail models to reinforce safe segments and suppress unsafe content during LLM finetuning, preventing safety degradation while maintaining task capabilities.

Details

Motivation: Finetuning LLMs for customization introduces safety risks where even a few harmful examples can compromise safety alignment. Current approaches treat examples as uniformly safe/unsafe, but safety context can shift within a single response, making static safety shaping suboptimal.

Method: Proposes dynamic safety shaping (DSS) using guardrail models to evaluate partial responses and track safety risk evolution throughout responses. Introduces Safety Trajectory Assessment of Response (STAR) - token-level safety signals that enable dynamic shaping over training sequences. STAR-DSS uses these fine-grained scores to reinforce learning from safe segments while suppressing unsafe content.

Result: STAR-DSS robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families without compromising capability on intended tasks.

Conclusion: Dynamic safety shaping using fine-grained safety signals (STAR) provides stronger mitigation against finetuning risks than static approaches. Future safety research should build on dynamic shaping principles for evolving finetuning risks.

Abstract: Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks: even a few harmful examples can compromise safety alignment. A common mitigation strategy is to update the model more strongly on examples deemed safe, while downweighting or excluding those flagged as unsafe. However, because safety context can shift within a single example, updating the model equally on both harmful and harmless parts of a response is suboptimal-a coarse treatment we term static safety shaping. In contrast, we propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content. To enable such fine-grained control during finetuning, we introduce a key insight: guardrail models, traditionally used for filtering, can be repurposed to evaluate partial responses, tracking how safety risk evolves throughout the response, segment by segment. This leads to the Safety Trajectory Assessment of Response (STAR), a token-level signal that enables shaping to operate dynamically over the training sequence. Building on this, we present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families-all without compromising capability on intended tasks. We encourage future safety research to build on dynamic shaping principles for stronger mitigation against evolving finetuning risks. Our code is publicly available at https://github.com/poloclub/star-dss.

[654] Less is More: Unlocking Specialization of Time Series Foundation Models via Structured Pruning

Lifan Zhao, Yanyan Shen, Zhaoyang Liu, Xue Wang, Jiaji Deng

Main category: cs.LG

TL;DR: The paper proposes a “prune-then-finetune” paradigm for Time Series Foundation Models (TSFMs) where structured pruning is applied before fine-tuning to improve forecasting performance by focusing on relevant parameter spaces.

Details

Motivation: Despite scaling laws showing TSFMs have strong zero-shot performance, they often underperform smaller specialized models after fine-tuning. The authors aim to find effective adaptation methods for TSFMs to leverage their pre-trained knowledge while improving downstream task performance.

Method: Proposes structured pruning of TSFMs before fine-tuning to identify and preserve task-relevant network substructures. This reduces redundancy and focuses fine-tuning on a more compact, relevant parameter space, creating a “prune-then-finetune” paradigm.

Result: Extensive experiments on 7 TSFMs and 6 benchmarks show that fine-tuning smaller, pruned TSFMs significantly improves forecasting performance compared to fine-tuning original models. This approach often achieves state-of-the-art performance and surpasses specialized baselines.

Conclusion: The prune-then-finetune paradigm effectively leverages TSFMs’ pre-trained knowledge by identifying and preserving task-relevant substructures through pruning, enabling better adaptation to downstream forecasting tasks and superior performance over both original TSFMs and specialized models.

Abstract: Scaling laws motivate the development of Time Series Foundation Models (TSFMs) that pre-train vast parameters and achieve remarkable zero-shot forecasting performance. Surprisingly, even after fine-tuning, TSFMs cannot consistently outperform smaller, specialized models trained on full-shot downstream data. A key question is how to realize effective adaptation of TSFMs for a target forecasting task. Through empirical studies on various TSFMs, the pre-trained models often exhibit inherent sparsity and redundancy in computation, suggesting that TSFMs have learned to activate task-relevant network substructures to accommodate diverse forecasting tasks. To preserve this valuable prior knowledge, we propose a structured pruning method to regularize the subsequent fine-tuning process by focusing it on a more relevant and compact parameter space. Extensive experiments on seven TSFMs and six benchmarks demonstrate that fine-tuning a smaller, pruned TSFM significantly improves forecasting performance compared to fine-tuning original models. This prune-then-finetune paradigm often enables TSFMs to achieve state-of-the-art performance and surpass strong specialized baselines. Source code is made publicly available at https://github.com/SJTU-DMTai/Prune-then-Finetune.

[655] Can Slow-thinking LLMs Reason Over Time? Empirical Studies in Time Series Forecasting

Mingyue Cheng, Jiahao Wang, Daoyu Wang, Xiaoyu Tao, Qi Liu, Enhong Chen

Main category: cs.LG

TL;DR: Slow-thinking LLMs show promising zero-shot time series forecasting capabilities when framed as a reasoning task, though with limitations.

Details

Motivation: Current time series forecasting methods focus on fast pattern extraction but lack explicit reasoning over temporal dynamics. Slow-thinking LLMs have shown strong multi-step reasoning abilities, suggesting potential for reframing forecasting as a structured reasoning task.

Method: Proposed TimeReasoner framework that formulates time series forecasting as conditional reasoning task. Designed prompting strategies to elicit inference-time reasoning from pretrained slow-thinking LLMs and evaluated across diverse benchmarks.

Result: Slow-thinking LLMs demonstrate non-trivial zero-shot forecasting capabilities, particularly in capturing high-level trends and contextual shifts. However, the study reveals both potential and limitations in their reasoning behaviors for temporal domains.

Conclusion: The work provides preliminary evidence that slow-thinking LLMs can perform time series forecasting through reasoning, highlighting opportunities for more interpretable and generalizable forecasting frameworks based on reasoning paradigms.

Abstract: Time series forecasting (TSF) is a fundamental and widely studied task, spanning methods from classical statistical approaches to modern deep learning and multimodal language modeling. Despite their effectiveness, these methods often follow a fast thinking paradigm emphasizing pattern extraction and direct value mapping, while overlooking explicit reasoning over temporal dynamics and contextual dependencies. Meanwhile, emerging slow-thinking LLMs (e.g., ChatGPT-o1, DeepSeek-R1) have demonstrated impressive multi-step reasoning capabilities across diverse domains, suggesting a new opportunity for reframing TSF as a structured reasoning task. This motivates a key question: can slow-thinking LLMs effectively reason over temporal patterns to support time series forecasting, even in zero-shot manner? To investigate this, in this paper, we propose TimeReasoner, an extensive empirical study that formulates TSF as a conditional reasoning task. We design a series of prompting strategies to elicit inference-time reasoning from pretrained slow-thinking LLMs and evaluate their performance across diverse TSF benchmarks. Our findings reveal that slow-thinking LLMs exhibit non-trivial zero-shot forecasting capabilities, especially in capturing high-level trends and contextual shifts. While preliminary, our study surfaces important insights into the reasoning behaviors of LLMs in temporal domains highlighting both their potential and limitations. We hope this work catalyzes further research into reasoning-based forecasting paradigms and paves the way toward more interpretable and generalizable TSF frameworks.

[656] MOORL: A Framework for Integrating Offline-Online Reinforcement Learning

Gaurav Chaudhary, Wassim Uddin Mondal, Laxmidhar Behera

Main category: cs.LG

TL;DR: MOORL is a hybrid offline-online RL framework that uses a meta-policy to seamlessly adapt between offline and online learning, achieving better exploration and performance with minimal computational overhead.

Details

Motivation: Offline RL faces limitations with out-of-distribution actions and poor generalization, while existing hybrid methods are complex and computationally expensive. There's a need for an efficient framework that combines offline data's robustness with online exploration's benefits.

Method: MOORL introduces a meta-policy that adapts across offline and online trajectories, enabling agents to leverage offline data for initialization while using online interactions for exploration. The approach learns a stable Q-function without added complexity.

Result: Extensive experiments on 28 tasks from D4RL and V-D4RL benchmarks show consistent improvements over state-of-the-art offline and hybrid RL baselines, with minimal computational overhead.

Conclusion: MOORL effectively combines offline and online RL strengths, enhancing exploration and performance while maintaining simplicity, making it promising for practical real-world applications.

Abstract: Sample efficiency and exploration remain critical challenges in Deep Reinforcement Learning (DRL), particularly in complex domains. Offline RL, which enables agents to learn optimal policies from static, pre-collected datasets, has emerged as a promising alternative. However, offline RL is constrained by issues such as out-of-distribution (OOD) actions that limit policy performance and generalization. To overcome these limitations, we propose Meta Offline-Online Reinforcement Learning (MOORL), a hybrid framework that unifies offline and online RL for efficient and scalable learning. While previous hybrid methods rely on extensive design components and added computational complexity to utilize offline data effectively, MOORL introduces a meta-policy that seamlessly adapts across offline and online trajectories. This enables the agent to leverage offline data for robust initialization while utilizing online interactions to drive efficient exploration. Our theoretical analysis demonstrates that the hybrid approach enhances exploration by effectively combining the complementary strengths of offline and online data. Furthermore, we demonstrate that MOORL learns a stable Q-function without added complexity. Extensive experiments on 28 tasks from the D4RL and V-D4RL benchmarks validate its effectiveness, showing consistent improvements over state-of-the-art offline and hybrid RL baselines. With minimal computational overhead, MOORL achieves strong performance, underscoring its potential for practical applications in real-world scenarios.

[657] Value-Free Policy Optimization via Reward Partitioning

Bilal Faye, Hanane Azzag, Mustapha Lebbah

Main category: cs.LG

TL;DR: RPO is a new single-trajectory RL method that eliminates the need for value function approximation by using reward partitioning, outperforming existing baselines like DRO and KTO.

Details

Motivation: Single-trajectory RL methods are practical as they use simple scalar rewards (like thumbs-up/down) rather than complex preference pairs. However, existing methods like DRO require value function approximation, which introduces limitations including high variance, policy-value coupling, and lack of direct policy supervision.

Method: RPO removes the need for value function modeling by normalizing observed rewards using a partitioning approach estimated directly from data. This creates a straightforward supervised learning objective on the policy with no auxiliary models or joint optimization.

Result: RPO outperforms existing single-trajectory baselines (DRO and KTO) on scalar-feedback language modeling tasks using Flan-T5 encoder-decoder models.

Conclusion: RPO is a simple, effective, and theoretically grounded method for single-trajectory policy optimization that provides direct and stable supervision on the policy, making it robust and easy to implement in practice.

Abstract: Single-trajectory reinforcement learning (RL) methods aim to optimize policies from datasets consisting of (prompt, response, reward) triplets, where scalar rewards are directly available. This supervision format is highly practical, as it mirrors real-world human feedback, such as thumbs-up/down signals, and avoids the need for structured preference annotations. In contrast, pairwise preference-based methods like Direct Preference Optimization (DPO) rely on datasets with both preferred and dispreferred responses, which are harder to construct and less natural to collect. Among single-trajectory approaches, Direct Reward Optimization (DRO) has shown strong empirical performance due to its simplicity and stability. However, DRO requires approximating a value function, which introduces several limitations: high off-policy variance, coupling between policy and value learning, and a lack of absolute supervision on the policy itself. We introduce Reward Partitioning Optimization (RPO), a new method that resolves these limitations by removing the need to model the value function. Instead, RPO normalizes observed rewards using a partitioning approach estimated directly from data. This leads to a straightforward supervised learning objective on the policy, with no auxiliary models and no joint optimization. RPO provides direct and stable supervision on the policy, making it robust and easy to implement in practice. We validate RPO on scalar-feedback language modeling tasks using Flan-T5 encoder-decoder models. Our results demonstrate that RPO outperforms existing single-trajectory baselines such as DRO and Kahneman-Tversky Optimization (KTO). These findings confirm that RPO is a simple, effective, and theoretically grounded method for single-trajectory policy optimization.

[658] Subspace-Boosted Model Merging

Ronald Skorobogat, Karsten Roth, Mariana-Iuliana Georgescu

Main category: cs.LG

TL;DR: Subspace Boosting improves model merging by preventing rank collapse in task vectors when combining many specialized experts, achieving >10% performance gains for up to 20 models.

Details

Motivation: Current model merging methods suffer from diminishing returns as more expert models are combined, with task-specific information being overwhelmed by common information, leading to rank collapse and reduced performance gains.

Method: Introduces Subspace Boosting that operates on singular value decomposed task vector space to maintain task vector ranks, preventing rank collapse. Also proposes using Higher-Order Generalized Singular Value Decomposition to quantify task similarity for interpretability.

Result: Subspace Boosting significantly improves merging efficacy by more than 10% for up to 20 experts on both vision and language benchmarks, overcoming the diminishing returns problem.

Conclusion: The proposed Subspace Boosting method effectively addresses rank collapse in model merging, enabling better performance when combining many specialized experts while providing interpretable task similarity analysis.

Abstract: Model merging enables the combination of multiple specialized expert models into a single model capable of performing multiple tasks. However, the benefits of merging an increasing amount of specialized experts generally lead to diminishing returns and reduced overall performance gains. In this work, we empirically and theoretically analyze this limitation, proving that for Task Arithmetic-based methods, as more experts are merged, the common information dominates the task-specific information, leading to inevitable rank collapse. To mitigate this issue, we introduce Subspace Boosting, which operates on the singular value decomposed task vector space and maintains task vector ranks. Subspace Boosting raises merging efficacy for up to 20 experts by large margins of more than 10% when evaluated on both vision and language benchmarks. Moreover, we propose employing Higher-Order Generalized Singular Value Decomposition to quantify task similarity, offering a new interpretable perspective on model merging. Code and models are available at https://github.com/ronskoro/Subspace-Boosting.

[659] ESSA: Evolutionary Strategies for Scalable Alignment

Daria Korotyshova, Boris Shaposhnikov, Alexey Malakhov, Alexey Khokhulin, Nikita Surnachev, Kirill Ovcharenko, George Bredis, Alexey Gorbatovski, Viacheslav Sinii, Daniil Gavrilov

Main category: cs.LG

TL;DR: ESSA is a gradient-free evolutionary strategy framework for LLM alignment that uses only forward inference and black-box optimization on compressed LoRA adapters, achieving competitive results with faster scaling and reduced engineering overhead compared to gradient-based methods.

Details

Motivation: Current RLHF methods like PPO and GRPO require complex distributed training, large memory budgets, and careful hyperparameter tuning, which become increasingly difficult at billion-parameter scale. There's a need for simpler, more scalable alignment methods.

Method: ESSA uses evolutionary strategies with only forward inference and black-box optimization. It focuses on optimizing Low-Rank Adapters (LoRA) and further compresses their parameter space by optimizing only the singular values from SVD decomposition of each adapter matrix, enabling efficient operation in quantized INT4/INT8 modes.

Result: ESSA improves Qwen2.5-Math-7B by 12.6% on GSM8K and 14.8% on PRM800K, raises LLaMA3.1-8B accuracy on IFEval by 22.5% compared to GRPO. On Qwen2.5-32B for PRM800K, it reaches near-optimal accuracy twice as fast on 16 GPUs and six times as fast on 128 GPUs.

Conclusion: Evolutionary strategies offer a compelling, hardware-friendly alternative to gradient-based LLM alignment, combining competitive quality with substantially reduced wall-clock time and engineering overhead, especially beneficial for large-scale models.

Abstract: Alignment of Large Language Models (LLMs) typically relies on Reinforcement Learning from Human Feedback (RLHF) with gradient-based optimizers such as Proximal Policy Optimization (PPO) or Group Relative Policy Optimization (GRPO). While effective, these methods require complex distributed training, large memory budgets, and careful hyperparameter tuning, all of which become increasingly difficult at billion-parameter scale. We present ESSA, Evolutionary Strategies for Scalable Alignment, a gradient-free framework that aligns LLMs using only forward inference and black-box optimization. ESSA focuses optimization on Low-Rank Adapters (LoRA) and further compresses their parameter space by optimizing only the singular values from an singular value decomposition (SVD) of each adapter matrix. This dimensionality reduction makes evolutionary search practical even for very large models and allows efficient operation in quantized INT4 and INT8 inference mode. Across these benchmarks ESSA improves the test accuracy of Qwen2.5-Math-7B by 12.6% on GSM8K and 14.8% on PRM800K, and raises the accuracy of LLaMA3.1-8B on IFEval by 22.5%, all compared with GRPO. In large-scale settings ESSA shows stronger scaling than gradient-based methods: on Qwen2.5-32B for PRM800K it reaches near-optimal accuracy twice as fast on 16 GPUs and six times as fast on 128 GPUs compared with GRPO. These results position evolutionary strategies as a compelling, hardware-friendly alternative to gradient-based LLM alignment, combining competitive quality with substantially reduced wall-clock time and engineering overhead.

[660] Vidar: Embodied Video Diffusion Model for Generalist Manipulation

Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, Jun Zhu

Main category: cs.LG

TL;DR: Vidar: A video diffusion model with masked inverse dynamics adapter that enables robot manipulation with minimal demonstrations by leveraging large-scale pre-training and suppressing visual distractors.

Details

Motivation: Scaling robot manipulation to new embodiments is challenging due to need for large homogeneous demonstrations and sensitivity of end-to-end pixel-to-action pipelines to background/viewpoint shifts.

Method: Combines embodied video diffusion model (pre-trained on internet video, then fine-tuned on 750K multi-view robot trajectories) with masked inverse dynamics model that learns action-relevant pixel masks without dense labels.

Result: With only 20 minutes of human demonstrations (1% of typical data), Vidar outperforms SOTA baselines and generalizes to unseen tasks, backgrounds, and camera layouts.

Conclusion: Proposes scalable “one prior, many embodiments” approach: strong video priors plus minimal on-robot alignment enables efficient adaptation to new robot platforms.

Abstract: Scaling general-purpose manipulation to new robot embodiments remains challenging: each platform typically needs large, homogeneous demonstrations, and end-to-end pixel-to-action pipelines may degenerate under background and viewpoint shifts. Based on previous advances in video-based robot control, we present Vidar, consisting of an embodied video diffusion model as the generalizable prior and a masked inverse dynamics model (MIDM) as the adapter. We leverage a video diffusion model pre-trained at Internet scale, and further continuously pre-train it for the embodied domain using 750K multi-view trajectories collected from three real-world robot platforms. For this embodied pre-training, we introduce a unified observation space that jointly encodes robot, camera, task, and scene contexts. The MIDM module learns action-relevant pixel masks without dense labels, grounding the prior into the target embodiment’s action space while suppressing distractors. With only 20 minutes of human demonstrations on an unseen robot (1% of typical data), Vidar outperforms state-of-the-art baselines and generalizes to unseen tasks, backgrounds, and camera layouts. Our results suggest a scalable recipe for “one prior, many embodiments”: strong, inexpensive video priors together with minimal on-robot alignment.

[661] From Novelty to Imitation: Self-Distilled Rewards for Offline Reinforcement Learning

Gaurav Chaudhary, Laxmidhar Behera

Main category: cs.LG

TL;DR: ReLOAD is a novel offline RL framework that generates intrinsic rewards from expert demonstrations using Random Network Distillation, eliminating the need for costly reward annotations.

Details

Motivation: Offline RL requires explicit reward annotations which are costly to engineer and difficult to obtain retrospectively, hindering practical adoption of offline RL methods.

Method: Adapts Random Network Distillation (RND) to generate intrinsic rewards from expert demonstrations. Trains a predictor network to mimic a fixed target network’s embeddings on expert state transitions, then uses the prediction error as reward signal for all transitions in the dataset.

Result: Experiments on D4RL benchmark show ReLOAD enables robust offline policy learning and achieves performance competitive with traditional reward-annotated methods.

Conclusion: ReLOAD provides a practical solution for offline RL by eliminating the need for handcrafted reward annotations through a simple yet effective intrinsic reward generation mechanism based on expert demonstrations.

Abstract: Offline Reinforcement Learning (RL) aims to learn effective policies from a static dataset without requiring further agent-environment interactions. However, its practical adoption is often hindered by the need for explicit reward annotations, which can be costly to engineer or difficult to obtain retrospectively. To address this, we propose ReLOAD (Reinforcement Learning with Offline Reward Annotation via Distillation), a novel reward annotation framework for offline RL. Unlike existing methods that depend on complex alignment procedures, our approach adapts Random Network Distillation (RND) to generate intrinsic rewards from expert demonstrations using a simple yet effective embedding discrepancy measure. First, we train a predictor network to mimic a fixed target network’s embeddings based on expert state transitions. Later, the prediction error between these networks serves as a reward signal for each transition in the static dataset. This mechanism provides a structured reward signal without requiring handcrafted reward annotations. We provide a formal theoretical construct that offers insights into how RND prediction errors effectively serve as intrinsic rewards by distinguishing expert-like transitions. Experiments on the D4RL benchmark demonstrate that ReLOAD enables robust offline policy learning and achieves performance competitive with traditional reward-annotated methods.

[662] Prediction-Oriented Subsampling from Data Streams

Benedetta Lavinia Mussati, Freddie Bickford Smith, Tom Rainforth, Stephen Roberts

Main category: cs.LG

TL;DR: Information-theoretic data subsampling method for stream learning that reduces uncertainty in downstream predictions outperforms previous techniques.

Details

Motivation: Data streams present computational challenges for learning models, requiring methods to capture relevant information while managing costs.

Method: Propose an information-theoretic subsampling approach focused on reducing uncertainty in downstream predictions of interest for offline learning.

Result: The prediction-oriented approach outperforms a previously proposed information-theoretic technique on two widely studied problems.

Conclusion: Careful model design is essential for reliably achieving strong performance with information-theoretic subsampling methods in practice.

Abstract: Data is often generated in streams, with new observations arriving over time. A key challenge for learning models from data streams is capturing relevant information while keeping computational costs manageable. We explore intelligent data subsampling for offline learning, and argue for an information-theoretic method centred on reducing uncertainty in downstream predictions of interest. Empirically, we demonstrate that this prediction-oriented approach performs better than a previously proposed information-theoretic technique on two widely studied problems. At the same time, we highlight that reliably achieving strong performance in practice requires careful model design.

[663] Low-Regret and Low-Complexity Learning for Hierarchical Inference

Sameep Chattopadhyay, Vinay Sutar, Jaya Prakash Champati, Sharayu Moharir

Main category: cs.LG

TL;DR: Novel hierarchical inference learning approach with UCB-based policies achieves optimal logarithmic regret for edge intelligence systems, significantly outperforming existing methods.

Details

Motivation: Hierarchical Inference (HI) in edge intelligence systems faces challenges in estimating when local inference is likely incorrect, especially with changing data distributions and offloading costs over time - a problem termed Hierarchical Inference Learning (HIL).

Method: Model probability of correct local inference as increasing function of model’s confidence measure, then propose two UCB-based policies: HI-LCB and HI-LCB-lite with O(1) per-sample complexity.

Result: Both policies achieve order-optimal regret of O(log T), significant improvement over existing HIL policies with O(T^{2/3}) regret. HI-LCB-lite has O(1) computational complexity suitable for resource-limited devices.

Conclusion: The proposed UCB-based HIL policies provide theoretically optimal performance with practical efficiency, making them suitable for deployment in real-world edge intelligence systems with resource constraints.

Abstract: This work focuses on Hierarchical Inference (HI) in edge intelligence systems, where a compact Local-ML model on an end-device works in conjunction with a high-accuracy Remote-ML model on an edge-server. HI aims to reduce latency, improve accuracy, and lower bandwidth usage by first using the Local-ML model for inference and offloading to the Remote-ML only when the local inference is likely incorrect. A critical challenge in HI is estimating the likelihood of the local inference being incorrect, especially when data distributions and offloading costs change over time – a problem we term Hierarchical Inference Learning (HIL). We introduce a novel approach to HIL by modeling the probability of correct inference by the Local-ML as an increasing function of the model’s confidence measure, a structure motivated by empirical observations but previously unexploited. We propose two policies, HI-LCB and HI-LCB-lite, based on the Upper Confidence Bound (UCB) framework. We demonstrate that both policies achieve order-optimal regret of $O(\log T)$, a significant improvement over existing HIL policies with $O(T^{2/3})$ regret guarantees. Notably, HI-LCB-lite has an $O(1)$ per-sample computational complexity, making it well-suited for deployment on devices with severe resource limitations. Simulations using real-world datasets confirm that our policies outperform existing state-of-the-art HIL methods.

[664] Explainable Graph Spectral Clustering For GloVe-like Text Embeddings

Mieczysław A. Kłopotek, Sławomir T. Wierzchoń, Bartłomiej Starosta, Piotr Borkowski, Dariusz Czerski, Eryk Laskowski

Main category: cs.LG

TL;DR: Extends previous work on explaining Graph Spectral Clustering results for documents to include GloVe embeddings and other document representations beyond term vector cosine similarity.

Details

Motivation: To generalize the explainability framework for Graph Spectral Clustering from just term vector space (cosine similarity) to include other document embeddings like GloVe, making the approach more versatile for different document representation methods.

Method: Generalizes previous explainability approach by incorporating various document embeddings, particularly focusing on GloVe embeddings as an alternative to term vector representations.

Result: The paper extends the explainability framework to work with multiple document embedding methods, enabling interpretation of clustering results across different representation spaces.

Conclusion: The generalization allows for explainable Graph Spectral Clustering across diverse document embedding techniques, enhancing the framework’s applicability to various text analysis scenarios.

Abstract: In a previous paper, we proposed an introduction to the explainability of Graph Spectral Clustering results for textual documents, given that document similarity is computed as cosine similarity in term vector space. In this paper, we generalize this idea by considering other embeddings of documents, in particular, based on the GloVe embedding idea.

[665] An Unsupervised Deep Explainable AI Framework for Localization of Concurrent Replay Attacks in Nuclear Reactor Signals

Konstantinos Vasili, Zachery T. Dahm, Stylianos Chatzidakis

Main category: cs.LG

TL;DR: Proposes an unsupervised explainable AI framework combining autoencoder and customized windowSHAP to detect and characterize replay attacks in nuclear reactor data with high accuracy.

Details

Motivation: Next-gen nuclear reactors generate multivariate time series data vulnerable to replay attacks. Current approaches focus on detection without root cause analysis, rely on synthetic data with unrealistic assumptions, and lack explainability needed for regulated nuclear systems.

Method: Unsupervised XAI framework combining autoencoder for anomaly detection with customized windowSHAP algorithm to characterize replay attacks (detection, source identification, timing, and type). Tested on real-world datasets from Purdue’s nuclear reactor PUR-1.

Result: Framework achieved 95%+ accuracy in detecting replay attacks, identifying source/number of signals being replayed, and determining duration of falsification across datasets with up to six concurrently replayed signals.

Conclusion: The proposed XAI framework effectively addresses limitations of current approaches by providing unsupervised, explainable characterization of replay attacks using real nuclear reactor data, meeting regulatory requirements for nuclear cyber-physical systems.

Abstract: Next generation advanced nuclear reactors are expected to be smaller both in size and power output, relying extensively on fully digital instrumentation and control systems. These reactors will generate a large flow of information in the form of multivariate time series data, conveying simultaneously various non linear cyber physical, process, control, sensor, and operational states. Ensuring data integrity against deception attacks is becoming increasingly important for networked communication and a requirement for safe and reliable operation. Current efforts to address replay attacks, almost universally focus on watermarking or supervised anomaly detection approaches without further identifying and characterizing the root cause of the anomaly. In addition, these approaches rely mostly on synthetic data with uncorrelated Gaussian process and measurement noise and full state feedback or are limited to univariate signals, signal stationarity, linear quadratic regulators, or other linear-time invariant state-space which may fail to capture any unmodeled system dynamics. In the realm of regulated nuclear cyber-physical systems, additional work is needed on characterization of replay attacks and explainability of predictions using real data. Here, we propose an unsupervised explainable AI framework based on a combination of autoencoder and customized windowSHAP algorithm to fully characterize real-time replay attacks, i.e., detection, source identification, timing and type, of increasing complexity during a dynamic time evolving reactor process. The proposed XAI framework was benchmarked on several real world datasets from Purdue’s nuclear reactor PUR-1 with up to six signals concurrently being replayed. In all cases, the XAI framework was able to detect and identify the source and number of signals being replayed and the duration of the falsification with 95 percent or better accuracy.

[666] Prototype-Guided Diffusion: Visual Conditioning without External Memory

Bilal Faye, Hanane Azzag, Mustapha Lebbah

Main category: cs.LG

TL;DR: PDM embeds prototype learning into diffusion models to provide adaptive, memory-free conditioning, reducing computational costs while maintaining generation quality.

Details

Motivation: Current diffusion models are computationally expensive due to iterative denoising. Latent-space models lose fine detail, while retrieval-augmented methods require large memory banks, static similarity models, and rigid infrastructures.

Method: PDM learns compact visual prototypes from clean features via contrastive learning, then aligns noisy representations with semantically relevant patterns during denoising, eliminating the need for retrieval infrastructure.

Result: PDM sustains high generation quality while lowering computational and storage costs compared to retrieval-based conditioning methods.

Conclusion: PDM offers a scalable alternative to retrieval-based conditioning by embedding prototype learning directly into the diffusion process for adaptive, memory-free conditioning.

Abstract: Diffusion models achieve state-of-the-art image generation but remain computationally costly due to iterative denoising. Latent-space models like Stable Diffusion reduce overhead yet lose fine detail, while retrieval-augmented methods improve efficiency but rely on large memory banks, static similarity models, and rigid infrastructures. We introduce the Prototype Diffusion Model (PDM), which embeds prototype learning into the diffusion process to provide adaptive, memory-free conditioning. Instead of retrieving references, PDM learns compact visual prototypes from clean features via contrastive learning, then aligns noisy representations with semantically relevant patterns during denoising. Experiments demonstrate that PDM sustains high generation quality while lowering computational and storage costs, offering a scalable alternative to retrieval-based conditioning.

[667] Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning

Zelin Tan, Hejia Geng, Xiaohang Yu, Mulei Zhang, Guancheng Wan, Yifan Zhou, Qiang He, Xiangyuan Xue, Heng Zhou, Yutao Fan, Zhongzhi Li, Zaibin Zhang, Guibin Zhang, Chen Zhang, Zhenfei Yin, Philip Torr, Lei Bai

Main category: cs.LG

TL;DR: The paper investigates scaling laws for LLMs during RL post-training for mathematical reasoning, finding that larger models learn more efficiently, scaling follows predictable power-laws, learning efficiency saturates at very large scales, and data reuse is effective in data-constrained regimes.

Details

Motivation: While scaling laws for LLM pre-training are well-studied, scaling behaviors during RL-based post-training remain largely unexplored, especially for mathematical reasoning tasks.

Method: Systematic empirical investigation using the full Qwen2.5 dense model series (0.5B to 72B parameters), analyzing how model scale, data volume, and computational budget interact during RL post-training.

Result: Four key findings: 1) Larger models show superior learning efficiency; 2) Test loss follows predictable power-law scaling; 3) Learning efficiency saturates at very large model sizes; 4) Data reuse is effective in data-constrained regimes.

Conclusion: The results provide principled foundations and practical guidelines for efficiently scaling LLM reasoning capabilities through RL post-training, with implications for resource allocation and training strategies.

Abstract: While scaling laws for large language models (LLMs) during pre-training have been extensively studied, their behavior under reinforcement learning (RL) post-training remains largely unexplored. This paper presents a systematic empirical investigation of scaling behaviors in RL-based post-training, with a particular focus on mathematical reasoning. Based on a set of experiments across the full Qwen2.5 dense model series (0.5B to 72B), we characterize how model scale, data volume, and computational budget interact to shape performance. Our analysis leads to four key findings: 1. Larger models consistently exhibit superior learning efficiency on both compute and data metrics. 2. The relationship between test loss, compute, and data can be modeled by a predictive power-law which is robust across both base and instruction-tuned models. 3. Although larger models exhibit higher learning efficiency, the analytical learning efficiency term k(N) in the power-law reveals a latent saturation trend in learning efficiency as model size continues to increase. 4. In data-constrained regimes, repeated reuse of high-quality data proves highly effective, as final performance is primarily governed by the total number of optimization steps rather than the uniqueness of samples. Collectively, these results provide a principled foundation and practical guidelines for efficiently scaling the reasoning capabilities of LLMs through RL post-training.

[668] Data-driven particle dynamics: Structure-preserving coarse-graining for emergent behavior in non-equilibrium systems

Quercus Hernandez, Max Win, Thomas C. O’Connor, Paulo E. Arratia, Nathaniel Trask

Main category: cs.LG

TL;DR: A framework for machine learning coarse-grained dynamics from particle trajectories using metriplectic brackets that preserves thermodynamic laws, momentum conservation, and fluctuation-dissipation balance.

Details

Motivation: Multiscale systems are challenging to simulate because short spatiotemporal scales must be linked to emergent bulk physics. When coarse-graining high-dimensional systems into low-dimensional models, information loss leads to dissipative, history-dependent, and stochastic emergent physics that needs to be properly captured.

Method: Proposes a framework using metriplectic bracket formalism that preserves thermodynamic properties by construction. Introduces a novel self-supervised learning strategy to identify emergent structural variables when labels are unavailable. Specializes to particle discretization and provides implementations in PyTorch and LAMMPS.

Result: Validated on benchmark systems and demonstrated on two challenging examples: (1) coarse-graining star polymers at challenging levels while preserving non-equilibrium statistics, and (2) learning models from high-speed video of colloidal suspensions capturing coupling between local rearrangement events and emergent stochastic dynamics.

Conclusion: The framework enables machine learning of coarse-grained dynamics that preserves essential physical properties (thermodynamic laws, momentum conservation, fluctuation-dissipation balance) and is extensible to diverse particle-based systems through open-source implementations.

Abstract: Multiscale systems are ubiquitous in science and technology, but are notoriously challenging to simulate as short spatiotemporal scales must be appropriately linked to emergent bulk physics. When expensive high-dimensional dynamical systems are coarse-grained into low-dimensional models, the entropic loss of information leads to emergent physics which are dissipative, history-dependent, and stochastic. To machine learn coarse-grained dynamics from time-series observations of particle trajectories, we propose a framework using the metriplectic bracket formalism that preserves these properties by construction; most notably, the framework guarantees discrete notions of the first and second laws of thermodynamics, conservation of momentum, and a discrete fluctuation-dissipation balance crucial for capturing non-equilibrium statistics. We introduce the mathematical framework abstractly before specializing to a particle discretization. As labels are generally unavailable for entropic state variables, we introduce a novel self-supervised learning strategy to identify emergent structural variables. We validate the method on benchmark systems and demonstrate its utility on two challenging examples: (1) coarse-graining star polymers at challenging levels of coarse-graining while preserving non-equilibrium statistics, and (2) learning models from high-speed video of colloidal suspensions that capture coupling between local rearrangement events and emergent stochastic dynamics. We provide open-source implementations in both PyTorch and LAMMPS, enabling large-scale inference and extensibility to diverse particle-based systems.

[669] Multimodal Neural Operators for Real-Time Biomechanical Modelling of Traumatic Brain Injury

Anusha Agarwal, Dibakar Roy Sarkar, Somdatta Goswami

Main category: cs.LG

TL;DR: Multimodal neural operators fuse heterogeneous data (imaging, demographics, geometry) to predict brain displacement for real-time TBI risk assessment, achieving high accuracy and 7x speedup.

Details

Motivation: TBI affects 69M people annually, but existing neural operators can't handle the heterogeneous multimodal data needed for patient-specific biomechanical modeling. There's a need for frameworks that can fuse anatomical imaging, scalar demographics, and geometric constraints for rapid TBI assessment.

Method: Reformulated TBI modeling as multimodal operator learning problem. Proposed two fusion strategies: field projection for FNO architectures and branch decomposition for DeepONet. Extended four architectures (FNO, Factorized FNO, Multi-Grid FNO, DeepONet) with fusion mechanisms and evaluated on 249 in vivo MRE datasets (20-90 Hz).

Result: Multi-Grid FNO achieved highest accuracy (MSE = 0.0023, 94.3% spatial fidelity). DeepONet offered fastest inference (14.5 iterations/s, 7x speedup). All architectures reduced computation from hours to milliseconds.

Conclusion: Multimodal neural operators enable efficient, real-time, patient-specific TBI risk assessment. The framework establishes a generalizable paradigm for heterogeneous data fusion in scientific domains, including precision medicine.

Abstract: Background: Traumatic brain injury (TBI) is a major global health concern with 69 million annual cases. While neural operators have revolutionized scientific computing, existing architectures cannot handle the heterogeneous multimodal data (anatomical imaging, scalar demographics, and geometric constraints) required for patient-specific biomechanical modeling. Objective: This study introduces the first multimodal neural operator framework for biomechanics, fusing heterogeneous inputs to predict brain displacement fields for rapid TBI risk assessment. Methods: TBI modeling was reformulated as a multimodal operator learning problem. We proposed two fusion strategies: field projection for Fourier Neural Operator (FNO) architectures and branch decomposition for Deep Operator Networks (DeepONet). Four architectures (FNO, Factorized FNO, Multi-Grid FNO, and DeepONet) were extended with fusion mechanisms and evaluated on 249 in vivo Magnetic Resonance Elastography (MRE) datasets (20-90 Hz). Results: Multi-Grid FNO achieved the highest accuracy (MSE = 0.0023, 94.3% spatial fidelity). DeepONet offered the fastest inference (14.5 iterations/s, 7x speedup), suitable for edge deployment. All architectures reduced computation from hours to milliseconds. Conclusion: Multimodal neural operators enable efficient, real-time, patient-specific TBI risk assessment. This framework establishes a generalizable paradigm for heterogeneous data fusion in scientific domains, including precision medicine.

[670] PLaID++: A Preference Aligned Language Model for Targeted Inorganic Materials Design

Andy Xu, Rohan Desai, Larry Wang, Gabriel Hope, Ethan Ritz

Main category: cs.LG

TL;DR: PLaID++ is an LLM post-trained for stable, property-guided crystal generation that uses symmetry-informed text representation and temperature scaling to generate diverse, thermodynamically stable materials structures.

Details

Motivation: In scientific problems like materials generation, the goal isn't just correct answers but diverse candidates satisfying constraints. RLVR improves LLM correctness but needs adaptation for materials design where diversity and constraint satisfaction are key.

Method: 1) Introduces compact, symmetry-informed Wyckoff text representation for efficiency and physical priors. 2) Uses temperature scaling as entropy regularizer to prevent mode collapse and encourage exploration. 3) Post-trains LLM for stable, property-guided crystal generation by encoding symmetry constraints directly into text.

Result: PLaID++ generates structures that are thermodynamically stable, unique, and novel at ~50% greater rate than prior methods. Successfully conditionally generates structures with desired space group properties.

Conclusion: Demonstrates potential of adapting NLP post-training techniques to materials design, enabling targeted, efficient discovery of novel materials through constraint-aware, diverse generation.

Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising approach to improve correctness in LLMs, however, in many scientific problems, the objective is not necessarily to produce the correct answer, but instead to produce a diverse array of candidates which satisfy a set of constraints. We study this challenge in the context of materials generation. To this end, we introduce PLaID++, an LLM post-trained for stable and property-guided crystal generation. We find that performance hinges on our crystallographic representation and reward formulation. First, we introduce a compact, symmetry-informed Wyckoff text representation which improves computational efficiency and encourages generalization from physical priors. Second, we demonstrate that temperature scaling acts as an entropy regularizer which counteracts mode collapse and encourages exploration. By encoding symmetry constraints directly into text and guiding model outputs towards desirable chemical space, PLaID++ generates structures that are thermodynamically stable, unique, and novel at a $\sim$50% greater rate than prior methods and conditionally generates structures with desired space group properties. Our work demonstrates the potential of adapting post-training techniques from natural language processing to materials design, paving the way for targeted and efficient discovery of novel materials.

[671] Using LLMs for Late Multimodal Sensor Fusion for Activity Recognition

Ilker Demirel, Karan Thakkar, Benjamin Elizalde, Miquel Espi Marques, Aditya Sarathy, Yang Bai, Umamahesh Srinivas, Jiajie Xu, Shirley Ren, Jaya Narain

Main category: cs.LG

TL;DR: LLMs enable zero-shot multimodal fusion for activity recognition from audio and motion data without task-specific training.

Details

Motivation: Integrating complementary sensor data (audio and motion) for activity recognition is challenging, especially when there's limited aligned training data for learning shared multimodal embeddings.

Method: Used LLMs for late fusion of modality-specific models on curated Ego4D dataset subset for diverse activity recognition. Applied zero-shot and one-shot classification without task-specific training.

Result: LLMs achieved 12-class classification F1-scores significantly above chance in both zero-shot and one-shot settings, demonstrating effective multimodal fusion capabilities.

Conclusion: LLM-based fusion enables multimodal temporal applications with limited training data and reduces memory/computation requirements compared to application-specific multimodal models.

Abstract: Sensor data streams provide valuable information around activities and context for downstream applications, though integrating complementary information can be challenging. We show that large language models (LLMs) can be used for late fusion for activity classification from audio and motion time series data. We curated a subset of data for diverse activity recognition across contexts (e.g., household activities, sports) from the Ego4D dataset. Evaluated LLMs achieved 12-class zero- and one-shot classification F1-scores significantly above chance, with no task-specific training. Zero-shot classification via LLM-based fusion from modality-specific models can enable multimodal temporal applications where there is limited aligned training data for learning a shared embedding space. Additionally, LLM-based fusion can enable model deploying without requiring additional memory and computation for targeted application-specific multimodal models.

[672] DiffEM: Learning from Corrupted Data with Diffusion Models via Expectation Maximization

Danial Hosseintabar, Fan Chen, Giannis Daras, Antonio Torralba, Constantinos Daskalakis

Main category: cs.LG

TL;DR: DiffEM: Training diffusion models with Expectation-Maximization from corrupted data using conditional diffusion models for reconstruction and refinement.

Details

Motivation: Diffusion models are powerful generative priors for high-dimensional inverse problems, but learning them from only corrupted or noisy observations remains challenging.

Method: DiffEM uses Expectation-Maximization with conditional diffusion models: E-step reconstructs clean data from observations, M-step refines the conditional diffusion model using reconstructed data.

Result: Theoretical monotonic convergence guarantees under appropriate statistical conditions, with demonstrated effectiveness on various image reconstruction tasks.

Conclusion: DiffEM provides a principled approach to train diffusion models from corrupted data with theoretical guarantees and practical effectiveness.

Abstract: Diffusion models have emerged as powerful generative priors for high-dimensional inverse problems, yet learning them when only corrupted or noisy observations are available remains challenging. In this work, we propose a new method for training diffusion models with Expectation-Maximization (EM) from corrupted data. Our proposed method, DiffEM, utilizes conditional diffusion models to reconstruct clean data from observations in the E-step, and then uses the reconstructed data to refine the conditional diffusion model in the M-step. Theoretically, we provide monotonic convergence guarantees for the DiffEM iteration, assuming appropriate statistical conditions. We demonstrate the effectiveness of our approach through experiments on various image reconstruction tasks.

[673] Self-Supervised Learning of Graph Representations for Network Intrusion Detection

Lorenzo Guerra, Thomas Chapuis, Guillaume Duc, Pavlo Mozharovskyi, Van-Tam Nguyen

Main category: cs.LG

TL;DR: GraphIDS: Self-supervised graph neural network for network intrusion detection using masked autoencoder to learn normal communication patterns and detect anomalies via reconstruction errors.

Details

Motivation: Existing graph-based intrusion detection methods decouple representation learning from anomaly detection, limiting embedding utility for attack identification. Need unified approach that optimizes embeddings directly for intrusion detection task.

Method: Self-supervised model with inductive graph neural network to embed flows with local topological context, combined with Transformer-based encoder-decoder for reconstruction. Uses masked autoencoder to learn normal patterns, flags flows with high reconstruction errors as intrusions.

Result: Achieves up to 99.98% PR-AUC and 99.61% macro F1-score on diverse NetFlow benchmarks, outperforming baselines by 5-25 percentage points.

Conclusion: GraphIDS provides end-to-end framework that unifies representation learning and anomaly detection, enabling direct optimization of embeddings for intrusion detection and effectively identifying malicious traffic in evolving network environments.

Abstract: Detecting intrusions in network traffic is a challenging task, particularly under limited supervision and constantly evolving attack patterns. While recent works have leveraged graph neural networks for network intrusion detection, they often decouple representation learning from anomaly detection, limiting the utility of the embeddings for identifying attacks. We propose GraphIDS, a self-supervised intrusion detection model that unifies these two stages by learning local graph representations of normal communication patterns through a masked autoencoder. An inductive graph neural network embeds each flow with its local topological context to capture typical network behavior, while a Transformer-based encoder-decoder reconstructs these embeddings, implicitly learning global co-occurrence patterns via self-attention without requiring explicit positional information. During inference, flows with unusually high reconstruction errors are flagged as potential intrusions. This end-to-end framework ensures that embeddings are directly optimized for the downstream task, facilitating the recognition of malicious traffic. On diverse NetFlow benchmarks, GraphIDS achieves up to 99.98% PR-AUC and 99.61% macro F1-score, outperforming baselines by 5-25 percentage points.

[674] GenUQ: Predictive Uncertainty Estimates via Generative Hyper-Networks

Tian Yu Yen, Reese E. Jones, Ravi G. Patel

Main category: cs.LG

TL;DR: GenUQ introduces a measure-theoretic approach to uncertainty quantification in operator learning using generative hyper-networks, avoiding likelihood construction and outperforming existing UQ methods on three example problems.

Details

Motivation: Operator learning enables fast surrogate modeling for PDEs but existing UQ methods rely on likelihood-based approaches that are difficult or impossible to construct for stochastic operators. There's a need for UQ methods that don't require likelihood construction.

Method: GenUQ uses a measure-theoretic approach with generative hyper-network models that produce parameter distributions consistent with observed data, avoiding the need to construct likelihood functions.

Result: GenUQ outperforms other UQ methods in three example problems: recovering a manufactured operator, learning the solution operator to a stochastic elliptic PDE, and modeling failure location of porous steel under tension.

Conclusion: The proposed GenUQ framework provides an effective measure-theoretic alternative to likelihood-based UQ methods for operator learning, demonstrating superior performance across multiple applications.

Abstract: Operator learning is a recently developed generalization of regression to mappings between functions. It promises to drastically reduce expensive numerical integration of PDEs to fast evaluations of mappings between functional states of a system, i.e., surrogate and reduced-order modeling. Operator learning has already found applications in several areas such as modeling sea ice, combustion, and atmospheric physics. Recent approaches towards integrating uncertainty quantification into the operator models have relied on likelihood based methods to infer parameter distributions from noisy data. However, stochastic operators may yield actions from which a likelihood is difficult or impossible to construct. In this paper, we introduce, GenUQ, a measure-theoretic approach to UQ that avoids constructing a likelihood by introducing a generative hyper-network model that produces parameter distributions consistent with observed data. We demonstrate that GenUQ outperforms other UQ methods in three example problems, recovering a manufactured operator, learning the solution operator to a stochastic elliptic PDE, and modeling the failure location of porous steel under tension.

[675] Brain-language fusion enables interactive neural readout and in-silico experimentation

Victoria Bosch, Daniel Anthes, Adrien Doerig, Sushrut Thorat, Peter König, Tim Christian Kietzmann

Main category: cs.LG

TL;DR: CorText is a framework that integrates neural activity into LLM latent space, enabling natural language interaction with brain data for image captioning and Q&A from fMRI data.

Details

Motivation: Current neural decoding methods are static and non-interactive, lacking the ability for open-ended, natural language interaction with brain data. There's a need to move beyond passive decoding toward generative, flexible brain-language interfaces.

Method: CorText integrates fMRI data (recorded during natural scene viewing) directly into the latent space of a large language model, enabling the LLM to process neural activity. The framework uses in-silico microstimulation experiments for counterfactual prompts on brain activity.

Result: CorText generates accurate image captions and answers detailed questions better than controls using only neural data. It achieves zero-shot generalization beyond semantic categories seen during training, and reveals a consistent, graded mapping between brain states and language output.

Conclusion: CorText represents a shift from passive neural decoding toward generative, flexible interfaces between brain activity and language, enabling open-ended natural language interaction with brain data.

Abstract: Large language models (LLMs) have revolutionized human-machine interaction, and have been extended by embedding diverse modalities such as images into a shared language space. Yet, neural decoding has remained constrained by static, non-interactive methods. We introduce CorText, a framework that integrates neural activity directly into the latent space of an LLM, enabling open-ended, natural language interaction with brain data. Trained on fMRI data recorded during viewing of natural scenes, CorText generates accurate image captions and can answer more detailed questions better than controls, while having access to neural data only. We showcase that CorText achieves zero-shot generalization beyond semantic categories seen during training. In-silico microstimulation experiments, which enable counterfactual prompts on brain activity, reveal a consistent, and graded mapping between brain-state and language output. These advances mark a shift from passive decoding toward generative, flexible interfaces between brain activity and language.

[676] Causal Graph Neural Networks for Healthcare

Munib Mesinovic, Max Buhlan, Tingting Zhu

Main category: cs.LG

TL;DR: Causal graph neural networks address healthcare AI failures by learning invariant causal mechanisms rather than spurious correlations, enabling clinical applications and patient-specific Causal Digital Twins, though challenges remain in validation and deployment.

Details

Motivation: Healthcare AI systems fail when deployed across institutions due to learning statistical associations rather than causal mechanisms, leading to performance drops and perpetuation of discrimination from historical data biases.

Method: Combines graph-based representations of biomedical data with causal inference principles, including structural causal models, disentangled causal representation learning, interventional prediction, and counterfactual reasoning on graphs.

Result: Demonstrates clinical value across psychiatric diagnosis via brain network analysis, cancer subtyping through multi-omics causal integration, continuous physiological monitoring with mechanistic interpretation, and drug recommendation correcting prescription bias.

Conclusion: Establishes foundations for patient-specific Causal Digital Twins enabling in silico clinical experimentation, but faces barriers including computational requirements, validation challenges, and risks of “causal-washing.” Proposes tiered frameworks to distinguish causally-inspired architectures from causally-validated discoveries.

Abstract: Healthcare artificial intelligence systems routinely fail when deployed across institutions, with documented performance drops and perpetuation of discriminatory patterns embedded in historical data. This brittleness stems, in part, from learning statistical associations rather than causal mechanisms. Causal graph neural networks address this triple crisis of distribution shift, discrimination, and inscrutability by combining graph-based representations of biomedical data with causal inference principles to learn invariant mechanisms rather than spurious correlations. This Review examines methodological foundations spanning structural causal models, disentangled causal representation learning, and techniques for interventional prediction and counterfactual reasoning on graphs. We analyse applications demonstrating clinical value across psychiatric diagnosis through brain network analysis, cancer subtyping via multi-omics causal integration, continuous physiological monitoring with mechanistic interpretation, and drug recommendation correcting prescription bias. These advances establish foundations for patient-specific Causal Digital Twins, enabling in silico clinical experimentation, with integration of large language models for hypothesis generation and causal graph neural networks for mechanistic validation. Substantial barriers remain, including computational requirements precluding real-time deployment, validation challenges demanding multi-modal evidence triangulation beyond cross-validation, and risks of causal-washing where methods employ causal terminology without rigorous evidentiary support. We propose tiered frameworks distinguishing causally-inspired architectures from causally-validated discoveries and identify critical research priorities making causal rather than purely associational claims.

[677] OAT-FM: Optimal Acceleration Transport for Improved Flow Matching

Angxiao Yue, Anqi Dong, Hongteng Xu

Main category: cs.LG

TL;DR: OAT-FM bridges Flow Matching with Optimal Acceleration Transport theory, developing an improved method that optimizes acceleration transport instead of constant velocity, enabling a two-phase fine-tuning paradigm that consistently improves generative model performance.

Details

Motivation: Existing Flow Matching methods are explained as solving Optimal Transport problems, but the authors identify a hidden straightening objective that's mathematically equivalent to minimizing physical action associated with acceleration. They aim to bridge FM with the recent Optimal Acceleration Transport theory to develop a more theoretically grounded and practically effective method.

Method: Develop OAT-FM by optimizing acceleration transport in the product space of sample and velocity, which corresponds to a necessary and sufficient condition of flow straightness. Design an efficient algorithm with low complexity. Propose a two-phase paradigm: first train with any FM method, then fine-tune with OAT-FM using the reliable velocity information from the pre-trained model.

Result: OAT-FM eliminates the risk of data distribution drift and avoids the need to generate large numbers of noise-data pairs during fine-tuning. The method consistently improves model performance across various generative tasks. Code is publicly available.

Conclusion: OAT-FM provides a theoretically grounded improvement to Flow Matching by connecting it with Optimal Acceleration Transport theory, offering both theoretical insights and practical benefits through a novel two-phase training paradigm that enhances existing generative models.

Abstract: As a powerful technique in generative modeling, Flow Matching (FM) aims to learn velocity fields from noise to data, which is often explained and implemented as solving Optimal Transport (OT) problems. In this study, we bridge FM and the recent theory of Optimal Acceleration Transport (OAT), developing an improved FM method called OAT-FM and exploring its benefits in both theory and practice. In particular, we demonstrate that the straightening objective hidden in existing OT-based FM methods is mathematically equivalent to minimizing the physical action associated with acceleration defined by OAT. Accordingly, instead of enforcing constant velocity, OAT-FM optimizes the acceleration transport in the product space of sample and velocity, whose objective corresponds to a necessary and sufficient condition of flow straightness. An efficient algorithm is designed to achieve OAT-FM with low complexity. OAT-FM motivates a new two-phase FM paradigm: Given a generative model trained by an arbitrary FM method, whose velocity information has been relatively reliable, we can fine-tune and improve it via OAT-FM. This paradigm eliminates the risk of data distribution drift and the need to generate a large number of noise data pairs, which consistently improves model performance in various generative tasks. Code is available at: https://github.com/AngxiaoYue/OAT-FM

[678] EEsizer: LLM-Based AI Agent for Sizing of Analog and Mixed Signal Circuit

Chang Liu, Danial Chitnis

Main category: cs.LG

TL;DR: EEsizer is an LLM-based AI agent that automates transistor sizing for analog/mixed-signal circuits by integrating language models with circuit simulators, achieving successful optimization of a 20-transistor op-amp across technology nodes with minimal human intervention.

Details

Motivation: Manual transistor sizing in analog/mixed-signal IC design is time-consuming and requires expert knowledge. While ML techniques in EDA show promise, they still face challenges like numerous iterations and lack of circuit design knowledge. LLMs have demonstrated potential knowledge in circuit design, suggesting they could automate the transistor sizing process.

Method: EEsizer integrates large language models with circuit simulators and custom data analysis functions to create a fully automated, closed-loop transistor sizing system. It uses prompt engineering and Chain-of-Thought reasoning to iteratively explore design directions, evaluate performance, and refine solutions without external knowledge.

Result: The system benchmarked 8 LLMs on six basic circuits, selected three high-performing models to optimize a 20-transistor CMOS operational amplifier targeting multiple performance metrics across 180nm to 90nm nodes. OpenAI o3 successfully achieved user-intended targets at 90nm across three test groups with maximum 20 iterations. Design robustness was assessed through variation analysis using Gaussian-distributed variations on transistor dimensions and threshold voltages.

Conclusion: EEsizer demonstrates that LLM-based AI agents can effectively automate transistor sizing for analog circuits, showing adaptability and robustness at advanced technology nodes while minimizing human intervention in the design process.

Abstract: The design of Analog and Mixed-Signal (AMS) integrated circuits (ICs) often involves significant manual effort, especially during the transistor sizing process. While Machine Learning techniques in Electronic Design Automation (EDA) have shown promise in reducing complexity and minimizing human intervention, they still face challenges such as numerous iterations and a lack of knowledge about AMS circuit design. Recently, Large Language Models (LLMs) have demonstrated significant potential across various fields, showing a certain level of knowledge in circuit design and indicating their potential to automate the transistor sizing process. In this work, we propose EEsizer, an LLM-based AI agent that integrates large language models with circuit simulators and custom data analysis functions, enabling fully automated, closed-loop transistor sizing without relying on external knowledge. By employing prompt engineering and Chain-of-Thought reasoning, the agent iteratively explores design directions, evaluates performance, and refines solutions with minimal human intervention. We first benchmarked 8 LLMs on six basic circuits and selected three high-performing models to optimize a 20-transistor CMOS operational amplifier, targeting multiple performance metrics, including rail-to-rail operation from 180 nm to 90 nm technology nodes. Notably, OpenAI o3 successfully achieved the user-intended target at 90 nm across three different test groups, with a maximum of 20 iterations, demonstrating adaptability and robustness at advanced nodes. To assess design robustness, we manually designed a bias circuit and performed a variation analysis using Gaussian-distributed variations on transistor dimensions and threshold voltages.

[679] Potent but Stealthy: Rethink Profile Pollution against Sequential Recommendation via Bi-level Constrained Reinforcement Paradigm

Jiajie Su, Zihan Nan, Yunshan Ma, Xiaobo Xia, Xiaohua Feng, Weiming Liu, Xiang Chen, Xiaolin Zheng, Chaochao Chen

Main category: cs.LG

TL;DR: CREAT: A constrained reinforcement learning attack method for sequential recommenders that subtly pollutes user profiles to induce targeted mispredictions while maintaining stealthiness.

Details

Motivation: Existing adversarial attacks on sequential recommenders require large-scale user access or fake profiles, lacking practicality. Profile Pollution Attacks (PPA) that contaminate partial user interactions are more practical but suffer from over-reliance on sequence horizon impact and cause detectable distribution shifts.

Method: Proposes CREAT: a constrained reinforcement driven attack with bi-level optimization and multi-reward reinforcement learning. Uses Pattern Balanced Rewarding Policy with pattern inversion rewards and distribution consistency rewards via unbalanced co-optimal transport. Employs Constrained Group Relative Reinforcement Learning with dynamic barrier constraints and group-shared experience replay for step-wise perturbations.

Result: Extensive experiments demonstrate the effectiveness of CREAT in achieving targeted pollution with minimal detectability.

Conclusion: CREAT successfully addresses limitations of previous PPA methods by enabling fine-grained perturbations while maintaining stealthiness through constrained reinforcement learning, making adversarial attacks on sequential recommenders more practical and effective.

Abstract: Sequential Recommenders, which exploit dynamic user intents through interaction sequences, is vulnerable to adversarial attacks. While existing attacks primarily rely on data poisoning, they require large-scale user access or fake profiles thus lacking practicality. In this paper, we focus on the Profile Pollution Attack that subtly contaminates partial user interactions to induce targeted mispredictions. Previous PPA methods suffer from two limitations, i.e., i) over-reliance on sequence horizon impact restricts fine-grained perturbations on item transitions, and ii) holistic modifications cause detectable distribution shifts. To address these challenges, we propose a constrained reinforcement driven attack CREAT that synergizes a bi-level optimization framework with multi-reward reinforcement learning to balance adversarial efficacy and stealthiness. We first develop a Pattern Balanced Rewarding Policy, which integrates pattern inversion rewards to invert critical patterns and distribution consistency rewards to minimize detectable shifts via unbalanced co-optimal transport. Then we employ a Constrained Group Relative Reinforcement Learning paradigm, enabling step-wise perturbations through dynamic barrier constraints and group-shared experience replay, achieving targeted pollution with minimal detectability. Extensive experiments demonstrate the effectiveness of CREAT.

[680] TAPAS: Datasets for Learning the Learning with Errors Problem

Eshika Saxena, Alberto Alfarano, François Charton, Emily Wenger, Kristin Lauter

Main category: cs.LG

TL;DR: TAPAS datasets provide accessible LWE data for AI research on post-quantum cryptography attacks, filling a critical data gap.

Details

Motivation: AI attacks on LWE show promise but lack accessible training data, which is difficult and expensive to create due to computational demands and domain expertise requirements.

Method: Created TAPAS (Toolkit for Analysis of Post-quantum cryptography using AI Systems) datasets covering multiple LWE settings, designed for off-the-shelf use by AI practitioners.

Result: Provides documented datasets with established attack performance baselines, enabling AI researchers to prototype new LWE attack approaches without creating their own data.

Conclusion: TAPAS accelerates AI research on LWE attacks by addressing the data accessibility problem and provides a foundation for future work in AI-powered post-quantum cryptography analysis.

Abstract: AI-powered attacks on Learning with Errors (LWE), an important hard math problem in post-quantum cryptography, rival or outperform “classical” attacks on LWE under certain parameter settings. Despite the promise of this approach, a dearth of accessible data limits AI practitioners’ ability to study and improve these attacks. Creating LWE data for AI model training is time- and compute-intensive and requires significant domain expertise. To fill this gap and accelerate AI research on LWE attacks, we propose the TAPAS datasets, a Toolkit for Analysis of Post-quantum cryptography using AI Systems. These datasets cover several LWE settings and can be used off-the-shelf by AI practitioners to prototype new approaches to cracking LWE. This work documents TAPAS dataset creation, establishes attack performance baselines, and lays out directions for future work.

[681] What-If Decision Support for Product Line Extension Using Conditional Deep Generative Models

Yinxing Li, Tsukasa Ishigaki

Main category: cs.LG

TL;DR: CTVAE framework for forward-looking what-if analysis of product line extensions using historical transaction data to generate synthetic consumer distributions for hypothetical products.

Details

Motivation: Product line extension decisions are uncertain because managers must anticipate consumer responses to hypothetical designs without direct market observations, requiring inference from historical data alone.

Method: Conditional Tabular Variational Autoencoder (CTVAE) learns conditional joint distribution of product attributes and consumer characteristics from large-scale tabular data, generating synthetic consumer attribute distributions for hypothetical products by conditioning on controllable design variables.

Result: CTVAE outperforms existing tabular generative models in capturing conditional consumer attribute distributions; simulation analyses show synthetic data supports knowledge-driven reasoning for assessing cannibalization risks and identifying target segments.

Conclusion: Conditional deep generative models like CTVAE are valuable core components of decision support systems for product line extension planning, enabling systematic exploration of design scenarios without costly market pretests.

Abstract: Product line extension is a strategically important managerial decision that requires anticipating how consumer segments and purchasing contexts may respond to hypothetical product designs that do not yet exist in the market. Such decisions are inherently uncertain because managers must infer future outcomes from historical purchase data without direct market observations. This study addresses this challenge by proposing a data-driven decision support framework that enables forward-looking what-if analysis based on historical transaction data. We introduce a Conditional Tabular Variational Autoencoder (CTVAE) that learns the conditional joint distribution of product attributes and consumer characteristics from large-scale tabular data. By conditioning the generative process on controllable design variables such as container type, volume, flavor, and calorie content, the proposed model generates synthetic consumer attribute distributions for hypothetical line-extended products. This enables systematic exploration of alternative design scenarios without costly market pretests. The framework is evaluated using home-scan panel data covering more than 20,000 consumers and 700 soft drink products. Empirical results show that the CTVAE outperforms existing tabular generative models in capturing conditional consumer attribute distributions. Simulation-based analyses further demonstrate that the generated synthetic data support knowledge-driven reasoning for assessing cannibalization risks and identifying potential target segments. These findings highlight the value of conditional deep generative models as core components of decision support systems for product line extension planning.

[682] GraphShaper: Geometry-aware Alignment for Improving Transfer Learning in Text-Attributed Graphs

Heng Zhang, Tianyi Zhang, Yuling Shi, Xiaodong Gu, Yaomin Shen, Haochen You, Zijian Zhang, Yilei Yuan, Jin Huang

Main category: cs.LG

TL;DR: GraphShaper is a geometry-aware framework that improves graph-text alignment by using multiple geometric spaces (Euclidean, hyperbolic, spherical) to better encode diverse graph structures, achieving significant accuracy gains.

Details

Motivation: Current graph foundation models using contrastive learning with LLMs suffer from significant performance degradation (over 20% accuracy loss) at structural boundaries where different topological patterns converge, because they assume all graph structures can be encoded in a single Euclidean space.

Method: GraphShaper uses expert networks specialized for different geometric spaces (Euclidean for general graphs, hyperbolic for tree/hierarchical structures, spherical for cyclic patterns), with dynamic fusion weights that adaptively integrate geometric properties based on local structural characteristics before aligning with text embeddings.

Result: GraphShaper achieves 9.47% accuracy improvement on citation networks and 7.63% on social networks in zero-shot settings, significantly outperforming previous methods that use uniform encoding spaces.

Conclusion: The paper demonstrates that respecting the intrinsic geometric diversity of graph structures through multi-geometric specialization is crucial for effective graph-text alignment, and GraphShaper provides a practical framework for achieving this.

Abstract: Graph foundation models represent a transformative paradigm for learning transferable representations across diverse graph domains. Recent methods leverage large language models to unify graph and text modalities into a shared representation space using contrastive learning. However, systematic evaluations reveal significant performance degradation at structural boundaries where distinct topological patterns converge, with accuracy losses exceeding 20 percentage points. This issue arises from a key limitation: current methods assume all graph structures can be encoded within a single Euclidean space. In reality, tree structures require hyperbolic geometry to preserve hierarchical branching, while cyclic patterns depend on spherical geometry for closure properties. At structural boundaries, nodes experience conflicting geometric constraints that uniform encoding spaces cannot resolve. This raises a crucial challenge: \textbf{Can alignment frameworks be designed to respect the intrinsic geometric diversity of graph structures?} We introduce \textbf{GraphShaper}, a geometry-aware framework that enhances graph encoding through multi-geometric specialization. Our approach employs expert networks tailored to different geometric spaces, dynamically computing fusion weights to adaptively integrate geometric properties based on local structural characteristics. This adaptive fusion preserves structural integrity before alignment with text embeddings. Extensive experiments demonstrate that GraphShaper achieves 9.47% accuracy improvements on citation networks and 7.63% on social networks in zero-shot settings.

[683] Expressive Temporal Specifications for Reward Monitoring

Omar Adalat, Francesco Belardinelli

Main category: cs.LG

TL;DR: Using quantitative Linear Temporal Logic (LTLf[F]) to create dense reward monitors that outperform Boolean monitors for RL training efficiency.

Details

Motivation: Sparse rewards in RL cause inefficient training, especially for long-horizon tasks. Current Boolean semantics don't provide nuanced feedback during training.

Method: Harness quantitative LTLf[F] to synthesize reward monitors that generate dense reward streams from runtime-observable state trajectories. Framework is algorithm-agnostic and only requires state labeling function.

Result: Quantitative monitors consistently subsume and often outperform Boolean monitors in maximizing task completion and reducing convergence time across different environments.

Conclusion: Quantitative LTLf[F] provides effective framework for dense reward specification that improves RL training efficiency and handles non-Markovian properties naturally.

Abstract: Specifying informative and dense reward functions remains a pivotal challenge in Reinforcement Learning, as it directly affects the efficiency of agent training. In this work, we harness the expressive power of quantitative Linear Temporal Logic on finite traces (($\text{LTL}_f[\mathcal{F}]$)) to synthesize reward monitors that generate a dense stream of rewards for runtime-observable state trajectories. By providing nuanced feedback during training, these monitors guide agents toward optimal behaviour and help mitigate the well-known issue of sparse rewards under long-horizon decision making, which arises under the Boolean semantics dominating the current literature. Our framework is algorithm-agnostic and only relies on a state labelling function, and naturally accommodates specifying non-Markovian properties. Empirical results show that our quantitative monitors consistently subsume and, depending on the environment, outperform Boolean monitors in maximizing a quantitative measure of task completion and in reducing convergence time.

[684] Sharpness-Controlled Group Relative Policy Optimization with Token-Level Probability Shaping

Tue Le, Nghi D. Q. Bui, Linh Ngo Van, Trung Le

Main category: cs.LG

TL;DR: TR-GRPO improves GRPO for RLVR by regulating token-level gradients to reduce sharpness and improve generalization.

Details

Motivation: GRPO can be dominated by a small subset of tokens with disproportionately large per-token gradients, which increases sharpness and harms generalization in RLVR settings.

Method: Propose Token-Regulated GRPO (TR-GRPO) which introduces a monotone probability shaping function to assign token weights based on the model’s own token probabilities, integrating these weights into standard GRPO to suppress sharp gradient directions while preserving learning signal on semantically critical tokens.

Result: Experiments on logic puzzles, mathematical reasoning, and tool-augmented question answering show consistent improvements over GRPO, with smoother gradient-norm trajectories.

Conclusion: TR-GRPO is a simple and effective generalization-oriented upgrade to GRPO for RLVR, addressing token-level gradient imbalance issues through probability-aware weighting.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a practical route to improve large language model reasoning, and Group Relative Policy Optimization (GRPO) is a widely used optimizer in this setting. This paper revisits GRPO from a generalization perspective. Recent analysis shows that population performance can be controlled by a robust empirical objective that decomposes into the training loss plus a sharpness term measured by the gradient norm. We develop a token-level view of this sharpness term and show that GRPO can be dominated by a small subset of tokens with disproportionately large per-token gradients, which increases sharpness and can harm generalization. Motivated by this view, we propose Token-Regulated GRPO (TR-GRPO), which introduces a monotone probability shaping function to assign token weights based on the model’s own token probabilities, and integrates these weights into the standard GRPO. Our analysis yields a bound that isolates a probability dependent multiplicative factor in token-gradient magnitudes, explaining how probability-aware weighting suppresses sharp directions while preserving learning signal on semantically critical tokens. Experiments on logic puzzles, mathematical reasoning, and tool-augmented question answering show consistent improvements over GRPO, along with smoother gradient-norm trajectories, supporting TR-GRPO as a simple and effective generalization-oriented upgrade to GRPO for RLVR.

[685] Memory-Efficient Training with In-Place FFT Implementation

Xinyu Ding, Bangtian Liu, Siyu Liao, Zhongfeng Wang

Main category: cs.LG

TL;DR: The paper proposes rdFFT, the first real-domain fully in-place FFT framework that eliminates memory allocation overhead by preserving input-output memory space consistency through implicit complex encoding.

Details

Motivation: Existing FFT implementations (standard FFT and real FFT) cannot achieve true in-place computation. Real FFT (rFFT) causes dimensional mismatch by mapping size n input to size n/2+1 complex output, requiring additional memory allocation and increasing memory costs in deep learning applications.

Method: The proposed rdFFT framework leverages butterfly operation symmetry and conjugate properties in the frequency domain to design an implicit complex encoding scheme. This eliminates intermediate cache usage entirely while preserving input-output memory space consistency for real-domain computations.

Result: Experiments on multiple natural language understanding tasks demonstrate the method’s effectiveness in reducing training memory cost, showing practical benefits for memory-constrained deep learning applications.

Conclusion: The rdFFT framework offers a promising direction for frequency-domain lightweight adaptation by enabling true in-place FFT computation, addressing the memory inefficiency limitations of existing FFT implementations in deep learning.

Abstract: Fast Fourier Transforms (FFT) are widely used to reduce memory and computational costs in deep learning. However, existing implementations, including standard FFT and real FFT (rFFT), cannot achieve true in-place computation. In particular, rFFT maps an input of size n to a complex output of size n/2+1, causing dimensional mismatch and requiring additional memory allocation. We propose the first real-domain, fully in-place FFT framework (rdFFT) that preserves input-output memory space consistency. By leveraging butterfly operation symmetry and conjugate properties in the frequency domain, we design an implicit complex encoding scheme that eliminates intermediate cache usage entirely. Experiments on multiple natural language understanding tasks demonstrate the method effectiveness in reducing training memory cost, offering a promising direction for frequency-domain lightweight adaptation.

[686] GateRA: Token-Aware Modulation for Parameter-Efficient Fine-Tuning

Jie Ou, Shuaihong Jiang, Yingjun Du, Cees G. M. Snoek

Main category: cs.LG

TL;DR: GateRA introduces token-aware modulation to dynamically adjust PEFT update strength, enabling selective adaptation based on input difficulty while maintaining efficiency.

Details

Motivation: Existing PEFT methods apply static, input-agnostic updates to all tokens, which can lead to overfitting on trivial content or under-adaptation on informative regions, especially in autoregressive settings with distinct prefill and decoding dynamics.

Method: GateRA incorporates adaptive gating into standard PEFT branches to enable token-level adaptation. It includes entropy-based regularization to encourage near-binary gating decisions and prevent diffuse update patterns, while maintaining differentiability.

Result: GateRA consistently outperforms or matches prior PEFT methods on multiple commonsense reasoning benchmarks. Visualizations show it suppresses updates for redundant prefill tokens while emphasizing adaptation during decoding.

Conclusion: GateRA provides a unified framework for dynamic, token-aware PEFT adaptation that preserves pre-trained knowledge for well-modeled inputs while focusing capacity on challenging cases, with theoretical grounding and practical effectiveness.

Abstract: Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, DoRA, and HiRA, enable lightweight adaptation of large pre-trained models via low-rank updates. However, existing PEFT approaches apply static, input-agnostic updates to all tokens, disregarding the varying importance and difficulty of different inputs. This uniform treatment can lead to overfitting on trivial content or under-adaptation on more informative regions, especially in autoregressive settings with distinct prefill and decoding dynamics. In this paper, we propose GateRA, a unified framework that introduces token-aware modulation to dynamically adjust the strength of PEFT updates. By incorporating adaptive gating into standard PEFT branches, GateRA enables selective, token-level adaptation, preserving pre-trained knowledge for well-modeled inputs while focusing capacity on challenging cases. Empirical visualizations reveal phase-sensitive behaviors, where GateRA automatically suppresses updates for redundant prefill tokens while emphasizing adaptation during decoding. To promote confident and efficient modulation, we further introduce an entropy-based regularization that encourages near-binary gating decisions. This regularization prevents diffuse update patterns and leads to interpretable, sparse adaptation without hard thresholding. Finally, we present a theoretical analysis showing that GateRA induces a soft gradient-masking effect over the PEFT path, enabling continuous and differentiable control over adaptation. Experiments on multiple commonsense reasoning benchmarks demonstrate that GateRA consistently outperforms or matches prior PEFT methods.

[687] Nowcast3D: Reliable precipitation nowcasting via gray-box learning

Huaguan Chen, Wei Han, Haofei Sun, Ning Lin, Xingtao Song, Yunfan Yang, Jie Tian, Yang Liu, Ji-Rong Wen, Xiaoye Zhang, Xueshun Shen, Hao Sun

Main category: cs.LG

TL;DR: Nowcast3D is a 3D gray-box nowcasting framework that combines physics-constrained neural operators with data-driven learning for extreme precipitation forecasting, outperforming existing methods in meteorologist evaluations.

Details

Motivation: Current precipitation nowcasting approaches have limitations: numerical weather prediction systems operate at coarse resolution, radar extrapolation struggles with complex evolution, data-driven models produce smoothed fields, and 2D methods discard crucial vertical information needed for accurate precipitation dynamics.

Method: Nowcast3D is a fully 3D gray-box framework that operates on volumetric radar reflectivity. It learns three fields governing reflectivity evolution: 3D flow field for advection, spatially varying diffusion field for dispersion, and residual source term for microphysics. These physics-constrained neural operators advance forecasts, while a conditional diffusion model generates ensembles to quantify uncertainty.

Result: In blind evaluation by 160 meteorologists, Nowcast3D was preferred in 57% of post-hoc assessments and 51% of prior assessments, demonstrating superior performance over existing nowcasting methods.

Conclusion: Nowcast3D provides a scalable and robust approach for extreme precipitation nowcasting by explicitly embedding three-dimensional dynamics and uncertainty quantification into a single framework, addressing key limitations of current methods.

Abstract: Extreme-precipitation nowcasting requires high spatial and temporal resolution together with extended lead times, yet current approaches remain constrained. Numerical weather prediction systems and their deep-learning emulators operate at relatively coarse space-time resolution and struggle to capture rapidly evolving convective systems. Radar extrapolation methods, which advect recent fields using estimated motion, have difficulty capturing the complex evolution of precipitation. Purely data-driven models often produce overly smoothed reflectivity fields and underestimate intensity. Hybrid 2D radar-based methods discard crucial vertical information, preventing accurate reconstruction of height-dependent dynamics. We introduce Nowcast3D, a gray-box, fully three-dimensional nowcasting framework that operates directly on volumetric radar reflectivity and couples physically constrained neural operators with data-driven learning. The model learns three fields that govern reflectivity evolution: a three-dimensional flow field for advective transport, a spatially varying diffusion field for local dispersive spreading, and a residual source term for unresolved microphysical effects. These learned operators advance the forecast in time under explicit physical constraints, while a conditional diffusion model, conditioned on both the observations and the physics-based forecast, generates ensembles of future radar volumes that quantify forecast uncertainty. In a blind evaluation by 160 meteorologists, Nowcast3D is preferred in 57% of post-hoc and 51% of prior assessments. By explicitly embedding three-dimensional dynamics and uncertainty into a single framework, Nowcast3D offers a scalable and robust approach for reliable nowcasting of extreme precipitation.

[688] HVAdam: A Full-Dimension Adaptive Optimizer

Yiheng Zhang, Shaowu Wu, Yuanzhuo Xu, Jiajun Wu, Shang Xu, Steve Drew, Xiaoguang Niu

Main category: cs.LG

TL;DR: Anon is a novel optimizer with tunable adaptivity that bridges SGD and Adam, using incremental delay update for convergence across adaptivity spectrum, outperforming SOTA on various tasks.

Details

Motivation: Adaptive optimizers like Adam generalize worse than non-adaptive methods like SGD on classical architectures. The performance gap stems from adaptivity in pre-conditioners limiting optimization landscape adaptation.

Method: Propose Anon (Adaptivity Non-restricted Optimizer) with continuously tunable adaptivity to interpolate between SGD-like and Adam-like behaviors. Introduce incremental delay update (IDU) mechanism for convergence across adaptivity spectrum, more flexible than AMSGrad’s max-tracking.

Result: Theoretically establish convergence guarantees under convex and non-convex settings. Empirically outperform state-of-the-art optimizers on image classification, diffusion, and language modeling tasks.

Conclusion: Adaptivity can serve as a valuable tunable design principle. Anon provides the first unified framework bridging classical and modern optimizers, surpassing their advantageous properties.

Abstract: Adaptive optimizers such as Adam have achieved great success in training large-scale models like large language models and diffusion models. However, they often generalize worse than non-adaptive methods, such as SGD on classical architectures like CNNs. We identify a key cause of this performance gap: adaptivity in pre-conditioners, which limits the optimizer’s ability to adapt to diverse optimization landscapes. To address this, we propose Anon (Adaptivity Non-restricted Optimizer with Novel convergence technique), a novel optimizer with continuously tunable adaptivity , allowing it to interpolate between SGD-like and Adam-like behaviors and even extrapolate beyond both. To ensure convergence across the entire adaptivity spectrum, we introduce incremental delay update (IDU), a novel mechanism that is more flexible than AMSGrad’s hard max-tracking strategy and enhances robustness to gradient noise. We theoretically establish convergence guarantees under both convex and non-convex settings. Empirically, Anon consistently outperforms state-of-the-art optimizers on representative image classification, diffusion, and language modeling tasks. These results demonstrate that adaptivity can serve as a valuable tunable design principle, and Anon provides the first unified and reliable framework capable of bridging the gap between classical and modern optimizers and surpassing their advantageous properties.

[689] DETECT: Data-Driven Evaluation of Treatments Enabled by Classification Transformers

Yuanheng Mao, Lillian Yang, Stephen Yang, Ethan Shao, Zihan Li

Main category: cs.LG

TL;DR: DETECT is a data-driven framework using smartphone sensor data to objectively assess chronic pain treatment success by comparing patient activities before and after treatment.

Details

Motivation: Chronic pain affects millions worldwide, but current assessment methods like numeric rating scales are subjective and self-reported. There's a need for objective, reliable methods to measure treatment impact on patients' daily functioning.

Method: DETECT (Data-Driven Evaluation of Treatments Enabled by Classification Transformers) uses classification transformers to analyze smartphone sensor data, comparing patient activities of daily life before and after treatment to assess treatment success.

Result: DETECT demonstrates objectivity and lightweight performance on public benchmark datasets and simulated patient data, offering a novel approach to clinical decision-making.

Conclusion: DETECT provides physicians with an objective tool to better understand treatment impacts, enabling more personalized and responsive patient care, either independently or alongside traditional self-reported metrics.

Abstract: Chronic pain is a global health challenge affecting millions of individuals, making it essential for physicians to have reliable and objective methods to measure the functional impact of clinical treatments. Traditionally used methods, like the numeric rating scale, while personalized and easy to use, are subjective due to their self-reported nature. Thus, this paper proposes DETECT (Data-Driven Evaluation of Treatments Enabled by Classification Transformers), a data-driven framework that assesses treatment success by comparing patient activities of daily life before and after treatment. We use DETECT on public benchmark datasets and simulated patient data from smartphone sensors. Our results demonstrate that DETECT is objective yet lightweight, making it a significant and novel contribution to clinical decision-making. By using DETECT, independently or together with other self-reported metrics, physicians can improve their understanding of their treatment impacts, ultimately leading to more personalized and responsive patient care.

[690] Preparation of Fractal-Inspired Computational Architectures for Advanced Large Language Model Analysis

Yash Mittal, Dmitry Ignatov, Radu Timofte

Main category: cs.LG

TL;DR: FractalNet introduces fractal-inspired architectures for efficient large-scale neural network exploration, generating 1200+ variants through systematic permutations and achieving strong performance on CIFAR-10.

Details

Motivation: To address the challenge of exploring model diversity at scale efficiently, moving beyond manual architecture design to automated, systematic exploration of neural network variants.

Method: Uses template-driven generator, runner, and evaluation framework with systematic permutations of convolutional, normalization, activation, and dropout layers. Employs fractal templates for structural recursion and multi-column pathways, creating balanced deep and wide architectures. Training uses PyTorch, AMP, and gradient checkpointing on CIFAR-10 for 5 epochs.

Result: Successfully generates over 1,200 neural network variants. Fractal-based architectures demonstrate strong performance and computational efficiency, validating the approach as effective for automated architecture exploration.

Conclusion: Fractal design provides a feasible and resource-efficient method for automated neural architecture exploration, enabling systematic investigation of model diversity at scale while maintaining computational efficiency.

Abstract: It introduces FractalNet, a fractal-inspired computational architectures for advanced large language model analysis that mainly challenges model diversity on a large scale in an efficient manner. The new set-up involves a template-driven generator, runner, and evaluation framework that, through systematic permutations of convolutional, normalization, activation, and dropout layers, can create more than 1,200 variants of neural networks. Fractal templates allow for structural recursion and multi-column pathways, thus, models become deeper and wider in a balanced way. Training utilizes PyTorch, Automatic Mixed Precision (AMP), and gradient checkpointing and is carried out on the CIFAR-10 dataset for five epochs. The outcomes show that fractal-based architectures are capable of strong performance and are computationally efficient. The paper positions fractal design as a feasible and resource-efficient method of automated architecture exploration.

[691] When, How Long and How Much? Interpretable Neural Networks for Time Series Regression by Learning to Mask and Aggregate

Florent Forest, Amaury Wei, Olga Fink

Main category: cs.LG

TL;DR: MAGNETS is an inherently interpretable neural architecture for time series extrinsic regression that learns human-understandable concepts without annotations, using mask-based feature aggregation and transparent additive structure.

Details

Motivation: Current TSER models are black boxes with poor interpretability, while existing interpretable approaches have limitations: require concept supervision, can't capture feature interactions, lack expressiveness for complex temporal patterns, and don't scale to high-dimensional multivariate data.

Method: MAGNETS learns a compact set of human-understandable concepts without annotations. Each concept corresponds to a learned, mask-based aggregation over selected input features, revealing which features drive predictions and when they matter. Predictions use transparent additive combinations of these concepts.

Result: The proposed method provides an inherently interpretable architecture for TSER that explicitly reveals feature importance and temporal relevance without requiring concept annotations, with code and datasets publicly available.

Conclusion: MAGNETS addresses limitations of both black-box TSER models and existing interpretable approaches by providing transparent, concept-based interpretability without supervision, suitable for domains requiring both accurate predictions and trustworthy reasoning.

Abstract: Time series extrinsic regression (TSER) refers to the task of predicting a continuous target variable from an input time series. It appears in many domains, including healthcare, finance, environmental monitoring, and engineering. In these settings, accurate predictions and trustworthy reasoning are both essential. Although state-of-the-art TSER models achieve strong predictive performance, they typically operate as black boxes, making it difficult to understand which temporal patterns drive their decisions. Post-hoc interpretability techniques, such as feature attribution, aim to to explain how the model arrives at its predictions, but often produce coarse, noisy, or unstable explanations. Recently, inherently interpretable approaches based on concepts, additive decompositions, or symbolic regression, have emerged as promising alternatives. However, these approaches remain limited: they require explicit supervision on the concepts themselves, often cannot capture interactions between time-series features, lack expressiveness for complex temporal patterns, and struggle to scale to high-dimensional multivariate data. To address these limitations, we propose MAGNETS (Mask-and-AGgregate NEtwork for Time Series), an inherently interpretable neural architecture for TSER. MAGNETS learns a compact set of human-understandable concepts without requiring any annotations. Each concept corresponds to a learned, mask-based aggregation over selected input features, explicitly revealing both which features drive predictions and when they matter in the sequence. Predictions are formed as combinations of these learned concepts through a transparent, additive structure, enabling clear insight into the model’s decision process. The code implementation and datasets are publicly available at https://github.com/FlorentF9/MAGNETS.

[692] Nonparametric estimation of conditional probability distributions using a generative approach based on conditional push-forward neural networks

Nicola Rares Franco, Lorenzo Tedesco

Main category: cs.LG

TL;DR: CPFN is a lightweight generative framework for conditional distribution estimation that learns stochastic maps for efficient conditional sampling without requiring invertibility or adversarial training.

Details

Motivation: To create an efficient framework for conditional distribution estimation that enables direct conditional sampling and straightforward estimation of conditional statistics, overcoming limitations of existing methods that may require complex invertibility constraints or adversarial training.

Method: CPFN learns a stochastic map φ(x,u) such that φ(x,U) approximates the law of Y|X=x, where U is a predefined latent random vector. Training uses a Kullback-Leibler derived objective function without requiring invertibility or adversarial training.

Result: CPFN achieves performance competitive with or superior to state-of-the-art methods including kernel estimators, tree-based algorithms, and deep learning techniques, while remaining lightweight and easy to train.

Conclusion: CPFN provides an effective, lightweight framework for conditional distribution estimation that enables efficient conditional sampling and statistical estimation, with theoretical consistency guarantees and practical advantages over existing methods.

Abstract: We introduce conditional push-forward neural networks (CPFN), a generative framework for conditional distribution estimation. Instead of directly modeling the conditional density $f_{Y|X}$, CPFN learns a stochastic map $\varphi=\varphi(x,u)$ such that $\varphi(x,U)$ and $Y|X=x$ follow approximately the same law, with $U$ a suitable random vector of pre-defined latent variables. This enables efficient conditional sampling and straightforward estimation of conditional statistics through Monte Carlo methods. The model is trained via an objective function derived from a Kullback-Leibler formulation, without requiring invertibility or adversarial training. We establish a near-asymptotic consistency result and demonstrate experimentally that CPFN can achieve performance competitive with, or even superior to, state-of-the-art methods, including kernel estimators, tree-based algorithms, and popular deep learning techniques, all while remaining lightweight and easy to train.

[693] Semantic Superiority vs. Forensic Efficiency: A Comparative Analysis of Deep Learning and Psycholinguistics for Business Email Compromise Detection

Yaw Osei Adjei, Frederick Ayivor, Davis Opoku

Main category: cs.LG

TL;DR: This paper compares two BEC detection approaches: a psycholinguistic CatBoost model and a semantic DistilBERT model, showing both achieve high accuracy with different latency/cost trade-offs suitable for different deployment scenarios.

Details

Motivation: BEC causes massive financial losses ($2.9B annually) with extreme economic asymmetry where false negatives (fraud losses) cost 5,480x more than false positives (manual reviews), creating urgent need for effective detection systems.

Method: Two detection paradigms: 1) Forensic Psycholinguistic Stream using CatBoost to analyze linguistic cues (urgency, authority) for interpretability, and 2) Semantic Stream using DistilBERT for deep contextual understanding. Evaluated on hybrid dataset (N=7,990) with human-legitimate and AI-synthesized adversarial fraud.

Result: DistilBERT achieved near-perfect detection on synthetic threats (AUC >0.99, F1=0.998) with 7.4ms latency. CatBoost achieved competitive detection (AUC=0.991, F1=0.949) at 8.4x lower latency (0.8ms) with negligible resource consumption.

Conclusion: DistilBERT offers maximum accuracy for GPU-equipped organizations, while CatBoost provides cost-effective alternative for edge deployments. Both approaches demonstrate theoretical ROI exceeding 99.9% when optimized via cost-sensitive learning.

Abstract: Business Email Compromise (BEC) is a sophisticated social engineering threat that manipulates organizational hierarchies, leading to significant financial damage. According to the 2024 FBI Internet Crime Report, BEC accounts for over $2.9 billion in annual losses, presenting a massive economic asymmetry: the financial cost of a False Negative (fraud loss) exceeds the operational cost of a False Positive (manual review) by a ratio of approximately 5,480:1. This paper contrasts two detection paradigms: a Forensic Psycholinguistic Stream (CatBoost), which analyzes linguistic cues like urgency and authority with high interpretability, and a Semantic Stream (DistilBERT), which utilizes deep learning for contextual understanding. We evaluated both streams on a hybrid dataset (N=7,990) containing human-legitimate and AI-synthesized adversarial fraud. Benchmarked on Tesla T4 infrastructure, DistilBERT achieved near-perfect detection on synthetic threats (AUC >0.99, F1 =0.998) with acceptable real-time latency (7.4 ms). CatBoost achieved competitive detection (AUC =0.991, F1 =0.949) at 8.4x lower latency (0.8 ms) with negligible resource consumption. We conclude that while DistilBERT offers maximum accuracy for GPU-equipped organizations, CatBoost provides a viable, cost-effective alternative for edge deployments. Both approaches demonstrate a theoretical ROI exceeding 99.9% when optimized via cost-sensitive learning.

[694] Spectral Concentration at the Edge of Stability: Information Geometry of Kernel Associative Memory

Akira Tamamori

Main category: cs.LG

TL;DR: High-capacity kernel Hopfield networks have a “Ridge of Optimization” with extreme stability, which corresponds to the Edge of Stability where Fisher Information Matrix becomes singular, revealing a geometric theory of self-organized criticality.

Details

Motivation: To understand the origin of the Ridge of Optimization in high-capacity kernel Hopfield networks, which was previously linked to Spectral Concentration but remained elusive in its fundamental nature.

Method: Analyze network dynamics on a statistical manifold, revealing that the Ridge corresponds to the Edge of Stability where Fisher Information Matrix becomes singular. Show that apparent Euclidean force antagonism manifests as Dual Equilibrium in Riemannian space.

Result: The Ridge of Optimization is identified as the Edge of Stability boundary where Fisher Information Matrix becomes singular. This provides a geometric interpretation that unifies learning dynamics and capacity through the Minimum Description Length principle.

Conclusion: The paper offers a geometric theory of self-organized criticality by showing that the Ridge of Optimization corresponds to the Edge of Stability in statistical manifolds, unifying learning dynamics and capacity through information-theoretic principles.

Abstract: High-capacity kernel Hopfield networks exhibit a \textit{Ridge of Optimization} characterized by extreme stability. While previously linked to \textit{Spectral Concentration}, its origin remains elusive. Here, we analyze the network dynamics on a statistical manifold, revealing that the Ridge corresponds to the Edge of Stability, a critical boundary where the Fisher Information Matrix becomes singular. We demonstrate that the apparent Euclidean force antagonism is a manifestation of \textit{Dual Equilibrium} in the Riemannian space. This unifies learning dynamics and capacity via the Minimum Description Length principle, offering a geometric theory of self-organized criticality.

[695] ASPEN: An Adaptive Spectral Physics-Enabled Network for Ginzburg-Landau Dynamics

Julian Evan Chrisnanto, Nurfauzi Fadillah, Yulison Herry Chrisnanto

Main category: cs.LG

TL;DR: ASPEN introduces adaptive spectral layers with learnable Fourier features to overcome PINNs’ spectral bias, successfully solving the challenging Ginzburg-Landau equation where standard PINNs fail.

Details

Motivation: Standard PINNs struggle with stiff, multi-scale, and nonlinear PDE systems due to spectral bias in MLP architectures that prevents adequate representation of high-frequency components.

Method: ASPEN integrates an adaptive spectral layer with learnable Fourier features into the network’s input stage, allowing dynamic tuning of spectral basis during training to efficiently learn required frequency content.

Result: ASPEN successfully solves the complex Ginzburg-Landau equation with exceptional accuracy (median physics residual: 5.10×10⁻³), while standard PINNs catastrophically fail with non-physical oscillations.

Conclusion: The adaptive spectral basis framework provides a robust, physically-consistent solver for complex dynamical systems where standard PINNs fail, opening new options for machine learning in challenging physical domains.

Abstract: Physics-Informed Neural Networks (PINNs) have emerged as a powerful, mesh-free paradigm for solving partial differential equations (PDEs). However, they notoriously struggle with stiff, multi-scale, and nonlinear systems due to the inherent spectral bias of standard multilayer perceptron (MLP) architectures, which prevents them from adequately representing high-frequency components. In this work, we introduce the Adaptive Spectral Physics-Enabled Network (ASPEN), a novel architecture designed to overcome this critical limitation. ASPEN integrates an adaptive spectral layer with learnable Fourier features directly into the network’s input stage. This mechanism allows the model to dynamically tune its own spectral basis during training, enabling it to efficiently learn and represent the precise frequency content required by the solution. We demonstrate the efficacy of ASPEN by applying it to the complex Ginzburg-Landau equation (CGLE), a canonical and challenging benchmark for nonlinear, stiff spatio-temporal dynamics. Our results show that a standard PINN architecture catastrophically fails on this problem, diverging into non-physical oscillations. In contrast, ASPEN successfully solves the CGLE with exceptional accuracy. The predicted solution is visually indistinguishable from the high-resolution ground truth, achieving a low median physics residual of 5.10 x 10^-3. Furthermore, we validate that ASPEN’s solution is not only pointwise accurate but also physically consistent, correctly capturing emergent physical properties, including the rapid free energy relaxation and the long-term stability of the domain wall front. This work demonstrates that by incorporating an adaptive spectral basis, our framework provides a robust and physically-consistent solver for complex dynamical systems where standard PINNs fail, opening new options for machine learning in challenging physical domains.

[696] Always Keep Your Promises: DynamicLRP, A Model-Agnostic Solution To Layer-Wise Relevance Propagation

Kevin Lee, Pablo Millan Arias

Main category: cs.LG

TL;DR: DynamicLRP is a model-agnostic Layer-wise Relevance Propagation framework that works at tensor operation level using a Promise System for deferred activation resolution, achieving architecture independence without model modifications.

Details

Motivation: Existing LRP implementations operate at module level, requiring architecture-specific propagation rules and model modifications, limiting generality and sustainability as architectures evolve.

Method: Decomposes attribution to individual operations within computation graphs and introduces a Promise System for deferred activation resolution, operating independently of backpropagation machinery without model modifications.

Result: Achieves 99.92% node coverage across 31,465 computation graph nodes from 15 diverse architectures including Mamba, Whisper, and DePlot, with faithfulness matching/exceeding specialized implementations on various models.

Conclusion: DynamicLRP establishes a sustainable, extensible foundation for LRP across evolving architectures with operation-level decomposition and Promise System, enabling true architecture agnosticity while maintaining theoretical guarantees.

Abstract: Layer-wise Relevance Propagation (LRP) provides principled attribution for neural networks through conservation properties and foundations in Deep Taylor Decomposition. However, existing implementations operate at the module level, requiring architecture-specific propagation rules and model modifications. These limit the generality of target model and sustainability of implementations as architectures evolve. We introduce DynamicLRP, a model-agnostic LRP framework operating at the tensor operation level. By decomposing attribution to individual operations within computation graphs and introducing a novel mechanism for deferred activation resolution, named the Promise System, our approach achieves true architecture agnosticity while maintaining LRP’s theoretical guarantees. This design operates independently of backpropagation machinery, requiring no model modification, enabling side-by-side execution with gradient backpropagation. Being based on computation graphs, this method is theoretically extensible to other deep learning libraries that support auto-differentiation. We demonstrate faithfulness matching or exceeding specialized implementations (1.77 vs 1.69 ABPC on VGG, equivalent performance on ViT, 93.70% and 95.06% top-1 attribution accuracy for explaining RoBERTa-large and Flan-T5-large answers on SQuADv2, respectively) while maintaining practical efficiency on models with 100M-1B parameters. We achieved 99.92% node coverage across 31,465 computation graph nodes from 15 diverse architectures, including state-space models (Mamba), audio transformers (Whisper), and multimodal systems (DePlot) without any model-specific code with rules for 47 fundamental operations implemented. Our operation-level decomposition and Promise System establish a sustainable, extensible foundation for LRP across evolving architectures. All code is available at https://github.com/keeinlev/dynamicLRP .

[697] SSAS: Cross-subject EEG-based Emotion Recognition through Source Selection with Adversarial Strategy

Yici Liu, Qi Wei Oung, Hoi Leong Lee

Main category: cs.LG

TL;DR: A novel cross-subject EEG emotion recognition method using source selection with adversarial strategy to address inter-individual variability and negative transfer issues.

Details

Motivation: Cross-subject EEG-based emotion recognition faces challenges from inter-individual variability and negative transfer phenomena, where existing methods often neglect these issues during model training.

Method: Two-module approach: 1) Source Selection Network (SS) that uses domain labels to reverse-engineer domain adaptation by disrupting class separability and magnifying inter-domain differences, forcing domain-invariant emotion representations; 2) Adversarial Strategies Network (AS) that uses pretrained domain discriminators to compute novel loss for balancing adversarial training.

Result: Achieves outstanding performance on two EEG-based emotion datasets (SEED and SEED-IV), with theoretical insights provided and code made publicly available.

Conclusion: The proposed SSAS method effectively addresses inter-individual variability and negative transfer in cross-subject EEG emotion recognition, demonstrating superior performance through adversarial source selection strategy.

Abstract: Electroencephalographic (EEG) signals have long been applied in the field of affective brain-computer interfaces (aBCIs). Cross-subject EEG-based emotion recognition has demonstrated significant potential in practical applications due to its suitability across diverse people. However, most studies on cross-subject EEG-based emotion recognition neglect the presence of inter-individual variability and negative transfer phenomena during model training. To address this issue, a cross-subject EEG-based emotion recognition through source selection with adversarial strategy is introduced in this paper. The proposed method comprises two modules: the source selection network (SS) and the adversarial strategies network (AS). The SS uses domain labels to reverse-engineer the training process of domain adaptation. Its key idea is to disrupt class separability and magnify inter-domain differences, thereby raising the classification difficulty and forcing the model to learn domain-invariant yet emotion-relevant representations. The AS gets the source domain selection results and the pretrained domain discriminators from SS. The pretrained domain discriminators compute a novel loss aimed at enhancing the performance of domain classification during adversarial training, ensuring the balance of adversarial strategies. This paper provides theoretical insights into the proposed method and achieves outstanding performance on two EEG-based emotion datasets, SEED and SEED-IV. The code can be found at https://github.com/liuyici/SSAS.

[698] Graph Contrastive Learning via Spectral Graph Alignment

Manh Nguyen

Main category: cs.LG

TL;DR: SpecMatch-CL is a novel contrastive learning loss that aligns graph-of-graphs by minimizing normalized Laplacian differences, improving graph representation learning across unsupervised, semi-supervised, and transfer learning tasks.

Details

Motivation: Existing contrastive learning methods for graphs (like InfoNCE) align pairwise embeddings but lack control over the global structure of the view-specific graph-of-graphs built from these embeddings, limiting their effectiveness.

Method: SpecMatch-CL introduces a loss function that aligns view-specific graph-of-graphs by minimizing the difference between their normalized Laplacians, providing theoretical guarantees on alignment quality.

Result: Achieves new state-of-the-art on eight TU benchmarks in unsupervised and semi-supervised learning at low label rates, and shows consistent gains in transfer learning on PPI-306K and ZINC 2M datasets.

Conclusion: Aligning normalized Laplacians of graph-of-graphs provides an effective mechanism for controlling global structure in contrastive learning, leading to superior performance across diverse graph learning tasks.

Abstract: Given augmented views of each input graph, contrastive learning methods (e.g., InfoNCE) optimize pairwise alignment of graph embeddings across views while providing no mechanism to control the global structure of the view specific graph-of-graphs built from these embeddings. We introduce SpecMatch-CL, a novel loss function that aligns the view specific graph-of-graphs by minimizing the difference between their normalized Laplacians. Theoretically, we show that under certain assumptions, the difference between normalized Laplacians provides an upper bound not only for the difference between the ideal Perfect Alignment contrastive loss and the current loss, but also for the Uniformly loss. Empirically, SpecMatch-CL establishes new state of the art on eight TU benchmarks under unsupervised learning and semi-supervised learning at low label rates, and yields consistent gains in transfer learning on PPI-306K and ZINC 2M datasets.

[699] SEA: Spectral Edge Attack on Graph Neural Networks

Yongyu Wang

Main category: cs.LG

TL;DR: Proposes a stealthy GNN attack method that subtly adjusts edge weights instead of modifying graph structure, using spectral analysis to target vulnerable edges.

Details

Motivation: Existing GNN attacks modify graph structure (adding/deleting edges) which is easily detectable. Need stealthier attacks that maintain effectiveness while being hard to detect.

Method: Uses spectral adversarial robustness evaluation to quantitatively analyze edge vulnerability, then subtly adjusts weights of most vulnerable edges without changing connectivity patterns.

Result: Method effectively attacks several classical GNN architectures while maintaining stealth by not altering graph structure.

Conclusion: Proposes a new stealthy attack paradigm for GNNs that focuses on weight manipulation rather than structural changes, demonstrating effective attacks while being difficult to detect.

Abstract: Graph neural networks (GNNs) have been widely applied in a variety of domains. However, the very ability of graphs to represent complex data structures is both the key strength of GNNs and a major source of their vulnerability. Recent studies have shown that attacking GNNs by maliciously perturbing the underlying graph can severely degrade their performance. For attack methods, the central challenge is to maintain attack effectiveness while remaining difficult to detect. Most existing attacks require modifying the graph structure, such as adding or deleting edges, which is relatively easy to notice. To address this problem, this paper proposes a new attack model that employs spectral adversarial robustness evaluation to quantitatively analyze the vulnerability of each edge in a graph. By precisely targeting the weakest links, our method can achieve effective attacks without changing the connectivity pattern of edges in the graph, for example by subtly adjusting the weights of a small subset of the most vulnerable edges. We apply the proposed method to attack several classical graph neural network architectures, and experimental results show that our attack is highly effective.

[700] Renormalizable Spectral-Shell Dynamics as the Origin of Neural Scaling Laws

Yizhou Zhang

Main category: cs.LG

TL;DR: The paper derives macroscopic structure of deep network training from gradient descent in function space, showing training error evolves via time-dependent operator M(t). Using spectral-shell coarse-graining, they obtain renormalizable dynamics that explains neural scaling laws and unifies lazy training with feature learning.

Details

Motivation: To understand the macroscopic structure governing deep network training despite highly nonlinear optimization dynamics, and to explain phenomena like neural scaling laws and double descent from first principles.

Method: Derive training dynamics directly from gradient descent in function space for MSE loss, use Kato perturbation theory to obtain modewise ODEs, introduce logarithmic spectral-shell coarse-graining, formalize renormalizable shell-dynamics assumption, and analyze power-law spectral transport.

Result: Obtain exact system of coupled ODEs in eigenbasis of M(t), show microscopic interactions cancel within shells, derive self-similar solution with moving resolution frontier and scaling exponents under power-law transport assumption.

Conclusion: The framework explains neural scaling laws and double descent, and unifies lazy (NTK-like) training and feature learning as two limits of the same spectral-shell dynamics, providing a macroscopic understanding of deep network training.

Abstract: Neural scaling laws and double-descent phenomena suggest that deep-network training obeys a simple macroscopic structure despite highly nonlinear optimization dynamics. We derive such structure directly from gradient descent in function space. For mean-squared error loss, the training error evolves as $\dot e_t=-M(t)e_t$ with $M(t)=J_{θ(t)}J_{θ(t)}^{!*}$, a time-dependent self-adjoint operator induced by the network Jacobian. Using Kato perturbation theory, we obtain an exact system of coupled modewise ODEs in the instantaneous eigenbasis of $M(t)$. To extract macroscopic behavior, we introduce a logarithmic spectral-shell coarse-graining and track quadratic error energy across shells. Microscopic interactions within each shell cancel identically at the energy level, so shell energies evolve only through dissipation and external inter-shell interactions. We formalize this via a \emph{renormalizable shell-dynamics} assumption, under which cumulative microscopic effects reduce to a controlled net flux across shell boundaries. Assuming an effective power-law spectral transport in a relevant resolution range, the shell dynamics admits a self-similar solution with a moving resolution frontier and explicit scaling exponents. This framework explains neural scaling laws and double descent, and unifies lazy (NTK-like) training and feature learning as two limits of the same spectral-shell dynamics.

[701] Neural CDEs as Correctors for Learned Time Series Models

Muhammad Bilal Shahid, Prajwal Koirla, Cody Fleming

Main category: cs.LG

TL;DR: Proposes a Predictor-Corrector framework where any learned time-series model (Predictor) is paired with a neural controlled differential equation (Corrector) that predicts forecast errors, improving multi-step forecasting performance.

Details

Motivation: Learned time-series models (both continuous- and discrete-time) generate multi-step forecasts that are prone to errors, whether using direct prediction or iterative feedback approaches. Existing methods lack effective error correction mechanisms.

Method: A Predictor-Corrector mechanism where: 1) Predictor is any learned time-series model, 2) Corrector is a neural controlled differential equation that predicts forecast errors, 3) Adding predicted errors to forecasts improves performance, 4) Works with irregularly sampled data and both continuous/discrete-time Predictors, 5) Includes two regularization strategies for better extrapolation and accelerated training.

Result: Experiments with diverse Predictors (neural ODEs, Conformer, DLinear) on synthetic, physics simulation, and real-world datasets show the Predictor-Corrector mechanism consistently improves performance compared to Predictor alone.

Conclusion: The proposed Predictor-Corrector framework effectively addresses forecast errors in time-series models, demonstrating consistent performance improvements across various model types and datasets through error prediction and correction.

Abstract: Learned time-series models, whether continuous- or discrete-time, are widely used to forecast the states of a dynamical system. Such models generate multi-step forecasts either directly, by predicting the full horizon at once, or iteratively, by feeding back their own predictions at each step. In both cases, the multi-step forecasts are prone to errors. To address this, we propose a Predictor-Corrector mechanism where the Predictor is any learned time-series model and the Corrector is a neural controlled differential equation. The Predictor forecasts, and the Corrector predicts the errors of the forecasts. Adding these errors to the forecasts improves forecast performance. The proposed Corrector works with irregularly sampled time series and continuous- and discrete-time Predictors. Additionally, we introduce two regularization strategies to improve the extrapolation performance of the Corrector with accelerated training. We evaluate our Corrector with diverse Predictors, e.g., neural ordinary differential equations, Contiformer, and DLinear, on synthetic, physics simulation, and real-world forecasting datasets. The experiments demonstrate that the Predictor-Corrector mechanism consistently improves the performance compared to Predictor alone.

[702] Softly Constrained Denoisers for Diffusion Models

Victor M. Yeom-Song, Severi Rissanen, Arno Solin, Samuel Kaski, Mingfei Sun

Main category: cs.LG

TL;DR: The paper proposes softly constrained denoisers for diffusion models that improve constraint compliance while maintaining flexibility to handle constraint misspecification, avoiding bias away from true data distribution.

Details

Motivation: Diffusion models struggle to produce constraint-compliant samples in scientific applications. Existing regularization and guidance methods bias the generative model away from the true data distribution, which is problematic when constraints are misspecified.

Method: Instead of changing the loss or sampling loop, the authors integrate a guidance-inspired adjustment into the denoiser itself, giving it a soft inductive bias towards constraint-compliant samples.

Result: Softly constrained denoisers exploit constraint knowledge to improve compliance over standard denoisers while maintaining enough flexibility to deviate from constraints when there is misspecification with observed data.

Conclusion: The proposed approach provides a better balance between constraint enforcement and fidelity to the true data distribution, especially important for scientific applications where constraint misspecification is common.

Abstract: Diffusion models struggle to produce samples that respect constraints, a common requirement in scientific applications. Recent approaches have introduced regularization terms in the loss or guidance methods during sampling to enforce such constraints, but they bias the generative model away from the true data distribution. This is a problem, especially when the constraint is misspecified, a common issue when formulating constraints on scientific data. In this paper, instead of changing the loss or the sampling loop, we integrate a guidance-inspired adjustment into the denoiser itself, giving it a soft inductive bias towards constraint-compliant samples. We show that these softly constrained denoisers exploit constraint knowledge to improve compliance over standard denoisers, and maintain enough flexibility to deviate from it when there is misspecification with observed data.

[703] Coarse-to-Fine Open-Set Graph Node Classification with Large Language Models

Xueqi Ma, Xingjun Ma, Sarah Monazam Erfani, Danilo Mandic, James Bailey

Main category: cs.LG

TL;DR: CFC is a coarse-to-fine open-set classification framework using LLMs for graph datasets that improves OOD detection by 10% and achieves up to 70% accuracy in OOD classification without true label information.

Details

Motivation: Existing OOD detection methods treat all OOD samples as a single class, but real-world applications like fraud detection and medical diagnosis need deeper insights into OOD samples including their probable labels. The paper addresses whether OOD detection can be extended to OOD classification without true label information.

Method: CFC framework has three components: 1) coarse classifier using LLM prompts for OOD detection and outlier label generation, 2) GNN-based fine classifier trained with identified OOD samples for enhanced OOD detection and ID classification, and 3) refined OOD classification through LLM prompts and post-processed OOD labels. It uses semantic OOD instances based on inherent meaning rather than synthetic/auxiliary samples.

Result: CFC improves OOD detection by 10% over state-of-the-art methods on graph and text domains, and achieves up to 70% accuracy in OOD classification on graph datasets.

Conclusion: The proposed CFC framework successfully extends OOD detection to OOD classification without true label information, using LLMs to provide semantic understanding of OOD samples and improving both interpretability and practical utility for real-world applications.

Abstract: Developing open-set classification methods capable of classifying in-distribution (ID) data while detecting out-of-distribution (OOD) samples is essential for deploying graph neural networks (GNNs) in open-world scenarios. Existing methods typically treat all OOD samples as a single class, despite real-world applications, especially high-stake settings such as fraud detection and medical diagnosis, demanding deeper insights into OOD samples, including their probable labels. This raises a critical question: can OOD detection be extended to OOD classification without true label information? To address this question, we propose a Coarse-to-Fine open-set Classification (CFC) framework that leverages large language models (LLMs) for graph datasets. CFC consists of three key components: a coarse classifier that uses LLM prompts for OOD detection and outlier label generation, a GNN-based fine classifier trained with OOD samples identified by the coarse classifier for enhanced OOD detection and ID classification, and refined OOD classification achieved through LLM prompts and post-processed OOD labels. Unlike methods that rely on synthetic or auxiliary OOD samples, CFC employs semantic OOD instances that are genuinely out-of-distribution based on their inherent meaning, improving interpretability and practical utility. Experimental results show that CFC improves OOD detection by ten percent over state-of-the-art methods on graph and text domains and achieves up to seventy percent accuracy in OOD classification on graph datasets.

[704] Pretrained Battery Transformer (PBT): A battery life prediction foundation model

Ruifeng Tan, Weixiang Hong, Jia Li, Jiaqiang Huang, Tong-Yi Zhang

Main category: cs.LG

TL;DR: PBT is the first foundation model for battery cycle life prediction that uses domain-knowledge-encoded mixture-of-expert layers, achieving 19.8% better performance than existing models and state-of-the-art results across 15 diverse battery datasets.

Details

Motivation: Early battery cycle life prediction is crucial for accelerating battery research and deployment, but current machine learning methods are limited by data scarcity and heterogeneity from diverse aging conditions. While foundation models have shown broad generalization in other fields, none exist for battery life prediction.

Method: Developed Pretrained Battery Transformer (PBT), the first foundation model for battery life prediction, using domain-knowledge-encoded mixture-of-expert layers. Trained on the largest public battery life database, learning transferable representations from 13 lithium-ion battery datasets.

Result: PBT outperforms existing models by an average of 19.8% and achieves state-of-the-art performance across 15 diverse datasets covering various operating conditions, formation protocols, and chemistries through transfer learning.

Conclusion: This work establishes a foundation model pathway for battery lifetime prediction, paving the way toward universal battery lifetime prediction systems that can generalize across diverse battery types and conditions.

Abstract: Early prediction of battery cycle life is essential for accelerating battery research, manufacturing, and deployment. Although machine learning methods have shown encouraging results, progress is hindered by data scarcity and heterogeneity arising from diverse aging conditions. In other fields, foundation models (FMs) trained on diverse datasets have achieved broad generalization through transfer learning, but no FMs have been reported for battery cycle life prediction yet. Here we present the Pretrained Battery Transformer (PBT), the first FM for battery life prediction, developed through domain-knowledge-encoded mixture-of-expert layers. Validated on the largest public battery life database, PBT learns transferable representations from 13 lithium-ion battery (LIB) datasets, outperforming existing models by an average of 19.8%. With transfer learning, PBT achieves state-of-the-art performance across 15 diverse datasets encompassing various operating conditions, formation protocols, and chemistries. This work establishes a foundation model pathway for battery lifetime prediction, paving the way toward universal battery lifetime prediction systems.

[705] Compression is Routing: Reconstruction Error as an Intrinsic Signal for Modular Language Models

Zhongpan Tang

Main category: cs.LG

TL;DR: A novel “Compression is Routing” architecture uses Transformer Autoencoder compression errors as intrinsic distribution fingerprints for expert routing, eliminating explicit gating networks while enabling VRAM compression for long contexts.

Details

Motivation: Address LLM challenges: context length limitations, high inference costs, and catastrophic forgetting in continual learning. Current MoE routing mechanisms increase complexity and lack interpretability for mixed-domain inputs.

Method: Train an 87M-parameter Transformer Autoencoder for 64x sequence compression (512 tokens → 8 latent vectors). Use reconstruction error as Intrinsic Distribution Fingerprint to automatically schedule expert modules without explicit gating networks.

Result: Compressor shows extreme domain discrimination: 99.47% reconstruction accuracy on in-domain code, 47.76% on semi-OOD Wiki text, and 0.57% on fully OOD random sequences. Reconstruction error serves as valid distribution fingerprint for routing.

Conclusion: Compression errors provide intrinsic distribution fingerprints for automatic expert routing, eliminating explicit gating networks. Offers scalable architecture with VRAM compression for ultra-long contexts, providing new perspective for next-generation modular neural networks.

Abstract: Current Large Language Models (LLMs) face three major challenges: context length limitations, high inference costs, and catastrophic forgetting during continual learning. While Mixture-of-Experts (MoE) architectures mitigate some of these conflicts, their routing mechanisms typically rely on explicitly trained auxiliary classifiers. This not only increases system complexity but also often lacks interpretability when handling mixed-domain inputs. Building upon the premise that Compression is Intelligence,'' this paper proposes a novel architectural philosophy: Compression is Routing. We trained an 87M-parameter end-to-end Transformer Autoencoder, achieving a 64x sequence length compression (compressing 512 tokens into 8 latent vectors). Experimental results demonstrate that this compressor possesses extreme domain discriminative capability: it achieves a reconstruction accuracy of 99.47% on the in-domain (code) validation set; accuracy drops sharply to 47.76% on a semi-out-of-distribution domain (Wiki text); and further plummets to just 0.57% on a fully out-of-distribution domain (random sequences). This extreme and systematic performance discrepancy establishes the validity of reconstruction error as an Intrinsic Distribution Fingerprint. Based on this, we propose that expert modules can be automatically scheduled using reconstruction residuals directly, without the need for explicit gating networks. This mechanism offers excellent scalability. Furthermore, this architecture provides a new perspective on VRAM compression’’ for handling ultra-long contexts. This report aims to verify the physical validity of this foundational architecture, offering a new research perspective for the next generation of scalable modular neural networks.

[706] UniCoMTE: A Universal Counterfactual Framework for Explaining Time-Series Classifiers on ECG Data

Justin Li, Efe Sencan, Jasper Zheng Duan, Vitus J. Leung, Stephen Tsaur, Ayse K. Coskun

Main category: cs.LG

TL;DR: UniCoMTE is a model-agnostic framework for generating counterfactual explanations for multivariate time series classifiers, evaluated on ECG data and shown to outperform existing methods in clarity and clinical utility.

Details

Motivation: Deep neural networks perform well on time series classification but their black-box nature limits trust and adoption in high-stakes domains like healthcare, creating a need for interpretable explanations.

Method: UniCoMTE identifies influential temporal features by modifying input samples and assessing impact on predictions. It’s model-agnostic, works directly on raw time series, and was evaluated on ECG classifiers comparing to LIME and SHAP.

Result: UniCoMTE produces concise, stable, human-aligned explanations that outperform LIME and SHAP in clarity and applicability. Medical experts found the counterfactual explanations clinically useful when presented alongside original ECG samples.

Conclusion: The framework advances interpretability of deep learning models for real-world time series applications by linking model predictions to meaningful signal patterns, addressing trust issues in high-stakes domains.

Abstract: Machine learning models, particularly deep neural networks, have demonstrated strong performance in classifying complex time series data. However, their black-box nature limits trust and adoption, especially in high-stakes domains such as healthcare. To address this challenge, we introduce UniCoMTE, a model-agnostic framework for generating counterfactual explanations for multivariate time series classifiers. The framework identifies temporal features that most heavily influence a model’s prediction by modifying the input sample and assessing its impact on the model’s prediction. UniCoMTE is compatible with a wide range of model architectures and operates directly on raw time series inputs. In this study, we evaluate UniCoMTE’s explanations on a time series ECG classifier. We quantify explanation quality by comparing our explanations’ comprehensibility to comprehensibility of established techniques (LIME and SHAP) and assessing their generalizability to similar samples. Furthermore, clinical utility is assessed through a questionnaire completed by medical experts who review counterfactual explanations presented alongside original ECG samples. Results show that our approach produces concise, stable, and human-aligned explanations that outperform existing methods in both clarity and applicability. By linking model predictions to meaningful signal patterns, the framework advances the interpretability of deep learning models for real-world time series applications.

[707] Learning What to Write: Write-Gated KV for Efficient Long-Context Inference

Yen-Chieh Huang, Pi-Cheng Hsiu, Rui Fang, Ming-Syan Chen

Main category: cs.LG

TL;DR: WG-KV introduces a lightweight mechanism that learns to predict token utility before writing to KV cache, reducing memory usage by 46-57% and achieving 3x prefill/2x decode speedups with negligible accuracy loss.

Details

Motivation: Long-context LLM inference suffers from quadratic attention complexity and linear KV cache growth. Prior approaches use post-hoc selection/eviction but miss the root cause: indiscriminate writing to persistent memory.

Method: Formalizes KV cache management as a causal system with three primitives: KV Admission, Selection, and Eviction. Instantiates KV Admission via Write-Gated KV - a lightweight mechanism that learns to predict token utility before cache entry, filtering low-utility states early while maintaining compact global cache with sliding local cache.

Result: Reduces memory usage by 46-57%, delivers 3.03-3.45x prefill speedup and 1.89-2.56x decode speedup on Llama models with negligible accuracy loss. Compatible with FlashAttention and paged-KV systems.

Conclusion: Learning what to write (rather than post-hoc management) is a principled and practical approach for efficient long-context inference. Early filtering of low-utility tokens significantly improves efficiency while maintaining accuracy.

Abstract: Long-context LLM inference is bottlenecked by the quadratic attention complexity and linear KV cache growth. Prior approaches mitigate this via post-hoc selection or eviction but overlook the root inefficiency: indiscriminate writing to persistent memory. In this paper, we formalize KV cache management as a causal system of three primitives: KV Admission, Selection, and Eviction. We instantiate KV Admission via Write-Gated KV, a lightweight mechanism that learns to predict token utility before it enters the cache. By filtering out low-utility states early to maintain a compact global cache alongside a sliding local cache, Write-Gated KV reduces memory usage by 46-57% and delivers 3.03-3.45$\times$ prefill and 1.89-2.56$\times$ decode speedups on Llama model with negligible accuracy loss, all while remaining compatible with FlashAttention and paged-KV systems. These results demonstrate that learning what to write, is a principled and practical recipe for efficient long-context inference. Code is available at https://github.com/EMCLab-Sinica/WG-KV .

[708] NetworkFF: Unified Layer Optimization in Forward-Only Neural Networks

Salar Beigzad

Main category: cs.LG

TL;DR: CFF extends Forward-Forward algorithm with inter-layer collaboration to overcome isolation issues, improving performance while maintaining memory efficiency and biological plausibility.

Details

Motivation: Forward-Forward algorithm addresses backpropagation's memory constraints and biological implausibility but suffers from inter-layer isolation where layers optimize independently without collective learning dynamics, limiting representational coordination and convergence efficiency.

Method: Introduces Collaborative Forward-Forward (CFF) learning with two paradigms: Fixed CFF (F-CFF) with constant inter-layer coupling and Adaptive CFF (A-CFF) with learnable collaboration parameters. Uses collaborative goodness function incorporating weighted contributions from all layers for coordinated feature learning.

Result: Significant performance improvements over baseline Forward-Forward implementations on MNIST and Fashion-MNIST datasets, demonstrating enhanced convergence efficiency and representational coordination.

Conclusion: Inter-layer collaboration is a fundamental enhancement to Forward-Forward learning, with applications in neuromorphic computing and energy-constrained AI systems, preserving forward-only computation while enabling global context integration.

Abstract: The Forward-Forward algorithm eliminates backpropagation’s memory constraints and biological implausibility through dual forward passes with positive and negative data. However, conventional implementations suffer from critical inter-layer isolation, where layers optimize goodness functions independently without leveraging collective learning dynamics. This isolation constrains representational coordination and limits convergence efficiency in deeper architectures. This paper introduces Collaborative Forward-Forward (CFF) learning, extending the original algorithm through inter-layer cooperation mechanisms that preserve forward-only computation while enabling global context integration. Our framework implements two collaborative paradigms: Fixed CFF (F-CFF) with constant inter-layer coupling and Adaptive CFF (A-CFF) with learnable collaboration parameters that evolve during training. The collaborative goodness function incorporates weighted contributions from all layers, enabling coordinated feature learning while maintaining memory efficiency and biological plausibility. Comprehensive evaluation on MNIST and Fashion-MNIST demonstrates significant performance improvements over baseline Forward-Forward implementations. These findings establish inter-layer collaboration as a fundamental enhancement to Forward-Forward learning, with immediate applicability to neuromorphic computing architectures and energy-constrained AI systems.

[709] SCOPE: Sequential Causal Optimization of Process Interventions

Jakob De Moor, Hans Weytjens, Johannes De Smedt, Jochen De Weerdt

Main category: cs.LG

TL;DR: SCOPE is a Prescriptive Process Monitoring approach that learns aligned sequential intervention recommendations using backward induction and causal learners, outperforming existing methods that treat interventions independently or rely on simulation.

Details

Motivation: Existing PresPM approaches fail to handle realistic scenarios where organizations need to align sequences of interventions to jointly steer case outcomes. Current methods either focus on single interventions, treat multiple interventions independently, or rely on simulation/data augmentation that creates reality gaps and bias.

Method: SCOPE uses backward induction to estimate the effect of each candidate intervention action, propagating impact from final decision point back to the first. It leverages causal learners to utilize observational data directly without requiring process approximations for reinforcement learning.

Result: Experiments on both an existing synthetic dataset and a new semi-synthetic dataset show that SCOPE consistently outperforms state-of-the-art PresPM techniques in optimizing KPIs.

Conclusion: SCOPE effectively addresses sequential intervention alignment in PresPM, and the novel semi-synthetic setup based on real-life event logs provides a reusable benchmark for future work on sequential PresPM.

Abstract: Prescriptive Process Monitoring (PresPM) recommends interventions during business processes to optimize key performance indicators (KPIs). In realistic settings, interventions are rarely isolated: organizations need to align sequences of interventions to jointly steer the outcome of a case. Existing PresPM approaches fall short in this respect. Many focus on a single intervention decision, while others treat multiple interventions independently, ignoring how they interact over time. Methods that do address these dependencies depend either on simulation or data augmentation to approximate the process to train a Reinforcement Learning (RL) agent, which can create a reality gap and introduce bias. We introduce SCOPE, a PresPM approach that learns aligned sequential intervention recommendations. SCOPE employs backward induction to estimate the effect of each candidate intervention action, propagating its impact from the final decision point back to the first. By leveraging causal learners, our method can utilize observational data directly, unlike methods that require constructing process approximations for reinforcement learning. Experiments on both an existing synthetic dataset and a new semi-synthetic dataset show that SCOPE consistently outperforms state-of-the-art PresPM techniques in optimizing the KPI. The novel semi-synthetic setup, based on a real-life event log, is provided as a reusable benchmark for future work on sequential PresPM.

[710] A Unified Representation of Neural Networks Architectures

Christophe Prieur, Mircea Lazar, Bogdan Robu

Main category: cs.LG

TL;DR: The paper introduces Distributed Parameter neural Networks (DiPaNet) as a unified framework connecting finite and infinite-dimensional neural networks through homogenization/discretization techniques, deriving approximation errors for neural networks as they approach continuum limits.

Details

Motivation: To establish a unified mathematical framework that connects finite neural networks with their infinite-dimensional continuum limits, providing rigorous approximation error analysis for neural networks as the number of neurons and layers tends to infinity.

Method: 1) Derive integral infinite width neural representations for single hidden layer networks, 2) Extend to deep residual CNNs with finite integral hidden layers, 3) Formalize neural ODEs and deep residual NN relations via discretization, 4) Merge approaches into unified DiPaNet representation using homogenization/discretization techniques.

Result: Developed DiPaNet framework showing most existing finite and infinite-dimensional NN architectures are related via homogenization/discretization, with deterministic approach applicable to general uniformly continuous matrix weight functions.

Conclusion: DiPaNet provides a unified representation connecting various neural network architectures through continuum limits and discretization, with potential applications to neural fields and neural integro-differential equations.

Abstract: In this paper we consider the limiting case of neural networks (NNs) architectures when the number of neurons in each hidden layer and the number of hidden layers tend to infinity thus forming a continuum, and we derive approximation errors as a function of the number of neurons and/or hidden layers. Firstly, we consider the case of neural networks with a single hidden layer and we derive an integral infinite width neural representation that generalizes existing continuous neural networks (CNNs) representations. Then we extend this to deep residual CNNs that have a finite number of integral hidden layers and residual connections. Secondly, we revisit the relation between neural ODEs and deep residual NNs and we formalize approximation errors via discretization techniques. Then, we merge these two approaches into a unified homogeneous representation of NNs as a Distributed Parameter neural Network (DiPaNet) and we show that most of the existing finite and infinite-dimensional NNs architectures are related via homogenization/discretization with the DiPaNet representation. Our approach is purely deterministic and applies to general, uniformly continuous matrix weight functions. Relations with neural fields and other neural integro-differential equations are discussed along with further possible generalizations and applications of the DiPaNet framework.

[711] Estimating Spatially Resolved Radiation Fields Using Neural Networks

Felix Lehner, Pasquale Lombardo, Susana Castillo, Oliver Hupe, Marcus Magnor

Main category: cs.LG

TL;DR: Neural networks for estimating scattered radiation field distributions in medical settings using synthetic datasets from Monte-Carlo simulations.

Details

Motivation: Need for accurate radiation protection dosimetry in medical radiation fields (interventional radiology/cardiology) where scattered radiation poses risks to medical staff.

Method: Created three synthetic datasets with increasing complexity using Geant4 Monte-Carlo simulations. Evaluated convolutional and fully connected neural network architectures to reconstruct fluence and spectra distributions over spatial domains.

Result: Demonstrated which neural network design decisions work well for reconstructing radiation field distributions. All datasets and training pipeline published as open source.

Conclusion: Neural networks can effectively estimate spatial distributions of scattered radiation fields for medical dosimetry, with specific architectural choices performing better for this task.

Abstract: We present an in-depth analysis on how to build and train neural networks to estimate the spatial distribution of scattered radiation fields for radiation protection dosimetry in medical radiation fields, such as those found in interventional radiology and cardiology. We present three different synthetically generated datasets with increasing complexity for training, using a Monte-Carlo Simulation application based on Geant4. On those datasets, we evaluate convolutional and fully connected architectures of neural networks to demonstrate which design decisions work well for reconstructing the fluence and spectra distributions over the spatial domain of such radiation fields. All our datasets, as well as our training pipeline, are published as open source in separate repositories.

cs.MA

[712] ShibuyaSocial: Multi-scale Model of Pedestrian Flows in Scramble Crossing

Akihiro Sakurai, Naoya Kajio, Ko Yamamoto

Main category: cs.MA

TL;DR: A learning-based pedestrian flow model that integrates global route selection and local collision avoidance using Attention mechanism, validated at Shibuya scramble crossing.

Details

Motivation: To prevent serious accidents from pedestrian congestion by developing a comprehensive model that captures both global route selection and local behaviors, which existing models often treat separately.

Method: Proposed model integrates local collision avoidance with global route selection using Attention mechanism, trained on video-recorded pedestrian trajectory data from Shibuya scramble crossing.

Result: Simulations based on trained model show both qualitative and quantitative validation that the model appropriately predicts pedestrian behaviors.

Conclusion: The integrated approach successfully models complex pedestrian decision-making including global route selection and local behaviors, providing a comprehensive framework for pedestrian flow prediction in crowded urban environments.

Abstract: This paper presents a learning-based model of pedestrian flows that integrates multi scale behaviors such as global route selection and local collision avoidance in urban spaces, particularly focusing on pedestrian movements at Shibuya scramble crossing. Since too much congestion of pedestrian flows can cause serious accidents, mathematically modeling and predicting pedestrian behaviors is important for preventing such accidents and providing a safe and comfortable environment. Although numerous studies have investigated learning-based modeling methods, most of them focus only on the local behavior of pedestrians, such as collision avoidance with neighbors and environmental objects. In an actual environment, pedestrian behavior involves more complicated decision making including global route selection. Moreover, a state transition from stopping to walking at a traffic light should be considered simultaneously. In this study, the proposed model integrates local behaviors with global route selection, using an Attention mechanism to ensure consistent global and local behavior predictions. We recorded video data of pedestrians at Shibuya scramble crossing and trained the proposed model using pedestrian walking trajectory data obtained from the video. Simulations of pedestrian behaviors based on the trained model qualitatively and quantitatively validated that the proposed model can appropriately predict pedestrian behaviors.

[713] Adaptive Accountability in Networked MAS: Tracing and Mitigating Emergent Norms at Scale

Saad Alqithami

Main category: cs.MA

TL;DR: An adaptive accountability framework for multi-agent systems that detects harmful emergent norms and intervenes in real-time to maintain ethical alignment and system objectives.

Details

Motivation: Large-scale networked multi-agent systems in critical infrastructure can develop undesirable emergent norms that evade traditional governance, requiring new approaches to ensure ethical alignment and system-level objectives.

Method: Three-component framework: (1) continuous responsibility tracing via lifecycle-aware audit ledger, (2) online detection of harmful emergent norms using decentralized sequential hypothesis tests, (3) local policy and reward-shaping interventions for real-time realignment.

Result: Framework prevents collusion and resource hoarding in ≥90% of configurations, boosts collective reward by 12-18%, lowers Gini inequality index by up to 33% vs PPO baseline. Proves bounded-compromise theorem showing limited compromised interactions.

Conclusion: Theoretically principled accountability layer enables ethically aligned, self-regulating behavior in complex multi-agent systems without sacrificing performance or scalability.

Abstract: Large-scale networked multi-agent systems increasingly underpin critical infrastructure, yet their collective behavior can drift toward undesirable emergent norms that elude conventional governance mechanisms. We introduce an adaptive accountability framework that (i) continuously traces responsibility flows through a lifecycle-aware audit ledger, (ii) detects harmful emergent norms online via decentralized sequential hypothesis tests, and (iii) deploys local policy and reward-shaping interventions that realign agents with system-level objectives in near real time. We prove a bounded-compromise theorem showing that whenever the expected intervention cost exceeds an adversary’s payoff, the long-run proportion of compromised interactions is bounded by a constant strictly less than one. Extensive high-performance simulations with up to 100 heterogeneous agents, partial observability, and stochastic communication graphs show that our framework prevents collusion and resource hoarding in at least 90% of configurations, boosts average collective reward by 12-18%, and lowers the Gini inequality index by up to 33% relative to a PPO baseline. These results demonstrate that a theoretically principled accountability layer can induce ethically aligned, self-regulating behavior in complex MAS without sacrificing performance or scalability.

[714] Networked Communication for Mean-Field Games with Function Approximation and Empirical Mean-Field Estimation

Patrick Benjamin, Alessandro Abate

Main category: cs.MA

TL;DR: The paper introduces function approximation to decentralized mean-field game learning, enabling agents to handle large state spaces and population-dependent policies, with communication networks improving performance beyond centralized approaches.

Details

Motivation: Existing decentralized algorithms for mean-field games are limited to tabular settings with small state spaces and cannot handle population-dependent policies that require global mean-field information.

Method: Introduces function approximation using Munchausen Online Mirror Descent, adapts it to non-episodic decentralized settings, and develops algorithms for agents to locally estimate the global empirical distribution via inter-agent communication.

Result: Theoretical proof shows exchanging policy information helps networked agents outperform both independent and centralized agents. Experiments confirm empirical performance gains and demonstrate that communication networks enable accurate mean-field estimation for population-dependent policies.

Conclusion: Function approximation enables scalable decentralized mean-field game learning with population-dependent policies, and communication networks provide superior performance by allowing agents to effectively estimate and leverage global information.

Abstract: Recent algorithms allow decentralised agents, possibly connected via a communication network, to learn equilibria in mean-field games from a non-episodic run of the empirical system. However, these algorithms are for tabular settings: this computationally limits the size of agents’ observation space, meaning the algorithms cannot handle anything but small state spaces, nor generalise beyond policies depending only on the agent’s local state to so-called ‘population-dependent’ policies. We address this limitation by introducing function approximation to the existing setting, drawing on the Munchausen Online Mirror Descent method that has previously been employed only in finite-horizon, episodic, centralised settings. While this permits us to include the mean field in the observation for players’ policies, it is unrealistic to assume decentralised agents have access to this global information: we therefore also provide new algorithms allowing agents to locally estimate the global empirical distribution, and to improve this estimate via inter-agent communication. We prove theoretically that exchanging policy information helps networked agents outperform both independent and even centralised agents in function-approximation settings. Our experiments demonstrate this happening empirically, and show that the communication network allows decentralised agents to estimate the mean field for population-dependent policies.

[715] Networked Communication for Decentralised Cooperative Agents in Mean-Field Control

Patrick Benjamin, Alessandro Abate

Main category: cs.MA

TL;DR: This paper introduces networked communication to cooperative mean-field control (MFC), adapting MFG algorithms and providing theoretical analysis showing networked communication increases social welfare faster than independent or centralized learning.

Details

Motivation: Mean-field control (MFC) has received less attention than mean-field games (MFGs) despite being potentially more useful for engineering large-scale collective behaviors. Recent decentralized communication algorithms in MFGs showed benefits for learning speed and robustness, but this approach hasn't been applied to MFC where populations arguably have broader incentive to communicate.

Method: The authors adapt recent MFG algorithms to MFC with networked communication, where decentralized agents learn online from a single, non-episodic run of the empirical system. They contribute a novel sub-routine allowing networked agents to estimate the global average reward from their local neighborhood, and provide new theoretical analysis for this MFC setting.

Result: Theoretical analysis proves that in MFC, networked communication allows agents to increase social welfare faster than both independent and centralized learning architectures. Experiments support this result across different classes of cooperative games, with additional ablation studies on communication rounds and robustness to failures.

Conclusion: Networked communication in cooperative mean-field control enables faster social welfare improvement compared to traditional architectures, making it a valuable approach for engineering large-scale collective behaviors where agents have strong incentives to cooperate and communicate.

Abstract: The mean-field framework has been used to find approximate solutions to problems involving very large populations of symmetric, anonymous agents, which may be intractable by other methods. The cooperative mean-field control (MFC) problem has received less attention than the non-cooperative mean-field game (MFG), despite the former potentially being more useful as a tool for engineering large-scale collective behaviours. Decentralised communication algorithms have recently been introduced to MFGs, giving benefits to learning speed and robustness. Inspired by this, we introduce networked communication to MFC - where populations arguably have broader incentive to communicate - and in particular to the setting where decentralised agents learn online from a single, non-episodic run of the empirical system. We adapt recent MFG algorithms to this new setting, as well as contributing a novel sub-routine allowing networked agents to estimate the global average reward from their local neighbourhood. Previous theoretical analysis of decentralised communication in MFGs does not extend trivially to MFC. We therefore contribute new theory proving that in MFC the networked communication scheme allows agents to increase social welfare faster than under both of the two typical alternative architectures, namely independent and centralised learning. We provide experiments that support this new result across different classes of cooperative game, and also give numerous ablation studies and additional experiments concerning numbers of communication round and robustness to communication failures.

[716] Personalized and Resilient Distributed Learning Through Opinion Dynamics

Luca Ballotta, Nicola Bastianello, Riccardo M. G. Ferrari, Karl H. Johansson

Main category: cs.MA

TL;DR: A distributed learning algorithm combining gradient descent with opinion dynamics to achieve both personalization (for heterogeneous agents) and resilience (against cyberattacks/anomalous data) in multi-agent networks.

Details

Motivation: Address two practical challenges in distributed multi-agent learning: 1) Personalization - heterogeneous agents need local models tailored to their data/tasks while still generalizing well, and 2) Resilience - learning must withstand cyberattacks or anomalous training data without disruption.

Method: Combines distributed gradient descent with the Friedkin-Johnsen model of opinion dynamics to achieve both personalization and resilience. The algorithm parameters can be tuned to control the trade-off between personalized and resilient behavior.

Result: Quantifies convergence speed and the neighborhood containing final learned models. Shows effectiveness on synthetic and real-world distributed learning tasks, achieving high global accuracy for both personalized models and with malicious agents compared to standard strategies.

Conclusion: The proposed algorithm successfully addresses both personalization and resilience challenges in distributed learning by leveraging the conceptual affinity between these requirements through a combination of gradient descent and opinion dynamics.

Abstract: In this paper, we address two practical challenges of distributed learning in multi-agent network systems, namely personalization and resilience. Personalization is the need of heterogeneous agents to learn local models tailored to their own data and tasks, while still generalizing well; on the other hand, the learning process must be resilient to cyberattacks or anomalous training data to avoid disruption. Motivated by a conceptual affinity between these two requirements, we devise a distributed learning algorithm that combines distributed gradient descent and the Friedkin-Johnsen model of opinion dynamics to fulfill both of them. We quantify its convergence speed and the neighborhood that contains the final learned models, which can be easily controlled by tuning the algorithm parameters to enforce a more personalized/resilient behavior. We numerically showcase the effectiveness of our algorithm on synthetic and real-world distributed learning tasks, where it achieves high global accuracy both for personalized models and with malicious agents compared to standard strategies.

[717] Cooperation as a Black Box: Conceptual Fluctuation and Diagnostic Tools for Misalignment in MAS

Shayak Nandi, Fernanda M. Eliott

Main category: cs.MA

TL;DR: The paper introduces the Misalignment Mosaic framework to diagnose meaning-level misalignment in Multi-Agent Systems, focusing on coordination-cooperation ambiguity and other semantic issues that arise during conceptual design.

Details

Motivation: Misalignment in MAS is often treated as technical failure, but problems originate earlier in conceptual design where semantic ambiguity and normative projection occur. The Rabbit-Duck illusion illustrates how perspective-dependent interpretations of agent behavior create epistemic instability, such as coordinated agents in cooperative MARL being misinterpreted as morally aligned when they're only optimized for shared utility.

Method: The paper introduces the Misalignment Mosaic framework with four components: 1) Terminological Inconsistency, 2) Interpretive Ambiguity, 3) Concept-to-Code Decay, and 4) Morality as Cooperation. This framework diagnoses how misalignment emerges through language, framing, and design assumptions, building on Morality-as-Cooperation Theory.

Result: The framework provides a systematic approach to identify meaning-level misalignment in MAS, particularly addressing coordination-cooperation ambiguity but generalizing to other overloaded terms like alignment, autonomy, and trust.

Conclusion: The paper calls for consistent meaning-level grounding in MAS to ensure systems function as intended both technically and ethically. This need is urgent as MAS principles influence broader AI workflows, amplifying risks in trust, interpretability, and governance.

Abstract: Misalignment in Multi-Agent Systems (MAS) is frequently treated as a technical failure. Yet, issues may arise from the conceptual design phase, where semantic ambiguity and normative projection occur. The Rabbit-Duck illusion illustrates how perspective-dependent readings of agent behavior, such as the conflation of cooperation-coordination, can create epistemic instability; e.g., coordinated agents in cooperative Multi-Agent Reinforcement Learning (MARL) benchmarks being interpreted as morally aligned, despite being optimized for shared utility maximization only. Motivated by three drivers of meaning-level misalignment in MAS (coordination-cooperation ambiguity, conceptual fluctuation, and semantic instability), we introduce the Misalignment Mosaic: a framework for diagnosing how misalignment emerges through language, framing, and design assumptions. The Mosaic comprises four components: 1. Terminological Inconsistency, 2. Interpretive Ambiguity, 3. Concept-to-Code Decay, and 4. Morality as Cooperation. Building on insights from the Morality-as-Cooperation Theory, we call for consistent meaning-level grounding in MAS to ensure systems function as intended: technically and ethically. This need is particularly urgent as MAS principles influence broader Artificial Intelligence (AI) workflows, amplifying risks in trust, interpretability, and governance. While this work focuses on the coordination/cooperation ambiguity, the Mosaic generalizes to other overloaded terms, such as alignment, autonomy, and trust.

[718] SafeSieve: From Heuristics to Experience in Progressive Pruning for LLM-based Multi-Agent Communication

Ruijia Zhang, Xinyan Zhao, Ruixiang Wang, Sigen Chen, Guibin Zhang, An Zhang, Kun Wang, Qingsong Wen

Main category: cs.MA

TL;DR: SafeSieve is a progressive adaptive pruning algorithm for LLM-based multi-agent systems that reduces token overhead by 12.4%-27.8% while maintaining 94.01% average accuracy through dual-mechanism communication refinement.

Details

Motivation: LLM-based multi-agent systems suffer from redundant communication and excessive token overhead. Existing methods isolate pre- and post-task optimization, lacking a unified strategy for efficient collaboration.

Method: SafeSieve uses a progressive adaptive pruning algorithm with dual-mechanism: initial LLM-based semantic evaluation plus accumulated performance feedback, transitioning from heuristic initialization to experience-driven refinement. Employs 0-extension clustering to preserve structurally coherent agent groups while eliminating ineffective links.

Result: Achieves 94.01% average accuracy across benchmarks (SVAMP, HumanEval, etc.) while reducing token usage by 12.4%-27.8%. Shows robustness under prompt injection attacks (1.23% average accuracy drop). In heterogeneous settings, reduces deployment costs by 13.3% while maintaining performance.

Conclusion: SafeSieve establishes an efficient, GPU-free, and scalable framework for practical multi-agent systems, providing a unified strategy for communication optimization that balances performance with resource efficiency.

Abstract: LLM-based multi-agent systems exhibit strong collaborative capabilities but often suffer from redundant communication and excessive token overhead. Existing methods typically enhance efficiency through pretrained GNNs or greedy algorithms, but often isolate pre- and post-task optimization, lacking a unified strategy. To this end, we present SafeSieve, a progressive and adaptive multi-agent pruning algorithm that dynamically refines the inter-agent communication through a novel dual-mechanism. SafeSieve integrates initial LLM-based semantic evaluation with accumulated performance feedback, enabling a smooth transition from heuristic initialization to experience-driven refinement. Unlike existing greedy Top-k pruning methods, SafeSieve employs 0-extension clustering to preserve structurally coherent agent groups while eliminating ineffective links. Experiments across benchmarks (SVAMP, HumanEval, etc.) showcase that SafeSieve achieves 94.01% average accuracy while reducing token usage by 12.4%-27.8%. Results further demonstrate robustness under prompt injection attacks (1.23% average accuracy drop). In heterogeneous settings, SafeSieve reduces deployment costs by 13.3% while maintaining performance. These results establish SafeSieve as an efficient, GPU-free, and scalable framework for practical multi-agent systems. Our code can be found here: https://github.com/csgen/SafeSieve

[719] RadAgents: Multimodal Agentic Reasoning for Chest X-ray Interpretation with Radiologist-like Workflows

Kai Zhang, Corey D Barrett, Jangwon Kim, Lichao Sun, Tara Taghavi, Krishnaram Kenthapadi

Main category: cs.MA

TL;DR: RadAgents is a multi-agent framework for chest X-ray interpretation that addresses limitations in clinical interpretability, multimodal fusion, and inconsistency resolution through radiologist-style workflows and verification mechanisms.

Details

Motivation: Current CXR interpretation systems lack clinical interpretability, fail to properly fuse multimodal evidence, and don't detect or resolve cross-tool inconsistencies, making them unreliable for clinical practice.

Method: A multi-agent framework that encodes radiologist-style workflows into modular pipelines, integrates clinical priors with task-aware multimodal reasoning, and uses grounding and retrieval-augmentation to verify and resolve context conflicts.

Result: The system produces more reliable, transparent, and clinically consistent outputs by addressing the three key limitations of existing methods.

Conclusion: RadAgents bridges important gaps in CXR interpretation by creating an auditable, clinically-aligned framework that improves reliability and transparency through structured multi-agent collaboration.

Abstract: Agentic systems offer a potential path to solve complex clinical tasks through collaboration among specialized agents, augmented by tool use and external knowledge bases. Nevertheless, for chest X-ray (CXR) interpretation, prevailing methods remain limited: (i) reasoning is frequently neither clinically interpretable nor aligned with guidelines, reflecting mere aggregation of tool outputs; (ii) multimodal evidence is insufficiently fused, yielding text-only rationales that are not visually grounded; and (iii) systems rarely detect or resolve cross-tool inconsistencies and provide no principled verification mechanisms. To bridge the above gaps, we present RadAgents, a multi-agent framework that couples clinical priors with task-aware multimodal reasoning and encodes a radiologist-style workflow into a modular, auditable pipeline. In addition, we integrate grounding and multimodal retrieval-augmentation to verify and resolve context conflicts, resulting in outputs that are more reliable, transparent, and consistent with clinical practice.

[720] HyperAgent: Leveraging Hypergraphs for Topology Optimization in Multi-Agent Communication

Heng Zhang, Yuling Shi, Xiaodong Gu, Zijian Zhang, Haochen You, Lubin Gan, Yilei Yuan, Jin Huang

Main category: cs.MA

TL;DR: HyperAgent: A hypergraph-based framework for multi-agent systems that optimizes communication topologies and captures group collaboration patterns using hyperedges instead of pairwise edges, achieving better performance with reduced communication costs.

Details

Motivation: Existing multi-agent systems have two main limitations: (1) ineffective group collaboration modeling due to reliance on pairwise edge representations in graphs, which can't capture relationships among multiple agents, and (2) limited task-adaptiveness in communication topology design, leading to either excessive communication for simple tasks or insufficient coordination for complex scenarios.

Method: HyperAgent uses hyperedges to link multiple agents within the same subtask and employs hypergraph convolutional layers for one-step information aggregation in collaboration groups. It incorporates a variational autoencoder framework with sparsity regularization to dynamically adjust hypergraph topologies based on task complexity.

Result: HyperAgent demonstrates superior performance and efficiency. On GSM8K, it achieves 95.07% accuracy while reducing token consumption by 25.33%, showing the effectiveness of hypergraph-based optimization for multi-agent communication.

Conclusion: Hypergraph-based approaches offer a promising solution for multi-agent collaboration by enabling more effective group modeling and adaptive communication topologies, addressing scalability and practical deployment challenges in multi-agent systems.

Abstract: Recent advances in large language model-powered multi-agent systems have demonstrated remarkable collective intelligence through effective communication. However, existing approaches face two primary challenges: (i) \textit{Ineffective group collaboration modeling}, as they rely on pairwise edge representations in graph structures, limiting their ability to capture relationships among multiple agents; and (ii) \textit{Limited task-adaptiveness in communication topology design}, leading to excessive communication cost for simple tasks and insufficient coordination for complex scenarios. These issues restrict the scalability and practical deployment of adaptive collaboration frameworks. To address these challenges, we propose \textbf{HyperAgent}, a hypergraph-based framework that optimizes communication topologies and effectively captures group collaboration patterns using direct hyperedge representations. Unlike edge-based approaches, HyperAgent uses hyperedges to link multiple agents within the same subtask and employs hypergraph convolutional layers to achieve one-step information aggregation in collaboration groups. Additionally, it incorporates a variational autoencoder framework with sparsity regularization to dynamically adjust hypergraph topologies based on task complexity. Experiments highlight the superiority of HyperAgent in both performance and efficiency. For instance, on GSM8K, HyperAgent achieves 95.07% accuracy while reducing token consumption by 25.33%, demonstrating the potential of hypergraph-based optimization for multi-agent communication.

[721] STAIR: Stability criterion for Time-windowed Assignment and Internal adversarial influence in Routing and decision-making

Roee M. Francos, Daniel Garces, Orhan Eren Akgün, Stephanie Gil

Main category: cs.MA

TL;DR: Proposes STAIR, a new stability criterion for adversarial multi-agent routing problems that addresses limitations of existing queuing theory and RL approaches, particularly for systems with denial-of-service attacks.

Details

Motivation: Existing routing algorithms don't consider adversarial agents that can cause severe performance degradation through coordinated denial-of-service attacks by spoofing locations and refusing service, leading to delays, cancellations, and potential instability.

Method: Introduces STAIR stability criterion that is easier to analyze than queuing-theory-based stability in adversarial settings and doesn’t depend on discount factors like RL. Also identifies degenerate stability phenomenon and introduces time-window constraints to mitigate it.

Result: Demonstrates STAIR’s practical relevance through simulations on real-world San Francisco mobility-on-demand data and shows it provides a metric for monitoring adversarial effects while linking stability to operational metrics like finite rejected requests.

Conclusion: STAIR offers a more suitable stability criterion for adversarial routing problems that directly connects to operational performance and helps address the challenges posed by coordinated denial-of-service attacks in multi-agent systems.

Abstract: A major limitation of existing routing algorithms for multi-agent systems is that they are designed without considering the potential presence of adversarial agents in the decision-making loop, which could lead to severe performance degradation in real-life applications where adversarial agents may be present. We study autonomous pickup-and-delivery routing problems in which adversarial agents launch coordinated denial-of-service attacks by spoofing their locations. This deception causes the central scheduler to assign pickup requests to adversarial agents instead of cooperative agents. Adversarial agents then choose not to service the requests with the goal of disrupting the operation of the system, leading to delays, cancellations, and potential instability in the routing policy. Policy stability in routing problems is typically defined as the cost of the policy being uniformly bounded over time, and it has been studied through two different lenses: queuing theory and reinforcement learning (RL), which are not well suited for routing with adversaries. In this paper, we propose a new stability criterion, STAIR, which is easier to analyze than queuing-theory-based stability in adversarial settings. Furthermore, STAIR does not depend on a chosen discount factor as is the case in discounted RL stability. STAIR directly links stability to desired operational metrics, like a finite number of rejected requests. This characterization is particularly useful in adversarial settings as it provides a metric for monitoring the effect of adversaries in the operation of the system. Furthermore, we demonstrate STAIR’s practical relevance through simulations on real-world San Francisco mobility-on-demand data. We also identify a phenomenon of degenerate stability that arises in the adversarial routing problem, and we introduce time-window constraints in the decision-making algorithm to mitigate it.

[722] MTTR-A: Measuring Cognitive Recovery Latency in Multi-Agent Systems

Barak Or

Main category: cs.MA

TL;DR: MTTR-A is a new metric for measuring cognitive recovery latency in multi-agent systems built on LLMs, addressing reliability limitations from reasoning failures rather than infrastructure faults.

Details

Motivation: Current reliability issues in LLM-based multi-agent systems stem from cognitive failures (reasoning drift) rather than infrastructure faults, and existing tools only describe failures without quantifying recovery speed.

Method: MTTR-A adapts classical dependability theory to agentic orchestration, measuring cognitive recovery latency. The paper also defines complementary metrics (MTBF, normalized recovery ratio) and establishes theoretical bounds linking recovery latency to cognitive uptime. Empirical validation uses a LangGraph-based benchmark with simulated drift and reflex recovery strategies.

Result: The work establishes measurable recovery behavior across multiple reflex strategies and provides a quantitative foundation for runtime cognitive dependability in distributed agentic systems.

Conclusion: MTTR-A provides the first quantitative framework for measuring cognitive recovery in multi-agent systems, addressing a critical gap in agentic system reliability by focusing on reasoning coherence restoration rather than just failure detection.

Abstract: Reliability in multi-agent systems (MAS) built on large language models is increasingly limited by cognitive failures rather than infrastructure faults. Existing observability tools describe failures but do not quantify how quickly distributed reasoning recovers once coherence is lost. We introduce MTTR-A (Mean Time-to-Recovery for Agentic Systems), a runtime reliability metric that measures cognitive recovery latency in MAS. MTTR-A adapts classical dependability theory to agentic orchestration, capturing the time required to detect reasoning drift and restore coherent operation. We further define complementary metrics, including MTBF and a normalized recovery ratio (NRR), and establish theoretical bounds linking recovery latency to long-run cognitive uptime. Using a LangGraph-based benchmark with simulated drift and reflex recovery, we empirically demonstrate measurable recovery behavior across multiple reflex strategies. This work establishes a quantitative foundation for runtime cognitive dependability in distributed agentic systems.

[723] MAPPO-LCR: Multi-Agent Proximal Policy Optimization with Local Cooperation Reward in Spatial Public Goods Games

Zhaoqilin Yang, Axin Xiang, Kedi Yang, Tianjun Liu, Youliang Tian

Main category: cs.MA

TL;DR: MAPPO with Local Cooperation Reward enables stable cooperation emergence in spatial public goods games by addressing payoff coupling through centralized critics and local reward alignment.

Details

Motivation: Existing evolutionary and reinforcement learning methods struggle with payoff coupling and non-stationarity in large interacting populations in spatial public goods games, where individual returns are intrinsically coupled through overlapping group interactions.

Method: Introduces Multi-Agent Proximal Policy Optimization (MAPPO) with a centralized critic for joint strategy evaluation, plus MAPPO-LCR with Local Cooperation Reward that aligns policy updates with surrounding cooperative density without altering game structure.

Result: Extensive simulations show stable cooperation emergence and reliable convergence across enhancement factors. Statistical analyses confirm MAPPO’s learning advantage over standard PPO in spatial public goods games.

Conclusion: MAPPO-LCR effectively addresses payoff coupling in spatial public goods games, enabling decentralized execution while supporting population-level value estimation, leading to stable cooperation emergence where traditional methods fail.

Abstract: Spatial public goods games model collective dilemmas where individual payoffs depend on population-level strategy configurations. Most existing studies rely on evolutionary update rules or value-based reinforcement learning methods. These approaches struggle to represent payoff coupling and non-stationarity in large interacting populations. This work introduces Multi-Agent Proximal Policy Optimization (MAPPO) into spatial public goods games for the first time. In these games, individual returns are intrinsically coupled through overlapping group interactions. Proximal Policy Optimization (PPO) treats agents as independent learners and ignores this coupling during value estimation. MAPPO addresses this limitation through a centralized critic that evaluates joint strategy configurations. To study neighborhood-level cooperation signals under this framework, we propose MAPPO with Local Cooperation Reward, termed MAPPO-LCR. The local cooperation reward aligns policy updates with surrounding cooperative density without altering the original game structure. MAPPO-LCR preserves decentralized execution while enabling population-level value estimation during training. Extensive simulations demonstrate stable cooperation emergence and reliable convergence across enhancement factors. Statistical analyses further confirm the learning advantage of MAPPO over PPO in spatial public goods games.

cs.MM

[724] Layout-Aware Text Editing for Efficient Transformation of Academic PDFs to Markdown

Changxu Duan

Main category: cs.MM

TL;DR: EditTrans: A hybrid editing-generation model that reduces PDF-to-markup transformation latency by 44.5% by identifying editable text from PDFs before generation.

Details

Motivation: Existing end-to-end decoder transformer models for PDF-to-markup conversion are inefficient because they regenerate dense text token-by-token from scratch, wasting inference steps on text that could be directly copied from PDF files.

Method: Introduces EditTrans, a hybrid editing-generation model with a lightweight classifier fine-tuned from a Document Layout Analysis model on 162,127 arXiv document pages. It identifies a queue of to-be-edited text from PDFs before generating markup language.

Result: EditTrans reduced transformation latency up to 44.5% compared to end-to-end decoder transformer models while maintaining transformation quality.

Conclusion: The hybrid editing-generation approach significantly improves efficiency in PDF-to-markup conversion, making academic content more accessible and adaptable for digital library workflows and linguistic corpus compilation.

Abstract: Academic documents stored in PDF format can be transformed into plain text structured markup languages to enhance accessibility and enable scalable digital library workflows. Markup languages allow for easier updates and customization, making academic content more adaptable and accessible to diverse usage, such as linguistic corpus compilation. Such documents, typically delivered in PDF format, contain complex elements including mathematical formulas, figures, headers, and tables, as well as densely layouted text. Existing end-to-end decoder transformer models can transform screenshots of documents into markup language. However, these models exhibit significant inefficiencies; their token-by-token decoding from scratch wastes a lot of inference steps in regenerating dense text that could be directly copied from PDF files. To solve this problem, we introduce EditTrans, a hybrid editing-generation model whose features allow identifying a queue of to-be-edited text from a PDF before starting to generate markup language. EditTrans contains a lightweight classifier fine-tuned from a Document Layout Analysis model on 162,127 pages of documents from arXiv. In our evaluations, EditTrans reduced the transformation latency up to 44.5% compared to end-to-end decoder transformer models, while maintaining transformation quality. Our code and reproducible dataset production scripts are open-sourced.

[725] Accelerating End-to-End PDF to Markdown Conversion Through Assisted Generation

Changxu Duan

Main category: cs.MM

TL;DR: The paper introduces Copy Lookup Decoding (CLD), a method that accelerates PDF-to-Markdown conversion by extracting candidate sequences directly from PDFs instead of decoding token-by-token, achieving up to 1.70× speedup while maintaining quality.

Details

Motivation: Current decoder transformer models for PDF-to-Markdown conversion are inefficient because they decode text token by token from scratch, even though dense text can often be directly copied from PDF files. There's a need to leverage the high n-gram overlap between PDFs and their Markdown equivalents to improve efficiency.

Method: The paper modifies Prompt Lookup Decoding (PLD) to extract candidate sequences directly from PDF files. It introduces Copy Lookup Decoding (CLD), which enhances PLD’s candidate generation mechanism by leveraging the n-gram overlap between PDF content and Markdown output.

Result: Experiments show that CLD can accelerate the PDF-to-Markdown conversion process by up to 1.70× while maintaining original quality. The method is open-sourced on GitHub.

Conclusion: CLD provides an efficient alternative to token-by-token decoding for PDF-to-Markdown conversion by directly extracting text from PDFs, significantly speeding up the process without compromising output quality.

Abstract: Converting data from machine-unreadable formats like PDFs into Markdown has the potential to enhance the accessibility of scientific research. Existing end-to-end decoder transformer models can transform screenshots of PDFs into Markdown, offering more flexibility than pipeline-based methods. Yet, decoding text token by token from scratch is inefficient, especially when dense text can be directly copied from the PDF. To address this challenge, this paper modifies Prompt Lookup Decoding (PLD) to extract candidate sequences directly from PDF files, leveraging the high n-gram overlap between PDFs and their Markdown equivalents. A new method, Copy Lookup Decoding (CLD), is introduced here to enhance PLD’s candidate generation mechanism. Experiments demonstrate that CLD can accelerate the conversion process by up to 1.70$\times$ at original quality. The codebase for this paper is open-source on GitHub (https://github.com/Fireblossom/CopyLookup).

[726] Asynchronous Pipeline Parallelism for Real-Time Multilingual Lip Synchronization in Video Communication Systems

Eren Caglar, Amirkia Rafiei Oskooei, Mehmet Kutanoglu, Mustafa Keles, Mehmet S. Aktas

Main category: cs.MM

TL;DR: Parallel asynchronous Transformer framework for real-time multilingual lip sync in video conferencing, reducing latency 3.1× vs sequential approaches through pipeline parallelism and hardware optimizations.

Details

Motivation: Need for efficient real-time multilingual lip synchronization in video conferencing systems, particularly for resource-constrained IoT scenarios like telemedicine and multilingual kiosks.

Method: Parallel pipeline architecture with translation, speech processing, and lip-sync modules decoupled via message queues; optimized with graph compilation, mixed-precision quantization, kernel fusion; includes context-adaptive silence detection.

Result: 3.1× latency reduction vs sequential pipelines, improved processing speed, synchronization stability, and resource utilization while preserving accuracy and visual quality.

Conclusion: The parallel asynchronous framework enables low-latency, resource-efficient multimodal communication suitable for next-generation AIoT systems in telemedicine, multilingual kiosks, and remote assistance.

Abstract: This paper introduces a parallel and asynchronous Transformer framework designed for efficient and accurate multilingual lip synchronization in real-time video conferencing systems. The proposed architecture integrates translation, speech processing, and lip-synchronization modules within a pipeline-parallel design that enables concurrent module execution through message-queue-based decoupling, reducing end-to-end latency by up to 3.1 times compared to sequential approaches. To enhance computational efficiency and throughput, the inference workflow of each module is optimized through low-level graph compilation, mixed-precision quantization, and hardware-accelerated kernel fusion. These optimizations provide substantial gains in efficiency while preserving model accuracy and visual quality. In addition, a context-adaptive silence-detection component segments the input speech stream at semantically coherent boundaries, improving translation consistency and temporal alignment across languages. Experimental results demonstrate that the proposed parallel architecture outperforms conventional sequential pipelines in processing speed, synchronization stability, and resource utilization. The modular, message-oriented design makes this work applicable to resource-constrained IoT communication scenarios including telemedicine, multilingual kiosks, and remote assistance systems. Overall, this work advances the development of low-latency, resource-efficient multimodal communication frameworks for next-generation AIoT systems.

[727] D$^{2}$Stream: Decoupled Dual-Stream Temporal-Speaker Interaction for Audio-Visual Speaker Detection

Junhao Xiao, Shun Feng, Zhiyu Wu, Jianjun Li, Zhiyuan Ma, Yi Chen

Main category: cs.MM

TL;DR: D²Stream is a decoupled dual-stream framework for audio-visual speaker detection that separates temporal modeling from speaker discrimination, achieving state-of-the-art performance with significantly reduced computation.

Details

Motivation: Existing audio-visual speaker detection methods suffer from computational inefficiency or suboptimal performance due to joint modeling of temporal and speaker interactions, which motivates the need for a more efficient and effective approach.

Method: Proposes D²Stream with cross-modal aligned audio-visual features, two lightweight streams (Temporal Interaction Stream for long-range dependencies and Speaker Interaction Stream for inter-person relationships), cross-attention between streams, and a Voice Gate module to filter non-speech facial movements.

Result: Achieves 95.6% mAP on AVA-ActiveSpeaker (new SOTA), with 80% computation reduction vs GNN-based models and 30% fewer parameters than attention-based alternatives, plus strong generalization on Columbia ASD.

Conclusion: D²Stream demonstrates that decoupling temporal and speaker modeling leads to superior performance and efficiency in audio-visual speaker detection, offering a promising direction for real-world applications.

Abstract: Audio-visual speaker detection aims to identify the active speaker in videos by leveraging complementary audio and visual cues. Existing methods often suffer from computational inefficiency or suboptimal performance due to joint modeling of temporal and speaker interactions. We propose D$^{2}$Stream, a decoupled dual-stream framework that separates cross-frame temporal modeling from within-frame speaker discrimination. Audio and visual features are first aligned via cross-modal attention, then fed into two lightweight streams: a Temporal Interaction Stream captures long-range temporal dependencies, while a Speaker Interaction Stream models per-frame inter-person relationships. The temporal and relational features extracted by the two streams interact via cross-attention to enrich representations. A lightweight Voice Gate module further mitigates false positives from non-speech facial movements. On AVA-ActiveSpeaker, D$^{2}$Stream achieves a new state-of-the-art at 95.6% mAP, with 80% reduction in computation compared to GNN-based models and 30% fewer parameters than attention-based alternatives, while also generalizing well on Columbia ASD. Source code is available at https://anonymous.4open.science/r/D2STREAM.

[728] TAG-WM: Tamper-Aware Generative Image Watermarking via Diffusion Inversion Sensitivity

Yuzhuo Chen, Zehua Ma, Han Fang, Weiming Zhang, Nenghai Yu

Main category: cs.MM

TL;DR: TAG-WM is a tamper-aware generative image watermarking method that embeds copyright and localization watermarks in latent space, detects tampered regions using diffusion inversion sensitivity, and achieves SOTA robustness while preserving generation quality.

Details

Motivation: AIGC enables efficient visual creation but raises copyright and authenticity risks. Digital image watermarking can help with integrity verification and source tracing, but generative image editing tools increase malicious tampering risks and challenge passive tampering detection and watermark robustness.

Method: Four key modules: 1) Dual-mark joint sampling (DMJS) for embedding copyright and localization watermarks into latent space while preserving generative quality; 2) Watermark latent reconstruction (WLR) using reversed DMJS; 3) Dense variation region detector (DVRD) leveraging diffusion inversion sensitivity to identify tampered areas via statistical deviation analysis; 4) Tamper-aware decoding (TAD) guided by localization results.

Result: TAG-WM achieves state-of-the-art performance in both tampering robustness and localization capability even under distortion, while preserving lossless generation quality and maintaining a watermark capacity of 256 bits.

Conclusion: The proposed TAG-WM method effectively addresses copyright and authenticity challenges in AIGC by combining robust watermarking with tamper detection capabilities, providing a comprehensive solution for integrity verification and source tracing in generative image content.

Abstract: AI-generated content (AIGC) enables efficient visual creation but raises copyright and authenticity risks. As a common technique for integrity verification and source tracing, digital image watermarking is regarded as a potential solution to above issues. However, the widespread adoption and advancing capabilities of generative image editing tools have amplified malicious tampering risks, while simultaneously posing new challenges to passive tampering detection and watermark robustness. To address these challenges, this paper proposes a Tamper-Aware Generative image WaterMarking method named TAG-WM. The proposed method comprises four key modules: a dual-mark joint sampling (DMJS) algorithm for embedding copyright and localization watermarks into the latent space while preserving generative quality, the watermark latent reconstruction (WLR) utilizing reversed DMJS, a dense variation region detector (DVRD) leveraging diffusion inversion sensitivity to identify tampered areas via statistical deviation analysis, and the tamper-aware decoding (TAD) guided by localization results. The experimental results demonstrate that TAG-WM achieves state-of-the-art performance in both tampering robustness and localization capability even under distortion, while preserving lossless generation quality and maintaining a watermark capacity of 256 bits. The code is available at: https://github.com/Suchenl/TAG-WM.

[729] Prompt-Aware Adaptive Elastic Weight Consolidation for Continual Learning in Medical Vision-Language Models

Ziyuan Gao, Philippe Morel

Main category: cs.MM

TL;DR: PA-EWC is a novel continual learning method that uses prompt-guided parameter specialization to prevent catastrophic forgetting in medical vision-language models, achieving up to 17.58% reduction in forgetting across diverse medical imaging datasets.

Details

Motivation: Medical AI systems deployed in clinical settings face catastrophic forgetting when learning new imaging protocols while needing to retain prior diagnostic capabilities. This is especially challenging for medical vision-language models that must preserve complex cross-modal alignments between medical images and clinical terminology across diverse imaging modalities.

Method: Prompt-Aware Adaptive Elastic Weight Consolidation (PA-EWC) uses prompt-guided parameter specialization to categorize model parameters based on their functional roles in processing visual-descriptive, spatial-guided, and medical-semantic information. It incorporates adaptive Fisher Information computation with gradient stability analysis and develops weighted complexity metrics based on medical terminology density.

Result: PA-EWC reduces catastrophic forgetting by up to 17.58% compared to baseline methods, with performance improvements of 4.30% on chest X-ray pathology localization (CheXlocalize) and 6.06% on polyp segmentation (Kvasir-SEG). Evaluated across five medical imaging datasets representing endoscopy, dermoscopy, radiography, and ultrasound modalities.

Conclusion: PA-EWC effectively addresses catastrophic forgetting in medical vision-language models through prompt-guided parameter specialization, enabling targeted protection of critical knowledge while allowing adaptation to new clinical requirements across diverse medical imaging modalities.

Abstract: Medical AI systems face catastrophic forgetting when deployed in clinical settings, where models must learn new imaging protocols while retaining prior diagnostic capabilities. This challenge is particularly acute for medical vision-language models that must preserve complex cross-modal alignments between medical images and clinical terminology across diverse imaging modalities. We introduce Prompt- Aware Adaptive Elastic Weight Consolidation (PA-EWC), a novel continual learning approach that addresses catastrophic forgetting through prompt-guided parameter specialization. Our method systematically categorizes model parameters based on their functional roles in processing visual-descriptive, spatial-guided, and medical-semantic information, enabling targeted protection of critical knowledge while allowing adaptation to new clinical requirements. PA-EWC incorporates adaptive Fisher Information computation with gradient stability analysis and develops weighted complexity metrics based on medical terminology density. We evaluate our approach across five medical imaging datasets (Kvasir-SEG, ISIC 2018, CheXlocalize, BUSI, CAMUS) representing diverse modalities including endoscopy, dermoscopy, radiography, and ultrasound. Experimental results demonstrate that PA-EWC reduces catastrophic forgetting by up to 17.58% compared to baseline methods, with performance improvements of 4.30% on chest X-ray pathology localization and 6.06% on polyp segmentation.

eess.AS

[730] Continual Learning for Acoustic Event Classification

Yang Xiao

Main category: eess.AS

TL;DR: Proposed RK approach: a diversity-aware incremental learning method for acoustic event classification that prevents catastrophic forgetting while maintaining computational efficiency for on-device deployment.

Details

Motivation: On-device acoustic event classification faces challenges in continuously learning new classes without catastrophic forgetting, while being constrained by limited computation resources (model size, memory). Existing methods struggle with efficient memory management and computational overhead.

Method: Two novel diversity-aware incremental learning methods: 1) For Spoken Keyword Spotting: RK approach uses diversity-aware sampler to select diverse historical/incoming data based on classification uncertainty, plus data augmentation and knowledge distillation for memory efficiency. 2) For Environmental Sound Classification: Measures uncertainty by observing classification probability fluctuations against parallel perturbations in classifier embedding (not raw data), reducing computation cost.

Result: RK approach achieves 4.2% absolute improvement in average accuracy over best baseline on Google Speech Command dataset with less memory. Outperforms baseline continual learning methods on DCASE 2019 Task 1 and ESC-50 datasets in both classification accuracy and computational efficiency.

Conclusion: The proposed RK approach enables efficient incremental learning of new classes without catastrophic forgetting for on-device acoustic event classification, balancing accuracy improvements with computational resource constraints.

Abstract: Continuously learning new classes without catastrophic forgetting is a challenging problem for on-device acoustic event classification given the restrictions on computation resources (e.g., model size, running memory). To alleviate such an issue, we propose two novel diversity-aware incremental learning method for Spoken Keyword Spotting and Environmental Sound Classification. Our method selects the historical data for the training by measuring the per-sample classification uncertainty. For the Spoken Keyword Spotting application, the proposed RK approach introduces a diversity-aware sampler to select a diverse set from historical and incoming keywords by calculating classification uncertainty. As a result, the RK approach can incrementally learn new tasks without forgetting prior knowledge. Besides, the RK approach also proposes data augmentation and knowledge distillation loss function for efficient memory management on the edge device. For the Environmental Sound Classification application, we measure the uncertainty by observing how the classification probability of data fluctuates against the parallel perturbations added to the classifier embedding. In this way, the computation cost can be significantly reduced compared with adding perturbation to the raw data. Experimental results show that the proposed RK approach achieves 4.2% absolute improvement in terms of average accuracy over the best baseline on Google Speech Command dataset with less required memory. Experimental results on the DCASE 2019 Task 1 and ESC-50 dataset show that our proposed method outperforms baseline continual learning methods on classification accuracy and computational efficiency, indicating our method can efficiently and incrementally learn new classes without the catastrophic forgetting problem for on-device environmental sound classification

[731] LIWhiz: A Non-Intrusive Lyric Intelligibility Prediction System for the Cadenza Challenge

Ram C. M. C. Shekar, Iván López-Espejo

Main category: eess.AS

TL;DR: LIWhiz is a non-intrusive lyric intelligibility prediction system that uses Whisper for feature extraction and achieves 22.4% relative RMSE reduction over STOI baseline on the Cadenza CLIP evaluation set.

Details

Motivation: The paper addresses the need for improved lyric intelligibility prediction in music, particularly for the ICASSP 2026 Cadenza Challenge, where current STOI-based baselines have limitations in accuracy.

Method: LIWhiz uses Whisper for robust feature extraction from audio, combined with a trainable back-end architecture for predicting lyric intelligibility scores in a non-intrusive manner (no reference signal needed).

Result: The system achieves a 22.4% relative root mean squared error reduction over the STOI-based baseline and shows substantial improvement in normalized cross-correlation on the Cadenza Lyric Intelligibility Prediction (CLIP) evaluation set.

Conclusion: LIWhiz demonstrates that leveraging Whisper’s robust feature extraction capabilities combined with a trainable back-end can significantly improve lyric intelligibility prediction performance over traditional STOI-based approaches.

Abstract: We present LIWhiz, a non-intrusive lyric intelligibility prediction system submitted to the ICASSP 2026 Cadenza Challenge. LIWhiz leverages Whisper for robust feature extraction and a trainable back-end for score prediction. Tested on the Cadenza Lyric Intelligibility Prediction (CLIP) evaluation set, LIWhiz achieves a 22.4% relative root mean squared error reduction over the STOI-based baseline, yielding a substantial improvement in normalized cross-correlation.

[732] SAM Audio: Segment Anything in Audio

Bowen Shi, Andros Tjandra, John Hoffman, Helin Wang, Yi-Chiao Wu, Luya Gao, Julius Richter, Matt Le, Apoorv Vyas, Sanyuan Chen, Christoph Feichtenhofer, Piotr Dollár, Wei-Ning Hsu, Ann Lee

Main category: eess.AS

TL;DR: SAM Audio is a foundation model for general audio separation that unifies text, visual, and temporal span prompting within a single diffusion transformer framework, achieving SOTA across diverse audio domains.

Details

Motivation: Existing audio separation models are either domain-specific (limited to fixed categories like speech/music) or have limited controllability (supporting only single prompting modality like text). There's a need for a general-purpose model that can handle diverse audio types with flexible multimodal prompting.

Method: Built on diffusion transformer architecture, trained with flow matching on large-scale audio data spanning speech, music, and general sounds. Unifies text, visual, and temporal span prompting within a single framework for flexible target source separation.

Result: Achieves state-of-the-art performance across diverse benchmarks including general sound, speech, music, and musical instrument separation in both in-the-wild and professionally produced audios, substantially outperforming prior general-purpose and specialized systems.

Conclusion: SAM Audio represents a significant advancement in general audio separation, offering unified multimodal prompting capabilities and superior performance across diverse audio domains, while also introducing new benchmarks and evaluation methods for real-world separation tasks.

Abstract: General audio source separation is a key capability for multimodal AI systems that can perceive and reason about sound. Despite substantial progress in recent years, existing separation models are either domain-specific, designed for fixed categories such as speech or music, or limited in controllability, supporting only a single prompting modality such as text. In this work, we present SAM Audio, a foundation model for general audio separation that unifies text, visual, and temporal span prompting within a single framework. Built on a diffusion transformer architecture, SAM Audio is trained with flow matching on large-scale audio data spanning speech, music, and general sounds, and can flexibly separate target sources described by language, visual masks, or temporal spans. The model achieves state-of-the-art performance across a diverse suite of benchmarks, including general sound, speech, music, and musical instrument separation in both in-the-wild and professionally produced audios, substantially outperforming prior general-purpose and specialized systems. Furthermore, we introduce a new real-world separation benchmark with human-labeled multimodal prompts and a reference-free evaluation model that correlates strongly with human judgment.

[733] TICL+: A Case Study On Speech In-Context Learning for Children’s Speech Recognition

Haolong Zheng, Yekaterina Yegorova, Mark Hasegawa-Johnson

Main category: eess.AS

TL;DR: TICL+ improves children’s speech recognition by combining semantic and acoustic information for better in-context example selection in speech foundation models.

Details

Motivation: Children's speech recognition faces challenges due to acoustic/linguistic variability, limited labeled data, and differences from adult speech. Speech foundation models can help via Speech In-Context Learning (SICL), but effectiveness depends on how in-context examples are selected.

Method: Extends existing Text-Embedding KNN for SICL (TICL) by adding an acoustic reranking step to create TICL+. This prioritizes examples that are both semantically and acoustically aligned with test input.

Result: Experiments on four children’s speech corpora show TICL+ achieves up to 53.3% relative WER reduction over zero-shot performance and 37.6% over baseline TICL.

Conclusion: Combining semantic and acoustic information enables robust, scalable ASR for children’s speech through improved in-context example selection in speech foundation models.

Abstract: Children’s speech recognition remains challenging due to substantial acoustic and linguistic variability, limited labeled data, and significant differences from adult speech. Speech foundation models can address these challenges through Speech In-Context Learning (SICL), allowing adaptation to new domains without fine-tuning. However, the effectiveness of SICL depends on how in-context examples are selected. We extend an existing retrieval-based method, Text-Embedding KNN for SICL (TICL), introducing an acoustic reranking step to create TICL+. This extension prioritizes examples that are both semantically and acoustically aligned with the test input. Experiments on four children’s speech corpora show that TICL+ achieves up to a 53.3% relative word error rate reduction over zero-shot performance and 37.6% over baseline TICL, highlighting the value of combining semantic and acoustic information for robust, scalable ASR in children’s speech.

[734] What Does the Speaker Embedding Encode?

Shuai Wang, Yanmin Qian, Kai Yu

Main category: eess.AS

TL;DR: Analysis of speaker embeddings (i-vector, d-vector, s-vector) reveals their distinct encoding properties, leading to a novel i-s-vector that combines i-vector’s speaker discrimination with s-vector’s sequential information capture.

Details

Motivation: Despite widespread use of speaker embeddings like i-vector and d-vector, there's limited understanding of what properties they actually encode. The paper aims to systematically analyze these embeddings across multiple dimensions to understand their strengths and limitations.

Method: Comprehensive analysis of three speaker embedding methods (i-vector, d-vector, RNN/LSTM-based s-vector) through carefully designed classification tasks. Based on insights, propose a multi-task learning framework integrating i-vector and s-vector to create i-s-vector.

Result: i-vector excels at speaker discrimination but encodes limited sequential information; s-vector captures text content and word order effectively but struggles with speaker identity; d-vector shows balanced performance but loses sequential information through averaging. The proposed i-s-vector achieves >50% EER reduction compared to i-vector baseline on content mismatch trials in RSR2015 experiments.

Conclusion: Different speaker embeddings encode complementary information, and combining them through multi-task learning creates more robust representations. The i-s-vector successfully integrates i-vector’s speaker discrimination with s-vector’s sequential information capture, significantly improving performance on challenging content mismatch scenarios.

Abstract: Developing a good speaker embedding has received tremendous interest in the speech community, with representations such as i-vector and d-vector demonstrating remarkable performance across various tasks. Despite their widespread adoption, a fundamental question remains largely unexplored: what properties are actually encoded in these embeddings? To address this gap, we conduct a comprehensive analysis of three prominent speaker embedding methods: i-vector, d-vector, and RNN/LSTM-based sequence-vector (s-vector). Through carefully designed classification tasks, we systematically investigate their encoding capabilities across multiple dimensions, including speaker identity, gender, speaking rate, text content, word order, and channel information. Our analysis reveals distinct strengths and limitations of each embedding type: i-vector excels at speaker discrimination but encodes limited sequential information; s-vector captures text content and word order effectively but struggles with speaker identity; d-vector shows balanced performance but loses sequential information through averaging. Based on these insights, we propose a novel multi-task learning framework that integrates i-vector and s-vector, resulting in a new speaker embedding (i-s-vector) that combines their complementary advantages. Experimental results on RSR2015 demonstrate that the proposed i-s-vector achieves more than 50% EER reduction compared to the i-vector baseline on content mismatch trials, validating the effectiveness of our approach.

[735] Phoneme-based speech recognition driven by large language models and sampling marginalization

Te Ma, Nanjie Li, Hao Huang, Zhijian Ou

Main category: eess.AS

TL;DR: Proposes Sampling-K Marginalized (SKM) training strategy for LLM-P2G speech recognition, replacing beam search with random sampling to improve path diversity, training efficiency, and recognition performance.

Details

Motivation: Existing LLM-P2G uses Top-K Marginalized (TKM) training with beam search, which suffers from insufficient path diversity, low training efficiency, and high resource overhead.

Method: Proposes Sampling-K Marginalized (SKM) training strategy that replaces beam search with random sampling to generate candidate phoneme sequences, improving marginalized modeling and training efficiency.

Result: SKM improves model learning convergence speed and recognition performance while maintaining model complexity. Outperforms SpeechLLM in accuracy and structural simplicity on Polish and German datasets.

Conclusion: SKM-driven LLM-P2G shows practical value and application potential in cross-language speech recognition systems, offering better efficiency and performance than previous approaches.

Abstract: Recently, the Large Language Model-based Phoneme-to-Grapheme (LLM-P2G) method has shown excellent performance in speech recognition tasks and has become a feasible direction to replace the traditional WFST decoding method. This framework takes into account both recognition accuracy and system scalability through two-stage modeling of phoneme prediction and text generation. However, the existing LLM-P2G adopts the Top-K Marginalized (TKM) training strategy, and its candidate phoneme sequences rely on beam search generation, which has problems such as insufficient path diversity, low training efficiency, and high resource overhead. To this end, this paper proposes a sampling marginalized training strategy (Sampling-K Marginalized, SKM), which replaces beam search with random sampling to generate candidate paths, improving marginalized modeling and training efficiency. Experiments were conducted on Polish and German datasets, and the results showed that SKM further improved the model learning convergence speed and recognition performance while maintaining the complexity of the model. Comparative experiments with a speech recognition method that uses a projector combined with a large language model (SpeechLLM) also show that the SKM-driven LLM-P2G has more advantages in recognition accuracy and structural simplicity. The study verified the practical value and application potential of this method in cross-language speech recognition systems.

[736] MeanFlow-TSE: One-Step Generative Target Speaker Extraction with Mean Flow

Riki Shimizu, Xilin Jiang, Nima Mesgarani

Main category: eess.AS

TL;DR: MeanFlow-TSE: A one-step generative target speaker extraction framework using mean-flow objectives for fast, high-quality separation without iterative refinement.

Details

Motivation: Current diffusion and flow-matching TSE methods require multi-step sampling, limiting their practicality in low-latency, real-time applications. There's a need for faster inference while maintaining high separation quality.

Method: Proposes MeanFlow-TSE, a one-step generative framework trained with mean-flow objectives. Builds on AD-FlowTSE paradigm, defining a flow between background and target source governed by the mixing ratio (MR). Enables fast generation without iterative refinement.

Result: Outperforms existing diffusion- and flow-matching-based TSE models on Libri2Mix corpus in separation quality and perceptual metrics while requiring only a single inference step.

Conclusion: Mean-flow-guided one-step generation offers an effective and efficient alternative for real-time target speaker extraction, balancing speed and quality for practical applications.

Abstract: Target speaker extraction (TSE) aims to isolate a desired speaker’s voice from a multi-speaker mixture using auxiliary information such as a reference utterance. Although recent advances in diffusion and flow-matching models have improved TSE performance, these methods typically require multi-step sampling, which limits their practicality in low-latency settings. In this work, we propose MeanFlow-TSE, a one-step generative TSE framework trained with mean-flow objectives, enabling fast and high-quality generation without iterative refinement. Building on the AD-FlowTSE paradigm, our method defines a flow between the background and target source that is governed by the mixing ratio (MR). Experiments on the Libri2Mix corpus show that our approach outperforms existing diffusion- and flow-matching-based TSE models in separation quality and perceptual metrics while requiring only a single inference step. These results demonstrate that mean-flow-guided one-step generation offers an effective and efficient alternative for real-time target speaker extraction. Code is available at https://github.com/rikishimizu/MeanFlow-TSE.

[737] Enhancing Fully Formatted End-to-End Speech Recognition with Knowledge Distillation via Multi-Codebook Vector Quantization

Jian You, Xiangfeng Li, Erwan Zerhouni

Main category: eess.AS

TL;DR: Enhanced E2E ASR model using knowledge distillation with multi-codebook vector quantization achieves SOTA results in WER and PER on LibriSpeech-PC.

Details

Motivation: Traditional ASR models output normalized text without punctuation/capitalization, requiring post-processing models that add complexity and latency. There's a need for end-to-end ASR models that directly predict punctuation and capitalization, but this area remains underexplored.

Method: Proposes an enhanced fully formatted end-to-end ASR model that leverages knowledge distillation through multi-codebook vector quantization (MVQ).

Result: Model significantly outperforms previous works in word error rate (both with and without punctuation/capitalization) and punctuation error rate. Achieves state-of-the-art results on LibriSpeech-PC test-clean and test-other subsets.

Conclusion: The proposed enhanced fully formatted E2E ASR model with knowledge distillation via MVQ effectively addresses the limitations of cascaded systems and demonstrates superior performance in generating properly formatted speech recognition outputs.

Abstract: Conventional automatic speech recognition (ASR) models typically produce outputs as normalized texts lacking punctuation and capitalization, necessitating post-processing models to enhance readability. This approach, however, introduces additional complexity and latency due to the cascaded system design. In response to this challenge, there is a growing trend to develop end-to-end (E2E) ASR models capable of directly predicting punctuation and capitalization, though this area remains underexplored. In this paper, we propose an enhanced fully formatted E2E ASR model that leverages knowledge distillation (KD) through multi-codebook vector quantization (MVQ). Experimental results demonstrate that our model significantly outperforms previous works in word error rate (WER) both with and without punctuation and capitalization, and in punctuation error rate (PER). Evaluations on the LibriSpeech-PC test-clean and test-other subsets show that our model achieves state-of-the-art results.

[738] A Comprehensive Survey on Generative AI for Video-to-Music Generation

Shulei Ji, Songruoyao Wu, Zihao Wang, Shuyu Li, Kejun Zhang

Main category: eess.AS

TL;DR: A comprehensive review paper on video-to-music generation using deep generative AI, covering conditioning input construction, conditioning mechanisms, music generation frameworks, multimodal datasets, and evaluation metrics.

Details

Motivation: The rapid growth of video-to-music generation driven by multimodal generative models lacks comprehensive literature review. This paper aims to fill that gap by systematically analyzing existing approaches and providing a structured overview of the field.

Method: The review organizes video-to-music generation approaches around three key components: 1) conditioning input construction, 2) conditioning mechanisms, and 3) music generation frameworks. It categorizes existing methods based on their designs for each component and provides fine-grained categorization of video and music modalities.

Result: The paper presents a systematic taxonomy of video-to-music generation techniques, clarifies the roles of different strategies, summarizes available multimodal datasets and evaluation metrics, and identifies ongoing challenges in the field.

Conclusion: This comprehensive review provides a structured framework for understanding video-to-music generation, serves as a reference for researchers, and highlights future research directions by identifying current limitations and challenges in the field.

Abstract: The burgeoning growth of video-to-music generation can be attributed to the ascendancy of multimodal generative models. However, there is a lack of literature that comprehensively combs through the work in this field. To fill this gap, this paper presents a comprehensive review of video-to-music generation using deep generative AI techniques, focusing on three key components: conditioning input construction, conditioning mechanism, and music generation frameworks. We categorize existing approaches based on their designs for each component, clarifying the roles of different strategies. Preceding this, we provide a fine-grained categorization of video and music modalities, illustrating how different categories influence the design of components within the generation pipelines. Furthermore, we summarize available multimodal datasets and evaluation metrics while highlighting ongoing challenges in the field.

[739] ASR-Synchronized Speaker-Role Diarization

Arindam Ghosh, Mark Fuhs, Bongjun Kim, Anurag Chowdhury, Monika Woszczyna

Main category: eess.AS

TL;DR: Proposed method improves speaker-role diarization (RD) by adapting joint ASR+SD framework to ASR+RD with frozen ASR transducer and auxiliary RD transducer, achieving better performance than baseline.

Details

Motivation: Speaker-role diarization (identifying specific roles like doctor/patient) is more useful than conventional speaker diarization, but current end-to-end ASR+RD approaches degrade ASR performance. Need to maintain ASR quality while improving RD accuracy.

Method: Adapt joint ASR+SD framework to ASR+RD by: 1) Freezing ASR transducer, 2) Training auxiliary RD transducer in parallel, 3) Using task-specific predictor networks, 4) Using higher-layer ASR encoder features as RD encoder input, 5) Replacing blank-shared RNNT loss with cross-entropy loss along forced-alignment path.

Result: Experiments on doctor-patient conversations show relative reductions of 6.2% and 4.5% in role-based word diarization error rate (R-WDER) on public and private datasets respectively, outperforming best baseline.

Conclusion: The proposed method successfully improves speaker-role diarization performance while maintaining ASR quality, demonstrating the importance of task-specific architectures and leveraging higher-layer ASR features for RD.

Abstract: Speaker-role diarization (RD), such as doctor vs. patient or lawyer vs. client, is practically often more useful than conventional speaker diarization (SD), which assigns only generic labels (speaker-1, speaker-2). The state-of-the-art end-to-end ASR+RD approach uses a single transducer that serializes word and role predictions (role at the end of a speaker’s turn), but at the cost of degraded ASR performance. To address this, we adapt a recent joint ASR+SD framework to ASR+RD by freezing the ASR transducer and training an auxiliary RD transducer in parallel to assign a role to each ASR-predicted word. For this, we first show that SD and RD are fundamentally different tasks, exhibiting different dependencies on acoustic and linguistic information. Motivated by this, we propose (1) task-specific predictor networks and (2) using higher-layer ASR encoder features as input to the RD encoder. Additionally, we replace the blank-shared RNNT loss by cross-entropy loss along the 1-best forced-alignment path to further improve performance while reducing computational and memory requirements during RD training. Experiments on a public and a private dataset of doctor-patient conversations demonstrate that our method outperforms the best baseline with relative reductions of 6.2% and 4.5% in role-based word diarization error rate (R-WDER), respectively

[740] MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows

Guobin Ma, Jixun Yao, Ziqian Ning, Yuepeng Jiang, Lingxin Xiong, Lei Xie, Pengcheng Zhu

Main category: eess.AS

TL;DR: MeanVC is a lightweight streaming zero-shot voice conversion model that combines AR and NAR paradigms using diffusion transformers with mean flows, achieving high-quality conversion with single-step inference.

Details

Motivation: Growing demand for streaming voice conversion models that are simultaneously fast, lightweight, and high-fidelity, as existing methods either require large parameters or struggle with unseen speaker generalization.

Method: Proposes MeanVC with diffusion transformer using chunk-wise autoregressive denoising strategy, mean flows for velocity field regression, and diffusion adversarial post-training to mitigate over-smoothing.

Result: Significantly outperforms existing zero-shot streaming VC systems with superior conversion quality, higher efficiency, and significantly fewer parameters.

Conclusion: MeanVC successfully addresses the need for efficient streaming voice conversion by combining AR/NAR strengths through diffusion transformers with mean flows, enabling high-quality zero-shot conversion with minimal computational requirements.

Abstract: Zero-shot voice conversion (VC) aims to transfer timbre from a source speaker to any unseen target speaker while preserving linguistic content. Growing application scenarios demand models with streaming inference capabilities. This has created a pressing need for models that are simultaneously fast, lightweight, and high-fidelity. However, existing streaming methods typically rely on either autoregressive (AR) or non-autoregressive (NAR) frameworks, which either require large parameter sizes to achieve strong performance or struggle to generalize to unseen speakers. In this study, we propose MeanVC, a lightweight and streaming zero-shot VC approach. MeanVC introduces a diffusion transformer with a chunk-wise autoregressive denoising strategy, combining the strengths of both AR and NAR paradigms for efficient streaming processing. By introducing mean flows, MeanVC regresses the average velocity field during training, enabling zero-shot VC with superior speech quality and speaker similarity in a single sampling step by directly mapping from the start to the endpoint of the flow trajectory. Additionally, we incorporate diffusion adversarial post-training to mitigate over-smoothing and further enhance speech quality. Experimental results demonstrate that MeanVC significantly outperforms existing zero-shot streaming VC systems, achieving superior conversion quality with higher efficiency and significantly fewer parameters. Audio demos and code are publicly available at https://aslp-lab.github.io/MeanVC.

eess.IV

[741] SLIM: Semantic-based Low-bitrate Image compression for Machines by leveraging diffusion

Hyeonjin Lee, Jun-Hyuk Kim, Jong-Seok Lee

Main category: eess.IV

TL;DR: SLIM is a semantic-based low-bitrate image compression framework for machine vision that uses diffusion models to focus on Region-of-Interest areas without needing guide masks at inference.

Details

Motivation: Current image compression models are optimized for human vision, maintaining excessive perceptual details that aren't necessary for machine vision tasks, leading to suboptimal bitrate reduction when compressing for machine analysis.

Method: Uses a pretrained latent diffusion model where the compressor focuses only on Region-of-Interest areas in the image latent space. A pretrained Unet model then enhances decompressed latents using RoI-focused text captions containing semantic information, enabling denoising-based enhancement without needing guide masks at inference.

Result: SLIM achieves higher classification accuracy compared to conventional image compression models for machines at the same bits per pixel condition, while still maintaining perceptual details for human vision.

Conclusion: SLIM provides an effective training framework for machine vision image compression that achieves better performance at low bitrates by focusing on semantically important regions and leveraging diffusion models for enhancement.

Abstract: In recent years, the demand of image compression models for machine vision has increased dramatically. However, the training frameworks of image compression still focus on the vision of human, maintaining the excessive perceptual details, thus have limitations in optimally reducing the bits per pixel in the case of performing machine vision tasks. In this paper, we propose Semantic-based Low-bitrate Image compression for Machines by leveraging diffusion, termed SLIM. This is a new effective training framework of image compression for machine vision, using a pretrained latent diffusion model.The compressor model of our method focuses only on the Region-of-Interest (RoI) areas for machine vision in the image latent, to compress it compactly. Then the pretrained Unet model enhances the decompressed latent, utilizing a RoI-focused text caption which containing semantic information of the image. Therefore, SLIM is able to focus on RoI areas of the image without any guide mask at the inference stage, achieving low bitrate when compressing. And SLIM is also able to enhance a decompressed latent by denoising steps, so the final reconstructed image from the enhanced latent can be optimized for the machine vision task while still containing perceptual details for human vision. Experimental results show that SLIM achieves a higher classification accuracy in the same bits per pixel condition, compared to conventional image compression models for machines.Code will be released upon acceptance.

[742] PSI3D: Plug-and-Play 3D Stochastic Inference with Slice-wise Latent Diffusion Prior

Wenhan Guo, Jinglun Yu, Yaning Wang, Jin U. Kang, Yu Sun

Main category: eess.IV

TL;DR: PSI3D: A plug-and-play algorithm using 2D latent diffusion models with TV regularization for 3D stochastic inference on large-scale volumetric data (1024×1024×128).

Details

Motivation: Diffusion models are powerful image priors but struggle with large-scale, high-dimensional data due to computational costs. Need efficient methods for massive 3D volumes in scientific imaging applications like OCT super-resolution.

Method: Markov chain Monte Carlo approach reconstructs each 2D slice by sampling from 2D latent diffusion models. Incorporates total variation regularization stochastically along concatenation axis to enhance inter-slice consistency in 3D volumes.

Result: Significantly improves reconstruction quality for large-scale scientific imaging compared to traditional and learning-based baselines. Provides robust and credible reconstructions for optical coherence tomography super-resolution.

Conclusion: PSI3D enables efficient 3D stochastic inference with diffusion priors for massive volumetric data, overcoming computational limitations while maintaining slice consistency and reconstruction quality.

Abstract: Diffusion models are highly expressive image priors for Bayesian inverse problems. However, most diffusion models cannot operate on large-scale, high-dimensional data due to high training and inference costs. In this work, we introduce a Plug-and-play algorithm for 3D stochastic inference with latent diffusion prior (PSI3D) to address massive ($1024\times 1024\times 128$) volumes. Specifically, we formulate a Markov chain Monte Carlo approach to reconstruct each two-dimensional (2D) slice by sampling from a 2D latent diffusion model. To enhance inter-slice consistency, we also incorporate total variation (TV) regularization stochastically along the concatenation axis. We evaluate our performance on optical coherence tomography (OCT) super-resolution. Our method significantly improves reconstruction quality for large-scale scientific imaging compared to traditional and learning-based baselines, while providing robust and credible reconstructions.

[743] Image-to-Image Translation with Generative Adversarial Network for Electrical Resistance Tomography Reconstruction

Wejian Yan

Main category: eess.IV

TL;DR: Pix2Pix GAN improves ECT image reconstruction quality with better quantitative metrics and sharper boundaries.

Details

Motivation: Conventional electrical tomography reconstructions struggle to capture fine details, limiting broader adoption despite advantages like non-invasiveness, safety, and low cost.

Method: Introduces a Pix2Pix generative adversarial network (GAN) for electrical capacitance tomography (ECT) enhancement, using comprehensive simulated and experimental databases with multiple baseline reconstruction algorithms for comparison.

Result: The proposed GAN improves quantitative metrics (SSIM, PSNR, PMSE) and produces high-resolution images with sharp boundaries not constrained by mesh discretization.

Conclusion: Deep learning-based GAN approach successfully enhances ECT image reconstruction quality, addressing limitations of conventional methods for multiphase-flow monitoring.

Abstract: Electrical tomography techniques have been widely employed for multiphase-flow monitoring owing to their non invasive nature, intrinsic safety, and low cost. Nevertheless, conventional reconstructions struggle to capture fine details, which hampers broader adoption. Motivated by recent advances in deep learning, this study introduces a Pix2Pix generative adversarial network (GAN) to enhance image reconstruction in electrical capacitance tomography (ECT). Comprehensive simulated and experimental databases were established and multiple baseline reconstruction algorithms were implemented. The proposed GAN demonstrably improves quantitative metrics such as SSIM, PSNR, and PMSE, while qualitatively producing high resolution images with sharp boundaries that are no longer constrained by mesh discretization.

[744] Selective Phase-Aware Training of nnU-Net for Robust Breast Cancer Segmentation in Multi-Center DCE-MRI

Beyza Zayim, Aissiou Ikram, Boukhiar Naima

Main category: eess.IV

TL;DR: A selective, phase-aware training framework for nnU-Net improves breast tumor segmentation in DCE-MRI by focusing on quality data selection, showing that training on cleaner early-phase images outperforms using artifact-prone datasets.

Details

Motivation: Breast cancer diagnosis lacks standardized benchmarks for DCE-MRI analysis, and current segmentation approaches struggle with heterogeneous clinical data quality and variability across imaging centers.

Method: Modified nnU-Net framework with selective training strategy that analyzes image quality and center-specific variability impact. Used controlled experiments on DUKE, NACT, ISPY1, and ISPY2 datasets with preprocessing techniques like CLAHE, focusing on early phase images (0000-0002).

Result: Training on DUKE and NACT data (clearer contrast, fewer artifacts) with early phase images provided stable performance, while ISPY scans with motion artifacts and reduced contrast impaired segmentation even with advanced preprocessing.

Conclusion: Phase-sensitive and quality-aware training strategies are crucial for reliable segmentation in heterogeneous clinical datasets, highlighting limitations of naive dataset expansion and motivating automated quality-based data selection.

Abstract: Breast cancer remains the most common cancer among women and is a leading cause of female mortality. Dynamic contrast-enhanced MRI (DCE-MRI) is a powerful imaging tool for evaluating breast tumors, yet the field lacks a standardized benchmark for analyzing treatment responses and guiding personalized care. We participated in the MAMA-MIA Challenge’s Primary Tumor Segmentation task and this work presents a proposed selective, phase-aware training framework for the nnU-Net architecture, emphasizing quality-focused data selection to strengthen model robustness and generalization. We employed the No New Net (nnU-Net) framework with a selective training strategy that systematically analyzed the impact of image quality and center-specific variability on segmentation performance. Controlled experiments on the DUKE, NACT, ISPY1, and ISPY2 datasets revealed that including ISPY scans with motion artifacts and reduced contrast impaired segmentation performance, even with advanced preprocessing, such as contrast-limited adaptive histogram equalization (CLAHE). In contrast, training on DUKE and NACT data, which exhibited clearer contrast and fewer motion artifacts despite varying resolutions, with early phase images (0000-0002) provided more stable training conditions. Our results demonstrate the importance of phase-sensitive and quality-aware training strategies in achieving reliable segmentation performance in heterogeneous clinical datasets, highlighting the limitations of the expansion of naive datasets and motivating the need for future automation of quality-based data selection strategies.

[745] ForeSpeed: A real-world video dataset of CCTV cameras with different settings for vehicle speed estimation

Massimo Iuliani, Blake Sawyer, Marco Fontani, David Spreadborough, Martino Jerian

Main category: eess.IV

TL;DR: ForeSpeed: A comprehensive CCTV dataset for evaluating vehicle speed estimation methods in forensic scenarios, featuring 322 videos with known speeds, multiple camera types/perspectives, and various compression settings.

Details

Motivation: Vehicle speed estimation from CCTV footage is crucial for video forensics but current methods' accuracy depends heavily on factors like camera specs, compression, and perspective distortion. There's a need for standardized datasets to properly evaluate and improve forensic speed estimation techniques.

Method: Created ForeSpeed dataset with 322 videos of vehicles traveling at known speeds, captured by 3 digital and 3 analog cameras from two perspectives. Includes real-world road metrics for scene geometry restoration, multiple compression factors/settings to simulate real-world export procedures.

Result: Dataset used to benchmark commercial speed estimation algorithm (Amped FIVE). Method reliably estimates average speed across conditions, but uncertainty range significantly increases with strong perspective distortion. Dataset is publicly available to forensic community.

Conclusion: ForeSpeed provides a standardized benchmark for evaluating forensic speed estimation methods, revealing limitations in current approaches (especially with perspective distortion) and enabling development of more robust solutions for collision investigation and forensic analysis.

Abstract: The need to estimate the speed of road vehicles has become increasingly important in the field of video forensics, particularly with the widespread deployment of CCTV cameras worldwide. Despite the development of various approaches, the accuracy of forensic speed estimation from real-world footage remains highly dependent on several factors, including camera specifications, acquisition methods, spatial and temporal resolution, compression methods, and scene perspective, which can significantly influence performance. In this paper, we introduce ForeSpeed, a comprehensive dataset designed to support the evaluation of speed estimation techniques in real-world scenarios using CCTV footage. The dataset includes recordings of a vehicle traveling at known speeds, captured by three digital and three analog cameras from two distinct perspectives. Real-world road metrics are provided to enable the restoration of the scene geometry. Videos were stored with multiple compression factors and settings, to simulate real world scenarios in which export procedures are not always performed according to forensic standards. Overall, ForeSpeed, includes a collection of 322 videos. As a case study, we employed the ForeSpeed dataset to benchmark a speed estimation algorithm available in a commercial product (Amped FIVE). Results demonstrate that while the method reliably estimates average speed across various conditions, its uncertainty range significantly increases when the scene involves strong perspective distortion. The ForeSpeed dataset is publicly available to the forensic community, with the aim of facilitating the evaluation of current methodologies and inspiring the development of new, robust solutions tailored to collision investigation and forensic incident analysis.

[746] Rethinking Coupled Tensor Analysis for Hyperspectral Super-Resolution: Recoverable Modeling Under Endmember Variability

Meng Ding, Xiao Fu

Main category: eess.IV

TL;DR: Proposes LMN tensor decomposition model for hyperspectral super-resolution, balancing expressiveness and interpretability while handling endmember variability.

Details

Motivation: Existing tensor decomposition methods for hyperspectral super-resolution either lack physical interpretability (CPD, Tucker) or rely on restrictive linear mixture model assumptions (LL1) that don't account for real-world nonlinear effects like endmember variability.

Method: Proposes a more flexible block-term tensor decomposition with rank-(L_r, M_r, N_r) terms (LMN model) that retains interpretability while handling nonlinear effects, subsumes existing models as special cases, and provides theoretical recoverability guarantees.

Result: The LMN model offers balanced tradeoff between expressiveness and interpretability, establishes SRI recoverability under proper conditions, and shows effectiveness and robustness in experiments on synthetic and real datasets compared to existing CTD-based approaches.

Conclusion: The proposed LMN tensor decomposition model provides a flexible, interpretable, and theoretically sound framework for hyperspectral super-resolution that addresses limitations of existing methods while maintaining practical applicability.

Abstract: This work revisits the hyperspectral super-resolution (HSR) problem, i.e., fusing a pair of spatially co-registered hyperspectral (HSI) and multispectral (MSI) images to recover a super-resolution image (SRI) that enhances the spatial resolution of the HSI. Coupled tensor decomposition (CTD)-based methods have gained traction in this domain, offering recoverability guarantees under various assumptions. Existing models such as canonical polyadic decomposition (CPD) and Tucker decomposition provide strong expressive power but lack physical interpretability. The block-term decomposition model with rank-$(L_r, L_r, 1)$ terms (the LL1 model) yields interpretable factors under the linear mixture model (LMM) of spectral images, but LMM assumptions are often violated in practice – primarily due to nonlinear effects such as endmember variability (EV). To address this, we propose modeling spectral images using a more flexible block-term tensor decomposition with rank-$(L_r, M_r, N_r)$ terms (the LMN model). This modeling choice retains interpretability, subsumes CPD, Tucker, and LL1 as special cases, and robustly accounts for non-ideal effects such as EV, offering a balanced tradeoff between expressiveness and interpretability for HSR. Importantly, under the LMN model for HSI and MSI, recoverability of the SRI can still be established under proper conditions – providing strong theoretical support. Extensive experiments on synthetic and real datasets further validate the effectiveness and robustness of the proposed method compared with existing CTD-based approaches.

[747] Patlak Parametric Image Estimation from Dynamic PET Using Diffusion Model Prior

Ziqian Huang, Boxiao Yu, Siqi Li, Savas Ozdemir, Sangjin Bae, Jae Sung Lee, Guobao Wang, Kuang Gong

Main category: eess.IV

TL;DR: A diffusion model-based framework improves parametric PET image quality by using pre-trained priors and kinetic model constraints.

Details

Motivation: Parametric PET images suffer from low quality due to ill-posed kinetic model fitting and limited counts from non-continuous whole-body acquisition, requiring better reconstruction methods.

Method: Proposed diffusion model-based kinetic modeling framework using Patlak model as example; pre-trained score function on static PET images serves as prior; incorporates kinetic model as data-consistency constraint during inference.

Result: Framework evaluated on total-body dynamic PET datasets with different dose levels, demonstrating feasibility and promising performance in improving parametric image quality.

Conclusion: Diffusion model-based approach effectively enhances parametric PET image reconstruction by combining learned priors with kinetic modeling constraints.

Abstract: Dynamic PET enables the quantitative estimation of physiology-related parameters and is widely utilized in research and increasingly adopted in clinical settings. Parametric imaging in dynamic PET requires kinetic modeling to estimate voxel-wise physiological parameters based on specific kinetic models. However, parametric images estimated through kinetic model fitting often suffer from low image quality due to the inherently ill-posed nature of the fitting process and the limited counts resulting from non-continuous data acquisition across multiple bed positions in whole-body PET. In this work, we proposed a diffusion model-based kinetic modeling framework for parametric image estimation, using the Patlak model as an example. The score function of the diffusion model was pre-trained on static total-body PET images and served as a prior for both Patlak slope and intercept images by leveraging their patch-wise similarity. During inference, the kinetic model was incorporated as a data-consistency constraint to guide the parametric image estimation. The proposed framework was evaluated on total-body dynamic PET datasets with different dose levels, demonstrating the feasibility and promising performance of the proposed framework in improving parametric image quality.

[748] High Frequency Matters: Uncertainty Guided Image Compression with Wavelet Diffusion

Juan Song, Jiaxiang He, Lijie Yang, Mingtao Feng, Keyan Wang

Main category: eess.IV

TL;DR: UGDiff is a novel image compression method using wavelet diffusion with uncertainty-guided rate-distortion optimization, achieving better perceptual quality and lower distortion than state-of-the-art methods.

Details

Motivation: Diffusion models have shown success in image generation but struggle to balance perceptual quality and distortion in image compression applications. High frequency components are crucial for image detail reconstruction but are challenging to compress effectively.

Method: Proposes UGDiff with wavelet transform for high frequency compression, wavelet conditional diffusion model for high frequency prediction, residual codec for compressing prediction residuals, and uncertainty-weighted rate-distortion loss for residual compression optimization.

Result: UGDiff surpasses state-of-the-art image compression methods in rate-distortion performance, perceptual quality, subjective quality, and inference time on two benchmark datasets.

Conclusion: The proposed uncertainty-guided wavelet diffusion approach effectively addresses the fidelity issues in diffusion-based compression, providing a rational trade-off between rate and distortion while maintaining high perceptual quality.

Abstract: Diffusion probabilistic models have recently achieved remarkable success in generating high-quality images. However, balancing high perceptual quality and low distortion remains challenging in application of diffusion models in image compression. To address this issue, we propose a novel Uncertainty-Guided image compression approach with wavelet Diffusion (UGDiff). Our approach focuses on high frequency compression via the wavelet transform, since high frequency components are crucial for reconstructing image details. We introduce a wavelet conditional diffusion model for high frequency prediction, followed by a residual codec that compresses and transmits prediction residuals to the decoder. This diffusion prediction-then-residual compression paradigm effectively addresses the low fidelity issue common in direct reconstructions by existing diffusion models. Considering the uncertainty from the random sampling of the diffusion model, we further design an uncertainty-weighted rate-distortion (R-D) loss tailored for residual compression, providing a more rational trade-off between rate and distortion. Comprehensive experiments on two benchmark datasets validate the effectiveness of UGDiff, surpassing state-of-the-art image compression methods in R-D performance, perceptual quality, subjective quality, and inference time. Our code is available at: https://github.com/hejiaxiang1/Wavelet-Diffusion/tree/main.

[749] PIGUIQA: A Physical Imaging Guided Perceptual Framework for Underwater Image Quality Assessment

Weizhi Xian, Mingliang Zhou, Leong Hou U, Zhengguo Li

Main category: eess.IV

TL;DR: PIGUIQA is a physics-guided perceptual framework for underwater image quality assessment that integrates radiative transfer theory with local attention mechanisms and global aggregation to predict quality scores.

Details

Motivation: Underwater image quality assessment needs to account for complex physical distortions (direct transmission attenuation and backward scattering) while considering human perceptual sensitivity to spatial variations in image content.

Method: 1) Formulate UIQA using underwater radiative transfer theory to establish physics-based distortion metrics; 2) Design neighborhood attention mechanism for local perception of subtle features; 3) Employ global perceptual aggregator to integrate holistic scene information with distortion data.

Result: PIGUIQA achieves state-of-the-art performance across multiple benchmarks while maintaining robust cross-dataset generalizability.

Conclusion: The proposed physics-guided perceptual framework effectively addresses underwater image quality assessment by combining physical imaging principles with human perceptual mechanisms, demonstrating superior performance and generalization.

Abstract: In this paper, we propose a Physical Imaging Guided perceptual framework for Underwater Image Quality Assessment (UIQA), termed PIGUIQA. First, we formulate UIQA as a comprehensive problem that considers the combined effects of direct transmission attenuation and backward scattering on image perception. By leveraging underwater radiative transfer theory, we systematically integrate physics-based imaging estimations to establish quantitative metrics for these distortions. Second, recognizing spatial variations in image content significance and human perceptual sensitivity to distortions, we design a module built upon a neighborhood attention mechanism for local perception of images. This module effectively captures subtle features in images, thereby enhancing the adaptive perception of distortions on the basis of local information. Third, by employing a global perceptual aggregator that further integrates holistic image scene with underwater distortion information, the proposed model accurately predicts image quality scores. Extensive experiments across multiple benchmarks demonstrate that PIGUIQA achieves state-of-the-art performance while maintaining robust cross-dataset generalizability. The implementation is publicly available at https://github.com/WeizhiXian/PIGUIQA

[750] Cross-organ all-in-one parallel compressed sensing magnetic resonance imaging

Baoshun Shi, Xin Meng, Shuangni Lv, Zheng Liu, Yan Yang

Main category: eess.IV

TL;DR: CAPNet is a unified deep unfolding network for cross-organ parallel compressed sensing MRI that handles varying sampling ratios and anatomical variations using artifact generators and SAM-based structural prompts.

Details

Motivation: Current p-CSMRI methods require training separate DNNs for each organ due to anatomical variations, creating barriers to developing generalized medical image reconstruction systems. There's a need for a unified framework that can handle cross-organ reconstruction with varying sampling ratios.

Method: Proposes CAPNet with three specialized modules: auxiliary variable module, prior module, and data consistency module. Introduces an artifact generator to handle varying sampling ratios, and an organ structure-prompt generation submodule using SAM to extract structural features for cross-organ prompts, integrated via an organ structure-aware Mamba submodule.

Result: Comprehensive evaluations on cross-organ dataset confirm CAPNet achieves state-of-the-art reconstruction performance across multiple anatomical structures using a single unified model.

Conclusion: CAPNet provides a unified solution for cross-organ p-CSMRI reconstruction that eliminates the need for organ-specific models while maintaining high reconstruction quality across different anatomical structures.

Abstract: Recent advances in deep learning-based parallel compressed sensing magnetic resonance imaging (p-CSMRI) have significantly improved reconstruction quality. However, current p-CSMRI methods often require training separate deep neural network (DNN) for each organ due to anatomical variations, creating a barrier to developing generalized medical image reconstruction systems. To address this, we propose CAPNet (cross-organ all-in-one deep unfolding p-CSMRI network), a unified framework that implements a p-CSMRI iterative algorithm via three specialized modules: auxiliary variable module, prior module, and data consistency module. Recognizing that p-CSMRI systems often employ varying sampling ratios for different organs, resulting in organ-specific artifact patterns, we introduce an artifact generator, which extracts and integrates artifact features into the data consistency module to enhance the discriminative ability of the overall network. For the prior module, we design an organ structure-prompt generation submodule that leverages structural features extracted from the segment anything model (SAM) to create cross-organ prompts. These prompts are strategically incorporated into the prior module through an organ structure-aware Mamba submodule. Comprehensive evaluations on a cross-organ dataset confirm that CAPNet achieves state-of-the-art reconstruction performance across multiple anatomical structures using a single unified model. Our code will be published at https://github.com/shibaoshun/CAPNet.

[751] Reproducing and Improving CheXNet: Deep Learning for Chest X-ray Disease Classification

Daniel J. Strick, Carlos Garcia, Anthony Huang, Thomas Gardos

Main category: eess.IV

TL;DR: Researchers reproduced CheXNet and developed improved algorithms for classifying 14 diseases from chest X-rays, achieving 0.85 AUC-ROC and 0.39 F1 score on the NIH ChestX-ray14 dataset.

Details

Motivation: Deep learning for radiologic image analysis is rapidly growing and likely to become standard practice in medicine. The study aims to advance automated disease detection from chest X-rays beyond existing baseline models like CheXNet.

Method: Reproduced CheXNet algorithm and explored other algorithms on the NIH ChestX-ray14 dataset containing X-ray images with 14 disease classifications. Used F1 score and AUC-ROC as primary evaluation metrics for imbalanced, multi-label classification tasks.

Result: The best model achieved an average AUC-ROC score of 0.85 and an average F1 score of 0.39 across all 14 disease classifications, outperforming CheXNet’s baseline metrics.

Conclusion: The research demonstrates successful improvement over existing baseline models for automated chest X-ray analysis, contributing to the advancement of deep learning applications in medical imaging diagnostics.

Abstract: Deep learning for radiologic image analysis is a rapidly growing field in biomedical research and is likely to become a standard practice in modern medicine. On the publicly available NIH ChestX-ray14 dataset, containing X-ray images that are classified by the presence or absence of 14 different diseases, we reproduced an algorithm known as CheXNet, as well as explored other algorithms that outperform CheXNet’s baseline metrics. Model performance was primarily evaluated using the F1 score and AUC-ROC, both of which are critical metrics for imbalanced, multi-label classification tasks in medical imaging. The best model achieved an average AUC-ROC score of 0.85 and an average F1 score of 0.39 across all 14 disease classifications present in the dataset.

Jinzhi Wang, Jiangbo Zhang, Chenzhan Yu, Zhigang Xiu, Duwei Dai, Ziyu xu, Ningyong Wu, Wenhong Zhao

Main category: eess.IV

TL;DR: OrthoInsight is a multi-modal AI framework that combines YOLOv9 for rib fracture detection in CT scans, medical knowledge graphs for clinical context, and fine-tuned LLaVA for automated diagnostic report generation, achieving superior performance over existing models.

Details

Motivation: The growing volume of medical imaging data creates need for automated diagnostic tools, as manual interpretation of CT scans for rib fractures is time-consuming and error-prone.

Method: Multi-modal deep learning framework integrating: 1) YOLOv9 model for fracture detection in CT images, 2) medical knowledge graph for retrieving clinical context, and 3) fine-tuned LLaVA language model for generating diagnostic reports by combining visual features with expert textual data.

Result: Evaluated on 28,675 annotated CT images and expert reports, achieved high performance across Diagnostic Accuracy, Content Completeness, Logical Coherence, and Clinical Guidance Value with average score of 4.28, outperforming models like GPT-4 and Claude-3.

Conclusion: Demonstrates potential of multi-modal learning in transforming medical image analysis and providing effective support for radiologists in clinical practice.

Abstract: The growing volume of medical imaging data has increased the need for automated diagnostic tools, especially for musculoskeletal injuries like rib fractures, commonly detected via CT scans. Manual interpretation is time-consuming and error-prone. We propose OrthoInsight, a multi-modal deep learning framework for rib fracture diagnosis and report generation. It integrates a YOLOv9 model for fracture detection, a medical knowledge graph for retrieving clinical context, and a fine-tuned LLaVA language model for generating diagnostic reports. OrthoInsight combines visual features from CT images with expert textual data to deliver clinically useful outputs. Evaluated on 28,675 annotated CT images and expert reports, it achieves high performance across Diagnostic Accuracy, Content Completeness, Logical Coherence, and Clinical Guidance Value, with an average score of 4.28, outperforming models like GPT-4 and Claude-3. This study demonstrates the potential of multi-modal learning in transforming medical image analysis and providing effective support for radiologists.

[753] Deep Equilibrium Convolutional Sparse Coding for Hyperspectral Image Denoising

Jin Ye, Jingran Wang, Fengchao Xiong, Jingzhou Chen, Yuntao Qian

Main category: eess.IV

TL;DR: DECSC: A Deep Equilibrium Convolutional Sparse Coding framework for hyperspectral image denoising that unifies local spatial-spectral correlations, nonlocal self-similarities, and global spatial consistency with convergence guarantees.

Details

Motivation: Hyperspectral images are often degraded by complex noise patterns, and existing deep unfolding methods lack convergence guarantees despite ensuring physical properties. Deep Equilibrium models offer infinite-depth networks naturally consistent with optimization, providing a better framework for robust HSI denoising.

Method: Proposes DECSC framework combining 2D convolutional sparse coding for global spatial consistency across bands, 3D convolutional sparse coding for local spatial-spectral details, transformer blocks for nonlocal self-similarities, and detail enhancement modules. Formulates proximal gradient descent of CSC model as fixed-point problem and transforms iterative updates into learnable DEQ architecture.

Result: Experimental results demonstrate superior denoising performance compared to state-of-the-art methods.

Conclusion: DECSC effectively unifies multiple HSI characteristics within a DEQ framework, achieving robust denoising with convergence guarantees while outperforming existing methods.

Abstract: Hyperspectral images (HSIs) play a crucial role in remote sensing but are often degraded by complex noise patterns. Ensuring the physical property of the denoised HSIs is vital for robust HSI denoising, giving the rise of deep unfolding-based methods. However, these methods map the optimization of a physical model to a learnable network with a predefined depth, which lacks convergence guarantees. In contrast, Deep Equilibrium (DEQ) models treat the hidden layers of deep networks as the solution to a fixed-point problem and models them as infinite-depth networks, naturally consistent with the optimization. Under the framework of DEQ, we propose a Deep Equilibrium Convolutional Sparse Coding (DECSC) framework that unifies local spatial-spectral correlations, nonlocal spatial self-similarities, and global spatial consistency for robust HSI denoising. Within the convolutional sparse coding (CSC) framework, we enforce shared 2D convolutional sparse representation to ensure global spatial consistency across bands, while unshared 3D convolutional sparse representation captures local spatial-spectral details. To further exploit nonlocal self-similarities, a transformer block is embedded after the 2D CSC. Additionally, a detail enhancement module is integrated with the 3D CSC to promote image detail preservation. We formulate the proximal gradient descent of the CSC model as a fixed-point problem and transform the iterative updates into a learnable network architecture within the framework of DEQ. Experimental results demonstrate that our DECSC method achieves superior denoising performance compared to state-of-the-art methods.

[754] Understanding Benefits and Pitfalls of Current Methods for the Segmentation of Undersampled MRI Data

Jan Nikolas Morshuis, Matthias Hein, Christian F. Baumgartner

Main category: eess.IV

TL;DR: First unified benchmark comparing 7 approaches for segmenting accelerated MRI data, finding simple two-stage methods with data-consistency outperform specialized one-stage methods.

Details

Motivation: MRI acquisition is time-consuming, leading to patient discomfort and costs. While accelerated MRI methods exist, optimal segmentation strategies for undersampled data remain unknown due to isolated development, separate datasets, and lack of unified evaluation standards.

Method: Created first unified benchmark comparing 7 approaches for segmenting undersampled MRI data, focusing on comparing one-stage (combined reconstruction+segmentation) vs two-stage (reconstruction followed by segmentation) methods. Tested on two MRI datasets with multi-coil k-space data and human-annotated segmentation ground-truth.

Result: Simple two-stage methods that incorporate data-consistency achieved the best segmentation scores, outperforming complex specialized methods specifically developed for this task.

Conclusion: For segmenting accelerated MRI data, straightforward two-stage approaches with data-consistency are more effective than complex one-stage methods, providing practical guidance for clinical applications.

Abstract: MR imaging is a valuable diagnostic tool allowing to non-invasively visualize patient anatomy and pathology with high soft-tissue contrast. However, MRI acquisition is typically time-consuming, leading to patient discomfort and increased costs to the healthcare system. Recent years have seen substantial research effort into the development of methods that allow for accelerated MRI acquisition while still obtaining a reconstruction that appears similar to the fully-sampled MR image. However, for many applications a perfectly reconstructed MR image may not be necessary, particularly, when the primary goal is a downstream task such as segmentation. This has led to growing interest in methods that aim to perform segmentation directly on accelerated MRI data. Despite recent advances, existing methods have largely been developed in isolation, without direct comparison to one another, often using separate or private datasets, and lacking unified evaluation standards. To date, no high-quality, comprehensive comparison of these methods exists, and the optimal strategy for segmenting accelerated MR data remains unknown. This paper provides the first unified benchmark for the segmentation of undersampled MRI data comparing 7 approaches. A particular focus is placed on comparing \textit{one-stage approaches}, that combine reconstruction and segmentation into a unified model, with \textit{two-stage approaches}, that utilize established MRI reconstruction methods followed by a segmentation network. We test these methods on two MRI datasets that include multi-coil k-space data as well as a human-annotated segmentation ground-truth. We find that simple two-stage methods that consider data-consistency lead to the best segmentation scores, surpassing complex specialized methods that are developed specifically for this task.

[755] Weakly Supervised Segmentation and Classification of Alpha-Synuclein Aggregates in Brightfield Midbrain Images

Erwan Dereure, Robin Louiset, Laura Parkkinen, David A Menassa, David Holcman

Main category: eess.IV

TL;DR: Automated deep learning pipeline for segmenting and classifying alpha-synuclein aggregates in Parkinson’s disease histopathology images using weakly supervised segmentation and ResNet50 classifier.

Details

Motivation: Parkinson's disease involves misfolded alpha-synuclein aggregates (Lewy bodies/neurites) that need automated analysis for better understanding spatial organization and relationships with surrounding cells.

Method: Developed automated image processing pipeline using weakly supervised segmentation robust to immunohistochemical labeling variability, with ResNet50 classifier to differentiate aggregate morphologies in whole-slide images.

Result: Achieved 80% balanced accuracy in differentiating major aggregate morphologies including Lewy bodies and neurites.

Conclusion: Framework enables large-scale characterization of alpha-synuclein aggregate spatial distribution and heterogeneity, and investigation of relationships with surrounding cells like microglia and astrocytes.

Abstract: Parkinson’s disease (PD) is a neurodegenerative disorder associated with the accumulation of misfolded alpha-synuclein aggregates, forming Lewy bodies and neuritic shape used for pathology diagnostics. Automatic analysis of immunohistochemistry histopathological images with Deep Learning provides a promising tool for better understanding the spatial organization of these aggregates. In this study, we develop an automated image processing pipeline to segment and classify these aggregates in whole-slide images (WSIs) of midbrain tissue from PD and incidental Lewy Body Disease (iLBD) cases based on weakly supervised segmentation, robust to immunohistochemical labelling variability, with a ResNet50 classifier. Our approach allows to differentiate between major aggregate morphologies, including Lewy bodies and neurites with a balanced accuracy of $80%$. This framework paves the way for large-scale characterization of the spatial distribution and heterogeneity of alpha-synuclein aggregates in brightfield immunohistochemical tissue, and for investigating their poorly understood relationships with surrounding cells such as microglia and astrocytes.

[756] Two-Dimensional Tomographic Reconstruction From Projections With Unknown Angles and Unknown Spatial Shifts

Shreyas Jayant Grampurohit, Satish Mulleti, Ajit Rajwade

Main category: eess.IV

TL;DR: A novel method for 2D tomography that simultaneously estimates unknown projection angles and spatial shifts, improving reconstruction quality compared to shift-neglecting baselines.

Details

Motivation: In industrial and biomedical CT applications, projection geometry (angles and spatial shifts) is often unknown or partially known. Existing 2D unknown view tomography (UVT) algorithms typically assume centered projections without spatial shifts, which limits their practical applicability.

Method: 1) Modified an existing graph Laplacian-based 2D UVT algorithm to incorporate spatial shifts as initialization. 2) Proposed a three-way alternating minimization algorithm that jointly estimates the 2D structure, projection angles, and corresponding spatial shifts.

Result: The method was evaluated on noisy projections of ribosome images and demonstrated superior reconstruction quality compared to baseline methods that neglect spatial shifts.

Conclusion: The proposed approach successfully addresses geometric ambiguities in 2D tomography by jointly estimating unknown viewing angles and spatial shifts, providing more accurate reconstructions for practical applications where projection geometry is uncertain.

Abstract: In parallel beam computed tomography (CT), an object is reconstructed from a series of projections taken at different angles. However, in some industrial and biomedical imaging applications, the projection geometry is unknown, completely or partially. In this paper, we present a technique for two-dimensional (2D) tomography in which both viewing angles and spatial shifts associated with the projections are unknown. There exists literature on 2D unknown view tomography (UVT), but most existing 2D UVT algorithms assume that the projections are centered; that is, there are no spatial shifts in the projections. To tackle these geometric ambiguities, we first modify an existing graph Laplacian-based algorithm for 2D UVT to incorporate spatial shifts, and then use it as the initialization for the proposed three-way alternating minimization algorithm that jointly estimates the 2D structure, its projection angles, and the corresponding shifts. We evaluate our method on noisy projections of ribosome images and demonstrate that it achieves superior reconstruction compared to the baseline that neglects shifts.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Towards Reasoning-Preserving Unlearning in Multimodal Large Language Models

[2] Graph-O1 : Monte Carlo Tree Search with Reinforcement Learning for Text-Attributed Graph Reasoning

[3] Q-KVComm: Efficient Multi-Agent Communication Via Adaptive KV Cache Compression

[4] Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on the Loquacious Dataset

[5] Learning to Prioritize IT Tickets: A Comparative Evaluation of Embedding-based Approaches and Fine-Tuned Transformer Models

[6] KVReviver: Reversible KV Cache Compression with Sketch-Based Token Reconstruction

[7] Separating Constraint Compliance from Semantic Accuracy: A Novel Benchmark for Evaluating Instruction-Following Under Compression

[8] ReGal: A First Look at PPO-based Legal AI for Judgment Prediction and Summarization in India

[9] CoPE: A Small Language Model for Steerable and Scalable Content Labeling

[10] Narrative Consolidation: Formulating a New Task for Unifying Multi-Perspective Accounts

[11] MauBERT: Universal Phonetic Inductive Biases for Few-Shot Acoustic Units Discovery

[12] Statistical laws and linguistics inform meaning in naturalistic and fictional conversation

[13] Training LLMs with LogicReward for Faithful and Rigorous Reasoning

[14] GeoSense-AI: Fast Location Inference from Crisis Microblogs

[15] InstructNet: A Novel Approach for Multi-Label Instruction Classification through Advanced Deep Learning

[16] MemEvolve: Meta-Evolution of Agent Memory Systems

[17] CTTA-T: Continual Test-Time Adaptation for Text Understanding via Teacher-Student with a Domain-aware and Generalized Teacher

[18] LIR$^3$AG: A Lightweight Rerank Reasoning Strategy Framework for Retrieval-Augmented Generation

[19] Towards Efficient Agents: A Co-Design of Inference Architecture and System

[20] LLM-based Few-Shot Early Rumor Detection with Imitation Agent

[21] DACE For Railway Acronym Disambiguation

[22] LLM Agents Implement an NLG System from Scratch: Building Interpretable Rule-Based RDF-to-Text Generators

[23] SRS-Stories: Vocabulary-constrained multilingual story generation for language learning

[24] AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

[25] An Agentic AI Framework for Training General Practitioner Student Skills

[26] Mitigating Spurious Correlations in NLI via LLM-Synthesized Counterfactuals and Dynamic Balanced Sampling

[27] Research on a hybrid LSTM-CNN-Attention model for text-based web content classification

[28] Teaching and Critiquing Conceptualization and Operationalization in NLP

[29] Generalization Gaps in Political Fake News Detection: An Empirical Study on the LIAR Dataset

[30] LLMs on Drugs: Language Models Are Few-Shot Consumers

[31] Neologism Learning as a Parameter-Efficient Alternative to Fine-Tuning for Model Steering

[32] From Scratch to Fine-Tuned: A Comparative Study of Transformer Training Strategies for Legal Machine Translation

[33] On Finding Inconsistencies in Documents

[34] A Comparative Study of Light-weight Language Models for PII Masking and their Deployment for Real Conversational Texts

[35] LLM-CAS: Dynamic Neuron Perturbation for Real-Time Hallucination Correction

[36] Does It Tie Out? Towards Autonomous Legal Agents in Venture Capital

[37] Solver-Independent Automated Problem Formulation via LLMs for High-Cost Simulation-Driven Design

[38] From Natural Language to Control Signals: A Conceptual Framework for Semantic Channel Finding in Complex Experimental Infrastructure

[39] From Word to World: Can Large Language Models be Implicit Text-based World Models?

[40] AraMix: Recycling, Refiltering, and Deduplicating to Deliver the Largest Arabic Pretraining Corpus

[41] MDToC: Metacognitive Dynamic Tree of Concepts for Boosting Mathematical Problem-Solving of Large Language Models

[42] Toward Human-Centered AI-Assisted Terminology Work

[43] Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

[44] Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations

[45] FASTRIC: Prompt Specification Language for Verifiable LLM Interactions

[46] Evaluating the Challenges of LLMs in Real-world Medical Follow-up: A Comparative Study and An Optimized Framework

[47] Context-Aware Initialization for Reducing Generative Path Length in Diffusion Language Models

[48] DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation

[49] A Large Language Model Based Method for Complex Logical Reasoning over Knowledge Graphs

[50] Stop saying LLM: Large Discourse Models (LDM) and Artificial Discursive Agent (ADA)?

[51] SAP: Syntactic Attention Pruning for Transformer-based Language Models

[52] AWPO: Enhancing Tool-Use of Large Language Models through Explicit Integration of Reasoning Rewards

[53] QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation

[54] From Speech to Subtitles: Evaluating ASR Models in Subtitling Italian Television Programs

[55] JEPA-Reasoner: Decoupling Latent Reasoning from Token Generation

[56] CycleChart: A Unified Consistency-Based Learning Framework for Bidirectional Chart Understanding and Generation

[57] Identifying Features Associated with Bias Against 93 Stigmatized Groups in Language Models and Guardrail Model Safety Mitigation

[58] ChemATP: A Training-Free Chemical Reasoning Framework for Large Language Models

[59] Auto-Prompting with Retrieval Guidance for Frame Detection in Logistics

[60] CienaLLM: Generative Climate-Impact Extraction from News Articles with Autoregressive LLMs

[61] HATS: High-Accuracy Triple-Set Watermarking for Large Language Models

[62] Kunnafonidilaw ka Cadeau: an ASR dataset of present-day Bambara

[63] CodeSimpleQA: Scaling Factuality in Code Large Language Models

[64] MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments

[65] SiamGPT: Quality-First Fine-Tuning for Stable Thai Text Generation

[66] Activations as Features: Probing LLMs for Generalizable Essay Scoring Representations

[67] A Large-Language-Model Framework for Automated Humanitarian Situation Reporting

[68] Event Extraction in Large Language Model

[69] Algerian Dialect

[70] Increasing the Thinking Budget is Not All You Need

[71] Exploring the features used for summary evaluation by Human and GPT

[72] Diacritic Restoration for Low-Resource Indigenous Languages: Case Study with Bribri and Cook Islands Māori

[73] Exploring Zero-Shot ACSA with Unified Meaning Representation in Chain-of-Thought Prompting

[74] GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators

[75] Towards a resource for multilingual lexicons: an MT assisted and human-in-the-loop multilingual parallel corpus with multi-word expression annotation

[76] Structured Language Generation Model: Loss Calibration and Formatted Decoding for Robust Structure Prediction and Knowledge Retrieval

[77] Label Words as Local Task Vectors in In-Context Learning