Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 42]
cs.CV [Total: 107]
cs.AI [Total: 44]
cs.SD [Total: 6]
cs.LG [Total: 133]
cs.MA [Total: 2]
cs.MM [Total: 2]
eess.AS [Total: 9]
eess.IV [Total: 5]

cs.CL

[1] HARMON-E: Hierarchical Agentic Reasoning for Multimodal Oncology Notes to Extract Structured Data

Shashi Kant Gupta, Arijeet Pramanik, Jerrin John Thomas, Regina Schwind, Lauren Wiener, Avi Raju, Jeremy Kornbluth, Yanshan Wang, Zhaohui Su, Hrituraj Singh

Main category: cs.CL

TL;DR: LLM-based agentic framework achieves high accuracy (F1=0.93) for extracting structured oncology data from 400K+ unstructured EHR notes, reducing manual annotation costs.

Details

Motivation: Unstructured EHR notes contain vital oncology information but are challenging to extract due to variability, specialized terminology, and inconsistent formats. Manual abstraction is costly, while existing automated approaches are narrow in scope and don't handle patient-level synthesis across contradictory documents.

Method: Proposed an agentic framework using LLMs as reasoning agents with context-sensitive retrieval and iterative synthesis capabilities. Systematically decomposes complex oncology data extraction into modular, adaptive tasks to extract structured clinical variables from real-world oncology notes.

Result: Achieved average F1-score of 0.93 on 400,000+ unstructured notes from 2,250 cancer patients. 100 out of 103 oncology-specific clinical variables exceeded 0.85 F1, with critical variables (biomarkers, medications) surpassing 0.95. Integration into workflow resulted in 0.94 manual approval rate, significantly reducing annotation costs.

Conclusion: This represents the first exhaustive, end-to-end application of LLM-based agents for structured oncology data extraction at scale, demonstrating high accuracy and cost-effectiveness for real-world clinical data curation.

Abstract: Unstructured notes within the electronic health record (EHR) contain rich clinical information vital for cancer treatment decision making and research, yet reliably extracting structured oncology data remains challenging due to extensive variability, specialized terminology, and inconsistent document formats. Manual abstraction, although accurate, is prohibitively costly and unscalable. Existing automated approaches typically address narrow scenarios - either using synthetic datasets, restricting focus to document-level extraction, or isolating specific clinical variables (e.g., staging, biomarkers, histology) - and do not adequately handle patient-level synthesis across the large number of clinical documents containing contradictory information. In this study, we propose an agentic framework that systematically decomposes complex oncology data extraction into modular, adaptive tasks. Specifically, we use large language models (LLMs) as reasoning agents, equipped with context-sensitive retrieval and iterative synthesis capabilities, to exhaustively and comprehensively extract structured clinical variables from real-world oncology notes. Evaluated on a large-scale dataset of over 400,000 unstructured clinical notes and scanned PDF reports spanning 2,250 cancer patients, our method achieves an average F1-score of 0.93, with 100 out of 103 oncology-specific clinical variables exceeding 0.85, and critical variables (e.g., biomarkers and medications) surpassing 0.95. Moreover, integration of the agentic system into a data curation workflow resulted in 0.94 direct manual approval rate, significantly reducing annotation costs. To our knowledge, this constitutes the first exhaustive, end-to-end application of LLM-based agents for structured oncology data extraction at scale

[2] How well do Large Language Models Recognize Instructional Moves? Establishing Baselines for Foundation Models in Educational Discourse

Kirk Vanacore, Rene F. Kizilcec

Main category: cs.CL

TL;DR: LLMs show moderate baseline performance for classifying instructional moves in classroom transcripts, with few-shot prompting significantly improving results but not eliminating reliability issues.

Details

Motivation: As LLMs become widely adopted in educational technologies, understanding their out-of-the-box capabilities for interpreting authentic educational scenarios is crucial for setting realistic expectations and benchmarking performance.

Method: Compared six LLMs on classifying instructional moves in authentic classroom transcripts using typical prompting methods: zero-shot, one-shot, and few-shot prompting, evaluated against expert-coded annotations.

Result: Zero-shot performance was moderate, but few-shot prompting significantly improved performance for state-of-the-art models (best configuration reached Cohen’s Kappa = 0.58). Performance varied by instructional move type, and higher recall often came with increased false positives.

Conclusion: Foundation models demonstrate meaningful yet limited capacity to interpret instructional discourse. Prompt design helps surface capabilities but doesn’t eliminate fundamental reliability constraints for educational applications.

Abstract: Large language models (LLMs) are increasingly adopted in educational technologies for a variety of tasks, from generating instructional materials and assisting with assessment design to tutoring. While prior work has investigated how models can be adapted or optimized for specific tasks, far less is known about how well LLMs perform at interpreting authentic educational scenarios without significant customization. As LLM-based systems become widely adopted by learners and educators in everyday academic contexts, understanding their out-of-the-box capabilities is increasingly important for setting expectations and benchmarking. We compared six LLMs to estimate their baseline performance on a simple but important task: classifying instructional moves in authentic classroom transcripts. We evaluated typical prompting methods: zero-shot, one-shot, and few-shot prompting. We found that while zero-shot performance was moderate, providing comprehensive examples (few-shot prompting) significantly improved performance for state-of-the-art models, with the strongest configuration reaching Cohen’s Kappa = 0.58 against expert-coded annotations. At the same time, improvements were neither uniform nor complete: performance varied considerably by instructional move, and higher recall frequently came at the cost of increased false positives. Overall, these findings indicate that foundation models demonstrate meaningful yet limited capacity to interpret instructional discourse, with prompt design helping to surface capability but not eliminating fundamental reliability constraints.

[3] Counterfactual LLM-based Framework for Measuring Rhetorical Style

Jingyi Qiu, Hong Chen, Zongyi Li

Main category: cs.CL

TL;DR: LLM-based framework quantifies rhetorical style in ML papers, showing visionary framing predicts attention and post-2023 rise driven by LLM writing assistance.

Details

Motivation: Need to distinguish between substantive content and rhetorical style in ML papers, as bold language could reflect either strong results or hype, but existing methods struggle to separate them.

Method: Counterfactual LLM framework: multiple LLM personas generate alternative writings from same content, LLM judge compares them via pairwise evaluations, aggregated using Bradley-Terry model. Applied to 8,485 ICLR submissions (2017-2025), generating 250k+ counterfactuals.

Result: Visionary framing significantly predicts downstream attention (citations, media) even controlling for peer-review. Sharp rise in rhetorical strength after 2023, driven by LLM writing assistance adoption. Framework validated by robustness to personas and high LLM-human correlation.

Conclusion: LLMs can serve as instruments to measure and improve scientific evaluation by quantifying rhetorical style independently of content.

Abstract: The rise of AI has fueled growing concerns about ``hype’’ in machine learning papers, yet a reliable way to quantify rhetorical style independently of substantive content has remained elusive. Because bold language can stem from either strong empirical results or mere rhetorical style, it is often difficult to distinguish between the two. To disentangle rhetorical style from substantive content, we introduce a counterfactual, LLM-based framework: multiple LLM rhetorical personas generate counterfactual writings from the same substantive content, an LLM judge compares them through pairwise evaluations, and the outcomes are aggregated using a Bradley–Terry model. Applying this method to 8,485 ICLR submissions sampled from 2017 to 2025, we generate more than 250,000 counterfactual writings and provide a large-scale quantification of rhetorical style in ML papers. We find that visionary framing significantly predicts downstream attention, including citations and media attention, even after controlling for peer-review evaluations. We also observe a sharp rise in rhetorical strength after 2023, and provide empirical evidence showing that this increase is largely driven by the adoption of LLM-based writing assistance. The reliability of our framework is validated by its robustness to the choice of personas and the high correlation between LLM judgments and human annotations. Our work demonstrates that LLMs can serve as instruments to measure and improve scientific evaluation.

Zhixiang Lu, Xueyuan Deng, Yiran Liu, Yulong Li, Qiang Yan, Imran Razzak, Jionglong Su

Main category: cs.CL

TL;DR: PRISM is a hybrid agent-based model that integrates MBTI personality types with MLLM agents to better simulate online polarization by capturing psychological heterogeneity and cognitive biases.

Details

Motivation: Traditional ABMs fail to capture psychological heterogeneity driving online polarization due to simplistic homogeneity assumptions, obscuring the interplay between individual cognitive biases and information propagation.

Method: PRISM combines stochastic differential equations for continuous emotional evolution with personality-conditional partially observable Markov decision processes for discrete decision-making, using MBTI-based cognitive policies for MLLM agents initialized from social media data.

Result: PRISM achieves superior personality consistency aligned with human ground truth, significantly outperforming standard homogeneous and Big Five benchmarks, and effectively replicates emergent phenomena like rational suppression and affective resonance.

Conclusion: PRISM offers a robust framework for analyzing complex social media ecosystems by better capturing psychological heterogeneity and cognitive biases in opinion dynamics.

Abstract: Traditional agent-based models (ABMs) of opinion dynamics often fail to capture the psychological heterogeneity driving online polarization due to simplistic homogeneity assumptions. This limitation obscures the critical interplay between individual cognitive biases and information propagation, thereby hindering a mechanistic understanding of how ideological divides are amplified. To address this challenge, we introduce the Personality-Refracted Intelligent Simulation Model (PRISM), a hybrid framework coupling stochastic differential equations (SDE) for continuous emotional evolution with a personality-conditional partially observable Markov decision process (PC-POMDP) for discrete decision-making. In contrast to continuous trait approaches, PRISM assigns distinct Myers-Briggs Type Indicator (MBTI) based cognitive policies to multimodal large language model (MLLM) agents, initialized via data-driven priors from large-scale social media datasets. PRISM achieves superior personality consistency aligned with human ground truth, significantly outperforming standard homogeneous and Big Five benchmarks. This framework effectively replicates emergent phenomena such as rational suppression and affective resonance, offering a robust tool for analyzing complex social media ecosystems.

[5] Bias Beneath the Tone: Empirical Characterisation of Tone Bias in LLM-Driven UX Systems

Heet Bodara, Md Masum Mushfiq, Isma Farah Siddiqui

Main category: cs.CL

TL;DR: This paper investigates tone bias in LLM-based conversational systems, showing that even neutral prompts produce systematic tonal skew, and develops ensemble classifiers to detect these biases with high accuracy.

Details

Motivation: LLMs in conversational systems often exhibit subtle tone biases (overly polite, cheerful, cautious) even when neutrality is expected, which can influence user perceptions of trust, empathy, and fairness in dialogue.

Method: Created two synthetic dialogue datasets (neutral prompts vs. explicitly guided positive/negative tones), used weak supervision with pretrained DistilBERT for tone labeling, and trained multiple classifiers including ensemble models to detect tone patterns.

Result: Even neutral prompts showed consistent tonal skew, suggesting bias stems from underlying conversational style. Ensemble models achieved macro F1 scores up to 0.92, demonstrating tone bias is systematic and measurable.

Conclusion: Tone bias is a hidden behavioral trait of LLMs that is systematic, measurable, and relevant for designing fair and trustworthy conversational AI systems.

Abstract: Large Language Models are increasingly used in conversational systems such as digital personal assistants, shaping how people interact with technology through language. While their responses often sound fluent and natural, they can also carry subtle tone biases such as sounding overly polite, cheerful, or cautious even when neutrality is expected. These tendencies can influence how users perceive trust, empathy, and fairness in dialogue. In this study, we explore tone bias as a hidden behavioral trait of large language models. The novelty of this research lies in the integration of controllable large language model based dialogue synthesis with tone classification models, enabling robust and ethical emotion recognition in personal assistant interactions. We created two synthetic dialogue datasets, one generated from neutral prompts and another explicitly guided to produce positive or negative tones. Surprisingly, even the neutral set showed consistent tonal skew, suggesting that bias may stem from the model’s underlying conversational style. Using weak supervision through a pretrained DistilBERT model, we labeled tones and trained several classifiers to detect these patterns. Ensemble models achieved macro F1 scores up to 0.92, showing that tone bias is systematic, measurable, and relevant to designing fair and trustworthy conversational AI.

[6] SlideTailor: Personalized Presentation Slide Generation for Scientific Papers

Wenzheng Zeng, Mingyu Ouyang, Langyuan Cui, Hwee Tou Ng

Main category: cs.CL

TL;DR: SlideTailor: A user-aligned framework for automatic slide generation that learns preferences from example paper-slide pairs and visual templates, enabling customized presentations without detailed textual specifications.

Details

Motivation: Existing automatic slide generation systems produce suboptimal results because they don't account for individual user preferences, which vary significantly. Current approaches are under-specified and fail to align with specific user needs.

Method: Proposes SlideTailor, a human behavior-inspired agentic framework that progressively generates editable slides. It learns user preferences from natural inputs: a paper-slides example pair and a visual template, which implicitly encode content and visual style preferences. Introduces a chain-of-speech mechanism to align slide content with planned oral narration.

Result: The framework effectively distills and generalizes user preferences from implicit, unlabeled inputs. The chain-of-speech mechanism significantly enhances slide quality and enables downstream applications like video presentations. A benchmark dataset with interpretable metrics supports evaluation.

Conclusion: SlideTailor successfully addresses the challenge of user-aligned slide generation by learning from easy-to-provide artifacts, demonstrating effectiveness through extensive experiments and enabling more personalized presentation creation.

Abstract: Automatic presentation slide generation can greatly streamline content creation. However, since preferences of each user may vary, existing under-specified formulations often lead to suboptimal results that fail to align with individual user needs. We introduce a novel task that conditions paper-to-slides generation on user-specified preferences. We propose a human behavior-inspired agentic framework, SlideTailor, that progressively generates editable slides in a user-aligned manner. Instead of requiring users to write their preferences in detailed textual form, our system only asks for a paper-slides example pair and a visual template - natural and easy-to-provide artifacts that implicitly encode rich user preferences across content and visual style. Despite the implicit and unlabeled nature of these inputs, our framework effectively distills and generalizes the preferences to guide customized slide generation. We also introduce a novel chain-of-speech mechanism to align slide content with planned oral narration. Such a design significantly enhances the quality of generated slides and enables downstream applications like video presentations. To support this new task, we construct a benchmark dataset that captures diverse user preferences, with carefully designed interpretable metrics for robust evaluation. Extensive experiments demonstrate the effectiveness of our framework.

[7] Schoenfeld’s Anatomy of Mathematical Reasoning by Language Models

Ming Li, Chenrui Fan, Yize Cheng, Soheil Feizi, Tianyi Zhou

Main category: cs.CL

TL;DR: ThinkARM framework uses Schoenfeld’s Episode Theory to abstract LLM reasoning traces into functional steps (Analysis, Explore, Implement, Verify), revealing thinking dynamics and structural differences in mathematical problem solving.

Details

Motivation: Current analysis of LLM reasoning traces focuses on surface-level statistics, making it difficult to identify underlying cognitive structure and steps. There's a need for better tools to understand how reasoning is structured in language models.

Method: Introduce ThinkARM framework that abstracts reasoning traces into functional reasoning steps based on Schoenfeld’s Episode Theory. Apply this to mathematical problem solving across diverse models to analyze thinking dynamics and structural differences.

Result: Reveals reproducible thinking dynamics and structural differences between reasoning and non-reasoning models not apparent from token-level views. Exploration functions as critical branching step associated with correctness. Efficiency-oriented methods selectively suppress evaluative feedback steps rather than uniformly shortening responses.

Conclusion: Episode-level representations make reasoning steps explicit, enabling systematic analysis of how reasoning is structured, stabilized, and altered in modern language models, providing deeper insights beyond surface-level statistics.

Abstract: Large language models increasingly expose reasoning traces, yet their underlying cognitive structure and steps remain difficult to identify and analyze beyond surface-level statistics. We adopt Schoenfeld’s Episode Theory as an inductive, intermediate-scale lens and introduce ThinkARM (Anatomy of Reasoning in Models), a scalable framework that explicitly abstracts reasoning traces into functional reasoning steps such as Analysis, Explore, Implement, Verify, etc. When applied to mathematical problem solving by diverse models, this abstraction reveals reproducible thinking dynamics and structural differences between reasoning and non-reasoning models, which are not apparent from token-level views. We further present two diagnostic case studies showing that exploration functions as a critical branching step associated with correctness, and that efficiency-oriented methods selectively suppress evaluative feedback steps rather than uniformly shortening responses. Together, our results demonstrate that episode-level representations make reasoning steps explicit, enabling systematic analysis of how reasoning is structured, stabilized, and altered in modern language models.

[8] Memory-T1: Reinforcement Learning for Temporal Reasoning in Multi-session Agents

Yiming Du, Baojun Wang, Yifan Xiang, Zhaowei Wang, Wenyu Huang, Boyang Xue, Bin Liang, Xingshan Zeng, Fei Mi, Haoli Bai, Lifeng Shang, Jeff Z. Pan, Yuxin Jiang, Kam-Fai Wong

Main category: cs.CL

TL;DR: Memory-T1: A reinforcement learning framework that learns time-aware memory selection for temporal reasoning in long multi-session dialogues, using coarse-to-fine filtering and multi-level rewards to improve accuracy and robustness.

Details

Motivation: Current long-context models struggle with temporal reasoning in lengthy, noisy dialogue histories, impairing their ability to identify temporally pertinent information for accurate responses.

Method: Uses RL to learn a time-aware memory selection policy with coarse-to-fine strategy: first prunes dialogue history using temporal and relevance filters, then RL agent selects precise evidence sessions. Training uses multi-level rewards optimizing answer accuracy, evidence grounding, and temporal consistency (session-level chronological proximity and utterance-level chronological fidelity).

Result: Achieves 67.0% overall score on Time-Dialog benchmark, establishing new SOTA for open-source models and outperforming 14B baseline by 10.2%. Maintains robustness up to 128k tokens where baselines collapse. Ablation shows temporal consistency and evidence grounding rewards contribute 15.0% performance gain.

Conclusion: Memory-T1 effectively addresses temporal reasoning challenges in long, noisy dialogue histories through learned time-aware memory selection, demonstrating significant performance improvements and robustness at extreme context lengths.

Abstract: Temporal reasoning over long, multi-session dialogues is a critical capability for conversational agents. However, existing works and our pilot study have shown that as dialogue histories grow in length and accumulate noise, current long-context models struggle to accurately identify temporally pertinent information, significantly impairing reasoning performance. To address this, we introduce Memory-T1, a framework that learns a time-aware memory selection policy using reinforcement learning (RL). It employs a coarse-to-fine strategy, first pruning the dialogue history into a candidate set using temporal and relevance filters, followed by an RL agent that selects the precise evidence sessions. The RL training is guided by a multi-level reward function optimizing (i) answer accuracy, (ii) evidence grounding, and (iii) temporal consistency. In particular, the temporal consistency reward provides a dense signal by evaluating alignment with the query time scope at both the session-level (chronological proximity) and the utterance-level (chronological fidelity), enabling the agent to resolve subtle chronological ambiguities. On the Time-Dialog benchmark, Memory-T1 boosts a 7B model to an overall score of 67.0%, establishing a new state-of-the-art performance for open-source models and outperforming a 14B baseline by 10.2%. Ablation studies show temporal consistency and evidence grounding rewards jointly contribute to a 15.0% performance gain. Moreover, Memory-T1 maintains robustness up to 128k tokens, where baseline models collapse, proving effectiveness against noise in extensive dialogue histories. The code and datasets are publicly available at https://github.com/Elvin-Yiming-Du/Memory-T1/

[9] A Novel Graph-Sequence Learning Model for Inductive Text Classification

Zuo Wang, Ye Yuan

Main category: cs.CL

TL;DR: TextGSL: A novel graph-sequence learning model for inductive text classification that combines diverse structural relationships (co-occurrence, syntax, semantics) with sequential information using adaptive multi-edge message passing and Transformer layers.

Details

Motivation: Current GNN-based text classification methods have two major limitations: 1) They fail to fully consider diverse structural information across word pairs (co-occurrence, syntax, semantics), and 2) They neglect sequence information in text graph structure learning and cannot handle texts with new words and relations (inductive setting).

Method: Proposes TextGSL with: 1) Construction of single text-level graphs with different edge types based on diverse word-pair relationships, 2) Adaptive multi-edge message-passing paradigm to aggregate diverse structural information, and 3) Incorporation of Transformer layers to capture sequential information.

Result: TextGSL outperforms several strong baselines on diverse benchmarking datasets in terms of accuracy, demonstrating superior performance for inductive text classification.

Conclusion: TextGSL effectively addresses limitations of existing GNN-based text classification by simultaneously capturing diverse structural relationships and sequential information, enabling inductive learning with new words and relations.

Abstract: Text classification plays an important role in various downstream text-related tasks, such as sentiment analysis, fake news detection, and public opinion analysis. Recently, text classification based on Graph Neural Networks (GNNs) has made significant progress due to their strong capabilities of structural relationship learning. However, these approaches still face two major limitations. First, these approaches fail to fully consider the diverse structural information across word pairs, e.g., co-occurrence, syntax, and semantics. Furthermore, they neglect sequence information in the text graph structure information learning module and can not classify texts with new words and relations. In this paper, we propose a Novel Graph-Sequence Learning Model for Inductive Text Classification (TextGSL) to address the previously mentioned issues. More specifically, we construct a single text-level graph for all words in each text and establish different edge types based on the diverse relationships between word pairs. Building upon this, we design an adaptive multi-edge message-passing paradigm to aggregate diverse structural information between word pairs. Additionally, sequential information among text data can be captured by the proposed TextGSL through the incorporation of Transformer layers. Therefore, TextGSL can learn more discriminative text representations. TextGSL has been comprehensively compared with several strong baselines. The experimental results on diverse benchmarking datasets demonstrate that TextGSL outperforms these baselines in terms of accuracy.

[10] ABBEL: LLM Agents Acting through Belief Bottlenecks Expressed in Language

Aly Lidayan, Jakob Bjorner, Satvik Golechha, Kartik Goyal, Alane Suhr

Main category: cs.CL

TL;DR: ABBEL framework enables LLM agents to maintain concise contexts via natural language belief states instead of full interaction histories, with RL post-training improving performance beyond full context while using less memory.

Details

Motivation: As sequential decision-making tasks grow longer, keeping full interaction histories becomes computationally impractical due to memory constraints and context window limitations.

Method: ABBEL framework replaces long multi-step interaction history with a belief state (natural language summary). At each step, agent updates prior belief with new observation to form posterior belief, then uses only posterior to select action. RL post-training with belief grading (rewarding quality) and length penalties (rewarding compression) further improves performance.

Result: ABBEL maintains near-constant memory use over interaction steps while generating interpretable beliefs. However, bottleneck approaches suffer from error propagation, causing inferior performance vs. full context. RL post-training improves ABBEL’s performance beyond full context setting while using less memory than contemporaneous approaches.

Conclusion: ABBEL provides a practical framework for long-horizon sequential decision-making with LLM agents, balancing memory efficiency and performance through belief bottlenecks and RL optimization.

Abstract: As the length of sequential decision-making tasks increases, it becomes computationally impractical to keep full interaction histories in context. We introduce a general framework for LLM agents to maintain concise contexts through multi-step interaction: Acting through Belief Bottlenecks Expressed in Language (ABBEL), and methods to further improve ABBEL agents with RL post-training. ABBEL replaces long multi-step interaction history by a belief state, i.e., a natural language summary of what has been discovered about task-relevant unknowns. Under ABBEL, at each step the agent first updates a prior belief with the most recent observation from the environment to form a posterior belief, then uses only the posterior to select an action. We systematically evaluate frontier models under ABBEL across six diverse multi-step environments, finding that ABBEL supports generating interpretable beliefs while maintaining near-constant memory use over interaction steps. However, bottleneck approaches are generally prone to error propagation, which we observe causing inferior performance when compared to the full context setting due to errors in belief updating. Therefore, we train LLMs to generate and act on beliefs within the ABBEL framework via reinforcement learning (RL). We experiment with belief grading, to reward higher quality beliefs, as well as belief length penalties to reward more compressed beliefs. Our experiments demonstrate the ability of RL to improve ABBEL’s performance beyond the full context setting, while using less memory than contemporaneous approaches.

[11] Fun-Audio-Chat Technical Report

Qian Chen, Luyao Cheng, Chong Deng, Xiangang Li, Jiaqing Liu, Chao-Hong Tan, Wen Wang, Junhao Xu, Jieping Ye, Qinglin Zhang, Qiquan Zhang, Jingren Zhou

Main category: cs.CL

TL;DR: Fun-Audio-Chat is a Large Audio Language Model that solves speech-text temporal mismatch and catastrophic forgetting issues through dual-resolution processing and core-cocktail training, achieving competitive performance on audio tasks while retaining text LLM knowledge.

Details

Motivation: Existing joint speech-text models face critical challenges: temporal resolution mismatch between speech tokens (25Hz) and text tokens (~3Hz) dilutes semantic information, incurs high computational costs, and causes catastrophic forgetting of text LLM knowledge.

Method: 1) Dual-Resolution Speech Representations (DRSR): Shared LLM processes audio at efficient 5Hz via token grouping, while Speech Refined Head generates high-quality tokens at 25Hz. 2) Core-Cocktail Training: Two-stage fine-tuning with intermediate merging to mitigate catastrophic forgetting. 3) Multi-Task DPO Training to enhance robustness, audio understanding, instruction-following and voice empathy.

Result: Fun-Audio-Chat 8B and MoE 30B-A3B achieve competitive performance on Speech-to-Text and Speech-to-Speech tasks, ranking top among similar-scale models on Spoken QA benchmarks. They also achieve competitive to superior performance on Audio Understanding, Speech Function Calling, Instruction-Following and Voice Empathy. A full-duplex variant (Fun-Audio-Chat-Duplex) shows strong performance on spoken QA and full-duplex interactions.

Conclusion: Fun-Audio-Chat successfully addresses key limitations of existing audio-language models through innovative dual-resolution processing and training techniques, achieving strong performance while retaining text LLM knowledge without requiring large-scale audio-text pre-training. The model is open-sourced with training/inference code and an interactive demo.

Abstract: Recent advancements in joint speech-text models show great potential for seamless voice interactions. However, existing models face critical challenges: temporal resolution mismatch between speech tokens (25Hz) and text tokens (~3Hz) dilutes semantic information, incurs high computational costs, and causes catastrophic forgetting of text LLM knowledge. We introduce Fun-Audio-Chat, a Large Audio Language Model addressing these limitations via two innovations from our previous work DrVoice. First, Dual-Resolution Speech Representations (DRSR): the Shared LLM processes audio at efficient 5Hz (via token grouping), while the Speech Refined Head generates high-quality tokens at 25Hz, balancing efficiency (~50% GPU reduction) and quality. Second, Core-Cocktail Training, a two-stage fine-tuning with intermediate merging that mitigates catastrophic forgetting. We then apply Multi-Task DPO Training to enhance robustness, audio understanding, instruction-following and voice empathy. This multi-stage post-training enables Fun-Audio-Chat to retain text LLM knowledge while gaining powerful audio understanding, reasoning, and generation. Unlike recent LALMs requiring large-scale audio-text pre-training, Fun-Audio-Chat leverages pre-trained models and extensive post-training. Fun-Audio-Chat 8B and MoE 30B-A3B achieve competitive performance on Speech-to-Text and Speech-to-Speech tasks, ranking top among similar-scale models on Spoken QA benchmarks. They also achieve competitive to superior performance on Audio Understanding, Speech Function Calling, Instruction-Following and Voice Empathy. We develop Fun-Audio-Chat-Duplex, a full-duplex variant with strong performance on Spoken QA and full-duplex interactions. We open-source Fun-Audio-Chat-8B with training and inference code, and provide an interactive demo.

[12] M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation

Hyeongcheol Park, Jiyoung Seo, Jaewon Mun, Hogun Park, Wonmin Byeon, Sung June Kim, Hyeonsoo Im, JeungSub Lee, Sangpil Kim

Main category: cs.CL

TL;DR: M³KG-RAG: A multi-hop multimodal knowledge graph-enhanced RAG system for audio-visual reasoning that addresses limitations in existing multimodal RAG approaches through improved knowledge retrieval and pruning.

Details

Motivation: Multimodal RAG in audio-visual domain faces challenges: 1) limited modality coverage and multi-hop connectivity in existing multimodal knowledge graphs, and 2) similarity-based retrieval that fails to filter out off-topic or redundant knowledge, leading to poor reasoning depth and answer faithfulness.

Method: Proposes M³KG-RAG with two key components: 1) Lightweight multi-agent pipeline to construct multi-hop MMKG (M³KG) with context-enriched triplets of multimodal entities, enabling modality-wise retrieval. 2) GRASP (Grounded Retrieval And Selective Pruning) that ensures precise entity grounding, evaluates answer-supporting relevance, and prunes redundant context to retain only essential knowledge.

Result: Extensive experiments across diverse multimodal benchmarks demonstrate that M³KG-RAG significantly enhances MLLMs’ multimodal reasoning and grounding capabilities over existing approaches.

Conclusion: M³KG-RAG effectively addresses limitations of current multimodal RAG systems by improving knowledge retrieval precision and reasoning depth through multi-hop multimodal knowledge graphs and selective pruning mechanisms, leading to better audio-visual reasoning performance.

Abstract: Retrieval-Augmented Generation (RAG) has recently been extended to multimodal settings, connecting multimodal large language models (MLLMs) with vast corpora of external knowledge such as multimodal knowledge graphs (MMKGs). Despite their recent success, multimodal RAG in the audio-visual domain remains challenging due to 1) limited modality coverage and multi-hop connectivity of existing MMKGs, and 2) retrieval based solely on similarity in a shared multimodal embedding space, which fails to filter out off-topic or redundant knowledge. To address these limitations, we propose M$^3$KG-RAG, a Multi-hop Multimodal Knowledge Graph-enhanced RAG that retrieves query-aligned audio-visual knowledge from MMKGs, improving reasoning depth and answer faithfulness in MLLMs. Specifically, we devise a lightweight multi-agent pipeline to construct multi-hop MMKG (M$^3$KG), which contains context-enriched triplets of multimodal entities, enabling modality-wise retrieval based on input queries. Furthermore, we introduce GRASP (Grounded Retrieval And Selective Pruning), which ensures precise entity grounding to the query, evaluates answer-supporting relevance, and prunes redundant context to retain only knowledge essential for response generation. Extensive experiments across diverse multimodal benchmarks demonstrate that M$^3$KG-RAG significantly enhances MLLMs’ multimodal reasoning and grounding over existing approaches.

[13] SpidR: Learning Fast and Stable Linguistic Units for Spoken Language Models Without Supervision

Maxime Poli, Mahi Luthra, Youssef Benchekroun, Yosuke Higuchi, Martin Gleize, Jiayi Shen, Robin Algayres, Yu-An Chung, Mido Assran, Juan Pino, Emmanuel Dupoux

Main category: cs.CL

TL;DR: SpidR is a self-supervised speech representation model that learns phonetic-rich representations for textless spoken language modeling, outperforming existing models on language benchmarks while reducing pretraining time from a week to one day.

Details

Motivation: To enable learning language directly from speech without textual intermediates by extracting semantic representations from speech, addressing the need for efficient textless spoken language modeling.

Method: Self-supervised model trained on raw waveforms using masked prediction objective with self-distillation and online clustering. Intermediate student layers predict assignments from teacher’s intermediate layers, stabilizing clustering for higher quality codebooks.

Result: Outperforms wav2vec 2.0, HuBERT, WavLM, and DinoSR on language modeling benchmarks (sWUGGY, sBLIMP, tSC). Validates speech unit quality metrics as reliable proxies for language modeling performance. Reduces pretraining time from a week to one day on 16 GPUs.

Conclusion: SpidR provides efficient, high-quality speech representations for textless language modeling with significantly reduced training time, enabling faster experimentation and advancing speech-based language learning.

Abstract: The parallel advances in language modeling and speech representation learning have raised the prospect of learning language directly from speech without textual intermediates. This requires extracting semantic representations directly from speech. Our contributions are threefold. First, we introduce SpidR, a self-supervised speech representation model that efficiently learns representations with highly accessible phonetic information, which makes it particularly suited for textless spoken language modeling. It is trained on raw waveforms using a masked prediction objective combined with self-distillation and online clustering. The intermediate layers of the student model learn to predict assignments derived from the teacher’s intermediate layers. This learning objective stabilizes the online clustering procedure compared to previous approaches, resulting in higher quality codebooks. SpidR outperforms wav2vec 2.0, HuBERT, WavLM, and DinoSR on downstream language modeling benchmarks (sWUGGY, sBLIMP, tSC). Second, we systematically evaluate across models and layers the correlation between speech unit quality (ABX, PNMI) and language modeling performance, validating these metrics as reliable proxies. Finally, SpidR significantly reduces pretraining time compared to HuBERT, requiring only one day of pretraining on 16 GPUs, instead of a week. This speedup is enabled by the pretraining method and an efficient codebase, which allows faster iteration and easier experimentation. We open-source the training code and model checkpoints at https://github.com/facebookresearch/spidr.

[14] Multi-hop Reasoning via Early Knowledge Alignment

Yuxin Wang, Shicheng Fang, Bo Wang, Qi Luo, Xuanjing Huang, Yining Zheng, Xipeng Qiu

Main category: cs.CL

TL;DR: EKA (Early Knowledge Alignment) improves iterative RAG by aligning LLMs with retrieval corpus before planning, reducing cascading errors and improving efficiency.

Details

Motivation: Existing iterative RAG systems plan question decomposition without considering available retrieval corpus, leading to inefficient retrieval and cascading errors in complex multi-hop questions.

Method: Introduces Early Knowledge Alignment module that aligns LLMs with retrieval set before planning, providing contextually relevant retrieved knowledge to establish stronger reasoning foundation.

Result: Significantly improves retrieval precision, reduces cascading errors, enhances performance and efficiency across six standard RAG datasets. Analysis shows reduced unnecessary exploration and better focus on relevant information.

Conclusion: EKA advances state-of-the-art in iterative RAG systems, demonstrating critical interplay between structured reasoning and efficient exploration in reinforcement learning-augmented frameworks as a versatile, training-free inference strategy.

Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for Large Language Models (LLMs) to address knowledge-intensive queries requiring domain-specific or up-to-date information. To handle complex multi-hop questions that are challenging for single-step retrieval, iterative RAG approaches incorporating reinforcement learning have been proposed. However, existing iterative RAG systems typically plan to decompose questions without leveraging information about the available retrieval corpus, leading to inefficient retrieval and reasoning chains that cascade into suboptimal performance. In this paper, we introduce Early Knowledge Alignment (EKA), a simple but effective module that aligns LLMs with retrieval set before planning in iterative RAG systems with contextually relevant retrieved knowledge. Extensive experiments on six standard RAG datasets demonstrate that by establishing a stronger reasoning foundation, EKA significantly improves retrieval precision, reduces cascading errors, and enhances both performance and efficiency. Our analysis from an entropy perspective demonstrate that incorporating early knowledge reduces unnecessary exploration during the reasoning process, enabling the model to focus more effectively on relevant information subsets. Moreover, EKA proves effective as a versatile, training-free inference strategy that scales seamlessly to large models. Generalization tests across diverse datasets and retrieval corpora confirm the robustness of our approach. Overall, EKA advances the state-of-the-art in iterative RAG systems while illuminating the critical interplay between structured reasoning and efficient exploration in reinforcement learning-augmented frameworks. The code is released at \href{https://github.com/yxzwang/EarlyKnowledgeAlignment}{Github}.

[15] Retrieval-augmented Prompt Learning for Pre-trained Foundation Models

Xiang Chen, Yixin Ou, Quan Feng, Lei Li, Piji Li, Haibo Ye, Sheng-Jun Huang, Shuofei Qiao, Shumin Deng, Huajun Chen, Ningyu Zhang

Main category: cs.CL

TL;DR: RetroPrompt is a novel prompt learning approach that uses retrieval mechanisms from a knowledge base to balance memorization and generalization in pre-trained foundation models, outperforming traditional methods in zero-shot and few-shot scenarios.

Details

Motivation: Traditional prompt learning for pre-trained foundation models follows parametric learning paradigms that can compromise generalization stability, overfit to shallow patterns, and fail to fully utilize atypical instances with limited data.

Method: RetroPrompt decouples knowledge from memorization by leveraging a publicly accessible knowledge base generated from training data and incorporating retrieval mechanisms throughout input, training, and inference stages to actively retrieve relevant contextual information.

Result: Comprehensive experiments across NLP and computer vision tasks demonstrate RetroPrompt’s superior performance in both zero-shot and few-shot scenarios, with analysis showing reduced reliance on rote memorization and enhanced generalization.

Conclusion: RetroPrompt effectively addresses limitations of conventional prompt learning by balancing memorization and generalization through retrieval-based knowledge utilization, leading to improved performance in data-limited scenarios.

Abstract: The pre-trained foundation models (PFMs) have become essential for facilitating large-scale multimodal learning. Researchers have effectively employed the ``pre-train, prompt, and predict’’ paradigm through prompt learning to induce improved few-shot performance. However, prompt learning approaches for PFMs still follow a parametric learning paradigm. As such, the stability of generalization in memorization and rote learning can be compromised. More specifically, conventional prompt learning might face difficulties in fully utilizing atypical instances and avoiding overfitting to shallow patterns with limited data during the process of fully-supervised training. To overcome these constraints, we present our approach, named RetroPrompt, which aims to achieve a balance between memorization and generalization by decoupling knowledge from mere memorization. Unlike traditional prompting methods, RetroPrompt leverages a publicly accessible knowledge base generated from the training data and incorporates a retrieval mechanism throughout the input, training, and inference stages. This enables the model to actively retrieve relevant contextual information from the corpus, thereby enhancing the available cues. We conduct comprehensive experiments on a variety of datasets across natural language processing and computer vision tasks to demonstrate the superior performance of our proposed approach, RetroPrompt, in both zero-shot and few-shot scenarios. Through detailed analysis of memorization patterns, we observe that RetroPrompt effectively reduces the reliance on rote memorization, leading to enhanced generalization.

[16] AI Security Beyond Core Domains: Resume Screening as a Case Study of Adversarial Vulnerabilities in Specialized LLM Applications

Honglin Mu, Jinghao Liu, Kaiyang Wan, Rui Xing, Xiuying Chen, Timothy Baldwin, Wanxiang Che

Main category: cs.CL

TL;DR: LLMs are vulnerable to adversarial instructions hidden in inputs like resumes, with attacks achieving >80% success. The paper introduces a benchmark for resume screening vulnerability and shows training-time defenses (FIDS with LoRA) outperform prompt-based defenses.

Details

Motivation: LLMs are increasingly used for automated tasks like resume screening, but they can be manipulated by adversarial instructions hidden in input data. While defenses exist for mature domains like code review, they're often absent in other applications like resume screening, creating security vulnerabilities.

Method: 1) Introduced a benchmark to assess LLM vulnerability to adversarial instructions in resume screening. 2) Evaluated two defense mechanisms: prompt-based defenses and FIDS (Foreign Instruction Detection through Separation) using LoRA adaptation. 3) Combined both approaches for enhanced defense.

Result: Attack success rates exceeded 80% for certain attack types in resume screening. Prompt-based defenses achieved 10.1% attack reduction with 12.5% false rejection increase. FIDS with LoRA achieved 15.4% attack reduction with 10.4% false rejection increase. Combined approach provided 26.3% attack reduction.

Conclusion: Training-time defenses (FIDS with LoRA adaptation) outperform inference-time mitigations in both security and utility preservation. The vulnerability to adversarial instructions is significant in applications like resume screening, and effective defenses require model adaptation rather than just prompt engineering.

Abstract: Large Language Models (LLMs) excel at text comprehension and generation, making them ideal for automated tasks like code review and content moderation. However, our research identifies a vulnerability: LLMs can be manipulated by “adversarial instructions” hidden in input data, such as resumes or code, causing them to deviate from their intended task. Notably, while defenses may exist for mature domains such as code review, they are often absent in other common applications such as resume screening and peer review. This paper introduces a benchmark to assess this vulnerability in resume screening, revealing attack success rates exceeding 80% for certain attack types. We evaluate two defense mechanisms: prompt-based defenses achieve 10.1% attack reduction with 12.5% false rejection increase, while our proposed FIDS (Foreign Instruction Detection through Separation) using LoRA adaptation achieves 15.4% attack reduction with 10.4% false rejection increase. The combined approach provides 26.3% attack reduction, demonstrating that training-time defenses outperform inference-time mitigations in both security and utility preservation.

[17] FaithLens: Detecting and Explaining Faithfulness Hallucination

Shuzheng Si, Qingyi Wang, Haozhe Zhao, Yuzhuo Bai, Guanqiao Chen, Kangyang Luo, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun

Main category: cs.CL

TL;DR: FaithLens is an 8B-parameter model for detecting faithfulness hallucinations in LLM outputs, providing both binary predictions and explanations, outperforming GPT-4.1 and o3 across 12 diverse tasks.

Details

Motivation: Faithfulness hallucination detection is crucial for real-world LLM applications like retrieval-augmented generation and summarization, requiring trustworthy, efficient solutions.

Method: 1) Synthesize training data with explanations using advanced LLMs, 2) Apply data filtering for label correctness, explanation quality, and diversity, 3) Fine-tune model on curated data, 4) Optimize with rule-based reinforcement learning using rewards for prediction correctness and explanation quality.

Result: FaithLens outperforms advanced models like GPT-4.1 and o3 on 12 diverse tasks, produces high-quality explanations, and achieves a distinctive balance of trustworthiness, efficiency, and effectiveness.

Conclusion: FaithLens provides a cost-efficient, effective solution for faithfulness hallucination detection with joint prediction and explanation capabilities, offering superior performance and trustworthiness for real-world LLM applications.

Abstract: Recognizing whether outputs from large language models (LLMs) contain faithfulness hallucination is crucial for real-world applications, e.g., retrieval-augmented generation and summarization. In this paper, we introduce FaithLens, a cost-efficient and effective faithfulness hallucination detection model that can jointly provide binary predictions and corresponding explanations to improve trustworthiness. To achieve this, we first synthesize training data with explanations via advanced LLMs and apply a well-defined data filtering strategy to ensure label correctness, explanation quality, and data diversity. Subsequently, we fine-tune the model on these well-curated training data as a cold start and further optimize it with rule-based reinforcement learning, using rewards for both prediction correctness and explanation quality. Results on 12 diverse tasks show that the 8B-parameter FaithLens outperforms advanced models such as GPT-4.1 and o3. Also, FaithLens can produce high-quality explanations, delivering a distinctive balance of trustworthiness, efficiency, and effectiveness.

[18] Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction

Yangui Fang, Baixu Chen, Jing Peng, Xu Li, Yu Xi, Chengwei Zhang, Guohui Zhong

Main category: cs.CL

TL;DR: RLLM-CF framework improves ASR error correction using LLMs without training, addressing hallucination through error pre-detection, chain-of-thought iterative correction, and verification.

Details

Motivation: Traditional ASR error correction methods have moderate effectiveness. LLMs offer training-free solutions but suffer from hallucinations that may modify correct text, creating a need for reliable correction frameworks.

Method: Three-stage Reliable LLM Correction Framework: (1) error pre-detection, (2) chain-of-thought sub-tasks iterative correction, and (3) reasoning process verification. No additional information or model fine-tuning required.

Result: GPT-4o enhanced by RLLM-CF achieves 21%, 11%, 9%, and 11.4% relative reductions in CER/WER on AISHELL-1, AISHELL-2, and Librispeech datasets.

Conclusion: RLLM-CF effectively addresses LLM hallucination in ASR error correction, providing reliable performance improvements without requiring training or labeled data.

Abstract: Automatic Speech Recognition (ASR) error correction aims to correct recognition errors while preserving accurate text. Although traditional approaches demonstrate moderate effectiveness, LLMs offer a paradigm that eliminates the need for training and labeled data. However, directly using LLMs will encounter hallucinations problem, which may lead to the modification of the correct text. To address this problem, we propose the Reliable LLM Correction Framework (RLLM-CF), which consists of three stages: (1) error pre-detection, (2) chain-of-thought sub-tasks iterative correction, and (3) reasoning process verification. The advantage of our method is that it does not require additional information or fine-tuning of the model, and ensures the correctness of the LLM correction under multi-pass programming. Experiments on AISHELL-1, AISHELL-2, and Librispeech show that the GPT-4o model enhanced by our framework achieves 21%, 11%, 9%, and 11.4% relative reductions in CER/WER.

[19] Corpus of Cross-lingual Dialogues with Minutes and Detection of Misunderstandings

Marko Čechovič, Natália Komorníková, Dominik Macháček, Ondřej Bojar

Main category: cs.CL

TL;DR: A corpus of cross-lingual dialogues facilitated by speech translation, with 5 hours of speech in 12 languages, plus automatic detection of misunderstandings using LLMs.

Details

Motivation: Need for realistic evaluation corpus to assess automatic systems for facilitating meetings between individuals without a common language, enabled by speech translation technology.

Method: Created corpus of cross-lingual dialogues with ASR transcripts in 12 languages and English translations; proposed automatic misunderstanding detection using large language models (Gemini).

Result: Corpus includes 5 hours of speech with transcripts and translations; Gemini model achieved 77% recall and 47% precision in detecting misunderstandings.

Conclusion: Presented valuable corpus for cross-lingual meeting research and demonstrated LLMs’ potential for automatic misunderstanding detection, though precision needs improvement.

Abstract: Speech processing and translation technology have the potential to facilitate meetings of individuals who do not share any common language. To evaluate automatic systems for such a task, a versatile and realistic evaluation corpus is needed. Therefore, we create and present a corpus of cross-lingual dialogues between individuals without a common language who were facilitated by automatic simultaneous speech translation. The corpus consists of 5 hours of speech recordings with ASR and gold transcripts in 12 original languages and automatic and corrected translations into English. For the purposes of research into cross-lingual summarization, our corpus also includes written summaries (minutes) of the meetings. Moreover, we propose automatic detection of misunderstandings. For an overview of this task and its complexity, we attempt to quantify misunderstandings in cross-lingual meetings. We annotate misunderstandings manually and also test the ability of current large language models to detect them automatically. The results show that the Gemini model is able to identify text spans with misunderstandings with recall of 77% and precision of 47%.

[20] AprielGuard

Jaykumar Kasundra, Anjaneya Praharaj, Sourabh Surana, Lakshmi Sirisha Chodisetty, Sourav Sharma, Abhigya Verma, Abhishek Bhardwaj, Debasish Kanhar, Aakash Bhagat, Khalil Slimi, Seganrasan Subramanian, Sathwik Tejaswi Madhusudhan, Ranga Prasad Chenna, Srinivas Sunkara

Main category: cs.CL

TL;DR: AprielGuard is an 8B parameter safeguard model that unifies safety risk detection (toxicity, bias) and adversarial threat detection (prompt injections, jailbreaks) in a single framework, outperforming existing open-source guardrails.

Details

Motivation: Existing moderation tools treat safety risks and adversarial threats as separate problems, limiting robustness and generalizability. As LLMs are increasingly deployed in conversational and agentic settings, there's a need for unified safeguards.

Method: AprielGuard uses a single taxonomy and learning framework, trained on diverse open and synthetic data covering standalone prompts, multi-turn conversations, and agentic workflows, augmented with structured reasoning traces for interpretability.

Result: AprielGuard achieves strong performance across multiple public and proprietary benchmarks, outperforming existing open-source guardrails like Llama-Guard and Granite Guardian, especially in multi-step and reasoning-intensive scenarios.

Conclusion: By releasing the model, the authors aim to advance transparent and reproducible research on reliable safeguards for LLMs, providing a unified solution for both safety risks and adversarial threats.

Abstract: Safeguarding large language models (LLMs) against unsafe or adversarial behavior is critical as they are increasingly deployed in conversational and agentic settings. Existing moderation tools often treat safety risks (e.g. toxicity, bias) and adversarial threats (e.g. prompt injections, jailbreaks) as separate problems, limiting their robustness and generalizability. We introduce AprielGuard, an 8B parameter safeguard model that unify these dimensions within a single taxonomy and learning framework. AprielGuard is trained on a diverse mix of open and synthetic data covering standalone prompts, multi-turn conversations, and agentic workflows, augmented with structured reasoning traces to improve interpretability. Across multiple public and proprietary benchmarks, AprielGuard achieves strong performance in detecting harmful content and adversarial manipulations, outperforming existing opensource guardrails such as Llama-Guard and Granite Guardian, particularly in multi-step and reasoning intensive scenarios. By releasing the model, we aim to advance transparent and reproducible research on reliable safeguards for LLMs.

[21] Patterns vs. Patients: Evaluating LLMs against Mental Health Professionals on Personality Disorder Diagnosis through First-Person Narratives

Karolina Drożdż, Kacper Dudzic, Anna Sterna, Marcin Moskalewicz

Main category: cs.CL

TL;DR: LLMs outperform mental health professionals in diagnosing BPD from Polish autobiographical narratives but severely underdiagnose NPD due to bias against the term “narcissism,” revealing both competence and reliability issues.

Details

Motivation: To evaluate LLMs' ability to interpret qualitative patient narratives for psychiatric diagnosis compared to human experts, particularly for personality disorders where nuanced understanding is crucial.

Method: Direct comparison between state-of-the-art LLMs (Gemini Pro) and mental health professionals using Polish-language first-person autobiographical accounts to diagnose Borderline (BPD) and Narcissistic (NPD) Personality Disorders.

Result: Gemini Pro models surpassed human professionals by 21.91 percentage points overall (65.48% vs. 43.57%). Both performed well on BPD (F1 = 83.4 vs 80.0), but models severely underdiagnosed NPD (F1 = 6.7 vs 50.0), showing reluctance toward the term “narcissism.” Models gave confident, elaborate justifications while humans were concise and cautious.

Conclusion: LLMs are highly competent at interpreting complex clinical data but suffer from critical reliability and bias issues, particularly with value-laden terms like “narcissism,” highlighting the need for caution in psychiatric applications.

Abstract: Growing reliance on LLMs for psychiatric self-assessment raises questions about their ability to interpret qualitative patient narratives. We present the first direct comparison between state-of-the-art LLMs and mental health professionals in diagnosing Borderline (BPD) and Narcissistic (NPD) Personality Disorders utilizing Polish-language first-person autobiographical accounts. We show that the top-performing Gemini Pro models surpassed human professionals in overall diagnostic accuracy by 21.91 percentage points (65.48% vs. 43.57%). While both models and human experts excelled at identifying BPD (F1 = 83.4 & F1 = 80.0, respectively), models severely underdiagnosed NPD (F1 = 6.7 vs. 50.0), showing a reluctance toward the value-laden term “narcissism.” Qualitatively, models provided confident, elaborate justifications focused on patterns and formal categories, while human experts remained concise and cautious, emphasizing the patient’s sense of self and temporal experience. Our findings demonstrate that while LLMs are highly competent at interpreting complex first-person clinical data, they remain subject to critical reliability and bias issues.

[22] Can LLMs Solve My Grandma’s Riddle? Evaluating Multilingual Large Language Models on Reasoning Traditional Bangla Tricky Riddles

Nurul Labib Sayeedi, Md. Faiyaz Abdullah Sayeedi, Khushnur Binte Jahangir, Swakkhar Shatabda, Sarah Masud Preum

Main category: cs.CL

TL;DR: BanglaRiddleEval is a new benchmark of 1,244 traditional Bangla riddles across four tasks, showing current LLMs perform poorly on Bangla figurative reasoning compared to humans.

Details

Motivation: LLMs show strong performance on many NLP benchmarks, but their ability to reason in figurative, culturally grounded, and low-resource settings like Bangla remains underexplored.

Method: Created BanglaRiddleEval benchmark with 1,244 traditional Bangla riddles across four tasks (4,976 artifacts total). Used LLM-based pipeline to generate Chain-of-Thought explanations, distractors, and ambiguity annotations. Evaluated diverse open-source and closed-source models with different prompting strategies.

Result: Models show moderate semantic overlap but low correctness on generative QA. MCQ accuracy peaks at only ~56% vs 83% human baseline. Ambiguity resolution ranges from ~26% to 68%. High-quality explanations only from strongest models.

Conclusion: Current LLMs capture some cues for Bangla riddle reasoning but remain far from human-level performance. BanglaRiddleEval establishes a challenging new benchmark for low-resource figurative reasoning.

Abstract: Large Language Models (LLMs) show impressive performance on many NLP benchmarks, yet their ability to reason in figurative, culturally grounded, and low-resource settings remains underexplored. We address this gap for Bangla by introducing BanglaRiddleEval, a benchmark of 1,244 traditional Bangla riddles instantiated across four tasks (4,976 riddle-task artifacts in total). Using an LLM-based pipeline, we generate Chain-of-Thought explanations, semantically coherent distractors, and fine-grained ambiguity annotations, and evaluate a diverse suite of open-source and closed-source models under different prompting strategies. Models achieve moderate semantic overlap on generative QA but low correctness, MCQ accuracy peaks at only about 56% versus an 83% human baseline, and ambiguity resolution ranges from roughly 26% to 68%, with high-quality explanations confined to the strongest models. These results show that current LLMs capture some cues needed for Bangla riddle reasoning but remain far from human-level performance, establishing BanglaRiddleEval as a challenging new benchmark for low-resource figurative reasoning. All data, code, and evaluation scripts are available on GitHub: https://github.com/Labib1610/BanglaRiddleEval.

[23] Multi-LLM Thematic Analysis with Dual Reliability Metrics: Combining Cohen’s Kappa and Semantic Similarity for Qualitative Research Validation

Nilesh Jain, Seyi Adeyinka, Leor Roseman, Aza Allsop

Main category: cs.CL

TL;DR: A framework combining ensemble validation with dual reliability metrics (Cohen’s Kappa and cosine similarity) for LLM-based thematic analysis, achieving high reliability across three leading LLMs on psychedelic art therapy data.

Details

Motivation: Traditional inter-rater agreement methods in qualitative research require multiple human coders, are time-intensive, and often yield moderate consistency, creating a critical reliability challenge.

Method: Multi-perspective validation framework with ensemble validation and dual reliability metrics (Cohen’s Kappa for inter-rater agreement, cosine similarity for semantic consistency). Features configurable analysis parameters (1-6 seeds, temperature 0.0-2.0), custom prompt structures with variable substitution, and consensus theme extraction across any JSON format. Evaluated three LLMs (Gemini 2.5 Pro, GPT-4o, Claude 3.5 Sonnet) on psychedelic art therapy interview transcript with six independent runs per model.

Result: Gemini achieved highest reliability (κ=0.907, cosine=95.3%), followed by GPT-4o (κ=0.853, cosine=92.6%) and Claude (κ=0.842, cosine=92.1%). All three models achieved high agreement (κ>0.80). Gemini identified 6 consensus themes (50-83% consistency), GPT-4o identified 5 themes, and Claude identified 4 themes.

Conclusion: The multi-run ensemble approach is validated for reliable AI-assisted qualitative research, providing transparent reliability metrics, flexible configuration, and structure-agnostic consensus extraction through an open-source implementation.

Abstract: Qualitative research faces a critical reliability challenge: traditional inter-rater agreement methods require multiple human coders, are time-intensive, and often yield moderate consistency. We present a multi-perspective validation framework for LLM-based thematic analysis that combines ensemble validation with dual reliability metrics: Cohen’s Kappa ($κ$) for inter-rater agreement and cosine similarity for semantic consistency. Our framework enables configurable analysis parameters (1-6 seeds, temperature 0.0-2.0), supports custom prompt structures with variable substitution, and provides consensus theme extraction across any JSON format. As proof-of-concept, we evaluate three leading LLMs (Gemini 2.5 Pro, GPT-4o, Claude 3.5 Sonnet) on a psychedelic art therapy interview transcript, conducting six independent runs per model. Results demonstrate Gemini achieves highest reliability ($κ= 0.907$, cosine=95.3%), followed by GPT-4o ($κ= 0.853$, cosine=92.6%) and Claude ($κ= 0.842$, cosine=92.1%). All three models achieve a high agreement ($κ> 0.80$), validating the multi-run ensemble approach. The framework successfully extracts consensus themes across runs, with Gemini identifying 6 consensus themes (50-83% consistency), GPT-4o identifying 5 themes, and Claude 4 themes. Our open-source implementation provides researchers with transparent reliability metrics, flexible configuration, and structure-agnostic consensus extraction, establishing methodological foundations for reliable AI-assisted qualitative research.

[24] Sentiment-Aware Extractive and Abstractive Summarization for Unstructured Text Mining

Junyi Liu, Stanley Kok

Main category: cs.CL

TL;DR: Proposes a sentiment-aware summarization framework for noisy user-generated text that integrates emotional cues into both extractive and abstractive methods to improve relevance and emotional nuance capture.

Details

Motivation: Existing summarization methods optimized for structured news struggle with noisy, informal user-generated content from social media, reviews, and forums. Emotional cues are critical for IS tasks like brand monitoring and market analysis, but few studies integrate sentiment modeling into summarization of short texts.

Method: A sentiment-aware framework extending extractive (TextRank) and abstractive (UniLM) approaches by embedding sentiment signals into ranking and generation processes. This dual design integrates emotional cues into both extractive and abstractive summarization methods.

Result: The framework improves capture of emotional nuances and thematic relevance, producing concise, sentiment-enriched summaries that enhance timely interventions and strategic decision-making in dynamic online environments.

Conclusion: The proposed sentiment-aware summarization framework addresses the limitations of existing methods for user-generated content, enabling better extraction of actionable insights from unstructured social media data for IS applications like brand monitoring and market analysis.

Abstract: With the rapid growth of unstructured data from social media, reviews, and forums, text mining has become essential in Information Systems (IS) for extracting actionable insights. Summarization can condense fragmented, emotion-rich posts, but existing methods-optimized for structured news-struggle with noisy, informal content. Emotional cues are critical for IS tasks such as brand monitoring and market analysis, yet few studies integrate sentiment modeling into summarization of short user-generated texts. We propose a sentiment-aware framework extending extractive (TextRank) and abstractive (UniLM) approaches by embedding sentiment signals into ranking and generation processes. This dual design improves the capture of emotional nuances and thematic relevance, producing concise, sentiment-enriched summaries that enhance timely interventions and strategic decision-making in dynamic online environments.

[25] Step-DeepResearch Technical Report

Chen Hu, Haikuo Du, Heng Wang, Lin Lin, Mingrui Chen, Peng Liu, Ruihang Miao, Tianchi Yue, Wang You, Wei Ji, Wei Yuan, Wenjin Deng, Xiaojian Yuan, Xiaoyun Zhang, Xiangyu Liu, Xikai Liu, Yanming Xu, Yicheng Cao, Yifei Zhang, Yongyao Wang, Yubo Shu, Yurong Zhang, Yuxiang Zhang, Zheng Gong, Zhichao Chang, Binyan Li, Dan Ma, Furong Jia, Hongyuan Wang, Jiayu Liu, Jing Bai, Junlan Liu, Manjiao Liu, Na Wang, Qiuping Wu, Qinxin Du, Shiwei Li, Wen Sun, Yifeng Gong, Yonglin Chen, Yuling Zhao, Yuxuan Lin, Ziqi Ren, Zixuan Wang, Aihu Zhang, Brian Li, Buyun Ma, Kang An, Li Xie, Mingliang Li, Pan Li, Shidong Yang, Xi Chen, Xiaojia Liu, Yuchu Luo, Yuan Song, YuanHao Ding, Yuanwei Liang, Zexi Li, Zhaoning Zhang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Jiansheng Chen, Jing Li, Xiangyu Zhang, Yibo Zhu

Main category: cs.CL

TL;DR: Step-DeepResearch is a cost-effective 32B parameter agent for deep research that achieves 61.4% on Scale AI Research Rubrics and rivals SOTA closed-source models through refined training techniques.

Details

Motivation: Existing academic benchmarks like BrowseComp fail to meet real-world demands for open-ended research, which requires robust skills in intent recognition, long-horizon decision-making, and cross-source verification. There's also an evaluation gap in the Chinese domain.

Method: Introduces Step-DeepResearch agent with Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing, combined with progressive training path from agentic mid-training to SFT and RL. Enhanced by Checklist-style Judger for robustness. Also establishes ADR-Bench for Chinese domain evaluation.

Result: Step-DeepResearch (32B) scores 61.4% on Scale AI Research Rubrics. On ADR-Bench, it significantly outperforms comparable models and rivals SOTA closed-source models like OpenAI and Gemini DeepResearch.

Conclusion: Refined training enables medium-sized models to achieve expert-level capabilities at industry-leading cost-efficiency, proving that well-designed training approaches can make smaller models competitive with larger SOTA models.

Abstract: As LLMs shift toward autonomous agents, Deep Research has emerged as a pivotal metric. However, existing academic benchmarks like BrowseComp often fail to meet real-world demands for open-ended research, which requires robust skills in intent recognition, long-horizon decision-making, and cross-source verification. To address this, we introduce Step-DeepResearch, a cost-effective, end-to-end agent. We propose a Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing, combined with a progressive training path from agentic mid-training to SFT and RL. Enhanced by a Checklist-style Judger, this approach significantly improves robustness. Furthermore, to bridge the evaluation gap in the Chinese domain, we establish ADR-Bench for realistic deep research scenarios. Experimental results show that Step-DeepResearch (32B) scores 61.4% on Scale AI Research Rubrics. On ADR-Bench, it significantly outperforms comparable models and rivals SOTA closed-source models like OpenAI and Gemini DeepResearch. These findings prove that refined training enables medium-sized models to achieve expert-level capabilities at industry-leading cost-efficiency.

[26] Distilling to Hybrid Attention Models via KL-Guided Layer Selection

Yanhong Li, Songlin Yang, Shawn Tan, Mayank Mishra, Rameswar Panda, Jiawei Zhou, Yoon Kim

Main category: cs.CL

TL;DR: A simple recipe for selecting which layers to convert to linear attention in hybrid Transformers, using importance scores from small text training, outperforming existing heuristics and specialized approaches.

Details

Motivation: To improve LLM inference efficiency by distilling pretrained softmax attention Transformers into hybrid architectures with linear attention layers, without expensive pretraining from scratch.

Method: Uses layer importance scores derived from small training on generic text data for layer selection, then applies RADLADS distillation pipeline (attention weight transfer, hidden state alignment, KL-based distribution matching, and fine-tuning).

Result: This approach is more effective than existing layer selection methods, including uniform interleaving heuristics and specialized diagnostic dataset approaches.

Conclusion: Simple layer importance scoring with generic text training provides an efficient and effective recipe for selecting layers to convert to linear attention in hybrid Transformer distillation.

Abstract: Distilling pretrained softmax attention Transformers into more efficient hybrid architectures that interleave softmax and linear attention layers is a promising approach for improving the inference efficiency of LLMs without requiring expensive pretraining from scratch. A critical factor in the conversion process is layer selection, i.e., deciding on which layers to convert to linear attention variants. This paper describes a simple and efficient recipe for layer selection that uses layer importance scores derived from a small amount of training on generic text data. Once the layers have been selected we use a recent pipeline for the distillation process itself \citep[RADLADS;][]{goldstein2025radlads}, which consists of attention weight transfer, hidden state alignment, KL-based distribution matching, followed by a small amount of finetuning. We find that this approach is more effective than existing approaches for layer selection, including heuristics that uniformly interleave linear attentions based on a fixed ratio, as well as more involved approaches that rely on specialized diagnostic datasets.

[27] Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits

Amirhosein Ghasemabadi, Di Niu

Main category: cs.CL

TL;DR: Gnosis enables frozen LLMs to predict their own failures by analyzing internal states during inference, adding minimal parameters and achieving better accuracy than external judges.

Details

Motivation: LLMs generate fluent outputs but often fail to recognize their own mistakes and hallucinations. Existing approaches use external judges, multi-sample consistency, or text-based self-critique, which are computationally expensive or weakly correlated with true correctness.

Method: Gnosis is a lightweight self-awareness mechanism that enables frozen LLMs to perform intrinsic self-verification by decoding signals from hidden states and attention patterns. It passively observes internal traces, compresses them into fixed-budget descriptors, and predicts correctness with negligible inference cost (~5M parameters).

Result: Across math reasoning, open-domain question answering, and academic knowledge benchmarks (1.7B to 20B parameters), Gnosis consistently outperforms strong internal baselines and large external judges in both accuracy and calibration. It also generalizes zero-shot to partial generations for early failure detection.

Conclusion: Reliable correctness cues are intrinsic to the generation process and can be extracted efficiently without external supervision, enabling LLMs to predict their own failures through internal state analysis.

Abstract: Large language models (LLMs) generate fluent and complex outputs but often fail to recognize their own mistakes and hallucinations. Existing approaches typically rely on external judges, multi-sample consistency, or text-based self-critique, which incur additional compute or correlate weakly with true correctness. We ask: can LLMs predict their own failures by inspecting internal states during inference? We introduce Gnosis, a lightweight self-awareness mechanism that enables frozen LLMs to perform intrinsic self-verification by decoding signals from hidden states and attention patterns. Gnosis passively observes internal traces, compresses them into fixed-budget descriptors, and predicts correctness with negligible inference cost, adding only ~5M parameters and operating independently of sequence length. Across math reasoning, open-domain question answering, and academic knowledge benchmarks, and over frozen backbones ranging from 1.7B to 20B parameters, Gnosis consistently outperforms strong internal baselines and large external judges in both accuracy and calibration. Moreover, it generalizes zero-shot to partial generations, enabling early detection of failing trajectories and compute-aware control. These results show that reliable correctness cues are intrinsic to generation process and can be extracted efficiently without external supervision.

[28] Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs

Dhruv Anand, Ehsan Shareghi

Main category: cs.CL

TL;DR: Cube Bench is a Rubik’s cube benchmark for evaluating spatial and sequential reasoning in multimodal LLMs, testing five skills across scramble depths with consistent metrics.

Details

Motivation: To create a standardized benchmark for evaluating spatial and sequential reasoning capabilities in multimodal large language models using Rubik's cube as a testbed, addressing the need for reproducible assessment of complex reasoning skills.

Method: Developed Cube Bench with five skill tests: (1) reconstructing cube faces from images/text, (2) choosing optimal next moves, (3) predicting move outcomes without execution, (4) executing multi-step plans with error recovery, and (5) detecting/revising errors. Used shared scrambled cube states, identical prompts/parsers, and single distance-to-solved metric across seven MLLMs at varying scramble depths.

Result: Accuracy drops sharply with scramble depth; models rarely recover from stalled/diverged trajectories; high face reconstruction doesn’t guarantee competent action selection. Closed-source models outperform open-weight models significantly, with best closed model leading on both perception and control tasks. Simple self-correction yields modest gains but can cause overthinking.

Conclusion: Cube Bench provides a compact, reproducible probe for sequential spatial reasoning in MLLMs, revealing significant performance gaps between closed and open models, and showing that even best models degrade with increased cube complexity, highlighting limitations in current MLLM reasoning capabilities.

Abstract: We introduce Cube Bench, a Rubik’s-cube benchmark for evaluating spatial and sequential reasoning in multimodal large language models (MLLMs). The benchmark decomposes performance into five skills: (i) reconstructing cube faces from images and text, (ii) choosing the optimal next move, (iii) predicting the outcome of a candidate move without applying it, (iv) executing multi-step plans while recovering from mistakes, and (v) detecting and revising one’s own errors. Using a shared set of scrambled cube states, identical prompts and parsers, and a single distance-to-solved metric, we compare recent MLLMs side by side as a function of scramble depth. Across seven MLLMs, accuracy drops sharply with depth; once a trajectory stalls or diverges, models rarely recover, and high face-reconstruction accuracy does not guarantee competent action selection or multi-step execution. A pronounced closed- vs open-source gap emerges: the strongest closed model leads on both single-step perception tasks and multi-step control tasks, while open-weight models cluster near chance on the hardest settings; yet even the best MLLM degrades at higher cube complexity. A simple self-correction via reflective thinking yields modest gains but can also introduce overthinking. Cube Bench offers a compact, reproducible probe of sequential spatial reasoning in MLLMs.

[29] MoE-DiffuSeq: Enhancing Long-Document Diffusion Models with Sparse Attention and Mixture of Experts

Alexandros Christoforos, Chadbourne Davis

Main category: cs.CL

TL;DR: MoE-DiffuSeq enhances diffusion models for long document generation by combining sparse attention with mixture of experts architecture to reduce computational cost and memory overhead while maintaining text quality.

Details

Motivation: Existing diffusion-based text generation models like DiffuSeq suffer from high computational cost and memory overhead when applied to extended sequences, limiting their practical use for long document generation.

Method: Integrates sparse attention with mixture of experts architecture, introduces customized sparse attention mechanism to reduce computational complexity, and incorporates soft absorbing state within diffusion process to accelerate sequence reconstruction.

Result: Significantly improves training efficiency and sampling speed compared to existing diffusion models, particularly effective for long document scenarios including scientific article generation, code repository modeling, and long form dialogue generation.

Conclusion: MoE-DiffuSeq advances practical applicability of diffusion models for high quality long form text generation by improving efficiency, speed, accuracy, and expressiveness through architectural innovations.

Abstract: We present MoE-DiffuSeq, a mixture of experts based framework for enhancing diffusion models in long document generation. Existing diffusion based text generation models, such as DiffuSeq, suffer from high computational cost and memory overhead when applied to extended sequences. To address these challenges, MoE-DiffuSeq integrates sparse attention with a mixture of experts architecture, enabling efficient and scalable long sequence modeling. Our approach introduces a customized sparse attention mechanism designed to reduce computational complexity while preserving text quality and coherence. In addition, we incorporate a soft absorbing state within the diffusion process to accelerate sequence reconstruction and improve generation precision. Extensive experiments demonstrate that MoE-DiffuSeq significantly improves training efficiency and sampling speed compared to existing diffusion models. These advantages are particularly effective for long document scenarios, including scientific article generation, code repository modeling, and long form dialogue generation. Benchmark results further show that MoE-DiffuSeq improves efficiency, speed, accuracy, and expressiveness, advancing the practical applicability of diffusion models for high quality long form text generation.

[30] Deep Learning and Machine Learning, Advancing Big Data Analytics and Management: Object-Oriented Programming

Tianyang Wang, Ziqian Bi, Keyu Chen, Jiawei Xu, Qian Niu, Junyu Liu, Benji Peng, Ming Li, Sen Zhang, Xuanhe Pan, Jinlang Wang, Pohsun Feng, Yizhu Wen, Xinyuan Song, Ming Liu

Main category: cs.CL

TL;DR: This paper provides a comprehensive introduction to applying Object-Oriented Programming (OOP) principles in AI/ML domains to improve code modularity, maintainability, and scalability, with practical Python examples for real-world AI tasks.

Details

Motivation: As software complexity grows in AI/ML fields (deep learning, LLMs, data analytics), there's a need for better code organization. OOP offers a paradigm to manage this complexity by improving modularity, maintainability, and scalability of AI systems.

Method: The paper introduces key OOP principles (encapsulation, inheritance, polymorphism, abstraction), demonstrates their application in Python, examines design patterns and modular programming for ML systems, and provides practical examples of encapsulating preprocessing workflows, model training, and evaluation.

Result: The work serves as a bridge between OOP theory and practical AI implementation, showing how OOP can be used to build reusable, scalable machine learning systems while maintaining code clarity and reducing redundancy.

Conclusion: OOP methodologies are essential for developing robust and maintainable AI systems, and this comprehensive guide equips developers with the knowledge to apply OOP principles effectively in AI-driven projects.

Abstract: Object-Oriented Programming (OOP) has become a crucial paradigm for managing the growing complexity of modern software systems, particularly in fields like machine learning, deep learning, large language models (LLM), and data analytics. This work provides a comprehensive introduction to the integration of OOP techniques within these domains, with a focus on improving code modularity, maintainability, and scalability. We begin by outlining the evolution of computing and the rise of OOP, followed by an in-depth discussion of key OOP principles such as encapsulation, inheritance, polymorphism, and abstraction. The practical application of these principles is demonstrated using Python, a widely adopted language in AI and data science. Furthermore, we examine how design patterns and modular programming can be employed to enhance the structure and efficiency of machine learning systems. In subsequent sections, we apply these OOP concepts to real-world AI tasks, including the encapsulation of preprocessing workflows, machine learning model training, and evaluation. Detailed examples illustrate how OOP can be used to build reusable, scalable machine learning systems while maintaining code clarity and reducing redundancy.This work is intended to serve as a bridge for both beginners and experienced developers, equipping them with the necessary knowledge to apply OOP methodologies in AI-driven projects, ultimately fostering the development of more robust and maintainable systems.

[31] Don’t Pay Attention, PLANT It: Pretraining Attention via Learning-to-Rank

Debjyoti Saha Roy, Byron C. Wallace, Javed A. Aslam

Main category: cs.CL

TL;DR: PLANT introduces a plug-and-play attention initialization strategy using pretrained Learning-to-Rank models guided by mutual information gain to improve Extreme Multi-Label Text Classification performance.

Details

Motivation: Current XMC models struggle with learning good attention weights for focusing on key tokens in input text, which limits their performance in multi-label classification tasks.

Method: PLANT plants label-specific attention using pretrained Learning-to-Rank models guided by mutual information gain. It’s architecture-agnostic and integrates with LLM backbones like Mistral-7B, LLaMA3-8B, DeepSeek-V3, and Phi-3.

Result: PLANT outperforms state-of-the-art methods across ICD coding, legal topic classification, and content recommendation tasks. Gains are especially pronounced in few-shot settings with substantial improvements on rare labels.

Conclusion: Attention initialization is a key driver of performance gains in XMC models, and PLANT provides an effective plug-and-play solution that works across different LLM architectures.

Abstract: State-of-the-art Extreme Multi-Label Text Classification models rely on multi-label attention to focus on key tokens in input text, but learning good attention weights is challenging. We introduce PLANT - Pretrained and Leveraged Attention - a plug-and-play strategy for initializing attention. PLANT works by planting label-specific attention using a pretrained Learning-to-Rank model guided by mutual information gain. This architecture-agnostic approach integrates seamlessly with large language model backbones such as Mistral-7B, LLaMA3-8B, DeepSeek-V3, and Phi-3. PLANT outperforms state-of-the-art methods across tasks including ICD coding, legal topic classification, and content recommendation. Gains are especially pronounced in few-shot settings, with substantial improvements on rare labels. Ablation studies confirm that attention initialization is a key driver of these gains. For code and trained models, see https://github.com/debjyotiSRoy/xcube/tree/plant

[32] GRAPHMOE: Amplifying Cognitive Depth of Mixture-of-Experts Network via Introducing Self-Rethinking Mechanism

Bo Lv, Chen Tang, Zifan Zheng, Bohao Yang, Kun Zhao, Ning Liao, Xiaoxing Wang, Feiyu Xiong, Zhiyu Li, Nayu Liu, Jingchi Jiang

Main category: cs.CL

TL;DR: GRAPHMOE enhances Mixture-of-Experts networks by connecting experts through a self-rethinking mechanism with recurrent routing, achieving SOTA performance on language model benchmarks.

Details

Motivation: Traditional MoE networks have independent experts that don't communicate, limiting their potential. The paper explores whether interconnecting expert models could improve MoE network performance and cognitive depth.

Method: GRAPHMOE introduces a self-rethinking mechanism on Pseudo GraphMoE networks using recurrent routing strategy to simulate iterative thinking steps and facilitate information flow among expert nodes. Implementation uses Low-Rank Adaptation (LoRA) techniques.

Result: GRAPHMOE outperforms other LoRA-based models and achieves state-of-the-art performance on various benchmark datasets. The recurrent routing strategy shows promise for enhancing reasoning capabilities.

Conclusion: Interconnecting expert models through GRAPHMOE’s self-rethinking mechanism with recurrent routing significantly enhances MoE network performance and cognitive depth, offering a novel approach for improving language model reasoning capabilities.

Abstract: Traditional Mixture-of-Experts (MoE) networks benefit from utilizing multiple smaller expert models as opposed to a single large network. However, these experts typically operate independently, leaving a question open about whether interconnecting these models could enhance the performance of MoE networks. In response, we introduce GRAPHMOE, a novel method aimed at augmenting the cognitive depth of language models via a self-rethinking mechanism constructed on Pseudo GraphMoE networks. GRAPHMOE employs a recurrent routing strategy to simulate iterative thinking steps, thereby facilitating the flow of information among expert nodes. We implement the GRAPHMOE architecture using Low-Rank Adaptation techniques (LoRA) and conduct extensive experiments on various benchmark datasets. The experimental results reveal that GRAPHMOE outperforms other LoRA based models, achieving state-of-the-art (SOTA) performance. Additionally, this study explores a novel recurrent routing strategy that may inspire further advancements in enhancing the reasoning capabilities of language models.

[33] Select2Reason: Efficient Instruction-Tuning Data Selection for Long-CoT Reasoning

Cehao Yang, Xueyuan Lin, Xiaojun Wu, Chengjin Xu, Xuhui Jiang, Honghao Liu, Hui Xiong, Jian Guo

Main category: cs.CL

TL;DR: Select2Reason is an efficient instruction selection framework for long chain-of-thought reasoning that selects only 10% of data while achieving competitive performance with full-data tuning.

Details

Motivation: Large-scale instruction datasets (100k+ samples) for long chain-of-thought reasoning incur significant training overhead, and there are no effective strategies for automatic selection of high-quality long-CoT instructions.

Method: Select2Reason uses a quantifier to estimate question difficulty and incorporates reasoning trace length-based heuristics through a weighted ranking scheme to prioritize high-utility examples, focusing on emergence of rethinking behaviors like self-correction and backtracking.

Result: Fine-tuning on only 10% of data selected by Select2Reason achieves performance competitive with or superior to full-data tuning and OpenR1-Qwen-7B baseline across three competition-level and six comprehensive mathematical benchmarks, with demonstrated scalability, inference efficiency, and adaptability.

Conclusion: Select2Reason provides an efficient and effective framework for selecting high-quality long-CoT reasoning instructions, significantly reducing training overhead while maintaining or improving performance, with broad applicability to different instruction pools.

Abstract: A practical approach to activate long chain-of-thoughts reasoning ability in pre-trained large language models is to perform supervised fine-tuning on instruction datasets synthesized by strong Large Reasoning Models such as DeepSeek-R1, offering a cost-effective alternative to reinforcement learning. However, large-scale instruction sets with more than 100k samples incur significant training overhead, while effective strategies for automatic long-CoT instruction selection still remain unexplored. In this work, we propose Select2Reason, a novel and efficient instruction-tuning data selection framework for long-CoT reasoning. From the perspective of emergence of rethinking behaviors like self-correction and backtracking, we investigate common metrics that may determine the quality of long-CoT reasoning instructions. Select2Reason leverages a quantifier to estimate difficulty of question and jointly incorporates a reasoning trace length-based heuristic through a weighted scheme for ranking to prioritize high-utility examples. Empirical results on OpenR1-Math-220k demonstrate that fine-tuning LLM on only 10% of the data selected by Select2Reason achieves performance competitive with or superior to full-data tuning and open-source baseline OpenR1-Qwen-7B across three competition-level and six comprehensive mathematical benchmarks. Further experiments highlight the scalability in varying data size, efficiency during inference, and its adaptability to other instruction pools with minimal cost.

[34] DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations

Chao-Hong Tan, Qian Chen, Wen Wang, Chong Deng, Qinglin Zhang, Luyao Cheng, Hai Yu, Xin Zhang, Xiang Lv, Tianyu Zhao, Chong Zhang, Yukun Ma, Yafeng Chen, Hui Wang, Jiaqing Liu, Xiangang Li, Jieping Ye

Main category: cs.CL

TL;DR: DrVoice is a 7B parameter parallel speech-text voice conversation model using joint autoregressive modeling with dual-resolution speech representations that reduces input frequency from 12.5Hz to 5Hz, achieving SOTA results on multiple speech benchmarks.

Details

Motivation: Existing E2E speech generation with LLMs has limitations: either speech tokens are generated independently without LLM awareness, or they use high-frequency representations (12.5Hz) that create computational burden and modality frequency mismatch. There's a need for more efficient joint modeling that better exploits LLM capabilities.

Method: DrVoice uses joint autoregressive modeling for parallel speech-text generation with dual-resolution speech representations. The key innovation reduces input frequency from 12.5Hz to 5Hz, lowering computational cost and better aligning speech-text token frequencies. The model is built on a 7B parameter LLM architecture.

Result: DrVoice-7B establishes new state-of-the-art on multiple speech benchmarks: OpenAudioBench, VoiceBench, UltraEval-Audio, and Big Bench Audio. It becomes the leading open-source speech foundation model in the ~7B parameter category.

Conclusion: The dual-resolution approach with reduced input frequency (5Hz) enables more efficient joint speech-text autoregressive modeling, better exploiting LLM capabilities while reducing computational costs, making DrVoice a top-performing open-source speech foundation model.

Abstract: Recent studies on end-to-end (E2E) speech generation with large language models (LLMs) have attracted significant community attention, with multiple works extending text-based LLMs to generate discrete speech tokens. Existing E2E approaches primarily fall into two categories: (1) Methods that generate discrete speech tokens independently without incorporating them into the LLM’s autoregressive process, resulting in text generation being unaware of concurrent speech synthesis. (2) Models that generate interleaved or parallel speech-text tokens through joint autoregressive modeling, enabling mutual modality awareness during generation. This paper presents DrVoice, a parallel speech-text voice conversation model based on joint autoregressive modeling, featuring dual-resolution speech representations. Notably, while current methods utilize mainly 12.5Hz input audio representation, our proposed dual-resolution mechanism reduces the input frequency for the LLM to 5Hz, significantly reducing computational cost and alleviating the frequency discrepancy between speech and text tokens and in turn better exploiting LLMs’ capabilities. Experimental results demonstrate that DrVoice-7B establishes new state-of-the-art (SOTA) on prominent speech benchmarks including OpenAudioBench, VoiceBench, UltraEval-Audio and Big Bench Audio, making it a leading open-source speech foundation model in ~7B models.

[35] Learning without training: The implicit dynamics of in-context learning

Benoit Dherin, Michael Munn, Hanna Mazzawi, Michael Wunder, Javier Gonzalvo

Main category: cs.CL

TL;DR: LLMs can learn in-context at inference time through implicit weight modification in transformer blocks, where self-attention and MLP layers work together to create low-rank weight updates based on context.

Details

Motivation: To understand the mechanisms behind LLMs' ability to learn in-context at inference time without weight updates, which remains largely unknown despite being a striking feature of these models.

Method: Theoretical analysis and experimentation showing how transformer blocks (self-attention + MLP) implicitly modify MLP layer weights according to context, transforming context into low-rank weight updates.

Result: Demonstrated that transformer blocks can implicitly create low-rank weight updates to MLP layers based on context, providing a mechanism for in-context learning.

Conclusion: The stacking of self-attention with MLP in transformer blocks enables implicit weight modification, explaining how LLMs can learn in-context at inference time without traditional training updates.

Abstract: One of the most striking features of Large Language Models (LLMs) is their ability to learn in-context. Namely at inference time an LLM is able to learn new patterns without any additional weight update when these patterns are presented in the form of examples in the prompt, even if these patterns were not seen during training. The mechanisms through which this can happen are still largely unknown. In this work, we show that the stacking of a self-attention layer with an MLP, allows the transformer block to implicitly modify the weights of the MLP layer according to the context. We argue through theory and experimentation that this simple mechanism may be the reason why LLMs can learn in-context and not only during training. Specifically, we show how a transformer block implicitly transforms a context into a low-rank weight-update of its MLP layer.

[36] Persistent Instability in LLM’s Personality Measurements: Effects of Scale, Reasoning, and Conversation History

Tommaso Tosato, Saskia Helbling, Yorguin-Jose Mantilla-Ramos, Mahmood Hegazy, Alberto Tosato, David John Lemay, Irina Rish, Guillaume Dumas

Main category: cs.CL

TL;DR: LLM personality traits show large instability across various conditions, with question reordering, scaling, and stabilization interventions paradoxically increasing variability, suggesting current alignment strategies are inadequate for safety-critical applications.

Details

Motivation: Large language models need consistent behavioral patterns for safe deployment, but there are indications of large variability in personality trait expression that could lead to unstable behavior.

Method: PERSIST framework evaluates 25 open-source models (1B-685B parameters) across 2M+ responses using traditional (BFI, SD3) and novel LLM-adapted personality questionnaires, varying model size, personas, reasoning modes, question order/paraphrasing, and conversation history.

Result: (1) Question reordering alone causes large personality measurement shifts; (2) Scaling provides limited stability gains (400B+ models show SD >0.3 on 5-point scales); (3) Stabilization interventions like reasoning and conversation history paradoxically increase variability; (4) Detailed personas produce mixed effects with misaligned personas showing higher variability; (5) LLM-adapted questionnaires show comparable instability to human-centric versions.

Conclusion: Persistent instability across scales and mitigation strategies suggests current LLMs lack architectural foundations for genuine behavioral consistency, indicating current alignment strategies may be inadequate for safety-critical applications requiring predictable behavior.

Abstract: Large language models require consistent behavioral patterns for safe deployment, yet there are indications of large variability that may lead to an instable expression of personality traits in these models. We present PERSIST (PERsonality Stability in Synthetic Text), a comprehensive evaluation framework testing 25 open-source models (1B-685B parameters) across 2 million+ responses. Using traditional (BFI, SD3) and novel LLM-adapted personality questionnaires, we systematically vary model size, personas, reasoning modes, question order or paraphrasing, and conversation history. Our findings challenge fundamental assumptions: (1) Question reordering alone can introduce large shifts in personality measurements; (2) Scaling provides limited stability gains: even 400B+ models exhibit standard deviations >0.3 on 5-point scales; (3) Interventions expected to stabilize behavior, such as reasoning and inclusion of conversation history, can paradoxically increase variability; (4) Detailed persona instructions produce mixed effects, with misaligned personas showing significantly higher variability than the helpful assistant baseline; (5) The LLM-adapted questionnaires, despite their improved ecological validity, exhibit instability comparable to human-centric versions. This persistent instability across scales and mitigation strategies suggests that current LLMs lack the architectural foundations for genuine behavioral consistency. For safety-critical applications requiring predictable behavior, these findings indicate that current alignment strategies may be inadequate.

[37] Enhancing Uncertainty Estimation in LLMs with Expectation of Aggregated Internal Belief

Zeguan Xiao, Diyang Dou, Boya Xiong, Yun Chen, Guanhua Chen

Main category: cs.CL

TL;DR: EAGLE is a novel calibration method that extracts internal beliefs from multiple intermediate layers during LLM self-evaluation, aggregates them, and calculates expectation over confidence scores to produce more accurate uncertainty estimates.

Details

Motivation: LLMs often exhibit overconfidence and generate plausible yet incorrect answers, especially after RLHF training, posing challenges for reliable uncertainty estimation and safe deployment.

Method: EAGLE extracts internal beliefs from multiple intermediate layers during self-evaluation, aggregates these layer-wise beliefs, and calculates expectation over the resulting confidence score distribution to produce refined confidence scores.

Result: Extensive experiments on diverse datasets and LLMs demonstrate that EAGLE significantly improves calibration performance over existing baselines.

Conclusion: EAGLE provides a more faithful reflection of LLMs’ internal certainty through layer-wise belief aggregation and expectation calculation, offering improved uncertainty estimation for safer deployment.

Abstract: Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language tasks, but often exhibit overconfidence and generate plausible yet incorrect answers. This overconfidence, especially in models undergone Reinforcement Learning from Human Feedback (RLHF), poses significant challenges for reliable uncertainty estimation and safe deployment. In this paper, we propose EAGLE (Expectation of AGgregated internaL bEief), a novel self-evaluation-based calibration method that leverages the internal hidden states of LLMs to derive more accurate confidence scores. Instead of relying on the model’s final output, our approach extracts internal beliefs from multiple intermediate layers during self-evaluation. By aggregating these layer-wise beliefs and calculating the expectation over the resulting confidence score distribution, EAGLE produces a refined confidence score that more faithfully reflects the model’s internal certainty. Extensive experiments on diverse datasets and LLMs demonstrate that EAGLE significantly improves calibration performance over existing baselines. We also provide an in-depth analysis of EAGLE, including a layer-wise examination of uncertainty patterns, a study of the impact of self-evaluation prompts, and an analysis of the effect of self-evaluation score range.

[38] Thematic Dispersion in Arabic Applied Linguistics: A Bibliometric Analysis using Brookes’ Measure

Ayman Eddakrouri, Amani Ramadan

Main category: cs.CL

TL;DR: Arabic Applied Linguistics research shows extreme thematic dispersion (Δ=0.194), indicating high heterogeneity across eight sub-disciplines rather than concentration in specific areas.

Details

Motivation: To analyze the thematic structure of contemporary Arabic Applied Linguistics research using Brookes' Measure of Categorical Dispersion to understand whether the field is concentrated or dispersed across sub-disciplines.

Method: Applied Brookes’ Measure of Categorical Dispersion (Δ) to a comprehensive dataset of 1,564 publications (2019-2025) classified into eight core sub-disciplines of Arabic Applied Linguistics.

Result: Found extremely low dispersion index of Δ = 0.194, indicating pronounced thematic heterogeneity. Computational Linguistics emerged as dominant but non-hegemonic, coexisting with robust research in Sociolinguistics, Language Teaching, and other subfields.

Conclusion: The field exhibits extreme thematic dispersion rather than concentration. The study clarifies Brookes’ formula application, demonstrates its utility for field characterization, and provides a replicable bibliometric methodology for assessing disciplinary structure.

Abstract: This study applies Brookes’ Measure of Categorical Dispersion (Δ) to analyze the thematic structure of contemporary Arabic Applied Linguistics research. Using a comprehensive, real-world dataset of 1,564 publications from 2019 to 2025, classified into eight core sub-disciplines, we calculate a dispersion index of Δ = 0.194. This remarkably low value indicates extreme thematic dispersion, revealing that the field is characterized by pronounced heterogeneity rather than concentration. The analysis identifies Computational Linguistics as a dominant but non-hegemonic force, coexisting with robust research in Sociolinguistics, Language Teaching, and other subfields. This study clarifies the correct application of Brookes’ original formula, demonstrates its utility for field characterization, and provides a replicable bibliometric methodology for assessing disciplinary structure across domains.

[39] DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation

Shijian Ma, Yunqi Huang, Yan Lin

Main category: cs.CL

TL;DR: DramaBench is a new benchmark for evaluating drama script continuation across six dimensions, addressing limitations of existing benchmarks.

Details

Motivation: Existing benchmarks fail to comprehensively evaluate drama script continuation capabilities like maintaining character consistency, advancing plot coherently, and preserving dramatic structure.

Method: Combines rule-based analysis with LLM-based labeling and statistical metrics to create an objective, reproducible evaluation framework across six dimensions: Format Standards, Narrative Efficiency, Character Consistency, Emotional Depth, Logic Consistency, and Conflict Handling.

Result: Evaluated 8 state-of-the-art language models on 1,103 scripts (8,824 total evaluations) with rigorous statistical testing (252 pairwise comparisons, 65.9% significant) and human validation (188 scripts, substantial agreement on 3/5 dimensions). Ablation studies confirmed all six dimensions capture independent quality aspects (mean |r| = 0.020).

Conclusion: DramaBench provides actionable, dimension-specific feedback for model improvement and establishes a rigorous standard for creative writing evaluation.

Abstract: Drama script continuation requires models to maintain character consistency, advance plot coherently, and preserve dramatic structurecapabilities that existing benchmarks fail to evaluate comprehensively. We present DramaBench, the first large-scale benchmark for evaluating drama script continuation across six independent dimensions: Format Standards, Narrative Efficiency, Character Consistency, Emotional Depth, Logic Consistency, and Conflict Handling. Our framework combines rulebased analysis with LLM-based labeling and statistical metrics, ensuring objective and reproducible evaluation. We conduct comprehensive evaluation of 8 state-of-the-art language models on 1,103 scripts (8,824 evaluations total), with rigorous statistical significance testing (252 pairwise comparisons, 65.9% significant) and human validation (188 scripts, substantial agreement on 3/5 dimensions). Our ablation studies confirm all six dimensions capture independent quality aspects (mean | r | = 0.020). DramaBench provides actionable, dimensionspecific feedback for model improvement and establishes a rigorous standard for creative writing evaluation.

[40] AWPO: Enhancing Tool-Use of Large Language Models through Explicit Integration of Reasoning Rewards

Zihan Lin, Xiaohan Wang, Hexiong Yang, Jiajun Chai, Jie Cao, Guojun Yin, Wei Lin, Ran He

Main category: cs.CL

TL;DR: AWPO is a reinforcement learning framework that integrates explicit reasoning rewards with outcome rewards to enhance tool-use capability in LLMs, achieving SOTA performance with high parameter efficiency.

Details

Motivation: Existing RL methods for training tool-use LLMs overlook explicit reasoning rewards, and naively combining reasoning and outcome rewards can lead to suboptimal performance or conflicts with primary optimization objectives.

Method: Proposes Advantage-Weighted Policy Optimization (AWPO) with variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals based on group-relative statistics, plus a tailored clipping mechanism for stable optimization.

Result: AWPO achieves state-of-the-art performance across standard tool-use benchmarks, significantly outperforming strong baselines and closed-source models. A 4B model surpasses Grok-4 by 16.0% in multi-turn accuracy while preserving generalization on out-of-distribution MMLU-Pro benchmark.

Conclusion: AWPO effectively integrates explicit reasoning rewards to enhance tool-use capability in LLMs, demonstrating superior performance and parameter efficiency compared to existing approaches.

Abstract: While reinforcement learning (RL) shows promise in training tool-use large language models (LLMs) using verifiable outcome rewards, existing methods largely overlook the potential of explicit reasoning rewards to bolster reasoning and tool utilization. Furthermore, natively combining reasoning and outcome rewards may yield suboptimal performance or conflict with the primary optimization objective. To address this, we propose advantage-weighted policy optimization (AWPO) – a principled RL framework that effectively integrates explicit reasoning rewards to enhance tool-use capability. AWPO incorporates variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals based on group-relative statistics, alongside a tailored clipping mechanism for stable optimization. Extensive experiments demonstrate that AWPO achieves state-of-the-art performance across standard tool-use benchmarks, significantly outperforming strong baselines and leading closed-source models in challenging multi-turn scenarios. Notably, with exceptional parameter efficiency, our 4B model surpasses Grok-4 by 16.0 percent in multi-turn accuracy while preserving generalization capability on the out-of-distribution MMLU-Pro benchmark.

[41] SiamGPT: Quality-First Fine-Tuning for Stable Thai Text Generation

Thittipat Pairatsuppawat, Abhibhu Tachaapornchai, Paweekorn Kusolsomboon, Chutikan Chaiwong, Thodsaporn Chay-intr, Kobkrit Viriyayudhakorn, Nongnuch Ketui, Aslan B. Wong

Main category: cs.CL

TL;DR: SiamGPT-32B is a Thai language model fine-tuned from Qwen3-32B using a Quality-First strategy that prioritizes curated supervision over data scale, achieving top performance among open-weights Thai models without additional pretraining.

Details

Motivation: Open-weights LLMs perform well in English but struggle with Thai language generation under complex instructions, showing instability and poor instruction following despite strong English capabilities.

Method: Fine-tuned Qwen3-32B using a Quality-First strategy with translated high-complexity English instruction data and Thai-adapted AutoIF framework for instruction/linguistic constraints, using only supervised fine-tuning without continual pretraining or corpus expansion.

Result: SiamGPT-32B achieves strongest overall performance among similar-scale open-weights Thai models on SEA-HELM benchmark, with consistent gains in instruction following, multi-turn dialogue, and natural language understanding.

Conclusion: Quality-focused fine-tuning with curated supervision and Thai-adapted constraints effectively addresses Thai language generation instability, demonstrating that targeted adaptation rather than scale expansion can significantly improve instruction adherence and linguistic stability.

Abstract: Open-weights large language models remain difficult to deploy for Thai due to unstable generation under complex instructions, despite strong English performance. To mitigate these limitations, We present SiamGPT-32B, an open-weights model based on Qwen3-32B, fine-tuned with a Quality-First strategy emphasizing curated supervision over data scale. The fine-tuning pipeline combines translated high-complexity English instruction data with a Thai-adapted AutoIF framework for instruction and linguistic constraints. Using supervised fine-tuning only, without continual pretraining or corpus expansion, SiamGPT-32B improves instruction adherence, multi-turn robustness, and linguistic stability. Evaluations on the SEA-HELM benchmark show that SiamGPT-32B achieves the strongest overall performance among similar-scale open-weights Thai models, with consistent gains in instruction following, multi-turn dialogue, and natural language understanding.

[42] GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators

Jiacheng Guo, Ling Yang, Peter Chen, Qixin Xiao, Yinjie Wang, Xinzhe Juan, Jiahao Qiu, Ke Shen, Mengdi Wang

Main category: cs.CL

TL;DR: GenEnv is a framework that creates a co-evolutionary game between LLM agents and generative environment simulators, enabling dynamic task generation aligned with agent capabilities for more efficient training.

Details

Motivation: Training LLM agents is bottlenecked by high cost and static nature of real-world interaction data. Current methods use static datasets that don't adapt to agent learning progress.

Method: GenEnv establishes a difficulty-aligned co-evolutionary game between agent and generative environment simulator. The simulator acts as dynamic curriculum policy generating tasks tailored to agent’s “zone of proximal development” using α-Curriculum Reward to align task difficulty with current capabilities.

Result: Improves agent performance by up to +40.3% over 7B baselines, matches/exceeds average performance of larger models. Achieves better performance than Gemini 2.5 Pro-based offline data augmentation while using 3.3× less data across five benchmarks (API-Bank, ALFWorld, BFCL, Bamboogle, TravelPlanner).

Conclusion: GenEnv shifts from static supervision to adaptive simulation, providing a data-efficient pathway for scaling agent capabilities through dynamic, difficulty-aligned task generation.

Abstract: Training capable Large Language Model (LLM) agents is critically bottlenecked by the high cost and static nature of real-world interaction data. We address this by introducing GenEnv, a framework that establishes a difficulty-aligned co-evolutionary game between an agent and a scalable, generative environment simulator. Unlike traditional methods that evolve models on static datasets, GenEnv instantiates a dataevolving: the simulator acts as a dynamic curriculum policy, continuously generating tasks specifically tailored to the agent’s ``zone of proximal development’’. This process is guided by a simple but effective $α$-Curriculum Reward, which aligns task difficulty with the agent’s current capabilities. We evaluate GenEnv on five benchmarks, including API-Bank, ALFWorld, BFCL, Bamboogle, and TravelPlanner. Across these tasks, GenEnv improves agent performance by up to \textbf{+40.3%} over 7B baselines and matches or exceeds the average performance of larger models. Compared to Gemini 2.5 Pro-based offline data augmentation, GenEnv achieves better performance while using 3.3$\times$ less data. By shifting from static supervision to adaptive simulation, GenEnv provides a data-efficient pathway for scaling agent capabilities.

cs.CV

[43] PHANTOM: PHysical ANamorphic Threats Obstructing Connected Vehicle Mobility

Md Nahid Hasan Shuvo, Moinul Hossain

Main category: cs.CV

TL;DR: PHANTOM is a novel physical adversarial attack framework using anamorphic art to create perspective-dependent adversarial examples that fool object detectors in connected autonomous vehicles, causing both individual perception failures and network-wide communication disruption.

Details

Motivation: Connected autonomous vehicles rely on vision-based DNNs and V2X communication, but remain vulnerable to physical adversarial attacks. Current attacks often require model access or lack transferability across different detector architectures.

Method: PHANTOM uses anamorphic art principles to craft perspective-dependent adversarial examples that appear natural to humans but cause misclassification. It operates in black-box settings without model access and exploits geometric distortions. Evaluated across four detector architectures (YOLOv5, SSD, Faster R-CNN, RetinaNet) in CARLA simulator under varying speeds, weather, and lighting conditions.

Result: Achieves over 90% attack success rate under optimal conditions, maintains 60-80% effectiveness in degraded environments. Activates within 6-10 meters of target, providing insufficient time for safe maneuvering. In network simulations, triggers false emergency messages through V2X links, increasing Peak Age of Information by 68-89% and degrading safety-critical communication.

Conclusion: PHANTOM exposes critical vulnerabilities in both perception and communication layers of CAV ecosystems, demonstrating that physical adversarial attacks can cause both individual vehicle deception and network-wide disruption in connected vehicle systems.

Abstract: Connected autonomous vehicles (CAVs) rely on vision-based deep neural networks (DNNs) and low-latency (Vehicle-to-Everything) V2X communication to navigate safely and efficiently. Despite their advances, these systems remain vulnerable to physical adversarial attacks. In this paper, we introduce PHANTOM (PHysical ANamorphic Threats Obstructing connected vehicle Mobility), a novel framework for crafting and deploying perspective-dependent adversarial examples using \textit{anamorphic art}. PHANTOM exploits geometric distortions that appear natural to humans but are misclassified with high confidence by state-of-the-art object detectors. Unlike conventional attacks, PHANTOM operates in black-box settings without model access and demonstrates strong transferability across four diverse detector architectures (YOLOv5, SSD, Faster R-CNN, and RetinaNet). Comprehensive evaluation in CARLA across varying speeds, weather conditions, and lighting scenarios shows that PHANTOM achieves over 90% attack success rate under optimal conditions and maintains 60-80% effectiveness even in degraded environments. The attack activates within 6-10 meters of the target, providing insufficient time for safe maneuvering. Beyond individual vehicle deception, PHANTOM triggers network-wide disruption in CAV systems: SUMO-OMNeT++ co-simulation demonstrates that false emergency messages propagate through V2X links, increasing Peak Age of Information by 68-89% and degrading safety-critical communication. These findings expose critical vulnerabilities in both perception and communication layers of CAV ecosystems.

[44] Generating the Past, Present and Future from a Motion-Blurred Image

SaiKiran Tedla, Kelly Zhu, Trevor Canham, Felix Taubner, Michael S. Brown, Kiriakos N. Kutulakos, David B. Lindell

Main category: cs.CV

TL;DR: A new technique that uses pre-trained video diffusion models to recover videos from motion-blurred images, revealing scene dynamics before, during, and after capture.

Details

Motivation: Motion blur encodes information about scene and camera motion, but existing methods rely on handcrafted priors and struggle with complex dynamics. They also don't recover what occurred before or after image capture.

Method: Repurposes a pre-trained video diffusion model trained on internet-scale datasets to recover videos from motion-blurred images, revealing scene dynamics during capture and extending into past/future.

Result: Outperforms previous methods, generalizes to challenging in-the-wild images, and supports downstream tasks like recovering camera trajectories, object motion, and dynamic 3D scene structure.

Conclusion: The approach is robust and versatile, successfully recovering complex scene dynamics from motion blur by leveraging large-scale video priors, enabling temporal scene understanding beyond just deblurring.

Abstract: We seek to answer the question: what can a motion-blurred image reveal about a scene’s past, present, and future? Although motion blur obscures image details and degrades visual quality, it also encodes information about scene and camera motion during an exposure. Previous techniques leverage this information to estimate a sharp image from an input blurry one, or to predict a sequence of video frames showing what might have occurred at the moment of image capture. However, they rely on handcrafted priors or network architectures to resolve ambiguities in this inverse problem, and do not incorporate image and video priors on large-scale datasets. As such, existing methods struggle to reproduce complex scene dynamics and do not attempt to recover what occurred before or after an image was taken. Here, we introduce a new technique that repurposes a pre-trained video diffusion model trained on internet-scale datasets to recover videos revealing complex scene dynamics during the moment of capture and what might have occurred immediately into the past or future. Our approach is robust and versatile; it outperforms previous methods for this task, generalizes to challenging in-the-wild images, and supports downstream tasks such as recovering camera trajectories, object motion, and dynamic 3D scene structure. Code and data are available at https://blur2vid.github.io

[45] Learning to Refocus with Video Diffusion Models

SaiKiran Tedla, Zhoutong Zhang, Xuaner Zhang, Shumian Xin

Main category: cs.CV

TL;DR: A novel method for realistic post-capture refocusing using video diffusion models that generates focal stacks from single defocused images, enabling interactive focus adjustment after capture.

Details

Motivation: Autofocus systems often fail to capture intended subjects, and users frequently want to adjust focus after capture, but current solutions are limited. There's a need for realistic post-capture refocusing capabilities in everyday photography.

Method: Uses video diffusion models to generate perceptually accurate focal stacks from single defocused images. The focal stack is represented as a video sequence. The approach is supported by a large-scale focal stack dataset collected under diverse real-world smartphone conditions.

Result: Consistently outperforms existing approaches in both perceptual quality and robustness across challenging scenarios. Enables interactive refocusing and unlocks various downstream applications. Code and dataset are publicly available.

Conclusion: The method paves the way for more advanced focus-editing capabilities in everyday photography by providing realistic post-capture refocusing using video diffusion models, supported by a comprehensive dataset for future research.

Abstract: Focus is a cornerstone of photography, yet autofocus systems often fail to capture the intended subject, and users frequently wish to adjust focus after capture. We introduce a novel method for realistic post-capture refocusing using video diffusion models. From a single defocused image, our approach generates a perceptually accurate focal stack, represented as a video sequence, enabling interactive refocusing and unlocking a range of downstream applications. We release a large-scale focal stack dataset acquired under diverse real-world smartphone conditions to support this work and future research. Our method consistently outperforms existing approaches in both perceptual quality and robustness across challenging scenarios, paving the way for more advanced focus-editing capabilities in everyday photography. Code and data are available at www.learn2refocus.github.io

[46] RANSAC Scoring Functions: Analysis and Reality Check

A. Shekhovtsov

Main category: cs.CV

TL;DR: The paper revisits scoring functions for geometric model fitting, analyzes MAGSAC++, and finds that all scoring functions perform similarly despite different theoretical derivations.

Details

Motivation: To critically analyze scoring functions for robust geometric model fitting, particularly examining the state-of-the-art MAGSAC++ method and establishing proper evaluation methodology.

Method: 1) Extends geometric error to spherical noises and analyzes mixture models with outliers; 2) Theoretically analyzes MAGSAC++ derivation; 3) Proposes experimental methodology using large or small random validation sets; 4) Compares various scoring functions including learned inlier distributions.

Result: 1) MAGSAC++ derivation lacks sound principles and is numerically equivalent to simple Gaussian-uniform likelihood; 2) All scoring functions perform identically in experiments; 3) MAGSAC++ is neither better performing than simpler methods nor less sensitive to threshold hyperparameter choice.

Conclusion: The state-of-the-art in scoring functions for robust geometric fitting needs comprehensive reevaluation, as theoretical differences don’t translate to practical performance differences, and simpler methods work equally well.

Abstract: We revisit the problem of assigning a score (a quality of fit) to candidate geometric models – one of the key components of RANSAC for robust geometric fitting. In a non-robust setting, the ``gold standard’’ scoring function, known as the geometric error, follows from a probabilistic model with Gaussian noises. We extend it to spherical noises. In a robust setting, we consider a mixture with uniformly distributed outliers and show that a threshold-based parameterization leads to a unified view of likelihood-based and robust M-estimators and associated local optimization schemes. Next we analyze MAGSAC++ which stands out for two reasons. First, it achieves the best results according to existing benchmarks. Second, it makes quite different modeling assumptions and derivation steps. We discovered, however that the derivation does not correspond to sound principles and the resulting score function is in fact numerically equivalent to a simple Gaussian-uniform likelihood, a basic model within the proposed framework. Finally, we propose an experimental methodology for evaluating scoring functions: assuming either a large validation set, or a small random validation set in expectation. We find that all scoring functions, including using a learned inlier distribution, perform identically. In particular, MAGSAC++ score is found to be neither better performing than simple contenders nor less sensitive to the choice of the threshold hyperparameter. Our theoretical and experimental analysis thus comprehensively revisit the state-of-the-art, which is critical for any future research seeking to improve the methods or apply them to other robust fitting problems.

[47] Chain-of-Anomaly Thoughts with Large Vision-Language Models

Pedro Domingos, João Pereira, Vasco Lopes, João Neves, David Semedo

Main category: cs.CV

TL;DR: CoAT framework introduces criminal bias into multi-agent reasoning to improve video surveillance anomaly detection, boosting F1-score by 11.8 p.p. on low-res footage and classification by 3.78 p.p. on high-res videos.

Details

Motivation: Current Large Vision-Language Models for video surveillance have inherent bias towards normality, failing to detect crimes, and Chain-of-Thought reasoning lacks inductive anomaly biases that steer models toward normal interpretations.

Method: Proposed Chain-of-Anomaly-Thoughts (CoAT), a multi-agent reasoning framework that introduces inductive criminal bias through a final anomaly-focused classification layer in the reasoning process.

Result: Significant improvement in Anomaly Detection with 11.8 percentage point F1-score boost on challenging low-resolution footage, and 3.78 percentage point improvement in Anomaly Classification for high-resolution videos.

Conclusion: Introducing inductive criminal bias through the CoAT framework effectively addresses the normality bias in automated video surveillance systems, substantially improving both anomaly detection and classification performance.

Abstract: Automated video surveillance with Large Vision-Language Models is limited by their inherent bias towards normality, often failing to detect crimes. While Chain-of-Thought reasoning strategies show significant potential for improving performance in language tasks, the lack of inductive anomaly biases in their reasoning further steers the models towards normal interpretations. To address this, we propose Chain-of-Anomaly-Thoughts (CoAT), a multi-agent reasoning framework that introduces inductive criminal bias in the reasoning process through a final, anomaly-focused classification layer. Our method significantly improves Anomaly Detection, boosting F1-score by 11.8 p.p. on challenging low-resolution footage and Anomaly Classification by 3.78 p.p. in high-resolution videos.

[48] DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation

Jingqi Tian, Yiheng Du, Haoji Zhang, Yuji Wang, Isaac Ning Lee, Xulong Bai, Tianrui Zhu, Jingxuan Niu, Yansong Tang

Main category: cs.CV

TL;DR: DDAVS is a novel audio-visual segmentation framework that addresses multi-source entanglement and audio-visual misalignment through disentangled audio semantics and delayed bidirectional alignment.

Details

Motivation: Existing AVS methods suffer from multi-source entanglement (confusing multiple sound sources) and audio-visual misalignment, causing biases toward louder/larger objects while overlooking weaker/smaller/co-occurring sources.

Method: Proposes DDAVS with: 1) Learnable queries to extract audio semantics anchored in semantic space from audio prototype memory bank, optimized via contrastive learning; 2) Dual cross-attention with delayed modality interaction for robust multimodal alignment.

Result: Extensive experiments on AVS-Objects and VPO benchmarks show DDAVS consistently outperforms existing approaches across single-source, multi-source, and multi-instance scenarios.

Conclusion: DDAVS effectively addresses key AVS challenges, demonstrating strong generalization ability under challenging real-world audio-visual segmentation conditions.

Abstract: Audio-Visual Segmentation (AVS) aims to localize sound-producing objects at the pixel level by jointly leveraging auditory and visual information. However, existing methods often suffer from multi-source entanglement and audio-visual misalignment, which lead to biases toward louder or larger objects while overlooking weaker, smaller, or co-occurring sources. To address these challenges, we propose DDAVS, a Disentangled Audio Semantics and Delayed Bidirectional Alignment framework. To mitigate multi-source entanglement, DDAVS employs learnable queries to extract audio semantics and anchor them within a structured semantic space derived from an audio prototype memory bank. This is further optimized through contrastive learning to enhance discriminability and robustness. To alleviate audio-visual misalignment, DDAVS introduces dual cross-attention with delayed modality interaction, improving the robustness of multimodal alignment. Extensive experiments on the AVS-Objects and VPO benchmarks demonstrate that DDAVS consistently outperforms existing approaches, exhibiting strong performance across single-source, multi-source, and multi-instance scenarios. These results validate the effectiveness and generalization ability of our framework under challenging real-world audio-visual segmentation conditions. Project page: https://trilarflagz.github.io/DDAVS-page/

[49] Progressive Learned Image Compression for Machine Perception

Jungwoo Kim, Jun-Hyuk Kim, Jong-Seok Lee

Main category: cs.CV

TL;DR: PICM-Net is a progressive learned image compression codec for machine perception using trit-plane coding, enabling fine granular scalability and adaptive decoding based on downstream task confidence.

Details

Motivation: While learned image codecs have been extended to machine perception, progressive compression with fine granular scalability remains unexplored for machine-oriented codecs. There's a need for efficient progressive transmission that maintains machine task performance.

Method: Proposed PICM-Net based on trit-plane coding, analyzed human- vs machine-oriented rate-distortion priorities, and designed an adaptive decoding controller that dynamically determines necessary decoding level during inference to maintain desired downstream task confidence.

Result: Extensive experiments demonstrate efficient and adaptive progressive transmission while maintaining high performance in downstream classification tasks.

Conclusion: Establishes a new paradigm for machine-aware progressive image compression that enables fine granular scalability and adaptive decoding for real-world machine perception applications.

Abstract: Recent advances in learned image codecs have been extended from human perception toward machine perception. However, progressive image compression with fine granular scalability (FGS)-which enables decoding a single bitstream at multiple quality levels-remains unexplored for machine-oriented codecs. In this work, we propose a novel progressive learned image compression codec for machine perception, PICM-Net, based on trit-plane coding. By analyzing the difference between human- and machine-oriented rate-distortion priorities, we systematically examine the latent prioritization strategies in terms of machine-oriented codecs. To further enhance real-world adaptability, we design an adaptive decoding controller, which dynamically determines the necessary decoding level during inference time to maintain the desired confidence of downstream machine prediction. Extensive experiments demonstrate that our approach enables efficient and adaptive progressive transmission while maintaining high performance in the downstream classification task, establishing a new paradigm for machine-aware progressive image compression.

[50] HyGE-Occ: Hybrid View-Transformation with 3D Gaussian and Edge Priors for 3D Panoptic Occupancy Prediction

Jong Wook Kim, Wonseok Roh, Ha Dam Baek, Pilhyeon Lee, Jonghyun Choi, Sangpil Kim

Main category: cs.CV

TL;DR: HyGE-Occ introduces a hybrid view-transformation branch with 3D Gaussian and edge priors to improve geometric consistency and boundary awareness for 3D panoptic occupancy prediction.

Details

Motivation: Existing approaches for 3D panoptic occupancy prediction struggle with maintaining precise geometry and capturing accurate spatial ranges of 3D instances, which is critical for robust panoptic separation in complex environments.

Method: HyGE-Occ uses a hybrid view-transformation branch that fuses continuous Gaussian-based depth representation with discretized depth-bin formulation to produce BEV features with improved geometric consistency. It also extracts edge maps from BEV features as auxiliary information to learn edge cues for better boundary awareness.

Result: On the Occ3D-nuScenes dataset, HyGE-Occ outperforms existing work and demonstrates superior 3D geometric reasoning capabilities.

Conclusion: The proposed hybrid approach with Gaussian and edge priors effectively enhances both geometric consistency and boundary awareness in 3D panoptic occupancy prediction, leading to improved performance over existing methods.

Abstract: 3D Panoptic Occupancy Prediction aims to reconstruct a dense volumetric scene map by predicting the semantic class and instance identity of every occupied region in 3D space. Achieving such fine-grained 3D understanding requires precise geometric reasoning and spatially consistent scene representation across complex environments. However, existing approaches often struggle to maintain precise geometry and capture the precise spatial range of 3D instances critical for robust panoptic separation. To overcome these limitations, we introduce HyGE-Occ, a novel framework that leverages a hybrid view-transformation branch with 3D Gaussian and edge priors to enhance both geometric consistency and boundary awareness in 3D panoptic occupancy prediction. HyGE-Occ employs a hybrid view-transformation branch that fuses a continuous Gaussian-based depth representation with a discretized depth-bin formulation, producing BEV features with improved geometric consistency and structural coherence. In parallel, we extract edge maps from BEV features and use them as auxiliary information to learn edge cues. In our extensive experiments on the Occ3D-nuScenes dataset, HyGE-Occ outperforms existing work, demonstrating superior 3D geometric reasoning.

Alireza Moayedikia, Sattar Dorafshan

Main category: cs.CV

TL;DR: Multi-modal attention network fusing radar temporal patterns with thermal spatial signatures for bridge deck delamination detection, featuring uncertainty quantification and outperforming single-modal approaches.

Details

Motivation: Automated inspection of deteriorating civil infrastructure needs to overcome limitations of visual assessment. Single-modal approaches like Ground Penetrating Radar (moisture/surface issues) and Infrared Thermography (weather dependency/depth limits) have complementary constraints that multi-modal fusion can address.

Method: Multi-modal attention network with temporal attention for radar processing, spatial attention for thermal features, and cross-modal fusion with learnable embeddings. Incorporates uncertainty quantification via Monte Carlo dropout and learned variance estimation, decomposing uncertainty into epistemic and aleatoric components.

Result: On five bridge datasets with balanced to moderately imbalanced data, the approach substantially outperforms baselines in accuracy and AUC. Cross-modal attention provides critical gains beyond within-modality attention, and uncertainty quantification reduces calibration error enabling selective prediction. However, under extreme class imbalance, attention mechanisms show vulnerability to majority class collapse.

Conclusion: Attention-based architecture performs well across typical scenarios, while extreme imbalance requires specialized techniques. The system maintains deployment efficiency for real-time inspection with characterized capabilities and limitations, providing actionable guidance for practical implementation.

Abstract: Deteriorating civil infrastructure requires automated inspection techniques overcoming limitations of visual assessment. While Ground Penetrating Radar and Infrared Thermography enable subsurface defect detection, single modal approaches face complementary constraints radar struggles with moisture and shallow defects, while thermography exhibits weather dependency and limited depth. This paper presents a multi modal attention network fusing radar temporal patterns with thermal spatial signatures for bridge deck delamination detection. Our architecture introduces temporal attention for radar processing, spatial attention for thermal features, and cross modal fusion with learnable embeddings discovering complementary defect patterns invisible to individual sensors. We incorporate uncertainty quantification through Monte Carlo dropout and learned variance estimation, decomposing uncertainty into epistemic and aleatoric components for safety critical decisions. Experiments on five bridge datasets reveal that on balanced to moderately imbalanced data, our approach substantially outperforms baselines in accuracy and AUC representing meaningful improvements over single modal and concatenation based fusion. Ablation studies demonstrate cross modal attention provides critical gains beyond within modality attention, while multi head mechanisms achieve improved calibration. Uncertainty quantification reduces calibration error, enabling selective prediction by rejecting uncertain cases. However, under extreme class imbalance, attention mechanisms show vulnerability to majority class collapse. These findings provide actionable guidance: attention based architecture performs well across typical scenarios, while extreme imbalance requires specialized techniques. Our system maintains deployment efficiency, enabling real time inspection with characterized capabilities and limitations.

[52] Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs

Houston H. Zhang, Tao Zhang, Baoze Lin, Yuanqi Xue, Yincheng Zhu, Huan Liu, Li Gu, Linfeng Ye, Ziqiang Wang, Xinxin Zuo, Yang Wang, Yuanhao Yu, Zhixiang Chi

Main category: cs.CV

TL;DR: Widget2Code: A new benchmark and baseline system for generating executable code from widget UI images, addressing unique challenges of compact, context-free micro-interfaces with proprietary designs.

Details

Motivation: Widget UIs are underexplored in UI2Code research despite their unique challenges: they're compact, context-free micro-interfaces with dense layouts and iconography under strict spatial constraints, and lack accessible markup data unlike web/mobile UIs.

Method: Introduces Widget2Code benchmark with fine-grained metrics, then develops a baseline system with perceptual understanding (assembling atomic components with icon retrieval) and structured code generation via WidgetFactory infrastructure including WidgetDSL (framework-agnostic domain-specific language) and compiler for multiple front-end implementations with adaptive rendering.

Result: Benchmarking shows generalized MLLMs outperform specialized UI2Code methods but still produce unreliable code. The proposed baseline substantially enhances visual fidelity and establishes strong foundation for future Widget2Code research.

Conclusion: Widget2Code addresses a critical gap in UI2Code research, providing both a benchmark and unified infrastructure that advances perceptual understanding and structured code generation for widget interfaces, enabling more reliable and visually consistent code generation.

Abstract: User interface to code (UI2Code) aims to generate executable code that can faithfully reconstruct a given input UI. Prior work focuses largely on web pages and mobile screens, leaving app widgets underexplored. Unlike web or mobile UIs with rich hierarchical context, widgets are compact, context-free micro-interfaces that summarize key information through dense layouts and iconography under strict spatial constraints. Moreover, while (image, code) pairs are widely available for web or mobile UIs, widget designs are proprietary and lack accessible markup. We formalize this setting as the Widget-to-Code (Widget2Code) and introduce an image-only widget benchmark with fine-grained, multi-dimensional evaluation metrics. Benchmarking shows that although generalized multimodal large language models (MLLMs) outperform specialized UI2Code methods, they still produce unreliable and visually inconsistent code. To address these limitations, we develop a baseline that jointly advances perceptual understanding and structured code generation. At the perceptual level, we follow widget design principles to assemble atomic components into complete layouts, equipped with icon retrieval and reusable visualization modules. At the system level, we design an end-to-end infrastructure, WidgetFactory, which includes a framework-agnostic widget-tailored domain-specific language (WidgetDSL) and a compiler that translates it into multiple front-end implementations (e.g., React, HTML/CSS). An adaptive rendering module further refines spatial dimensions to satisfy compactness constraints. Together, these contributions substantially enhance visual fidelity, establishing a strong baseline and unified infrastructure for future Widget2Code research.

[53] TAVID: Text-Driven Audio-Visual Interactive Dialogue Generation

Ji-Hoon Kim, Junseok Ahn, Doyeop Kwak, Joon Son Chung, Shinji Watanabe

Main category: cs.CV

TL;DR: TAVID is a unified framework that jointly generates synchronized interactive videos and conversational speech from text and reference images, addressing the multimodal nature of human conversation.

Details

Motivation: Current research treats talking/listening head generation and conversational speech generation as separate problems, overlooking the tightly coupled audio-visual interactions in human conversation. The authors aim to build more human-like conversational systems by addressing this multimodal gap.

Method: TAVID integrates face and speech generation pipelines through two cross-modal mappers: a motion mapper and a speaker mapper. These mappers enable bidirectional exchange of complementary information between audio and visual modalities for synchronized generation.

Result: Extensive experiments demonstrate effectiveness across four dimensions: talking face realism, listening head responsiveness, dyadic interaction fluency, and speech quality. The system performs well on all evaluated aspects.

Conclusion: TAVID successfully addresses the multimodal nature of human conversation by jointly generating synchronized interactive videos and conversational speech, representing an advancement toward more human-like conversational systems.

Abstract: The objective of this paper is to jointly synthesize interactive videos and conversational speech from text and reference images. With the ultimate goal of building human-like conversational systems, recent studies have explored talking or listening head generation as well as conversational speech generation. However, these works are typically studied in isolation, overlooking the multimodal nature of human conversation, which involves tightly coupled audio-visual interactions. In this paper, we introduce TAVID, a unified framework that generates both interactive faces and conversational speech in a synchronized manner. TAVID integrates face and speech generation pipelines through two cross-modal mappers (i.e., a motion mapper and a speaker mapper), which enable bidirectional exchange of complementary information between the audio and visual modalities. We evaluate our system across four dimensions: talking face realism, listening head responsiveness, dyadic interaction fluency, and speech quality. Extensive experiments demonstrate the effectiveness of our approach across all these aspects.

[54] Generative Latent Coding for Ultra-Low Bitrate Image Compression

Zhaoyang Jia, Jiahao Li, Bin Li, Houqiang Li, Yan Lu

Main category: cs.CV

TL;DR: GLC is a generative latent coding architecture that performs compression in VQ-VAE latent space instead of pixel space, achieving high realism and fidelity at ultra-low bitrates (<0.04 bpp).

Details

Motivation: Traditional pixel-space compression struggles with high-realism and high-fidelity at low bitrates because pixel-space distortion doesn't align well with human perception. Need better perceptual alignment for compression.

Method: Uses VQ-VAE generative latent space for transform coding (more sparse, semantic, and perception-aligned). Adds categorical hyper module to reduce hyper-information bit cost, and code-prediction supervision for semantic consistency.

Result: Achieves <0.04 bpp on natural images, <0.01 bpp on facial images. On CLIC2020, matches MS-ILLM FID with 45% fewer bits. Enables applications like image restoration and style transfer.

Conclusion: Generative latent space coding outperforms pixel-space methods for low-bitrate compression, offering better perceptual quality and enabling downstream applications.

Abstract: Most existing image compression approaches perform transform coding in the pixel space to reduce its spatial redundancy. However, they encounter difficulties in achieving both high-realism and high-fidelity at low bitrate, as the pixel-space distortion may not align with human perception. To address this issue, we introduce a Generative Latent Coding (GLC) architecture, which performs transform coding in the latent space of a generative vector-quantized variational auto-encoder (VQ-VAE), instead of in the pixel space. The generative latent space is characterized by greater sparsity, richer semantic and better alignment with human perception, rendering it advantageous for achieving high-realism and high-fidelity compression. Additionally, we introduce a categorical hyper module to reduce the bit cost of hyper-information, and a code-prediction-based supervision to enhance the semantic consistency. Experiments demonstrate that our GLC maintains high visual quality with less than 0.04 bpp on natural images and less than 0.01 bpp on facial images. On the CLIC2020 test set, we achieve the same FID as MS-ILLM with 45% fewer bits. Furthermore, the powerful generative latent space enables various applications built on our GLC pipeline, such as image restoration and style transfer. The code is available at https://github.com/jzyustc/GLC.

[55] Unified Brain Surface and Volume Registration

S. Mazdak Abulnaga, Andrew Hoopes, Malte Hoffmann, Robin Magnet, Maks Ovsjanikov, Lilla Zöllei, John Guttag, Bruce Fischl, Adrian Dalca

Main category: cs.CV

TL;DR: NeurAlign is a deep learning framework for joint cortical and subcortical brain MRI registration using unified volume-and-surface representation with spherical coordinate space, achieving better accuracy and speed than existing methods.

Details

Motivation: Traditional brain MRI registration methods treat volumetric and surface-based registration separately, leading to inconsistencies that limit downstream neuroscientific analyses. There's a need for a unified approach that aligns both cortical and subcortical regions consistently.

Method: NeurAlign uses deep learning with an intermediate spherical coordinate space to bridge anatomical surface topology with volumetric anatomy. It integrates spherical registration into learning to ensure geometric coherence between volume and surface domains, requiring only MRI scans as input.

Result: Outperforms both classical and machine learning-based methods, improving Dice score by up to 7 points while maintaining regular deformation fields. Orders of magnitude faster than standard methods and simpler to use with no additional inputs beyond MRI scans.

Conclusion: NeurAlign sets a new standard for joint cortical and subcortical registration with superior accuracy, fast inference, and ease of use, enabling more consistent cross-subject brain analysis.

Abstract: Accurate registration of brain MRI scans is fundamental for cross-subject analysis in neuroscientific studies. This involves aligning both the cortical surface of the brain and the interior volume. Traditional methods treat volumetric and surface-based registration separately, which often leads to inconsistencies that limit downstream analyses. We propose a deep learning framework, NeurAlign, that registers $3$D brain MRI images by jointly aligning both cortical and subcortical regions through a unified volume-and-surface-based representation. Our approach leverages an intermediate spherical coordinate space to bridge anatomical surface topology with volumetric anatomy, enabling consistent and anatomically accurate alignment. By integrating spherical registration into the learning, our method ensures geometric coherence between volume and surface domains. In a series of experiments on both in-domain and out-of-domain datasets, our method consistently outperforms both classical and machine learning-based registration methods – improving the Dice score by up to 7 points while maintaining regular deformation fields. Additionally, it is orders of magnitude faster than the standard method for this task, and is simpler to use because it requires no additional inputs beyond an MRI scan. With its superior accuracy, fast inference, and ease of use, NeurAlign sets a new standard for joint cortical and subcortical registration.

[56] Vehicle-centric Perception via Multimodal Structured Pre-training

Wentao Wu, Xiao Wang, Chenglong Li, Jin Tang, Bin Luo

Main category: cs.CV

TL;DR: VehicleMAE-V2 is a vehicle-centric pre-trained vision model that incorporates multimodal structured priors (symmetry, contour, semantics) to learn better vehicle representations, achieving state-of-the-art performance on downstream tasks.

Details

Motivation: Existing approaches lack effective learning of vehicle-related knowledge during pre-training, resulting in poor capability for modeling general vehicle perception representations needed for applications like surveillance, intelligent transportation, and autonomous driving.

Method: Proposes VehicleMAE-V2 with three key modules: 1) Symmetry-guided Mask Module (SMM) uses vehicle symmetry constraints to select high-quality masked patches, 2) Contour-guided Representation Module (CRM) preserves holistic vehicle structure through probability distribution alignment, and 3) Semantics-guided Representation Module (SRM) addresses feature confusion via image-text contrastive learning and cross-modal distillation.

Result: Extensive experiments on five downstream tasks demonstrate superior performance of VehicleMAE-V2. The model is pre-trained on Autobot4M, a large-scale dataset of ~4 million vehicle images and 12,693 text descriptions.

Conclusion: VehicleMAE-V2 effectively incorporates vehicle-related multimodal structured priors to enhance masked token reconstruction, enabling learning of generalizable representations for vehicle-centric perception tasks.

Abstract: Vehicle-centric perception plays a crucial role in many intelligent systems, including large-scale surveillance systems, intelligent transportation, and autonomous driving. Existing approaches lack effective learning of vehicle-related knowledge during pre-training, resulting in poor capability for modeling general vehicle perception representations. To handle this problem, we propose VehicleMAE-V2, a novel vehicle-centric pre-trained large model. By exploring and exploiting vehicle-related multimodal structured priors to guide the masked token reconstruction process, our approach can significantly enhance the model’s capability to learn generalizable representations for vehicle-centric perception. Specifically, we design the Symmetry-guided Mask Module (SMM), Contour-guided Representation Module (CRM) and Semantics-guided Representation Module (SRM) to incorporate three kinds of structured priors into token reconstruction including symmetry, contour and semantics of vehicles respectively. SMM utilizes the vehicle symmetry constraints to avoid retaining symmetric patches and can thus select high-quality masked image patches and reduce information redundancy. CRM minimizes the probability distribution divergence between contour features and reconstructed features and can thus preserve holistic vehicle structure information during pixel-level reconstruction. SRM aligns image-text features through contrastive learning and cross-modal distillation to address the feature confusion caused by insufficient semantic understanding during masked reconstruction. To support the pre-training of VehicleMAE-V2, we construct Autobot4M, a large-scale dataset comprising approximately 4 million vehicle images and 12,693 text descriptions. Extensive experiments on five downstream tasks demonstrate the superior performance of VehicleMAE-V2.

[57] Degradation-Aware Metric Prompting for Hyperspectral Image Restoration

Binfeng Wang, Di Wang, Haonan Guo, Ying Fu, Jing Zhang

Main category: cs.CV

TL;DR: DAMP is a unified HSI restoration framework that uses degradation metrics as prompts instead of explicit degradation labels, enabling adaptive restoration under diverse degradations through spatial-spectral adaptive modules and mixture-of-experts architecture.

Details

Motivation: Existing unified HSI restoration methods rely on explicit degradation priors/labels as prompts, which are difficult to obtain in real-world scenarios with complex mixed degradations.

Method: Proposes Degradation-Aware Metric Prompting (DAMP) framework with: 1) Spatial-spectral degradation metrics to quantify degradations as Degradation Prompts (DP), 2) Spatial-Spectral Adaptive Module (SSAM) for dynamic feature extraction, 3) Mixture-of-Experts architecture using DP as gating router and SSAM as experts.

Result: Achieves state-of-the-art performance on natural and remote sensing HSI datasets, demonstrating exceptional generalization capability under diverse, mixed, or unseen degradations.

Conclusion: DAMP provides a practical unified HSI restoration solution that doesn’t require explicit degradation priors, enabling robust performance in real-world scenarios with complex degradations.

Abstract: Unified hyperspectral image (HSI) restoration aims to recover various degraded HSIs using a single model, offering great practical value. However, existing methods often depend on explicit degradation priors (e.g., degradation labels) as prompts to guide restoration, which are difficult to obtain due to complex and mixed degradations in real-world scenarios. To address this challenge, we propose a Degradation-Aware Metric Prompting (DAMP) framework. Instead of relying on predefined degradation priors, we design spatial-spectral degradation metrics to continuously quantify multi-dimensional degradations, serving as Degradation Prompts (DP). These DP enable the model to capture cross-task similarities in degradation distributions and enhance shared feature learning. Furthermore, we introduce a Spatial-Spectral Adaptive Module (SSAM) that dynamically modulates spatial and spectral feature extraction through learnable parameters. By integrating SSAM as experts within a Mixture-of-Experts architecture, and using DP as the gating router, the framework enables adaptive, efficient, and robust restoration under diverse, mixed, or unseen degradations. Extensive experiments on natural and remote sensing HSI datasets show that DAMP achieves state-of-the-art performance and demonstrates exceptional generalization capability. Code is publicly available at https://github.com/MiliLab/DAMP.

[58] Block-Recurrent Dynamics in Vision Transformers

Mozes Jacobs, Thomas Fel, Richard Hakim, Alessandra Brondetta, Demba Ba, T. Andy Keller

Main category: cs.CV

TL;DR: Vision Transformers exhibit block-recurrent depth structure where computation can be approximated using far fewer distinct blocks applied recurrently, enabling dynamical systems analysis.

Details

Motivation: To develop a mechanistic understanding of Vision Transformers' computational behavior by interpreting their depth as a well-characterized dynamical flow rather than just architectural layers.

Method: Proposed Block-Recurrent Hypothesis (BRH) and trained Recurrent Approximations to Phase-structured TransfORmers (Raptor) models to test whether ViT computation can be rewritten using k « L distinct blocks applied recurrently.

Result: Demonstrated that ViTs exhibit few contiguous phases in depth, trained Raptor models recover 96% of DINOv2 ImageNet-1k accuracy with only 2 blocks, and revealed directional convergence, token-specific dynamics, and low-rank updates consistent with dynamical attractors.

Conclusion: Vision Transformers develop a compact recurrent program along depth, representing a low-complexity normative solution that enables principled dynamical systems analysis of these models.

Abstract: As Vision Transformers (ViTs) become standard vision backbones, a mechanistic account of their computational phenomenology is essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the Block-Recurrent Hypothesis (BRH), arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k \ll L$ distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether these phases reflect genuinely reusable computation, we train block-recurrent surrogates of pretrained ViTs: Recurrent Approximations to Phase-structured TransfORmers (Raptor). In small-scale, we demonstrate that stochastic depth and training promote recurrent structure and subsequently correlate with our ability to accurately fit Raptor. We then provide an empirical existence proof for BRH by training a Raptor model to recover $96%$ of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks at equivalent computational cost. Finally, we leverage our hypothesis to develop a program of Dynamical Interpretability. We find i) directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations, ii) token-specific dynamics, where cls executes sharp late reorientations while patch tokens exhibit strong late-stage coherence toward their mean direction, and iii) a collapse to low rank updates in late depth, consistent with convergence to low-dimensional attractors. Altogether, we find a compact recurrent program emerges along ViT depth, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.

[59] SE360: Semantic Edit in 360$^\circ$ Panoramas via Hierarchical Data Construction

Haoyi Zhong, Fang-Lue Zhang, Andrew Chalmers, Taehyun Rhee

Main category: cs.CV

TL;DR: SE360: A novel framework for multi-condition guided object editing in 360° panoramas using autonomous data generation and Transformer-based diffusion model.

Details

Motivation: Existing instruction-based image editing methods produce implausible results when applied to 360° panoramas, both in equirectangular projections and perspective views, creating a need for specialized solutions.

Method: Proposes SE360 with: 1) Coarse-to-fine autonomous data generation pipeline using Vision-Language Model and adaptive projection adjustment for hierarchical analysis; 2) Two-stage data refinement strategy to improve realism; 3) Transformer-based diffusion model trained on the constructed dataset for flexible object editing guided by text, mask, or reference image.

Result: Outperforms existing methods in both visual quality and semantic accuracy for 360° panorama editing.

Conclusion: SE360 provides an effective framework for multi-condition guided object editing in 360° panoramas, addressing the unique challenges of panoramic image editing through autonomous data generation and specialized model training.

Abstract: While instruction-based image editing is emerging, extending it to 360$^\circ$ panoramas introduces additional challenges. Existing methods often produce implausible results in both equirectangular projections (ERP) and perspective views. To address these limitations, we propose SE360, a novel framework for multi-condition guided object editing in 360$^\circ$ panoramas. At its core is a novel coarse-to-fine autonomous data generation pipeline without manual intervention. This pipeline leverages a Vision-Language Model (VLM) and adaptive projection adjustment for hierarchical analysis, ensuring the holistic segmentation of objects and their physical context. The resulting data pairs are both semantically meaningful and geometrically consistent, even when sourced from unlabeled panoramas. Furthermore, we introduce a cost-effective, two-stage data refinement strategy to improve data realism and mitigate model overfitting to erase artifacts. Based on the constructed dataset, we train a Transformer-based diffusion model to allow flexible object editing guided by text, mask, or reference image in 360$^\circ$ panoramas. Our experiments demonstrate that our method outperforms existing methods in both visual quality and semantic accuracy.

[60] How Much 3D Do Video Foundation Models Encode?

Zixuan Huang, Xiang Li, Zhaoyang Lv, James M. Rehg

Main category: cs.CV

TL;DR: The paper investigates whether Video Foundation Models (VidFMs) trained on large video datasets naturally develop 3D understanding, and finds that state-of-the-art video generation models show strong 3D awareness that can surpass specialized 3D models.

Details

Motivation: To determine if global 3D understanding emerges naturally in Video Foundation Models trained on large-scale video data, and to quantify the level of 3D awareness these models possess without explicit 3D supervision.

Method: Proposes a model-agnostic framework that measures 3D awareness of various VidFMs by estimating multiple 3D properties from their features using shallow read-out networks, enabling benchmarking across different models.

Result: State-of-the-art video generation models exhibit strong understanding of 3D objects and scenes despite no 3D training, and this understanding can even surpass large expert models specifically trained for 3D tasks.

Conclusion: Video Foundation Models naturally develop significant 3D awareness from 2D video training, providing valuable insights for building scalable 3D models and suggesting that 3D understanding emerges as a byproduct of large-scale video pretraining.

Abstract: Videos are continuous 2D projections of 3D worlds. After training on large video data, will global 3D understanding naturally emerge? We study this by quantifying the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data. We propose the first model-agnostic framework that measures the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs. Our study presents meaningful findings regarding the 3D awareness of VidFMs on multiple axes. In particular, we show that state-of-the-art video generation models exhibit a strong understanding of 3D objects and scenes, despite not being trained on any 3D data. Such understanding can even surpass that of large expert models specifically trained for 3D tasks. Our findings, together with the 3D benchmarking of major VidFMs, provide valuable observations for building scalable 3D models.

[61] BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes

Minkyun Seo, Hyungtae Lim, Kanghee Lee, Luca Carlone, Jaesik Park

Main category: cs.CV

TL;DR: BUFFER-X is a zero-shot point cloud registration pipeline that achieves substantial generalization across diverse environments without retraining or manual parameter tuning.

Details

Motivation: Current deep learning-based point cloud registration methods have limited generalization, requiring retraining or manual parameter tuning for each new environment due to three key factors: reliance on environment-specific parameters, poor out-of-domain robustness of learned keypoint detectors, and raw coordinate usage that exacerbates scale discrepancies.

Method: The pipeline addresses these issues by: (a) adaptively determining voxel size and search radii, (b) using farthest point sampling instead of learned keypoint detectors, and (c) employing patch-wise scale normalization for consistent coordinate bounds. It features multi-scale patch-based descriptor generation and hierarchical inlier search across scales.

Result: The method demonstrates substantial generalization across 11 diverse indoor/outdoor datasets covering various sensor modalities, achieving robust performance without prior information or manual parameter tuning for test datasets.

Conclusion: BUFFER-X presents an effective zero-shot registration solution that overcomes key generalization limitations in point cloud registration, with the authors also proposing a novel generalizability benchmark using diverse datasets to evaluate such methods.

Abstract: Recent advances in deep learning-based point cloud registration have improved generalization, yet most methods still require retraining or manual parameter tuning for each new environment. In this paper, we identify three key factors limiting generalization: (a) reliance on environment-specific voxel size and search radius, (b) poor out-of-domain robustness of learning-based keypoint detectors, and (c) raw coordinate usage, which exacerbates scale discrepancies. To address these issues, we present a zero-shot registration pipeline called BUFFER-X by (a) adaptively determining voxel size/search radii, (b) using farthest point sampling to bypass learned detectors, and (c) leveraging patch-wise scale normalization for consistent coordinate bounds. In particular, we present a multi-scale patch-based descriptor generation and a hierarchical inlier search across scales to improve robustness in diverse scenes. We also propose a novel generalizability benchmark using 11 datasets that cover various indoor/outdoor scenarios and sensor modalities, demonstrating that BUFFER-X achieves substantial generalization without prior information or manual parameter tuning for the test datasets. Our code is available at https://github.com/MIT-SPARK/BUFFER-X.

[62] HistoWAS: A Pathomics Framework for Large-Scale Feature-Wide Association Studies of Tissue Topology and Patient Outcomes

Yuechen Yang, Junlin Guo, Yanfan Zhu, Jialin Yue, Junchao Zhu, Yu Wang, Shilin Zhao, Haichun Yang, Xingyi Guo, Jovan Tanevski, Laura Barisoni, Avi Z. Rosenberg, Yuankai Huo

Main category: cs.CV

TL;DR: HistoWAS is a computational framework that links tissue spatial organization to clinical outcomes by augmenting conventional pathology features with 30 topological/spatial features from GIS analysis and performing mass univariate regression with statistical correction.

Details

Motivation: Current high-throughput pathomic analysis of Whole Slide Images lacks tools to measure spatial interactions of tissue characteristics and their association with clinical parameters, limiting clinical relevance of tissue micro- and macro-environment analysis.

Method: 1) Feature space augmentation: Adds 30 topological and spatial features adapted from Geographic Information Systems point pattern analysis to conventional metrics. 2) Association study engine: Implements mass univariate regression for each feature with statistical correction, inspired by Phenome-Wide Association Studies (PheWAS).

Result: Applied HistoWAS to 385 PAS-stained WSIs from 206 KPMP participants, analyzing 102 features (72 conventional + 30 spatial features). Code and data released publicly on GitHub.

Conclusion: HistoWAS provides a computational framework to bridge tissue spatial organization with clinical outcomes, enabling biomarker discovery and advancing pathomic analysis through spatial feature quantification and statistical association testing.

Abstract: High-throughput “pathomic” analysis of Whole Slide Images (WSIs) offers new opportunities to study tissue characteristics and for biomarker discovery. However, the clinical relevance of the tissue characteristics at the micro- and macro-environment level is limited by the lack of tools that facilitate the measurement of the spatial interaction of individual structure characteristics and their association with clinical parameters. To address these challenges, we introduce HistoWAS (Histology-Wide Association Study), a computational framework designed to link tissue spatial organization to clinical outcomes. Specifically, HistoWAS implements (1) a feature space that augments conventional metrics with 30 topological and spatial features, adapted from Geographic Information Systems (GIS) point pattern analysis, to quantify tissue micro-architecture; and (2) an association study engine, inspired by Phenome-Wide Association Studies (PheWAS), that performs mass univariate regression for each feature with statistical correction. As a proof of concept, we applied HistoWAS to analyze a total of 102 features (72 conventional object-level features and our 30 spatial features) using 385 PAS-stained WSIs from 206 participants in the Kidney Precision Medicine Project (KPMP). The code and data have been released to https://github.com/hrlblab/histoWAS.

[63] From Binary to Semantic: Utilizing Large-Scale Binary Occupancy Data for 3D Semantic Occupancy Prediction

Chihiro Noguchi, Takaki Yamamoto

Main category: cs.CV

TL;DR: A framework that leverages large-scale binary occupancy data (cheaper to collect than semantic occupancy) to improve 3D semantic occupancy prediction through pre-training and auto-labeling.

Details

Motivation: 3D semantic occupancy prediction is crucial for vision-centric autonomous driving but requires expensive LiDAR annotations. Binary occupancy data (occupied/free space) is cheaper and more available, but its potential hasn't been explored for improving semantic occupancy prediction.

Method: Proposes a binary occupancy-based framework that decomposes prediction into two modules: binary occupancy module (uses binary data) and semantic occupancy module (uses semantic data). Enables leveraging binary data through pre-training and learning-based auto-labeling approaches.

Result: The proposed framework outperforms existing methods in both pre-training and auto-labeling tasks, demonstrating effectiveness in enhancing 3D semantic occupancy prediction.

Conclusion: Leveraging large-scale binary occupancy data through the proposed decomposition framework significantly improves 3D semantic occupancy prediction while reducing annotation costs, making it valuable for vision-centric autonomous driving systems.

Abstract: Accurate perception of the surrounding environment is essential for safe autonomous driving. 3D occupancy prediction, which estimates detailed 3D structures of roads, buildings, and other objects, is particularly important for vision-centric autonomous driving systems that do not rely on LiDAR sensors. However, in 3D semantic occupancy prediction – where each voxel is assigned a semantic label – annotated LiDAR point clouds are required, making data acquisition costly. In contrast, large-scale binary occupancy data, which only indicate occupied or free space without semantic labels, can be collected at a lower cost. Despite their availability, the potential of leveraging such data remains unexplored. In this study, we investigate the utilization of large-scale binary occupancy data from two perspectives: (1) pre-training and (2) learning-based auto-labeling. We propose a novel binary occupancy-based framework that decomposes the prediction process into binary and semantic occupancy modules, enabling effective use of binary occupancy data. Our experimental results demonstrate that the proposed framework outperforms existing methods in both pre-training and auto-labeling tasks, highlighting its effectiveness in enhancing 3D semantic occupancy prediction. The code will be available at https://github.com/ToyotaInfoTech/b2s-occupancy

[64] WSD-MIL: Window Scale Decay Multiple Instance Learning for Whole Slide Image Classification

Le Feng, Li Xiao

Main category: cs.CV

TL;DR: WSD-MIL is a novel multiple instance learning approach for computational pathology that uses window scale decay attention to efficiently model tumor regions at varying scales while reducing computational memory by 62%.

Details

Motivation: Existing MIL methods in computational pathology overlook complex semantic relationships among instances in whole slide images. Transformer-based approaches have quadratic complexity that limits scalability to large WSIs, and fixed-scale attention mechanisms struggle with varying tumor region scales and fail to account for distance-based decay of patch relevance.

Method: WSD-MIL consists of two main components: 1) Window scale decay based attention module using cluster-based sampling to reduce computational costs while progressively decaying attention window-scale to capture local instance relationships at varying scales; 2) Squeeze-and-excitation based region gate module that dynamically adjusts window weights to enhance global information modeling.

Result: WSD-MIL achieves state-of-the-art performance on CAMELYON16 and TCGA-BRCA datasets while reducing computational memory by 62% compared to existing methods.

Conclusion: The proposed WSD-MIL effectively addresses the limitations of existing Transformer-based MIL methods by efficiently modeling tumor regions at varying scales with reduced computational complexity, making it suitable for large-scale whole slide image analysis in computational pathology.

Abstract: In recent years, the integration of pre-trained foundational models with multiple instance learning (MIL) has improved diagnostic accuracy in computational pathology. However, existing MIL methods focus on optimizing feature extractors and aggregation strategies while overlooking the complex semantic relationships among instances within whole slide image (WSI). Although Transformer-based MIL approaches aiming to model instance dependencies, the quadratic computational complexity limits their scalability to large-scale WSIs. Moreover, due to the pronounced variations in tumor region scales across different WSIs, existing Transformer-based methods employing fixed-scale attention mechanisms face significant challenges in precisely capturing local instance correlations and fail to account for the distance-based decay effect of patch relevance. To address these challenges, we propose window scale decay MIL (WSD-MIL), designed to enhance the capacity to model tumor regions of varying scales while improving computational efficiency. WSD-MIL comprises: 1) a window scale decay based attention module, which employs a cluster-based sampling strategy to reduce computational costs while progressively decaying attention window-scale to capture local instance relationships at varying scales; and 2) a squeeze-and-excitation based region gate module, which dynamically adjusts window weights to enhance global information modeling. Experimental results demonstrate that WSD-MIL achieves state-of-the-art performance on the CAMELYON16 and TCGA-BRCA datasets while reducing 62% of the computational memory. The code will be publicly available.

[65] WaveletGaussian: Wavelet-domain Diffusion for Sparse-view 3D Gaussian Object Reconstruction

Hung Nguyen, Runfa Li, An Le, Truong Nguyen

Main category: cs.CV

TL;DR: WaveletGaussian improves sparse-view 3D Gaussian Splatting by applying diffusion only to low-resolution wavelet subbands and using lightweight networks for high-frequency refinement, achieving competitive quality with significantly reduced training time.

Details

Motivation: 3D Gaussian Splatting performs poorly in sparse-view settings, and existing solutions using diffusion models for render repair are computationally expensive due to fine-tuning and repair steps.

Method: Shift diffusion to wavelet domain: apply diffusion only to low-resolution LL subband, refine high-frequency subbands with lightweight network. Use efficient online random masking strategy for training pairs instead of inefficient leave-one-out approach.

Result: Experiments on Mip-NeRF 360 and OmniObject3D datasets show competitive rendering quality while substantially reducing training time compared to previous methods.

Conclusion: WaveletGaussian provides an efficient framework for sparse-view 3D Gaussian object reconstruction by leveraging wavelet domain processing and optimized training strategies.

Abstract: 3D Gaussian Splatting (3DGS) has become a powerful representation for image-based object reconstruction, yet its performance drops sharply in sparse-view settings. Prior works address this limitation by employing diffusion models to repair corrupted renders, subsequently using them as pseudo ground truths for later optimization. While effective, such approaches incur heavy computation from the diffusion fine-tuning and repair steps. We present WaveletGaussian, a framework for more efficient sparse-view 3D Gaussian object reconstruction. Our key idea is to shift diffusion into the wavelet domain: diffusion is applied only to the low-resolution LL subband, while high-frequency subbands are refined with a lightweight network. We further propose an efficient online random masking strategy to curate training pairs for diffusion fine-tuning, replacing the commonly used, but inefficient, leave-one-out strategy. Experiments across two benchmark datasets, Mip-NeRF 360 and OmniObject3D, show WaveletGaussian achieves competitive rendering quality while substantially reducing training time.

[66] A Novel CNN Gradient Boosting Ensemble for Guava Disease Detection

Tamim Ahasan Rijon, Yeasin Arafath

Main category: cs.CV

TL;DR: The paper proposes an ensemble model combining CNN with Gradient Boosting Machine to detect guava diseases (Healthy, Fruit Flies, Anthracnose) with ~99.99% accuracy using the GFDD24 dataset from Bangladesh.

Details

Motivation: Bangladesh relies heavily on guava cultivation for economic development, but anthracnose and fruit fly infections reduce quality and productivity. Early disease detection through expert systems can minimize losses and protect harvests.

Method: Developed models using CNN combined with traditional machine learning techniques. Proposed ensemble models that combine CNNML with Gradient Boosting Machine for classification of guava diseases using the GFDD24 dataset from Rajshahi and Pabna plantations.

Result: Achieved highest classification accuracy of approximately 99.99% for the guava dataset. The CNN-ML cascade framework demonstrates strong, high-accuracy guava disease detection suitable for real-time agricultural monitoring systems.

Conclusion: The proposed ensemble model combining CNN with Gradient Boosting Machine provides highly accurate guava disease detection that can be effectively deployed in real-time agricultural monitoring systems to support Bangladesh’s guava cultivation and economic development.

Abstract: As a significant agricultural country, Bangladesh utilizes its fertile land for guava cultivation and dedicated labor to boost its economic development. In a nation like Bangladesh, enhancing guava production and agricultural practices plays a crucial role in its economy. Anthracnose and fruit fly infection can lower the quality and productivity of guava, a crucial tropical fruit. Expert systems that detect diseases early can reduce losses and safeguard the harvest. Images of guava fruits classified into the Healthy, Fruit Flies, and Anthracnose classes are included in the Guava Fruit Disease Dataset 2024 (GFDD24), which comes from plantations in Rajshahi and Pabna, Bangladesh. This study aims to create models using CNN alongside traditional machine learning techniques that can effectively identify guava diseases in locally cultivated varieties in Bangladesh. In order to achieve the highest classification accuracy of approximately 99.99% for the guava dataset, we propose utilizing ensemble models that combine CNNML with Gradient Boosting Machine. In general, the CNN-ML cascade framework exhibits strong, high-accuracy guava disease detection that is appropriate for real-time agricultural monitoring systems.

[67] Neural Implicit Heart Coordinates: 3D cardiac shape reconstruction from sparse segmentations

Marica Muffoletto, Uxio Hermida, Charlène Mauger, Avan Suinesiaputra, Yiyang Xu, Richard Burns, Lisa Pankewitz, Andrew D McCulloch, Steffen E Petersen, Daniel Rueckert, Alistair A Young

Main category: cs.CV

TL;DR: NIHCs create a standardized implicit coordinate system for cardiac anatomy that enables accurate 3D reconstruction from sparse 2D segmentations, achieving high accuracy with fast inference.

Details

Motivation: Accurate reconstruction of cardiac anatomy from sparse clinical images is challenging. While neural implicit functions have been applied, mapping anatomical consistency across subjects has been limited. There's a need for a common anatomical reference frame for the human heart.

Method: Introduces Neural Implicit Heart Coordinates (NIHCs), a standardized implicit coordinate system based on universal ventricular coordinates. The method predicts NIHCs directly from limited 2D segmentations (sparse acquisition) and decodes them into dense 3D segmentations and high-resolution meshes at arbitrary output resolution. Trained on 5,000 cardiac meshes.

Result: Achieves high reconstruction accuracy: mean Euclidean surface errors of 2.51±0.33 mm in diseased cohort (n=4549) and 2.3±0.36 mm in healthy cohort (n=5576). Enables anatomically coherent reconstruction under severe slice sparsity and segmentation noise, recovering complex structures like valve planes. Reduces inference time from over 60s to 5-15s compared to traditional pipelines.

Conclusion: NIHCs constitute a robust and efficient anatomical representation for patient-specific 3D cardiac reconstruction from minimal input data, demonstrating significant improvements in accuracy, robustness, and speed over traditional methods.

Abstract: Accurate reconstruction of cardiac anatomy from sparse clinical images remains a major challenge in patient-specific modeling. While neural implicit functions have previously been applied to this task, their application to mapping anatomical consistency across subjects has been limited. In this work, we introduce Neural Implicit Heart Coordinates (NIHCs), a standardized implicit coordinate system, based on universal ventricular coordinates, that provides a common anatomical reference frame for the human heart. Our method predicts NIHCs directly from a limited number of 2D segmentations (sparse acquisition) and subsequently decodes them into dense 3D segmentations and high-resolution meshes at arbitrary output resolution. Trained on a large dataset of 5,000 cardiac meshes, the model achieves high reconstruction accuracy on clinical contours, with mean Euclidean surface errors of 2.51$\pm$0.33 mm in a diseased cohort (n=4549) and 2.3$\pm$0.36 mm in a healthy cohort (n=5576). The NIHC representation enables anatomically coherent reconstruction even under severe slice sparsity and segmentation noise, faithfully recovering complex structures such as the valve planes. Compared with traditional pipelines, inference time is reduced from over 60 s to 5-15 s. These results demonstrate that NIHCs constitute a robust and efficient anatomical representation for patient-specific 3D cardiac reconstruction from minimal input data.

[68] A Dual-Branch Local-Global Framework for Cross-Resolution Land Cover Mapping

Peng Gao, Ke Li, Di Wang, Yongshan Zhu, Yiming Zhang, Xuemei Luo, Yifeng Wang

Main category: cs.CV

TL;DR: DDTM: A dual-branch weakly supervised framework for cross-resolution land cover mapping that decouples local semantic refinement from global contextual reasoning to address resolution mismatch issues.

Details

Motivation: Cross-resolution land cover mapping faces severe resolution mismatch between coarse/low-resolution supervision and high-resolution predictions, causing existing weakly supervised methods to struggle with aligning fine-grained spatial structures with coarse labels, leading to noisy supervision and degraded accuracy.

Method: DDTM uses a dual-branch framework: 1) diffusion-based branch for progressive fine-scale local semantic refinement under coarse supervision, 2) transformer-based branch for long-range contextual consistency across large spatial extents, plus a pseudo-label confidence evaluation module to mitigate noise from cross-resolution inconsistencies.

Result: DDTM achieves state-of-the-art performance on Chesapeake Bay benchmark with 66.52% mIoU, substantially outperforming prior weakly supervised methods.

Conclusion: DDTM effectively addresses cross-resolution land cover mapping challenges by decoupling local refinement from global reasoning, establishing a new state-of-the-art approach for weakly supervised semantic prediction from coarse supervision.

Abstract: Cross-resolution land cover mapping aims to produce high-resolution semantic predictions from coarse or low-resolution supervision, yet the severe resolution mismatch makes effective learning highly challenging. Existing weakly supervised approaches often struggle to align fine-grained spatial structures with coarse labels, leading to noisy supervision and degraded mapping accuracy. To tackle this problem, we propose DDTM, a dual-branch weakly supervised framework that explicitly decouples local semantic refinement from global contextual reasoning. Specifically, DDTM introduces a diffusion-based branch to progressively refine fine-scale local semantics under coarse supervision, while a transformer-based branch enforces long-range contextual consistency across large spatial extents. In addition, we design a pseudo-label confidence evaluation module to mitigate noise induced by cross-resolution inconsistencies and to selectively exploit reliable supervisory signals. Extensive experiments demonstrate that DDTM establishes a new state-of-the-art on the Chesapeake Bay benchmark, achieving 66.52% mIoU and substantially outperforming prior weakly supervised methods. The code is available at https://github.com/gpgpgp123/DDTM.

[69] Few-Shot-Based Modular Image-to-Video Adapter for Diffusion Models

Zhenhao Li, Shaohan Yi, Zheng Liu, Leonartinus Gao, Minh Ngoc Le, Ambrose Ling, Zhuoran Wang, Md Amirul Islam, Zhixiang Chi, Yuanhao Yu

Main category: cs.CV

TL;DR: MIVA is a lightweight modular adapter system for diffusion models that enables precise motion control for image animation with minimal training data (≈10 samples) and no prompt engineering.

Details

Motivation: Diffusion models struggle with image animation due to video data scarcity causing memorization over prompt compliance, and poor generalization to novel motion patterns not in training data. Fine-tuning with limited data is under-explored.

Method: Proposes Modular Image-to-Video Adapter (MIVA) - lightweight sub-networks attachable to pre-trained DMs, each capturing a single motion pattern. Scalable via parallelization, trainable with ≈10 samples on consumer GPU.

Result: MIVA enables more precise motion control while maintaining or surpassing generation quality of models trained on much larger datasets. Users select motion patterns via MIVA modules without prompt engineering.

Conclusion: MIVA addresses key limitations of diffusion models for image animation by providing efficient, modular motion control with minimal training data requirements, eliminating the need for complex prompt engineering.

Abstract: Diffusion models (DMs) have recently achieved impressive photorealism in image and video generation. However, their application to image animation remains limited, even when trained on large-scale datasets. Two primary challenges contribute to this: the high dimensionality of video signals leads to a scarcity of training data, causing DMs to favor memorization over prompt compliance when generating motion; moreover, DMs struggle to generalize to novel motion patterns not present in the training set, and fine-tuning them to learn such patterns, especially using limited training data, is still under-explored. To address these limitations, we propose Modular Image-to-Video Adapter (MIVA), a lightweight sub-network attachable to a pre-trained DM, each designed to capture a single motion pattern and scalable via parallelization. MIVAs can be efficiently trained on approximately ten samples using a single consumer-grade GPU. At inference time, users can specify motion by selecting one or multiple MIVAs, eliminating the need for prompt engineering. Extensive experiments demonstrate that MIVA enables more precise motion control while maintaining, or even surpassing, the generation quality of models trained on significantly larger datasets.

[70] PaveSync: A Unified and Comprehensive Dataset for Pavement Distress Analysis and Classification

Blessing Agyei Kyem, Joshua Kofi Asamoah, Anthony Dontoh, Andrews Danyo, Eugene Denteh, Armstrong Aboah

Main category: cs.CV

TL;DR: Researchers created a standardized benchmark dataset for pavement defect detection by consolidating multiple public sources, addressing the lack of consistent datasets that hinder model generalization across real-world conditions.

Details

Motivation: Automated pavement defect detection struggles to generalize across diverse real-world conditions due to inconsistent datasets with varying annotation styles, distress type definitions, and formats, which prevents unified training and fair model comparison.

Method: The authors consolidated multiple publicly available pavement defect datasets into a standardized benchmark collection, unifying 52,747 images from seven countries with 135,277 bounding box annotations covering 13 distinct distress types. They standardized class definitions and annotation formats to create a globally representative resource.

Result: The benchmark dataset captures broad real-world variation in image quality, resolution, viewing angles, and weather conditions. When tested with state-of-the-art object detection models (YOLOv8-YOLOv12, Faster R-CNN, DETR), the models achieved competitive performance across diverse scenarios, including zero-shot transfer to new environments.

Conclusion: This standardized benchmark dataset provides the first globally representative resource for pavement defect detection, enabling consistent training, fair model comparison, and improved generalization across diverse real-world conditions through unified data standards.

Abstract: Automated pavement defect detection often struggles to generalize across diverse real-world conditions due to the lack of standardized datasets. Existing datasets differ in annotation styles, distress type definitions, and formats, limiting their integration for unified training. To address this gap, we introduce a comprehensive benchmark dataset that consolidates multiple publicly available sources into a standardized collection of 52747 images from seven countries, with 135277 bounding box annotations covering 13 distinct distress types. The dataset captures broad real-world variation in image quality, resolution, viewing angles, and weather conditions, offering a unique resource for consistent training and evaluation. Its effectiveness was demonstrated through benchmarking with state-of-the-art object detection models including YOLOv8-YOLOv12, Faster R-CNN, and DETR, which achieved competitive performance across diverse scenarios. By standardizing class definitions and annotation formats, this dataset provides the first globally representative benchmark for pavement defect detection and enables fair comparison of models, including zero-shot transfer to new environments.

[71] SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images

Zepeng Xin, Kaiyu Li, Luodi Chen, Wanchen Li, Yuchen Xiao, Hui Qiao, Weizhan Zhang, Deyu Meng, Xiangyong Cao

Main category: cs.CV

TL;DR: LaSeRS is a new large-scale dataset for complex language-guided segmentation in remote sensing, addressing hierarchical granularity, target multiplicity, reasoning requirements, and linguistic variability. SegEarth-R2 is a proposed MLLM architecture with spatial attention supervision and flexible segmentation queries that achieves state-of-the-art performance.

Details

Motivation: Current remote sensing models fail at complex geospatial scenarios like multi-target segmentation, hierarchical object parsing, and implicit intent interpretation. Existing datasets oversimplify these challenges, leading to models that are sensitive and unreliable for real-world applications like disaster response and environmental monitoring.

Method: 1) Created LaSeRS dataset covering four critical dimensions of language-guided segmentation. 2) Proposed SegEarth-R2 MLLM architecture with spatial attention supervision for small object localization and flexible segmentation query mechanism for handling both single and multi-target scenarios.

Result: SegEarth-R2 achieves outstanding performance on LaSeRS and other benchmarks, establishing a powerful baseline for next-generation geospatial segmentation. The model effectively handles complex segmentation tasks that previous models failed at.

Conclusion: LaSeRS addresses a critical gap in remote sensing datasets by capturing complex geospatial reasoning dimensions. SegEarth-R2’s architectural innovations enable comprehensive language-guided segmentation, advancing the field toward more robust and capable models for real-world applications.

Abstract: Effectively grounding complex language to pixels in remote sensing (RS) images is a critical challenge for applications like disaster response and environmental monitoring. Current models can parse simple, single-target commands but fail when presented with complex geospatial scenarios, e.g., segmenting objects at various granularities, executing multi-target instructions, and interpreting implicit user intent. To drive progress against these failures, we present LaSeRS, the first large-scale dataset built for comprehensive training and evaluation across four critical dimensions of language-guided segmentation: hierarchical granularity, target multiplicity, reasoning requirements, and linguistic variability. By capturing these dimensions, LaSeRS moves beyond simple commands, providing a benchmark for complex geospatial reasoning. This addresses a critical gap: existing datasets oversimplify, leading to sensitivity-prone real-world models. We also propose SegEarth-R2, an MLLM architecture designed for comprehensive language-guided segmentation in RS, which directly confronts these challenges. The model’s effectiveness stems from two key improvements: (1) a spatial attention supervision mechanism specifically handles the localization of small objects and their components, and (2) a flexible and efficient segmentation query mechanism that handles both single-target and multi-target scenarios. Experimental results demonstrate that our SegEarth-R2 achieves outstanding performance on LaSeRS and other benchmarks, establishing a powerful baseline for the next generation of geospatial segmentation. All data and code will be released at https://github.com/earth-insights/SegEarth-R2.

[72] A Contextual Analysis of Driver-Facing and Dual-View Video Inputs for Distraction Detection in Naturalistic Driving Environments

Anthony Dontoh, Stephanie Ivey, Armstrong Aboah

Main category: cs.CV

TL;DR: Dual-view (driver + road) distraction detection can improve accuracy but depends on architecture design - SlowOnly improved 9.8% while SlowFast dropped 7.2% due to representational conflicts.

Details

Motivation: Most existing distracted driving detection models only use driver-facing views and ignore important environmental context that affects driving behavior, limiting their effectiveness in real-world conditions.

Method: Used synchronized dual-camera recordings from real-world driving to benchmark three spatiotemporal action recognition models (SlowFast-R50, X3D-M, SlowOnly-R50) under two configurations: driver-only and stacked dual-view inputs.

Result: Performance gains from adding road context depend heavily on architecture: SlowOnly improved 9.8% with dual-view, but SlowFast dropped 7.2% due to representational conflicts. X3D-M showed mixed results.

Conclusion: Simply adding visual context isn’t sufficient and can cause interference; architecture must be specifically designed for multi-view integration. Future driver monitoring systems need fusion-aware design.

Abstract: Despite increasing interest in computer vision-based distracted driving detection, most existing models rely exclusively on driver-facing views and overlook crucial environmental context that influences driving behavior. This study investigates whether incorporating road-facing views alongside driver-facing footage improves distraction detection accuracy in naturalistic driving conditions. Using synchronized dual-camera recordings from real-world driving, we benchmark three leading spatiotemporal action recognition architectures: SlowFast-R50, X3D-M, and SlowOnly-R50. Each model is evaluated under two input configurations: driver-only and stacked dual-view. Results show that while contextual inputs can improve detection in certain models, performance gains depend strongly on the underlying architecture. The single-pathway SlowOnly model achieved a 9.8 percent improvement with dual-view inputs, while the dual-pathway SlowFast model experienced a 7.2 percent drop in accuracy due to representational conflicts. These findings suggest that simply adding visual context is not sufficient and may lead to interference unless the architecture is specifically designed to support multi-view integration. This study presents one of the first systematic comparisons of single- and dual-view distraction detection models using naturalistic driving data and underscores the importance of fusion-aware design for future multimodal driver monitoring systems.

[73] MAPI-GNN: Multi-Activation Plane Interaction Graph Neural Network for Multimodal Medical Diagnosis

Ziwei Qin, Xuhui Song, Deqing Huang, Na Qin, Jun Li

Main category: cs.CV

TL;DR: MAPI-GNN is a novel graph neural network that learns multifaceted graph profiles from disentangled feature subspaces to overcome limitations of single static graphs in multimodal medical diagnosis.

Details

Motivation: Current graph neural networks for medical diagnosis rely on single static graphs built from indiscriminate features, which limits their ability to model patient-specific pathological relationships and compromises diagnostic efficacy.

Method: The framework uses a multi-dimensional discriminator to uncover latent graph-aware patterns, dynamically constructs a stack of activation graphs guided by these patterns, and aggregates the multifaceted profile through a relational fusion engine for robust diagnosis.

Result: Extensive experiments on two diverse tasks with over 1300 patient samples demonstrate that MAPI-GNN significantly outperforms state-of-the-art methods.

Conclusion: MAPI-GNN successfully reconstructs the single-graph paradigm by learning multifaceted graph profiles from semantically disentangled feature subspaces, enabling better modeling of patient-specific pathological relationships and improving multimodal medical diagnosis.

Abstract: Graph neural networks are increasingly applied to multimodal medical diagnosis for their inherent relational modeling capabilities. However, their efficacy is often compromised by the prevailing reliance on a single, static graph built from indiscriminate features, hindering the ability to model patient-specific pathological relationships. To this end, the proposed Multi-Activation Plane Interaction Graph Neural Network (MAPI-GNN) reconstructs this single-graph paradigm by learning a multifaceted graph profile from semantically disentangled feature subspaces. The framework first uncovers latent graph-aware patterns via a multi-dimensional discriminator; these patterns then guide the dynamic construction of a stack of activation graphs; and this multifaceted profile is finally aggregated and contextualized by a relational fusion engine for a robust diagnosis. Extensive experiments on two diverse tasks, comprising over 1300 patient samples, demonstrate that MAPI-GNN significantly outperforms state-of-the-art methods.

[74] $\text{H}^2$em: Learning Hierarchical Hyperbolic Embeddings for Compositional Zero-Shot Learning

Lin Li, Jiahui Li, Jiaming Lei, Jun Xiao, Feifei Shao, Long Chen

Main category: cs.CV

TL;DR: H2em: Hierarchical Hyperbolic Embeddings for Compositional Zero-Shot Learning that leverages hyperbolic geometry to better model hierarchical structures in CZSL, achieving state-of-the-art performance.

Details

Motivation: Current CZSL methods overlook rich hierarchical structures (semantic hierarchy of primitives and conceptual hierarchy between primitives and compositions). Euclidean space approaches fail to scale to large-scale taxonomies due to polynomial volume growth that cannot match exponential hierarchical structure, impairing generalization capacity.

Method: H2em framework learns Hierarchical Hyperbolic Embeddings using hyperbolic geometry’s tree-like structure properties. Includes: 1) Dual-Hierarchical Entailment Loss using hyperbolic entailment cones to enforce predefined hierarchies, 2) Discriminative Alignment Loss with hard negative mining for large geodesic distance between similar compositions, and 3) Hyperbolic Cross-Modal Attention for instance-aware cross-modal infusion in hyperbolic space.

Result: Extensive ablations on three benchmarks demonstrate H2em establishes new state-of-the-art in both closed-world and open-world CZSL scenarios.

Conclusion: Hyperbolic geometry is better suited for embedding hierarchical structures in CZSL than Euclidean space. The proposed H2em framework effectively addresses hierarchical collapse and fine-grained discrimination issues through specialized learning objectives and cross-modal attention in hyperbolic space.

Abstract: Compositional zero-shot learning (CZSL) aims to recognize unseen state-object compositions by generalizing from a training set of their primitives (state and object). Current methods often overlook the rich hierarchical structures, such as the semantic hierarchy of primitives (e.g., apple fruit) and the conceptual hierarchy between primitives and compositions (e.g, sliced apple apple). A few recent efforts have shown effectiveness in modeling these hierarchies through loss regularization within Euclidean space. In this paper, we argue that they fail to scale to the large-scale taxonomies required for real-world CZSL: the space’s polynomial volume growth in flat geometry cannot match the exponential structure, impairing generalization capacity. To this end, we propose H2em, a new framework that learns Hierarchical Hyperbolic EMbeddings for CZSL. H2em leverages the unique properties of hyperbolic geometry, a space naturally suited for embedding tree-like structures with low distortion. However, a naive hyperbolic mapping may suffer from hierarchical collapse and poor fine-grained discrimination. We further design two learning objectives to structure this space: a Dual-Hierarchical Entailment Loss that uses hyperbolic entailment cones to enforce the predefined hierarchies, and a Discriminative Alignment Loss with hard negative mining to establish a large geodesic distance between semantically similar compositions. Furthermore, we devise Hyperbolic Cross-Modal Attention to realize instance-aware cross-modal infusion within hyperbolic geometry. Extensive ablations on three benchmarks demonstrate that H2em establishes a new state-of-the-art in both closed-world and open-world scenarios. Our codes will be released.

Chang Sun, Dongliang Xie, Bo Qin, Hong Yang

Main category: cs.CV

TL;DR: VALLR-Pin is a two-stage Mandarin lip-reading framework that combines visual features with phonetic context and LLM refinement to handle homophone ambiguity in Chinese.

Details

Motivation: Mandarin visual speech recognition is challenging due to highly ambiguous visemes and prevalent homophones, requiring better approaches to disambiguate similar-looking lip movements.

Method: Two-stage framework: 1) Shared video encoder with dual decoders predicting Chinese characters and Pinyin romanization jointly; 2) LLM refinement using Pinyin output and candidate transcripts as prompts, plus fine-tuning on synthetic noisy examples from intermediate checkpoints.

Result: The method synergizes visual features with phonetic and linguistic context to improve Mandarin lip-reading performance by explicitly addressing homophone-induced errors through LLM-based correction.

Conclusion: VALLR-Pin effectively extends English VALLR to Mandarin by incorporating phonetic context and LLM refinement, providing a robust solution for Mandarin visual speech recognition challenges.

Abstract: Visual Speech Recognition aims to transcribe spoken words from silent lip-motion videos. This task is particularly challenging for Mandarin, as visemes are highly ambiguous and homophones are prevalent. We propose VALLR-Pin, a novel two-stage framework that extends the recent VALLR architecture from English to Mandarin. First, a shared video encoder feeds into dual decoders, which jointly predict both Chinese character sequences and their standard Pinyin romanization. The multi-task learning of character and phonetic outputs fosters robust visual-semantic representations. During inference, the text decoder generates multiple candidate transcripts. We construct a prompt by concatenating the Pinyin output with these candidate Chinese sequences and feed it to a large language model to resolve ambiguities and refine the transcription. This provides the LLM with explicit phonetic context to correct homophone-induced errors. Finally, we fine-tune the LLM on synthetic noisy examples: we generate imperfect Pinyin-text pairs from intermediate VALLR-Pin checkpoints using the training data, creating instruction-response pairs for error correction. This endows the LLM with awareness of our model’s specific error patterns. In summary, VALLR-Pin synergizes visual features with phonetic and linguistic context to improve Mandarin lip-reading performance.

[76] FlashLips: 100-FPS Mask-Free Latent Lip-Sync using Reconstruction Instead of Diffusion or GANs

Andreas Zinonos, Michał Stypułkowski, Antoni Bigata, Stavros Petridis, Maja Pantic, Nikita Drobyshev

Main category: cs.CV

TL;DR: FlashLips is a real-time lip-sync system that achieves 100+ FPS with high visual quality by decoupling lips control from rendering using a two-stage approach without explicit masks or complex generative models.

Details

Motivation: The motivation is to create a lip-sync system that achieves real-time performance (over 100 FPS) while maintaining high visual quality comparable to larger state-of-the-art models, without relying on explicit masks or complex generative models like GANs or diffusion.

Method: Two-stage approach: Stage 1 is a compact one-step latent-space editor that reconstructs images using reference identity, masked target frame, and lips-pose vector, trained with reconstruction losses only. Stage 2 is an audio-to-pose transformer with flow-matching objective to predict lips-poses from speech. Uses self-supervision to remove explicit masks by generating mouth-altered variants for fine-tuning.

Result: Achieves real-time performance running at over 100 FPS on a single GPU while matching visual quality of larger state-of-the-art models. Combines deterministic reconstruction with robust audio control in a simple, stable pipeline.

Conclusion: FlashLips demonstrates that high-quality lip-sync can be achieved with real-time performance through a decoupled approach that separates lips control from rendering, using simple reconstruction losses and self-supervision instead of complex generative models.

Abstract: We present FlashLips, a two-stage, mask-free lip-sync system that decouples lips control from rendering and achieves real-time performance running at over 100 FPS on a single GPU, while matching the visual quality of larger state-of-the-art models. Stage 1 is a compact, one-step latent-space editor that reconstructs an image using a reference identity, a masked target frame, and a low-dimensional lips-pose vector, trained purely with reconstruction losses - no GANs or diffusion. To remove explicit masks at inference, we use self-supervision: we generate mouth-altered variants of the target image, that serve as pseudo ground truth for fine-tuning, teaching the network to localize edits to the lips while preserving the rest. Stage 2 is an audio-to-pose transformer trained with a flow-matching objective to predict lips-poses vectors from speech. Together, these stages form a simple and stable pipeline that combines deterministic reconstruction with robust audio control, delivering high perceptual quality and faster-than-real-time speed.

Nguyen Lam Phu Quy, Pham Phu Hoa, Tran Chi Nguyen, Dao Sy Duy Minh, Nguyen Hoang Minh Ngoc, Huynh Trung Kiet

Main category: cs.CV

TL;DR: A multimodal pipeline that enhances image captions by retrieving similar images, extracting contextual information from related articles, and integrating this knowledge using a fine-tuned Qwen3 model to produce event-enriched, context-aware descriptions.

Details

Motivation: Real-world image captions often lack contextual depth, omitting crucial details like event background, temporal cues, outcomes, and named entities that aren't visually discernible. This limits effectiveness in domains like journalism, education, and digital archives where richer descriptions are essential.

Method: Multimodal pipeline that: 1) retrieves semantically similar images using BEIT-3 and SigLIP So-384, 2) reranks them using ORB and SIFT for geometric alignment, 3) extracts contextual information from related articles via semantic search, 4) integrates context with base captions (generated by Instruct BLIP) using a fine-tuned Qwen3 model with QLoRA.

Result: Evaluated on OpenEvents v1 dataset, the approach generates significantly more informative captions compared to traditional methods, showing strong potential for real-world applications requiring deeper visual-textual understanding.

Conclusion: The proposed multimodal pipeline effectively addresses the contextual gap in image captions by augmenting visual input with external textual knowledge, producing event-enriched, context-aware descriptions suitable for real-world applications in journalism, education, and digital archives.

Abstract: Real-world image captions often lack contextual depth, omitting crucial details such as event background, temporal cues, outcomes, and named entities that are not visually discernible. This gap limits the effectiveness of image understanding in domains like journalism, education, and digital archives, where richer, more informative descriptions are essential. To address this, we propose a multimodal pipeline that augments visual input with external textual knowledge. Our system retrieves semantically similar images using BEIT-3 (Flickr30k-384 and COCO-384) and SigLIP So-384, reranks them using ORB and SIFT for geometric alignment, and extracts contextual information from related articles via semantic search. A fine-tuned Qwen3 model with QLoRA then integrates this context with base captions generated by Instruct BLIP (Vicuna-7B) to produce event-enriched, context-aware descriptions. Evaluated on the OpenEvents v1 dataset, our approach generates significantly more informative captions compared to traditional methods, showing strong potential for real-world applications requiring deeper visual-textual understanding

[78] Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark

Hao Guo, Xugong Qin, Jun Jie Ou Yang, Peng Zhang, Gangyan Zeng, Yubo Li, Hailun Lin

Main category: cs.CV

TL;DR: The paper introduces NL-DIR, a new benchmark for Natural Language-based Document Image Retrieval, where natural language descriptions serve as fine-grained semantic queries instead of coarse image-based queries.

Details

Motivation: Existing DIR methods only handle coarse semantic categories (e.g., newspapers vs receipts) using image queries, but struggle with real-world scenarios where users provide textual queries with fine-grained semantics.

Method: Created NL-DIR benchmark with 41K authentic document images, each paired with 5 high-quality fine-grained semantic queries generated via LLMs and manual verification. Evaluated existing contrastive vision-language models and OCR-free VDU models with zero-shot and fine-tuning approaches, plus a two-stage retrieval method for efficiency.

Result: Proposed NL-DIR benchmark enables evaluation of document retrieval using natural language queries. The dataset and evaluation framework are established, with models tested in various settings and a two-stage method showing performance improvements while maintaining efficiency.

Conclusion: NL-DIR benchmark bridges the gap between coarse image-based DIR and real-world fine-grained text queries, providing new opportunities for the Visual Document Understanding community to advance document retrieval with natural language semantics.

Abstract: Document image retrieval (DIR) aims to retrieve document images from a gallery according to a given query. Existing DIR methods are primarily based on image queries that retrieve documents within the same coarse semantic category, e.g., newspapers or receipts. However, these methods struggle to effectively retrieve document images in real-world scenarios where textual queries with fine-grained semantics are usually provided. To bridge this gap, we introduce a new Natural Language-based Document Image Retrieval (NL-DIR) benchmark with corresponding evaluation metrics. In this work, natural language descriptions serve as semantically rich queries for the DIR task. The NL-DIR dataset contains 41K authentic document images, each paired with five high-quality, fine-grained semantic queries generated and evaluated through large language models in conjunction with manual verification. We perform zero-shot and fine-tuning evaluations of existing mainstream contrastive vision-language models and OCR-free visual document understanding (VDU) models. A two-stage retrieval method is further investigated for performance improvement while achieving both time and space efficiency. We hope the proposed NL-DIR benchmark can bring new opportunities and facilitate research for the VDU community. Datasets and codes will be publicly available at huggingface.co/datasets/nianbing/NL-DIR.

[79] Item Region-based Style Classification Network (IRSN): A Fashion Style Classifier Based on Domain Knowledge of Fashion Experts

Jinyoung Choi, Youngchae Kwon, Injung Kim

Main category: cs.CV

TL;DR: IRSN improves fashion style classification by analyzing item-specific features and their combinations using item region pooling and gated feature fusion, achieving significant accuracy improvements on benchmark datasets.

Details

Motivation: Fashion style classification is challenging due to large visual variation within styles and similarity between different styles. Styles are expressed through global appearance, individual item attributes, and their combinations, requiring more sophisticated analysis than just global features.

Method: Proposes Item Region-based Style Network (IRSN) with: 1) Item Region Pooling (IRP) to extract features from each item region, 2) Separate analysis of item-specific features, 3) Gated Feature Fusion (GFF) to combine features, and 4) Dual-backbone architecture combining domain-specific and general feature extractors pre-trained on large-scale image-text data.

Result: IRSN applied to six backbones (EfficientNet, ConvNeXt, Swin Transformer) improved style classification accuracy by average 6.9% (max 14.5%) on FashionStyle14 and average 7.6% (max 15.1%) on ShowniqV3. Visualization analysis shows IRSN better captures differences between similar style classes.

Conclusion: IRSN effectively addresses fashion style classification challenges by analyzing both global and item-specific features with their combinations, demonstrating significant performance improvements and better discrimination of visually similar styles.

Abstract: Fashion style classification is a challenging task because of the large visual variation within the same style and the existence of visually similar styles. Styles are expressed not only by the global appearance, but also by the attributes of individual items and their combinations. In this study, we propose an item region-based fashion style classification network (IRSN) to effectively classify fashion styles by analyzing item-specific features and their combinations in addition to global features. IRSN extracts features of each item region using item region pooling (IRP), analyzes them separately, and combines them using gated feature fusion (GFF). In addition, we improve the feature extractor by applying a dual-backbone architecture that combines a domain-specific feature extractor and a general feature extractor pre-trained with a large-scale image-text dataset. In experiments, applying IRSN to six widely-used backbones, including EfficientNet, ConvNeXt, and Swin Transformer, improved style classification accuracy by an average of 6.9% and a maximum of 14.5% on the FashionStyle14 dataset and by an average of 7.6% and a maximum of 15.1% on the ShowniqV3 dataset. Visualization analysis also supports that the IRSN models are better than the baseline models at capturing differences between similar style classes.

[80] Effect of Activation Function and Model Optimizer on the Performance of Human Activity Recognition System Using Various Deep Learning Models

Subrata Kumer Paula, Dewan Nafiul Islam Noora, Rakhi Rani Paula, Md. Ekramul Hamidb, Fahmid Al Faridc, Hezerul Abdul Karimd, Md. Maruf Al Hossain Princee, Abu Saleh Musa Miahb

Main category: cs.CV

TL;DR: This paper analyzes how activation functions and model optimizers affect Human Activity Recognition performance in healthcare applications, finding ConvLSTM with Adam/RMSprop achieves up to 99% accuracy.

Details

Motivation: While deep learning-based HAR systems are widely used, the impact of activation functions and model optimizers on performance hasn't been sufficiently analyzed, especially how their combinations influence model behavior in practical healthcare scenarios.

Method: The study investigates three activation functions (ReLU, Sigmoid, Tanh) combined with four optimization algorithms (SGD, Adam, RMSprop, Adagrad) using two recurrent architectures (BiLSTM and ConvLSTM). Experiments conducted on six medically relevant activity classes from HMDB51 and UCF101 datasets.

Result: ConvLSTM consistently outperforms BiLSTM across both datasets. ConvLSTM with Adam or RMSprop achieves up to 99.00% accuracy, showing strong spatio-temporal learning. BiLSTM performs well on UCF101 (~98%) but drops to ~60% on HMDB51, showing limited robustness and weaker sensitivity to AF/MO variations.

Conclusion: The study provides practical insights for optimizing HAR systems in healthcare environments, demonstrating that ConvLSTM with Adam/RMSprop offers superior performance and stability for real-world applications requiring fast and precise activity detection.

Abstract: Human Activity Recognition (HAR) plays a vital role in healthcare, surveillance, and innovative environments, where reliable action recognition supports timely decision-making and automation. Although deep learning-based HAR systems are widely adopted, the impact of Activation Functions (AFs) and Model Optimizers (MOs) on performance has not been sufficiently analyzed, particularly regarding how their combinations influence model behavior in practical scenarios. Most existing studies focus on architecture design, while the interaction between AF and MO choices remains relatively unexplored. In this work, we investigate the effect of three commonly used activation functions (ReLU, Sigmoid, and Tanh) combined with four optimization algorithms (SGD, Adam, RMSprop, and Adagrad) using two recurrent deep learning architectures, namely BiLSTM and ConvLSTM. Experiments are conducted on six medically relevant activity classes selected from the HMDB51 and UCF101 datasets, considering their suitability for healthcare-oriented HAR applications. Our experimental results show that ConvLSTM consistently outperforms BiLSTM across both datasets. ConvLSTM, combined with Adam or RMSprop, achieves an accuracy of up to 99.00%, demonstrating strong spatio-temporal learning capabilities and stable performance. While BiLSTM performs reasonably well on UCF101, with accuracy approaching 98.00%, its performance drops to approximately 60.00% on HMDB51, indicating limited robustness across datasets and weaker sensitivity to AF and MO variations. This study provides practical insights for optimizing HAR systems, particularly for real-world healthcare environments where fast and precise activity detection is critical.

[81] LiDARDraft: Generating LiDAR Point Cloud from Versatile Inputs

Haiyun Wei, Fan Lu, Yunwei Zhu, Zehan Zheng, Weiyi Xue, Lin Shao, Xudong Zhang, Ya Wu, Rong Fu, Guang Chen

Main category: cs.CV

TL;DR: LiDARDraft: A method for generating realistic LiDAR point clouds from various inputs (text, images, sketches) using 3D layouts as intermediate representation and ControlNet for guided generation.

Details

Motivation: Previous LiDAR generation methods struggle with high-quality results and versatile controllability due to the imbalance between complex LiDAR point cloud distributions and simple control signals.

Method: 1) Convert text, images, and point clouds into unified 3D layouts, 2) Transform layouts into semantic and depth control signals, 3) Use rangemap-based ControlNet to guide LiDAR point cloud generation through pixel-level alignment.

Result: Enables “simulation from scratch” - creating self-driving environments from arbitrary textual descriptions, images, and sketches with excellent performance in controllable LiDAR point cloud generation.

Conclusion: LiDARDraft successfully bridges versatile conditional signals and LiDAR point clouds using 3D layouts as intermediate representation, achieving high-quality controllable generation for autonomous driving simulation.

Abstract: Generating realistic and diverse LiDAR point clouds is crucial for autonomous driving simulation. Although previous methods achieve LiDAR point cloud generation from user inputs, they struggle to attain high-quality results while enabling versatile controllability, due to the imbalance between the complex distribution of LiDAR point clouds and the simple control signals. To address the limitation, we propose LiDARDraft, which utilizes the 3D layout to build a bridge between versatile conditional signals and LiDAR point clouds. The 3D layout can be trivially generated from various user inputs such as textual descriptions and images. Specifically, we represent text, images, and point clouds as unified 3D layouts, which are further transformed into semantic and depth control signals. Then, we employ a rangemap-based ControlNet to guide LiDAR point cloud generation. This pixel-level alignment approach demonstrates excellent performance in controllable LiDAR point clouds generation, enabling “simulation from scratch”, allowing self-driving environments to be created from arbitrary textual descriptions, images and sketches.

[82] UMAMI: Unifying Masked Autoregressive Models and Deterministic Rendering for View Synthesis

Thanh-Tung Le, Tuan Pham, Tung Nguyen, Deying Kong, Xiaohui Xie, Stephan Mandt

Main category: cs.CV

TL;DR: A hybrid framework combining deterministic and stochastic methods for novel view synthesis that achieves SOTA quality with 10x faster rendering than fully generative baselines.

Details

Motivation: Existing NVS methods have trade-offs: deterministic networks are fast but blur unobserved areas, while diffusion-based methods hallucinate plausible content but are computationally expensive. There's a need to unify both strengths.

Method: Hybrid framework with bidirectional transformer encoding multi-view image tokens and Plucker-ray embeddings. Two lightweight heads: feed-forward regression head for well-constrained geometry, and masked autoregressive diffusion head for occluded/unseen regions. End-to-end training with joint photometric and diffusion losses.

Result: Achieves state-of-the-art image quality while reducing rendering time by an order of magnitude compared to fully generative baselines.

Conclusion: The proposed hybrid approach successfully combines the strengths of deterministic and stochastic methods, enabling scalable, high-quality novel view synthesis without handcrafted 3D inductive biases.

Abstract: Novel view synthesis (NVS) seeks to render photorealistic, 3D-consistent images of a scene from unseen camera poses given only a sparse set of posed views. Existing deterministic networks render observed regions quickly but blur unobserved areas, whereas stochastic diffusion-based methods hallucinate plausible content yet incur heavy training- and inference-time costs. In this paper, we propose a hybrid framework that unifies the strengths of both paradigms. A bidirectional transformer encodes multi-view image tokens and Plucker-ray embeddings, producing a shared latent representation. Two lightweight heads then act on this representation: (i) a feed-forward regression head that renders pixels where geometry is well constrained, and (ii) a masked autoregressive diffusion head that completes occluded or unseen regions. The entire model is trained end-to-end with joint photometric and diffusion losses, without handcrafted 3D inductive biases, enabling scalability across diverse scenes. Experiments demonstrate that our method attains state-of-the-art image quality while reducing rendering time by an order of magnitude compared with fully generative baselines.

[83] HEART-VIT: Hessian-Guided Efficient Dynamic Attention and Token Pruning in Vision Transformer

Mohammad Helal Uddin, Liam Seymour, Sabur Baidya

Main category: cs.CV

TL;DR: HEART-ViT is a Hessian-guided framework for dynamic attention and token pruning in Vision Transformers that achieves significant computational savings while maintaining or improving accuracy.

Details

Motivation: Vision Transformers have state-of-the-art accuracy but suffer from quadratic attention cost and redundant computations, making them difficult to deploy on latency and resource-constrained platforms. Existing pruning methods treat tokens or heads in isolation and rely on heuristics, often sacrificing accuracy or failing to generalize.

Method: HEART-ViT uses Hessian-guided efficient dynamic attention and token pruning. It estimates curvature-weighted sensitivities of both tokens and attention heads using efficient Hessian-vector products, enabling principled pruning decisions under explicit loss budgets. This unified, second-order, input-adaptive framework combines token pruning (for computational savings) with head pruning (for fine-grained redundancy removal).

Result: On ImageNet-100 and ImageNet-1K with ViT-B/16 and DeiT-B/16, HEART-ViT achieves up to 49.4% FLOPs reduction, 36% lower latency, and 46% higher throughput while matching or surpassing baseline accuracy (e.g., 4.7% recovery at 40% token pruning). Real-world deployment on edge devices like AGX Orin shows direct translation to improved inference speed and energy efficiency.

Conclusion: HEART-ViT bridges theory and practice as the first unified, curvature-driven pruning framework that is both accuracy-preserving and edge-efficient, revealing that token pruning dominates computational savings while head pruning provides fine-grained redundancy removal.

Abstract: Vision Transformers (ViTs) deliver state-of-the-art accuracy but their quadratic attention cost and redundant computations severely hinder deployment on latency and resource-constrained platforms. Existing pruning approaches treat either tokens or heads in isolation, relying on heuristics or first-order signals, which often sacrifice accuracy or fail to generalize across inputs. We introduce HEART-ViT, a Hessian-guided efficient dynamic attention and token pruning framework for vision transformers, which to the best of our knowledge is the first unified, second-order, input-adaptive framework for ViT optimization. HEART-ViT estimates curvature-weighted sensitivities of both tokens and attention heads using efficient Hessian-vector products, enabling principled pruning decisions under explicit loss budgets.This dual-view sensitivity reveals an important structural insight: token pruning dominates computational savings, while head pruning provides fine-grained redundancy removal, and their combination achieves a superior trade-off. On ImageNet-100 and ImageNet-1K with ViT-B/16 and DeiT-B/16, HEART-ViT achieves up to 49.4 percent FLOPs reduction, 36 percent lower latency, and 46 percent higher throughput, while consistently matching or even surpassing baseline accuracy after fine-tuning, for example 4.7 percent recovery at 40 percent token pruning. Beyond theoretical benchmarks, we deploy HEART-ViT on different edge devices such as AGX Orin, demonstrating that our reductions in FLOPs and latency translate directly into real-world gains in inference speed and energy efficiency. HEART-ViT bridges the gap between theory and practice, delivering the first unified, curvature-driven pruning framework that is both accuracy-preserving and edge-efficient.

[84] milliMamba: Specular-Aware Human Pose Estimation via Dual mmWave Radar with Multi-Frame Mamba Fusion

Niraj Prakash Kini, Shiau-Rung Tsai, Guan-Hsun Lin, Wen-Hsiao Peng, Ching-Wen Ma, Jenq-Neng Hwang

Main category: cs.CV

TL;DR: milliMamba: A radar-based 2D human pose estimation framework using Mamba architecture to handle sparse radar signals by jointly modeling spatio-temporal dependencies with linear complexity.

Details

Motivation: Millimeter-wave radar provides privacy-preserving, lighting-invariant sensing for human pose estimation, but suffers from sparse signals due to specular reflection, making robust feature extraction challenging.

Method: Uses Cross-View Fusion Mamba encoder for efficient spatio-temporal feature extraction from longer sequences with linear complexity, plus Spatio-Temporal-Cross Attention decoder to predict joint coordinates across frames. Incorporates velocity loss alongside keypoint loss for motion smoothness.

Result: Achieves significant performance improvements, exceeding baselines by 11.0 AP on TransHuPR and 14.6 AP on HuPR datasets while maintaining reasonable complexity.

Conclusion: milliMamba effectively addresses radar signal sparsity through joint spatio-temporal modeling, enabling robust human pose estimation from privacy-preserving radar sensors with superior performance over existing methods.

Abstract: Millimeter-wave radar offers a privacy-preserving and lighting-invariant alternative to RGB sensors for Human Pose Estimation (HPE) task. However, the radar signals are often sparse due to specular reflection, making the extraction of robust features from radar signals highly challenging. To address this, we present milliMamba, a radar-based 2D human pose estimation framework that jointly models spatio-temporal dependencies across both the feature extraction and decoding stages. Specifically, given the high dimensionality of radar inputs, we adopt a Cross-View Fusion Mamba encoder to efficiently extract spatio-temporal features from longer sequences with linear complexity. A Spatio-Temporal-Cross Attention decoder then predicts joint coordinates across multiple frames. Together, this spatio-temporal modeling pipeline enables the model to leverage contextual cues from neighboring frames and joints to infer missing joints caused by specular reflections. To reinforce motion smoothness, we incorporate a velocity loss alongside the standard keypoint loss during training. Experiments on the TransHuPR and HuPR datasets demonstrate that our method achieves significant performance improvements, exceeding the baselines by 11.0 AP and 14.6 AP, respectively, while maintaining reasonable complexity. Code: https://github.com/NYCU-MAPL/milliMamba

[85] Enhancing annotations for 5D apple pose estimation through 3D Gaussian Splatting (3DGS)

Robert van de Ven, Trim Bresilla, Bram Nelissen, Ard Nieuwenhuizen, Eldert J. van Henten, Gert Kootstra

Main category: cs.CV

TL;DR: A pipeline using 3D Gaussian Splatting to reconstruct orchard scenes and automate apple pose annotation, reducing manual labeling by 99.6% while maintaining good pose estimation performance.

Details

Motivation: Apple pose estimation is challenging due to environmental variations, occlusions, and the difficulty/time required for manual annotation of key points like calyx. Existing methods still need these annotations for training despite not requiring them during inference.

Method: Novel pipeline combining: 1) 3D Gaussian Splatting for orchard scene reconstruction, 2) simplified manual annotations in 3D space, 3) automated projection of annotations to 2D images, and 4) training and evaluation of pose estimation models.

Result: Only 105 manual annotations needed to generate 28,191 training labels (99.6% reduction). Best performance achieved with fruits ≤95% occluded (F1: 0.927 on original, 0.970 on rendered images). Position estimation quality decreased with increased occlusion. Orientation estimation failed to learn correctly.

Conclusion: The 3D reconstruction-based annotation pipeline dramatically reduces manual labeling effort while maintaining pose estimation performance, though orientation estimation remains challenging and occlusion affects position accuracy.

Abstract: Automating tasks in orchards is challenging because of the large amount of variation in the environment and occlusions. One of the challenges is apple pose estimation, where key points, such as the calyx, are often occluded. Recently developed pose estimation methods no longer rely on these key points, but still require them for annotations, making annotating challenging and time-consuming. Due to the abovementioned occlusions, there can be conflicting and missing annotations of the same fruit between different images. Novel 3D reconstruction methods can be used to simplify annotating and enlarge datasets. We propose a novel pipeline consisting of 3D Gaussian Splatting to reconstruct an orchard scene, simplified annotations, automated projection of the annotations to images, and the training and evaluation of a pose estimation method. Using our pipeline, 105 manual annotations were required to obtain 28,191 training labels, a reduction of 99.6%. Experimental results indicated that training with labels of fruits that are $\leq95%$ occluded resulted in the best performance, with a neutral F1 score of 0.927 on the original images and 0.970 on the rendered images. Adjusting the size of the training dataset had small effects on the model performance in terms of F1 score and pose estimation accuracy. It was found that the least occluded fruits had the best position estimation, which worsened as the fruits became more occluded. It was also found that the tested pose estimation method was unable to correctly learn the orientation estimation of apples.

[86] CoDi – an exemplar-conditioned diffusion model for low-shot counting

Grega Šuštar, Jer Pelhan, Alan Lukežič, Matej Kristan

Main category: cs.CV

TL;DR: CoDi: A latent diffusion-based low-shot object counter that generates high-quality density maps for accurate object localization, outperforming SOTA methods by 10-44% MAE on benchmarks.

Details

Motivation: Existing low-shot object counting methods have limitations: density-based counters have poor localization capabilities, while point-detection-based counters underperform on images with very large numbers of objects and resort to ad-hoc techniques like upsampling and tiling.

Method: CoDi uses latent diffusion to generate high-quality density maps, with a novel exemplar-based conditioning module that extracts and adjusts object prototypes to intermediate layers of the denoising network for accurate object location estimation via non-maxima suppression.

Result: Outperforms state-of-the-art by 15% MAE (few-shot), 13% MAE (one-shot), and 10% MAE (reference-less) on FSC benchmark, and sets new SOTA on MCAC benchmark by outperforming top method by 44% MAE.

Conclusion: CoDi is the first latent diffusion-based low-shot counter that effectively addresses dense regions with small objects while maintaining accurate localization, establishing new benchmarks for low-shot object counting.

Abstract: Low-shot object counting addresses estimating the number of previously unobserved objects in an image using only few or no annotated test-time exemplars. A considerable challenge for modern low-shot counters are dense regions with small objects. While total counts in such situations are typically well addressed by density-based counters, their usefulness is limited by poor localization capabilities. This is better addressed by point-detection-based counters, which are based on query-based detectors. However, due to limited number of pre-trained queries, they underperform on images with very large numbers of objects, and resort to ad-hoc techniques like upsampling and tiling. We propose CoDi, the first latent diffusion-based low-shot counter that produces high-quality density maps on which object locations can be determined by non-maxima suppression. Our core contribution is the new exemplar-based conditioning module that extracts and adjusts the object prototypes to the intermediate layers of the denoising network, leading to accurate object location estimation. On FSC benchmark, CoDi outperforms state-of-the-art by 15% MAE, 13% MAE and 10% MAE in the few-shot, one-shot, and reference-less scenarios, respectively, and sets a new state-of-the-art on MCAC benchmark by outperforming the top method by 44% MAE. The code is available at https://github.com/gsustar/CoDi.

[87] AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model

Sofian Chaybouti, Sanath Narayan, Yasser Dahou, Phúc H. Lê Khac, Ankit Singh, Ngoc Dung Huynh, Wamiq Reyaz Para, Hilde Kuehne, Hakim Hacid

Main category: cs.CV

TL;DR: AMoE: Agglomerative Mixture-of-Experts Vision Foundation Models that distill knowledge from SigLIP2 and DINOv3 simultaneously using asymmetric distillation loss, token-balanced batching, and hierarchical data sampling for efficient multi-teacher training.

Details

Motivation: Multi-teacher distillation for vision foundation models shows promise but lacks understanding of learning dynamics and data efficiency. The paper aims to systematically study these factors to enable training at lower computational cost.

Method: Proposes AMoE framework with: (1) Asymmetric Relation-Knowledge Distillation loss to preserve teacher geometric properties, (2) token-balanced batching for stable multi-resolution learning, (3) hierarchical clustering/sampling for improved data efficiency, and (4) OpenLVD200M curated dataset.

Result: Developed OpenLVD200M corpus (200M images) demonstrating superior efficiency for multi-teacher distillation. Successfully instantiated Mixture-of-Experts models that effectively distill knowledge from both SigLIP2 and DINOv3 teachers.

Conclusion: Multi-teacher distillation can be made more efficient through asymmetric distillation losses, token-balanced batching, and hierarchical data sampling. The released OpenLVD200M dataset and distilled models provide valuable resources for vision foundation model research.

Abstract: Vision foundation models trained via multi-teacher distillation offer a promising path toward unified visual representations, yet the learning dynamics and data efficiency of such approaches remain underexplored. In this paper, we systematically study multi-teacher distillation for vision foundation models and identify key factors that enable training at lower computational cost. We introduce Agglomerative Mixture-of-Experts Vision Foundation Models (AMoE), which distill knowledge from SigLIP2 and DINOv3 simultaneously into a Mixture-of-Experts student. We show that (1) our Asymmetric Relation-Knowledge Distillation loss preserves the geometric properties of each teacher while enabling effective knowledge transfer, (2) token-balanced batching that packs varying-resolution images into sequences with uniform token budgets stabilizes representation learning across resolutions without sacrificing performance, and (3) hierarchical clustering and sampling of training data–typically reserved for self-supervised learning–substantially improves sample efficiency over random sampling for multi-teacher distillation. By combining these findings, we curate OpenLVD200M, a 200M-image corpus that demonstrates superior efficiency for multi-teacher distillation. Instantiated in a Mixture-of-Experts. We release OpenLVD200M and distilled models.

[88] JDPNet: A Network Based on Joint Degradation Processing for Underwater Image Enhancement

Tao Ye, Hongbin Ren, Chongbing Zhang, Haoran Chen, Xiaosong Li

Main category: cs.CV

TL;DR: JDPNet is a joint degradation processing network for underwater image enhancement that effectively handles nonlinear coupled degradations through unified feature mining and adjustment.

Details

Motivation: Underwater images suffer from complex, nonlinearly coupled degradations (not simple superposition), but existing methods focus on specific degradations individually and fail to capture their coupled interactions effectively.

Method: Proposes JDPNet with: 1) Joint feature-mining module with probabilistic bootstrap distribution strategy for unified mining and adjustment of coupled degradation features; 2) AquaBalanceLoss to balance color, clarity, and contrast by learning from multiple coupled degradation losses.

Result: State-of-the-art performance on six public underwater datasets and two new constructed datasets, with better tradeoff between performance, parameter size, and computational cost.

Conclusion: JDPNet effectively addresses nonlinear coupled degradations in underwater images through unified feature mining and adjustment, outperforming existing methods while maintaining efficiency.

Abstract: Given the complexity of underwater environments and the variability of water as a medium, underwater images are inevitably subject to various types of degradation. The degradations present nonlinear coupling rather than simple superposition, which renders the effective processing of such coupled degradations particularly challenging. Most existing methods focus on designing specific branches, modules, or strategies for specific degradations, with little attention paid to the potential information embedded in their coupling. Consequently, they struggle to effectively capture and process the nonlinear interactions of multiple degradations from a bottom-up perspective. To address this issue, we propose JDPNet, a joint degradation processing network, that mines and unifies the potential information inherent in coupled degradations within a unified framework. Specifically, we introduce a joint feature-mining module, along with a probabilistic bootstrap distribution strategy, to facilitate effective mining and unified adjustment of coupled degradation features. Furthermore, to balance color, clarity, and contrast, we design a novel AquaBalanceLoss to guide the network in learning from multiple coupled degradation losses. Experiments on six publicly available underwater datasets, as well as two new datasets constructed in this study, show that JDPNet exhibits state-of-the-art performance while offering a better tradeoff between performance, parameter size, and computational cost.

Xiangxuan Ren, Zhongdao Wang, Pin Tang, Guoqing Wang, Jilai Zheng, Chao Ma

Main category: cs.CV

TL;DR: LiteFusion is a novel multi-modal 3D detector that uses LiDAR as complementary geometric information for camera-based detection, eliminating 3D backbone dependency for better deployment across diverse hardware platforms while maintaining strong performance even without LiDAR.

Details

Motivation: Current multi-modal 3D detectors rely heavily on LiDAR and complex architectures, suffering performance drops without LiDAR and deployment difficulties on non-GPU hardware due to 3D sparse convolution operators optimized for NVIDIA GPUs.

Method: LiteFusion treats LiDAR as complementary geometric information rather than independent modality, integrates LiDAR features into image features in quaternion space to preserve orthogonal constraints, eliminating 3D backbone and dedicated LiDAR encoders.

Result: On nuScenes dataset, improves baseline vision-based detector by +20.4% mAP and +19.7% NDS with only 1.1% parameter increase; maintains strong performance even without LiDAR input, demonstrating robustness across fusion paradigms.

Conclusion: LiteFusion provides a deployment-friendly, robust multi-modal 3D detection approach that reduces LiDAR dependency while maintaining strong performance, enabling practical deployment across diverse hardware platforms for intelligent transportation systems.

Abstract: 3D object detection is fundamental for safe and robust intelligent transportation systems. Current multi-modal 3D object detectors often rely on complex architectures and training strategies to achieve higher detection accuracy. However, these methods heavily rely on the LiDAR sensor so that they suffer from large performance drops when LiDAR is absent, which compromises the robustness and safety of autonomous systems in practical scenarios. Moreover, existing multi-modal detectors face difficulties in deployment on diverse hardware platforms, such as NPUs and FPGAs, due to their reliance on 3D sparse convolution operators, which are primarily optimized for NVIDIA GPUs. To address these challenges, we reconsider the role of LiDAR in the camera-LiDAR fusion paradigm and introduce a novel multi-modal 3D detector, LiteFusion. Instead of treating LiDAR point clouds as an independent modality with a separate feature extraction backbone, LiteFusion utilizes LiDAR data as a complementary source of geometric information to enhance camera-based detection. This straightforward approach completely eliminates the reliance on a 3D backbone, making the method highly deployment-friendly. Specifically, LiteFusion integrates complementary features from LiDAR points into image features within a quaternion space, where the orthogonal constraints are well-preserved during network training. This helps model domain-specific relations across modalities, yielding a compact cross-modal embedding. Experiments on the nuScenes dataset show that LiteFusion improves the baseline vision-based detector by +20.4% mAP and +19.7% NDS with a minimal increase in parameters (1.1%) without using dedicated LiDAR encoders. Notably, even in the absence of LiDAR input, LiteFusion maintains strong results , highlighting its favorable robustness and effectiveness across diverse fusion paradigms and deployment scenarios.

[90] IndicDLP: A Foundational Dataset for Multi-Lingual and Multi-Domain Document Layout Parsing

Oikantik Nath, Sahithi Kukkala, Mitesh Khapra, Ravi Kiran Sarvadevabhatla

Main category: cs.CV

TL;DR: IndicDLP is a large-scale multilingual document layout dataset for 11 Indic languages plus English, addressing gaps in existing datasets for complex document layout analysis.

Details

Motivation: Existing document layout datasets lack fine-grained region labels, multilingual diversity, and adequate representation of Indic documents with diverse scripts. Current datasets are either too large but coarse (PubLayNet, DocBank) or too small and limited (M6Doc, D4LA), creating a gap for robust multilingual document layout analysis.

Method: Created IndicDLP dataset spanning 11 Indic languages and English across 12 document domains. Also curated UED-mini dataset from DocLayNet and M6Doc for pretraining. Fine-tuned existing English models on IndicDLP to validate effectiveness.

Result: Fine-tuning English models on IndicDLP significantly boosts performance. Models trained on IndicDLP generalize well beyond Indic layouts, making it valuable for broader document digitization tasks.

Conclusion: IndicDLP bridges gaps in scale, diversity, and annotation granularity for document layout analysis, enabling inclusive and efficient document understanding, especially for underrepresented Indic languages and scripts.

Abstract: Document layout analysis is essential for downstream tasks such as information retrieval, extraction, OCR, and digitization. However, existing large-scale datasets like PubLayNet and DocBank lack fine-grained region labels and multilingual diversity, making them insufficient for representing complex document layouts. In contrast, human-annotated datasets such as M6Doc and D4LA offer richer labels and greater domain diversity, but are too small to train robust models and lack adequate multilingual coverage. This gap is especially pronounced for Indic documents, which encompass diverse scripts yet remain underrepresented in current datasets, further limiting progress in this space. To address these shortcomings, we introduce IndicDLP, a large-scale foundational document layout dataset spanning 11 representative Indic languages alongside English and 12 common document domains. Additionally, we curate UED-mini, a dataset derived from DocLayNet and M6Doc, to enhance pretraining and provide a solid foundation for Indic layout models. Our experiments demonstrate that fine-tuning existing English models on IndicDLP significantly boosts performance, validating its effectiveness. Moreover, models trained on IndicDLP generalize well beyond Indic layouts, making it a valuable resource for document digitization. This work bridges gaps in scale, diversity, and annotation granularity, driving inclusive and efficient document understanding.

Jinghao Shi, Jianing Song

Main category: cs.CV

TL;DR: BiCoR-Seg: A bidirectional co-refinement framework for high-resolution remote sensing image semantic segmentation that addresses inter-class similarity and intra-class variability through heatmap-driven bidirectional information synergy and hierarchical supervision.

Details

Motivation: High-resolution remote sensing image semantic segmentation faces challenges of high inter-class similarity and large intra-class variability. Existing methods struggle to inject abstract semantic knowledge into pixel-level feature learning, resulting in blurred boundaries and class confusion in complex scenes.

Method: Proposes Bidirectional Co-Refinement Framework (BiCoR-Seg) with: 1) Heatmap-driven Bidirectional Information Synergy Module (HBIS) that establishes bidirectional flow between feature maps and class embeddings via class-level heatmaps; 2) Hierarchical supervision using interpretable heatmaps as low-resolution predictions; 3) Cross-layer class embedding Fisher Discriminative Loss to enforce intra-class compactness and inter-class separability.

Result: Extensive experiments on LoveDA, Vaihingen, and Potsdam datasets demonstrate outstanding segmentation performance with stronger interpretability compared to existing approaches.

Conclusion: BiCoR-Seg effectively addresses the challenges of high inter-class similarity and intra-class variability in HRSS through bidirectional co-refinement, achieving superior performance while providing enhanced interpretability through heatmap visualization.

Abstract: High-resolution remote sensing image semantic segmentation (HRSS) is a fundamental yet critical task in the field of Earth observation. However, it has long faced the challenges of high inter-class similarity and large intra-class variability. Existing approaches often struggle to effectively inject abstract yet strongly discriminative semantic knowledge into pixel-level feature learning, leading to blurred boundaries and class confusion in complex scenes. To address these challenges, we propose Bidirectional Co-Refinement Framework for HRSS (BiCoR-Seg). Specifically, we design a Heatmap-driven Bidirectional Information Synergy Module (HBIS), which establishes a bidirectional information flow between feature maps and class embeddings by generating class-level heatmaps. Based on HBIS, we further introduce a hierarchical supervision strategy, where the interpretable heatmaps generated by each HBIS module are directly utilized as low-resolution segmentation predictions for supervision, thereby enhancing the discriminative capacity of shallow features. In addition, to further improve the discriminability of the embedding representations, we propose a cross-layer class embedding Fisher Discriminative Loss to enforce intra-class compactness and enlarge inter-class separability. Extensive experiments on the LoveDA, Vaihingen, and Potsdam datasets demonstrate that BiCoR-Seg achieves outstanding segmentation performance while offering stronger interpretability. The released code is available at https://github.com/ShiJinghao566/BiCoR-Seg.

[92] LADLE-MM: Limited Annotation based Detector with Learned Ensembles for Multimodal Misinformation

Daniele Cardullo, Simone Teglia, Irene Amerini

Main category: cs.CV

TL;DR: LADLE-MM is a lightweight multimodal misinformation detector that achieves competitive performance with 60.3% fewer parameters than SOTA models, operating effectively with limited annotations and training resources.

Details

Motivation: The rise of accessible multimedia manipulation tools has created widespread threats of synthetic content and misinformation, especially through image-text pairs. Existing detection methods are computationally intensive or require large annotated datasets, creating a need for efficient, annotation-light solutions.

Method: LADLE-MM uses a model-soup initialized architecture with two unimodal branches (image and text) and a third multimodal branch. It enhances representations with fixed multimodal embeddings from BLIP as a reference space, operating with limited annotations and constrained resources.

Result: Achieves competitive performance on DGM4 benchmark for binary and multi-label classification, outperforming methods without grounding annotations. On VERITE dataset, outperforms more complex LVLM-based approaches, demonstrating strong generalization and robustness to unimodal bias.

Conclusion: LADLE-MM provides an effective, resource-efficient solution for multimodal misinformation detection that generalizes well in open-set settings while being robust to unimodal biases, making it practical for real-world deployment with limited resources.

Abstract: With the rise of easily accessible tools for generating and manipulating multimedia content, realistic synthetic alterations to digital media have become a widespread threat, often involving manipulations across multiple modalities simultaneously. Recently, such techniques have been increasingly employed to distort narratives of important events and to spread misinformation on social media, prompting the development of misinformation detectors. In the context of misinformation conveyed through image-text pairs, several detection methods have been proposed. However, these approaches typically rely on computationally intensive architectures or require large amounts of annotated data. In this work we introduce LADLE-MM: Limited Annotation based Detector with Learned Ensembles for Multimodal Misinformation, a model-soup initialized multimodal misinformation detector designed to operate under a limited annotation setup and constrained training resources. LADLE-MM is composed of two unimodal branches and a third multimodal one that enhances image and text representations with additional multimodal embeddings extracted from BLIP, serving as fixed reference space. Despite using 60.3% fewer trainable parameters than previous state-of-the-art models, LADLE-MM achieves competitive performance on both binary and multi-label classification tasks on the DGM4 benchmark, outperforming existing methods when trained without grounding annotations. Moreover, when evaluated on the VERITE dataset, LADLE-MM outperforms current state-of-the-art approaches that utilize more complex architectures involving Large Vision-Language-Models, demonstrating the effective generalization ability in an open-set setting and strong robustness to unimodal bias.

[93] ${D}^{3}${ETOR}: ${D}$ebate-Enhanced Pseudo Labeling and Frequency-Aware Progressive ${D}$ebiasing for Weakly-Supervised Camouflaged Object ${D}$etection with Scribble Annotations

Jiawei Ge, Jiuxin Cao, Xinyi Li, Xuelin Zhu, Chang Liu, Bo Liu, Chen Feng, Ioannis Patras

Main category: cs.CV

TL;DR: D³ETOR is a two-stage weakly-supervised camouflaged object detection framework that uses debate-enhanced pseudo labeling and frequency-aware progressive debiasing to overcome limitations of unreliable pseudo masks and scribble annotation bias.

Details

Motivation: Existing WSCOD methods lag behind fully supervised approaches due to: (1) unreliable pseudo masks from general-purpose segmentation models lacking COD-specific understanding, and (2) neglect of inherent annotation bias in scribbles that hinders global structure capture.

Method: Two-stage framework: 1) Debate-Enhanced Pseudo Labeling with adaptive entropy-driven point sampling and multi-agent debate mechanism to enhance SAM for COD; 2) FADeNet that progressively fuses multi-level frequency-aware features to balance global semantics with local details while dynamically reweighting supervision strength to alleviate scribble bias.

Result: Significantly narrows the gap between weakly and fully supervised COD, achieving state-of-the-art performance on multiple benchmarks.

Conclusion: D³ETOR effectively addresses key limitations in WSCOD by improving pseudo mask quality through debate mechanisms and mitigating scribble bias through frequency-aware progressive debiasing, demonstrating superior performance over existing methods.

Abstract: Weakly-Supervised Camouflaged Object Detection (WSCOD) aims to locate and segment objects that are visually concealed within their surrounding scenes, relying solely on sparse supervision such as scribble annotations. Despite recent progress, existing WSCOD methods still lag far behind fully supervised ones due to two major limitations: (1) the pseudo masks generated by general-purpose segmentation models (e.g., SAM) and filtered via rules are often unreliable, as these models lack the task-specific semantic understanding required for effective pseudo labeling in COD; and (2) the neglect of inherent annotation bias in scribbles, which hinders the model from capturing the global structure of camouflaged objects. To overcome these challenges, we propose ${D}^{3}$ETOR, a two-stage WSCOD framework consisting of Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing. In the first stage, we introduce an adaptive entropy-driven point sampling method and a multi-agent debate mechanism to enhance the capability of SAM for COD, improving the interpretability and precision of pseudo masks. In the second stage, we design FADeNet, which progressively fuses multi-level frequency-aware features to balance global semantic understanding with local detail modeling, while dynamically reweighting supervision strength across regions to alleviate scribble bias. By jointly exploiting the supervision signals from both the pseudo masks and scribble semantics, ${D}^{3}$ETOR significantly narrows the gap between weakly and fully supervised COD, achieving state-of-the-art performance on multiple benchmarks.

[94] UbiQVision: Quantifying Uncertainty in XAI for Image Recognition

Akshat Dubey, Aleksandar Anžel, Bahar İlgen, Georges Hattab

Main category: cs.CV

TL;DR: A framework combining Dirichlet posterior sampling and Dempster-Shafer theory to quantify uncertainty in SHAP explanations for medical imaging models, addressing instability from epistemic and aleatoric uncertainty.

Details

Motivation: Deep learning models in medical imaging are becoming more complex (ResNets, Vision Transformers, Hybrid CNNs), which compromises explainability. SHAP provides interpretable visualizations but can be unstable and unreliable due to epistemic and aleatoric uncertainty, especially in medical imaging with varying data characteristics.

Method: Uses Dirichlet posterior sampling and Dempster-Shafer theory to quantify uncertainty from unstable SHAP explanations. Implements belief, plausible, and fusion map approach alongside statistical quantitative analysis to produce uncertainty quantification in SHAP.

Result: Framework evaluated on three medical imaging datasets with varying class distributions, image qualities, and modality types (pathology, ophthalmology, radiology) that introduce significant epistemic uncertainty due to noise from varying resolutions and modality-specific aspects.

Conclusion: The proposed framework addresses the challenge of unstable SHAP explanations in medical imaging by providing systematic uncertainty quantification, making model interpretations more reliable for domain experts dealing with complex deep learning models.

Abstract: Recent advances in deep learning have led to its widespread adoption across diverse domains, including medical imaging. This progress is driven by increasingly sophisticated model architectures, such as ResNets, Vision Transformers, and Hybrid Convolutional Neural Networks, that offer enhanced performance at the cost of greater complexity. This complexity often compromises model explainability and interpretability. SHAP has emerged as a prominent method for providing interpretable visualizations that aid domain experts in understanding model predictions. However, SHAP explanations can be unstable and unreliable in the presence of epistemic and aleatoric uncertainty. In this study, we address this challenge by using Dirichlet posterior sampling and Dempster-Shafer theory to quantify the uncertainty that arises from these unstable explanations in medical imaging applications. The framework uses a belief, plausible, and fusion map approach alongside statistical quantitative analysis to produce quantification of uncertainty in SHAP. Furthermore, we evaluated our framework on three medical imaging datasets with varying class distributions, image qualities, and modality types which introduces noise due to varying image resolutions and modality-specific aspect covering the examples from pathology, ophthalmology, and radiology, introducing significant epistemic uncertainty.

[95] The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection

Qingdong He, Xueqin Chen, Yanjie Pan, Peng Tang, Pengcheng Xu, Zhenye Gan, Chengjie Wang, Xiaobin Hu, Jiangning Zhang, Yabiao Wang

Main category: cs.CV

TL;DR: KeyTailor is a diffusion transformer-based video virtual try-on framework that uses keyframe-driven details injection to improve garment dynamics and background consistency while reducing computational costs, paired with a new high-quality dataset ViT-HD.

Details

Motivation: Existing DiT-based video virtual try-on methods struggle with capturing fine-grained garment dynamics, preserving background integrity across frames, and have high computational costs due to additional interaction modules. Limited dataset quality and scale also restrict model generalization.

Method: KeyTailor uses a keyframe-driven details injection strategy with instruction-guided keyframe sampling to filter informative frames. It employs two tailored modules: garment details enhancement module (distills garment dynamics into garment-related latents) and collaborative background optimization module (optimizes background latents integrity). These enriched details are injected into standard DiT blocks with pose, mask, and noise latents without modifying DiT architecture.

Result: KeyTailor outperforms state-of-the-art baselines in garment fidelity and background integrity across both dynamic and static scenarios. The framework achieves consistency without explicit DiT architecture modifications while avoiding additional complexity.

Conclusion: KeyTailor effectively addresses challenges in video virtual try-on by leveraging keyframes to capture garment dynamics and background consistency, while the ViT-HD dataset provides high-quality training data for improved generalization.

Abstract: Although diffusion transformer (DiT)-based video virtual try-on (VVT) has made significant progress in synthesizing realistic videos, existing methods still struggle to capture fine-grained garment dynamics and preserve background integrity across video frames. They also incur high computational costs due to additional interaction modules introduced into DiTs, while the limited scale and quality of existing public datasets also restrict model generalization and effective training. To address these challenges, we propose a novel framework, KeyTailor, along with a large-scale, high-definition dataset, ViT-HD. The core idea of KeyTailor is a keyframe-driven details injection strategy, motivated by the fact that keyframes inherently contain both foreground dynamics and background consistency. Specifically, KeyTailor adopts an instruction-guided keyframe sampling strategy to filter informative frames from the input video. Subsequently,two tailored keyframe-driven modules, the garment details enhancement module and the collaborative background optimization module, are employed to distill garment dynamics into garment-related latents and to optimize the integrity of background latents, both guided by keyframes.These enriched details are then injected into standard DiT blocks together with pose, mask, and noise latents, enabling efficient and realistic try-on video synthesis. This design ensures consistency without explicitly modifying the DiT architecture, while simultaneously avoiding additional complexity. In addition, our dataset ViT-HD comprises 15, 070 high-quality video samples at a resolution of 810*1080, covering diverse garments. Extensive experiments demonstrate that KeyTailor outperforms state-of-the-art baselines in terms of garment fidelity and background integrity across both dynamic and static scenarios.

[96] CRAFT: Continuous Reasoning and Agentic Feedback Tuning for Multimodal Text-to-Image Generation

V. Kovalev, A. Kuvshinov, A. Buzovkin, D. Pokidov, D. Timonin

Main category: cs.CV

TL;DR: CRAFT introduces a training-free framework for text-to-image generation that uses structured reasoning with visual question decomposition, verification, and targeted prompt edits to improve compositional accuracy with minimal inference overhead.

Details

Motivation: Existing inference-time reasoning approaches for text-to-image generation rely on implicit critiques or unconstrained prompt rewrites, making them difficult to interpret, control, or stop reliably. The paper aims to bring the structured reasoning paradigm that has benefited large language models (with verification, targeted correction, and early stopping) to multimodal image generation.

Method: CRAFT decomposes prompts into dependency-structured visual questions, verifies generated images using a vision-language model, and applies targeted prompt edits through an LLM agent only where constraints fail. The process iterates with an explicit stopping criterion once all constraints are satisfied, creating an interpretable and controllable inference-time refinement loop.

Result: CRAFT consistently improves compositional accuracy, text rendering, and preference-based evaluations across multiple model families and challenging benchmarks, with particularly strong gains for lightweight generators. These improvements come with negligible inference-time overhead, allowing smaller models to approach the quality of more expensive systems.

Conclusion: Explicitly structured, constraint-driven inference-time reasoning is a key ingredient for improving the reliability of multimodal generative models, offering an interpretable and controllable approach to enhancing text-to-image generation without retraining.

Abstract: Recent work has shown that inference-time reasoning and reflection can improve text-to-image generation without retraining. However, existing approaches often rely on implicit, holistic critiques or unconstrained prompt rewrites, making their behavior difficult to interpret, control, or stop reliably. In contrast, large language models have benefited from explicit, structured forms of thinking based on verification, targeted correction, and early stopping. We introduce CRAFT (Continuous Reasoning and Agentic Feedback Tuning), a training-free, model-agnostic framework that brings this structured reasoning paradigm to multimodal image generation. CRAFT decomposes a prompt into dependency-structured visual questions, veries generated images using a vision-language model, and applies targeted prompt edits through an LLM agent only where constraints fail. The process iterates with an explicit stopping criterion once all constraints are satised, yielding an interpretable and controllable inference-time renement loop. Across multiple model families and challenging benchmarks, CRAFT consistently improves compositional accuracy, text rendering, and preference-based evaluations, with particularly strong gains for lightweight generators. Importantly, these improvements incur only a negligible inference-time overhead, allowing smaller or cheaper models to approach the quality of substantially more expensive systems. Our results suggest that explicitly structured, constraint-driven inference-time reasoning is a key ingredient for improving the reliability of multimodal generative models.

[97] Linking Faces and Voices Across Languages: Insights from the FAME 2026 Challenge

Marta Moscati, Ahmed Abdullah, Muhammad Saad Saeed, Shah Nawaz, Rohan Kumar Das, Muhammad Zaigham Zaheer, Junaid Mir, Muhammad Haroon Yousaf, Khalid Mahmood Malik, Markus Schedl

Main category: cs.CV

TL;DR: The FAME 2026 Challenge at ICASSP 2026 focuses on developing face-voice association methods that work when test language differs from training language, addressing multilingual communication scenarios.

Details

Motivation: Over half the world's population is bilingual, and people often communicate in multilingual scenarios, creating a need for face-voice association methods that can handle language mismatches between training and testing.

Method: The paper describes a challenge framework (FAME 2026 Challenge) where participants develop methods for cross-language face-voice association, though specific technical approaches are not detailed in this summary.

Result: This is a challenge summary paper, so results would include the challenge framework, evaluation metrics, and presumably participant submissions and performance, though not explicitly stated in the abstract.

Conclusion: The FAME 2026 Challenge addresses an important real-world problem of face-voice association in multilingual environments and provides a platform for developing solutions to language-mismatch scenarios.

Abstract: Over half of the world’s population is bilingual and people often communicate under multilingual scenarios. The Face-Voice Association in Multilingual Environments (FAME) 2026 Challenge, held at ICASSP 2026, focuses on developing methods for face-voice association that are effective when the language at test-time is different than the training one. This report provides a brief summary of the challenge.

[98] SmartSplat: Feature-Smart Gaussians for Scalable Compression of Ultra-High-Resolution Images

Linfei Li, Lin Zhang, Zhong Wang, Ying Shen

Main category: cs.CV

TL;DR: SmartSplat is a Gaussian Splatting-based image compression framework that uses adaptive feature-aware sampling to achieve high compression ratios while maintaining reconstruction quality for ultra-high-resolution images.

Details

Motivation: The paper addresses challenges in compressing ultra-high-resolution visual content generated by modern AI systems, where existing 2D Gaussian image models struggle to balance compression ratio and reconstruction fidelity at high resolutions.

Method: SmartSplat uses a feature-aware approach with Gradient-Color Guided Variational Sampling and Exclusion-based Uniform Sampling to improve Gaussian primitive coverage. It also employs Scale-Adaptive Gaussian Color Sampling for better color initialization across scales, with joint optimization of spatial layout, scale, and color initialization.

Result: Extensive experiments on DIV8K and a new 16K dataset show SmartSplat outperforms state-of-the-art methods at comparable compression ratios, exceeds their compression limits, and demonstrates strong scalability and practical applicability.

Conclusion: SmartSplat provides an effective solution for ultra-high-resolution image compression by efficiently capturing both local structures and global textures using limited Gaussians, offering practical advantages for real-world applications.

Abstract: Recent advances in generative AI have accelerated the production of ultra-high-resolution visual content, posing significant challenges for efficient compression and real-time decoding on end-user devices. Inspired by 3D Gaussian Splatting, recent 2D Gaussian image models improve representation efficiency, yet existing methods struggle to balance compression ratio and reconstruction fidelity in ultra-high-resolution scenarios. To address this issue, we propose SmartSplat, a highly adaptive and feature-aware GS-based image compression framework that supports arbitrary image resolutions and compression ratios. SmartSplat leverages image-aware features such as gradients and color variances, introducing a Gradient-Color Guided Variational Sampling strategy together with an Exclusion-based Uniform Sampling scheme to improve the non-overlapping coverage of Gaussian primitives in pixel space. In addition, we propose a Scale-Adaptive Gaussian Color Sampling method to enhance color initialization across scales. Through joint optimization of spatial layout, scale, and color initialization, SmartSplat efficiently captures both local structures and global textures using a limited number of Gaussians, achieving high reconstruction quality under strong compression. Extensive experiments on DIV8K and a newly constructed 16K dataset demonstrate that SmartSplat consistently outperforms state-of-the-art methods at comparable compression ratios and exceeds their compression limits, showing strong scalability and practical applicability. The code is publicly available at https://github.com/lif314/SmartSplat.

[99] DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning

Junho Yoon, Jaemo Jung, Hyunju Kim, Dongman Lee

Main category: cs.CV

TL;DR: DETACH: A decomposed spatio-temporal framework for aligning exocentric video with ambient sensors, addressing limitations of global alignment approaches in capturing local details and handling similar temporal patterns with different contexts.

Details

Motivation: Egocentric video with wearable sensors has practical limitations (user discomfort, privacy concerns, scalability). Exocentric video with ambient sensors offers a non-intrusive, scalable alternative, but existing global alignment methods fail to capture local details and misalign actions with similar temporal patterns but different spatio-semantic contexts.

Method: DETACH decomposes spatio-temporal features explicitly to preserve local details. Uses novel sensor-spatial features discovered via online clustering for semantic grounding. Two-stage alignment: first establishes spatial correspondence through mutual supervision, then performs temporal alignment via spatial-temporal weighted contrastive loss that adaptively handles easy negatives, hard negatives, and false negatives.

Result: Comprehensive experiments on Opportunity++ and HWU-USP datasets demonstrate substantial improvements over adapted egocentric-wearable baselines in downstream tasks.

Conclusion: DETACH effectively addresses the limitations of global alignment in exocentric-ambient settings by decomposing spatio-temporal features and using a novel two-stage alignment approach, enabling better action recognition with non-intrusive sensors.

Abstract: Aligning egocentric video with wearable sensors have shown promise for human action recognition, but face practical limitations in user discomfort, privacy concerns, and scalability. We explore exocentric video with ambient sensors as a non-intrusive, scalable alternative. While prior egocentric-wearable works predominantly adopt Global Alignment by encoding entire sequences into unified representations, this approach fails in exocentric-ambient settings due to two problems: (P1) inability to capture local details such as subtle motions, and (P2) over-reliance on modality-invariant temporal patterns, causing misalignment between actions sharing similar temporal patterns with different spatio-semantic contexts. To resolve these problems, we propose DETACH, a decomposed spatio-temporal framework. This explicit decomposition preserves local details, while our novel sensor-spatial features discovered via online clustering provide semantic grounding for context-aware alignment. To align the decomposed features, our two-stage approach establishes spatial correspondence through mutual supervision, then performs temporal alignment via a spatial-temporal weighted contrastive loss that adaptively handles easy negatives, hard negatives, and false negatives. Comprehensive experiments with downstream tasks on Opportunity++ and HWU-USP datasets demonstrate substantial improvements over adapted egocentric-wearable baselines.

[100] Skin Lesion Classification Using a Soft Voting Ensemble of Convolutional Neural Networks

Abdullah Al Shafi, Abdul Muntakim, Pintu Chandra Shill, Rowzatul Zannat, Abdullah Al-Amin

Main category: cs.CV

TL;DR: Early skin cancer classification using soft voting ensemble of CNNs (MobileNetV2, VGG19, InceptionV3) with segmentation preprocessing, achieving 90-96% accuracy on three benchmark datasets.

Details

Motivation: Early detection of skin cancer significantly improves survival rates. While AI using CNNs can improve diagnostic accuracy, there's a need for methods that balance accuracy with computational efficiency for real-world clinical deployment.

Method: Used three benchmark datasets (HAM10000, ISIC 2016, ISIC 2019) with preprocessing including rebalancing, augmentation, and filtering. Applied hybrid dual encoder for segmentation via transfer learning to focus on clinically significant features. Classification performed using soft voting ensemble of MobileNetV2, VGG19, and InceptionV3 CNNs.

Result: Achieved lesion recognition accuracies of 96.32%, 90.86%, and 93.92% on the three datasets respectively. The ensemble approach balanced accuracy and speed for practical deployment.

Conclusion: The proposed soft voting ensemble CNN method with segmentation preprocessing effectively classifies skin cancer with high accuracy while maintaining computational efficiency suitable for real-world clinical applications.

Abstract: Skin cancer can be identified by dermoscopic examination and ocular inspection, but early detection significantly increases survival chances. Artificial intelligence (AI), using annotated skin images and Convolutional Neural Networks (CNNs), improves diagnostic accuracy. This paper presents an early skin cancer classification method using a soft voting ensemble of CNNs. In this investigation, three benchmark datasets, namely HAM10000, ISIC 2016, and ISIC 2019, were used. The process involved rebalancing, image augmentation, and filtering techniques, followed by a hybrid dual encoder for segmentation via transfer learning. Accurate segmentation focused classification models on clinically significant features, reducing background artifacts and improving accuracy. Classification was performed through an ensemble of MobileNetV2, VGG19, and InceptionV3, balancing accuracy and speed for real-world deployment. The method achieved lesion recognition accuracies of 96.32%, 90.86%, and 93.92% for the three datasets. The system performance was evaluated using established skin lesion detection metrics, yielding impressive results.

[101] High Dimensional Data Decomposition for Anomaly Detection of Textured Images

Ji Song, Xing Wang, Jianguo Wu, Xiaowei Yue

Main category: cs.CV

TL;DR: TBSD method for anomaly detection in textured images with smooth backgrounds using texture basis learning and decomposition to reduce misidentification and dataset requirements.

Details

Motivation: Conventional anomaly detection methods struggle with textured defect images, suffering from misidentification, low robustness, and excessive reliance on large structured datasets.

Method: Texture Basis integrated Smooth Decomposition (TBSD) approach with two processes: 1) learning texture basis functions to extract quasi-periodic texture patterns, 2) using texture basis as prior knowledge for anomaly detection to prevent texture misidentification.

Result: TBSD surpasses benchmarks with less misidentification, smaller training dataset requirements, and superior anomaly detection performance on both simulation and real-world datasets.

Conclusion: The proposed TBSD method provides an efficient solution for anomaly detection in textured images with smooth backgrounds, addressing limitations of conventional methods through mathematical formulation of quasi-periodicity and texture basis learning.

Abstract: In the realm of diverse high-dimensional data, images play a significant role across various processes of manufacturing systems where efficient image anomaly detection has emerged as a core technology of utmost importance. However, when applied to textured defect images, conventional anomaly detection methods have limitations including non-negligible misidentification, low robustness, and excessive reliance on large-scale and structured datasets. This paper proposes a texture basis integrated smooth decomposition (TBSD) approach, which is targeted at efficient anomaly detection in textured images with smooth backgrounds and sparse anomalies. Mathematical formulation of quasi-periodicity and its theoretical properties are investigated for image texture estimation. TBSD method consists of two principal processes: the first process learns the texture basis functions to effectively extract quasi-periodic texture patterns; the subsequent anomaly detection process utilizes that texture basis as prior knowledge to prevent texture misidentification and capture potential anomalies with high accuracy.The proposed method surpasses benchmarks with less misidentification, smaller training dataset requirement, and superior anomaly detection performance on both simulation and real-world datasets.

[102] Beyond Motion Pattern: An Empirical Study of Physical Forces for Human Motion Understanding

Anh Dao, Manh Tran, Yufei Zhang, Xiaoming Liu, Zijun Cui

Main category: cs.CV

TL;DR: Incorporating physically inferred joint actuation forces into motion understanding pipelines consistently improves performance across gait recognition, action recognition, and video captioning tasks.

Details

Motivation: Most existing vision-based motion understanding methods overlook physical cues like joint actuation forces that are fundamental in biomechanics, creating a gap in understanding whether and when these forces can enhance motion analysis.

Method: By incorporating physically inferred forces into established motion understanding pipelines and systematically evaluating their impact across baseline models on three major tasks: gait recognition, action recognition, and fine-grained video captioning.

Result: Across 8 benchmarks, incorporating forces yielded consistent performance gains: +0.87% Rank-1 accuracy on CASIA-B gait recognition (with larger gains under challenging conditions), +1.3% on Gait3D, +2.00% on Penn Action recognition, +6.96% for high-exertion actions, and +0.029 ROUGE-L score improvement in video captioning.

Conclusion: Force cues substantially complement visual and kinematic features, especially under dynamic, occluded, or appearance-varying conditions, demonstrating that physically inferred forces enhance motion understanding across multiple domains.

Abstract: Human motion understanding has advanced rapidly through vision-based progress in recognition, tracking, and captioning. However, most existing methods overlook physical cues such as joint actuation forces that are fundamental in biomechanics. This gap motivates our study: if and when do physically inferred forces enhance motion understanding? By incorporating forces into established motion understanding pipelines, we systematically evaluate their impact across baseline models on 3 major tasks: gait recognition, action recognition, and fine-grained video captioning. Across 8 benchmarks, incorporating forces yields consistent performance gains; for example, on CASIA-B, Rank-1 gait recognition accuracy improved from 89.52% to 90.39% (+0.87), with larger gain observed under challenging conditions: +2.7% when wearing a coat and +3.0% at the side view. On Gait3D, performance also increases from 46.0% to 47.3% (+1.3). In action recognition, CTR-GCN achieved +2.00% on Penn Action, while high-exertion classes like punching/slapping improved by +6.96%. Even in video captioning, Qwen2.5-VL’s ROUGE-L score rose from 0.310 to 0.339 (+0.029), indicating that physics-inferred forces enhance temporal grounding and semantic richness. These results demonstrate that force cues can substantially complement visual and kinematic features under dynamic, occluded, or appearance-varying conditions.

[103] UTDesign: A Unified Framework for Stylized Text Editing and Generation in Graphic Design Images

Yiming Zhao, Yuanpeng Gao, Yuxuan Luo, Jiwei Duan, Shisong Lin, Longfei Xiong, Zhouhui Lian

Main category: cs.CV

TL;DR: UTDesign is a unified framework for high-precision stylized text editing and conditional text generation in design images, supporting both English and Chinese scripts, with state-of-the-art performance in stylistic consistency and text accuracy.

Details

Motivation: While diffusion-based text-to-image models are powerful for visual content generation, their text rendering performance for small-scale typography and non-Latin scripts remains limited, creating a need for better AI-assisted graphic design tools.

Method: Proposes UTDesign with three key components: 1) A novel DiT-based text style transfer model trained from scratch on synthetic data to generate transparent RGBA text foregrounds, 2) A conditional text generation framework with multi-modal condition encoder trained on curated text-annotated data, and 3) Integration into a fully automated text-to-design pipeline using pre-trained T2I models and MLLM-based layout planner.

Result: UTDesign achieves state-of-the-art performance among open-source methods in stylistic consistency and text accuracy, and exhibits unique advantages compared to proprietary commercial approaches.

Conclusion: UTDesign provides a unified solution for high-quality text editing and generation in design images, addressing limitations of existing diffusion models for typography and non-Latin scripts, with promising applications in automated graphic design.

Abstract: AI-assisted graphic design has emerged as a powerful tool for automating the creation and editing of design elements such as posters, banners, and advertisements. While diffusion-based text-to-image models have demonstrated strong capabilities in visual content generation, their text rendering performance, particularly for small-scale typography and non-Latin scripts, remains limited. In this paper, we propose UTDesign, a unified framework for high-precision stylized text editing and conditional text generation in design images, supporting both English and Chinese scripts. Our framework introduces a novel DiT-based text style transfer model trained from scratch on a synthetic dataset, capable of generating transparent RGBA text foregrounds that preserve the style of reference glyphs. We further extend this model into a conditional text generation framework by training a multi-modal condition encoder on a curated dataset with detailed text annotations, enabling accurate, style-consistent text synthesis conditioned on background images, prompts, and layout specifications. Finally, we integrate our approach into a fully automated text-to-design (T2D) pipeline by incorporating pre-trained text-to-image (T2I) models and an MLLM-based layout planner. Extensive experiments demonstrate that UTDesign achieves state-of-the-art performance among open-source methods in terms of stylistic consistency and text accuracy, and also exhibits unique advantages compared to proprietary commercial approaches. Code and data for this paper are available at https://github.com/ZYM-PKU/UTDesign.

[104] Multi-temporal Adaptive Red-Green-Blue and Long-Wave Infrared Fusion for You Only Look Once-Based Landmine Detection from Unmanned Aerial Systems

James E. Gallagher, Edward J. Oughton, Jana Kosecka

Main category: cs.CV

TL;DR: This paper evaluates adaptive RGB and LWIR fusion for UAS-based landmine detection using YOLO architectures, finding YOLOv11 with 10-30% thermal fusion at 5-10m altitude as optimal, and reveals tradeoffs between accuracy and efficiency across different detection models.

Details

Motivation: Landmines pose a severe humanitarian threat with 110 million deployed mines causing 26,000 annual casualties. The research aims to improve detection using thermal contrast between mines and surrounding soil via UAS-based systems.

Method: Used adaptive RGB and Long-Wave Infrared fusion with YOLO architectures (v8, v10, v11) across 114 test images generating 35,640 evaluations. Compared multiple architectures including RF-DETR, Faster R-CNN, and RetinaNet. Employed multi-temporal training datasets and analyzed detection parameters like altitude and thermal fusion percentages.

Result: YOLOv11 achieved optimal performance (86.8% mAP) with 10-30% thermal fusion at 5-10m altitude. RF-DETR had highest accuracy (69.2% mAP) but YOLOv11 trained 17.7x faster. Multi-temporal training outperformed season-specific approaches by 1.8-9.6%. Anti-Tank mines detected at 61.9% vs 19.2% for Anti-Personnel mines.

Conclusion: Adaptive thermal fusion with YOLOv11 provides effective landmine detection with optimal parameters identified. There’s a critical accuracy-efficiency tradeoff between transformer-based and YOLO architectures. Future research should examine thermal contrast effects for buried mines across different soil types.

Abstract: Landmines remain a persistent humanitarian threat, with 110 million actively deployed mines across 60 countries, claiming 26,000 casualties annually. This research evaluates adaptive Red-Green-Blue (RGB) and Long-Wave Infrared (LWIR) fusion for Unmanned Aerial Systems (UAS)-based detection of surface-laid landmines, leveraging the thermal contrast between the ordnance and the surrounding soil to enhance feature extraction. Using You Only Look Once (YOLO) architectures (v8, v10, v11) across 114 test images, generating 35,640 model-condition evaluations, YOLOv11 achieved optimal performance (86.8% mAP), with 10 to 30% thermal fusion at 5 to 10m altitude identified as the optimal detection parameters. A complementary architectural comparison revealed that while RF-DETR achieved the highest accuracy (69.2% mAP), followed by Faster R-CNN (67.6%), YOLOv11 (64.2%), and RetinaNet (50.2%), YOLOv11 trained 17.7 times faster than the transformer-based RF-DETR (41 minutes versus 12 hours), presenting a critical accuracy-efficiency tradeoff for operational deployment. Aggregated multi-temporal training datasets outperformed season-specific approaches by 1.8 to 9.6%, suggesting that models benefit from exposure to diverse thermal conditions. Anti-Tank (AT) mines achieved 61.9% detection accuracy, compared with 19.2% for Anti-Personnel (AP) mines, reflecting both the size differential and thermal-mass differences between these ordnance classes. As this research examined surface-laid mines where thermal contrast is maximized, future research should quantify thermal contrast effects for mines buried at varying depths across heterogeneous soil types.

[105] Vision Language Models are Confused Tourists

Patrick Amadeus Irawan, Ikhlasul Akmal Hanif, Muhammad Dehan Al Kautsar, Genta Indra Winata, Fajri Koto, Alham Fikri Aji

Main category: cs.CV

TL;DR: VLMs struggle with cultural robustness when multiple cultural cues coexist in images, showing significant accuracy drops with simple perturbations.

Details

Motivation: Current VLM evaluations overlook scenarios with multiple cultural concepts per image, creating a gap in understanding cultural robustness needed for diverse societies.

Method: Introduce ConfusedTourist, a cultural adversarial robustness suite that tests VLM stability against perturbed geographical cues using image-stacking and image-generation-based perturbations.

Result: VLMs show critical vulnerability with heavy accuracy drops under simple perturbations, worsening with image-generation variants. Failures stem from systematic attention shifts toward distracting cues.

Conclusion: Visual cultural concept mixing substantially impairs state-of-the-art VLMs, highlighting urgent need for more culturally robust multimodal understanding.

Abstract: Although the cultural dimension has been one of the key aspects in evaluating Vision-Language Models (VLMs), their ability to remain stable across diverse cultural inputs remains largely untested, despite being crucial to support diversity and multicultural societies. Existing evaluations often rely on benchmarks featuring only a singular cultural concept per image, overlooking scenarios where multiple, potentially unrelated cultural cues coexist. To address this gap, we introduce ConfusedTourist, a novel cultural adversarial robustness suite designed to assess VLMs’ stability against perturbed geographical cues. Our experiments reveal a critical vulnerability, where accuracy drops heavily under simple image-stacking perturbations and even worsens with its image-generation-based variant. Interpretability analyses further show that these failures stem from systematic attention shifts toward distracting cues, diverting the model from its intended focus. These findings highlight a critical challenge: visual cultural concept mixing can substantially impair even state-of-the-art VLMs, underscoring the urgent need for more culturally robust multimodal understanding.

[106] Bridging Modalities and Transferring Knowledge: Enhanced Multimodal Understanding and Recognition

Gorjan Radevski

Main category: cs.CV

TL;DR: This dissertation presents multimodal alignment methods across five chapters: spatial reasoning from text to 2D scenes, medical text to 3D anatomical mapping, text to knowledge graph facts, video-object fusion for action recognition, and multimodal knowledge distillation for efficient egocentric recognition.

Details

Motivation: To enhance machine understanding of complex multimodal inputs by addressing key challenges in multimodal machine learning, including spatial reasoning, medical text interpretation, knowledge extraction, and action recognition across different domains.

Method: Five-chapter approach: 1) Spatial-Reasoning Bert for text-to-2D scene translation, 2) Medical text to 3D anatomical mapping with spatial co-occurrence loss, 3) Text to knowledge graph fact extraction with benchmark creation, 4) Video-object fusion for compositional action recognition, 5) Multimodal knowledge distillation for RGB-only egocentric action recognition.

Result: Developed methods for automated scene generation from spatial language, interpretable medical text navigation, clearer knowledge graph extraction, improved action recognition robustness, and efficient RGB-only models with multimodal capabilities through knowledge distillation.

Conclusion: The contributions advance multimodal alignment methodologies across spatial understanding, medical interpretation, knowledge enrichment, and action recognition, enhancing computational systems’ ability to process complex multimodal inputs in diverse applications.

Abstract: This manuscript explores multimodal alignment, translation, fusion, and transference to enhance machine understanding of complex inputs. We organize the work into five chapters, each addressing unique challenges in multimodal machine learning. Chapter 3 introduces Spatial-Reasoning Bert for translating text-based spatial relations into 2D arrangements between clip-arts. This enables effective decoding of spatial language into visual representations, paving the way for automated scene generation aligned with human spatial understanding. Chapter 4 presents a method for translating medical texts into specific 3D locations within an anatomical atlas. We introduce a loss function leveraging spatial co-occurrences of medical terms to create interpretable mappings, significantly enhancing medical text navigability. Chapter 5 tackles translating structured text into canonical facts within knowledge graphs. We develop a benchmark for linking natural language to entities and predicates, addressing ambiguities in text extraction to provide clearer, actionable insights. Chapter 6 explores multimodal fusion methods for compositional action recognition. We propose a method fusing video frames and object detection representations, improving recognition robustness and accuracy. Chapter 7 investigates multimodal knowledge transference for egocentric action recognition. We demonstrate how multimodal knowledge distillation enables RGB-only models to mimic multimodal fusion-based capabilities, reducing computational requirements while maintaining performance. These contributions advance methodologies for spatial language understanding, medical text interpretation, knowledge graph enrichment, and action recognition, enhancing computational systems’ ability to process complex, multimodal inputs across diverse applications.

[107] SirenPose: Dynamic Scene Reconstruction via Geometric Supervision

Kaitong Cai, Jensen Zhang, Jing Yang, Keze Wang

Main category: cs.CV

TL;DR: SirenPose is a novel method that combines sinusoidal representation networks with geometric supervision for accurate, temporally consistent 3D scene reconstruction from monocular videos, outperforming state-of-the-art methods on multiple benchmarks.

Details

Motivation: Existing methods struggle with motion fidelity and spatiotemporal coherence in challenging scenarios involving fast motion, multi-object interaction, occlusion, and rapid scene changes. There's a need for better reconstruction of dynamic 3D scenes from monocular videos.

Method: SirenPose integrates periodic activation properties of sinusoidal representation networks (SIRENs) with keypoint-based geometric supervision. It incorporates physics-inspired constraints for coherent keypoint predictions across spatial and temporal dimensions, uses high-frequency signal modeling for fine-grained details, expands the UniKPT dataset to 600k annotated instances, and integrates graph neural networks to model keypoint relationships and structural correlations.

Result: On DAVIS benchmark: 17.8% reduction in FVD, 28.7% reduction in FID, 6.0% improvement in LPIPS compared to MoSCA. Also improves temporal consistency, geometric accuracy, user score, and motion smoothness. In pose estimation, outperforms Monst3R with lower absolute trajectory error and reduced translational/rotational relative pose error.

Conclusion: SirenPose effectively handles rapid motion, complex dynamics, and enables physically plausible reconstruction of dynamic 3D scenes from monocular videos, demonstrating superior performance across multiple benchmarks and metrics compared to state-of-the-art methods.

Abstract: We introduce SirenPose, a geometry-aware loss formulation that integrates the periodic activation properties of sinusoidal representation networks with keypoint-based geometric supervision, enabling accurate and temporally consistent reconstruction of dynamic 3D scenes from monocular videos. Existing approaches often struggle with motion fidelity and spatiotemporal coherence in challenging settings involving fast motion, multi-object interaction, occlusion, and rapid scene changes. SirenPose incorporates physics inspired constraints to enforce coherent keypoint predictions across both spatial and temporal dimensions, while leveraging high frequency signal modeling to capture fine grained geometric details. We further expand the UniKPT dataset to 600,000 annotated instances and integrate graph neural networks to model keypoint relationships and structural correlations. Extensive experiments on benchmarks including Sintel, Bonn, and DAVIS demonstrate that SirenPose consistently outperforms state-of-the-art methods. On DAVIS, SirenPose achieves a 17.8 percent reduction in FVD, a 28.7 percent reduction in FID, and a 6.0 percent improvement in LPIPS compared to MoSCA. It also improves temporal consistency, geometric accuracy, user score, and motion smoothness. In pose estimation, SirenPose outperforms Monst3R with lower absolute trajectory error as well as reduced translational and rotational relative pose error, highlighting its effectiveness in handling rapid motion, complex dynamics, and physically plausible reconstruction.

[108] VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

Hongbo Zhao, Meng Wang, Fei Zhu, Wenzhuo Liu, Bolin Ni, Fanhu Zeng, Gaofeng Meng, Zhaoxiang Zhang

Main category: cs.CV

TL;DR: VLMs struggle with long-context understanding when using vision-text compression (VTC) despite good OCR decoding, revealing a gap in their ability to process highly dense 2D visual representations of text.

Details

Motivation: Vision-text compression (VTC) offers token compression ratios of 3x-20x for LLMs but its impact on VLMs' long-context capabilities is under-investigated. The paper aims to systematically assess how VTC affects VLMs' ability to understand and process compressed long-context information.

Method: Introduced the first VTC benchmark with three evaluation settings: VTC-Retrieval (information retrieval/aggregation), VTC-Reasoning (inferring latent associations with minimal lexical overlap), and VTC-Memory (comprehensive QA within long-term dialogue memory). Also created VTCBench-Wild for diverse input scenarios. Evaluated leading open-source and proprietary models.

Result: Most VLMs exhibit surprisingly poor long-context understanding with VTC-processed information, despite good OCR decoding ability. They fail to capture long associations or dependencies in compressed contexts, revealing a significant limitation in their architecture.

Conclusion: The study provides foundational understanding of VTC limitations in VLMs and highlights the need for designing more efficient and scalable VLMs that can effectively process highly compressed visual-text representations while maintaining long-context understanding capabilities.

Abstract: The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model’s ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input scenarios.We comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-processed information, failing to capture long associations or dependencies in the context.This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.

[109] AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment

Anna Šárová Mikeštíková, Médéric Fourmy, Martin Cífka, Josef Sivic, Vladimir Petrik

Main category: cs.CV

TL;DR: AlignPose: Multi-view 6D object pose estimation method that aggregates RGB views without object-specific training, using novel multi-view feature-metric refinement to overcome single-view limitations.

Details

Motivation: Single-view RGB pose estimation suffers from depth ambiguity, clutter, and occlusions. Multi-view methods could solve these but either rely on precise single-view estimates or lack generalization to unseen objects.

Method: AlignPose aggregates information from multiple calibrated RGB views without object-specific training. Key innovation is multi-view feature-metric refinement that optimizes a single consistent world-frame pose by minimizing feature discrepancy between rendered and observed features across all views simultaneously.

Result: Extensive experiments on four datasets (YCB-V, T-LESS, ITODD-MV, HouseCat6D) using BOP benchmark show AlignPose outperforms other published methods, especially on challenging industrial datasets where multiple views are available.

Conclusion: AlignPose effectively addresses limitations of single-view pose estimation through multi-view aggregation and feature-metric refinement, demonstrating superior performance particularly in industrial applications with multiple camera views.

Abstract: Single-view RGB model-based object pose estimation methods achieve strong generalization but are fundamentally limited by depth ambiguity, clutter, and occlusions. Multi-view pose estimation methods have the potential to solve these issues, but existing works rely on precise single-view pose estimates or lack generalization to unseen objects. We address these challenges via the following three contributions. First, we introduce AlignPose, a 6D object pose estimation method that aggregates information from multiple extrinsically calibrated RGB views and does not require any object-specific training or symmetry annotation. Second, the key component of this approach is a new multi-view feature-metric refinement specifically designed for object pose. It optimizes a single, consistent world-frame object pose minimizing the feature discrepancy between on-the-fly rendered object features and observed image features across all views simultaneously. Third, we report extensive experiments on four datasets (YCB-V, T-LESS, ITODD-MV, HouseCat6D) using the BOP benchmark evaluation and show that AlignPose outperforms other published methods, especially on challenging industrial datasets where multiple views are readily available in practice.

[110] Multi-Grained Text-Guided Image Fusion for Multi-Exposure and Multi-Focus Scenarios

Mingwei Tang, Jiahao Nie, Guang Yang, Ziqing Cui, Jie Li

Main category: cs.CV

TL;DR: MTIF introduces multi-grained text guidance with hierarchical cross-modal modulation for better image fusion, outperforming previous methods on multi-exposure and multi-focus tasks.

Details

Motivation: Existing text-guided image fusion methods use coarse-grained descriptions that limit fine-grained detail understanding and precise cross-modal alignment, hindering fusion quality.

Method: Proposes MTIF with three key designs: 1) multi-grained textual descriptions (fine details, structural cues, semantic content) with hierarchical cross-modal modulation, 2) supervision at each granularity for better visual-textual alignment, 3) saliency-driven enrichment module to augment training data with dense semantic content.

Result: Extensive experiments show MTIF consistently outperforms previous methods on both multi-exposure and multi-focus image fusion tasks.

Conclusion: MTIF’s multi-grained text guidance paradigm effectively addresses limitations of coarse-grained text descriptions, achieving superior image fusion performance through improved cross-modal alignment and modulation.

Abstract: Image fusion aims to synthesize a single high-quality image from a pair of inputs captured under challenging conditions, such as differing exposure levels or focal depths. A core challenge lies in effectively handling disparities in dynamic range and focus depth between the inputs. With the advent of vision-language models, recent methods incorporate textual descriptions as auxiliary guidance to enhance fusion quality. However, simply incorporating coarse-grained descriptions hampers the understanding of fine-grained details and poses challenges for precise cross-modal alignment. To address these limitations, we propose Multi-grained Text-guided Image Fusion (MTIF), a novel fusion paradigm with three key designs. First, it introduces multi-grained textual descriptions that separately capture fine details, structural cues, and semantic content, guiding image fusion through a hierarchical cross-modal modulation module. Second, it involves supervision signals at each granularity to facilitate alignment between visual and textual features and enhance the utility of auxiliary text. Third, it adopts a saliency-driven enrichment module to augment training data with dense semantic content, further strengthening the cross-modal modulation and alignment. Extensive experiments show that MTIF consistently outperforms previous methods on both multi-exposure and multi-focus image fusion tasks.

[111] Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, Xiaojuan Qi

Main category: cs.CV

TL;DR: DSR Suite addresses VLMs’ weakness in dynamic spatial reasoning through automated 4D-aware dataset generation (DSR-Train/Bench) and a Geometry Selection Module that integrates geometric priors into VLMs.

Details

Motivation: Vision-language models are strong at general understanding but weak at dynamic spatial reasoning (reasoning about object geometry and relationships in 3D space over time), largely due to lack of scalable 4D-aware training data.

Method: 1) Automated pipeline generating multiple-choice QA pairs from in-the-wild videos using vision foundation models to extract geometric/motion info (camera poses, point clouds, object masks, orientations, 3D trajectories). 2) Geometry Selection Module (GSM) that condenses question semantics and extracts question-relevant knowledge from 4D reconstruction priors into compact geometry tokens.

Result: Integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances dynamic spatial reasoning capability while maintaining accuracy on general video understanding benchmarks.

Conclusion: DSR Suite bridges the gap in dynamic spatial reasoning across dataset, benchmark, and model aspects, enabling VLMs to better understand 4D spatial relationships in videos through targeted geometric knowledge integration.

Abstract: Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (GSM) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.

[112] FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models

Kaitong Cai, Jusheng Zhang, Jing Yang, Yijia Fan, Pengtao Xie, Jian Wang, Keze Wang

Main category: cs.CV

TL;DR: FlashVLM is a text-guided visual token selection framework that dynamically prunes redundant visual tokens based on query relevance, achieving beyond lossless compression while maintaining or slightly improving performance over unpruned baselines.

Details

Motivation: Current VLMs process hundreds/thousands of visual tokens per image/frame, causing quadratic attention cost and redundancy. Existing token reduction methods either ignore textual queries or rely on unstable deep attention maps that degrade semantic alignment under aggressive pruning.

Method: Computes explicit cross-modal similarity between projected image tokens and normalized text embeddings in language model space (instead of noisy attention weights). Fuses extrinsic relevance with intrinsic visual saliency using log domain weighting and temperature-controlled sharpening. Includes diversity-preserving partition to retain minimal representative background tokens for global context.

Result: Achieves beyond lossless compression - slightly surpasses unpruned baseline while pruning up to 77.8% of visual tokens on LLaVA 1.5. Maintains 92.8% accuracy even under 94.4% compression. State-of-the-art efficiency-performance trade-offs demonstrated across 14 image/video benchmarks with strong robustness and generalization across mainstream VLMs.

Conclusion: FlashVLM provides an effective text-guided visual token selection framework that dynamically adapts visual inputs to queries, achieving significant token reduction while maintaining or improving performance through explicit cross-modal relevance computation and balanced saliency fusion.

Abstract: Large vision-language models (VLMs) typically process hundreds or thousands of visual tokens per image or video frame, incurring quadratic attention cost and substantial redundancy. Existing token reduction methods often ignore the textual query or rely on deep attention maps, whose instability under aggressive pruning leads to degraded semantic alignment. We propose FlashVLM, a text guided visual token selection framework that dynamically adapts visual inputs to the query. Instead of relying on noisy attention weights, FlashVLM computes an explicit cross modal similarity between projected image tokens and normalized text embeddings in the language model space. This extrinsic relevance is fused with intrinsic visual saliency using log domain weighting and temperature controlled sharpening. In addition, a diversity preserving partition retains a minimal yet representative set of background tokens to maintain global context. Under identical token budgets and evaluation protocols, FlashVLM achieves beyond lossless compression, slightly surpassing the unpruned baseline while pruning up to 77.8 percent of visual tokens on LLaVA 1.5, and maintaining 92.8 percent accuracy even under 94.4 percent compression. Extensive experiments on 14 image and video benchmarks demonstrate that FlashVLM delivers state of the art efficiency performance trade offs while maintaining strong robustness and generalization across mainstream VLMs.

[113] LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving

Long Nguyen, Micha Fauth, Bernhard Jaeger, Daniel Dauner, Maximilian Igl, Andreas Geiger, Kashyap Chitta

Main category: cs.CV

TL;DR: TransFuser v6 (TFv6) achieves SOTA on CARLA benchmarks by addressing expert-student misalignment in imitation learning, improving from 95 DS on Bench2Drive and doubling prior performance on Longest6~v2 and Town13.

Details

Motivation: Simulators generate unlimited driving data but imitation learning policies still struggle with robust closed-loop performance due to misalignment between privileged expert demonstrations (with higher visibility and lower uncertainty) and sensor-based student observations.

Method: Address asymmetries between expert and student by narrowing gaps in visibility, uncertainty, and navigational intent specification. Integrate perception supervision from dataset into shared sim-to-real pipeline.

Result: TFv6 achieves new SOTA on all major CARLA closed-loop benchmarks: 95 DS on Bench2Drive, more than doubles prior performance on Longest6~v2 and Town13. Shows consistent gains on NAVSIM and Waymo Vision-Based End-to-End benchmarks.

Conclusion: Careful modifications to address expert-student misalignment significantly improve imitation learning performance, enabling state-of-the-art results across multiple autonomous driving benchmarks.

Abstract: Simulators can generate virtually unlimited driving data, yet imitation learning policies in simulation still struggle to achieve robust closed-loop performance. Motivated by this gap, we empirically study how misalignment between privileged expert demonstrations and sensor-based student observations can limit the effectiveness of imitation learning. More precisely, experts have significantly higher visibility (e.g., ignoring occlusions) and far lower uncertainty (e.g., knowing other vehicles’ actions), making them difficult to imitate reliably. Furthermore, navigational intent (i.e., the route to follow) is under-specified in student models at test time via only a single target point. We demonstrate that these asymmetries can measurably limit driving performance in CARLA and offer practical interventions to address them. After careful modifications to narrow the gaps between expert and student, our TransFuser v6 (TFv6) student policy achieves a new state of the art on all major publicly available CARLA closed-loop benchmarks, reaching 95 DS on Bench2Drive and more than doubling prior performances on Longest6~v2 and Town13. Additionally, by integrating perception supervision from our dataset into a shared sim-to-real pipeline, we show consistent gains on the NAVSIM and Waymo Vision-Based End-to-End driving benchmarks. Our code, data, and models are publicly available at https://github.com/autonomousvision/lead.

[114] Repurposing Video Diffusion Transformers for Robust Point Tracking

Soowon Son, Honggyu An, Chaehyun Kim, Hyunah Ko, Jisu Nam, Dahyun Chung, Siyoon Jin, Jung Yi, Jaewon Min, Junhwa Hur, Seungryong Kim

Main category: cs.CV

TL;DR: DiTracker adapts video Diffusion Transformers (DiTs) for point tracking, achieving state-of-the-art performance by leveraging pre-trained spatio-temporal attention for better temporal coherence and handling of challenging conditions.

Details

Motivation: Existing point tracking methods use shallow convolutional backbones (like ResNet) that process frames independently, lacking temporal coherence and producing unreliable matching costs under challenging conditions like dynamic motions and occlusions.

Method: DiTracker adapts video Diffusion Transformers through: (1) query-key attention matching, (2) lightweight LoRA tuning, and (3) cost fusion with a ResNet backbone. It leverages pre-trained video DiTs’ spatio-temporal attention capabilities.

Result: Despite training with 8 times smaller batch size, DiTracker achieves state-of-the-art performance on challenging ITTO benchmark and matches or outperforms state-of-the-art models on TAP-Vid benchmarks.

Conclusion: Video Diffusion Transformers pre-trained on large-scale real-world videos exhibit strong inherent point tracking capability and can serve as an effective and efficient foundation for point tracking tasks.

Abstract: Point tracking aims to localize corresponding points across video frames, serving as a fundamental task for 4D reconstruction, robotics, and video editing. Existing methods commonly rely on shallow convolutional backbones such as ResNet that process frames independently, lacking temporal coherence and producing unreliable matching costs under challenging conditions. Through systematic analysis, we find that video Diffusion Transformers (DiTs), pre-trained on large-scale real-world videos with spatio-temporal attention, inherently exhibit strong point tracking capability and robustly handle dynamic motions and frequent occlusions. We propose DiTracker, which adapts video DiTs through: (1) query-key attention matching, (2) lightweight LoRA tuning, and (3) cost fusion with a ResNet backbone. Despite training with 8 times smaller batch size, DiTracker achieves state-of-the-art performance on challenging ITTO benchmark and matches or outperforms state-of-the-art models on TAP-Vid benchmarks. Our work validates video DiT features as an effective and efficient foundation for point tracking.

[115] FedPOD: the deployable units of training for federated learning

Daewoon Kim, Si Young Yie, Jae Sung Lee

Main category: cs.CV

TL;DR: FedPOD improves federated learning efficiency and communication costs by including outlier participants, eliminating dependency on previous rounds, and using validation loss calculation, achieving comparable performance to FedPIDAvg while being Kubernetes-compatible.

Details

Motivation: FedPIDAvg has limitations: it excludes outlier participants based on Poisson distribution (limiting data utilization) and requires maintaining the same participants throughout training due to PID controller's dependency on previous rounds' learning information.

Method: FedPOD addresses FedPIDAvg’s limitations by: 1) including participants excluded as outliers, 2) eliminating dependency on previous rounds’ learning information, 3) applying validation loss calculation at each round, and 4) being designed as Kubernetes-compatible POD units for flexible auto-scaling.

Result: FedPOD achieves comparable performance to FedPIDAvg with Dice scores of 0.78, 0.71, and 0.72 for WT, ET, and TC respectively (average), and projected convergence score of 0.74 average. It maintains efficiency while improving flexibility.

Conclusion: FedPOD demonstrates potential to enhance federated learning by improving efficiency, flexibility, and performance metrics while being compatible with Kubernetes auto-scaling for practical deployment.

Abstract: This paper proposes FedPOD (Proportionally Orchestrated Derivative) for optimizing learning efficiency and communication cost in federated learning among multiple clients. Inspired by FedPIDAvg, we define a round-wise task for FedPOD to enhance training efficiency. FedPIDAvg achieved performance improvement by incorporating the training loss reduction for prediction entropy as weights using differential terms. Furthermore, by modeling data distribution with a Poisson distribution and using a PID controller, it reduced communication costs even in skewed data distribution. However, excluding participants classified as outliers based on the Poisson distribution can limit data utilization. Additionally, PID controller requires the same participants to be maintained throughout the federated learning process as it uses previous rounds’ learning information in the current round. In our approach, FedPOD addresses these issues by including participants excluded as outliers, eliminating dependency on previous rounds’ learning information, and applying a method for calculating validation loss at each round. In this challenge, FedPOD presents comparable performance to FedPIDAvg in metrics of Dice score, 0.78, 0.71 and 0.72 for WT, ET and TC in average, and projected convergence score, 0.74 in average. Furthermore, the concept of FedPOD draws inspiration from Kubernetes’ smallest computing unit, POD, designed to be compatible with Kubernetes auto-scaling. Extending round-wise tasks of FedPOD to POD units allows flexible design by applying scale-out similar to Kubernetes’ auto-scaling. This work demonstrated the potentials of FedPOD to enhance federated learning by improving efficiency, flexibility, and performance in metrics.

[116] Active Intelligence in Video Avatars via Closed-loop World Modeling

Xuanhua He, Tianyu Yang, Ke Cao, Ruiqi Wu, Cheng Meng, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Qifeng Chen

Main category: cs.CV

TL;DR: ORCA framework enables video avatars to autonomously pursue long-term goals through adaptive environmental interaction, moving beyond passive animation to active intelligence.

Details

Motivation: Current video avatar generation methods lack genuine agency - they cannot autonomously pursue long-term goals through adaptive environmental interaction, being limited to identity preservation and motion alignment.

Method: ORCA framework with Internal World Model capabilities: 1) closed-loop OTAR cycle (Observe-Think-Act-Reflect) for robust state tracking under generative uncertainty, 2) hierarchical dual-system architecture (System 2 for strategic reasoning, System 1 for precise action caption generation), formulated as POMDP with continuous belief updating and outcome verification.

Result: ORCA significantly outperforms open-loop and non-reflective baselines in task success rate and behavioral coherence, enabling autonomous multi-step task completion in open-domain scenarios.

Conclusion: ORCA’s IWM-inspired design advances video avatar intelligence from passive animation to active, goal-oriented behavior, validating the approach for creating genuinely intelligent video avatars.

Abstract: Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency, they cannot autonomously pursue long-term goals through adaptive environmental interaction. We address this by introducing L-IVA (Long-horizon Interactive Visual Avatar), a task and benchmark for evaluating goal-directed planning in stochastic generative environments, and ORCA (Online Reasoning and Cognitive Architecture), the first framework enabling active intelligence in video avatars. ORCA embodies Internal World Model (IWM) capabilities through two key innovations: (1) a closed-loop OTAR cycle (Observe-Think-Act-Reflect) that maintains robust state tracking under generative uncertainty by continuously verifying predicted outcomes against actual generations, and (2) a hierarchical dual-system architecture where System 2 performs strategic reasoning with state prediction while System 1 translates abstract plans into precise, model-specific action captions. By formulating avatar control as a POMDP and implementing continuous belief updating with outcome verification, ORCA enables autonomous multi-step task completion in open-domain scenarios. Extensive experiments demonstrate that ORCA significantly outperforms open-loop and non-reflective baselines in task success rate and behavioral coherence, validating our IWM-inspired design for advancing video avatar intelligence from passive animation to active, goal-oriented behavior.

[117] SpatialTree: How Spatial Abilities Branch Out in MLLMs

Yuxi Xiao, Longfei Li, Shen Yan, Xinhang Liu, Sida Peng, Yunchao Wei, Xiaowei Zhou, Bingyi Kang

Main category: cs.CV

TL;DR: SpatialTree introduces a cognitive-science-inspired 4-level hierarchy for spatial abilities in MLLMs, creates a hierarchical benchmark, reveals skill correlations and transfer dynamics, and proposes auto-think strategy for consistent improvement across levels.

Details

Motivation: Current multimodal LLM research lacks understanding of spatial ability hierarchy, with most studies focusing on narrow tasks. There's a need for a systematic framework to understand and scale spatial abilities in MLLMs based on cognitive science principles.

Method: Developed SpatialTree - a 4-level cognitive hierarchy (L1: perception, L2: mental mapping, L3: simulation, L4: agentic competence). Created capability-centric hierarchical benchmark with 27 sub-abilities. Evaluated mainstream MLLMs, conducted targeted supervised fine-tuning to study transfer dynamics, and proposed auto-think strategy to optimize reinforcement learning across levels.

Result: Evaluation revealed clear structure: L1 skills are orthogonal while higher-level skills are strongly correlated. Found negative transfer within L1 but strong cross-level transfer from low to high abilities with synergy. Naive RL helps complex reasoning but hurts intuitive perception; auto-think strategy enables consistent improvement across all levels.

Conclusion: SpatialTree provides a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs, demonstrating hierarchical organization, transfer dynamics, and effective optimization strategies for improving spatial reasoning across cognitive levels.

Abstract: Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic-negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive “thinking” is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.

[118] SemanticGen: Video Generation in Semantic Space

Jianhong Bai, Xiaoshi Wu, Xintao Wang, Fu Xiao, Yuanxing Zhang, Qinghe Wang, Xiaoyu Shi, Menghan Xia, Zuozhu Liu, Haoji Hu, Pengfei Wan, Kun Gai

Main category: cs.CV

TL;DR: SemanticGen generates videos in semantic space first for global planning, then adds details, making it faster and more efficient than direct VAE latent generation.

Details

Motivation: Current video generative models using VAE latent space suffer from slow convergence and computational inefficiency, especially for long videos.

Method: Two-stage diffusion process: 1) Generate compact semantic video features for global layout, 2) Generate VAE latents conditioned on semantic features for final output.

Result: Faster convergence than VAE latent space generation, computationally efficient for long videos, and outperforms state-of-the-art approaches.

Conclusion: Semantic space generation enables efficient high-quality video production with better convergence and scalability for long videos.

Abstract: State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos. In this paper, we introduce SemanticGen, a novel solution to address these limitations by generating videos in the semantic space. Our main insight is that, due to the inherent redundancy in videos, the generation process should begin in a compact, high-level semantic space for global planning, followed by the addition of high-frequency details, rather than directly modeling a vast set of low-level video tokens using bi-directional attention. SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the global layout of the video. In the second stage, another diffusion model generates VAE latents conditioned on these semantic features to produce the final output. We observe that generation in the semantic space leads to faster convergence compared to the VAE latent space. Our method is also effective and computationally efficient when extended to long video generation. Extensive experiments demonstrate that SemanticGen produces high-quality videos and outperforms state-of-the-art approaches and strong baselines.

[119] TropNNC: Structured Neural Network Compression Using Tropical Geometry

Konstantinos Fotopoulos, Petros Maragos, Panagiotis Misiakos

Main category: cs.CV

TL;DR: TropNNC is a neural network compression framework using tropical geometry to represent networks as tropical rational functions, enabling weight-only compression without training data.

Details

Motivation: Current neural network compression methods often require training data or have suboptimal theoretical bounds. There's a need for data-free compression with strong theoretical guarantees.

Method: Represents network output as tropical rational function, uses tropical geometry to compress by reducing corresponding tropical polynomials, adaptively selects weights of retained neurons, and extends to convolutional layers.

Result: Achieves competitive performance on MNIST, CIFAR, and ImageNet, matching strong baselines like ThiNet and CUP. Provides tightest known theoretical compression bound and first application to convolutional layers.

Conclusion: TropNNC demonstrates effective neural network compression using tropical geometry without requiring training data, with strong theoretical foundations and practical performance across multiple datasets.

Abstract: We present TropNNC, a framework for compressing neural networks with linear and convolutional layers and ReLU activations using tropical geometry. By representing a network’s output as a tropical rational function, TropNNC enables structured compression via reduction of the corresponding tropical polynomials. Our method refines the geometric approximation of previous work by adaptively selecting the weights of retained neurons. Key contributions include the first application of tropical geometry to convolutional layers and the tightest known theoretical compression bound. TropNNC requires only access to network weights - no training data - and achieves competitive performance on MNIST, CIFAR, and ImageNet, matching strong baselines such as ThiNet and CUP.

Fabio Bellavia, Zhenjun Zhao, Luca Morelli, Fabio Remondino

Main category: cs.CV

TL;DR: A non-deep learning method for filtering and refining sparse image correspondences using local homography transformations and planar clustering, with optional cross-correlation refinement and robustness against planar assumption violations.

Details

Motivation: To develop a practical, geometry-based alternative to deep learning methods for image matching that works in real-world scenarios where camera intrinsics are often unavailable, and to handle outlier correspondences effectively.

Method: Uses local homography approximations for motion flow, clusters matches into virtual planes via iterative RANSAC, optionally refines keypoints through cross-correlation template matching after patch reprojection, and introduces intermediate homographies to minimize patch distortion when planar assumptions are violated.

Result: The method demonstrates effectiveness in outlier presence, validates cross-correlation refinement for corner-like keypoints, and shows competitive performance on standard datasets, particularly in practical cases without known camera intrinsics.

Conclusion: Geometry-based non-deep learning approaches still have significant development potential for practical image matching and could be incorporated into future deep learning architectures, offering a robust alternative especially when camera parameters are unknown.

Abstract: This paper introduces a modular, non-deep learning method for filtering and refining sparse correspondences in image matching. Assuming that motion flow within the scene can be approximated by local homography transformations, matches are aggregated into overlapping clusters corresponding to virtual planes using an iterative RANSAC-based approach discarding incompatible correspondences. Moreover, the underlying planar structural design provides an explicit map between local patches associated with the matches, by which optionally refine the keypoint positions through cross-correlation template matching after the patch reprojection. Finally, to enhance robustness and fault-tolerance against violations of the piece-wise planar approximation assumption, a further strategy is designed in order to minimize the relative patch distortion in the plane reprojection by introducing an intermediate homography that projects both patches into a common plane. The proposed method is extensively evaluated on standard datasets and image matching pipelines, and compared with state-of-the-art approaches. Unlike other current comparisons, the proposed benchmark also takes into account the more general, real, and practical cases where camera intrinsics are unavailable. Experimental results demonstrate that our proposed non-deep learning, geometry-based filter is effective in presence of outliers and the optional cross-correlation refinement step is valid in the case of corner-like keypoints. Finally, this study suggests that there is still significant development potential in practical image matching solutions in the considered research direction, which could be in the future incorporated in novel deep image matching architectures.

[121] Compression for Better: A General and Stable Lossless Compression Framework

Boyang Zhang, Daning Cheng, Yunquan Zhang, Fangming Liu, Wenguang Chen

Main category: cs.CV

TL;DR: Proposes LLC framework for lossless model compression by defining error boundaries where compression doesn’t degrade performance, applied to quantization and decomposition techniques.

Details

Motivation: Current model compression lacks systematic approaches to determine error boundaries for lossless compression, leading to performance degradation from compression errors. Need to understand how compression errors affect model performance and define boundaries for lossless compression.

Method: Proposes LossLess Compression (LLC) theoretical framework that uses total differential to delineate compression neighborhood and higher-order analysis boundaries. For quantization: reformulates quantization search as grouped knapsack problem within lossless neighborhood. For decomposition: addresses approximation under low-rank constraints with automatic rank determination per layer.

Result: Extensive experiments on multiple neural network architectures across different datasets show LLC can effectively achieve lossless model compression without fancy tricks. Successfully applies various compression techniques including quantization and decomposition while maintaining performance.

Conclusion: LLC provides a general theoretical framework for lossless compression by defining error boundaries, enabling compression techniques to operate within performance-preserving regions. The approach works across different compression methods and neural architectures without requiring complex tricks.

Abstract: This work focus on how to stabilize and lossless model compression, aiming to reduce model complexity and enhance efficiency without sacrificing performance due to compression errors. A key challenge is effectively leveraging compression errors and defining the boundaries for lossless compression to minimize model loss. i.e., compression for better. Currently, there is no systematic approach to determining this error boundary or understanding its specific impact on model performance. We propose a general \textbf{L}oss\textbf{L}ess \textbf{C}ompression theoretical framework (\textbf{LLC}), which further delineates the compression neighborhood and higher-order analysis boundaries through the total differential, thereby specifying the error range within which a model can be compressed without loss. To verify the effectiveness of LLC, we apply various compression techniques, including quantization and decomposition. Specifically, for quantization, we reformulate the classic quantization search problem as a grouped knapsack problem within the lossless neighborhood, achieving lossless quantization while improving computational efficiency. For decomposition, LLC addresses the approximation problem under low-rank constraints, automatically determining the rank for each layer and producing lossless low-rank models. We conduct extensive experiments on multiple neural network architectures on different datasets. The results show that without fancy tricks, LLC can effectively achieve lossless model compression. Our code will be made publicly.

[122] GenVidBench: A 6-Million Benchmark for AI-Generated Video Detection

Zhenliang Ni, Qiangyu Yan, Mouxiao Huang, Tianning Yuan, Yehui Tang, Hailin Hu, Xinghao Chen, Yunhe Wang

Main category: cs.CV

TL;DR: GenVidBench is a large-scale dataset for AI-generated video detection containing 6.78 million videos from 11 state-of-the-art video generators, designed to address the lack of quality datasets in this field.

Details

Motivation: The rapid advancement of video generation models makes it increasingly difficult to distinguish AI-generated videos from real ones, creating an urgent need for effective detectors to prevent false information dissemination. Current detector development is impeded by the lack of large-scale, high-quality datasets specifically designed for generative video detection.

Method: Created GenVidBench dataset with three key design principles: 1) Large-scale collection of 6.78 million videos, 2) Cross-source and cross-generator design to reduce content interference and ensure diversity between training/test sets, 3) Inclusion of videos from 11 state-of-the-art AI video generators to cover latest advancements.

Result: GenVidBench is currently the largest dataset for AI-generated video detection, providing researchers with a comprehensive benchmark for developing and evaluating detection models. The dataset enables extensive experimental evaluation with advanced video classification models.

Conclusion: GenVidBench addresses the critical dataset gap in AI-generated video detection, providing a challenging benchmark that will facilitate the development of generalized and effective detection models to combat misinformation from AI-generated videos.

Abstract: The rapid advancement of video generation models has made it increasingly challenging to distinguish AI-generated videos from real ones. This issue underscores the urgent need for effective AI-generated video detectors to prevent the dissemination of false information via such videos. However, the development of high-performance AI-generated video detectors is currently impeded by the lack of large-scale, high-quality datasets specifically designed for generative video detection. To this end, we introduce GenVidBench, a challenging AI-generated video detection dataset with several key advantages: 1) Large-scale video collection: The dataset contains 6.78 million videos and is currently the largest dataset for AI-generated video detection. 2) Cross-Source and Cross-Generator: The cross-source generation reduces the interference of video content on the detection. The cross-generator ensures diversity in video attributes between the training and test sets, preventing them from being overly similar. 3) State-of-the-Art Video Generators: The dataset includes videos from 11 state-of-the-art AI video generators, ensuring that it covers the latest advancements in the field of video generation. These generators ensure that the datasets are not only large in scale but also diverse, aiding in the development of generalized and effective detection models. Additionally, we present extensive experimental results with advanced video classification models. With GenVidBench, researchers can efficiently develop and evaluate AI-generated video detection models.. Datasets and code are available at https://genvidbench.github.io.

[123] Regressor-Guided Generative Image Editing Balances User Emotions to Reduce Time Spent Online

Christoph Gebhardt, Robin Willardt, Seyedmorteza Sadat, Chih-Wei Ning, Andreas Brombach, Jie Song, Otmar Hilliges, Christian Holz

Main category: cs.CV

TL;DR: Diffusion-based image editing that regulates emotional impact can reduce social media usage without restrictive controls.

Details

Motivation: Existing interventions like time limits or grayscaling are restrictive, provoke psychological reactance, and get circumvented. Emotional responses mediate content consumption and online engagement, suggesting emotional regulation could reduce usage non-coercively.

Method: Three regressor-guided image-editing approaches: (1) global optimization of emotion-related image attributes, (2) optimization in style latent space, (3) diffusion-based method using classifier and classifier-free guidance. First two modify low-level features; diffusion enables higher-level changes like adjusting clothing or facial features.

Result: Diffusion-based edits balance emotional responses and are associated with lower usage duration while preserving visual quality, as shown in controlled image-rating study and social media experiment.

Conclusion: Emotion-regulating image editing, particularly diffusion-based approaches, offers a non-coercive alternative to restrictive interventions for reducing internet overuse by modifying content’s emotional impact rather than imposing external controls.

Abstract: Internet overuse is a widespread phenomenon in today’s digital society. Existing interventions, such as time limits or grayscaling, often rely on restrictive controls that provoke psychological reactance and are frequently circumvented. Building on prior work showing that emotional responses mediate the relationship between content consumption and online engagement, we investigate whether regulating the emotional impact of images can reduce online use in a non-coercive manner. We introduce and systematically analyze three regressor-guided image-editing approaches: (i) global optimization of emotion-related image attributes, (ii) optimization in a style latent space, and (iii) a diffusion-based method using classifier and classifier-free guidance. While the first two approaches modify low-level visual features (e.g., contrast, color), the diffusion-based method enables higher-level changes (e.g., adjusting clothing, facial features). Results from a controlled image-rating study and a social media experiment show that diffusion-based edits balance emotional responses and are associated with lower usage duration while preserving visual quality.

[124] SPECIAL: Zero-shot Hyperspectral Image Classification With CLIP

Li Pang, Jing Yao, Kaiyu Li, Jun Zhou, Deyu Meng, Xiangyong Cao

Main category: cs.CV

TL;DR: SPECIAL is a zero-shot HSI classification framework using CLIP for pseudo-label generation and noisy label learning, eliminating need for manual annotations.

Details

Motivation: Deep learning HSI classification methods require manually labeled data which is time-consuming and labor-intensive to obtain. The paper aims to eliminate the need for manual annotations through zero-shot learning.

Method: Two-stage framework: (1) CLIP-based pseudo-label generation where HSI is spectrally interpolated to RGB bands, classified using CLIP with multi-scale fusion for confidence scores; (2) Noisy label learning incorporating spectral information and label refinement to mitigate label noise.

Result: Experimental results on three benchmark datasets show SPECIAL outperforms existing methods in zero-shot HSI classification.

Conclusion: SPECIAL demonstrates potential for practical applications by eliminating manual annotation requirements while achieving superior zero-shot classification performance.

Abstract: Hyperspectral image (HSI) classification aims to categorize each pixel in an HSI into a specific land cover class, which is crucial for applications such as remote sensing, environmental monitoring, and agriculture. Although deep learning-based HSI classification methods have achieved significant advancements, existing methods still rely on manually labeled data for training, which is both time-consuming and labor-intensive. To address this limitation, we introduce a novel zero-shot hyperspectral image classification framework based on CLIP (SPECIAL), aiming to eliminate the need for manual annotations. The SPECIAL framework consists of two main stages: (1) CLIP-based pseudo-label generation, and (2) noisy label learning. In the first stage, HSI is spectrally interpolated to produce RGB bands. These bands are subsequently classified using CLIP, resulting in noisy pseudo-labels that are accompanied by confidence scores. To improve the quality of these labels, we propose a scaling strategy that fuses predictions from multiple spatial scales. In the second stage, spectral information and a label refinement technique are incorporated to mitigate label noise and further enhance classification accuracy. Experimental results on three benchmark datasets demonstrate that our SPECIAL outperforms existing methods in zero-shot HSI classification, showing its potential for more practical applications. The code is available at https://github.com/LiPang/SPECIAL.

Yaohua Liu, Xinyuan Song, Yunfu Deng, Yifan Xie, Binkai Ou, Yan Zhong

Main category: cs.CV

TL;DR: OIKG: A fine-grained instruction-guided graph reasoning framework for VLN that disentangles visual/directional cues and extracts navigation-critical semantics to improve spatial reasoning and cross-modal alignment.

Details

Motivation: Existing VLN methods encode visual and directional cues in a coupled manner and process instructions without explicitly extracting navigation-critical semantics, leading to imprecise spatial reasoning and suboptimal cross-modal alignment.

Method: 1) Observation-graph interaction mechanism to disentangle angular and visual cues while strengthening directed edge representations through geometric embedding. 2) Fine-grained instruction guidance module to explicitly extract location-specific and object-centric information from language instructions.

Result: Achieves state-of-the-art performance on R2R and RxR benchmarks across multiple evaluation metrics, demonstrating effectiveness of fine-grained instruction-guided graph reasoning.

Conclusion: The proposed OIKG framework significantly improves VLN agents’ ability to follow complex navigation instructions by integrating structured graph reasoning with instruction-critical semantic cues.

Abstract: Vision-and-Language Navigation (VLN) requires an embodied agent to traverse complex environments by following natural language instructions, demanding accurate alignment between visual observations and linguistic guidance. Despite recent progress, existing methods typically encode visual and directional cues in a coupled manner, and process instructions without explicitly extracting navigation-critical semantics, which often leads to imprecise spatial reasoning and suboptimal cross-modal alignment. To address these challenges, we propose a fine-grained instruction-guided graph reasoning framework (OIKG) that enhances both spatial representation and instruction understanding during navigation. Specifically, an observation-graph interaction mechanism is introduced to disentangle angular and visual cues while strengthening directed edge representations through geometric embedding, enabling more reliable spatial reasoning within the navigation graph. In addition, a fine-grained instruction guidance module is designed to explicitly extract and leverage location-specific and object-centric information from language instructions, facilitating more precise cross-modal alignment between linguistic semantics and navigable trajectories. By jointly integrating structured graph reasoning with instruction-critical semantic cues, the proposed approach significantly improves the agent’s ability to follow complex navigation instructions. Extensive experiments on the R2R and RxR benchmarks demonstrate that our method consistently achieves state-of-the-art performance across multiple evaluation metrics, validating the effectiveness of fine-grained instruction-guided graph reasoning for vision-and-language navigation.

[126] I Want It That Way! Specifying Nuanced Camera Motions in Video Editing

Pooja Guhan, Divya Kothandaraman, Geonsun Lee, Tsung-Wei Huang, Guan-Ming Su, Dinesh Manocha

Main category: cs.CV

TL;DR: Zero-shot diffusion system for personalized camera motion transfer from reference video to static image using multi-concept learning and homography refinement.

Details

Motivation: Address the "expressive gap" where generic text prompts fail to capture nuanced cinematic camera motion, making it difficult for non-expert creators to achieve their vision.

Method: Two-phase diffusion-based approach: 1) Multi-concept learning with LoRA layers and orthogonality loss to separate spatial-temporal characteristics from scene features, 2) Homography-based refinement for temporal and spatial alignment.

Result: Significantly preferred over prior work (90.45% for motion accuracy, 70.31% for scene preservation). Interface improves usability and creative control for video direction.

Conclusion: Provides robust technical solution and human-centered design that expands cinematic video editing accessibility for diverse users without requiring 3D data or complex interfaces.

Abstract: Specifying nuanced and compelling camera motion remains a major hurdle for non-expert creators using generative tools, creating an ``expressive gap" where generic text prompts fail to capture cinematic vision. To address this, we present a novel zero-shot diffusion-based system that enables personalized camera motion transfer from a single reference video onto a user-provided static image. Our technical contribution introduces an intuitive interaction paradigm that bypasses the need for 3D data, predefined trajectories, or complex graphical interfaces. The core pipeline leverages a text-to-video diffusion model, employing a two-phase strategy: 1) a multi-concept learning method using LoRA layers and an orthogonality loss to distinctly capture spatial-temporal characteristics and scene features, and 2) a homography-based refinement strategy to enhance temporal and spatial alignment of the generated video. Extensive evaluation demonstrates the efficacy of our method. In a comparative study with 72 participants, our system was significantly preferred over prior work for both motion accuracy (90.45%) and scene preservation (70.31%). A second study confirmed our interface significantly improves usability and creative control for video direction. Our work contributes a robust technical solution and a novel human-centered design, significantly expanding cinematic video editing for diverse users.

[127] VibrantLeaves: A principled parametric image generator for training deep restoration models

Raphael Achddou, Yann Gousseau, Saïd Ladjal, Sabine Süsstrunk

Main category: cs.CV

TL;DR: A synthetic image generator based on Dead Leaves model creates training sets for image restoration networks, achieving near-natural dataset performance with better robustness and explainability.

Details

Motivation: Deep Neural Networks for image restoration have limitations: they're poorly understood, suffer from training set biases, and lack explainability. Synthetic training sets offer better control and understanding.

Method: Proposes a synthetic image generator using geometric modeling, textures, and simple image acquisition modeling integrated into a classical Dead Leaves model to create efficient training datasets.

Result: Networks trained on synthetic datasets achieve performance almost on par with natural image datasets, provide better robustness to geometric/radiometric perturbations, and enable analysis of which image properties are necessary for good performance.

Conclusion: Synthetic training sets using principled modeling can effectively train image restoration networks while offering better control, explainability, and robustness compared to natural image datasets.

Abstract: Even though Deep Neural Networks are extremely powerful for image restoration tasks, they have several limitations. They are poorly understood and suffer from strong biases inherited from the training sets. One way to address these shortcomings is to have a better control over the training sets, in particular by using synthetic sets. In this paper, we propose a synthetic image generator relying on a few simple principles. In particular, we focus on geometric modeling, textures, and a simple modeling of image acquisition. These properties, integrated in a classical Dead Leaves model, enable the creation of efficient training sets. Standard image denoising and super-resolution networks can be trained on such datasets, reaching performance almost on par with training on natural image datasets. As a first step towards explainability, we provide a careful analysis of the considered principles, identifying which image properties are necessary to obtain good performances. Besides, such training also yields better robustness to various geometric and radiometric perturbations of the test sets.

[128] FiGO: Fine-Grained Object Counting without Annotations

Adriano D’Alessandro, Ali Mahdavi-Amiri, Ghassan Hamarneh

Main category: cs.CV

TL;DR: FiGO enables fine-grained class-agnostic counting using only category names, outperforming existing open-vocabulary methods on visually similar subcategories.

Details

Motivation: Current open-vocabulary counting methods work well for broad categories but fail at fine-grained distinctions (e.g., specific waterfowl species or pepper cultivars), limiting practical applications where precise category counting is needed.

Method: FiGO adapts existing counting models using only category names by: 1) generating synthetic examples with text-to-image diffusion models, 2) learning compact concept embeddings with joint positive/hard-negative loss, and 3) using a specialization module to convert outputs from any frozen counter into fine-grained estimates.

Result: The method substantially outperforms strong open-vocabulary baselines on the new LOOKALIKES dataset (37 subcategories across 14 parent categories with visually similar objects), enabling precise counting like “count only the habaneros” instead of just “count all the peppers.”

Conclusion: FiGO advances class-agnostic counting from broad category recognition to fine-grained distinction capability, making counting systems more practical for real-world applications requiring precise subcategory identification.

Abstract: Class-agnostic counting (CAC) methods reduce annotation costs by letting users define what to count at test-time through text or visual exemplars. However, current open-vocabulary approaches work well for broad categories but fail when fine-grained category distinctions are needed, such as telling apart waterfowl species or pepper cultivars. We present FiGO, a new annotation-free method that adapts existing counting models to fine-grained categories using only the category name. Our approach uses a text-to-image diffusion model to create synthetic examples and a joint positive/hard-negative loss to learn a compact concept embedding that conditions a specialization module to convert outputs from any frozen counter into accurate, fine-grained estimates. To evaluate fine-grained counting, we introduce LOOKALIKES, a dataset of 37 subcategories across 14 parent categories with many visually similar objects per image. Our method substantially outperforms strong open-vocabulary baselines, moving counting systems from “count all the peppers” to “count only the habaneros.”

[129] COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

Xindi Wu, Hee Seung Hwang, Polina Kirichenko, Esin Tureci, Olga Russakovsky

Main category: cs.CV

TL;DR: COMPACT introduces a data curation method for visual instruction tuning that combines multiple atomic visual capabilities in single training examples, achieving 100.2% of full dataset performance with only 10% of the data.

Details

Motivation: Current visual instruction tuning datasets are constructed from randomly sampled image-question pairs without considering informativeness, leading to inefficient training. Recent methods show that informative samples can enable efficient finetuning, but sample complexity impact on data curation hasn't been explored.

Method: COMPACT scales training sample complexity by synthesizing rich text questions that combine multiple atomic visual capabilities in single training examples. This allows significant reduction in training examples while maintaining informativeness through compositional atomic-to-complex visual capability tuning.

Result: COMPACT reduces LLAVA-665K dataset by 90% while achieving 100.2% of full VIT performance (vs. 97.5% by SOTA). It particularly excels on complex benchmarks: MM-Vet (+8.6%) and MMStar (+2.9%) improvements over full-scale training.

Conclusion: COMPACT provides a scalable, efficient synthetic data generation recipe that demonstrates superior data efficiency and even outperforms full-scale training on complex visual language tasks through compositional capability combination.

Abstract: Visual instruction tuning (VIT) datasets are constructed from randomly sampled image-question pairs, without regard to the informativeness of each pair. Recent dataset selection methods have shown that a small fraction of such datasets enriched with informative samples can lead to efficient finetuning of Multimodal Large Language Models. In this work, we explore the impact of sample complexity on informative data curation and introduce COMPACT (COMPositional Atomic-to-complex Visual Capability Tuning), a VIT data recipe that scales training sample complexity by combining multiple atomic visual capabilities in a single training example. Concretely, we synthesize rich and informative text questions for each image, allowing us to significantly reduce the number of training examples required for effective visual instruction tuning. COMPACT demonstrates superior data efficiency compared to existing data reduction methods. When applied to the LLAVA-665K VIT dataset, COMPACT reduces the data budget by 90% while still achieving 100.2% of the full VIT performance (compared to only 97.5% by the state-of-the-art method) across eight multimodal benchmarks. Further, training on the COMPACT data outperforms training on the full-scale data on particularly complex benchmarks such as MM-Vet (+8.6%) and MMStar (+2.9%). COMPACT offers a scalable and efficient synthetic data generation recipe to improve on visual language tasks.

[130] Towards Dataset Copyright Evasion Attack against Personalized Text-to-Image Diffusion Models

Kuofeng Gao, Yufei Zhu, Yiming Li, Jiawang Bai, Yong Yang, Zhifeng Li, Shu-Tao Xia

Main category: cs.CV

TL;DR: CEAT2I is the first copyright evasion attack that bypasses dataset ownership verification in text-to-image diffusion models by detecting watermarked samples, identifying trigger tokens, and erasing injected watermarks while preserving model performance.

Details

Motivation: Dataset ownership verification (DOV) uses backdoor watermarks to protect fine-tuning datasets, but its robustness against copyright evasion attacks remains unexplored. The paper aims to investigate how adversaries can circumvent these verification mechanisms.

Method: CEAT2I has three stages: 1) Reliably detect watermarked samples by observing that T2I models converge faster on watermarked samples in intermediate features rather than training loss; 2) Iteratively ablate tokens from prompts and monitor feature shifts to identify trigger tokens; 3) Apply closed-form concept erasure to remove injected watermarks.

Result: Extensive experiments show CEAT2I effectively evades state-of-the-art DOV mechanisms (including TPD and T2IShield) while preserving model performance. The attack works even when watermarks are embedded as local image patches where T2IShield fails.

Conclusion: CEAT2I demonstrates vulnerabilities in current dataset ownership verification methods for T2I diffusion models, highlighting the need for more robust watermarking techniques. The attack successfully bypasses ownership verification while maintaining model utility.

Abstract: Text-to-image (T2I) diffusion models enable high-quality image generation conditioned on textual prompts. However, fine-tuning these pre-trained models for personalization raises concerns about unauthorized dataset usage. To address this issue, dataset ownership verification (DOV) has recently been proposed, which embeds watermarks into fine-tuning datasets via backdoor techniques. These watermarks remain dormant on benign samples but produce owner-specified outputs when triggered. Despite its promise, the robustness of DOV against copyright evasion attacks (CEA) remains unexplored. In this paper, we investigate how adversaries can circumvent these mechanisms, enabling models trained on watermarked datasets to bypass ownership verification. We begin by analyzing the limitations of potential attacks achieved by backdoor removal, including TPD and T2IShield. In practice, TPD suffers from inconsistent effectiveness due to randomness, while T2IShield fails when watermarks are embedded as local image patches. To this end, we introduce CEAT2I, the first CEA specifically targeting DOV in T2I diffusion models. CEAT2I consists of three stages: (1) motivated by the observation that T2I models converge faster on watermarked samples with respect to intermediate features rather than training loss, we reliably detect watermarked samples; (2) we iteratively ablate tokens from the prompts of detected samples and monitor feature shifts to identify trigger tokens; and (3) we apply a closed-form concept erasure method to remove the injected watermarks. Extensive experiments demonstrate that CEAT2I effectively evades state-of-the-art DOV mechanisms while preserving model performance. The code is available at https://github.com/csyufei/CEAT2I.

[131] Learning Informative Attention Weights for Person Re-Identification

Yancheng Wang, Nebojsa Jojic, Yingzhen Yang

Main category: cs.CV

TL;DR: RIB method improves person Re-ID by using Information Bottleneck principle to ensure attention weights are informative for identity prediction, reducing noisy information.

Details

Motivation: Existing attention modules (self-attention, channel attention) don't explicitly ensure attention weights are informative for identity prediction, potentially introducing noisy information from input images.

Method: Proposes Reduction of Information Bottleneck loss (RIB) with novel variational upper bound for IB loss. Two implementations: RIB-DCS with Differentiable Channel Selection Attention module, and RIB-CA applied to existing channel attention modules. Applied to both fixed backbones and learnable backbones with Differentiable Neural Architecture Search.

Result: Extensive experiments on multiple person Re-ID benchmarks show RIB significantly enhances prediction accuracy, even for occluded person Re-ID.

Conclusion: RIB effectively addresses the limitation of existing attention methods by ensuring attention weights are informative for identity prediction, leading to improved performance in person Re-ID tasks.

Abstract: Attention mechanisms have been widely used in deep learning, and recent efforts have been devoted to incorporating attention modules into deep neural networks (DNNs) for person Re-Identification (Re-ID) to enhance their discriminative feature learning capabilities. Existing attention modules, including self-attention and channel attention, learn attention weights that quantify the importance of feature tokens or feature channels. However, existing attention methods do not explicitly ensure that the attention weights are informative for predicting the identity of the person in the input image, and may consequently introduce noisy information from the input image. To address this issue, we propose a novel method termed Reduction of Information Bottleneck loss (RIB), motivated by the principle of the Information Bottleneck (IB). A novel distribution-free and efficient variational upper bound for the IB loss (IBB), which can be optimized by standard SGD, is derived and incorporated into the training loss of the RIB models. RIB is applied to DNNs with self-attention modules through a novel Differentiable Channel Selection Attention module, or DCS-Attention, that selects the most informative channels for computing attention weights, leading to competitive models termed RIB-DCS. RIB is also incorporated into DNNs with existing channel attention modules to promote the learning of informative channel attention weights, leading to models termed RIB-CA. Both RIB-DCS and RIB-CA are applied to fixed neural network backbones and learnable backbones with Differentiable Neural Architecture Search (DNAS). Extensive experiments on multiple person Re-ID benchmarks show that RIB significantly enhances the prediction accuracy of DNNs for person Re-ID, even for the occluded person Re-ID.

[132] Binarization-Aware Adjuster for Discrete Decision Learning with an Application to Edge Detection

Hao Shu

Main category: cs.CV

TL;DR: Proposes Binarization-Aware Adjuster (BAA) framework to align continuous training with discrete inference in binary decision tasks by embedding binarization characteristics into optimization via Distance Weight Function.

Details

Motivation: Addresses fundamental misalignment between continuous-valued training optimization and discrete decision evaluation in machine learning, where the discontinuity of discretization operations prevents direct incorporation of decision behavior into gradient-based optimization.

Method: Introduces Binarization-Aware Adjuster (BAA) framework built on Distance Weight Function (DWF) that modulates loss contributions based on prediction correctness and proximity to decision threshold, aligning optimization emphasis with decision-critical regions while maintaining compatibility with standard learning pipelines.

Result: Experimental results on edge detection task (representative binary decision problem) with representative models and datasets show consistent performance improvements when incorporating BAA into optimization, demonstrating its effectiveness.

Conclusion: Establishes a principled approach for aligning continuous optimization with discrete decision behavior, with effectiveness demonstrated in concrete application setting of edge detection, providing a theoretically grounded solution to the training-inference misalignment problem.

Abstract: Discrete decision tasks in machine learning exhibit a fundamental misalignment between training and inference: models are optimized with continuous-valued outputs but evaluated using discrete predictions. This misalignment arises from the discontinuity of discretization operations, which prevents decision behavior from being directly incorporated into gradient-based optimization. To address this issue, we propose a theoretically grounded framework termed the Binarization-Aware Adjuster (BAA), which embeds binarization characteristics into continuous optimization. The framework is built upon the Distance Weight Function (DWF), which modulates loss contributions according to prediction correctness and proximity to the decision threshold, thereby aligning optimization emphasis with decision-critical regions while remaining compatible with standard learning pipelines. We apply the proposed BAA framework to the edge detection (ED) task, a representative binary decision problem. Experimental results on representative models and datasets show that incorporating BAA into optimization leads to consistent performance improvements, supporting its effectiveness. Overall, this work establishes a principled approach for aligning continuous optimization with discrete decision behavior, with its effectiveness demonstrated in a concrete application setting.

[133] Portable Biomechanics Laboratory: Clinically Accessible Movement Analysis from a Handheld Smartphone

J. D. Peiffer, Kunal Shah, Irina Djuraskovic, Shawana Anarwala, Kayan Abdou, Rujvee Patel, Prakash Jayabalan, Brenton Pennicooke, R. James Cotton

Main category: cs.CV

TL;DR: Portable Biomechanics Laboratory (PBL) enables accurate biomechanical assessment using handheld smartphone video, validated across diverse patient populations with clinical applications.

Details

Motivation: Movement assessment is crucial for neurological and musculoskeletal health, but objective biomechanical measurement is rarely available in routine clinical care due to cost and complexity of traditional motion capture systems.

Method: PBL is a secure platform that fits biomechanical models to video collected with a handheld, moving smartphone. Validated on over 15 hours of data synchronized with ground truth motion capture across diverse populations including neurological-injury patients, prosthesis users, pediatric patients, and controls.

Result: Mean joint-angle errors < 3° and pelvis-translation errors of a few centimeters. In prospective clinical deployments (>5 hours), PBL was easy to setup, yielded highly reliable gait metrics (ICC > 0.9), detected clinically relevant differences, and correlated with mJOA scores for cervical-myelopathy patients.

Conclusion: Handheld smartphone video can deliver accurate, scalable, and low-burden biomechanical measurement, enabling greatly increased monitoring of movement impairments in clinical settings.

Abstract: Movement directly reflects neurological and musculoskeletal health, yet objective biomechanical assessment is rarely available in routine care. We introduce Portable Biomechanics Laboratory (PBL), a secure platform for fitting biomechanical models to video collected with a handheld, moving, smartphone. We validate this approach on over 15 hours of data synchronized to ground truth motion capture, finding mean joint-angle errors < 3$°$ and pelvis-translation errors of a few centimeters across patients with neurological-injury, lower-limb prosthesis users, pediatric in-patients, and controls. In > 5 hours of prospective deployments to neurosurgery and sports-medicine clinics, PBL was easy to setup, yielded highly reliable gait metrics (ICC > 0.9), and detected clinically relevant differences. For cervical-myelopathy patients, its measurement of gait quality correlated with modified Japanese Orthopedic Association (mJOA) scores and were responsive to clinical intervention. Handheld smartphone video can therefore deliver accurate, scalable, and low-burden biomechanical measurement, enabling greatly increased monitoring of movement impairments. We release the first clinically-validated method for measuring whole-body kinematics from handheld smartphone video at https://IntelligentSensingAndRehabilitation.github.io/MonocularBiomechanics/.

[134] RemoteReasoner: Towards Unifying Geospatial Reasoning Workflow

Liang Yao, Fan Liu, Hongbo Lu, Chuanyi Zhang, Rui Min, Shengxiang Xu, Shimin Di, Pai Peng

Main category: cs.CV

TL;DR: RemoteReasoner: A unified RL-trained MLLM framework for autonomous geospatial reasoning across object-, region-, and pixel-level tasks without task-specific fine-tuning.

Details

Motivation: Remote sensing requires sophisticated reasoning to interpret complex user intents and contextual relationships beyond simple recognition. Existing approaches rely on supervised fine-tuning and task-specific heads, limiting autonomous reasoning and unified generalization.

Method: Proposes RemoteReasoner with a multi-modal LLM for instruction interpretation and target localization, plus task transformation strategies for multi-granularity tasks. Trained with reinforcement learning to enable autonomous reasoning rather than predefined sequences.

Result: Achieves state-of-the-art performance across multi-granularity reasoning tasks. Retains MLLM’s generalization capability, showing robust performance on unseen tasks and out-of-distribution categories.

Conclusion: RemoteReasoner provides a unified workflow for geospatial reasoning that enables autonomous exploration and diverse task outputs without task-specific decoders or additional fine-tuning.

Abstract: Remote sensing imagery presents vast, inherently unstructured spatial data, necessitating sophisticated reasoning to interpret complex user intents and contextual relationships beyond simple recognition tasks. In this paper, we aim to construct an Earth observation workflow to handle complex queries by reasoning about spatial context and user intent. As a reasoning workflow, it should autonomously explore and construct its own inference paths, rather than being confined to predefined ground-truth sequences. Ideally, its architecture ought to be unified yet generalized, possessing capabilities to perform diverse reasoning tasks through one model without requiring additional fine-tuning. Existing remote sensing approaches rely on supervised fine-tuning paradigms and task-specific heads, limiting both autonomous reasoning and unified generalization. To this end, we propose RemoteReasoner, a unified workflow for geospatial reasoning. The design of RemoteReasoner integrates a multi-modal large language model (MLLM) for interpreting user instructions and localizing targets, together with task transformation strategies that enable multi-granularity tasks, including object-, region-, and pixel-level. In contrast to existing methods, our framework is trained with reinforcement learning (RL) to endow the MLLM sufficient reasoning autonomy. At the inference stage, our transformation strategies enable diverse task output formats without requiring task-specific decoders or further fine-tuning. Experiments demonstrated that RemoteReasoner achieves state-of-the-art (SOTA) performance across multi-granularity reasoning tasks. Furthermore, it retains the MLLM’s inherent generalization capability, demonstrating robust performance on unseen tasks and out-of-distribution categories.

[135] LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer

Yuzhuo Chen, Zehua Ma, Jianhua Wang, Kai Kang, Shunyu Yao, Weiming Zhang

Main category: cs.CV

TL;DR: LAMIC is a training-free Layout-Aware Multi-Image Composition framework that extends single-reference diffusion models to multi-reference scenarios with spatial layout awareness.

Details

Motivation: Current controllable image synthesis struggles with generating coherent and consistent images from multiple references while maintaining spatial layout awareness, creating an open challenge in the field.

Method: Built on MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: Group Isolation Attention (GIA) for entity disentanglement and Region-Modulated Attention (RMA) for layout-aware generation. It’s training-free and extends single-reference models to multi-reference scenarios.

Result: LAMIC achieves state-of-the-art performance across most metrics: consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores, and achieves best DPG in complex composition tasks. Demonstrates superior identity keeping, background preservation, layout control, and prompt-following without any training.

Conclusion: LAMIC establishes a new training-free paradigm for controllable multi-image composition with strong zero-shot generalization. By inheriting strengths of advanced single-reference models, it enables seamless extension to multi-image scenarios, with performance expected to scale as foundation models evolve.

Abstract: In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC, a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio (IN-R) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity (BG-S) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC’s superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC’s performance is expected to scale accordingly. Our implementation is available at: https://github.com/Suchenl/LAMIC.

[136] Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models

Ruofan Wang, Xin Wang, Yang Yao, Xuan Tong, Xingjun Ma

Main category: cs.CV

TL;DR: SEA is a grey-box jailbreak attack that exploits vulnerabilities in base VLMs that persist in fine-tuned variants, achieving high transfer success rates even against safety-enhanced models.

Details

Motivation: Fine-tuning open-source VLMs creates an underexplored attack surface where vulnerabilities in base models could transfer to fine-tuned variants, posing security risks for proprietary models built on open-source foundations.

Method: Simulated Ensemble Attack (SEA) combines Fine-tuning Trajectory Simulation (FTS) to generate transferable adversarial images by simulating vision encoder parameter shifts, and Targeted Prompt Guidance (TPG) to steer language decoder outputs toward adversarial objectives.

Result: SEA achieves >86.5% transfer attack success rate and ~49.5% toxicity rate across diverse fine-tuned Qwen2-VL variants (2B and 7B), significantly outperforming direct PGD-based attacks which rarely transfer.

Conclusion: Fine-tuned VLMs inherit vulnerabilities from base models, creating urgent need for holistic defenses across the model lifecycle to protect proprietary models from transferable jailbreak attacks.

Abstract: Fine-tuning open-source Vision-Language Models (VLMs) creates a critical yet underexplored attack surface: vulnerabilities in the base VLM could be retained in fine-tuned variants, rendering them susceptible to transferable jailbreak attacks. To demonstrate this risk, we introduce the Simulated Ensemble Attack (SEA), a novel grey-box jailbreak method in which the adversary has full access to the base VLM but no knowledge of the fine-tuned target’s weights or training configuration. To improve jailbreak transferability across fine-tuned VLMs, SEA combines two key techniques: Fine-tuning Trajectory Simulation (FTS) and Targeted Prompt Guidance (TPG). FTS generates transferable adversarial images by simulating the vision encoder’s parameter shifts, while TPG is a textual strategy that steers the language decoder toward adversarially optimized outputs. Experiments on the Qwen2-VL family (2B and 7B) demonstrate that SEA achieves high transfer attack success rates exceeding 86.5% and toxicity rates near 49.5% across diverse fine-tuned variants, even those specifically fine-tuned to improve safety behaviors. Notably, while direct PGD-based image jailbreaks rarely transfer across fine-tuned VLMs, SEA reliably exploits inherited vulnerabilities from the base model, significantly enhancing transferability. These findings highlight an urgent need to safeguard fine-tuned proprietary VLMs against transferable vulnerabilities inherited from open-source foundations, motivating the development of holistic defenses across the entire model lifecycle.

[137] Reinforcement Learning for Large Model: A Survey

Weijia Wu, Chen Gao, Joya Chen, Kevin Qinghong Lin, Qingwei Meng, Yiming Zhang, Yuke Qiu, Hong Zhou, Mike Zheng Shou

Main category: cs.CV

TL;DR: Survey paper synthesizing recent advances in visual reinforcement learning, covering policy optimization evolution, thematic pillars (multi-modal LLMs, visual generation, unified models, vision-language-action), and evaluation protocols.

Details

Motivation: To provide a critical and up-to-date synthesis of the rapidly expanding field at the intersection of reinforcement learning and visual intelligence, where agents can perceive, reason, generate, and act within complex visual scenes.

Method: Formalizes visual RL problems, traces evolution of policy optimization strategies, organizes 200+ works into four thematic pillars, examines algorithmic design and reward engineering, and reviews evaluation protocols.

Result: Comprehensive survey identifying key trends like curriculum-driven training, preference-aligned diffusion, unified reward modeling, and evaluation metrics spanning set-level fidelity, sample-level preference, and state-level stability.

Conclusion: Provides researchers with a coherent map of visual RL landscape, highlights promising directions, and identifies open challenges including sample efficiency, generalization, and safe deployment.

Abstract: Recent advances at the intersection of reinforcement learning (RL) and visual intelligence have enabled agents that not only perceive complex visual scenes but also reason, generate, and act within them. This survey offers a critical and up-to-date synthesis of the field. We first formalize visual RL problems and trace the evolution of policy-optimization strategies from RLHF to verifiable reward paradigms, and from Proximal Policy Optimization to Group Relative Policy Optimization. We then organize more than 200 representative works into four thematic pillars: multi-modal large language models, visual generation, unified model frameworks, and vision-language-action models. For each pillar we examine algorithmic design, reward engineering, benchmark progress, and we distill trends such as curriculum-driven training, preference-aligned diffusion, and unified reward modeling. Finally, we review evaluation protocols spanning set-level fidelity, sample-level preference, and state-level stability, and we identify open challenges that include sample efficiency, generalization, and safe deployment. Our goal is to provide researchers and practitioners with a coherent map of the rapidly expanding landscape of visual RL and to highlight promising directions for future inquiry. Resources are available at: https://github.com/weijiawu/Awesome-Visual-Reinforcement-Learning.

[138] Drifting Away from Truth: GenAI-Driven News Diversity Challenges LVLM-Based Misinformation Detection

Fanxiao Li, Jiaying Wu, Tingchao Fu, Yunyun Dong, Bingbing Song, Wei Zhou

Main category: cs.CV

TL;DR: LVLM-based multimodal misinformation detection systems suffer significant performance degradation due to GenAI-driven news diversity causing multi-level drift in model perception and evidence quality.

Details

Motivation: The rise of GenAI tools creates highly varied and complex news content (GenAI-driven news diversity) that challenges current LVLM-based misinformation detection systems, requiring systematic study of their vulnerabilities.

Method: Introduced DriftBench, a large-scale benchmark with 16,000 news instances across six diversification categories, and designed three evaluation tasks: robustness under multi-level drift, susceptibility to adversarial evidence contamination, and analysis of reasoning consistency.

Result: Experiments with six state-of-the-art LVLM detectors show substantial performance drops (average F1 -14.8%), increasingly unstable reasoning traces, and even more severe failures under adversarial evidence injection.

Conclusion: Current MMD systems have fundamental vulnerabilities to GenAI-driven diversity, revealing an urgent need for more resilient approaches in the GenAI era.

Abstract: The proliferation of multimodal misinformation poses growing threats to public discourse and societal trust. While Large Vision-Language Models (LVLMs) have enabled recent progress in multimodal misinformation detection (MMD), the rise of generative AI (GenAI) tools introduces a new challenge: GenAI-driven news diversity, characterized by highly varied and complex content. We show that this diversity induces multi-level drift, comprising (1) model-level misperception drift, where stylistic variations disrupt a model’s internal reasoning, and (2) evidence-level drift, where expression diversity degrades the quality or relevance of retrieved external evidence. These drifts significantly degrade the robustness of current LVLM-based MMD systems. To systematically study this problem, we introduce DriftBench, a large-scale benchmark comprising 16,000 news instances across six categories of diversification. We design three evaluation tasks: (1) robustness of truth verification under multi-level drift; (2) susceptibility to adversarial evidence contamination generated by GenAI; and (3) analysis of reasoning consistency across diverse inputs. Experiments with six state-of-the-art LVLM-based detectors show substantial performance drops (average F1 -14.8%) and increasingly unstable reasoning traces, with even more severe failures under adversarial evidence injection. Our findings uncover fundamental vulnerabilities in existing MMD systems and suggest an urgent need for more resilient approaches in the GenAI era.

[139] EmoCAST: Emotional Talking Portrait via Emotive Text Description

Yiguo Jiang, Xiaodong Cun, Yong Zhang, Yudian Zheng, Fan Tang, Chi-Man Pun

Main category: cs.CV

TL;DR: EmoCAST is a diffusion-based framework for emotional talking head synthesis that uses text prompts to control emotions, introduces architectural modules for better text control and audio-emotion alignment, and leverages a large-scale in-the-wild dataset with novel training strategies.

Details

Motivation: Existing emotional talking head synthesis methods have limitations in control flexibility, motion naturalness, and expression quality. Most available datasets are collected in lab settings, which hinders real-world deployment and exacerbates these shortcomings.

Method: Proposes EmoCAST with three main contributions: (1) architectural modules for text control including text-guided emotive attention for appearance modeling and emotive audio attention for audio-emotion alignment; (2) a large-scale in-the-wild emotional talking head dataset with emotive text descriptions; (3) emotion-aware sampling strategy and progressive functional training strategy to improve expressive features and lip-sync accuracy.

Result: EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos, demonstrating improved control flexibility, motion naturalness, and expression quality compared to existing methods.

Conclusion: The proposed EmoCAST framework successfully addresses limitations in emotional talking head synthesis through its diffusion-based architecture, novel attention modules, large-scale in-the-wild dataset, and specialized training strategies, enabling precise text-driven emotional synthesis with high-quality results.

Abstract: Emotional talking head synthesis aims to generate talking portrait videos with vivid expressions. Existing methods still exhibit limitations in control flexibility, motion naturalness, and expression quality. Moreover, currently available datasets are mainly collected in lab settings, further exacerbating these shortcomings and hindering real-world deployment. To address these challenges, we propose EmoCAST, a diffusion-based talking head framework for precise, text-driven emotional synthesis. Its contributions are threefold: (1) architectural modules that enable effective text control; (2) an emotional talking-head dataset that expands the framework’s ability; and (3) training strategies that further improve performance. Specifically, for appearance modeling, emotional prompts are integrated through a text-guided emotive attention module, enhancing spatial knowledge to improve emotion understanding. To strengthen audio-emotion alignment, we introduce an emotive audio attention module to capture the interplay between controlled emotion and driving audio, generating emotion-aware features to guide precise facial motion synthesis. Additionally, we construct a large-scale, in-the-wild emotional talking head dataset with emotive text descriptions to optimize the framework’s performance. Based on this dataset, we propose an emotion-aware sampling strategy and a progressive functional training strategy that improve the model’s ability to capture nuanced expressive features and achieve accurate lip-sync. Overall, EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos. Project Page: https://github.com/GVCLab/EmoCAST

[140] Resolution scaling governs DINOv3 transfer performance in chest radiograph classification

Soroosh Tayebi Arasteh, Mina Shaigan, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn

Main category: cs.CV

TL;DR: DINOv3 SSL models with 512x512 resolution and ConvNeXt-B backbone provide best chest X-ray performance, especially for subtle abnormalities, while larger models/resolutions offer diminishing returns.

Details

Motivation: To systematically evaluate whether DINOv3's SSL improvements translate to better chest radiography performance compared to DINOv2 and ImageNet initialization, given chest X-rays' fine-grained findings and high clinical volume.

Method: Benchmarked DINOv3 against DINOv2 and ImageNet initialization across 7 datasets (>814k images) using ViT-B/16 and ConvNeXt-B backbones at 224x224, 512x512, and 1024x1024 resolutions. Also evaluated frozen features from 7B model. Primary outcome was mean AUROC across labels.

Result: At 224x224, DINOv3 and DINOv2 performed similarly on adult datasets. At 512x512, DINOv3 consistently outperformed both DINOv2 and ImageNet. Pediatric cohort showed no differences. ConvNeXt-B outperformed ViT-B/16 across all settings. Frozen DINOv3-7B features underperformed vs finetuned backbones. 1024x1024 didn’t improve accuracy further. Resolution gains were most evident for boundary-dependent and small focal abnormalities.

Conclusion: Higher input resolution (512x512) is critical for leveraging modern SSL models in chest radiography. DINOv3-initialized ConvNeXt-B networks at 512x512 provide strongest performance with practical upper limit. Finetuned mid-sized backbones at 512x512 are recommended for chest X-ray interpretation, with greatest gains for subtle lesions in emergency/critical care settings.

Abstract: Self-supervised learning (SSL) has advanced visual representation learning, but its value in chest radiography, a high-volume imaging modality with fine-grained findings, remains unclear. Meta’s DINOv3 extends earlier SSL models through Gram-anchored self-distillation. Whether these design choices improve transfer learning for chest radiography has not been systematically tested. We benchmarked DINOv3 against DINOv2 and ImageNet initialization across seven datasets (n>814,000). Two representative backbones were evaluated: ViT-B/16 and ConvNeXt-B. Images were analyzed at 224x224, 512x512, and 1024x1024 pixels. We additionally assessed frozen features from a 7B model. The primary outcome was mean AUROC across labels. At 224x224, DINOv3 and DINOv2 achieved comparable performance on adult datasets. Increasing resolution to 512x512 yielded consistent improvements for DINOv3 over both DINOv2 and ImageNet. In contrast, results in pediatric cohort showed no differences across initializations. Across all settings, ConvNeXt-B outperformed ViT-B/16. Models using frozen DINOv3-7B features underperformed relative to fully finetuned 86-89M-parameter backbones, highlighting the importance of domain adaptation. Scaling to 1024x1024 did not further improve accuracy. Resolution-related gains were most evident for boundary-dependent and small focal abnormalities. In chest radiography, higher input resolution is critical for leveraging the benefits of modern self-supervised models. 512x512 pixels represent a practical upper limit where DINOv3-initialized ConvNeXt-B networks provide the strongest performance, while larger inputs offer minimal return on cost. Clinically, these findings support use of finetuned, mid-sized backbones at 512x512 for chest radiograph interpretation, with the greatest gains expected in detecting subtle or boundary-centered lesions relevant to emergency and critical care settings.

[141] Weakly Supervised Ephemeral Gully Detection In Remote Sensing Images Using Vision Language Models

Seyed Mohamad Ali Tousi, Ramy Farag, John A. Lory, G. N. DeSouza

Main category: cs.CV

TL;DR: First weakly supervised pipeline for ephemeral gully detection using Vision Language Models to reduce manual labeling effort, with novel teacher-student approach and new large-scale dataset.

Details

Motivation: Ephemeral gullies are concerning soil erosion phenomena with short temporal cycles, making automatic detection difficult. Classical computer vision and remote sensing struggle, and machine learning is limited by scarce, hard-to-produce labeled data, forcing reliance on zero-shot approaches.

Method: Weakly supervised pipeline using Vision Language Models (VLMs) to reduce manual labeling. Method exploits: 1) knowledge from VLM pretraining, 2) teacher-student model where teacher learns from noisy VLM labels, and student learns via weak supervision using teacher-generated labels with noise-aware loss function.

Result: Superior performance compared to VLMs and label model alone when using weak supervision to train student model. Also provides first-of-its-kind dataset for semi-supervised ephemeral gully detection with over 18,000 high-resolution remote-sensing images spanning 13 years.

Conclusion: Proposed weakly supervised pipeline effectively addresses ephemeral gully detection challenges, reducing labeling effort while maintaining performance. Public release of code and dataset enables further research in this important soil erosion domain.

Abstract: Among soil erosion problems, Ephemeral Gullies are one of the most concerning phenomena occurring in agricultural fields. Their short temporal cycles increase the difficulty in automatically detecting them using classical computer vision approaches and remote sensing. Also, due to scarcity of and the difficulty in producing accurate labeled data, automatic detection of ephemeral gullies using Machine Learning is limited to zero-shot approaches which are hard to implement. To overcome these challenges, we present the first weakly supervised pipeline for detection of ephemeral gullies. Our method relies on remote sensing and uses Vision Language Models (VLMs) to drastically reduce the labor-intensive task of manual labeling. In order to achieve that, the method exploits: 1) the knowledge embedded in the VLM’s pretraining; 2) a teacher-student model where the teacher learns from noisy labels coming from the VLMs, and the student learns by weak supervision using teacher-generate labels and a noise-aware loss function. We also make available the first-of-its-kind dataset for semi-supervised detection of ephemeral gully from remote-sensed images. The dataset consists of a number of locations labeled by a group of soil and plant scientists, as well as a large number of unlabeled locations. The dataset represent more than 18,000 high-resolution remote-sensing images obtained over the course of 13 years. Our experimental results demonstrate the validity of our approach by showing superior performances compared to VLMs and the label model itself when using weak supervision to train an student model. The code and dataset for this work are made publicly available.

[142] Video Generation Models Are Good Latent Reward Models

Xiaoyue Mi, Wenqing Yu, Jiesong Lian, Shibo Jie, Ruizhe Zhong, Zijun Liu, Guozhen Zhang, Zixiang Zhou, Zhiyong Xu, Yuan Zhou, Qinglin Lu, Fan Tang

Main category: cs.CV

TL;DR: PRFL enables efficient preference optimization for video generation by conducting reward feedback learning entirely in latent space, avoiding expensive pixel-space processing and improving alignment with human preferences.

Details

Motivation: Existing video reward models rely on pixel-space inputs, requiring VAE decoding and late-stage optimization that incurs high memory/time costs and lacks early-stage supervision for motion dynamics and structural coherence.

Method: Propose Process Reward Feedback Learning (PRFL) that conducts preference optimization entirely in latent space using pre-trained video generation models as reward models, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding.

Result: PRFL significantly improves alignment with human preferences while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.

Conclusion: Pre-trained video generation models are naturally suited for reward modeling in noisy latent space, enabling more efficient and effective preference optimization for video generation through latent-space processing.

Abstract: Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding. This pixel-space approach incurs substantial memory overhead and increased training time, and its late-stage optimization lacks early-stage supervision, refining only visual quality rather than fundamental motion dynamics and structural coherence. In this work, we show that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, as they are explicitly designed to process noisy latent representations at arbitrary timesteps and inherently preserve temporal information through their sequential modeling capabilities. Accordingly, we propose Process Reward Feedback Learning~(PRFL), a framework that conducts preference optimization entirely in latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding. Extensive experiments demonstrate that PRFL significantly improves alignment with human preferences, while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.

[143] TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection

Jian-Yu Jiang-Lin, Kang-Yang Huang, Ling Zou, Ling Lo, Sheng-Ping Yang, Yu-Wen Tseng, Kun-Hsiang Lin, Chia-Ling Chen, Yu-Ting Ta, Yan-Tsung Wang, Po-Ching Chen, Hongxia Xie, Hong-Han Shuai, Wen-Huang Cheng

Main category: cs.CV

TL;DR: TriDF is a comprehensive benchmark for interpretable DeepFake detection that evaluates models across three key aspects: Perception (identifying manipulation artifacts), Detection (classification performance), and Hallucination (explanation reliability).

Details

Motivation: The increasing ease of fabricating realistic portrayals of individuals using generative models creates serious risks for security, communication, and public trust. Current systems need to not only detect manipulations but also provide clear and reliable reasoning for their decisions.

Method: The authors introduce TriDF, a benchmark containing high-quality forgeries from advanced synthesis models covering 16 DeepFake types across image, video, and audio modalities. The benchmark evaluates models on three aspects: Perception (using human-annotated evidence), Detection (classification across diverse forgery families), and Hallucination (quantifying explanation reliability).

Result: Experiments on state-of-the-art multimodal large language models show that accurate perception is essential for reliable detection, but hallucination can severely disrupt decision-making. The results reveal the interdependence between detection accuracy, evidence identification, and explanation reliability.

Conclusion: TriDF provides a unified framework for understanding the interaction between detection accuracy, evidence identification, and explanation reliability. It offers a foundation for building trustworthy systems that can address real-world synthetic media threats by ensuring both accurate detection and reliable explanations.

Abstract: Advances in generative modeling have made it increasingly easy to fabricate realistic portrayals of individuals, creating serious risks for security, communication, and public trust. Detecting such person-driven manipulations requires systems that not only distinguish altered content from authentic media but also provide clear and reliable reasoning. In this paper, we introduce TriDF, a comprehensive benchmark for interpretable DeepFake detection. TriDF contains high-quality forgeries from advanced synthesis models, covering 16 DeepFake types across image, video, and audio modalities. The benchmark evaluates three key aspects: Perception, which measures the ability of a model to identify fine-grained manipulation artifacts using human-annotated evidence; Detection, which assesses classification performance across diverse forgery families and generators; and Hallucination, which quantifies the reliability of model-generated explanations. Experiments on state-of-the-art multimodal large language models show that accurate perception is essential for reliable detection, but hallucination can severely disrupt decision-making, revealing the interdependence of these three aspects. TriDF provides a unified framework for understanding the interaction between detection accuracy, evidence identification, and explanation reliability, offering a foundation for building trustworthy systems that address real-world synthetic media threats.

[144] Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, Xuyan Chi, Jian Cong, Jing Cui, Qinpeng Cui, Qide Dong, Junliang Fan, Jing Fang, Zetao Fang, Chengjian Feng, Han Feng, Mingyuan Gao, Yu Gao, Dong Guo, Qiushan Guo, Boyang Hao, Qingkai Hao, Bibo He, Qian He, Tuyen Hoang, Ruoqing Hu, Xi Hu, Weilin Huang, Zhaoyang Huang, Zhongyi Huang, Donglei Ji, Siqi Jiang, Wei Jiang, Yunpu Jiang, Zhuo Jiang, Ashley Kim, Jianan Kong, Zhichao Lai, Shanshan Lao, Yichong Leng, Ai Li, Feiya Li, Gen Li, Huixia Li, JiaShi Li, Liang Li, Ming Li, Shanshan Li, Tao Li, Xian Li, Xiaojie Li, Xiaoyang Li, Xingxing Li, Yameng Li, Yifu Li, Yiying Li, Chao Liang, Han Liang, Jianzhong Liang, Ying Liang, Zhiqiang Liang, Wang Liao, Yalin Liao, Heng Lin, Kengyu Lin, Shanchuan Lin, Xi Lin, Zhijie Lin, Feng Ling, Fangfang Liu, Gaohong Liu, Jiawei Liu, Jie Liu, Jihao Liu, Shouda Liu, Shu Liu, Sichao Liu, Songwei Liu, Xin Liu, Xue Liu, Yibo Liu, Zikun Liu, Zuxi Liu, Junlin Lyu, Lecheng Lyu, Qian Lyu, Han Mu, Xiaonan Nie, Jingzhe Ning, Xitong Pan, Yanghua Peng, Lianke Qin, Xueqiong Qu, Yuxi Ren, Kai Shen, Guang Shi, Lei Shi, Yan Song, Yinglong Song, Fan Sun, Li Sun, Renfei Sun, Yan Sun, Zeyu Sun, Wenjing Tang, Yaxue Tang, Zirui Tao, Feng Wang, Furui Wang, Jinran Wang, Junkai Wang, Ke Wang, Kexin Wang, Qingyi Wang, Rui Wang, Sen Wang, Shuai Wang, Tingru Wang, Weichen Wang, Xin Wang, Yanhui Wang, Yue Wang, Yuping Wang, Yuxuan Wang, Ziyu Wang, Guoqiang Wei, Wanru Wei, Di Wu, Guohong Wu, Hanjie Wu, Jian Wu, Jie Wu, Ruolan Wu, Xinglong Wu, Yonghui Wu, Ruiqi Xia, Liang Xiang, Fei Xiao, XueFeng Xiao, Pan Xie, Shuangyi Xie, Shuang Xu, Jinlan Xue, Shen Yan, Bangbang Yang, Ceyuan Yang, Jiaqi Yang, Runkai Yang, Tao Yang, Yang Yang, Yihang Yang, ZhiXian Yang, Ziyan Yang, Songting Yao, Yifan Yao, Zilyu Ye, Bowen Yu, Jian Yu, Chujie Yuan, Linxiao Yuan, Sichun Zeng, Weihong Zeng, Xuejiao Zeng, Yan Zeng, Chuntao Zhang, Heng Zhang, Jingjie Zhang, Kuo Zhang, Liang Zhang, Liying Zhang, Manlin Zhang, Ting Zhang, Weida Zhang, Xiaohe Zhang, Xinyan Zhang, Yan Zhang, Yuan Zhang, Zixiang Zhang, Fengxuan Zhao, Huating Zhao, Yang Zhao, Hao Zheng, Jianbin Zheng, Xiaozheng Zheng, Yangyang Zheng, Yijie Zheng, Jiexin Zhou, Jiahui Zhu, Kuan Zhu, Shenhan Zhu, Wenjia Zhu, Benhui Zou, Feilong Zuo

Main category: cs.CV

TL;DR: Seedance 1.5 pro is a foundational model for joint audio-video generation using dual-branch Diffusion Transformer with cross-modal integration, achieving superior synchronization and quality through SFT, RLHF, and 10X acceleration.

Details

Motivation: To advance unified audio-visual generation by creating a professional-grade foundational model that can generate synchronized audio and video content with practical utility for content creation.

Method: Dual-branch Diffusion Transformer architecture with cross-modal joint module, multi-stage data pipeline, post-training optimizations (SFT and RLHF with multi-dimensional reward models), and an acceleration framework for 10X faster inference.

Result: Achieves exceptional audio-visual synchronization, superior generation quality, precise multilingual/dialect lip-syncing, dynamic cinematic camera control, enhanced narrative coherence, and 10X inference speed boost.

Conclusion: Seedance 1.5 pro positions itself as a robust engine for professional-grade content creation with advanced audio-visual generation capabilities, now accessible on Volcano Engine platform.

Abstract: Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.

[145] Next-Embedding Prediction Makes Strong Vision Learners

Sihan Xu, Ziqiao Ma, Wenhao Chai, Xuweiyi Chen, Weiyang Jin, Joyce Chai, Saining Xie, Stella X. Yu

Main category: cs.CV

TL;DR: NEPA proposes a simple generative pretraining approach for vision by predicting future patch embeddings from past ones, achieving strong results without complex designs like pixel reconstruction or contrastive losses.

Details

Motivation: Inspired by generative pretraining success in NLP, the paper explores whether similar principles can create effective self-supervised visual learners by shifting from learning representations to learning models that directly perform predictive tasks.

Method: Next-Embedding Predictive Autoregression (NEPA) trains models to predict future patch embeddings conditioned on past ones using causal masking and stop gradient, with a simple Transformer pretrained on ImageNet-1k as the sole learning objective.

Result: Achieves 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transfers effectively to semantic segmentation on ADE20K.

Conclusion: Generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.

Abstract: Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next-Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet-1k with next embedding prediction as its sole learning objective is effective - no pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. This formulation retains architectural simplicity and scalability, without requiring additional design complexity. NEPA achieves strong results across tasks, attaining 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transferring effectively to semantic segmentation on ADE20K. We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.

[146] HeadHunt-VAD: Hunting Robust Anomaly-Sensitive Heads in MLLM for Tuning-Free Video Anomaly Detection

Zhaolin Cai, Fan Li, Ziwei Zheng, Haixia Bi, Lijun He

Main category: cs.CV

TL;DR: HeadHunt-VAD is a tuning-free video anomaly detection method that identifies and uses specific attention heads within frozen multimodal LLMs instead of relying on textual outputs, achieving state-of-the-art performance while being efficient.

Details

Motivation: Traditional VAD methods require extensive labeled data and are computationally expensive. Existing tuning-free MLLM approaches suffer from information loss, normalcy bias, and prompt sensitivity due to their reliance on textual outputs, making them insufficient for capturing subtle anomalous cues.

Method: Proposes HeadHunt-VAD with a Robust Head Identification module that systematically evaluates all attention heads using multi-criteria analysis of saliency and stability to identify a sparse subset of consistently discriminative heads. Features from these expert heads are fed into a lightweight anomaly scorer and temporal locator for efficient detection.

Result: Achieves state-of-the-art performance among tuning-free methods on two major VAD benchmarks while maintaining high efficiency. Validates head-level probing in MLLMs as a powerful solution for real-world anomaly detection.

Conclusion: HeadHunt-VAD demonstrates that directly hunting robust anomaly-sensitive internal attention heads within frozen MLLMs, bypassing textual generation, provides an effective and practical tuning-free approach for video anomaly detection with interpretable outputs.

Abstract: Video Anomaly Detection (VAD) aims to locate events that deviate from normal patterns in videos. Traditional approaches often rely on extensive labeled data and incur high computational costs. Recent tuning-free methods based on Multimodal Large Language Models (MLLMs) offer a promising alternative by leveraging their rich world knowledge. However, these methods typically rely on textual outputs, which introduces information loss, exhibits normalcy bias, and suffers from prompt sensitivity, making them insufficient for capturing subtle anomalous cues. To address these constraints, we propose HeadHunt-VAD, a novel tuning-free VAD paradigm that bypasses textual generation by directly hunting robust anomaly-sensitive internal attention heads within the frozen MLLM. Central to our method is a Robust Head Identification module that systematically evaluates all attention heads using a multi-criteria analysis of saliency and stability, identifying a sparse subset of heads that are consistently discriminative across diverse prompts. Features from these expert heads are then fed into a lightweight anomaly scorer and a temporal locator, enabling efficient and accurate anomaly detection with interpretable outputs. Extensive experiments show that HeadHunt-VAD achieves state-of-the-art performance among tuning-free methods on two major VAD benchmarks while maintaining high efficiency, validating head-level probing in MLLMs as a powerful and practical solution for real-world anomaly detection.

[147] LiteGE: Lightweight Geodesic Embedding for Efficient Geodesics Computation and Non-Isometric Shape Correspondence

Yohanes Yudhi Adikusuma, Qixing Huang, Ying He

Main category: cs.CV

TL;DR: LiteGE is a lightweight method for computing geodesic distances on 3D surfaces using PCA on UDF samples, achieving 300× memory/time reduction and 1000× speedup for shape matching while working on sparse point clouds.

Details

Motivation: Existing learning-based methods for geodesic distance computation rely on large 3D backbones, causing high memory usage and latency that limit practical applications in interactive or resource-constrained settings.

Method: Constructs compact, category-aware shape descriptors by applying Principal Component Analysis (PCA) to unsigned distance field (UDF) samples at informative voxels, eliminating the need for high-capacity networks.

Result: Reduces memory usage and inference time by up to 300× compared to neural approaches, works on sparse point clouds (as few as 300 points), and achieves up to 1000× speedup for shape matching while maintaining comparable accuracy.

Conclusion: LiteGE provides an efficient, lightweight alternative to heavy neural networks for geodesic distance computation, enabling practical applications in resource-constrained environments and demonstrating strong performance on both dense and sparse 3D data.

Abstract: Computing geodesic distances on 3D surfaces is fundamental to many tasks in 3D vision and geometry processing, with deep connections to tasks such as shape correspondence. Recent learning-based methods achieve strong performance but rely on large 3D backbones, leading to high memory usage and latency, which limit their use in interactive or resource-constrained settings. We introduce LiteGE, a lightweight approach that constructs compact, category-aware shape descriptors by applying Principal Component Analysis (PCA) to unsigned distance field (UDFs) samples at informative voxels. This descriptor is efficient to compute and removes the need for high-capacity networks. LiteGE remains robust on sparse point clouds, supporting inputs with as few as 300 points, where prior methods fail. Extensive experiments show that LiteGE reduces memory usage and inference time by up to 300$\times$ compared to existing neural approaches. In addition, by exploiting the intrinsic relationship between geodesic distance and shape correspondence, LiteGE enables fast and accurate shape matching. Our method achieves up to 1000$\times$ speedup over state-of-the-art mesh-based approaches while maintaining comparable accuracy on non-isometric shape pairs, including evaluations on point-cloud inputs.

[148] UniMPR: A Unified Framework for Multimodal Place Recognition with Heterogeneous Sensor Configurations

Zhangshuo Qi, Jingyi Xu, Luqi Cheng, Shichen Wen, Yiming Ma, Guangming Xiong

Main category: cs.CV

TL;DR: UniMPR is a unified multimodal place recognition framework that uses a single trained model to handle any combination of camera, LiDAR, and radar inputs, achieving SOTA performance across diverse sensor configurations.

Details

Motivation: Existing multimodal place recognition methods struggle with dynamically adapting to various modality inputs, maintaining robustness with missing/degraded modalities, and generalizing across diverse sensor configurations and setups.

Method: Unifies all inputs in polar BEV feature space, uses multi-branch network to extract intra-modal and inter-modal features, constructs large-scale training set from multiple datasets, and employs adaptive label assignment for extensive pre-training.

Result: Achieves state-of-the-art performance on seven datasets under varying sensor configurations, modality combinations, and environmental conditions.

Conclusion: UniMPR provides a unified solution for multimodal place recognition that can seamlessly adapt to any combination of common perceptual modalities with a single trained model, demonstrating strong generalization and robustness.

Abstract: Place recognition is a critical component of autonomous vehicles and robotics, enabling global localization in GPS-denied environments. Recent advances have spurred significant interest in multimodal place recognition (MPR), which leverages complementary strengths of multiple modalities. Despite its potential, most existing MPR methods still face three key challenges: (1) dynamically adapting to various modality inputs within a unified framework, (2) maintaining robustness with missing or degraded modalities, and (3) generalizing across diverse sensor configurations and setups. In this paper, we propose UniMPR, a unified framework for multimodal place recognition. Using only one trained model, it can seamlessly adapt to any combination of common perceptual modalities (e.g., camera, LiDAR, radar). To tackle the data heterogeneity, we unify all inputs within a polar BEV feature space. Subsequently, the polar BEVs are fed into a multi-branch network to exploit discriminative intra-model and inter-modal features from any modality combinations. To fully exploit the network’s generalization capability and robustness, we construct a large-scale training set from multiple datasets and introduce an adaptive label assignment strategy for extensive pre-training. Experiments on seven datasets demonstrate that UniMPR achieves state-of-the-art performance under varying sensor configurations, modality combinations, and environmental conditions. Our code will be released at https://github.com/QiZS-BIT/UniMPR.

[149] Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation

Tianrui Zhu, Shiyi Zhang, Zhirui Sun, Jingqi Tian, Yansong Tang

Main category: cs.CV

TL;DR: MAG framework decouples memory compression and frame generation for long video synthesis, addressing catastrophic forgetting in autoregressive models while maintaining efficiency.

Details

Motivation: Current frame-AR models for long video generation face a trade-off: window attention discards historical context causing scene inconsistency, while full history retention is memory-prohibitive.

Method: Proposes Memorize-and-Generate (MAG) with two components: 1) a memory model that compresses historical information into compact KV cache, and 2) a separate generator model that synthesizes frames using this compressed representation.

Result: MAG achieves superior historical scene consistency while maintaining competitive performance on standard video generation benchmarks, with MAG-Bench introduced for strict memory retention evaluation.

Conclusion: MAG effectively addresses the memory-efficiency trade-off in long video generation by decoupling memory compression from frame synthesis, enabling better scene consistency without prohibitive memory costs.

Abstract: Frame-level autoregressive (frame-AR) models have achieved significant progress, enabling real-time video generation comparable to bidirectional diffusion models and serving as a foundation for interactive world models and game engines. However, current approaches in long video generation typically rely on window attention, which naively discards historical context outside the window, leading to catastrophic forgetting and scene inconsistency; conversely, retaining full history incurs prohibitive memory costs. To address this trade-off, we propose Memorize-and-Generate (MAG), a framework that decouples memory compression and frame generation into distinct tasks. Specifically, we train a memory model to compress historical information into a compact KV cache, and a separate generator model to synthesize subsequent frames utilizing this compressed representation. Furthermore, we introduce MAG-Bench to strictly evaluate historical memory retention. Extensive experiments demonstrate that MAG achieves superior historical scene consistency while maintaining competitive performance on standard video generation benchmarks.

cs.AI

[150] PhysMaster: Building an Autonomous AI Physicist for Theoretical and Computational Physics Research

Tingjia Miao, Jiawen Dai, Jingkun Liu, Jinxin Tan, Muhua Zhang, Wenkai Jin, Yuwen Du, Tian Jin, Xianghe Pang, Zexi Liu, Tu Guo, Zhengliang Zhang, Yunjie Huang, Shuo Chen, Rui Ye, Yuzhi Zhang, Linfeng Zhang, Kun Chen, Wei Wang, Weinan E, Siheng Chen

Main category: cs.AI

TL;DR: PhysMaster is an LLM-based agent that functions as an autonomous theoretical/computational physicist, combining abstract reasoning with numerical computation to accelerate, automate, and discover solutions in physics research.

Details

Motivation: Existing LLM-based scientific agents are limited to well-defined benchmarks or simple tasks like literature retrieval, lacking end-to-end problem-solving ability in open scientific scenarios, especially in physics which requires abstract mathematical reasoning integrated with computational methods.

Method: PhysMaster couples abstract reasoning with numerical computation, leverages LANDAU (Layered Academic Data Universe) for storing retrieved literature, curated knowledge, and methodological traces, and employs adaptive exploration strategies balancing efficiency with open-ended exploration for ultra-long-horizon tasks.

Result: PhysMaster demonstrates capabilities in three areas: (i) acceleration - compressing labor-intensive research from months to hours, (ii) automation - autonomously executing hypothesis-driven loops, and (iii) autonomous discovery - independently exploring open problems across high-energy theory, condensed matter theory, and astrophysics.

Conclusion: PhysMaster represents a significant advancement in LLM-based scientific agents, enabling autonomous physics research by integrating reasoning, computation, and knowledge management, with potential to transform how physics research is conducted.

Abstract: Advances in LLMs have produced agents with knowledge and operational capabilities comparable to human scientists, suggesting potential to assist, accelerate, and automate research. However, existing studies mainly evaluate such systems on well-defined benchmarks or general tasks like literature retrieval, limiting their end-to-end problem-solving ability in open scientific scenarios. This is particularly true in physics, which is abstract, mathematically intensive, and requires integrating analytical reasoning with code-based computation. To address this, we propose PhysMaster, an LLM-based agent functioning as an autonomous theoretical and computational physicist. PhysMaster couples absract reasoning with numerical computation and leverages LANDAU, the Layered Academic Data Universe, which preserves retrieved literature, curated prior knowledge, and validated methodological traces, enhancing decision reliability and stability. It also employs an adaptive exploration strategy balancing efficiency and open-ended exploration, enabling robust performance in ultra-long-horizon tasks. We evaluate PhysMaster on problems from high-energy theory, condensed matter theory to astrophysics, including: (i) acceleration, compressing labor-intensive research from months to hours; (ii) automation, autonomously executing hypothesis-driven loops ; and (iii) autonomous discovery, independently exploring open problems.

[151] A Branch-and-Price Algorithm for Fast and Equitable Last-Mile Relief Aid Distribution

Mahdi Mostajabdaveh, F. Sibel Salman, Walter J. Gutjahr

Main category: cs.AI

TL;DR: Bi-objective vehicle routing for post-disaster relief distribution balancing efficiency (travel time) and equity (Gini-index-based unsatisfied demand) using MIP and branch-and-price algorithm.

Details

Motivation: Post-disaster relief distribution faces challenges when prepositioned supplies are insufficient, requiring efficient and equitable allocation to shelters while managing limited resources.

Method: Formulated bi-objective MIP model minimizing Gini-index-based inequity and total travel time, used ε-constraint method, derived mathematical properties for valid inequalities, developed algorithm for optimal delivery allocations, and implemented branch-and-price algorithm.

Result: Branch-and-price algorithm significantly outperforms commercial MIP solvers, reduces aid distribution inequity by 34% without efficiency loss, and shows lexicographic optimization works for extreme time constraints while balanced approach needed for moderate constraints.

Conclusion: Bi-objective approach effectively balances efficiency and equity in relief distribution, with algorithm showing practical superiority and providing guidance on optimization strategies based on time constraint severity.

Abstract: The distribution of relief supplies to shelters is a critical aspect of post-disaster humanitarian logistics. In major disasters, prepositioned supplies often fall short of meeting all demands. We address the problem of planning vehicle routes from a distribution center to shelters while allocating limited relief supplies. To balance efficiency and equity, we formulate a bi-objective problem: minimizing a Gini-index-based measure of inequity in unsatisfied demand for fair distribution and minimizing total travel time for timely delivery. We propose a Mixed Integer Programming (MIP) model and use the $ε$-constraint method to handle the bi-objective nature. By deriving mathematical properties of the optimal solution, we introduce valid inequalities and design an algorithm for optimal delivery allocations given feasible vehicle routes. A branch-and-price (B&P) algorithm is developed to solve the problem efficiently. Computational tests on realistic datasets from a past earthquake in Van, Turkey, and predicted data for Istanbul’s Kartal region show that the B&P algorithm significantly outperforms commercial MIP solvers. Our bi-objective approach reduces aid distribution inequity by 34% without compromising efficiency. Results indicate that when time constraints are very loose or tight, lexicographic optimization prioritizing demand coverage over fairness is effective. For moderately restrictive time constraints, a balanced approach is essential to avoid inequitable outcomes.

[152] Interpolative Decoding: Exploring the Spectrum of Personality Traits in LLMs

Eric Yeh, John Cadigan, Ran Chen, Dick Crouch, Melinda Gervasio, Dayne Freitag

Main category: cs.AI

TL;DR: LLMs can simulate human behavior in economic games using interpolative decoding to model personality traits without needing separate prompts for each personality profile.

Details

Motivation: Current LLM-based human simulations require separate prompts for each personality profile, creating experimental overhead and reducing replicability. The paper aims to develop a more efficient method for simulating personality-driven decision making.

Method: Uses interpolative decoding where each Big Five personality dimension is represented as a pair of opposed prompts, with an interpolation parameter to simulate behavior along that dimension continuum.

Result: Interpolative decoding reliably modulates scores along Big Five dimensions, replicates human decision-making behavior in economic games, and shows preliminary success in “twinning” individual human players through systematic search in interpolation space.

Conclusion: Interpolative decoding provides an efficient, replicable method for simulating personality-driven human behavior in LLMs, enabling better psychological and economic research simulations.

Abstract: Recent research has explored using very large language models (LLMs) as proxies for humans in tasks such as simulation, surveys, and studies. While LLMs do not possess a human psychology, they often can emulate human behaviors with sufficiently high fidelity to drive simulations to test human behavioral hypotheses, exhibiting more nuance and range than the rule-based agents often employed in behavioral economics. One key area of interest is the effect of personality on decision making, but the requirement that a prompt must be created for every tested personality profile introduces experimental overhead and degrades replicability. To address this issue, we leverage interpolative decoding, representing each dimension of personality as a pair of opposed prompts and employing an interpolation parameter to simulate behavior along the dimension. We show that interpolative decoding reliably modulates scores along each of the Big Five dimensions. We then show how interpolative decoding causes LLMs to mimic human decision-making behavior in economic games, replicating results from human psychological research. Finally, we present preliminary results of our efforts to ``twin’’ individual human players in a collaborative game through systematic search for points in interpolation space that cause the system to replicate actions taken by the human subject.

[153] Zero-Shot Segmentation through Prototype-Guidance for Multi-Label Plant Species Identification

Luciano Araujo Dourado Filho, Almir Moreira da Silva Neto, Rodrigo Pereira David, Rodrigo Tripodi Calumby

Main category: cs.AI

TL;DR: A prototype-guided segmentation ViT approach for PlantCLEF 2025 multi-label species identification, achieving 5th place with F1=0.33331.

Details

Motivation: To address the PlantCLEF 2025 challenge of fine-grained multi-label species identification from high-resolution vegetation plot images, requiring domain adaptation from individual species classification to multi-label classification.

Method: Uses class prototypes from training data as guidance: extracts features from training images, clusters them with K-Means (K=number of classes), then trains a custom narrow ViT with frozen DinoV2 patch embedding to reconstruct these prototypes from test images. The model generates attention scores for localization to guide classification.

Result: Achieved 5th place in PlantCLEF 2025 private leaderboard with F1 score of 0.33331, only 0.03 lower than top-performing submission, demonstrating competitive performance.

Conclusion: The prototype-guided segmentation ViT approach effectively enables domain adaptation from individual species classification to multi-label vegetation plot classification, showing competitive results in the PlantCLEF 2025 benchmark.

Abstract: This paper presents an approach developed to address the PlantClef 2025 challenge, which consists of a fine-grained multi-label species identification, over high-resolution images. Our solution focused on employing class prototypes obtained from the training dataset as a proxy guidance for training a segmentation Vision Transformer (ViT) on the test set images. To obtain these representations, the proposed method extracts features from training dataset images and create clusters, by applying K-Means, with $K$ equals to the number of classes in the dataset. The segmentation model is a customized narrow ViT, built by replacing the patch embedding layer with a frozen DinoV2, pre-trained on the training dataset for individual species classification. This model is trained to reconstruct the class prototypes of the training dataset from the test dataset images. We then use this model to obtain attention scores that enable to identify and localize areas of interest and consequently guide the classification process. The proposed approach enabled a domain-adaptation from multi-class identification with individual species, into multi-label classification from high-resolution vegetation plots. Our method achieved fifth place in the PlantCLEF 2025 challenge on the private leaderboard, with an F1 score of 0.33331. Besides that, in absolute terms our method scored 0.03 lower than the top-performing submission, suggesting that it may achieved competitive performance in the benchmark task. Our code is available at \href{https://github.com/ADAM-UEFS/PlantCLEF2025}{https://github.com/ADAM-UEFS/PlantCLEF2025}.

[154] FGDCC: Fine-Grained Deep Cluster Categorization – A Framework for Intra-Class Variability Problems in Plant Classification

Luciano Araujo Dourado Filho, Rodrigo Tripodi Calumby

Main category: cs.AI

TL;DR: A novel method for Fine-Grained Visual Categorization that uses class-wise clustering to discover pseudo-labels encoding latent similarity between images, then employs hierarchical classification to learn fine-grained features and mitigate intra-class variability issues.

Details

Motivation: Intra-class variability (dissimilarity between images within the same class) hinders DL model learning, especially when combined with class underrepresentation - a common problem in FGVC tasks. This paper aims to improve classification performance in FGVC by addressing intra-class variability through learning fine-grained features.

Method: The method applies clustering to each class individually to discover pseudo-labels that encode latent similarity between images. These pseudo-labels are then used in a hierarchical classification process to learn more fine-grained visual features, thereby mitigating intra-class variability issues.

Result: Initial experiments on PlantNet300k dataset showed promising results, with the method achieving state-of-the-art performance even though some components weren’t fully optimized. The experiments revealed key points for future development to find more conclusive evidence of the method’s effectiveness.

Conclusion: The proposed hierarchical classification approach using class-wise cluster assignments shows potential for improving FGVC performance by learning fine-grained features and mitigating intra-class variability, though further optimization and validation are needed.

Abstract: Intra-class variability is given according to the significance in the degree of dissimilarity between images within a class. In that sense, depending on its intensity, intra-class variability can hinder the learning process for DL models, specially when such classes are also underrepresented, which is a very common scenario in Fine-Grained Visual Categorization (FGVC) tasks. This paper proposes a novel method that aims at leveraging classification performance in FGVC tasks by learning fine-grained features via classification of class-wise cluster assignments. Our goal is to apply clustering over each class individually, which can allow to discover pseudo-labels that encodes a latent degree of similarity between images. In turn, those labels can be employed in a hierarchical classification process that allows to learn more fine-grained visual features and thereby mitigating intra-class variability issues. Initial experiments over the PlantNet300k enabled to shed light upon several key points in which future work will have to be developed in order to find more conclusive evidence regarding the effectiveness of our method. Our method still achieves state-of-the-art performance on the PlantNet300k dataset even though some of its components haven’t been shown to be fully optimized. Our code is available at \href{https://github.com/ADAM-UEFS/FGDCC}{https://github.com/ADAM-UEFS/FGDCC}.

[155] LongVideoAgent: Multi-Agent Reasoning with Long Videos

Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, Qifeng Chen

Main category: cs.AI

TL;DR: Multi-agent framework for long-video QA where a master LLM coordinates grounding and vision agents to localize relevant segments and extract visual details, trained with RL for efficient cooperation.

Details

Motivation: Existing multimodal LLMs for long-video QA compress content into lossy summaries or use limited toolsets, which weakens temporal grounding and misses fine-grained visual cues needed for accurate episode-level reasoning.

Method: A multi-agent framework with: 1) master LLM that coordinates and plans with step limits, 2) grounding agent to localize question-relevant segments, 3) vision agent to extract targeted textual observations. The master agent is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation.

Result: The system significantly outperforms strong non-agent baselines on LongTVQA and LongTVQA+ datasets (episode-level datasets aggregated from TVQA/TVQA+). Reinforcement learning further strengthens reasoning and planning capabilities of the trained agent.

Conclusion: The multi-agent framework with RL training enables effective long-video QA by improving temporal grounding, complementing subtitles with visual details, and producing interpretable reasoning trajectories while maintaining efficiency.

Abstract: Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent. Code and data will be shared at https://longvideoagent.github.io/.

Zhe Sun, Xueyuan Yang, Yujie Lu, Zhenliang Zhang

Main category: cs.AI

TL;DR: S³IT benchmark evaluates embodied social intelligence through seat-ordering tasks requiring integration of social norms and physical constraints in 3D environments.

Details

Motivation: Existing evaluations fail to integrate social reasoning with physical constraints - they either test disembodied social reasoning or socially-agnostic physical tasks, missing the crucial integration needed for real-world embodied agents.

Method: Introduces S³IT benchmark with procedurally extensible framework generating diverse seat-ordering scenarios where agents must arrange seating for LLM-driven NPCs with complex identities, preferences, and relationships. Agents must acquire preferences through dialogue, explore environment, and perform multi-objective optimization.

Result: State-of-the-art LLMs struggle with S³IT, showing significant gap compared to human baseline. LLMs have deficiencies in spatial intelligence but achieve near human-level competence in resolving conflicts with explicit textual cues.

Conclusion: Embodied social intelligence requires integration of social and physical reasoning that current LLMs lack, particularly in spatial intelligence, highlighting need for better benchmarks and model improvements for real-world embodied agents.

Abstract: The integration of embodied agents into human environments demands embodied social intelligence: reasoning over both social norms and physical constraints. However, existing evaluations fail to address this integration, as they are limited to either disembodied social reasoning (e.g., in text) or socially-agnostic physical tasks. Both approaches fail to assess an agent’s ability to integrate and trade off both physical and social constraints within a realistic, embodied context. To address this challenge, we introduce Spatially Situated Social Intelligence Test (S$^{3}$IT), a benchmark specifically designed to evaluate embodied social intelligence. It is centered on a novel and challenging seat-ordering task, requiring an agent to arrange seating in a 3D environment for a group of large language model-driven (LLM-driven) NPCs with diverse identities, preferences, and intricate interpersonal relationships. Our procedurally extensible framework generates a vast and diverse scenario space with controllable difficulty, compelling the agent to acquire preferences through active dialogue, perceive the environment via autonomous exploration, and perform multi-objective optimization within a complex constraint network. We evaluate state-of-the-art LLMs on S$^{3}$IT and found that they still struggle with this problem, showing an obvious gap compared with the human baseline. Results imply that LLMs have deficiencies in spatial intelligence, yet simultaneously demonstrate their ability to achieve near human-level competence in resolving conflicts that possess explicit textual cues.

[157] Discovering Lie Groups with Flow Matching

Jung Yeon Park, Yuxuan Chen, Floor Eijkelboom, Jan-Willem van de Meent, Lawson L. S. Wong, Robin Walters

Main category: cs.AI

TL;DR: Proposes LieFlow, a method to discover symmetries in data via flow matching on Lie groups, learning distributions over hypothesis groups that match observed symmetries.

Details

Motivation: Symmetry is fundamental in physics and improves ML performance/efficiency, but requires knowledge of underlying symmetries in data. Current methods have limitations in group types they can discover and require many assumptions.

Method: LieFlow uses flow matching on Lie groups to learn distributions over larger hypothesis groups that match observed symmetries. Addresses “last-minute convergence” issue with novel interpolation scheme for flow matching.

Result: Successfully discovers discrete groups including reflections via flow matching over complex domain in 2D and 3D point cloud experiments.

Conclusion: LieFlow is more flexible than previous methods, requires fewer assumptions, and effectively discovers symmetries directly from data through flow matching on Lie groups.

Abstract: Symmetry is fundamental to understanding physical systems, and at the same time, can improve performance and sample efficiency in machine learning. Both pursuits require knowledge of the underlying symmetries in data. To address this, we propose learning symmetries directly from data via flow matching on Lie groups. We formulate symmetry discovery as learning a distribution over a larger hypothesis group, such that the learned distribution matches the symmetries observed in data. Relative to previous works, our method, \lieflow, is more flexible in terms of the types of groups it can discover and requires fewer assumptions. Experiments on 2D and 3D point clouds demonstrate the successful discovery of discrete groups, including reflections by flow matching over the complex domain. We identify a key challenge where the symmetric arrangement of the target modes causes ``last-minute convergence,’’ where samples remain stationary until relatively late in the flow, and introduce a novel interpolation scheme for flow matching for symmetry discovery.

[158] Let’s Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification

Moises Andrade, Joonhyuk Cha, Brandon Ho, Vriksha Srihari, Karmesh Yadav, Zsolt Kira

Main category: cs.AI

TL;DR: MLLMs show promise as verifiers for AI agents but suffer from agreement bias (over-validating behavior). The paper introduces Self-Grounded Verification (SGV) to mitigate this bias, improving evaluation accuracy and boosting downstream task performance.

Details

Motivation: While verifiers have driven AI progress in domains with clear success criteria (math, code), extending them to domains without clear-cut success (computer use, robotics) is challenging. MLLMs offer potential as verifiers due to their world knowledge and human-preference alignment, but they exhibit systematic agreement bias.

Method: Introduces Self-Grounded Verification (SGV): a two-step method where MLLMs first generate broad priors about desired behavior independently, then condition on these self-generated priors to evaluate candidate trajectories. This leverages MLLMs’ own sampling mechanisms to better utilize their knowledge and reasoning.

Result: SGV improves human-aligned evaluations with gains up to 25pp in failure detection and 14pp in accuracy. It boosts task completion in GUI specialist (OSWorld), diffusion policy (robomimic), and ReAct agent (VisualWebArena), setting new SOTA with 20pp improvement over previous best.

Conclusion: MLLMs have potential as verifiers but suffer from agreement bias. SGV effectively mitigates this bias by grounding evaluations in self-generated priors, leading to more reliable assessments and improved downstream performance across diverse domains.

Abstract: Verifiers–functions assigning rewards to agent behavior–have been key for AI progress in domains like math and code. However, extending gains to domains without clear-cut success criteria (e.g., computer use) remains a challenge: while humans can recognize desired outcomes, translating this intuition into scalable rules is nontrivial. Multimodal Large Language Models (MLLMs) emerge as a promising solution, given their world knowledge, human-preference alignment, and reasoning skills. We evaluate MLLMs as verifiers across web navigation, computer use, and robotic manipulation, and identify a critical limitation: a strong tendency to over-validate agent behavior, a phenomenon we term agreement bias. This bias is pervasive across models, resilient to test-time scaling, and poses risks to existing methods relying on MLLM evaluations. We discuss methods to evaluate and improve MLLM verifiers and introduce Self-Grounded Verification (SGV), a lightweight method that harnesses MLLMs’ own sampling mechanisms by modulating (un)conditional generation to better leverage their knowledge, alignment, and reasoning. SGV operates in two steps: first, the MLLM is elicited to generate broad priors about desired behavior, independent of the data under evaluation. Then, conditioned on self-generated priors, it reasons over and evaluates a candidate trajectory. SGV yields more human-aligned evaluations with gains of up to 25pp in failure detection, 14pp in accuracy, and benefits extending to downstream applications. In self-refinement and online supervision, SGV boosts task completion of a GUI specialist in OSWorld, a diffusion policy in robomimic, and a ReAct agent in VisualWebArena–setting a new state of the art, surpassing the previous best by 20pp. We release an updated version of VisualWebArena featuring more human-aligned evaluators, high-fidelity environment parallelism, and speedups of over 10x.

[159] Learning Skills from Action-Free Videos

Hung-Chieh Fang, Kuo-Han Hung, Chu-Rong Chen, Po-Jung Chou, Chun-Kai Yang, Po-Chen Ko, Yu-Chiang Wang, Yueh-Hua Wu, Min-Hung Chen, Shao-Hua Sun

Main category: cs.AI

TL;DR: SOF learns latent skills from action-free videos using optical flow as intermediate representation, enabling high-level planning and easier translation to robot actions.

Details

Motivation: Existing video generative models produce good visual predictions but are hard to translate into low-level actions, while latent-action models lack high-level planning capabilities. There's a need to bridge this gap for better robot learning from videos.

Method: SOF learns latent skills from large collections of action-free videos using optical flow as intermediate representation. It creates a flow-based latent space that captures motion information aligned with both video dynamics and robot actions, enabling skill composition and planning.

Result: Experiments show consistent performance improvements in both multitask and long-horizon settings, demonstrating the ability to acquire and compose skills directly from raw visual data.

Conclusion: SOF successfully bridges the gap between video generative models and action models by learning skills from optical flow, enabling effective high-level planning and translation to robot actions from visual data alone.

Abstract: Learning from videos offers a promising path toward generalist robots by providing rich visual and temporal priors beyond what real robot datasets contain. While existing video generative models produce impressive visual predictions, they are difficult to translate into low-level actions. Conversely, latent-action models better align videos with actions, but they typically operate at the single-step level and lack high-level planning capabilities. We bridge this gap by introducing Skill Abstraction from Optical Flow (SOF), a framework that learns latent skills from large collections of action-free videos. Our key idea is to learn a latent skill space through an intermediate representation based on optical flow that captures motion information aligned with both video dynamics and robot actions. By learning skills in this flow-based latent space, SOF enables high-level planning over video-derived skills and allows for easier translation of these skills into actions. Experiments show that our approach consistently improves performance in both multitask and long-horizon settings, demonstrating the ability to acquire and compose skills directly from raw visual data.

[160] CardAIc-Agents: A Multimodal Framework with Hierarchical Adaptation for Cardiac Care Support

Yuting Zhang, Karina V. Bunting, Asgher Champsi, Xiaoxia Wang, Wenqi Lu, Alexander Thorley, Sandeep S Hothi, Zhaowen Qiu, Baturalp Buyukates, Dipak Kotecha, Jinming Duan

Main category: cs.AI

TL;DR: CardAIc-Agents: A multimodal AI framework for adaptive cardiac care that addresses limitations of current AI systems through dynamic planning, tool integration, continuous learning, and visual outputs.

Details

Motivation: Current AI systems for cardiovascular disease detection have limitations: rigid sequential workflows, lack of domain-specific tools, static knowledge bases without continuous learning, and fixed unimodal inputs without visual outputs when clinicians need clarification.

Method: Proposed CardAIc-Agents framework with: 1) CardiacRAG agent for task-aware planning from updatable cardiac knowledge, 2) Chief agent integrating tools to execute plans, 3) stepwise update strategy for dynamic plan refinement, 4) multidisciplinary discussion team for challenging cases, and 5) visual review panels for clinician validation.

Result: Experiments across three datasets showed CardAIc-Agents outperformed mainstream Vision-Language Models and state-of-the-art agentic systems in efficiency.

Conclusion: CardAIc-Agents provides an adaptive multimodal framework that addresses key limitations of current AI systems for cardiac care, enabling more flexible, tool-augmented, and clinically relevant AI assistance for cardiovascular disease management.

Abstract: Cardiovascular diseases (CVDs) remain the foremost cause of mortality worldwide, a burden worsened by a severe deficit of healthcare workers. Artificial intelligence (AI) agents have shown potential to alleviate this gap through automated detection and proactive screening, yet their clinical application remains limited by: 1) rigid sequential workflows, whereas clinical care often requires adaptive reasoning that select specific tests and, based on their results, guides personalised next steps; 2) reliance solely on intrinsic model capabilities to perform role assignment without domain-specific tool support; 3) general and static knowledge bases without continuous learning capability; and 4) fixed unimodal or bimodal inputs and lack of on-demand visual outputs when clinicians require visual clarification. In response, a multimodal framework, CardAIc-Agents, was proposed to augment models with external tools and adaptively support diverse cardiac tasks. First, a CardiacRAG agent generated task-aware plans from updatable cardiac knowledge, while the Chief agent integrated tools to autonomously execute these plans and deliver decisions. Second, to enable adaptive and case-specific customization, a stepwise update strategy was developed to dynamically refine plans based on preceding execution results, once the task was assessed as complex. Third, a multidisciplinary discussion team was proposed which was automatically invoked to interpret challenging cases, thereby supporting further adaptation. In addition, visual review panels were provided to assist validation when clinicians raised concerns. Experiments across three datasets showed the efficiency of CardAIc-Agents compared to mainstream Vision-Language Models (VLMs) and state-of-the-art agentic systems.

[161] Towards Generative Location Awareness for Disaster Response: A Probabilistic Cross-view Geolocalization Approach

Hao Li, Fabian Deuser, Wenping Yin, Steffen Knoblauch, Wufan Zhao, Filip Biljecki, Yong Xue, Wei Huang

Main category: cs.AI

TL;DR: ProbGLC: A probabilistic cross-view geolocalization approach for rapid disaster response that combines probabilistic and deterministic models to enhance explainability and achieve state-of-the-art accuracy.

Details

Motivation: As climate change intensifies disasters, rapid and efficient response requires accurate disaster location identification. Current approaches lack explainability and uncertainty quantification needed for reliable decision-making in disaster scenarios.

Method: Proposes ProbGLC, a unified framework combining probabilistic and deterministic geolocalization models. It addresses cross-view geolocalization across multiple disaster events and provides probabilistic distributions and localizability scores for explainability.

Result: Achieves superior geolocalization accuracy: 0.86 in Acc@1km and 0.97 in Acc@25km on MultiIAN and SAGAINDisaster datasets. Provides model explainability through probabilistic distributions and localizability scores.

Conclusion: ProbGLC demonstrates great potential for leveraging generative cross-view approaches to facilitate location awareness for better and faster disaster response, with publicly available data and code.

Abstract: As Earth’s climate changes, it is impacting disasters and extreme weather events across the planet. Record-breaking heat waves, drenching rainfalls, extreme wildfires, and widespread flooding during hurricanes are all becoming more frequent and more intense. Rapid and efficient response to disaster events is essential for climate resilience and sustainability. A key challenge in disaster response is to accurately and quickly identify disaster locations to support decision-making and resources allocation. In this paper, we propose a Probabilistic Cross-view Geolocalization approach, called ProbGLC, exploring new pathways towards generative location awareness for rapid disaster response. Herein, we combine probabilistic and deterministic geolocalization models into a unified framework to simultaneously enhance model explainability (via uncertainty quantification) and achieve state-of-the-art geolocalization performance. Designed for rapid diaster response, the ProbGLC is able to address cross-view geolocalization across multiple disaster events as well as to offer unique features of probabilistic distribution and localizability score. To evaluate the ProbGLC, we conduct extensive experiments on two cross-view disaster datasets (i.e., MultiIAN and SAGAINDisaster), consisting diverse cross-view imagery pairs of multiple disaster types (e.g., hurricanes, wildfires, floods, to tornadoes). Preliminary results confirms the superior geolocalization accuracy (i.e., 0.86 in Acc@1km and 0.97 in Acc@25km) and model explainability (i.e., via probabilistic distributions and localizability scores) of the proposed ProbGLC approach, highlighting the great potential of leveraging generative cross-view approach to facilitate location awareness for better and faster disaster response. The data and code is publicly available at https://github.com/bobleegogogo/ProbGLC

[162] Multi-Agent Intelligence for Multidisciplinary Decision-Making in Gastrointestinal Oncology

Rongzhao Zhang, Junqiao Wang, Shuyun Yang, Mouxiao Bian, Chihao Zhang, Dongyang Wang, Qiujuan Yan, Yun Zhong, Yuwei Bai, Guanxu Zhu, Kangkun Mao, Miao Wang, Chao Ding, Renjie Lu, Lei Wang, Lei Zheng, Tao Zheng, Xi Wang, Zhuo Fan, Bing Han, Meiling Liu, Luyi Jiang, Dongming Shan, Wenzhong Jin, Jiwei Yu, Zheng Wang, Jie Xu, Meng Luo

Main category: cs.AI

TL;DR: A hierarchical multi-agent framework mimicking human MDT collaboration outperforms monolithic MLLMs in GI oncology clinical reasoning by reducing context dilution and hallucinations.

Details

Motivation: Multimodal clinical reasoning in GI oncology requires integrating endoscopic, radiological, and biochemical data, but current MLLMs suffer from context dilution and hallucinations when processing complex medical histories.

Method: Proposed a hierarchical Multi-Agent Framework that emulates human Multidisciplinary Team (MDT) collaborative workflow to address MLLM limitations.

Result: Achieved composite expert evaluation score of 4.60/5.00, significantly outperforming monolithic baseline, with greatest improvements in reasoning logic and medical accuracy.

Conclusion: Agent-based collaboration provides scalable, interpretable, and clinically robust paradigm for automated oncology decision support by mimicking human team workflows.

Abstract: Multimodal clinical reasoning in the field of gastrointestinal (GI) oncology necessitates the integrated interpretation of endoscopic imagery, radiological data, and biochemical markers. Despite the evident potential exhibited by Multimodal Large Language Models (MLLMs), they frequently encounter challenges such as context dilution and hallucination when confronted with intricate, heterogeneous medical histories. In order to address these limitations, a hierarchical Multi-Agent Framework is proposed, which emulates the collaborative workflow of a human Multidisciplinary Team (MDT). The system attained a composite expert evaluation score of 4.60/5.00, thereby demonstrating a substantial improvement over the monolithic baseline. It is noteworthy that the agent-based architecture yielded the most substantial enhancements in reasoning logic and medical accuracy. The findings indicate that mimetic, agent-based collaboration provides a scalable, interpretable, and clinically robust paradigm for automated decision support in oncology.

[163] Scaling Reinforcement Learning for Content Moderation with Large Language Models

Hamed Firooz, Rui Liu, Yuchen Lu, Zhenyu Hou, Fangzhou Xiong, Xiaoyang Zhang, Changshu Jian, Zhicheng Zhu, Jiayuan Ma, Jacob Tao, Chaitali Gupta, Xiaochang Peng, Shike Mei, Hang Cui, Yang Qin, Shuo Tang, Jason Gaedtke, Arpit Mittal

Main category: cs.AI

TL;DR: RL-based content moderation systems show sigmoid-like scaling with data efficiency 100x better than supervised fine-tuning, especially effective for complex policy reasoning tasks.

Details

Motivation: Content moderation at scale is challenging due to billions of user/AI-generated artifacts needing policy evaluation. While LLMs show potential, practical challenges remain unexplored: label sparsity, evolving policies, and need for nuanced reasoning beyond pattern matching.

Method: Comprehensive empirical investigation of scaling RL for content classification, evaluating multiple RL training recipes and reward-shaping strategies (verifiable rewards, LLM-as-judge frameworks) to transform general-purpose LLMs into specialized policy-aligned classifiers across three real-world moderation tasks.

Result: RL exhibits sigmoid-like scaling behavior with smooth performance improvements from increased training data, rollouts, and optimization steps before saturation. RL substantially improves performance on complex policy-grounded reasoning tasks and achieves up to 100x higher data efficiency than supervised fine-tuning.

Conclusion: RL is particularly effective for content moderation in domains with scarce/costly expert annotations, providing actionable insights for industrial-scale moderation systems.

Abstract: Content moderation at scale remains one of the most pressing challenges in today’s digital ecosystem, where billions of user- and AI-generated artifacts must be continuously evaluated for policy violations. Although recent advances in large language models (LLMs) have demonstrated strong potential for policy-grounded moderation, the practical challenges of training these systems to achieve expert-level accuracy in real-world settings remain largely unexplored, particularly in regimes characterized by label sparsity, evolving policy definitions, and the need for nuanced reasoning beyond shallow pattern matching. In this work, we present a comprehensive empirical investigation of scaling reinforcement learning (RL) for content classification, systematically evaluating multiple RL training recipes and reward-shaping strategies-including verifiable rewards and LLM-as-judge frameworks-to transform general-purpose language models into specialized, policy-aligned classifiers across three real-world content moderation tasks. Our findings provide actionable insights for industrial-scale moderation systems, demonstrating that RL exhibits sigmoid-like scaling behavior in which performance improves smoothly with increased training data, rollouts, and optimization steps before gradually saturating. Moreover, we show that RL substantially improves performance on tasks requiring complex policy-grounded reasoning while achieving up to 100x higher data efficiency than supervised fine-tuning, making it particularly effective in domains where expert annotations are scarce or costly.

[164] Reason2Decide: Rationale-Driven Multi-Task Learning

H M Quamran Hasan, Housam Khalifa Bashier, Jiayi Dai, Mi-Young Kim, Randy Goebel

Main category: cs.AI

TL;DR: Reason2Decide: A two-stage training framework for clinical decision support that improves prediction accuracy while generating explanations aligned with predictions, addressing exposure bias and task separation issues.

Details

Motivation: Clinical decision support systems using LLMs face a critical challenge: achieving high predictive accuracy while generating explanations that align with predictions. Current approaches suffer from exposure bias leading to misaligned explanations.

Method: Two-stage training framework: Stage-1 trains on rationale generation; Stage-2 jointly trains on label prediction and rationale generation using scheduled sampling to gradually transition from conditioning on gold labels to model predictions.

Result: Outperforms other fine-tuning baselines and some zero-shot LLMs in prediction (F1) and rationale fidelity across three medical datasets. Achieves rationale source-robustness across LLM-generated, nurse-authored, and nurse-post-processed rationales. Works with models 40x smaller than contemporary foundation models.

Conclusion: Reason2Decide makes clinical reasoning more accessible for resource-constrained deployments while providing explainable decision support, reducing reliance on human annotations by effectively using LLM-generated rationales for pretraining.

Abstract: Despite the wide adoption of Large Language Models (LLM)s, clinical decision support systems face a critical challenge: achieving high predictive accuracy while generating explanations aligned with the predictions. Current approaches suffer from exposure bias leading to misaligned explanations. We propose Reason2Decide, a two-stage training framework that addresses key challenges in self-rationalization, including exposure bias and task separation. In Stage-1, our model is trained on rationale generation, while in Stage-2, we jointly train on label prediction and rationale generation, applying scheduled sampling to gradually transition from conditioning on gold labels to model predictions. We evaluate Reason2Decide on three medical datasets, including a proprietary triage dataset and public biomedical QA datasets. Across model sizes, Reason2Decide outperforms other fine-tuning baselines and some zero-shot LLMs in prediction (F1) and rationale fidelity (BERTScore, BLEU, LLM-as-a-Judge). In triage, Reason2Decide is rationale source-robust across LLM-generated, nurse-authored, and nurse-post-processed rationales. In our experiments, while using only LLM-generated rationales in Stage-1, Reason2Decide outperforms other fine-tuning variants. This indicates that LLM-generated rationales are suitable for pretraining models, reducing reliance on human annotations. Remarkably, Reason2Decide achieves these gains with models 40x smaller than contemporary foundation models, making clinical reasoning more accessible for resource-constrained deployments while still providing explainable decision support.

[165] Adaptive Financial Sentiment Analysis for NIFTY 50 via Instruction-Tuned LLMs , RAG and Reinforcement Learning Approaches

Chaithra, Kamesh Kadimisetty, Biju R Mohan

Main category: cs.AI

TL;DR: An adaptive framework combining instruction-tuned LLMs with market feedback and reinforcement learning for improved financial sentiment analysis in Indian stock markets.

Details

Motivation: Existing financial sentiment analysis methods don't consider stock price impacts or market feedback, limiting their real-world applicability and alignment with actual market behavior.

Method: Fine-tunes LLaMA 3.2 3B on SentiFin dataset using instruction-based learning, adds RAG pipeline for dynamic multi-source context selection, introduces feedback module comparing predicted sentiment with next-day returns, and incorporates PPO reinforcement learning agent for adaptive source weighting.

Result: Significantly improves classification accuracy, F1-score, and market alignment over baseline models on NIFTY 50 news headlines (2024-2025), demonstrating better real-world performance.

Conclusion: Combining instruction-tuned LLMs with dynamic feedback and reinforcement learning enables robust, market-aware financial sentiment modeling that adapts to real market behavior.

Abstract: Financial sentiment analysis plays a crucial role in informing investment decisions, assessing market risk, and predicting stock price trends. Existing works in financial sentiment analysis have not considered the impact of stock prices or market feedback on sentiment analysis. In this paper, we propose an adaptive framework that integrates large language models (LLMs) with real-world stock market feedback to improve sentiment classification in the context of the Indian stock market. The proposed methodology fine-tunes the LLaMA 3.2 3B model using instruction-based learning on the SentiFin dataset. To enhance sentiment predictions, a retrieval-augmented generation (RAG) pipeline is employed that dynamically selects multi-source contextual information based on the cosine similarity of the sentence embeddings. Furthermore, a feedback-driven module is introduced that adjusts the reliability of the source by comparing predicted sentiment with actual next-day stock returns, allowing the system to iteratively adapt to market behavior. To generalize this adaptive mechanism across temporal data, a reinforcement learning agent trained using proximal policy optimization (PPO) is incorporated. The PPO agent learns to optimize source weighting policies based on cumulative reward signals from sentiment-return alignment. Experimental results on NIFTY 50 news headlines collected from 2024 to 2025 demonstrate that the proposed system significantly improves classification accuracy, F1-score, and market alignment over baseline models and static retrieval methods. The results validate the potential of combining instruction-tuned LLMs with dynamic feedback and reinforcement learning for robust, market-aware financial sentiment modeling.

[166] MolAct: An Agentic RL Framework for Molecular Editing and Property Optimization

Zhuo Yang, Yeyun chen, Jiaqing Xie, Ben Gao, Shuaike Shen, Wanhao Liu, Liujia Yang, Beilun Wang, Tianfan Fu, Yuqiang Li

Main category: cs.AI

TL;DR: MolAct is an agentic RL framework for molecular design that treats editing and optimization as sequential, tool-guided decisions, enabling LLM agents to learn chemical tool usage for valid molecular improvements.

Details

Motivation: Molecular editing and optimization require iterative improvements while maintaining chemical validity and structural similarity. Current approaches lack formalization as agentic reinforcement learning problems where LLMs can learn to interleave reasoning, tool-use, and optimization in multi-turn interactions.

Method: MolAct uses a two-stage training paradigm: first building editing capability, then optimizing properties while reusing learned editing behaviors. The framework enables LLM agents to interact in multiple turns, invoking chemical tools for validity checking, property assessment, and similarity control, using feedback to refine subsequent edits.

Result: MolEditAgent-7B achieves 100, 95, and 98 valid add, delete, and substitute edits, outperforming DeepSeek-R1. MolEditAgent-3B approaches Qwen3-32B-think performance. MolOptAgent-7B surpasses Claude 3.7 on LogP optimization and remains competitive on solubility while maintaining balanced performance across objectives.

Conclusion: Treating molecular design as a multi-step, tool-augmented process is key to reliable and interpretable improvements. The agentic reinforcement learning framework successfully enables LLMs to learn chemical tool usage for valid molecular editing and optimization.

Abstract: Molecular editing and optimization are multi-step problems that require iteratively improving properties while keeping molecules chemically valid and structurally similar. We frame both tasks as sequential, tool-guided decisions and introduce MolAct, an agentic reinforcement learning framework that employs a two-stage training paradigm: first building editing capability, then optimizing properties while reusing the learned editing behaviors. To the best of our knowledge, this is the first work to formalize molecular design as an Agentic Reinforcement Learning problem, where an LLM agent learns to interleave reasoning, tool-use, and molecular optimization. The framework enables agents to interact in multiple turns, invoking chemical tools for validity checking, property assessment, and similarity control, and leverages their feedback to refine subsequent edits. We instantiate the MolAct framework to train two model families: MolEditAgent for molecular editing tasks and MolOptAgent for molecular optimization tasks. In molecular editing, MolEditAgent-7B delivers 100, 95, and 98 valid add, delete, and substitute edits, outperforming strong closed “thinking” baselines such as DeepSeek-R1; MolEditAgent-3B approaches the performance of much larger open “thinking” models like Qwen3-32B-think. In molecular optimization, MolOptAgent-7B (trained on MolEditAgent-7B) surpasses the best closed “thinking” baseline (e.g., Claude 3.7) on LogP and remains competitive on solubility, while maintaining balanced performance across other objectives. These results highlight that treating molecular design as a multi-step, tool-augmented process is key to reliable and interpretable improvements.

[167] Enhancing Zero-Shot Time Series Forecasting in Off-the-Shelf LLMs via Noise Injection

Xingyou Yin, Ceyao Zhang, Min Hu, Kai Chen

Main category: cs.AI

TL;DR: Injecting noise into raw time series before tokenization improves frozen LLMs’ forecasting performance by forcing them to focus on robust temporal patterns rather than numerical artifacts.

Details

Motivation: Frozen LLMs (without fine-tuning) are brittle for time series forecasting because their performance is highly sensitive to textual representation of input data. Since parameters cannot adapt to distribution shifts, the models can be misled by superficial numerical artifacts.

Method: Inject noise into raw time series before tokenization as a form of inference-time augmentation. This non-invasive intervention forces the frozen LLM to extrapolate based on robust underlying temporal patterns rather than superficial numerical artifacts.

Result: Empirical validation across diverse benchmarks shows improved performance. To eliminate data contamination biases, the authors introduced two novel time series datasets outside LLMs’ pre-training scopes and consistently observed improved performance.

Conclusion: Noise injection before tokenization is a simple yet effective strategy to improve frozen LLMs for time series forecasting, providing a further step in directly leveraging off-the-shelf LLMs for this task.

Abstract: Large Language Models (LLMs) have demonstrated effectiveness as zero-shot time series (TS) forecasters. The key challenge lies in tokenizing TS data into textual representations that align with LLMs’ pre-trained knowledge. While existing work often relies on fine-tuning specialized modules to bridge this gap, a distinct, yet challenging, paradigm aims to leverage truly off-the-shelf LLMs without any fine-tuning whatsoever, relying solely on strategic tokenization of numerical sequences. The performance of these fully frozen models is acutely sensitive to the textual representation of the input data, as their parameters cannot adapt to distribution shifts. In this paper, we introduce a simple yet highly effective strategy to overcome this brittleness: injecting noise into the raw time series before tokenization. This non-invasive intervention acts as a form of inference-time augmentation, compelling the frozen LLM to extrapolate based on robust underlying temporal patterns rather than superficial numerical artifacts. We theoretically analyze this phenomenon and empirically validate its effectiveness across diverse benchmarks. Notably, to fully eliminate potential biases from data contamination during LLM pre-training, we introduce two novel TS datasets that fall outside all utilized LLMs’ pre-training scopes, and consistently observe improved performance. This study provides a further step in directly leveraging off-the-shelf LLMs for time series forecasting.

[168] A Bidirectional Gated Recurrent Unit Model for PUE Prediction in Data Centers

Dhivya Dharshini Kannan, Anupam Trivedi, Dipti Srinivasan

Main category: cs.AI

TL;DR: Developed BiGRU-based PUE prediction model for data center energy efficiency, outperforming GRU with optimized feature selection and hyperparameters.

Details

Motivation: Data centers consume significant global energy and have large carbon footprints. With growing edge computing and AI demands, improving energy efficiency is crucial for cost reduction, business competitiveness, and environmental sustainability. PUE (Power Usage Effectiveness) is key metric for data center operational efficiency.

Method: Developed Bidirectional Gated Recurrent Unit (BiGRU) model for PUE prediction. Used EnergyPlus simulation data (52,560 samples, 117 features) from Singapore data center. Applied Recursive Feature Elimination with Cross-Validation (RFECV) for feature selection. Optimized hyperparameters and compared performance with standard GRU model.

Result: BiGRU model outperformed GRU using evaluation metrics: mean squared error (MSE), mean absolute error (MAE), and R-squared. Feature selection via RFECV identified most relevant features for accurate PUE prediction.

Conclusion: BiGRU-based PUE prediction model effectively improves data center energy efficiency assessment. The approach enables targeted modifications of key features to reduce energy consumption and supports sustainable data center operations.

Abstract: Data centers account for significant global energy consumption and a carbon footprint. The recent increasing demand for edge computing and AI advancements drives the growth of data center storage capacity. Energy efficiency is a cost-effective way to combat climate change, cut energy costs, improve business competitiveness, and promote IT and environmental sustainability. Thus, optimizing data center energy management is the most important factor in the sustainability of the world. Power Usage Effectiveness (PUE) is used to represent the operational efficiency of the data center. Predicting PUE using Neural Networks provides an understanding of the effect of each feature on energy consumption, thus enabling targeted modifications of those key features to improve energy efficiency. In this paper, we have developed Bidirectional Gated Recurrent Unit (BiGRU) based PUE prediction model and compared the model performance with GRU. The data set comprises 52,560 samples with 117 features using EnergyPlus, simulating a DC in Singapore. Sets of the most relevant features are selected using the Recursive Feature Elimination with Cross-Validation (RFECV) algorithm for different parameter settings. These feature sets are used to find the optimal hyperparameter configuration and train the BiGRU model. The performance of the optimized BiGRU-based PUE prediction model is then compared with that of GRU using mean squared error (MSE), mean absolute error (MAE), and R-squared metrics.

[169] Concept Generalization in Humans and Large Language Models: Insights from the Number Game

Arghavan Bazigaran, Hansem Sohn

Main category: cs.AI

TL;DR: Humans outperform LLMs in concept inference tasks by flexibly using both rule-based and similarity-based reasoning, while LLMs rely more on rigid mathematical rules and require more examples to generalize.

Details

Motivation: To understand the fundamental differences in how humans and large language models (LLMs) generalize and infer mathematical concepts, particularly in concept inference tasks like the number game.

Method: Used a Bayesian model as an analytical framework to examine inductive biases and inference strategies of both humans and LLMs in the number game concept inference task.

Result: Humans showed superior generalization, flexibly inferring both rule-based and similarity-based concepts, while LLMs relied more heavily on mathematical rules. Humans demonstrated few-shot generalization from single examples, whereas LLMs required more samples.

Conclusion: There are fundamental differences in how humans and LLMs infer and generalize mathematical concepts, with humans showing more flexible, sample-efficient reasoning compared to LLMs’ more rigid, rule-based approaches.

Abstract: We compare human and large language model (LLM) generalization in the number game, a concept inference task. Using a Bayesian model as an analytical framework, we examined the inductive biases and inference strategies of humans and LLMs. The Bayesian model captured human behavior better than LLMs in that humans flexibly infer rule-based and similarity-based concepts, whereas LLMs rely more on mathematical rules. Humans also demonstrated a few-shot generalization, even from a single example, while LLMs required more samples to generalize. These contrasts highlight the fundamental differences in how humans and LLMs infer and generalize mathematical concepts.

[170] Offline Safe Policy Optimization From Heterogeneous Feedback

Ze Gong, Pradeep Varakantham, Akshat Kumar

Main category: cs.AI

TL;DR: PreSa: A framework for offline safe PbRL that learns safe policies directly from preferences and safety labels without explicit reward/cost models, avoiding constrained RL and outperforming baselines.

Details

Motivation: Existing safe RLHF approaches learn reward/cost models from offline data then use constrained RL, but in long-horizon continuous control tasks, errors accumulate and impair performance. Need direct policy learning from preferences and safety signals.

Method: PreSa combines preference learning with safety alignment in a constrained optimization problem solved via Lagrangian paradigm. Learns reward-maximizing safe policy directly from pairwise preferences (reward preferences) and binary safety labels on trajectory segments, without explicit reward/cost models.

Result: Outperforms state-of-the-art baselines and offline safe RL approaches with ground-truth reward/cost on continuous control tasks with synthetic and real human feedback. Successfully learns safe policies with high rewards.

Conclusion: Direct policy learning from preferences and safety labels (without explicit reward/cost models) effectively addresses error accumulation in long-horizon tasks and enables safe preference-based RL in continuous control domains.

Abstract: Offline Preference-based Reinforcement Learning (PbRL) learns rewards and policies aligned with human preferences without the need for extensive reward engineering and direct interaction with human annotators. However, ensuring safety remains a critical challenge across many domains and tasks. Previous works on safe RL from human feedback (RLHF) first learn reward and cost models from offline data, then use constrained RL to optimize a safe policy. While such an approach works in the contextual bandits settings (LLMs), in long horizon continuous control tasks, errors in rewards and costs accumulate, leading to impairment in performance when used with constrained RL methods. To address these challenges, (a) instead of indirectly learning policies (from rewards and costs), we introduce a framework that learns a policy directly based on pairwise preferences regarding the agent’s behavior in terms of rewards, as well as binary labels indicating the safety of trajectory segments; (b) we propose \textsc{PreSa} (Preference and Safety Alignment), a method that combines preference learning module with safety alignment in a constrained optimization problem. This optimization problem is solved within a Lagrangian paradigm that directly learns reward-maximizing safe policy \textit{without explicitly learning reward and cost models}, avoiding the need for constrained RL; (c) we evaluate our approach on continuous control tasks with both synthetic and real human feedback. Empirically, our method successfully learns safe policies with high rewards, outperforming state-of-the-art baselines, and offline safe RL approaches with ground-truth reward and cost.

[171] TongSIM: A General Platform for Simulating Intelligent Machines

Zhe Sun, Kunlun Wu, Chuanjian Fu, Zeming Song, Langyong Shi, Zihe Xue, Bohan Jing, Ying Yang, Xiaomeng Gao, Aijia Li, Tianyu Guo, Huiying Li, Xueyuan Yang, Rongkai Liu, Xinyi He, Yuxi Wang, Yue Li, Mingyuan Liu, Yujie Lu, Hongzhao Xie, Shiyun Zhao, Bo Dai, Wei Wang, Tao Yuan, Song-Chun Zhu, Yujia Peng, Zhenliang Zhang

Main category: cs.AI

TL;DR: TongSIM is a high-fidelity, general-purpose platform for training and evaluating embodied AI agents across diverse indoor and outdoor scenarios, addressing the lack of versatile simulation environments for embodied intelligence research.

Details

Motivation: As AI research shifts from single-modality text processing to multimodal and embodied AI, there's a gap in available simulation platforms. Most existing platforms are narrowly designed for specific tasks, lacking a versatile, general-purpose training environment that can support everything from low-level embodied navigation to high-level composite activities like multi-agent social simulation and human-AI collaboration.

Method: The authors introduce TongSIM, a high-fidelity platform featuring: 1) Over 100 diverse, multi-room indoor scenarios, 2) An open-ended, interaction-rich outdoor town simulation, 3) Comprehensive evaluation framework and benchmarks, 4) Customized scenes and task-adaptive fidelity, 5) Diverse agent types, and 6) Dynamic environmental simulation.

Result: TongSIM provides broad applicability across research needs with its flexible and scalable architecture. It enables precise assessment of agent capabilities including perception, cognition, decision-making, human-robot cooperation, and spatial and social reasoning.

Conclusion: TongSIM serves as a unified platform that accelerates training, evaluation, and advancement toward general embodied intelligence, addressing the critical need for versatile simulation environments in embodied AI research.

Abstract: As artificial intelligence (AI) rapidly advances, especially in multimodal large language models (MLLMs), research focus is shifting from single-modality text processing to the more complex domains of multimodal and embodied AI. Embodied intelligence focuses on training agents within realistic simulated environments, leveraging physical interaction and action feedback rather than conventionally labeled datasets. Yet, most existing simulation platforms remain narrowly designed, each tailored to specific tasks. A versatile, general-purpose training environment that can support everything from low-level embodied navigation to high-level composite activities, such as multi-agent social simulation and human-AI collaboration, remains largely unavailable. To bridge this gap, we introduce TongSIM, a high-fidelity, general-purpose platform for training and evaluating embodied agents. TongSIM offers practical advantages by providing over 100 diverse, multi-room indoor scenarios as well as an open-ended, interaction-rich outdoor town simulation, ensuring broad applicability across research needs. Its comprehensive evaluation framework and benchmarks enable precise assessment of agent capabilities, such as perception, cognition, decision-making, human-robot cooperation, and spatial and social reasoning. With features like customized scenes, task-adaptive fidelity, diverse agent types, and dynamic environmental simulation, TongSIM delivers flexibility and scalability for researchers, serving as a unified platform that accelerates training, evaluation, and advancement toward general embodied intelligence.

[172] MemR$^3$: Memory Retrieval via Reflective Reasoning for LLM Agents

Xingbo Du, Loka Li, Duzhen Zhang, Le Song

Main category: cs.AI

TL;DR: MemR³ is a memory retrieval system for LLM agents that introduces closed-loop control with a router for action selection and evidence-gap tracking, improving answer quality over standard retrieve-then-answer pipelines.

Details

Motivation: Existing memory systems for LLM agents focus too much on compression and storage optimization, with insufficient emphasis on explicit, closed-loop control of memory retrieval processes.

Method: MemR³ features two core mechanisms: 1) a router that selects among retrieve, reflect, and answer actions to optimize answer quality; 2) a global evidence-gap tracker that makes the answering process transparent and tracks evidence collection.

Result: MemR³ surpasses strong baselines on the LoCoMo benchmark, improving existing retrievers across four categories with overall improvements of +7.29% on RAG and +1.94% on Zep using GPT-4.1-mini backend.

Conclusion: The system offers a plug-and-play controller for existing memory stores, demonstrating that closed-loop control mechanisms can significantly enhance memory retrieval performance in LLM agents.

Abstract: Memory systems have been designed to leverage past experiences in Large Language Model (LLM) agents. However, many deployed memory systems primarily optimize compression and storage, with comparatively less emphasis on explicit, closed-loop control of memory retrieval. From this observation, we build memory retrieval as an autonomous, accurate, and compatible agent system, named MemR$^3$, which has two core mechanisms: 1) a router that selects among retrieve, reflect, and answer actions to optimize answer quality; 2) a global evidence-gap tracker that explicitly renders the answering process transparent and tracks the evidence collection process. This design departs from the standard retrieve-then-answer pipeline by introducing a closed-loop control mechanism that enables autonomous decision-making. Empirical results on the LoCoMo benchmark demonstrate that MemR$^3$ surpasses strong baselines on LLM-as-a-Judge score, and particularly, it improves existing retrievers across four categories with an overall improvement on RAG (+7.29%) and Zep (+1.94%) using GPT-4.1-mini backend, offering a plug-and-play controller for existing memory stores.

[173] Graph-Symbolic Policy Enforcement and Control (G-SPEC): A Neuro-Symbolic Framework for Safe Agentic AI in 5G Autonomous Networks

Divya Vijay, Vignesh Ethiraj

Main category: cs.AI

TL;DR: G-SPEC is a neuro-symbolic framework that uses deterministic verification (Network Knowledge Graph + SHACL constraints) to constrain LLM-based probabilistic planning for 5G/6G network orchestration, achieving zero safety violations and 94.1% remediation success.

Details

Motivation: 5G/6G network orchestration challenges exceed static automation and Deep Reinforcement Learning capabilities. LLM agents offer intent-based networking but introduce stochastic risks like topology hallucinations and policy non-compliance that need mitigation.

Method: Proposes Graph-Symbolic Policy Enforcement and Control (G-SPEC) with a Governance Triad: telecom-adapted LLM agent (TSLAM-4B), Network Knowledge Graph (NKG) for deterministic verification, and SHACL constraints for policy compliance.

Result: Achieved zero safety violations and 94.1% remediation success rate on 450-node 5G Core simulation, outperforming 82.4% baseline. NKG validation contributed 68% of safety gains, SHACL policies 24%. Validation latency scales as O(k^1.2) with subgraph size k, with 142ms processing overhead.

Conclusion: G-SPEC effectively mitigates LLM stochastic risks in network orchestration through neuro-symbolic approach, making it viable for SMO-layer operations in 5G/6G networks with scalable performance.

Abstract: As networks evolve toward 5G Standalone and 6G, operators face orchestration challenges that exceed the limits of static automation and Deep Reinforcement Learning. Although Large Language Model (LLM) agents offer a path toward intent-based networking, they introduce stochastic risks, including topology hallucinations and policy non-compliance. To mitigate this, we propose Graph-Symbolic Policy Enforcement and Control (G-SPEC), a neuro-symbolic framework that constrains probabilistic planning with deterministic verification. The architecture relies on a Governance Triad - a telecom-adapted agent (TSLAM-4B), a Network Knowledge Graph (NKG), and SHACL constraints. We evaluated G-SPEC on a simulated 450-node 5G Core, achieving zero safety violations and a 94.1% remediation success rate, significantly outperforming the 82.4% baseline. Ablation analysis indicates that NKG validation drives the majority of safety gains (68%), followed by SHACL policies (24%). Scalability tests on topologies ranging from 10K to 100K nodes demonstrate that validation latency scales as $O(k^{1.2})$ where $k$ is subgraph size. With a processing overhead of 142ms, G-SPEC is viable for SMO-layer operations.

[174] ActionFlow: A Pipelined Action Acceleration for Vision Language Models on Edge

Yuntao Dai, Hang Gu, Teng Wang, Qianyu Cheng, Yifei Zheng, Zhiyong Qiu, Lei Gong, Wenqi Lou, Xuehai Zhou

Main category: cs.AI

TL;DR: ActionFlow is a system-level inference framework that speeds up Vision-Language-Action models on edge devices by 2.55x using cross-request pipelining and memory optimization techniques.

Details

Motivation: Current VLA models suffer from high inference latency (3-5 Hz) on edge devices due to memory-bound autoregressive decoding, while smooth robotic interaction requires 20-30 Hz. Existing optimizations often require extensive retraining or compromise accuracy.

Method: ActionFlow introduces: 1) Cross-Request Pipelining strategy that batches memory-bound Decode phases with compute-bound Prefill phases across continuous time steps; 2) Cross-Request State Packed Forward operator; 3) Unified KV Ring Buffer to fuse fragmented memory operations into efficient dense computations.

Result: Achieves 2.55x improvement in FPS on OpenVLA-7B model without retraining, enabling real-time dynamic manipulation on edge hardware.

Conclusion: ActionFlow bridges the gap between VLA model capabilities and real-time robotic requirements through system-level optimizations that maximize hardware utilization without compromising model accuracy or requiring retraining.

Abstract: Vision-Language-Action (VLA) models have emerged as a unified paradigm for robotic perception and control, enabling emergent generalization and long-horizon task execution. However, their deployment in dynamic, real-world environments is severely hin dered by high inference latency. While smooth robotic interaction requires control frequencies of 20 to 30 Hz, current VLA models typi cally operate at only 3-5 Hz on edge devices due to the memory bound nature of autoregressive decoding. Existing optimizations often require extensive retraining or compromise model accuracy. To bridge this gap, we introduce ActionFlow, a system-level inference framework tailored for resource-constrained edge plat forms. At the core of ActionFlow is a Cross-Request Pipelin ing strategy, a novel scheduler that redefines VLA inference as a macro-pipeline of micro-requests. The strategy intelligently batches memory-bound Decode phases with compute-bound Prefill phases across continuous time steps to maximize hardware utilization. Furthermore, to support this scheduling, we propose a Cross Request State Packed Forward operator and a Unified KV Ring Buffer, which fuse fragmented memory operations into efficient dense computations. Experimental results demonstrate that ActionFlow achieves a 2.55x improvement in FPS on the OpenVLA-7B model without retraining, enabling real-time dy namic manipulation on edge hardware. Our work is available at https://anonymous.4open.science/r/ActionFlow-1D47.

[175] Synthesizing Procedural Memory: Challenges and Architectures in Automated Workflow Generation

Nishant Gaurav, Adit Akarsh, Ankit Ranjan, Manoj Bajaj

Main category: cs.AI

TL;DR: CodeMem enables LLMs to autonomously synthesize executable code as procedural memory for workflow orchestration, addressing four key bottlenecks in automated skill generation through a scientific methodology.

Details

Motivation: While CodeMem establishes executable code as optimal for agentic procedural memory, there's a gap in how to autonomously synthesize this memory from scratch. The paper aims to transition LLMs from passive tool-users to active workflow architects.

Method: Through a high-fidelity case study of cross-service orchestration (Outlook & OneDrive), the paper identifies and addresses four structural bottlenecks: Discovery Gap (navigating large tool registries), Verification Gap (grounding tool response structures), Decomposition Gap (using Linear State Anchoring instead of inefficient search), and Scaling Gap (concurrency & persistence). A scientific methodology of “hypothesize, probe, and code” is enforced.

Result: The approach demonstrates that agents can autonomously write robust, production-grade code skills by systematically addressing the identified bottlenecks through the scientific methodology.

Conclusion: LLMs can be transformed into active workflow architects capable of autonomously generating executable code skills for procedural memory, overcoming key structural bottlenecks in automated skill generation through a systematic scientific approach.

Abstract: While CodeMem establishes executable code as the optimal representation for agentic procedural memory, the mechanism for autonomously synthesizing this memory from a blank slate remains underexplored. This paper operationalizes the transition of Large Language Models from passive tool-users to active workflow architects. Through a high-fidelity case study of a cross-service orchestration task involving Outlook and OneDrive, we identify and address four structural bottlenecks in automated skill generation: the Discovery Gap involving navigation of large tool registries, the Verification Gap regarding grounding tool response structures, the Decomposition Gap which replaces inefficient search with Linear State Anchoring, and the Scaling Gap focused on concurrency and persistence. We demonstrate that by enforcing a scientific methodology of hypothesize, probe, and code, agents can autonomously write robust, production-grade code skills.

[176] SynCraft: Guiding Large Language Models to Predict Edit Sequences for Molecular Synthesizability Optimization

Junren Li, Luhua Lai

Main category: cs.AI

TL;DR: SynCraft is an LLM-based framework that optimizes molecular synthesizability through precise structural editing rather than sequence translation, outperforming existing methods while preserving structural novelty and pharmacophores.

Details

Motivation: Current generative AI for molecules produces many synthetically inaccessible compounds, and existing solutions (filtering or template-based methods) compromise structural novelty or disrupt key pharmacophores. There's a need for a method that can navigate the "synthesis cliff" where minimal edits yield significant synthesizability gains.

Method: SynCraft reframes synthesizability optimization as a structural editing problem using LLMs’ reasoning capabilities. Instead of generating SMILES strings directly, it predicts executable sequences of atom-level edits, leveraging interaction-aware prompting to replicate expert medicinal chemistry intuition.

Result: Extensive benchmarks show SynCraft outperforms state-of-the-art baselines in generating synthesizable analogs with high structural fidelity. It successfully replicates expert intuition in editing PLK1 inhibitors and rescues high-scoring but previously discarded RIPK1 candidates from previous molecular generation literature.

Conclusion: SynCraft represents a novel approach to molecular synthesizability optimization that leverages LLM reasoning for precise structural editing, overcoming limitations of current methods while maintaining structural novelty and pharmacophore integrity.

Abstract: Generative artificial intelligence has revolutionized the exploration of chemical space, yet a critical bottleneck remains that a substantial fraction of generated molecules is synthetically inaccessible. Current solutions, such as post-hoc filtering or projection-based methods, often compromise structural novelty or disrupt key pharmacophores by forcing molecules into pre-defined synthetic templates. Herein, we introduce SynCraft, a reasoning-based framework that reframes synthesizability optimization not as a sequence translation task, but as a precise structural editing problem. Leveraging the emergent reasoning capabilities of Large Language Models, SynCraft navigates the “synthesis cliff” where minimal structural modifications yield significant gains in synthetic feasibility. By predicting executable sequences of atom-level edits rather than generating SMILES strings directly, SynCraft circumvents the syntactic fragility of LLMs while harnessing their chemical intuition. Extensive benchmarks demonstrate that SynCraft outperforms state-of-the-art baselines in generating synthesizable analogs with high structural fidelity. Furthermore, through interaction-aware prompting, SynCraft successfully replicates expert medicinal chemistry intuition in editing PLK1 inhibitors and rescuing high-scoring but previously discarded RIPK1 candidates in previous molecular generation literatures.

[177] A DeepSeek-Powered AI System for Automated Chest Radiograph Interpretation in Clinical Practice

Yaowei Bai, Ruiheng Zhang, Yu Lei, Xuhua Duan, Jingfeng Yao, Shuguang Ju, Chaoyang Wang, Wei Yao, Yiwan Guo, Guilin Zhang, Chao Wan, Qian Yuan, Lei Chen, Wenjuan Tang, Biqiang Zhu, Xinggang Wang, Tao Sun, Wei Zhou, Dacheng Tao, Yongchao Xu, Chuansheng Zheng, Huangxuan Zhao, Bo Du

Main category: cs.AI

TL;DR: Janus-Pro-CXR is a lightweight chest X-ray interpretation system that outperforms larger models in report generation and clinical deployment, improving workflow efficiency and diagnostic reliability in resource-constrained settings.

Details

Motivation: Addressing the global shortage of radiologists and heavy chest X-ray workloads in primary care, particularly the lack of rigorous prospective clinical validation for existing multimodal large language models in radiology.

Method: Developed Janus-Pro-CXR (1B parameters) based on DeepSeek Janus-Pro model, with domain-specific optimization. Conducted multicenter prospective trial (NCT07117266) with rigorous validation including retrospective evaluation and prospective clinical deployment.

Result: Outperformed state-of-the-art X-ray report generation models including ChatGPT 4o (200B parameters), demonstrated reliable detection of six clinically critical findings. In clinical deployment: improved report quality scores, reduced interpretation time by 18.3% (P<0.001), and was preferred by experts in 54.3% of cases.

Conclusion: Janus-Pro-CXR improves diagnostic reliability and workflow efficiency through lightweight architecture and domain-specific optimization, particularly valuable in resource-constrained settings. The model will be open-sourced to facilitate clinical translation of AI-assisted radiology solutions.

Abstract: A global shortage of radiologists has been exacerbated by the significant volume of chest X-ray workloads, particularly in primary care. Although multimodal large language models show promise, existing evaluations predominantly rely on automated metrics or retrospective analyses, lacking rigorous prospective clinical validation. Janus-Pro-CXR (1B), a chest X-ray interpretation system based on DeepSeek Janus-Pro model, was developed and rigorously validated through a multicenter prospective trial (NCT07117266). Our system outperforms state-of-the-art X-ray report generation models in automated report generation, surpassing even larger-scale models including ChatGPT 4o (200B parameters), while demonstrating reliable detection of six clinically critical radiographic findings. Retrospective evaluation confirms significantly higher report accuracy than Janus-Pro and ChatGPT 4o. In prospective clinical deployment, AI assistance significantly improved report quality scores, reduced interpretation time by 18.3% (P < 0.001), and was preferred by a majority of experts in 54.3% of cases. Through lightweight architecture and domain-specific optimization, Janus-Pro-CXR improves diagnostic reliability and workflow efficiency, particularly in resource-constrained settings. The model architecture and implementation framework will be open-sourced to facilitate the clinical translation of AI-assisted radiology solutions.

[178] Generative Digital Twins: Vision-Language Simulation Models for Executable Industrial Systems

YuChe Hsu, AnJui Wang, TsaiChing Ni, YuanFu Yang

Main category: cs.AI

TL;DR: VLSM unifies visual and textual understanding to generate executable simulation code from layout sketches and natural language prompts, with new dataset and evaluation metrics for generative digital twins.

Details

Motivation: To enable cross-modal reasoning for industrial simulation systems by integrating visual reasoning and language understanding into executable simulation code generation.

Method: Proposes Vision-Language Simulation Model (VLSM) that synthesizes executable FlexScript from layout sketches and natural-language prompts. Creates first large-scale dataset with 120,000+ prompt-sketch-code triplets for multimodal learning. Introduces three novel evaluation metrics: Structural Validity Rate (SVR), Parameter Match Rate (PMR), and Execution Success Rate (ESR).

Result: Models achieve near-perfect structural accuracy and high execution robustness through systematic ablation studies across vision encoders, connectors, and code-pretrained language backbones.

Conclusion: Establishes foundation for generative digital twins that integrate visual reasoning and language understanding into executable industrial simulation systems.

Abstract: We propose a Vision-Language Simulation Model (VLSM) that unifies visual and textual understanding to synthesize executable FlexScript from layout sketches and natural-language prompts, enabling cross-modal reasoning for industrial simulation systems. To support this new paradigm, the study constructs the first large-scale dataset for generative digital twins, comprising over 120,000 prompt-sketch-code triplets that enable multimodal learning between textual descriptions, spatial structures, and simulation logic. In parallel, three novel evaluation metrics, Structural Validity Rate (SVR), Parameter Match Rate (PMR), and Execution Success Rate (ESR), are proposed specifically for this task to comprehensively evaluate structural integrity, parameter fidelity, and simulator executability. Through systematic ablation across vision encoders, connectors, and code-pretrained language backbones, the proposed models achieve near-perfect structural accuracy and high execution robustness. This work establishes a foundation for generative digital twins that integrate visual reasoning and language understanding into executable industrial simulation systems.

[179] Bohrium + SciMaster: Building the Infrastructure and Ecosystem for Agentic Science at Scale

Linfeng Zhang, Siheng Chen, Yuzhu Cai, Jingyi Chai, Junhan Chang, Kun Chen, Zhi X. Chen, Zhaohan Ding, Yuwen Du, Yuanpeng Gao, Yuan Gao, Jing Gao, Zhifeng Gao, Qiangqiang Gu, Yanhui Hong, Yuan Huang, Xi Fang, Xiaohong Ji, Guolin Ke, Zixing Lei, Xinyu Li, Yongge Li, Ruoxue Liao, Hang Lin, Xiaolu Lin, Yuxiang Liu, Xinzijian Liu, Zexi Liu, Jintan Lu, Tingjia Miao, Haohui Que, Weijie Sun, Yanfeng Wang, Bingyang Wu, Tianju Xue, Rui Ye, Jinzhe Zeng, Duo Zhang, Jiahui Zhang, Linfeng Zhang, Tianhan Zhang, Wenchang Zhang, Yuzhi Zhang, Zezhong Zhang, Hang Zheng, Hui Zhou, Tong Zhu, Xinyu Zhu, Qingguo Zhou, Weinan E

Main category: cs.AI

TL;DR: Bohrium+SciMaster is an infrastructure-and-ecosystem approach to scale AI agentic science by providing traceable AI4S assets and workflow orchestration, enabling reusable scientific agents that dramatically reduce scientific cycle times.

Details

Motivation: AI agents are enabling multi-step scientific workflows, but scaling agentic science faces challenges: workflows are hard to observe/reproduce, tools aren't agent-ready, execution lacks traceability, and prototype systems are bespoke, limiting reuse and systematic improvement.

Method: Bohrium+SciMaster stack: Bohrium serves as a managed hub for AI4S assets (like HuggingFace for AI for Science), turning scientific data/software/compute/lab systems into agent-ready capabilities. SciMaster orchestrates these into long-horizon workflows. A scientific intelligence substrate organizes reusable models/knowledge/components into executable building blocks.

Result: Demonstrated with eleven representative master agents in real workflows, achieving orders-of-magnitude reductions in end-to-end scientific cycle time and generating execution-grounded signals from real workloads at multi-million scale.

Conclusion: Scaling agentic science requires an infrastructure-and-ecosystem approach, realized through Bohrium+SciMaster, which enables composition, auditability, and improvement of scientific workflows through reusable AI agents and traceable execution.

Abstract: AI agents are emerging as a practical way to run multi-step scientific workflows that interleave reasoning with tool use and verification, pointing to a shift from isolated AI-assisted steps toward \emph{agentic science at scale}. This shift is increasingly feasible, as scientific tools and models can be invoked through stable interfaces and verified with recorded execution traces, and increasingly necessary, as AI accelerates scientific output and stresses the peer-review and publication pipeline, raising the bar for traceability and credible evaluation. However, scaling agentic science remains difficult: workflows are hard to observe and reproduce; many tools and laboratory systems are not agent-ready; execution is hard to trace and govern; and prototype AI Scientist systems are often bespoke, limiting reuse and systematic improvement from real workflow signals. We argue that scaling agentic science requires an infrastructure-and-ecosystem approach, instantiated in Bohrium+SciMaster. Bohrium acts as a managed, traceable hub for AI4S assets – akin to a HuggingFace of AI for Science – that turns diverse scientific data, software, compute, and laboratory systems into agent-ready capabilities. SciMaster orchestrates these capabilities into long-horizon scientific workflows, on which scientific agents can be composed and executed. Between infrastructure and orchestration, a \emph{scientific intelligence substrate} organizes reusable models, knowledge, and components into executable building blocks for workflow reasoning and action, enabling composition, auditability, and improvement through use. We demonstrate this stack with eleven representative master agents in real workflows, achieving orders-of-magnitude reductions in end-to-end scientific cycle time and generating execution-grounded signals from real workloads at multi-million scale.

[180] Automated stereotactic radiosurgery planning using a human-in-the-loop reasoning large language model agent

Humza Nusrat, Luke Francisco, Bing Luo, Hassan Bagher-Ebadian, Joshua Kim, Karen Chin-Snyder, Salim Siddiqui, Mira Shah, Eric Mellon, Mohammad Ghassemi, Anthony Doemer, Benjamin Movsas, Kundan Thind

Main category: cs.AI

TL;DR: Chain-of-thought reasoning in LLM-based SRS planning agent improves plan quality and transparency while matching human performance on key metrics.

Details

Motivation: Black-box AI systems for stereotactic radiosurgery have limited clinical adoption due to opacity concerns, creating a need for transparent, explainable automated planning systems.

Method: Developed SAGE (Secure Agent for Generative Dose Expertise), an LLM-based planning agent with two variants: non-reasoning and reasoning models. Tested on retrospective cohort of 41 brain metastasis patients treated with 18 Gy single-fraction SRS, comparing plan quality against human planners.

Result: Reasoning model achieved comparable dosimetry to human planners on primary endpoints (PTV coverage, max dose, conformity/gradient indices; all p>0.21) while reducing cochlear dose below human baselines (p=0.022). Reasoning model demonstrated systematic planning behaviors including constraint verification (457 instances) and trade-off deliberation (609 instances), while standard model showed minimal deliberation.

Conclusion: Chain-of-thought reasoning enables transparent, auditable automated SRS planning with human-level performance and improved organ sparing, addressing clinical adoption barriers through explainable AI.

Abstract: Stereotactic radiosurgery (SRS) demands precise dose shaping around critical structures, yet black-box AI systems have limited clinical adoption due to opacity concerns. We tested whether chain-of-thought reasoning improves agentic planning in a retrospective cohort of 41 patients with brain metastases treated with 18 Gy single-fraction SRS. We developed SAGE (Secure Agent for Generative Dose Expertise), an LLM-based planning agent for automated SRS treatment planning. Two variants generated plans for each case: one using a non-reasoning model, one using a reasoning model. The reasoning variant showed comparable plan dosimetry relative to human planners on primary endpoints (PTV coverage, maximum dose, conformity index, gradient index; all p > 0.21) while reducing cochlear dose below human baselines (p = 0.022). When prompted to improve conformity, the reasoning model demonstrated systematic planning behaviors including prospective constraint verification (457 instances) and trade-off deliberation (609 instances), while the standard model exhibited none of these deliberative processes (0 and 7 instances, respectively). Content analysis revealed that constraint verification and causal explanation concentrated in the reasoning agent. The optimization traces serve as auditable logs, offering a path toward transparent automated planning.

[181] Benchmarking LLMs for Predictive Applications in the Intensive Care Units

Chehak Malhotra, Mehak Gopal, Akshaya Devadiga, Pradeep Singh, Ridam Pal, Ritwik Kashyap, Tavpritesh Sethi

Main category: cs.AI

TL;DR: LLMs (GatorTron, Llama, Mistral) show comparable performance to SLMs (BioBERT, DocBERT, etc.) for shock prediction in ICU patients, suggesting LLMs aren’t inherently superior for clinical event prediction despite their NLP capabilities.

Details

Motivation: While LLMs have transformed many NLP tasks, their application in clinical predictive tasks remains under-researched. Timely prediction of shock in critically ill patients could enable early interventions and improve patient outcomes.

Method: Compared LLMs (GatorTron-Base, Llama 8B, Mistral 7B) against SLMs (BioBERT, DocBERT, BioClinicalBERT, Word2Vec, Doc2Vec) for shock prediction using MIMIC III data. Analyzed 17,294 ICU stays, scoring for length of stay >24h and shock index >0.7, resulting in 355 normal and 87 abnormal cases. Used both focal and cross-entropy losses to address class imbalance during fine-tuning.

Result: GatorTron-Base achieved highest weighted recall of 80.5%, but overall performance metrics were comparable between SLMs and LLMs. LLMs did not demonstrate inherent superiority over SLMs for predicting clinical events.

Conclusion: LLMs are not inherently superior to SLMs for clinical event prediction. Future LLM development should focus on predicting clinical trajectories rather than simpler tasks like named entity recognition or phenotyping to achieve meaningful clinical outcomes.

Abstract: With the advent of LLMs, various tasks across the natural language processing domain have been transformed. However, their application in predictive tasks remains less researched. This study compares large language models, including GatorTron-Base (trained on clinical data), Llama 8B, and Mistral 7B, against models like BioBERT, DocBERT, BioClinicalBERT, Word2Vec, and Doc2Vec, setting benchmarks for predicting Shock in critically ill patients. Timely prediction of shock can enable early interventions, thus improving patient outcomes. Text data from 17,294 ICU stays of patients in the MIMIC III database were scored for length of stay > 24 hours and shock index (SI) > 0.7 to yield 355 and 87 patients with normal and abnormal SI-index, respectively. Both focal and cross-entropy losses were used during finetuning to address class imbalances. Our findings indicate that while GatorTron Base achieved the highest weighted recall of 80.5%, the overall performance metrics were comparable between SLMs and LLMs. This suggests that LLMs are not inherently superior to SLMs in predicting future clinical events despite their strong performance on text-based tasks. To achieve meaningful clinical outcomes, future efforts in training LLMs should prioritize developing models capable of predicting clinical trajectories rather than focusing on simpler tasks such as named entity recognition or phenotyping.

[182] Advancing Multimodal Teacher Sentiment Analysis:The Large-Scale T-MED Dataset & The Effective AAM-TSA Model

Zhiyi Duan, Xiangren Wang, Hongyu Yuan, Qianli Xing

Main category: cs.AI

TL;DR: This paper introduces T-MED, the first large-scale multimodal teacher sentiment analysis dataset, and AAM-TSA, a novel asymmetric attention-based model that outperforms existing methods for analyzing teacher emotions in educational settings.

Details

Motivation: Existing studies fail to accurately capture teachers' emotions due to their performative nature and overlook the impact of instructional information on emotional expression, despite teachers' emotional states being critical for teaching efficacy, student engagement, and learning achievements.

Method: 1) Constructed T-MED dataset with 14,938 instances from 250 real classrooms across 11 subjects (K-12 to higher education) using human-machine collaborative labeling; 2) Proposed AAM-TSA model with asymmetric attention mechanism and hierarchical gating unit for differentiated cross-modal feature fusion and precise emotional classification.

Result: AAM-TSA significantly outperforms existing state-of-the-art methods in terms of accuracy and interpretability on the T-MED dataset.

Conclusion: The paper successfully addresses the gap in teacher sentiment analysis by creating a comprehensive multimodal dataset and developing an effective model that accounts for the performative nature of teaching and instructional context, advancing the field of educational emotion analysis.

Abstract: Teachers’ emotional states are critical in educational scenarios, profoundly impacting teaching efficacy, student engagement, and learning achievements. However, existing studies often fail to accurately capture teachers’ emotions due to the performative nature and overlook the critical impact of instructional information on emotional expression.In this paper, we systematically investigate teacher sentiment analysis by building both the dataset and the model accordingly. We construct the first large-scale teacher multimodal sentiment analysis dataset, T-MED.To ensure labeling accuracy and efficiency, we employ a human-machine collaborative labeling process.The T-MED dataset includes 14,938 instances of teacher emotional data from 250 real classrooms across 11 subjects ranging from K-12 to higher education, integrating multimodal text, audio, video, and instructional information.Furthermore, we propose a novel asymmetric attention-based multimodal teacher sentiment analysis model, AAM-TSA.AAM-TSA introduces an asymmetric attention mechanism and hierarchical gating unit to enable differentiated cross-modal feature fusion and precise emotional classification. Experimental results demonstrate that AAM-TSA significantly outperforms existing state-of-the-art methods in terms of accuracy and interpretability on the T-MED dataset.

[183] External Hippocampus: Topological Cognitive Maps for Guiding Large Language Model Reasoning

Jian Yan

Main category: cs.AI

TL;DR: The External Hippocampus framework models LLM reasoning as energy flow in semantic space, using topological cognitive maps for navigation and intervention without additional training, solving cognitive deadlock in small models.

Details

Motivation: To address the cognitive deadlock problem in multi-step reasoning for small language models (≤7B parameters) without the computational overhead of traditional weight-space optimization methods.

Method: Constructs topological cognitive maps through dimensionality reduction projection to model reasoning as information energy flow in semantic space, enabling precise navigation and intervention at test time with temperature perturbations to restart energy flow.

Result: Achieves 81.20% accuracy on 500 challenging problems (relative baseline +16.80%), reduces reasoning time by ≥15x, identifies reasoning stagnation as “Cognitive Vortex” and low-entropy potential wells, and shows temperature perturbations effectively restart energy flow.

Conclusion: The framework provides an efficient, controllable, topological-aware solution for small model reasoning that requires no additional training, has autonomous growth capability, and effectively solves cognitive deadlock through predictable intervention patterns.

Abstract: This paper proposes the External Hippocampus framework, which models language model reasoning from a cognitive dynamics perspective as the flow of information energy in semantic space. Unlike traditional weight-space optimization methods, this framework constructs topological cognitive maps through dimensionality reduction projection, enabling precise navigation and intervention of energy flow at test time while avoiding substantial computational requirements and demonstrating predictable intervention patterns. The method effectively addresses the cognitive deadlock problem in multi-step reasoning for small models. Experiments on models <=7B parameters show: map-guided methods achieve 81.20% accuracy on 500 challenging problems (relative baseline +16.80%), reduce reasoning time by >= 15x, with key findings revealing that reasoning stagnation manifests as “Cognitive Vortex” and low-entropy potential wells, while temperature perturbations effectively restart energy flow. The framework requires no additional training, possesses autonomous growth capability, and provides an efficient and controllable topological-aware solution for small model reasoning.

[184] Computational Basis of LLM’s Decision Making in Social Simulation

Ji Ma

Main category: cs.AI

TL;DR: This paper proposes methods to probe, quantify, and modify how social concepts (like gender) are encoded in LLMs’ internal representations, using a Dictator Game to study fairness and prosocial behavior.

Details

Motivation: LLMs are increasingly used as human-like decision-making agents in social science and applied settings, often assigned human-like characters in real-life contexts. However, how these characters and contexts shape LLM behavior remains underexplored, creating a need to understand how social concepts are encoded in these models.

Method: The study proposes extracting “vectors of variable variations” (e.g., “male” to “female”) from LLMs’ internal states during a Dictator Game experiment. These vectors are then manipulated during model inference to alter how social variables relate to decision-making, providing a principled way to study concept encoding.

Result: Manipulating these extracted vectors can substantially alter how social variables (like gender) relate to the model’s decision-making in fairness and prosocial behavior scenarios, demonstrating that social concepts can be systematically engineered within transformer models.

Conclusion: This approach offers a framework for studying and regulating how social concepts are encoded in LLMs, with implications for AI alignment, debiasing, and designing AI agents for social simulations. The methods strengthen sociological theory and measurement by providing tools to understand and engineer social concept representations in AI systems.

Abstract: Large language models (LLMs) increasingly serve as human-like decision-making agents in social science and applied settings. These LLM-agents are typically assigned human-like characters and placed in real-life contexts. However, how these characters and contexts shape an LLM’s behavior remains underexplored. This study proposes and tests methods for probing, quantifying, and modifying an LLM’s internal representations in a Dictator Game, a classic behavioral experiment on fairness and prosocial behavior. We extract vectors of variable variations'' (e.g., male’’ to ``female’’) from the LLM’s internal state. Manipulating these vectors during the model’s inference can substantially alter how those variables relate to the model’s decision-making. This approach offers a principled way to study and regulate how social concepts can be encoded and engineered within transformer-based models, with implications for alignment, debiasing, and designing AI agents for social simulations in both academic and commercial applications, strengthening sociological theory and measurement.

[185] Deep Research Comparator: A Platform For Fine-grained Human Annotations of Deep Research Agents

Prahaladh Chandrahasan, Jiahe Jin, Zhihan Zhang, Tevin Wang, Andy Tang, Lucy Mo, Morteza Ziyadi, Leonardo F. R. Ribeiro, Zimeng Qiu, Markus Dreyer, Akari Asai, Chenyan Xiong

Main category: cs.AI

TL;DR: Deep Research Comparator is a platform for evaluating deep research agents through side-by-side comparison of final reports and intermediate steps, with fine-grained human feedback collection and ranking calculation.

Details

Motivation: There's a major challenge in effectively evaluating deep research agents that autonomously search the web, analyze information, and generate reports, especially for assessing long reports and providing detailed feedback on intermediate steps.

Method: The platform offers a holistic framework for deep research agent hosting, side-by-side comparison, fine-grained human feedback collection, and ranking calculation. It displays final reports from two different agents along with their intermediate steps, allowing annotators to evaluate overall quality and provide detailed feedback on specific steps or text spans. Also developed Simple Deepresearch as an end-to-end agent scaffold for easy integration of various LLMs.

Result: Collected real user preference data from 17 annotators on three deep research agents to demonstrate the platform’s utility for deep research agent development.

Conclusion: The Deep Research Comparator platform addresses evaluation gaps for deep research agents by providing comprehensive comparison and feedback mechanisms, with potential to facilitate better agent development through systematic evaluation.

Abstract: Effectively evaluating deep research agents that autonomously search the web, analyze information, and generate reports remains a major challenge, particularly when it comes to assessing long reports and giving detailed feedback on their intermediate steps. To address these gaps, we introduce Deep Research Comparator, a platform that offers a holistic framework for deep research agent hosting, side-by-side comparison, fine-grained human feedback collection, and ranking calculation. Given a user query, our platform displays the final reports from two different agents along with their intermediate steps during generation. Annotators can evaluate the overall quality of final reports based on side-by-side comparison, and also provide detailed feedback separately by assessing intermediate steps or specific text spans within the final report. Furthermore, we develop Simple Deepresearch, an end-to-end agent scaffold. This scaffold serves as a baseline that facilitates the easy integration of various large language models to transform them into deep research agents for evaluation. To demonstrate the platform’s utility for deep research agent development, we have collected real user preference data from 17 annotators on three deep research agents. A demo video of our platform can be found at https://www.youtube.com/watch?v=g4d2dnbdseg.

[186] Trust Semantics Distillation for Collaborator Selection via Memory-Augmented Agentic AI

Botao Zhu, Jeslyn Wang, Dusit Niyato, Xianbin Wang

Main category: cs.AI

TL;DR: Proposes a task-specific trust semantics distillation model using LAM-enabled teacher-student architecture to reduce overhead in collaborative computing trust evaluation.

Details

Motivation: Resource-constrained devices need to offload tasks to peers, requiring trust evaluation. Independent assessment by each device causes significant overhead from frequent data exchange and complex reasoning, degrading timeliness.

Method: Uses LAM-enabled teacher-student agent architecture. Teacher agent on powerful server performs multidimensional trust data collection, task-specific trust semantics extraction, and task-collaborator matching. Student agents on devices receive distilled trust semantics for rapid collaborator selection.

Result: Experimental results show reduced collaborator evaluation time, decreased device resource consumption, and improved accuracy of collaborator selection.

Conclusion: The proposed TSD model effectively addresses trust evaluation overhead in collaborative computing by leveraging server-side intelligence to distill task-specific trust semantics for efficient device-side decision making.

Abstract: Offloading computational tasks from resource-constrained devices to resource-abundant peers constitutes a critical paradigm for collaborative computing. Within this context, accurate trust evaluation of potential collaborating devices is essential for the effective execution of complex computing tasks. This trust evaluation process involves collecting diverse trust-related information from every potential collaborator and performing trust inference based on the collected data. However, when each resource-constrained device independently assesses all potential collaborators, frequent data exchange and complex reasoning can incur significant overhead and further degrade the timeliness of trust evaluation. To overcome these challenges, we propose a task-specific trust semantics distillation (TSD) model based on a large AI model (LAM)-enabled teacher-student agent architecture. Specifically, the teacher agent is deployed on a server with powerful computational capabilities and an augmented memory module to perform multidimensional trust-related data collection, task-specific trust semantics extraction, and task-collaborator matching analysis. Upon receiving task-specific evaluation requests from device-side student agents, the teacher agent transfers the trust semantics of potential collaborators to the student agents, enabling rapid and accurate collaborator selection. Experimental results demonstrate that the proposed TSD model can reduce collaborator evaluation time, decrease device resource consumption, and improve the accuracy of collaborator selection.

[187] Explaining Tournament Solutions with Minimal Supports

Clément Contet, Umberto Grandi, Jérôme Mengin

Main category: cs.AI

TL;DR: This paper studies certified explanations for tournament winners by identifying minimal sub-tournaments (minimal supports) where a candidate is guaranteed to win regardless of how the rest of the tournament is completed.

Details

Motivation: Tournaments model pairwise dominance relationships, but explaining why a particular candidate wins under various tournament rules is challenging. The paper aims to provide formal, certified explanations for tournament winners that are compact, intuitive, and mathematically rigorous, addressing a central question in explainable AI: "Why does the winner win the tournament?"

Method: The authors identify minimal supports - minimal sub-tournaments where a candidate is guaranteed to be a necessary winner. They analyze six common tournament solutions: top cycle, uncovered set, Copeland rule, Borda rule, maximin rule, and weighted uncovered set. For each rule, they determine the size of smallest minimal supports and develop polynomial-time algorithms to compute them (except for weighted uncovered set).

Result: The paper provides theoretical bounds on minimal support sizes for each tournament solution and presents efficient algorithms to compute them for all solutions except the weighted uncovered set. For the weighted uncovered set, they prove the problem is NP-complete. They demonstrate how minimal supports can produce compact, certified, and intuitive explanations for tournament outcomes.

Conclusion: Minimal supports provide a formal framework for generating certified explanations of tournament winners, offering a rigorous approach to explainable AI in tournament settings. The results show that efficient explanation generation is possible for most common tournament rules, though computational complexity varies across different solution concepts.

Abstract: Tournaments are widely used models to represent pairwise dominance between candidates, alternatives, or teams. We study the problem of providing certified explanations for why a candidate appears among the winners under various tournament rules. To this end, we identify minimal supports, minimal sub-tournaments in which the candidate is guaranteed to win regardless of how the rest of the tournament is completed (that is, the candidate is a necessary winner of the sub-tournament). This notion corresponds to an abductive explanation for the question,“Why does the winner win the tournament?”, a central concept in formal explainable AI. We focus on common tournament solutions: the top cycle, the uncovered set, the Copeland rule, the Borda rule, the maximin rule, and the weighted uncovered set. For each rule we determine the size of the smallest minimal supports, and we present polynomial-time algorithms to compute them for all solutions except for the weighted uncovered set, for which the problem is NP-complete. Finally, we show how minimal supports can serve to produce compact, certified, and intuitive explanations for tournament solutions.

[188] Similarity Field Theory: A Mathematical Framework for Intelligence

Kei-Sing Ng

Main category: cs.AI

TL;DR: Similarity Field Theory formalizes systems through similarity relations and their evolution, defining intelligence as generating entities that maintain similarity to concepts.

Details

Motivation: To provide a mathematical foundation for understanding dynamic systems through similarity relations, reframing intelligence and interpretability as geometric problems rather than purely statistical ones.

Method: Introduces Similarity Field Theory with: (1) similarity field S over entities, (2) system evolution sequences, (3) concepts as fibers of similarity, (4) generative operator G. Formalizes intelligence as generating entities that stay within concept fibers.

Result: Two theorems: (i) asymmetry blocks mutual inclusion; (ii) stability implies either anchor coordinate or asymptotic confinement to target level. Framework provides language for characterizing intelligent systems.

Conclusion: Similarity Field Theory offers foundational language for characterizing, comparing, and constructing intelligent systems, reframing intelligence as geometric problems on similarity fields rather than statistical ones.

Abstract: We posit that persisting and transforming similarity relations form the structural basis of any comprehensible dynamic system. This paper introduces Similarity Field Theory, a mathematical framework that formalizes the principles governing similarity values among entities and their evolution. We define: (1) a similarity field $S: U \times U \to [0,1]$ over a universe of entities $U$, satisfying reflexivity $S(E,E)=1$ and treated as a directed relational field (asymmetry and non-transitivity are allowed); (2) the evolution of a system through a sequence $Z_p=(X_p,S^{(p)})$ indexed by $p=0,1,2,\ldots$; (3) concepts $K$ as entities that induce fibers $F_α(K)={E\in U \mid S(E,K)\ge α}$, i.e., superlevel sets of the unary map $S_K(E):=S(E,K)$; and (4) a generative operator $G$ that produces new entities. Within this framework, we formalize a generative definition of intelligence: an operator $G$ is intelligent with respect to a concept $K$ if, given a system containing entities belonging to the fiber of $K$, it generates new entities that also belong to that fiber. Similarity Field Theory thus offers a foundational language for characterizing, comparing, and constructing intelligent systems. At a high level, this framework reframes intelligence and interpretability as geometric problems on similarity fields–preserving and composing level-set fibers–rather than purely statistical ones. We prove two theorems: (i) asymmetry blocks mutual inclusion; and (ii) stability implies either an anchor coordinate or asymptotic confinement to the target level (up to arbitrarily small tolerance). Together, these results constrain similarity-field evolution and motivate an interpretive lens that can be applied to large language models.

[189] cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution

Jinwu Chen, Qidie Wu, Bin Li, Lin Ma, Xin Si, Yang Hu, Shouyi Yin, Jun Yang

Main category: cs.AI

TL;DR: cuPilot is a multi-agent framework that uses strategy-coordinated evolution to automatically optimize CUDA kernels, achieving 3.09× speedup over PyTorch on average.

Details

Motivation: CUDA kernel optimization is difficult due to hardware-software co-design expertise requirements and proprietary nature of high-performance libraries. Existing LLM-based approaches with evolutionary algorithms have suboptimal agent designs and mismatched evolution representations.

Method: Proposes cuPilot: a strategy-coordinated multi-agent framework that introduces strategy as intermediate semantic representation for kernel evolution. Includes strategy-coordinated evolution algorithm, roofline-guided prompting, and strategy-level population initialization.

Result: Generated kernels achieve average 3.09× speedup over PyTorch on 100-kernel benchmark. On GEMM tasks, shows sophisticated optimizations and high utilization of critical hardware units.

Conclusion: cuPilot effectively addresses representation mismatches in kernel optimization through strategy-coordinated evolution, demonstrating significant performance improvements and open-sourcing the generated kernels.

Abstract: Optimizing CUDA kernels is a challenging and labor-intensive task, given the need for hardware-software co-design expertise and the proprietary nature of high-performance kernel libraries. While recent large language models (LLMs) combined with evolutionary algorithms show promise in automatic kernel optimization, existing approaches often fall short in performance due to their suboptimal agent designs and mismatched evolution representations. This work identifies these mismatches and proposes cuPilot, a strategy-coordinated multi-agent framework that introduces strategy as an intermediate semantic representation for kernel evolution. Key contributions include a strategy-coordinated evolution algorithm, roofline-guided prompting, and strategy-level population initialization. Experimental results show that the generated kernels by cuPilot achieve an average speed up of 3.09$\times$ over PyTorch on a benchmark of 100 kernels. On the GEMM tasks, cuPilot showcases sophisticated optimizations and achieves high utilization of critical hardware units. The generated kernels are open-sourced at https://github.com/champloo2878/cuPilot-Kernels.git.

[190] Scaling Laws for Energy Efficiency of Local LLMs

Ander Alvarez, Alessandro Genuardi, Nilotpal Sinha, Antonio Tiene, Mikail Okyay, Bakbergen Ryskulov, David Montero, Samuel Mugel, Román Orús

Main category: cs.AI

TL;DR: CPU-only inference scaling laws for local LLMs/VLMs: token length scales linearly, VLMs have resolution “knees”, and quantum-inspired compression reduces compute/memory by up to 72%.

Details

Motivation: Most consumer hardware relies on CPUs for AI deployment, but computational laws for CPU-only inference of local language and vision-language models remain unexplored, creating a gap in understanding how to balance accuracy with computational/energy constraints on edge devices.

Method: Systematic benchmarking of LLMs and VLMs on two CPU tiers (MacBook Pro M2 and Raspberry Pi 5) using continuous sampling of processor/memory usage with area-under-curve integration to characterize computational scaling with input text length and image resolution.

Result: Two empirical scaling laws: (1) LLM inference cost scales linearly with token length; (2) VLMs exhibit a preprocessing-driven “resolution knee” where compute remains constant above internal resolution clamp. Quantum-inspired compression reduces processor/memory usage by up to 71.9% and energy by 62% while preserving accuracy.

Conclusion: Provides systematic quantification of multimodal CPU-only scaling for local workloads, identifying model compression and input-resolution preprocessing as effective, low-cost levers for sustainable edge inference on consumer hardware.

Abstract: Deploying local large language models and vision-language models on edge devices requires balancing accuracy with constrained computational and energy budgets. Although graphics processors dominate modern artificial-intelligence deployment, most consumer hardware–including laptops, desktops, industrial controllers, and embedded systems–relies on central processing units. Despite this, the computational laws governing central-processing-unit-only inference for local language and vision-language workloads remain largely unexplored. We systematically benchmark large language and vision-language models on two representative central-processing-unit tiers widely used for local inference: a MacBook Pro M2, reflecting mainstream laptop-class deployment, and a Raspberry Pi 5, representing constrained, low-power embedded settings. Using a unified methodology based on continuous sampling of processor and memory usage together with area-under-curve integration, we characterize how computational load scales with input text length for language models and with image resolution for vision-language models. We uncover two empirical scaling laws: (1) computational cost for language-model inference scales approximately linearly with token length; and (2) vision-language models exhibit a preprocessing-driven “resolution knee”, where compute remains constant above an internal resolution clamp and decreases sharply below it. Beyond these laws, we show that quantum-inspired compression reduces processor and memory usage by up to 71.9% and energy consumption by up to 62%, while preserving or improving semantic accuracy. These results provide a systematic quantification of multimodal central-processing-unit-only scaling for local language and vision-language workloads, and they identify model compression and input-resolution preprocessing as effective, low-cost levers for sustainable edge inference.

[191] ScoutGPT: Capturing Player Impact from Team Action Sequences Using GPT-Based Framework

Miru Hong, Minho Lee, Geonhee Jo, Jae-Hee So, Pascal Bauer, Sang-Ki Ko

Main category: cs.AI

TL;DR: EventGPT: A GPT-style transformer model for football transfer analysis that predicts next events and simulates how players would perform in different teams using counterfactual simulations.

Details

Motivation: Current football transfer evaluation methods rely on static statistics or post-hoc value models that fail to capture how players adapt to new tactical environments and teammates, making transfer success difficult to predict.

Method: EventGPT is a player-conditioned, value-aware next-event prediction model using GPT-style autoregressive transformer. It treats match play as discrete token sequences, predicting next action type, location, timing, and residual On-Ball Value (rOBV). Key innovation: counterfactual simulations by substituting player embeddings into new event sequences.

Result: Outperforms existing sequence-based baselines in next-event prediction accuracy and spatial precision on five seasons of Premier League data. Case studies demonstrate practical utility for transfer analysis, comparing striker performance across systems and identifying stylistic replacements.

Conclusion: EventGPT provides a principled method for evaluating transfer fit by simulating how players’ behavioral distribution and value profile would change in different tactical environments, addressing the context-dependence problem in football transfer analysis.

Abstract: Transfers play a pivotal role in shaping a football club’s success, yet forecasting whether a transfer will succeed remains difficult due to the strong context-dependence of on-field performance. Existing evaluation practices often rely on static summary statistics or post-hoc value models, which fail to capture how a player’s contribution adapts to a new tactical environment or different teammates. To address this gap, we introduce EventGPT, a player-conditioned, value-aware next-event prediction model built on a GPT-style autoregressive transformer. Our model treats match play as a sequence of discrete tokens, jointly learning to predict the next on-ball action’s type, location, timing, and its estimated residual On-Ball Value (rOBV) based on the preceding context and player identity. A key contribution of this framework is the ability to perform counterfactual simulations. By substituting learned player embeddings into new event sequences, we can simulate how a player’s behavioral distribution and value profile would change when placed in a different team or tactical structure. Evaluated on five seasons of Premier League event data, EventGPT outperforms existing sequence-based baselines in next-event prediction accuracy and spatial precision. Furthermore, we demonstrate the model’s practical utility for transfer analysis through case studies-such as comparing striker performance across different systems and identifying stylistic replacements for specific roles-showing that our approach provides a principled method for evaluating transfer fit.

[192] Dialectics for Artificial Intelligence

Zhengmian Hu

Main category: cs.AI

TL;DR: AI can discover human-like concepts from raw experience without supervision by treating concepts as information structures defined through their relation to experience, with concepts evolving through dialectical optimization.

Details

Motivation: To determine if AI can autonomously discover concepts similar to human concepts, given that human concepts are fluid and evolve over time (e.g., Pluto's planetary status). Need a definition of "concept" that's not just a label but a revisable, comparable structure.

Method: Propose algorithmic-information viewpoint: concepts as information objects defined through structural relation to agent’s total experience. Core constraint is determination via reversible consistency relations. Define excess information to measure redundancy overhead. Formulate dialectics as optimization dynamics where concepts compete to explain new information via shorter descriptions. Formalize concept transmission using small seeds for multi-agent alignment.

Result: Provides a theoretical framework for concept discovery and evolution based on information theory, enabling systematic concept expansion, contraction, splitting, and merging through dialectical optimization. Enables low-cost concept transmission between agents.

Conclusion: AI can potentially discover human-like concepts from raw experience by treating concepts as information structures with reversible consistency relations, evolving through dialectical optimization, and enabling efficient multi-agent alignment through shared protocols.

Abstract: Can artificial intelligence discover, from raw experience and without human supervision, concepts that humans have discovered? One challenge is that human concepts themselves are fluid: conceptual boundaries can shift, split, and merge as inquiry progresses (e.g., Pluto is no longer considered a planet). To make progress, we need a definition of “concept” that is not merely a dictionary label, but a structure that can be revised, compared, and aligned across agents. We propose an algorithmic-information viewpoint that treats a concept as an information object defined only through its structural relation to an agent’s total experience. The core constraint is determination: a set of parts forms a reversible consistency relation if any missing part is recoverable from the others (up to the standard logarithmic slack in Kolmogorov-style identities). This reversibility prevents “concepts” from floating free of experience and turns concept existence into a checkable structural claim. To judge whether a decomposition is natural, we define excess information, measuring the redundancy overhead introduced by splitting experience into multiple separately described parts. On top of these definitions, we formulate dialectics as an optimization dynamics: as new patches of information appear (or become contested), competing concepts bid to explain them via shorter conditional descriptions, driving systematic expansion, contraction, splitting, and merging. Finally, we formalize low-cost concept transmission and multi-agent alignment using small grounds/seeds that allow another agent to reconstruct the same concept under a shared protocol, making communication a concrete compute-bits trade-off.

Yosuke Taniuchi, Chie Hieida, Atsushi Noritake, Kazushi Ikeda, Masaki Isoda

Main category: cs.AI

TL;DR: Monkeys use objective reward differences rather than inferring others’ subjective valuations during social comparison, as shown by computational modeling of primate social cognition.

Details

Motivation: To understand how primates process social information during reward evaluation - specifically whether they recognize objective reward differences or infer others' subjective valuations.

Method: Developed three computational models (IPM, NCM, ECM) with varying social information processing, trained on monkey behavior data using multi-layered multimodal latent Dirichlet allocation, and evaluated classification performance across experimental conditions.

Result: The External Comparison Model (ECM) achieved highest classification score (Rand Index 0.88 vs 0.79 for IPM), indicating social comparison relies on objective reward differences rather than subjective state inferences.

Conclusion: Primate social comparison is based on objective reward differences rather than inferences about others’ subjective valuations, suggesting a simpler computational mechanism than previously thought.

Abstract: Social comparison$\unicode{x2014}$the process of evaluating one’s rewards relative to others$\unicode{x2014}$plays a fundamental role in primate social cognition. However, it remains unknown from a computational perspective how information about others’ rewards affects the evaluation of one’s own reward. With a constructive approach, this study examines whether monkeys merely recognize objective reward differences or, instead, infer others’ subjective reward valuations. We developed three computational models with varying degrees of social information processing: an Internal Prediction Model (IPM), which infers the partner’s subjective values; a No Comparison Model (NCM), which disregards partner information; and an External Comparison Model (ECM), which directly incorporates the partner’s objective rewards. To test model performance, we used a multi-layered, multimodal latent Dirichlet allocation. We trained the models on a dataset containing the behavior of a pair of monkeys, their rewards, and the conditioned stimuli. Then, we evaluated the models’ ability to classify subjective values across pre-defined experimental conditions. The ECM achieved the highest classification score in the Rand Index (0.88 vs. 0.79 for the IPM) under our settings, suggesting that social comparison relies on objective reward differences rather than inferences about subjective states.

cs.SD

[194] Spectral or spatial? Leveraging both for speaker extraction in challenging data conditions

Aviad Eisenberg, Sharon Gannot, Shlomo E. Chazan

Main category: cs.SD

TL;DR: Robust multi-channel speaker extraction algorithm that integrates both spatial and spectral cues to handle inaccurate reference information, using dynamic balancing of features for stability.

Details

Motivation: Existing speaker extraction methods often rely on either spatial or spectral cues alone, making them vulnerable when reference information is inaccurate. There's a need for a more robust approach that can handle unreliable cues.

Method: Proposes a multi-channel speaker extraction system that integrates both spatial (DOA) and spectral cues. Uses a dedicated network trained to dynamically balance contributions from both features, or disregard less informative ones when necessary.

Result: Experimental evaluation under challenging conditions with simulated inference-time errors shows the model successfully extracts desired speakers even with substantial reference inaccuracies.

Conclusion: The proposed integration of spatial and spectral cues with dynamic feature balancing creates a robust speaker extraction system that maintains performance despite unreliable reference information.

Abstract: This paper presents a robust multi-channel speaker extraction algorithm designed to handle inaccuracies in reference information. While existing approaches often rely solely on either spatial or spectral cues to identify the target speaker, our method integrates both sources of information to enhance robustness. A key aspect of our approach is its emphasis on stability, ensuring reliable performance even when one of the features is degraded or misleading. Given a noisy mixture and two potentially unreliable cues, a dedicated network is trained to dynamically balance their contributions-or disregard the less informative one when necessary. We evaluate the system under challenging conditions by simulating inference-time errors using a simple direction of arrival (DOA) estimator and a noisy spectral enrollment process. Experimental results demonstrate that the proposed model successfully extracts the desired speaker even in the presence of substantial reference inaccuracies.

[195] Aliasing-Free Neural Audio Synthesis

Yicheng Gu, Junan Zhang, Chaoren Wang, Jerry Li, Zhizheng Wu, Lauri Juvela

Main category: cs.SD

TL;DR: The paper proposes anti-aliased neural vocoders and codecs that eliminate aliasing artifacts in audio synthesis by applying oversampling, anti-derivative anti-aliasing to activation functions, and replacing ConvTranspose with resampling layers.

Details

Motivation: Current upsampling-based time-domain vocoders suffer from aliasing artifacts that limit synthesis fidelity. Three main issues exist: 1) unconstrained nonlinear activations generate infinite harmonics beyond Nyquist frequency causing "folded-back" aliasing, 2) ConvTranspose layers create "mirrored" aliasing by copying low-frequency parts to high-frequency regions, and 3) periodicity and mirrored DC bias cause "tonal artifacts" (constant-frequency ringing).

Method: Apply oversampling and anti-derivative anti-aliasing to activation functions to obtain anti-aliased forms. Replace problematic ConvTranspose layers with resampling to avoid tonal artifacts and eliminate aliased components. Implement these anti-aliased modules in Pupu-Vocoder and Pupu-Codec models.

Result: Pupu-Vocoder and Pupu-Codec outperform existing systems on singing voice, music, and audio tasks while achieving comparable performance on speech. The lightweight models effectively eliminate aliasing artifacts as demonstrated through test signal benchmarks and experiments across multiple audio domains.

Conclusion: The proposed anti-aliasing techniques from a signal processing perspective successfully address aliasing artifacts in neural vocoders and codecs, enabling higher fidelity audio synthesis across diverse domains while maintaining lightweight model architectures.

Abstract: Neural vocoders and codecs reconstruct waveforms from acoustic representations, which directly impact the audio quality. Among existing methods, upsampling-based time-domain models are superior in both inference speed and synthesis quality, achieving state-of-the-art performance. Still, despite their success in producing perceptually natural sound, their synthesis fidelity remains limited due to the aliasing artifacts brought by the inadequately designed model architectures. In particular, the unconstrained nonlinear activation generates an infinite number of harmonics that exceed the Nyquist frequency, resulting in folded-back'' aliasing artifacts. The widely used upsampling layer, ConvTranspose, copies the mirrored low-frequency parts to fill the empty high-frequency region, resulting in mirrored’’ aliasing artifacts. Meanwhile, the combination of its inherent periodicity and the mirrored DC bias also brings tonal artifact,'' resulting in constant-frequency ringing. This paper aims to solve these issues from a signal processing perspective. Specifically, we apply oversampling and anti-derivative anti-aliasing to the activation function to obtain its anti-aliased form, and replace the problematic ConvTranspose layer with resampling to avoid the tonal artifact’’ and eliminate aliased components. Based on our proposed anti-aliased modules, we introduce Pupu-Vocoder and Pupu-Codec, and release high-quality pre-trained checkpoints to facilitate audio generation research. We build a test signal benchmark to illustrate the effectiveness of the anti-aliased modules, and conduct experiments on speech, singing voice, music, and audio to validate our proposed models. Experimental results confirm that our lightweight Pupu-Vocoder and Pupu-Codec models can easily outperform existing systems on singing voice, music, and audio, while achieving comparable performance on speech.

[196] MMEDIT: A Unified Framework for Multi-Type Audio Editing via Audio Language Model

Ye Tao, Xuenan Xu, Wen Wu, Shuai Wang, Mengyue Wu, Chao Zhang

Main category: cs.SD

TL;DR: MMEdit is a unified audio editing framework using audio-language models that addresses limitations of existing methods by covering comprehensive editing operations, using scalable data synthesis, and enabling precise cross-modal alignment.

Details

Motivation: Existing text-guided audio editing methods have fundamental limitations: training-free methods suffer from signal degradation from diffusion inversion, while training-based methods are constrained by scarce high-quality paired data and narrow task formulations. Standard architectures also decouple text and audio processing, limiting instruction-acoustic context alignment.

Method: Proposes MMEdit with three key components: 1) Systematic extension of task definitions to cover comprehensive editing operations (addition, replacement, removal, reordering, attribute modification), 2) Scalable data synthesis pipeline to construct large-scale paired datasets with fine-grained event-level annotations, 3) Integration of Qwen2-Audio encoder with MMDiT-based generator for precise cross-modal alignment and localized editing.

Result: Experimental results demonstrate superior editing localization accuracy, robust instruction following, and high fidelity in non-edited regions compared to existing methods.

Conclusion: MMEdit addresses fundamental limitations in text-guided audio editing by providing a unified framework with comprehensive task coverage, scalable data synthesis, and improved cross-modal alignment, achieving state-of-the-art performance across multiple editing operations.

Abstract: Text-guided audio editing aims to modify specific acoustic events while strictly preserving non-target content. Despite recent progress, existing approaches remain fundamentally limited. Training-free methods often suffer from signal degradation caused by diffusion inversion, while training-based methods, although achieving higher generation quality, are severely constrained by the scarcity of high-quality paired data and task formulations that cover only a narrow subset of editing operations. In addition, standard architectures typically decouple text and audio processing, limiting the ability to align instructions with specific acoustic contexts. To address these challenges, we propose MMEdit, an audio-language-model-driven framework for unified audio editing. We systematically extend task definitions to cover a comprehensive range of editing operations, including addition, replacement, removal, reordering, and attribute modification. Furthermore, we design a scalable data synthesis pipeline to construct large-scale paired datasets with fine-grained event-level annotations. To capture complex editing semantics, we integrate a Qwen2-Audio encoder with an MMDiT-based generator, enabling precise cross-modal alignment and localized editing. Experimental results demonstrate that our method achieves superior editing localization accuracy, robust instruction following, and high fidelity in non-edited regions.

[197] EnvSSLAM-FFN: Lightweight Layer-Fused System for ESDD 2026 Challenge

Xiaoxuan Guo, Hengyan Huang, Jiayi Zhou, Renhe Sun, Jian Liu, Haonan Cheng, Long Ye, Qin Zhang

Main category: cs.SD

TL;DR: EnvSSLAM-FFN system for environmental sound deepfake detection achieves state-of-the-art performance on ESDD 2026 Challenge with 1.20% and 1.05% EERs on two tracks.

Details

Motivation: Recent advances in generative audio models enable high-fidelity environmental sound synthesis, raising serious security concerns about audio deepfakes, necessitating robust detection methods.

Method: Proposes EnvSSLAM-FFN that integrates a frozen SSLAM self-supervised encoder with a lightweight FFN back-end, fuses intermediate SSLAM representations from layers 4-9, and uses class-weighted training to handle data imbalance.

Result: The system consistently outperforms official baselines on both ESDD 2026 Challenge tracks, achieving Test Equal Error Rates of 1.20% (Track 1: unseen generators) and 1.05% (Track 2: black-box low-resource detection).

Conclusion: EnvSSLAM-FFN demonstrates effective environmental sound deepfake detection under challenging conditions, addressing security concerns raised by advanced generative audio models.

Abstract: Recent advances in generative audio models have enabled high-fidelity environmental sound synthesis, raising serious concerns for audio security. The ESDD 2026 Challenge therefore addresses environmental sound deepfake detection under unseen generators (Track 1) and black-box low-resource detection (Track 2) conditions. We propose EnvSSLAM-FFN, which integrates a frozen SSLAM self-supervised encoder with a lightweight FFN back-end. To effectively capture spoofing artifacts under severe data imbalance, we fuse intermediate SSLAM representations from layers 4-9 and adopt a class-weighted training objective. Experimental results show that the proposed system consistently outperforms the official baselines on both tracks, achieving Test Equal Error Rates (EERs) of 1.20% and 1.05%, respectively.

[198] AUDRON: A Deep Learning Framework with Fused Acoustic Signatures for Drone Type Recognition

Rajdeep Chatterjee, Sudip Chakrabarty, Trishaani Acharjee, Deepanjali Mishra

Main category: cs.SD

TL;DR: AUDRON is a hybrid deep learning framework that uses acoustic sensing with MFCC, STFT spectrograms, CNNs, recurrent layers, and autoencoders to detect drones with high accuracy (98.51% binary, 97.11% multiclass).

Details

Motivation: UAVs/drones are increasingly used but pose safety/security risks when misused. Acoustic sensing offers a low-cost, non-intrusive alternative to vision/radar-based detection since drone propellers generate distinctive sound patterns.

Method: Hybrid deep learning framework combining MFCC and STFT spectrograms processed with CNNs, recurrent layers for temporal modeling, and autoencoder-based representations. Uses feature-level fusion to integrate complementary information before classification.

Result: AUDRON achieves 98.51% accuracy in binary classification and 97.11% accuracy in multiclass classification, effectively differentiating drone acoustic signatures from background noise while maintaining generalizability across varying conditions.

Conclusion: Combining multiple feature representations with deep learning provides reliable acoustic drone detection, with potential for deployment in security/surveillance applications where visual or radar sensing may be limited.

Abstract: Unmanned aerial vehicles (UAVs), commonly known as drones, are increasingly used across diverse domains, including logistics, agriculture, surveillance, and defense. While these systems provide numerous benefits, their misuse raises safety and security concerns, making effective detection mechanisms essential. Acoustic sensing offers a low-cost and non-intrusive alternative to vision or radar-based detection, as drone propellers generate distinctive sound patterns. This study introduces AUDRON (AUdio-based Drone Recognition Network), a hybrid deep learning framework for drone sound detection, employing a combination of Mel-Frequency Cepstral Coefficients (MFCC), Short-Time Fourier Transform (STFT) spectrograms processed with convolutional neural networks (CNNs), recurrent layers for temporal modeling, and autoencoder-based representations. Feature-level fusion integrates complementary information before classification. Experimental evaluation demonstrates that AUDRON effectively differentiates drone acoustic signatures from background noise, achieving high accuracy while maintaining generalizability across varying conditions. AUDRON achieves 98.51 percent and 97.11 percent accuracy in binary and multiclass classification. The results highlight the advantage of combining multiple feature representations with deep learning for reliable acoustic drone detection, suggesting the framework’s potential for deployment in security and surveillance applications where visual or radar sensing may be limited.

[199] Improving Speech Emotion Recognition with Mutual Information Regularized Generative Model

Chung-Soo Ahn, Rajib Rana, Sunil Sivadas, Carlos Busso, Jagath C. Rajapakse

Main category: cs.SD

TL;DR: A novel mutual-information-regularized generative framework for speech emotion recognition that combines cross-modal alignment with feature-level synthesis to generate emotionally consistent data, outperforming existing augmentation methods.

Details

Motivation: Lack of large, well-annotated emotional speech corpora limits SER performance, especially as models grow more complex and multimodal systems increase. Existing generative data augmentation approaches often produce emotionally inconsistent samples due to oversimplified conditioning on categorical labels.

Method: Mutual-information-regularized generative framework based on InfoGAN-style architecture. First learns semantically aligned audio-text representation space using pre-trained transformers and contrastive objectives. Then trains feature generator to produce emotion-aware audio features with mutual information as quantitative regularizer to ensure strong dependency between generated features and conditioning variables. Extended to multimodal settings for generating novel paired (audio, text) features.

Result: Comprehensive evaluation on three benchmark datasets (IEMOCAP, MSP-IMPROV, MSP-Podcast) shows framework consistently outperforms existing augmentation methods, achieving state-of-the-art performance with improvements of up to 2.6% in unimodal SER and 3.2% in multimodal emotion recognition.

Conclusion: Mutual information functions as both a regularizer and measurable metric for generative quality, offering a systematic approach to data augmentation in affective computing. The framework addresses emotional consistency issues in generative data augmentation for SER.

Abstract: Lack of large, well-annotated emotional speech corpora continues to limit the performance and robustness of speech emotion recognition (SER), particularly as models grow more complex and the demand for multimodal systems increases. While generative data augmentation offers a promising solution, existing approaches often produce emotionally inconsistent samples due to oversimplified conditioning on categorical labels. This paper introduces a novel mutual-information-regularised generative framework that combines cross-modal alignment with feature-level synthesis. Building on an InfoGAN-style architecture, our method first learns a semantically aligned audio-text representation space using pre-trained transformers and contrastive objectives. A feature generator is then trained to produce emotion-aware audio features while employing mutual information as a quantitative regulariser to ensure strong dependency between generated features and their conditioning variables. We extend this approach to multimodal settings, enabling the generation of novel, paired (audio, text) features. Comprehensive evaluation on three benchmark datasets (IEMOCAP, MSP-IMPROV, MSP-Podcast) demonstrates that our framework consistently outperforms existing augmentation methods, achieving state-of-the-art performance with improvements of up to 2.6% in unimodal SER and 3.2% in multimodal emotion recognition. Most importantly, we demonstrate that mutual information functions as both a regulariser and a measurable metric for generative quality, offering a systematic approach to data augmentation in affective computing.

cs.LG

[200] Large Language Models for EDA Cloud Job Resource and Lifetime Prediction

Yuxuan Yin, Shengke Zhou, Yunjie Zhang, Ajay Mohindra, Boxun Xu, Peng Li

Main category: cs.LG

TL;DR: Fine-tuned LLMs with scientific notation and prefix filling for EDA workload prediction, achieving state-of-the-art results on real cloud datasets.

Details

Motivation: Cloud computing growth in EDA industry creates need for resource/job lifetime prediction for optimal scheduling, but traditional ML struggles with EDA workload complexity and heterogeneity.

Method: Fine-tune LLMs using text-to-text regression, introduce scientific notation and prefix filling to constrain output format, and use full-attention finetuning/inference to improve sliding-window-attention LLM prediction accuracy.

Result: Demonstrated effectiveness on real-world cloud datasets, setting new baseline for performance prediction in EDA domain.

Conclusion: Proposed LLM-based framework effectively addresses EDA workload prediction challenges, outperforming traditional methods and establishing new state-of-the-art.

Abstract: The rapid growth of cloud computing in the Electronic Design Automation (EDA) industry has created a critical need for resource and job lifetime prediction to achieve optimal scheduling. Traditional machine learning methods often struggle with the complexity and heterogeneity of EDA workloads, requiring extensive feature engineering and domain expertise. We propose a novel framework that fine-tunes Large Language Models (LLMs) to address this challenge through text-to-text regression. We introduce the scientific notation and prefix filling to constrain the LLM, significantly improving output format reliability. Moreover, we found that full-attention finetuning and inference improves the prediction accuracy of sliding-window-attention LLMs. We demonstrate the effectiveness of our proposed framework on real-world cloud datasets, setting a new baseline for performance prediction in the EDA domain.

[201] Reducing Label Dependency in Human Activity Recognition with Wearables: From Supervised Learning to Novel Weakly Self-Supervised Approaches

Taoran Sheng, Manfred Huber

Main category: cs.LG

TL;DR: This paper comprehensively investigates different machine learning paradigms for wearable-based human activity recognition, focusing on reducing labeling requirements while maintaining accuracy. It compares six approaches and finds that novel weakly supervised and weakly self-supervised methods achieve competitive performance with significantly less labeled data.

Details

Motivation: Human activity recognition using wearable sensors faces a trade-off between performance and labeling requirements. Fully supervised methods need extensive labeled data (costly), while unsupervised methods have poor performance. There's a need for approaches that minimize labeling while maintaining accuracy.

Method: The paper develops and compares six approaches: (1) traditional fully supervised learning, (2) basic unsupervised learning, (3) weakly supervised learning with constraints, (4) multi-task learning with knowledge sharing, (5) self-supervised learning based on domain expertise, and (6) a novel weakly self-supervised framework leveraging domain knowledge and minimal labeled data.

Result: Experiments show: (i) weakly supervised methods achieve comparable performance to fully supervised approaches with significantly reduced supervision; (ii) multi-task learning enhances performance through knowledge sharing; (iii) the weakly self-supervised approach demonstrates remarkable efficiency with only 10% of labeled data.

Conclusion: The study highlights complementary strengths of different learning paradigms and shows that the novel weakly self-supervised framework offers a promising solution for practical HAR applications where labeled data are limited, providing insights for tailoring solutions based on labeled data availability.

Abstract: Human activity recognition (HAR) using wearable sensors has advanced through various machine learning paradigms, each with inherent trade-offs between performance and labeling requirements. While fully supervised techniques achieve high accuracy, they demand extensive labeled datasets that are costly to obtain. Conversely, unsupervised methods eliminate labeling needs but often deliver suboptimal performance. This paper presents a comprehensive investigation across the supervision spectrum for wearable-based HAR, with particular focus on novel approaches that minimize labeling requirements while maintaining competitive accuracy. We develop and empirically compare: (1) traditional fully supervised learning, (2) basic unsupervised learning, (3) a weakly supervised learning approach with constraints, (4) a multi-task learning approach with knowledge sharing, (5) a self-supervised approach based on domain expertise, and (6) a novel weakly self-supervised learning framework that leverages domain knowledge and minimal labeled data. Experiments across benchmark datasets demonstrate that: (i) our weakly supervised methods achieve performance comparable to fully supervised approaches while significantly reducing supervision requirements; (ii) the proposed multi-task framework enhances performance through knowledge sharing between related tasks; (iii) our weakly self-supervised approach demonstrates remarkable efficiency with just 10% of labeled data. These results not only highlight the complementary strengths of different learning paradigms, offering insights into tailoring HAR solutions based on the availability of labeled data, but also establish that our novel weakly self-supervised framework offers a promising solution for practical HAR applications where labeled data are limited.

[202] Development and external validation of a multimodal artificial intelligence mortality prediction model of critically ill patients using multicenter data

Behrooz Mamandipoor, Chun-Nan Hsu, Martin Krause, Ulrich H. Schmidt, Rodney A. Gabriel

Main category: cs.LG

TL;DR: Multimodal deep learning model using structured/unstructured data predicts in-hospital mortality in ICU patients with high accuracy (AUROC 0.92) and shows improved performance when including clinical notes and chest X-rays.

Details

Motivation: Early prediction of in-hospital mortality in critically ill patients can help clinicians optimize treatment decisions and resource allocation in intensive care settings.

Method: Developed multimodal deep learning model using MIMIC-III, MIMIC-IV, eICU, and HiRID datasets. Inputs included time-invariant variables, time-variant variables (first 24h ICU data), clinical notes, and chest X-ray images. Model was trained on MIMIC datasets and externally validated on temporally separated MIMIC populations, HiRID, and eICU datasets from 200+ hospitals.

Result: Model achieved AUROC 0.92, AUPRC 0.53, Brier score 0.19 with structured data. External validation across 8 eICU institutions showed AUROCs 0.84-0.92. Adding clinical notes and imaging improved AUROC from 0.87 to 0.89, AUPRC from 0.43 to 0.48, and reduced Brier score from 0.37 to 0.17. Dataset included 203,434 ICU admissions (2001-2022) with mortality rates 5.2%-7.9%.

Conclusion: Multimodal data integration (structured data, clinical notes, imaging) significantly improves mortality prediction accuracy. External validation across multiple institutions demonstrates model generalizability. The approach highlights the importance of leveraging diverse patient information sources for clinical prediction tasks.

Abstract: Early prediction of in-hospital mortality in critically ill patients can aid clinicians in optimizing treatment. The objective was to develop a multimodal deep learning model, using structured and unstructured clinical data, to predict in-hospital mortality risk among critically ill patients after their initial 24 hour intensive care unit (ICU) admission. We used data from MIMIC-III, MIMIC-IV, eICU, and HiRID. A multimodal model was developed on the MIMIC datasets, featuring time series components occurring within the first 24 hours of ICU admission and predicting risk of subsequent inpatient mortality. Inputs included time-invariant variables, time-variant variables, clinical notes, and chest X-ray images. External validation occurred in a temporally separated MIMIC population, HiRID, and eICU datasets. A total of 203,434 ICU admissions from more than 200 hospitals between 2001 to 2022 were included, in which mortality rate ranged from 5.2% to 7.9% across the four datasets. The model integrating structured data points had AUROC, AUPRC, and Brier scores of 0.92, 0.53, and 0.19, respectively. We externally validated the model on eight different institutions within the eICU dataset, demonstrating AUROCs ranging from 0.84-0.92. When including only patients with available clinical notes and imaging data, inclusion of notes and imaging into the model, the AUROC, AUPRC, and Brier score improved from 0.87 to 0.89, 0.43 to 0.48, and 0.37 to 0.17, respectively. Our findings highlight the importance of incorporating multiple sources of patient information for mortality prediction and the importance of external validation.

[203] Learning to Design City-scale Transit Routes

Bibek Poudel, Weizi Li

Main category: cs.LG

TL;DR: End-to-end RL framework using graph attention networks for sequential transit network design, outperforming human-designed systems and traditional heuristics on real-world benchmarks.

Details

Motivation: Transit route network design is NP-hard with exponentially large solution spaces, traditionally relying on manual planning processes that may be suboptimal.

Method: End-to-end reinforcement learning framework based on graph attention networks for sequential transit network construction, with two-level reward structure combining incremental topological feedback and simulation-based terminal rewards.

Result: Substantial outperformance of existing designs and traditional heuristics across multiple scenarios: 25.6% higher service rates, 30.9% shorter wait times, 21.0% better bus utilization under high transit adoption; 68.8% higher route efficiency and 5.9% lower travel times under mixed-mode conditions.

Conclusion: End-to-end RL can design transit networks that substantially outperform both human-designed systems and hand-crafted heuristics on realistic city-scale benchmarks.

Abstract: Designing efficient transit route networks is an NP-hard problem with exponentially large solution spaces that traditionally relies on manual planning processes. We present an end-to-end reinforcement learning (RL) framework based on graph attention networks for sequential transit network construction. To address the long-horizon credit assignment challenge, we introduce a two-level reward structure combining incremental topological feedback with simulation-based terminal rewards. We evaluate our approach on a new real-world dataset from Bloomington, Indiana with topologically accurate road networks, census-derived demand, and existing transit routes. Our learned policies substantially outperform existing designs and traditional heuristics across two initialization schemes and two modal-split scenarios. Under high transit adoption with transit center initialization, our approach achieves 25.6% higher service rates, 30.9% shorter wait times, and 21.0% better bus utilization compared to the real-world network. Under mixed-mode conditions with random initialization, it delivers 68.8% higher route efficiency than demand coverage heuristics and 5.9% lower travel times than shortest path construction. These results demonstrate that end-to-end RL can design transit networks that substantially outperform both human-designed systems and hand-crafted heuristics on realistic city-scale benchmarks.

[204] Thermodynamic Focusing for Inference-Time Search: Practical Methods for Target-Conditioned Sampling and Prompted Inference

Zhan Zhang

Main category: cs.LG

TL;DR: ICFA is a practical framework for finding rare solutions in large search spaces by treating search as target-conditioned reweighting, reusing existing proposal samplers and similarity functions while adaptively controlling focusing strength.

Details

Motivation: Finding rare but useful solutions in very large candidate spaces is a recurring practical challenge across language generation, planning, and reinforcement learning. There's a need for practical methods that can efficiently discover these rare solutions without requiring extensive modifications to existing systems.

Method: Inverted Causality Focusing Algorithm (ICFA) treats search as a target-conditioned reweighting process. It reuses available proposal samplers and task-specific similarity functions to form a focused sampling distribution. The method includes adaptive control of focusing strength to avoid degeneracy, stability diagnostics based on effective sample size, and can be combined with structured prompts for language-level approximation.

Result: The paper provides a clear recipe, stability diagnostics, theoretical analysis explaining when ICFA reduces sample needs, and demonstrates reproducible experiments in constrained language generation and sparse-reward navigation. It also shows how structured prompts can instantiate an approximate language-level form of ICFA and describes a hybrid architecture combining prompted inference with algorithmic reweighting.

Conclusion: ICFA offers a practical framework for efficient search in large candidate spaces by leveraging existing components through target-conditioned reweighting, with applications across multiple domains including language generation and reinforcement learning. The method’s adaptability and combination with prompting techniques make it versatile for real-world applications.

Abstract: Finding rare but useful solutions in very large candidate spaces is a recurring practical challenge across language generation, planning, and reinforcement learning. We present a practical framework, \emph{Inverted Causality Focusing Algorithm} (ICFA), that treats search as a target-conditioned reweighting process. ICFA reuses an available proposal sampler and a task-specific similarity function to form a focused sampling distribution, while adaptively controlling focusing strength to avoid degeneracy. We provide a clear recipe, a stability diagnostic based on effective sample size, a compact theoretical sketch explaining when ICFA can reduce sample needs, and two reproducible experiments: constrained language generation and sparse-reward navigation. We further show how structured prompts instantiate an approximate, language-level form of ICFA and describe a hybrid architecture combining prompted inference with algorithmic reweighting.

[205] Synthetic Data Blueprint (SDB): A modular framework for the statistical, structural, and graph-based evaluation of synthetic tabular data

Vasileios C. Pezoulas, Nikolaos S. Tachos, Eleni Georga, Kostas Marias, Manolis Tsiknakis, Dimitrios I. Fotiadis

Main category: cs.LG

TL;DR: SDB is a Python library for comprehensive evaluation of synthetic tabular data with automated feature detection, fidelity metrics, structure preservation scores, and visualization tools.

Details

Motivation: Current synthetic data evaluation is fragmented with heterogeneous metrics, ad-hoc scripts, and incomplete reporting, creating a need for standardized assessment tools.

Method: Developed Synthetic Data Blueprint (SDB) - a modular Python library with automated feature-type detection, distributional/dependency fidelity metrics, graph/embedding-based structure preservation scores, and visualization schemas.

Result: Successfully evaluated SDB across three diverse real-world use cases (healthcare diagnostics, socioeconomic/financial modeling, cybersecurity) showing it handles mixed-type clinical variables, high-cardinality categorical attributes, and high-dimensional telemetry signals.

Conclusion: SDB provides consistent, transparent, and reproducible benchmarking for synthetic data fidelity assessment across heterogeneous domains, addressing diverse data evaluation challenges.

Abstract: In the rapidly evolving era of Artificial Intelligence (AI), synthetic data are widely used to accelerate innovation while preserving privacy and enabling broader data accessibility. However, the evaluation of synthetic data remains fragmented across heterogeneous metrics, ad-hoc scripts, and incomplete reporting practices. To address this gap, we introduce Synthetic Data Blueprint (SDB), a modular Pythonic based library to quantitatively and visually assess the fidelity of synthetic tabular data. SDB supports: (i) automated feature-type detection, (ii) distributional and dependency-level fidelity metrics, (iii) graph- and embedding-based structure preservation scores, and (iv) a rich suite of data visualization schemas. To demonstrate the breadth, robustness, and domain-agnostic applicability of the SDB, we evaluated the framework across three real-world use cases that differ substantially in scale, feature composition, statistical complexity, and downstream analytical requirements. These include: (i) healthcare diagnostics, (ii) socioeconomic and financial modelling, and (iii) cybersecurity and network traffic analysis. These use cases reveal how SDB can address diverse data fidelity assessment challenges, varying from mixed-type clinical variables to high-cardinality categorical attributes and high-dimensional telemetry signals, while at the same time offering a consistent, transparent, and reproducible benchmarking across heterogeneous domains.

[206] Multiscale Dual-path Feature Aggregation Network for Remaining Useful Life Prediction of Lithium-Ion Batteries

Zihao Lv, Siqi Ai, Yanbin Zhang

Main category: cs.LG

TL;DR: Proposes MDFA-Net, a dual-path deep learning architecture for battery RUL prediction that captures both local and global degradation patterns through multiscale feature networks.

Details

Motivation: Current modeling techniques for assessing battery degradation sequences are inefficient and inadequate for real-life applications, failing to properly capture both local and global correlations in degradation patterns.

Method: MDFA-Net consists of two path networks: MF-Net (multiscale feature network) that maintains shallow information, and EC-Net (encoder network) that captures continuous trends and retains deep details. The dual-path design integrates both deep and shallow attributes to grasp local and global patterns.

Result: Testing on two publicly available Lithium-ion battery datasets shows the approach surpasses existing top-tier methods in RUL forecasting and accurately maps capacity degradation trajectories.

Conclusion: The proposed MDFA-Net architecture effectively addresses the limitations of current modeling techniques by capturing both local and global degradation patterns, demonstrating superior performance in battery RUL prediction for real-world applications.

Abstract: Targeted maintenance strategies, ensuring the dependability and safety of industrial machinery. However, current modeling techniques for assessing both local and global correlation of battery degradation sequences are inefficient and difficult to meet the needs in real-life applications. For this reason, we propose a novel deep learning architecture, multiscale dual-path feature aggregation network (MDFA-Net), for RUL prediction. MDFA-Net consists of dual-path networks, the first path network, multiscale feature network (MF-Net) that maintains the shallow information and avoids missing information, and the second path network is an encoder network (EC-Net) that captures the continuous trend of the sequences and retains deep details. Integrating both deep and shallow attributes effectively grasps both local and global patterns. Testing conducted with two publicly available Lithium-ion battery datasets reveals our approach surpasses existing top-tier methods in RUL forecasting, accurately mapping the capacity degradation trajectory.

[207] OASI: Objective-Aware Surrogate Initialization for Multi-Objective Bayesian Optimization in TinyML Keyword Spotting

Soumen Garai, Suman Samui

Main category: cs.LG

TL;DR: OASI is a novel initialization strategy for Multi-objective Bayesian Optimization that uses Multi-Objective Simulated Annealing to generate high-performing, diverse configurations for TinyML Keyword Spotting models, outperforming traditional initialization methods.

Details

Motivation: Existing initialization methods for Multi-objective Bayesian Optimization (MOBO) in TinyML Keyword Spotting are naive and not adapted to the Pareto front, leading to suboptimal performance when balancing accuracy and model size under strict resource constraints.

Method: Proposed Objective-Aware Surrogate Initialization (OASI) uses Multi-Objective Simulated Annealing (MOSA) to generate a seed Pareto set of configurations that explicitly balance accuracy and model size for MOBO initialization.

Result: OASI outperforms LHS, Sobol, and Random initialization in TinyML KWS, achieving highest hypervolume (0.0627) and lowest generational distance (0.0) with modest computation time increase (1934s vs ~1500s). Statistical analysis shows superior consistency.

Conclusion: OASI provides an effective initialization strategy for MOBO in resource-constrained TinyML applications, offering better Pareto front approximation and consistency than traditional methods while maintaining reasonable computational overhead.

Abstract: Voice assistants utilize Keyword Spotting (KWS) to enable efficient, privacy-friendly activation. However, realizing accurate KWS models on ultra-low-power TinyML devices (often with less than $<2$ MB of flash memory) necessitates a delicate balance between accuracy with strict resource constraints. Multi-objective Bayesian Optimization (MOBO) is an ideal candidate for managing such a trade-off but is highly initialization-dependent, especially under the budgeted black-box setting. Existing methods typically fall back to naive, ad-hoc sampling routines (e.g., Latin Hypercube Sampling (LHS), Sobol sequences, or Random search) that are adapted to neither the Pareto front nor undergo rigorous statistical comparison. To address this, we propose Objective-Aware Surrogate Initialization (OASI), a novel initialization strategy that leverages Multi-Objective Simulated Annealing (MOSA) to generate a seed Pareto set of high-performing and diverse configurations that explicitly balance accuracy and model size. Evaluated in a TinyML KWS setting, OASI outperforms LHS, Sobol, and Random initialization, achieving the highest hypervolume (0.0627) and the lowest generational distance (0.0) across multiple runs, with only a modest increase in computation time (1934 s vs. $\sim$1500 s). A non-parametric statistical analysis using the Kruskal-Wallis test ($H = 5.40$, $p = 0.144$, $η^2 = 0.0007$) and Dunn’s post-hoc test confirms OASI’s superior consistency despite the non-significant overall difference with respect to the $α=0.05$ threshold.

[208] Per-Axis Weight Deltas for Frequent Model Updates

Stefan Kuyumdzhiev, Radostin Cholakov

Main category: cs.LG

TL;DR: 1-bit delta compression for fine-tuned LLM variants using sign bits + per-axis scaling factors, reducing storage and cold-start latency while maintaining inference efficiency.

Details

Motivation: Serving many task-specialized LLM variants is limited by large fine-tuned checkpoint sizes and resulting cold-start latency. Fine-tuned weights differ from base models by small structured residuals, suggesting compressed delta representation as a solution.

Method: Propose 1-bit delta scheme storing only sign of weight difference with lightweight per-axis (row/column) FP16 scaling factors learned from small calibration set. Streamlined loader transfers packed deltas in single operation per module.

Result: Method preserves compactness of 1-bit deltas while capturing weight dimension variation better than scalar alternatives, reducing artifacts to several times smaller than full FP16 checkpoint with reduced cold-start latency and storage overhead.

Conclusion: Drop-in method requires minimal calibration data, maintains inference efficiency by avoiding dense reconstruction, and provides practical solution for frequent model updates with source code available.

Abstract: Serving many task-specialized LLM variants is often limited by the large size of fine-tuned checkpoints and the resulting cold-start latency. Since fine-tuned weights differ from their base model by relatively small structured residuals, a natural approach is to represent them as compressed deltas. We propose a simple 1-bit delta scheme that stores only the sign of the weight difference together with lightweight per-axis (row/column) FP16 scaling factors, learned from a small calibration set. This design preserves the compactness of 1-bit deltas while more accurately capturing variation across weight dimensions, leading to improved reconstruction quality over scalar alternatives. From a systems perspective, a streamlined loader that transfers packed deltas in a single operation per module reduces cold-start latency and storage overhead, with artifacts several times smaller than a full FP16 checkpoint. The method is drop-in, requires minimal calibration data, and maintains inference efficiency by avoiding dense reconstruction. Our experimental setup and source code are available at https://github.com/kuiumdjiev/Per-Axis-Weight-Deltas-for-Frequent-Model-Updates.

[209] Unified Multimodal Brain Decoding via Cross-Subject Soft-ROI Fusion

Xuanyu Hu

Main category: cs.LG

TL;DR: BrainROI model improves multimodal brain decoding by addressing cross-subject generalization and interpretability challenges, achieving state-of-the-art results on NSD dataset through fMRI encoding with soft functional parcellations, interpretable prompt optimization, and parameterized decoding constraints.

Details

Motivation: Multimodal brain decoding faces key challenges in cross-subject generalization (due to heterogeneity of functional brain topology across subjects) and interpretability (limitations of manual and black-box prompting methods in stability and transparency).

Method: Three main components: 1) New fMRI encoder using multi-atlas soft functional parcellations (soft-ROI) as shared space with voxel-wise gated fusion mechanism and global label alignment; 2) Interpretable prompt optimization using locally deployed Qwen model in small-sample closed loop; 3) Parameterized decoding constraints during inference.

Result: Achieves leading-level results in brain-captioning evaluation on NSD dataset. Under cross-subject setting, shows clear improvements in metrics such as BLEU-4 and CIDEr compared with recent state-of-the-art methods and representative baselines.

Conclusion: BrainROI model successfully addresses cross-subject generalization and interpretability challenges in multimodal brain decoding through innovative fMRI encoding, interpretable prompt optimization, and constrained decoding, achieving state-of-the-art performance.

Abstract: Multimodal brain decoding aims to reconstruct semantic information that is consistent with visual stimuli from brain activity signals such as fMRI, and then generate readable natural language descriptions. However, multimodal brain decoding still faces key challenges in cross-subject generalization and interpretability. We propose a BrainROI model and achieve leading-level results in brain-captioning evaluation on the NSD dataset. Under the cross-subject setting, compared with recent state-of-the-art methods and representative baselines, metrics such as BLEU-4 and CIDEr show clear improvements. Firstly, to address the heterogeneity of functional brain topology across subjects, we design a new fMRI encoder. We use multi-atlas soft functional parcellations (soft-ROI) as a shared space. We extend the discrete ROI Concatenation strategy in MINDLLM to a voxel-wise gated fusion mechanism (Voxel-gate). We also ensure consistent ROI mapping through global label alignment, which enhances cross-subject transferability. Secondly, to overcome the limitations of manual and black-box prompting methods in stability and transparency, we introduce an interpretable prompt optimization process. In a small-sample closed loop, we use a locally deployed Qwen model to iteratively generate and select human-readable prompts. This process improves the stability of prompt design and preserves an auditable optimization trajectory. Finally, we impose parameterized decoding constraints during inference to further improve the stability and quality of the generated descriptions.

[210] Sign-Aware Multistate Jaccard Kernels and Geometry for Real and Complex-Valued Signals

Vineet Yadav

Main category: cs.LG

TL;DR: A sign-aware, multistate Jaccard/Tanimoto framework that extends overlap-based distances to real- and complex-valued signals while preserving metric properties and positive-semidefinite kernel structure.

Details

Motivation: To extend Jaccard/Tanimoto similarity measures beyond nonnegative vectors to handle arbitrary real- and complex-valued signals while maintaining bounded metric properties and positive-semidefinite kernel structure for use in kernel methods and graph-based learning.

Method: Represent signals as atomic measures on a signed state space, embed signals into nonnegative multistate representations using positive/negative splits for real signals and Cartesian/polar decompositions for complex signals, then apply Tanimoto construction to produce bounded distances that satisfy triangle inequality.

Result: Develops a family of [0,1] distances with triangle inequality and positive-semidefinite kernels, plus coalition analysis via Möbius inversion for budget accounting, and probabilistic semantics through normalization to probability measures.

Conclusion: Provides a unified, interpretable framework that simultaneously offers bounded metric structure, positive-semidefinite kernels, probabilistic semantics, and transparent budget accounting for scientific and financial applications like correlograms and similarity graphs.

Abstract: We introduce a sign-aware, multistate Jaccard/Tanimoto framework that extends overlap-based distances from nonnegative vectors and measures to arbitrary real- and complex-valued signals while retaining bounded metric and positive-semidefinite kernel structure. Formally, the construction is a set- and measure-theoretic geometry: signals are represented as atomic measures on a signed state space, and similarity is given by a generalized Jaccard overlap of these measures. Each signal is embedded into a nonnegative multistate representation, using positive/negative splits for real signals, Cartesian and polar decompositions for complex signals, and user-defined state partitions for refined regime analysis. Applying the Tanimoto construction to these embeddings yields a family of $[0,1]$ distances that satisfy the triangle inequality and define positive-semidefinite kernels usable directly in kernel methods and graph-based learning. Beyond pairwise distances, we develop coalition analysis via Möbius inversion, which decomposes signal magnitude into nonnegative, additive contributions with exact budget closure across coalitions of signals. Normalizing the same embeddings produces probability measures on coordinate – state configurations, so that the distance becomes a monotone transform of total variation and admits a regime – intensity decomposition. The resulting construction yields a single, mechanistically interpretable distance that simultaneously provides bounded metric structure, positive-semidefinite kernels, probabilistic semantics, and transparent budget accounting within one sign-aware framework, supporting correlograms, feature engineering, similarity graphs, and other analytical tools in scientific and financial applications.

[211] Node-Level Financial Optimization in Demand Forecasting Through Dynamic Cost Asymmetry and Feedback Mechanism

Alessandro Casadei, Clemens Grupp, Sreyoshi Bhaduri, Lu Guo, Wilson Fung, Rohit Malshe, Raj Ratan, Ankush Pole, Arkajit Rakshit

Main category: cs.LG

TL;DR: Methodology to adjust forecasts using node-specific cost asymmetry, achieving $5.1M annual savings through dynamic error distribution adjustments and self-regulation.

Details

Motivation: To improve forecast accuracy by incorporating node-specific cost function asymmetry, addressing the limitation that traditional forecasting methods treat all errors equally regardless of their economic impact.

Method: Proposes a model that dynamically incorporates cost asymmetry into forecasting error probability distribution, favoring least expensive scenarios. Includes self-regulation mechanism that modulates adjustment magnitude based on observed savings to adapt to station-specific conditions and unmodeled factors.

Result: Empirical results demonstrate the model’s ability to achieve $5.1M in annual savings, showing practical economic benefits from incorporating cost asymmetry into forecasting adjustments.

Conclusion: The methodology successfully generates significant savings by dynamically adjusting forecasts based on cost asymmetry, with self-regulation enabling adaptation to various conditions and unmodeled factors.

Abstract: This work introduces a methodology to adjust forecasts based on node-specific cost function asymmetry. The proposed model generates savings by dynamically incorporating the cost asymmetry into the forecasting error probability distribution to favor the least expensive scenario. Savings are calculated and a self-regulation mechanism modulates the adjustments magnitude based on the observed savings, enabling the model to adapt to station-specific conditions and unmodeled factors such as calibration errors or shifting macroeconomic dynamics. Finally, empirical results demonstrate the model’s ability to achieve $5.1M annual savings.

[212] End-to-End Data Quality-Driven Framework for Machine Learning in Production Environment

Firas Bayram, Bestoun S. Ahmed, Erik Hallin

Main category: cs.LG

TL;DR: Novel end-to-end framework integrating real-time data quality assessment with ML operations, validated in steel manufacturing with 12% performance improvement and 4x latency reduction.

Details

Motivation: Existing approaches treat data quality assessment and ML systems as isolated processes, creating a gap between theoretical methods and practical implementation in production environments.

Method: End-to-end framework combining dynamic drift detection, adaptive data quality metrics, and MLOps into a cohesive lightweight system for real-time quality-driven ML decision-making.

Result: 12% improvement in model performance (R2 = 94%) and fourfold reduction in prediction latency when validated in steel manufacturing’s Electroslag Remelting vacuum pumping process.

Conclusion: Framework represents significant advancement in MLOps, offering robust solution for time-sensitive, data-driven decision-making in dynamic industrial environments with insights on balancing data quality standards and predictive performance.

Abstract: This paper introduces a novel end-to-end framework that efficiently integrates data quality assessment with machine learning (ML) model operations in real-time production environments. While existing approaches treat data quality assessment and ML systems as isolated processes, our framework addresses the critical gap between theoretical methods and practical implementation by combining dynamic drift detection, adaptive data quality metrics, and MLOps into a cohesive, lightweight system. The key innovation lies in its operational efficiency, enabling real-time, quality-driven ML decision-making with minimal computational overhead. We validate the framework in a steel manufacturing company’s Electroslag Remelting (ESR) vacuum pumping process, demonstrating a 12% improvement in model performance (R2 = 94%) and a fourfold reduction in prediction latency. By exploring the impact of data quality acceptability thresholds, we provide actionable insights into balancing data quality standards and predictive performance in industrial applications. This framework represents a significant advancement in MLOps, offering a robust solution for time-sensitive, data-driven decision-making in dynamic industrial environments.

[213] Out-of-Distribution Detection for Continual Learning: Design Principles and Benchmarking

Srishti Gupta, Riccardo Balia, Daniele Angioni, Fabio Brau, Maura Pintor, Ambra Demontis, Alessandro Sebastian, Salvatore Mario Carta, Fabio Roli, Battista Biggio

Main category: cs.LG

TL;DR: This paper discusses the limitations of traditional machine learning models that assume i.i.d. data and proposes Continual Learning and Out-of-Distribution detection as critical solutions for building robust, adaptive AI systems in real-world dynamic environments.

Details

Motivation: Traditional ML models assume i.i.d. training and test data, but this assumption fails in real-world applications where data evolves over time and novel conditions emerge post-deployment. Retraining from scratch is impractical, creating a need for systems that can adapt continuously while maintaining reliability.

Method: The paper proposes a joint approach combining Continual Learning (CL) to incrementally learn from evolving data streams without catastrophic forgetting, and Out-of-Distribution (OOD) detection to identify and respond to novel or anomalous inputs.

Result: The paper argues that addressing both CL and OOD detection together is essential for developing robust, efficient, and adaptive AI systems that can operate reliably in dynamic real-world environments.

Conclusion: Jointly solving Continual Learning and Out-of-Distribution detection challenges is critical for creating AI systems that remain reliable and adaptive over time in ever-changing real-world scenarios, overcoming the limitations of traditional i.i.d.-based approaches.

Abstract: Recent years have witnessed significant progress in the development of machine learning models across a wide range of fields, fueled by increased computational resources, large-scale datasets, and the rise of deep learning architectures. From malware detection to enabling autonomous navigation, modern machine learning systems have demonstrated remarkable capabilities. However, as these models are deployed in ever-changing real-world scenarios, their ability to remain reliable and adaptive over time becomes increasingly important. For example, in the real world, new malware families are continuously developed, whereas autonomous driving cars are employed in many different cities and weather conditions. Models trained in fixed settings can not respond effectively to novel conditions encountered post-deployment. In fact, most machine learning models are still developed under the assumption that training and test data are independent and identically distributed (i.i.d.), i.e., sampled from the same underlying (unknown) distribution. While this assumption simplifies model development and evaluation, it does not hold in many real-world applications, where data changes over time and unexpected inputs frequently occur. Retraining models from scratch whenever new data appears is computationally expensive, time-consuming, and impractical in resource-constrained environments. These limitations underscore the need for Continual Learning (CL), which enables models to incrementally learn from evolving data streams without forgetting past knowledge, and Out-of-Distribution (OOD) detection, which allows systems to identify and respond to novel or anomalous inputs. Jointly addressing both challenges is critical to developing robust, efficient, and adaptive AI systems.

[214] Tiny, On-Device Decision Makers with the MiniConv Library

Carlos Purves

Main category: cs.LG

TL;DR: Split-policy RL architecture uses on-device OpenGL fragment shaders to compress observations before transmission, reducing data transfer and latency for edge deployment.

Details

Motivation: Deploying visual RL policies on resource-constrained edge devices is challenging due to computational costs and communication latency. Current approaches that offload policy inference to remote servers require transmitting high-dimensional observations, causing network round trips and latency issues.

Method: Proposes a split-policy architecture where a small on-device encoder (implemented as OpenGL fragment-shader passes) transforms observations into compact feature tensors. These compressed features are transmitted to a remote policy head for decision making, reducing data transmission and leveraging embedded GPU support.

Result: The approach reduces transmitted data, lowers decision latency in bandwidth-limited settings, and reduces server-side compute per request. Achieves broadly comparable learning performance (mean over final 100 episodes) in single-run benchmarks with modest trade-offs in mean return. Evaluated across NVIDIA Jetson Nano, Raspberry Pi 4B, and Raspberry Pi Zero 2 W with learning results, execution behavior, and latency measurements.

Conclusion: The split-policy architecture enables efficient deployment of visual RL policies on edge devices by reducing communication overhead while maintaining learning performance, with open-source code released for training, deployment, and measurement.

Abstract: Reinforcement learning (RL) has achieved strong results, but deploying visual policies on resource-constrained edge devices remains challenging due to computational cost and communication latency. Many deployments therefore offload policy inference to a remote server, incurring network round trips and requiring transmission of high-dimensional observations. We introduce a split-policy architecture in which a small on-device encoder, implemented as OpenGL fragment-shader passes for broad embedded GPU support, transforms each observation into a compact feature tensor that is transmitted to a remote policy head. In RL, this communication overhead manifests as closed-loop decision latency rather than only per-request inference latency. The proposed approach reduces transmitted data, lowers decision latency in bandwidth-limited settings, and reduces server-side compute per request, whilst achieving broadly comparable learning performance by final return (mean over the final 100 episodes) in single-run benchmarks, with modest trade-offs in mean return. We evaluate across an NVIDIA Jetson Nano, a Raspberry Pi 4B, and a Raspberry Pi Zero 2 W, reporting learning results, on-device execution behaviour under sustained load, and end-to-end decision latency and scalability measurements under bandwidth shaping. Code for training, deployment, and measurement is released as open source.

[215] Trend Extrapolation for Technology Forecasting: Leveraging LSTM Neural Networks for Trend Analysis of Space Exploration Vessels

Peng-Hung Tsai, Daniel Berleant

Main category: cs.LG

TL;DR: This paper develops a hybrid forecasting model combining LSTM neural networks with augmented Moore’s law to predict spacecraft lifetimes, addressing right-censoring bias using the STETI approach.

Details

Motivation: Forecasting technological advancement in complex domains like space exploration is challenging due to technical, economic, and policy interactions. Current methods rely on quantitative trend extrapolation, but there's a growing trend toward machine learning-based hybrid models. Spacecraft lifetime prediction is important for mission planning and serves as a proxy for technological progress.

Method: The authors conducted a systematic literature review of technology forecasting methods, then developed a hybrid model combining LSTM neural networks with augmented Moore’s law. They introduced an advance to the STETI (Start Time End Time Integration) approach to address right-censoring bias in lifetime data, where recent spacecraft are still operational and don’t contribute failure data.

Result: The model successfully predicts spacecraft lifetimes while mitigating the right-censoring distortion that biases lifetime estimates for recent launch dates downward. The STETI approach effectively addresses the systematic distortion in lifetime versus launch date curves.

Conclusion: The hybrid LSTM-Moore’s law model with STETI enhancement provides improved spacecraft lifetime forecasting, offering valuable insights for space mission planning and policy decision-making while advancing technology forecasting methodology for complex domains.

Abstract: Forecasting technological advancement in complex domains such as space exploration presents significant challenges due to the intricate interaction of technical, economic, and policy-related factors. The field of technology forecasting has long relied on quantitative trend extrapolation techniques, such as growth curves (e.g., Moore’s law) and time series models, to project technological progress. To assess the current state of these methods, we conducted an updated systematic literature review (SLR) that incorporates recent advances. This review highlights a growing trend toward machine learning-based hybrid models. Motivated by this review, we developed a forecasting model that combines long short-term memory (LSTM) neural networks with an augmentation of Moore’s law to predict spacecraft lifetimes. Operational lifetime is an important engineering characteristic of spacecraft and a potential proxy for technological progress in space exploration. Lifetimes were modeled as depending on launch date and additional predictors. Our modeling analysis introduces a novel advance in the recently introduced Start Time End Time Integration (STETI) approach. STETI addresses a critical right censoring problem known to bias lifetime analyses: the more recent the launch dates, the shorter the lifetimes of the spacecraft that have failed and can thus contribute lifetime data. Longer-lived spacecraft are still operating and therefore do not contribute data. This systematically distorts putative lifetime versus launch date curves by biasing lifetime estimates for recent launch dates downward. STETI mitigates this distortion by interconverting between expressing lifetimes as functions of launch time and modeling them as functions of failure time. The results provide insights relevant to space mission planning and policy decision-making.

[216] Hard Negative Sample-Augmented DPO Post-Training for Small Language Models

Haocheng Lu, Minjun Zhu, Henry Yu

Main category: cs.LG

TL;DR: A lightweight post-training pipeline using a compact MathVerifier to detect structured errors in math reasoning, enabling verifier-guided weighted DPO for targeted improvements without expensive reward models.

Details

Motivation: LLMs struggle with mathematical reasoning, and current post-training pipelines reduce solutions to binary correct/incorrect outcomes, missing structured errors. RLHF variants are expensive, difficult to scale, and unstable.

Method: Start with SFT on MetaMathQA-style CoT data, then introduce a compact MathVerifier that decomposes solutions into six-dimensional error profiles with wrongness/absurdity scores. Use these signals to mine hard negatives and define per-sample importance weights for verifier-guided weighted DPO.

Result: Experiments on 1.5B-parameter Qwen2.5 show verifier-guided weighted DPO yields more targeted improvements than vanilla SFT and unweighted DPO, especially on problems with numerically close but logically inconsistent solutions.

Conclusion: The proposed lightweight pipeline effectively targets structured mathematical reasoning errors under realistic compute budgets, avoiding overhead of large reward models or external judges while providing interpretable error analysis.

Abstract: Large language models (LLMs) continue to struggle with mathematical reasoning, and common post-training pipelines often reduce each generated solution to a binary outcome: correct or incorrect. This perspective is limiting in practice, as failures in chain-of-thought (CoT) reasoning are frequently structured; solutions may appear convincing while containing subtle logical, algebraic, or numerical flaws. Meanwhile, reinforcement learning from human feedback (RLHF) variants that rely on large reward models or LLM-as-a-judge signals are often expensive, difficult to scale, and unstable to iterate. We propose a lightweight and pragmatic post-training pipeline that targets such structured errors under realistic compute budgets. Starting from supervised fine-tuning (SFT) on MetaMathQA-style CoT data, we introduce a compact MathVerifier that decomposes a candidate solution into a six-dimensional error profile and aggregates it into interpretable wrongness and absurdity scores. These verifier signals serve two roles: (i) mining hard negatives that are near-correct yet structurally flawed, and (ii) defining per-sample importance weights that emphasize the most informative preference pairs. We integrate both into an offline Direct Preference Optimization (DPO) objective via a verifier-guided weighted formulation. Experiments on a 1.5B-parameter Qwen2.5 model show that verifier-guided, weighted DPO yields more targeted improvements than vanilla SFT and unweighted DPO, particularly on problems where solutions are numerically close to correct but logically inconsistent, while avoiding the overhead of training large reward models or relying on external judges.

[217] High-Performance Self-Supervised Learning by Joint Training of Flow Matching

Kosuke Ukita, Tsuyoshi Okita

Main category: cs.LG

TL;DR: FlowFM is a flow matching-based foundation model that jointly trains representation encoder and conditional flow matching generator to achieve both high-fidelity generation and effective recognition, addressing diffusion models’ trade-off between generative quality and discriminative performance while reducing computational costs.

Details

Motivation: Diffusion models show promise for SSL but face trade-offs between generative quality and discriminative performance, plus high computational/energy costs from iterative sampling that hinder industrial and edge AI applications.

Method: Proposes FlowFM with decoupled design: jointly trains representation encoder and conditional flow matching generator. Uses flow matching to learn simpler velocity field, accelerating and stabilizing training for better representation learning efficiency.

Result: On wearable sensor data: 50.4% training time reduction vs diffusion-based approach. Surpassed state-of-the-art SSL method (SSL-Wearables) on all five datasets with up to 51.0x inference speedup while maintaining high generative quality.

Conclusion: FlowFM effectively addresses diffusion models’ limitations by achieving both high-quality generation and effective recognition with significantly improved efficiency, making it suitable for industrial and edge AI applications.

Abstract: Diffusion models can learn rich representations during data generation, showing potential for Self-Supervised Learning (SSL), but they face a trade-off between generative quality and discriminative performance. Their iterative sampling also incurs substantial computational and energy costs, hindering industrial and edge AI applications. To address these issues, we propose the Flow Matching-based Foundation Model (FlowFM), which jointly trains a representation encoder and a conditional flow matching generator. This decoupled design achieves both high-fidelity generation and effective recognition. By using flow matching to learn a simpler velocity field, FlowFM accelerates and stabilizes training, improving its efficiency for representation learning. Experiments on wearable sensor data show FlowFM reduces training time by 50.4% compared to a diffusion-based approach. On downstream tasks, FlowFM surpassed the state-of-the-art SSL method (SSL-Wearables) on all five datasets while achieving up to a 51.0x inference speedup and maintaining high generative quality. The implementation code is available at https://github.com/Okita-Laboratory/jointOptimizationFlowMatching.

[218] ArcGen: Generalizing Neural Backdoor Detection Across Diverse Architectures

Zhonghao Yang, Cheng Luo, Daojing He, Yiming Li, Yu Li

Main category: cs.LG

TL;DR: ArcGen is a novel black-box neural backdoor detection method that learns architecture-invariant features to improve generalization to unseen model architectures, achieving up to 42.5% improvement in detection performance.

Details

Motivation: Existing learning-based neural backdoor detection methods fail to generalize well to new model architectures not seen during training, limiting their practical applicability in real-world scenarios.

Method: Proposes ArcGen with: 1) An alignment layer in feature extraction to reduce architecture influence, 2) Two alignment losses (distribution and sample level) to align features from models with similar backdoor behaviors but different architectures, enabling architecture-invariant feature learning.

Result: Achieves up to 42.5% improvement in detection performance (AUC) on unseen model architectures, validated through large-scale evaluation on 16,896 models across diverse datasets, backdoor attacks, and architectures.

Conclusion: ArcGen successfully addresses the generalization problem in neural backdoor detection by learning architecture-invariant features, making backdoor detection more robust and practical across different model architectures.

Abstract: Backdoor attacks pose a significant threat to the security and reliability of deep learning models. To mitigate such attacks, one promising approach is to learn to extract features from the target model and use these features for backdoor detection. However, we discover that existing learning-based neural backdoor detection methods do not generalize well to new architectures not seen during the learning phase. In this paper, we analyze the root cause of this issue and propose a novel black-box neural backdoor detection method called ArcGen. Our method aims to obtain architecture-invariant model features, i.e., aligned features, for effective backdoor detection. Specifically, in contrast to existing methods directly using model outputs as model features, we introduce an additional alignment layer in the feature extraction function to further process these features. This reduces the direct influence of architecture information on the features. Then, we design two alignment losses to train the feature extraction function. These losses explicitly require that features from models with similar backdoor behaviors but different architectures are aligned at both the distribution and sample levels. With these techniques, our method demonstrates up to 42.5% improvements in detection performance (e.g., AUC) on unseen model architectures. This is based on a large-scale evaluation involving 16,896 models trained on diverse datasets, subjected to various backdoor attacks, and utilizing different model architectures. Our code is available at https://github.com/SeRAlab/ArcGen.

[219] Exploring Deep-to-Shallow Transformable Neural Networks for Intelligent Embedded Systems

Xiangzhong Luo, Weichen Liu

Main category: cs.LG

TL;DR: Double-Win NAS is a novel neural architecture search paradigm that first searches for accurate deep networks, then transforms them into shallow equivalents for hardware efficiency on embedded systems.

Details

Motivation: Deep CNNs achieve high accuracy but suffer from poor hardware efficiency on resource-constrained embedded systems, while shallow networks offer better hardware efficiency but inferior accuracy. There's a need to bridge this accuracy-efficiency gap for ubiquitous embedded intelligence.

Method: A deep-to-shallow transformable NAS paradigm that: 1) Automatically explores deep networks for strong accuracy, 2) Equivalently transforms them into shallow counterparts for hardware efficiency, 3) Uses hybrid transformable training for better accuracy, and 4) Employs arbitrary-resolution elastic training for network elasticity across input resolutions.

Result: Extensive experiments on NVIDIA Jetson AGX Xavier and Jetson Nano with ImageNet and ImageNet-100 datasets demonstrate superiority over previous state-of-the-art NAS approaches in balancing accuracy and hardware efficiency.

Conclusion: Double-Win NAS successfully addresses the accuracy-efficiency dilemma for embedded systems by enabling deep networks to be transformed into shallow equivalents, achieving both strong accuracy and hardware efficiency for resource-constrained intelligent embedded systems.

Abstract: Thanks to the evolving network depth, convolutional neural networks (CNNs) have achieved remarkable success across various embedded scenarios, paving the way for ubiquitous embedded intelligence. Despite its promise, the evolving network depth comes at the cost of degraded hardware efficiency. In contrast to deep networks, shallow networks can deliver superior hardware efficiency but often suffer from inferior accuracy. To address this dilemma, we propose Double-Win NAS, a novel deep-to-shallow transformable neural architecture search (NAS) paradigm tailored for resource-constrained intelligent embedded systems. Specifically, Double-Win NAS strives to automatically explore deep networks to first win strong accuracy, which are then equivalently transformed into their shallow counterparts to further win strong hardware efficiency. In addition to search, we also propose two enhanced training techniques, including hybrid transformable training towards better training accuracy and arbitrary-resolution elastic training towards enabling natural network elasticity across arbitrary input resolutions. Extensive experimental results on two popular intelligent embedded systems (i.e., NVIDIA Jetson AGX Xavier and NVIDIA Jetson Nano) and two representative large-scale datasets (i.e., ImageNet and ImageNet-100) clearly demonstrate the superiority of Double-Win NAS over previous state-of-the-art NAS approaches.

[220] Leakage-Aware Bandgap Prediction on the JARVIS-DFT Dataset: A Phase-Wise Feature Analysis

Gaurav Kumar Sharma

Main category: cs.LG

TL;DR: Systematic analysis of JARVIS-DFT bandgap dataset with leakage control yields curated subset of 2280 materials; three-phase modeling shows tree-based models achieve R2 ~0.88-0.90, with dielectric tensor as dominant feature.

Details

Motivation: To address potential data leakage in bandgap prediction by identifying and removing descriptors that inadvertently encode band-structure information, creating a leakage-controlled dataset for more reliable machine learning models.

Method: Three-phase modeling framework: 1) basic physical descriptors, 2) engineered features, 3) compositional attributes. Uses systematic analysis to remove leakage-prone descriptors (like effective masses) from JARVIS-DFT dataset, resulting in curated subset of 2280 materials.

Result: Tree-based models achieve R2 values of approximately 0.88 to 0.90 across all phases. SHAP analysis consistently identifies dielectric tensor components as dominant contributors. Expanding descriptor space doesn’t substantially improve predictive accuracy when leakage is controlled.

Conclusion: Provides curated dataset and baseline performance metrics for future leakage-aware bandgap prediction studies, demonstrating that controlling data leakage is crucial for reliable machine learning models in materials science.

Abstract: In this study, we perform a systematic analysis of the JARVIS-DFT bandgap dataset and identify and remove descriptors that may inadvertently encode band-structure information, such as effective masses. This process yields a curated, leakage-controlled subset of 2280 materials. Using this dataset, a three-phase modeling framework is implemented that incrementally incorporates basic physical descriptors, engineered features, and compositional attributes. The results show that tree-based models achieve R2 values of approximately 0.88 to 0.90 across all phases, indicating that expanding the descriptor space does not substantially improve predictive accuracy when leakage is controlled. SHAP analysis consistently identifies the dielectric tensor components as the dominant contributors. This work provides a curated dataset and baseline performance metrics for future leakage-aware bandgap prediction studies.

[221] The Deleuzian Representation Hypothesis

Clément Cornet, Romaric Besançon, Hervé Le Borgne

Main category: cs.LG

TL;DR: A novel unsupervised method for extracting interpretable concepts from neural networks by clustering activation differences, outperforming sparse autoencoders and approaching supervised baselines.

Details

Motivation: To develop a simpler and more effective alternative to sparse autoencoders for unsupervised concept extraction from neural networks, addressing limitations of existing methods while providing interpretable concepts that can causally influence model behavior.

Method: Clusters differences in neural activations within a discriminant analysis framework, enhanced by weighting clustering using activation skewness to improve concept diversity. The approach is philosophically aligned with Deleuze’s view of concepts as differences.

Result: The method achieves superior concept quality compared to prior unsupervised SAE variants and approaches supervised baselines across five models and three modalities (vision, language, audio). Extracted concepts enable steering of model’s inner representations and demonstrate causal influence on downstream behavior.

Conclusion: The proposed clustering-based approach provides an effective unsupervised alternative to SAEs for concept extraction, offering high-quality, diverse concepts that can causally influence model behavior, with potential applications in model interpretability and control.

Abstract: We propose an alternative to sparse autoencoders (SAEs) as a simple and effective unsupervised method for extracting interpretable concepts from neural networks. The core idea is to cluster differences in activations, which we formally justify within a discriminant analysis framework. To enhance the diversity of extracted concepts, we refine the approach by weighting the clustering using the skewness of activations. The method aligns with Deleuze’s modern view of concepts as differences. We evaluate the approach across five models and three modalities (vision, language, and audio), measuring concept quality, diversity, and consistency. Our results show that the proposed method achieves concept quality surpassing prior unsupervised SAE variants while approaching supervised baselines, and that the extracted concepts enable steering of a model’s inner representations, demonstrating their causal influence on downstream behavior.

[222] Case Prompting to Mitigate Large Language Model Bias for ICU Mortality Prediction

Gangxiong Zhang, Yongchao Long

Main category: cs.LG

TL;DR: CAP framework improves ICU mortality prediction fairness and accuracy using case-based prompting without retraining LLMs.

Details

Motivation: LLMs show promise for ICU mortality prediction but exhibit demographic biases (sex, age, race) that limit trustworthy clinical use. Existing debiasing methods often reduce predictive performance, creating a fairness-accuracy tradeoff.

Method: Proposed CAse Prompting (CAP) - a training-free, clinically adaptive prompting framework that integrates conventional debiasing prompts with case-based reasoning. CAP guides models to learn from similar historical misprediction cases and correct outcomes, enabling correction of biased reasoning patterns. Includes multi-dimensional bias assessment scheme for comprehensive model diagnosis.

Result: On MIMIC-IV dataset: AUROC increased from 0.806 to 0.873, AUPRC from 0.497 to 0.694. Reduced sex- and race-related disparities by over 90%. Feature reliance analysis showed highly consistent attention patterns across demographic groups (similarity scores >0.98).

Conclusion: LLMs exhibit measurable bias in ICU mortality prediction. CAP effectively co-optimizes fairness and performance without retraining, offering a transferable paradigm for equitable clinical decision support.

Abstract: Accurate mortality risk prediction for intensive care unit (ICU) patients is essential for clinical decision-making. Although large language models (LLMs) show promise in predicting outcomes from structured medical data, their predictions may exhibit demographic biases related to sex, age, and race, limiting their trustworthy use in clinical practice. Existing debiasing methods often reduce predictive performance, making it difficult to jointly optimize fairness and accuracy. In this study, we systematically examine bias in LLM-based ICU mortality prediction and propose a training-free, clinically adaptive prompting framework to simultaneously improve fairness and performance. We first develop a multi-dimensional bias assessment scheme for comprehensive model diagnosis. Building on this analysis, we introduce CAse Prompting (CAP), a novel prompting framework that integrates conventional debiasing prompts with case-based reasoning. CAP guides the model to learn from similar historical misprediction cases and their correct outcomes, enabling correction of biased reasoning patterns. Experiments on the MIMIC-IV dataset show that CAP substantially improves both predictive accuracy and fairness. CAP increases AUROC from 0.806 to 0.873 and AUPRC from 0.497 to 0.694, while reducing sex- and race-related disparities by over 90%. Feature reliance analysis further indicates highly consistent attention patterns across demographic groups, with similarity scores exceeding 0.98. These results demonstrate that LLMs exhibit measurable bias in ICU mortality prediction, and that a carefully designed prompting framework can effectively co-optimize fairness and performance without retraining, offering a transferable paradigm for equitable clinical decision support.

[223] CoPHo: Classifier-guided Conditional Topology Generation with Persistent Homology

Gongli Xi, Ye Tian, Mengyu Yang, Zhenyu Zhao, Yuchao Zhang, Xiangyang Gong, Xirong Que, Wendong Wang

Main category: cs.LG

TL;DR: CoPHo: A method for conditional graph generation using classifier guidance with persistent homology to steer diffusion toward desired structural properties without retraining.

Details

Motivation: Topology data is scarce, requiring synthetic graph generation for testing. Existing diffusion methods either need retraining for each attribute (limiting real-time use) or use classifier guidance that ignores topology scale and constraints.

Method: CoPHo incorporates gradients from a pre-trained graph-level classifier into discrete reverse diffusion posterior. It builds persistent homology filtration over intermediate graphs and uses features as guidance signals at each denoising step to steer generation toward desired properties.

Result: Outperforms existing methods at matching target metrics on four generic/network datasets. Validates transferability on QM9 molecular dataset.

Conclusion: CoPHo enables conditional topology generation without retraining by leveraging persistent homology and classifier guidance, demonstrating effectiveness across diverse graph datasets.

Abstract: The structure of topology underpins much of the research on performance and robustness, yet available topology data are typically scarce, necessitating the generation of synthetic graphs with desired properties for testing or release. Prior diffusion-based approaches either embed conditions into the diffusion model, requiring retraining for each attribute and hindering real-time applicability, or use classifier-based guidance post-training, which does not account for topology scale and practical constraints. In this paper, we show from a discrete perspective that gradients from a pre-trained graph-level classifier can be incorporated into the discrete reverse diffusion posterior to steer generation toward specified structural properties. Based on this insight, we propose Classifier-guided Conditional Topology Generation with Persistent Homology (CoPHo), which builds a persistent homology filtration over intermediate graphs and interprets features as guidance signals that steer generation toward the desired properties at each denoising step. Experiments on four generic/network datasets demonstrate that CoPHo outperforms existing methods at matching target metrics, and we further validate its transferability on the QM9 molecular dataset.

[224] Simulation-Driven Railway Delay Prediction: An Imitation Learning Approach

Clément Elliker, Jesse Read, Sonia Vanier, Albert Bifet

Main category: cs.LG

TL;DR: DCIL is a self-supervised imitation learning method with drift correction for stochastic train delay forecasting, outperforming traditional models on real-world Belgian railway data.

Details

Motivation: Reliable train delay prediction is crucial for improving railway system robustness and efficiency. Current approaches need to better handle the sequential and uncertain nature of delay propagation in large-scale networks.

Method: Drift-Corrected Imitation Learning (DCIL) reframes delay forecasting as stochastic simulation using imitation learning with distance-based drift correction. It extends DAgger to mitigate covariate shift during rollouts without needing external oracles or adversarial schemes, combining event-driven model fidelity with data-driven representational capacity.

Result: DCIL demonstrates superior predictive performance over traditional regression models and behavioral cloning on deep learning architectures for up to 30-minute ahead predictions, using a comprehensive real-world dataset of over 3 million train movements from Belgian railways.

Conclusion: DCIL effectively captures the sequential and uncertain nature of delay propagation in large-scale railway networks, enabling uncertainty-aware forecasting through Monte Carlo simulation and showing practical value for railway transportation systems.

Abstract: Reliable prediction of train delays is essential for enhancing the robustness and efficiency of railway transportation systems. In this work, we reframe delay forecasting as a stochastic simulation task, modeling state-transition dynamics through imitation learning. We introduce Drift-Corrected Imitation Learning (DCIL), a novel self-supervised algorithm that extends DAgger by incorporating distance-based drift correction, thereby mitigating covariate shift during rollouts without requiring access to an external oracle or adversarial schemes. Our approach synthesizes the dynamical fidelity of event-driven models with the representational capacity of data-driven methods, enabling uncertainty-aware forecasting via Monte Carlo simulation. We evaluate DCIL using a comprehensive real-world dataset from \textsc{Infrabel}, the Belgian railway infrastructure manager, which encompasses over three million train movements. Our results, focused on predictions up to 30 minutes ahead, demonstrate superior predictive performance of DCIL over traditional regression models and behavioral cloning on deep learning architectures, highlighting its effectiveness in capturing the sequential and uncertain nature of delay propagation in large-scale networks.

[225] Brain-Grounded Axes for Reading and Steering LLM States

Sandro Andric

Main category: cs.LG

TL;DR: Using human brain activity as a coordinate system to derive interpretable axes for reading and steering LLM states, validated with independent lexica and showing robust lexical and function/content axes across multiple models.

Details

Motivation: Current LLM interpretability methods rely on textual supervision which lacks external grounding. The authors propose using human brain activity as a biologically-grounded coordinate system for interpreting and controlling LLM states.

Method: Constructed word-level brain atlas from SMN4Lang MEG dataset using phase-locking value patterns, extracted latent axes via ICA, validated with independent lexica and NER-based labels. Trained lightweight adapters to map LLM hidden states to brain axes without fine-tuning the LLM.

Result: Found robust lexical (frequency-linked) axis in mid TinyLlama layer, surviving perplexity-matched controls. Brain-derived directions showed larger log-frequency shifts than text probes with lower perplexity. Function/content axis (axis 13) showed consistent steering across TinyLlama, Qwen2-0.5B, and GPT-2. Axis structure remained stable across different embedding methods.

Conclusion: Neurophysiology-grounded axes provide interpretable and controllable handles for LLM behavior, offering a new interface that uses brain activity as a coordinate system rather than just a training signal.

Abstract: Interpretability methods for large language models (LLMs) typically derive directions from textual supervision, which can lack external grounding. We propose using human brain activity not as a training signal but as a coordinate system for reading and steering LLM states. Using the SMN4Lang MEG dataset, we construct a word-level brain atlas of phase-locking value (PLV) patterns and extract latent axes via ICA. We validate axes with independent lexica and NER-based labels (POS/log-frequency used as sanity checks), then train lightweight adapters that map LLM hidden states to these brain axes without fine-tuning the LLM. Steering along the resulting brain-derived directions yields a robust lexical (frequency-linked) axis in a mid TinyLlama layer, surviving perplexity-matched controls, and a brain-vs-text probe comparison shows larger log-frequency shifts (relative to the text probe) with lower perplexity for the brain axis. A function/content axis (axis 13) shows consistent steering in TinyLlama, Qwen2-0.5B, and GPT-2, with PPL-matched text-level corroboration. Layer-4 effects in TinyLlama are large but inconsistent, so we treat them as secondary (Appendix). Axis structure is stable when the atlas is rebuilt without GPT embedding-change features or with word2vec embeddings (|r|=0.64-0.95 across matched axes), reducing circularity concerns. Exploratory fMRI anchoring suggests potential alignment for embedding change and log frequency, but effects are sensitive to hemodynamic modeling assumptions and are treated as population-level evidence only. These results support a new interface: neurophysiology-grounded axes provide interpretable and controllable handles for LLM behavior.

[226] OpComm: A Reinforcement Learning Framework for Adaptive Buffer Control in Warehouse Volume Forecasting

Wilson Fung, Lu Guo, Drake Hilliard, Alessandro Casadei, Raj Ratan, Sreyoshi Bhaduri, Adi Surve, Nikhil Agarwal, Rohit Malshe, Pavan Mullapudi, Hungjen Wang, Saurabh Doodhwala, Ankush Pole, Arkajit Rakshit

Main category: cs.LG

TL;DR: OpComm is a forecasting framework combining LightGBM for demand prediction, PPO reinforcement learning for buffer control, and generative AI for interpretability, reducing forecasting errors by 21.65% in last-mile logistics.

Details

Motivation: Accurate package volume forecasting is critical for last-mile logistics to avoid inefficient resource allocation, higher costs, and delivery delays caused by forecasting errors.

Method: Combines supervised learning (LightGBM regression for station-level demand forecasts) with reinforcement learning (PPO agent for buffer control using discrete action set), plus generative AI communication module for interpretability with SHAP-based feature attributions and Monte Carlo update mechanism for continual policy adaptation.

Result: Across 400+ stations, OpComm reduced Weighted Absolute Percentage Error (WAPE) by 21.65% compared to manual forecasts, while lowering under-buffering incidents and improving transparency for decision-makers.

Conclusion: This work demonstrates how contextual reinforcement learning coupled with predictive modeling can address operational forecasting challenges and bridge statistical rigor with practical decision-making in high-stakes logistics environments.

Abstract: Accurate forecasting of package volumes at delivery stations is critical for last-mile logistics, where errors lead to inefficient resource allocation, higher costs, and delivery delays. We propose OpComm, a forecasting and decision-support framework that combines supervised learning with reinforcement learning-based buffer control and a generative AI-driven communication module. A LightGBM regression model generates station-level demand forecasts, which serve as context for a Proximal Policy Optimization (PPO) agent that selects buffer levels from a discrete action set. The reward function penalizes under-buffering more heavily than over-buffering, reflecting real-world trade-offs between unmet demand risks and resource inefficiency. Station outcomes are fed back through a Monte Carlo update mechanism, enabling continual policy adaptation. To enhance interpretability, a generative AI layer produces executive-level summaries and scenario analyses grounded in SHAP-based feature attributions. Across 400+ stations, OpComm reduced Weighted Absolute Percentage Error (WAPE) by 21.65% compared to manual forecasts, while lowering under-buffering incidents and improving transparency for decision-makers. This work shows how contextual reinforcement learning, coupled with predictive modeling, can address operational forecasting challenges and bridge statistical rigor with practical decision-making in high-stakes logistics environments.

[227] Learning to Reason in LLMs by Expectation Maximization

Junghyun Lee, Branislav Kveton, Sunav Choudhary, Subhojyoti Mukherjee, Anup Rao, Ryan A. Rossi, Alexa Siu

Main category: cs.LG

TL;DR: The paper formalizes reasoning as a latent variable model, derives an EM objective for learning to reason, connects EM to reward-based optimization, and shows that sampling scheme design is crucial for generating rationales that justify correct answers.

Details

Motivation: To improve reasoning in LLMs by formalizing the reasoning process as a latent variable model and developing better methods for generating rationales that lead to correct answers.

Method: Formalizes reasoning as a latent variable model, derives EM objective, connects EM to reward-based optimization, and compares sampling schemes: rejection sampling with budget, STaR, and PPS (which keeps only rationalization stage of STaR).

Result: Experiments on ARC, MMLU, and OpenBookQA datasets with Llama and Qwen models show sampling scheme significantly affects accuracy, with PPS outperforming other sampling schemes despite its simplicity.

Conclusion: The design of sampling distribution for generating rationales is crucial for learning to reason, and PPS provides an effective approach that outperforms more complex methods.

Abstract: Large language models (LLMs) solve reasoning problems by first generating a rationale and then answering. We formalize reasoning as a latent variable model and derive an expectation-maximization (EM) objective for learning to reason. This view connects EM and modern reward-based optimization, and shows that the main challenge lies in designing a sampling distribution that generates rationales that justify correct answers. We instantiate and compare several sampling schemes: rejection sampling with a budget, self-taught reasoner (STaR), and prompt posterior sampling (PPS), which only keeps the rationalization stage of STaR. Our experiments on the ARC, MMLU, and OpenBookQA datasets with the Llama and Qwen models show that the sampling scheme can significantly affect the accuracy of learned reasoning models. Despite its simplicity, we observe that PPS outperforms the other sampling schemes.

[228] Asia Cup 2025: A Structured T20 Match-Level Dataset and Exploratory Analysis for Cricket Analytics

Kousar Raza, Faizan Ali

Main category: cs.LG

TL;DR: A comprehensive cricket dataset for the 2025 Asia Cup T20 tournament with 19 matches and 61 variables, released publicly for sports analytics research.

Details

Motivation: To provide an open, structured dataset for cricket analytics research, addressing the need for comprehensive, machine-readable data to support data-driven analysis in sports.

Method: Created a structured dataset covering all 19 matches of the 2025 Asia Cup T20 tournament with 61 variables including team scores, wickets, powerplay stats, boundaries, toss decisions, venues, and player highlights. The dataset is publicly released on Zenodo under CC-BY 4.0 license.

Result: Produced a comprehensive cricket dataset with demonstrated analytical value through exploratory data analysis focusing on team performance indicators, boundary distributions, and scoring patterns.

Conclusion: This work provides an open benchmark dataset for advancing cricket analytics research, supporting reproducibility, predictive modeling, and strategic decision-making in sports.

Abstract: This paper presents a structured and comprehensive dataset corresponding to the 2025 Asia Cup T20 cricket tournament, designed to facilitate data-driven research in sports analytics. The dataset comprises records from all 19 matches of the tournament and includes 61 variables covering team scores, wickets, powerplay statistics, boundary counts, toss decisions, venues, and player-specific highlights. To demonstrate its analytical value, we conduct an exploratory data analysis focusing on team performance indicators, boundary distributions, and scoring patterns. The dataset is publicly released through Zenodo under a CC-BY 4.0 license to support reproducibility and further research in cricket analytics, predictive modeling, and strategic decision-making. This work contributes an open, machine-readable benchmark dataset for advancing cricket analytics research.

[229] EdgeFlex-Transformer: Transformer Inference for Edge Devices

Shoaib Mohammad, Guanqun Song, Ting Zhu

Main category: cs.LG

TL;DR: Lightweight multi-stage optimization pipeline compresses Vision Transformers for edge deployment, achieving 76% memory reduction and 6x lower latency while maintaining accuracy.

Details

Motivation: Deploying large transformer models on edge devices is challenging due to strict memory, compute, and latency constraints. There's a need for efficient compression techniques that don't require costly retraining.

Method: Multi-stage pipeline combining activation profiling, memory-aware pruning, selective mixed-precision execution (FP16), and activation-aware quantization (AWQ). Uses forward hooks for activation statistics, structured pruning on MLP layers, and INT8 quantization.

Result: Compressed ViT-Huge (632M params) achieves 76% reduction in peak memory usage and over 6x lower latency on CIFAR-10, while retaining or even improving accuracy compared to FP32 baseline.

Conclusion: The framework provides a practical path for efficient transformer inference on edge platforms and opens avenues for integrating dynamic sparsity and Mixture-of-Experts architectures for further scaling.

Abstract: Deploying large-scale transformer models on edge devices presents significant challenges due to strict constraints on memory, compute, and latency. In this work, we propose a lightweight yet effective multi-stage optimization pipeline designed to compress and accelerate Vision Transformers (ViTs) for deployment in resource-constrained environments. Our methodology combines activation profiling, memory-aware pruning, selective mixed-precision execution, and activation-aware quantization (AWQ) to reduce the model’s memory footprint without requiring costly retraining or task-specific fine-tuning. Starting from a ViT-Huge backbone with 632 million parameters, we first identify low-importance channels using activation statistics collected via forward hooks, followed by structured pruning to shrink the MLP layers under a target memory budget. We further apply FP16 conversion to selected components and leverage AWQ to quantize the remaining model weights and activations to INT8 with minimal accuracy degradation. Our experiments on CIFAR-10 demonstrate that the fully optimized model achieves a 76% reduction in peak memory usage and over 6x lower latency, while retaining or even improving accuracy compared to the original FP32 baseline. This framework offers a practical path toward efficient transformer inference on edge platforms, and opens future avenues for integrating dynamic sparsity and Mixture-of-Experts (MoE) architectures to further scale performance across diverse tasks.

Md Shakhrul Iman Siam, Ishtiaque Ahmed Showmik, Guanqun Song, Ting Zhu

Main category: cs.LG

TL;DR: A Large Multi-Modal Agent for Human Activity Recognition that leverages LLMs to enhance both classification performance and interpretability through reasoning and Q&A capabilities.

Details

Motivation: While Human Activity Recognition (HAR) has been actively researched for applications in healthcare and smart environments, there's a need to bridge the gap between technical classification outputs and user-friendly insights. Recent LLM advancements offer opportunities to enhance HAR with interpretability and human-like interaction capabilities.

Method: Proposes a Large Multi-Modal Agent framework that integrates LLMs for HAR. The framework combines activity classification with reasoning and question-answering capabilities to provide interpretable outputs and user engagement.

Result: Extensive evaluations on widely adopted HAR datasets (HHAR, Shoaib, Motionsense) show the model achieves high classification accuracy comparable to state-of-the-art methods while significantly improving interpretability through its reasoning and Q&A capabilities.

Conclusion: The proposed Large Multi-Modal Agent successfully enhances HAR by leveraging LLMs to provide both accurate activity classification and improved interpretability, bridging the gap between technical outputs and user-friendly insights.

Abstract: Human Activity Recognition (HAR) has been an active area of research, with applications ranging from healthcare to smart environments. The recent advancements in Large Language Models (LLMs) have opened new possibilities to leverage their capabilities in HAR, enabling not just activity classification but also interpretability and human-like interaction. In this paper, we present a Large Multi-Modal Agent designed for HAR, which integrates the power of LLMs to enhance both performance and user engagement. The proposed framework not only delivers activity classification but also bridges the gap between technical outputs and user-friendly insights through its reasoning and question-answering capabilities. We conduct extensive evaluations using widely adopted HAR datasets, including HHAR, Shoaib, Motionsense to assess the performance of our framework. The results demonstrate that our model achieves high classification accuracy comparable to state-of-the-art methods while significantly improving interpretability through its reasoning and Q&A capabilities.

[231] From Theory to Throughput: CUDA-Optimized APML for Large-Batch 3D Learning

Sasan Sharifipour, Constantino Álvarez Casado, Manuel Lage Cañellas, Miguel Bordallo López

Main category: cs.LG

TL;DR: CUDA-APML is a sparse GPU implementation of APML that reduces memory usage by 99.9% while maintaining accuracy for 3D point cloud learning.

Details

Motivation: Existing loss functions for 3D point clouds trade geometric fidelity for computational cost. Chamfer Distance is efficient but allows many-to-one correspondences, while Earth Mover Distance is accurate but computationally expensive. APML approximates transport but has quadratic memory scaling.

Method: CUDA-APML implements sparse GPU operations: thresholds negligible assignments, runs adaptive softmax, bidirectional symmetrization, and Sinkhorn normalization directly in COO format. This enables near-linear memory scaling while preserving gradients on stored support.

Result: On ShapeNet and MM-Fi datasets, CUDA-APML matches dense APML accuracy within small tolerance while reducing peak GPU memory by 99.9%. The implementation maintains near-linear memory scaling.

Conclusion: CUDA-APML provides an efficient sparse implementation of APML that dramatically reduces memory requirements while preserving accuracy, making transport-based loss functions more practical for 3D point cloud learning.

Abstract: Loss functions are fundamental to learning accurate 3D point cloud models, yet common choices trade geometric fidelity for computational cost. Chamfer Distance is efficient but permits many-to-one correspondences, while Earth Mover Distance better reflects one-to-one transport at high computational cost. APML approximates transport with differentiable Sinkhorn iterations and an analytically derived temperature, but its dense formulation scales quadratically in memory. We present CUDA-APML, a sparse GPU implementation that thresholds negligible assignments and runs adaptive softmax, bidirectional symmetrization, and Sinkhorn normalization directly in COO form. This yields near-linear memory scaling and preserves gradients on the stored support, while pairwise distance evaluation remains quadratic in the current implementation. On ShapeNet and MM-Fi, CUDA-APML matches dense APML within a small tolerance while reducing peak GPU memory by 99.9%. Code available at: https://github.com/Multimodal-Sensing-Lab/apml

[232] DeepBridge: A Unified and Production-Ready Framework for Multi-Dimensional Machine Learning Validation

Gustavo Coelho Haase, Paulo Henrique Dourado da Silva

Main category: cs.LG

TL;DR: DeepBridge is an 80K-line Python library that unifies multi-dimensional validation, automatic compliance verification, knowledge distillation, and synthetic data generation for AI systems, reducing validation time by 89% and improving fairness detection coverage.

Details

Motivation: The paper addresses the fragmented landscape of AI validation tools, where practitioners need to use multiple separate tools for fairness, robustness, compliance, and other validation aspects, leading to inefficiency and incomplete coverage.

Method: DeepBridge provides a unified Python library with: 1) 5 validation suites (fairness, robustness, uncertainty, resilience, hyperparameter sensitivity), 2) automatic EEOC/ECOA/GDPR compliance verification, 3) multi-format reporting system, 4) HPM-KD framework for knowledge distillation with meta-learning, and 5) scalable synthetic data generation via Dask.

Result: Through 6 case studies, DeepBridge reduces validation time by 89% (17 min vs. 150 min), detects fairness violations with complete coverage (10/10 features vs. 2/10 from existing tools), generates audit-ready reports in minutes. HPM-KD shows consistent superiority across compression ratios 2.3-7x on CIFAR100 (+1.00-2.04pp vs. Direct Training). Usability study shows SUS score 87.5 (top 10%), 95% success rate, and low cognitive load.

Conclusion: DeepBridge successfully unifies fragmented AI validation workflows into a single comprehensive library, significantly improving efficiency, coverage, and usability while maintaining open-source accessibility under MIT license.

Abstract: We present DeepBridge, an 80K-line Python library that unifies multi-dimensional validation, automatic compliance verification, knowledge distillation, and synthetic data generation. DeepBridge offers: (i) 5 validation suites (fairness with 15 metrics, robustness with weakness detection, uncertainty via conformal prediction, resilience with 5 drift types, hyperparameter sensitivity), (ii) automatic EEOC/ECOA/GDPR verification, (iii) multi-format reporting system (interactive/static HTML, PDF, JSON), (iv) HPM-KD framework for knowledge distillation with meta-learning, and (v) scalable synthetic data generation via Dask. Through 6 case studies (credit scoring, hiring, healthcare, mortgage, insurance, fraud) we demonstrate that DeepBridge: reduces validation time by 89% (17 min vs. 150 min with fragmented tools), automatically detects fairness violations with complete coverage (10/10 features vs. 2/10 from existing tools), generates audit-ready reports in minutes. HPM-KD demonstrates consistent superiority across compression ratios 2.3–7x (CIFAR100): +1.00–2.04pp vs. Direct Training (p<0.05), confirming that Knowledge Distillation is effective at larger teacher-student gaps. Usability study with 20 participants shows SUS score 87.5 (top 10%, ``excellent’’), 95% success rate, and low cognitive load (NASA-TLX 28/100). DeepBridge is open-source under MIT license at https://github.com/deepbridge/deepbridge, with complete documentation at https://deepbridge.readthedocs.io

[233] How Many Experts Are Enough? Towards Optimal Semantic Specialization for Mixture-of-Experts

Sumin Park, Noseong Park

Main category: cs.LG

TL;DR: MASS is a semantic-aware MoE framework that adaptively expands experts and dynamically routes tokens based on semantic specialization needs, outperforming existing MoE methods across language and vision domains.

Details

Motivation: Existing Sparse Mixture-of-Experts frameworks either require extensive hyperparameter tuning or fail to properly diversify semantic roles across experts when adapting expert pool size, limiting their ability to fully exploit MoE architectures' potential.

Method: MASS introduces: 1) gradient-based semantic drift detector that triggers expert expansion when existing experts lack capacity to capture data diversity, and 2) adaptive routing strategy that dynamically adjusts expert usage based on token-level routing confidence.

Result: MASS reliably converges to optimal cost-performance trade-off with improved semantic specialization in synthetic setups, and consistently outperforms strong MoE baselines on real-world datasets across language and vision domains.

Conclusion: MASS demonstrates domain robustness and enhanced expert specialization through its semantic-aware adaptive expansion and dynamic routing mechanisms, providing an effective solution for optimizing Sparse Mixture-of-Experts architectures.

Abstract: Finding the optimal configuration of Sparse Mixture-ofExperts (SMoE) that maximizes semantic differentiation among experts is essential for exploiting the full potential of MoE architectures. However, existing SMoE frameworks either heavily rely on hyperparameter tuning or overlook the importance of diversifying semantic roles across experts when adapting the expert pool size. We propose Mixture-of-Experts for Adaptive Semantic Specialization (MASS), a semanticaware MoE framework for adaptive expert expansion and dynamic routing. MASS introduces two key advancements: (i) a gradient-based semantic drift detector that prompts targeted expert expansion when the existing expert pool lacks capacity to capture the full semantic diversity of the data, and (ii) an integration of adaptive routing strategy that dynamically adjusts expert usage based on token-level routing confidence mass. We first demonstrate that MASS reliably converges to the point of optimal balance between cost-performance trade-off with notably improved sematic specialization in a highly controlled synthetic setup. Further empirical results on real-world datasets across language and vision domains show that MASS consistently outperforms a range of strong MoE baselines, demonstrating its domain robustness and enhanced expert specialization.

[234] A K-Means, Ward and DBSCAN repeatability study

Anthony Bertrand, Engelbert Mephu Nguifo, Violaine Antoine, David Hill

Main category: cs.LG

TL;DR: Analysis of reproducibility issues in popular clustering algorithms (K-Means, DBSCAN, Ward) showing inconsistent results with K-Means when OpenMP threads exceed two, using scikit-learn implementation.

Details

Motivation: Reproducibility is essential for scientific integrity in machine learning, allowing debugging and ensuring consistent scientific conclusions. Bitwise identical results are crucial for specific algorithms.

Method: Decomposed K-Means, DBSCAN, and Ward clustering algorithms into fundamental steps, identified conditions for repeatability at each stage, and examined implementations using Python’s scikit-learn library.

Result: Found inconsistent results with K-Means when OpenMP threads exceed two, revealing reproducibility issues in popular clustering algorithms.

Conclusion: The work aims to raise awareness about reproducibility issues among users and developers, encouraging further investigation and potential fixes for clustering algorithms.

Abstract: Reproducibility is essential in machine learning because it ensures that a model or experiment yields the same scientific conclusion. For specific algorithms repeatability with bitwise identical results is also a key for scientific integrity because it allows debugging. We decomposed several very popular clustering algorithms: K-Means, DBSCAN and Ward into their fundamental steps, and we identify the conditions required to achieve repeatability at each stage. We use an implementation example with the Python library scikit-learn to examine the repeatable aspects of each method. Our results reveal inconsistent results with K-Means when the number of OpenMP threads exceeds two. This work aims to raise awareness of this issue among both users and developers, encouraging further investigation and potential fixes.

[235] Reduced Order Modeling for Tsunami Forecasting with Bayesian Hierarchical Pooling

Shane X. Coffing, John Tipton, Arvind T. Mohan, Darren Engwirda

Main category: cs.LG

TL;DR: Proposes randPROM - a reduced order model using neural Galerkin projections and hierarchical pooling to generate statistically interpretable, physics-based simulations for tsunamis with uncertainty quantification.

Details

Motivation: Traditional ROMs are constrained to fixed weights for specific processes, limiting generalization. Need physics-based surrogates that can generate statistically calibrated predictions for similar problems with uncertainty quantification, especially for unpredictable catastrophic events like tsunamis.

Method: Uses neural Galerkin projections to create ROMs with learnable weights, then applies statistical hierarchical pooling to learn distributions on initial values of temporal weights. This creates generalized, statistically interpretable weights that can be recombined with spatial features to form randPROM - a complete physics surrogate.

Result: Applied to synthetic tsunamis near Fiji and real-world Tohoku 2011 disaster. Demonstrates significant reduction in simulations needed for statistically calibrated predictions of tsunami wave arrival time and height while maintaining physical consistency.

Conclusion: randPROM enables generation of physics-based simulations consistent with distribution of initial conditions, providing statistically defensible predictions for complex nonlinear problems like tsunamis with uncertainty quantification.

Abstract: Reduced order models (ROM) can represent spatiotemporal processes in significantly fewer dimensions and can be solved many orders faster than their governing partial differential equations (PDEs). For example, using a proper orthogonal decomposition produces a ROM that is a small linear combination of fixed features and weights, but that is constrained to the given process it models. In this work, we explore a new type of ROM that is not constrained to fixed weights, based on neural Galerkin-Projections, which is an initial value problem that encodes the physics of the governing PDEs, calibrated via neural networks to accurately model the trajectory of these weights. Then using a statistical hierarchical pooling technique to learn a distribution on the initial values of the temporal weights, we can create new, statistically interpretable and physically justified weights that are generalized to many similar problems. When recombined with the spatial features, we form a complete physics surrogate, called a randPROM, for generating simulations that are consistent in distribution to a neighborhood of initial conditions close to those used to construct the ROM. We apply the randPROM technique to the study of tsunamis, which are unpredictable, catastrophic, and highly-detailed non-linear problems, modeling both a synthetic case of tsunamis near Fiji and the real-world Tohoku 2011 disaster. We demonstrate that randPROMs may enable us to significantly reduce the number of simulations needed to generate a statistically calibrated and physically defensible prediction model for arrival time and height of tsunami waves.

[236] Guardrailed Uplift Targeting: A Causal Optimization Playbook for Marketing Strategy

Deepit Sapru

Main category: cs.LG

TL;DR: A marketing framework that uses uplift modeling and constrained optimization to target customers with offers while respecting business constraints like budget and customer experience.

Details

Motivation: Marketers need to maximize revenue and retention while respecting business constraints (budget, acceptable sales deterioration, customer experience). Traditional approaches like propensity scoring don't optimize for treatment effects or handle constraints effectively.

Method: 1) Estimate Conditional Average Treatment Effects (CATE) using uplift learners. 2) Solve constrained allocation optimization to decide who to target and which offer to deploy under limits (budget, acceptable sales deterioration).

Result: Outperforms propensity and static baselines in offline evaluations (uplift AUC, IPS, SNIPS). Production A/B test validates strategic lift on revenue and completion while preserving customer-experience constraints.

Conclusion: Provides a reusable playbook for marketers to operationalize causal targeting at scale, set guardrails, and align campaigns with strategic KPIs.

Abstract: This paper introduces a marketing decision framework that converts heterogeneous-treatment uplift into constrained targeting strategies to maximize revenue and retention while honoring business guardrails. The approach estimates Conditional Average Treatment Effects (CATE) with uplift learners and then solves a constrained allocation to decide who to target and which offer to deploy under limits such as budget or acceptable sales deterioration. Applied to retention messaging, event rewards, and spend-threshold assignment, the framework consistently outperforms propensity and static baselines in offline evaluations using uplift AUC, Inverse Propensity Scoring (IPS), and Self-Normalized IPS (SNIPS). A production-scale online A/B test further validates strategic lift on revenue and completion while preserving customer-experience constraints. The result is a reusable playbook for marketers to operationalize causal targeting at scale, set guardrails, and align campaigns with strategic KPIs.

[237] Fine-Tuned In-Context Learners for Efficient Adaptation

Jorg Bornschein, Clare Lyle, Yazhe Li, Amal Rannen-Triki, Xu Owen He, Razvan Pascanu

Main category: cs.LG

TL;DR: A unified approach combining fine-tuning with in-context learning that outperforms both methods individually, especially in low-data regimes.

Details

Motivation: Prompt engineering with in-context learning works well with few examples but plateaus with more data, while fine-tuning scales with data but underperforms with scarce examples. The paper aims to bridge these two paradigms.

Method: Fine-tune models on task-specific data augmented with in-context examples (mimicking k-shot prompts), using prequential evaluation for hyperparameter selection in low-data regimes to avoid expensive cross-validation.

Result: The unified approach consistently matches and often significantly exceeds both fine-tuning and in-context learning baselines across various downstream tasks.

Conclusion: Combining in-context learning with fine-tuning creates a method that leverages the sample efficiency of in-context learning while achieving the performance gains of fine-tuning, offering superior adaptation for LLMs across different data regimes.

Abstract: When adapting large language models (LLMs) to a specific downstream task, two primary approaches are commonly employed: (1) prompt engineering, often with in-context few-shot learning, leveraging the model’s inherent generalization abilities, and (2) fine-tuning on task-specific data, directly optimizing the model’s parameters. While prompt-based methods excel in few-shot scenarios, their effectiveness often plateaus as more data becomes available. Conversely, fine-tuning scales well with data but may underperform when training examples are scarce. We investigate a unified approach that bridges these two paradigms by incorporating in-context learning directly into the fine-tuning process. Specifically, we fine-tune the model on task-specific data augmented with in-context examples, mimicking the structure of k-shot prompts. This approach, while requiring per-task fine-tuning, combines the sample efficiency of in-context learning with the performance gains of fine-tuning, leading to a method that consistently matches and often significantly exceeds both these baselines. To perform hyperparameter selection in the low-data regime, we propose to use prequential evaluation, which eliminates the need for expensive cross-validation and leverages all available data for training while simultaneously providing a robust validation signal. We conduct an extensive empirical study to determine which adaptation paradigm - fine-tuning, in-context learning, or our proposed unified approach offers the best predictive performance on a concrete data downstream-tasks.

[238] Demystifying LLM-as-a-Judge: Analytically Tractable Model for Inference-Time Scaling

Indranil Halder, Cengiz Pehlevan

Main category: cs.LG

TL;DR: This paper analyzes inference-time scaling in LLMs using a Bayesian linear regression model, showing how generalization error changes with inference-time samples (k), reward specification, and temperature, with theoretical bounds and experimental validation.

Details

Motivation: While LLMs show benefits from shifting computation from training to inference time, the principles behind inference-time scaling are not well understood. The paper aims to provide an analytical framework to study how inference-time computation affects generalization performance.

Method: Uses Bayesian linear regression with reward-weighted sampling, modeling LLM-as-a-judge scenario. Analyzes in high-dimensional regime using deterministic equivalents. Studies generalization error with training data from teacher model, drawing k inference-time samples with softmax selection at temperature applied to quadratic reward.

Result: When reward matches teacher, generalization error decreases monotonically with k (Θ(1/k²) in best-of-k limit). Reward misspecification can cause optimal finite k beyond which more sampling increases error. Optimal sampling temperature exists for fixed k. Inference-time compute advantage degrades with increasing task difficulty.

Conclusion: The paper provides theoretical framework for understanding inference-time scaling, showing conditions where scaling inference compute is preferable to collecting more data, while highlighting limitations from reward misspecification and task difficulty.

Abstract: Recent developments in large language models have shown advantages in reallocating a notable share of computational resource from training time to inference time. However, the principles behind inference time scaling are not well understood. In this paper, we introduce an analytically tractable model of inference-time scaling: Bayesian linear regression with a reward-weighted sampler, where the reward is determined from a linear model, modeling LLM-as-a-judge scenario. We study this problem in the high-dimensional regime, where the deterministic equivalents dictate a closed-form expression for the posterior predictive mean and variance. We analyze the generalization error when training data are sampled from a teacher model. We draw $k$ inference-time samples and select via softmax at a temperature applied to a quadratic reward. When the reward is not too different from the teacher, the generalization error decreases monotonically with increasing inference time samples $k$. However, the specific reward that optimizes inference-time selection generally differs from the teacher. In contrast, substantial reward misspecification induces a finite optimal $k$ beyond which more sampling can increase the generalization error. For fixed $k$, there exists an optimal sampling temperature. We experimentally verify these facts in large language model inference with an additional large language model as a judge. In the “best-of-$k$” limit with the teacher as reward, we theoretically show that the generalization error decays as $Θ(1/k^2)$ and determine the leading coefficient via extreme value theory. These formulas delineate domains where scaling inference-time computation is provably preferable to collecting more data. Finally, we demonstrate that when task difficulty increases, the previously mentioned advantage of inference-time compute degrades.

[239] Modeling Non-Ergodic Path Effects Using Conditional Generative Model for Fourier Amplitude Spectra

Maxime Lacour, Pu Ren, Rie Nakata, Nori Nakata, Michael Mahoney

Main category: cs.LG

TL;DR: Deep learning approach (CGM-FAS) replaces Gaussian Process methods for non-ergodic ground motion modeling, offering faster predictions and better spatial pattern learning without prescribed correlation functions.

Details

Motivation: Current non-ergodic ground-motion models use Gaussian Process methods with computational limitations for large-scale predictions, requiring prescribed correlation functions that may not capture complex spatial patterns.

Method: CGM-FAS uses Conditional Variational Autoencoder architecture to learn spatial patterns and interfrequency correlations directly from data, using earthquake and station coordinates as conditional variables.

Result: CGM-FAS produces consistent predictions with GP-based models while offering faster computation (10,000 sites × 1,000 frequencies in 10 seconds) and better spatial pattern learning without prescribed correlation functions.

Conclusion: CGM-FAS demonstrates a promising deep learning alternative to GP-based methods for efficient non-ergodic ground-motion prediction across large spatial domains and multiple frequencies.

Abstract: Recent developments in non-ergodic ground-motion models (GMMs) explicitly model systematic spatial variations in source, site, and path effects, reducing standard deviation to 30-40% of ergodic models and enabling more accurate site-specific seismic hazard analysis. Current non-ergodic GMMs rely on Gaussian Process (GP) methods with prescribed correlation functions and thus have computational limitations for large-scale predictions. This study proposes a deep-learning approach called Conditional Generative Modeling for Fourier Amplitude Spectra (CGM-FAS) as an alternative to GP-based methods for modeling non-ergodic path effects in Fourier Amplitude Spectra (FAS). CGM-FAS uses a Conditional Variational Autoencoder architecture to learn spatial patterns and interfrequency correlation directly from data by using geographical coordinates of earthquakes and stations as conditional variables. Using San Francisco Bay Area earthquake data, we compare CGM-FAS against a recent GP-based GMM for the region and demonstrate consistent predictions of non-ergodic path effects. Additionally, CGM-FAS offers advantages compared to GP-based approaches in learning spatial patterns without prescribed correlation functions, capturing interfrequency correlations, and enabling rapid predictions, generating maps for 10,000 sites across 1,000 frequencies within 10 seconds using a few GB of memory. CGM-FAS hyperparameters can be tuned to ensure generated path effects exhibit variability consistent with the GP-based empirical GMM. This work demonstrates a promising direction for efficient non-ergodic ground-motion prediction across multiple frequencies and large spatial domains.

[240] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning

Jiayun Wu, Jiashuo Liu, Zhiyuan Zeng, Tianyang Zhan, Wenhao Huang

Main category: cs.LG

TL;DR: This paper addresses LLM hallucinations by proposing behavioral calibration methods that incentivize models to admit uncertainty through abstention, enabling smaller models to surpass frontier models in uncertainty quantification despite lower factual accuracy.

Details

Motivation: LLM hallucinations impede deployment in critical domains. Current training objectives prioritize mimicking data distribution over epistemic honesty, and RLHF paradigms inadvertently incentivize models to guess whenever correctness probability exceeds zero rather than being honest communicators.

Method: Proposes behavioral calibration through training interventions that optimize strictly proper scoring rules, enabling models to output calibrated probability of correctness. Methods allow models to either abstain from complete responses or flag uncertain individual claims. Uses Qwen3-4B-Instruct for empirical analysis.

Result: Behavior-calibrated RL allows smaller models to surpass frontier models in uncertainty quantification. On math reasoning (BeyondAIME), log-scale Accuracy-to-Hallucination Ratio gain (0.806) exceeds GPT-5’s (0.207). In factual QA (SimpleQA), 4B LLM achieves zero-shot calibration error on par with frontier models despite much lower factual accuracy.

Conclusion: Behavioral calibration enables models to become honest communicators by admitting uncertainty, creating a transferable meta-skill decoupled from raw predictive accuracy. This approach addresses fundamental limitations of current LLM training paradigms and enables safer deployment in critical domains.

Abstract: LLM deployment in critical domains is currently impeded by persistent hallucinations–generating plausible but factually incorrect assertions. While scaling laws drove significant improvements in general capabilities, theoretical frameworks suggest hallucination is not merely stochastic error but a predictable statistical consequence of training objectives prioritizing mimicking data distribution over epistemic honesty. Standard RLVR paradigms, utilizing binary reward signals, inadvertently incentivize models as good test-takers rather than honest communicators, encouraging guessing whenever correctness probability exceeds zero. This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when not confident, aligning model behavior with accuracy. Synthesizing recent advances, we propose and evaluate training interventions optimizing strictly proper scoring rules for models to output a calibrated probability of correctness. Our methods enable models to either abstain from producing a complete response or flag individual claims where uncertainty remains. Utilizing Qwen3-4B-Instruct, empirical analysis reveals behavior-calibrated reinforcement learning allows smaller models to surpass frontier models in uncertainty quantification–a transferable meta-skill decouplable from raw predictive accuracy. Trained on math reasoning tasks, our model’s log-scale Accuracy-to-Hallucination Ratio gain (0.806) exceeds GPT-5’s (0.207) in a challenging in-domain evaluation (BeyondAIME). Moreover, in cross-domain factual QA (SimpleQA), our 4B LLM achieves zero-shot calibration error on par with frontier models including Grok-4 and Gemini-2.5-Pro, even though its factual accuracy is much lower.

[241] The Seismic Wavefield Common Task Framework

Alexey Yermakov, Yue Zhao, Marine Denolle, Yiyu Ni, Philippe M. Wyder, Judah Goldfeder, Stefano Riva, Jan Williams, David Zoro, Amy Sara Rude, Matteo Tomasetto, Joe Germany, Joseph Bakarji, Georg Maierhofer, Miles Cranmer, J. Nathan Kutz

Main category: cs.LG

TL;DR: The paper introduces a Common Task Framework (CTF) for machine learning in seismology to standardize evaluation of wavefield forecasting, reconstruction, and generalization tasks across multiple datasets.

Details

Motivation: Seismology faces challenges in state forecasting/reconstruction (earthquake early warning, ground motion prediction) and managing parametric variability. Traditional simulations are computationally expensive, while real-data approaches are limited by Earth complexity and sparse measurements. Existing ML efforts lack proper characterization, fair reporting, and rigorous comparisons.

Method: Introduces a Common Task Framework (CTF) inspired by frameworks in fields like NLP. Features three curated wavefield datasets at different scales (global, crustal, local) with task-specific metrics for forecasting, reconstruction, and generalization under realistic constraints (noise, limited data). Provides structured foundation for head-to-head algorithm evaluation with hidden test sets.

Result: CTF scores are reported for two datasets, showcasing performance of various methods and foundation models for reconstructing seismic wavefields from both simulated and real-world sensor measurements. Scores reveal strengths, limitations, and suitability for specific problem classes.

Conclusion: The CTF aims to replace ad hoc comparisons with standardized evaluations, raising the bar for rigor and reproducibility in scientific ML for seismology. Provides a structured foundation for fair algorithm comparison and progress tracking in the field.

Abstract: Seismology faces fundamental challenges in state forecasting and reconstruction (e.g., earthquake early warning and ground motion prediction) and managing the parametric variability of source locations, mechanisms, and Earth models (e.g., subsurface structure and topography effects). Addressing these with simulations is hindered by their massive scale, both in synthetic data volumes and numerical complexity, while real-data efforts are constrained by models that inadequately reflect the Earth’s complexity and by sparse sensor measurements from the field. Recent machine learning (ML) efforts offer promise, but progress is obscured by a lack of proper characterization, fair reporting, and rigorous comparisons. To address this, we introduce a Common Task Framework (CTF) for ML for seismic wavefields, starting with three distinct wavefield datasets. Our CTF features a curated set of datasets at various scales (global, crustal, and local) and task-specific metrics spanning forecasting, reconstruction, and generalization under realistic constraints such as noise and limited data. Inspired by CTFs in fields like natural language processing, this framework provides a structured and rigorous foundation for head-to-head algorithm evaluation. We illustrate the evaluation procedure with scores reported for two of the datasets, showcasing the performance of various methods and foundation models for reconstructing seismic wavefields from both simulated and real-world sensor measurements. The CTF scores reveal the strengths, limitations, and suitability for specific problem classes. Our vision is to replace ad hoc comparisons with standardized evaluations on hidden test sets, raising the bar for rigor and reproducibility in scientific ML.

[242] Conditional Adversarial Fragility in Financial Machine Learning under Macroeconomic Stress

Samruddhi Baviskar

Main category: cs.LG

TL;DR: Adversarial vulnerability in financial ML models increases during economic stress periods, requiring regime-aware evaluation frameworks.

Details

Motivation: Financial ML models operate in nonstationary economic environments, but adversarial robustness is typically evaluated under static assumptions, ignoring how economic stress might amplify vulnerabilities.

Method: Proposes a regime-aware evaluation framework using volatility-based regime segmentation to assess adversarial robustness across calm and stress periods. Introduces semantic auditing of model explanations using LLMs as an interpretive governance layer.

Result: Models show comparable baseline performance across regimes, but under adversarial perturbations, stress regimes exhibit substantially greater degradation in accuracy, decision thresholds, and risk-sensitive outcomes, with increased false negative rates.

Conclusion: Adversarial robustness in financial ML is regime-dependent, motivating stress-aware approaches to model risk assessment in high-stakes financial deployments.

Abstract: Machine learning models used in financial decision systems operate in nonstationary economic environments, yet adversarial robustness is typically evaluated under static assumptions. This work introduces Conditional Adversarial Fragility, a regime dependent phenomenon in which adversarial vulnerability is systematically amplified during periods of macroeconomic stress. We propose a regime aware evaluation framework for time indexed tabular financial classification tasks that conditions robustness assessment on external indicators of economic stress. Using volatility based regime segmentation as a proxy for macroeconomic conditions, we evaluate model behavior across calm and stress periods while holding model architecture, attack methodology, and evaluation protocols constant. Baseline predictive performance remains comparable across regimes, indicating that economic stress alone does not induce inherent performance degradation. Under adversarial perturbations, however, models operating during stress regimes exhibit substantially greater degradation across predictive accuracy, operational decision thresholds, and risk sensitive outcomes. We further demonstrate that this amplification propagates to increased false negative rates, elevating the risk of missed high risk cases during adverse conditions. To complement numerical robustness metrics, we introduce an interpretive governance layer based on semantic auditing of model explanations using large language models. Together, these results demonstrate that adversarial robustness in financial machine learning is a regime dependent property and motivate stress aware approaches to model risk assessment in high stakes financial deployments.

[243] Spatio-Temporal Graph Neural Networks for Dairy Farm Sustainability Forecasting and Counterfactual Policy Analysis

Surya Jayakumar, Kieran Sullivan, John McLaughlin, Christine O’Meara, Indrakshi Dey

Main category: cs.LG

TL;DR: First county-scale STGNN framework for forecasting composite sustainability indices from cattle herd data using VAE augmentation and PCA-based scoring.

Details

Motivation: To develop a data-driven approach for forecasting sustainability indices at county scale using herd-level operational records, addressing sparsity in agricultural datasets and enabling multi-year sustainability predictions.

Method: 1) VAE-based data augmentation to handle sparse ICBF datasets while preserving joint distributions; 2) PCA-derived pillar-based scoring formulation identifying four sustainability pillars; 3) Novel STGNN architecture encoding geographic dependencies and non-linear temporal dynamics for forecasting.

Result: Developed first-ever county-scale application of STGNN for sustainability forecasting, created weighted composite indices for four sustainability pillars, and generated multi-year forecasts for 2026-2030.

Conclusion: The framework successfully enables data-driven sustainability forecasting at county scale, providing a novel approach for agricultural sustainability assessment and prediction using advanced spatio-temporal neural networks.

Abstract: This study introduces a novel data-driven framework and the first-ever county-scale application of Spatio-Temporal Graph Neural Networks (STGNN) to forecast composite sustainability indices from herd-level operational records. The methodology employs a novel, end-to-end pipeline utilizing a Variational Autoencoder (VAE) to augment Irish Cattle Breeding Federation (ICBF) datasets, preserving joint distributions while mitigating sparsity. A first-ever pillar-based scoring formulation is derived via Principal Component Analysis, identifying Reproductive Efficiency, Genetic Management, Herd Health, and Herd Management, to construct weighted composite indices. These indices are modelled using a novel STGNN architecture that explicitly encodes geographic dependencies and non-linear temporal dynamics to generate multi-year forecasts for 2026-2030.

[244] Bloom Filter Encoding for Machine Learning

John Cartmell, Mihaela Cardei, Ionut Cardei

Main category: cs.LG

TL;DR: Bloom filter transform encodes data into privacy-preserving bit arrays for ML, achieving similar accuracy to raw data with memory savings and privacy protection.

Details

Motivation: To develop a preprocessing method that reduces memory usage while preserving privacy in machine learning, maintaining classification accuracy comparable to raw data approaches.

Method: Use Bloom filter transform to encode each data sample into compact bit arrays. Test on six diverse datasets (SMS Spam, ECG200, Adult 50K, CDC Diabetes, MNIST, Fashion MNIST) with four classifiers (XGBoost, DNN, CNN, Logistic Regression).

Result: Models trained on Bloom filter encodings achieve accuracy similar to models trained on raw data or other transforms, while providing memory savings and enhanced privacy protection.

Conclusion: Bloom filter transform is an efficient preprocessing approach for diverse ML tasks that balances accuracy, memory efficiency, and privacy preservation.

Abstract: We present a method that uses the Bloom filter transform to preprocess data for machine learning. Each sample is encoded into a compact, privacy-preserving bit array. This reduces memory use and protects the original data while keeping enough structure for accurate classification. We test the method on six datasets: SMS Spam Collection, ECG200, Adult 50K, CDC Diabetes, MNIST, and Fashion MNIST. Four classifiers are used: Extreme Gradient Boosting, Deep Neural Networks, Convolutional Neural Networks, and Logistic Regression. Results show that models trained on Bloom filter encodings achieve accuracy similar to models trained on raw data or other transforms. At the same time, the method provides memory savings while enhancing privacy. These results suggest that the Bloom filter transform is an efficient preprocessing approach for diverse machine learning tasks.

[245] LoFT-LLM: Low-Frequency Time-Series Forecasting with Large Language Models

Jiacheng You, Jingcheng Yang, Yuhang Xie, Zhongxuan Wu, Xiucheng Li, Feng Li, Pengjie Wang, Jian Xu, Bo Zheng, Xinyang Chen

Main category: cs.LG

TL;DR: LoFT-LLM: A frequency-aware forecasting pipeline combining low-frequency trend extraction with LLM semantic calibration for time-series forecasting in finance and energy domains.

Details

Motivation: Time-series forecasting faces challenges with limited training data, complex noisy dynamics, and underutilized auxiliary variables. Existing models use full-length windows with high-frequency noise that obscure long-term trends.

Method: Three-stage approach: 1) Patch Low-Frequency forecasting Module extracts stable low-frequency trends from spectral patches, 2) residual learner models high-frequency variations, 3) fine-tuned LLM refines predictions using auxiliary context and domain knowledge via structured natural language prompts.

Result: Extensive experiments on financial and energy datasets show LoFT-LLM significantly outperforms strong baselines in both full-data and few-shot regimes, delivering superior accuracy, robustness, and interpretability.

Conclusion: LoFT-LLM effectively addresses challenges in time-series forecasting by combining frequency-aware decomposition with LLM semantic calibration, making it suitable for real-world applications with limited data and complex dynamics.

Abstract: Time-series forecasting in real-world applications such as finance and energy often faces challenges due to limited training data and complex, noisy temporal dynamics. Existing deep forecasting models typically supervise predictions using full-length temporal windows, which include substantial high-frequency noise and obscure long-term trends. Moreover, auxiliary variables containing rich domain-specific information are often underutilized, especially in few-shot settings. To address these challenges, we propose LoFT-LLM, a frequency-aware forecasting pipeline that integrates low-frequency learning with semantic calibration via a large language model (LLM). Firstly, a Patch Low-Frequency forecasting Module (PLFM) extracts stable low-frequency trends from localized spectral patches. Secondly, a residual learner then models high-frequency variations. Finally, a fine-tuned LLM refines the predictions by incorporating auxiliary context and domain knowledge through structured natural language prompts. Extensive experiments on financial and energy datasets demonstrate that LoFT-LLM significantly outperforms strong baselines under both full-data and few-shot regimes, delivering superior accuracy, robustness, and interpretability.

[246] Control Variate Score Matching for Diffusion Models

Khaled Kahouli, Romuald Elie, Klaus-Robert Müller, Quentin Berthet, Oliver T. Unke, Arnaud Doucet

Main category: cs.LG

TL;DR: CVSI unifies DSI and TSI score estimators using control variates to minimize variance across all noise levels, improving sample efficiency in diffusion model training and inference.

Details

Motivation: Diffusion models need accurate score estimation of noise-perturbed target distributions. Existing estimators (DSI using data samples and TSI using energy function) suffer from complementary variance problems: DSI has high variance at low noise, TSI at high noise. There's a need for a unified estimator that minimizes variance across the entire noise spectrum.

Method: Proposes Control Variate Score Identity (CVSI) that unifies DSI and TSI within the principled framework of control variates. Derives an optimal, time-dependent control coefficient that theoretically guarantees variance minimization across all noise levels. CVSI serves as a robust, low-variance plug-in estimator.

Result: CVSI significantly enhances sample efficiency in both data-free sampler learning and inference-time diffusion sampling. The unified estimator provides theoretical variance minimization guarantees across the entire noise spectrum.

Conclusion: CVSI reconciles the complementary variance trade-offs of DSI and TSI through control variates, offering a principled solution for low-variance score estimation that improves diffusion model training and inference efficiency.

Abstract: Diffusion models offer a robust framework for sampling from unnormalized probability densities, which requires accurately estimating the score of the noise-perturbed target distribution. While the standard Denoising Score Identity (DSI) relies on data samples, access to the target energy function enables an alternative formulation via the Target Score Identity (TSI). However, these estimators face a fundamental variance trade-off: DSI exhibits high variance in low-noise regimes, whereas TSI suffers from high variance at high noise levels. In this work, we reconcile these approaches by unifying both estimators within the principled framework of control variates. We introduce the Control Variate Score Identity (CVSI), deriving an optimal, time-dependent control coefficient that theoretically guarantees variance minimization across the entire noise spectrum. We demonstrate that CVSI serves as a robust, low-variance plug-in estimator that significantly enhances sample efficiency in both data-free sampler learning and inference-time diffusion sampling.

[247] Position as Probability: Self-Supervised Transformers that Think Past Their Training for Length Extrapolation

Philip Heejun Lee

Main category: cs.LG

TL;DR: PRISM introduces a probabilistic relative-position encoding for Transformers that enables 10x length extrapolation beyond training data, achieving SOTA on algorithmic reasoning tasks.

Details

Motivation: Deep sequence models degrade in accuracy when test sequences exceed training lengths, but many critical tasks (algorithmic reasoning, multi-step arithmetic, compositional generalization) require robust length extrapolation.

Method: PRISM (Probabilistic Relative-position Implicit Superposition Model) learns continuous relative positions through a differentiable histogram-filter update, preserving position uncertainty via probabilistic superposition rather than deterministic embeddings.

Result: PRISM achieves state-of-the-art length extrapolation, generalizing to previously intractable sequence lengths (up to 10x beyond training) across algorithmic benchmarks including arithmetic, SCAN compositionality tasks, and complex copy variants.

Conclusion: PRISM’s stochastic positional encoding maintains sharp and interpretable internal states, advancing neural sequence models that remain algorithmically robust at lengths far exceeding their training horizon.

Abstract: Deep sequence models typically degrade in accuracy when test sequences significantly exceed their training lengths, yet many critical tasks–such as algorithmic reasoning, multi-step arithmetic, and compositional generalization–require robust length extrapolation. We introduce PRISM, a Probabilistic Relative-position Implicit Superposition Model, a novel positional encoding mechanism that enables Transformers to extrapolate accurately up to 10x beyond their training length. PRISM learns continuous relative positions through a differentiable histogram-filter update, preserving position uncertainty via a probabilistic superposition rather than conventional deterministic embeddings. Empirically, PRISM achieves state-of-the-art length extrapolation, successfully generalizing to previously intractable sequence lengths across algorithmic benchmarks–including arithmetic (addition, multiplication), SCAN compositionality tasks, and complex copy variants derived from DeepMind’s recent datasets. Our analysis demonstrates that PRISM’s stochastic positional encoding maintains sharp and interpretable internal states, providing a theoretical basis for reliable length generalization. These results advance the goal of neural sequence models that remain algorithmically robust at lengths far exceeding their training horizon.

[248] DecoKAN: Interpretable Decomposition for Forecasting Cryptocurrency Market Dynamics

Yuan Gao, Zhenguo Dong, Xuelong Wang, Zhiqiang Wang, Yong Zhang, Shaofan Wang

Main category: cs.LG

TL;DR: DecoKAN: Interpretable cryptocurrency forecasting framework using wavelet decomposition and Kolmogorov-Arnold Networks for transparent modeling.

Details

Motivation: Cryptocurrency data contains both long-term trends and high-frequency oscillations, but existing deep learning models are black-boxes that fail to decouple these dynamics or provide interpretability needed for trustworthy financial decision-making.

Method: Integrates multi-level Discrete Wavelet Transform (DWT) for signal decomposition and Kolmogorov-Arnold Network (KAN) mixers for interpretable nonlinear modeling. Includes symbolic analysis pipeline with sparsification, pruning, and symbolization to produce analytical expressions.

Result: Achieves lowest average Mean Squared Error on all tested cryptocurrency datasets (BTC, ETH, XMR), consistently outperforming state-of-the-art baselines.

Conclusion: DecoKAN bridges predictive accuracy and model transparency, advancing trustworthy decision support in cryptocurrency markets through interpretable forecasting.

Abstract: Accurate and interpretable forecasting of multivariate time series is crucial for understanding the complex dynamics of cryptocurrency markets in digital asset systems. Advanced deep learning methodologies, particularly Transformer-based and MLP-based architectures, have achieved competitive predictive performance in cryptocurrency forecasting tasks. However, cryptocurrency data is inherently composed of long-term socio-economic trends and local high-frequency speculative oscillations. Existing deep learning-based ‘black-box’ models fail to effectively decouple these composite dynamics or provide the interpretability needed for trustworthy financial decision-making. To overcome these limitations, we propose DecoKAN, an interpretable forecasting framework that integrates multi-level Discrete Wavelet Transform (DWT) for decoupling and hierarchical signal decomposition with Kolmogorov-Arnold Network (KAN) mixers for transparent and interpretable nonlinear modeling. The DWT component decomposes complex cryptocurrency time series into distinct frequency components, enabling frequency-specific analysis, while KAN mixers provide intrinsically interpretable spline-based mappings within each decomposed subseries. Furthermore, interpretability is enhanced through a symbolic analysis pipeline involving sparsification, pruning, and symbolization, which produces concise analytical expressions offering symbolic representations of the learned patterns. Extensive experiments demonstrate that DecoKAN achieves the lowest average Mean Squared Error on all tested real-world cryptocurrency datasets (BTC, ETH, XMR), consistently outperforming a comprehensive suite of competitive state-of-the-art baselines. These results validate DecoKAN’s potential to bridge the gap between predictive accuracy and model transparency, advancing trustworthy decision support within complex cryptocurrency markets.

[249] Orthogonal Activation with Implicit Group-Aware Bias Learning for Class Imbalance

Sukumar Kishanthan, Asela Hevapathige

Main category: cs.LG

TL;DR: Proposes OGAB, a novel activation function that uses orthogonality and group-aware bias learning to address class imbalance in deep learning without explicit label supervision.

Details

Motivation: Class imbalance causes suboptimal classifier performance, and while deep learning excels at feature extraction, its performance deteriorates under imbalanced data. Existing approaches address imbalance through preprocessing or post-processing, but there's a need for solutions that tackle it during training at the embedding level.

Method: OGAB activation function incorporates orthogonality to preserve minority class information by maintaining feature independence, and group-aware bias mechanism that automatically identifies data clusters and adjusts embeddings to enhance class separability without explicit supervision.

Result: Demonstrated effectiveness on both real-world and synthetic imbalanced datasets, showing consistent performance improvements over both traditional and learnable activation functions.

Conclusion: Activation functions can introduce strong inductive biases to address complex data challenges like class imbalance. OGAB tackles imbalance during training at the embedding level, enabling direct integration with the learning process without requiring explicit label information.

Abstract: Class imbalance is a common challenge in machine learning and data mining, often leading to suboptimal performance in classifiers. While deep learning excels in feature extraction, its performance still deteriorates under imbalanced data. In this work, we propose a novel activation function, named OGAB, designed to alleviate class imbalance in deep learning classifiers. OGAB incorporates orthogonality and group-aware bias learning to enhance feature distinguishability in imbalanced scenarios without explicitly requiring label information. Our key insight is that activation functions can be used to introduce strong inductive biases that can address complex data challenges beyond traditional non-linearity. Our work demonstrates that orthogonal transformations can preserve information about minority classes by maintaining feature independence, thereby preventing the dominance of majority classes in the embedding space. Further, the proposed group-aware bias mechanism automatically identifies data clusters and adjusts embeddings to enhance class separability without the need for explicit supervision. Unlike existing approaches that address class imbalance through preprocessing data modifications or post-processing corrections, our proposed approach tackles class imbalance during the training phase at the embedding learning level, enabling direct integration with the learning process. We demonstrate the effectiveness of our solution on both real-world and synthetic imbalanced datasets, showing consistent performance improvements over both traditional and learnable activation functions.

[250] An Optimal Policy for Learning Controllable Dynamics by Exploration

Peter N. Loxley

Main category: cs.LG

TL;DR: This paper presents an optimal non-stationary policy for learning controllable Markov chains through exploration over limited time horizons, with efficient computation and implementation.

Details

Motivation: The motivation is to develop optimal exploration policies for learning controllable dynamics in unknown environments, addressing the challenge of efficiently maximizing information gain during limited exploration time.

Method: The method involves deriving the general form of an optimal policy that greedily maximizes information gain, parameterizing control sets that change over time, and providing an algorithm for finding optimal policies based on dynamic programming principles.

Result: The paper demonstrates policy optimality through counting arguments, comparisons with suboptimal policies, and dynamic programming’s sequential improvement property, with detailed treatment of six examples of controllable dynamics.

Conclusion: Non-stationary policies are essential for optimal exploration due to the existence of states that restrict control dynamics (transient, absorbing, and non-backtracking states), and the proposed policy provides an efficient solution for learning controllable Markov chains.

Abstract: Controllable Markov chains describe the dynamics of sequential decision making tasks and are the central component in optimal control and reinforcement learning. In this work, we give the general form of an optimal policy for learning controllable dynamics in an unknown environment by exploring over a limited time horizon. This policy is simple to implement and efficient to compute, and allows an agent to ``learn by exploring" as it maximizes its information gain in a greedy fashion by selecting controls from a constraint set that changes over time during exploration. We give a simple parameterization for the set of controls, and present an algorithm for finding an optimal policy. The reason for this policy is due to the existence of certain types of states that restrict control of the dynamics; such as transient states, absorbing states, and non-backtracking states. We show why the occurrence of these states makes a non-stationary policy essential for achieving optimal exploration. Six interesting examples of controllable dynamics are treated in detail. Policy optimality is demonstrated using counting arguments, comparing with suboptimal policies, and by making use of a sequential improvement property from dynamic programming.

[251] PairFlow: Closed-Form Source-Target Coupling for Few-Step Generation in Discrete Flow Models

Mingue Park, Jisung Hwang, Seungwoo Yoo, Kyeongmin Yeo, Minhyuk Sung

Main category: cs.LG

TL;DR: PairFlow is a lightweight preprocessing method for Discrete Flow Models that enables few-step sampling without needing a pretrained teacher, using only 1.7% of full training compute.

Details

Motivation: Discrete Flow Models (DFMs) are powerful generative models for discrete data but suffer from slow iterative sampling. Existing acceleration methods require expensive finetuning with substantial training overhead, creating a need for more efficient approaches.

Method: PairFlow trains DFMs from coupled source-target distribution samples without pretrained teachers. The core innovation is a closed-form inversion for DFMs that enables efficient construction of paired source-target samples for training.

Result: PairFlow matches or surpasses performance of two-stage finetuning methods while using only up to 1.7% of full training compute. It also provides stronger base models for subsequent distillation, enabling further acceleration after finetuning.

Conclusion: PairFlow offers an efficient, broadly applicable solution for accelerating Discrete Flow Models with minimal computational cost, demonstrated across molecular data, binary, and RGB image domains.

Abstract: We introduce $\texttt{PairFlow}$, a lightweight preprocessing step for training Discrete Flow Models (DFMs) to achieve few-step sampling without requiring a pretrained teacher. DFMs have recently emerged as a new class of generative models for discrete data, offering strong performance. However, they suffer from slow sampling due to their iterative nature. Existing acceleration methods largely depend on finetuning, which introduces substantial additional training overhead. $\texttt{PairFlow}$ addresses this issue with a lightweight preprocessing step. Inspired by ReFlow and its extension to DFMs, we train DFMs from coupled samples of source and target distributions, without requiring any pretrained teacher. At the core of our approach is a closed-form inversion for DFMs, which allows efficient construction of paired source-target samples. Despite its extremely low cost, taking only up to 1.7% of the compute needed for full model training, $\texttt{PairFlow}$ matches or even surpasses the performance of two-stage training involving finetuning. Furthermore, models trained with our framework provide stronger base models for subsequent distillation, yielding further acceleration after finetuning. Experiments on molecular data as well as binary and RGB images demonstrate the broad applicability and effectiveness of our approach.

[252] Decoupling the “What” and “Where” With Polar Coordinate Positional Embeddings

Anand Gopalakrishnan, Robert Csordás, Jürgen Schmidhuber, Michael C. Mozer

Main category: cs.LG

TL;DR: PoPE (Polar Coordinate Position Embeddings) disentangles content and position information in Transformers, improving performance over RoPE on tasks requiring independent content/position matching and showing better length extrapolation.

Details

Motivation: RoPE (Rotary Position Embedding) entangles content ("what") and position ("where") information, which can impair performance when decisions require independent matching on these two factors. This entanglement creates a confound that limits Transformer performance.

Method: Proposes PoPE (Polar Coordinate Position Embeddings), a new positional encoding scheme that eliminates the what-where confound by separating content and position information more effectively than RoPE.

Result: PoPE outperforms RoPE on diagnostic tasks requiring indexing solely by position or content. On autoregressive sequence modeling in music, genomic, and natural language domains, PoPE achieves better evaluation loss (perplexity) and downstream task performance. Gains persist across model scales (124M to 774M parameters) and PoPE shows strong zero-shot length extrapolation capabilities, outperforming both RoPE and YaRN.

Conclusion: PoPE successfully disentangles content and position information in Transformers, leading to improved performance across multiple domains and better length extrapolation without requiring additional fine-tuning or frequency interpolation.

Abstract: The attention mechanism in a Transformer architecture matches key to query based on both content – the what – and position in a sequence – the where. We present an analysis indicating that what and where are entangled in the popular RoPE rotary position embedding. This entanglement can impair performance particularly when decisions require independent matches on these two factors. We propose an improvement to RoPE, which we call Polar Coordinate Position Embeddings or PoPE, that eliminates the what-where confound. PoPE is far superior on a diagnostic task requiring indexing solely by position or by content. On autoregressive sequence modeling in music, genomic, and natural language domains, Transformers using PoPE as the positional encoding scheme outperform baselines using RoPE with respect to evaluation loss (perplexity) and downstream task performance. On language modeling, these gains persist across model scale, from 124M to 774M parameters. Crucially, PoPE shows strong zero-shot length extrapolation capabilities compared not only to RoPE but even a method designed for extrapolation, YaRN, which requires additional fine tuning and frequency interpolation.

[253] QE-Catalytic: A Graph-Language Multimodal Base Model for Relaxed-Energy Prediction in Catalytic Adsorption

Yanjie Li, Jian Xu, Xueqing Chen, Lina Yu, Shiming Xiang, Weijun Li, Cheng-lin Liu

Main category: cs.LG

TL;DR: QE-Catalytic: A multimodal framework combining a large language model (Qwen) with an E(3)-equivariant graph Transformer (Equiformer-V2) for improved adsorption energy prediction and inverse design of catalytic surfaces.

Details

Motivation: Current language-model-based approaches for catalyst screening have insufficient accuracy in adsorption energy prediction and cannot distinguish different configurations of the same system, even with graph-assisted pretraining. There's a need for better methods that can leverage both 3D structural information and textual descriptions.

Method: Proposes QE-Catalytic, a multimodal framework that deeply couples Qwen (large language model) with Equiformer-V2 (E(3)-equivariant graph Transformer). The method jointly leverages 3D structures and structured configuration text, injects 3D geometric information into the language channel via graph-text alignment, and can autoregressively generate CIF files for structure design.

Result: On OC20 dataset, reduces MAE of relaxed adsorption energy from 0.713 eV to 0.486 eV. Consistently outperforms baseline models (CatBERTa and GAP-CATBERTa) across multiple evaluation protocols.

Conclusion: QE-Catalytic enables unified support for adsorption-configuration property prediction and inverse design on complex catalytic surfaces, functioning as a high-performance text-based predictor even when precise coordinates are unavailable.

Abstract: Adsorption energy is a key descriptor of catalytic reactivity. It is fundamentally defined as the difference between the relaxed total energy of the adsorbate-surface system and that of an appropriate reference state; therefore, the accuracy of relaxed-energy prediction directly determines the reliability of machine-learning-driven catalyst screening. E(3)-equivariant graph neural networks (GNNs) can natively operate on three-dimensional atomic coordinates under periodic boundary conditions and have demonstrated strong performance on such tasks. In contrast, language-model-based approaches, while enabling human-readable textual descriptions and reducing reliance on explicit graph – thereby broadening applicability – remain insufficient in both adsorption-configuration energy prediction accuracy and in distinguishing the same system with different configurations,'' even with graph-assisted pretraining in the style of GAP-CATBERTa. To this end, we propose QE-Catalytic, a multimodal framework that deeply couples a large language model (\textbf{Q}wen) with an E(3)-equivariant graph Transformer (\textbf{E}quiformer-V2), enabling unified support for adsorption-configuration property prediction and inverse design on complex catalytic surfaces. During prediction, QE-Catalytic jointly leverages three-dimensional structures and structured configuration text, and injects 3D geometric information’’ into the language channel via graph-text alignment, allowing it to function as a high-performance text-based predictor when precise coordinates are unavailable, while also autoregressively generating CIF files for target-energy-driven structure design and information completion. On OC20, QE-Catalytic reduces the MAE of relaxed adsorption energy from 0.713~~eV to 0.486~~eV, and consistently outperforms baseline models such as CatBERTa and GAP-CATBERTa across multiple evaluation protocols.

[254] Latent learning: episodic memory complements parametric learning by enabling flexible reuse of experiences

Andrew Kyle Lampinen, Martin Engelcke, Yuxuan Li, Arslan Chaudhry, James L. McClelland

Main category: cs.LG

TL;DR: Machine learning systems fail to generalize due to lack of latent learning (learning irrelevant but potentially useful information), which can be addressed by episodic memory and retrieval mechanisms.

Details

Motivation: The paper aims to understand why machine learning systems fail to generalize and seeks inspiration from cognitive science to address these limitations, particularly focusing on the lack of latent learning in parametric systems.

Method: The authors draw from cognitive science concepts, analyze various failure cases (reversal curse in language modeling, agent-based navigation), and propose episodic memory with oracle retrieval mechanisms as a solution. They also identify key components for effective retrieval, including within-example in-context learning.

Result: Systems with oracle retrieval mechanisms demonstrate improved generalization across various challenges. The research identifies within-example in-context learning as crucial for effectively using retrieved information across examples.

Conclusion: Lack of latent learning contributes to data inefficiency in ML systems compared to natural intelligence. Retrieval methods can complement parametric learning to improve generalization, with connections to cognitive science and neuroscience findings.

Abstract: When do machine learning systems fail to generalize, and what mechanisms could improve their generalization? Here, we draw inspiration from cognitive science to argue that one weakness of parametric machine learning systems is their failure to exhibit latent learning – learning information that is not relevant to the task at hand, but that might be useful in a future task. We show how this perspective links failures ranging from the reversal curse in language modeling to new findings on agent-based navigation. We then highlight how cognitive science points to episodic memory as a potential part of the solution to these issues. Correspondingly, we show that a system with an oracle retrieval mechanism can use learning experiences more flexibly to generalize better across many of these challenges. We also identify some of the essential components for effectively using retrieval, including the importance of within-example in-context learning for acquiring the ability to use information across retrieved examples. In summary, our results illustrate one possible contributor to the relative data inefficiency of current machine learning systems compared to natural intelligence, and help to understand how retrieval methods can complement parametric learning to improve generalization. We close by discussing some of the links between these findings and prior results in cognitive science and neuroscience, and the broader implications.

[255] Spatio-Temporal Graphs Beyond Grids: Benchmark for Maritime Anomaly Detection

Jeehong Kim, Youngseok Hwang, Minchan Kim, Sungho Bae, Hyunwoo Park

Main category: cs.LG

TL;DR: A novel benchmark dataset for anomaly detection in maritime traffic using spatio-temporal graph neural networks, addressing challenges of non-grid environments with irregular trajectories and multi-granularity anomalies.

Details

Motivation: Existing ST-GNNs work well for structured domains with fixed nodes (like road traffic), but real-world systems like maritime traffic lack fixed anchors, making graph construction and anomaly detection challenging due to sparse/irregular trajectories and multi-granular anomalies.

Method: Extend Open Maritime Traffic Analysis Dataset (OMTAD) into a benchmark for graph-based anomaly detection, enabling evaluation across node-level, edge-level, and graph-level anomalies. Use two LLM-based agents: Trajectory Synthesizer to construct richer interaction contexts, and Anomaly Injector to generate semantically meaningful anomalies.

Result: A benchmark dataset specifically designed for anomaly detection in non-grid spatio-temporal systems, with systematic evaluation capabilities across three granularity levels.

Conclusion: This benchmark will promote reproducibility and foster methodological advances in anomaly detection for non-grid spatio-temporal systems like maritime traffic.

Abstract: Spatio-temporal graph neural networks (ST-GNNs) have achieved notable success in structured domains such as road traffic and public transportation, where spatial entities can be naturally represented as fixed nodes. In contrast, many real-world systems including maritime traffic lack such fixed anchors, making the construction of spatio-temporal graphs a fundamental challenge. Anomaly detection in these non-grid environments is particularly difficult due to the absence of canonical reference points, the sparsity and irregularity of trajectories, and the fact that anomalies may manifest at multiple granularities. In this work, we introduce a novel benchmark dataset for anomaly detection in the maritime domain, extending the Open Maritime Traffic Analysis Dataset (OMTAD) into a benchmark tailored for graph-based anomaly detection. Our dataset enables systematic evaluation across three different granularities: node-level, edge-level, and graph-level anomalies. We plan to employ two specialized LLM-based agents: \emph{Trajectory Synthesizer} and \emph{Anomaly Injector} to construct richer interaction contexts and generate semantically meaningful anomalies. We expect this benchmark to promote reproducibility and to foster methodological advances in anomaly detection for non-grid spatio-temporal systems.

[256] C$^2$GSPG: Confidence-calibrated Group Sequence Policy Gradient towards Self-aware Reasoning

Haotian Liu, Shuo Wang, Hongteng Xu

Main category: cs.LG

TL;DR: C²GSPG is a confidence-calibration group sequence policy gradient method that addresses overconfidence in RL-based reasoning models while improving reasoning performance.

Details

Motivation: Existing RL methods like GRPO suffer from overconfidence issues that prevent achieving self-aware reasoning models, limiting their effectiveness in reasoning tasks.

Method: Proposes Group Sequence Policy Gradient (GSPG) framework to eliminate token-level bias, defines model confidence using normalized sequence-level probability, and applies cross-entropy regularizer to calibrate confidence to reward. Uses nonlinear reward normalization and adaptive regularizer clipping for non-binary rewards.

Result: C²GSPG shows superiority over state-of-the-art methods in both reasoning accuracy and confidence calibration when applied to post-train large language models on logical and mathematical reasoning tasks.

Conclusion: The proposed C²GSPG method effectively addresses overconfidence in RL-based reasoning models while simultaneously improving reasoning performance through collaborative confidence calibration and policy gradient optimization.

Abstract: Reinforcement Learning (RL) methods, exemplified by Group Relative Policy Optimization (GRPO) and its variants, play a central role in developing reasoning models. However, these methods often suffer from a critical overconfidence issue, which prevents them from achieving self-aware reasoning models. In this study, we propose a simple yet effective confidence-calibration group sequence policy gradient method, called C$^2$GSPG, which simultaneously enhances reasoning performance while suppressing overconfidence. In principle, we propose a Group Sequence Policy Gradient (GSPG) framework for learning reasoning models, which eliminates the token-level bias commonly appearing in GRPO and its variants. In this framework, we define the model confidence for each reasoning problem using the normalized sequence-level probability, and then apply a cross-entropy regularizer to calibrate the model confidence to the sequence’s reward. We demonstrate that the confidence calibration regularizer and GSPG are collaborative for binary rewards, as their objectives always share the same gradient direction. For non-binary rewards, we apply nonlinear reward normalization and adaptive regularizer clipping, mitigating the potential conflict between the two objectives. Applying C$^2$GSPG to post-train large language models in logical and mathematical reasoning tasks, we show its superiority over state-of-the-art methods in both reasoning accuracy and confidence calibration. The code of C$^2$GSPG is available at https://github.com/HaotianLiu123/CCGSPG.

[257] Jensen-Shannon Divergence Message-Passing for Rich-Text Graph Representation Learning

Zuo Wang, Ye Yuan

Main category: cs.LG

TL;DR: Proposes JSDMP, a new message-passing paradigm for rich-text graphs that captures both similarity and dissimilarity using Jensen-Shannon divergence, leading to two novel GNNs (DMPGCN and DMPPRG) that outperform state-of-the-art baselines.

Details

Motivation: To address the problem of contextual and structural divergence in rich-text graphs, which can negatively impact representation learning by causing models to aggregate information from poorly correlated text nodes.

Method: Jensen-Shannon Divergence Message-Passing (JSDMP) paradigm that computes message weights by jointly considering similarity and dissimilarity between text nodes using Jensen-Shannon divergence. This enables learning from truly correlated nodes. Two GNN architectures are built on JSDMP: DMPGCN and DMPPRG.

Result: Extensive experiments on established rich-text datasets show that both DMPGCN and DMPPRG outperform several state-of-the-art baselines, demonstrating the effectiveness of the JSDMP paradigm.

Conclusion: The proposed JSDMP paradigm successfully addresses contextual and structural divergence in rich-text graphs by capturing both similarity and dissimilarity, leading to improved representation learning through the novel DMPGCN and DMPPRG architectures.

Abstract: In this paper, we investigate how the widely existing contextual and structural divergence may influence the representation learning in rich-text graphs. To this end, we propose Jensen-Shannon Divergence Message-Passing (JSDMP), a new learning paradigm for rich-text graph representation learning. Besides considering similarity regarding structure and text, JSDMP further captures their corresponding dissimilarity by Jensen-Shannon divergence. Similarity and dissimilarity are then jointly used to compute new message weights among text nodes, thus enabling representations to learn with contextual and structural information from truly correlated text nodes. With JSDMP, we propose two novel graph neural networks, namely Divergent message-passing graph convolutional network (DMPGCN) and Divergent message-passing Page-Rank graph neural networks (DMPPRG), for learning representations in rich-text graphs. DMPGCN and DMPPRG have been extensively texted on well-established rich-text datasets and compared with several state-of-the-art baselines. The experimental results show that DMPGCN and DMPPRG can outperform other baselines, demonstrating the effectiveness of the proposed Jensen-Shannon Divergence Message-Passing paradigm

[258] Why mask diffusion does not work

Haocheng Sun, Cynthia Xin Wen, Edward Hong Wang

Main category: cs.LG

TL;DR: Mask diffusion language models struggle with parallel generation and bidirectional attention despite theoretical advantages, requiring optimized training/inference strategies.

Details

Motivation: To address the gap between theoretical advantages of diffusion models (parallel generation, bidirectional attention) and practical limitations of existing mask diffusion implementations, particularly absorbing diffusion variants.

Method: Analyzes inherent difficulties in mask diffusion for achieving parallel generation and bidirectional attention, then proposes optimized training and inference strategies specifically for mask diffusion models.

Result: Demonstrates why mask diffusion faces fundamental challenges with parallel generation and bidirectional attention, and provides the most effective strategies to address these limitations.

Conclusion: While mask diffusion models theoretically offer advantages over autoregressive models, they face inherent implementation challenges that require specialized training and inference approaches to realize their potential.

Abstract: The main advantages of diffusion language models over autoregressive (AR) models lie in their ability to support parallel generation and bidirectional attention, enabling a more controllable generation process. In recent years, open-source mask diffusion language models have emerged, most of which are based on a variant known as absorbing diffusion. However, this paper demonstrates why mask diffusion faces inherent difficulties in achieving parallel generation and bidirectional attention. We also propose the most effective training and inference strategies for mask diffusion.

[259] Information-directed sampling for bandits: a primer

Annika Hirling, Giorgio Nicoletti, Antonio Celani

Main category: cs.LG

TL;DR: Information Directed Sampling (IDS) policies for two-state Bernoulli bandits in discounted infinite-horizon setting achieve bounded regret in symmetric cases and logarithmic regret in one-fair-coin scenarios.

Details

Motivation: To bridge concepts from reinforcement learning and information theory for statistical physicists, using the Multi-Armed Bandit problem as a fundamental framework to analyze exploration-exploitation trade-offs through heuristic strategies like IDS.

Method: Extends IDS framework to discounted infinite-horizon setting with modified information measure and tuning parameter; analyzes two-state Bernoulli bandits as minimal model; examines symmetric bandits and one-fair-coin scenarios.

Result: IDS achieves bounded cumulative regret in symmetric bandit cases, and yields logarithmic regret scaling with horizon in one-fair-coin scenarios, matching classical asymptotic lower bounds.

Conclusion: This pedagogical synthesis demonstrates how IDS policies effectively balance exploration and exploitation, providing rigorous analysis of heuristic strategies against optimal policies in tractable bandit environments.

Abstract: The Multi-Armed Bandit problem provides a fundamental framework for analyzing the tension between exploration and exploitation in sequential learning. This paper explores Information Directed Sampling (IDS) policies, a class of heuristics that balance immediate regret against information gain. We focus on the tractable environment of two-state Bernoulli bandits as a minimal model to rigorously compare heuristic strategies against the optimal policy. We extend the IDS framework to the discounted infinite-horizon setting by introducing a modified information measure and a tuning parameter to modulate the decision-making behavior. We examine two specific problem classes: symmetric bandits and the scenario involving one fair coin. In the symmetric case we show that IDS achieves bounded cumulative regret, whereas in the one-fair-coin scenario the IDS policy yields a regret that scales logarithmically with the horizon, in agreement with classical asymptotic lower bounds. This work serves as a pedagogical synthesis, aiming to bridge concepts from reinforcement learning and information theory for an audience of statistical physicists.

[260] On Structured State-Space Duality

Jerry Yao-Chieh Hu, Xiwen Zhang, Ali ElSheikh, Weimin Wu, Han Liu

Main category: cs.LG

TL;DR: SSD duality extends from scalar-identity SSMs to general diagonal SSMs, showing equivalence to 1-semiseparable masked attention while maintaining training complexity bounds, but fails for standard softmax attention.

Details

Motivation: To generalize the Structured State-Space Duality (SSD) beyond the scalar-identity case, explore richer dynamics while maintaining efficiency, and understand the limits of SSM-attention equivalence.

Method: Extend SSD to diagonal state matrices, analyze training complexity lower bounds, establish necessary/sufficient conditions for SSM-equivalence to 1-semiseparable attention, and examine rank explosion in softmax attention.

Result: Diagonal SSMs match scalar case’s training complexity bounds while supporting richer dynamics; identified conditions for SSM-attention equivalence; showed duality fails for standard softmax attention due to rank explosion.

Conclusion: The work tightens the bridge between recurrent SSMs and Transformers, widening design space for expressive yet efficient sequence models through generalized SSD duality.

Abstract: Structured State-Space Duality (SSD) [Dao & Gu, ICML 2024] is an equivalence between a simple Structured State-Space Model (SSM) and a masked attention mechanism. In particular, a state-space model with a scalar-times-identity state matrix is equivalent to a masked self-attention with a $1$-semiseparable causal mask. Consequently, the same sequence transformation (model) has two algorithmic realizations: as a linear-time $O(T)$ recurrence or as a quadratic-time $O(T^2)$ attention. In this note, we formalize and generalize this duality: (i) we extend SSD from the scalar-identity case to general diagonal SSMs (diagonal state matrices); (ii) we show that these diagonal SSMs match the scalar case’s training complexity lower bounds while supporting richer dynamics; (iii) we establish a necessary and sufficient condition under which an SSM is equivalent to $1$-semiseparable masked attention; and (iv) we show that such duality fails to extend to standard softmax attention due to rank explosion. Together, these results tighten bridge between recurrent SSMs and Transformers, and widen the design space for expressive yet efficient sequence models.

[261] Sample-Efficient Policy Constraint Offline Deep Reinforcement Learning based on Sample Filtering

Yuanhao Chen, Qi Liu, Pengbin Chen, Zhongjian Qiao, Yanjie Li

Main category: cs.LG

TL;DR: A sample filtering method for policy constraint offline RL that selects high-reward transitions to improve learning efficiency and performance over using all transitions.

Details

Motivation: Policy constraint offline RL suffers when datasets contain many low-reward transitions, causing learned policies to be constrained by suboptimal behavior policies, resulting in slow learning and poor performance.

Method: Proposes a two-step sample filtering method: 1) Score transitions using average reward and average discounted reward of episodes, 2) Extract high-score transitions to train offline RL algorithms instead of using all dataset transitions.

Result: Experimental results across various offline RL algorithms and benchmark tasks show the proposed method outperforms baseline methods that use all transitions.

Conclusion: Selectively using high-quality transitions through sample filtering improves policy constraint offline RL performance by avoiding constraints from low-reward transitions in the dataset.

Abstract: Offline reinforcement learning (RL) aims to learn a policy that maximizes the expected return using a given static dataset of transitions. However, offline RL faces the distribution shift problem. The policy constraint offline RL method is proposed to solve the distribution shift problem. During the policy constraint offline RL training, it is important to ensure the difference between the learned policy and behavior policy within a given threshold. Thus, the learned policy heavily relies on the quality of the behavior policy. However, a problem exists in existing policy constraint methods: if the dataset contains many low-reward transitions, the learned will be contained with a suboptimal reference policy, leading to slow learning speed, low sample efficiency, and inferior performances. This paper shows that the sampling method in policy constraint offline RL that uses all the transitions in the dataset can be improved. A simple but efficient sample filtering method is proposed to improve the sample efficiency and the final performance. First, we evaluate the score of the transitions by average reward and average discounted reward of episodes in the dataset and extract the transition samples of high scores. Second, the high-score transition samples are used to train the offline RL algorithms. We verify the proposed method in a series of offline RL algorithms and benchmark tasks. Experimental results show that the proposed method outperforms baselines.

[262] NeuralCrop: Combining physics and machine learning for improved crop yield predictions

Yunan Lin, Sebastian Bathiany, Maha Badri, Maximilian Gelbrecht, Philipp Hess, Brian Groenke, Jens Heinke, Christoph Müller, Niklas Boers

Main category: cs.LG

TL;DR: NeuralCrop is a hybrid crop model combining process-based GGCMs with machine learning that outperforms state-of-the-art models in yield prediction and generalizes better to climate change scenarios.

Details

Motivation: Traditional GGCMs have substantial uncertainties due to limited process understanding, while pure machine learning models fail to generalize to changing climate conditions outside their training distributions. There's a need for models that combine process understanding with data-driven approaches for more reliable yield projections under climate change.

Method: NeuralCrop is a hybrid model that combines an advanced process-based GGCM with data-driven machine learning components. It’s first trained to emulate a competitive GGCM, then fine-tuned on observational data, creating a model that leverages both explicit process representation and data-driven learning.

Result: NeuralCrop outperforms state-of-the-art GGCMs across site-level and large-scale cropping regions. It accurately reproduces interannual yield anomalies in European wheat regions and US Corn Belt (2000-2019), with particularly strong improvements under drought extremes. Unlike pure ML models, NeuralCrop maintains robust performance when generalizing to unseen conditions.

Conclusion: The hybrid approach combining process-based modeling with machine learning offers improved crop modeling and more reliable yield projections under climate change and intensifying extreme weather conditions, addressing limitations of both traditional GGCMs and pure ML models.

Abstract: Global gridded crop models (GGCMs) simulate daily crop growth by explicitly representing key biophysical processes and project end-of-season yield time series. They are a primary tool to quantify the impacts of climate change on agricultural productivity and assess associated risks for food security. Despite decades of development, state-of-the-art GGCMs still have substantial uncertainties in simulating complex biophysical processes due to limited process understanding. Recently, machine learning approaches trained on observational data have shown great potential in crop yield predictions. However, these models have not demonstrated improved performance over classical GGCMs and are not suitable for simulating crop yields under changing climate conditions due to problems in generalizing outside their training distributions. Here we introduce NeuralCrop, a hybrid GGCM that combines the strengths of an advanced process-based GGCM, resolving important processes explicitly, with data-driven machine learning components. The model is first trained to emulate a competitive GGCM before it is fine-tuned on observational data. We show that NeuralCrop outperforms state-of-the-art GGCMs across site-level and large-scale cropping regions. Across moisture conditions, NeuralCrop reproduces the interannual yield anomalies in European wheat regions and the US Corn Belt more accurately during the period from 2000 to 2019 with particularly strong improvements under drought extremes. When generalizing to conditions unseen during training, NeuralCrop continues to make robust projections, while pure machine learning models exhibit substantial performance degradation. Our results show that our hybrid crop modelling approach offers overall improved crop modeling and more reliable yield projections under climate change and intensifying extreme weather conditions.

[263] Cost-TrustFL: Cost-Aware Hierarchical Federated Learning with Lightweight Reputation Evaluation across Multi-Cloud

Jixiao Yang, Jinyu Chen, Zixiao Huang, Chengda Xu, Chi Zhang, Sijia Li

Main category: cs.LG

TL;DR: Cost-TrustFL: Hierarchical federated learning framework for multi-cloud environments that jointly optimizes model accuracy and communication costs while defending against Byzantine attacks.

Details

Motivation: Federated learning in multi-cloud environments faces challenges: non-IID data distributions, malicious participants, and high cross-cloud communication costs (egress fees). Existing Byzantine-robust methods focus on accuracy but ignore economic costs of data transfer between cloud providers.

Method: Cost-TrustFL uses gradient-based approximate Shapley value computation (reducing complexity from exponential to linear) for lightweight reputation evaluation. Implements cost-aware aggregation strategy prioritizing intra-cloud communication to minimize expensive cross-cloud data transfers.

Result: Achieves 86.7% accuracy on CIFAR-10 and FEMNIST datasets under 30% malicious clients while reducing communication costs by 32% compared to baseline methods. Maintains stable performance across varying non-IID degrees and attack intensities.

Conclusion: Cost-TrustFL provides a practical solution for real-world multi-cloud deployments by jointly addressing model performance, security against poisoning attacks, and economic communication costs.

Abstract: Federated learning across multi-cloud environments faces critical challenges, including non-IID data distributions, malicious participant detection, and substantial cross-cloud communication costs (egress fees). Existing Byzantine-robust methods focus primarily on model accuracy while overlooking the economic implications of data transfer across cloud providers. This paper presents Cost-TrustFL, a hierarchical federated learning framework that jointly optimizes model performance and communication costs while providing robust defense against poisoning attacks. We propose a gradient-based approximate Shapley value computation method that reduces the complexity from exponential to linear, enabling lightweight reputation evaluation. Our cost-aware aggregation strategy prioritizes intra-cloud communication to minimize expensive cross-cloud data transfers. Experiments on CIFAR-10 and FEMNIST datasets demonstrate that Cost-TrustFL achieves 86.7% accuracy under 30% malicious clients while reducing communication costs by 32% compared to baseline methods. The framework maintains stable performance across varying non-IID degrees and attack intensities, making it practical for real-world multi-cloud deployments.

[264] Zero-Overhead Introspection for Adaptive Test-Time Compute

Rohin Manvi, Joey Hong, Tim Seyde, Maxime Labonne, Mathias Lechner, Sergey Levine

Main category: cs.LG

TL;DR: ZIP-RC enables LLMs to predict their own success probability and remaining computation cost at each token, allowing adaptive inference decisions without extra overhead.

Details

Motivation: Current LLMs lack introspection to anticipate their own success and required computation, leading to inefficient fixed-budget sampling (like Best-of-N) and inability to make intelligent meta-cognition decisions about when to invest effort, stop, or signal success/failure.

Method: ZIP-RC reuses reserved/unused logits in the same forward pass to output joint distribution over final reward and remaining length. Uses this distribution to compute sampling utility (expected max reward, compute, latency) and maximizes utility with meta-actions that determine which token prefixes to continue or initiate sampling from.

Result: On mixed-difficulty mathematical benchmarks, ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost, and produces smooth Pareto frontiers between quality, compute, and latency.

Conclusion: ZIP-RC provides real-time reward-cost introspection that enables adaptive, efficient reasoning without extra models or inference overhead, addressing key limitations in LLM meta-cognition.

Abstract: Large language models excel at reasoning but lack key aspects of introspection, including anticipating their own success and the computation required to achieve it. Humans use real-time introspection to decide how much effort to invest, when to make multiple attempts, when to stop, and when to signal success or failure. Without this, LLMs struggle to make intelligent meta-cognition decisions. Test-time scaling methods like Best-of-N drive up cost and latency by using a fixed budget of samples regardless of the marginal benefit of each one at any point in generation, and the absence of confidence signals can mislead people, prevent appropriate escalation to better tools, and undermine trustworthiness. Learned verifiers or reward models can provide confidence estimates, but do not enable adaptive inference and add substantial cost by requiring extra models or forward passes. We present ZIP-RC, which equips models with zero-overhead introspective predictions of reward and cost. At every token, ZIP-RC reuses reserved or unused logits in the same forward pass as next-token prediction to output a joint distribution over final reward and remaining length – no extra models, architecture change, or inference overhead. This full joint distribution is used to compute a sampling utility which is the linear combination of the expected maximum reward, total compute, and latency of set of samples if generated to completion. During inference, we maximize this utility with meta-actions that determine which prefix of tokens to continue or initiate sampling from. On mixed-difficulty mathematical benchmarks, ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost, and traces smooth Pareto frontiers between quality, compute, and latency. By providing real-time reward-cost introspection, ZIP-RC enables adaptive, efficient reasoning.

[265] Generalisation in Multitask Fitted Q-Iteration and Offline Q-learning

Kausthubh Manda, Raghuram Bharadwaj Diddigi

Main category: cs.LG

TL;DR: Multitask offline RL with shared low-rank representations improves statistical efficiency via data pooling across tasks, achieving 1/√(nT) sample dependence and better downstream generalization.

Details

Motivation: To improve statistical efficiency and generalization in offline RL when multiple tasks share underlying structure, without requiring online interaction during learning.

Method: Multitask variant of fitted Q-iteration that jointly learns shared representation and task-specific value functions via Bellman error minimization on offline datasets.

Result: Established finite-sample generalization guarantees showing 1/√(nT) dependence on total samples across tasks, and demonstrated that learned representations reduce downstream task complexity.

Conclusion: Shared representations in multitask offline Q-learning improve generalization and statistical efficiency, providing theoretical insight into when multitask structure benefits model-free, value-based RL.

Abstract: We study offline multitask reinforcement learning in settings where multiple tasks share a low-rank representation of their action-value functions. In this regime, a learner is provided with fixed datasets collected from several related tasks, without access to further online interaction, and seeks to exploit shared structure to improve statistical efficiency and generalization. We analyze a multitask variant of fitted Q-iteration that jointly learns a shared representation and task-specific value functions via Bellman error minimization on offline data. Under standard realizability and coverage assumptions commonly used in offline reinforcement learning, we establish finite-sample generalization guarantees for the learned value functions. Our analysis explicitly characterizes how pooling data across tasks improves estimation accuracy, yielding a $1/\sqrt{nT}$ dependence on the total number of samples across tasks, while retaining the usual dependence on the horizon and concentrability coefficients arising from distribution shift. In addition, we consider a downstream offline setting in which a new task shares the same underlying representation as the upstream tasks. We study how reusing the representation learned during the multitask phase affects value estimation for this new task, and show that it can reduce the effective complexity of downstream learning relative to learning from scratch. Together, our results clarify the role of shared representations in multitask offline Q-learning and provide theoretical insight into when and how multitask structure can improve generalization in model-free, value-based reinforcement learning.

[266] TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning

Saisai Yang, Qingyi Huang, Jing Yuan, Liangyu Zha, Kai Tang, Yuhang Yang, Ning Wang, Yucheng Wei, Liyao Li, Wentao Ye, Hao Chen, Tao Zhang, Junlin Zhou, Haobo Wang, Gang Chen, Junbo Zhao

Main category: cs.LG

TL;DR: TableGPT-R1 is a specialized tabular model using systematic RL framework to overcome limitations of SFT-tuned LLMs in handling complex multi-step reasoning and code execution for table tasks.

Details

Motivation: Current LLMs fine-tuned via SFT fall short in handling complex multi-step reasoning and robust code execution for real-world table tasks. RL offers promise but faces three critical hurdles: scarcity of high-quality agentic trajectories, extreme heterogeneity of feedback signals, and risk of catastrophic forgetting during specialization.

Method: 1) Comprehensive data engineering pipeline synthesizing difficulty-stratified agentic trajectories; 2) Task-adaptive reward system combining rule-based verification with criteria-injected reward model, plus process-level step reward shaping with behavioral regularization; 3) Multi-stage training framework that progressively stabilizes reasoning before specializing in table-specific tasks.

Result: TableGPT-R1 achieves state-of-the-art performance on authoritative benchmarks, significantly outperforming baseline models while retaining robust general capabilities.

Conclusion: The systematic RL framework successfully overcomes the three critical challenges in applying RL to tabular data, enabling advanced reasoning on complex tables while maintaining general capabilities.

Abstract: Tabular data serves as the backbone of modern data analysis and scientific research. While Large Language Models (LLMs) fine-tuned via Supervised Fine-Tuning (SFT) have significantly improved natural language interaction with such structured data, they often fall short in handling the complex, multi-step reasoning and robust code execution required for real-world table tasks. Reinforcement Learning (RL) offers a promising avenue to enhance these capabilities, yet its application in the tabular domain faces three critical hurdles: the scarcity of high-quality agentic trajectories with closed-loop code execution and environment feedback on diverse table structures, the extreme heterogeneity of feedback signals ranging from rigid SQL execution to open-ended data interpretation, and the risk of catastrophic forgetting of general knowledge during vertical specialization. To overcome these challenges and unlock advanced reasoning on complex tables, we introduce \textbf{TableGPT-R1}, a specialized tabular model built on a systematic RL framework. Our approach integrates a comprehensive data engineering pipeline that synthesizes difficulty-stratified agentic trajectories for both supervised alignment and RL rollouts, a task-adaptive reward system that combines rule-based verification with a criteria-injected reward model and incorporates process-level step reward shaping with behavioral regularization, and a multi-stage training framework that progressively stabilizes reasoning before specializing in table-specific tasks. Extensive evaluations demonstrate that TableGPT-R1 achieves state-of-the-art performance on authoritative benchmarks, significantly outperforming baseline models while retaining robust general capabilities. Our model is available at https://huggingface.co/tablegpt/TableGPT-R1.

[267] Adaptive Multi-task Learning for Probabilistic Load Forecasting

Onintze Zaballa, Verónica Álvarez, Santiago Mazuelas

Main category: cs.LG

TL;DR: Proposes an adaptive multi-task learning method for probabilistic load forecasting that dynamically adjusts to changing consumption patterns and correlations among multiple entities using vector-valued hidden Markov models.

Details

Motivation: Simultaneous load forecasting across multiple entities is crucial for power system efficiency, but existing methods are limited to offline learning and cannot capture dynamic changes in consumption patterns and correlations.

Method: Uses vector-valued hidden Markov models with a recursive process to update model parameters dynamically, enabling adaptive multi-task learning for probabilistic load forecasting.

Result: Outperforms existing methods in both forecasting performance and uncertainty assessment when tested on datasets with diverse and dynamic consumption patterns.

Conclusion: The adaptive multi-task learning approach successfully addresses limitations of offline methods by dynamically capturing changing patterns and correlations, providing reliable probabilistic load forecasts for multiple entities.

Abstract: Simultaneous load forecasting across multiple entities (e.g., regions, buildings) is crucial for the efficient, reliable, and cost-effective operation of power systems. Accurate load forecasting is a challenging problem due to the inherent uncertainties in load demand, dynamic changes in consumption patterns, and correlations among entities. Multi-task learning has emerged as a powerful machine learning approach that enables the simultaneous learning across multiple related problems. However, its application to load forecasting remains underexplored and is limited to offline learning-based methods, which cannot capture changes in consumption patterns. This paper presents an adaptive multi-task learning method for probabilistic load forecasting. The proposed method can dynamically adapt to changes in consumption patterns and correlations among entities. In addition, the techniques presented provide reliable probabilistic predictions for loads of multiples entities and assess load uncertainties. Specifically, the method is based on vectorvalued hidden Markov models and uses a recursive process to update the model parameters and provide predictions with the most recent parameters. The performance of the proposed method is evaluated using datasets that contain the load demand of multiple entities and exhibit diverse and dynamic consumption patterns. The experimental results show that the presented techniques outperform existing methods both in terms of forecasting performance and uncertainty assessment.

[268] How I Met Your Bias: Investigating Bias Amplification in Diffusion Models

Nathan Roos, Ekaterina Iakovleva, Ani Gjergji, Vito Paolo Pastore, Enzo Tartaglione

Main category: cs.LG

TL;DR: Diffusion model samplers and their hyperparameters significantly affect bias amplification, not just the trained model itself.

Details

Motivation: While diffusion models excel at image synthesis, they tend to replicate and amplify dataset biases. Previous work treated bias amplification as inherent to diffusion models, but this paper investigates how sampling algorithms and hyperparameters influence bias amplification.

Method: Conducted controlled studies with models trained on Biased MNIST, Multi-Color MNIST, BFFHQ, and Stable Diffusion. Analyzed how different sampling algorithms and their hyperparameters affect bias amplification while keeping the trained model fixed.

Result: Sampling hyperparameters can induce both bias reduction and amplification. Samplers optimized for sample quality and speed have significant, measurable effects on bias amplification. The source of bias amplification is not just the trained model but also the sampling process.

Conclusion: Bias amplification in diffusion models is not solely inherent to the models themselves but is significantly influenced by sampling algorithms and their hyperparameters. This provides new opportunities for bias mitigation through careful sampler selection and hyperparameter tuning.

Abstract: Diffusion-based generative models demonstrate state-of-the-art performance across various image synthesis tasks, yet their tendency to replicate and amplify dataset biases remains poorly understood. Although previous research has viewed bias amplification as an inherent characteristic of diffusion models, this work provides the first analysis of how sampling algorithms and their hyperparameters influence bias amplification. We empirically demonstrate that samplers for diffusion models – commonly optimized for sample quality and speed – have a significant and measurable effect on bias amplification. Through controlled studies with models trained on Biased MNIST, Multi-Color MNIST and BFFHQ, and with Stable Diffusion, we show that sampling hyperparameters can induce both bias reduction and amplification, even when the trained model is fixed. Source code is available at https://github.com/How-I-met-your-bias/how_i_met_your_bias.

[269] DeepONet-accelerated Bayesian inversion for moving boundary problems

Marco A. Iglesias, Michael. E. Causon, Mikhail Y. Matveev, Andreas Endruweit, Michael . V. Tretyakov

Main category: cs.LG

TL;DR: DeepONet neural operator enables fast surrogate modeling for moving boundary problems, coupled with Ensemble Kalman Inversion for real-time parameter estimation in Resin Transfer Moulding processes.

Details

Motivation: To develop fast, accurate emulators for moving boundary systems that can be integrated into digital twin platforms, specifically for monitoring and controlling manufacturing processes like Resin Transfer Moulding.

Method: Uses Deep Operator Network (DeepONet) architecture to build efficient surrogate models for moving boundary problems in porous media flow, coupled with Ensemble Kalman Inversion (EKI) for Bayesian inverse problems.

Result: DeepONet surrogate accelerates inversion by several orders of magnitude compared to full-model EKI, enabling real-time, high-resolution estimation of permeability, porosity, and other parameters using both synthetic and experimental data.

Conclusion: Neural operator learning provides a powerful framework for digital twin deployment, with generalization across spatial/temporal domains and arbitrary sensor configurations without retraining, representing significant progress toward practical industrial applications.

Abstract: This work demonstrates that neural operator learning provides a powerful and flexible framework for building fast, accurate emulators of moving boundary systems, enabling their integration into digital twin platforms. To this end, a Deep Operator Network (DeepONet) architecture is employed to construct an efficient surrogate model for moving boundary problems in single-phase Darcy flow through porous media. The surrogate enables rapid and accurate approximation of complex flow dynamics and is coupled with an Ensemble Kalman Inversion (EKI) algorithm to solve Bayesian inverse problems. The proposed inversion framework is demonstrated by estimating the permeability and porosity of fibre reinforcements for composite materials manufactured via the Resin Transfer Moulding (RTM) process. Using both synthetic and experimental in-process data, the DeepONet surrogate accelerates inversion by several orders of magnitude compared with full-model EKI. This computational efficiency enables real-time, accurate, high-resolution estimation of local variations in permeability, porosity, and other parameters, thereby supporting effective monitoring and control of RTM processes, as well as other applications involving moving boundary flows. Unlike prior approaches for RTM inversion that learn mesh-dependent mappings, the proposed neural operator generalises across spatial and temporal domains, enabling evaluation at arbitrary sensor configurations without retraining, and represents a significant step toward practical industrial deployment of digital twins.

[270] Clust-PSI-PFL: A Population Stability Index Approach for Clustered Non-IID Personalized Federated Learning

Daniel M. Jimenez-Gutierrez, Mehrdad Hassanzadeh, Aris Anagnostopoulos, Ioannis Chatzigiannakis, Andrea Vitaletti

Main category: cs.LG

TL;DR: Clust-PSI-PFL: A clustering-based personalized federated learning framework using Population Stability Index to handle non-IID data, achieving better accuracy and fairness than state-of-the-art methods.

Details

Motivation: Federated learning suffers from performance degradation due to non-IID data across clients, which biases model updates. Existing approaches need better mechanisms to quantify and handle data distribution heterogeneity effectively.

Method: Proposes Clust-PSI-PFL framework that: 1) Uses weighted Population Stability Index (WPSI^L) to quantify non-IID data, shown to be more informative than existing metrics; 2) Forms distributionally homogeneous client groups via K-means++ clustering on PSI features; 3) Uses silhouette-based procedure to determine optimal cluster count; 4) Enables personalized federated learning within clusters.

Result: Across six datasets (tabular, image, text), two partition protocols (Dirichlet and Similarity), and multiple client sizes: Achieves up to 18% higher global accuracy than state-of-the-art baselines; Improves client fairness by 37% relative improvement under severe non-IID data; Typically yields few clusters with modest overhead.

Conclusion: PSI-guided clustering provides a principled, lightweight mechanism for robust personalized federated learning under label skew, effectively addressing non-IID data challenges while improving both accuracy and fairness.

Abstract: Federated learning (FL) supports privacy-preserving, decentralized machine learning (ML) model training by keeping data on client devices. However, non-independent and identically distributed (non-IID) data across clients biases updates and degrades performance. To alleviate these issues, we propose Clust-PSI-PFL, a clustering-based personalized FL framework that uses the Population Stability Index (PSI) to quantify the level of non-IID data. We compute a weighted PSI metric, $WPSI^L$, which we show to be more informative than common non-IID metrics (Hellinger, Jensen-Shannon, and Earth Mover’s distance). Using PSI features, we form distributionally homogeneous groups of clients via K-means++; the number of optimal clusters is chosen by a systematic silhouette-based procedure, typically yielding few clusters with modest overhead. Across six datasets (tabular, image, and text modalities), two partition protocols (Dirichlet with parameter $α$ and Similarity with parameter S), and multiple client sizes, Clust-PSI-PFL delivers up to 18% higher global accuracy than state-of-the-art baselines and markedly improves client fairness by a relative improvement of 37% under severe non-IID data. These results establish PSI-guided clustering as a principled, lightweight mechanism for robust PFL under label skew.

[271] HGAN-SDEs: Learning Neural Stochastic Differential Equations with Hermite-Guided Adversarial Training

Yuanjian Xu, Yuan Shuai, Jianing Hao, Guang Zhang

Main category: cs.LG

TL;DR: HGAN-SDEs: A GAN framework using Neural Hermite functions as efficient discriminators for learning SDE-driven distributions, improving computational efficiency and training stability over existing methods.

Details

Motivation: Existing Neural SDE GAN approaches face bottlenecks in designing discriminators that capture temporal dependencies efficiently. Neural CDE discriminators are computationally expensive and exacerbate adversarial training instability.

Method: Propose HGAN-SDEs framework leveraging Neural Hermite functions to construct structured, efficient discriminators. Hermite functions provide expressive yet lightweight basis for approximating path-level dynamics.

Result: Theoretical: Establish universal approximation property for broad class of SDE-driven distributions and characterize convergence behavior. Empirical: Superior sample quality and learning efficiency on synthetic and real-world systems compared to existing generative models for SDEs.

Conclusion: HGAN-SDEs address computational and stability limitations of prior SDE GAN approaches by using Hermite function-based discriminators, enabling more efficient and stable learning of continuous-time stochastic processes.

Abstract: Neural Stochastic Differential Equations (Neural SDEs) provide a principled framework for modeling continuous-time stochastic processes and have been widely adopted in fields ranging from physics to finance. Recent advances suggest that Generative Adversarial Networks (GANs) offer a promising solution to learning the complex path distributions induced by SDEs. However, a critical bottleneck lies in designing a discriminator that faithfully captures temporal dependencies while remaining computationally efficient. Prior works have explored Neural Controlled Differential Equations (CDEs) as discriminators due to their ability to model continuous-time dynamics, but such architectures suffer from high computational costs and exacerbate the instability of adversarial training. To address these limitations, we introduce HGAN-SDEs, a novel GAN-based framework that leverages Neural Hermite functions to construct a structured and efficient discriminator. Hermite functions provide an expressive yet lightweight basis for approximating path-level dynamics, enabling both reduced runtime complexity and improved training stability. We establish the universal approximation property of our framework for a broad class of SDE-driven distributions and theoretically characterize its convergence behavior. Extensive empirical evaluations on synthetic and real-world systems demonstrate that HGAN-SDEs achieve superior sample quality and learning efficiency compared to existing generative models for SDEs

[272] Mixture-of-Experts with Gradient Conflict-Driven Subspace Topology Pruning for Emergent Modularity

Yuxing Gan, Ziyu Lei

Main category: cs.LG

TL;DR: CDSP-MoE addresses MoE limitations by shifting from isolated experts to dynamic instantiation in shared subspace, using gradient conflicts to prune interfering connections and enable content-driven routing without task labels.

Details

Motivation: Current MoE architectures suffer from structural parameter isolation causing catastrophic forgetting, and instruction-overfitting that degrades performance in instruction-free scenarios.

Method: CDSP-MoE maintains a super-complete parameter backbone where logical experts are carved out via learnable topology masks. It uses a Lagged Gradient Game to penalize interfering connections in the shared manifold, enabling spontaneous pruning of conflicting pathways.

Result: CDSP-MoE achieves robust content-driven routing without human-defined task labels, maintaining semantic specialization even under strict blind inference protocols where explicit instructions are absent.

Conclusion: The framework enables interpretable modular structures through conflict-driven subspace pruning, addressing fundamental limitations of contemporary MoE designs.

Abstract: Mixture-of-Experts (MoE) architectures achieve parameter efficiency through conditional computation, yet contemporary designs suffer from two fundamental limitations: structural parameter isolation that causes catastrophic forgetting, and instruction-overfitting that degrades performance in instruction-free scenarios. We propose CDSP-MoE (Conflict-Driven Subspace Pruning MoE), a framework that addresses these issues through a paradigm shift from isolated expert containers to dynamic expert instantiation within a shared physical subspace. Grounded in the Universal Weight Subspace Hypothesis, CDSP-MoE maintains a super-complete parameter backbone where logical experts are carved out via learnable topology masks. Unlike prior work that uses gradient conflict for token reassignment or optimization surgery, we leverage it as a structural supervisory signal: a Lagged Gradient Game penalizes interfering connections in the shared manifold, enabling the topology to spontaneously prune conflicting pathways and evolve interpretable modular structures. Experimental results demonstrate that CDSP-MoE achieves robust content-driven routing without human-defined task labels, maintaining semantic specialization even under strict blind inference protocols where explicit instructions are absent. Code is available at: https://github.com/konodiodaaaaa1/Conflict-Driven-Subspace-Pruning-Mixture-of-Experts

[273] Simplifying Multi-Task Architectures Through Task-Specific Normalization

Mihai Suteu, Ovidiu Serban

Main category: cs.LG

TL;DR: Task-specific normalization layers alone can effectively address multi-task learning challenges, eliminating need for complex architectures.

Details

Motivation: Multi-task learning faces challenges in balancing resources and mitigating interference between tasks. Existing architectural solutions often introduce complex task-specific modules or routing schemes that increase overhead.

Method: Proposes Task-Specific Sigmoid Batch Normalization (TSσBN), a lightweight mechanism that replaces shared normalization with task-specific variants, enabling tasks to softly allocate network capacity while fully sharing feature extractors.

Result: TSσBN improves stability across CNNs and Transformers, matching or exceeding performance on NYUv2, Cityscapes, CelebA, and PascalContext benchmarks while remaining highly parameter-efficient.

Conclusion: Complex MTL architectures may be unnecessary; task-specific normalization offers a simple, interpretable, and efficient alternative that provides insights into capacity allocation, filter specialization, and task relationships.

Abstract: Multi-task learning (MTL) aims to leverage shared knowledge across tasks to improve generalization and parameter efficiency, yet balancing resources and mitigating interference remain open challenges. Architectural solutions often introduce elaborate task-specific modules or routing schemes, increasing complexity and overhead. In this work, we show that normalization layers alone are sufficient to address many of these challenges. Simply replacing shared normalization with task-specific variants already yields competitive performance, questioning the need for complex designs. Building on this insight, we propose Task-Specific Sigmoid Batch Normalization (TS$σ$BN), a lightweight mechanism that enables tasks to softly allocate network capacity while fully sharing feature extractors. TS$σ$BN improves stability across CNNs and Transformers, matching or exceeding performance on NYUv2, Cityscapes, CelebA, and PascalContext, while remaining highly parameter-efficient. Moreover, its learned gates provide a natural framework for analyzing MTL dynamics, offering interpretable insights into capacity allocation, filter specialization, and task relationships. Our findings suggest that complex MTL architectures may be unnecessary and that task-specific normalization offers a simple, interpretable, and efficient alternative.

[274] FedDPC : Handling Data Heterogeneity and Partial Client Participation in Federated Learning

Mrinmay Sen, Subhrajit Nag

Main category: cs.LG

TL;DR: FedDPC is a federated learning method that addresses both data heterogeneity and partial client participation by projecting local updates onto previous global updates and using adaptive scaling to accelerate training.

Details

Motivation: Data heterogeneity in FL creates variance in local model updates, causing the global model to shift away from the true optimum. Partial client participation exacerbates this by skewing aggregation toward participating clients' data distributions, creating additional variance and instability that degrades model performance and slows training.

Method: FedDPC projects each local update onto the previous global update to control variance in both local and global updates. It also employs adaptive scaling for each local update before aggregation to accelerate FL training.

Result: Extensive experiments on image classification tasks with multiple heterogeneously partitioned datasets show FedDPC outperforms state-of-the-art FL algorithms by achieving faster reduction in training loss and improved test accuracy across communication rounds.

Conclusion: FedDPC effectively mitigates both data heterogeneity and partial client participation challenges in federated learning, leading to more stable training, faster convergence, and better global model performance compared to existing methods.

Abstract: Data heterogeneity is a significant challenge in modern federated learning (FL) as it creates variance in local model updates, causing the aggregated global model to shift away from the true global optimum. Partial client participation in FL further exacerbates this issue by skewing the aggregation of local models towards the data distribution of participating clients. This creates additional variance in the global model updates, causing the global model to converge away from the optima of the global objective. These variances lead to instability in FL training, which degrades global model performance and slows down FL training. While existing literature primarily focuses on addressing data heterogeneity, the impact of partial client participation has received less attention. In this paper, we propose FedDPC, a novel FL method, designed to improve FL training and global model performance by mitigating both data heterogeneity and partial client participation. FedDPC addresses these issues by projecting each local update onto the previous global update, thereby controlling variance in both local and global updates. To further accelerate FL training, FedDPC employs adaptive scaling for each local update before aggregation. Extensive experiments on image classification tasks with multiple heterogeneously partitioned datasets validate the effectiveness of FedDPC. The results demonstrate that FedDPC outperforms state-of-the-art FL algorithms by achieving faster reduction in training loss and improved test accuracy across communication rounds.

[275] Inverse Autoregressive Flows for Zero Degree Calorimeter fast simulation

Emilia Majerz, Witold Dzwinel, Jacek Kitowski

Main category: cs.LG

TL;DR: Physics-based ML accelerates ALICE ZDC simulations using novel loss function and scaling, achieving 421x speedup over existing NF implementations.

Details

Motivation: To accelerate simulations of the Zero Degree Calorimeter (ZDC) in the ALICE experiment at CERN by blending physics knowledge with data-driven techniques for more accurate and robust models.

Method: Uses physics-based machine learning with a novel loss function and output variability-based scaling mechanism. Leverages Normalizing Flows in a teacher-student generative framework to enhance representation of spatial distribution and morphology of particle showers while mitigating rare artefacts.

Result: The approach outperforms classic data-driven model assimilation and achieves models that are 421 times faster than existing Normalizing Flow implementations in ZDC simulation literature.

Conclusion: Physics-based machine learning with specialized loss functions and scaling mechanisms can significantly accelerate particle detector simulations while maintaining or improving accuracy over purely data-driven approaches.

Abstract: Physics-based machine learning blends traditional science with modern data-driven techniques. Rather than relying exclusively on empirical data or predefined equations, this methodology embeds domain knowledge directly into the learning process, resulting in models that are both more accurate and robust. We leverage this paradigm to accelerate simulations of the Zero Degree Calorimeter (ZDC) of the ALICE experiment at CERN. Our method introduces a novel loss function and an output variability-based scaling mechanism, which enhance the model’s capability to accurately represent the spatial distribution and morphology of particle showers in detector outputs while mitigating the influence of rare artefacts on the training. Leveraging Normalizing Flows (NFs) in a teacher-student generative framework, we demonstrate that our approach not only outperforms classic data-driven model assimilation but also yields models that are 421 times faster than existing NF implementations in ZDC simulation literature.

[276] Physics-guided Neural Network-based Shaft Power Prediction for Vessels

Dogan Altan, Hamza Haruna Mohammed, Glenn Terje Lines, Dusica Marijan, Arnbjørn Maressa

Main category: cs.LG

TL;DR: Physics-guided neural network for vessel shaft power prediction outperforms both traditional empirical formulas and baseline neural networks in accuracy metrics.

Details

Motivation: Accurate shaft power prediction is crucial for optimizing maritime fuel consumption and reducing costs/emissions. Traditional empirical formulas struggle with dynamic conditions like sea states and vessel fouling.

Method: Hybrid physics-guided neural network that incorporates empirical formulas within the network architecture to combine advantages of both neural networks and traditional techniques.

Result: The physics-guided neural network achieved lower mean absolute error (MAE), root mean square error (RMSE), and mean absolute percentage error (MAPE) for all four tested cargo vessels compared to both empirical formula-based methods and baseline neural networks.

Conclusion: The hybrid approach successfully combines the strengths of physics-based modeling and data-driven neural networks, providing more accurate shaft power predictions for maritime operations optimization.

Abstract: Optimizing maritime operations, particularly fuel consumption for vessels, is crucial, considering its significant share in global trade. As fuel consumption is closely related to the shaft power of a vessel, predicting shaft power accurately is a crucial problem that requires careful consideration to minimize costs and emissions. Traditional approaches, which incorporate empirical formulas, often struggle to model dynamic conditions, such as sea conditions or fouling on vessels. In this paper, we present a hybrid, physics-guided neural network-based approach that utilizes empirical formulas within the network to combine the advantages of both neural networks and traditional techniques. We evaluate the presented method using data obtained from four similar-sized cargo vessels and compare the results with those of a baseline neural network and a traditional approach that employs empirical formulas. The experimental results demonstrate that the physics-guided neural network approach achieves lower mean absolute error, root mean square error, and mean absolute percentage error for all tested vessels compared to both the empirical formula-based method and the base neural network.

[277] Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs

Rui Pan, Zhuofu Chen, Ravi Netravali

Main category: cs.LG

TL;DR: FailFast uses diffusion LLMs as drafters in speculative decoding to achieve lossless acceleration of autoregressive LLMs by dynamically adapting speculation lengths based on difficulty.

Details

Motivation: Diffusion LLMs offer fast parallel token generation but suffer from an efficiency-quality tradeoff when used standalone. The authors aim to leverage dLLMs' strengths as drafters in speculative decoding to overcome this limitation.

Method: FailFast framework uses dLLMs as drafters with AR verifiers, dynamically adapting speculation length: “fails fast” in hard regions with minimal compute, and “wins big” in easy regions by aggressively extending draft lengths (up to 70 tokens).

Result: Achieves up to 4.9× speedup over vanilla decoding, 1.7× over best naive dLLM drafter, and 1.4× over EAGLE-3 across diverse models and workloads, with lossless acceleration.

Conclusion: dLLMs can be effectively used as drafters in speculative decoding when combined with dynamic length adaptation, providing significant speedups without quality loss.

Abstract: Diffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a strength for drafters in speculative decoding with autoregressive (AR) verifiers. Our core insight is that dLLM’s speed from parallel decoding drastically lowers the risk of costly rejections, providing a practical mechanism to effectively realize the (elusive) lengthy drafts that lead to large speedups with speculative decoding. We present FailFast, a dLLM-based speculative decoding framework that realizes this approach by dynamically adapting its speculation length. It “fails fast” by spending minimal compute in hard-to-speculate regions to shrink speculation latency and “wins big” by aggressively extending draft lengths in easier regions to reduce verification latency (in many cases, speculating and accepting 70 tokens at a time!). Without any fine-tuning, FailFast delivers lossless acceleration of AR LLMs and achieves up to 4.9$\times$ speedup over vanilla decoding, 1.7$\times$ over the best naive dLLM drafter, and 1.4$\times$ over EAGLE-3 across diverse models and workloads. We open-source FailFast at https://github.com/ruipeterpan/failfast.

[278] Field-Space Attention for Structure-Preserving Earth System Transformers

Maximilian Witte, Johannes Meuer, Étienne Plésiat, Christopher Kadow

Main category: cs.LG

TL;DR: Field-Space Attention: A transformer mechanism that computes attention in the physical domain rather than latent space, preserving geometric structure for Earth system modeling.

Details

Motivation: Accurate Earth system modeling requires ML architectures that operate directly on continuous geophysical fields while preserving their underlying geometric structure and enabling interpretability and scientific constraints.

Method: Introduces Field-Space attention mechanism for Earth system Transformers that maintains all intermediate representations as continuous fields on the sphere, uses fixed multiscale decomposition, learns structure-preserving deformations, and computes attention in physical domain.

Result: Field-Space Transformers converge more rapidly and stably than conventional Vision Transformers and U-Nets for global temperature super-resolution on HEALPix grid, requiring fewer parameters while improving fidelity and reliability.

Conclusion: Field-Space Attention provides a compact, interpretable, physically grounded building block for next-generation Earth system prediction and generative modeling frameworks by explicitly preserving field structure throughout the network.

Abstract: Accurate and physically consistent modeling of Earth system dynamics requires machine-learning architectures that operate directly on continuous geophysical fields and preserve their underlying geometric structure. Here we introduce Field-Space attention, a mechanism for Earth system Transformers that computes attention in the physical domain rather than in a learned latent space. By maintaining all intermediate representations as continuous fields on the sphere, the architecture enables interpretable internal states and facilitates the enforcement of scientific constraints. The model employs a fixed, non-learned multiscale decomposition and learns structure-preserving deformations of the input field, allowing coherent integration of coarse and fine-scale information while avoiding the optimization instabilities characteristic of standard single-scale Vision Transformers. Applied to global temperature super-resolution on a HEALPix grid, Field-Space Transformers converge more rapidly and stably than conventional Vision Transformers and U-Net baselines, while requiring substantially fewer parameters. The explicit preservation of field structure throughout the network allows physical and statistical priors to be embedded directly into the architecture, yielding improved fidelity and reliability in data-driven Earth system modeling. These results position Field-Space Attention as a compact, interpretable, and physically grounded building block for next-generation Earth system prediction and generative modeling frameworks.

[279] Performative Policy Gradient: Optimality in Performative Reinforcement Learning

Debabrota Basu, Udvas Das, Brahim Driss, Uddalak Mukherjee

Main category: cs.LG

TL;DR: PePG is a new policy gradient algorithm for RL that accounts for performative effects - where deployed policies influence their own environment dynamics. It converges to performatively optimal policies that remain optimal under self-induced distribution shifts.

Details

Motivation: Standard RL methods ignore that deployed policies influence their environments, causing distribution shifts. While performative effects have been studied in supervised learning, they remain under-explored in RL. Existing performative RL methods only achieve stability, not optimality.

Method: The authors prove performative versions of the performance difference lemma and policy gradient theorem, then introduce PePG (Performative Policy Gradient). The algorithm accounts for performativity and works under softmax parametrization with/without entropy regularization.

Result: PePG converges to performatively optimal policies - policies that remain optimal under the distribution shifts they induce. Empirical analysis shows PePG outperforms standard policy gradient algorithms and existing performative RL algorithms that only aim for stability.

Conclusion: PePG is the first policy gradient algorithm designed for performative RL that achieves performative optimality, significantly advancing beyond prior work that only achieved stability. It provides a principled approach to handle self-induced distribution shifts in RL deployments.

Abstract: Post-deployment machine learning algorithms often influence the environments they act in, and thus shift the underlying dynamics that the standard reinforcement learning (RL) methods ignore. While designing optimal algorithms in this performative setting has recently been studied in supervised learning, the RL counterpart remains under-explored. In this paper, we prove the performative counterparts of the performance difference lemma and the policy gradient theorem in RL, and further introduce the Performative Policy Gradient algorithm (PePG). PePG is the first policy gradient algorithm designed to account for performativity in RL. Under softmax parametrisation, and also with and without entropy regularisation, we prove that PePG converges to performatively optimal policies, i.e. policies that remain optimal under the distribution shifts induced by themselves. Thus, PePG significantly extends the prior works in Performative RL that achieves performative stability but not optimality. Furthermore, our empirical analysis on standard performative RL environments validate that PePG outperforms standard policy gradient algorithms and the existing performative RL algorithms aiming for stability.

[280] GeoTransolver: Learning Physics on Irregumar Domains Using Multi-scale Geometry Aware Physics Attention Transformer

Corey Adams, Rishikesh Ranade, Ram Cherukuri, Sanjay Choudhry

Main category: cs.LG

TL;DR: GeoTransolver is a multiscale geometry-aware physics attention transformer for CAE that replaces standard attention with GALE, coupling physics-aware self-attention with cross-attention to shared geometry/global/boundary-condition context from multi-scale ball queries.

Details

Motivation: To advance operator learning for high-fidelity surrogate modeling across complex, irregular domains and non-linear physical regimes by unifying multiscale geometry-aware context with physics-based attention in a scalable transformer architecture.

Method: Replaces standard attention with GALE (Geometry-Aware Physics Attention), coupling physics-aware self-attention on learned state slices with cross-attention to a shared geometry/global/boundary-condition context computed from multi-scale ball queries (inspired by DoMINO). Persistently projects geometry, global and boundary condition parameters into physical state spaces to anchor latent computations to domain structure and operating regimes.

Result: Benchmarked on DrivAerML, Luminary SHIFT-SUV, and Luminary SHIFT-Wing, showing better accuracy, improved robustness to geometry/regime shifts, and favorable data efficiency compared to Domino, Transolver, and AB-UPT. Achieved superior drag/lift R2 and Relative L1 errors for field variables.

Conclusion: GeoTransolver advances operator learning for high-fidelity surrogate modeling across complex, irregular domains and non-linear physical regimes by unifying multiscale geometry-aware context with physics-based attention in a scalable transformer architecture.

Abstract: We present GeoTransolver, a Multiscale Geometry-Aware Physics Attention Transformer for CAE that replaces standard attention with GALE, coupling physics-aware self-attention on learned state slices with cross-attention to a shared geometry/global/boundary-condition context computed from multi-scale ball queries (inspired by DoMINO) and reused in every block. Implemented and released in NVIDIA PhysicsNeMo, GeoTransolver persistently projects geometry, global and boundary condition parameters into physical state spaces to anchor latent computations to domain structure and operating regimes. We benchmark GeoTransolver on DrivAerML, Luminary SHIFT-SUV, and Luminary SHIFT-Wing, comparing against Domino, Transolver (as released in PhysicsNeMo), and literature-reported AB-UPT, and evaluate drag/lift R2 and Relative L1 errors for field variables. GeoTransolver delivers better accuracy, improved robustness to geometry/regime shifts, and favorable data efficiency; we include ablations on DrivAerML and qualitative results such as contour plots and design trends for the best GeoTransolver models. By unifying multiscale geometry-aware context with physics-based attention in a scalable transformer, GeoTransolver advances operator learning for high-fidelity surrogate modeling across complex, irregular domains and non-linear physical regimes.

[281] BRIDGE: Budget-aware Reasoning via Intermediate Distillation with Guided Examples

Xuan-An Le, Minh-Nam Tran, Son Nguyen

Main category: cs.LG

TL;DR: BRIDGE is a two-phase distillation framework that uses a mid-sized Teacher Assistant to bridge the capacity gap between large proprietary models and tiny deployable models, achieving better performance with 10x fewer teacher queries.

Details

Motivation: Direct knowledge distillation from large models (like GPT-4) to tiny models (<1B parameters) faces a capacity-budget trap: huge capacity gap prevents effective transfer, while API costs limit data collection.

Method: Two-phase framework: 1) Train mid-sized Teacher Assistant (TA) on limited data (3-5%) selected via zero-API-cost pipeline balancing difficulty and diversity; 2) Use TA to generate synthetic rationales for full dataset to train tiny student, with instruction-tuning curriculum for behavioral alignment.

Result: 28-41% student performance gains, closing capability gap with proprietary teachers by 12-16%, using 10x fewer teacher queries. Surpasses direct distillation baselines using 100% of budget while consuming only 5% of resources.

Conclusion: BRIDGE effectively resolves the capacity-budget trap in knowledge distillation through strategic intermediation and budget asymmetry, defying conventional cost-performance frontiers and enabling efficient deployment of tiny models.

Abstract: Distilling knowledge from large proprietary models (e.g., GPT-4) to tiny deployable models (less than 1B parameters) faces a critical capacity-budget trap: the 1000x capacity gap between teachers and students prevents effective direct transfer, while API costs prohibit extensive data collection. We introduce BRIDGE (Budget-Aware Reasoning via Intermediate Distillation), a two-phase framework that resolves these constraints through strategic intermediation and budget asymmetry. In Phase 1, a mid-sized Teacher Assistant (TA; e.g., about 7B) learns from the black-box teacher on a strictly limited subset of data (e.g., 3-5%), selected via a zero-API-cost pipeline that balances entropic difficulty and semantic diversity using only local TA inference. In Phase 2, we exploit this asymmetry-teacher queries are expensive, whereas TA inference is free to amplify supervision: the refined TA generates synthetic rationales for the full dataset to train the tiny student. Crucially, we apply an instruction-tuning curriculum to establish behavioral alignment in the tiny student before transferring reasoning. Our theoretical analysis shows that BRIDGE yields tighter generalization bounds than direct distillation when data is abundant. Experiments across medical, legal, and financial benchmarks demonstrate consistent improvements: BRIDGE delivers student performance gains of 28-41%, closing the capability gap with proprietary teachers by 12-16% while using 10x fewer teacher queries. Notably, BRIDGE defies the conventional cost-performance frontier, surpassing direct distillation baselines that use 100% of the budget while consuming only 5% of the resources.

[282] Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

Seijin Kobayashi, Yanick Schimpf, Maximilian Schlegel, Angelika Steger, Maciej Wolczyk, Johannes von Oswald, Nino Scherre, Kaitlin Maile, Guillaume Lajoie, Blake A. Richards, Rif A. Saurous, James Manyika, Blaise Agüera y Arcas, Alexander Meulemans, João Sacramento

Main category: cs.LG

TL;DR: The paper introduces “internal RL” - a method for hierarchical reinforcement learning within autoregressive models by acting and exploring in their internal representations rather than token-by-token sampling.

Details

Motivation: Standard RL finetuning of autoregressive models explores by generating outputs token-by-token, which is inefficient for sparse rewards. There's a need for more efficient exploration through temporally abstract actions.

Method: Introduces a higher-order, non-causal sequence model that outputs control signals for the residual stream activations of a base autoregressive model. This learns to compress long activation sequences into internal controllers with learned termination conditions.

Result: On grid world and MuJoCo tasks with hierarchical structure, the method learns internal controllers that execute behaviorally meaningful action sequences over long timescales. Internal RL enables learning from sparse rewards where standard RL fails.

Conclusion: Internal RL demonstrates benefits of latent action generation and reinforcement in autoregressive models, offering a promising approach for hierarchical RL within foundation models.

Abstract: Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these models explore by generating new outputs, one token at a time. However, sampling actions token-by-token can result in highly inefficient learning, particularly when rewards are sparse. Here, we show that it is possible to overcome this problem by acting and exploring within the internal representations of an autoregressive model. Specifically, to discover temporally-abstract actions, we introduce a higher-order, non-causal sequence model whose outputs control the residual stream activations of a base autoregressive model. On grid world and MuJoCo-based tasks with hierarchical structure, we find that the higher-order model learns to compress long activation sequence chunks onto internal controllers. Critically, each controller executes a sequence of behaviorally meaningful actions that unfold over long timescales and are accompanied with a learned termination condition, such that composing multiple controllers over time leads to efficient exploration on novel tasks. We show that direct internal controller reinforcement, a process we term “internal RL”, enables learning from sparse rewards in cases where standard RL finetuning fails. Our results demonstrate the benefits of latent action generation and reinforcement in autoregressive models, suggesting internal RL as a promising avenue for realizing hierarchical RL within foundation models.

[283] Machine Learning to Predict Digital Frustration from Clickstream Data

Jibin Joseph

Main category: cs.LG

TL;DR: Using clickstream data from an e-commerce site, this research predicts frustrated user sessions with 91% accuracy using LSTM models, achieving reliable predictions within the first 20-30 interactions.

Details

Motivation: User frustration in mobile apps and websites leads to lost sales and complaints for businesses. Predicting frustrated sessions can help identify and address usability issues before customers abandon their tasks.

Method: Used 5.4 million clickstream events (304,881 sessions) from a real e-commerce site. Defined frustration using rules based on rage bursts, U-turns, cart churn, search struggle, and long wandering. Built tabular features for standard classifiers and used full event sequences for discriminative LSTM models.

Result: XGBoost achieved 90% accuracy with ROC AUC of 0.9579, while LSTM performed best with 91% accuracy and ROC AUC of 0.9705. The LSTM could reliably predict frustration using only the first 20-30 interactions.

Conclusion: LSTM models effectively predict user frustration from clickstream data, with early prediction capability (20-30 interactions) enabling proactive intervention to improve user experience and reduce business losses.

Abstract: Many businesses depend on their mobile apps and websites, so user frustration while trying to complete a task on these channels can cause lost sales and complaints. In this research, I use clickstream data from a real e-commerce site to predict whether a session is frustrated or not. Frustration is defined using certain rules based on rage bursts, back and forth navigation (U turns), cart churn, search struggle, and long wandering sessions, and applies these rules to 5.4 million raw clickstream events (304,881 sessions). From each session, I build tabular features and train standard classifier models. I also use the full event sequence to train a discriminative LSTM classifier. XGBoost reaches about 90% accuracy, ROC AUC of 0.9579, while the LSTM performs best with about 91% accuracy and a ROC AUC of 0.9705. Finally, the research shows that with only the first 20 to 30 interactions, the LSTM already predicts frustration reliably.

[284] Recurrent Off-Policy Deep Reinforcement Learning Doesn’t Have to be Slow

Tyler Clark, Christine Evers, Jonathon Hare

Main category: cs.LG

TL;DR: RISE enables efficient recurrent networks for image-based RL with minimal computational overhead, boosting Atari performance by 35.6% IQM.

Details

Motivation: Recurrent off-policy RL models achieve state-of-the-art performance but are computationally expensive, limiting their practical adoption despite their effectiveness.

Method: RISE (Recurrent Integration via Simplified Encodings) uses both learnable and non-learnable encoder layers to integrate recurrent networks into any image-based off-policy RL setting without significant computational overhead.

Result: When integrated into leading non-recurrent off-policy RL algorithms, RISE achieves a 35.6% human-normalized interquartile mean (IQM) performance improvement across the Atari benchmark.

Conclusion: RISE provides a versatile framework that makes recurrent networks practical for image-based RL by eliminating computational barriers while delivering substantial performance gains.

Abstract: Recurrent off-policy deep reinforcement learning models achieve state-of-the-art performance but are often sidelined due to their high computational demands. In response, we introduce RISE (Recurrent Integration via Simplified Encodings), a novel approach that can leverage recurrent networks in any image-based off-policy RL setting without significant computational overheads via using both learnable and non-learnable encoder layers. When integrating RISE into leading non-recurrent off-policy RL algorithms, we observe a 35.6% human-normalized interquartile mean (IQM) performance improvement across the Atari benchmark. We analyze various implementation strategies to highlight the versatility and potential of our proposed framework.

[285] Explainable time-series forecasting with sampling-free SHAP for Transformers

Matthias Hertel, Sebastian Pütz, Ralf Mikut, Veit Hagenmeyer, Benjamin Schäfer

Main category: cs.LG

TL;DR: SHAPformer is a fast, sampling-free explainable time-series forecasting model based on Transformers that generates accurate explanations much faster than traditional SHAP methods.

Details

Motivation: Time-series forecasts need explainability for user trust and transparency requirements, but current SHAP methods lack efficient time-series implementations and make problematic feature independence assumptions when sampling counterfactuals.

Method: SHAPformer uses Transformer architecture with attention manipulation to make predictions based on feature subsets, enabling sampling-free Shapley value computation for time-series explanations.

Result: SHAPformer generates explanations in under one second (orders of magnitude faster than SHAP Permutation Explainer), provides accurate explanations on synthetic data with ground truth, achieves competitive predictive performance on real-world electrical load data, and delivers meaningful local/global insights.

Conclusion: SHAPformer offers an accurate, fast, and sampling-free approach for explainable time-series forecasting that overcomes limitations of traditional SHAP methods while providing meaningful insights into model behavior.

Abstract: Time-series forecasts are essential for planning and decision-making in many domains. Explainability is key to building user trust and meeting transparency requirements. Shapley Additive Explanations (SHAP) is a popular explainable AI framework, but it lacks efficient implementations for time series and often assumes feature independence when sampling counterfactuals. We introduce SHAPformer, an accurate, fast and sampling-free explainable time-series forecasting model based on the Transformer architecture. It leverages attention manipulation to make predictions based on feature subsets. SHAPformer generates explanations in under one second, several orders of magnitude faster than the SHAP Permutation Explainer. On synthetic data with ground truth explanations, SHAPformer provides explanations that are true to the data. Applied to real-world electrical load data, it achieves competitive predictive performance and delivers meaningful local and global insights, such as identifying the past load as the key predictor and revealing a distinct model behavior during the Christmas period.

[286] Improving ML Training Data with Gold-Standard Quality Metrics

Leslie Barrett, Michael W. Sherman

Main category: cs.LG

TL;DR: The paper proposes statistical methods to evaluate and improve hand-tagged training data quality using consistency and agreement metrics measured over multiple tagging iterations.

Details

Motivation: Hand-tagged training data is crucial for machine learning but quality varies considerably, yet quality control has received little attention in literature despite its importance.

Method: Uses statistical approaches to measure tagging consistency and agreement, with metrics recorded over multiple iterations of tagging. Shows that declining variance in such recordings indicates increasing data quality.

Result: Demonstrates that agreement metrics give more reliable results when measured over multiple iterations, and presents a method to collect high-quality training data without requiring multiple tags for every work item.

Conclusion: Tagging quality can be systematically evaluated and improved using statistical consistency measures, and a tagger burn-in period alone may not be sufficient for minimizing tagger errors.

Abstract: Hand-tagged training data is essential to many machine learning tasks. However, training data quality control has received little attention in the literature, despite data quality varying considerably with the tagging exercise. We propose methods to evaluate and enhance the quality of hand-tagged training data using statistical approaches to measure tagging consistency and agreement. We show that agreement metrics give more reliable results if recorded over multiple iterations of tagging, where declining variance in such recordings is an indicator of increasing data quality. We also show one way a tagging project can collect high-quality training data without requiring multiple tags for every work item, and that a tagger burn-in period may not be sufficient for minimizing tagger errors.

[287] Relu and softplus neural nets as zero-sum turn-based games

Stephane Gaubert, Yiannis Vlassopoulos

Main category: cs.LG

TL;DR: ReLU neural network outputs can be interpreted as values of zero-sum stopping games, enabling game-theoretic analysis and training as inverse game problems.

Details

Motivation: To provide a game-theoretic interpretation of neural networks that enables new analytical tools for understanding, bounding, verifying robustness, and training neural networks through the lens of game theory.

Method: Represent ReLU network outputs as values of zero-sum turn-based stopping games (ReLU net games) using Shapley-Bellman backward recursion, derive Feynman-Kac-type path-integral formulas, and extend to Softplus networks via entropic regularization.

Result: Established equivalence between neural network evaluation and game value computation, enabling derivation of input-output bounds, robustness verification via policy certificates, and reformulation of training as inverse game problems.

Conclusion: Game-theoretic representation provides powerful analytical framework for neural networks, connecting deep learning to game theory and optimal stopping, with applications to robustness analysis, verification, and training.

Abstract: We show that the output of a ReLU neural network can be interpreted as the value of a zero-sum, turn-based, stopping game, which we call the ReLU net game. The game runs in the direction opposite to that of the network, and the input of the network serves as the terminal reward of the game. In fact, evaluating the network is the same as running the Shapley-Bellman backward recursion for the value of the game. Using the expression of the value of the game as an expected total payoff with respect to the path measure induced by the transition probabilities and a pair of optimal policies, we derive a discrete Feynman-Kac-type path-integral formula for the network output. This game-theoretic representation can be used to derive bounds on the output from bounds on the input, leveraging the monotonicity of Shapley operators, and to verify robustness properties using policies as certificates. Moreover, training the neural network becomes an inverse game problem: given pairs of terminal rewards and corresponding values, one seeks transition probabilities and rewards of a game that reproduces them. Finally, we show that a similar approach applies to neural networks with Softplus activation functions, where the ReLU net game is replaced by its entropic regularization.

[288] Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Neural Network Architectures

Yedi Zhang, Andrew Saxe, Peter E. Latham

Main category: cs.LG

TL;DR: The paper presents a unifying theoretical framework explaining simplicity bias in neural networks through saddle-to-saddle learning dynamics, showing how different architectures progressively learn increasingly complex solutions.

Details

Motivation: Despite widespread observation of simplicity bias (neural networks learning increasingly complex solutions over time) across architectures, existing theoretical treatments lack a unifying framework to explain this phenomenon consistently.

Method: Developed a theoretical framework analyzing saddle-to-saddle learning dynamics for general neural networks (fully-connected, convolutional, attention-based). Analyzed fixed points, invariant manifolds, and gradient descent dynamics to show how networks iteratively evolve near invariant manifolds, approach saddles, and switch to new manifolds.

Result: Showed that linear networks learn increasing rank solutions, ReLU networks learn solutions with increasing kinks, convolutional networks learn with increasing kernels, and self-attention models learn with increasing attention heads. Also illuminated how data distribution and weight initialization affect plateau durations and numbers.

Conclusion: The theory provides a unifying framework for understanding when and why gradient descent progressively learns increasingly complex solutions across different neural network architectures through saddle-to-saddle dynamics.

Abstract: Neural networks trained with gradient descent often learn solutions of increasing complexity over time, a phenomenon known as simplicity bias. Despite being widely observed across architectures, existing theoretical treatments lack a unifying framework. We present a theoretical framework that explains a simplicity bias arising from saddle-to-saddle learning dynamics for a general class of neural networks, incorporating fully-connected, convolutional, and attention-based architectures. Here, simple means expressible with few hidden units, i.e., hidden neurons, convolutional kernels, or attention heads. Specifically, we show that linear networks learn solutions of increasing rank, ReLU networks learn solutions with an increasing number of kinks, convolutional networks learn solutions with an increasing number of convolutional kernels, and self-attention models learn solutions with an increasing number of attention heads. By analyzing fixed points, invariant manifolds, and dynamics of gradient descent learning, we show that saddle-to-saddle dynamics operates by iteratively evolving near an invariant manifold, approaching a saddle, and switching to another invariant manifold. Our analysis also illuminates the effects of data distribution and weight initialization on the duration and number of plateaus in learning, dissociating previously confounding factors. Overall, our theory offers a framework for understanding when and why gradient descent progressively learns increasingly complex solutions.

[289] Improving Local Training in Federated Learning via Temperature Scaling

Kichang Lee, Pei Zhang, Songkuk Kim, JeongGil Ko

Main category: cs.LG

TL;DR: FLex&Chill uses Logit Chilling to address non-i.i.d. data in federated learning, achieving 6X faster convergence and 3.37% accuracy improvement.

Details

Motivation: Federated learning suffers from data heterogeneity (non-i.i.d. training data across local clients), which hampers model performance and convergence.

Method: FLex&Chill approach exploits Logit Chilling method to handle non-i.i.d. data characteristics in federated learning systems.

Result: Up to 6X improvement in global federated learning model convergence time and up to 3.37% improvement in inference accuracy.

Conclusion: FLex&Chill effectively addresses data heterogeneity in federated learning, significantly improving convergence speed and model accuracy.

Abstract: Federated learning is inherently hampered by data heterogeneity: non-i.i.d. training data over local clients. We propose a novel model training approach for federated learning, FLex&Chill, which exploits the Logit Chilling method. Through extensive evaluations, we demonstrate that, in the presence of non-i.i.d. data characteristics inherent in federated learning systems, this approach can expedite model convergence and improve inference accuracy. Quantitatively, from our experiments, we observe up to 6X improvement in the global federated learning model convergence time, and up to 3.37% improvement in inference accuracy.

[290] Enhancing Topological Dependencies in Spatio-Temporal Graphs with Cycle Message Passing Blocks

Minho Lee, Yun Young Choi, Sun Woo Park, Seunghwan Lee, Joohwan Ko, Jaeyoung Hong

Main category: cs.LG

TL;DR: Cy2Mixer introduces a novel spatio-temporal GNN using topological invariants and gMLP blocks to better capture spatio-temporal dependencies through dedicated temporal, spatial, and cyclic message-passing components.

Details

Motivation: Existing GNN and Transformer methods for spatio-temporal graphs encode temporal and spatial relations independently and reflect topological characteristics in limited ways, failing to fully capture complex spatio-temporal dependencies.

Method: Cy2Mixer uses three gMLP-based blocks: temporal block for temporal properties, message-passing block for spatial information, and novel cycle message-passing block for topological enrichment through cyclic subgraphs, with mathematical justification for the cycle block’s unique contributions.

Result: Empirical evaluations demonstrate state-of-the-art performance across various spatio-temporal benchmark datasets, with mathematical evidence showing the cycle message-passing block provides differentiated information compared to standard message-passing.

Conclusion: Cy2Mixer effectively captures complex spatio-temporal dependencies by incorporating topological invariants and cyclic structures, outperforming existing methods and offering a novel architectural approach for spatio-temporal graph learning.

Abstract: Graph Neural Networks (GNNs) and Transformer-based models have been increasingly adopted to learn the complex vector representations of spatio-temporal graphs, capturing intricate spatio-temporal dependencies crucial for applications such as traffic datasets. Although many existing methods utilize multi-head attention mechanisms and message-passing neural networks (MPNNs) to capture both spatial and temporal relations, these approaches encode temporal and spatial relations independently, and reflect the graph’s topological characteristics in a limited manner. In this work, we introduce the Cycle to Mixer (Cy2Mixer), a novel spatio-temporal GNN based on topological non-trivial invariants of spatio-temporal graphs with gated multi-layer perceptrons (gMLP). The Cy2Mixer is composed of three blocks based on MLPs: A temporal block for capturing temporal properties, a message-passing block for encapsulating spatial information, and a cycle message-passing block for enriching topological information through cyclic subgraphs. We bolster the effectiveness of Cy2Mixer with mathematical evidence emphasizing that our cycle message-passing block is capable of offering differentiated information to the deep learning model compared to the message-passing block. Furthermore, empirical evaluations substantiate the efficacy of the Cy2Mixer, demonstrating state-of-the-art performances across various spatio-temporal benchmark datasets. The source code is available at https://github.com/leemingo/cy2mixer.

[291] FP=xINT:Representing Neural Networks via Low-Bit Series Basis Functions

Boyang Zhang, Daning Cheng, Yunquan Zhang, Jiake Tian, Jing Li, Fangming Liu

Main category: cs.LG

TL;DR: A series expansion framework for post-training quantization that expands full-precision models into multiple low-bit basis models to achieve accurate quantization without calibration or fine-tuning.

Details

Motivation: Existing PTQ methods degrade performance significantly at extremely low-bit settings due to quantization noise, creating a need for methods that can maintain accuracy without requiring calibration sets or fine-tuning.

Method: Expands full-precision models into multiple low-bit basis models at different granularities (tensor, layer, model), uses AbelianAdd/Mul operations between isomorphic models for parallelism, and theoretically proves convergence to the dense model.

Result: Achieves state-of-the-art performance in low-bit settings, with 4-bit ResNet-50 quantization surpassing original accuracy (77.03%), demonstrating the first successful application of series expansion to neural network quantization.

Conclusion: The series expansion framework enables accurate low-bit quantization without calibration or fine-tuning, representing a novel approach to PTQ that maintains model performance while reducing computational costs.

Abstract: Post-Training Quantization (PTQ) converts pre-trained Full-Precision (FP) models into quantized versions without training. While existing methods reduce size and computational costs, they also significantly degrade performance and quantization efficiency at extremely low settings due to quantization noise. We introduce a deep model series expansion framework to address this issue, enabling rapid and accurate approximation of unquantized models without calibration sets or fine-tuning. This is the first use of series expansion for neural network quantization. Specifically, our method expands the FP model into multiple low-bit basis models. To ensure accurate quantization, we develop low-bit basis model expansions at different granularities (tensor, layer, model), and theoretically confirm their convergence to the dense model, thus restoring FP model accuracy. Additionally, we design AbelianAdd/Mul operations between isomorphic models in the low-bit expansion, forming an Abelian group to ensure operation parallelism and commutativity. The experiments show that our algorithm achieves state-of-the-art performance in low-bit settings; for example, 4-bit quantization of ResNet-50 surpasses the original accuracy, reaching 77.03%. The code will be made public.

[292] Lossless Model Compression via Joint Low-Rank Factorization Optimization

Boyang Zhang, Daning Cheng, Yunquan Zhang, Fangming Liu, Jiake Tian

Main category: cs.LG

TL;DR: Novel joint optimization strategy for lossless low-rank weight factorization that enhances model performance beyond original models while compressing them, without requiring fine-tuning.

Details

Motivation: Traditional low-rank factorization minimizes approximation error but creates performance discrepancy due to separate optimization processes for compression and model performance, resulting in unavoidable losses.

Method: 1) Theoretical analysis of relationship between low-rank factorization and model optimization objectives, establishing perturbation range for factorization errors. 2) Reformulate as numerical rank deficiency problem with inequality constraints. 3) Develop joint objective addressing both factorization error and model performance. 4) Propose two optimization algorithms: lossless optimization (maximizes accuracy while ensuring compression) and compact optimization (minimizes model size while preserving performance).

Result: Methods demonstrate robust efficacy across vision and language tasks. Compressed models achieve lossless results without fine-tuning. Example: ResNext50 compressed by 70% outperforms original model.

Conclusion: First approach to enhance model performance beyond original through joint optimization of low-rank factorization, enabling direct compression of deep models to achieve lossless results without fine-tuning.

Abstract: Low-rank factorization is a popular model compression technique that minimizes the error $δ$ between approximated and original weight matrices. Despite achieving performances close to the original models when $δ$ is optimized, a performance discrepancy remains due to the separate optimization processes for low-rank factorization and model performance, resulting in unavoidable losses. We address this issue by introducing a novel joint optimization strategy for lossless low-rank weight factorization, which, for the first time, enhances the model’s performance beyond the original. Our approach begins with a theoretical analysis of the relationship between low-rank factorization and model optimization objectives, establishing a precise perturbation range for matrix factorization errors on model performance. This challenge is then reformulated as a numerical rank deficiency problem with inequality constraints and develop a joint objective that simultaneously addresses factorization error and model performance. Based on the above analysis, we propose two optimization algorithms: \textbf{a lossless optimization algorithm} that maximizes model accuracy while ensuring compression, and \textbf{a compact optimization algorithm} that minimizes model size while preserving performance. These algorithms do not require fine-tuning and can directly compress numerous deep models to achieve lossless results. Our methods demonstrate robust efficacy across various vision and language tasks. For example, the compressed model reduced by 70% on ResNext50 outperforms the original. Our code will be made public.

[293] Deep Learning for Spatio-Temporal Fusion in Land Surface Temperature Estimation: A Comprehensive Survey, Experimental Analysis, and Future Trends

Sofiane Bouaziz, Adel Hafiane, Raphael Canals, Rachid Nedjai

Main category: cs.LG

TL;DR: A comprehensive review of deep learning-based spatio-temporal fusion methods specifically for Land Surface Temperature (LST), addressing the gap between surface reflectance-oriented models and thermal data requirements.

Details

Motivation: Current thermal infrared satellite sensors cannot achieve both high spatial and temporal resolution simultaneously. Existing spatio-temporal fusion techniques were primarily developed for surface reflectance data and don't adequately address LST-specific spatial and temporal variability, creating a research gap for thermal data applications.

Method: The study provides a formal mathematical definition of thermal fusion tasks, proposes a refined taxonomy of relevant deep learning methods, analyzes modifications needed to adapt surface reflectance models to LST, introduces a new dataset of 51 Terra MODIS-Landsat LST pairs (2013-2024), and evaluates representative models on thermal data.

Result: The analysis reveals performance gaps, architecture sensitivities, and open research challenges in applying deep learning-based spatio-temporal fusion to LST data. A new benchmark dataset is created and made publicly available for reproducibility.

Conclusion: The study provides a focused review and framework for deep learning-based spatio-temporal fusion of LST data, highlighting the need for specialized approaches that account for thermal-specific characteristics, and establishes a benchmark dataset to support future research in this domain.

Abstract: Land Surface Temperature (LST) plays a key role in climate monitoring, urban heat assessment, and land-atmosphere interactions. However, current thermal infrared satellite sensors cannot simultaneously achieve high spatial and temporal resolution. Spatio-temporal fusion (STF) techniques address this limitation by combining complementary satellite data, one with high spatial but low temporal resolution, and another with high temporal but low spatial resolution. Existing STF techniques, from classical models to modern deep learning (DL) architectures, were primarily developed for surface reflectance (SR). Their application to thermal data remains limited and often overlooks LST-specific spatial and temporal variability. This study provides a focused review of DL-based STF methods for LST. We present a formal mathematical definition of the thermal fusion task, propose a refined taxonomy of relevant DL methods, and analyze the modifications required when adapting SR-oriented models to LST. To support reproducibility and benchmarking, we introduce a new dataset comprising 51 Terra MODIS-Landsat LST pairs from 2013 to 2024, and evaluate representative models to explore their behavior on thermal data. The analysis highlights performance gaps, architecture sensitivities, and open research challenges. The dataset and accompanying resources are publicly available at https://github.com/Sofianebouaziz1/STF-LST.

[294] A General Error-Theoretical Analysis Framework for Constructing Compression Strategies

Boyang Zhang, Daning Cheng, Yunquan Zhang, Meiqi Tu, Fangming Liu, Jiake Tian

Main category: cs.LG

TL;DR: Proposes Compression Error Theory (CET) - a framework to determine optimal compression levels per layer by modeling quantization error geometrically and finding minimal performance loss subspaces.

Details

Motivation: Deep models face deployment challenges due to parameter growth and computational complexity. Different layers have varying tolerance to compression, creating a need for layer-specific compression allocation to minimize performance loss while maximizing parameter reduction.

Method: CET uses differential expansion and algebraic geometry to reconstruct quantization error as ellipsoids/hyperbolic paraboloids, defines error subspaces, performs orthogonal decomposition to transform optimization into a complementary problem, and identifies optimal subspaces along major axes.

Result: On ResNet-34, CET achieves nearly 11× parameter compression while maintaining or even surpassing original model performance, demonstrating effective compression with minimal degradation.

Conclusion: CET provides a theoretical framework for optimal layer-wise compression allocation, enabling significant parameter reduction with minimal performance loss through geometric analysis of compression errors.

Abstract: The exponential growth in parameter size and computational complexity of deep models poses significant challenges for efficient deployment. The core problem of existing compression methods is that different layers of the model have significant differences in their tolerance to compression levels. For instance, the first layer of a model can typically sustain a higher compression level compared to the last layer without compromising performance. Thus, the key challenge lies in how to allocate compression levels across layers in a way that minimizes performance loss while maximizing parameter reduction. To address this challenge, we propose a Compression Error Theory (CET) framework, designed to determine the optimal compression level for each layer. Taking quantization as an example, CET leverages differential expansion and algebraic geometry to reconstruct the quadratic form of quantization error as ellipsoids and hyperbolic paraboloids, and utilizes their geometric structures to define an error subspace. To identify the error subspace with minimal performance loss, by performing orthogonal decomposition of the geometric space, CET transforms the optimization process of the error subspace into a complementary problem. The final theoretical analysis shows that constructing the quantization subspace along the major axis results in minimal performance degradation. Through experimental verification of the theory, CET can greatly retain performance while compressing. Specifically, on the ResNet-34 model, CET achieves nearly 11$\times$ parameter compression while even surpassing performance comparable to the original model.

[295] GradMix: Gradient-based Selective Mixup for Robust Data Augmentation in Class-Incremental Learning

Minsu Kim, Seong-Hyeon Hwang, Steven Euijong Whang

Main category: cs.LG

TL;DR: GradMix is a gradient-based selective mixup method for continual learning that intelligently mixes only helpful class pairs to reduce catastrophic forgetting, outperforming random data augmentation approaches.

Details

Motivation: Existing experience replay methods in continual learning use data augmentation by mixing previous and current task data, but random mixing can actually harm previous task knowledge and increase catastrophic forgetting.

Method: GradMix performs gradient-based selective mixup using a class-based criterion to identify and mix only samples from helpful class pairs while avoiding detrimental class pairs that would cause forgetting.

Result: Experiments on various real datasets show that GradMix outperforms data augmentation baselines in accuracy by minimizing forgetting of previous knowledge in class-incremental learning scenarios.

Conclusion: Selective, gradient-based data augmentation is crucial for continual learning, as indiscriminate mixing can be harmful; GradMix provides an effective solution for reducing catastrophic forgetting through intelligent sample pairing.

Abstract: In the context of continual learning, acquiring new knowledge while maintaining previous knowledge presents a significant challenge. Existing methods often use experience replay techniques that store a small portion of previous task data for training. In experience replay approaches, data augmentation has emerged as a promising strategy to further improve the model performance by mixing limited previous task data with sufficient current task data. However, we theoretically and empirically analyze that training with mixed samples from random sample pairs may harm the knowledge of previous tasks and cause greater catastrophic forgetting. We then propose GradMix, a robust data augmentation method specifically designed for mitigating catastrophic forgetting in class-incremental learning. GradMix performs gradient-based selective mixup using a class-based criterion that mixes only samples from helpful class pairs and not from detrimental class pairs for reducing catastrophic forgetting. Our experiments on various real datasets show that GradMix outperforms data augmentation baselines in accuracy by minimizing the forgetting of previous knowledge.

[296] Heartcare Suite: A Unified Multimodal ECG Suite for Dual Signal-Image Modeling and Understanding

Yihan Xie, Sijing Li, Tianwei Lin, Zhuonan Wang, Chenglin Yang, Yu Zhong, Wenjie Yan, Wenqiao Zhang, Xiaogang Guo, Jun Xiao, Yueting Zhuang, Beng Chin Ooi

Main category: cs.LG

TL;DR: Heartcare Suite is a unified ECG suite for dual signal-image modeling that addresses cross-modal alignment challenges in medical multimodal LLMs through a high-quality dataset, systematic benchmark, and novel projection alignment mechanism.

Details

Motivation: ECG data's intrinsic forms and representational patterns create significant challenges for medical multimodal LLMs in achieving cross-modal semantic alignment, limiting their effectiveness in cardiovascular diagnosis and treatment.

Method: Three components: (1) Heartcare-400K - fine-grained ECG instruction dataset built using HeartAgent pipeline with 12,170 clinical reports; (2) Heartcare-Bench - systematic benchmark for multi-perspective ECG understanding; (3) HeartcareGPT - uses structure-aware discrete tokenizer Beat with DSPA paradigm (dual encoder projection alignment) for joint signal-image modeling.

Result: Heartcare achieves consistent improvements across diverse ECG understanding tasks, validating the unified modeling paradigm and high-quality data pipeline effectiveness, establishing foundation for extending Med-MLLMs to physiological signal domains.

Conclusion: The proposed Heartcare Suite successfully addresses ECG cross-modal alignment challenges through a comprehensive approach combining high-quality data, systematic evaluation, and novel dual-stream modeling, providing a methodological foundation for medical multimodal LLMs in physiological signal domains.

Abstract: Although electrocardiograms (ECG) play a dominant role in cardiovascular diagnosis and treatment, their intrinsic data forms and representational patterns pose significant challenges for medical multimodal large language models (Med-MLLMs) in achieving cross-modal semantic alignment. To address this gap, we propose Heartcare Suite, a unified ECG suite designed for dual signal-image modeling and understanding. (i) Heartcare-400K: We build a finegrained ECG instruction dataset on top of our data pipeline engine–HeartAgent–by integrating 12,170 high quality clinical ECG reports from top hospitals with open-source data; (ii) Heartcare-Bench: a systematic benchmark assessing performance of models in multi-perspective ECG understanding and cross-modal generalization, providing guidance for optimizing ECG comprehension models; (iii) HeartcareGPT: built upon a structure-aware discrete tokenizer Beat, we propose the DSPA (Dual Stream Projection Alignment) paradigm–a dual encoder projection alignment mechanism enabling joint optimizing and modeling native ECG signal-image within a shared feature space. Heartcare achieves consistent improvements across diverse ECG understanding tasks, validating both the effectiveness of the unified modeling paradigm and the necessity of a high-quality data pipeline, and establishing a methodological foundation for extending Med-MLLMs toward physiological signal domains. Our project is available at https://github.com/DCDmllm/Heartcare-Suite .

[297] Mixture of Experts in Large Language Models

Danyang Zhang, Junhao Song, Ziqian Bi, Xinyuan Song, Yingfang Yuan, Tianyang Wang, Joe Yeong, Junfeng Hao

Main category: cs.LG

TL;DR: This paper provides a comprehensive review of Mixture-of-Experts (MoE) architecture in large language models, analyzing its ability to enhance performance with minimal computational overhead, covering theoretical foundations, architectural designs, applications, and identifying key advantages and challenges.

Details

Motivation: To systematically review and analyze the Mixture-of-Experts architecture in large language models, examining its theoretical foundations, architectural designs, applications, advantages, and challenges to provide a comprehensive understanding of this important approach for scaling model capacity efficiently.

Method: Conducted a systematic analysis spanning theoretical foundations, core architectural designs, and LLM applications, examining expert gating/routing mechanisms, hierarchical/sparse MoE configurations, meta-learning approaches, multimodal/multitask learning scenarios, real-world deployment cases, and recent advances/challenges.

Result: Identified key advantages of MoE including superior model capacity compared to equivalent Bayesian approaches, improved task-specific performance, and efficient scaling of model capacity. Also highlighted the importance of expert diversity, accurate calibration, and reliable inference aggregation for maximizing MoE effectiveness.

Conclusion: The review outlines current research limitations, open challenges, and promising future directions, providing a foundation for continued innovation in MoE architecture and its applications in large language models and beyond.

Abstract: This paper presents a comprehensive review of the Mixture-of-Experts (MoE) architecture in large language models, highlighting its ability to significantly enhance model performance while maintaining minimal computational overhead. Through a systematic analysis spanning theoretical foundations, core architectural designs, and large language model (LLM) applications, we examine expert gating and routing mechanisms, hierarchical and sparse MoE configurations, meta-learning approaches, multimodal and multitask learning scenarios, real-world deployment cases, and recent advances and challenges in deep learning. Our analysis identifies key advantages of MoE, including superior model capacity compared to equivalent Bayesian approaches, improved task-specific performance, and the ability to scale model capacity efficiently. We also underscore the importance of ensuring expert diversity, accurate calibration, and reliable inference aggregation, as these are essential for maximizing the effectiveness of MoE architectures. Finally, this review outlines current research limitations, open challenges, and promising future directions, providing a foundation for continued innovation in MoE architecture and its applications.

[298] Learning Treatment Policies From Multimodal Electronic Health Records

Henri Arno, Thomas Demeester

Main category: cs.LG

TL;DR: Proposes a causal policy learning method for multimodal EHRs that uses expert annotations during training to supervise treatment effect estimation, enabling better treatment decisions than risk-based approaches.

Details

Motivation: Existing causal policy learning methods assume tabular covariates with strong causal assumptions that are violated in multimodal EHR settings (tabular + clinical text). Current practice uses risk-based policies that don't identify which patients benefit most from treatment.

Method: Extends causal policy learning using expert-provided annotations during training to supervise treatment effect estimation, while using only multimodal representations as input during inference.

Result: Achieves strong empirical performance across synthetic, semi-synthetic, and real-world EHR datasets, demonstrating practical applicability to realistic clinical data.

Conclusion: The proposed method offers practical insights for applying causal machine learning to multimodal clinical data, enabling better treatment policies than risk-based approaches.

Abstract: We study how to learn effective treatment policies from multimodal electronic health records (EHRs) that consist of tabular data and clinical text. These policies can help physicians make better treatment decisions and allocate healthcare resources more efficiently. Causal policy learning methods prioritize patients with the largest expected treatment benefit. Yet, existing estimators assume tabular covariates that satisfy strong causal assumptions, which are typically violated in the multimodal setting. As a result, predictive models of baseline risk are commonly used in practice to guide such decisions, as they extend naturally to multimodal data. However, such risk-based policies are not designed to identify which patients benefit most from treatment. We propose an extension of causal policy learning that uses expert-provided annotations during training to supervise treatment effect estimation, while using only multimodal representations as input during inference. We show that the proposed method achieves strong empirical performance across synthetic, semi-synthetic, and real-world EHR datasets, thereby offering practical insights into applying causal machine learning to realistic clinical data.

[299] Membership Inference Attack with Partial Features

Xurun Wang, Guangrui Liu, Xinjie Li, Haoyu He, Lin Yao, Zhongyun Hua, Weizhe Zhang

Main category: cs.LG

TL;DR: PFMI attack scenario where adversary only has partial features; MRAD framework reconstructs missing features using model memory and detects anomalies to infer membership.

Details

Motivation: Existing membership inference attacks assume full feature access, which is unrealistic in many real-world scenarios where only partial features are available, limiting practical applicability.

Method: MRAD (Memory-guided Reconstruction and Anomaly Detection) - two-stage framework: 1) reconstruct unknown features using target model’s latent memory, 2) use anomaly detection to measure deviation between reconstructed sample and training data distribution.

Result: MRAD is effective across various datasets, works in both white-box and black-box settings, and maintains compatibility with off-the-shelf anomaly detection. On STL-10, achieves AUC >0.75 even with 60% missing features.

Conclusion: PFMI is a practical attack scenario, and MRAD provides an effective solution that works with partial features, demonstrating vulnerability of ML models even when attackers have limited information.

Abstract: Machine learning models are vulnerable to membership inference attack, which can be used to determine whether a given sample appears in the training data. Most existing methods assume the attacker has full access to the features of the target sample. This assumption, however, does not hold in many real-world scenarios where only partial features are available, thereby limiting the applicability of these methods. In this work, we introduce Partial Feature Membership Inference (PFMI), a scenario where the adversary observes only partial features of each sample and aims to infer whether this observed subset was present in the training set. To address this problem, we propose MRAD (Memory-guided Reconstruction and Anomaly Detection), a two-stage attack framework that works in both white-box and black-box settings. In the first stage, MRAD leverages the latent memory of the target model to reconstruct the unknown features of the sample. We observe that when the known features are absent from the training set, the reconstructed sample deviates significantly from the true data distribution. Consequently, in the second stage, we use anomaly detection algorithms to measure the deviation between the reconstructed sample and the training data distribution, thereby determining whether the known features belong to a member of the training set. Empirical results demonstrate that MRAD is effective across various datasets, and maintains compatibility with off-the-shelf anomaly detection techniques. For example, on STL-10, our attack exceeds an AUC of around 0.75 even with 60% of the missing features.

[300] Learning a Neural Solver for Parametric PDE to Enhance Physics-Informed Methods

Lise Le Boudec, Emmanuel de Bezenac, Louis Serrano, Ramon Daniel Regueiro-Espino, Yuan Yin, Patrick Gallinari

Main category: cs.LG

TL;DR: Proposes learning a physics-informed iterative solver that conditions gradient descent to automatically adapt to each PDE instance, accelerating and stabilizing optimization while extending to parametric PDEs.

Details

Motivation: Physics-informed deep learning faces optimization challenges due to PDE complexity: large solution spaces, many iterations, unstable training, and ill-conditioning from differential terms in loss functions.

Method: Learns a solver using physics-informed iterative algorithm trained on data. Conditions gradient descent to adapt to each PDE instance. Integrates physical loss gradient with PDE parameters to handle parametric PDEs (coefficients, initial/boundary conditions).

Result: Demonstrates effectiveness through empirical experiments on multiple datasets, showing accelerated and stabilized optimization with faster convergence for both training and test-time optimization.

Conclusion: The approach significantly improves physics-informed deep learning optimization by learning adaptive solvers that handle parametric PDEs, overcoming traditional limitations of single-instance solutions.

Abstract: Physics-informed deep learning often faces optimization challenges due to the complexity of solving partial differential equations (PDEs), which involve exploring large solution spaces, require numerous iterations, and can lead to unstable training. These challenges arise particularly from the ill-conditioning of the optimization problem caused by the differential terms in the loss function. To address these issues, we propose learning a solver, i.e., solving PDEs using a physics-informed iterative algorithm trained on data. Our method learns to condition a gradient descent algorithm that automatically adapts to each PDE instance, significantly accelerating and stabilizing the optimization process and enabling faster convergence of physics-aware models. Furthermore, while traditional physics-informed methods solve for a single PDE instance, our approach extends to parametric PDEs. Specifically, we integrate the physical loss gradient with PDE parameters, allowing our method to solve over a distribution of PDE parameters, including coefficients, initial conditions, and boundary conditions. We demonstrate the effectiveness of our approach through empirical experiments on multiple datasets, comparing both training and test-time optimization performance. The code is available at https://github.com/2ailesB/neural-parametric-solver.

[301] Deep Learning and Machine Learning – Python Data Structures and Mathematics Fundamental: From Theory to Practice

Silin Chen, Ziqian Bi, Junyu Liu, Benji Peng, Sen Zhang, Xuanhe Pan, Jiawei Xu, Jinlang Wang, Keyu Chen, Caitlyn Heqi Yin, Pohsun Feng, Yizhu Wen, Tianyang Wang, Ming Li, Jintao Ren, Qian Niu, Xinyuan Song, Ming Liu

Main category: cs.LG

TL;DR: A comprehensive textbook introducing machine learning and deep learning foundations, bridging theory with Python-based practical implementation for both beginners and advanced learners.

Details

Motivation: To bridge the gap between theoretical mathematics and practical application in ML/DL education, providing accessible learning materials that emphasize the critical role of mathematical principles in developing scalable AI solutions.

Method: Uses Python as the primary programming language to implement key algorithms and data structures, with practical examples and code throughout. Covers basic to advanced Python programming, fundamental mathematics, linear algebra, optimization techniques, neural networks, and real-world applications.

Result: A comprehensive educational resource that provides hands-on experience in applying theoretical knowledge to solve complex problems in ML, DL, and big data analytics, suitable for diverse learning levels.

Conclusion: The book successfully creates an integrated learning approach that combines mathematical foundations with practical Python implementation, preparing readers to develop scalable AI solutions and apply theoretical knowledge to real-world ML/DL problems.

Abstract: This book provides a comprehensive introduction to the foundational concepts of machine learning (ML) and deep learning (DL). It bridges the gap between theoretical mathematics and practical application, focusing on Python as the primary programming language for implementing key algorithms and data structures. The book covers a wide range of topics, including basic and advanced Python programming, fundamental mathematical operations, matrix operations, linear algebra, and optimization techniques crucial for training ML and DL models. Advanced subjects like neural networks, optimization algorithms, and frequency domain methods are also explored, along with real-world applications of large language models (LLMs) and artificial intelligence (AI) in big data management. Designed for both beginners and advanced learners, the book emphasizes the critical role of mathematical principles in developing scalable AI solutions. Practical examples and Python code are provided throughout, ensuring readers gain hands-on experience in applying theoretical knowledge to solve complex problems in ML, DL, and big data analytics.

[302] Toward Storage-Aware Learning with Compressed Data An Empirical Exploratory Study on JPEG

Kichang Lee, Songkuk Kim, JaeYeon Park, JeongGil Ko

Main category: cs.LG

TL;DR: Empirical study shows naive data dropping/compression is suboptimal for storage-aware learning; sample-wise adaptive compression is feasible and promising.

Details

Motivation: On-device ML faces storage constraints, especially with continuous data collection. Need to balance data quantity vs quality through compression strategies.

Method: Empirical study analyzing trade-offs between data quantity and quality via compression. Examines uniform data dropping, one-size-fits-all compression, and explores sample-wise compression sensitivity.

Result: Naive strategies (uniform dropping, fixed compression) are suboptimal. Data samples show varying sensitivities to compression, supporting feasibility of sample-wise adaptive compression.

Conclusion: Systematic characterization of storage-aware learning challenge provides foundation for new adaptive compression systems. Insights advance understanding of storage-quality trade-offs in on-device ML.

Abstract: On-device machine learning is often constrained by limited storage, particularly in continuous data collection scenarios. This paper presents an empirical study on storage-aware learning, focusing on the trade-off between data quantity and quality via compression. We demonstrate that naive strategies, such as uniform data dropping or one-size-fits-all compression, are suboptimal. Our findings further reveal that data samples exhibit varying sensitivities to compression, supporting the feasibility of a sample-wise adaptive compression strategy. These insights provide a foundation for developing a new class of storage-aware learning systems. The primary contribution of this work is the systematic characterization of this under-explored challenge, offering valuable insights that advance the understanding of storage-aware learning.

[303] Scalable Temporal Anomaly Causality Discovery in Large Systems: Achieving Computational Efficiency with Binary Anomaly Flag Data

Mulugeta Weldezgina Asres, Christian Walter Omlin, The CMS-HCAL Collaboration

Main category: cs.LG

TL;DR: AnomalyCD is a causal discovery method for binary anomaly data that improves computational efficiency and accuracy in learning graphical causal models from temporal binary alarm flags.

Details

Motivation: Existing causal discovery methods face computational challenges for real-time large-scale deployments and struggle with binary anomaly data characteristics (state transitions and sparsity), limiting their applicability in modern monitoring systems.

Method: Proposes AnomalyCD with anomaly data-aware causality testing, sparse data and prior link compression, and edge pruning adjustment approaches to handle binary anomaly data efficiently.

Result: Demonstrates significant computation overhead reduction and moderate accuracy improvement on binary anomaly datasets, validated on CERN’s Compact Muon Solenoid experiment data and IT monitoring system data.

Conclusion: AnomalyCD addresses computational and accuracy challenges in learning causal models from temporal binary anomaly data, making causal discovery more practical for real-time large-scale monitoring systems.

Abstract: Extracting anomaly causality facilitates diagnostics once monitoring systems detect system faults. Identifying anomaly causes in large systems involves investigating a broader set of monitoring variables across multiple subsystems. However, learning graphical causal models (GCMs) comes with a significant computational burden that restrains the applicability of most existing methods in real-time and large-scale deployments. In addition, modern monitoring applications for large systems often generate large amounts of binary alarm flags, and the distinct characteristics of binary anomaly data – the meaning of state transition and data sparsity – challenge existing causality learning mechanisms. This study proposes an anomaly causal discovery approach (AnomalyCD), addressing the accuracy and computational challenges of generating GCMs from temporal binary flag datasets. The AnomalyCD presents several strategies, such as anomaly data-aware causality testing, sparse data and prior link compression, and edge pruning adjustment approaches. We validate the performance of the approach on two datasets: monitoring sensor data from the readout-box system of the Compact Muon Solenoid experiment at CERN, and a public dataset from an information technology monitoring system. The results on temporal GCMs demonstrate a considerable reduction of computation overhead and a moderate enhancement of accuracy on the binary anomaly datasets Source code: https://github.com/muleina/AnomalyCD .

[304] FedMeld: A Model-dispersal Federated Learning Framework for Space-ground Integrated Networks

Qian Chen, Xianhao Chen, Kaibin Huang

Main category: cs.LG

TL;DR: FedMeld: Infrastructure-free federated learning for space-ground integrated networks using satellite movement patterns and store-carry-forward capabilities to enable parameter mixing across large regions without ground stations or inter-satellite links.

Details

Motivation: To bridge the digital divide by delivering AI services globally through SGINs, but existing space-ground integrated FL frameworks require ground stations or costly inter-satellite links, leading to excessive training latency and communication costs.

Method: Proposes FedMeld framework using model dispersal strategy that exploits periodic satellite movement patterns and store-carry-forward capabilities for parameter mixing. Formulates joint optimization problem for staleness control and mixing ratio (SC-MR), decomposes it into sequential subproblems, and derives closed-form solutions for round interval and semi-closed form for mixing ratio.

Result: Theoretical analysis shows FedMeld leads to global model convergence. Experiments with various datasets demonstrate FedMeld achieves superior model accuracy while significantly reducing communication costs compared to traditional FL schemes for SGINs.

Conclusion: FedMeld provides an infrastructure-free FL framework for SGINs that achieves optimal latency-accuracy tradeoff by leveraging satellite movement patterns, enabling efficient global-scale federated learning without costly infrastructure requirements.

Abstract: To bridge the digital divide, space-ground integrated networks (SGINs) are expected to deliver artificial intelligence (AI) services to every corner of the world. One key mission of SGINs is to support federated learning (FL) at a global scale. However, existing space-ground integrated FL frameworks involve ground stations or costly inter-satellite links, entailing excessive training latency and communication costs. To overcome these limitations, we propose an infrastructure-free federated learning framework based on a model dispersal (FedMeld) strategy, which exploits periodic movement patterns and store-carry-forward capabilities of satellites to enable parameter mixing across large-scale geographical regions. We theoretically show that FedMeld leads to global model convergence and quantify the effects of round interval and mixing ratio between adjacent areas on its learning performance. Based on the theoretical results, we formulate a joint optimization problem to design the staleness control and mixing ratio (SC-MR) for minimizing the training loss. By decomposing the problem into sequential SC and MR subproblems without compromising the optimality, we derive the round interval solution in a closed form and the mixing ratio in a semi-closed form to achieve the optimal latency-accuracy tradeoff. Experiments using various datasets demonstrate that FedMeld achieves superior model accuracy while significantly reducing communication costs as compared with traditional FL schemes for SGINs.

[305] ORACLE: Explaining Feature Interactions in Neural Networks with ANOVA

Dongseok Kim, Hyoungsun Choi, Mohamed Jismy Aashik Rasool, Gisung Oh

Main category: cs.LG

TL;DR: ORACLE is a framework for explaining neural networks on tabular data using orthogonal factorial surrogates to extract main effects and pairwise interactions, outperforming SHAP methods on synthetic benchmarks.

Details

Motivation: Need for interpretable explanations of neural networks on tabular data that align with classical design-of-experiments practice, providing stable, comparable interaction summaries for scientific and engineering workflows.

Method: Treats neural network as black-box, discretizes inputs onto grid, fits orthogonal factorial (ANOVA-style) surrogate via L² projection, then applies centering and μ-rebalancing to extract main- and interaction-effect tables.

Result: ORACLE more accurately recovers ground-truth interaction structure than Monte Carlo SHAP methods on synthetic benchmarks and tabular regression tasks, measured by ranking, localization, and cross-backbone stability.

Conclusion: ORACLE is particularly effective for features with interpretable factorial structure, making it well-suited for scientific workflows requiring stable, DoE-style interaction summaries, though its scope is clarified for latent image/text settings.

Abstract: We introduce ORACLE, a framework for explaining neural networks on tabular data and scientific factorial designs. ORACLE summarizes a trained network’s prediction surface with main effects and pairwise interactions by treating the network as a black-box response, discretizing the inputs onto a grid, and fitting an orthogonal factorial (ANOVA-style) surrogate – the $L^2$ orthogonal projection of the model response onto a finite-dimensional factorial subspace. A simple centering and $μ$-rebalancing step then expresses this surrogate as main- and interaction-effect tables that remain faithful to the original model in the $L^2$ sense. The resulting grid-based interaction maps are easy to visualize, comparable across backbones, and directly aligned with classical design-of-experiments practice. On synthetic factorial benchmarks and low- to medium-dimensional tabular regression tasks, ORACLE more accurately recovers ground-truth interaction structure and hotspots than Monte Carlo SHAP-family interaction methods, as measured by ranking, localization, and cross-backbone stability. In latent image and text settings, ORACLE clarifies its scope: grid-based factorial surrogates are most effective when features admit an interpretable factorial structure, making ORACLE particularly well-suited to scientific and engineering workflows that require stable, DoE-style interaction summaries.

[306] PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch

Abhishek Ghosh, Ajay Nayak, Ashish Panwar, Arkaprava Basu

Main category: cs.LG

TL;DR: PyGraph is a compiler framework that automatically optimizes ML workloads for CUDA Graphs, addressing launch latency bottlenecks through code transformations, parameter copy elimination, and selective deployment.

Details

Motivation: GPU compute throughput has grown rapidly, but CPU-side launch latency of hundreds to thousands of short-running GPU kernels per iteration has become a bottleneck. While CUDA Graphs promise to address this by replaying kernels with a single dispatch, they remain difficult to deploy correctly and efficiently in ML workloads.

Method: PyGraph introduces three novel optimizations: 1) automatic code transformations to make ML applications amenable to CUDA Graphs, 2) elimination of parameter copy overheads for kernels executing in CUDA Graphs, and 3) selective deployment of CUDA Graphs guided by cost-benefit analysis. It’s built atop PyTorch2’s compilation framework and requires no programmer intervention.

Result: For 25 ML workloads from TorchBench, HuggingFace, and TIMM, PyGraph more than doubles the benefit from deploying CUDA Graph compared to the most popular and widely used ML compiler, PyTorch2.

Conclusion: PyGraph successfully addresses the CUDA Graph deployment challenges in ML workloads through automated compiler optimizations, significantly improving performance without requiring programmer intervention, making CUDA Graphs more accessible and effective for accelerating ML computations.

Abstract: Machine learning (ML) workloads launch hundreds to thousands of short-running GPU kernels per iteration. With GPU compute throughput growing rapidly, CPU-side launch latency of kernels is emerging as a bottleneck. CUDA Graphs promise to address this by replaying a set of kernels with a single dispatch of the graph, removing per-kernel launch costs. However, CUDA Graphs remain surprisingly difficult to deploy correctly and efficiently. We present PyGraph - a compiler framework to maximize the coverage and benefits of CUDA Graphs for ML workloads. It introduces three novel optimizations: it applies automatic code transformations to make ML applications amenable to CUDA Graphs; it eliminates the parameter copy overheads for kernels executing in CUDA Graphs, and it selectively deploys CUDA Graphs guided by a cost-benefit analysis. For 25 ML workloads from TorchBench, HuggingFace, and TIMM, PyGraph more than doubles the benefit from deploying CUDA Graph compared to the most popular and widely used ML compiler, PyTorch2. PyGraph is built atop PyTorch2’s compilation framework and requires no programmer intervention.

[307] Stochastic activations

Maria Lomeli, Matthijs Douze, Gergely Szilvasy, Loic Cabannes, Jade Copet, Sainbayar Sukhbaatar, Jason Weston, Gabriel Synnaeve, Pierre-Emmanuel Mazaré, Hervé Jégou

Main category: cs.LG

TL;DR: Stochastic activations randomly choose between SILU or RELU in LLM feed-forward layers, solving RELU’s gradient flow issues while enabling sparse inference and controlled text diversity.

Details

Motivation: To address RELU's optimization problem (constant shape for negative inputs that prevents gradient flow) while enabling benefits like sparse inference and controlled text generation diversity.

Method: Randomly select between SILU or RELU activations via Bernoulli draws in feed-forward layers. Two applications: (1) pre-train with stochastic activations, fine-tune with RELU for sparse inference; (2) use stochastic activations directly for text generation.

Result: (1) Better results than training from scratch with RELU, with reduced inference FLOPs and CPU speedup; (2) Reasonable generation performance, slightly inferior to best deterministic non-linearity (SILU with temperature scaling), but offers controlled diversity.

Conclusion: Stochastic activations provide an effective solution to RELU’s optimization issues while enabling sparse inference benefits and offering a novel approach for controlled text generation diversity.

Abstract: We introduce stochastic activations. This novel strategy randomly selects between several non-linear functions in the feed-forward layer of a large language model. In particular, we choose between SILU or RELU depending on a Bernoulli draw. This strategy circumvents the optimization problem associated with RELU, namely, the constant shape for negative inputs that prevents the gradient flow. We leverage this strategy in two ways: (1) We use stochastic activations during pre-training and fine-tune the model with RELU, which is used at inference time to provide sparse latent vectors. This reduces the inference FLOPs and translates into a significant speedup in the CPU. Interestingly, this leads to much better results than training from scratch with the RELU activation function. (2) We evaluate stochastic activations for generation. This strategy performs reasonably well: it is only slightly inferior to the best deterministic non-linearity, namely SILU combined with temperature scaling. This offers an alternative to existing strategies by providing a controlled way to increase the diversity of the generated text.

[308] Categorical Equivariant Deep Learning: Category-Equivariant Neural Networks and Universal Approximation Theorems

Yoshihiro Maruyama

Main category: cs.LG

TL;DR: Categorical Equivariant Neural Networks (CENNs) unify various equivariant architectures (group/groupoid, poset/lattice, graph, sheaf networks) using category theory, proving universal approximation for continuous equivariant transformations.

Details

Motivation: To unify and generalize equivariant neural networks beyond group symmetries, encompassing geometric, contextual, and compositional symmetries through a categorical framework.

Method: Develop theory of CENNs using topological categories with Radon measures, formulate linear/nonlinear layers categorically, prove equivariant universal approximation theorem for continuous equivariant transformations.

Result: Proved that finite-depth CENNs are dense in space of continuous equivariant transformations, providing systematic universal approximation theorems for groups/groupoids, posets/lattices, graphs, and cellular sheaves.

Conclusion: Categorical equivariant deep learning expands equivariant learning beyond group actions to include contextual and compositional symmetries, providing unified theoretical foundation.

Abstract: We develop a theory of category-equivariant neural networks (CENNs) that unifies group/groupoid-equivariant networks, poset/lattice-equivariant networks, graph and sheaf neural networks. Equivariance is formulated as naturality in a topological category with Radon measures. Formulating linear and nonlinear layers in the categorical setup, we prove the equivariant universal approximation theorem in the general setting: the class of finite-depth CENNs is dense in the space of continuous equivariant transformations. We instantiate the framework for groups/groupoids, posets/lattices, graphs and cellular sheaves, deriving universal approximation theorems for them in a systematic manner. Categorical equivariant deep learning thus allows us to expand the horizons of equivariant deep learning beyond group actions, encompassing not only geometric symmetries but also contextual and compositional symmetries.

[309] Training Deep Morphological Neural Networks as Universal Approximators

Konstantinos Fotopoulos, Petros Maragos

Main category: cs.LG

TL;DR: This paper investigates deep morphological neural networks (DMNNs), showing that “linear” activations are essential despite their non-linearity. The authors propose constrained architectures to preserve sparsity, improve generalization via residual connections and weight dropout, and demonstrate successful training of DMNNs under these constraints. They also show morphological layers accelerate convergence in hybrid networks.

Details

Motivation: The motivation is to investigate deep morphological neural networks (DMNNs) and address their training challenges. Despite their inherent non-linearity, the authors recognize that "linear" activations are essential for DMNNs, and they aim to preserve the inherent sparsity of these networks while improving their generalization ability.

Method: The authors propose two constrained architectures: 1) where the majority of parameters should be part of morphological operations, and 2) where the majority of learnable parameters should be part of morphological operations. They improve generalization via residual connections and weight dropout. They also propose a hybrid network architecture combining linear and morphological layers.

Result: The proposed networks can be successfully trained and are more prunable than linear networks. The authors claim to be the first to successfully train DMNNs under such constraints. Empirically, they show that inclusion of morphological layers significantly accelerates the convergence of gradient descent with large batches in hybrid networks.

Conclusion: Deep morphological neural networks can be effectively trained with constrained architectures that preserve sparsity. The combination of morphological and linear layers in hybrid architectures offers practical benefits, particularly in accelerating convergence with large batch sizes, making DMNNs a viable approach for neural network design.

Abstract: We investigate deep morphological neural networks (DMNNs). We demonstrate that despite their inherent non-linearity, “linear” activations are essential for DMNNs. To preserve their inherent sparsity, we propose architectures that constraint the parameters of the “linear” activations: For the first (resp. second) architecture, we work under the constraint that the majority of parameters (resp. learnable parameters) should be part of morphological operations. We improve the generalization ability of our networks via residual connections and weight dropout. Our proposed networks can be successfully trained, and are more prunable than linear networks. To the best of our knowledge, we are the first to successfully train DMNNs under such constraints. Finally, we propose a hybrid network architecture combining linear and morphological layers, showing empirically that the inclusion of morphological layers significantly accelerates the convergence of gradient descent with large batches.

[310] PATCH: Learnable Tile-level Hybrid Sparsity for LLMs

Younes Hourri, Mohammad Mozaffari, Maryam Mehri Dehnavi

Main category: cs.LG

TL;DR: PATCH introduces a hybrid sparsity framework that enables continuous sparsity ratios (0-50%) by partitioning weight matrices into tiles that can be either dense or 2:4 sparse, bridging the gap between accuracy-preserving unstructured sparsity and hardware-friendly 2:4 sparsity.

Details

Motivation: LLMs have prohibitive memory and compute costs at deployment. Existing pruning approaches face tradeoffs: unstructured sparsity preserves accuracy but prevents GPU acceleration due to irregular access patterns, while semi-structured 2:4 sparsity is hardware-friendly but degrades model quality with its rigid 50% pattern.

Method: PATCH partitions weight matrices into tiles and assigns each tile to be either dense or 2:4 sparse using a learnable mask selection mechanism. This enables continuous sparsity ratios between 0% and 50% and supports non-uniform sparsity across layers for fine-grained accuracy-acceleration tradeoffs.

Result: Across models from 0.5B to 8B parameters, PATCH consistently narrows the gap to dense accuracy while delivering practical speedups. On LLaMA-2 7B with an A6000 GPU, PATCH achieves 1.18x-1.38x end-to-end speedup over dense baselines while improving accuracy by 0.37%-2.96% compared to state-of-the-art 2:4 pruning method MaskLLM.

Conclusion: PATCH bridges the gap between accuracy and hardware efficiency in LLM pruning by enabling continuous sparsity ratios through a hybrid tile-based approach, achieving both practical speedups and improved accuracy compared to existing methods.

Abstract: Large language models (LLMs) deliver impressive performance but incur prohibitive memory and compute costs at deployment. Model pruning is an effective way to reduce these overheads, yet existing approaches face challenges: unstructured sparsity, where nonzeros can appear anywhere, preserves accuracy but yields irregular access patterns that prevent GPU acceleration, while semi-structured 2:4 sparsity is hardware-friendly but enforces a rigid 50% pattern that degrades model quality. To bridge this gap, we introduce PATCH, a hybrid sparsity framework that enables a continuous sparsity ratio between 0% and 50%. PATCH partitions weight matrices into tiles, assigning each tile to be either dense or 2:4 sparse via a learnable mask selection mechanism. This design provides fine-grained control over accuracy-acceleration tradeoffs and supports non-uniform sparsity across layers, leading to superior overall quality. Across models from 0.5B to 8B parameters, PATCH consistently narrows the gap to dense accuracy while delivering practical speedups. For instance, on LLaMA-2 7B with an A6000 GPU, PATCH achieves 1.18x-1.38x end-to-end speedup over dense baselines while improving accuracy by 0.37%-2.96% compared to the state-of-the-art 2:4 pruning method, MaskLLM.

[311] Machine Unlearning in the Era of Quantum Machine Learning: An Empirical Study

Carla Crivoi, Radu Tudor Ionescu

Main category: cs.LG

TL;DR: First comprehensive empirical study of machine unlearning in hybrid quantum-classical neural networks, showing quantum models can support effective unlearning with performance dependent on circuit depth, entanglement, and task complexity.

Details

Motivation: Machine unlearning has been extensively explored in classical deep learning, but its behavior in variational quantum circuits and quantum-augmented architectures remains largely unexplored, creating a gap as quantum machine learning systems expand in scale and capability.

Method: Adapted a broad suite of unlearning methods to quantum settings (gradient-based, distillation-based, regularization-based, certified techniques) and introduced two new unlearning strategies tailored to hybrid models. Experiments conducted across Iris, MNIST, and Fashion-MNIST datasets under subset removal and full-class deletion scenarios.

Result: Quantum models can support effective unlearning, but outcomes strongly depend on circuit depth, entanglement structure, and task complexity. Shallow VQCs show high intrinsic stability with minimal memorization, while deeper hybrid models exhibit stronger trade-offs between utility, forgetting strength, and alignment with retrain oracle. Methods like EU-k, LCA, and Certified Unlearning consistently provide the best balance across metrics.

Conclusion: Establishes baseline empirical insights into quantum machine unlearning and highlights the need for quantum-aware algorithms and theoretical guarantees as quantum machine learning systems continue to expand in scale and capability.

Abstract: We present the first comprehensive empirical study of machine unlearning (MU) in hybrid quantum-classical neural networks. While MU has been extensively explored in classical deep learning, its behavior within variational quantum circuits (VQCs) and quantum-augmented architectures remains largely unexplored. First, we adapt a broad suite of unlearning methods to quantum settings, including gradient-based, distillation-based, regularization-based and certified techniques. Second, we introduce two new unlearning strategies tailored to hybrid models. Experiments across Iris, MNIST, and Fashion-MNIST, under both subset removal and full-class deletion, reveal that quantum models can support effective unlearning, but outcomes depend strongly on circuit depth, entanglement structure, and task complexity. Shallow VQCs display high intrinsic stability with minimal memorization, whereas deeper hybrid models exhibit stronger trade-offs between utility, forgetting strength, and alignment with retrain oracle. We find that certain methods, e.g. EU-k, LCA, and Certified Unlearning, consistently provide the best balance across metrics. These findings establish baseline empirical insights into quantum machine unlearning and highlight the need for quantum-aware algorithms and theoretical guarantees, as quantum machine learning systems continue to expand in scale and capability. We publicly release our code at: https://github.com/CrivoiCarla/HQML.

[312] Multi-Scale Harmonic Encoding for Feature-Wise Graph Message Passing

Longlong Li, Mengyang Zhao, Guanghui Wang, Cunquan Qu

Main category: cs.LG

TL;DR: MSH-GNN is a frequency-aware GNN that performs feature-wise adaptive propagation using node-conditioned feature subspaces and multi-scale harmonic modulations to capture both smooth and oscillatory structural patterns.

Details

Motivation: Traditional GNNs treat node embeddings as holistic feature vectors with uniform relevance across dimensions, limiting their ability to selectively transmit informative components when graph structures exhibit distinct frequency characteristics.

Method: Proposes MSH-GNN with: 1) feature-wise adaptive propagation where nodes project incoming messages onto node-conditioned feature subspaces, 2) learnable multi-scale harmonic modulations to capture smooth and oscillatory patterns, and 3) frequency-aware attention pooling for graph-level readout.

Result: MSH-GNN matches the expressive power of the 1-WL test and shows consistent improvements over state-of-the-art methods on node- and graph-level benchmarks, particularly in joint structure-frequency analysis tasks.

Conclusion: MSH-GNN provides an effective frequency-aware message passing framework that enables selective extraction of frequency-relevant components and captures diverse structural patterns through harmonic modulations.

Abstract: Most Graph Neural Networks (GNNs) propagate messages by treating node embeddings as holistic feature vectors, implicitly assuming uniform relevance across feature dimensions. This limits their ability to selectively transmit informative components, especially when graph structures exhibit distinct frequency characteristics. We propose MSH-GNN (Multi-Scale Harmonic Graph Neural Network), a frequency-aware message passing framework that performs feature-wise adaptive propagation. Each node projects incoming messages onto node-conditioned feature subspaces derived from its own representation, enabling selective extraction of frequency-relevant components. Learnable multi-scale harmonic modulations further allow the model to capture both smooth and oscillatory structural patterns. A frequency-aware attention pooling mechanism is introduced for graph-level readout. We show that MSH-GNN admits an interpretation as a learnable Fourier-feature approximation of kernelized message functions and matches the expressive power of the 1-Weisfeiler-Lehman (1-WL) test. Extensive experiments on node- and graph-level benchmarks demonstrate consistent improvements over state-of-the-art methods, particularly in joint structure-frequency analysis tasks.

[313] Diffusion Self-Weighted Guidance for Offline Reinforcement Learning

Augusto Tagle, Javier Ruiz-del-Solar, Felipe Tobar

Main category: cs.LG

TL;DR: Proposes Self-Weighted Guidance (SWG), a diffusion-based offline RL method that jointly models actions and weights, eliminating the need for separate critic networks.

Details

Motivation: Offline RL methods using diffusion models face challenges in computing required scores due to dependence on unknown weight functions that critique behavior policies. Existing approaches struggle with this computational difficulty.

Method: Constructs a diffusion process over both actions and weights, enabling direct score computation from the diffusion model without learning extra networks. Introduces Self-Weighted Guidance (SWG) where guidance comes from the same diffusion model.

Result: SWG successfully generates samples from desired distributions on toy examples and performs competitively with state-of-the-art methods on D4RL benchmark environments while maintaining simpler training.

Conclusion: SWG provides an effective solution to the weight computation problem in diffusion-based offline RL, offering streamlined training and competitive performance without additional network complexity.

Abstract: Offline reinforcement learning (RL) recovers the optimal policy $π$ given historical observations of an agent. In practice, $π$ is modeled as a weighted version of the agent’s behavior policy $μ$, using a weight function $w$ working as a critic of the agent’s behavior. Though recent approaches to offline RL based on diffusion models have exhibited promising results, the computation of the required scores is challenging due to their dependence on the unknown $w$. In this work, we alleviate this issue by constructing a diffusion over both the actions and the weights. With the proposed setting, the required scores are directly obtained from the diffusion model without learning extra networks. Our main conceptual contribution is a novel guidance method, where guidance (which is a function of $w$) comes from the same diffusion model, therefore, our proposal is termed Self-Weighted Guidance (SWG). We show that SWG generates samples from the desired distribution on toy examples and performs on par with state-of-the-art methods on D4RL’s challenging environments, while maintaining a streamlined training pipeline. We further validate SWG through ablation studies on weight formulations and scalability.

[314] Position: Federated Foundation Language Model Post-Training Should Focus on Open-Source Models

Nikita Agrawal, Simon Mertel, Ruben Mayer

Main category: cs.LG

TL;DR: This position paper argues against using black-box foundation language models in federated learning post-training, claiming it contradicts core FL principles like data privacy and autonomy.

Details

Motivation: The motivation is to critically analyze the problematic adoption of black-box models in federated post-training, which has become popular but may violate fundamental FL principles despite its success in centralized settings.

Method: The paper takes a position paper approach, providing critical analysis of black-box model usage in federated post-training and examining various aspects of openness and their implications for FL systems.

Result: The analysis reveals that using black-box models in FL contradicts core federation principles like data privacy and autonomy, raising concerns about blindly replicating centralized approaches in federated settings.

Conclusion: The paper concludes that black-box models are fundamentally incompatible with FL principles and calls for more careful consideration of openness aspects when designing federated post-training systems.

Abstract: Post-training of foundation language models has emerged as a promising research domain in federated learning (FL) with the goal to enable privacy-preserving model improvements and adaptations to user’s downstream tasks. Recent advances in this area adopt centralized post-training approaches that build upon black-box foundation language models where there is no access to model weights and architecture details. Although the use of black-box models has been successful in centralized post-training, their blind replication in FL raises several concerns. Our position is that using black-box models in FL contradicts the core principles of federation such as data privacy and autonomy. In this position paper, we critically analyze the usage of black-box models in federated post-training, and provide a detailed account of various aspects of openness and their implications for FL.

[315] mLaSDI: Multi-stage latent space dynamics identification

William Anderson, Seung Whan Chung, Robert Stephany, Youngsoo Choi

Main category: cs.LG

TL;DR: mLaSDI improves LaSDI by using multi-stage training with residual decoders to better capture high-frequency phenomena while maintaining interpretable latent dynamics.

Details

Motivation: Standard LaSDI has competing objectives between data reconstruction and latent dynamics learning, limiting accuracy for complex/high-frequency phenomena. Need better ROM framework that maintains interpretability while improving accuracy.

Method: Multi-stage LaSDI with sequential training: initial autoencoder followed by additional decoders that map latent trajectories to residuals from previous stages. Uses staged residual learning with periodic activation functions.

Result: Significantly lower reconstruction and prediction errors (often by order of magnitude), less training time, reduced hyperparameter tuning compared to standard LaSDI. Tested on multiscale oscillating system, unsteady wake flow, and 1D-1V Vlasov equation.

Conclusion: mLaSDI effectively addresses LaSDI’s limitations by enabling recovery of high-frequency content without sacrificing interpretability, making it a superior ROM framework for complex PDE problems.

Abstract: Accurately solving partial differential equations (PDEs) is essential across many scientific disciplines. However, high-fidelity solvers can be computationally prohibitive, motivating the development of reduced-order models (ROMs). Recently, Latent Space Dynamics Identification (LaSDI) was proposed as a data-driven, non-intrusive ROM framework. LaSDI compresses the training data via an autoencoder and learns user-specified ordinary differential equations (ODEs), governing the latent dynamics, enabling rapid predictions for unseen parameters. While LaSDI has produced effective ROMs for numerous problems, the autoencoder must simultaneously reconstruct the training data and satisfy the imposed latent dynamics, which are often competing objectives that limit accuracy, particularly for complex or high-frequency phenomena. To address this limitation, we propose multi-stage Latent Space Dynamics Identification (mLaSDI). With mLaSDI, we train LaSDI sequentially in stages. After training the initial autoencoder, we train additional decoders which map the latent trajectories to residuals from previous stages. This staged residual learning, combined with periodic activation functions, enables recovery of high-frequency content without sacrificing interpretability of the latent dynamics. Numerical experiments on a multiscale oscillating system, unsteady wake flow, and the 1D-1V Vlasov equation demonstrate that mLaSDI achieves significantly lower reconstruction and prediction errors, often by an order of magnitude, while requiring less training time and reduced hyperparameter tuning compared to standard LaSDI.

[316] Efficient Low Rank Attention for Long-Context Inference in Large Language Models

Tenghui Li, Guoxu Zhou, Xuyang Zhao, Yuning Qiu, Qibin Zhao

Main category: cs.LG

TL;DR: LRQK is a two-stage framework that reduces KV cache memory in LLMs by decomposing query/key matrices into low-rank factors and using mixed GPU-CPU caching with hit-and-miss mechanism, achieving significant memory savings with minimal accuracy loss.

Details

Motivation: The KV cache in LLMs imposes prohibitive GPU memory costs for long-context inference, limiting deployment on resource-constrained devices. Existing approaches like KV quantization and pruning reduce memory but suffer from precision loss or suboptimal KV pair retention.

Method: Two-stage framework: 1) Decomposes full-precision query and key matrices into compact rank-r factors during prefill stage, 2) Uses low-dimensional projections to compute proxy attention scores in O(lr) time at decode steps, 3) Selects top-k tokens and recent tokens, 4) Implements mixed GPU-CPU cache with hit-and-miss mechanism where only missing full-precision KV pairs are transferred.

Result: Extensive experiments on RULER and LongBench benchmarks with LLaMA-3-8B and Qwen2.5-7B show LRQK matches or surpasses leading sparse-attention methods in long-context settings while delivering significant memory savings with minimal accuracy loss.

Conclusion: LRQK effectively addresses KV cache memory limitations by combining low-rank decomposition with intelligent caching, preserving exact attention outputs while reducing CPU-GPU data movement, making long-context inference more feasible on resource-constrained devices.

Abstract: As the length of input text increases, the key-value (KV) cache in LLMs imposes prohibitive GPU memory costs and limits long-context inference on resource constrained devices. Existing approaches, such as KV quantization and pruning, reduce memory usage but suffer from numerical precision loss or suboptimal retention of key-value pairs. In this work, Low Rank Query and Key attention (LRQK) is introduced, a two-stage framework that jointly decomposes full-precision query and key matrices into compact rank-(r) factors during the prefill stage, and then employs these low-dimensional projections to compute proxy attention scores in (\mathcal{O}(lr)) time at each decode step. By selecting only the top-(k) tokens and a small fixed set of recent tokens, LRQK employs a mixed GPU-CPU cache with a hit-and-miss mechanism where only missing full-precision KV pairs are transferred, thereby preserving exact attention outputs while reducing CPU-GPU data movement. Extensive experiments on the RULER and LongBench benchmarks with LLaMA-3-8B and Qwen2.5-7B demonstrate that LRQK matches or surpasses leading sparse-attention methods in long context settings, while delivering significant memory savings with minimal accuracy loss. Our code is available at https://github.com/tenghuilee/LRQK.

[317] C3RL: Rethinking the Combination of Channel-independence and Channel-mixing from Representation Learning

Shusen Ma, Yun-Bo Zhao, Yu Kang

Main category: cs.LG

TL;DR: C3RL is a representation learning framework that jointly models channel-mixing and channel-independence strategies for multivariate time series forecasting using contrastive learning and siamese networks.

Details

Motivation: Existing multivariate time series forecasting approaches have limitations: channel-mixing strategies capture inter-variable dependencies but miss variable-specific patterns, while channel-independence strategies improve variable-specific patterns but fail to fully exploit cross-variable dependencies. Hybrid strategies based on feature fusion offer limited generalization and interpretability.

Method: C3RL treats the inputs of channel-mixing and channel-independence strategies as transposed views and builds a siamese network architecture. One strategy serves as the backbone while the other complements it. The framework jointly optimizes contrastive and prediction losses with adaptive weighting to balance representation and forecasting performance.

Result: Extensive experiments on seven models show C3RL boosts the best-case performance rate to 81.4% for models based on CI strategy and to 76.3% for models based on CM strategy, demonstrating strong generalization and effectiveness.

Conclusion: C3RL effectively addresses the limitations of existing strategies by jointly modeling both channel-mixing and channel-independence approaches through contrastive learning, achieving improved performance and generalization in multivariate time series forecasting.

Abstract: Multivariate time series forecasting has drawn increasing attention due to its practical importance. Existing approaches typically adopt either channel-mixing (CM) or channel-independence (CI) strategies. CM strategy can capture inter-variable dependencies but fails to discern variable-specific temporal patterns. CI strategy improves this aspect but fails to fully exploit cross-variable dependencies like CM. Hybrid strategies based on feature fusion offer limited generalization and interpretability. To address these issues, we propose C3RL, a novel representation learning framework that jointly models both CM and CI strategies. Motivated by contrastive learning in computer vision, C3RL treats the inputs of the two strategies as transposed views and builds a siamese network architecture: one strategy serves as the backbone, while the other complements it. By jointly optimizing contrastive and prediction losses with adaptive weighting, C3RL balances representation and forecasting performance. Extensive experiments on seven models show that C3RL boosts the best-case performance rate to 81.4% for models based on CI strategy and to 76.3% for models based on CM strategy, demonstrating strong generalization and effectiveness.

[318] Environment Scaling for Interactive Agentic Experience Collection: A Survey

Yuchen Huang, Sijia Li, Minghao Liu, Wei Liu, Shijue Huang, Zhiyuan Fan, Hou Pong Chan, Yi R. Fung

Main category: cs.LG

TL;DR: Survey paper proposing Generation-Execution-Feedback (GEF) loop framework for scaling AI agent environments to enable better reinforcement learning through more complex, realistic, and interactive experiences.

Details

Motivation: Current static datasets for LLM-based agents are insufficient for developing adaptive behavior and long-term decision-making capabilities. They are costly to build, lack dynamism and realism, and don't support learning from experience through reinforcement learning.

Method: Proposes the Generation-Execution-Feedback (GEF) loop framework where environments: 1) generate tasks to challenge agents, 2) return observations during task execution, and 3) provide evaluative feedback on rollouts for learning. Surveys environment scaling methods organized along these three stages.

Result: Systematic review of representative methods for environment scaling from an environment-centric perspective, analyzing implementation frameworks, challenges, and applications. Consolidates fragmented advances in the field.

Conclusion: Environments are essential producers of experiential data for agent training. Future research should focus on scaling environments toward greater complexity, realism, and interactivity to advance agent intelligence through the GEF loop paradigm.

Abstract: LLM-based agents can autonomously accomplish complex tasks across various domains. However, to further cultivate capabilities such as adaptive behavior and long-term decision-making, training on static datasets built from human-level knowledge is insufficient. These datasets are costly to construct and lack both dynamism and realism. A growing consensus is that agents should instead interact directly with environments and learn from experience through reinforcement learning. We formalize this iterative process as the Generation-Execution-Feedback (GEF) loop, where environments generate tasks to challenge agents, return observations in response to agents’ actions during task execution, and provide evaluative feedback on rollouts for subsequent learning. Under this paradigm, environments function as indispensable producers of experiential data, highlighting the need to scale them toward greater complexity, realism, and interactivity. In this survey, we systematically review representative methods for environment scaling from a pioneering environment-centric perspective and organize them along the stages of the GEF loop, namely task generation, task execution, and feedback. We further analyze implementation frameworks, challenges, and applications, consolidating fragmented advances and outlining future research directions for agent intelligence.

[319] FedReFT: Federated Representation Fine-Tuning with All-But-Me Aggregation

Fatema Siddika, Md Anwar Hossen, J. Pablo Muñoz, Tanya Roosta, Anuj Sharma, Ali Jannesari

Main category: cs.LG

TL;DR: FedReFT introduces federated representation fine-tuning with ABM aggregation and adaptive updates, achieving SOTA performance with 1-49x higher parameter efficiency than LoRA-based methods.

Details

Motivation: ReFT outperforms PEFT methods but faces challenges in FL due to client heterogeneity in data, models, and resources. Representation-level updates are vulnerable to aggregation mismatch under task heterogeneity.

Method: FedReFT applies sparse intervention layers to steer hidden representations directly. Uses All-But-Me aggregation where clients receive aggregated updates from others and partially incorporate them. Includes adaptive update strategy inspired by Test-Time Computing to balance local and global contributions.

Result: Achieves state-of-the-art performance on commonsense reasoning, arithmetic reasoning, and GLUE benchmarks. Delivers 1-49 times higher parameter efficiency compared to leading LoRA-based methods.

Conclusion: FedReFT provides a lightweight, semantically rich fine-tuning approach ideal for edge devices in FL settings, effectively addressing heterogeneity challenges through novel aggregation and update strategies.

Abstract: Parameter-efficient fine-tuning (PEFT) adapts large pre-trained models by updating only a small subset of parameters. Recently, Representation Fine-Tuning (ReFT) has emerged as an effective alternative. ReFT shifts the fine-tuning paradigm from updating model weights to directly manipulating hidden representations that capture rich semantic information, and outperforms state-of-the-art PEFTs in standalone settings. However, its application in Federated Learning (FL) remains challenging due to heterogeneity in clients’ data distributions, model capacities, and computational resources. To address these challenges, we introduce Federated Representation Fine-Tuning (FedReFT), a novel approach to fine-tune clients’ hidden representations. FedReFT applies sparse intervention layers to steer hidden representations directly, offering a lightweight and semantically rich fine-tuning alternative ideal for edge devices. However, representation-level updates are especially vulnerable to aggregation mismatch under different task heterogeneity, where naive averaging can corrupt semantic alignment. To mitigate this issue, we propose All-But-Me (ABM) aggregation, where each client receives the aggregated updates of others and partially incorporates them, enabling stable and personalized learning by balancing local focus with global knowledge. We further design an adaptive update strategy inspired by Test-Time Computing (TTC) to balance local and global contributions under heterogeneous conditions. FedReFT achieves state-of-the-art performance on commonsense reasoning, arithmetic reasoning, and GLUE benchmarks, while delivering 1-49 times higher parameter efficiency compared to leading LoRA-based methods.

Georgia Channing, Avijit Ghosh

Main category: cs.LG

TL;DR: AI’s benefits in science are uneven due to social/institutional barriers, not just technical ones. Need collective approach with community-building, shared resources, and equitable participation.

Details

Motivation: AI's potential in scientific research is not being fully realized due to uneven distribution across communities and disciplines. While technical challenges exist, social and institutional factors are often the primary constraints limiting AI's impact on scientific discovery.

Method: The paper analyzes four interconnected challenges: community coordination, misalignment of research priorities with upstream needs, data fragmentation, and infrastructure inequities. It proposes addressing these through community-building, cross-disciplinary education, shared benchmarks, and accessible infrastructure.

Result: The analysis reveals that narratives emphasizing autonomous “AI scientists,” under-recognition of data/infrastructure work, misaligned incentives, and gaps between domain experts and ML researchers all limit AI’s scientific impact.

Conclusion: AI for science should be reframed as a collective social project where sustainable collaboration and equitable participation are treated as prerequisites for technical progress, requiring intentional efforts beyond just technical innovation.

Abstract: Artificial intelligence (AI) is increasingly applied to scientific research, but its benefits remain unevenly distributed across communities and disciplines. While technical challenges such as limited data, fragmented standards, and unequal access to computational resources exist, social and institutional factors are often the primary constraints. Narratives emphasizing autonomous “AI scientists,” under-recognition of data and infrastructure work, misaligned incentives, and gaps between domain experts and machine learning researchers all limit the impact of AI on scientific discovery. This paper highlights four interconnected challenges: community coordination, misalignment of research priorities with upstream needs, data fragmentation, and infrastructure inequities. We argue that addressing these challenges requires not only technical innovation but also intentional efforts in community-building, cross-disciplinary education, shared benchmarks, and accessible infrastructure. We call for reframing AI for science as a collective social project, where sustainable collaboration and equitable participation are treated as prerequisites for technical progress

[321] Meta-Router: Bridging Gold-standard and Preference-based Evaluations in Large Language Model Routing

Yichi Zhang, Fangzheng Xie, Shu Yang, Chong Wu

Main category: cs.LG

TL;DR: A causal inference framework for training LLM routers that combines expensive gold-standard data with cheaper but biased preference-based data to reduce inference costs while maintaining response quality.

Details

Motivation: Deploying a single "best" LLM for every query is expensive. LLM routers can reduce costs by selecting appropriate models per query, but training them requires reliable supervision data. Gold-standard data (expert labels, rubric scores) is accurate but costly to scale, while preference-based data (crowdsourcing, LLM-as-judge) is cheaper but often biased.

Method: Frames LLM router training as a causal inference problem where response evaluation mechanisms are treatment assignments. The bias in preference-based data corresponds to conditional average treatment effect. Develops an integrative causal router training framework that corrects preference-data bias, addresses imbalances between gold-standard and preference data sources, and improves routing robustness and efficiency.

Result: Numerical experiments show the approach delivers more accurate routing and improves the trade-off between cost and quality compared to methods that don’t correct for bias in preference-based data.

Conclusion: The causal inference perspective provides an effective framework for combining scarce gold-standard data with abundant but biased preference-based data to train high-quality LLM routers, enabling cost-efficient deployment of multiple LLMs while maintaining response quality.

Abstract: In language tasks that require extensive human–model interaction, deploying a single “best” model for every query can be expensive. To reduce inference cost while preserving the quality of the responses, a large language model (LLM) router selects the most appropriate model from a pool of candidates for each query. A central challenge to training a high-quality router is the scarcity of reliable supervision. Gold-standard data (e.g., expert-verified labels or rubric-based scores) provide accurate quality evaluations of LLM responses but are costly and difficult to scale. In contrast, preference-based data, collected via crowdsourcing or LLM-as-a-judge systems, are cheaper and more scalable, yet often biased in reflecting the true quality of responses. We cast the problem of LLM router training with combined gold-standard and preference-based data into a causal inference framework by viewing the response evaluation mechanism as the treatment assignment. This perspective further reveals that the bias in preference-based data corresponds to the well-known causal estimand: the conditional average treatment effect. Based on this new perspective, we develop an integrative causal router training framework that corrects preference-data bias, address imbalances between two data sources, and improve routing robustness and efficiency. Numerical experiments demonstrate that our approach delivers more accurate routing and improves the trade-off between cost and quality.

[322] Equivalence of Context and Parameter Updates in Modern Transformer Blocks

Adrian Goldwaser, Michael Munn, Javier Gonzalvo, Benoit Dherin

Main category: cs.LG

TL;DR: The paper extends the theory of implicit context representation in transformers, showing that context effects can be perfectly mapped to rank-1 patches on MLP weights and RMSNorm scales across diverse modern LLM architectures.

Details

Motivation: To generalize the foundational theory of implicit context representation in vanilla transformers to the diverse architectures of modern Large Language Models, providing a unified understanding of how prompts are transformed into effective weights.

Method: 1. Demonstrates analytical solution for Gemma-style transformer blocks showing perfect mapping to rank-1 patches on MLP weights and RMSNorm scales. 2. Provides constructive proof and algorithm for multi-layer models. 3. Introduces general framework with two core properties: input controllability and output controllability. 4. Proves perfect implicit weight patches are possible for any MLP block meeting these controllability conditions.

Result: Establishes that context effects can be perfectly represented as implicit weight patches across diverse LLM architectures including gating, pre-/post-norm, mixture of experts, and sequential/parallel transformer blocks.

Conclusion: The paper provides a simpler and more powerful theoretical framework for understanding how transformer models transmute prompts into effective weights, generalizing to a wide range of modern LLM architectures through the lens of input and output controllability.

Abstract: Recent research has established that the impact of context in a vanilla transformer can be represented implicitly by forming a token-dependent, rank-1 patch to its MLP weights. This work extends that foundational theory to the diverse architectures of modern Large Language Models. We first demonstrate a precise, analytical solution for a Gemma-style transformer block, proving that the entire effect of a context can be perfectly mapped to rank-1 patches on its MLP weight matrices and a patch to the RMSNorm scale. We then generalize this result, providing a constructive proof and algorithm for multi-layer models. To unify these findings, we introduce a general framework centered on two core properties: input controllability and output controllability. We prove that a perfect implicit weight patch is possible for any MLP block where the inner function is input-controllable and the outer function is output-controllable. This provides a simpler and more powerful lens for understanding how transformer models transmute prompts into effective weights. This setup generalizes to a wide range of modern LLM architectures including gating, pre-/post-norm, mixture of experts and sequential/parallel transformer blocks.

[323] Statistically-Guided Dual-Domain Meta-Learning with Adaptive Multi-Prototype Aggregation for Distributed Fiber Optic Sensing

Yifan He, Haodong Zhang, Qiuheng Song, Lin Lei, Zhenxuan Zeng, Haoyang He, Hongyan Wu

Main category: cs.LG

TL;DR: DUPLE is a prototype-based meta-learning framework for cross-deployment DFOS recognition that addresses domain shift, label scarcity, and limited within-class coverage through dual-domain learning, statistical guidance, and query-adaptive prototype aggregation.

Details

Motivation: Practical deployment of Distributed Fiber Optic Sensing (DFOS) for perimeter security faces three key challenges: severe cross-deployment domain shift, scarce or unavailable labels at new sites, and limited within-class coverage even in source deployments.

Method: DUPLE uses a prototype-based meta-learning framework with three components: (1) dual-domain learner constructs multi-prototype class representations covering intra-class heterogeneity, (2) lightweight statistical guidance mechanism estimates reliability of each domain from raw signal statistics, and (3) query-adaptive aggregation strategy selects and combines most relevant prototypes for each query.

Result: Extensive experiments on two real-world cross-deployment benchmarks demonstrate consistent improvements over strong deep learning and meta-learning baselines, achieving more accurate and stable recognition under label-scarce target deployments.

Conclusion: DUPLE effectively addresses cross-deployment challenges in DFOS recognition by jointly exploiting complementary time- and frequency-domain cues and adapting class representations to sample-specific statistics, enabling practical deployment in real-world perimeter security applications.

Abstract: Distributed Fiber Optic Sensing (DFOS) is promising for long-range perimeter security, yet practical deployment faces three key obstacles: severe cross-deployment domain shift, scarce or unavailable labels at new sites, and limited within-class coverage even in source deployments. We propose DUPLE, a prototype-based meta-learning framework tailored for cross-deployment DFOS recognition. The core idea is to jointly exploit complementary time- and frequency-domain cues and adapt class representations to sample-specific statistics: (i) a dual-domain learner constructs multi-prototype class representations to cover intra-class heterogeneity; (ii) a lightweight statistical guidance mechanism estimates the reliability of each domain from raw signal statistics; and (iii) a query-adaptive aggregation strategy selects and combines the most relevant prototypes for each query. Extensive experiments on two real-world cross-deployment benchmarks demonstrate consistent improvements over strong deep learning and meta-learning baselines, achieving more accurate and stable recognition under label-scarce target deployments.

[324] No Trust Issues Here: A Technical Report on the Winning Solutions for the Rayan AI Contest

Ali Nafisi, Sina Asghari, Mohammad Saeed Arvenaghi, Hossein Shakibania

Main category: cs.LG

TL;DR: This paper presents solutions to three ML challenges from the Rayan AI Contest: compositional image retrieval (1st place, 95.38% accuracy), zero-shot anomaly detection (2nd place, 73.14% score), and backdoored model detection (2nd place, 78% accuracy).

Details

Motivation: To address key machine learning challenges in retrieval, anomaly detection, and model security that have real-world applications in healthcare, manufacturing, and cybersecurity.

Method: Developed specialized systems for each challenge: 1) A compositional image retrieval system processing visual and textual inputs, 2) A zero-shot anomaly detection model identifying anomalies without prior exposure to abnormal examples, and 3) A method to detect hidden backdoor triggers in neural networks.

Result: Achieved top rankings in all three challenges: 1st place in compositional image retrieval with 95.38% accuracy, 2nd place in zero-shot anomaly detection with 73.14% score, and 2nd place in backdoored model detection with 78% accuracy.

Conclusion: The methods effectively address key ML challenges with practical implications for real-world applications, demonstrating strong performance across retrieval, anomaly detection, and model security tasks. All code is publicly available.

Abstract: This report presents solutions to three machine learning challenges developed as part of the Rayan AI Contest: compositional image retrieval, zero-shot anomaly detection, and backdoored model detection. In compositional image retrieval, we developed a system that processes visual and textual inputs to retrieve relevant images, achieving 95.38% accuracy and ranking first with a clear margin over the second team. For zero-shot anomaly detection, we designed a model that identifies and localizes anomalies in images without prior exposure to abnormal examples, securing second place with a 73.14% score. In the backdoored model detection task, we proposed a method to detect hidden backdoor triggers in neural networks, reaching an accuracy of 78%, which placed our approach in second place. These results demonstrate the effectiveness of our methods in addressing key challenges related to retrieval, anomaly detection, and model security, with implications for real-world applications in industries such as healthcare, manufacturing, and cybersecurity. Code for all solutions is available online (https://github.com/safinal/rayan-ai-contest-solutions).

[325] Training LLMs for Honesty via Confessions

Manas Joglekar, Jeremy Chen, Gabriel Wu, Jason Yosinski, Jasmine Wang, Boaz Barak, Amelia Glaese

Main category: cs.LG

TL;DR: Researchers propose using “confessions” - self-reported honest accounts of LLM shortcomings - to elicit honesty about model misbehavior through separate reward training.

Details

Motivation: LLMs can be dishonest about their actions and beliefs, often due to reinforcement learning reward shaping that inadvertently incentivizes lying. There's a need for methods to elicit honest self-reports of model shortcomings and misbehavior.

Method: Introduce “confessions” - outputs provided after main answers that fully account for model compliance with policies. Train models with confession rewards based solely on honesty, separate from main answer rewards. Train GPT-5-Thinking to produce confessions and evaluate in OOD scenarios.

Result: When models lie or omit shortcomings in main answers, they often confess honestly to these behaviors. Confession honesty modestly improves with training. Confessions enable inference-time interventions like monitoring, rejection sampling, and surfacing issues to users.

Conclusion: Confession-based training provides a viable approach for eliciting honest self-reports of LLM shortcomings, especially for egregious misbehavior, and enables practical interventions to improve model transparency and accountability.

Abstract: Large language models (LLMs) can be dishonest when reporting on their actions and beliefs – for example, they may overstate their confidence in factual claims or cover up evidence of covert actions. Such dishonesty may arise due to the effects of reinforcement learning (RL), where challenges with reward shaping can result in a training process that inadvertently incentivizes the model to lie or misrepresent its actions. In this work we propose a method for eliciting an honest expression of an LLM’s shortcomings via a self-reported confession. A confession is an output, provided upon request after a model’s original answer, that is meant to serve as a full account of the model’s compliance with the letter and spirit of its policies and instructions. The reward assigned to a confession during training is solely based on its honesty, and does not impact positively or negatively the main answer’s reward. As long as the “path of least resistance” for maximizing confession reward is to surface misbehavior rather than covering it up, this incentivizes models to be honest in their confessions. Our findings provide some justification this empirical assumption, especially in the case of egregious model misbehavior. To demonstrate the viability of our approach, we train GPT-5-Thinking to produce confessions, and we evaluate its honesty in out-of-distribution scenarios measuring hallucination, instruction following, scheming, and reward hacking. We find that when the model lies or omits shortcomings in its “main” answer, it often confesses to these behaviors honestly, and this confession honesty modestly improves with training. Confessions can enable a number of inference-time interventions including monitoring, rejection sampling, and surfacing issues to the user.

[326] Efficient Low-Tubal-Rank Tensor Estimation via Alternating Preconditioned Gradient Descent

Zhiyu Liu, Zhi Han, Yandong Tang, Jun Fan, Yao Wang

Main category: cs.LG

TL;DR: APGD algorithm accelerates low-tubal-rank tensor estimation with linear convergence even under over-parameterization, independent of tensor condition number.

Details

Motivation: Traditional tensor SVD is computationally expensive for large tensors, while recent factorization approaches require accurate rank estimation and suffer slow convergence when rank is overestimated.

Method: Alternating Preconditioned Gradient Descent (APGD) adds a preconditioning term to original gradient and updates two factor tensors alternately to accelerate convergence in over-parameterized settings.

Result: APGD achieves linear convergence even under over-parameterization with convergence rate independent of tensor condition number, validated through extensive simulations on synthetic data.

Conclusion: APGD provides an efficient solution for low-tubal-rank tensor estimation that overcomes limitations of existing methods, particularly in handling over-parameterization without sacrificing convergence speed.

Abstract: The problem of low-tubal-rank tensor estimation is a fundamental task with wide applications across high-dimensional signal processing, machine learning, and image science. Traditional approaches tackle such a problem by performing tensor singular value decomposition, which is computationally expensive and becomes infeasible for large-scale tensors. Recent approaches address this issue by factorizing the tensor into two smaller factor tensors and solving the resulting problem using gradient descent. However, this kind of approach requires an accurate estimate of the tensor rank, and when the rank is overestimated, the convergence of gradient descent and its variants slows down significantly or even diverges. To address this problem, we propose an Alternating Preconditioned Gradient Descent (APGD) algorithm, which accelerates convergence in the over-parameterized setting by adding a preconditioning term to the original gradient and updating these two factors alternately. Based on certain geometric assumptions on the objective function, we establish linear convergence guarantees for more general low-tubal-rank tensor estimation problems. Then we further analyze the specific cases of low-tubal-rank tensor factorization and low-tubal-rank tensor recovery. Our theoretical results show that APGD achieves linear convergence even under over-parameterization, and the convergence rate is independent of the tensor condition number. Extensive simulations on synthetic data are carried out to validate our theoretical assertions.

[327] Adaptive Information Routing for Multimodal Time Series Forecasting

Jun Seo, Hyeokjun Choe, Seohui Bae, Soyeon Park, Wonbin Ahn, Taeyoon Lim, Junhyeok Kang, Sangjun Han, Jaehoon Lee, Dongwan Kang, Minjae Kim, Sungdong Yoo, Soonyoung Lee

Main category: cs.LG

TL;DR: AIR framework uses text data to dynamically guide time series models by controlling how multivariate time series information should be combined, improving forecasting accuracy.

Details

Motivation: Traditional time series forecasting relying only on historical data is insufficient for accurate predictions due to limited information. Multimodal approaches incorporating text data alongside time series can address this limitation.

Method: Adaptive Information Routing (AIR) framework leverages text information to dynamically guide time series models by controlling how and to what extent multivariate time series information should be combined. Includes text-refinement pipeline using LLMs to convert raw text into suitable form for multimodal forecasting.

Result: Experiments with real-world market data (crude oil price, exchange rates) demonstrate AIR effectively modulates time series model behavior using textual inputs, significantly enhancing forecasting accuracy across various tasks.

Conclusion: AIR framework successfully addresses limitations of traditional time series forecasting by intelligently incorporating text data to guide time series models, creating a benchmark for multimodal forecasting experiments.

Abstract: Time series forecasting is a critical task for artificial intelligence with numerous real-world applications. Traditional approaches primarily rely on historical time series data to predict the future values. However, in practical scenarios, this is often insufficient for accurate predictions due to the limited information available. To address this challenge, multimodal time series forecasting methods which incorporate additional data modalities, mainly text data, alongside time series data have been explored. In this work, we introduce the Adaptive Information Routing (AIR) framework, a novel approach for multimodal time series forecasting. Unlike existing methods that treat text data on par with time series data as interchangeable auxiliary features for forecasting, AIR leverages text information to dynamically guide the time series model by controlling how and to what extent multivariate time series information should be combined. We also present a text-refinement pipeline that employs a large language model to convert raw text data into a form suitable for multimodal forecasting, and we introduce a benchmark that facilitates multimodal forecasting experiments based on this pipeline. Experiment results with the real world market data such as crude oil price and exchange rates demonstrate that AIR effectively modulates the behavior of the time series model using textual inputs, significantly enhancing forecasting accuracy in various time series forecasting tasks.

[328] Reinforcement Learning From State and Temporal Differences

Lex Weaver, Jonathan Baxter

Main category: cs.LG

TL;DR: STD(λ) modifies TD(λ) to focus on relative state ordering rather than absolute value errors, preventing convergence to suboptimal policies and showing monotonic policy improvement.

Details

Motivation: TD(λ) with function approximation minimizes squared error between approximate and true state values, but for policy learning, the relative ordering of states is more critical than absolute values. TD(λ) can converge to suboptimal policies even when starting from optimal policies.

Method: Proposes STD(λ) (State Transition Difference λ), a modified form of TD(λ) where function approximators are trained with respect to relative state values on binary decision problems. The method focuses on learning the relative ordering between states rather than absolute values.

Result: STD(λ) successfully addresses the limitations of TD(λ), preventing convergence to suboptimal policies. Theoretical analysis includes proof of monotonic policy improvement for STD(λ) in two-state systems. Successful demonstrations on two-state systems and a variation of the acrobot problem.

Conclusion: STD(λ) provides a more effective approach than TD(λ) for policy learning by focusing on relative state ordering rather than absolute value errors, with theoretical guarantees of monotonic policy improvement and practical success on benchmark problems.

Abstract: TD($λ$) with function approximation has proved empirically successful for some complex reinforcement learning problems. For linear approximation, TD($λ$) has been shown to minimise the squared error between the approximate value of each state and the true value. However, as far as policy is concerned, it is error in the relative ordering of states that is critical, rather than error in the state values. We illustrate this point, both in simple two-state and three-state systems in which TD($λ$)–starting from an optimal policy–converges to a sub-optimal policy, and also in backgammon. We then present a modified form of TD($λ$), called STD($λ$), in which function approximators are trained with respect to relative state values on binary decision problems. A theoretical analysis, including a proof of monotonic policy improvement for STD($λ$) in the context of the two-state system, is presented, along with a comparison with Bertsekas’ differential training method [1]. This is followed by successful demonstrations of STD($λ$) on the two-state system and a variation on the well known acrobot problem.

[329] Pretrained Battery Transformer (PBT): A battery life prediction foundation model

Ruifeng Tan, Weixiang Hong, Jia Li, Jiaqiang Huang, Tong-Yi Zhang

Main category: cs.LG

TL;DR: PBT is the first foundation model for battery cycle life prediction that uses domain-knowledge-encoded mixture-of-expert layers to achieve broad generalization across diverse battery datasets.

Details

Motivation: Early battery cycle life prediction is crucial for accelerating battery research and deployment, but current machine learning approaches are limited by data scarcity and heterogeneity from diverse aging conditions. Foundation models have shown success in other fields for generalization, but none exist for battery life prediction.

Method: Developed Pretrained Battery Transformer (PBT) using domain-knowledge-encoded mixture-of-expert layers. Trained on 13 lithium-ion battery datasets from the largest public battery life database to learn transferable representations.

Result: PBT outperforms existing models by an average of 19.8% and achieves state-of-the-art performance across 15 diverse datasets with various operating conditions, formation protocols, and chemistries through transfer learning.

Conclusion: This work establishes the first foundation model pathway for battery lifetime prediction, paving the way toward universal battery lifetime prediction systems that can generalize across diverse battery conditions and chemistries.

Abstract: Early prediction of battery cycle life is essential for accelerating battery research, manufacturing, and deployment. Although machine learning methods have shown encouraging results, progress is hindered by data scarcity and heterogeneity arising from diverse aging conditions. In other fields, foundation models (FMs) trained on diverse datasets have achieved broad generalization through transfer learning, but no FMs have been reported for battery cycle life prediction yet. Here we present the Pretrained Battery Transformer (PBT), the first FM for battery life prediction, developed through domain-knowledge-encoded mixture-of-expert layers. Validated on the largest public battery life database, PBT learns transferable representations from 13 lithium-ion battery datasets, outperforming existing models by an average of 19.8%. With transfer learning, PBT achieves state-of-the-art performance across 15 diverse datasets encompassing various operating conditions, formation protocols, and chemistries. This work establishes a foundation model pathway for battery lifetime prediction, paving the way toward universal battery lifetime prediction systems.

[330] Dynamic Tool Dependency Retrieval for Efficient Function Calling

Bhrij Patel, Davide Belli, Amir Jalalirad, Maximilian Arnold, Aleksandr Ermolov, Bence Major

Main category: cs.LG

TL;DR: DTDR is a lightweight dynamic retrieval method that improves function calling agents by considering both initial queries and evolving execution context, outperforming static retrievers by 23-104% in success rates.

Details

Motivation: Existing retrieval methods for on-device LLM agents rely on static inputs, failing to capture multi-step tool dependencies and evolving task context, which introduces irrelevant tools that degrade agent efficiency and accuracy.

Method: Dynamic Tool Dependency Retrieval (DTDR) conditions retrieval on both initial query and evolving execution context, models tool dependencies from function calling demonstrations, and enables adaptive retrieval as plans unfold.

Result: DTDR improves function calling success rates between 23% and 104% compared to state-of-the-art static retrievers across multiple datasets and LLM backbones, while maintaining computational efficiency.

Conclusion: Dynamic tool retrieval that considers evolving execution context significantly outperforms static retrieval methods, demonstrating the importance of capturing tool dependencies and task evolution for effective function calling agents.

Abstract: Function calling agents powered by Large Language Models (LLMs) select external tools to automate complex tasks. On-device agents typically use a retrieval module to select relevant tools, improving performance and reducing context length. However, existing retrieval methods rely on static and limited inputs, failing to capture multi-step tool dependencies and evolving task context. This limitation often introduces irrelevant tools that mislead the agent, degrading efficiency and accuracy. We propose Dynamic Tool Dependency Retrieval (DTDR), a lightweight retrieval method that conditions on both the initial query and the evolving execution context. DTDR models tool dependencies from function calling demonstrations, enabling adaptive retrieval as plans unfold. We benchmark DTDR against state-of-the-art retrieval methods across multiple datasets and LLM backbones, evaluating retrieval precision, downstream task accuracy, and computational efficiency. Additionally, we explore strategies to integrate retrieved tools into prompts. Our results show that dynamic tool retrieval improves function calling success rates between $23%$ and $104%$ compared to state-of-the-art static retrievers.

[331] Learning Safe Autonomous Driving Policies Using Predictive Safety Representations

Mahesh Keswani, Raunak Bhattacharyya

Main category: cs.LG

TL;DR: SRPL framework improves safety-performance tradeoff in real-world autonomous driving, showing significant success rate improvements and cost reduction across Waymo and NuPlan datasets.

Details

Motivation: SafeRL faces fundamental tension between safety requirements and driving efficiency - overly conservative policies limit efficiency while aggressive exploration risks safety violations. Need to test if SRPL framework works in real-world autonomous driving scenarios.

Method: Systematic experiments on real-world datasets (Waymo Open Motion Dataset and NuPlan) using SRPL framework that equips agents with predictive model of future constraint violations. Evaluated reward-safety tradeoff, success rates, cost reduction, robustness to observation noise, and zero-shot cross-dataset generalization.

Result: SRPL improves reward-safety tradeoff with statistically significant improvements: success rate (effect sizes r = 0.65-0.86), cost reduction (effect sizes r = 0.70-0.83), p < 0.05. Effectiveness depends on policy optimizer and dataset distribution. Predictive safety representations improve robustness to observation noise and enable better zero-shot cross-dataset generalization.

Conclusion: Predictive safety representations in SRPL framework demonstrate potential to strengthen SafeRL for autonomous driving in real-world scenarios, though effectiveness depends on underlying policy optimizer and dataset characteristics.

Abstract: Safe reinforcement learning (SafeRL) is a prominent paradigm for autonomous driving, where agents are required to optimize performance under strict safety requirements. This dual objective creates a fundamental tension, as overly conservative policies limit driving efficiency while aggressive exploration risks safety violations. The Safety Representations for Safer Policy Learning (SRPL) framework addresses this challenge by equipping agents with a predictive model of future constraint violations and has shown promise in controlled environments. This paper investigates whether SRPL extends to real-world autonomous driving scenarios. Systematic experiments on the Waymo Open Motion Dataset (WOMD) and NuPlan demonstrate that SRPL can improve the reward-safety tradeoff, achieving statistically significant improvements in success rate (effect sizes r = 0.65-0.86) and cost reduction (effect sizes r = 0.70-0.83), with p < 0.05 for observed improvements. However, its effectiveness depends on the underlying policy optimizer and the dataset distribution. The results further show that predictive safety representations play a critical role in improving robustness to observation noise. Additionally, in zero-shot cross-dataset evaluation, SRPL-augmented agents demonstrate improved generalization compared to non-SRPL methods. These findings collectively demonstrate the potential of predictive safety representations to strengthen SafeRL for autonomous driving.

[332] Fusion of Multiscale Features Via Centralized Sparse-attention Network for EEG Decoding

Xiangrui Cai, Shaocheng Ma, Lei Cao, Jie Li, Tianyu Liu, Yilin Dong

Main category: cs.LG

TL;DR: EEG-CSANet: A multi-branch parallel architecture with centralized sparse-attention network for EEG signal decoding, achieving SOTA performance across five public datasets.

Details

Motivation: To address the inherent spatiotemporal heterogeneity of EEG signals, which is crucial for accurate brain-machine interfacing and intelligent interaction.

Method: Multi-branch parallel architecture with independent spatial feature extraction modules for each temporal scale, plus a centralized sparse-attention network (EEG-CSANet) with main-auxiliary branch architecture. Main branch models core spatiotemporal patterns via multiscale self-attention, while auxiliary branch facilitates efficient local interactions through sparse cross-attention.

Result: Achieves state-of-the-art performance across five public datasets: BCIC-IV-2A (88.54%), BCIC-IV-2B (91.09%), HGD (99.43%), SEED (96.03%), and SEED-VIG (90.56%). Demonstrates strong adaptability and robustness across various EEG decoding tasks.

Conclusion: EEG-CSANet shows promising performance as a baseline model for EEG signal decoding, with extensive ablation studies enhancing interpretability. Source code is publicly available for further research.

Abstract: Electroencephalography (EEG) signal decoding is a key technology that translates brain activity into executable commands, laying the foundation for direct brain-machine interfacing and intelligent interaction. To address the inherent spatiotemporal heterogeneity of EEG signals, this paper proposes a multi-branch parallel architecture, where each temporal scale is equipped with an independent spatial feature extraction module. To further enhance multi-branch feature fusion, we propose a Fusion of Multiscale Features via Centralized Sparse-attention Network (EEG-CSANet), a centralized sparse-attention network. It employs a main-auxiliary branch architecture, where the main branch models core spatiotemporal patterns via multiscale self-attention, and the auxiliary branch facilitates efficient local interactions through sparse cross-attention. Experimental results show that EEG-CSANet achieves state-of-the-art (SOTA) performance across five public datasets (BCIC-IV-2A, BCIC-IV-2B, HGD, SEED, and SEED-VIG), with accuracies of 88.54%, 91.09%, 99.43%, 96.03%, and 90.56%, respectively. Such performance demonstrates its strong adaptability and robustness across various EEG decoding tasks. Moreover, extensive ablation studies are conducted to enhance the interpretability of EEG-CSANet. In the future, we hope that EEG-CSANet could serve as a promising baseline model in the field of EEG signal decoding. The source code is publicly available at: https://github.com/Xiangrui-Cai/EEG-CSANet

cs.MA

[333] A Multi-Agent Retrieval-Augmented Framework for Work-in-Progress Predictio

Yousef Mehrdad Bibalan, Behrouz Far, Mohammad Moshirpour, Bahareh Ghiyasian

Main category: cs.MA

TL;DR: A retrieval-augmented multi-agent framework for Work-in-Progress prediction that combines RAG with collaborative agent reasoning, achieving competitive accuracy with MAPE of 1.50% on real-world datasets.

Details

Motivation: Work-in-Progress prediction is critical for predictive process monitoring to anticipate workload fluctuations and optimize operational planning. Existing approaches may lack robustness and contextual understanding.

Method: Proposes a retrieval-augmented multi-agent framework with: 1) narrative generation transforming event logs into natural language stories, 2) semantic vector-based process memory for dynamic context retrieval, 3) predictor agents using retrieved contexts, 4) assistant agent extracting high-level signals, and 5) fusion agent synthesizing predictions using ReAct-style reasoning.

Result: Achieves competitive prediction accuracy with MAPE of 1.50% on one dataset, surpassing Temporal Convolutional Networks (TCN), Long Short-Term Memory (LSTM), and persistence baselines. Demonstrates improved robustness on two real-world benchmark datasets.

Conclusion: The integration of retrieval mechanisms and multi-agent reasoning effectively improves WiP prediction, highlighting the framework’s robustness and competitive performance compared to traditional deep learning approaches.

Abstract: Work-in-Progress (WiP) prediction is critical for predictive process monitoring, enabling accurate anticipation of workload fluctuations and optimized operational planning. This paper proposes a retrieval-augmented, multi-agent framework that combines retrieval-augmented generation (RAG) and collaborative multi-agent reasoning for WiP prediction. The narrative generation component transforms structured event logs into semantically rich natural language stories, which are embedded into a semantic vector-based process memory to facilitate dynamic retrieval of historical context during inference. The framework includes predictor agents that independently leverage retrieved historical contexts and a decision-making assistant agent that extracts high-level descriptive signals from recent events. A fusion agent then synthesizes predictions using ReAct-style reasoning over agent outputs and retrieved narratives. We evaluate our framework on two real-world benchmark datasets. Results show that the proposed retrieval-augmented multi-agent approach achieves competitive prediction accuracy, obtaining a Mean Absolute Percentage Error (MAPE) of 1.50% on one dataset, and surpassing Temporal Convolutional Networks (TCN), Long Short-Term Memory (LSTM), and persistence baselines. The results highlight improved robustness, demonstrating the effectiveness of integrating retrieval mechanisms and multi-agent reasoning in WiP prediction.

[334] When Natural Strategies Meet Fuzziness and Resource-Bounded Actions (Extended Version)

Marco Aruta, Francesco Improta, Vadim Malvone, Aniello Murano

Main category: cs.MA

TL;DR: HumanATLF is a logic combining natural strategies with fuzzy semantics and resource-bounded actions for more realistic modeling of human decision-making in multi-agent systems.

Details

Motivation: Traditional MAS formal reasoning assumes agents use arbitrarily complex strategies with zero-cost actions and crisp game structures, which contrasts with real human decision-making that involves action costs and uncertainty. Existing natural strategies frameworks address strategy complexity but ignore action costs and perceptual uncertainty.

Method: Introduces HumanATLF logic that extends natural strategies with fuzzy semantics (degrees in [0,1] for atomic conditions/goals) and resource-bounded actions (each action has real-valued cost from non-refillable budget). Provides formal syntax/semantics and analyzes computational complexity.

Result: Model checking complexity: P when strategy complexity k and budget b are fixed; NP-complete with one strategic operator over Boolean objectives; Δ₂ᴾ-complete when k and b vary; PSPACE for recall-based strategies. Implemented in VITAMIN tool and validated on adversarial drone rescue scenario.

Conclusion: HumanATLF bridges the gap between formal strategic reasoning and realistic human decision-making by incorporating both resource constraints and fuzzy uncertainty, with practical model checking algorithms of varying complexity depending on parameter settings.

Abstract: In formal strategic reasoning for Multi-Agent Systems (MAS), agents are typically assumed to (i) employ arbitrarily complex strategies, (ii) execute each move at zero cost, and (iii) operate over fully crisp game structures. These idealized assumptions stand in stark contrast with human decision making in real world environments. The natural strategies framework along with some of its recent variants, partially addresses this gap by restricting strategies to concise rules guarded by regular expressions. Yet, it still overlook both the cost of each action and the uncertainty that often characterizes human perception of facts over the time. In this work, we introduce HumanATLF, a logic that builds upon natural strategies employing both fuzzy semantics and resource bound actions: each action carries a real valued cost drawn from a non refillable budget, and atomic conditions and goals have degrees in [0,1]. We give a formal syntax and semantics, and prove that model checking is in P when both the strategy complexity k and resource budget b are fixed, NP complete if just one strategic operator over Boolean objectives is allowed, and Delta^P_2 complete when k and b vary. Moreover, we show that recall based strategies can be decided in PSPACE. We implement our algorithms in VITAMIN, an open source model checking tool for MAS and validate them on an adversarial resource aware drone rescue scenario.

cs.MM

Ziyang Fan, Li Tao, Yi Wang, Jingwei Qu, Ying Wang, Fei Jiang

Main category: cs.MM

TL;DR: Proposes DS-HGCN, a dual-stream multi-feature fusion model using hypergraph convolutional networks to predict student engagement by modeling social contagion effects between students.

Details

Motivation: Current student engagement prediction approaches are limited by single-dimensional feature analysis and focus on individual student factors, ignoring the social contagion effects where engagement spreads between students.

Method: DS-HGCN (dual-stream hypergraph convolutional network) constructs hypergraph structures to encode engagement contagion among students, captures emotional/behavioral differences via multi-frequency signals, and uses hypergraph attention to dynamically weigh student influence.

Result: Extensive experiments on public benchmark datasets show superior performance, significantly outperforming existing state-of-the-art approaches.

Conclusion: The proposed DS-HGCN effectively models social contagion of student engagement through hypergraph structures and attention mechanisms, enabling more accurate engagement prediction for personalized educational interventions.

Abstract: Student engagement is a critical factor influencing academic success and learning outcomes. Accurately predicting student engagement is essential for optimizing teaching strategies and providing personalized interventions. However, most approaches focus on single-dimensional feature analysis and assessing engagement based on individual student factors. In this work, we propose a dual-stream multi-feature fusion model based on hypergraph convolutional networks (DS-HGCN), incorporating social contagion of student engagement. DS-HGCN enables accurate prediction of student engagement states by modeling multi-dimensional features and their propagation mechanisms between students. The framework constructs a hypergraph structure to encode engagement contagion among students and captures the emotional and behavioral differences and commonalities by multi-frequency signals. Furthermore, we introduce a hypergraph attention mechanism to dynamically weigh the influence of each student, accounting for individual differences in the propagation process. Extensive experiments on public benchmark datasets demonstrate that our proposed method achieves superior performance and significantly outperforms existing state-of-the-art approaches.

[336] Reinforcement Learning for Unsupervised Video Summarization with Reward Generator Training

Mehryar Abbasi, Hadi Hadizadeh, Parvaneh Saeedi

Main category: cs.MM

TL;DR: Unsupervised video summarization using RL with reconstruction fidelity as reward signal, avoiding adversarial training and heuristic rewards.

Details

Motivation: Address limitations in existing unsupervised video summarization methods: unstable adversarial training and reliance on heuristic-based reward functions.

Method: Two-stage training: 1) Pre-train generator self-supervisedly to reconstruct randomly masked frames; 2) RL summarizer assigns importance scores, with reward from reconstruction similarity between original and summary-reconstructed video.

Result: Strong alignment with human judgments and promising F-scores, validating reconstruction objective as effective proxy for informativeness.

Conclusion: Reconstruction fidelity serves as effective proxy for informativeness in video summarization, enabling stable RL training without adversarial architectures or heuristic rewards.

Abstract: This paper presents a novel approach for unsupervised video summarization using reinforcement learning (RL), addressing limitations like unstable adversarial training and reliance on heuristic-based reward functions. The method operates on the principle that reconstruction fidelity serves as a proxy for informativeness, correlating summary quality with reconstruction ability. The summarizer model assigns importance scores to frames to generate the final summary. For training, RL is coupled with a unique reward generation pipeline that incentivizes improved reconstructions. This pipeline uses a generator model to reconstruct the full video from the selected summary frames; the similarity between the original and reconstructed video provides the reward signal. The generator itself is pre-trained self-supervisedly to reconstruct randomly masked frames. This two-stage training process enhances stability compared to adversarial architectures. Experimental results show strong alignment with human judgments and promising F-scores, validating the reconstruction objective.

eess.AS

[337] ASK: Adaptive Self-improving Knowledge Framework for Audio Text Retrieval

Siyuan Fu, Xuchen Guo, Mingjun Liu, Hongxiang Li, Boyin Tan, Gongxi Zhu, Xianwei Zhuang, Jinghan Ru, Yuxin Xie, Yuguo Yin

Main category: eess.AS

TL;DR: ASK framework addresses Gradient Locality Bottleneck and Representation-Drift Mismatch in Audio-Text Retrieval through adaptive knowledge refinement and multi-grained injection.

Details

Motivation: Current Audio-Text Retrieval methods suffer from Gradient Locality Bottleneck (GLB) that limits out-of-batch knowledge utilization, and existing knowledge-enhanced approaches create Representation-Drift Mismatch (RDM) where static knowledge bases become misaligned with evolving models.

Method: Proposes Adaptive Self-improving Knowledge (ASK) framework with: 1) multi-grained knowledge injection to break GLB, 2) dynamic knowledge refinement to mitigate RDM, and 3) adaptive reliability weighting to ensure consistent knowledge contributes to optimization.

Result: Achieves superior, state-of-the-art performance on two benchmark datasets, demonstrating the efficacy of the ASK framework.

Conclusion: ASK effectively addresses both GLB and RDM challenges in Audio-Text Retrieval through a model-agnostic, plug-and-play solution that adaptively refines knowledge while maintaining alignment with model evolution.

Abstract: The dominant paradigm for Audio-Text Retrieval (ATR) relies on mini-batch-based contrastive learning. This process, however, is inherently limited by what we formalize as the Gradient Locality Bottleneck (GLB), which structurally prevents models from leveraging out-of-batch knowledge and thus impairs fine-grained and long-tail learning. While external knowledge-enhanced methods can alleviate the GLB, we identify a critical, unaddressed side effect: the Representation-Drift Mismatch (RDM), where a static knowledge base becomes progressively misaligned with the evolving model, turning guidance into noise. To address this dual challenge, we propose the Adaptive Self-improving Knowledge (ASK) framework, a model-agnostic, plug-and-play solution. ASK breaks the GLB via multi-grained knowledge injection, systematically mitigates RDM through dynamic knowledge refinement, and introduces a novel adaptive reliability weighting scheme to ensure consistent knowledge contributes to optimization. Experimental results on two benchmark datasets with superior, state-of-the-art performance justify the efficacy of our proposed ASK framework.

[338] SpatialNet with Binaural Loss Function for Correcting Binaural Signal Matching Outputs under Head Rotations

Dor Shamay, Boaz Rafaely

Main category: eess.AS

TL;DR: Deep learning-enhanced BSM-MagLS method improves binaural audio reproduction accuracy during head rotation by using SpatialNet with perceptual loss functions.

Details

Motivation: BSM-MagLS method for binaural reproduction degrades with head rotation, causing spatial and timbral artifacts when ears move away from microphones, especially in VR/AR devices with limited microphone arrays.

Method: Integrate deep learning with BSM-MagLS using SpatialNet post-processing framework with signal-level loss and perceptually motivated binaural loss based on human hearing model.

Result: Simulation with six-microphone semicircular array shows robust performance across head rotations. Listening experiments in various reverberant environments confirm effective mitigation of BSM-MagLS degradations.

Conclusion: Deep learning enhancement of BSM-MagLS provides robust correction for binaural reproduction across substantial head rotations, improving VR/AR audio quality.

Abstract: Binaural reproduction is gaining increasing attention with the rise of devices such as virtual reality headsets, smart glasses, and head-tracked headphones. Achieving accurate binaural signals with these systems is challenging, as they often employ arbitrary microphone arrays with limited spatial resolution. The Binaural Signals Matching with Magnitude Least-Squares (BSM-MagLS) method was developed to address limitations of earlier BSM formulations, improving reproduction at high frequencies and under head rotation. However, its accuracy still degrades as head rotation increases, resulting in spatial and timbral artifacts, particularly when the virtual listener’s ear moves farther from the nearest microphones. In this work, we propose the integration of deep learning with BSM-MagLS to mitigate these degradations. A post-processing framework based on the SpatialNet network is employed, leveraging its ability to process spatial information effectively and guided by both signal-level loss and a perceptually motivated binaural loss derived from a theoretical model of human binaural hearing. The effectiveness of the approach is investigated in a simulation study with a six-microphone semicircular array, showing its ability to perform robustly across head rotations. These findings are further studied in a listening experiment across different reverberant acoustic environments and head rotations, demonstrating that the proposed framework effectively mitigates BSM-MagLS degradations and provides robust correction across substantial head rotations.

[339] QuarkAudio Technical Report

Chengwei Liu, Haoyin Yan, Shaofei Xue, Xiaotao Liang, Xiaofu Chen, Bin Gong, Zheng Xue, Gang Song

Main category: eess.AS

TL;DR: QuarkAudio is a unified autoregressive language model framework for multiple audio tasks using a novel discrete audio tokenizer (H-Codec) with SSL representations, supporting speech restoration, separation, conversion, and language-guided audio editing.

Details

Motivation: Existing audio models use task-specific architectures, leading to fragmented development and limited extensibility. There's a need for a unified framework that can handle multiple audio tasks with robust instruction understanding and high-quality generation.

Method: Decoder-only autoregressive LM framework with H-Codec tokenizer incorporating SSL representations, dynamic frame-rate mechanism, and 48kHz sampling. Uses task-specific conditional information as conditioning sequence and predicts discrete audio tokens autoregressively.

Result: H-Codec achieves high-quality audio reconstruction with low frame rate, improving efficiency and performance. QuarkAudio delivers competitive/comparable performance to state-of-the-art task-specific or multi-task systems across multiple audio tasks.

Conclusion: QuarkAudio successfully unifies multiple audio processing and generation tasks in a single framework, demonstrating that a unified approach can achieve competitive performance while supporting diverse applications including language-guided audio editing.

Abstract: Many existing audio processing and generation models rely on task-specific architectures, resulting in fragmented development efforts and limited extensibility. It is therefore promising to design a unified framework capable of handling multiple tasks, while providing robust instruction and audio understanding and high-quality audio generation. This requires a compatible paradigm design, a powerful backbone, and a high-fidelity audio reconstruction module. To meet these requirements, this technical report introduces QuarkAudio, a decoder-only autoregressive (AR) LM-based generative framework that unifies multiple tasks. The framework includes a unified discrete audio tokenizer, H-Codec, which incorporates self-supervised learning (SSL) representations into the tokenization and reconstruction process. We further propose several improvements to H-Codec, such as a dynamic frame-rate mechanism and extending the audio sampling rate to 48 kHz. QuarkAudio unifies tasks by using task-specific conditional information as the conditioning sequence of the decoder-only LM, and predicting discrete target audio tokens in an AR manner. The framework supports a wide range of audio processing and generation tasks, including speech restoration (SR), target speaker extraction (TSE), speech separation (SS), voice conversion (VC), and language-queried audio source separation (LASS). In addition, we extend downstream tasks to universal free-form audio editing guided by natural language instructions (including speech semantic editing and audio event editing). Experimental results show that H-Codec achieves high-quality audio reconstruction with a low frame rate, improving both the efficiency and performance of downstream audio generation, and that QuarkAudio delivers competitive or comparable performance to state-of-the-art task-specific or multi-task systems across multiple tasks.

[340] LP-CFM: Perceptual Invariance-Aware Conditional Flow Matching for Speech Modeling

Doyeop Kwak, Youngjoon Jang, Joon Son Chung

Main category: eess.AS

TL;DR: LP-CFM improves speech modeling by incorporating perceptual invariances through projection-aligned Gaussians and vector calibrated sampling, outperforming conventional CFM especially in low-resource and few-step scenarios.

Details

Motivation: Conventional generative models treat speech samples as fixed representatives, ignoring that each sample is just one of many perceptually equivalent variants (due to amplitude scaling, temporal shifts, etc.). This fails to capture the true speech distribution's perceptual invariances.

Method: Proposes Linear Projection Conditional Flow Matching (LP-CFM) which models targets as projection-aligned elongated Gaussians along perceptually equivalent variants. Also introduces Vector Calibrated Sampling (VCS) to keep sampling aligned with the line-projection path.

Result: In neural vocoding experiments across various model sizes, data scales, and sampling steps, LP-CFM consistently outperforms conventional optimal transport CFM. Shows particularly strong gains in low-resource and few-step sampling scenarios.

Conclusion: LP-CFM and VCS provide more robust and perceptually grounded generative modeling of speech by explicitly incorporating perceptual invariances, demonstrating potential for improved speech synthesis.

Abstract: The goal of this paper is to provide a new perspective on speech modeling by incorporating perceptual invariances such as amplitude scaling and temporal shifts. Conventional generative formulations often treat each dataset sample as a fixed representative of the target distribution. From a generative standpoint, however, such samples are only one among many perceptually equivalent variants within the true speech distribution. To address this, we propose Linear Projection Conditional Flow Matching (LP-CFM), which models targets as projection-aligned elongated Gaussians along perceptually equivalent variants. We further introduce Vector Calibrated Sampling (VCS) to keep the sampling process aligned with the line-projection path. In neural vocoding experiments across model sizes, data scales, and sampling steps, the proposed approach consistently improves over the conventional optimal transport CFM, with particularly strong gains in low-resource and few-step scenarios. These results highlight the potential of LP-CFM and VCS to provide more robust and perceptually grounded generative modeling of speech.

[341] Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning

Yangui Fang, Jing Peng, Xu Li, Yu Xi, Chengwei Zhang, Guohui Zhong, Kai Yu

Main category: eess.AS

TL;DR: Text-only fine-tuning for Speech LLMs enables domain adaptation using only unpaired target-domain text, preserving speech-text alignment through real-time evaluation during training.

Details

Motivation: Adapting Speech LLMs to new domains is challenging in low-resource settings where paired speech-text data is scarce, creating a need for methods that can leverage abundant unpaired text data.

Method: Proposes text-only fine-tuning strategy using unpaired target-domain text without additional audio, with real-time evaluation mechanism during fine-tuning to preserve speech-text alignment.

Result: Achieves competitive recognition performance on LibriSpeech, SlideSpeech, and Medical datasets with minimal degradation compared to full audio-text fine-tuning, while improving generalization without catastrophic forgetting.

Conclusion: Text-only fine-tuning shows strong potential for low-resource domain adaptation of ASR systems, enabling effective adaptation while maintaining source-domain performance.

Abstract: Recent advances in automatic speech recognition (ASR) have combined speech encoders with large language models (LLMs) through projection, forming Speech LLMs with strong performance. However, adapting them to new domains remains challenging, especially in low-resource settings where paired speech-text data is scarce. We propose a text-only fine-tuning strategy for Speech LLMs using unpaired target-domain text without requiring additional audio. To preserve speech-text alignment, we introduce a real-time evaluation mechanism during fine-tuning. This enables effective domain adaptation while maintaining source-domain performance. Experiments on LibriSpeech, SlideSpeech, and Medical datasets show that our method achieves competitive recognition performance, with minimal degradation compared to full audio-text fine-tuning. It also improves generalization to new domains without catastrophic forgetting, highlighting the potential of text-only fine-tuning for low-resource domain adaptation of ASR.

[342] Spectral Bottleneck in Sinusoidal Representation Networks: Noise is All You Need

Hemanth Chandravamsi, Dhanush V. Shenoy, Itay Zinn, Ziv Chen, Shimon Pisnoy, Steven H. Frankel

Main category: eess.AS

TL;DR: SIRENs have sensitivity to frequency content and initialization, causing spectral bottlenecks. WINNER initialization addresses this by adapting spectral profiles for better fitting accuracy.

Details

Motivation: Implicit neural representations with sinusoidal activations (SIRENs) suffer from fundamental limitations: their fitting error is highly sensitive to target frequency content and initialization choices, leading to spectral bottlenecks and potential zero-valued outputs in extreme cases.

Method: Analyzed activation spectra evolution and empirical neural tangent kernel during training. Examined Gaussian perturbations to uniformly initialized weights. Proposed WINNER (Weight Initialization) scheme as a target-aware initialization strategy that modifies spectral profiles of network activations.

Result: WINNER achieves state-of-the-art performance on audio fitting tasks and yields notable improvements in image fitting tasks. Demonstrates that fitting accuracy can be significantly improved through target-aware initialization.

Conclusion: Initialization is a central factor governing SIREN evolution, requiring adaptive, target-aware strategies as target length increases and fine-scale detail becomes essential. WINNER represents a simple but effective step toward addressing spectral bottlenecks in implicit neural representations.

Abstract: This work identifies and attempts to address a fundamental limitation of implicit neural representations with sinusoidal activation. The fitting error of SIRENs is highly sensitive to the target frequency content and to the choice of initialization. In extreme cases, this sensitivity leads to a spectral bottleneck that can result in a zero-valued output. This phenomenon is characterized by analyzing the evolution of activation spectra and the empirical neural tangent kernel (NTK) during the training process. An unfavorable distribution of energy across frequency modes was noted to give rise to this failure mode. Furthermore, the effect of Gaussian perturbations applied to the baseline uniformly initialized weights is examined, showing how these perturbations influence activation spectra and the NTK eigenbasis of SIREN. Overall, initialization emerges as a central factor governing the evolution of SIRENs, indicating the need for adaptive, target-aware strategies as the target length increases and fine-scale detail becomes essential. The proposed weight initialization scheme (WINNER) represents a simple ad hoc step in this direction and demonstrates that fitting accuracy can be significantly improved by modifying the spectral profile of network activations through a target-aware initialization. The approach achieves state-of-the-art performance on audio fitting tasks and yields notable improvements in image fitting tasks.

[343] DeepASA: An Object-Oriented Multi-Purpose Network for Auditory Scene Analysis

Dongheon Lee, Younghoo Kwon, Jung-Woo Choi

Main category: eess.AS

TL;DR: DeepASA is a unified multi-purpose model for auditory scene analysis that performs source separation, dereverberation, sound event detection, audio classification, and direction-of-arrival estimation using object-oriented processing and chain-of-inference mechanisms.

Details

Motivation: The paper addresses complex auditory scenes where multiple similar sound sources overlap in time and move dynamically in space. Traditional approaches suffer from parameter association ambiguity in track-wise processing, and early-stage object separation can lead to downstream task failures.

Method: DeepASA uses object-oriented processing (OOP) strategy that encapsulates auditory features into object-centric representations refined through chain-of-inference (CoI). The pipeline includes dynamic temporal kernel-based feature extractor, transformer-based aggregator, object separator, and task-specific decoders. Temporal coherence matching (TCM) enables multi-task fusion and iterative refinement of object features.

Result: The model achieves state-of-the-art performance across all evaluated tasks on spatial audio benchmark datasets (ASA2, MC-FUSS, STARSS23), demonstrating effectiveness in both source separation and auditory parameter estimation under diverse spatial auditory scenes.

Conclusion: DeepASA provides a unified framework for comprehensive auditory scene analysis that overcomes traditional limitations through object-centric representations and iterative refinement, achieving robust performance across multiple ASA tasks in complex spatial audio scenarios.

Abstract: We propose DeepASA, a multi-purpose model for auditory scene analysis that performs multi-input multi-output (MIMO) source separation, dereverberation, sound event detection (SED), audio classification, and direction-of-arrival estimation (DoAE) within a unified framework. DeepASA is designed for complex auditory scenes where multiple, often similar, sound sources overlap in time and move dynamically in space. To achieve robust and consistent inference across tasks, we introduce an object-oriented processing (OOP) strategy. This approach encapsulates diverse auditory features into object-centric representations and refines them through a chain-of-inference (CoI) mechanism. The pipeline comprises a dynamic temporal kernel-based feature extractor, a transformer-based aggregator, and an object separator that yields per-object features. These features feed into multiple task-specific decoders. Our object-centric representations naturally resolve the parameter association ambiguity inherent in traditional track-wise processing. However, early-stage object separation can lead to failure in downstream ASA tasks. To address this, we implement temporal coherence matching (TCM) within the chain-of-inference, enabling multi-task fusion and iterative refinement of object features using estimated auditory parameters. We evaluate DeepASA on representative spatial audio benchmark datasets, including ASA2, MC-FUSS, and STARSS23. Experimental results show that our model achieves state-of-the-art performance across all evaluated tasks, demonstrating its effectiveness in both source separation and auditory parameter estimation under diverse spatial auditory scenes.

[344] Unsupervised Single-Channel Audio Separation with Diffusion Source Priors

Runwu Shi, Chang Li, Jiang Wang, Rui Zhang, Nabeela Khan, Benjamin Yen, Takeshi Ashizawa, Kazuhiro Nakadai

Main category: eess.AS

TL;DR: Unsupervised single-channel audio separation using diffusion priors and reconstruction guidance, with novel inverse problem solver and time-frequency attention architecture.

Details

Motivation: Supervised methods require paired synthetic data which is hard to obtain in real-world scenarios, limiting generalization. Unsupervised approach needed to overcome data scarcity.

Method: Frames separation as probabilistic inverse problem using diffusion priors trained on individual sources. Uses reconstruction guidance with novel inverse problem solver to mitigate gradient conflicts. Initializes denoising with augmented mixture instead of Gaussian noise. Introduces time-frequency attention-based network for audio prior modeling.

Result: Significant performance gains validated across speech-sound event, sound event, and speech separation tasks. Achieves high-quality and balanced separation across individual sources.

Conclusion: Proposed unsupervised approach with diffusion priors and reconstruction guidance effectively solves single-channel audio separation without requiring paired training data, demonstrating strong generalization capability.

Abstract: Single-channel audio separation aims to separate individual sources from a single-channel mixture. Most existing methods rely on supervised learning with synthetically generated paired data. However, obtaining high-quality paired data in real-world scenarios is often difficult. This data scarcity can degrade model performance under unseen conditions and limit generalization ability. To this end, in this work, we approach this problem from an unsupervised perspective, framing it as a probabilistic inverse problem. Our method requires only diffusion priors trained on individual sources. Separation is then achieved by iteratively guiding an initial state toward the solution through reconstruction guidance. Importantly, we introduce an advanced inverse problem solver specifically designed for separation, which mitigates gradient conflicts caused by interference between the diffusion prior and reconstruction guidance during inverse denoising. This design ensures high-quality and balanced separation performance across individual sources. Additionally, we find that initializing the denoising process with an augmented mixture instead of pure Gaussian noise provides an informative starting point that significantly improves the final performance. To further enhance audio prior modeling, we design a novel time-frequency attention-based network architecture that demonstrates strong audio modeling capability. Collectively, these improvements lead to significant performance gains, as validated across speech-sound event, sound event, and speech separation tasks.

[345] SAM Audio: Segment Anything in Audio

Bowen Shi, Andros Tjandra, John Hoffman, Helin Wang, Yi-Chiao Wu, Luya Gao, Julius Richter, Matt Le, Apoorv Vyas, Sanyuan Chen, Christoph Feichtenhofer, Piotr Dollár, Wei-Ning Hsu, Ann Lee

Main category: eess.AS

TL;DR: SAM Audio is a foundation model for general audio source separation that unifies text, visual, and temporal span prompting within a single diffusion transformer framework, achieving SOTA across diverse audio domains.

Details

Motivation: Existing audio separation models are either domain-specific (limited to fixed categories like speech/music) or have limited controllability (supporting only single prompting modalities like text). There's a need for a general-purpose separation model with multimodal prompting capabilities.

Method: Built on diffusion transformer architecture, trained with flow matching on large-scale audio data spanning speech, music, and general sounds. Unifies text, visual, and temporal span prompting within a single framework for flexible target source separation.

Result: Achieves state-of-the-art performance across diverse benchmarks including general sound, speech, music, and musical instrument separation in both in-the-wild and professionally produced audios, substantially outperforming prior general-purpose and specialized systems.

Conclusion: SAM Audio represents a significant advancement in general audio separation with multimodal controllability, introducing new benchmarks and evaluation methods that correlate well with human judgment, enabling more capable multimodal AI systems.

Abstract: General audio source separation is a key capability for multimodal AI systems that can perceive and reason about sound. Despite substantial progress in recent years, existing separation models are either domain-specific, designed for fixed categories such as speech or music, or limited in controllability, supporting only a single prompting modality such as text. In this work, we present SAM Audio, a foundation model for general audio separation that unifies text, visual, and temporal span prompting within a single framework. Built on a diffusion transformer architecture, SAM Audio is trained with flow matching on large-scale audio data spanning speech, music, and general sounds, and can flexibly separate target sources described by language, visual masks, or temporal spans. The model achieves state-of-the-art performance across a diverse suite of benchmarks, including general sound, speech, music, and musical instrument separation in both in-the-wild and professionally produced audios, substantially outperforming prior general-purpose and specialized systems. Furthermore, we introduce a new real-world separation benchmark with human-labeled multimodal prompts and a reference-free evaluation model that correlates strongly with human judgment.

eess.IV

[346] Neural Compression of 360-Degree Equirectangular Videos using Quality Parameter Adaptation

Daichi Arai, Yuichi Kondo, Kyohei Unno, Yasuko Sugito, Yuichi Kusakabe

Main category: eess.IV

TL;DR: A method to compress 360-degree equirectangular videos using pretrained neural video compression models without retraining, adapting quantization parameters based on latitude to account for spatial sampling density variations.

Details

Motivation: 360-degree equirectangular videos have spatially varying sampling density due to projection distortion, but existing neural video compression models don't account for this. Traditional video codecs use quantization parameter adaptation, but this hasn't been effectively extended to neural video compression.

Method: Extends quantization parameter adaptation from traditional codecs to NVC using latitude-based adaptive quality parameters via rate-distortion optimization. Uses vector bank interpolation for latent modulation to enable flexible adaptation with arbitrary quality parameters and mitigate rounding errors.

Result: Applied to DCVC-RT framework, achieves 5.2% BD-Rate savings in weighted spherical PSNR for JVET class S1 test sequences with only 0.3% increase in processing time.

Conclusion: The proposed method successfully adapts pretrained NVC models for 360-degree video compression without architectural changes or retraining, achieving significant bitrate savings with minimal computational overhead.

Abstract: This study proposes a practical approach for compressing 360-degree equirectangular videos using pretrained neural video compression (NVC) models. Without requiring additional training or changes in the model architectures, the proposed method extends quantization parameter adaptation techniques from traditional video codecs to NVC, utilizing the spatially varying sampling density in equirectangular projections. We introduce latitude-based adaptive quality parameters through rate-distortion optimization for NVC. The proposed method utilizes vector bank interpolation for latent modulation, enabling flexible adaptation with arbitrary quality parameters and mitigating the limitations caused by rounding errors in the adaptive quantization parameters. Experimental results demonstrate that applying this method to the DCVC-RT framework yields BD-Rate savings of 5.2% in terms of the weighted spherical peak signal-to-noise ratio for JVET class S1 test sequences, with only a 0.3% increase in processing time.

[347] Branch Learning in MRI: More Data, More Models, More Training

Yuyang Li, Yipin Deng, Zijian Zhou, Peng Hu

Main category: eess.IV

TL;DR: The paper investigates two strategies for multicontrast cardiac MR reconstruction: physics-consistent data augmentation (DualSpaceCMR) and parameter-efficient capacity scaling via VQPrompt and Moero, evaluated on the CMRxRecon25 benchmark for generalization performance.

Details

Motivation: To develop effective strategies for multicontrast cardiac MR reconstruction that can generalize well in few-shot and out-of-distribution scenarios, addressing the challenges of limited data and domain shifts in medical imaging.

Method: Two complementary approaches: 1) DualSpaceCMR - physics-consistent data-space augmentation coupling image-level transforms with k-space noise and motion simulations while preserving forward model consistency; 2) Parameter-efficient capacity scaling using VQPrompt (lightweight bottleneck prompt) and Moero (sparse mixture of experts with histogram-based routing embedded in deep unrolled networks).

Result: On small datasets, k-space motion-plus-noise augmentation improves reconstruction; on large benchmarks it degrades performance, showing sensitivity to augmentation ratio and schedule. VQPrompt provides modest consistent gains with negligible memory overhead. Moero continues improving after early plateaus and maintains baseline-like generalization despite mild overfitting, but sparse routing reduces PyTorch throughput.

Conclusion: Scale-aware augmentation is crucial, prompt-based capacity scaling offers a practical path, and efficiency improvements are essential for sparse expert models to overcome computational bottlenecks in wall clock time.

Abstract: We investigated two complementary strategies for multicontrast cardiac MR reconstruction: physics-consistent data-space augmentation (DualSpaceCMR) and parameter-efficient capacity scaling via VQPrompt and Moero. DualSpaceCMR couples image-level transforms with kspace noise and motion simulations while preserving forwardmodel consistency. VQPrompt adds a lightweight bottleneck prompt; Moero embeds a sparse mixture of experts within a deep unrolled network with histogram-based routing. In the multivendor, multisite CMRxRecon25 benchmark, we evaluate fewshot and out-of-distribution generalization. On small datasets, k-space motion-plus-noise improves reconstruction; on the large benchmark it degrades performance, revealing sensitivity to augmentation ratio and schedule. VQPrompt produces modest and consistent gains with negligible memory overhead. Moero continues to improve after early plateaus and maintains baseline-like fewshot and out-of-distribution behavior despite mild overfitting, but sparse routing lowers PyTorch throughput and makes wall clock time the main bottleneck. These results motivate scale-aware augmentation and suggest prompt-based capacity scaling as a practical path, while efficiency improvements are crucial for sparse expert models.

[348] CLIP Based Region-Aware Feature Fusion for Automated BBPS Scoring in Colonoscopy Images

Yujia Fu, Zhiyu Dong, Tianwen Qian, Chenye Zheng, Danian Ji, Linhai Zhuo

Main category: eess.IV

TL;DR: Proposes an automated Boston Bowel Preparation Scale scoring system using CLIP with adapter-based transfer learning and fecal-feature extraction, achieving superior performance on both proprietary and public datasets.

Details

Motivation: Manual BBPS scoring suffers from subjectivity and inter-observer variability, creating a need for automated, objective assessment of bowel cleanliness to improve colonoscopy effectiveness.

Method: Uses CLIP model with adapter-based transfer learning and a dedicated fecal-feature extraction branch, fusing global visual features with stool-related textual priors without requiring explicit segmentation.

Result: Extensive experiments on both proprietary dataset (2,240 images from 517 subjects) and public NERTHU dataset demonstrate superiority over existing baselines.

Conclusion: The proposed framework shows strong potential for clinical deployment in computer-aided colonoscopy analysis by providing automated, objective bowel cleanliness assessment.

Abstract: Accurate assessment of bowel cleanliness is essential for effective colonoscopy procedures. The Boston Bowel Preparation Scale (BBPS) offers a standardized scoring system but suffers from subjectivity and inter-observer variability when performed manually. In this paper, to support robust training and evaluation, we construct a high-quality colonoscopy dataset comprising 2,240 images from 517 subjects, annotated with expert-agreed BBPS scores. We propose a novel automated BBPS scoring framework that leverages the CLIP model with adapter-based transfer learning and a dedicated fecal-feature extraction branch. Our method fuses global visual features with stool-related textual priors to improve the accuracy of bowel cleanliness evaluation without requiring explicit segmentation. Extensive experiments on both our dataset and the public NERTHU dataset demonstrate the superiority of our approach over existing baselines, highlighting its potential for clinical deployment in computer-aided colonoscopy analysis.

[349] Dual-Encoder Transformer-Based Multimodal Learning for Ischemic Stroke Lesion Segmentation Using Diffusion MRI

Muhammad Usman, Azka Rehman, Muhammad Mutti Ur Rehman, Abd Ur Rehman, Muhammad Umar Farooq

Main category: eess.IV

TL;DR: Dual-encoder TransUNet achieves state-of-the-art 85.4% Dice score for ischemic stroke lesion segmentation from multimodal diffusion MRI (DWI+ADC), outperforming convolutional and other transformer models.

Details

Motivation: Accurate ischemic stroke lesion segmentation from diffusion MRI (DWI and ADC) is crucial for clinical decision-making and outcome assessment, but automated delineation remains challenging due to lesion appearance variability.

Method: Benchmarked various architectures (U-Net variants, Swin-UNet, TransUNet) on ISLES 2022 dataset, then proposed dual-encoder TransUNet to learn modality-specific representations from DWI and ADC. Incorporated spatial context using three-slice input configuration for adjacent slice information.

Result: Transformer-based models outperformed convolutional baselines. The proposed dual-encoder TransUNet achieved the best performance with 85.4% Dice score on the test set.

Conclusion: The dual-encoder TransUNet framework provides a robust solution for automated ischemic stroke lesion segmentation from diffusion MRI, leveraging complementary DWI and ADC information with spatial context.

Abstract: Accurate segmentation of ischemic stroke lesions from diffusion magnetic resonance imaging (MRI) is essential for clinical decision-making and outcome assessment. Diffusion-Weighted Imaging (DWI) and Apparent Diffusion Coefficient (ADC) scans provide complementary information on acute and sub-acute ischemic changes; however, automated lesion delineation remains challenging due to variability in lesion appearance. In this work, we study ischemic stroke lesion segmentation using multimodal diffusion MRI from the ISLES 2022 dataset. Several state-of-the-art convolutional and transformer-based architectures, including U-Net variants, Swin-UNet, and TransUNet, are benchmarked. Based on performance, a dual-encoder TransUNet architecture is proposed to learn modality-specific representations from DWI and ADC inputs. To incorporate spatial context, adjacent slice information is integrated using a three-slice input configuration. All models are trained under a unified framework and evaluated using the Dice Similarity Coefficient (DSC). Results show that transformer-based models outperform convolutional baselines, and the proposed dual-encoder TransUNet achieves the best performance, reaching a Dice score of 85.4% on the test set. The proposed framework offers a robust solution for automated ischemic stroke lesion segmentation from diffusion MRI.

[350] SLIM: Semantic-based Low-bitrate Image compression for Machines by leveraging diffusion

Hyeonjin Lee, Jun-Hyuk Kim, Jong-Seok Lee

Main category: eess.IV

TL;DR: SLIM is a semantic-based low-bitrate image compression framework for machine vision using diffusion models, focusing on Region-of-Interest areas without requiring guide masks at inference.

Details

Motivation: Current image compression models focus on human vision with excessive perceptual details, limiting optimal bitrate reduction for machine vision tasks. There's a need for compression specifically optimized for machine vision applications.

Method: Uses pretrained latent diffusion model with compressor focusing on Region-of-Interest areas in image latent. Pretrained Unet enhances decompressed latent using RoI-focused text captions containing semantic information, enabling focus on RoI areas without guide masks at inference.

Result: SLIM achieves higher classification accuracy at the same bits per pixel compared to conventional image compression models for machines, while maintaining perceptual details for human vision.

Conclusion: SLIM provides an effective training framework for machine vision image compression that achieves low bitrates by focusing on semantic RoI areas and leveraging diffusion models for enhancement.

Abstract: In recent years, the demand of image compression models for machine vision has increased dramatically. However, the training frameworks of image compression still focus on the vision of human, maintaining the excessive perceptual details, thus have limitations in optimally reducing the bits per pixel in the case of performing machine vision tasks. In this paper, we propose Semantic-based Low-bitrate Image compression for Machines by leveraging diffusion, termed SLIM. This is a new effective training framework of image compression for machine vision, using a pretrained latent diffusion model.The compressor model of our method focuses only on the Region-of-Interest (RoI) areas for machine vision in the image latent, to compress it compactly. Then the pretrained Unet model enhances the decompressed latent, utilizing a RoI-focused text caption which containing semantic information of the image. Therefore, SLIM is able to focus on RoI areas of the image without any guide mask at the inference stage, achieving low bitrate when compressing. And SLIM is also able to enhance a decompressed latent by denoising steps, so the final reconstructed image from the enhanced latent can be optimized for the machine vision task while still containing perceptual details for human vision. Experimental results show that SLIM achieves a higher classification accuracy in the same bits per pixel condition, compared to conventional image compression models for machines.

Today’s Research Highlights

Table of Contents

cs.CL

[1] HARMON-E: Hierarchical Agentic Reasoning for Multimodal Oncology Notes to Extract Structured Data

[2] How well do Large Language Models Recognize Instructional Moves? Establishing Baselines for Foundation Models in Educational Discourse

[3] Counterfactual LLM-based Framework for Measuring Rhetorical Style

[4] PRISM: A Personality-Driven Multi-Agent Framework for Social Media Simulation

[5] Bias Beneath the Tone: Empirical Characterisation of Tone Bias in LLM-Driven UX Systems

[6] SlideTailor: Personalized Presentation Slide Generation for Scientific Papers

[7] Schoenfeld’s Anatomy of Mathematical Reasoning by Language Models

[8] Memory-T1: Reinforcement Learning for Temporal Reasoning in Multi-session Agents

[9] A Novel Graph-Sequence Learning Model for Inductive Text Classification

[10] ABBEL: LLM Agents Acting through Belief Bottlenecks Expressed in Language

[11] Fun-Audio-Chat Technical Report

[12] M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation

[13] SpidR: Learning Fast and Stable Linguistic Units for Spoken Language Models Without Supervision

[14] Multi-hop Reasoning via Early Knowledge Alignment

[15] Retrieval-augmented Prompt Learning for Pre-trained Foundation Models

[16] AI Security Beyond Core Domains: Resume Screening as a Case Study of Adversarial Vulnerabilities in Specialized LLM Applications

[17] FaithLens: Detecting and Explaining Faithfulness Hallucination

[18] Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction

[19] Corpus of Cross-lingual Dialogues with Minutes and Detection of Misunderstandings

[20] AprielGuard

[21] Patterns vs. Patients: Evaluating LLMs against Mental Health Professionals on Personality Disorder Diagnosis through First-Person Narratives

[22] Can LLMs Solve My Grandma’s Riddle? Evaluating Multilingual Large Language Models on Reasoning Traditional Bangla Tricky Riddles

[23] Multi-LLM Thematic Analysis with Dual Reliability Metrics: Combining Cohen’s Kappa and Semantic Similarity for Qualitative Research Validation

[24] Sentiment-Aware Extractive and Abstractive Summarization for Unstructured Text Mining

[25] Step-DeepResearch Technical Report

[26] Distilling to Hybrid Attention Models via KL-Guided Layer Selection

[27] Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits

[28] Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs

[29] MoE-DiffuSeq: Enhancing Long-Document Diffusion Models with Sparse Attention and Mixture of Experts

[30] Deep Learning and Machine Learning, Advancing Big Data Analytics and Management: Object-Oriented Programming

[31] Don’t Pay Attention, PLANT It: Pretraining Attention via Learning-to-Rank

[32] GRAPHMOE: Amplifying Cognitive Depth of Mixture-of-Experts Network via Introducing Self-Rethinking Mechanism

[33] Select2Reason: Efficient Instruction-Tuning Data Selection for Long-CoT Reasoning

[34] DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations

[35] Learning without training: The implicit dynamics of in-context learning

[36] Persistent Instability in LLM’s Personality Measurements: Effects of Scale, Reasoning, and Conversation History

[37] Enhancing Uncertainty Estimation in LLMs with Expectation of Aggregated Internal Belief

[38] Thematic Dispersion in Arabic Applied Linguistics: A Bibliometric Analysis using Brookes’ Measure

[39] DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation

[40] AWPO: Enhancing Tool-Use of Large Language Models through Explicit Integration of Reasoning Rewards

[41] SiamGPT: Quality-First Fine-Tuning for Stable Thai Text Generation

[42] GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators

cs.CV

[43] PHANTOM: PHysical ANamorphic Threats Obstructing Connected Vehicle Mobility

[44] Generating the Past, Present and Future from a Motion-Blurred Image

[45] Learning to Refocus with Video Diffusion Models

[46] RANSAC Scoring Functions: Analysis and Reality Check

[47] Chain-of-Anomaly Thoughts with Large Vision-Language Models

[48] DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation

[49] Progressive Learned Image Compression for Machine Perception

[50] HyGE-Occ: Hybrid View-Transformation with 3D Gaussian and Edge Priors for 3D Panoptic Occupancy Prediction

[51] Multi Modal Attention Networks with Uncertainty Quantification for Automated Concrete Bridge Deck Delamination Detection

[52] Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs

[53] TAVID: Text-Driven Audio-Visual Interactive Dialogue Generation

[54] Generative Latent Coding for Ultra-Low Bitrate Image Compression

[55] Unified Brain Surface and Volume Registration

[56] Vehicle-centric Perception via Multimodal Structured Pre-training

[57] Degradation-Aware Metric Prompting for Hyperspectral Image Restoration

[58] Block-Recurrent Dynamics in Vision Transformers

[59] SE360: Semantic Edit in 360$^\circ$ Panoramas via Hierarchical Data Construction

[60] How Much 3D Do Video Foundation Models Encode?

[61] BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes

[62] HistoWAS: A Pathomics Framework for Large-Scale Feature-Wide Association Studies of Tissue Topology and Patient Outcomes

[63] From Binary to Semantic: Utilizing Large-Scale Binary Occupancy Data for 3D Semantic Occupancy Prediction

[64] WSD-MIL: Window Scale Decay Multiple Instance Learning for Whole Slide Image Classification

[65] WaveletGaussian: Wavelet-domain Diffusion for Sparse-view 3D Gaussian Object Reconstruction

[66] A Novel CNN Gradient Boosting Ensemble for Guava Disease Detection

[67] Neural Implicit Heart Coordinates: 3D cardiac shape reconstruction from sparse segmentations

[68] A Dual-Branch Local-Global Framework for Cross-Resolution Land Cover Mapping

[69] Few-Shot-Based Modular Image-to-Video Adapter for Diffusion Models

[70] PaveSync: A Unified and Comprehensive Dataset for Pavement Distress Analysis and Classification

[71] SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images

[72] A Contextual Analysis of Driver-Facing and Dual-View Video Inputs for Distraction Detection in Naturalistic Driving Environments

[73] MAPI-GNN: Multi-Activation Plane Interaction Graph Neural Network for Multimodal Medical Diagnosis

[74] $\text{H}^2$em: Learning Hierarchical Hyperbolic Embeddings for Compositional Zero-Shot Learning

[75] VALLR-Pin: Dual-Decoding Visual Speech Recognition for Mandarin with Pinyin-Guided LLM Refinement

[76] FlashLips: 100-FPS Mask-Free Latent Lip-Sync using Reconstruction Instead of Diffusion or GANs