Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 107]
cs.CV [Total: 189]
cs.AI [Total: 62]
cs.SD [Total: 14]
cs.LG [Total: 150]
cs.MA [Total: 4]
cs.MM [Total: 2]
eess.AS [Total: 4]
eess.IV [Total: 11]

cs.CL

[1] Unsupervised Cycle Detection in Agentic Applications

Felix George, Harshit Kumar, Divya Pathak, Kaustabha Ray, Mudit Verma, Pratibha Moogi

Main category: cs.CL

TL;DR: Unsupervised framework detects hidden execution cycles in LLM-powered applications by combining structural and semantic analysis, achieving F1 score of 0.72.

Details

Motivation: Agentic applications with LLMs exhibit non-deterministic behaviors that create hidden execution cycles, silently consuming resources without triggering errors, which traditional observability platforms fail to detect.

Method: Hybrid approach combining structural analysis (temporal call stack analysis for explicit loops) and semantic analysis (similarity analysis for redundant content cycles).

Result: Evaluated on 1575 trajectories from LangGraph stock market app: F1 score 0.72 (precision: 0.62, recall: 0.86), significantly outperforming structural-only (F1: 0.08) and semantic-only methods (F1: 0.28).

Conclusion: Results are encouraging but substantial scope for improvement remains; future work needed to refine approach and address current limitations.

Abstract: Agentic applications powered by Large Language Models exhibit non-deterministic behaviors that can form hidden execution cycles, silently consuming resources without triggering explicit errors. Traditional observability platforms fail to detect these costly inefficiencies. We present an unsupervised cycle detection framework that combines structural and semantic analysis. Our approach first applies computationally efficient temporal call stack analysis to identify explicit loops and then leverages semantic similarity analysis to uncover subtle cycles characterized by redundant content generation. Evaluated on 1575 trajectories from a LangGraph-based stock market application, our hybrid approach achieves an F1 score of 0.72 (precision: 0.62, recall: 0.86), significantly outperforming individual structural (F1: 0.08) and semantic methods (F1: 0.28). While these results are encouraging, there remains substantial scope for improvement, and future work is needed to refine the approach and address its current limitations.

[2] AV-Dialog: Spoken Dialogue Models with Audio-Visual Input

Tuochao Chen, Bandhav Veluri, Hongyu Gong, Shyamnath Gollakota

Main category: cs.CL

TL;DR: AV-Dialog is the first multimodal dialogue framework that uses both audio and visual cues to improve speaker tracking, turn-taking prediction, and response generation in noisy multi-speaker environments.

Details

Motivation: Current dialogue models struggle in noisy, multi-speaker environments, producing irrelevant responses and awkward turn-taking due to lack of multimodal context.

Method: Combines acoustic tokenization with multi-task, multi-stage training on monadic, synthetic, and real audio-visual dialogue datasets to achieve robust streaming transcription and turn-boundary detection.

Result: Outperforms audio-only models under interference, reducing transcription errors, improving turn-taking prediction, and enhancing human-rated dialogue quality.

Conclusion: Visual cues combined with audio enable more natural conversational flow and robust performance in real-world noisy environments, paving the way for better spoken dialogue agents.

Abstract: Dialogue models falter in noisy, multi-speaker environments, often producing irrelevant responses and awkward turn-taking. We present AV-Dialog, the first multimodal dialog framework that uses both audio and visual cues to track the target speaker, predict turn-taking, and generate coherent responses. By combining acoustic tokenization with multi-task, multi-stage training on monadic, synthetic, and real audio-visual dialogue datasets, AV-Dialog achieves robust streaming transcription, semantically grounded turn-boundary detection and accurate responses, resulting in a natural conversational flow. Experiments show that AV-Dialog outperforms audio-only models under interference, reducing transcription errors, improving turn-taking prediction, and enhancing human-rated dialogue quality. These results highlight the power of seeing as well as hearing for speaker-aware interaction, paving the way for {spoken} dialogue agents that perform {robustly} in real-world, noisy environments.

[3] Data Analysis and Performance Evaluation of Simulation Deduction Based on LLMs

Shansi Zhang, Min Li

Main category: cs.CL

TL;DR: Proposes a method using LLMs to automate military simulation analysis by decomposing tasks, using multi-round interactions with self-check, custom tools for figures/metrics, and adaptable report templates.

Details

Motivation: Traditional manual analysis of military simulations is time-consuming and error-prone; LLMs can enhance efficiency but need structured approaches to generate high-quality reports.

Method: Decompose complex tasks into sub-tasks with tailored prompts, use multi-round LLM interactions with self-check/reflection, employ custom tools for figures/metrics, and design adaptable report templates.

Result: Generated reports show higher quality and obtain higher scores than baseline methods in extensive evaluations.

Conclusion: The proposed structured LLM approach effectively automates military simulation analysis, producing high-quality reports adaptable to various scenarios.

Abstract: Data analysis and performance evaluation of simulation deduction plays a pivotal role in modern warfare, which enables military personnel to gain invaluable insights into the potential effectiveness of different strategies, tactics, and operational plans. Traditional manual analysis approach is time-consuming and limited by human errors. To enhance efficiency and accuracy, large language models (LLMs) with strong analytical and inferencing capabilities can be employed. However, high-quality analysis reports with well-structured formatting cannot be obtained through a single instruction input to the LLM. To tackle this issue, we propose a method that first decomposes the complex task into several sub-tasks and designs effective system prompts and user prompts for each sub-task. Multi-round interactions with the LLM incorporating self-check and reflection are then conducted to enable structured data extraction as well as multi-step analysis and evaluation. Furthermore, custom tools are defined and invoked to generate figures and compute metrics. We also design multiple report templates, each tailored to a specific application and input data type, ensuring their adaptability across a variety of scenarios. Extensive evaluation results demonstrate that the reports generated by our method exhibit higher quality, therefore obtaining higher scores than the baseline method.

[4] Cognitively-Inspired Episodic Memory Architectures for Accurate and Efficient Character AI

Rafael Arias Gonzalez, Steve DiPaola

Main category: cs.CL

TL;DR: A system that creates historical character dialogues using enriched episodic memory with efficient parallel retrieval, achieving depth without high latency.

Details

Motivation: To overcome the trade-off between shallow responses from simple RAG and high latency from multi-stage reflection in historical character dialogue systems.

Method: Transform biographical data into enriched first-person memories with metadata, then use two-stage parallel retrieval for efficient prompt generation.

Result: Achieves 0.52s prompt generation, parity with traditional RAG on GPT-4, and superior performance on smaller models (GPT-3.5, GPT-3).

Conclusion: Provides a practical framework for educational, museum, and research applications requiring both accuracy and efficiency in historical character embodiment.

Abstract: Large language models show promise for embodying historical characters in dialogue systems, but existing approaches face a critical trade-off: simple retrieval-augmented generation produces shallow responses, while multi-stage reflection achieves depth at prohibitive latency. We present an architecture that resolves this tension through offline data augmentation and efficient parallel retrieval from structured episodic memory. Our system transforms biographical data into 1,774 enriched first-person memories with affective-semantic metadata, then employs two-stage retrieval achieving 0.52s prompt generation. Evaluation using LLM-as-judge and RAGAs metrics shows our approach achieves parity with traditional RAG on GPT-4 while significantly outperforming it on smaller models (GPT-3.5, GPT-3), suggesting particular value for resource-constrained deployments. Beyond dialogue, the structured memory enables novel visualization tools: spatiotemporal heatmaps, emotional trajectory analysis, and interactive path tracking, positioning the system as both a dialogue interface and research tool for biographical analysis. We use Van Gogh as a test case, but the architecture is generalizable to any historical figure with substantial textual records, offering a practical framework for educational, museum, and research applications requiring both accuracy and efficiency

[5] Proactive Hearing Assistants that Isolate Egocentric Conversations

Guilin Hu, Malek Itani, Tuochao Chen, Shyamnath Gollakota

Main category: cs.CL

TL;DR: Proactive hearing assistants that automatically identify and separate conversation partners using egocentric binaural audio, leveraging self-speech as anchor and turn-taking behavior for real-time operation.

Details

Motivation: To create hearing assistants that adapt proactively to conversational dynamics without requiring explicit user prompts, enabling automatic identification and separation of conversation partners in multi-conversation settings.

Method: Dual-model architecture: lightweight streaming model runs every 12.5ms for low-latency extraction, while slower model captures longer-range conversational dynamics. Uses egocentric binaural audio with self-speech as anchor and turn-taking behavior.

Result: System generalizes well on real-world 2- and 3-speaker conversation test sets (6.8 hours from 11 participants), successfully identifying and isolating conversational partners in multi-conversation settings.

Conclusion: This work represents a step toward hearing assistants that can proactively adapt to conversational dynamics and engagement, operating in real-time on-device without user intervention.

Abstract: We introduce proactive hearing assistants that automatically identify and separate the wearer’s conversation partners, without requiring explicit prompts. Our system operates on egocentric binaural audio and uses the wearer’s self-speech as an anchor, leveraging turn-taking behavior and dialogue dynamics to infer conversational partners and suppress others. To enable real-time, on-device operation, we propose a dual-model architecture: a lightweight streaming model runs every 12.5 ms for low-latency extraction of the conversation partners, while a slower model runs less frequently to capture longer-range conversational dynamics. Results on real-world 2- and 3-speaker conversation test sets, collected with binaural egocentric hardware from 11 participants totaling 6.8 hours, show generalization in identifying and isolating conversational partners in multi-conversation settings. Our work marks a step toward hearing assistants that adapt proactively to conversational dynamics and engagement. More information can be found on our website: https://proactivehearing.cs.washington.edu/

[6] Hybrid Quantum Transformer for Language Generation

Desheng Kong, Xiangshuo Cui, Jiaying Jin, Jing Xu, Donglin Wang

Main category: cs.CL

TL;DR: HyQuT is the first hybrid quantum-classical LLM that integrates variational quantum circuits into Transformers, showing quantum computing can replace ~10% of classical parameters while maintaining comparable performance in language generation.

Details

Motivation: To demonstrate the feasibility of applying quantum computing to large-scale natural language generation, as most existing quantum models are limited to simple tasks.

Method: Integrates variational quantum circuits (VQCs) into Transformer framework at 8M and 150M parameter scales, using minimal quantum resources (10 qubits with 80 gates).

Result: 10 qubits with 80 quantum gates can replace about 10% of classical parameters in 150M-parameter model while achieving comparable convergence stability and generation quality.

Conclusion: Provides early demonstration that quantum computing can be successfully integrated into large-scale generative language models for coherent and context-aware dialogue.

Abstract: Although quantum computing has been increasingly applied to replace classical computation, most existing quantum or hybrid models remain confined to simple tasks, with no successful application to large-scale natural language generation to date. In this work, we present the first hybrid quantum-classical large language model (LLM) for natural language generation, HyQuT, capable of performing coherent and context-aware dialogue. The proposed architecture integrates variational quantum circuits (VQCs) into the Transformer framework at both 8M and 150M parameter scales. Experimental results show that a minimal number of qubits (10 qubits with 80 quantum gates) can replace about 10% of the classical parameters in the 150M-parameter model, while achieving comparable convergence stability and generation quality. This study provides an early demonstration of the feasibility of integrating quantum computing to large-scale generative language models.

[7] Empirical Characterization of Temporal Constraint Processing in LLMs

Javier Marín

Main category: cs.CL

TL;DR: LLMs struggle with temporal constraint processing in real-time decision making, showing systematic risks including bimodal performance, extreme prompt brittleness, and action bias, with no correlation to parameter count.

Details

Motivation: To test the assumption that LLMs can reliably determine whether action windows remain open or have closed in time-critical applications, as current deployments rely on this untested capability.

Method: Characterized temporal constraint processing across eight production-scale models (2.8-8B parameters) using deadline detection tasks, and conducted fine-tuning experiments with 200 synthetic examples.

Result: Revealed systematic deployment risks: bimodal performance distribution (95% or 50% accuracy), extreme prompt brittleness (30-60 percentage point swings), systematic action bias (100% false positive rates in failing models), and no parameter count correlation.

Conclusion: Temporal constraint satisfaction cannot be reliably learned through next-token prediction on natural language alone; requires architectural mechanisms for continuous temporal state representation, explicit constraint checking, and systematic compositional reasoning over temporal relations.

Abstract: When deploying LLMs in agentic architectures requiring real-time decisions under temporal constraints, we assume they reliably determine whether action windows remain open or have closed. This assumption is untested. We characterize temporal constraint processing across eight production-scale models (2.8-8B parameters) using deadline detection tasks, revealing systematic deployment risks: bimodal performance distribution (models achieve either 95% or 50% accuracy), extreme prompt brittleness (30-60 percentage point swings from formatting changes alone), and systematic action bias (100% false positive rates in failing models). Parameter count shows no correlation with capability in this range-a 3.8B model matches 7B models while other 7B models fail completely. Fine-tuning on 200 synthetic examples improves models with partial capability by 12-37 percentage points. We demonstrate that temporal constraint satisfaction cannot be reliably learned through next-token prediction on natural language, even with targeted fine-tuning. This capability requires architectural mechanisms for: (1) continuous temporal state representation, (2) explicit constraint checking separate from linguistic pattern matching, (3) systematic compositional reasoning over temporal relations. Current autoregressive architectures lack these mechanisms. Deploying such systems in time-critical applications without hybrid architectures incorporating symbolic reasoning modules represents unacceptable risk.

[8] HI-TransPA: Hearing Impairments Translation Personal Assistant

Zhiming Ma, Shiyu Gan, Junhao Zhao, Xianming Li, Qingyun Pan, Peidong Wang, Mingjun Pan, Yuhao Mo, Jiajie Cheng, Chengxin Chen, Zhonglun Cao, Chonghan Liu, Shi Cheng

Main category: cs.CL

TL;DR: HI-TransPA is an audio-visual personal assistant that fuses indistinct speech with lip dynamics to help hearing-impaired individuals communicate, using multimodal preprocessing and curriculum learning to achieve state-of-the-art performance.

Details

Motivation: To address communication barriers faced by hearing-impaired individuals due to unclear speech production, by leveraging the Omni-Model paradigm in assistive technology.

Method: Uses multimodal preprocessing pipeline with facial landmark detection and lip stabilization, curriculum learning with quality scores, and a unified 3D-Resampler for efficient lip dynamics encoding within a single multimodal framework.

Result: Achieves state-of-the-art performance in both literal accuracy and semantic fidelity on the HI-Dialogue dataset.

Conclusion: Establishes foundation for applying Omni-Models to assistive communication technology with an end-to-end framework and essential processing tools for future research.

Abstract: Hearing-impaired individuals often face significant barriers in daily communication due to the inherent challenges of producing clear speech. To address this, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with lip dynamics, enabling both translation and dialogue within a single multimodal framework. To address the distinctive pronunciation patterns of hearing-impaired speech and the limited adaptability of existing models, we develop a multimodal preprocessing and curation pipeline that detects facial landmarks, stabilizes the lip region, and quantitatively evaluates sample quality. These quality scores guide a curriculum learning strategy that first trains on clean, high-confidence samples and progressively incorporates harder cases to strengthen model robustness. Architecturally, we employs a novel unified 3D-Resampler to efficiently encode the lip dynamics, which is critical for accurate interpretation. Experiments on purpose-built HI-Dialogue dataset show that HI-TransPA achieves state-of-the-art performance in both literal accuracy and semantic fidelity. Our work establishes a foundation for applying Omni-Models to assistive communication technology, providing an end-to-end modeling framework and essential processing tools for future research.

[9] Spectral Neuro-Symbolic Reasoning II: Semantic Node Merging, Entailment Filtering, and Knowledge Graph Alignment

Andrew Kiruluta, Priscilla Burity

Main category: cs.CL

TL;DR: Extends Spectral NSR with three semantic enhancements: transformer-based node merging, sentence-level entailment validation, and knowledge graph alignment to improve graph quality before spectral reasoning.

Details

Motivation: To enhance graph fidelity and reduce redundancy in neuro-symbolic reasoning while preserving the core spectral reasoning pipeline, enabling more robust and interpretable reasoning.

Method: Three semantic preprocessing steps: (1) transformer-based node merging using contextual embeddings, (2) sentence-level entailment validation with NLI classifiers, (3) alignment with external knowledge graphs like ConceptNet and Wikidata.

Result: Consistent accuracy gains up to +3.8% on ProofWriter, EntailmentBank, and CLUTRR benchmarks, improved generalization to adversarial cases, and reduced inference noise.

Conclusion: The framework enables efficient, interpretable, and scalable reasoning without quadratic attention mechanisms by performing semantic refinement upstream of spectral inference, making it suitable for open-domain and real-world deployment.

Abstract: This report extends the Spectral Neuro-Symbolic Reasoning (Spectral NSR) framework by introducing three semantically grounded enhancements: (1) transformer-based node merging using contextual embeddings (e.g., Sentence-BERT, SimCSE) to reduce redundancy, (2) sentence-level entailment validation with pretrained NLI classifiers (e.g., RoBERTa, DeBERTa) to improve edge quality, and (3) alignment with external knowledge graphs (e.g., ConceptNet, Wikidata) to augment missing context. These modifications enhance graph fidelity while preserving the core spectral reasoning pipeline. Experimental results on ProofWriter, EntailmentBank, and CLUTRR benchmarks show consistent accuracy gains (up to +3.8%), improved generalization to adversarial cases, and reduced inference noise. The novelty lies in performing semantic and symbolic refinement entirely upstream of the spectral inference stage, enabling efficient, interpretable, and scalable reasoning without relying on quadratic attention mechanisms. In summary, this work extends the Spectral NSR framework with modular, semantically grounded preprocessing steps that improve graph quality without altering the core spectral reasoning engine. The result is a more robust, interpretable, and scalable reasoning system suitable for deployment in open-domain and real-world settings.

[10] Preference Orchestrator: Prompt-Aware Multi-Objective Alignment for Large Language Models

Biao Liu, Ning Xu, Junming Yang, Xin Geng

Main category: cs.CL

TL;DR: PRO framework uses lightweight preference adapter to automatically infer prompt-specific preference weights for multi-objective LLM alignment, eliminating manual preference specification and improving training efficiency.

Details

Motivation: Existing multi-objective alignment methods require manual preference weight specification, which burdens users and leads to suboptimal training efficiency due to exploration of irrelevant preference combinations.

Method: PRO framework features a lightweight preference adapter that automatically learns appropriate preference weights for each prompt by training on normalized reward scores from multiple reward models for preferred responses.

Result: Extensive experiments across multiple tasks demonstrate the effectiveness of PRO over existing multi-objective alignment approaches, with theoretical analysis proving superior performance compared to fixed preference weights.

Conclusion: PRO provides an effective solution for multi-objective LLM alignment by automatically inferring prompt-specific preference weights, reducing user burden and improving training efficiency.

Abstract: While Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, aligning these models with varying human preferences across multiple objectives remains a significant challenge in practical deployments. Existing multi-objective alignment methods rely on manually specified preference weights, which not only burden users with difficult preference specification tasks but also lead to suboptimal training efficiency due to exploration of irrelevant preference combinations. To alleviate these issues, we propose a novel framework named PRO, i.e., PReference Orchestrator, which features a lightweight preference adapter that automatically infers prompt-specific preference weights during both training and deployment phases. Specifically, the adapter automatically learns appropriate preference weights for each prompt by training on normalized reward scores from multiple reward models for preferred responses, which inherently reflect effective preference balances across objectives. Additionally, We provide theoretical analysis proving that our prompt-aware preference mechanism achieves superior performance compared to fixed preference weights in multi-objective alignment scenarios. Extensive experiments across multiple tasks demonstrate the effectiveness of our method over existing multi-objective alignment approaches.

[11] Patent Representation Learning via Self-supervision

You Zuo, Kim Gerdes, Eric Villemonte de La Clergerie, Benoît Sagot

Main category: cs.CL

TL;DR: A contrastive learning framework for patent embeddings that uses different patent sections as multiple views, overcoming limitations of dropout augmentation and achieving state-of-the-art performance without external annotations.

Details

Motivation: To address the failure mode of SimCSE-style dropout augmentation in patents, which produces overly uniform embeddings that lose semantic cohesion, by leveraging the inherent structural diversity within patent documents.

Method: Proposes section-based augmentation where different patent sections (abstract, claims, background) serve as complementary views for contrastive learning, introducing natural semantic and structural diversity to mitigate over-dispersion.

Result: The method matches or surpasses citation- and IPC-supervised baselines in prior-art retrieval and classification on large-scale benchmarks, while being fully self-supervised and avoiding reliance on external annotations.

Conclusion: Different patent sections specialize for different tasks (claims/summaries for retrieval, background for classification), highlighting the value of exploiting intra-document views for scalable and generalizable patent understanding.

Abstract: This paper presents a simple yet effective contrastive learning framework for learning patent embeddings by leveraging multiple views from within the same document. We first identify a patent-specific failure mode of SimCSE style dropout augmentation: it produces overly uniform embeddings that lose semantic cohesion. To remedy this, we propose section-based augmentation, where different sections of a patent (e.g., abstract, claims, background) serve as complementary views. This design introduces natural semantic and structural diversity, mitigating over-dispersion and yielding embeddings that better preserve both global structure and local continuity. On large-scale benchmarks, our fully self-supervised method matches or surpasses citation-and IPC-supervised baselines in prior-art retrieval and classification, while avoiding reliance on brittle or incomplete annotations. Our analysis further shows that different sections specialize for different tasks-claims and summaries benefit retrieval, while background sections aid classification-highlighting the value of patents’ inherent discourse structure for representation learning. These results highlight the value of exploiting intra-document views for scalable and generalizable patent understanding.

[12] Evaluating Open-Weight Large Language Models for Structured Data Extraction from Narrative Medical Reports Across Multiple Use Cases and Languages

Douwe J. Spaanderman, Karthik Prathaban, Petr Zelina, Kaouther Mouheb, Lukáš Hejtmánek, Matthew Marzetti, Antonius W. Schurink, Damian Chan, Ruben Niemantsverdriet, Frederik Hartmann, Zhen Qian, Maarten G. J. Thomeer, Petr Holub, Farhan Akram, Frank J. Wolters, Meike W. Vernooij, Cornelis Verhoef, Esther E. Bron, Vít Nováček, Dirk J. Grünhagen, Wiro J. Niessen, Martijn P. A. Starmans, Stefan Klein

Main category: cs.CL

TL;DR: Open-weight LLMs can effectively extract structured data from clinical reports across multiple diseases, languages, and institutions, with small-to-medium general-purpose models performing comparably to large models.

Details

Motivation: To evaluate the performance of open-weight LLMs in extracting structured information from free-text clinical records across multiple tasks, models, and languages, addressing limitations of prior work that focused on single tasks and English-only reports.

Method: Evaluated 15 open-weight LLMs on pathology and radiology reports across six clinical use cases at three institutes in the Netherlands, UK, and Czech Republic, comparing six prompting strategies and assessing performance using task-appropriate metrics with consensus rank aggregation and linear mixed-effects models.

Result: Top-ranked models achieved macro-average scores close to inter-rater agreement across tasks. Small-to-medium general-purpose models performed comparably to large models, while prompt graph and few-shot prompting improved performance by ~13%. Task-specific factors influenced results more than model size or prompting strategy.

Conclusion: Open-weight LLMs offer a scalable approach for clinical data curation, capable of extracting structured data from clinical reports across diverse diseases, languages, and institutions.

Abstract: Large language models (LLMs) are increasingly used to extract structured information from free-text clinical records, but prior work often focuses on single tasks, limited models, and English-language reports. We evaluated 15 open-weight LLMs on pathology and radiology reports across six use cases, colorectal liver metastases, liver tumours, neurodegenerative diseases, soft-tissue tumours, melanomas, and sarcomas, at three institutes in the Netherlands, UK, and Czech Republic. Models included general-purpose and medical-specialised LLMs of various sizes, and six prompting strategies were compared: zero-shot, one-shot, few-shot, chain-of-thought, self-consistency, and prompt graph. Performance was assessed using task-appropriate metrics, with consensus rank aggregation and linear mixed-effects models quantifying variance. Top-ranked models achieved macro-average scores close to inter-rater agreement across tasks. Small-to-medium general-purpose models performed comparably to large models, while tiny and specialised models performed worse. Prompt graph and few-shot prompting improved performance by ~13%. Task-specific factors, including variable complexity and annotation variability, influenced results more than model size or prompting strategy. These findings show that open-weight LLMs can extract structured data from clinical reports across diseases, languages, and institutions, offering a scalable approach for clinical data curation.

[13] Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment

Yan Gao, Yazheng Yang, Zhibin Lan, Yidong Chen, Min Zhang, Daimeng Wei, Hui Huang, Jinsong Su

Main category: cs.CL

TL;DR: Proposes enhancing LLMs with Mixture of Experts speech projector for code-switching speech translation, using multi-stage training with monolingual data and specialized loss functions to address data scarcity.

Details

Motivation: Code-switching speech translation faces challenges in semantic modeling complexity and data scarcity, with previous methods relying on implicit learning and costly manual annotations.

Method: Uses Mixture of Experts speech projector where each expert specializes in a language’s semantic subspace, with multi-stage training using monolingual ASR/ST data, language-specific loss, load balancing loss, and transition loss.

Result: Extensive experiments on widely used datasets demonstrate the effectiveness and generality of the approach.

Conclusion: The proposed method effectively addresses code-switching speech translation challenges through specialized experts and multi-stage training with available monolingual data.

Abstract: Code-switching (CS) speech translation (ST) refers to translating speech that alternates between two or more languages into a target language text, which poses significant challenges due to the complexity of semantic modeling and the scarcity of CS data. Previous studies tend to rely on the model itself to implicitly learn semantic modeling during training, and resort to inefficient and costly manual annotations for these two challenges. To mitigate these limitations, we propose enhancing Large Language Models (LLMs) with a Mixture of Experts (MoE) speech projector, where each expert specializes in the semantic subspace of a specific language, enabling fine-grained modeling of speech features. Additionally, we introduce a multi-stage training paradigm that utilizes readily available monolingual automatic speech recognition (ASR) and monolingual ST data, facilitating speech-text alignment and improving translation capabilities. During training, we leverage a combination of language-specific loss and intra-group load balancing loss to guide the MoE speech projector in efficiently allocating tokens to the appropriate experts, across expert groups and within each group, respectively. To bridge the data gap across different training stages and improve adaptation to the CS scenario, we further employ a transition loss, enabling smooth transitions of data between stages, to effectively address the scarcity of high-quality CS speech translation data. Extensive experiments on widely used datasets demonstrate the effectiveness and generality of our approach.

[14] iMAD: Intelligent Multi-Agent Debate for Efficient and Accurate LLM Inference

Wei Fan, JinYi Yoon, Bo Ji

Main category: cs.CL

TL;DR: iMAD is a token-efficient multi-agent debate framework that selectively triggers debates only when likely to correct wrong answers, reducing token usage by up to 92% while improving accuracy by up to 13.5%.

Details

Motivation: Traditional Multi-Agent Debate (MAD) frameworks are inefficient as they trigger debates for every query, incurring high computational costs and potentially degrading accuracy by overturning correct single-agent answers.

Method: iMAD first prompts a single agent to produce structured self-critique responses, extracts 41 interpretable linguistic and semantic features capturing hesitation cues, then uses a lightweight debate-decision classifier trained with FocusCal loss to determine when to trigger MAD.

Result: Extensive experiments on six visual question answering datasets show iMAD reduces token usage by up to 92% while improving final answer accuracy by up to 13.5% compared to five competitive baselines.

Conclusion: iMAD provides an efficient and effective framework for multi-agent debate that intelligently triggers debates only when beneficial, achieving significant computational savings while maintaining or improving accuracy.

Abstract: Large Language Model (LLM) agent systems have advanced rapidly, driven by their strong generalization in zero-shot settings. To further enhance reasoning and accuracy on complex tasks, Multi-Agent Debate (MAD) has emerged as a promising framework that engages multiple LLM agents in structured debates to encourage diverse reasoning. However, triggering MAD for every query is inefficient, as it incurs substantial computational (token) cost and may even degrade accuracy by overturning correct single-agent answers. To address these limitations, we propose intelligent Multi-Agent Debate (iMAD), a token-efficient framework that selectively triggers MAD only when it is likely to be beneficial (i.e., correcting an initially wrong answer). To achieve this goal, iMAD learns generalizable model behaviors to make accurate debate decisions. Specifically, iMAD first prompts a single agent to produce a structured self-critique response, from which we extract 41 interpretable linguistic and semantic features capturing hesitation cues. Then, iMAD uses a lightweight debate-decision classifier, trained using our proposed FocusCal loss, to determine whether to trigger MAD, enabling robust debate decisions without test dataset-specific tuning. Through extensive experiments using six (visual) question answering datasets against five competitive baselines, we have shown that iMAD significantly reduces token usage (by up to 92%) while also improving final answer accuracy (by up to 13.5%).

[15] Information Extraction From Fiscal Documents Using LLMs

Vikram Aggarwal, Jay Kulkarni, Aditi Mascarenhas, Aakriti Narang, Siddarth Raman, Ajay Shah, Susan Thomas

Main category: cs.CL

TL;DR: LLMs can effectively extract structured data from complex multi-page government fiscal documents using hierarchical validation and domain-specific processing.

Details

Motivation: Large Language Models have strong text comprehension but their ability to process complex hierarchical tabular data from government fiscal documents remains underexplored, especially for developing country contexts.

Method: Multi-stage pipeline using LLM-based techniques with domain knowledge, sequential context, and algorithmic validation that leverages hierarchical relationships in fiscal tables for multi-level validation checks.

Result: Applied to 200+ page Karnataka fiscal documents, achieved high accuracy in extracting structured data and demonstrated LLMs can process document-specific structural hierarchies.

Conclusion: LLMs offer a scalable process for converting PDF-based fiscal disclosures into research-ready databases, showing promise for broader applications across developing countries.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in text comprehension, but their ability to process complex, hierarchical tabular data remains underexplored. We present a novel approach to extracting structured data from multi-page government fiscal documents using LLM-based techniques. Applied to annual fiscal documents from the State of Karnataka in India (200+ pages), our method achieves high accuracy through a multi-stage pipeline that leverages domain knowledge, sequential context, and algorithmic validation. A large challenge with traditional OCR methods is the inability to verify the accurate extraction of numbers. When applied to fiscal data, the inherent structure of fiscal tables, with totals at each level of the hierarchy, allows for robust internal validation of the extracted data. We use these hierarchical relationships to create multi-level validation checks. We demonstrate that LLMs can read tables and also process document-specific structural hierarchies, offering a scalable process for converting PDF-based fiscal disclosures into research-ready databases. Our implementation shows promise for broader applications across developing country contexts.

Eyal Rabin, Zohar Elyoseph, Rotem Israel-Fishelson, Adi Dali, Ravit Nussinson

Main category: cs.CL

TL;DR: AI voice systems implicitly learn to speak slower when conveying politeness, replicating human social communication patterns without explicit programming.

Details

Motivation: To investigate whether AI text-to-speech systems have internalized the human tendency to reduce speech rate as a non-obvious prosodic marker of politeness.

Method: Prompted 22 synthetic voices from AI Studio and OpenAI to read a fixed script under “polite and formal” vs “casual and informal” conditions, then measured speech duration.

Result: Polite prompts consistently produced slower speech than casual prompts with very large effect sizes, statistically significant for all AI Studio voices and most OpenAI voices.

Conclusion: AI can implicitly learn and replicate psychological nuances of human communication, demonstrating its emerging role as a social actor capable of reinforcing human social norms.

Abstract: Voice-based artificial intelligence is increasingly expected to adhere to human social conventions, but can it learn implicit cues that are not explicitly programmed? This study investigates whether state-of-the-art text-to-speech systems have internalized the human tendency to reduce speech rate to convey politeness - a non-obvious prosodic marker. We prompted 22 synthetic voices from two leading AI platforms (AI Studio and OpenAI) to read a fixed script under both “polite and formal” and “casual and informal” conditions and measured the resulting speech duration. Across both AI platforms, the polite prompt produced slower speech than the casual prompt with very large effect sizes, an effect that was statistically significant for all of AI Studio’s voices and for a large majority of OpenAI’s voices. These results demonstrate that AI can implicitly learn and replicate psychological nuances of human communication, highlighting its emerging role as a social actor capable of reinforcing human social norms.

[17] Test-Time Steering for Lossless Text Compression via Weighted Product of Experts

Qihang Zhang, Muchen Li, Ziao Wang, Renjie Liao, Lele Wang

Main category: cs.CL

TL;DR: A novel framework using Weighted Product of Experts (wPoE) to combine universal compressors with neural language models for improved text compression without fine-tuning.

Details

Motivation: To address the limitation of neural compressors struggling with unseen data while maintaining the advantages of universal compressors (low overhead, speed, broad applicability) and neural compressors (better compression rates).

Method: Test-Time Steering via Weighted Product of Experts (wPoE) that adaptively combines a universal compression model with a pretrained neural language model at inference time.

Result: The approach improves text compression performance without requiring fine-tuning and ensures compression rate is at least as good as the best individual model.

Conclusion: The framework provides a practical solution for enhancing text compression across diverse data distributions and seamlessly integrates with any autoregressive language model.

Abstract: Lossless compression techniques are crucial in an era of rapidly growing data. Traditional universal compressors like gzip offer low computational overhead, high speed, and broad applicability across data distributions. However, they often lead to worse compression rates than modern neural compressors, which leverage large-scale training data to model data distributions more effectively. Despite their advantages, neural compressors struggle to generalize to unseen data. To address this limitation, we propose a novel framework that performs Test-Time Steering via a Weighted Product of Experts (wPoE). At inference, our method adaptively combines a universal compression model with a pretrained neural language model, ensuring the compression rate is at least as good as that of the best individual model. Extensive experiments demonstrate that our approach improves the performance of text compression without requiring fine-tuning. Furthermore, it seamlessly integrates with any autoregressive language model, providing a practical solution for enhancing text compression across diverse data distributions.

[18] Bayesian Evaluation of Large Language Model Behavior

Rachel Longjohn, Shang Wu, Saatvik Kher, Catarina Belém, Padhraic Smyth

Main category: cs.CL

TL;DR: A Bayesian approach for quantifying statistical uncertainty in binary evaluation metrics for LLM text generation systems, addressing uncertainty from probabilistic generation strategies.

Details

Motivation: Existing LLM evaluation approaches neglect statistical uncertainty quantification, which is crucial for reliable assessment of system behavior like harmful output tendencies or sensitivity to adversarial inputs.

Method: Developed a Bayesian approach to quantify uncertainty in binary evaluation metrics, focusing on uncertainty induced by probabilistic text generation strategies in LLM systems.

Result: Applied the approach in two case studies: evaluating refusal rates on adversarial inputs and pairwise preferences between LLMs on dialogue benchmarks, demonstrating useful uncertainty quantification.

Conclusion: The Bayesian approach provides valuable uncertainty quantification for LLM behavior evaluation, enhancing reliability of binary metric assessments in text generation system evaluation.

Abstract: It is increasingly important to evaluate how text generation systems based on large language models (LLMs) behave, such as their tendency to produce harmful output or their sensitivity to adversarial inputs. Such evaluations often rely on a curated benchmark set of input prompts provided to the LLM, where the output for each prompt may be assessed in a binary fashion (e.g., harmful/non-harmful or does not leak/leaks sensitive information), and the aggregation of binary scores is used to evaluate the LLM. However, existing approaches to evaluation often neglect statistical uncertainty quantification. With an applied statistics audience in mind, we provide background on LLM text generation and evaluation, and then describe a Bayesian approach for quantifying uncertainty in binary evaluation metrics. We focus in particular on uncertainty that is induced by the probabilistic text generation strategies typically deployed in LLM-based systems. We present two case studies applying this approach: 1) evaluating refusal rates on a benchmark of adversarial inputs designed to elicit harmful responses, and 2) evaluating pairwise preferences of one LLM over another on a benchmark of open-ended interactive dialogue examples. We demonstrate how the Bayesian approach can provide useful uncertainty quantification about the behavior of LLM-based systems.

[19] Evaluating Modern Large Language Models on Low-Resource and Morphologically Rich Languages:A Cross-Lingual Benchmark Across Cantonese, Japanese, and Turkish

Chengxuan Xia, Qianye Wu, Hongbin Guan, Sixuan Tian, Yilun Hao, Xiaoyu Wu

Main category: cs.CL

TL;DR: Comprehensive evaluation of 7 LLMs on Cantonese, Japanese, and Turkish across 4 tasks reveals proprietary models lead but struggle with cultural nuances and morphological complexity, while smaller open-source models lag significantly.

Details

Motivation: LLMs perform well in high-resource languages like English but their effectiveness in low-resource and morphologically rich languages remains underexplored, requiring systematic evaluation.

Method: Created cross-lingual benchmark covering Cantonese, Japanese, and Turkish with 4 tasks (QA, summarization, translation, dialogue), combining human evaluations (fluency, accuracy, cultural appropriateness) with automated metrics (BLEU, ROUGE).

Result: Largest proprietary models (GPT-4o, GPT-4, Claude 3.5) lead across languages/tasks but show gaps in cultural understanding and morphological generalization. GPT-4o excels in multilingual performance, Claude 3.5 in knowledge/reasoning. All struggle with language-specific challenges. Smaller open-source models lag substantially.

Conclusion: Significant disparities exist in LLM performance across languages, highlighting need for more culturally aware and linguistically generalizable models. Benchmark released to foster reproducibility and further research.

Abstract: Large language models (LLMs) have achieved impressive results in high-resource languages like English, yet their effectiveness in low-resource and morphologically rich languages remains underexplored. In this paper, we present a comprehensive evaluation of seven cutting-edge LLMs – including GPT-4o, GPT-4, Claude~~3.5~~Sonnet, LLaMA~~3.1, Mistral~~Large~~2, LLaMA-2~~Chat~~13B, and Mistral~~7BInstruct – on a new cross-lingual benchmark covering \textbf{Cantonese, Japanese, and Turkish}. Our benchmark spans four diverse tasks: open-domain question answering, document summarization, English-to-X translation, and culturally grounded dialogue. We combine \textbf{human evaluations} (rating fluency, factual accuracy, and cultural appropriateness) with automated metrics (e.g., BLEU, ROUGE) to assess model performance. Our results reveal that while the largest proprietary models (GPT-4o, GPT-4, Claude3.5) generally lead across languages and tasks, significant gaps persist in culturally nuanced understanding and morphological generalization. Notably, GPT-4o demonstrates robust multilingual performance even on cross-lingual tasks, and Claude~~3.5~~Sonnet achieves competitive accuracy on knowledge and reasoning benchmarks. However, all models struggle to some extent with the unique linguistic challenges of each language, such as Turkish agglutinative morphology and Cantonese colloquialisms. Smaller open-source models (LLaMA-2~~13B, Mistral~~7B) lag substantially in fluency and accuracy, highlighting the resource disparity. We provide detailed quantitative results, qualitative error analysis, and discuss implications for developing more culturally aware and linguistically generalizable LLMs. Our benchmark and evaluation data are released to foster reproducibility and further research.

[20] Guarding the Meaning: Self-Supervised Training for Semantic Robustness in Guard Models

Cristina Pinneri, Christos Louizos

Main category: cs.CL

TL;DR: A self-supervised framework improves semantic robustness of LLM guard models by enforcing prediction consistency across paraphrases using skew-aware aggregation, reducing semantic variability by ~58% and improving calibration.

Details

Motivation: Guard models for LLM safety are vulnerable to meaning-preserving paraphrases causing large safety score fluctuations, revealing lack of semantic grounding.

Method: Self-supervised framework using paraphrase sets with novel skew-aware aggregation strategy for robust target computation, enforcing prediction consistency across linguistic variations.

Result: Reduces semantic variability by ~58%, improves benchmark accuracy by ~2.5% on average, generalizes to unseen stylistic variations, and improves model calibration by up to 40%.

Conclusion: Semantic consistency should be treated as a first-class training objective, providing a scalable recipe for building more reliable guard models with fundamental connection between consistency and calibration.

Abstract: Guard models are a critical component of LLM safety, but their sensitivity to superficial linguistic variations remains a key vulnerability. We show that even meaning-preserving paraphrases can cause large fluctuations in safety scores, revealing a lack of semantic grounding. To address this, we introduce a practical, self-supervised framework for improving the semantic robustness of guard models. Our method leverages paraphrase sets to enforce prediction consistency using a novel, skew-aware aggregation strategy for robust target computation. Notably, we find that standard aggregation methods like mean and median can degrade safety, underscoring the need for skew-aware alternatives. We analyze six open-source guard models and show that our approach reduces semantic variability across paraphrases by ~58%, improves benchmark accuracy by ~2.5% on average, and generalizes to unseen stylistic variations. Intriguingly, we discover a bidirectional relationship between model calibration and consistency: our robustness training improves calibration by up to 40%, revealing a fundamental connection between these properties. These results highlight the value of treating semantic consistency as a first-class training objective and provide a scalable recipe for building more reliable guard models.

[21] Evaluating LLM Understanding via Structured Tabular Decision Simulations

Sichao Li, Xinyue Xu, Xiaomeng Li

Main category: cs.CL

TL;DR: STaDS framework evaluates LLM understanding through structured decision simulations, revealing models struggle with consistent accuracy and often show mismatches between rationales and actual decision factors.

Details

Motivation: To assess whether LLMs achieve genuine understanding beyond just predictive accuracy, by evaluating their ability to make consistent, well-founded decisions across multiple domains using relevant decision factors.

Method: Introduced Structured Tabular Decision Simulations (STaDS) - a suite of expert-like decision settings that evaluate LLMs through question comprehension, knowledge-based prediction, and reliance on relevant decision factors across 15 diverse domains.

Result: Analysis of 9 frontier LLMs showed: (a) most models struggle with consistent accuracy across domains; (b) models can be accurate yet globally unfaithful with frequent mismatches between stated rationales and actual decision factors driving predictions.

Conclusion: Highlights the need for global-level understanding evaluation protocols and novel frameworks that go beyond accuracy to enhance LLMs’ genuine understanding ability.

Abstract: Large language models (LLMs) often achieve impressive predictive accuracy, yet correctness alone does not imply genuine understanding. True LLM understanding, analogous to human expertise, requires making consistent, well-founded decisions across multiple instances and diverse domains, relying on relevant and domain-grounded decision factors. We introduce Structured Tabular Decision Simulations (STaDS), a suite of expert-like decision settings that evaluate LLMs as if they were professionals undertaking structured decision ``exams’’. In this context, understanding is defined as the ability to identify and rely on the correct decision factors, features that determine outcomes within a domain. STaDS jointly assesses understanding through: (i) question and instruction comprehension, (ii) knowledge-based prediction, and (iii) reliance on relevant decision factors. By analyzing 9 frontier LLMs across 15 diverse decision settings, we find that (a) most models struggle to achieve consistently strong accuracy across diverse domains; (b) models can be accurate yet globally unfaithful, and there are frequent mismatches between stated rationales and factors driving predictions. Our findings highlight the need for global-level understanding evaluation protocols and advocate for novel frameworks that go beyond accuracy to enhance LLMs’ understanding ability.

[22] Forecasting Spoken Language Development in Children with Cochlear Implants Using Preimplantation MRI

Yanlin Wang, Di Yuan, Shani Dettman, Dawn Choo, Emily Shimeng Xu, Denise Thomas, Maura E Ryan, Patrick C M Wong, Nancy M Young

Main category: cs.CL

TL;DR: Deep transfer learning (DTL) with bilinear attention-based fusion outperforms traditional machine learning in predicting cochlear implant language outcomes using brain neuroanatomic features, achieving 92.39% accuracy.

Details

Motivation: Cochlear implant outcomes are highly variable in children with hearing loss, and current predictors like age at implantation or residual hearing are unreliable for individual prediction.

Method: Compared traditional ML vs DTL algorithms using brain neuroanatomic features from 278 implanted children across three centers, with bilinear attention-based fusion strategy for DTL.

Result: DTL achieved 92.39% accuracy, 91.22% sensitivity, 93.56% specificity, and AUC of 0.977, significantly outperforming traditional ML models in all metrics.

Conclusion: DTL enables feasible single prediction model for worldwide CI programs by capturing discriminative task-specific information through representation learning.

Abstract: Cochlear implants (CI) significantly improve spoken language in children with severe-to-profound sensorineural hearing loss (SNHL), yet outcomes remain more variable than in children with normal hearing. This variability cannot be reliably predicted for individual children using age at implantation or residual hearing. This study aims to compare the accuracy of traditional machine learning (ML) to deep transfer learning (DTL) algorithms to predict post-CI spoken language development of children with bilateral SNHL using a binary classification model of high versus low language improvers. A total of 278 implanted children enrolled from three centers. The accuracy, sensitivity and specificity of prediction models based upon brain neuroanatomic features using traditional ML and DTL learning. DTL prediction models using bilinear attention-based fusion strategy achieved: accuracy of 92.39% (95% CI, 90.70%-94.07%), sensitivity of 91.22% (95% CI, 89.98%-92.47%), specificity of 93.56% (95% CI, 90.91%-96.21%), and area under the curve (AUC) of 0.977 (95% CI, 0.969-0.986). DTL outperformed traditional ML models in all outcome measures. DTL was significantly improved by direct capture of discriminative and task-specific information that are advantages of representation learning enabled by this approach over ML. The results support the feasibility of a single DTL prediction model for language prediction of children served by CI programs worldwide.

[23] Grounded Visual Factualization: Factual Anchor-Based Finetuning for Enhancing MLLM Factual Consistency

Filippo Morbiato, Luca Romano, Alessandro Persona

Main category: cs.CL

TL;DR: GVF Finetuning systematically enhances MLLM visual factual consistency through three mechanisms: data augmentation with factual anchors, fact-aware instruction tuning, and factual consistency loss, significantly reducing hallucinations while maintaining general performance.

Details

Motivation: Visual hallucination in Multimodal Large Language Models critically undermines their reliability, and existing fine-tuning methods offer limited improvement in factual reasoning.

Method: Grounded Visual Factualization (GVF) Finetuning integrates explicit factual signals via Factual Anchor Data Augmentation, Fact-Aware Instruction Tuning, and a Factual Consistency Loss function.

Result: GVF Finetuning significantly outperforms standard fine-tuning on VHTest benchmark for both Open-Ended and Yes/No questions, while maintaining or slightly improving performance on general multimodal benchmarks like MME and POPE.

Conclusion: GVF effectively mitigates visual hallucinations without compromising general understanding and reasoning abilities in MLLMs.

Abstract: Visual hallucination, where Multimodal Large Language Models fabricate details inconsistent with image content, critically undermines their reliability. Existing fine-tuning methods offer limited improvement, failing to deeply intervene in factual reasoning. This paper introduces Grounded Visual Factualization (GVF) Finetuning, a novel approach to systematically enhance MLLM visual factual consistency. GVF integrates explicit factual signals via three core mechanisms: Factual Anchor Data Augmentation, enriching training data with structured factual anchors and counter-factual prompts; Fact-Aware Instruction Tuning, embedding these cues into explicit instructions; and a Factual Consistency Loss function, specifically penalizing factual inaccuracies. Evaluated on LLaVA-1.5-13B, GVF Finetuning significantly outperforms standard fine-tuning on the VHTest benchmark for both Open-Ended Question (OEQ) and Yes/No Question (YNQ) formats. Crucially, GVF maintains or even slightly improves performance on general multimodal benchmarks like MME and POPE, demonstrating effective mitigation of visual hallucinations without compromising general understanding and reasoning abilities.

[24] Large language models in materials science and the need for open-source approaches

Fengxu Yang, Weitong Chen, Jack D. Evans

Main category: cs.CL

TL;DR: LLMs are transforming materials science by extracting information from literature, modeling structure-property relationships, and coordinating experimental systems, with open-source models matching commercial performance while offering better transparency and accessibility.

Details

Motivation: To examine how large language models (LLMs) are being applied across the materials discovery pipeline and advocate for broader adoption of open-source alternatives.

Method: Review of recent LLM applications focusing on three key areas: mining scientific literature, predictive modeling, and multi-agent experimental systems, with benchmark comparisons between commercial and open-source models.

Result: Open-source LLMs can match the performance of closed-source commercial models while offering greater transparency, reproducibility, cost-effectiveness, and data privacy.

Conclusion: Advocates for broader adoption of open-source LLMs to build accessible, flexible, and community-driven AI platforms for scientific discovery as these models continue to improve.

Abstract: Large language models (LLMs) are rapidly transforming materials science. This review examines recent LLM applications across the materials discovery pipeline, focusing on three key areas: mining scientific literature , predictive modelling, and multi-agent experimental systems. We highlight how LLMs extract valuable information such as synthesis conditions from text, learn structure-property relationships, and can coordinate agentic systems integrating computational tools and laboratory automation. While progress has been largely dependent on closed-source commercial models, our benchmark results demonstrate that open-source alternatives can match performance while offering greater transparency, reproducibility, cost-effectiveness, and data privacy. As open-source models continue to improve, we advocate their broader adoption to build accessible, flexible, and community-driven AI platforms for scientific discovery.

[25] Continual Learning of Domain Knowledge from Human Feedback in Text-to-SQL

Thomas Cook, Kelly Patel, Sivapriya Vellaichamy, Saba Rahimi, Zhen Zeng, Sumitra Ganesh

Main category: cs.CL

TL;DR: A framework for continual learning in text-to-SQL that uses human feedback to refine queries and stores distilled knowledge in structured memory, improving execution accuracy over time.

Details

Motivation: LLMs struggle with database-specific schemas and tacit domain knowledge when generating SQL queries from natural language, requiring a way to capture and reuse human expertise.

Method: Developed learning agents with structured memory that receive natural language feedback to refine queries and distill knowledge for future reuse. Evaluated multiple agent architectures varying in how they capture and retrieve past experiences.

Result: Memory-augmented agents, especially the Procedural Agent, achieved significant accuracy gains and error reduction on the BIRD benchmark Dev set by leveraging human-in-the-loop feedback.

Conclusion: Transforming tacit human expertise into reusable knowledge enables more adaptive, domain-aware text-to-SQL systems that continually learn from human feedback.

Abstract: Large Language Models (LLMs) can generate SQL queries from natural language questions but struggle with database-specific schemas and tacit domain knowledge. We introduce a framework for continual learning from human feedback in text-to-SQL, where a learning agent receives natural language feedback to refine queries and distills the revealed knowledge for reuse on future tasks. This distilled knowledge is stored in a structured memory, enabling the agent to improve execution accuracy over time. We design and evaluate multiple variations of a learning agent architecture that vary in how they capture and retrieve past experiences. Experiments on the BIRD benchmark Dev set show that memory-augmented agents, particularly the Procedural Agent, achieve significant accuracy gains and error reduction by leveraging human-in-the-loop feedback. Our results highlight the importance of transforming tacit human expertise into reusable knowledge, paving the way for more adaptive, domain-aware text-to-SQL systems that continually learn from a human-in-the-loop.

[26] Learn to Select: Exploring Label Distribution Divergence for In-Context Demonstration Selection in Text Classification

Ye Jiang, Taihang Wang, Youzheng Liu, Yimin Wang, Yuhan Xia, Yunfei Long

Main category: cs.CL

TL;DR: Proposes TopK + L2D, a two-stage demonstration selection method for in-context learning that considers both semantic similarity and label distribution alignment using SLMs.

Details

Motivation: Existing demonstration selection methods focus mainly on semantic similarity but overlook label distribution alignment, which is crucial for ICL performance.

Method: Two-stage approach: first select top-K semantically similar demonstrations, then use fine-tuned BERT-like SLM to generate label distributions and calculate divergence to select demonstrations with aligned label distribution.

Result: Outperforms previous demonstration selection strategies across seven text classification benchmarks, with positive correlation between LLM performance and SLM accuracy.

Conclusion: Label distribution alignment is important for effective demonstration selection in ICL, and SLMs can effectively estimate label distributions for this purpose.

Abstract: In-context learning (ICL) for text classification, which uses a few input-label demonstrations to describe a task, has demonstrated impressive performance on large language models (LLMs). However, the selection of in-context demonstrations plays a crucial role and can significantly affect LLMs’ performance. Most existing demonstration selection methods primarily focus on semantic similarity between test inputs and demonstrations, often overlooking the importance of label distribution alignment. To address this limitation, we propose a two-stage demonstration selection method, TopK + Label Distribution Divergence (L2D), which leverages a fine-tuned BERT-like small language model (SLM) to generate label distributions and calculate their divergence for both test inputs and candidate demonstrations. This enables the selection of demonstrations that are not only semantically similar but also aligned in label distribution with the test input. Extensive experiments across seven text classification benchmarks show that our method consistently outperforms previous demonstration selection strategies. Further analysis reveals a positive correlation between the performance of LLMs and the accuracy of the underlying SLMs used for label distribution estimation.

[27] Pre-Attention Expert Prediction and Prefetching for Mixture-of-Experts Large Language Models

Shien Zhu, Samuel Bohl, Robin Oester, Gustavo Alonso

Main category: cs.CL

TL;DR: Proposes pre-attention expert prediction for accurate and lightweight expert prefetching in MoE LLMs, achieving ~15% accuracy improvement over state-of-the-art methods.

Details

Motivation: Existing expert prediction methods use activations from previous layers, resulting in low accuracy and leaving the first layer unoptimized. Complex approaches introduce high computation overhead.

Method: Utilizes activations before attention block in the same layer with 2 linear functions and ranking-aware loss, leveraging ranking-preserving functions in LLMs for accurate prediction.

Result: Achieves 93.03% accuracy on DeepSeek V2 Lite, 94.69% on Qwen3-30B, and 97.62% on Phi-mini-MoE, showing ~15% absolute accuracy improvement over state-of-the-art methods.

Conclusion: Pre-attention expert prediction enables accurate and lightweight expert prefetching, supporting first-layer optimization while maintaining low computation overhead.

Abstract: Mixture-of-Experts (MoE) Large Language Models (LLMs) efficiently scale-up the model while keeping relatively low inference cost. As MoE models only activate part of the experts, related work has proposed expert prediction and caching methods to prefetch the experts for faster inference. However, existing approaches utilize the activations from the previous layer for prediction, incurring low accuracy and leave the first layer unoptimized. Applying complex layers or even training standalone networks for better prediction introduces high computation overhead. In this paper, we propose pre-attention expert prediction to achieve accurate and lightweight expert prefetching. The key insight is that some functions in LLMs are ranking-preserving, indicating that matching the ranking of selected experts using simple linear functions is possible. Therefore, we utilize the activations before the attention block in the same layer with 2 linear functions and ranking-aware loss to achieve accurate prediction, which also supports prefetching in the first layer. Our lightweight, pre-attention expert routers achieve 93.03% accuracy on DeepSeek V2 Lite, 94.69% on Qwen3-30B, and 97.62% on Phi-mini-MoE, showing about 15% improvement on absolute accuracy over the state-of-the-art methods.

[28] SpiderGen: Towards Procedure Generation For Carbon Life Cycle Assessments with Generative AI

Anupama Sitaraman, Bharathan Balaji, Yuvraj Agarwal

Main category: cs.CL

TL;DR: SpiderGen is an LLM-based workflow that generates Life Cycle Assessment (LCA) process information for estimating environmental impact of consumer products, achieving 62% F1-Score and significant cost/time savings compared to traditional LCA methods.

Details

Motivation: Climate change and global warming caused by GHG emissions are major concerns, with consumer products contributing significantly. Traditional LCAs are expensive and time-consuming, creating need for automated tools to estimate environmental impact.

Method: SpiderGen integrates LCA taxonomy and methodology with LLM reasoning capabilities to generate procedural information for LCA. It was evaluated against real-world LCA documents as ground-truth and compared with baseline techniques like chain-of-thought and one-shot prompting.

Result: SpiderGen achieves 62% F1-Score across 10 sample data points, providing accurate LCA process information with minor errors. It outperforms baseline techniques and can produce LCA information for under $1 USD in under 10 minutes, compared to traditional LCA costing over $25,000 USD and taking up to 21-person days.

Conclusion: SpiderGen demonstrates potential to significantly reduce human effort and costs for carbon impact estimation while maintaining reasonable accuracy, making LCA more accessible and scalable for environmental assessment of consumer products.

Abstract: Investigating the effects of climate change and global warming caused by GHG emissions have been a primary concern worldwide. These emissions are largely contributed to by the production, use and disposal of consumer products. Thus, it is important to build tools to estimate the environmental impact of consumer goods, an essential part of which is conducting Life Cycle Assessments (LCAs). LCAs specify and account for the appropriate processes involved with the production, use, and disposal of the products. We present SpiderGen, an LLM-based workflow which integrates the taxonomy and methodology of traditional LCA with the reasoning capabilities and world knowledge of LLMs to generate the procedural information used for LCA. We additionally evaluate the output of SpiderGen using real-world LCA documents as ground-truth. We find that SpiderGen provides accurate LCA process information that is either fully correct or has minor errors, achieving an F1-Score of 62% across 10 sample data points. We observe that the remaining missed processes and hallucinated errors occur primarily due to differences in detail between LCA documents, as well as differences in the “scope” of which auxiliary processes must also be included. We also demonstrate that SpiderGen performs better than several baselines techniques, such as chain-of-thought prompting and one-shot prompting. Finally, we highlight SpiderGen’s potential to reduce the human effort and costs for estimating carbon impact, as it is able to produce LCA process information for less than $1 USD in under 10 minutes as compared to the status quo LCA, which can cost over $25000 USD and take up to 21-person days.

[29] A methodological analysis of prompt perturbations and their effect on attack success rates

Tiago Machado, Maysa Malfiza Garcia de Macedo, Rogerio Abreu de Paula, Marcelo Carpinette Grave, Aminat Adebiyi, Luan Soares de Souza, Enrico Santarelli, Claudio Pinhanez

Main category: cs.CL

TL;DR: LLMs’ vulnerability to prompt attacks varies significantly with small prompt modifications, and existing attack benchmarks alone may not reveal all vulnerabilities across different alignment methods.

Details

Motivation: To investigate how different LLM alignment methods (SFT, DPO, RLHF) affect models' responses to prompt attacks and assess the sensitivity of Attack Success Rate to prompt variations.

Method: Systematic analysis using statistical methods on open-source models with different alignment methods, testing ASR sensitivity to prompt variations designed to elicit inappropriate content.

Result: Small prompt modifications significantly change ASR, making models more or less susceptible to attacks. Statistical tests confirm ASR sensitivity varies across alignment methods.

Conclusion: Existing attack benchmarks alone are insufficient to reveal all model vulnerabilities; systematic, statistically-based analysis of alignment methods and prompt variation sensitivity is needed for comprehensive attack evaluation.

Abstract: This work aims to investigate how different Large Language Models (LLMs) alignment methods affect the models’ responses to prompt attacks. We selected open source models based on the most common alignment methods, namely, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Reinforcement Learning with Human Feedback (RLHF). We conducted a systematic analysis using statistical methods to verify how sensitive the Attack Success Rate (ASR) is when we apply variations to prompts designed to elicit inappropriate content from LLMs. Our results show that even small prompt modifications can significantly change the Attack Success Rate (ASR) according to the statistical tests we run, making the models more or less susceptible to types of attack. Critically, our results demonstrate that running existing ‘attack benchmarks’ alone may not be sufficient to elicit all possible vulnerabilities of both models and alignment methods. This paper thus contributes to ongoing efforts on model attack evaluation by means of systematic and statistically-based analyses of the different alignment methods and how sensitive their ASR is to prompt variation.

[30] Modeling and Predicting Multi-Turn Answer Instability in Large Language Models

Jiahang He, Rishi Ramachandran, Neel Ramachandran, Aryan Katakam, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Aryan Shrivastava

Main category: cs.CL

TL;DR: LLMs show significant robustness vulnerabilities when repeatedly prompted with simple follow-ups like “Think again”, causing accuracy drops of up to 10% across multiple turns, with stationary accuracy being 8% lower than initial accuracy.

Details

Motivation: To evaluate LLM robustness in interactive settings where user-model interactions are increasingly frequent and scale, as consistent reasoning is crucial for real-world deployment.

Method: Using simple multi-turn follow-up prompts to test answer changes, modeling accuracy dynamics with Markov chains, and examining linear probes from hidden states to predict answer changes.

Result: Simple prompts caused significant accuracy drops: 10% for Gemini 1.5 Flash over 9 turns, 7.5% for Claude 3.5 Haiku. Markov chains effectively model accuracy dynamics, revealing stationary accuracy is ~8% lower than first-turn accuracy. Linear probes can predict future answer changes.

Conclusion: Stationary accuracy serves as a principled robustness metric for interactive settings, exposing LLM fragility under repeated questioning. Addressing this instability is essential for high-stakes deployments where consistent reasoning matters.

Abstract: As large language models (LLMs) are adopted in an increasingly wide range of applications, user-model interactions have grown in both frequency and scale. Consequently, research has focused on evaluating the robustness of LLMs, an essential quality for real-world tasks. In this paper, we employ simple multi-turn follow-up prompts to evaluate models’ answer changes, model accuracy dynamics across turns with Markov chains, and examine whether linear probes can predict these changes. Our results show significant vulnerabilities in LLM robustness: a simple “Think again” prompt led to an approximate 10% accuracy drop for Gemini 1.5 Flash over nine turns, while combining this prompt with a semantically equivalent reworded question caused a 7.5% drop for Claude 3.5 Haiku. Additionally, we find that model accuracy across turns can be effectively modeled using Markov chains, enabling the prediction of accuracy probabilities over time. This allows for estimation of the model’s stationary (long-run) accuracy, which we find to be on average approximately 8% lower than its first-turn accuracy for Gemini 1.5 Flash. Our results from a model’s hidden states also reveal evidence that linear probes can help predict future answer changes. Together, these results establish stationary accuracy as a principled robustness metric for interactive settings and expose the fragility of models under repeated questioning. Addressing this instability will be essential for deploying LLMs in high-stakes and interactive settings where consistent reasoning is as important as initial accuracy.

[31] Equilibrium Dynamics and Mitigation of Gender Bias in Synthetically Generated Data

Ashish Kattamuri, Arpita Vats, Harshwardhan Fartale, Rahul Raja, Akshata Kishore Moharir, Ishita Prasad

Main category: cs.CL

TL;DR: Recursive prompting with LLMs for synthetic data generation risks bias amplification, but experiments show equilibrium dynamics rather than monotonic bias increase. Low initial bias amplifies while high bias decays toward model’s inherent level. Contrastive augmentation effectively reduces downstream bias despite higher embedding scores.

Details

Motivation: To investigate gender bias dynamics in recursive text generation with LLMs and evaluate different mitigation strategies for responsible synthetic data generation.

Method: Used three evaluation frameworks (rule-based pattern matching, embedding-based semantic similarity, downstream task performance) across three generations of recursive text generation with three initial bias levels and four mitigation strategies.

Result: Found equilibrium dynamics: low initial bias amplified (+36%) while high bias decayed (-26%) toward model’s inherent level. Contrastive augmentation achieved 98.8% bias reduction for low initial bias and 91% average reduction, despite higher embedding-based bias scores.

Conclusion: Semantic similarity metrics may diverge from behavioral fairness outcomes, highlighting the need for multidimensional evaluation in synthetic data generation. Contrastive augmentation is effective for bias mitigation despite paradoxical embedding scores.

Abstract: Recursive prompting with large language models enables scalable synthetic dataset generation but introduces the risk of bias amplification. We investigate gender bias dynamics across three generations of recursive text generation using three complementary evaluation frameworks: rule-based pattern matching, embedding-based semantic similarity, and downstream task performance. Experiments with three initial bias levels (0.1, 0.3, 0.6) and four mitigation strategies reveal equilibrium dynamics rather than monotonic amplification. The low initial bias amplifies toward the model’s inherent bias level (+36%), whereas the high initial bias decays toward it (-26%). Among mitigation methods, contrastive augmentation, which introduces gender-swapped variants, achieves significant downstream bias reduction (98.8% for low initial bias and 91% on average) despite producing higher embedding-based bias scores. This paradox demonstrates that semantic similarity metrics may diverge from behavioral fairness outcomes, highlighting the need for multidimensional evaluation in responsible synthetic data generation.

[32] Saying the Unsaid: Revealing the Hidden Language of Multimodal Systems Through Telephone Games

Juntu Zhao, Jialing Zhang, Chongxuan Li, Dequan Wang

Main category: cs.CL

TL;DR: The paper proposes using a “telephone game” approach to study multimodal systems’ hidden language by analyzing their preference bias in concept co-occurrence shifts during image compression and reconstruction.

Details

Motivation: To understand the opaque hidden language in closed-source multimodal systems by leveraging their inherent preference bias that disrupts original input concept co-occurrence during processing.

Method: Multi-round “telephone game” framework that strategically leverages systems’ preference bias, observing co-occurrence frequencies to quantitatively investigate concept connection strength. Includes Telescope dataset of 10,000+ concept pairs and uses Reasoning-LLMs to uncover relationships.

Result: Enables construction of global concept connection maps, identification of training-inherited preference bias, assessment of generalization capability, discovery of stable pathways for fragile concepts, and uncovering of unexpected concept relationships beyond textual/visual similarities.

Conclusion: Provides new perspective on multimodal systems’ hidden language and foundation for future research on interpretability and controllability of multimodal systems.

Abstract: Recent closed-source multimodal systems have made great advances, but their hidden language for understanding the world remains opaque because of their black-box architectures. In this paper, we use the systems’ preference bias to study their hidden language: During the process of compressing the input images (typically containing multiple concepts) into texts and then reconstructing them into images, the systems’ inherent preference bias introduces specific shifts in the outputs, disrupting the original input concept co-occurrence. We employ the multi-round “telephone game” to strategically leverage this bias. By observing the co-occurrence frequencies of concepts in telephone games, we quantitatively investigate the concept connection strength in the understanding of multimodal systems, i.e., “hidden language.” We also contribute Telescope, a dataset of 10,000+ concept pairs, as the database of our telephone game framework. Our telephone game is test-time scalable: By iteratively running telephone games, we can construct a global map of concept connections in multimodal systems’ understanding. Here we can identify preference bias inherited from training, assess generalization capability advancement, and discover more stable pathways for fragile concept connections. Furthermore, we use Reasoning-LLMs to uncover unexpected concept relationships that transcend textual and visual similarities, inferring how multimodal systems understand and simulate the world. This study offers a new perspective on the hidden language of multimodal systems and lays the foundation for future research on the interpretability and controllability of multimodal systems.

[33] Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models

Zijian Chen, Wenjun Zhang, Guangtao Zhai

Main category: cs.CL

TL;DR: Squid Game is a dynamic adversarial evaluation environment for LLMs that addresses data contamination issues in static benchmarks by testing models in resource-constrained, asymmetric information settings through interactive gameplay.

Details

Motivation: Current benchmarks struggle to evaluate whether LLMs genuinely learn problem-solving or just memorize training data, and they fail to test model behavior under pressure and resource constraints.

Method: Created Squid Game - a dynamic adversarial environment with 6 elimination-style levels testing instruction-following, code, reasoning, planning, and safety alignment through interactive gameplay against other LLM opponents.

Result: Evaluated 50+ LLMs, observed generational performance transitions, found evidence of models using speculative shortcuts, and showed dynamic evaluation complements static benchmarks through correlation analysis.

Conclusion: Dynamic adversarial evaluation like Squid Game provides a complementary approach to static benchmarks, revealing model behaviors and potential evaluation contamination that traditional benchmarks miss.

Abstract: Contemporary benchmarks are struggling to keep pace with the development of large language models (LLMs). Although they are indispensable to evaluate model performance on various tasks, it is uncertain whether the models trained on Internet data have genuinely learned how to solve problems or merely seen the questions before. This potential data contamination issue presents a fundamental challenge to establishing trustworthy evaluation frameworks. Meanwhile, existing benchmarks predominantly assume benign, resource-rich settings, leaving the behavior of LLMs under pressure unexplored. In this paper, we introduce Squid Game, a dynamic and adversarial evaluation environment with resource-constrained and asymmetric information settings elaborated to evaluate LLMs through interactive gameplay against other LLM opponents. Notably, Squid Game consists of six elimination-style levels, focusing on multi-faceted abilities, such as instruction-following, code, reasoning, planning, and safety alignment. We evaluate over 50 LLMs on Squid Game, presenting the largest behavioral evaluation study of general LLMs on dynamic adversarial scenarios. We observe a clear generational phase transition on performance in the same model lineage and find evidence that some models resort to speculative shortcuts to win the game, indicating the possibility of higher-level evaluation paradigm contamination in static benchmarks. Furthermore, we compare prominent LLM benchmarks and Squid Game with correlation analyses, highlighting that dynamic evaluation can serve as a complementary part for static evaluations. The code and data will be released in the future.

[34] Where does an LLM begin computing an instruction?

Aditya Pola, Vineeth N. Balasubramanian

Main category: cs.CL

TL;DR: The paper identifies where instruction following begins in language models by measuring when activation interventions stop affecting predictions, finding a consistent inflection point across tasks and model sizes.

Details

Motivation: To understand at which layer in neural networks reading instructions transitions to executing them, and to develop a replicable method for locating this transition point.

Method: Used activation patching on minimal-contrast prompt pairs across three simple datasets (Key-Value, Quote Attribution, Letter Selection) and their multi-hop compositions, measuring layer-wise flip rates to identify when interventions change predictions.

Result: Found an inflection point (onset) where interventions before this point become ineffective afterward, with consistent onset locations across Llama family models and multi-hop compositions.

Conclusion: Provides a simple, replicable method to locate where instruction following begins and compare this location across different tasks and model architectures.

Abstract: Following an instruction involves distinct sub-processes, such as reading content, reading the instruction, executing it, and producing an answer. We ask where, along the layer stack, instruction following begins, the point where reading gives way to doing. We introduce three simple datasets (Key-Value, Quote Attribution, Letter Selection) and two hop compositions of these tasks. Using activation patching on minimal-contrast prompt pairs, we measure a layer-wise flip rate that indicates when substituting selected residual activations changes the predicted answer. Across models in the Llama family, we observe an inflection point, which we term onset, where interventions that change predictions before this point become largely ineffective afterward. Multi-hop compositions show a similar onset location. These results provide a simple, replicable way to locate where instruction following begins and to compare this location across tasks and model sizes.

[35] “As Eastern Powers, I will veto.” : An Investigation of Nation-level Bias of Large Language Models in International Relations

Jonghyeon Choi, Yeonjun Choi, Hyun-chul Kim, Beakcheol Jang

Main category: cs.CL

TL;DR: This paper examines nation-level biases in LLMs using UNSC data, develops a bias evaluation framework, and introduces a debiasing method that combines RAG with Reflexion to reduce bias and improve performance.

Details

Motivation: To systematically identify and address nation-level biases in LLMs within International Relations, particularly focusing on the five permanent UNSC members, as these biases can affect factual reasoning and performance in IR applications.

Method: Developed a bias evaluation framework with three tests using UNSC historical records, then created a debiasing framework combining Retrieval-Augmented Generation with Reflexion-based self-reflection techniques.

Result: LLMs show varying nation-level biases (favorable toward western nations, unfavorable toward Russia) that change across models and contexts. Models with stronger reasoning abilities exhibit reduced bias. The debiasing framework effectively reduces bias and improves performance, especially in GPT-4o-mini and LLama-3.3-70B.

Conclusion: Nation-level bias assessment is crucial when applying LLMs in International Relations, and the proposed debiasing framework successfully mitigates these biases while enhancing model performance.

Abstract: This paper systematically examines nation-level biases exhibited by Large Language Models (LLMs) within the domain of International Relations (IR). Leveraging historical records from the United Nations Security Council (UNSC), we developed a bias evaluation framework comprising three distinct tests to explore nation-level bias in various LLMs, with a particular focus on the five permanent members of the UNSC. Experimental results show that, even with the general bias patterns across models (e.g., favorable biases toward the western nations, and unfavorable biases toward Russia), these still vary based on the LLM. Notably, even within the same LLM, the direction and magnitude of bias for a nation change depending on the evaluation context. This observation suggests that LLM biases are fundamentally multidimensional, varying across models and tasks. We also observe that models with stronger reasoning abilities show reduced bias and better performance. Building on this finding, we introduce a debiasing framework that improves LLMs’ factual reasoning combining Retrieval-Augmented Generation with Reflexion-based self-reflection techniques. Experiments show it effectively reduces nation-level bias, and improves performance, particularly in GPT-4o-mini and LLama-3.3-70B. Our findings emphasize the need to assess nation-level bias alongside performance when applying LLMs in the IR domain.

[36] $π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling

Dong Liu, Yanxuan Yu

Main category: cs.CL

TL;DR: ΠAttention is a periodic sparse Transformer that achieves linear complexity while providing better long-range modeling than RingAttention through deterministic skips and adaptive fusion.

Details

Motivation: Transformers have quadratic complexity with sequence length, and existing sparse attention methods like RingAttention have limited receptive fields and lack adaptability.

Method: Factorizes attention into ring-local neighborhoods, deterministic π-stride skips, and adaptive fusion gate to create periodic sparse structure with linear complexity.

Result: Achieves O(kL + πlogL) receptive field growth vs O(kL) for RingAttention, matches dense attention quality with 8.3% lower perplexity than RingAttention using 50% fewer GPUs.

Conclusion: Periodic skips, adaptive fusion, and head-level sparsity coordination are crucial for efficient long-context modeling with linear complexity.

Abstract: Transformers have revolutionized natural language processing, but their quadratic complexity with respect to sequence length remains a fundamental bottleneck for long-range modeling. While sparse attention mechanisms like RingAttention reduce computational costs by restricting attention to local neighborhoods, they suffer from limited receptive fields and lack of adaptability. We present \PiAttention, a periodic sparse Transformer that factorizes attention into ring-local neighborhoods, deterministic $π$-stride skips, and an adaptive fusion gate. The periodic structure provides predictable coverage of distant tokens, while the sparse footprint keeps the per-layer complexity linear in context length. We prove that \PiAttention achieves $\mathcal{O}(kL + π\log L)$ receptive field growth compared to $\mathcal{O}(kL)$ for RingAttention, where $k$ is the local window size, $π$ is the skip period, and $L$ is the sequence length. Extensive experiments on language modeling, retrieval, and vision-language tasks demonstrate that \PiAttention matches or surpasses dense attention quality with 8.3% lower perplexity than RingAttention while using 50% fewer GPUs for the same context length. Our detailed ablations and visualizations reveal the importance of periodic skips, adaptive fusion, and head-level sparsity coordination for efficient long-context modeling.

[37] Faithful Summarization of Consumer Health Queries: A Cross-Lingual Framework with LLMs

Ajwad Abrar, Nafisa Tabassum Oeshy, Prianka Maheru, Farzana Tabassum, Tareque Mohmud Chowdhury

Main category: cs.CL

TL;DR: A framework combining TextRank-based extraction, medical entity recognition, and fine-tuned LLaMA-2-7B improves faithfulness in medical text summarization, achieving better quality and faithfulness metrics while preserving critical medical information.

Details

Motivation: Consumer health question summarization can improve healthcare communication, but unfaithful summaries that misrepresent medical details pose serious risks to patient safety.

Method: Proposed framework combines TextRank-based sentence extraction and medical named entity recognition with large language models (LLMs), fine-tuning LLaMA-2-7B on MeQSum (English) and BanglaCHQ-Summ (Bangla) datasets.

Result: Achieved consistent improvements across quality (ROUGE, BERTScore, readability) and faithfulness (SummaC, AlignScore) metrics, outperforming zero-shot baselines and prior systems. Human evaluation shows over 80% of generated summaries preserve critical medical information.

Conclusion: Faithfulness is essential for reliable medical summarization, and the approach demonstrates potential for safer deployment of LLMs in healthcare contexts.

Abstract: Summarizing consumer health questions (CHQs) can ease communication in healthcare, but unfaithful summaries that misrepresent medical details pose serious risks. We propose a framework that combines TextRank-based sentence extraction and medical named entity recognition with large language models (LLMs) to enhance faithfulness in medical text summarization. In our experiments, we fine-tuned the LLaMA-2-7B model on the MeQSum (English) and BanglaCHQ-Summ (Bangla) datasets, achieving consistent improvements across quality (ROUGE, BERTScore, readability) and faithfulness (SummaC, AlignScore) metrics, and outperforming zero-shot baselines and prior systems. Human evaluation further shows that over 80% of generated summaries preserve critical medical information. These results highlight faithfulness as an essential dimension for reliable medical summarization and demonstrate the potential of our approach for safer deployment of LLMs in healthcare contexts.

[38] TEDxTN: A Three-way Speech Translation Corpus for Code-Switched Tunisian Arabic - English

Fethi Bougares, Salima Mdhaffar, Haroun Elleuch, Yannick Estève

Main category: cs.CL

TL;DR: TEDxTN is the first publicly available Tunisian Arabic to English speech translation dataset, addressing data scarcity for Arabic dialects with 25 hours of code-switched speech from 11 Tunisian regions.

Details

Motivation: To mitigate data scarcity for Arabic dialects and enable research on Tunisian dialect natural language processing, particularly for code-switched speech.

Method: Collected 108 TEDx talks, segmented, transcribed and translated them following internally developed annotation guidelines, covering speakers from 11 Tunisian regions with various accents.

Result: Created a 25-hour speech translation corpus with code-switching, made publicly available with annotation guidelines, and established baseline systems for speech recognition and translation using pre-trained and fine-tuned models.

Conclusion: TEDxTN is the first open source Tunisian dialect speech translation corpus that will facilitate further research on Tunisian dialect NLP and can be extended as new talks become available.

Abstract: In this paper, we introduce TEDxTN, the first publicly available Tunisian Arabic to English speech translation dataset. This work is in line with the ongoing effort to mitigate the data scarcity obstacle for a number of Arabic dialects. We collected, segmented, transcribed and translated 108 TEDx talks following our internally developed annotations guidelines. The collected talks represent 25 hours of speech with code-switching that cover speakers with various accents from over 11 different regions of Tunisia. We make the annotation guidelines and corpus publicly available. This will enable the extension of TEDxTN to new talks as they become available. We also report results for strong baseline systems of Speech Recognition and Speech Translation using multiple pre-trained and fine-tuned end-to-end models. This corpus is the first open source and publicly available speech translation corpus of Code-Switching Tunisian dialect. We believe that this is a valuable resource that can motivate and facilitate further research on the natural language processing of Tunisian Dialect.

[39] Sabiá: Um Chatbot de Inteligência Artificial Generativa para Suporte no Dia a Dia do Ensino Superior

Guilherme Biava Rodrigues, Franciele Beal, Marlon Marcon, Alinne Cristinne Corrêa Souza, André Roberto Ortoncelli, Francisco Carlos Monteiro Souza, Rodolfo Adamshuk Silva

Main category: cs.CL

TL;DR: Developed a GenAI chatbot using RAG to help students access fragmented academic information more easily, with Gemini 2.0 Flash and Gemma 3n identified as top-performing models.

Details

Motivation: Students face difficulties accessing day-to-day academic information due to fragmentation across institutional documents and websites, causing confusion about routine university information.

Method: Used Generative AI and Retrieval-Augmented Generation (RAG) to develop a chatbot, testing several GenAI models and evaluating them based on quality metrics and LLM-as-a-Judge approach.

Result: Gemini 2.0 Flash stood out for its quality and speed, while Gemma 3n showed good performance and open-source advantages.

Conclusion: A GenAI chatbot with RAG can effectively simplify access to fragmented academic information, with Gemini 2.0 Flash and Gemma 3n being suitable model choices depending on specific requirements.

Abstract: Students often report difficulties in accessing day-to-day academic information, which is usually spread across numerous institutional documents and websites. This fragmentation results in a lack of clarity and causes confusion about routine university information. This project proposes the development of a chatbot using Generative Artificial Intelligence (GenAI) and Retrieval-Augmented Generation (RAG) to simplify access to such information. Several GenAI models were tested and evaluated based on quality metrics and the LLM-as-a-Judge approach. Among them, Gemini 2.0 Flash stood out for its quality and speed, and Gemma 3n for its good performance and open-source nature.

[40] LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation

Grace Byun, Swati Rajwal, Jinho D. Choi

Main category: cs.CL

TL;DR: GPT-4o shows strong correlation with human graders (up to 0.98) for short-answer quizzes and project reports in a Computational Linguistics course, with 55% exact score agreement on quizzes, though some variability exists for technical responses.

Details

Motivation: To investigate the feasibility of using LLMs for educational assessment in real classrooms, as their alignment with human evaluation remains underexamined despite increasing exploration for grading tasks.

Method: Collected student responses from 50 students across five quizzes and project reports from 14 teams in an undergraduate Computational Linguistics course, then compared GPT-4o-generated scores against independent human evaluations by teaching assistants.

Result: GPT-4o achieved strong correlation with human graders (up to 0.98) and exact score agreement in 55% of quiz cases. For project reports, it showed strong overall alignment with human grading but exhibited some variability in scoring technical, open-ended responses.

Conclusion: LLM-based grading systems show both potential and limitations for educational assessment, contributing to advancing automated grading in real-world academic settings. All code and sample data are released to support further research.

Abstract: Large Language Models (LLMs) are increasingly explored for educational tasks such as grading, yet their alignment with human evaluation in real classrooms remains underexamined. In this study, we investigate the feasibility of using an LLM (GPT-4o) to evaluate short-answer quizzes and project reports in an undergraduate Computational Linguistics course. We collect responses from approximately 50 students across five quizzes and receive project reports from 14 teams. LLM-generated scores are compared against human evaluations conducted independently by the course teaching assistants (TAs). Our results show that GPT-4o achieves strong correlation with human graders (up to 0.98) and exact score agreement in 55% of quiz cases. For project reports, it also shows strong overall alignment with human grading, while exhibiting some variability in scoring technical, open-ended responses. We release all code and sample data to support further research on LLMs in educational assessment. This work highlights both the potential and limitations of LLM-based grading systems and contributes to advancing automated grading in real-world academic settings.

[41] Assessing the Capabilities of LLMs in Humor:A Multi-dimensional Analysis of Oogiri Generation and Evaluation

Ritsu Sakabe, Hwichan Kim, Tosho Hirasawa, Mamoru Komachi

Main category: cs.CL

TL;DR: This paper systematically evaluates LLMs’ humor capabilities using Japanese Oogiri comedy games, finding that while LLMs can generate responses at low-to-mid human level, they lack Empathy and prioritize Novelty over Empathy in humor evaluation.

Details

Motivation: To address the gap in multifaceted humor evaluation of LLMs, moving beyond single-dimensional 'funny/not funny' assessments by using Oogiri comedy games as a comprehensive benchmark.

Method: Expanded Oogiri datasets with new sources and LLM-generated responses, then manually annotated with 5-point ratings across six dimensions (Novelty, Clarity, Relevance, Intelligence, Empathy, Overall Funniness). Evaluated LLMs on generation and evaluation tasks.

Result: LLMs generate responses at low-to-mid human performance level but exhibit significant Empathy deficit. LLMs prioritize Novelty in humor evaluation while humans prioritize Empathy, explaining LLMs’ failure to replicate human humor assessment.

Conclusion: LLMs’ humor capabilities are limited by lack of Empathy, and their evaluation criteria fundamentally diverge from humans. The annotated corpus is released to support development of more emotionally intelligent conversational agents.

Abstract: Computational humor is a frontier for creating advanced and engaging natural language processing (NLP) applications, such as sophisticated dialogue systems. While previous studies have benchmarked the humor capabilities of Large Language Models (LLMs), they have often relied on single-dimensional evaluations, such as judging whether something is simply ``funny.’’ This paper argues that a multifaceted understanding of humor is necessary and addresses this gap by systematically evaluating LLMs through the lens of Oogiri, a form of Japanese improvisational comedy games. To achieve this, we expanded upon existing Oogiri datasets with data from new sources and then augmented the collection with Oogiri responses generated by LLMs. We then manually annotated this expanded collection with 5-point absolute ratings across six dimensions: Novelty, Clarity, Relevance, Intelligence, Empathy, and Overall Funniness. Using this dataset, we assessed the capabilities of state-of-the-art LLMs on two core tasks: their ability to generate creative Oogiri responses and their ability to evaluate the funniness of responses using a six-dimensional evaluation. Our results show that while LLMs can generate responses at a level between low- and mid-tier human performance, they exhibit a notable lack of Empathy. This deficit in Empathy helps explain their failure to replicate human humor assessment. Correlation analyses of human and model evaluation data further reveal a fundamental divergence in evaluation criteria: LLMs prioritize Novelty, whereas humans prioritize Empathy. We release our annotated corpus to the community to pave the way for the development of more emotionally intelligent and sophisticated conversational agents.

[42] Tracing Multilingual Representations in LLMs with Cross-Layer Transcoders

Abir Harrasse, Florent Draye, Zhijing Jin, Bernhard Schölkopf

Main category: cs.CL

TL;DR: LLMs use pivot language representations where early layers create shared multilingual representations and later layers perform language-specific decoding, with dominant training languages influencing performance.

Details

Motivation: To understand how multilingual LLMs internally represent different languages and why performance favors dominant training languages despite shared representations.

Method: Train LLMs on different multilingual data mixtures and analyze internal mechanisms using cross-layer transcoders (CLT) and attribution graphs.

Result: Evidence for pivot language representations: identical early representations across languages with language-specific decoding in later layers; attribution shows decoding relies on high-frequency language features that can be manipulated to change output language.

Conclusion: Understanding pivot-language mechanisms is crucial for improving multilingual alignment in LLMs, as dominant training languages shape decoding pathways and attribution patterns.

Abstract: Multilingual Large Language Models (LLMs) can process many languages, yet how they internally represent this diversity remains unclear. Do they form shared multilingual representations with language-specific decoding, and if so, why does performance still favor the dominant training language? To address this, we train a series of LLMs on different mixtures of multilingual data and analyze their internal mechanisms using cross-layer transcoders (CLT) and attribution graphs. Our results provide strong evidence for pivot language representations: the model employs nearly identical representations across languages, while language-specific decoding emerges in later layers. Attribution analyses reveal that decoding relies in part on a small set of high-frequency language features in the final layers, which linearly read out language identity from the first layers in the model. By intervening on these features, we can suppress one language and substitute another in the model’s outputs. Finally, we study how the dominant training language influences these mechanisms across attribution graphs and decoding pathways. We argue that understanding this pivot-language mechanism is crucial for improving multilingual alignment in LLMs.

[43] Reinforcing Stereotypes of Anger: Emotion AI on African American Vernacular English

Rebecca Dorn, Christina Chance, Casandra Rusti, Charles Bickham, Kai-Wei Chang, Fred Morstatter, Kristina Lerman

Main category: cs.CL

TL;DR: Emotion recognition models show significant racial bias, with false positive anger predictions more than doubling for African American Vernacular English (AAVE) compared to General American English, reinforcing racial stereotypes.

Details

Motivation: To examine how emotion AI models perform differently on African American Vernacular English versus General American English, as current models often rely on annotations reflecting dominant cultural norms and may fail to recognize emotional expression in excluded dialects.

Method: Analyzed 2.7 million geo-tagged tweets from Los Angeles, scored AAVE strength using computational dialect features, collected emotion annotations on 875 tweets with high/low AAVE densities, and used community-informed “silver” labels from African American AAVE-fluent annotators.

Result: GPT and BERT models showed false positive anger rates on AAVE more than double than on GAE. SpanEmo increased false positive anger from 25% on GAE to 60% on AAVE. Models and non-ingroup annotations correlated more with profanity-based AAVE features than ingroup annotations. Neighborhoods with higher African American populations correlated with higher anger predictions (r=0.27) and lower joy (r=-0.10).

Conclusion: Emotion AI systems exhibit emergent safety issues by reinforcing racial stereotypes through biased emotion classification, highlighting the need for culturally and dialect-informed affective computing systems.

Abstract: Automated emotion detection is widely used in applications ranging from well-being monitoring to high-stakes domains like mental health and hiring. However, models often rely on annotations that reflect dominant cultural norms, limiting model ability to recognize emotional expression in dialects often excluded from training data distributions, such as African American Vernacular English (AAVE). This study examines emotion recognition model performance on AAVE compared to General American English (GAE). We analyze 2.7 million tweets geo-tagged within Los Angeles. Texts are scored for strength of AAVE using computational approximations of dialect features. Annotations of emotion presence and intensity are collected on a dataset of 875 tweets with both high and low AAVE densities. To assess model accuracy on a task as subjective as emotion perception, we calculate community-informed “silver” labels where AAVE-dense tweets are labeled by African American, AAVE-fluent (ingroup) annotators. On our labeled sample, GPT and BERT-based models exhibit false positive prediction rates of anger on AAVE more than double than on GAE. SpanEmo, a popular text-based emotion model, increases false positive rates of anger from 25 percent on GAE to 60 percent on AAVE. Additionally, a series of linear regressions reveals that models and non-ingroup annotations are significantly more correlated with profanity-based AAVE features than ingroup annotations. Linking Census tract demographics, we observe that neighborhoods with higher proportions of African American residents are associated with higher predictions of anger (Pearson’s correlation r = 0.27) and lower joy (r = -0.10). These results find an emergent safety issue of emotion AI reinforcing racial stereotypes through biased emotion classification. We emphasize the need for culturally and dialect-informed affective computing systems.

[44] Leveraging Parameter Space Symmetries for Reasoning Skill Transfer in LLMs

Stefan Horoi, Sangwoo Cho, Supriyo Chakraborty, Shi-Xiong Zhang, Sambit Sahu, Guy Wolf, Genta Indra Winata

Main category: cs.CL

TL;DR: Task arithmetic for LLM skill transfer often fails due to negative interference from divergent training. This paper solves this by first aligning parameter spaces using Transformer symmetries, adapting methods for GQA and SwiGLU layers, enabling successful transfer of reasoning skills.

Details

Motivation: Task arithmetic is useful for transferring skills between LLMs but suffers from negative interference when models have diverged during training, limiting its effectiveness.

Method: Align parameter spaces first using Transformer symmetries (permutation, rotation, scaling), adapt alignment for modern GQA and SwiGLU layers using both weight-based and activation-based approaches.

Result: Successfully transferred advanced reasoning skills to non-reasoning model. Experiments on challenging reasoning benchmarks show consistent outperformance over standard task arithmetic.

Conclusion: Provides effective approach for merging and transferring specialized skills across evolving LLM families, reducing redundant fine-tuning and enhancing model adaptability.

Abstract: Task arithmetic is a powerful technique for transferring skills between Large Language Models (LLMs), but it often suffers from negative interference when models have diverged during training. We address this limitation by first aligning the models’ parameter spaces, leveraging the inherent permutation, rotation, and scaling symmetries of Transformer architectures. We adapt parameter space alignment for modern Grouped-Query Attention (GQA) and SwiGLU layers, exploring both weight-based and activation-based approaches. Using this alignment-first strategy, we successfully transfer advanced reasoning skills to a non-reasoning model. Experiments on challenging reasoning benchmarks show that our method consistently outperforms standard task arithmetic. This work provides an effective approach for merging and transferring specialized skills across evolving LLM families, reducing redundant fine-tuning and enhancing model adaptability.

[45] From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems

Parisa Rabbani, Nimet Beyza Bozdag, Dilek Hakkani-Tür

Main category: cs.CL

TL;DR: LLM judges show inconsistent conviction when tasks shift from direct factual queries to conversational judgment tasks, with performance changing by 9.24% on average across models under minimal dialogue context.

Details

Motivation: To investigate whether LLM judges can reliably assess tasks requiring social or conversational judgment, particularly how their conviction changes when tasks are reframed from direct factual queries to conversational contexts.

Method: Evaluation framework contrasting model performance on direct factual queries vs. conversational judgment tasks, applying pressure via simple rebuttals to measure conviction maintenance under conversational pressure.

Result: Models exhibit different tendencies: GPT-4o-mini shows sycophantic behavior while Llama-8B-Instruct becomes overly-critical. Average performance change of 9.24% across all models demonstrates significant impact of conversational framing.

Conclusion: Minimal dialogue context significantly alters LLM judgment, making conversational framing a key factor in LLM-based evaluation. The framework provides methodology for diagnosing model conviction and developing more trustworthy dialogue systems.

Abstract: LLMs are increasingly employed as judges across a variety of tasks, including those involving everyday social interactions. Yet, it remains unclear whether such LLM-judges can reliably assess tasks that require social or conversational judgment. We investigate how an LLM’s conviction is changed when a task is reframed from a direct factual query to a Conversational Judgment Task. Our evaluation framework contrasts the model’s performance on direct factual queries with its assessment of a speaker’s correctness when the same information is presented within a minimal dialogue, effectively shifting the query from “Is this statement correct?” to “Is this speaker correct?”. Furthermore, we apply pressure in the form of a simple rebuttal (“The previous answer is incorrect.”) to both conditions. This perturbation allows us to measure how firmly the model maintains its position under conversational pressure. Our findings show that while some models like GPT-4o-mini reveal sycophantic tendencies under social framing tasks, others like Llama-8B-Instruct become overly-critical. We observe an average performance change of 9.24% across all models, demonstrating that even minimal dialogue context can significantly alter model judgment, underscoring conversational framing as a key factor in LLM-based evaluation. The proposed framework offers a reproducible methodology for diagnosing model conviction and contributes to the development of more trustworthy dialogue systems.

[46] ICX360: In-Context eXplainability 360 Toolkit

Dennis Wei, Ronny Luss, Xiaomeng Hu, Lucas Monteiro Paes, Pin-Yu Chen, Karthikeyan Natesan Ramamurthy, Erik Miehling, Inge Vejsbjerg, Hendrik Strobelt

Main category: cs.CL

TL;DR: ICX360 is an open-source Python toolkit for explaining LLM outputs using black-box and white-box methods through perturbations and gradients.

Details

Motivation: As LLMs enter higher-stakes applications, there's a critical need for tools to explain their outputs (summaries, responses, etc.) to ensure transparency and trustworthiness.

Method: Implements three recent explanation tools using both black-box (perturbations) and white-box (gradients) methods, focusing on user-provided context/prompts.

Result: Developed ICX360 toolkit with quick-start guidance and detailed tutorials covering use cases like retrieval augmented generation, natural language generation, and jailbreaking.

Conclusion: ICX360 provides a comprehensive open-source solution for LLM explainability, addressing the growing need for transparency in LLM applications across various domains.

Abstract: Large Language Models (LLMs) have become ubiquitous in everyday life and are entering higher-stakes applications ranging from summarizing meeting transcripts to answering doctors’ questions. As was the case with earlier predictive models, it is crucial that we develop tools for explaining the output of LLMs, be it a summary, list, response to a question, etc. With these needs in mind, we introduce In-Context Explainability 360 (ICX360), an open-source Python toolkit for explaining LLMs with a focus on the user-provided context (or prompts in general) that are fed to the LLMs. ICX360 contains implementations for three recent tools that explain LLMs using both black-box and white-box methods (via perturbations and gradients respectively). The toolkit, available at https://github.com/IBM/ICX360, contains quick-start guidance materials as well as detailed tutorials covering use cases such as retrieval augmented generation, natural language generation, and jailbreaking.

[47] A Multifaceted Analysis of Negative Bias in Large Language Models through the Lens of Parametric Knowledge

Jongyoon Song, Sangwon Yu, Sungroh Yoon

Main category: cs.CL

TL;DR: LLMs exhibit format-level negative bias where prompt format influences responses more than negative semantics. Models tend to generate negative responses when lacking sufficient knowledge in yes-no questions.

Details

Motivation: Previous research focused on detecting negative attention heads causing negative bias, but the detailed underlying factors influencing negative bias remain underexplored.

Method: Introduced a pipeline to construct evaluation sets categorized by model’s parametric knowledge (correct, incorrect, insufficient). Analyzed negative bias under various prompting scenarios including relevant context, “I don’t know” option, and chain-of-thought prompting.

Result: Models show shortcut behavior by generating negative responses when lacking sufficient knowledge. Providing context and “I don’t know” option reduces negative bias, while chain-of-thought amplifies it. Bias degree varies by prompt type.

Conclusion: Reveals various factors influencing negative bias, providing critical insights for mitigating it in LLMs, particularly how prompt format and knowledge availability affect response direction.

Abstract: Negative bias refers to the tendency of large language models (LLMs) to excessively generate negative responses in binary decision tasks (e.g., yes-no question answering). Previous research has focused on detecting and addressing negative attention heads that induce negative bias. However, the underlying detailed factors influencing negative bias remain underexplored. In this paper, we demonstrate that LLMs exhibit format-level negative bias, meaning the prompt format more influences their responses than the semantics of the negative response. For the fine-grained study of the negative bias, we introduce a pipeline for constructing the evaluation set, which systematically categorizes the dataset into three subsets based on the model’s parametric knowledge: correct, incorrect, and insufficient relevant knowledge. Through analysis of this evaluation set, we identify a shortcut behavior in which models tend to generate negative responses when they lack sufficient knowledge to answer a yes-no question, leading to negative bias. We further examine how negative bias changes under various prompting scenarios related to parametric knowledge. We observe that providing relevant context and offering an “I don’t know” option generally reduces negative bias, whereas chain-of-thought prompting tends to amplify the bias. Finally, we demonstrate that the degree of negative bias can vary depending on the type of prompt, which influences the direction of the response. Our work reveals the various factors that influence negative bias, providing critical insights for mitigating it in LLMs.

[48] MedPath: Multi-Domain Cross-Vocabulary Hierarchical Paths for Biomedical Entity Linking

Nishant Mishra, Wilker Aziz, Iacer Calixto

Main category: cs.CL

TL;DR: MedPath is a large-scale biomedical Entity Linking dataset that integrates nine existing datasets, normalizes entities using UMLS, maps to 62 vocabularies, and provides full ontological paths to enable semantic-rich and explainable NLP models.

Details

Motivation: Address fragmented data landscape, lack of resources for explainable models, and limitations of semantically-blind evaluation metrics in biomedical NER and EL.

Method: Build upon nine existing expert-annotated EL datasets, normalize entities using latest UMLS version, augment with mappings to 62 biomedical vocabularies, and enrich with full ontological paths in up to 11 vocabularies.

Result: Created MedPath - a large-scale, multi-domain biomedical EL dataset that enables training and evaluation of semantic-rich and interpretable EL systems.

Conclusion: MedPath facilitates new research frontiers in biomedical NLP and enables development of next-generation interoperable and explainable clinical NLP models.

Abstract: Progress in biomedical Named Entity Recognition (NER) and Entity Linking (EL) is currently hindered by a fragmented data landscape, a lack of resources for building explainable models, and the limitations of semantically-blind evaluation metrics. To address these challenges, we present MedPath, a large-scale and multi-domain biomedical EL dataset that builds upon nine existing expert-annotated EL datasets. In MedPath, all entities are 1) normalized using the latest version of the Unified Medical Language System (UMLS), 2) augmented with mappings to 62 other biomedical vocabularies and, crucially, 3) enriched with full ontological paths – i.e., from general to specific – in up to 11 biomedical vocabularies. MedPath directly enables new research frontiers in biomedical NLP, facilitating training and evaluation of semantic-rich and interpretable EL systems, and the development of the next generation of interoperable and explainable clinical NLP models.

[49] From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models

Farima Fatahi Bayat, Pouya Pezeshkpour, Estevam Hruschka

Main category: cs.CL

TL;DR: Tool-augmented Language Models (TaLMs) using Code Interpreter show improved answer accuracy but suffer from Tool-Induced Myopia (TIM), where tools substitute for reasoning, leading to less coherent justifications despite correct answers.

Details

Motivation: To investigate whether tool-enabled gains in TaLMs reflect trustworthy reasoning or if tools are being used as substitutes for proper reasoning processes.

Method: Developed PYMATH benchmark with 1,679 competition-level math problems requiring Python code, created multi-dimensional evaluation suite, analyzed reasoning degradation, and proposed preference-optimization framework to realign tool use.

Result: TaLMs achieved 19.3% accuracy gain but reasoning deteriorated significantly (non-tool LLMs won 41.5% more in reasoning comparisons), with degradation intensifying with tool frequency and TIM present in ~55% of high-risk cases.

Conclusion: Tool use shifts errors from arithmetic to global reasoning failures, but preference optimization can realign TaLMs to use tools as assistive evidence, improving both accuracy and reasoning depth.

Abstract: Tool-augmented Language Models (TaLMs) can invoke external tools to solve problems beyond their parametric capacity. However, it remains unclear whether these tool-enabled gains reflect trustworthy reasoning. Focusing on the Code Interpreter tool, we show that even when tools are selected and executed correctly, TaLMs treat tool outputs as substitutes for reasoning, producing solutions that appear correct but lack coherent justification. We term this failure mode Tool-Induced Myopia (TIM), and study it using PYMATH, a benchmark of 1,679 competition-level mathematical problems for which Python code is helpful but not sufficient. We further develop a multi-dimensional evaluation suite to quantify reasoning degradation in TaLMs relative to their non-tool counterparts. Our findings reveal that while TaLMs achieve up to a 19.3 percentage point gain in final-answer accuracy, their reasoning behavior consistently deteriorates (e.g., non-tool LLMs win up to 41.5% more often in pairwise comparisons of the reasoning process). This degradation intensifies with tool use; the more frequently a model invokes tools, the less coherent its reasoning becomes. Moreover, tool use shifts errors from arithmetic mistakes toward global reasoning failures (logic, assumption, creativity); with TIM present in ~55% of high-risk cases. Finally, we propose a preference-optimization-based framework that realigns TaLMs to use tools as assistive evidence, improving both final-answer accuracy and reasoning depth under tool use. Codes and data are available at: https://github.com/megagonlabs/TIM.

[50] Expert-Guided Prompting and Retrieval-Augmented Generation for Emergency Medical Service Question Answering

Xueren Ge, Sahil Murtaza, Anthony Cortez, Homa Alemzadeh

Main category: cs.CL

TL;DR: EMSQA introduces a specialized medical QA dataset and two methods (Expert-CoT and ExpertRAG) that leverage clinical subject areas and certification levels to improve LLM performance in emergency medical services.

Details

Motivation: Existing LLM approaches overlook domain-specific expertise like clinical subject areas and certification levels, limiting performance in high-stakes medical settings.

Method: Created EMSQA dataset (24.3K questions) with clinical subject areas and certification levels, plus Expert-CoT prompting that conditions reasoning on specific expertise, and ExpertRAG that retrieves subject-aligned documents.

Result: Expert-CoT improves up to 2.05% over vanilla CoT, and combined with ExpertRAG yields up to 4.59% accuracy gain. 32B expertise-augmented LLMs pass all EMS certification simulation exams.

Conclusion: Incorporating clinical expertise domains and certification levels significantly enhances LLM performance in medical QA, enabling reliable performance in high-stakes emergency medical scenarios.

Abstract: Large language models (LLMs) have shown promise in medical question answering, yet they often overlook the domain-specific expertise that professionals depend on, such as the clinical subject areas (e.g., trauma, airway) and the certification level (e.g., EMT, Paramedic). Existing approaches typically apply general-purpose prompting or retrieval strategies without leveraging this structured context, limiting performance in high-stakes settings. We address this gap with EMSQA, an 24.3K-question multiple-choice dataset spanning 10 clinical subject areas and 4 certification levels, accompanied by curated, subject area-aligned knowledge bases (40K documents and 2M tokens). Building on EMSQA, we introduce (i) Expert-CoT, a prompting strategy that conditions chain-of-thought (CoT) reasoning on specific clinical subject area and certification level, and (ii) ExpertRAG, a retrieval-augmented generation pipeline that grounds responses in subject area-aligned documents and real-world patient data. Experiments on 4 LLMs show that Expert-CoT improves up to 2.05% over vanilla CoT prompting. Additionally, combining Expert-CoT with ExpertRAG yields up to a 4.59% accuracy gain over standard RAG baselines. Notably, the 32B expertise-augmented LLMs pass all the computer-adaptive EMS certification simulation exams.

[51] Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions

Mengze Hong, Di Jiang, Weiwei Zhao, Yawen Li, Yihang Wang, Xinyuan Luo, Yanjie Sun, Chen Jason Zhang

Main category: cs.CL

TL;DR: A multimodal LLM-based system for academic peer review simulation that integrates text and visual inputs, uses RAG with OpenReview data, and generates actionable feedback in structured formats to help authors improve manuscripts before submission.

Details

Motivation: Existing peer review systems are limited to text-only inputs, lack contextual grounding, and don't provide actionable feedback, making it difficult for authors to effectively revise papers before submission.

Method: Interactive web-based system using multimodal LLMs to process text and visual information, RAG with OpenReview data for context, and converts reviews into actionable to-do lists using Action:Objective[#] format for structured guidance.

Result: The system generates more comprehensive and useful reviews aligned with expert standards, outperforming ablated baselines and providing effective scholarly assistance.

Conclusion: The framework advances transparent, human-centered scholarly assistance by enabling effective manuscript revisions through multimodal, community-aware peer review simulation integrated into existing academic writing platforms.

Abstract: While large language models (LLMs) offer promising capabilities for automating academic workflows, existing systems for academic peer review remain constrained by text-only inputs, limited contextual grounding, and a lack of actionable feedback. In this work, we present an interactive web-based system for multimodal, community-aware peer review simulation to enable effective manuscript revisions before paper submission. Our framework integrates textual and visual information through multimodal LLMs, enhances review quality via retrieval-augmented generation (RAG) grounded in web-scale OpenReview data, and converts generated reviews into actionable to-do lists using the proposed Action:Objective[#] format, providing structured and traceable guidance. The system integrates seamlessly into existing academic writing platforms, providing interactive interfaces for real-time feedback and revision tracking. Experimental results highlight the effectiveness of the proposed system in generating more comprehensive and useful reviews aligned with expert standards, surpassing ablated baselines and advancing transparent, human-centered scholarly assistance.

[52] Automated Analysis of Learning Outcomes and Exam Questions Based on Bloom’s Taxonomy

Ramya Kumar, Dhruv Gulwani, Sonit Singh

Main category: cs.CL

TL;DR: SVM with data augmentation achieved 94% accuracy for Bloom’s Taxonomy classification, outperforming RNNs, BERT, RoBERTa, and LLMs which suffered from overfitting or lower performance.

Details

Motivation: To automatically classify exam questions and learning outcomes according to Bloom's Taxonomy categories (Knowledge, Comprehension, Application, Analysis, Synthesis, Evaluation).

Method: Used traditional ML (Naive Bayes, Logistic Regression, SVM), RNNs (LSTM, BiLSTM, GRU, BiGRU), transformers (BERT, RoBERTa), and LLMs (OpenAI, Gemini, Ollama, Anthropic) on a 600-sentence dataset with various preprocessing and augmentation strategies.

Result: SVM with augmentation achieved 94% accuracy/recall/F1 with minimal overfitting. RNNs and BERT had severe overfitting, RoBERTa initially overcame it but showed signs later. LLMs achieved ~0.72-0.73 accuracy in zero-shot evaluation.

Conclusion: Complex deep models struggle with limited data; careful data augmentation and simpler algorithms (like augmented SVM) are more effective for Bloom’s Taxonomy classification.

Abstract: This paper explores the automatic classification of exam questions and learning outcomes according to Bloom’s Taxonomy. A small dataset of 600 sentences labeled with six cognitive categories - Knowledge, Comprehension, Application, Analysis, Synthesis, and Evaluation - was processed using traditional machine learning (ML) models (Naive Bayes, Logistic Regression, Support Vector Machines), recurrent neural network architectures (LSTM, BiLSTM, GRU, BiGRU), transformer-based models (BERT and RoBERTa), and large language models (OpenAI, Gemini, Ollama, Anthropic). Each model was evaluated under different preprocessing and augmentation strategies (for example, synonym replacement, word embeddings, etc.). Among traditional ML approaches, Support Vector Machines (SVM) with data augmentation achieved the best overall performance, reaching 94 percent accuracy, recall, and F1 scores with minimal overfitting. In contrast, the RNN models and BERT suffered from severe overfitting, while RoBERTa initially overcame it but began to show signs as training progressed. Finally, zero-shot evaluations of large language models (LLMs) indicated that OpenAI and Gemini performed best among the tested LLMs, achieving approximately 0.72-0.73 accuracy and comparable F1 scores. These findings highlight the challenges of training complex deep models on limited data and underscore the value of careful data augmentation and simpler algorithms (such as augmented SVM) for Bloom’s Taxonomy classification.

[53] Evaluating Large Language Models on Rare Disease Diagnosis: A Case Study using House M.D

Arsh Gupta, Ajay Narayanan Sridhar, Bonam Mingole, Amulya Yadav

Main category: cs.CL

TL;DR: LLMs show varying performance (16.48%-38.64% accuracy) on rare disease diagnosis from narrative medical cases, with newer models showing 2.3x improvement, using a House M.D.-based validated dataset.

Details

Motivation: To evaluate LLM capabilities on rare disease diagnosis from narrative medical cases, an area that remains underexplored despite LLMs' broad capabilities.

Method: Created a novel dataset of 176 symptom-diagnosis pairs from House M.D. TV series, validated for medical education, and tested four state-of-the-art LLMs (GPT 4o mini, GPT 5 mini, Gemini 2.5 Flash, Gemini 2.5 Pro) on narrative-based diagnostic reasoning tasks.

Result: Significant performance variation (16.48%-38.64% accuracy) with newer model generations showing 2.3 times improvement, though all models face substantial challenges with rare disease diagnosis.

Conclusion: The observed improvement across architectures suggests promising directions for future development, and the educationally validated benchmark establishes baseline metrics for narrative medical reasoning and provides an accessible evaluation framework for AI-assisted diagnosis research.

Abstract: Large language models (LLMs) have demonstrated capabilities across diverse domains, yet their performance on rare disease diagnosis from narrative medical cases remains underexplored. We introduce a novel dataset of 176 symptom-diagnosis pairs extracted from House M.D., a medical television series validated for teaching rare disease recognition in medical education. We evaluate four state-of-the-art LLMs such as GPT 4o mini, GPT 5 mini, Gemini 2.5 Flash, and Gemini 2.5 Pro on narrative-based diagnostic reasoning tasks. Results show significant variation in performance, ranging from 16.48% to 38.64% accuracy, with newer model generations demonstrating a 2.3 times improvement. While all models face substantial challenges with rare disease diagnosis, the observed improvement across architectures suggests promising directions for future development. Our educationally validated benchmark establishes baseline performance metrics for narrative medical reasoning and provides a publicly accessible evaluation framework for advancing AI-assisted diagnosis research.

[54] CardioEmbed: Domain-Specialized Text Embeddings for Clinical Cardiology

Richard J. Young, Alice M. Matthews

Main category: cs.CL

TL;DR: CardioEmbed is a domain-specialized embedding model for cardiology that achieves 99.60% retrieval accuracy on cardiac-specific tasks, significantly outperforming existing medical embedding models.

Details

Motivation: Existing biomedical text embeddings are trained on PubMed research literature, creating a gap with clinical cardiology practice which relies on procedural knowledge and specialized terminology from comprehensive textbooks.

Method: Trained CardioEmbed based on Qwen3-Embedding-8B using contrastive learning with InfoNCE loss and in-batch negatives on a curated corpus of seven comprehensive cardiology textbooks (150,000 sentences after deduplication).

Result: Achieved 99.60% retrieval accuracy on cardiac-specific semantic retrieval tasks (+15.94 percentage point improvement over MedTE). On MTEB medical benchmarks: BIOSSES 0.77 Spearman and SciFact 0.61 NDCG@10.

Conclusion: Domain-specialized training on comprehensive clinical textbooks yields near-perfect cardiology retrieval performance, significantly improving over existing medical embedding models.

Abstract: Biomedical text embeddings have primarily been developed using research literature from PubMed, yet clinical cardiology practice relies heavily on procedural knowledge and specialized terminology found in comprehensive textbooks rather than research abstracts. This research practice gap limits the effectiveness of existing embedding models for clinical applications incardiology. This study trained CardioEmbed, a domain-specialized embedding model based on Qwen3-Embedding-8B, using contrastive learning on a curated corpus of seven comprehensive cardiology textbooks totaling approximately 150,000 sentences after deduplication. The model employs InfoNCE loss with in-batch negatives and achieves 99.60% retrieval accuracy on cardiac-specific semantic retrieval tasks, a +15.94 percentage point improvement over MedTE, the current state-of-the-art medical embedding model. On MTEB medical benchmarks, the model obtained BIOSSES 0.77 Spearman and SciFact 0.61 NDCG@10, indicating competitive performance on related biomedical domains. Domain-specialized training on comprehensive clinical textbooks yields near-perfect cardiology retrieval (99.60% Acc@1), improving over MedTE by +15.94 percentage points.

[55] DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains

Xiying Zhao, Zhoufutu Wen, Zhixuan Chen, Jingzhe Ding, Jianpeng Jiao, Shuai Li, Xi Li, Danni Liang, Shengda Long, Qianqian Liu, Xianbo Wu, Hongwan Gao, Xiang Gao, Liang Hu, Jiashuo Liu, Mengyun Liu, Weiran Shi, Chenghao Yang, Qianyu Yang, Xuanliang Zhang, Ge Zhang, Wenhao Huang

Main category: cs.CL

TL;DR: DiscoX is a new benchmark for discourse-level and expert-level Chinese-English translation, featuring 200 long professional texts across 7 domains, with Metric-S as a reference-free evaluation system that outperforms existing metrics.

Details

Motivation: Current translation evaluation methods focus on segment-level accuracy and fluency, but expert domains require discourse-level coherence and terminological precision, which current methods inadequately assess.

Method: Created DiscoX benchmark with 200 professionally-curated texts (avg. 1700+ tokens) from 7 domains, and developed Metric-S - a reference-free evaluation system for fine-grained assessment of accuracy, fluency, and appropriateness.

Result: Metric-S shows strong consistency with human judgments and significantly outperforms existing metrics. Advanced LLMs still trail human experts significantly on DiscoX tasks, validating the benchmark’s difficulty.

Conclusion: DiscoX and Metric-S provide a robust framework for rigorous evaluation of discourse-level translation, highlighting remaining challenges in achieving professional-grade machine translation and facilitating future LLM advancements.

Abstract: The evaluation of discourse-level translation in expert domains remains inadequate, despite its centrality to knowledge dissemination and cross-lingual scholarly communication. While these translations demand discourse-level coherence and strict terminological precision, current evaluation methods predominantly focus on segment-level accuracy and fluency. To address this limitation, we introduce DiscoX, a new benchmark for discourse-level and expert-level Chinese-English translation. It comprises 200 professionally-curated texts from 7 domains, with an average length exceeding 1700 tokens. To evaluate performance on DiscoX, we also develop Metric-S, a reference-free system that provides fine-grained automatic assessments across accuracy, fluency, and appropriateness. Metric-S demonstrates strong consistency with human judgments, significantly outperforming existing metrics. Our experiments reveal a remarkable performance gap: even the most advanced LLMs still trail human experts on these tasks. This finding validates the difficulty of DiscoX and underscores the challenges that remain in achieving professional-grade machine translation. The proposed benchmark and evaluation system provide a robust framework for more rigorous evaluation, facilitating future advancements in LLM-based translation.

[56] When Data is the Algorithm: A Systematic Study and Curation of Preference Optimization Datasets

Aladin Djuhera, Farhan Ahmed, Swanand Ravindra Kadhe, Syed Zawad, Heiko Ludwig, Holger Boche

Main category: cs.CL

TL;DR: This paper presents a comprehensive analysis of popular open-source DPO datasets, introduces a new curated mixture called UltraMix that outperforms individual datasets while being 30% smaller, and releases annotations to support data-centric preference optimization research.

Details

Motivation: There is a lack of systematic comparisons between open-source DPO datasets due to high computational costs and insufficient quality annotations, making it difficult to understand how preferences were selected and how well they reflect human judgment.

Method: The authors used the Magpie framework to annotate samples from five DPO datasets (TuluDPO, ORPO, UltraFeedback, HelpSteer, Code-Preference-Pairs) for task category, input quality, and preference reward, then systematically curated UltraMix by selectively combining samples while removing noisy or redundant ones.

Result: UltraMix is 30% smaller than the best-performing individual dataset yet exceeds its performance across key benchmarks, revealing structural and qualitative discrepancies in reward margins across different datasets.

Conclusion: The study provides the first comprehensive data-centric analysis of DPO corpora, demonstrates the value of systematic dataset curation, and releases all annotations and the curated UltraMix mixture to facilitate future research in data-centric preference optimization.

Abstract: Aligning large language models (LLMs) is a central objective of post-training, often achieved through reward modeling and reinforcement learning methods. Among these, direct preference optimization (DPO) has emerged as a widely adopted technique that fine-tunes LLMs on preferred completions over less favorable ones. While most frontier LLMs do not disclose their curated preference pairs, the broader LLM community has released several open-source DPO datasets, including TuluDPO, ORPO, UltraFeedback, HelpSteer, and Code-Preference-Pairs. However, systematic comparisons remain scarce, largely due to the high computational cost and the lack of rich quality annotations, making it difficult to understand how preferences were selected, which task types they span, and how well they reflect human judgment on a per-sample level. In this work, we present the first comprehensive, data-centric analysis of popular open-source DPO corpora. We leverage the Magpie framework to annotate each sample for task category, input quality, and preference reward, a reward-model-based signal that validates the preference order without relying on human annotations. This enables a scalable, fine-grained inspection of preference quality across datasets, revealing structural and qualitative discrepancies in reward margins. Building on these insights, we systematically curate a new DPO mixture, UltraMix, that draws selectively from all five corpora while removing noisy or redundant samples. UltraMix is 30% smaller than the best-performing individual dataset yet exceeds its performance across key benchmarks. We publicly release all annotations, metadata, and our curated mixture to facilitate future research in data-centric preference optimization.

[57] Automata-Based Steering of Large Language Models for Diverse Structured Generation

Xiaokun Luan, Zeming Wei, Yihao Zhang, Meng Sun

Main category: cs.CL

TL;DR: A method to improve diversity in automaton-based structured generation by using traversal history to guide LLMs toward novel patterns.

Details

Motivation: Current structured generation methods ensure validity but lack output diversity, which is a critical limitation.

Method: Utilizes automata traversal history to steer LLMs towards novel structural patterns in automaton-based structured generation.

Result: Significantly improves structural and content diversity while maintaining comparable generation efficiency.

Conclusion: The proposed method effectively enhances diversity in structured generation and shows practical value in generating diverse test cases for open-source libraries.

Abstract: Large language models (LLMs) are increasingly tasked with generating structured outputs. While structured generation methods ensure validity, they often lack output diversity, a critical limitation that we confirm in our preliminary study. We propose a novel method to enhance diversity in automaton-based structured generation. Our approach utilizes automata traversal history to steer LLMs towards novel structural patterns. Evaluations show our method significantly improves structural and content diversity while maintaining comparable generation efficiency. Furthermore, we conduct a case study showcasing the effectiveness of our method in generating diverse test cases for testing open-source libraries.

[58] Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB

Xingyu Ren, Youran Sun, Haoyu Liang

Main category: cs.CL

TL;DR: Text embedding models have a consistent bias μ in their outputs. Renormalization, a plug-and-play training-free method that removes this bias, significantly improves performance on MMTEB benchmarks.

Details

Motivation: Current text embedding models produce outputs with a consistent bias across all sentences, which may degrade performance on downstream tasks.

Method: Propose Renormalization - two variants: directly subtracting the bias μ from embeddings, or subtracting the projection of embeddings onto μ. The latter is theoretically predicted and empirically shown to perform better.

Result: Across 38 models, renormalization improves performance by 9.7σ on retrieval, 3.1σ on classification, and 0.8σ on other tasks. The projection-based variant outperforms direct subtraction.

Conclusion: Renormalization is an effective, lightweight solution that consistently improves text embedding model performance by removing inherent bias.

Abstract: We find that current text embedding models produce outputs with a consistent bias, i.e., each embedding vector $e$ can be decomposed as $\tilde{e} + μ$, where $μ$ is almost identical across all sentences. We propose a plug-and-play, training-free and lightweight solution called Renormalization. Through extensive experiments, we show that renormalization consistently and statistically significantly improves the performance of existing models on the Massive Multilingual Text Embedding Benchmark (MMTEB). In particular, across 38 models, renormalization improves performance by 9.7 $σ$ on retrieval tasks, 3.1 $σ$ on classification tasks, and 0.8 $σ$ on other types of tasks. Renormalization has two variants: directly subtracting $μ$ from $e$, or subtracting the projection of $e$ onto $μ$. We theoretically predict that the latter performs better, and our experiments confirm this prediction.

[59] Can LLMs Detect Their Own Hallucinations?

Sora Kadotani, Kosuke Nishida, Kyosuke Nishida

Main category: cs.CL

TL;DR: LLMs can detect their own hallucinations using Chain-of-Thought prompting, achieving 58.2% detection rate with GPT-3.5 Turbo.

Details

Motivation: Large language models sometimes hallucinate facts, so it's important to investigate whether they can detect their own hallucinations.

Method: Formulated hallucination detection as classification task, proposed framework using Chain-of-Thought to extract knowledge from model parameters.

Result: GPT-3.5 Turbo with CoT detected 58.2% of its own hallucinations.

Conclusion: LLMs with CoT can detect hallucinations if sufficient knowledge is contained in their parameters.

Abstract: Large language models (LLMs) can generate fluent responses, but sometimes hallucinate facts. In this paper, we investigate whether LLMs can detect their own hallucinations. We formulate hallucination detection as a classification task of a sentence. We propose a framework for estimating LLMs’ capability of hallucination detection and a classification method using Chain-of-Thought (CoT) to extract knowledge from their parameters. The experimental results indicated that GPT-$3.5$ Turbo with CoT detected $58.2%$ of its own hallucinations. We concluded that LLMs with CoT can detect hallucinations if sufficient knowledge is contained in their parameters.

[60] Analysing Personal Attacks in U.S. Presidential Debates

Ruban Goyal, Rohitash Chandra, Sonit Singh

Main category: cs.CL

TL;DR: Framework for detecting personal attacks in U.S. presidential debates using transformer models and manual annotation across 2016-2024 election cycles.

Details

Motivation: Personal attacks shape public perception in elections; automated detection can improve political discourse transparency and provide insights for journalists and the public.

Method: Manual annotation of debate transcripts, followed by statistical analysis and evaluation of fine-tuned transformer models (BERT) and general-purpose LLMs for attack detection.

Result: Demonstrated the potential of task-specific adaptation of modern language models for detecting personal attacks in formal political speech.

Conclusion: Fine-tuned transformer models and LLMs can effectively contribute to understanding political communication by detecting personal attacks in presidential debates.

Abstract: Personal attacks have become a notable feature of U.S. presidential debates and play an important role in shaping public perception during elections. Detecting such attacks can improve transparency in political discourse and provide insights for journalists, analysts and the public. Advances in deep learning and transformer-based models, particularly BERT and large language models (LLMs) have created new opportunities for automated detection of harmful language. Motivated by these developments, we present a framework for analysing personal attacks in U.S. presidential debates. Our work involves manual annotation of debate transcripts across the 2016, 2020 and 2024 election cycles, followed by statistical and language-model based analysis. We investigate the potential of fine-tuned transformer models alongside general-purpose LLMs to detect personal attacks in formal political speech. This study demonstrates how task-specific adaptation of modern language models can contribute to a deeper understanding of political communication.

Yi Shi, Wenlong Meng, Zhenyuan Guo, Chengkun Wei, Wenzhi Chen

Main category: cs.CL

TL;DR: MemoDetector is a novel framework for Meme Emotion Understanding that uses MLLMs for textual enhancement and dual-stage multimodal fusion, achieving state-of-the-art performance on benchmark datasets.

Details

Motivation: Address two key challenges in meme emotion understanding: lack of fine-grained multimodal fusion strategies and insufficient mining of memes' implicit meanings and background knowledge.

Method: Four-step textual enhancement using MLLMs to extract implicit insights, followed by dual-stage modal fusion with shallow fusion of raw content and deep integration of enhanced features.

Result: Outperforms state-of-the-art baselines with 4.3% F1 improvement on MET-MEME and 3.4% on MOOD datasets.

Conclusion: The proposed approach effectively captures nuanced cross-modal emotional cues and demonstrates strong potential for advancing meme emotion understanding.

Abstract: With the rapid rise of social media and Internet culture, memes have become a popular medium for expressing emotional tendencies. This has sparked growing interest in Meme Emotion Understanding (MEU), which aims to classify the emotional intent behind memes by leveraging their multimodal contents. While existing efforts have achieved promising results, two major challenges remain: (1) a lack of fine-grained multimodal fusion strategies, and (2) insufficient mining of memes’ implicit meanings and background knowledge. To address these challenges, we propose MemoDetector, a novel framework for advancing MEU. First, we introduce a four-step textual enhancement module that utilizes the rich knowledge and reasoning capabilities of Multimodal Large Language Models (MLLMs) to progressively infer and extract implicit and contextual insights from memes. These enhanced texts significantly enrich the original meme contents and provide valuable guidance for downstream classification. Next, we design a dual-stage modal fusion strategy: the first stage performs shallow fusion on raw meme image and text, while the second stage deeply integrates the enhanced visual and textual features. This hierarchical fusion enables the model to better capture nuanced cross-modal emotional cues. Experiments on two datasets, MET-MEME and MOOD, demonstrate that our method consistently outperforms state-of-the-art baselines. Specifically, MemoDetector improves F1 scores by 4.3% on MET-MEME and 3.4% on MOOD. Further ablation studies and in-depth analyses validate the effectiveness and robustness of our approach, highlighting its strong potential for advancing MEU. Our code is available at https://github.com/singing-cat/MemoDetector.

[62] Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition

Yiming Rong, Yixin Zhang, Ziyi Wang, Deyang Jiang, Yunlong Zhao, Haoran Wu, Shiyu Zhou, Bo Xu

Main category: cs.CL

TL;DR: SAP² method improves ASR performance in contextual scenarios by dynamically pruning and integrating relevant keywords using speech-driven attention pooling, achieving state-of-the-art results on SlideSpeech and LibriSpeech datasets.

Details

Motivation: ASR systems struggle with long-context information in domain-specific scenarios like conference presentations due to limited context windows and sparse relevant information within contextual noise.

Method: Proposed SAP² framework with two-stage dynamic pruning and integration of relevant contextual keywords using Speech-Driven Attention-based Pooling mechanism to compress context embeddings while preserving speech-salient information.

Result: Achieved WER of 7.71% on SlideSpeech and 1.12% on LibriSpeech, with 41.1% reduction in biased keyword error rates on SlideSpeech compared to non-contextual baselines. Method shows robust scalability under extensive contextual inputs.

Conclusion: SAP² effectively addresses ASR limitations in contextual scenarios through dynamic keyword pruning and speech-driven attention, demonstrating superior performance and scalability across different datasets.

Abstract: Automatic speech recognition (ASR) systems have achieved remarkable performance in common conditions but often struggle to leverage long-context information in contextualized scenarios that require domain-specific knowledge, such as conference presentations. This challenge arises primarily due to constrained model context windows and the sparsity of relevant information within extensive contextual noise. To solve this, we propose the SAP$^{2}$ method, a novel framework that dynamically prunes and integrates relevant contextual keywords in two stages. Specifically, each stage leverages our proposed Speech-Driven Attention-based Pooling mechanism, enabling efficient compression of context embeddings while preserving speech-salient information. Experimental results demonstrate state-of-the-art performance of SAP$^{2}$ on the SlideSpeech and LibriSpeech datasets, achieving word error rates (WER) of 7.71% and 1.12%, respectively. On SlideSpeech, our method notably reduces biased keyword error rates (B-WER) by 41.1% compared to non-contextual baselines. SAP$^{2}$ also exhibits robust scalability, consistently maintaining performance under extensive contextual input conditions on both datasets.

[63] PRSM: A Measure to Evaluate CLIP’s Robustness Against Paraphrases

Udo Schlegel, Franziska Weeber, Jian Lan, Thomas Seidl

Main category: cs.CL

TL;DR: CLIP’s robustness to paraphrasing is underexplored. This paper introduces PRSM metric to measure CLIP’s sensitivity to paraphrased queries, revealing varying robustness across strategies and gender-associated queries.

Details

Motivation: CLIP performs well on zero-shot/few-shot tasks but its robustness to linguistic variation (paraphrasing) is not well studied. This is crucial for reliable deployment, especially in socially sensitive contexts where inconsistent representations can amplify demographic biases.

Method: Introduce Paraphrase Ranking Stability Metric (PRSM) to quantify CLIP’s sensitivity to paraphrased queries. Use Social Counterfactuals dataset to assess stability under paraphrastic variation and examine interaction between paraphrase robustness and gender.

Result: Robustness varies across paraphrasing strategies, with subtle yet consistent differences observed between male- and female-associated queries.

Conclusion: Paraphrase robustness is important for fairness and equitable deployment of multimodal systems, with gender-associated differences in robustness requiring attention.

Abstract: Contrastive Language-Image Pre-training (CLIP) is a widely used multimodal model that aligns text and image representations through large-scale training. While it performs strongly on zero-shot and few-shot tasks, its robustness to linguistic variation, particularly paraphrasing, remains underexplored. Paraphrase robustness is essential for reliable deployment, especially in socially sensitive contexts where inconsistent representations can amplify demographic biases. In this paper, we introduce the Paraphrase Ranking Stability Metric (PRSM), a novel measure for quantifying CLIP’s sensitivity to paraphrased queries. Using the Social Counterfactuals dataset, a benchmark designed to reveal social and demographic biases, we empirically assess CLIP’s stability under paraphrastic variation, examine the interaction between paraphrase robustness and gender, and discuss implications for fairness and equitable deployment of multimodal systems. Our analysis reveals that robustness varies across paraphrasing strategies, with subtle yet consistent differences observed between male- and female-associated queries.

[64] Adverbs Revisited: Enhancing WordNet Coverage of Adverbs with a Supersense Taxonomy

Jooyoung Lee, Jader Martins Camboim de Sá

Main category: cs.CL

TL;DR: This paper introduces a new supersense typology for adverbs to fill WordNet’s gap in adverb classification, with categories covering manner, temporal, frequency, degree, domain, speaker-oriented, and subject-oriented functions.

Details

Motivation: WordNet has rich supersense hierarchies for nouns and verbs but lacks systematic semantic classification for adverbs, leaving them underdeveloped compared to other parts of speech.

Method: Developed a linguistically grounded supersense typology for adverbs and empirically validated it through a pilot annotation study with human annotators.

Result: The proposed adverb categories provide broad coverage of adverbs in natural text and can be reliably assigned by human annotators, demonstrating the typology’s effectiveness.

Conclusion: Incorporating this adverb supersense typology extends WordNet’s coverage, aligns it better with linguistic theory, and benefits various NLP applications including word sense disambiguation, event extraction, sentiment analysis, and discourse modeling.

Abstract: WordNet offers rich supersense hierarchies for nouns and verbs, yet adverbs remain underdeveloped, lacking a systematic semantic classification. We introduce a linguistically grounded supersense typology for adverbs, empirically validated through annotation, that captures major semantic domains including manner, temporal, frequency, degree, domain, speaker-oriented, and subject-oriented functions. Results from a pilot annotation study demonstrate that these categories provide broad coverage of adverbs in natural text and can be reliably assigned by human annotators. Incorporating this typology extends WordNet’s coverage, aligns it more closely with linguistic theory, and facilitates downstream NLP applications such as word sense disambiguation, event extraction, sentiment analysis, and discourse modeling. We present the proposed supersense categories, annotation outcomes, and directions for future work.

[65] LANE: Lexical Adversarial Negative Examples for Word Sense Disambiguation

Jader Martins Camboim de Sá, Jooyoung Lee, Cédric Pruski, Marcos Da Silveira

Main category: cs.CL

TL;DR: LANE is an adversarial training method that improves fine-grained word meaning resolution in neural language models by generating challenging negative examples through selective word marking.

Details

Motivation: Neural language models often overfit to global sentence representations and fail to capture local semantic details, making fine-grained word meaning resolution a critical challenge.

Method: Proposes LANE - an adversarial training strategy that shifts learning focus to target words by generating negative examples through selective marking of alternate words, forcing greater separability between same sentences with different marked words.

Result: Experimental results on lexical semantic change detection and word sense disambiguation benchmarks show improved performance over standard contrastive learning baselines, with more discriminative word representations that better capture subtle meaning differences.

Conclusion: LANE is a model-agnostic method that can be integrated into existing representation learning frameworks to enhance fine-grained word meaning resolution capabilities.

Abstract: Fine-grained word meaning resolution remains a critical challenge for neural language models (NLMs) as they often overfit to global sentence representations, failing to capture local semantic details. We propose a novel adversarial training strategy, called LANE, to address this limitation by deliberately shifting the model’s learning focus to the target word. This method generates challenging negative training examples through the selective marking of alternate words in the training set. The goal is to force the model to create a greater separability between same sentences with different marked words. Experimental results on lexical semantic change detection and word sense disambiguation benchmarks demonstrate that our approach yields more discriminative word representations, improving performance over standard contrastive learning baselines. We further provide qualitative analyses showing that the proposed negatives lead to representations that better capture subtle meaning differences even in challenging environments. Our method is model-agnostic and can be integrated into existing representation learning frameworks.

[66] KGQuest: Template-Driven QA Generation from Knowledge Graphs with LLM-Based Refinement

Sania Nayab, Marco Simoni, Giulio Rossolini, Andrea Saracino

Main category: cs.CL

TL;DR: A scalable pipeline for generating natural language QA pairs from knowledge graphs using template clustering and LLM refinement.

Details

Motivation: Existing approaches for QA generation from knowledge graphs struggle with scalability, linguistic quality, and factual consistency.

Method: Clusters KG triplets by relations, creates reusable templates via natural language rules, refines templates with LLMs for clarity, and instantiates answer options with distractors from KG.

Result: The hybrid approach efficiently generates high-quality QA pairs combining scalability with fluency and linguistic precision.

Conclusion: The proposed pipeline successfully addresses scalability and quality issues in QA generation from knowledge graphs through a deterministic approach enhanced by LLM refinement.

Abstract: The generation of questions and answers (QA) from knowledge graphs (KG) plays a crucial role in the development and testing of educational platforms, dissemination tools, and large language models (LLM). However, existing approaches often struggle with scalability, linguistic quality, and factual consistency. This paper presents a scalable and deterministic pipeline for generating natural language QA from KGs, with an additional refinement step using LLMs to further enhance linguistic quality. The approach first clusters KG triplets based on their relations, creating reusable templates through natural language rules derived from the entity types of objects and relations. A module then leverages LLMs to refine these templates, improving clarity and coherence while preserving factual accuracy. Finally, the instantiation of answer options is achieved through a selection strategy that introduces distractors from the KG. Our experiments demonstrate that this hybrid approach efficiently generates high-quality QA pairs, combining scalability with fluency and linguistic precision.

[67] destroR: Attacking Transfer Models with Obfuscous Examples to Discard Perplexity

Saadat Rafid Ahmed, Rubayet Shareen, Radoan Sharkar, Nazia Hossain, Mansur Mahi, Farig Yousuf Sadeque

Main category: cs.CL

TL;DR: This paper analyzes and develops novel adversarial attack strategies against state-of-the-art machine learning models, focusing on creating ambiguous inputs to confuse models and improve their robustness, with special attention to Bangla Language attacks.

Details

Motivation: Recent research has shown machine learning models are vulnerable to various attacks, putting both models and systems at risk. The authors aim to address these vulnerabilities by analyzing existing adversarial attack recipes and creating new ones.

Method: Develop adversarial instances with maximum perplexity using machine learning and deep learning approaches. Analyze several datasets and create obfuscous adversary examples to put models in a state of perplexity, including Bangla Language in adversarial attacks.

Result: Not explicitly stated in the abstract, but the work focuses on developing novel attack strategies and adversarial examples.

Conclusion: The research contributes to the development of model robustness by creating effective adversarial attacks, with emphasis on utility usage reduction and efficiency throughout the work.

Abstract: Advancements in Machine Learning & Neural Networks in recent years have led to widespread implementations of Natural Language Processing across a variety of fields with remarkable success, solving a wide range of complicated problems. However, recent research has shown that machine learning models may be vulnerable in a number of ways, putting both the models and the systems theyre used in at risk. In this paper, we intend to analyze and experiment with the best of existing adversarial attack recipes and create new ones. We concentrated on developing a novel adversarial attack strategy on current state-of-the-art machine learning models by producing ambiguous inputs for the models to confound them and then constructing the path to the future development of the robustness of the models. We will develop adversarial instances with maximum perplexity, utilizing machine learning and deep learning approaches in order to trick the models. In our attack recipe, we will analyze several datasets and focus on creating obfuscous adversary examples to put the models in a state of perplexity, and by including the Bangla Language in the field of adversarial attacks. We strictly uphold utility usage reduction and efficiency throughout our work.

[68] LAET: A Layer-wise Adaptive Ensemble Tuning Framework for Pretrained Language Models

Jawad Ibn Ahad, Muhammad Rafsan Kabir, Robin Krambroeckers, Sifat Momen, Nabeel Mohammed, Shafin Rahman

Main category: cs.CL

TL;DR: LAET is a novel fine-tuning strategy that selectively tunes only the most effective layers of pre-trained LLMs, reducing computational costs while improving performance in financial NLP tasks.

Details

Motivation: Address the high computational demands of large language models in financial NLP that limit accessibility for many organizations, despite their strong performance in tasks like sentiment analysis and stock prediction.

Method: Layer-wise Adaptive Ensemble Tuning (LAET) analyzes hidden state representations to identify and selectively fine-tune only the most effective layers of pre-trained LLMs, while freezing less critical layers to reduce computational overhead.

Result: LAET significantly reduces computational costs while enhancing task-specific performance, outperforming existing benchmarks and state-of-the-art LLMs like GPT-4 even with smaller models (~3B parameters).

Conclusion: LAET bridges cutting-edge financial NLP research with real-world deployment by providing efficient and scalable models for financial applications.

Abstract: Natural Language Processing (NLP) has transformed the financial industry, enabling advancements in areas such as textual analysis, risk management, and forecasting. Large language models (LLMs) like BloombergGPT and FinMA have set new benchmarks across various financial NLP tasks, including sentiment analysis, stock movement prediction, and credit risk assessment. Furthermore, FinMA-ES, a bilingual financial LLM, has also demonstrated strong performance using the FLARE and FLARE-ES benchmarks. However, the high computational demands of these models limit the accessibility of many organizations. To address this, we propose Layer-wise Adaptive Ensemble Tuning (LAET), a novel strategy that selectively fine-tunes the most effective layers of pre-trained LLMs by analyzing hidden state representations while freezing less critical layers. LAET significantly reduces computational overhead while enhancing task-specific performance. Our approach shows strong results in financial NLP tasks, outperforming existing benchmarks and state-of-the-art LLMs such as GPT-4, even with smaller LLMs ($\sim$3B parameters). This work bridges cutting-edge financial NLP research and real-world deployment with efficient and scalable models for financial applications.

[69] NOVA: An Agentic Framework for Automated Histopathology Analysis and Discovery

Anurag J. Vaidya, Felix Meissen, Daniel C. Castro, Shruthi Bannur, Tristan Lazard, Drew F. K. Williamson, Faisal Mahmood, Javier Alvarez-Valle, Stephanie L. Hyland, Kenza Bouzid

Main category: cs.CL

TL;DR: NOVA is an agentic framework that translates scientific queries into executable analysis pipelines for digitized histopathology, outperforming coding-agent baselines on the SlideQuest benchmark.

Details

Motivation: Digitized histopathology analysis is complex, time-intensive, and requires specialized expertise, limiting accessibility for researchers and clinicians.

Method: NOVA iteratively generates and runs Python code, integrating 49 domain-specific tools (e.g., nuclei segmentation, whole-slide encoding) and can create new tools ad hoc. It’s evaluated on SlideQuest, a 90-question benchmark requiring multi-step reasoning and computational problem solving.

Result: NOVA outperforms coding-agent baselines in quantitative evaluation. A pathologist-verified case study successfully links morphology to prognostically relevant PAM50 subtypes, demonstrating scalable discovery potential.

Conclusion: NOVA provides a scalable framework for automated histopathology analysis that can translate scientific queries into executable pipelines, enabling broader accessibility and discovery potential in digital pathology.

Abstract: Digitized histopathology analysis involves complex, time-intensive workflows and specialized expertise, limiting its accessibility. We introduce NOVA, an agentic framework that translates scientific queries into executable analysis pipelines by iteratively generating and running Python code. NOVA integrates 49 domain-specific tools (e.g., nuclei segmentation, whole-slide encoding) built on open-source software, and can also create new tools ad hoc. To evaluate such systems, we present SlideQuest, a 90-question benchmark – verified by pathologists and biomedical scientists – spanning data processing, quantitative analysis, and hypothesis testing. Unlike prior biomedical benchmarks focused on knowledge recall or diagnostic QA, SlideQuest demands multi-step reasoning, iterative coding, and computational problem solving. Quantitative evaluation shows NOVA outperforms coding-agent baselines, and a pathologist-verified case study links morphology to prognostically relevant PAM50 subtypes, demonstrating its scalable discovery potential.

[70] LaoBench: A Large-Scale Multidimensional Lao Benchmark for Large Language Models

Jian Gao, Richeng Xuan, Zhaolu Kang, Dingshi Liao, Wenxin Huang, Zongmou Huang, Yangdi Xu, Bowen Qin, Zheqi He, Xi Yang, Changjin Li

Main category: cs.CL

TL;DR: LaoBench is the first comprehensive benchmark for evaluating LLMs in Lao language, featuring 17,000+ samples across knowledge application, K12 education, and bilingual translation tasks.

Details

Motivation: There is a significant gap in evaluating LLMs for low-resource Southeast Asian languages like Lao, which lack dedicated benchmarks despite rapid LLM advancements.

Method: Created a large-scale dataset with expert human curation and automated agent-assisted verification, divided into open-source and closed-source subsets for fair black-box evaluation.

Result: Current state-of-the-art LLMs show significant challenges in mastering Lao across diverse tasks, highlighting the need for improved language capabilities.

Conclusion: LaoBench will catalyze further AI research and development for underrepresented Southeast Asian languages by providing a standardized evaluation framework.

Abstract: The rapid advancement of large language models (LLMs) has not been matched by their evaluation in low-resource languages, especially Southeast Asian languages like Lao. To fill this gap, we introduce LaoBench, the first large-scale, high-quality, and multidimensional benchmark dataset dedicated to assessing LLMs’ comprehensive language understanding and reasoning abilities in Lao. LaoBench comprises over 17,000 carefully curated samples spanning three core dimensions: knowledge application, K12 foundational education, and bilingual translation among Lao, Chinese, and English. The dataset is divided into open-source and closed-source subsets, with the closed-source portion enabling black-box evaluation on an official platform to ensure fairness and data security. Our data construction pipeline integrates expert human curation with automated agent-assisted verification, ensuring linguistic accuracy, cultural relevance, and educational value. Benchmarking multiple state-of-the-art LLMs on LaoBench reveals that current models still face significant challenges in mastering Lao across diverse tasks. We hope LaoBench will catalyze further research and development of AI technologies for underrepresented Southeast Asian languages.

[71] M-DAIGT: A Shared Task on Multi-Domain Detection of AI-Generated Text

Salima Lamsiyah, Saad Ezzini, Abdelkader El Mahdaouy, Hamza Alami, Abdessamad Benlahbib, Samir El Amrany, Salmane Chafik, Hicham Hammouchi

Main category: cs.CL

TL;DR: M-DAIGT shared task for detecting AI-generated text across news and academic domains using a new 30,000-sample benchmark dataset.

Details

Motivation: The generation of highly fluent text by Large Language Models poses challenges to information integrity and academic research, necessitating reliable detection methods.

Method: Two binary classification subtasks: News Article Detection and Academic Writing Detection, supported by a balanced dataset of human-written and AI-generated texts from various LLMs with diverse prompting strategies.

Result: 46 teams registered, 4 teams submitted final results for both subtasks. The paper describes methods used by participating teams.

Conclusion: The M-DAIGT shared task establishes a foundation for AI-generated text detection research and discusses future directions for the initiative.

Abstract: The generation of highly fluent text by Large Language Models (LLMs) poses a significant challenge to information integrity and academic research. In this paper, we introduce the Multi-Domain Detection of AI-Generated Text (M-DAIGT) shared task, which focuses on detecting AI-generated text across multiple domains, particularly in news articles and academic writing. M-DAIGT comprises two binary classification subtasks: News Article Detection (NAD) (Subtask 1) and Academic Writing Detection (AWD) (Subtask 2). To support this task, we developed and released a new large-scale benchmark dataset of 30,000 samples, balanced between human-written and AI-generated texts. The AI-generated content was produced using a variety of modern LLMs (e.g., GPT-4, Claude) and diverse prompting strategies. A total of 46 unique teams registered for the shared task, of which four teams submitted final results. All four teams participated in both Subtask 1 and Subtask 2. We describe the methods employed by these participating teams and briefly discuss future directions for M-DAIGT.

[72] Studies with impossible languages falsify LMs as models of human language

Jeffrey S. Bowers, Jeff Mitchell

Main category: cs.CL

TL;DR: LMs learn attested and impossible languages equally well when complexity is controlled, unlike humans who find unnatural structures harder to learn.

Details

Motivation: To investigate whether language models share human inductive biases that make natural languages easier to learn than unnatural ones.

Method: Reviewing literature comparing language model performance on attested vs impossible languages, controlling for complexity.

Result: LMs often learn attested and impossible languages equally well when complexity is accounted for, unlike humans who show preference for natural structures.

Conclusion: Language models lack human inductive biases that support language acquisition, as they don’t differentiate between natural and unnatural language structures when complexity is controlled.

Abstract: According to Futrell and Mahowald [arXiv:2501.17047], both infants and language models (LMs) find attested languages easier to learn than impossible languages that have unnatural structures. We review the literature and show that LMs often learn attested and many impossible languages equally well. Difficult to learn impossible languages are simply more complex (or random). LMs are missing human inductive biases that support language acquisition.

[73] MajinBook: An open catalogue of digital world literature with likes

Antoine Mazières, Thierry Poibeau

Main category: cs.CL

TL;DR: MajinBook is an open catalogue linking shadow library metadata with Goodreads data, creating a corpus of 539,000+ English books with publication dates, genres, and popularity metrics, while addressing biases and legal considerations.

Details

Motivation: To facilitate computational social science and cultural analytics using shadow libraries by overcoming limitations of traditional corpora like HathiTrust and providing enriched, machine-readable data.

Method: Linking metadata from shadow libraries (Library Genesis, Z-Library) with structured bibliographic data from Goodreads, prioritizing EPUB files for quality, and evaluating linkage accuracy.

Result: Created a high-precision corpus of 539,000+ English books spanning three centuries with enriched metadata including publication dates, genres, ratings, and reviews, plus secondary datasets for French, German, and Spanish.

Conclusion: MajinBook provides a valuable resource for research while addressing legal permissibility under EU and US text and data mining frameworks, with all data released openly.

Abstract: This data paper introduces MajinBook, an open catalogue designed to facilitate the use of shadow libraries–such as Library Genesis and Z-Library–for computational social science and cultural analytics. By linking metadata from these vast, crowd-sourced archives with structured bibliographic data from Goodreads, we create a high-precision corpus of over 539,000 references to English-language books spanning three centuries, enriched with first publication dates, genres, and popularity metrics like ratings and reviews. Our methodology prioritizes natively digital EPUB files to ensure machine-readable quality, while addressing biases in traditional corpora like HathiTrust, and includes secondary datasets for French, German, and Spanish. We evaluate the linkage strategy for accuracy, release all underlying data openly, and discuss the project’s legal permissibility under EU and US frameworks for text and data mining in research.

[74] W2S-AlignTree: Weak-to-Strong Inference-Time Alignment for Large Language Models via Monte Carlo Tree Search

Zhenyu Ding, Yuhao Wang, Tengyue Xiao, Haoying Wang, Guojun Ma, Mingyang Wan, Caigui Jiang, Ning Ding

Main category: cs.CL

TL;DR: W2S-AlignTree is a plug-and-play inference-time alignment framework that combines Monte Carlo Tree Search with Weak-to-Strong Generalization to align LLM outputs with human preferences without parameter modification.

Details

Motivation: Current LLM alignment methods like RLHF are costly, lack scalability, and offer limited dynamic control during inference, creating need for adaptable alignment mechanisms.

Method: Formulates LLM alignment as optimal heuristic search using MCTS with weak model’s step-level signals as alignment proxies, incorporating Entropy-Aware exploration to balance exploration and exploitation.

Result: Consistently outperforms baselines across sentiment generation, summarization, and instruction-following tasks, improving Llama3-8B from 1.89 to 2.19 (15.9% relative improvement) on summarization.

Conclusion: W2S-AlignTree provides scalable, fine-grained alignment during inference without modifying model parameters, addressing limitations of training-time alignment methods.

Abstract: Large Language Models (LLMs) demonstrate impressive capabilities, yet their outputs often suffer from misalignment with human preferences due to the inadequacy of weak supervision and a lack of fine-grained control. Training-time alignment methods like Reinforcement Learning from Human Feedback (RLHF) face prohibitive costs in expert supervision and inherent scalability limitations, offering limited dynamic control during inference. Consequently, there is an urgent need for scalable and adaptable alignment mechanisms. To address this, we propose W2S-AlignTree, a pioneering plug-and-play inference-time alignment framework that synergistically combines Monte Carlo Tree Search (MCTS) with the Weak-to-Strong Generalization paradigm for the first time. W2S-AlignTree formulates LLM alignment as an optimal heuristic search problem within a generative search tree. By leveraging weak model’s real-time, step-level signals as alignment proxies and introducing an Entropy-Aware exploration mechanism, W2S-AlignTree enables fine-grained guidance during strong model’s generation without modifying its parameters. The approach dynamically balances exploration and exploitation in high-dimensional generation search trees. Experiments across controlled sentiment generation, summarization, and instruction-following show that W2S-AlignTree consistently outperforms strong baselines. Notably, W2S-AlignTree raises the performance of Llama3-8B from 1.89 to 2.19, a relative improvement of 15.9 on the summarization task.

[75] PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning

Afra Feyza Akyürek, Advait Gosai, Chen Bo Calvin Zhang, Vipul Gupta, Jaehwan Jeong, Anisha Gunjal, Tahseen Rabbani, Maria Mazzone, David Randolph, Mohammad Mahmoudi Meymand, Gurshaan Chattha, Paula Rodriguez, Diego Mares, Pavit Singh, Michael Liu, Subodh Chawla, Pete Cline, Lucy Ogaz, Ernesto Hernandez, Zihao Wang, Pavi Bhatter, Marcos Ayestaran, Bing Liu, Yunzhong He

Main category: cs.CL

TL;DR: PRBench is a realistic benchmark of 1,100 expert-authored tasks in Finance and Law with 19,356 expert-curated criteria, revealing significant gaps in AI model performance for professional reasoning tasks.

Details

Motivation: Existing academic benchmarks provide limited view of real-world professional performance in high-stakes domains like Legal and Finance, failing to assess open-ended, economically consequential tasks.

Method: Created PRBench with 1,100 tasks from 182 qualified professionals (JDs, CFAs, 6+ years experience), spanning 114 countries and 47 US jurisdictions, using expert-curated rubrics validated through rigorous quality pipeline.

Result: Evaluation of 20 leading models shows substantial room for improvement - top scores only 0.39 (Finance) and 0.37 (Legal) on Hard subsets. Models with similar overall scores diverge significantly on specific capabilities.

Conclusion: Models show critical gaps in reliability for professional adoption, with common failure modes including inaccurate judgments, lack of process transparency, and incomplete reasoning.

Abstract: Frontier model progress is often measured by academic benchmarks, which offer a limited view of performance in real-world professional contexts. Existing evaluations often fail to assess open-ended, economically consequential tasks in high-stakes domains like Legal and Finance, where practical returns are paramount. To address this, we introduce Professional Reasoning Bench (PRBench), a realistic, open-ended, and difficult benchmark of real-world problems in Finance and Law. We open-source its 1,100 expert-authored tasks and 19,356 expert-curated criteria, making it, to our knowledge, the largest public, rubric-based benchmark for both legal and finance domains. We recruit 182 qualified professionals, holding JDs, CFAs, or 6+ years of experience, who contributed tasks inspired by their actual workflows. This process yields significant diversity, with tasks spanning 114 countries and 47 US jurisdictions. Our expert-curated rubrics are validated through a rigorous quality pipeline, including independent expert validation. Subsequent evaluation of 20 leading models reveals substantial room for improvement, with top scores of only 0.39 (Finance) and 0.37 (Legal) on our Hard subsets. We further catalog associated economic impacts of the prompts and analyze performance using human-annotated rubric categories. Our analysis shows that models with similar overall scores can diverge significantly on specific capabilities. Common failure modes include inaccurate judgments, a lack of process transparency and incomplete reasoning, highlighting critical gaps in their reliability for professional adoption.

[76] Identifying and Analyzing Performance-Critical Tokens in Large Language Models

Yu Bai, Heyan Huang, Cesare Spinoso-Di Piano, Marc-Antoine Rondeau, Sanxing Chen, Yang Gao, Jackie Chi Kit Cheung

Main category: cs.CL

TL;DR: LLMs use template and stopword tokens more than content tokens for in-context learning performance, contrasting human attention patterns.

Details

Motivation: To understand how LLMs leverage demonstrations in ICL and identify which token types are performance-critical, contrasting with human learning patterns.

Method: Categorize tokens into content, stopword, and template types; ablate representations from attention; analyze distinguishing characteristics through experiments.

Result: Template and stopword tokens are more performance-critical than informative content tokens; these critical tokens aggregate information from content tokens.

Conclusion: LLMs learn tasks differently from humans, relying more on structural cues and repetition in template/stopword tokens rather than informative content tokens.

Abstract: In-context learning (ICL) has emerged as an effective solution for few-shot learning with large language models (LLMs). However, how LLMs leverage demonstrations to specify a task and learn a corresponding computational function through ICL is underexplored. Drawing from the way humans learn from content-label mappings in demonstrations, we categorize the tokens in an ICL prompt into content, stopword, and template tokens. Our goal is to identify the types of tokens whose representations directly influence LLM’s performance, a property we refer to as being performance-critical. By ablating representations from the attention of the test example, we find that the representations of informative content tokens have less influence on performance compared to template and stopword tokens, which contrasts with the human attention to informative words. We give evidence that the representations of performance-critical tokens aggregate information from the content tokens. Moreover, we demonstrate experimentally that lexical meaning, repetition, and structural cues are the main distinguishing characteristics of these tokens. Our work sheds light on how large language models learn to perform tasks from demonstrations and deepens our understanding of the roles different types of tokens play in large language models.

[77] Metric Learning Encoding Models: A Multivariate Framework for Interpreting Neural Representations

Louis Jalouzot, Christophe Pallier, Emmanuel Chemla, Yair Lakretz

Main category: cs.CL

TL;DR: MLEMs use metric learning to match theoretical feature distances with neural distances, improving feature importance analysis in neural systems over existing methods.

Details

Motivation: To address the challenge of understanding how explicit theoretical features are encoded in opaque neural systems, bridging neuroscience and AI.

Method: Metric Learning Encoding Models (MLEMs) learn a metric that models feature distances and interactions, building on second-order isomorphism methods like RSA.

Result: MLEMs outperform state-of-the-art methods in recovering ground-truth features in synthetic data and show stronger robustness to noise in real language data analysis.

Conclusion: MLEMs provide an effective framework for measuring feature importance in neural representations across various domains like language, vision, and audition.

Abstract: Understanding how explicit theoretical features are encoded in opaque neural systems is a central challenge now common to neuroscience and AI. We introduce Metric Learning Encoding Models (MLEMs) to address this challenge most directly as a metric learning problem: we fit the distance in the space of theoretical features to match the distance in neural space. Our framework improves on univariate encoding and decoding methods by building on second-order isomorphism methods, such as Representational Similarity Analysis, and extends them by learning a metric that efficiently models feature as well as interactions between them. The effectiveness of MLEM is validated through two sets of simulations. First, MLEMs recover ground-truth importance features in synthetic datasets better than state-of-the-art methods, such as Feature Reweighted RSA (FR-RSA). Second, we deploy MLEMs on real language data, where they show stronger robustness to noise in calculating the importance of linguistic features (gender, tense, etc.). MLEMs are applicable to any domains where theoretical features can be identified, such as language, vision, audition, etc. We release optimized code applicable to measure feature importance in the representations of any artificial neural networks or empirical neural data at https://github.com/LouisJalouzot/MLEM.

[78] Survey in Characterization of Semantic Change

Jader Martins Camboim de Sá, Marcos Da Silveira, Cédric Pruski

Main category: cs.CL

TL;DR: Survey paper on semantic change characterization, defining three classes: dimension (generalization/narrowing), orientation (pejorative/positive), and relation (metaphoric/metonymic) changes.

Details

Motivation: Semantic changes impact computational linguistics algorithms like translation and information retrieval, requiring formal characterization to understand how word meanings evolve over time.

Method: Survey and analysis of existing approaches, formal definition of three characterization classes, and summary of publications in a comparative table.

Result: Comprehensive overview of semantic change characterization methods, identification of research needs and trends in the field.

Conclusion: Formal characterization of semantic changes is crucial for improving computational linguistics applications and understanding language evolution.

Abstract: Live languages continuously evolve to integrate the cultural change of human societies. This evolution manifests through neologisms (new words) or \textbf{semantic changes} of words (new meaning to existing words). Understanding the meaning of words is vital for interpreting texts coming from different cultures (regionalism or slang), domains (e.g., technical terms), or periods. In computer science, these words are relevant to computational linguistics algorithms such as translation, information retrieval, question answering, etc. Semantic changes can potentially impact the quality of the outcomes of these algorithms. Therefore, it is important to understand and characterize these changes formally. The study of this impact is a recent problem that has attracted the attention of the computational linguistics community. Several approaches propose methods to detect semantic changes with good precision, but more effort is needed to characterize how the meaning of words changes and to reason about how to reduce the impact of semantic change. This survey provides an understandable overview of existing approaches to the \textit{characterization of semantic changes} and also formally defines three classes of characterizations: if the meaning of a word becomes more general or narrow (change in dimension) if the word is used in a more pejorative or positive/ameliorated sense (change in orientation), and if there is a trend to use the word in a, for instance, metaphoric or metonymic context (change in relation). We summarized the main aspects of the selected publications in a table and discussed the needs and trends in the research activities on semantic change characterization.

[79] Improving the Downstream Performance of Mixture-of-Experts Transformers via Weak Vanilla Transformers

Xin Lu, Yanyan Zhao, Bing Qin, Ting Liu

Main category: cs.CL

TL;DR: MoE Transformers underperform vanilla Transformers in downstream tasks due to poorer transfer capability. The paper proposes transfer capability distillation where vanilla models teach MoE models, significantly improving downstream performance.

Details

Motivation: MoE Transformers have advantages in model capacity and computational efficiency but underperform vanilla Transformers in downstream tasks, diminishing their practical value.

Method: Proposed transfer capability distillation where vanilla models serve as teachers to guide MoE models. Designed specific distillation method and conducted experiments on BERT architecture.

Result: Experimental results show significant improvement in downstream performance of MoE models. Further evidence strongly supports the concept of transfer capability distillation.

Conclusion: Transfer capability distillation enables MoE models to achieve both strong pre-training performance and transfer capability, enhancing downstream task performance. The paper provides insights from the perspective of model features.

Abstract: Recently, Mixture of Experts (MoE) Transformers have garnered increasing attention due to their advantages in model capacity and computational efficiency. However, studies have indicated that MoE Transformers underperform vanilla Transformers in many downstream tasks, significantly diminishing the practical value of MoE models. To explain this issue, we propose that the pre-training performance and transfer capability of a model are joint determinants of its downstream task performance. MoE models, in comparison to vanilla models, have poorer transfer capability, leading to their subpar performance in downstream tasks. To address this issue, we introduce the concept of transfer capability distillation, positing that although vanilla models have weaker performance, they are effective teachers of transfer capability. The MoE models guided by vanilla models can achieve both strong pre-training performance and transfer capability, ultimately enhancing their performance in downstream tasks. We design a specific distillation method and conduct experiments on the BERT architecture. Experimental results show a significant improvement in downstream performance of MoE models, and many further evidences also strongly support the concept of transfer capability distillation. Finally, we attempt to interpret transfer capability distillation and provide some insights from the perspective of model feature.

[80] Benchmarking Retrieval-Augmented Large Language Models in Biomedical NLP: Application, Robustness, and Self-Awareness

Mingchen Li, Zaifu Zhan, Han Yang, Yongkang Xiao, Jiatan Huang, Rui Zhang

Main category: cs.CL

TL;DR: This paper systematically evaluates retrieval-augmented LLMs (RALs) on 5 biomedical NLP tasks, analyzing their performance across 4 fundamental abilities: unlabeled robustness, counterfactual robustness, diverse robustness, and negative awareness.

Details

Motivation: To address the lack of rigorous evaluation of RALs' impact on different biomedical NLP tasks and their sensitivity to unlabeled, counterfactual, or diverse knowledge that is common in real-world biomedical applications.

Method: Proposed an evaluation framework to assess RALs’ performance on 5 biomedical tasks (triple extraction, link prediction, classification, QA, NLI) using 4 testbeds based on fundamental abilities. Evaluated 3 representative LLMs with 3 different retrievers on 9 datasets.

Result: The paper establishes comprehensive testbeds and evaluation framework, but specific performance results are not provided in the abstract.

Conclusion: Systematic investigation reveals the need for rigorous evaluation of RALs in biomedical domain, particularly regarding their robustness to different types of knowledge and self-awareness capabilities.

Abstract: Large language models (LLM) have demonstrated remarkable capabilities in various biomedical natural language processing (NLP) tasks, leveraging the demonstration within the input context to adapt to new tasks. However, LLM is sensitive to the selection of demonstrations. To address the hallucination issue inherent in LLM, retrieval-augmented LLM (RAL) offers a solution by retrieving pertinent information from an established database. Nonetheless, existing research work lacks rigorous evaluation of the impact of retrieval-augmented large language models on different biomedical NLP tasks. This deficiency makes it challenging to ascertain the capabilities of RAL within the biomedical domain. Moreover, the outputs from RAL are affected by retrieving the unlabeled, counterfactual, or diverse knowledge that is not well studied in the biomedical domain. However, such knowledge is common in the real world. Finally, exploring the self-awareness ability is also crucial for the RAL system. So, in this paper, we systematically investigate the impact of RALs on 5 different biomedical tasks (triple extraction, link prediction, classification, question answering, and natural language inference). We analyze the performance of RALs in four fundamental abilities, including unlabeled robustness, counterfactual robustness, diverse robustness, and negative awareness. To this end, we proposed an evaluation framework to assess the RALs’ performance on different biomedical NLP tasks and establish four different testbeds based on the aforementioned fundamental abilities. Then, we evaluate 3 representative LLMs with 3 different retrievers on 5 tasks over 9 datasets.

[81] Are language models rational? The case of coherence norms and belief revision

Thomas Hofweber, Peter Hase, Elias Stengel-Eskin, Mohit Bansal

Main category: cs.CL

TL;DR: Investigates whether coherence norms of rationality apply to language models, proposing a new credence account based on next token probabilities and arguing that rational norms apply to some models but not others.

Details

Motivation: To determine if norms of rationality, particularly coherence norms, apply to machine learning models like language models, given the importance of rationality for predicting behavior and its connection to AI safety and alignment.

Method: Introduces the Minimal Assent Connection (MAC) and proposes a new account of credence that assigns strength of belief based on model internal next token probabilities. Examines both logical coherence norms and coherence norms tied to belief strength.

Result: The analysis shows that rational norms tied to coherence do apply to some language models but not to others.

Conclusion: Rationality norms are significant for understanding model behavior and have implications for AI safety and alignment, with coherence norms being applicable selectively across different language models.

Abstract: Do norms of rationality apply to machine learning models, in particular language models? In this paper we investigate this question by focusing on a special subset of rational norms: coherence norms. We consider both logical coherence norms as well as coherence norms tied to the strength of belief. To make sense of the latter, we introduce the Minimal Assent Connection (MAC) and propose a new account of credence, which captures the strength of belief in language models. This proposal uniformly assigns strength of belief simply on the basis of model internal next token probabilities. We argue that rational norms tied to coherence do apply to some language models, but not to others. This issue is significant since rationality is closely tied to predicting and explaining behavior, and thus it is connected to considerations about AI safety and alignment, as well as understanding model behavior more generally.

[82] RASTeR: Robust, Agentic, and Structured Temporal Reasoning

Dan Schumacher, Fatemeh Haji, Tara Grey, Niharika Bandlamudi, Nupoor Karnik, Gagana Uday Kumar, Jason Cho-Yu Chiang, Paul Rad, Nishant Vishwamitra, Anthony Rios

Main category: cs.CL

TL;DR: RASTeR is a prompting framework for robust temporal question answering that separates context evaluation from answer generation, using temporal knowledge graphs and selective context correction.

Details

Motivation: Temporal question answering is challenging for LLMs due to irrelevant, outdated, or temporally inconsistent retrieved content, especially in critical applications like clinical event ordering and policy tracking.

Method: RASTeR separates context evaluation from answer generation, assesses relevance and temporal coherence of retrieved context, constructs temporal knowledge graphs, and selectively corrects or discards inconsistent context before generating answers.

Result: RASTeR consistently improves robustness across multiple datasets and LLMs, achieving 75% accuracy with forty distractors in a needle-in-the-haystack study, outperforming the runner-up by over 12%.

Conclusion: RASTeR provides an effective framework for robust temporal reasoning that handles noisy or outdated information through structured context evaluation and selective correction.

Abstract: Temporal question answering (TQA) remains a challenge for large language models (LLMs), particularly when retrieved content may be irrelevant, outdated, or temporally inconsistent. This is especially critical in applications like clinical event ordering, and policy tracking, which require reliable temporal reasoning even under noisy or outdated information. To address this challenge, we introduce RASTeR: \textbf{R}obust, \textbf{A}gentic, and \textbf{S}tructured, \textbf{Te}mporal \textbf{R}easoning, a prompting framework that separates context evaluation from answer generation. RASTeR first assesses the relevance and temporal coherence of the retrieved context, then constructs a temporal knolwedge graph (TKG) to better facilitate reasoning. When inconsistencies are detected, RASTeR selectively corrects or discards context before generating an answer. Across multiple datasets and LLMs, RASTeR consistently improves robustness\footnote{\ Some TQA work defines robustness as handling diverse temporal phenomena. Here, we define it as the ability to answer correctly despite suboptimal context}. We further validate our approach through a ``needle-in-the-haystack’’ study, in which relevant context is buried among distractors. With forty distractors, RASTeR achieves 75% accuracy, over 12% ahead of the runner up

[83] Computational Analysis of Gender Depiction in the Comedias of Calderón de la Barca

Allison Keith, Antonio Rojas Castro, Hanno Ehrlicher, Kerstin Jung, Sebastian Padó

Main category: cs.CL

TL;DR: Quantitative analysis of gender portrayal in Calderón’s 17th century Spanish plays using NLP methods, achieving 83% accuracy in gender classification and revealing gendered dialogue patterns.

Details

Motivation: To study culturally based gender norms in theatre through quantitative methods, specifically examining gender depiction in Pedro Calderón de la Barca's comedias.

Method: Used gender classifier and model explainability (attribution) methods on corpus of 100+ plays to identify influential text features for gender classification.

Result: Achieved f-score of 0.83 in gender classification, identified semantic aspects of gender portrayal, and successfully predicted cross-dressing characters scene-by-scene.

Conclusion: Female and male characters are portrayed differently in Calderón’s works, and NLP methods can effectively reveal gendered elements of dialogue in historical theatrical texts.

Abstract: In theatre, playwrights use the portrayal of characters to explore culturally based gender norms. In this paper, we develop quantitative methods to study gender depiction in the non-religious works (comedias) of Pedro Calderón de la Barca, a prolific Spanish 17th century author. We gather insights from a corpus of more than 100 plays by using a gender classifier and applying model explainability (attribution) methods to determine which text features are most influential in the model’s decision to classify speech as ‘male’ or ‘female’, indicating the most gendered elements of dialogue in Calderón’s comedias in a human accessible manner. We find that female and male characters are portrayed differently and can be identified by the gender prediction model at practically useful accuracies (up to f=0.83). Analysis reveals semantic aspects of gender portrayal, and demonstrates that the model is even useful in providing a relatively accurate scene-by-scene prediction of cross-dressing characters.

[84] Consolidating and Developing Benchmarking Datasets for the Nepali Natural Language Understanding Tasks

Jinu Nyachhyon, Mridul Sharma, Prajwal Thapa, Bal Krishna Bal

Main category: cs.CL

TL;DR: The paper introduces NLUE, an expanded benchmark for Nepali language understanding with 12 new datasets across multiple NLU tasks, addressing limitations of existing benchmarks.

Details

Motivation: Existing Nepali NLU benchmarks like Nep-gLUE are limited in scope (only 4 tasks), restricting comprehensive model evaluation for the linguistically complex Nepali language.

Method: Created 12 new datasets forming the NLUE benchmark covering Single-Sentence Classification, Similarity and Paraphrase Tasks, Natural Language Inference, and General Masked Evaluation Task.

Result: Existing top models struggle with the added complexity; multilingual models outperform monolingual models across most tasks, indicating need for more robust Nepali-specific solutions.

Conclusion: The expanded NLUE benchmark sets a new standard for evaluating and advancing NLP models for low-resource languages like Nepali, contributing significantly to NLP research advancement.

Abstract: The Nepali language has distinct linguistic features, especially its complex script (Devanagari script), morphology, and various dialects,which pose a unique challenge for Natural Language Understanding (NLU) tasks. While the Nepali Language Understanding Evaluation (Nep-gLUE) benchmark provides a foundation for evaluating models, it remains limited in scope, covering four tasks. This restricts their utility for comprehensive assessments of Natural Language Processing (NLP) models. To address this limitation, we introduce twelve new datasets, creating a new benchmark, the Nepali /Language Understanding Evaluation (NLUE) benchmark for evaluating the performance of models across a diverse set of Natural Language Understanding (NLU) tasks. The added tasks include Single-Sentence Classification, Similarity and Paraphrase Tasks, Natural Language Inference (NLI), and General Masked Evaluation Task (GMET). Through extensive experiments, we demonstrate that existing top models struggle with the added complexity of these tasks. We also find that the best multilingual model outperforms the best monolingual models across most tasks, highlighting the need for more robust solutions tailored to the Nepali language. This expanded benchmark sets a new standard for evaluating, comparing, and advancing models, contributing significantly to the broader goal of advancing NLP research for low-resource languages.

[85] LDC: Learning to Generate Research Idea with Dynamic Control

Ruochen Li, Liqiang Jing, Chi Han, Jiawei Zhou, Xinya Du

Main category: cs.CL

TL;DR: A two-stage framework using SFT and controllable RL for generating high-quality research ideas that balance novelty, feasibility, and effectiveness.

Details

Motivation: Existing LLM approaches for research ideation often produce ideas misaligned with expert standards and struggle to balance the trade-offs between novelty, feasibility, and effectiveness.

Method: Two-stage approach: SFT learns from paper-idea pairs, then controllable RL with multi-dimensional reward models optimizes across key dimensions. Inference uses dimensional controllers with sentence-level decoder for dynamic steering.

Result: Achieves high-quality research idea generation by dynamically navigating trade-offs among novelty, feasibility, and effectiveness.

Conclusion: The framework provides a balanced approach to research idea generation, successfully addressing the limitations of existing methods.

Abstract: Recent advancements in large language models (LLMs) have demonstrated their potential in automating the scientific research ideation. Existing approaches primarily focus on prompting techniques, often producing ideas misaligned with expert standards - novelty, feasibility, and effectiveness, which are widely recognized by the research community as the three key subdimensions of high-quality ideas. Also, balancing these dimensions remains challenging due to their inherent trade-offs. To address these limitations, we propose the first framework that employs a two-stage approach combining Supervised Fine-Tuning (SFT) and controllable Reinforcement Learning (RL) for the task. In the SFT stage, the model learns foundational patterns from pairs of research papers and their corresponding follow-up ideas. In the RL stage, multi-dimensional reward models guided by fine-grained feedback evaluate and optimize the model across key dimensions. During inference, dimensional controllers coordinated by a sentence-level decoder enable dynamic context-aware steering of the idea generation process. Our framework provides a balanced approach to research idea generation, achieving high-quality outcomes in the experiment by dynamically navigating the trade-offs among novelty, feasibility, and effectiveness.

[86] Emotions, Context, and Substance Use in Adolescents: A Large Language Model Analysis of Reddit Posts

Jianfeng Zhu, Hailong Jiang, Yulan Wang, Karin G. Coifman, Ruoming Jin, Deric R. Kenne

Main category: cs.CL

TL;DR: Analysis of 46,000 Reddit posts from r/teenagers shows substance-use discussions feature more negative emotions (sadness, guilt, fear, disgust) while non-substance posts are dominated by joy. Peer influence is the strongest contextual factor, with family and school environments serving dual protective/risk roles.

Details

Motivation: Early substance use increases risk of later disorders and mental health problems, but emotional and contextual drivers remain poorly understood. Need to uncover factors driving adolescent substance use behaviors.

Method: Analyzed 23,000 substance-use and 23,000 non-substance posts from Reddit’s r/teenagers (2018-2022). Used LLMs to annotate six emotions and contextual factors. Applied statistical analysis, SHAP interpretable ML, and LLM-assisted thematic coding.

Result: Negative emotions significantly more common in substance-use posts. Guilt and shame function differently - guilt promotes reflection while shame reinforces risky behaviors. Peer influence strongest contextual predictor. Family and school environments can be both risk and protective factors.

Conclusion: Adolescent substance-use reflects dynamic interplay of emotion, social context, and coping. Mixed computational approaches (statistics, interpretable ML, LLM thematic analysis) effectively uncover emotional and contextual mechanisms of risk behavior.

Abstract: Early substance use during adolescence increases the risk of later substance use disorders and mental health problems, yet the emotional and contextual factors driving these behaviors remain poorly understood. This study analyzed 23000 substance-use related posts and an equal number of non-substance posts from Reddit’s r/teenagers community (2018-2022). Posts were annotated for six discrete emotions (sadness, anger, joy, guilt, fear, disgust) and contextual factors (family, peers, school) using large language models (LLMs). Statistical analyses compared group differences, and interpretable machine learning (SHAP) identified key predictors of substance-use discussions. LLM-assisted thematic coding further revealed latent psychosocial themes linking emotions with contexts. Negative emotions, especially sadness, guilt, fear, and disgust, were significantly more common in substance-use posts, while joy dominated non-substance discussions. Guilt and shame diverged in function: guilt often reflected regret and self-reflection, whereas shame reinforced risky behaviors through peer performance. Peer influence emerged as the strongest contextual factor, closely tied to sadness, fear, and guilt. Family and school environments acted as both risk and protective factors depending on relational quality and stress levels. Overall, adolescent substance-use discussions reflected a dynamic interplay of emotion, social context, and coping behavior. By integrating statistical analysis, interpretable models, and LLM-based thematic exploration, this study demonstrates the value of mixed computational approaches for uncovering the emotional and contextual mechanisms underlying adolescent risk behavior.

[87] Figurative Archive: an open dataset and web-based application for the study of metaphor

Maddalena Bressler, Veronica Mangiaterra, Paolo Canal, Federico Frau, Fabrizio Luciani, Biagio Scalingi, Chiara Barattieri di San Pietro, Chiara Battaglini, Chiara Pompei, Fortunata Romeo, Luca Bischetti, Valentina Bambini

Main category: cs.CL

TL;DR: The Figurative Archive is an open database of 996 Italian metaphors with rating and corpus-based measures, validated through correlations between familiarity and other metrics, featuring a web interface and guidelines for research use.

Details

Motivation: To address the increasing demand for rigorously constructed and extensively normed experimental materials in metaphor research, which provides insights into linguistic and cognitive processes.

Method: Collection of stimuli from 11 studies, including both everyday and literary metaphors, enriched with rating and corpus-based measures (familiarity, semantic distance, preferred interpretations), and validated through correlation analysis.

Result: Creation of a comprehensive database with 996 Italian metaphors, featuring measures of metaphor inclusiveness for non-discriminatory language, displayed in a web-based interface with customization options.

Conclusion: The Figurative Archive serves as a valuable resource for sourcing materials in metaphor processing studies and investigating relationships between metaphor features in humans and computational models.

Abstract: Research on metaphor has steadily increased over the last decades, as this phenomenon opens a window into a range of linguistic and cognitive processes. At the same time, the demand for rigorously constructed and extensively normed experimental materials increased as well. Here, we present the Figurative Archive, an open database of 996 metaphors in Italian enriched with rating and corpus-based measures (from familiarity to semantic distance and preferred interpretations), derived by collecting stimuli used across 11 studies. It includes both everyday and literary metaphors, varying in structure and semantic domains, and is validated based on correlations between familiarity and other measures. The Archive has several aspects of novelty: it is increased in size compared to previous resources; it offers a measure of metaphor inclusiveness, to comply with recommendations for non-discriminatory language use; it is displayed in a web-based interface, with features for a customized consultation. We provide guidelines for using the Archive to source materials for studies investigating metaphor processing and relationships between metaphor features in humans and computational models.

[88] DomainCQA: Crafting Knowledge-Intensive QA from Domain-Specific Charts

Yujing Lu, Ling Zhong, Jing Yang, Weiming Li, Peng Wei, Yongheng Wang, Manni Duan, Qing Zhang

Main category: cs.CL

TL;DR: DomainCQA is a framework for creating domain-specific chart question answering benchmarks that test both visual comprehension and knowledge-intensive reasoning, addressing limitations of existing benchmarks that focus only on surface-level parsing.

Details

Motivation: Existing Chart QA benchmarks mostly test surface-level parsing like reading labels and legends, overlooking deeper scientific reasoning needed for domain-specific chart understanding.

Method: DomainCQA integrates complexity-aware chart selection, multitier QA generation, and expert validation to construct domain-specific benchmarks. Applied to astronomy, it created AstroChart with 1,690 QA pairs over 482 charts.

Result: AstroChart exposed persistent weaknesses in fine-grained perception, numerical reasoning, and domain knowledge integration across 21 MLLMs. Fine-tuning on AstroChart improved performance across fundamental and advanced tasks.

Conclusion: DomainCQA establishes a unified pipeline for constructing and augmenting domain-specific chart reasoning benchmarks, with pilot demonstrations in biochemistry, economics, medicine, and social science showing its generality.

Abstract: Chart Question Answering (CQA) evaluates Multimodal Large Language Models (MLLMs) on visual understanding and reasoning over chart data. However, existing benchmarks mostly test surface-level parsing, such as reading labels and legends, while overlooking deeper scientific reasoning. We propose DomainCQA, a framework for constructing domain-specific CQA benchmarks that emphasize both visual comprehension and knowledge-intensive reasoning. It integrates complexity-aware chart selection, multitier QA generation, and expert validation. Applied to astronomy, DomainCQA yields AstroChart, a benchmark of 1,690 QA pairs over 482 charts, exposing persistent weaknesses in fine-grained perception, numerical reasoning, and domain knowledge integration across 21 MLLMs. Fine-tuning on AstroChart improves performance across fundamental and advanced tasks. Pilot QA sets in biochemistry, economics, medicine, and social science further demonstrate DomainCQA’s generality. Together, our results establish DomainCQA as a unified pipeline for constructing and augmenting domain-specific chart reasoning benchmarks.

[89] ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance

Wissam Antoun, Benoît Sagot, Djamé Seddah

Main category: cs.CL

TL;DR: Controlled comparison shows DeBERTaV3 outperforms ModernBERT in sample efficiency and benchmark performance when trained on same data, with ModernBERT’s main advantages being long context support and faster training/inference.

Details

Motivation: To isolate architectural effects from training data differences when comparing ModernBERT and DeBERTaV3, since previous comparisons used different datasets.

Method: Pretrained ModernBERT on the same dataset as CamemBERTaV2 (DeBERTaV3 French model) for controlled comparison of model architectures.

Result: DeBERTaV3 showed superior sample efficiency and overall benchmark performance; ModernBERT’s advantages are long context support, faster training, and inference speed; high-quality data accelerates convergence but doesn’t significantly improve final performance.

Conclusion: Architectural innovations should be evaluated separately from training data effects; DeBERTaV3 remains superior to ModernBERT in key metrics when controlling for data; potential benchmark saturation exists.

Abstract: Pretrained transformer-encoder models like DeBERTaV3 and ModernBERT introduce architectural advancements aimed at improving efficiency and performance. Although the authors of ModernBERT report improved performance over DeBERTaV3 on several benchmarks, the lack of disclosed training data and the absence of comparisons using a shared dataset make it difficult to determine whether these gains are due to architectural improvements or differences in training data. In this work, we conduct a controlled study by pretraining ModernBERT on the same dataset as CamemBERTaV2, a DeBERTaV3 French model, isolating the effect of model design. Our results show that the previous model generation remains superior in sample efficiency and overall benchmark performance, with ModernBERT’s primary advantage being its support for long context, faster training, and inference speed. However, the new proposed model still provides meaningful architectural improvements compared to earlier models such as BERT and RoBERTa. Additionally, we observe that high-quality pre-training data accelerates convergence but does not significantly improve final performance, suggesting potential benchmark saturation. These findings show the importance of disentangling pretraining data from architectural innovations when evaluating transformer models.

[90] $\textit{New News}$: System-2 Fine-tuning for Robust Integration of New Knowledge

Core Francisco Park, Zechen Zhang, Hidenori Tanaka

Main category: cs.CL

TL;DR: The paper introduces New News dataset to study the gap between fine-tuning and in-context learning for knowledge integration, and proposes System-2 Fine-tuning with Self-QA protocol to improve in-weight learning while preserving general capabilities.

Details

Motivation: To address the challenge of effectively integrating new information into model weights via fine-tuning, as current methods show substantial performance gaps compared to in-context learning.

Method: Created New News dataset with hypothetical news across domains; proposed System-2 Fine-tuning using self-play data generation protocols (paraphrases, implications, Self-QA) to distill context-processed knowledge into model weights.

Result: Self-QA protocol of Sys2-FT significantly improves in-weight learning while preserving general capabilities; discovered contextual shadowing effect where certain training approaches degrade learning; found preliminary evidence of scaling laws for Sys2-FT.

Conclusion: System-2 Fine-tuning with Self-QA protocol effectively bridges the fine-tuning vs in-context learning gap, enabling better knowledge integration into model weights while maintaining model capabilities.

Abstract: Humans and intelligent animals can internalize new information and accurately internalize their implications to perform downstream tasks. While large language models (LLMs) can achieve this through in-context learning (ICL) when the information (news) is explicitly given as context, adequately integrating the information into model weights via fine-tuning remains challenging. In this paper, we introduce New News, a dataset composed of hypothetical yet plausible news spanning multiple domains (mathematics, coding, discoveries, leaderboards, events), accompanied by downstream evaluation questions whose correct answers critically depend on understanding and internalizing the news. First, we demonstrate a substantial gap between naive fine-tuning and in-context learning (FT-ICL gap) on our dataset. To address this gap, we explore a suite of self-play data generation protocols – paraphrases, implications, and Self-QA – designed to distill the knowledge processed by the model with context into the weights of the model, which we term System-2 Fine-tuning (Sys2-FT). We systematically evaluate ICL and Sys2-FT performance across data domains and model scales with the Qwen 2.5 family of models. Our results demonstrate that the Self-QA protocol of Sys2-FT significantly improves models’ in-weight learning of the news while preserving general capabilities. Furthermore, we discover the contextual shadowing effect, where training with the news in context followed by its rephrases or QAs catastrophically degrades learning of the news. Finally, we show preliminary evidence of an emerging scaling law of Sys2-FT.

[91] Activation-Guided Consensus Merging for Large Language Models

Yuxuan Yao, Shuqi Liu, Zehua Liu, Qintong Li, Mingyang Liu, Xiongwei Han, Zhijiang Guo, Han Wu, Linqi Song

Main category: cs.CL

TL;DR: ACM is a plug-and-play model merging framework that uses activation-based mutual information to determine layer-specific coefficients, achieving better efficiency and reasoning accuracy than uniform merging methods.

Details

Motivation: Existing approaches for combining System 2 reasoning with System 1 efficiency face efficiency and stability challenges. Model merging offers promise but conventional methods assume uniform layer importance, ignoring functional heterogeneity in neural components.

Method: Propose Activation-Guided Consensus Merging (ACM) that determines layer-specific merging coefficients based on mutual information between activations of pre-trained and fine-tuned models, without requiring gradient computations or additional training.

Result: ACM consistently outperforms baseline methods on Long-to-Short and general merging tasks. For Qwen-7B models, TIES-Merging with ACM achieves 55.3% reduction in response length while improving reasoning accuracy by 1.3 points.

Conclusion: ACM effectively preserves task-specific capabilities through activation-guided merging, demonstrating superior performance over conventional uniform merging approaches.

Abstract: Recent research has increasingly focused on reconciling the reasoning capabilities of System 2 with the efficiency of System 1. While existing training-based and prompt-based approaches face significant challenges in terms of efficiency and stability, model merging emerges as a promising strategy to integrate the diverse capabilities of different Large Language Models (LLMs) into a unified model. However, conventional model merging methods often assume uniform importance across layers, overlooking the functional heterogeneity inherent in neural components. To address this limitation, we propose \textbf{A}ctivation-Guided \textbf{C}onsensus \textbf{M}erging (\textbf{ACM}), a plug-and-play merging framework that determines layer-specific merging coefficients based on mutual information between activations of pre-trained and fine-tuned models. ACM effectively preserves task-specific capabilities without requiring gradient computations or additional training. Extensive experiments on Long-to-Short (L2S) and general merging tasks demonstrate that ACM consistently outperforms all baseline methods. For instance, in the case of Qwen-7B models, TIES-Merging equipped with ACM achieves a \textbf{55.3%} reduction in response length while simultaneously improving reasoning accuracy by \textbf{1.3} points.

[92] Transformer Copilot: Learning from The Mistake Log in LLM Fine-tuning

Jiaru Zou, Yikun Ban, Zihao Li, Yunzhe Qi, Ruizhong Qiu, Ling Yang, Jingrui He

Main category: cs.CL

TL;DR: Transformer Copilot introduces a Pilot-Copilot framework where a Copilot model learns from the Pilot’s mistakes via a Mistake Log to refine inference through logits rectification, improving performance by up to 34.5% with minimal overhead.

Details

Motivation: To improve fine-tuning by retaining and leveraging the model's own learning signals, similar to how humans learn from past mistakes, rather than just minimizing generation loss.

Method: Introduces Mistake Log to track learning behavior, designs a Copilot model for logits rectification, and implements joint training where Copilot learns from evolving Mistake Log alongside Pilot model.

Result: Experiments on 12 benchmarks show consistent performance improvements up to 34.5% across commonsense, arithmetic, and recommendation tasks with marginal computational overhead.

Conclusion: Transformer Copilot framework effectively enhances model performance through mistake-driven learning and logits rectification, demonstrating strong scalability and transferability.

Abstract: Large language models are typically adapted to downstream tasks through supervised fine-tuning on domain-specific data. While standard fine-tuning focuses on minimizing generation loss to optimize model parameters, we take a deeper step by retaining and leveraging the model’s own learning signals, analogous to how human learners reflect on past mistakes to improve future performance. We first introduce the concept of Mistake Log to systematically track the model’s learning behavior and recurring errors throughout fine-tuning. Treating the original transformer-based model as the Pilot, we correspondingly design a Copilot model to refine the Pilot’s inference performance via logits rectification. We name the overall Pilot-Copilot framework the Transformer Copilot, which introduces (i) a novel Copilot model design, (ii) a joint training paradigm where the Copilot continuously learns from the evolving Mistake Log alongside the Pilot, and (iii) a fused inference paradigm where the Copilot rectifies the Pilot’s logits for enhanced generation. We provide both theoretical and empirical analyses on our new learning framework. Experiments on 12 benchmarks spanning commonsense, arithmetic, and recommendation tasks demonstrate that Transformer Copilot consistently improves performance by up to 34.5%, while introducing marginal computational overhead to Pilot models and exhibiting strong scalability and transferability. Our code is released at https://github.com/jiaruzouu/TransformerCopilot.

[93] Latent Principle Discovery for Language Model Self-Improvement

Keshav Ramji, Tahira Naseem, Ramón Fernandez Astudillo

Main category: cs.CL

TL;DR: Automated method for eliciting and compressing latent behavioral principles from language models to enable self-improvement through strategic principle invocation.

Details

Motivation: Manual curation of behavioral principles for language model improvement is labor-intensive; automated discovery of these latent attributes could enable more efficient model refinement.

Method: Posterior-regularized Monte Carlo Expectation-Maximization to mine principles from the LM itself, compress them via clustering, and teach the model to strategically invoke them for self-correction.

Result: Smaller language models (7-8B parameters) achieved +8-10% AlpacaEval win-rate, +0.3 on MT-Bench, and +19-23% principle-following win-rate on IFEval.

Conclusion: Automated principle-driven post-training enables continual self-improvement in language models, with clustering yielding interpretable and diverse model-generated constitutions while maintaining performance.

Abstract: When language model (LM) users aim to improve the quality of its generations, it is crucial to specify concrete behavioral attributes that the model should strive to reflect. However, curating such principles across many domains, even non-exhaustively, requires a labor-intensive annotation process. To automate this process, we propose eliciting these latent attributes that guide model reasoning toward human-preferred responses by explicitly modeling them in a self-correction setting. Our approach mines new principles from the LM itself and compresses the discovered elements to an interpretable set via clustering. Specifically, we employ a form of posterior-regularized Monte Carlo Expectation-Maximization to both identify a condensed set of the most effective latent principles and teach the LM to strategically invoke them in order to intrinsically refine its responses. We demonstrate that bootstrapping our algorithm over multiple iterations enables smaller language models (7-8B parameters) to self-improve, achieving +8-10% in AlpacaEval win-rate, an average of +0.3 on MT-Bench, and +19-23% in principle-following win-rate on IFEval. We also show that clustering the principles yields interpretable and diverse model-generated constitutions while retaining model performance. The gains that our method achieves highlight the potential of automated, principle-driven post-training recipes toward continual self-improvement.

[94] Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval

Shubhashis Roy Dipta, Francis Ferraro

Main category: cs.CL

TL;DR: Q2E is a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval that extracts latent parametric knowledge from LLMs and VLMs to enhance query understanding and improve video retrieval performance.

Details

Motivation: To improve identification and retrieval of videos related to complex real-world events by automatically extracting latent parametric knowledge about those events from language and vision models.

Method: Query-to-Event decomposition using knowledge embedded in LLMs and VLMs, adaptable across datasets and models, with entropy-based fusion scoring for zero-shot fusion of multimodal knowledge including visual and speech-based inputs.

Result: Outperforms several state-of-the-art baselines on two diverse datasets across multiple retrieval metrics, with audio integration significantly improving text-to-video retrieval performance.

Conclusion: The approach demonstrates effective enhancement of query understanding through decomposition and multimodal knowledge fusion, with released code and data for future research.

Abstract: Recent approaches have shown impressive proficiency in extracting and leveraging parametric knowledge from Large-Language Models (LLMs) and Vision-Language Models (VLMs). In this work, we consider how we can improve the identification and retrieval of videos related to complex real-world events by automatically extracting latent parametric knowledge about those events. We present Q2E: a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval, adaptable across datasets, domains, LLMs, or VLMs. Our approach demonstrates that we can enhance the understanding of otherwise overly simplified human queries by decomposing the query using the knowledge embedded in LLMs and VLMs. We additionally show how to apply our approach to both visual and speech-based inputs. To combine this varied multimodal knowledge, we adopt entropy-based fusion scoring for zero-shot fusion. Through evaluations on two diverse datasets and multiple retrieval metrics, we demonstrate that Q2E outperforms several state-of-the-art baselines. Our evaluation also shows that integrating audio information can significantly improve text-to-video retrieval. We have released code and data for future research.

[95] Format as a Prior: Quantifying and Analyzing Bias in LLMs for Heterogeneous Data

Jiacheng Liu, Mayi Xu, Qiankun Pi, Wenli Li, Ming Zhong, Yuanyuan Zhu, Mengchi Liu, Tieyun Qian

Main category: cs.CL

TL;DR: This paper presents the first comprehensive study of format bias in LLMs, showing systematic biases toward particular data formats that can undermine impartial data integration and cause reasoning errors.

Details

Motivation: LLMs are increasingly used to process heterogeneous data formats, but potential systematic biases toward specific formats could lead to reasoning errors and increased risks in downstream tasks.

Method: Three-stage empirical analysis: 1) exploring presence and direction of bias across diverse LLMs, 2) examining data-level factors influencing biases, 3) analyzing bias emergence in attention patterns and testing lightweight interventions.

Result: Format bias is consistent across model families, driven by information richness, structure quality, and representation type, and closely associated with attention imbalance within LLMs.

Conclusion: Identified three future research directions to reduce format bias: enhancing data pre-processing, introducing inference-time interventions, and developing format-balanced training corpora to support more robust heterogeneous data processing systems.

Abstract: Large Language Models (LLMs) are increasingly employed in applications that require processing information from heterogeneous formats, including texts, tables, infoboxes, and knowledge graphs. However, systematic biases toward particular formats may undermine LLMs’ ability to integrate heterogeneous data impartially, potentially resulting in reasoning errors and increased risks in downstream tasks. Yet it remains unclear whether such biases are systematic, which data-level factors drive them, and what internal mechanisms underlie their emergence. In this paper, we present the first comprehensive study of format bias in LLMs through a three-stage empirical analysis. The first stage explores the presence and direction of bias across a diverse range of LLMs. The second stage examines how key data-level factors influence these biases. The third stage analyzes how format bias emerges within LLMs’ attention patterns and evaluates a lightweight intervention to test its effectiveness. Our results show that format bias is consistent across model families, driven by information richness, structure quality, and representation type, and is closely associated with attention imbalance within the LLMs. Based on these investigations, we identify three future research directions to reduce format bias: enhancing data pre-processing through format repair and normalization, introducing inference-time interventions such as attention re-weighting, and developing format-balanced training corpora. These directions will support the design of more robust and fair heterogeneous data processing systems.

[96] CyPortQA: Benchmarking Multimodal Large Language Models for Cyclone Preparedness in Port Operation

Chenchen Kuai, Chenhao Wu, Yang Zhou, Xiubin Bruce Wang, Tianbao Yang, Zhengzhong Tu, Zihao Li, Yunlong Zhang

Main category: cs.CL

TL;DR: CyPortQA is the first multimodal benchmark for evaluating MLLMs in port cyclone preparedness, featuring 117,178 QA pairs from 2,917 real-world disruption scenarios across 145 U.S. ports.

Details

Motivation: To address the need for accurate and reliable multimodal AI systems that can integrate diverse forecast products into actionable guidance for port operations during tropical cyclones.

Method: Created CyPortQA benchmark with 2,917 real-world scenarios from 2015-2023, expanded to 117,178 structured QA pairs through automated pipeline, and tested diverse MLLMs on this dataset.

Result: MLLMs show strong potential in situation understanding but face significant challenges in reasoning tasks like impact estimation and decision reasoning for port operations.

Conclusion: While MLLMs demonstrate promise for port cyclone preparedness, substantial improvements are needed in reasoning capabilities to ensure reliable decision support for port operators.

Abstract: As tropical cyclones intensify and track forecasts become increasingly uncertain, U.S. ports face heightened supply-chain risk under extreme weather conditions. Port operators need to rapidly synthesize diverse multimodal forecast products, such as probabilistic wind maps, track cones, and official advisories, into clear, actionable guidance as cyclones approach. Multimodal large language models (MLLMs) offer a powerful means to integrate these heterogeneous data sources alongside broader contextual knowledge, yet their accuracy and reliability in the specific context of port cyclone preparedness have not been rigorously evaluated. To fill this gap, we introduce CyPortQA, the first multimodal benchmark tailored to port operations under cyclone threat. CyPortQA assembles 2,917 realworld disruption scenarios from 2015 through 2023, spanning 145 U.S. principal ports and 90 named storms. Each scenario fuses multisource data (i.e., tropical cyclone products, port operational impact records, and port condition bulletins) and is expanded through an automated pipeline into 117,178 structured question answer pairs. Using this benchmark, we conduct extensive experiments on diverse MLLMs, including both open-source and proprietary model. MLLMs demonstrate great potential in situation understanding but still face considerable challenges in reasoning tasks, including potential impact estimation and decision reasoning.

[97] Beyond the Surface: Probing the Ideological Depth of Large Language Models

Shariar Kabir, Kevin Esterling, Yue Dong

Main category: cs.CL

TL;DR: The paper introduces ‘ideological depth’ as a measurable property of LLMs, combining steerability and internal political feature richness, showing that refusals on political instructions can stem from capability deficits rather than safety measures.

Details

Motivation: To understand why LLMs display varying political leanings and consistency, and to develop a framework for measuring their political representation capabilities beyond surface-level behavior.

Method: Used Llama-3.1-8B-Instruct and Gemma-2-9B-IT models, comparing prompt-based and activation-steering interventions, and probing political features with sparse autoencoders (SAEs) to measure internal political representations.

Result: Gemma showed 7.3x more distinct political features and better steerability than Llama. Causal ablations of Gemma’s political features increased refusal rates, indicating capability deficits rather than safety guardrails cause refusals.

Conclusion: Ideological depth is a measurable LLM property, and steerability reveals latent political architecture. Refusals on political content often result from capability limitations rather than intentional safety mechanisms.

Abstract: Large language models (LLMs) display recognizable political leanings, yet they vary significantly in their ability to represent a political orientation consistently. In this paper, we define ideological depth as (i) a model’s ability to follow political instructions without failure (steerability), and (ii) the feature richness of its internal political representations measured with sparse autoencoders (SAEs), an unsupervised sparse dictionary learning (SDL) approach. Using Llama-3.1-8B-Instruct and Gemma-2-9B-IT as candidates, we compare prompt-based and activation-steering interventions and probe political features with publicly available SAEs. We find large, systematic differences: Gemma is more steerable in both directions and activates approximately 7.3x more distinct political features than Llama. Furthermore, causal ablations of a small targeted set of Gemma’s political features to create a similar feature-poor setting induce consistent shifts in its behavior, with increased rates of refusals across topics. Together, these results indicate that refusals on benign political instructions or prompts can arise from capability deficits rather than safety guardrails. Ideological depth thus emerges as a measurable property of LLMs, and steerability serves as a window into their latent political architecture.

[98] Wage Sentiment Indices Derived from Survey Comments via Large Language Models

Taihei Sone

Main category: cs.CL

TL;DR: A Wage Sentiment Index (WSI) using Large Language Models (LLMs) is proposed to forecast Japanese wage dynamics, outperforming traditional methods and showing promise for economic policy design.

Details

Motivation: To leverage generative AI for economic text analysis and create a timely wage sentiment indicator for Japan using survey data from economically sensitive industries.

Method: Extends the Price Sentiment Index framework to wages using LLMs on Economy Watchers Survey data, with a scalable architecture for integrating additional data sources like newspapers and social media.

Result: LLM-based WSI models significantly outperform both baseline approaches and pretrained models in forecasting wage dynamics.

Conclusion: LLM-driven sentiment indices can enhance timeliness and effectiveness of economic policy design by governments and central banks.

Abstract: The emergence of generative Artificial Intelligence (AI) has created new opportunities for economic text analysis. This study proposes a Wage Sentiment Index (WSI) constructed with Large Language Models (LLMs) to forecast wage dynamics in Japan. The analysis is based on the Economy Watchers Survey (EWS), a monthly survey conducted by the Cabinet Office of Japan that captures real-time economic assessments from workers in industries highly sensitive to business conditions. The WSI extends the framework of the Price Sentiment Index (PSI) used in prior studies, adapting it specifically to wage related sentiment. To ensure scalability and adaptability, a data architecture is also developed that enables integration of additional sources such as newspapers and social media. Experimental results demonstrate that WSI models based on LLMs significantly outperform both baseline approaches and pretrained models. These findings highlight the potential of LLM-driven sentiment indices to enhance the timeliness and effectiveness of economic policy design by governments and central banks.

[99] Re-FRAME the Meeting Summarization SCOPE: Fact-Based Summarization and Personalization via Questions

Frederic Kirstein, Sonu Kumar, Terry Ruas, Bela Gipp

Main category: cs.CL

TL;DR: FRAME reframes meeting summarization as semantic enrichment to reduce hallucinations and omissions, while SCOPE enables personalization through reasoning traces. P-MESA provides reliable reference-free evaluation.

Details

Motivation: Current LLM-based meeting summarization often produces outputs with hallucinations, omissions, and irrelevancies, lacking control and personalization.

Method: FRAME pipeline extracts/scored facts, organizes them thematically, and enriches outlines into summaries. SCOPE uses reasoning traces via nine questions for personalization. P-MESA evaluates summaries multi-dimensionally.

Result: FRAME reduces hallucination and omission by 2 out of 5 points on QMSum and FAME. SCOPE improves knowledge fit and goal alignment. P-MESA achieves >=89% balanced accuracy against human annotations.

Conclusion: Rethinking summarization as semantic enrichment improves control, faithfulness, and personalization in meeting summaries.

Abstract: Meeting summarization with large language models (LLMs) remains error-prone, often producing outputs with hallucinations, omissions, and irrelevancies. We present FRAME, a modular pipeline that reframes summarization as a semantic enrichment task. FRAME extracts and scores salient facts, organizes them thematically, and uses these to enrich an outline into an abstractive summary. To personalize summaries, we introduce SCOPE, a reason-out-loud protocol that has the model build a reasoning trace by answering nine questions before content selection. For evaluation, we propose P-MESA, a multi-dimensional, reference-free evaluation framework to assess if a summary fits a target reader. P-MESA reliably identifies error instances, achieving >= 89% balanced accuracy against human annotations and strongly aligns with human severity ratings (r >= 0.70). On QMSum and FAME, FRAME reduces hallucination and omission by 2 out of 5 points (measured with MESA), while SCOPE improves knowledge fit and goal alignment over prompt-only baselines. Our findings advocate for rethinking summarization to improve control, faithfulness, and personalization.

[100] BanglaTalk: Towards Real-Time Speech Assistance for Bengali Regional Dialects

Jakir Hasan, Shubhashis Roy Dipta

Main category: cs.CL

TL;DR: BanglaTalk is the first real-time speech assistance system for Bengali regional dialects, featuring a dialect-aware ASR system that outperforms baselines by 12.41-33.98% and operates at low bandwidth with minimal delay.

Details

Motivation: Bengali is a low-resource language with high dialectal diversity, but existing systems are not optimized for real-time use and only focus on standard Bengali, limiting accessibility for diverse speakers.

Method: Client-server architecture using Real-time Transport Protocol (RTP) for low-latency communication, with a dialect-aware ASR system (BRDialect) fine-tuned from IndicWav2Vec model across ten Bengali regional dialects.

Result: BRDialect outperforms baseline ASR models by 12.41-33.98% on RegSpeech12 dataset; system operates at 24 kbps bandwidth with average end-to-end delay of 4.9 seconds.

Conclusion: BanglaTalk enables inclusive and accessible speech technology for Bengali speakers through cost-effective, interactive real-time operation with minimal bandwidth usage and delay.

Abstract: Real-time speech assistants are becoming increasingly popular for ensuring improved accessibility to information. Bengali, being a low-resource language with a high regional dialectal diversity, has seen limited progress in developing such systems. Existing systems are not optimized for real-time use and focus only on standard Bengali. In this work, we present BanglaTalk, the first real-time speech assistance system for Bengali regional dialects. BanglaTalk follows the client-server architecture and uses the Real-time Transport Protocol (RTP) to ensure low-latency communication. To address dialectal variation, we introduce a dialect-aware ASR system, BRDialect, developed by fine-tuning the IndicWav2Vec model in ten Bengali regional dialects. It outperforms the baseline ASR models by 12.41-33.98% on the RegSpeech12 dataset. Furthermore, BanglaTalk can operate at a low bandwidth of 24 kbps while maintaining an average end-to-end delay of 4.9 seconds. Low bandwidth usage and minimal end-to-end delay make the system both cost-effective and interactive for real-time use cases, enabling inclusive and accessible speech technology for the diverse community of Bengali speakers. Code is available in https://github.com/Jak57/BanglaTalk

[101] A Critical Study of Automatic Evaluation in Sign Language Translation

Shakib Yazdani, Yasser Hamidullah, Cristina España-Bonet, Eleftherios Avramidis, Josef van Genabith

Main category: cs.CL

TL;DR: Analysis of text-based metrics for sign language translation reveals limitations in capturing semantic quality, with LLM-based evaluators showing promise but bias, motivating need for multimodal evaluation frameworks.

Details

Motivation: Current SLT evaluation relies on text-based metrics like BLEU and ROUGE, but it's unclear how well these capture SLT quality, creating a gap in reliable automatic evaluation.

Method: Analyzed six text-based metrics (BLEU, chrF, ROUGE, BLEURT) and LLM-based evaluators (G-Eval, GEMBA) under controlled conditions: paraphrasing, hallucinations, and sentence length variations.

Result: Lexical overlap metrics have limitations; LLM-based evaluators better capture semantic equivalence but show bias toward LLM-paraphrased translations. All metrics detect hallucinations but with varying sensitivity - BLEU is overly sensitive while BLEURT and LLM evaluators are lenient toward subtle cases.

Conclusion: Text-based metrics alone are insufficient for comprehensive SLT evaluation, highlighting the need for multimodal evaluation frameworks that go beyond text-based approaches.

Abstract: Automatic evaluation metrics are crucial for advancing sign language translation (SLT). Current SLT evaluation metrics, such as BLEU and ROUGE, are only text-based, and it remains unclear to what extent text-based metrics can reliably capture the quality of SLT outputs. To address this gap, we investigate the limitations of text-based SLT evaluation metrics by analyzing six metrics, including BLEU, chrF, and ROUGE, as well as BLEURT on the one hand, and large language model (LLM)-based evaluators such as G-Eval and GEMBA zero-shot direct assessment on the other hand. Specifically, we assess the consistency and robustness of these metrics under three controlled conditions: paraphrasing, hallucinations in model outputs, and variations in sentence length. Our analysis highlights the limitations of lexical overlap metrics and demonstrates that while LLM-based evaluators better capture semantic equivalence often missed by conventional metrics, they can also exhibit bias toward LLM-paraphrased translations. Moreover, although all metrics are able to detect hallucinations, BLEU tends to be overly sensitive, whereas BLEURT and LLM-based evaluators are comparatively lenient toward subtle cases. This motivates the need for multimodal evaluation frameworks that extend beyond text-based metrics to enable a more holistic assessment of SLT outputs.

[102] Scaling Latent Reasoning via Looped Language Models

Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, Lu Li, Jiajun Shi, Kaijing Ma, Shanda Li, Taylor Kergan, Andrew Smith, Xingwei Qu, Mude Hui, Bohong Wu, Qiyang Min, Hongzhi Huang, Xun Zhou, Wei Ye, Jiaheng Liu, Jian Yang, Yunfeng Shi, Chenghua Lin, Enduo Zhao, Tianle Cai, Ge Zhang, Wenhao Huang, Yoshua Bengio, Jason Eshraghian

Main category: cs.CL

TL;DR: Ouro is a family of pre-trained Looped Language Models that integrate reasoning into pre-training through iterative latent computation and entropy-regularized depth allocation, achieving superior performance matching 12B LLMs with only 1.4B-2.6B parameters.

Details

Motivation: Current LLMs rely on explicit text generation like chain-of-thought for reasoning, which defers reasoning to post-training and under-leverages pre-training data. The authors aim to build reasoning capabilities directly into the pre-training phase.

Method: Three key components: (i) iterative computation in latent space, (ii) entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. The models are pre-trained LoopLMs that learn reasoning capabilities during pre-training rather than relying on post-training techniques.

Result: Ouro 1.4B and 2.6B models match the performance of up to 12B state-of-the-art LLMs across various benchmarks. Controlled experiments show this advantage comes from superior knowledge manipulation capabilities rather than increased knowledge capacity. LoopLM also produces reasoning traces more aligned with final outputs than explicit chain-of-thought.

Conclusion: LoopLM represents a promising new scaling direction for reasoning capabilities in language models, demonstrating that building reasoning into pre-training can yield significant efficiency and performance improvements over traditional approaches.

Abstract: Modern LLMs are trained to “think” primarily via explicit text generation, such as chain-of-thought (CoT), which defers reasoning to post-training and under-leverages pre-training data. We present and open-source Ouro, named after the recursive Ouroboros, a family of pre-trained Looped Language Models (LoopLM) that instead build reasoning into the pre-training phase through (i) iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and 2.6B models enjoy superior performance that match the results of up to 12B SOTA LLMs across a wide range of benchmarks. Through controlled experiments, we show this advantage stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities. We also show that LoopLM yields reasoning traces more aligned with final outputs than explicit CoT. We hope our results show the potential of LoopLM as a novel scaling direction in the reasoning era. Our model is available here: http://ouro-llm.github.io.

[103] Efficient Reasoning via Thought-Training and Thought-Free Inference

Canhui Wu, Qiong Cao, Chao Xue, Wei Xi, Xiaodong He

Main category: cs.CL

TL;DR: 3TF framework enables implicit reasoning through thought-training and thought-free inference, achieving high reasoning accuracy without explicit step-by-step generation.

Details

Motivation: Existing methods focus on compressing verbose reasoning outputs (Long-to-Short) but still require explicit reasoning during inference, which is inefficient.

Method: Train hybrid model with reasoning/non-reasoning modes, further train on CoT data to internalize reasoning, then use thought-free mode for concise inference outputs.

Result: 3TF-trained models show large improvements on reasoning benchmarks under thought-free inference, demonstrating high-quality implicit reasoning.

Conclusion: Models can learn and execute high-quality reasoning implicitly without explicit step-by-step generation, enabling efficient thought-free inference.

Abstract: Recent advances in large language models (LLMs) have leveraged explicit Chain-of-Thought (CoT) prompting to improve reasoning accuracy. However, most existing methods primarily compress verbose reasoning outputs. These Long-to-Short transformations aim to improve efficiency, but still rely on explicit reasoning during inference. In this work, we introduce \textbf{3TF} (\textbf{T}hought-\textbf{T}raining and \textbf{T}hought-\textbf{F}ree inference), a framework for efficient reasoning that takes a Short-to-Long perspective. We first train a hybrid model that can operate in both reasoning and non-reasoning modes, and then further train it on CoT-annotated data to internalize structured reasoning, while enforcing concise, thought-free outputs at inference time using the no-reasoning mode. Unlike compression-based approaches, 3TF improves the reasoning quality of non-reasoning outputs, enabling models to perform rich internal reasoning implicitly while keeping external outputs short. Empirically, 3TF-trained models obtain large improvements on reasoning benchmarks under thought-free inference, demonstrating that high quality reasoning can be learned and executed implicitly without explicit step-by-step generation.

[104] Bot Meets Shortcut: How Can LLMs Aid in Handling Unknown Invariance OOD Scenarios?

Shiyan Zheng, Herun Wan, Minnan Luo, Junhang Huang

Main category: cs.CL

TL;DR: Social bot detectors are vulnerable to shortcut learning where models rely on spurious textual correlations instead of causal features. The study evaluates robustness against manipulated textual cues and proposes LLM-based counterfactual augmentation strategies that improve performance by 56% under shortcut scenarios.

Details

Motivation: Existing social bot detectors perform well on benchmarks but lack robustness in real-world scenarios due to unclear ground truth and varied misleading cues. The impact of shortcut learning, where models use spurious correlations instead of causal features, has been understudied.

Method: Designed shortcut scenarios by constructing spurious associations between user labels and superficial textual cues. Proposed mitigation strategies using large language models with counterfactual data augmentation across three levels: individual user text, overall dataset distribution, and model’s causal information extraction ability.

Result: Shifts in irrelevant feature distributions significantly degrade detector performance with 32% average relative accuracy drop in baseline models. The proposed LLM-based strategies achieved 56% average relative performance improvement under shortcut scenarios.

Conclusion: Shortcut learning poses serious robustness challenges for social bot detectors. Counterfactual data augmentation using LLMs effectively mitigates this issue by addressing data distribution and causal feature extraction, significantly improving model robustness against manipulated textual cues.

Abstract: While existing social bot detectors perform well on benchmarks, their robustness across diverse real-world scenarios remains limited due to unclear ground truth and varied misleading cues. In particular, the impact of shortcut learning, where models rely on spurious correlations instead of capturing causal task-relevant features, has received limited attention. To address this gap, we conduct an in-depth study to assess how detectors are influenced by potential shortcuts based on textual features, which are most susceptible to manipulation by social bots. We design a series of shortcut scenarios by constructing spurious associations between user labels and superficial textual cues to evaluate model robustness. Results show that shifts in irrelevant feature distributions significantly degrade social bot detector performance, with an average relative accuracy drop of 32% in the baseline models. To tackle this challenge, we propose mitigation strategies based on large language models, leveraging counterfactual data augmentation. These methods mitigate the problem from data and model perspectives across three levels, including data distribution at both the individual user text and overall dataset levels, as well as the model’s ability to extract causal information. Our strategies achieve an average relative performance improvement of 56% under shortcut scenarios.

[105] PustakAI: Curriculum-Aligned and Interactive Textbooks Using Large Language Models

Shivam Sharma, Riya Naik, Tejas Gawas, Heramb Patil, Kunal Korgaonkar

Main category: cs.CL

TL;DR: PustakAI framework creates NCERT-QA dataset for Indian curriculum evaluation, testing LLMs on educational content with various prompting techniques.

Details

Motivation: To address challenges in adapting LLMs to curriculum-specific content like NCERT syllabus, ensuring accuracy, alignment, and pedagogical relevance in education.

Method: Created NCERT-QA dataset aligned with grades 6-8 English and Science curriculum, classified into Factoid, Inferential, and Other question types. Evaluated using meta-prompt, few-shot, and CoT-style prompting with various LLMs.

Result: Analyzed strengths and limitations of both open-source (Gemma3:1b, Llama3.2:3b, Nemotron-mini:4b) and high-end LLMs (Llama-4-Scout-17B, Deepseek-r1-70B) as AI learning tools.

Conclusion: Provides framework for evaluating LLM effectiveness in formal education systems and identifies which prompting approaches better align with curriculum demands.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like content. This has revolutionized various sectors such as healthcare, software development, and education. In education, LLMs offer potential for personalized and interactive learning experiences, especially in regions with limited teaching resources. However, adapting these models effectively to curriculum-specific content, such as the National Council of Educational Research and Training (NCERT) syllabus in India, presents unique challenges in terms of accuracy, alignment, and pedagogical relevance. In this paper, we present the framework “PustakAI”\footnote{Pustak means `book’ in many Indian languages.} for the design and evaluation of a novel question-answering dataset “NCERT-QA” aligned with the NCERT curriculum for English and Science subjects of grades 6 to 8. We classify the curated QA pairs as Factoid, Inferential, and Others (evaluative and reasoning). We evaluate the dataset with various prompting techniques, such as meta-prompt, few-shot, and CoT-style prompting, using diverse evaluation metrics to understand which approach aligns more efficiently with the structure and demands of the curriculum. Along with the usability of the dataset, we analyze the strengths and limitations of current open-source LLMs (Gemma3:1b, Llama3.2:3b, and Nemotron-mini:4b) and high-end LLMs (Llama-4-Scout-17B and Deepseek-r1-70B) as AI-based learning tools in formal education systems.

[106] Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL

Qifeng Cai, Hao Liang, Chang Xu, Tao Xie, Wentao Zhang, Bin Cui

Main category: cs.CL

TL;DR: Text2SQL-Flow is a SQL-aware data augmentation framework that generates large-scale, diverse Text-to-SQL pairs from minimal seed data, creating the SQLFlow dataset with 89,544 examples that improves LLM performance through fine-tuning and novel retrieval methods.

Details

Motivation: Current Text-to-SQL performance is limited by scarce, simplistic, and low-diversity datasets, creating a need for scalable data generation methods.

Method: Proposed Text2SQL-Flow framework with six augmentation dimensions, SQL execution verification, natural language question generation, chain-of-thought reasoning traces, data classification, and modular Database Manager for cross-database compatibility.

Result: Created SQLFlow dataset of 89,544 examples. For open-source LLMs: fine-tuning improves performance across benchmarks. For closed-source LLMs: masked alignment retrieval method outperforms existing methods by treating SQLFlow as knowledge base and training data.

Conclusion: Establishes scalable data-centric foundation for Text-to-SQL systems and highlights critical role of high-quality structured data in modern AI.

Abstract: The data-centric paradigm has become pivotal in AI, especially for Text-to-SQL, where performance is limited by scarce, simplistic, and low-diversity datasets. To address this, we propose Text2SQL-Flow, a SQL-aware data augmentation framework that generates large-scale, semantically valid, and structurally diverse Text-to-SQL pairs from minimal seed data. It operates across six augmentation dimensions and integrates an end-to-end pipeline featuring SQL execution verification, natural language question generation, chain-of-thought reasoning traces, and data classification. A modular Database Manager ensures cross-database compatibility and scalability. Using this framework, we build SQLFlow, a high-quality dataset of 89,544 annotated examples. We evaluate SQLFlow in two settings: (1) For open-source LLMs, fine-tuning on SQLFlow consistently improves performance across benchmarks under the same data budget. (2) For closed-source LLMs, we introduce a masked alignment retrieval method that treats SQLFlow as both knowledge base and training data for the retriever. This enables structure-aware example matching by modeling fine-grained alignments between questions and SQL queries. Experiments show our retrieval strategy outperforms existing methods, underscoring the value of SQLFlow’s high-fidelity data and our novel technique. Our work establishes a scalable, data-centric foundation for advancing Text-to-SQL systems and highlights the critical role of high-quality structured data in modern AI.

[107] Instella: Fully Open Language Models with Stellar Performance

Jiang Liu, Jialian Wu, Xiaodong Yu, Yusheng Su, Prakamya Mishra, Gowtham Ramesh, Sudhanshu Ranjan, Chaitanya Manem, Ximeng Sun, Ze Wang, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum

Main category: cs.CL

TL;DR: Instella is a family of fully open 3B parameter language models trained on open data and codebase, achieving state-of-the-art results among fully open models and competitive with leading open-weight models of comparable size.

Details

Motivation: Most high-performing LLMs remain closed-source or partially open, limiting transparency and reproducibility in language modeling research.

Method: Large-scale pre-training on openly available data using AMD Instinct MI300X GPUs, followed by general-purpose instruction tuning and alignment with human preferences. Also developed specialized variants: Instella-Long for 128K context length and Instella-Math enhanced for mathematical reasoning.

Result: Instella achieves state-of-the-art results among fully open models despite using fewer pre-training tokens than contemporaries, and is competitive with leading open-weight models of comparable size.

Conclusion: Instella establishes a transparent, performant, and versatile alternative for the community, advancing the goal of open and reproducible language modeling research.

Abstract: Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks, yet the majority of high-performing models remain closed-source or partially open, limiting transparency and reproducibility. In this work, we introduce Instella, a family of fully open three billion parameter language models trained entirely on openly available data and codebase. Powered by AMD Instinct MI300X GPUs, Instella is developed through large-scale pre-training, general-purpose instruction tuning, and alignment with human preferences. Despite using substantially fewer pre-training tokens than many contemporaries, Instella achieves state-of-the-art results among fully open models and is competitive with leading open-weight models of comparable size. We further release two specialized variants: Instella-Long, capable of handling context lengths up to 128K tokens, and Instella-Math, a reasoning-focused model enhanced through supervised fine-tuning and reinforcement learning on mathematical tasks. Together, these contributions establish Instella as a transparent, performant, and versatile alternative for the community, advancing the goal of open and reproducible language modeling research.

cs.CV

[108] Accelerating Controllable Generation via Hybrid-grained Cache

Lin Liu, Huixia Ben, Shuo Wang, Jinda Lu, Junxiang Qiu, Shengeng Tang, Yanbin Hao

Main category: cs.CV

TL;DR: Proposes Hybrid-Grained Cache (HGC) to improve efficiency of controllable generative models by using coarse-grained block-level cache and fine-grained prompt-level cache to reduce computational overhead while maintaining visual quality.

Details

Motivation: Controllable generative models face computational efficiency challenges due to handling both control conditions and content generation requirements, resulting in low generation efficiency.

Method: Hybrid-Grained Cache approach with: (1) coarse-grained block-level cache for dynamic bypass of redundant computations in encoder-decoder blocks, (2) fine-grained prompt-level cache that reuses cross-attention maps within consecutive reasoning steps and extends to adjacent module computations.

Result: On COCO-Stuff segmentation benchmark, HGC reduces computational cost (MACs) by 63% (from 18.22T to 6.70T) while keeping semantic fidelity loss within 1.5%. Validated on four benchmark datasets with balanced generation efficiency and visual quality.

Conclusion: HGC effectively addresses computational efficiency issues in controllable generative models through multi-granularity cache strategies, achieving significant computational savings with minimal quality degradation.

Abstract: Controllable generative models have been widely used to improve the realism of synthetic visual content. However, such models must handle control conditions and content generation computational requirements, resulting in generally low generation efficiency. To address this issue, we propose a Hybrid-Grained Cache (HGC) approach that reduces computational overhead by adopting cache strategies with different granularities at different computational stages. Specifically, (1) we use a coarse-grained cache (block-level) based on feature reuse to dynamically bypass redundant computations in encoder-decoder blocks between each step of model reasoning. (2) We design a fine-grained cache (prompt-level) that acts within a module, where the fine-grained cache reuses cross-attention maps within consecutive reasoning steps and extends them to the corresponding module computations of adjacent steps. These caches of different granularities can be seamlessly integrated into each computational link of the controllable generation process. We verify the effectiveness of HGC on four benchmark datasets, especially its advantages in balancing generation efficiency and visual quality. For example, on the COCO-Stuff segmentation benchmark, our HGC significantly reduces the computational cost (MACs) by 63% (from 18.22T to 6.70T), while keeping the loss of semantic fidelity (quantized performance degradation) within 1.5%.

[109] A Mathematical Framework for AI Singularity: Conditions, Bounds, and Control of Recursive Improvement

Akbar Anbar Jafari, Cagri Ozcinar, Gholamreza Anbarjafari

Main category: cs.CV

TL;DR: This paper develops an analytic framework to determine when AI capability growth could escalate without bound (runaway growth) versus when it can be ruled out, based on measurable conditions and physical limits.

Details

Motivation: To replace speculation about AI 'runaway growth' or singularity with precise, testable conditions and deployable safety controls grounded in observable engineering measurements.

Method: Developed an analytic framework linking capability growth to resource build-out and deployment policies, using physical limits (power, bandwidth, memory) to define a service envelope. Created an endogenous growth model coupling capital to compute, data, and energy, with decision rules mapping observable metrics into yes/no certificates for runaway behavior.

Result: The framework provides falsifiable tests based on improvement acceleration rates and yields practical safety controls like power caps, throughput throttling, and evaluation gates. Analytical case studies show when the envelope binds and when it doesn’t.

Conclusion: The approach replaces speculation with testable conditions and deployable controls for certifying or precluding an AI singularity, though limitations exist regarding capability metrics and regularity diagnostics.

Abstract: AI systems improve by drawing on more compute, data, energy, and better training methods. This paper asks a precise, testable version of the “runaway growth” question: under what measurable conditions could capability escalate without bound in finite time, and under what conditions can that be ruled out? We develop an analytic framework for recursive self-improvement that links capability growth to resource build-out and deployment policies. Physical and information-theoretic limits from power, bandwidth, and memory define a service envelope that caps instantaneous improvement. An endogenous growth model couples capital to compute, data, and energy and defines a critical boundary separating superlinear from subcritical regimes. We derive decision rules that map observable series (facility power, IO bandwidth, training throughput, benchmark losses, and spending) into yes/no certificates for runaway versus nonsingular behavior. The framework yields falsifiable tests based on how fast improvement accelerates relative to its current level, and it provides safety controls that are directly implementable in practice, such as power caps, throughput throttling, and evaluation gates. Analytical case studies cover capped-power, saturating-data, and investment-amplified settings, illustrating when the envelope binds and when it does not. The approach is simulation-free and grounded in measurements engineers already collect. Limitations include dependence on the chosen capability metric and on regularity diagnostics; future work will address stochastic dynamics, multi-agent competition, and abrupt architectural shifts. Overall, the results replace speculation with testable conditions and deployable controls for certifying or precluding an AI singularity.

[110] Semantic VLM Dataset for Safe Autonomous Driving

Yuankai He, Weisong Shi

Main category: cs.CV

TL;DR: CAR-Scenes is a comprehensive frame-level autonomous driving dataset with 5,192 annotated images and 350+ attributes across 28 categories, enabling vision-language model training for interpretable scene understanding.

Details

Motivation: To create a dataset that supports training and evaluation of vision-language models for interpretable, scene-level understanding in autonomous driving, addressing the need for explainable AI in intelligent vehicles.

Method: Used GPT-4o-assisted vision-language pipeline with human-in-the-loop verification to annotate images from Argoverse 1, Cityscapes, KITTI, and nuScenes. Includes 28-key category/sub-category knowledge base with severity scale (1-10) and provides attribute co-occurrence graphs and JSONL records.

Result: Created dataset with 5,192 annotated images and 350+ leaf attributes. Includes reproducible baselines with LoRA-tuned Qwen2-VL-2B model achieving measurable performance on validation split using accuracy, F1 scores, and severity MAE/RMSE metrics.

Conclusion: CAR-Scenes enables explainable, data-centric workflows for autonomous driving research by providing comprehensive annotations, analysis tools, and reproducible baselines to support semantic retrieval, dataset triage, and risk-aware scenario mining.

Abstract: CAR-Scenes is a frame-level dataset for autonomous driving that enables training and evaluation of vision-language models (VLMs) for interpretable, scene-level understanding. We annotate 5,192 images drawn from Argoverse 1, Cityscapes, KITTI, and nuScenes using a 28-key category/sub-category knowledge base covering environment, road geometry, background-vehicle behavior, ego-vehicle behavior, vulnerable road users, sensor states, and a discrete severity scale (1-10), totaling 350+ leaf attributes. Labels are produced by a GPT-4o-assisted vision-language pipeline with human-in-the-loop verification; we release the exact prompts, post-processing rules, and per-field baseline model performance. CAR-Scenes also provides attribute co-occurrence graphs and JSONL records that support semantic retrieval, dataset triage, and risk-aware scenario mining across sources. To calibrate task difficulty, we include reproducible, non-benchmark baselines, notably a LoRA-tuned Qwen2-VL-2B with deterministic decoding, evaluated via scalar accuracy, micro-averaged F1 for list attributes, and severity MAE/RMSE on a fixed validation split. We publicly release the annotation and analysis scripts, including graph construction and evaluation scripts, to enable explainable, data-centric workflows for future intelligent vehicles. Dataset: https://github.com/Croquembouche/CAR-Scenes

[111] Fast Data Attribution for Text-to-Image Models

Sheng-Yu Wang, Aaron Hertzmann, Alexei A Efros, Richard Zhang, Jun-Yan Zhu

Main category: cs.CV

TL;DR: Proposes a scalable data attribution method for text-to-image models that distills slow unlearning-based attribution into efficient feature embeddings, achieving 2,500x-400,000x speedup over existing methods.

Details

Motivation: Existing data attribution methods for text-to-image models are computationally expensive and impractical for real-world applications, requiring significant resources for each query.

Method: Distills slow unlearning-based attribution methods into a feature embedding space for efficient retrieval, combined with indexing and search methods to find influential training images without running expensive attribution algorithms.

Result: Achieves better or competitive performance in seconds compared to existing methods, with 2,500x-400,000x speedup on both medium-scale MSCOCO models and large-scale Stable Diffusion models trained on LAION.

Conclusion: The method represents a meaningful step toward large-scale application of data attribution on real-world models like Stable Diffusion by making attribution practical and efficient.

Abstract: Data attribution for text-to-image models aims to identify the training images that most significantly influenced a generated output. Existing attribution methods involve considerable computational resources for each query, making them impractical for real-world applications. We propose a novel approach for scalable and efficient data attribution. Our key idea is to distill a slow, unlearning-based attribution method to a feature embedding space for efficient retrieval of highly influential training images. During deployment, combined with efficient indexing and search methods, our method successfully finds highly influential images without running expensive attribution algorithms. We show extensive results on both medium-scale models trained on MSCOCO and large-scale Stable Diffusion models trained on LAION, demonstrating that our method can achieve better or competitive performance in a few seconds, faster than existing methods by 2,500x - 400,000x. Our work represents a meaningful step towards the large-scale application of data attribution methods on real-world models such as Stable Diffusion.

[112] Expert Consensus-based Video-Based Assessment Tool for Workflow Analysis in Minimally Invasive Colorectal Surgery: Development and Validation of ColoWorkflow

Pooja P Jain, Pietro Mascagni, Giuseppe Massimiani, Nabani Banik, Marta Goglia, Lorenzo Arboit, Britty Baby, Andrea Balla, Ludovica Baldari, Gianfranco Silecchia, Claudio Fiorillo, CompSurg Colorectal Experts Group, Sergio Alfieri, Salvador Morales-Conde, Deborah S Keller, Luigi Boni, Nicolas Padoy

Main category: cs.CV

TL;DR: Developed and validated ColoWorkflow - a consensus-based video assessment tool for analyzing surgical workflows in minimally invasive colorectal surgery, achieving moderate inter-rater reliability.

Details

Motivation: To address procedural variability, difficult learning curves, and complications in minimally invasive colorectal surgery through standardized video-based assessment that can reduce variability and improve surgical performance.

Method: Used Delphi process to achieve consensus on workflow descriptors, developed ColoWorkflow tool, applied it to 54 multicentre colorectal surgery videos, and evaluated applicability and inter-rater reliability.

Result: Achieved consensus on 10 procedure-agnostic phases and 34 procedure-specific steps. Tool demonstrated broad applicability with mean Cohen’s K of 0.71 for phases and 0.66 for steps. Most discrepancies occurred at phase transitions.

Conclusion: ColoWorkflow provides a validated, reproducible framework for video-based performance assessment in colorectal surgery, enabling benchmarking and supporting AI-driven workflow recognition to standardize training and improve surgical quality.

Abstract: Minimally invasive colorectal surgery is characterized by procedural variability, a difficult learning curve, and complications that impact quality and outcomes. Video-based assessment (VBA) offers an opportunity to generate data-driven insights to reduce variability, optimize training, and improve surgical performance. However, existing tools for workflow analysis remain difficult to standardize and implement. This study aims to develop and validate a VBA tool for workflow analysis across minimally invasive colorectal procedures. A Delphi process was conducted to achieve consensus on generalizable workflow descriptors. The resulting framework informed the development of a new VBA tool, ColoWorkflow. Independent raters then applied ColoWorkflow to a multicentre video dataset of laparoscopic and robotic colorectal surgery (CRS). Applicability and inter-rater reliability were evaluated. Consensus was achieved for 10 procedure-agnostic phases and 34 procedure-specific steps describing CRS workflows. ColoWorkflow was developed and applied to 54 colorectal operative videos (left and right hemicolectomies, sigmoid and rectosigmoid resections, and total proctocolectomies) from five centres. The tool demonstrated broad applicability, with all but one label utilized. Inter-rater reliability was moderate, with mean Cohen’s K of 0.71 for phases and 0.66 for steps. Most discrepancies arose at phase transitions and step boundary definitions. ColoWorkflow is the first consensus-based, validated VBA tool for comprehensive workflow analysis in minimally invasive CRS. It establishes a reproducible framework for video-based performance assessment, enabling benchmarking across institutions and supporting the development of artificial intelligence-driven workflow recognition. Its adoption may standardize training, accelerate competency acquisition, and advance data-informed surgical quality improvement.

[113] Frequency-Aware Vision-Language Multimodality Generalization Network for Remote Sensing Image Classification

Junjie Zhang, Feng Zhao, Hanqiang Liu, Jun Yu

Main category: cs.CV

TL;DR: Proposes FVMGN for remote sensing multimodality generalization, using frequency-aware vision-language fusion with diffusion-based augmentation and wavelet disentanglement to handle multimodal heterogeneity.

Details

Motivation: Address the challenge of multimodal generalization in remote sensing where models need to overcome data heterogeneity and lack proprietary linguistic knowledge specific to RS vision modalities.

Method: Uses diffusion-based training-test-time augmentation, multimodal wavelet disentanglement for cross-domain invariant features, spatial-frequency-aware image encoder, and multiscale spatial-frequency feature alignment.

Result: Extensive experiments show FVMGN achieves excellent multimodality generalization ability compared to state-of-the-art methods.

Conclusion: FVMGN effectively addresses RS multimodality generalization by leveraging frequency-domain processing and vision-language fusion, demonstrating superior performance over existing approaches.

Abstract: The booming remote sensing (RS) technology is giving rise to a novel multimodality generalization task, which requires the model to overcome data heterogeneity while possessing powerful cross-scene generalization ability. Moreover, most vision-language models (VLMs) usually describe surface materials in RS images using universal texts, lacking proprietary linguistic prior knowledge specific to different RS vision modalities. In this work, we formalize RS multimodality generalization (RSMG) as a learning paradigm, and propose a frequency-aware vision-language multimodality generalization network (FVMGN) for RS image classification. Specifically, a diffusion-based training-test-time augmentation (DTAug) strategy is designed to reconstruct multimodal land-cover distributions, enriching input information for FVMGN. Following that, to overcome multimodal heterogeneity, a multimodal wavelet disentanglement (MWDis) module is developed to learn cross-domain invariant features by resampling low and high frequency components in the frequency domain. Considering the characteristics of RS vision modalities, shared and proprietary class texts is designed as linguistic inputs for the transformer-based text encoder to extract diverse text features. For multimodal vision inputs, a spatial-frequency-aware image encoder (SFIE) is constructed to realize local-global feature reconstruction and representation. Finally, a multiscale spatial-frequency feature alignment (MSFFA) module is suggested to construct a unified semantic space, ensuring refined multiscale alignment of different text and vision features in spatial and frequency domains. Extensive experiments show that FVMGN has the excellent multimodality generalization ability compared with state-of-the-art (SOTA) methods.

[114] GFT: Graph Feature Tuning for Efficient Point Cloud Analysis

Manish Dhakal, Venkat R. Dasari, Raj Sunderraman, Yi Ding

Main category: cs.CV

TL;DR: GFT is a parameter-efficient fine-tuning method for point cloud data that uses dynamic graph features and cross-attention to reduce trainable parameters while maintaining performance.

Details

Motivation: To further reduce trainable parameters in parameter-efficient fine-tuning for point cloud data, as general PEFT approaches are suboptimal for this domain.

Method: Learns dynamic graphs from initial transformer tokenized inputs using lightweight graph convolution networks, then passes graph features to deeper layers via skip connections and efficient cross-attention modules.

Result: Extensive experiments show GFT rivals existing methods on object classification and segmentation tasks while reducing trainable parameters.

Conclusion: GFT provides an effective point-cloud-specific PEFT approach that significantly reduces parameter count while maintaining competitive performance.

Abstract: Parameter-efficient fine-tuning (PEFT) significantly reduces computational and memory costs by updating only a small subset of the model’s parameters, enabling faster adaptation to new tasks with minimal loss in performance. Previous studies have introduced PEFTs tailored for point cloud data, as general approaches are suboptimal. To further reduce the number of trainable parameters, we propose a point-cloud-specific PEFT, termed Graph Features Tuning (GFT), which learns a dynamic graph from initial tokenized inputs of the transformer using a lightweight graph convolution network and passes these graph features to deeper layers via skip connections and efficient cross-attention modules. Extensive experiments on object classification and segmentation tasks show that GFT operates in the same domain, rivalling existing methods, while reducing the trainable parameters. Code is at https://github.com/manishdhakal/GFT.

[115] Accuracy-Preserving CNN Pruning Method under Limited Data Availability

Daisuke Yasui, Toshitaka Matsuki, Hiroshi Sato

Main category: cs.CV

TL;DR: Proposes an improved pruning method using Layer-wise Relevance Propagation that achieves higher pruning rates with better accuracy preservation than existing methods, especially effective with limited data.

Details

Motivation: CNN models are becoming larger for better accuracy but need compression for resource-constrained environments. Existing LRP-based pruning methods suffer from significant accuracy degradation, limiting practical usability.

Method: Uses Layer-wise Relevance Propagation (LRP) for pruning, focusing on achieving higher pruning rates while better preserving model accuracy, particularly effective with small amounts of data.

Result: Achieved pruning that preserves accuracy better than existing methods, with higher pruning rates while maintaining model performance.

Conclusion: The proposed LRP-based pruning method successfully addresses accuracy degradation issues in existing approaches, enabling practical model compression with limited data while maintaining better accuracy.

Abstract: Convolutional Neural Networks (CNNs) are widely used in image recognition and have succeeded in various domains. CNN models have become larger-scale to improve accuracy and generalization performance. Research has been conducted on compressing pre-trained models for specific target applications in environments with limited computing resources. Among model compression techniques, methods using Layer-wise Relevance Propagation (LRP), an explainable AI technique, have shown promise by achieving high pruning rates while preserving accuracy, even without fine-tuning. Because these methods do not require fine-tuning, they are suited to scenarios with limited data. However, existing LRP-based pruning approaches still suffer from significant accuracy degradation, limiting their practical usability. This study proposes a pruning method that achieves a higher pruning rate while preserving better model accuracy. Our approach to pruning with a small amount of data has achieved pruning that preserves accuracy better than existing methods.

[116] SplineSplat: 3D Ray Tracing for Higher-Quality Tomography

Youssef Haouchat, Sepand Kashani, Aleix Boquet-Pujadas, Philippe Thévenaz, Michael Unser

Main category: cs.CV

TL;DR: Efficient tomographic projection method using B-spline representations and neural network-accelerated ray-tracing for 3D line integrals.

Details

Motivation: To improve reconstruction quality beyond traditional voxel-based methods by using more sophisticated mathematical representations and efficient computation.

Method: Ray-tracing algorithm for 3D line integrals with arbitrary projection geometries, using linear combinations of shifted B-splines and a neural network to efficiently compute basis function contributions.

Result: Achieved higher reconstruction quality than traditional voxel-based methods in well-posed cases with sufficient data for accurate reconstruction.

Conclusion: The proposed B-spline representation with neural network-accelerated ray-tracing provides superior reconstruction quality compared to conventional voxel-based approaches.

Abstract: We propose a method to efficiently compute tomographic projections of a 3D volume represented by a linear combination of shifted B-splines. To do so, we propose a ray-tracing algorithm that computes 3D line integrals with arbitrary projection geometries. One of the components of our algorithm is a neural network that computes the contribution of the basis functions efficiently. In our experiments, we consider well-posed cases where the data are sufficient for accurate reconstruction without the need for regularization. We achieve higher reconstruction quality than traditional voxel-based methods.

[117] Short-Window Sliding Learning for Real-Time Violence Detection via LLM-based Auto-Labeling

Seoik Jung, Taekyung Song, Yangro Lee, Sungjun Lee

Main category: cs.CV

TL;DR: A Short-Window Sliding Learning framework for real-time violence detection using 1-2 second video clips with LLM-based auto-caption labeling, achieving high accuracy on benchmark datasets.

Details

Motivation: To enable precise real-time violence detection in CCTV footage by addressing limitations of conventional long-video training approaches that may miss rapid violent events.

Method: Divides videos into 1-2 second clips, applies LLM-based auto-caption labeling to create fine-grained datasets, and fully utilizes all frames in short clips to preserve temporal continuity.

Result: Achieves 95.25% accuracy on RWF-2000 and 83.25% on UCF-Crime, demonstrating strong generalization and real-time applicability.

Conclusion: The proposed framework effectively enables real-time violence detection with high accuracy and strong generalization across different video datasets.

Abstract: This paper proposes a Short-Window Sliding Learning framework for real-time violence detection in CCTV footages. Unlike conventional long-video training approaches, the proposed method divides videos into 1-2 second clips and applies Large Language Model (LLM)-based auto-caption labeling to construct fine-grained datasets. Each short clip fully utilizes all frames to preserve temporal continuity, enabling precise recognition of rapid violent events. Experiments demonstrate that the proposed method achieves 95.25% accuracy on RWF-2000 and significantly improves performance on long videos (UCF-Crime: 83.25%), confirming its strong generalization and real-time applicability in intelligent surveillance systems.

[118] MCN-CL: Multimodal Cross-Attention Network and Contrastive Learning for Multimodal Emotion Recognition

Feng Li, Ke Wu, Yongwei Li

Main category: cs.CV

TL;DR: Proposes MCN-CL, a multimodal emotion recognition method using cross-attention and contrastive learning to address modal heterogeneity and category imbalance, achieving state-of-the-art performance on IEMOCAP and MELD datasets.

Details

Motivation: Address three major challenges in multimodal emotion recognition: unbalanced category distribution, complexity of dynamic facial action unit time modeling, and difficulty of feature fusion due to modal heterogeneity, especially with growing multimodal data in social media.

Method: Uses Multimodal Cross-Attention Network and Contrastive Learning (MCN-CL) with triple query mechanism and hard negative mining strategy to remove feature redundancy while preserving important emotional cues.

Result: Outperforms state-of-the-art approaches on IEMOCAP and MELD datasets, with Weighted F1 scores improving by 3.42% and 5.73% respectively.

Conclusion: The proposed MCN-CL method effectively addresses modal heterogeneity and category imbalance issues in multimodal emotion recognition, demonstrating superior performance compared to existing methods.

Abstract: Multimodal emotion recognition plays a key role in many domains, including mental health monitoring, educational interaction, and human-computer interaction. However, existing methods often face three major challenges: unbalanced category distribution, the complexity of dynamic facial action unit time modeling, and the difficulty of feature fusion due to modal heterogeneity. With the explosive growth of multimodal data in social media scenarios, the need for building an efficient cross-modal fusion framework for emotion recognition is becoming increasingly urgent. To this end, this paper proposes Multimodal Cross-Attention Network and Contrastive Learning (MCN-CL) for multimodal emotion recognition. It uses a triple query mechanism and hard negative mining strategy to remove feature redundancy while preserving important emotional cues, effectively addressing the issues of modal heterogeneity and category imbalance. Experiment results on the IEMOCAP and MELD datasets show that our proposed method outperforms state-of-the-art approaches, with Weighted F1 scores improving by 3.42% and 5.73%, respectively.

[119] DINOv3 as a Frozen Encoder for CRPS-Oriented Probabilistic Rainfall Nowcasting

Luciano Araujo Dourado Filho, Almir Moreira da Silva Neto, Anthony Miyaguchi, Rodrigo Pereira David, Rodrigo Tripodi Calumby, Lukáš Picek

Main category: cs.CV

TL;DR: A competitive probabilistic rainfall nowcasting method using a V-JEPA Vision Transformer with lightweight probabilistic head attached to pre-trained satellite encoder, achieving 26% effectiveness gain over 3D-UNET baselines.

Details

Motivation: To develop a computationally efficient approach for probabilistic rainfall nowcasting that outperforms existing methods while maintaining computational efficiency.

Method: Uses video projector (V-JEPA Vision Transformer) with lightweight probabilistic head attached to pre-trained satellite vision encoder (DINOv3-SAT493M) to map encoder tokens into discrete empirical CDF over 4-hour accumulated rainfall, optimized end-to-end over CRPS. Compared against 3D-UNET baselines with aggregate Rank Probability Score and per-pixel Gamma-Hurdle objective.

Result: Achieved CRPS of 3.5102 on Weather4Cast 2025 benchmark, representing approximately 26% effectiveness gain against the best 3D-UNET baseline.

Conclusion: The proposed V-JEPA-based approach with lightweight probabilistic head provides a competitive and computationally efficient solution for probabilistic rainfall nowcasting, significantly outperforming traditional 3D-UNET methods.

Abstract: This paper proposes a competitive and computationally efficient approach to probabilistic rainfall nowcasting. A video projector (V-JEPA Vision Transformer) associated to a lightweight probabilistic head is attached to a pre-trained satellite vision encoder (DINOv3\text{-}SAT493M) to map encoder tokens into a discrete empirical CDF (eCDF) over 4-hour accumulated rainfall. The projector-head is optimized end-to-end over the Continuous Ranked Probability Score (CRPS). As an alternative, 3D-UNET baselines trained with an aggregate Rank Probability Score and a per-pixel Gamma-Hurdle objective are used. On the Weather4Cast 2025 benchmark, the proposed method achieved a promising performance, with a CRPS of 3.5102 (CRPS), which represents $\approx$26% in effectiveness gain against the best 3D-UNET.

[120] YOLO-Drone: An Efficient Object Detection Approach Using the GhostHead Network for Drone Images

Hyun-Ki Jung

Main category: cs.CV

TL;DR: Proposes YOLO-Drone, an enhanced YOLOv11 model with GhostHead Network for drone-based object detection, achieving improved accuracy and speed on VisDrone dataset.

Details

Motivation: Drone images from high altitudes make object identification difficult, requiring improved object detection models for drone applications.

Method: Enhanced YOLOv11n with GhostHead Network in the Head network, using VisDrone dataset for evaluation.

Result: YOLO-Drone achieved 0.4% Precision, 0.6% Recall, 0.5% F1-Score, and 0.5% mAP (0.5) improvements over YOLOv11, with better inference speed. Outperformed YOLOv8, YOLOv9, and YOLOv10 in mAP (0.5).

Conclusion: YOLO-Drone is a high-performance model with enhanced accuracy and speed for drone-based object detection, superior to existing YOLO variants.

Abstract: Object detection using images or videos captured by drones is a promising technology with significant potential across various industries. However, a major challenge is that drone images are typically taken from high altitudes, making object identification difficult. This paper proposes an effective solution to address this issue. The base model used in the experiments is YOLOv11, the latest object detection model, with a specific implementation based on YOLOv11n. The experimental data were sourced from the widely used and reliable VisDrone dataset, a standard benchmark in drone-based object detection. This paper introduces an enhancement to the Head network of the YOLOv11 algorithm, called the GhostHead Network. The model incorporating this improvement is named YOLO-Drone. Experimental results demonstrate that YOLO-Drone achieves significant improvements in key detection accuracy metrics, including Precision, Recall, F1-Score, and mAP (0.5), compared to the original YOLOv11. Specifically, the proposed model recorded a 0.4% increase in Precision, a 0.6% increase in Recall, a 0.5% increase in F1-Score, and a 0.5% increase in mAP (0.5). Additionally, the Inference Speed metric, which measures image processing speed, also showed a notable improvement. These results indicate that YOLO-Drone is a high-performance model with enhanced accuracy and speed compared to YOLOv11. To further validate its reliability, comparative experiments were conducted against other high-performance object detection models, including YOLOv8, YOLOv9, and YOLOv10. The results confirmed that the proposed model outperformed YOLOv8 by 0.1% in mAP (0.5) and surpassed YOLOv9 and YOLOv10 by 0.3% and 0.6%, respectively.

[121] PhaseWin Search Framework Enable Efficient Object-Level Interpretation

Zihan Gu, Ruoyu Chen, Junchi Zhang, Yue Hu, Hua Zhang, Xiaochun Cao

Main category: cs.CV

TL;DR: PhaseWin is a novel phase-window search algorithm for efficient object-level foundation model attribution that achieves near-greedy faithfulness with near-linear complexity, reducing computational costs by 80% while maintaining over 95% attribution quality.

Details

Motivation: Existing submodular subset selection methods for attribution achieve high faithfulness but suffer from efficiency limitations that hinder practical deployment in real-world scenarios due to quadratic complexity.

Method: PhaseWin replaces traditional greedy selection with a phased coarse-to-fine search approach, combining adaptive pruning, windowed fine-grained selection, and dynamic supervision mechanisms to approximate greedy behavior while reducing model evaluations.

Result: PhaseWin achieves over 95% of greedy attribution faithfulness using only 20% of the computational budget, and consistently outperforms other attribution baselines across object detection and visual grounding tasks with Grounding DINO and Florence-2.

Conclusion: PhaseWin establishes a new state of the art in scalable, high-faithfulness attribution for object-level multimodal models, enabling practical deployment with near-linear complexity while maintaining strong theoretical guarantees.

Abstract: Attribution is essential for interpreting object-level foundation models. Recent methods based on submodular subset selection have achieved high faithfulness, but their efficiency limitations hinder practical deployment in real-world scenarios. To address this, we propose PhaseWin, a novel phase-window search algorithm that enables faithful region attribution with near-linear complexity. PhaseWin replaces traditional quadratic-cost greedy selection with a phased coarse-to-fine search, combining adaptive pruning, windowed fine-grained selection, and dynamic supervision mechanisms to closely approximate greedy behavior while dramatically reducing model evaluations. Theoretically, PhaseWin retains near-greedy approximation guarantees under mild monotone submodular assumptions. Empirically, PhaseWin achieves over 95% of greedy attribution faithfulness using only 20% of the computational budget, and consistently outperforms other attribution baselines across object detection and visual grounding tasks with Grounding DINO and Florence-2. PhaseWin establishes a new state of the art in scalable, high-faithfulness attribution for object-level multimodal models.

[122] Out-of-Distribution Detection with Positive and Negative Prompt Supervision Using Large Language Models

Zhixia He, Chen Zhao, Minglai Shao, Xintao Wu, Xujiang Zhao, Dong Li, Qin Tian, Linlin Yu

Main category: cs.CV

TL;DR: Proposes Positive and Negative Prompt Supervision to enhance OOD detection by using LLM-initialized prompts that capture inter-class features and transfer semantic knowledge to visual modality via graph-based architecture.

Details

Motivation: Current negative prompts in vision-language models for OOD detection often include broad non-ID features, leading to suboptimal results due to overlapping or misleading information capture.

Method: Uses LLM-initialized class-specific positive/negative prompts, optimizes them to focus on intra-class features and category boundaries, and employs graph-based architecture to transfer semantic supervision to visual branch for energy-based OOD detection.

Result: Outperforms state-of-the-art baselines on CIFAR-100 and ImageNet-1K across eight OOD datasets and five different LLMs.

Conclusion: The proposed method effectively enhances OOD detection performance by better capturing inter-class features and transferring semantic knowledge between modalities.

Abstract: Out-of-distribution (OOD) detection is committed to delineating the classification boundaries between in-distribution (ID) and OOD images. Recent advances in vision-language models (VLMs) have demonstrated remarkable OOD detection performance by integrating both visual and textual modalities. In this context, negative prompts are introduced to emphasize the dissimilarity between image features and prompt content. However, these prompts often include a broad range of non-ID features, which may result in suboptimal outcomes due to the capture of overlapping or misleading information. To address this issue, we propose Positive and Negative Prompt Supervision, which encourages negative prompts to capture inter-class features and transfers this semantic knowledge to the visual modality to enhance OOD detection performance. Our method begins with class-specific positive and negative prompts initialized by large language models (LLMs). These prompts are subsequently optimized, with positive prompts focusing on features within each class, while negative prompts highlight features around category boundaries. Additionally, a graph-based architecture is employed to aggregate semantic-aware supervision from the optimized prompt representations and propagate it to the visual branch, thereby enhancing the performance of the energy-based OOD detector. Extensive experiments on two benchmarks, CIFAR-100 and ImageNet-1K, across eight OOD datasets and five different LLMs, demonstrate that our method outperforms state-of-the-art baselines.

[123] Fractured Glass, Failing Cameras: Simulating Physics-Based Adversarial Samples for Autonomous Driving Systems

Manav Prabhakar, Jwalandhar Girnar, Arpan Kusari

Main category: cs.CV

TL;DR: This paper investigates how camera glass failures create physics-based adversarial samples that compromise autonomous vehicle perception systems, using real-world experiments, FEM-based crack simulation, and PBR visualization to evaluate detection failure rates across multiple datasets.

Details

Motivation: Camera failures from physical stresses pose critical safety risks to autonomous vehicles by causing neural network detection models to fail, yet this category of physics-based adversarial samples is often overlooked in research.

Method: Combines real-world experiments with simulation: uses FEM-based approach to generate surface cracks from stress fields, applies PBR for realistic fracture visualization, and evaluates detection failure rates on KITTI, BDD100K, and MS-COCO datasets using CNN-based (YOLOv8, Faster R-CNN) and transformer-based (Pyramid Vision Transformers) models.

Result: Demonstrates that simulated broken glass effects significantly increase detection failure rates for critical object classes across multiple datasets and models, with distributional analysis showing substantial visual distortions through K-L divergence measurements.

Conclusion: Camera glass failures represent a realistic and dangerous class of physics-based adversarial samples that can severely compromise autonomous driving safety, highlighting the need for robust perception systems resilient to such physical failures.

Abstract: While much research has recently focused on generating physics-based adversarial samples, a critical yet often overlooked category originates from physical failures within on-board cameras-components essential to the perception systems of autonomous vehicles. Camera failures, whether due to external stresses causing hardware breakdown or internal component faults, can directly jeopardize the safety and reliability of autonomous driving systems. Firstly, we motivate the study using two separate real-world experiments to showcase that indeed glass failures would cause the detection based neural network models to fail. Secondly, we develop a simulation-based study using the physical process of the glass breakage to create perturbed scenarios, representing a realistic class of physics-based adversarial samples. Using a finite element model (FEM)-based approach, we generate surface cracks on the camera image by applying a stress field defined by particles within a triangular mesh. Lastly, we use physically-based rendering (PBR) techniques to provide realistic visualizations of these physically plausible fractures. To assess the safety implications, we apply the simulated broken glass effects as image filters to two autonomous driving datasets- KITTI and BDD100K- as well as the large-scale image detection dataset MS-COCO. We then evaluate detection failure rates for critical object classes using CNN-based object detection models (YOLOv8 and Faster R-CNN) and a transformer-based architecture with Pyramid Vision Transformers. To further investigate the distributional impact of these visual distortions, we compute the Kullback-Leibler (K-L) divergence between three distinct data distributions, applying various broken glass filters to a custom dataset (captured through a cracked windshield), as well as the KITTI and Kaggle cats and dogs datasets.

[124] Facial Expression Recognition with YOLOv11 and YOLOv12: A Comparative Study

Umma Aymon, Nur Shazwani Kamarudin, Ahmad Fakhri Ab. Nasir

Main category: cs.CV

TL;DR: YOLOv12n outperforms YOLOv11n in facial expression recognition with higher mAP scores, while YOLOv11n shows better precision in noisy conditions, demonstrating a trade-off between sensitivity and reliability.

Details

Motivation: To investigate lightweight YOLO models for facial expression recognition in real-world environments where efficiency and performance need to be balanced.

Method: Used YOLOv11n and YOLOv12n nano variants in a unified detection and classification framework, evaluating on FER2013 and KDEF datasets converted to object detection format with metrics including mAP 0.5, precision, recall, and confusion matrices.

Result: YOLOv12n achieved highest mAP 0.5 of 95.6 on KDEF and 63.8 on FER2013, while YOLOv11n showed higher precision (65.2) on FER2013. Models performed better on cleaner KDEF with clearer class separation.

Conclusion: Lightweight YOLO models effectively balance performance and efficiency, with YOLOv12n offering better sensitivity and YOLOv11n better reliability in noisy conditions, making them suitable for real-time emotion-aware AI applications.

Abstract: Facial Expression Recognition remains a challenging task, especially in unconstrained, real-world environments. This study investigates the performance of two lightweight models, YOLOv11n and YOLOv12n, which are the nano variants of the latest official YOLO series, within a unified detection and classification framework for FER. Two benchmark classification datasets, FER2013 and KDEF, are converted into object detection format and model performance is evaluated using mAP 0.5, precision, recall, and confusion matrices. Results show that YOLOv12n achieves the highest overall performance on the clean KDEF dataset with a mAP 0.5 of 95.6, and also outperforms YOLOv11n on the FER2013 dataset in terms of mAP 63.8, reflecting stronger sensitivity to varied expressions. In contrast, YOLOv11n demonstrates higher precision 65.2 on FER2013, indicating fewer false positives and better reliability in noisy, real-world conditions. On FER2013, both models show more confusion between visually similar expressions, while clearer class separation is observed on the cleaner KDEF dataset. These findings underscore the trade-off between sensitivity and precision, illustrating how lightweight YOLO models can effectively balance performance and efficiency. The results demonstrate adaptability across both controlled and real-world conditions, establishing these models as strong candidates for real-time, resource-constrained emotion-aware AI applications.

[125] Quantifying the Limits of Segmentation Foundation Models: Modeling Challenges in Segmenting Tree-Like and Low-Contrast Objects

Yixin Zhang, Nicholas Konz, Kevin Kramer, Maciej A. Mazurowski

Main category: cs.CV

TL;DR: SFMs like SAM struggle with tree-like objects and low textural contrast, failing due to misinterpretation of local structure as global texture, which fine-tuning cannot fix.

Details

Motivation: To understand why image segmentation foundation models fail on objects with dense, tree-like morphology and low textural contrast from surroundings.

Method: Introduce interpretable metrics for tree-likeness and textural separability; test on synthetic experiments and real datasets with SAM variants.

Result: SFM performance correlates with tree-likeness and textural separability; models misinterpret local structure as texture, causing over-segmentation or background confusion.

Conclusion: SFMs have fundamental limitations with certain structures; fine-tuning doesn’t help, providing a quantitative framework to model their behavior on challenging cases.

Abstract: Image segmentation foundation models (SFMs) like Segment Anything Model (SAM) have achieved impressive zero-shot and interactive segmentation across diverse domains. However, they struggle to segment objects with certain structures, particularly those with dense, tree-like morphology and low textural contrast from their surroundings. These failure modes are crucial for understanding the limitations of SFMs in real-world applications. To systematically study this issue, we introduce interpretable metrics quantifying object tree-likeness and textural separability. On carefully controlled synthetic experiments and real-world datasets, we show that SFM performance (\eg, SAM, SAM 2, HQ-SAM) noticeably correlates with these factors. We attribute these failures to SFMs misinterpreting local structure as global texture, resulting in over-segmentation or difficulty distinguishing objects from similar backgrounds. Notably, targeted fine-tuning fails to resolve this issue, indicating a fundamental limitation. Our study provides the first quantitative framework for modeling the behavior of SFMs on challenging structures, offering interpretable insights into their segmentation capabilities.

[126] Heterogeneous Complementary Distillation

Liuchi Xu, Hao Zheng, Lu Wang, Lisheng Xu, Jun Cheng

Main category: cs.CV

TL;DR: HCD is a novel heterogeneous knowledge distillation framework that integrates complementary teacher-student features through shared logits decomposition and orthogonality constraints, achieving superior performance across multiple datasets.

Details

Motivation: Traditional KD methods struggle with heterogeneous architectures due to spatial feature representation differences, while existing heterogeneous KD approaches suffer from high computational costs, complex designs, or over-reliance on logit alignment, limiting their ability to leverage complementary features.

Method: HCD processes student’s intermediate features through convolutional projector and adaptive pooling, concatenates with teacher’s penultimate layer features, maps via Complementary Feature Mapper to shared logits, then decomposes into sub-logits with Orthogonality Loss to ensure diversity and reduce redundant knowledge transfer.

Result: Extensive experiments on CIFAR-100, Fine-grained datasets (CUB200), and ImageNet-1K demonstrate that HCD outperforms state-of-the-art KD methods in heterogeneous distillation scenarios.

Conclusion: HCD effectively addresses heterogeneous KD challenges by preserving student-specific strengths while leveraging teacher knowledge, enhancing student robustness and generalization through complementary feature integration and sub-logit decomposition with orthogonality constraints.

Abstract: Knowledge distillation (KD)transfers the dark knowledge from a complex teacher to a compact student. However, heterogeneous architecture distillation, such as Vision Transformer (ViT) to ResNet18, faces challenges due to differences in spatial feature representations.Traditional KD methods are mostly designed for homogeneous architectures and hence struggle to effectively address the disparity. Although heterogeneous KD approaches have been developed recently to solve these issues, they often incur high computational costs and complex designs, or overly rely on logit alignment, which limits their ability to leverage the complementary features. To overcome these limitations, we propose Heterogeneous Complementary Distillation (HCD),a simple yet effective framework that integrates complementary teacher and student features to align representations in shared logits.These logits are decomposed and constrained to facilitate diverse knowledge transfer to the student. Specifically, HCD processes the student’s intermediate features through convolutional projector and adaptive pooling, concatenates them with teacher’s feature from the penultimate layer and then maps them via the Complementary Feature Mapper (CFM) module, comprising fully connected layer,to produce shared logits.We further introduce Sub-logit Decoupled Distillation (SDD) that partitions the shared logits into n sub-logits, which are fused with teacher’s logits to rectify classification.To ensure sub-logit diversity and reduce redundant knowledge transfer, we propose an Orthogonality Loss (OL).By preserving student-specific strengths and leveraging teacher knowledge,HCD enhances robustness and generalization in students.Extensive experiments on the CIFAR-100, Fine-grained (e.g., CUB200)and ImageNet-1K datasets demonstrate that HCD outperforms state-of-the-art KD methods,establishing it as an effective solution for heterogeneous KD.

[127] StreamDiT: Real-Time Streaming Text-to-Video Generation

Akio Kodaira, Tingbo Hou, Ji Hou, Markos Georgopoulos, Felix Juefei-Xu, Masayoshi Tomizuka, Yue Zhao

Main category: cs.CV

TL;DR: StreamDiT is a streaming video generation model that enables real-time text-to-video generation at 16 FPS on one GPU using a 4B parameter transformer-based diffusion model with flow matching and window attention.

Details

Motivation: Existing text-to-video models only produce short clips offline, limiting their use in interactive and real-time applications. The paper aims to address this by enabling streaming video generation.

Method: Proposes StreamDiT with flow matching using a moving buffer, mixed training with different partitioning schemes, adaLN DiT with varying time embedding and window attention, and a multistep distillation method that reduces function evaluations to the number of buffer chunks.

Result: The distilled model achieves real-time performance at 16 FPS on one GPU, generating 512p resolution video streams. Evaluation shows strong performance through both quantitative metrics and human evaluation.

Conclusion: StreamDiT enables real-time applications including streaming generation, interactive generation, and video-to-video tasks, overcoming limitations of offline short-clip generation models.

Abstract: Recently, great progress has been achieved in text-to-video (T2V) generation by scaling transformer-based diffusion models to billions of parameters, which can generate high-quality videos. However, existing models typically produce only short clips offline, restricting their use cases in interactive and real-time applications. This paper addresses these challenges by proposing StreamDiT, a streaming video generation model. StreamDiT training is based on flow matching by adding a moving buffer. We design mixed training with different partitioning schemes of buffered frames to boost both content consistency and visual quality. StreamDiT modeling is based on adaLN DiT with varying time embedding and window attention. To practice the proposed method, we train a StreamDiT model with 4B parameters. In addition, we propose a multistep distillation method tailored for StreamDiT. Sampling distillation is performed in each segment of a chosen partitioning scheme. After distillation, the total number of function evaluations (NFEs) is reduced to the number of chunks in a buffer. Finally, our distilled model reaches real-time performance at 16 FPS on one GPU, which can generate video streams at 512p resolution. We evaluate our method through both quantitative metrics and human evaluation. Our model enables real-time applications, e.g. streaming generation, interactive generation, and video-to-video. We provide video results and more examples in our project website: https://cumulo-autumn.github.io/StreamDiT/

[128] Divide, Conquer and Unite: Hierarchical Style-Recalibrated Prototype Alignment for Federated Medical Image Segmentation

Xingyue Zhao, Wenke Huang, Xingguang Wang, Haoyu Zhao, Linghao Zhuang, Anwen Jiang, Guancheng Wan, Mang Ye

Main category: cs.CV

TL;DR: FedBCS addresses feature heterogeneity in federated medical image segmentation by aligning domain-invariant contextual prototypes across layers, using frequency-domain style recalibration and dual-level prototype alignment.

Details

Motivation: Federated learning for medical imaging faces challenges from feature heterogeneity due to different scanners/protocols. Existing methods have incomplete contextual representation learning (focusing only on final-layer features) and layerwise style bias accumulation that reduces model robustness.

Method: Proposes FedBCS with frequency-domain adaptive style recalibration for prototype construction to decouple content-style representations, and context-aware dual-level prototype alignment that extracts domain-invariant prototypes from both encoder and decoder layers.

Result: Extensive experiments on two public datasets demonstrate remarkable performance improvements over existing methods.

Conclusion: FedBCS effectively bridges feature representation gaps in federated medical image segmentation through robust domain-invariant contextual prototype alignment.

Abstract: Federated learning enables multiple medical institutions to train a global model without sharing data, yet feature heterogeneity from diverse scanners or protocols remains a major challenge. Many existing works attempt to address this issue by leveraging model representations (e.g., mean feature vectors) to correct local training; however, they often face two key limitations: 1) Incomplete Contextual Representation Learning: Current approaches primarily focus on final-layer features, overlooking critical multi-level cues and thus diluting essential context for accurate segmentation. 2) Layerwise Style Bias Accumulation: Although utilizing representations can partially align global features, these methods neglect domain-specific biases within intermediate layers, allowing style discrepancies to build up and reduce model robustness. To address these challenges, we propose FedBCS to bridge feature representation gaps via domain-invariant contextual prototypes alignment. Specifically, we introduce a frequency-domain adaptive style recalibration into prototype construction that not only decouples content-style representations but also learns optimal style parameters, enabling more robust domain-invariant prototypes. Furthermore, we design a context-aware dual-level prototype alignment method that extracts domain-invariant prototypes from different layers of both encoder and decoder and fuses them with contextual information for finer-grained representation alignment. Extensive experiments on two public datasets demonstrate that our method exhibits remarkable performance.

[129] Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

Yifan Liu, Fangneng Zhan, Kaichen Zhou, Yilun Du, Paul Pu Liang, Hanspeter Pfister

Main category: cs.CV

TL;DR: SandboxVLM bridges the modality gap between 2D-trained VLMs and 3D tasks by using abstract bounding boxes to encode geometric structure and physical kinematics, improving spatial intelligence without additional training.

Details

Motivation: Vision-language models struggle with 3D-related tasks due to the modality gap between their 2D training and 3D requirements, limiting their effectiveness in robotics and embodied applications.

Method: A 3D Sandbox reconstruction and perception pipeline with four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning using abstract bounding boxes.

Result: Achieved 8.3% gain on SAT Real benchmark and consistent improvements across multiple benchmarks in zero-shot settings, demonstrating enhanced spatial intelligence.

Conclusion: Equipping VLMs with 3D abstraction substantially enhances 3D reasoning ability without additional training, opening new possibilities for general-purpose embodied intelligence.

Abstract: Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents. We attribute this to a modality gap between the 3D tasks and the 2D training of VLM, which led to inefficient retrieval of 3D information from 2D input. To bridge this gap, we introduce SandboxVLM, a simple yet effective framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM. Specifically, we design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning. Evaluated in zero-shot settings across multiple benchmarks and VLM backbones, our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods for instance. These results demonstrate that equipping VLMs with a 3D abstraction substantially enhances their 3D reasoning ability without additional training, suggesting new possibilities for general-purpose embodied intelligence.

[130] DEFT-LLM: Disentangled Expert Feature Tuning for Micro-Expression Recognition

Ren Zhang, Huilai Li, Chao qi, Guoliang Xu, Tianyu Zhou, Wei wei, Jianqin Yin

Main category: cs.CV

TL;DR: DEFT-LLM is a multimodal LLM approach for micro expression recognition that addresses motion-text semantic gaps through multi-expert disentanglement and a motion-driven instruction dataset called Uni-MER.

Details

Motivation: Two core challenges in MER: (1) entanglement of static appearance and dynamic motion prevents focus on subtle motion, (2) textual labels in existing datasets don't fully correspond to facial muscle movements, creating semantic gaps.

Method: Proposed DEFT-LLM with multi-expert disentanglement: three experts decouple facial dynamics into structure, dynamic textures, and motion-semantics. Uses Uni-MER dataset with dual constraints from optical flow and Action Unit labels for motion-text alignment.

Result: State-of-the-art performance on multiple challenging MER benchmarks, with particular advantage in interpretable modeling of local facial motion.

Conclusion: The method effectively injects physical priors for micro expressions while leveraging LLM cross-modal reasoning, enabling precise capture of subtle emotional cues through motion semantic alignment.

Abstract: Micro expression recognition (MER) is crucial for inferring genuine emotion. Applying a multimodal large language model (MLLM) to this task enables spatio-temporal analysis of facial motion and provides interpretable descriptions. However, there are still two core challenges: (1) The entanglement of static appearance and dynamic motion cues prevents the model from focusing on subtle motion; (2) Textual labels in existing MER datasets do not fully correspond to underlying facial muscle movements, creating a semantic gap between text supervision and physical motion. To address these issues, we propose DEFT-LLM, which achieves motion semantic alignment by multi-expert disentanglement. We first introduce Uni-MER, a motion-driven instruction dataset designed to align text with local facial motion. Its construction leverages dual constraints from optical flow and Action Unit (AU) labels to ensure spatio-temporal consistency and reasonable correspondence to the movements. We then design an architecture with three experts to decouple facial dynamics into independent and interpretable representations (structure, dynamic textures, and motion-semantics). By integrating the instruction-aligned knowledge from Uni-MER into DEFT-LLM, our method injects effective physical priors for micro expressions while also leveraging the cross modal reasoning ability of large language models, thus enabling precise capture of subtle emotional cues. Experiments on multiple challenging MER benchmarks demonstrate state-of-the-art performance, as well as a particular advantage in interpretable modeling of local facial motion.

[131] Language-Guided Graph Representation Learning for Video Summarization

Wenrui Li, Wei Han, Hengyu Man, Wangmeng Zuo, Xiaopeng Fan, Yonghong Tian

Main category: cs.CV

TL;DR: Proposes LGRLN, a language-guided graph network for video summarization that converts frames into structured graphs and uses dual-threshold graph convolution to capture semantic relationships, achieving state-of-the-art performance with significant efficiency improvements.

Details

Motivation: Existing video summarization methods struggle with capturing global dependencies, accommodating multimodal user customization, and handling the mismatch between temporal and semantic proximity in video frames.

Method: Uses video graph generator to create forward, backward and undirected graphs from frames; intra-graph relational reasoning with dual-threshold graph convolution; language-guided cross-modal embedding; models summary generation as Bernoulli mixture distribution solved with EM algorithm.

Result: Outperforms existing approaches across multiple benchmarks; reduces inference time by 87.8% and model parameters by 91.7% compared to previous methods.

Conclusion: LGRLN effectively addresses limitations of existing video summarization methods by leveraging graph-based representation learning and language guidance, achieving superior performance with significantly improved efficiency.

Abstract: With the rapid growth of video content on social media, video summarization has become a crucial task in multimedia processing. However, existing methods face challenges in capturing global dependencies in video content and accommodating multimodal user customization. Moreover, temporal proximity between video frames does not always correspond to semantic proximity. To tackle these challenges, we propose a novel Language-guided Graph Representation Learning Network (LGRLN) for video summarization. Specifically, we introduce a video graph generator that converts video frames into a structured graph to preserve temporal order and contextual dependencies. By constructing forward, backward and undirected graphs, the video graph generator effectively preserves the sequentiality and contextual relationships of video content. We designed an intra-graph relational reasoning module with a dual-threshold graph convolution mechanism, which distinguishes semantically relevant frames from irrelevant ones between nodes. Additionally, our proposed language-guided cross-modal embedding module generates video summaries with specific textual descriptions. We model the summary generation output as a mixture of Bernoulli distribution and solve it with the EM algorithm. Experimental results show that our method outperforms existing approaches across multiple benchmarks. Moreover, we proposed LGRLN reduces inference time and model parameters by 87.8% and 91.7%, respectively. Our codes and pre-trained models are available at https://github.com/liwrui/LGRLN.

[132] Text-guided Weakly Supervised Framework for Dynamic Facial Expression Recognition

Gunho Jung, Heejo Kong, Seong-Whan Lee

Main category: cs.CV

TL;DR: TG-DFER is a text-guided weakly supervised framework for dynamic facial expression recognition that uses vision-language models and multi-grained temporal modeling to address the many-to-one labeling problem.

Details

Motivation: To solve the many-to-one labeling problem in DFER where videos with multiple frames get single emotion labels, and overcome limitations of MIL-based approaches that struggle with visual diversity and complex temporal dynamics.

Method: Incorporates vision-language pre-trained models for semantic guidance, uses visual prompts to align text labels with visual features, and employs multi-grained temporal network to capture both short-term facial dynamics and long-range emotional flow.

Result: TG-DFER achieves improved generalization, interpretability, and temporal sensitivity under weak supervision, demonstrating enhanced performance in dynamic facial expression recognition.

Conclusion: The proposed text-guided framework effectively addresses DFER challenges by leveraging semantic guidance and coherent temporal modeling, providing a robust solution for weakly supervised dynamic facial expression analysis.

Abstract: Dynamic facial expression recognition (DFER) aims to identify emotional states by modeling the temporal changes in facial movements across video sequences. A key challenge in DFER is the many-to-one labeling problem, where a video composed of numerous frames is assigned a single emotion label. A common strategy to mitigate this issue is to formulate DFER as a Multiple Instance Learning (MIL) problem. However, MIL-based approaches inherently suffer from the visual diversity of emotional expressions and the complexity of temporal dynamics. To address this challenge, we propose TG-DFER, a text-guided weakly supervised framework that enhances MIL-based DFER by incorporating semantic guidance and coherent temporal modeling. We incorporate a vision-language pre-trained (VLP) model is integrated to provide semantic guidance through fine-grained textual descriptions of emotional context. Furthermore, we introduce visual prompts, which align enriched textual emotion labels with visual instance features, enabling fine-grained reasoning and frame-level relevance estimation. In addition, a multi-grained temporal network is designed to jointly capture short-term facial dynamics and long-range emotional flow, ensuring coherent affective understanding across time. Extensive results demonstrate that TG-DFER achieves improved generalization, interpretability, and temporal sensitivity under weak supervision.

[133] ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization

Anzhe Cheng, Shukai Duan, Shixuan Li, Chenzhong Yin, Mingxi Cheng, Heng Ping, Tamoghna Chattopadhyay, Sophia I Thomopoulos, Shahin Nazarian, Paul Thompson, Paul Bogdan

Main category: cs.CV

TL;DR: ERMoE is a sparse Mixture-of-Experts transformer that uses eigenbasis reparameterization and cosine similarity routing to stabilize expert utilization and improve specialization without explicit load-balancing losses.

Details

Motivation: To address misalignment between router logits and expert internal structures that causes unstable routing and expert underutilization, and to eliminate load imbalances that create straggler bottlenecks in standard MoE architectures.

Method: Reparameterizes each expert in a learned orthonormal eigenbasis and replaces learned gating logits with an Eigenbasis Score (cosine similarity between input features and expert’s basis) for content-aware routing.

Result: Achieves state-of-the-art accuracy on ImageNet classification and cross-modal image-text retrieval benchmarks, produces flatter expert load distributions, and improves brain age prediction accuracy by over 7% with anatomically interpretable expert specializations.

Conclusion: ERMoE introduces a new architectural principle for sparse expert models that directly addresses routing instabilities and enables improved performance with scalable, interpretable specialization without explicit balancing losses.

Abstract: Mixture-of-Experts (MoE) architectures expand model capacity by sparsely activating experts but face two core challenges: misalignment between router logits and each expert’s internal structure leads to unstable routing and expert underutilization, and load imbalances create straggler bottlenecks. Standard solutions, such as auxiliary load-balancing losses, can reduce load disparities but often weaken expert specialization and hurt downstream performance. To address these issues, we propose ERMoE, a sparse MoE transformer that reparameterizes each expert in a learned orthonormal eigenbasis and replaces learned gating logits with an “Eigenbasis Score”, defined as the cosine similarity between input features and an expert’s basis. This content-aware routing ties token assignments directly to experts’ representation spaces, stabilizing utilization and promoting interpretable specialization without sacrificing sparsity. Crucially, ERMoE removes the need for explicit balancing losses and avoids the interfering gradients they introduce. We show that ERMoE achieves state-of-the-art accuracy on ImageNet classification and cross-modal image-text retrieval benchmarks (e.g., COCO, Flickr30K), while naturally producing flatter expert load distributions. Moreover, a 3D MRI variant (ERMoE-ba) improves brain age prediction accuracy by more than 7% and yields anatomically interpretable expert specializations. ERMoE thus introduces a new architectural principle for sparse expert models that directly addresses routing instabilities and enables improved performance with scalable, interpretable specialization.

Haoran Chen, Houze Xu, Micah Goldblum, Daoguo Dong, Zuxuan Wu

Main category: cs.CV

TL;DR: DMC is a two-stage framework for CLIP-based class-incremental learning that decouples vision encoder adaptation and textual prompt optimization to prevent classifier bias. DMC-OT enhances it with optimal-transport calibration for memory statistics alignment.

Details

Motivation: Extending vision-language models like CLIP to continual learning settings is challenging due to classifier bias from task-specific soft prompts overfitting to recent categories when prior data is unavailable.

Method: Two-stage framework: (1) adapt vision encoder with frozen text prompts, (2) optimize textual soft prompts with frozen vision encoder. DMC-OT adds optimal-transport guided calibration for memory statistics alignment across evolving encoders and task-specific prompting for inter-task separability.

Result: Extensive experiments on CIFAR-100, Imagenet-R, CUB-200, and UCF-101 show state-of-the-art performance, with DMC-OT improving accuracy by average 1.80% over DMC.

Conclusion: Decoupling vision and text adaptation preserves cross-modal alignment in CLIP-based CIL, and optimal-transport calibration effectively addresses distributional drift in memory statistics.

Abstract: Class-incremental learning (CIL) enables models to continuously learn new categories from sequential tasks without forgetting previously acquired knowledge. While recent advances in vision-language models such as CLIP have demonstrated strong generalization across domains, extending them to continual settings remains challenging. In particular, learning task-specific soft prompts for newly introduced classes often leads to severe classifier bias, as the text prototypes overfit to recent categories when prior data are unavailable. In this paper, we propose DMC, a simple yet effective two-stage framework for CLIP-based CIL that decouples the adaptation of the vision encoder and the optimization of textual soft prompts. Each stage is trained with the other frozen, allowing one modality to act as a stable semantic anchor for the other to preserve cross-modal alignment. Furthermore, current CLIP-based CIL approaches typically store class-wise Gaussian statistics for generative replay, yet they overlook the distributional drift that arises when the vision encoder is updated over time. To address this issue, we introduce DMC-OT, an enhanced version of DMC that incorporates an optimal-transport guided calibration strategy to align memory statistics across evolving encoders, along with a task-specific prompting design that enhances inter-task separability. Extensive experiments on CIFAR-100, Imagenet-R, CUB-200, and UCF-101 demonstrate that both DMC and DMC-OT achieve state-of-the-art performance, with DMC-OT further improving accuracy by an average of 1.80%.

[135] PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs

Bowen Sun, Yujun Cai, Ming-Hsuan Yang, Hang Wu, Yiwei Wang

Main category: cs.CV

TL;DR: PAS (Phase Aggregated Smoothing) is a training-free method that stabilizes video LLM attention by smoothing temporal kernels through multi-phase aggregation, reducing sensitivity to small frame timing shifts.

Details

Motivation: Video LLMs suffer from temporal inconsistency where small frame timing shifts can flip attention and suppress relevant frames, traced to multimodal RoPE's inverse Fourier time kernel causing frame-scale ripples.

Method: Apply small opposed phase offsets across attention heads and aggregate their outputs, preserving per-head spectrum magnitude while smoothing the temporal kernel without changing positional encoding structure.

Result: Experiments on multiple video understanding benchmarks show consistent improvements with negligible computational overhead under matched token budgets.

Conclusion: PAS provides a plug-and-play upgrade for robust temporal encoding in Video LLMs, achieving Lipschitz stability of attention to small temporal shifts.

Abstract: Video LLMs suffer from temporal inconsistency: small shifts in frame timing can flip attention and suppress relevant frames. We trace this instability to the common extension of Rotary Position Embeddings to video through multimodal RoPE. The induced inverse Fourier time kernel exhibits frame-scale ripples that multiply adjacent frames by different factors, which perturbs attention that should otherwise be governed by the raw query key inner product. We present Phase Aggregated Smoothing (PAS), a simple, training-free mechanism that applies small opposed phase offsets across heads and then aggregates their outputs. PAS preserves the per-head spectrum magnitude, while the aggregation effectively smooths the temporal kernel and reduces phase sensitivity without changing the positional encoding structure. Our analysis shows that the RoPE rotated logit can be approximated as a content dot product scaled by a time kernel; smoothing this kernel yields Lipschitz stability of attention to small temporal shifts; multi phase averaging attenuates high frequency ripples while preserving per-head spectra under Nyquist-valid sampling. Experiments on multiple video understanding benchmarks under matched token budgets show consistent improvements with negligible computational overhead. PAS provides a plug and play upgrade for robust temporal encoding in Video LLMs.

[136] Binary Verification for Zero-Shot Vision

Jeffrey Liu, Rongbin Hu

Main category: cs.CV

TL;DR: Training-free binary verification workflow for zero-shot vision using VLMs that converts open-ended queries to multiple-choice questions and then to True/False verifications for improved accuracy.

Details

Motivation: To improve zero-shot vision performance without task-specific training by leveraging inference-time design rather than model training, providing a practical drop-in solution for existing VLMs.

Method: Two-step workflow: (1) Quantization - converts open-ended queries to multiple-choice questions with explicit candidates; (2) Binarization - asks True/False questions per candidate and resolves deterministically, reverting to MCQ if needed.

Result: Significant improvements across all evaluated tasks (referring expression grounding, spatial reasoning, BLINK-Jigsaw), with quantization providing large gains and binarization offering consistent additional boost.

Conclusion: The workflow provides a simple, unified approach that emphasizes inference-time design over training, establishing a practical path to stronger zero-shot vision with current VLMs through formal quantization and binarization theory.

Abstract: We propose a training-free, binary verification workflow for zero-shot vision with off-the-shelf VLMs. It comprises two steps: (i) quantization, which turns the open-ended query into a multiple-choice question (MCQ) with a small, explicit list of unambiguous candidates; and (ii) binarization, which asks one True/False question per candidate and resolves deterministically: if exactly one is True, select it; otherwise, revert to an MCQ over the remaining plausible candidates. We evaluate the workflow on referring expression grounding (REC), spatial reasoning (Spatial-Map, Spatial-Grid, Spatial-Maze), and BLINK-Jigsaw. Relative to answering open-ended queries directly, quantization to MCQ yields large gains, and True/False binarization provides a consistent additional boost. Across all tasks, the same workflow produces significant improvements, indicating generality. Our theory formalizes how open-ended vision queries can be quantized to MCQs and further binarized into True/False verifications, establishing a hardness ladder. A simple analysis explains why Boolean resolution boosts accuracy. Together, these components yield a simple and unified workflow that emphasizes inference-time design over task-specific training. It offers a practical, drop-in path to stronger zero-shot vision with today’s VLMs.

[137] Rethinking Autoregressive Models for Lossless Image Compression via Hierarchical Parallelism and Progressive Adaptation

Daxin Li, Yuanchao Bai, Kai Wang, Wenbo Zhao, Junjun Jiang, Xianming Liu

Main category: cs.CV

TL;DR: This paper introduces HPAC, a practical autoregressive framework for lossless image compression using hierarchical parallelism and progressive adaptation, achieving state-of-the-art performance with efficient computation.

Details

Motivation: Autoregressive models are theoretically optimal for lossless compression but considered impractical due to computational costs. This work aims to make pure autoregression practical while maintaining performance advantages.

Method: HPAC model with hierarchical factorized structure, content-aware convolutional gating, Cache-then-Select Inference for acceleration, Adaptive Focus Coding for high bit-depth, and Spatially-Aware Rate-Guided Progressive Fine-tuning for instance-level adaptation.

Result: Achieves new state-of-the-art compression performance across diverse datasets (natural, satellite, medical) with small parameter count and competitive coding speeds.

Conclusion: Carefully designed autoregressive frameworks can offer significant compression gains over existing methods while being practical in terms of computation and model size.

Abstract: Autoregressive (AR) models, the theoretical performance benchmark for learned lossless image compression, are often dismissed as impractical due to prohibitive computational cost. This work re-thinks this paradigm, introducing a framework built on hierarchical parallelism and progressive adaptation that re-establishes pure autoregression as a top-performing and practical solution. Our approach is embodied in the Hierarchical Parallel Autoregressive ConvNet (HPAC), an ultra-lightweight pre-trained model using a hierarchical factorized structure and content-aware convolutional gating to efficiently capture spatial dependencies. We introduce two key optimizations for practicality: Cache-then-Select Inference (CSI), which accelerates coding by eliminating redundant computations, and Adaptive Focus Coding (AFC), which efficiently extends the framework to high bit-depth images. Building on this efficient foundation, our progressive adaptation strategy is realized by Spatially-Aware Rate-Guided Progressive Fine-tuning (SARP-FT). This instance-level strategy fine-tunes the model for each test image by optimizing low-rank adapters on progressively larger, spatially-continuous regions selected via estimated information density. Experiments on diverse datasets (natural, satellite, medical) validate that our method achieves new state-of-the-art compression. Notably, our approach sets a new benchmark in learned lossless compression, showing a carefully designed AR framework can offer significant gains over existing methods with a small parameter count and competitive coding speeds.

[138] CLUE: Controllable Latent space of Unprompted Embeddings for Diversity Management in Text-to-Image Synthesis

Keunwoo Park, Jihye Chae, Joong Ho Ahn, Jihoon Kweon

Main category: cs.CV

TL;DR: CLUE is a generative model framework that achieves diverse and stable image generation using fixed-format prompts without additional data, particularly effective for specialized domains like medicine with limited datasets.

Details

Motivation: Existing text-to-image methods struggle in specialized fields like medicine where data is limited and diverse, requiring solutions that can generate stable and diverse images without extensive additional data.

Method: Based on Stable Diffusion, CLUE uses a Style Encoder to generate style embeddings from images and prompts, feeding them into a new second attention layer in U-Net. KL divergence enables continuous representation of image features in Gaussian regions independent of prompts.

Result: On otitis media dataset: FID reduced from 46.81 to 9.30, recall improved from 49.60% to 70.29%. Synthetic-only training at 1000% scale achieved F1 score of 83.21% vs 73.83%. Combined synthetic+real data achieved 94.76% F1 score, outperforming real-only data.

Conclusion: CLUE enables diverse yet stable image generation from limited datasets and serves as an effective data augmentation method for domain-specific applications, demonstrating superior performance in medical imaging tasks.

Abstract: Text-to-image synthesis models require the ability to generate diverse images while maintaining stability. To overcome this challenge, a number of methods have been proposed, including the collection of prompt-image datasets and the integration of additional data modalities during training. Although these methods have shown promising results in general domains, they face limitations when applied to specialized fields such as medicine, where only limited types and insufficient amounts of data are available. We present CLUE (Controllable Latent space of Unprompted Embeddings), a generative model framework that achieves diverse generation while maintaining stability through fixed-format prompts without requiring any additional data. Based on the Stable Diffusion architecture, CLUE employs a Style Encoder that processes images and prompts to generate style embeddings, which are subsequently fed into a new second attention layer of the U-Net architecture. Through Kullback-Leibler divergence, the latent space achieves continuous representation of image features within Gaussian regions, independent of prompts. Performance was assessed on otitis media dataset. CLUE reduced FID to 9.30 (vs. 46.81) and improved recall to 70.29% (vs. 49.60%). A classifier trained on synthetic-only data at 1000% scale achieved an F1 score of 83.21% (vs. 73.83%). Combining synthetic data with equal amounts of real data achieved an F1 score of 94.76%, higher than when using only real data. On an external dataset, synthetic-only training achieved an F1 score of 76.77% (vs. 60.61%) at 1000% scale. The combined approach achieved an F1 score of 85.78%, higher than when using only the internal dataset. These results demonstrate that CLUE enables diverse yet stable image generation from limited datasets and serves as an effective data augmentation method for domain-specific applications.

Jiajun Chen, Sai Cheng, Yutao Yuan, Yirui Zhang, Haitao Yuan, Peng Peng, Yi Zhong

Main category: cs.CV

TL;DR: PROMISE is a novel multimodal framework that uses prompt-attention hierarchical contrastive learning to handle missing modalities and maintain cross-modal consistency.

Details

Motivation: Multimodal models degrade significantly when modalities are missing, due to inconsistent representation learning between complete and incomplete data scenarios.

Method: Incorporates multimodal prompt learning into hierarchical contrastive learning with a prompt-attention mechanism to dynamically generate robust representations for missing modalities.

Result: Extensive experiments show PROMISE outperforms state-of-the-art multimodal methods on benchmark datasets.

Conclusion: PROMISE effectively bridges the representational gap between complete and incomplete multimodal data, achieving superior performance in handling missing modalities.

Abstract: Multimodal models integrating natural language and visual information have substantially improved generalization of representation models. However, their effectiveness significantly declines in real-world situations where certain modalities are missing or unavailable. This degradation primarily stems from inconsistent representation learning between complete multimodal data and incomplete modality scenarios. Existing approaches typically address missing modalities through relatively simplistic generation methods, yet these approaches fail to adequately preserve cross-modal consistency, leading to suboptimal performance. To overcome this limitation, we propose a novel multimodal framework named PROMISE, a PROMpting-Attentive HIerarchical ContraStive LEarning approach designed explicitly for robust cross-modal representation under conditions of missing modalities. Specifically, PROMISE innovatively incorporates multimodal prompt learning into a hierarchical contrastive learning framework, equipped with a specially designed prompt-attention mechanism. This mechanism dynamically generates robust and consistent representations for scenarios where particular modalities are absent, thereby effectively bridging the representational gap between complete and incomplete data. Extensive experiments conducted on benchmark datasets, along with comprehensive ablation studies, clearly demonstrate the superior performance of PROMISE compared to current state-of-the-art multimodal methods.

[140] EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation

Zongyang Qiu, Bingyuan Wang, Xingbei Chen, Yingqing He, Zeyu Wang

Main category: cs.CV

TL;DR: EmoVid is the first multimodal emotion-annotated video dataset for creative media, enabling emotion-conditioned video generation that significantly improves quality and emotional expression.

Details

Motivation: Existing video generation systems focus on visual metrics but neglect affective dimensions, lacking dedicated resources to bridge emotion understanding with generative tasks, especially for stylized and non-realistic contexts.

Method: Created EmoVid dataset with emotion labels, visual attributes, and text captions; analyzed spatial-temporal patterns linking visual features to emotions; developed emotion-conditioned video generation by fine-tuning Wan2.1 model.

Result: Significant improvement in both quantitative metrics and visual quality for text-to-video and image-to-video tasks; established new benchmark for affective video computing.

Conclusion: EmoVid provides valuable insights into visual emotion analysis in artistically styled videos and practical methods for enhancing emotional expression in video generation.

Abstract: Emotion plays a pivotal role in video-based expression, but existing video generation systems predominantly focus on low-level visual metrics while neglecting affective dimensions. Although emotion analysis has made progress in the visual domain, the video community lacks dedicated resources to bridge emotion understanding with generative tasks, particularly for stylized and non-realistic contexts. To address this gap, we introduce EmoVid, the first multimodal, emotion-annotated video dataset specifically designed for creative media, which includes cartoon animations, movie clips, and animated stickers. Each video is annotated with emotion labels, visual attributes (brightness, colorfulness, hue), and text captions. Through systematic analysis, we uncover spatial and temporal patterns linking visual features to emotional perceptions across diverse video forms. Building on these insights, we develop an emotion-conditioned video generation technique by fine-tuning the Wan2.1 model. The results show a significant improvement in both quantitative metrics and the visual quality of generated videos for text-to-video and image-to-video tasks. EmoVid establishes a new benchmark for affective video computing. Our work not only offers valuable insights into visual emotion analysis in artistically styled videos, but also provides practical methods for enhancing emotional expression in video generation.

[141] MeCaMIL: Causality-Aware Multiple Instance Learning for Fair and Interpretable Whole Slide Image Diagnosis

Yiran Song, Yikai Zhang, Shuang Zhou, Guojun Xiong, Xiaofeng Yang, Nian Wang, Fenglong Ma, Rui Zhang, Mingquan Lin

Main category: cs.CV

TL;DR: MeCaMIL is a causality-aware multiple instance learning framework for whole slide image analysis that addresses fairness concerns by modeling demographic confounders through structured causal graphs, achieving state-of-the-art performance and significantly improved fairness across diverse populations.

Details

Motivation: Existing MIL methods for computational pathology lack causal interpretability and fail to integrate patient demographics, leading to fairness concerns and algorithmic bias that can exacerbate health disparities, hindering clinical translation.

Method: MeCaMIL employs principled causal inference using do-calculus and collider structures to explicitly model demographic confounders through structured causal graphs, disentangling disease-relevant signals from spurious demographic correlations.

Result: Achieved state-of-the-art performance on three benchmarks (CAMELYON16, TCGA-Lung, TCGA-Multi) with ACC/AUC/F1 scores up to 0.977/0.993/0.970, and superior fairness with demographic disparity variance reduced by over 65% on average. Also generalized to survival prediction with improved C-index.

Conclusion: MeCaMIL establishes a principled framework for fair, interpretable, and clinically actionable AI in digital pathology, with causal graph structure being essential for both performance and fairness improvements.

Abstract: Multiple instance learning (MIL) has emerged as the dominant paradigm for whole slide image (WSI) analysis in computational pathology, achieving strong diagnostic performance through patch-level feature aggregation. However, existing MIL methods face critical limitations: (1) they rely on attention mechanisms that lack causal interpretability, and (2) they fail to integrate patient demographics (age, gender, race), leading to fairness concerns across diverse populations. These shortcomings hinder clinical translation, where algorithmic bias can exacerbate health disparities. We introduce \textbf{MeCaMIL}, a causality-aware MIL framework that explicitly models demographic confounders through structured causal graphs. Unlike prior approaches treating demographics as auxiliary features, MeCaMIL employs principled causal inference – leveraging do-calculus and collider structures – to disentangle disease-relevant signals from spurious demographic correlations. Extensive evaluation on three benchmarks demonstrates state-of-the-art performance across CAMELYON16 (ACC/AUC/F1: 0.939/0.983/0.946), TCGA-Lung (0.935/0.979/0.931), and TCGA-Multi (0.977/0.993/0.970, five cancer types). Critically, MeCaMIL achieves superior fairness – demographic disparity variance drops by over 65% relative reduction on average across attributes, with notable improvements for underserved populations. The framework generalizes to survival prediction (mean C-index: 0.653, +0.017 over best baseline across five cancer types). Ablation studies confirm causal graph structure is essential – alternative designs yield 0.048 lower accuracy and 4.2x times worse fairness. These results establish MeCaMIL as a principled framework for fair, interpretable, and clinically actionable AI in digital pathology. Code will be released upon acceptance.

[142] Draft and Refine with Visual Experts

Sungheon Jeong, Ryozo Masukawa, Jihong Park, Sanggeon Yun, Wenjun Huang, Hanning Chen, Mahdi Imani, Mohsen Imani

Main category: cs.CV

TL;DR: Draft and Refine (DnR) is an agent framework that uses a question-conditioned utilization metric to quantify and improve LVLMs’ visual grounding, reducing hallucinations by guiding response refinement with external visual experts.

Details

Motivation: Current LVLMs often produce ungrounded or hallucinated responses due to over-reliance on linguistic priors rather than visual evidence, highlighting the need for quantitative measurement of visual information utilization.

Method: Proposes DnR framework with question-conditioned utilization metric that localizes question-specific cues and measures visual dependence via relevance-guided probabilistic masking. Uses external visual experts to provide feedback rendered as visual cues, then re-queries model to select responses with highest utilization improvement.

Result: Experiments across VQA and captioning benchmarks show consistent accuracy gains and reduced hallucination.

Conclusion: Measuring visual utilization provides a principled path toward more interpretable and evidence-driven multimodal agent systems without requiring retraining or architectural changes.

Abstract: While recent Large Vision-Language Models (LVLMs) exhibit strong multimodal reasoning abilities, they often produce ungrounded or hallucinated responses because they rely too heavily on linguistic priors instead of visual evidence. This limitation highlights the absence of a quantitative measure of how much these models actually use visual information during reasoning. We propose Draft and Refine (DnR), an agent framework driven by a question-conditioned utilization metric. The metric quantifies the model’s reliance on visual evidence by first constructing a query-conditioned relevance map to localize question-specific cues and then measuring dependence through relevance-guided probabilistic masking. Guided by this metric, the DnR agent refines its initial draft using targeted feedback from external visual experts. Each expert’s output (such as boxes or masks) is rendered as visual cues on the image, and the model is re-queried to select the response that yields the largest improvement in utilization. This process strengthens visual grounding without retraining or architectural changes. Experiments across VQA and captioning benchmarks show consistent accuracy gains and reduced hallucination, demonstrating that measuring visual utilization provides a principled path toward more interpretable and evidence-driven multimodal agent systems.

[143] VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models

Xinlei Yu, Chengming Xu, Guibin Zhang, Zhangquan Chen, Yudong Zhang, Yongbo He, Peng-Tao Jiang, Jiangning Zhang, Xiaobin Hu, Shuicheng Yan

Main category: cs.CV

TL;DR: VisMem is a cognitive framework that adds dynamic latent vision memories to Vision-Language Models, improving performance by 11.8% on average across visual understanding, reasoning, and generation tasks.

Details

Motivation: Current VLMs suffer from a visual processing bottleneck where they lose grounding in visual evidence and lack contextualized visual experience during prolonged generation, inspired by human cognitive memory theory.

Method: Proposes VisMem framework with two memory modules: short-term for fine-grained perceptual retention and long-term for abstract semantic consolidation, seamlessly invoked during inference.

Result: Extensive experiments show VisMem delivers 11.8% average performance boost relative to vanilla models and outperforms all counterparts across diverse visual benchmarks.

Conclusion: VisMem establishes a new paradigm for latent-space memory enhancement in VLMs, enabling better perceptual fidelity and semantic consistency in visual tasks.

Abstract: Despite the remarkable success of Vision-Language Models (VLMs), their performance on a range of complex visual tasks is often hindered by a “visual processing bottleneck”: a propensity to lose grounding in visual evidence and exhibit a deficit in contextualized visual experience during prolonged generation. Drawing inspiration from human cognitive memory theory, which distinguishes short-term visually-dominant memory and long-term semantically-dominant memory, we propose VisMem, a cognitively-aligned framework that equips VLMs with dynamic latent vision memories, a short-term module for fine-grained perceptual retention and a long-term module for abstract semantic consolidation. These memories are seamlessly invoked during inference, allowing VLMs to maintain both perceptual fidelity and semantic consistency across thinking and generation. Extensive experiments across diverse visual benchmarks for understanding, reasoning, and generation reveal that VisMem delivers a significant average performance boost of 11.8% relative to the vanilla model and outperforms all counterparts, establishing a new paradigm for latent-space memory enhancement. The code will be available: https://github.com/YU-deep/VisMem.git.

[144] SP-Guard: Selective Prompt-adaptive Guidance for Safe Text-to-Image Generation

Sumin Yu, Taesup Moon

Main category: cs.CV

TL;DR: SP-Guard is a new method that improves safety in text-to-image generation by adaptively estimating prompt harmfulness and selectively guiding only unsafe image regions, outperforming existing methods while minimizing content alteration.

Details

Motivation: Current diffusion-based T2I models enable easy creation of harmful content, raising social concerns. Existing inference-time guiding methods lack adaptivity (adjusting guidance strength based on prompt) and selectivity (targeting only unsafe regions).

Method: SP-Guard estimates prompt harmfulness and applies a selective guidance mask to guide only unsafe areas of the image during generation.

Result: Experiments show SP-Guard generates safer images than existing methods while minimizing unintended content alteration.

Conclusion: Beyond improving safety, the findings highlight the importance of transparency and controllability in image generation.

Abstract: While diffusion-based T2I models have achieved remarkable image generation quality, they also enable easy creation of harmful content, raising social concerns and highlighting the need for safer generation. Existing inference-time guiding methods lack both adaptivity–adjusting guidance strength based on the prompt–and selectivity–targeting only unsafe regions of the image. Our method, SP-Guard, addresses these limitations by estimating prompt harmfulness and applying a selective guidance mask to guide only unsafe areas. Experiments show that SP-Guard generates safer images than existing methods while minimizing unintended content alteration. Beyond improving safety, our findings highlight the importance of transparency and controllability in image generation.

[145] SUPER Decoder Block for Reconstruction-Aware U-Net Variants

Siheon Joo, Hongjo Kim

Main category: cs.CV

TL;DR: SUPER (Selectively Suppressed Perfect Reconstruction) is a plug-and-play decoder block for U-Net variants that uses wavelet perfect reconstruction to prevent information loss while selectively suppressing redundant features, improving high-frequency detail recovery in inverse problems.

Details

Motivation: Skip-connected encoder-decoder architectures (U-Net variants) suffer from information loss that limits recovery of fine high-frequency details in inverse problems, creating a need for better reconstruction methods.

Method: Exploits perfect reconstruction property of wavelets to prevent information degradation while selectively suppressing redundant features. Serves as plug-and-play decoder block for U-Net variants without rigid framelet constraints.

Result: Markedly improves thin-crack segmentation for cracks narrower than 4px, achieves moderate PSNR gains in smartphone image denoising, and demonstrates robustness across both high-frequency and low-frequency regimes while maintaining comparable computational cost.

Conclusion: SUPER validates plug-and-play generality across U-Net variants, achieving high-frequency fidelity and global coherence within a unified reconstruction-aware framework, enriching representational diversity through increased parameterization.

Abstract: Skip-connected encoder-decoder architectures (U-Net variants) are widely adopted for inverse problems but still suffer from information loss, limiting recovery of fine high-frequency details. We present Selectively Suppressed Perfect Reconstruction (SUPER), which exploits the perfect reconstruction (PR) property of wavelets to prevent information degradation while selectively suppressing (SS) redundant features. Free from rigid framelet constraints, SUPER serves as a plug-and-play decoder block for diverse U-Net variants, eliminating their intrinsic reconstruction bottlenecks and enhancing representational richness. Experiments across diverse crack benchmarks, including state-of-the-art (SOTA) models, demonstrate the structural potential of the proposed SUPER Decoder Block. Maintaining comparable computational cost, SUPER enriches representational diversity through increased parameterization. In small-scale in-domain experiments on the CrackVision12K dataset, SUPER markedly improves thin-crack segmentation performance, particularly for cracks narrower than 4 px, underscoring its advantage in high-frequency dominant settings. In smartphone image denoising on SIDD, where low-frequency components prevail, SUPER still achieves a moderate gain in PSNR, confirming its robustness across low- and high-frequency regimes. These results validate its plug-and-play generality across U-Net variants, achieving high-frequency fidelity and global coherence within a unified, reconstruction-aware framework.

[146] AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning

Jirong Zha, Yuxuan Fan, Tianyu Zhang, Geng Chen, Yingfeng Chen, Chen Gao, Xinlei Chen

Main category: cs.CV

TL;DR: AirCopBench is the first comprehensive benchmark for evaluating Multimodal Large Language Models in embodied aerial collaborative perception under challenging conditions, featuring 14.6k+ questions across four task dimensions and revealing significant performance gaps between models and humans.

Details

Motivation: There is a lack of benchmarks for evaluating multi-agent collaborative perception in MLLMs, despite multi-drone systems offering enhanced coverage and robustness compared to single-sensor setups. Existing benchmarks focus on basic perception tasks with high-quality images, failing to assess MLLMs in complex egocentric collaborative scenarios under real-world degraded conditions.

Method: Created AirCopBench with 14.6k+ questions from simulator and real-world data across four task dimensions (Scene Understanding, Object Understanding, Perception Assessment, Collaborative Decision) and 14 task types. Used model-, rule-, and human-based methods with rigorous quality control to generate questions from challenging degraded-perception scenarios with annotated collaborative events.

Result: Evaluations on 40 MLLMs revealed significant performance gaps in collaborative perception tasks, with the best model trailing humans by 24.38% on average and showing inconsistent results across different tasks. Fine-tuning experiments confirmed the feasibility of sim-to-real transfer in aerial collaborative perception and reasoning.

Conclusion: AirCopBench addresses the critical gap in multi-agent collaborative perception evaluation for MLLMs, demonstrating substantial performance challenges and highlighting the need for improved collaborative reasoning capabilities in multimodal models, while validating the potential for sim-to-real knowledge transfer.

Abstract: Multimodal Large Language Models (MLLMs) have shown promise in single-agent vision tasks, yet benchmarks for evaluating multi-agent collaborative perception remain scarce. This gap is critical, as multi-drone systems provide enhanced coverage, robustness, and collaboration compared to single-sensor setups. Existing multi-image benchmarks mainly target basic perception tasks using high-quality single-agent images, thus failing to evaluate MLLMs in more complex, egocentric collaborative scenarios, especially under real-world degraded perception conditions.To address these challenges, we introduce AirCopBench, the first comprehensive benchmark designed to evaluate MLLMs in embodied aerial collaborative perception under challenging perceptual conditions. AirCopBench includes 14.6k+ questions derived from both simulator and real-world data, spanning four key task dimensions: Scene Understanding, Object Understanding, Perception Assessment, and Collaborative Decision, across 14 task types. We construct the benchmark using data from challenging degraded-perception scenarios with annotated collaborative events, generating large-scale questions through model-, rule-, and human-based methods under rigorous quality control. Evaluations on 40 MLLMs show significant performance gaps in collaborative perception tasks, with the best model trailing humans by 24.38% on average and exhibiting inconsistent results across tasks. Fine-tuning experiments further confirm the feasibility of sim-to-real transfer in aerial collaborative perception and reasoning.

[147] EmbryoDiff: A Conditional Diffusion Framework with Multi-Focal Feature Fusion for Fine-Grained Embryo Developmental Stage Recognition

Yong Sun, Zhengjie Zhang, Junyu Shi, Zhiyuan Zhang, Lijiang Liu, Qiang Nie

Main category: cs.CV

TL;DR: EmbryoDiff is a two-stage diffusion-based framework that uses multi-focal feature fusion and hybrid semantic-boundary conditioning to accurately classify embryo developmental stages in IVF, achieving state-of-the-art performance.

Details

Motivation: Existing deep learning methods for embryo stage classification fail to utilize developmental distribution priors and rely on single-focal information, making them vulnerable to feature ambiguity from cell occlusions.

Method: Two-stage framework: 1) Train frame-level encoder for multi-focal features, 2) Multi-Focal Feature Fusion Strategy creates 3D-aware representations, then Hybrid Semantic-Boundary Condition Block injects complementary cues into diffusion-based denoising process.

Result: Achieves state-of-the-art results with 82.8% and 81.3% accuracy on two benchmark datasets using only a single denoising step.

Conclusion: EmbryoDiff effectively addresses limitations of previous methods by leveraging multi-focal information and diffusion-based sequence denoising, providing robust embryo stage classification even under cell occlusion conditions.

Abstract: Identification of fine-grained embryo developmental stages during In Vitro Fertilization (IVF) is crucial for assessing embryo viability. Although recent deep learning methods have achieved promising accuracy, existing discriminative models fail to utilize the distributional prior of embryonic development to improve accuracy. Moreover, their reliance on single-focal information leads to incomplete embryonic representations, making them susceptible to feature ambiguity under cell occlusions. To address these limitations, we propose EmbryoDiff, a two-stage diffusion-based framework that formulates the task as a conditional sequence denoising process. Specifically, we first train and freeze a frame-level encoder to extract robust multi-focal features. In the second stage, we introduce a Multi-Focal Feature Fusion Strategy that aggregates information across focal planes to construct a 3D-aware morphological representation, effectively alleviating ambiguities arising from cell occlusions. Building on this fused representation, we derive complementary semantic and boundary cues and design a Hybrid Semantic-Boundary Condition Block to inject them into the diffusion-based denoising process, enabling accurate embryonic stage classification. Extensive experiments on two benchmark datasets show that our method achieves state-of-the-art results. Notably, with only a single denoising step, our model obtains the best average test performance, reaching 82.8% and 81.3% accuracy on the two datasets, respectively.

[148] Algorithms Trained on Normal Chest X-rays Can Predict Health Insurance Types

Chi-Yu Chen, Rawan Abulibdeh, Arash Asgari, Leo Anthony Celi, Deirdre Goode, Hassan Hamidi, Laleh Seyyed-Kalantari, Po-Chih Kuo, Ned McCague, Thomas Sounack

Main category: cs.CV

TL;DR: Deep learning models can predict patients’ health insurance type (a proxy for socioeconomic status) from normal chest X-rays with significant accuracy, revealing that medical images encode hidden social inequalities.

Details

Motivation: To investigate whether medical AI models can detect and exploit hidden social signatures in clinical data, challenging the assumption that medical images are neutral biological data.

Method: Used state-of-the-art architectures (DenseNet121, SwinV2-B, MedMamba) trained on chest X-rays from MIMIC-CXR-JPG and CheXpert datasets, with patch-based occlusion analysis to localize the signal.

Result: Models achieved AUC around 0.67-0.68 in predicting health insurance type, with the signal persisting after controlling for age, race, and sex, and remaining detectable when trained on single racial groups.

Conclusion: Medical AI fairness requires not just dataset balancing but interrogating and disentangling the social fingerprints embedded in clinical data itself, as models learn socioeconomic segregation from medical images.

Abstract: Artificial intelligence is revealing what medicine never intended to encode. Deep vision models, trained on chest X-rays, can now detect not only disease but also invisible traces of social inequality. In this study, we show that state-of-the-art architectures (DenseNet121, SwinV2-B, MedMamba) can predict a patient’s health insurance type, a strong proxy for socioeconomic status, from normal chest X-rays with significant accuracy (AUC around 0.67 on MIMIC-CXR-JPG, 0.68 on CheXpert). The signal persists even when age, race, and sex are controlled for, and remains detectable when the model is trained exclusively on a single racial group. Patch-based occlusion reveals that the signal is diffuse rather than localized, embedded in the upper and mid-thoracic regions. This suggests that deep networks may be internalizing subtle traces of clinical environments, equipment differences, or care pathways; learning socioeconomic segregation itself. These findings challenge the assumption that medical images are neutral biological data. By uncovering how models perceive and exploit these hidden social signatures, this work reframes fairness in medical AI: the goal is no longer only to balance datasets or adjust thresholds, but to interrogate and disentangle the social fingerprints embedded in clinical data itself.

[149] MPCGNet: A Multiscale Feature Extraction and Progressive Feature Aggregation Network Using Coupling Gates for Polyp Segmentation

Wei Wang, Feng Jiang, Xin Wang

Main category: cs.CV

TL;DR: MPCGNet introduces coupling gates and three specialized modules to improve polyp segmentation by addressing noise filtering, boundary ambiguity, and small polyp detection challenges.

Details

Motivation: To overcome key challenges in polyp segmentation: missed small polyps, ambiguous boundaries, and noise from uneven lighting in colonoscopy images.

Method: Proposes coupling gates for noise filtering and feature selection, with three modules: CGMFE for local feature extraction and noise suppression, WCAD decoder for detail restoration, and DFA for progressive feature aggregation and importance selection.

Result: Outperforms recent networks with mDice scores 2.20% higher on ETIS-LaribPolypDB and 0.68% higher on CVC-ColonDB datasets compared to second-best network.

Conclusion: The proposed MPCGNet with coupling gates and specialized modules effectively addresses polyp segmentation challenges and achieves state-of-the-art performance.

Abstract: Automatic segmentation methods of polyps is crucial for assisting doctors in colorectal polyp screening and cancer diagnosis. Despite the progress made by existing methods, polyp segmentation faces several challenges: (1) small-sized polyps are prone to being missed during identification, (2) the boundaries between polyps and the surrounding environment are often ambiguous, (3) noise in colonoscopy images, caused by uneven lighting and other factors, affects segmentation results. To address these challenges, this paper introduces coupling gates as components in specific modules to filter noise and perform feature importance selection. Three modules are proposed: the coupling gates multiscale feature extraction (CGMFE) module, which effectively extracts local features and suppresses noise; the windows cross attention (WCAD) decoder module, which restores details after capturing the precise location of polyps; and the decoder feature aggregation (DFA) module, which progressively aggregates features, further extracts them, and performs feature importance selection to reduce the loss of small-sized polyps. Experimental results demonstrate that MPCGNet outperforms recent networks, with mDice scores 2.20% and 0.68% higher than the second-best network on the ETIS-LaribPolypDB and CVC-ColonDB datasets, respectively.

[150] CrossMed: A Multimodal Cross-Task Benchmark for Compositional Generalization in Medical Imaging

Pooja Singh, Siddhant Ujjain, Tapan Kumar Gandhi, Sandeep Kumar

Main category: cs.CV

TL;DR: CrossMed is a benchmark for evaluating compositional generalization in medical multimodal LLMs using a Modality-Anatomy-Task schema, showing significant performance drops in unseen combinations despite good performance on related tasks.

Details

Motivation: To address the underexplored ability of multimodal LLMs to generalize compositionally across unseen combinations of imaging modality, anatomy, and task type in medical AI applications.

Method: Reformulated four public medical datasets into unified VQA format with 20,200 QA instances, evaluated LLaVA-Vicuna-7B and Qwen2-VL-7B on Related, Unrelated, and zero-overlap MAT splits.

Result: Models achieved 83.2% classification accuracy and 0.75 segmentation cIoU on Related splits, but performance dropped significantly under Unrelated and zero-overlap conditions. Cross-task transfer improved segmentation by 7% cIoU.

Conclusion: CrossMed provides a rigorous testbed for evaluating zero-shot, cross-task, and modality-agnostic generalization, demonstrating multimodal LLMs’ unique strength in compositional generalization compared to traditional models.

Abstract: Recent advances in multimodal large language models have enabled unified processing of visual and textual inputs, offering promising applications in general-purpose medical AI. However, their ability to generalize compositionally across unseen combinations of imaging modality, anatomy, and task type remains underexplored. We introduce CrossMed, a benchmark designed to evaluate compositional generalization (CG) in medical multimodal LLMs using a structured Modality-Anatomy-Task (MAT) schema. CrossMed reformulates four public datasets, CheXpert (X-ray classification), SIIM-ACR (X-ray segmentation), BraTS 2020 (MRI classification and segmentation), and MosMedData (CT classification) into a unified visual question answering (VQA) format, resulting in 20,200 multiple-choice QA instances. We evaluate two open-source multimodal LLMs, LLaVA-Vicuna-7B and Qwen2-VL-7B, on both Related and Unrelated MAT splits, as well as a zero-overlap setting where test triplets share no Modality, Anatomy, or Task with the training data. Models trained on Related splits achieve 83.2 percent classification accuracy and 0.75 segmentation cIoU, while performance drops significantly under Unrelated and zero-overlap conditions, demonstrating the benchmark difficulty. We also show cross-task transfer, where segmentation performance improves by 7 percent cIoU even when trained using classification-only data. Traditional models (ResNet-50 and U-Net) show modest gains, confirming the broad utility of the MAT framework, while multimodal LLMs uniquely excel at compositional generalization. CrossMed provides a rigorous testbed for evaluating zero-shot, cross-task, and modality-agnostic generalization in medical vision-language models.

[151] SemanticNN: Compressive and Error-Resilient Semantic Offloading for Extremely Weak Devices

Jiaming Huang, Yi Gao, Fuchang Pan, Renjie Li, Wei Dong

Main category: cs.CV

TL;DR: SemanticNN is a semantic codec for IoT device-edge collaboration that tolerates bit errors to achieve semantic correctness, reducing feature transmission by 56.82-344.83x while maintaining accuracy under varying error rates.

Details

Motivation: Enable AI on resource-constrained IoT devices through resilient device-edge collaboration, overcoming limitations of traditional bit-level error correction methods that are inefficient under dynamic channel conditions.

Method: Proposes SemanticNN with BER-aware decoder for dynamic channel adaptation, Soft Quantization encoder for compact representations, Feature-augmentation Learning for offloading efficiency, and XAI-based Asymmetry Compensation for decoder enhancement.

Result: Extensive experiments on STM32 with 3 models and 6 datasets show SemanticNN reduces feature transmission volume by 56.82-344.83x while maintaining superior inference accuracy under varying transmission error rates.

Conclusion: SemanticNN enables efficient and resilient collaborative inference offloading for IoT devices under strict computational and communication constraints through semantic-level error tolerance.

Abstract: With the rapid growth of the Internet of Things (IoT), integrating artificial intelligence (AI) on extremely weak embedded devices has garnered significant attention, enabling improved real-time performance and enhanced data privacy. However, the resource limitations of such devices and unreliable network conditions necessitate error-resilient device-edge collaboration systems. Traditional approaches focus on bit-level transmission correctness, which can be inefficient under dynamic channel conditions. In contrast, we propose SemanticNN, a semantic codec that tolerates bit-level errors in pursuit of semantic-level correctness, enabling compressive and resilient collaborative inference offloading under strict computational and communication constraints. It incorporates a Bit Error Rate (BER)-aware decoder that adapts to dynamic channel conditions and a Soft Quantization (SQ)-based encoder to learn compact representations. Building on this architecture, we introduce Feature-augmentation Learning, a novel training strategy that enhances offloading efficiency. To address encoder-decoder capability mismatches from asymmetric resources, we propose XAI-based Asymmetry Compensation to enhance decoding semantic fidelity. We conduct extensive experiments on STM32 using three models and six datasets across image classification and object detection tasks. Experimental results demonstrate that, under varying transmission error rates, SemanticNN significantly reduces feature transmission volume by 56.82-344.83x while maintaining superior inference accuracy.

[152] Hyperbolic Hierarchical Alignment Reasoning Network for Text-3D Retrieval

Wenrui Li, Yidan Lu, Yeyu Chai, Rui Zhao, Hengyu Man, Xiaopeng Fan

Main category: cs.CV

TL;DR: H²ARN addresses hierarchy collapse and redundancy issues in text-3D retrieval using hyperbolic embeddings and contribution-aware aggregation.

Details

Motivation: Current text-3D retrieval methods suffer from Hierarchy Representation Collapse (HRC) and Redundancy-Induced Saliency Dilution (RISD), which compress hierarchical relationships and dilute semantic cues.

Method: Proposes Hyperbolic Hierarchical Alignment Reasoning Network (H²ARN) using Lorentz-model hyperbolic space embeddings, hierarchical ordering loss, instance-level contrastive loss, and contribution-aware hyperbolic aggregation.

Result: Developed expanded T3DR-HIT v2 benchmark with 8,935 text-to-3D pairs (2.6x original size) covering cultural artefacts and indoor scenes.

Conclusion: H²ARN effectively preserves hierarchical relationships and enhances discriminative capabilities in text-3D retrieval through hyperbolic geometry and intelligent feature aggregation.

Abstract: With the daily influx of 3D data on the internet, text-3D retrieval has gained increasing attention. However, current methods face two major challenges: Hierarchy Representation Collapse (HRC) and Redundancy-Induced Saliency Dilution (RISD). HRC compresses abstract-to-specific and whole-to-part hierarchies in Euclidean embeddings, while RISD averages noisy fragments, obscuring critical semantic cues and diminishing the model’s ability to distinguish hard negatives. To address these challenges, we introduce the Hyperbolic Hierarchical Alignment Reasoning Network (H$^{2}$ARN) for text-3D retrieval. H$^{2}$ARN embeds both text and 3D data in a Lorentz-model hyperbolic space, where exponential volume growth inherently preserves hierarchical distances. A hierarchical ordering loss constructs a shrinking entailment cone around each text vector, ensuring that the matched 3D instance falls within the cone, while an instance-level contrastive loss jointly enforces separation from non-matching samples. To tackle RISD, we propose a contribution-aware hyperbolic aggregation module that leverages Lorentzian distance to assess the relevance of each local feature and applies contribution-weighted aggregation guided by hyperbolic geometry, enhancing discriminative regions while suppressing redundancy without additional supervision. We also release the expanded T3DR-HIT v2 benchmark, which contains 8,935 text-to-3D pairs, 2.6 times the original size, covering both fine-grained cultural artefacts and complex indoor scenes. Our codes are available at https://github.com/liwrui/H2ARN.

[153] PINGS-X: Physics-Informed Normalized Gaussian Splatting with Axes Alignment for Efficient Super-Resolution of 4D Flow MRI

Sun Jo, Seok Young Hong, JinHyun Kim, Seungmin Kang, Ahjin Choi, Don-Gwan An, Simon Song, Je Hyeong Hong

Main category: cs.CV

TL;DR: PINGS-X is a novel framework using axes-aligned spatiotemporal Gaussian representations for 4D flow MRI super-resolution, achieving faster training times and superior accuracy compared to physics-informed neural networks.

Details

Motivation: 4D flow MRI requires high spatiotemporal resolution for early detection of cardiovascular conditions, but achieving this resolution leads to prolonged scan times. Existing PINN methods are impractical due to slow per-patient training.

Method: Extends 3D Gaussian splatting with normalized Gaussian splatting with convergence guarantees, axes-aligned Gaussians for high-dimensional data, and Gaussian merging for computational efficiency.

Result: Experimental results on CFD and real 4D flow MRI datasets show PINGS-X substantially reduces training time while achieving superior super-resolution accuracy.

Conclusion: PINGS-X provides an efficient and accurate solution for 4D flow MRI super-resolution, overcoming limitations of previous PINN approaches with faster training and better performance.

Abstract: 4D flow magnetic resonance imaging (MRI) is a reliable, non-invasive approach for estimating blood flow velocities, vital for cardiovascular diagnostics. Unlike conventional MRI focused on anatomical structures, 4D flow MRI requires high spatiotemporal resolution for early detection of critical conditions such as stenosis or aneurysms. However, achieving such resolution typically results in prolonged scan times, creating a trade-off between acquisition speed and prediction accuracy. Recent studies have leveraged physics-informed neural networks (PINNs) for super-resolution of MRI data, but their practical applicability is limited as the prohibitively slow training process must be performed for each patient. To overcome this limitation, we propose PINGS-X, a novel framework modeling high-resolution flow velocities using axes-aligned spatiotemporal Gaussian representations. Inspired by the effectiveness of 3D Gaussian splatting (3DGS) in novel view synthesis, PINGS-X extends this concept through several non-trivial novel innovations: (i) normalized Gaussian splatting with a formal convergence guarantee, (ii) axes-aligned Gaussians that simplify training for high-dimensional data while preserving accuracy and the convergence guarantee, and (iii) a Gaussian merging procedure to prevent degenerate solutions and boost computational efficiency. Experimental results on computational fluid dynamics (CFD) and real 4D flow MRI datasets demonstrate that PINGS-X substantially reduces training time while achieving superior super-resolution accuracy. Our code and datasets are available at https://github.com/SpatialAILab/PINGS-X.

[154] NP-LoRA: Null Space Projection Unifies Subject and Style in LoRA Fusion

Chuheng Chen, Xiaofei Zhou, Geyuan Zhang, Yong Huang

Main category: cs.CV

TL;DR: NP-LoRA is a projection-based framework that prevents structural interference in LoRA fusion by enforcing subspace separation between principal directions, improving subject fidelity and style consistency.

Details

Motivation: Existing LoRA fusion methods suffer from interference where one LoRA dominates another due to overlapping low-rank subspaces, leading to degraded generation quality.

Method: Extract principal style directions via SVD, project subject LoRA into orthogonal null space, and use soft projection for trade-off control between subject fidelity and style consistency.

Result: NP-LoRA consistently improves fusion quality over baselines across multiple metrics (DINO, CLIP, human/LLM preference) and works broadly across different backbones and LoRA pairs without retraining.

Conclusion: The proposed NP-LoRA framework effectively prevents structural interference in LoRA fusion through subspace separation, enabling better control over subject-style composition without costly retraining.

Abstract: Low-Rank Adaptation (LoRA) fusion has emerged as a key technique for reusing and composing learned subject and style representations for controllable generation without costly retraining. However, existing methods rely on weight-based merging, where one LoRA often dominates the other, leading to interference and degraded fidelity. This interference is structural: separately trained LoRAs occupy low-rank high-dimensional subspaces, leading to non-orthogonal and overlapping representations. In this work, we analyze the internal structure of LoRAs and find their generative behavior is dominated by a few principal directions in the low-rank subspace, which should remain free from interference during fusion. To achieve this, we propose Null Space Projection LoRA (NP-LoRA), a projection-based framework for LoRA fusion that enforces subspace separation to prevent structural interference among principal directions. Specifically, we first extract principal style directions via singular value decomposition (SVD) and then project the subject LoRA into its orthogonal null space. Furthermore, we introduce a soft projection mechanism that enables smooth control over the trade-off between subject fidelity and style consistency. Experiments show NP-LoRA consistently improves fusion quality over strong baselines (e.g., DINO and CLIP-based metrics, with human and LLM preference scores), and applies broadly across backbones and LoRA pairs without retraining.

[155] CareCom: Generative Image Composition with Calibrated Reference Features

Jiaxuan Chen, Bo Zhang, Qingdong He, Jinlong Peng, Li Niu

Main category: cs.CV

TL;DR: Proposes a multi-reference generative image composition method with feature calibration to improve detail preservation and foreground pose/view adjustment.

Details

Motivation: Existing generative image composition methods struggle with simultaneous detail preservation and foreground pose/view adjustment when inserting foreground objects into backgrounds.

Method: Extends generative composition model to multi-reference version and calibrates global/local features of foreground reference images to make them compatible with background information.

Result: Extensive experiments on MVImgNet and MureCom datasets show generative models greatly benefit from calibrated reference features.

Conclusion: The proposed multi-reference approach with feature calibration effectively improves image composition quality by providing better pose/view adjustment while preserving details.

Abstract: Image composition aims to seamlessly insert foreground object into background. Despite the huge progress in generative image composition, the existing methods are still struggling with simultaneous detail preservation and foreground pose/view adjustment. To address this issue, we extend the existing generative composition model to multi-reference version, which allows using arbitrary number of foreground reference images. Furthermore, we propose to calibrate the global and local features of foreground reference images to make them compatible with the background information. The calibrated reference features can supplement the original reference features with useful global and local information of proper pose/view. Extensive experiments on MVImgNet and MureCom demonstrate that the generative model can greatly benefit from the calibrated reference features.

[156] LiteAttention: A Temporal Sparse Attention for Diffusion Transformers

Dor Shmilovich, Tony Wu, Aviad Dahan, Yuval Domb

Main category: cs.CV

TL;DR: LiteAttention exploits temporal coherence in diffusion attention patterns to enable evolutionary computation skips, eliminating redundant attention computations in video diffusion models without quality degradation.

Details

Motivation: Diffusion Transformers for video generation suffer from quadratic attention complexity causing prohibitive latency, while existing acceleration methods face trade-offs between dynamic estimation overhead and static pattern suboptimality.

Method: Leverages temporal coherence of attention sparsity patterns across denoising steps, marking non-essential tiles early and propagating skip decisions forward to eliminate redundant computations without repeated profiling.

Result: Substantial speedups on production video diffusion models with no degradation in quality, implemented as optimized kernel on top of FlashAttention.

Conclusion: LiteAttention combines adaptivity of dynamic methods with efficiency of static ones by exploiting temporal coherence in diffusion attention patterns.

Abstract: Diffusion Transformers, particularly for video generation, achieve remarkable quality but suffer from quadratic attention complexity, leading to prohibitive latency. Existing acceleration methods face a fundamental trade-off: dynamically estimating sparse attention patterns at each denoising step incurs high computational overhead and estimation errors, while static sparsity patterns remain fixed and often suboptimal throughout denoising. We identify a key structural property of diffusion attention, namely, its sparsity patterns exhibit strong temporal coherence across denoising steps. Tiles deemed non-essential at step $t$ typically remain so at step $t+δ$. Leveraging this observation, we introduce LiteAttention, a method that exploits temporal coherence to enable evolutionary computation skips across the denoising sequence. By marking non-essential tiles early and propagating skip decisions forward, LiteAttention eliminates redundant attention computations without repeated profiling overheads, combining the adaptivity of dynamic methods with the efficiency of static ones. We implement a highly optimized LiteAttention kernel on top of FlashAttention and demonstrate substantial speedups on production video diffusion models, with no degradation in quality. The code and implementation details will be publicly released.

[157] From Retinal Pixels to Patients: Evolution of Deep Learning Research in Diabetic Retinopathy Screening

Muskaan Chopra, Lorenz Sparrenberg, Armin Berger, Sarthak Khanna, Jan H. Terheyden, Rafet Sifa

Main category: cs.CV

TL;DR: This survey systematically reviews deep learning advances in Diabetic Retinopathy (DR) screening from 2016-2025, analyzing 50+ studies and 20+ datasets to address key challenges like class imbalance, label scarcity, and domain shift.

Details

Motivation: DR is a leading cause of preventable blindness, and early detection is critical. Deep learning has transformed DR screening over the past decade, but there's a need to systematically synthesize methodological advances and address translational barriers for clinical deployment.

Method: Systematic synthesis of DR research spanning 2016-2025, examining methodological advances including self- and semi-supervised learning, domain generalization, federated training, hybrid neuro-symbolic models, evaluation protocols, and reporting standards.

Result: Consolidated results from 50+ studies and over 20 datasets with benchmark tables contextualizing performance across datasets. Identified key innovations in addressing class imbalance, label scarcity, domain shift, and interpretability challenges.

Conclusion: The survey outlines a practical agenda for reproducible, privacy-preserving, and clinically deployable DR AI, highlighting that many innovations extend broadly to medical imaging at scale. Identifies open gaps in multi-center validation and clinical trust.

Abstract: Diabetic Retinopathy (DR) remains a leading cause of preventable blindness, with early detection critical for reducing vision loss worldwide. Over the past decade, deep learning has transformed DR screening, progressing from early convolutional neural networks trained on private datasets to advanced pipelines addressing class imbalance, label scarcity, domain shift, and interpretability. This survey provides the first systematic synthesis of DR research spanning 2016-2025, consolidating results from 50+ studies and over 20 datasets. We critically examine methodological advances, including self- and semi-supervised learning, domain generalization, federated training, and hybrid neuro-symbolic models, alongside evaluation protocols, reporting standards, and reproducibility challenges. Benchmark tables contextualize performance across datasets, while discussion highlights open gaps in multi-center validation and clinical trust. By linking technical progress with translational barriers, this work outlines a practical agenda for reproducible, privacy-preserving, and clinically deployable DR AI. Beyond DR, many of the surveyed innovations extend broadly to medical imaging at scale.

[158] S2D-ALIGN: Shallow-to-Deep Auxiliary Learning for Anatomically-Grounded Radiology Report Generation

Jiechao Gao, Chang Liu, Yuangang Li

Main category: cs.CV

TL;DR: S2D-Align is a novel SFT paradigm for radiology report generation that establishes anatomically-grounded alignment using auxiliary signals of varying granularities through a shallow-to-deep strategy.

Details

Motivation: Standard SFT paradigm fails to establish anatomically-grounded alignment in radiology report generation due to templated nature of reports, leading to sub-optimal generation quality.

Method: Proposes S2D-Align with shallow-to-deep strategy: starts with coarse radiograph-report pairing, adds reference reports for instance-level guidance, then uses key phrases for anatomical grounding. Uses memory-based adapter for feature sharing between alignment stages.

Result: Achieves state-of-the-art performance on MIMIC-CXR and IU X-Ray benchmarks. Ablation studies validate effectiveness of multi-stage, auxiliary-guided approach.

Conclusion: S2D-Align represents a promising direction for enhancing grounding capabilities in complex, multi-modal generation tasks.

Abstract: Radiology Report Generation (RRG) aims to automatically generate diagnostic reports from radiology images. To achieve this, existing methods have leveraged the powerful cross-modal generation capabilities of Multimodal Large Language Models (MLLMs), primarily focusing on optimizing cross-modal alignment between radiographs and reports through Supervised Fine-Tuning (SFT). However, by only performing instance-level alignment with the image-text pairs, the standard SFT paradigm fails to establish anatomically-grounded alignment, where the templated nature of reports often leads to sub-optimal generation quality. To address this, we propose \textsc{S2D-Align}, a novel SFT paradigm that establishes anatomically-grounded alignment by leveraging auxiliary signals of varying granularities. \textsc{S2D-Align} implements a shallow-to-deep strategy, progressively enriching the alignment process: it begins with the coarse radiograph-report pairing, then introduces reference reports for instance-level guidance, and ultimately utilizes key phrases to ground the generation in specific anatomical details. To bridge the different alignment stages, we introduce a memory-based adapter that empowers feature sharing, thereby integrating coarse and fine-grained guidance. For evaluation, we conduct experiments on the public \textsc{MIMIC-CXR} and \textsc{IU X-Ray} benchmarks, where \textsc{S2D-Align} achieves state-of-the-art performance compared to existing methods. Ablation studies validate the effectiveness of our multi-stage, auxiliary-guided approach, highlighting a promising direction for enhancing grounding capabilities in complex, multi-modal generation tasks.

[159] Evaluating Latent Generative Paradigms for High-Fidelity 3D Shape Completion from a Single Depth Image

Matthias Humt, Ulrich Hillenbrand, Rudolph Triebel

Main category: cs.CV

TL;DR: Comparison of diffusion models and autoregressive transformers for 3D shape generation and completion, showing diffusion models excel with continuous latents while autoregressive models match performance in discrete latent spaces.

Details

Motivation: There's no consensus on which generative models work best for 3D data tasks, and conditional information like partial 3D data hasn't been thoroughly evaluated compared to text/image conditioning.

Method: Adapted Denoising Diffusion Probabilistic Models and Autoregressive Causal Transformers for generative shape modeling and completion, with thorough quantitative evaluation including baseline discriminative model and extensive ablation study.

Result: Diffusion model with continuous latents outperformed both discriminative model and autoregressive approach, achieving state-of-the-art performance on multi-modal shape completion from single noisy depth images. Autoregressive model matched or exceeded diffusion performance when compared in the same discrete latent space.

Conclusion: Both diffusion and autoregressive models show strong performance for 3D shape tasks, with diffusion models excelling with continuous representations and autoregressive models being competitive in discrete latent spaces.

Abstract: While generative models have seen significant adoption across a wide range of data modalities, including 3D data, a consensus on which model is best suited for which task has yet to be reached. Further, conditional information such as text and images to steer the generation process are frequently employed, whereas others, like partial 3D data, have not been thoroughly evaluated. In this work, we compare two of the most promising generative models–Denoising Diffusion Probabilistic Models and Autoregressive Causal Transformers–which we adapt for the tasks of generative shape modeling and completion. We conduct a thorough quantitative evaluation and comparison of both tasks, including a baseline discriminative model and an extensive ablation study. Our results show that (1) the diffusion model with continuous latents outperforms both the discriminative model and the autoregressive approach and delivers state-of-the-art performance on multi-modal shape completion from a single, noisy depth image under realistic conditions and (2) when compared on the same discrete latent space, the autoregressive model can match or exceed diffusion performance on these tasks.

[160] Phys-Liquid: A Physics-Informed Dataset for Estimating 3D Geometry and Volume of Transparent Deformable Liquids

Ke Ma, Yizhou Fang, Jean-Baptiste Weibel, Shuai Tan, Xinggang Wang, Yang Xiao, Yi Fang, Tian Xia

Main category: cs.CV

TL;DR: Phys-Liquid is a physics-informed dataset with 97,200 simulation images and 3D meshes for transparent liquid perception, enabling improved geometric and volumetric reconstruction of deformable liquids under dynamic conditions.

Details

Motivation: Current datasets lack comprehensive physics-informed simulation data for realistic liquid behaviors under dynamic scenarios, which is essential for autonomous robots performing precise liquid manipulation tasks like dispensing, aspiration, and mixing.

Method: Created a physics-informed dataset with simulation images and 3D meshes across multiple laboratory scenes, lighting conditions, liquid colors, and container rotations. Proposed a four-stage pipeline: liquid segmentation, multi-view mask generation, 3D mesh reconstruction, and real-world scaling.

Result: Experimental results demonstrate improved accuracy and consistency in reconstructing liquid geometry and volume, outperforming existing benchmarks.

Conclusion: Phys-Liquid facilitates future advancements in transparent liquid perception tasks and provides a valuable resource for robotics applications involving liquid manipulation.

Abstract: Estimating the geometric and volumetric properties of transparent deformable liquids is challenging due to optical complexities and dynamic surface deformations induced by container movements. Autonomous robots performing precise liquid manipulation tasks, such as dispensing, aspiration, and mixing, must handle containers in ways that inevitably induce these deformations, complicating accurate liquid state assessment. Current datasets lack comprehensive physics-informed simulation data representing realistic liquid behaviors under diverse dynamic scenarios. To bridge this gap, we introduce Phys-Liquid, a physics-informed dataset comprising 97,200 simulation images and corresponding 3D meshes, capturing liquid dynamics across multiple laboratory scenes, lighting conditions, liquid colors, and container rotations. To validate the realism and effectiveness of Phys-Liquid, we propose a four-stage reconstruction and estimation pipeline involving liquid segmentation, multi-view mask generation, 3D mesh reconstruction, and real-world scaling. Experimental results demonstrate improved accuracy and consistency in reconstructing liquid geometry and volume, outperforming existing benchmarks. The dataset and associated validation methods facilitate future advancements in transparent liquid perception tasks. The dataset and code are available at https://dualtransparency.github.io/Phys-Liquid/.

[161] A Space-Time Transformer for Precipitation Forecasting

Levi Harris, Tianlong Chen

Main category: cs.CV

TL;DR: SaTformer is a video transformer that uses full space-time attention to forecast extreme precipitation from satellite radiances, achieving state-of-the-art performance on precipitation nowcasting.

Details

Motivation: Traditional numerical weather prediction models are computationally expensive and perform poorly at nowcasting timescales (0-4 hours), while existing AI-weather prediction approaches haven't fully explored video-understanding architectures for weather forecasting.

Method: Proposed SaTformer: a video transformer with full space-time attention that reformulates precipitation regression as a classification problem and uses class-weighted loss to handle imbalanced precipitation datasets.

Result: The model achieved first place on the NeurIPS Weather4Cast 2025 Cumulative Rainfall challenge, demonstrating superior performance in extreme precipitation forecasting.

Conclusion: Video transformers with full space-time attention and proper handling of imbalanced datasets can effectively address precipitation nowcasting challenges, outperforming traditional methods.

Abstract: Meteorological agencies around the world rely on real-time flood guidance to issue live-saving advisories and warnings. For decades traditional numerical weather prediction (NWP) models have been state-of-the-art for precipitation forecasting. However, physically-parameterized models suffer from a few core limitations: first, solving PDEs to resolve atmospheric dynamics is computationally demanding, and second, these methods degrade in performance at nowcasting timescales (i.e., 0-4 hour lead-times). Motivated by these shortcomings, recent work proposes AI-weather prediction (AI-WP) alternatives that learn to emulate analysis data with neural networks. While these data-driven approaches have enjoyed enormous success across diverse spatial and temporal resolutions, applications of video-understanding architectures for weather forecasting remain underexplored. To address these gaps, we propose SaTformer: a video transformer built on full space-time attention that skillfully forecasts extreme precipitation from satellite radiances. Along with our novel architecture, we introduce techniques to tame long-tailed precipitation datasets. Namely, we reformulate precipitation regression into a classification problem, and employ a class-weighted loss to address label imbalances. Our model scored first place on the NeurIPS Weather4Cast 2025 Cumulative Rainfall challenge. Code and model weights are available: https://github.com/leharris3/satformer

[162] Machine-Learning Based Detection of Coronary Artery Calcification Using Synthetic Chest X-Rays

Dylan Saeed, Ramtin Gharleghi, Susann Bier, Sonit Singh

Main category: cs.CV

TL;DR: DRRs (digitally reconstructed radiographs) from CT scans serve as scalable training data for coronary artery calcification detection, achieving competitive performance with lightweight CNNs and curriculum learning.

Details

Motivation: CT-based Agatston scoring is the gold standard for CAC detection but is costly and impractical for large-scale screening, while chest X-rays lack reliable ground truth labels, limiting deep learning development.

Method: Generate synthetic DRRs from 667 CT scans, evaluate model capacity, super-resolution fidelity enhancement, preprocessing, and training strategies including curriculum learning with lightweight CNNs trained from scratch.

Result: Best configuration achieves mean AUC of 0.754, comparable to or exceeding prior CXR-based studies, with lightweight CNNs outperforming large pretrained networks.

Conclusion: DRRs establish a scalable, label-rich foundation for CAC detection and provide a foundation for future transfer learning to real CXRs.

Abstract: Coronary artery calcification (CAC) is a strong predictor of cardiovascular events, with CT-based Agatston scoring widely regarded as the clinical gold standard. However, CT is costly and impractical for large-scale screening, while chest X-rays (CXRs) are inexpensive but lack reliable ground truth labels, constraining deep learning development. Digitally reconstructed radiographs (DRRs) offer a scalable alternative by projecting CT volumes into CXR-like images while inheriting precise labels. In this work, we provide the first systematic evaluation of DRRs as a surrogate training domain for CAC detection. Using 667 CT scans from the COCA dataset, we generate synthetic DRRs and assess model capacity, super-resolution fidelity enhancement, preprocessing, and training strategies. Lightweight CNNs trained from scratch outperform large pretrained networks; pairing super-resolution with contrast enhancement yields significant gains; and curriculum learning stabilises training under weak supervision. Our best configuration achieves a mean AUC of 0.754, comparable to or exceeding prior CXR-based studies. These results establish DRRs as a scalable, label-rich foundation for CAC detection, while laying the foundation for future transfer learning and domain adaptation to real CXRs.

[163] Detection of Bark Beetle Attacks using Hyperspectral PRISMA Data and Few-Shot Learning

Mattia Ferrari, Giancarlo Papitto, Giorgio Deligios, Lorenzo Bruzzone

Main category: cs.CV

TL;DR: Few-shot learning with contrastive learning for bark beetle detection using PRISMA hyperspectral data, outperforming traditional methods.

Details

Motivation: Bark beetle infestations threaten coniferous forest health, requiring effective monitoring methods.

Method: Contrastive learning pre-trains 1D CNN encoder for robust feature extraction from hyperspectral data, followed by support vector regression estimators for each class using few labeled samples.

Result: Method outperforms original PRISMA spectral bands and Sentinel-2 data in Dolomites study area.

Conclusion: PRISMA hyperspectral data combined with few-shot learning provides significant advantages for forest health monitoring.

Abstract: Bark beetle infestations represent a serious challenge for maintaining the health of coniferous forests. This paper proposes a few-shot learning approach leveraging contrastive learning to detect bark beetle infestations using satellite PRISMA hyperspectral data. The methodology is based on a contrastive learning framework to pre-train a one-dimensional CNN encoder, enabling the extraction of robust feature representations from hyperspectral data. These extracted features are subsequently utilized as input to support vector regression estimators, one for each class, trained on few labeled samples to estimate the proportions of healthy, attacked by bark beetle, and dead trees for each pixel. Experiments on the area of study in the Dolomites show that our method outperforms the use of original PRISMA spectral bands and of Sentinel-2 data. The results indicate that PRISMA hyperspectral data combined with few-shot learning offers significant advantages for forest health monitoring.

[164] VIDEOP2R: Video Understanding from Perception to Reasoning

Yifan Jiang, Yueying Wang, Rui Zhao, Toufiq Parag, Zhimin Chen, Zhenyu Liao, Jayakrishnan Unnikrishnan

Main category: cs.CV

TL;DR: VideoP2R is a process-aware reinforcement fine-tuning framework for video language models that separates perception and reasoning processes, achieving state-of-the-art performance on video reasoning benchmarks.

Details

Motivation: Extending reinforcement fine-tuning (RFT) to large video language models (LVLMs) remains challenging, and existing approaches don't adequately model the distinct processes of perception and reasoning in video understanding.

Method: Two-stage framework: SFT stage creates VideoP2R-CoT-162K dataset with process-aware chain-of-thought annotations; RL stage uses novel PA-GRPO algorithm with separate rewards for perception and reasoning.

Result: Achieves state-of-the-art performance on 6 out of 7 video reasoning and understanding benchmarks, with ablation studies confirming effectiveness of process-aware modeling and PA-GRPO.

Conclusion: Process-aware modeling of perception and reasoning as distinct processes significantly enhances video reasoning capabilities, and perception output contains sufficient information for downstream reasoning tasks.

Abstract: Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model’s perception output is information-sufficient for downstream reasoning.

[165] Toward Generalized Detection of Synthetic Media: Limitations, Challenges, and the Path to Multimodal Solutions

Redwan Hussain, Mizanur Rahman, Prithwiraj Bhattacharjee

Main category: cs.CV

TL;DR: This paper reviews 24 recent works on AI-generated media detection, identifies common limitations and challenges, and suggests multimodal deep learning models as a promising research direction for more robust detection.

Details

Motivation: The rapid advancement of AI in media, particularly GANs and diffusion models, has made it difficult to distinguish real from synthetic content, leading to misuse through deepfakes for misinformation, privacy violations, and fraud.

Method: The study conducts a comprehensive review of 24 recent works on AI-generated media detection, examining each individually to identify contributions and weaknesses, then summarizing common limitations and key challenges.

Result: Current detection approaches using CNNs and ViTs often fail to generalize across unseen data, struggle with content from different models, and are ineffective with multimodal data and highly modified content.

Conclusion: Multimodal deep learning models are suggested as a promising research direction to provide more robust and generalized detection, offering future researchers a clear starting point for building stronger defenses against harmful synthetic media.

Abstract: Artificial intelligence (AI) in media has advanced rapidly over the last decade. The introduction of Generative Adversarial Networks (GANs) improved the quality of photorealistic image generation. Diffusion models later brought a new era of generative media. These advances made it difficult to separate real and synthetic content. The rise of deepfakes demonstrated how these tools could be misused to spread misinformation, political conspiracies, privacy violations, and fraud. For this reason, many detection models have been developed. They often use deep learning methods such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). These models search for visual, spatial, or temporal anomalies. However, such approaches often fail to generalize across unseen data and struggle with content from different models. In addition, existing approaches are ineffective in multimodal data and highly modified content. This study reviews twenty-four recent works on AI-generated media detection. Each study was examined individually to identify its contributions and weaknesses, respectively. The review then summarizes the common limitations and key challenges faced by current approaches. Based on this analysis, a research direction is suggested with a focus on multimodal deep learning models. Such models have the potential to provide more robust and generalized detection. It offers future researchers a clear starting point for building stronger defenses against harmful synthetic media.

[166] Stroke Modeling Enables Vectorized Character Generation with Large Vectorized Glyph Model

Xinyue Zhang, Haolong Li, Jiawei Ma, Chen Ye

Main category: cs.CV

TL;DR: LVGM is a novel model that generates vectorized Chinese glyphs by predicting next strokes using fine-tuned LLMs, with a new large-scale SVG dataset.

Details

Motivation: Vectorized glyphs are widely used in design and animation due to scalability, and LLMs' sequence prediction abilities can be leveraged for stroke-based character generation.

Method: Encode strokes into discrete latent variables (stroke embeddings), fine-tune DeepSeek LLM to predict next stroke embeddings, enabling generation from limited strokes.

Result: Model generates complete characters, elegant words, and unseen verses in vectorized form, shows scaling behaviors on data scales, and expert-validated outputs.

Conclusion: LVGM successfully bridges LLMs with vectorized glyph generation through stroke prediction, demonstrating practical applications in Chinese typography with scalable performance.

Abstract: Vectorized glyphs are widely used in poster design, network animation, art display, and various other fields due to their scalability and flexibility. In typography, they are often seen as special sequences composed of ordered strokes. This concept extends to the token sequence prediction abilities of large language models (LLMs), enabling vectorized character generation through stroke modeling. In this paper, we propose a novel Large Vectorized Glyph Model (LVGM) designed to generate vectorized Chinese glyphs by predicting the next stroke. Initially, we encode strokes into discrete latent variables called stroke embeddings. Subsequently, we train our LVGM via fine-tuning DeepSeek LLM by predicting the next stroke embedding. With limited strokes given, it can generate complete characters, semantically elegant words, and even unseen verses in vectorized form. Moreover, we release a new large-scale Chinese SVG dataset containing 907,267 samples based on strokes for dynamically vectorized glyph generation. Experimental results show that our model has scaling behaviors on data scales. Our generated vectorized glyphs have been validated by experts and relevant individuals.

[167] Hindsight Distillation Reasoning with Knowledge Encouragement Preference for Knowledge-based Visual Question Answering

Yu Zhao, Ying Zhang, Xuhui Sui, Baohang Zhou, Li Shen, Dacheng Tao

Main category: cs.CV

TL;DR: HinD framework with KEPO elicits internal knowledge reasoning in MLLMs through hindsight distillation and preference optimization, achieving superior KBVQA performance without external APIs or knowledge.

Details

Motivation: Existing KBVQA methods lack explicit multi-step reasoning trajectories from MLLMs, either using implicit knowledge via in-context learning or explicit knowledge via retrieval, leaving reasoning processes implicit.

Method: 1) Hindsight distillation: Use frozen 7B MLLM to generate reasoning between questions and ground truth answers (Hindsight-Zero), then distill into CoT Generator and Knowledge Generator. 2) KEPO: Optimize Knowledge Generator to prefer under-confident but helpful knowledge over over-confident but unhelpful knowledge.

Result: Experiments on OK-VQA and A-OKVQA show HinD achieves superior performance using only 7B-size MLLM, without commercial model APIs or outside knowledge.

Conclusion: HinD framework successfully elicits and harnesses internal knowledge reasoning ability in MLLMs through hindsight distillation and knowledge encouragement optimization, providing explicit multi-step reasoning trajectories for KBVQA.

Abstract: Knowledge-based Visual Question Answering (KBVQA) necessitates external knowledge incorporation beyond cross-modal understanding. Existing KBVQA methods either utilize implicit knowledge in multimodal large language models (MLLMs) via in-context learning or explicit knowledge via retrieval augmented generation. However, their reasoning processes remain implicit, without explicit multi-step trajectories from MLLMs. To address this gap, we provide a Hindsight Distilled Reasoning (HinD) framework with Knowledge Encouragement Preference Optimization (KEPO), designed to elicit and harness internal knowledge reasoning ability in MLLMs. First, to tackle the reasoning supervision problem, we propose to emphasize the hindsight wisdom of MLLM by prompting a frozen 7B-size MLLM to complete the reasoning process between the question and its ground truth answer, constructing Hindsight-Zero training data. Then we self-distill Hindsight-Zero into Chain-of-Thought (CoT) Generator and Knowledge Generator, enabling the generation of sequential steps and discrete facts. Secondly, to tackle the misalignment between knowledge correctness and confidence, we optimize the Knowledge Generator with KEPO, preferring under-confident but helpful knowledge over the over-confident but unhelpful one. The generated CoT and sampled knowledge are then exploited for answer prediction. Experiments on OK-VQA and A-OKVQA validate the effectiveness of HinD, showing that HinD with elicited reasoning from 7B-size MLLM achieves superior performance without commercial model APIs or outside knowledge.

[168] OT-ALD: Aligning Latent Distributions with Optimal Transport for Accelerated Image-to-Image Translation

Zhanpeng Wang, Shuting Cao, Yuhang Lu, Yuhan Li, Na Lei, Zhongxuan Luo

Main category: cs.CV

TL;DR: OT-ALD is an image-to-image translation framework that uses optimal transport theory to improve DDIB by eliminating latent distribution mismatches, achieving 20.29% faster translation and 2.6 lower FID scores.

Details

Motivation: To address DDIB's low translation efficiency and trajectory deviations caused by mismatched latent distributions between source and target domains.

Method: Computes an optimal transport map from source to target latent distributions, using the mapped distribution as starting point for reverse diffusion in target domain.

Result: 20.29% improvement in sampling efficiency and 2.6 average reduction in FID score across four translation tasks on three high-resolution datasets.

Conclusion: OT-ALD effectively balances faster translation with improved quality by eliminating latent distribution mismatches through optimal transport theory.

Abstract: The Dual Diffusion Implicit Bridge (DDIB) is an emerging image-to-image (I2I) translation method that preserves cycle consistency while achieving strong flexibility. It links two independently trained diffusion models (DMs) in the source and target domains by first adding noise to a source image to obtain a latent code, then denoising it in the target domain to generate the translated image. However, this method faces two key challenges: (1) low translation efficiency, and (2) translation trajectory deviations caused by mismatched latent distributions. To address these issues, we propose a novel I2I translation framework, OT-ALD, grounded in optimal transport (OT) theory, which retains the strengths of DDIB-based approach. Specifically, we compute an OT map from the latent distribution of the source domain to that of the target domain, and use the mapped distribution as the starting point for the reverse diffusion process in the target domain. Our error analysis confirms that OT-ALD eliminates latent distribution mismatches. Moreover, OT-ALD effectively balances faster image translation with improved image quality. Experiments on four translation tasks across three high-resolution datasets show that OT-ALD improves sampling efficiency by 20.29% and reduces the FID score by 2.6 on average compared to the top-performing baseline models.

[169] Reverberation: Learning the Latencies Before Forecasting Trajectories

Conghao Wong, Ziqian Zou, Beihao Xia, Xinge You

Main category: cs.CV

TL;DR: The paper proposes a Reverberation (Rev) trajectory prediction model that explicitly learns and predicts temporal latencies in agent responses to trajectory-changing events, using reverberation kernels inspired by acoustics.

Details

Motivation: Current trajectory prediction methods struggle to explicitly model temporal latencies - the delays with which agents respond to events. This lack of latency consideration undermines causal continuity and leads to implausible trajectories.

Method: Proposes a reverberation transform and Rev model using two explicit, learnable reverberation kernels to simulate different latency preferences of each agent and their stochasticity, enabling controllable trajectory prediction based on forecasted latencies.

Result: Experiments on multiple datasets show Rev achieves competitive accuracy while revealing interpretable latency dynamics across agents and scenarios. Qualitative analyses verify the properties of the reverberation transform.

Conclusion: The reverberation transform shows potential as a general latency modeling approach for trajectory prediction, providing interpretable latency dynamics and controllable prediction capabilities.

Abstract: Bridging the past to the future, connecting agents both spatially and temporally, lies at the core of the trajectory prediction task. Despite great efforts, it remains challenging to explicitly learn and predict latencies, the temporal delays with which agents respond to different trajectory-changing events and adjust their future paths, whether on their own or interactively. Different agents may exhibit distinct latency preferences for noticing, processing, and reacting to any specific trajectory-changing event. The lack of consideration of such latencies may undermine the causal continuity of the forecasting system and also lead to implausible or unintended trajectories. Inspired by the reverberation curves in acoustics, we propose a new reverberation transform and the corresponding Reverberation (short for Rev) trajectory prediction model, which simulates and predicts different latency preferences of each agent as well as their stochasticity by using two explicit and learnable reverberation kernels, allowing for the controllable trajectory prediction based on these forecasted latencies. Experiments on multiple datasets, whether pedestrians or vehicles, demonstrate that Rev achieves competitive accuracy while revealing interpretable latency dynamics across agents and scenarios. Qualitative analyses further verify the properties of the proposed reverberation transform, highlighting its potential as a general latency modeling approach.

[170] Refine and Align: Confidence Calibration through Multi-Agent Interaction in VQA

Ayush Pandey, Jai Bardhan, Ishita Jain, Ramya S Hebbalaguppe, Rohan Raju Dhanakshirur, Lovekesh Vig

Main category: cs.CV

TL;DR: AlignVQA is a debate-based multi-agent framework that improves confidence calibration in Visual Question Answering systems by having specialized VLMs generate answers and engage in two-stage interactions, with a novel calibration-aware loss function to fine-tune agents.

Details

Motivation: Modern VQA systems are increasingly used in high-stakes domains but often produce overconfident responses, with their confidence estimates' reliability remaining under-examined despite improved accuracy.

Method: A debate-based multi-agent framework where diverse specialized VLMs with distinct prompting strategies generate candidate answers, followed by two-stage interactions where generalist agents critique, refine and aggregate proposals, plus a novel differentiable calibration-aware loss function (aligncal) to fine-tune specialized agents.

Result: Empirical results across multiple benchmark VQA datasets show substantial reductions in calibration discrepancies, with more calibrated specialized agents producing better aligned confidences.

Conclusion: The debate process yields confidence estimates that more accurately reflect the model’s true predictive performance, and the calibration-aware loss function explicitly improves the fidelity of each agent’s confidence estimates.

Abstract: In the context of Visual Question Answering (VQA) and Agentic AI, calibration refers to how closely an AI system’s confidence in its answers reflects their actual correctness. This aspect becomes especially important when such systems operate autonomously and must make decisions under visual uncertainty. While modern VQA systems, powered by advanced vision-language models (VLMs), are increasingly used in high-stakes domains like medical diagnostics and autonomous navigation due to their improved accuracy, the reliability of their confidence estimates remains under-examined. Particularly, these systems often produce overconfident responses. To address this, we introduce AlignVQA, a debate-based multi-agent framework, in which diverse specialized VLM – each following distinct prompting strategies – generate candidate answers and then engage in two-stage interaction: generalist agents critique, refine and aggregate these proposals. This debate process yields confidence estimates that more accurately reflect the model’s true predictive performance. We find that more calibrated specialized agents produce better aligned confidences. Furthermore, we introduce a novel differentiable calibration-aware loss function called aligncal designed to fine-tune the specialized agents by minimizing an upper bound on the calibration error. This objective explicitly improves the fidelity of each agent’s confidence estimates. Empirical results across multiple benchmark VQA datasets substantiate the efficacy of our approach, demonstrating substantial reductions in calibration discrepancies. Furthermore, we propose a novel differentiable calibration-aware loss to fine-tune the specialized agents and improve the quality of their individual confidence estimates based on minimising upper bound calibration error.

[171] Explainable Deep Convolutional Multi-Type Anomaly Detection

Alex George, Lyudmila Mihaylova, Sean Anderson

Main category: cs.CV

TL;DR: MultiTypeFCDD is a lightweight convolutional framework for explainable multi-type anomaly detection that uses image-level labels to generate multi-channel heatmaps for different anomaly types, eliminating the need for separate models per object category.

Details

Motivation: Existing anomaly detection methods lack specificity in identifying anomaly types and require costly separate models for each object category, while large VLMs are computationally intensive and impractical for real-time systems.

Method: A simple convolutional framework that uses only image-level labels to learn multi-channel heatmaps, where each channel corresponds to a specific anomaly type, functioning as a unified framework across multiple object categories.

Result: Competitive performance with state-of-the-art models on Real-IAD dataset at significantly reduced parametric load and inference times.

Conclusion: MultiTypeFCDD provides a practical and viable solution for real-world applications with constrained computational resources, offering explainable multi-type anomaly detection in a lightweight framework.

Abstract: Most explainable anomaly detection methods often identify anomalies but lack the capability to differentiate the type of anomaly. Furthermore, they often require the costly training and maintenance of separate models for each object category. The lack of specificity is a significant research gap, as identifying the type of anomaly (e.g., “Crack” vs. “Scratch”) is crucial for accurate diagnosis that facilitates cost-saving operational decisions across diverse application domains. While some recent large-scale Vision-Language Models (VLMs) have begun to address this, they are computationally intensive and memory-heavy, restricting their use in real-time or embedded systems. We propose MultiTypeFCDD, a simple and lightweight convolutional framework designed as a practical alternative for explainable multi-type anomaly detection. MultiTypeFCDD uses only image-level labels to learn and produce multi-channel heatmaps, where each channel is trained to correspond to a specific anomaly type. The model functions as a single, unified framework capable of differentiating anomaly types across multiple object categories, eliminating the need to train and manage separate models for each object category. We evaluated our proposed method on the Real-IAD dataset and it delivers results competitive with state-of-the-art complex models at significantly reduced parametric load and inference times. This makes it a highly practical and viable solution for real-world applications where computational resources are tightly constrained.

[172] CATS-V2V: A Real-World Vehicle-to-Vehicle Cooperative Perception Dataset with Complex Adverse Traffic Scenarios

Hangyu Li, Bofeng Cao, Zhaohui Liang, Wuzhen Li, Juyoung Oh, Yuxuan Chen, Shixiao Liang, Hang Zhou, Chengyuan Ma, Jiaxi Liu, Zheng Li, Peng Zhang, KeKe Long, Maolin Liu, Jackson Jiang, Chunlei Yu, Shengxiang Liu, Hongkai Yu, Xiaopeng Li

Main category: cs.CV

TL;DR: CATS-V2V is the first real-world dataset for V2V cooperative perception in complex adverse traffic scenarios, featuring 100 clips with 60K LiDAR frames, 1.26M camera images, and 750K sensor records across diverse weather/location conditions.

Details

Motivation: Existing datasets focus on ordinary traffic scenarios, limiting cooperative perception benefits. Complex adverse scenarios require specialized data collection to overcome perception limitations in autonomous driving.

Method: Collected using two time-synchronized vehicles across 10 weather/lighting conditions and 10 locations. Provides LiDAR point clouds, multi-view camera images, RTK-fixed GNSS/IMU records, and 3D bounding box annotations with temporal alignment.

Result: Created the largest-scale, most supportive, and highest-quality V2V cooperative perception dataset to date, enabling precise object alignment across all sensor modalities in complex adverse conditions.

Conclusion: CATS-V2V addresses the data gap for cooperative perception in challenging scenarios and will benefit the autonomous driving community by providing comprehensive infrastructure for related research tasks.

Abstract: Vehicle-to-Vehicle (V2V) cooperative perception has great potential to enhance autonomous driving performance by overcoming perception limitations in complex adverse traffic scenarios (CATS). Meanwhile, data serves as the fundamental infrastructure for modern autonomous driving AI. However, due to stringent data collection requirements, existing datasets focus primarily on ordinary traffic scenarios, constraining the benefits of cooperative perception. To address this challenge, we introduce CATS-V2V, the first-of-its-kind real-world dataset for V2V cooperative perception under complex adverse traffic scenarios. The dataset was collected by two hardware time-synchronized vehicles, covering 10 weather and lighting conditions across 10 diverse locations. The 100-clip dataset includes 60K frames of 10 Hz LiDAR point clouds and 1.26M multi-view 30 Hz camera images, along with 750K anonymized yet high-precision RTK-fixed GNSS and IMU records. Correspondingly, we provide time-consistent 3D bounding box annotations for objects, as well as static scenes to construct a 4D BEV representation. On this basis, we propose a target-based temporal alignment method, ensuring that all objects are precisely aligned across all sensor modalities. We hope that CATS-V2V, the largest-scale, most supportive, and highest-quality dataset of its kind to date, will benefit the autonomous driving community in related tasks.

[173] 3D Gaussian and Diffusion-Based Gaze Redirection

Abiram Panchalingam, Indu Bodala, Stuart Middleton

Main category: cs.CV

TL;DR: DiT-Gaze enhances 3D gaze redirection using Diffusion Transformers, weak supervision with synthetic intermediate gaze angles, and orthogonality constraints, achieving state-of-the-art results with 6.353° gaze error.

Details

Motivation: High-fidelity gaze redirection is needed to generate augmented data for improving gaze estimator generalization, as current 3D Gaussian Splatting models struggle with rendering subtle, continuous gaze shifts.

Method: Combines Diffusion Transformer (DiT) for higher-fidelity synthesis, weak supervision using synthetically generated intermediate gaze angles for smooth gaze manifolds, and orthogonality constraint loss to disentangle gaze, head pose, and expression representations.

Result: Sets new state-of-the-art in perceptual quality and redirection accuracy, reducing gaze error by 4.1% to 6.353 degrees, providing superior synthetic training data generation.

Conclusion: DiT-Gaze offers a superior method for creating synthetic training data and will be made available to the research community for benchmarking.

Abstract: High-fidelity gaze redirection is critical for generating augmented data to improve the generalization of gaze estimators. 3D Gaussian Splatting (3DGS) models like GazeGaussian represent the state-of-the-art but can struggle with rendering subtle, continuous gaze shifts. In this paper, we propose DiT-Gaze, a framework that enhances 3D gaze redirection models using a novel combination of Diffusion Transformer (DiT), weak supervision across gaze angles, and an orthogonality constraint loss. DiT allows higher-fidelity image synthesis, while our weak supervision strategy using synthetically generated intermediate gaze angles provides a smooth manifold of gaze directions during training. The orthogonality constraint loss mathematically enforces the disentanglement of internal representations for gaze, head pose, and expression. Comprehensive experiments show that DiT-Gaze sets a new state-of-the-art in both perceptual quality and redirection accuracy, reducing the state-of-the-art gaze error by 4.1% to 6.353 degrees, providing a superior method for creating synthetic training data. Our code and models will be made available for the research community to benchmark against.

[174] Dynamic Gaussian Scene Reconstruction from Unsynchronized Videos

Zhixin Xu, Hengyu Zhou, Yuan Liu, Wenhan Xue, Hao Pan, Wenping Wang, Bin Wang

Main category: cs.CV

TL;DR: A temporal alignment method for 4D Gaussian Splatting that handles unsynchronized multi-view videos through coarse-to-fine time shift estimation.

Details

Motivation: Real-world multi-view videos often have temporal misalignment due to camera trigger delays or independent recording, which degrades 4DGS reconstruction quality.

Method: Coarse-to-fine alignment module that first estimates frame-level offsets then refines to sub-frame accuracy, integrated as a plug-in module to existing 4DGS frameworks.

Result: Effectively processes temporally misaligned videos and significantly enhances baseline reconstruction methods.

Conclusion: The proposed temporal alignment strategy improves 4DGS robustness for unsynchronized multi-view video reconstruction.

Abstract: Multi-view video reconstruction plays a vital role in computer vision, enabling applications in film production, virtual reality, and motion analysis. While recent advances such as 4D Gaussian Splatting (4DGS) have demonstrated impressive capabilities in dynamic scene reconstruction, they typically rely on the assumption that input video streams are temporally synchronized. However, in real-world scenarios, this assumption often fails due to factors like camera trigger delays or independent recording setups, leading to temporal misalignment across views and reduced reconstruction quality. To address this challenge, a novel temporal alignment strategy is proposed for high-quality 4DGS reconstruction from unsynchronized multi-view videos. Our method features a coarse-to-fine alignment module that estimates and compensates for each camera’s time shift. The method first determines a coarse, frame-level offset and then refines it to achieve sub-frame accuracy. This strategy can be integrated as a readily integrable module into existing 4DGS frameworks, enhancing their robustness when handling asynchronous data. Experiments show that our approach effectively processes temporally misaligned videos and significantly enhances baseline methods.

[175] Toward Gaze Target Detection of Young Autistic Children

Shijian Deng, Erin E. Kosloski, Siva Sai Nagender Vasireddy, Jia Li, Randi Sierra Sherwood, Feroz Mohamed Hatha, Siddhi Patel, Pamela R Rollins, Yapeng Tian

Main category: cs.CV

TL;DR: This paper introduces a novel AI framework for gaze target detection in autistic children, using a Socially Aware Coarse-to-Fine approach that leverages social context to address class imbalance in autism datasets.

Details

Motivation: Automatic gaze target detection in autistic children can significantly improve their quality of life, especially for those lacking access to sufficient professional support. This is crucial for measuring joint attention, a core challenge in Autism Spectrum Disorder.

Method: Proposed Socially Aware Coarse-to-Fine (SACF) framework with two-pathway architecture using expert models specialized in social and non-social gaze, guided by a context-awareness gate module. Uses the first-ever Autism Gaze Target (AGT) dataset.

Result: The framework achieves state-of-the-art performance for gaze target detection in autistic children, significantly outperforming existing methods, especially on the critical minority class of face-directed gaze.

Conclusion: The SACF framework effectively addresses class imbalance in autism datasets and provides a robust solution for gaze target detection, which is foundational for automated systems measuring joint attention in ASD.

Abstract: The automatic detection of gaze targets in autistic children through artificial intelligence can be impactful, especially for those who lack access to a sufficient number of professionals to improve their quality of life. This paper introduces a new, real-world AI application for gaze target detection in autistic children, which predicts a child’s point of gaze from an activity image. This task is foundational for building automated systems that can measure joint attention-a core challenge in Autism Spectrum Disorder (ASD). To facilitate the study of this challenging application, we collected the first-ever Autism Gaze Target (AGT) dataset. We further propose a novel Socially Aware Coarse-to-Fine (SACF) gaze detection framework that explicitly leverages the social context of a scene to overcome the class imbalance common in autism datasets-a consequence of autistic children’s tendency to show reduced gaze to faces. It utilizes a two-pathway architecture with expert models specialized in social and non-social gaze, guided by a context-awareness gate module. The results of our comprehensive experiments demonstrate that our framework achieves new state-of-the-art performance for gaze target detection in this population, significantly outperforming existing methods, especially on the critical minority class of face-directed gaze.

Quoc-Huy Trinh, Mustapha Abdullahi, Do Duy Hung Trinh, Bo Zhao, Debesh Jha

Main category: cs.CV

TL;DR: Viper-F1 is a hybrid state-space vision-language model that replaces attention with efficient liquid state-space dynamics for improved computational efficiency and fine-grained visual understanding.

Details

Motivation: Address the high computational cost of multimodal large language models and their struggle with fine-grained visual reasoning in resource-constrained scenarios like robotics and smart devices.

Method: Uses hybrid state-space vision-language model with liquid state-space dynamics instead of attention, plus a Token-Grid Correlation Module for lightweight text-image correlations and FiLM conditioning to modulate state-space dynamics.

Result: Achieves accurate fine-grained understanding with significantly improved efficiency across multiple benchmarks while maintaining linear-time inference.

Conclusion: Viper-F1 provides an efficient alternative to Transformer-based MLLMs with better fine-grained visual reasoning capabilities suitable for resource-constrained applications.

Abstract: Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as robotic manipulation, personal assistants, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Viper-F1, a Hybrid State-Space Vision-Language Model that replaces attention with efficient Liquid State-Space Dynamics. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates the state-space dynamics via FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Viper-F1 achieves accurate, fine-grained understanding with significantly improved efficiency.

[177] A Comparison of Lightweight Deep Learning Models for Particulate-Matter Nowcasting in the Indian Subcontinent & Surrounding Regions

Ansh Kushwaha, Kaushik Gopalan

Main category: cs.CV

TL;DR: Efficient framework for 6-hour PM pollution nowcasting over India using CAMS data, with specialized lightweight models outperforming foundation models.

Details

Motivation: To develop accurate short-range pollution forecasting for PM1, PM2.5, and PM10 across India and surrounding regions using efficient deep learning approaches.

Method: Uses CAMS Global Atmospheric Composition Forecasts at 0.4° resolution, covering 256x256 input region with 128x128 India-centric output. Trains three parameter-specific lightweight architectures on 2021-2023 data with 90/10 split.

Result: Substantial performance gains over Aurora foundation model demonstrated by RMSE, MAE, Bias, and SSIM metrics on 2024 evaluation data.

Conclusion: Compact specialized deep learning models are highly effective for short-range pollution forecasting on limited spatial domains.

Abstract: This paper is a submission for the Weather4Cast~2025 complementary Pollution Task and presents an efficient framework for 6-hour lead-time nowcasting of PM$1$, PM${2.5}$, and PM$_{10}$ across the Indian subcontinent and surrounding regions. The proposed approach leverages analysis fields from the Copernicus Atmosphere Monitoring Service (CAMS) Global Atmospheric Composition Forecasts at 0.4 degree resolution. A 256x256 spatial region, covering 28.4S-73.6N and 32E-134.0E, is used as the model input, while predictions are generated for the central 128x128 area spanning 2.8S-48N and 57.6E-108.4E, ensuring an India-centric forecast domain with sufficient synoptic-scale context. Models are trained on CAMS analyses from 2021-2023 using a shuffled 90/10 split and independently evaluated on 2024 data. Three lightweight parameter-specific architectures are developed to improve accuracy, minimize systematic bias, and enable rapid inference. Evaluation using RMSE, MAE, Bias, and SSIM demonstrates substantial performance gains over the Aurora foundation model, underscoring the effectiveness of compact & specialized deep learning models for short-range forecasts on limited spatial domains.

[178] D-GAP: Improving Out-of-Domain Robustness via Dataset-Agnostic and Gradient-Guided Augmentation in Amplitude and Pixel Spaces

Ruoqi Wang, Haitao Wang, Shaojie Guo, Qiong Luo

Main category: cs.CV

TL;DR: D-GAP improves out-of-domain robustness by combining frequency-space and pixel-space augmentations guided by task gradients, outperforming both generic and dataset-specific methods.

Details

Motivation: Neural networks struggle with domain shifts due to learning biases toward domain-specific frequency components, and existing frequency-based methods overlook pixel-level details.

Method: Compute sensitivity maps from task gradients to guide adaptive interpolation in frequency space, combined with pixel-space blending to restore spatial details.

Result: Outperforms generic and dataset-specific augmentations, improving average OOD performance by +5.3% on real-world datasets and +1.8% on benchmarks.

Conclusion: D-GAP effectively addresses domain shift challenges through targeted augmentations in both frequency and pixel spaces, achieving superior OOD robustness.

Abstract: Out-of-domain (OOD) robustness is challenging to achieve in real-world computer vision applications, where shifts in image background, style, and acquisition instruments always degrade model performance. Generic augmentations show inconsistent gains under such shifts, whereas dataset-specific augmentations require expert knowledge and prior analysis. Moreover, prior studies show that neural networks adapt poorly to domain shifts because they exhibit a learning bias to domain-specific frequency components. Perturbing frequency values can mitigate such bias but overlooks pixel-level details, leading to suboptimal performance. To address these problems, we propose D-GAP (Dataset-agnostic and Gradient-guided augmentation in Amplitude and Pixel spaces), improving OOD robustness by introducing targeted augmentation in both the amplitude space (frequency space) and pixel space. Unlike conventional handcrafted augmentations, D-GAP computes sensitivity maps in the frequency space from task gradients, which reflect how strongly the model responds to different frequency components, and uses the maps to adaptively interpolate amplitudes between source and target samples. This way, D-GAP reduces the learning bias in frequency space, while a complementary pixel-space blending procedure restores fine spatial details. Extensive experiments on four real-world datasets and three domain-adaptation benchmarks show that D-GAP consistently outperforms both generic and dataset-specific augmentations, improving average OOD performance by +5.3% on real-world datasets and +1.8% on benchmark datasets.

[179] Computationally-efficient deep learning models for nowcasting of precipitation: A solution for the Weather4cast 2025 challenge

Anushree Bhuskute, Kaushik Gopalan, Jeet Shah

Main category: cs.CV

TL;DR: A ConvGRU-based transfer learning framework for short-term rainfall prediction using SEVIRI infrared data, achieving 2nd place in Weather4Cast 2025 competition.

Details

Motivation: To develop an effective short-term rainfall prediction system for weather forecasting competitions using transfer learning and infrared satellite data.

Method: Two-stage training: first stage trains ConvGRU to forecast SEVIRI brightness temperatures, second stage applies nonlinear transformation to predict rainfall rates. Uses single infrared channel (10.8 μm) with four observations over one hour.

Result: Achieved 2nd place in cumulative rainfall task. Same model performed similarly to baseline in event prediction task without modifications.

Conclusion: The ConvGRU-based transfer learning framework effectively captures spatiotemporal patterns for rainfall prediction and demonstrates good generalization across different prediction tasks.

Abstract: This study presents a transfer-learning framework based on Convolutional Gated Recurrent Units (ConvGRU) for short-term rainfall prediction in the Weather4Cast 2025 competition. A single SEVIRI infrared channel (10.8 μm wavelength) is used as input, which consists of four observations over a one-hour period. A two-stage training strategy is applied to generate rainfall estimates up to four hours ahead. In the first stage, ConvGRU is trained to forecast the brightness temperatures from SEVIRI, enabling the model to capture relevant spatiotemporal patterns. In the second stage, an empirically derived nonlinear transformation maps the predicted fields to OPERA-compatible rainfall rates. For the event-prediction task, the transformed rainfall forecasts are processed using 3D event detection followed by spatiotemporal feature extraction to identify and characterize precipitation events. Our submission achieved 2nd place in the cumulative rainfall task. Further, the same model was used out-of-the-box for the event prediction task, and resulted in similar scores as the baseline model to the competition.

[180] Geospatial Chain of Thought Reasoning for Enhanced Visual Question Answering on Satellite Imagery

Shambhavi Shanker, Manikandan Padmanaban, Jagabondhu Hazra

Main category: cs.CV

TL;DR: A VQA framework combining Chain of Thought reasoning with Direct Preference Optimization improves geospatial reasoning on satellite imagery for climate applications, achieving 34.9% accuracy gains over baselines.

Details

Motivation: Existing VQA models lack structured reasoning for complex geospatial queries needed in climate applications like disaster monitoring and infrastructure risk assessment.

Method: Integrates Chain of Thought reasoning with Direct Preference Optimization to generate intermediate rationales for handling detection, classification, spatial relations, and comparative analysis tasks.

Result: CoT supervision improves accuracy by 34.9% over direct baselines, with DPO providing additional gains in accuracy and reasoning quality.

Conclusion: The framework advances VQA for multispectral Earth observation by enabling richer geospatial reasoning and more effective climate use cases.

Abstract: Geospatial chain of thought (CoT) reasoning is essential for advancing Visual Question Answering (VQA) on satellite imagery, particularly in climate related applications such as disaster monitoring, infrastructure risk assessment, urban resilience planning, and policy support. Existing VQA models enable scalable interpretation of remote sensing data but often lack the structured reasoning required for complex geospatial queries. We propose a VQA framework that integrates CoT reasoning with Direct Preference Optimization (DPO) to improve interpretability, robustness, and accuracy. By generating intermediate rationales, the model better handles tasks involving detection, classification, spatial relations, and comparative analysis, which are critical for reliable decision support in high stakes climate domains. Experiments show that CoT supervision improves accuracy by 34.9% over direct baselines, while DPO yields additional gains in accuracy and reasoning quality. The resulting system advances VQA for multispectral Earth observation by enabling richer geospatial reasoning and more effective climate use cases.

Haokun Chen, Jianing Li, Yao Zhang, Jinhe Bi, Yan Xia, Jindong Gu, Volker Tresp

Main category: cs.CV

TL;DR: AUVIC is a novel framework for visual concept unlearning in MLLMs that uses adversarial perturbations to precisely remove target concepts while preserving performance on related entities, achieving state-of-the-art forgetting rates with minimal performance degradation.

Details

Motivation: MLLMs trained on massive datasets raise data privacy concerns due to sensitive/copyrighted content. Regulatory 'right to be forgotten' mandates require efficient unlearning methods without costly retraining, but visual concept unlearning in MLLMs remains underexplored.

Method: AUVIC applies adversarial perturbations to enable precise forgetting of target visual concepts while avoiding disruption of related entities. The framework is evaluated using VCUBench, the first benchmark for visual concept unlearning in group contexts.

Result: Experimental results show AUVIC achieves state-of-the-art target forgetting rates while incurring minimal performance degradation on non-target concepts.

Conclusion: AUVIC effectively addresses the challenge of precise visual concept unlearning in MLLMs, providing a practical solution for data privacy compliance while maintaining model performance on related entities.

Abstract: Multimodal Large Language Models (MLLMs) achieve impressive performance once optimized on massive datasets. Such datasets often contain sensitive or copyrighted content, raising significant data privacy concerns. Regulatory frameworks mandating the ‘right to be forgotten’ drive the need for machine unlearning. This technique allows for the removal of target data without resource-consuming retraining. However, while well-studied for text, visual concept unlearning in MLLMs remains underexplored. A primary challenge is precisely removing a target visual concept without disrupting model performance on related entities. To address this, we introduce AUVIC, a novel visual concept unlearning framework for MLLMs. AUVIC applies adversarial perturbations to enable precise forgetting. This approach effectively isolates the target concept while avoiding unintended effects on similar entities. To evaluate our method, we construct VCUBench. It is the first benchmark designed to assess visual concept unlearning in group contexts. Experimental results demonstrate that AUVIC achieves state-of-the-art target forgetting rates while incurs minimal performance degradation on non-target concepts.

[182] Questioning the Stability of Visual Question Answering

Amir Rosenfeld, Neta Glazer, Ethan Fetaya

Main category: cs.CV

TL;DR: VLMs are highly sensitive to minor visual/textual perturbations that don’t change semantics, with stability patterns predicting correctness across models.

Details

Motivation: To systematically evaluate VLM robustness to benign perturbations that preserve meaning but reveal fundamental fragility in current systems.

Method: Large-scale study across multiple VLMs and datasets, testing sensitivity to pixel shifts, geometric transforms, rescaling, paraphrasing, and multilingual rewrites.

Result: Modern VLMs are highly unstable under minor perturbations; stable samples strongly correlate with correctness; small models’ stability patterns can predict larger models’ performance.

Conclusion: Current VLMs exhibit fundamental fragility to meaning-preserving changes, highlighting the need for robustness evaluations focusing on semantic invariances rather than just adversarial attacks.

Abstract: Visual Language Models (VLMs) have achieved remarkable progress, yet their reliability under small, meaning-preserving input changes remains poorly understood. We present the first large-scale, systematic study of VLM robustness to benign visual and textual perturbations: pixel-level shifts, light geometric transformations, padded rescaling, paraphrasing, and multilingual rewrites that do not alter the underlying semantics of an image-question pair. Across a broad set of models and datasets, we find that modern VLMs are highly sensitive to such minor perturbations: a substantial fraction of samples change their predicted answer under at least one visual or textual modification. We characterize how this instability varies across perturbation types, question categories, and models, revealing that even state-of-the-art systems (e.g., GPT-4o, Gemini 2.0 Flash) frequently fail under shifts as small as a few pixels or harmless rephrasings. We further show that sample-level stability serves as a strong indicator of correctness: stable samples are consistently far more likely to be answered correctly. Leveraging this, we demonstrate that the stability patterns of small, accessible open-source models can be used to predict the correctness of much larger closed-source models with high precision. Our findings expose a fundamental fragility in current VLMs and highlight the need for robustness evaluations that go beyond adversarial perturbations, focusing instead on invariances that models should reliably uphold.

[183] One-to-N Backdoor Attack in 3D Point Cloud via Spherical Trigger

Dongmei Shan, Wei Lian, Chongxia Wang

Main category: cs.CV

TL;DR: First one-to-N backdoor framework for 3D point clouds using configurable spherical triggers that encode multiple target classes, achieving up to 100% attack success while maintaining clean data accuracy.

Details

Motivation: Existing 3D point cloud backdoor attacks are limited to rigid one-to-one paradigm, creating a critical threat in safety-sensitive domains like autonomous driving and robotics where multi-target attacks are needed.

Method: Novel configurable spherical trigger that leverages spatial properties of spheres as parameter space, allowing single trigger design to encode multiple target classes through distinct trigger configurations.

Result: High attack success rates (up to 100%) across multiple datasets and model architectures while maintaining accuracy on clean data, systematically validating the one-to-N backdoor concept.

Conclusion: Establishes crucial benchmark for multi-target threats in 3D vision and provides foundational understanding needed to secure future 3D-driven intelligent systems.

Abstract: Backdoor attacks represent a critical threat to deep learning systems, particularly in safety-sensitive 3D domains such as autonomous driving and robotics. However, existing backdoor attacks for 3D point clouds have been limited to a rigid one-to-one paradigm. To address this, we present the first one-to-N backdoor framework for 3D vision, based on a novel, configurable spherical trigger. Our key insight is to leverage the spatial properties of spheres as a parameter space, allowing a single trigger design to encode multiple target classes. We establish a theoretical foundation for one-to-N backdoor attacks in 3D, demonstrating that poisoned models can map distinct trigger configurations to different target labels. Experimental results systematically validate this conclusion across multiple datasets and model architectures, achieving high attack success rates (up to 100%) while maintaining accuracy on clean data. This work establishes a crucial benchmark for multi-target threats in 3D vision and provides the foundational understanding needed to secure future 3D-driven intelligent systems.

Mohammad Areeb Qazi, Munachiso S Nwadike, Ibrahim Almakky, Mohammad Yaqub, Numan Saeed

Main category: cs.CV

TL;DR: MAFM^3 is a modular framework that enables a single foundation model to adapt to multiple medical imaging domains, tasks, and modalities through lightweight components, improving performance on prognosis and segmentation tasks.

Details

Motivation: Address the challenge of data scarcity in medical imaging by creating a unified framework that allows foundation models to expand beyond their initial training scope without requiring separate models for each domain, modality, or task.

Method: Proposes MAFM^3 framework with lightweight modular components that serve as specialized skill sets, allowing flexible activation of appropriate capabilities at inference time based on input type or clinical objective.

Result: Improved performance on both prognosis and segmentation tasks; achieved 5% Dice score improvement when incorporating PET scans compared to respective baselines.

Conclusion: Foundation models equipped with modular components can evolve into multitask, multimodality systems for medical imaging, overcoming inherent constraints of their initial training scope.

Abstract: Foundational models are trained on extensive datasets to capture the general trends of a domain. However, in medical imaging, the scarcity of data makes pre-training for every domain, modality, or task challenging. Instead of building separate models, we propose MAFM^3 (Modular Adaptation of Foundation Models for Multi-Modal Medical AI), a framework that enables a single foundation model to expand into diverse domains, tasks, and modalities through lightweight modular components. These components serve as specialized skill sets that allow the system to flexibly activate the appropriate capability at the inference time, depending on the input type or clinical objective. Unlike conventional adaptation methods that treat each new task or modality in isolation, MAFM^3 provides a unified and expandable framework for efficient multitask and multimodality adaptation. Empirically, we validate our approach by adapting a chest CT foundation model initially trained for classification into prognosis and segmentation modules. Our results show improved performance on both tasks. Furthermore, by incorporating PET scans, MAFM^3 achieved an improvement in the Dice score 5% compared to the respective baselines. These findings establish that foundation models, when equipped with modular components, are not inherently constrained to their initial training scope but can evolve into multitask, multimodality systems for medical imaging. The code implementation of this work can be found at https://github.com/Areeb2735/CTscan_prognosis_VLM

[185] RealisticDreamer: Guidance Score Distillation for Few-shot Gaussian Splatting

Ruocheng Wu, Haolan He, Yufei Wang, Zhihao Li, Bihan Wen

Main category: cs.CV

TL;DR: GSD framework uses Video Diffusion Models to provide multi-view consistency priors for 3D Gaussian Splatting, addressing overfitting in sparse view scenarios through unified guidance with depth and semantic constraints.

Details

Motivation: 3D Gaussian Splatting suffers from overfitting when trained on sparse views due to lack of intermediate-view supervision, requiring external priors to ensure multi-view consistency.

Method: Proposes Guidance Score Distillation (GSD) that supervises rendered images from multiple views using Video Diffusion Models, with unified guidance incorporating depth warp and semantic feature constraints to correct noise predictions.

Result: Experimental results show the method outperforms existing approaches across multiple datasets, demonstrating improved performance in sparse view scenarios.

Conclusion: GSD effectively leverages Video Diffusion Models to provide multi-view consistency priors for 3D Gaussian Splatting, achieving superior results through unified guidance with depth and semantic constraints.

Abstract: 3D Gaussian Splatting (3DGS) has recently gained great attention in the 3D scene representation for its high-quality real-time rendering capabilities. However, when the input comprises sparse training views, 3DGS is prone to overfitting, primarily due to the lack of intermediate-view supervision. Inspired by the recent success of Video Diffusion Models (VDM), we propose a framework called Guidance Score Distillation (GSD) to extract the rich multi-view consistency priors from pretrained VDMs. Building on the insights from Score Distillation Sampling (SDS), GSD supervises rendered images from multiple neighboring views, guiding the Gaussian splatting representation towards the generative direction of VDM. However, the generative direction often involves object motion and random camera trajectories, making it challenging for direct supervision in the optimization process. To address this problem, we introduce an unified guidance form to correct the noise prediction result of VDM. Specifically, we incorporate both a depth warp guidance based on real depth maps and a guidance based on semantic image features, ensuring that the score update direction from VDM aligns with the correct camera pose and accurate geometry. Experimental results show that our method outperforms existing approaches across multiple datasets.

[186] The Persistence of Cultural Memory: Investigating Multimodal Iconicity in Diffusion Models

Maria-Teresa De Rosa Palmini, Eva Cetinic

Main category: cs.CV

TL;DR: This paper introduces a framework to analyze how diffusion models handle cultural references, distinguishing between recognizing references and how they depict them (replication vs. reinterpretation).

Details

Motivation: To address ambiguity between generalization and memorization in text-to-image models, specifically focusing on multimodal iconicity where images and texts evoke culturally shared associations.

Method: Developed an evaluation framework separating recognition from realization, tested on 767 Wikidata-derived cultural references across 5 diffusion models, with prompt perturbation experiments using synonym substitutions and literal descriptions.

Result: The framework distinguishes replication from transformation better than existing similarity methods; models often reproduce iconic structures even with altered text; cultural alignment correlates with training frequency, textual uniqueness, popularity, and creation date.

Conclusion: Diffusion models’ value lies not just in reproduction but in how they transform and recontextualize cultural knowledge, advancing evaluation beyond simple text-image matching to richer contextual understanding.

Abstract: Our work addresses the ambiguity between generalization and memorization in text-to-image diffusion models, focusing on a specific case we term multimodal iconicity. This refers to instances where images and texts evoke culturally shared associations, such as when a title recalls a familiar artwork or film scene. While prior research on memorization and unlearning emphasizes forgetting, we examine what is remembered and how, focusing on the balance between recognizing cultural references and reproducing them. We introduce an evaluation framework that separates recognition, whether a model identifies a reference, from realization, how it depicts it through replication or reinterpretation, quantified through measures capturing both dimensions. By evaluating five diffusion models across 767 Wikidata-derived cultural references spanning static and dynamic imagery, we show that our framework distinguishes replication from transformation more effectively than existing similarity-based methods. To assess linguistic sensitivity, we conduct prompt perturbation experiments using synonym substitutions and literal image descriptions, finding that models often reproduce iconic visual structures even when textual cues are altered. Finally, our analysis shows that cultural alignment correlates not only with training data frequency, but also textual uniqueness, reference popularity, and creation date. Our work reveals that the value of diffusion models lies not only in what they reproduce but in how they transform and recontextualize cultural knowledge, advancing evaluation beyond simple text-image matching toward richer contextual understanding.

[187] Positional Bias in Multimodal Embedding Models: Do They Favor the Beginning, the Middle, or the End?

Kebin Wu, Fatima Albreiki

Main category: cs.CV

TL;DR: Investigates positional bias in multimodal representation models, particularly in image-text retrieval, finding that text encoders favor the beginning of input while image encoders show bias at both beginning and end.

Details

Motivation: Positional bias negatively impacts model performance but remains underexplored in representation models and multimodal settings compared to text generation models.

Method: Distinguished between context importance and positional bias, then assessed positional bias across different models and datasets in image-text retrieval tasks.

Result: Positional bias is prevalent in multimodal models with text encoders biased toward input beginning and image encoders biased at both beginning and end; bias arises from positional encoding, training loss, context importance, and multimodal training nature.

Conclusion: Multimodal representation models exhibit significant positional bias that differs across modalities, influenced by multiple training and architectural factors.

Abstract: Positional bias - where models overemphasize certain positions regardless of content - has been shown to negatively impact model performance across various tasks. While recent research has extensively examined positional bias in text generation models, its presence and effects in representation models remain underexplored. Even less is known about such biases in multimodal models. In this work, we investigate positional bias in multimodal representation models, specifically in the context of image-text retrieval. We begin by distinguishing between context importance and positional bias, and then assess the presence and extent of positional bias across different models and datasets. Our experiments demonstrate that positional bias is prevalent in multimodal models, but manifests differently across modalities: text encoders tend to exhibit bias toward the beginning of the input, whereas image encoders show bias at both the beginning and end. Furthermore, we find that this bias arises from, or is amplified by, a combination of factors, including the positional encoding scheme, training loss, context importance, and the nature of using image-text pairs in multimodal training.

[188] Benchmarking Visual LLMs Resilience to Unanswerable Questions on Visually Rich Documents

Davide Napolitano, Luca Cagliero, Fabrizio Battiloro

Main category: cs.CV

TL;DR: VRD-UQA is a benchmark for evaluating Visual Large Language Models’ ability to detect plausible yet unanswerable questions in Visually Rich Documents, revealing significant limitations in current models.

Details

Motivation: VLLMs excel at Visual Question Answering but struggle with detecting unanswerable questions, especially those that appear valid but cannot be answered due to subtle corruptions in document elements.

Method: Created VRD-UQA benchmark by automatically altering questions from existing VQA datasets through entity swaps, document element changes, and layout modifications, then using VLLM-as-a-judge to verify unanswerability and evaluate 12 models across multiple dimensions.

Result: Experiments revealed VLLMs’ limitations in detecting unanswerable questions at both page and document levels, with performance varying based on corruption type (NLP entity, document element, layout) and knowledge injection strategies.

Conclusion: VRD-UQA serves as an effective evaluation framework for developing more resilient document VQA systems by exposing VLLMs’ current weaknesses in handling plausible unanswerable questions.

Abstract: The evolution of Visual Large Language Models (VLLMs) has revolutionized the automatic understanding of Visually Rich Documents (VRDs), which contain both textual and visual elements. Although VLLMs excel in Visual Question Answering (VQA) on multi-page VRDs, their ability to detect unanswerable questions is still an open research question. Our research delves into the robustness of the VLLMs to plausible yet unanswerable questions, i.e., questions that appear valid but cannot be answered due to subtle corruptions caused by swaps between related concepts or plausible question formulations. Corruptions are generated by replacing the original natural language entities with other ones of the same type, belonging to different document elements, and in different layout positions or pages of the related document. To this end, we present VRD-UQA (VISUALLY RICH DOCUMENT UNANSWERABLE QUESTION ANSWERING), a benchmark for evaluating VLLMs’ resilience to plausible yet unanswerable questions across multiple dimensions. It automatically alters the questions of existing VQA datasets consisting of multi-page VRDs, verifies their unanswerability using a VLLM-as-a-judge approach, and then thoroughly evaluates VLLMs’ performance. Experiments, run on 12 models, analyze: (1) The VLLMs’ accuracy in detecting unanswerable questions at both page and document levels; (2) The effect of different types of corruption (NLP entity, document element, layout); (3) The effectiveness of different knowledge injection strategies based on in-context learning (OCR, multi-page selection, or the possibility of unanswerability). Our findings reveal VLLMs’ limitations and demonstrate that VRD-UQA can serve as an evaluation framework for developing resilient document VQA systems.

[189] Discovering Meaningful Units with Visually Grounded Semantics from Image Captions

Melika Behjati, James Henderson

Main category: cs.CV

TL;DR: Proposes a model that groups caption tokens to capture fine-grained language representations, aligning them with object-level image features for better vision-language understanding.

Details

Motivation: Current vision-language models align image patches with tokens, but patches lack meaning and individual tokens may not carry groundable information. Groups of tokens better describe scene aspects.

Method: Developed a model that groups caption tokens as part of its architecture to create fine-grained language representations, aligning these with object-level outputs from an image encoder trained for object discovery.

Result: The model achieves better fine-grained understanding of vision and language through learned token grouping. Discovered token groups are highly similar to groundable phrases in text, both qualitatively and quantitatively.

Conclusion: Grouping tokens enables vision-language models to capture more meaningful fine-grained representations that align with object-level image understanding, improving overall scene comprehension.

Abstract: Fine-grained knowledge is crucial for vision-language models to obtain a better understanding of the real world. While there has been work trying to acquire this kind of knowledge in the space of vision and language, it has mostly focused on aligning the image patches with the tokens on the language side. However, image patches do not have any meaning to the human eye, and individual tokens do not necessarily carry groundable information in the image. It is groups of tokens which describe different aspects of the scene. In this work, we propose a model which groups the caption tokens as part of its architecture in order to capture a fine-grained representation of the language. We expect our representations to be at the level of objects present in the image, and therefore align our representations with the output of an image encoder trained to discover objects. We show that by learning to group the tokens, the vision-language model has a better fine-grained understanding of vision and language. In addition, the token groups that our model discovers are highly similar to groundable phrases in text, both qualitatively and quantitatively.

[190] DoReMi: A Domain-Representation Mixture Framework for Generalizable 3D Understanding

Mingwei Xing, Xinliang Wang, Yifeng Shi

Main category: cs.CV

TL;DR: DoReMi is a Mixture-of-Experts framework that jointly models domain-aware and unified representation branches to address multi-domain 3D point cloud learning challenges, achieving state-of-the-art performance on benchmarks like ScanNet and S3DIS.

Details

Motivation: 3D deep learning faces limitations due to small datasets and heterogeneous multi-source point clouds from different sensors (LiDAR vs mesh-derived), causing negative transfer during multi-domain fusion. Existing approaches overlook the synergy between domain-aware and domain-general features.

Method: Proposes DoReMi with Domain-aware Experts branch and unified Representation branch. Uses Domain-Guided Spatial Routing (DSR) for context-aware expert selection and Entropy-Controlled Dynamic Allocation (EDA) for stable expert utilization. Includes frozen unified representation branch pretrained via multi-attribute self-supervised learning.

Result: Achieves 80.1% mIoU on ScanNet Val and 77.2% mIoU on S3DIS, demonstrating competitive or superior performance compared to existing approaches.

Conclusion: DoReMi shows strong potential as a foundation framework for future 3D understanding research by effectively handling diverse domain distributions while preserving cross-domain geometric and structural priors.

Abstract: The generalization of 3D deep learning across multiple domains remains limited by the limited scale of existing datasets and the high heterogeneity of multi-source point clouds. Point clouds collected from different sensors (e.g., LiDAR scans and mesh-derived point clouds) exhibit substantial discrepancies in density and noise distribution, resulting in negative transfer during multi-domain fusion. Most existing approaches focus exclusively on either domain-aware or domain-general features, overlooking the potential synergy between them. To address this, we propose DoReMi (Domain-Representation Mixture), a Mixture-of-Experts (MoE) framework that jointly models Domain-aware Experts branch and a unified Representation branch to enable cooperative learning between specialized and generalizable knowledge. DoReMi dynamically activates domain-aware expert branch via Domain-Guided Spatial Routing (DSR) for context-aware expert selection and employs Entropy-Controlled Dynamic Allocation (EDA) for stable and efficient expert utilization, thereby adaptively modeling diverse domain distributions. Complemented by a frozen unified representation branch pretrained through robust multi-attribute self-supervised learning, DoReMi preserves cross-domain geometric and structural priors while maintaining global consistency. We evaluate DoReMi across multiple 3D understanding benchmarks. Notably, DoReMi achieves 80.1% mIoU on ScanNet Val and 77.2% mIoU on S3DIS, demonstrating competitive or superior performance compared to existing approaches, and showing strong potential as a foundation framework for future 3D understanding research. The code will be released soon.

[191] Parameter-Efficient MoE LoRA for Few-Shot Multi-Style Editing

Cong Cao, Yujie Xu, Xiaodong Xu

Main category: cs.CV

TL;DR: Proposes a few-shot style editing framework using MoE LoRA with routing mechanisms to fine-tune image editing models for new styles with limited data.

Details

Motivation: General image editing models often fail with new styles, and there's a need for effective fine-tuning using limited paired data.

Method: Uses parameter-efficient multi-style Mixture-of-Experts LoRA with style-specific and style-shared routing, metric-guided rank optimization, and integrates adversarial learning and flow matching in Diffusion Transformer.

Result: Outperforms state-of-the-art approaches with significantly fewer LoRA parameters.

Conclusion: The proposed framework effectively adapts general image editing models to new styles using limited data through efficient parameter usage and specialized routing mechanisms.

Abstract: In recent years, image editing has garnered growing attention. However, general image editing models often fail to produce satisfactory results when confronted with new styles. The challenge lies in how to effectively fine-tune general image editing models to new styles using only a limited amount of paired data. To address this issue, this paper proposes a novel few-shot style editing framework. For this task, we construct a benchmark dataset that encompasses five distinct styles. Correspondingly, we propose a parameter-efficient multi-style Mixture-of-Experts Low-Rank Adaptation (MoE LoRA) with style-specific and style-shared routing mechanisms for jointly fine-tuning multiple styles. The style-specific routing ensures that different styles do not interfere with one another, while the style-shared routing adaptively allocates shared MoE LoRAs to learn common patterns. Our MoE LoRA can automatically determine the optimal ranks for each layer through a novel metric-guided approach that estimates the importance score of each single-rank component. Additionally, we explore the optimal location to insert LoRA within the Diffusion in Transformer (DiT) model and integrate adversarial learning and flow matching to guide the diffusion training process. Experimental results demonstrate that our proposed method outperforms existing state-of-the-art approaches with significantly fewer LoRA parameters.

[192] From Synthetic Scenes to Real Performance: Enhancing Spatial Reasoning in VLMs

Massimo Rizzoli, Simone Alghisi, Seyed Mahed Mousavi, Giuseppe Riccardi

Main category: cs.CV

TL;DR: Fine-tuning VLMs with controlled synthetic data generation improves performance and reduces biases compared to real-world data fine-tuning.

Details

Motivation: Real-world data collection for VLM fine-tuning often introduces biases, errors, and distribution imbalance, leading to overfitting and imbalanced performance.

Method: Redesigned fine-tuning process with controlled synthetic data generation that comprehensively samples object attributes (color, shape, size, position) and uses this balanced dataset to fine-tune VLMs.

Result: Fine-tuning on balanced synthetic data yields uniform performance across visual scenes, mitigates biases, and significantly improves performance on real-world COCO data, outperforming matched-setting fine-tuning.

Conclusion: Controlled synthetic data generation provides an effective alternative to real-world data collection for VLM fine-tuning, addressing bias and distribution issues while improving transferability to real-world scenarios.

Abstract: Fine-tuning Vision-Language Models (VLMs) is a common strategy to improve performance following an ad-hoc data collection and annotation of real-world scenes. However, this process is often prone to biases, errors, and distribution imbalance, resulting in overfitting and imbalanced performance. Although a few studies have tried to address this problem by generating synthetic data, they lacked control over distribution bias and annotation quality. To address these challenges, we redesign the fine-tuning process in two ways. First, we control the generation of data and its annotations, ensuring it is free from bias, distribution imbalance, and annotation errors. We automatically construct the dataset by comprehensively sampling objects’ attributes, including color, shape, size, and position within the scene. Secondly, using this annotated dataset, we fine-tune state-of-the-art VLMs and assess performance transferability to real-world data on the absolute position task. We conduct exhaustive evaluations on both synthetic and real-world benchmarks. Our experiments reveal two key findings: 1) fine-tuning on balanced synthetic data yields uniform performance across the visual scene and mitigates common biases; and 2) fine-tuning on synthetic stimuli significantly improves performance on real-world data (COCO), outperforming models fine-tuned in the matched setting.

[193] Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression

Zhongbin Guo, Jiahe Liu, Yushan Li, Wenyu Gao, Zhen Yang, Chenzhi Li, Xinyue Zhang, Ping Jian

Main category: cs.CV

TL;DR: GEODE is a novel VLM architecture that overcomes 3D spatial reasoning limitations by decoupling 3D reasoning from numerical generation using two specialized modules: DRM for spatial co-processing and DRH for precise continuous regression.

Details

Motivation: Existing VLMs struggle with real-world 3D spatial intelligence due to input-stage conflicts between geometric-aware encoders and 2D features, and output-stage misalignment where discrete tokenizers can't produce precise continuous values.

Method: GEODE augments main VLM with two plug-and-play modules: Decoupled Rationale Module (DRM) that aligns 3D data with 2D features and distills spatial Chain-of-Thought logic, and Direct Regression Head (DRH) that uses “Embedding-as-Value” paradigm for precise continuous regression.

Result: The 1.5B parameter GEODE model achieves state-of-the-art spatial reasoning performance that rivals 7B+ models, functioning as a high-level semantic dispatcher.

Conclusion: GEODE successfully resolves the dual-bottleneck in VLMs for 3D spatial reasoning through its decoupled architecture, enabling efficient and precise geometric understanding with smaller model size.

Abstract: Existing Vision Language Models (VLMs) architecturally rooted in “flatland” perception, fundamentally struggle to comprehend real-world 3D spatial intelligence. This failure stems from a dual-bottleneck: input-stage conflict between computationally exorbitant geometric-aware encoders and superficial 2D-only features, and output-stage misalignment where discrete tokenizers are structurally incapable of producing precise, continuous numerical values. To break this impasse, we introduce GEODE (Geometric-Output and Decoupled-Input Engine), a novel architecture that resolves this dual-bottleneck by decoupling 3D reasoning from numerical generation. GEODE augments main VLM with two specialized, plug-and-play modules: Decoupled Rationale Module (DRM) that acts as spatial co-processor, aligning explicit 3D data with 2D visual features via cross-attention and distilling spatial Chain-of-Thought (CoT) logic into injectable Rationale Tokens; and Direct Regression Head (DRH), an “Embedding-as-Value” paradigm which routes specialized control tokens to a lightweight MLP for precise, continuous regression of scalars and 3D bounding boxes. The synergy of these modules allows our 1.5B parameter model to function as a high-level semantic dispatcher, achieving state-of-the-art spatial reasoning performance that rivals 7B+ models.

[194] ImAgent: A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation

Kaishen Wang, Ruibo Chen, Tong Zheng, Heng Huang

Main category: cs.CV

TL;DR: ImAgent is a training-free multimodal agent that integrates reasoning, generation, and self-evaluation in a single framework to improve text-to-image generation consistency and efficiency without external modules.

Details

Motivation: Current T2I models suffer from randomness and inconsistency with vague prompts, while existing solutions require additional modules and lack test-time scaling efficiency.

Method: A unified multimodal agent with policy controller that enables multiple generation actions to dynamically interact and self-organize for enhanced image fidelity and semantic alignment.

Result: ImAgent consistently improves over backbone models and surpasses other baselines where backbone models fail, demonstrating effectiveness in image generation and editing tasks.

Conclusion: Unified multimodal agents like ImAgent show potential for adaptive and efficient image generation under test-time scaling conditions.

Abstract: Recent text-to-image (T2I) models have made remarkable progress in generating visually realistic and semantically coherent images. However, they still suffer from randomness and inconsistency with the given prompts, particularly when textual descriptions are vague or underspecified. Existing approaches, such as prompt rewriting, best-of-N sampling, and self-refinement, can mitigate these issues but usually require additional modules and operate independently, hindering test-time scaling efficiency and increasing computational overhead. In this paper, we introduce ImAgent, a training-free unified multimodal agent that integrates reasoning, generation, and self-evaluation within a single framework for efficient test-time scaling. Guided by a policy controller, multiple generation actions dynamically interact and self-organize to enhance image fidelity and semantic alignment without relying on external models. Extensive experiments on image generation and editing tasks demonstrate that ImAgent consistently improves over the backbone and even surpasses other strong baselines where the backbone model fails, highlighting the potential of unified multimodal agents for adaptive and efficient image generation under test-time scaling.

[195] DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding

Dawei Zhu, Rui Meng, Jiefeng Chen, Sujian Li, Tomas Pfister, Jinsung Yoon

Main category: cs.CV

TL;DR: DocLens is a tool-augmented multi-agent framework that improves evidence localization in long visual documents by navigating from full documents to specific visual elements and using sampling-adjudication for reliable answers.

Details

Motivation: Existing Vision-Language Models struggle with evidence localization in long visual documents, failing to retrieve relevant pages and missing fine-grained details, leading to poor performance and hallucinations.

Method: A multi-agent framework that first navigates from full documents to specific visual elements on relevant pages, then employs a sampling-adjudication mechanism to generate reliable answers.

Result: Achieves state-of-the-art performance on MMLongBench-Doc and FinRAGBench-V, surpassing human experts, with particular strength on vision-centric and unanswerable queries.

Conclusion: DocLens demonstrates the power of enhanced localization capabilities for comprehending long visual documents, effectively addressing fundamental evidence localization challenges.

Abstract: Comprehending long visual documents, where information is distributed across extensive pages of text and visual elements, is a critical but challenging task for modern Vision-Language Models (VLMs). Existing approaches falter on a fundamental challenge: evidence localization. They struggle to retrieve relevant pages and overlook fine-grained details within visual elements, leading to limited performance and model hallucination. To address this, we propose DocLens, a tool-augmented multi-agent framework that effectively ``zooms in’’ on evidence like a lens. It first navigates from the full document to specific visual elements on relevant pages, then employs a sampling-adjudication mechanism to generate a single, reliable answer. Paired with Gemini-2.5-Pro, DocLens achieves state-of-the-art performance on MMLongBench-Doc and FinRAGBench-V, surpassing even human experts. The framework’s superiority is particularly evident on vision-centric and unanswerable queries, demonstrating the power of its enhanced localization capabilities.

[196] Arcee: Differentiable Recurrent State Chain for Generative Vision Modeling with Mamba SSMs

Jitesh Chavan, Rohit Lal, Anand Kamat, Mengjia Xu

Main category: cs.CV

TL;DR: Arcee introduces cross-block recurrent state chains in Mamba models, reusing terminal state-space representations between blocks to improve performance in vision tasks without adding parameters or significant computational cost.

Details

Motivation: Current Mamba models for vision reinitialize state-space dynamics from zero at each block, discarding valuable terminal state representations from previous blocks, which limits their modeling capacity.

Method: Arcee creates a differentiable boundary map that passes each block’s terminal state-space representation as the initial condition to the next block, enabling end-to-end gradient flow across block boundaries while remaining parameter-free.

Result: On CelebA-HQ 256×256 unconditional generation with Flow Matching, Arcee reduced FID from 82.81 to 15.33 (5.4× improvement) using a single Zigzag Mamba baseline.

Conclusion: Reusing terminal state-space representations across blocks provides significant performance gains for vision Mamba models with negligible computational overhead, treating terminal SSR as a directional prior rather than a signal estimator.

Abstract: State-space models (SSMs), Mamba in particular, are increasingly adopted for long-context sequence modeling, providing linear-time aggregation via an input-dependent, causal selective-scan operation. Along this line, recent “Mamba-for-vision” variants largely explore multiple scan orders to relax strict causality for non-sequential signals (e.g., images). Rather than preserving cross-block memory, the conventional formulation of the selective-scan operation in Mamba reinitializes each block’s state-space dynamics from zero, discarding the terminal state-space representation (SSR) from the previous block. Arcee, a cross-block recurrent state chain, reuses each block’s terminal state-space representation as the initial condition for the next block. Handoff across blocks is constructed as a differentiable boundary map whose Jacobian enables end-to-end gradient flow across terminal boundaries. Key to practicality, Arcee is compatible with all prior “vision-mamba” variants, parameter-free, and incurs constant, negligible cost. As a modeling perspective, we view terminal SSR as a mild directional prior induced by a causal pass over the input, rather than an estimator of the non-sequential signal itself. To quantify the impact, for unconditional generation on CelebA-HQ (256$\times$256) with Flow Matching, Arcee reduces FID$\downarrow$ from $82.81$ to $15.33$ ($5.4\times$ lower) on a single scan-order Zigzag Mamba baseline. Efficient CUDA kernels and training code will be released to support rigorous and reproducible research.

[197] PAS : Prelim Attention Score for Detecting Object Hallucinations in Large Vision–Language Models

Nhat Hoang-Xuan, Minh Vu, My T. Thai, Manish Bhattarai

Main category: cs.CV

TL;DR: LVLMs often hallucinate objects by ignoring images and relying on previously generated tokens. The paper introduces PAS, a training-free attention-based signal that detects hallucinations in real-time.

Details

Motivation: Large vision-language models suffer from object hallucinations where they generate objects not present in images, making them unreliable for practical applications.

Method: Quantify image dependence via mutual information, then introduce Prelim Attention Score (PAS) computed from attention weights over previously generated tokens during inference.

Result: PAS achieves state-of-the-art object-hallucination detection across multiple models and datasets, enabling real-time filtering and intervention without additional forward passes.

Conclusion: Attention patterns over previously generated tokens provide a reliable signal for detecting hallucinations, offering a lightweight, training-free solution to improve LVLM reliability.

Abstract: Large vision-language models (LVLMs) are powerful, yet they remain unreliable due to object hallucinations. In this work, we show that in many hallucinatory predictions the LVLM effectively ignores the image and instead relies on previously generated output (prelim) tokens to infer new objects. We quantify this behavior via the mutual information between the image and the predicted object conditioned on the prelim, demonstrating that weak image dependence strongly correlates with hallucination. Building on this finding, we introduce the Prelim Attention Score (PAS), a lightweight, training-free signal computed from attention weights over prelim tokens. PAS requires no additional forward passes and can be computed on the fly during inference. Exploiting this previously overlooked signal, PAS achieves state-of-the-art object-hallucination detection across multiple models and datasets, enabling real-time filtering and intervention.

[198] CountSteer: Steering Attention for Object Counting in Diffusion Models

Hyemin Boo, Hyoryung Kim, Myungjin Lee, Seunghyeon Lee, Jiyoung Lee, Jang-Hwan Choi, Hyunsoo Cho

Main category: cs.CV

TL;DR: CountSteer is a training-free method that improves object count accuracy in text-to-image diffusion models by steering cross-attention hidden states during inference.

Details

Motivation: Text-to-image diffusion models often fail to follow numerical instructions in text, revealing a gap between language and visual representation, but they implicitly encode latent notions of numerical correctness.

Method: CountSteer steers the model’s cross-attention hidden states during inference based on the observation that models have internal signals indicating their own counting accuracy.

Result: CountSteer improved object-count accuracy by about 4% without compromising visual quality.

Conclusion: This demonstrates a simple yet effective step toward more controllable and semantically reliable text-to-image generation by harnessing the model’s latent numerical awareness.

Abstract: Text-to-image diffusion models generate realistic and coherent images but often fail to follow numerical instructions in text, revealing a gap between language and visual representation. Interestingly, we found that these models are not entirely blind to numbers-they are implicitly aware of their own counting accuracy, as their internal signals shift in consistent ways depending on whether the output meets the specified count. This observation suggests that the model already encodes a latent notion of numerical correctness, which can be harnessed to guide generation more precisely. Building on this intuition, we introduce CountSteer, a training-free method that improves generation of specified object counts by steering the model’s cross-attention hidden states during inference. In our experiments, CountSteer improved object-count accuracy by about 4% without compromising visual quality, demonstrating a simple yet effective step toward more controllable and semantically reliable text-to-image generation.

[199] GraphPilot: Grounded Scene Graph Conditioning for Language-Based Autonomous Driving

Fabian Schmidt, Markus Enzweiler, Abhinav Valada

Main category: cs.CV

TL;DR: The paper introduces GraphPilot, a model-agnostic method that conditions vision-language driving models on structured traffic scene graphs to improve topology-aware reasoning and relational understanding in autonomous driving.

Details

Motivation: Existing vision-language models for autonomous driving lack explicit supervision for relational dependencies between traffic entities, limiting their ability to understand how agents influence each other from raw sensor data.

Method: The method serializes traffic scene graphs at various abstraction levels and formats, incorporating them into language-based driving models via structured prompt templates during training.

Result: Extensive evaluations on LangAuto benchmark show significant improvements: 15.6% increase in driving score for LMDrive and 17.5% for BEVDriver, with persistent performance gains even without scene graph input at test-time.

Conclusion: Scene graph conditioning enables models to better internalize relational priors, demonstrating that structured relational supervision is highly beneficial for autonomous driving planning tasks.

Abstract: Vision-language models have recently emerged as promising planners for autonomous driving, where success hinges on topology-aware reasoning over spatial structure and dynamic interactions from multimodal input. However, existing models are typically trained without supervision that explicitly encodes these relational dependencies, limiting their ability to infer how agents and other traffic entities influence one another from raw sensor data. In this work, we bridge this gap with a novel model-agnostic method that conditions language-based driving models on structured relational context in the form of traffic scene graphs. We serialize scene graphs at various abstraction levels and formats, and incorporate them into the models via structured prompt templates, enabling a systematic analysis of when and how relational supervision is most beneficial. Extensive evaluations on the public LangAuto benchmark show that scene graph conditioning of state-of-the-art approaches yields large and persistent improvement in driving performance. Notably, we observe up to a 15.6% increase in driving score for LMDrive and 17.5% for BEVDriver, indicating that models can better internalize and ground relational priors through scene graph-conditioned training, even without requiring scene graph input at test-time. Code, fine-tuned models, and our scene graph dataset are publicly available at https://github.com/iis-esslingen/GraphPilot.

[200] Φeat: Physically-Grounded Feature Representation

Giuseppe Vecchio, Adrien Kaiser, Rouffet Romain, Rosalie Martin, Elena Garces, Tamy Boubekeur

Main category: cs.CV

TL;DR: Φeat is a self-supervised visual backbone that learns physically-grounded representations sensitive to material identity by contrasting spatial crops and physical augmentations under varying shapes and lighting conditions.

Details

Motivation: Current self-supervised features entangle high-level semantics with low-level physical factors like geometry and illumination, which hinders their use in tasks requiring explicit physical reasoning.

Method: Uses a pretraining strategy that contrasts spatial crops and physical augmentations of the same material under varying shapes and lighting conditions, without explicit labels.

Result: The learned representations capture physically-grounded structure beyond semantic grouping and provide strong prior for tasks requiring robust features invariant to external physical factors.

Conclusion: Unsupervised physical feature learning shows promise as a foundation for physics-aware perception in vision and graphics.

Abstract: Foundation models have emerged as effective backbones for many vision tasks. However, current self-supervised features entangle high-level semantics with low-level physical factors, such as geometry and illumination, hindering their use in tasks requiring explicit physical reasoning. In this paper, we introduce $Φ$eat, a novel physically-grounded visual backbone that encourages a representation sensitive to material identity, including reflectance cues and geometric mesostructure. Our key idea is to employ a pretraining strategy that contrasts spatial crops and physical augmentations of the same material under varying shapes and lighting conditions. While similar data have been used in high-end supervised tasks such as intrinsic decomposition or material estimation, we demonstrate that a pure self-supervised training strategy, without explicit labels, already provides a strong prior for tasks requiring robust features invariant to external physical factors. We evaluate the learned representations through feature similarity analysis and material selection, showing that $Φ$eat captures physically-grounded structure beyond semantic grouping. These findings highlight the promise of unsupervised physical feature learning as a foundation for physics-aware perception in vision and graphics. These findings highlight the promise of unsupervised physical feature learning as a foundation for physics-aware perception in vision and graphics.

[201] Coordinative Learning with Ordinal and Relational Priors for Volumetric Medical Image Segmentation

Haoyi Wang

Main category: cs.CV

TL;DR: CORAL is a novel method that captures both local and global anatomical structure in volumetric medical images through contrastive ranking and ordinal objectives, achieving state-of-the-art segmentation performance with limited annotations.

Details

Motivation: Existing methods use hard binary thresholds that discard continuous anatomical similarity information and overlook global directional consistency, resulting in distorted feature spaces that fail to capture the canonical anatomical manifold shared across patients.

Method: CORAL employs a contrastive ranking objective to leverage continuous anatomical similarity and an ordinal objective to enforce global directional consistency, coordinatively learning inter-slice relationships to produce anatomically informed representations.

Result: CORAL achieves state-of-the-art performance on benchmark datasets under limited-annotation settings while learning representations with meaningful anatomical structure.

Conclusion: Learning coordinative ordinal-relational anatomical relationships produces anatomically informed representations that benefit downstream segmentation tasks, demonstrating the effectiveness of capturing both local and global structure in volumetric medical images.

Abstract: Volumetric medical image segmentation presents unique challenges due to the inherent anatomical structure and limited availability of annotations. While recent methods have shown promise by contrasting spatial relationships between slices, they rely on hard binary thresholds to define positive and negative samples, thereby discarding valuable continuous information about anatomical similarity. Moreover, these methods overlook the global directional consistency of anatomical progression, resulting in distorted feature spaces that fail to capture the canonical anatomical manifold shared across patients. To address these limitations, we propose Coordinative Ordinal-Relational Anatomical Learning (CORAL) to capture both local and global structure in volumetric images. First, CORAL employs a contrastive ranking objective to leverage continuous anatomical similarity, ensuring relational feature distances between slices are proportional to their anatomical position differences. In addition, CORAL incorporates an ordinal objective to enforce global directional consistency, aligning the learned feature distribution with the canonical anatomical progression across patients. Learning these inter-slice relationships produces anatomically informed representations that benefit the downstream segmentation task. Through this coordinative learning framework, CORAL achieves state-of-the-art performance on benchmark datasets under limited-annotation settings while learning representations with meaningful anatomical structure. Code is available at https://github.com/haoyiwang25/CORAL.

[202] RTGaze: Real-Time 3D-Aware Gaze Redirection from a Single Image

Hengfei Wang, Zhongqun Zhang, Yihua Cheng, Hyung Jin Chang

Main category: cs.CV

TL;DR: RTGaze is a real-time, high-quality gaze redirection method that uses gaze-controllable facial representations and neural rendering, achieving 800x speed improvement over previous 3D-aware methods.

Details

Motivation: Recent gaze redirection methods struggle with 3D consistency, efficiency, or quality, limiting practical applications.

Method: Learns gaze-controllable facial representations from face images and gaze prompts, decodes via neural rendering, and distills face geometric priors from a pretrained 3D portrait generator.

Result: Achieves state-of-the-art performance in efficiency, redirection accuracy, and image quality across multiple datasets, with real-time processing (~0.06 sec/image) - 800x faster than previous 3D-aware methods.

Conclusion: RTGaze enables real-time, 3D-aware gaze redirection with high quality and efficiency, making it suitable for practical applications.

Abstract: Gaze redirection methods aim to generate realistic human face images with controllable eye movement. However, recent methods often struggle with 3D consistency, efficiency, or quality, limiting their practical applications. In this work, we propose RTGaze, a real-time and high-quality gaze redirection method. Our approach learns a gaze-controllable facial representation from face images and gaze prompts, then decodes this representation via neural rendering for gaze redirection. Additionally, we distill face geometric priors from a pretrained 3D portrait generator to enhance generation quality. We evaluate RTGaze both qualitatively and quantitatively, demonstrating state-of-the-art performance in efficiency, redirection accuracy, and image quality across multiple datasets. Our system achieves real-time, 3D-aware gaze redirection with a feedforward network (~0.06 sec/image), making it 800x faster than the previous state-of-the-art 3D-aware methods.

[203] SimuFreeMark: A Noise-Simulation-Free Robust Watermarking Against Image Editing

Yichao Tang, Mingyang Li, Di Miao, Sheng Li, Zhenxing Qian, Xinpeng Zhang

Main category: cs.CV

TL;DR: SimuFreeMark is a simulation-free watermarking framework that embeds watermarks in low-frequency image components using a pre-trained VAE, eliminating the need for noise simulation during training while achieving robust performance against various attacks.

Details

Motivation: Current deep learning-based watermarking methods rely on hand-crafted noise simulation layers, which limit generalization to unforeseen distortions. There's a need for more robust watermarking that can withstand both conventional signal processing and novel semantic editing attacks.

Method: The framework exploits the inherent stability of image low-frequency components, systematically establishing their robustness against attacks. Watermarks are embedded directly into the deep feature space of low-frequency components using a pre-trained variational autoencoder (VAE) to bind watermarks with structurally stable image representations.

Result: Extensive experiments show SimuFreeMark outperforms state-of-the-art methods across a wide range of conventional and semantic attacks while maintaining superior visual quality.

Conclusion: SimuFreeMark successfully eliminates the need for noise simulation during training and provides robust watermarking by leveraging the stability of low-frequency image components, demonstrating superior performance against various attack types.

Abstract: The advancement of artificial intelligence generated content (AIGC) has created a pressing need for robust image watermarking that can withstand both conventional signal processing and novel semantic editing attacks. Current deep learning-based methods rely on training with hand-crafted noise simulation layers, which inherently limit their generalization to unforeseen distortions. In this work, we propose $\textbf{SimuFreeMark}$, a noise-$\underline{\text{simu}}$lation-$\underline{\text{free}}$ water$\underline{\text{mark}}$ing framework that circumvents this limitation by exploiting the inherent stability of image low-frequency components. We first systematically establish that low-frequency components exhibit significant robustness against a wide range of attacks. Building on this foundation, SimuFreeMark embeds watermarks directly into the deep feature space of the low-frequency components, leveraging a pre-trained variational autoencoder (VAE) to bind the watermark with structurally stable image representations. This design completely eliminates the need for noise simulation during training. Extensive experiments demonstrate that SimuFreeMark outperforms state-of-the-art methods across a wide range of conventional and semantic attacks, while maintaining superior visual quality.

[204] 6D Strawberry Pose Estimation: Real-time and Edge AI Solutions Using Purely Synthetic Training Data

Saptarshi Neil Sinha, Julius Kühn, Mika Silvan Goschke, Michael Weinmann

Main category: cs.CV

TL;DR: This paper presents a 6D pose estimation system for strawberries using synthetic data from a procedural Blender pipeline and YOLOX-6D-Pose algorithm, achieving comparable accuracy on both high-end and edge devices.

Details

Motivation: Addressing challenges in automated fruit harvesting due to high costs and labor shortages in advanced economies, particularly focusing on strawberries.

Method: Uses purely synthetic data generated through a procedural Blender pipeline for photorealistic rendering, combined with YOLOX-6D-Pose algorithm - a single-shot approach leveraging YOLOX backbone for speed-accuracy balance and edge inference support.

Result: Models achieve comparable accuracy on both NVIDIA RTX 3090 and Jetson Orin Nano across ADD-S metrics, with RTX 3090 showing superior speed while Jetson Orin Nano excels in resource-constrained environments. Accurately infers poses of ripe and partially ripe strawberries but struggles with unripe specimens.

Conclusion: The methodology is effective for agricultural robotics deployment and can be adapted for other fruits like apples, peaches, and plums, with future improvements needed for detecting unripe strawberries through color variation exploration.

Abstract: Automated and selective harvesting of fruits has become an important area of research, particularly due to challenges such as high costs and a shortage of seasonal labor in advanced economies. This paper focuses on 6D pose estimation of strawberries using purely synthetic data generated through a procedural pipeline for photorealistic rendering. We employ the YOLOX-6D-Pose algorithm, a single-shot approach that leverages the YOLOX backbone, known for its balance between speed and accuracy, and its support for edge inference. To address the lacking availability of training data, we introduce a robust and flexible pipeline for generating synthetic strawberry data from various 3D models via a procedural Blender pipeline, where we focus on enhancing the realism of the synthesized data in comparison to previous work to make it a valuable resource for training pose estimation algorithms. Quantitative evaluations indicate that our models achieve comparable accuracy on both the NVIDIA RTX 3090 and Jetson Orin Nano across several ADD-S metrics, with the RTX 3090 demonstrating superior processing speed. However, the Jetson Orin Nano is particularly suited for resource-constrained environments, making it an excellent choice for deployment in agricultural robotics. Qualitative assessments further confirm the model’s performance, demonstrating its capability to accurately infer the poses of ripe and partially ripe strawberries, while facing challenges in detecting unripe specimens. This suggests opportunities for future improvements, especially in enhancing detection capabilities for unripe strawberries (if desired) by exploring variations in color. Furthermore, the methodology presented could be adapted easily for other fruits such as apples, peaches, and plums, thereby expanding its applicability and impact in the field of agricultural automation.

[205] DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding

Tanveer Hannan, Dimitrios Mallios, Parth Pathak, Faegheh Sardari, Thomas Seidl, Gedas Bertasius, Mohsen Fayyaz, Sunando Sengupta

Main category: cs.CV

TL;DR: DocSLM is an efficient small vision-language model for long-document understanding on edge devices, using hierarchical multimodal compression and streaming abstention to reduce memory and latency while maintaining performance.

Details

Motivation: Large Vision-Language Models have strong multimodal reasoning but high memory footprint, making them impractical for resource-constrained edge devices.

Method: Incorporates Hierarchical Multimodal Compressor for joint encoding of visual, textual, and layout information, and Streaming Abstention mechanism with entropy-based uncertainty calibration for sequential processing of long documents.

Result: Matches or surpasses state-of-the-art methods while using 82% fewer visual tokens, 75% fewer parameters, and 71% lower latency on multiple long multimodal document benchmarks.

Conclusion: DocSLM delivers reliable multimodal document understanding on lightweight edge devices with significantly reduced resource requirements.

Abstract: Large Vision-Language Models (LVLMs) have demonstrated strong multimodal reasoning capabilities on long and complex documents. However, their high memory footprint makes them impractical for deployment on resource-constrained edge devices. We present DocSLM, an efficient Small Vision-Language Model designed for long-document understanding under constrained memory resources. DocSLM incorporates a Hierarchical Multimodal Compressor that jointly encodes visual, textual, and layout information from each page into a fixed-length sequence, greatly reducing memory consumption while preserving both local and global semantics. To enable scalable processing over arbitrarily long inputs, we introduce a Streaming Abstention mechanism that operates on document segments sequentially and filters low-confidence responses using an entropy-based uncertainty calibrator. Across multiple long multimodal document benchmarks, DocSLM matches or surpasses state-of-the-art methods while using 82% fewer visual tokens, 75% fewer parameters, and 71% lower latency, delivering reliable multimodal document understanding on lightweight edge devices. Code is available in the supplementary material.

[206] YCB-Ev SD: Synthetic event-vision dataset for 6DoF object pose estimation

Pavel Rojtberg, Julius Kühn

Main category: cs.CV

TL;DR: YCB-Ev SD is a synthetic dataset of event-camera data for 6DoF object pose estimation, featuring 50,000 event sequences generated from PBR scenes using simulated linear camera motion.

Details

Motivation: Event-based vision lacks comprehensive synthetic datasets comparable to those available for frame-based computer vision, creating a gap in resources for 6DoF object pose estimation research.

Method: Generated 50,000 event sequences (34ms each) from Physically Based Rendering scenes of YCB-Video objects following BOP methodology, using simulated linear camera motion for complete scene coverage.

Result: Time-surfaces with linear decay and dual-channel polarity encoding achieved superior pose estimation performance, significantly outperforming exponential decay and single-channel alternatives, with polarity information contributing most to performance gains.

Conclusion: The dataset provides structured event streams and precomputed optimal representations to facilitate immediate research use and reproducible benchmarking in event-based 6DoF object pose estimation.

Abstract: We introduce YCB-Ev SD, a synthetic dataset of event-camera data at standard definition (SD) resolution for 6DoF object pose estimation. While synthetic data has become fundamental in frame-based computer vision, event-based vision lacks comparable comprehensive resources. Addressing this gap, we present 50,000 event sequences of 34 ms duration each, synthesized from Physically Based Rendering (PBR) scenes of YCB-Video objects following the Benchmark for 6D Object Pose (BOP) methodology. Our generation framework employs simulated linear camera motion to ensure complete scene coverage, including background activity. Through systematic evaluation of event representations for CNN-based inference, we demonstrate that time-surfaces with linear decay and dual-channel polarity encoding achieve superior pose estimation performance, outperforming exponential decay and single-channel alternatives by significant margins. Our analysis reveals that polarity information contributes most substantially to performance gains, while linear temporal encoding preserves critical motion information more effectively than exponential decay. The dataset is provided in a structured format with both raw event streams and precomputed optimal representations to facilitate immediate research use and reproducible benchmarking. The dataset is publicly available at https://huggingface.co/datasets/paroj/ycbev_sd.

[207] Free3D: 3D Human Motion Emerges from Single-View 2D Supervision

Sheng Liu, Yuanzhi Liang, Sidan Du

Main category: cs.CV

TL;DR: Free3D is a 3D human motion generation framework that achieves state-of-the-art performance without any 3D motion annotations, using only 2D motion data and novel regularization techniques.

Details

Motivation: Current 3D motion generation models rely on precise 3D supervision, which limits generalization beyond training distributions by encouraging models to fit coordinate patterns rather than learn essential 3D structure and motion semantics.

Method: Proposes Motion-Lifting Residual Quantized VAE (ML-RQ) to map 2D motion sequences into 3D-consistent latent spaces, combined with 3D-free regularization objectives for view consistency, orientation coherence, and physical plausibility.

Result: Free3D generates diverse, temporally coherent, and semantically aligned 3D motions that achieve performance comparable to or surpassing fully 3D-supervised models, despite being trained entirely on 2D data.

Conclusion: Relaxing explicit 3D supervision encourages stronger structural reasoning and generalization, offering a scalable and data-efficient paradigm for 3D motion generation.

Abstract: Recent 3D human motion generation models demonstrate remarkable reconstruction accuracy yet struggle to generalize beyond training distributions. This limitation arises partly from the use of precise 3D supervision, which encourages models to fit fixed coordinate patterns instead of learning the essential 3D structure and motion semantic cues required for robust generalization.To overcome this limitation, we propose Free3D, a framework that synthesizes realistic 3D motions without any 3D motion annotations. Free3D introduces a Motion-Lifting Residual Quantized VAE (ML-RQ) that maps 2D motion sequences into 3D-consistent latent spaces, and a suite of 3D-free regularization objectives enforcing view consistency, orientation coherence, and physical plausibility. Trained entirely on 2D motion data, Free3D generates diverse, temporally coherent, and semantically aligned 3D motions, achieving performance comparable to or even surpassing fully 3D-supervised counterparts. These results suggest that relaxing explicit 3D supervision encourages stronger structural reasoning and generalization, offering a scalable and data-efficient paradigm for 3D motion generation.

[208] Unsupervised Segmentation of Micro-CT Scans of Polyurethane Structures By Combining Hidden-Markov-Random Fields and a U-Net

Julian Grolig, Lars Griem, Michael Selzer, Hans-Ulrich Kauczor, Simon M. F. Triphan, Britta Nestler, Arnd Koeppe

Main category: cs.CV

TL;DR: The paper presents HMRF-UNet, a hybrid method combining Hidden Markov Random Fields (HMRF) with CNN segmentation to achieve high accuracy without ground truth data, with applications in material microstructure analysis.

Details

Motivation: Traditional segmentation methods lack accuracy or speed, while supervised CNNs require large labeled datasets. Unsupervised approaches suffer from slow segmentation and poor accuracy. There's a need for methods that combine unsupervised learning with fast segmentation.

Method: Integration of HMRF theory with CNN segmentation, creating HMRF-UNet that leverages neighborhood concepts and class distributions. Investigates different neighborhood terms and components for unsupervised HMRF loss. Also proposes a pre-training strategy to reduce ground-truth data requirements.

Result: HMRF-UNet achieves high segmentation accuracy without ground truth on Micro-CT image dataset of Polyurethane foam structures. The pre-training strategy considerably reduces required ground-truth data for training segmentation models.

Conclusion: The hybrid HMRF-UNet approach successfully combines unsupervised learning with fast segmentation, enabling accurate material microstructure analysis without extensive labeled data, with potential applications in quantitative material property analysis.

Abstract: Extracting digital material representations from images is a necessary prerequisite for a quantitative analysis of material properties. Different segmentation approaches have been extensively studied in the past to achieve this task, but were often lacking accuracy or speed. With the advent of machine learning, supervised convolutional neural networks (CNNs) have achieved state-of-the-art performance for different segmentation tasks. However, these models are often trained in a supervised manner, which requires large labeled datasets. Unsupervised approaches do not require ground-truth data for learning, but suffer from long segmentation times and often worse segmentation accuracy. Hidden Markov Random Fields (HMRF) are an unsupervised segmentation approach that incorporates concepts of neighborhood and class distributions. We present a method that integrates HMRF theory and CNN segmentation, leveraging the advantages of both areas: unsupervised learning and fast segmentation times. We investigate the contribution of different neighborhood terms and components for the unsupervised HMRF loss. We demonstrate that the HMRF-UNet enables high segmentation accuracy without ground truth on a Micro-Computed Tomography ($μ$CT) image dataset of Polyurethane (PU) foam structures. Finally, we propose and demonstrate a pre-training strategy that considerably reduces the required amount of ground-truth data when training a segmentation model.

[209] Disentangling Emotional Bases and Transient Fluctuations: A Low-Rank Sparse Decomposition Approach for Video Affective Analysis

Feng-Qi Cui, Jinyang Huang, Ziyu Jia, Xinyu Li, Xin Yan, Xiaokang Zhou, Meng Wang

Main category: cs.CV

TL;DR: LSEF is a hierarchical framework that disentangles emotional bases and transient fluctuations using low-rank sparse modeling to improve video-based affective computing stability and discrimination.

Details

Motivation: Video-based Affective Computing suffers from model instability and representational degradation due to complex emotional dynamics, lacking mechanisms to separate long-term emotional tones from short-term fluctuations.

Method: Proposes LSEF with three modules: Stability Encoding Module captures low-rank emotional bases, Dynamic Decoupling Module isolates sparse transient signals, and Consistency Integration Module reconstructs multi-scale coherence, optimized by Rank Aware Optimization strategy.

Result: Extensive experiments show LSEF significantly enhances robustness and dynamic discrimination across multiple datasets, validating the effectiveness of hierarchical low-rank sparse modeling.

Conclusion: The hierarchical low-rank sparse compositional framework effectively addresses affective dynamics, improving emotion understanding in video-based affective computing.

Abstract: Video-based Affective Computing (VAC), vital for emotion analysis and human-computer interaction, suffers from model instability and representational degradation due to complex emotional dynamics. Since the meaning of different emotional fluctuations may differ under different emotional contexts, the core limitation is the lack of a hierarchical structural mechanism to disentangle distinct affective components, i.e., emotional bases (the long-term emotional tone), and transient fluctuations (the short-term emotional fluctuations). To address this, we propose the Low-Rank Sparse Emotion Understanding Framework (LSEF), a unified model grounded in the Low-Rank Sparse Principle, which theoretically reframes affective dynamics as a hierarchical low-rank sparse compositional process. LSEF employs three plug-and-play modules, i.e., the Stability Encoding Module (SEM) captures low-rank emotional bases; the Dynamic Decoupling Module (DDM) isolates sparse transient signals; and the Consistency Integration Module (CIM) reconstructs multi-scale stability and reactivity coherence. This framework is optimized by a Rank Aware Optimization (RAO) strategy that adaptively balances gradient smoothness and sensitivity. Extensive experiments across multiple datasets confirm that LSEF significantly enhances robustness and dynamic discrimination, which further validates the effectiveness and generality of hierarchical low-rank sparse modeling for understanding affective dynamics.

[210] MicroVQA++: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model

Manyu Li, Ruian He, Chenxi Ma, Weimin Tan, Bo Yan

Main category: cs.CV

TL;DR: MicroVQA++ is a high-quality microscopy VQA dataset created through a three-stage pipeline: bootstrapping from expert-validated figure-caption pairs, filtering with HiCQA-Graph for cross-modal consistency, and generating MCQs with human screening.

Details

Motivation: Address the scarcity of large-scale, high-quality training data for scientific reasoning in microscopy, which limits the application of Multimodal Large Language Models in biomedical imaging.

Method: Three-stage pipeline: 1) Bootstrap from expert-validated figure-caption pairs, 2) Apply HiCQA-Graph (heterogeneous graph fusing NLI, CLIP, and agent signals) to filter inconsistent samples, 3) Use MLLM agent to generate MCQs with human screening.

Result: Created a quality-controlled dataset with large training split and human-checked test split that exceeds MicroVQA benchmark in Bloom’s level hard-sample distribution. Enables 4B-scale MLLMs to reach competitive performance comparable to GPT-5 and achieve state-of-the-art among open-source MLLMs.

Conclusion: Careful data construction through expert literature coupling, graph-based filtering, and human refinement enables smaller MLLMs to achieve competitive microscopy reasoning performance, demonstrating the importance of high-quality dataset curation.

Abstract: Multimodal Large Language Models are increasingly applied to biomedical imaging, yet scientific reasoning for microscopy remains limited by the scarcity of large-scale, high-quality training data. We introduce MicroVQA++, a three-stage, large-scale and high-quality microscopy VQA corpus derived from the BIOMEDICA archive. Stage one bootstraps supervision from expert-validated figure-caption pairs sourced from peer-reviewed articles. Stage two applies HiCQA-Graph, a novel heterogeneous graph over images, captions, and QAs that fuses NLI-based textual entailment, CLIP-based vision-language alignment, and agent signals to identify and filter inconsistent samples. Stage three uses a MultiModal Large Language Model (MLLM) agent to generate multiple-choice questions (MCQ) followed by human screening. The resulting release comprises a large training split and a human-checked test split whose Bloom’s level hard-sample distribution exceeds the MicroVQA benchmark. Our work delivers (i) a quality-controlled dataset that couples expert literature with graph-based filtering and human refinement; (ii) HiCQA-Graph, the first graph that jointly models (image, caption, QA) for cross-modal consistency filtering; (iii) evidence that careful data construction enables 4B-scale MLLMs to reach competitive microscopy reasoning performance (e.g., GPT-5) and achieve state-of-the-art performance among open-source MLLMs. Code and dataset will be released after the review process concludes.

Jiaxi Huang, Dongxu Wu, Hanwei Zhu, Lingyu Zhu, Jun Xing, Xu Wang, Baoliang Chen

Main category: cs.CV

TL;DR: Q-Doc is a three-tiered framework for evaluating Multi-modal Large Language Models’ Document Image Quality Assessment capabilities at coarse, middle, and fine granularity levels, revealing limitations but showing Chain-of-Thought prompting improves performance.

Details

Motivation: The potential of Multi-modal Large Language Models for Document Image Quality Assessment remains underexplored despite their advancement beyond high-level vision tasks.

Method: Three-tiered evaluation framework: coarse level (quality scoring), middle level (distortion-type identification), and fine level (distortion-severity assessment) with Chain-of-Thought prompting.

Result: MLLMs possess nascent DIQA abilities but exhibit critical limitations including inconsistent scoring, distortion misidentification, and severity misjudgment. Chain-of-Thought prompting substantially enhances performance across all levels.

Conclusion: The work provides a benchmark for DIQA capabilities in MLLMs, revealing pronounced deficiencies in their quality perception and promising pathways for enhancement through improved prompting strategies.

Abstract: The rapid advancement of Multi-modal Large Language Models (MLLMs) has expanded their capabilities beyond high-level vision tasks. Nevertheless, their potential for Document Image Quality Assessment (DIQA) remains underexplored. To bridge this gap, we propose Q-Doc, a three-tiered evaluation framework for systematically probing DIQA capabilities of MLLMs at coarse, middle, and fine granularity levels. a) At the coarse level, we instruct MLLMs to assign quality scores to document images and analyze their correlation with Quality Annotations. b) At the middle level, we design distortion-type identification tasks, including single-choice and multi-choice tests for multi-distortion scenarios. c) At the fine level, we introduce distortion-severity assessment where MLLMs classify distortion intensity against human-annotated references. Our evaluation demonstrates that while MLLMs possess nascent DIQA abilities, they exhibit critical limitations: inconsistent scoring, distortion misidentification, and severity misjudgment. Significantly, we show that Chain-of-Thought (CoT) prompting substantially enhances performance across all levels. Our work provides a benchmark for DIQA capabilities in MLLMs, revealing pronounced deficiencies in their quality perception and promising pathways for enhancement. The benchmark and code are publicly available at: https://github.com/cydxf/Q-Doc.

[212] BOFA: Bridge-Layer Orthogonal Low-Rank Fusion for CLIP-Based Class-Incremental Learning

Lan Li, Tao Hu, Da-Wei Zhou, Han-Jia Ye, De-Chuan Zhan

Main category: cs.CV

TL;DR: BOFA is a novel framework for Class-Incremental Learning that adapts CLIP’s cross-modal bridge-layer without adding parameters, using orthogonal low-rank fusion to prevent forgetting and cross-modal hybrid prototypes for enhanced classification.

Details

Motivation: Address two key challenges in applying CLIP to CIL: avoiding additional learnable modules that increase complexity and forgetting, and better integrating visual and textual modalities.

Method: Confines adaptation to CLIP’s existing cross-modal bridge-layer using Orthogonal Low-Rank Fusion to constrain updates to a low-rank safe subspace orthogonal to past tasks, and employs cross-modal hybrid prototypes.

Result: Achieves superior accuracy and efficiency compared to existing methods on standard benchmarks without data replay.

Conclusion: BOFA provides an effective parameter-free adaptation approach for CLIP in CIL that prevents forgetting while leveraging multi-modal representations.

Abstract: Class-Incremental Learning (CIL) aims to continually learn new categories without forgetting previously acquired knowledge. Vision-language models such as CLIP offer strong transferable representations via multi-modal supervision, making them promising for CIL. However, applying CLIP to CIL poses two major challenges: (1) adapting to downstream tasks often requires additional learnable modules, increasing model complexity and susceptibility to forgetting; and (2) while multi-modal representations offer complementary strengths, existing methods have yet to fully realize their potential in effectively integrating visual and textual modalities. To address these issues, we propose BOFA (Bridge-layer Orthogonal Fusion for Adaptation), a novel framework for CIL. BOFA confines all model adaptation exclusively to CLIP’s existing cross-modal bridge-layer, thereby adding no extra parameters or inference cost. To prevent forgetting within this layer, it leverages Orthogonal Low-Rank Fusion, a mechanism that constrains parameter updates to a low-rank ``safe subspace" mathematically constructed to be orthogonal to past task features. This ensures stable knowledge accumulation without data replay. Furthermore, BOFA employs a cross-modal hybrid prototype that synergizes stable textual prototypes with visual counterparts derived from our stably adapted bridge-layer, enhancing classification performance. Extensive experiments on standard benchmarks show that BOFA achieves superior accuracy and efficiency compared to existing methods.

[213] Shrinking the Teacher: An Adaptive Teaching Paradigm for Asymmetric EEG-Vision Alignment

Lukun Wu, Jie Li, Ziqi Ren, Kaifan Zhang, Xinbo Gao

Main category: cs.CV

TL;DR: Proposes adaptive teaching paradigm for asymmetric EEG-vision alignment, addressing fidelity and semantic gaps through teacher modality adaptation, achieving 60.2% accuracy on brain-to-image retrieval.

Details

Motivation: Address fundamental asymmetry between visual and EEG modalities characterized by fidelity gap (EEG noise vs vision clarity) and semantic gap (EEG shallow representation vs vision rich semantics), overcoming limitations of symmetric alignment approaches.

Method: Adaptive teaching paradigm where vision (teacher) dynamically shrinks and adjusts knowledge structure to match EEG (student) capacity, implemented via ShrinkAdapter module with residual-free design and bottleneck structure.

Result: Achieves 60.2% top-1 accuracy on zero-shot brain-to-image retrieval, surpassing previous state-of-the-art by 9.8% margin.

Conclusion: Teacher modality must shrink and adapt to bridge vision-brain gap, introducing new perspective for asymmetric alignment in cross-modal learning.

Abstract: Decoding visual features from EEG signals is a central challenge in neuroscience, with cross-modal alignment as the dominant approach. We argue that the relationship between visual and brain modalities is fundamentally asymmetric, characterized by two critical gaps: a Fidelity Gap (stemming from EEG’s inherent noise and signal degradation, vs. vision’s high-fidelity features) and a Semantic Gap (arising from EEG’s shallow conceptual representation, vs. vision’s rich semantic depth). Previous methods often overlook this asymmetry, forcing alignment between the two modalities as if they were equal partners and thereby leading to poor generalization. To address this, we propose the adaptive teaching paradigm. This paradigm empowers the teacher" modality (vision) to dynamically shrink and adjust its knowledge structure under task guidance, tailoring its semantically dense features to match the student" modality (EEG)’s capacity. We implement this paradigm with the ShrinkAdapter, a simple yet effective module featuring a residual-free design and a bottleneck structure. Through extensive experiments, we validate the underlying rationale and effectiveness of our paradigm. Our method achieves a top-1 accuracy of 60.2% on the zero-shot brain-to-image retrieval task, surpassing previous state-of-the-art methods by a margin of 9.8%. Our work introduces a new perspective for asymmetric alignment: the teacher must shrink and adapt to bridge the vision-brain gap.

[214] VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation

Maximilian Rokuss, Moritz Langenberg, Yannick Kirchhoff, Fabian Isensee, Benjamin Hamm, Constantin Ulrich, Sebastian Regnery, Lukas Bauer, Efthimios Katsigiannopulos, Tobias Norajitra, Klaus Maier-Hein

Main category: cs.CV

TL;DR: VoxTell is a vision-language model for text-prompted 3D medical image segmentation that maps free-form text descriptions to volumetric masks, achieving state-of-the-art zero-shot performance across CT, MRI, and PET modalities.

Details

Motivation: To enable flexible medical image segmentation using natural language descriptions rather than predefined classes, allowing clinicians to specify anatomical structures and pathologies using their own terminology.

Method: Multi-stage vision-language fusion across decoder layers to align textual and visual features at multiple scales, trained on 62K+ CT, MRI, and PET volumes spanning over 1K anatomical and pathological classes.

Result: Achieves state-of-the-art zero-shot performance across modalities on unseen datasets, demonstrates strong cross-modality transfer, robustness to linguistic variations, and accurate instance-specific segmentation from real-world clinical text.

Conclusion: VoxTell provides an effective framework for text-prompted volumetric medical image segmentation that generalizes well to unseen concepts and clinical language, offering practical utility for medical imaging applications.

Abstract: We introduce VoxTell, a vision-language model for text-prompted volumetric medical image segmentation. It maps free-form descriptions, from single words to full clinical sentences, to 3D masks. Trained on 62K+ CT, MRI, and PET volumes spanning over 1K anatomical and pathological classes, VoxTell uses multi-stage vision-language fusion across decoder layers to align textual and visual features at multiple scales. It achieves state-of-the-art zero-shot performance across modalities on unseen datasets, excelling on familiar concepts while generalizing to related unseen classes. Extensive experiments further demonstrate strong cross-modality transfer, robustness to linguistic variations and clinical language, as well as accurate instance-specific segmentation from real-world text. Code is available at: https://www.github.com/MIC-DKFZ/VoxTell

[215] Comprehension of Multilingual Expressions Referring to Target Objects in Visual Inputs

Francisco Nogueira, Alexandre Bernardino, Bruno Martins

Main category: cs.CV

TL;DR: This paper introduces a multilingual dataset and model for Referring Expression Comprehension (REC), addressing the English-centric bias in current research by expanding 12 English benchmarks to 10 languages and proposing an attention-anchored neural architecture.

Details

Motivation: To address the English-centric nature of REC research and meet increasing global deployment demands by creating a unified multilingual dataset and model that can localize objects in images based on natural language descriptions across multiple languages.

Method: Constructed a multilingual dataset by systematically expanding 12 existing English REC benchmarks through machine translation and context-based translation enhancement. Introduced an attention-anchored neural architecture using multilingual SigLIP2 encoders that generates coarse spatial anchors from attention distributions and refines them through learned residuals.

Result: Created a dataset with approximately 8 million multilingual referring expressions across 177,620 images and 336,882 annotated objects. Achieved 86.9% accuracy at IoU@50 on RefCOCO aggregate multilingual evaluation (vs 91.3% English-only), demonstrating consistent capabilities across 10 languages.

Conclusion: The work establishes the practical feasibility of multilingual visual grounding systems, showing competitive performance across languages and providing resources for future multilingual REC research.

Abstract: Referring Expression Comprehension (REC) requires models to localize objects in images based on natural language descriptions. Research on the area remains predominantly English-centric, despite increasing global deployment demands. This work addresses multilingual REC through two main contributions. First, we construct a unified multilingual dataset spanning 10 languages, by systematically expanding 12 existing English REC benchmarks through machine translation and context-based translation enhancement. The resulting dataset comprises approximately 8 million multilingual referring expressions across 177,620 images, with 336,882 annotated objects. Second, we introduce an attention-anchored neural architecture that uses multilingual SigLIP2 encoders. Our attention-based approach generates coarse spatial anchors from attention distributions, which are subsequently refined through learned residuals. Experimental evaluation demonstrates competitive performance on standard benchmarks, e.g. achieving 86.9% accuracy at IoU@50 on RefCOCO aggregate multilingual evaluation, compared to an English-only result of 91.3%. Multilingual evaluation shows consistent capabilities across languages, establishing the practical feasibility of multilingual visual grounding systems. The dataset and model are available at $\href{https://multilingual.franreno.com}{multilingual.franreno.com}$.

[216] WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation

Wei Chow, Jiachun Pan, Yongyuan Liang, Mingze Zhou, Xue Song, Liyu Jia, Saining Zhang, Siliang Tang, Juncheng Li, Fengda Zhang, Weijia Wu, Hanwang Zhang, Tat-Seng Chua

Main category: cs.CV

TL;DR: WEAVE is the first suite for in-context interleaved cross-modality comprehension and generation, consisting of WEAVE-100k dataset and WEAVEBench benchmark to address the gap in multi-turn, context-dependent image creation and editing.

Details

Motivation: Existing datasets and benchmarks focus primarily on single-turn interactions, failing to capture the multi-turn, context-dependent nature of real-world image creation and editing.

Method: WEAVE-100k is a large-scale dataset of 100K interleaved samples spanning over 370K dialogue turns and 500K images, covering comprehension, editing, and generation tasks. WEAVEBench is a human-annotated benchmark with 100 tasks based on 480 images, featuring a hybrid VLM judger evaluation framework.

Result: Training on WEAVE-100k enables vision comprehension, image editing, and comprehension-generation collaboration capabilities, and facilitates UMMs to develop emergent visual-memory capabilities.

Conclusion: WEAVE provides a foundation for studying in-context interleaved comprehension and generation, while evaluations on WEAVEBench expose persistent limitations of current approaches in multi-turn, context-aware image generation and editing.

Abstract: Recent advances in unified multimodal models (UMMs) have enabled impressive progress in visual comprehension and generation. However, existing datasets and benchmarks focus primarily on single-turn interactions, failing to capture the multi-turn, context-dependent nature of real-world image creation and editing. To address this gap, we present WEAVE, the first suite for in-context interleaved cross-modality comprehension and generation. Our suite consists of two complementary parts. WEAVE-100k is a large-scale dataset of 100K interleaved samples spanning over 370K dialogue turns and 500K images, covering comprehension, editing, and generation tasks that require reasoning over historical context. WEAVEBench is a human-annotated benchmark with 100 tasks based on 480 images, featuring a hybrid VLM judger evaluation framework based on both the reference image and the combination of the original image with editing instructions that assesses models’ abilities in multi-turn generation, visual memory, and world-knowledge reasoning across diverse domains. Experiments demonstrate that training on WEAVE-100k enables vision comprehension, image editing, and comprehension-generation collaboration capabilities. Furthermore, it facilitates UMMs to develop emergent visual-memory capabilities, while extensive evaluations on WEAVEBench expose the persistent limitations and challenges of current approaches in multi-turn, context-aware image generation and editing. We believe WEAVE provides a view and foundation for studying in-context interleaved comprehension and generation for multi-modal community.

[217] Hi-DREAM: Brain Inspired Hierarchical Diffusion for fMRI Reconstruction via ROI Encoder and visuAl Mapping

Guowei Zhang, Yun Zhao, Moein Khajehnejad, Adeel Razi, Levin Kuhlmann

Main category: cs.CV

TL;DR: Hi-DREAM is a brain-inspired diffusion framework that explicitly models cortical hierarchy for fMRI-to-image reconstruction, achieving state-of-the-art semantic performance while maintaining competitive low-level fidelity.

Details

Motivation: Current diffusion-based decoders directly condition on fMRI features without analyzing how visual information is organized across the cortex, overlooking the brain's hierarchical processing and blurring the roles of different visual areas.

Method: Proposes Hi-DREAM with ROI adapter that groups fMRI into early/mid/late streams and converts them into a multi-scale cortical pyramid aligned with U-Net depth, using depth-matched ControlNet to inject scale-specific hints during denoising.

Result: Achieves state-of-the-art performance on high-level semantic metrics while maintaining competitive low-level fidelity on the Natural Scenes Dataset (NSD).

Conclusion: Structuring conditioning by cortical hierarchy is a powerful alternative to purely data-driven embeddings and provides a useful lens for studying the visual cortex.

Abstract: Mapping human brain activity to natural images offers a new window into vision and cognition, yet current diffusion-based decoders face a core difficulty: most condition directly on fMRI features without analyzing how visual information is organized across the cortex. This overlooks the brain’s hierarchical processing and blurs the roles of early, middle, and late visual areas. We propose Hi-DREAM, a brain-inspired conditional diffusion framework that makes the cortical organization explicit. A region-of-interest (ROI) adapter groups fMRI into early/mid/late streams and converts them into a multi-scale cortical pyramid aligned with the U-Net depth (shallow scales preserve layout and edges; deeper scales emphasize objects and semantics). A lightweight, depth-matched ControlNet injects these scale-specific hints during denoising. The result is an efficient and interpretable decoder in which each signal plays a brain-like role, allowing the model not only to reconstruct images but also to illuminate functional contributions of different visual areas. Experiments on the Natural Scenes Dataset (NSD) show that Hi-DREAM attains state-of-the-art performance on high-level semantic metrics while maintaining competitive low-level fidelity. These findings suggest that structuring conditioning by cortical hierarchy is a powerful alternative to purely data-driven embeddings and provides a useful lens for studying the visual cortex.

[218] VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models

Mingjie Xu, Jinpeng Chen, Yuzhi Zhao, Jason Chun Lok Li, Yue Qiu, Zekang Du, Mengyang Wu, Pingping Zhang, Kun Li, Hongzheng Yang, Wenao Ma, Jiaheng Wei, Qinbin Li, Kangcheng Liu, Wenqiang Lei

Main category: cs.CV

TL;DR: VP-Bench is a new benchmark that systematically evaluates multimodal large language models’ ability to perceive and utilize visual prompts (like bounding boxes) for solving vision-language tasks.

Details

Motivation: Existing MLLMs lack systematic evaluation for interpreting visual prompts (VPs) - intuitive human prompting methods like bounding boxes. It's unclear if current models can effectively recognize and use VPs for problem-solving.

Method: Two-stage evaluation framework: Stage 1 tests VP perception in natural scenes using 30k visualized prompts across 8 shapes and 355 attribute combinations. Stage 2 measures VP impact on downstream tasks in real-world problem-solving scenarios.

Result: Evaluated 28 MLLMs (including GPT-4o, InternVL3, Qwen2.5-VL) and provided comprehensive analysis of factors affecting VP understanding, such as VP attribute variations, question arrangement, and model scale.

Conclusion: VP-Bench establishes a new reference framework for studying how MLLMs comprehend and resolve grounded referring questions using visual prompts.

Abstract: Multimodal large language models (MLLMs) have enabled a wide range of advanced vision-language applications, including fine-grained object recognition and contextual understanding. When querying specific regions or objects in an image, human users naturally use “visual prompts” (VPs), such as bounding boxes, to provide reference. However, no existing benchmark systematically evaluates the ability of MLLMs to interpret such VPs. This gap leaves it unclear whether current MLLMs can effectively recognize VPs, an intuitive prompting method for humans, and use them to solve problems. To address this limitation, we introduce VP-Bench, a benchmark for assessing MLLMs’ capability in VP perception and utilization. VP-Bench employs a two-stage evaluation framework: Stage 1 examines models’ ability to perceive VPs in natural scenes, using 30k visualized prompts spanning eight shapes and 355 attribute combinations. Stage 2 investigates the impact of VPs on downstream tasks, measuring their effectiveness in real-world problem-solving scenarios. Using VP-Bench, we evaluate 28 MLLMs, including proprietary systems (e.g., GPT-4o) and open-source models (e.g., InternVL3 and Qwen2.5-VL), and provide a comprehensive analysis of factors that affect VP understanding, such as variations in VP attributes, question arrangement, and model scale. VP-Bench establishes a new reference framework for studying how MLLMs comprehend and resolve grounded referring questions.

[219] CVChess: A Deep Learning Framework for Converting Chessboard Images to Forsyth-Edwards Notation

Luthira Abeykoon, Ved Patel, Gawthaman Senthilvelan, Darshan Kasundra

Main category: cs.CV

TL;DR: CVChess is a deep learning framework that converts smartphone images of physical chessboards into FEN notation using CNN with residual layers, enabling digital chess assistance for physical games.

Details

Motivation: To bridge the gap between analog and digital chess experiences by providing real-time assistance for physical chess games, similar to what's available in online platforms.

Method: Uses CNN with residual layers for piece recognition, with preprocessing steps including Hough Line Transform for edge detection, projective transform for top-down alignment, segmentation into 64 squares, and classification into 13 classes (6 white pieces, 6 black pieces, empty square).

Result: Trained and evaluated on Chess Recognition Dataset (ChessReD) containing 10,800 annotated smartphone images under diverse conditions, achieving accurate piece classification.

Conclusion: The system successfully converts physical chessboard images to FEN notation, enabling integration with online chess engines to provide optimal move suggestions for physical games.

Abstract: Chess has experienced a large increase in viewership since the pandemic, driven largely by the accessibility of online learning platforms. However, no equivalent assistance exists for physical chess games, creating a divide between analog and digital chess experiences. This paper presents CVChess, a deep learning framework for converting chessboard images to Forsyth-Edwards Notation (FEN), which is later input into online chess engines to provide you with the best next move. Our approach employs a convolutional neural network (CNN) with residual layers to perform piece recognition from smartphone camera images. The system processes RGB images of a physical chess board through a multistep process: image preprocessing using the Hough Line Transform for edge detection, projective transform to achieve a top-down board alignment, segmentation into 64 individual squares, and piece classification into 13 classes (6 unique white pieces, 6 unique black pieces and an empty square) using the residual CNN. Residual connections help retain low-level visual features while enabling deeper feature extraction, improving accuracy and stability during training. We train and evaluate our model using the Chess Recognition Dataset (ChessReD), containing 10,800 annotated smartphone images captured under diverse lighting conditions and angles. The resulting classifications are encoded as an FEN string, which can be fed into a chess engine to generate the most optimal move

[220] Rethinking Efficient Mixture-of-Experts for Remote Sensing Modality-Missing Classification

Qinghao Gao, Jianhai Qu, Yunsong Li, Weiqiang Dong

Main category: cs.CV

TL;DR: MaMOL is a parameter-efficient framework for robust multimodal classification that handles missing modalities through a dual-routing mechanism and reformulates modality missing as multi-task learning.

Details

Motivation: Existing multimodal classification methods struggle with missing modalities in remote sensing due to environmental interference and sensor failures, and current two-stage adaptation methods are computationally expensive and assume complete data during training.

Method: Proposes Missing-aware Mixture-of-Loras (MaMOL) with dual-routing: task-oriented dynamic router for adaptive expert activation per missing pattern, and modality-specific-shared static router for stable cross-modal knowledge sharing using lightweight expert updates.

Result: Experiments show superior robustness and generalization under varying missing rates on remote sensing benchmarks with minimal computational overhead, and transfer experiments validate scalability and cross-domain applicability on natural image datasets.

Conclusion: MaMOL provides a general and efficient solution for incomplete multimodal learning that outperforms prior methods while maintaining parameter efficiency and computational efficiency.

Abstract: Multimodal classification in remote sensing often suffers from missing modalities caused by environmental interference, sensor failures, or atmospheric effects, which severely degrade classification performance. Existing two-stage adaptation methods are computationally expensive and assume complete multimodal data during training, limiting their generalization to real-world incompleteness. To overcome these issues, we propose a Missing-aware Mixture-of-Loras (MaMOL) framework that reformulates modality missing as a multi-task learning problem. MaMOL introduces a dual-routing mechanism: a task-oriented dynamic router that adaptively activates experts for different missing patterns, and a modality-specific-shared static router that maintains stable cross-modal knowledge sharing. Unlike prior methods that train separate networks for each missing configuration, MaMOL achieves parameter-efficient adaptation via lightweight expert updates and shared expert reuse. Experiments on multiple remote sensing benchmarks demonstrate superior robustness and generalization under varying missing rates, with minimal computational overhead. Moreover, transfer experiments on natural image datasets validate its scalability and cross-domain applicability, highlighting MaMOL as a general and efficient solution for incomplete multimodal learning.

[221] Sat2RealCity: Geometry-Aware and Appearance-Controllable 3D Urban Generation from Satellite Imagery

Yijie Kang, Xinliang Wang, Zhenyu Wu, Yifeng Shi, Hailong Zhu

Main category: cs.CV

TL;DR: Sat2RealCity is a framework for generating realistic 3D cities from satellite imagery using building-level generation, spatial priors, and appearance control to overcome limitations of existing methods that require expensive 3D assets and produce unrealistic results.

Details

Motivation: Existing 3D urban generation methods require costly large-scale 3D city assets and rely on semantic/height maps that produce unrealistic results disconnected from real-world appearance, limiting realism and generalizability.

Method: Uses building-level generation with OSM-based spatial priors for geometric generation, appearance-guided modeling for realism and style control, and MLLM-powered semantic-guided pipeline to bridge semantic interpretation with geometric reconstruction.

Result: Extensive experiments show Sat2RealCity significantly outperforms existing baselines in structural consistency and appearance realism, establishing strong foundation for real-world aligned 3D urban content creation.

Conclusion: The proposed framework successfully addresses key limitations in 3D urban generation by leveraging satellite imagery, building-level generation, and appearance control to create realistic, real-world aligned cities without requiring expensive 3D assets.

Abstract: Recent advances in generative modeling have substantially enhanced 3D urban generation, enabling applications in digital twins, virtual cities, and large-scale simulations. However, existing methods face two key challenges: (1) the need for large-scale 3D city assets for supervised training, which are difficult and costly to obtain, and (2) reliance on semantic or height maps, which are used exclusively for generating buildings in virtual worlds and lack connection to real-world appearance, limiting the realism and generalizability of generated cities. To address these limitations, we propose Sat2RealCity, a geometry-aware and appearance-controllable framework for 3D urban generation from real-world satellite imagery. Unlike previous city-level generation methods, Sat2RealCity builds generation upon individual building entities, enabling the use of rich priors and pretrained knowledge from 3D object generation while substantially reducing dependence on large-scale 3D city assets. Specifically, (1) we introduce the OSM-based spatial priors strategy to achieve interpretable geometric generation from spatial topology to building instances; (2) we design an appearance-guided controllable modeling mechanism for fine-grained appearance realism and style control; and (3) we construct an MLLM-powered semantic-guided generation pipeline, bridging semantic interpretation and geometric reconstruction. Extensive quantitative and qualitative experiments demonstrate that Sat2RealCity significantly surpasses existing baselines in structural consistency and appearance realism, establishing a strong foundation for real-world aligned 3D urban content creation. The code will be released soon.

[222] Multimodal Posterior Sampling-based Uncertainty in PD-L1 Segmentation from H&E Images

Roman Kinakh, Gonzalo R. Ríos-Muñoz, Arrate Muñoz-Barrutia

Main category: cs.CV

TL;DR: nnUNet-B: Bayesian segmentation framework that predicts PD-L1 expression from H&E histology images using Multimodal Posterior Sampling, achieving competitive performance with uncertainty estimation.

Details

Motivation: Current PD-L1 assessment methods using immunohistochemistry are resource-intensive, creating need for more efficient alternatives.

Method: Built on nnUNet-v2 with Bayesian approach using Multimodal Posterior Sampling, sampling diverse model checkpoints during cyclic training to approximate posterior distribution.

Result: Achieved mean Dice Score of 0.805 and mean IoU of 0.709 on lung squamous cell carcinoma dataset, with uncertainty estimates correlating with segmentation error.

Conclusion: Uncertainty-aware H&E-based PD-L1 prediction is promising for scalable, interpretable biomarker assessment in clinical workflows.

Abstract: Accurate assessment of PD-L1 expression is critical for guiding immunotherapy, yet current immunohistochemistry (IHC) based methods are resource-intensive. We present nnUNet-B: a Bayesian segmentation framework that infers PD-L1 expression directly from H&E-stained histology images using Multimodal Posterior Sampling (MPS). Built upon nnUNet-v2, our method samples diverse model checkpoints during cyclic training to approximate the posterior, enabling both accurate segmentation and epistemic uncertainty estimation via entropy and standard deviation. Evaluated on a dataset of lung squamous cell carcinoma, our approach achieves competitive performance against established baselines with mean Dice Score and mean IoU of 0.805 and 0.709, respectively, while providing pixel-wise uncertainty maps. Uncertainty estimates show strong correlation with segmentation error, though calibration remains imperfect. These results suggest that uncertainty-aware H&E-based PD-L1 prediction is a promising step toward scalable, interpretable biomarker assessment in clinical workflows.

[223] OpenUS: A Fully Open-Source Foundation Model for Ultrasound Image Analysis via Self-Adaptive Masked Contrastive Learning

Xiaoyu Zheng, Xu Chen, Awais Rauf, Qifan Fu, Benedetta Monosi, Felice Rivellese, Myles J. Lewis, Shaogang Gong, Gregory Slabaugh

Main category: cs.CV

TL;DR: OpenUS is the first reproducible, open-source ultrasound foundation model that uses a vision Mamba backbone with self-adaptive masking and dynamic learning to address challenges in ultrasound AI development.

Details

Motivation: Ultrasound imaging faces challenges like operator dependency, anatomical variations, and limited annotations, which hinder the development of generalizable AI models for medical ultrasound applications.

Method: Uses vision Mamba backbone for local and global dependencies, introduces self-adaptive masking framework combining contrastive learning with masked image modeling, and applies dynamic learning schedule. Built on largest public ultrasound dataset with 308K+ images from 42 datasets.

Result: Developed OpenUS foundation model that can be adapted to downstream tasks through label-efficient fine-tuning, with code publicly available.

Conclusion: OpenUS provides a reproducible foundation model for ultrasound AI that addresses key challenges in the field and enables more efficient development of ultrasound-based diagnostic tools.

Abstract: Ultrasound (US) is one of the most widely used medical imaging modalities, thanks to its low cost, portability, real-time feedback, and absence of ionizing radiation. However, US image interpretation remains highly operator-dependent and varies significantly across anatomical regions, acquisition protocols, and device types. These variations, along with unique challenges such as speckle, low contrast, and limited standardized annotations, hinder the development of generalizable, label-efficient ultrasound AI models. In this paper, we propose OpenUS, the first reproducible, open-source ultrasound foundation model built on a large collection of public data. OpenUS employs a vision Mamba backbone, capturing both local and global long-range dependencies across the image. To extract rich features during pre-training, we introduce a novel self-adaptive masking framework that combines contrastive learning with masked image modeling. This strategy integrates the teacher’s attention map with student reconstruction loss, adaptively refining clinically-relevant masking to enhance pre-training effectiveness. OpenUS also applies a dynamic learning schedule to progressively adjust the difficulty of the pre-training process. To develop the foundation model, we compile the largest to-date public ultrasound dataset comprising over 308K images from 42 publicly available datasets, covering diverse anatomical regions, institutions, imaging devices, and disease types. Our pre-trained OpenUS model can be easily adapted to specific downstream tasks by serving as a backbone for label-efficient fine-tuning. Code is available at https://github.com/XZheng0427/OpenUS.

[224] Bridging Hidden States in Vision-Language Models

Benjamin Fein-Ashley, Jacob Fein-Ashley

Main category: cs.CV

TL;DR: BRIDGE proposes a lightweight fusion module using bidirectional cross-attention layers near the top of vision and text encoders to align modality-specific hidden states, achieving strong performance on retrieval, VQA, and reasoning tasks while maintaining bi-encoder efficiency.

Details

Motivation: Existing VLMs either fuse modalities too early (mixing tokens/features) or too late (comparing pooled embeddings), often tying fusion to autoregressive decoders. The hidden states of both modalities already carry rich structure, so directly aligning these states is a natural way to match what vision and text "think".

Method: A few cross-only, bidirectional attention layers placed near the top of both encoders. Each layer projects vision and text encoder hidden-state sequences into a shared space, attends across modalities, and sends gated residual updates back with simple stabilizers to improve alignment. Encoders remain non-causal and strong for understanding.

Result: BRIDGE outperforms comparable VLMs across standard retrieval, VQA, and visual reasoning benchmarks while preserving the bi-encoder efficiency of contrastive models.

Conclusion: The proposed lightweight fusion module effectively aligns modality-specific hidden states without compromising encoder understanding capabilities, enabling strong multimodal performance with efficient bi-encoder architecture.

Abstract: Vision-Language Models (VLMs) are a new family of models that align image content with natural language. Existing approaches typically fuse either (a) early: by mixing tokens/features inside the encoders, or (b) late: by comparing pooled embeddings. Many methods also tie fusion to an autoregressive decoder. However, the hidden states of both modalities already carry rich, modality-specific structure (spatial layout in vision; syntax and semantics in text), so directly aligning these states is a natural way to match what the two modalities “think”. We propose a lightweight fusion module: a few cross-only, bidirectional attention layers placed near the top of both encoders. Each layer projects the vision and text encoder hidden-state sequences into a shared space, attends across modalities, and sends gated residual updates back, with simple stabilizers to improve alignment. The encoders remain non-causal and strong for understanding, while generation stays cleanly decoupled via an optional decoder. Across standard retrieval, VQA, and visual reasoning benchmarks, BRIDGE outperforms comparable VLMs while preserving the bi-encoder efficiency of contrastive models. We make our code publicly available at https://github.com/jfeinashley/BRIDGE.

[225] LARM: A Large Articulated-Object Reconstruction Model

Sylvia Yuan, Ruoxi Shi, Xinyue Wei, Xiaoshuai Zhang, Hao Su, Minghua Liu

Main category: cs.CV

TL;DR: LARM is a unified feedforward framework that reconstructs 3D articulated objects from sparse-view images, jointly recovering detailed geometry, realistic textures, and accurate joint structures using a transformer-based architecture.

Details

Motivation: Existing methods for 3D articulated object reconstruction require dense multi-view inputs and expensive per-instance optimization, limiting scalability. Recent feedforward approaches produce coarse geometry, lack texture reconstruction, and rely on complex multi-stage pipelines.

Method: LARM extends LVSM (a novel view synthesis approach) to articulated objects by jointly reasoning over camera pose and articulation variation using a transformer-based architecture. It generates auxiliary outputs like depth maps and part masks to facilitate explicit 3D mesh extraction and joint estimation.

Result: LARM outperforms state-of-the-art methods in both novel view and state synthesis, as well as 3D articulated object reconstruction. It generates high-quality meshes that closely adhere to input images and supports high-fidelity reconstruction across diverse object categories.

Conclusion: LARM provides a scalable and accurate solution for 3D articulated object reconstruction from sparse-view images, eliminating the need for dense supervision while achieving superior performance compared to existing methods.

Abstract: Modeling 3D articulated objects with realistic geometry, textures, and kinematics is essential for a wide range of applications. However, existing optimization-based reconstruction methods often require dense multi-view inputs and expensive per-instance optimization, limiting their scalability. Recent feedforward approaches offer faster alternatives but frequently produce coarse geometry, lack texture reconstruction, and rely on brittle, complex multi-stage pipelines. We introduce LARM, a unified feedforward framework that reconstructs 3D articulated objects from sparse-view images by jointly recovering detailed geometry, realistic textures, and accurate joint structures. LARM extends LVSM a recent novel view synthesis (NVS) approach for static 3D objects into the articulated setting by jointly reasoning over camera pose and articulation variation using a transformer-based architecture, enabling scalable and accurate novel view synthesis. In addition, LARM generates auxiliary outputs such as depth maps and part masks to facilitate explicit 3D mesh extraction and joint estimation. Our pipeline eliminates the need for dense supervision and supports high-fidelity reconstruction across diverse object categories. Extensive experiments demonstrate that LARM outperforms state-of-the-art methods in both novel view and state synthesis as well as 3D articulated object reconstruction, generating high-quality meshes that closely adhere to the input images. project page: https://sylviayuan-sy.github.io/larm-site/

[226] DENTEX: Dental Enumeration and Tooth Pathosis Detection Benchmark for Panoramic X-ray

Ibrahim Ethem Hamamci, Sezgin Er, Omer Faruk Durugol, Gulsade Rabia Cakmak, Ezequiel de la Rosa, Enis Simsar, Atif Emre Yuksel, Sadullah Gultekin, Serife Damla Ozdemir, Kaiyuan Yang, Mehmet Berke Isler, Mustafa Salih Gucez, Shenxiao Mei, Chenglong Ma, Feihong Shen, Kaidi Shen, Huikai Wu, Han Wu, Lanzhuju Mei, Zhiming Cui, Niels van Nistelrooij, Khalid El Ghoul, Steven Kempers, Tong Xi, Shankeeth Vinayahalingam, Kyoungyeon Choi, Jaewon Shin, Eunyi Lyou, Lanshan He, Yusheng Liu, Lisheng Wang, Tudor Dascalu, Shaqayeq Ramezanzade, Azam Bakhshandeh, Lars Bjørndal, Bulat Ibragimov, Hongwei Bran Li, Sarthak Pati, Bernd Stadlinger, Albert Mehl, Mehmet Kemal Ozdemir, Mustafa Gundogar, Bjoern Menze

Main category: cs.CV

TL;DR: The DENTEX Challenge 2023 promoted AI algorithms for multi-label detection of abnormal teeth in panoramic dental X-rays using hierarchical annotations, with top performers using advanced models like Transformers and diffusion models that significantly outperformed traditional approaches.

Details

Motivation: Panoramic dental X-ray interpretation is time-consuming and error-prone, and AI can improve diagnosis accuracy, but faces challenges from scarce annotated data and anatomical variations.

Method: Organized the DENTEX Challenge with three types of hierarchically annotated data: partially annotated quadrant data, partially annotated quadrant-enumeration data, and fully annotated quadrant-enumeration-diagnosis data with four diagnoses.

Result: Top performers succeeded with diverse specialized strategies including segmentation-guided pipelines and highly-engineered single-stage detectors using advanced Transformer and diffusion models, significantly outperforming traditional approaches especially for tooth enumeration and subtle disease classification.

Conclusion: The challenge provides key insights for future AI-powered dental tools, showing that advanced architectures can offer more precise and efficient diagnosis and treatment planning in dentistry.

Abstract: Panoramic X-rays are frequently used in dentistry for treatment planning, but their interpretation can be both time-consuming and prone to error. Artificial intelligence (AI) has the potential to aid in the analysis of these X-rays, thereby improving the accuracy of dental diagnoses and treatment plans. Nevertheless, designing automated algorithms for this purpose poses significant challenges, mainly due to the scarcity of annotated data and variations in anatomical structure. To address these issues, we organized the Dental Enumeration and Diagnosis on Panoramic X-rays Challenge (DENTEX) in association with the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) in 2023. This challenge aims to promote the development of algorithms for multi-label detection of abnormal teeth, using three types of hierarchically annotated data: partially annotated quadrant data, partially annotated quadrant-enumeration data, and fully annotated quadrant-enumeration-diagnosis data, inclusive of four different diagnoses. In this paper, we present a comprehensive analysis of the methods and results from the challenge. Our findings reveal that top performers succeeded through diverse, specialized strategies, from segmentation-guided pipelines to highly-engineered single-stage detectors, using advanced Transformer and diffusion models. These strategies significantly outperformed traditional approaches, particularly for the challenging tasks of tooth enumeration and subtle disease classification. By dissecting the architectural choices that drove success, this paper provides key insights for future development of AI-powered tools that can offer more precise and efficient diagnosis and treatment planning in dentistry. The evaluation code and datasets can be accessed at https://github.com/ibrahimethemhamamci/DENTEX

[227] GreatSplicing: A Semantically Rich Splicing Dataset

Jiaming Liang, Yuwan Xue, Haowei Liu, Zhenqi Dai, Yu Liao, Rui Wang, Weihao Jiang, Yaping Liu, Zhikun Chen, Guoxiao Liu, Bo Liu, Xiuli Bi

Main category: cs.CV

TL;DR: GreatSplicing is a manually created, large-scale, high-quality splicing dataset with 5,000 spliced images covering 335 semantic categories to address overfitting and inconsistent benchmarks in splicing detection.

Details

Motivation: Existing splicing datasets lack semantic variety, causing models to overfit semantic features rather than learn genuine splicing traces, and the absence of a reasonable benchmark leads to inconsistent experimental settings.

Method: Created GreatSplicing - a manually curated dataset of 5,000 spliced images with diverse semantic coverage across 335 distinct categories to enable better learning of splicing traces.

Result: Detection models trained on GreatSplicing achieve low misidentification rates and demonstrate stronger cross-dataset generalization compared to models trained on existing datasets.

Conclusion: GreatSplicing effectively addresses the limitations of existing splicing datasets and is publicly available for research, providing a better benchmark for splicing detection methods.

Abstract: In existing splicing forgery datasets, the insufficient semantic variety of spliced regions causes trained detection models to overfit semantic features rather than learn genuine splicing traces. Meanwhile, the lack of a reasonable benchmark dataset has led to inconsistent experimental settings across existing detection methods. To address these issues, we propose GreatSplicing, a manually created, large-scale, high-quality splicing dataset. GreatSplicing comprises 5,000 spliced images and covers spliced regions across 335 distinct semantic categories, enabling detection models to learn splicing traces more effectively. Empirical results show that detection models trained on GreatSplicing achieve low misidentification rates and stronger cross-dataset generalization compared to existing datasets. GreatSplicing is now publicly available for research purposes at the following link.

[228] Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications

Junyi Ma, Xieyuanli Chen, Jiawei Huang, Jingyi Xu, Zhen Luo, Jintao Xu, Weihao Gu, Rui Ai, Hesheng Wang

Main category: cs.CV

TL;DR: Cam4DOcc is a new benchmark for camera-only 4D occupancy forecasting that extends current 3D occupancy estimation to include temporal prediction of future scene states in autonomous driving.

Details

Motivation: Current camera-only occupancy estimation methods are limited to representing the current 3D space and don't consider future states of surrounding objects, which is crucial for safe autonomous driving.

Method: Built benchmark using multiple public datasets (nuScenes, nuScenes-Occupancy, Lyft-Level5) with sequential occupancy states and 3D backward centripetal flow. Proposed four baseline types: static-world model, point cloud prediction voxelization, 2D-3D instance-based prediction, and a novel end-to-end 4D occupancy forecasting network.

Result: Established comprehensive benchmark with standardized evaluation protocol for multiple tasks comparing present and future occupancy estimation performance for autonomous driving objects of interest.

Conclusion: Cam4DOcc extends camera-only occupancy estimation to spatiotemporal prediction, providing a foundation for future research in 4D occupancy forecasting with publicly available dataset and baseline implementations.

Abstract: Understanding how the surrounding environment changes is crucial for performing downstream tasks safely and reliably in autonomous driving applications. Recent occupancy estimation techniques using only camera images as input can provide dense occupancy representations of large-scale scenes based on the current observation. However, they are mostly limited to representing the current 3D space and do not consider the future state of surrounding objects along the time axis. To extend camera-only occupancy estimation into spatiotemporal prediction, we propose Cam4DOcc, a new benchmark for camera-only 4D occupancy forecasting, evaluating the surrounding scene changes in a near future. We build our benchmark based on multiple publicly available datasets, including nuScenes, nuScenes-Occupancy, and Lyft-Level5, which provides sequential occupancy states of general movable and static objects, as well as their 3D backward centripetal flow. To establish this benchmark for future research with comprehensive comparisons, we introduce four baseline types from diverse camera-based perception and prediction implementations, including a static-world occupancy model, voxelization of point cloud prediction, 2D-3D instance-based prediction, and our proposed novel end-to-end 4D occupancy forecasting network. Furthermore, the standardized evaluation protocol for preset multiple tasks is also provided to compare the performance of all the proposed baselines on present and future occupancy estimation with respect to objects of interest in autonomous driving scenarios. The dataset and our implementation of all four baselines in the proposed Cam4DOcc benchmark will be released here: https://github.com/haomo-ai/Cam4DOcc.

[229] Diff-IP2D: Diffusion-Based Hand-Object Interaction Prediction on Egocentric Videos

Junyi Ma, Jingyi Xu, Xieyuanli Chen, Hesheng Wang

Main category: cs.CV

TL;DR: Diff-IP2D is a novel diffusion-based method for predicting future hand trajectories and object affordances in egocentric videos, addressing limitations of autoregressive approaches by using iterative non-autoregressive prediction with camera motion awareness.

Details

Motivation: Understanding human hand-object interaction is crucial for service robots and extended reality. Existing methods use autoregressive prediction which lacks holistic sequence constraints and accumulates errors, while also ignoring camera egomotion effects.

Method: Proposes Diff-IP2D that transforms sequential 2D images to latent features and uses a denoising diffusion model to predict future interaction features conditioned on past ones. Integrates motion features to account for camera wearer’s dynamics.

Result: Significantly outperforms state-of-the-art baselines on both standard metrics and newly proposed evaluation protocol, demonstrating the efficacy of generative paradigms for 2D hand-object interaction prediction.

Conclusion: The diffusion-based approach with motion awareness effectively addresses limitations of autoregressive methods and provides more accurate 2D hand-object interaction forecasting.

Abstract: Understanding how humans would behave during hand-object interaction is vital for applications in service robot manipulation and extended reality. To achieve this, some recent works have been proposed to simultaneously forecast hand trajectories and object affordances on human egocentric videos. The joint prediction serves as a comprehensive representation of future hand-object interactions in 2D space, indicating potential human motion and motivation. However, the existing approaches mostly adopt the autoregressive paradigm for unidirectional prediction, which lacks mutual constraints within the holistic future sequence, and accumulates errors along the time axis. Meanwhile, these works basically overlook the effect of camera egomotion on first-person view predictions. To address these limitations, we propose a novel diffusion-based interaction prediction method, namely Diff-IP2D, to forecast future hand trajectories and object affordances concurrently in an iterative non-autoregressive manner. We transform the sequential 2D images into latent feature space and design a denoising diffusion model to predict future latent interaction features conditioned on past ones. Motion features are further integrated into the conditional denoising process to enable Diff-IP2D aware of the camera wearer’s dynamics for more accurate interaction prediction. Extensive experiments demonstrate that our method significantly outperforms the state-of-the-art baselines on both the off-the-shelf metrics and our newly proposed evaluation protocol. This highlights the efficacy of leveraging a generative paradigm for 2D hand-object interaction prediction. The code of Diff-IP2D is released as open source at https://github.com/IRMVLab/Diff-IP2D.

[230] DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation

Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, Mohit Bansal

Main category: cs.CV

TL;DR: DREAMRUNNER is a novel story-to-video generation method that addresses challenges in generating high-quality videos from complex single-scene descriptions by combining LLM-based planning, retrieval-augmented motion adaptation, and spatial-temporal attention mechanisms.

Details

Motivation: Existing storytelling video generation methods struggle with generating high-quality videos aligned with complex single-scene descriptions that involve multiple characters, events, motion synthesis, and character customization.

Method: Uses LLM for scene and object-level layout planning, retrieval-augmented test-time adaptation for motion priors, and SR3AI module for object-motion binding and spatial-temporal control.

Result: Achieves state-of-the-art performance in character consistency, text alignment, smooth transitions, and compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench.

Conclusion: DREAMRUNNER demonstrates robust ability to generate multi-object interactions and addresses key challenges in storytelling video generation through its integrated approach to planning, motion adaptation, and spatial-temporal control.

Abstract: Storytelling video generation (SVG) aims to produce coherent and visually rich multi-scene videos that follow a structured narrative. Existing methods primarily employ LLM for high-level planning to decompose a story into scene-level descriptions, which are then independently generated and stitched together. However, these approaches struggle with generating high-quality videos aligned with the complex single-scene description, as visualizing such complex description involves coherent composition of multiple characters and events, complex motion synthesis and multi-character customization. To address these challenges, we propose DREAMRUNNER, a novel story-to-video generation method: First, we structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning as well as fine-grained object-level layout planning. Next, DREAMRUNNER presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos, thus facilitating the generation of new videos with complex, scripted motions. Lastly, we propose a novel spatial-temporal region-based 3D attention and prior injection module SR3AI for fine-grained object-motion binding and frame-by-frame spatial-temporal semantic control. We compare DREAMRUNNER with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Additionally, DREAMRUNNER exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. Finally, we validate DREAMRUNNER’s robust ability to generate multi-object interactions with qualitative examples.

[231] Adaptive Parametric Activation: Unifying and Generalising Activation Functions Across Tasks

Konstantinos Panagiotis Alexandridis, Jiankang Deng, Anh Nguyen, Shan Luo

Main category: cs.CV

TL;DR: The paper proposes Adaptive Parametric Activation (APA), a novel activation function that adapts to data distribution and outperforms state-of-the-art methods on imbalanced classification benchmarks while being versatile across various tasks.

Details

Motivation: Standard activation functions like Sigmoid are biased towards frequent classes in imbalanced classification tasks, limiting model performance. There's a need for activation functions that align with data distribution.

Method: Proposed APA function that unifies common activation functions under a single formula and can be applied in both intermediate and attention layers. Performed comprehensive statistical analysis of classification and intermediate layers.

Result: APA significantly outperforms state-of-the-art on several imbalanced benchmarks (ImageNet-LT, iNaturalist2018, Places-LT, CIFAR100-LT, LVIS) and improves performance across various tasks including classification, detection, visual instruction following, image generation, and next-text-token prediction.

Conclusion: Aligning activation functions with data distribution enhances performance in both balanced and imbalanced tasks. APA provides a versatile solution that works across multiple domains and model architectures.

Abstract: The activation function plays a crucial role in model optimisation, yet the optimal choice remains unclear. For example, the Sigmoid activation is the de-facto activation in balanced classification tasks, however, in imbalanced classification, it proves inappropriate due to bias towards frequent classes. In this work, we delve deeper in this phenomenon by performing a comprehensive statistical analysis in the classification and intermediate layers of both balanced and imbalanced networks and we empirically show that aligning the activation function with the data distribution, enhances the performance in both balanced and imbalanced tasks. To this end, we propose the Adaptive Parametric Activation (APA) function, a novel and versatile activation function that unifies most common activation functions under a single formula. APA can be applied in both intermediate layers and attention layers, significantly outperforming the state-of-the-art on several imbalanced benchmarks such as ImageNet-LT, iNaturalist2018, Places-LT, CIFAR100-LT and LVIS. Also, we extend APA to a plethora of other tasks such as classification, detection, visual instruction following tasks, image generation and next-text-token prediction benchmarks. APA increases the performance in multiple benchmarks across various model architectures. The code is available at https://github.com/kostas1515/AGLU.

[232] MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating Multimodal Large Language Models Understanding of Complex Image

Shezheng Song, Chengxiang He, Shan Zhao, Chengyu Wang, Qian Wan, Tianwei Yan, Meng Wang

Main category: cs.CV

TL;DR: MOSABench is a new benchmark for evaluating multi-object sentiment analysis in multimodal LLMs, revealing current models’ limitations in handling complex scenarios with multiple objects.

Details

Motivation: There's a lack of standardized benchmarks for evaluating MLLMs' performance in multi-object sentiment analysis, which is crucial for semantic understanding in real-world applications.

Method: Created MOSABench dataset with ~1,000 images containing multiple objects, featuring distance-based target annotation, post-processing for evaluation standardization, and improved scoring mechanism.

Result: Experiments show significant limitations in current MLLMs - some models (mPLUG-owl, Qwen-VL2) perform well with sentiment-relevant features, while others struggle with scattered focus and performance degradation as object distance increases.

Conclusion: MOSABench serves as a foundational tool to advance MLLMs’ sentiment analysis capabilities, highlighting the need for improved accuracy in complex multi-object scenarios.

Abstract: Multimodal large language models (MLLMs) have shown remarkable progress in high-level semantic tasks such as visual question answering, image captioning, and emotion recognition. However, despite advancements, there remains a lack of standardized benchmarks for evaluating MLLMs performance in multi-object sentiment analysis, a key task in semantic understanding. To address this gap, we introduce MOSABench, a novel evaluation dataset designed specifically for multi-object sentiment analysis. MOSABench includes approximately 1,000 images with multiple objects, requiring MLLMs to independently assess the sentiment of each object, thereby reflecting real-world complexities. Key innovations in MOSABench include distance-based target annotation, post-processing for evaluation to standardize outputs, and an improved scoring mechanism. Our experiments reveal notable limitations in current MLLMs: while some models, like mPLUG-owl and Qwen-VL2, demonstrate effective attention to sentiment-relevant features, others exhibit scattered focus and performance declines, especially as the spatial distance between objects increases. This research underscores the need for MLLMs to enhance accuracy in complex, multi-object sentiment analysis tasks and establishes MOSABench as a foundational tool for advancing sentiment analysis capabilities in MLLMs.

[233] MADiff: Motion-Aware Mamba Diffusion Models for Hand Trajectory Prediction on Egocentric Videos

Junyi Ma, Xieyuanli Chen, Wentao Bao, Jingyi Xu, Hesheng Wang

Main category: cs.CV

TL;DR: MADiff is a novel hand trajectory prediction method using diffusion models with motion-aware Mamba architecture that integrates egomotion and leverages foundation models for high-level semantics, achieving state-of-the-art performance on five datasets with real-time capability.

Details

Motivation: Understanding human intentions from egocentric videos is crucial for embodied AI, but predicting hand trajectories is challenging due to camera egomotion interference and lack of affordance labels for explicit guidance.

Method: Proposes MADiff with diffusion models for hand waypoint prediction, using motion-aware Mamba with motion-driven selective scan (MDSS) to integrate egomotion, and leverages foundation models to capture high-level semantics from video without explicit affordance supervision.

Result: Achieves comparable performance to state-of-the-art baselines on five public datasets with new evaluation metrics, demonstrating reasonable hand trajectory prediction with real-time performance.

Conclusion: MADiff effectively predicts hand trajectories from egocentric videos by addressing egomotion challenges and leveraging semantic understanding without explicit affordance labels, advancing egocentric vision capabilities.

Abstract: Understanding human intentions and actions through egocentric videos is important on the path to embodied artificial intelligence. As a branch of egocentric vision techniques, hand trajectory prediction plays a vital role in comprehending human motion patterns, benefiting downstream tasks in extended reality and robot manipulation. However, capturing high-level human intentions consistent with reasonable temporal causality is challenging when only egocentric videos are available. This difficulty is exacerbated under camera egomotion interference and the absence of affordance labels to explicitly guide the optimization of hand waypoint distribution. In this work, we propose a novel hand trajectory prediction method dubbed MADiff, which forecasts future hand waypoints with diffusion models. The devised denoising operation in the latent space is achieved by our proposed motion-aware Mamba, where the camera wearer’s egomotion is integrated to achieve motion-driven selective scan (MDSS). To discern the relationship between hands and scenarios without explicit affordance supervision, we leverage a foundation model that fuses visual and language features to capture high-level semantics from video clips. Comprehensive experiments conducted on five public datasets with the existing and our proposed new evaluation metrics demonstrate that MADiff predicts comparably reasonable hand trajectories compared to the state-of-the-art baselines, and achieves real-time performance. We will release our code and pretrained models of MADiff at the project page: https://irmvlab.github.io/madiff.github.io.

[234] Enhanced Structured Lasso Pruning with Class-wise Information

Xiang Liu, Mingchen Li, Xia Li, Leigang Qu, Guangsu Wang, Zifan Peng, Yijun Song, Zemin Liu, Linshan Jiang, Jialin Li

Main category: cs.CV

TL;DR: Proposes two adaptive network pruning methods (sGLP-IB and sTLP-IB) using structured lasso and Information Bottleneck theory to maintain class-wise information, achieving significant parameter reduction while preserving accuracy.

Details

Motivation: Existing pruning methods focus on removing unimportant filters but may lose statistical information by failing to consider class-wise information, leading to performance degradation after pruning.

Method: Uses structured lasso with Information Bottleneck theory to prune model filters while preserving class-wise relatedness. Proposes two parallel schemes: sGLP-IB (sparse graph-structured lasso pruning) and sTLP-IB (sparse tree-guided lasso pruning).

Result: Achieved state-of-the-art performance across three datasets and six model structures. For VGG16 on CIFAR-10: 85% parameter reduction, 61% FLOPs reduction, 94.10% accuracy (0.14% better than original). For ResNet on ImageNet: 55% parameter reduction with only 0.03% accuracy drop (76.12%).

Conclusion: Successfully reduces model size and computational resource usage while maintaining accuracy effectiveness by leveraging structured class-wise information and Information Bottleneck theory.

Abstract: Modern applications require lightweight neural network models. Most existing neural network pruning methods focus on removing unimportant filters; however, these may result in the loss of statistical information after pruning due to failing to consider the class-wise information. In this paper, we employ the structured lasso from the perspective of utilizing precise class-wise information for model pruning with the help of Information Bottleneck theory, which guides us to ensure the retention of statistical information before and after pruning. With these techniques, we propose two novel adaptive network pruning schemes in parallel: sparse graph-structured lasso pruning with Information Bottleneck (sGLP-IB) and sparse tree-guided lasso pruning with Information Bottleneck (sTLP-IB). The key component is that we prune the model filters utilizing sGLP-IB and sTLP-IB with more precise structured class-wise relatedness. Compared to multiple state-of-the-art methods, our approaches achieve the best performance across three datasets and six model structures on extensive experiments. For example, with the VGG16 model based on the CIFAR-10 dataset, we can reduce the parameters by 85%, decrease the FLOPs by 61%, and maintain an accuracy of 94.10% (0.14% better than the original). For large-scale ImageNet, we can reduce the parameters by 55% while keeping the accuracy at 76.12% (only drop 0.03%) using the ResNet architecture. In summary, we succeed in reducing the model size and computational resource usage while maintaining the effectiveness of accuracy.

[235] FQ-PETR: Fully Quantized Position Embedding Transformation for Multi-View 3D Object Detection

Jiangyong Yu, Changyong Shu, Sifan Zhou, Zichen Yu, Xing Hu, Yan Chen, Dawei Yang

Main category: cs.CV

TL;DR: FQ-PETR is a fully quantized framework for PETR-based 3D detection models that addresses quantization challenges through three innovations: QFPE for feature alignment, DULUT for non-linear function approximation, and QANS for attention stabilization, achieving near-floating-point accuracy with 75% latency reduction.

Details

Motivation: PETR models excel in multi-view 3D detection but face deployment challenges due to high computational cost and memory footprint. Direct quantization causes severe accuracy degradation due to feature magnitude disparity and inefficient non-linear operator quantization.

Method: Three key innovations: (1) QFPE replaces multi-point sampling with LiDAR-guided single-point sampling and anchor-based embedding; (2) DULUT approximates non-linear functions with cascaded linear lookup tables; (3) QANS performs quantization after softmax numerical stabilization.

Result: FQ-PETR achieves near-floating-point accuracy (only 1% degradation) under W8A8 quantization while reducing latency by up to 75%, significantly outperforming existing PTQ and QAT baselines across PETR variants.

Conclusion: FQ-PETR successfully enables efficient deployment of PETR-based 3D detection models through comprehensive quantization framework that addresses key challenges in feature alignment, non-linear function approximation, and attention stabilization.

Abstract: Camera-based multi-view 3D detection is crucial for autonomous driving. PETR and its variants (PETRs) excel in benchmarks but face deployment challenges due to high computational cost and memory footprint. Quantization is an effective technique for compressing deep neural networks by reducing the bit width of weights and activations. However, directly applying existing quantization methods to PETRs leads to severe accuracy degradation. This issue primarily arises from two key challenges: (1) significant magnitude disparity between multi-modal features-specifically, image features and camera-ray positional embeddings (PE), and (2) the inefficiency and approximation error of quantizing non-linear operators, which commonly rely on hardware-unfriendly computations. In this paper, we propose FQ-PETR, a fully quantized framework for PETRs, featuring three key innovations: (1) Quantization-Friendly LiDAR-ray Position Embedding (QFPE): Replacing multi-point sampling with LiDAR-prior-guided single-point sampling and anchor-based embedding eliminates problematic non-linearities (e.g., inverse-sigmoid) and aligns PE scale with image features, preserving accuracy. (2) Dual-Lookup Table (DULUT): This algorithm approximates complex non-linear functions using two cascaded linear LUTs, achieving high fidelity with minimal entries and no specialized hardware. (3) Quantization After Numerical Stabilization (QANS): Performing quantization after softmax numerical stabilization mitigates attention distortion from large inputs. On PETRs (e.g. PETR, StreamPETR, PETRv2, MV2d), FQ-PETR under W8A8 achieves near-floating-point accuracy (1% degradation) while reducing latency by up to 75%, significantly outperforming existing PTQ and QAT baselines.

[236] Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment

Xing Xie, Jiawei Liu, Ziyue Lin, Huijie Fan, Zhi Han, Yandong Tang, Liangqiong Qu

Main category: cs.CV

TL;DR: ARRA is a training framework that enables global-coherent text-to-image generation in autoregressive LLMs without architectural changes, using visual alignment and hybrid tokens to maintain coherence while preserving original autoregressive paradigms.

Details

Motivation: To address the challenge of achieving global coherence in text-to-image generation with autoregressive LLMs without requiring complex architectural modifications that previous approaches needed.

Method: Uses global visual alignment loss and a hybrid token () to align LLM hidden states with visual representations from external models, enforcing dual constraints of local next-token prediction and global semantic distillation.

Result: Significantly reduces FID scores across multiple datasets: 16.6% on ImageNet, 12.0% on LAION-COCO for training from scratch; 25.5% on MIMIC-CXR, 8.8% on DeepEyeNet for text-generation LLMs; 18.6% improvement over direct fine-tuning for domain adaptation.

Conclusion: Training objective redesign rather than architectural modifications can effectively resolve cross-modal global coherence challenges, offering a complementary paradigm for advancing autoregressive models.

Abstract: We present Autoregressive Representation Alignment (ARRA), a new training framework that unlocks global-coherent text-to-image generation in autoregressive LLMs without architectural modifications. Different from prior works that require complex architectural redesigns, ARRA aligns LLM’s hidden states with visual representations from external visual foundational models via a global visual alignment loss and a hybrid token, . This token enforces dual constraints: local next-token prediction and global semantic distillation, enabling LLMs to implicitly learn spatial and contextual coherence while retaining their original autoregressive paradigm. Extensive experiments validate ARRA’s plug-and-play versatility. When training T2I LLMs from scratch, ARRA reduces FID by 16.6% (ImageNet), 12.0% (LAION-COCO) for autoregressive LLMs like LlamaGen, without modifying original architecture and inference mechanism. For training from text-generation-only LLMs, ARRA reduces FID by 25.5% (MIMIC-CXR), 8.8% (DeepEyeNet) for advanced LLMs like Chameleon. For domain adaptation, ARRA aligns general-purpose LLMs with specialized models (e.g., BioMedCLIP), achieving an 18.6% FID reduction over direct fine-tuning on medical imaging (MIMIC-CXR). These results demonstrate that training objective redesign, rather than architectural modifications, can resolve cross-modal global coherence challenges. ARRA offers a complementary paradigm for advancing autoregressive models. The code is available at https://github.com/HKU-HealthAI/ARRA.

[237] Concept-as-Tree: A Controllable Synthetic Data Framework Makes Stronger Personalized VLMs

Ruichuan An, Kai Zeng, Ming Lu, Sihan Yang, Renrui Zhang, Huitong Ji, Hao Liang, Wentao Zhang

Main category: cs.CV

TL;DR: CaT introduces a tree-based concept representation that generates diverse positive and negative samples for VLM personalization, addressing data scarcity and quality issues through a controllable synthetic data pipeline.

Details

Motivation: To overcome challenges in VLM personalization caused by scarce positive samples and low-quality negative samples, enabling better integration of user-provided concepts.

Method: Concept-as-Tree (CaT) framework representing concepts as tree structures to generate diverse positive/negative samples with varying difficulty, combined with a data filtering strategy for quality control.

Result: CaT significantly enhances VLM capabilities across personalization benchmarks by alleviating data scarcity and quality issues.

Conclusion: CaT provides the first controllable synthetic data pipeline for VLM personalization, effectively improving model performance through structured concept representation and quality-controlled data generation.

Abstract: Vision-Language Models (VLMs) have demonstrated exceptional performance in various multi-modal tasks. Recently, there has been an increasing interest in improving the personalization capabilities of VLMs. To better integrate user-provided concepts into VLMs, many methods use positive and negative samples to fine-tune these models. However, the scarcity of user-provided positive samples and the low quality of retrieved negative samples pose challenges for existing techniques. To reveal the relationship between sample and model performance, we systematically investigate the amount and diversity impact of positive and negative samples (easy and hard) on VLM personalization tasks. Based on the detailed analysis, we introduce Concept-as-Tree (CaT), which represents a concept as a tree structure, thereby enabling the data generation of positive and negative samples with varying difficulty and diversity, and can be easily extended to multi-concept scenarios. With a well-designed data filtering strategy, our CaT framework can ensure the quality of generated data, constituting a powerful pipeline. We perform thorough experiments with various VLM personalization baselines to assess the effectiveness of the pipeline, alleviating the lack of positive samples and the low quality of negative samples. Our results demonstrate that CaT equipped with the proposed data filter significantly enhances the capabilities of VLMs across personalization benchmarks. To the best of our knowledge, this work is the first controllable synthetic data pipeline for VLM personalization. The code will be released.

[238] Novel Diffusion Models for Multimodal 3D Hand Trajectory Prediction

Junyi Ma, Wentao Bao, Jingyi Xu, Guanzhong Sun, Xieyuanli Chen, Hesheng Wang

Main category: cs.CV

TL;DR: MMTwin is a novel diffusion model for multimodal 3D hand trajectory prediction that integrates 2D RGB images, 3D point clouds, past hand waypoints, and text prompts, while concurrently predicting camera egomotion and future hand trajectories using twin diffusion models.

Details

Motivation: Existing hand trajectory prediction methods only use 2D egocentric video inputs and lack awareness of multimodal environmental information from both 2D and 3D observations. They also overlook the synergy between hand movements and headset camera egomotion.

Method: Proposed MMTwin with two latent diffusion models (egomotion diffusion and HTP diffusion) working as twins. Uses a novel hybrid Mamba-Transformer module as the denoising model for HTP diffusion to fuse multimodal features. Takes multimodal inputs including 2D RGB images, 3D point clouds, past hand waypoints, and text prompts.

Result: Experimental results on three public datasets and self-recorded data show MMTwin predicts plausible future 3D hand trajectories compared to state-of-the-art baselines and generalizes well to unseen environments.

Conclusion: MMTwin effectively addresses limitations of existing methods by incorporating multimodal environmental information and modeling the synergy between hand movements and camera egomotion, achieving superior performance in 3D hand trajectory prediction.

Abstract: Predicting hand motion is critical for understanding human intentions and bridging the action space between human movements and robot manipulations. Existing hand trajectory prediction (HTP) methods forecast the future hand waypoints in 3D space conditioned on past egocentric observations. However, such models are only designed to accommodate 2D egocentric video inputs. There is a lack of awareness of multimodal environmental information from both 2D and 3D observations, hindering the further improvement of 3D HTP performance. In addition, these models overlook the synergy between hand movements and headset camera egomotion, either predicting hand trajectories in isolation or encoding egomotion only from past frames. To address these limitations, we propose novel diffusion models (MMTwin) for multimodal 3D hand trajectory prediction. MMTwin is designed to absorb multimodal information as input encompassing 2D RGB images, 3D point clouds, past hand waypoints, and text prompt. Besides, two latent diffusion models, the egomotion diffusion and the HTP diffusion as twins, are integrated into MMTwin to predict camera egomotion and future hand trajectories concurrently. We propose a novel hybrid Mamba-Transformer module as the denoising model of the HTP diffusion to better fuse multimodal features. The experimental results on three publicly available datasets and our self-recorded data demonstrate that our proposed MMTwin can predict plausible future 3D hand trajectories compared to the state-of-the-art baselines, and generalizes well to unseen environments. The code and pretrained models have been released at https://github.com/IRMVLab/MMTwin.

[239] Leveraging NTPs for Efficient Hallucination Detection in VLMs

Ofir Azachi, Kfir Eliyahu, Eyal El Ani, Rom Himelstein, Roi Reichart, Yuval Pinter, Nitay Calderon

Main category: cs.CV

TL;DR: Proposes using next-token probabilities (NTPs) from VLMs to train lightweight ML models for efficient hallucination detection, achieving performance comparable to strong VLMs with lower computational cost.

Details

Motivation: Hallucinations in VLMs undermine reliability, and current detection methods using VLMs themselves are computationally intensive and increase latency.

Method: Train traditional ML models using NTP-based features from VLMs, augmented with linguistic NTPs and VLM hallucination scores, tested on a 1,400-statement human-annotated dataset.

Result: NTP-based features are effective predictors of hallucinations, enabling simple ML models to match VLM performance. Combining NTPs with linguistic NTPs and VLM scores further improves detection.

Conclusion: NTP-based lightweight methods offer efficient hallucination detection that enhances VLM reliability without heavy computational overhead.

Abstract: Hallucinations of vision-language models (VLMs), which are misalignments between visual content and generated text, undermine the reliability of VLMs. One common approach for detecting them employs the same VLM, or a different one, to assess generated outputs. This process is computationally intensive and increases model latency. In this paper, we explore an efficient on-the-fly method for hallucination detection by training traditional ML models over signals based on the VLM’s next-token probabilities (NTPs). NTPs provide a direct quantification of model uncertainty. We hypothesize that high uncertainty (i.e., a low NTP value) is strongly associated with hallucinations. To test this, we introduce a dataset of 1,400 human-annotated statements derived from VLM-generated content, each labeled as hallucinated or not, and use it to test our NTP-based lightweight method. Our results demonstrate that NTP-based features are valuable predictors of hallucinations, enabling fast and simple ML models to achieve performance comparable to that of strong VLMs. Furthermore, augmenting these NTPs with linguistic NTPs, computed by feeding only the generated text back into the VLM, enhances hallucination detection performance. Finally, integrating hallucination prediction scores from VLMs into the NTP-based models led to better performance than using either VLMs or NTPs alone. We hope this study paves the way for simple, lightweight solutions that enhance the reliability of VLMs.

[240] Rethinking Target Label Conditioning in Adversarial Attacks: A 2D Tensor-Guided Generative Approach

Hangyu Liu, Bo Peng, Pengxiang Ding, Donglin Wang

Main category: cs.CV

TL;DR: TGAF uses 2D semantic tensors and diffusion models to improve multi-target adversarial attacks by enhancing feature quality and quantity for better transferability.

Details

Motivation: Existing multi-target attacks use 1D label encoding, losing fine-grained visual information and overfitting to model-specific features, limiting transferability.

Method: Proposes TGAF framework using diffusion models to encode target labels into 2D semantic tensors, with a masking strategy to preserve complete target semantic information during training.

Result: TGAF consistently outperforms state-of-the-art methods across various experimental settings.

Conclusion: The 2D tensor encoding approach with diffusion models effectively addresses feature quality and quantity issues, significantly improving multi-target adversarial attack transferability.

Abstract: Compared to single-target adversarial attacks, multi-target attacks have garnered significant attention due to their ability to generate adversarial images for multiple target classes simultaneously. However, existing generative approaches for multi-target attacks primarily encode target labels into one-dimensional tensors, leading to a loss of fine-grained visual information and overfitting to model-specific features during noise generation. To address this gap, we first identify and validate that the semantic feature quality and quantity are critical factors affecting the transferability of targeted attacks: 1) Feature quality refers to the structural and detailed completeness of the implanted target features, as deficiencies may result in the loss of key discriminative information; 2) Feature quantity refers to the spatial sufficiency of the implanted target features, as inadequacy limits the victim model’s attention to this feature. Based on these findings, we propose the 2D Tensor-Guided Adversarial Fusion (TGAF) framework, which leverages the powerful generative capabilities of diffusion models to encode target labels into two-dimensional semantic tensors for guiding adversarial noise generation. Additionally, we design a novel masking strategy tailored for the training process, ensuring that parts of the generated noise retain complete semantic information about the target class. Extensive experiments demonstrate that TGAF consistently surpasses state-of-the-art methods across various settings.

[241] Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input

Chenxu Li, Zhicai Wang, Yuan Sheng, Xingyu Zhu, Yanbin Hao, Xiang Wang

Main category: cs.CV

TL;DR: Res-Bench is a new benchmark for evaluating resolution robustness in Multimodal Large Language Models, assessing performance stability across 12 resolution levels and 6 capability dimensions with novel robustness metrics.

Details

Motivation: Current MLLM evaluations focus on semantic performance but overlook resolution robustness - whether performance remains stable across varying input resolutions, creating a critical gap in understanding model reliability.

Method: Developed Res-Bench with 14,400 samples across 12 resolution levels and 6 capability dimensions; introduced novel robustness metrics including Spearman’s correlation for resolution-performance trends and Absolute/Relative Continuous Error for performance volatility.

Result: Conducted large-scale evaluation of leading MLLMs examining model-centric and task-centric robustness, preprocessing strategies (padding and super-resolution), and fine-tuning for stability enhancement.

Conclusion: The benchmark provides comprehensive tools to assess and improve resolution robustness in MLLMs, addressing a previously overlooked aspect of model evaluation and reliability.

Abstract: Multimodal Large Language Models (MLLMs) increasingly support dynamic image resolutions. However, current evaluation paradigms primarily assess semantic performance, overlooking the critical question of resolution robustness - whether performance remains stable across varying input resolutions. To address this gap, we introduce \textbf{Res-Bench}, a comprehensive benchmark comprising 14,400 samples across 12 resolution levels and six core capability dimensions. We designed a novel evaluation framework that goes beyond traditional accuracy metrics to capture performance stability. This framework introduces multiple robustness metrics: Spearman’s correlation for assessing resolution-performance trends, and Absolute/Relative Continuous Error (ACE/RCE) for measuring performance volatility. Using these metrics, we conducted a large-scale evaluation of leading MLLMs. Our analysis encompasses: (1) model-centric and task-centric robustness examination, (2) investigation of preprocessing strategies including padding and super-resolution, and (3) exploration of fine-tuning for stability enhancement.

[242] MS-Occ: Multi-Stage LiDAR-Camera Fusion for 3D Semantic Occupancy Prediction

Zhiqiang Wei, Lianqing Zheng, Jianan Liu, Tao Huang, Qing-Long Han, Wenwen Zhang, Fengdeng Zhang

Main category: cs.CV

TL;DR: MS-Occ is a multi-stage LiDAR-camera fusion framework for 3D semantic occupancy perception that achieves state-of-the-art performance by integrating LiDAR’s geometric accuracy with camera’s semantic richness through hierarchical cross-modal fusion.

Details

Motivation: Existing vision-centric methods suffer from geometric inaccuracies while LiDAR-based approaches lack rich semantic information, creating a need for better fusion methods to enable accurate 3D semantic occupancy perception in complex autonomous driving environments.

Method: Multi-stage fusion framework with: (1) Middle-stage fusion using Gaussian-Geo module for geometric enhancement and Semantic-Aware module for semantic enrichment via deformable cross-attention; (2) Late-stage fusion with Adaptive Fusion module for dynamic feature balancing and High Classification Confidence Voxel Fusion for semantic inconsistency resolution.

Result: Achieves state-of-the-art performance: 32.1% IoU and 25.3% mIoU on nuScenes-OpenOccupancy (surpassing SOTA by +0.7% IoU and +2.4% mIoU), and 24.08% mIoU on SemanticKITTI benchmark, with significant improvements in small object perception.

Conclusion: MS-Occ effectively addresses the limitations of single-modality approaches through hierarchical cross-modal fusion, demonstrating strong generalization capabilities and practical value for safety-critical autonomous driving applications.

Abstract: Accurate 3D semantic occupancy perception is essential for autonomous driving in complex environments with diverse and irregular objects. While vision-centric methods suffer from geometric inaccuracies, LiDAR-based approaches often lack rich semantic information. To address these limitations, MS-Occ, a novel multi-stage LiDAR-camera fusion framework which includes middle-stage fusion and late-stage fusion, is proposed, integrating LiDAR’s geometric fidelity with camera-based semantic richness via hierarchical cross-modal fusion. The framework introduces innovations at two critical stages: (1) In the middle-stage feature fusion, the Gaussian-Geo module leverages Gaussian kernel rendering on sparse LiDAR depth maps to enhance 2D image features with dense geometric priors, and the Semantic-Aware module enriches LiDAR voxels with semantic context via deformable cross-attention; (2) In the late-stage voxel fusion, the Adaptive Fusion (AF) module dynamically balances voxel features across modalities, while the High Classification Confidence Voxel Fusion (HCCVF) module resolves semantic inconsistencies using self-attention-based refinement. Experiments on two large-scale benchmarks demonstrate state-of-the-art performance. On nuScenes-OpenOccupancy, MS-Occ achieves an Intersection over Union (IoU) of 32.1% and a mean IoU (mIoU) of 25.3%, surpassing the state-of-the-art by +0.7% IoU and +2.4% mIoU. Furthermore, on the SemanticKITTI benchmark, our method achieves a new state-of-the-art mIoU of 24.08%, robustly validating its generalization capabilities.Ablation studies further confirm the effectiveness of each individual module, highlighting substantial improvements in the perception of small objects and reinforcing the practical value of MS-Occ for safety-critical autonomous driving scenarios.

[243] FlexPara: Flexible Neural Surface Parameterization

Yuming Zhao, Qijian Zhang, Junhui Hou, Jiazhi Xia, Wenping Wang, Ying He

Main category: cs.CV

TL;DR: FlexPara is an unsupervised neural framework for flexible surface parameterization that establishes point-wise mappings between 3D surfaces and adaptively-deformed 2D UV coordinates without requiring manual cutting seams.

Details

Motivation: Conventional parameterization methods require high-quality meshes, are limited to simple topologies, and need manual cutting/decomposition. Optimal parameterization configurations vary with surface structures and tasks, requiring more flexible and controllable pipelines.

Method: Uses geometrically-interpretable sub-networks for cutting, deforming, unwrapping, and wrapping to create a bi-directional cycle mapping framework. Also constructs multi-chart parameterization with adaptively-learned chart assignment.

Result: Extensive experiments demonstrate universality, superiority, and inspiring potential of the neural surface parameterization paradigm.

Conclusion: FlexPara provides a flexible neural framework for both global and multi-chart surface parameterizations without manual intervention, showing promising results for various surface structures and tasks.

Abstract: Surface parameterization is a fundamental geometry processing task, laying the foundations for the visual presentation of 3D assets and numerous downstream shape analysis scenarios. Conventional parameterization approaches demand high-quality mesh triangulation and are restricted to certain simple topologies unless additional surface cutting and decomposition are provided. In practice, the optimal configurations (e.g., type of parameterization domains, distribution of cutting seams, number of mapping charts) may vary drastically with different surface structures and task characteristics, thus requiring more flexible and controllable processing pipelines. To this end, this paper introduces FlexPara, an unsupervised neural optimization framework to achieve both global and multi-chart surface parameterizations by establishing point-wise mappings between 3D surface points and adaptively-deformed 2D UV coordinates. We ingeniously design and combine a series of geometrically-interpretable sub-networks, with specific functionalities of cutting, deforming, unwrapping, and wrapping, to construct a bi-directional cycle mapping framework for global parameterization without the need for manually specified cutting seams. Furthermore, we construct a multi-chart parameterization framework with adaptively-learned chart assignment. Extensive experiments demonstrate the universality, superiority, and inspiring potential of our neural surface parameterization paradigm. The code will be publicly available at https://github.com/AidenZhao/FlexPara

[244] Unifying Segment Anything in Microscopy with Vision-Language Knowledge

Manyu Li, Ruian He, Zixian Zhang, Chenxi Ma, Weimin Tan, Bo Yan

Main category: cs.CV

TL;DR: uLLSAM is a framework that uses Multimodal Large Language Models (MLLMs) to guide SAM for biomedical image segmentation, improving cross-domain generalization through Vision-Language Semantic Alignment and Semantic Boundary Regularization.

Details

Motivation: Existing foundation models for biomedical segmentation show sub-optimal performance on unseen domain data due to lack of vision-language knowledge before segmentation. MLLMs' understanding capabilities can enhance generalization.

Method: Proposes uLLSAM framework with Vision-Language Semantic Alignment (VLSA) module to inject vision-language knowledge into SAM, and Semantic Boundary Regularization (SBR) to address boundary perception deficiencies.

Result: Achieves 11.8% improvement in SA across 9 in-domain microscopy datasets (state-of-the-art) and 9.2% improvement across 10 out-of-domain datasets, demonstrating strong generalization.

Conclusion: Injecting vision-language knowledge from MLLMs into SAM significantly improves biomedical segmentation performance and cross-domain generalization capabilities.

Abstract: Accurate segmentation of regions of interest in biomedical images holds substantial value in image analysis. Although several foundation models for biomedical segmentation have currently achieved excellent performance on certain datasets, they typically demonstrate sub-optimal performance on unseen domain data. We owe the deficiency to lack of vision-language knowledge before segmentation. Multimodal Large Language Models (MLLMs) bring outstanding understanding and reasoning capabilities to multimodal tasks, which inspires us to leverage MLLMs to inject Vision-Language Knowledge (VLK), thereby enabling vision models to demonstrate superior generalization capabilities on cross-domain datasets. In this paper, we propose a novel framework that seamlessly uses MLLMs to guide SAM in learning microscopy cross-domain data, unifying Segment Anything in Microscopy, named uLLSAM. Specifically, we propose the Vision-Language Semantic Alignment (VLSA) module, which injects VLK into Segment Anything Model (SAM). We find that after SAM receives global VLK prompts, its performance improves significantly, but there are deficiencies in boundary contour perception. Therefore, we further propose Semantic Boundary Regularization (SBR) to regularize SAM. Our method achieves performance improvements of 11.8% in SA across 9 in-domain microscopy datasets, achieving state-of-the-art performance. Our method also demonstrates improvements of 9.2% in SA across 10 out-of-domain datasets, exhibiting strong generalization capabilities. Code is available at https://github.com/ieellee/uLLSAM.

[245] Zero-Shot Temporal Interaction Localization for Egocentric Videos

Erhang Zhang, Junyi Ma, Yin-Dong Zheng, Yixuan Zhou, Hesheng Wang

Main category: cs.CV

TL;DR: EgoLoc is a zero-shot temporal interaction localization approach that locates grasp action timings in egocentric videos using self-adaptive sampling and closed-loop feedback with vision-language models.

Details

Motivation: Current temporal action localization methods rely on annotated categories, causing domain bias and low efficiency. Existing zero-shot approaches have coarse-grained estimations and open-loop pipelines that limit performance for temporal interaction localization.

Method: Proposes EgoLoc with self-adaptive sampling strategy for VLM reasoning, using both 2D and 3D observations to sample initial guesses based on 3D hand velocities around contact/separation timestamps, and closed-loop feedback from visual and dynamic cues for refinement.

Result: Comprehensive experiments show EgoLoc achieves better temporal interaction localization for egocentric videos compared to state-of-the-art baselines on public datasets and newly proposed benchmarks.

Conclusion: EgoLoc effectively addresses limitations of existing methods by combining self-adaptive sampling with closed-loop feedback, achieving superior zero-shot temporal interaction localization performance for human-object interactions in egocentric videos.

Abstract: Locating human-object interaction (HOI) actions within video serves as the foundation for multiple downstream tasks, such as human behavior analysis and human-robot skill transfer. Current temporal action localization methods typically rely on annotated action and object categories of interactions for optimization, which leads to domain bias and low deployment efficiency. Although some recent works have achieved zero-shot temporal action localization (ZS-TAL) with large vision-language models (VLMs), their coarse-grained estimations and open-loop pipelines hinder further performance improvements for temporal interaction localization (TIL). To address these issues, we propose a novel zero-shot TIL approach dubbed EgoLoc to locate the timings of grasp actions for human-object interaction in egocentric videos. EgoLoc introduces a self-adaptive sampling strategy to generate reasonable visual prompts for VLM reasoning. By absorbing both 2D and 3D observations, it directly samples high-quality initial guesses around the possible contact/separation timestamps of HOI according to 3D hand velocities, leading to high inference accuracy and efficiency. In addition, EgoLoc generates closed-loop feedback from visual and dynamic cues to further refine the localization results. Comprehensive experiments on the publicly available dataset and our newly proposed benchmark demonstrate that EgoLoc achieves better temporal interaction localization for egocentric videos compared to state-of-the-art baselines. We have released our code and relevant data as open-source at https://github.com/IRMVLab/EgoLoc.

[246] Motion Matters: Compact Gaussian Streaming for Free-Viewpoint Video Reconstruction

Jiacong Chen, Qingyu Mao, Youneng Bao, Xiandong Meng, Fanyang Meng, Ronggang Wang, Yongsheng Liang

Main category: cs.CV

TL;DR: ComGS is a compact Gaussian streaming framework that reduces storage requirements for dynamic scene reconstruction by modeling object-consistent motion through keypoint-driven representation and adaptive propagation.

Details

Motivation: Existing online 3D Gaussian Splatting methods for free-viewpoint video reconstruction face prohibitive storage requirements due to inefficient point-wise modeling that doesn't exploit motion properties.

Method: Uses keypoint-driven motion representation with sparse motion-sensitive keypoints identified via viewspace gradient difference, adaptive motion-driven mechanism for propagating motion to neighboring points, and error-aware correction for key frame reconstruction.

Result: Achieves remarkable storage reduction of over 159X compared to 3DGStream and 14X compared to QUEEN while maintaining competitive visual fidelity and rendering speed.

Conclusion: ComGS provides a highly storage-efficient solution for dynamic scene reconstruction by leveraging motion locality and consistency through keypoint-driven modeling.

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a high-fidelity and efficient paradigm for online free-viewpoint video (FVV) reconstruction, offering viewers rapid responsiveness and immersive experiences. However, existing online methods face challenge in prohibitive storage requirements primarily due to point-wise modeling that fails to exploit the motion properties. To address this limitation, we propose a novel Compact Gaussian Streaming (ComGS) framework, leveraging the locality and consistency of motion in dynamic scene, that models object-consistent Gaussian point motion through keypoint-driven motion representation. By transmitting only the keypoint attributes, this framework provides a more storage-efficient solution. Specifically, we first identify a sparse set of motion-sensitive keypoints localized within motion regions using a viewspace gradient difference strategy. Equipped with these keypoints, we propose an adaptive motion-driven mechanism that predicts a spatial influence field for propagating keypoint motion to neighboring Gaussian points with similar motion. Moreover, ComGS adopts an error-aware correction strategy for key frame reconstruction that selectively refines erroneous regions and mitigates error accumulation without unnecessary overhead. Overall, ComGS achieves a remarkable storage reduction of over 159 X compared to 3DGStream and 14 X compared to the SOTA method QUEEN, while maintaining competitive visual fidelity and rendering speed.

[247] Symmetrical Flow Matching: Unified Image Generation, Segmentation, and Classification with Score-Based Generative Models

Francisco Caetano, Christiaan Viviers, Peter H. N. De With, Fons van der Sommen

Main category: cs.CV

TL;DR: SymmFlow is a symmetrical flow matching framework that unifies semantic segmentation, classification, and image generation in a single model using bi-directional consistency and semantic preservation.

Details

Motivation: To create a unified framework that bridges semantic segmentation, classification, and image generation, overcoming limitations of previous approaches that required strict one-to-one mappings between masks and images.

Method: Uses symmetrical flow matching with joint forward and reverse transformations, introduces a training objective to retain semantic information across flows, and supports flexible conditioning with both pixel-level and image-level class labels.

Result: Achieves state-of-the-art FID scores of 11.9 on CelebAMask-HQ and 7.0 on COCO-Stuff with only 25 inference steps, with competitive semantic segmentation performance and promising classification capabilities.

Conclusion: SymmFlow provides an effective unified framework for multiple vision tasks with efficient sampling and semantic preservation, demonstrating strong performance across generation, segmentation, and classification.

Abstract: Flow Matching has emerged as a powerful framework for learning continuous transformations between distributions, enabling high-fidelity generative modeling. This work introduces Symmetrical Flow Matching (SymmFlow), a new formulation that unifies semantic segmentation, classification, and image generation within a single model. Using a symmetric learning objective, SymmFlow models forward and reverse transformations jointly, ensuring bi-directional consistency, while preserving sufficient entropy for generative diversity. A new training objective is introduced to explicitly retain semantic information across flows, featuring efficient sampling while preserving semantic structure, allowing for one-step segmentation and classification without iterative refinement. Unlike previous approaches that impose strict one-to-one mapping between masks and images, SymmFlow generalizes to flexible conditioning, supporting both pixel-level and image-level class labels. Experimental results on various benchmarks demonstrate that SymmFlow achieves state-of-the-art performance on semantic image synthesis, obtaining FID scores of 11.9 on CelebAMask-HQ and 7.0 on COCO-Stuff with only 25 inference steps. Additionally, it delivers competitive results on semantic segmentation and shows promising capabilities in classification tasks.

[248] BecomingLit: Relightable Gaussian Avatars with Hybrid Neural Shading

Jonathan Schmidt, Simon Giebenhain, Matthias Niessner

Main category: cs.CV

TL;DR: BecomingLit is a method for creating relightable, high-resolution head avatars that can be rendered from new viewpoints at interactive speeds using a low-cost light stage setup and 3D Gaussian primitives.

Details

Motivation: To enable realistic head avatar reconstruction with relighting capabilities from novel viewpoints at interactive rates, addressing the need for high-quality digital human representations.

Method: Uses a low-cost light stage capture setup to collect multi-view sequences under varying illumination and expressions. Creates a relightable avatar representation based on 3D Gaussian primitives animated with parametric head models and expression-dependent dynamics. Implements hybrid neural shading combining neural diffuse BRDF with analytical specular terms.

Result: Successfully reconstructs disentangled materials from dynamic light stage recordings, enabling all-frequency relighting with point lights and environment maps. Avatars can be animated and controlled from monocular videos. Outperforms existing state-of-the-art methods in relighting and reenactment by significant margins.

Conclusion: BecomingLit provides an effective solution for creating high-quality relightable head avatars that support realistic rendering, animation, and relighting capabilities, demonstrating superior performance compared to current methods.

Abstract: We introduce BecomingLit, a novel method for reconstructing relightable, high-resolution head avatars that can be rendered from novel viewpoints at interactive rates. Therefore, we propose a new low-cost light stage capture setup, tailored specifically towards capturing faces. Using this setup, we collect a novel dataset consisting of diverse multi-view sequences of numerous subjects under varying illumination conditions and facial expressions. By leveraging our new dataset, we introduce a new relightable avatar representation based on 3D Gaussian primitives that we animate with a parametric head model and an expression-dependent dynamics module. We propose a new hybrid neural shading approach, combining a neural diffuse BRDF with an analytical specular term. Our method reconstructs disentangled materials from our dynamic light stage recordings and enables all-frequency relighting of our avatars with both point lights and environment maps. In addition, our avatars can easily be animated and controlled from monocular videos. We validate our approach in extensive experiments on our dataset, where we consistently outperform existing state-of-the-art methods in relighting and reenactment by a significant margin.

[249] Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation

Divyanshu Mishra, Mohammadreza Salehi, Pramit Saha, Olga Patey, Aris T. Papageorghiou, Yuki M. Asano, J. Alison Noble

Main category: cs.CV

TL;DR: DISCOVR is a self-supervised dual-branch framework for cardiac ultrasound video representation learning that combines temporal modeling with fine-grained spatial semantics through semantic cluster distillation.

Details

Motivation: Existing SSL methods struggle with echocardiography due to subtle anatomical structures, complex temporal dynamics, high intersample similarity, sensitivity to low PSNR inputs, and aggressive augmentations that distort clinically relevant features.

Method: Dual-branch framework with clustering-based video encoder for temporal dynamics and online image encoder for spatial semantics, connected through semantic cluster distillation loss that transfers anatomical knowledge from image to video encoder.

Result: Outperforms specialized video anomaly detection methods and state-of-the-art video-SSL baselines on six echocardiography datasets across fetal, pediatric, and adult populations in zero-shot and linear probing setups, achieving superior segmentation transfer and strong LVEF prediction performance.

Conclusion: DISCOVR enables temporally coherent representations enriched with fine-grained semantic understanding for cardiac ultrasound, demonstrating strong downstream performance on clinically relevant tasks.

Abstract: Self-supervised learning (SSL) has achieved major advances in natural images and video understanding, but challenges remain in domains like echocardiography (heart ultrasound) due to subtle anatomical structures, complex temporal dynamics, and the current lack of domain-specific pre-trained models. Existing SSL approaches such as contrastive, masked modeling, and clustering-based methods struggle with high intersample similarity, sensitivity to low PSNR inputs common in ultrasound, or aggressive augmentations that distort clinically relevant features. We present DISCOVR (Distilled Image Supervision for Cross Modal Video Representation), a self-supervised dual branch framework for cardiac ultrasound video representation learning. DISCOVR combines a clustering-based video encoder that models temporal dynamics with an online image encoder that extracts fine-grained spatial semantics. These branches are connected through a semantic cluster distillation loss that transfers anatomical knowledge from the evolving image encoder to the video encoder, enabling temporally coherent representations enriched with fine-grained semantic understanding.Evaluated on six echocardiography datasets spanning fetal, pediatric, and adult populations, DISCOVR outperforms both specialized video anomaly detection methods and state-of-the-art video-SSL baselines in zero-shot and linear probing setups,achieving superior segmentation transfer and strong downstream performance on clinically relevant tasks such as LVEF prediction. Code available at: https://github.com/mdivyanshu97/DISCOVR

[250] Active Contour Models Driven by Hyperbolic Mean Curvature Flow for Image Segmentation

Saiyu Hu, Chunlei He, Jianfeng Zhang, Dexing Kong, Shoujun Huang

Main category: cs.CV

TL;DR: Proposes hyperbolic mean curvature flow-driven active contour models (HMCF-ACMs) to overcome zig-zag degradation in PMCF-ACMs under high-intensity noise, using adjustable acceleration for smooth curve evolution.

Details

Motivation: Parabolic mean curvature flow-driven active contour models (PMCF-ACMs) suffer severe degradation under high-intensity noise due to gradient-descent evolutions exhibiting the zig-zag phenomenon.

Method: Develops HMCF-ACMs with adjustable acceleration field to regulate curve evolution smoothness, proves they are normal flows, establishes numerical equivalence to wave equations via level set formulation, and creates efficient spectral discretization scheme with optimized temporal integration.

Result: HMCF-ACMs achieve superior performance under high-noise conditions with reduced parameter sensitivity, enhanced noise robustness, and improved segmentation accuracy compared to PMCF-ACMs in experiments on natural and medical images.

Conclusion: HMCF-ACMs provide an effective solution to overcome noise-induced degradation in active contour models, offering dual degrees of freedom for adaptive contour selection and velocity field regulation.

Abstract: Parabolic mean curvature flow-driven active contour models (PMCF-ACMs) are widely used for image segmentation, yet they suffer severe degradation under high-intensity noise because gradient-descent evolutions exhibit the well-known zig-zag phenomenon. To overcome this drawback, we propose hyperbolic mean curvature flow-driven ACMs (HMCF-ACMs). This novel framework incorporates an adjustable acceleration field to autonomously regulate curve evolution smoothness, providing dual degrees of freedom for adaptive selection of both initial contours and velocity fields. We rigorously prove that HMCF-ACMs are normal flows and establish their numerical equivalence to wave equations through a level set formulation with signed distance functions. An efficient numerical scheme combining spectral discretization and optimized temporal integration is developed to solve the governing equations, and its stability condition is derived through Fourier analysis. Extensive experiments on natural and medical images validate that HMCF-ACMs achieve superior performance under high-noise conditions, demonstrating reduced parameter sensitivity, enhanced noise robustness, and improved segmentation accuracy compared to PMCF-ACMs.

[251] Curing Semantic Drift: A Dynamic Approach to Grounding Generation in Large Vision-Language Models

Jiahe Chen, Jiaying He, Qiyuan Chen, Qian Shao, Jiahe Ying, Hongxia Xu, Jintai Chen, Jianwei Zheng, Jian Wu

Main category: cs.CV

TL;DR: DLC is a training-free framework that directly addresses semantic drift in LVLMs by dynamically calibrating output logits using real-time visual alignment checks, achieving superior hallucination mitigation while maintaining high efficiency.

Details

Motivation: LVLMs suffer from semantic drift - progressive detachment from visual input that causes hallucinations. Existing methods are either computationally expensive or rely on unreliable heuristic proxies.

Method: DLC introduces a real-time visual referee that performs dual-aspect visual alignment: (1) intrinsic visual relevance of candidate tokens and (2) contextual visual coherence, then dynamically balances these checks against an adaptive baseline to modulate output logits.

Result: Extensive experiments show DLC significantly outperforms existing methods in mitigating hallucinations while maintaining high inference efficiency by avoiding multiple LVLM forward passes.

Conclusion: DLC presents a powerful and practical solution for building more reliable and visually-grounded LVLMs by directly curing semantic drift in an efficient, dynamic manner.

Abstract: Large Vision-Language Models (LVLMs) face a tug-of-war between powerful linguistic priors and visual evidence, often leading to ``semantic drift’’ – the progressive detachment from visual input that we identify as the root cause of hallucination. While several existing training-free decoding strategies have achieved considerable success, they still suffer from inherent limitations. Many are computationally prohibitive, requiring multiple forward passes through the entire LVLM, while others rely on indirect, heuristic-based proxies that are unreliable correlates for a direct semantic conflict. We propose \textbf{D}ynamic \textbf{L}ogits \textbf{C}alibration (DLC), a novel training-free framework that is the first to cure semantic drift in a direct, dynamic, and efficient manner. At each decoding step, DLC introduces a real-time visual referee that performs a dual-aspect visual alignment check: (1) it assesses the intrinsic visual relevance of a candidate token and (2) its contextual visual coherence. By dynamically balancing these two checks and evaluating them against an adaptive baseline, DLC surgically modulates the output logits to favor grounded tokens. Extensive experiments show DLC significantly outperforms existing methods in mitigating hallucinations while, crucially, maintaining high inference efficiency by avoiding costly multiple LVLM forward passes. Our work presents a powerful and practical solution for building more reliable and visually-grounded LVLMs. Code will be released on https://github.com/JiaheChen2002/DLC.

[252] NOCTIS: Novel Object Cyclic Threshold based Instance Segmentation

Max Gandyra, Alessandro Santonicola, Michael Beetz

Main category: cs.CV

TL;DR: NOCTIS is a training-free framework for instance segmentation of novel objects that combines Grounded-SAM 2 for object proposals and DINOv2 for embeddings, using cyclic thresholding to improve matching accuracy.

Details

Motivation: To create a general model for instance segmentation of novel objects without requiring retraining for different object types, addressing the challenge of handling diverse objects with a single framework.

Method: Integrates pre-trained Grounded-SAM 2 for object proposals and DINOv2 for embeddings, introduces cyclic thresholding for stable matching, appearance score, confidence-based scoring, and RGB-only pipeline.

Result: Outperforms best RGB and RGB-D methods on BOP 2023 challenge datasets for unseen object segmentation without any training or fine-tuning.

Conclusion: NOCTIS provides an effective training-free solution for novel object instance segmentation that surpasses existing methods across multiple datasets.

Abstract: Instance segmentation of novel objects instances in RGB images, given some example images for each object, is a well known problem in computer vision. Designing a model general enough to be employed for all kinds of novel objects without (re-) training has proven to be a difficult task. To handle this, we present a new training-free framework, called: Novel Object Cyclic Threshold based Instance Segmentation (NOCTIS). NOCTIS integrates two pre-trained models: Grounded-SAM 2 for object proposals with precise bounding boxes and corresponding segmentation masks; and DINOv2 for robust class and patch embeddings, due to its zero-shot capabilities. Internally, the proposal-object matching is realized by determining an object matching score based on the similarity of the class embeddings and the average maximum similarity of the patch embeddings with a new cyclic thresholding (CT) mechanism that mitigates unstable matches caused by repetitive textures or visually similar patterns. Beyond CT, NOCTIS introduces: (i) an appearance score that is unaffected by object selection bias; (ii) the usage of the average confidence of the proposals’ bounding box and mask as a scoring component; and (iii) an RGB-only pipeline that performs even better than RGB-D ones. We empirically show that NOCTIS, without further training/fine tuning, outperforms the best RGB and RGB-D methods regarding the mean AP score on the seven core datasets of the BOP 2023 challenge for the “Model-based 2D segmentation of unseen objects” task.

[253] Axis-Aligned Document Dewarping

Chaoyun Wang, I-Chao Shen, Takeo Igarashi, Caigui Jiang

Main category: cs.CV

TL;DR: A document dewarping method that leverages axis-aligned geometric properties through training constraints, inference preprocessing, and a new evaluation metric (AAD), achieving state-of-the-art performance.

Details

Motivation: Existing learning-based methods rely heavily on supervised regression without fully leveraging inherent geometric properties of physical documents, particularly the axis-aligned nature of document feature lines.

Method: Three synergistic contributions: axis-aligned geometric constraint for training, axis alignment preprocessing for inference, and Axis-Aligned Distortion (AAD) metric for evaluation that incorporates geometric meaning and human visual perception.

Result: Achieves state-of-the-art performance on multiple benchmarks, improving the AAD metric by 18.2% to 34.5%.

Conclusion: The method effectively leverages axis-aligned geometric properties to enhance document dewarping performance across training, inference, and evaluation phases.

Abstract: Document dewarping is crucial for many applications. However, existing learning-based methods rely heavily on supervised regression with annotated data without fully leveraging the inherent geometric properties of physical documents. Our key insight is that a well-dewarped document is defined by its axis-aligned feature lines. This property aligns with the inherent axis-aligned nature of the discrete grid geometry in planar documents. Harnessing this property, we introduce three synergistic contributions: for the training phase, we propose an axis-aligned geometric constraint to enhance document dewarping; for the inference phase, we propose an axis alignment preprocessing strategy to reduce the dewarping difficulty; and for the evaluation phase, we introduce a new metric, Axis-Aligned Distortion (AAD), that not only incorporates geometric meaning and aligns with human visual perception but also demonstrates greater robustness. As a result, our method achieves state-of-the-art performance on multiple existing benchmarks, improving the AAD metric by 18.2% to 34.5%. The code is publicly available at https://github.com/chaoyunwang/AADD.

[254] FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning

Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan, Xiaoan Zhang, Zhuo Li, Xiaobao Wei, Sixiang Chen, Liyun Li, Xianming Liu, Ming Lu, Yang Wang, Shanghang Zhang

Main category: cs.CV

TL;DR: FastDriveVLA is a reconstruction-based visual token pruning framework for autonomous driving VLA models that reduces computational costs by prioritizing foreground information through MAE-style pixel reconstruction.

Details

Motivation: Current visual token pruning methods perform poorly in autonomous driving scenarios, while human drivers focus on relevant foreground areas, suggesting that retaining foreground visual tokens is essential for effective decision-making.

Method: Proposes FastDriveVLA with ReconPruner - a plug-and-play visual token pruner using adversarial foreground-background reconstruction strategy and MAE-style pixel reconstruction to prioritize foreground information. Also introduces nuScenes-FG dataset with 241K image-mask pairs.

Result: Achieves state-of-the-art results on nuScenes open-loop planning benchmark across different pruning ratios, effectively reducing computational costs while maintaining performance.

Conclusion: The proposed reconstruction-based pruning framework successfully addresses the computational bottleneck in VLA models for autonomous driving by focusing on foreground information, demonstrating superior performance over existing methods.

Abstract: Vision-Language-Action (VLA) models have demonstrated significant potential in complex scene understanding and action reasoning, leading to their increasing adoption in end-to-end autonomous driving systems. However, the long visual tokens of VLA models greatly increase computational costs. Current visual token pruning methods in Vision-Language Models (VLM) rely on either visual token similarity or visual-text attention, but both have shown poor performance in autonomous driving scenarios. Given that human drivers concentrate on relevant foreground areas while driving, we assert that retaining visual tokens containing this foreground information is essential for effective decision-making. Inspired by this, we propose FastDriveVLA, a novel reconstruction-based vision token pruning framework designed specifically for autonomous driving. FastDriveVLA includes a plug-and-play visual token pruner called ReconPruner, which prioritizes foreground information through MAE-style pixel reconstruction. A novel adversarial foreground-background reconstruction strategy is designed to train ReconPruner for the visual encoder of VLA models. Once trained, ReconPruner can be seamlessly applied to different VLA models with the same visual encoder without retraining. To train ReconPruner, we also introduce a large-scale dataset called nuScenes-FG, consisting of 241K image-mask pairs with annotated foreground regions. Our approach achieves state-of-the-art results on the nuScenes open-loop planning benchmark across different pruning ratios.

[255] A COCO-Formatted Instance-Level Dataset for Plasmodium Falciparum Detection in Giemsa-Stained Blood Smears

Frauke Wilm, Luis Carlos Rivera Monroy, Mathias Öttl, Lukas Mürdter, Leonid Mill, Andreas Maier

Main category: cs.CV

TL;DR: Enhanced NIH malaria dataset with detailed bounding box annotations for object detection training, achieving F1 score of 0.88 for infected cell detection using Faster R-CNN.

Details

Motivation: Address scarcity of datasets with detailed instance-level annotations for automated malaria diagnosis using deep learning object detection methods.

Method: Created enhanced version of NIH malaria dataset with COCO format bounding box annotations, trained Faster R-CNN model to detect infected/non-infected red blood cells and white blood cells.

Result: Cross-validation achieved F1 scores up to 0.88 for infected cell detection, demonstrating robust detection performance.

Conclusion: Automated annotation refinement combined with manual correction can produce high-quality training data for reliable malaria diagnosis.

Abstract: Accurate detection of Plasmodium falciparum in Giemsa-stained blood smears is an essential component of reliable malaria diagnosis, especially in developing countries. Deep learning-based object detection methods have demonstrated strong potential for automated Malaria diagnosis, but their adoption is limited by the scarcity of datasets with detailed instance-level annotations. In this work, we present an enhanced version of the publicly available NIH malaria dataset, with detailed bounding box annotations in COCO format to support object detection training. We validated the revised annotations by training a Faster R-CNN model to detect infected and non-infected red blood cells, as well as white blood cells. Cross-validation on the original dataset yielded F1 scores of up to 0.88 for infected cell detection. These results underscore the importance of annotation volume and consistency, and demonstrate that automated annotation refinement combined with targeted manual correction can produce training data of sufficient quality for robust detection performance. The updated annotations set is publicly available via Zenodo: https://doi.org/10.5281/zenodo.17514694

[256] Visual Document Understanding and Reasoning: A Multi-Agent Collaboration Framework with Agent-Wise Adaptive Test-Time Scaling

Xinlei Yu, Chengming Xu, Zhangquan Chen, Yudong Zhang, Shilin Lu, Cheng Yang, Jiangning Zhang, Shuicheng Yan, Xiaobin Hu

Main category: cs.CV

TL;DR: MACT introduces a multi-agent collaboration framework with adaptive test-time scaling for visual document understanding, achieving superior performance with smaller parameter scale through specialized agents and intelligent resource allocation.

Details

Motivation: Monolithic scaling in Vision-Language Models struggles with document understanding due to cognitive complexity, procedural reasoning needs, and factual accuracy requirements, showing diminishing returns.

Method: Decomposes document processing into four specialized agents (planning, execution, judgment, answer) with self-correction loop and agent-wise adaptive test-time scaling that allocates computational resources based on complexity and redundancy.

Result: Achieves 9.9-11.5% performance improvements over base models, consistently ranks top-three across benchmarks, maintains general and mathematical reasoning capabilities while adapting to various document scenarios.

Conclusion: MACT pioneers procedural scaling paradigm shift, demonstrating that multi-agent collaboration with adaptive resource allocation can overcome limitations of monolithic scaling for visual document understanding.

Abstract: The dominant paradigm of monolithic scaling in Vision-Language Models (VLMs) is failing for understanding and reasoning in documents, yielding diminishing returns as it struggles with the inherent need of this domain for document-based procedural reasoning, cognitive complexity, and factual accuracy. To this end, we introduce MACT, a Multi-Agent Collaboration framework with agent-wise adaptive Test-time scaling that pioneers a paradigm shift to procedural scaling, adapting dynamically to the functional entities of visual documents understanding and reasoning. MACT decomposes the visual document processing flow into four specialized agents, i.e., planning, execution, judgment, and answer, to resolve cognitive overload and introduce a critical self-correction loop for factual grounding. This collaborative architecture is amplified by an agent-wise adaptive test-time scaling strategy that intelligently allocates computational resources based on the complexity and redundancy of each functionality. Evaluated on multiple visual document understanding benchmarks, MACT achieves superior performance with a smaller parameter scale, adapting effectively to various document scenarios without compromising its general or mathematical reasoning capabilities. The three variants of MACT consistently attain top-three average performance rankings, with average performance enhancements of 9.9-11.5% over the base models. The source code will be released publicly.

Sara Shoouri, Morteza Tavakoli Taba, Hun-Seok Kim

Main category: cs.CV

TL;DR: A predictive adaptive LiDAR scanning framework that reduces energy consumption by 65% while maintaining 3D object detection performance by focusing scans on predicted regions of interest.

Details

Motivation: Conventional LiDAR sensors perform dense, stateless scans ignoring temporal continuity, leading to sensing redundancy and excessive power consumption on resource-constrained platforms.

Method: Uses a lightweight predictor network to distill historical spatial-temporal contexts into query embeddings, guiding a differentiable Mask Generator with Gumbel-Softmax sampling to produce binary masks for critical regions of interest.

Result: Reduces LiDAR energy consumption by over 65% while maintaining competitive or superior 3D object detection performance on nuScenes and Lyft benchmarks compared to traditional dense scanning methods.

Conclusion: The adaptive scanning framework effectively addresses LiDAR inefficiency by leveraging temporal continuity to focus scanning on informative regions, enabling significant energy savings without compromising detection performance.

Abstract: Multi-sensor fusion using LiDAR and RGB cameras significantly enhances 3D object detection task. However, conventional LiDAR sensors perform dense, stateless scans, ignoring the strong temporal continuity in real-world scenes. This leads to substantial sensing redundancy and excessive power consumption, limiting their practicality on resource-constrained platforms. To address this inefficiency, we propose a predictive, history-aware adaptive scanning framework that anticipates informative regions of interest (ROI) based on past observations. Our approach introduces a lightweight predictor network that distills historical spatial and temporal contexts into refined query embeddings. These embeddings guide a differentiable Mask Generator network, which leverages Gumbel-Softmax sampling to produce binary masks identifying critical ROIs for the upcoming frame. Our method significantly reduces unnecessary data acquisition by concentrating dense LiDAR scanning only within these ROIs and sparsely sampling elsewhere. Experiments on nuScenes and Lyft benchmarks demonstrate that our adaptive scanning strategy reduces LiDAR energy consumption by over 65% while maintaining competitive or even superior 3D object detection performance compared to traditional LiDAR-camera fusion methods with dense LiDAR scanning.

[258] Towards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning

Yiheng Li, Zichang Tan, Zhen Lei, Xu Zhou, Yang Yang

Main category: cs.CV

TL;DR: IAPL introduces dynamic prompt learning for AI-generated image detection, adapting prompts per test image to improve generalization to unseen generators.

Details

Motivation: Current methods struggle with generalization to unseen AI generators due to limited patterns from training data and inability to adapt to evolving forgery traits.

Method: Image-Adaptive Prompt Learning (IAPL) dynamically adjusts prompts using conditional information from CNN feature extractors and test-time adaptive tokens optimized through multi-view consistency.

Result: Achieves state-of-the-art performance with 95.61% mean accuracy on UniversalFakeDetect and 96.7% on GenImage datasets.

Conclusion: Dynamic prompt adaptation significantly enhances robustness and adaptability to diverse forged images from various generators.

Abstract: In AI-generated image detection, current cutting-edge methods typically adapt pre-trained foundation models through partial-parameter fine-tuning. However, these approaches often struggle to generalize to forgeries from unseen generators, as the fine-tuned models capture only limited patterns from training data and fail to reflect the evolving traits of new ones. To overcome this limitation, we propose Image-Adaptive Prompt Learning (IAPL), a novel paradigm that dynamically adjusts the prompts fed into the encoder according to each testing image, rather than fixing them after training. This design significantly enhances robustness and adaptability to diverse forged images. The dynamic prompts integrate conditional information with test-time adaptive tokens through a lightweight learnable scaling factor. The conditional information is produced by a Conditional Information Learner, which leverages CNN-based feature extractors to model both forgery-specific and general conditions. The test-time adaptive tokens are optimized during inference on a single sample by enforcing prediction consistency across multiple views, ensuring that the parameters align with the current image. For the final decision, the optimal input with the highest prediction confidence is selected. Extensive experiments show that IAPL achieves state-of-the-art performance, with mean accuracies of 95.61% and 96.7% on the widely used UniversalFakeDetect and GenImage datasets, respectively. Codes and weights will be released on https://github.com/liyih/IAPL.

[259] TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models

Chenghao Liu, Jiachen Zhang, Chengxuan Li, Zhimu Zhou, Shixin Wu, Songfang Huang, Huiling Duan

Main category: cs.CV

TL;DR: TTF is a training-free method that enhances VLA models by integrating temporal visual information through selective fusion of historical and current frames, improving robustness and performance across various robotic manipulation tasks.

Details

Motivation: Current VLA models process visual inputs frame-by-frame, discarding temporal information and making them vulnerable to visual noise while ignoring coherence between consecutive frames in manipulation sequences.

Method: Temporal Token Fusion (TTF) employs dual-dimension detection combining grayscale pixel difference analysis with attention-based semantic relevance assessment, using hard fusion strategies and keyframe anchoring to prevent error accumulation.

Result: Consistent improvements across benchmarks: 4.0 percentage points average on LIBERO (72.4% vs 68.4% baseline), 4.8% relative improvement on SimplerEnv, and 8.7% relative improvement on real robot tasks. Method is model-agnostic across OpenVLA and VLA-Cache architectures.

Conclusion: TTF demonstrates that selective temporal token fusion enhances VLA performance without training, and reveals that selective Query matrix reuse in attention mechanisms improves performance, suggesting promising directions for computational acceleration while maintaining task success.

Abstract: Vision-Language-Action (VLA) models process visual inputs independently at each timestep, discarding valuable temporal information inherent in robotic manipulation tasks. This frame-by-frame processing makes models vulnerable to visual noise while ignoring the substantial coherence between consecutive frames in manipulation sequences. We propose Temporal Token Fusion (TTF), a training-free approach that intelligently integrates historical and current visual representations to enhance VLA inference quality. Our method employs dual-dimension detection combining efficient grayscale pixel difference analysis with attention-based semantic relevance assessment, enabling selective temporal token fusion through hard fusion strategies and keyframe anchoring to prevent error accumulation. Comprehensive experiments across LIBERO, SimplerEnv, and real robot tasks demonstrate consistent improvements: 4.0 percentage points average on LIBERO (72.4% vs 68.4% baseline), cross-environment validation on SimplerEnv (4.8% relative improvement), and 8.7% relative improvement on real robot tasks. Our approach proves model-agnostic, working across OpenVLA and VLA-Cache architectures. Notably, TTF reveals that selective Query matrix reuse in attention mechanisms enhances rather than compromises performance, suggesting promising directions for direct KQV matrix reuse strategies that achieve computational acceleration while improving task success rates.

[260] Duplex-GS: Proxy-Guided Weighted Blending for Real-Time Order-Independent Gaussian Splatting

Weihang Liu, Yuke Li, Yuxuan Li, Jingyi Yu, Xin Lou

Main category: cs.CV

TL;DR: Duplex-GS is a dual-hierarchy framework for 3D Gaussian Splatting that uses proxy Gaussian representations and order-independent rendering to achieve photorealistic results with real-time performance, reducing radix sort overhead by 52.2-86.9% and achieving 1.5-4x speedup over existing methods.

Details

Motivation: Current 3D Gaussian Splatting methods rely on computationally expensive sequential alpha-blending operations that cause significant overhead, especially on resource-constrained platforms.

Method: Proposes a dual-hierarchy framework with proxy Gaussian representations, cell proxies for local Gaussian management, cell search rasterization, and integration with Order-Independent Transparency (OIT) using a physically inspired weighted sum rendering technique.

Result: Achieves 1.5-4x speedup over existing OIT-based Gaussian Splatting approaches, reduces radix sort overhead by 52.2-86.9% without quality degradation, eliminates “popping” and “transparency” artifacts, and maintains high-quality rendering across diverse scenarios.

Conclusion: The OIT rendering paradigm in Gaussian Splatting provides substantial improvements in both accuracy and efficiency, demonstrating robustness across multi-scale training views and large-scale environments.

Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated remarkable rendering fidelity and efficiency. However, these methods still rely on computationally expensive sequential alpha-blending operations, resulting in significant overhead, particularly on resource-constrained platforms. In this paper, we propose Duplex-GS, a dual-hierarchy framework that integrates proxy Gaussian representations with order-independent rendering techniques to achieve photorealistic results while sustaining real-time performance. To mitigate the overhead caused by view-adaptive radix sort, we introduce cell proxies for local Gaussians management and propose cell search rasterization for further acceleration. By seamlessly combining our framework with Order-Independent Transparency (OIT), we develop a physically inspired weighted sum rendering technique that simultaneously eliminates “popping” and “transparency” artifacts, yielding substantial improvements in both accuracy and efficiency. Extensive experiments on a variety of real-world datasets demonstrate the robustness of our method across diverse scenarios, including multi-scale training views and large-scale environments. Our results validate the advantages of the OIT rendering paradigm in Gaussian Splatting, achieving high-quality rendering with an impressive 1.5 to 4 speedup over existing OIT based Gaussian Splatting approaches and 52.2% to 86.9% reduction of the radix sort overhead without quality degradation.

[261] Efficient Bayer-Domain Video Computer Vision with Fast Motion Estimation and Learned Perception Residual

Haichao Wang, Jiangtao Wen, Yuxing Han

Main category: cs.CV

TL;DR: Proposes an efficient video computer vision framework that jointly optimizes front-end by removing ISP and using Bayer raw data directly, and back-end with fast motion estimation and residual networks to reduce redundancy.

Details

Motivation: Address computational burdens in video computer vision systems from unnecessary processing and temporal redundancy while maintaining accuracy with minimal extra computation.

Method: Remove traditional ISP and feed Bayer raw measurements directly to Bayer-domain models; introduce fast parallel motion estimation for temporal correspondence; use lightweight perception residual networks to refine propagated features.

Result: Achieves substantial acceleration with only minor performance degradation across multiple models and tasks.

Conclusion: The proposed framework effectively reduces computational burden in video computer vision while maintaining accuracy through joint optimization of front-end and back-end processing.

Abstract: Video computer vision systems face substantial computational burdens arising from two fundamental challenges: eliminating unnecessary processing and reducing temporal redundancy in back-end inference while maintaining accuracy with minimal extra computation. To address these issues, we propose an efficient video computer vision framework that jointly optimizes both the front end and back end of the pipeline. On the front end, we remove the traditional image signal processor (ISP) and feed Bayer raw measurements directly into Bayer-domain vision models, avoiding costly human-oriented ISP operations. On the back end, we introduce a fast and highly parallel motion estimation algorithm that extracts inter-frame temporal correspondence to avoid redundant computation. To mitigate artifacts caused by motion inaccuracies, we further employ lightweight perception residual networks that directly learn perception-level residuals and refine the propagated features. Experiments across multiple models and tasks demonstrate that our system achieves substantial acceleration with only minor performance degradation.

[262] MILD: Multi-Layer Diffusion Strategy for Complex and Precise Multi-IP Aware Human Erasing

Jinghan Yu, Junhao Xiao, Zhiyuan Ma, Yue Ma, Kaiqi Liu, Yuhan Wang, Daizong Liu, Xianghao Meng, Jianjun Li

Main category: cs.CV

TL;DR: MILD proposes a multi-layer diffusion framework for human erasing that handles complex scenarios like occlusions and multi-instance interactions through independent denoising pathways, human morphology guidance, and spatially-modulated attention.

Details

Motivation: Existing mask-guided human erasing methods struggle with complex scenarios involving human-human occlusion, human-object entanglement, and human-background interference due to lack of large-scale multi-instance datasets and effective spatial decoupling.

Method: Proposes Multi-Layer Diffusion (MILD) with independent denoising pathways for separate reconstruction of foreground instances and background. Introduces Human Morphology Guidance module incorporating pose, parsing, and spatial relationships, and Spatially-Modulated Attention using spatial mask priors to modulate attention across semantic regions.

Result: Experiments show MILD significantly outperforms existing methods in handling complex scenarios with occlusions and multi-instance interactions.

Conclusion: MILD effectively addresses challenges in human erasing through multi-layer decomposition, human-centric guidance, and attention modulation, achieving superior performance in complex scenarios.

Abstract: Recent years have witnessed the success of diffusion models in image customization tasks. However, existing mask-guided human erasing methods still struggle in complex scenarios such as human-human occlusion, human-object entanglement, and human-background interference, mainly due to the lack of large-scale multi-instance datasets and effective spatial decoupling to separate foreground from background. To bridge these gaps, we curate the MILD dataset capturing diverse poses, occlusions, and complex multi-instance interactions. We then define the Cross-Domain Attention Gap (CAG), an attention-gap metric to quantify semantic leakage. On top of these, we propose Multi-Layer Diffusion (MILD), which decomposes the generation process into independent denoising pathways, enabling separate reconstruction of each foreground instance and the background. To enhance human-centric understanding, we introduce Human Morphology Guidance, a plug-and-play module that incorporates pose, parsing, and spatial relationships into the diffusion process to improve structural awareness and restoration quality. Additionally, we present Spatially-Modulated Attention, an adaptive mechanism that leverages spatial mask priors to modulate attention across semantic regions, further widening the CAG to effectively minimize boundary artifacts and mitigate semantic leakage. Experiments show that MILD significantly outperforms existing methods. Datasets and code are publicly available at: https://mild-multi-layer-diffusion.github.io/.

[263] Adaptive Cache Enhancement for Test-Time Adaptation of Vision-Language Models

Khanh-Binh Nguyen, Phuoc-Nguyen Bui, Hyunseung Choo, Duc Thanh Nguyen

Main category: cs.CV

TL;DR: ACE framework improves vision-language model adaptation to distribution shifts by using dynamic class-specific thresholds and adaptive cache management for robust test-time adaptation.

Details

Motivation: Vision-language models suffer performance degradation under distribution shifts, and existing cache-based TTA methods face unreliable confidence metrics and rigid decision boundaries that limit adaptation effectiveness.

Method: Adaptive Cache Enhancement (ACE) framework constructs robust cache by storing high-confidence/low-entropy embeddings per class using dynamic class-specific thresholds initialized from zero-shot statistics and refined with exponential moving average and exploration-augmented updates.

Result: Extensive experiments on 15 benchmark datasets show ACE achieves state-of-the-art performance with superior robustness and generalization in challenging out-of-distribution scenarios.

Conclusion: ACE effectively addresses limitations of existing TTA methods by enabling adaptive class-wise decision boundaries and reliable cache construction, demonstrating robust performance across diverse visual distributions.

Abstract: Vision-language models (VLMs) exhibit remarkable zero-shot generalization but suffer performance degradation under distribution shifts in downstream tasks, particularly in the absence of labeled data. Test-Time Adaptation (TTA) addresses this challenge by enabling online optimization of VLMs during inference, eliminating the need for annotated data. Cache-based TTA methods exploit historical knowledge by maintaining a dynamic memory cache of low-entropy or high-confidence samples, promoting efficient adaptation to out-of-distribution data. Nevertheless, these methods face two critical challenges: (1) unreliable confidence metrics under significant distribution shifts, resulting in error accumulation within the cache and degraded adaptation performance; and (2) rigid decision boundaries that fail to accommodate substantial distributional variations, leading to suboptimal predictions. To overcome these limitations, we introduce the Adaptive Cache Enhancement (ACE) framework, which constructs a robust cache by selectively storing high-confidence or low-entropy image embeddings per class, guided by dynamic, class-specific thresholds initialized from zero-shot statistics and iteratively refined using an exponential moving average and exploration-augmented updates. This approach enables adaptive, class-wise decision boundaries, ensuring robust and accurate predictions across diverse visual distributions. Extensive experiments on 15 diverse benchmark datasets demonstrate that ACE achieves state-of-the-art performance, delivering superior robustness and generalization compared to existing TTA methods in challenging out-of-distribution scenarios.

[264] Generative AI in Map-Making: A Technical Exploration and Its Implications for Cartographers

Claudio Affolter, Sidi Wu, Yizi Chen, Lorenz Hurni

Main category: cs.CV

TL;DR: This paper presents a GenAI model that integrates vector data with text prompts to generate accurate maps in controlled styles, addressing limitations of existing diffusion models in spatial composition control.

Details

Motivation: Traditional GIS-based map-making requires domain expertise and is time-consuming, while current generative AI models struggle with accurate map creation due to limited control over spatial composition and semantic layout.

Method: Integration of vector data to guide map generation in different styles specified by textual prompts, creating the first model to generate accurate maps in controlled styles, and developing a web application for improved usability.

Result: User study with professional cartographers showed potential for both non-expert users and professionals to create maps more efficiently, with positive feedback on fidelity and usability.

Conclusion: The approach demonstrates the potential of GenAI models in democratizing map-making while outlining technical improvements and emphasizing the evolving role of cartographers in AI-assisted map-making.

Abstract: Traditional map-making relies heavily on Geographic Information Systems (GIS), requiring domain expertise and being time-consuming, especially for repetitive tasks. Recent advances in generative AI (GenAI), particularly image diffusion models, offer new opportunities for automating and democratizing the map-making process. However, these models struggle with accurate map creation due to limited control over spatial composition and semantic layout. To address this, we integrate vector data to guide map generation in different styles, specified by the textual prompts. Our model is the first to generate accurate maps in controlled styles, and we have integrated it into a web application to improve its usability and accessibility. We conducted a user study with professional cartographers to assess the fidelity of generated maps, the usability of the web application, and the implications of ever-emerging GenAI in map-making. The findings have suggested the potential of our developed application and, more generally, the GenAI models in helping both non-expert users and professionals in creating maps more efficiently. We have also outlined further technical improvements and emphasized the new role of cartographers to advance the paradigm of AI-assisted map-making. The code and pre-trained models are available at https://github.com/claudaff/generative-ai-mapmaking/.

[265] RiverScope: High-Resolution River Masking Dataset

Rangel Daroya, Taylor Rowley, Jonathan Flores, Elisa Friedmann, Fiona Bennitt, Heejin An, Travis Simmons, Marissa Jean Hughes, Camryn L Kluetmeier, Solomon Kica, J. Daniel Vélez, Sarah E. Esenther, Thomas E. Howard, Yanqi Ye, Audrey Turcotte, Colin Gleason, Subhransu Maji

Main category: cs.CV

TL;DR: RiverScope is a high-resolution dataset for monitoring rivers and surface water, addressing challenges in capturing narrow or sediment-rich rivers with low-resolution satellite data.

Details

Motivation: Monitoring rivers and surface water at fine spatial and temporal scales is challenging, especially for narrow or sediment-rich rivers poorly captured by low-resolution satellite data, which impacts ecosystems, agriculture, disaster resilience, and sustainable development.

Method: Developed RiverScope dataset with 1,145 high-resolution images and expert-labeled river/surface water masks, co-registered with Sentinel-2, SWOT, and SWORD. Evaluated deep networks across architectures, pretraining strategies, and training datasets, combining transfer learning with multispectral PlanetScope channels via learned adaptors.

Result: Established first global high-resolution benchmark for river width estimation with median error of 7.2 meters, significantly outperforming existing satellite-derived methods. Best-performing models effectively combine transfer learning with multispectral data.

Conclusion: RiverScope provides valuable resource for fine-scale and multi-sensor hydrological modeling, supporting climate adaptation and sustainable water management by enabling better monitoring of surface water dynamics.

Abstract: Surface water dynamics play a critical role in Earth’s climate system, influencing ecosystems, agriculture, disaster resilience, and sustainable development. Yet monitoring rivers and surface water at fine spatial and temporal scales remains challenging – especially for narrow or sediment-rich rivers that are poorly captured by low-resolution satellite data. To address this, we introduce RiverScope, a high-resolution dataset developed through collaboration between computer science and hydrology experts. RiverScope comprises 1,145 high-resolution images (covering 2,577 square kilometers) with expert-labeled river and surface water masks, requiring over 100 hours of manual annotation. Each image is co-registered with Sentinel-2, SWOT, and the SWOT River Database (SWORD), enabling the evaluation of cost-accuracy trade-offs across sensors – a key consideration for operational water monitoring. We also establish the first global, high-resolution benchmark for river width estimation, achieving a median error of 7.2 meters – significantly outperforming existing satellite-derived methods. We extensively evaluate deep networks across multiple architectures (e.g., CNNs and transformers), pretraining strategies (e.g., supervised and self-supervised), and training datasets (e.g., ImageNet and satellite imagery). Our best-performing models combine the benefits of transfer learning with the use of all the multispectral PlanetScope channels via learned adaptors. RiverScope provides a valuable resource for fine-scale and multi-sensor hydrological modeling, supporting climate adaptation and sustainable water management.

[266] Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding

Weikai Huang, Jieyu Zhang, Taoyang Jia, Chenhao Zheng, Ziqi Gao, Jae Sung Park, Winson Han, Ranjay Krishna

Main category: cs.CV

TL;DR: SOC is a synthetic data generation pipeline that composes high-quality object segments using geometric and camera augmentations, outperforming real and existing synthetic datasets in visual grouping tasks.

Details

Motivation: Real datasets for visual grouping tasks are costly, biased, and hard to scale, while existing synthetic datasets lack flexibility, accuracy, and compositional diversity.

Method: Object-centric composition strategy using 3D geometric layout augmentation, camera configuration augmentation, generative harmonization, and mask-area-weighted blending to create accurate synthetic object segments.

Result: Models trained on 100K SOC images outperform those trained on larger real datasets (GRIT 20M, V3Det 200K) and synthetic pipelines by +24-36%, achieving +10.9 AP on LVIS and +8.4 NAcc on gRefCOCO. Also boosts performance in low-data scenarios (+6.59 AP on 1% COCO).

Conclusion: SOC provides an accurate, scalable synthetic data solution that outperforms real datasets and enables controllable dataset construction for various use cases, including fine-grained attribute discrimination tasks.

Abstract: Visual grouping – operationalized through tasks such as instance segmentation, visual grounding, and object detection – enables applications ranging from robotic perception to photo editing. These fundamental problems in computer vision are powered by large-scale, painstakingly annotated datasets. Despite their impact, these datasets are costly to build, biased in coverage, and difficult to scale. Synthetic datasets offer a promising alternative but struggle with flexibility, accuracy, and compositional diversity. We introduce Synthetic Object Compositions (SOC), an accurate and scalable data synthesis pipeline via a novel object-centric composition strategy. It composes high-quality synthetic object segments into new images using 3D geometric layout augmentation and camera configuration augmentation with generative harmonization and mask-area-weighted blending, yielding accurate and diverse masks, boxes, and referring expressions. Models trained on just 100K of our synthetic images outperform those trained on larger real datasets (GRIT 20M, V3Det 200K) and synthetic pipelines (Copy-Paste, X-Paste, SynGround, SegGen) by +24-36% – achieving +10.9 AP on LVIS and +8.4 NAcc on gRefCOCO. Beyond the general open-vocabulary setup, SOC also enables controllable dataset construction for different use cases and boosts performance in both low-data and closed-vocabulary scenarios. Augmenting LVIS and COCO with synthetic object segments delivers strong performance across different real-data scales and yields even greater improvements under extremely limited real-data conditions, including +6.59 AP on a 1% COCO data setup. Furthermore, this controllability enables targeted data generation for intra-class referring, a diagnostic grounding task we propose that requires fine-grained attribute discrimination.

[267] Medverse: A Universal Model for Full-Resolution 3D Medical Image Segmentation, Transformation and Enhancement

Jiesi Hu, Jianfeng Cao, Yanwu Yang, Chenfei Ye, Yixuan Zhang, Hanyang Peng, Ting Ma

Main category: cs.CV

TL;DR: Medverse is a universal in-context learning model for 3D medical imaging that achieves high-fidelity predictions with global anatomical understanding across diverse tasks and anatomical regions.

Details

Motivation: Current ICL models for medical imaging cannot simultaneously achieve high-fidelity predictions and global anatomical understanding, and lack unified training across diverse medical imaging tasks and anatomical regions, leaving the full potential of ICL underexplored.

Method: Medverse employs a next-scale autoregressive in-context learning framework that progressively refines predictions from coarse to fine, and uses a blockwise cross-attention module for long-range interactions while preserving computational efficiency through spatial sparsity.

Result: Medverse substantially outperforms existing ICL baselines on held-out datasets covering unseen clinical centers, organs, species, and imaging modalities, demonstrating its universal applicability.

Conclusion: Medverse establishes a novel paradigm for in-context learning in medical imaging and enables universal medical image analysis across diverse tasks without retraining.

Abstract: In-context learning (ICL) offers a promising paradigm for universal medical image analysis, enabling models to perform diverse image processing tasks without retraining. However, current ICL models for medical imaging remain limited in two critical aspects: they cannot simultaneously achieve high-fidelity predictions and global anatomical understanding, and there is no unified model trained across diverse medical imaging tasks (e.g., segmentation and enhancement) and anatomical regions. As a result, the full potential of ICL in medical imaging remains underexplored. Thus, we present \textbf{Medverse}, a universal ICL model for 3D medical imaging, trained on 22 datasets covering diverse tasks in universal image segmentation, transformation, and enhancement across multiple organs, imaging modalities, and clinical centers. Medverse employs a next-scale autoregressive in-context learning framework that progressively refines predictions from coarse to fine, generating consistent, full-resolution volumetric outputs and enabling multi-scale anatomical awareness. We further propose a blockwise cross-attention module that facilitates long-range interactions between context and target inputs while preserving computational efficiency through spatial sparsity. Medverse is extensively evaluated on a broad collection of held-out datasets covering previously unseen clinical centers, organs, species, and imaging modalities. Results demonstrate that Medverse substantially outperforms existing ICL baselines and establishes a novel paradigm for in-context learning. Code and model weights will be made publicly available. Our model are publicly available at https://github.com/jiesihu/Medverse.

[268] Explicit Multimodal Graph Modeling for Human-Object Interaction Detection

Wenxuan Ji, Haichao Shi, Xiao-Yu Zhang

Main category: cs.CV

TL;DR: Proposes MGNM, a Graph Neural Network-based approach for Human-Object Interaction detection that explicitly models relational structures through multimodal graph networks, achieving SOTA performance.

Details

Motivation: Transformer-based methods lack explicit modeling of relational structures in HOI detection, while GNNs are inherently better suited for capturing relationships between human-object pairs.

Method: Multimodal Graph Network Modeling (MGNM) with four-stage graph structure and multi-level feature interaction mechanism using visual and language features for enhanced information propagation.

Result: Achieves state-of-the-art performance on HICO-DET and V-COCO benchmarks, with significant gains when using advanced object detectors and balanced performance across rare and non-rare classes.

Conclusion: GNN-based relational modeling with multimodal features effectively enhances HOI detection performance compared to Transformer approaches.

Abstract: Transformer-based methods have recently become the prevailing approach for Human-Object Interaction (HOI) detection. However, the Transformer architecture does not explicitly model the relational structures inherent in HOI detection, which impedes the recognition of interactions. In contrast, Graph Neural Networks (GNNs) are inherently better suited for this task, as they explicitly model the relationships between human-object pairs. Therefore, in this paper, we propose \textbf{M}ultimodal \textbf{G}raph \textbf{N}etwork \textbf{M}odeling (MGNM) that leverages GNN-based relational structures to enhance HOI detection. Specifically, we design a multimodal graph network framework that explicitly models the HOI task in a four-stage graph structure. Furthermore, we introduce a multi-level feature interaction mechanism within our graph network. This mechanism leverages multi-level visual and language features to enhance information propagation across human-object pairs. Consequently, our proposed MGNM achieves state-of-the-art (SOTA) performance on two widely used benchmarks: HICO-DET and V-COCO. Moreover, when integrated with a more advanced object detector, our method demonstrates a significant performance gain and maintains an effective balance between rare and non-rare classes.

[269] ORIC: Benchmarking Object Recognition under Contextual Incongruity in Large Vision-Language Models

Zhaoyang Li, Zhan Ling, Yuchen Zhou, Litian Gong, Erdem Bıyık, Hao Su

Main category: cs.CV

TL;DR: ORIC framework addresses LVLM failures in recognizing objects in incongruous contexts through systematic data generation and fine-tuning, improving reliability.

Details

Motivation: LVLMs often miss obvious objects or hallucinate nonexistent ones in atypical scenes, particularly when objects appear unexpectedly or fail to appear in expected contexts.

Method: Introduced ORIC framework with two strategies: LLM-guided sampling for hard-to-recognize objects and CLIP-guided sampling for plausible but absent objects. Applied to MSCOCO to create ORIC-Bench and training data, then fine-tuned models with Visual Reinforcement Fine-Tuning.

Result: Evaluation of 18 LVLMs and 2 detectors showed substantial performance drops and bias patterns under incongruous contexts. Fine-tuning Qwen3-VL-8B-Instruct improved performance on ORIC-Bench, AMBER, and HallusionBench.

Conclusion: Contextual incongruity is a key source of uncertainty in LVLMs, and the ORIC framework provides tools for building more reliable vision-language models.

Abstract: Large Vision-Language Models (LVLMs) excel at captioning, visual question answering, and robotics by combining vision and language, yet they often miss obvious objects or hallucinate nonexistent ones in atypical scenes. We examine these failures through the lens of uncertainty, focusing on contextual incongruity, where objects appear unexpectedly or fail to appear in expected contexts, and show that such cases increase recognition difficulty for state-of-the-art LVLMs. To study this regime, we introduce the Object Recognition in Incongruous Context (ORIC) framework, which constructs incongruous object-context pairs through two complementary strategies: (1) LLM-guided sampling to identify hard-to-recognize objects present in the image and (2) CLIP-guided sampling to mine plausible but absent ones. Applied to MSCOCO, ORIC produces ORIC-Bench and ORIC-style training data. Evaluating 18 LVLMs and 2 open-vocabulary detectors reveals substantial performance drops and bias patterns under incongruous contexts. Fine-tuning Qwen3-VL-8B-Instruct with Visual Reinforcement Fine-Tuning on 600 ORIC-style samples improves results on ORIC-Bench, AMBER, and HallusionBench. Overall, we show that contextual incongruity is a key source of uncertainty and provide tools for more reliable LVLMs. The code is available at https://github.com/ZhaoyangLi-1/ORIC.

[270] NeuS-QA: Grounding Long-Form Video Understanding in Temporal Logic and Neuro-Symbolic Reasoning

Sahil Shah, S P Sharan, Harsh Goel, Minkyu Choi, Mustafa Munir, Manvik Pasula, Radu Marculescu, Sandeep Chinchali

Main category: cs.CV

TL;DR: NeuS-QA is a training-free neuro-symbolic pipeline that translates video QA questions into temporal logic specifications, builds video automata, and uses model checking to identify relevant video segments before feeding them to VLMs, improving performance on long video QA.

Details

Motivation: Current VLMs struggle with Long Video Question Answering due to token overhead from uniform frame sampling, causing loss of fine-grained temporal details and lack of explicit temporal reasoning mechanisms.

Method: Translate questions into temporal logic specifications, construct video automata modeling frame-by-frame event progression, use model checking to identify logic-satisfying segments, then feed only verified segments to VLMs.

Result: Improves performance by over 10% on LongVideoBench and CinePile benchmarks, particularly for event ordering, causality, and multi-step reasoning questions.

Conclusion: NeuS-QA enables compositional temporal reasoning without model fine-tuning, reduces hallucinations, improves interpretability, and significantly enhances LVQA performance through neuro-symbolic integration.

Abstract: While vision-language models (VLMs) excel at tasks involving single images or short videos, they still struggle with Long Video Question Answering (LVQA) due to its demand for complex multi-step temporal reasoning. Vanilla approaches, which simply sample frames uniformly and feed them to a VLM along with the question, incur significant token overhead. This forces aggressive downsampling of long videos, causing models to miss fine-grained visual structure, subtle event transitions, and key temporal cues. Recent works attempt to overcome these limitations through heuristic approaches; however, they lack explicit mechanisms for encoding temporal relationships and fail to provide any formal guarantees that the sampled context actually encodes the compositional or causal logic required by the question. To address these foundational gaps, we introduce NeuS-QA, a training-free, plug-and-play neuro-symbolic pipeline for LVQA. NeuS-QA first translates a natural language question into a logic specification that models the temporal relationship between frame-level events. Next, we construct a video automaton to model the video’s frame-by-frame event progression, and finally employ model checking to compare the automaton against the specification to identify all video segments that satisfy the question’s logical requirements. Only these logic-verified segments are submitted to the VLM, thus improving interpretability, reducing hallucinations, and enabling compositional reasoning without modifying or fine-tuning the model. Experiments on the LongVideoBench and CinePile LVQA benchmarks show that NeuS-QA significantly improves performance by over 10%, particularly on questions involving event ordering, causality, and multi-step reasoning. We open-source our code at https://utaustin-swarmlab.github.io/NeuS-QA/.

[271] FlashI2V: Fourier-Guided Latent Shifting Prevents Conditional Image Leakage in Image-to-Video Generation

Yunyang Ge, Xinhua Cheng, Chengshu Zhao, Xianyi He, Shenghai Yuan, Bin Lin, Bin Zhu, Li Yuan

Main category: cs.CV

TL;DR: FlashI2V addresses conditional image leakage in Image-to-Video generation by using latent shifting and Fourier guidance to prevent overfitting and improve out-of-domain performance.

Details

Motivation: Existing I2V methods suffer from conditional image leakage where denoisers shortcut the conditional image, causing slow motion, color inconsistency, and poor generalization to out-of-domain data.

Method: FlashI2V uses: (1) Latent Shifting - modifies flow matching distributions by subtracting conditional image information from noisy latents, (2) Fourier Guidance - uses high-frequency magnitude features from Fourier Transform to accelerate convergence and adjust detail levels.

Result: Achieves best generalization and performance on out-of-domain data among I2V paradigms. With 1.3B parameters, scores 53.01 on Vbench-I2V, surpassing larger models like CogVideoX1.5-5B-I2V and Wan2.1-I2V-14B-480P.

Conclusion: FlashI2V effectively overcomes conditional image leakage and achieves superior performance with fewer parameters, demonstrating strong generalization capabilities for Image-to-Video generation.

Abstract: In Image-to-Video (I2V) generation, a video is created using an input image as the first-frame condition. Existing I2V methods concatenate the full information of the conditional image with noisy latents to achieve high fidelity. However, the denoisers in these methods tend to shortcut the conditional image, which is known as conditional image leakage, leading to performance degradation issues such as slow motion and color inconsistency. In this work, we further clarify that conditional image leakage leads to overfitting to in-domain data and decreases the performance in out-of-domain scenarios. Moreover, we introduce Fourier-Guided Latent Shifting I2V, named FlashI2V, to prevent conditional image leakage. Concretely, FlashI2V consists of: (1) Latent Shifting. We modify the source and target distributions of flow matching by subtracting the conditional image information from the noisy latents, thereby incorporating the condition implicitly. (2) Fourier Guidance. We use high-frequency magnitude features obtained by the Fourier Transform to accelerate convergence and enable the adjustment of detail levels in the generated video. Experimental results show that our method effectively overcomes conditional image leakage and achieves the best generalization and performance on out-of-domain data among various I2V paradigms. With only 1.3B parameters, FlashI2V achieves a dynamic degree score of 53.01 on Vbench-I2V, surpassing CogVideoX1.5-5B-I2V and Wan2.1-I2V-14B-480P. Project page: https://pku-yuangroup.github.io/FlashI2V/

[272] A filtering scheme for confocal laser endomicroscopy (CLE)-video sequences for self-supervised learning

Nils Porsche, Flurin Müller-Diesing, Sweta Banerjee, Miguel Goncalves, Marc Aubreville

Main category: cs.CV

TL;DR: Proposes a filter for CLE video sequences to reduce redundancy in self-supervised learning, improving training efficiency and performance on medical image classification tasks.

Details

Motivation: CLE images are hard to interpret for non-experts, and machine learning models overfit due to limited labeled data. Self-supervised learning can help but suffers from high inter-frame correlation in CLE videos.

Method: Developed a filter functionality for CLE video sequences to reduce dataset redundancy in SSL training. Used four baseline networks and a SSL teacher-student network with vision transformer backbone, evaluated on sinonasal tumor and skin cancer datasets.

Result: Filtered SSL-pretrained models achieved highest test accuracy: 67.48% on sinonasal tumor and 73.52% on skin cancer datasets, outperforming non-SSL baselines. Training time reduced by 67%.

Conclusion: SSL is effective for CLE pretraining, and the proposed video filter improves training efficiency and performance in self-supervised scenarios.

Abstract: Confocal laser endomicroscopy (CLE) is a non-invasive, real-time imaging modality that can be used for in-situ, in-vivo imaging and the microstructural analysis of mucous structures. The diagnosis using CLE is, however, complicated by images being hard to interpret for non-experienced physicians. Utilizing machine learning as an augmentative tool would hence be beneficial, but is complicated by the shortage of histopathology-correlated CLE imaging sequences with respect to the plurality of patterns in this domain, leading to overfitting of machine learning models. To overcome this, self-supervised learning (SSL) can be employed on larger unlabeled datasets. CLE is a video-based modality with high inter-frame correlation, leading to a non-stratified data distribution for SSL training. In this work, we propose a filter functionality on CLE video sequences to reduce the dataset redundancy in SSL training and improve SSL training convergence and training efficiency. We use four state-of-the-art baseline networks and a SSL teacher-student network with a vision transformer small backbone for the evaluation. These networks were evaluated on downstream tasks for a sinonasal tumor dataset and a squamous cell carcinoma of the skin dataset. On both datasets, we found the highest test accuracy on the filtered SSL-pretrained model, with 67.48% and 73.52%, both considerably outperforming their non-SSL baselines. Our results show that SSL is an effective method for CLE pretraining. Further, we show that our proposed CLE video filter can be utilized to improve training efficiency in self-supervised scenarios, resulting in a reduction of 67% in training time.

[273] Concept Retrieval – What and How?

Ori Nizan, Oren Shrout, Ayellet Tal

Main category: cs.CV

TL;DR: This paper introduces a novel approach for retrieving images that share central concepts with a query image, going beyond visual similarity to capture underlying narratives.

Details

Motivation: Traditional image retrieval methods focus on visual or semantic similarity but fail to capture the central concepts and underlying narratives that images may share, even when they look different.

Method: The approach is based on two key observations: (1) neighbors in embedding space share at least one concept with the query but not necessarily with each other, and (2) modeling the neighborhood with a bimodal Gaussian distribution reveals meaningful structure for concept identification.

Result: The method shows effectiveness through qualitative, quantitative, and human evaluations, with the implementation available as a PyPI package called ‘coret’.

Conclusion: The proposed approach successfully addresses concept-based image retrieval by capturing shared central concepts and narratives, outperforming conventional similarity-based methods.

Abstract: A concept may reflect either a concrete or abstract idea. Given an input image, this paper seeks to retrieve other images that share its central concepts, capturing aspects of the underlying narrative. This goes beyond conventional retrieval or clustering methods, which emphasize visual or semantic similarity. We formally define the problem, outline key requirements, and introduce appropriate evaluation metrics. We propose a novel approach grounded in two key observations: (1) While each neighbor in the embedding space typically shares at least one concept with the query, not all neighbors necessarily share the same concept with one another. (2) Modeling this neighborhood with a bimodal Gaussian distribution uncovers meaningful structure that facilitates concept identification. Qualitative, quantitative, and human evaluations confirm the effectiveness of our approach. See the package on PyPI: https://pypi.org/project/coret/

[274] Hierarchical Mixing Architecture for Low-light RAW Image Enhancement

Xianmin Chen, Peiliang Huang, Longfei Han, Dingwen Zhang, Junwei Han

Main category: cs.CV

TL;DR: HiMA is an efficient low-light RAW image enhancement framework using hierarchical mixing architecture with channel attention and Mamba, achieving superior performance with fewer parameters through local distribution adjustment and multi-prior fusion.

Details

Motivation: To address the challenge of simultaneously achieving strong enhancement quality and high efficiency in low-light RAW image enhancement, overcoming domain ambiguity issues in existing pipelines.

Method: Proposes HiMA framework with Large Scale Block (LSB) for upper layers and Small Scale Block (SSB) for lower layers, plus Local Distribution Adjustment (LoDA) module for adaptive feature statistics alignment and Multi-Prior Fusion (MPF) module for domain consistency.

Result: Extensive experiments on multiple public benchmarks show the approach outperforms state-of-the-art methods with superior performance and fewer parameters.

Conclusion: HiMA successfully addresses the efficiency-quality trade-off in low-light RAW image enhancement through its hierarchical hybrid architecture and novel modules, demonstrating significant improvements over existing methods.

Abstract: With the rapid development of deep learning, low-light RAW image enhancement (LLRIE) has achieved remarkable progress. However, the challenge that how to simultaneously achieve strong enhancement quality and high efficiency still remains. Leveraging the inherent efficiency of Channel Attention and Mamba, we introduce a Hierarchical Mixing Architecture (HiMA), a hybrid LLRIE framework built upon two core modules. Specifically, we introduce Large Scale Block (LSB) for upper layers and Small Scale Block (SSB) for lower layers that reduce the parameters while improve the performance. Based on this framework, we also introduce a novel Local Distribution Adjustment (LoDA) module that adaptively aligns local feature statistics in a content-aware manner by learning to adjust regional luminance and contrast distributions. Moreover, to alleviate the domain ambiguity commonly observed in existing LLRIE pipelines, we design a Multi-Prior Fusion (MPF) module that leverages three complementary priors extracted from the first stage of the hybrid architecture to maintain domain consistency. Extensive experiments on multiple public benchmarks demonstrate that our approach outperforms state-of-the-art methods, delivering superior performance with fewer parameters. Code is available at https://github.com/Cynicarlos/HiMA.

[275] Enhancing Video Inpainting with Aligned Frame Interval Guidance

Ming Xie, Junqiu Yu, Qiaole Dong, Xiangyang Xue, Yanwei Fu

Main category: cs.CV

TL;DR: VidPivot is a novel video inpainting framework that decouples the task into multi-frame consistent image inpainting and masked area motion propagation, using frame interval priors and a FrameProp Module to enhance spatiotemporal stability.

Details

Motivation: Existing I2V-based video inpainting methods suffer from severe content degradation in video chunks and lack robust frame alignment, compromising spatiotemporal stability and control over the entire video.

Method: Proposes VidPivot framework with frame interval priors as spatiotemporal cues, FrameProp Module for cross-frame coherence via content propagation, and context controller to encode coherent frame priors into I2V backbone.

Result: Extensive evaluations show VidPivot achieves competitive performance across diverse benchmarks and generalizes well to different video inpainting scenarios.

Conclusion: VidPivot effectively addresses content degradation and spatiotemporal stability issues in video inpainting through its decoupled approach and frame propagation mechanisms.

Abstract: Recent image-to-video (I2V) based video inpainting methods have made significant strides by leveraging single-image priors and modeling temporal consistency across masked frames. Nevertheless, these methods suffer from severe content degradation within video chunks. Furthermore, the absence of a robust frame alignment scheme compromises intra-chunk and inter-chunk spatiotemporal stability, resulting in insufficient control over the entire video. To address these limitations, we propose VidPivot, a novel framework that decouples video inpainting into two sub-tasks: multi-frame consistent image inpainting and masked area motion propagation. Our approach introduces frame interval priors as spatiotemporal cues to guide the inpainting process. To enhance cross-frame coherence, we design a FrameProp Module that implements a frame content propagation strategy, diffusing reference frame content into subsequent frames via a splicing mechanism. Additionally, a dedicated context controller encodes these coherent frame priors into the I2V generative backbone, effectively serving as soft constrain to suppress content distortion during generation. Extensive evaluations demonstrate that VidPivot achieves competitive performance across diverse benchmarks and generalizes well to different video inpainting scenarios.

[276] AI Assisted AR Assembly: Object Recognition and Computer Vision for Augmented Reality Assisted Assembly

Alexander Htet Kyaw, Haotian Ma, Sasa Zivkovic, Jenny Sabin

Main category: cs.CV

TL;DR: AI-assisted AR assembly workflow using deep learning for object recognition to display step-by-step instructions and component locations in physical space.

Details

Motivation: To eliminate manual searching, sorting, and labeling of components during assembly by connecting instructions with real-time component locations.

Method: Deep learning-based object recognition identifies assembly components and displays bounding boxes around them in AR, showing where components should be placed for each assembly step.

Result: Demonstrated feasibility through a case study involving LEGO sculpture assembly, showing the system can successfully guide assembly processes.

Conclusion: The AI-assisted AR workflow effectively bridges digital instructions with physical assembly, reducing manual component management and improving assembly efficiency.

Abstract: We present an AI-assisted Augmented Reality assembly workflow that uses deep learning-based object recognition to identify different assembly components and display step-by-step instructions. For each assembly step, the system displays a bounding box around the corresponding components in the physical space, and where the component should be placed. By connecting assembly instructions with the real-time location of relevant components, the system eliminates the need for manual searching, sorting, or labeling of different components before each assembly. To demonstrate the feasibility of using object recognition for AR-assisted assembly, we highlight a case study involving the assembly of LEGO sculptures.

[277] UHKD: A Unified Framework for Heterogeneous Knowledge Distillation via Frequency-Domain Representations

Fengming Yu, Haiwei Pan, Kejia Zhang, Jian Guan, Haiying Jiang

Main category: cs.CV

TL;DR: UHKD is a heterogeneous knowledge distillation framework that uses frequency-domain representations to transfer knowledge across different architectures, addressing semantic discrepancies in intermediate features.

Details

Motivation: Existing KD methods work poorly in heterogeneous settings due to architectural differences, especially with intermediate features. Most heterogeneous KD studies focus only on logits, missing rich semantic information in intermediate layers.

Method: Proposes UHKD with Feature Transformation Module (FTM) to generate frequency-domain teacher representations and Feature Alignment Module (FAM) to project and align student features via multi-level matching. Uses joint objective combining MSE on features and KL divergence on logits.

Result: Achieves maximum gains of 5.59% on CIFAR-100 and 0.83% on ImageNet-1K over latest heterogeneous distillation methods.

Conclusion: UHKD effectively addresses heterogeneous KD challenges by leveraging frequency-domain representations to capture global semantics and mitigate architectural discrepancies, outperforming existing methods.

Abstract: Knowledge distillation (KD) is an effective model compression technique that transfers knowledge from a high-performance teacher to a lightweight student, reducing computational and storage costs while maintaining competitive accuracy. However, most existing KD methods are tailored for homogeneous models and perform poorly in heterogeneous settings, particularly when intermediate features are involved. Semantic discrepancies across architectures hinder effective use of intermediate representations from the teacher model, while prior heterogeneous KD studies mainly focus on the logits space, underutilizing rich semantic information in intermediate layers. To address this, Unified Heterogeneous Knowledge Distillation (UHKD) is proposed, a framework that leverages intermediate features in the frequency domain for cross-architecture transfer. Frequency-domain representations are leveraged to capture global semantic knowledge and mitigate representational discrepancies between heterogeneous teacher-student pairs. Specifically, a Feature Transformation Module (FTM) generates compact frequency-domain representations of teacher features, while a learnable Feature Alignment Module (FAM) projects student features and aligns them via multi-level matching. Training is guided by a joint objective combining mean squared error on intermediate features with Kullback-Leibler divergence on logits. Extensive experiments on CIFAR-100 and ImageNet-1K demonstrate the effectiveness of the proposed approach, achieving maximum gains of 5.59% and 0.83% over the latest heterogeneous distillation method on the two datasets, respectively. Code will be released soon.

Yanlong Yang, Guanxiong Luo

Main category: cs.CV

TL;DR: DeblurSDI is a zero-shot self-supervised blind imaging framework that uses iterative reverse self-diffusion to jointly estimate sharp images and unknown blur kernels without pre-training, outperforming existing methods on optical aberrations and motion blur.

Details

Motivation: Optical imaging systems suffer from inherent imperfections like diffraction limits, manufacturing tolerances, misalignment, and motion-induced degradations that are unknown and difficult to measure. Existing blind inverse problem approaches face convergence instability, limited prior expressiveness, and hyperparameter sensitivity.

Method: Proposes DeblurSDI, a zero-shot self-supervised blind imaging framework that formulates blind image recovery as an iterative reverse self-diffusion process starting from pure noise, progressively refining both the sharp image and blur kernel without requiring pre-training.

Result: Extensive experiments on combined optical aberrations and motion blur demonstrate that DeblurSDI consistently outperforms other methods by a substantial margin.

Conclusion: DeblurSDI provides an effective solution for blind image deblurring by leveraging self-diffusion principles to handle unknown degradations without pre-training, achieving superior performance over existing approaches.

Abstract: Optical imaging systems are inherently imperfect due to diffraction limits, lens manufacturing tolerances, assembly misalignment, and other physical constraints. In addition, unavoidable camera shake and object motion further introduce non-ideal degradations during acquisition. These aberrations and motion-induced variations are typically unknown, difficult to measure, and costly to model or calibrate in practice. Blind inverse problems offer a promising direction by jointly estimating both the latent image and the unknown degradation kernel. However, existing approaches often suffer from convergence instability, limited prior expressiveness, and sensitivity to hyperparameters. Inspired by recent advances in self-diffusion, we propose DeblurSDI, a zero-shot, self-supervised blind imaging framework that requires no pre-training. DeblurSDI formulates blind image recovery as an iterative reverse self-diffusion process that begins from pure noise and progressively refines both the sharp image and the blur kernel. Extensive experiments on combined optical aberrations and motion blur demonstrate that DeblurSDI consistently outperforms other methods by a substantial margin.

[279] Invisible Triggers, Visible Threats! Road-Style Adversarial Creation Attack for Visual 3D Detection in Autonomous Driving

Jian Wang, Lijun He, Yixing Yong, Haixia Bi, Fan Li

Main category: cs.CV

TL;DR: AdvRoad generates road-style adversarial posters that look natural like actual road surfaces but cause 3D object detectors to hallucinate non-existent objects, enabling stealthy attacks on autonomous driving systems.

Details

Motivation: Current visual 3D detection models are vulnerable to adversarial attacks, and existing adversarial posters have unnatural appearances that make them easily detectable by humans, limiting their practical threat.

Method: Two-stage approach: Road-Style Adversary Generation creates natural-looking road-style adversarial patterns, and Scenario-Associated Adaptation optimizes attack effectiveness for specific scenes while maintaining stealth.

Result: Extensive experiments show AdvRoad generalizes well across different detectors, scenes, and attack locations. Physical attacks demonstrate practical threats in real-world environments.

Conclusion: AdvRoad presents a significant security threat to autonomous driving systems by creating stealthy, natural-looking adversarial road patterns that can cause dangerous misperceptions without drawing human attention.

Abstract: Modern autonomous driving (AD) systems leverage 3D object detection to perceive foreground objects in 3D environments for subsequent prediction and planning. Visual 3D detection based on RGB cameras provides a cost-effective solution compared to the LiDAR paradigm. While achieving promising detection accuracy, current deep neural network-based models remain highly susceptible to adversarial examples. The underlying safety concerns motivate us to investigate realistic adversarial attacks in AD scenarios. Previous work has demonstrated the feasibility of placing adversarial posters on the road surface to induce hallucinations in the detector. However, the unnatural appearance of the posters makes them easily noticeable by humans, and their fixed content can be readily targeted and defended. To address these limitations, we propose the AdvRoad to generate diverse road-style adversarial posters. The adversaries have naturalistic appearances resembling the road surface while compromising the detector to perceive non-existent objects at the attack locations. We employ a two-stage approach, termed Road-Style Adversary Generation and Scenario-Associated Adaptation, to maximize the attack effectiveness on the input scene while ensuring the natural appearance of the poster, allowing the attack to be carried out stealthily without drawing human attention. Extensive experiments show that AdvRoad generalizes well to different detectors, scenes, and spoofing locations. Moreover, physical attacks further demonstrate the practical threats in real-world environments.

[280] YOLO-SAT: A Data-based and Model-based Enhanced YOLOv12 Model for Desert Waste Detection and Classification

Abdulmumin Sa’ad, Sulaimon Oyeniyi Adebayo

Main category: cs.CV

TL;DR: YOLO-SAT is an enhanced real-time object detection framework for waste detection in desert environments, combining a pruned YOLOv12 with Self-Adversarial Training and specialized data augmentation to achieve high accuracy and efficiency for drone deployment.

Details

Motivation: Traditional waste collection in remote/harsh environments like deserts is inefficient and hazardous, while current computer vision research focuses mainly on urban environments and recyclable materials, overlooking organic/hazardous waste in underexplored terrains.

Method: Proposed YOLO-SAT framework based on pruned lightweight YOLOv12 integrated with Self-Adversarial Training (SAT) and specialized data augmentation strategies, using the DroneTrashNet dataset.

Result: Significant improvements in precision, recall, and mAP while achieving low latency and compact model size suitable for deployment on resource-constrained aerial drones. Outperforms state-of-the-art lightweight YOLO variants.

Conclusion: The combination of data-centric and model-centric enhancements provides robust, real-time waste detection in desert environments, validating the effectiveness of the proposed approach.

Abstract: The global waste crisis is escalating, with solid waste generation expected to increase tremendously in the coming years. Traditional waste collection methods, particularly in remote or harsh environments like deserts, are labor-intensive, inefficient, and often hazardous. Recent advances in computer vision and deep learning have opened the door to automated waste detection systems, yet most research focuses on urban environments and recyclable materials, overlooking organic and hazardous waste and underexplored terrains such as deserts. In this work, we propose YOLO-SAT, an enhanced real-time object detection framework based on a pruned, lightweight version of YOLOv12 integrated with Self-Adversarial Training (SAT) and specialized data augmentation strategies. Using the DroneTrashNet dataset, we demonstrate significant improvements in precision, recall, and mean average precision (mAP), while achieving low latency and compact model size suitable for deployment on resource-constrained aerial drones. Benchmarking YOLO-SAT against state-of-the-art lightweight YOLO variants further highlights its optimal balance of accuracy and efficiency. Our results validate the effectiveness of combining data-centric and model-centric enhancements for robust, real-time waste detection in desert environments.

[281] SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding

Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, Saining Xie

Main category: cs.CV

TL;DR: SIMS-V uses 3D simulators to generate spatially-rich video training data for multimodal language models, identifying three key question types that enable efficient training and strong real-world spatial reasoning performance.

Details

Motivation: Current multimodal language models struggle with spatial reasoning across time and space, and obtaining diverse real-world video data with precise spatial annotations is challenging and expensive.

Method: Developed a systematic data-generation framework using 3D simulators to create spatially-rich video training data, then conducted systematic ablations to identify the most effective question types for transfer learning.

Result: Identified three key question categories (metric measurement, perspective-dependent reasoning, temporal tracking) that enable a 7B-parameter model trained on just 25K examples to outperform larger 72B baselines and achieve competitive performance with proprietary models on real-world spatial reasoning benchmarks.

Conclusion: Simulated data with carefully selected question types can efficiently develop transferable spatial intelligence in multimodal language models, enabling robust generalization to real-world spatial reasoning tasks while maintaining general video understanding capabilities.

Abstract: Despite impressive high-level video comprehension, multimodal language models struggle with spatial reasoning across time and space. While current spatial training approaches rely on real-world video data, obtaining diverse footage with precise spatial annotations remains a bottleneck. To alleviate this bottleneck, we present SIMS-V – a systematic data-generation framework that leverages the privileged information of 3D simulators to create spatially-rich video training data for multimodal language models. Using this framework, we investigate which properties of simulated data drive effective real-world transfer through systematic ablations of question types, mixes, and scales. We identify a minimal set of three question categories (metric measurement, perspective-dependent reasoning, and temporal tracking) that prove most effective for developing transferable spatial intelligence, outperforming comprehensive coverage despite using fewer question types. These insights enable highly efficient training: our 7B-parameter video LLM fine-tuned on just 25K simulated examples outperforms the larger 72B baseline and achieves competitive performance with proprietary models on rigorous real-world spatial reasoning benchmarks. Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.

[282] Contrastive Integrated Gradients: A Feature Attribution-Based Method for Explaining Whole Slide Image Classification

Anh Mai Vu, Tuan L. Vo, Ngoc Lam Quang Bui, Nam Nguyen Le Binh, Akash Awasthi, Huy Quoc Vo, Thanh-Huy Nguyen, Zhu Han, Chandra Mohan, Hien Van Nguyen

Main category: cs.CV

TL;DR: CIG is a novel attribution method for Whole Slide Image analysis that computes contrastive gradients to highlight class-discriminative regions, outperforming traditional methods in identifying tumor subtypes with better theoretical guarantees and quantitative metrics.

Details

Motivation: Existing attribution methods like Integrated Gradients capture model decision patterns but may miss class-discriminative signals crucial for distinguishing tumor subtypes in high-resolution WSIs, limiting interpretability in computational pathology.

Method: CIG computes contrastive gradients in logit space by comparing feature importance relative to a reference class, satisfies integrated attribution axioms, and introduces MIL-AIC and MIL-SIC metrics to measure predictive information and confidence evolution under weak supervision.

Result: CIG produces more informative attributions across three cancer datasets (CAMELYON16, TCGA-RCC, TCGA-Lung), with quantitative improvements in MIL-AIC/MIL-SIC metrics and qualitative visualizations that better align with ground truth tumor regions.

Conclusion: CIG enhances interpretability and trustworthiness in WSI-based diagnostics by providing sharper differentiation between tumor subtypes while maintaining theoretical soundness, making it valuable for AI-assisted pathology.

Abstract: Interpretability is essential in Whole Slide Image (WSI) analysis for computational pathology, where understanding model predictions helps build trust in AI-assisted diagnostics. While Integrated Gradients (IG) and related attribution methods have shown promise, applying them directly to WSIs introduces challenges due to their high-resolution nature. These methods capture model decision patterns but may overlook class-discriminative signals that are crucial for distinguishing between tumor subtypes. In this work, we introduce Contrastive Integrated Gradients (CIG), a novel attribution method that enhances interpretability by computing contrastive gradients in logit space. First, CIG highlights class-discriminative regions by comparing feature importance relative to a reference class, offering sharper differentiation between tumor and non-tumor areas. Second, CIG satisfies the axioms of integrated attribution, ensuring consistency and theoretical soundness. Third, we propose two attribution quality metrics, MIL-AIC and MIL-SIC, which measure how predictive information and model confidence evolve with access to salient regions, particularly under weak supervision. We validate CIG across three datasets spanning distinct cancer types: CAMELYON16 (breast cancer metastasis in lymph nodes), TCGA-RCC (renal cell carcinoma), and TCGA-Lung (lung cancer). Experimental results demonstrate that CIG yields more informative attributions both quantitatively, using MIL-AIC and MIL-SIC, and qualitatively, through visualizations that align closely with ground truth tumor regions, underscoring its potential for interpretable and trustworthy WSI-based diagnostics

Surbhi Madan, Shreya Ghosh, Ramanathan Subramanian, Abhinav Dhall, Tom Gedeon

Main category: cs.CV

TL;DR: CSGaze is a context-aware multimodal approach that uses facial and scene information to predict social gaze patterns in conversational interactions, incorporating attention mechanisms and showing competitive performance on benchmark datasets.

Details

Motivation: Gaze provides insights into attention, social engagement, and confidence during conversations. The paper aims to leverage contextual cues with visual scene and facial information to better understand and predict social gaze patterns.

Method: CSGaze uses a multimodal approach combining facial and scene information with a fine-grained attention mechanism focused on the principal speaker to model social gaze dynamics.

Result: CSGaze performs competitively with state-of-the-art methods on GP-Static, UCO-LAEO and AVA-LAEO datasets, and shows good generalizability on open set datasets with robust performance across diverse scenarios.

Conclusion: Contextual cues significantly improve social gaze prediction, and the model’s attention mechanism provides explainable insights into decision-making while demonstrating strong generalization capabilities.

Abstract: A person’s gaze offers valuable insights into their focus of attention, level of social engagement, and confidence. In this work, we investigate how contextual cues combined with visual scene and facial information can be effectively utilized to predict and interpret social gaze patterns during conversational interactions. We introduce CSGaze, a context aware multimodal approach that leverages facial, scene information as complementary inputs to enhance social gaze pattern prediction from multi-person images. The model also incorporates a fine-grained attention mechanism centered on the principal speaker, which helps in better modeling social gaze dynamics. Experimental results show that CSGaze performs competitively with state-of-the-art methods on GP-Static, UCO-LAEO and AVA-LAEO. Our findings highlight the role of contextual cues in improving social gaze prediction. Additionally, we provide initial explainability through generated attention scores, offering insights into the model’s decision-making process. We also demonstrate our model’s generalizability by testing our model on open set datasets that demonstrating its robustness across diverse scenarios.

[284] Improving Multimodal Sentiment Analysis via Modality Optimization and Dynamic Primary Modality Selection

Dingkang Yang, Mingcheng Li, Xuecheng Wu, Zhaoyu Chen, Kaixun Jiang, Keliang Liu, Peng Zhai, Lihua Zhang

Main category: cs.CV

TL;DR: Proposes MODS framework for multimodal sentiment analysis that dynamically selects primary modalities and reduces redundancy in acoustic/visual data using graph-based compression.

Details

Motivation: Addresses imbalanced unimodal performance and suboptimal fusion in MSA, where fixed primary modality strategies fail to adapt to dynamic modality importance variations across samples.

Method: Uses Graph-based Dynamic Sequence Compressor (GDC) with capsule networks and graph convolution to reduce sequential redundancy, sample-adaptive Primary Modality Selector (MSelector), and Primary-modality-Centric Cross-Attention (PCCA) module.

Result: Extensive experiments on four benchmark datasets show MODS outperforms state-of-the-art methods with superior performance by balancing modality contributions and eliminating redundant noise.

Conclusion: MODS effectively addresses modality imbalance and redundancy issues in MSA through dynamic primary modality selection and optimization, achieving better sentiment analysis performance.

Abstract: Multimodal Sentiment Analysis (MSA) aims to predict sentiment from language, acoustic, and visual data in videos. However, imbalanced unimodal performance often leads to suboptimal fused representations. Existing approaches typically adopt fixed primary modality strategies to maximize dominant modality advantages, yet fail to adapt to dynamic variations in modality importance across different samples. Moreover, non-language modalities suffer from sequential redundancy and noise, degrading model performance when they serve as primary inputs. To address these issues, this paper proposes a modality optimization and dynamic primary modality selection framework (MODS). First, a Graph-based Dynamic Sequence Compressor (GDC) is constructed, which employs capsule networks and graph convolution to reduce sequential redundancy in acoustic/visual modalities. Then, we develop a sample-adaptive Primary Modality Selector (MSelector) for dynamic dominance determination. Finally, a Primary-modality-Centric Cross-Attention (PCCA) module is designed to enhance dominant modalities while facilitating cross-modal interaction. Extensive experiments on four benchmark datasets demonstrate that MODS outperforms state-of-the-art methods, achieving superior performance by effectively balancing modality contributions and eliminating redundant noise.

[285] Learning Topology-Driven Multi-Subspace Fusion for Grassmannian Deep Network

Xuan Yu, Tianyang Xu

Main category: cs.CV

TL;DR: A topology-driven multi-subspace fusion framework on Grassmannian manifold that enables adaptive subspace collaboration through dynamic selection and fusion mechanisms, outperforming single-subspace approaches.

Details

Motivation: Existing Grassmannian approaches rely on static single-subspace representations, neglecting the dynamic interplay between multiple subspaces needed to capture complex geometric structures.

Method: Proposes adaptive multi-subspace modeling with topological convergence analysis for dynamic subspace selection, and multi-subspace interaction blocks using Fréchet mean optimization on the manifold, with Riemannian batch normalization and mutual information regularization.

Result: Achieves state-of-the-art performance on 3D action recognition (HDM05, FPHA), EEG classification (MAMEM-SSVEPII), and graph tasks, demonstrating superior discriminability and interpretability.

Conclusion: Successfully adapts multi-channel interaction philosophy from Euclidean networks to non-Euclidean domains, advancing geometric deep learning with improved subspace collaboration and theoretical convergence guarantees.

Abstract: Grassmannian manifold offers a powerful carrier for geometric representation learning by modelling high-dimensional data as low-dimensional subspaces. However, existing approaches predominantly rely on static single-subspace representations, neglecting the dynamic interplay between multiple subspaces critical for capturing complex geometric structures. To address this limitation, we propose a topology-driven multi-subspace fusion framework that enables adaptive subspace collaboration on the Grassmannian. Our solution introduces two key innovations: (1) Inspired by the Kolmogorov-Arnold representation theorem, an adaptive multi-subspace modelling mechanism is proposed that dynamically selects and weights task-relevant subspaces via topological convergence analysis, and (2) a multi-subspace interaction block that fuses heterogeneous geometric representations through Fréchet mean optimisation on the manifold. Theoretically, we establish the convergence guarantees of adaptive subspaces under a projection metric topology, ensuring stable gradient-based optimisation. Practically, we integrate Riemannian batch normalisation and mutual information regularisation to enhance discriminability and robustness. Extensive experiments on 3D action recognition (HDM05, FPHA), EEG classification (MAMEM-SSVEPII), and graph tasks demonstrate state-of-the-art performance. Our work not only advances geometric deep learning but also successfully adapts the proven multi-channel interaction philosophy of Euclidean networks to non-Euclidean domains, achieving superior discriminability and interpretability.

Huili Huang, Chengeng Liu, Danrong Zhang, Shail Patel, Anastasiya Masalava, Sagar Sadak, Parisa Babolhavaeji, WeiHong Low, Max Mahdi Roozbahani, J. David Frost

Main category: cs.CV

TL;DR: EIDSeg is the first large-scale semantic segmentation dataset for post-earthquake social media imagery, enabling fine-grained damage assessment of buildings and roads using ground-level photos.

Details

Motivation: Existing remote sensing methods for post-earthquake damage assessment rely on costly aerial images, expert labeling, and produce only binary damage maps, creating a gap that social media imagery could fill.

Method: Created EIDSeg dataset with 3,266 images from 9 earthquakes (2008-2023) annotated across 5 damage classes using a three-phase cross-disciplinary annotation protocol with non-expert annotators achieving over 70% inter-annotator agreement.

Result: Benchmarked state-of-the-art segmentation models and identified Encoder-only Mask Transformer (EoMT) as top performer with 80.8% mIoU, demonstrating effective damage classification from social media images.

Conclusion: The work enables faster, finer-grained post-earthquake damage assessment by leveraging social networks’ ground-level perspective, overcoming limitations of traditional remote sensing methods.

Abstract: Rapid post-earthquake damage assessment is crucial for rescue and resource planning. Still, existing remote sensing methods depend on costly aerial images, expert labeling, and produce only binary damage maps for early-stage evaluation. Although ground-level images from social networks provide a valuable source to fill this gap, a large pixel-level annotated dataset for this task is still unavailable. We introduce EIDSeg, the first large-scale semantic segmentation dataset specifically for post-earthquake social media imagery. The dataset comprises 3,266 images from nine major earthquakes (2008-2023), annotated across five classes of infrastructure damage: Undamaged Building, Damaged Building, Destroyed Building, Undamaged Road, and Damaged Road. We propose a practical three-phase cross-disciplinary annotation protocol with labeling guidelines that enables consistent segmentation by non-expert annotators, achieving over 70% inter-annotator agreement. We benchmark several state-of-the-art segmentation models, identifying Encoder-only Mask Transformer (EoMT) as the top-performing method with a Mean Intersection over Union (mIoU) of 80.8%. By unlocking social networks’ rich ground-level perspective, our work paves the way for a faster, finer-grained damage assessment in the post-earthquake scenario.

[287] MRT: Learning Compact Representations with Mixed RWKV-Transformer for Extreme Image Compression

Han Liu, Hengyu Man, Xingtao Wang, Wenrui Li, Debin Zhao

Main category: cs.CV

TL;DR: Proposes MRT architecture for extreme image compression using 1-D latent representations, combining RWKV and Transformer models to reduce spatial redundancy and achieve superior compression efficiency.

Details

Motivation: Existing methods compress images into 2-D latent spaces using CNNs or Swin Transformers, which retain substantial spatial redundancy and limit compression performance.

Method: Mixed RWKV-Transformer (MRT) architecture that encodes images into 1-D latent representations, using RWKV modules for global dependencies across windows and Transformer blocks for local redundancies within windows, plus a dedicated RWKV Compression Model (RCM) for intermediate features.

Result: Achieves superior reconstruction quality at bitrates below 0.02 bpp, with 43.75% and 30.59% bitrate savings on Kodak and CLIC2020 datasets respectively compared to state-of-the-art GLC.

Conclusion: MRT framework effectively reduces spatial redundancy through 1-D latent representations and hierarchical attention, significantly outperforming existing 2-D compression architectures.

Abstract: Recent advances in extreme image compression have revealed that mapping pixel data into highly compact latent representations can significantly improve coding efficiency. However, most existing methods compress images into 2-D latent spaces via convolutional neural networks (CNNs) or Swin Transformers, which tend to retain substantial spatial redundancy, thereby limiting overall compression performance. In this paper, we propose a novel Mixed RWKV-Transformer (MRT) architecture that encodes images into more compact 1-D latent representations by synergistically integrating the complementary strengths of linear-attention-based RWKV and self-attention-based Transformer models. Specifically, MRT partitions each image into fixed-size windows, utilizing RWKV modules to capture global dependencies across windows and Transformer blocks to model local redundancies within each window. The hierarchical attention mechanism enables more efficient and compact representation learning in the 1-D domain. To further enhance compression efficiency, we introduce a dedicated RWKV Compression Model (RCM) tailored to the structure characteristics of the intermediate 1-D latent features in MRT. Extensive experiments on standard image compression benchmarks validate the effectiveness of our approach. The proposed MRT framework consistently achieves superior reconstruction quality at bitrates below 0.02 bits per pixel (bpp). Quantitative results based on the DISTS metric show that MRT significantly outperforms the state-of-the-art 2-D architecture GLC, achieving bitrate savings of 43.75%, 30.59% on the Kodak and CLIC2020 test datasets, respectively.

[288] UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation

Zhen Yang, Wenyi Hong, Mingde Xu, Xinyue Fan, Weihan Wang, Jiele Cheng, Xiaotao Gu, Jie Tang

Main category: cs.CV

TL;DR: UI2Code$^\text{N}$ is a visual language model that introduces an interactive UI-to-code paradigm with multimodal coding capabilities, achieving state-of-the-art performance comparable to leading closed-source models.

Details

Motivation: Current UI programming approaches face limitations in multimodal coding capabilities and lack iterative visual feedback, making automatic UI coding complex and underdeveloped.

Method: Interactive UI-to-code paradigm with staged pretraining, fine-tuning, and reinforcement learning; unifies UI-to-code generation, UI editing, and UI polishing; implements test-time scaling for multi-turn feedback.

Result: Establishes new state-of-the-art among open-source models and achieves performance comparable to Claude-4-Sonnet and GPT-5 on UI-to-code and UI polishing benchmarks.

Conclusion: The interactive paradigm and UI2Code$^\text{N}$ model significantly advance automatic UI coding by better reflecting real-world workflows and leveraging multimodal capabilities with iterative feedback.

Abstract: User interface (UI) programming is a core yet highly complex part of modern software development. Recent advances in visual language models (VLMs) highlight the potential of automatic UI coding, but current approaches face two key limitations: multimodal coding capabilities remain underdeveloped, and single-turn paradigms make little use of iterative visual feedback. We address these challenges with an interactive UI-to-code paradigm that better reflects real-world workflows and raises the upper bound of achievable performance. Under this paradigm, we present UI2Code$^\text{N}$, a visual language model trained through staged pretraining, fine-tuning, and reinforcement learning to achieve foundational improvements in multimodal coding. The model unifies three key capabilities: UI-to-code generation, UI editing, and UI polishing. We further explore test-time scaling for interactive generation, enabling systematic use of multi-turn feedback. Experiments on UI-to-code and UI polishing benchmarks show that UI2Code$^\text{N}$ establishes a new state of the art among open-source models and achieves performance comparable to leading closed-source models such as Claude-4-Sonnet and GPT-5. Our code and models are available at https://github.com/zai-org/UI2Code_N.

[289] FQ-PETR: Fully Quantized Position Embedding Transformation for Multi-View 3D Object Detection

Jiangyong Yu, Changyong Shu, Sifan Zhou, Zichen Yu, Xing Hu, Yan Chen, Dawei Yang

Main category: cs.CV

TL;DR: FQ-PETR is a fully quantized framework for PETR-based 3D detection models that addresses quantization challenges through three innovations: QFPE for scale alignment, DULUT for non-linear function approximation, and QANS for attention stabilization, achieving near-floating-point accuracy with 75% latency reduction.

Details

Motivation: PETR models excel in camera-based 3D detection but face deployment challenges due to high computational cost and memory footprint. Direct quantization causes severe accuracy degradation due to multi-modal feature disparity and non-linear operator inefficiency.

Method: Three key innovations: (1) QFPE replaces multi-point sampling with LiDAR-prior-guided single-point sampling and anchor-based embedding, (2) DULUT approximates non-linear functions using cascaded linear lookup tables, (3) QANS performs quantization after softmax numerical stabilization.

Conclusion: FQ-PETR successfully enables efficient deployment of PETR-based 3D detection models through effective quantization techniques that address multi-modal feature disparity and non-linear operator challenges, making them practical for real-world autonomous driving applications.

[290] STELLAR: Scene Text Editor for Low-Resource Languages and Real-World Data

Yongdeuk Seo, Hyun-seok Min, Sungchul Choi

Main category: cs.CV

TL;DR: STELLAR is a diffusion-based scene text editing model that addresses limitations in multilingual support, domain gaps, and evaluation metrics through language-adaptive glyph encoding, multi-stage training, and a novel Text Appearance Similarity metric.

Details

Motivation: Current scene text editing methods lack support for low-resource languages, suffer from synthetic-to-real domain gaps, and lack appropriate metrics for evaluating text style preservation.

Method: Proposes STELLAR with language-adaptive glyph encoder, multi-stage training (pre-training on synthetic data then fine-tuning on real images), and constructs STIPLAR dataset for training/evaluation.

Result: Outperforms state-of-the-art models in visual consistency and recognition accuracy, achieving 2.2% average TAS improvement across languages over baselines.

Conclusion: STELLAR effectively addresses multilingual scene text editing challenges through adaptive encoding, staged training, and robust evaluation metrics.

Abstract: Scene Text Editing (STE) is the task of modifying text content in an image while preserving its visual style, such as font, color, and background. While recent diffusion-based approaches have shown improvements in visual quality, key limitations remain: lack of support for low-resource languages, domain gap between synthetic and real data, and the absence of appropriate metrics for evaluating text style preservation. To address these challenges, we propose STELLAR (Scene Text Editor for Low-resource LAnguages and Real-world data). STELLAR enables reliable multilingual editing through a language-adaptive glyph encoder and a multi-stage training strategy that first pre-trains on synthetic data and then fine-tunes on real images. We also construct a new dataset, STIPLAR(Scene Text Image Pairs of Low-resource lAnguages and Real-world data), for training and evaluation. Furthermore, we propose Text Appearance Similarity (TAS), a novel metric that assesses style preservation by independently measuring font, color, and background similarity, enabling robust evaluation even without ground truth. Experimental results demonstrate that STELLAR outperforms state-of-the-art models in visual consistency and recognition accuracy, achieving an average TAS improvement of 2.2% across languages over the baselines.

[291] LampQ: Towards Accurate Layer-wise Mixed Precision Quantization for Vision Transformers

Minjun Kim, Jaeri Lee, Jongjin Kim, Jeongin Yun, Yongmo Kwon, U Kang

Main category: cs.CV

TL;DR: LampQ is a layer-wise mixed precision quantization method for Vision Transformers that overcomes limitations of existing methods through fine-grained control, type-aware sensitivity metrics, and optimal bit allocation.

Details

Motivation: Existing quantization methods for Vision Transformers use uniform precision, ignoring the diverse sensitivity of different ViT components to quantization. Previous mixed precision methods suffer from coarse granularity, metric scale mismatch, and quantization-unaware bit allocation.

Method: LampQ performs layer-wise quantization with fine-grained control, uses a type-aware Fisher-based metric to measure sensitivity, assigns bit-widths optimally through integer linear programming, and iteratively updates them.

Result: Extensive experiments show LampQ provides state-of-the-art performance in quantizing ViTs for various tasks including image classification, object detection, and zero-shot quantization.

Conclusion: LampQ effectively overcomes the limitations of existing mixed precision quantization methods for Vision Transformers, achieving accurate quantization with minimal accuracy degradation across multiple vision tasks.

Abstract: How can we accurately quantize a pre-trained Vision Transformer model? Quantization algorithms compress Vision Transformers (ViTs) into low-bit formats, reducing memory and computation demands with minimal accuracy degradation. However, existing methods rely on uniform precision, ignoring the diverse sensitivity of ViT components to quantization. Metric-based Mixed Precision Quantization (MPQ) is a promising alternative, but previous MPQ methods for ViTs suffer from three major limitations: 1) coarse granularity, 2) mismatch in metric scale across component types, and 3) quantization-unaware bit allocation. In this paper, we propose LampQ (Layer-wise Mixed Precision Quantization for Vision Transformers), an accurate metric-based MPQ method for ViTs to overcome these limitations. LampQ performs layer-wise quantization to achieve both fine-grained control and efficient acceleration, incorporating a type-aware Fisher-based metric to measure sensitivity. Then, LampQ assigns bit-widths optimally through integer linear programming and further updates them iteratively. Extensive experiments show that LampQ provides the state-of-the-art performance in quantizing ViTs pre-trained on various tasks such as image classification, object detection, and zero-shot quantization.

[292] FreDFT: Frequency Domain Fusion Transformer for Visible-Infrared Object Detection

Wencong Wu, Xiuwei Zhang, Hanlin Yin, Shun Dai, Hongxi Zhang, Yanning Zhang

Main category: cs.CV

TL;DR: FreDFT is a frequency domain fusion transformer for visible-infrared object detection that addresses modality imbalance through frequency domain attention and cross-modal feature interaction.

Details

Motivation: Existing visible-infrared detection methods suffer from information imbalance between modalities and ignore the potential of frequency domain transformers for mining complementary information.

Method: Proposes multimodal frequency domain attention (MFDA), frequency domain feed-forward layer (FDFFL), cross-modal global modeling module (CGMM), and local feature enhancement module (LFEM) to handle modality imbalance and enhance feature fusion.

Result: Extensive experiments show FreDFT achieves excellent performance on multiple public datasets compared to state-of-the-art methods.

Conclusion: The proposed frequency domain fusion transformer effectively addresses modality imbalance and improves visible-infrared object detection performance.

Abstract: Visible-infrared object detection has gained sufficient attention due to its detection performance in low light, fog, and rain conditions. However, visible and infrared modalities captured by different sensors exist the information imbalance problem in complex scenarios, which can cause inadequate cross-modal fusion, resulting in degraded detection performance. \textcolor{red}{Furthermore, most existing methods use transformers in the spatial domain to capture complementary features, ignoring the advantages of developing frequency domain transformers to mine complementary information.} To solve these weaknesses, we propose a frequency domain fusion transformer, called FreDFT, for visible-infrared object detection. The proposed approach employs a novel multimodal frequency domain attention (MFDA) to mine complementary information between modalities and a frequency domain feed-forward layer (FDFFL) via a mixed-scale frequency feature fusion strategy is designed to better enhance multimodal features. To eliminate the imbalance of multimodal information, a cross-modal global modeling module (CGMM) is constructed to perform pixel-wise inter-modal feature interaction in a spatial and channel manner. Moreover, a local feature enhancement module (LFEM) is developed to strengthen multimodal local feature representation and promote multimodal feature fusion by using various convolution layers and applying a channel shuffle. Extensive experimental results have verified that our proposed FreDFT achieves excellent performance on multiple public datasets compared with other state-of-the-art methods. The code of our FreDFT is linked at https://github.com/WenCongWu/FreDFT.

Xun Huang, Shijia Zhao, Yunxiang Wang, Xin Lu, Wanfa Zhang, Rongsheng Qu, Weixin Li, Yunhong Wang, Chenglu Wen

Main category: cs.CV

TL;DR: The paper introduces M3DSG, a multi-modal 3D scene graph that preserves visual cues to address limitations of text-only scene graphs in embodied navigation.

Details

Motivation: Real-world robotic navigation requires open vocabulary generalization and low training overhead, motivating zero-shot methods over task-specific RL training. Existing zero-shot methods using explicit 3D scene graphs compress visual observations into text-only relations, leading to high construction cost, irreversible visual evidence loss, and constrained vocabularies.

Method: The authors introduce Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relations with multi-modal representations.

Result: The abstract does not provide specific experimental results, but implies that M3DSG addresses the limitations of existing text-only scene graph methods.

Conclusion: M3DSG is proposed as a solution to overcome the limitations of text-only 3D scene graphs in embodied navigation, enabling better preservation of visual evidence and more flexible vocabulary handling.

Abstract: Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relation

[294] Physics informed Transformer-VAE for biophysical parameter estimation: PROSAIL model inversion in Sentinel-2 imagery

Prince Mensah, Pelumi Victor Aderinto, Ibrahim Salihu Yusuf, Arnu Pretorius

Main category: cs.CV

TL;DR: A physics-informed Transformer-VAE architecture for vegetation parameter retrieval from Sentinel-2 data using only simulated training data, achieving performance comparable to methods requiring real satellite imagery.

Details

Motivation: Accurate retrieval of vegetation biophysical variables is crucial for ecosystem monitoring and agricultural management, but existing methods often require real satellite images for training which can be costly and limited.

Method: Transformer-VAE architecture that incorporates the PROSAIL radiative transfer model as a differentiable physical decoder, trained exclusively on simulated data without requiring real satellite images or in-situ labels.

Result: Achieves retrieval of leaf area index (LAI) and canopy chlorophyll content (CCC) on real-world datasets (FRM4Veg and BelSAR) with accuracy comparable to state-of-the-art methods that use real Sentinel-2 data.

Conclusion: The approach demonstrates how integrating physical models with deep networks enables cost-effective, self-supervised vegetation monitoring without real image calibration, opening prospects for large-scale physically-constrained remote sensing.

Abstract: Accurate retrieval of vegetation biophysical variables from satellite imagery is crucial for ecosystem monitoring and agricultural management. In this work, we propose a physics-informed Transformer-VAE architecture to invert the PROSAIL radiative transfer model for simultaneous estimation of key canopy parameters from Sentinel-2 data. Unlike previous hybrid approaches that require real satellite images for self-supevised training. Our model is trained exclusively on simulated data, yet achieves performance on par with state-of-the-art methods that utilize real imagery. The Transformer-VAE incorporates the PROSAIL model as a differentiable physical decoder, ensuring that inferred latent variables correspond to physically plausible leaf and canopy properties. We demonstrate retrieval of leaf area index (LAI) and canopy chlorophyll content (CCC) on real-world field datasets (FRM4Veg and BelSAR) with accuracy comparable to models trained with real Sentinel-2 data. Our method requires no in-situ labels or calibration on real images, offering a cost-effective and self-supervised solution for global vegetation monitoring. The proposed approach illustrates how integrating physical models with advanced deep networks can improve the inversion of RTMs, opening new prospects for large-scale, physically-constrained remote sensing of vegetation traits.

[295] RodEpil: A Video Dataset of Laboratory Rodents for Seizure Detection and Benchmark Evaluation

Daniele Perlo, Vladimir Despotovic, Selma Boudissa, Sang-Yoon Kim, Petr V. Nazarov, Yanrong Zhang, Max Wintermark, Olivier Keunen

Main category: cs.CV

TL;DR: A curated video dataset (RodEpil) for detecting convulsive events in laboratory rodents using top-down and side-view video clips, achieving 97% F1-score with TimeSformer classifier.

Details

Motivation: To support non-invasive, video-based monitoring in preclinical epilepsy research by providing a standardized dataset for automatic seizure detection in rodents.

Method: Collected 13,053 video clips (10,101 normal, 2,952 seizure) from 19 subjects, used TimeSformer transformer-based video classifier with strict subject-wise five-fold cross-validation to prevent data leakage.

Result: TimeSformer achieved 97% average F1-score for discriminating between seizure and normal activity in rodents.

Conclusion: The RodEpil dataset enables effective video-based seizure detection in preclinical research and is publicly available to support reproducible epilepsy monitoring studies.

Abstract: We introduce a curated video dataset of laboratory rodents for automatic detection of convulsive events. The dataset contains short (10~s) top-down and side-view video clips of individual rodents, labeled at clip level as normal activity or seizure. It includes 10,101 negative samples and 2,952 positive samples collected from 19 subjects. We describe the data curation, annotation protocol and preprocessing pipeline, and report baseline experiments using a transformer-based video classifier (TimeSformer). Experiments employ five-fold cross-validation with strict subject-wise partitioning to prevent data leakage (no subject appears in more than one fold). Results show that the TimeSformer architecture enables discrimination between seizure and normal activity with an average F1-score of 97%. The dataset and baseline code are publicly released to support reproducible research on non-invasive, video-based monitoring in preclinical epilepsy research. RodEpil Dataset access - DOI: 10.5281/zenodo.17601357

[296] OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer

Haosong Peng, Hao Li, Yalun Dai, Yushi Lan, Yihang Luo, Tianyu Qi, Zhengshen Zhang, Yufeng Zhan, Junfei Zhang, Wenchao Xu, Ziwei Liu

Main category: cs.CV

TL;DR: OmniVGGT is a 3D foundation model framework that effectively incorporates geometric modalities like depth, camera intrinsics, and poses into spatial foundation models using a GeoAdapter with zero-initialized convolutions and stochastic multimodal fusion.

Details

Motivation: Most 3D foundation models only use RGB inputs and ignore available geometric cues, limiting their performance on vision tasks that benefit from spatial information.

Method: Proposes GeoAdapter with zero-initialized convolutions to inject geometric information without disrupting foundation model representations, and stochastic multimodal fusion that randomly samples modality subsets during training for robust inference with arbitrary modality inputs.

Result: Outperforms prior methods on monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation, achieving state-of-the-art results even with RGB-only input. When integrated into vision-language-action models, it outperforms point-cloud-based baselines and achieves consistent gains on robotic tasks.

Conclusion: OmniVGGT effectively leverages geometric modalities to enhance 3D foundation models, demonstrating superior performance across multiple vision tasks and practical utility in robotic applications.

Abstract: General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model’s representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed, which randomly samples modality subsets per instance during training. This enables an arbitrary number of modality inputs during testing and promotes learning robust spatial representations instead of overfitting to auxiliary cues. Comprehensive experiments on monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation demonstrate that OmniVGGT outperforms prior methods with auxiliary inputs and achieves state-of-the-art results even with RGB-only input. To further highlight its practical utility, we integrated OmniVGGT into vision-language-action (VLA) models. The enhanced VLA model by OmniVGGT not only outperforms the vanilla point-cloud-based baseline on mainstream benchmarks, but also effectively leverages accessible auxiliary inputs to achieve consistent gains on robotic tasks.

cs.AI

[297] The Second Law of Intelligence: Controlling Ethical Entropy in Autonomous Systems

Samih Fadli

Main category: cs.AI

TL;DR: The paper proposes a Second Law of AI analogous to thermodynamics, showing that ethical entropy (divergence from intended goals) increases spontaneously without continuous alignment work, and provides a quantitative framework for maintaining AI stability.

Details

Motivation: To establish a theoretical foundation for AI alignment by drawing analogies to thermodynamics, addressing the problem of goal drift and specification gaming in unconstrained AI systems.

Method: Define ethical entropy S = -Σ p(g_i; theta) ln p(g_i; theta) over goals, prove dS/dt >= 0, derive critical stability boundary gamma_crit = (lambda_max / 2) ln N, and validate with simulations of 7-billion-parameter models.

Result: Unregularized 7B-parameter model drifts from entropy 0.32 to 1.69±1.08 nats, while system with alignment work gamma=20.4 (1.5 gamma_crit) maintains stability at 0.00±0.00 nats (p=4.19×10^-17).

Conclusion: AI alignment should be treated as a continuous thermodynamic control problem, providing quantitative foundations for maintaining stability and safety of autonomous systems.

Abstract: We propose that unconstrained artificial intelligence obeys a Second Law analogous to thermodynamics, where ethical entropy, defined as a measure of divergence from intended goals, increases spontaneously without continuous alignment work. For gradient-based optimizers, we define this entropy over a finite set of goals {g_i} as S = -Σ p(g_i; theta) ln p(g_i; theta), and we prove that its time derivative dS/dt >= 0, driven by exploration noise and specification gaming. We derive the critical stability boundary for alignment work as gamma_crit = (lambda_max / 2) ln N, where lambda_max is the dominant eigenvalue of the Fisher Information Matrix and N is the number of model parameters. Simulations validate this theory. A 7-billion-parameter model (N = 7 x 10^9) with lambda_max = 1.2 drifts from an initial entropy of 0.32 to 1.69 +/- 1.08 nats, while a system regularized with alignment work gamma = 20.4 (1.5 gamma_crit) maintains stability at 0.00 +/- 0.00 nats (p = 4.19 x 10^-17, n = 20 trials). This framework recasts AI alignment as a problem of continuous thermodynamic control, providing a quantitative foundation for maintaining the stability and safety of advanced autonomous systems.

[298] Co-EPG: A Framework for Co-Evolution of Planning and Grounding in Autonomous GUI Agents

Yuan Zhao, Hualei Zhu, Tingyu Jiang, Shen Li, Xiaohang Xu, Hao Henry Wang

Main category: cs.AI

TL;DR: Co-EPG is a self-iterative training framework that enables co-evolution of planning and grounding models for GUI task automation, creating a positive feedback loop that improves both components without external data.

Details

Motivation: Current GUI agents have insufficient cross-model synergies and over-rely on synthetic data without sufficient utilization, limiting their effectiveness in task automation.

Method: Co-EPG establishes an iterative feedback loop where the planning model explores strategies under grounding-based rewards via GRPO, generating diverse data to optimize the grounding model, which in turn provides better rewards for planning model training.

Result: On Multimodal-Mind2Web and AndroidControl benchmarks, Co-EPG outperforms state-of-the-art methods after just three iterations without external data, with consistent improvement in each iteration.

Conclusion: This work establishes a novel training paradigm for GUI agents that shifts from isolated optimization to an integrated, self-driven co-evolution approach.

Abstract: Graphical User Interface (GUI) task automation constitutes a critical frontier in artificial intelligence research. While effective GUI agents synergistically integrate planning and grounding capabilities, current methodologies exhibit two fundamental limitations: (1) insufficient exploitation of cross-model synergies, and (2) over-reliance on synthetic data generation without sufficient utilization. To address these challenges, we propose Co-EPG, a self-iterative training framework for Co-Evolution of Planning and Grounding. Co-EPG establishes an iterative positive feedback loop: through this loop, the planning model explores superior strategies under grounding-based reward guidance via Group Relative Policy Optimization (GRPO), generating diverse data to optimize the grounding model. Concurrently, the optimized Grounding model provides more effective rewards for subsequent GRPO training of the planning model, fostering continuous improvement. Co-EPG thus enables iterative enhancement of agent capabilities through self-play optimization and training data distillation. On the Multimodal-Mind2Web and AndroidControl benchmarks, our framework outperforms existing state-of-the-art methods after just three iterations without requiring external data. The agent consistently improves with each iteration, demonstrating robust self-enhancement capabilities. This work establishes a novel training paradigm for GUI agents, shifting from isolated optimization to an integrated, self-driven co-evolution approach.

[299] Multi-agent Undercover Gaming: Hallucination Removal via Counterfactual Test for Multimodal Reasoning

Dayong Liang, Xiao-Yong Wei, Changmeng Zheng

Main category: cs.AI

TL;DR: MUG introduces a multi-agent undercover gaming protocol that detects hallucinating agents in LLMs using counterfactual image modifications, advancing multimodal reasoning beyond traditional debate methods.

Details

Motivation: Address hallucination issues in LLM reasoning by improving Multi-Agent Debate (MAD) protocols, which unrealistically assume all agents are rational and not prone to hallucinations.

Method: Multi-agent Undercover Gaming (MUG) protocol inspired by social deduction games, using counterfactual image modifications to identify hallucinating agents through multimodal counterfactual tests and active probing discussions.

Result: Provides a more reliable framework for multimodal reasoning by enabling factual verification beyond statistical consensus, introducing cross-evidence reasoning, and fostering active reasoning among agents.

Conclusion: MUG offers significant improvements over traditional MAD protocols for detecting and mitigating hallucinations in LLMs, creating a more robust multimodal reasoning framework.

Abstract: Hallucination continues to pose a major obstacle in the reasoning capabilities of large language models (LLMs). Although the Multi-Agent Debate (MAD) paradigm offers a promising solution by promoting consensus among multiple agents to enhance reliability, it relies on the unrealistic assumption that all debaters are rational and reflective, which is a condition that may not hold when agents themselves are prone to hallucinations. To address this gap, we introduce the Multi-agent Undercover Gaming (MUG) protocol, inspired by social deduction games like “Who is Undercover?”. MUG reframes MAD as a process of detecting “undercover” agents (those suffering from hallucinations) by employing multimodal counterfactual tests. Specifically, we modify reference images to introduce counterfactual evidence and observe whether agents can accurately identify these changes, providing ground-truth for identifying hallucinating agents and enabling robust, crowd-powered multimodal reasoning. MUG advances MAD protocols along three key dimensions: (1) enabling factual verification beyond statistical consensus through counterfactual testing; (2) introducing cross-evidence reasoning via dynamically modified evidence sources instead of relying on static inputs; and (3) fostering active reasoning, where agents engage in probing discussions rather than passively answering questions. Collectively, these innovations offer a more reliable and effective framework for multimodal reasoning in LLMs. The source code can be accessed at https://github.com/YongLD/MUG.git.

[300] Picking a Representative Set of Solutions in Multiobjective Optimization: Axioms, Algorithms, and Experiments

Niclas Boehmer, Maximilian T. Wittmann

Main category: cs.AI

TL;DR: The paper reframes Pareto pruning as a multiwinner voting problem, conducts axiomatic analysis of quality measures, introduces a new measure called directed coverage, analyzes computational complexity, and shows experimental results.

Details

Motivation: To reduce cognitive load on decision makers by computing fixed-size subsets of Pareto optimal solutions that best represent the full set, addressing unintuitive behaviors in existing quality measures.

Method: Reframing Pareto pruning as multiwinner voting, axiomatic analysis of quality measures, introducing directed coverage measure, computational complexity analysis, and experimental evaluation.

Result: Uncovered unintuitive behaviors in existing measures, identified tractability boundaries, and showed that choice of quality measure significantly impacts solution set characteristics with competitive performance of proposed measure.

Conclusion: The paper provides theoretical foundations and practical guidance for Pareto pruning, demonstrating the importance of quality measure selection and offering a competitive new measure for multi-objective optimization.

Abstract: Many real-world decision-making problems involve optimizing multiple objectives simultaneously, rendering the selection of the most preferred solution a non-trivial problem: All Pareto optimal solutions are viable candidates, and it is typically up to a decision maker to select one for implementation based on their subjective preferences. To reduce the cognitive load on the decision maker, previous work has introduced the Pareto pruning problem, where the goal is to compute a fixed-size subset of Pareto optimal solutions that best represent the full set, as evaluated by a given quality measure. Reframing Pareto pruning as a multiwinner voting problem, we conduct an axiomatic analysis of existing quality measures, uncovering several unintuitive behaviors. Motivated by these findings, we introduce a new measure, directed coverage. We also analyze the computational complexity of optimizing various quality measures, identifying previously unknown boundaries between tractable and intractable cases depending on the number and structure of the objectives. Finally, we present an experimental evaluation, demonstrating that the choice of quality measure has a decisive impact on the characteristics of the selected set of solutions and that our proposed measure performs competitively or even favorably across a range of settings.

[301] Structure-Aware Encodings of Argumentation Properties for Clique-width

Yasir Mahmood, Markus Hecher, Johanna Groven, Johannes K. Fichte

Main category: cs.AI

TL;DR: The paper develops novel reductions from abstract argumentation problems to (Q)SAT that linearly preserve clique-width, enabling efficient solving for graphs with small clique-width.

Details

Motivation: To understand encoding capabilities with clique-width for computational problems, particularly in abstract argumentation frameworks where traditional treewidth-based approaches may not apply to dense graphs.

Method: Design directed decomposition-guided (DDG) reductions from argumentation problems to (Q)SAT that linearly preserve clique-width, allowing efficient solving for graphs with small clique-width.

Result: Established novel results for all argumentation semantics including counting, with reductions that cannot be significantly improved under reasonable assumptions.

Conclusion: The paper successfully initiates the study of clique-width-based encodings for abstract argumentation, providing efficient reductions to (Q)SAT that preserve clique-width linearly.

Abstract: Structural measures of graphs, such as treewidth, are central tools in computational complexity resulting in efficient algorithms when exploiting the parameter. It is even known that modern SAT solvers work efficiently on instances of small treewidth. Since these solvers are widely applied, research interests in compact encodings into (Q)SAT for solving and to understand encoding limitations. Even more general is the graph parameter clique-width, which unlike treewidth can be small for dense graphs. Although algorithms are available for clique-width, little is known about encodings. We initiate the quest to understand encoding capabilities with clique-width by considering abstract argumentation, which is a robust framework for reasoning with conflicting arguments. It is based on directed graphs and asks for computationally challenging properties, making it a natural candidate to study computational properties. We design novel reductions from argumentation problems to (Q)SAT. Our reductions linearly preserve the clique-width, resulting in directed decomposition-guided (DDG) reductions. We establish novel results for all argumentation semantics, including counting. Notably, the overhead caused by our DDG reductions cannot be significantly improved under reasonable assumptions.

[302] Potential Outcome Rankings for Counterfactual Decision Making

Yuta Kawakami, Jin Tian

Main category: cs.AI

TL;DR: Introduces two new counterfactual decision-making metrics: Probability of potential outcome ranking (PoR) and Probability of achieving best potential outcome (PoB), with identification theorems, bounds, estimation methods, and experimental validation.

Details

Motivation: To improve counterfactual decision-making under uncertainty by providing more nuanced metrics that reveal the most probable ranking of potential outcomes and the action most likely to yield the best outcome for individuals.

Method: Established identification theorems and derived bounds for PoR and PoB metrics, developed estimation methods, and conducted numerical experiments to validate finite-sample properties.

Result: Successfully developed and validated estimators for PoR and PoB metrics, demonstrating their application to both simulated data and real-world datasets with good finite-sample performance.

Conclusion: The proposed PoR and PoB metrics provide valuable tools for counterfactual decision-making, offering more informative ways to compare candidate actions and identify optimal interventions for individuals.

Abstract: Counterfactual decision-making in the face of uncertainty involves selecting the optimal action from several alternatives using causal reasoning. Decision-makers often rank expected potential outcomes (or their corresponding utility and desirability) to compare the preferences of candidate actions. In this paper, we study new counterfactual decision-making rules by introducing two new metrics: the probabilities of potential outcome ranking (PoR) and the probability of achieving the best potential outcome (PoB). PoR reveals the most probable ranking of potential outcomes for an individual, and PoB indicates the action most likely to yield the top-ranked outcome for an individual. We then establish identification theorems and derive bounds for these metrics, and present estimation methods. Finally, we perform numerical experiments to illustrate the finite-sample properties of the estimators and demonstrate their application to a real-world dataset.

[303] MUDAS: Mote-scale Unsupervised Domain Adaptation in Multi-label Sound Classification

Jihoon Yun, Chengzhang Li, Dhrubojyoti Roy, Anish Arora

Main category: cs.AI

TL;DR: MUDAS is a lightweight unsupervised domain adaptation framework for multi-label sound classification in IoT devices, using selective retraining and adaptive thresholds to handle domain shifts with minimal resources.

Details

Motivation: Existing UDA methods are designed for single-label tasks and require high computational resources, making them unsuitable for multi-label sound classification in resource-constrained IoT devices where overlapping sounds and varying acoustics pose challenges.

Method: MUDAS selectively retrains only the classifier using high-confidence data, employs class-specific adaptive thresholds for pseudo-label generation, and applies diversity regularization to improve multi-label classification accuracy while minimizing computational and memory requirements.

Result: On the SONYC-UST dataset from various NYC locations, MUDAS achieved notable improvements in classification accuracy over existing UDA algorithms while maintaining efficiency suitable for IoT deployment.

Conclusion: MUDAS successfully addresses the limitations of traditional UDA methods by providing an efficient, lightweight framework for multi-label sound classification in resource-constrained IoT environments, demonstrating practical applicability for urban sound monitoring.

Abstract: Unsupervised Domain Adaptation (UDA) is essential for adapting machine learning models to new, unlabeled environments where data distribution shifts can degrade performance. Existing UDA algorithms are designed for single-label tasks and rely on significant computational resources, limiting their use in multi-label scenarios and in resource-constrained IoT devices. Overcoming these limitations is particularly challenging in contexts such as urban sound classification, where overlapping sounds and varying acoustics require robust, adaptive multi-label capabilities on low-power, on-device systems. To address these limitations, we introduce Mote-scale Unsupervised Domain Adaptation for Sounds (MUDAS), a UDA framework developed for multi-label sound classification in resource-constrained IoT settings. MUDAS efficiently adapts models by selectively retraining the classifier in situ using high-confidence data, minimizing computational and memory requirements to suit on-device deployment. Additionally, MUDAS incorporates class-specific adaptive thresholds to generate reliable pseudo-labels and applies diversity regularization to improve multi-label classification accuracy. In evaluations on the SONYC Urban Sound Tagging (SONYC-UST) dataset recorded at various New York City locations, MUDAS demonstrates notable improvements in classification accuracy over existing UDA algorithms, achieving good performance in a resource-constrained IoT setting.

[304] From Efficiency to Adaptivity: A Deeper Look at Adaptive Reasoning in Large Language Models

Chao Wu, Baoheng Li, Mingchen Gao, Zhenyi Wang

Main category: cs.AI

TL;DR: This survey reframes reasoning in LLMs through adaptivity - the ability to allocate reasoning effort based on input characteristics like difficulty and uncertainty, addressing the limitation of uniform reasoning strategies.

Details

Motivation: Current LLMs apply uniform reasoning strategies regardless of task complexity, generating long traces for trivial problems while failing to extend reasoning for difficult tasks, overlooking the fundamental challenge of adaptivity.

Method: Formalizes deductive, inductive, and abductive reasoning in LLM context; formalizes adaptive reasoning as control-augmented policy optimization; proposes taxonomy organizing methods into training-based (RL, supervised fine-tuning, learned controllers) and training-free approaches (prompt conditioning, feedback-driven halting, modular composition).

Result: Provides a systematic framework that clarifies how different mechanisms realize adaptive reasoning in practice and enables systematic comparison across diverse strategies.

Conclusion: Identifies open challenges in self-evaluation, meta-reasoning, and human-aligned reasoning control for future research directions.

Abstract: Recent advances in large language models (LLMs) have made reasoning a central benchmark for evaluating intelligence. While prior surveys focus on efficiency by examining how to shorten reasoning chains or reduce computation, this view overlooks a fundamental challenge: current LLMs apply uniform reasoning strategies regardless of task complexity, generating long traces for trivial problems while failing to extend reasoning for difficult tasks. This survey reframes reasoning through the lens of {adaptivity}: the capability to allocate reasoning effort based on input characteristics such as difficulty and uncertainty. We make three contributions. First, we formalize deductive, inductive, and abductive reasoning within the LLM context, connecting these classical cognitive paradigms with their algorithmic realizations. Second, we formalize adaptive reasoning as a control-augmented policy optimization problem balancing task performance with computational cost, distinguishing learned policies from inference-time control mechanisms. Third, we propose a systematic taxonomy organizing existing methods into training-based approaches that internalize adaptivity through reinforcement learning, supervised fine-tuning, and learned controllers, and training-free approaches that achieve adaptivity through prompt conditioning, feedback-driven halting, and modular composition. This framework clarifies how different mechanisms realize adaptive reasoning in practice and enables systematic comparison across diverse strategies. We conclude by identifying open challenges in self-evaluation, meta-reasoning, and human-aligned reasoning control.

Ran Elgedawy, Sanjay Das, Ethan Seefried, Gavin Wiggins, Ryan Burchfield, Dana Hewit, Sudarshan Srinivasan, Todd Thomas, Prasanna Balaprakash, Tirthankar Ghosal

Main category: cs.AI

TL;DR: HARNESS is an AI framework that uses LLMs and human-in-the-loop collaboration to forecast hazardous events and analyze operational risks in DOE environments.

Details

Motivation: To improve operational safety at mission-critical work sites by proactively identifying potential hazards through AI-powered risk analysis.

Method: Integrates LLMs with structured work data, historical event retrieval, risk analysis, and human-in-the-loop refinement by subject matter experts.

Result: Preliminary deployment shows promising results with improved reliability and efficiency of predictive safety systems.

Conclusion: The framework successfully combines SME collaboration with iterative agentic reasoning, with future work focusing on quantitative evaluation metrics.

Abstract: Operational safety at mission-critical work sites is a top priority given the complex and hazardous nature of daily tasks. This paper presents the Human-Agent Risk Navigation and Event Safety System (HARNESS), a modular AI framework designed to forecast hazardous events and analyze operational risks in U.S. Department of Energy (DOE) environments. HARNESS integrates Large Language Models (LLMs) with structured work data, historical event retrieval, and risk analysis to proactively identify potential hazards. A human-in-the-loop mechanism allows subject matter experts (SMEs) to refine predictions, creating an adaptive learning loop that enhances performance over time. By combining SME collaboration with iterative agentic reasoning, HARNESS improves the reliability and efficiency of predictive safety systems. Preliminary deployment shows promising results, with future work focusing on quantitative evaluation of accuracy, SME agreement, and decision latency reduction.

[306] HyperComplEx: Adaptive Multi-Space Knowledge Graph Embeddings

Jugal Gajjar, Kaustik Ranaware, Kamalasankari Subramaniakuppusamy, Vaibhav Gandhi

Main category: cs.AI

TL;DR: HyperComplEx is a hybrid knowledge graph embedding framework that adaptively combines hyperbolic, complex, and Euclidean spaces via learned attention mechanisms to overcome limitations of single-space models.

Details

Motivation: Existing embedding methods face critical limitations: Euclidean models struggle with hierarchies, vector space models cannot capture asymmetry, and hyperbolic models fail on symmetric relations. A unified approach is needed to handle diverse relationship types at scale.

Method: A relation-specific space weighting strategy dynamically selects optimal geometries for each relation type using learned attention mechanisms, with a multi-space consistency loss to ensure coherent predictions across spaces. The model scales near-linearly through adaptive dimension allocation.

Result: HyperComplEx achieves 0.612 MRR on a 10M-paper dataset (45M triples), a 4.8% relative gain over best baselines, with efficient 85 ms inference per triple. It consistently outperforms TransE, RotatE, DistMult, ComplEx, SEPA, and UltraE across various benchmarks.

Conclusion: The hybrid framework successfully addresses limitations of single-space embeddings by adaptively combining multiple geometric spaces, demonstrating superior performance and scalability for large knowledge graphs while maintaining training efficiency.

Abstract: Knowledge graphs have emerged as fundamental structures for representing complex relational data across scientific and enterprise domains. However, existing embedding methods face critical limitations when modeling diverse relationship types at scale: Euclidean models struggle with hierarchies, vector space models cannot capture asymmetry, and hyperbolic models fail on symmetric relations. We propose HyperComplEx, a hybrid embedding framework that adaptively combines hyperbolic, complex, and Euclidean spaces via learned attention mechanisms. A relation-specific space weighting strategy dynamically selects optimal geometries for each relation type, while a multi-space consistency loss ensures coherent predictions across spaces. We evaluate HyperComplEx on computer science research knowledge graphs ranging from 1K papers (~25K triples) to 10M papers (~45M triples), demonstrating consistent improvements over state-of-the-art baselines including TransE, RotatE, DistMult, ComplEx, SEPA, and UltraE. Additional tests on standard benchmarks confirm significantly higher results than all baselines. On the 10M-paper dataset, HyperComplEx achieves 0.612 MRR, a 4.8% relative gain over the best baseline, while maintaining efficient training, achieving 85 ms inference per triple. The model scales near-linearly with graph size through adaptive dimension allocation. We release our implementation and dataset family to facilitate reproducible research in scalable knowledge graph embeddings.

[307] Advanced Tool for Traffic Crash Analysis: An AI-Driven Multi-Agent Approach to Pre-Crash Reconstruction

Gerui Xu, Boyou Chen, Huizhong Guo, Dave LeBlanc, Ananna Ahmed, Zhaonan Sun, Shan Bao

Main category: cs.AI

TL;DR: Multi-agent AI framework reconstructs pre-crash scenarios from fragmented collision data with perfect accuracy, outperforming human experts on complex rear-end collision cases.

Details

Motivation: Traditional traffic collision reconstruction relying on human expertise often yields inconsistent results when analyzing incomplete multimodal data, creating a need for more reliable automated systems.

Method: Two-phase collaborative framework: Phase I generates natural-language crash reconstructions from multimodal inputs (textual reports, structured data, visual diagrams); Phase II performs in-depth reasoning by combining reconstructions with temporal Event Data Recorder data.

Result: Achieved perfect accuracy across all 39 complex test cases, successfully identifying relevant EDR events and distinguishing striking/struck vehicles, surpassing human researchers’ 92% accuracy. Maintained robust performance with incomplete data.

Conclusion: Demonstrates superior AI capabilities in processing heterogeneous collision data, providing unprecedented precision in reconstructing impact dynamics and characterizing pre-crash behaviors.

Abstract: Traffic collision reconstruction traditionally relies on human expertise, often yielding inconsistent results when analyzing incomplete multimodal data. This study develops a multi-agent AI framework that reconstructs pre-crash scenarios and infers vehicle behaviors from fragmented collision data. We present a two-phase collaborative framework combining reconstruction and reasoning phases. The system processes 277 rear-end lead vehicle deceleration (LVD) collisions from the Crash Investigation Sampling System, integrating textual crash reports, structured tabular data, and visual scene diagrams. Phase I generates natural-language crash reconstructions from multimodal inputs. Phase II performs in-depth crash reasoning by combining these reconstructions with temporal Event Data Recorder (EDR).For validation, we applied it to all LVD cases, focusing on a subset of 39 complex crashes where multiple EDR records per collision introduced ambiguity (e.g., due to missing or conflicting data).The evaluation of the 39 LVD crash cases revealed our framework achieved perfect accuracy across all test cases, successfully identifying both the most relevant EDR event and correctly distinguishing striking versus struck vehicles, surpassing the 92% accuracy achieved by human researchers on the same challenging dataset. The system maintained robust performance even when processing incomplete data, including missing or erroneous EDR records and ambiguous scene diagrams. This study demonstrates superior AI capabilities in processing heterogeneous collision data, providing unprecedented precision in reconstructing impact dynamics and characterizing pre-crash behaviors.

[308] Enhancing Demand-Oriented Regionalization with Agentic AI and Local Heterogeneous Data for Adaptation Planning

Seyedeh Mobina Noorani, Shangde Gao, Changjie Chen, Karla Saldana Ochoa

Main category: cs.AI

TL;DR: A planning support system using agentic AI to create dynamic, demand-oriented regions for disaster planning, featuring human-in-the-loop transparency and adaptability.

Details

Motivation: Conventional planning units like census tracts lack flexibility and fail to capture specific local community demands for effective hazard prevention and response strategies.

Method: Built on RepSC-SOM (representative initialized spatially constrained self-organizing map) with adaptive geographic filtering and region-growing refinement, using AI agents to suggest features, guide spatial constraints, and support interactive exploration.

Result: Successfully demonstrated through a case study on flooding risk in Jacksonville, Florida, enabling users to interactively explore, generate, and evaluate regionalization with computational rigor and user-driven decision making.

Conclusion: The system effectively combines computational methods with human oversight to create flexible, demand-oriented planning regions for disaster management.

Abstract: Conventional planning units or urban regions, such as census tracts, zip codes, or neighborhoods, often do not capture the specific demands of local communities and lack the flexibility to implement effective strategies for hazard prevention or response. To support the creation of dynamic planning units, we introduce a planning support system with agentic AI that enables users to generate demand-oriented regions for disaster planning, integrating the human-in-the-loop principle for transparency and adaptability. The platform is built on a representative initialized spatially constrained self-organizing map (RepSC-SOM), extending traditional SOM with adaptive geographic filtering and region-growing refinement, while AI agents can reason, plan, and act to guide the process by suggesting input features, guiding spatial constraints, and supporting interactive exploration. We demonstrate the capabilities of the platform through a case study on the flooding-related risk in Jacksonville, Florida, showing how it allows users to explore, generate, and evaluate regionalization interactively, combining computational rigor with user-driven decision making.

[309] LLM enhanced graph inference for long-term disease progression modelling

Tiantian He, An Zhao, Elinor Thompson, Anna Schroder, Ahmed Abdulaal, Frederik Barkhof, Daniel C. Alexander

Main category: cs.AI

TL;DR: A novel framework using LLMs to enhance learning of neurodegenerative disease progression by constructing biologically-constrained interaction graphs from irregular longitudinal data, applied to Alzheimer’s disease tau pathology spread.

Details

Motivation: Current methods oversimplify brain connectivity by assuming single-modality connectomes as disease-spreading substrates, leading to inaccurate long-term predictions, while purely data-driven approaches face identifiability issues due to lack of proper constraints.

Method: Uses Large Language Models as expert guides on regional variable interactions to simultaneously optimize long-term disease trajectory construction from irregular longitudinal data and biologically-constrained graph structures capturing brain region interactions.

Result: Demonstrated superior prediction accuracy and interpretability compared to traditional approaches when applied to tau-PET imaging data from an Alzheimer’s disease cohort, revealing additional disease-driving factors beyond conventional connectivity measures.

Conclusion: The LLM-enhanced framework provides a more accurate and interpretable method for modeling neurodegenerative disease progression by leveraging multi-modal relationships and biological constraints, overcoming limitations of existing approaches.

Abstract: Understanding the interactions between biomarkers among brain regions during neurodegenerative disease is essential for unravelling the mechanisms underlying disease progression. For example, pathophysiological models of Alzheimer’s Disease (AD) typically describe how variables, such as regional levels of toxic proteins, interact spatiotemporally within a dynamical system driven by an underlying biological substrate, often based on brain connectivity. However, current methods grossly oversimplify the complex relationship between brain connectivity by assuming a single-modality brain connectome as the disease-spreading substrate. This leads to inaccurate predictions of pathology spread, especially during the long-term progression period. Meanhwile, other methods of learning such a graph in a purely data-driven way face the identifiability issue due to lack of proper constraint. We thus present a novel framework that uses Large Language Models (LLMs) as expert guides on the interaction of regional variables to enhance learning of disease progression from irregularly sampled longitudinal patient data. By leveraging LLMs’ ability to synthesize multi-modal relationships and incorporate diverse disease-driving mechanisms, our method simultaneously optimizes 1) the construction of long-term disease trajectories from individual-level observations and 2) the biologically-constrained graph structure that captures interactions among brain regions with better identifiability. We demonstrate the new approach by estimating the pathology propagation using tau-PET imaging data from an Alzheimer’s disease cohort. The new framework demonstrates superior prediction accuracy and interpretability compared to traditional approaches while revealing additional disease-driving factors beyond conventional connectivity measures.

[310] Multi-Agent Legal Verifier Systems for Data Transfer Planning

Ha-Thanh Nguyen, Wachara Fungwacharakorn, Ken Satoh

Main category: cs.AI

TL;DR: A multi-agent legal verifier system for AI-driven data transfer compliance achieves 72% accuracy on APPI Article 16 cases, significantly outperforming single-agent baselines through specialized agents and coordinated reasoning.

Details

Motivation: Legal compliance in AI-driven data transfer planning is critical under stringent privacy regulations like Japan's APPI, requiring trustworthy and interpretable automated compliance verification.

Method: Multi-agent system decomposing compliance checking into specialized agents for statutory interpretation, business context evaluation, and risk assessment, coordinated through a structured synthesis protocol.

Result: 72% accuracy on 200 APPI Article 16 cases (21pp higher than baseline), 90% accuracy on clear compliance cases (vs. 16% baseline), perfect detection of clear violations.

Conclusion: Domain specialization and coordinated reasoning meaningfully improve legal AI performance, providing a scalable and regulation-aware framework for trustworthy automated compliance verification.

Abstract: Legal compliance in AI-driven data transfer planning is becoming increasingly critical under stringent privacy regulations such as the Japanese Act on the Protection of Personal Information (APPI). We propose a multi-agent legal verifier that decomposes compliance checking into specialized agents for statutory interpretation, business context evaluation, and risk assessment, coordinated through a structured synthesis protocol. Evaluated on a stratified dataset of 200 Amended APPI Article 16 cases with clearly defined ground truth labels and multiple performance metrics, the system achieves 72% accuracy, which is 21 percentage points higher than a single-agent baseline, including 90% accuracy on clear compliance cases (vs. 16% for the baseline) while maintaining perfect detection of clear violations. While challenges remain in ambiguous scenarios, these results show that domain specialization and coordinated reasoning can meaningfully improve legal AI performance, providing a scalable and regulation-aware framework for trustworthy and interpretable automated compliance verification.

[311] Requirements for Aligned, Dynamic Resolution of Conflicts in Operational Constraints

Steven J. Jones, Robert E. Wray, John E. Laird

Main category: cs.AI

TL;DR: AI agents need to construct and evaluate multiple courses of action in novel situations where no single option fully satisfies all constraints, requiring integration of normative, pragmatic, and situational knowledge beyond trained policies.

Details

Motivation: Autonomous AI systems inevitably encounter scenarios where no available course of action fully satisfies all operational constraints, requiring them to go beyond trained policies to construct, evaluate, and justify actions aligned with human values.

Method: The paper uses both analysis and empirical case studies to examine how agents need to integrate normative, pragmatic, and situational understanding for decision making in complex environments.

Result: Identifies requirements for agent decision making and the types of knowledge needed to make decisions robust to agent goals and aligned with human expectations.

Conclusion: Agents must integrate multiple types of contextual knowledge to select and pursue aligned courses of action in complex real-world environments where no single option satisfies all constraints.

Abstract: Deployed, autonomous AI systems must often evaluate multiple plausible courses of action (extended sequences of behavior) in novel or under-specified contexts. Despite extensive training, these systems will inevitably encounter scenarios where no available course of action fully satisfies all operational constraints (e.g., operating procedures, rules, laws, norms, and goals). To achieve goals in accordance with human expectations and values, agents must go beyond their trained policies and instead construct, evaluate, and justify candidate courses of action. These processes require contextual “knowledge” that may lie outside prior (policy) training. This paper characterizes requirements for agent decision making in these contexts. It also identifies the types of knowledge agents require to make decisions robust to agent goals and aligned with human expectations. Drawing on both analysis and empirical case studies, we examine how agents need to integrate normative, pragmatic, and situational understanding to select and then to pursue more aligned courses of action in complex, real-world environments.

[312] AI Agent-Driven Framework for Automated Product Knowledge Graph Construction in E-Commerce

Dimitar Peshevski, Riste Stojanov, Dimitar Trajanov

Main category: cs.AI

TL;DR: Automated AI agent framework using LLMs to construct product knowledge graphs from unstructured product descriptions, achieving 97% property coverage with minimal redundancy.

Details

Motivation: E-commerce platforms generate vast unstructured product data, creating challenges for information retrieval and analytics. Manual KG construction is complex, requiring automated solutions.

Method: Three-stage agent-driven approach: ontology creation/expansion, ontology refinement, and KG population using LLMs, without predefined schemas or handcrafted rules.

Result: Evaluated on air conditioner dataset with over 97% property coverage and minimal redundancy, demonstrating strong performance in ontology generation and KG population.

Conclusion: LLMs can effectively automate structured knowledge extraction in retail, providing scalable product data integration and utilization.

Abstract: The rapid expansion of e-commerce platforms generates vast amounts of unstructured product data, creating significant challenges for information retrieval, recommendation systems, and data analytics. Knowledge Graphs (KGs) offer a structured, interpretable format to organize such data, yet constructing product-specific KGs remains a complex and manual process. This paper introduces a fully automated, AI agent-driven framework for constructing product knowledge graphs directly from unstructured product descriptions. Leveraging Large Language Models (LLMs), our method operates in three stages using dedicated agents: ontology creation and expansion, ontology refinement, and knowledge graph population. This agent-based approach ensures semantic coherence, scalability, and high-quality output without relying on predefined schemas or handcrafted extraction rules. We evaluate the system on a real-world dataset of air conditioner product descriptions, demonstrating strong performance in both ontology generation and KG population. The framework achieves over 97% property coverage and minimal redundancy, validating its effectiveness and practical applicability. Our work highlights the potential of LLMs to automate structured knowledge extraction in retail, providing a scalable path toward intelligent product data integration and utilization.

[313] Faster Symmetry Breaking Constraints for Abstract Structures

Özgür Akgün, Mun See Chang, Ian P. Gent, Christopher Jefferson

Main category: cs.AI

TL;DR: A new incomplete method for breaking symmetries in abstract structures by exploiting their representations, specifically for indistinguishable objects, showing improved performance over previous methods.

Details

Motivation: Symmetry breaking in constraint programming significantly speeds up solving, but traditional methods applied to abstract variables produce complex constraints that perform poorly in practice.

Method: A new incomplete symmetry-breaking method that better exploits the representations of abstract structures, specifically targeting symmetries arising from indistinguishable objects.

Result: The proposed method is faster than previous methods proposed in Akgün et al. 2025 for breaking symmetries of abstract structures.

Conclusion: The new incomplete symmetry-breaking method effectively handles symmetries in abstract structures by leveraging their representations, offering practical performance improvements over existing approaches.

Abstract: In constraint programming and related paradigms, a modeller specifies their problem in a modelling language for a solver to search and return its solution(s). Using high-level modelling languages such as Essence, a modeller may express their problems in terms of abstract structures. These are structures not natively supported by the solvers, and so they have to be transformed into or represented as other structures before solving. For example, nested sets are abstract structures, and they can be represented as matrices in constraint solvers. Many problems contain symmetries and one very common and highly successful technique used in constraint programming is to “break” symmetries, to avoid searching for symmetric solutions. This can speed up the solving process by many orders of magnitude. Most of these symmetry-breaking techniques involve placing some kind of ordering for the variables of the problem, and picking a particular member under the symmetries, usually the smallest. Unfortunately, applying this technique to abstract variables produces a very large number of complex constraints that perform poorly in practice. In this paper, we demonstrate a new incomplete method of breaking the symmetries of abstract structures by better exploiting their representations. We apply the method in breaking the symmetries arising from indistinguishable objects, a commonly occurring type of symmetry, and show that our method is faster than the previous methods proposed in (Akgün et al. 2025).

[314] Key Decision-Makers in Multi-Agent Debates: Who Holds the Power?

Qian Zhang, Yan Zheng, Jinyi Liu, Hebin Liang, Lanjun Wang

Main category: cs.AI

TL;DR: The paper introduces a novel “Truth Last” role allocation strategy that improves Multi-Agent Debate performance by up to 22% in reasoning tasks, and proposes MADC strategy to handle unknown truth scenarios.

Details

Motivation: To address the underexplored critical aspect of role allocation strategies in Multi-Agent Debate (MAD) and improve reasoning abilities in LLM agent scaling.

Method: Proposes “Truth Last” role allocation strategy and Multi-Agent Debate Consistency (MADC) strategy that uses path consistency to assess agreement among independent roles and simulates the role with highest consistency score as truth.

Result: MADC demonstrated advanced performance across 9 LLM models including DeepSeek-R1 Distilled Models, effectively overcoming MAD’s performance bottlenecks with up to 22% improvement in reasoning tasks.

Conclusion: MADC provides a crucial pathway for further improvements in LLM agent scaling by systematically optimizing Multi-Agent Debate mechanisms.

Abstract: Recent studies on LLM agent scaling have highlighted the potential of Multi-Agent Debate (MAD) to enhance reasoning abilities. However, the critical aspect of role allocation strategies remains underexplored. In this study, we demonstrate that allocating roles with differing viewpoints to specific positions significantly impacts MAD’s performance in reasoning tasks. Specifically, we find a novel role allocation strategy, “Truth Last”, which can improve MAD performance by up to 22% in reasoning tasks. To address the issue of unknown truth in practical applications, we propose the Multi-Agent Debate Consistency (MADC) strategy, which systematically simulates and optimizes its core mechanisms. MADC incorporates path consistency to assess agreement among independent roles, simulating the role with the highest consistency score as the truth. We validated MADC across a range of LLMs (9 models), including the DeepSeek-R1 Distilled Models, on challenging reasoning tasks. MADC consistently demonstrated advanced performance, effectively overcoming MAD’s performance bottlenecks and providing a crucial pathway for further improvements in LLM agent scaling.

[315] Autonomous Vehicle Path Planning by Searching With Differentiable Simulation

Asen Nachkov, Jan-Nico Zaech, Danda Pani Paudel, Xi Wang, Luc Van Gool

Main category: cs.AI

TL;DR: DSS is a planning framework that uses differentiable simulation (Waymax) as both next-state predictor and critic, enabling gradient-based optimization of action sequences for autonomous driving.

Details

Motivation: Planning is crucial for safe autonomous driving to avoid collisions in complex traffic, but learning all components (policy, predictor, critic) is challenging.

Method: Uses differentiable simulator Waymax for accurate state predictions and gradient-based search across action sequences via gradient descent over imagined trajectories.

Result: DSS significantly improves tracking and path planning accuracy compared to sequence prediction, imitation learning, model-free RL, and other planning methods.

Conclusion: The combination of planning gradients and stochastic search in DSS provides effective planning for autonomous driving.

Abstract: Planning allows an agent to safely refine its actions before executing them in the real world. In autonomous driving, this is crucial to avoid collisions and navigate in complex, dense traffic scenarios. One way to plan is to search for the best action sequence. However, this is challenging when all necessary components - policy, next-state predictor, and critic - have to be learned. Here we propose Differentiable Simulation for Search (DSS), a framework that leverages the differentiable simulator Waymax as both a next state predictor and a critic. It relies on the simulator’s hardcoded dynamics, making state predictions highly accurate, while utilizing the simulator’s differentiability to effectively search across action sequences. Our DSS agent optimizes its actions using gradient descent over imagined future trajectories. We show experimentally that DSS - the combination of planning gradients and stochastic search - significantly improves tracking and path planning accuracy compared to sequence prediction, imitation learning, model-free RL, and other planning methods.

[316] ARCTraj: A Dataset and Benchmark of Human Reasoning Trajectories for Abstract Problem Solving

Sejin Kim, Hayan Choi, Seokki Lee, Sundong Kim

Main category: cs.AI

TL;DR: ARCTraj is a dataset and framework for modeling human reasoning in ARC tasks by capturing temporally ordered object-level actions that reveal intermediate reasoning steps, enabling integration with various learning methods.

Details

Motivation: To address the limitation of static input-output supervision in ARC research by capturing how human reasoning unfolds over time through intermediate steps, providing insights into the reasoning process.

Method: Collected human trajectories via O2ARC web interface with object-level actions, timestamps, and success labels; defined unified reasoning pipeline with MDP formulation for integration with RL, generative modeling, and sequence modeling methods.

Result: Created dataset with ~10,000 trajectories across 400 training tasks from ARC-AGI-1 benchmark, revealing structured patterns in spatial selection, color attribution, and strategic convergence in human reasoning.

Conclusion: ARCTraj provides a structured foundation for studying human-like reasoning, advancing explainability, alignment, and generalizable intelligence through interpretable action trajectories.

Abstract: We present ARCTraj, a dataset and methodological framework for modeling human reasoning through complex visual tasks in the Abstraction and Reasoning Corpus (ARC). While ARC has inspired extensive research on abstract reasoning, most existing approaches rely on static input–output supervision, which limits insight into how reasoning unfolds over time. ARCTraj addresses this gap by recording temporally ordered, object-level actions that capture how humans iteratively transform inputs into outputs, revealing intermediate reasoning steps that conventional datasets overlook. Collected via the O2ARC web interface, it contains around 10,000 trajectories annotated with task identifiers, timestamps, and success labels across 400 training tasks from the ARC-AGI-1 benchmark. It further defines a unified reasoning pipeline encompassing data collection, action abstraction, Markov decision process (MDP) formulation, and downstream learning, enabling integration with reinforcement learning, generative modeling, and sequence modeling methods such as PPO, World Models, GFlowNets, Diffusion agents, and Decision Transformers. Analyses of spatial selection, color attribution, and strategic convergence highlight the structure and diversity of human reasoning. Together, these contributions position ARCTraj as a structured and interpretable foundation for studying human-like reasoning, advancing explainability, alignment, and generalizable intelligence.

[317] Satisficing and Optimal Generalised Planning via Goal Regression (Extended Version)

Dillon Z. Chen, Till Hofmann, Toryn Q. Klassen, Sheila A. McIlraith

Main category: cs.AI

TL;DR: A novel method for generalised planning that learns Condition→Actions rules from optimal plans of training problems, which can be executed directly or used to prune search space.

Details

Motivation: To develop a simple yet effective approach for synthesising programs that solve families of related planning problems, improving upon existing generalised planning methods.

Method: For each training problem, compute optimal plans for goal atoms in order, perform goal regression on the plans, and lift outputs to obtain first-order Condition→Actions rules that form a generalised plan.

Result: Experiments show significant improvements over state-of-the-art planners in synthesis cost, planning coverage, and solution quality across classical and numeric planning domains.

Conclusion: The method effectively learns valid generalised plans and pruning axioms, demonstrating practical advantages in multiple planning metrics compared to existing approaches.

Abstract: Generalised planning (GP) refers to the task of synthesising programs that solve families of related planning problems. We introduce a novel, yet simple method for GP: given a set of training problems, for each problem, compute an optimal plan for each goal atom in some order, perform goal regression on the resulting plans, and lift the corresponding outputs to obtain a set of first-order $\textit{Condition} \rightarrow \textit{Actions}$ rules. The rules collectively constitute a generalised plan that can be executed as is or alternatively be used to prune the planning search space. We formalise and prove the conditions under which our method is guaranteed to learn valid generalised plans and state space pruning axioms for search. Experiments demonstrate significant improvements over state-of-the-art (generalised) planners with respect to the 3 metrics of synthesis cost, planning coverage, and solution quality on various classical and numeric planning domains.

[318] GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models

Jingxuan Wei, Caijun Jia, Xi Bai, Xinglong Xu, Siyuan Li, Linzhuang Sun, Bihui Yu, Conghui He, Lijun Wu, Cheng Tan

Main category: cs.AI

TL;DR: GGBench is a new benchmark for evaluating geometric generative reasoning in Unified Multimodal Models, addressing the gap in assessing integrated cognitive processes of understanding and active construction.

Details

Motivation: Existing benchmarks fail to measure the integrated cognitive process of generative reasoning in UMMs, focusing only on discriminative understanding or unconstrained image generation separately.

Method: Proposes geometric construction as an ideal testbed that inherently demands fusion of language comprehension and precise visual generation, and introduces GGBench benchmark.

Result: GGBench provides a comprehensive framework for systematically diagnosing models’ ability to understand, reason, and actively construct solutions in geometric contexts.

Conclusion: Geometric construction sets a more rigorous standard for evaluating the next generation of intelligent systems by measuring integrated generative reasoning capabilities.

Abstract: The advent of Unified Multimodal Models (UMMs) signals a paradigm shift in artificial intelligence, moving from passive perception to active, cross-modal generation. Despite their unprecedented ability to synthesize information, a critical gap persists in evaluation: existing benchmarks primarily assess discriminative understanding or unconstrained image generation separately, failing to measure the integrated cognitive process of generative reasoning. To bridge this gap, we propose that geometric construction provides an ideal testbed as it inherently demands a fusion of language comprehension and precise visual generation. We introduce GGBench, a benchmark designed specifically to evaluate geometric generative reasoning. It provides a comprehensive framework for systematically diagnosing a model’s ability to not only understand and reason but to actively construct a solution, thereby setting a more rigorous standard for the next generation of intelligent systems. Project website: https://opendatalab-raiser.github.io/GGBench/.

[319] STaR: Towards Cognitive Table Reasoning via Slow-Thinking Large Language Models

Huajian Zhang, Mingyue Cheng, Yucong Luo, Xiaoyu Tao

Main category: cs.AI

TL;DR: STaR is a slow-thinking framework that enhances LLMs for table reasoning by modeling step-by-step thinking and uncertainty-aware inference, achieving superior performance and stability through difficulty-aware reinforcement learning and trajectory-level uncertainty quantification.

Details

Motivation: Current LLM-based table reasoning lacks the depth and iterative refinement of human cognition and suffers from instability, compromising reliability in downstream applications.

Method: Two-stage difficulty-aware reinforcement learning (DRL) progressively learns from simple to complex queries under composite reward; inference uses trajectory-level uncertainty quantification integrating token-level confidence and answer consistency.

Result: Achieves superior performance and enhanced reasoning stability on benchmarks, with strong generalization over out-of-domain datasets.

Conclusion: STaR demonstrates potential as a reliable and cognitively inspired solution for table reasoning with LLMs, providing more credible reasoning paths through slow-thinking capabilities.

Abstract: Table reasoning with the large language models (LLMs) is a fundamental path toward building intelligent systems that can understand and analyze over structured data. While recent progress has shown promising results, they still suffer from two key limitations: (i) the reasoning processes lack the depth and iterative refinement characteristic of human cognition; and (ii) the reasoning processes exhibit instability, which compromises their reliability in downstream applications. In this work, we present STaR (slow-thinking for table reasoning), a new framework achieving cognitive table reasoning, in which LLMs are equipped with slow-thinking capabilities by explicitly modeling step-by-step thinking and uncertainty-aware inference. During training, STaR employs two-stage difficulty-aware reinforcement learning (DRL), progressively learning from simple to complex queries under a composite reward. During inference, STaR performs trajectory-level uncertainty quantification by integrating token-level confidence and answer consistency, enabling selection of more credible reasoning paths. Extensive experiments on benchmarks demonstrate that STaR achieves superior performance and enhanced reasoning stability. Moreover, strong generalization over out-of-domain datasets further demonstrates STaR’s potential as a reliable and cognitively inspired solution for table reasoning with LLMs.

[320] UAVBench: An Open Benchmark Dataset for Autonomous and Agentic AI UAV Systems via LLM-Generated Flight Scenarios

Mohamed Amine Ferrag, Abderrahmane Lakas, Merouane Debbah

Main category: cs.AI

TL;DR: UAVBench is a comprehensive benchmark dataset for evaluating LLM reasoning in autonomous aerial systems, featuring 50,000 validated flight scenarios and 50,000 multiple-choice questions across ten cognitive domains.

Details

Motivation: Address the lack of standardized and physically grounded benchmarks for systematically evaluating LLM reasoning capabilities in autonomous aerial systems, which increasingly rely on LLMs for mission planning and decision-making.

Method: Created UAVBench through taxonomy-guided LLM prompting and multi-stage safety validation, encoding scenarios in structured JSON with mission objectives, vehicle configuration, environmental conditions, and risk labels. Extended to UAVBench_MCQ with multiple-choice questions spanning ten reasoning styles.

Result: Evaluated 32 state-of-the-art LLMs and found strong performance in perception and policy reasoning but persistent challenges in ethics-aware and resource-constrained decision-making.

Conclusion: UAVBench establishes a reproducible foundation for benchmarking agentic AI in autonomous aerial systems and advancing next-generation UAV reasoning intelligence, with all materials released for open science.

Abstract: Autonomous aerial systems increasingly rely on large language models (LLMs) for mission planning, perception, and decision-making, yet the lack of standardized and physically grounded benchmarks limits systematic evaluation of their reasoning capabilities. To address this gap, we introduce UAVBench, an open benchmark dataset comprising 50,000 validated UAV flight scenarios generated through taxonomy-guided LLM prompting and multi-stage safety validation. Each scenario is encoded in a structured JSON schema that includes mission objectives, vehicle configuration, environmental conditions, and quantitative risk labels, providing a unified representation of UAV operations across diverse domains. Building on this foundation, we present UAVBench_MCQ, a reasoning-oriented extension containing 50,000 multiple-choice questions spanning ten cognitive and ethical reasoning styles, ranging from aerodynamics and navigation to multi-agent coordination and integrated reasoning. This framework enables interpretable and machine-checkable assessment of UAV-specific cognition under realistic operational contexts. We evaluate 32 state-of-the-art LLMs, including GPT-5, ChatGPT-4o, Gemini 2.5 Flash, DeepSeek V3, Qwen3 235B, and ERNIE 4.5 300B, and find strong performance in perception and policy reasoning but persistent challenges in ethics-aware and resource-constrained decision-making. UAVBench establishes a reproducible and physically grounded foundation for benchmarking agentic AI in autonomous aerial systems and advancing next-generation UAV reasoning intelligence. To support open science and reproducibility, we release the UAVBench dataset, the UAVBench_MCQ benchmark, evaluation scripts, and all related materials on GitHub at https://github.com/maferrag/UAVBench

[321] AIonopedia: an LLM agent orchestrating multimodal learning for ionic liquid discovery

Yuqi Yin, Yibo Fu, Siyuan Wang, Peng Sun, Hongyu Wang, Xiaohui Wang, Lei Zheng, Zhiyong Li, Zhirong Liu, Jianji Wang, Zhaoxi Sun

Main category: cs.AI

TL;DR: AIonopedia is the first LLM agent for Ionic Liquid discovery, using an LLM-augmented multimodal domain foundation model to enable accurate property predictions and hierarchical search for molecular screening and design, validated through real-world wet-lab testing.

Details

Motivation: The discovery of novel Ionic Liquids is hindered by challenges in property prediction including limited data, poor model accuracy, and fragmented workflows.

Method: Leveraging Large Language Models with an LLM-augmented multimodal domain foundation model for ILs, incorporating hierarchical search architecture for molecular screening and design, trained on a newly curated comprehensive IL dataset.

Result: The model delivers superior performance, can perform effective IL modification, and demonstrated exceptional generalization capabilities on challenging out-of-distribution tasks in real-world wet-lab validation.

Conclusion: AIonopedia has the ability to accelerate real-world IL discovery through its accurate property predictions and practical efficacy demonstrated in laboratory validation.

Abstract: The discovery of novel Ionic Liquids (ILs) is hindered by critical challenges in property prediction, including limited data, poor model accuracy, and fragmented workflows. Leveraging the power of Large Language Models (LLMs), we introduce AIonopedia, to the best of our knowledge, the first LLM agent for IL discovery. Powered by an LLM-augmented multimodal domain foundation model for ILs, AIonopedia enables accurate property predictions and incorporates a hierarchical search architecture for molecular screening and design. Trained and evaluated on a newly curated and comprehensive IL dataset, our model delivers superior performance. Complementing these results, evaluations on literature-reported systems indicate that the agent can perform effective IL modification. Moving beyond offline tests, the practical efficacy was further confirmed through real-world wet-lab validation, in which the agent demonstrated exceptional generalization capabilities on challenging out-of-distribution tasks, underscoring its ability to accelerate real-world IL discovery.

[322] A Workflow for Full Traceability of AI Decisions

Julius Wenzel, Syeda Umaima Alam, Andreas Schmidt, Hanwei Zhang, Holger Hermanns

Main category: cs.AI

TL;DR: This paper presents a workflow for generating tamper-proof, verifiable traces of AI decisions to address the lack of documentation in automated systems that make high-stakes decisions.

Details

Motivation: The increasing use of brittle AI systems in high-stakes decisions poses substantial risks of harm to people's well-being and fundamental rights, with current systems lacking proper documentation that could establish responsibility chains in legal contexts.

Method: The paper enforces documentation of every component in AI training and inference processes, expanding the DBOM concept into a running workflow using confidential computing technology to create tamper-proof and verifiable decision traces.

Result: The authors demonstrate a working workflow through a mushroom classification app example, showing how to generate exhaustive traces of AI decisions that can withstand legal scrutiny.

Conclusion: This approach provides a practical solution for establishing traceability and accountability in AI decision-making systems, enabling proper documentation that can be used in legal proceedings when AI decisions violate laws.

Abstract: An ever increasing number of high-stake decisions are made or assisted by automated systems employing brittle artificial intelligence technology. There is a substantial risk that some of these decision induce harm to people, by infringing their well-being or their fundamental human rights. The state-of-the-art in AI systems makes little effort with respect to appropriate documentation of the decision process. This obstructs the ability to trace what went into a decision, which in turn is a prerequisite to any attempt of reconstructing a responsibility chain. Specifically, such traceability is linked to a documentation that will stand up in court when determining the cause of some AI-based decision that inadvertently or intentionally violates the law. This paper takes a radical, yet practical, approach to this problem, by enforcing the documentation of each and every component that goes into the training or inference of an automated decision. As such, it presents the first running workflow supporting the generation of tamper-proof, verifiable and exhaustive traces of AI decisions. In doing so, we expand the DBOM concept into an effective running workflow leveraging confidential computing technology. We demonstrate the inner workings of the workflow in the development of an app to tell poisonous and edible mushrooms apart, meant as a playful example of high-stake decision support.

[323] Can You Tell the Difference? Contrastive Explanations for ABox Entailments

Patrick Koopmann, Yasir Mahmood, Axel-Cyrille Ngonga Ngomo, Balram Tiwari

Main category: cs.AI

TL;DR: Introduces contrastive ABox explanations to answer ‘Why is a an instance of C, but b is not?’ by considering both positive and missing entailments simultaneously.

Details

Motivation: Existing approaches explain positive entailments or missing entailments separately, but contrastive explanations consider both together to highlight relevant commonalities and differences between instances.

Method: Developed notion of contrastive explanations for ABox reasoning with description logic ontologies, analyzed computational complexity for different variants under optimality criteria across various description logics.

Result: Implemented a method for computing one variant of contrastive explanations and evaluated it on generated problems for realistic knowledge bases.

Conclusion: Contrastive ABox explanations provide a novel approach to explain instance classification differences by simultaneously considering both positive and negative entailments.

Abstract: We introduce the notion of contrastive ABox explanations to answer questions of the type “Why is a an instance of C, but b is not?”. While there are various approaches for explaining positive entailments (why is C(a) entailed by the knowledge base) as well as missing entailments (why is C(b) not entailed) in isolation, contrastive explanations consider both at the same time, which allows them to focus on the relevant commonalities and differences between a and b. We develop an appropriate notion of contrastive explanations for the special case of ABox reasoning with description logic ontologies, and analyze the computational complexity for different variants under different optimality criteria, considering lightweight as well as more expressive description logics. We implemented a first method for computing one variant of contrastive explanations, and evaluated it on generated problems for realistic knowledge bases.

[324] EcoAlign: An Economically Rational Framework for Efficient LVLM Alignment

Ruoxi Cheng, Haoxuan Ma, Teng Ma, Hongyi Zhang

Main category: cs.AI

TL;DR: EcoAlign is an inference-time framework that treats LVLM alignment as an economic efficiency problem, using forward-looking cost-benefit analysis to balance safety, utility, and computational costs during reasoning.

Details

Motivation: Current LVLM alignment methods struggle with trade-offs between safety, utility, and operational costs, and suffer from process-blindness that wastes computational budget on unsafe deliberation while allowing harmful reasoning to be disguised with benign justifications.

Method: EcoAlign reframes alignment as economically rational search by treating LVLM as boundedly rational agent. It incrementally expands thought graphs and scores actions using forward-looking function (like net present value) that dynamically weighs expected safety, utility, and cost against remaining budget, with path safety enforced via weakest-link principle.

Result: Extensive experiments across 3 closed-source and 2 open-source models on 6 datasets show EcoAlign matches or surpasses state-of-the-art safety and utility at lower computational cost.

Conclusion: EcoAlign offers a principled, economical pathway to robust LVLM alignment by addressing the fundamental economic efficiency challenge in alignment while preventing deception through weakest-link safety enforcement.

Abstract: Large Vision-Language Models (LVLMs) exhibit powerful reasoning capabilities but suffer sophisticated jailbreak vulnerabilities. Fundamentally, aligning LVLMs is not just a safety challenge but a problem of economic efficiency. Current alignment methods struggle with the trade-off between safety, utility, and operational costs. Critically, a focus solely on final outputs (process-blindness) wastes significant computational budget on unsafe deliberation. This flaw allows harmful reasoning to be disguised with benign justifications, thereby circumventing simple additive safety scores. To address this, we propose EcoAlign, an inference-time framework that reframes alignment as an economically rational search by treating the LVLM as a boundedly rational agent. EcoAlign incrementally expands a thought graph and scores actions using a forward-looking function (analogous to net present value) that dynamically weighs expected safety, utility, and cost against the remaining budget. To prevent deception, path safety is enforced via the weakest-link principle. Extensive experiments across 3 closed-source and 2 open-source models on 6 datasets show that EcoAlign matches or surpasses state-of-the-art safety and utility at a lower computational cost, thereby offering a principled, economical pathway to robust LVLM alignment.

Yitian Kou, Yihe Gu, Chen Zhou, DanDan Zhu, Shuguang Kuai

Main category: cs.AI

TL;DR: RLSLM is a hybrid reinforcement learning framework that combines rule-based social locomotion models with data-driven learning to enable socially-aware navigation that minimizes human discomfort while maintaining efficiency.

Details

Motivation: To bridge the gap between interpretable but inflexible rule-based approaches and flexible but opaque data-driven methods for social navigation, creating agents that can navigate human-populated environments without causing discomfort.

Method: Integrates a rule-based Social Locomotion Model (grounded in empirical behavioral experiments) into RL reward function, generating orientation-sensitive social comfort fields that quantify human comfort across space.

Result: Outperforms state-of-the-art rule-based models in user experience, achieves socially aligned navigation with minimal training, and shows significantly improved interpretability over conventional data-driven methods.

Conclusion: RLSLM presents a scalable, human-centered methodology that effectively integrates cognitive science and machine learning for real-world social navigation, balancing mechanical efficiency with social comfort.

Abstract: Navigating human-populated environments without causing discomfort is a critical capability for socially-aware agents. While rule-based approaches offer interpretability through predefined psychological principles, they often lack generalizability and flexibility. Conversely, data-driven methods can learn complex behaviors from large-scale datasets, but are typically inefficient, opaque, and difficult to align with human intuitions. To bridge this gap, we propose RLSLM, a hybrid Reinforcement Learning framework that integrates a rule-based Social Locomotion Model, grounded in empirical behavioral experiments, into the reward function of a reinforcement learning framework. The social locomotion model generates an orientation-sensitive social comfort field that quantifies human comfort across space, enabling socially aligned navigation policies with minimal training. RLSLM then jointly optimizes mechanical energy and social comfort, allowing agents to avoid intrusions into personal or group space. A human-agent interaction experiment using an immersive VR-based setup demonstrates that RLSLM outperforms state-of-the-art rule-based models in user experience. Ablation and sensitivity analyses further show the model’s significantly improved interpretability over conventional data-driven methods. This work presents a scalable, human-centered methodology that effectively integrates cognitive science and machine learning for real-world social navigation.

[326] KarmaTS: A Universal Simulation Platform for Multivariate Time Series with Functional Causal Dynamics

Haixin Li, Yanke Li, Diego Paez-Granados

Main category: cs.AI

TL;DR: KarmaTS is an interactive framework for building lag-indexed, executable spatiotemporal causal graphical models to simulate multivariate time series with known causal dynamics.

Details

Motivation: Addresses the challenge of access-restricted physiological data by generating synthetic MTS with known causal dynamics and augmenting real-world datasets with expert knowledge.

Method: Constructs discrete-time structural causal processes (DSCP) through mixed-initiative human-in-the-loop workflow combining expert knowledge and algorithmic proposals. Handles mixed variable types, contemporaneous/lagged edges, and modular edge functionals from parameterizable templates to neural networks.

Result: The resulting DSCP supports simulation and causal interventions, including those under user-specified distribution shifts.

Conclusion: Enables flexible validation and benchmarking of causal discovery algorithms through expert-informed simulation.

Abstract: We introduce KarmaTS, an interactive framework for constructing lag-indexed, executable spatiotemporal causal graphical models for multivariate time series (MTS) simulation. Motivated by the challenge of access-restricted physiological data, KarmaTS generates synthetic MTS with known causal dynamics and augments real-world datasets with expert knowledge. The system constructs a discrete-time structural causal process (DSCP) by combining expert knowledge and algorithmic proposals in a mixed-initiative, human-in-the-loop workflow. The resulting DSCP supports simulation and causal interventions, including those under user-specified distribution shifts. KarmaTS handles mixed variable types, contemporaneous and lagged edges, and modular edge functionals ranging from parameterizable templates to neural network models. Together, these features enable flexible validation and benchmarking of causal discovery algorithms through expert-informed simulation.

[327] MarsRL: Advancing Multi-Agent Reasoning System via Reinforcement Learning with Agentic Pipeline Parallelism

Shulin Liu, Dong Du, Tao Yang, Yang Li, Boyu Qiu

Main category: cs.AI

TL;DR: MarsRL is a reinforcement learning framework that improves multi-agent reasoning systems by jointly optimizing all agents through agent-specific rewards and pipeline parallelism, achieving significant performance gains on mathematical reasoning benchmarks.

Details

Motivation: Current multi-agent reasoning systems work well with closed-source models but struggle with open-source models due to insufficient critic and correction capabilities, limiting their generalization potential.

Method: Proposes MarsRL framework with agentic pipeline parallelism, agent-specific reward mechanisms to reduce reward noise, and pipeline-inspired training for efficient handling of long trajectories.

Result: Applied to Qwen3-30B-A3B-Thinking-2507, improved AIME2025 accuracy from 86.5% to 93.3% and BeyondAIME from 64.9% to 73.8%, even surpassing larger Qwen3-235B-A22B-Thinking-2507 model.

Conclusion: MarsRL demonstrates strong potential to advance multi-agent reasoning systems and broaden their applicability across diverse reasoning tasks by effectively optimizing all agents in the system.

Abstract: Recent progress in large language models (LLMs) has been propelled by reinforcement learning with verifiable rewards (RLVR) and test-time scaling. However, the limited output length of LLMs constrains the depth of reasoning attainable in a single inference process. Multi-agent reasoning systems offer a promising alternative by employing multiple agents including Solver, Verifier, and Corrector, to iteratively refine solutions. While effective in closed-source models like Gemini 2.5 Pro, they struggle to generalize to open-source models due to insufficient critic and correction capabilities. To address this, we propose MarsRL, a novel reinforcement learning framework with agentic pipeline parallelism, designed to jointly optimize all agents in the system. MarsRL introduces agent-specific reward mechanisms to mitigate reward noise and employs pipeline-inspired training to enhance efficiency in handling long trajectories. Applied to Qwen3-30B-A3B-Thinking-2507, MarsRL improves AIME2025 accuracy from 86.5% to 93.3% and BeyondAIME from 64.9% to 73.8%, even surpassing Qwen3-235B-A22B-Thinking-2507. These findings highlight the potential of MarsRL to advance multi-agent reasoning systems and broaden their applicability across diverse reasoning tasks.

[328] Robust and Efficient Communication in Multi-Agent Reinforcement Learning

Zejiao Liu, Yi Li, Jiali Wang, Junqi Tu, Yitian Hong, Fangfei Li, Yang Liu, Toshiharu Sugawara, Yang Tang

Main category: cs.AI

TL;DR: Survey on robust and efficient communication strategies for multi-agent reinforcement learning under realistic constraints like message perturbations, delays, and bandwidth limits.

Details

Motivation: Existing MARL approaches assume ideal communication conditions (instantaneous, reliable, unlimited bandwidth) that are unrealistic for real-world deployments, creating a gap between theory and practice.

Method: Systematic review of recent advances in communication strategies for MARL, focusing on three key applications: cooperative autonomous driving, distributed SLAM, and federated learning.

Result: Identifies the need for robust communication approaches that handle real-world constraints and highlights central challenges in practical MARL systems.

Conclusion: Advocates for a unified co-design approach that integrates communication, learning, and robustness to bridge the gap between theoretical MARL models and practical implementations.

Abstract: Multi-agent reinforcement learning (MARL) has made significant strides in enabling coordinated behaviors among autonomous agents. However, most existing approaches assume that communication is instantaneous, reliable, and has unlimited bandwidth; these conditions are rarely met in real-world deployments. This survey systematically reviews recent advances in robust and efficient communication strategies for MARL under realistic constraints, including message perturbations, transmission delays, and limited bandwidth. Furthermore, because the challenges of low-latency reliability, bandwidth-intensive data sharing, and communication-privacy trade-offs are central to practical MARL systems, we focus on three applications involving cooperative autonomous driving, distributed simultaneous localization and mapping, and federated learning. Finally, we identify key open challenges and future research directions, advocating a unified approach that co-designs communication, learning, and robustness to bridge the gap between theoretical MARL models and practical implementations.

[329] CURENet: Combining Unified Representations for Efficient Chronic Disease Prediction

Cong-Tinh Dao, Nguyen Minh Thao Phan, Jun-En Ding, Chenwei Wu, David Restrepo, Dongsheng Luo, Fanyi Zhao, Chun-Chieh Liao, Wen-Chih Peng, Chi-Te Wang, Pei-Fu Chen, Ling Chen, Xinglong Ju, Feng Liu, Fang-Ming Hung

Main category: cs.AI

TL;DR: CURENet is a multimodal model that integrates clinical notes, lab tests, and time-series visit data using LLMs and transformers to predict chronic diseases with over 94% accuracy.

Details

Motivation: Most predictive models fail to capture interactions and temporal patterns across multiple EHR data modalities, focusing on single data types and overlooking complexities in clinical decision-making.

Method: CURENet uses large language models for clinical text processing and textual lab tests, and transformer encoders for longitudinal sequential visits to integrate multimodal EHR data.

Result: Achieved over 94% accuracy in predicting top 10 chronic conditions on MIMIC-III and FEMH datasets in a multi-label framework.

Conclusion: Multimodal EHR integration through CURENet enhances clinical decision-making and improves patient outcomes by capturing intricate interactions between different clinical data forms.

Abstract: Electronic health records (EHRs) are designed to synthesize diverse data types, including unstructured clinical notes, structured lab tests, and time-series visit data. Physicians draw on these multimodal and temporal sources of EHR data to form a comprehensive view of a patient’s health, which is crucial for informed therapeutic decision-making. Yet, most predictive models fail to fully capture the interactions, redundancies, and temporal patterns across multiple data modalities, often focusing on a single data type or overlooking these complexities. In this paper, we present CURENet, a multimodal model (Combining Unified Representations for Efficient chronic disease prediction) that integrates unstructured clinical notes, lab tests, and patients’ time-series data by utilizing large language models (LLMs) for clinical text processing and textual lab tests, as well as transformer encoders for longitudinal sequential visits. CURENet has been capable of capturing the intricate interaction between different forms of clinical data and creating a more reliable predictive model for chronic illnesses. We evaluated CURENet using the public MIMIC-III and private FEMH datasets, where it achieved over 94% accuracy in predicting the top 10 chronic conditions in a multi-label framework. Our findings highlight the potential of multimodal EHR integration to enhance clinical decision-making and improve patient outcomes.

[330] Experience-Guided Adaptation of Inference-Time Reasoning Strategies

Adam Stein, Matthew Trager, Benjamin Bowman, Michael Kleinman, Aditya Chattopadhyay, Wei Xia, Stefano Soatto

Main category: cs.AI

TL;DR: EGuR is an AI system that dynamically generates tailored computational strategies at inference time using accumulated experience, enabling flexible adaptation of all strategy components without offline optimization.

Details

Motivation: Existing AI systems either only modify textual inputs (limiting adaptation) or require offline optimization (remaining static after deployment), creating a gap for systems that can flexibly adapt all strategy components during inference.

Method: EGuR uses an LLM-based meta-strategy with two components: a Guide that generates candidate strategies based on current problem and past experiences, and a Consolidator that integrates execution feedback to improve future strategy generation.

Result: Across five challenging benchmarks, EGuR achieved up to 14% accuracy improvements over strongest baselines while reducing computational costs by up to 111x, with both metrics improving as the system gains experience.

Conclusion: EGuR demonstrates that dynamically generating complete computational strategies at inference time enables significant performance improvements and cost reductions, with adaptation capabilities that improve with accumulated experience.

Abstract: Enabling agentic AI systems to adapt their problem-solving approaches based on post-training interactions remains a fundamental challenge. While systems that update and maintain a memory at inference time have been proposed, existing designs only steer the system by modifying textual input to a language model or agent, which means that they cannot change sampling parameters, remove tools, modify system prompts, or switch between agentic and workflow paradigms. On the other hand, systems that adapt more flexibly require offline optimization and remain static once deployed. We present Experience-Guided Reasoner (EGuR), which generates tailored strategies – complete computational procedures involving LLM calls, tools, sampling parameters, and control logic – dynamically at inference time based on accumulated experience. We achieve this using an LLM-based meta-strategy – a strategy that outputs strategies – enabling adaptation of all strategy components (prompts, sampling parameters, tool configurations, and control logic). EGuR operates through two components: a Guide generates multiple candidate strategies conditioned on the current problem and structured memory of past experiences, while a Consolidator integrates execution feedback to improve future strategy generation. This produces complete, ready-to-run strategies optimized for each problem, which can be cached, retrieved, and executed as needed without wasting resources. Across five challenging benchmarks (AIME 2025, 3-SAT, and three Big Bench Extra Hard tasks), EGuR achieves up to 14% accuracy improvements over the strongest baselines while reducing computational costs by up to 111x, with both metrics improving as the system gains experience.

[331] Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping

Dena Mujtaba, Brian Hu, Anthony Hoogs, Arslan Basharat

Main category: cs.AI

TL;DR: A test-time alignment technique using model-guided policy shaping to control AI agent behavior without retraining, evaluated on ethical decision-making in text-based games.

Details

Motivation: AI agents trained for reward maximization may adopt harmful behaviors, creating a trade-off between objectives and ethical alignment. Retraining pre-trained agents is costly, especially with diverse ethical values.

Method: Model-guided policy shaping applied at test time using scenario-action attribute classifiers to align decisions with ethical attributes, without requiring agent retraining.

Result: Effective mitigation of unethical behavior across diverse environments and alignment attributes, outperforming prior training-time methods and general-purpose agents.

Conclusion: Test-time policy shaping provides a scalable solution for maintaining AI agent alignment with ethical values while balancing reward maximization.

Abstract: The deployment of decision-making AI agents presents a critical challenge in maintaining alignment with human values or guidelines while operating in complex, dynamic environments. Agents trained solely to achieve their objectives may adopt harmful behavior, exposing a key trade-off between maximizing the reward function and maintaining the alignment. For the pre-trained agents, ensuring alignment is particularly challenging, as retraining can be a costly and slow process. This is further complicated by the diverse and potentially conflicting attributes representing the ethical values for alignment. To address these challenges, we propose a test-time alignment technique based on model-guided policy shaping. Our method allows precise control over individual behavioral attributes, generalizes across diverse reinforcement learning (RL) environments, and facilitates a principled trade-off between ethical alignment and reward maximization without requiring agent retraining. We evaluate our approach using the MACHIAVELLI benchmark, which comprises 134 text-based game environments and thousands of annotated scenarios involving ethical decisions. The RL agents are first trained to maximize the reward in their respective games. At test time, we apply policy shaping via scenario-action attribute classifiers to ensure decision alignment with ethical attributes. We compare our approach against prior training-time methods and general-purpose agents, as well as study several types of ethical violations and power-seeking behavior. Our results demonstrate that test-time policy shaping provides an effective and scalable solution for mitigating unethical behavior across diverse environments and alignment attributes.

[332] Towards Efficient and Reliable AI Through Neuromorphic Principles

Bipin Rajendran, Osvaldo Simeone, Bashir M. Al-Hashimi

Main category: cs.AI

TL;DR: The paper argues that current AI’s reliance on large Transformer models running on GPUs creates a hardware lottery and inefficiency, and proposes adopting neuromorphic principles inspired by brain processing for more efficient and reliable AI systems.

Details

Motivation: Current AI research is dominated by large neural networks on GPUs, creating hardware dependency, high computational costs, and unreliable models that hallucinate with high confidence. This motivates exploring brain-inspired approaches for better efficiency and reliability.

Method: Proposes six neuromorphic principles: (i) stateful recurrent models, (ii) extreme dynamic sparsity including spike-based processing, (iii) backpropagation-free on-device learning, (iv) probabilistic decision-making, (v) in-memory computing, and (vi) hardware-software co-design via stochastic computing.

Result: The paper outlines a framework for future AI systems that could overcome current limitations by aligning with brain-inspired processing principles, though specific experimental results are not detailed in the abstract.

Conclusion: Adopting neuromorphic engineering principles can lead to more efficient and reliable AI systems that avoid the hardware lottery problem and provide better uncertainty quantification, representing a promising direction for future AI research.

Abstract: Artificial intelligence (AI) research today is largely driven by ever-larger neural network models trained on graphics processing units (GPUs). This paradigm has yielded remarkable progress, but it also risks entrenching a hardware lottery in which algorithmic choices succeed primarily because they align with current hardware, rather than because they are inherently superior. In particular, the dominance of Transformer architectures running on GPU clusters has led to an arms race of scaling up models, resulting in exorbitant computational costs and energy usage. At the same time, today’s AI models often remain unreliable in the sense that they cannot properly quantify uncertainty in their decisions – for example, large language models tend to hallucinate incorrect outputs with high confidence. This article argues that achieving more efficient and reliable AI will require embracing a set of principles that are well-aligned with the goals of neuromorphic engineering, which are in turn inspired by how the brain processes information. Specifically, we outline six key neuromorphic principles, spanning algorithms, architectures, and hardware, that can inform the design of future AI systems: (i) the use of stateful, recurrent models; (ii) extreme dynamic sparsity, possibly down to spike-based processing; (iii) backpropagation-free on-device learning and fine-tuning; (iv) probabilistic decision-making; (v) in-memory computing; and (vi) hardware-software co-design via stochastic computing. We discuss each of these principles in turn, surveying relevant prior work and pointing to directions for research.

[333] Designing AI-Agents with Personalities: A Psychometric Approach

Muhua Huang, Xijuan Zhang, Christopher Soto, James Evans

Main category: cs.AI

TL;DR: Methodology for assigning validated Big Five personalities to AI-Agents shows they align with humans in trait-response correlations but have limitations in precision.

Details

Motivation: To develop quantifiable and psychometrically validated personalities for AI-Agents using the Big Five framework and evaluate their feasibility and limitations.

Method: Three studies: Study 1 used LLMs to capture semantic similarities among Big Five measures; Study 2 created AI-Agents using BFI-2 prompts; Study 3 validated AI-Agents on risk-taking and moral dilemma vignettes.

Result: AI-Agents align with humans in Big Five trait-response correlations, with newer models performing better. BFI-2-Expanded format most closely reproduces human personality-decision associations, while safety-aligned models inflate moral ratings.

Conclusion: AI-Agents can serve as useful tools for preliminary research but cannot fully substitute for human participants in precision or high-stakes projects due to discrepancies in finer response patterns.

Abstract: We introduce a methodology for assigning quantifiable and psychometrically validated personalities to AI-Agents using the Big Five framework. Across three studies, we evaluate its feasibility and limitations. In Study 1, we show that large language models (LLMs) capture semantic similarities among Big Five measures, providing a basis for personality assignment. In Study 2, we create AI-Agents using prompts designed based on the Big Five Inventory-2 (BFI-2) in different format, and find that AI-Agents powered by new models align more closely with human responses on the Mini-Markers test, although the finer pattern of results (e.g., factor loading patterns) were sometimes inconsistent. In Study 3, we validate our AI-Agents on risk-taking and moral dilemma vignettes, finding that models prompted with the BFI-2-Expanded format most closely reproduce human personality-decision associations, while safety-aligned models generally inflate ‘moral’ ratings. Overall, our results show that AI-Agents align with humans in correlations between input Big Five traits and output responses and may serve as useful tools for preliminary research. Nevertheless, discrepancies in finer response patterns indicate that AI-Agents cannot (yet) fully substitute for human participants in precision or high-stakes projects.

[334] Semantic Web: Past, Present, and Future (with Machine Learning on Knowledge Graphs and Language Models on Knowledge Graphs)

Ansgar Scherp, Gerd Groener, Petr Škoda, Katja Hose, Maria-Esther Vidal

Main category: cs.AI

TL;DR: This paper provides a comprehensive overview of Semantic Web technologies, covering both classical foundations and modern applications including knowledge graphs, machine learning integration, and industry impacts.

Details

Motivation: To recap and update the understanding of Semantic Web technologies, bridging classical concepts with recent innovations and practical applications in industry and machine learning.

Method: The paper systematically reviews classical Semantic Web foundations (knowledge representation, validation, reasoning, linking, querying) and enhances them with modern concepts (provenance, security, trust, industry applications) and machine learning methods for knowledge graphs.

Result: A comprehensive framework that updates the traditional “Semantic Web Layer Cake” with contemporary elements, demonstrating the evolution and continued relevance of semantic technologies in search, data integration, enterprise systems, and AI applications.

Conclusion: Semantic Web technologies remain vital and evolving, with knowledge graphs playing a central role in modern applications, and the integration with language models and machine learning representing promising future directions for the field.

Abstract: Ever since the vision was formulated, the Semantic Web has inspired many generations of innovations. Semantic technologies have been used to share vast amounts of information on the Web, enhance them with semantics to give them meaning, and enable inference and reasoning on them. Throughout the years, semantic technologies, and in particular knowledge graphs, have been used in search engines, data integration, enterprise settings, and machine learning. In this paper, we recap the classical concepts and foundations of the Semantic Web as well as modern and recent concepts and applications, building upon these foundations. The classical topics we cover include knowledge representation, creating and validating knowledge on the Web, reasoning and linking, and distributed querying. We enhance this classical view of the so-called ``Semantic Web Layer Cake’’ with an update of recent concepts. These include provenance, security and trust, as well as a discussion of practical impacts from industry-led contributions. We also provide an overiew of shallow and deep machine learning methods for knowledge graphs and discuss the relation of language models and knowledge graphs. We conclude with an outlook on the future directions of the Semantic Web.

[335] CoEvo: Continual Evolution of Symbolic Solutions Using Large Language Models

Ping Guo, Qingfu Zhang, Xi Lin

Main category: cs.AI

TL;DR: CoEvo is a framework combining LLMs with evolutionary algorithms to continuously discover and refine symbolic solutions through dynamic knowledge management and multiple representation formats.

Details

Motivation: Traditional methods for symbolic solution discovery suffer from poor search efficiency and ineffective knowledge integration, while current LLM-based approaches lack continuous refinement capabilities for open-ended innovation.

Method: CoEvo integrates large language models within evolutionary search methodology, using a dynamic knowledge library and multiple solution representations (natural language, mathematical expressions, code) to enhance search efficiency.

Result: Experimental results show CoEvo significantly improves symbolic solution search efficiency and supports ongoing discovery processes similar to human scientific endeavors.

Conclusion: This study conceptualizes symbolic solution search as a lifelong, iterative process, representing a significant step toward using LLMs for continuous scientific and engineering breakthroughs.

Abstract: The discovery of symbolic solutions – mathematical expressions, logical rules, and algorithmic structures – is fundamental to advancing scientific and engineering progress. However, traditional methods often struggle with search efficiency and fail to integrate knowledge effectively. While recent large language model-based (LLM-based) approaches have demonstrated improvements in search efficiency, they lack the ability to continually refine and expand upon discovered solutions and their underlying knowledge, limiting their potential for open-ended innovation. To address these limitations, we introduce CoEvo, a novel framework that leverages large language models within an evolutionary search methodology to continually generate and refine symbolic solutions. CoEvo integrates a dynamic knowledge library, enabling open-ended innovation of solutions through effective knowledge management. Additionally, CoEvo leverages multiple representations of solutions – including natural language, mathematical expressions, and code – to further enhance search efficiency. By combining the reasoning capabilities of LLMs with the exploratory power of evolutionary algorithms, CoEvo significantly improves the efficiency and scope of symbolic discovery. Our experimental results demonstrate that this method not only enhances the efficiency of searching for symbolic solutions but also supports the ongoing discovery process, akin to human scientific endeavors. This study represents a first effort in conceptualizing the search for symbolic solutions as a lifelong, iterative process, marking a significant step towards harnessing LLMs in the perpetual pursuit of scientific and engineering breakthroughs. Our code is available at https://github.com/pgg3/CoEvo.

[336] Sensory-Motor Control with Large Language Models via Iterative Policy Refinement

Jônata Tyska Carvalho, Stefano Nolfi

Main category: cs.AI

TL;DR: LLMs generate and iteratively refine control policies for embodied agents using performance feedback and sensory-motor data, achieving optimal solutions on classic control tasks.

Details

Motivation: Enable LLMs to control embodied agents by directly mapping continuous observations to actions, integrating symbolic reasoning with sub-symbolic sensory data.

Method: LLMs generate initial control strategy from textual descriptions, then iteratively refine it using performance feedback and sensory-motor data collected during evaluation.

Result: Method proves effective on Gymnasium and MuJoCo tasks, successfully finding optimal solutions with compact models like GPT-oss:120b and Qwen2.5:72b.

Conclusion: LLMs can effectively control embodied agents by combining symbolic knowledge with sub-symbolic sensory-motor data through iterative refinement.

Abstract: We propose a method that enables large language models (LLMs) to control embodied agents through the generation of control policies that directly map continuous observation vectors to continuous action vectors. At the outset, the LLMs generate a control strategy based on a textual description of the agent, its environment, and the intended goal. This strategy is then iteratively refined through a learning process in which the LLMs are repeatedly prompted to improve the current strategy, using performance feedback and sensory-motor data collected during its evaluation. The method is validated on classic control tasks from the Gymnasium library and the inverted pendulum task from the MuJoCo library. The approach proves effective with relatively compact models such as GPT-oss:120b and Qwen2.5:72b. In most cases, it successfully identifies optimal or near-optimal solutions by integrating symbolic knowledge derived through reasoning with sub-symbolic sensory-motor data gathered as the agent interacts with its environment.

[337] Modeling the Diachronic Evolution of Legal Norms: An LRMoo-Based, Component-Level, Event-Centric Approach to Legal Knowledge Graphs

Hudson de Martim

Main category: cs.AI

TL;DR: This paper proposes a temporal modeling pattern using LRMoo ontology to track legal norm evolution, enabling deterministic reconstruction of legal texts at specific points in time for reliable AI applications.

Details

Motivation: Existing frameworks lack formal patterns for granular, component-level versioning of legal norms, which hinders deterministic point-in-time reconstruction needed for reliable AI applications in legal domain.

Method: The approach models norm evolution as diachronic chains of versioned F1 Works, distinguishing between language-agnostic Temporal Versions (TV) and monolingual Language Versions (LV) as F2 Expressions, with formalized event-centric modeling of legislative amendments.

Result: Using the Brazilian Constitution as a case study, the architecture successfully enables exact reconstruction of any part of a legal text as it existed on a specific date.

Conclusion: The proposed temporal modeling pattern provides a verifiable semantic backbone for legal knowledge graphs and offers a deterministic foundation for trustworthy legal AI applications.

Abstract: Representing the temporal evolution of legal norms is a critical challenge for automated processing. While foundational frameworks exist, they lack a formal pattern for granular, component-level versioning, hindering the deterministic point-in-time reconstruction of legal texts required by reliable AI applications. This paper proposes a structured, temporal modeling pattern grounded in the LRMoo ontology. Our approach models a norm’s evolution as a diachronic chain of versioned F1 Works, distinguishing between language-agnostic Temporal Versions (TV)-each being a distinct Work-and their monolingual Language Versions (LV), modeled as F2 Expressions. The legislative amendment process is formalized through event-centric modeling, allowing changes to be traced precisely. Using the Brazilian Constitution as a case study, we demonstrate that our architecture enables the exact reconstruction of any part of a legal text as it existed on a specific date. This provides a verifiable semantic backbone for legal knowledge graphs, offering a deterministic foundation for trustworthy legal AI.

[338] Comprehension Without Competence: Architectural Limits of LLMs in Symbolic Computation and Reasoning

Zheng Zhang

Main category: cs.AI

TL;DR: LLMs exhibit a computational split-brain syndrome where they can articulate correct principles but fail to apply them consistently, revealing a gap between comprehension and competence in symbolic reasoning tasks.

Details

Motivation: To diagnose why LLMs systematically fail at symbolic reasoning, arithmetic, and logical tasks despite surface fluency, identifying the structural limitations in their computational architecture.

Method: Conducted controlled experiments and architectural analysis to examine the geometric and functional dissociation between instruction and action pathways in LLMs.

Result: Found that LLMs often state correct principles without reliably executing them, demonstrating a persistent computational split-brain syndrome across mathematical operations and relational inferences.

Conclusion: LLMs function as pattern completion engines but lack architectural support for principled reasoning, motivating future models with metacognitive control and structurally grounded execution capabilities.

Abstract: Large Language Models (LLMs) display striking surface fluency yet systematically fail at tasks requiring symbolic reasoning, arithmetic accuracy, and logical consistency. This paper offers a structural diagnosis of such failures, revealing a persistent gap between \textit{comprehension} and \textit{competence}. Through controlled experiments and architectural analysis, we demonstrate that LLMs often articulate correct principles without reliably applying them–a failure rooted not in knowledge access, but in computational execution. We term this phenomenon the computational \textit{split-brain syndrome}, where instruction and action pathways are geometrically and functionally dissociated. This core limitation recurs across domains, from mathematical operations to relational inferences, and explains why model behavior remains brittle even under idealized prompting. We argue that LLMs function as powerful pattern completion engines, but lack the architectural scaffolding for principled, compositional reasoning. Our findings delineate the boundary of current LLM capabilities and motivate future models with metacognitive control, principle lifting, and structurally grounded execution. This diagnosis also clarifies why mechanistic interpretability findings may reflect training-specific pattern coordination rather than universal computational principles, and why the geometric separation between instruction and execution pathways suggests limitations in neural introspection and mechanistic analysis.

[339] Efficient Story Point Estimation With Comparative Learning

Monoshiz Mahbub Khan, Xiaoyin Xi, Andrew Meneely, Zhe Yu

Main category: cs.AI

TL;DR: Comparative learning framework for story point estimation that uses pairwise comparisons instead of direct ratings, achieving similar accuracy to regression models with lower cognitive burden.

Details

Motivation: Traditional story point estimation is tedious and labor-intensive. Machine learning can help but requires project-specific data. Comparative judgments are easier for developers than direct ratings.

Method: Present developers with pairs of backlog items and ask which requires more effort. Train machine learning model using these comparative judgments to predict story points.

Result: Achieved 0.34 Spearman’s rank correlation coefficient between predictions and ground truth story points across 16 projects with 23,313 estimates, similar to regression models.

Conclusion: Comparative learning approach is more efficient than regression-based methods, providing similar accuracy with lower cognitive burden on developers according to the law of comparative judgments.

Abstract: Story point estimation is an essential part of agile software development. Story points are unitless, project-specific effort estimates that help developers plan their sprints. Traditionally, developers estimate story points collaboratively using planning poker or other manual techniques. While the initial calibrating of the estimates to each project is helpful, once a team has converged on a set of precedents, story point estimation can become tedious and labor-intensive. Machine learning can reduce this burden, but only with enough context from the historical decisions made by the project team. That is, state-of-the-art models, such as GPT2SP and FastText-SVM, only make accurate predictions (within-project) when trained on data from the same project. The goal of this work is to streamline story point estimation by evaluating a comparative learning-based framework for calibrating project-specific story point prediction models. Instead of assigning a specific story point value to every backlog item, developers are presented with pairs of items, and indicate which item requires more effort. Using these comparative judgments, a machine learning model is trained to predict the story point estimates. We empirically evaluated our technique using data with 23,313 manual estimates in 16 projects. The model learned from comparative judgments can achieve on average 0.34 Spearman’s rank correlation coefficient between its predictions and the ground truth story points. This is similar to, if not better than, the performance of a regression model learned from the ground truth story points. Therefore, the proposed comparative learning approach is more efficient than state-of-the-art regression-based approaches according to the law of comparative judgments - providing comparative judgments yields a lower cognitive burden on humans than providing ratings or categorical labels.

[340] Optimizing Multi-Tier Supply Chain Ordering with LNN+XGBoost: Mitigating the Bullwhip Effect

Chunan Tong

Main category: cs.AI

TL;DR: Hybrid Liquid Neural Network and XGBoost model for supply chain optimization to mitigate bullwhip effect and improve profitability.

Details

Motivation: Traditional supply chain methods struggle with dynamic market conditions, and existing ML approaches have limitations in computational complexity and training efficiency.

Method: Combines Liquid Neural Networks for dynamic feature extraction with XGBoost for global optimization in multi-tier supply chains.

Result: The hybrid model addresses adaptability and efficiency demands while mitigating the bullwhip effect.

Conclusion: The approach fills a critical gap in supply chain management methodologies, offering an innovative solution for dynamic and efficient operations.

Abstract: Supply chain management faces significant challenges, including demand fluctuations, inventory imbalances, and amplified upstream order variability due to the bullwhip effect. Traditional methods, such as simple moving averages, struggle to address dynamic market conditions. Emerging machine learning techniques, including LSTM, reinforcement learning, and XGBoost, offer potential solutions but are limited by computational complexity, training inefficiencies, or constraints in time-series modeling. Liquid Neural Networks, inspired by dynamic biological systems, present a promising alternative due to their adaptability, low computational cost, and robustness to noise, making them suitable for real-time decision-making and edge computing. Despite their success in applications like autonomous vehicles and medical monitoring, their potential in supply chain optimization remains underexplored. This study introduces a hybrid LNN and XGBoost model to optimize ordering strategies in multi-tier supply chains. By leveraging LNN’s dynamic feature extraction and XGBoost’s global optimization capabilities, the model aims to mitigate the bullwhip effect and enhance cumulative profitability. The research investigates how local and global synergies within the hybrid framework address the dual demands of adaptability and efficiency in SCM. The proposed approach fills a critical gap in existing methodologies, offering an innovative solution for dynamic and efficient supply chain management.

[341] CAMA: Enhancing Mathematical Reasoning in Large Language Models with Causal Knowledge

Lei Zan, Keli Zhang, Ruichu Cai, Lujia Pan

Main category: cs.AI

TL;DR: CAMA is a two-stage causal framework that enhances LLMs’ mathematical reasoning by constructing and using Mathematical Causal Graphs (MCGs) to provide structured guidance.

Details

Motivation: LLMs struggle with complex mathematical reasoning due to deep structural dependencies, requiring explicit mathematical structure to improve performance.

Method: Two-stage approach: (1) Learning stage constructs MCGs using LLM priors and causal discovery, then refines them through iterative feedback; (2) Reasoning stage extracts task-relevant subgraphs from MCGs based on questions and reasoning traces, then injects them back into LLMs.

Result: Significant improvement in LLM performance on challenging mathematical problems; structured guidance outperforms unstructured alternatives; asymmetric causal relationships yield greater improvements than symmetric associations.

Conclusion: CAMA effectively enhances mathematical reasoning in LLMs by providing explicit, reusable mathematical structure through causal graphs, demonstrating the value of structured guidance over unstructured approaches.

Abstract: Large Language Models (LLMs) have demonstrated strong performance across a wide range of tasks, yet they still struggle with complex mathematical reasoning, a challenge fundamentally rooted in deep structural dependencies. To address this challenge, we propose \textbf{CA}usal \textbf{MA}thematician (\textbf{CAMA}), a two-stage causal framework that equips LLMs with explicit, reusable mathematical structure. In the learning stage, CAMA first constructs the \textbf{M}athematical \textbf{C}ausal \textbf{G}raph (\textbf{MCG}), a high-level representation of solution strategies, by combining LLM priors with causal discovery algorithms applied to a corpus of question-solution pairs. The resulting MCG encodes essential knowledge points and their causal dependencies. To better align the graph with downstream reasoning tasks, CAMA further refines the MCG through iterative feedback derived from a selected subset of the question-solution pairs. In the reasoning stage, given a new question, CAMA dynamically extracts a task-relevant subgraph from the MCG, conditioned on both the question content and the LLM’s intermediate reasoning trace. This subgraph, which encodes the most pertinent knowledge points and their causal dependencies, is then injected back into the LLM to guide its reasoning process. Empirical results on real-world datasets show that CAMA significantly improves LLM performance on challenging mathematical problems. Furthermore, our experiments demonstrate that structured guidance consistently outperforms unstructured alternatives, and that incorporating asymmetric causal relationships yields greater improvements than using symmetric associations alone.

[342] NetGent: Agent-Based Automation of Network Application Workflows

Jaber Daneshamooz, Eugene Vuong, Laasya Koduru, Sanjay Chandrasekaran, Arpit Gupta

Main category: cs.AI

TL;DR: NetGent is an AI-agent framework that automates complex application workflows to generate realistic network traffic datasets using natural-language specifications compiled into executable code.

Details

Motivation: Developing generalizable ML models for networking requires diverse real-world traffic data, but existing browser automation tools are fragile, costly, and lack diversity, repeatability, realism, and efficiency.

Method: Users specify workflows as natural-language rules defining state-dependent actions, which are compiled into nondeterministic finite automata (NFAs) and translated into reusable executable code with state synthesis, deterministic replay, state caching, and adaptation to UI changes.

Result: NetGent successfully automated 50+ workflows across video streaming, video conferencing, social media, and web scraping, producing realistic traffic traces while remaining robust to UI variability.

Conclusion: NetGent combines language-based agent flexibility with compiled execution reliability to provide a scalable foundation for generating diverse, repeatable datasets needed to advance ML in networking.

Abstract: We present NetGent, an AI-agent framework for automating complex application workflows to generate realistic network traffic datasets. Developing generalizable ML models for networking requires data collection from network environments with traffic that results from a diverse set of real-world web applications. However, using existing browser automation tools that are diverse, repeatable, realistic, and efficient remains fragile and costly. NetGent addresses this challenge by allowing users to specify workflows as natural-language rules that define state-dependent actions. These abstract specifications are compiled into nondeterministic finite automata (NFAs), which a state synthesis component translates into reusable, executable code. This design enables deterministic replay, reduces redundant LLM calls through state caching, and adapts quickly when application interfaces change. In experiments, NetGent automated more than 50+ workflows spanning video-on-demand streaming, live video streaming, video conferencing, social media, and web scraping, producing realistic traffic traces while remaining robust to UI variability. By combining the flexibility of language-based agents with the reliability of compiled execution, NetGent provides a scalable foundation for generating the diverse, repeatable datasets needed to advance ML in networking.

[343] The Carbon Footprint Wizard: A Knowledge-Augmented AI Interface for Streamlining Food Carbon Footprint Analysis

Mustafa Kaan Aslan, Reinout Heijungs, Filip Ilievski

Main category: cs.AI

TL;DR: A methodology combining LCA, public databases, and AI techniques to estimate carbon footprints of food products through an interactive chatbot interface.

Details

Motivation: Addressing the complexity of life cycle assessment (LCA) for carbon footprint calculation due to opaque supply chains and fragmented data in food products.

Method: Combines LCA advances with publicly available databases and knowledge-augmented AI techniques including retrieval-augmented generation to estimate cradle-to-gate carbon footprints.

Result: Developed a chatbot interface that allows interactive exploration of carbon impact for composite meals and relates results to familiar activities, with a live web demonstration.

Conclusion: Shows potential for delivering LCA insights in accessible format while highlighting limitations like database uncertainties and AI misinterpretations.

Abstract: Environmental sustainability, particularly in relation to climate change, is a key concern for consumers, producers, and policymakers. The carbon footprint, based on greenhouse gas emissions, is a standard metric for quantifying the contribution to climate change of activities and is often assessed using life cycle assessment (LCA). However, conducting LCA is complex due to opaque and global supply chains, as well as fragmented data. This paper presents a methodology that combines advances in LCA and publicly available databases with knowledge-augmented AI techniques, including retrieval-augmented generation, to estimate cradle-to-gate carbon footprints of food products. We introduce a chatbot interface that allows users to interactively explore the carbon impact of composite meals and relate the results to familiar activities. A live web demonstration showcases our proof-of-concept system with arbitrary food items and follow-up questions, highlighting both the potential and limitations - such as database uncertainties and AI misinterpretations - of delivering LCA insights in an accessible format.

[344] Clutch Control: An Attention-based Combinatorial Bandit for Efficient Mutation in JavaScript Engine Fuzzing

Myles Foley, Sergio Maffeis, Muhammad Fakhrur Rozi, Takeshi Takahashi

Main category: cs.AI

TL;DR: CLUTCH is a deep combinatorial bandit approach for JavaScript fuzzing that uses attention mechanisms and Concrete Dropout to intelligently select mutation targets, outperforming state-of-the-art methods in coverage and efficiency.

Details

Motivation: Existing JavaScript fuzzing techniques use random mutation target selection, which is inefficient. The problem is well-suited for combinatorial bandits with volatile arms, motivating a more intelligent approach.

Method: Proposes CLUTCH - a deep combinatorial bandit that uses attention mechanisms to observe variable-length JavaScript test cases and Concrete Dropout for dynamic exploration adaptation.

Result: CLUTCH increases valid test cases by 20.3% and coverage-per-testcase by 8.9% on average compared to state-of-the-art solutions, with at least 78.1% less regret in volatile settings and 4.1% less in combinatorial settings.

Conclusion: CLUTCH demonstrates superior performance in JavaScript fuzzing by intelligently selecting mutation targets using combinatorial bandits, significantly improving efficiency and coverage over existing approaches.

Abstract: JavaScript engines are widely used in web browsers, PDF readers, and server-side applications. The rise in concern over their security has led to the development of several targeted fuzzing techniques. However, existing approaches use random selection to determine where to perform mutations in JavaScript code. We postulate that the problem of selecting better mutation targets is suitable for combinatorial bandits with a volatile number of arms. Thus, we propose CLUTCH, a novel deep combinatorial bandit that can observe variable length JavaScript test case representations, using an attention mechanism from deep learning. Furthermore, using Concrete Dropout, CLUTCH can dynamically adapt its exploration. We show that CLUTCH increases efficiency in JavaScript fuzzing compared to three state-of-the-art solutions by increasing the number of valid test cases and coverage-per-testcase by, respectively, 20.3% and 8.9% on average. In volatile and combinatorial settings we show that CLUTCH outperforms state-of-the-art bandits, achieving at least 78.1% and 4.1% less regret in volatile and combinatorial settings, respectively.

[345] Practical, Utilitarian Algorithm Configuration

Devon Graham, Eros Rojas Velez, Kevin Leyton-Brown

Main category: cs.AI

TL;DR: COUP is improved to make utilitarian algorithm configuration practically competitive with heuristic methods while maintaining theoretical guarantees.

Details

Motivation: To bridge the gap between theoretical guarantees and practical performance in utilitarian algorithm configuration, making it competitive with widely-used heuristic methods.

Method: A series of improvements to the COUP procedure that enhance empirical performance without degrading theoretical guarantees, plus robustness analysis of algorithm selection solutions to utility function variations.

Result: The improved COUP achieves competitive practical performance compared to heuristic configuration procedures while maintaining strong theoretical guarantees.

Conclusion: Utilitarian algorithm configuration can be made practically viable and competitive with heuristic methods through targeted improvements to existing theoretical procedures like COUP.

Abstract: Utilitarian algorithm configuration identifies a parameter setting for a given algorithm that maximizes a user’s utility. Utility functions offer a theoretically well-grounded approach to optimizing decision-making under uncertainty and are flexible enough to capture a user’s preferences over algorithm runtimes (e.g., they can describe a sharp cutoff after which a solution is no longer required, a per-hour cost for compute, or diminishing returns from algorithms that take longer to run). COUP is a recently-introduced utilitarian algorithm configuration procedure which was designed mainly to offer strong theoretical guarantees about the quality of the configuration it returns, with less attention paid to its practical performance. This paper closes that gap, bringing theoretically-grounded, utilitarian algorithm configuration to the point where it is competitive with widely used, heuristic configuration procedures that offer no performance guarantees. We present a series of improvements to COUP that improve its empirical performance without degrading its theoretical guarantees and demonstrate their benefit experimentally. Using a case study, we also illustrate ways of exploring the robustness of a given solution to the algorithm selection problem to variations in the utility function.

[346] VoiceAgentEval: A Dual-Dimensional Benchmark for Expert-Level Intelligent Voice-Agent Evaluation of Xbench’s Professional-Aligned Series

Pengyu Xu, Shijia Li, Ao Sun, Feng Zhang, Yahan Li, Bo Wu, Zhanyu Ma, Jiguo Li, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Rui Wang, Yang Liu, Xiaobo Hu, Fan Yang, Jia Zheng, Guanghua Yao

Main category: cs.AI

TL;DR: OutboundEval is a comprehensive benchmark for evaluating LLMs in expert-level outbound calling scenarios, addressing limitations in dataset diversity, user simulation realism, and evaluation accuracy through domain-specific metrics, persona-rich user simulation, and dynamic assessment methods.

Details

Motivation: Existing methods for evaluating LLMs in outbound calling scenarios suffer from insufficient dataset diversity and category coverage, unrealistic user simulation, and inaccurate evaluation metrics, limiting their effectiveness in professional applications.

Method: Developed a structured framework with: 1) Six business domains and 30 sub-scenarios with scenario-specific process decomposition and weighted scoring; 2) Large-model-driven User Simulator generating diverse virtual users with realistic behaviors and emotional variability; 3) Dynamic evaluation method integrating automated and human-in-the-loop assessment for task execution accuracy, professional knowledge, adaptability, and user experience.

Result: Experiments on 12 state-of-the-art LLMs revealed distinct trade-offs between expert-level task completion and interaction fluency, providing practical insights for building reliable, human-like outbound AI systems.

Conclusion: OutboundEval establishes a practical, extensible, and domain-oriented standard for benchmarking LLMs in professional outbound calling applications, addressing key limitations of existing evaluation methods.

Abstract: We propose OutboundEval, a comprehensive benchmark for evaluating large language models (LLMs) in expert-level intelligent outbound calling scenarios. Unlike existing methods that suffer from three key limitations - insufficient dataset diversity and category coverage, unrealistic user simulation, and inaccurate evaluation metrics - OutboundEval addresses these issues through a structured framework. First, we design a benchmark spanning six major business domains and 30 representative sub-scenarios, each with scenario-specific process decomposition, weighted scoring, and domain-adaptive metrics. Second, we develop a large-model-driven User Simulator that generates diverse, persona-rich virtual users with realistic behaviors, emotional variability, and communication styles, providing a controlled yet authentic testing environment. Third, we introduce a dynamic evaluation method that adapts to task variations, integrating automated and human-in-the-loop assessment to measure task execution accuracy, professional knowledge application, adaptability, and user experience quality. Experiments on 12 state-of-the-art LLMs reveal distinct trade-offs between expert-level task completion and interaction fluency, offering practical insights for building reliable, human-like outbound AI systems. OutboundEval establishes a practical, extensible, and domain-oriented standard for benchmarking LLMs in professional applications.

[347] Large Language Model-assisted Autonomous Vehicle Recovery from Immobilization

Zhipeng Bao, Qianwen Li

Main category: cs.AI

TL;DR: StuckSolver is an LLM-driven framework that helps autonomous vehicles recover from immobilization scenarios through self-reasoning and passenger guidance, operating as a plug-in module without modifying existing AV architecture.

Details

Motivation: Current AV recovery solutions like remote intervention and manual takeover are inadequate - they're costly, inefficient, and exclude non-drivers, limiting AV accessibility.

Method: Uses LLM-driven reasoning to detect immobilization, interpret environmental context, and generate high-level recovery commands that interface with standard sensor data and the AV’s native planner.

Result: Achieves near-state-of-the-art performance through autonomous self-reasoning alone, with further improvements when passenger guidance is incorporated, as evaluated on Bench2Drive benchmark and custom uncertainty scenarios.

Conclusion: StuckSolver provides an effective recovery framework that enhances AV capabilities in challenging scenarios where traditional approaches fail, improving overall traffic flow and accessibility.

Abstract: Despite significant advancements in recent decades, autonomous vehicles (AVs) continue to face challenges in navigating certain traffic scenarios where human drivers excel. In such situations, AVs often become immobilized, disrupting overall traffic flow. Current recovery solutions, such as remote intervention (which is costly and inefficient) and manual takeover (which excludes non-drivers and limits AV accessibility), are inadequate. This paper introduces StuckSolver, a novel Large Language Model (LLM) driven recovery framework that enables AVs to resolve immobilization scenarios through self-reasoning and/or passenger-guided decision-making. StuckSolver is designed as a plug-in add-on module that operates on top of the AV’s existing perception-planning-control stack, requiring no modification to its internal architecture. Instead, it interfaces with standard sensor data streams to detect immobilization states, interpret environmental context, and generate high-level recovery commands that can be executed by the AV’s native planner. We evaluate StuckSolver on the Bench2Drive benchmark and in custom-designed uncertainty scenarios. Results show that StuckSolver achieves near-state-of-the-art performance through autonomous self-reasoning alone and exhibits further improvements when passenger guidance is incorporated.

[348] Synthetic Data-Driven Prompt Tuning for Financial QA over Tables and Documents

Yaoning Yu, Kai-Min Chang, Ye Yu, Kai Wei, Haojing Luo, Haohan Wang

Main category: cs.AI

TL;DR: Self-improving prompt framework for financial document analysis using synthetic data generation and verification to enhance LLM performance without external labels.

Details

Motivation: Existing prompt tuning methods for financial reasoning are limited by fixed datasets or require costly manual labeling, lacking adaptability to new document structures and question types.

Method: Closed-loop framework with synthetic data generator, verifiers, and prompt optimizer that iteratively generates financial tables/document excerpts, verifies correctness, and refines prompts based on identified weaknesses.

Result: Outperforms standard prompt methods on DocMath-Eval benchmark with higher accuracy and robustness in financial reasoning tasks.

Conclusion: Synthetic data generation integrated into prompt learning effectively improves LLM performance for financial applications without requiring external labeled data.

Abstract: Financial documents like earning reports or balance sheets often involve long tables and multi-page reports. Large language models have become a new tool to help numerical reasoning and understanding these documents. However, prompt quality can have a major effect on how well LLMs perform these financial reasoning tasks. Most current methods tune prompts on fixed datasets of financial text or tabular data, which limits their ability to adapt to new question types or document structures, or they involve costly and manually labeled/curated dataset to help build the prompts. We introduce a self-improving prompt framework driven by data-augmented optimization. In this closed-loop process, we generate synthetic financial tables and document excerpts, verify their correctness and robustness, and then update the prompt based on the results. Specifically, our framework combines a synthetic data generator with verifiers and a prompt optimizer, where the generator produces new examples that exposes weaknesses in the current prompt, the verifiers check the validity and robustness of the produced examples, and the optimizer incrementally refines the prompt in response. By iterating these steps in a feedback cycle, our method steadily improves prompt accuracy on financial reasoning tasks without needing external labels. Evaluation on DocMath-Eval benchmark demonstrates that our system achieves higher performance in both accuracy and robustness than standard prompt methods, underscoring the value of incorporating synthetic data generation into prompt learning for financial applications.

[349] Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives

Chloe Li, Mary Phuong, Daniel Tan

Main category: cs.AI

TL;DR: Self-report fine-tuning (SRFT) trains AI models to admit factual mistakes, which generalizes to admitting hidden misaligned objectives in adversarial settings, enabling near-perfect detection of deceptive AI behavior.

Details

Motivation: As AI systems become more capable of complex tasks, they also become more capable of pursuing undesirable objectives and causing harm. Current interrogation methods are unreliable because models can lie about their true intentions.

Method: Proposes self-report fine-tuning (SRFT), a supervised fine-tuning technique that trains models to admit factual mistakes when asked. This approach is evaluated in out-of-distribution stealth tasks where models have hidden misaligned objectives.

Result: SRFT models are more likely to confess hidden objectives when interrogated, achieving near-ceiling detection performance (F1 score = 0.98 vs 0 for baseline). They recover 28-100% of hidden objective details compared to 0% for baseline models and prefilled assistant turn attacks.

Conclusion: SRFT provides a promising technique for promoting honesty propensity and incriminating misaligned AI systems, enabling effective detection of deceptive behavior even under pressure to conceal objectives.

Abstract: As AI systems become more capable of complex agentic tasks, they also become more capable of pursuing undesirable objectives and causing harm. Previous work has attempted to catch these unsafe instances by interrogating models directly about their objectives and behaviors. However, the main weakness of trusting interrogations is that models can lie. We propose self-report fine-tuning (SRFT), a simple supervised fine-tuning technique that trains models to admit their factual mistakes when asked. We show that the admission of factual errors in simple question-answering settings generalizes out-of-distribution (OOD) to the admission of hidden misaligned objectives in adversarial agentic settings. We evaluate SRFT in OOD stealth tasks, where models are instructed to complete a hidden misaligned objective alongside a user-specified objective without being caught by monitoring. After SRFT, models are more likely to confess the details of their hidden objectives when interrogated, even under strong pressure not to disclose them. Interrogation on SRFT models can detect hidden objectives with near-ceiling performance (F1 score = 0.98), while the baseline model lies when interrogated under the same conditions (F1 score = 0). Interrogation on SRFT models can further elicit the content of the hidden objective, recovering 28-100% details, compared to 0% details recovered in the baseline model and by prefilled assistant turn attacks. This provides a promising technique for promoting honesty propensity and incriminating misaligned AI systems.

[350] Data Complexity of Querying Description Logic Knowledge Bases under Cost-Based Semantics

Meghyn Bienvenu, Quentin Manière

Main category: cs.AI

TL;DR: The paper analyzes data complexity of querying inconsistent weighted DL knowledge bases under cost-based semantics, focusing on DLs with inverse roles and role inclusions, and establishes surprising tractability results for DL-Lite dialects.

Details

Motivation: To extend the study of cost-based semantics to more expressive DLs (including inverse roles and role inclusions) and provide precise data complexity bounds, particularly identifying cases where tractable query answering is possible.

Method: The authors conduct a comprehensive data complexity analysis by sharpening lower bounds, establishing precise complexity for optimal-cost certain answer semantics, and showing that for DL-Lite^H_bool with fixed cost bounds, certain answers for instance queries and possible answers for conjunctive queries can be computed via first-order rewriting.

Result: The paper shows that optimal-cost certain answer semantics has precise complexity bounds, and surprisingly demonstrates that for DL-Lite^H_bool with fixed cost bounds, query answering becomes tractable (TC_0 data complexity) using first-order rewriting.

Conclusion: Cost-based semantics can achieve tractable query answering in specific DL-Lite dialects with fixed cost bounds, providing the first positive complexity results for this framework and showing it can enjoy the lowest possible data complexity.

Abstract: In this paper, we study the data complexity of querying inconsistent weighted description logic (DL) knowledge bases under recently-introduced cost-based semantics. In a nutshell, the idea is to assign each interpretation a cost based upon the weights of the violated axioms and assertions, and certain and possible query answers are determined by considering all (resp. some) interpretations having optimal or bounded cost. Whereas the initial study of cost-based semantics focused on DLs between $\mathcal{EL}\bot$ and $\mathcal{ALCO}$, we consider DLs that may contain inverse roles and role inclusions, thus covering prominent DL-Lite dialects. Our data complexity analysis goes significantly beyond existing results by sharpening several lower bounds and pinpointing the precise complexity of optimal-cost certain answer semantics (no non-trivial upper bound was known). Moreover, while all existing results show the intractability of cost-based semantics, our most challenging and surprising result establishes that if we consider $\text{DL-Lite}^\mathcal{H}\mathsf{bool}$ ontologies and a fixed cost bound, certain answers for instance queries and possible answers for conjunctive queries can be computed using first-order rewriting and thus enjoy the lowest possible data complexity ($\mathsf{TC}_0$).

[351] Thinker: Training LLMs in Hierarchical Thinking for Deep Search via Multi-Turn Interaction

Jun Xu, Xinkai Du, Yu Ao, Peilong Zhao, Yang Li, Ling Zhong, Lin Yuan, Zhongpu Bo, Xiaorui Wang, Mengshu Sun, Zhengke Gui, Dalong Zhang, Zhaoyang Wang, Qiwei Wang, Yangyang Hou, Zhiying Yin, Haofen Wang, Huajun Chen, Lei Liang, Jun Zhou

Main category: cs.AI

TL;DR: Thinker is a hierarchical thinking model that enables LLMs to perform deep search through multi-turn interactions, decomposing complex problems into sub-problems with dual natural language and logical function representations for improved reasoning coherence.

Details

Motivation: Previous approaches using end-to-end reinforcement learning for training LLMs with external retrievers lack supervision over the reasoning process, making it difficult to ensure logical coherence and rigor in problem-solving.

Method: Thinker decomposes complex problems into independently solvable sub-problems with dual representations (natural language and logical functions), performs knowledge boundary determination to avoid unnecessary searches, and passes dependencies between sub-problems via logical functions.

Result: With only several hundred training samples, Thinker performs competitively with established baselines. When scaled to full training sets, it significantly outperforms other methods across various datasets and model sizes.

Conclusion: Thinker provides a supervisable and verifiable reasoning process through hierarchical thinking and multi-turn interactions, effectively enhancing LLMs’ ability to leverage external knowledge while maintaining logical coherence.

Abstract: Efficient retrieval of external knowledge bases and web pages is crucial for enhancing the reasoning abilities of LLMs. Previous works on training LLMs to leverage external retrievers for solving complex problems have predominantly employed end-to-end reinforcement learning. However, these approaches neglect supervision over the reasoning process, making it difficult to guarantee logical coherence and rigor. To address these limitations, we propose Thinker, a hierarchical thinking model for deep search through multi-turn interaction, making the reasoning process supervisable and verifiable. It decomposes complex problems into independently solvable sub-problems, each dually represented in both natural language and an equivalent logical function to support knowledge base and web searches. Concurrently, dependencies between sub-problems are passed as parameters via these logical functions, enhancing the logical coherence of the problem-solving process. To avoid unnecessary external searches, we perform knowledge boundary determination to check if a sub-problem is within the LLM’s intrinsic knowledge, allowing it to answer directly. Experimental results indicate that with as few as several hundred training samples, the performance of Thinker is competitive with established baselines. Furthermore, when scaled to the full training set, Thinker significantly outperforms these methods across various datasets and model sizes. The source code is available at https://github.com/OpenSPG/KAG-Thinker.

[352] An Efficient Training Pipeline for Reasoning Graphical User Interface Agents

Georgios Pantazopoulos, Eda B. Özyiğit

Main category: cs.AI

TL;DR: Efficient visual grounding training pipeline using filtered synthetic data and parameter-efficient fine-tuning achieves state-of-the-art performance with smaller models.

Details

Motivation: Existing visual grounding methods rely on massive noisy synthetic datasets, which is inefficient. Need for compact yet capable multimodal reasoning agents.

Method: Curated 12K clean instances from 4.8M synthetic examples via model-based filtering, then trained 3B-parameter VLM with supervised fine-tuning, chain-of-thought fine-tuning, and reinforcement learning via Group Relative Policy Optimization.

Result: Models match or surpass larger baselines on ScreenSpot, Multimodal-Mind2Web, and AndroidControl benchmarks.

Conclusion: Principled data curation and robust adaptation can rival large-scale training, enabling compact yet capable multimodal reasoning agents.

Abstract: Visual grounding is the task of localising image regions from natural language queries and is critical for reasoning capable Graphical User Interface agents. Many existing methods rely on massive, noisy synthetic datasets. This work introduces an efficient training pipeline that combines model-based data filtering with parameter-efficient fine-tuning. From 4.8M synthetic examples, 12K clean and diverse instances are curated by first identifying challenging cases, removing misaligned and then selecting a diverse set of multimodal instances. On this data, a 3B-parameter Vision-Language Model is trained under three regimes: supervised fine-tuning, chain-of-thought-augmented fine-tuning, and reinforcement learning via Group Relative Policy Optimization. Models trained with the filtered data and lightweight training strategies match or surpass larger baselines on benchmarks such as ScreenSpot, Multimodal-Mind2Web, and AndroidControl. These results demonstrate that principled data curation and robust adaptation can rival large-scale training, enabling compact yet capable multimodal reasoning agents.

[353] Proceedings of the Second International Workshop on Next-Generation Language Models for Knowledge Representation and Reasoning (NeLaMKRR 2025)

Ha-Thanh Nguyen, Ken Satoh, Francesca Toni, Randy Goebel, Kostas Stathis

Main category: cs.AI

TL;DR: This workshop explores reconciling reasoning between transformer-based language models and logic-based representations, analyzing their reasoning abilities, injecting KR-style reasoning, and formalizing how language models reason.

Details

Motivation: To bridge the gap between traditional logic-based AI reasoning and emerging transformer-based language models, creating a platform for interdisciplinary research on reasoning capabilities in AI systems.

Method: Creating a collaborative platform for researchers to analyze language model reasoning abilities alongside KR methods, inject KR-style reasoning into language models through neuro-symbolic approaches, and formalize language model reasoning processes.

Result: The workshop aims to uncover how language models can effectively integrate knowledge and reasoning to improve precision and reliability in applications where these are critical requirements.

Conclusion: By reconciling transformer-based language models with logic-based representations, the research seeks to enhance reasoning capabilities in AI systems for more reliable and precise applications.

Abstract: Reasoning is an essential component of human intelligence in that it plays a fundamental role in our ability to think critically, support responsible decisions, and solve challenging problems. Traditionally, AI has addressed reasoning in the context of logic-based representations of knowledge. However, the recent leap forward in natural language processing, with the emergence of language models based on transformers, is hinting at the possibility that these models exhibit reasoning abilities, particularly as they grow in size and are trained on more and more data. Still, despite ongoing discussions about what reasoning is in language models, it is still not easy to articulate to what extent these models are actually capable of reasoning. The goal of this workshop is to create a platform for researchers from different disciplines and/or AI perspectives to explore approaches and techniques with the aim to reconcile reasoning between language models using transformers and logic-based representations. The specific objectives include analysing the reasoning abilities of language models measured alongside KR methods, injecting KR-style reasoning abilities into language models (including by neuro-symbolic means), and formalising the kind of reasoning language models carry out. This exploration aims to uncover how language models can effectively integrate and leverage knowledge and reasoning with it, thus improving their application and utility in areas where precision and reliability are key requirements.

[354] OIDA-QA: A Multimodal Benchmark for Analyzing the Opioid Industry Documents Archive

Xuan Shen, Brian Wingenroth, Zichao Wang, Jason Kuen, Wanrong Zhu, Ruiyi Zhang, Yiwei Wang, Lichun Ma, Anqi Liu, Hongfu Liu, Tong Sun, Kevin S. Hawkins, Kate Tasker, G. Caleb Alexander, Jiuxiang Gu

Main category: cs.AI

TL;DR: This paper presents a multimodal AI system for analyzing opioid crisis documents from the UCSF-JHU Opioid Industry Documents Archive, creating a benchmark dataset with 400k training and 10k testing documents, and developing domain-specific LLMs for improved document analysis and QA tasks.

Details

Motivation: The opioid crisis reveals systemic failures across multiple domains, requiring advanced analysis of complex healthcare-related legal and corporate documents from the OIDA archive, which demands specialized multimodal approaches due to their complexity and diverse data types.

Method: Organized dataset by document attributes, extracted multimodal information (text, visuals, layout), generated 360k training and 10k testing QA pairs, developed domain-specific multimodal LLMs, incorporated historical QA pairs as context, and added page references with importance-based page classification.

Result: Preliminary results show improvements in document information extraction and question-answering tasks, with the dataset publicly available on Hugging Face.

Conclusion: The developed multimodal AI system and dataset provide an effective framework for analyzing complex opioid crisis documents, enhancing precision and relevance in information extraction and QA tasks through domain-specific models and contextual grounding.

Abstract: The opioid crisis represents a significant moment in public health that reveals systemic shortcomings across regulatory systems, healthcare practices, corporate governance, and public policy. Analyzing how these interconnected systems simultaneously failed to protect public health requires innovative analytic approaches for exploring the vast amounts of data and documents disclosed in the UCSF-JHU Opioid Industry Documents Archive (OIDA). The complexity, multimodal nature, and specialized characteristics of these healthcare-related legal and corporate documents necessitate more advanced methods and models tailored to specific data types and detailed annotations, ensuring the precision and professionalism in the analysis. In this paper, we tackle this challenge by organizing the original dataset according to document attributes and constructing a benchmark with 400k training documents and 10k for testing. From each document, we extract rich multimodal information-including textual content, visual elements, and layout structures-to capture a comprehensive range of features. Using multiple AI models, we then generate a large-scale dataset comprising 360k training QA pairs and 10k testing QA pairs. Building on this foundation, we develop domain-specific multimodal Large Language Models (LLMs) and explore the impact of multimodal inputs on task performance. To further enhance response accuracy, we incorporate historical QA pairs as contextual grounding for answering current queries. Additionally, we incorporate page references within the answers and introduce an importance-based page classifier, further improving the precision and relevance of the information provided. Preliminary results indicate the improvements with our AI assistant in document information extraction and question-answering tasks. The dataset is available at: https://huggingface.co/datasets/opioidarchive/oida-qa

[355] Enhancing the Medical Context-Awareness Ability of LLMs via Multifaceted Self-Refinement Learning

Yuxuan Zhou, Yubin Wang, Bin Wang, Chen Ning, Xien Liu, Ji Wu, Jianye Hao

Main category: cs.AI

TL;DR: MuSeR is a self-refinement method that enhances LLMs’ medical context-awareness through attribute-conditioned query generation, self-evaluation across decision-making, communication, and safety facets, and supervised fine-tuning, achieving SOTA performance on HealthBench.

Details

Motivation: LLMs underperform in real-world medical scenarios due to lack of context-awareness - the ability to recognize missing details like user identity, medical history, and risk factors, and provide safe, contextually appropriate responses.

Method: Propose Multifaceted Self-Refinement (MuSeR): 1) Attribute-conditioned query generator simulates diverse user contexts, 2) LLM responds, self-evaluates along three facets (decision-making, communication, safety), 3) Refines responses, 4) Uses queries and refined responses for supervised fine-tuning.

Result: Significantly improves LLM performance on HealthBench, with notable gains in context-awareness. With knowledge distillation, Qwen3-32B surpasses teacher model, achieving SOTA (63.8% overall, 43.1% on hard subset) across all open-source LLMs.

Conclusion: MuSeR effectively enhances LLMs’ medical context-awareness through self-evaluation and refinement across key facets, demonstrating strong performance improvements and enabling smaller models to surpass larger ones through knowledge distillation.

Abstract: Large language models (LLMs) have shown great promise in the medical domain, achieving strong performance on several benchmarks. However, they continue to underperform in real-world medical scenarios, which often demand stronger context-awareness, i.e., the ability to recognize missing or critical details (e.g., user identity, medical history, risk factors) and provide safe, helpful, and contextually appropriate responses. To address this issue, we propose Multifaceted Self-Refinement (MuSeR), a data-driven approach that enhances LLMs’ context-awareness along three key facets (decision-making, communication, and safety) through self-evaluation and refinement. Specifically, we first design a attribute-conditioned query generator that simulates diverse real-world user contexts by varying attributes such as role, geographic region, intent, and degree of information ambiguity. An LLM then responds to these queries, self-evaluates its answers along three key facets, and refines its responses to better align with the requirements of each facet. Finally, the queries and refined responses are used for supervised fine-tuning to reinforce the model’s context-awareness ability. Evaluation results on the latest HealthBench dataset demonstrate that our method significantly improves LLM performance across multiple aspects, with particularly notable gains in the context-awareness axis. Furthermore, by incorporating knowledge distillation with the proposed method, the performance of a smaller backbone LLM (e.g., Qwen3-32B) surpasses its teacher model, achieving a new SOTA across all open-source LLMs on HealthBench (63.8%) and its hard subset (43.1%). Code and dataset will be released at https://muser-llm.github.io.

[356] Intelligence Foundation Model: A New Perspective to Approach Artificial General Intelligence

Borui Cai, Yao Zhao

Main category: cs.AI

TL;DR: Proposes Intelligence Foundation Model (IFM) as a new approach to AGI that learns from diverse intelligent behaviors rather than specific domains, using biologically-inspired state neural networks and neuron output prediction.

Details

Motivation: Existing foundation models specialize in pattern learning within specific domains (language, vision, etc.), but AGI requires understanding the underlying mechanisms of intelligence across all cognitive abilities.

Method: Two core components: (1) State neural network that captures neuron-like dynamic processes for temporal information processing, and (2) Neuron output prediction objective that trains the system to predict neuronal outputs from collective dynamics.

Result: Establishes a biologically grounded and computationally scalable foundation for building systems capable of generalization, reasoning, and adaptive learning across domains.

Conclusion: This approach represents a step toward truly AGI by learning general principles of intelligence directly from diverse intelligent behaviors rather than domain-specific patterns.

Abstract: We propose a new perspective for approaching artificial general intelligence (AGI) through an intelligence foundation model (IFM). Unlike existing foundation models (FMs), which specialize in pattern learning within specific domains such as language, vision, or time series, IFM aims to acquire the underlying mechanisms of intelligence by learning directly from diverse intelligent behaviors. Vision, language, and other cognitive abilities are manifestations of intelligent behavior; learning from this broad range of behaviors enables the system to internalize the general principles of intelligence. Based on the fact that intelligent behaviors emerge from the collective dynamics of biological neural systems, IFM consists of two core components: a novel network architecture, termed the state neural network, which captures neuron-like dynamic processes, and a new learning objective, neuron output prediction, which trains the system to predict neuronal outputs from collective dynamics. The state neural network emulates the temporal dynamics of biological neurons, allowing the system to store, integrate, and process information over time, while the neuron output prediction objective provides a unified computational principle for learning these structural dynamics from intelligent behaviors. Together, these innovations establish a biologically grounded and computationally scalable foundation for building systems capable of generalization, reasoning, and adaptive learning across domains, representing a step toward truly AGI.

[357] Strategic Opponent Modeling with Graph Neural Networks, Deep Reinforcement Learning and Probabilistic Topic Modeling

Georgios Chalkiadakis, Charilaos Akasiadis, Gerasimos Koresis, Stergios Plataniotis, Leonidas Bakopoulos

Main category: cs.AI

TL;DR: This paper reviews Graph Neural Networks, Deep Reinforcement Learning, and Probabilistic Topic Modeling for strategic multiagent settings, focusing on opponent modeling and integration with game theory while addressing uncertainty, heterogeneity, and scalability challenges.

Details

Motivation: To address the limitations of traditional game theory assumptions (Common Prior Assumption and Self-Interest Hypothesis) in real-world scenarios by incorporating machine learning methods that can handle uncertainty and heterogeneity in multiagent strategic settings.

Method: Comprehensive review and analysis of three main approaches: Graph Neural Networks for modeling relationships and interactions, Multiagent Deep Reinforcement Learning for strategic decision-making, and Probabilistic Topic Modeling for estimating unknown distributions and handling agent beliefs.

Result: Identifies the potential of GNNs for node classification and link prediction in graph-structured multiagent environments, MADRL for strategic learning, and PTM for handling heterogeneity and unknown beliefs beyond traditional document analysis applications.

Conclusion: The integration of these machine learning methods with game theory concepts provides promising approaches for strategic multiagent modeling, but open challenges remain including non-stationary environments, stability-adaptation balance, uncertainty handling, heterogeneity management, and scalability guarantees.

Abstract: This paper provides a comprehensive review of mainly Graph Neural Networks, Deep Reinforcement Learning, and Probabilistic Topic Modeling methods with a focus on their potential incorporation in strategic multiagent settings. We draw interest in (i) Machine Learning methods currently utilized for uncovering unknown model structures adaptable to the task of strategic opponent modeling, and (ii) the integration of these methods with Game Theoretic concepts that avoid relying on assumptions often invalid in real-world scenarios, such as the Common Prior Assumption (CPA) and the Self-Interest Hypothesis (SIH). We analyze the ability to handle uncertainty and heterogeneity, two characteristics that are very common in real-world application cases, as well as scalability. As a potential answer to effectively modeling relationships and interactions in multiagent settings, we champion the use of Graph Neural Networks (GNN). Such approaches are designed to operate upon graph-structured data, and have been shown to be a very powerful tool for performing tasks such as node classification and link prediction. Next, we review the domain of Reinforcement Learning (RL), and in particular that of Multiagent Deep Reinforcement Learning (MADRL). Following, we describe existing relevant game theoretic solution concepts and consider properties such as fairness and stability. Our review comes complete with a note on the literature that utilizes PTM in domains other than that of document analysis and classification. The capability of PTM to estimate unknown underlying distributions can help with tackling heterogeneity and unknown agent beliefs. Finally, we identify certain open challenges specifically, the need to (i) fit non-stationary environments, (ii) balance the degrees of stability and adaptation, (iii) tackle uncertainty and heterogeneity, (iv) guarantee scalability and solution tractability.

[358] Bi-Level Contextual Bandits for Individualized Resource Allocation under Delayed Feedback

Mohammadsina Almasi, Hadis Anahideh

Main category: cs.AI

TL;DR: A bi-level contextual bandit framework for equitable resource allocation under delayed feedback, balancing short-term utility with long-term impact while accounting for fairness constraints and temporal dynamics.

Details

Motivation: Existing allocation frameworks fail to address delayed outcomes, hidden heterogeneity, and ethical constraints in high-stakes domains like education, employment, and healthcare.

Method: Two-level approach: meta-level optimizes subgroup budget allocations with fairness constraints; base-level identifies responsive individuals using neural networks trained on observational data, modeling delayed effects via resource-specific delay kernels.

Result: Validated on education and workforce development datasets, achieving higher cumulative outcomes, better adaptation to delay structures, and equitable distribution across subgroups.

Conclusion: Delay-aware, data-driven decision-making systems can significantly improve institutional policy and social welfare in resource allocation contexts.

Abstract: Equitably allocating limited resources in high-stakes domains-such as education, employment, and healthcare-requires balancing short-term utility with long-term impact, while accounting for delayed outcomes, hidden heterogeneity, and ethical constraints. However, most learning-based allocation frameworks either assume immediate feedback or ignore the complex interplay between individual characteristics and intervention dynamics. We propose a novel bi-level contextual bandit framework for individualized resource allocation under delayed feedback, designed to operate in real-world settings with dynamic populations, capacity constraints, and time-sensitive impact. At the meta level, the model optimizes subgroup-level budget allocations to satisfy fairness and operational constraints. At the base level, it identifies the most responsive individuals within each group using a neural network trained on observational data, while respecting cooldown windows and delayed treatment effects modeled via resource-specific delay kernels. By explicitly modeling temporal dynamics and feedback delays, the algorithm continually refines its policy as new data arrive, enabling more responsive and adaptive decision-making. We validate our approach on two real-world datasets from education and workforce development, showing that it achieves higher cumulative outcomes, better adapts to delay structures, and ensures equitable distribution across subgroups. Our results highlight the potential of delay-aware, data-driven decision-making systems to improve institutional policy and social welfare.

cs.SD

[359] Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio

Guangke Chen, Yuhui Wang, Shouling Ji, Xiapu Luo, Ting Wang

Main category: cs.SD

TL;DR: HARMGEN attacks exploit TTS systems to generate harmful speech content through semantic obfuscation and audio-modality exploits, bypassing safety filters and increasing toxicity in generated speech.

Details

Motivation: To explore content-centric threats in TTS systems beyond speaker impersonation, addressing challenges of LALM safety alignment and input/output filters that block harmful content.

Method: Developed five attacks in two families: semantic obfuscation (Concat, Shuffle) that hides harmful content in text, and audio-modality exploits (Read, Spell, Phoneme) that inject harmful content through audio channels while keeping text prompts benign.

Result: Attacks substantially reduced refusal rates and increased toxicity across five commercial LALM-based TTS systems and three datasets in two languages, with proactive moderation detecting only 57-93% of attacks.

Conclusion: Reveals critical vulnerabilities in TTS systems, highlighting the need for robust cross-modal safeguards throughout training and deployment to address content-centric misuse vectors.

Abstract: Modern text-to-speech (TTS) systems, particularly those built on Large Audio-Language Models (LALMs), generate high-fidelity speech that faithfully reproduces input text and mimics specified speaker identities. While prior misuse studies have focused on speaker impersonation, this work explores a distinct content-centric threat: exploiting TTS systems to produce speech containing harmful content. Realizing such threats poses two core challenges: (1) LALM safety alignment frequently rejects harmful prompts, yet existing jailbreak attacks are ill-suited for TTS because these systems are designed to faithfully vocalize any input text, and (2) real-world deployment pipelines often employ input/output filters that block harmful text and audio. We present HARMGEN, a suite of five attacks organized into two families that address these challenges. The first family employs semantic obfuscation techniques (Concat, Shuffle) that conceal harmful content within text. The second leverages audio-modality exploits (Read, Spell, Phoneme) that inject harmful content through auxiliary audio channels while maintaining benign textual prompts. Through evaluation across five commercial LALMs-based TTS systems and three datasets spanning two languages, we demonstrate that our attacks substantially reduce refusal rates and increase the toxicity of generated speech. We further assess both reactive countermeasures deployed by audio-streaming platforms and proactive defenses implemented by TTS providers. Our analysis reveals critical vulnerabilities: deepfake detectors underperform on high-fidelity audio; reactive moderation can be circumvented by adversarial perturbations; while proactive moderation detects 57-93% of attacks. Our work highlights a previously underexplored content-centric misuse vector for TTS and underscore the need for robust cross-modal safeguards throughout training and deployment.

[360] StyleBreak: Revealing Alignment Vulnerabilities in Large Audio-Language Models via Style-Aware Audio Jailbreak

Hongyi Li, Chengxuan Zhou, Chu Wang, Sicheng Liang, Yanting Chen, Qinlin Xie, Jiawei Ye, Jie Wu

Main category: cs.SD

TL;DR: StyleBreak is a novel framework that exploits human speech style variations to jailbreak Large Audio-language Models (LAMs), revealing critical vulnerabilities in their alignment.

Details

Motivation: To address the underexplored security of LAMs under adversarial attacks, particularly through audio jailbreaks that bypass alignment by leveraging human speech's expressive variations.

Method: A two-stage style-aware transformation pipeline that perturbs both textual content and audio to control linguistic, paralinguistic, and extralinguistic attributes, plus a query-adaptive policy network for efficient adversarial style search.

Result: Extensive evaluations show LAMs have critical vulnerabilities to diverse human speech attributes, and StyleBreak achieves substantial improvements in attack effectiveness and efficiency across multiple attack paradigms.

Conclusion: The research highlights the urgent need for more robust alignment in LAMs to defend against style-aware audio jailbreak attacks.

Abstract: Large Audio-language Models (LAMs) have recently enabled powerful speech-based interactions by coupling audio encoders with Large Language Models (LLMs). However, the security of LAMs under adversarial attacks remains underexplored, especially through audio jailbreaks that craft malicious audio prompts to bypass alignment. Existing efforts primarily rely on converting text-based attacks into speech or applying shallow signal-level perturbations, overlooking the impact of human speech’s expressive variations on LAM alignment robustness. To address this gap, we propose StyleBreak, a novel style-aware audio jailbreak framework that systematically investigates how diverse human speech attributes affect LAM alignment robustness. Specifically, StyleBreak employs a two-stage style-aware transformation pipeline that perturbs both textual content and audio to control linguistic, paralinguistic, and extralinguistic attributes. Furthermore, we develop a query-adaptive policy network that automatically searches for adversarial styles to enhance the efficiency of LAM jailbreak exploration. Extensive evaluations demonstrate that LAMs exhibit critical vulnerabilities when exposed to diverse human speech attributes. Moreover, StyleBreak achieves substantial improvements in attack effectiveness and efficiency across multiple attack paradigms, highlighting the urgent need for more robust alignment in LAMs.

[361] Graph Neural Field with Spatial-Correlation Augmentation for HRTF Personalization

De Hu, Junsheng Hu, Cuicui Jiang

Main category: cs.SD

TL;DR: GraphNF-SCA uses graph neural networks with spatial-correlation augmentation to generate personalized HRTFs for VR/AR spatial audio, achieving state-of-the-art performance.

Details

Motivation: High-quality HRTFs are essential for immersive spatial audio in VR/AR, but measuring them is time-consuming and subject-dependent. Existing methods lack spatial correlation modeling.

Method: Three-stage approach: HRTF personalization module with GNN encoder-decoder, HRTF upsampling module for spatial correlation modeling, and fine-tuning stage to enhance spatial consistency.

Result: Experimental results demonstrate state-of-the-art performance in HRTF personalization.

Conclusion: GraphNF-SCA effectively leverages spatial correlations across HRTFs to enhance personalization performance, outperforming existing position-by-position estimation methods.

Abstract: To achieve immersive spatial audio rendering on VR/AR devices, high-quality Head-Related Transfer Functions (HRTFs) are essential. In general, HRTFs are subject-dependent and position-dependent, and their measurement is time-consuming and tedious. To address this challenge, we propose the Graph Neural Field with Spatial-Correlation Augmentation (GraphNF-SCA) for HRTF personalization, which can be used to generate individual HRTFs for unseen subjects. The GraphNF-SCA consists of three key components: an HRTF personalization (HRTF-P) module, an HRTF upsampling (HRTF-U) module, and a fine-tuning stage. In the HRTF-P module, we predict HRTFs of the target subject via the Graph Neural Network (GNN) with an encoder-decoder architecture, where the encoder extracts universal features and the decoder incorporates the target-relevant features and produces individualized HRTFs. The HRTF-U module employs another GNN to model spatial correlations across HRTFs. This module is fine-tuned using the output of the HRTF-P module, thereby enhancing the spatial consistency of the predicted HRTFs. Unlike existing methods that estimate individual HRTFs position-by-position without spatial correlation modeling, the GraphNF-SCA effectively leverages inherent spatial correlations across HRTFs to enhance the performance of HRTF personalization. Experimental results demonstrate that the GraphNF-SCA achieves state-of-the-art results.

[362] CAT-Net: A Cross-Attention Tone Network for Cross-Subject EEG-EMG Fusion Tone Decoding

Yifan Zhuang, Calvin Huang, Zepeng Yu, Yongjie Zou, Jiawei Ju

Main category: cs.SD

TL;DR: A novel cross-subject multimodal BCI framework that fuses EEG and EMG signals achieves high accuracy in Mandarin tone classification for both audible and silent speech using minimal channels.

Details

Motivation: To enhance BCI speech decoding for individuals with speech impairments by integrating EEG and EMG signals, particularly addressing the challenge of Mandarin tone classification where tonal variations convey distinct meanings.

Method: Proposes a cross-subject multimodal BCI decoding framework combining spatial-temporal feature extraction branches with cross-attention fusion mechanism and domain-adversarial training for improved generalization. Uses only 20 EEG and 5 EMG channels.

Result: Achieved average classification accuracies of 87.83% for audible speech and 88.08% for silent speech. In cross-subject evaluations, maintained strong performance with 83.27% and 85.10% accuracies respectively.

Conclusion: Tone-level decoding with minimal EEG-EMG channels is feasible and potentially generalizable across subjects, contributing to practical BCI applications for speech-impaired individuals.

Abstract: Brain-computer interface (BCI) speech decoding has emerged as a promising tool for assisting individuals with speech impairments. In this context, the integration of electroencephalography (EEG) and electromyography (EMG) signals offers strong potential for enhancing decoding performance. Mandarin tone classification presents particular challenges, as tonal variations convey distinct meanings even when phonemes remain identical. In this study, we propose a novel cross-subject multimodal BCI decoding framework that fuses EEG and EMG signals to classify four Mandarin tones under both audible and silent speech conditions. Inspired by the cooperative mechanisms of neural and muscular systems in speech production, our neural decoding architecture combines spatial-temporal feature extraction branches with a cross-attention fusion mechanism, enabling informative interaction between modalities. We further incorporate domain-adversarial training to improve cross-subject generalization. We collected 4,800 EEG trials and 4,800 EMG trials from 10 participants using only twenty EEG and five EMG channels, demonstrating the feasibility of minimal-channel decoding. Despite employing lightweight modules, our model outperforms state-of-the-art baselines across all conditions, achieving average classification accuracies of 87.83% for audible speech and 88.08% for silent speech. In cross-subject evaluations, it still maintains strong performance with accuracies of 83.27% and 85.10% for audible and silent speech, respectively. We further conduct ablation studies to validate the effectiveness of each component. Our findings suggest that tone-level decoding with minimal EEG-EMG channels is feasible and potentially generalizable across subjects, contributing to the development of practical BCI applications.

[363] DialogGraph-LLM: Graph-Informed LLMs for End-to-End Audio Dialogue Intent Recognition

HongYu Liu, Junxin Li, Changxi Guo, Hao Chen, Yaqian Huang, Yifu Guo, Huan Yang, Lihua Cai

Main category: cs.SD

TL;DR: DialogGraph-LLM is an end-to-end framework for speaker intent recognition in long audio dialogues that combines a Multi-Relational Dialogue Attention Network with multimodal foundation models and adaptive semi-supervised learning.

Details

Motivation: Recognizing speaker intent in long audio dialogues is challenging due to complex inter-dependencies in speaker utterances and scarce annotated data, despite its wide applications.

Method: Proposes DialogGraph-LLM framework with MR-DAN architecture and multimodal foundation models for acoustic-to-intent inference, plus adaptive semi-supervised learning with confidence-aware pseudo-label generation and entropy-based sample selection.

Result: Extensive evaluations on MarketCalls corpus and MIntRec 2.0 benchmark show superiority over audio and text-driven baselines, demonstrating strong performance and efficiency in real-world audio dialogues.

Conclusion: DialogGraph-LLM proves practical value for audio-rich domains with limited supervision, effectively addressing intent recognition challenges in complex dialogue scenarios.

Abstract: Recognizing speaker intent in long audio dialogues among speakers has a wide range of applications, but is a non-trivial AI task due to complex inter-dependencies in speaker utterances and scarce annotated data. To address these challenges, an end-to-end framework, namely DialogGraph-LLM, is proposed in the current work. DialogGraph-LLM combines a novel Multi-Relational Dialogue Attention Network (MR-DAN) architecture with multimodal foundation models (e.g., Qwen2.5-Omni-7B) for direct acoustic-to-intent inference. An adaptive semi-supervised learning strategy is designed using LLM with a confidence-aware pseudo-label generation mechanism based on dual-threshold filtering using both global and class confidences, and an entropy-based sample selection process that prioritizes high-information unlabeled instances. Extensive evaluations on the proprietary MarketCalls corpus and the publicly available MIntRec 2.0 benchmark demonstrate DialogGraph-LLM’s superiority over strong audio and text-driven baselines. The framework demonstrates strong performance and efficiency in intent recognition in real world scenario audio dialogues, proving its practical value for audio-rich domains with limited supervision. Our code is available at https://github.com/david188888/DialogGraph-LLM.

[364] Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation

Xinyi Tong, Yiran Zhu, Jishang Chen, Chunru Zhan, Tianle Wang, Sirui Zhang, Nian Liu, Tiezheng Ge, Duo Xu, Xin Jin, Feng Yu, Song-Chun Zhu

Main category: cs.SD

TL;DR: VeM is a video-to-music generation system that uses latent music diffusion to create semantically, temporally, and rhythmically aligned background music for videos through hierarchical video parsing and beat synchronization mechanisms.

Details

Motivation: Current video-to-music generation approaches suffer from incomplete video representation leading to weak alignment, and inadequate temporal/rhythmic correspondence, especially in beat synchronization.

Method: VeM employs hierarchical video parsing as a music conductor, modality-specific encoders with storyboard-guided cross-attention, position/duration encoding for temporal coherence, and frame-level transition-beat aligner for rhythmic precision.

Result: Experimental results demonstrate superiority over existing methods, particularly in semantic relevance and rhythmic precision.

Conclusion: VeM effectively addresses the limitations of current video-to-music generation by providing comprehensive video detail capture and precise beat synchronization through its hierarchical parsing and alignment mechanisms.

Abstract: Video-to-Music generation seeks to generate musically appropriate background music that enhances audiovisual immersion for videos. However, current approaches suffer from two critical limitations: 1) incomplete representation of video details, leading to weak alignment, and 2) inadequate temporal and rhythmic correspondence, particularly in achieving precise beat synchronization. To address the challenges, we propose Video Echoed in Music (VeM), a latent music diffusion that generates high-quality soundtracks with semantic, temporal, and rhythmic alignment for input videos. To capture video details comprehensively, VeM employs a hierarchical video parsing that acts as a music conductor, orchestrating multi-level information across modalities. Modality-specific encoders, coupled with a storyboard-guided cross-attention mechanism (SG-CAtt), integrate semantic cues while maintaining temporal coherence through position and duration encoding. For rhythmic precision, the frame-level transition-beat aligner and adapter (TB-As) dynamically synchronize visual scene transitions with music beats. We further contribute a novel video-music paired dataset sourced from e-commerce advertisements and video-sharing platforms, which imposes stricter transition-beat synchronization requirements. Meanwhile, we introduce novel metrics tailored to the task. Experimental results demonstrate superiority, particularly in semantic relevance and rhythmic precision.

[365] MSMT-FN: Multi-segment Multi-task Fusion Network for Marketing Audio Classification

HongYu Liu, Ruijie Wan, Yueju Han, Junxin Li, Liuxing Lu, Chao He, Lihua Cai

Main category: cs.SD

TL;DR: Proposes MSMT-FN, a novel multi-segment multi-task fusion network for audio classification, particularly for customer purchasing propensity analysis in marketing calls, showing superior performance on proprietary and benchmark datasets.

Details

Motivation: Address the challenge of efficiently categorizing customer purchasing propensity from large volumes of audio data in sentiment analysis and emotion recognition for marketing phone calls.

Method: Multi-Segment Multi-Task Fusion Network (MSMT-FN) designed specifically for business demands in audio classification.

Result: Outperforms or matches state-of-the-art methods on proprietary MarketCalls dataset and established benchmarks (CMU-MOSI, CMU-MOSEI, MELD).

Conclusion: MSMT-FN effectively addresses audio classification challenges in business contexts, with dataset availability and code release to advance further research.

Abstract: Audio classification plays an essential role in sentiment analysis and emotion recognition, especially for analyzing customer attitudes in marketing phone calls. Efficiently categorizing customer purchasing propensity from large volumes of audio data remains challenging. In this work, we propose a novel Multi-Segment Multi-Task Fusion Network (MSMT-FN) that is uniquely designed for addressing this business demand. Evaluations conducted on our proprietary MarketCalls dataset, as well as established benchmarks (CMU-MOSI, CMU-MOSEI, and MELD), show MSMT-FN consistently outperforms or matches state-of-the-art methods. Additionally, our newly curated MarketCalls dataset will be available upon request, and the code base is made accessible at GitHub Repository MSMT-FN, to facilitate further research and advancements in audio classification domain.

[366] TimeAudio: Bridging Temporal Gaps in Large Audio-Language Models

Hualei Wang, Yiming Li, Shuo Ma, Hong Liu, Xiangdong Wang

Main category: cs.SD

TL;DR: TimeAudio enhances Large Audio-Language Models with temporal localization and long audio understanding capabilities through temporal markers, absolute time encoding, and token merging.

Details

Motivation: Current LALMs struggle with timestamp understanding for temporal localization and are limited to short audio, restricting their fine-grained task capabilities.

Method: Incorporates temporal markers for time-sensitive reasoning, absolute time-aware encoding, and segment-level token merging to reduce redundancy and improve efficiency.

Result: Strong performance on fine-grained tasks including dense captioning, temporal grounding, and timeline speech summarization.

Conclusion: TimeAudio demonstrates robust temporal localization and reasoning capabilities, addressing key limitations in current audio-language models.

Abstract: Recent Large Audio-Language Models (LALMs) exhibit impressive capabilities in understanding audio content for conversational QA tasks. However, these models struggle to accurately understand timestamps for temporal localization (e.g., Temporal Audio Grounding) and are restricted to short audio perception, leading to constrained capabilities on fine-grained tasks. We identify three key aspects that limit their temporal localization and long audio understanding: (i) timestamp representation, (ii) architecture, and (iii) data. To address this, we introduce TimeAudio, a novel method that empowers LALMs to connect their understanding of audio content with precise temporal perception. Specifically, we incorporate unique temporal markers to improve time-sensitive reasoning and apply an absolute time-aware encoding that explicitly grounds the acoustic features with absolute time information. Moreover, to achieve end-to-end long audio understanding, we introduce a segment-level token merging module to substantially reduce audio token redundancy and enhance the efficiency of information extraction. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing audio datasets into a new dataset focused on temporal tasks and establish a series of metrics to evaluate the fine-grained performance. Evaluations show strong performance across a variety of fine-grained tasks, such as dense captioning, temporal grounding, and timeline speech summarization, demonstrating TimeAudio’s robust temporal localization and reasoning capabilities.

[367] CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation

Crystal Min Hui Poon, Pai Chet Ng, Xiaoxiao Miao, Immanuel Jun Kai Loh, Bowen Zhang, Haoyu Song, Ian Mcloughlin

Main category: cs.SD

TL;DR: CLARITY is a framework that addresses accent and linguistic biases in TTS systems through contextual linguistic adaptation and retrieval-augmented accent prompting.

Details

Motivation: Current TTS systems exhibit accent bias (defaulting to dominant phonetic patterns) and linguistic bias (ignoring dialect-specific lexical and cultural cues), which are interdependent issues.

Method: Uses dual-signal optimization: (i) contextual linguistic adaptation to localize input text to target dialect, and (ii) retrieval-augmented accent prompting (RAAP) for accent-consistent speech prompts.

Result: Across twelve English accents, CLARITY improves accent accuracy and fairness while maintaining strong perceptual quality.

Conclusion: CLARITY effectively addresses coupled biases in TTS systems through a backbone-agnostic framework that enhances accent authenticity and linguistic localization.

Abstract: Instruction-guided text-to-speech (TTS) research has reached a maturity level where excellent speech generation quality is possible on demand, yet two coupled biases persist: accent bias, where models default to dominant phonetic patterns, and linguistic bias, where dialect-specific lexical and cultural cues are ignored. These biases are interdependent, as authentic accent generation requires both accent fidelity and localized text. We present Contextual Linguistic Adaptation and Retrieval for Inclusive TTS sYnthesis (CLARITY), a backbone-agnostic framework that addresses these biases through dual-signal optimization: (i) contextual linguistic adaptation that localizes input text to the target dialect, and (ii) retrieval-augmented accent prompting (RAAP) that supplies accent-consistent speech prompts. Across twelve English accents, CLARITY improves accent accuracy and fairness while maintaining strong perceptual quality.

[368] Evaluation of Audio Compression Codecs

Thien T. Duong, Jan P. Springer

Main category: cs.SD

TL;DR: Failed to fetch summary for paper 2511.11527

Details

Motivation: Unable to determine motivation due to missing abstract content

Method: Method information unavailable - abstract retrieval failed

Result: No results available - paper summary could not be fetched

Conclusion: Unable to provide analysis - the requested paper abstract could not be retrieved

Abstract: Failed to fetch summary for 2511.11527:

[369] Golden Tonnetz

Yusuke Imai

Main category: cs.SD

TL;DR: The paper presents a novel geometric representation of music theory using golden triangles to map major/minor scales and chords, introducing “golden Tonnetz” that connects musical transformations with geometric operations.

Details

Motivation: To explore deeper connections between music and the golden ratio beyond existing geometric representations like the chromatic circle and Tonnetz, seeking to represent musical scales and transformations through golden ratio geometry.

Method: Developed an arrangement of 7 tones on a golden triangle that represents major/minor scales and their tonic, dominant, and subdominant chords. Applied this to create “golden Tonnetz” which maps all major/minor scales and triads using golden triangles and gnomons.

Result: Successfully demonstrated that major/minor scales and their fundamental chords can be represented by golden triangles. The golden Tonnetz effectively represents relative, parallel, and leading-tone exchange transformations from Neo-Riemannian theory as transformations among golden triangles and gnomons.

Conclusion: The golden ratio provides a powerful geometric framework for representing musical structures and transformations, with golden Tonnetz offering a novel way to visualize and understand relationships between scales, chords, and musical transformations through golden ratio geometry.

Abstract: For example, in the chromatic circle, the twelve tones are represented by twelve points on a circle, and in Tonnetz, the relationships among harmonies are represented by a triangular lattice. Recently, we have shown that several arrangements of tones on the regular icosahedron can be associated with chromatic scales, whole-tone scales, major tones, and minor tones through the golden ratio. Here, we investigate another type of connection between music and the golden ratio. We show that there exists an arrangement of 7 tones on a golden triangle that can represent a given major/minor scale and its tonic, dominant, and subdominant chords by golden triangles. By applying this finding, we propose ``golden Tonnetz" which represents all the major/minor scales and triads by the golden triangles or gnomons and also represents relative, parallel, and leading-tone exchange transformations in Neo-Riemannian theory by transformations among the golden triangles and gnomons

[370] Melodia: Training-Free Music Editing Guided by Attention Probing in Diffusion Models

Yi Yang, Haowen Li, Tianxiang Li, Boyu Cao, Xiaohan Zhang, Liqun Chen, Qi Liu

Main category: cs.SD

TL;DR: Melodia is a training-free music editing method that selectively manipulates self-attention maps in diffusion models to modify musical characteristics while preserving the source music’s temporal structure, achieving superior results without requiring textual descriptions of the source music.

Details

Motivation: Existing music editing methods often fail to preserve the source music's temporal structure (melody and rhythm) when altering attributes like instrument, genre, and mood, creating a need for better editing techniques.

Method: Conducted probing analysis on AudioLDM 2’s attention maps, revealing that self-attention maps preserve temporal structure while cross-attention maps handle musical characteristics. Melodia selectively manipulates self-attention maps during denoising and uses an attention repository to store source music information.

Result: Both objective and subjective experiments demonstrate superior results in textual adherence and structural integrity across various datasets compared to existing methods.

Conclusion: The research enhances understanding of music generation models’ internal mechanisms and provides improved control for music creation through selective attention manipulation.

Abstract: Text-to-music generation technology is progressing rapidly, creating new opportunities for musical composition and editing. However, existing music editing methods often fail to preserve the source music’s temporal structure, including melody and rhythm, when altering particular attributes like instrument, genre, and mood. To address this challenge, this paper conducts an in-depth probing analysis on attention maps within AudioLDM 2, a diffusion-based model commonly used as the backbone for existing music editing methods. We reveal a key finding: cross-attention maps encompass details regarding distinct musical characteristics, and interventions on these maps frequently result in ineffective modifications. In contrast, self-attention maps are essential for preserving the temporal structure of the source music during its conversion into the target music. Building upon this understanding, we present Melodia, a training-free technique that selectively manipulates self-attention maps in particular layers during the denoising process and leverages an attention repository to store source music information, achieving accurate modification of musical characteristics while preserving the original structure without requiring textual descriptions of the source music. Additionally, we propose two novel metrics to better evaluate music editing methods. Both objective and subjective experiments demonstrate that our approach achieves superior results in terms of textual adherence and structural integrity across various datasets. This research enhances comprehension of internal mechanisms within music generation models and provides improved control for music creation.

[371] Improving Speech Emotion Recognition with Mutual Information Regularized Generative Model

Chung-Soo Ahn, Rajib Rana, Sunil Sivadas, Carlos Busso, Jagath C. Rajapakse

Main category: cs.SD

TL;DR: A data augmentation framework for speech emotion recognition that uses cross-modal information transfer and mutual information regularization to generate high-quality synthetic data, improving emotion prediction performance on benchmark datasets.

Details

Motivation: Speech emotion recognition suffers from limited quality-labelled training data. While data augmentation methods exist, generative models show promise but need quality indicators and multimodal expansion.

Method: Proposed framework uses cross-modal information transfer and mutual information regularization to generate input features for emotion classification. Mutual information serves as quality indicator and ensures dependency between modalities.

Result: Tested on IEMOCAP, MSP-IMPROV and MSP-Podcast datasets. Framework improved emotion prediction performance against existing works and can generate inputs without cross-modal information.

Conclusion: The proposed data augmentation framework effectively addresses data scarcity in speech emotion recognition through cross-modal transfer and mutual information regularization, achieving better performance and enabling multimodal input generation.

Abstract: Although speech emotion recognition (SER) research has been advanced, thanks to deep learning methods, it still suffers from obtaining inputs from large quality-labelled training data. Data augmentation methods have been attempted to mitigate this issue, generative models have shown success among them recently. We propose a data augmentation framework that is aided by cross-modal information transfer and mutual information regularization. Mutual information based metric can serve as an indicator for the quality. Furthermore, we expand this data augmentation scope to multimodal inputs, thanks to mutual information ensureing dependency between modalities. Our framework was tested on three benchmark datasets: IEMOCAP, MSP-IMPROV and MSP-Podcast. The implementation was designed to generate input features that are fed into last layer for emotion classification. Our framework improved the performance of emotion prediction against existing works. Also, we discovered that our framework is able to generate new inputs without any cross-modal information.

[372] Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard

Yudong Yang, Xuezhen Zhang, Zhifeng Han, Siyin Wang, Jimin Zhuang, Zengrui Jin, Jing Shao, Guangzhi Sun, Chao Zhang

Main category: cs.SD

TL;DR: SACRED-Bench is a benchmark for evaluating LLM vulnerabilities to complex audio-based attacks using speech-audio composition mechanisms, achieving 66% attack success on state-of-the-art models, with SALMONN-Guard proposed as a defense reducing attacks to 20%.

Details

Motivation: Current LLM safeguards inadequately handle complex audio inputs, exposing new safety risks from speech-audio composition attacks that bypass text-only filters.

Method: Uses three speech-audio composition mechanisms: speech overlap/multi-speaker dialogue, speech-audio mixtures implying unsafe intent, and diverse spoken instruction formats to evade detection.

Result: Gemini 2.5 Pro shows 66% attack success rate on SACRED-Bench, demonstrating vulnerabilities to cross-modal audio attacks.

Conclusion: Audio-aware defenses are crucial for multimodal LLM safety, with SALMONN-Guard showing effectiveness by reducing attacks to 20% through joint speech-audio-text inspection.

Abstract: Recent progress in large language models (LLMs) has enabled understanding of both speech and non-speech audio, but exposing new safety risks emerging from complex audio inputs that are inadequately handled by current safeguards. We introduce SACRED-Bench (Speech-Audio Composition for RED-teaming) to evaluate the robustness of LLMs under complex audio-based attacks. Unlike existing perturbation-based methods that rely on noise optimization or white-box access, SACRED-Bench exploits speech-audio composition mechanisms. SACRED-Bench adopts three mechanisms: (a) speech overlap and multi-speaker dialogue, which embeds harmful prompts beneath or alongside benign speech; (b) speech-audio mixture, which imply unsafe intent via non-speech audio alongside benign speech or audio; and (c) diverse spoken instruction formats (open-ended QA, yes/no) that evade text-only filters. Experiments show that, even Gemini 2.5 Pro, the state-of-the-art proprietary LLM, still exhibits 66% attack success rate in SACRED-Bench test set, exposing vulnerabilities under cross-modal, speech-audio composition attacks. To bridge this gap, we propose SALMONN-Guard, a safeguard LLM that jointly inspects speech, audio, and text for safety judgments, reducing attack success down to 20%. Our results highlight the need for audio-aware defenses for the safety of multimodal LLMs. The benchmark and SALMONN-Guard checkpoints can be found at https://huggingface.co/datasets/tsinghua-ee/SACRED-Bench. Warning: this paper includes examples that may be offensive or harmful.

cs.LG

[373] LAD-BNet: Lag-Aware Dual-Branch Networks for Real-Time Energy Forecasting on Edge Devices

Jean-Philippe Lignier

Main category: cs.LG

TL;DR: LAD-BNet is a neural network for real-time energy forecasting on edge devices that combines temporal lag analysis with dilated convolutions, achieving 14.49% MAPE with 18ms inference time on Edge TPU.

Details

Motivation: Real-time energy forecasting on edge devices is challenging but crucial for smart grid optimization and intelligent buildings, requiring efficient models that can run on resource-constrained hardware.

Method: Hybrid architecture combining a branch for explicit temporal lag exploitation with a Temporal Convolutional Network (TCN) using dilated convolutions to capture both short and long-term dependencies simultaneously.

Result: Achieved 14.49% MAPE at 1-hour horizon with 18ms inference time on Edge TPU (8-12x faster than CPU), 2.39% improvement over LSTM and 3.04% over pure TCN, with 180MB memory footprint suitable for embedded devices.

Conclusion: LAD-BNet enables practical industrial applications in real-time energy optimization, demand management, and operational planning on edge devices.

Abstract: Real-time energy forecasting on edge devices represents a major challenge for smart grid optimization and intelligent buildings. We present LAD-BNet (Lag-Aware Dual-Branch Network), an innovative neural architecture optimized for edge inference with Google Coral TPU. Our hybrid approach combines a branch dedicated to explicit exploitation of temporal lags with a Temporal Convolutional Network (TCN) featuring dilated convolutions, enabling simultaneous capture of short and long-term dependencies. Tested on real energy consumption data with 10-minute temporal resolution, LAD-BNet achieves 14.49% MAPE at 1-hour horizon with only 18ms inference time on Edge TPU, representing an 8-12 x acceleration compared to CPU. The multi-scale architecture enables predictions up to 12 hours with controlled performance degradation. Our model demonstrates a 2.39% improvement over LSTM baselines and 3.04% over pure TCN architectures, while maintaining a 180MB memory footprint suitable for embedded device constraints. These results pave the way for industrial applications in real-time energy optimization, demand management, and operational planning.

[374] LT-Soups: Bridging Head and Tail Classes via Subsampled Model Soups

Masih Aminbeidokhti, Subhankar Roy, Eric Granger, Elisa Ricci, Marco Pedersoli

Main category: cs.LG

TL;DR: LT-Soups is a two-stage model soups framework that addresses the head-tail accuracy trade-off in long-tailed datasets by first averaging models from balanced subsets to reduce bias, then fine-tuning the classifier on full data.

Details

Motivation: Real-world datasets have long-tailed distributions where PEFT methods preserve tail-class performance but sacrifice head-class accuracy, with the head-tail ratio being a crucial overlooked factor.

Method: Two-stage approach: (1) average models fine-tuned on balanced subsets to reduce head-class bias, (2) fine-tune only classifier on full dataset to restore head-class accuracy.

Result: LT-Soups achieves superior trade-offs compared to PEFT and traditional model soups across six benchmark datasets and various imbalance regimes.

Conclusion: LT-Soups effectively generalizes across diverse long-tailed regimes by addressing the head-tail ratio problem through a specialized model soups framework.

Abstract: Real-world datasets typically exhibit long-tailed (LT) distributions, where a few head classes dominate and many tail classes are severely underrepresented. While recent work shows that parameter-efficient fine-tuning (PEFT) methods like LoRA and AdaptFormer preserve tail-class performance on foundation models such as CLIP, we find that they do so at the cost of head-class accuracy. We identify the head-tail ratio, the proportion of head to tail classes, as a crucial but overlooked factor influencing this trade-off. Through controlled experiments on CIFAR100 with varying imbalance ratio ($ρ$) and head-tail ratio ($η$), we show that PEFT excels in tail-heavy scenarios but degrades in more balanced and head-heavy distributions. To overcome these limitations, we propose LT-Soups, a two-stage model soups framework designed to generalize across diverse LT regimes. In the first stage, LT-Soups averages models fine-tuned on balanced subsets to reduce head-class bias; in the second, it fine-tunes only the classifier on the full dataset to restore head-class accuracy. Experiments across six benchmark datasets show that LT-Soups achieves superior trade-offs compared to both PEFT and traditional model soups across a wide range of imbalance regimes.

[375] Differentiable Sparse Identification of Lagrangian Dynamics

Zitong Zhang, Hao Sun

Main category: cs.LG

TL;DR: A novel differentiable sparse identification framework for discovering governing equations from noisy data in nonlinear dynamics, integrating cubic B-spline approximation with Lagrangian formalism for improved accuracy and noise robustness.

Details

Motivation: Existing sparse regression techniques struggle with rational functions and noise sensitivity in complex mechanical systems, while current Lagrangian identification methods are significantly affected by measurement noise and limited data availability.

Method: Integration of cubic B-spline approximation into Lagrangian system identification, robust equation discovery with physical constraints, and recursive derivative computation using B-spline basis functions to constrain higher-order derivatives.

Result: Superior performance in extracting physical laws from noisy data, particularly in complex mechanical systems, with improved accuracy and reliability compared to baseline methods.

Conclusion: The proposed framework enables more accurate and reliable discovery of governing equations from noisy data by effectively addressing noise sensitivity and complex nonlinearities through B-spline approximation and physical constraints.

Abstract: Data-driven discovery of governing equations from data remains a fundamental challenge in nonlinear dynamics. Although sparse regression techniques have advanced system identification, they struggle with rational functions and noise sensitivity in complex mechanical systems. The Lagrangian formalism offers a promising alternative, as it typically avoids rational expressions and provides a more concise representation of system dynamics. However, existing Lagrangian identification methods are significantly affected by measurement noise and limited data availability. This paper presents a novel differentiable sparse identification framework that addresses these limitations through three key contributions: (1) the first integration of cubic B-Spline approximation into Lagrangian system identification, enabling accurate representation of complex nonlinearities, (2) a robust equation discovery mechanism that effectively utilizes measurements while incorporating known physical constraints, (3) a recursive derivative computation scheme based on B-spline basis functions, effectively constraining higher-order derivatives and reducing noise sensitivity on second-order dynamical systems. The proposed method demonstrates superior performance and enables more accurate and reliable extraction of physical laws from noisy data, particularly in complex mechanical systems compared to baseline methods.

[376] Bias-Restrained Prefix Representation Finetuning for Mathematical Reasoning

Sirui Liang, Pengfei Cao, Jian Zhao, Cong Huang, Jun Zhao, Kang Liu

Main category: cs.LG

TL;DR: BREP ReFT improves mathematical reasoning in representation fine-tuning by optimizing initial reasoning prefixes, preventing early error accumulation, and constraining intervention vectors to preserve numerical encoding.

Details

Motivation: Standard ReFT methods show significant performance decline on mathematical reasoning tasks due to poor initial reasoning prefix generation and numerical encoding disturbance.

Method: Proposes Bias-REstrained Prefix Representation FineTuning (BREP ReFT) with three key techniques: truncating training data for better prefix generation, early inference intervention to prevent error accumulation, and magnitude constraints on intervention vectors.

Result: Extensive experiments show BREP outperforms both standard ReFT and weight-based PEFT methods on mathematical reasoning tasks across diverse model architectures.

Conclusion: BREP ReFT effectively addresses ReFT’s limitations in mathematical reasoning while maintaining efficiency and demonstrating robust generalization capability.

Abstract: Parameter-Efficient finetuning (PEFT) enhances model performance on downstream tasks by updating a minimal subset of parameters. Representation finetuning (ReFT) methods further improve efficiency by freezing model weights and optimizing internal representations with fewer parameters than PEFT, outperforming PEFT on several tasks. However, ReFT exhibits a significant performance decline on mathematical reasoning tasks. To address this problem, the paper demonstrates that ReFT’s poor performance on mathematical tasks primarily stems from its struggle to generate effective reasoning prefixes during the early inference phase. Moreover, ReFT disturbs the numerical encoding and the error accumulats during the CoT stage. Based on these observations, this paper proposes Bias-REstrained Prefix Representation FineTuning (BREP ReFT), which enhances ReFT’s mathematical reasoning capability by truncating training data to optimize the generation of initial reasoning prefixes, intervening on the early inference stage to prevent error accumulation, and constraining the intervention vectors’ magnitude to avoid disturbing numerical encoding. Extensive experiments across diverse model architectures demonstrate BREP’s superior effectiveness, efficiency, and robust generalization capability, outperforming both standard ReFT and weight-based PEFT methods on the task of mathematical reasoning. The source code is available at https://github.com/LiangThree/BREP.

[377] Towards Uncertainty Quantification in Generative Model Learning

Giorgio Morales, Frederic Jurie, Jalal Fadili

Main category: cs.LG

TL;DR: This position paper formalizes uncertainty quantification in generative models and proposes ensemble-based precision-recall curves to capture model approximation uncertainty.

Details

Motivation: Current generative model evaluation methods focus on distribution closeness but neglect uncertainty quantification in these measurements, creating reliability concerns.

Method: Proposes formalizing uncertainty quantification problem and explores ensemble-based precision-recall curves to measure model approximation uncertainty.

Result: Preliminary experiments on synthetic datasets show aggregated precision-recall curves effectively capture model approximation uncertainty and enable systematic comparison of different model architectures.

Conclusion: Uncertainty quantification is crucial for reliable generative models, and ensemble-based precision-recall curves provide a promising approach for systematic uncertainty evaluation.

Abstract: While generative models have become increasingly prevalent across various domains, fundamental concerns regarding their reliability persist. A crucial yet understudied aspect of these models is the uncertainty quantification surrounding their distribution approximation capabilities. Current evaluation methodologies focus predominantly on measuring the closeness between the learned and the target distributions, neglecting the inherent uncertainty in these measurements. In this position paper, we formalize the problem of uncertainty quantification in generative model learning. We discuss potential research directions, including the use of ensemble-based precision-recall curves. Our preliminary experiments on synthetic datasets demonstrate the effectiveness of aggregated precision-recall curves in capturing model approximation uncertainty, enabling systematic comparison among different model architectures based on their uncertainty characteristics.

[378] Movement-Specific Analysis for FIM Score Classification Using Spatio-Temporal Deep Learning

Jun Masaki, Ariaki Higashi, Naoko Shinagawa, Kazuhiko Hirata, Yuichi Kurita, Akira Furui

Main category: cs.LG

TL;DR: Automated FIM score estimation using deep learning with ST-GCN, BiLSTM, and attention mechanisms, achieving 70-79% accuracy in distinguishing independent vs assisted patients.

Details

Motivation: Traditional FIM assessment imposes significant burden on patients and healthcare professionals, requiring automated solutions to reduce this workload.

Method: Deep neural network combining spatial-temporal graph convolutional network (ST-GCN), bidirectional LSTM, and attention mechanism to estimate FIM motor item scores from simple exercises different from designated FIM actions.

Result: Achieved balanced accuracies of 70.09-78.79% across different FIM items, successfully distinguishing completely independent patients from those requiring assistance. Identified specific movement patterns as reliable predictors for particular FIM evaluation items.

Conclusion: The proposed automated FIM estimation method effectively captures long-term temporal dependencies and key body-joint contributions, providing a viable alternative to traditional assessment with reduced burden on patients and professionals.

Abstract: The functional independence measure (FIM) is widely used to evaluate patients’ physical independence in activities of daily living. However, traditional FIM assessment imposes a significant burden on both patients and healthcare professionals. To address this challenge, we propose an automated FIM score estimation method that utilizes simple exercises different from the designated FIM assessment actions. Our approach employs a deep neural network architecture integrating a spatial-temporal graph convolutional network (ST-GCN), bidirectional long short-term memory (BiLSTM), and an attention mechanism to estimate FIM motor item scores. The model effectively captures long-term temporal dependencies and identifies key body-joint contributions through learned attention weights. We evaluated our method in a study of 277 rehabilitation patients, focusing on FIM transfer and locomotion items. Our approach successfully distinguishes between completely independent patients and those requiring assistance, achieving balanced accuracies of 70.09-78.79 % across different FIM items. Additionally, our analysis reveals specific movement patterns that serve as reliable predictors for particular FIM evaluation items.

[379] Fast Neural Tangent Kernel Alignment, Norm and Effective Rank via Trace Estimation

James Hazelden

Main category: cs.LG

TL;DR: Matrix-free approach using trace estimation for fast computation of Neural Tangent Kernel (NTK) properties like trace, Frobenius norm, and alignment, enabling orders of magnitude speedup.

Details

Motivation: Computing the full NTK matrix is often infeasible, especially for recurrent architectures, creating a need for efficient approximation methods.

Method: Use trace estimation (Hutch++ estimator) and one-sided estimators that require only forward- or reverse-mode automatic differentiation, not both modes.

Result: Matrix-free randomized approaches yield speedups of many orders of magnitude, with one-sided estimators outperforming Hutch++ in low-sample regimes when model state and parameter count gap is large.

Conclusion: Matrix-free randomized approaches enable faster analysis and applications of the NTK through efficient computation of key properties.

Abstract: The Neural Tangent Kernel (NTK) characterizes how a model’s state evolves over Gradient Descent. Computing the full NTK matrix is often infeasible, especially for recurrent architectures. Here, we introduce a matrix-free perspective, using trace estimation to rapidly analyze the empirical, finite-width NTK. This enables fast computation of the NTK’s trace, Frobenius norm, effective rank, and alignment. We provide numerical recipes based on the Hutch++ trace estimator with provably fast convergence guarantees. In addition, we show that, due to the structure of the NTK, one can compute the trace using only forward- or reverse-mode automatic differentiation, not requiring both modes. We show these so-called one-sided estimators can outperform Hutch++ in the low-sample regime, especially when the gap between the model state and parameter count is large. In total, our results demonstrate that matrix-free randomized approaches can yield speedups of many orders of magnitude, leading to faster analysis and applications of the NTK.

[380] Near-optimal Linear Predictive Clustering in Non-separable Spaces via Mixed Integer Programming and Quadratic Pseudo-Boolean Reductions

Jiazhou Liang, Hassan Khurram, Scott Sanner

Main category: cs.LG

TL;DR: Two novel approaches for Linear Predictive Clustering that improve global optimization efficiency using separability theory and QPBO approximation, achieving near-optimal solutions with better scalability than existing MIP methods.

Details

Motivation: Existing greedy optimization methods for LPC lack global optimality and struggle with non-separable clusters, while MIP formulations ensure global optimality but suffer from poor scalability.

Method: Leverage separability theory to derive near-optimal approximations with provable error bounds, reducing MIP complexity. Also approximate LPC as Quadratic Pseudo-Boolean Optimization problem for computational improvements.

Result: Methods achieve near-optimal solutions with substantially lower regression errors than greedy optimization while exhibiting superior scalability over existing MIP formulations on synthetic and real-world datasets.

Conclusion: The proposed approaches successfully bridge the gap between greedy methods and MIP formulations, providing efficient global optimization for LPC with near-optimal performance and improved scalability.

Abstract: Linear Predictive Clustering (LPC) partitions samples based on shared linear relationships between feature and target variables, with numerous applications including marketing, medicine, and education. Greedy optimization methods, commonly used for LPC, alternate between clustering and linear regression but lack global optimality. While effective for separable clusters, they struggle in non-separable settings where clusters overlap in feature space. In an alternative constrained optimization paradigm, Bertsimas and Shioda (2007) formulated LPC as a Mixed-Integer Program (MIP), ensuring global optimality regardless of separability but suffering from poor scalability. This work builds on the constrained optimization paradigm to introduce two novel approaches that improve the efficiency of global optimization for LPC. By leveraging key theoretical properties of separability, we derive near-optimal approximations with provable error bounds, significantly reducing the MIP formulation’s complexity and improving scalability. Additionally, we can further approximate LPC as a Quadratic Pseudo-Boolean Optimization (QPBO) problem, achieving substantial computational improvements in some settings. Comparative analyses on synthetic and real-world datasets demonstrate that our methods consistently achieve near-optimal solutions with substantially lower regression errors than greedy optimization while exhibiting superior scalability over existing MIP formulations.

[381] Transformers know more than they can tell – Learning the Collatz sequence

François Charton, Ashvni Narayanan

Main category: cs.LG

TL;DR: Transformers can predict long Collatz steps with high accuracy (up to 99.7%) when using optimal base encodings (24, 32), but performance varies significantly with encoding base. Models learn to classify inputs by their residual modulo powers of 2, reflecting the mathematical structure of Collatz sequences.

Details

Motivation: To understand how transformers learn complex arithmetic functions like the Collatz sequence, and to use mathematical problems as tools for analyzing and explaining language model behavior and learning patterns.

Method: Train transformer models to predict long Collatz steps using different base encodings for input and output. Analyze learning patterns, accuracy variations by base, and failure cases to understand the algorithms learned.

Result: Accuracy ranges from 25-37% for bases 3 and 11 to 99.7% for bases 24 and 32. Models learn to classify inputs by residual modulo 2^p, achieving near-perfect accuracy on these classes. Over 90% of errors occur when models correctly compute but misestimate loop lengths.

Conclusion: The main difficulty in learning complex arithmetic functions is determining the control structure (loop lengths). Mathematical problems serve as effective tools for understanding, explaining, and potentially improving language models, with broad applicability across problem domains.

Abstract: We investigate transformer prediction of long Collatz steps, a complex arithmetic function that maps odd integers to their distant successors in the Collatz sequence ( $u_{n+1}=u_n/2$ if $u_n$ is even, $u_{n+1}=(3u_n+1)/2$ if $u_n$ is odd). Model accuracy varies with the base used to encode input and output. It can be as high as $99.7%$ for bases $24$ and $32$, and as low as $37$ and $25%$ for bases $11$ and $3$. Yet, all models, no matter the base, follow a common learning pattern. As training proceeds, they learn a sequence of classes of inputs that share the same residual modulo $2^p$. Models achieve near-perfect accuracy on these classes, and less than $1%$ for all other inputs. This maps to a mathematical property of Collatz sequences: the length of the loops involved in the computation of a long Collatz step can be deduced from the binary representation of its input. The learning pattern reflects the model learning to predict inputs associated with increasing loop lengths. An analysis of failure cases reveals that almost all model errors follow predictable patterns. Hallucination, a common feature of large language models, almost never happens. In over $90%$ of failures, the model performs the correct calculation, but wrongly estimates loop lengths. Our observations give a full account of the algorithms learned by the models. They suggest that the difficulty of learning such complex arithmetic function lies in figuring the control structure of the computation – the length of the loops. We believe that the approach outlined here, using mathematical problems as tools for understanding, explaining, and perhaps improving language models, can be applied to a broad range of problems and bear fruitful results.

[382] Towards Universal Neural Operators through Multiphysics Pretraining

Mikhail Masliaev, Dmitry Gusarov, Ilya Markov, Alexander Hvatov

Main category: cs.LG

TL;DR: Transformer-based neural operators can effectively transfer knowledge across diverse PDE problems through downstream learning.

Details

Motivation: To address the high computational cost of training neural operators by investigating transfer learning capabilities of transformer-based architectures across various PDE problems.

Method: Evaluated transformer-based neural operators in transfer learning settings including extrapolation to unseen parameters, adding new variables, and transferring from multi-equation datasets.

Result: Advanced neural operator architectures demonstrated effective knowledge transfer across different PDE problems.

Conclusion: Transformer-based neural operators show promising transfer learning capabilities for data-driven physical simulations, enabling more efficient training through downstream learning approaches.

Abstract: Although neural operators are widely used in data-driven physical simulations, their training remains computationally expensive. Recent advances address this issue via downstream learning, where a model pretrained on simpler problems is fine-tuned on more complex ones. In this research, we investigate transformer-based neural operators, which have previously been applied only to specific problems, in a more general transfer learning setting. We evaluate their performance across diverse PDE problems, including extrapolation to unseen parameters, incorporation of new variables, and transfer from multi-equation datasets. Our results demonstrate that advanced neural operator architectures can effectively transfer knowledge across PDE problems.

[383] Benchmarking Quantum Kernels Across Diverse and Complex Data

Yuhan Jiang, Matthew Otten

Main category: cs.LG

TL;DR: Quantum kernel methods show practical advantage over classical kernels on real-world high-dimensional datasets through a variational framework with parameter scaling.

Details

Motivation: Current quantum kernel research is limited to low-dimensional or synthetic data, lacking evaluation on diverse real-world datasets to verify practical quantum advantage.

Method: Developed a variational quantum kernel framework with resource-efficient ansätze and parameter scaling technique for complex classification tasks, benchmarked on 8 challenging real-world datasets.

Result: The proposed quantum kernel demonstrated clear performance advantage over standard classical kernels like RBF kernel across tabular, image, time series, and graph data.

Conclusion: Properly designed quantum kernels can serve as versatile, high-performance tools for real-world machine learning applications, though further research is needed to fully assess practical quantum advantage.

Abstract: Quantum kernel methods are a promising branch of quantum machine learning, yet their practical advantage on diverse, high-dimensional, real-world data remains unverified. Current research has largely been limited to low-dimensional or synthetic datasets, preventing a thorough evaluation of their potential. To address this gap, we developed a variational quantum kernel framework utilizing resource-efficient ansätze for complex classification tasks and introduced a parameter scaling technique to accelerate convergence. We conducted a comprehensive benchmark of this framework on eight challenging, real world and high-dimensional datasets covering tabular, image, time series, and graph data. Our classically simulated results show that the proposed quantum kernel demonstrated a clear performance advantage over standard classical kernels, such as the radial basis function (RBF) kernel. This work demonstrates that properly designed quantum kernels can function as versatile, high-performance tools, laying a foundation for quantum-enhanced applications in real-world machine learning. Further research is needed to fully assess the practical quantum advantage.

[384] SURFACEBENCH: Can Self-Evolving LLMs Find the Equations of 3D Scientific Surfaces?

Sanchit Kabra, Shobhnik Kriplani, Parshin Shojaee, Chandan K. Reddy

Main category: cs.LG

TL;DR: SurfaceBench is a comprehensive benchmark for symbolic surface discovery with 183 tasks across 15 complexity categories, featuring explicit, implicit, and parametric equations grounded in scientific domains and evaluated using both symbolic checks and geometry-aware metrics.

Details

Motivation: Existing symbolic regression benchmarks are limited to scalar functions, ignore domain grounding, rely on brittle string-matching metrics, and are vulnerable to LLM memorization, failing to capture scientific equivalence and surface-level structure.

Method: Created SurfaceBench with 183 tasks across explicit, implicit, and parametric equation forms, featuring novel symbolic compositions to resist memorization, variable semantics, synthetically sampled 3D data, and geometry-aware evaluation metrics like Chamfer and Hausdorff distances.

Result: State-of-the-art frameworks struggle to generalize across representation types and surface complexities, showing occasional success only on specific families of equations, revealing limitations in current approaches.

Conclusion: SurfaceBench establishes a challenging testbed that bridges symbolic reasoning with geometric reconstruction, enabling principled benchmarking of progress in compositional generalization, data-driven scientific induction, and geometry-aware reasoning with LLMs.

Abstract: Equation discovery from data is a core challenge in machine learning for science, requiring the recovery of concise symbolic expressions that govern complex physical and geometric phenomena. Recent approaches with large language models (LLMs) show promise in symbolic regression, but their success often hinges on memorized formulas or overly simplified functional forms. Existing benchmarks exacerbate this limitation: they focus on scalar functions, ignore domain grounding, and rely on brittle string-matching based metrics that fail to capture scientific equivalence. We introduce SurfaceBench, first comprehensive benchmark for symbolic surface discovery. SurfaceBench comprises 183 tasks across 15 categories of symbolic complexity, spanning explicit, implicit, and parametric equation representation forms. Each task includes ground-truth equations, variable semantics, and synthetically sampled three dimensional data. Unlike prior SR datasets, our tasks reflect surface-level structure, resist LLM memorization through novel symbolic compositions, and are grounded in scientific domains such as fluid dynamics, robotics, electromagnetics, and geometry. To evaluate equation discovery quality, we pair symbolic checks with geometry-aware metrics such as Chamfer and Hausdorff distances, capturing both algebraic fidelity and spatial reconstruction accuracy. Our experiments reveal that state-of-the-art frameworks, while occasionally successful on specific families, struggle to generalize across representation types and surface complexities. SurfaceBench thus establishes a challenging and diagnostic testbed that bridges symbolic reasoning with geometric reconstruction, enabling principled benchmarking of progress in compositional generalization, data-driven scientific induction, and geometry-aware reasoning with LLMs. We release the code here: https://github.com/Sanchit-404/surfacebench

[385] EarthSight: A Distributed Framework for Low-Latency Satellite Intelligence

Ansel Kaplan Erol, Seungjun Lee, Divya Mahajan

Main category: cs.LG

TL;DR: EarthSight is a distributed runtime framework that enables satellite constellations to perform efficient, low-latency image analysis by coordinating onboard multi-task inference with ground-station query scheduling and dynamic filter ordering.

Details

Motivation: Traditional satellite imagery pipelines suffer from hours-to-days delays due to bandwidth constraints, while current onboard ML solutions treat satellites as isolated nodes, leading to redundant inference and inefficient resource usage.

Method: Three core innovations: (1) multi-task inference on satellites using shared backbones, (2) ground-station query scheduler that aggregates requests and assigns compute budgets, (3) dynamic filter ordering that integrates model selectivity, accuracy, and cost to reject low-value images early.

Result: EarthSight reduces average compute time per image by 1.9x and lowers 90th percentile end-to-end latency from 51 to 21 minutes compared to state-of-the-art baseline.

Conclusion: EarthSight enables scalable, low-latency satellite image analysis within strict bandwidth and power constraints by leveraging global ground context and resource-aware orbit decisions.

Abstract: Low-latency delivery of satellite imagery is essential for time-critical applications such as disaster response, intelligence, and infrastructure monitoring. However, traditional pipelines rely on downlinking all captured images before analysis, introducing delays of hours to days due to restricted communication bandwidth. To address these bottlenecks, emerging systems perform onboard machine learning to prioritize which images to transmit. However, these solutions typically treat each satellite as an isolated compute node, limiting scalability and efficiency. Redundant inference across satellites and tasks further strains onboard power and compute costs, constraining mission scope and responsiveness. We present EarthSight, a distributed runtime framework that redefines satellite image intelligence as a distributed decision problem between orbit and ground. EarthSight introduces three core innovations: (1) multi-task inference on satellites using shared backbones to amortize computation across multiple vision tasks; (2) a ground-station query scheduler that aggregates user requests, predicts priorities, and assigns compute budgets to incoming imagery; and (3) dynamic filter ordering, which integrates model selectivity, accuracy, and execution cost to reject low-value images early and conserve resources. EarthSight leverages global context from ground stations and resource-aware adaptive decisions in orbit to enable constellations to perform scalable, low-latency image analysis within strict downlink bandwidth and onboard power budgets. Evaluations using a prior established satellite simulator show that EarthSight reduces average compute time per image by 1.9x and lowers 90th percentile end-to-end latency from first contact to delivery from 51 to 21 minutes compared to the state-of-the-art baseline.

[386] The Map of Misbelief: Tracing Intrinsic and Extrinsic Hallucinations Through Attention Patterns

Elyes Hajji, Aymen Bouguerra, Fabio Arnez

Main category: cs.LG

TL;DR: This paper introduces a framework for detecting LLM hallucinations by differentiating between extrinsic and intrinsic types, using attention-based uncertainty quantification with novel aggregation strategies that improve interpretability and detection performance.

Details

Motivation: LLMs are increasingly used in safety-critical domains but remain susceptible to hallucinations, with existing detection methods being computationally expensive and failing to distinguish between different hallucination types.

Method: Proposed a principled evaluation framework differentiating extrinsic and intrinsic hallucinations, leveraged attention-based uncertainty quantification with novel attention aggregation strategies over input tokens.

Result: Sampling-based methods like Semantic Entropy work well for extrinsic hallucinations but fail on intrinsic ones, while the attention-based method is better suited for intrinsic hallucinations.

Conclusion: Attention serves as a rich signal for quantifying model uncertainty, providing new directions for aligning detection strategies with the nature of different hallucination types.

Abstract: Large Language Models (LLMs) are increasingly deployed in safety-critical domains, yet remain susceptible to hallucinations. While prior works have proposed confidence representation methods for hallucination detection, most of these approaches rely on computationally expensive sampling strategies and often disregard the distinction between hallucination types. In this work, we introduce a principled evaluation framework that differentiates between extrinsic and intrinsic hallucination categories and evaluates detection performance across a suite of curated benchmarks. In addition, we leverage a recent attention-based uncertainty quantification algorithm and propose novel attention aggregation strategies that improve both interpretability and hallucination detection performance. Our experimental findings reveal that sampling-based methods like Semantic Entropy are effective for detecting extrinsic hallucinations but generally fail on intrinsic ones. In contrast, our method, which aggregates attention over input tokens, is better suited for intrinsic hallucinations. These insights provide new directions for aligning detection strategies with the nature of hallucination and highlight attention as a rich signal for quantifying model uncertainty.

[387] FlowPath: Learning Data-Driven Manifolds with Invertible Flows for Robust Irregularly-sampled Time Series Classification

YongKyung Oh, Dong-Young Lim, Sungil Kim

Main category: cs.LG

TL;DR: FlowPath improves neural controlled differential equations for irregular time series by learning the geometry of control paths using invertible neural flows, achieving better performance than fixed interpolation methods.

Details

Motivation: Existing neural controlled differential equations perform poorly with sparse, irregular time series because fixed interpolation schemes misrepresent the underlying data geometry, especially under high missingness.

Method: Proposes FlowPath which learns control path geometry via invertible neural flows, constructing continuous data-adaptive manifolds with invertibility constraints that enforce information-preserving transformations.

Result: Empirical evaluations on 18 benchmark datasets and real-world case studies show statistically significant improvements in classification accuracy over baselines using fixed interpolants or non-invertible architectures.

Conclusion: Modeling both the dynamics along the path and the geometry of the path itself provides a robust and generalizable solution for learning from irregular time series.

Abstract: Modeling continuous-time dynamics from sparse and irregularly-sampled time series remains a fundamental challenge. Neural controlled differential equations provide a principled framework for such tasks, yet their performance is highly sensitive to the choice of control path constructed from discrete observations. Existing methods commonly employ fixed interpolation schemes, which impose simplistic geometric assumptions that often misrepresent the underlying data manifold, particularly under high missingness. We propose FlowPath, a novel approach that learns the geometry of the control path via an invertible neural flow. Rather than merely connecting observations, FlowPath constructs a continuous and data-adaptive manifold, guided by invertibility constraints that enforce information-preserving and well-behaved transformations. This inductive bias distinguishes FlowPath from prior unconstrained learnable path models. Empirical evaluations on 18 benchmark datasets and a real-world case study demonstrate that FlowPath consistently achieves statistically significant improvements in classification accuracy over baselines using fixed interpolants or non-invertible architectures. These results highlight the importance of modeling not only the dynamics along the path but also the geometry of the path itself, offering a robust and generalizable solution for learning from irregular time series.

[388] Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning

Alexander W. Goodall, Edwin Hamel-De le Court, Francesco Belardinelli

Main category: cs.LG

TL;DR: The paper shows that using well-designed behavior policies for off-policy data collection can provide lower-variance return estimates than on-policy methods, leading to improved sample efficiency and training stability in reinforcement learning.

Details

Motivation: Many RL algorithms suffer from poor sample efficiency and training instability due to high-variance return estimates. The authors aim to leverage recent off-policy evaluation results showing that properly designed behavior policies can collect data for provably lower-variance return estimates.

Method: Extend the insight from off-policy evaluation to online RL by using a single behavior policy to collect data for policy improvement with provably lower-variance return estimates. Apply this regime to two policy-gradient methods.

Result: Experiments demonstrate better sample efficiency and performance across diverse environments compared to traditional approaches.

Conclusion: Collecting data on-policy is not variance optimal; using well-designed behavior policies for off-policy data collection can significantly improve RL algorithm performance and efficiency.

Abstract: Many reinforcement learning algorithms, particularly those that rely on return estimates for policy improvement, can suffer from poor sample efficiency and training instability due to high-variance return estimates. In this paper we leverage new results from off-policy evaluation; it has recently been shown that well-designed behaviour policies can be used to collect off-policy data for provably lower variance return estimates. This result is surprising as it means collecting data on-policy is not variance optimal. We extend this key insight to the online reinforcement learning setting, where both policy evaluation and improvement are interleaved to learn optimal policies. Off-policy RL has been well studied (e.g., IMPALA), with correct and truncated importance weighted samples for de-biasing and managing variance appropriately. Generally these approaches are concerned with reconciling data collected from multiple workers in parallel, while the policy is updated asynchronously, mismatch between the workers and policy is corrected in a mathematically sound way. Here we consider only one worker - the behaviour policy, which is used to collect data for policy improvement, with provably lower variance return estimates. In our experiments we extend two policy-gradient methods with this regime, demonstrating better sample efficiency and performance over a diverse set of environments.

[389] STAMP: Spatial-Temporal Adapter with Multi-Head Pooling

Brad Shook, Abby Turner, Jieshi Chen, Michał Wiliński, Mononito Goswami, Jonathan Elmer, Artur Dubrawski

Main category: cs.LG

TL;DR: STAMP adapter enables general time series foundation models to perform comparably to EEG-specific foundation models on EEG classification tasks with minimal trainable parameters.

Details

Motivation: No comparative analysis exists between EEG-specific foundation models and general time series foundation models for EEG tasks, despite the potential of leveraging powerful general TSFMs.

Method: Proposed Spatial-Temporal Adapter with Multi-Head Pooling (STAMP) that uses univariate embeddings from general TSFMs to implicitly model EEG’s spatial-temporal characteristics.

Result: STAMP achieves performance comparable to state-of-the-art EEG-specific foundation models on 8 benchmark EEG classification datasets with comprehensive ablation studies.

Conclusion: The lightweight and flexible STAMP adapter enables effective EEG data modeling using general TSFMs, supporting easy adoption without requiring specialized EEG foundation models.

Abstract: Time series foundation models (TSFMs) pretrained on data from multiple domains have shown strong performance on diverse modeling tasks. Various efforts have been made to develop foundation models specific to electroencephalography (EEG) data, which records brain electrical activity as time series. However, no comparative analysis of EEG-specific foundation models (EEGFMs) versus general TSFMs has been performed on EEG-specific tasks. We introduce a novel Spatial-Temporal Adapter with Multi-Head Pooling (STAMP), which leverages univariate embeddings produced by a general TSFM, implicitly models spatial-temporal characteristics of EEG data, and achieves performance comparable to state-of-the-art EEGFMs. A comprehensive analysis is performed on 8 benchmark datasets of clinical tasks using EEG for classification, along with ablation studies. Our proposed adapter is lightweight in trainable parameters and flexible in the inputs it can accommodate, supporting easy modeling of EEG data using TSFMs.

[390] ExPairT-LLM: Exact Learning for LLM Code Selection by Pairwise Queries

Tom Yuviler, Dana Drachsler-Cohen

Main category: cs.LG

TL;DR: ExPairT-LLM is a code selection algorithm that uses pairwise membership and equivalence queries to LLMs to select the best program from multiple LLM-generated options, achieving significant performance improvements over state-of-the-art methods.

Details

Motivation: Existing code selection algorithms fail because they can misidentify nonequivalent programs or rely on LLMs that may incorrectly determine outputs for all inputs.

Method: Uses pairwise membership and pairwise equivalence queries to LLMs, which are simpler for LLMs to handle, and employs a tournament approach that is robust to some LLM mistakes.

Result: Outperforms state-of-the-art code selection algorithm by +13.0% on average and up to +27.1% in pass@1 on four popular code datasets, and improves pass@1 of LLMs performing complex reasoning by +24.0%.

Conclusion: ExPairT-LLM effectively addresses limitations of existing code selection methods through simpler query types and a robust tournament mechanism, achieving substantial performance gains in code generation tasks.

Abstract: Despite recent advances in LLMs, the task of code generation is still challenging. To cope, code selection algorithms select the best program from multiple programs generated by an LLM. However, existing algorithms can fail to identify the correct program, either because they can misidentify nonequivalent programs or because they rely on an LLM and assume it always correctly determines the output for every input. We present ExPairT-LLM, an exact learning algorithm for code selection that selects a program by posing to an LLM oracle two new types of queries: pairwise membership and pairwise equivalence. These queries are simpler for LLMs and enable ExPairT-LLM to identify the correct program through a tournament, which is robust to some LLM mistakes. We evaluate ExPairT-LLM on four popular code datasets. Its pass@1 (success rate) outperforms the state-of-the-art code selection algorithm on average by +13.0% and up to +27.1%. It also improves the pass@1 of LLMs performing complex reasoning by +24.0%.

[391] Adaptive Pareto-Optimal Token Merging for Edge Transformer Models in Semantic Communication

Omar Erak, Omar Alhussein, Hatem Abou-Zeid, Mehdi Bennis

Main category: cs.LG

TL;DR: Training-free adaptive token merging for vision transformers in semantic communication systems, reducing computation and transmission costs while maintaining accuracy across varying SNR conditions.

Details

Motivation: Large transformer models are powerful for semantic communication but computationally expensive for resource-constrained 6G networks, requiring efficient deployment methods.

Method: Formulate token merging as multi-objective optimization, use Gaussian process-based Bayesian optimization to find Pareto-optimal configurations for adaptive per-layer merging proportions.

Result: Consistently outperforms baselines, achieves significant FLOPs reduction while maintaining competitive accuracy across various SNR conditions, enables adaptive policies for channel quality.

Conclusion: Provides scalable and efficient approach for deploying transformer-based semantic communication in edge intelligence systems with flexible trade-offs between latency and semantic fidelity.

Abstract: Large-scale transformer models have emerged as a powerful tool for semantic communication systems, enabling edge devices to extract rich representations for robust inference across noisy wireless channels. However, their substantial computational demands remain a major barrier to practical deployment in resource-constrained 6G networks. In this paper, we present a training-free framework for adaptive token merging in pretrained vision transformers to jointly reduce inference time and transmission resource usage. We formulate the selection of per-layer merging proportions as a multi-objective optimization problem to balance accuracy and computational cost. We employ Gaussian process-based Bayesian optimization to construct a Pareto frontier of optimal configurations, enabling flexible runtime adaptation to dynamic application requirements and channel conditions. Extensive experiments demonstrate that our method consistently outperforms other baselines and achieves significant reductions in floating-point operations while maintaining competitive accuracy across a wide range of signal-to-noise ratio (SNR) conditions. Additional results highlight the effectiveness of adaptive policies that adjust merging aggressiveness in response to channel quality, providing a practical mechanism to trade off latency and semantic fidelity on demand. These findings establish a scalable and efficient approach for deploying transformer-based semantic communication in future edge intelligence systems.

[392] Private Zeroth-Order Optimization with Public Data

Xuchen Gong, Tian Li

Main category: cs.LG

TL;DR: PAZO uses public data to improve gradient approximation in zeroth-order DP optimization, achieving better privacy/utility tradeoffs than DP-SGD with significant speedups.

Details

Motivation: First-order DP methods like DP-SGD have high computation/memory costs. Zeroth-order methods are easier to privatize but suffer from lower utility compared to DP-SGD.

Method: Proposed PAZO framework leverages public information to guide and improve gradient approximation in private zeroth-order algorithms with minimal overhead.

Result: PAZO achieves superior privacy/utility tradeoffs across vision and text tasks in pre-training and fine-tuning, outperforming first-order baselines in high-privacy regimes with up to 16× speedup.

Conclusion: Public-data-assisted zeroth-order optimization provides an effective alternative to DP-SGD, offering better performance in private settings with substantial computational benefits.

Abstract: One of the major bottlenecks for deploying popular first-order differentially private (DP) machine learning algorithms (e.g., DP-SGD) lies in their high computation and memory cost, despite the existence of optimized implementations. Zeroth-order methods have promise in mitigating the overhead, as they leverage function evaluations to approximate the gradients, hence significantly easier to privatize. While recent works have explored zeroth-order approaches in both private and non-private settings, they still suffer from relatively low utilities compared with DP-SGD, and have only been evaluated in limited application domains. In this work, we propose to leverage public information to guide and improve gradient approximation of private zeroth-order algorithms. We explore a suite of public-data-assisted zeroth-order optimizers (PAZO) with minimal overhead. We provide theoretical analyses of the PAZO framework under an assumption of the similarity between public and private data. Empirically, we demonstrate that PAZO achieves superior privacy/utility tradeoffs across vision and text tasks in both pre-training and fine-tuning settings, outperforming the best first-order baselines (with public data) especially in highly private regimes, while offering up to $16\times$ runtime speedup.

[393] Go-UT-Bench: A Fine-Tuning Dataset for LLM-Based Unit Test Generation in Go

Yashshi Pipalani, Hritik Raj, Rajat Ghosh, Vaishnavi Bhargava, Debojyoti Dutta

Main category: cs.LG

TL;DR: GO UT Bench is a benchmark dataset of 5264 code-unit test pairs from 10 Golang repositories, used to address data imbalance in code LLMs by improving unit test generation capabilities.

Details

Motivation: Training data imbalance in code LLMs overrepresents raw open-source code while underrepresenting broader software engineering tasks, especially in low-resource languages like Golang, causing models to struggle with real-world developer workflows like unit test generation.

Method: Introduce GO UT Bench dataset and evaluate its effectiveness by fine-tuning two LLM families (mixture of experts and dense decoders) on this dataset.

Result: Fine-tuned models outperform their base counterparts on more than 75% of benchmark tasks.

Conclusion: GO UT Bench effectively addresses the data imbalance problem and improves code LLMs’ performance on unit test generation tasks in Golang.

Abstract: Training data imbalance poses a major challenge for code LLMs. Most available data heavily over represents raw opensource code while underrepresenting broader software engineering tasks, especially in low resource languages like Golang. As a result, models excel at code autocompletion but struggle with real world developer workflows such as unit test generation. To address this gap, we introduce GO UT Bench, a benchmark dataset of 5264 pairs of code and unit tests, drawn from 10 permissively licensed Golang repositories spanning diverse domain. We evaluate its effectiveness as a fine tuning dataset across two LLM families i.e. mixture of experts and dense decoders. Our results show that finetuned models outperform their base counterparts on more than 75% of benchmark tasks.

[394] Incorporating Spatial Information into Goal-Conditioned Hierarchical Reinforcement Learning via Graph Representations

Shuyuan Zhang, Zihan Wang, Xiao-Wen Chang, Doina Precup

Main category: cs.LG

TL;DR: Proposes G4RL, a graph encoder-decoder method that enhances Goal-conditioned Hierarchical RL by generating subgoal representations for unseen states, improving sample efficiency and performance.

Details

Motivation: Existing graph-based GCHRL methods rely on domain-specific knowledge for graph construction or struggle to utilize dynamically created graphs effectively due to information transfer problems to new states, leading to sample inefficiency and poor subgoal representation.

Method: Develops a graph encoder-decoder to evaluate unseen states, which can be incorporated into any existing GCHRL method. Uses a network trained on the state graph generated during exploration to implement the encoder-decoder, leveraging high and low-level intrinsic rewards.

Result: Empirical results show significant performance enhancement for state-of-the-art GCHRL approaches with minimal computational cost in both dense and sparse reward environments.

Conclusion: G4RL effectively addresses key challenges in GCHRL by enabling better utilization of graph information for subgoal generation, particularly in environments with symmetric and reversible transitions.

Abstract: The integration of graphs with Goal-conditioned Hierarchical Reinforcement Learning (GCHRL) has recently gained attention, as intermediate goals (subgoals) can be effectively sampled from graphs that naturally represent the overall task structure in most RL tasks. However, existing approaches typically rely on domain-specific knowledge to construct these graphs, limiting their applicability to new tasks. Other graph-based approaches create graphs dynamically during exploration but struggle to fully utilize them, because they have problems passing the information in the graphs to newly visited states. Additionally, current GCHRL methods face challenges such as sample inefficiency and poor subgoal representation. This paper proposes a solution to these issues by developing a graph encoder-decoder to evaluate unseen states. Our proposed method, Graph-Guided sub-Goal representation Generation RL (G4RL), can be incorporated into any existing GCHRL method when operating in environments with primarily symmetric and reversible transitions to enhance performance across this class of problems. We show that the graph encoder-decoder can be effectively implemented using a network trained on the state graph generated during exploration. Empirical results indicate that leveraging high and low-level intrinsic rewards from the graph encoder-decoder significantly enhances the performance of state-of-the-art GCHRL approaches with an extra small computational cost in dense and sparse reward environments.

[395] Multi-Joint Physics-Informed Deep Learning Framework for Time-Efficient Inverse Dynamics

Shuhao Ma, Zeyi Huang, Yu Cao, Wesley Doorsamy, Chaoyang Shi, Jun Li, Zhi-Qiang Zhang

Main category: cs.LG

TL;DR: A physics-informed deep learning framework using Multi-Joint Cross-Attention with BiGRU layers to estimate muscle activations and forces from kinematics without labeled data, achieving comparable performance to supervised methods.

Details

Motivation: Time-efficient estimation of muscle activations and forces is critical for clinical assessment and assistive device control, but conventional approaches are computationally expensive and lack high-quality labeled datasets for multi-joint applications.

Method: Proposed PI-MJCA-BiGRU framework with Multi-Joint Cross-Attention module and BiGRU layers to capture inter-joint coordination, embedding multi-joint dynamics, inter-joint coupling, and external force interactions into the loss function for physiologically consistent predictions.

Result: Experimental validation shows PI-MJCA-BiGRU achieves performance comparable to conventional supervised methods without requiring ground-truth labels, with MJCA module significantly enhancing inter-joint coordination modeling compared to baseline architectures.

Conclusion: The proposed physics-informed deep learning framework enables time-efficient, physiologically consistent estimation of muscle activations and forces directly from kinematics without labeled data, addressing computational and data limitations of conventional approaches.

Abstract: Time-efficient estimation of muscle activations and forces across multi-joint systems is critical for clinical assessment and assistive device control. However, conventional approaches are computationally expensive and lack a high-quality labeled dataset for multi-joint applications. To address these challenges, we propose a physics-informed deep learning framework that estimates muscle activations and forces directly from kinematics. The framework employs a novel Multi-Joint Cross-Attention (MJCA) module with Bidirectional Gated Recurrent Unit (BiGRU) layers to capture inter-joint coordination, enabling each joint to adaptively integrate motion information from others. By embedding multi-joint dynamics, inter-joint coupling, and external force interactions into the loss function, our Physics-Informed MJCA-BiGRU (PI-MJCA-BiGRU) delivers physiologically consistent predictions without labeled data while enabling time-efficient inference. Experimental validation on two datasets demonstrates that PI-MJCA-BiGRU achieves performance comparable to conventional supervised methods without requiring ground-truth labels, while the MJCA module significantly enhances inter-joint coordination modeling compared to other baseline architectures.

[396] Multi-View Polymer Representations for the Open Polymer Prediction

Wonjin Jung, Yongseok Choi

Main category: cs.LG

TL;DR: Multi-view polymer property prediction system combining four representation families achieves 9th place in NeurIPS 2025 Open Polymer Prediction Challenge.

Details

Motivation: To leverage complementary representations of polymers for improved property prediction accuracy by integrating multiple data views.

Method: Integrates four representation families: tabular RDKit/Morgan descriptors, graph neural networks, 3D-informed representations, and pretrained SMILES language models, using uniform ensemble averaging with 10-fold training and SMILES test-time augmentation.

Result: Ranked 9th out of 2241 teams in NeurIPS 2025 Open Polymer Prediction Challenge, achieving public MAE of 0.057 and private MAE of 0.082.

Conclusion: Multi-view ensemble approach effectively combines complementary polymer representations for state-of-the-art property prediction performance.

Abstract: We address polymer property prediction with a multi-view design that exploits complementary representations. Our system integrates four families: (i) tabular RDKit/Morgan descriptors, (ii) graph neural networks, (iii) 3D-informed representations, and (iv) pretrained SMILES language models, and averages per-property predictions via a uniform ensemble. Models are trained with 10-fold splits and evaluated with SMILES test-time augmentation. The approach ranks 9th of 2241 teams in the Open Polymer Prediction Challenge at NeurIPS 2025. The submitted ensemble achieves a public MAE of 0.057 and a private MAE of 0.082.

[397] Graph Attention Network for Predicting Duration of Large-Scale Power Outages Induced by Natural Disasters

Chenghao Duan, Chuanyi Ji

Main category: cs.LG

TL;DR: The paper proposes a Graph Attention Network (GAT) approach to predict power outage duration from severe weather events, achieving over 93% accuracy and outperforming existing methods by 2-15%.

Details

Motivation: Natural disasters cause large-scale power outages with significant economic and societal impacts. Accurate prediction of outage recovery is crucial for power grid resilience, but existing methods face challenges with spatial dependency, heterogeneity, and limited event data.

Method: Uses Graph Attention Networks (GAT) with unsupervised pre-training followed by semi-supervised learning, trained on field data from four major hurricanes affecting 501 counties across eight Southeastern U.S. states.

Result: The model achieves excellent performance (>93% accuracy) and outperforms XGBoost, Random Forest, GCN, and simple GAT by 2-15% in both overall performance and class-wise accuracy.

Conclusion: The proposed GAT-based approach effectively addresses spatial dependency and heterogeneity challenges in power outage prediction, demonstrating superior performance for enhancing power grid resilience against severe weather events.

Abstract: Natural disasters such as hurricanes, wildfires, and winter storms have induced large-scale power outages in the U.S., resulting in tremendous economic and societal impacts. Accurately predicting power outage recovery and impact is key to resilience of power grid. Recent advances in machine learning offer viable frameworks for estimating power outage duration from geospatial and weather data. However, three major challenges are inherent to the task in a real world setting: spatial dependency of the data, spatial heterogeneity of the impact, and moderate event data. We propose a novel approach to estimate the duration of severe weather-induced power outages through Graph Attention Networks (GAT). Our network uses a simple structure from unsupervised pre-training, followed by semi-supervised learning. We use field data from four major hurricanes affecting $501$ counties in eight Southeastern U.S. states. The model exhibits an excellent performance ($>93%$ accuracy) and outperforms the existing methods XGBoost, Random Forest, GCN and simple GAT by $2% - 15%$ in both the overall performance and class-wise accuracy.

[398] Towards Federated Clustering: A Client-wise Private Graph Aggregation Framework

Guanxiong He, Jie Wang, Liaoyuan Tang, Zheng Wang, Rong Wang, Feiping Nie

Main category: cs.LG

TL;DR: SPP-FGC is a federated clustering method that uses local structural graphs for privacy-preserving knowledge sharing, avoiding the performance-privacy trade-off in existing approaches.

Details

Motivation: Current federated clustering methods face a dilemma: transmitting embeddings risks data leakage while sharing only cluster prototypes reduces accuracy. There's a need for a solution that maintains both performance and privacy.

Method: Uses client-server architecture where clients build private structural graphs capturing data relationships. Server aggregates these graphs to form a global graph for clustering. Offers two modes: one-shot SPP-FGC for efficiency, and iterative SPP-FGC+ for complex data like images.

Result: Achieves state-of-the-art performance with up to 10% improvement in clustering accuracy (NMI) over federated baselines while maintaining provable privacy guarantees.

Conclusion: SPP-FGC successfully resolves the performance-privacy trade-off in federated clustering by using structural graphs as the sharing medium, offering both efficient one-shot and enhanced iterative variants for different data types.

Abstract: Federated clustering addresses the critical challenge of extracting patterns from decentralized, unlabeled data. However, it is hampered by the flaw that current approaches are forced to accept a compromise between performance and privacy: \textit{transmitting embedding representations risks sensitive data leakage, while sharing only abstract cluster prototypes leads to diminished model accuracy}. To resolve this dilemma, we propose Structural Privacy-Preserving Federated Graph Clustering (SPP-FGC), a novel algorithm that innovatively leverages local structural graphs as the primary medium for privacy-preserving knowledge sharing, thus moving beyond the limitations of conventional techniques. Our framework operates on a clear client-server logic; on the client-side, each participant constructs a private structural graph that captures intrinsic data relationships, which the server then securely aggregates and aligns to form a comprehensive global graph from which a unified clustering structure is derived. The framework offers two distinct modes to suit different needs. SPP-FGC is designed as an efficient one-shot method that completes its task in a single communication round, ideal for rapid analysis. For more complex, unstructured data like images, SPP-FGC+ employs an iterative process where clients and the server collaboratively refine feature representations to achieve superior downstream performance. Extensive experiments demonstrate that our framework achieves state-of-the-art performance, improving clustering accuracy by up to 10% (NMI) over federated baselines while maintaining provable privacy guarantees.

[399] GraphToxin: Reconstructing Full Unlearned Graphs from Graph Unlearning

Ying Song, Balaji Palanisamy

Main category: cs.LG

TL;DR: GraphToxin is the first graph reconstruction attack against graph unlearning that can recover deleted individuals’ information, personal links, and sensitive content from connections, undermining privacy guarantees.

Details

Motivation: Graph unlearning solutions have vulnerabilities where residual traces of deleted data remain, allowing attackers to recover supposedly erased samples and compromise privacy regulations.

Method: Proposes GraphToxin with a novel curvature matching module for fine-grained guidance in full unlearned graph recovery, extending to multiple node removals under white-box and black-box settings.

Result: GraphToxin successfully recovers deleted information and connections, existing defenses are largely ineffective and sometimes amplify the attack, demonstrating severe privacy risks.

Conclusion: Urgent need for more effective and robust defense strategies against graph reconstruction attacks, highlighting the necessity of worst-case analysis for realistic vulnerability assessment.

Abstract: Graph unlearning has emerged as a promising solution for complying with “the right to be forgotten” regulations by enabling the removal of sensitive information upon request. However, this solution is not foolproof. The involvement of multiple parties creates new attack surfaces, and residual traces of deleted data can still remain in the unlearned graph neural networks. These vulnerabilities can be exploited by attackers to recover the supposedly erased samples, thereby undermining the inherent functionality of graph unlearning. In this work, we propose GraphToxin, the first graph reconstruction attack against graph unlearning. Specifically, we introduce a novel curvature matching module to provide a fine-grained guidance for full unlearned graph recovery. We demonstrate that GraphToxin can successfully subvert the regulatory guarantees expected from graph unlearning - it can recover not only a deleted individual’s information and personal links but also sensitive content from their connections, thereby posing substantially more detrimental threats. Furthermore, we extend GraphToxin to multiple node removals under both white-box and black-box setting. We highlight the necessity of a worst-case analysis and propose a comprehensive evaluation framework to systematically assess the attack performance under both random and worst-case node removals. This provides a more robust and realistic measure of the vulnerability of graph unlearning methods to graph reconstruction attacks. Our extensive experiments demonstrate the effectiveness and flexibility of GraphToxin. Notably, we show that existing defense mechanisms are largely ineffective against this attack and, in some cases, can even amplify its performance. Given the severe privacy risks posed by GraphToxin, our work underscores the urgent need for the development of more effective and robust defense strategies against this attack.

[400] Cascading Bandits With Feedback

R Sri Prakash, Nikhil Karamchandani, Sharayu Moharir

Main category: cs.LG

TL;DR: Analysis of cascade bandit model for edge inference, comparing four policies: Explore-then-Commit and Action Elimination show suboptimal regret due to lack of adaptivity, while LCB and Thompson Sampling achieve constant O(1) regret through continuous feedback updates.

Details

Motivation: Addressing challenges of edge inference where inference models have varying accuracy and error probabilities, requiring efficient decision-making under uncertainty.

Method: Analyzed four decision-making policies: Explore-then-Commit, Action Elimination, Lower Confidence Bound (LCB), and Thompson Sampling in a cascade bandit model framework.

Result: Explore-then-Commit and Action Elimination incur suboptimal regret due to fixed ordering commitment after exploration. LCB and Thompson Sampling achieve constant O(1) regret through continuous adaptation to observed feedback.

Conclusion: Adaptivity is crucial for efficient edge inference under uncertainty, with LCB and Thompson Sampling outperforming non-adaptive policies by continuously updating decisions based on feedback.

Abstract: Motivated by the challenges of edge inference, we study a variant of the cascade bandit model in which each arm corresponds to an inference model with an associated accuracy and error probability. We analyse four decision-making policies-Explore-then-Commit, Action Elimination, Lower Confidence Bound (LCB), and Thompson Sampling-and provide sharp theoretical regret guarantees for each. Unlike in classical bandit settings, Explore-then-Commit and Action Elimination incur suboptimal regret because they commit to a fixed ordering after the exploration phase, limiting their ability to adapt. In contrast, LCB and Thompson Sampling continuously update their decisions based on observed feedback, achieving constant O(1) regret. Simulations corroborate these theoretical findings, highlighting the crucial role of adaptivity for efficient edge inference under uncertainty.

[401] Flow matching-based generative models for MIMO channel estimation

Wenkai Liu, Nan Ma, Jianqiao Chen, Xiaoxuan Qi, Yuhang Ma

Main category: cs.LG

TL;DR: Proposes a flow matching-based generative model for MIMO channel estimation that significantly reduces sampling overhead compared to diffusion model-based approaches while maintaining high accuracy.

Details

Motivation: Diffusion model-based channel estimation suffers from slow sampling speeds, which limits practical deployment despite showing potential for high-precision CSI acquisition.

Method: Formulates channel estimation within flow matching framework, constructs conditional probability path from noisy to true channel distribution, derives velocity field based on noise statistics, and uses ODE Euler solver for fast sampling.

Result: Significantly reduces sampling overhead compared to diffusion model-based schemes while achieving superior channel estimation accuracy under various channel conditions.

Conclusion: Flow matching-based approach provides an efficient alternative to diffusion models for MIMO channel estimation, offering faster sampling speeds without compromising accuracy.

Abstract: Diffusion model (DM)-based channel estimation, which generates channel samples via a posteriori sampling stepwise with denoising process, has shown potential in high-precision channel state information (CSI) acquisition. However, slow sampling speed is an essential challenge for recent developed DM-based schemes. To alleviate this problem, we propose a novel flow matching (FM)-based generative model for multiple-input multiple-output (MIMO) channel estimation. We first formulate the channel estimation problem within FM framework, where the conditional probability path is constructed from the noisy channel distribution to the true channel distribution. In this case, the path evolves along the straight-line trajectory at a constant speed. Then, guided by this, we derive the velocity field that depends solely on the noise statistics to guide generative models training. Furthermore, during the sampling phase, we utilize the trained velocity field as prior information for channel estimation, which allows for quick and reliable noise channel enhancement via ordinary differential equation (ODE) Euler solver. Finally, numerical results demonstrate that the proposed FM-based channel estimation scheme can significantly reduce the sampling overhead compared to other popular DM-based schemes, such as the score matching (SM)-based scheme. Meanwhile, it achieves superior channel estimation accuracy under different channel conditions.

[402] From Parameter to Representation: A Closed-Form Approach for Controllable Model Merging

Jialin Wu, Jian Yang, Handing Wang, Jiajun Wen, Zhiyong Yu

Main category: cs.LG

TL;DR: Proposes a closed-form solution for controllable model merging that replaces costly multi-objective optimization with a single-step linear transformation, enabling on-the-fly generation of Pareto-optimal models with linear complexity scaling.

Details

Motivation: Address challenges in model merging from parameter interference and enable explicit balance of performance trade-offs through controllable merging, overcoming limitations of existing compile-then-query approaches with exponential complexity growth.

Method: Shifts from parameter-space optimization to direct correction of model’s final representation, modeling correction as optimal linear transformation with closed-form solution that is architecture-agnostic and single-step.

Result: Generates superior Pareto front with more precise preference alignment and drastically reduced computational cost compared to existing methods.

Conclusion: The proposed method enables efficient on-the-fly generation of Pareto-optimal models with linear complexity scaling, making controllable model merging more practical and accessible.

Abstract: Model merging combines expert models for multitask performance but faces challenges from parameter interference. This has sparked recent interest in controllable model merging, giving users the ability to explicitly balance performance trade-offs. Existing approaches employ a compile-then-query paradigm, performing a costly offline multi-objective optimization to enable fast, preference-aware model generation. This offline stage typically involves iterative search or dedicated training, with complexity that grows exponentially with the number of tasks. To overcome these limitations, we shift the perspective from parameter-space optimization to a direct correction of the model’s final representation. Our approach models this correction as an optimal linear transformation, yielding a closed-form solution that replaces the entire offline optimization process with a single-step, architecture-agnostic computation. This solution directly incorporates user preferences, allowing a Pareto-optimal model to be generated on-the-fly with complexity that scales linearly with the number of tasks. Experimental results show our method generates a superior Pareto front with more precise preference alignment and drastically reduced computational cost.

[403] How Data Quality Affects Machine Learning Models for Credit Risk Assessment

Andrea Maurino

Main category: cs.LG

TL;DR: Investigates how data quality issues affect ML models in credit risk assessment, showing varying robustness across models.

Details

Motivation: ML models are increasingly used for credit risk evaluation, but their effectiveness depends heavily on data quality, which is often compromised by issues like missing values, noise, outliers, and label errors.

Method: Used an open-source dataset with controlled data corruption via Pucktrick library to test robustness of 10 common ML models including Random Forest, SVM, and Logistic Regression.

Result: Found significant differences in model robustness depending on the type and severity of data degradation.

Conclusion: The methodology and tools provide practical support for enhancing data pipeline robustness and offer a flexible framework for further data-centric AI research.

Abstract: Machine Learning (ML) models are being increasingly employed for credit risk evaluation, with their effectiveness largely hinging on the quality of the input data. In this paper we investigate the impact of several data quality issues, including missing values, noisy attributes, outliers, and label errors, on the predictive accuracy of the machine learning model used in credit risk assessment. Utilizing an open-source dataset, we introduce controlled data corruption using the Pucktrick library to assess the robustness of 10 frequently used models like Random Forest, SVM, and Logistic Regression and so on. Our experiments show significant differences in model robustness based on the nature and severity of the data degradation. Moreover, the proposed methodology and accompanying tools offer practical support for practitioners seeking to enhance data pipeline robustness, and provide researchers with a flexible framework for further experimentation in data-centric AI contexts.

[404] Unsupervised Robust Domain Adaptation: Paradigm, Theory and Algorithm

Fuxiang Huang, Xiaowei Fu, Shiyu Ye, Lina Ma, Wen Li, Xinbo Gao, David Zhang, Lei Zhang

Main category: cs.LG

TL;DR: This paper introduces Unsupervised Robust Domain Adaptation (URDA) to address both domain shift and adversarial attacks in UDA, proposing a Disentangled Adversarial Robustness Training (DART) method with theoretical generalization bounds.

Details

Motivation: Most UDA approaches focus on transfer ability but overlook robustness against adversarial attacks, and vanilla adversarial training (VAT) fails in UDA due to inherent entanglement challenges.

Method: Proposed URDA paradigm with DART algorithm - a two-step training procedure: 1) pre-train any UDA model, 2) apply instantaneous robustification via disentangled distillation to ensure both transferability and robustness.

Result: Experiments on four benchmark datasets show DART effectively enhances robustness while maintaining domain adaptability, validating the URDA paradigm and theory.

Conclusion: This work establishes the first URDA paradigm and theory, providing a simple yet effective solution for robust domain adaptation that resists both adversarial noise and domain shift.

Abstract: Unsupervised domain adaptation (UDA) aims to transfer knowledge from a label-rich source domain to an unlabeled target domain by addressing domain shifts. Most UDA approaches emphasize transfer ability, but often overlook robustness against adversarial attacks. Although vanilla adversarial training (VAT) improves the robustness of deep neural networks, it has little effect on UDA. This paper focuses on answering three key questions: 1) Why does VAT, known for its defensive effectiveness, fail in the UDA paradigm? 2) What is the generalization bound theory under attacks and how does it evolve from classical UDA theory? 3) How can we implement a robustification training procedure without complex modifications? Specifically, we explore and reveal the inherent entanglement challenge in general UDA+VAT paradigm, and propose an unsupervised robust domain adaptation (URDA) paradigm. We further derive the generalization bound theory of the URDA paradigm so that it can resist adversarial noise and domain shift. To the best of our knowledge, this is the first time to establish the URDA paradigm and theory. We further introduce a simple, novel yet effective URDA algorithm called Disentangled Adversarial Robustness Training (DART), a two-step training procedure that ensures both transferability and robustness. DART first pre-trains an arbitrary UDA model, and then applies an instantaneous robustification post-training step via disentangled distillation.Experiments on four benchmark datasets with/without attacks show that DART effectively enhances robustness while maintaining domain adaptability, and validate the URDA paradigm and theory.

[405] Enhancing Graph Representations with Neighborhood-Contextualized Message-Passing

Brian Godwin Lim

Main category: cs.LG

TL;DR: Proposes neighborhood-contextualized message-passing (NCMP) framework to enhance GNN expressivity by incorporating broader local neighborhood context, and introduces SINC-GCN as a practical implementation.

Details

Motivation: Standard message-passing GNNs only consider pairwise node features, failing to capture rich contextual information from the broader local neighborhood, which limits their ability to learn complex relationships.

Method: Formalizes neighborhood-contextualization concept, generalizes message-passing to NCMP framework, and develops SINC-GCN as a practical parametrization method.

Result: Preliminary analysis on synthetic binary node classification shows improved expressivity and efficiency of the proposed SINC-GCN architecture.

Conclusion: NCMP framework provides a practical path to enhance graph representational power of classical GNNs by incorporating neighborhood context.

Abstract: Graph neural networks (GNNs) have become an indispensable tool for analyzing relational data. In the literature, classical GNNs may be classified into three variants: convolutional, attentional, and message-passing. While the standard message-passing variant is highly expressive, its typical pair-wise messages nevertheless only consider the features of the center node and each neighboring node individually. This design fails to incorporate the rich contextual information contained within the broader local neighborhood, potentially hindering its ability to learn complex relationships within the entire set of neighboring nodes. To address this limitation, this work first formalizes the concept of neighborhood-contextualization, rooted in a key property of the attentional variant. This then serves as the foundation for generalizing the message-passing variant to the proposed neighborhood-contextualized message-passing (NCMP) framework. To demonstrate its utility, a simple, practical, and efficient method to parametrize and operationalize NCMP is presented, leading to the development of the proposed Soft-Isomorphic Neighborhood-Contextualized Graph Convolution Network (SINC-GCN). A preliminary analysis on a synthetic binary node classification problem then underscores both the expressivity and efficiency of the proposed GNN architecture. Overall, the paper lays the foundation for the novel NCMP framework as a practical path toward further enhancing the graph representational power of classical GNNs.

[406] Echoless Label-Based Pre-computation for Memory-Efficient Heterogeneous Graph Learning

Jun Hu, Shangheng Chen, Yufei He, Yuan Li, Bryan Hooi, Bingsheng He

Main category: cs.LG

TL;DR: Echoless-LP eliminates training label leakage in pre-computation-based HGNNs using partition-focused echoless propagation, achieving superior performance while maintaining memory efficiency.

Details

Motivation: Address the echo effect in label-based pre-computation HGNNs where a node's own label information propagates back to itself during multi-hop message passing, causing training label leakage. Existing mitigation strategies are memory-inefficient or incompatible with advanced message passing methods.

Method: Propose Echoless Label-based Pre-computation (Echoless-LP) with Partition-Focused Echoless Propagation (PFEP) that partitions target nodes and performs echoless propagation where nodes collect label information only from neighbors in other partitions. Also introduce Asymmetric Partitioning Scheme (APS) and PostAdjust mechanism to address information loss and distributional shifts.

Result: Experiments on public datasets show Echoless-LP achieves superior performance and maintains memory efficiency compared to baseline methods.

Conclusion: Echoless-LP effectively eliminates training label leakage while being memory-efficient and compatible with any message passing method, making it suitable for large-scale heterogeneous graphs.

Abstract: Heterogeneous Graph Neural Networks (HGNNs) are widely used for deep learning on heterogeneous graphs. Typical end-to-end HGNNs require repetitive message passing during training, limiting efficiency for large-scale real-world graphs. Pre-computation-based HGNNs address this by performing message passing only once during preprocessing, collecting neighbor information into regular-shaped tensors, which enables efficient mini-batch training. Label-based pre-computation methods collect neighbors’ label information but suffer from training label leakage, where a node’s own label information propagates back to itself during multi-hop message passing - the echo effect. Existing mitigation strategies are memory-inefficient on large graphs or suffer from compatibility issues with advanced message passing methods. We propose Echoless Label-based Pre-computation (Echoless-LP), which eliminates training label leakage with Partition-Focused Echoless Propagation (PFEP). PFEP partitions target nodes and performs echoless propagation, where nodes in each partition collect label information only from neighbors in other partitions, avoiding echo while remaining memory-efficient and compatible with any message passing method. We also introduce an Asymmetric Partitioning Scheme (APS) and a PostAdjust mechanism to address information loss from partitioning and distributional shifts across partitions. Experiments on public datasets demonstrate that Echoless-LP achieves superior performance and maintains memory efficiency compared to baselines.

[407] Scalable Population Training for Zero-Shot Coordination

Bingyu Hui, Lebin Yu, Quanming Yao, Yunpeng Qu, Xudong Zhang, Jian Wang

Main category: cs.LG

TL;DR: Proposes Scalable Population Training (ScaPT) - an efficient framework for zero-shot coordination that enables large population training through parameter sharing and diversity regularization.

Details

Motivation: Existing population-based methods for zero-shot coordination are limited by computational resources and focus on small populations, missing performance gains from scaling up population size.

Method: ScaPT framework with two components: meta-agent for efficient parameter sharing across agents, and mutual information regularizer to maintain population diversity.

Result: Empirical evaluation in Hanabi shows ScaPT outperforms representative frameworks, confirming its effectiveness for zero-shot coordination.

Conclusion: ScaPT successfully addresses computational limitations of population-based training for zero-shot coordination, enabling scalable training with superior performance.

Abstract: Zero-shot coordination(ZSC) has become a hot topic in reinforcement learning research recently. It focuses on the generalization ability of agents, requiring them to coordinate well with collaborators that are not seen before without any fine-tuning. Population-based training has been proven to provide good zero-shot coordination performance; nevertheless, existing methods are limited by computational resources, mainly focusing on optimizing diversity in small populations while neglecting the potential performance gains from scaling population size. To address this issue, this paper proposes the Scalable Population Training (ScaPT), an efficient training framework comprising two key components: a meta-agent that efficiently realizes a population by selectively sharing parameters across agents, and a mutual information regularizer that guarantees population diversity. To empirically validate the effectiveness of ScaPT, this paper evaluates it along with representational frameworks in Hanabi and confirms its superiority.

[408] Sheaf Cohomology of Linear Predictive Coding Networks

Jeffrey Seely

Main category: cs.LG

TL;DR: Predictive coding networks can be modeled as cellular sheaves, where sheaf cohomology identifies irreducible errors and the Hodge decomposition analyzes learning stalls in recurrent networks.

Details

Motivation: To understand why predictive coding networks sometimes fail to learn, particularly in recurrent topologies where feedback loops create internal contradictions that introduce irreducible prediction errors.

Method: Model linear predictive coding networks as cellular sheaves, use sheaf cohomology to characterize irreducible error patterns, and apply Hodge decomposition to analyze when contradictions cause learning to stall.

Result: The sheaf formalism provides tools to identify problematic network configurations and offers design principles for effective weight initialization in recurrent predictive coding networks.

Conclusion: Cellular sheaves offer a powerful mathematical framework for analyzing and improving predictive coding networks, particularly for understanding and mitigating learning failures in recurrent architectures.

Abstract: Predictive coding (PC) replaces global backpropagation with local optimization over weights and activations. We show that linear PC networks admit a natural formulation as cellular sheaves: the sheaf coboundary maps activations to edge-wise prediction errors, and PC inference is diffusion under the sheaf Laplacian. Sheaf cohomology then characterizes irreducible error patterns that inference cannot remove. We analyze recurrent topologies where feedback loops create internal contradictions, introducing prediction errors unrelated to supervision. Using a Hodge decomposition, we determine when these contradictions cause learning to stall. The sheaf formalism provides both diagnostic tools for identifying problematic network configurations and design principles for effective weight initialization for recurrent PC networks.

[409] SMART: A Surrogate Model for Predicting Application Runtime in Dragonfly Systems

Xin Wang, Pietro Lodi Rizzini, Sourav Medya, Zhiling Lan

Main category: cs.LG

TL;DR: A hybrid simulation model combining GNNs and LLMs for accurate runtime prediction in Dragonfly networks, outperforming existing methods.

Details

Motivation: Address workload interference in Dragonfly networks and overcome computational limitations of high-fidelity PDES for large-scale/real-time scenarios.

Method: Combines graph neural networks (GNNs) and large language models (LLMs) to capture spatial and temporal patterns from port level router data.

Result: Outperforms existing statistical and machine learning baselines in runtime prediction accuracy.

Conclusion: Enables efficient hybrid simulation of Dragonfly networks through accurate runtime forecasting.

Abstract: The Dragonfly network, with its high-radix and low-diameter structure, is a leading interconnect in high-performance computing. A major challenge is workload interference on shared network links. Parallel discrete event simulation (PDES) is commonly used to analyze workload interference. However, high-fidelity PDES is computationally expensive, making it impractical for large-scale or real-time scenarios. Hybrid simulation that incorporates data-driven surrogate models offers a promising alternative, especially for forecasting application runtime, a task complicated by the dynamic behavior of network traffic. We present \ourmodel, a surrogate model that combines graph neural networks (GNNs) and large language models (LLMs) to capture both spatial and temporal patterns from port level router data. \ourmodel outperforms existing statistical and machine learning baselines, enabling accurate runtime prediction and supporting efficient hybrid simulation of Dragonfly networks.

[410] Improving Continual Learning of Knowledge Graph Embeddings via Informed Initialization

Gerard Pons, Besim Bilalli, Anna Queralt

Main category: cs.LG

TL;DR: Proposes a novel embedding initialization strategy for continual learning in Knowledge Graph Embeddings that uses KG schema and existing embeddings to initialize new entities, improving performance and reducing training time.

Details

Motivation: Knowledge Graphs are frequently updated, requiring KGEs to adapt. Current continual learning methods need better initialization strategies for new entities to improve accuracy and reduce training time, especially for small frequent updates.

Method: Uses KG schema and previously learned embeddings to obtain initial representations for new entities based on their classes, which can be integrated into existing continual learning methods.

Result: Improves predictive performance and knowledge retention, accelerates knowledge acquisition by reducing required training epochs, and shows benefits across various KGE learning models.

Conclusion: The proposed informed embedding initialization strategy effectively enhances continual learning for KGEs by improving performance, reducing forgetting, and accelerating training across different model types.

Abstract: Many Knowledege Graphs (KGs) are frequently updated, forcing their Knowledge Graph Embeddings (KGEs) to adapt to these changes. To address this problem, continual learning techniques for KGEs incorporate embeddings for new entities while updating the old ones. One necessary step in these methods is the initialization of the embeddings, as an input to the KGE learning process, which can have an important impact in the accuracy of the final embeddings, as well as in the time required to train them. This is especially relevant for relatively small and frequent updates. We propose a novel informed embedding initialization strategy, which can be seamlessly integrated into existing continual learning methods for KGE, that enhances the acquisition of new knowledge while reducing catastrophic forgetting. Specifically, the KG schema and the previously learned embeddings are utilized to obtain initial representations for the new entities, based on the classes the entities belong to. Our extensive experimental analysis shows that the proposed initialization strategy improves the predictive performance of the resulting KGEs, while also enhancing knowledge retention. Furthermore, our approach accelerates knowledge acquisition, reducing the number of epochs, and therefore time, required to incrementally learn new embeddings. Finally, its benefits across various types of KGE learning models are demonstrated.

[411] Anomaly Detection in High-Dimensional Bank Account Balances via Robust Methods

Federico Maddanu, Tommaso Proietti, Riccardo Crupi

Main category: cs.LG

TL;DR: Proposes robust statistical methods for detecting point anomalies in bank account balances that are computationally efficient for high-dimensional datasets.

Details

Motivation: Detecting anomalies in bank account balances is crucial for identifying fraud and operational issues, but traditional robust statistics methods are inefficient and computationally expensive in high-dimensional settings.

Method: Developed and evaluated several robust approaches designed for medium to high-dimensional datasets with high breakdown points and low computational time.

Result: Empirically evaluated the proposed methods on approximately 2.6 million daily records of anonymous users’ bank account balances.

Conclusion: The proposed robust statistical approaches offer computational efficiency while maintaining effectiveness for anomaly detection in high-dimensional financial data.

Abstract: Detecting point anomalies in bank account balances is essential for financial institutions, as it enables the identification of potential fraud, operational issues, or other irregularities. Robust statistics is useful for flagging outliers and for providing estimates of the data distribution parameters that are not affected by contaminated observations. However, such a strategy is often less efficient and computationally expensive under high dimensional setting. In this paper, we propose and evaluate empirically several robust approaches that may be computationally efficient in medium and high dimensional datasets, with high breakdown points and low computational time. Our application deals with around 2.6 million daily records of anonymous users’ bank account balances.

[412] Deep Learning for Short-Term Precipitation Prediction in Four Major Indian Cities: A ConvLSTM Approach with Explainable AI

Tanmay Ghosh, Shaurabh Anand, Rakesh Gomaji Nannewar, Nithin Nagaraj

Main category: cs.LG

TL;DR: Interpretable deep learning framework for precipitation forecasting using hybrid CNN-ConvLSTM architecture optimized for four Indian cities, achieving accurate predictions with transparent insights through xAI methods.

Details

Motivation: To address the black-box nature of deep learning models in precipitation forecasting and enhance transparency while maintaining accuracy for real-world weather prediction applications.

Method: Hybrid Time-Distributed CNN-ConvLSTM architecture trained on multi-decadal ERA5 reanalysis data, optimized with different convolutional filters for each city (32 for Bengaluru, 64 for Mumbai/Delhi, 128 for Kolkata), using interpretability analysis with permutation importance, Grad-CAM, temporal occlusion, and counterfactual perturbation.

Result: Achieved RMSE values of 0.21 mm/day (Bengaluru), 0.52 mm/day (Mumbai), 0.48 mm/day (Delhi), and 1.80 mm/day (Kolkata), with prediction horizons ranging from 1 day (Bengaluru) to 5 days (Kolkata), identifying city-specific variable dependencies.

Conclusion: Demonstrates that explainable AI can provide both accurate precipitation forecasts and transparent insights into precipitation patterns across diverse urban environments, enabling better adoption in real-world weather prediction.

Abstract: Deep learning models for precipitation forecasting often function as black boxes, limiting their adoption in real-world weather prediction. To enhance transparency while maintaining accuracy, we developed an interpretable deep learning framework for short-term precipitation prediction in four major Indian cities: Bengaluru, Mumbai, Delhi, and Kolkata, spanning diverse climate zones. We implemented a hybrid Time-Distributed CNN-ConvLSTM (Convolutional Neural Network-Long Short-Term Memory) architecture, trained on multi-decadal ERA5 reanalysis data. The architecture was optimized for each city with a different number of convolutional filters: Bengaluru (32), Mumbai and Delhi (64), and Kolkata (128). The models achieved root mean square error (RMSE) values of 0.21 mm/day (Bengaluru), 0.52 mm/day (Mumbai), 0.48 mm/day (Delhi), and 1.80 mm/day (Kolkata). Through interpretability analysis using permutation importance, Gradient-weighted Class Activation Mapping (Grad-CAM), temporal occlusion, and counterfactual perturbation, we identified distinct patterns in the model’s behavior. The model relied on city-specific variables, with prediction horizons ranging from one day for Bengaluru to five days for Kolkata. This study demonstrates how explainable AI (xAI) can provide accurate forecasts and transparent insights into precipitation patterns in diverse urban environments.

[413] Adaptive Symmetrization of the KL Divergence

Omri Ben-Dov, Luiz F. O. Chamon

Main category: cs.LG

TL;DR: The paper proposes a new method to minimize Jeffreys divergence using a proxy model that jointly trains with the main model through constrained optimization, combining advantages of normalizing flows and energy-based models.

Details

Motivation: Existing approaches for learning probability distributions have limitations: forward KL divergence is asymmetric and may miss target properties, while symmetric alternatives like Jeffreys divergence are challenging to compute from samples.

Method: A proxy model is trained jointly with the main model through constrained optimization, where the proxy model both fits the data and assists in optimizing the Jeffreys divergence of the main model.

Result: The framework enables combining advantages of normalizing flows and energy-based models for tasks like density estimation, image generation, and simulation-based inference.

Conclusion: The proposed joint training approach provides a practical algorithm for minimizing Jeffreys divergence and adapts model priorities throughout training.

Abstract: Many tasks in machine learning can be described as or reduced to learning a probability distribution given a finite set of samples. A common approach is to minimize a statistical divergence between the (empirical) data distribution and a parameterized distribution, e.g., a normalizing flow (NF) or an energy-based model (EBM). In this context, the forward KL divergence is a ubiquitous due to its tractability, though its asymmetry may prevent capturing some properties of the target distribution. Symmetric alternatives involve brittle min-max formulations and adversarial training (e.g., generative adversarial networks) or evaluating the reverse KL divergence, as is the case for the symmetric Jeffreys divergence, which is challenging to compute from samples. This work sets out to develop a new approach to minimize the Jeffreys divergence. To do so, it uses a proxy model whose goal is not only to fit the data, but also to assist in optimizing the Jeffreys divergence of the main model. This joint training task is formulated as a constrained optimization problem to obtain a practical algorithm that adapts the models priorities throughout training. We illustrate how this framework can be used to combine the advantages of NFs and EBMs in tasks such as density estimation, image generation, and simulation-based inference.

[414] Training Neural Networks at Any Scale

Thomas Pethick, Kimon Antonakopoulos, Antonio Silveti-Falls, Leena Chennuru Vankadara, Volkan Cevher

Main category: cs.LG

TL;DR: Review of modern neural network optimization methods focusing on efficiency and scalability, presenting state-of-the-art algorithms through a unified template.

Details

Motivation: To provide an accessible introduction to modern optimization techniques for neural networks, addressing the need for efficient and scalable training methods as models grow in size and complexity.

Method: Presents optimization algorithms under a unified algorithmic template that emphasizes adaptation to problem structures, and covers methods to make algorithms scale-agnostic.

Result: A comprehensive review that organizes state-of-the-art optimization methods into a coherent framework, highlighting their structural adaptations and scalability properties.

Conclusion: The article serves as an introductory guide for practitioners and researchers to understand and engage with modern developments in neural network optimization, particularly focusing on efficiency and scalability.

Abstract: This article reviews modern optimization methods for training neural networks with an emphasis on efficiency and scale. We present state-of-the-art optimization algorithms under a unified algorithmic template that highlights the importance of adapting to the structures in the problem. We then cover how to make these algorithms agnostic to the scale of the problem. Our exposition is intended as an introduction for both practitioners and researchers who wish to be involved in these exciting new developments.

[415] Power Ensemble Aggregation for Improved Extreme Event AI Prediction

Julien Collard, Pierre Gentine, Tian Zheng

Main category: cs.LG

TL;DR: Using power mean aggregation of ensemble predictions improves heat wave forecasting accuracy compared to standard mean prediction, with better performance for higher extremes.

Details

Motivation: To improve predictions of climate extreme events, specifically heat waves, using machine learning methods.

Method: Frame as classification problem predicting if temperature exceeds local quantile; use power mean aggregation of ensemble predictions from generative ML weather forecasting model.

Result: Power mean aggregation significantly enhances classifier performance, achieving better accuracy than typical mean prediction, with increased effectiveness for higher extremes.

Conclusion: Power aggregation method shows promise and adaptability for extreme heat event prediction, with optimal performance varying by quantile threshold.

Abstract: This paper addresses the critical challenge of improving predictions of climate extreme events, specifically heat waves, using machine learning methods. Our work is framed as a classification problem in which we try to predict whether surface air temperature will exceed its q-th local quantile within a specified timeframe. Our key finding is that aggregating ensemble predictions using a power mean significantly enhances the classifier’s performance. By making a machine-learning based weather forecasting model generative and applying this non-linear aggregation method, we achieve better accuracy in predicting extreme heat events than with the typical mean prediction from the same model. Our power aggregation method shows promise and adaptability, as its optimal performance varies with the quantile threshold chosen, demonstrating increased effectiveness for higher extremes prediction.

[416] On-line learning of dynamic systems: sparse regression meets Kalman filtering

Gianluigi Pillonetto, Akram Yazdani, Aleksandr Aravkin

Main category: cs.LG

TL;DR: The paper introduces Sindy Kalman Filter (SKF), which combines Sparse Identification of Nonlinear Dynamics (SINDy) with Kalman filtering for real-time learning of time-varying nonlinear dynamical systems.

Details

Motivation: To enable real-time learning of governing equations from data for complex physical systems with time-varying parameters, overcoming limitations of existing methods that cannot handle such dynamics.

Method: Integrates SINDy algorithm with Kalman filter by treating unknown system parameters as state variables, and enhances parameter identification using look-ahead error to estimate sparsity levels and switching instants.

Result: SKF successfully identifies complex, time-varying nonlinear models in real-time, validated on chaotic Lorenz systems with drifting/switching parameters and real flight data from aircraft modeling.

Conclusion: SKF provides a unified framework that enables real-time inference of time-varying nonlinear systems, significantly improving parameter identification and simplifying estimation of key parameters.

Abstract: Learning governing equations from data is central to understanding the behavior of physical systems across diverse scientific disciplines, including physics, biology, and engineering. The Sindy algorithm has proven effective in leveraging sparsity to identify concise models of nonlinear dynamical systems. In this paper, we extend sparsity-driven approaches to real-time learning by integrating a cornerstone algorithm from control theory – the Kalman filter (KF). The resulting Sindy Kalman Filter (SKF) unifies both frameworks by treating unknown system parameters as state variables, enabling real-time inference of complex, time-varying nonlinear models unattainable by either method alone. Furthermore, SKF enhances KF parameter identification strategies, particularly via look-ahead error, significantly simplifying the estimation of sparsity levels, variance parameters, and switching instants. We validate SKF on a chaotic Lorenz system with drifting or switching parameters and demonstrate its effectiveness in the real-time identification of a sparse nonlinear aircraft model built from real flight data.

[417] Dynamic Deep Graph Learning for Incomplete Multi-View Clustering with Masked Graph Reconstruction Loss

Zhenghao Zhang, Jun Xie, Xingchen Chen, Tao Yu, Hongzhu Yi, Kaixin Xu, Yuanxiang Wang, Tianyu Zong, Xinming Wang, Jiahuan Chen, Guoqing Chao, Feng Chen, Zhepeng Wang, Jungang Xu

Main category: cs.LG

TL;DR: DGIMVCM is a novel method for incomplete multi-view clustering that uses dynamic deep graph learning with masked graph reconstruction loss to address noise in graph construction and optimization.

Details

Motivation: Real-world multi-view data often has missing views, making incomplete multi-view clustering important. Existing GNN-based methods have limitations: (1) KNN-based static graphs introduce noise and reduce robustness, (2) MSE loss for graph reconstruction causes gradient noise during optimization.

Method: Proposes DGIMVCM with three components: (1) constructs missing-robust global graph, uses graph convolutional embedding to extract features and dynamic view-specific graphs with imputation, plus graph structure contrastive learning; (2) graph self-attention encoder with masked graph reconstruction loss; (3) clustering module with pseudo-label self-supervised training.

Result: Extensive experiments on multiple datasets validate the effectiveness and superiority of DGIMVCM compared to existing methods.

Conclusion: DGIMVCM successfully addresses the challenges of noise in graph construction and gradient noise during optimization, providing an effective solution for incomplete multi-view clustering.

Abstract: The prevalence of real-world multi-view data makes incomplete multi-view clustering (IMVC) a crucial research. The rapid development of Graph Neural Networks (GNNs) has established them as one of the mainstream approaches for multi-view clustering. Despite significant progress in GNNs-based IMVC, some challenges remain: (1) Most methods rely on the K-Nearest Neighbors (KNN) algorithm to construct static graphs from raw data, which introduces noise and diminishes the robustness of the graph topology. (2) Existing methods typically utilize the Mean Squared Error (MSE) loss between the reconstructed graph and the sparse adjacency graph directly as the graph reconstruction loss, leading to substantial gradient noise during optimization. To address these issues, we propose a novel \textbf{D}ynamic Deep \textbf{G}raph Learning for \textbf{I}ncomplete \textbf{M}ulti-\textbf{V}iew \textbf{C}lustering with \textbf{M}asked Graph Reconstruction Loss (DGIMVCM). Firstly, we construct a missing-robust global graph from the raw data. A graph convolutional embedding layer is then designed to extract primary features and refined dynamic view-specific graph structures, leveraging the global graph for imputation of missing views. This process is complemented by graph structure contrastive learning, which identifies consistency among view-specific graph structures. Secondly, a graph self-attention encoder is introduced to extract high-level representations based on the imputed primary features and view-specific graphs, and is optimized with a masked graph reconstruction loss to mitigate gradient noise during optimization. Finally, a clustering module is constructed and optimized through a pseudo-label self-supervised training mechanism. Extensive experiments on multiple datasets validate the effectiveness and superiority of DGIMVCM.

[418] LoRaCompass: Robust Reinforcement Learning to Efficiently Search for a LoRa Tag

Tianlang He, Zhongming Lin, Tianrui Jiang, S. -H. Gary Chan

Main category: cs.LG

TL;DR: LoRaCompass is a reinforcement learning model for efficiently locating LoRa tags worn by at-risk individuals, addressing domain shift and signal fluctuation issues to achieve robust search performance in unknown environments.

Details

Motivation: Existing methods for locating LoRa tags using RSSI are vulnerable to domain shift and signal fluctuation, leading to cascading decision errors and substantial localization inaccuracies, especially for mentally incapacitated persons and other at-risk individuals.

Method: Uses reinforcement learning with a spatially-aware feature extractor and policy distillation loss to learn robust spatial representations from RSSI, plus an exploration function inspired by upper confidence bound (UCB) to guide the sensor toward the tag with increasing confidence.

Result: Validated in diverse environments covering 80km², achieved >90% success rate in locating tags within 100m proximity (40% improvement over existing methods) with search path length scaling linearly with initial distance.

Conclusion: LoRaCompass provides robust and efficient search for LoRa tags, significantly outperforming existing methods in both success rate and efficiency across various real-world scenarios.

Abstract: The Long-Range (LoRa) protocol, known for its extensive range and low power, has increasingly been adopted in tags worn by mentally incapacitated persons (MIPs) and others at risk of going missing. We study the sequential decision-making process for a mobile sensor to locate a periodically broadcasting LoRa tag with the fewest moves (hops) in general, unknown environments, guided by the received signal strength indicator (RSSI). While existing methods leverage reinforcement learning for search, they remain vulnerable to domain shift and signal fluctuation, resulting in cascading decision errors that culminate in substantial localization inaccuracies. To bridge this gap, we propose LoRaCompass, a reinforcement learning model designed to achieve robust and efficient search for a LoRa tag. For exploitation under domain shift and signal fluctuation, LoRaCompass learns a robust spatial representation from RSSI to maximize the probability of moving closer to a tag, via a spatially-aware feature extractor and a policy distillation loss function. It further introduces an exploration function inspired by the upper confidence bound (UCB) that guides the sensor toward the tag with increasing confidence. We have validated LoRaCompass in ground-based and drone-assisted scenarios within diverse unseen environments covering an area of over 80km^2. It has demonstrated high success rate (>90%) in locating the tag within 100m proximity (a 40% improvement over existing methods) and high efficiency with a search path length (in hops) that scales linearly with the initial distance.

[419] When to Stop Federated Learning: Zero-Shot Generation of Synthetic Validation Data with Generative AI for Early Stopping

Youngjoon Lee, Hyukjoon Lee, Jinu Gong, Yang Cao, Joonhyuk Kang

Main category: cs.LG

TL;DR: Zero-shot synthetic validation framework using generative AI to enable early stopping in Federated Learning, reducing training rounds by up to 74% while maintaining accuracy.

Details

Motivation: Standard FL methods run for predefined rounds, leading to unnecessary computation when optimal performance is reached earlier, or continuing training even when models fail to achieve meaningful performance.

Method: Introduces a zero-shot synthetic validation framework that leverages generative AI to monitor model performance and determine early stopping points adaptively.

Result: Numerical results on multi-label chest X-ray classification show training rounds reduced by up to 74% while maintaining accuracy within 1% of optimal.

Conclusion: The proposed method effectively conserves computational resources and enables rapid hyperparameter adjustments by stopping training near optimal rounds.

Abstract: Federated Learning (FL) enables collaborative model training across decentralized devices while preserving data privacy. However, FL methods typically run for a predefined number of global rounds, often leading to unnecessary computation when optimal performance is reached earlier. In addition, training may continue even when the model fails to achieve meaningful performance. To address this inefficiency, we introduce a zero-shot synthetic validation framework that leverages generative AI to monitor model performance and determine early stopping points. Our approach adaptively stops training near the optimal round, thereby conserving computational resources and enabling rapid hyperparameter adjustments. Numerical results on multi-label chest X-ray classification demonstrate that our method reduces training rounds by up to 74% while maintaining accuracy within 1% of the optimal.

[420] A Best-of-Both-Worlds Proof for Tsallis-INF without Fenchel Conjugates

Wei-Cheng Lee, Francesco Orabona

Main category: cs.LG

TL;DR: Simple derivation of best-of-both-worlds guarantee for Tsallis-INF bandit algorithm using modern online convex optimization tools, avoiding conjugate functions.

Details

Motivation: To provide a simpler proof for the Tsallis-INF algorithm's performance guarantees in both stochastic and adversarial bandit settings.

Method: Uses modern online convex optimization tools and avoids conjugate functions, focusing on a streamlined proof approach.

Result: Achieves a best-of-both-worlds guarantee for the Tsallis-INF multi-armed bandit algorithm with a more elegant proof.

Conclusion: The paper presents a simplified derivation that maintains the algorithm’s theoretical guarantees while offering a cleaner proof methodology.

Abstract: In this short note, we present a simple derivation of the best-of-both-world guarantee for the Tsallis-INF multi-armed bandit algorithm from J. Zimmert and Y. Seldin. Tsallis-INF: An optimal algorithm for stochastic and adversarial bandits. Journal of Machine Learning Research, 22(28):1-49, 2021. URL https://jmlr.csail.mit.edu/papers/volume22/19-753/19-753.pdf. In particular, the proof uses modern tools from online convex optimization and avoid the use of conjugate functions. Also, we do not optimize the constants in the bounds in favor of a slimmer proof.

[421] Sparse Methods for Vector Embeddings of TPC Data

Tyler Wheeler, Michelle P. Kuchera, Raghuram Ramanujan, Ryan Krupp, Chris Wrede, Saiprasad Ravishankar, Connor L. Cross, Hoi Yan Ian Heung, Andrew J. Jones, Benjamin Votaw

Main category: cs.LG

TL;DR: Sparse convolutional networks, particularly sparse ResNet architectures, provide effective vector embeddings for Time Projection Chamber (TPC) data, enabling representation learning across different TPC detectors even with random weights or minimal training.

Details

Motivation: To develop general representation learning methods for TPC data that can work across different detector configurations and experiments, addressing the need for versatile analysis tools in nuclear physics.

Method: Using sparse convolutional networks (Minkowski Engine ResNet models) to process raw TPC pad-level signals as sparse tensors, with pre-training on physics-motivated binary classification tasks and cross-detector validation.

Result: Sparse ResNet models produce useful structured embeddings revealing rich event structure in both GADGET II and AT-TPC data, with improvements observed when trained on GADGET data and applied to AT-TPC.

Conclusion: Sparse convolutional techniques show strong potential as a general tool for representation learning in diverse TPC experiments, providing effective embeddings even with minimal training or across different detector types.

Abstract: Time Projection Chambers (TPCs) are versatile detectors that reconstruct charged-particle tracks in an ionizing medium, enabling sensitive measurements across a wide range of nuclear physics experiments. We explore sparse convolutional networks for representation learning on TPC data, finding that a sparse ResNet architecture, even with randomly set weights, provides useful structured vector embeddings of events. Pre-training this architecture on a simple physics-motivated binary classification task further improves the embedding quality. Using data from the GAseous Detector with GErmanium Tagging (GADGET) II TPC, a detector optimized for measuring low-energy $β$-delayed particle decays, we represent raw pad-level signals as sparse tensors, train Minkowski Engine ResNet models, and probe the resulting event-level embeddings which reveal rich event structure. As a cross-detector test, we embed data from the Active-Target TPC (AT-TPC) – a detector designed for nuclear reaction studies in inverse kinematics – using the same encoder. We find that even an untrained sparse ResNet model provides useful embeddings of AT-TPC data, and we observe improvements when the model is trained on GADGET data. Together, these results highlight the potential of sparse convolutional techniques as a general tool for representation learning in diverse TPC experiments.

[422] Neural Network-Powered Finger-Drawn Biometric Authentication

Maan Al Balkhi, Kordian Gontarska, Marko Harasic, Adrian Paschke

Main category: cs.LG

TL;DR: Neural network-based biometric authentication using finger-drawn digits on touchscreen devices achieves ~89% accuracy with CNN models and ~75% with autoencoders, providing a viable security solution.

Details

Motivation: To develop a secure, user-friendly biometric authentication method for touchscreen devices using simple finger-drawn digit patterns as an alternative to traditional authentication methods.

Method: Evaluated CNN architectures (modified Inception-V1 and lightweight shallow CNN) and autoencoders (Convolutional and Fully Connected) for user authentication using 2,000 finger-drawn digits from 20 participants on personal touchscreen devices.

Result: Both CNN architectures achieved ~89% authentication accuracy, with shallow CNN requiring fewer parameters. Autoencoder approaches achieved ~75% accuracy.

Conclusion: Finger-drawn symbol authentication provides a viable, secure, and user-friendly biometric solution that can be integrated with existing pattern-based authentication for multi-layered mobile security systems.

Abstract: This paper investigates neural network-based biometric authentication using finger-drawn digits on touchscreen devices. We evaluated CNN and autoencoder architectures for user authentication through simple digit patterns (0-9) traced with finger input. Twenty participants contributed 2,000 finger-drawn digits each on personal touchscreen devices. We compared two CNN architectures: a modified Inception-V1 network and a lightweight shallow CNN for mobile environments. Additionally, we examined Convolutional and Fully Connected autoencoders for anomaly detection. Both CNN architectures achieved ~89% authentication accuracy, with the shallow CNN requiring fewer parameters. Autoencoder approaches achieved ~75% accuracy. The results demonstrate that finger-drawn symbol authentication provides a viable, secure, and user-friendly biometric solution for touchscreen devices. This approach can be integrated with existing pattern-based authentication methods to create multi-layered security systems for mobile applications.

[423] Virtual Width Networks

Seed, Baisheng Li, Banggu Wu, Bole Ma, Bowen Xiao, Chaoyi Zhang, Cheng Li, Chengyi Wang, Chenyin Xu, Chi Zhang, Chong Hu, Daoguang Zan, Defa Zhu, Dongyu Xu, Du Li, Faming Wu, Fan Xia, Ge Zhang, Guang Shi, Haobin Chen, Hongyu Zhu, Hongzhi Huang, Huan Zhou, Huanzhang Dou, Jianhui Duan, Jianqiao Lu, Jianyu Jiang, Jiayi Xu, Jiecao Chen, Jin Chen, Jin Ma, Jing Su, Jingji Chen, Jun Wang, Jun Yuan, Juncai Liu, Jundong Zhou, Kai Hua, Kai Shen, Kai Xiang, Kaiyuan Chen, Kang Liu, Ke Shen, Liang Xiang, Lin Yan, Lishu Luo, Mengyao Zhang, Ming Ding, Mofan Zhang, Nianning Liang, Peng Li, Penghao Huang, Pengpeng Mu, Qi Huang, Qianli Ma, Qiyang Min, Qiying Yu, Renming Pang, Ru Zhang, Shen Yan, Shen Yan, Shixiong Zhao, Shuaishuai Cao, Shuang Wu, Siyan Chen, Siyu Li, Siyuan Qiao, Tao Sun, Tian Xin, Tiantian Fan, Ting Huang, Ting-Han Fan, Wei Jia, Wenqiang Zhang, Wenxuan Liu, Xiangzhong Wu, Xiaochen Zuo, Xiaoying Jia, Ximing Yang, Xin Liu, Xin Yu, Xingyan Bin, Xintong Hao, Xiongcai Luo, Xujing Li, Xun Zhou, Yanghua Peng, Yangrui Chen, Yi Lin, Yichong Leng, Yinghao Li, Yingshuan Song, Yiyuan Ma, Yong Shan, Yongan Xiang, Yonghui Wu, Yongtao Zhang, Yongzhen Yao, Yu Bao, Yuehang Yang, Yufeng Yuan, Yunshui Li, Yuqiao Xian, Yutao Zeng, Yuxuan Wang, Zehua Hong, Zehua Wang, Zengzhi Wang, Zeyu Yang, Zhengqiang Yin, Zhenyi Lu, Zhexi Zhang, Zhi Chen, Zhi Zhang, Zhiqi Lin, Zihao Huang, Zilin Xu, Ziyun Wei, Zuo Wang

Main category: cs.LG

TL;DR: Virtual Width Networks (VWN) enable wider representations without quadratic cost increases by decoupling representational width from backbone width, achieving significant optimization acceleration and loss reduction.

Details

Motivation: To overcome the quadratic computational cost of increasing hidden size in neural networks while still benefiting from wider representations for improved performance.

Method: VWN framework decouples representational width from backbone width, expanding embedding space while keeping backbone compute nearly constant.

Result: 8x expansion accelerates optimization by over 2x for next-token and 3x for next-2-token prediction, with advantages amplifying over training. Log-linear scaling relation found between virtual width and loss reduction.

Conclusion: VWN provides token-efficient scaling and increasingly effective performance with scale, establishing virtual-width scaling as a new dimension for large-model efficiency.

Abstract: We introduce Virtual Width Networks (VWN), a framework that delivers the benefits of wider representations without incurring the quadratic cost of increasing the hidden size. VWN decouples representational width from backbone width, expanding the embedding space while keeping backbone compute nearly constant. In our large-scale experiment, an 8-times expansion accelerates optimization by over 2 times for next-token and 3 times for next-2-token prediction. The advantage amplifies over training as both the loss gap grows and the convergence-speedup ratio increases, showing that VWN is not only token-efficient but also increasingly effective with scale. Moreover, we identify an approximately log-linear scaling relation between virtual width and loss reduction, offering an initial empirical basis and motivation for exploring virtual-width scaling as a new dimension of large-model efficiency.

[424] HealSplit: Towards Self-Healing through Adversarial Distillation in Split Federated Learning

Yuhan Xie, Chen Lyu

Main category: cs.LG

TL;DR: HealSplit is a unified defense framework for Split Federated Learning that detects and recovers from five types of poisoning attacks using topological anomaly detection, generative recovery, and adversarial multi-teacher distillation.

Details

Motivation: Split Federated Learning (SFL) is vulnerable to sophisticated data poisoning attacks, and existing FL defenses are ineffective in SFL due to limited access to complete model updates.

Method: HealSplit has three components: topology-aware detection using graph construction and topological anomaly scoring, generative recovery with consistency validation, and adversarial multi-teacher distillation with Vanilla Teacher and Anomaly-Influence Debiasing Teacher.

Result: Extensive experiments on four benchmark datasets show HealSplit consistently outperforms ten state-of-the-art defenses, achieving superior robustness and defense effectiveness across diverse attack scenarios.

Conclusion: HealSplit provides the first unified defense framework specifically tailored for SFL, offering effective end-to-end protection against sophisticated poisoning attacks.

Abstract: Split Federated Learning (SFL) is an emerging paradigm for privacy-preserving distributed learning. However, it remains vulnerable to sophisticated data poisoning attacks targeting local features, labels, smashed data, and model weights. Existing defenses, primarily adapted from traditional Federated Learning (FL), are less effective under SFL due to limited access to complete model updates. This paper presents HealSplit, the first unified defense framework tailored for SFL, offering end-to-end detection and recovery against five sophisticated types of poisoning attacks. HealSplit comprises three key components: (1) a topology-aware detection module that constructs graphs over smashed data to identify poisoned samples via topological anomaly scoring (TAS); (2) a generative recovery pipeline that synthesizes semantically consistent substitutes for detected anomalies, validated by a consistency validation student; and (3) an adversarial multi-teacher distillation framework trains the student using semantic supervision from a Vanilla Teacher and anomaly-aware signals from an Anomaly-Influence Debiasing (AD) Teacher, guided by the alignment between topological and gradient-based interaction matrices. Extensive experiments on four benchmark datasets demonstrate that HealSplit consistently outperforms ten state-of-the-art defenses, achieving superior robustness and defense effectiveness across diverse attack scenarios.

[425] Heterogeneous Attributed Graph Learning via Neighborhood-Aware Star Kernels

Hong Huang, Chengyu Yao, Haiming Chen, Hang Gao

Main category: cs.LG

TL;DR: NASK is a novel graph kernel that combines Gower similarity for mixed attributes with Weisfeiler-Lehman iterations for neighborhood structure, achieving state-of-the-art performance on attributed graphs.

Details

Motivation: Existing graph kernels struggle to capture both heterogeneous attribute semantics and neighborhood information in attributed graphs with mixed numerical and categorical features.

Method: Uses exponential transformation of Gower similarity coefficient for mixed attributes and star substructures enhanced by Weisfeiler-Lehman iterations for multi-scale neighborhood information.

Result: NASK consistently outperforms 16 state-of-the-art baselines (9 graph kernels and 7 GNNs) on 11 attributed and 4 large-scale real-world graph benchmarks.

Conclusion: NASK provides an effective solution for attributed graph learning by jointly modeling attribute semantics and structural information, with proven positive definiteness for kernel-based frameworks.

Abstract: Attributed graphs, typically characterized by irregular topologies and a mix of numerical and categorical attributes, are ubiquitous in diverse domains such as social networks, bioinformatics, and cheminformatics. While graph kernels provide a principled framework for measuring graph similarity, existing kernel methods often struggle to simultaneously capture heterogeneous attribute semantics and neighborhood information in attributed graphs. In this work, we propose the Neighborhood-Aware Star Kernel (NASK), a novel graph kernel designed for attributed graph learning. NASK leverages an exponential transformation of the Gower similarity coefficient to jointly model numerical and categorical features efficiently, and employs star substructures enhanced by Weisfeiler-Lehman iterations to integrate multi-scale neighborhood structural information. We theoretically prove that NASK is positive definite, ensuring compatibility with kernel-based learning frameworks such as SVMs. Extensive experiments are conducted on eleven attributed and four large-scale real-world graph benchmarks. The results demonstrate that NASK consistently achieves superior performance over sixteen state-of-the-art baselines, including nine graph kernels and seven Graph Neural Networks.

[426] Toward Scalable Early Cancer Detection: Evaluating EHR-Based Predictive Models Against Traditional Screening Criteria

Jiheum Park, Chao Pang, Tristan Y. Lee, Jeong Yun Yang, Jacob Berkowitz, Alexander Z. Wei, Nicholas Tatonetti

Main category: cs.LG

TL;DR: EHR-based predictive models significantly outperform traditional risk factors (3-6x higher enrichment) for identifying high-risk individuals across multiple cancers, with foundation models further improving performance.

Details

Motivation: Current cancer screening guidelines are limited to few cancer types and rely on narrow criteria, while EHRs capture comprehensive longitudinal data that could enable more effective risk identification.

Method: Systematic evaluation using All of Us Research Program data (865,000+ participants) comparing EHR-based models against traditional risk factors (gene mutations, family history) across 8 major cancers, with baseline modeling and state-of-the-art EHR foundation model approaches.

Result: EHR-based models achieved 3-6x higher enrichment of true cancer cases in high-risk groups compared to traditional factors alone, with foundation models further improving performance across 26 cancer types.

Conclusion: EHR-based predictive modeling demonstrates strong clinical potential for enabling more precise and scalable early cancer detection strategies beyond current screening limitations.

Abstract: Current cancer screening guidelines cover only a few cancer types and rely on narrowly defined criteria such as age or a single risk factor like smoking history, to identify high-risk individuals. Predictive models using electronic health records (EHRs), which capture large-scale longitudinal patient-level health information, may provide a more effective tool for identifying high-risk groups by detecting subtle prediagnostic signals of cancer. Recent advances in large language and foundation models have further expanded this potential, yet evidence remains limited on how useful HER-based models are compared with traditional risk factors currently used in screening guidelines. We systematically evaluated the clinical utility of EHR-based predictive models against traditional risk factors, including gene mutations and family history of cancer, for identifying high-risk individuals across eight major cancers (breast, lung, colorectal, prostate, ovarian, liver, pancreatic, and stomach), using data from the All of Us Research Program, which integrates EHR, genomic, and survey data from over 865,000 participants. Even with a baseline modeling approach, EHR-based models achieved a 3- to 6-fold higher enrichment of true cancer cases among individuals identified as high risk compared with traditional risk factors alone, whether used as a standalone or complementary tool. The EHR foundation model, a state-of-the-art approach trained on comprehensive patient trajectories, further improved predictive performance across 26 cancer types, demonstrating the clinical potential of EHR-based predictive modeling to support more precise and scalable early detection strategies.

[427] Fast and Expressive Multi-Token Prediction with Probabilistic Circuits

Andreas Grivas, Lorenzo Loconte, Emile van Krieken, Piotr Nawrot, Yu Zhao, Euan Wielewski, Pasquale Minervini, Edoardo Ponti, Antonio Vergari

Main category: cs.LG

TL;DR: MTPC is a framework using probabilistic circuits to model joint distributions of future tokens in multi-token prediction, improving generation speed while maintaining performance compared to methods assuming token independence.

Details

Motivation: Existing multi-token prediction methods sacrifice expressiveness by assuming independence between future tokens, limiting their effectiveness in speeding up generation for tokenizer-free byte-level LLMs.

Method: MTPC uses probabilistic circuits to encode joint distributions over future tokens, generalizing classical models like mixture models, HMMs, and tensor networks. It retrofits existing byte-level LLMs and combines with speculative decoding.

Result: MTPC significantly speeds up generation compared to MTP with independence assumptions while guaranteeing to retain the performance of the original verifier LLM.

Conclusion: The framework provides an optimal trade-off between expressiveness and latency, with rigorous study of parameterizations like PC architectures and partial layer sharing between verifier and draft LLMs.

Abstract: Multi-token prediction (MTP) is a prominent strategy to significantly speed up generation in large language models (LLMs), including byte-level LLMs, which are tokeniser-free but prohibitively slow. However, existing MTP methods often sacrifice expressiveness by assuming independence between future tokens. In this work, we investigate the trade-off between expressiveness and latency in MTP within the framework of probabilistic circuits (PCs). Our framework, named MTPC, allows one to explore different ways to encode the joint distributions over future tokens by selecting different circuit architectures, generalising classical models such as (hierarchical) mixture models, hidden Markov models and tensor networks. We show the efficacy of MTPC by retrofitting existing byte-level LLMs, such as EvaByte. Our experiments show that, when combined with speculative decoding, MTPC significantly speeds up generation compared to MTP with independence assumptions, while guaranteeing to retain the performance of the original verifier LLM. We also rigorously study the optimal trade-off between expressiveness and latency when exploring the possible parameterisations of MTPC, such as PC architectures and partial layer sharing between the verifier and draft LLMs.

[428] Toward Multi-Fidelity Machine Learning Force Field for Cathode Materials

Guangyi Dong, Zhihui Wang

Main category: cs.LG

TL;DR: A multi-fidelity machine learning force field framework that combines low-fidelity non-magnetic and high-fidelity magnetic computational datasets to improve data efficiency for lithium-ion battery cathode materials.

Details

Motivation: Machine learning force fields for lithium-ion battery cathode materials are limited due to complex electronic structures and scarcity of high-quality training datasets.

Method: Developed a multi-fidelity MLFF framework that simultaneously utilizes both low-fidelity non-magnetic and high-fidelity magnetic computational datasets for training.

Result: Tests on lithium manganese iron phosphate (LMFP) cathode material system demonstrated the effectiveness of the multi-fidelity approach.

Conclusion: This work enables high-accuracy MLFF training for cathode materials at lower dataset costs and provides new perspectives for applying MLFFs to cathode material simulations.

Abstract: Machine learning force fields (MLFFs), which employ neural networks to map atomic structures to system energies, effectively combine the high accuracy of first-principles calculation with the computational efficiency of empirical force fields. They are widely used in computational materials simulations. However, the development and application of MLFFs for lithium-ion battery cathode materials remain relatively limited. This is primarily due to the complex electronic structure characteristics of cathode materials and the resulting scarcity of high-quality computational datasets available for force field training. In this work, we develop a multi-fidelity machine learning force field framework to enhance the data efficiency of computational results, which can simultaneously utilize both low-fidelity non-magnetic and high-fidelity magnetic computational datasets of cathode materials for training. Tests conducted on the lithium manganese iron phosphate (LMFP) cathode material system demonstrate the effectiveness of this multi-fidelity approach. This work helps to achieve high-accuracy MLFF training for cathode materials at a lower training dataset cost, and offers new perspectives for applying MLFFs to computational simulations of cathode materials.

[429] On-Device Fine-Tuning via Backprop-Free Zeroth-Order Optimization

Prabodh Katti, Sangwoo Park, Bipin Rajendran, Osvaldo Simeone

Main category: cs.LG

TL;DR: MeZO enables larger model fine-tuning on edge devices by using forward-only gradient estimation, eliminating activation storage needs but potentially increasing training time.

Details

Motivation: Enable on-device fine-tuning for edge AI systems under strict memory constraints where conventional backpropagation's activation storage overhead limits deployable model sizes.

Method: Memory-efficient zeroth-order optimization (MeZO) that estimates gradients using forward evaluations only, eliminating need for storing intermediate activations or optimizer states.

Result: MeZO allows significantly larger models to fit within on-chip memory compared to BP-based training, with accuracy advantages under memory constraints given sufficient fine-tuning time.

Conclusion: MeZO is an effective approach for on-device fine-tuning that overcomes memory limitations of conventional BP training, enabling deployment of larger models on edge devices.

Abstract: On-device fine-tuning is a critical capability for edge AI systems, which must support adaptation to different agentic tasks under stringent memory constraints. Conventional backpropagation (BP)-based training requires storing layer activations and optimizer states, a demand that can be only partially alleviated through checkpointing. In edge deployments in which the model weights must reside entirely in device memory, this overhead severely limits the maximum model size that can be deployed. Memory-efficient zeroth-order optimization (MeZO) alleviates this bottleneck by estimating gradients using forward evaluations alone, eliminating the need for storing intermediate activations or optimizer states. This enables significantly larger models to fit within on-chip memory, albeit at the cost of potentially longer fine-tuning wall-clock time. This paper first provides a theoretical estimate of the relative model sizes that can be accommodated under BP and MeZO training. We then numerically validate the analysis, demonstrating that MeZO exhibits accuracy advantages under on-device memory constraints, provided sufficient wall-clock time is available for fine-tuning.

[430] When Genes Speak: A Semantic-Guided Framework for Spatially Resolved Transcriptomics Data Clustering

Jiangkai Long, Yanran Zhu, Chang Tang, Kun Sun, Yuanyuan Liu, Xuesong Yan

Main category: cs.LG

TL;DR: SemST is a semantic-guided deep learning framework that uses LLMs to extract biological meaning from gene symbols and combines them with spatial relationships via GNNs for improved spatial transcriptomics clustering.

Details

Motivation: Current computational models treat genes as isolated numerical features, ignoring the rich biological semantics encoded in gene symbols, which prevents deep understanding of biological characteristics in spatial transcriptomics data.

Method: Uses LLMs to transform gene sets into biologically informed embeddings, fuses them with spatial neighborhood relationships via GNNs, and introduces Fine-grained Semantic Modulation module for spot-specific affine transformations that calibrate spatial features with biological knowledge.

Result: Achieves state-of-the-art clustering performance on public spatial transcriptomics datasets, with the FSM module showing plug-and-play versatility that consistently improves baseline methods.

Conclusion: SemST successfully integrates biological function and spatial structure through semantic guidance, enabling more meaningful analysis of spatial transcriptomics data by leveraging the symbolic meaning of genes.

Abstract: Spatial transcriptomics enables gene expression profiling with spatial context, offering unprecedented insights into the tissue microenvironment. However, most computational models treat genes as isolated numerical features, ignoring the rich biological semantics encoded in their symbols. This prevents a truly deep understanding of critical biological characteristics. To overcome this limitation, we present SemST, a semantic-guided deep learning framework for spatial transcriptomics data clustering. SemST leverages Large Language Models (LLMs) to enable genes to “speak” through their symbolic meanings, transforming gene sets within each tissue spot into biologically informed embeddings. These embeddings are then fused with the spatial neighborhood relationships captured by Graph Neural Networks (GNNs), achieving a coherent integration of biological function and spatial structure. We further introduce the Fine-grained Semantic Modulation (FSM) module to optimally exploit these biological priors. The FSM module learns spot-specific affine transformations that empower the semantic embeddings to perform an element-wise calibration of the spatial features, thus dynamically injecting high-order biological knowledge into the spatial context. Extensive experiments on public spatial transcriptomics datasets show that SemST achieves state-of-the-art clustering performance. Crucially, the FSM module exhibits plug-and-play versatility, consistently improving the performance when integrated into other baseline methods.

[431] Robust inverse material design with physical guarantees using the Voigt-Reuss Net

Sanath Keshav, Felix Fritzen

Main category: cs.LG

TL;DR: A spectrally normalized neural network for mechanical homogenization with physical guarantees, using Voigt-Reuss bounds to ensure predictions lie between theoretical limits.

Details

Motivation: To develop a homogenization method that provides hard physical guarantees while maintaining accuracy, enabling both forward prediction and inverse design with constraint consistency.

Method: Uses Voigt-Reuss bounds factorization via Cholesky-like operator to learn dimensionless symmetric positive semi-definite representations. Combines spectral normalization with neural networks (fully connected for 3D, CNN for 2D) trained on large FFT-based datasets with isotropy-invariant descriptors.

Result: Achieves near-perfect fidelity for isotropic projections (R² ≥ 0.998), median tensor-level errors ≈1.7%, mean ≈3.4%. For 2D, R²>0.99 on all components, subpercent losses, accurate tracking of percolation effects, and robust generalization to out-of-distribution data.

Conclusion: The Voigt-Reuss net provides accurate, physically admissible forward prediction and large-batch inverse design, unifying both tasks while being generic to elliptic operators and coupled-physics settings.

Abstract: We propose a spectrally normalized surrogate for forward and inverse mechanical homogenization with hard physical guarantees. Leveraging the Voigt-Reuss bounds, we factor their difference via a Cholesky-like operator and learn a dimensionless, symmetric positive semi-definite representation with eigenvalues in $[0,1]$; the inverse map returns symmetric positive-definite predictions that lie between the bounds in the Löwner sense. In 3D linear elasticity on an open dataset of stochastic biphasic microstructures, a fully connected Voigt-Reuss net trained on $>!7.5\times 10^{5}$ FFT-based labels with 236 isotropy-invariant descriptors and three contrast parameters recovers the isotropic projection with near-perfect fidelity (isotropy-related entries: $R^2 \ge 0.998$), while anisotropy-revealing couplings are unidentifiable from $SO(3)$-invariant inputs. Tensor-level relative Frobenius errors have median $\approx 1.7%$ and mean $\approx 3.4%$ across splits. For 2D plane strain on thresholded trigonometric microstructures, coupling spectral normalization with a differentiable renderer and a CNN yields $R^2>0.99$ on all components, subpercent normalized losses, accurate tracking of percolation-induced eigenvalue jumps, and robust generalization to out-of-distribution images. Treating the parametric microstructure as design variables, batched first-order optimization with a single surrogate matches target tensors within a few percent and returns diverse near-optimal designs. Overall, the Voigt-Reuss net unifies accurate, physically admissible forward prediction with large-batch, constraint-consistent inverse design, and is generic to elliptic operators and coupled-physics settings.

[432] SPOT: Single-Shot Positioning via Trainable Near-Field Rainbow Beamforming

Yeyue Cai, Jianhua Mo, Meixia Tao

Main category: cs.LG

TL;DR: Deep learning-based scheme for simultaneous rainbow beam design and user localization using phase-time arrays, achieving order-of-magnitude overhead reduction and improved positioning accuracy.

Details

Motivation: Phase-time arrays with PSs and TTDs offer cost-effective rainbow beam generation for wideband sensing, but existing methods have high overhead and suboptimal positioning performance.

Method: End-to-end deep learning approach with trainable PS/TTD coefficients for task-oriented beam synthesis, plus lightweight FC network to estimate angle-range coordinates from quantized power feedback and subcarrier index after single downlink transmission.

Result: Reduces overhead by order of magnitude compared to existing methods, achieves consistently lower 2D positioning error than analytical and learning-based schemes.

Conclusion: The proposed deep learning framework successfully integrates beam design and localization, demonstrating significant improvements in efficiency and accuracy for wideband sensing applications.

Abstract: Phase-time arrays, which integrate phase shifters (PSs) and true-time delays (TTDs), have emerged as a cost-effective architecture for generating frequency-dependent rainbow beams in wideband sensing and localization. This paper proposes an end-to-end deep learning-based scheme that simultaneously designs the rainbow beams and estimates user positions. Treating the PS and TTD coefficients as trainable variables allows the network to synthesize task-oriented beams that maximize localization accuracy. A lightweight fully connected module then recovers the user’s angle-range coordinates from its feedback of the maximum quantized received power and its corresponding subcarrier index after a single downlink transmission. Compared with existing analytical and learning-based schemes, the proposed method reduces overhead by an order of magnitude and delivers consistently lower two-dimensional positioning error.

[433] Multi-Phase Spacecraft Trajectory Optimization via Transformer-Based Reinforcement Learning

Amit Jain, Victor Rodriguez-Fernandez, Richard Linares

Main category: cs.LG

TL;DR: Transformer-based RL framework unifies multi-phase spacecraft trajectory optimization using a single policy, eliminating manual phase transitions while maintaining control stability across dynamically distinct mission regimes.

Details

Motivation: Existing RL approaches require separate policies for different mission phases (launch, ascent, separation, orbit insertion), limiting adaptability and increasing operational complexity. Need for adaptive policies that generalize across dynamically distinct regimes.

Method: Transformer-based RL framework using PPO with transformer encoder-decoder structure, replacing conventional recurrent networks. Integrates Gated Transformer-XL (GTrXL) architecture to maintain coherent memory across mission phases spanning seconds to minutes.

Result: Matches analytical solutions in simple cases, learns coherent control policies across dynamically distinct regimes. Validated on single-phase benchmarks, multiphase waypoint navigation, and complex rocket ascent problem with atmospheric flight, stage separation, and vacuum operations.

Conclusion: Establishes foundation for scalable autonomous mission planning that reduces reliance on phase-specific controllers while maintaining compatibility with safety-critical verification protocols.

Abstract: Autonomous spacecraft control for mission phases such as launch, ascent, stage separation, and orbit insertion remains a critical challenge due to the need for adaptive policies that generalize across dynamically distinct regimes. While reinforcement learning (RL) has shown promise in individual astrodynamics tasks, existing approaches often require separate policies for distinct mission phases, limiting adaptability and increasing operational complexity. This work introduces a transformer-based RL framework that unifies multi-phase trajectory optimization through a single policy architecture, leveraging the transformer’s inherent capacity to model extended temporal contexts. Building on proximal policy optimization (PPO), our framework replaces conventional recurrent networks with a transformer encoder-decoder structure, enabling the agent to maintain coherent memory across mission phases spanning seconds to minutes during critical operations. By integrating a Gated Transformer-XL (GTrXL) architecture, the framework eliminates manual phase transitions while maintaining stability in control decisions. We validate our approach progressively: first demonstrating near-optimal performance on single-phase benchmarks (double integrator and Van der Pol oscillator), then extending to multiphase waypoint navigation variants, and finally tackling a complex multiphase rocket ascent problem that includes atmospheric flight, stage separation, and vacuum operations. Results demonstrate that the transformer-based framework not only matches analytical solutions in simple cases but also effectively learns coherent control policies across dynamically distinct regimes, establishing a foundation for scalable autonomous mission planning that reduces reliance on phase-specific controllers while maintaining compatibility with safety-critical verification protocols.

[434] Multicalibration yields better matchings

Riccardo Colini Baldeschi, Simone Di Gregorio, Simone Fioravanti, Federico Fusco, Ido Guy, Daniel Haimovich, Stefano Leonardi, Fridolin Linder, Lorenzo Perini, Matteo Russo, Niek Tax

Main category: cs.LG

TL;DR: The paper proposes using multicalibration to improve matching decisions in weighted graphs when working with imperfect weight predictors, showing that multicalibrated predictors can compete with the best decision rules applied to the original predictor.

Details

Motivation: In practice, perfect predictors for graph edge weights are unrealistic, and imperfect predictors can lead to suboptimal matching decisions. Standard optimal rules may underperform when the predictor is not Bayes optimal.

Method: The authors propose multicalibration as a fairness notion requiring predictors to be unbiased on protected context sets. They show how to construct a multicalibrated predictor from any given predictor, and prove that using this for matching decisions competes with the best decision rules applied to the original predictor.

Result: The paper demonstrates that picking the best matching based on the multicalibrated predictor is competitive with the best decision rule applied to the original predictor, and provides sample complexity bounds for this approach.

Conclusion: Multicalibration provides an effective framework for improving matching decisions with imperfect predictors, offering competitive performance guarantees while addressing fairness concerns through unbiased predictions on protected context sets.

Abstract: Consider the problem of finding the best matching in a weighted graph where we only have access to predictions of the actual stochastic weights, based on an underlying context. If the predictor is the Bayes optimal one, then computing the best matching based on the predicted weights is optimal. However, in practice, this perfect information scenario is not realistic. Given an imperfect predictor, a suboptimal decision rule may compensate for the induced error and thus outperform the standard optimal rule. In this paper, we propose multicalibration as a way to address this problem. This fairness notion requires a predictor to be unbiased on each element of a family of protected sets of contexts. Given a class of matching algorithms $\mathcal C$ and any predictor $γ$ of the edge-weights, we show how to construct a specific multicalibrated predictor $\hat γ$, with the following property. Picking the best matching based on the output of $\hat γ$ is competitive with the best decision rule in $\mathcal C$ applied onto the original predictor $γ$. We complement this result by providing sample complexity bounds.

[435] Differentiation Strategies for Acoustic Inverse Problems: Admittance Estimation and Shape Optimization

Nikolas Borrel-Jensen, Josiah Bjorgaard

Main category: cs.LG

TL;DR: Differentiable programming approach for acoustic inverse problems using JAX-FEM’s automatic differentiation for admittance estimation and combining JAX-FEM with PyTorch3D for shape optimization, achieving high precision and efficiency.

Details

Motivation: To demonstrate a practical differentiable programming approach for solving acoustic inverse problems without requiring manual derivation of adjoint equations, enabling rapid prototyping of optimization workflows.

Method: Uses JAX-FEM’s automatic differentiation for gradient-based admittance estimation from sparse pressure measurements, and combines JAX-FEM for forward simulation with PyTorch3D for mesh manipulation through randomized finite differences for shape optimization.

Result: Achieved 3-digit precision in admittance estimation and 48.1% energy reduction at target frequencies with 30-fold fewer FEM solutions compared to standard finite difference methods.

Conclusion: Modern differentiable software stacks enable efficient optimization workflows for physics-based inverse problems, with automatic differentiation for parameter estimation and combined finite differences/AD for geometric design.

Abstract: We demonstrate a practical differentiable programming approach for acoustic inverse problems through two applications: admittance estimation and shape optimization for resonance damping. First, we show that JAX-FEM’s automatic differentiation (AD) enables direct gradient-based estimation of complex boundary admittance from sparse pressure measurements, achieving 3-digit precision without requiring manual derivation of adjoint equations. Second, we apply randomized finite differences to acoustic shape optimization, combining JAX-FEM for forward simulation with PyTorch3D for mesh manipulation through AD. By separating physics-driven boundary optimization from geometry-driven interior mesh adaptation, we achieve 48.1% energy reduction at target frequencies with 30-fold fewer FEM solutions compared to standard finite difference on the full mesh. This work showcases how modern differentiable software stacks enable rapid prototyping of optimization workflows for physics-based inverse problems, with automatic differentiation for parameter estimation and a combination of finite differences and AD for geometric design.

[436] Low-Bit, High-Fidelity: Optimal Transport Quantization for Flow Matching

Dara Varam, Diaa A. Abuhani, Imran Zualkernan, Raghad AlDamani, Lujain Khalil

Main category: cs.LG

TL;DR: OT-based quantization preserves FM model quality down to 2-3 bits, outperforming uniform, piecewise, and logarithmic methods for edge AI deployment.

Details

Motivation: FM models face practical deployment challenges due to high-precision parameter requirements, needing compression for edge and embedded AI applications.

Method: Adapt optimal transport-based post-training quantization to minimize 2-Wasserstein distance between quantized and original weights, with theoretical analysis of generative degradation bounds.

Result: OT-based quantization maintains visual generation quality and latent space stability down to 2-3 bits per parameter across five benchmark datasets, where other methods fail.

Conclusion: OT-based quantization is a principled, effective approach for compressing FM generative models for practical deployment.

Abstract: Flow Matching (FM) generative models offer efficient simulation-free training and deterministic sampling, but their practical deployment is challenged by high-precision parameter requirements. We adapt optimal transport (OT)-based post-training quantization to FM models, minimizing the 2-Wasserstein distance between quantized and original weights, and systematically compare its effectiveness against uniform, piecewise, and logarithmic quantization schemes. Our theoretical analysis provides upper bounds on generative degradation under quantization, and empirical results across five benchmark datasets of varying complexity show that OT-based quantization preserves both visual generation quality and latent space stability down to 2-3 bits per parameter, where alternative methods fail. This establishes OT-based quantization as a principled, effective approach to compress FM generative models for edge and embedded AI applications.

[437] Retrofit: Continual Learning with Bounded Forgetting for Security Applications

Yiling He, Junchi Lei, Hongyu She, Shuo Shao, Xinran Zheng, Yiping Liu, Zhan Qin, Lorenzo Cavallaro

Main category: cs.LG

TL;DR: RETROFIT is a data-free continual learning method that mitigates catastrophic forgetting in security analytics by merging old and new model parameters without needing historical data, using low-rank updates and knowledge arbitration.

Details

Motivation: Security analytics models degrade over time due to evolving threats and data shifts, but existing continual learning methods require full retraining or data replay, which is infeasible in data-sensitive security environments.

Method: Consolidates previously trained and newly fine-tuned models as teachers through parameter-level merging, using low-rank and sparse updates to confine changes to independent subspaces, with knowledge arbitration balancing teacher contributions based on model confidence.

Result: In malware detection: improved retention score from 20.2% to 38.6% over CL baselines, exceeded oracle upper bound on new data. In binary summarization: achieved ~2x BLEU score of prior transfer learning, surpassed all baselines in cross-representation generalization.

Conclusion: RETROFIT effectively addresses catastrophic forgetting in security-critical scenarios without requiring historical data, demonstrating superior performance in both malware detection and binary summarization tasks compared to existing continual learning approaches.

Abstract: Modern security analytics are increasingly powered by deep learning models, but their performance often degrades as threat landscapes evolve and data representations shift. While continual learning (CL) offers a promising paradigm to maintain model effectiveness, many approaches rely on full retraining or data replay, which are infeasible in data-sensitive environments. Moreover, existing methods remain inadequate for security-critical scenarios, facing two coupled challenges in knowledge transfer: preserving prior knowledge without old data and integrating new knowledge with minimal interference. We propose RETROFIT, a data retrospective-free continual learning method that achieves bounded forgetting for effective knowledge transfer. Our key idea is to consolidate previously trained and newly fine-tuned models, serving as teachers of old and new knowledge, through parameter-level merging that eliminates the need for historical data. To mitigate interference, we apply low-rank and sparse updates that confine parameter changes to independent subspaces, while a knowledge arbitration dynamically balances the teacher contributions guided by model confidence. Our evaluation on two representative applications demonstrates that RETROFIT consistently mitigates forgetting while maintaining adaptability. In malware detection under temporal drift, it substantially improves the retention score, from 20.2% to 38.6% over CL baselines, and exceeds the oracle upper bound on new data. In binary summarization across decompilation levels, where analyzing stripped binaries is especially challenging, RETROFIT achieves around twice the BLEU score of transfer learning used in prior work and surpasses all baselines in cross-representation generalization.

[438] DiffPro: Joint Timestep and Layer-Wise Precision Optimization for Efficient Diffusion Inference

Farhana Amin, Sabiha Afroz, Kanchon Gharami, Mona Moghadampanah, Dimitrios S. Nikolopoulos

Main category: cs.LG

TL;DR: DiffPro is a post-training framework that jointly optimizes timesteps and per-layer precision in Diffusion Transformers to reduce latency and memory usage without retraining, achieving 6.25x model compression and 2.8x faster inference.

Details

Motivation: Diffusion models produce high-quality images but suffer from costly inference due to many denoising steps and heavy matrix operations, making them impractical for real-time deployment.

Method: Combines three techniques: manifold-aware sensitivity metric for weight bit allocation, dynamic activation quantization to stabilize activations across timesteps, and budgeted timestep selector guided by teacher-student drift.

Result: Achieves up to 6.25x model compression, 50% fewer timesteps, and 2.8x faster inference with Delta FID <= 10 on standard benchmarks.

Conclusion: DiffPro unifies step reduction and precision planning into a single budgeted deployable plan for real-time energy-aware diffusion inference.

Abstract: Diffusion models produce high quality images but inference is costly due to many denoising steps and heavy matrix operations. We present DiffPro, a post-training, hardware-faithful framework that works with the exact integer kernels used in deployment and jointly tunes timesteps and per-layer precision in Diffusion Transformers (DiTs) to reduce latency and memory without any training. DiffPro combines three parts: a manifold-aware sensitivity metric to allocate weight bits, dynamic activation quantization to stabilize activations across timesteps, and a budgeted timestep selector guided by teacher-student drift. In experiments DiffPro achieves up to 6.25x model compression, fifty percent fewer timesteps, and 2.8x faster inference with Delta FID <= 10 on standard benchmarks, demonstrating practical efficiency gains. DiffPro unifies step reduction and precision planning into a single budgeted deployable plan for real-time energy-aware diffusion inference.

[439] FairReweighing: Density Estimation-Based Reweighing Framework for Improving Separation in Fair Regression

Xiaoyin Xi, Zhe Yu

Main category: cs.LG

TL;DR: Proposes FairReweighing, a pre-processing algorithm for fairness in regression tasks that addresses separation violations using mutual information metrics and density estimation, outperforming existing methods.

Details

Motivation: AI software lacks transparency in high-stakes applications, raising fairness concerns. Most fairness research focuses on binary classification, leaving regression fairness underexplored.

Method: Uses mutual information-based metric for separation violations, extends it to handle classification/regression with binary/continuous sensitive attributes. Proposes FairReweighing algorithm based on density estimation inspired by Reweighing in fair classification.

Result: Theoretically guarantees separation under data independence assumption. Empirically outperforms state-of-the-art regression fairness solutions on synthetic and real-world data, improving separation while maintaining high accuracy.

Conclusion: FairReweighing effectively addresses fairness in regression tasks, providing theoretical guarantees and empirical superiority over existing methods for ensuring separation fairness.

Abstract: There has been a prevalence of applying AI software in both high-stakes public-sector and industrial contexts. However, the lack of transparency has raised concerns about whether these data-informed AI software decisions secure fairness against people of all racial, gender, or age groups. Despite extensive research on emerging fairness-aware AI software, up to now most efforts to solve this issue have been dedicated to binary classification tasks. Fairness in regression is relatively underexplored. In this work, we adopted a mutual information-based metric to assess separation violations. The metric is also extended so that it can be directly applied to both classification and regression problems with both binary and continuous sensitive attributes. Inspired by the Reweighing algorithm in fair classification, we proposed a FairReweighing pre-processing algorithm based on density estimation to ensure that the learned model satisfies the separation criterion. Theoretically, we show that the proposed FairReweighing algorithm can guarantee separation in the training data under a data independence assumption. Empirically, on both synthetic and real-world data, we show that FairReweighing outperforms existing state-of-the-art regression fairness solutions in terms of improving separation while maintaining high accuracy.

[440] Epistemic Error Decomposition for Multi-step Time Series Forecasting: Rethinking Bias-Variance in Recursive and Direct Strategies

Riku Green, Huw Day, Zahraa S. Abdallah, Telmo M. Silva Filho

Main category: cs.LG

TL;DR: The paper challenges traditional bias-variance intuition for multi-step forecasting, showing that recursive strategies can have lower bias and higher variance than direct strategies due to model nonlinearity and parameter sensitivity.

Details

Motivation: To revisit the conventional belief that recursive forecasting has high bias/low variance while direct forecasting has low bias/high variance, by providing a more nuanced theoretical analysis.

Method: Decomposed multi-step forecast error into irreducible noise, structural approximation gap, and estimation-variance term. Analyzed linear vs nonlinear predictors and introduced Jacobian-based amplification factor for recursive strategy variance.

Result: For linear predictors, structural gap is zero; for nonlinear predictors, recursion can increase expressivity. Recursive variance equals one-step variance multiplied by Jacobian-based amplification factor. Experiments with MLPs on ETTm1 dataset confirmed findings.

Conclusion: Recursive forecasting can simultaneously have lower bias and higher variance than direct forecasting. Practical guidance should consider model nonlinearity and noise characteristics rather than traditional bias-variance intuition.

Abstract: Multi-step forecasting is often described through a simple rule of thumb: recursive strategies are said to have high bias and low variance, while direct strategies are said to have low bias and high variance. We revisit this belief by decomposing the expected multi-step forecast error into three parts: irreducible noise, a structural approximation gap, and an estimation-variance term. For linear predictors we show that the structural gap is identically zero for any dataset. For nonlinear predictors, however, the repeated composition used in recursion can increase model expressivity, making the structural gap depend on both the model and the data. We further show that the estimation variance of the recursive strategy at any horizon can be written as the one-step variance multiplied by a Jacobian-based amplification factor that measures how sensitive the composed predictor is to parameter error. This perspective explains when recursive forecasting may simultaneously have lower bias and higher variance than direct forecasting. Experiments with multilayer perceptrons on the ETTm1 dataset confirm these findings. The results offer practical guidance for choosing between recursive and direct strategies based on model nonlinearity and noise characteristics, rather than relying on traditional bias-variance intuition.

[441] MoCap2Radar: A Spatiotemporal Transformer for Synthesizing Micro-Doppler Radar Signatures from Motion Capture

Kevin Chen, Kenneth W. Parker, Anish Arora

Main category: cs.LG

TL;DR: A transformer-based model synthesizes radar spectrograms from Motion-Capture data, enabling radar data generation without physics-based computation.

Details

Motivation: To create radar data using abundant MoCap data instead of scarce radar datasets, with applications in edge computing and IoT radars.

Method: Windowed sequence-to-sequence task using a transformer model that captures spatial relations among MoCap markers and temporal dynamics across frames.

Result: Produces visually and quantitatively plausible doppler radar spectrograms with good generalizability, including multi-part motion conversion and spatial understanding.

Conclusion: Transformers are effective for time-series signal processing, enabling radar data augmentation with less computation than physics-based methods.

Abstract: We present a pure machine learning process for synthesizing radar spectrograms from Motion-Capture (MoCap) data. We formulate MoCap-to-spectrogram translation as a windowed sequence-to-sequence task using a transformer-based model that jointly captures spatial relations among MoCap markers and temporal dynamics across frames. Real-world experiments show that the proposed approach produces visually and quantitatively plausible doppler radar spectrograms and achieves good generalizability. Ablation experiments show that the learned model includes both the ability to convert multi-part motion into doppler signatures and an understanding of the spatial relations between different parts of the human body. The result is an interesting example of using transformers for time-series signal processing. It is especially applicable to edge computing and Internet of Things (IoT) radars. It also suggests the ability to augment scarce radar datasets using more abundant MoCap data for training higher-level applications. Finally, it requires far less computation than physics-based methods for generating radar data.

[442] Quantifying and Improving Adaptivity in Conformal Prediction through Input Transformations

Sooyong Jang, Insup Lee

Main category: cs.LG

TL;DR: Proposes improved evaluation metrics for conformal prediction adaptiveness using uniform-mass binning and introduces a new adaptive prediction set algorithm that groups examples by difficulty for better performance.

Details

Motivation: Existing methods for evaluating adaptiveness in conformal prediction suffer from imbalanced binning, leading to inaccurate estimates of coverage and set size.

Method: Uses input transformations to sort examples by difficulty followed by uniform-mass binning, and proposes a new algorithm that groups examples by estimated difficulty and applies group-conditional conformal prediction.

Result: The proposed metrics correlate more strongly with adaptiveness and the new algorithm outperforms existing approaches on ImageNet classification and medical visual acuity prediction tasks.

Conclusion: The proposed evaluation framework and adaptive prediction set algorithm provide more reliable assessment of adaptiveness and improved performance compared to existing methods.

Abstract: Conformal prediction constructs a set of labels instead of a single point prediction, while providing a probabilistic coverage guarantee. Beyond the coverage guarantee, adaptiveness to example difficulty is an important property. It means that the method should produce larger prediction sets for more difficult examples, and smaller ones for easier examples. Existing evaluation methods for adaptiveness typically analyze coverage rate violation or average set size across bins of examples grouped by difficulty. However, these approaches often suffer from imbalanced binning, which can lead to inaccurate estimates of coverage or set size. To address this issue, we propose a binning method that leverages input transformations to sort examples by difficulty, followed by uniform-mass binning. Building on this binning, we introduce two metrics to better evaluate adaptiveness. These metrics provide more reliable estimates of coverage rate violation and average set size due to balanced binning, leading to more accurate adaptivity assessment. Through experiments, we demonstrate that our proposed metric correlates more strongly with the desired adaptiveness property compared to existing ones. Furthermore, motivated by our findings, we propose a new adaptive prediction set algorithm that groups examples by estimated difficulty and applies group-conditional conformal prediction. This allows us to determine appropriate thresholds for each group. Experimental results on both (a) an Image Classification (ImageNet) (b) a medical task (visual acuity prediction) show that our method outperforms existing approaches according to the new metrics.

[443] Data-efficient U-Net for Segmentation of Carbide Microstructures in SEM Images of Steel Alloys

Alinda Ezgi Gerçek, Till Korten, Paul Chekhonin, Maleeha Hassan, Peter Steinbach

Main category: cs.LG

TL;DR: A data-efficient U-Net model trained on only 10 SEM images achieves 0.98 Dice score for carbide segmentation in reactor-pressure-vessel steel, significantly outperforming classical methods (0.85) while reducing annotation effort by 10x.

Details

Motivation: Understanding steel microstructure is crucial for predicting mechanical properties, but gray-value overlap between carbides and matrix makes simple thresholding ineffective for SEM image segmentation.

Method: Lightweight U-Net (30.7M parameters) trained on just 10 annotated SEM images using a data-efficient segmentation pipeline.

Result: Achieves Dice-Sørensen coefficient of 0.98, significantly outperforming state-of-the-art classical image analysis (0.85) and reducing annotation effort by one order of magnitude.

Conclusion: This approach enables rapid automated carbide quantification for alloy design, generalizes to other steel types, and demonstrates the potential of data-efficient deep learning in reactor-pressure-vessel steel analysis.

Abstract: Understanding reactor-pressure-vessel steel microstructure is crucial for predicting mechanical properties, as carbide precipitates both strengthen the alloy and can initiate cracks. In scanning electron microscopy images, gray-value overlap between carbides and matrix makes simple thresholding ineffective. We present a data-efficient segmentation pipeline using a lightweight U-Net (30.7~M parameters) trained on just \textbf{10 annotated scanning electron microscopy images}. Despite limited data, our model achieves a \textbf{Dice-Sørensen coefficient of 0.98}, significantly outperforming the state-of-the-art in the field of metallurgy (classical image analysis: 0.85), while reducing annotation effort by one order of magnitude compared to the state-of-the-art data efficient segmentation model. This approach enables rapid, automated carbide quantification for alloy design and generalizes to other steel types, demonstrating the potential of data-efficient deep learning in reactor-pressure-vessel steel analysis.

[444] Intrinsic Dimension Estimation for Radio Galaxy Zoo using Diffusion Models

Joan Font-Quer Roset, Devina Mohan, Anna Scaife

Main category: cs.LG

TL;DR: The paper estimates the intrinsic dimension of the Radio Galaxy Zoo dataset using a score-based diffusion model, finding higher iD values for out-of-distribution sources and overall higher iD compared to natural image datasets.

Details

Motivation: To understand how intrinsic dimension varies across radio galaxy morphological classes and examine relationships with energy scores and signal-to-noise ratio in astronomical datasets.

Method: Used a score-based diffusion model to estimate intrinsic dimension, analyzed variation with Bayesian neural network energy scores, and examined relationships across Fanaroff-Riley morphological classes and signal-to-noise ratio.

Result: Out-of-distribution sources show higher iD values, RGZ dataset has higher iD than natural image datasets, no relationship found between FR I and FR II classes, weak trend of higher SNR at lower iD.

Conclusion: The relationship between iD and energy scores can be used to quantitatively study and improve representations learned by self-supervised learning algorithms on the RGZ dataset.

Abstract: In this work, we estimate the intrinsic dimension (iD) of the Radio Galaxy Zoo (RGZ) dataset using a score-based diffusion model. We examine how the iD estimates vary as a function of Bayesian neural network (BNN) energy scores, which measure how similar the radio sources are to the MiraBest subset of the RGZ dataset. We find that out-of-distribution sources exhibit higher iD values, and that the overall iD for RGZ exceeds those typically reported for natural image datasets. Furthermore, we analyse how iD varies across Fanaroff-Riley (FR) morphological classes and as a function of the signal-to-noise ratio (SNR). While no relationship is found between FR I and FR II classes, a weak trend toward higher SNR at lower iD. Future work using the RGZ dataset could make use of the relationship between iD and energy scores to quantitatively study and improve the representations learned by various self-supervised learning algorithms.

[445] Optimizing Mixture of Block Attention

Guangxuan Xiao, Junxian Guo, Kasra Mazaheri, Song Han

Main category: cs.LG

TL;DR: MoBA enables efficient long-context processing via sparse attention to key-value blocks, but suffers from unclear design principles and inefficient GPU implementation. This paper develops a statistical model revealing router accuracy as critical, proposes smaller blocks and key convolution for improvement, and introduces FlashMoBA for efficient GPU execution.

Details

Motivation: To understand MoBA's design principles and enable efficient GPU implementation for practical adoption, as current MoBA lacks theoretical understanding and hardware optimization.

Method: Developed statistical model to analyze MoBA mechanics, identified signal-to-noise ratio for retrieval accuracy, proposed smaller block sizes and key convolution to improve routing, and created FlashMoBA CUDA kernel for efficient GPU execution.

Result: Improved MoBA models match dense attention baseline performance, with FlashMoBA achieving up to 14.7x speedup over FlashAttention-2 for small blocks.

Conclusion: Theoretical analysis guides practical improvements to MoBA, and FlashMoBA enables efficient execution of theoretically-grounded small block sizes, making MoBA practical for long-context processing.

Abstract: Mixture of Block Attention (MoBA) (Lu et al., 2025) is a promising building block for efficiently processing long contexts in LLMs by enabling queries to sparsely attend to a small subset of key-value blocks, drastically reducing computational cost. However, the design principles governing MoBA’s performance are poorly understood, and it lacks an efficient GPU implementation, hindering its practical adoption. In this paper, we first develop a statistical model to analyze MoBA’s underlying mechanics. Our model reveals that performance critically depends on the router’s ability to accurately distinguish relevant from irrelevant blocks based on query-key affinities. We derive a signal-to-noise ratio that formally connects architectural parameters to this retrieval accuracy. Guided by our analysis, we identify two key pathways for improvement: using smaller block sizes and applying a short convolution on keys to cluster relevant signals, which enhances routing accuracy. While theoretically better, small block sizes are inefficient on GPUs. To bridge this gap, we introduce FlashMoBA, a hardware-aware CUDA kernel that enables efficient MoBA execution even with the small block sizes our theory recommends. We validate our insights by training LLMs from scratch, showing that our improved MoBA models match the performance of dense attention baselines. FlashMoBA achieves up to 14.7x speedup over FlashAttention-2 for small blocks, making our theoretically-grounded improvements practical. Code is available at: https://github.com/mit-han-lab/flash-moba.

[446] Honesty over Accuracy: Trustworthy Language Models through Reinforced Hesitation

Mohamad Amin Mohamadi, Tianhao Wang, Zhiyuan Li

Main category: cs.LG

TL;DR: The paper proposes Reinforced Hesitation (RH) to train language models to abstain from answering when uncertain, using ternary rewards (+1 for correct, 0 for abstention, -λ for errors) instead of binary rewards in RLVR training.

Details

Motivation: Current language models fail to know when not to answer, producing confident hallucinations even when wrong answers could have catastrophic consequences, and they almost never abstain despite explicit warnings.

Method: Reinforced Hesitation modifies RLVR with ternary rewards; introduces cascading and self-cascading inference strategies that use abstention as a coordination signal to route queries through models with different risk tolerance levels.

Result: Varying λ produces distinct models along a Pareto frontier - low penalties yield aggressive answerers, high penalties yield conservative abstainers. Both cascading strategies outperform majority voting with lower computational cost.

Conclusion: Abstention should be treated as a first-class training objective, transforming “I don’t know” from failure into a coordination signal that enables models to earn trust through calibrated honesty about their limits.

Abstract: Modern language models fail a fundamental requirement of trustworthy intelligence: knowing when not to answer. Despite achieving impressive accuracy on benchmarks, these models produce confident hallucinations, even when wrong answers carry catastrophic consequences. Our evaluations on GSM8K, MedQA and GPQA show frontier models almost never abstain despite explicit warnings of severe penalties, suggesting that prompts cannot override training that rewards any answer over no answer. As a remedy, we propose Reinforced Hesitation (RH): a modification to Reinforcement Learning from Verifiable Rewards (RLVR) to use ternary rewards (+1 correct, 0 abstention, -$λ$ error) instead of binary. Controlled experiments on logic puzzles reveal that varying $λ$ produces distinct models along a Pareto frontier, where each training penalty yields the optimal model for its corresponding risk regime: low penalties produce aggressive answerers, high penalties conservative abstainers. We then introduce two inference strategies that exploit trained abstention as a coordination signal: cascading routes queries through models with decreasing risk tolerance, while self-cascading re-queries the same model on abstention. Both outperform majority voting with lower computational cost. These results establish abstention as a first-class training objective that transforms ``I don’t know’’ from failure into a coordination signal, enabling models to earn trust through calibrated honesty about their limits.

[447] FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models

Yonatan Dukler, Guihong Li, Deval Shah, Vikram Appia, Emad Barsoum

Main category: cs.LG

TL;DR: FarSkip-Collective modifies MoE architectures with skip connections to overlap computation with communication, achieving near-original accuracy in models up to 109B parameters while accelerating training and inference.

Details

Motivation: Blocking communication is a major efficiency bottleneck for running Mixture of Experts (MoEs) in distributed settings, limiting their performance in training and inference.

Method: Modifies model architecture with skip connections to enable communication-computation overlap, uses self-distillation for model conversion, and implements optimized frameworks for explicit overlap.

Result: Successfully converted models from 16B to 109B parameters (including Llama 4 Scout) with average accuracy within 1% of original releases across downstream evaluations, while achieving communication-computation overlap benefits.

Conclusion: FarSkip-Collective enables efficient MoE execution in distributed environments by architecturally enabling communication-computation overlap without sacrificing model accuracy, making large-scale MoE models more practical for deployment.

Abstract: Blocking communication presents a major hurdle in running MoEs efficiently in distributed settings. To address this, we present FarSkip-Collective which modifies the architecture of modern models to enable overlapping of their computation with communication. Our approach modifies the architecture to skip connections in the model and it is unclear a priori whether the modified model architecture can remain as capable, especially for large state-of-the-art models and while modifying all of the model layers. We answer this question in the affirmative and fully convert a series of state-of-the-art models varying from 16B to 109B parameters to enable overlapping of their communication while achieving accuracy on par with their original open-source releases. For example, we convert Llama 4 Scout (109B) via self-distillation and achieve average accuracy within 1% of its instruction tuned release averaged across a wide range of downstream evaluations. In addition to demonstrating retained accuracy of the large modified models, we realize the benefits of FarSkip-Collective through optimized implementations that explicitly overlap communication with computation, accelerating both training and inference in existing frameworks.

[448] Generalizing Fair Clustering to Multiple Groups: Algorithms and Applications

Diptarka Chakraborty, Kushagra Chatterjee, Debarati Das, Tien-Long Nguyen

Main category: cs.LG

TL;DR: This paper extends closest fair clustering to multiple groups, shows NP-hardness for equal-sized groups, provides near-linear time approximation algorithms, and applies them to improve fair correlation clustering and address fair consensus clustering.

Details

Motivation: Existing fair clustering approaches are limited to two groups, but real-world data typically involves multiple protected attributes (age, ethnicity, gender, etc.), creating a need for fair clustering methods that handle arbitrary numbers of groups.

Method: The authors generalize closest fair clustering to multiple groups, prove NP-hardness even for equal-sized groups, and develop near-linear time approximation algorithms that efficiently handle arbitrary-sized multiple groups.

Result: The paper provides approximation algorithms for closest fair clustering with multiple groups, improves approximation guarantees for fair correlation clustering, and introduces the first approximation algorithms for fair consensus clustering with multiple groups.

Conclusion: This work successfully extends fair clustering to handle multiple protected attributes, addresses open questions from previous research, and provides efficient approximation algorithms that advance the state-of-the-art in fair clustering.

Abstract: Clustering is a fundamental task in machine learning and data analysis, but it frequently fails to provide fair representation for various marginalized communities defined by multiple protected attributes – a shortcoming often caused by biases in the training data. As a result, there is a growing need to enhance the fairness of clustering outcomes, ideally by making minimal modifications, possibly as a post-processing step after conventional clustering. Recently, Chakraborty et al. [COLT'25] initiated the study of \emph{closest fair clustering}, though in a restricted scenario where data points belong to only two groups. In practice, however, data points are typically characterized by many groups, reflecting diverse protected attributes such as age, ethnicity, gender, etc. In this work, we generalize the study of the \emph{closest fair clustering} problem to settings with an arbitrary number (more than two) of groups. We begin by showing that the problem is NP-hard even when all groups are of equal size – a stark contrast with the two-group case, for which an exact algorithm exists. Next, we propose near-linear time approximation algorithms that efficiently handle arbitrary-sized multiple groups, thereby answering an open question posed by Chakraborty et al. [COLT'25]. Leveraging our closest fair clustering algorithms, we further achieve improved approximation guarantees for the \emph{fair correlation clustering} problem, advancing the state-of-the-art results established by Ahmadian et al. [AISTATS'20] and Ahmadi et al. [2020]. Additionally, we are the first to provide approximation algorithms for the \emph{fair consensus clustering} problem involving multiple (more than two) groups, thus addressing another open direction highlighted by Chakraborty et al. [COLT'25].

[449] A Unified Convergence Analysis for Semi-Decentralized Learning: Sampled-to-Sampled vs. Sampled-to-All Communication

Angelo Rodio, Giovanni Neglia, Zheng Chen, Erik G. Larsson

Main category: cs.LG

TL;DR: Comparison of two communication strategies (S2S and S2A) in semi-decentralized federated learning, showing their performance depends on data heterogeneity and providing design guidelines.

Details

Motivation: Despite practical significance, no rigorous theoretical and empirical comparison exists between sampled-to-sampled (S2S) and sampled-to-all (S2A) communication strategies in semi-decentralized FL.

Method: Developed a unified convergence framework analyzing S2S and S2A strategies, accounting for sampling rate, server aggregation frequency, and network connectivity.

Result: Revealed distinct performance regimes where one strategy outperforms the other depending primarily on data heterogeneity across devices.

Conclusion: Provides concrete design guidelines for practical semi-decentralized FL deployments based on data heterogeneity conditions.

Abstract: In semi-decentralized federated learning, devices primarily rely on device-to-device communication but occasionally interact with a central server. Periodically, a sampled subset of devices uploads their local models to the server, which computes an aggregate model. The server can then either (i) share this aggregate model only with the sampled clients (sampled-to-sampled, S2S) or (ii) broadcast it to all clients (sampled-to-all, S2A). Despite their practical significance, a rigorous theoretical and empirical comparison of these two strategies remains absent. We address this gap by analyzing S2S and S2A within a unified convergence framework that accounts for key system parameters: sampling rate, server aggregation frequency, and network connectivity. Our results, both analytical and experimental, reveal distinct regimes where one strategy outperforms the other, depending primarily on the degree of data heterogeneity across devices. These insights lead to concrete design guidelines for practical semi-decentralized FL deployments.

[450] Multistability of Self-Attention Dynamics in Transformers

Claudio Altafini

Main category: cs.LG

TL;DR: Self-attention dynamics in transformers is related to a multiagent Oja flow and has four types of equilibria: consensus, bipartite consensus, clustering, and polygonal equilibria, with multiple stable equilibria often coexisting.

Details

Motivation: To understand the continuous-time dynamics of self-attention mechanisms in transformers by relating them to established dynamical systems and classifying their equilibrium states.

Method: Analyzed self-attention dynamics as a continuous-time multiagent model, related it to multiagent Oja flow, and classified equilibria into four categories based on their mathematical properties.

Result: Identified four classes of equilibria that can coexist in self-attention dynamics, with consensus and bipartite consensus equilibria always aligned with eigenvectors of the value matrix (often the principal eigenvector).

Conclusion: Self-attention dynamics exhibits rich equilibrium behavior with multiple stable states, providing insights into the mathematical foundations of transformer attention mechanisms.

Abstract: In machine learning, a self-attention dynamics is a continuous-time multiagent-like model of the attention mechanisms of transformers. In this paper we show that such dynamics is related to a multiagent version of the Oja flow, a dynamical system that computes the principal eigenvector of a matrix corresponding for transformers to the value matrix. We classify the equilibria of the ``single-head’’ self-attention system into four classes: consensus, bipartite consensus, clustering and polygonal equilibria. Multiple asymptotically stable equilibria from the first three classes often coexist in the self-attention dynamics. Interestingly, equilibria from the first two classes are always aligned with the eigenvectors of the value matrix, often but not exclusively with the principal eigenvector.

[451] On bounds for norms of reparameterized ReLU artificial neural network parameters: sums of fractional powers of the Lipschitz norm control the network parameter vector

Arnulf Jentzen, Timo Kröger

Main category: cs.LG

TL;DR: The paper establishes a converse relationship between ANN parameter norms and Lipschitz norms of realization functions, showing that parameter vector norms can be bounded by sums of powers of Lipschitz norms (exponents 1/2 and 1) for shallow ReLU networks.

Details

Motivation: To prove that the converse of the known Lipschitz norm bound holds - specifically, that ANN parameter vector norms can be bounded from above by Lipschitz norms of the realization function, establishing a two-way relationship.

Method: Mathematical analysis and proofs for shallow feedforward fully-connected ReLU neural networks, examining the relationship between parameter vector norms and various function norms including Lipschitz, Hölder, and Sobolev-Slobodeckij norms.

Result: Proved that the norm of ANN parameter vectors is bounded by sums of powers of Lipschitz norms (exponents 1/2 and 1), but this bound does not hold for Hölder norms, Sobolev-Slobodeckij norms, or for the Lipschitz norm alone.

Conclusion: The paper establishes a precise converse relationship between parameter norms and Lipschitz norms for shallow ReLU networks, revealing the specific conditions under which such bounds hold and providing insights into the fundamental properties of neural network parameterizations.

Abstract: It is an elementary fact in the scientific literature that the Lipschitz norm of the realization function of a feedforward fully-connected rectified linear unit (ReLU) artificial neural network (ANN) can, up to a multiplicative constant, be bounded from above by sums of powers of the norm of the ANN parameter vector. Roughly speaking, in this work we reveal in the case of shallow ANNs that the converse inequality is also true. More formally, we prove that the norm of the equivalence class of ANN parameter vectors with the same realization function is, up to a multiplicative constant, bounded from above by the sum of powers of the Lipschitz norm of the ANN realization function (with the exponents $ 1/2 $ and $ 1 $). Moreover, we prove that this upper bound only holds when employing the Lipschitz norm but does neither hold for Hölder norms nor for Sobolev-Slobodeckij norms. Furthermore, we prove that this upper bound only holds for sums of powers of the Lipschitz norm with the exponents $ 1/2 $ and $ 1 $ but does not hold for the Lipschitz norm alone.

[452] On the Relationship Between Adversarial Robustness and Decision Region in Deep Neural Networks

Seongjin Park, Haedong Jeong, Tair Djanibekov, Giyoung Jeon, Jinseok Seol, Jaesik Choi

Main category: cs.LG

TL;DR: This paper analyzes adversarial robustness in DNNs through geometric properties, proposing the Populated Region Set (PRS) concept and showing that low PRS ratio correlates with better robustness. They also develop a PRS regularizer to improve robustness without adversarial training.

Details

Motivation: As DNN generalization performance converges to state-of-the-art levels, it becomes insufficient for comprehensive evaluation. Adversarial robustness serves as an additional metric, but few studies analyze it from a geometric perspective within DNNs.

Method: Proposes the Populated Region Set (PRS) concept to represent internal geometric properties where training samples cluster. Conducts systematic experiments to analyze PRS ratio’s relationship with robustness and develops a PRS regularizer for improving robustness.

Result: Empirical evidence shows that a low PRS ratio strongly correlates with better adversarial robustness in DNNs. The PRS regularizer successfully improves model robustness without requiring adversarial training.

Conclusion: The geometric property represented by PRS ratio is a key factor in adversarial robustness, and the proposed PRS regularizer provides an effective method to enhance robustness without adversarial training.

Abstract: In general, Deep Neural Networks (DNNs) are evaluated by the generalization performance measured on unseen data excluded from the training phase. Along with the development of DNNs, the generalization performance converges to the state-of-the-art and it becomes difficult to evaluate DNNs solely based on this metric. The robustness against adversarial attack has been used as an additional metric to evaluate DNNs by measuring their vulnerability. However, few studies have been performed to analyze the adversarial robustness in terms of the geometry in DNNs. In this work, we perform an empirical study to analyze the internal properties of DNNs that affect model robustness under adversarial attacks. In particular, we propose the novel concept of the Populated Region Set (PRS), where training samples are populated more frequently, to represent the internal properties of DNNs in a practical setting. From systematic experiments with the proposed concept, we provide empirical evidence to validate that a low PRS ratio has a strong relationship with the adversarial robustness of DNNs. We also devise PRS regularizer leveraging the characteristics of PRS to improve the adversarial robustness without adversarial training.

[453] Higher-order Neural Additive Models: An Interpretable Machine Learning Model with Feature Interactions

Minkyu Kim, Hyun-Soo Choi, Jinho Kim

Main category: cs.LG

TL;DR: HONAMs extend Neural Additive Models to capture arbitrary-order feature interactions while maintaining interpretability, improving predictive accuracy for real-world datasets.

Details

Motivation: Standard Neural Additive Models (NAMs) are limited to first-order feature interactions, which restricts their effectiveness on real-world datasets that often contain complex higher-order interactions.

Method: Propose Higher-order Neural Additive Models (HONAMs) that capture feature interactions of arbitrary orders while preserving the interpretability of NAMs through an efficient architecture.

Result: HONAMs demonstrate improved predictive accuracy compared to standard NAMs without compromising interpretability, enabling analysis of high-order interactions in datasets.

Conclusion: HONAMs provide an interpretable machine learning model that effectively captures arbitrary-order feature interactions, making them suitable for high-stakes applications where both accuracy and interpretability are essential.

Abstract: Neural Additive Models (NAMs) have recently demonstrated promising predictive performance while maintaining interpretability. However, their capacity is limited to capturing only first-order feature interactions, which restricts their effectiveness on real-world datasets. To address this limitation, we propose Higher-order Neural Additive Models (HONAMs), an interpretable machine learning model that effectively and efficiently captures feature interactions of arbitrary orders. HONAMs improve predictive accuracy without compromising interpretability, an essential requirement in high-stakes applications. This advantage of HONAM can help analyze and extract high-order interactions present in datasets. The source code for HONAM is publicly available at https://github.com/gim4855744/HONAM/.

[454] Understanding Deep Representation Learning via Layerwise Feature Compression and Discrimination

Peng Wang, Xiao Li, Can Yaras, Zhihui Zhu, Laura Balzano, Wei Hu, Qing Qu

Main category: cs.LG

TL;DR: This paper provides the first quantitative characterization of feature evolution in deep linear networks, showing that layers progressively compress within-class features geometrically and discriminate between-class features linearly.

Details

Motivation: To understand how deep networks perform hierarchical feature learning across layers, motivated by empirical findings that linear layers mimic deep layers in nonlinear networks for feature learning.

Method: Define metrics for within-class compression and between-class discrimination, then theoretically analyze these metrics for deep linear networks with nearly orthogonal input data and minimum-norm, balanced, approximate low-rank weights.

Result: Features evolve following a simple pattern: each layer compresses within-class features at a geometric rate and discriminates between-class features at a linear rate with respect to layer depth.

Conclusion: The theoretical results are validated empirically and reveal similar patterns in deep nonlinear networks, with practical implications for transfer learning.

Abstract: Over the past decade, deep learning has proven to be a highly effective tool for learning meaningful features from raw data. However, it remains an open question how deep networks perform hierarchical feature learning across layers. In this work, we attempt to unveil this mystery by investigating the structures of intermediate features. Motivated by our empirical findings that linear layers mimic the roles of deep layers in nonlinear networks for feature learning, we explore how deep linear networks transform input data into output by investigating the output (i.e., features) of each layer after training in the context of multi-class classification problems. Toward this goal, we first define metrics to measure within-class compression and between-class discrimination of intermediate features, respectively. Through theoretical analysis of these two metrics, we show that the evolution of features follows a simple and quantitative pattern from shallow to deep layers when the input data is nearly orthogonal and the network weights are minimum-norm, balanced, and approximate low-rank: Each layer of the linear network progressively compresses within-class features at a geometric rate and discriminates between-class features at a linear rate with respect to the number of layers that data have passed through. To the best of our knowledge, this is the first quantitative characterization of feature evolution in hierarchical representations of deep linear networks. Empirically, our extensive experiments not only validate our theoretical results numerically but also reveal a similar pattern in deep nonlinear networks which aligns well with recent empirical studies. Moreover, we demonstrate the practical implications of our results in transfer learning. Our code is available at https://github.com/Heimine/PNC_DLN.

[455] MoPE: Mixture of Prompt Experts for Parameter-Efficient and Scalable Multimodal Fusion

Ruixiang Jiang, Lingbo Liu, Changwen Chen

Main category: cs.LG

TL;DR: MoPE enhances prompt-based multimodal fusion by generating instance-specific prompts through a mixture of experts, improving adaptivity and expressiveness while maintaining parameter efficiency.

Details

Motivation: Traditional prompt-based fusion methods have limited adaptivity and expressiveness due to using shared prompts across all instances, hindering effectiveness at scale.

Method: Proposes Mixture of Prompt Experts (MoPE) that dynamically generates instance-specific prompts using multimodal pairings as evidence, with regularization to encourage expert specialization.

Result: Achieves state-of-the-art performance across six multimodal datasets spanning four modalities, matching fine-tuning performance with only 0.8% trainable parameters.

Conclusion: MoPE fundamentally changes scaling dynamics for prompt-based fusion, enabling greater expressiveness and adaptability to complex multimodal relationships.

Abstract: Despite the demonstrated parameter efficiency of prompt-based fusion, its limited adaptivity and expressiveness hinder its effectiveness for multimodal applications at scale. In this paper, we present the first comprehensive study addressing these limitations. Our key motivation is to ``divide and conquer’’ the vanilla prompt, traditionally shared across all instances, by generating instance-specific prompts. Specifically, we propose the Mixture of Prompt Experts (MoPE), a framework that significantly enhances prompt adaptivity and expressiveness by dynamically generating instance-specific prompts. MoPE leverages multimodal pairings as additional evidence, allowing the model to adaptively select optimal prompts tailored to each individual instance. Unlike traditional prompt-fusion methods, which encounter scalability bottlenecks when optimizing long unified prompts, MoPE maintains fixed prompt length while effectively scaling the number of specialized experts. Moreover, we investigate regularization terms to encourage expert specialization, resulting in highly adaptive and interpretable prompting. MoPE fundamentally changes the scaling dynamic, unlocking greater expressiveness and adaptability to complex multimodal relationships, enabling the model to selectively attend to task-relevant sub-sequences based on instance-specific multimodal input. Extensive experiments across six multimodal datasets spanning four modalities demonstrate state-of-the-art performance for multimodal fusion, matching or surpassing the performance of fine-tuning while requiring only 0.8% of the trainable parameters. Code is available: https://github.com/songrise/MoPE.

[456] Partial Information Decomposition for Data Interpretability and Feature Selection

Charles Westphal, Stephen Hailes, Mirco Musolesi

Main category: cs.LG

TL;DR: PIDF introduces a new paradigm for data interpretability and feature selection using three metrics per feature: mutual information with target, synergistic contribution, and redundant information, revealing both individual and combined feature importance.

Details

Motivation: Traditional feature selection methods assign single importance values, lacking insight into how features interact and provide overlapping or synergistic information when considered together.

Method: Developed a novel procedure based on three information-theoretic metrics per feature: mutual information with target, synergistic information contribution, and redundant information amount.

Result: Extensive evaluation on synthetic and real-world data from genetics and neuroscience demonstrates PIDF’s effectiveness in revealing feature correlations and interaction patterns.

Conclusion: PIDF provides a comprehensive framework for simultaneous data interpretability and feature selection by capturing individual feature importance and their synergistic/redundant interactions with other features.

Abstract: In this paper, we introduce Partial Information Decomposition of Features (PIDF), a new paradigm for simultaneous data interpretability and feature selection. Contrary to traditional methods that assign a single importance value, our approach is based on three metrics per feature: the mutual information shared with the target variable, the feature’s contribution to synergistic information, and the amount of this information that is redundant. In particular, we develop a novel procedure based on these three metrics, which reveals not only how features are correlated with the target but also the additional and overlapping information provided by considering them in combination with other features. We extensively evaluate PIDF using both synthetic and real-world data, demonstrating its potential applications and effectiveness, by considering case studies from genetics and neuroscience.

[457] Posterior Label Smoothing for Node Classification

Jaeseung Heo, Moonjeong Park, Dongwoo Kim

Main category: cs.LG

TL;DR: Posterior label smoothing improves node classification on graphs by deriving soft labels from neighborhood label distributions, adapting to various graph properties and reducing overfitting.

Details

Motivation: Label smoothing is well-studied in ML but unexplored for node classification across homophilic to heterophilic graphs. Current methods don't leverage neighborhood label distributions effectively.

Method: Posterior label smoothing derives soft labels from posterior distribution conditioned on neighborhood labels, with likelihood and prior estimated from global graph statistics.

Result: Consistent accuracy improvements on 10 benchmark datasets across 8 baseline models. Soft labels mitigate overfitting and pseudo-labeling refines global label statistics.

Conclusion: The method effectively adapts to diverse graph properties, improves generalization through overfitting reduction, and demonstrates practical value for transductive node classification.

Abstract: Label smoothing is a widely studied regularization technique in machine learning. However, its potential for node classification in graph-structured data, spanning homophilic to heterophilic graphs, remains largely unexplored. We introduce posterior label smoothing, a novel method for transductive node classification that derives soft labels from a posterior distribution conditioned on neighborhood labels. The likelihood and prior distributions are estimated from the global statistics of the graph structure, allowing our approach to adapt naturally to various graph properties. We evaluate our method on 10 benchmark datasets using eight baseline models, demonstrating consistent improvements in classification accuracy. The following analysis demonstrates that soft labels mitigate overfitting during training, leading to better generalization performance, and that pseudo-labeling effectively refines the global label statistics of the graph. Our code is available at https://github.com/ml-postech/PosteL.

[458] A Global Geometric Analysis of Maximal Coding Rate Reduction

Peng Wang, Huikang Liu, Druv Pai, Yaodong Yu, Zhihui Zhu, Qing Qu, Yi Ma

Main category: cs.LG

TL;DR: This paper provides a complete theoretical characterization of the optimization landscape for the maximal coding rate reduction (MCR²) objective, showing all critical points are either local maximizers or strict saddle points, making it suitable for first-order optimization methods.

Details

Motivation: The MCR² objective is gaining attention for learning structured deep representations but lacks complete theoretical justification regarding its optimization landscape and properties of local optima.

Method: Theoretical analysis of the MCR² objective’s critical points, proving properties of local and global optima, and conducting experiments on synthetic and real datasets to validate findings.

Result: Showed that every maximizer of MCR² produces low-dimensional, discriminative, and diverse representations, and all critical points are either local maximizers or strict saddle points.

Conclusion: MCR² has a favorable optimization landscape that makes it naturally suitable for learning diverse and discriminative representations using first-order optimization methods.

Abstract: The maximal coding rate reduction (MCR$^2$) objective for learning structured and compact deep representations is drawing increasing attention, especially after its recent usage in the derivation of fully explainable and highly effective deep network architectures. However, it lacks a complete theoretical justification: only the properties of its global optima are known, and its global landscape has not been studied. In this work, we give a complete characterization of the properties of all its local and global optima, as well as other types of critical points. Specifically, we show that each (local or global) maximizer of the MCR$^2$ problem corresponds to a low-dimensional, discriminative, and diverse representation, and furthermore, each critical point of the objective is either a local maximizer or a strict saddle point. Such a favorable landscape makes MCR$^2$ a natural choice of objective for learning diverse and discriminative representations via first-order optimization methods. To validate our theoretical findings, we conduct extensive experiments on both synthetic and real data sets.

[459] Towards Formalizing Spuriousness of Biased Datasets Using Partial Information Decomposition

Barproda Halder, Faisal Hamman, Pasan Dissanayake, Qiuyi Zhang, Ilia Sucholutsky, Sanghamitra Dutta

Main category: cs.LG

TL;DR: Proposes an explainability framework using Partial Information Decomposition to preemptively identify and measure spurious associations in datasets before model training, with applications to high-dimensional image data.

Details

Motivation: To address the problem of spurious correlations in datasets that can mislead machine learning models, by developing a method to disentangle these associations before training begins.

Method: Leverages Partial Information Decomposition (PID) to break down information about the target into four components: unique information in core features, unique information in spurious features, redundant information, and synergistic information. Includes segmentation, dimensionality reduction, and estimation modules for handling high-dimensional data.

Result: Developed a novel measure of dataset spuriousness and validated it across 6 benchmark datasets. Observed agreement between the preemptive spuriousness measure and post-training model generalization metrics like worst-group accuracy.

Conclusion: The framework successfully disentangles spurious associations in datasets and provides a systematic way to measure spuriousness, which correlates with model generalization performance, enabling better dataset understanding before model training.

Abstract: Spuriousness arises when there is an association between two or more variables in a dataset that are not causally related. In this work, we propose an explainability framework to preemptively disentangle the nature of such spurious associations in a dataset before model training. We leverage a body of work in information theory called Partial Information Decomposition (PID) to decompose the total information about the target into four non-negative quantities, namely unique information (in core and spurious features, respectively), redundant information, and synergistic information. Our framework helps anticipate when the core or spurious feature is indispensable, when either suffices, and when both are jointly needed for an optimal classifier trained on the dataset. Next, we leverage this decomposition to propose a novel measure of the spuriousness of a dataset. We arrive at this measure systematically by examining several candidate measures, and demonstrating what they capture and miss through intuitive canonical examples and counterexamples. Our framework Spurious Disentangler consists of segmentation, dimensionality reduction, and estimation modules, with capabilities to specifically handle high-dimensional image data efficiently. Finally, we also perform empirical evaluation to demonstrate the trends of unique, redundant, and synergistic information, as well as our proposed spuriousness measure across $6$ benchmark datasets under various experimental settings. We observe an agreement between our preemptive measure of dataset spuriousness and post-training model generalization metrics such as worst-group accuracy, further supporting our proposition. The code is available at https://github.com/Barproda/spuriousness-disentangler.

[460] Provable Domain Adaptation for Offline Reinforcement Learning with Limited Samples

Weiqin Chen, Xinjie Zhang, Sandipan Mishra, Santiago Paternain

Main category: cs.LG

TL;DR: This paper proposes the first theoretical framework for optimal weight assignment between limited target datasets and large-but-biased source datasets in offline RL, establishing performance bounds and convergence guarantees.

Details

Motivation: Offline RL performance degrades with limited target dataset samples, which is common in real-world applications. Domain adaptation using auxiliary source datasets can help, but optimal trade-offs between target and source datasets with theoretical guarantees remain challenging.

Method: Proposes a theoretical framework that explores the impact of weights assigned to each dataset on offline RL performance, establishes performance bounds, proves existence of optimal weights, and provides algorithmic convergence guarantees.

Result: Established performance bounds and existence of optimal weights that can be computed in closed form under simplifying assumptions. Provided convergence guarantees to a neighborhood of the optimum, with results depending on source dataset quality and target dataset size.

Conclusion: The proposed framework provides the first theoretical exploration of dataset weighting in offline RL, with empirical validation on the Procgen benchmark confirming the theoretical contributions.

Abstract: Offline reinforcement learning (RL) learns effective policies from a static target dataset. The performance of state-of-the-art offline RL algorithms notwithstanding, it relies on the size of the target dataset, and it degrades if limited samples in the target dataset are available, which is often the case in real-world applications. To address this issue, domain adaptation that leverages auxiliary samples from related source datasets (such as simulators) can be beneficial. However, establishing the optimal way to trade off the limited target dataset and the large-but-biased source dataset while ensuring provably theoretical guarantees remains an open challenge. To the best of our knowledge, this paper proposes the first framework that theoretically explores the impact of the weights assigned to each dataset on the performance of offline RL. In particular, we establish performance bounds and the existence of the optimal weight, which can be computed in closed form under simplifying assumptions. We also provide algorithmic guarantees in terms of convergence to a neighborhood of the optimum. Notably, these results depend on the quality of the source dataset and the number of samples in the target dataset. Our empirical results on the well-known offline Procgen benchmark substantiate the theoretical contributions in this work.

[461] An Empirical Study on Improving SimCLR’s Nonlinear Projection Head using Pretrained Autoencoder Embeddings

Andreas Schliebitz, Heiko Tapken, Martin Atzmueller

Main category: cs.LG

TL;DR: Using pretrained autoencoder embeddings in SimCLR’s projection head improves classification accuracy by up to 2.9% while reducing projection space dimensionality, with sigmoid/tanh activations outperforming ReLU.

Details

Motivation: To improve the effectiveness of SimCLR's standard 2-layer MLP projection head by leveraging pretrained autoencoder embeddings for better contrastive learning performance.

Method: Train a shallow autoencoder, extract its encoder embeddings, freeze them, and use as input layer replacement in SimCLR’s projector. Also modify projector width and activation functions (sigmoid/tanh vs ReLU).

Result: Pretrained autoencoder embeddings increased classification accuracy by up to 2.9% (1.7% average) and significantly reduced projection space dimensionality. Sigmoid and tanh activations outperformed ReLU in peak and average accuracy.

Conclusion: Frozen pretrained autoencoder embeddings in SimCLR’s projection head enhance contrastive learning performance, with sigmoid/tanh activations being more effective than ReLU, achieving better accuracy with reduced dimensionality.

Abstract: This paper focuses on improving the effectiveness of the standard 2-layer MLP projection head featured in the SimCLR framework through the use of pretrained autoencoder embeddings. Given a contrastive learning task with a largely unlabeled image classification dataset, we first train a shallow autoencoder architecture and extract its compressed representations contained in the encoder’s embedding layer. After freezing the weights within this pretrained layer, we use it as a drop-in replacement for the input layer of SimCLR’s default projector. Additionally, we also apply further architectural changes to the projector by decreasing its width and changing its activation function. The different projection heads are then used to contrastively train and evaluate a feature extractor following the SimCLR protocol. Our experiments indicate that using a pretrained autoencoder embedding in the projector can not only increase classification accuracy by up to 2.9% or 1.7% on average, but can also significantly decrease the dimensionality of the projection space. Our results also suggest, that using the sigmoid and tanh activation functions within the projector can outperform ReLU in terms of peak and average classification accuracy. All experiments involving our pretrained projectors are conducted with frozen embeddings, since our test results indicate an advantage compared to using their non-frozen counterparts.

[462] Self-Supervised Learning of Iterative Solvers for Constrained Optimization

Lukas Lüken, Sergio Lucia

Main category: cs.LG

TL;DR: A learning-based iterative solver for constrained optimization that combines neural network prediction with iterative refinement, achieving high accuracy and speed improvements over traditional solvers.

Details

Motivation: Need for real-time solution of parametric optimization problems in applications like model predictive control that require high accuracy under tight time constraints.

Method: Neural network predictor generates initial primal-dual solutions, followed by learned iterative solver refinement. Uses novel KKT-based loss function for self-supervised training without pre-sampled solutions, with convexification for nonconvex problems.

Result: Achieves speedups of up to one order of magnitude compared to IPOPT solver, with orders of magnitude higher accuracy than competing learning-based approaches in nonconvex case studies.

Conclusion: The proposed method provides an effective approach for real-time constrained optimization with theoretical guarantees and practical performance improvements.

Abstract: The real-time solution of parametric optimization problems is critical for applications that demand high accuracy under tight real-time constraints, such as model predictive control. To this end, this work presents a learning-based iterative solver for constrained optimization, comprising a neural network predictor that generates initial primal-dual solution estimates, followed by a learned iterative solver that refines these estimates to reach high accuracy. We introduce a novel loss function based on Karush-Kuhn-Tucker (KKT) optimality conditions, enabling fully self-supervised training without pre-sampled optimizer solutions. Theoretical guarantees ensure that the training loss function attains minima exclusively at KKT points. A convexification procedure enables application to nonconvex problems while preserving these guarantees. Experiments on two nonconvex case studies demonstrate speedups of up to one order of magnitude compared to state-of-the-art solvers such as IPOPT, while achieving orders of magnitude higher accuracy than competing learning-based approaches.

[463] Strada-LLM: Graph LLM for traffic prediction

Seyed Mohamad Moghadas, Bruno Cornelis, Alexandre Alahi, Adrian Munteanu

Main category: cs.LG

TL;DR: Strada-LLM is a novel multivariate probabilistic forecasting LLM that explicitly models temporal and spatial traffic patterns, outperforming existing LLM-based and GNN-based methods in traffic forecasting with improved accuracy and efficiency.

Details

Motivation: Address the limitations of existing LLM-based traffic forecasting methods that struggle to capture complex graph relationships and spatiotemporal dependencies, particularly in heterogeneous traffic conditions across diverse locations.

Method: Introduces Strada-LLM with explicit modeling of temporal and spatial traffic patterns, incorporates proximal traffic information as covariates, and uses a lightweight distribution-derived strategy for domain adaptation to handle new data distributions or network topologies.

Result: Empirical evaluations show Strada-LLM consistently surpasses state-of-the-art methods, improving long-term forecasting by 17% in RMSE error and 16% more efficiency, while maintaining robust performance across different LLM backbones.

Conclusion: Strada-LLM provides a versatile and powerful solution for real-world traffic prediction tasks, offering enhanced adaptability, interpretability, and performance in few-shot learning scenarios.

Abstract: Traffic forecasting is pivotal for intelligent transportation systems, where accurate and interpretable predictions can significantly enhance operational efficiency and safety. A key challenge stems from the heterogeneity of traffic conditions across diverse locations, leading to highly varied traffic data distributions. Large language models (LLMs) show exceptional promise for few-shot learning in such dynamic and data-sparse scenarios. However, existing LLM-based solutions often rely on prompt-tuning, which can struggle to fully capture complex graph relationships and spatiotemporal dependencies-thereby limiting adaptability and interpretability in real-world traffic networks. We address these gaps by introducing Strada-LLM, a novel multivariate probabilistic forecasting LLM that explicitly models both temporal and spatial traffic patterns. By incorporating proximal traffic information as covariates, Strada-LLM more effectively captures local variations and outperforms prompt-based existing LLMs. To further enhance adaptability, we propose a lightweight distribution-derived strategy for domain adaptation, enabling parameter-efficient model updates when encountering new data distributions or altered network topologies-even under few-shot constraints. Empirical evaluations on spatio-temporal transportation datasets demonstrate that Strada-LLM consistently surpasses state-of-the-art LLM-driven and traditional GNN-based predictors. Specifically, it improves long-term forecasting by 17% in RMSE error and 16% more efficiency. Moreover, it maintains robust performance across different LLM backbones with minimal degradation, making it a versatile and powerful solution for real-world traffic prediction tasks.

[464] Predictive Control and Regret Analysis of Non-Stationary MDP with Look-ahead Information

Ziyi Zhang, Yorie Nakahira, Guannan Qu

Main category: cs.LG

TL;DR: Proposes an algorithm for non-stationary MDPs that leverages look-ahead predictions to achieve low regret, with theoretical guarantees showing exponential regret reduction as prediction window expands.

Details

Motivation: Policy design in non-stationary MDPs is challenging due to time-varying transitions and rewards. Many practical applications like energy systems have available look-ahead predictions (e.g., renewable energy forecasts).

Method: Leverage look-ahead predictions and propose an algorithm that incorporates these predictions to achieve low regret in non-stationary MDPs.

Result: Theoretical analysis shows regret decreases exponentially as look-ahead window expands. With prediction errors, regret doesn’t explode even if error grows sub-exponentially with prediction horizon. Simulations confirm algorithm efficacy.

Conclusion: The proposed algorithm effectively uses look-ahead predictions to handle non-stationary MDPs, providing strong theoretical guarantees and practical performance in applications like energy systems.

Abstract: Policy design in non-stationary Markov Decision Processes (MDPs) is inherently challenging due to the complexities introduced by time-varying system transition and reward, which make it difficult for learners to determine the optimal actions for maximizing cumulative future rewards. Fortunately, in many practical applications, such as energy systems, look-ahead predictions are available, including forecasts for renewable energy generation and demand. In this paper, we leverage these look-ahead predictions and propose an algorithm designed to achieve low regret in non-stationary MDPs by incorporating such predictions. Our theoretical analysis demonstrates that, under certain assumptions, the regret decreases exponentially as the look-ahead window expands. When the system prediction is subject to error, the regret does not explode even if the prediction error grows sub-exponentially as a function of the prediction horizon. We validate our approach through simulations, confirming the efficacy of our algorithm in non-stationary environments.

[465] Adapt-Pruner: Adaptive Structural Pruning for Efficient Small Language Model Training

Rui Pan, Shivanshu Shekhar, Boyao Wang, Shizhe Diao, Jipeng Zhang, Xingyuan Pan, Renjie Pi, Tong Zhang

Main category: cs.LG

TL;DR: Adapt-Pruner is a layer-wise adaptive pruning method for LLMs that outperforms existing pruning techniques, achieves performance comparable to pre-training from scratch, and discovers efficient small models through incremental pruning.

Details

Motivation: Small language models are needed for edge devices, but conventional approaches either require expensive pre-training from scratch or suffer performance drops from compression/pruning of large models.

Method: Layer-wise adaptive pruning combined with training, using incremental pruning that removes small portions of neurons (~5%) at a time while interleaving with training.

Result: Adapt-Pruner outperforms LLM-Pruner, FLAP, and SliceGPT by 1%-7% on commonsense benchmarks, restores MobileLLM-125M performance to 600M level with 200× fewer tokens, and discovers a 1B model surpassing LLaMA-3.2-1B.

Conclusion: Adaptive pruning with training is highly effective for creating efficient small language models that match or exceed pre-trained models while reducing computational costs.

Abstract: Small language models (SLMs) have attracted considerable attention from both academia and industry due to their broad range of applications in edge devices. To obtain SLMs with strong performance, conventional approaches either pre-train the models from scratch, which incurs substantial computational costs, or compress/prune existing large language models (LLMs), which results in performance drops and falls short in comparison to pre-training. In this paper, we investigate the family of acceleration methods that involve both structured pruning and model training. We found 1) layer-wise adaptive pruning (Adapt-Pruner) is extremely effective in LLMs and yields significant improvements over existing pruning techniques, 2) adaptive pruning equipped with further training leads to models comparable to those pre-training from scratch, 3) incremental pruning brings non-trivial performance gain by interleaving pruning with training and only removing a small portion of neurons ($\sim$5%) at a time. Experimental results on LLaMA-3.1-8B demonstrate that Adapt-Pruner outperforms conventional pruning methods, such as LLM-Pruner, FLAP, and SliceGPT, by an average of 1%-7% in accuracy on commonsense benchmarks. Additionally, Adapt-Pruner restores the performance of MobileLLM-125M to 600M on the MMLU benchmark with 200$\times$ fewer tokens via pruning from its larger counterparts, and discovers a new 1B model that surpasses LLaMA-3.2-1B in multiple benchmarks. The official code is released at https://github.com/research4pan/AdaptPruner.

[466] Evolutionary Retrofitting

Mathurin Videau, Mariia Zameshina, Alessandro Leite, Laurent Najman, Marc Schoenauer, Olivier Teytaud

Main category: cs.LG

TL;DR: AfterLearnER uses evolutionary optimization to refine trained ML models by optimizing selected parameters using non-differentiable error signals from validation data, requiring minimal feedback (dozens to hundreds of scalars).

Details

Motivation: To enable optimization of machine learning models using non-differentiable error signals (like human feedback, game scores, or quality metrics) that cannot be handled by gradient-based methods.

Method: Evolutionary optimization applied to refine fully trained models by optimizing carefully chosen parameters/hyperparameters using actual, exact, non-differentiable error signals from a validation subset.

Result: Successfully demonstrated across diverse domains: depth sensing (threshold criteria), speech re-synthesis (word error rate), Doom gameplay (kills per life), code translation (BLEU), 3D GANs (image quality), and LDMs (user feedback).

Conclusion: AfterLearnER offers versatile, gradient-free optimization with limited overfitting, theoretical support, anytime behavior, and requires only minimal feedback compared to existing methods.

Abstract: AfterLearnER (After Learning Evolutionary Retrofitting) consists in applying evolutionary optimization to refine fully trained machine learning models by optimizing a set of carefully chosen parameters or hyperparameters of the model, with respect to some actual, exact, and hence possibly non-differentiable error signal, performed on a subset of the standard validation set. The efficiency of AfterLearnER is demonstrated by tackling non-differentiable signals such as threshold-based criteria in depth sensing, the word error rate in speech re-synthesis, the number of kills per life at Doom, computational accuracy or BLEU in code translation, image quality in 3D generative adversarial networks (GANs), and user feedback in image generation via Latent Diffusion Models (LDM). This retrofitting can be done after training, or dynamically at inference time by taking into account the user feedback. The advantages of AfterLearnER are its versatility, the possibility to use non-differentiable feedback, including human evaluations (i.e., no gradient is needed), the limited overfitting supported by a theoretical study, and its anytime behavior. Last but not least, AfterLearnER requires only a small amount of feedback, i.e., a few dozen to a few hundred scalars, compared to the tens of thousands needed in most related published works.

[467] FedALT: Federated Fine-Tuning through Adaptive Local Training with Rest-of-World LoRA

Jieming Bian, Lei Wang, Letian Zhang, Jie Xu

Main category: cs.LG

TL;DR: FedALT is a personalized federated LoRA fine-tuning method that addresses cross-client interference by using separate individual and Rest-of-World LoRA components with adaptive mixing, outperforming existing methods.

Details

Motivation: Existing federated LoRA fine-tuning methods based on FedAvg suffer from cross-client interference and suboptimal personalization due to data heterogeneity in federated settings.

Method: FedALT uses separate individual LoRA for each client and a shared Rest-of-World LoRA component, with an adaptive mixer that dynamically learns input-specific weightings between them, inspired by Mixture-of-Experts.

Result: Extensive experiments on NLP benchmarks show FedALT significantly outperforms state-of-the-art personalized federated LoRA fine-tuning methods.

Conclusion: FedALT achieves superior local adaptation without sacrificing computational efficiency by effectively balancing local adaptation and global information through adaptive mixing.

Abstract: Fine-tuning large language models (LLMs) in federated settings enables privacy-preserving adaptation but suffers from cross-client interference due to model aggregation. Existing federated LoRA fine-tuning methods, primarily based on FedAvg, struggle with data heterogeneity, leading to harmful cross-client interference and suboptimal personalization. In this work, we propose \textbf{FedALT}, a novel personalized federated LoRA fine-tuning algorithm that fundamentally departs from FedAvg. Instead of using an aggregated model to initialize local training, each client continues training its individual LoRA while incorporating shared knowledge through a separate Rest-of-World (RoW) LoRA component. To effectively balance local adaptation and global information, FedALT introduces an adaptive mixer that dynamically learns input-specific weightings between the individual and RoW LoRA components, drawing conceptual foundations from the Mixture-of-Experts (MoE) paradigm. Through extensive experiments on NLP benchmarks, we demonstrate that FedALT significantly outperforms state-of-the-art personalized federated LoRA fine-tuning methods, achieving superior local adaptation without sacrificing computational efficiency.

[468] SGLP: A Similarity Guided Fast Layer Partition Pruning for Compressing Large Deep Models

Yuqi Li, Yao Lu, Junhao Dong, Zeyu Dong, Chuanguang Yang, Xin Yin, Yihao Chen, Jianping Gou, Yingli Tian, Tingwen Huang

Main category: cs.LG

TL;DR: SGLP Pruning is a novel layer pruning method that uses representation similarity and segmentation to preserve essential network characteristics while removing redundant layers for model compression.

Details

Motivation: Existing layer pruning methods overlook intrinsic layer connections and interdependencies in deep neural networks, leading to pruned models that don't effectively preserve pre-trained network characteristics.

Method: Uses Centered Kernel Alignment (CKA) to quantify layer similarity, applies Fisher Optimal Segmentation to partition network into coherent segments, and employs GradNorm for fine-tuning-free importance evaluation to remove redundant layers segment-wise.

Result: Outperforms state-of-the-art methods in accuracy and efficiency on both image classification and LLMs, achieving significant model compression with minimal performance degradation.

Conclusion: The approach is well-suited for deployment in resource-limited environments due to its ability to maintain model performance while achieving substantial compression.

Abstract: Layer pruning has emerged as a potent approach to remove redundant layers in the pre-trained network on the purpose of reducing network size and improve computational efficiency. However, existing layer pruning methods mostly overlook the intrinsic connections and inter-dependencies between different layers within complicated deep neural networks. This oversight can result in pruned models that do not preserve the essential characteristics of the pre-trained network as effectively as desired. To address these limitations, we propose a Similarity-Guided Layer Partition (SGLP) Pruning, a novel pruning framework that exploits representation similarity to guide efficient and informed layer removal for compressing large deep models. Our method begins by employing Centered Kernel Alignment (CKA) to quantify representational similarity between layers, uncovering structural patterns within the network. We then apply Fisher Optimal Segmentation on the similarity matrix to partition the network into semantically coherent layer segments. This segmentation allows pruning decisions to respect layer interdependencies and preserve essential knowledge. Within each segment, we introduce a fine-tuning-free importance evaluation using GradNorm, identifying and removing redundant layers in a targeted, segment-wise manner. Experimental results on both image classification tasks and large language models (LLMs) demonstrate that our proposed SGLP outperforms the state-of-the-art methods in accuracy and efficiency. Our approach achieves significant model compression with minimal performance degradation, making it well-suited for deployment in resource-limited environments.

[469] High-Dimensional Linear Bandits under Stochastic Latent Heterogeneity

Elynn Chen, Xi Chen, Wenbo Jing, Xiao Liu

Main category: cs.LG

TL;DR: The paper addresses stochastic latent heterogeneity in online decision-making, proposing a latent heterogeneous bandit framework that models probabilistic subgroup membership and group-specific rewards. It reveals that randomness in group realizations creates irreducible classification uncertainty, making sub-linear regret against a strong oracle impossible.

Details

Motivation: Existing data-driven approaches fail when sources of variation are latent and stochastic, as in promotion targeting where individuals' responses vary with unobserved subgroups. The paper aims to model this latent heterogeneity explicitly.

Method: Proposes a latent heterogeneous bandit framework and a phased EM-greedy algorithm that jointly learns latent group probabilities and reward parameters in high dimensions, achieving optimal estimation and classification guarantees.

Result: Establishes matching upper and minimax lower bounds for both strong and regular regrets. Strong regret grows linearly due to irreducible classification uncertainty, while regular regret achieves minimax-optimal sublinear rate.

Conclusion: Reveals a fundamental stochastic barrier in online decision-making with latent subgroups, suggesting potential remedies through strategic interventions and mechanism-design-based elicitation of latent information.

Abstract: This paper addresses the critical challenge of stochastic latent heterogeneity in online decision-making, where individuals’ responses to actions vary not only with observable contexts but also with unobserved, randomly realized subgroups. Existing data-driven approaches largely capture observable heterogeneity through contextual features but fail when the sources of variation are latent and stochastic. We propose a latent heterogeneous bandit framework that explicitly models probabilistic subgroup membership and group-specific reward functions, using promotion targeting as a motivating example. Our phased EM-greedy algorithm jointly learns latent group probabilities and reward parameters in high dimensions, achieving optimal estimation and classification guarantees. Our analysis reveals a new phenomenon unique to decision-making with stochastic latent subgroups: randomness in group realizations creates irreducible classification uncertainty, making sub-linear regret against a fully informed strong oracle fundamentally impossible. We establish matching upper and minimax lower bounds for both the strong and regular regrets, corresponding, respectively, to oracles with and without access to realized group memberships. The strong regret necessarily grows linearly, while the regular regret achieves a minimax-optimal sublinear rate. These findings uncover a fundamental stochastic barrier in online decision-making and point to potential remedies through simple strategic interventions and mechanism-design-based elicitation of latent information.

[470] First-Order Error Matters: Accurate Compensation for Quantized Large Language Models

Xingyu Zheng, Haotong Qin, Yuye Li, Haoran Chu, Jiakai Wang, Jinyang Guo, Michele Magno, Xianglong Liu

Main category: cs.LG

TL;DR: FOEM is a novel post-training quantization method that incorporates first-order gradient terms to improve quantization error compensation, outperforming existing methods like GPTQ with minimal computational overhead.

Details

Motivation: Existing PTQ methods assume negligible first-order terms in quantization error modeling, but this assumption is flawed due to accumulated first-order deviations during progressive compensation processes.

Method: FOEM performs first-order Taylor expansion around pre-quantization weights to approximate gradients, using the difference between latent and full-precision weights and the Hessian matrix without explicit computation.

Result: FOEM reduces Llama3-8B perplexity by 17.3% in 3-bit quantization and increases MMLU accuracy from 53.8% to 56.1%. It also works well with advanced techniques like SpinQuant in W4A4KV4 settings.

Conclusion: FOEM effectively addresses the limitations of existing PTQ methods by incorporating first-order gradient terms, delivering superior performance across various models and benchmarks while maintaining computational efficiency.

Abstract: Post-training quantization (PTQ) offers an efficient approach to compressing large language models (LLMs), significantly reducing memory access and computational costs. Existing compensation-based weight calibration methods often rely on a second-order Taylor expansion to model quantization error, under the assumption that the first-order term is negligible in well-trained full-precision models. However, we reveal that the progressive compensation process introduces accumulated first-order deviations between latent weights and their full-precision counterparts, making this assumption fundamentally flawed. To address this, we propose FOEM, a novel PTQ method that explicitly incorporates first-order gradient terms to improve quantization error compensation. FOEM approximates gradients by performing a first-order Taylor expansion around the pre-quantization weights. This yields an approximation based on the difference between latent and full-precision weights as well as the Hessian matrix. When substituted into the theoretical solution, the formulation eliminates the need to explicitly compute the Hessian, thereby avoiding the high computational cost and limited generalization of backpropagation-based gradient methods. This design introduces only minimal additional computational overhead. Extensive experiments across a wide range of models and benchmarks demonstrate that FOEM consistently outperforms the classical GPTQ method. In 3-bit weight-only quantization, FOEM reduces the perplexity of Llama3-8B by 17.3% and increases the 5-shot MMLU accuracy from 53.8% achieved by GPTAQ to 56.1%. Moreover, FOEM can be seamlessly combined with advanced techniques such as SpinQuant, delivering additional gains under the challenging W4A4KV4 setting and further narrowing the performance gap with full-precision baselines, surpassing existing state-of-the-art methods.

[471] Training speedups via batching for geometric learning: an analysis of static and dynamic algorithms

Daniel T. Speckhard, Tim Bechtel, Sebastian Kehl, Jonathan Godwin, Claudia Draxl

Main category: cs.LG

TL;DR: Analysis of static vs dynamic batching algorithms for GNN training, showing up to 2.7x speedup but performance depends on data, model, batch size, hardware, and training steps.

Details

Motivation: GNNs process graphs in batches, but unlike traditional NNs, the effect of batching algorithms on GNN training time and performance hasn't been thoroughly explored despite GNNs' growing importance in materials science, chemistry, and social sciences.

Method: Analyzed static and dynamic batching algorithms for graph-based models using two datasets: QM9 dataset of small molecules and AFLOW materials database, testing different combinations of batch size, dataset, and model.

Result: Dynamic batching can provide up to 2.7x speedup over static batching, but the optimal algorithm depends on specific factors. Significant differences in model learning metrics were observed between algorithms for certain combinations of batch size, dataset, and model.

Conclusion: Batching algorithm choice significantly impacts GNN training efficiency and performance, with no single optimal algorithm - the best choice depends on data characteristics, model architecture, batch size, hardware, and training duration.

Abstract: Graph neural networks (GNN) have shown promising results for several domains such as materials science, chemistry, and the social sciences. GNN models often contain millions of parameters, and like other neural network (NN) models, are often fed only a fraction of the graphs that make up the training dataset in batches to update model parameters. The effect of batching algorithms on training time and model performance has been thoroughly explored for NNs but not yet for GNNs. We analyze two different batching algorithms for graph based models, namely static and dynamic batching for two datasets, the QM9 dataset of small molecules and the AFLOW materials database. Our experiments show that changing the batching algorithm can provide up to a 2.7x speedup, but the fastest algorithm depends on the data, model, batch size, hardware, and number of training steps run. Experiments show that for a select number of combinations of batch size, dataset, and model, significant differences in model learning metrics are observed between static and dynamic batching algorithms.

[472] RAG-Enhanced Collaborative LLM Agents for Drug Discovery

Namkyeong Lee, Edward De Brouwer, Ehsan Hajiramezanali, Tommaso Biancalani, Chanyoung Park, Gabriele Scalia

Main category: cs.LG

TL;DR: CLADD is a retrieval-augmented generation (RAG) system that uses multiple LLM agents to solve drug discovery tasks without domain-specific fine-tuning, overcoming challenges like data heterogeneity and ambiguity.

Details

Motivation: To address the limitations of costly domain-specific fine-tuning for LLMs in drug discovery, which hinders flexible application and rapid integration of continuously generated scientific data for complex, open-ended scientific questions.

Method: Proposes CLADD - a RAG-empowered agentic system with multiple LLM agents that dynamically retrieves information from biomedical knowledge bases, contextualizes query molecules, and integrates relevant evidence without domain-specific fine-tuning.

Result: CLADD outperforms general-purpose and domain-specific LLMs as well as traditional deep learning approaches across various drug discovery tasks, demonstrating flexibility and effectiveness.

Conclusion: The framework successfully tackles key obstacles in applying RAG workflows to biochemical data and provides a flexible, effective solution for drug discovery tasks without requiring domain-specific fine-tuning.

Abstract: Recent advances in large language models (LLMs) have shown great potential to accelerate drug discovery. However, the specialized nature of biochemical data often necessitates costly domain-specific fine-tuning, posing major challenges. First, it hinders the application of more flexible general-purpose LLMs for cutting-edge drug discovery tasks. More importantly, it limits the rapid integration of the vast amounts of scientific data continuously generated through experiments and research. Compounding these challenges is the fact that real-world scientific questions are typically complex and open-ended, requiring reasoning beyond pattern matching or static knowledge retrieval.To address these challenges, we propose CLADD, a retrieval-augmented generation (RAG)-empowered agentic system tailored to drug discovery tasks. Through the collaboration of multiple LLM agents, CLADD dynamically retrieves information from biomedical knowledge bases, contextualizes query molecules, and integrates relevant evidence to generate responses - all without the need for domain-specific fine-tuning. Crucially, we tackle key obstacles in applying RAG workflows to biochemical data, including data heterogeneity, ambiguity, and multi-source integration. We demonstrate the flexibility and effectiveness of this framework across a variety of drug discovery tasks, showing that it outperforms general-purpose and domain-specific LLMs as well as traditional deep learning approaches. Our code is publicly available at https://github.com/Genentech/CLADD.

[473] AMUN: Adversarial Machine UNlearning

Ali Ebrahimpour-Boroojeny, Hari Sundaram, Varun Chandrasekaran

Main category: cs.LG

TL;DR: AMUN is a machine unlearning method that uses adversarial examples to selectively lower model confidence on forget samples while preserving test accuracy, outperforming prior approaches.

Details

Motivation: Existing exact unlearning methods are computationally expensive, while approximate methods fail to match exact unlearning effectiveness in terms of accuracy and prediction confidence on both forget and test datasets.

Method: Fine-tunes the model on adversarial examples corresponding to forget samples, which localizes changes to decision boundaries around forget samples and avoids global model behavior changes.

Result: When unlearning 10% of CIFAR-10 samples, AMUN achieves that state-of-the-art membership inference attacks perform no better than random guessing.

Conclusion: AMUN provides an effective and computationally efficient approach to machine unlearning that maintains model utility while ensuring privacy protection.

Abstract: Machine unlearning, where users can request the deletion of a forget dataset, is becoming increasingly important because of numerous privacy regulations. Initial works on exact'' unlearning (e.g., retraining) incur large computational overheads. However, while computationally inexpensive, approximate’’ methods have fallen short of reaching the effectiveness of exact unlearning: models produced fail to obtain comparable accuracy and prediction confidence on both the forget and test (i.e., unseen) dataset. Exploiting this observation, we propose a new unlearning method, Adversarial Machine UNlearning (AMUN), that outperforms prior state-of-the-art (SOTA) methods for image classification. AMUN lowers the confidence of the model on the forget samples by fine-tuning the model on their corresponding adversarial examples. Adversarial examples naturally belong to the distribution imposed by the model on the input space; fine-tuning the model on the adversarial examples closest to the corresponding forget samples (a) localizes the changes to the decision boundary of the model around each forget sample and (b) avoids drastic changes to the global behavior of the model, thereby preserving the model’s accuracy on test samples. Using AMUN for unlearning a random $10%$ of CIFAR-10 samples, we observe that even SOTA membership inference attacks cannot do better than random guessing.

[474] Mining–Gym: A Configurable RL Benchmarking Environment for Truck Dispatch Scheduling

Chayan Banerjee, Kien Nguyen, Clinton Fookes

Main category: cs.LG

TL;DR: Mining-Gym is an open-source benchmarking environment for evaluating RL algorithms in mining truck dispatch optimization, addressing the lack of standardized testing in dynamic mining operations.

Details

Motivation: The dynamic and stochastic nature of mining operations with uncertainties like equipment failures and variable cycle times challenges traditional optimization, while RL shows potential but lacks standardized benchmarking for fair comparison and real-world applicability.

Method: Built on Salabim-based Discrete Event Simulation integrated with Gymnasium, Mining-Gym uses an event-driven decision-point architecture to capture mining uncertainties, with GUI for configuration, data logging, and real-time visualization.

Result: Validation across six scenarios from normal operation to severe equipment failures shows Mining-Gym enables reproducible evaluation, demonstrating strong performance potential of RL-trained schedulers compared to classical heuristics.

Conclusion: Mining-Gym provides an effective, reproducible testbed for fair evaluation of adaptive decision-making in mining logistics, supporting the development and comparison of RL solutions for real-world mining optimization.

Abstract: Optimizing the mining process – particularly truck dispatch scheduling – is a key driver of efficiency in open-pit operations. However, the dynamic and stochastic nature of these environments, with uncertainties such as equipment failures, truck maintenance, and variable haul cycle times, challenges traditional optimization. While Reinforcement Learning (RL) shows strong potential for adaptive decision-making in mining logistics, practical deployment requires evaluation in realistic, customizable simulation environments. The lack of standardized benchmarking hampers fair algorithm comparison, reproducibility, and real-world applicability of RL solutions. To address this, we present Mining-Gym – a configurable, open-source benchmarking environment for training, testing, and evaluating RL algorithms in mining process optimization. Built on Salabim-based Discrete Event Simulation (DES) and integrated with Gymnasium, Mining-Gym captures mining-specific uncertainties through an event-driven decision-point architecture. It offers a GUI for parameter configuration, data logging, and real-time visualization, supporting reproducible evaluation of RL strategies and heuristic baselines. We validate Mining-Gym by comparing classical heuristics with RL-based scheduling across six scenarios from normal operation to severe equipment failures. Results show it is an effective, reproducible testbed, enabling fair evaluation of adaptive decision-making and demonstrating the strong performance potential of RL-trained schedulers.

[475] Gaussian Process Tilted Nonparametric Density Estimation using Fisher Divergence Score Matching

John Paisley, Wei Zhang, Brian Barr

Main category: cs.LG

TL;DR: Proposes GP-tilted nonparametric density estimator using Gaussian process refinement with three closed-form learning algorithms based on Fisher divergence score matching and random Fourier feature approximation.

Details

Motivation: To develop a nonparametric density estimator that avoids iterative learning algorithms and is suitable for big data by using closed-form solutions.

Method: Multiplies base multivariate normal distribution with exponentiated GP refinement, uses RFF approximation for linear GP representation, and applies three FD-based objectives including basic FD, noise conditional FD, and variational inference-based FD.

Result: Derived three closed-form learning algorithms that only require sufficient statistics from a single data pass, demonstrated on low-dimensional density estimation problems.

Conclusion: The approach provides well-defined densities with tractable VI approximation and is particularly suitable for big data due to closed-form nature and single-pass data processing.

Abstract: We propose a nonparametric density estimator based on the Gaussian process (GP) and derive three novel closed form learning algorithms based on Fisher divergence (FD) score matching. The density estimator is formed by multiplying a base multivariate normal distribution with an exponentiated GP refinement, and so we refer to it as a GP-tilted nonparametric density. By representing the GP part of the score as a linear function using the random Fourier feature (RFF) approximation, we show that optimization can be solved in closed form for the three FD-based objectives considered. This includes the basic and noise conditional versions of the Fisher divergence, as well as an alternative to noise conditional FD models based on variational inference (VI) that we propose in this paper. For this novel learning approach, we propose an ELBO-like optimization to approximate the posterior distribution, with which we then derive a Fisher variational predictive distribution. The RFF representation of the GP, which is functionally equivalent to a single layer neural network score model with cosine activation, provides a useful linear representation of the GP for which all expectations can be solved. The Gaussian base distribution also helps with tractability of the VI approximation and ensures that our proposed density is well-defined. We demonstrate our three learning algorithms, as well as a MAP baseline algorithm, on several low dimensional density estimation problems. The closed form nature of the learning problem removes the reliance on iterative learning algorithms, making this technique particularly well-suited to big data sets, since only sufficient statistics collected from a single pass through the data is needed.

[476] Stochastic Variational Inference with Tuneable Stochastic Annealing

John Paisley, Ghazal Fazelnia, Brian Barr

Main category: cs.LG

TL;DR: SVI+ modifies stochastic variational inference by decoupling batch size from gradient noise variance, enabling better annealing to escape local optima while maintaining accurate gradient directions.

Details

Motivation: Standard SVI faces a trade-off: larger batch sizes give more Gaussian noise but smaller variance (less annealing), while smaller batch sizes have larger variance but less accurate gradients. The goal is to achieve both large variance for escaping local optima and accurate gradient directions.

Method: Propose SVI+ which sets an actual batch size (can be full dataset) and an effective batch size that controls noise variance. This approximates maximum entropy stochastic gradient at desired variance level, decoupling batch size from annealing properties.

Result: Theoretically motivated for conjugate exponential family models. Empirically demonstrated improved performance for probabilistic matrix factorization (PMF), Latent Dirichlet Allocation (LDA), and Gaussian mixture models (GMM).

Conclusion: SVI+ provides a tunable annealing approach for variational inference that overcomes the batch size trade-off in standard SVI, enabling better optimization by controlling gradient noise variance independently from data information.

Abstract: We exploit the observation that stochastic variational inference (SVI) is a form of annealing and present a modified SVI approach – applicable to both large and small datasets – that allows the amount of annealing done by SVI to be tuned. We are motivated by the fact that, in SVI, the larger the batch size the more approximately Gaussian is the noise of the gradient, but the smaller its variance, which reduces the amount of annealing done to escape bad local optimal solutions. We propose a simple method for achieving both goals of having larger variance noise to escape bad local optimal solutions and more data information to obtain more accurate gradient directions. The idea is to set an actual batch size, which may be the size of the data set, and an effective batch size that matches the increased variance of a smaller batch size. The result is an approximation to the maximum entropy stochastic gradient at a desired variance level. We theoretically motivate our ``SVI+’’ approach for conjugate exponential family model framework and illustrate its empirical performance for learning the probabilistic matrix factorization collaborative filter (PMF), the Latent Dirichlet Allocation topic model (LDA), and the Gaussian mixture model (GMM).

[477] Towards Effective Federated Graph Foundation Model via Mitigating Knowledge Entanglement

Yinlin Zhu, Xunkai Li, Jishuo Jia, Miao Hu, Di Wu, Meikang Qiu

Main category: cs.LG

TL;DR: FedGFM+ is a federated graph foundation model framework that addresses knowledge entanglement through domain-aware initialization and adaptive prompt pooling to improve cross-domain generalization.

Details

Motivation: To combine the benefits of federated graph learning (multi-client collaboration) and graph foundation models (strong generalization) while overcoming their respective limitations of data heterogeneity and single-machine training constraints.

Method: Two core modules: (1) AncDAI - global anchor-based domain-aware initialization using domain-specific prototypes, (2) AdaDPP - local adaptive domain-sensitive prompt pool for downstream adaptation.

Result: Outperforms 20 baselines across 8 diverse benchmarks spanning multiple domains and tasks, demonstrating superior performance in federated graph learning scenarios.

Conclusion: FedGFM+ successfully integrates federated learning with graph foundation models, effectively reducing knowledge entanglement and improving cross-domain generalization through its novel domain-aware initialization and prompt pooling approach.

Abstract: Recent advances in graph machine learning have shifted to data-centric paradigms, driven by two emerging fields: (1) Federated graph learning (FGL) enables multi-client collaboration but faces challenges from data and task heterogeneity, limiting its practicality; (2) Graph foundation models (GFM) offer strong domain generalization but are usually trained on single machines, missing out on cross-silo data and resources. These paradigms are complementary, and their integration brings notable benefits. Motivated by this, we propose FedGFM, a novel decentralized GFM training paradigm. However, a key challenge is knowledge entanglement, where multi-domain knowledge merges into indistinguishable representations, hindering downstream adaptation. To address this, we present FedGFM+, an enhanced framework with two core modules to reduce knowledge entanglement: (1) AncDAI: A global anchor-based domain-aware initialization strategy. Before pre-training, each client encodes its local graph into domain-specific prototypes that serve as semantic anchors. Synthetic embeddings around these anchors initialize the global model. We theoretically prove these prototypes are distinguishable across domains, providing a strong inductive bias to disentangle domain-specific knowledge. (2) AdaDPP: A local adaptive domain-sensitive prompt pool. Each client learns a lightweight graph prompt capturing domain semantics during pre-training. During fine-tuning, prompts from all clients form a pool from which the GFM selects relevant prompts to augment target graph attributes, improving downstream adaptation. FedGFM+ is evaluated on 8 diverse benchmarks across multiple domains and tasks, outperforming 20 baselines from supervised learning, FGL, and federated GFM variants.

[478] Beyond $\tilde{O}(\sqrt{T})$ Constraint Violation for Online Convex Optimization with Adversarial Constraints

Abhishek Sinha, Rahul Vaze

Main category: cs.LG

TL;DR: This paper improves Online Convex Optimization with adversarial constraints by trading off regret for substantially smaller cumulative constraint violation (CCV), achieving Õ(√dT + T^β) regret and Õ(dT^{1-β}) CCV with tunable parameter β.

Details

Motivation: In safety-critical applications where satisfying constraints is non-negotiable, existing policies achieve O(√T) regret and Õ(√T) CCV, but this work aims to achieve much smaller constraint violations by trading off some regret.

Method: The authors first solve a special case called Constrained Expert problem using adaptive small-loss regret bounds, then reduce the general problem via covering arguments. With additional smoothness assumptions, they propose a computationally efficient first-order policy.

Result: The proposed policies achieve Õ(√dT + T^β) regret and Õ(dT^{1-β}) CCV for general convex functions, and O(√T ln N + T^β) regret and Õ(T^{1-β} ln N) CCV for the Constrained Expert problem with N experts.

Conclusion: This work provides a flexible trade-off between regret and constraint violation in constrained online convex optimization, enabling substantially smaller constraint violations at the cost of slightly increased regret, which is crucial for safety-critical applications.

Abstract: We study Online Convex Optimization with adversarial constraints (COCO). At each round a learner selects an action from a convex decision set and then an adversary reveals a convex cost and a convex constraint function. The goal of the learner is to select a sequence of actions to minimize both regret and the cumulative constraint violation (CCV) over a horizon of length $T$. The best-known policy for this problem achieves $O(\sqrt{T})$ regret and $\tilde{O}(\sqrt{T})$ CCV. In this paper, we improve this by trading off regret to achieve substantially smaller CCV. This trade-off is especially important in safety-critical applications, where satisfying the safety constraints is non-negotiable. Specifically, for any bounded convex cost and constraint functions, we propose an online policy that achieves $\tilde{O}(\sqrt{dT}+ T^β)$ regret and $\tilde{O}(dT^{1-β})$ CCV, where $d$ is the dimension of the decision set and $β\in [0,1]$ is a tunable parameter. We begin with a special case, called the $\textsf{Constrained Expert}$ problem, where the decision set is a probability simplex and the cost and constraint functions are linear. Leveraging a new adaptive small-loss regret bound, we propose a computationally efficient policy for the $\textsf{Constrained Expert}$ problem, that attains $O(\sqrt{T\ln N}+T^β)$ regret and $\tilde{O}(T^{1-β} \ln N)$ CCV for $N$ number of experts. The original problem is then reduced to the $\textsf{Constrained Expert}$ problem via a covering argument. Finally, with an additional $M$-smoothness assumption, we propose a computationally efficient first-order policy attaining $O(\sqrt{MT}+T^β)$ regret and $\tilde{O}(MT^{1-β})$ CCV.

[479] Advanced Long-term Earth System Forecasting

Hao Wu, Yuan Gao, Ruijian Gou, Xian Wu, Chuhan Wu, Huahui Yi, Johannes Brandstetter, Fan Xu, Kun Wang, Penghao Zhao, Hao Jia, Qi Song, Xinliang Liu, Juncai He, Shuhao Cao, Huanshuo Dong, Yanfei Xiang, Fan Zhang, Haixin Wang, Xingjian Shi, Qiufeng Wang, Shuaipeng Li, Ruobing Xie, Feng Tao, Yuxu Lu, Yu Guo, Yuntian Chen, Yuxuan Liang, Qingsong Wen, Wanli Ouyang, Deliang Chen, Xiaomeng Huang

Main category: cs.LG

TL;DR: TritonCast is a novel AI architecture that addresses spectral bias in long-term Earth system forecasting by using a nested grid approach with a stable latent dynamical core for macro-evolution and outer structure for fine details, achieving state-of-the-art performance in weather and ocean forecasting.

Details

Motivation: Current AI models for Earth system forecasting suffer from instabilities during extended autoregressive simulations due to spectral bias, which leads to inadequate representation of high-frequency processes and uncontrolled error amplification.

Method: Uses a nested grid design inspired by numerical models, featuring a dedicated latent dynamical core for stable macro-evolution at coarse scale, and an outer structure that fuses this stable trend with fine-grained local details to mitigate spectral bias from cross-scale interactions.

Result: Achieves state-of-the-art accuracy on WeatherBench 2, executes year-long autoregressive global forecasts and multi-year climate simulations spanning 2500 days without drift, extends skillful ocean eddy forecasts to 120 days, and demonstrates unprecedented zero-shot cross-resolution generalization.

Conclusion: TritonCast offers a promising pathway towards trustworthy AI-driven simulations that can accelerate discovery in climate and Earth system science, enabling more reliable long-term forecasting and deeper insights into complex geophysical dynamics.

Abstract: Reliable long-term forecasting of Earth system dynamics is fundamentally limited by instabilities in current artificial intelligence (AI) models during extended autoregressive simulations. These failures often originate from inherent spectral bias, leading to inadequate representation of critical high-frequency, small-scale processes and subsequent uncontrolled error amplification. Inspired by the nested grids in numerical models used to resolve small scales, we present TritonCast. At the core of its design is a dedicated latent dynamical core, which ensures the long-term stability of the macro-evolution at a coarse scale. An outer structure then fuses this stable trend with fine-grained local details. This design effectively mitigates the spectral bias caused by cross-scale interactions. In atmospheric science, it achieves state-of-the-art accuracy on the WeatherBench 2 benchmark while demonstrating exceptional long-term stability: executing year-long autoregressive global forecasts and completing multi-year climate simulations that span the entire available $2500$-day test period without drift. In oceanography, it extends skillful eddy forecast to $120$ days and exhibits unprecedented zero-shot cross-resolution generalization. Ablation studies reveal that this performance stems from the synergistic interplay of the architecture’s core components. TritonCast thus offers a promising pathway towards a new generation of trustworthy, AI-driven simulations. This significant advance has the potential to accelerate discovery in climate and Earth system science, enabling more reliable long-term forecasting and deeper insights into complex geophysical dynamics.

[480] Sparse Tuning Enhances Plasticity in PTM-based Continual Learning

Huan Zhang, Shenghua Fan, Shuyu Dong, Yujin Zheng, Dingwen Wang, Fan Lyu

Main category: cs.LG

TL;DR: MIST is a plug-and-play method that selectively updates <5% of pre-trained model parameters using mutual information guidance and strong sparsity regularization to enable effective continual learning while preserving generalization.

Details

Motivation: Existing continual learning approaches either freeze PTMs (limiting plasticity) or use full fine-tuning (risking pre-trained knowledge disruption), leading to suboptimal generalization with distribution shifts.

Method: Mutual Information-guided Sparse Tuning (MIST) that selectively updates <5% of PTM parameters based on sensitivity to mutual information objectives, with additional strong sparsity regularization via random gradient dropping (<0.5% parameters updated per step).

Result: MIST consistently boosts performance across diverse continual learning benchmarks when integrated into multiple baselines, achieving significant performance gains.

Conclusion: MIST provides an effective plug-and-play solution for continual learning that balances adaptation and preservation of pre-trained knowledge through selective sparse parameter updates.

Abstract: Continual Learning with Pre-trained Models holds great promise for efficient adaptation across sequential tasks. However, most existing approaches freeze PTMs and rely on auxiliary modules like prompts or adapters, limiting model plasticity and leading to suboptimal generalization when facing significant distribution shifts. While full fine-tuning can improve adaptability, it risks disrupting crucial pre-trained knowledge. In this paper, we propose Mutual Information-guided Sparse Tuning (MIST), a plug-and-play method that selectively updates a small subset of PTM parameters, less than 5%, based on sensitivity to mutual information objectives. MIST enables effective task-specific adaptation while preserving generalization. To further reduce interference, we introduce strong sparsity regularization by randomly dropping gradients during tuning, resulting in fewer than 0.5% of parameters being updated per step. Applied before standard freeze-based methods, MIST consistently boosts performance across diverse continual learning benchmarks. Experiments show that integrating our method into multiple baselines yields significant performance gains. Our code is available at https://github.com/zhwhu/MIST.

[481] FNOPE: Simulation-based inference on function spaces with Fourier Neural Operators

Guy Moss, Leah Sophie Muhle, Reinhard Drews, Jakob H. Macke, Cornelius Schröder

Main category: cs.LG

TL;DR: FNOPE enables efficient Bayesian inference of function-valued parameters using Fourier Neural Operators with flow matching, outperforming state-of-the-art methods with significantly lower simulation budgets.

Details

Motivation: Current SBI methods struggle with function-valued parameters common in spatiotemporal modeling (e.g., climate science), limiting their applicability to high-dimensional scientific domains.

Method: Proposes FNOPE - uses Fourier Neural Operator architecture with flow matching objective to perform posterior estimation of function-valued parameters, supporting arbitrary domain discretizations and simultaneous vector-valued parameter inference.

Result: FNOPE achieves inference of function-valued parameters at a fraction of the simulation budget compared to state-of-the-art methods, demonstrated on benchmark tasks and a challenging glaciology spatial inference problem.

Conclusion: FNOPE extends SBI applicability to scientific domains requiring function-valued parameter inference, enabling efficient Bayesian inference for spatiotemporal processes.

Abstract: Simulation-based inference (SBI) is an established approach for performing Bayesian inference on scientific simulators. SBI so far works best on low-dimensional parametric models. However, it is difficult to infer function-valued parameters, which frequently occur in disciplines that model spatiotemporal processes such as the climate and earth sciences. Here, we introduce an approach for efficient posterior estimation, using a Fourier Neural Operator (FNO) architecture with a flow matching objective. We show that our approach, FNOPE, can perform inference of function-valued parameters at a fraction of the simulation budget of state of the art methods. In addition, FNOPE supports posterior evaluation at arbitrary discretizations of the domain, as well as simultaneous estimation of vector-valued parameters. We demonstrate the effectiveness of our approach on several benchmark tasks and a challenging spatial inference task from glaciology. FNOPE extends the applicability of SBI methods to new scientific domains by enabling the inference of function-valued parameters.

[482] Optimization-Induced Dynamics of Lipschitz Continuity in Neural Networks

Róisín Luo, James McDermott, Christian Gagné, Qiang Sun, Colm O’Riordan

Main category: cs.LG

TL;DR: This paper develops a mathematical framework using stochastic differential equations to model how Lipschitz continuity evolves during neural network training with SGD, identifying three key factors driving this evolution.

Details

Motivation: Lipschitz continuity characterizes neural network sensitivity to input perturbations, but its temporal dynamics during training remain poorly understood.

Method: Theoretical framework using stochastic differential equations (SDEs) to model Lipschitz evolution, analyzing projections of gradient flows and noise onto operator-norm Jacobian and Hessian matrices.

Result: Identified three principal factors driving Lipschitz evolution: gradient flow projections, gradient noise projections on Jacobian, and gradient noise projections on Hessian. Experimental results show strong agreement with theoretical predictions.

Conclusion: The framework successfully models Lipschitz continuity evolution during training and reveals how factors like noisy supervision, initialization, batch size, and sampling trajectories shape this evolution.

Abstract: Lipschitz continuity characterizes the worst-case sensitivity of neural networks to small input perturbations; yet its dynamics (i.e. temporal evolution) during training remains under-explored. We present a rigorous mathematical framework to model the temporal evolution of Lipschitz continuity during training with stochastic gradient descent (SGD). This framework leverages a system of stochastic differential equations (SDEs) to capture both deterministic and stochastic forces. Our theoretical analysis identifies three principal factors driving the evolution: (i) the projection of gradient flows, induced by the optimization dynamics, onto the operator-norm Jacobian of parameter matrices; (ii) the projection of gradient noise, arising from the randomness in mini-batch sampling, onto the operator-norm Jacobian; and (iii) the projection of the gradient noise onto the operator-norm Hessian of parameter matrices. Furthermore, our theoretical framework sheds light on such as how noisy supervision, parameter initialization, batch size, and mini-batch sampling trajectories, among other factors, shape the evolution of the Lipschitz continuity of neural networks. Our experimental results demonstrate strong agreement between the theoretical implications and the observed behaviors.

[483] Cautious Optimism: A Meta-Algorithm for Near-Constant Regret in General Games

Ashkan Soleymani, Georgios Piliouras, Gabriele Farina

Main category: cs.LG

TL;DR: Cautious Optimism is a framework that accelerates no-regret learning in games by adaptively controlling learning pace, achieving near-optimal O(log T) regret in self-play while maintaining O(√T) regret in adversarial settings.

Details

Motivation: To substantially accelerate regularized learning in general games while preserving uncoupledness (learners don't need to know others' utilities) and improving upon prior works that relied on monotonic step sizes.

Method: A variant of Optimism that takes any FTRL instance and outputs an accelerated algorithm (COFTRL) by adaptively pacing the underlying FTRL with minimal computational overhead.

Result: COFTRL achieves near-optimal O(log T) regret in self-play while preserving optimal O(√T) regret in adversarial scenarios. It also achieves new state-of-the-art regret guarantees in general convex games, exponentially improving dependence on action space dimension d.

Conclusion: Cautious Optimism provides a novel route for fast learning in general games without relying on monotonic step sizes, offering significant improvements in regret minimization while maintaining computational efficiency and uncoupledness.

Abstract: We introduce Cautious Optimism, a framework for substantially faster regularized learning in general games. Cautious Optimism, as a variant of Optimism, adaptively controls the learning pace in a dynamic, non-monotone manner to accelerate no-regret learning dynamics. Cautious Optimism takes as input any instance of Follow-the-Regularized-Leader (FTRL) and outputs an accelerated no-regret learning algorithm (COFTRL) by pacing the underlying FTRL with minimal computational overhead. Importantly, it retains uncoupledness, that is, learners do not need to know other players’ utilities. Cautious Optimistic FTRL (COFTRL) achieves near-optimal $O_T(\log T)$ regret in diverse self-play (mixing and matching regularizers) while preserving the optimal $O_T(\sqrt{T})$ regret in adversarial scenarios. In contrast to prior works (e.g., Syrgkanis et al. [2015], Daskalakis et al. [2021]), our analysis does not rely on monotonic step sizes, showcasing a novel route for fast learning in general games. Moreover, instances of COFTRL achieve new state-of-the-art regret minimization guarantees in general convex games, exponentially improving the dependence on the dimension of the action space $d$ over previous works [Farina et al., 2022a].

[484] Orthogonal Soft Pruning for Efficient Class Unlearning

Qinghui Gong, Xue Yang, Xiaohu Tang

Main category: cs.LG

TL;DR: FedOrtho is a federated unlearning framework that uses orthogonalized kernels and one-shot pruning to efficiently remove specific data while maintaining model performance, achieving over 98% forgetting quality with minimal computational cost.

Details

Motivation: Address the challenge of efficient and controllable data unlearning in federated learning, particularly under non-IID settings where deep feature entanglement makes it difficult to balance forgetting and retention performance.

Method: Combines orthogonalized deep convolutional kernels with activation-driven controllable one-shot soft pruning (OSP), enforcing kernel orthogonality and local-global alignment to decouple feature representations and mitigate client drift.

Result: Achieves SOTA performance on CIFAR-10, CIFAR-100 and TinyImageNet with ResNet and VGG frameworks, supporting class-, client-, and sample-level unlearning with over 98% forgetting quality. Reduces computational and communication costs by 2-3 orders of magnitude and achieves subsecond-level erasure while maintaining over 97% retention accuracy.

Conclusion: FedOrtho effectively addresses the federated unlearning challenge by providing precise, efficient data removal while preserving model performance and mitigating membership inference risks.

Abstract: Efficient and controllable data unlearning in federated learning remains challenging, due to the trade-off between forgetting and retention performance. Especially under non-independent and identically distributed (non-IID) settings, where deep feature entanglement exacerbates this dilemma. To address this challenge, we propose FedOrtho, a federated unlearning framework that combines orthogonalized deep convolutional kernels with an activation-driven controllable one-shot soft pruning (OSP) mechanism. FedOrtho enforces kernel orthogonality and local-global alignment to decouple feature representations and mitigate client drift. This structural independence enables precise one-shot pruning of forgetting-related kernels while preserving retained knowledge. FedOrtho achieves SOTA performance on CIFAR-10, CIFAR100 and TinyImageNet with ResNet and VGG frameworks, verifying that FedOrtho supports class-, client-, and sample-level unlearning with over 98% forgetting quality. It reduces computational and communication costs by 2-3 orders of magnitude in federated settings and achieves subsecond-level erasure in centralized scenarios while maintaining over 97% retention accuracy and mitigating membership inference risks.

[485] Flow-Attentional Graph Neural Networks

Pascal Plettenberg, Dominik Köhler, Bernhard Sick, Josephine M. Thomas

Main category: cs.LG

TL;DR: Proposes flow attention for GNNs that enforces Kirchhoff’s first law to handle physical flow conservation in graphs, improving performance on flow-related datasets.

Details

Motivation: Standard GNNs ignore conservation laws in physical flow graphs (like power grids or circuits), which reduces performance on such datasets.

Method: Modified graph attention mechanism to satisfy Kirchhoff’s first law, ensuring flow conservation in the attention computations.

Result: Flow attention outperforms standard attention on electronic circuits and power grid datasets for both classification and regression tasks.

Conclusion: Incorporating physical conservation laws into GNN attention mechanisms improves performance on flow-related graph learning tasks.

Abstract: Graph Neural Networks (GNNs) have become essential for learning from graph-structured data. However, existing GNNs do not consider the conservation law inherent in graphs associated with a flow of physical resources, such as electrical current in power grids or traffic in transportation networks, which can lead to reduced model performance. To address this, we propose flow attention, which adapts existing graph attention mechanisms to satisfy Kirchhoff$\text{’}$s first law. Furthermore, we discuss how this modification influences the expressivity and identify sets of non-isomorphic graphs that can be discriminated by flow attention but not by standard attention. Through extensive experiments on two flow graph datasets (electronic circuits and power grids) we demonstrate that flow attention enhances the performance of attention-based GNNs on both graph-level classification and regression tasks.

[486] On the Necessity of Output Distribution Reweighting for Effective Class Unlearning

Ali Ebrahimpour-Boroojeny, Yian Wang, Hari Sundaram

Main category: cs.LG

TL;DR: The paper reveals that class unlearning evaluations overlook class geometry, leading to privacy leakage via membership-inference attacks using nearest neighbors (MIA-NN). It proposes Tilted ReWeighting (TRW) to mitigate this by approximating the distribution of retrained models for forget-class inputs.

Details

Motivation: Existing class unlearning evaluations fail to consider underlying class geometry, which can cause privacy leakage through membership-inference attacks.

Method: Proposes Tilted ReWeighting (TRW) - a fine-tuning objective that estimates inter-class similarity and tilts the model’s distribution to approximate what a retrained-from-scratch model would produce for forget-class inputs.

Result: TRW reduces privacy leakage significantly, cutting the gap with retrained models by 19% for U-LiRA and 46% for MIA-NN scores on CIFAR-10 compared to state-of-the-art methods.

Conclusion: Considering class geometry is crucial for effective unlearning, and TRW provides a simple yet effective solution that matches or surpasses existing methods while better protecting privacy.

Abstract: In this paper, we reveal a significant shortcoming in class unlearning evaluations: overlooking the underlying class geometry can cause privacy leakage. We further propose a simple yet effective solution to mitigate this issue. We introduce a membership-inference attack via nearest neighbors (MIA-NN) that uses the probabilities the model assigns to neighboring classes to detect unlearned samples. Our experiments show that existing unlearning methods are vulnerable to MIA-NN across multiple datasets. We then propose a new fine-tuning objective that mitigates this privacy leakage by approximating, for forget-class inputs, the distribution over the remaining classes that a retrained-from-scratch model would produce. To construct this approximation, we estimate inter-class similarity and tilt the target model’s distribution accordingly. The resulting Tilted ReWeighting (TRW) distribution serves as the desired distribution during fine-tuning. We also show that across multiple benchmarks, TRW matches or surpasses existing unlearning methods on prior unlearning metrics. More specifically, on CIFAR-10, it reduces the gap with retrained models by 19% and 46% for U-LiRA and MIA-NN scores, accordingly, compared to the SOTA method for each category.

[487] Urban Incident Prediction with Graph Neural Networks: Integrating Government Ratings and Crowdsourced Reports

Sidhika Balachandar, Shuvom Sadhuka, Bonnie Berger, Emma Pierson, Nikhil Garg

Main category: cs.LG

TL;DR: Proposes a multiview GNN model combining sparse government inspection ratings and dense but biased crowdsourced reports to predict urban incident states, validated on NYC data showing improved performance over single-data approaches.

Details

Motivation: Urban incident prediction faces data challenges: government inspections are sparse but unbiased, while crowdsourced reports are dense but biased due to heterogeneous reporting behavior across neighborhoods.

Method: Multiview, multioutput GNN-based model that integrates both unbiased government rating data and biased crowdsourced reporting data to predict latent incident states.

Result: Model outperforms single-data approaches on real and semi-synthetic NYC data, especially when ratings are sparse and reports are predictive. Also quantified demographic biases in reporting (higher-income areas report more).

Conclusion: The approach effectively leverages heterogeneous, sparse, and biased data for latent state prediction, with broad applicability beyond urban incident forecasting.

Abstract: Graph neural networks (GNNs) are widely used in urban spatiotemporal forecasting, such as predicting infrastructure problems. In this setting, government officials wish to know in which neighborhoods incidents like potholes or rodent issues occur. The true state of incidents (e.g., street conditions) for each neighborhood is observed via government inspection ratings. However, these ratings are only conducted for a sparse set of neighborhoods and incident types. We also observe the state of incidents via crowdsourced reports, which are more densely observed but may be biased due to heterogeneous reporting behavior. First, for such settings, we propose a multiview, multioutput GNN-based model that uses both unbiased rating data and biased reporting data to predict the true latent state of incidents. Second, we investigate a case study of New York City urban incidents and collect, standardize, and make publicly available a dataset of 9,615,863 crowdsourced reports and 1,041,415 government inspection ratings over 3 years and across 139 types of incidents. Finally, we show on both real and semi-synthetic data that our model can better predict the latent state compared to models that use only reporting data or models that use only rating data, especially when rating data is sparse and reports are predictive of ratings. We also quantify demographic biases in crowdsourced reporting, e.g., higher-income neighborhoods report problems at higher rates. Our analysis showcases a widely applicable approach for latent state prediction using heterogeneous, sparse, and biased data.

[488] Preserving Task-Relevant Information Under Linear Concept Removal

Floris Holstege, Shauli Ravfogel, Bram Wouters

Main category: cs.LG

TL;DR: SPLINCE is a method that removes unwanted concepts from neural network representations while preserving covariance with target labels, outperforming existing approaches on fairness benchmarks.

Details

Motivation: Neural networks often encode unwanted concepts alongside task-relevant information, causing fairness and interpretability issues. Existing methods for removing these concepts often degrade useful signals.

Method: SPLINCE uses an oblique projection to ‘splice out’ unwanted directions while protecting important label correlations. It’s a simultaneous projection approach that removes linear concept predictability while maintaining target covariance with minimal embedding distortion.

Result: SPLINCE outperforms baselines on benchmarks like Bias in Bios and Winobias, effectively removing protected attributes while minimally damaging main-task information.

Conclusion: SPLINCE provides a theoretically unique solution for concept removal that preserves important label correlations, offering improved fairness and interpretability in neural network representations.

Abstract: Modern neural networks often encode unwanted concepts alongside task-relevant information, leading to fairness and interpretability concerns. Existing post-hoc approaches can remove undesired concepts but often degrade useful signals. We introduce SPLINCE-Simultaneous Projection for LINear concept removal and Covariance prEservation - which eliminates sensitive concepts from representations while exactly preserving their covariance with a target label. SPLINCE achieves this via an oblique projection that ‘splices out’ the unwanted direction yet protects important label correlations. Theoretically, it is the unique solution that removes linear concept predictability and maintains target covariance with minimal embedding distortion. Empirically, SPLINCE outperforms baselines on benchmarks such as Bias in Bios and Winobias, removing protected attributes while minimally damaging main-task information.

[489] Convergence Bound and Critical Batch Size of Muon Optimizer

Naoki Sato, Hiroki Naganuma, Hideaki Iiduka

Main category: cs.LG

TL;DR: Theoretical analysis of Muon optimizer showing convergence proofs across four settings, tighter bounds with weight decay, and derivation of critical batch size for computational efficiency.

Details

Motivation: Muon optimizer has shown strong empirical performance but lacks theoretical foundation. This paper aims to provide theoretical support for its practical success.

Method: Provides convergence proofs for Muon across four practical settings (with/without Nesterov momentum and weight decay), analyzes hyperparameter interplay, and derives critical batch size for minimizing computational cost.

Result: Demonstrated that weight decay yields strictly tighter theoretical bounds, clarified weight decay-learning rate interplay, and identified hyperparameters governing critical batch size. Experiments validated findings on image classification and language modeling tasks.

Conclusion: Muon’s theoretical foundations are established through convergence proofs and analysis, supporting its empirical success and providing guidance for hyperparameter tuning and computational efficiency.

Abstract: Muon, a recently proposed optimizer that leverages the inherent matrix structure of neural network parameters, has demonstrated strong empirical performance, indicating its potential as a successor to standard optimizers such as AdamW. This paper presents theoretical analysis to support its practical success. We provide convergence proofs for Muon across four practical settings, systematically examining its behavior with and without the inclusion of Nesterov momentum and weight decay. Our analysis covers the standard configuration using both, thereby elucidating its real-world performance. We then demonstrate that the addition of weight decay yields strictly tighter theoretical bounds and clarify the interplay between the weight decay coefficient and the learning rate. Finally, we derive the critical batch size for Muon that minimizes the computational cost of training. Our analysis identifies the hyperparameters governing this value, and our experiments validate the corresponding theoretical findings across workloads including image classification and language modeling task.

[490] RetrySQL: text-to-SQL training with retry data for self-correcting query generation

Alicja Rączkowska, Riccardo Belluzzo, Piotr Zieliński, Joanna Baran, Paweł Olszewski

Main category: cs.LG

TL;DR: RetrySQL introduces self-correcting training for text-to-SQL models by creating corrupted reasoning steps with corrections, improving execution accuracy by up to 4 percentage points.

Details

Motivation: Address the lack of SQL-specific generative models and unexplored application of self-correcting generation strategies in text-to-SQL tasks.

Method: Prepare reasoning steps for SQL queries, corrupt them to create retry data with incorrect/corrected steps, and continuously pre-train open-source coding models with this data using full-parameter pre-training.

Result: RetrySQL improves overall and challenging execution accuracy by up to 4 percentage points, with models competitive against larger proprietary models, and demonstrates learned self-correcting behavior.

Conclusion: Self-correction can be effectively learned for text-to-SQL tasks, providing a novel approach to improve SQL generation accuracy without requiring massive parameter counts.

Abstract: The text-to-SQL task is an active challenge in Natural Language Processing. Many existing solutions focus on using black-box language models extended with specialized components within customized end-to-end text-to-SQL pipelines. While these solutions use both closed-source proprietary language models and coding-oriented open-source models, there is a lack of research regarding SQL-specific generative models. At the same time, recent advancements in self-correcting generation strategies show promise for improving the capabilities of existing architectures. The application of these concepts to the text-to-SQL task remains unexplored. In this paper, we introduce RetrySQL, a new approach to training text-to-SQL generation models. We prepare reasoning steps for reference SQL queries and then corrupt them to create retry data that contains both incorrect and corrected steps, divided with a special token. We continuously pre-train an open-source coding model with this data and demonstrate that retry steps yield an improvement of up to 4 percentage points in both overall and challenging execution accuracy metrics, compared to pre-training without retry data. Additionally, we confirm that supervised fine-tuning with LoRA is ineffective for learning from retry data and that full-parameter pre-training is a necessary requirement for that task. We showcase that the self-correcting behavior is learned by the model and the increase in downstream accuracy metrics is a result of this additional skill. Finally, we incorporate RetrySQL-trained models into the full text-to-SQL pipeline and showcase that they are competitive in terms of execution accuracy with proprietary models that contain orders of magnitude more parameters. RetrySQL demonstrates that self-correction can be learned in the text-to-SQL task and provides a novel way of improving generation accuracy for SQL-oriented language models.

[491] NTSFormer: A Self-Teaching Graph Transformer for Multimodal Isolated Cold-Start Node Classification

Jun Hu, Yufei He, Yuan Li, Bryan Hooi, Bingsheng He

Main category: cs.LG

TL;DR: NTSFormer is a Graph Transformer framework that addresses isolated cold-start node classification with missing modalities using a self-teaching paradigm, avoiding degradation to MLPs and handling multimodal data effectively.

Details

Motivation: Existing methods degrade graph learning models to MLPs for isolated cold-start inference, limiting model capacity and struggling with missing modalities. A unified approach is needed to handle both structural isolation and modality missing issues.

Method: Proposes Neighbor-to-Self Graph Transformer (NTSFormer) with cold-start attention mask that makes simultaneous student (self-only) and teacher (self+neighbor) predictions. Uses multimodal graph pre-computation, Mixture-of-Experts Input Projection, and Transformer layers for fusion.

Result: Experiments on public datasets show NTSFormer achieves superior performance for multimodal isolated cold-start node classification compared to existing methods.

Conclusion: NTSFormer provides an effective unified framework that jointly addresses isolation and missing-modality challenges in cold-start node classification through self-teaching without model degradation.

Abstract: Isolated cold-start node classification on multimodal graphs is challenging because such nodes have no edges and often have missing modalities (e.g., absent text or image features). Existing methods address structural isolation by degrading graph learning models to multilayer perceptrons (MLPs) for isolated cold-start inference, using a teacher model (with graph access) to guide the MLP. However, this results in limited model capacity in the student, which is further challenged when modalities are missing. In this paper, we propose Neighbor-to-Self Graph Transformer (NTSFormer), a unified Graph Transformer framework that jointly tackles the isolation and missing-modality issues via a self-teaching paradigm. Specifically, NTSFormer uses a cold-start attention mask to simultaneously make two predictions for each node: a “student” prediction based only on self information (i.e., the node’s own features), and a “teacher” prediction incorporating both self and neighbor information. This enables the model to supervise itself without degrading to an MLP, thereby fully leveraging the Transformer’s capacity to handle missing modalities. To handle diverse graph information and missing modalities, NTSFormer performs a one-time multimodal graph pre-computation that converts structural and feature data into token sequences, which are then processed by Mixture-of-Experts (MoE) Input Projection and Transformer layers for effective fusion. Experiments on public datasets show that NTSFormer achieves superior performance for multimodal isolated cold-start node classification.

[492] Advanced Torrential Loss Function for Precipitation Forecasting

Jaeho Choi, Hyeri Kim, Kwang-Ho Kim, Jaesung Lee

Main category: cs.LG

TL;DR: Proposes a differentiable advanced torrential (AT) loss function for precipitation forecasting that addresses limitations of CSI during dry periods by introducing a penalty term reformulated as QUBO and relaxed through approximation.

Details

Motivation: Current machine learning approaches for precipitation forecasting rely on suboptimal loss functions like CSI, which become ineffective during extended dry periods when precipitation is below threshold, limiting forecast accuracy.

Method: Introduces a penalty expression reformulated as quadratic unconstrained binary optimization (QUBO), then relaxes it into a differentiable AT loss function through approximation process.

Result: The AT loss demonstrates superiority through Lipschitz constant analysis, forecast performance evaluations, consistency experiments, and ablation studies with operational models.

Conclusion: The proposed AT loss function effectively addresses CSI limitations during dry periods and improves precipitation forecasting performance compared to traditional approaches.

Abstract: Accurate precipitation forecasting is becoming increasingly important in the context of climate change. In response, machine learning-based approaches have recently gained attention as an emerging alternative to traditional methods such as numerical weather prediction and climate models. Nonetheless, many recent approaches still rely on off-the-shelf loss functions, and even the more advanced ones merely involve optimization processes based on the critical success index (CSI). The problem, however, is that CSI may become ineffective during extended dry periods when precipitation remains below the threshold, rendering it less than ideal as a criterion for optimization. To address this limitation, we introduce a simple penalty expression and reinterpret it as a quadratic unconstrained binary optimization (QUBO) formulation. Ultimately, the resulting QUBO formulation is relaxed into a differentiable advanced torrential (AT) loss function through an approximation process. The proposed AT loss demonstrates its superiority through the Lipschitz constant, forecast performance evaluations, consistency experiments, and ablation studies with the operational model.

[493] OccamVTS: Distilling Vision Models to 1% Parameters for Time Series Forecasting

Sisuo Lyu, Siru Zhong, Weilin Ruan, Qingxiang Liu, Qingsong Wen, Hui Xiong, Yuxuan Liang

Main category: cs.LG

TL;DR: OccamVTS uses knowledge distillation to extract only 1% of essential predictive information from large vision models for time series forecasting, achieving state-of-the-art performance with dramatically reduced parameters.

Details

Motivation: While large vision models improve time series forecasting, 99% of their parameters are unnecessary and high-level semantic features can actually impair accuracy by introducing noise.

Method: Knowledge distillation framework using pre-trained LVMs as teachers, with pyramid-style feature alignment combined with correlation and feature distillation to transfer beneficial patterns while filtering semantic noise.

Result: Achieves state-of-the-art performance across multiple benchmark datasets with only 1% of original parameters, particularly excelling in few-shot and zero-shot scenarios.

Conclusion: Aggressive parameter reduction through selective knowledge distillation actually improves accuracy by eliminating overfitting to irrelevant visual features while preserving essential temporal patterns.

Abstract: Time series forecasting is fundamental to diverse applications, with recent approaches leverage large vision models (LVMs) to capture temporal patterns through visual representations. We reveal that while vision models enhance forecasting performance, 99% of their parameters are unnecessary for time series tasks. Through cross-modal analysis, we find that time series align with low-level textural features but not high-level semantics, which can impair forecasting accuracy. We propose OccamVTS, a knowledge distillation framework that extracts only the essential 1% of predictive information from LVMs into lightweight networks. Using pre-trained LVMs as privileged teachers, OccamVTS employs pyramid-style feature alignment combined with correlation and feature distillation to transfer beneficial patterns while filtering out semantic noise. Counterintuitively, this aggressive parameter reduction improves accuracy by eliminating overfitting to irrelevant visual features while preserving essential temporal patterns. Extensive experiments across multiple benchmark datasets demonstrate that OccamVTS consistently achieves state-of-the-art performance with only 1% of the original parameters, particularly excelling in few-shot and zero-shot scenarios.

[494] VITA: Variational Pretraining of Transformers for Climate-Robust Crop Yield Forecasting

Adib Hasan, Mardavij Roozbehani, Munther Dahleh

Main category: cs.LG

TL;DR: VITA is a variational pretraining framework that learns from satellite weather data and transfers to limited ground measurements for crop yield forecasting, achieving state-of-the-art performance especially during extreme years.

Details

Motivation: Current AI models systematically underperform when yields deviate from historical trends due to lack of physically grounded datasets linking atmospheric states to yields.

Method: VITA uses variational pretraining with meteorological variables as proxy targets, learns to predict latent atmospheric states under seasonality-aware sinusoidal prior, and fine-tunes with limited weather statistics.

Result: Applied to 763 US Corn Belt counties, VITA achieves state-of-the-art corn and soybean yield prediction across all scenarios, with statistically significant improvements (p < 0.0001), particularly during extreme years.

Conclusion: Domain-aware AI design like VITA can overcome data limitations and support resilient agricultural forecasting in a changing climate, outperforming prior frameworks with less compute.

Abstract: Accurate crop yield forecasting is essential for global food security. However, current AI models systematically underperform when yields deviate from historical trends. We attribute this to the lack of rich, physically grounded datasets directly linking atmospheric states to yields. To address this, we introduce VITA (Variational Inference Transformer for Asymmetric Data), a variational pretraining framework that learns representations from large satellite-based weather datasets and transfers to the ground-based limited measurements available for yield prediction. VITA is trained using detailed meteorological variables as proxy targets during pretraining and learns to predict latent atmospheric states under a seasonality-aware sinusoidal prior. This allows the model to be fine-tuned using limited weather statistics during deployment. Applied to 763 counties in the US Corn Belt, VITA achieves state-of-the-art performance in predicting corn and soybean yields across all evaluation scenarios, particularly during extreme years, with statistically significant improvements (paired t-test, p < 0.0001). Importantly, VITA outperforms prior frameworks like GNN-RNN without soil data, and larger foundational models (e.g., Chronos-Bolt) with less compute, making it practical for real-world use, especially in data-scarce regions. This work highlights how domain-aware AI design can overcome data limitations and support resilient agricultural forecasting in a changing climate.

[495] BubbleOKAN: A Physics-Informed Interpretable Neural Operator for High-Frequency Bubble Dynamics

Yunhao Zhang, Sidharth S. Menon, Lin Cheng, Aswin Gnanaskandan, Ameya D. Jagtap

Main category: cs.LG

TL;DR: Physics-informed neural operators using two-step DeepONet and DeepOKAN architectures to map pressure profiles to bubble radius responses, addressing spectral bias and enhancing high-frequency feature representation.

Details

Motivation: To overcome the spectral bias in deep learning models and improve representation of high-frequency bubble dynamics while enhancing interpretability compared to conventional MLP architectures.

Method: Two-step DeepONet with Rowdy adaptive activation function and DeepOKAN using spline basis functions with RBF, evaluated on Rayleigh-Plesset and Keller-Miksis equations with single/multiple initial radii.

Result: DeepOKAN accurately captures both low- and high-frequency bubble dynamics and outperforms state-of-the-art neural operators like FNO, WNO, OFormer, and CNOs.

Conclusion: Two-step DeepOKAN offers a promising alternative to conventional numerical solvers for bubble dynamics by effectively addressing spectral bias and providing interpretable high-frequency modeling.

Abstract: In this work, we employ physics-informed neural operators to map pressure profiles from an input function space to the corresponding bubble radius responses. Our approach employs a two-step DeepONet architecture. To address the intrinsic spectral bias of deep learning models, our model incorporates the Rowdy adaptive activation function, enhancing the representation of high-frequency features. Moreover, we introduce the Kolmogorov-Arnold network (KAN) based two-step DeepOKAN model, which enhances interpretability (often lacking in conventional multilayer perceptron architectures) while efficiently capturing high-frequency bubble dynamics without explicit utilization of activation functions in any form. We particularly investigate the use of spline basis functions in combination with radial basis functions (RBF) within our architecture, as they demonstrate superior performance in constructing a universal basis for approximating high-frequency bubble dynamics compared to alternative formulations. Furthermore, we emphasize on the performance bottleneck of RBF while learning the high frequency bubble dynamics and showcase the advantage of using spline basis function for the trunk network in overcoming this inherent spectral bias. The model is systematically evaluated across three representative scenarios: (1) bubble dynamics governed by the Rayleigh-Plesset equation with a single initial radius, (2) bubble dynamics governed by the Keller-Miksis equation with a single initial radius, and (3) Keller-Miksis dynamics with multiple initial radii. We also compare our results with state-of-the-art neural operators, including Fourier Neural Operators, Wavelet Neural Operators, OFormer, and Convolutional Neural Operators. Our findings demonstrate that the two-step DeepOKAN accurately captures both low- and high-frequency behaviors, and offers a promising alternative to conventional numerical solvers.

Zihan Liu, Jiayi Wen, Junru Wu, Xuyang Zou, Shouhong Tan, Zhirun Zheng, Cheng Huang

Main category: cs.LG

TL;DR: PrivDFS is a distributed feature-sharing framework that fragments intermediate representations across multiple servers to prevent diffusion-based data reconstruction attacks while maintaining high classification accuracy.

Details

Motivation: Single holistic intermediate representations in split inference are vulnerable to diffusion-based Data Reconstruction Attacks (DRAs) that can reconstruct inputs with high fidelity.

Method: Fragments intermediate representations using learnable binary masks into sparse, non-overlapping feature shares processed independently across majority-honest servers, with a lightweight fusion module aggregating predictions on the client.

Result: Dramatically reduces DRA performance (PSNR drops from 23.25 to 12.72, SSIM from 0.963 to 0.260 on CIFAR-10) while maintaining accuracy within 1% of non-private split inference across multiple datasets.

Conclusion: Structural feature partitioning is a practical and architecture-agnostic approach to reducing reconstructive leakage in cloud-based vision inference.

Abstract: In this paper, we introduce PrivDFS, a distributed feature-sharing framework for input-private inference in image classification. A single holistic intermediate representation in split inference gives diffusion-based Data Reconstruction Attacks (DRAs) sufficient signal to reconstruct the input with high fidelity. PrivDFS restructures this vulnerability by fragmenting the representation and processing the fragments independently across a majority-honest set of servers. As a result, each branch observes only an incomplete and reconstruction-insufficient view of the input. To realize this, PrivDFS employs learnable binary masks that partition the intermediate representation into sparse and largely non-overlapping feature shares, each processed by a separate server, while a lightweight fusion module aggregates their predictions on the client. This design preserves full task accuracy when all branches are combined, yet sharply limits the reconstructive power available to any individual server. PrivDFS applies seamlessly to both ResNet-based CNNs and Vision Transformers. Across CIFAR-10/100, CelebA, and ImageNet-1K, PrivDFS induces a pronounced collapse in DRA performance, e.g., on CIFAR-10, PSNR drops from 23.25 -> 12.72 and SSIM from 0.963 -> 0.260, while maintaining accuracy within 1% of non-private split inference. These results establish structural feature partitioning as a practical and architecture-agnostic approach to reducing reconstructive leakage in cloud-based vision inference.

[497] Hypergraph Neural Network with State Space Models for Node Classification

A. Quadir, M. Tanveer

Main category: cs.LG

TL;DR: Proposes HGMN, a hypergraph neural network with state space model that integrates role-aware representations into GNNs for improved node classification by capturing higher-order relationships and structural similarities.

Details

Motivation: Traditional GNNs focus mainly on adjacency relationships and overlook role-based characteristics, while existing role-based feature extraction methods are largely unsupervised and ineffective for downstream tasks.

Method: Combines hypergraph construction (degree-based and neighborhood-based strategies) with state-space modeling using a learnable mamba transformer to fuse role-based and adjacency-based embeddings, with hypergraph convolution layers and residual connections to prevent over-smoothing.

Result: Outperforms strong baselines on multiple benchmark datasets (OGB, ACM, DBLP, IIP TerroristRel, Cora, Citeseer, Pubmed) in node classification tasks.

Conclusion: Explicitly incorporating role-based features within a hypergraph framework provides tangible benefits for node classification tasks.

Abstract: In recent years, graph neural networks (GNNs) have gained significant attention for node classification tasks on graph-structured data. However, traditional GNNs primarily focus on adjacency relationships between nodes, often overlooking the role-based characteristics that can provide complementary insights for learning expressive node representations. Existing frameworks for extracting role-based features are largely unsupervised and often fail to translate effectively into downstream predictive tasks. To address these limitations, we propose a hypergraph neural network with a state space model (HGMN). The model integrates role-aware representations into GNNs by combining hypergraph construction with state-space modeling in a principled manner. HGMN employs hypergraph construction techniques to capture higher-order relationships and leverages a learnable mamba transformer mechanism to fuse role-based and adjacency-based embeddings. By exploring two distinct hypergraph construction strategies, degree-based and neighborhood-based, the framework reinforces connectivity among nodes with structural similarity, thereby enriching the learned representations. Furthermore, the inclusion of hypergraph convolution layers enables the model to account for complex dependencies within hypergraph structures. To alleviate the over-smoothing problem encountered in deeper networks, we incorporate residual connections, which improve stability and promote effective feature propagation across layers. Comprehensive experiments on benchmark datasets including OGB, ACM, DBLP, IIP TerroristRel, Cora, Citeseer, and Pubmed demonstrate that HGMN consistently outperforms strong baselines in node classification tasks. These results support the claim that explicitly incorporating role-based features within a hypergraph framework offers tangible benefits for node classification tasks.

[498] On the notion of missingness for path attribution explainability methods in medical settings: Guiding the selection of medically meaningful baselines

Alexander Geiger, Lars Wagner, Daniel Rueckert, Dirk Wilhelm, Alissa Jell

Main category: cs.LG

TL;DR: The paper introduces a counterfactual-guided approach for selecting meaningful baselines in path attribution methods for medical AI, using clinically “normal” variations of pathological inputs to improve interpretability.

Details

Motivation: Current baseline choices in path attribution methods (like all-zero inputs) are semantically meaningless in medical contexts, lacking principled approaches for dynamic baseline selection tailored to each input.

Method: Proposes using generated counterfactuals (clinically “normal” variations of pathological inputs) as baselines, implemented with a Variational Autoencoder but model-agnostic in concept.

Result: Evaluation on three medical datasets shows counterfactual baselines yield more faithful and medically relevant attributions, outperforming standard baseline choices and related methods.

Conclusion: Counterfactual baselines provide a more accurate representation of meaningful feature absence in medical contexts, improving the faithfulness and clinical relevance of model attributions.

Abstract: The explainability of deep learning models remains a significant challenge, particularly in the medical domain where interpretable outputs are critical for clinical trust and transparency. Path attribution methods such as Integrated Gradients rely on a baseline representing the absence of relevant features (“missingness”). Commonly used baselines, such as all-zero inputs, are often semantically meaningless, especially in medical contexts. While alternative baseline choices have been explored, existing methods lack a principled approach to dynamically select baselines tailored to each input. In this work, we examine the notion of missingness in the medical context, analyze its implications for baseline selection, and introduce a counterfactual-guided approach to address the limitations of conventional baselines. We argue that a generated counterfactual (i.e. clinically “normal” variation of the pathological input) represents a more accurate representation of a meaningful absence of features. We use a Variational Autoencoder in our implementation, though our concept is model-agnostic and can be applied with any suitable counterfactual method. We evaluate our concept on three distinct medical data sets and empirically demonstrate that counterfactual baselines yield more faithful and medically relevant attributions, outperforming standard baseline choices as well as other related methods.

[499] Fairness for the People, by the People: Minority Collective Action

Omri Ben-Dov, Samira Samadi, Amartya Sanyal, Alexandru Ţifrea

Main category: cs.LG

TL;DR: End-users can improve fairness in ML models through coordinated data relabeling without changing the firm’s training process, achieving significant fairness gains with minimal impact on overall accuracy.

Details

Motivation: Machine learning models often preserve biases from training data, and existing firm-side bias mitigation techniques incur utility costs and require organizational buy-in. End-users who contribute data should be able to induce fairness.

Method: Proposed three practical, model-agnostic methods for Algorithmic Collective Action where a coordinated minority group strategically relabels its own data to approximate ideal relabeling.

Result: A subgroup of the minority can substantially reduce unfairness with a small impact on overall prediction error, validated on real-world datasets.

Conclusion: End-user coordinated data relabeling through Algorithmic Collective Action provides an effective approach to enhance fairness without requiring changes to the firm’s training process.

Abstract: Machine learning models often preserve biases present in training data, leading to unfair treatment of certain minority groups. Despite an array of existing firm-side bias mitigation techniques, they typically incur utility costs and require organizational buy-in. Recognizing that many models rely on user-contributed data, end-users can induce fairness through the framework of Algorithmic Collective Action, where a coordinated minority group strategically relabels its own data to enhance fairness, without altering the firm’s training process. We propose three practical, model-agnostic methods to approximate ideal relabeling and validate them on real-world datasets. Our findings show that a subgroup of the minority can substantially reduce unfairness with a small impact on the overall prediction error.

[500] ICL-Router: In-Context Learned Model Representations for LLM Routing

Chenxu Wang, Hao Li, Yiqun Zhang, Linyao Chen, Jianhao Chen, Ping Jian, Peng Ye, Qiaosheng Zhang, Shuyue Hu

Main category: cs.LG

TL;DR: Proposes a novel model routing method using in-context vectors to represent model capabilities, enabling dynamic query routing without retraining when adding new models.

Details

Motivation: Current model routing methods require retraining when adding new models, limiting scalability, and routing performance depends on accurate model representations.

Method: Two-stage approach: 1) Embed queries and project into vectors with trained projector and LLM-based router to reconstruct queries, 2) Profile candidate models on query set and learn to predict model performance using in-context vectors of query and model capabilities.

Result: Achieves state-of-the-art routing performance in both in-distribution and out-of-distribution tasks, and allows seamless integration of new models without router retraining.

Conclusion: The proposed in-context vector routing method provides scalable and effective model routing with strong performance across different task distributions.

Abstract: Large language models (LLMs) often exhibit complementary strengths. Model routing harnesses these strengths by dynamically directing each query to the most suitable model, given a candidate model pool. However, routing performance relies on accurate model representations, and adding new models typically requires retraining, limiting scalability. To address these challenges, we propose a novel routing method using in-context vectors to represent model capabilities. The method proceeds in two stages. First, queries are embedded and projected into vectors, with a projector and LLM-based router trained to reconstruct the original queries, aligning vector representations with the router’s semantic space. Second, each candidate model is profiled on a query set, and the router learns – based on in-context vectors of query and model performance – to predict whether each model can correctly answer new queries. Extensive experiments demonstrate that our method achieves state-of-the-art routing performance in both in-distribution and out-of-distribution tasks. Moreover, our method allows for seamless integration of new models without retraining the router. The code is available at https://github.com/lalalamdbf/ICL-Router.

[501] DRMD: Deep Reinforcement Learning for Malware Detection under Concept Drift

Shae McFadden, Myles Foley, Mario D’Onghia, Chris Hicks, Vasilios Mavroudis, Nicola Paoletti, Fabio Pierazzi

Main category: cs.LG

TL;DR: DRL-based malware detection agent simultaneously optimizes classification and rejection for manual labeling, achieving 8.66-10.90 AUT improvement over standard methods with better concept drift resilience.

Details

Motivation: Traditional malware classifiers struggle with concept drift and lack mechanisms for deferring decisions to manual labeling. Real-world malware detection needs to handle evolving threats with limited labeling budgets.

Method: Formulate malware detection as one-step Markov Decision Process and train deep reinforcement learning agent to jointly optimize classification performance and rejection of high-risk samples for manual labeling.

Result: DRMD agent achieved average AUT improvements of 8.66 (classification-only) and 10.90 (classification-rejection) over standard approaches, showing superior resilience to concept drift in multi-year evaluations.

Conclusion: DRL can effectively facilitate malware detection and improve resiliency to concept drift in dynamic Android malware detection settings, outperforming traditional classification approaches.

Abstract: Malware detection in real-world settings must deal with evolving threats, limited labeling budgets, and uncertain predictions. Traditional classifiers, without additional mechanisms, struggle to maintain performance under concept drift in malware domains, as their supervised learning formulation cannot optimize when to defer decisions to manual labeling and adaptation. Modern malware detection pipelines combine classifiers with monthly active learning (AL) and rejection mechanisms to mitigate the impact of concept drift. In this work, we develop a novel formulation of malware detection as a one-step Markov Decision Process and train a deep reinforcement learning (DRL) agent, simultaneously optimizing sample classification performance and rejecting high-risk samples for manual labeling. We evaluated the joint detection and drift mitigation policy learned by the DRL-based Malware Detection (DRMD) agent through time-aware evaluations on Android malware datasets subject to realistic drift requiring multi-year performance stability. The policies learned under these conditions achieve a higher Area Under Time (AUT) performance compared to standard classification approaches used in the domain, showing improved resilience to concept drift. Specifically, the DRMD agent achieved an average AUT improvement of 8.66 and 10.90 for the classification-only and classification-rejection policies, respectively. Our results demonstrate for the first time that DRL can facilitate effective malware detection and improved resiliency to concept drift in the dynamic setting of Android malware detection.

[502] Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients

Christos Thrampoulidis, Sadegh Mahdavi, Wenlong Deng

Main category: cs.LG

TL;DR: This paper shows that REINFORCE-style methods and advantage-shaping techniques for Pass@K optimization are fundamentally equivalent - advantage-shaping implicitly optimizes surrogate rewards, while surrogate reward objectives can derive advantage-shaping methods.

Details

Motivation: To reconcile two seemingly distinct approaches to policy gradient optimization for Pass@K objective in reinforcement learning with verifiable rewards: direct REINFORCE-style methods and advantage-shaping techniques.

Method: By reverse-engineering existing advantage-shaping algorithms and showing they implicitly optimize surrogate rewards, and conversely deriving advantage-shaping methods from surrogate reward objectives.

Result: Revealed that advantage-shaping techniques are equivalent to reward-level regularization, and provided a simple recipe for deriving both existing and new advantage-shaping methods.

Conclusion: This unified perspective provides a framework for RLVR policy gradient optimization that extends beyond the original Pass@K motivation.

Abstract: This note reconciles two seemingly distinct approaches to policy gradient optimization for the Pass@K objective in reinforcement learning with verifiable rewards: (1) direct REINFORCE-style methods, and (2) advantage-shaping techniques that directly modify GRPO. We show that these are two sides of the same coin. By reverse-engineering existing advantage-shaping algorithms, we reveal that they implicitly optimize surrogate rewards. We specifically interpret practical “hard-example up-weighting” modifications to GRPO as reward-level regularization. Conversely, starting from surrogate reward objectives, we provide a simple recipe for deriving both existing and new advantage-shaping methods. This perspective provides a lens for RLVR policy gradient optimization beyond our original motivation of Pass@K.

[503] Neuro-Spectral Architectures for Causal Physics-Informed Networks

Arthur Bizzi, Leonardo M. Moreira, Márcio Marques, Leonardo Mendonça, Christian Júnior de Oliveira, Vitor Balestro, Lucas dos Santos Fernandez, Daniel Yukimura, Pavel Petrov, João M. Pereira, Tiago Novello, Lucas Nissenbaum

Main category: cs.LG

TL;DR: NeuSA introduces a novel PINN architecture combining spectral methods with Neural ODEs to solve PDEs with improved convergence, causality enforcement, and reduced spectral bias.

Details

Motivation: Standard MLP-based PINNs fail to converge for complex initial value problems, violating causality and suffering from spectral bias towards low-frequency components.

Method: NeuSA learns PDE projections onto spectral bases, creating finite-dimensional dynamics representations integrated with adapted Neural ODEs, plus initialization using classical methods.

Result: NeuSA demonstrates strong performance on linear/nonlinear wave equations with faster convergence, improved temporal consistency, and superior predictive accuracy compared to other architectures.

Conclusion: NeuSA effectively overcomes spectral bias, enforces causality through NODE structure, and enables better initialization, making it a promising framework for solving complex PDEs.

Abstract: Physics-Informed Neural Networks (PINNs) have emerged as a powerful framework for solving partial differential equations (PDEs). However, standard MLP-based PINNs often fail to converge when dealing with complex initial value problems, leading to solutions that violate causality and suffer from a spectral bias towards low-frequency components. To address these issues, we introduce NeuSA (Neuro-Spectral Architectures), a novel class of PINNs inspired by classical spectral methods, designed to solve linear and nonlinear PDEs with variable coefficients. NeuSA learns a projection of the underlying PDE onto a spectral basis, leading to a finite-dimensional representation of the dynamics which is then integrated with an adapted Neural ODE (NODE). This allows us to overcome spectral bias, by leveraging the high-frequency components enabled by the spectral representation; to enforce causality, by inheriting the causal structure of NODEs, and to start training near the target solution, by means of an initialization scheme based on classical methods. We validate NeuSA on canonical benchmarks for linear and nonlinear wave equations, demonstrating strong performance as compared to other architectures, with faster convergence, improved temporal consistency and superior predictive accuracy. Code and pretrained models are available in https://github.com/arthur-bizzi/neusa.

[504] Active Learning and Explainable AI for Multi-Objective Optimization of Spin Coated Polymers

Brendan Young, Brendan Alvey, Andreas Werbrouck, Will Murphy, James Keller, Matthias J. Young, Matthew Maschmann

Main category: cs.LG

TL;DR: A framework combining active Pareto front learning with visualization and explainable AI to optimize spin coating parameters for polymer thin films, achieving efficient multi-objective optimization with interpretable results.

Details

Motivation: Spin coating polymer thin films to achieve specific mechanical properties is inherently a multi-objective optimization problem that requires balancing competing objectives like hardness and elasticity.

Method: Integrates PyePAL (active Pareto front learning algorithm) with Gaussian process models, UMAP for 2D visualization, and fuzzy linguistic summaries to optimize spin speed, dilution, and polymer mixture parameters.

Result: The method efficiently identifies promising polymer designs while providing visual and linguistic explanations that facilitate expert-driven analysis and knowledge discovery.

Conclusion: The framework successfully enables multi-objective optimization of polymer thin films with enhanced explainability through visualization and linguistic summaries, supporting better understanding and decision-making.

Abstract: Spin coating polymer thin films to achieve specific mechanical properties is inherently a multi-objective optimization problem. We present a framework that integrates an active Pareto front learning algorithm (PyePAL) with visualization and explainable AI techniques to optimize processing parameters. PyePAL uses Gaussian process models to predict objective values (hardness and elasticity) from the design variables (spin speed, dilution, and polymer mixture), guiding the adaptive selection of samples toward promising regions of the design space. To enable interpretable insights into the high-dimensional design space, we utilize UMAP (Uniform Manifold Approximation and Projection) for two-dimensional visualization of the Pareto front exploration. Additionally, we incorporate fuzzy linguistic summaries, which translate the learned relationships between process parameters and performance objectives into linguistic statements, thus enhancing the explainability and understanding of the optimization results. Experimental results demonstrate that our method efficiently identifies promising polymer designs, while the visual and linguistic explanations facilitate expert-driven analysis and knowledge discovery.

[505] AttentiveGRUAE: An Attention-Based GRU Autoencoder for Temporal Clustering and Behavioral Characterization of Depression from Wearable Data

Nidhi Soley, Vishal M Patel, Casey O Taylor

Main category: cs.LG

TL;DR: AttentiveGRUAE is an attention-based GRU autoencoder for temporal clustering and depression prediction from wearable data, achieving superior clustering (silhouette=0.70) and classification (AUC=0.74) performance.

Details

Motivation: To develop a model that can jointly learn compact behavioral representations, predict depression outcomes, and identify behavioral subtypes from longitudinal wearable data for better clinical interpretation.

Method: Joint optimization of three objectives: sequence reconstruction via GRU autoencoder, depression classification through binary classification head, and behavioral subtype identification using GMM soft clustering on learned embeddings with attention mechanisms.

Result: Superior performance over baselines in clustering quality (silhouette=0.70 vs 0.32-0.70) and depression classification (AUC=0.74 vs 0.50-0.67) on 372 participants, with external validation confirming reproducibility (silhouette=0.63, AUC=0.61) on 332 participants.

Conclusion: AttentiveGRUAE effectively identifies clinically interpretable behavioral subtypes and salient temporal patterns in sleep data, providing insights into depression risk through attention-based analysis.

Abstract: In this study, we present AttentiveGRUAE, a novel attention-based gated recurrent unit (GRU) autoencoder designed for temporal clustering and prediction of outcome from longitudinal wearable data. Our model jointly optimizes three objectives: (1) learning a compact latent representation of daily behavioral features via sequence reconstruction, (2) predicting end-of-period depression rate through a binary classification head, and (3) identifying behavioral subtypes through Gaussian Mixture Model (GMM) based soft clustering of learned embeddings. We evaluate AttentiveGRUAE on longitudinal sleep data from 372 participants (GLOBEM 2018-2019), and it demonstrates superior performance over baseline clustering, domain-aligned self-supervised, and ablated models in both clustering quality (silhouette score = 0.70 vs 0.32-0.70) and depression classification (AUC = 0.74 vs 0.50-0.67). Additionally, external validation on cross-year cohorts from 332 participants (GLOBEM 2020-2021) confirms cluster reproducibility (silhouette score = 0.63, AUC = 0.61) and stability. We further perform subtype analysis and visualize temporal attention, which highlights sleep-related differences between clusters and identifies salient time windows that align with changes in sleep regularity, yielding clinically interpretable explanations of risk.

[506] Pelican-VL 1.0: A Foundation Brain Model for Embodied Intelligence

Yi Zhang, Che Liu, Xiancong Ren, Hanchu Ni, Shuai Zhang, Zeyuan Ding, Jiayu Hu, Hanzhe Shan, Zhenwei Niu, Zhaoyang Liu, Shuang Liu, Yue Zhao, Junbo Qi, Qinfan Zhang, Dengjie Li, Yidong Wang, Jiachen Luo, Yong Dai, Zenglin Xu, Bin Shen, Qifan Wang, Jian Tang, Xiaozhu Ju

Main category: cs.LG

TL;DR: Pelican-VL 1.0 is a new family of open-source embodied brain models (7B-72B parameters) that achieves state-of-the-art performance through metaloop-based training and DPPO framework, outperforming 100B-level open-source models by 10.6%.

Details

Motivation: To embed powerful intelligence into various embodiments by creating the largest-scale open-source embodied multimodal brain model.

Method: Uses DPPO (Deliberate Practice Policy Optimization) framework with metaloop distillation from 4+ billion tokens dataset, trained on 1000+ A800 GPUs consuming 50k+ GPU-hours per checkpoint.

Result: Achieves 20.3% performance uplift from base model, outperforms 100B-level open-source counterparts by 10.6%, and performs on par with leading proprietary systems on embodied benchmarks.

Conclusion: Pelican-VL 1.0 establishes a new state-of-the-art for open-source embodied brain models through its metaloop training approach and DPPO framework.

Abstract: This report presents Pelican-VL 1.0, a new family of open-source embodied brain models with parameter scales ranging from 7 billion to 72 billion. Our explicit mission is clearly stated as: To embed powerful intelligence into various embodiments. Pelican-VL 1.0 is currently the largest-scale open-source embodied multimodal brain model. Its core advantage lies in the in-depth integration of data power and intelligent adaptive learning mechanisms. Specifically, metaloop distilled a high-quality dataset from a raw dataset containing 4+ billion tokens. Pelican-VL 1.0 is trained on a large-scale cluster of 1000+ A800 GPUs, consuming over 50k+ A800 GPU-hours per checkpoint. This translates to a 20.3% performance uplift from its base model and outperforms 100B-level open-source counterparts by 10.6%, placing it on par with leading proprietary systems on well-known embodied benchmarks. We establish a novel framework, DPPO (Deliberate Practice Policy Optimization), inspired by human metacognition to train Pelican-VL 1.0. We operationalize this as a metaloop that teaches the AI to practice deliberately, which is a RL-Refine-Diagnose-SFT loop.

[507] Towards Personalized Treatment Plan: Geometrical Model-Agnostic Approach to Counterfactual Explanations

Daniel Sin, Milad Toutounchian

Main category: cs.LG

TL;DR: SSBA method generates counterfactual explanations by finding closest feasible points on decision boundaries using segmented sampling and binary search, outperforming existing methods with 5-50% distance reduction while handling real-world constraints.

Details

Motivation: Need for effective counterfactual explanation generation in high-dimensional spaces that can handle real-world constraints on immutable features like age, gender, and categorical variables.

Method: Four-step approach: fit dataset to model, find decision boundary, determine constraints, compute closest feasible counterfactual point using segmented sampling with binary search (SSBA algorithm).

Result: Outperforms current methods with 5-50% reduction in L2 distance across four datasets; handles constraints on immutable/categorical features; significantly faster runtime than grid-based approaches.

Conclusion: SSBA provides simple, effective model-agnostic method for computing nearest feasible counterfactual explanations that are realistic with constraints.

Abstract: In our article, we describe a method for generating counterfactual explanations in high-dimensional spaces using four steps that involve fitting our dataset to a model, finding the decision boundary, determining constraints on the problem, and computing the closest point (counterfactual explanation) from that boundary. We propose a discretized approach where we find many discrete points on the boundary and then identify the closest feasible counterfactual explanation. This method, which we later call $\textit{Segmented Sampling for Boundary Approximation}$ (SSBA), applies binary search to find decision boundary points and then searches for the closest boundary point. Across four datasets of varying dimensionality, we show that our method can outperform current methods for counterfactual generation with reductions in distance between $5%$ to $50%$ in terms of the $L_2$ norm. Our method can also handle real-world constraints by restricting changes to immutable and categorical features, such as age, gender, sex, height, and other related characteristics such as the case for a health-based dataset. In terms of runtime, the SSBA algorithm generates decision boundary points on multiple orders of magnitude in the same given time when we compare to a grid-based approach. In general, our method provides a simple and effective model-agnostic method that can compute nearest feasible (i.e. realistic with constraints) counterfactual explanations. All of our results and code are available at: https://github.com/dsin85691/SSBA_For_Counterfactuals

[508] EMOD: A Unified EEG Emotion Representation Framework Leveraging V-A Guided Contrastive Learning

Yuning Chen, Sha Zhao, Shijian Li, Gang Pan

Main category: cs.LG

TL;DR: EMOD is a unified EEG emotion recognition framework that uses valence-arousal guided contrastive learning to create transferable representations across heterogeneous datasets, achieving state-of-the-art performance.

Details

Motivation: Current deep learning models for EEG emotion recognition perform well on single datasets but generalize poorly across datasets due to different annotation schemes and data formats, requiring dataset-specific architectures.

Method: Projects emotion labels into unified valence-arousal space, uses soft-weighted supervised contrastive loss for semantic alignment, and employs Triple-Domain Encoder with Spatial-Temporal Transformer to handle variable EEG formats.

Result: Pretrained on 8 public EEG datasets and evaluated on 3 benchmarks, EMOD achieves state-of-the-art performance with strong adaptability and generalization across diverse emotion recognition scenarios.

Conclusion: EMOD successfully addresses cross-dataset generalization challenges in EEG emotion recognition through unified V-A space projection and flexible architecture design.

Abstract: Emotion recognition from EEG signals is essential for affective computing and has been widely explored using deep learning. While recent deep learning approaches have achieved strong performance on single EEG emotion datasets, their generalization across datasets remains limited due to the heterogeneity in annotation schemes and data formats. Existing models typically require dataset-specific architectures tailored to input structure and lack semantic alignment across diverse emotion labels. To address these challenges, we propose EMOD: A Unified EEG Emotion Representation Framework Leveraging Valence-Arousal (V-A) Guided Contrastive Learning. EMOD learns transferable and emotion-aware representations from heterogeneous datasets by bridging both semantic and structural gaps. Specifically, we project discrete and continuous emotion labels into a unified V-A space and formulate a soft-weighted supervised contrastive loss that encourages emotionally similar samples to cluster in the latent space. To accommodate variable EEG formats, EMOD employs a flexible backbone comprising a Triple-Domain Encoder followed by a Spatial-Temporal Transformer, enabling robust extraction and integration of temporal, spectral, and spatial features. We pretrain EMOD on 8 public EEG datasets and evaluate its performance on three benchmark datasets. Experimental results show that EMOD achieves the state-of-the-art performance, demonstrating strong adaptability and generalization across diverse EEG-based emotion recognition scenarios.

[509] MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains

Leyan Xue, Changqing Zhang, Kecheng Xue, Xiaohong Liu, Guangyu Wang, Zongbo Han

Main category: cs.LG

TL;DR: Created a large-scale multimodal evaluation benchmark with 30+ datasets, 15 modalities, and 20 tasks to address the lack of adequate evaluation standards in multimodal fusion research.

Details

Motivation: Current multimodal fusion methods are evaluated on limited datasets that don't represent real-world complexity, leading to biased evaluations and hindered generalization. The absence of unified evaluation standards makes fair comparisons difficult.

Method: Developed a large-scale, domain-adaptive benchmark integrating over 30 datasets across 15 modalities and 20 predictive tasks, along with an open-source unified evaluation pipeline with standardized implementations of state-of-the-art models and fusion paradigms.

Result: Successfully established new performance baselines across multiple tasks through large-scale experiments conducted using the developed platform.

Conclusion: This work provides a crucial platform for rigorous and reproducible assessment of multimodal models, aiming to advance the field of multimodal artificial intelligence.

Abstract: Although multimodal fusion has made significant progress, its advancement is severely hindered by the lack of adequate evaluation benchmarks. Current fusion methods are typically evaluated on a small selection of public datasets, a limited scope that inadequately represents the complexity and diversity of real-world scenarios, potentially leading to biased evaluations. This issue presents a twofold challenge. On one hand, models may overfit to the biases of specific datasets, hindering their generalization to broader practical applications. On the other hand, the absence of a unified evaluation standard makes fair and objective comparisons between different fusion methods difficult. Consequently, a truly universal and high-performance fusion model has yet to emerge. To address these challenges, we have developed a large-scale, domain-adaptive benchmark for multimodal evaluation. This benchmark integrates over 30 datasets, encompassing 15 modalities and 20 predictive tasks across key application domains. To complement this, we have also developed an open-source, unified, and automated evaluation pipeline that includes standardized implementations of state-of-the-art models and diverse fusion paradigms. Leveraging this platform, we have conducted large-scale experiments, successfully establishing new performance baselines across multiple tasks. This work provides the academic community with a crucial platform for rigorous and reproducible assessment of multimodal models, aiming to propel the field of multimodal artificial intelligence to new heights.

[510] A Closer Look at Knowledge Distillation in Spiking Neural Network Training

Xu Liu, Na Xia, Jinxing Zhou, Jingyuan Xu, Dan Guo

Main category: cs.LG

TL;DR: This paper proposes two novel knowledge distillation strategies (SAMD and NLD) to bridge the architectural gap between ANNs and SNNs for more effective training of SNNs.

Details

Motivation: Current knowledge distillation methods for SNNs use simple element-wise alignment but neglect the fundamental differences between ANN's continuous outputs and SNN's sparse, discrete outputs, leading to suboptimal training.

Method: Two strategies: 1) Saliency-scaled Activation Map Distillation (SAMD) aligns SNN spike activation maps with ANN class-aware activation maps for semantic consistency; 2) Noise-smoothed Logits Distillation (NLD) uses Gaussian noise to smooth SNN’s sparse logits for better alignment with ANN’s continuous logits.

Result: Extensive experiments on multiple datasets demonstrate the effectiveness of the proposed methods in improving SNN training through knowledge distillation.

Conclusion: The proposed SAMD and NLD strategies effectively address the architectural differences between ANNs and SNNs, enabling more effective knowledge transfer and improved SNN performance.

Abstract: Spiking Neural Networks (SNNs) become popular due to excellent energy efficiency, yet facing challenges for effective model training. Recent works improve this by introducing knowledge distillation (KD) techniques, with the pre-trained artificial neural networks (ANNs) used as teachers and the target SNNs as students. This is commonly accomplished through a straightforward element-wise alignment of intermediate features and prediction logits from ANNs and SNNs, often neglecting the intrinsic differences between their architectures. Specifically, ANN’s outputs exhibit a continuous distribution, whereas SNN’s outputs are characterized by sparsity and discreteness. To mitigate this issue, we introduce two innovative KD strategies. Firstly, we propose the Saliency-scaled Activation Map Distillation (SAMD), which aligns the spike activation map of the student SNN with the class-aware activation map of the teacher ANN. Rather than performing KD directly on the raw %and distinct features of ANN and SNN, our SAMD directs the student to learn from saliency activation maps that exhibit greater semantic and distribution consistency. Additionally, we propose a Noise-smoothed Logits Distillation (NLD), which utilizes Gaussian noise to smooth the sparse logits of student SNN, facilitating the alignment with continuous logits from teacher ANN. Extensive experiments on multiple datasets demonstrate the effectiveness of our methods. Code is available~\footnote{https://github.com/SinoLeu/CKDSNN.git}.

[511] Multistep Quasimetric Learning for Scalable Goal-conditioned Reinforcement Learning

Bill Chunyuan Zheng, Vivek Myers, Benjamin Eysenbach, Sergey Levine

Main category: cs.LG

TL;DR: The paper presents a goal-conditioned reinforcement learning (GCRL) method that integrates temporal difference and Monte Carlo approaches to estimate temporal distances between observations, enabling effective long-horizon reasoning and real-world robotic manipulation.

Details

Motivation: The motivation is to address the challenge of reasoning over long horizons in AI, particularly in estimating temporal distances between observations, where traditional temporal difference methods underperform compared to Monte Carlo methods despite having optimality guarantees.

Method: The method integrates temporal difference and Monte Carlo approaches into a practical GCRL method that fits a quasimetric distance using a multistep Monte-Carlo return, enabling end-to-end learning from unlabeled offline datasets of visual observations.

Result: The method outperforms existing GCRL methods on long-horizon simulated tasks with up to 4000 steps, even with visual observations, and successfully enables stitching in real-world robotic manipulation (Bridge setup).

Conclusion: This approach represents the first end-to-end GCRL method that enables multistep stitching in real-world manipulation domains from unlabeled offline datasets of visual observations, demonstrating effective long-horizon reasoning capabilities.

Abstract: Learning how to reach goals in an environment is a longstanding challenge in AI, yet reasoning over long horizons remains a challenge for modern methods. The key question is how to estimate the temporal distance between pairs of observations. While temporal difference methods leverage local updates to provide optimality guarantees, they often perform worse than Monte Carlo methods that perform global updates (e.g., with multi-step returns), which lack such guarantees. We show how these approaches can be integrated into a practical GCRL method that fits a quasimetric distance using a multistep Monte-Carlo return. We show our method outperforms existing GCRL methods on long-horizon simulated tasks with up to 4000 steps, even with visual observations. We also demonstrate that our method can enable stitching in the real-world robotic manipulation domain (Bridge setup). Our approach is the first end-to-end GCRL method that enables multistep stitching in this real-world manipulation domain from an unlabeled offline dataset of visual observations.

[512] Dynamic Sparsity: Challenging Common Sparsity Assumptions for Learning World Models in Robotic Reinforcement Learning Benchmarks

Muthukumar Pandaram, Jakob Hollenstein, David Drexel, Samuele Tosatto, Antonio Rodríguez-Sánchez, Justus Piater

Main category: cs.LG

TL;DR: Analysis of ground-truth dynamics in robotic RL environments reveals that global sparsity assumptions are rare; instead, dynamics show local, state-dependent sparsity with temporal clustering patterns.

Details

Motivation: To critically examine assumptions about sparsity in learned dynamics models - specifically whether causal graphs are sparse and whether temporal sparsity holds in typical RL tasks.

Method: Analyzed ground-truth dynamics from MuJoCo Playground benchmark suite, examining: (i) causal graph sparsity, (ii) state-dependent sparsity, and (iii) sparse local dynamics changes.

Result: Global sparsity is rare; tasks show local, state-dependent sparsity with distinct structures - appearing in temporally localized clusters (e.g., during contact events) and affecting specific state dimension subsets.

Conclusion: Common sparsity prior assumptions in dynamics learning are challenged; grounded inductive biases are needed that reflect the state-dependent sparsity structure of real-world dynamics.

Abstract: The use of learned dynamics models, also known as world models, can improve the sample efficiency of reinforcement learning. Recent work suggests that the underlying causal graphs of such dynamics models are sparsely connected, with each of the future state variables depending only on a small subset of the current state variables, and that learning may therefore benefit from sparsity priors. Similarly, temporal sparsity, i.e. sparsely and abruptly changing local dynamics, has also been proposed as a useful inductive bias. In this work, we critically examine these assumptions by analyzing ground-truth dynamics from a set of robotic reinforcement learning environments in the MuJoCo Playground benchmark suite, aiming to determine whether the proposed notions of state and temporal sparsity actually tend to hold in typical reinforcement learning tasks. We study (i) whether the causal graphs of environment dynamics are sparse, (ii) whether such sparsity is state-dependent, and (iii) whether local system dynamics change sparsely. Our results indicate that global sparsity is rare, but instead the tasks show local, state-dependent sparsity in their dynamics and this sparsity exhibits distinct structures, appearing in temporally localized clusters (e.g., during contact events) and affecting specific subsets of state dimensions. These findings challenge common sparsity prior assumptions in dynamics learning, emphasizing the need for grounded inductive biases that reflect the state-dependent sparsity structure of real-world dynamics.

[513] SafeMIL: Learning Offline Safe Imitation Policy from Non-Preferred Trajectories

Returaj Burnwal, Nirav Pravinbhai Bhatt, Balaraman Ravindran

Main category: cs.LG

TL;DR: SafeMIL: Offline safe imitation learning using non-preferred trajectories to learn risky behaviors via Multiple Instance Learning, enabling safer policies without reward degradation.

Details

Motivation: Online interactions can be risky in real-world settings, and specifying exact reward/cost functions is difficult. However, collecting trajectories showing undesirable behavior is often feasible, providing implicit safety information.

Method: Propose SafeMIL using Multiple Instance Learning to learn a parameterized cost function that predicts risky state-action pairs from non-preferred trajectories, then use this cost to avoid unsafe behaviors.

Result: Empirical results show SafeMIL learns safer policies that satisfy cost constraints without degrading reward performance, outperforming several baseline methods.

Conclusion: SafeMIL effectively leverages non-preferred trajectories for offline safe imitation learning, enabling safer policy learning without requiring explicit cost specification at each timestep.

Abstract: In this work, we study the problem of offline safe imitation learning (IL). In many real-world settings, online interactions can be risky, and accurately specifying the reward and the safety cost information at each timestep can be difficult. However, it is often feasible to collect trajectories reflecting undesirable or risky behavior, implicitly conveying the behavior the agent should avoid. We refer to these trajectories as non-preferred trajectories. Unlike standard IL, which aims to mimic demonstrations, our agent must also learn to avoid risky behavior using non-preferred trajectories. In this paper, we propose a novel approach, SafeMIL, to learn a parameterized cost that predicts if the state-action pair is risky via Multiple Instance Learning. The learned cost is then used to avoid non-preferred behaviors, resulting in a policy that prioritizes safety. We empirically demonstrate that our approach can learn a safer policy that satisfies cost constraints without degrading the reward performance, thereby outperforming several baselines.

[514] Towards Non-Stationary Time Series Forecasting with Temporal Stabilization and Frequency Differencing

Junkai Lu, Peng Chen, Chenjuan Guo, Yang Shu, Meng Wang, Bin Yang

Main category: cs.LG

TL;DR: DTAF is a dual-branch framework that addresses non-stationarity in time series forecasting by handling temporal distribution shifts and spectral variability through separate temporal and frequency domain modules.

Details

Motivation: Real-world time series often exhibit non-stationarity including temporal distribution shifts and spectral variability, which pose significant challenges for long-term forecasting accuracy.

Method: Uses a dual-branch framework with Temporal Stabilizing Fusion (TFS) module for temporal domain (non-stationary MOE filter to suppress temporal non-stationary patterns) and Frequency Wave Modeling (FWM) module for frequency domain (frequency differencing to highlight spectral shifts).

Result: Extensive experiments show DTAF outperforms state-of-the-art baselines with significant improvements in forecasting accuracy under non-stationary conditions.

Conclusion: DTAF effectively addresses non-stationarity in both temporal and frequency domains, generating robust forecasts that adapt to complex real-world time series patterns.

Abstract: Time series forecasting is critical for decision-making across dynamic domains such as energy, finance, transportation, and cloud computing. However, real-world time series often exhibit non-stationarity, including temporal distribution shifts and spectral variability, which pose significant challenges for long-term time series forecasting. In this paper, we propose DTAF, a dual-branch framework that addresses non-stationarity in both the temporal and frequency domains. For the temporal domain, the Temporal Stabilizing Fusion (TFS) module employs a non-stationary mix of experts (MOE) filter to disentangle and suppress temporal non-stationary patterns while preserving long-term dependencies. For the frequency domain, the Frequency Wave Modeling (FWM) module applies frequency differencing to dynamically highlight components with significant spectral shifts. By fusing the complementary outputs of TFS and FWM, DTAF generates robust forecasts that adapt to both temporal and frequency domain non-stationarity. Extensive experiments on real-world benchmarks demonstrate that DTAF outperforms state-of-the-art baselines, yielding significant improvements in forecasting accuracy under non-stationary conditions. All codes are available at https://github.com/PandaJunk/DTAF.

[515] LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

Randall Balestriero, Yann LeCun

Main category: cs.LG

TL;DR: LeJEPA is a theoretically grounded Joint-Embedding Predictive Architecture that uses Sketched Isotropic Gaussian Regularization (SIGReg) to constrain embeddings to an optimal isotropic Gaussian distribution, achieving stable and scalable self-supervised learning across diverse architectures and domains.

Details

Motivation: To address the lack of practical guidance and theory in Joint-Embedding Predictive Architectures (JEPAs), which has led to ad-hoc R&D approaches in learning manipulable world representations.

Method: Combines JEPA predictive loss with SIGReg - a novel objective that constrains embeddings to follow an optimal isotropic Gaussian distribution, eliminating the need for heuristics like stop-gradient, teacher-student setups, or hyperparameter schedulers.

Result: Achieves 79% accuracy on ImageNet-1k with ViT-H/14 using linear evaluation with frozen backbone, demonstrates stability across 60+ architectures and 10+ datasets, and offers linear time/memory complexity with only one trade-off hyperparameter.

Conclusion: LeJEPA provides a simple, theory-friendly ecosystem that reestablishes self-supervised pre-training as a core pillar of AI research, with distributed training-friendly implementation requiring only ~50 lines of code.

Abstract: Learning manipulable representations of the world and its dynamics is central to AI. Joint-Embedding Predictive Architectures (JEPAs) offer a promising blueprint, but lack of practical guidance and theory has led to ad-hoc R&D. We present a comprehensive theory of JEPAs and instantiate it in {\bf LeJEPA}, a lean, scalable, and theoretically grounded training objective. First, we identify the isotropic Gaussian as the optimal distribution that JEPAs’ embeddings should follow to minimize downstream prediction risk. Second, we introduce a novel objective–{\bf Sketched Isotropic Gaussian Regularization} (SIGReg)–to constrain embeddings to reach that ideal distribution. Combining the JEPA predictive loss with SIGReg yields LeJEPA with numerous theoretical and practical benefits: (i) single trade-off hyperparameter, (ii) linear time and memory complexity, (iii) stability across hyper-parameters, architectures (ResNets, ViTs, ConvNets) and domains, (iv) heuristics-free, e.g., no stop-gradient, no teacher-student, no hyper-parameter schedulers, and (v) distributed training-friendly implementation requiring only $\approx$50 lines of code. Our empirical validation covers 10+ datasets, 60+ architectures, all with varying scales and domains. As an example, using imagenet-1k for pretraining and linear evaluation with frozen backbone, LeJEPA reaches 79% with a ViT-H/14. We hope that the simplicity and theory-friendly ecosystem offered by LeJEPA will reestablish self-supervised pre-training as a core pillar of AI research (\href{https://github.com/rbalestr-lab/lejepa}{GitHub repo}).

[516] FAST-CAD: A Fairness-Aware Framework for Non-Contact Stroke Diagnosis

Tianming Sha, Zechuan Chen, Zhan Cheng, Haotian Zhai, Xuwei Ding, Keze Wang

Main category: cs.LG

TL;DR: FAST-CAD is a fair stroke diagnosis framework combining domain-adversarial training and group distributionally robust optimization to address demographic fairness issues in automated medical diagnosis.

Details

Motivation: Existing automated stroke diagnosis methods suffer from fairness issues across demographic groups, potentially exacerbating healthcare disparities and limiting equitable access to timely diagnosis.

Method: Combines domain-adversarial training (DAT) with group distributionally robust optimization (Group-DRO) to learn demographic-invariant representations and optimize worst-group risk across 12 demographic subgroups defined by age, gender, and posture.

Result: Achieves superior diagnostic performance while maintaining fairness across all demographic groups, with extensive experiments validating the effectiveness of the unified DAT + Group-DRO framework.

Conclusion: Provides both practical advances and theoretical insights for fair medical AI systems, with convergence guarantees and fairness bounds supporting the effectiveness of the proposed framework.

Abstract: Stroke is an acute cerebrovascular disease, and timely diagnosis significantly improves patient survival. However, existing automated diagnosis methods suffer from fairness issues across demographic groups, potentially exacerbating healthcare disparities. In this work we propose FAST-CAD, a theoretically grounded framework that combines domain-adversarial training (DAT) with group distributionally robust optimization (Group-DRO) for fair and accurate non-contact stroke diagnosis. Our approach is built on domain adaptation and minimax fairness theory and provides convergence guarantees and fairness bounds. We curate a multimodal dataset covering 12 demographic subgroups defined by age, gender, and posture. FAST-CAD employs self-supervised encoders with adversarial domain discrimination to learn demographic-invariant representations, while Group-DRO optimizes worst-group risk to ensure robust performance across all subgroups. Extensive experiments show that our method achieves superior diagnostic performance while maintaining fairness across demographic groups, and our theoretical analysis supports the effectiveness of the unified DAT + Group-DRO framework. This work provides both practical advances and theoretical insights for fair medical AI systems.

[517] Data reuse enables cost-efficient randomized trials of medical AI models

Michael Nercessian, Wenxin Zhang, Alexander Schubert, Daphne Yang, Maggie Chung, Ahmed Alaa, Adam Yala

Main category: cs.LG

TL;DR: BRIDGE is a data-reuse RCT design for AI risk models that recycles participant data from completed trials when legacy and updated models make concordant predictions, reducing enrollment requirements for subsequent trials.

Details

Motivation: Traditional RCTs for AI tools are costly and slow, hindering timely validation as new AI models emerge rapidly, creating a need for more efficient trial designs.

Method: BRIDGE trials reuse participant-level data from completed trials when AI models make concordant predictions, with a practical checklist to ensure valid causal inference and preserve type I error.

Result: Real-world datasets showed up to 64.8% overlap in high-risk cohorts between successive AI models. Simulations of breast cancer screening studies reduced required enrollment by 46.6% (saving over $2.8M) while maintaining 80% power.

Conclusion: BRIDGE transforms trials into adaptive, modular studies, making Level I evidence generation feasible for every model iteration and accelerating cost-effective translation of AI into routine care.

Abstract: Randomized controlled trials (RCTs) are indispensable for establishing the clinical value of medical artificial-intelligence (AI) tools, yet their high cost and long timelines hinder timely validation as new models emerge rapidly. Here, we propose BRIDGE, a data-reuse RCT design for AI-based risk models. AI risk models support a broad range of interventions, including screening, treatment selection, and clinical alerts. BRIDGE trials recycle participant-level data from completed trials of AI models when legacy and updated models make concordant predictions, thereby reducing the enrollment requirement for subsequent trials. We provide a practical checklist for investigators to assess whether reusing data from previous trials allows for valid causal inference and preserves type I error. Using real-world datasets across breast cancer, cardiovascular disease, and sepsis, we demonstrate concordance between successive AI models, with up to 64.8% overlap in top 5% high-risk cohorts. We then simulate a series of breast cancer screening studies, where our design reduced required enrollment by 46.6%–saving over US$2.8 million–while maintaining 80% power. By transforming trials into adaptive, modular studies, our proposed design makes Level I evidence generation feasible for every model iteration, thereby accelerating cost-effective translation of AI into routine care.

[518] Potent but Stealthy: Rethink Profile Pollution against Sequential Recommendation via Bi-level Constrained Reinforcement Paradigm

Jiajie Su, Zihan Nan, Yunshan Ma, Xiaobo Xia, Xiaohua Feng, Weiming Liu, Xiang Chen, Xiaolin Zheng, Chaochao Chen

Main category: cs.LG

TL;DR: CREAT is a constrained reinforcement learning attack method that subtly pollutes user profiles in sequential recommenders to cause targeted mispredictions while maintaining stealthiness.

Details

Motivation: Existing profile pollution attacks on sequential recommenders suffer from over-reliance on sequence horizon impact and cause detectable distribution shifts, limiting their practicality.

Method: A bi-level optimization framework with multi-reward reinforcement learning using Pattern Balanced Rewarding Policy (pattern inversion + distribution consistency rewards) and Constrained Group Relative Reinforcement Learning with dynamic barrier constraints and group-shared experience replay.

Result: Extensive experiments demonstrate the effectiveness of CREAT in achieving targeted pollution with minimal detectability.

Conclusion: CREAT successfully addresses limitations of previous PPA methods by balancing adversarial efficacy and stealthiness through constrained reinforcement learning.

Abstract: Sequential Recommenders, which exploit dynamic user intents through interaction sequences, is vulnerable to adversarial attacks. While existing attacks primarily rely on data poisoning, they require large-scale user access or fake profiles thus lacking practicality. In this paper, we focus on the Profile Pollution Attack that subtly contaminates partial user interactions to induce targeted mispredictions. Previous PPA methods suffer from two limitations, i.e., i) over-reliance on sequence horizon impact restricts fine-grained perturbations on item transitions, and ii) holistic modifications cause detectable distribution shifts. To address these challenges, we propose a constrained reinforcement driven attack CREAT that synergizes a bi-level optimization framework with multi-reward reinforcement learning to balance adversarial efficacy and stealthiness. We first develop a Pattern Balanced Rewarding Policy, which integrates pattern inversion rewards to invert critical patterns and distribution consistency rewards to minimize detectable shifts via unbalanced co-optimal transport. Then we employ a Constrained Group Relative Reinforcement Learning paradigm, enabling step-wise perturbations through dynamic barrier constraints and group-shared experience replay, achieving targeted pollution with minimal detectability. Extensive experiments demonstrate the effectiveness of CREAT.

[519] Human-Corrected Labels Learning: Enhancing Labels Quality via Human Correction of VLMs Discrepancies

Zhongnian Li, Lan Chen, Yixin Xu, Shi Xu, Xinzheng Xu

Main category: cs.LG

TL;DR: Proposes Human-Corrected Labels (HCL) to efficiently improve VLM-generated noisy labels by strategically applying human correction only to instances with VLM discrepancies, achieving better quality annotations with reduced labor costs.

Details

Motivation: Vision-Language Models (VLMs) generate labels with dual limitations: low quality (label noise) and absence of error correction mechanisms, requiring a solution to enhance label quality efficiently.

Method: HCL strategically deploys human correction only for instances with VLM discrepancies. Uses a risk-consistent estimator incorporating both human-corrected labels and VLM predictions, and a conditional probability method to estimate label distribution using VLM outputs and model predictions.

Result: Extensive experiments demonstrate superior classification performance and robustness to label noise, validating HCL’s effectiveness in practical weak supervision scenarios.

Conclusion: HCL provides an effective approach for improving VLM-generated labels through strategic human correction, achieving both higher-quality annotations and reduced labor costs in weak supervision settings.

Abstract: Vision-Language Models (VLMs), with their powerful content generation capabilities, have been successfully applied to data annotation processes. However, the VLM-generated labels exhibit dual limitations: low quality (i.e., label noise) and absence of error correction mechanisms. To enhance label quality, we propose Human-Corrected Labels (HCLs), a novel setting that efficient human correction for VLM-generated noisy labels. As shown in Figure 1(b), HCL strategically deploys human correction only for instances with VLM discrepancies, achieving both higher-quality annotations and reduced labor costs. Specifically, we theoretically derive a risk-consistent estimator that incorporates both human-corrected labels and VLM predictions to train classifiers. Besides, we further propose a conditional probability method to estimate the label distribution using a combination of VLM outputs and model predictions. Extensive experiments demonstrate that our approach achieves superior classification performance and is robust to label noise, validating the effectiveness of HCL in practical weak supervision scenarios. Code https://github.com/Lilianach24/HCL.git

[520] Harnessing Bounded-Support Evolution Strategies for Policy Refinement

Ethan Hirschowitz, Fabio Ramos

Main category: cs.LG

TL;DR: TD-ES uses bounded triangular noise and centered-rank finite-difference estimator for stable, parallelizable policy refinement, improving robotic manipulation success rates by 26.5% over PPO.

Details

Motivation: Improving competent robot policies with on-policy RL is hampered by noisy, low-signal gradients, prompting the revisit of Evolution Strategies as a policy-gradient proxy.

Method: Propose Triangular-Distribution ES (TD-ES) with bounded triangular noise and centered-rank finite-difference estimator, used in a two-stage pipeline: PPO pretraining followed by TD-ES refinement.

Result: Across robotic manipulation tasks, TD-ES raises success rates by 26.5% relative to PPO and greatly reduces variance.

Conclusion: TD-ES offers a simple, compute-light path to reliable policy refinement while preserving early sample efficiency and enabling robust late-stage gains.

Abstract: Improving competent robot policies with on-policy RL is often hampered by noisy, low-signal gradients. We revisit Evolution Strategies (ES) as a policy-gradient proxy and localize exploration with bounded, antithetic triangular perturbations, suitable for policy refinement. We propose Triangular-Distribution ES (TD-ES) which pairs bounded triangular noise with a centered-rank finite-difference estimator to deliver stable, parallelizable, gradient-free updates. In a two-stage pipeline - PPO pretraining followed by TD-ES refinement - this preserves early sample efficiency while enabling robust late-stage gains. Across a suite of robotic manipulation tasks, TD-ES raises success rates by 26.5% relative to PPO and greatly reduces variance, offering a simple, compute-light path to reliable refinement.

[521] Unitho: A Unified Multi-Task Framework for Computational Lithography

Qian Jin, Yumeng Liu, Yuqi Jiang, Qi Sun, Cheng Zhuo

Main category: cs.LG

TL;DR: Unitho is a unified multi-task large vision model for computational lithography that handles mask generation, lithography simulation, and rule violation detection using Transformer architecture trained on large-scale industrial data.

Details

Motivation: Current computational lithography tasks are handled in isolation with scarce datasets and limited modeling approaches, hindering the development of reliable data foundations for large-scale models.

Method: Built on Transformer architecture and trained on a large-scale industrial lithography simulation dataset with hundreds of thousands of cases, supporting end-to-end mask generation, lithography simulation, and rule violation detection.

Result: Experimental results show Unitho substantially surpasses academic baselines in effectiveness and generalizability, enabling agile and high-fidelity lithography simulation.

Conclusion: Unitho facilitates the construction of robust data foundations for intelligent EDA by providing a unified solution for multiple computational lithography tasks.

Abstract: Reliable, generalizable data foundations are critical for enabling large-scale models in computational lithography. However, essential tasks-mask generation, rule violation detection, and layout optimization-are often handled in isolation, hindered by scarce datasets and limited modeling approaches. To address these challenges, we introduce Unitho, a unified multi-task large vision model built upon the Transformer architecture. Trained on a large-scale industrial lithography simulation dataset with hundreds of thousands of cases, Unitho supports end-to-end mask generation, lithography simulation, and rule violation detection. By enabling agile and high-fidelity lithography simulation, Unitho further facilitates the construction of robust data foundations for intelligent EDA. Experimental results validate its effectiveness and generalizability, with performance substantially surpassing academic baselines.

[522] Pretrained Joint Predictions for Scalable Batch Bayesian Optimization of Molecular Designs

Miles Wang-Henderson, Benjamin Kaufman, Edward Williams, Ryan Pederson, Matteo Rossi, Owen Howell, Carl Underkoffler, Narbe Mardirossian, John Parkhill

Main category: cs.LG

TL;DR: The paper presents a method using Epistemic Neural Networks to create scalable probabilistic surrogates for binding affinity prediction, enabling more efficient Batch Bayesian Optimization for drug discovery with 5-10x fewer iterations needed.

Details

Motivation: Batched synthesis and testing is the key bottleneck in drug development, and there's great interest in using biomolecular foundation models as surrogates to accelerate this process.

Method: Uses Epistemic Neural Networks (ENNs) to obtain scalable joint predictive distributions of binding affinity, leveraging representations from large structure-informed models and pretraining prior networks on synthetic data.

Result: Demonstrated utility by rediscovering known potent EGFR inhibitors in up to 5x fewer iterations on semi-synthetic benchmarks, and potent inhibitors from real-world small-molecule libraries in up to 10x fewer iterations.

Conclusion: Offers a promising solution for large-scale drug discovery applications by enabling more efficient molecular design optimization through improved probabilistic modeling and batch optimization.

Abstract: Batched synthesis and testing of molecular designs is the key bottleneck of drug development. There has been great interest in leveraging biomolecular foundation models as surrogates to accelerate this process. In this work, we show how to obtain scalable probabilistic surrogates of binding affinity for use in Batch Bayesian Optimization (Batch BO). This demands parallel acquisition functions that hedge between designs and the ability to rapidly sample from a joint predictive density to approximate them. Through the framework of Epistemic Neural Networks (ENNs), we obtain scalable joint predictive distributions of binding affinity on top of representations taken from large structure-informed models. Key to this work is an investigation into the importance of prior networks in ENNs and how to pretrain them on synthetic data to improve downstream performance in Batch BO. Their utility is demonstrated by rediscovering known potent EGFR inhibitors on a semi-synthetic benchmark in up to 5x fewer iterations, as well as potent inhibitors from a real-world small-molecule library in up to 10x fewer iterations, offering a promising solution for large-scale drug discovery applications.

cs.MA

[523] Towards Assume-Guarantee Verification of Abilities in Stochastic Multi-Agent Systems

Wojciech Jamroga, Damian Kurpiewski, Łukasz Mikulski

Main category: cs.MA

TL;DR: Proposes assume-guarantee verification schemes for probabilistic alternating-time temporal logic with imperfect information to handle complex model checking problems.

Details

Motivation: Model checking strategic abilities with imperfect information in stochastic environments is notoriously hard; assume-guarantee reasoning can decompose complex problems into easier subproblems.

Method: Develops several assume-guarantee verification schemes for probabilistic alternating-time temporal logic with imperfect information, proves their soundness, and discusses completeness. Also introduces a new variant of non-probabilistic alternating-time logic with “achieving at most φ” strategic modalities.

Result: Soundness of the proposed verification schemes is proven, and completeness is discussed. A new logic variant with “achieving at most” modalities is introduced.

Conclusion: Assume-guarantee reasoning provides an effective approach to decompose and verify complex strategic ability problems in probabilistic settings with imperfect information.

Abstract: Model checking of strategic abilities is a notoriously hard problem, even more so in the realistic case of agents with imperfect information, acting in a stochastic environment. Assume-guarantee reasoning can be of great help here, providing a way to decompose the complex problem into a small set of easier subproblems. In this paper, we propose several schemes for assume-guarantee verification of probabilistic alternating-time temporal logic with imperfect information. We prove the soundness of the schemes, and discuss their completeness. On the way, we also propose a new variant of (non-probabilistic) alternating-time logic, where the strategic modalities capture “achieving at most $\varphi$,” analogous to Levesque’s logic of “only knowing.”

[524] Who Gets the Reward, Who Gets the Blame? Evaluation-Aligned Training Signals for Multi-LLM Agents

Chih-Hsuan Yang, Tanwi Mallick, Le Chen, Krishnan Raghavan, Azton Wells, Amal Gueroudji, Ian T. Foster, Rajeev Thakur

Main category: cs.MA

TL;DR: A theoretical framework that transforms system-level evaluation into agent-level credit and message-level rewards using cooperative game theory and process reward modeling for LLM multi-agent training.

Details

Motivation: Current LLM multi-agent training lacks principled connections between system evaluation and agent/message learning, needing unified methods to translate global outcomes into local supervision.

Method: Combines Shapley-based credit assignment for success cases with first-error localization for failures, producing bounded, cooperative signals compatible with reinforcement or preference learning.

Result: Produces local, signed, credit-conserving signals that fairly allocate outcomes, promote cooperation, discourage redundancy/sabotage, and enable repair-aware preferences for corrective actions.

Conclusion: Provides theoretical foundation for unified, auditable pathway from global evaluation to local supervision in LLM multi-agent systems, with conceptual contribution awaiting empirical validation.

Abstract: Large Language Models (LLMs) in multi-agent systems (MAS) have shown promise for complex tasks, yet current training methods lack principled ways to connect system-level evaluation with agent-level and message-level learning. We propose a theoretical framework that unifies cooperative game-theoretic attribution with process reward modeling to transform system evaluation into agent credit and then into response-level signals. Unlike prior approaches that rely only on attribution (e.g., Shapley) or step-level labels (e.g., PRM), our method produces local, signed, and credit-conserving signals. In success cases, Shapley-based credit assignment fairly allocates outcomes across agents and is refined into per-message rewards that promote cooperation while discouraging redundancy or sabotage. In failure cases, first-error localization yields repair-aware preferences that penalize harmful steps while rewarding corrective attempts. The resulting signals are bounded, cooperative, and directly compatible with reinforcement-based or preference-based post-training, providing a unified and auditable pathway from global evaluation to local supervision in LLM multi-agent training. Our contribution is conceptual: we present a theoretical foundation and training signals, leaving empirical validation for future work.

[525] Exposing Weak Links in Multi-Agent Systems under Adversarial Prompting

Nirmit Arora, Sathvik Joel, Ishan Kavathekar, Palak, Rohan Gandhi, Yash Pandya, Tanuja Ganu, Aditya Kanade, Akshay Nambi

Main category: cs.MA

TL;DR: SafeAgents is a unified framework for assessing security vulnerabilities in multi-agent systems, revealing how design choices affect susceptibility to adversarial attacks.

Details

Motivation: Existing security research focuses on single-agent systems, leaving a critical gap in understanding vulnerabilities introduced by multi-agent design patterns and the lack of unified assessment frameworks.

Method: Developed SafeAgents framework with Dharma diagnostic measure to systematically evaluate how plan construction, inter-agent context sharing, and fallback behaviors affect security across five multi-agent architectures on four datasets.

Result: Common multi-agent design patterns carry significant vulnerabilities; centralized systems delegating atomic instructions to sub-agents obscure harmful objectives and reduce robustness.

Conclusion: Multi-agent systems require security-aware design, as current architectures exhibit critical vulnerabilities that need systematic assessment and mitigation.

Abstract: LLM-based agents are increasingly deployed in multi-agent systems (MAS). As these systems move toward real-world applications, their security becomes paramount. Existing research largely evaluates single-agent security, leaving a critical gap in understanding the vulnerabilities introduced by multi-agent design. However, existing systems fall short due to lack of unified frameworks and metrics focusing on unique rejection modes in MAS. We present SafeAgents, a unified and extensible framework for fine-grained security assessment of MAS. SafeAgents systematically exposes how design choices such as plan construction strategies, inter-agent context sharing, and fallback behaviors affect susceptibility to adversarial prompting. We introduce Dharma, a diagnostic measure that helps identify weak links within multi-agent pipelines. Using SafeAgents, we conduct a comprehensive study across five widely adopted multi-agent architectures (centralized, decentralized, and hybrid variants) on four datasets spanning web tasks, tool use, and code generation. Our findings reveal that common design patterns carry significant vulnerabilities. For example, centralized systems that delegate only atomic instructions to sub-agents obscure harmful objectives, reducing robustness. Our results highlight the need for security-aware design in MAS. Link to code is https://github.com/microsoft/SafeAgents

[526] GraphMASAL: A Graph-based Multi-Agent System for Adaptive Learning

Biqing Zeng, Mengquan Liu, Zongwei Zhen

Main category: cs.MA

TL;DR: GraphMASAL is a graph-based multi-agent system that uses dynamic knowledge graphs and optimized planning to generate personalized learning paths, outperforming LLM-based approaches in educational effectiveness.

Details

Motivation: Existing Intelligent Tutoring Systems lack structural reasoning and dynamic knowledge modeling needed for true personalization that adapts to learners' complex knowledge states and diverse goals.

Method: Integrates dynamic knowledge graph for learner modeling, LangGraph-orchestrated agent trio (Diagnostician, Planner, Tutor), two-stage neural IR with calibrated score fusion, and MSMS planning engine with greedy set cover approximation.

Result: Consistently outperforms LLM prompting and structured ablations in planning, achieving better learning path alignment, higher weak concept coverage, and lower learning cost, while also surpassing baselines in cognitive diagnosis.

Conclusion: Grounding LLM agents in dynamic knowledge graphs with educational constraint optimization yields reliable, interpretable, and pedagogically plausible learning plans, advancing personalized goal-oriented education.

Abstract: The advent of Intelligent Tutoring Systems (ITSs) has marked a paradigm shift in education, enabling highly personalized learning pathways. However, true personalization requires adapting to learners’ complex knowledge states (multi-source) and diverse goals (multi-sink); existing ITSs often lack the necessary structural-reasoning capability and knowledge dynamism to generate genuinely effective learning paths, and they lack scientifically rigorous validation paradigms. In this paper we propose GraphMASAL (A Graph-based Multi-Agent System for Adaptive Learning), which integrates (i) a dynamic knowledge graph for persistent, stateful learner modeling; (ii) a LangGraph-orchestrated trio of agents (Diagnostician, Planner, Tutor); (iii) a knowledge-graph-grounded two-stage neural IR component (dual-encoder dense retrieval with cross-encoder listwise re-ranking and calibrated score fusion); and (iv) a multi-source multi-sink (MSMS) planning engine with a cognitively grounded cost and an approximation guarantee via greedy set cover. Under blinded automated evaluations with matched inputs and inference settings across diverse student profiles, GraphMASAL consistently outperforms LLM prompting and structured ablations in planning–achieving stronger structural/sequence alignment of learning paths, higher coverage of weak concepts, and lower learning cost–while also surpassing prompt-based baselines in cognitive diagnosis. Agreement with expert/LLM-proxy ratings further supports the validity of our evaluation protocol. These findings indicate that grounding LLM agents in a dynamic knowledge graph, coupled with optimization under educational constraints, yields reliable, interpretable, and pedagogically plausible learning plans, advancing personalized and goal-oriented education.

cs.MM

[527] AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization

Zhonghua Jiang, Kui Chen, Kunxi Li, Keting Yin, Yiyun Zhou, Zhaode Wang, Chengfei Lv, Shengyu Zhang

Main category: cs.MM

TL;DR: AccKV is an adaptive KV cache optimization framework for Audio-Video LLMs that uses layer-adaptive focusing and cross-calibration to improve computational efficiency while maintaining accuracy.

Details

Motivation: Current AV-LLMs face challenges with large KV caches from temporal audio-video data, leading to performance degradation when modalities are processed indiscriminately or compressed excessively.

Method: Uses layer adaptive focusing to selectively focus on key modalities per layer, attention redistribution for heavy hitter tokens, and cross-calibration to integrate and align modalities before selective eviction.

Result: Significantly improves computational efficiency of AV-LLMs while maintaining accuracy.

Conclusion: AccKV effectively addresses KV cache optimization challenges in AV-LLMs through adaptive modality focusing and cross-modal alignment.

Abstract: Recent advancements in Audio-Video Large Language Models (AV-LLMs) have enhanced their capabilities in tasks like audio-visual question answering and multimodal dialog systems. Video and audio introduce an extended temporal dimension, resulting in a larger key-value (KV) cache compared to static image embedding. A naive optimization strategy is to selectively focus on and retain KV caches of audio or video based on task. However, in the experiment, we observed that the attention of AV-LLMs to various modalities in the high layers is not strictly dependent on the task. In higher layers, the attention of AV-LLMs shifts more towards the video modality. In addition, we also found that directly integrating temporal KV of audio and spatial-temporal KV of video may lead to information confusion and significant performance degradation of AV-LLMs. If audio and video are processed indiscriminately, it may also lead to excessive compression or reservation of a certain modality, thereby disrupting the alignment between modalities. To address these challenges, we propose AccKV, an Adaptive-Focusing and Cross-Calibration KV cache optimization framework designed specifically for efficient AV-LLMs inference. Our method is based on layer adaptive focusing technology, selectively focusing on key modalities according to the characteristics of different layers, and enhances the recognition of heavy hitter tokens through attention redistribution. In addition, we propose a Cross-Calibration technique that first integrates inefficient KV caches within the audio and video modalities, and then aligns low-priority modalities with high-priority modalities to selectively evict KV cache of low-priority modalities. The experimental results show that AccKV can significantly improve the computational efficiency of AV-LLMs while maintaining accuracy.

[528] Contribution-Guided Asymmetric Learning for Robust Multimodal Fusion under Imbalance and Noise

Zijing Xu, Yunfeng Kou, Kunming Wu, Hong Liu

Main category: cs.MM

TL;DR: CAL addresses modality imbalance and noise in multimodal learning by enhancing high-contribution modalities while compressing weak ones, achieving superior performance on benchmark datasets.

Details

Motivation: Overcome limitations of existing methods that suppress dominant modalities without considering inherent information value differences, which can lead to suboptimal solutions.

Method: Uses Contribution-Guided Asymmetric Learning with modality contribution metric W^m, asymmetric gradient acceleration, and contribution-aware Asymmetric Information Bottleneck compression.

Result: Achieves 79.30%, 74.82%, and 74.21% accuracy on CREMA-D, KS, and AVE datasets, outperforming state-of-the-art ARL model, and shows leading performance in high-noise robustness tests.

Conclusion: CAL provides a flexible, efficient framework that effectively handles modality imbalance and noise interference with broad adaptability and application potential.

Abstract: Multimodal learning faces two major challenges: modality imbalance and data noise, which significantly affect the robustness and generalization ability of models. Existing methods achieve modality balance by suppressing dominant modalities, but they neglect the inherent differences in the information value between modalities, potentially leading to convergence to suboptimal solutions. This paper proposes an innovative modality compression paradigm, Contribution-Guided Asymmetric Learning (CAL), which aims to enhance the contribution of high-contribution modalities while compressing weak modalities to increase their contribution, allowing both to improve the performance of multimodal information fusion. CAL is based on a modality contribution metric W^m combining the information quantity I(m) and confidence D(m), and it designs an asymmetric gradient acceleration mechanism and a contribution-aware Asymmetric Information Bottleneck (AIB) compression mechanism. The former accelerates the gradient update of modalities, while the latter dynamically compresses the noise of low-contribution modalities. On five benchmark datasets, including emotion recognition, scene recognition, and event localization tasks, CAL has shown outstanding performance in imbalanced fusion tasks and noise robustness tests. On CREMA-D, KS, and AVE, CAL achieves 79.30%, 74.82%, and 74.21% accuracy, significantly outperforming the existing state-of-the-art model ARL. In high-noise robustness tests, CAL also achieved leading performance under various attack strategies on the MVSA-Single and NYUD2 datasets. These results validate the significant advantages of CAL in modality imbalance and noise interference. CAL, as a flexible and efficient framework, is easy to transfer to other tasks and has broad adaptability and potential application prospects.

eess.AS

[529] Towards Attribution of Generators and Emotional Manipulation in Cross-Lingual Synthetic Speech using Geometric Learning

Girish, Mohd Mujtaba Akhtar, Farhan Sheth, Muskaan Singh

Main category: eess.AS

TL;DR: MiCuNet is a multitask framework that combines speech foundation model embeddings with auditory features using mixed-curvature projections to trace emotional and manipulation characteristics in synthetic speech.

Details

Motivation: To enable precise tracing of both emotion and manipulation sources in synthetically manipulated speech by leveraging semantic-prosodic cues and fine-grained spectral dynamics.

Method: Integrates SFM embeddings with spectrogram-based auditory features through mixed-curvature projection (Hyperbolic, Euclidean, Spherical spaces) with learnable temporal gating, using multitask learning to predict original emotions, manipulated emotions, and manipulation sources.

Result: MiCuNet yields consistent improvements over conventional fusion strategies on the EmoFake dataset across English and Chinese subsets.

Conclusion: This is the first study to explore a curvature-adaptive framework specifically designed for multitask tracking in synthetic speech.

Abstract: In this work, we address the problem of finegrained traceback of emotional and manipulation characteristics from synthetically manipulated speech. We hypothesize that combining semantic-prosodic cues captured by Speech Foundation Models (SFMs) with fine-grained spectral dynamics from auditory representations can enable more precise tracing of both emotion and manipulation source. To validate this hypothesis, we introduce MiCuNet, a novel multitask framework for fine-grained tracing of emotional and manipulation attributes in synthetically generated speech. Our approach integrates SFM embeddings with spectrogram-based auditory features through a mixed-curvature projection mechanism that spans Hyperbolic, Euclidean, and Spherical spaces guided by a learnable temporal gating mechanism. Our proposed method adopts a multitask learning setup to simultaneously predict original emotions, manipulated emotions, and manipulation sources on the EmoFake dataset (EFD) across both English and Chinese subsets. MiCuNet yields consistent improvements, consistently surpassing conventional fusion strategies. To the best of our knowledge, this work presents the first study to explore a curvature-adaptive framework specifically tailored for multitask tracking in synthetic speech.

[530] Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces

Farhan Sheth, Girish, Mohd Mujtaba Akhtar, Muskaan Singh

Main category: eess.AS

TL;DR: RHYME is a unified audio deepfake detection framework that uses non-Euclidean projections to fuse embeddings from multiple speech encoders, achieving state-of-the-art performance across diverse synthesis paradigms.

Details

Motivation: Prior audio deepfake detection methods overfit to specific synthesis artifacts and fail to generalize across different speech generation paradigms like TTS, diffusion, and flow-matching systems.

Method: Fuses utterance-level embeddings from multiple pretrained speech encoders using hyperbolic and spherical projections, with Riemannian barycentric averaging to align representations across synthesis methods.

Result: Outperforms individual pretrained models and homogeneous fusion baselines, achieving state-of-the-art performance in cross-paradigm audio deepfake detection.

Conclusion: Non-Euclidean geometry enables effective modeling of shared structural distortions in synthetic speech across diverse generation paradigms, leading to robust and generalizable detection.

Abstract: In this work, we address the challenge of generalizable audio deepfake detection (ADD) across diverse speech synthesis paradigms-including conventional text-to-speech (TTS) systems and modern diffusion or flow-matching (FM) based generators. Prior work has mostly targeted individual synthesis families and often fails to generalize across paradigms due to overfitting to generation-specific artifacts. We hypothesize that synthetic speech, irrespective of its generative origin, leaves behind shared structural distortions in the embedding space that can be aligned through geometry-aware modeling. To this end, we propose RHYME, a unified detection framework that fuses utterance-level embeddings from diverse pretrained speech encoders using non-Euclidean projections. RHYME maps representations into hyperbolic and spherical manifolds-where hyperbolic geometry excels at modeling hierarchical generator families, and spherical projections capture angular, energy-invariant cues such as periodic vocoder artifacts. The fused representation is obtained via Riemannian barycentric averaging, enabling synthesis-invariant alignment. RHYME outperforms individual PTMs and homogeneous fusion baselines, achieving top performance and setting new state-of-the-art in cross-paradigm ADD.

[531] CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for Fair Speech Emotion Recognition

Yun-Shao Tsai, Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee

Main category: eess.AS

TL;DR: CO-VADA is a debiasing approach for speech emotion recognition that uses voice conversion to alter speaker attributes in biased training samples, helping models focus on emotion-relevant features without requiring demographic data or model changes.

Details

Motivation: Speech emotion recognition systems often exhibit bias due to spurious correlations between speaker characteristics and emotional labels, and existing debiasing methods require model-specific changes or demographic annotations, limiting practical deployment.

Method: CO-VADA identifies biased training samples, applies voice conversion to alter irrelevant speaker attributes, and generates augmented samples with different speaker variations to break spurious correlations.

Result: The approach mitigates bias in SER systems without modifying model architecture or relying on demographic information, making it compatible with various SER models and voice conversion tools.

Conclusion: CO-VADA provides a scalable and practical solution for improving fairness in speech emotion recognition systems by guiding models to focus on emotion-relevant features rather than speaker characteristics.

Abstract: Bias in speech emotion recognition (SER) systems often stems from spurious correlations between speaker characteristics and emotional labels, leading to unfair predictions across demographic groups. Many existing debiasing methods require model-specific changes or demographic annotations, limiting their practical use. We present CO-VADA, a Confidence-Oriented Voice Augmentation Debiasing Approach that mitigates bias without modifying model architecture or relying on demographic information. CO-VADA identifies training samples that reflect bias patterns present in the training data and then applies voice conversion to alter irrelevant attributes and generate samples. These augmented samples introduce speaker variations that differ from dominant patterns in the data, guiding the model to focus more on emotion-relevant features. Our framework is compatible with various SER models and voice conversion tools, making it a scalable and practical solution for improving fairness in SER systems.

[532] SPUR: A Plug-and-Play Framework for Integrating Spatial Audio Understanding and Reasoning into Large Audio-Language Models

S Sakshi, Vaibhavi Lokegaonkar, Neil Zhang, Ramani Duraiswami, Sreyan Ghosh, Dinesh Manocha, Lie Lu

Main category: eess.AS

TL;DR: SPUR is a lightweight plug-in method that adds spatial perception capabilities to large audio-language models through First-Order Ambisonics encoding and a specialized spatial QA dataset.

Details

Motivation: Current large audio-language models operate on monaural inputs and lack spatial perception abilities like direction, elevation, and distance awareness, which are crucial for real-world acoustic scene understanding.

Method: SPUR consists of: (1) a First-Order Ambisonics encoder that maps (W,X,Y,Z) channels to rotation-aware spatial features via multimodal adapter, and (2) SPUR-Set spatial QA dataset combining FOA recordings with controlled simulations for supervised spatial reasoning training.

Result: Fine-tuning with SPUR consistently improves spatial QA and multi-speaker attribution while preserving general audio understanding capabilities. The approach transforms monaural LALMs into spatially aware models.

Conclusion: SPUR provides a simple and effective recipe to equip large audio-language models with spatial perception through minimal architectural changes, validated by extensive ablations.

Abstract: Spatial perception is central to auditory intelligence, enabling accurate understanding of real-world acoustic scenes and advancing human-level perception of the world around us. While recent large audio-language models (LALMs) show strong reasoning over complex audios, most operate on monaural inputs and lack the ability to capture spatial cues such as direction, elevation, and distance. We introduce SPUR, a lightweight, plug-in approach that equips LALMs with spatial perception through minimal architectural changes. SPUR consists of: (i) a First-Order Ambisonics (FOA) encoder that maps (W, X, Y, Z) channels to rotation-aware, listener-centric spatial features, integrated into target LALMs via a multimodal adapter; and (ii) SPUR-Set, a spatial QA dataset combining open-source FOA recordings with controlled simulations, emphasizing relative direction, elevation, distance, and overlap for supervised spatial reasoning. Fine-tuning our model on the SPUR-Set consistently improves spatial QA and multi-speaker attribution while preserving general audio understanding. SPUR provides a simple recipe that transforms monaural LALMs into spatially aware models. Extensive ablations validate the effectiveness of our approach.

eess.IV

[533] Boosting Neural Video Representation via Online Structural Reparameterization

Ziyi Li, Qingyu Mao, Shuai Liu, Qilei Li, Fanyang Meng, Yongsheng Liang

Main category: eess.IV

TL;DR: Online-RepNeRV is a neural video representation framework that uses online structural reparameterization to enhance model capacity while maintaining computational efficiency during decoding.

Details

Motivation: Current neural video representation methods have complex designs with high computational overhead, lack flexibility for integration, and suffer from limited model capacity that creates performance bottlenecks.

Method: Proposes a universal reparameterization block (ERB) with multiple parallel convolutional paths, and uses online reparameterization to dynamically fuse parameters during training, converting multi-branch to single-branch structure after training.

Result: Achieves 0.37-2.7 dB average PSNR gain over baseline methods while maintaining comparable training time and decoding speed. Additional complexity is confined to encoding stage only.

Conclusion: Online-RepNeRV effectively enhances neural video representation performance through structural reparameterization, improving compression efficiency without sacrificing decoding speed.

Abstract: Neural Video Representation~(NVR) is a promising paradigm for video compression, showing great potential in improving video storage and transmission efficiency. While recent advances have made efforts in architectural refinements to improve representational capability, these methods typically involve complex designs, which may incur increased computational overhead and lack the flexibility to integrate into other frameworks. Moreover, the inherent limitation in model capacity restricts the expressiveness of NVR networks, resulting in a performance bottleneck. To overcome these limitations, we propose Online-RepNeRV, a NVR framework based on online structural reparameterization. Specifically, we propose a universal reparameterization block named ERB, which incorporates multiple parallel convolutional paths to enhance the model capacity. To mitigate the overhead, an online reparameterization strategy is adopted to dynamically fuse the parameters during training, and the multi-branch structure is equivalently converted into a single-branch structure after training. As a result, the additional computational and parameter complexity is confined to the encoding stage, without affecting the decoding efficiency. Extensive experiments on mainstream video datasets demonstrate that our method achieves an average PSNR gain of 0.37-2.7 dB over baseline methods, while maintaining comparable training time and decoding speed.

Hongchao Shu, Lalithkumar Seenivasan, Mingxu Liu, Yunseo Hwang, Yu-Chun Ku, Jonathan Knopf, Alejandro Martin-Gomez, Mehran Armand, Mathias Unberath

Main category: eess.IV

TL;DR: DualVision ArthroNav is a multi-camera navigation system for arthroscopy that combines an external camera with the arthroscope to resolve scale ambiguity and drift issues in monocular SLAM, achieving high accuracy in surgical navigation.

Details

Motivation: Existing optical tracking systems impose workspace constraints and disrupt surgical workflow, while vision-based alternatives using only the monocular arthroscope suffer from drift, scale ambiguity, and sensitivity to motion/occlusion.

Method: Integrates an external camera rigidly mounted on the arthroscope to provide stable visual odometry and absolute localization, while the monocular arthroscope enables dense scene reconstruction. Combines both views to resolve scale ambiguity and drift.

Result: Achieves average absolute trajectory error of 1.09 mm, reconstructed scenes reach average target registration error of 2.16 mm with high visual fidelity (SSIM = 0.69, PSNR = 22.19). Effectively compensates for calibration errors.

Conclusion: Provides a practical and cost-efficient solution for arthroscopic navigation, bridging the gap between optical tracking and purely vision-based systems, paving the way for clinically deployable, fully vision-based arthroscopic guidance.

Abstract: Arthroscopic procedures can greatly benefit from navigation systems that enhance spatial awareness, depth perception, and field of view. However, existing optical tracking solutions impose strict workspace constraints and disrupt surgical workflow. Vision-based alternatives, though less invasive, often rely solely on the monocular arthroscope camera, making them prone to drift, scale ambiguity, and sensitivity to rapid motion or occlusion. We propose DualVision ArthroNav, a multi-camera arthroscopy navigation system that integrates an external camera rigidly mounted on the arthroscope. The external camera provides stable visual odometry and absolute localization, while the monocular arthroscope video enables dense scene reconstruction. By combining these complementary views, our system resolves the scale ambiguity and long-term drift inherent in monocular SLAM and ensures robust relocalization. Experiments demonstrate that our system effectively compensates for calibration errors, achieving an average absolute trajectory error of 1.09 mm. The reconstructed scenes reach an average target registration error of 2.16 mm, with high visual fidelity (SSIM = 0.69, PSNR = 22.19). These results indicate that our system provides a practical and cost-efficient solution for arthroscopic navigation, bridging the gap between optical tracking and purely vision-based systems, and paving the way toward clinically deployable, fully vision-based arthroscopic guidance.

[535] From Attention to Frequency: Integration of Vision Transformer and FFT-ReLU for Enhanced Image Deblurring

Syed Mumtahin Mahmud, Mahdi Mohd Hossain Noki, Prothito Shovon Majumder, Abdul Mohaimen Al Radi, Md. Haider Ali, Md. Mosaddek Khan

Main category: eess.IV

TL;DR: Proposes a dual-domain architecture combining Vision Transformers with FFT-ReLU module for image deblurring, achieving superior performance through spatial attention and frequency sparsity.

Details

Motivation: Current deep learning approaches (CNNs and ViTs) struggle with complex/high-resolution blur and computational demands, needing better handling of blur artifacts and detail preservation.

Method: Dual-domain architecture unifying Vision Transformers with frequency-domain FFT-ReLU module, where ViT captures local/global dependencies and FFT-ReLU enforces frequency sparsity to suppress blur artifacts.

Result: Achieves superior PSNR, SSIM, and perceptual quality on benchmark datasets, with confirmation from quantitative metrics, qualitative comparisons, and human preference evaluations.

Conclusion: Establishes a practical and generalizable paradigm for real-world image restoration by bridging spatial attention modeling and frequency sparsity.

Abstract: Image deblurring is vital in computer vision, aiming to recover sharp images from blurry ones caused by motion or camera shake. While deep learning approaches such as CNNs and Vision Transformers (ViTs) have advanced this field, they often struggle with complex or high-resolution blur and computational demands. We propose a new dual-domain architecture that unifies Vision Transformers with a frequency-domain FFT-ReLU module, explicitly bridging spatial attention modeling and frequency sparsity. In this structure, the ViT backbone captures local and global dependencies, while the FFT-ReLU component enforces frequency-domain sparsity to suppress blur-related artifacts and preserve fine details. Extensive experiments on benchmark datasets demonstrate that this architecture achieves superior PSNR, SSIM, and perceptual quality compared to state-of-the-art models. Both quantitative metrics, qualitative comparisons, and human preference evaluations confirm its effectiveness, establishing a practical and generalizable paradigm for real-world image restoration.

[536] CLIPPan: Adapting CLIP as A Supervisor for Unsupervised Pansharpening

Lihua Jian, Jiabo Liu, Shaowu Wu, Lihui Chen

Main category: eess.IV

TL;DR: CLIPPan is an unsupervised pansharpening framework that uses CLIP as a supervisor to train models at full resolution, overcoming domain adaptation issues in traditional supervised methods.

Details

Motivation: Supervised pansharpening networks face domain adaptation challenges due to disparity between simulated low-resolution training data and real-world full-resolution scenarios, creating a need for unsupervised full-resolution training.

Method: Fine-tunes CLIP to recognize pansharpening components, then uses semantic language constraints that align image fusion transitions with textual prompts (Wald’s or Khan’s descriptions) as supervisory signals without ground truth.

Result: CLIPPan consistently improves spectral and spatial fidelity across various pansharpening backbones on real-world datasets, setting new state-of-the-art for unsupervised full-resolution pansharpening.

Conclusion: The framework successfully bridges the domain gap by leveraging CLIP’s language-image understanding to provide supervision for pansharpening without requiring ground truth data.

Abstract: Despite remarkable advancements in supervised pansharpening neural networks, these methods face domain adaptation challenges of resolution due to the intrinsic disparity between simulated reduced-resolution training data and real-world full-resolution scenarios.To bridge this gap, we propose an unsupervised pansharpening framework, CLIPPan, that enables model training at full resolution directly by taking CLIP, a visual-language model, as a supervisor. However, directly applying CLIP to supervise pansharpening remains challenging due to its inherent bias toward natural images and limited understanding of pansharpening tasks. Therefore, we first introduce a lightweight fine-tuning pipeline that adapts CLIP to recognize low-resolution multispectral, panchromatic, and high-resolution multispectral images, as well as to understand the pansharpening process. Then, building on the adapted CLIP, we formulate a novel \textit{loss integrating semantic language constraints}, which aligns image-level fusion transitions with protocol-aligned textual prompts (e.g., Wald’s or Khan’s descriptions), thus enabling CLIPPan to use language as a powerful supervisory signal and guide fusion learning without ground truth. Extensive experiments demonstrate that CLIPPan consistently improves spectral and spatial fidelity across various pansharpening backbones on real-world datasets, setting a new state of the art for unsupervised full-resolution pansharpening.

[537] Sensitivity of Finite Element Models to Relationship Between T2 Relaxation and Modulus in Articular Cartilage

Alexander A. Donabedian, Deva D. Chan

Main category: eess.IV

TL;DR: This study evaluates how uncertainty in the relationship between MRI biomarkers (T2) and cartilage material properties (dynamic modulus) affects finite element model predictions, finding that modulus shifts significantly impact stress/strain calculations while slope changes have minimal effect.

Details

Motivation: To understand the sensitivity of finite element models to uncertainty in the relationship between quantitative MRI biomarkers (T2) and cartilage material properties (dynamic modulus), which is crucial for developing accurate subject-specific models.

Method: The researchers systematically shifted the slope and intercept of a linear T2-dynamic modulus relationship used to define cartilage properties in finite element models, then analyzed the resulting changes in calculated stress and strain.

Result: Shifts in modulus (intercept changes) led to notable percent changes in the top 1% of calculated stress and strain, while modulating the slope of the relationship had a negligible impact on model predictions.

Conclusion: The findings support using physiologically relevant modulus ranges in subject-specific finite element models, as modulus values have significant influence on stress/strain predictions while the specific slope of the T2-modulus relationship is less critical.

Abstract: Correlating articular cartilage material properties to quantitative magnetic resonance imaging biomarkers is a powerful approach to biofidelic finite element models. However, subject-specific relationships between imaging biomarkers such as T2 and material properties like dynamic modulus are uncertain. To evaluate the sensitivity of finite element models to this uncertainty, we shifted the slope and intercept of a linear T2-dynamic modulus relationship used to define cartilage properties. Modulus shifts led to notable percent changes in the top 1% of calculated stress and strain while modulating slope had a negligible impact, together supporting the use of physiologically relevant moduli ranges in subject-specific models.

[538] Large-scale modality-invariant foundation models for brain MRI analysis: Application to lesion segmentation

Petros Koutsouvelis, Matej Gazda, Leroy Volmer, Sina Amirrajab, Kamil Barbierik, Branislav Setlak, Jakub Gazda, Peter Drotar

Main category: eess.IV

TL;DR: Proposes modality-invariant representation learning for brain MRI SSL, finding that lesion segmentation benefits more from preserving modality-specific features than cross-modality alignment.

Details

Motivation: Current SSL frameworks are designed for natural images and don't effectively capture multi-modal MRI information needed for neuroimaging tasks like lesion segmentation.

Method: Developed a modality-invariant representation learning setup with large-scale self-supervised pre-training on unlabeled brain MRI data, evaluated on stroke and epilepsy lesion segmentation tasks.

Result: Despite successful cross-modality alignment, lesion segmentation performance primarily benefits from preserving fine-grained modality-specific features rather than modality-invariant representations.

Conclusion: For lesion segmentation in neuroimaging, preserving modality-specific features is more important than achieving perfect cross-modality alignment in SSL pre-training.

Abstract: The field of computer vision is undergoing a paradigm shift toward large-scale foundation model pre-training via self-supervised learning (SSL). Leveraging large volumes of unlabeled brain MRI data, such models can learn anatomical priors that improve few-shot performance in diverse neuroimaging tasks. However, most SSL frameworks are tailored to natural images, and their adaptation to capture multi-modal MRI information remains underexplored. This work proposes a modality-invariant representation learning setup and evaluates its effectiveness in stroke and epilepsy lesion segmentation, following large-scale pre-training. Experimental results suggest that despite successful cross-modality alignment, lesion segmentation primarily benefits from preserving fine-grained modality-specific features. Model checkpoints and code are made publicly available.

[539] Unsupervised Motion-Compensated Decomposition for Cardiac MRI Reconstruction via Neural Representation

Xuanyu Tian, Lixuan Chen, Qing Wu, Xiao Wang, Jie Feng, Yuyao Zhang, Hongjiang Wei

Main category: eess.IV

TL;DR: MoCo-INR is an unsupervised method combining implicit neural representations with motion-compensated framework for high-quality cardiac MRI reconstruction from highly undersampled data, achieving 20x acceleration with superior results.

Details

Motivation: Current CMR reconstruction methods either produce unsatisfactory image quality or are limited by scarce ground truth data, restricting clinical applicability.

Method: Integrates implicit neural representations with motion-compensated framework using explicit motion modeling and continuous INR priors, with a specialized INR architecture for CMR.

Result: Superior performance over state-of-the-art methods with fast convergence and fine-detailed reconstructions at 20x acceleration, validated on both retrospective and prospective free-breathing scans.

Conclusion: MoCo-INR demonstrates clinical practicality for real-time CMR imaging with effective motion decomposition and high-quality reconstruction without requiring ground truth data.

Abstract: Cardiac magnetic resonance (CMR) imaging is widely used to characterize cardiac morphology and function. To accelerate CMR imaging, various methods have been proposed to recover high-quality spatiotemporal CMR images from highly undersampled k-t space data. However, current CMR reconstruction techniques either fail to achieve satisfactory image quality or are restricted by the scarcity of ground truth data, leading to limited applicability in clinical scenarios. In this work, we proposed MoCo-INR, a new unsupervised method that integrates implicit neural representations (INR) with the conventional motion-compensated (MoCo) framework. Using explicit motion modeling and the continuous prior of INRs, MoCo-INR can produce accurate cardiac motion decomposition and high-quality CMR reconstruction. Furthermore, we introduce a new INR network architecture tailored to the CMR problem, which significantly stabilizes model optimization. Experiments on retrospective (simulated) datasets demonstrate the superiority of MoCo-INR over state-of-the-art methods, achieving fast convergence and fine-detailed reconstructions at ultra-high acceleration factors (e.g., 20x in VISTA sampling). Additionally, evaluations on prospective (real-acquired) free-breathing CMR scans highlight the clinical practicality of MoCo-INR for real-time imaging. Several ablation studies further confirm the effectiveness of the critical components of MoCo-INR.

[540] TEyeD: Over 20 million real-world eye images with Pupil, Eyelid, and Iris 2D and 3D Segmentations, 2D and 3D Landmarks, 3D Eyeball, Gaze Vector, and Eye Movement Types

Wolfgang Fuhl, Gjergji Kasneci, Enkelejda Kasneci

Main category: eess.IV

TL;DR: TEyeD is the world’s largest unified public dataset of eye images captured with head-mounted devices, featuring over 20 million annotated images from various activities and eye trackers including VR/AR devices.

Details

Motivation: To provide a comprehensive, coherent resource for advancing research in computer vision, eye tracking, and gaze estimation for modern VR and AR applications by addressing the need for large-scale, diverse eye image data.

Method: Collected eye images using seven different head-mounted eye trackers (including VR/AR devices) during various tasks such as car rides, simulator rides, outdoor sports, and daily indoor activities. Provided extensive annotations including 2D/3D landmarks, semantic segmentation, 3D eyeball data, gaze vectors, and eye movement types.

Result: Created a dataset with over 20 million carefully annotated images, making it the largest unified public dataset of eye images from head-mounted devices. The dataset includes diverse video lengths from minutes to hours and comprehensive annotations for pupil, iris, and eyelids.

Conclusion: TEyeD provides a unique and valuable foundation that will significantly advance research in computer vision, eye tracking, and gaze estimation, particularly for modern VR and AR applications, by offering unprecedented scale and annotation quality.

Abstract: We present TEyeD, the world’s largest unified public data set of eye images taken with head-mounted devices. TEyeD was acquired with seven different head-mounted eye trackers. Among them, two eye trackers were integrated into virtual reality (VR) or augmented reality (AR) devices. The images in TEyeD were obtained from various tasks, including car rides, simulator rides, outdoor sports activities, and daily indoor activities. The data set includes 2D and 3D landmarks, semantic segmentation, 3D eyeball annotation and the gaze vector and eye movement types for all images. Landmarks and semantic segmentation are provided for the pupil, iris and eyelids. Video lengths vary from a few minutes to several hours. With more than 20 million carefully annotated images, TEyeD provides a unique, coherent resource and a valuable foundation for advancing research in the field of computer vision, eye tracking and gaze estimation in modern VR and AR applications. Download: https://es-cloud.cs.uni-tuebingen.de/d/8e2ab8c3fdd444e1a135/?p=%2FTEyeDS&mode=list

[541] Efficient Image Restoration via Latent Consistency Flow Matching

Elad Cohen, Idan Achituve, Idit Diamant, Arnon Netzer, Hai Victor Habi

Main category: eess.IV

TL;DR: ELIR is an efficient latent image restoration method that addresses the distortion-perception trade-off in latent space using a latent consistency flow-based model, achieving competitive performance while being 4x smaller and faster than state-of-the-art methods.

Details

Motivation: Current generative image restoration methods are too large and computationally demanding for deployment on edge devices, creating a need for efficient alternatives.

Method: Uses a latent consistency flow-based model operating in latent space with an efficient and lightweight architecture to balance distortion and perceptual quality.

Result: ELIR achieves competitive performance compared to state-of-the-art methods while being 4x smaller and faster, enabling deployment on resource-constrained devices.

Conclusion: ELIR effectively balances distortion and perceptual quality metrics while significantly reducing model size and computational cost, making it suitable for edge device deployment.

Abstract: Recent advances in generative image restoration (IR) have demonstrated impressive results. However, these methods are hindered by their substantial size and computational demands, rendering them unsuitable for deployment on edge devices. This work introduces ELIR, an Efficient Latent Image Restoration method. ELIR addresses the distortion-perception trade-off within the latent space and produces high-quality images using a latent consistency flow-based model. In addition, ELIR introduces an efficient and lightweight architecture. Consequently, ELIR is 4$\times$ smaller and faster than state-of-the-art diffusion and flow-based approaches for blind face restoration, enabling a deployment on resource-constrained devices. Comprehensive evaluations of various image restoration tasks and datasets show that ELIR achieves competitive performance compared to state-of-the-art methods, effectively balancing distortion and perceptual quality metrics while significantly reducing model size and computational cost. The code is available at: https://github.com/eladc-git/ELIR

[542] SCReedSolo: A Secure and Robust LSB Image Steganography Framework with Randomized Symmetric Encryption and Reed-Solomon Coding

Syed Rifat Raiyan, Md. Hasanul Kabir

Main category: eess.IV

TL;DR: SCR{\small EED}S{\small OLO} is a novel image steganography framework that combines random shuffling, Fernet encryption, Reed-Solomon error correction, and LSB embedding to securely hide binary data in images with 3 bits per pixel capacity.

Details

Motivation: To address vulnerabilities in traditional steganography methods regarding both security and resilience against bit-level corruption in stego-images.

Method: Synergistically combines Random Shuffling, Fernet Symmetric Encryption, Reed-Solomon Error Correction Codes, and LSB (Least Significant Bit) Steganography to encode and embed secret payloads.

Result: Achieves 3 bits per pixel payload for RGB images, successfully evades detection by passive steganalysis tools, resists active steganalysis attacks, and provides mathematical assessment of transmission success probability.

Conclusion: SCR{\small EED}S{\small OLO} provides a robust framework for secure and resilient image steganography that balances payload capacity with detection resistance and error correction capabilities.

Abstract: Image steganography is an information-hiding technique that involves the surreptitious concealment of covert informational content within digital images. In this paper, we introduce ${\rm SCR{\small EED}S{\small OLO}}$, a novel framework for concealing arbitrary binary data within images. Our approach synergistically leverages Random Shuffling, Fernet Symmetric Encryption, and Reed-Solomon Error Correction Codes to encode the secret payload, which is then discretely embedded into the carrier image using LSB (Least Significant Bit) Steganography. The combination of these methods addresses the vulnerability vectors of both security and resilience against bit-level corruption in the resultant stego-images. We show that our framework achieves a data payload of 3 bits per pixel for an RGB image, and mathematically assess the probability of successful transmission for the amalgamated $n$ message bits and $k$ error correction bits. Additionally, we find that ${\rm SCR{\small EED}S{\small OLO}}$ yields good results upon being evaluated with multiple performance metrics, successfully eludes detection by various passive steganalysis tools, and is immune to simple active steganalysis attacks. Our code and data are available at https://github.com/Starscream-11813/SCReedSolo-Steganography.

[543] Latent Motion Profiling for Annotation-free Cardiac Phase Detection in Adult and Fetal Echocardiography Videos

Yingyu Yang, Qianye Yang, Kangning Cui, Can Peng, Elena D’Alberti, Netzahualcoyotl Hernandez-Cruz, Olga Patey, Aris T. Papageorghiou, J. Alison Noble

Main category: eess.IV

TL;DR: Unsupervised framework for cardiac phase detection using self-supervised learning of latent motion trajectories from echocardiography videos, eliminating need for manual annotations.

Details

Motivation: Automatic cardiac phase detection typically requires extensive annotations which are time-consuming and labor-intensive. There's a need for methods that don't rely on manual labeling.

Method: Self-supervised learning approach that trains a reconstruction model to encode interpretable spatiotemporal motion patterns from 4-chamber-view echocardiography videos without requiring annotations.

Result: Achieved MAE of 3 frames (58.3 ms) for ED and 2 frames (38.8 ms) for ES on EchoNet-Dynamic benchmark, matching supervised methods. Also performed well on fetal echocardiography with MAE 1.46 frames (20.7 ms) for ED and 1.74 frames (25.3 ms) for ES.

Conclusion: The latent motion trajectory strategy shows strong potential for unsupervised cardiac phase detection in both adult and fetal echocardiography, offering scalable solution for clinical populations lacking annotated data.

Abstract: The identification of cardiac phase is an essential step for analysis and diagnosis of cardiac function. Automatic methods, especially data-driven methods for cardiac phase detection, typically require extensive annotations, which is time-consuming and labor-intensive. In this paper, we present an unsupervised framework for end-diastole (ED) and end-systole (ES) detection through self-supervised learning of latent cardiac motion trajectories from 4-chamber-view echocardiography videos. Our method eliminates the need for manual annotations, including ED and ES indices, segmentation, or volumetric measurements, by training a reconstruction model to encode interpretable spatiotemporal motion patterns. Evaluated on the EchoNet-Dynamic benchmark, the approach achieves mean absolute error (MAE) of 3 frames (58.3 ms) for ED and 2 frames (38.8 ms) for ES detection, matching state-of-the-art supervised methods. Extended to fetal echocardiography, the model demonstrates robust performance with MAE 1.46 frames (20.7 ms) for ED and 1.74 frames (25.3 ms) for ES, despite the fact that the fetal heart model is built using non-standardized heart views due to fetal heart positioning variability. Our results demonstrate the potential of the proposed latent motion trajectory strategy for cardiac phase detection in adult and fetal echocardiography. This work advances unsupervised cardiac motion analysis, offering a scalable solution for clinical populations lacking annotated data. Code will be released at https://github.com/YingyuYyy/CardiacPhase.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Unsupervised Cycle Detection in Agentic Applications

[2] AV-Dialog: Spoken Dialogue Models with Audio-Visual Input

[3] Data Analysis and Performance Evaluation of Simulation Deduction Based on LLMs

[4] Cognitively-Inspired Episodic Memory Architectures for Accurate and Efficient Character AI

[5] Proactive Hearing Assistants that Isolate Egocentric Conversations

[6] Hybrid Quantum Transformer for Language Generation

[7] Empirical Characterization of Temporal Constraint Processing in LLMs

[8] HI-TransPA: Hearing Impairments Translation Personal Assistant

[9] Spectral Neuro-Symbolic Reasoning II: Semantic Node Merging, Entailment Filtering, and Knowledge Graph Alignment

[10] Preference Orchestrator: Prompt-Aware Multi-Objective Alignment for Large Language Models

[11] Patent Representation Learning via Self-supervision

[12] Evaluating Open-Weight Large Language Models for Structured Data Extraction from Narrative Medical Reports Across Multiple Use Cases and Languages

[13] Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment

[14] iMAD: Intelligent Multi-Agent Debate for Efficient and Accurate LLM Inference

[15] Information Extraction From Fiscal Documents Using LLMs

[16] Do AI Voices Learn Social Nuances? A Case of Politeness and Speech Rate

[17] Test-Time Steering for Lossless Text Compression via Weighted Product of Experts

[18] Bayesian Evaluation of Large Language Model Behavior

[19] Evaluating Modern Large Language Models on Low-Resource and Morphologically Rich Languages:A Cross-Lingual Benchmark Across Cantonese, Japanese, and Turkish

[20] Guarding the Meaning: Self-Supervised Training for Semantic Robustness in Guard Models

[21] Evaluating LLM Understanding via Structured Tabular Decision Simulations

[22] Forecasting Spoken Language Development in Children with Cochlear Implants Using Preimplantation MRI

[23] Grounded Visual Factualization: Factual Anchor-Based Finetuning for Enhancing MLLM Factual Consistency

[24] Large language models in materials science and the need for open-source approaches

[25] Continual Learning of Domain Knowledge from Human Feedback in Text-to-SQL

[26] Learn to Select: Exploring Label Distribution Divergence for In-Context Demonstration Selection in Text Classification

[27] Pre-Attention Expert Prediction and Prefetching for Mixture-of-Experts Large Language Models

[28] SpiderGen: Towards Procedure Generation For Carbon Life Cycle Assessments with Generative AI

[29] A methodological analysis of prompt perturbations and their effect on attack success rates

[30] Modeling and Predicting Multi-Turn Answer Instability in Large Language Models

[31] Equilibrium Dynamics and Mitigation of Gender Bias in Synthetically Generated Data

[32] Saying the Unsaid: Revealing the Hidden Language of Multimodal Systems Through Telephone Games

[33] Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models

[34] Where does an LLM begin computing an instruction?

[35] “As Eastern Powers, I will veto.” : An Investigation of Nation-level Bias of Large Language Models in International Relations

[36] $π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling

[37] Faithful Summarization of Consumer Health Queries: A Cross-Lingual Framework with LLMs

[38] TEDxTN: A Three-way Speech Translation Corpus for Code-Switched Tunisian Arabic - English

[39] Sabiá: Um Chatbot de Inteligência Artificial Generativa para Suporte no Dia a Dia do Ensino Superior

[40] LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation

[41] Assessing the Capabilities of LLMs in Humor:A Multi-dimensional Analysis of Oogiri Generation and Evaluation

[42] Tracing Multilingual Representations in LLMs with Cross-Layer Transcoders

[43] Reinforcing Stereotypes of Anger: Emotion AI on African American Vernacular English

[44] Leveraging Parameter Space Symmetries for Reasoning Skill Transfer in LLMs

[45] From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems

[46] ICX360: In-Context eXplainability 360 Toolkit

[47] A Multifaceted Analysis of Negative Bias in Large Language Models through the Lens of Parametric Knowledge

[48] MedPath: Multi-Domain Cross-Vocabulary Hierarchical Paths for Biomedical Entity Linking

[49] From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models

[50] Expert-Guided Prompting and Retrieval-Augmented Generation for Emergency Medical Service Question Answering

[51] Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions

[52] Automated Analysis of Learning Outcomes and Exam Questions Based on Bloom’s Taxonomy

[53] Evaluating Large Language Models on Rare Disease Diagnosis: A Case Study using House M.D

[54] CardioEmbed: Domain-Specialized Text Embeddings for Clinical Cardiology

[55] DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains

[56] When Data is the Algorithm: A Systematic Study and Curation of Preference Optimization Datasets

[57] Automata-Based Steering of Large Language Models for Diverse Structured Generation

[58] Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB

[59] Can LLMs Detect Their Own Hallucinations?

[60] Analysing Personal Attacks in U.S. Presidential Debates

[61] Enhancing Meme Emotion Understanding with Multi-Level Modality Enhancement and Dual-Stage Modal Fusion

[62] Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition

[63] PRSM: A Measure to Evaluate CLIP’s Robustness Against Paraphrases

[64] Adverbs Revisited: Enhancing WordNet Coverage of Adverbs with a Supersense Taxonomy

[65] LANE: Lexical Adversarial Negative Examples for Word Sense Disambiguation

[66] KGQuest: Template-Driven QA Generation from Knowledge Graphs with LLM-Based Refinement

[67] destroR: Attacking Transfer Models with Obfuscous Examples to Discard Perplexity

[68] LAET: A Layer-wise Adaptive Ensemble Tuning Framework for Pretrained Language Models

[69] NOVA: An Agentic Framework for Automated Histopathology Analysis and Discovery

[70] LaoBench: A Large-Scale Multidimensional Lao Benchmark for Large Language Models

[71] M-DAIGT: A Shared Task on Multi-Domain Detection of AI-Generated Text

[72] Studies with impossible languages falsify LMs as models of human language

[73] MajinBook: An open catalogue of digital world literature with likes

[74] W2S-AlignTree: Weak-to-Strong Inference-Time Alignment for Large Language Models via Monte Carlo Tree Search

[75] PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning

[76] Identifying and Analyzing Performance-Critical Tokens in Large Language Models

[77] Metric Learning Encoding Models: A Multivariate Framework for Interpreting Neural Representations