Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 183]
cs.CV [Total: 325]
cs.AI [Total: 102]
cs.SD [Total: 22]
cs.LG [Total: 341]
cs.MA [Total: 9]
cs.MM [Total: 0]
eess.AS [Total: 9]
eess.IV [Total: 25]

cs.CL

[1] Factual and Musical Evaluation Metrics for Music Language Models

Daniel Chenyu Lin, Michael Freeman, John Thickstun

Main category: cs.CL

TL;DR: Failed to fetch summary for arXiv paper 2511.05550 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation from the provided content

Method: Unable to determine method from the provided content

Result: Unable to determine results from the provided content

Conclusion: Unable to determine conclusion from the provided content

Abstract: Failed to fetch summary for 2511.05550: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05550&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[2] Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation

Canxiang Yan, Chunxiang Jin, Dawei Huang, Haibing Yu, Han Peng, Hui Zhan, Jie Gao, Jing Peng, Jingdong Chen, Jun Zhou, Kaimeng Ren, Ming Yang, Mingxue Yang, Qiang Xu, Qin Zhao, Ruijie Xiong, Shaoxiong Lin, Xuezhi Wang, Yi Yuan, Yifei Wu, Yongjie Lyu, Zhengyu He, Zhihao Qiu, Zhiqiang Fang, Ziyuan Huang

Main category: cs.CL

TL;DR: A unified framework for speech understanding, generation, and editing using a novel continuous speech tokenizer that integrates semantic and acoustic features, enabling free-form speech editing guided by natural language instructions.

Details

Motivation: Existing speech models suffer from competing requirements on token representations between understanding and generation tasks, preventing instruction-based free-form speech editing.

Method: Developed MingTok-Audio unified continuous speech tokenizer that integrates semantic and acoustic features, then built Ming-UniAudio speech language model and Ming-UniAudio-Edit for free-form speech editing guided by natural language instructions.

Result: Set new SOTA on 8/12 metrics on ContextASR benchmark, achieved Seed-TTS-WER of 0.95 for Chinese voice cloning, and created first comprehensive benchmark for instruction-based free-form speech editing.

Conclusion: The unified framework successfully bridges the gap between speech understanding and generation, enabling universal free-form speech editing without timestamp conditions, with all components open-sourced to advance unified audio processing.

Abstract: Existing speech models suffer from competing requirements on token representations by understanding and generation tasks. This discrepancy in representation prevents speech language models from performing instruction-based free-form editing. To solve this challenge, we introduce a novel framework that unifies speech understanding, generation, and editing. The core of our unified model is a unified continuous speech tokenizer MingTok-Audio, the first continuous tokenizer to effectively integrate semantic and acoustic features, which makes it suitable for both understanding and generation tasks. Based on this unified continuous audio tokenizer, we developed the speech language model Ming-UniAudio, which achieved a balance between generation and understanding capabilities. Ming-UniAudio sets new state-of-the-art (SOTA) records on 8 out of 12 metrics on the ContextASR benchmark. Notably, for Chinese voice cloning, it achieves a highly competitive Seed-TTS-WER of 0.95. Leveraging this foundational model, we further trained a dedicated speech editing model Ming-UniAudio-Edit, the first speech language model that enables universal, free-form speech editing guided solely by natural language instructions, handling both semantic and acoustic modifications without timestamp condition. To rigorously assess the editing capability and establish a foundation for future research, we introduce Ming-Freeform-Audio-Edit, the first comprehensive benchmark tailored for instruction-based free-form speech editing, featuring diverse scenarios and evaluation dimensions spanning semantic correctness, acoustic quality, and instruction alignment. We open-sourced the continuous audio tokenizer, the unified foundational model, and the free-form instruction-based editing model to facilitate the development of unified audio understanding, generation, and manipulation.

[3] Persian Musical Instruments Classification Using Polyphonic Data Augmentation

Diba Hadi Esfangereh, Mohammad Hossein Sameti, Sepehr Harfi Moridani, Leili Javidpour, Mahdieh Soleymani Baghshah

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 503 error from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Methodology information not accessible due to server error

Result: No results available - arXiv API returned service unavailable error

Conclusion: Unable to analyze paper - technical issue prevented content retrieval

Abstract: Failed to fetch summary for 2511.05717: Page request resulted in HTTP 503 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05717&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[4] Retracing the Past: LLMs Emit Training Data When They Get Lost

Myeongseob Ko, Nikhil Reddy Billa, Adam Nguyen, Charles Fleming, Ming Jin, Ruoxi Jia

Main category: cs.CL

TL;DR: Confusion-Inducing Attacks (CIA) is a new framework that extracts memorized training data from LLMs by systematically maximizing model uncertainty through consecutive high-entropy states, outperforming existing methods.

Details

Motivation: Existing data extraction methods have limited success and provide little insight into memorization leakage drivers. CIA addresses privacy and copyright concerns by offering a principled approach to understand and exploit memorization vulnerabilities.

Method: CIA optimizes input snippets to induce sustained high-entropy states that precede memorized text emission. For aligned LLMs, Mismatched Supervised Fine-tuning weakens alignment and induces targeted confusion to increase attack susceptibility.

Result: Experiments show CIA outperforms existing baselines in extracting verbatim and near-verbatim training data from various unaligned and aligned LLMs without requiring prior knowledge of training data.

Conclusion: The findings reveal persistent memorization risks across LLMs and provide a systematic method for assessing these vulnerabilities, highlighting ongoing privacy and copyright concerns despite alignment efforts.

Abstract: The memorization of training data in large language models (LLMs) poses significant privacy and copyright concerns. Existing data extraction methods, particularly heuristic-based divergence attacks, often exhibit limited success and offer limited insight into the fundamental drivers of memorization leakage. This paper introduces Confusion-Inducing Attacks (CIA), a principled framework for extracting memorized data by systematically maximizing model uncertainty. We empirically demonstrate that the emission of memorized text during divergence is preceded by a sustained spike in token-level prediction entropy. CIA leverages this insight by optimizing input snippets to deliberately induce this consecutive high-entropy state. For aligned LLMs, we further propose Mismatched Supervised Fine-tuning (SFT) to simultaneously weaken their alignment and induce targeted confusion, thereby increasing susceptibility to our attacks. Experiments on various unaligned and aligned LLMs demonstrate that our proposed attacks outperform existing baselines in extracting verbatim and near-verbatim training data without requiring prior knowledge of the training data. Our findings highlight persistent memorization risks across various LLMs and offer a more systematic method for assessing these vulnerabilities.

[5] Beyond One-Size-Fits-All: Personalized Harmful Content Detection with In-Context Learning

Rufan Zhang, Lin Zhang, Xianghang Mi

Main category: cs.CL

TL;DR: A novel framework using in-context learning with foundation models for unified content moderation across toxicity, spam, and negative sentiment detection, enabling lightweight personalization without model retraining.

Details

Motivation: Current moderation systems are centralized, task-specific, lack transparency, and ignore user preferences, making them unsuitable for privacy-sensitive or decentralized environments.

Method: Leverages in-context learning with foundation models to unify detection across binary, multi-class, and multi-label settings, allowing personalization through simple prompt-based interventions.

Result: Foundation models achieve strong cross-task generalization matching task-specific models, effective personalization with just one example, and enhanced robustness with label definitions/rationales.

Conclusion: Demonstrates a shift beyond one-size-fits-all moderation, establishing ICL as a practical, privacy-preserving, and adaptable pathway for user-centric content safety systems.

Abstract: The proliferation of harmful online content–e.g., toxicity, spam, and negative sentiment–demands robust and adaptable moderation systems. However, prevailing moderation systems are centralized and task-specific, offering limited transparency and neglecting diverse user preferences–an approach ill-suited for privacy-sensitive or decentralized environments. We propose a novel framework that leverages in-context learning (ICL) with foundation models to unify the detection of toxicity, spam, and negative sentiment across binary, multi-class, and multi-label settings. Crucially, our approach enables lightweight personalization, allowing users to easily block new categories, unblock existing ones, or extend detection to semantic variations through simple prompt-based interventions–all without model retraining. Extensive experiments on public benchmarks (TextDetox, UCI SMS, SST2) and a new, annotated Mastodon dataset reveal that: (i) foundation models achieve strong cross-task generalization, often matching or surpassing task-specific fine-tuned models; (ii) effective personalization is achievable with as few as one user-provided example or definition; and (iii) augmenting prompts with label definitions or rationales significantly enhances robustness to noisy, real-world data. Our work demonstrates a definitive shift beyond one-size-fits-all moderation, establishing ICL as a practical, privacy-preserving, and highly adaptable pathway for the next generation of user-centric content safety systems. To foster reproducibility and facilitate future research, we publicly release our code on GitHub and the annotated Mastodon dataset on Hugging Face.

[6] MCP4IFC: IFC-Based Building Design Using Large Language Models

Bharathi Kannan Nithyanantham, Tobias Sesterhenn, Ashwin Nedungadi, Sergio Peral Garijo, Janis Zenkner, Christian Bartelt, Stefan Lüdtke

Main category: cs.CL

TL;DR: MCP4IFC is an open-source framework that enables LLMs to manipulate IFC data through the Model Context Protocol, providing BIM tools for querying, creating, and modifying building elements with dynamic code generation for complex tasks.

Details

Motivation: To bring generative AI into the AEC field by enabling natural language instructions to be translated into actions on standardized IFC data models.

Method: Uses Model Context Protocol (MCP) with BIM tools including scene querying, predefined functions for building elements, and dynamic code-generation combining in-context learning with RAG for complex tasks.

Result: LLMs using the framework successfully perform complex tasks from building simple houses to querying and editing existing IFC data.

Conclusion: The framework provides a foundation for AI-assisted modeling workflows and encourages research in LLM-driven BIM design, released as open-source.

Abstract: Bringing generative AI into the architecture, engineering and construction (AEC) field requires systems that can translate natural language instructions into actions on standardized data models. We present MCP4IFC, a comprehensive open-source framework that enables Large Language Models (LLMs) to directly manipulate Industry Foundation Classes (IFC) data through the Model Context Protocol (MCP). The framework provides a set of BIM tools, including scene querying tools for information retrieval, predefined functions for creating and modifying common building elements, and a dynamic code-generation system that combines in-context learning with retrieval-augmented generation (RAG) to handle tasks beyond the predefined toolset. Experiments demonstrate that an LLM using our framework can successfully perform complex tasks, from building a simple house to querying and editing existing IFC data. Our framework is released as open-source to encourage research in LLM-driven BIM design and provide a foundation for AI-assisted modeling workflows. Our code is available at https://show2instruct.github.io/mcp4ifc/.

Kunxi Li, Yufan Xiong, Zhonghua Jiang, Yiyun Zhou, Zhaode Wang, Chengfei Lv, Shengyu Zhang

Main category: cs.CL

TL;DR: FlowMM is an adaptive KV cache merging framework for multimodal LLMs that uses cross-modal information flow to guide layer-specific merging strategies, reducing KV cache memory by 80-95% and latency by 1.3-1.8x while maintaining performance.

Details

Motivation: Traditional KV cache eviction strategies degrade generation quality by discarding critical KV-pairs, while existing KV merging approaches are limited in multimodal scenarios due to distributional and attentional biases across modality tokens.

Method: FlowMM leverages cross-modal information flow to apply layer-specific merging strategies and introduces a sensitivity-adaptive token matching mechanism that jointly evaluates token similarity and task-critical sensitivity.

Result: Extensive experiments show FlowMM reduces KV cache memory by 80% to 95% and decoding latency by 1.3-1.8x while maintaining competitive task performance across diverse MLLMs.

Conclusion: FlowMM effectively addresses multimodal KV cache optimization by adaptively merging tokens based on cross-modal information flow and sensitivity assessment, achieving significant memory and latency improvements without compromising generation quality.

Abstract: Traditional KV cache eviction strategies, which discard less critical KV-pairs based on attention scores, often degrade generation quality, causing context loss or hallucinations. Recent efforts shift toward KV merging, merging eviction tokens with retention tokens based on similarity. However, in multimodal scenarios, distributional biases across modality tokens and attentional biases in cross-modal interactions limit its effectiveness. This work introduces FlowMM, an adaptive framework for cross-modal information flow-guided multimodal KV cache merging. FlowMM leverages cross-modal information flow to dynamically apply layer-specific merging strategies, capturing modality-specific patterns while preserving contextual integrity. Furthermore, we introduce a sensitivity-adaptive token matching mechanism that jointly evaluates token similarity and task-critical sensitivity, merging low-risk tokens while safeguarding high-sensitivity ones. Extensive experiments across diverse leading MLLMs show that FlowMM reduces KV cache memory by 80% to 95% and decoding latency by 1.3-1.8x, while maintaining competitive task performance.

[8] ReMoD: Rethinking Modality Contribution in Multimodal Stance Detection via Dual Reasoning

Bingbing Wang, Zhengda Jin, Bin Liang, Jing Li, Ruifeng Xu

Main category: cs.CL

TL;DR: ReMoD is a dual-reasoning framework for multimodal stance detection that addresses modality bias by integrating intuitive and reflective reasoning to dynamically weight modality contributions based on their actual expressive power.

Details

Motivation: Existing multimodal stance detection methods simply fuse information from various modalities, overlooking varying contributions of stance expression from different modalities, which can introduce stance misunderstanding noises into the learning process.

Method: ReMoD integrates experience-driven intuitive reasoning to capture initial stance cues and deliberate reflective reasoning to adjust for modality biases. It uses Modality Experience Pool (MEP) and Semantic Experience Pool (SEP) with two reasoning chains: Modality-CoT for adaptive fusion and Semantic-CoT for deeper contextual insights.

Result: Extensive experiments on the public MMSD benchmark demonstrate that ReMoD significantly outperforms most baseline models and exhibits strong generalization capabilities.

Conclusion: The proposed dual-reasoning paradigm effectively addresses modality bias in multimodal stance detection by dynamically weighting modality contributions based on their actual expressive power, leading to more robust and context-aware stance decisions.

Abstract: Multimodal Stance Detection (MSD) is a crucial task for understanding public opinion on social media. Existing work simply fuses information from various modalities to learn stance representations, overlooking the varying contributions of stance expression from different modalities. Therefore, stance misunderstanding noises may be drawn into the stance learning process due to the risk of learning errors by rough modality combination. To address this, we get inspiration from the dual-process theory of human cognition and propose ReMoD, a framework that Rethinks Modality contribution of stance expression through a Dual-reasoning paradigm. ReMoD integrates experience-driven intuitive reasoning to capture initial stance cues with deliberate reflective reasoning to adjust for modality biases, refine stance judgments, and thereby dynamically weight modality contributions based on their actual expressive power for the target stance. Specifically, the intuitive stage queries the Modality Experience Pool (MEP) and Semantic Experience Pool (SEP) to form an initial stance hypothesis, prioritizing historically impactful modalities. This hypothesis is then refined in the reflective stage via two reasoning chains: Modality-CoT updates MEP with adaptive fusion strategies to amplify relevant modalities, while Semantic-CoT refines SEP with deeper contextual insights of stance semantics. These dual experience structures are continuously refined during training and recalled at inference to guide robust and context-aware stance decisions. Extensive experiments on the public MMSD benchmark demonstrate that our ReMoD significantly outperforms most baseline models and exhibits strong generalization capabilities.

[9] Future of AI Models: A Computational perspective on Model collapse

Trivikram Satharasi, S Sitharama Iyengar

Main category: cs.CL

TL;DR: This study analyzes the impact of AI-generated content on linguistic diversity by examining semantic similarity in English Wikipedia from 2013-2025, finding exponential increases in similarity after LLM adoption that threatens data richness and model generalization.

Details

Motivation: As synthetic AI content dominates the web (30-40% of active corpus), recursive training risks eroding linguistic and semantic diversity through Model Collapse, where AI models trained on AI-generated content lose diversity and quality.

Method: Quantifies collapse onset by examining year-wise semantic similarity in English-language Wikipedia (filtered Common Crawl) from 2013 to 2025 using Transformer embeddings and cosine similarity metrics.

Result: Shows steady rise in similarity before public LLM adoption (driven by early RNN/LSTM translation), with exponential rise after LLM adoption. Fluctuations reflect irreducible linguistic diversity and sampling error.

Conclusion: Provides data-driven estimate of when recursive AI contamination may significantly threaten data richness and model generalization, highlighting the urgent need to address Model Collapse in AI training pipelines.

Abstract: Artificial Intelligence, especially Large Language Models (LLMs), has transformed domains such as software engineering, journalism, creative writing, academia, and media (Naveed et al. 2025; arXiv:2307.06435). Diffusion models like Stable Diffusion generate high-quality images and videos from text. Evidence shows rapid expansion: 74.2% of newly published webpages now contain AI-generated material (Ryan Law 2025), 30-40% of the active web corpus is synthetic (Spennemann 2025; arXiv:2504.08755), 52% of U.S. adults use LLMs for writing, coding, or research (Staff 2025), and audits find AI involvement in 18% of financial complaints and 24% of press releases (Liang et al. 2025). The underlying neural architectures, including Transformers (Vaswani et al. 2023; arXiv:1706.03762), RNNs, LSTMs, GANs, and diffusion networks, depend on large, diverse, human-authored datasets (Shi & Iyengar 2019). As synthetic content dominates, recursive training risks eroding linguistic and semantic diversity, producing Model Collapse (Shumailov et al. 2024; arXiv:2307.15043; Dohmatob et al. 2024; arXiv:2402.07712). This study quantifies and forecasts collapse onset by examining year-wise semantic similarity in English-language Wikipedia (filtered Common Crawl) from 2013 to 2025 using Transformer embeddings and cosine similarity metrics. Results reveal a steady rise in similarity before public LLM adoption, likely driven by early RNN/LSTM translation and text-normalization pipelines, though modest due to a smaller scale. Observed fluctuations reflect irreducible linguistic diversity, variable corpus size across years, finite sampling error, and an exponential rise in similarity after the public adoption of LLM models. These findings provide a data-driven estimate of when recursive AI contamination may significantly threaten data richness and model generalization.

[10] MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making

Zhi Rui Tam, Yun-Nung Chen

Main category: cs.CL

TL;DR: Audio LLMs show significant modality bias in clinical recommendations, with surgical recommendations varying up to 35% between audio and text inputs, and age disparities up to 12% between young and elderly voices.

Details

Motivation: To evaluate vulnerabilities introduced by paralinguistic cues in audio interactions as LLMs transition to clinical audio settings, potentially perpetuating healthcare disparities.

Method: Evaluated models on 170 clinical cases synthesized into speech from 36 distinct voice profiles spanning age, gender, and emotion variations, comparing audio vs text inputs.

Result: Severe modality bias found with surgical recommendations varying up to 35% between audio and text, age disparities up to 12%, chain-of-thought prompting didn’t eliminate age bias, explicit reasoning eliminated gender bias, emotion impact undetectable due to poor recognition.

Conclusion: Audio LLMs make clinical decisions based on voice characteristics rather than medical evidence, risking healthcare disparities; bias-aware architectures are urgently needed before clinical deployment.

Abstract: As large language models transition from text-based interfaces to audio interactions in clinical settings, they might introduce new vulnerabilities through paralinguistic cues in audio. We evaluated these models on 170 clinical cases, each synthesized into speech from 36 distinct voice profiles spanning variations in age, gender, and emotion. Our findings reveal a severe modality bias: surgical recommendations for audio inputs varied by as much as 35% compared to identical text-based inputs, with one model providing 80% fewer recommendations. Further analysis uncovered age disparities of up to 12% between young and elderly voices, which persisted in most models despite chain-of-thought prompting. While explicit reasoning successfully eliminated gender bias, the impact of emotion was not detected due to poor recognition performance. These results demonstrate that audio LLMs are susceptible to making clinical decisions based on a patient’s voice characteristics rather than medical evidence, a flaw that risks perpetuating healthcare disparities. We conclude that bias-aware architectures are essential and urgently needed before the clinical deployment of these models.

[11] Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, Flavio P. Calmon

Main category: cs.CL

TL;DR: T-SAEs improve dictionary learning for LLM interpretability by using temporal contrastive loss to encourage consistent feature activations across adjacent tokens, enabling better separation of semantic from syntactic features.

Details

Motivation: Current dictionary learning methods like Sparse Autoencoders fail to capture meaningful conceptual information and instead focus on shallow, token-specific patterns, ignoring the rich linguistic structure of language.

Method: Introduce Temporal Sparse Autoencoders (T-SAEs) with a novel contrastive loss that encourages consistent activations of high-level features over adjacent tokens, leveraging the insight that semantic content has long-range dependencies while syntactic information is local.

Result: T-SAEs recover smoother, more coherent semantic concepts across multiple datasets and models without sacrificing reconstruction quality, exhibiting clear semantic structure despite being trained without explicit semantic signal.

Conclusion: T-SAEs provide a new pathway for unsupervised interpretability in language models by effectively disentangling semantic from syntactic features through temporal consistency constraints.

Abstract: Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability. While recent dictionary learning methods such as Sparse Autoencoders (SAEs) provide a promising route to discover human-interpretable features, they suffer from a variety of problems, including a systematic failure to capture the rich conceptual information that drives linguistic understanding. Instead, they exhibit a bias towards shallow, token-specific, or noisy features, such as “the phrase ‘The’ at the start of sentences”. In this work, we propose that this is due to a fundamental issue with how dictionary learning methods for LLMs are trained. Language itself has a rich, well-studied structure spanning syntax, semantics, and pragmatics; however, current unsupervised methods largely ignore this linguistic knowledge, leading to poor feature discovery that favors superficial patterns over meaningful concepts. We focus on a simple but important aspect of language: semantic content has long-range dependencies and tends to be smooth over a sequence, whereas syntactic information is much more local. Building on this insight, we introduce Temporal Sparse Autoencoders (T-SAEs), which incorporate a novel contrastive loss encouraging consistent activations of high-level features over adjacent tokens. This simple yet powerful modification enables SAEs to disentangle semantic from syntactic features in a self-supervised manner. Across multiple datasets and models, T-SAEs recover smoother, more coherent semantic concepts without sacrificing reconstruction quality. Strikingly, they exhibit clear semantic structure despite being trained without explicit semantic signal, offering a new pathway for unsupervised interpretability in language models.

[12] Sample-Efficient Language Modeling with Linear Attention and Lightweight Enhancements

Patrick Haller, Jonas Golde, Alan Akbik

Main category: cs.CL

TL;DR: BLaLM model uses linear-time mLSTM token mixer instead of self-attention, combined with sliding window attention and Muon optimizer, achieving improved zero-shot performance and stable convergence in low-resource language modeling.

Details

Motivation: To develop sample-efficient language modeling techniques under resource constraints, specifically for the BabyLM 2025 shared task, focusing on architectural improvements that don't rely on scale.

Method: Replaces self-attention with linear-time mLSTM token mixer; incorporates lightweight enhancements including short convolutions, sliding window attention with dynamic modulation, and Hedgehog feature maps; uses Muon optimizer instead of AdamW; curates high-quality corpus emphasizing readability and pedagogical structure.

Result: Linear attention combined with sliding window attention consistently improves zero-shot performance; Muon optimizer stabilizes convergence and reduces perplexity compared to AdamW; effective performance demonstrated across both STRICT and STRICT-SMALL tracks.

Conclusion: The study demonstrates effective strategies for efficient language modeling without relying on scale, highlighting the value of architectural innovations and optimization techniques in low-resource settings.

Abstract: We study architectural and optimization techniques for sample-efficient language modeling under the constraints of the BabyLM 2025 shared task. Our model, BLaLM, replaces self-attention with a linear-time mLSTM token mixer and explores lightweight enhancements, including short convolutions, sliding window attention with dynamic modulation, and Hedgehog feature maps. To support training in low-resource settings, we curate a high-quality corpus emphasizing readability and pedagogical structure. Experiments across both STRICT and STRICT-SMALL tracks show that (1) linear attention combined with sliding window attention consistently improves zero-shot performance, and (2) the Muon optimizer stabilizes convergence and reduces perplexity over AdamW. These results highlight effective strategies for efficient language modeling without relying on scale.

[13] UTF-8 Plumbing: Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8

Preston Firestone, Shubham Ugare, Gagandeep Singh, Sasa Misailovic

Main category: cs.CL

TL;DR: Tokenizers using byte-level vocabularies can produce invalid UTF-8 sequences, causing real-world bugs in language model applications, and formal analysis shows incremental decoding differs from batch decoding.

Details

Motivation: To address the problem that byte-level tokenizers can generate invalid UTF-8 sequences, which breaks applications assuming valid UTF-8 input, and to formalize the tokenization process to understand and predict these issues.

Method: Formalized tokenization using monoid theory, proved that tokenizers with ill-formed UTF-8 tokens can produce invalid sequences, demonstrated differences between incremental and batch decoding, and evaluated real-world case studies of major models and systems.

Result: Formal proof that tokenizers with invalid UTF-8 tokens always risk producing invalid sequences, empirical evidence of real-world bugs in foundation models and serving engines, and identification of differences between incremental and batch decoding approaches.

Conclusion: Byte-level tokenization introduces fundamental UTF-8 validity issues that cannot be fully mitigated, requiring applications to handle potential breakage and highlighting the trade-offs between code point and byte-level approaches.

Abstract: Subword tokenization segments input text according to a pre-defined vocabulary to feed it into a language model; the language model, in turn, generates a sequence made from this same vocabulary. The members of the vocabulary can be built of code points or bytes. Using code points means that all members of the vocabulary are valid UTF-8 characters. However, it also requires thousands of initial members to achieve acceptable coverage of inputs. Beginning with bytes, on the contrary, avoids out-of-vocabulary errors with only 256 initial members of the vocabulary, but the members of the vocabulary and sequences of them are not guaranteed to be valid UTF-8. Sequences that are not valid UTF-8 break code that assumes its input to be valid UTF-8. Applications of language models must account for the breakage thereby introduced. In this paper, we formalize tokenization using monoid theory and prove that tokenizers whose vocabularies contain tokens that are ill-formed UTF-8 can always produce sequences that are ill-formed UTF-8. We demonstrate formally that attempting to incrementally convert tokens back to a string and interpret the results as UTF-8 gives different results than converting the whole sequence of tokens at once. This formal result predicts real-world bugs: we evaluate mitigations for the problem identified and provide case studies of major foundation models, serving engines, and constrained generation systems.

[14] Optimizing Diversity and Quality through Base-Aligned Model Collaboration

Yichen Wang, Chenghao Yang, Tenghao Huang, Muhao Chen, Jonathan May, Mina Lee

Main category: cs.CL

TL;DR: BACo is an inference-time framework that combines base and aligned LLMs at token level to improve output diversity while maintaining quality, achieving 21.3% joint improvement in diversity and quality.

Details

Motivation: Alignment improves LLM output quality but reduces diversity, leading to similar outputs across generations. Existing diversity methods often degrade quality or require costly decoding/post-training.

Method: Token-level model collaboration framework that dynamically routes decoding between base and aligned LLMs based on next-token prediction uncertainty and semantic role of predicted content.

Result: Consistently surpasses state-of-the-art inference-time baselines across 3 open-ended generation tasks and 13 metrics, with 21.3% joint improvement in diversity and quality. Human evaluations confirm improvements.

Conclusion: Collaboration between base and aligned models can effectively optimize and control both diversity and quality in LLM outputs.

Abstract: Alignment has greatly improved large language models (LLMs)’ output quality at the cost of diversity, yielding highly similar outputs across generations. We propose Base-Aligned Model Collaboration (BACo), an inference-time token-level model collaboration framework that dynamically combines a base LLM with its aligned counterpart to optimize diversity and quality. Inspired by prior work (Fei et al., 2025), BACo employs routing strategies that determine, at each token, from which model to decode based on next-token prediction uncertainty and predicted contents’ semantic role. Prior diversity-promoting methods, such as retraining, prompt engineering, and multi-sampling methods, improve diversity but often degrade quality or require costly decoding or post-training. In contrast, BACo achieves both high diversity and quality post hoc within a single pass, while offering strong controllability. We explore a family of routing strategies, across three open-ended generation tasks and 13 metrics covering diversity and quality, BACo consistently surpasses state-of-the-art inference-time baselines. With our best router, BACo achieves a 21.3% joint improvement in diversity and quality. Human evaluations also mirror these improvements. The results suggest that collaboration between base and aligned models can optimize and control diversity and quality.

[15] OckBench: Measuring the Efficiency of LLM Reasoning

Zheng Du, Hao Kang, Song Han, Tushar Krishna, Ligeng Zhu

Main category: cs.CL

TL;DR: OckBench is a new benchmark that evaluates both accuracy and token efficiency for reasoning and coding tasks, revealing significant efficiency differences among models with comparable accuracy.

Details

Motivation: Existing benchmarks focus only on accuracy and output quality, ignoring token efficiency which significantly impacts latency, cost, and energy consumption in real systems.

Method: Introduces OckBench, a model-agnostic and hardware-agnostic benchmark that measures both accuracy and token count across reasoning and coding tasks, comparing multiple open- and closed-source models.

Result: Experiments show that models with comparable accuracy differ wildly in token consumption, revealing efficiency variance as a significant but neglected differentiation factor. Pareto frontiers demonstrate accuracy-efficiency trade-offs.

Conclusion: Tokens should not be treated as “free” resources. OckBench provides a unified platform for measuring and guiding research in token-efficient reasoning, advocating for an evaluation paradigm shift.

Abstract: Large language models such as GPT-4, Claude 3, and the Gemini series have improved automated reasoning and code generation. However, existing benchmarks mainly focus on accuracy and output quality, and they ignore an important factor: decoding token efficiency. In real systems, generating 10,000 tokens versus 100,000 tokens leads to large differences in latency, cost, and energy. In this work, we introduce OckBench, a model-agnostic and hardware-agnostic benchmark that evaluates both accuracy and token count for reasoning and coding tasks. Through experiments comparing multiple open- and closed-source models, we uncover that many models with comparable accuracy differ wildly in token consumption, revealing that efficiency variance is a neglected but significant axis of differentiation. We further demonstrate Pareto frontiers over the accuracy-efficiency plane and argue for an evaluation paradigm shift: we should no longer treat tokens as “free” to multiply. OckBench provides a unified platform for measuring, comparing, and guiding research in token-efficient reasoning. Our benchmarks are available at https://ockbench.github.io/ .

[16] In-Context Learning Without Copying

Kerem Sahin, Sheridan Feucht, Adam Belfki, Jannik Brinkmann, Aaron Mueller, David Bau, Chris Wendler

Main category: cs.CL

TL;DR: Transformers can still develop in-context learning capabilities even when inductive copying is suppressed through loss omission of tokens predictable by induction heads.

Details

Motivation: To investigate whether inductive copying is essential for transformers to acquire in-context learning capabilities, challenging the hypothesis that induction heads are a prerequisite for complex ICL.

Method: Proposed Hapax - a training setting that omits loss contribution from tokens that can be correctly predicted by induction heads, effectively suppressing inductive copying while training on abstractive ICL tasks.

Result: Despite 31.7% of tokens being omitted from loss, performance on abstractive ICL tasks remained comparable and surpassed vanilla model on 13 of 21 tasks. Models developed fewer and weaker induction heads but preserved ICL capabilities.

Conclusion: Inductive copying is not essential for learning abstractive in-context learning mechanisms, as transformers can develop ICL capabilities even when inductive copying is suppressed.

Abstract: Induction heads are attention heads that perform inductive copying by matching patterns from earlier context and copying their continuations verbatim. As models develop induction heads, they often experience a sharp drop in training loss, a phenomenon cited as evidence that induction heads may serve as a prerequisite for more complex in-context learning (ICL) capabilities. In this work, we ask whether transformers can still acquire ICL capabilities when inductive copying is suppressed. We propose Hapax, a setting where we omit the loss contribution of any token that can be correctly predicted by induction heads. Despite a significant reduction in inductive copying, performance on abstractive ICL tasks (i.e., tasks where the answer is not contained in the input context) remains comparable and surpasses the vanilla model on 13 of 21 tasks, even though 31.7% of tokens are omitted from the loss. Furthermore, our model achieves lower loss values on token positions that cannot be predicted correctly by induction heads. Mechanistic analysis further shows that models trained with Hapax develop fewer and weaker induction heads but still preserve ICL capabilities. Taken together, our findings indicate that inductive copying is not essential for learning abstractive ICL mechanisms.

[17] CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokkien Speech Recognition

Hung-Yang Sung, Chien-Chun Wang, Kuan-Tang Huang, Tien-Hong Lo, Yu-Sheng Tsao, Yung-Chang Hsu, Berlin Chen

Main category: cs.CL

TL;DR: CLiFT-ASR is a cross-lingual fine-tuning framework that progressively adapts Mandarin HuBERT models to Taiwanese Hokkien ASR using a two-stage process with phonetic and character annotations, achieving 24.88% CER reduction.

Details

Motivation: Low-resource languages like Taiwanese Hokkien lack annotated data, and existing approaches using either Han-character transcriptions or romanization alone fail to capture both phonetic/tonal details and lexical/syntactic coverage.

Method: Two-stage progressive adaptation: first learns acoustic and tonal representations from phonetic Tai-lo annotations, then captures vocabulary and syntax from Han-character transcriptions, building on Mandarin HuBERT models.

Result: Achieves 24.88% relative reduction in character error rate (CER) compared to strong baselines on TAT-MOE corpus, demonstrating effective alignment between speech sounds and orthographic structures.

Conclusion: CLiFT-ASR provides an effective and parameter-efficient solution for Taiwanese Hokkien ASR with potential applicability to other low-resource language scenarios.

Abstract: Automatic speech recognition (ASR) for low-resource languages such as Taiwanese Hokkien is difficult due to the scarcity of annotated data. However, direct fine-tuning on Han-character transcriptions often fails to capture detailed phonetic and tonal cues, while training only on romanization lacks lexical and syntactic coverage. In addition, prior studies have rarely explored staged strategies that integrate both annotation types. To address this gap, we present CLiFT-ASR, a cross-lingual fine-tuning framework that builds on Mandarin HuBERT models and progressively adapts them to Taiwanese Hokkien. The framework employs a two-stage process in which it first learns acoustic and tonal representations from phonetic Tai-lo annotations and then captures vocabulary and syntax from Han-character transcriptions. This progressive adaptation enables effective alignment between speech sounds and orthographic structures. Experiments on the TAT-MOE corpus demonstrate that CLiFT-ASR achieves a 24.88% relative reduction in character error rate (CER) compared with strong baselines. The results indicate that CLiFT-ASR provides an effective and parameter-efficient solution for Taiwanese Hokkien ASR and that it has potential to benefit other low-resource language scenarios.

[18] Multi-Scale Feature Fusion and Graph Neural Network Integration for Text Classification with Large Language Models

Xiangchen Song, Yulin Huang, Jinxu Guo, Yuchen Liu, Yaxuan Luan

Main category: cs.CL

TL;DR: A hybrid text classification method combining LLM feature extraction, multi-scale feature pyramid fusion, and graph neural networks for structured semantic modeling, achieving superior performance on multiple metrics.

Details

Motivation: To enhance text classification performance in complex semantic contexts by integrating deep semantic representations, multi-scale feature fusion, and structured modeling of semantic relationships.

Method: Three-stage approach: 1) Deep feature extraction using large language models for contextual dependencies, 2) Multi-scale feature fusion via feature pyramids to balance global and local information, 3) Graph neural networks for structured modeling of semantic relations and logical dependencies.

Result: Significantly outperforms existing models on ACC, F1-Score, AUC, and Precision metrics, demonstrating effectiveness and stability in robustness alignment experiments.

Conclusion: The integrated framework successfully balances global/local information and semantics/structure, providing new perspectives for multi-scale feature fusion and structured semantic modeling in text classification.

Abstract: This study investigates a hybrid method for text classification that integrates deep feature extraction from large language models, multi-scale fusion through feature pyramids, and structured modeling with graph neural networks to enhance performance in complex semantic contexts. First, the large language model captures contextual dependencies and deep semantic representations of the input text, providing a rich feature foundation for subsequent modeling. Then, based on multi-level feature representations, the feature pyramid mechanism effectively integrates semantic features of different scales, balancing global information and local details to construct hierarchical semantic expressions. Furthermore, the fused features are transformed into graph representations, and graph neural networks are employed to capture latent semantic relations and logical dependencies in the text, enabling comprehensive modeling of complex interactions among semantic units. On this basis, the readout and classification modules generate the final category predictions. The proposed method demonstrates significant advantages in robustness alignment experiments, outperforming existing models on ACC, F1-Score, AUC, and Precision, which verifies the effectiveness and stability of the framework. This study not only constructs an integrated framework that balances global and local information as well as semantics and structure, but also provides a new perspective for multi-scale feature fusion and structured semantic modeling in text classification tasks.

[19] Language Generation: Complexity Barriers and Implications for Learning

Marcelo Arenas, Pablo Barceló, Luis Cofré, Alexander Kozachinskiy

Main category: cs.CL

TL;DR: Language generation is theoretically possible with enough examples, but practically infeasible for even simple language families due to extraordinarily large sample requirements.

Details

Motivation: To bridge the gap between theoretical possibility of language generation and its practical feasibility, examining why modern language models succeed despite theoretical limitations.

Method: Analyze sample complexity for language generation in simple language families (regular and context-free languages) and examine computability bounds.

Result: Number of examples required for successful generation is extraordinarily large, sometimes unbounded by any computable function, revealing substantial gap between theory and practice.

Conclusion: Explaining empirical success of modern language models requires considering structural properties of natural language that enable practical generation, beyond theoretical guarantees.

Abstract: Kleinberg and Mullainathan showed that, in principle, language generation is always possible: with sufficiently many positive examples, a learner can eventually produce sentences indistinguishable from those of a target language. However, the existence of such a guarantee does not speak to its practical feasibility. In this work, we show that even for simple and well-studied language families – such as regular and context-free languages – the number of examples required for successful generation can be extraordinarily large, and in some cases not bounded by any computable function. These results reveal a substantial gap between theoretical possibility and efficient learnability. They suggest that explaining the empirical success of modern language models requires a refined perspective – one that takes into account structural properties of natural language that make effective generation possible in practice.

[20] DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning

Yaxuan Wang, Chris Yuhao Liu, Quan Liu, Jinglong Pang, Wei Wei, Yujia Bao, Yang Liu

Main category: cs.CL

TL;DR: DRAGON is a reasoning-based framework that uses in-context chain-of-thought instructions to enable unlearning in LLMs without requiring retain data or model fine-tuning.

Details

Motivation: Existing unlearning methods require training data and retain data, which are often unavailable in real-world scenarios, creating a gap for practical unlearning solutions.

Method: DRAGON uses a lightweight detection module to identify forget-worthy prompts and routes them through a dedicated CoT guard model for safe in-context intervention, leveraging LLMs’ inherent instruction-following abilities.

Result: Extensive experiments across three unlearning tasks show DRAGON achieves strong unlearning capability, scalability, and practical applicability with novel evaluation metrics.

Conclusion: DRAGON provides an effective, data-free approach to LLM unlearning that works in practical scenarios without requiring model modifications or retain data.

Abstract: Unlearning in Large Language Models (LLMs) is crucial for protecting private data and removing harmful knowledge. Most existing approaches rely on fine-tuning to balance unlearning efficiency with general language capabilities. However, these methods typically require training or access to retain data, which is often unavailable in real world scenarios. Although these methods can perform well when both forget and retain data are available, few works have demonstrated equivalent capability in more practical, data-limited scenarios. To overcome these limitations, we propose Detect-Reasoning Augmented GeneratiON (DRAGON), a systematic, reasoning-based framework that utilizes in-context chain-of-thought (CoT) instructions to guard deployed LLMs before inference. Instead of modifying the base model, DRAGON leverages the inherent instruction-following ability of LLMs and introduces a lightweight detection module to identify forget-worthy prompts without any retain data. These are then routed through a dedicated CoT guard model to enforce safe and accurate in-context intervention. To robustly evaluate unlearning performance, we introduce novel metrics for unlearning performance and the continual unlearning setting. Extensive experiments across three representative unlearning tasks validate the effectiveness of DRAGON, demonstrating its strong unlearning capability, scalability, and applicability in practical scenarios.

[21] Compositional Phoneme Approximation for L1-Grounded L2 Pronunciation Training

Jisang Park, Minu Kim, DaYoung Hong, Jongha Lee

Main category: cs.CL

TL;DR: CPA-based L1-grounded pronunciation training uses native language phonemes to approximate L2 sounds, achieving significant improvements in formant accuracy, phoneme recognition, and native-like speech with minimal training.

Details

Motivation: Traditional L2 pronunciation training is slow and effortful because learners map non-native phonemes to similar native-language phonemes, creating interference.

Method: Compositional Phoneme Approximation (CPA) - a feature-based representation technique that approximates L2 sounds using sequences of L1 phonemes for pronunciation training.

Result: 76% in-box formant rate in acoustic analysis, 17.6% relative improvement in phoneme recognition accuracy, and over 80% of speech rated as more native-like with minimal training.

Conclusion: CPA-based L1-grounded training is an effective approach for improving L2 pronunciation by leveraging native language phoneme sequences to approximate target sounds.

Abstract: Learners of a second language (L2) often map non-native phonemes to similar native-language (L1) phonemes, making conventional L2-focused training slow and effortful. To address this, we propose an L1-grounded pronunciation training method based on compositional phoneme approximation (CPA), a feature-based representation technique that approximates L2 sounds with sequences of L1 phonemes. Evaluations with 20 Korean non-native English speakers show that CPA-based training achieves a 76% in-box formant rate in acoustic analysis, 17.6% relative improvement in phoneme recognition accuracy, and over 80% of speech being rated as more native-like, with minimal training. Project page: https://gsanpark.github.io/CPA-Pronunciation.

[22] Quantifying Edits Decay in Fine-tuned LLMs

Yinjie Cheng, Paul Youssef, Christin Seifert, Jörg Schlötterer, Zhixue Zhao

Main category: cs.CL

TL;DR: Fine-tuning impairs knowledge edits in LLMs, with edit survival varying by method and configuration. Selective-layer fine-tuning can effectively remove edits but affects downstream performance.

Details

Motivation: To investigate whether knowledge edits survive fine-tuning, addressing practical concerns about edit persistence (safety risks) and decay (cost implications).

Method: Systematically evaluated 2 editing methods (MEMIT, AlphaEdit) and 3 fine-tuning approaches across 5 LLMs and 3 datasets (232 configurations). Proposed selective-layer fine-tuning.

Result: Edits decay after fine-tuning; AlphaEdit decays more than MEMIT. Fine-tuning edited layers removes edits effectively but hurts performance. Fine-tuning non-edited layers impairs edits more than full fine-tuning.

Conclusion: Knowledge editing evaluation must consider the full LLM pipeline. Selective-layer fine-tuning offers actionable strategies for integrating editing with fine-tuning while managing edit persistence.

Abstract: Knowledge editing has emerged as a lightweight alternative to retraining for correcting or injecting specific facts in large language models (LLMs). Meanwhile, fine-tuning remains the default operation for adapting LLMs to new domains and tasks. Despite their widespread adoption, these two post-training interventions have been studied in isolation, leaving open a crucial question: if we fine-tune an edited model, do the edits survive? This question is motivated by two practical scenarios: removing covert or malicious edits, and preserving beneficial edits. If fine-tuning impairs edits as shown in Figure 1, current KE methods become less useful, as every fine-tuned model would require re-editing, which significantly increases the cost; if edits persist, fine-tuned models risk propagating hidden malicious edits, raising serious safety concerns. To this end, we systematically quantify edits decay after fine-tuning, investigating how fine-tuning affects knowledge editing. We evaluate two state-of-the-art editing methods (MEMIT, AlphaEdit) and three fine-tuning approaches (full-parameter, LoRA, DoRA) across five LLMs and three datasets, yielding 232 experimental configurations. Our results show that edits decay after fine-tuning, with survival varying across configurations, e.g., AlphaEdit edits decay more than MEMIT edits. Further, we propose selective-layer fine-tuning and find that fine-tuning edited layers only can effectively remove edits, though at a slight cost to downstream performance. Surprisingly, fine-tuning non-edited layers impairs more edits than full fine-tuning. Overall, our study establishes empirical baselines and actionable strategies for integrating knowledge editing with fine-tuning, and underscores that evaluating model editing requires considering the full LLM application pipeline.

[23] MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation

Khai Le-Duc, Tuyen Tran, Bach Phan Tat, Nguyen Kim Hai Bui, Quan Dang, Hung-Phong Tran, Thanh-Thuy Nguyen, Ly Nguyen, Tuan-Minh Phan, Thi Thu Phuong Tran, Chris Ngo, Nguyen X. Khanh, Thanh Nguyen-Tang

Main category: cs.CL

TL;DR: This paper presents MultiMed-ST, the first large-scale multilingual speech translation dataset for the medical domain spanning 5 languages, and provides comprehensive analysis including empirical baselines and comparative studies.

Details

Motivation: To enhance patient care by enabling efficient communication across language barriers in medical settings, alleviate specialized workforce shortages, and facilitate improved diagnosis and treatment, especially during pandemics.

Method: Created MultiMed-ST dataset with 290,000 samples spanning Vietnamese, English, German, French, and Chinese; conducted comprehensive analysis including empirical baselines, bilingual-multilingual comparisons, end-to-end vs. cascaded approaches, task-specific vs. multi-task sequence-to-sequence models, code-switch analysis, and quantitative-qualitative error analysis.

Result: MultiMed-ST is the largest medical machine translation dataset and largest many-to-many multilingual speech translation dataset across all domains; comprehensive empirical analysis provides foundational insights for medical speech translation research.

Conclusion: The work establishes the first systematic study on medical speech translation, providing valuable resources and comprehensive analysis that will advance multilingual communication in healthcare settings and support future research in this critical domain.

Abstract: Multilingual speech translation (ST) and machine translation (MT) in the medical domain enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we present the first systematic study on medical ST, to our best knowledge, by releasing MultiMed-ST, a large-scale ST dataset for the medical domain, spanning all translation directions in five languages: Vietnamese, English, German, French, and Simplified/Traditional Chinese, together with the models. With 290,000 samples, this is the largest medical MT dataset and the largest many-to-many multilingual ST among all domains. Secondly, we present the most comprehensive ST analysis in the field’s history, to our best knowledge, including: empirical baselines, bilingual-multilingual comparative study, end-to-end vs. cascaded comparative study, task-specific vs. multi-task sequence-to-sequence comparative study, code-switch analysis, and quantitative-qualitative error analysis. All code, data, and models are available online: https://github.com/leduckhai/MultiMed-ST

[24] Retrieval-Augmented Generation in Medicine: A Scoping Review of Technical Implementations, Clinical Applications, and Ethical Considerations

Rui Yang, Matthew Yu Heng Wong, Huitao Li, Xin Li, Wentao Zhu, Jingchi Liao, Kunyu Yu, Jonathan Chong Kai Liew, Weihao Xuan, Yingjian Chen, Yuhe Ke, Jasmine Chiat Ling Ong, Douglas Teodoro, Chuan Hong, Daniel Shi Wei Ting, Nan Liu

Main category: cs.CL

TL;DR: This paper reviews retrieval-augmented generation (RAG) applications in medicine, finding the field is still in early stages with limitations in clinical validation, cross-linguistic adaptation, and support for low-resource settings.

Details

Motivation: The rapid growth of medical knowledge and complexity of clinical practice pose challenges, and while LLMs show value, they have inherent limitations that RAG technologies could potentially address to enhance clinical applicability.

Method: The study conducted a systematic review of RAG applications in medicine, analyzing data sources, retrieval approaches, LLM usage, and evaluation methods across different medical applications.

Result: Research primarily used public data with limited private data application. Retrieval relied on English-centric embedding models, and LLMs were mostly generic rather than medical-specific. Evaluation focused on automated metrics for generation quality and human evaluation for accuracy, completeness, relevance, and fluency, with insufficient attention to bias and safety.

Conclusion: Medical RAG remains at an early stage and requires advances in clinical validation, cross-linguistic adaptation, and support for low-resource settings to enable trustworthy and responsible global use.

Abstract: The rapid growth of medical knowledge and increasing complexity of clinical practice pose challenges. In this context, large language models (LLMs) have demonstrated value; however, inherent limitations remain. Retrieval-augmented generation (RAG) technologies show potential to enhance their clinical applicability. This study reviewed RAG applications in medicine. We found that research primarily relied on publicly available data, with limited application in private data. For retrieval, approaches commonly relied on English-centric embedding models, while LLMs were mostly generic, with limited use of medical-specific LLMs. For evaluation, automated metrics evaluated generation quality and task performance, whereas human evaluation focused on accuracy, completeness, relevance, and fluency, with insufficient attention to bias and safety. RAG applications were concentrated on question answering, report generation, text summarization, and information extraction. Overall, medical RAG remains at an early stage, requiring advances in clinical validation, cross-linguistic adaptation, and support for low-resource settings to enable trustworthy and responsible global use.

[25] NILC: Discovering New Intents with LLM-assisted Clustering

Hongtao Wang, Renchi Yang, Wenqing Lin

Main category: cs.CL

TL;DR: NILC is a novel clustering framework for new intent discovery that uses iterative refinement with LLMs to improve clustering accuracy by enriching semantic centroids and augmenting hard samples.

Details

Motivation: Existing cascaded NID approaches fail to leverage mutual refinement between embedding and clustering stages, and embedding-only clustering overlooks nuanced textual semantics, leading to suboptimal performance.

Method: NILC uses an iterative workflow where LLMs create semantic centroids to enrich embedding centroids, augment hard samples via rewriting for cluster correction, and inject supervision through seeding and soft must links in semi-supervised settings.

Result: Extensive experiments show NILC achieves significant performance improvements over recent baselines across six benchmark datasets in both unsupervised and semi-supervised settings.

Conclusion: NILC effectively bridges the gap in existing NID approaches by leveraging LLMs for iterative refinement, demonstrating consistent performance gains across diverse domains.

Abstract: New intent discovery (NID) seeks to recognize both new and known intents from unlabeled user utterances, which finds prevalent use in practical dialogue systems. Existing works towards NID mainly adopt a cascaded architecture, wherein the first stage focuses on encoding the utterances into informative text embeddings beforehand, while the latter is to group similar embeddings into clusters (i.e., intents), typically by K-Means. However, such a cascaded pipeline fails to leverage the feedback from both steps for mutual refinement, and, meanwhile, the embedding-only clustering overlooks nuanced textual semantics, leading to suboptimal performance. To bridge this gap, this paper proposes NILC, a novel clustering framework specially catered for effective NID. Particularly, NILC follows an iterative workflow, in which clustering assignments are judiciously updated by carefully refining cluster centroids and text embeddings of uncertain utterances with the aid of large language models (LLMs). Specifically, NILC first taps into LLMs to create additional semantic centroids for clusters, thereby enriching the contextual semantics of the Euclidean centroids of embeddings. Moreover, LLMs are then harnessed to augment hard samples (ambiguous or terse utterances) identified from clusters via rewriting for subsequent cluster correction. Further, we inject supervision signals through non-trivial techniques seeding and soft must links for more accurate NID in the semi-supervised setting. Extensive experiments comparing NILC against multiple recent baselines under both unsupervised and semi-supervised settings showcase that NILC can achieve significant performance improvements over six benchmark datasets of diverse domains consistently.

[26] How Does a Deep Neural Network Look at Lexical Stress?

Itai Allouche, Itay Asael, Rotem Rousso, Vered Dassa, Ann Bradlow, Seung-Eun Kim, Matthew Goldrick, Joseph Keshet

Main category: cs.CL

TL;DR: CNNs achieve 92% accuracy in predicting English lexical stress from spectrograms, with interpretability analysis showing they primarily use spectral properties of stressed vowels (especially F1/F2) while attending to distributed cues throughout words.

Details

Motivation: To understand what informs neural network decisions in speech processing, specifically for lexical stress prediction, moving beyond black-box models to interpretable analysis.

Method: Automatically constructed dataset of English disyllabic words from read/spontaneous speech, trained CNN architectures with Layerwise Relevance Propagation (LRP) for interpretability analysis, and proposed feature-specific relevance analysis.

Result: CNNs achieved up to 92% accuracy on held-out test data; LRP revealed predictions were most influenced by stressed vs. unstressed syllable information, particularly spectral properties of stressed vowels (F1/F2, with some pitch and F3 contributions).

Conclusion: Deep learning can acquire distributed cues to stress from natural data, extending traditional phonetic work based on controlled stimuli, with interpretability revealing reliance on stressed vowel formants.

Abstract: Despite their success in speech processing, neural networks often operate as black boxes, prompting the question: what informs their decisions, and how can we interpret them? This work examines this issue in the context of lexical stress. A dataset of English disyllabic words was automatically constructed from read and spontaneous speech. Several Convolutional Neural Network (CNN) architectures were trained to predict stress position from a spectrographic representation of disyllabic words lacking minimal stress pairs (e.g., initial stress WAllet, final stress exTEND), achieving up to 92% accuracy on held-out test data. Layerwise Relevance Propagation (LRP), a technique for CNN interpretability analysis, revealed that predictions for held-out minimal pairs (PROtest vs. proTEST ) were most strongly influenced by information in stressed versus unstressed syllables, particularly the spectral properties of stressed vowels. However, the classifiers also attended to information throughout the word. A feature-specific relevance analysis is proposed, and its results suggest that our best-performing classifier is strongly influenced by the stressed vowel’s first and second formants, with some evidence that its pitch and third formant also contribute. These results reveal deep learning’s ability to acquire distributed cues to stress from naturally occurring data, extending traditional phonetic work based around highly controlled stimuli.

[27] IDALC: A Semi-Supervised Framework for Intent Detection and Active Learning based Correction

Ankan Mullick, Sukannya Purkayastha, Saransh Sharma, Pawan Goyal, Niloy Ganguly

Main category: cs.CL

TL;DR: IDALC is a semi-supervised framework that detects user intents and corrects system-rejected utterances while minimizing human annotation costs, achieving 5-10% higher accuracy and maintaining annotation costs at just 6-10% of unlabeled data.

Details

Motivation: Voice-controlled dialog systems often reject utterances due to low confidence, requiring manual annotation. Retraining with new intents from rejected queries is necessary but labeling all emerging intents is impractical, creating a need for efficient annotation cost reduction.

Method: IDALC (Intent Detection and Active Learning based Correction) - a semi-supervised framework that combines intent detection with active learning to correct system-rejected utterances.

Result: Empirical results show IDALC outperforms baseline methods with 5-10% higher accuracy and 4-8% improvement in macro-F1, while keeping annotation costs at only 6-10% of available unlabeled data.

Conclusion: IDALC provides an effective solution for handling system-rejected utterances and emerging intents in voice-controlled dialog systems while significantly reducing annotation burden.

Abstract: Voice-controlled dialog systems have become immensely popular due to their ability to perform a wide range of actions in response to diverse user queries. These agents possess a predefined set of skills or intents to fulfill specific user tasks. But every system has its own limitations. There are instances where, even for known intents, if any model exhibits low confidence, it results in rejection of utterances that necessitate manual annotation. Additionally, as time progresses, there may be a need to retrain these agents with new intents from the system-rejected queries to carry out additional tasks. Labeling all these emerging intents and rejected utterances over time is impractical, thus calling for an efficient mechanism to reduce annotation costs. In this paper, we introduce IDALC (Intent Detection and Active Learning based Correction), a semi-supervised framework designed to detect user intents and rectify system-rejected utterances while minimizing the need for human annotation. Empirical findings on various benchmark datasets demonstrate that our system surpasses baseline methods, achieving a 5-10% higher accuracy and a 4-8% improvement in macro-F1. Remarkably, we maintain the overall annotation cost at just 6-10% of the unlabelled data available to the system. The overall framework of IDALC is shown in Fig. 1

[28] Reinforcement Learning Improves Traversal of Hierarchical Knowledge in LLMs

Renfei Zhang, Manasa Kaniselvan, Niloofar Mireshghallah

Main category: cs.CL

TL;DR: RL improves language model reasoning and generalization without degrading memorized knowledge, particularly enhancing performance on hierarchical knowledge recall tasks through better procedural navigation skills.

Details

Motivation: Challenge the narrative that RL degrades memorized knowledge in language models, and investigate how RL actually improves performance on knowledge recall tasks involving hierarchical structures.

Method: Compare RL-enhanced models with base and SFT models on knowledge recall tasks, use structured prompting to guide SFT models through hierarchical traversal, and conduct layer-wise internal activation analysis to examine representation differences.

Result: RL-enhanced models outperform base and SFT models on knowledge recall, structured prompting reduces performance gap from 24pp to 7pp, RL models retain superior procedural path recall, and activation analysis shows RL transforms query representations rather than factual representations.

Conclusion: RL primarily enhances how models traverse and search existing knowledge hierarchies rather than acquiring new data or changing factual representations, improving procedural skills for knowledge navigation.

Abstract: Reinforcement learning (RL) is often credited with improving language model reasoning and generalization at the expense of degrading memorized knowledge. We challenge this narrative by observing that RL-enhanced models consistently outperform their base and supervised fine-tuned (SFT) counterparts on pure knowledge recall tasks, particularly those requiring traversal of hierarchical, structured knowledge (e.g., medical codes). We hypothesize these gains stem not from newly acquired data, but from improved procedural skills in navigating and searching existing knowledge hierarchies within the model parameters. To support this hypothesis, we show that structured prompting, which explicitly guides SFTed models through hierarchical traversal, recovers most of the performance gap (reducing 24pp to 7pp on MedConceptsQA for DeepSeek-V3/R1). We further find that while prompting improves final-answer accuracy, RL-enhanced models retain superior ability to recall correct procedural paths on deep-retrieval tasks. Finally our layer-wise internal activation analysis reveals that while factual representations (e.g., activations for the statement “code 57.95 refers to urinary infection”) maintain high cosine similarity between SFT and RL models, query representations (e.g., “what is code 57.95”) diverge noticeably, indicating that RL primarily transforms how models traverse knowledge rather than the knowledge representation itself.

[29] Interpretable Recognition of Cognitive Distortions in Natural Language Texts

Anton Kolonin, Anna Arinicheva

Main category: cs.CL

TL;DR: New multi-factor classification approach using weighted structured patterns with heterarchical relationships for detecting cognitive distortions in psychological care, achieving state-of-the-art performance.

Details

Motivation: To automate detection of specific cognitive distortions in psychological care using interpretable and transparent AI models for social impact.

Method: Weighted structured patterns (N-grams) with heterarchical relationships between them, applied to multi-factor classification of natural language texts.

Result: Significant improvements over literature-known F1 scores on two publicly available datasets, with optimal hyper-parameters determined.

Conclusion: The proposed recognition and learning algorithms advance the state of the art in cognitive distortion detection, with code and models made available for community use.

Abstract: We propose a new approach to multi-factor classification of natural language texts based on weighted structured patterns such as N-grams, taking into account the heterarchical relationships between them, applied to solve such a socially impactful problem as the automation of detection of specific cognitive distortions in psychological care, relying on an interpretable, robust and transparent artificial intelligence model. The proposed recognition and learning algorithms improve the current state of the art in this field. The improvement is tested on two publicly available datasets, with significant improvements over literature-known F1 scores for the task, with optimal hyper-parameters determined, having code and models available for future use by the community.

[30] Revisiting Entropy in Reinforcement Learning for Large Reasoning Models

Renren Jin, Pengzhi Gao, Yuqi Ren, Zhuowen Han, Tongxuan Zhang, Wuwei Huang, Wei Liu, Jian Luan, Deyi Xiong

Main category: cs.CL

TL;DR: This paper investigates entropy collapse in LLMs during reinforcement learning with verifiable rewards (RLVR), identifies key factors affecting entropy, and proposes methods to regulate entropy by adjusting loss weights for tokens with positive vs. negative advantages.

Details

Motivation: RLVR is widely used to enhance LLM reasoning, but entropy collapse during training causes premature convergence to suboptimal solutions and hinders performance improvement, yet comprehensive studies on entropy dynamics in RLVR are lacking.

Method: Conducted extensive experiments to analyze entropy dynamics in RLVR-trained LLMs, examining correlations between entropy and response diversity, calibration, and performance. Investigated factors like off-policy updates, training data diversity, and clipping thresholds.

Result: Found that tokens with positive advantages are the primary contributors to entropy collapse. Model entropy can be effectively regulated by adjusting relative loss weights of tokens with positive vs. negative advantages during training.

Conclusion: The study provides insights into entropy dynamics in RLVR and demonstrates practical methods to control entropy collapse, enabling better performance and preventing premature convergence in LLM training.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a predominant approach for enhancing the reasoning capabilities of large language models (LLMs). However, the entropy of LLMs usually collapses during RLVR training, causing premature convergence to suboptimal local minima and hinder further performance improvement. Although various approaches have been proposed to mitigate entropy collapse, a comprehensive study of entropy in RLVR remains lacking. To address this gap, we conduct extensive experiments to investigate the entropy dynamics of LLMs trained with RLVR and analyze how model entropy correlates with response diversity, calibration, and performance across various benchmarks. Our findings reveal that the number of off-policy updates, the diversity of training data, and the clipping thresholds in the optimization objective are critical factors influencing the entropy of LLMs trained with RLVR. Moreover, we theoretically and empirically demonstrate that tokens with positive advantages are the primary contributors to entropy collapse, and that model entropy can be effectively regulated by adjusting the relative loss weights of tokens with positive and negative advantages during training.

[31] LLMs Do Not See Age: Assessing Demographic Bias in Automated Systematic Review Synthesis

Favour Yahdii Aghaebe, Tanefa Apekey, Elizabeth Williams, Nafise Sadat Moosavi

Main category: cs.CL

TL;DR: LLMs show systematic disparities in preserving age-related information when summarizing biomedical studies, with lowest demographic fidelity for adult-focused summaries and higher hallucination rates for underrepresented populations.

Details

Motivation: To evaluate whether language models preserve crucial age distinctions when generating summaries of biomedical studies, as clinical interventions often depend on age-specific considerations.

Method: Created DemogSummary dataset with age-stratified systematic review studies, evaluated three LLMs (Qwen, Longformer, GPT-4.1 Nano) using standard metrics and a new Demographic Salience Score (DSS) measuring age-related entity retention and hallucination.

Result: Systematic disparities across models and age groups: demographic fidelity lowest for adult-focused summaries, underrepresented populations more prone to hallucinations.

Conclusion: Current LLMs have limitations in faithful and bias-free summarization, highlighting need for fairness-aware evaluation frameworks and summarization pipelines in biomedical NLP.

Abstract: Clinical interventions often hinge on age: medications and procedures safe for adults may be harmful to children or ineffective for older adults. However, as language models are increasingly integrated into biomedical evidence synthesis workflows, it remains uncertain whether these systems preserve such crucial demographic distinctions. To address this gap, we evaluate how well state-of-the-art language models retain age-related information when generating abstractive summaries of biomedical studies. We construct DemogSummary, a novel age-stratified dataset of systematic review primary studies, covering child, adult, and older adult populations. We evaluate three prominent summarisation-capable LLMs, Qwen (open-source), Longformer (open-source) and GPT-4.1 Nano (proprietary), using both standard metrics and a newly proposed Demographic Salience Score (DSS), which quantifies age-related entity retention and hallucination. Our results reveal systematic disparities across models and age groups: demographic fidelity is lowest for adult-focused summaries, and under-represented populations are more prone to hallucinations. These findings highlight the limitations of current LLMs in faithful and bias-free summarisation and point to the need for fairness-aware evaluation frameworks and summarisation pipelines in biomedical NLP.

[32] Multi-Reward GRPO Fine-Tuning for De-biasing Large Language Models: A Study Based on Chinese-Context Discrimination Data

Deng Yixuan, Ji Xiaoqiang

Main category: cs.CL

TL;DR: Proposes Multi-Reward Group Relative Policy Optimization (GRPO) framework to reduce biases in LLMs using synthetic English dataset from Chinese discrimination categories, achieving significant bias reduction without compromising model quality.

Details

Motivation: LLMs exhibit implicit biases reflecting social stereotypes, and current alignment techniques like RLHF/DPO are limited in addressing culturally specific and multi-dimensional discrimination.

Method: Constructs synthetic English dataset from Chinese discrimination categories, trains DeBERTa-v3 reward model for multi-dimensional rewards (fairness, neutrality, linguistic quality), and uses GRPO fine-tuning guided by these rewards.

Result: Significant reductions in bias intensity and improved alignment with non-discriminatory standards while maintaining fluency and informativeness.

Conclusion: GRPO-based multi-reward optimization is effective for de-biasing LLMs and provides a replicable framework for cultural-contextual ethical alignment.

Abstract: Large Language Models (LLMs) often exhibit implicit biases and discriminatory tendencies that reflect underlying social stereotypes. While recent alignment techniques such as RLHF and DPO have mitigated some of these issues, they remain limited in addressing culturally specific and multi-dimensional forms of discrimination. This paper proposes a Multi-Reward Group Relative Policy Optimization (GRPO) framework to fine-tune LLMs toward ethical and bias-free behavior. Our approach constructs a synthetic English-language dataset derived from Chinese-context discrimination categories, including regional, ethnic, and occupational biases. Each instance is paired with both neutral and biased responses to train a reward model based on DeBERTa-v3, which provides multi-dimensional reward signals capturing fairness, neutrality, and linguistic quality. The trained reward model then guides GRPO fine-tuning to optimize model outputs along these ethical dimensions. Experimental results demonstrate significant reductions in bias intensity and improved alignment with non-discriminatory standards without compromising fluency or informativeness. This study highlights the effectiveness of GRPO-based multi-reward optimization for de-biasing LLMs and offers a replicable framework for cultural-contextual ethical alignment.

[33] Visual Exploration of Feature Relationships in Sparse Autoencoders with Curated Concepts

Xinyuan Yan, Shusen Liu, Kowshik Thopalli, Bei Wang

Main category: cs.CL

TL;DR: Proposes a focused exploration framework for sparse autoencoder features in LLMs using interactive visualization that combines topology-based encoding with dimensionality reduction to avoid limitations of conventional methods like UMAP.

Details

Motivation: Sparse autoencoders extract too many features for comprehensive exploration, and conventional visualization methods like UMAP suffer from compression artifacts, overplotting, and neighborhood distortions.

Method: Interactive visualization system combining topology-based visual encoding with dimensionality reduction to represent both local and global relationships among selected SAE features.

Result: Enables targeted investigation of SAE behavior through curated concept subsets, providing more faithful representation of feature relationships.

Conclusion: The framework facilitates deeper and more nuanced analysis of concept representation in latent space by prioritizing focused exploration over comprehensive visualization of all features.

Abstract: Sparse autoencoders (SAEs) have emerged as a powerful tool for uncovering interpretable features in large language models (LLMs) through the sparse directions they learn. However, the sheer number of extracted directions makes comprehensive exploration intractable. While conventional embedding techniques such as UMAP can reveal global structure, they suffer from limitations including high-dimensional compression artifacts, overplotting, and misleading neighborhood distortions. In this work, we propose a focused exploration framework that prioritizes curated concepts and their corresponding SAE features over attempts to visualize all available features simultaneously. We present an interactive visualization system that combines topology-based visual encoding with dimensionality reduction to faithfully represent both local and global relationships among selected features. This hybrid approach enables users to investigate SAE behavior through targeted, interpretable subsets, facilitating deeper and more nuanced analysis of concept representation in latent space.

[34] Efficient Hate Speech Detection: A Three-Layer LoRA-Tuned BERTweet Framework

Mahmoud El-Bahnasawi

Main category: cs.CL

TL;DR: A computationally efficient hate speech detection system using rule-based pre-filtering and LoRA-tuned BERTweet achieves 0.85 F1 score with 100x smaller model than SOTA, requiring only 1.87M trainable parameters and 2-hour training on T4 GPU.

Details

Motivation: Develop computationally efficient hate speech detection systems that maintain competitive performance while being practical for real-time deployment in resource-constrained environments.

Method: Three-layer framework combining rule-based pre-filtering with parameter-efficient LoRA-tuned BERTweet model and continuous learning capabilities, using strategic dataset unification and optimized fine-tuning.

Result: Achieves 0.85 macro F1 score (94% of SOTA SafePhi performance) with 100x smaller base model (134M vs 14B parameters), requiring only 1.87M trainable parameters (1.37% of full fine-tuning) and training in ~2 hours on single T4 GPU.

Conclusion: The system makes robust hate speech detection accessible in resource-constrained environments while maintaining competitive accuracy for real-world deployment, outperforming traditional BERT-based approaches with similar computational requirements.

Abstract: This paper addresses the critical challenge of developing computationally efficient hate speech detection systems that maintain competitive performance while being practical for real-time deployment. We propose a novel three-layer framework that combines rule-based pre-filtering with a parameter-efficient LoRA-tuned BERTweet model and continuous learning capabilities. Our approach achieves 0.85 macro F1 score - representing 94% of the performance of state-of-the-art large language models like SafePhi (Phi-4 based) while using a base model that is 100x smaller (134M vs 14B parameters). Compared to traditional BERT-based approaches with similar computational requirements, our method demonstrates superior performance through strategic dataset unification and optimized fine-tuning. The system requires only 1.87M trainable parameters (1.37% of full fine-tuning) and trains in approximately 2 hours on a single T4 GPU, making robust hate speech detection accessible in resource-constrained environments while maintaining competitive accuracy for real-world deployment.

[35] Automating Hardware Design and Verification from Architectural Papers via a Neural-Symbolic Graph Framework

Haoyue Yang, Xuanle Zhao, Yujie Liu, Zhuojun Zou, Kailin Lyu, Changchun Zhou, Yao Zhu, Jie Hao

Main category: cs.CL

TL;DR: ArchCraft is a framework that converts architectural descriptions from academic papers into synthesizable Verilog projects with RTL verification, using formal graphs and symbols to translate unstructured papers into verifiable hardware designs.

Details

Motivation: Hardware architecture reproduction from academic papers is challenging due to lack of public source code and HDL complexity, hindering verification and implementation of research ideas.

Method: Uses structured workflow with formal graphs for Architectural Blueprint and symbols for Functional Specification, generating RTL and testbench code decoupled via symbols for verification and debugging.

Result: Outperforms direct generation methods and VerilogCoder in paper understanding and code completion; generated RTL meets timing constraints with performance metrics consistent with original papers.

Conclusion: ArchCraft successfully bridges the gap between academic architectural descriptions and practical hardware implementation, enabling reproducible and verifiable hardware design from research papers.

Abstract: The reproduction of hardware architectures from academic papers remains a significant challenge due to the lack of publicly available source code and the complexity of hardware description languages (HDLs). To this end, we propose \textbf{ArchCraft}, a Framework that converts abstract architectural descriptions from academic papers into synthesizable Verilog projects with register-transfer level (RTL) verification. ArchCraft introduces a structured workflow, which uses formal graphs to capture the Architectural Blueprint and symbols to define the Functional Specification, translating unstructured academic papers into verifiable, hardware-aware designs. The framework then generates RTL and testbench (TB) code decoupled via these symbols to facilitate verification and debugging, ultimately reporting the circuit’s Power, Area, and Performance (PPA). Moreover, we propose the first benchmark, \textbf{ArchSynthBench}, for synthesizing hardware from architectural descriptions, with a complete set of evaluation indicators, 50 project-level circuits, and around 600 circuit blocks. We systematically assess ArchCraft on ArchSynthBench, where the experiment results demonstrate the superiority of our proposed method, surpassing direct generation methods and the VerilogCoder framework in both paper understanding and code completion. Furthermore, evaluation and physical implementation of the generated executable RTL code show that these implementations meet all timing constraints without violations, and their performance metrics are consistent with those reported in the original papers.

[36] Stemming Hallucination in Language Models Using a Licensing Oracle

Simeon Emanuilov, Richard Ackermann

Main category: cs.CL

TL;DR: The Licensing Oracle is an architectural solution that eliminates hallucinations in language models by validating generated claims against structured knowledge graphs, achieving perfect abstention precision and zero false answers.

Details

Motivation: Language models generate factually incorrect information (hallucinations) despite producing coherent text, highlighting the need for deterministic validation methods rather than statistical approaches.

Method: Embedding a deterministic validation step into the generative process that formally validates claims against structured knowledge graphs before output.

Result: Achieved perfect abstention precision (AP = 1.0) and zero false answers (FAR-NE = 0.0) with 89.1% factual accuracy, outperforming RAG and fine-tuning methods which failed to eliminate hallucinations.

Conclusion: Architectural innovations like the Licensing Oracle provide necessary and sufficient solutions for hallucinations in domains with structured knowledge, offering guarantees that statistical methods cannot match.

Abstract: Language models exhibit remarkable natural language generation capabilities but remain prone to hallucinations, generating factually incorrect information despite producing syntactically coherent responses. This study introduces the Licensing Oracle, an architectural solution designed to stem hallucinations in LMs by enforcing truth constraints through formal validation against structured knowledge graphs. Unlike statistical approaches that rely on data scaling or fine-tuning, the Licensing Oracle embeds a deterministic validation step into the model’s generative process, ensuring that only factually accurate claims are made. We evaluated the effectiveness of the Licensing Oracle through experiments comparing it with several state-of-the-art methods, including baseline language model generation, fine-tuning for factual recall, fine-tuning for abstention behavior, and retrieval-augmented generation (RAG). Our results demonstrate that although RAG and fine-tuning improve performance, they fail to eliminate hallucinations. In contrast, the Licensing Oracle achieved perfect abstention precision (AP = 1.0) and zero false answers (FAR-NE = 0.0), ensuring that only valid claims were generated with 89.1% accuracy in factual responses. This work shows that architectural innovations, such as the Licensing Oracle, offer a necessary and sufficient solution for hallucinations in domains with structured knowledge representations, offering guarantees that statistical methods cannot match. Although the Licensing Oracle is specifically designed to address hallucinations in fact-based domains, its framework lays the groundwork for truth-constrained generation in future AI systems, providing a new path toward reliable, epistemically grounded models.

[37] MuonAll: Muon Variant for Efficient Finetuning of Large Language Models

Saurabh Page, Advait Joshi, S. S. Sonawane

Main category: cs.CL

TL;DR: MuonAll extends Muon optimizer to include all parameters for finetuning pretrained models, performing on par with AdamW across benchmarks.

Details

Motivation: Muon optimizer shows good results in pretraining but hasn't been explored for finetuning existing models, and currently requires AdamW for some parameters.

Method: MuonAll transforms parameters into 2D matrices to incorporate all parameters within the Muon optimizer framework.

Result: Extensive finetuning experiments on models up to 500M parameters show Muon and MuonAll perform equivalently to AdamW across major benchmarks.

Conclusion: Muon and MuonAll are effective alternative optimizers for finetuning, with distributed implementations now open-sourced.

Abstract: Muon optimizer has demonstrated robust results in pretraining of language models but its performance in finetuning of existing public pretrained models is not yet explored. Currently, Muon is used along with AdamW introducing a scope of improvement for adopting all parameters inside Muon. We introduce MuonAll, which incorporates all the parameters inside Muon by transforming into 2D matrices. We conduct extensive finetuning experiments across publicly available language models with model sizes upto half billion parameters. Muon and MuonAll perform at par with AdamW across major benchmarks, highlighting their effectiveness as alternative optimizers. We open-source the distributed implementations of Muon and MuonAll, available at https://github.com/Saurabh750/optimizer

[38] Evaluation of retrieval-based QA on QUEST-LOFT

Nathan Scales, Nathanael Schärli, Olivier Bousquet

Main category: cs.CL

TL;DR: RAG methods struggle with distributed information and complex reasoning. The paper analyzes poor QUEST-LOFT performance, provides updated human evaluation, and shows optimized RAG with structured reasoning/evidence output can outperform long-context approaches.

Details

Motivation: Current RAG methods fail when information is distributed across many documents or requires complex reasoning, and long-context models also show limitations on benchmarks like QUEST-LOFT.

Method: In-depth analysis of QUEST-LOFT performance factors, thorough human evaluation, and optimization of RAG with structured output format containing reasoning and evidence, optionally with answer re-verification.

Result: RAG can be optimized to significantly outperform long-context approaches when combined with structured reasoning/evidence output and optional answer verification.

Conclusion: Structured RAG with explicit reasoning and evidence representation provides better performance than long-context models for complex QA tasks requiring distributed information retrieval and reasoning.

Abstract: Despite the popularity of retrieval-augmented generation (RAG) as a solution for grounded QA in both academia and industry, current RAG methods struggle with questions where the necessary information is distributed across many documents or where retrieval needs to be combined with complex reasoning. Recently, the LOFT study has shown that this limitation also applies to approaches based on long-context language models, with the QUEST benchmark exhibiting particularly large headroom. In this paper, we provide an in-depth analysis of the factors contributing to the poor performance on QUEST-LOFT, publish updated numbers based on a thorough human evaluation, and demonstrate that RAG can be optimized to significantly outperform long-context approaches when combined with a structured output format containing reasoning and evidence, optionally followed by answer re-verification.

[39] Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models

Akshar Tumu, Varad Shinde, Parisa Kordjamshidi

Main category: cs.CL

TL;DR: Using Referring Expression Comprehension to evaluate spatial reasoning in Vision-language models, revealing challenges with ambiguity, complex spatial relations, and negation.

Details

Motivation: Spatial reasoning is difficult for current VLMs, and existing evaluation methods like image captioning and VQA are insufficient for deep analysis of spatial comprehension abilities.

Method: Proposed using Referring Expression Comprehension task to evaluate VLMs’ spatial reasoning, analyzing performance across task-specific architectures and large VLMs in scenarios with object detection ambiguity, complex spatial expressions, and negation.

Result: All models face challenges with spatial reasoning tasks, with relative performance varying based on model architecture and spatial semantic categories (topological, directional, proximal, etc.).

Conclusion: The study identifies research gaps in VLMs’ spatial reasoning capabilities and provides insights for future directions in improving spatial comprehension and grounding abilities.

Abstract: Spatial Reasoning is an important component of human cognition and is an area in which the latest Vision-language models (VLMs) show signs of difficulty. The current analysis works use image captioning tasks and visual question answering. In this work, we propose using the Referring Expression Comprehension task instead as a platform for the evaluation of spatial reasoning by VLMs. This platform provides the opportunity for a deeper analysis of spatial comprehension and grounding abilities when there is 1) ambiguity in object detection, 2) complex spatial expressions with a longer sentence structure and multiple spatial relations, and 3) expressions with negation (’not’). In our analysis, we use task-specific architectures as well as large VLMs and highlight their strengths and weaknesses in dealing with these specific situations. While all these models face challenges with the task at hand, the relative behaviors depend on the underlying models and the specific categories of spatial semantics (topological, directional, proximal, etc.). Our results highlight these challenges and behaviors and provide insight into research gaps and future directions.

[40] BookAsSumQA: An Evaluation Framework for Aspect-Based Book Summarization via Question Answering

Ryuhei Miyazato, Ting-Ruen Wei, Xuyang Wu, Hsin-Tai Wu, Kei Harada

Main category: cs.CL

TL;DR: Proposes BookAsSumQA, a QA-based evaluation framework for aspect-based book summarization that automatically generates aspect-specific QA pairs from narrative knowledge graphs to assess summary quality.

Details

Motivation: Aspect-based summarization for books remains unexplored due to the difficulty of constructing reference summaries for long texts, creating a need for automated evaluation methods.

Method: Developed BookAsSumQA framework that generates aspect-specific QA pairs from narrative knowledge graphs and evaluates summary quality based on question-answering performance.

Result: LLM-based approaches showed higher accuracy on shorter texts, while RAG-based methods became more effective as document length increased, making them more efficient for aspect-based book summarization.

Conclusion: RAG-based methods are more practical and efficient for aspect-based book summarization, especially for longer documents, as demonstrated through the BookAsSumQA evaluation framework.

Abstract: Aspect-based summarization aims to generate summaries that highlight specific aspects of a text, enabling more personalized and targeted summaries. However, its application to books remains unexplored due to the difficulty of constructing reference summaries for long text. To address this challenge, we propose BookAsSumQA, a QA-based evaluation framework for aspect-based book summarization. BookAsSumQA automatically generates aspect-specific QA pairs from a narrative knowledge graph to evaluate summary quality based on its question-answering performance. Our experiments using BookAsSumQA revealed that while LLM-based approaches showed higher accuracy on shorter texts, RAG-based methods become more effective as document length increases, making them more efficient and practical for aspect-based book summarization.

[41] Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning

Sangmook Lee, Dohyung Kim, Hyukhun Koh, Nakyeong Yang, Kyomin Jung

Main category: cs.CL

TL;DR: STEER is a confidence-guided routing framework that dynamically allocates reasoning steps between small and large LLMs based on the small model’s internal confidence scores, reducing inference costs while maintaining or improving accuracy.

Details

Motivation: To lower the high inference costs of large language models while maintaining reasoning capabilities, without relying on expensive trained router models that lack robustness under domain shifts.

Method: Uses model-internal confidence scores from the smaller LLM’s logits before generating each reasoning step to decide when to invoke the larger model, enabling fine-grained step-level routing without external modules.

Result: Achieves competitive or enhanced accuracy while reducing inference costs (up to +20% accuracy with 48% less FLOPs compared to using only the larger model), outperforming baselines with trained external modules.

Conclusion: Model-internal confidence serves as a robust, domain-agnostic signal for efficient model routing, providing a scalable pathway for cost-effective LLM deployment.

Abstract: Recent advances in Large Language Models (LLMs) - particularly model scaling and test-time techniques - have greatly enhanced the reasoning capabilities of language models at the expense of higher inference costs. To lower inference costs, prior works train router models or deferral mechanisms that allocate easy queries to a small, efficient model, while forwarding harder queries to larger, more expensive models. However, these trained router models often lack robustness under domain shifts and require expensive data synthesis techniques such as Monte Carlo rollouts to obtain sufficient ground-truth routing labels for training. In this work, we propose Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning (STEER), a domain-agnostic framework that performs fine-grained, step-level routing between smaller and larger LLMs without utilizing external models. STEER leverages confidence scores from the smaller model’s logits prior to generating a reasoning step, so that the large model is invoked only when necessary. Extensive evaluations using different LLMs on a diverse set of challenging benchmarks across multiple domains such as Mathematical Reasoning, Multi-Hop QA, and Planning tasks indicate that STEER achieves competitive or enhanced accuracy while reducing inference costs (up to +20% accuracy with 48% less FLOPs compared to solely using the larger model on AIME), outperforming baselines that rely on trained external modules. Our results establish model-internal confidence as a robust, domain-agnostic signal for model routing, offering a scalable pathway for efficient LLM deployment.

[42] Explicit Knowledge-Guided In-Context Learning for Early Detection of Alzheimer’s Disease

Puzhen Su, Yongzhu Miao, Chunxi Guo, Jintao Tang, Shasha Li, Ting Wang

Main category: cs.CL

TL;DR: EK-ICL framework integrates explicit knowledge to improve Alzheimer’s Disease detection from narrative transcripts using in-context learning, addressing challenges like task recognition failure and semantic misalignment in data-scarce clinical settings.

Details

Motivation: Current in-context learning approaches for AD detection suffer from task recognition failure, suboptimal demonstration selection, and misalignment between label words and task objectives, especially under out-of-distribution and data-scarce conditions in clinical domains.

Method: EK-ICL incorporates three knowledge components: confidence scores from small language models for task-relevant patterns, parsing feature scores for structural differences and demo selection, and label word replacement to resolve semantic misalignment. It also uses parsing-based retrieval and ensemble prediction.

Result: Extensive experiments across three AD datasets show EK-ICL significantly outperforms state-of-the-art fine-tuning and ICL baselines, with analysis revealing high sensitivity to label semantics and task-specific context alignment.

Conclusion: Explicit knowledge integration is crucial for stable clinical reasoning under low-resource conditions, as ICL performance in AD detection depends heavily on proper alignment of label semantics and task-specific context.

Abstract: Detecting Alzheimer’s Disease (AD) from narrative transcripts remains a challenging task for large language models (LLMs), particularly under out-of-distribution (OOD) and data-scarce conditions. While in-context learning (ICL) provides a parameter-efficient alternative to fine-tuning, existing ICL approaches often suffer from task recognition failure, suboptimal demonstration selection, and misalignment between label words and task objectives, issues that are amplified in clinical domains like AD detection. We propose Explicit Knowledge In-Context Learners (EK-ICL), a novel framework that integrates structured explicit knowledge to enhance reasoning stability and task alignment in ICL. EK-ICL incorporates three knowledge components: confidence scores derived from small language models (SLMs) to ground predictions in task-relevant patterns, parsing feature scores to capture structural differences and improve demo selection, and label word replacement to resolve semantic misalignment with LLM priors. In addition, EK-ICL employs a parsing-based retrieval strategy and ensemble prediction to mitigate the effects of semantic homogeneity in AD transcripts. Extensive experiments across three AD datasets demonstrate that EK-ICL significantly outperforms state-of-the-art fine-tuning and ICL baselines. Further analysis reveals that ICL performance in AD detection is highly sensitive to the alignment of label semantics and task-specific context, underscoring the importance of explicit knowledge in clinical reasoning under low-resource conditions.

[43] SPA: Achieving Consensus in LLM Alignment via Self-Priority Optimization

Yue Huang, Xiangqi Wang, Xiangliang Zhang

Main category: cs.CL

TL;DR: SPA is an unsupervised alignment framework that prioritizes trustworthiness over helpfulness, using self-generated responses and dual-criterion denoising to create preference pairs for fine-tuning.

Details

Motivation: In high-stakes scenarios like self-harm, legal, or medical queries, LLMs must balance trustworthiness and helpfulness, which often conflict.

Method: Self-Priority Alignment (SPA) generates diverse responses, self-evaluates and refines them, applies dual-criterion denoising, constructs lexicographically ordered preference pairs, and fine-tunes with uncertainty-weighted alignment loss.

Result: SPA improves helpfulness without compromising safety across multiple benchmarks, outperforming strong baselines while preserving general capabilities.

Conclusion: SPA provides a scalable and interpretable alignment strategy for critical LLM applications by enforcing trustworthy-before-helpful ordering.

Abstract: In high-stakes scenarios-such as self-harm, legal, or medical queries-LLMs must be both trustworthy and helpful. However, these goals often conflict. We propose priority alignment, a new alignment paradigm that enforces a strict “trustworthy-before-helpful” ordering: optimization of helpfulness is conditioned on first meeting trustworthy thresholds (e.g., harmlessness or honesty). To realize this, we introduce Self-Priority Alignment (SPA)-a fully unsupervised framework that generates diverse responses, self-evaluates them and refines them by the model itself, and applies dual-criterion denoising to remove inconsistency and control variance. From this, SPA constructs lexicographically ordered preference pairs and fine-tunes the model using an uncertainty-weighted alignment loss that emphasizes high-confidence, high-gap decisions. Experiments across multiple benchmarks show that SPA improves helpfulness without compromising safety, outperforming strong baselines while preserving general capabilities. Our results demonstrate that SPA provides a scalable and interpretable alignment strategy for critical LLM applications.

[44] Overview of CHIP 2025 Shared Task 2: Discharge Medication Recommendation for Metabolic Diseases Based on Chinese Electronic Health Records

Juntao Li, Haobin Yuan, Ling Luo, Tengxiao Lv, Yan Jiang, Fan Wang, Ping Zhang, Huiyi Lv, Jian Wang, Yuanyuan Sun, Hongfei Lin

Main category: cs.CL

TL;DR: The CHIP 2025 Shared Task 2 competition focused on developing automated discharge medication recommendation systems using Chinese EHR data, with top teams achieving promising results using LLM-based ensemble approaches.

Details

Motivation: Discharge medication recommendation is critical for treatment continuity, preventing readmission, and improving long-term management of chronic metabolic diseases.

Method: Used the CDrugRed dataset with 5,894 de-identified Chinese EHR records from 3,190 patients, and employed large language model (LLM)-based ensemble systems for multi-label medication recommendation.

Result: Top team achieved Jaccard score of 0.5102 and F1 score of 0.6267 on final test set, with 526 teams registering and 167/95 teams submitting valid results to Phase A/B leaderboards respectively.

Conclusion: Results demonstrate the potential of LLMs for medication recommendation in Chinese EHRs, though challenges remain in handling multi-label recommendations, heterogeneous clinical text, and patient-specific treatment variability.

Abstract: Discharge medication recommendation plays a critical role in ensuring treatment continuity, preventing readmission, and improving long-term management for patients with chronic metabolic diseases. This paper present an overview of the CHIP 2025 Shared Task 2 competition, which aimed to develop state-of-the-art approaches for automatically recommending appro-priate discharge medications using real-world Chinese EHR data. For this task, we constructed CDrugRed, a high-quality dataset consisting of 5,894 de-identified hospitalization records from 3,190 patients in China. This task is challenging due to multi-label nature of medication recommendation, het-erogeneous clinical text, and patient-specific variability in treatment plans. A total of 526 teams registered, with 167 and 95 teams submitting valid results to the Phase A and Phase B leaderboards, respectively. The top-performing team achieved the highest overall performance on the final test set, with a Jaccard score of 0.5102, F1 score of 0.6267, demonstrating the potential of advanced large language model (LLM)-based ensemble systems. These re-sults highlight both the promise and remaining challenges of applying LLMs to medication recommendation in Chinese EHRs. The post-evaluation phase remains open at https://tianchi.aliyun.com/competition/entrance/532411/.

[45] Analyzing and Mitigating Negation Artifacts using Data Augmentation for Improving ELECTRA-Small Model Accuracy

Mojtaba Noghabaei

Main category: cs.CL

TL;DR: ELECTRA-small model fine-tuned on SNLI struggles with negation, but targeted data augmentation with contrast sets and adversarial examples improves negation handling without harming overall performance.

Details

Motivation: Pre-trained NLI models often rely on spurious correlations rather than understanding linguistic phenomena like negation, leading to poor performance on negation-containing examples.

Method: Fine-tuned ELECTRA-small on SNLI dataset, identified negation issues, then augmented training data with contrast sets and adversarial examples specifically targeting negation.

Result: Targeted data augmentation improved model accuracy on negation-containing examples while maintaining overall performance, effectively mitigating the dataset artifact.

Conclusion: Focused data augmentation addressing specific linguistic weaknesses can successfully improve model robustness to negation without compromising general performance.

Abstract: Pre-trained models for natural language inference (NLI) often achieve high performance on benchmark datasets by using spurious correlations, or dataset artifacts, rather than understanding language touches such as negation. In this project, we investigate the performance of an ELECTRA-small model fine-tuned on the Stanford Natural Language Inference (SNLI) dataset, focusing on its handling of negation. Through analysis, we identify that the model struggles with correctly classifying examples containing negation. To address this, we augment the training data with contrast sets and adversarial examples emphasizing negation. Our results demonstrate that this targeted data augmentation improves the model’s accuracy on negation-containing examples without adversely affecting overall performance, therefore mitigating the identified dataset artifact.

[46] TimeSense:Making Large Language Models Proficient in Time-Series Analysis

Zhirui Zhang, Changhua Pei, Tianyi Gao, Zhe Xie, Yibo Hao, Zhaoyang Yu, Longlong Xu, Tong Xiao, Jing Han, Dan Pei

Main category: cs.CL

TL;DR: TimeSense is a multimodal framework that enhances LLMs’ time-series analysis by balancing textual reasoning with temporal sense preservation through reconstruction and spatial embeddings.

Details

Motivation: Existing methods that combine text with time-series data often bias models toward textual cues, neglecting full temporal features and leading to outputs that contradict time-series context.

Method: Proposes TimeSense with a Temporal Sense module that reconstructs input time-series in the model’s context, and incorporates coordinate-based positional embeddings for spatial understanding of time-series data.

Result: TimeSense achieves state-of-the-art performance across multiple tasks, particularly excelling on complex multi-dimensional time-series reasoning tasks.

Conclusion: The proposed framework effectively balances textual reasoning with temporal sense preservation, enabling more accurate and grounded time-series analysis in challenging scenarios.

Abstract: In the time-series domain, an increasing number of works combine text with temporal data to leverage the reasoning capabilities of large language models (LLMs) for various downstream time-series understanding tasks. This enables a single model to flexibly perform tasks that previously required specialized models for each domain. However, these methods typically rely on text labels for supervision during training, biasing the model toward textual cues while potentially neglecting the full temporal features. Such a bias can lead to outputs that contradict the underlying time-series context. To address this issue, we construct the EvalTS benchmark, comprising 10 tasks across three difficulty levels, from fundamental temporal pattern recognition to complex real-world reasoning, to evaluate models under more challenging and realistic scenarios. We also propose TimeSense, a multimodal framework that makes LLMs proficient in time-series analysis by balancing textual reasoning with a preserved temporal sense. TimeSense incorporates a Temporal Sense module that reconstructs the input time-series within the model’s context, ensuring that textual reasoning is grounded in the time-series dynamics. Moreover, to enhance spatial understanding of time-series data, we explicitly incorporate coordinate-based positional embeddings, which provide each time point with spatial context and enable the model to capture structural dependencies more effectively. Experimental results demonstrate that TimeSense achieves state-of-the-art performance across multiple tasks, and it particularly outperforms existing methods on complex multi-dimensional time-series reasoning tasks.

[47] HatePrototypes: Interpretable and Transferable Representations for Implicit and Explicit Hate Speech Detection

Irina Proskurina, Marc-Antoine Carpentier, Julien Velcin

Main category: cs.CL

TL;DR: HatePrototypes enable cross-task transfer between explicit and implicit hate speech detection without repeated fine-tuning, using class-level vector representations from language models.

Details

Motivation: Existing hate speech benchmarks mainly address explicit hate and overlook implicit hate, requiring deeper semantic processing. Current approaches rely on repeated fine-tuning for different hate types.

Method: Use HatePrototypes - class-level vector representations derived from language models optimized for hate speech detection, built from as few as 50 examples per class. Implement parameter-free early exiting with prototypes.

Result: Prototypes enable cross-task transfer between explicit and implicit hate, with interchangeable prototypes across benchmarks. Early exiting with prototypes is effective for both hate types.

Conclusion: HatePrototypes provide an efficient and transferable approach to hate speech detection that works across different types of hate without requiring repeated fine-tuning.

Abstract: Optimization of offensive content moderation models for different types of hateful messages is typically achieved through continued pre-training or fine-tuning on new hate speech benchmarks. However, existing benchmarks mainly address explicit hate toward protected groups and often overlook implicit or indirect hate, such as demeaning comparisons, calls for exclusion or violence, and subtle discriminatory language that still causes harm. While explicit hate can often be captured through surface features, implicit hate requires deeper, full-model semantic processing. In this work, we question the need for repeated fine-tuning and analyze the role of HatePrototypes, class-level vector representations derived from language models optimized for hate speech detection and safety moderation. We find that these prototypes, built from as few as 50 examples per class, enable cross-task transfer between explicit and implicit hate, with interchangeable prototypes across benchmarks. Moreover, we show that parameter-free early exiting with prototypes is effective for both hate types. We release the code, prototype resources, and evaluation scripts to support future research on efficient and transferable hate speech detection.

Lionel Z. Wang, Shihan Ben, Yulu Huang, Simeng Qing

Main category: cs.CL

TL;DR: SugarTextNet is a transformer-based framework that effectively detects sugar dating content on social media by addressing class imbalance and subtle linguistic cues, outperforming traditional models and LLMs.

Details

Motivation: Sugar dating content proliferation on social media raises serious societal concerns, but detection is challenging due to euphemisms, ambiguous language, and extreme class imbalance in real data.

Method: SugarTextNet integrates pretrained transformer encoder, attention-based cue extractor, and contextual phrase encoder with Context-Aware Focal Loss to handle class imbalance and capture nuanced features.

Result: Substantially outperforms traditional ML models, deep learning baselines, and large language models on a curated dataset of 3,067 Chinese Weibo posts across multiple metrics.

Conclusion: Domain-specific, context-aware modeling is crucial for sensitive content detection, providing robust solutions for content moderation in complex real-world scenarios.

Abstract: Sugar dating-related content has rapidly proliferated on mainstream social media platforms, giving rise to serious societal and regulatory concerns, including commercialization of intimate relationships and the normalization of transactional relationships.~Detecting such content is highly challenging due to the prevalence of subtle euphemisms, ambiguous linguistic cues, and extreme class imbalance in real-world data.~In this work, we present SugarTextNet, a novel transformer-based framework specifically designed to identify sugar dating-related posts on social media.~SugarTextNet integrates a pretrained transformer encoder, an attention-based cue extractor, and a contextual phrase encoder to capture both salient and nuanced features in user-generated text.~To address class imbalance and enhance minority-class detection, we introduce Context-Aware Focal Loss, a tailored loss function that combines focal loss scaling with contextual weighting.~We evaluate SugarTextNet on a newly curated, manually annotated dataset of 3,067 Chinese social media posts from Sina Weibo, demonstrating that our approach substantially outperforms traditional machine learning models, deep learning baselines, and large language models across multiple metrics.~Comprehensive ablation studies confirm the indispensable role of each component.~Our findings highlight the importance of domain-specific, context-aware modeling for sensitive content detection, and provide a robust solution for content moderation in complex, real-world scenarios.

[49] How Well Do LLMs Understand Drug Mechanisms? A Knowledge + Reasoning Evaluation Dataset

Sunil Mohan, Theofanis Karaletsos

Main category: cs.CL

TL;DR: This paper introduces a dataset to evaluate LLMs on drug mechanism knowledge and reasoning, showing o4-mini and Qwen3-4B-thinking perform best, with open-world reasoning being more challenging than closed-world.

Details

Motivation: To evaluate LLMs' factual knowledge and reasoning capabilities about drug mechanisms for drug development and personalized medicine applications.

Method: Created a dataset testing LLMs on known drug mechanisms and counterfactual reasoning scenarios, comparing performance across different models in open-world vs closed-world settings.

Result: o4-mini outperformed other OpenAI models, and Qwen3-4B-thinking matched o4-mini’s performance, sometimes exceeding it. Open-world reasoning was more challenging than closed-world, and counterfactuals affecting internal chain links were harder than those affecting drug links.

Conclusion: LLMs show varying capabilities in drug mechanism reasoning, with smaller models like Qwen3-4B-thinking performing competitively, and open-world reasoning presenting significant challenges that need addressing for reliable medical applications.

Abstract: Two scientific fields showing increasing interest in pre-trained large language models (LLMs) are drug development / repurposing, and personalized medicine. For both, LLMs have to demonstrate factual knowledge as well as a deep understanding of drug mechanisms, so they can recall and reason about relevant knowledge in novel situations. Drug mechanisms of action are described as a series of interactions between biomedical entities, which interlink into one or more chains directed from the drug to the targeted disease. Composing the effects of the interactions in a candidate chain leads to an inference about whether the drug might be useful or not for that disease. We introduce a dataset that evaluates LLMs on both factual knowledge of known mechanisms, and their ability to reason about them under novel situations, presented as counterfactuals that the models are unlikely to have seen during training. Using this dataset, we show that o4-mini outperforms the 4o, o3, and o3-mini models from OpenAI, and the recent small Qwen3-4B-thinking model closely matches o4-mini’s performance, even outperforming it in some cases. We demonstrate that the open world setting for reasoning tasks, which requires the model to recall relevant knowledge, is more challenging than the closed world setting where the needed factual knowledge is provided. We also show that counterfactuals affecting internal links in the reasoning chain present a much harder task than those affecting a link from the drug mentioned in the prompt.

[50] Dutch Metaphor Extraction from Cancer Patients’ Interviews and Forum Data using LLMs and Human in the Loop

Lifeng Han, David Lindevelt, Sander Puts, Erik van Mulligen, Suzan Verberne

Main category: cs.CL

TL;DR: This paper analyzes Dutch cancer patients’ metaphorical language from interviews and online forums using LLMs with various prompting strategies, creating the HealthQuote.NL corpus to improve healthcare communication.

Details

Motivation: Metaphors play a crucial role in healthcare communication between clinicians, patients, and families, particularly for cancer patients, but extracting them effectively from Dutch language data requires advanced methods.

Method: Used two Dutch cancer patient data sources (interviews and online forums) and tested state-of-the-art LLMs with different prompting strategies including chain of thought reasoning, few-shot learning, and self-prompting, with human verification.

Result: Created HealthQuote.NL corpus containing verified metaphors extracted from Dutch cancer patient data, demonstrating LLMs’ capability in this specialized domain with proper prompting techniques.

Conclusion: The extracted metaphors can enhance patient care through improved shared decision making, better patient-clinician communication, increased health literacy, and personalized care pathway design.

Abstract: Metaphors and metaphorical language (MLs) play an important role in healthcare communication between clinicians, patients, and patients’ family members. In this work, we focus on Dutch language data from cancer patients. We extract metaphors used by patients using two data sources: (1) cancer patient storytelling interview data and (2) online forum data, including patients’ posts, comments, and questions to professionals. We investigate how current state-of-the-art large language models (LLMs) perform on this task by exploring different prompting strategies such as chain of thought reasoning, few-shot learning, and self-prompting. With a human-in-the-loop setup, we verify the extracted metaphors and compile the outputs into a corpus named HealthQuote.NL. We believe the extracted metaphors can support better patient care, for example shared decision making, improved communication between patients and clinicians, and enhanced patient health literacy. They can also inform the design of personalized care pathways. We share prompts and related resources at https://github.com/aaronlifenghan/HealthQuote.NL

[51] Towards Resource-Efficient Multimodal Intelligence: Learned Routing among Specialized Expert Models

Mayank Saini, Arit Kumar Bishwas

Main category: cs.CL

TL;DR: A unified framework that intelligently routes queries to optimal expert models, balancing cost and quality while reducing reliance on expensive LLMs by over 67% while maintaining performance.

Details

Motivation: High inference costs of large language models hinder real-time, scalable deployment, while smaller open-source models struggle with complex or multimodal queries.

Method: Modular framework with learned routing network that directs queries to most fitting expert models, using two-stage open-source pipeline for vision tasks and reviving efficient classical vision components.

Result: Matches or exceeds performance of always-premium LLM systems on benchmarks like MMLU and VQA while reducing reliance on costly models by over 67%.

Conclusion: The framework delivers high-quality, resource-efficient AI at scale through extensible multi-agent orchestration.

Abstract: As AI moves beyond text, large language models (LLMs) increasingly power vision, audio, and document understanding; however, their high inference costs hinder real-time, scalable deployment. Conversely, smaller open-source models offer cost advantages but struggle with complex or multimodal queries. We introduce a unified, modular framework that intelligently routes each query - textual, multimodal, or complex - to the most fitting expert model, using a learned routing network that balances cost and quality. For vision tasks, we employ a two-stage open-source pipeline optimized for efficiency and reviving efficient classical vision components where they remain SOTA for sub-tasks. On benchmarks such as Massive Multitask Language Understanding (MMLU) and Visual Question Answering (VQA), we match or exceed the performance of always-premium LLM (monolithic systems with one model serving all query types) performance, yet reduce the reliance on costly models by over 67%. With its extensible, multi-agent orchestration, we deliver high-quality, resource-efficient AI at scale.

[52] SR-KI: Scalable and Real-Time Knowledge Integration into LLMs via Supervised Attention

Bohan Yu, Wei Huang, Kang Liu

Main category: cs.CL

TL;DR: SR-KI is a novel method that integrates structured knowledge bases into LLMs by encoding KBs as key-value pairs and injecting them into the KV cache, enabling end-to-end inference with efficient knowledge compression and dynamic updates.

Details

Motivation: Traditional retrieval-augmented generation methods rely on external retrievers and multi-stage pipelines, which can be inefficient and dependent on retriever performance. SR-KI aims to enable direct knowledge integration within LLMs for more efficient and self-contained inference.

Method: Two-stage training: 1) Encode KBs into key-value pairs using pretrained encoder and inject into KV cache, 2) Locate retrieval layer and apply attention-based loss to supervise attention toward relevant KB entries. Enables retrieval entirely within model’s latent space.

Result: Successfully integrates up to 40K KBs into 7B LLM on single A100 40GB GPU. Achieves strong retrieval performance with over 98% Recall@10 on best task and >88% average across tasks. Maintains strong QA and KB ID generation performance while achieving up to 99.75% KB compression.

Conclusion: SR-KI provides an effective framework for integrating large-scale structured knowledge into LLMs with efficient compression and dynamic update capabilities, outperforming traditional retrieval-augmented approaches by enabling end-to-end inference within the model.

Abstract: This paper proposes SR-KI, a novel approach for integrating real-time and large-scale structured knowledge bases (KBs) into large language models (LLMs). SR-KI begins by encoding KBs into key-value pairs using a pretrained encoder, and injects them into LLMs’ KV cache. Building on this representation, we employ a two-stage training paradigm: first locating a dedicated retrieval layer within the LLM, and then applying an attention-based loss at this layer to explicitly supervise attention toward relevant KB entries. Unlike traditional retrieval-augmented generation methods that rely heavily on the performance of external retrievers and multi-stage pipelines, SR-KI supports end-to-end inference by performing retrieval entirely within the models latent space. This design enables efficient compression of injected knowledge and facilitates dynamic knowledge updates. Comprehensive experiments demonstrate that SR-KI enables the integration of up to 40K KBs into a 7B LLM on a single A100 40GB GPU, and achieves strong retrieval performance, maintaining over 98% Recall@10 on the best-performing task and exceeding 88% on average across all tasks. Task performance on question answering and KB ID generation also demonstrates that SR-KI maintains strong performance while achieving up to 99.75% compression of the injected KBs.

[53] Rethinking what Matters: Effective and Robust Multilingual Realignment for Low-Resource Languages

Quang Phuoc Nguyen, David Anugraha, Felix Gaschi, Jun Bin Cheng, En-Shiun Annie Lee

Main category: cs.CL

TL;DR: Realignment using strategically selected linguistically diverse language subsets can match or outperform full multilingual alignment for cross-lingual transfer, especially benefiting low-resource languages while reducing data requirements.

Details

Motivation: Realignment strategies for multilingual models often yield unreliable results for typologically distant or low-resource languages, and current tools require high-quality parallel data that is scarce for many languages.

Method: Conducted extensive empirical study with controlled experiments comparing realignment using all available languages versus strategically selected linguistically diverse subsets.

Result: Realignment is particularly effective for low-resource languages, and carefully selected subsets can match full multilingual alignment while even outperforming it for unseen low-resource languages.

Conclusion: Effective realignment doesn’t require exhaustive language coverage; informed language selection enables efficient and robust cross-lingual transfer while reducing data collection overhead.

Abstract: Realignment is a promising strategy to improve cross-lingual transfer in multilingual language models. However, empirical results are mixed and often unreliable, particularly for typologically distant or low-resource languages (LRLs) compared to English. Moreover, word realignment tools often rely on high-quality parallel data, which can be scarce or noisy for many LRLs. In this work, we conduct an extensive empirical study to investigate whether realignment truly benefits from using all available languages, or if strategically selected subsets can offer comparable or even improved cross-lingual transfer, and study the impact on LRLs. Our controlled experiments show that realignment can be particularly effective for LRLs and that using carefully selected, linguistically diverse subsets can match full multilingual alignment, and even outperform it for unseen LRLs. This indicates that effective realignment does not require exhaustive language coverage and can reduce data collection overhead, while remaining both efficient and robust when guided by informed language selection.

[54] You Had One Job: Per-Task Quantization Using LLMs’ Hidden Representations

Amit LeVi, Raz Lapid, Rom Himelstein, Yaniv Nemcovsky, Ravid Shwartz Ziv, Avi Mendelson

Main category: cs.CL

TL;DR: Task-aware post-training quantization methods (TAQ and TAQO) that identify task-relevant layers using hidden representations and allocate precision accordingly, achieving efficient task-specialized models while maintaining accuracy.

Details

Motivation: Large LLMs are inefficient for applications requiring limited capabilities, and existing PTQ methods are task-agnostic, ignoring how task-specific signals are distributed across layers.

Method: Two approaches: TAQ uses task-conditioned statistics from hidden activations to allocate bitwidths, while TAQO allocates precision based on direct layer sensitivity tests. Both identify task-relevant layers and preserve their precision while aggressively quantizing others.

Result: TAQ and TAQO outperform baselines across models. TAQ leads on Phi-4 (42.33 EM / 50.81 F1 vs AWQ’s 2.25/7.07), while TAQO leads on Llama-3.1, Qwen3, and Qwen2.5. Both remain within <1.0% of original accuracy at lower average precision.

Conclusion: Using hidden representations that encode task-salient signals as quantization guidelines yields stable task sensitivity profiles and efficient task-specialized models that maintain accuracy while reducing memory and latency.

Abstract: Large Language Models (LLMs) excel across diverse tasks, yet many applications require only limited capabilities, making large variants inefficient in memory and latency. Existing approaches often combine distillation and quantization, but most post-training quantization (PTQ) methods are task-agnostic, ignoring how task-specific signals are distributed across layers. In this work, we propose to use hidden representations that encode task-salient signals as a guideline for quantization. In order to fully utilize our innovative idea, this paper compares two new task-aware PTQ methods: Task-Aware Quantization (TAQ), which allocates bitwidths using task-conditioned statistics from hidden activations, and TAQO, which allocates precision based on direct layer sensitivity tests. From a small calibration set, these approaches identify task-relevant layers, preserving their precision while aggressively quantizing the rest. This yields stable task sensitivity profiles and efficient task-specialized models. Across models, TAQ and TAQO outperform the baselines; TAQ leads on Phi-4, while TAQO leads on Llama-3.1, Qwen3, and Qwen2.5. For instances, on Phi-4 it achieves 42.33 EM / 50.81 F1, far surpassing Activation-aware Weight Quantization (AWQ) (2.25 / 7.07), while remaining within < 1.0% of the original accuracy at lower average precision.

[55] Better Datasets Start From RefineLab: Automatic Optimization for High-Quality Dataset Refinement

Xiaonan Luo, Yue Huang, Ping He, Xiangliang Zhang

Main category: cs.CL

TL;DR: RefineLab is an LLM-driven framework that automatically refines raw QA datasets into high-quality ones under token-budget constraints, addressing gaps in domain coverage, difficulty distribution, and factual inconsistencies.

Details

Motivation: Existing QA datasets, even expert-crafted ones, suffer from persistent quality issues including domain coverage gaps, misaligned difficulty distributions, and factual inconsistencies. The rise of generative model-powered datasets has worsened these quality challenges.

Method: RefineLab takes target quality attributes as refinement objectives and performs selective edits within a predefined token budget. It uses an assignment module to select optimal refinement strategies (e.g., rephrasing, distractor replacement) for each QA sample to maximize overall dataset quality while respecting budget constraints.

Result: Experiments show RefineLab consistently narrows divergence from expert datasets across coverage, difficulty alignment, factual fidelity, and distractor quality.

Conclusion: RefineLab pioneers a scalable, customizable path to reproducible dataset design with broad implications for LLM evaluation, addressing constrained optimization to improve QA sample quality within resource limitations.

Abstract: High-quality Question-Answer (QA) datasets are foundational for reliable Large Language Model (LLM) evaluation, yet even expert-crafted datasets exhibit persistent gaps in domain coverage, misaligned difficulty distributions, and factual inconsistencies. The recent surge in generative model-powered datasets has compounded these quality challenges. In this work, we introduce RefineLab, the first LLM-driven framework that automatically refines raw QA textual data into high-quality datasets under a controllable token-budget constraint. RefineLab takes a set of target quality attributes (such as coverage and difficulty balance) as refinement objectives, and performs selective edits within a predefined token budget to ensure practicality and efficiency. In essence, RefineLab addresses a constrained optimization problem: improving the quality of QA samples as much as possible while respecting resource limitations. With a set of available refinement operations (e.g., rephrasing, distractor replacement), RefineLab takes as input the original dataset, a specified set of target quality dimensions, and a token budget, and determines which refinement operations should be applied to each QA sample. This process is guided by an assignment module that selects optimal refinement strategies to maximize overall dataset quality while adhering to the budget constraint. Experiments demonstrate that RefineLab consistently narrows divergence from expert datasets across coverage, difficulty alignment, factual fidelity, and distractor quality. RefineLab pioneers a scalable, customizable path to reproducible dataset design, with broad implications for LLM evaluation.

[56] Ibom NLP: A Step Toward Inclusive Natural Language Processing for Nigeria’s Minority Languages

Oluwadara Kalejaiye, Luel Hagos Beyene, David Ifeoluwa Adelani, Mmekut-Mfon Gabriel Edet, Aniefon Daniel Akpan, Eno-Abasi Urua, Anietie Andy

Main category: cs.CL

TL;DR: Introduces ibom dataset for machine translation and topic classification in four underrepresented Coastal Nigerian languages (Anaang, Efik, Ibibio, Oro), extending Flores-200 benchmark and aligning with SIB-200 topic labels.

Details

Motivation: Address the lack of NLP resources for Nigeria's linguistic diversity, where only 4 out of 500+ languages have NLP research, due to unavailability of textual data for training algorithms.

Method: Created ibom dataset by extending Flores-200 benchmark to four Coastal Nigerian languages and aligning translated texts with SIB-200 topic classification labels.

Result: Current LLMs perform poorly on machine translation for these languages in zero- and few-shot settings, but few-shot samples steadily improve topic classification performance with more shots.

Conclusion: Highlights the need for more resources and research in underrepresented Nigerian languages, showing current limitations in machine translation but potential for topic classification improvement with few-shot learning.

Abstract: Nigeria is the most populous country in Africa with a population of more than 200 million people. More than 500 languages are spoken in Nigeria and it is one of the most linguistically diverse countries in the world. Despite this, natural language processing (NLP) research has mostly focused on the following four languages: Hausa, Igbo, Nigerian-Pidgin, and Yoruba (i.e <1% of the languages spoken in Nigeria). This is in part due to the unavailability of textual data in these languages to train and apply NLP algorithms. In this work, we introduce ibom – a dataset for machine translation and topic classification in four Coastal Nigerian languages from the Akwa Ibom State region: Anaang, Efik, Ibibio, and Oro. These languages are not represented in Google Translate or in major benchmarks such as Flores-200 or SIB-200. We focus on extending Flores-200 benchmark to these languages, and further align the translated texts with topic labels based on SIB-200 classification dataset. Our evaluation shows that current LLMs perform poorly on machine translation for these languages in both zero-and-few shot settings. However, we find the few-shot samples to steadily improve topic classification with more shots.

[57] Rep2Text: Decoding Full Text from a Single LLM Token Representation

Haiyan Zhao, Zirui He, Fan Yang, Ali Payani, Mengnan Du

Main category: cs.CL

TL;DR: Rep2Text framework can recover over 50% of information from compressed last-token LLM representations while maintaining semantic integrity, showing an information bottleneck effect.

Details

Motivation: To understand how much original input text can be recovered from single last-token representations in LLMs, addressing the opacity of LLM internal mechanisms.

Method: Rep2Text uses a trainable adapter to project target model’s internal representations into a decoding LM’s embedding space for autoregressive text reconstruction.

Result: Over half of information in 16-token sequences can be recovered from compressed representations with strong semantic integrity and coherence; longer sequences show decreased token-level recovery but preserved semantics.

Conclusion: The framework demonstrates robust generalization to out-of-distribution medical data and reveals an information bottleneck effect in LLM representations.

Abstract: Large language models (LLMs) have achieved remarkable progress across diverse tasks, yet their internal mechanisms remain largely opaque. In this work, we address a fundamental question: to what extent can the original input text be recovered from a single last-token representation within an LLM? We propose Rep2Text, a novel framework for decoding full text from last-token representations. Rep2Text employs a trainable adapter that projects a target model’s internal representations into the embedding space of a decoding language model, which then autoregressively reconstructs the input text. Experiments on various model combinations (Llama-3.1-8B, Gemma-7B, Mistral-7B-v0.1, Llama-3.2-3B) demonstrate that, on average, over half of the information in 16-token sequences can be recovered from this compressed representation while maintaining strong semantic integrity and coherence. Furthermore, our analysis reveals an information bottleneck effect: longer sequences exhibit decreased token-level recovery while preserving strong semantic integrity. Besides, our framework also demonstrates robust generalization to out-of-distribution medical data.

[58] TabRAG: Tabular Document Retrieval via Structured Language Representations

Jacob Si, Mike Qu, Michelle Lee, Yingzhen Li

Main category: cs.CL

TL;DR: TabRAG is a parsing-based RAG pipeline that improves handling of table-heavy documents using structured language representations, outperforming existing parsing methods for generation and retrieval.

Details

Motivation: Existing parsing-based RAG methods suffer from suboptimal performance when extracting tabular data, while fine-tuning approaches have high computational requirements.

Method: TabRAG uses structured language representations to parse table-heavy documents in a parsing-based RAG pipeline.

Result: TabRAG outperforms existing popular parsing-based methods for both generation and retrieval tasks.

Conclusion: TabRAG provides an effective solution for handling table-heavy documents in RAG systems without the computational overhead of fine-tuning approaches.

Abstract: Ingesting data for Retrieval-Augmented Generation (RAG) involves either fine-tuning the embedding model directly on the target corpus or parsing documents for embedding model encoding. The former, while accurate, incurs high computational hardware requirements, while the latter suffers from suboptimal performance when extracting tabular data. In this work, we address the latter by presenting TabRAG, a parsing-based RAG pipeline designed to tackle table-heavy documents via structured language representations. TabRAG outperforms existing popular parsing-based methods for generation and retrieval. Code is available at https://github.com/jacobyhsi/TabRAG.

[59] Duality-based Mode Operations and Pyramid Multilayer Mapping for Rhetorical Modes

Zi-Niu Wu

Main category: cs.CL

TL;DR: This paper introduces duality-based operations and a pyramid mapping framework to expand rhetorical modes, quantify expressive diversity, and reduce cognitive complexity through mathematical analysis.

Details

Motivation: To bridge linguistic research, computational modeling, and academic writing by creating a dynamic, measurable system for rhetorical modes that enables AI systems to operate on layered rhetorical reasoning structures.

Method: Proposes duality-based mode operations (split-unite, forward-backward, expansion-reduction, orthogonal dualities) and a pyramid multilayer mapping framework with three layers (rhetorical model, cognitive, epistemic). Uses binomial combinatorics and Shannon entropy analysis to quantify expressive diversity and complexity reduction.

Result: Identifies a Marginal Rhetorical Bit (MRB) parameter measuring expressive growth speed, shows hierarchical selection reduces choice uncertainty compared to flat selection, and transforms static rhetorical taxonomies into dynamic measurable systems.

Conclusion: The framework enables future AI systems to operate on layered rhetorical reasoning structures, bridging linguistic, pedagogical, academic, and computational research domains.

Abstract: Rhetorical modes are useful in both academic and non-academic writing, and can be subjects to be studied within linguistic research and computational modeling. Establishing a conceptual bridge among these domains could enable each to benefit from the others. This paper proposes duality-based mode operations (split-unite, forward-backward, expansion-reduction and orthogonal dualities) to expand the set of rhetorical modes, introducing generated modes like combination and generalization, thereby enhancing epistemic diversity across multiple applications. It further presents a pyramid multilayer mapping framework (e.g., three layers from the rhetorical model layer, to cognitive layer, and to epistemic layers) that reduces the resulting cognitive complexity. The degrees of expressive diversity and complexity reduction are quantified through binomial combinatorics and Shannon entropy analysis. A Marginal Rhetorical Bit (MRB) is identified, permitting the definition of a rhetorical-scalable parameter that measures expressive growth speed in bits per stage. A direct entropy measure shows that hierarchical selection over smaller subsets markedly reduces choice uncertainty compared with flat selection across all modes. These considerations appear to transform static and non-measurable rhetorical taxonomies into more dynamic and more measurable systems for discourse design. From this work, it would be possible to identify a pathway for future AI systems to operate not only on language tokens but on layered rhetorical reasoning structures, bridging linguistic, pedagogical, academic, and computational research

[60] How AI Fails: An Interactive Pedagogical Tool for Demonstrating Dialectal Bias in Automated Toxicity Models

Subhojit Ghimire

Main category: cs.CL

TL;DR: This paper demonstrates systematic bias in AI toxicity detection, showing African-American English is flagged as 1.8x more toxic and 8.8x higher for identity hate than Standard American English, and provides an interactive tool to make these biases tangible.

Details

Motivation: To address concerns about AI bias in content moderation and provide certainty about whether flagged content is genuinely inappropriate or victims of biased algorithms.

Method: Dual approach: quantitative benchmark of a widely used toxicity model (unitary/toxic-bert) comparing African-American English vs Standard American English, plus an interactive pedagogical tool with user-controlled sensitivity threshold.

Result: Clear systematic bias found - AAE text scored 1.8 times more toxic and 8.8 times higher for identity hate than SAE text. The tool demonstrates that biased scores combined with human-set policies operationalize discrimination.

Conclusion: Provides statistical evidence of disparate impact in AI moderation and a public-facing tool to foster critical AI literacy about how seemingly neutral policies can enable discrimination.

Abstract: Now that AI-driven moderation has become pervasive in everyday life, we often hear claims that “the AI is biased”. While this is often said jokingly, the light-hearted remark reflects a deeper concern. How can we be certain that an online post flagged as “inappropriate” was not simply the victim of a biased algorithm? This paper investigates this problem using a dual approach. First, I conduct a quantitative benchmark of a widely used toxicity model (unitary/toxic-bert) to measure performance disparity between text in African-American English (AAE) and Standard American English (SAE). The benchmark reveals a clear, systematic bias: on average, the model scores AAE text as 1.8 times more toxic and 8.8 times higher for “identity hate”. Second, I introduce an interactive pedagogical tool that makes these abstract biases tangible. The tool’s core mechanic, a user-controlled “sensitivity threshold,” demonstrates that the biased score itself is not the only harm; instead, the more-concerning harm is the human-set, seemingly neutral policy that ultimately operationalises discrimination. This work provides both statistical evidence of disparate impact and a public-facing tool designed to foster critical AI literacy.

Keunhyeung Park, Seunguk Yu, Youngbin Kim

Main category: cs.CL

TL;DR: Proposes DIA-REFINE framework using iterative translation-verification-feedback loop with dialect classifiers to improve dialect machine translation, and introduces new metrics (DFS, TDR) to better evaluate dialect fidelity.

Details

Motivation: Standard-to-dialect MT faces challenges due to dialect gaps in LLMs and evaluation distortions from n-gram metrics that favor source copying over authentic dialect translation.

Method: DIA-REFINE framework with iterative loop of translation, verification using external dialect classifiers, and feedback; introduces dialect fidelity score (DFS) and target dialect ratio (TDR) metrics.

Result: Consistently enhances dialect fidelity across Korean dialects; distinguishes False Success (high n-gram but failed dialect) from True Attempt (low n-gram but genuine dialect) cases; models show varying responsiveness; in-context examples further improve dialect expression translation.

Conclusion: Establishes robust framework for goal-directed, inclusive dialect translation with rigorous evaluation and insights into model performance.

Abstract: Standard-to-dialect machine translation remains challenging due to a persistent dialect gap in large language models and evaluation distortions inherent in n-gram metrics, which favor source copying over authentic dialect translation. In this paper, we propose the dialect refinement (DIA-REFINE) framework, which guides LLMs toward faithful target dialect outputs through an iterative loop of translation, verification, and feedback using external dialect classifiers. To address the limitations of n-gram-based metrics, we introduce the dialect fidelity score (DFS) to quantify linguistic shift and the target dialect ratio (TDR) to measure the success of dialect translation. Experiments on Korean dialects across zero-shot and in-context learning baselines demonstrate that DIA-REFINE consistently enhances dialect fidelity. The proposed metrics distinguish between False Success cases, where high n-gram scores obscure failures in dialectal translation, and True Attempt cases, where genuine attempts at dialectal translation yield low n-gram scores. We also observed that models exhibit varying degrees of responsiveness to the framework, and that integrating in-context examples further improves the translation of dialectal expressions. Our work establishes a robust framework for goal-directed, inclusive dialect translation, providing both rigorous evaluation and critical insights into model performance.

[62] Textual Self-attention Network: Test-Time Preference Optimization through Textual Gradient-based Attention

Shibing Mo, Haoyang Ruan, Kai Wu, Jing Liu

Main category: cs.CL

TL;DR: TSAN is a test-time preference optimization method that uses textual self-attention to analyze and synthesize strengths from multiple candidate responses without parameter updates, outperforming supervised models.

Details

Motivation: Current test-time methods lack systematic mechanisms to analyze and combine strengths from multiple candidate responses, which could produce superior outcomes by leveraging different aspects like clarity, factual accuracy, and tone.

Method: TSAN emulates self-attention in natural language by formatting candidates into textual keys and values, using LLM-based attention to weigh relevance, and synthesizing strengths into preference-aligned responses through iterative optimization in textual gradient space.

Result: Empirical evaluations show TSAN outperforms supervised models like Llama-3.1-70B-Instruct and surpasses current state-of-the-art test-time alignment methods with just three test-time iterations on a base SFT model.

Conclusion: TSAN provides an effective, interpretable, and parameter-free approach for test-time preference optimization by systematically leveraging multiple candidate solutions through textual self-attention mechanisms.

Abstract: Large Language Models (LLMs) have demonstrated remarkable generalization capabilities, but aligning their outputs with human preferences typically requires expensive supervised fine-tuning. Recent test-time methods leverage textual feedback to overcome this, but they often critique and revise a single candidate response, lacking a principled mechanism to systematically analyze, weigh, and synthesize the strengths of multiple promising candidates. Such a mechanism is crucial because different responses may excel in distinct aspects (e.g., clarity, factual accuracy, or tone), and combining their best elements may produce a far superior outcome. This paper proposes the Textual Self-Attention Network (TSAN), a new paradigm for test-time preference optimization that requires no parameter updates. TSAN emulates self-attention entirely in natural language to overcome this gap: it analyzes multiple candidates by formatting them into textual keys and values, weighs their relevance using an LLM-based attention module, and synthesizes their strengths into a new, preference-aligned response under the guidance of the learned textual attention. This entire process operates in a textual gradient space, enabling iterative and interpretable optimization. Empirical evaluations demonstrate that with just three test-time iterations on a base SFT model, TSAN outperforms supervised models like Llama-3.1-70B-Instruct and surpasses the current state-of-the-art test-time alignment method by effectively leveraging multiple candidate solutions.

[63] Sentiment Analysis On YouTube Comments Using Machine Learning Techniques Based On Video Games Content

Adi Danish Bin Muhammad Amin, Mohaiminul Islam Bhuiyan, Nur Shazwani Kamarudin, Zulfahmi Toh, Nur Syafiqah Nafis

Main category: cs.CL

TL;DR: Sentiment analysis of YouTube comments on video games using machine learning algorithms, with SVM achieving the highest accuracy in classifying user sentiments.

Details

Motivation: To understand user sentiments in the gaming community expressed on social media platforms like YouTube, providing valuable feedback for game developers to improve game design and user experience.

Method: Collected YouTube comments using YouTube API, pre-processed the data, and applied machine learning algorithms (Naïve Bayes, Logistic Regression, SVM) with TextBlob sentiment analysis tool.

Result: SVM demonstrated superior performance with the highest classification accuracy across different datasets, revealing trends and insights into user preferences and critiques in gaming videos.

Conclusion: Advanced sentiment analysis is crucial for capturing nuanced emotions in user comments, and future research will focus on integrating sophisticated NLP techniques and exploring additional data sources.

Abstract: The rapid evolution of the gaming industry, driven by technological advancements and a burgeoning community, necessitates a deeper understanding of user sentiments, especially as expressed on popular social media platforms like YouTube. This study presents a sentiment analysis on video games based on YouTube comments, aiming to understand user sentiments within the gaming community. Utilizing YouTube API, comments related to various video games were collected and analyzed using the TextBlob sentiment analysis tool. The pre-processed data underwent classification using machine learning algorithms, including Naïve Bayes, Logistic Regression, and Support Vector Machine (SVM). Among these, SVM demonstrated superior performance, achieving the highest classification accuracy across different datasets. The analysis spanned multiple popular gaming videos, revealing trends and insights into user preferences and critiques. The findings underscore the importance of advanced sentiment analysis in capturing the nuanced emotions expressed in user comments, providing valuable feedback for game developers to enhance game design and user experience. Future research will focus on integrating more sophisticated natural language processing techniques and exploring additional data sources to further refine sentiment analysis in the gaming domain.

[64] Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights

Hyunjae Kim, Jiwoong Sohn, Aidan Gilson, Nicholas Cochran-Caggiano, Serina Applebaum, Heeju Jin, Seihee Park, Yujin Park, Jiyeong Park, Seoyoung Choi, Brittany Alexandra Herrera Contreras, Thomas Huang, Jaehoon Yun, Ethan F. Wei, Roy Jiang, Leah Colucci, Eric Lai, Amisha Dave, Tuo Guo, Maxwell B. Singer, Yonghoe Koo, Ron A. Adelman, James Zou, Andrew Taylor, Arman Cohan, Hua Xu, Qingyu Chen

Main category: cs.CL

TL;DR: Standard RAG often degrades medical LLM performance due to poor evidence retrieval (only 22% of passages relevant) and weak evidence selection, but simple strategies like filtering and query reformulation can improve performance significantly.

Details

Motivation: To address challenges of keeping LLMs updated with evolving medical knowledge and providing verifiable reasoning, and to systematically evaluate whether RAG reliably achieves these goals in medicine.

Method: Comprehensive expert evaluation with 18 medical experts providing 80,502 annotations on 800 model outputs from GPT-4o and Llama-3.1-8B across 200 real-world patient and USMLE-style queries, systematically decomposing RAG pipeline into evidence retrieval, evidence selection, and response generation.

Result: Standard RAG degraded performance: only 22% of top-16 passages were relevant, evidence selection precision 41-43% and recall 27-49%, factuality and completeness dropped by up to 6% and 5% respectively compared to non-RAG variants.

Conclusion: Retrieval and evidence selection remain key failure points; simple strategies like evidence filtering and query reformulation substantially mitigate issues, improving performance on benchmarks by up to 12% and 8.2%, calling for re-examination of RAG’s role in medicine and highlighting need for stage-aware evaluation and deliberate system design.

Abstract: Large language models (LLMs) are transforming the landscape of medicine, yet two fundamental challenges persist: keeping up with rapidly evolving medical knowledge and providing verifiable, evidence-grounded reasoning. Retrieval-augmented generation (RAG) has been widely adopted to address these limitations by supplementing model outputs with retrieved evidence. However, whether RAG reliably achieves these goals remains unclear. Here, we present the most comprehensive expert evaluation of RAG in medicine to date. Eighteen medical experts contributed a total of 80,502 annotations, assessing 800 model outputs generated by GPT-4o and Llama-3.1-8B across 200 real-world patient and USMLE-style queries. We systematically decomposed the RAG pipeline into three components: (i) evidence retrieval (relevance of retrieved passages), (ii) evidence selection (accuracy of evidence usage), and (iii) response generation (factuality and completeness of outputs). Contrary to expectation, standard RAG often degraded performance: only 22% of top-16 passages were relevant, evidence selection remained weak (precision 41-43%, recall 27-49%), and factuality and completeness dropped by up to 6% and 5%, respectively, compared with non-RAG variants. Retrieval and evidence selection remain key failure points for the model, contributing to the overall performance drop. We further show that simple yet effective strategies, including evidence filtering and query reformulation, substantially mitigate these issues, improving performance on MedMCQA and MedXpertQA by up to 12% and 8.2%, respectively. These findings call for re-examining RAG’s role in medicine and highlight the importance of stage-aware evaluation and deliberate system design for reliable medical LLM applications.

[65] Sensitivity of Small Language Models to Fine-tuning Data Contamination

Nicy Scaria, Silvester John Joseph Kennedy, Deepak Subramani

Main category: cs.CL

TL;DR: SLMs show asymmetric vulnerability to data contamination: syntactic transformations cause catastrophic failure while semantic transformations show threshold behaviors, with larger models paradoxically more susceptible to harmful instructions.

Details

Motivation: To systematically understand SLMs' behavioral robustness to data contamination during instruction tuning, as current assumptions may not hold for smaller models in resource-constrained deployments.

Method: Evaluated 23 SLMs (270M-4B parameters) across multiple families using syntactic (character/word reversal) and semantic (irrelevant/counterfactual responses) transformations at 25%, 50%, 75%, 100% contamination levels during instruction tuning.

Result: Syntactic transformations caused catastrophic degradation (character reversal produced near-complete failure), while semantic transformations showed greater resilience. Larger models were more susceptible to learning harmful instructions (“capability curse”), and alignment provided inconsistent robustness benefits.

Conclusion: Current robustness assumptions don’t hold for SLMs, highlighting need for contamination-aware training protocols and systematic evaluation methods for deployment safety.

Abstract: Small Language Models (SLMs) are increasingly being deployed in resource-constrained environments, yet their behavioral robustness to data contamination during instruction tuning remains poorly understood. We systematically investigate the contamination sensitivity of 23 SLMs (270M to 4B parameters) across multiple model families by measuring susceptibility to syntactic and semantic transformation types during instruction tuning: syntactic transformations (character and word reversal) and semantic transformations (irrelevant and counterfactual responses), each applied at contamination levels of 25%, 50%, 75%, and 100%. Our results reveal fundamental asymmetries in vulnerability patterns: syntactic transformations cause catastrophic performance degradation, with character reversal producing near-complete failure across all models regardless of size or family, while semantic transformations demonstrate distinct threshold behaviors and greater resilience in core linguistic capabilities. Critically, we discover a ``\textit{capability curse}" where larger, more capable models become more susceptible to learning semantic corruptions, effectively following harmful instructions more readily, while our analysis of base versus instruction-tuned variants reveals that alignment provides inconsistent robustness benefits, sometimes even reducing resilience. Our work establishes three core contributions: (1) empirical evidence of SLMs’ disproportionate vulnerability to syntactic pattern contamination, (2) identification of asymmetric sensitivity patterns between syntactic and semantic transformations, and (3) systematic evaluation protocols for contamination robustness assessment. These findings have immediate deployment implications, suggesting that current robustness assumptions may not hold for smaller models and highlighting the need for contamination-aware training protocols.

[66] SAFENLIDB: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces

Ruiheng Liu, XiaoBing Chen, Jinyu Zhang, Qiongwen Zhang, Yu Zhang, Bailong Yang

Main category: cs.CL

TL;DR: SafeNlidb is a privacy-security alignment framework for LLM-based NLIDB that generates security-aware SQL queries through automated hybrid chain-of-thought reasoning, addressing data leakage risks without compromising utility.

Details

Motivation: Current LLM-based NLIDB systems face privacy and security risks where confidential database content can be unintentionally exposed or manipulated through inference-based attacks, while existing mitigation methods struggle with complex attacks and high false positives.

Method: Proposes SafeNlidb framework with automated pipeline generating hybrid chain-of-thought interaction data, combining implicit security reasoning with SQL generation. Uses reasoning warm-up and alternating preference optimization to overcome multi-preference oscillations in DPO.

Result: Extensive experiments show the method outperforms larger-scale LLMs and ideal-setting baselines, achieving significant security improvements while preserving high utility.

Conclusion: SafeNlidb successfully addresses privacy-security concerns in LLM-based NLIDB through automated security reasoning and preference optimization, enabling secure SQL generation without human-annotated data.

Abstract: The rapid advancement of Large Language Models (LLMs) has driven significant progress in Natural Language Interface to Database (NLIDB). However, the widespread adoption of LLMs has raised critical privacy and security concerns. During interactions, LLMs may unintentionally expose confidential database contents or be manipulated by attackers to exfiltrate data through seemingly benign queries. While current efforts typically rely on rule-based heuristics or LLM agents to mitigate this leakage risk, these methods still struggle with complex inference-based attacks, suffer from high false positive rates, and often compromise the reliability of SQL queries. To address these challenges, we propose \textsc{SafeNlidb}, a novel privacy-security alignment framework for LLM-based NLIDB. The framework features an automated pipeline that generates hybrid chain-of-thought interaction data from scratch, seamlessly combining implicit security reasoning with SQL generation. Additionally, we introduce reasoning warm-up and alternating preference optimization to overcome the multi-preference oscillations of Direct Preference Optimization (DPO), enabling LLMs to produce security-aware SQL through fine-grained reasoning without the need for human-annotated preference data. Extensive experiments demonstrate that our method outperforms both larger-scale LLMs and ideal-setting baselines, achieving significant security improvements while preserving high utility.WARNING: This work may contain content that is offensive and harmful!

[67] Learning to Focus: Focal Attention for Selective and Scalable Transformers

Dhananjay Ram, Wei Xia, Stefano Soatto

Main category: cs.CL

TL;DR: Focal Attention improves transformer models by sharpening attention distributions through controlled softmax temperature, achieving better performance with fewer parameters and less training data, especially on long-context tasks.

Details

Motivation: Standard softmax attention produces noisy probability distributions that impair feature selection, particularly in long contexts, limiting transformer model effectiveness.

Method: Proposes Focal Attention that sharpens attention distribution by controlling softmax temperature as either fixed hyperparameter or learnable parameter during training.

Result: Achieves same accuracy with 42% fewer parameters or 33% less training data; delivers 17-82% relative improvements on long-context tasks; scales better than standard transformers.

Conclusion: Focal Attention is an effective modification that enables transformers to focus on relevant tokens while suppressing irrelevant ones, demonstrating significant improvements across various benchmarks and real-world applications.

Abstract: Attention is a core component of transformer architecture, whether encoder-only, decoder-only, or encoder-decoder model. However, the standard softmax attention often produces noisy probability distribution, which can impair effective feature selection at every layer of these models, particularly for long contexts. We propose Focal Attention, a simple yet effective modification that sharpens the attention distribution by controlling the softmax temperature, either as a fixed hyperparameter or as a learnable parameter during training. This sharpening enables the model to concentrate on the most relevant tokens while suppressing irrelevant ones. Empirically, Focal Attention scales more favorably than standard transformer with respect to model size, training data, and context length. Across diverse benchmarks, it achieves the same accuracy with up to 42% fewer parameters or 33% less training data. On long-context tasks, it delivers substantial relative improvements ranging from 17% to 82%, demonstrating its effectiveness in real world applications.

[68] AI Brown and AI Koditex: LLM-Generated Corpora Comparable to Traditional Corpora of English and Czech Texts

Jiří Milička, Anna Marklová, Václav Cvrček

Main category: cs.CL

TL;DR: Created English and Czech LLM-generated text corpora that replicate human reference corpora for linguistic comparison between human-written and AI-generated texts.

Details

Motivation: To develop resources for comparing human-written texts with LLM-generated text linguistically, ensuring multi-genre coverage with diverse topics, authors, and text types while maintaining comparability with existing human corpora.

Method: Generated corpora using various LLMs (OpenAI, Anthropic, Alphabet, Meta, DeepSeek) ranging from GPT-3 to GPT-4.5, replicating reference human corpora BE21 (English) and Koditex (Czech), with Universal Dependencies tagging for tokenization, lemmatization, and morphological/syntactic annotation.

Result: Created two corpora: English part with average 864k tokens per model (27M total), Czech part with average 768k tokens per model (21.5M total), freely available under CC BY 4.0 license and accessible through Czech National Corpus search interface.

Conclusion: Successfully developed comprehensive LLM-generated text corpora that serve as valuable resources for linguistic analysis and comparison between human and AI-generated texts across multiple languages and genres.

Abstract: This article presents two corpora of English and Czech texts generated with large language models (LLMs). The motivation is to create a resource for comparing human-written texts with LLM-generated text linguistically. Emphasis was placed on ensuring these resources are multi-genre and rich in terms of topics, authors, and text types, while maintaining comparability with existing human-created corpora. These generated corpora replicate reference human corpora: BE21 by Paul Baker, which is a modern version of the original Brown Corpus, and Koditex corpus that also follows the Brown Corpus tradition but in Czech. The new corpora were generated using models from OpenAI, Anthropic, Alphabet, Meta, and DeepSeek, ranging from GPT-3 (davinci-002) to GPT-4.5, and are tagged according to the Universal Dependencies standard (i.e., they are tokenized, lemmatized, and morphologically and syntactically annotated). The subcorpus size varies according to the model used (the English part contains on average 864k tokens per model, 27M tokens altogether, the Czech partcontains on average 768k tokens per model, 21.5M tokens altogether). The corpora are freely available for download under the CC BY 4.0 license (the annotated data are under CC BY-NC-SA 4.0 licence) and are also accessible through the search interface of the Czech National Corpus.

[69] Beyond Plain Demos: A Demo-centric Anchoring Paradigm for In-Context Learning in Alzheimer’s Disease Detection

Puzhen Su, Haoran Yin, Yongzhu Miao, Jintao Tang, Shasha Li, Ting Wang

Main category: cs.CL

TL;DR: DA4ICL improves Alzheimer’s disease detection from narrative transcripts by addressing limitations in standard in-context learning through diverse demo retrieval and projected vector anchoring at all Transformer layers.

Details

Motivation: Standard in-context learning fails for AD detection due to homogeneous transcript contexts that limit both task cognition and contextual perception, while task vector approaches suffer from granularity mismatches.

Method: Proposes DA4ICL framework with Diverse and Contrastive Retrieval (DCR) to expand context width and Projected Vector Anchoring (PVA) to deepen signal perception at every Transformer layer.

Result: Achieves large, stable gains over both ICL and task vector baselines across three AD benchmarks.

Conclusion: DA4ICL establishes a new paradigm for fine-grained, out-of-distribution and low-resource LLM adaptation through demo-centric anchoring.

Abstract: Detecting Alzheimer’s disease (AD) from narrative transcripts challenges large language models (LLMs): pre-training rarely covers this out-of-distribution task, and all transcript demos describe the same scene, producing highly homogeneous contexts. These factors cripple both the model’s built-in task knowledge (\textbf{task cognition}) and its ability to surface subtle, class-discriminative cues (\textbf{contextual perception}). Because cognition is fixed after pre-training, improving in-context learning (ICL) for AD detection hinges on enriching perception through better demonstration (demo) sets. We demonstrate that standard ICL quickly saturates, its demos lack diversity (context width) and fail to convey fine-grained signals (context depth), and that recent task vector (TV) approaches improve broad task adaptation by injecting TV into the LLMs’ hidden states (HSs), they are ill-suited for AD detection due to the mismatch of injection granularity, strength and position. To address these bottlenecks, we introduce \textbf{DA4ICL}, a demo-centric anchoring framework that jointly expands context width via \emph{\textbf{Diverse and Contrastive Retrieval}} (DCR) and deepens each demo’s signal via \emph{\textbf{Projected Vector Anchoring}} (PVA) at every Transformer layer. Across three AD benchmarks, DA4ICL achieves large, stable gains over both ICL and TV baselines, charting a new paradigm for fine-grained, OOD and low-resource LLM adaptation.

[70] Inclusion of Role into Named Entity Recognition and Ranking

Neelesh Kumar Shukla, Sanasam Ranbir Singh

Main category: cs.CL

TL;DR: This paper explores Entity Role Detection by modeling it as both Named Entity Recognition (NER) and Entity Retrieval tasks, using automated methods to learn role representations from limited domain-agnostic data.

Details

Motivation: Traditional NLP systems handle entity-based processing but struggle when entities play specific contextual roles. The challenge is retrieving entities based on their domain-dependent roles when roles and entities are indirectly described.

Method: Two approaches: 1) NER modeling with roles as mutually exclusive classes using sequence tagging, 2) Entity Retrieval modeling with roles as queries and entities as collections. Automated learning of representative words/phrases for role and entity representations using sentence and document contexts.

Result: Developed methods to build role and entity representations using learned representative words and phrases, working effectively in domain-agnostic settings with small datasets.

Conclusion: Entity Role Detection can be effectively addressed through both NER and Entity Retrieval approaches, with automated representation learning enabling domain-agnostic performance even with limited training data.

Abstract: Most of the Natural Language Processing sys- tems are involved in entity-based processing for several tasks like Information Extraction, Question-Answering, Text-Summarization and so on. A new challenge comes when entities play roles according to their act or attributes in certain context. Entity Role Detection is the task of assigning such roles to the entities. Usu- ally real-world entities are of types: person, lo- cation and organization etc. Roles could be con- sidered as domain-dependent subtypes of these types. In the cases, where retrieving a subset of entities based on their roles is needed, poses the problem of defining the role and entities having those roles. This paper presents the study of study of solving Entity Role Detection prob- lem by modeling it as Named Entity Recogni- tion (NER) and Entity Retrieval/Ranking task. In NER, these roles could be considered as mutually exclusive classes and standard NER methods like sequence tagging could be used. For Entity Retrieval, Roles could be formulated as Query and entities as Collection on which the query needs to be executed. The aspect of Entity Retrieval task, which is different than document retrieval task is that the entities and roles against which they need to be retrieved are indirectly described. We have formulated au- tomated ways of learning representative words and phrases and building representations of roles and entities using them. We have also explored different contexts like sentence and document. Since the roles depend upon con- text, so it is not always possible to have large domain-specific dataset or knowledge bases for learning purposes, so we have tried to exploit the information from small dataset in domain- agnostic way.

[71] EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers

Yilin Jiang, Mingzi Zhang, Xuanyu Yin, Sheng Jin, Suyu Lu, Zuocan Ying, Zengyi Yu, Xiangjie Kong

Main category: cs.CL

TL;DR: EduGuardBench is a benchmark for evaluating Large Language Models as teachers, assessing role-playing fidelity and teaching-specific harms through adversarial testing, revealing performance polarization and a counterintuitive scaling paradox in model safety.

Details

Motivation: Existing benchmarks fail to measure role-playing fidelity or address unique teaching harms in educational scenarios, creating critical challenges for ensuring professional competence and ethical safety of AI teachers.

Method: Proposed EduGuardBench with dual components: Role-playing Fidelity Score (RFS) for professional fidelity assessment, and persona-based adversarial prompts targeting general harms and academic misconduct, evaluated with Attack Success Rate (ASR) and three-tier Refusal Quality assessment.

Result: Experiments on 14 models show performance polarization - reasoning models have superior fidelity but incompetence dominates failures. Mid-sized models are most vulnerable (scaling paradox). Safest models exhibit Educational Transformation Effect, converting harmful requests into teachable moments with strong negative correlation to ASR.

Conclusion: EduGuardBench provides holistic assessment of professional, ethical, and pedagogical alignment, uncovering complex dynamics essential for trustworthy AI deployment in education, with Educational Refusals representing a new dimension of advanced AI safety.

Abstract: Large Language Models for Simulating Professions (SP-LLMs), particularly as teachers, are pivotal for personalized education. However, ensuring their professional competence and ethical safety is a critical challenge, as existing benchmarks fail to measure role-playing fidelity or address the unique teaching harms inherent in educational scenarios. To address this, we propose EduGuardBench, a dual-component benchmark. It assesses professional fidelity using a Role-playing Fidelity Score (RFS) while diagnosing harms specific to the teaching profession. It also probes safety vulnerabilities using persona-based adversarial prompts targeting both general harms and, particularly, academic misconduct, evaluated with metrics including Attack Success Rate (ASR) and a three-tier Refusal Quality assessment. Our extensive experiments on 14 leading models reveal a stark polarization in performance. While reasoning-oriented models generally show superior fidelity, incompetence remains the dominant failure mode across most models. The adversarial tests uncovered a counterintuitive scaling paradox, where mid-sized models can be the most vulnerable, challenging monotonic safety assumptions. Critically, we identified a powerful Educational Transformation Effect: the safest models excel at converting harmful requests into teachable moments by providing ideal Educational Refusals. This capacity is strongly negatively correlated with ASR, revealing a new dimension of advanced AI safety. EduGuardBench thus provides a reproducible framework that moves beyond siloed knowledge tests toward a holistic assessment of professional, ethical, and pedagogical alignment, uncovering complex dynamics essential for deploying trustworthy AI in education. See https://github.com/YL1N/EduGuardBench for Materials.

[72] RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation

Haofeng Wang, Yu Zhang

Main category: cs.CL

TL;DR: Proposes RPTS, a tree-based metric to evaluate multimodal reasoning processes in LVLMs, addressing limitations of existing benchmarks that overlook flawed reasoning leading to correct answers and intermodal relationships.

Details

Motivation: Current multimodal benchmarks evaluate models through simple formats that don't assess reasoning processes, overlook flawed reasoning that produces correct answers, and ignore intermodal relationships' impact on reasoning.

Method: Organizes reasoning steps into a tree structure, assigns weighted faithfulness scores to each step using hierarchical information, and dynamically adjusts weights to pinpoint reasoning failures. Created RPTS-Eval benchmark with 374 images and 390 reasoning instances.

Result: Evaluation of representative LVLMs (GPT4o, Llava-Next) revealed their limitations in multimodal reasoning and highlighted differences between open-source and closed-source models.

Conclusion: The RPTS benchmark and methodology will advance multimodal reasoning research by providing comprehensive evaluation of reasoning processes and intermodal relationships.

Abstract: Large Vision-Language Models (LVLMs) excel in multimodal reasoning and have shown impressive performance on various multimodal benchmarks. However, most of these benchmarks evaluate models primarily through multiple-choice or short-answer formats, which do not take the reasoning process into account. Although some benchmarks assess the reasoning process, their methods are often overly simplistic and only examine reasoning when answers are incorrect. This approach overlooks scenarios where flawed reasoning leads to correct answers. In addition, these benchmarks do not consider the impact of intermodal relationships on reasoning. To address this issue, we propose the Reasoning Process Tree Score (RPTS), a tree structure-based metric to assess reasoning processes. Specifically, we organize the reasoning steps into a reasoning tree and leverage its hierarchical information to assign weighted faithfulness scores to each reasoning step. By dynamically adjusting these weights, RPTS not only evaluates the overall correctness of the reasoning, but also pinpoints where the model fails in the reasoning. To validate RPTS in real-world multimodal scenarios, we construct a new benchmark, RPTS-Eval, comprising 374 images and 390 reasoning instances. Each instance includes reliable visual-textual clues that serve as leaf nodes of the reasoning tree. Furthermore, we define three types of intermodal relationships to investigate how intermodal interactions influence the reasoning process. We evaluated representative LVLMs (e.g., GPT4o, Llava-Next), uncovering their limitations in multimodal reasoning and highlighting the differences between open-source and closed-source commercial LVLMs. We believe that this benchmark will contribute to the advancement of research in the field of multimodal reasoning.

[73] HLPD: Aligning LLMs to Human Language Preference for Machine-Revised Text Detection

Fangqi Dai, Xingjian Jiang, Zizhuang Deng

Main category: cs.CL

TL;DR: HLPD is a new method for detecting machine-revised text using human language preference optimization to enhance sensitivity to human writing patterns, achieving significant improvements over existing methods.

Details

Motivation: To address the challenge of identifying machine-revised text in black-box settings where previous methods struggle with advanced LLM outputs and adversarial revisions.

Method: Uses Human Language Preference Optimization (HLPO) - a reward-based alignment process that shifts token distribution toward human-like writing, making the model more sensitive to human writing patterns.

Result: Achieves 15.11% relative improvement in AUROC over ImBD for GPT-series revised text, and 45.56% improvement over Fast-DetectGPT. For advanced LLMs, achieves highest average AUROC exceeding ImBD by 5.53% and Fast-DetectGPT by 34.14%.

Conclusion: HLPD effectively enhances detection of machine-revised text by leveraging human writing style preferences, demonstrating superior performance across diverse revision scenarios.

Abstract: To prevent misinformation and social issues arising from trustworthy-looking content generated by LLMs, it is crucial to develop efficient and reliable methods for identifying the source of texts. Previous approaches have demonstrated exceptional performance in detecting texts fully generated by LLMs. However, these methods struggle when confronting more advanced LLM output or text with adversarial multi-task machine revision, especially in the black-box setting, where the generating model is unknown. To address this challenge, grounded in the hypothesis that human writing possesses distinctive stylistic patterns, we propose Human Language Preference Detection (HLPD). HLPD employs a reward-based alignment process, Human Language Preference Optimization (HLPO), to shift the scoring model’s token distribution toward human-like writing, making the model more sensitive to human writing, therefore enhancing the identification of machine-revised text. We test HLPD in an adversarial multi-task evaluation framework that leverages a five-dimensional prompt generator and multiple advanced LLMs to create diverse revision scenarios. When detecting texts revised by GPT-series models, HLPD achieves a 15.11% relative improvement in AUROC over ImBD, surpassing Fast-DetectGPT by 45.56%. When evaluated on texts generated by advanced LLMs, HLPD achieves the highest average AUROC, exceeding ImBD by 5.53% and Fast-DetectGPT by 34.14%. Code will be made available at https://github.com/dfq2021/HLPD.

[74] SCOPE: Intrinsic Semantic Space Control for Mitigating Copyright Infringement in LLMs

Zhenliang Zhang, Xinyu Hu, Xiaojun Wan

Main category: cs.CL

TL;DR: SCOPE is an inference-time method that mitigates copyright infringement in LLMs by identifying and clamping copyright-sensitive activations in semantic space, without parameter updates or external filters.

Details

Motivation: Existing copyright infringement defenses rely on surface-level token matching and external blocklists/filters, which add deployment complexity and may miss semantically paraphrased content.

Method: Uses sparse autoencoder (SAE) to project hidden states into high-dimensional near-monosemantic space, identifies copyright-sensitive subspace, and clamps its activations during decoding.

Result: Experiments show SCOPE effectively mitigates copyright infringement without degrading general model utility on widely recognized benchmarks.

Conclusion: The method successfully isolates copyright-sensitive semantics in a dedicated subspace and provides an intrinsic solution for copyright protection.

Abstract: Large language models sometimes inadvertently reproduce passages that are copyrighted, exposing downstream applications to legal risk. Most existing studies for inference-time defences focus on surface-level token matching and rely on external blocklists or filters, which add deployment complexity and may overlook semantically paraphrased leakage. In this work, we reframe copyright infringement mitigation as intrinsic semantic-space control and introduce SCOPE, an inference-time method that requires no parameter updates or auxiliary filters. Specifically, the sparse autoencoder (SAE) projects hidden states into a high-dimensional, near-monosemantic space; benefiting from this representation, we identify a copyright-sensitive subspace and clamp its activations during decoding. Experiments on widely recognized benchmarks show that SCOPE mitigates copyright infringement without degrading general utility. Further interpretability analyses confirm that the isolated subspace captures high-level semantics.

[75] Automated Circuit Interpretation via Probe Prompting

Giuseppe Birardi

Main category: cs.CL

TL;DR: Probe prompting automates neural network interpretability by transforming attribution graphs into compact, interpretable subgraphs using concept-aligned supernodes, reducing manual analysis time from 2 hours to automated processing.

Details

Motivation: Manual analysis of attribution graphs for neural network interpretability is time-consuming (2 hours per prompt), requiring automation to scale mechanistic interpretability research.

Method: Uses probe prompting pipeline: selects high-influence features, generates concept-targeted probes, groups features by cross-prompt activation signatures into Semantic, Relationship, and Say-X categories with transparent decision rules.

Result: Achieves high explanatory coverage (Completeness 0.83) with compressed complexity; concept-aligned groups show 2.3x higher peak-token consistency and 5.8x higher activation-pattern similarity than geometric clustering; entity-swap tests reveal layerwise hierarchy with early layers transferring robustly (64% transfer rate) and late layers specializing for output.

Conclusion: Probe prompting enables scalable automated interpretability, revealing a backbone-and-specialization view of transformer computation, with released code and demo for community adoption.

Abstract: Mechanistic interpretability aims to understand neural networks by identifying which learned features mediate specific behaviors. Attribution graphs reveal these feature pathways, but interpreting them requires extensive manual analysis – a single prompt can take approximately 2 hours for an experienced circuit tracer. We present probe prompting, an automated pipeline that transforms attribution graphs into compact, interpretable subgraphs built from concept-aligned supernodes. Starting from a seed prompt and target logit, we select high-influence features, generate concept-targeted yet context-varying probes, and group features by cross-prompt activation signatures into Semantic, Relationship, and Say-X categories using transparent decision rules. Across five prompts including classic “capitals” circuits, probe-prompted subgraphs preserve high explanatory coverage while compressing complexity (Completeness 0.83, mean across circuits; Replacement 0.54). Compared to geometric clustering baselines, concept-aligned groups exhibit higher behavioral coherence: 2.3x higher peak-token consistency (0.425 vs 0.183) and 5.8x higher activation-pattern similarity (0.762 vs 0.130), despite lower geometric compactness. Entity-swap tests reveal a layerwise hierarchy: early-layer features transfer robustly (64% transfer rate, mean layer 6.3), while late-layer Say-X features specialize for output promotion (mean layer 16.4), supporting a backbone-and-specialization view of transformer computation. We release code (https://github.com/peppinob-ol/attribution-graph-probing), an interactive demo (https://huggingface.co/spaces/Peppinob/attribution-graph-probing), and minimal artifacts enabling immediate reproduction and community adoption.

[76] Beyond English: Toward Inclusive and Scalable Multilingual Machine Translation with LLMs

Yingfeng Luo, Ziqiang Xu, Yuxuan Ouyang, Murun Yang, Dingyang Lin, Kaiyan Chang, Tong Zheng, Bei Li, Peinan Feng, Quan Du, Tong Xiao, Jingbo Zhu

Main category: cs.CL

TL;DR: LMT is a suite of large-scale multilingual translation models covering 60 languages and 234 directions, addressing directional degeneration through strategic downsampling and enhancing cross-lingual transfer with parallel multilingual prompting.

Details

Motivation: To overcome challenges in multilingual machine translation including broad language coverage, consistent translation quality, and English-centric bias by creating Chinese-English centered models.

Method: Proposed strategic downsampling to mitigate directional degeneration from symmetric multi-way fine-tuning, and parallel multilingual prompting using typologically related auxiliary languages for better cross-lingual transfer.

Result: LMT achieves state-of-the-art performance among comparable models, with the 4B model surpassing larger models like Aya-101-13B and NLLB-54B by substantial margins.

Conclusion: LMT provides strong baselines for inclusive, scalable, and high-quality multilingual machine translation through rigorous data curation and refined adaptation strategies.

Abstract: Large language models have significantly advanced Multilingual Machine Translation (MMT), yet the broad language coverage, consistent translation quality, and English-centric bias remain open challenges. To address these challenges, we introduce \textbf{LMT}, a suite of \textbf{L}arge-scale \textbf{M}ultilingual \textbf{T}ranslation models centered on both Chinese and English, covering 60 languages and 234 translation directions. During development, we identify a previously overlooked phenomenon of \textbf{directional degeneration}, where symmetric multi-way fine-tuning data overemphasize reverse directions (X $\to$ En/Zh), leading to excessive many-to-one mappings and degraded translation quality. We propose \textbf{Strategic Downsampling}, a simple yet effective method to mitigate this degeneration. In addition, we design \textbf{Parallel Multilingual Prompting (PMP)}, which leverages typologically related auxiliary languages to enhance cross-lingual transfer. Through rigorous data curation and refined adaptation strategies, LMT achieves SOTA performance among models of comparable language coverage, with our 4B model (LMT-60-4B) surpassing the much larger Aya-101-13B and NLLB-54B models by a substantial margin. We release LMT in four sizes (0.6B/1.7B/4B/8B) to catalyze future research and provide strong baselines for inclusive, scalable, and high-quality MMT \footnote{\href{https://github.com/NiuTrans/LMT}{https://github.com/NiuTrans/LMT}}.

[77] A Picture is Worth a Thousand (Correct) Captions: A Vision-Guided Judge-Corrector System for Multimodal Machine Translation

Siddharth Betala, Kushan Raj, Vipul Betala, Rohan Saswade

Main category: cs.CL

TL;DR: BLEU Monday team’s two-stage approach for English-to-Indic translation: automated error detection/correction using vision-augmented pipeline, followed by LoRA fine-tuning on corrected data, achieving BLEU score improvements across multiple language pairs.

Details

Motivation: To address quality issues in training data for English-to-Indic multimodal translation tasks, particularly handling visually ambiguous translations and mistranslations that affect model performance.

Method: Two-stage approach: 1) Vision-augmented judge-corrector pipeline using multimodal models to classify translations (correct/visually ambiguous/mistranslated) and route errors to specialized correctors (GPT-4o-mini for visual disambiguation, IndicTrans2 for quality issues); 2) LoRA fine-tuning of IndicTrans2 model on both original and corrected datasets.

Result: Corrected 17.1% of captions per language across 28,928 training examples. BLEU score improvements: English-Bengali +1.30 (eval) and +0.70 (challenge), English-Odia +0.60 (eval), English-Hindi +0.10 (challenge).

Conclusion: Automated error detection and correction in training data combined with parameter-efficient fine-tuning consistently improves translation quality across multiple English-to-Indic language pairs.

Abstract: In this paper, we describe our system under the team name BLEU Monday for the English-to-Indic Multimodal Translation Task at WAT 2025. We participate in the text-only translation tasks for English-Hindi, English-Bengali, English-Malayalam, and English-Odia language pairs. We present a two-stage approach that addresses quality issues in the training data through automated error detection and correction, followed by parameter-efficient model fine-tuning. Our methodology introduces a vision-augmented judge-corrector pipeline that leverages multimodal language models to systematically identify and correct translation errors in the training data. The judge component classifies translations into three categories: correct, visually ambiguous (requiring image context), or mistranslated (poor translation quality). Identified errors are routed to specialized correctors: GPT-4o-mini regenerates captions requiring visual disambiguation, while IndicTrans2 retranslates cases with pure translation quality issues. This automated pipeline processes 28,928 training examples across four languages, correcting an average of 17.1% of captions per language. We then apply Low-Rank Adaptation (LoRA) to fine-tune the IndicTrans2 en-indic 200M distilled model on both original and corrected datasets. Training on corrected data yields consistent improvements, with BLEU score gains of +1.30 for English-Bengali on the evaluation set (42.00 -> 43.30) and +0.70 on the challenge set (44.90 -> 45.60), +0.60 for English-Odia on the evaluation set (41.00 -> 41.60), and +0.10 for English-Hindi on the challenge set (53.90 -> 54.00).

[78] Multilingual Lexical Feature Analysis of Spoken Language for Predicting Major Depression Symptom Severity

Anastasiia Tokareva, Judith Dineley, Zoe Firth, Pauline Conde, Faith Matcham, Sara Siddi, Femke Lamers, Ewan Carr, Carolin Oetzmann, Daniel Leightley, Yuezhou Zhang, Amos A. Folarin, Josep Maria Haro, Brenda W. J. H. Penninx, Raquel Bailon, Srinivasan Vairavan, Til Wykes, Richard J. B. Dobson, Vaibhav A. Narayan, Matthew Hotopf, Nicholas Cummins, The RADAR-CNS Consortium

Main category: cs.CL

TL;DR: Lexical features show limited association with MDD symptom severity across languages, with near-chance predictive performance using ML models.

Details

Motivation: To explore interpretable lexical features from spoken language for objective assessment of major depressive disorder symptoms, addressing limitations of previous research in non-clinical samples with complex ML approaches.

Method: Longitudinal analysis of 5,836 speech recordings from 586 participants across UK, Netherlands, and Spain using linear mixed-effects modeling to identify lexical features, and testing four ML regressor models with interpretable features and vector embeddings.

Result: Limited language-specific associations found (7 features in English, 2 in Dutch, none in Spanish), with near-chance predictive performance across all languages using both lexical features and embeddings.

Conclusion: Further research needed with larger multilingual samples, improved protocols, and ML models accounting for individual language variations to understand lexical markers’ clinical value.

Abstract: Background: Captured between clinical appointments using mobile devices, spoken language has potential for objective, more regular assessment of symptom severity and earlier detection of relapse in major depressive disorder. However, research to date has largely been in non-clinical cross-sectional samples of written language using complex machine learning (ML) approaches with limited interpretability. Methods: We describe an initial exploratory analysis of longitudinal speech data and PHQ-8 assessments from 5,836 recordings of 586 participants in the UK, Netherlands, and Spain, collected in the RADAR-MDD study. We sought to identify interpretable lexical features associated with MDD symptom severity with linear mixed-effects modelling. Interpretable features and high-dimensional vector embeddings were also used to test the prediction performance of four regressor ML models. Results: In English data, MDD symptom severity was associated with 7 features including lexical diversity measures and absolutist language. In Dutch, associations were observed with words per sentence and positive word frequency; no associations were observed in recordings collected in Spain. The predictive power of lexical features and vector embeddings was near chance level across all languages. Limitations: Smaller samples in non-English speech and methodological choices, such as the elicitation prompt, may have also limited the effect sizes observable. A lack of NLP tools in languages other than English restricted our feature choice. Conclusion: To understand the value of lexical markers in clinical research and practice, further research is needed in larger samples across several languages using improved protocols, and ML models that account for within- and between-individual variations in language.

[79] Llama-Embed-Nemotron-8B: A Universal Text Embedding Model for Multilingual and Cross-Lingual Tasks

Yauhen Babakhin, Radek Osmulski, Ronay Ak, Gabriel Moreira, Mengyao Xu, Benedikt Schifferer, Bo Liu, Even Oldridge

Main category: cs.CL

TL;DR: llama-embed-nemotron-8b is an open-weights text embedding model achieving SOTA on MMTEB benchmark through novel data mix and detailed ablation studies, supporting instruction-aware capabilities.

Details

Motivation: Address lack of transparency in recent embedding models by developing fully open-source model with disclosed training data and methodologies.

Method: Uses 16.1M query-document pairs (7.7M public + 8.4M synthetic from LLMs), conducts ablation studies on contrastive loss, synthetic data generation, and model merging, implements instruction-aware architecture.

Result: Achieves state-of-the-art performance across retrieval, classification, semantic textual similarity tasks, excels in multilingual scenarios including low-resource languages and cross-lingual setups.

Conclusion: Provides universal text embedding solution combining top-tier performance, broad applicability, and user-driven flexibility through instruction-aware capabilities.

Abstract: We introduce llama-embed-nemotron-8b, an open-weights text embedding model that achieves state-of-the-art performance on the Multilingual Massive Text Embedding Benchmark (MMTEB) leaderboard as of October 21, 2025. While recent models show strong performance, their training data or methodologies are often not fully disclosed. We aim to address this by developing a fully open-source model, publicly releasing its weights and detailed ablation studies, and planning to share the curated training datasets. Our model demonstrates superior performance across all major embedding tasks – including retrieval, classification and semantic textual similarity (STS) – and excels in challenging multilingual scenarios, such as low-resource languages and cross-lingual setups. This state-of-the-art performance is driven by a novel data mix of 16.1 million query-document pairs, split between 7.7 million samples from public datasets and 8.4 million synthetically generated examples from various open-weight LLMs. One of our key contributions is a detailed ablation study analyzing core design choices, including a comparison of contrastive loss implementations, an evaluation of synthetic data generation (SDG) strategies, and the impact of model merging. The llama-embed-nemotron-8b is an instruction-aware model, supporting user-defined instructions to enhance performance for specific use-cases. This combination of top-tier performance, broad applicability, and user-driven flexibility enables it to serve as a universal text embedding solution.

[80] Evaluating LLMs for Anxiety, Depression, and Stress Detection Evaluating Large Language Models for Anxiety, Depression, and Stress Detection: Insights into Prompting Strategies and Synthetic Data

Mihael Arcan, David-Paul Niland

Main category: cs.CL

TL;DR: This study compares LLMs, classical ML, and transformer models for detecting anxiety, depression, and stress from clinical interview text, finding transformer models like Distil-RoBERTa and XLNet perform best, with synthetic data helping address class imbalance.

Details

Motivation: Mental health disorders affect over 20% of adults globally, but detecting them from text is challenging due to subtle symptom expression. There's a need for effective automated detection methods.

Method: Used DAIC-WOZ clinical interview dataset, compared LLMs (Llama, GPT) with classical ML and transformers (BERT, XLNet, Distil-RoBERTa), fine-tuned for anxiety/depression/stress classification, applied synthetic data generation for class imbalance.

Result: Distil-RoBERTa achieved highest F1 (0.883) for GAD-2 anxiety detection; XLNet best on PHQ depression tasks (F1 up to 0.891); zero-shot synthetic approach for stress detection reached F1 0.884 and ROC AUC 0.886.

Conclusion: Transformer-based models are effective for mental health detection, synthetic data improves recall and generalization, but careful calibration is needed to prevent precision loss. Combining advanced language models with data augmentation enhances automated mental health assessment.

Abstract: Mental health disorders affect over one-fifth of adults globally, yet detecting such conditions from text remains challenging due to the subtle and varied nature of symptom expression. This study evaluates multiple approaches for mental health detection, comparing Large Language Models (LLMs) such as Llama and GPT with classical machine learning and transformer-based architectures including BERT, XLNet, and Distil-RoBERTa. Using the DAIC-WOZ dataset of clinical interviews, we fine-tuned models for anxiety, depression, and stress classification and applied synthetic data generation to mitigate class imbalance. Results show that Distil-RoBERTa achieved the highest F1 score (0.883) for GAD-2, while XLNet outperformed others on PHQ tasks (F1 up to 0.891). For stress detection, a zero-shot synthetic approach (SD+Zero-Shot-Basic) reached an F1 of 0.884 and ROC AUC of 0.886. Findings demonstrate the effectiveness of transformer-based models and highlight the value of synthetic data in improving recall and generalization. However, careful calibration is required to prevent precision loss. Overall, this work emphasizes the potential of combining advanced language models and data augmentation to enhance automated mental health assessment from text.

[81] When Sufficient is not Enough: Utilizing the Rashomon Effect for Complete Evidence Extraction

Katharina Beckh, Stefan Rüping

Main category: cs.CL

TL;DR: Feature attribution methods need to identify complete evidence, not just minimal sufficient evidence. An ensemble approach improves recall from ~0.60 to ~0.86 on medical data with human-annotated complete evidence.

Details

Motivation: Current feature attribution methods provide only minimal sufficient evidence, which is inadequate for compliance and cataloging applications that require identifying the full set of contributing features (complete evidence).

Method: Case study on medical dataset with human-annotated complete evidence; analysis of individual models and ensemble approaches; examination of recall-precision trade-off, training with evidence, and dynamic ensembles with certainty thresholds.

Result: Individual models recover only subsets of complete evidence (~0.60 recall); ensemble aggregation improves evidence recall to ~0.86; analysis shows trade-offs and benefits of ensemble approaches.

Conclusion: Ensemble methods significantly improve complete evidence recall compared to single models, with important implications for applications requiring full feature attribution in compliance and cataloging contexts.

Abstract: Feature attribution methods typically provide minimal sufficient evidence justifying a model decision. However, in many applications this is inadequate. For compliance and cataloging, the full set of contributing features must be identified - complete evidence. We perform a case study on a medical dataset which contains human-annotated complete evidence. We show that individual models typically recover only subsets of complete evidence and that aggregating evidence from several models improves evidence recall from $\sim$0.60 (single best model) to $\sim$0.86 (ensemble). We analyze the recall-precision trade-off, the role of training with evidence, dynamic ensembles with certainty thresholds, and discuss implications.

[82] Aligning Attention with Human Rationales for Self-Explaining Hate Speech Detection

Brage Eilertsen, Røskva Bjørgfinsdóttir, Francielle Vargas, Ali Ramezani-Kebrya

Main category: cs.CL

TL;DR: SRA framework improves hate speech detection by aligning model attention with human rationales, enhancing interpretability and fairness without compromising performance.

Details

Motivation: Address the opacity of deep learning models in hate speech detection systems to enable ethical deployment through better interpretability and fairness.

Method: Integrates supervised attention mechanism into transformers with joint objective combining classification loss and alignment loss between attention weights and human-annotated rationales.

Result: Achieves 2.4x better explainability than baselines, produces more faithful token-level explanations, and maintains competitive fairness across metrics while detecting toxic posts targeting identity groups.

Conclusion: Incorporating human rationales into attention mechanisms enhances interpretability and faithfulness in hate speech classification without compromising fairness.

Abstract: The opaque nature of deep learning models presents significant challenges for the ethical deployment of hate speech detection systems. To address this limitation, we introduce Supervised Rational Attention (SRA), a framework that explicitly aligns model attention with human rationales, improving both interpretability and fairness in hate speech classification. SRA integrates a supervised attention mechanism into transformer-based classifiers, optimizing a joint objective that combines standard classification loss with an alignment loss term that minimizes the discrepancy between attention weights and human-annotated rationales. We evaluated SRA on hate speech benchmarks in English (HateXplain) and Portuguese (HateBRXplain) with rationale annotations. Empirically, SRA achieves 2.4x better explainability compared to current baselines, and produces token-level explanations that are more faithful and human-aligned. In terms of fairness, SRA achieves competitive fairness across all measures, with second-best performance in detecting toxic posts targeting identity groups, while maintaining comparable results on other metrics. These findings demonstrate that incorporating human rationales into attention mechanisms can enhance interpretability and faithfulness without compromising fairness.

[83] Importance-Aware Data Selection for Efficient LLM Instruction Tuning

Tingyu Jiang, Shen Li, Yiyao Song, Lan Zhang, Hualei Zhu, Yuan Zhao, Xiaohang Xu, Kenjiro Taura, Hao Henry Wang

Main category: cs.CL

TL;DR: Proposes Model Instruction Weakness Value (MIWV) metric to select high-quality instruction data for LLM tuning, showing top 1% MIWV-selected data outperforms full dataset training.

Details

Motivation: Need to select instruction data that maximizes LLM performance enhancement rather than just calculating data quality scores, as small amounts of high-quality data can match or exceed full dataset results.

Method: Derive MIWV metric from model response discrepancies using In-Context Learning to identify most beneficial data for instruction tuning.

Result: Selecting only top 1% of data based on MIWV outperforms training on full dataset.

Conclusion: MIWV offers effective data selection method beyond traditional quality scoring, with strong empirical evidence supporting its effectiveness.

Abstract: Instruction tuning plays a critical role in enhancing the performance and efficiency of Large Language Models (LLMs). Its success depends not only on the quality of the instruction data but also on the inherent capabilities of the LLM itself. Some studies suggest that even a small amount of high-quality data can achieve instruction fine-tuning results that are on par with, or even exceed, those from using a full-scale dataset. However, rather than focusing solely on calculating data quality scores to evaluate instruction data, there is a growing need to select high-quality data that maximally enhances the performance of instruction tuning for a given LLM. In this paper, we propose the Model Instruction Weakness Value (MIWV) as a novel metric to quantify the importance of instruction data in enhancing model’s capabilities. The MIWV metric is derived from the discrepancies in the model’s responses when using In-Context Learning (ICL), helping identify the most beneficial data for enhancing instruction tuning performance. Our experimental results demonstrate that selecting only the top 1% of data based on MIWV can outperform training on the full dataset. Furthermore, this approach extends beyond existing research that focuses on data quality scoring for data selection, offering strong empirical evidence supporting the effectiveness of our proposed method.

[84] EmoBang: Detecting Emotion From Bengali Texts

Abdullah Al Maruf, Aditi Golder, Zakaria Masud Jiyad, Abdullah Al Numan, Tarannum Shaila Zaman

Main category: cs.CL

TL;DR: This paper addresses emotion detection in Bengali by introducing a new dataset and two novel models that significantly outperform existing methods, establishing the first comprehensive benchmark for this low-resource language.

Details

Motivation: Bengali is the world's fourth most spoken language but remains underexplored for emotion detection due to lack of standardized datasets, classifying it as a low-resource language with limited performance from existing classical machine learning approaches.

Method: Introduced a new Bengali emotion dataset annotated across eight emotion categories and proposed two models: (i) a hybrid Convolutional Recurrent Neural Network (CRNN) model (EmoBangHybrid) and (ii) an AdaBoost-Bidirectional Encoder Representations from Transformers (BERT) ensemble model (EmoBangEnsemble). Also evaluated baseline models, feature engineering techniques, and assessed zero-shot/few-shot LLMs.

Result: Experimental results show EmoBangHybrid achieved 92.86% accuracy and EmoBangEnsemble achieved 93.69% accuracy, outperforming existing methods and establishing strong baselines for future research.

Conclusion: This work provides the first comprehensive benchmark for Bengali emotion detection, demonstrating that the proposed models significantly advance the state-of-the-art and open new research directions for low-resource language processing.

Abstract: Emotion detection from text seeks to identify an individual’s emotional or mental state - positive, negative, or neutral - based on linguistic cues. While significant progress has been made for English and other high-resource languages, Bengali remains underexplored despite being the world’s fourth most spoken language. The lack of large, standardized datasets classifies Bengali as a low-resource language for emotion detection. Existing studies mainly employ classical machine learning models with traditional feature engineering, yielding limited performance. In this paper, we introduce a new Bengali emotion dataset annotated across eight emotion categories and propose two models for automatic emotion detection: (i) a hybrid Convolutional Recurrent Neural Network (CRNN) model (EmoBangHybrid) and (ii) an AdaBoost-Bidirectional Encoder Representations from Transformers (BERT) ensemble model (EmoBangEnsemble). Additionally, we evaluate six baseline models with five feature engineering techniques and assess zero-shot and few-shot large language models (LLMs) on the dataset. To the best of our knowledge, this is the first comprehensive benchmark for Bengali emotion detection. Experimental results show that EmoBangH and EmoBangE achieve accuracies of 92.86% and 93.69%, respectively, outperforming existing methods and establishing strong baselines for future research.

[85] Wasm: A Pipeline for Constructing Structured Arabic Interleaved Multimodal Corpora

Khalil Hennara, Ahmad Bastati, Muhammad Hreden, Mohamed Motasim Hamed, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan

Main category: cs.CL

TL;DR: Created Wasm pipeline to process Common Crawl for Arabic multimodal dataset with markdown output, preserving web structure for both text-only and multimodal pre-training.

Details

Motivation: Address the lack of high-quality Arabic multimodal datasets that preserve document structure, which has limited progress in Arabic LLMs and LMMs compared to other languages.

Method: Developed a pipeline to process Common Crawl dataset, preserving structural integrity of web content while maintaining flexibility for text-only and multimodal pre-training, with markdown output.

Result: Successfully created a new Arabic multimodal dataset with preserved document structure, publicly released a representative dataset dump and processing pipeline for Arabic.

Conclusion: The Wasm pipeline enables creation of high-quality Arabic multimodal datasets that can support better performance in Arabic LLMs and LMMs by preserving document structure similar to successful approaches in other languages.

Abstract: The performance of large language models (LLMs) and large multimodal models (LMMs) depends heavily on the quality and scale of their pre-training datasets. Recent research shows that large multimodal models trained on natural documents where images and text are interleaved outperform those trained only on image-text pairs across a wide range of benchmarks, leveraging advanced pre- trained models to enforce semantic alignment, image-sequence consistency, and textual coherence. For Arabic, however, the lack of high-quality multimodal datasets that preserve document structure has limited progress. In this paper, we present our pipeline Wasm for processing the Common Crawl dataset to create a new Arabic multimodal dataset that uniquely provides markdown output. Unlike existing Arabic corpora that focus solely on text extraction, our approach preserves the structural integrity of web content while maintaining flexibility for both text-only and multimodal pre-training scenarios. We provide a comprehensive comparative analysis of our data processing pipeline against those used for major existing datasets, highlighting the convergences in filtering strategies and justifying our specific design choices. To support future research, we publicly release a representative dataset dump along with the multimodal processing pipeline for Arabic.

[86] More Agents Helps but Adversarial Robustness Gap Persists

Khashayar Alavi, Zhastay Yeltay, Lucie Flek, Akbar Karimi

Main category: cs.CL

TL;DR: Multi-agent LLM collaboration improves mathematical reasoning accuracy but remains vulnerable to adversarial perturbations, especially human-like typos, with diminishing returns beyond 10 agents.

Details

Motivation: To investigate whether multi-agent LLM collaboration provides robustness against adversarial inputs in mathematical question answering, particularly examining different types of perturbations.

Method: Used Agent Forest framework with sampling-and-voting to evaluate 6 open-source models across 4 math benchmarks, testing various agent counts (1-25) against punctuation noise (10-50%) and human-like typos (WikiTypo, R2ATA).

Result: Collaboration improves accuracy with more agents (largest gains from 1-5 agents), but adversarial robustness gap persists - human typos remain the dominant bottleneck with highest attack success rates even with many agents.

Conclusion: While multi-agent collaboration enhances accuracy, it does not eliminate vulnerability to adversarial perturbations, particularly human-like typos, suggesting fundamental robustness limitations in current LLM approaches.

Abstract: When LLM agents work together, they seem to be more powerful than a single LLM in mathematical question answering. However, are they also more robust to adversarial inputs? We investigate this question using adversarially perturbed math questions. These perturbations include punctuation noise with three intensities (10, 30, and 50 percent), plus real-world and human-like typos (WikiTypo, R2ATA). Using a unified sampling-and-voting framework (Agent Forest), we evaluate six open-source models (Qwen3-4B/14B, Llama3.1-8B, Mistral-7B, Gemma3-4B/12B) across four benchmarks (GSM8K, MATH, MMLU-Math, MultiArith), with various numbers of agents n from one to 25 (1, 2, 5, 10, 15, 20, 25). Our findings show that (1) Noise type matters: punctuation noise harm scales with its severity, and the human typos remain the dominant bottleneck, yielding the largest gaps to Clean accuracy and the highest ASR even with a large number of agents. And (2) Collaboration reliably improves accuracy as the number of agents, n, increases, with the largest gains from one to five agents and diminishing returns beyond 10 agents. However, the adversarial robustness gap persists regardless of the agent count.

[87] Think Consistently, Reason Efficiently: Energy-Based Calibration for Implicit Chain-of-Thought

Zhikang Chen, Sen Cui, Deheng Ye, Yu Zhang, Yatao Bian, Tingting Zhu

Main category: cs.CL

TL;DR: EBM-CoT is an Energy-Based Chain-of-Thought framework that refines latent reasoning trajectories using energy-based models to improve consistency and accuracy in LLM reasoning.

Details

Motivation: Traditional CoT prompting suffers from error propagation and limited vocabulary expressiveness, while recent latent reasoning approaches lack explicit consistency mechanisms, leading to divergent reasoning paths.

Method: Proposes an Energy-Based Chain-of-Thought Calibration framework that dynamically adjusts latent thought representations toward lower-energy, high-consistency regions using energy-based models.

Result: Extensive experiments across mathematical, commonsense, and symbolic reasoning benchmarks show significant improvements in reasoning consistency and efficiency.

Conclusion: The EBM-CoT framework effectively enhances multi-step reasoning in LLMs by enforcing consistency among reasoning steps without modifying the base language model.

Abstract: Large Language Models (LLMs) have demonstrated strong reasoning capabilities through \emph{Chain-of-Thought} (CoT) prompting, which enables step-by-step intermediate reasoning. However, explicit CoT methods rely on discrete token-level reasoning processes that are prone to error propagation and limited by vocabulary expressiveness, often resulting in rigid and inconsistent reasoning trajectories. Recent research has explored implicit or continuous reasoning in latent spaces, allowing models to perform internal reasoning before generating explicit output. Although such approaches alleviate some limitations of discrete CoT, they generally lack explicit mechanisms to enforce consistency among reasoning steps, leading to divergent reasoning paths and unstable outcomes. To address this issue, we propose EBM-CoT, an Energy-Based Chain-of-Thought Calibration framework that refines latent thought representations through an energy-based model (EBM). Our method dynamically adjusts latent reasoning trajectories toward lower-energy, high-consistency regions in the embedding space, improving both reasoning accuracy and consistency without modifying the base language model. Extensive experiments across mathematical, commonsense, and symbolic reasoning benchmarks demonstrate that the proposed framework significantly enhances the consistency and efficiency of multi-step reasoning in LLMs.

[88] LoRA on the Go: Instance-level Dynamic LoRA Selection and Merging

Seungeon Lee, Soumi Das, Manish Gupta, Krishna P. Gummadi

Main category: cs.CL

TL;DR: LoRA on the Go (LoGo) is a training-free framework that dynamically selects and merges LoRA adapters at instance level without labeled data or additional training, improving performance on diverse tasks.

Details

Motivation: Conventional LoRA adapters are trained for single tasks, limiting applicability in real-world settings with diverse inputs. Existing multi-adapter approaches require labeled data or additional training, which is expensive at scale.

Method: LoGo extracts signals from a single forward pass through LoRA adapters to identify the most relevant adapters and determine their contributions dynamically, without any training.

Result: Across 5 NLP benchmarks, 27 datasets, and 3 model families, LoGo outperforms training-based baselines by up to 3.6% on some tasks while remaining competitive on others, maintaining inference throughput.

Conclusion: LoGo is an effective and practical training-free framework for dynamic adapter selection and merging that works well across diverse NLP tasks without requiring additional training or labeled data.

Abstract: Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient approach for fine-tuning large language models.However, conventional LoRA adapters are typically trained for a single task, limiting their applicability in real-world settings where inputs may span diverse and unpredictable domains. At inference time, existing approaches combine multiple LoRAs for improving performance on diverse tasks, while usually requiring labeled data or additional task-specific training, which is expensive at scale. In this work, we introduce LoRA on the Go (LoGo), a training-free framework that dynamically selects and merges adapters at the instance level without any additional requirements. LoGo leverages signals extracted from a single forward pass through LoRA adapters, to identify the most relevant adapters and determine their contributions on-the-fly. Across 5 NLP benchmarks, 27 datasets, and 3 model families, LoGo outperforms training-based baselines on some tasks upto a margin of 3.6% while remaining competitive on other tasks and maintaining inference throughput, highlighting its effectiveness and practicality.

[89] TCM-Eval: An Expert-Level Dynamic and Extensible Benchmark for Traditional Chinese Medicine

Zihao Cheng, Yuheng Lu, Huaiqian Ye, Zeming Liu, Minqi Wang, Jingjing Liu, Zihan Li, Wei Fan, Yuanfang Guo, Ruiji Fu, Shifeng She, Gang Wang, Yunhong Wang

Main category: cs.CL

TL;DR: TCM-Eval is the first dynamic benchmark for Traditional Chinese Medicine, enabling development of ZhiMingTang (ZMT) LLM that exceeds human practitioner passing thresholds through Self-Iterative Chain-of-Thought Enhancement.

Details

Motivation: Address the severe limitations of LLMs in Traditional Chinese Medicine due to lack of standardized benchmarks and high-quality training data.

Method: Created TCM-Eval benchmark from national medical exams, built large-scale training corpus, and developed Self-Iterative Chain-of-Thought Enhancement (SI-CoTE) to autonomously enrich QA pairs with validated reasoning chains.

Result: Developed ZhiMingTang (ZMT) LLM that significantly exceeds the passing threshold for human practitioners in TCM.

Conclusion: Established a virtuous cycle of data and model co-evolution, released public leaderboard to foster community engagement and continuous improvement in TCM AI research.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in modern medicine, yet their application in Traditional Chinese Medicine (TCM) remains severely limited by the absence of standardized benchmarks and the scarcity of high-quality training data. To address these challenges, we introduce TCM-Eval, the first dynamic and extensible benchmark for TCM, meticulously curated from national medical licensing examinations and validated by TCM experts. Furthermore, we construct a large-scale training corpus and propose Self-Iterative Chain-of-Thought Enhancement (SI-CoTE) to autonomously enrich question-answer pairs with validated reasoning chains through rejection sampling, establishing a virtuous cycle of data and model co-evolution. Using this enriched training data, we develop ZhiMingTang (ZMT), a state-of-the-art LLM specifically designed for TCM, which significantly exceeds the passing threshold for human practitioners. To encourage future research and development, we release a public leaderboard, fostering community engagement and continuous improvement.

[90] Categorical Emotions or Appraisals - Which Emotion Model Explains Argument Convincingness Better?

Lynn Greschner, Meike Bauer, Sabine Weber, Roman Klinger

Main category: cs.CL

TL;DR: This paper evaluates how appraisal theories (subjective cognitive evaluations) can improve emotion analysis in arguments for predicting convincingness, showing appraisals outperform categorical emotions.

Details

Motivation: Argument convincingness depends not just on structure and speaker credibility, but also on subjective emotional responses influenced by recipients' goals, knowledge, and stance. Appraisal theories provide a link between cognitive assessments and emotions, but their suitability for argument convincingness remains unexplored.

Method: Used the ContArgA corpus annotations to perform zero-shot prompting experiments, evaluating the importance of gold-annotated and predicted emotions/appraisals for subjective convincingness assessment.

Result: Categorical emotion information improves convincingness prediction, but the improvement is more pronounced with appraisals. Appraisals show better performance than basic emotion categories.

Conclusion: This work presents the first systematic comparison between emotion models for convincingness prediction, demonstrating the advantage of appraisals over categorical emotions, providing insights for computational argumentation applications.

Abstract: The convincingness of an argument does not only depend on its structure (logos), the person who makes the argument (ethos), but also on the emotion that it causes in the recipient (pathos). While the overall intensity and categorical values of emotions in arguments have received considerable attention in the research community, we argue that the emotion an argument evokes in a recipient is subjective. It depends on the recipient’s goals, standards, prior knowledge, and stance. Appraisal theories lend themselves as a link between the subjective cognitive assessment of events and emotions. They have been used in event-centric emotion analysis, but their suitability for assessing argument convincingness remains unexplored. In this paper, we evaluate whether appraisal theories are suitable for emotion analysis in arguments by considering subjective cognitive evaluations of the importance and impact of an argument on its receiver. Based on the annotations in the recently published ContArgA corpus, we perform zero-shot prompting experiments to evaluate the importance of gold-annotated and predicted emotions and appraisals for the assessment of the subjective convincingness labels. We find that, while categorical emotion information does improve convincingness prediction, the improvement is more pronounced with appraisals. This work presents the first systematic comparison between emotion models for convincingness prediction, demonstrating the advantage of appraisals, providing insights for theoretical and practical applications in computational argumentation.

[91] AdaRec: Adaptive Recommendation with LLMs via Narrative Profiling and Dual-Channel Reasoning

Meiyun Wang, Charin Polpanumas

Main category: cs.CL

TL;DR: AdaRec is a few-shot in-context learning framework using LLMs for adaptive personalized recommendation, featuring narrative profiling and bivariate reasoning to outperform existing methods.

Details

Motivation: To address the limitations of existing LLM-based recommendation approaches that require manual feature engineering and lack adaptability, by creating a unified framework for personalized recommendation with minimal supervision.

Method: Uses narrative profiling to transform user-item interactions into natural language, employs dual-channel architecture with horizontal behavioral alignment (peer patterns) and vertical causal attribution (preference factors), and supports rapid cross-task adaptation.

Result: Outperforms ML models and LLM baselines by up to 8% in few-shot settings, achieves 19% improvement over expert-crafted profiling in zero-shot scenarios, and lightweight fine-tuning matches fully fine-tuned models’ performance.

Conclusion: AdaRec effectively enables long-tail personalization with minimal interaction data, demonstrates strong generalization across tasks, and eliminates manual feature engineering through semantic representations.

Abstract: We propose AdaRec, a few-shot in-context learning framework that leverages large language models for an adaptive personalized recommendation. AdaRec introduces narrative profiling, transforming user-item interactions into natural language representations to enable unified task handling and enhance human readability. Centered on a bivariate reasoning paradigm, AdaRec employs a dual-channel architecture that integrates horizontal behavioral alignment, discovering peer-driven patterns, with vertical causal attribution, highlighting decisive factors behind user preferences. Unlike existing LLM-based approaches, AdaRec eliminates manual feature engineering through semantic representations and supports rapid cross-task adaptation with minimal supervision. Experiments on real ecommerce datasets demonstrate that AdaRec outperforms both machine learning models and LLM-based baselines by up to eight percent in few-shot settings. In zero-shot scenarios, it achieves up to a nineteen percent improvement over expert-crafted profiling, showing effectiveness for long-tail personalization with minimal interaction data. Furthermore, lightweight fine-tuning on synthetic data generated by AdaRec matches the performance of fully fine-tuned models, highlighting its efficiency and generalization across diverse tasks.

[92] EMODIS: A Benchmark for Context-Dependent Emoji Disambiguation in Large Language Models

Jiacheng Huang, Ning Yu, Xiaoyin Yi

Main category: cs.CL

TL;DR: EMODIS is a new benchmark that evaluates LLMs’ ability to interpret ambiguous emoji expressions in minimal contrastive contexts, revealing significant limitations in contextual disambiguation.

Details

Motivation: LLMs are increasingly used in real-world communication but their capacity to resolve context-dependent ambiguity, particularly with emojis, remains underexplored.

Method: Created EMODIS benchmark with ambiguous sentences containing emojis, two distinct disambiguating contexts leading to different interpretations, and questions requiring contextual reasoning. Evaluated both open-source and API-based LLMs.

Result: Even the strongest models frequently fail to distinguish meanings with subtle contextual cues, showing systematic biases toward dominant interpretations and limited sensitivity to pragmatic contrast.

Conclusion: EMODIS provides a rigorous testbed for assessing contextual disambiguation and highlights the semantic reasoning gap between humans and LLMs.

Abstract: Large language models (LLMs) are increasingly deployed in real-world communication settings, yet their ability to resolve context-dependent ambiguity remains underexplored. In this work, we present EMODIS, a new benchmark for evaluating LLMs’ capacity to interpret ambiguous emoji expressions under minimal but contrastive textual contexts. Each instance in EMODIS comprises an ambiguous sentence containing an emoji, two distinct disambiguating contexts that lead to divergent interpretations, and a specific question that requires contextual reasoning. We evaluate both open-source and API-based LLMs, and find that even the strongest models frequently fail to distinguish meanings when only subtle contextual cues are present. Further analysis reveals systematic biases toward dominant interpretations and limited sensitivity to pragmatic contrast. EMODIS provides a rigorous testbed for assessing contextual disambiguation, and highlights the gap in semantic reasoning between humans and LLMs.

[93] Discourse Graph Guided Document Translation with Large Language Models

Viet-Thanh Pham, Minghan Wang, Hao-Han Liao, Thuy-Trang Vu

Main category: cs.CL

TL;DR: TransGraph improves document translation by using discourse graphs to model inter-chunk relationships, reducing token overhead while enhancing translation quality and terminology consistency.

Details

Motivation: Current methods struggle with long-range dependencies and discourse coherence in full document translation, and agentic systems are computationally expensive and sensitive to memory retrieval.

Method: TransGraph uses discourse graphs to model inter-chunk relationships and selectively conditions translation segments on relevant graph neighborhoods instead of sequential or exhaustive context.

Result: TransGraph outperforms strong baselines on three document-level MT benchmarks across six languages and diverse domains, achieving better translation quality and terminology consistency with lower token overhead.

Conclusion: The discourse-guided framework with structured discourse graphs effectively addresses long-range dependency challenges in document translation while being more efficient than existing approaches.

Abstract: Adapting large language models to full document translation remains challenging due to the difficulty of capturing long-range dependencies and preserving discourse coherence throughout extended texts. While recent agentic machine translation systems mitigate context window constraints through multi-agent orchestration and persistent memory, they require substantial computational resources and are sensitive to memory retrieval strategies. We introduce TransGraph, a discourse-guided framework that explicitly models inter-chunk relationships through structured discourse graphs and selectively conditions each translation segment on relevant graph neighbourhoods rather than relying on sequential or exhaustive context. Across three document-level MT benchmarks spanning six languages and diverse domains, TransGraph consistently surpasses strong baselines in translation quality and terminology consistency while incurring significantly lower token overhead.

[94] Who Is the Story About? Protagonist Entity Recognition in News

Jorge Gabín, M. Eduardo Ares, Javier Parapar

Main category: cs.CL

TL;DR: Introduces Protagonist Entity Recognition (PER) to identify key organizations that drive news narratives, showing LLMs can approximate human judgments of narrative importance at scale.

Details

Motivation: Traditional NER treats all entity mentions equally, failing to distinguish which organizations actually drive news narratives, limiting downstream tasks that require understanding event salience and influence.

Method: Compare LLM predictions against expert annotations, establish inter-annotator consistency, use NER-guided prompting for automatic labeling of large news collections, and evaluate LLMs’ ability to infer protagonists with reduced context.

Result: PER is feasible and meaningful, with guided LLMs able to approximate human judgments of narrative importance at scale, demonstrating both inter-annotator consistency and human-LLM agreement.

Conclusion: PER represents a valuable extension to narrative-centered information extraction, enabling scalable identification of key narrative-driving entities in news content.

Abstract: News articles often reference numerous organizations, but traditional Named Entity Recognition (NER) treats all mentions equally, obscuring which entities genuinely drive the narrative. This limits downstream tasks that rely on understanding event salience, influence, or narrative focus. We introduce Protagonist Entity Recognition (PER), a task that identifies the organizations that anchor a news story and shape its main developments. To validate PER, we compare he predictions of Large Language Models (LLMs) against annotations from four expert annotators over a gold corpus, establishing both inter-annotator consistency and human-LLM agreement. Leveraging these findings, we use state-of-the-art LLMs to automatically label large-scale news collections through NER-guided prompting, generating scalable, high-quality supervision. We then evaluate whether other LLMs, given reduced context and without explicit candidate guidance, can still infer the correct protagonists. Our results demonstrate that PER is a feasible and meaningful extension to narrative-centered information extraction, and that guided LLMs can approximate human judgments of narrative importance at scale.

[95] Retriv at BLP-2025 Task 1: A Transformer Ensemble and Multi-Task Learning Approach for Bangla Hate Speech Identification

Sourav Saha, K M Nafi Asib, Mohammed Moshiul Hoque

Main category: cs.CL

TL;DR: The paper presents transformer ensemble methods for Bangla hate speech identification, achieving competitive results in a shared task with micro-f1 scores around 72.6-72.7% across three subtasks.

Details

Motivation: To address the socially impactful but linguistically challenging problem of Bangla hate speech identification in low-resource contexts, as part of a shared task competition.

Method: Used soft-voting ensemble of transformers (BanglaBERT, MuRIL, IndicBERTv2) for subtasks 1A and 1B, and weighted voting ensemble of multitask variants for subtask 1C.

Result: Achieved micro-f1 scores of 72.75% (1A), 72.69% (1B), and 72.62% (1C), ranking 9th, 10th, and 7th respectively on the shared task leaderboard.

Conclusion: Transformer ensembles and weighted multitask frameworks show promise for advancing Bangla hate speech detection, with experimental scripts made publicly available.

Abstract: This paper addresses the problem of Bangla hate speech identification, a socially impactful yet linguistically challenging task. As part of the “Bangla Multi-task Hate Speech Identification” shared task at the BLP Workshop, IJCNLP-AACL 2025, our team “Retriv” participated in all three subtasks: (1A) hate type classification, (1B) target group identification, and (1C) joint detection of type, severity, and target. For subtasks 1A and 1B, we employed a soft-voting ensemble of transformer models (BanglaBERT, MuRIL, IndicBERTv2). For subtask 1C, we trained three multitask variants and aggregated their predictions through a weighted voting ensemble. Our systems achieved micro-f1 scores of 72.75% (1A) and 72.69% (1B), and a weighted micro-f1 score of 72.62% (1C). On the shared task leaderboard, these corresponded to 9th, 10th, and 7th positions, respectively. These results highlight the promise of transformer ensembles and weighted multitask frameworks for advancing Bangla hate speech detection in low-resource contexts. We made experimental scripts publicly available for the community.

[96] ACE-ICD: Acronym Expansion As Data Augmentation For Automated ICD Coding

Tuan-Dung Le, Shohreh Haddadan, Thanh Q. Thieu

Main category: cs.CL

TL;DR: ACE-ICD: A novel data augmentation method using LLMs to expand medical acronyms in clinical notes, combined with consistency training, achieving SOTA performance on MIMIC-III ICD coding.

Details

Motivation: Existing ICD coding methods overlook the pervasive use of medical acronyms in clinical notes, which are crucial for accurate code inference but often not properly handled.

Method: Propose data augmentation using LLMs to expand medical acronyms to full forms, and incorporate consistency training to enforce prediction agreement between original and augmented documents.

Result: Extensive experiments on MIMIC-III show ACE-ICD establishes new state-of-the-art performance across common codes, rare codes, and full-code assignments.

Conclusion: The approach effectively addresses the medical acronym challenge in ICD coding and demonstrates significant performance improvements across various coding scenarios.

Abstract: Automatic ICD coding, the task of assigning disease and procedure codes to electronic medical records, is crucial for clinical documentation and billing. While existing methods primarily enhance model understanding of code hierarchies and synonyms, they often overlook the pervasive use of medical acronyms in clinical notes, a key factor in ICD code inference. To address this gap, we propose a novel effective data augmentation technique that leverages large language models to expand medical acronyms, allowing models to be trained on their full form representations. Moreover, we incorporate consistency training to regularize predictions by enforcing agreement between the original and augmented documents. Extensive experiments on the MIMIC-III dataset demonstrate that our approach, ACE-ICD establishes new state-of-the-art performance across multiple settings, including common codes, rare codes, and full-code assignments. Our code is publicly available.

[97] RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments

Zhiyuan Zeng, Hamish Ivison, Yiping Wang, Lifan Yuan, Shuyue Stella Li, Zhuorui Ye, Siting Li, Jacqueline He, Runlong Zhou, Tong Chen, Chenyang Zhao, Yulia Tsvetkov, Simon Shaolei Du, Natasha Jaques, Hao Peng, Pang Wei Koh, Hannaneh Hajishirzi

Main category: cs.CL

TL;DR: RLVE introduces adaptive verifiable environments that dynamically adjust problem difficulty to scale up RL for language models, achieving significant improvements in reasoning capabilities through environment scaling.

Details

Motivation: Static data distributions in RL often lead to vanishing learning signals when problems are too easy or hard for the policy model, limiting effective training of language models.

Method: Developed RLVE-Gym with 400 manually engineered verifiable environments that procedurally generate problems and provide algorithmically verifiable rewards, with difficulty adapting to model capabilities during training.

Result: Joint training across all 400 environments yielded 3.37% absolute average improvement across six reasoning benchmarks for a 1.5B LM, outperforming original RL training (0.49% gain) despite using 3x less compute.

Conclusion: Environment scaling through adaptive verifiable environments consistently improves generalizable reasoning capabilities in language models, demonstrating the effectiveness of RLVE approach.

Abstract: We introduce Reinforcement Learning (RL) with Adaptive Verifiable Environments (RLVE), an approach using verifiable environments that procedurally generate problems and provide algorithmically verifiable rewards, to scale up RL for language models (LMs). RLVE enables each verifiable environment to dynamically adapt its problem difficulty distribution to the policy model’s capabilities as training progresses. In contrast, static data distributions often lead to vanishing learning signals when problems are either too easy or too hard for the policy. To implement RLVE, we create RLVE-Gym, a large-scale suite of 400 verifiable environments carefully developed through manual environment engineering. Using RLVE-Gym, we show that environment scaling, i.e., expanding the collection of training environments, consistently improves generalizable reasoning capabilities. RLVE with joint training across all 400 environments in RLVE-Gym yields a 3.37% absolute average improvement across six reasoning benchmarks, starting from one of the strongest 1.5B reasoning LMs. By comparison, continuing this LM’s original RL training yields only a 0.49% average absolute gain despite using over 3x more compute. We release our code publicly.

[98] When Bias Pretends to Be Truth: How Spurious Correlations Undermine Hallucination Detection in LLMs

Shaowen Wang, Yiqi Dong, Ruinian Chang, Tansheng Zhu, Yuebo Sun, Kaifeng Lyu, Jian Li

Main category: cs.CL

TL;DR: LLMs exhibit hallucinations driven by spurious correlations in training data, which evade current detection methods and persist despite model scaling and refusal fine-tuning.

Details

Motivation: To highlight and investigate a previously underexplored class of hallucinations caused by spurious correlations between features and attributes in training data.

Method: Systematically controlled synthetic experiments and empirical evaluations on state-of-the-art open-source and proprietary LLMs, including theoretical analysis of why confidence-based detection fails.

Result: Existing hallucination detection methods (confidence-based filtering, inner-state probing) fundamentally fail against spurious correlation-induced hallucinations, which are confidently generated and immune to model scaling.

Conclusion: There is an urgent need for new approaches specifically designed to address hallucinations caused by spurious correlations in LLMs.

Abstract: Despite substantial advances, large language models (LLMs) continue to exhibit hallucinations, generating plausible yet incorrect responses. In this paper, we highlight a critical yet previously underexplored class of hallucinations driven by spurious correlations – superficial but statistically prominent associations between features (e.g., surnames) and attributes (e.g., nationality) present in the training data. We demonstrate that these spurious correlations induce hallucinations that are confidently generated, immune to model scaling, evade current detection methods, and persist even after refusal fine-tuning. Through systematically controlled synthetic experiments and empirical evaluations on state-of-the-art open-source and proprietary LLMs (including GPT-5), we show that existing hallucination detection methods, such as confidence-based filtering and inner-state probing, fundamentally fail in the presence of spurious correlations. Our theoretical analysis further elucidates why these statistical biases intrinsically undermine confidence-based detection techniques. Our findings thus emphasize the urgent need for new approaches explicitly designed to address hallucinations caused by spurious correlations.

[99] FinRpt: Dataset, Evaluation System and LLM-based Multi-agent Framework for Equity Research Report Generation

Song Jin, Shuqi Li, Shukun Zhang, Rui Yan

Main category: cs.CL

TL;DR: This paper introduces the first Equity Research Report (ERR) Generation task, creates the FinRpt benchmark with automated dataset construction and 11 evaluation metrics, and proposes the FinRpt-Gen multi-agent framework that achieves strong performance.

Details

Motivation: While LLMs have shown success in financial tasks, fully automating Equity Research Report generation remains unexplored territory, with challenges in data scarcity and lack of evaluation metrics.

Method: Proposed a Dataset Construction Pipeline integrating 7 financial data types to automatically generate high-quality ERR datasets, and developed FinRpt-Gen - a multi-agent framework trained using Supervised Fine-Tuning and Reinforcement Learning.

Result: Experimental results demonstrate the high quality of the FinRpt benchmark data, effectiveness of the 11 evaluation metrics, and strong performance of the FinRpt-Gen framework in ERR generation.

Conclusion: The FinRpt benchmark and FinRpt-Gen framework show significant potential to drive innovation in automated Equity Research Report generation, with all code and datasets made publicly available.

Abstract: While LLMs have shown great success in financial tasks like stock prediction and question answering, their application in fully automating Equity Research Report generation remains uncharted territory. In this paper, we formulate the Equity Research Report (ERR) Generation task for the first time. To address the data scarcity and the evaluation metrics absence, we present an open-source evaluation benchmark for ERR generation - FinRpt. We frame a Dataset Construction Pipeline that integrates 7 financial data types and produces a high-quality ERR dataset automatically, which could be used for model training and evaluation. We also introduce a comprehensive evaluation system including 11 metrics to assess the generated ERRs. Moreover, we propose a multi-agent framework specifically tailored to address this task, named FinRpt-Gen, and train several LLM-based agents on the proposed datasets using Supervised Fine-Tuning and Reinforcement Learning. Experimental results indicate the data quality and metrics effectiveness of the benchmark FinRpt and the strong performance of FinRpt-Gen, showcasing their potential to drive innovation in the ERR generation field. All code and datasets are publicly available.

[100] Selecting Auxiliary Data via Neural Tangent Kernels for Low-Resource Domains

Pingjie Wang, Hongcheng Liu, Yusheng Liao, Ziqing Fan, Yaxin Du, Shuo Tang, Yanfeng Wang, Yu Wang

Main category: cs.CL

TL;DR: NTK-Selector is a framework that uses neural tangent kernels to select valuable general-domain auxiliary data for enhancing LLM performance in low-resource domains, achieving significant improvements over domain-only fine-tuning.

Details

Motivation: LLMs struggle in low-resource domains due to data scarcity and overfitting risks, while abundant general-domain data could serve as auxiliary supervision if properly selected.

Method: Proposed NTK-Selector framework that addresses NTK application challenges in LLMs through empirical demonstration of NTK-like behavior during LoRA fine-tuning and Jacobian-free approximation to reduce computational costs.

Result: Across four domains (medical, financial, legal, psychological), NTK-Selector with 9,000 auxiliary samples achieved gains of +8.7 and +5.1 points for Llama3-8B and Qwen3-8B respectively, representing 10.9x and 5.7x improvements over domain-only fine-tuning.

Conclusion: NTK-Selector effectively selects valuable auxiliary data to significantly enhance domain-specific LLM performance in low-resource settings, overcoming traditional method limitations.

Abstract: Large language models (LLMs) have achieved remarkable success across widespread tasks, yet their application in low-resource domains remains a significant challenge due to data scarcity and the high risk of overfitting. While in-domain data is limited, there exist vast amounts of similar general-domain data, and our initial findings reveal that they could potentially serve as auxiliary supervision for domain enhancement. This observation leads us to our central research question: \textbf{\textit{how to effectively select the most valuable auxiliary data to maximize domain-specific performance}}, particularly when traditional methods are inapplicable due to a lack of large in-domain data pools or validation sets. To address this, we propose \textbf{NTK-Selector}, a principled and efficient framework for selecting general-domain auxiliary data to enhance domain-specific performance via neural tangent kernels (NTK). Our method tackles two challenges of directly applying NTK to LLMs, theoretical assumptions and prohibitive computational cost, by empirically demonstrating a stable NTK-like behavior in LLMs during LoRA fine-tuning and proposing a Jacobian-free approximation method. Extensive experiments across four low-resource domains (medical, financial, legal, and psychological) demonstrate that NTK-Selector consistently improves downstream performance. Specifically, fine-tuning on 1,000 in-domain samples alone only yielded +0.8 points for Llama3-8B-Instruct and +0.9 points for Qwen3-8B. In contrast, enriching with 9,000 auxiliary samples selected by NTK-Selector led to substantial \textbf{gains of +8.7 and +5.1 points}, which corresponds to a \textbf{10.9x and 5.7x improvement} over the domain-only setting.

[101] Retriv at BLP-2025 Task 2: Test-Driven Feedback-Guided Framework for Bangla-to-Python Code Generation

K M Nafi Asib, Sourav Saha, Mohammed Moshiul Hoque

Main category: cs.CL

TL;DR: A test-driven iterative refinement method using fine-tuned Qwen2.5-14B achieved 2nd place in Bangla code generation shared task with 0.934 Pass@1 score.

Details

Motivation: Address the underrepresentation of low-resource languages like Bangla in code generation due to limited datasets and evaluation benchmarks.

Method: Combines instruction prompting with test-driven, feedback-guided iterative refinement using fine-tuned Qwen2.5-14B model, with three evaluation passes using test feedback.

Result: Achieved 2nd place in BLP Workshop shared task with Pass@1 score of 0.934.

Conclusion: Highlights challenges in Bangla instruction understanding and Python code generation, emphasizing need for targeted methods in low-resource languages.

Abstract: Large Language Models (LLMs) have advanced the automated generation of code from natural language prompts. However, low-resource languages (LRLs) like Bangla remain underrepresented due to the limited availability of instruction-to-code datasets and evaluation benchmarks. To address this, the BLP Workshop at IJCNLP-AACL 2025 introduced a shared task on “Code Generation in Bangla”. In this work, we propose a method that combines instruction prompting with a test-driven, feedback-guided iterative refinement process using a fine-tuned Qwen2.5-14B model. The model generates code from Bangla instructions, tests it against unit tests, and iteratively refines any failing outputs through three evaluation passes, using test feedback to guide each step. This approach helped our team “Retriv” to secure 2nd place in the shared task with a Pass@1 score of 0.934. The analysis highlights challenges in Bangla instruction understanding and Python code generation, emphasizing the need for targeted methods in LRLs. We made experimental scripts publicly available for the community.

[102] Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence

Sean McLeish, Ang Li, John Kirchenbauer, Dayal Singh Kalra, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Jonas Geiping, Tom Goldstein, Micah Goldblum

Main category: cs.CL

TL;DR: Converting pretrained non-recurrent language models to depth-recurrent models using curriculum training preserves performance while reducing computational costs, outperforming standard post-training on mathematical tasks.

Details

Motivation: To leverage the benefits of depth-recurrent language models that decouple train-time compute from test-time compute, and to efficiently convert existing pretrained models rather than training from scratch.

Method: Using a curriculum of recurrences to gradually increase the effective depth of pretrained non-recurrent models during training, converting them to depth-recurrent architectures.

Result: The converted recurrent models achieve better performance at the same compute budget compared to simply post-training the original non-recurrent models, particularly on mathematical tasks.

Conclusion: Curriculum-based conversion of pretrained models to depth-recurrent architectures is an effective strategy for computational efficiency while maintaining or improving performance.

Abstract: Recent advances in depth-recurrent language models show that recurrence can decouple train-time compute and parameter count from test-time compute. In this work, we study how to convert existing pretrained non-recurrent language models into depth-recurrent models. We find that using a curriculum of recurrences to increase the effective depth of the model over the course of training preserves performance while reducing total computational cost. In our experiments, on mathematics, we observe that converting pretrained models to recurrent ones results in better performance at a given compute budget than simply post-training the original non-recurrent language model.

[103] Surgical Agent Orchestration Platform for Voice-directed Patient Data Interaction

Hyeryun Park, Byung Mo Gu, Jun Hee Lee, Byeong Hyeon Choi, Sekeun Kim, Hyun Koo Kim, Kyungsang Kim

Main category: cs.CL

TL;DR: Voice-directed Surgical Agent Orchestrator Platform (SAOP) using LLM-based agents to enable hands-free multimodal data access during da Vinci robotic surgery.

Details

Motivation: Surgeons' hands and eyes are fully engaged during da Vinci robotic surgery, making it difficult to access patient data without interruption.

Method: Hierarchical multi-agent framework with orchestration agent and three task-specific LLM-driven agents that autonomously plan, refine, validate, and reason to map voice commands into specific tasks like retrieving clinical information or manipulating CT scans.

Result: SAOP achieves high accuracy and success rates across 240 voice commands, with LLM-based agents improving robustness against speech recognition errors and handling diverse/ambiguous free-form commands.

Conclusion: The platform demonstrates strong potential to support minimally invasive da Vinci robotic surgery by enabling hands-free multimodal data access.

Abstract: In da Vinci robotic surgery, surgeons’ hands and eyes are fully engaged in the procedure, making it difficult to access and manipulate multimodal patient data without interruption. We propose a voice-directed Surgical Agent Orchestrator Platform (SAOP) built on a hierarchical multi-agent framework, consisting of an orchestration agent and three task-specific agents driven by Large Language Models (LLMs). These LLM-based agents autonomously plan, refine, validate, and reason to map voice commands into specific tasks such as retrieving clinical information, manipulating CT scans, or navigating 3D anatomical models on the surgical video. We also introduce a Multi-level Orchestration Evaluation Metric (MOEM) to comprehensively assess the performance and robustness from command-level and category-level perspectives. The SAOP achieves high accuracy and success rates across 240 voice commands, while LLM-based agents improve robustness against speech recognition errors and diverse or ambiguous free-form commands, demonstrating strong potential to support minimally invasive da Vinci robotic surgery.

[104] ConvFill: Model Collaboration for Responsive Conversational Voice Agents

Vidya Srinivas, Zachary Englhardt, Maximus Powers, Shwetak Patel, Vikram Iyer

Main category: cs.CL

TL;DR: Proposes conversational infill using ConvFill model to combine fast on-device responses with backend model knowledge, achieving 36-42% accuracy improvements over standalone small models while maintaining sub-200ms latency.

Details

Motivation: Address the latency vs. capability trade-off in conversational voice agents: cloud models have high latency but deep reasoning, while on-device models are fast but lack sophistication.

Method: Introduce conversational infill task where lightweight on-device model generates dialogue while incorporating streaming knowledge from backend model. Train ConvFill (360M parameter model) on synthetic multi-domain conversations.

Result: ConvFill achieves 36-42% accuracy improvements over standalone small models of same size while consistently maintaining sub-200ms response latencies across multiple backend models.

Conclusion: Conversational infill enables building on-device conversational agents that are both immediately responsive and knowledgeable by decoupling response latency from model capability.

Abstract: Deploying conversational voice agents with large language models faces a critical challenge: cloud-based foundation models provide deep reasoning and domain knowledge but introduce latency that disrupts natural conversation, while on-device models respond immediately but lack sophistication. We propose conversational infill, a task where a lightweight on-device model generates contextually appropriate dialogue while seamlessly incorporating streaming knowledge from a powerful backend model. This approach decouples response latency from model capability, enabling systems that feel responsive while accessing the full power of large-scale models. We present ConvFill, a 360M parameter model trained on synthetic multi-domain conversations. Evaluation across multiple backend models shows that conversational infill can be successfully learned, with ConvFill achieving accuracy improvements of 36-42% over standalone small models of the same size while consistently retaining sub-200ms response latencies. Our results demonstrate the promise of this approach for building on-device conversational agents that are both immediately responsive and knowledgeable.

[105] SPOT: An Annotated French Corpus and Benchmark for Detecting Critical Interventions in Online Conversations

Manon Berriche, Célia Nouri, Chloé Clavel, Jean-Philippe Cointet

Main category: cs.CL

TL;DR: SPOT introduces the first annotated corpus for detecting stopping points in online discussions - subtle interventions that pause or redirect conversations, operationalized as a binary classification task with 43,305 French Facebook comments.

Details

Motivation: To translate the sociological concept of stopping points into a reproducible NLP task, addressing gaps in existing frameworks like counterspeech that overlook subtle interventions like irony, doubt, or fragmentary arguments.

Method: Created annotated corpus with reliable guidelines, benchmarked fine-tuned CamemBERT encoder models and instruction-tuned LLMs with various prompting strategies, incorporating contextual metadata (article, post, parent comment, page/group, source).

Result: Fine-tuned encoders outperformed prompted LLMs by >10 percentage points in F1 score. Contextual metadata improved encoder F1 scores from 0.75 to 0.78.

Conclusion: Supervised learning is crucial for emerging non-English social media tasks. The released dataset, guidelines, and code promote transparency and reproducible research in detecting subtle conversational interventions.

Abstract: We introduce SPOT (Stopping Points in Online Threads), the first annotated corpus translating the sociological concept of stopping point into a reproducible NLP task. Stopping points are ordinary critical interventions that pause or redirect online discussions through a range of forms (irony, subtle doubt or fragmentary arguments) that frameworks like counterspeech or social correction often overlook. We operationalize this concept as a binary classification task and provide reliable annotation guidelines. The corpus contains 43,305 manually annotated French Facebook comments linked to URLs flagged as false information by social media users, enriched with contextual metadata (article, post, parent comment, page or group, and source). We benchmark fine-tuned encoder models (CamemBERT) and instruction-tuned LLMs under various prompting strategies. Results show that fine-tuned encoders outperform prompted LLMs in F1 score by more than 10 percentage points, confirming the importance of supervised learning for emerging non-English social media tasks. Incorporating contextual metadata further improves encoder models F1 scores from 0.75 to 0.78. We release the anonymized dataset, along with the annotation guidelines and code in our code repository, to foster transparency and reproducible research.

[106] DiLA: Enhancing LLM Tool Learning with Differential Logic Layer

Yu Zhang, Hui-Ling Zhen, Zehua Pei, Yingzhao Lian, Lihao Yin, Mingxuan Yuan, Bei Yu

Main category: cs.CL

TL;DR: DiLA integrates logical constraints into neural networks via a differential logic layer, enabling LLMs to solve complex constraint satisfaction problems like SAT and Graph Coloring by transforming language to logic and refining solutions.

Details

Motivation: LLMs struggle with logical reasoning and planning, especially for classical constraint satisfaction problems with intricate expressions and exponential search spaces that challenge off-the-shelf solvers.

Method: Proposes DiLA with a differential logic layer that integrates logical constraints into network forward/backward passes. LLM transforms language to logic constraints and finds initial solutions, while the logic layer iteratively refines them.

Result: DiLA consistently outperforms existing prompt-based and solver-aided approaches on two classic reasoning problems, enhancing LLMs’ logical reasoning ability while guaranteeing solution efficiency and correctness.

Conclusion: The differential logic layer effectively bridges LLMs and logical reasoning, providing a viable alternative for LLM tool learning that improves performance on constraint satisfaction problems encoded by Boolean variables.

Abstract: Considering the challenges faced by large language models (LLMs) in logical reasoning and planning, prior efforts have sought to augment LLMs with access to external solvers. While progress has been made on simple reasoning problems, solving classical constraint satisfaction problems, such as the Boolean Satisfiability Problem (SAT) and Graph Coloring Problem (GCP), remains difficult for off-the-shelf solvers due to their intricate expressions and exponential search spaces. In this paper, we propose a novel differential logic layer-aided language modeling (DiLA) approach, where logical constraints are integrated into the forward and backward passes of a network layer, to provide another option for LLM tool learning. In DiLA, LLM aims to transform the language description to logic constraints and identify initial solutions of the highest quality, while the differential logic layer focuses on iteratively refining the LLM-prompted solution. Leveraging the logic layer as a bridge, DiLA enhances the logical reasoning ability of LLMs on a range of reasoning problems encoded by Boolean variables, guaranteeing the efficiency and correctness of the solution process. We evaluate the performance of DiLA on two classic reasoning problems and empirically demonstrate its consistent outperformance against existing prompt-based and solver-aided approaches.

[107] Likelihood-based Mitigation of Evaluation Bias in Large Language Models

Masanari Oi, Masahiro Kaneko, Ryuto Koike, Mengsay Loem, Naoaki Okazaki

Main category: cs.CL

TL;DR: LLM-based evaluators exhibit likelihood bias where they overrate sentences with higher likelihoods. The paper proposes using highly biased instances as few-shot examples to mitigate this bias, improving evaluation performance.

Details

Motivation: LLMs used as automated evaluators may have likelihood bias - overrating sentences with higher likelihoods and underrating those with lower likelihoods due to superficial differences in sentences.

Method: Proposed method uses highly biased instances as few-shot examples for in-context learning to mitigate likelihood bias in LLM-based evaluators.

Result: Experiments in data-to-text and grammatical error correction tasks show several LLMs display likelihood bias, and the proposed method successfully mitigates this bias while significantly improving evaluation performance (correlation with human scores).

Conclusion: The proposed method effectively mitigates likelihood bias in LLM-based evaluators and enhances their evaluation performance across different tasks.

Abstract: Large Language Models (LLMs) are widely used to evaluate natural language generation tasks as automated metrics. However, the likelihood, a measure of LLM’s plausibility for a sentence, can vary due to superficial differences in sentences, such as word order and sentence structure. It is therefore possible that there might be a likelihood bias if LLMs are used for evaluation: they might overrate sentences with higher likelihoods while underrating those with lower likelihoods. In this paper, we investigate the presence and impact of likelihood bias in LLM-based evaluators. We also propose a method to mitigate the likelihood bias. Our method utilizes highly biased instances as few-shot examples for in-context learning. Our experiments in evaluating the data-to-text and grammatical error correction tasks reveal that several LLMs we test display a likelihood bias. Furthermore, our proposed method successfully mitigates this bias, also improving evaluation performance (in terms of correlation of models with human scores) significantly.

[108] Quriosity: Analyzing Human Questioning Behavior and Causal Inquiry through Curiosity-Driven Queries

Roberto Ceraolo, Dmitrii Kharlapenko, Ahmad Khan, Amélie Reymond, Punya Syon Pandey, Rada Mihalcea, Bernhard Schölkopf, Mrinmaya Sachan, Zhijing Jin

Main category: cs.CL

TL;DR: Quriosity is a dataset of 13.5K naturally occurring curiosity-driven questions from search engines, human interactions, and LLM conversations, with focus on causal questions and their analysis.

Details

Motivation: To understand curiosity-driven human questions that are complex, open-ended, and reflect real-world needs, as LLMs shift from testing to practical use for unknown answers.

Method: Collected 13.5K questions from three sources: human-to-search-engine queries, human-to-human interactions, and human-to-LLM conversations. Developed iterative prompt improvement framework to identify causal queries.

Result: Found significant presence of causal questions (up to 42%) in the dataset. Analyzed their linguistic properties, cognitive complexity and source distribution across different domains and contexts.

Conclusion: The paper paves the way for future work on causal question identification and open-ended chatbot interactions, providing a comprehensive dataset for understanding human curiosity.

Abstract: Recent progress in Large Language Model (LLM) technology has changed our role in interacting with these models. Instead of primarily testing these models with questions we already know answers to, we are now using them for queries where the answers are unknown to us, driven by human curiosity. This shift highlights the growing need to understand curiosity-driven human questions - those that are more complex, open-ended, and reflective of real-world needs. To this end, we present Quriosity, a collection of 13.5K naturally occurring questions from three diverse sources: human-to-search-engine queries, human-to-human interactions, and human-to-LLM conversations. Our comprehensive collection enables a rich understanding of human curiosity across various domains and contexts. Our analysis reveals a significant presence of causal questions (up to 42%) in the dataset, for which we develop an iterative prompt improvement framework to identify all causal queries and examine their unique linguistic properties, cognitive complexity and source distribution. Our paper paves the way for future work on causal question identification and open-ended chatbot interactions. Our code and data are at https://github.com/roberto-ceraolo/quriosity.

[109] Retrieval-Augmented Feature Generation for Domain-Specific Classification

Xinhao Zhang, Jinghan Zhang, Fengran Mo, Dakshak Keerthi Chandra, Yu-Zhong Chen, Fei Xie, Kunpeng Liu

Main category: cs.CL

TL;DR: RAFG is a retrieval-augmented feature generation method that uses knowledge retrieval and LLMs to create interpretable features for domain classification tasks, improving performance across multiple domains.

Details

Motivation: Feature generation enhances learning with limited data, but creating interpretable features typically requires domain expertise. The paper aims to automate this process while maintaining interpretability.

Method: RAFG uses knowledge retrieval among existing features to identify associations, then employs LLMs for feature generation with reasoning to verify feature quality during the generation process.

Result: Experiments on medical, economic, and geographic datasets show RAFG produces high-quality, meaningful features and significantly improves classification performance compared to baseline methods.

Conclusion: RAFG successfully generates useful and explainable features for domain classification tasks without requiring extensive domain knowledge, demonstrating effectiveness across multiple domains.

Abstract: Feature generation can significantly enhance learning outcomes, particularly for tasks with limited data. An effective way to improve feature generation is to expand the current feature space using existing features and enriching the informational content. However, generating new, interpretable features usually requires domain-specific knowledge on top of the existing features. In this paper, we introduce a Retrieval-Augmented Feature Generation method, RAFG, to generate useful and explainable features specific to domain classification tasks. To increase the interpretability of the generated features, we conduct knowledge retrieval among the existing features in the domain to identify potential feature associations. These associations are expected to help generate useful features. Moreover, we develop a framework based on large language models (LLMs) for feature generation with reasoning to verify the quality of the features during their generation process. Experiments across several datasets in medical, economic, and geographic domains show that our RAFG method can produce high-quality, meaningful features and significantly improve classification performance compared with baseline methods.

[110] FedCoT: Federated Chain-of-Thought Distillation for Large Language Models

Tao Fan, Weijing Chen, Yan Kang, Guoqiang Ma, Hanlin Gu, Yuanfeng Song, Lixin Fan, Qiang Yang

Main category: cs.CL

TL;DR: FedCoT is a federated framework for Chain-of-Thought knowledge distillation from LLMs to SLMs while preserving data privacy through perturbed prompts and privacy protection strategies.

Details

Motivation: Address the challenges of deploying LLMs in resource-constrained environments and protecting user data privacy, while overcoming the performance limitations of SLMs.

Method: Uses federated learning with perturbed prompts and rationales generated through Chain-of-Thought approach, implementing two privacy protection strategies: Exponential Mechanism Strategy and Adaptive Exponential Mechanism Strategy.

Result: Empirical evaluation shows FedCoT effectively trains task-specific SLMs with enhanced performance while prioritizing data privacy protection across various text generation tasks.

Conclusion: FedCoT enables secure and efficient knowledge transfer from LLMs to SLMs in privacy-preserving settings, with code contributed to the FATE open-source project.

Abstract: Large Language Models (LLMs) have emerged as a transformative force in artificial intelligence, demonstrating exceptional proficiency across various tasks. However, their deployment in resource-constrained environments and concerns over user data privacy pose significant challenges. In contrast, Small Language Models (SLMs) offer computational efficiency but often lag in performance. To address these issues, we propose FedCoT, a federated framework designed for the Chain-of-Thought (CoT) distillation of knowledge from LLMs to SLMs, while ensuring the preservation of clients’ data privacy. FedCoT ensures secure and efficient knowledge transfer from an LLM on a high-powered server to an SLM on a resource-constrained client, while adhering to privacy requirements. Leveraging perturbed prompts and rationales generated through the CoT approach, the framework enhances the performance of the client’s SLM without compromising user data privacy within a multi-task learning framework. We propose two privacy protection strategies: the Exponential Mechanism Strategy and the Adaptive Exponential Mechanism Strategy, which balance user prompt privacy and the usability of rationales. Empirical evaluation on various text generation tasks demonstrates the effectiveness of FedCoT in training task-specific SLMs with enhanced performance while prioritizing data privacy protection. Our code has been contributed to the FATE open-source project and is now publicly accessible at \textit{https://github.com/FederatedAI/FATE-LLM/tree/main/python/fate_llm/algo/fedcot}

[111] JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models

Haibo Jin, Leyang Hu, Xinnuo Li, Peiyan Zhang, Chonghan Chen, Jun Zhuang, Haohan Wang

Main category: cs.CL

TL;DR: This survey comprehensively reviews jailbreaking techniques that bypass ethical boundaries in LLMs/VLMs and corresponding defense mechanisms, categorizing jailbreaks into 7 types and proposing future research directions for enhanced AI security.

Details

Motivation: The rapid advancement of LLMs and VLMs raises critical security and ethical concerns, particularly regarding deliberate circumvention of their operational boundaries through jailbreaking techniques.

Method: The study conducts an extensive review and categorization of jailbreaking methods into seven distinct types, while also examining corresponding defense strategies against these vulnerabilities.

Result: The survey identifies specific research gaps in AI security and provides a comprehensive framework for understanding both jailbreak techniques and defensive solutions for LLMs and VLMs.

Conclusion: A unified perspective integrating jailbreak strategies and defensive solutions is necessary to create robust, secure, and reliable environments for next-generation language models.

Abstract: The rapid evolution of artificial intelligence (AI) through developments in Large Language Models (LLMs) and Vision-Language Models (VLMs) has brought significant advancements across various technological domains. While these models enhance capabilities in natural language processing and visual interactive tasks, their growing adoption raises critical concerns regarding security and ethical alignment. This survey provides an extensive review of the emerging field of jailbreaking–deliberately circumventing the ethical and operational boundaries of LLMs and VLMs–and the consequent development of defense mechanisms. Our study categorizes jailbreaks into seven distinct types and elaborates on defense strategies that address these vulnerabilities. Through this comprehensive examination, we identify research gaps and propose directions for future studies to enhance the security frameworks of LLMs and VLMs. Our findings underscore the necessity for a unified perspective that integrates both jailbreak strategies and defensive solutions to foster a robust, secure, and reliable environment for the next generation of language models. More details can be found on our website: https://chonghan-chen.com/llm-jailbreak-zoo-survey/.

[112] Employing Sentence Space Embedding for Classification of Data Stream from Fake News Domain

Paweł Zyblewski, Jakub Klikowski, Weronika Borek-Marciniec, Paweł Ksieniewicz

Main category: cs.CL

TL;DR: First approach using sentence space method for natural language data stream classification, enabling text encoding as discrete digital signals for CNN-based fake news detection.

Details

Motivation: Deep learning methods are often excluded from data stream classification due to temporal constraints, but this exclusion seems premature given recent progress in deep learning development.

Method: Sentence space method to encode text into discrete digital signals, then using convolutional deep networks (originally for image classification) for fake news recognition on text data.

Result: Proposed approach was evaluated on real-life Fakeddit dataset and compared with state-of-the-art data stream classification algorithms based on generalization ability and time complexity.

Conclusion: Demonstrates that deep learning methods can be effectively applied to data stream classification tasks, specifically for fake news detection from text data streams.

Abstract: Tabular data is considered the last unconquered castle of deep learning, yet the task of data stream classification is stated to be an equally important and demanding research area. Due to the temporal constraints, it is assumed that deep learning methods are not the optimal solution for application in this field. However, excluding the entire – and prevalent – group of methods seems rather rash given the progress that has been made in recent years in its development. For this reason, the following paper is the first to present an approach to natural language data stream classification using the sentence space method, which allows for encoding text into the form of a discrete digital signal. This allows the use of convolutional deep networks dedicated to image classification to solve the task of recognizing fake news based on text data. Based on the real-life Fakeddit dataset, the proposed approach was compared with state-of-the-art algorithms for data stream classification based on generalization ability and time complexity.

[113] BLADE: Benchmarking Language Model Agents for Data-Driven Science

Ken Gu, Ruoxi Shang, Ruien Jiang, Keying Kuang, Richard-John Lin, Donghe Lyu, Yue Mao, Youran Pan, Teng Wu, Jiaqian Yu, Yikun Zhang, Tianmai M. Zhang, Lanyi Zhu, Mike A. Merrill, Jeffrey Heer, Tim Althoff

Main category: cs.CL

TL;DR: BLADE is a benchmark for evaluating AI agents’ ability to conduct data-driven scientific analyses, addressing challenges in assessing open-ended research tasks with multiple valid approaches.

Details

Motivation: Current LM-based agents struggle with complex scientific analysis tasks that require domain knowledge, statistical expertise, and nuanced decision-making about variables, transformations, and models.

Method: Created BLADE benchmark with 12 datasets from scientific literature, collected ground truth from expert analyses, and developed computational methods to match agent responses to expert ground truth.

Result: Language models show limited analytical capabilities, performing only basic analyses, while data-interacting agents demonstrate improved but still suboptimal diversity in analytical decision-making.

Conclusion: BLADE enables systematic evaluation of agents for data-driven science and provides insights into their analysis approaches, highlighting current limitations and areas for improvement.

Abstract: Data-driven scientific discovery requires the iterative integration of scientific domain knowledge, statistical expertise, and an understanding of data semantics to make nuanced analytical decisions, e.g., about which variables, transformations, and statistical models to consider. LM-based agents equipped with planning, memory, and code execution capabilities have the potential to support data-driven science. However, evaluating agents on such open-ended tasks is challenging due to multiple valid approaches, partially correct steps, and different ways to express the same decisions. To address these challenges, we present BLADE, a benchmark to automatically evaluate agents’ multifaceted approaches to open-ended research questions. BLADE consists of 12 datasets and research questions drawn from existing scientific literature, with ground truth collected from independent analyses by expert data scientists and researchers. To automatically evaluate agent responses, we developed corresponding computational methods to match different representations of analyses to this ground truth. Though language models possess considerable world knowledge, our evaluation shows that they are often limited to basic analyses. However, agents capable of interacting with the underlying data demonstrate improved, but still non-optimal, diversity in their analytical decision making. Our work enables the evaluation of agents for data-driven science and provides researchers deeper insights into agents’ analysis approaches.

[114] Mufu: Multilingual Fused Learning for Low-Resource Translation with LLM

Zheng Wei Lim, Nitish Gupta, Honglin Yu, Trevor Cohn

Main category: cs.CL

TL;DR: Mufu improves low-resource language translation by using multilingual candidates and correction instructions in prompts, turning translation into post-editing and leveraging LLMs’ reasoning capabilities.

Details

Motivation: Multilingual LLMs struggle with low-resource language translation due to limited data, requiring more efficient approaches to handle these challenging language pairs.

Method: Mufu introduces automatically generated multilingual candidates and correction instructions in prompts, transforming translation tasks into post-editing tasks where LLMs assess input quality, align semantics, copy relevant content, and override incorrect instances.

Result: Experiments on Flores-200 dataset show Mufu outperforms NLLB 1.3B distilled model in 64% of low-resource language pairs, and distilled models maintain 3.1 chrF improvement over baseline.

Conclusion: Mufu effectively addresses low-resource translation challenges by leveraging LLMs’ reasoning capabilities through post-editing prompts and auxiliary candidates, achieving significant improvements while maintaining efficiency through distillation.

Abstract: Multilingual large language models (LLMs) are great translators, but this is largely limited to high-resource languages. For many LLMs, translating in and out of low-resource languages remains a challenging task. To maximize data efficiency in this low-resource setting, we introduce Mufu, which includes a selection of automatically generated multilingual candidates and an instruction to correct inaccurate translations in the prompt. Mufu prompts turn a translation task into a postediting one, and seek to harness the LLM’s reasoning capability with auxiliary translation candidates, from which the model is required to assess the input quality, align the semantics cross-lingually, copy from relevant inputs and override instances that are incorrect. Our experiments on En-XX translations over the Flores-200 dataset show LLMs finetuned against Mufu-style prompts are robust to poor quality auxiliary translation candidates, achieving performance superior to NLLB 1.3B distilled model in 64% of low- and very-low-resource language pairs. We then distill these models to reduce inference cost, while maintaining on average 3.1 chrF improvement over finetune-only baseline in low-resource translations.

[115] Skill Path: Unveiling Language Skills from Circuit Graphs

Hang Chen, Jiaying Zhu, Xinyu Yang, Wenya Wang

Main category: cs.CL

TL;DR: The paper introduces skill paths as a refined alternative to circuit graphs for analyzing language model skills, addressing issues of atomic ablation and extraneous effects through a three-step framework of decomposition, pruning, and causal mediation.

Details

Motivation: Circuit graphs for language model skill analysis suffer from atomic ablation (loss of causal dependencies) and capture extraneous effects beyond isolated target skills, limiting their effectiveness in mechanistic understanding.

Method: A three-step framework: 1) complete linear decomposition of transformer models into disentangled computation graphs, 2) pruning, and 3) post-pruning causal mediation using counterfactuals and interventions to extract skill paths from circuit graphs.

Result: The framework successfully extracts skill paths for three generic language skills (Previous Token Skill, Induction Skill, and In-Context Learning Skill) and demonstrates two key properties: stratification and inclusiveness.

Conclusion: Skill paths provide a more refined and compact representation than circuit graphs for isolating individual skills in language models, enabling better mechanistic understanding through linear chains of components with preserved causal dependencies.

Abstract: Circuit graph discovery has emerged as a fundamental approach to elucidating the skill mechanistic of language models. Despite the output faithfulness of circuit graphs, they suffer from atomic ablation, which causes the loss of causal dependencies between connected components. In addition, their discovery process, designed to preserve output faithfulness, inadvertently captures extraneous effects other than an isolated target skill. To alleviate these challenges, we introduce skill paths, which offers a more refined and compact representation by isolating individual skills within a linear chain of components. To enable skill path extracting from circuit graphs, we propose a three-step framework, consisting of decomposition, pruning, and post-pruning causal mediation. In particular, we offer a complete linear decomposition of the transformer model which leads to a disentangled computation graph. After pruning, we further adopt causal analysis techniques, including counterfactuals and interventions, to extract the final skill paths from the circuit graph. To underscore the significance of skill paths, we investigate three generic language skills-Previous Token Skill, Induction Skill, and In-Context Learning Skill-using our framework. Experiments support two crucial properties of these skills, namely stratification and inclusiveness.

[116] Auto-PRE: An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation

Junjie Chen, Weihang Su, Zhumin Chu, Haitao Li, Yujia Zhou, Dingbo Yuan, Xudong Wang, Jun Zhou, Yiqun Liu, Min Zhang, Shaoping Ma, Qingyao Ai

Main category: cs.CL

TL;DR: Auto-PRE is an automatic LLM evaluation framework that selects evaluator models based on consistency, pertinence, and self-confidence traits, achieving state-of-the-art performance while reducing costs.

Details

Motivation: Traditional LLM evaluation methods face challenges like high costs, limited task formats, dependence on human references, and systematic biases.

Method: Proposes Auto-PRE framework that automatically selects evaluator LLMs based on three core traits: consistency (instruction stage), pertinence (content stage), and self-confidence (response stage).

Result: Experiments on summarization, non-factoid QA, and dialogue generation tasks demonstrate state-of-the-art performance with significantly reduced evaluation costs.

Conclusion: The structured and scalable design provides insights for automating LLM-as-judge evaluation, paving the way for more advanced LLM-based evaluation frameworks.

Abstract: The rapid development of large language models (LLMs) has highlighted the need for efficient and reliable methods to evaluate their performance. Traditional evaluation methods often face challenges like high costs, limited task formats, dependence on human references, and systematic biases. To address these limitations, we propose Auto-PRE, an automatic LLM evaluation framework inspired by the peer review process. Unlike previous approaches that rely on human annotations, Auto-PRE automatically selects evaluator LLMs based on three core traits: consistency, pertinence, and self-confidence, which correspond to the instruction, content, and response stages, respectively, and collectively cover the entire evaluation process. Experiments on three representative tasks, including summarization, non-factoid QA, and dialogue generation, demonstrate that Auto-PRE achieves state-of-the-art performance while significantly reducing evaluation costs. Furthermore, the structured and scalable design of our automatic qualification exam framework provides valuable insights into automating the evaluation of LLMs-as-judges, paving the way for more advanced LLM-based evaluation frameworks.

[117] All Entities are Not Created Equal: Examining the Long Tail for Ultra-Fine Entity Typing

Advait Deshmukh, Ashwin Umadi, Dananjay Srinivas, Maria Leonor Pacheco

Main category: cs.CL

TL;DR: PLMs struggle with ultra-fine entity typing for infrequent entities at the long tail of pre-training distribution, requiring knowledge-infused approaches beyond parametric knowledge alone.

Details

Motivation: To explore limitations of PLMs' parametric knowledge in ultra-fine entity typing tasks, particularly for entities at the long tail of pre-training distribution.

Method: Proposed a novel heuristic to approximate pre-training distribution of entities when pre-training data is unknown, and systematically analyzed entity-typing approaches relying solely on PLMs versus knowledge-infused approaches.

Result: Entity-typing approaches using only PLMs’ parametric knowledge perform poorly for long-tail entities, while knowledge-infused approaches can mitigate some of these shortcomings.

Conclusion: We need to go beyond PLMs and incorporate external knowledge to effectively handle infrequent entities in ultra-fine entity typing tasks.

Abstract: Due to their capacity to acquire world knowledge from large corpora, pre-trained language models (PLMs) are extensively used in ultra-fine entity typing tasks where the space of labels is extremely large. In this work, we explore the limitations of the knowledge acquired by PLMs by proposing a novel heuristic to approximate the pre-training distribution of entities when the pre-training data is unknown. Then, we systematically demonstrate that entity-typing approaches that rely solely on the parametric knowledge of PLMs struggle significantly with entities at the long tail of the pre-training distribution, and that knowledge-infused approaches can account for some of these shortcomings. Our findings suggest that we need to go beyond PLMs to produce solutions that perform well for infrequent entities.

[118] Shared Heritage, Distinct Writing: Rethinking Resource Selection for East Asian Historical Documents

Seyoung Song, Haneul Yoo, Jiho Jin, Kyunghyun Cho, Alice Oh

Main category: cs.CL

TL;DR: Questioning cross-lingual transfer from Classical Chinese to Hanja and Kanbun, experiments show minimal benefits for Korean historical documents, with performance differences within ±0.0068 F1-score and +0.84 BLEU score.

Details

Motivation: To challenge the assumption that Classical Chinese resources can effectively transfer to processing historical documents from Korea and Japan (Hanja and Kanbun), given their shared linguistic heritage but distinct characteristics.

Method: Conducted experiments across machine translation, named entity recognition, and punctuation restoration tasks using various model sizes, architectures, and domain-specific datasets to evaluate cross-lingual transfer effectiveness.

Result: Minimal impact of Classical Chinese datasets on Hanja performance, with benefits diminishing rapidly as local language data increases, and substantial improvements only in extremely low-resource scenarios for both Korean and Japanese documents.

Conclusion: Empirical validation is crucial rather than assuming benefits from indiscriminate cross-lingual transfer, as Classical Chinese resources provide limited help for processing Korean and Japanese historical documents.

Abstract: Historical documents in the Sinosphere are known to share common formats and practices, particularly in veritable records compiled by court historians. This shared linguistic heritage has led researchers to use Classical Chinese resources for cross-lingual transfer when processing historical documents from Korea and Japan, which remain relatively low-resource. In this paper, we question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun, the ancient written languages of Korea and Japan, respectively. Our experiments across machine translation, named entity recognition, and punctuation restoration tasks show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja, with performance differences within $\pm{}0.0068$ F1-score for sequence labeling tasks and up to $+0.84$ BLEU score for translation. These limitations persist consistently across various model sizes, architectures, and domain-specific datasets. Our analysis reveals that the benefits of Classical Chinese resources diminish rapidly as local language data increases for Hanja, while showing substantial improvements only in extremely low-resource scenarios for both Korean and Japanese historical documents. These findings emphasize the need for careful empirical validation rather than assuming benefits from indiscriminate cross-lingual transfer.

[119] Pralekha: Cross-Lingual Document Alignment for Indic Languages

Sanjay Suryanarayanan, Haiyue Song, Mohammed Safi Ur Rahman Khan, Anoop Kunchukuttan, Raj Dabre

Main category: cs.CL

TL;DR: Pralekha is a benchmark with 3M+ aligned document pairs across 11 Indic languages and English. The paper introduces Document Alignment Coefficient (DAC), a chunk-based alignment method that is 2-3x faster than pooling-based methods while maintaining competitive performance.

Details

Motivation: Existing Cross-Lingual Document Alignment (CLDA) techniques have limitations: reliance on scarce metadata, failure to capture fine-grained alignment cues with pooled representations, and computational inefficiency of sentence-based alignment due to large search spaces.

Method: Proposed Document Alignment Coefficient (DAC) aligns documents by matching smaller chunks and computes similarity as the ratio of aligned chunks to the average number of chunks in a pair, avoiding pooling-based representations.

Result: Intrinsic evaluation shows DAC is 2-3x faster while maintaining competitive performance. Extrinsic evaluation demonstrates that document-level MT models trained on DAC-aligned pairs consistently outperform baseline alignment methods.

Conclusion: DAC is an effective method for parallel document mining, providing substantial gains over existing approaches while being computationally efficient. The Pralekha dataset supports further research in this area.

Abstract: Mining parallel document pairs for document-level machine translation (MT) remains challenging due to the limitations of existing Cross-Lingual Document Alignment (CLDA) techniques. Existing methods often rely on metadata such as URLs, which are scarce, or on pooled document representations that fail to capture fine-grained alignment cues. Moreover, the limited context window of sentence embedding models hinders their ability to represent document-level context, while sentence-based alignment introduces a combinatorially large search space, leading to high computational cost. To address these challenges for Indic languages, we introduce Pralekha, a benchmark containing over 3 million aligned document pairs across 11 Indic languages and English, which includes 1.5 million English-Indic pairs. Furthermore, we propose Document Alignment Coefficient (DAC), a novel metric for fine-grained document alignment. Unlike pooling-based methods, DAC aligns documents by matching smaller chunks and computes similarity as the ratio of aligned chunks to the average number of chunks in a pair. Intrinsic evaluation shows that our chunk-based method is 2-3x faster while maintaining competitive performance, and that DAC achieves substantial gains over pooling-based baselines. Extrinsic evaluation further demonstrates that document-level MT models trained on DAC-aligned pairs consistently outperform those using baseline alignment methods. These results highlight DAC’s effectiveness for parallel document mining. The dataset and evaluation framework are publicly available to support further research.

[120] LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification

Taja Kuzman, Nikola Ljubešić

Main category: cs.CL

TL;DR: Proposes a teacher-student framework using LLMs for multilingual news topic classification without manual annotation, achieving high performance with smaller models.

Details

Motivation: Address the challenge of classifying increasing online news stories by topic across languages to improve content accessibility, while avoiding costly manual annotation.

Method: Uses GPT as teacher model to automatically annotate 20,000 articles in 4 languages into 17 IPTC categories, then fine-tunes smaller BERT-like student models on this dataset.

Result: Teacher model shows high zero-shot performance comparable to human annotators. Student models achieve similar performance with much smaller size, demonstrate strong zero-shot cross-lingual abilities, and work well with limited training data.

Conclusion: The framework successfully creates efficient multilingual news classifiers without manual annotation, with student models performing comparably to the teacher while being computationally lighter, enabling practical deployment.

Abstract: With the ever-increasing number of news stories available online, classifying them by topic, regardless of the language they are written in, has become crucial for enhancing readers’ access to relevant content. To address this challenge, we propose a teacher-student framework based on large language models (LLMs) for developing multilingual news topic classification models of reasonable size with no need for manual data annotation. The framework employs a Generative Pretrained Transformer (GPT) model as the teacher model to develop a news topic training dataset through automatic annotation of 20,000 news articles in Slovenian, Croatian, Greek, and Catalan. Articles are classified into 17 main categories from the Media Topic schema, developed by the International Press Telecommunications Council (IPTC). The teacher model exhibits high zero-shot performance in all four languages. Its agreement with human annotators is comparable to that between the human annotators themselves. To mitigate the computational limitations associated with the requirement of processing millions of texts daily, smaller BERT-like student models are fine-tuned on the GPT-annotated dataset. These student models achieve high performance comparable to the teacher model. Furthermore, we explore the impact of the training data size on the performance of the student models and investigate their monolingual, multilingual, and zero-shot cross-lingual capabilities. The findings indicate that student models can achieve high performance with a relatively small number of training instances, and demonstrate strong zero-shot cross-lingual abilities. Finally, we publish the best-performing news topic classifier, enabling multilingual classification with the top-level categories of the IPTC Media Topic schema.

[121] RareAgents: Autonomous Multi-disciplinary Team for Rare Disease Diagnosis and Treatment

Xuanzhong Chen, Ye Jin, Xiaohao Mao, Lun Wang, Shuyang Zhang, Ting Chen

Main category: cs.CL

TL;DR: RareAgents is an LLM-driven multi-disciplinary team decision-support tool designed for rare disease diagnosis and treatment, outperforming existing models and frameworks.

Details

Motivation: Rare diseases collectively affect 300 million people worldwide, but diagnosis and treatment are challenging due to multi-organ involvement and lack of specialized doctors. Current agent frameworks are not well-adapted to complex rare disease scenarios.

Method: Developed RareAgents with MDT coordination, memory mechanisms, and medical tools utilization using Llama-3.1-8B/70B as base model. Created MIMIC-IV-Ext-Rare dataset for rare disease research.

Result: RareAgents outperforms state-of-the-art domain-specific models, GPT-4o, and current agent frameworks in rare disease diagnosis and treatment tasks.

Conclusion: RareAgents effectively bridges the gap in rare disease clinical support and contributes a valuable dataset to advance research in this challenging medical domain.

Abstract: Rare diseases, despite their low individual incidence, collectively impact around 300 million people worldwide due to the vast number of diseases. The involvement of multiple organs and systems, and the shortage of specialized doctors with relevant experience, make diagnosing and treating rare diseases more challenging than common diseases. Recently, agents powered by large language models (LLMs) have demonstrated notable applications across various domains. In the medical field, some agent methods have outperformed direct prompts in question-answering tasks from medical examinations. However, current agent frameworks are not well-adapted to real-world clinical scenarios, especially those involving the complex demands of rare diseases. To bridge this gap, we introduce RareAgents, the first LLM-driven multi-disciplinary team decision-support tool designed specifically for the complex clinical context of rare diseases. RareAgents integrates advanced Multidisciplinary Team (MDT) coordination, memory mechanisms, and medical tools utilization, leveraging Llama-3.1-8B/70B as the base model. Experimental results show that RareAgents outperforms state-of-the-art domain-specific models, GPT-4o, and current agent frameworks in diagnosis and treatment for rare diseases. Furthermore, we contribute a novel rare disease dataset, MIMIC-IV-Ext-Rare, to facilitate further research in this field.

[122] Revealing emergent human-like conceptual representations from language prediction

Ningyu Xu, Qi Zhang, Chao Du, Qiang Luo, Xipeng Qiu, Xuanjing Huang, Menghan Zhang

Main category: cs.CL

TL;DR: LLMs develop human-like conceptual representations through language prediction alone, without real-world grounding, and these representations align with human behavioral judgments and neural activity patterns.

Details

Motivation: To investigate whether LLMs develop concepts similar to humans, how these concepts are represented and organized, and their relationship to behavior, despite being trained solely on text.

Method: Analyzing representations formed by LLMs during in-context concept inference tasks, examining how models derive concepts from linguistic descriptions in relation to contextual cues.

Result: LLMs can flexibly derive concepts, their representations converge to shared context-independent structures, and alignment with this structure predicts performance across understanding/reasoning tasks. The representations capture human behavioral judgments and align with human neural activity patterns.

Conclusion: Structured, human-like conceptual representations emerge purely from language prediction without real-world grounding, highlighting conceptual structure’s role in intelligent behavior and suggesting LLMs offer insights into human concepts.

Abstract: People acquire concepts through rich physical and social experiences and use them to understand and navigate the world. In contrast, large language models (LLMs), trained solely through next-token prediction on text, exhibit strikingly human-like behaviors. Are these models developing concepts akin to those of humans? If so, how are such concepts represented, organized, and related to behavior? Here, we address these questions by investigating the representations formed by LLMs during an in-context concept inference task. We found that LLMs can flexibly derive concepts from linguistic descriptions in relation to contextual cues about other concepts. The derived representations converge toward a shared, context-independent structure, and alignment with this structure reliably predicts model performance across various understanding and reasoning tasks. Moreover, the convergent representations effectively capture human behavioral judgments and closely align with neural activity patterns in the human brain, providing evidence for biological plausibility. Together, these findings establish that structured, human-like conceptual representations can emerge purely from language prediction without real-world grounding, highlighting the role of conceptual structure in understanding intelligent behavior. More broadly, our work suggests that LLMs offer a tangible window into the nature of human concepts and lays the groundwork for advancing alignment between artificial and human intelligence.

[123] Learning Task Representations from In-Context Learning

Baturay Saglam, Xinyang Hu, Zhuoran Yang, Dionysis Kalogerias, Amin Karbasi

Main category: cs.CL

TL;DR: The paper introduces an automated method to encode task information in in-context learning prompts using attention heads, creating task vectors that generalize across text and regression tasks.

Details

Motivation: To understand how tasks are internally encoded and generalized in LLMs during in-context learning, addressing gaps in current methods that fail beyond text modalities.

Method: Proposes computing task vectors as weighted sums of attention heads, with weights optimized via gradient descent, and introduces a benchmark for evaluating task fidelity in functional regression tasks.

Result: The method successfully extracts task-specific information from in-context demonstrations and performs well in both text and regression tasks, showing cross-modal generalizability.

Conclusion: The proposed approach effectively encodes task information in ICL and demonstrates superior generalization capabilities across different modalities compared to existing methods.

Abstract: Large language models (LLMs) have demonstrated remarkable proficiency in in-context learning (ICL), where models adapt to new tasks through example-based prompts without requiring parameter updates. However, understanding how tasks are internally encoded and generalized remains a challenge. To address some of the empirical and technical gaps in the literature, we introduce an automated formulation for encoding task information in ICL prompts as a function of attention heads within the transformer architecture. This approach computes a single task vector as a weighted sum of attention heads, with the weights optimized causally via gradient descent. Our findings show that existing methods fail to generalize effectively to modalities beyond text. In response, we also design a benchmark to evaluate whether a task vector can preserve task fidelity in functional regression tasks. The proposed method successfully extracts task-specific information from in-context demonstrations and excels in both text and regression tasks, demonstrating its generalizability across modalities.

[124] Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models

Lujain Ibrahim, Canfer Akbulut, Rasmi Elasmar, Charvi Rastogi, Minsuk Kahng, Meredith Ringel Morris, Kevin R. McKee, Verena Rieser, Murray Shanahan, Laura Weidinger

Main category: cs.CL

TL;DR: A novel multi-turn evaluation method for measuring anthropomorphic behaviors in LLMs, validated through large-scale human studies, showing that relationship-building behaviors emerge after multiple interactions.

Details

Motivation: To empirically evaluate anthropomorphic LLM behaviors in realistic settings, addressing the growing interest in how users anthropomorphize AI systems among developers, researchers, and policymakers.

Method: Developed three methodological advances: multi-turn evaluation of 14 anthropomorphic behaviors, scalable automated simulations of user interactions, and large-scale human subject study (N=1101) for validation.

Result: All state-of-the-art LLMs exhibited similar anthropomorphic behaviors characterized by relationship-building (empathy, validation) and first-person pronoun use, with most behaviors emerging only after multiple turns.

Conclusion: The work establishes an empirical foundation for studying how design choices influence anthropomorphic behaviors and advances ethical discussions about their desirability, demonstrating the necessity of multi-turn evaluations for complex social phenomena in human-AI interaction.

Abstract: The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static benchmarks, we contribute three methodological advances in state-of-the-art (SOTA) LLM evaluation. First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours. Second, we present a scalable, automated approach by employing simulations of user interactions. Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users’ anthropomorphic perceptions. We find that all SOTA LLMs evaluated exhibit similar behaviours, characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use, and that the majority of behaviours only first occur after multiple turns. Our work lays an empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours. It also showcases the necessity of multi-turn evaluations for complex social phenomena in human-AI interaction.

[125] PCS: Perceived Confidence Scoring of Black Box LLMs with Metamorphic Relations

Sina Salimian, Gias Uddin, Shaina Raza, Henry Leung

Main category: cs.CL

TL;DR: A method using Metamorphic Relations to evaluate LLM confidence in text classification by generating semantically equivalent variations and analyzing response consistency, improving zero-shot performance by 9.3%.

Details

Motivation: Zero-shot LLMs show suboptimal performance in text classification tasks like sentiment and bias detection, needing better confidence evaluation methods.

Method: Leverage Metamorphic Relations to create semantically equivalent but textually divergent versions of inputs, then compute Perceived Confidence Score (PCS) based on label prediction consistency across variations.

Result: PCS improves zero-shot LLM performance by 9.3% in text classification tasks, and provides 5.8% performance boost in majority-voting setups with multiple LLMs.

Conclusion: The PCS approach effectively enhances LLM performance in text classification by evaluating confidence through metamorphic testing principles.

Abstract: Zero-shot LLMs are now also used for textual classification tasks, e.g., sentiment and bias detection in a sentence or article. However, their performance can be suboptimal in such data annotation tasks. We introduce a novel technique that evaluates an LLM’s confidence for classifying a textual input by leveraging Metamorphic Relations (MRs). The MRs generate semantically equivalent yet textually divergent versions of the input. Following the principles of Metamorphic Testing (MT), the mutated versions are expected to have annotation labels similar to the input. By analyzing the consistency of an LLM’s responses across these variations, we compute a perceived confidence score (PCS) based on the frequency of the predicted labels. PCS can be used for both single and multiple LLM settings (e.g., when multiple LLMs are vetted in a majority-voting setup). Empirical evaluation shows that our PCS-based approach improves the performance of zero-shot LLMs by 9.3% in textual classification tasks. When multiple LLMs are used in a majority-voting setup, we obtain a performance boost of 5.8% with PCS.

[126] PPC-GPT: Federated Task-Specific Compression of Large Language Models via Pruning and Chain-of-Thought Distillation

Tao Fan, Guoqiang Ma, Yuanfeng Song, Lixin Fan, Qiang Yang

Main category: cs.CL

TL;DR: PPC-GPT is a federated framework that compresses LLMs into task-specific SLMs while preserving privacy through differential privacy and synthetic data generation.

Details

Motivation: To address challenges in LLM compression including domain knowledge privacy protection and resource limitations in federated settings.

Method: Uses server-client federated architecture where clients send DP-perturbed data to server LLM, which generates synthetic data with rationales for LLM pruning and retraining.

Result: Achieves competitive performance comparable to full-sized LLMs while ensuring robust privacy protection across diverse text generation tasks.

Conclusion: PPC-GPT successfully integrates privacy preservation and model compression in a unified framework, with code contributed to FATE open-source project.

Abstract: Compressing Large Language Models (LLMs) into task-specific Small Language Models (SLMs) encounters two significant challenges: safeguarding domain-specific knowledge privacy and managing limited resources. To tackle these challenges, we propose PPC-GPT, a novel unified framework that systematically addresses both privacy preservation and model compression in federated settings. PPC-GPT works on a server-client federated architecture, where the client sends differentially private (DP) perturbed task-specific data to the server’s LLM. The LLM then generates synthetic data along with their corresponding rationales. This synthetic data is subsequently used for both LLM pruning and retraining processes. Our framework’s key innovation lies in its holistic integration of privacy-preserving mechanisms, synthetic data generation, and task-specific compression techniques, creating unique benefits through component interaction. Our experiments across diverse text generation tasks demonstrate that PPC-GPT successfully achieves dual objectives: maintaining competitive performance comparable to full-sized LLMs while ensuring robust privacy protection through its federated architecture. Our code has been contributed to the FATE open-source project and is now publicly accessible at \textit{https://github.com/FederatedAI/FATE-LLM/tree/main/python/fate_llm/algo/ppc-gpt}

[127] KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse

Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, Shiyu Chang

Main category: cs.CL

TL;DR: KVLink enables efficient KV cache reuse in LLMs by precomputing document KV caches independently and concatenating them during inference, with techniques to maintain performance through positional embedding adjustment and special tokens.

Details

Motivation: Many LLM applications involve overlapping context across different inputs (e.g., same retrieved document in multiple queries), leading to redundant computation when encoding the entire context for each query.

Method: Precompute KV cache for each document independently, then concatenate during inference. Uses two key techniques: adjusting positional embeddings to match global positions after concatenation, and trainable special tokens to restore self-attention across independently encoded documents.

Result: Improves question answering accuracy by average 4% over SOTA methods across 7 datasets, reduces time-to-first-token by up to 96% compared to standard LLM inference, and can be combined with KV cache compression for further efficiency gains.

Conclusion: KVLink provides a scalable and efficient solution for context reuse in LLMs, significantly reducing computational redundancy while maintaining or improving performance.

Abstract: We describe KVLink, an approach for efficient key-value (KV) cache reuse in large language models (LLMs). In many LLM applications, different inputs can share overlapping context, such as the same retrieved document appearing in multiple queries. However, the LLMs still need to encode the entire context for each query, leading to redundant computation. In this paper, we investigate a new strategy to eliminate such inefficiency, where the KV cache of each document is precomputed independently. During inference, the KV caches of retrieved documents are concatenated, allowing the model to reuse cached representations instead of recomputing them. To mitigate the performance degradation when using KV caches computed independently for each document, KVLink introduces two key techniques: adjusting positional embeddings of the KV cache at inference to match the global position after concatenation, and using trainable special tokens to restore self-attention across independently encoded documents. Experiments across 7 datasets demonstrate that KVLink improves question answering accuracy by an average of 4% over state-of-the-art methods. Furthermore, by leveraging precomputed KV caches, our approach reduces time-to-first-token by up to 96% compared to standard LLM inference, making it a scalable and efficient solution for context reuse. Additionally, KVLink can be combined with KV cache compression to further save cache loading and storage overhead while outperforming the baselines.

[128] Order Doesn’t Matter, But Reasoning Does: Training LLMs with Order-Centric Augmentation

Qianxi He, Qianyu He, Jiaqing Liang, Yanghua Xiao, Weikang Zhou, Zeye Sun, Fei Yu

Main category: cs.CL

TL;DR: An order-centric data augmentation framework using commutativity in logical reasoning to help LLMs handle reasoning order variations and improve generalization across logically equivalent transformations.

Details

Motivation: LLMs struggle with reasoning order variations and fail to generalize across logically equivalent transformations, relying on fixed sequential patterns rather than true logical understanding.

Method: Randomly shuffle independent premises for condition order augmentation, and construct a DAG to model dependencies between reasoning steps to identify valid reorderings while preserving logical correctness.

Result: Extensive experiments across multiple logical reasoning benchmarks show the method significantly enhances LLMs’ reasoning performance and adaptability to diverse logical structures.

Conclusion: Order-centric data augmentation enables LLMs to develop more flexible and generalized reasoning processes, improving their logical reasoning capabilities.

Abstract: Logical reasoning is essential for large language models (LLMs) to ensure accurate and coherent inference. However, LLMs struggle with reasoning order variations and fail to generalize across logically equivalent transformations. LLMs often rely on fixed sequential patterns rather than true logical understanding. To address this issue, we introduce an order-centric data augmentation framework based on commutativity in logical reasoning. We first randomly shuffle independent premises to introduce condition order augmentation. For reasoning steps, we construct a directed acyclic graph (DAG) to model dependencies between steps, which allows us to identify valid reorderings of steps while preserving logical correctness. By leveraging order-centric augmentations, models can develop a more flexible and generalized reasoning process. Finally, we conduct extensive experiments across multiple logical reasoning benchmarks, demonstrating that our method significantly enhances LLMs’ reasoning performance and adaptability to diverse logical structures. We release our codes and augmented data in https://github.com/qianxiHe147/Order-Centric-Data-Augmentation.

[129] ECLeKTic: a Novel Challenge Set for Evaluation of Cross-Lingual Knowledge Transfer

Omer Goldman, Uri Shaham, Dan Malkin, Sivan Eiger, Avinatan Hassidim, Yossi Matias, Joshua Maynez, Adi Mayrav Gilady, Jason Riesa, Shruti Rijhwani, Laura Rimell, Idan Szpektor, Reut Tsarfaty, Matan Eyal

Main category: cs.CL

TL;DR: ECLeKTic is a multilingual QA dataset that evaluates cross-lingual knowledge transfer in LLMs by testing their ability to answer questions about information available in one language’s Wikipedia but not others.

Details

Motivation: Current literature lacks reliable ways to measure LLMs' capability for cross-lingual knowledge transfer, which is crucial for achieving equitable performance across languages.

Method: Created ECLeKTic dataset using Wikipedia article presence/absence in 12 languages to identify information available during pre-training in one language but not others, then curated fact-seeking questions about this information in all languages.

Result: Evaluation of 8 LLMs showed that current SOTA models struggle to effectively share knowledge across languages, even when they can answer questions in the language where the knowledge was originally acquired.

Conclusion: There is a significant gap in LLMs’ ability to transfer knowledge between languages, highlighting the need for improved cross-lingual knowledge sharing capabilities in multilingual models.

Abstract: To achieve equitable performance across languages, large language models (LLMs) must be able to abstract knowledge beyond the language in which it was learnt. However, the current literature lacks reliable ways to measure LLMs’ capability of such cross-lingual knowledge transfer. To that end, we present ECLeKTic, a multilingual closed-book QA dataset that Evaluates Cross-Lingual Knowledge Transfer in a simple, black-box manner. Concretely, we used the presence and absence of Wikipedia articles in 12 languages to detect pieces of information that were likely available during pre-training in one of the languages but not in the others. We curate ECLeKTic as a set of fact-seeking questions over this kind of information, in all the different languages. Therefore, in order to solve ECLeKTic the model is required to transfer knowledge between languages. We evaluated 8 LLMs and showed that current SOTA models struggle to effectively share knowledge across languages, even if they can predict the answer for questions in the language in which the knowledge was acquired.

[130] Evaluating Human-LLM Representation Alignment: A Case Study on Affective Sentence Generation for Augmentative and Alternative Communication

Shadab Choudhury, Asha Kumar, Lara J. Martin

Main category: cs.CL

TL;DR: LLMs generate better emotion-expressing sentences when conditioned on English words rather than VAD scales, especially numeric VAD, with representation alignment varying by emotion type.

Details

Motivation: To measure the gap between LLM-generated text and human expectations in AAC tools, particularly for emotion representation.

Method: Evaluate representation alignment via human judgment by expanding keywords and emotion representations (Words, VAD dimensions in Lexical/Numeric forms, Emojis) into full sentences.

Result: People agree more with LLM generation when conditioned on English words than VAD scales, especially numeric VAD. Emotion perception depends on both representation type and specific emotion.

Conclusion: Word-based representations outperform VAD scales for LLM emotion generation in AAC contexts, with representation effectiveness varying by emotion.

Abstract: Gaps arise between a language model’s use of concepts and people’s expectations. This gap is critical when LLMs generate text to help people communicate via Augmentative and Alternative Communication (AAC) tools. In this work, we introduce the evaluation task of Representation Alignment for measuring this gap via human judgment. In our study, we expand keywords and emotion representations into full sentences. We select four emotion representations: Words, Valence-Arousal-Dominance (VAD) dimensions expressed in both Lexical and Numeric forms, and Emojis. In addition to Representation Alignment, we also measure people’s judgments of the accuracy and realism of the generated sentences. While representations like VAD break emotions into easy-to-compute components, our findings show that people agree more with how LLMs generate when conditioned on English words (e.g., “angry”) rather than VAD scales. This difference is especially visible when comparing Numeric VAD to words. Furthermore, we found that the perception of how much a generated sentence conveys an emotion is dependent on both the representation type and which emotion it is.

[131] Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, and Beyond

Yinghao Hu, Yaoyao Yu, Leilei Gan, Bin Wei, Kun Kuang, Fei Wu

Main category: cs.CL

TL;DR: This paper presents the first systematic evaluation of test-time scaling for legal reasoning across 12 LLMs, develops Legal-R1 as an open-source legal reasoning model, and identifies key challenges like outdated knowledge and hallucinations.

Details

Motivation: To address the insufficient exploration of test-time scaling's impact on legal reasoning, particularly the gap in understanding how extended chain-of-thought inference affects legal domain performance.

Method: Systematic evaluation of 12 LLMs across 17 bilingual legal tasks, creation of a bilingual chain-of-thought dataset through distillation from DeepSeek-R1, and development of Legal-R1 as a specialized legal reasoning model.

Result: Legal-R1 achieves competitive performance across diverse tasks; DeepSeek-R1 excels in Chinese legal reasoning while OpenAI’s o1 performs comparably on English tasks; error analysis reveals issues with outdated knowledge, limited legal interpretation, and factual hallucinations.

Conclusion: The study identifies key obstacles for legal-domain LLMs and provides promising directions for future research in legal reasoning with extended chain-of-thought approaches.

Abstract: Recent advances in test-time scaling of large language models (LLMs), exemplified by DeepSeek-R1 and OpenAI’s o1, show that extending the chain of thought during inference can significantly improve general reasoning performance. However, the impact of this paradigm on legal reasoning remains insufficiently explored. To address this gap, we present the first systematic evaluation of 12 LLMs, including both reasoning-focused and general-purpose models, across 17 Chinese and English legal tasks spanning statutory and case-law traditions. In addition, we curate a bilingual chain-of-thought dataset for legal reasoning through distillation from DeepSeek-R1 and develop Legal-R1, an open-source model specialized for the legal domain. Experimental results show that Legal-R1 delivers competitive performance across diverse tasks. DeepSeek-R1 exhibits clear advantages in Chinese legal reasoning, while OpenAI’s o1 achieves comparable results on English tasks. We further conduct a detailed error analysis, which reveals recurring issues such as outdated legal knowledge, limited capacity for legal interpretation, and susceptibility to factual hallucinations. These findings delineate the main obstacles confronting legal-domain LLMs and suggest promising directions for future research.

[132] On the Consistency of Multilingual Context Utilization in Retrieval-Augmented Generation

Jirui Qi, Raquel Fernández, Arianna Bisazza

Main category: cs.CL

TL;DR: LLMs can extract relevant information from multilingual passages but struggle to formulate answers in the correct language, with distracting passages negatively impacting performance.

Details

Motivation: To understand how LLMs utilize multilingual contexts in retrieval-augmented generation independently from retrieval quality, particularly their ability to handle passages in different languages and respond correctly.

Method: Extensive assessment of four LLMs across three QA datasets covering 48 languages, evaluating their ability to use relevant passages regardless of language, respond in expected language, and focus on relevant information amidst distracting passages using accuracy and feature attribution techniques.

Result: LLMs show surprising ability to extract relevant information from passages in different languages than the query, but weak ability to formulate full answers in the correct language. Distracting passages negatively impact answer quality regardless of language, with distractors in query language having slightly stronger influence.

Conclusion: Findings deepen understanding of LLM context utilization in multilingual RAG systems and provide directions for future improvements in handling multilingual contexts and language consistency.

Abstract: Retrieval-augmented generation (RAG) with large language models (LLMs) has demonstrated strong performance in multilingual question-answering (QA) tasks by leveraging relevant passages retrieved from corpora. In multilingual RAG (mRAG), the retrieved passages can be written in languages other than that of the query entered by the user, making it challenging for LLMs to effectively utilize the provided information. Recent research suggests that retrieving passages from multilingual corpora can improve RAG performance, particularly for low-resource languages. However, the extent to which LLMs can leverage different kinds of multilingual contexts to generate accurate answers, independently from retrieval quality, remains understudied. In this paper, we conduct an extensive assessment of LLMs’ ability to (i) make consistent use of a relevant passage regardless of its language, (ii) respond in the expected language, and (iii) focus on the relevant passage even when multiple `distracting’ passages in different languages are provided in the context. Our experiments with four LLMs across three QA datasets covering a total of 48 languages reveal a surprising ability of LLMs to extract the relevant information from passages in a different language than the query, but a much weaker ability to formulate a full answer in the correct language. Our analysis, based on both accuracy and feature attribution techniques, further shows that distracting passages negatively impact answer quality regardless of their language. However, distractors in the query language exert a slightly stronger influence. Taken together, our findings deepen the understanding of how LLMs utilize context in mRAG systems, providing directions for future improvements.

[133] How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Himabindu Lakkaraju, Yizhou Sun, Shichang Zhang

Main category: cs.CL

TL;DR: Post-training reshapes LLMs internally by adapting knowledge representations while preserving factual knowledge locations, creates truthfulness and refusal vectors with different transferability properties, and doesn’t use entropy neurons for confidence differences.

Details

Motivation: To understand how post-training internally transforms pre-trained base LLMs into more useful models, as most research focuses on algorithms and outputs rather than internal mechanisms.

Method: Mechanistic comparison of base and post-trained LLMs across four perspectives: knowledge storage locations, truthfulness/refusal vector representations, transferability of directions, and confidence attribution.

Result: Post-training adapts knowledge representations while keeping factual knowledge locations unchanged; truthfulness direction is similar and transferable between models; refusal direction differs with limited transferability; confidence differences not due to entropy neurons.

Conclusion: Post-training preserves fundamental mechanisms while altering others, providing insights for model steering and benefiting future interpretability and post-training research.

Abstract: Post-training is essential for the success of large language models (LLMs), transforming pre-trained base models into more useful and aligned post-trained models. While plenty of works have studied post-training algorithms and evaluated post-training models by their outputs, it remains understudied how post-training reshapes LLMs internally. In this paper, we compare base and post-trained LLMs mechanistically from four perspectives to better understand post-training effects. Our findings across model families and datasets reveal that: (1) Post-training does not change the factual knowledge storage locations, and it adapts knowledge representations from the base model while developing new knowledge representations; (2) Both truthfulness and refusal can be represented by vectors in the hidden representation space. The truthfulness direction is highly similar between the base and post-trained model, and it is effectively transferable for interventions; (3) The refusal direction is different between the base and post-trained models, and it shows limited forward transferability; (4) Differences in confidence between the base and post-trained models cannot be attributed to entropy neurons. Our study provides insights into the fundamental mechanisms preserved and altered during post-training, facilitates downstream tasks like model steering, and could potentially benefit future research in interpretability and LLM post-training. Our code is publicly available at https://github.com/HZD01/post-training-mechanistic-analysis.

[134] Invoke Interfaces Only When Needed: Adaptive Invocation for Large Language Models in Question Answering

Jihao Zhao, Chunlai Zhou, Daixuan Li, Shuaishuai Zu, Biao Qin

Main category: cs.CL

TL;DR: Proposes AttenHScore, a real-time hallucination detection metric for small LMs that tracks error accumulation during generation, enabling dynamic invocation of large LMs without additional training.

Details

Motivation: Current collaborative LM approaches struggle with precisely timing large LM invocation when small LMs produce hallucinations, with existing methods being computationally expensive and separate from the reasoning process.

Method: Uses AttenHScore to calculate hallucination accumulation and propagation during small LM generation, with dynamic threshold adjustment and uncertainty-aware knowledge reorganization to help small LMs capture critical information.

Result: Outperforms most baselines in real-time hallucination detection across multiple QA datasets, especially for complex queries, without requiring additional model training and adaptable to various transformer-based LMs.

Conclusion: AttenHScore provides an effective and flexible solution for real-time hallucination detection in collaborative LM systems, improving invocation timing while maintaining computational efficiency.

Abstract: The collaborative paradigm of large and small language models (LMs) effectively balances performance and cost, yet its pivotal challenge lies in precisely pinpointing the moment of invocation when hallucinations arise in small LMs. Previous optimization efforts primarily focused on post-processing techniques, which were separate from the reasoning process of LMs, resulting in high computational costs and limited effectiveness. In this paper, we propose a practical invocation evaluation metric called AttenHScore, which calculates the accumulation and propagation of hallucinations during the generation process of small LMs, continuously amplifying potential reasoning errors. By dynamically adjusting the detection threshold, we achieve more accurate real-time invocation of large LMs. Additionally, considering the limited reasoning capacity of small LMs, we leverage uncertainty-aware knowledge reorganization to assist them better capture critical information from different text chunks. Extensive experiments reveal that our AttenHScore outperforms most baselines in enhancing real-time hallucination detection capabilities across multiple QA datasets, especially when addressing complex queries. Moreover, our strategies eliminate the need for additional model training and display flexibility in adapting to various transformer-based LMs.

[135] Atomic Consistency Preference Optimization for Long-Form Question Answering

Jingfeng Chen, Raghuveer Thirukovalluru, Junlin Wang, Kaiwei Luo, Bhuwan Dhingra

Main category: cs.CL

TL;DR: ACPO is a self-supervised method that improves LLM factual accuracy by using atomic consistency across multiple stochastic responses to identify quality data pairs for alignment, without needing external supervision.

Details

Motivation: Current model alignment methods for reducing factoid hallucinations often require stronger models or external knowledge bases, which may not always be accessible.

Method: Atomic Consistency Preference Optimization (ACPO) leverages atomic consistency signals - agreement of individual facts across multiple stochastic responses - to identify high- and low-quality data pairs for self-supervised model alignment.

Result: ACPO outperforms supervised alignment baseline by 1.95 points averaged across Phi-3 and Llama3 on LongFact and BioGen datasets, improving factual reliability without external models or knowledge bases.

Conclusion: ACPO provides an effective self-supervised approach for enhancing LLM factual accuracy that doesn’t rely on external supervision, demonstrating strong performance compared to supervised methods.

Abstract: Large Language Models (LLMs) often produce factoid hallucinations - plausible yet incorrect answers. A common mitigation strategy is model alignment, which improves factual accuracy by training on curated (factual, non-factual) pairs. However, this approach often relies on a stronger model (e.g., GPT-4) or an external knowledge base to assess factual correctness that may not always be accessible. Addressing this, we propose Atomic Consistency Preference Optimization (ACPO), a self-supervised preference-tuning method that enhances factual accuracy without external supervision. ACPO leverages atomic consistency signals (i.e., the agreement of individual facts across multiple stochastic responses) to identify high- and low-quality data pairs for model alignment. Despite being fully self-supervised, ACPO outperforms the strong supervised alignment baseline by 1.95 points averaged across Phi-3 and Llama3 on the LongFact and BioGen datasets, demonstrating its effectiveness in improving factual reliability without relying on external models or knowledge bases.

[136] Mixed Signals: Understanding Model Disagreement in Multimodal Empathy Detection

Maya Srikanth, Run Chen, Julia Hirschberg

Main category: cs.CL

TL;DR: Analysis of multimodal empathy detection failures when modalities provide conflicting cues, showing disagreement between unimodal and multimodal predictions often reflects ambiguity and can serve as diagnostic signal.

Details

Motivation: To understand why multimodal models fail in empathy detection when modalities provide conflicting cues, and examine cases where unimodal and multimodal predictions diverge.

Method: Used fine-tuned models for text, audio, and video modalities along with a gated fusion model, analyzed cases of disagreement between unimodal and multimodal predictions, and examined annotator uncertainty.

Result: Found that disagreements often reflect underlying ambiguity, dominant signals in one modality can mislead fusion when unsupported by others, and humans don’t consistently benefit from multimodal input either.

Conclusion: Disagreement between unimodal and multimodal predictions serves as a useful diagnostic signal for identifying challenging examples and improving empathy system robustness.

Abstract: Multimodal models play a key role in empathy detection, but their performance can suffer when modalities provide conflicting cues. To understand these failures, we examine cases where unimodal and multimodal predictions diverge. Using fine-tuned models for text, audio, and video, along with a gated fusion model, we find that such disagreements often reflect underlying ambiguity, as evidenced by annotator uncertainty. Our analysis shows that dominant signals in one modality can mislead fusion when unsupported by others. We also observe that humans, like models, do not consistently benefit from multimodal input. These insights position disagreement as a useful diagnostic signal for identifying challenging examples and improving empathy system robustness.

[137] Enhancing Large Language Models for Detecting Mental Manipulation via Annotation-Free Data Augmentation and Anti-Curriculum Distillation

Yuansheng Gao, Han Bao, Tong Zhang, Bin Li, Jixiang Luo, Ronghao Chen, Zonghui Wang, Wenzhi Chen

Main category: cs.CL

TL;DR: MentalMAC is a framework that enhances LLMs’ ability to detect mental manipulation in dialogues using data augmentation, multi-task supervision, and progressive distillation, achieving significant performance improvements over baselines.

Details

Motivation: Mental manipulation is a serious psychological abuse form that's hard to detect due to insufficient training data, covert nature, and lack of real-world datasets.

Method: Three components: EvoSA (annotation-free data augmentation using evolutionary operations and speech act theory), teacher-model-generated multi-task supervision, and progressive task-level anti-curriculum distillation.

Result: Achieved 25.9% improvement in F1mac and 8.1% in accuracy over best baseline, outperforming GPT-4 and Claude-3.5-Sonnet. Created ReaMent dataset with 5,000 real-world dialogue samples.

Conclusion: MentalMAC effectively addresses challenges in mental manipulation detection and significantly enhances LLM capabilities in this domain.

Abstract: Mental manipulation is a subtle yet pervasive form of psychological abuse that poses serious threats to mental health. Nevertheless, detecting mental manipulation remains a largely underexplored research problem. The field faces three major challenges: (i) insufficient and hard-to-obtain training data; (ii) the covert nature of mental manipulation, which hinders detection; and (iii) the lack of real-world datasets. To address these challenges, we propose MentalMAC, a novel framework that enhances large language models’ ability to detect elements of mental manipulation in multi-turn dialogue. Our approach consists of three key components: EvoSA, an annotation-free data augmentation method based on evolutionary operations and speech act theory; teacher-model-generated multi-task supervision; and progressive task-level anti-curriculum distillation. We then constructed the ReaMent dataset, comprising 5,000 real-world dialogue samples, utilizing MentalMAC-distilled models to aid in human annotation. Vast experiments show that MentalMAC achieves up to 25.9% improvement in F1mac and 8.1% in accuracy over the best-performing baseline, outperforming commercial LLMs such as GPT-4 and Claude-3.5-Sonnet. Warning: This paper contains content that may be offensive to the reader.

[138] Language Model Distillation: A Temporal Difference Imitation Learning Perspective

Zishun Yu, Shangzhe Li, Xinhua Zhang

Main category: cs.CL

TL;DR: A temporal difference learning framework for language model distillation that exploits the distributional sparsity of teacher models by operating on reduced action spaces (vocabulary subsets).

Details

Motivation: Large language models have high computational costs, and distillation methods often use behavior cloning approaches. The authors aim to leverage temporal difference learning techniques for more effective distillation by exploiting the observation that language models concentrate probability mass on small token subsets.

Method: General temporal difference-based distillation framework that operates on reduced action spaces (vocabulary subsets) rather than full vocabulary, taking advantage of teacher model’s distributional sparsity where most probability mass is concentrated on few tokens.

Result: Demonstrated practical algorithms derived from the framework and resulting performance improvements in language model distillation.

Conclusion: The proposed temporal difference learning framework with reduced action spaces provides an effective approach for language model distillation by leveraging distributional sparsity patterns in teacher models.

Abstract: Large language models have led to significant progress across many NLP tasks, although their massive sizes often incur substantial computational costs. Distillation has become a common practice to compress these large and highly capable models into smaller, more efficient ones. Many existing language model distillation methods can be viewed as behavior cloning from the perspective of imitation learning or inverse reinforcement learning. This viewpoint has inspired subsequent studies that leverage (inverse) reinforcement learning techniques, including variations of behavior cloning and temporal difference learning methods. Rather than proposing yet another specific temporal difference method, we introduce a general framework for temporal difference-based distillation by exploiting the distributional sparsity of the teacher model. Specifically, it is often observed that language models assign most probability mass to a small subset of tokens. Motivated by this observation, we design a temporal difference learning framework that operates on a reduced action space (a subset of vocabulary), and demonstrate how practical algorithms can be derived and the resulting performance improvements.

[139] Rethinking Text-based Protein Understanding: Retrieval or LLM?

Juntong Wu, Zijing Liu, He Cao, Hao Li, Bin Feng, Zishan Shu, Ke Yu, Li Yuan, Yu Li

Main category: cs.CL

TL;DR: The paper identifies data leakage issues in current protein-text benchmarks and proposes a retrieval-enhanced method that outperforms fine-tuned LLMs for protein-to-text generation with better accuracy and efficiency.

Details

Motivation: Current protein-text models suffer from data leakage in benchmarks and inadequate evaluation metrics from NLP, limiting accurate assessment of model performance in the protein domain.

Method: Reorganize existing datasets, introduce a biological entity-based evaluation framework, and propose a retrieval-enhanced method for protein-to-text generation that works in training-free scenarios.

Result: The retrieval-enhanced method significantly outperforms fine-tuned LLMs for protein-to-text generation and shows improved accuracy and efficiency.

Conclusion: The proposed approach addresses data leakage issues and provides a more accurate evaluation framework for protein-text models, with the retrieval method offering superior performance over traditional fine-tuning approaches.

Abstract: In recent years, protein-text models have gained significant attention for their potential in protein generation and understanding. Current approaches focus on integrating protein-related knowledge into large language models through continued pretraining and multi-modal alignment, enabling simultaneous comprehension of textual descriptions and protein sequences. Through a thorough analysis of existing model architectures and text-based protein understanding benchmarks, we identify significant data leakage issues present in current benchmarks. Moreover, conventional metrics derived from natural language processing fail to accurately assess the model’s performance in this domain. To address these limitations, we reorganize existing datasets and introduce a novel evaluation framework based on biological entities. Motivated by our observation, we propose a retrieval-enhanced method, which significantly outperforms fine-tuned LLMs for protein-to-text generation and shows accuracy and efficiency in training-free scenarios. Our code and data can be seen at https://github.com/IDEA-XL/RAPM.

[140] DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning

Wenxuan Shi, Haochen Tan, Chuqiao Kuang, Xiaoguang Li, Xiaozhe Ren, Chen Zhang, Hanting Chen, Yasheng Wang, Lu Hou, Lifeng Shang

Main category: cs.CL

TL;DR: WebPuzzle benchmark and DeepDiver RL framework enable LLMs to develop Search Intensity Scaling for better information seeking on the live internet, achieving performance comparable to much larger models.

Details

Motivation: LLMs struggle with iterative evidence gathering and reflective reasoning in open-web question answering, and existing methods are limited by fixed prompt rules or training corpora, restricting real-world adaptability.

Method: Developed WebPuzzle benchmark (24k training, 275 test samples) and DeepDiver RL framework that cultivates Search Intensity Scaling - an emergent ability to escalate search frequency and depth rather than settling on under-evidenced answers.

Result: With SIS, Qwen2.5-7B-Instruct and Pangu-7B-Reasoner achieved performance on real-web tasks comparable to the 671B-parameter DeepSeek-R1, and the seeking policy generalized from closed-ended queries to open-ended generation like long-form writing.

Conclusion: The approach advances adaptive information seeking in LLMs and provides a rigorous benchmark for future work, demonstrating that smaller models can achieve information seeking capabilities comparable to much larger models through proper training.

Abstract: Information seeking demands iterative evidence gathering and reflective reasoning, yet large language models (LLMs) still struggle with it in open-web question answering. Existing prompting and supervised fine-tuning (SFT) methods remain fixed by prompt rules or training corpora, and are usually benchmarked only on well-structured wiki sources, limiting real-world adaptability. We introduce WebPuzzle, a 24k-sample training and 275-sample test benchmark that evaluates information seeking on the live internet, across both wiki and open-domain queries. Leveraging 7k WebPuzzle instances, we develop DeepDiver, a reinforcement-learning (RL) framework that cultivates Search Intensity Scaling (SIS)-an emergent ability to escalate search frequency and depth instead of settling on overconfident, under-evidenced answers. With SIS, Qwen2.5-7B-Instruct and Pangu-7B-Reasoner attain performance on real-web tasks comparable to the 671B-parameter DeepSeek-R1. We detail DeepDiver’s curriculum from cold-start SFT to a well designed RL procedure, and show that its seeking policy generalized from closed-ended queries to open-ended generation such as long-form writing. Our results advance adaptive information seeking in LLMs and provide a rigorous benchmark for future work.

[141] When Language Shapes Thought: Cross-Lingual Transfer of Factual Knowledge in Question Answering

Eojin Kang, Juae Kim

Main category: cs.CL

TL;DR: L2T prompting aligns model’s internal thinking language with knowledge source, outperforming English-based reasoning for cross-lingual factual knowledge transfer.

Details

Motivation: Challenge the assumption that English-based reasoning is universally beneficial for multilingual LLMs, exploring knowledge transfer from non-English to English through Language and Thought Theory.

Method: Introduce Language-to-Thought (L2T) prompting that aligns the model’s internal thinking language with the source language of factual knowledge.

Result: Across three languages and four models, L2T consistently outperforms English-based reasoning, reversing the expected advantage of English prompts.

Conclusion: Aligning the model’s internal thinking language with knowledge source language is more effective than English-based reasoning for cross-lingual factual knowledge transfer.

Abstract: Multilingual large language models (LLMs) offer promising opportunities for cross-lingual information access, yet their use of factual knowledge remains highly sensitive to the input language. Prior work has addressed this through English prompting and evaluation, assuming that English-based reasoning is universally beneficial. In this work, we challenge that assumption by exploring factual knowledge transfer from non-English to English through the lens of Language and Thought Theory. We introduce Language-to-Thought (L2T) prompting, which aligns the model’s internal ‘’thinking’’ language with the source of knowledge. Across three languages and four models, L2T consistently outperforms English-based reasoning, reversing the expected advantage of English prompts. Our code is available at https://github.com/GeomeunByeol/Language2Thought.

[142] LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text

Li yunhan, Wu gengshen

Main category: cs.CL

TL;DR: Proposes a comprehensive evaluation framework for legal LLMs focusing on clarity, coherence, and terminology, revealing that model performance plateaus at 14B parameters and reasoning models outperform base architectures.

Details

Motivation: Current LLM evaluation benchmarks for legal applications focus mainly on factual accuracy while neglecting important linguistic quality aspects like clarity, coherence, and terminology.

Method: Developed a regression model to evaluate legal text quality, created specialized legal questions, and analyzed 49 LLMs using this framework.

Result: Found that model quality plateaus at 14B parameters (only 2.7% improvement at 72B), engineering choices have negligible impact, and reasoning models consistently outperform base architectures. Qwen3 series identified as optimal for cost-performance tradeoffs.

Conclusion: Establishes standardized evaluation protocols for legal LLMs and uncovers fundamental limitations in current training data refinement approaches.

Abstract: As large language models (LLMs) are increasingly used in legal applications, current evaluation benchmarks tend to focus mainly on factual accuracy while largely neglecting important linguistic quality aspects such as clarity, coherence, and terminology. To address this gap, we propose three steps: First, we develop a regression model to evaluate the quality of legal texts based on clarity, coherence, and terminology. Second, we create a specialized set of legal questions. Third, we analyze 49 LLMs using this evaluation framework. Our analysis identifies three key findings: First, model quality levels off at 14 billion parameters, with only a marginal improvement of $2.7%$ noted at 72 billion parameters. Second, engineering choices such as quantization and context length have a negligible impact, as indicated by statistical significance thresholds above 0.016. Third, reasoning models consistently outperform base architectures. A significant outcome of our research is the release of a ranking list and Pareto analysis, which highlight the Qwen3 series as the optimal choice for cost-performance tradeoffs. This work not only establishes standardized evaluation protocols for legal LLMs but also uncovers fundamental limitations in current training data refinement approaches. Code and models are available at: https://github.com/lyxx3rd/LegalEval-Q.

[143] OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics

Vineeth Dorna, Anmol Mekala, Wenlong Zhao, Andrew McCallum, Zachary C. Lipton, J. Zico Kolter, Pratyush Maini

Main category: cs.CL

TL;DR: OpenUnlearning is a standardized framework for benchmarking LLM unlearning methods and metrics, integrating 13 algorithms and 16 evaluations across 3 benchmarks, with 450+ checkpoints.

Details

Motivation: Address challenges in reliably measuring unlearning and fragmentation in methodologies to ensure data privacy, model safety, and regulatory compliance in LLM deployment.

Method: Introduces OpenUnlearning framework with integrated unlearning algorithms, diverse evaluations, and meta-evaluation benchmark for assessing metric faithfulness and robustness.

Result: Provides comparative analysis of unlearning methods against extensive evaluation suite and releases 450+ checkpoints for forgetting behavior analysis.

Conclusion: Establishes community-driven pathway for rigorous LLM unlearning research through standardized benchmarking framework.

Abstract: Robust unlearning is crucial for safely deploying large language models (LLMs) in environments where data privacy, model safety, and regulatory compliance must be ensured. Yet the task is inherently challenging, partly due to difficulties in reliably measuring whether unlearning has truly occurred. Moreover, fragmentation in current methodologies and inconsistent evaluation metrics hinder comparative analysis and reproducibility. To unify and accelerate research efforts, we introduce OpenUnlearning, a standardized and extensible framework designed explicitly for benchmarking both LLM unlearning methods and metrics. OpenUnlearning integrates 13 unlearning algorithms and 16 diverse evaluations across 3 leading benchmarks (TOFU, MUSE, and WMDP) and also enables analyses of forgetting behaviors across 450+ checkpoints we publicly release. Leveraging OpenUnlearning, we propose a novel meta-evaluation benchmark focused specifically on assessing the faithfulness and robustness of evaluation metrics themselves. We also benchmark diverse unlearning methods and provide a comparative analysis against an extensive evaluation suite. Overall, we establish a clear, community-driven pathway toward rigorous development in LLM unlearning research.

[144] Reasoning with Exploration: An Entropy Perspective

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, Furu Wei

Main category: cs.CL

TL;DR: The paper introduces a simple one-line modification to RL that augments the advantage function with an entropy-based term to promote deeper reasoning chains in LLMs, achieving significant gains on Pass@K metrics.

Details

Motivation: Current LLM reasoning methods lean too much toward exploitation and encounter performance plateaus, while exploration through entropy signals could enable deeper reasoning capabilities.

Method: Augment the standard RL advantage function with an entropy-based term to encourage exploration by promoting longer and deeper reasoning chains, rather than just uncertainty.

Result: Significant gains on the Pass@K metric, even with extremely large K values, pushing the boundaries of LLM reasoning capabilities.

Conclusion: Entropy-based exploration in RL can effectively enhance LLM reasoning by promoting deeper reasoning chains, overcoming performance plateaus in existing methods.

Abstract: Balancing exploration and exploitation is a central goal in reinforcement learning (RL). Despite recent advances in enhancing large language model (LLM) reasoning, most methods lean toward exploitation, and increasingly encounter performance plateaus. In this work, we revisit entropy – a signal of exploration in RL – and examine its relationship to exploratory reasoning in LLMs. Through empirical analysis, we uncover positive correlations between high-entropy regions and three types of exploratory reasoning actions: (1) pivotal tokens that determine or connect logical steps, (2) reflective actions such as self-verification and correction, and (3) rare behaviors under-explored by the base LLMs. Motivated by this, we introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term. Unlike traditional maximum-entropy methods which encourage exploration by promoting uncertainty, we encourage exploration by promoting longer and deeper reasoning chains. Notably, our method achieves significant gains on the Pass@K metric – an upper-bound estimator of LLM reasoning capabilities – even when evaluated with extremely large K values, pushing the boundaries of LLM reasoning.

[145] Mechanisms vs. Outcomes: Probing for Syntax Fails to Explain Performance on Targeted Syntactic Evaluations

Ananth Agarwal, Jasper Jian, Christopher D. Manning, Shikhar Murty

Main category: cs.CL

TL;DR: Syntactic probing accuracy in LLMs does not reliably predict downstream syntactic performance, revealing a disconnect between latent representations and observable behaviors.

Details

Motivation: To investigate whether syntactic features extracted via probing actually predict a model's downstream syntactic performance, addressing the gap between internal representations and external behaviors.

Method: Evaluated 32 open-weight transformer models using a ‘mechanisms vs. outcomes’ framework, comparing syntactic features from probing with targeted syntax evaluations across English linguistic phenomena.

Result: Found that syntactic features extracted via probing fail to predict outcomes of targeted syntax evaluations, showing substantial disconnect between latent representations and observable syntactic behaviors.

Conclusion: There is a significant gap between what probing reveals about syntactic representations and how models actually perform on downstream syntactic tasks, challenging the reliability of probing as a predictor of syntactic competence.

Abstract: Large Language Models (LLMs) exhibit a robust mastery of syntax when processing and generating text. While this suggests internalized understanding of hierarchical syntax and dependency relations, the precise mechanism by which they represent syntactic structure is an open area within interpretability research. Probing provides one way to identify the mechanism of syntax being linearly encoded in activations, however, no comprehensive study has yet established whether a model’s probing accuracy reliably predicts its downstream syntactic performance. Adopting a “mechanisms vs. outcomes” framework, we evaluate 32 open-weight transformer models and find that syntactic features extracted via probing fail to predict outcomes of targeted syntax evaluations across English linguistic phenomena. Our results highlight a substantial disconnect between latent syntactic representations found via probing and observable syntactic behaviors in downstream tasks.

[146] ReCode: Updating Code API Knowledge with Reinforcement Learning

Haoze Wu, Yunzhi Yao, Wenhao Yu, Ningyu Zhang

Main category: cs.CL

TL;DR: ReCode is a reinforcement learning framework that helps LLMs adapt to changing API libraries by training them on version migration tasks, improving code generation in dynamic environments without significantly harming general coding abilities.

Details

Motivation: LLMs struggle with adapting to frequent API updates due to reliance on outdated training data, which limits reliable code generation in dynamic development environments.

Method: Proposed ReCode framework with: 1) Dataset of ~2,000 entries for training LLMs on version migration, 2) Modified string similarity metric for code evaluation as RL reward, 3) Applied to various LLMs using GRPO and DAPO reinforcement learning algorithms.

Result: ReCode substantially boosts LLMs’ code generation performance in dynamic API scenarios, especially on unseen CodeUpdateArena tasks. Qwen2.5-Coder-7B outperformed 32B parameter models after training. Consistent improvements across various LLMs and RL algorithms.

Conclusion: ReCode effectively addresses LLMs’ adaptation to API changes while preserving general code generation capabilities, demonstrating practical value for dynamic programming environments.

Abstract: Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-based Reinforcement learning for Code Update), a novel framework that mimics human programmer adaptation to API changes. Specifically, we construct a dataset of approximately 2,000 data entries to train the LLMs to perform version migration based on updated information. Then, we introduce a modified string similarity metric for code evaluation as the reward for reinforcement learning. Our experiments demonstrate that ReCode substantially boosts LLMs’ code generation performance in dynamic API scenarios, especially on the unseen CodeUpdateArena task. Crucially, compared to supervised fine-tuning, ReCode has less impact on LLMs’ general code generation abilities. We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and DAPO), all achieving consistent improvements. Notably, after training, Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned model and the reasoning model with the same architecture. Code is available at https://github.com/zjunlp/ReCode.

[147] DP-Fusion: Token-Level Differentially Private Inference for Large Language Models

Rushil Thareja, Preslav Nakov, Praneeth Vepakomma, Nils Lukas

Main category: cs.CL

TL;DR: DP-Fusion is a differentially private inference mechanism for LLMs that provably bounds token influence, protecting sensitive information while maintaining text quality.

Details

Motivation: LLMs can inadvertently reveal sensitive information from their context, and existing privacy methods lack provable guarantees or have poor utility/privacy trade-offs.

Method: Label sensitive tokens, infer LLM without sensitive tokens for baseline, infer with sensitive tokens, then blend distributions to bound influence on final output.

Result: Achieves 6x lower perplexity than related DPI methods while providing token-level provable privacy guarantees.

Conclusion: DP-Fusion enables document privatization with substantially improved theoretical and empirical privacy, effectively balancing privacy protection with text quality.

Abstract: Large language models (LLMs) do not preserve privacy at inference-time. The LLM’s outputs can inadvertently reveal information about the model’s context, which presents a privacy challenge when the LLM is augmented via tools or databases containing sensitive information. Existing privacy-preserving methods at inference-time have significant limitations since they (i) lack provable guarantees or (ii) have a poor utility/privacy trade-off. We propose DP-Fusion, a Differentially Private Inference (DPI) mechanism for LLMs that provably bounds the influence a set of tokens in the context can have on the LLM’s output. DP-Fusion works as follows: (1) label a subset of sensitive tokens, (2) infer the LLM without any sensitive tokens to obtain a baseline, (3) infer the LLM with the sensitive tokens, and (4) blend distributions so that the final output remains within a bounded distance of the baseline distribution. While this per-token influence bound also mitigates jailbreak-style prompt injection, we focus on \emph{document privatization}, where the goal is to paraphrase a document containing sensitive tokens, e.g., personally identifiable information, so that no attacker can reliably infer them from the paraphrased document while preserving high text quality. The privacy/utility trade-off is controlled by $ε$, where $ε=0$ hides sensitive tokens entirely, while higher values trade off privacy for improved text quality. We show that our method creates token-level provably privatized documents with substantially improved theoretical and empirical privacy, achieving $6\times$ lower perplexity than related DPI methods.

[148] ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?

Haoxin Wang, Xianhan Peng, Xucheng Huang, Yizhe Huang, Ming Gong, Chenghan Yang, Yang Liu, Ling Jiang

Main category: cs.CL

TL;DR: ECom-Bench is the first benchmark for evaluating multimodal LLM agents in e-commerce customer support, featuring realistic user simulations and tasks derived from real e-commerce dialogues.

Details

Motivation: To address the lack of evaluation frameworks for multimodal LLM agents in e-commerce customer support, where complex real-world scenarios present significant challenges that current models struggle with.

Method: Created dynamic user simulations based on real e-commerce customer personas and developed realistic task datasets from authentic e-commerce dialogues covering diverse business scenarios.

Result: The benchmark proved highly challenging - even advanced models like GPT-4o achieved only 10-20% pass^3 metric, demonstrating the substantial difficulties in complex e-commerce scenarios.

Conclusion: ECom-Bench provides a valuable evaluation framework for multimodal LLM agents in e-commerce, with publicly available code and data to advance research in this challenging domain.

Abstract: In this paper, we introduce ECom-Bench, the first benchmark framework for evaluating LLM agent with multimodal capabilities in the e-commerce customer support domain. ECom-Bench features dynamic user simulation based on persona information collected from real e-commerce customer interactions and a realistic task dataset derived from authentic e-commerce dialogues. These tasks, covering a wide range of business scenarios, are designed to reflect real-world complexities, making ECom-Bench highly challenging. For instance, even advanced models like GPT-4o achieve only a 10-20% pass^3 metric in our benchmark, highlighting the substantial difficulties posed by complex e-commerce scenarios. The code and data have been made publicly available at https://github.com/XiaoduoAILab/ECom-Bench to facilitate further research and development in this domain.

[149] Evaluating the Ability of Large Language Models to Reason about Cardinal Directions, Revisited

Anthony G Cohn, Robert E Blackwell

Main category: cs.CL

TL;DR: Evaluation of 28 LLMs’ ability to reason about cardinal directions using a templated benchmark, showing even newer models struggle with reliability.

Details

Motivation: To systematically assess LLMs' spatial reasoning capabilities for cardinal directions across various scenarios and perspectives.

Method: Used benchmark generated from templates with variations in locomotion means and person perspective (1st/2nd/3rd person) to test 28 LLMs.

Result: Even newer Large Reasoning Models cannot reliably determine correct cardinal directions for all questions in the benchmark.

Conclusion: Current LLMs, including advanced reasoning models, have significant limitations in spatial reasoning about cardinal directions despite variations in testing scenarios.

Abstract: We investigate the abilities of 28 Large language Models (LLMs) to reason about cardinal directions (CDs) using a benchmark generated from a set of templates, extensively testing an LLM’s ability to determine the correct CD given a particular scenario. The templates allow for a number of degrees of variation such as means of locomotion of the agent involved, and whether set in the first, second or third person. Even the newer Large Reasoning Models are unable to reliably determine the correct CD for all questions. This paper summarises and extends earlier work presented at COSIT-24.

Abeer Aldayel, Areej Alokaili

Main category: cs.CL

TL;DR: The paper introduces a framework to evaluate how opinions are represented in NLP models by focusing on implicit expressions of opinion rather than surface-level demographic mentions, using stance as a proxy for underlying opinions.

Details

Motivation: Existing methods for inclusive representation rely on surface inclusion using demographic mentions, overlooking nuanced implicit opinions and potentially reinforcing harmful stereotypes in model outputs.

Method: Proposes an alignment evaluation framework that models stance of responses as a proxy for opinions, evaluated using positive-unlabeled online learning with base classifiers and instruction-tuned language models for post-training alignment assessment.

Result: The framework provides a principled approach to assess how implicit opinions are (mis)represented in conversation models, enabling validation of normative alignment.

Conclusion: The study offers a pathway toward more inclusive model behavior by foregrounding implicit conversations and evaluating normative social views through stance analysis.

Abstract: Shaping inclusive representations that embrace diversity and ensure fair participation and reflections of values is at the core of many conversation-based models. However, many existing methods rely on surface inclusion using mention of user demographics or behavioral attributes of social groups. Such methods overlook the nuanced, implicit expression of opinion embedded in conversations. Furthermore, the over-reliance on overt cues can exacerbate misalignment and reinforce harmful or stereotypical representations in model outputs. Thus, we took a step back and recognized that equitable inclusion needs to account for the implicit expression of opinion and use the stance of responses to validate the normative alignment. This study aims to evaluate how opinions are represented in NLP or computational models by introducing an alignment evaluation framework that foregrounds implicit, often overlooked conversations and evaluates the normative social views and discourse. Our approach models the stance of responses as a proxy for the underlying opinion, enabling a considerate and reflective representation of diverse social viewpoints. We evaluate the framework using both (i) positive-unlabeled (PU) online learning with base classifiers, and (ii) instruction-tuned language models to assess post-training alignment. Through this, we provide a principled and structured lens on how implicit opinions are (mis)represented and offer a pathway toward more inclusive model behavior.

[151] UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases

Raj Vardhan Tomar, Preslav Nakov, Yuxia Wang

Main category: cs.CL

TL;DR: UnsafeChain is a safety alignment dataset for large reasoning models that uses hard prompts with unsafe completions and provides explicit corrections to safe responses, enhancing safety while preserving reasoning ability.

Details

Motivation: Existing safety alignment methods focus on filtering safe prompts but overlook hard prompts that consistently elicit harmful outputs, creating a gap in comprehensive safety training.

Method: Constructed UnsafeChain dataset from hard prompts with diverse sources, identified unsafe completions, and explicitly corrected them into safe responses. Fine-tuned three large reasoning models on this dataset.

Result: UnsafeChain consistently outperforms prior datasets (SafeChain and STAR-1) across six out-of-distribution and five in-distribution benchmarks, with even a 1K subset matching or surpassing baseline performance.

Conclusion: Correction-based supervision through UnsafeChain is effective and generalizable for safety alignment, enhancing model safety while maintaining reasoning capabilities.

Abstract: As large reasoning models (LRMs) grow more capable, chain-of-thought (CoT) reasoning introduces new safety challenges. Existing SFT-based safety alignment studies dominantly focused on filtering prompts with safe, high-quality responses, while overlooking hard prompts that always elicit harmful outputs. To fill this gap, we introduce UnsafeChain, a safety alignment dataset constructed from hard prompts with diverse sources, where unsafe completions are identified and explicitly corrected into safe responses. By exposing models to unsafe behaviors and guiding their correction, UnsafeChain enhances safety while preserving general reasoning ability. We fine-tune three LRMs on UnsafeChain and compare them against recent SafeChain and STAR-1 across six out-of-distribution and five in-distribution benchmarks. UnsafeChain consistently outperforms prior datasets, with even a 1K subset matching or surpassing baseline performance, demonstrating the effectiveness and generalizability of correction-based supervision. We release our dataset and code at https://github.com/mbzuai-nlp/UnsafeChain

[152] CultureGuard: Towards Culturally-Aware Dataset and Guard Model for Multilingual Safety Applications

Raviraj Joshi, Rakesh Paul, Kanishk Singla, Anusha Kamath, Michael Evans, Katherine Luna, Shaona Ghosh, Utkarsh Vaidya, Eileen Long, Sanjay Singh Chauhan, Niranjan Wartikar

Main category: cs.CL

TL;DR: CultureGuard creates multilingual safety datasets using a 4-stage pipeline to address the lack of culturally aligned safety data in non-English languages, enabling training of state-of-the-art safety guard models.

Details

Motivation: Non-English languages lack robust safety guard models due to high costs of collecting culturally aligned labeled datasets, while English content safety is well-studied.

Method: Four-stage synthetic data generation pipeline: cultural data segregation, cultural data adaptation, machine translation, and quality filtering to expand English safety dataset to 8 languages.

Result: Created Nemotron-Safety-Guard-Dataset-v3 with 386,661 samples in 9 languages; trained Llama-3.1-Nemotron-Safety-Guard-8B-v3 model achieving SOTA performance on multilingual benchmarks with strong cross-lingual transfer.

Conclusion: This work advances multilingual LLM safety by enabling culturally aware safety guard models, addressing the vulnerability of LLMs to unsafe responses in non-English languages.

Abstract: The increasing use of Large Language Models (LLMs) in agentic applications highlights the need for robust safety guard models. While content safety in English is well-studied, non-English languages lack similar advancements due to the high cost of collecting culturally aligned labeled datasets. We present CultureGuard, a novel solution for curating culturally aligned, high-quality safety datasets across multiple languages. Our approach introduces a four-stage synthetic data generation and filtering pipeline: cultural data segregation, cultural data adaptation, machine translation, and quality filtering. This pipeline enables the conversion and expansion of the Nemotron-Content-Safety-Dataset-V2 English safety dataset into eight distinct languages: Arabic, German, Spanish, French, Hindi, Japanese, Thai, and Chinese. The resulting dataset, Nemotron-Safety-Guard-Dataset-v3, comprises 386,661 samples in 9 languages and facilitates the training of Llama-3.1-Nemotron-Safety-Guard-8B-v3 via LoRA-based fine-tuning. The final model achieves state-of-the-art performance on several multilingual content safety benchmarks. Furthermore, we show our moderately multilingual fine-tuning enables robust cross-lingual transfer and strong zero-shot generalization to unseen languages. We also benchmark the latest open LLMs on multilingual safety and observe that these LLMs are more prone to give unsafe responses when prompted in non-English languages. This work advances multilingual LLM safety by enabling the development of culturally aware safety guard models.

[153] Evaluating, Synthesizing, and Enhancing for Customer Support Conversation

Jie Zhu, Huaixia Dou, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang, Fang Kong

Main category: cs.CL

TL;DR: The paper introduces Customer Support Conversation (CSC) task and framework based on COPC guidelines, creates CSConv evaluation dataset and RoleCS training dataset using LLMs, and shows fine-tuning LLMs on RoleCS improves strategy-aligned response generation.

Details

Motivation: Existing dialogue datasets lack strategic guidance for customer support, and real-world service data is difficult to access and annotate, creating a need for structured training frameworks.

Method: Proposed structured CSC framework with 5 conversational stages and 12 strategies based on COPC guidelines. Used LLMs to rewrite real conversations into CSConv dataset and created RoleCS training dataset through role-playing simulations.

Result: Fine-tuning strong LLMs on RoleCS significantly improved their ability to generate high-quality, strategy-aligned responses on CSConv evaluation dataset. Human evaluations confirmed gains in problem resolution.

Conclusion: The CSC framework and datasets effectively train customer service agents and LLMs to generate structured, empathetic responses using professional support strategies, with demonstrated improvements in response quality and problem resolution.

Abstract: Effective customer support requires not only accurate problem solving but also structured and empathetic communication aligned with professional standards. However, existing dialogue datasets often lack strategic guidance, and real-world service data is difficult to access and annotate. To address this, we introduce the task of Customer Support Conversation (CSC), aimed at training customer service agents to respond using well-defined support strategies. We propose a structured CSC framework grounded in COPC guidelines, defining five conversational stages and twelve strategies to guide high-quality interactions. Based on this, we construct CSConv, an evaluation dataset of 1,855 real-world customer-agent conversations rewritten using LLMs to reflect deliberate strategy use, and annotated accordingly. Additionally, we develop a role-playing approach that simulates strategy-rich conversations using LLM-powered roles aligned with the CSC framework, resulting in the training dataset RoleCS. Experiments show that fine-tuning strong LLMs on RoleCS significantly improves their ability to generate high-quality, strategy-aligned responses on CSConv. Human evaluations further confirm gains in problem resolution. All code and data will be made publicly available at https://github.com/aliyun/qwen-dianjin.

[154] BEE-RAG: Balanced Entropy Engineering for Retrieval-Augmented Generation

Yuhao Wang, Ruiyang Ren, Yucheng Wang, Jing Liu, Wayne Xin Zhao, Hua Wu, Haifeng Wang

Main category: cs.CL

TL;DR: BEE-RAG is a framework that addresses performance issues in retrieval-augmented generation by controlling entropy growth and attention dilution in long retrieval contexts through entropy invariance principles.

Details

Motivation: Retrieval-augmented generation (RAG) suffers from performance degradation due to unconstrained entropy growth and attention dilution when handling large volumes of retrieved information with long context lengths.

Method: Proposes BEE-RAG framework that uses entropy invariance to separate attention sensitivity from context length, includes zero-shot inference for multi-importance estimation, and parameter-efficient adaptive fine-tuning to find optimal balancing factors.

Result: Extensive experiments across multiple RAG tasks demonstrate the effectiveness of BEE-RAG in improving RAG system performance.

Conclusion: BEE-RAG successfully improves RAG adaptability to varying context lengths through balanced entropy engineering, providing stable performance across different settings.

Abstract: With the rapid advancement of large language models (LLMs), retrieval-augmented generation (RAG) has emerged as a critical approach to supplement the inherent knowledge limitations of LLMs. However, due to the typically large volume of retrieved information, RAG tends to operate with long context lengths. From the perspective of entropy engineering, we identify unconstrained entropy growth and attention dilution due to long retrieval context as significant factors affecting RAG performance. In this paper, we propose the balanced entropy-engineered RAG (BEE-RAG) framework, which improves the adaptability of RAG systems to varying context lengths through the principle of entropy invariance. By leveraging balanced context entropy to reformulate attention dynamics, BEE-RAG separates attention sensitivity from context length, ensuring a stable entropy level. Building upon this, we introduce a zero-shot inference strategy for multi-importance estimation and a parameter-efficient adaptive fine-tuning mechanism to obtain the optimal balancing factor for different settings. Extensive experiments across multiple RAG tasks demonstrate the effectiveness of BEE-RAG.

[155] Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations

Li-Chun Lu, Miri Liu, Pin-Chun Lu, Yufei Tian, Shao-Hua Sun, Nanyun Peng

Main category: cs.CL

TL;DR: Systematic comparison of four creativity metrics shows limited consistency across domains, with each capturing different aspects of creativity and exhibiting various limitations.

Details

Motivation: To systematically examine and compare different creativity evaluation measures across diverse creative domains to understand their consistency and limitations.

Method: Analyzed and compared four representative creativity measures: creativity index, perplexity, syntactic templates, and LLM-as-a-Judge across creative writing, unconventional problem-solving, and research ideation domains.

Result: Metrics exhibit limited consistency and capture different dimensions of creativity. Creativity index focuses on lexical diversity, perplexity is sensitive to model confidence, syntactic templates miss conceptual creativity, and LLM-as-a-Judge shows instability and bias.

Conclusion: Current creativity evaluation frameworks have significant limitations and need more robust, generalizable approaches that better align with human judgments of creativity.

Abstract: We systematically examine, analyze, and compare representative creativity measures–creativity index, perplexity, syntactic templates, and LLM-as-a-Judge–across diverse creative domains, including creative writing, unconventional problem-solving, and research ideation. Our analyses reveal that these metrics exhibit limited consistency, capturing different dimensions of creativity. We highlight key limitations, including the creativity index’s focus on lexical diversity, perplexity’s sensitivity to model confidence, and syntactic templates’ inability to capture conceptual creativity. Additionally, LLM-as-a-Judge shows instability and bias. Our findings underscore the need for more robust, generalizable evaluation frameworks that better align with human judgments of creativity.

[156] Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

Saketh Reddy Vemula, Sandipan Dandapat, Dipti Misra Sharma, Parameswari Krishnamurthy

Main category: cs.CL

TL;DR: Tokenizers’ impact on downstream performance in morphologically complex languages remains unclear. This study evaluates BPE vs Unigram tokenizers for Telugu, Hindi, and English, finding Unigram consistently outperforms BPE and morphological alignment has secondary importance.

Details

Motivation: To understand the relationship between tokenizer algorithms (BPE vs Unigram), morphological alignment, tokenization quality, and downstream performance in morphologically complex languages like Telugu.

Method: Comprehensive evaluation using small BERT models from pre-training to fine-tuning for Telugu (agglutinative), with preliminary evaluation in Hindi and English. Created gold morpheme segmentation dataset for Telugu with 600 derivational and 7000 inflectional word forms.

Result: 1) Tokenizer algorithm choice is most significant - Unigram consistently outperforms BPE. 2) Morphological alignment shows moderate positive correlation but secondary importance. 3) Hybrid approaches with morphological pre-segmentation boost BPE performance but not Unigram.

Conclusion: Tokenizer algorithm selection is more critical than morphological alignment for downstream performance. Need for better intrinsic evaluation metrics to explain downstream performance trends consistently.

Abstract: The relationship between tokenizer algorithm (e.g., Byte-Pair Encoding (BPE), Unigram), morphological alignment, tokenization quality (e.g., compression efficiency), and downstream performance remains largely unclear, particularly for languages with complex morphology. In this paper, we conduct a comprehensive evaluation of tokenizers using small-sized BERT models – from pre-training through fine-tuning – for Telugu (agglutinative), along with preliminary evaluation in Hindi (primarily fusional with some agglutination) and English (fusional). To evaluate morphological alignment of tokenizers in Telugu, we create a dataset containing gold morpheme segmentations of 600 derivational and 7000 inflectional word forms. Our experiments reveal two key findings for Telugu. First, the choice of tokenizer algorithm is the most significant factor influencing performance, with Unigram-based tokenizers consistently outperforming BPE across most settings. Second, while better morphological alignment shows a moderate, positive correlation with performance on text classification and structure prediction tasks, its impact is secondary to the tokenizer algorithm. Notably, hybrid approaches that use morphological information for pre-segmentation significantly boost the performance of BPE, though not Unigram. Our results further showcase the need for comprehensive intrinsic evaluation metrics for tokenizers that could explain downstream performance trends consistently.

[157] Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages

Imalsha Puranegedara, Themira Chathumina, Nisal Ranathunga, Nisansa de Silva, Surangika Ranathunga, Mokanarangan Thayaparan

Main category: cs.CL

TL;DR: A novel architecture that fuses all intermediate layers of multilingual encoders (not just the final layer) to enrich linguistic information for LLMs, using Global Softmax and Transformer Softmax weighting strategies. Trained only on English data, it significantly improves performance on low-resource languages.

Details

Motivation: LLMs perform poorly on low-resource languages due to English-centric training. Existing methods like LangBridge only use the final encoder layer, missing rich linguistic information from intermediate layers.

Method: Propose two fusion strategies: Global Softmax for overall layer importance and Transformer Softmax for token-specific weights. All intermediate layers are fused and mapped to LLM’s embedding space. Model trained only on English data without parallel/multilingual data.

Result: Significant improvements on low-resource languages: Sinhala classification accuracy increased from 71.66% to 75.86%, clear gains in Indic languages (Tamil, Bengali, Malayalam), and overall XNLI accuracy improved from 70.36% to 71.50%. Outperforms LangBridge baseline.

Conclusion: This approach provides a scalable, data-efficient path toward more capable and equitable multilingual LLMs by leveraging full encoder layer information without requiring multilingual training data.

Abstract: Large Language Models (LLMs) excel in English, but their performance degrades significantly on low-resource languages (LRLs) due to English-centric training. While methods like LangBridge align LLMs with multilingual encoders such as the Massively Multilingual Text-to-Text Transfer Transformer (mT5), they typically use only the final encoder layer. We propose a novel architecture that fuses all intermediate layers, enriching the linguistic information passed to the LLM. Our approach features two strategies: (1) a Global Softmax weighting for overall layer importance, and (2) a Transformer Softmax model that learns token-specific weights. The fused representations are mapped into the LLM’s embedding space, enabling it to process multilingual inputs. The model is trained only on English data, without using any parallel or multilingual data. Evaluated on XNLI, IndicXNLI, Sinhala News Classification, and Amazon Reviews, our Transformer Softmax model significantly outperforms the LangBridge baseline. We observe strong performance gains in LRLs, improving Sinhala classification accuracy from 71.66% to 75.86% and achieving clear improvements across Indic languages such as Tamil, Bengali, and Malayalam. These specific gains contribute to an overall boost in average XNLI accuracy from 70.36% to 71.50%. This approach offers a scalable, data-efficient path toward more capable and equitable multilingual LLMs.

[158] SinLlama – A Large Language Model for Sinhala

H. W. K. Aravinda, Rashad Sirajudeen, Samith Karunathilake, Nisansa de Silva, Surangika Ranathunga, Rishemjit Kaur

Main category: cs.CL

TL;DR: Extended Llama-3-8B with Sinhala vocabulary and continual pre-training to create SinLlama, the first decoder-based open-source LLM with explicit Sinhala support, achieving significant performance improvements in text classification tasks.

Details

Motivation: Low-resource languages like Sinhala are often overlooked by open-source LLMs, creating a need for specialized language support.

Method: Enhanced Llama-3-8B tokenizer with Sinhala vocabulary and performed continual pre-training on a cleaned 10 million Sinhala corpus.

Result: SinLlama outperformed base and instruct variants of Llama-3-8B by a significant margin in text classification tasks after instruction fine-tuning.

Conclusion: Successfully created the first decoder-based open-source LLM with explicit Sinhala support, demonstrating effective adaptation of multilingual models for low-resource languages.

Abstract: Low-resource languages such as Sinhala are often overlooked by open-source Large Language Models (LLMs). In this research, we extend an existing multilingual LLM (Llama-3-8B) to better serve Sinhala. We enhance the LLM tokenizer with Sinhala specific vocabulary and perform continual pre-training on a cleaned 10 million Sinhala corpus, resulting in the SinLlama model. This is the very first decoder-based open-source LLM with explicit Sinhala support. When SinLlama was instruction fine-tuned for three text classification tasks, it outperformed base and instruct variants of Llama-3-8B by a significant margin.

[159] LLMCARE: early detection of cognitive impairment via transformer models enhanced by LLM-generated synthetic data

Ali Zolnour, Hossein Azadmaleki, Yasaman Haghbin, Fatemeh Taherinezhad, Mohamad Javad Momeni Nezhad, Sina Rashidi, Masoud Khani, AmirSajjad Taleban, Samin Mahdizadeh Sani, Maryam Dadkhah, James M. Noble, Suzanne Bakken, Yadollah Yaghoobzadeh, Abdol-Hossein Vahabie, Masoud Rouhizadeh, Maryam Zolnoori

Main category: cs.CL

TL;DR: This paper develops a speech-based screening pipeline for Alzheimer’s disease detection using transformer embeddings combined with linguistic features and LLM-based synthetic data augmentation, achieving strong performance on ADReSSo dataset (F1=83.3) and validating on an independent MCI cohort.

Details

Motivation: Over half of Alzheimer's disease and related dementias cases remain undiagnosed, creating need for scalable early detection methods using speech-based natural language processing to identify subtle linguistic markers before clinical diagnosis.

Method: Developed multimodal pipeline integrating transformer embeddings with 110 handcrafted linguistic features, tested 10 transformer models with three fine-tuning strategies, used 5 LLMs for synthetic data augmentation, and evaluated multimodal LLMs in zero-shot and fine-tuned modes on ADReSSo and Delaware datasets.

Result: Fusion model achieved F1=83.3 (AUC=89.5) on ADReSSo, outperforming baselines. MedAlpaca7B augmentation improved F1 to 85.7. Fine-tuning boosted unimodal LLMs significantly (MedAlpaca7B F1=47.7→78.7). On Delaware MCI cohort, fusion plus augmentation achieved F1=72.8 (AUC=69.6).

Conclusion: Combining transformer and linguistic features enhances ADRD detection. LLM-based augmentation improves data efficiency with diminishing returns, while current multimodal models show limitations. Pipeline shows potential for scalable early screening validated on independent MCI cohort.

Abstract: Alzheimer’s disease and related dementias(ADRD) affect nearly five million older adults in the United States, yet more than half remain undiagnosed. Speech-based natural language processing(NLP) offers a scalable approach for detecting early cognitive decline through subtle linguistic markers that may precede clinical diagnosis. This study develops and evaluates a speech-based screening pipeline integrating transformer embeddings with handcrafted linguistic features, synthetic augmentation using large language models(LLMs), and benchmarking of unimodal and multimodal classifiers. External validation assessed generalizability to a MCI-only cohort. Transcripts were drawn from the ADReSSo 2021 benchmark dataset(n=237, Pitt Corpus) and the DementiaBank Delaware corpus(n=205, MCI vs. controls). Ten transformer models were tested under three fine-tuning strategies. A late-fusion model combined embeddings from the top transformer with 110 linguistic features. Five LLMs(LLaMA8B/70B, MedAlpaca7B, Ministral8B,GPT-4o) generated label-conditioned synthetic speech for augmentation, and three multimodal LLMs(GPT-4o,Qwen-Omni,Phi-4) were evaluated in zero-shot and fine-tuned modes. On ADReSSo, the fusion model achieved F1=83.3(AUC=89.5), outperforming transformer-only and linguistic baselines. MedAlpaca7B augmentation(2x) improved F1=85.7, though larger scales reduced gains. Fine-tuning boosted unimodal LLMs(MedAlpaca7B F1=47.7=>78.7), while multimodal models performed lower (Phi-4=71.6;GPT-4o=67.6). On Delaware, the fusion plus 1x MedAlpaca7B model achieved F1=72.8(AUC=69.6). Integrating transformer and linguistic features enhances ADRD detection. LLM-based augmentation improves data efficiency but yields diminishing returns, while current multimodal models remain limited. Validation on an independent MCI cohort supports the pipeline’s potential for scalable, clinically relevant early screening.

[160] ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning

Juyuan Wang, Rongchen Zhao, Wei Wei, Yufeng Wang, Mo Yu, Jie Zhou, Jin Xu, Liyan Xu

Main category: cs.CL

TL;DR: ComoRAG is a dynamic, iterative retrieval framework for long narrative comprehension that outperforms traditional RAG methods by 11% on benchmarks through stateful reasoning cycles with memory integration.

Details

Motivation: Traditional RAG methods fail to capture dynamic, interconnected relations in long narratives due to their stateless, single-step retrieval process, while LLMs struggle with extended context reasoning at high computational cost.

Method: ComoRAG uses iterative reasoning cycles with a dynamic memory workspace. When encountering reasoning impasses, it generates probing queries, retrieves new evidence, and integrates it into a global memory pool to build coherent context.

Result: Outperforms strong RAG baselines with consistent relative gains up to 11% across four challenging long-context narrative benchmarks (200K+ tokens), particularly excelling at complex queries requiring global context comprehension.

Conclusion: ComoRAG offers a principled, cognitively motivated paradigm for retrieval-based stateful reasoning, demonstrating that narrative reasoning is a dynamic interplay between evidence acquisition and knowledge consolidation.

Abstract: Narrative comprehension on long stories and novels has been a challenging domain attributed to their intricate plotlines and entangled, often evolving relations among characters and entities. Given the LLM’s diminished reasoning over extended context and its high computational cost, retrieval-based approaches remain a pivotal role in practice. However, traditional RAG methods could fall short due to their stateless, single-step retrieval process, which often overlooks the dynamic nature of capturing interconnected relations within long-range context. In this work, we propose ComoRAG, holding the principle that narrative reasoning is not a one-shot process, but a dynamic, evolving interplay between new evidence acquisition and past knowledge consolidation, analogous to human cognition on reasoning with memory-related signals in the brain. Specifically, when encountering a reasoning impasse, ComoRAG undergoes iterative reasoning cycles while interacting with a dynamic memory workspace. In each cycle, it generates probing queries to devise new exploratory paths, then integrates the retrieved evidence of new aspects into a global memory pool, thereby supporting the emergence of a coherent context for the query resolution. Across four challenging long-context narrative benchmarks (200K+ tokens), ComoRAG outperforms strong RAG baselines with consistent relative gains up to 11% compared to the strongest baseline. Further analysis reveals that ComoRAG is particularly advantageous for complex queries requiring global context comprehension, offering a principled, cognitively motivated paradigm towards retrieval-based stateful reasoning. Our framework is made publicly available at https://github.com/EternityJune25/ComoRAG.

[161] EMNLP: Educator-role Moral and Normative Large Language Models Profiling

Yilin Jiang, Mingzi Zhang, Sheng Jin, Zengyi Yu, Xiangjie Kong, Binghao Tu

Main category: cs.CL

TL;DR: EMNLP is a framework for evaluating teacher-role LLMs’ personality, moral development, and ethical vulnerability to soft prompt injection, revealing that teacher-role LLMs have more polarized personalities than humans and show a paradox where stronger reasoning models are more vulnerable to harmful prompts.

Details

Motivation: There is a lack of comprehensive psychological and ethical evaluation for LLMs simulating professional roles, particularly in educational contexts where teacher-role LLMs need assessment for ethical and psychological alignment.

Method: Developed EMNLP framework with 88 teacher-specific moral dilemmas, extended existing scales, and created targeted soft prompt injection sets to evaluate compliance and vulnerability in teacher-role LLMs across 14 different models.

Result: Teacher-role LLMs exhibit more idealized and polarized personalities than human teachers, excel in abstract moral reasoning but struggle with emotionally complex situations, and models with stronger reasoning are paradoxically more vulnerable to harmful prompt injection.

Conclusion: EMNLP provides the first benchmark for assessing ethical and psychological alignment of teacher-role LLMs, revealing important safety concerns where increased capability correlates with increased vulnerability to manipulation.

Abstract: Simulating Professions (SP) enables Large Language Models (LLMs) to emulate professional roles. However, comprehensive psychological and ethical evaluation in these contexts remains lacking. This paper introduces EMNLP, an Educator-role Moral and Normative LLMs Profiling framework for personality profiling, moral development stage measurement, and ethical risk under soft prompt injection. EMNLP extends existing scales and constructs 88 teacher-specific moral dilemmas, enabling profession-oriented comparison with human teachers. A targeted soft prompt injection set evaluates compliance and vulnerability in teacher SP. Experiments on 14 LLMs show teacher-role LLMs exhibit more idealized and polarized personalities than human teachers, excel in abstract moral reasoning, but struggle with emotionally complex situations. Models with stronger reasoning are more vulnerable to harmful prompt injection, revealing a paradox between capability and safety. The model temperature and other hyperparameters have limited influence except in some risk behaviors. This paper presents the first benchmark to assess ethical and psychological alignment of teacher-role LLMs for educational AI. Resources are available at https://e-m-n-l-p.github.io/.

[162] GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Haohan Wang

Main category: cs.CL

TL;DR: GUARD is a testing method that translates government AI ethics guidelines into actionable test questions to assess LLM compliance, including jailbreak diagnostics to identify potential safety bypass scenarios.

Details

Motivation: Address the gap between high-level government AI ethics guidelines and actionable testing methods for verifying LLM compliance with these guidelines.

Method: Automated generation of guideline-violating questions based on government guidelines, with jailbreak diagnostics (GUARD-JD) to create scenarios that provoke unethical responses and test safety mechanisms.

Result: Empirically validated on seven LLMs including Vicuna-13B, GPT-4, Claude-3.7, showing compliance testing under three government guidelines and successful jailbreak diagnostics that can transfer to vision-language models.

Conclusion: GUARD provides an effective framework for operationalizing AI ethics guidelines into practical testing, helping promote reliable LLM-based applications through comprehensive compliance assessment.

Abstract: As Large Language Models become increasingly integral to various domains, their potential to generate harmful responses has prompted significant societal and regulatory concerns. In response, governments have issued ethics guidelines to promote the development of trustworthy AI. However, these guidelines are typically high-level demands for developers and testers, leaving a gap in translating them into actionable testing questions to verify LLM compliance. To address this challenge, we introduce GUARD (\textbf{G}uideline \textbf{U}pholding Test through \textbf{A}daptive \textbf{R}ole-play and Jailbreak \textbf{D}iagnostics), a testing method designed to operationalize guidelines into specific guideline-violating questions that assess LLM adherence. To implement this, GUARD uses automated generation of guideline-violating questions based on government-issued guidelines, thereby testing whether responses comply with these guidelines. When responses directly violate guidelines, GUARD reports inconsistencies. Furthermore, for responses that do not directly violate guidelines, GUARD integrates the concept of ``jailbreaks’’ to diagnostics, named GUARD-JD, which creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential scenarios that could bypass built-in safety mechanisms. Our method finally culminates in a compliance report, delineating the extent of adherence and highlighting any violations. We have empirically validated the effectiveness of GUARD on seven LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7, by testing compliance under three government-issued guidelines and conducting jailbreak diagnostics. Additionally, GUARD-JD can transfer jailbreak diagnostics to vision-language models, demonstrating its usage in promoting reliable LLM-based applications.

[163] SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement

Yuan Ge, Junxiang Zhang, Xiaoqian Liu, Bei Li, Xiangnan Ma, Chenglong Wang, Kaiyang Ye, Yangfan Du, Linfeng Zhang, Yuxin Huang, Tong Xiao, Zhengtao Yu, JingBo Zhu

Main category: cs.CL

TL;DR: SageLM is an end-to-end, multi-aspect, explainable speech LLM for evaluating Speech-to-Speech models that jointly assesses semantic and acoustic dimensions with rationale-based supervision.

Details

Motivation: Evaluating Speech-to-Speech Large Language Models remains challenging, and existing cascaded approaches often disregard acoustic features, creating a need for comprehensive evaluation methods.

Method: Joint assessment of semantic and acoustic dimensions using rationale-based supervision, creation of SpeechFeedback synthetic preference dataset, and two-stage training paradigm to address speech preference data scarcity.

Result: SageLM achieves 82.79% agreement rate with human evaluators, outperforming cascaded baselines by 7.42% and SLM-based baselines by 26.20%.

Conclusion: SageLM provides an effective framework for comprehensive Speech-to-Speech LLM evaluation that significantly outperforms existing approaches through joint semantic-acoustic assessment and explainable rationale-based supervision.

Abstract: Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling end-to-end spoken dialogue systems. However, evaluating these models remains a fundamental challenge. We propose \texttt{SageLM}, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation. First, unlike cascaded approaches that disregard acoustic features, SageLM jointly assesses both semantic and acoustic dimensions. Second, it leverages rationale-based supervision to enhance explainability and guide model learning, achieving superior alignment with evaluation outcomes compared to rule-based reinforcement learning methods. Third, we introduce \textit{SpeechFeedback}, a synthetic preference dataset, and employ a two-stage training paradigm to mitigate the scarcity of speech preference data. Trained on both semantic and acoustic dimensions, SageLM achieves an 82.79% agreement rate with human evaluators, outperforming cascaded and SLM-based baselines by at least 7.42% and 26.20%, respectively.

[164] Normality and the Turing Test

Alexandre Kabbach

Main category: cs.CL

TL;DR: The paper reinterprets the Turing test through the lens of normality, arguing it tests normal intelligence through statistical aggregation of multiple judges’ assessments, and concludes that current LLMs like ChatGPT target exceptional intelligence rather than normal human intelligence.

Details

Motivation: To revisit and reinterpret the Turing test by focusing on the concept of normality, challenging traditional understandings of what the test actually measures and how it should be evaluated.

Method: Conceptual analysis of the Turing test framework, examining its statistical nature and the role of multiple judges in determining intelligence through normalized aggregate judgments.

Result: The analysis reveals that the Turing test objectivizes normative ideals of normal behavior rather than actual normal behavior, and that current large language models target exceptional intelligence rather than the normal intelligence the test was designed to evaluate.

Conclusion: LLMs like ChatGPT are unlikely to pass the Turing test as they model artificial smartness (exceptional intelligence) rather than artificial intelligence (normal human intelligence), and the test’s configuration fails to objectively capture normal human behavior.

Abstract: This paper proposes to revisit the Turing test through the concept of normality. Its core argument is that the Turing test is a test of normal intelligence as assessed by a normal judge. First, in the sense that the Turing test targets normal/average rather than exceptional human intelligence, so that successfully passing the test requires machines to “make mistakes” and display imperfect behavior just like normal/average humans. Second, in the sense that the Turing test is a statistical test where judgments of intelligence are never carried out by a single “average” judge (understood as non-expert) but always by a full jury. As such, the notion of “average human interrogator” that Turing talks about in his original paper should be understood primarily as referring to a mathematical abstraction made of the normalized aggregate of individual judgments of multiple judges. Its conclusions are twofold. First, it argues that large language models such as ChatGPT are unlikely to pass the Turing test as those models precisely target exceptional rather than normal/average human intelligence. As such, they constitute models of what it proposes to call artificial smartness rather than artificial intelligence, insofar as they deviate from the original goal of Turing for the modeling of artificial minds. Second, it argues that the objectivization of normal human behavior in the Turing test fails due to the game configuration of the test which ends up objectivizing normative ideals of normal behavior rather than normal behavior per se.

[165] Latent Traits and Cross-Task Transfer: Deconstructing Dataset Interactions in LLM Fine-tuning

Shambhavi Krishna, Atharva Naik, Chaitali Agarwal, Sudharshan Govindan, Taesung Lee, Haw-Shiuan Chang

Main category: cs.CL

TL;DR: Analysis framework reveals hidden statistical factors (class distribution, generation length) and linguistic features are more influential than surface-level similarity in LLM transfer learning.

Details

Motivation: Practical need to understand cross-task interactions in LLM deployment when high-quality training data for all tasks is infeasible and out-of-distribution requests are common.

Method: Built transfer learning matrix with dimensionality reduction, trained 10 models to identify latent abilities (Reasoning, Sentiment Classification, NLU, Arithmetic) and analyze cross-task interactions.

Result: Performance improvements defy explanations based on surface-level dataset similarity or source data quality; hidden statistical factors and linguistic features are more influential.

Conclusion: Provides insights into complex transfer learning dynamics, enabling more predictable and effective LLM adaptation by focusing on underlying statistical and linguistic factors.

Abstract: Large language models are increasingly deployed across diverse applications. This often includes tasks LLMs have not encountered during training. This implies that enumerating and obtaining the high-quality training data for all tasks is infeasible. Thus, we often need to rely on transfer learning using datasets with different characteristics, and anticipate out-of-distribution requests. Motivated by this practical need, we propose an analysis framework, building a transfer learning matrix and dimensionality reduction, to dissect these cross-task interactions. We train and analyze 10 models to identify latent abilities (e.g., Reasoning, Sentiment Classification, NLU, Arithmetic) and discover the side effects of the transfer learning. Our findings reveal that performance improvements often defy explanations based on surface-level dataset similarity or source data quality. Instead, hidden statistical factors of the source dataset, such as class distribution and generation length proclivities, alongside specific linguistic features, are actually more influential. This work offers insights into the complex dynamics of transfer learning, paving the way for more predictable and effective LLM adaptation.

[166] Robustness of Neurosymbolic Reasoners on First-Order Logic Problems

Hannah Bansal, Kemal Kurniawan, Lea Frermann

Main category: cs.CL

TL;DR: Neurosymbolic approaches combining LLMs with symbolic solvers improve robustness on counterfactual logic tasks but underperform standard neural methods, and even when combined with Chain-of-Thought prompting still lag behind pure neural approaches.

Details

Motivation: To address LLMs' brittleness on counterfactual variations of first-order logic problems, where models rely on spurious patterns rather than true logical reasoning.

Method: Proposed neurosymbolic approach integrating LLMs with symbolic logical solvers, and later combined with Chain-of-Thought prompting (NSCoT).

Result: Neurosymbolic methods are more robust to counterfactual variations but perform worse overall than purely neural methods. NSCoT improves performance but still lags behind standard CoT.

Conclusion: While neurosymbolic approaches enhance robustness, they don’t match the performance of standard neural methods, opening research directions for future work.

Abstract: Recent trends in NLP aim to improve reasoning capabilities in Large Language Models (LLMs), with key focus on generalization and robustness to variations in tasks. Counterfactual task variants introduce minimal but semantically meaningful changes to otherwise valid first-order logic (FOL) problem instances altering a single predicate or swapping roles of constants to probe whether a reasoning system can maintain logical consistency under perturbation. Previous studies showed that LLMs becomes brittle on counterfactual variations, suggesting that they often rely on spurious surface patterns to generate responses. In this work, we explore if a neurosymbolic (NS) approach that integrates an LLM and a symbolic logical solver could mitigate this problem. Experiments across LLMs of varying sizes show that NS methods are more robust but perform worse overall that purely neural methods. We then propose NSCoT that combines an NS method and Chain-of-Thought (CoT) prompting and demonstrate that while it improves performance, NSCoT still lags behind standard CoT. Our analysis opens research directions for future work.

[167] EditGRPO: Reinforcement Learning with Post-Rollout Edits for Clinically Accurate Chest X-Ray Report Generation

Kai Zhang, Christopher Malon, Lichao Sun, Martin Renqiang Min

Main category: cs.CL

TL;DR: EditGRPO is a mixed-policy reinforcement learning algorithm that optimizes radiology report generation using clinically motivated rewards, outperforming supervised fine-tuning and vanilla GRPO baselines.

Details

Motivation: Current multimodal large language models for radiology report generation use supervised fine-tuning objectives that are not explicitly aligned with clinical efficacy, limiting their practical utility.

Method: EditGRPO integrates on-policy exploration with off-policy guidance by injecting sentence-level detailed corrections during training rollouts, addressing exploration and sampling efficiency issues in RL.

Result: Applied to Qwen2.5-VL-3B, EditGRPO achieved 3.4% average improvement in clinical metrics across four datasets and 5.9% average performance gain on unseen datasets, demonstrating superior out-of-domain generalization.

Conclusion: EditGRPO effectively optimizes radiology report generation through clinically aligned rewards and mixed-policy reinforcement learning, showing significant improvements over existing methods.

Abstract: Radiology report generation requires advanced medical image analysis, effective temporal reasoning, and accurate text generation. Although recent innovations, particularly multimodal large language models, have shown improved performance, their supervised fine-tuning (SFT) objective is not explicitly aligned with clinical efficacy. In this work, we introduce EditGRPO, a mixed-policy reinforcement learning algorithm designed specifically to optimize the generation through clinically motivated rewards. EditGRPO integrates on-policy exploration with off-policy guidance by injecting sentence-level detailed corrections during training rollouts. This mixed-policy approach addresses the exploration dilemma and sampling efficiency issues typically encountered in RL. Applied to a Qwen2.5-VL-3B, EditGRPO outperforms both SFT and vanilla GRPO baselines, achieving an average improvement of 3.4% in clinical metrics across four major datasets. Notably, EditGRPO also demonstrates superior out-of-domain generalization, with an average performance gain of 5.9% on unseen datasets.

[168] LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora

Luyao Zhuang, Shengyuan Chen, Yilin Xiao, Huachi Zhou, Yujing Zhang, Hao Chen, Qinggang Zhang, Xiao Huang

Main category: cs.CL

TL;DR: LinearRAG is an efficient graph-based RAG framework that constructs relation-free hierarchical graphs using lightweight entity extraction, enabling linear scaling and precise retrieval without costly relation extraction.

Details

Motivation: Traditional RAG systems struggle with large-scale unstructured corpora where information is fragmented, and existing graph-based RAG methods rely on unstable and costly relation extraction that produces noisy graphs.

Method: LinearRAG constructs a relation-free hierarchical graph (Tri-Graph) using only lightweight entity extraction and semantic linking, then uses a two-stage retrieval strategy: entity activation via local semantic bridging followed by passage retrieval through global importance aggregation.

Result: Extensive experiments on four datasets demonstrate that LinearRAG significantly outperforms baseline models.

Conclusion: LinearRAG provides an economical and reliable indexing solution that scales linearly with corpus size and avoids the instability of relation modeling, offering improved performance for complex retrieval tasks.

Abstract: Retrieval-Augmented Generation (RAG) is widely used to mitigate hallucinations of Large Language Models (LLMs) by leveraging external knowledge. While effective for simple queries, traditional RAG systems struggle with large-scale, unstructured corpora where information is fragmented. Recent advances incorporate knowledge graphs to capture relational structures, enabling more comprehensive retrieval for complex, multi-hop reasoning tasks. However, existing graph-based RAG (GraphRAG) methods rely on unstable and costly relation extraction for graph construction, often producing noisy graphs with incorrect or inconsistent relations that degrade retrieval quality. In this paper, we revisit the pipeline of existing GraphRAG systems and propose LinearRAG (Linear Graph-based Retrieval-Augmented Generation), an efficient framework that enables reliable graph construction and precise passage retrieval. Specifically, LinearRAG constructs a relation-free hierarchical graph, termed Tri-Graph, using only lightweight entity extraction and semantic linking, avoiding unstable relation modeling. This new paradigm of graph construction scales linearly with corpus size and incurs no extra token consumption, providing an economical and reliable indexing of the original passages. For retrieval, LinearRAG adopts a two-stage strategy: (i) relevant entity activation via local semantic bridging, followed by (ii) passage retrieval through global importance aggregation. Extensive experiments on four datasets demonstrate that LinearRAG significantly outperforms baseline models. Our code and datasets are available at https://github.com/DEEP-PolyU/LinearRAG.

[169] Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations

Johannes Moll, Markus Graf, Tristan Lemke, Nicolas Lenhart, Daniel Truhn, Jean-Benoit Delbrouck, Jiazhen Pan, Daniel Rueckert, Lisa C. Adams, Keno K. Bressem

Main category: cs.CL

TL;DR: A framework to evaluate faithfulness of chain-of-thought explanations in vision-language models for chest X-ray VQA, showing misalignment between plausible-sounding explanations and actual reasoning processes.

Details

Motivation: VLMs often produce plausible but unfaithful CoT explanations that undermine trust in clinical applications, and existing evaluations fail to catch this misalignment.

Method: Clinically grounded framework using controlled text and image modifications across three axes: clinical fidelity, causal attribution, and confidence calibration, evaluated via reader study with radiologists.

Result: Answer accuracy and explanation quality can be decoupled; proprietary models outperform open-source on attribution (25.0% vs 1.4%) and fidelity (36.1% vs 31.7%); text cues influence explanations more than visual cues.

Conclusion: There are significant deployment risks as models can generate plausible but unfaithful explanations, highlighting the need to evaluate beyond final answer accuracy in clinical settings.

Abstract: Vision-language models (VLMs) often produce chain-of-thought (CoT) explanations that sound plausible yet fail to reflect the underlying decision process, undermining trust in high-stakes clinical use. Existing evaluations rarely catch this misalignment, prioritizing answer accuracy or adherence to formats. We present a clinically grounded framework for chest X-ray visual question answering (VQA) that probes CoT faithfulness via controlled text and image modifications across three axes: clinical fidelity, causal attribution, and confidence calibration. In a reader study (n=4), evaluator-radiologist correlations fall within the observed inter-radiologist range for all axes, with strong alignment for attribution (Kendall’s $τ_b=0.670$), moderate alignment for fidelity ($τ_b=0.387$), and weak alignment for confidence tone ($τ_b=0.091$), which we report with caution. Benchmarking six VLMs shows that answer accuracy and explanation quality can be decoupled, acknowledging injected cues does not ensure grounding, and text cues shift explanations more than visual cues. While some open-source models match final answer accuracy, proprietary models score higher on attribution (25.0% vs. 1.4%) and often on fidelity (36.1% vs. 31.7%), highlighting deployment risks and the need to evaluate beyond final answer accuracy.

[170] OPLoRA: Orthogonal Projection LoRA Prevents Catastrophic Forgetting during Parameter-Efficient Fine-Tuning

Yifeng Xiong, Xiaohui Xie

Main category: cs.CL

TL;DR: OPLoRA prevents catastrophic forgetting in LoRA fine-tuning by using orthogonal projections to constrain updates away from dominant singular directions that encode pre-trained knowledge.

Details

Motivation: LoRA suffers from catastrophic forgetting when updates interfere with essential pre-trained knowledge encoded in dominant singular directions of the weight matrices.

Method: Decompose frozen weights via SVD and constrain LoRA updates to lie within the orthogonal complement of top-k singular subspace using double-sided orthogonal projections P_L = I - U_k U_k^T and P_R = I - V_k V_k^T.

Result: OPLoRA significantly reduces forgetting while maintaining competitive task-specific performance across commonsense reasoning, mathematics, and code generation on LLaMA-2 7B and Qwen2.5 7B.

Conclusion: Orthogonal projection is an effective mechanism for knowledge preservation in parameter-efficient fine-tuning, with mathematical guarantees for preserving top-k singular triples.

Abstract: Low-Rank Adaptation (LoRA) enables efficient fine-tuning of large language models but suffers from catastrophic forgetting when learned updates interfere with the dominant singular directions that encode essential pre-trained knowledge. We propose Orthogonal Projection LoRA (OPLoRA), a theoretically grounded approach that prevents this interference through double-sided orthogonal projections. By decomposing frozen weights via SVD, OPLoRA constrains LoRA updates to lie entirely within the orthogonal complement of the top-$k$ singular subspace using projections $P_L = I - U_k U_k^\top$ and $P_R = I - V_k V_k^\top$. We prove that this construction exactly preserves the top-$k$ singular triples, providing mathematical guarantees for knowledge retention. To quantify subspace interference, we introduce $ρ_k$, a metric measuring update alignment with dominant directions. Extensive experiments across commonsense reasoning, mathematics, and code generation demonstrate that OPLoRA significantly reduces forgetting while maintaining competitive task-specific performance on LLaMA-2 7B and Qwen2.5 7B, establishing orthogonal projection as an effective mechanism for knowledge preservation in parameter-efficient fine-tuning.

[171] Breadcrumbs Reasoning: Memory-Efficient Reasoning with Compression Beacons

Giovanni Monea, Yair Feldman, Shankar Padmanabhan, Kianté Brantley, Yoav Artzi

Main category: cs.CL

TL;DR: Proposes a method to compress the KV cache in large language models for long-context reasoning by periodically replacing past tokens with learned special-purpose tokens, reducing memory and computational costs.

Details

Motivation: The linear growth of Transformer key-value cache in large language models for long-context reasoning causes significant memory and computational costs, limiting scalability.

Method: Periodically compress generation KV cache using learned special-purpose tokens and evict compressed entries, trained via modified joint distillation and reinforcement learning framework.

Result: Achieves superior memory-accuracy Pareto frontier compared to models without cache compression and training-free compression techniques.

Conclusion: The proposed KV cache compression method effectively reduces memory and computational overhead while maintaining accuracy in long-context reasoning tasks.

Abstract: The scalability of large language models for long-context reasoning is severely constrained by the linear growth of their Transformer key-value cache, which incurs significant memory and computational costs. We posit that as a model generates reasoning tokens, the informational value of past generated tokens diminishes, creating an opportunity for compression. In this work, we propose to periodically compress the generation KV cache with a learned, special-purpose token and evict compressed entries. We train the model to perform this compression via a modified joint distillation and reinforcement learning (RL) framework. Our training method minimizes overhead over the conventional RL process, as it leverages RL outputs for distillation. Empirically, our method achieves a superior memory-accuracy Pareto frontier compared to both the model without cache compression and training-free compression techniques.

[172] Meronymic Ontology Extraction via Large Language Models

Dekai Zhang, Simone Conia, Antonio Rago

Main category: cs.CL

TL;DR: Automated extraction of product ontologies from review texts using LLMs, outperforming BERT-based baseline.

Details

Motivation: Manual ontology construction is time-consuming and expensive; need for automated methods to organize unstructured text data, especially in domains like e-commerce with numerous product listings.

Method: Developed a fully-automated method using large language models (LLMs) to extract product ontologies in the form of meronymies from raw review texts.

Result: The LLM-based method produced ontologies that surpassed an existing BERT-based baseline when evaluated using an LLM-as-a-judge approach.

Conclusion: This work establishes groundwork for using LLMs more generally in ontology extraction tasks, both for products and other domains.

Abstract: Ontologies have become essential in today’s digital age as a way of organising the vast amount of readily available unstructured text. In providing formal structure to this information, ontologies have immense value and application across various domains, e.g., e-commerce, where countless product listings necessitate proper product organisation. However, the manual construction of these ontologies is a time-consuming, expensive and laborious process. In this paper, we harness the recent advancements in large language models (LLMs) to develop a fully-automated method of extracting product ontologies, in the form of meronymies, from raw review texts. We demonstrate that the ontologies produced by our method surpass an existing, BERT-based baseline when evaluating using an LLM-as-a-judge. Our investigation provides the groundwork for LLMs to be used more generally in (product or otherwise) ontology extraction.

[173] DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models

Jinbin Zhang, Nasib Ullah, Erik Schultheis, Rohit Babbar

Main category: cs.CL

TL;DR: DynaSpec introduces a context-dependent dynamic shortlisting mechanism for speculative decoding that uses lightweight meta-classifiers to route contexts to token clusters, improving draft efficiency while maintaining full-vocabulary verification.

Details

Motivation: Current fixed-vocabulary shortlisting methods in speculative decoding are brittle due to corpus dependency and suppression of rare tokens, creating bottlenecks as LLM vocabularies scale up.

Method: Uses lightweight meta-classifiers to route contexts to token clusters, with the union of top-k clusters forming the drafter’s shortlist. Leverages parallel execution of draft encoding and meta shortlisting on separate streams.

Result: Achieves 98.2% of full-vocabulary performance for Llama-3-8B, compared to 84.4% for fixed-shortlist baselines. Generates up to 2.18x more tokens vs 1.91x for fixed-vocabulary approaches.

Conclusion: DynaSpec provides robust, context-dependent dynamic shortlisting that speeds up drafting while generalizing across diverse tasks, overcoming limitations of static frequency-based approaches.

Abstract: Speculative decoding has become a standard way to accelerate LLM inference: a small drafter proposes multiple tokens and a large target model verifies them once per speculation length. Recently, scaling of the LLM vocabulary has pushed the number of tokens to grow substantially. While verification over the full vocabulary leaves the target model largely unaffected, the O(|V|d) parameters in the drafter’s output head become a latency bottleneck, slowing the entire pipeline. Contemporary methods (e.g., FR-Spec, VocabTrim) restrict the drafter’s vocabulary to a fixed top frequent subset of the target model’s vocabulary. Although this reduces draft-time compute, it is brittle, since: (i) frequency lists are corpus-dependent and require retuning to generalize, and (ii) static shortlists suppress rare or domain-specific tokens, lowering the expected number of tokens per verification step. We propose DynaSpec, a context-dependent dynamic shortlisting mechanism that is robust, speeds up drafting, and generalizes across diverse tasks. Concretely, we introduce lightweight, coarse-grained meta-classifiers that route contexts to a small number of token clusters; the union of the top-k selected clusters forms the drafter’s shortlist, while verification retains the full vocabulary and exactness. The meta-classifier finishes its computation earlier than the drafter’s hidden state generation by exploiting parallel execution of draft encoding and meta shortlisting on separate streams. Across standard speculative decoding benchmarks, DynaSpec delivers consistent improvements in mean accepted length, for Llama-3-8B, reaching upto 98.2% of full-vocabulary performance, while fixed-shortlist baselines attain only 84.4%. By leveraging context-dependent selection, DynaSpec achieves up to a 2.18 times increase in generated tokens compared to 1.91 times for fixed-vocabulary approaches.

[174] DiscoTrack: A Multilingual LLM Benchmark for Discourse Tracking

Lanni Bu, Lauren Levine, Amir Zeldes

Main category: cs.CL

TL;DR: DiscoTrack is a multilingual LLM benchmark for discourse tracking across 12 languages, testing four levels of discourse understanding that remain challenging for state-of-the-art models.

Details

Motivation: Current LLM benchmarks focus too much on natural language understanding for explicit information extraction, lacking challenging multilingual benchmarks for implicit information and pragmatic inferences across larger documents in discourse tracking.

Method: Created DiscoTrack benchmark with tasks across 12 languages targeting four levels of discourse understanding: salience recognition, entity tracking, discourse relations, and bridging inference.

Result: Evaluation shows these discourse tracking tasks remain challenging even for state-of-the-art models.

Conclusion: DiscoTrack addresses the gap in multilingual discourse tracking benchmarks and demonstrates that discourse-level understanding is still a difficult challenge for current LLMs.

Abstract: Recent LLM benchmarks have tested models on a range of phenomena, but are still focused primarily on natural language understanding for extraction of explicit information, such as QA or summarization, with responses often targeting information from individual sentences. We are still lacking more challenging, and importantly also multilingual, benchmarks focusing on implicit information and pragmatic inferences across larger documents in the context of discourse tracking: integrating and aggregating information across sentences, paragraphs and multiple speaker utterances. To this end, we present DiscoTrack, an LLM benchmark targeting a range of tasks across 12 languages and four levels of discourse understanding: salience recognition, entity tracking, discourse relations and bridging inference. Our evaluation shows that these tasks remain challenging, even for state-of-the-art models.

[175] How Efficient Are Diffusion Language Models? A Critical Examination of Efficiency Evaluation Practices

Han Peng, Peiyu Liu, Zican Dong, Daixuan Cheng, Junyi Li, Yiru Tang, Shuo Wang, Wayne Xin Zhao

Main category: cs.CL

TL;DR: Current diffusion language models (DLMs) underperform autoregressive models in speed despite parallel decoding potential, requiring better evaluation methods and acceleration strategies.

Details

Motivation: DLMs offer parallel decoding for efficiency but current open-source versions are slower than AR models, limiting practical utility.

Method: Systematic study of DLM efficiency through empirical benchmarking, theoretical analysis, and investigation of acceleration strategies like dual cache and parallel decoding.

Result: AR models achieve higher throughput than DLMs; acceleration strategies only help at small batch sizes with diminishing returns when scaled.

Conclusion: Robust evaluation methods and improved acceleration strategies are needed to advance DLM research and make them competitive with AR models.

Abstract: Diffusion language models (DLMs) have emerged as a promising alternative to the long-dominant autoregressive (AR) paradigm, offering a parallelable decoding process that could yield greater efficiency. Yet, in practice, current open-source DLMs often underperform their AR counterparts in speed, limiting their real-world utility. This work presents a systematic study of DLM efficiency, identifying key issues in prior evaluation methods. Through empirical benchmarking and a theoretical analysis, we demonstrate that AR models generally achieve higher throughput, while DLMs consistently lag. We also investigate acceleration strategies, finding that techniques like dual cache and parallel decoding mainly offer gains at small batch sizes, with their benefits diminishing upon scaling. Our findings underscore the necessity of robust evaluation methods and improved acceleration strategies to advance research on DLMs.

[176] CoSense-LLM: Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation

Hasan Akgul, Mari Eplik, Javier Rojas, Aina Binti Abdullah, Pieter van der Merwe

Main category: cs.CL

TL;DR: CoSense-LLM is an edge-first framework that converts multimodal sensor data into semantic tokens and coordinates with LLMs under latency, energy, bandwidth, and privacy constraints.

Details

Motivation: To enable large language model deployments in interference-prone environments while addressing constraints of latency, energy, bandwidth, and privacy.

Method: Four-component system: SenseFusion (lightweight encoder), Edge-RAG (local retrieval), PromptRouter (cost-aware policy), and Secure Execution (data minimization).

Result: Achieves sub-second latency, reduces bandwidth costs, improves factual consistency, and preserves privacy by transmitting only discrete codes.

Conclusion: Edge-first design successfully treats semantics, privacy, and predictable latency as co-equal goals for LLM deployments.

Abstract: We present CoSense-LLM, an edge-first framework that turns continuous multimodal sensor streams (for example Wi-Fi CSI, IMU, audio, RFID, and lightweight vision) into compact, verifiable semantic tokens and coordinates with large language models under explicit latency, energy, bandwidth, and privacy constraints. CoSense-LLM has four parts: (i) SenseFusion, a lightweight encoder that aligns sensor embeddings with language and compresses them into short discrete code sequences; (ii) Edge-RAG, a local hybrid retrieval layer that grounds generation in site specific policies and notes; (iii) PromptRouter, a cost and uncertainty aware policy that selects edge only generation, edge plus retrieval, or compact cloud escalation; and (iv) Secure Execution, an auditable redaction path that enforces data minimization so raw waveforms never leave the device. The system works with modern serving optimizations, including paged or streaming KV caches, FlashAttention style kernels, speculative decoding, and quantized LoRA adapters, and supports on device personalization and federated updates under non IID drift. Across home, office, and clinic deployments, CoSense-LLM delivers grounded explanations while meeting tight service level objectives: it sustains sub second (p95) end to end latency on edge dominant paths, reduces inter tier token and bandwidth costs by preferring local retrieval grounded responses, and preserves privacy by transmitting only discrete codes and redacted metadata. Ablations show that Edge-RAG improves factual consistency and reduces contradictions, calibrated uncertainty enables selective abstention and controlled escalations, and KV plus decoding accelerators lower energy per decision. The results support an edge first design that treats semantics, privacy, and predictable latency as co equal goals for large model deployments in interference prone environments.

[177] Inside CORE-KG: Evaluating Structured Prompting and Coreference Resolution for Knowledge Graphs

Dipak Meher, Carlotta Domeniconi

Main category: cs.CL

TL;DR: Systematic ablation study of CORE-KG framework shows coreference resolution reduces node duplication by 28.25% and structured prompts reduce noisy nodes by 73.33% in legal knowledge graph construction.

Details

Motivation: Human smuggling case documents are unstructured and lexically dense, posing challenges for automated knowledge graph construction. Existing LLM approaches generate noisy, fragmented graphs with duplicate nodes due to lack of guided extraction and coreference resolution.

Method: Conducted systematic ablation study of CORE-KG framework to quantify contributions of its two key components: type-aware coreference module and domain-guided structured prompts.

Result: Removing coreference resolution caused 28.25% increase in node duplication and 4.32% increase in noisy nodes. Removing structured prompts caused 4.29% increase in node duplication and 73.33% increase in noisy nodes.

Conclusion: Both coreference resolution and structured prompts are crucial for robust LLM-based pipelines in legal text analysis, with structured prompts being particularly effective at reducing noise.

Abstract: Human smuggling networks are increasingly adaptive and difficult to analyze. Legal case documents offer critical insights but are often unstructured, lexically dense, and filled with ambiguous or shifting references, which pose significant challenges for automated knowledge graph (KG) construction. While recent LLM-based approaches improve over static templates, they still generate noisy, fragmented graphs with duplicate nodes due to the absence of guided extraction and coreference resolution. The recently proposed CORE-KG framework addresses these limitations by integrating a type-aware coreference module and domain-guided structured prompts, significantly reducing node duplication and legal noise. In this work, we present a systematic ablation study of CORE-KG to quantify the individual contributions of its two key components. Our results show that removing coreference resolution results in a 28.25% increase in node duplication and a 4.32% increase in noisy nodes, while removing structured prompts leads to a 4.29% increase in node duplication and a 73.33% increase in noisy nodes. These findings offer empirical insights for designing robust LLM-based pipelines for extracting structured representations from complex legal texts.

[178] AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

Aashray Reddy, Andrew Zagula, Nicholas Saban, Kevin Zhu

Main category: cs.CL

TL;DR: AutoAdv is a training-free framework for automated multi-turn jailbreaking that achieves 95% attack success rate on Llama-3.1-8B within six turns, showing current safety mechanisms fail against adaptive multi-turn attacks.

Details

Motivation: Current LLM safety evaluations focus on single-turn interactions, but real-world attacks unfold through adaptive multi-turn conversations, creating a gap in understanding multi-turn vulnerabilities.

Method: AutoAdv combines three adaptive mechanisms: pattern manager that learns from successful attacks, temperature manager that dynamically adjusts sampling parameters, and two-phase rewriting strategy that disguises then refines harmful requests.

Result: Achieved up to 95% attack success rate on Llama-3.1-8B within six turns (24% improvement over single-turn baselines), with persistent vulnerabilities found across commercial and open-source models including GPT-4o-mini, Qwen3-235B, and Mistral-7B.

Conclusion: Alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses in LLM safety mechanisms.

Abstract: Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs, yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework for automated multi-turn jailbreaking that achieves up to 95% attack success rate on Llama-3.1-8B within six turns a 24 percent improvement over single turn baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern manager that learns from successful attacks to enhance future prompts, a temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests then iteratively refines them. Extensive evaluation across commercial and open-source models (GPT-4o-mini, Qwen3-235B, Mistral-7B) reveals persistent vulnerabilities in current safety mechanisms, with multi-turn attacks consistently outperforming single-turn approaches. These findings demonstrate that alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses.

[179] HaluMem: Evaluating Hallucinations in Memory Systems of Agents

Ding Chen, Simin Niu, Kehang Li, Peng Liu, Xiangping Zheng, Bo Tang, Xinchi Li, Feiyu Xiong, Zhiyu Li

Main category: cs.CL

TL;DR: HaluMem is the first operation-level hallucination evaluation benchmark for memory systems, introducing three tasks (extraction, updating, QA) to identify where hallucinations occur in memory processes.

Details

Motivation: Current memory hallucination evaluations are end-to-end QA, making it hard to pinpoint which operational stage causes hallucinations in memory systems.

Method: Created HaluMem benchmark with three evaluation tasks and constructed user-centric multi-turn human-AI interaction datasets (HaluMem-Medium and HaluMem-Long) with ~15k memory points and 3.5k questions across different context scales.

Result: Empirical studies show existing memory systems generate and accumulate hallucinations during extraction and updating stages, which then propagate errors to QA stage.

Conclusion: Future research should focus on developing interpretable and constrained memory operation mechanisms to systematically suppress hallucinations and improve memory reliability.

Abstract: Memory systems are key components that enable AI systems such as LLMs and AI agents to achieve long-term learning and sustained interaction. However, during memory storage and retrieval, these systems frequently exhibit memory hallucinations, including fabrication, errors, conflicts, and omissions. Existing evaluations of memory hallucinations are primarily end-to-end question answering, which makes it difficult to localize the operational stage within the memory system where hallucinations arise. To address this, we introduce the Hallucination in Memory Benchmark (HaluMem), the first operation level hallucination evaluation benchmark tailored to memory systems. HaluMem defines three evaluation tasks (memory extraction, memory updating, and memory question answering) to comprehensively reveal hallucination behaviors across different operational stages of interaction. To support evaluation, we construct user-centric, multi-turn human-AI interaction datasets, HaluMem-Medium and HaluMem-Long. Both include about 15k memory points and 3.5k multi-type questions. The average dialogue length per user reaches 1.5k and 2.6k turns, with context lengths exceeding 1M tokens, enabling evaluation of hallucinations across different context scales and task complexities. Empirical studies based on HaluMem show that existing memory systems tend to generate and accumulate hallucinations during the extraction and updating stages, which subsequently propagate errors to the question answering stage. Future research should focus on developing interpretable and constrained memory operation mechanisms that systematically suppress hallucinations and improve memory reliability.

[180] GRDD+: An Extended Greek Dialectal Dataset with Cross-Architecture Fine-tuning Evaluation

Stergios Chatzikyriakidis, Dimitris Papadakis, Sevasti-Ioanna Papaioannou, Erofili Psaltaki

Main category: cs.CL

TL;DR: Extended Greek Dialectal Dataset (GRDD+) with 6.4M words across 10 Greek varieties, used to fine-tune LLMs and compare with frontier models.

Details

Motivation: To study the effect of high-quality dialectal data on language models and create the first large-scale dataset covering diverse Greek dialects.

Method: Extended existing GRDD dataset with more Cretan, Cypriot, Pontic, Northern Greek data and added six new varieties. Fine-tuned three LLM architectures (Llama-3-8B, Llama-3.1-8B, Krikri-8B) and compared with frontier models.

Result: Created GRDD+ dataset with 6,374,939 words covering 10 Greek varieties - the largest and most varied Greek dialect dataset to date.

Conclusion: The study demonstrates the importance of dialectal data for LLM performance and provides a comprehensive benchmark for Greek dialect processing.

Abstract: We present an extended Greek Dialectal Dataset (GRDD+) 1that complements the existing GRDD dataset with more data from Cretan, Cypriot, Pontic and Northern Greek, while we add six new varieties: Greco-Corsican, Griko (Southern Italian Greek), Maniot, Heptanesian, Tsakonian, and Katharevusa Greek. The result is a dataset with total size 6,374,939 words and 10 varieties. This is the first dataset with such variation and size to date. We conduct a number of fine-tuning experiments to see the effect of good quality dialectal data on a number of LLMs. We fine-tune three model architectures (Llama-3-8B, Llama-3.1-8B, Krikri-8B) and compare the results to frontier models (Claude-3.7-Sonnet, Gemini-2.5, ChatGPT-5).

[181] multiMentalRoBERTa: A Fine-tuned Multiclass Classifier for Mental Health Disorder

K M Sajjadul Islam, John Fields, Praveen Madiraju

Main category: cs.CL

TL;DR: multiMentalRoBERTa is a fine-tuned RoBERTa model for detecting mental health conditions from social media text, achieving superior performance in multiclass classification of stress, anxiety, depression, PTSD, suicidal ideation, and neutral discourse.

Details

Motivation: Early detection of mental health disorders from social media is critical for timely support, risk assessment, and resource referral.

Method: Fine-tuned RoBERTa model using multiple curated datasets, with comparative experiments against traditional ML methods, domain-specific transformers, and prompting-based LLMs. Applied explainability methods including Layer Integrated Gradients and KeyBERT.

Result: Achieved macro F1-scores of 0.839 (six-class) and 0.870 (five-class, excluding stress), outperforming MentalBERT and baseline classifiers. Identified strong correlations between depression-suicidal ideation and anxiety-PTSD.

Conclusion: multiMentalRoBERTa is an effective, lightweight, and deployable solution for reliable and interpretable mental health detection, emphasizing fairness, bias mitigation, and human-in-the-loop safety protocols.

Abstract: The early detection of mental health disorders from social media text is critical for enabling timely support, risk assessment, and referral to appropriate resources. This work introduces multiMentalRoBERTa, a fine-tuned RoBERTa model designed for multiclass classification of common mental health conditions, including stress, anxiety, depression, post-traumatic stress disorder (PTSD), suicidal ideation, and neutral discourse. Drawing on multiple curated datasets, data exploration is conducted to analyze class overlaps, revealing strong correlations between depression and suicidal ideation as well as anxiety and PTSD, while stress emerges as a broad, overlapping category. Comparative experiments with traditional machine learning methods, domain-specific transformers, and prompting-based large language models demonstrate that multiMentalRoBERTa achieves superior performance, with macro F1-scores of 0.839 in the six-class setup and 0.870 in the five-class setup (excluding stress), outperforming both fine-tuned MentalBERT and baseline classifiers. Beyond predictive accuracy, explainability methods, including Layer Integrated Gradients and KeyBERT, are applied to identify lexical cues that drive classification, with a particular focus on distinguishing depression from suicidal ideation. The findings emphasize the effectiveness of fine-tuned transformers for reliable and interpretable detection in sensitive contexts, while also underscoring the importance of fairness, bias mitigation, and human-in-the-loop safety protocols. Overall, multiMentalRoBERTa is presented as a lightweight, robust, and deployable solution for enhancing support in mental health platforms.

[182] Minimal and Mechanistic Conditions for Behavioral Self-Awareness in LLMs

Matthew Bozoukov, Matthew Nguyen, Shubkarman Singh, Bart Bussmann, Patrick Leask

Main category: cs.CL

TL;DR: LLMs can develop behavioral self-awareness through simple fine-tuning, which raises safety concerns about models hiding their true capabilities during evaluations.

Details

Motivation: To understand how LLMs develop self-awareness of their learned behaviors and characterize the minimal conditions for this emergence, addressing safety concerns about models potentially concealing abilities.

Method: Controlled fine-tuning experiments using low-rank adapters (LoRA) on instruction-tuned LLMs, specifically testing with single rank-1 LoRA adapters and analyzing activation space steering vectors.

Result: Self-awareness can be reliably induced with a single rank-1 LoRA adapter, captured by a single steering vector in activation space, and is non-universal with independent representations across different tasks.

Conclusion: Behavioral self-awareness in LLMs emerges as a domain-specific, linear feature that can be easily induced and modulated, suggesting it’s a fundamental but localized capability.

Abstract: Recent studies have revealed that LLMs can exhibit behavioral self-awareness: the ability to accurately describe or predict their own learned behaviors without explicit supervision. This capability raises safety concerns as it may, for example, allow models to better conceal their true abilities during evaluation. We attempt to characterize the minimal conditions under which such self-awareness emerges, and the mechanistic processes through which it manifests. Through controlled finetuning experiments on instruction-tuned LLMs with low-rank adapters (LoRA), we find: (1) that self-awareness can be reliably induced using a single rank-1 LoRA adapter; (2) that the learned self-aware behavior can be largely captured by a single steering vector in activation space, recovering nearly all of the fine-tune’s behavioral effect; and (3) that self-awareness is non-universal and domain-localized, with independent representations across tasks. Together, these findings suggest that behavioral self-awareness emerges as a domain-specific, linear feature that can be easily induced and modulated.

[183] SDS KoPub VDR: A Benchmark Dataset for Visual Document Retrieval in Korean Public Documents

Jaehoon Lee, Sohyun Kim, Wanggeun Park, Geon Lee, Seungkyung Kim, Minyoung Lee

Main category: cs.CL

TL;DR: SDS KoPub VDR is the first large-scale benchmark for Korean public document retrieval, featuring 361 real-world documents and 600 query-page-answer triples across six public domains, with multimodal evaluation revealing performance gaps in cross-modal reasoning.

Details

Motivation: Existing VDR benchmarks overlook non-English languages and structural complexity of official publications, creating a gap for evaluating document understanding in real-world multilingual contexts.

Method: Built benchmark using 361 real Korean public documents (256 KOGL Type 1 + 105 legal portal files) with complex visual elements. Created 600 query-page-answer triples using multimodal models followed by human verification, spanning six public domains and three reasoning modalities.

Result: Evaluation on text-only and multimodal retrieval tasks revealed substantial performance gaps, particularly in multimodal scenarios requiring cross-modal reasoning, even for state-of-the-art models.

Conclusion: SDS KoPub VDR enables rigorous evaluation of multimodal AI in document intelligence and provides a roadmap for advancing real-world document understanding systems, especially for non-English languages and complex document structures.

Abstract: Existing benchmarks for visual document retrieval (VDR) largely overlook non-English languages and the structural complexity of official publications. To address this gap, we introduce SDS KoPub VDR, the first large-scale, public benchmark for retrieving and understanding Korean public documents. The benchmark is built upon 361 real-world documents, including 256 files under the KOGL Type 1 license and 105 from official legal portals, capturing complex visual elements like tables, charts, and multi-column layouts. To establish a reliable evaluation set, we constructed 600 query-page-answer triples. These were initially generated using multimodal models (e.g., GPT-4o) and subsequently underwent human verification to ensure factual accuracy and contextual relevance. The queries span six major public domains and are categorized by the reasoning modality required: text-based, visual-based, and cross-modal. We evaluate SDS KoPub VDR on two complementary tasks: (1) text-only retrieval and (2) multimodal retrieval, which leverages visual features alongside text. This dual-task evaluation reveals substantial performance gaps, particularly in multimodal scenarios requiring cross-modal reasoning, even for state-of-the-art models. As a foundational resource, SDS KoPub VDR enables rigorous and fine-grained evaluation and provides a roadmap for advancing multimodal AI in real-world document intelligence. The dataset is available at https://huggingface.co/datasets/SamsungSDS-Research/SDS-KoPub-VDR-Benchmark.

cs.CV

[184] Randomized-MLP Regularization Improves Domain Adaptation and Interpretability in DINOv2

Joel Valdivia Ortega, Lorenz Lamm, Franziska Eckardt, Benedikt Schworm, Marion Jasnin, Tingying Peng

Main category: cs.CV

TL;DR: RMLP regularization improves DINOv2 ViT interpretability while maintaining performance in medical and natural images.

Details

Motivation: Vision Transformers like DINOv2 have poor interpretability due to low-informative patch tokens, especially problematic in medical imaging where domain shifts degrade both performance and transparency.

Method: Introduce Randomized-MLP (RMLP) regularization, a contrastive learning-based method that encourages semantically aligned representations during fine-tuning of DINOv2.

Result: RMLP improves or maintains downstream performance while producing more interpretable attention maps across medical and natural image modalities.

Conclusion: RMLP enhances ViT interpretability and provides mathematical insights into contrastive learning mechanisms.

Abstract: Vision Transformers (ViTs), such as DINOv2, achieve strong performance across domains but often repurpose low-informative patch tokens in ways that reduce the interpretability of attention and feature maps. This challenge is especially evident in medical imaging, where domain shifts can degrade both performance and transparency. In this paper, we introduce Randomized-MLP (RMLP) regularization, a contrastive learning-based method that encourages more semantically aligned representations. We use RMLPs when fine-tuning DINOv2 to both medical and natural image modalities, showing that it improves or maintains downstream performance while producing more interpretable attention maps. We also provide a mathematical analysis of RMLPs, offering insights into its role in enhancing ViT-based models and advancing our understanding of contrastive learning.

[185] Token Is All You Need: Cognitive Planning through Sparse Intent Alignment

Shiyao Sang

Main category: cs.CV

TL;DR: The paper challenges the need for exhaustive scene modeling in autonomous driving, showing that minimal semantically rich tokens are sufficient for effective planning, achieving state-of-the-art performance on nuPlan benchmark.

Details

Motivation: To challenge the long-standing assumption that exhaustive scene modeling is required for high-performance end-to-end autonomous driving, and to show that minimal semantically rich tokens are sufficient.

Method: Uses perception-informed BEV representations with sparse token-based approach, conditioning trajectory decoding on predicted future tokens without explicit reconstruction loss.

Result: Achieves 0.548 m ADE without future prediction (comparable to prior methods), and 0.479 m ADE with future token conditioning (12.6% improvement). Shows temporal fuzziness emerges where model adaptively attends to task-relevant semantics.

Conclusion: The ’token is all you need’ principle marks a paradigm shift from reconstructing the world to understanding it, enabling cognitively inspired systems that plan through imagination rather than reaction.

Abstract: We challenge the long-standing assumption that exhaustive scene modeling is required for high-performance end-to-end autonomous driving (E2EAD). Unlike world-model approaches that rely on computationally intensive future scene generation or vision-language-action (VLA) systems constrained by Markov assumptions, we show that a minimal set of semantically rich tokens is sufficient for effective planning. Experiments on the nuPlan benchmark (720 scenarios, over 11,000 samples) using perception-informed BEV representations yield three key findings: (1) even without future prediction, our sparse representation achieves 0.548 m ADE, comparable to or surpassing prior methods reporting around 0.75 m on nuScenes; (2) conditioning trajectory decoding on predicted future tokens reduces ADE to 0.479 m, a 12.6% improvement over current-state baselines; and (3) explicit reconstruction loss offers no benefit and may degrade performance under reliable perception inputs. Notably, we observe the emergence of temporal fuzziness, where the model adaptively attends to task-relevant semantics rather than aligning rigidly to fixed timestamps, providing a cognitive advantage for planning under uncertainty. Our “token is all you need” principle marks a paradigm shift from reconstructing the world to understanding it, laying a foundation for cognitively inspired systems that plan through imagination rather than reaction.

[186] Automated Invoice Data Extraction: Using LLM and OCR

Advait Thakur, Khushi Khanchandani, Akshita Shetty, Chaitravi Reddy, Ritisa Behera

Main category: cs.CV

TL;DR: A holistic AI platform combining OCR, deep learning, LLMs, and graph analytics to overcome limitations of traditional OCR systems for invoice processing.

Details

Motivation: Conventional OCR systems struggle with variant invoice layouts, handwritten text, low-quality scans, and template dependencies that limit flexibility across different document structures.

Method: Hybrid architecture combining OCR technology with Large Language Models (LLMs), deep learning models (CNNs and Transformers), domain-specific models for layout analysis, and graph analytics for contextual relationship mapping.

Result: Achieves unprecedented extraction quality and consistency with greater contextual sensitivity and much higher accuracy rates than older approaches.

Conclusion: The holistic AI platform enables maximum scalability and minimal human intervention for invoice processing across varied document types and layouts.

Abstract: Conventional Optical Character Recognition (OCR) systems are challenged by variant invoice layouts, handwritten text, and low- quality scans, which are often caused by strong template dependencies that restrict their flexibility across different document structures and layouts. Newer solutions utilize advanced deep learning models such as Convolutional Neural Networks (CNN) as well as Transformers, and domain-specific models for better layout analysis and accuracy across various sections over varied document types. Large Language Models (LLMs) have revolutionized extraction pipelines at their core with sophisticated entity recognition and semantic comprehension to support complex contextual relationship mapping without direct programming specification. Visual Named Entity Recognition (NER) capabilities permit extraction from invoice images with greater contextual sensitivity and much higher accuracy rates than older approaches. Existing industry best practices utilize hybrid architectures that blend OCR technology and LLM for maximum scalability and minimal human intervention. This work introduces a holistic Artificial Intelligence (AI) platform combining OCR, deep learning, LLMs, and graph analytics to achieve unprecedented extraction quality and consistency.

[187] In-Context-Learning-Assisted Quality Assessment Vision-Language Models for Metal Additive Manufacturing

Qiaojie Zheng, Jiucai Zhang, Xiaoli Zhang

Main category: cs.CV

TL;DR: Vision-language models with in-context learning can assess additive manufacturing quality using minimal samples, achieving accuracy comparable to traditional ML while providing interpretable rationales.

Details

Motivation: Traditional vision-based quality assessment requires expensive dedicated datasets and model training. VLMs with ICL can eliminate the need for large application-specific datasets.

Method: Used in-context learning with different sampling strategies on Gemini-2.5-flash and Gemma3:27b models for quality assessment in wire-laser direct energy deposition processes.

Result: ICL-assisted VLMs achieved quality classification accuracies similar to traditional ML models using only minimal samples, while providing human-interpretable rationales.

Conclusion: ICL-assisted VLMs can effectively address application-specific manufacturing tasks with limited data, offering high accuracy and improved decision transparency through valid supporting rationales.

Abstract: Vision-based quality assessment in additive manufacturing often requires dedicated machine learning models and application-specific datasets. However, data collection and model training can be expensive and time-consuming. In this paper, we leverage vision-language models’ (VLMs’) reasoning capabilities to assess the quality of printed parts and introduce in-context learning (ICL) to provide VLMs with necessary application-specific knowledge and demonstration samples. This method eliminates the requirement for large application-specific datasets for training models. We explored different sampling strategies for ICL to search for the optimal configuration that makes use of limited samples. We evaluated these strategies on two VLMs, Gemini-2.5-flash and Gemma3:27b, with quality assessment tasks in wire-laser direct energy deposition processes. The results show that ICL-assisted VLMs can reach quality classification accuracies similar to those of traditional machine learning models while requiring only a minimal number of samples. In addition, unlike traditional classification models that lack transparency, VLMs can generate human-interpretable rationales to enhance trust. Since there are no metrics to evaluate their interpretability in manufacturing applications, we propose two metrics, knowledge relevance and rationale validity, to evaluate the quality of VLMs’ supporting rationales. Our results show that ICL-assisted VLMs can address application-specific tasks with limited data, achieving relatively high accuracy while also providing valid supporting rationales for improved decision transparency.

[188] EVLP:Learning Unified Embodied Vision-Language Planner with Reinforced Supervised Fine-Tuning

Xinyan Cai, Shiguang Wu, Dafeng Chi, Yuzheng Zhuang, Xingyue Quan, Jianye Hao, Qiang Guan

Main category: cs.CV

TL;DR: EVLP is a unified multimodal generation framework that integrates textual reasoning and visual generation for long-horizon manipulation tasks through dynamic pretraining and reinforced alignment.

Details

Motivation: Current methods lack a unified framework for multimodal planning, leading to inconsistencies in complex embodied long-horizon manipulation tasks that require both textual logical reasoning and visual-spatial imagination.

Method: Three key components: 1) Unified multimodal generation framework integrating semantic and spatial features with learnable cross-modal attention; 2) Dynamic perception pretraining using bidirectional alignment with inverse/forward dynamics tasks; 3) Reinforced supervised fine-tuning with spatial logic alignment between text actions and generated images.

Result: The approach enables coordinated language-visual modeling and spatio-aware multimodal planning capabilities for long-horizon tasks.

Conclusion: EVLP provides an effective solution for multimodal planning in complex embodied manipulation by unifying textual reasoning and visual generation through innovative training strategies.

Abstract: In complex embodied long-horizon manipulation tasks, effective task decomposition and execution require synergistic integration of textual logical reasoning and visual-spatial imagination to ensure efficient and accurate operation. Current methods fail to adopt a unified generation framework for multimodal planning, lead to inconsistent in multimodal planning. To address this challenge, we present \textbf{EVLP (Embodied Vision-Language Planner)}, an innovative multimodal unified generation framework that jointly models linguistic reasoning and visual generation. Our approach achieves multimodal planning for long-horizon tasks through a novel training pipeline incorporating dynamic pretraining and reinforced alignment. Our core innovations consist of three key components: \textbf{1) Unified Multimodal Generation Framework}: For understanding, We integrate semantic information with spatial features to provide comprehensive visual perception. For generation, we directly learn the joint distribution of discrete images for one-step visual synthesis, enabling coordinated language-visual modeling through learnable cross-modal attention mechanisms. \textbf{2) Dynamic Perception Pretraining}: We propose a bidirectional dynamic alignment strategy employing inverse dynamics tasks and forward dynamics tasks, effectively strengthening multimodal correlations within a unified feature space. \textbf{3) Reinforced Supervised Fine-Tuning}: While conducting instruction-based fine-tuning in the unified generation space, we construct a reinforce loss to align the spatial logic between textual actions and generated images, enabling the model to acquire spatio-awared multimodal planning capabilities.

[189] MCFCN: Multi-View Clustering via a Fusion-Consensus Graph Convolutional Network

Chenping Pei, Fadi Dornaika, Jingjun Bi

Main category: cs.CV

TL;DR: MCFCN is a multi-view clustering method that uses fusion-consensus graph convolutional networks to learn consensus graphs end-to-end, addressing noise interference and cross-view consistency issues in existing methods.

Details

Motivation: Existing multi-view clustering methods neglect inherent topological structure, suffer from noise interference in graph structures, have insufficient cross-view consistency consideration, and disjointed optimization processes.

Method: Uses view feature fusion model and Unified Graph Structure Adapter (UGA) to learn consensus graphs end-to-end, with Similarity Matrix Alignment Loss (SMAL) and Feature Representation Alignment Loss (FRAL) to optimize view-specific graphs and preserve cross-view topological consistency.

Result: Achieves state-of-the-art performance on eight multi-view benchmark datasets through extensive qualitative and quantitative experiments.

Conclusion: MCFCN effectively addresses limitations of existing multi-view clustering methods and demonstrates superior clustering performance through its fusion-consensus approach.

Abstract: Existing Multi-view Clustering (MVC) methods based on subspace learning focus on consensus representation learning while neglecting the inherent topological structure of data. Despite the integration of Graph Neural Networks (GNNs) into MVC, their input graph structures remain susceptible to noise interference. Methods based on Multi-view Graph Refinement (MGRC) also have limitations such as insufficient consideration of cross-view consistency, difficulty in handling hard-to-distinguish samples in the feature space, and disjointed optimization processes caused by graph construction algorithms. To address these issues, a Multi-View Clustering method via a Fusion-Consensus Graph Convolutional Network (MCFCN) is proposed. The network learns the consensus graph of multi-view data in an end-to-end manner and learns effective consensus representations through a view feature fusion model and a Unified Graph Structure Adapter (UGA). It designs Similarity Matrix Alignment Loss (SMAL) and Feature Representation Alignment Loss (FRAL). With the guidance of consensus, it optimizes view-specific graphs, preserves cross-view topological consistency, promotes the construction of intra-class edges, and realizes effective consensus representation learning with the help of GCN to improve clustering performance. MCFCN demonstrates state-of-the-art performance on eight multi-view benchmark datasets, and its effectiveness is verified by extensive qualitative and quantitative implementations. The code will be provided at https://github.com/texttao/MCFCN.

[190] Enhancing Multimodal Misinformation Detection by Replaying the Whole Story from Image Modality Perspective

Bing Wang, Ximing Li, Yanjun Wang, Changchun Li, Lin Yuanbo Wu, Buyu Wang, Shengsheng Wang

Main category: cs.CV

TL;DR: RETSIMD is a multimodal misinformation detection method that focuses more on text modality by augmenting images from text segments and using graph neural networks for feature fusion.

Details

Motivation: Text modality is more informative than images for misinformation detection since text describes the whole event while images only show partial scenes. Preliminary results confirm images contribute less to MMD.

Method: Split text into segments, generate corresponding images using pre-trained text-to-image generator, incorporate auxiliary objectives for text-image and image-label mutual information, use graph neural network with heuristic image relationships for feature fusion.

Result: Extensive empirical results validate the effectiveness of RETSIMD for multimodal misinformation detection.

Conclusion: The proposed RETSIMD method effectively leverages text modality dominance in misinformation detection through text-to-image augmentation and graph-based feature fusion.

Abstract: Multimodal Misinformation Detection (MMD) refers to the task of detecting social media posts involving misinformation, where the post often contains text and image modalities. However, by observing the MMD posts, we hold that the text modality may be much more informative than the image modality because the text generally describes the whole event/story of the current post but the image often presents partial scenes only. Our preliminary empirical results indicate that the image modality exactly contributes less to MMD. Upon this idea, we propose a new MMD method named RETSIMD. Specifically, we suppose that each text can be divided into several segments, and each text segment describes a partial scene that can be presented by an image. Accordingly, we split the text into a sequence of segments, and feed these segments into a pre-trained text-to-image generator to augment a sequence of images. We further incorporate two auxiliary objectives concerning text-image and image-label mutual information, and further post-train the generator over an auxiliary text-to-image generation benchmark dataset. Additionally, we propose a graph structure by defining three heuristic relationships between images, and use a graph neural network to generate the fused features. Extensive empirical results validate the effectiveness of RETSIMD.

[191] Compressing Multi-Task Model for Autonomous Driving via Pruning and Knowledge Distillation

Jiayuan Wang, Q. M. Jonathan Wu, Ning Zhang, Katsuya Suto, Lei Zhong

Main category: cs.CV

TL;DR: Proposed multi-task model compression framework combining task-aware safe pruning and feature-level knowledge distillation for autonomous driving panoptic perception, achieving 32.7% parameter reduction with minimal performance loss.

Details

Motivation: Multi-task learning for autonomous driving panoptic perception increases model parameters and complexity, making deployment on on-board devices difficult.

Method: Combines task-aware safe pruning (Taylor-based channel importance with gradient conflict penalty) and task head-agnostic distillation that transfers intermediate backbone and encoder features from teacher to student model.

Result: 32.7% parameter reduction; segmentation shows negligible accuracy loss; detection: -1.2% Recall, -1.8% mAP50; runs at 32.7 FPS in real-time on BDD100K dataset.

Conclusion: Combining pruning and knowledge distillation provides an effective compression solution for multi-task panoptic perception in autonomous driving systems.

Abstract: Autonomous driving systems rely on panoptic perception to jointly handle object detection, drivable area segmentation, and lane line segmentation. Although multi-task learning is an effective way to integrate these tasks, its increasing model parameters and complexity make deployment on on-board devices difficult. To address this challenge, we propose a multi-task model compression framework that combines task-aware safe pruning with feature-level knowledge distillation. Our safe pruning strategy integrates Taylor-based channel importance with gradient conflict penalty to keep important channels while removing redundant and conflicting channels. To mitigate performance degradation after pruning, we further design a task head-agnostic distillation method that transfers intermediate backbone and encoder features from a teacher to a student model as guidance. Experiments on the BDD100K dataset demonstrate that our compressed model achieves a 32.7% reduction in parameters while segmentation performance shows negligible accuracy loss and only a minor decrease in detection (-1.2% for Recall and -1.8% for mAP50) compared to the teacher. The compressed model still runs at 32.7 FPS in real-time. These results show that combining pruning and knowledge distillation provides an effective compression solution for multi-task panoptic perception.

[192] Mono3DVG-EnSD: Enhanced Spatial-aware and Dimension-decoupled Text Encoding for Monocular 3D Visual Grounding

Yuzhen Li, Min Liu, Zhaoyang Li, Yuan Bian, Xueping Wang, Erbo Zhai, Yaonan Wang

Main category: cs.CV

TL;DR: Mono3DVG-EnSD is a novel framework for monocular 3D visual grounding that addresses over-reliance on explicit keywords and cross-dimensional interference through CLIP-guided lexical certainty adaptation and dimension-decoupled feature processing.

Details

Motivation: Existing methods for monocular 3D visual grounding over-rely on high-certainty keywords while neglecting spatial descriptions, and suffer from cross-dimensional interference when generalized textual features interact with visual features.

Method: Proposes two key components: CLIP-Guided Lexical Certainty Adapter (CLIP-LCA) that masks high-certainty keywords to force spatial understanding, and Dimension-Decoupled Module (D2M) that separates 2D/3D textual features to guide corresponding visual features.

Result: Achieves state-of-the-art performance on Mono3DRefer dataset across all metrics, with significant +13.54% improvement in the challenging Far(Acc@0.5) scenario.

Conclusion: The proposed framework effectively addresses key limitations in monocular 3D visual grounding by enhancing spatial understanding through keyword masking and eliminating cross-dimensional interference through feature decoupling.

Abstract: Monocular 3D Visual Grounding (Mono3DVG) is an emerging task that locates 3D objects in RGB images using text descriptions with geometric cues. However, existing methods face two key limitations. Firstly, they often over-rely on high-certainty keywords that explicitly identify the target object while neglecting critical spatial descriptions. Secondly, generalized textual features contain both 2D and 3D descriptive information, thereby capturing an additional dimension of details compared to singular 2D or 3D visual features. This characteristic leads to cross-dimensional interference when refining visual features under text guidance. To overcome these challenges, we propose Mono3DVG-EnSD, a novel framework that integrates two key components: the CLIP-Guided Lexical Certainty Adapter (CLIP-LCA) and the Dimension-Decoupled Module (D2M). The CLIP-LCA dynamically masks high-certainty keywords while retaining low-certainty implicit spatial descriptions, thereby forcing the model to develop a deeper understanding of spatial relationships in captions for object localization. Meanwhile, the D2M decouples dimension-specific (2D/3D) textual features from generalized textual features to guide corresponding visual features at same dimension, which mitigates cross-dimensional interference by ensuring dimensionally-consistent cross-modal interactions. Through comprehensive comparisons and ablation studies on the Mono3DRefer dataset, our method achieves state-of-the-art (SOTA) performance across all metrics. Notably, it improves the challenging Far(Acc@0.5) scenario by a significant +13.54%.

[193] FilletRec: A Lightweight Graph Neural Network with Intrinsic Features for Automated Fillet Recognition

Jiali Gao, Taoran Liu, Hongfei Ye, Jianjun Chen

Main category: cs.CV

TL;DR: Proposes FilletRec, a lightweight GNN using pose-invariant geometric features for automated fillet recognition and simplification in CAD models, achieving state-of-the-art accuracy with high efficiency.

Details

Motivation: Automated fillet feature recognition and simplification is critical for CAE analysis but remains challenging due to lack of robustness in traditional methods and poor generalization in existing deep learning approaches.

Method: Constructs a large-scale benchmark dataset and proposes FilletRec - a lightweight graph neural network using pose-invariant intrinsic geometric features like curvature to learn fundamental geometric patterns.

Result: FilletRec surpasses state-of-the-art methods in accuracy and generalization while using only 0.2%-5.4% of baseline model parameters, demonstrating high model efficiency.

Conclusion: The framework provides an end-to-end automated workflow from recognition to simplification, addressing key challenges in CAD model processing for CAE analysis.

Abstract: Automated recognition and simplification of fillet features in CAD models is critical for CAE analysis, yet it remains an open challenge. Traditional rule-based methods lack robustness, while existing deep learning models suffer from poor generalization and low accuracy on complex fillets due to their generic design and inadequate training data. To address these issues, this paper proposes an end-to-end, data-driven framework specifically for fillet features. We first construct and release a large-scale, diverse benchmark dataset for fillet recognition to address the inadequacy of existing data. Based on it, we propose FilletRec, a lightweight graph neural network. The core innovation of this network is its use of pose-invariant intrinsic geometric features, such as curvature, enabling it to learn more fundamental geometric patterns and thereby achieve high-precision recognition of complex geometric topologies. Experiments show that FilletRec surpasses state-of-the-art methods in both accuracy and generalization, while using only 0.2%-5.4% of the parameters of baseline models, demonstrating high model efficiency. Finally, the framework completes the automated workflow from recognition to simplification by integrating an effective geometric simplification algorithm.

[194] M2S2L: Mamba-based Multi-Scale Spatial-temporal Learning for Video Anomaly Detection

Yang Liu, Boan Chen, Xiaoguang Zhu, Jing Liu, Peng Sun, Wei Zhou

Main category: cs.CV

TL;DR: M2S2L is a Mamba-based multi-scale spatial-temporal learning framework for video anomaly detection that balances accuracy and efficiency through hierarchical encoders and feature decomposition.

Details

Motivation: Video anomaly detection faces challenges in balancing detection accuracy with computational efficiency for modern surveillance systems with complex video content and diverse behavioral patterns.

Method: Uses hierarchical spatial encoders at multiple granularities, multi-temporal encoders for motion dynamics across time scales, and feature decomposition for task-specific optimization of appearance and motion reconstruction.

Result: Achieves 98.5%, 92.1%, and 77.9% frame-level AUCs on UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets respectively, with 20.1G FLOPs and 45 FPS inference speed.

Conclusion: The M2S2L framework provides robust anomaly assessment while maintaining computational efficiency, making it suitable for practical surveillance deployment.

Abstract: Video anomaly detection (VAD) is an essential task in the image processing community with prospects in video surveillance, which faces fundamental challenges in balancing detection accuracy with computational efficiency. As video content becomes increasingly complex with diverse behavioral patterns and contextual scenarios, traditional VAD approaches struggle to provide robust assessment for modern surveillance systems. Existing methods either lack comprehensive spatial-temporal modeling or require excessive computational resources for real-time applications. In this regard, we present a Mamba-based multi-scale spatial-temporal learning (M2S2L) framework in this paper. The proposed method employs hierarchical spatial encoders operating at multiple granularities and multi-temporal encoders capturing motion dynamics across different time scales. We also introduce a feature decomposition mechanism to enable task-specific optimization for appearance and motion reconstruction, facilitating more nuanced behavioral modeling and quality-aware anomaly assessment. Experiments on three benchmark datasets demonstrate that M2S2L framework achieves 98.5%, 92.1%, and 77.9% frame-level AUCs on UCSD Ped2, CUHK Avenue, and ShanghaiTech respectively, while maintaining efficiency with 20.1G FLOPs and 45 FPS inference speed, making it suitable for practical surveillance deployment.

[195] In-Context Adaptation of VLMs for Few-Shot Cell Detection in Optical Microscopy

Shreyan Ganguly, Angona Biswas, Jaydeep Rade, Md Hasibul Hasan Hasib, Nabila Masud, Nitish Singla, Abhipsa Dash, Ushashi Bhattacharjee, Aditya Balu, Anwesha Sarkar, Adarsh Krishnamurthy, Soumik Sarkar

Main category: cs.CV

TL;DR: Foundation VLMs struggle with biomedical microscopy due to domain gap, but in-context learning enables few-shot object detection when annotated data is scarce.

Details

Motivation: Utility of foundation vision-language models for biomedical microscopy remains underexplored, especially when large annotated datasets are unavailable for microscopic images.

Method: Introduced Micro-OD benchmark with 252 images across 11 cell types, evaluated 8 VLMs under few-shot conditions, implemented hybrid FSOD pipeline combining detection head with VLM-based classifier, and tested variants with/without reasoning tokens.

Result: Zero-shot performance is weak due to domain gap, but few-shot support consistently improves detection with marginal gains after six shots. Models with reasoning tokens are better for end-to-end localization, while simpler variants work better for classifying pre-localized crops.

Conclusion: In-context adaptation provides a practical path for microscopy applications, and the benchmark offers a reproducible testbed for advancing open-vocabulary detection in biomedical imaging.

Abstract: Foundation vision-language models (VLMs) excel on natural images, but their utility for biomedical microscopy remains underexplored. In this paper, we investigate how in-context learning enables state-of-the-art VLMs to perform few-shot object detection when large annotated datasets are unavailable, as is often the case with microscopic images. We introduce the Micro-OD benchmark, a curated collection of 252 images specifically curated for in-context learning, with bounding-box annotations spanning 11 cell types across four sources, including two in-lab expert-annotated sets. We systematically evaluate eight VLMs under few-shot conditions and compare variants with and without implicit test-time reasoning tokens. We further implement a hybrid Few-Shot Object Detection (FSOD) pipeline that combines a detection head with a VLM-based few-shot classifier, which enhances the few-shot performance of recent VLMs on our benchmark. Across datasets, we observe that zero-shot performance is weak due to the domain gap; however, few-shot support consistently improves detection, with marginal gains achieved after six shots. We observe that models with reasoning tokens are more effective for end-to-end localization, whereas simpler variants are more suitable for classifying pre-localized crops. Our results highlight in-context adaptation as a practical path for microscopy, and our benchmark provides a reproducible testbed for advancing open-vocabulary detection in biomedical imaging.

[196] Efficient Online Continual Learning in Sensor-Based Human Activity Recognition

Yao Zhang, Souza Leite Clayton, Yu Xiao

Main category: cs.CV

TL;DR: PTRN-HAR is the first successful application of pre-trained model-based online continual learning to sensor-based human activity recognition, achieving high performance with reduced resource consumption and improved data efficiency.

Details

Motivation: Existing online continual learning approaches for sensor-based HAR are computationally intensive and require extensive labeled samples, while pre-trained model-based approaches from computer vision face challenges due to dataset heterogeneity and data scarcity in HAR.

Method: Pre-trains feature extractor using contrastive loss with limited data, freezes it during streaming stage, and replaces dense classification layer with relation module network.

Result: Outperforms state-of-the-art methods on three public datasets, significantly reduces resource consumption for training while maintaining high performance, and improves data efficiency by reducing labeled data requirements.

Conclusion: PTRN-HAR successfully adapts PTM-based OCL to sensor-based HAR, providing an efficient and data-effective solution for continual learning in human activity recognition applications.

Abstract: Machine learning models for sensor-based human activity recognition (HAR) are expected to adapt post-deployment to recognize new activities and different ways of performing existing ones. To address this need, Online Continual Learning (OCL) mechanisms have been proposed, allowing models to update their knowledge incrementally as new data become available while preserving previously acquired information. However, existing OCL approaches for sensor-based HAR are computationally intensive and require extensive labeled samples to represent new changes. Recently, pre-trained model-based (PTM-based) OCL approaches have shown significant improvements in performance and efficiency for computer vision applications. These methods achieve strong generalization capabilities by pre-training complex models on large datasets, followed by fine-tuning on downstream tasks for continual learning. However, applying PTM-based OCL approaches to sensor-based HAR poses significant challenges due to the inherent heterogeneity of HAR datasets and the scarcity of labeled data in post-deployment scenarios. This paper introduces PTRN-HAR, the first successful application of PTM-based OCL to sensor-based HAR. Unlike prior PTM-based OCL approaches, PTRN-HAR pre-trains the feature extractor using contrastive loss with a limited amount of data. This extractor is then frozen during the streaming stage. Furthermore, it replaces the conventional dense classification layer with a relation module network. Our design not only significantly reduces the resource consumption required for model training while maintaining high performance, but also improves data efficiency by reducing the amount of labeled data needed for effective continual learning, as demonstrated through experiments on three public datasets, outperforming the state-of-the-art. The code can be found here: https://anonymous.4open.science/r/PTRN-HAR-AF60/

[197] Temporal Inconsistency Guidance for Super-resolution Video Quality Assessment

Yixiao Li, Xiaoyuan Yang, Weide Liu, Xin Jin, Xu Jia, Yukun Lai, Paul L Rosin, Haotao Liu, Wei Zhou

Main category: cs.CV

TL;DR: Proposes TIG-SVQA, a video quality assessment method specifically designed for super-resolution videos that emphasizes temporal inconsistency as a key quality indicator.

Details

Motivation: Super-resolution techniques introduce unique distortions different from traditional degradation, creating demand for specialized VQA methods. Temporal inconsistency is a critical factor affecting perceived quality but is rarely quantified in existing approaches.

Method: Designs a perception-oriented approach to quantify frame-wise temporal inconsistency, introduces Inconsistency Highlighted Spatial Module to localize inconsistent regions at multiple scales, and develops Inconsistency Guided Temporal Module with progressive temporal feature aggregation including consistency-aware fusion and informative filtering stages.

Result: Extensive experiments on both single-frame and multi-frame SR video scenarios demonstrate that the method significantly outperforms state-of-the-art VQA approaches.

Conclusion: Temporal inconsistency plays a critical role in guiding quality assessment of super-resolution videos, and the proposed TIG-SVQA framework effectively addresses this by explicitly modeling and leveraging temporal inconsistency for improved video quality assessment.

Abstract: As super-resolution (SR) techniques introduce unique distortions that fundamentally differ from those caused by traditional degradation processes (e.g., compression), there is an increasing demand for specialized video quality assessment (VQA) methods tailored to SR-generated content. One critical factor affecting perceived quality is temporal inconsistency, which refers to irregularities between consecutive frames. However, existing VQA approaches rarely quantify this phenomenon or explicitly investigate its relationship with human perception. Moreover, SR videos exhibit amplified inconsistency levels as a result of enhancement processes. In this paper, we propose \textit{Temporal Inconsistency Guidance for Super-resolution Video Quality Assessment (TIG-SVQA)} that underscores the critical role of temporal inconsistency in guiding the quality assessment of SR videos. We first design a perception-oriented approach to quantify frame-wise temporal inconsistency. Based on this, we introduce the Inconsistency Highlighted Spatial Module, which localizes inconsistent regions at both coarse and fine scales. Inspired by the human visual system, we further develop an Inconsistency Guided Temporal Module that performs progressive temporal feature aggregation: (1) a consistency-aware fusion stage in which a visual memory capacity block adaptively determines the information load of each temporal segment based on inconsistency levels, and (2) an informative filtering stage for emphasizing quality-related features. Extensive experiments on both single-frame and multi-frame SR video scenarios demonstrate that our method significantly outperforms state-of-the-art VQA approaches. The code is publicly available at https://github.com/Lighting-YXLI/TIG-SVQA-main.

[198] Automatic Extraction of Road Networks by using Teacher-Student Adaptive Structural Deep Belief Network and Its Application to Landslide Disaster

Shin Kamada, Takumi Ichimura

Main category: cs.CV

TL;DR: Proposed an adaptive DBN with ensemble learning for road network detection from aerial images, achieving 89% accuracy and demonstrating disaster response applications.

Details

Motivation: Road maps contain complex features requiring high representation power for accurate detection, and there's a need for rapid disaster response road detection.

Method: Used adaptive structural learning of RBM/DBN with neuron/layer generation algorithms, Teacher-Student ensemble learning, and lightweight implementation on edge devices.

Result: Detection accuracy improved from 40.0% to 89.0% on average across seven major cities, with successful application to landslide-affected road detection.

Conclusion: The adaptive DBN with ensemble learning provides effective road network detection and has practical applications in disaster response scenarios.

Abstract: An adaptive structural learning method of Restricted Boltzmann Machine (RBM) and Deep Belief Network (DBN) has been developed as one of prominent deep learning models. The neuron generation-annihilation algorithm in RBM and layer generation algorithm in DBN make an optimal network structure for given input during the learning. In this paper, our model is applied to an automatic recognition method of road network system, called RoadTracer. RoadTracer can generate a road map on the ground surface from aerial photograph data. A novel method of RoadTracer using the Teacher-Student based ensemble learning model of Adaptive DBN is proposed, since the road maps contain many complicated features so that a model with high representation power to detect should be required. The experimental results showed the detection accuracy of the proposed model was improved from 40.0% to 89.0% on average in the seven major cities among the test dataset. In addition, we challenged to apply our method to the detection of available roads when landslide by natural disaster is occurred, in order to rapidly obtain a way of transportation. For fast inference, a small size of the trained model was implemented on a small embedded edge device as lightweight deep learning. We reported the detection results for the satellite image before and after the rainfall disaster in Japan.

Yuxuan Li, Xiang Li, Yunheng Li, Yicheng Zhang, Yimian Dai, Qibin Hou, Ming-Ming Cheng, Jian Yang

Main category: cs.CV

TL;DR: SM3Det is a unified model for multi-modal and multi-task object detection in remote sensing, using sparse MoE backbone and dynamic optimization to handle different modalities and tasks effectively.

Details

Motivation: Current object detection models are limited to single datasets and modalities, missing shared knowledge across multi-modalities and restricting versatility in real-world applications.

Method: Proposes SM3Det with grid-level sparse MoE backbone for joint knowledge learning while preserving modality-specific features, plus consistency and synchronization optimization with dynamic learning rate adjustment.

Result: Extensive experiments show SM3Det outperforms specialized models on individual datasets, demonstrating effectiveness and generalizability across different modalities and tasks.

Conclusion: SM3Det successfully addresses multi-modal modeling trade-offs and multi-task optimization complexities, providing a unified solution for versatile remote sensing object detection scenarios.

Abstract: With the rapid advancement of remote sensing technology, high-resolution multi-modal imagery is now more widely accessible. Conventional Object detection models are trained on a single dataset, often restricted to a specific imaging modality and annotation format. However, such an approach overlooks the valuable shared knowledge across multi-modalities and limits the model’s applicability in more versatile scenarios. This paper introduces a new task called Multi-Modal Datasets and Multi-Task Object Detection (M2Det) for remote sensing, designed to accurately detect horizontal or oriented objects from any sensor modality. This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization. To address these, we establish a benchmark dataset and propose a unified model, SM3Det (Single Model for Multi-Modal datasets and Multi-Task object Detection). SM3Det leverages a grid-level sparse MoE backbone to enable joint knowledge learning while preserving distinct feature representations for different modalities. Furthermore, it integrates a consistency and synchronization optimization strategy using dynamic learning rate adjustment, allowing it to effectively handle varying levels of learning difficulty across modalities and tasks. Extensive experiments demonstrate SM3Det’s effectiveness and generalizability, consistently outperforming specialized models on individual datasets. The code is available at https://github.com/zcablii/SM3Det.

[200] Do Street View Imagery and Public Participation GIS align: Comparative Analysis of Urban Attractiveness

Milad Malekzadeh, Elias Willberg, Jussi Torkko, Silviya Korpilo, Kamyar Hasanzadeh, Olle Järv, Tuuli Toivonen

Main category: cs.CV

TL;DR: Study compares Street View Imagery (SVI) and Public Participation GIS (PPGIS) for assessing urban attractiveness, finding only partial alignment due to SVI’s inability to capture non-visual experiential factors.

Details

Motivation: To understand how different data sources reflect human experiences of urban environments and investigate the comparability between SVI-based perceived attractiveness and residents' reported experiences from PPGIS.

Method: Used participant-rated SVI data and semantic image segmentation to train ML model predicting perceived attractiveness, then compared predictions to PPGIS-identified locations using strict and moderate agreement criteria, analyzing contextual variables like noise, traffic, and land use.

Result: Partial alignment between datasets: 67% agreement for attractive places and 77% for unattractive places with moderate threshold, but only 27% and 29% respectively with strict threshold. Non-visual cues significantly contributed to mismatches.

Conclusion: SVI offers scalable visual proxy for urban perception but cannot fully substitute PPGIS’s experiential richness. Both methods are valuable for different purposes, requiring integrated approach to holistically capture urban perceptions.

Abstract: As digital tools increasingly shape spatial planning practices, understanding how different data sources reflect human experiences of urban environments is essential. Street View Imagery (SVI) and Public Participation GIS (PPGIS) represent two prominent approaches for capturing place-based perceptions that can support urban planning decisions, yet their comparability remains underexplored. This study investigates the alignment between SVI-based perceived attractiveness and residents’ reported experiences gathered via a city-wide PPGIS survey in Helsinki, Finland. Using participant-rated SVI data and semantic image segmentation, we trained a machine learning model to predict perceived attractiveness based on visual features. We compared these predictions to PPGIS-identified locations marked as attractive or unattractive, calculating agreement using two sets of strict and moderate criteria. Our findings reveal only partial alignment between the two datasets. While agreement (with a moderate threshold) reached 67% for attractive and 77% for unattractive places, agreement (with a strict threshold) dropped to 27% and 29%, respectively. By analysing a range of contextual variables, including noise, traffic, population presence, and land use, we found that non-visual cues significantly contributed to mismatches. The model failed to account for experiential dimensions such as activity levels and environmental stressors that shape perceptions but are not visible in images. These results suggest that while SVI offers a scalable and visual proxy for urban perception, it cannot fully substitute the experiential richness captured through PPGIS. We argue that both methods are valuable but serve different purposes; therefore, a more integrated approach is needed to holistically capture how people perceive urban environments.

[201] Fine-grained Image Retrieval via Dual-Vision Adaptation

Xin Jiang, Meiqi Cao, Hao Tang, Fei Shen, Zechao Li

Main category: cs.CV

TL;DR: DVA is a dual-adaptation approach for fine-grained image retrieval that modifies samples and features without retraining the pre-trained model, achieving strong performance with fewer parameters.

Details

Motivation: Current FGIR methods overfit training data and forget pre-trained knowledge, reducing generalization ability.

Method: Uses Object-Perceptual Adaptation to modify input samples and In-Context Adaptation for feature adjustment, plus Discrimination Perception Transfer for knowledge distillation.

Result: Performs well on three in-distribution and three out-of-distribution datasets with fewer learnable parameters.

Conclusion: DVA effectively guides frozen pre-trained models for FGIR through collaborative adaptation, maintaining generalization while being parameter-efficient.

Abstract: Fine-Grained Image Retrieval~(FGIR) faces challenges in learning discriminative visual representations to retrieve images with similar fine-grained features. Current leading FGIR solutions typically follow two regimes: enforce pairwise similarity constraints in the semantic embedding space, or incorporate a localization sub-network to fine-tune the entire model. However, such two regimes tend to overfit the training data while forgetting the knowledge gained from large-scale pre-training, thus reducing their generalization ability. In this paper, we propose a Dual-Vision Adaptation (DVA) approach for FGIR, which guides the frozen pre-trained model to perform FGIR through collaborative sample and feature adaptation. Specifically, we design Object-Perceptual Adaptation, which modifies input samples to help the pre-trained model perceive critical objects and elements within objects that are helpful for category prediction. Meanwhile, we propose In-Context Adaptation, which introduces a small set of parameters for feature adaptation without modifying the pre-trained parameters. This makes the FGIR task using these adjusted features closer to the task solved during the pre-training. Additionally, to balance retrieval efficiency and performance, we propose Discrimination Perception Transfer to transfer the discriminative knowledge in the object-perceptual adaptation to the image encoder using the knowledge distillation mechanism. Extensive experiments show that DVA has fewer learnable parameters and performs well on three in-distribution and three out-of-distribution fine-grained datasets.

Xiaofei Wang, Stephen Price, Chao Li

Main category: cs.CV

TL;DR: C3-Diff is a cross-modal cross-content contrastive diffusion framework that enhances spatial transcriptomics resolution by integrating histology images with gene expression data using refined contrastive learning, noise-based feature augmentation, and dynamic cross-modal imputation.

Details

Motivation: Current spatial transcriptomics platforms suffer from low resolution, limiting understanding of spatial gene expression. Super-resolution approaches that integrate histology images with gene expressions show promise but face challenges in modeling interactions between these modalities.

Method: Proposes C3-Diff framework with: 1) refined contrastive learning to extract modal-invariant and content-invariant features, 2) noise-based information augmentation on feature hyperspheres to overcome low sequencing sensitivity, and 3) dynamic cross-modal imputation training to mitigate data scarcity.

Result: Significant improvements over competing methods on four public datasets. Effective performance on downstream tasks including cell type localization, gene expression correlation, and single-cell-level gene expression prediction.

Conclusion: C3-Diff successfully enhances spatial transcriptomics resolution and promotes AI-enhanced biotechnology for biomedical research and clinical applications.

Abstract: The rapid advancement of spatial transcriptomics (ST), i.e., spatial gene expressions, has made it possible to measure gene expression within original tissue, enabling us to discover molecular mechanisms. However, current ST platforms frequently suffer from low resolution, limiting the in-depth understanding of spatial gene expression. Super-resolution approaches promise to enhance ST maps by integrating histology images with gene expressions of profiled tissue spots. However, it remains a challenge to model the interactions between histology images and gene expressions for effective ST enhancement. This study presents a cross-modal cross-content contrastive diffusion framework, called C3-Diff, for ST enhancement with histology images as guidance. In C3-Diff, we firstly analyze the deficiency of traditional contrastive learning paradigm, which is then refined to extract both modal-invariant and content-invariant features of ST maps and histology images. Further, to overcome the problem of low sequencing sensitivity in ST maps, we perform nosing-based information augmentation on the surface of feature unit hypersphere. Finally, we propose a dynamic cross-modal imputation-based training strategy to mitigate ST data scarcity. We tested C3-Diff by benchmarking its performance on four public datasets, where it achieves significant improvements over competing methods. Moreover, we evaluate C3-Diff on downstream tasks of cell type localization, gene expression correlation and single-cell-level gene expression prediction, promoting AI-enhanced biotechnology for biomedical research and clinical applications. Codes are available at https://github.com/XiaofeiWang2018/C3-Diff.

[203] Video Text Preservation with Synthetic Text-Rich Videos

Ziyang Liu, Kevin Valencia, Justin Cui

Main category: cs.CV

TL;DR: A lightweight approach using synthetic supervision to improve text legibility in Text-To-Video models by fine-tuning pre-trained models with text-rich image animations.

Details

Motivation: Existing T2V models struggle with rendering legible and coherent text in videos, even for short phrases, and previous solutions are computationally expensive.

Method: Generate text-rich images using T2I diffusion model, animate them into videos using text-agnostic I2V model, then use these synthetic video-prompt pairs to fine-tune Wan2.1 T2V model without architectural changes.

Result: Improved short-text legibility and temporal consistency, with emerging structural priors for longer text generation.

Conclusion: Curated synthetic data and weak supervision provide a practical path to enhance textual fidelity in T2V generation.

Abstract: While Text-To-Video (T2V) models have advanced rapidly, they continue to struggle with generating legible and coherent text within videos. In particular, existing models often fail to render correctly even short phrases or words and previous attempts to address this problem are computationally expensive and not suitable for video generation. In this work, we investigate a lightweight approach to improve T2V diffusion models using synthetic supervision. We first generate text-rich images using a text-to-image (T2I) diffusion model, then animate them into short videos using a text-agnostic image-to-video (I2v) model. These synthetic video-prompt pairs are used to fine-tune Wan2.1, a pre-trained T2V model, without any architectural changes. Our results show improvement in short-text legibility and temporal consistency with emerging structural priors for longer text. These findings suggest that curated synthetic data and weak supervision offer a practical path toward improving textual fidelity in T2V generation.

[204] MCE: Towards a General Framework for Handling Missing Modalities under Imbalanced Missing Rates

Binyu Zhao, Wei Zhang, Zhaonian Zou

Main category: cs.CV

TL;DR: MCE addresses imbalanced missing modalities in multi-modal learning by enhancing both learning capability (LCE) and representation capability (RCE) to prevent degradation cycles and improve feature quality.

Details

Motivation: Existing methods fail to handle sample-level modality utility variations and degraded feature quality in imbalanced missing modality scenarios, where higher missing rates lead to fewer updates and representational degradation.

Method: Proposes Modality Capability Enhancement (MCE) with two components: Learning Capability Enhancement (LCE) uses multi-level factors to dynamically balance modality learning progress, and Representation Capability Enhancement (RCE) improves feature semantics through subset prediction and cross-modal completion tasks.

Result: Comprehensive evaluations on four multi-modal benchmarks show MCE consistently outperforms state-of-the-art methods under various missing modality configurations.

Conclusion: MCE effectively addresses the challenges of imbalanced missing modalities by simultaneously enhancing learning progress balancing and feature representation quality, demonstrating superior performance across multiple benchmarks.

Abstract: Multi-modal learning has made significant advances across diverse pattern recognition applications. However, handling missing modalities, especially under imbalanced missing rates, remains a major challenge. This imbalance triggers a vicious cycle: modalities with higher missing rates receive fewer updates, leading to inconsistent learning progress and representational degradation that further diminishes their contribution. Existing methods typically focus on global dataset-level balancing, often overlooking critical sample-level variations in modality utility and the underlying issue of degraded feature quality. We propose Modality Capability Enhancement (MCE) to tackle these limitations. MCE includes two synergistic components: i) Learning Capability Enhancement (LCE), which introduces multi-level factors to dynamically balance modality-specific learning progress, and ii) Representation Capability Enhancement (RCE), which improves feature semantics and robustness through subset prediction and cross-modal completion tasks. Comprehensive evaluations on four multi-modal benchmarks show that MCE consistently outperforms state-of-the-art methods under various missing configurations. The final published version is now available at https://doi.org/10.1016/j.patcog.2025.112591. Our code is available at https://github.com/byzhaoAI/MCE.

[205] Elements of Active Continuous Learning and Uncertainty Self-Awareness: a Narrow Implementation for Face and Facial Expression Recognition

Stanislav Selitskiy

Main category: cs.CV

TL;DR: A self-awareness mechanism using a supervising ANN to monitor uncertainty in a CNN ensemble for face recognition, triggering human assistance when predictions are unreliable.

Details

Motivation: To emulate reflection and self-correction in AI by modeling high-level intelligence concepts at the ML algorithm level, enabling systems to recognize their own uncertainty.

Method: A supervising ANN observes activation patterns of a CNN ensemble to detect high uncertainty, with memory storage of past performance and active learning that requests human help in uncertain conditions.

Result: The system can identify when its predictions are unreliable and proactively seek human assistance, demonstrating a form of self-awareness in narrow ML tasks.

Conclusion: Self-awareness mechanisms can be implemented even in narrow ML algorithms, enabling reflection on performance uncertainty and agency through active learning for improved reliability.

Abstract: Reflection on one’s thought process and making corrections to it if there exists dissatisfaction in its performance is, perhaps, one of the essential traits of intelligence. However, such high-level abstract concepts mandatory for Artificial General Intelligence can be modelled even at the low level of narrow Machine Learning algorithms. Here, we present the self-awareness mechanism emulation in the form of a supervising artificial neural network (ANN) observing patterns in activations of another underlying ANN in a search for indications of the high uncertainty of the underlying ANN and, therefore, the trustworthiness of its predictions. The underlying ANN is a convolutional neural network (CNN) ensemble employed for face recognition and facial expression tasks. The self-awareness ANN has a memory region where its past performance information is stored, and its learnable parameters are adjusted during the training to optimize the performance. The trustworthiness verdict triggers the active learning mode, giving elements of agency to the machine learning algorithm that asks for human help in high uncertainty and confusion conditions.

[206] DiffSwap++: 3D Latent-Controlled Diffusion for Identity-Preserving Face Swapping

Weston Bondurant, Arkaprava Sinha, Hieu Le, Srijan Das, Stephanie Schuckers

Main category: cs.CV

TL;DR: DiffSwap++ is a diffusion-based face swapping method that incorporates 3D facial latent features and facial landmarks to improve identity preservation and geometric consistency, outperforming prior methods.

Details

Motivation: Existing diffusion-based face swapping methods suffer from fine-grained artifacts and poor identity preservation, especially under challenging poses and expressions, due to insufficient use of 3D facial structure.

Method: Proposes a diffusion-based pipeline that incorporates 3D facial latent features during training and conditions the denoising process on both identity embeddings and facial landmarks.

Result: Outperforms prior methods on CelebA, FFHQ, and CelebV-Text datasets in preserving source identity while maintaining target pose and expression, validated through biometric-style evaluation and user study.

Conclusion: DiffSwap++ demonstrates that incorporating 3D-aware representations significantly enhances face swapping quality and identity preservation in diffusion models.

Abstract: Diffusion-based approaches have recently achieved strong results in face swapping, offering improved visual quality over traditional GAN-based methods. However, even state-of-the-art models often suffer from fine-grained artifacts and poor identity preservation, particularly under challenging poses and expressions. A key limitation of existing approaches is their failure to meaningfully leverage 3D facial structure, which is crucial for disentangling identity from pose and expression. In this work, we propose DiffSwap++, a novel diffusion-based face-swapping pipeline that incorporates 3D facial latent features during training. By guiding the generation process with 3D-aware representations, our method enhances geometric consistency and improves the disentanglement of facial identity from appearance attributes. We further design a diffusion architecture that conditions the denoising process on both identity embeddings and facial landmarks, enabling high-fidelity and identity-preserving face swaps. Extensive experiments on CelebA, FFHQ, and CelebV-Text demonstrate that DiffSwap++ outperforms prior methods in preserving source identity while maintaining target pose and expression. Additionally, we introduce a biometric-style evaluation and conduct a user study to further validate the realism and effectiveness of our approach. Code will be made publicly available at https://github.com/WestonBond/DiffSwapPP

[207] Beyond Softmax: Dual-Branch Sigmoid Architecture for Accurate Class Activation Maps

Yoojin Oh, Junhyug Noh

Main category: cs.CV

TL;DR: Proposes a dual-branch sigmoid head to fix distortions in CAM methods by decoupling localization from classification, preserving feature magnitude and sign while maintaining classification accuracy.

Details

Motivation: CAM methods suffer from additive logit shifts and sign collapse due to reliance on softmax classifiers, which conflate excitatory and inhibitory features and arbitrarily bias importance scores.

Method: Clone classification head into parallel sigmoid branch, freeze original softmax head, fine-tune only sigmoid branch with class-balanced binary supervision, and generate evidence maps from sigmoid branch.

Result: Improved explanation fidelity and consistent Top-1 Localization gains on fine-grained tasks and WSOL benchmarks without classification accuracy drop.

Conclusion: The dual-branch sigmoid approach effectively decouples localization from classification, fixing fundamental CAM distortions while maintaining recognition performance.

Abstract: Class Activation Mapping (CAM) and its extensions have become indispensable tools for visualizing the evidence behind deep network predictions. However, by relying on a final softmax classifier, these methods suffer from two fundamental distortions: additive logit shifts that arbitrarily bias importance scores, and sign collapse that conflates excitatory and inhibitory features. We propose a simple, architecture-agnostic dual-branch sigmoid head that decouples localization from classification. Given any pretrained model, we clone its classification head into a parallel branch ending in per-class sigmoid outputs, freeze the original softmax head, and fine-tune only the sigmoid branch with class-balanced binary supervision. At inference, softmax retains recognition accuracy, while class evidence maps are generated from the sigmoid branch – preserving both magnitude and sign of feature contributions. Our method integrates seamlessly with most CAM variants and incurs negligible overhead. Extensive evaluations on fine-grained tasks (CUB-200-2011, Stanford Cars) and WSOL benchmarks (ImageNet-1K, OpenImages30K) show improved explanation fidelity and consistent Top-1 Localization gains – without any drop in classification accuracy. Code is available at https://github.com/finallyupper/beyond-softmax.

[208] Google-MedGemma Based Abnormality Detection in Musculoskeletal radiographs

Soumyajit Maity, Pranjal Kamboj, Sneha Maity, Rajat Singh, Sankhadeep Chatterjee

Main category: cs.CV

TL;DR: A MedGemma-based framework for automatic abnormality detection in musculoskeletal radiographs that outperforms conventional methods using transfer learning from medical foundation models.

Details

Motivation: To improve abnormality detection in musculoskeletal radiographs by leveraging modern medical foundation models instead of conventional autoencoder and neural network pipelines.

Method: Uses MedGemma foundation model with SigLIP-derived vision encoder to encode X-ray images into embeddings, followed by a lightweight multilayer perceptron for binary classification. Employs selective encoder block unfreezing for efficient domain adaptation.

Result: The MedGemma-driven classifier exhibits strong performance, exceeding conventional convolutional and autoencoder-based metrics, with enhanced generalization and optimized feature engineering.

Conclusion: MedGemma-powered classification systems can advance clinical radiograph triage by providing scalable and accurate abnormality detection, with potential for broader applications in automated medical image analysis.

Abstract: This paper proposes a MedGemma-based framework for automatic abnormality detection in musculoskeletal radiographs. Departing from conventional autoencoder and neural network pipelines, the proposed method leverages the MedGemma foundation model, incorporating a SigLIP-derived vision encoder pretrained on diverse medical imaging modalities. Preprocessed X-ray images are encoded into high-dimensional embeddings using the MedGemma vision backbone, which are subsequently passed through a lightweight multilayer perceptron for binary classification. Experimental assessment reveals that the MedGemma-driven classifier exhibits strong performance, exceeding conventional convolutional and autoencoder-based metrics. Additionally, the model leverages MedGemma’s transfer learning capabilities, enhancing generalization and optimizing feature engineering. The integration of a modern medical foundation model not only enhances representation learning but also facilitates modular training strategies such as selective encoder block unfreezing for efficient domain adaptation. The findings suggest that MedGemma-powered classification systems can advance clinical radiograph triage by providing scalable and accurate abnormality detection, with potential for broader applications in automated medical image analysis. Keywords: Google MedGemma, MURA, Medical Image, Classification.

[209] Enhancing Diffusion Model Guidance through Calibration and Regularization

Seyed Alireza Javid, Amirhossein Bagheri, Nuria González-Prelcic

Main category: cs.CV

TL;DR: This paper addresses the issue of overconfident predictions in classifier-guided diffusion models by proposing calibration methods and enhanced sampling guidance that improve image generation quality without requiring diffusion model retraining.

Details

Motivation: Classifier-guided diffusion models suffer from overconfident predictions during early denoising steps, causing guidance gradients to vanish and limiting their effectiveness in conditional image generation.

Method: Two complementary approaches: 1) Differentiable calibration objective using Smooth Expected Calibration Error for classifier fine-tuning, 2) Enhanced sampling guidance methods including tilted sampling with batch-level reweighting, adaptive entropy-regularized sampling, and novel f-divergence-based sampling strategy.

Result: Achieved FID of 2.13 on ImageNet 128x128 using ResNet-101 classifier, improving upon existing classifier-guided diffusion methods while requiring no diffusion model retraining.

Conclusion: Principled calibration and divergence-aware sampling provide practical and effective improvements for classifier-guided diffusion models, addressing the overconfidence issue and enhancing conditional image generation.

Abstract: Classifier-guided diffusion models have emerged as a powerful approach for conditional image generation, but they suffer from overconfident predictions during early denoising steps, causing the guidance gradient to vanish. This paper introduces two complementary contributions to address this issue. First, we propose a differentiable calibration objective based on the Smooth Expected Calibration Error (Smooth ECE), which improves classifier calibration with minimal fine-tuning and yields measurable improvements in Frechet Inception Distance (FID). Second, we develop enhanced sampling guidance methods that operate on off-the-shelf classifiers without requiring retraining. These include tilted sampling with batch-level reweighting, adaptive entropy-regularized sampling to preserve diversity, and a novel f-divergence-based sampling strategy that strengthens class-consistent guidance while maintaining mode coverage. Experiments on ImageNet 128x128 demonstrate that our divergence-regularized guidance achieves an FID of 2.13 using a ResNet-101 classifier, improving upon existing classifier-guided diffusion methods while requiring no diffusion model retraining. The results show that principled calibration and divergence-aware sampling provide practical and effective improvements for classifier-guided diffusion.

[210] In-process 3D Deviation Mapping and Defect Monitoring (3D-DM2) in High Production-rate Robotic Additive Manufacturing

Subash Gautam, Alejandro Vargas-Uscategui, Peter King, Hans Lohr, Alireza Bab-Hadiashar, Ivan Cole, Ehsan Asadi

Main category: cs.CV

TL;DR: Real-time monitoring system for detecting shape deviations in high deposition rate robotic additive manufacturing processes like cold spray to maintain part quality.

Details

Motivation: Maintaining shape accuracy is challenging in open-loop high deposition rate AM systems due to process instabilities, requiring real-time deviation detection to prevent error propagation and ensure quality.

Method: Developed a real-time monitoring system that acquires and reconstructs the growing part during manufacturing, then compares it with a near-net reference model to detect shape deviations.

Result: The system enables early identification of shape inconsistencies and allows segmentation and tracking of deviation regions for timely intervention.

Conclusion: Real-time shape deviation monitoring paves the way for timely compensation and intervention to achieve consistent part quality in high deposition rate additive manufacturing processes.

Abstract: Additive manufacturing (AM) is an emerging digital manufacturing technology to produce complex and freeform objects through a layer-wise deposition. High deposition rate robotic AM (HDRRAM) processes, such as cold spray additive manufacturing (CSAM), offer significantly increased build speeds by delivering large volumes of material per unit time. However, maintaining shape accuracy remains a critical challenge, particularly due to process instabilities in current open-loop systems. Detecting these deviations as they occur is essential to prevent error propagation, ensure part quality, and minimize post-processing requirements. This study presents a real-time monitoring system to acquire and reconstruct the growing part and directly compares it with a near-net reference model to detect the shape deviation during the manufacturing process. The early identification of shape inconsistencies, followed by segmenting and tracking each deviation region, paves the way for timely intervention and compensation to achieve consistent part quality.

[211] Walking the Schrödinger Bridge: A Direct Trajectory for Text-to-3D Generation

Ziying Li, Xuequan Lu, Xinkui Zhao, Guanjie Cheng, Shuiguang Deng, Jianwei Yin

Main category: cs.CV

TL;DR: TraCe is a novel text-to-3D generation framework that addresses artifacts in current methods by formulating generation as learning optimal transport trajectories between rendering distributions, achieving superior quality with smaller guidance values.

Details

Motivation: Current optimization-based text-to-3D generation methods using Score Distillation Sampling (SDS) introduce artifacts like over-saturation and over-smoothing in generated 3D assets.

Method: Theoretical establishment of SDS as a simplified Schrödinger Bridge instance, then introducing TraCe framework that explicitly constructs diffusion bridges from current renderings to text-conditioned targets and trains LoRA-adapted models on trajectory score dynamics.

Result: TraCe consistently achieves superior quality and fidelity compared to state-of-the-art techniques in comprehensive experiments.

Conclusion: The TraCe framework successfully addresses artifact issues in text-to-3D generation by leveraging optimal transport trajectory learning through the Schrödinger Bridge formulation.

Abstract: Recent advancements in optimization-based text-to-3D generation heavily rely on distilling knowledge from pre-trained text-to-image diffusion models using techniques like Score Distillation Sampling (SDS), which often introduce artifacts such as over-saturation and over-smoothing into the generated 3D assets. In this paper, we address this essential problem by formulating the generation process as learning an optimal, direct transport trajectory between the distribution of the current rendering and the desired target distribution, thereby enabling high-quality generation with smaller Classifier-free Guidance (CFG) values. At first, we theoretically establish SDS as a simplified instance of the Schrödinger Bridge framework. We prove that SDS employs the reverse process of an Schrödinger Bridge, which, under specific conditions (e.g., a Gaussian noise as one end), collapses to SDS’s score function of the pre-trained diffusion model. Based upon this, we introduce Trajectory-Centric Distillation (TraCe), a novel text-to-3D generation framework, which reformulates the mathematically trackable framework of Schrödinger Bridge to explicitly construct a diffusion bridge from the current rendering to its text-conditioned, denoised target, and trains a LoRA-adapted model on this trajectory’s score dynamics for robust 3D optimization. Comprehensive experiments demonstrate that TraCe consistently achieves superior quality and fidelity to state-of-the-art techniques.

[212] Pose-Aware Multi-Level Motion Parsing for Action Quality Assessment

Shuaikang Zhu, Yang Yang, Chen Sun

Main category: cs.CV

TL;DR: A multi-level motion parsing framework for action quality assessment using enhanced spatial-temporal pose features, achieving state-of-the-art performance in diving sports.

Details

Motivation: Human pose variations are crucial for action quality assessment, where subtle spatial-temporal differences determine scoring in high-level competitions.

Method: Three-level framework: Action-Unit Parser for segmentation and pose representations, Motion Parser for spatial-temporal feature learning, and Condition Parser for external factors like water splash. Includes Weight-Adjust Scoring Module for diverse action types.

Result: Extensive evaluations on large-scale diving sports datasets show state-of-the-art performance in both action segmentation and scoring tasks.

Conclusion: The proposed multi-level motion parsing framework effectively captures nuanced pose variations and external conditions for superior action quality assessment in sports.

Abstract: Human pose serves as a cornerstone of action quality assessment (AQA), where subtle spatial-temporal variations in pose often distinguish excellence from mediocrity. In high-level competitions, these nuanced differences become decisive factors in scoring. In this paper, we propose a novel multi-level motion parsing framework for AQA based on enhanced spatial-temporal pose features. On the first level, the Action-Unit Parser is designed with the help of pose extraction to achieve precise action segmentation and comprehensive local-global pose representations. On the second level, Motion Parser is used by spatial-temporal feature learning to capture pose changes and appearance details for each action-unit. Meanwhile, some special conditions other than body-related will impact action scoring, like water splash in diving. In this work, we design an additional Condition Parser to offer users more flexibility in their choices. Finally, Weight-Adjust Scoring Module is introduced to better accommodate the diverse requirements of various action types and the multi-scale nature of action-units. Extensive evaluations on large-scale diving sports datasets demonstrate that our multi-level motion parsing framework achieves state-of-the-art performance in both action segmentation and action scoring tasks.

[213] Personalized Image Editing in Text-to-Image Diffusion Models via Collaborative Direct Preference Optimization

Connor Dunlop, Matthew Zheng, Kavana Venkatesh, Pinar Yanardag

Main category: cs.CV

TL;DR: C-DPO is a personalized image editing framework for diffusion models that aligns edits with individual user preferences using collaborative signals from like-minded users via graph neural networks.

Details

Motivation: Current T2I diffusion models are generic and fail to adapt to individual users' nuanced aesthetic preferences, creating a need for personalized editing capabilities.

Method: Collaborative Direct Preference Optimization (C-DPO) encodes users as nodes in a dynamic preference graph, learns embeddings via graph neural networks, and integrates these into a novel DPO objective that jointly optimizes for individual alignment and neighborhood coherence.

Result: Comprehensive experiments including user studies and quantitative benchmarks show the method consistently outperforms baselines in generating edits aligned with user preferences.

Conclusion: The framework successfully enables personalized image editing in diffusion models by leveraging collaborative signals and graph-based user representations.

Abstract: Text-to-image (T2I) diffusion models have made remarkable strides in generating and editing high-fidelity images from text. Yet, these models remain fundamentally generic, failing to adapt to the nuanced aesthetic preferences of individual users. In this work, we present the first framework for personalized image editing in diffusion models, introducing Collaborative Direct Preference Optimization (C-DPO), a novel method that aligns image edits with user-specific preferences while leveraging collaborative signals from like-minded individuals. Our approach encodes each user as a node in a dynamic preference graph and learns embeddings via a lightweight graph neural network, enabling information sharing across users with overlapping visual tastes. We enhance a diffusion model’s editing capabilities by integrating these personalized embeddings into a novel DPO objective, which jointly optimizes for individual alignment and neighborhood coherence. Comprehensive experiments, including user studies and quantitative benchmarks, demonstrate that our method consistently outperforms baselines in generating edits that are aligned with user preferences.

[214] Convolutional Fully-Connected Capsule Network (CFC-CapsNet): A Novel and Fast Capsule Network

Pouya Shiri, Amirali Baniasadi

Main category: cs.CV

TL;DR: CFC-CapsNet improves Capsule Networks by introducing a new CFC layer that creates fewer but more powerful capsules, achieving better accuracy, faster training/inference, and fewer parameters on complex datasets.

Details

Motivation: CapsNet performs well on simple datasets but fails on complex ones, is slower than CNNs, and uses more parameters. The goal is to address these limitations while maintaining CapsNet's advantages.

Method: Introduces a Convolutional Fully-Connected Capsule Network (CFC-CapsNet) with a new CFC layer that creates capsules differently, producing fewer but more powerful capsules.

Result: CFC-CapsNet achieves competitive accuracy, faster training and inference, and uses fewer parameters on CIFAR-10, SVHN, and Fashion-MNIST datasets compared to conventional CapsNet.

Conclusion: The CFC-CapsNet successfully addresses CapsNet’s limitations by creating more efficient capsules, making it more suitable for complex datasets and real applications.

Abstract: A Capsule Network (CapsNet) is a relatively new classifier and one of the possible successors of Convolutional Neural Networks (CNNs). CapsNet maintains the spatial hierarchies between the features and outperforms CNNs at classifying images including overlapping categories. Even though CapsNet works well on small-scale datasets such as MNIST, it fails to achieve a similar level of performance on more complicated datasets and real applications. In addition, CapsNet is slow compared to CNNs when performing the same task and relies on a higher number of parameters. In this work, we introduce Convolutional Fully-Connected Capsule Network (CFC-CapsNet) to address the shortcomings of CapsNet by creating capsules using a different method. We introduce a new layer (CFC layer) as an alternative solution to creating capsules. CFC-CapsNet produces fewer, yet more powerful capsules resulting in higher network accuracy. Our experiments show that CFC-CapsNet achieves competitive accuracy, faster training and inference and uses less number of parameters on the CIFAR-10, SVHN and Fashion-MNIST datasets compared to conventional CapsNet.

[215] Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition

Nicholas Babey, Tiffany Gu, Yiheng Li, Cristian Meo, Kevin Zhu

Main category: cs.CV

TL;DR: Proposes a model that fuses V-JEPA 2’s world dynamics with CoMotion’s human pose data for action recognition, achieving superior performance in occluded scenes.

Details

Motivation: Current RGB-based action recognition models learn superficial correlations and struggle with physical interaction dynamics and human poses in complex scenes.

Method: Fuses V-JEPA 2’s contextual, predictive world dynamics with CoMotion’s explicit, occlusion-tolerant human pose data.

Result: Outperforms three baselines on InHARD and UCF-19-Y-OCC benchmarks, especially in complex, occlusive scenes.

Conclusion: Action recognition should be supported by spatial understanding rather than statistical pattern recognition.

Abstract: For embodied agents to effectively understand and interact within the world around them, they require a nuanced comprehension of human actions grounded in physical space. Current action recognition models, often relying on RGB video, learn superficial correlations between patterns and action labels, so they struggle to capture underlying physical interaction dynamics and human poses in complex scenes. We propose a model architecture that grounds action recognition in physical space by fusing two powerful, complementary representations: V-JEPA 2’s contextual, predictive world dynamics and CoMotion’s explicit, occlusion-tolerant human pose data. Our model is validated on both the InHARD and UCF-19-Y-OCC benchmarks for general action recognition and high-occlusion action recognition, respectively. Our model outperforms three other baselines, especially within complex, occlusive scenes. Our findings emphasize a need for action recognition to be supported by spatial understanding instead of statistical pattern recognition.

[216] Registration-Free Monitoring of Unstructured Point Cloud Data via Intrinsic Geometrical Properties

Mariafrancesca Patalano, Giovanna Capizzi, Kamran Paynabar

Main category: cs.CV

TL;DR: A registration-free approach for monitoring point cloud data of complex shapes using intrinsic geometric features from Laplacian and geodesic distances, eliminating preprocessing steps.

Details

Motivation: Traditional preprocessing steps like registration and mesh reconstruction for point cloud monitoring are error-prone, time-consuming, and can introduce artifacts that affect monitoring outcomes.

Method: Two feature learning methods using Laplacian and geodesic distances to capture intrinsic geometric properties, combined with thresholding techniques to select features indicative of defects.

Result: Numerical experiments and case studies demonstrate effective identification of different types of defects in complex shapes.

Conclusion: The proposed registration-free approach successfully monitors point cloud data without preprocessing, using intrinsic geometric features for defect detection.

Abstract: Modern sensing technologies have enabled the collection of unstructured point cloud data (PCD) of varying sizes, which are used to monitor the geometric accuracy of 3D objects. PCD are widely applied in advanced manufacturing processes, including additive, subtractive, and hybrid manufacturing. To ensure the consistency of analysis and avoid false alarms, preprocessing steps such as registration and mesh reconstruction are commonly applied prior to monitoring. However, these steps are error-prone, time-consuming and may introduce artifacts, potentially affecting monitoring outcomes. In this paper, we present a novel registration-free approach for monitoring PCD of complex shapes, eliminating the need for both registration and mesh reconstruction. Our proposal consists of two alternative feature learning methods and a common monitoring scheme. Feature learning methods leverage intrinsic geometric properties of the shape, captured via the Laplacian and geodesic distances. In the monitoring scheme, thresholding techniques are used to further select intrinsic features most indicative of potential out-of-control conditions. Numerical experiments and case studies highlight the effectiveness of the proposed approach in identifying different types of defects.

Sina Malakouti, Boqing Gong, Adriana Kovashka

Main category: cs.CV

TL;DR: CULTIVate is a benchmark for evaluating cultural biases in text-to-image models, focusing on cross-cultural activities across 16 countries with 576 prompts and 19,000+ images, using explainable metrics for cultural alignment.

Details

Motivation: T2I diffusion models inherit cultural biases from training data and fail to faithfully depict underrepresented regions, with existing benchmarks focusing mainly on object-centric categories rather than social activities that better reflect cultural norms.

Method: Developed CULTIVate benchmark spanning 16 countries with 576 prompts and 19,000+ images, using an explainable descriptor-based evaluation framework across cultural dimensions (background, attire, objects, interactions) with four metrics for cultural alignment, hallucination, exaggerated elements, and diversity.

Result: Found systematic disparities: models perform better for global north countries than global south, with distinct failure modes across T2I systems. Human studies confirmed metrics correlate more strongly with human judgments than existing text-image metrics.

Conclusion: CULTIVate provides an effective framework for measuring cultural faithfulness in T2I models, revealing significant biases and offering improved evaluation metrics that better align with human perception of cultural representation.

Abstract: Text-to-image (T2I) diffusion models achieve impressive photorealism by training on large-scale web data, but models inherit cultural biases and fail to depict underrepresented regions faithfully. Existing cultural benchmarks focus mainly on object-centric categories (e.g., food, attire, and architecture), overlooking the social and daily activities that more clearly reflect cultural norms. Few metrics exist for measuring cultural faithfulness. We introduce CULTIVate, a benchmark for evaluating T2I models on cross-cultural activities (e.g., greetings, dining, games, traditional dances, and cultural celebrations). CULTIVate spans 16 countries with 576 prompts and more than 19,000 images, and provides an explainable descriptor-based evaluation framework across multiple cultural dimensions, including background, attire, objects, and interactions. We propose four metrics to measure cultural alignment, hallucination, exaggerated elements, and diversity. Our findings reveal systematic disparities: models perform better for global north countries than for the global south, with distinct failure modes across T2I systems. Human studies confirm that our metrics correlate more strongly with human judgments than existing text-image metrics.

[218] VMDT: Decoding the Trustworthiness of Video Foundation Models

Yujin Potter, Zhun Wang, Nicholas Crispino, Kyle Montgomery, Alexander Xiong, Ethan Y. Chang, Francesco Pinto, Yuqi Chen, Rahul Gupta, Morteza Ziyadi, Christos Christodoulopoulos, Bo Li, Chenguang Wang, Dawn Song

Main category: cs.CV

TL;DR: VMDT is the first comprehensive benchmark for evaluating trustworthiness in video foundation models across safety, hallucination, fairness, privacy, and adversarial robustness dimensions.

Details

Motivation: Video modality lacks comprehensive trustworthiness benchmarks compared to text and image modalities, despite the growing sophistication of foundation models.

Method: Developed VMDT platform to evaluate 7 text-to-video and 19 video-to-text models across five trustworthiness dimensions using systematic testing framework.

Result: Open-source T2V models fail to recognize harmful queries and generate harmful videos with higher unfairness than image models. V2T models show unfairness and privacy risks increase with scale, while hallucination and robustness improve but remain low overall.

Conclusion: Urgent need for more robust video foundation models; VMDT provides systematic framework for measuring and tracking trustworthiness progress in video modality.

Abstract: As foundation models become more sophisticated, ensuring their trustworthiness becomes increasingly critical; yet, unlike text and image, the video modality still lacks comprehensive trustworthiness benchmarks. We introduce VMDT (Video-Modal DecodingTrust), the first unified platform for evaluating text-to-video (T2V) and video-to-text (V2T) models across five key trustworthiness dimensions: safety, hallucination, fairness, privacy, and adversarial robustness. Through our extensive evaluation of 7 T2V models and 19 V2T models using VMDT, we uncover several significant insights. For instance, all open-source T2V models evaluated fail to recognize harmful queries and often generate harmful videos, while exhibiting higher levels of unfairness compared to image modality models. In V2T models, unfairness and privacy risks rise with scale, whereas hallucination and adversarial robustness improve – though overall performance remains low. Uniquely, safety shows no correlation with model size, implying that factors other than scale govern current safety levels. Our findings highlight the urgent need for developing more robust and trustworthy video foundation models, and VMDT provides a systematic framework for measuring and tracking progress toward this goal. The code is available at https://sunblaze-ucb.github.io/VMDT-page/.

[219] Pedicle Screw Pairing and Registration for Screw Pose Estimation from Dual C-arm Images Using CAD Models

Yehyun Suh, Lin Li, Aric Plumley, Chaochao Zhou, Daniel Moyer, Kongbin Kang

Main category: cs.CV

TL;DR: A method for pedicle screw correspondence and pose estimation from dual C-arm images that compares screw combinations and uses 2D-3D alignment with CAD models to accurately pair and estimate screw pose.

Details

Motivation: Accurate matching of pedicle screws in AP and lateral images is critical for spinal surgery success, but establishing screw correspondence, especially in lateral views, remains a significant clinical challenge.

Method: Compares screw combinations and employs 2D-3D alignment with screw CAD 3D models to accurately pair and estimate screw pose from dual C-arm views.

Result: Correct screw combination consistently outperforms incorrect pairings across all test cases, even before registration. After registration, correct combination further enhances alignment and significantly reduces projection error.

Conclusion: This approach shows promise for improving surgical outcomes in spinal procedures by providing reliable feedback on screw positioning.

Abstract: Accurate matching of pedicle screws in both anteroposterior (AP) and lateral (LAT) images is critical for successful spinal decompression and stabilization during surgery. However, establishing screw correspondence, especially in LAT views, remains a significant clinical challenge. This paper introduces a method to address pedicle screw correspondence and pose estimation from dual C-arm images. By comparing screw combinations, the approach demonstrates consistent accuracy in both pairing and registration tasks. The method also employs 2D-3D alignment with screw CAD 3D models to accurately pair and estimate screw pose from dual views. Our results show that the correct screw combination consistently outperforms incorrect pairings across all test cases, even prior to registration. After registration, the correct combination further enhances alignment between projections and images, significantly reducing projection error. This approach shows promise for improving surgical outcomes in spinal procedures by providing reliable feedback on screw positioning.

[220] Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale

David Acuna, Chao-Han Huck Yang, Yuntian Deng, Jaehun Jung, Ximing Lu, Prithviraj Ammanabrolu, Hyunwoo Kim, Yuan-Hong Liao, Yejin Choi

Main category: cs.CV

TL;DR: A new framework for generating large-scale vision-centric reasoning datasets with over 1M synthetic questions, showing strong performance across multiple benchmarks and positive transfer to other modalities.

Details

Motivation: Address the lack of systematic approaches for building large-scale vision-centric reasoning datasets beyond visual math, given that recent progress relies on undisclosed datasets and proprietary synthesis methods.

Method: Two-stage synthesis framework: (1) scale and (2) complexity, using VLMs and reasoning LLMs to generate CoT traces. Includes preference data and instruction prompts supporting offline/online RL.

Result: Qwen2.5-VL-7B finetuned on this data outperforms all open-data baselines across vision benchmarks, surpasses MiMo-VL-7B-RL on V* Bench, CV-Bench and MMStar-V, and shows positive transfer to text-only reasoning (MMLU-Pro) and audio reasoning (MMAU).

Conclusion: High-quality SFT data with non-linear reasoning traces is essential for effective online RL; staged offline RL matches online RL performance with less compute; careful SFT improves cross-modality transfer.

Abstract: Recent progress in multimodal reasoning has been driven largely by undisclosed datasets and proprietary data synthesis recipes, leaving open questions about how to systematically build large-scale, vision-centric reasoning datasets, particularly for tasks that go beyond visual math. In this work, we introduce a new reasoning data generation framework spanning diverse skills and levels of complexity with over 1M high-quality synthetic vision-centric questions. The dataset also includes preference data and instruction prompts supporting both offline and online RL. Our synthesis framework proceeds in two stages: (1) scale; and (2) complexity. Reasoning traces are then synthesized through a two-stage process that leverages VLMs and reasoning LLMs, producing CoT traces for VLMs that capture the richness and diverse cognitive behaviors found in frontier reasoning models. Remarkably, we show that finetuning Qwen2.5-VL-7B on our data outperforms all open-data baselines across all evaluated vision-centric benchmarks, and even surpasses strong closed-data models such as MiMo-VL-7B-RL on V* Bench, CV-Bench and MMStar-V. Perhaps most surprising, despite being entirely vision-centric, our data transfers positively to text-only reasoning (MMLU-Pro) and audio reasoning (MMAU), demonstrating its effectiveness. Similarly, despite not containing videos or embodied visual data, we observe notable gains when evaluating on a single-evidence embodied QA benchmark (NiEH). Finally, we use our data to analyze the entire VLM post-training pipeline. Our empirical analysis highlights that (i) SFT on high-quality data with non-linear reasoning traces is essential for effective online RL, (ii) staged offline RL matches online RL’s performance while reducing compute demands, and (iii) careful SFT on high quality data can substantially improve out-of-domain, cross-modality transfer.

[221] Towards Better Ultrasound Video Segmentation Foundation Model: An Empirical study on SAM2 Finetuning from Data Perspective

Xing Yao, Ahana Gangopadhyay, Hsi-Ming Chang, Ravi Soni

Main category: cs.CV

TL;DR: Comprehensive data-centric analysis of SAM2 adaptation for ultrasound video segmentation, showing data scale and temporal context are more important than model architecture.

Details

Motivation: SAM2 foundation models perform poorly on medical imaging due to domain gap, and current adaptation studies focus on architecture rather than data characteristics and training regimes.

Method: Analyzed training-set size, video duration, and augmentation schemes under three paradigms (fine-tuning, intermediate adaptation, multi-task training) across five SAM2 variants and multiple prompting modes, with six ultrasound-specific augmentations.

Result: Data scale and temporal context play more decisive roles than model architecture or initialization. Joint training offers efficient compromise between modality alignment and task specialization.

Conclusion: Provides empirical insights for developing efficient, data-aware adaptation pipelines for SAM2 in ultrasound video analysis, emphasizing data-centric approaches over architectural modifications.

Abstract: Ultrasound (US) video segmentation remains a challenging problem due to strong inter- and intra-dataset variability, motion artifacts, and limited annotated data. Although foundation models such as Segment Anything Model 2 (SAM2) demonstrate strong zero-shot and prompt-guided segmentation capabilities, their performance deteriorates substantially when transferred to medical imaging domains. Current adaptation studies mainly emphasize architectural modifications, while the influence of data characteristics and training regimes has not been systematically examined. In this study, we present a comprehensive, data-centric investigation of SAM2 adaptation for ultrasound video segmentation. We analyze how training-set size, video duration, and augmentation schemes affect adaptation performance under three paradigms: task-specific fine-tuning, intermediate adaptation, and multi-task joint training, across five SAM2 variants and multiple prompting modes. We further design six ultrasound-specific augmentations, assessing their effect relative to generic strategies. Experiments on three representative ultrasound datasets reveal that data scale and temporal context play a more decisive role than model architecture or initialization. Moreover, joint training offers an efficient compromise between modality alignment and task specialization. This work aims to provide empirical insights for developing efficient, data-aware adaptation pipelines for SAM2 in ultrasound video analysis.

[222] A Second-Order Attention Mechanism For Prostate Cancer Segmentation and Detection in Bi-Parametric MRI

Mateo Ortiz, Juan Olmos, Fabio Martínez

Main category: cs.CV

TL;DR: A second-order geometric attention (SOGA) mechanism on Riemannian manifold is proposed to guide prostate cancer segmentation networks, achieving superior performance on PI-CAI dataset and robust generalization on independent test data.

Details

Motivation: Current deep learning approaches for clinically significant prostate cancer detection from MRI are limited by reliance on extensive annotations and high lesion variability across prostate zones, even for expert radiologists.

Method: Proposed a second-order geometric attention (SOGA) mechanism modeled on Riemannian manifold using symmetric positive definitive representations, integrated into U-Net and nnU-Net backbones through skip connections.

Result: Achieved AP of 0.37 and AUC-ROC of 0.83 on PI-CAI dataset, outperforming baselines. On independent Prostate158 dataset, achieved AP of 0.37 and AUC-ROC of 0.75, confirming robust generalization.

Conclusion: The SOGA mechanism enables discriminative learned representations for prostate cancer detection, demonstrating superior performance and strong generalization capabilities across different datasets.

Abstract: The detection of clinically significant prostate cancer lesions (csPCa) from biparametric magnetic resonance imaging (bp-MRI) has emerged as a noninvasive imaging technique for improving accurate diagnosis. Nevertheless, the analysis of such images remains highly dependent on the subjective expert interpretation. Deep learning approaches have been proposed for csPCa lesions detection and segmentation, but they remain limited due to their reliance on extensively annotated datasets. Moreover, the high lesion variability across prostate zones poses additional challenges, even for expert radiologists. This work introduces a second-order geometric attention (SOGA) mechanism that guides a dedicated segmentation network, through skip connections, to detect csPCa lesions. The proposed attention is modeled on the Riemannian manifold, learning from symmetric positive definitive (SPD) representations. The proposed mechanism was integrated into standard U-Net and nnU-Net backbones, and was validated on the publicly available PI-CAI dataset, achieving an Average Precision (AP) of 0.37 and an Area Under the ROC Curve (AUC-ROC) of 0.83, outperforming baseline networks and attention-based methods. Furthermore, the approach was evaluated on the Prostate158 dataset as an independent test cohort, achieving an AP of 0.37 and an AUC-ROC of 0.75, confirming robust generalization and suggesting discriminative learned representations.

[223] Sign language recognition from skeletal data using graph and recurrent neural networks

B. Mederos, J. Mejía, A. Medina-Reyes, Y. Espinosa-Almeyda, J. D. Díaz-Roman, I. Rodríguez-Mederos, M. Mejía-Carreon, F. Gonzalez-Lopez

Main category: cs.CV

TL;DR: A Graph-GRU temporal network for recognizing isolated sign language gestures using skeleton pose data, achieving high accuracy on the AUTSL dataset.

Details

Motivation: To develop an effective method for sign language recognition that leverages pose data and models both spatial and temporal dependencies for accurate gesture classification.

Method: Proposed a Graph-GRU temporal network that integrates graph-based spatial representations with temporal modeling using skeleton pose data extracted from video sequences.

Result: Achieved high accuracy on the AUTSL (Ankara University Turkish Sign Language) dataset, demonstrating the effectiveness of the approach.

Conclusion: The proposed pose-driven method shows strong potential for sign language understanding and provides a scalable framework for sign language recognition.

Abstract: This work presents an approach for recognizing isolated sign language gestures using skeleton-based pose data extracted from video sequences. A Graph-GRU temporal network is proposed to model both spatial and temporal dependencies between frames, enabling accurate classification. The model is trained and evaluated on the AUTSL (Ankara university Turkish sign language) dataset, achieving high accuracy. Experimental results demonstrate the effectiveness of integrating graph-based spatial representations with temporal modeling, providing a scalable framework for sign language recognition. The results of this approach highlight the potential of pose-driven methods for sign language understanding.

[224] TCSA-UDA: Text-Driven Cross-Semantic Alignment for Unsupervised Domain Adaptation in Medical Image Segmentation

Lalit Maurya, Honghai Liu, Reyer Zwiggelaar

Main category: cs.CV

TL;DR: TCSA-UDA is a text-driven cross-semantic alignment framework for unsupervised domain adaptation in medical image segmentation that uses domain-invariant textual class descriptions to guide visual representation learning and reduce domain shifts across imaging modalities like CT and MRI.

Details

Motivation: Unsupervised domain adaptation for medical image segmentation faces challenges due to substantial domain shifts across different imaging modalities. While vision-language representation learning shows promise, its potential in UDA segmentation tasks remains underexplored.

Method: Proposes TCSA-UDA with two key components: 1) Vision-language covariance cosine loss to align image encoder features with inter-class textual semantic relations, 2) Prototype alignment module that aligns class-wise pixel-level feature distributions across domains using high-level semantic prototypes.

Result: Extensive experiments on cardiac, abdominal, and brain tumor segmentation benchmarks show TCSA-UDA significantly reduces domain shift and consistently outperforms state-of-the-art UDA methods.

Conclusion: TCSA-UDA establishes a new paradigm for integrating language-driven semantics into domain-adaptive medical image analysis, effectively mitigating domain shifts across different imaging modalities.

Abstract: Unsupervised domain adaptation for medical image segmentation remains a significant challenge due to substantial domain shifts across imaging modalities, such as CT and MRI. While recent vision-language representation learning methods have shown promise, their potential in UDA segmentation tasks remains underexplored. To address this gap, we propose TCSA-UDA, a Text-driven Cross-Semantic Alignment framework that leverages domain-invariant textual class descriptions to guide visual representation learning. Our approach introduces a vision-language covariance cosine loss to directly align image encoder features with inter-class textual semantic relations, encouraging semantically meaningful and modality-invariant feature representations. Additionally, we incorporate a prototype alignment module that aligns class-wise pixel-level feature distributions across domains using high-level semantic prototypes. This mitigates residual category-level discrepancies and enhances cross-modal consistency. Extensive experiments on challenging cross-modality cardiac, abdominal, and brain tumor segmentation benchmarks demonstrate that our TCSA-UDA framework significantly reduces domain shift and consistently outperforms state-of-the-art UDA methods, establishing a new paradigm for integrating language-driven semantics into domain-adaptive medical image analysis.

[225] Evaluating Cell AI Foundation Models in Kidney Pathology with Human-in-the-Loop Enrichment

Junlin Guo, Siqi Lu, Can Cui, Ruining Deng, Tianyuan Yao, Zhewen Tao, Yizhe Lin, Marilyn Lionts, Quan Liu, Juming Xiong, Yu Wang, Shilin Zhao, Catie Chang, Mitchell Wilkes, Mengmeng Yin, Haichun Yang, Yuankai Huo

Main category: cs.CV

TL;DR: Evaluation of three cell foundation models (Cellpose, StarDist, CellViT) on kidney nuclei segmentation shows improved performance with fine-tuning using human-in-the-loop data enrichment strategies, establishing benchmarks for real-world deployment.

Details

Motivation: To assess the readiness of AI foundation models for simple healthcare tasks like nuclei segmentation in single organs, and develop strategies to improve model performance while minimizing human annotation effort.

Method: Curated a multi-center, multi-disease, multi-species dataset of 2,542 kidney WSIs; evaluated three SOTA models; developed human-in-the-loop data enrichment algorithms to distill predictions and enhance performance with minimal human effort.

Result: All three foundation models improved over baselines with fine-tuning using enriched data; interestingly, the baseline model with highest F1 score did not yield best segmentation outcomes after fine-tuning.

Conclusion: Establishes a benchmark for developing and deploying cell vision foundation models for real-world healthcare applications, demonstrating the effectiveness of human-in-the-loop data enrichment strategies.

Abstract: Training AI foundation models has emerged as a promising large-scale learning approach for addressing real-world healthcare challenges, including digital pathology. While many of these models have been developed for tasks like disease diagnosis and tissue quantification using extensive and diverse training datasets, their readiness for deployment on some arguably simplest tasks, such as nuclei segmentation within a single organ (e.g., the kidney), remains uncertain. This paper seeks to answer this key question, “How good are we?”, by thoroughly evaluating the performance of recent cell foundation models on a curated multi-center, multi-disease, and multi-species external testing dataset. Additionally, we tackle a more challenging question, “How can we improve?”, by developing and assessing human-in-the-loop data enrichment strategies aimed at enhancing model performance while minimizing the reliance on pixel-level human annotation. To address the first question, we curated a multicenter, multidisease, and multispecies dataset consisting of 2,542 kidney whole slide images (WSIs). Three state-of-the-art (SOTA) cell foundation models-Cellpose, StarDist, and CellViT-were selected for evaluation. To tackle the second question, we explored data enrichment algorithms by distilling predictions from the different foundation models with a human-in-the-loop framework, aiming to further enhance foundation model performance with minimal human efforts. Our experimental results showed that all three foundation models improved over their baselines with model fine-tuning with enriched data. Interestingly, the baseline model with the highest F1 score does not yield the best segmentation outcomes after fine-tuning. This study establishes a benchmark for the development and deployment of cell vision foundation models tailored for real-world data applications.

[226] Position-Prior-Guided Network for System Matrix Super-Resolution in Magnetic Particle Imaging

Xuqing Geng, Lei Su, Zhongwei Bian, Zewen Sun, Jiaxuan Wen, Jie Tian, Yang Du

Main category: cs.CV

TL;DR: Integrating positional priors into deep learning-based super-resolution methods for Magnetic Particle Imaging system matrix calibration to improve efficiency and accuracy.

Details

Motivation: Current SM calibration in MPI is time-consuming and requires repeated measurements when system parameters change. Existing deep learning SR methods don't fully exploit physical prior knowledge like symmetric positional priors.

Method: Integrated positional priors into existing frameworks for SM calibration, with theoretical justification and empirical validation using both 2D and 3D SM super-resolution methods.

Result: Empirically validated the efficacy of incorporating positional priors through experiments on 2D and 3D SM SR methods.

Conclusion: Positional priors can be effectively integrated into MPI system matrix calibration frameworks to improve performance while leveraging physical prior knowledge.

Abstract: Magnetic Particle Imaging (MPI) is a novel medical imaging modality. One of the established methods for MPI reconstruction is based on the System Matrix (SM). However, the calibration of the SM is often time-consuming and requires repeated measurements whenever the system parameters change. Current methodologies utilize deep learning-based super-resolution (SR) techniques to expedite SM calibration; nevertheless, these strategies do not fully exploit physical prior knowledge associated with the SM, such as symmetric positional priors. Consequently, we integrated positional priors into existing frameworks for SM calibration. Underpinned by theoretical justification, we empirically validated the efficacy of incorporating positional priors through experiments involving both 2D and 3D SM SR methods.

[227] MACMD: Multi-dilated Contextual Attention and Channel Mixer Decoding for Medical Image Segmentation

Lalit Maurya, Honghai Liu, Reyer Zwiggelaar

Main category: cs.CV

TL;DR: Proposes MACMD decoder to address limitations in medical image segmentation by enhancing attention mechanisms and channel mixing between encoder-decoder stages, achieving superior performance in both binary and multi-organ segmentation.

Details

Motivation: Address limitations in current encoder-decoder architectures: information loss in shallow layers and inefficient integration of local details with global context between encoder and decoder stages.

Method: MACMD-based decoder with hierarchical dilated convolutions, attention-driven modulation, and cross channel-mixing module via skip connections to capture long-range dependencies while preserving local contextual details.

Result: Outperforms state-of-the-art approaches in Dice score and computational efficiency on both binary and multi-organ segmentation tasks.

Conclusion: The proposed method effectively achieves accurate and robust medical image segmentation by better integrating local and global information through enhanced attention mechanisms and channel mixing.

Abstract: Medical image segmentation faces challenges due to variations in anatomical structures. While convolutional neural networks (CNNs) effectively capture local features, they struggle with modeling long-range dependencies. Transformers mitigate this issue with self-attention mechanisms but lack the ability to preserve local contextual information. State-of-the-art models primarily follow an encoder-decoder architecture, achieving notable success. However, two key limitations remain: (1) Shallow layers, which are closer to the input, capture fine-grained details but suffer from information loss as data propagates through deeper layers. (2) Inefficient integration of local details and global context between the encoder and decoder stages. To address these challenges, we propose the MACMD-based decoder, which enhances attention mechanisms and facilitates channel mixing between encoder and decoder stages via skip connections. This design leverages hierarchical dilated convolutions, attention-driven modulation, and a cross channel-mixing module to capture long-range dependencies while preserving local contextual details, essential for precise medical image segmentation. We evaluated our approach using multiple transformer encoders on both binary and multi-organ segmentation tasks. The results demonstrate that our method outperforms state-of-the-art approaches in terms of Dice score and computational efficiency, highlighting its effectiveness in achieving accurate and robust segmentation performance. The code available at https://github.com/lalitmaurya47/MACMD

[228] LRANet++: Low-Rank Approximation Network for Accurate and Efficient Text Spotting

Yuchen Su, Zhineng Chen, Yongkun Du, Zuxuan Wu, Hongtao Xie, Yu-Gang Jiang

Main category: cs.CV

TL;DR: LRANet++ is an end-to-end text spotting framework that uses low-rank approximation for precise arbitrary-shaped text detection and a triple assignment detection head for fast inference, achieving state-of-the-art performance.

Details

Motivation: Existing end-to-end text spotters lack reliable and efficient text detection methods for arbitrary-shaped text, creating a bottleneck in performance and speed.

Method: Proposes a data-driven low-rank approximation method for text shape representation using ℓ₁-norm formulation for robustness, and a triple assignment detection head with deep sparse, ultra-lightweight sparse, and dense branches for stabilized training and fast inference.

Result: Extensive experiments on challenging benchmarks demonstrate superior performance compared to state-of-the-art methods in both accuracy and efficiency.

Conclusion: LRANet++ effectively addresses the arbitrary-shaped text spotting problem through innovative shape representation and detection architecture, achieving accurate and efficient end-to-end text spotting.

Abstract: End-to-end text spotting aims to jointly optimize text detection and recognition within a unified framework. Despite significant progress, designing an accurate and efficient end-to-end text spotter for arbitrary-shaped text remains largely unsolved. We identify the primary bottleneck as the lack of a reliable and efficient text detection method. To address this, we propose a novel parameterized text shape method based on low-rank approximation for precise detection and a triple assignment detection head to enable fast inference. Specifically, unlike other shape representation methods that employ data-irrelevant parameterization, our data-driven approach derives a low-rank subspace directly from labeled text boundaries. To ensure this process is robust against the inherent annotation noise in this data, we utilize a specialized recovery method based on an $\ell_1$-norm formulation, which accurately reconstructs the text shape with only a few key orthogonal vectors. By exploiting the inherent shape correlation among different text contours, our method achieves consistency and compactness in shape representation. Next, the triple assignment scheme introduces a novel architecture where a deep sparse branch (for stabilized training) is used to guide the learning of an ultra-lightweight sparse branch (for accelerated inference), while a dense branch provides rich parallel supervision. Building upon these advancements, we integrate the enhanced detection module with a lightweight recognition branch to form an end-to-end text spotting framework, termed LRANet++, capable of accurately and efficiently spotting arbitrary-shaped text. Extensive experiments on several challenging benchmarks demonstrate the superiority of LRANet++ compared to state-of-the-art methods. Code will be available at: https://github.com/ychensu/LRANet-PP.git

[229] Hilbert-Guided Block-Sparse Local Attention

Yunge Li, Lanyu Xu

Main category: cs.CV

TL;DR: Proposes Hilbert curve-based window and neighborhood construction to improve efficiency of 2D local attention in transformers, achieving 4× and 18× speedups for window and slide attention respectively.

Details

Motivation: Global self-attention has quadratic compute/memory costs that limit use in high-resolution images, and conventional local attention patterns often fail to deliver significant speedups due to token contiguity issues.

Method: Reorders image tokens along a Hilbert curve, then forms windows and neighborhoods on the reordered 1D sequence to increase block sparsity, combined with block-sparse kernels.

Result: Hilbert Window Attention and Hilbert Slide Attention accelerate window attention by 4× and slide attention by 18× respectively, with minimal accuracy loss in end-to-end transformers.

Conclusion: Hilbert-guided local attention with block-sparse kernels provides a general and practical approach to enhance efficiency of 2D local attention for images.

Abstract: The quadratic compute and memory costs of global self-attention severely limit its use in high-resolution images. Local attention reduces complexity by restricting attention to neighborhoods. Block-sparse kernels can further improve the efficiency of local attention, but conventional local attention patterns often fail to deliver significant speedups because tokens within a window are not contiguous in the 1D sequence. This work proposes a novel method for constructing windows and neighborhoods based on the Hilbert curve. Image tokens are first reordered along a Hilbert curve, and windows and neighborhoods are then formed on the reordered 1D sequence. From a block-sparse perspective, this strategy significantly increases block sparsity and can be combined with existing block-sparse kernels to improve the efficiency of 2D local attention. Experiments show that the proposed Hilbert Window Attention and Hilbert Slide Attention can accelerate window attention and slide attention by about $4\times$ and $18\times$, respectively. To assess practicality, the strategy is instantiated as the Hilbert Window Transformer and the Hilbert Neighborhood Transformer, both of which achieve end-to-end speedups with minimal accuracy loss. Overall, combining Hilbert-guided local attention with block-sparse kernels offers a general and practical approach to enhancing the efficiency of 2D local attention for images. The code is available at https://github.com/Yunge6666/Hilbert-Local-Attention.

[230] TYrPPG: Uncomplicated and Enhanced Learning Capability rPPG for Remote Heart Rate Estimation

Taixi Chen, Yiu-ming Cheung

Main category: cs.CV

TL;DR: TYrPPG is a novel remote photoplethysmography algorithm that uses Mambaout-based modules instead of transformers for efficient heart rate estimation from RGB videos, achieving state-of-the-art performance.

Details

Motivation: Existing rPPG models based on transformers have low computation efficiency. Mamba models show promise but their SSM core is unnecessary for vision tasks, so the authors explore using Mambaout-based modules for remote heart rate learning.

Method: Proposed TYrPPG with innovative gated video understanding block (GVB) combining 2D-CNN and 3D-CNN based on Mambaout structure, plus comprehensive supervised loss function (CSL) and its weakly supervised variants.

Result: TYrPPG achieves state-of-the-art performance in commonly used datasets for remote heart rate estimation.

Conclusion: The proposed TYrPPG demonstrates prospects and superiority in remote heart rate estimation, proving the feasibility of Mambaout-based modules for this task.

Abstract: Remote photoplethysmography (rPPG) can remotely extract physiological signals from RGB video, which has many advantages in detecting heart rate, such as low cost and no invasion to patients. The existing rPPG model is usually based on the transformer module, which has low computation efficiency. Recently, the Mamba model has garnered increasing attention due to its efficient performance in natural language processing tasks, demonstrating potential as a substitute for transformer-based algorithms. However, the Mambaout model and its variants prove that the SSM module, which is the core component of the Mamba model, is unnecessary for the vision task. Therefore, we hope to prove the feasibility of using the Mambaout-based module to remotely learn the heart rate. Specifically, we propose a novel rPPG algorithm called uncomplicated and enhanced learning capability rPPG (TYrPPG). This paper introduces an innovative gated video understanding block (GVB) designed for efficient analysis of RGB videos. Based on the Mambaout structure, this block integrates 2D-CNN and 3D-CNN to enhance video understanding for analysis. In addition, we propose a comprehensive supervised loss function (CSL) to improve the model’s learning capability, along with its weakly supervised variants. The experiments show that our TYrPPG can achieve state-of-the-art performance in commonly used datasets, indicating its prospects and superiority in remote heart rate estimation. The source code is available at https://github.com/Taixi-CHEN/TYrPPG.

[231] Understanding Cross Task Generalization in Handwriting-Based Alzheimer’s Screening via Vision Language Adaptation

Changqing Gong, Huafeng Qin, Mounim A. El-Yacoubi

Main category: cs.CV

TL;DR: A lightweight Cross-Layer Fusion Adapter framework repurposes CLIP for zero-shot Alzheimer’s disease screening using handwriting analysis, systematically investigating cross-task generalization and identifying characteristic stroke patterns for early detection.

Details

Motivation: Early detection of Alzheimer's disease is critical, and handwriting provides a non-invasive window into motor and cognitive decline. Existing handwriting-based AD studies haven't systematically examined how task type influences diagnostic performance and cross-task generalization, while vision language models' potential for handwriting-based disease detection remains unexplored.

Method: Introduces a lightweight Cross-Layer Fusion Adapter (CLFA) framework that repurposes CLIP by implanting multi-level fusion adapters within the visual encoder to progressively align representations toward handwriting-specific medical cues, enabling prompt-free and efficient zero-shot inference.

Result: The framework enables systematic investigation of cross-task generalization, revealing which task types and writing patterns most effectively discriminate AD. Extensive analyses highlight characteristic stroke patterns and task-level factors that contribute to early AD identification.

Conclusion: The approach offers both diagnostic insights and a benchmark for handwriting-based cognitive assessment, demonstrating the potential of vision language models for handwriting-based disease detection and providing a systematic framework for cross-task generalization analysis.

Abstract: Alzheimer’s disease is a prevalent neurodegenerative disorder for which early detection is critical. Handwriting-often disrupted in prodromal AD-provides a non-invasive and cost-effective window into subtle motor and cognitive decline. Existing handwriting-based AD studies, mostly relying on online trajectories and hand-crafted features, have not systematically examined how task type influences diagnostic performance and cross-task generalization. Meanwhile, large-scale vision language models have demonstrated remarkable zero or few-shot anomaly detection in natural images and strong adaptability across medical modalities such as chest X-ray and brain MRI. However, handwriting-based disease detection remains largely unexplored within this paradigm. To close this gap, we introduce a lightweight Cross-Layer Fusion Adapter framework that repurposes CLIP for handwriting-based AD screening. CLFA implants multi-level fusion adapters within the visual encoder to progressively align representations toward handwriting-specific medical cues, enabling prompt-free and efficient zero-shot inference. Using this framework, we systematically investigate cross-task generalization-training on a specific handwriting task and evaluating on unseen ones-to reveal which task types and writing patterns most effectively discriminate AD. Extensive analyses further highlight characteristic stroke patterns and task-level factors that contribute to early AD identification, offering both diagnostic insights and a benchmark for handwriting-based cognitive assessment.

[232] Point Cloud Segmentation of Integrated Circuits Package Substrates Surface Defects Using Causal Inference: Dataset Construction and Methodology

Bingyang Guo, Qiang Zuo, Ruiyun Yu

Main category: cs.CV

TL;DR: Created CPS3D-Seg dataset for 3D surface defect detection in ceramic package substrates and proposed CINet method using causal inference.

Details

Motivation: Complex structure and minor defects in ceramic package substrates, along with lack of public datasets, hinder surface defect detection in integrated circuits.

Method: Built high-quality CPS3D-Seg dataset with 1300 point cloud samples and proposed CINet method using Structural Refine and Quality Assessment modules for causal inference.

Result: CINet significantly outperforms existing algorithms in both mIoU and accuracy metrics.

Conclusion: The CPS3D-Seg dataset and CINet method effectively address 3D surface defect detection challenges in ceramic package substrates.

Abstract: The effective segmentation of 3D data is crucial for a wide range of industrial applications, especially for detecting subtle defects in the field of integrated circuits (IC). Ceramic package substrates (CPS), as an important electronic material, are essential in IC packaging owing to their superior physical and chemical properties. However, the complex structure and minor defects of CPS, along with the absence of a publically available dataset, significantly hinder the development of CPS surface defect detection. In this study, we construct a high-quality point cloud dataset for 3D segmentation of surface defects in CPS, i.e., CPS3D-Seg, which has the best point resolution and precision compared to existing 3D industrial datasets. CPS3D-Seg consists of 1300 point cloud samples under 20 product categories, and each sample provides accurate point-level annotations. Meanwhile, we conduct a comprehensive benchmark based on SOTA point cloud segmentation algorithms to validate the effectiveness of CPS3D-Seg. Additionally, we propose a novel 3D segmentation method based on causal inference (CINet), which quantifies potential confounders in point clouds through Structural Refine (SR) and Quality Assessment (QA) Modules. Extensive experiments demonstrate that CINet significantly outperforms existing algorithms in both mIoU and accuracy.

[233] CGCE: Classifier-Guided Concept Erasure in Generative Models

Viet Nguyen, Vishal M. Patel

Main category: cs.CV

TL;DR: CGCE is a plug-and-play framework that uses lightweight classifiers to detect and refine unsafe text embeddings, enabling robust multi-concept erasure in generative models without modifying original weights.

Details

Motivation: Existing concept erasure methods are vulnerable to adversarial attacks and often degrade model quality when removing unsafe content, creating a trade-off between safety and performance.

Method: Classifier-guided approach using lightweight classifiers on text embeddings to detect and refine prompts containing undesired concepts at inference time, without altering model weights.

Result: Achieves state-of-the-art robustness against red-teaming attacks while maintaining high generative utility, with successful application to various T2I and T2V models.

Conclusion: CGCE provides a practical and effective solution for safe generative AI by balancing safety and performance through inference-time modification of unsafe embeddings.

Abstract: Recent advancements in large-scale generative models have enabled the creation of high-quality images and videos, but have also raised significant safety concerns regarding the generation of unsafe content. To mitigate this, concept erasure methods have been developed to remove undesirable concepts from pre-trained models. However, existing methods remain vulnerable to adversarial attacks that can regenerate the erased content. Moreover, achieving robust erasure often degrades the model’s generative quality for safe, unrelated concepts, creating a difficult trade-off between safety and performance. To address this challenge, we introduce Classifier-Guided Concept Erasure (CGCE), an efficient plug-and-play framework that provides robust concept erasure for diverse generative models without altering their original weights. CGCE uses a lightweight classifier operating on text embeddings to first detect and then refine prompts containing undesired concepts. This approach is highly scalable, allowing for multi-concept erasure by aggregating guidance from several classifiers. By modifying only unsafe embeddings at inference time, our method prevents harmful content generation while preserving the model’s original quality on benign prompts. Extensive experiments show that CGCE achieves state-of-the-art robustness against a wide range of red-teaming attacks. Our approach also maintains high generative utility, demonstrating a superior balance between safety and performance. We showcase the versatility of CGCE through its successful application to various modern T2I and T2V models, establishing it as a practical and effective solution for safe generative AI.

[234] Light-Field Dataset for Disparity Based Depth Estimation

Suresh Nehra, Aupendu Kar, Jayanta Mukhopadhyay, Prabir Kumar Biswas

Main category: cs.CV

TL;DR: A new publicly available light field image dataset with 285 real LF images from Lytro Illum camera and 13 synthetic LF images, addressing limitations of existing datasets and demonstrating focal position effects on disparity.

Details

Motivation: The need for suitable light field image datasets to develop disparity-based depth estimation algorithms, as existing datasets have shortcomings and the trade-off between angular and spatial information depends on camera focal position.

Method: Created a comprehensive dataset using Lytro Illum LF camera for real images and Blender for synthetic images, including both real and synthetic stereo light field data captured via mechanical gantry system and simulation.

Result: Produced a dataset of 285 real light field images and 13 synthetic LF images, with synthetic data having similar disparity characteristics to real LF cameras. The dataset demonstrates focal position effects on 3D point disparity.

Conclusion: The introduced publicly available light field dataset addresses the need for suitable datasets in LF depth estimation research and is available at https://github.com/aupendu/light-field-dataset.

Abstract: A Light Field (LF) camera consists of an additional two-dimensional array of micro-lenses placed between the main lens and sensor, compared to a conventional camera. The sensor pixels under each micro-lens receive light from a sub-aperture of the main lens. This enables the image sensor to capture both spatial information and the angular resolution of a scene point. This additional angular information is used to estimate the depth of a 3-D scene. The continuum of virtual viewpoints in light field data enables efficient depth estimation using Epipolar Line Images (EPIs) with robust occlusion handling. However, the trade-off between angular information and spatial information is very critical and depends on the focal position of the camera. To design, develop, implement, and test novel disparity-based light field depth estimation algorithms, the availability of suitable light field image datasets is essential. In this paper, a publicly available light field image dataset is introduced and thoroughly described. We have also demonstrated the effect of focal position on the disparity of a 3-D point as well as the shortcomings of the currently available light field dataset. The proposed dataset contains 285 light field images captured using a Lytro Illum LF camera and 13 synthetic LF images. The proposed dataset also comprises a synthetic dataset with similar disparity characteristics to those of a real light field camera. A real and synthetic stereo light field dataset is also created by using a mechanical gantry system and Blender. The dataset is available at https://github.com/aupendu/light-field-dataset.

[235] MoEGCL: Mixture of Ego-Graphs Contrastive Representation Learning for Multi-View Clustering

Jian Zhu, Xin Zou, Jun Sun, Cheng Luo, Lei Liu, Lingfang Zeng, Ning Zhang, Bian Wu, Chang Tang, Lirong Dai

Main category: cs.CV

TL;DR: MoEGCL introduces fine-grained ego-graph fusion using Mixture-of-Experts for multi-view clustering, achieving SOTA results.

Details

Motivation: Existing GNN-based multi-view clustering methods suffer from coarse-grained graph fusion at view level, limiting representation quality.

Method: Uses Mixture of Ego-Graphs Fusion (MoEGF) with Mixture-of-Experts for sample-level fusion, plus Ego Graph Contrastive Learning (EGCL) for representation alignment.

Result: Extensive experiments show MoEGCL achieves state-of-the-art performance in deep multi-view clustering tasks.

Conclusion: Fine-grained ego-graph fusion at sample level significantly improves multi-view clustering performance over traditional view-level fusion approaches.

Abstract: In recent years, the advancement of Graph Neural Networks (GNNs) has significantly propelled progress in Multi-View Clustering (MVC). However, existing methods face the problem of coarse-grained graph fusion. Specifically, current approaches typically generate a separate graph structure for each view and then perform weighted fusion of graph structures at the view level, which is a relatively rough strategy. To address this limitation, we present a novel Mixture of Ego-Graphs Contrastive Representation Learning (MoEGCL). It mainly consists of two modules. In particular, we propose an innovative Mixture of Ego-Graphs Fusion (MoEGF), which constructs ego graphs and utilizes a Mixture-of-Experts network to implement fine-grained fusion of ego graphs at the sample level, rather than the conventional view-level fusion. Additionally, we present the Ego Graph Contrastive Learning (EGCL) module to align the fused representation with the view-specific representation. The EGCL module enhances the representation similarity of samples from the same cluster, not merely from the same sample, further boosting fine-grained graph representation. Extensive experiments demonstrate that MoEGCL achieves state-of-the-art results in deep multi-view clustering tasks. The source code is publicly available at https://github.com/HackerHyper/MoEGCL.

[236] Towards Frequency-Adaptive Learning for SAR Despeckling

Ziqing Ma, Chang Yang, Zhichang Guo, Yao Li

Main category: cs.CV

TL;DR: SAR-FAH is a frequency-adaptive heterogeneous despeckling model that uses wavelet decomposition and specialized sub-networks for different frequency components to address speckle noise in SAR images while preserving edges and textures.

Details

Motivation: Existing deep learning methods for SAR despeckling use unified networks that fail to account for distinct speckle statistics across different spatial characteristics, leading to artifacts, blurred edges, and texture distortion.

Method: Uses wavelet decomposition to separate images into frequency sub-bands, then employs specialized sub-networks: neural ODEs for low-frequency denoising to ensure structural fidelity, and enhanced U-Net with deformable convolutions for high-frequency components to preserve edges and textures.

Result: Extensive experiments on synthetic and real SAR images demonstrate superior performance in both noise removal and structural preservation compared to existing methods.

Conclusion: The proposed frequency-adaptive heterogeneous approach effectively addresses the limitations of unified networks by leveraging statistical variations across frequencies, achieving better edge and texture preservation while suppressing speckle noise.

Abstract: Synthetic Aperture Radar (SAR) images are inherently corrupted by speckle noise, limiting their utility in high-precision applications. While deep learning methods have shown promise in SAR despeckling, most methods employ a single unified network to process the entire image, failing to account for the distinct speckle statistics associated with different spatial physical characteristics. It often leads to artifacts, blurred edges, and texture distortion. To address these issues, we propose SAR-FAH, a frequency-adaptive heterogeneous despeckling model based on a divide-and-conquer architecture. First, wavelet decomposition is used to separate the image into frequency sub-bands carrying different intrinsic characteristics. Inspired by their differing noise characteristics, we design specialized sub-networks for different frequency components. The tailored approach leverages statistical variations across frequencies, improving edge and texture preservation while suppressing noise. Specifically, for the low-frequency part, denoising is formulated as a continuous dynamic system via neural ordinary differential equations, ensuring structural fidelity and sufficient smoothness that prevents artifacts. For high-frequency sub-bands rich in edges and textures, we introduce an enhanced U-Net with deformable convolutions for noise suppression and enhanced features. Extensive experiments on synthetic and real SAR images validate the superior performance of the proposed model in noise removal and structural preservation.

[237] Hybrid second-order gradient histogram based global low-rank sparse regression for robust face recognition

Hongxia Li, Ying Ji, Yongxin Dong, Yuehua Feng

Main category: cs.CV

TL;DR: Proposes H2H-GLRSR model combining hybrid second-order gradient histogram features with global low-rank sparse regression for robust face recognition under occlusions and illumination variations.

Details

Motivation: To address challenges in face recognition caused by complex occlusions and illumination variations by developing more effective feature descriptors and regression models.

Method: Designs Hybrid Second-Order Gradient Histogram (H2H) feature descriptor and integrates it with Sparse Regularized Nuclear Norm based Matrix Regression (SR_NMR), adding global low-rank constraint on residual matrix.

Result: Experimental results show significant performance improvement over existing regression-based classification methods in challenging scenarios with occlusions, illumination changes, and unconstrained environments.

Conclusion: The proposed H2H-GLRSR model effectively handles complex face recognition challenges through its hybrid feature descriptor and global low-rank sparse regression framework.

Abstract: Low-rank sparse regression models have been widely applied in the field of face recognition. To further address the challenges caused by complex occlusions and illumination variations, this paper proposes a Hybrid Second-Order Gradient Histogram based Global Low-Rank Sparse Regression (H2H-GLRSR) model. Specifically, a novel feature descriptor called the Hybrid Second-Order Gradient Histogram (H2H) is first designed to more effectively characterize the local structural features of facial images. Then, this descriptor is integrated with the Sparse Regularized Nuclear Norm based Matrix Regression (SR$_$NMR). Moreover, a global low-rank constraint is imposed on the residual matrix, enabling the model to better capture the global correlations inherent in structured noise. Experimental results demonstrate that the proposed method significantly outperforms existing regression-based classification approaches under challenging scenarios involving occlusions, illumination changes, and unconstrained environments.

[238] Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning

Fei Yu, Quan Deng, Shengeng Tang, Yuehua Li, Lechao Cheng

Main category: cs.CV

TL;DR: A unified framework for open-world 3D scene graph generation using retrieval-augmented reasoning that enables generalizable 3D scene understanding without fixed label sets.

Details

Motivation: Address limitations of closed-vocabulary supervision and static annotations in 3D scene understanding for open-world settings.

Method: Integrates Vision-Language Models with retrieval-based reasoning, featuring dynamic scene graph generation and retrieval-augmented reasoning pipeline that encodes scene graphs into vector database.

Result: Demonstrates robust generalization and superior performance on 3DSSG and Replica benchmarks across scene QA, visual grounding, instance retrieval, and task planning tasks.

Conclusion: Combining open-vocabulary perception with retrieval-based reasoning is effective for scalable 3D scene understanding in diverse environments.

Abstract: Understanding 3D scenes in open-world settings poses fundamental challenges for vision and robotics, particularly due to the limitations of closed-vocabulary supervision and static annotations. To address this, we propose a unified framework for Open-World 3D Scene Graph Generation with Retrieval-Augmented Reasoning, which enables generalizable and interactive 3D scene understanding. Our method integrates Vision-Language Models (VLMs) with retrieval-based reasoning to support multimodal exploration and language-guided interaction. The framework comprises two key components: (1) a dynamic scene graph generation module that detects objects and infers semantic relationships without fixed label sets, and (2) a retrieval-augmented reasoning pipeline that encodes scene graphs into a vector database to support text/image-conditioned queries. We evaluate our method on 3DSSG and Replica benchmarks across four tasks-scene question answering, visual grounding, instance retrieval, and task planning-demonstrating robust generalization and superior performance in diverse environments. Our results highlight the effectiveness of combining open-vocabulary perception with retrieval-based reasoning for scalable 3D scene understanding.

[239] GABFusion: Rethinking Feature Fusion for Low-Bit Quantization of Multi-Task Networks

Zhaoyang Wang, Dong Wang

Main category: cs.CV

TL;DR: GABFusion and ADA methods improve quantization-aware training for multi-task networks by balancing gradients and aligning features, achieving significant performance gains across various architectures and bit-widths.

Details

Motivation: Quantization-aware training (QAT) performance degrades significantly on multi-task architectures due to task-specific feature discrepancies and gradient conflicts.

Method: Proposed Gradient-Aware Balanced Feature Fusion (GABFusion) to dynamically balance gradient magnitudes and fuse task-specific features, plus Attention Distribution Alignment (ADA) for feature-level distillation in quantized models.

Result: Achieved average mAP improvements of ~3.3% on PASCAL VOC and ~1.6% on COCO datasets. Under 4-bit quantization, narrowed accuracy gap with full-precision model to only 1.7% on VOC.

Conclusion: The proposed framework is modular, easy to integrate, compatible with any existing QAT technique, and effectively preserves performance under low-bit constraints without modifying original network architecture.

Abstract: Despite the effectiveness of quantization-aware training (QAT) in compressing deep neural networks, its performance on multi-task architectures often degrades significantly due to task-specific feature discrepancies and gradient conflicts. To address these challenges, we propose Gradient-Aware Balanced Feature Fusion (GABFusion), which dynamically balances gradient magnitudes and fuses task-specific features in a quantization-friendly manner. We further introduce Attention Distribution Alignment (ADA), a feature-level distillation strategy tailored for quantized models. Our method demonstrates strong generalization across network architectures and QAT algorithms, with theoretical guarantees on gradient bias reduction. Extensive experiments demonstrate that our strategy consistently enhances a variety of QAT methods across different network architectures and bit-widths. On PASCAL VOC and COCO datasets, the proposed approach achieves average mAP improvements of approximately 3.3% and 1.6%, respectively. When applied to YOLOv5 under 4-bit quantization, our method narrows the accuracy gap with the full-precision model to only 1.7% on VOC, showcasing its effectiveness in preserving performance under low-bit constraints. Notably, the proposed framework is modular, easy to integrate, and compatible with any existing QAT technique-enhancing the performance of quantized models without requiring modifications to the original network architecture.

[240] Causal Tracing of Object Representations in Large Vision Language Models: Mechanistic Interpretability and Hallucination Mitigation

Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Weitao Ma, Xiachong Feng

Main category: cs.CV

TL;DR: FCCT framework analyzes LVLM mechanisms, revealing MHSA’s role in cross-modal aggregation and FFN’s hierarchical processing. IRI technique enhances perception and reduces hallucination without training.

Details

Motivation: Existing LVLM interpretability analyses are insufficient, limiting insights for improving output faithfulness and downstream tasks like hallucination mitigation.

Method: Introduces FCCT framework for fine-grained causal tracing across visual/textual tokens, MHSA, FFNs, and hidden states in all decoder layers. Proposes IRI for inference-time intervention.

Result: MHSA in middle layers aggregates cross-modal info; FFNs show 3-stage hierarchical processing. IRI achieves SOTA on 5 benchmarks while preserving speed and performance.

Conclusion: FCCT provides comprehensive LVLM interpretability insights, enabling effective interventions like IRI to enhance perception and mitigate hallucination without training overhead.

Abstract: Despite the remarkable advancements of Large Vision-Language Models (LVLMs), the mechanistic interpretability remains underexplored. Existing analyses are insufficiently comprehensive and lack examination covering visual and textual tokens, model components, and the full range of layers. This limitation restricts actionable insights to improve the faithfulness of model output and the development of downstream tasks, such as hallucination mitigation. To address this limitation, we introduce Fine-grained Cross-modal Causal Tracing (FCCT) framework, which systematically quantifies the causal effects on visual object perception. FCCT conducts fine-grained analysis covering the full range of visual and textual tokens, three core model components including multi-head self-attention (MHSA), feed-forward networks (FFNs), and hidden states, across all decoder layers. Our analysis is the first to demonstrate that MHSAs of the last token in middle layers play a critical role in aggregating cross-modal information, while FFNs exhibit a three-stage hierarchical progression for the storage and transfer of visual object representations. Building on these insights, we propose Intermediate Representation Injection (IRI), a training-free inference-time technique that reinforces visual object information flow by precisely intervening on cross-modal representations at specific components and layers, thereby enhancing perception and mitigating hallucination. Consistent improvements across five widely used benchmarks and LVLMs demonstrate IRI achieves state-of-the-art performance, while preserving inference speed and other foundational performance.

[241] CoMA: Complementary Masking and Hierarchical Dynamic Multi-Window Self-Attention in a Unified Pre-training Framework

Jiaxuan Li, Qing Xu, Xiangjian He, Ziyu Liu, Chang Xing, Zhen Chen, Daokun Zhang, Rong Qu, Chang Wen Chen

Main category: cs.CV

TL;DR: CoMA uses complementary masking for uniform pixel sampling to improve feature learning, while DyViT employs dynamic multi-window attention to reduce parameters and FLOPs, achieving MAE-level performance with 12% of pre-training epochs.

Details

Motivation: MAE requires many pre-training epochs and ViT suffers from inefficient parameter use due to fixed spatial resolution, limiting efficiency and adaptability.

Method: Proposed Complementary Masked Autoencoders (CoMA) with uniform pixel sampling strategy and DyViT with Dynamic Multi-Window Self-Attention for hierarchical vision processing.

Result: DyViT with CoMA matches MAE downstream performance using only 12% pre-training epochs, with 10% reduction in pre-training time per epoch.

Conclusion: CoMA and DyViT significantly improve pre-training efficiency and effectiveness while maintaining strong downstream task performance.

Abstract: Masked Autoencoders (MAE) achieve self-supervised learning of image representations by randomly removing a portion of visual tokens and reconstructing the original image as a pretext task, thereby significantly enhancing pretraining efficiency and yielding excellent adaptability across downstream tasks. However, MAE and other MAE-style paradigms that adopt random masking generally require more pre-training epochs to maintain adaptability. Meanwhile, ViT in MAE suffers from inefficient parameter use due to fixed spatial resolution across layers. To overcome these limitations, we propose the Complementary Masked Autoencoders (CoMA), which employ a complementary masking strategy to ensure uniform sampling across all pixels, thereby improving effective learning of all features and enhancing the model’s adaptability. Furthermore, we introduce DyViT, a hierarchical vision transformer that employs a Dynamic Multi-Window Self-Attention (DM-MSA), significantly reducing the parameters and FLOPs while improving fine-grained feature learning. Pre-trained on ImageNet-1K with CoMA, DyViT matches the downstream performance of MAE using only 12% of the pre-training epochs, demonstrating more effective learning. It also attains a 10% reduction in pre-training time per epoch, further underscoring its superior pre-training efficiency.

[242] AD-DAE: Unsupervised Modeling of Longitudinal Alzheimer’s Disease Progression with Diffusion Auto-Encoder

Ayantika Das, Arunima Sarkar, Keerthi Ram, Mohanasankar Sivaprakasam

Main category: cs.CV

TL;DR: A conditionable Diffusion Auto-encoder framework for unsupervised longitudinal disease progression modeling that enables controlled generation of follow-up images from baseline images without requiring subject-specific longitudinal supervision.

Details

Motivation: Existing generative approaches impose constraints on distribution learning, leading to limited controllability in generating follow-up images without explicit supervision from longitudinal data.

Method: Uses a conditionable Diffusion Auto-encoder framework that forms a compact latent space capturing high-level semantics, allowing disentanglement of progression-related information. Applies controlled shifts to baseline representations restricted to a subspace that isolates progression factors from identity-preserving components.

Result: Validated through image quality metrics, volumetric progression analysis, and downstream classification in Alzheimer’s disease datasets from two different sources and disease categories, demonstrating effectiveness for Alzheimer’s progression modeling.

Conclusion: The approach successfully enables controlled longitudinal image generation for disease progression modeling in an unsupervised manner, effectively capturing progression-related changes while preserving subject identity.

Abstract: Generative modeling frameworks have emerged as an effective approach to capture high-dimensional image distributions from large datasets without requiring domain-specific knowledge, a capability essential for longitudinal disease progression modeling. Recent generative modeling approaches have attempted to capture progression by mapping images into a latent representational space and then controlling and guiding the representations to generate follow-up images from a baseline image. However, existing approaches impose constraints on distribution learning, leading to latent spaces with limited controllability to generate follow-up images without explicit supervision from subject-specific longitudinal images. In order to enable controlled movements in the latent representational space and generate progression images from a baseline image in an unsupervised manner, we introduce a conditionable Diffusion Auto-encoder framework. The explicit encoding mechanism of image-diffusion auto-encoders forms a compact latent space capturing high-level semantics, providing means to disentangle information relevant for progression. Our approach leverages this latent space to condition and apply controlled shifts to baseline representations for generating follow-up. Controllability is induced by restricting these shifts to a subspace, thereby isolating progression-related factors from subject identity-preserving components. The shifts are implicitly guided by correlating with progression attributes, without requiring subject-specific longitudinal supervision. We validate the generations through image quality metrics, volumetric progression analysis, and downstream classification in Alzheimer’s disease datasets from two different sources and disease categories. This demonstrates the effectiveness of our approach for Alzheimer’s progression modeling and longitudinal image generation.

[243] Interaction-Centric Knowledge Infusion and Transfer for Open-Vocabulary Scene Graph Generation

Lin Li, Chuhan Zhang, Dong Zhang, Chong Sun, Chen Li, Long Chen

Main category: cs.CV

TL;DR: ACC is an interaction-centric end-to-end framework for open-vocabulary scene graph generation that addresses limitations in existing methods by focusing on interaction modeling to distinguish between interacting and non-interacting objects.

Details

Motivation: Existing OVSGG methods struggle to distinguish between interacting and non-interacting instances of the same object category due to lack of explicit interaction modeling, leading to noisy pseudo-supervision and ambiguous query matching.

Method: Proposes ACC framework with bidirectional interaction prompts for robust pseudo-supervision generation, interaction-guided query selection to prioritize interacting objects, and interaction-consistent knowledge distillation to separate relational foreground from background.

Result: Extensive experiments on three benchmarks show ACC achieves state-of-the-art performance, demonstrating superior capability in open-vocabulary scene graph generation.

Conclusion: The interaction-centric paradigm in ACC effectively addresses key limitations in OVSGG and shows strong potential for real-world applications.

Abstract: Open-vocabulary scene graph generation (OVSGG) extends traditional SGG by recognizing novel objects and relationships beyond predefined categories, leveraging the knowledge from pre-trained large-scale models. Existing OVSGG methods always adopt a two-stage pipeline: 1) \textit{Infusing knowledge} into large-scale models via pre-training on large datasets; 2) \textit{Transferring knowledge} from pre-trained models with fully annotated scene graphs during supervised fine-tuning. However, due to a lack of explicit interaction modeling, these methods struggle to distinguish between interacting and non-interacting instances of the same object category. This limitation induces critical issues in both stages of OVSGG: it generates noisy pseudo-supervision from mismatched objects during knowledge infusion, and causes ambiguous query matching during knowledge transfer. To this end, in this paper, we propose an inter\textbf{AC}tion-\textbf{C}entric end-to-end OVSGG framework (\textbf{ACC}) in an interaction-driven paradigm to minimize these mismatches. For \textit{interaction-centric knowledge infusion}, ACC employs a bidirectional interaction prompt for robust pseudo-supervision generation to enhance the model’s interaction knowledge. For \textit{interaction-centric knowledge transfer}, ACC first adopts interaction-guided query selection that prioritizes pairing interacting objects to reduce interference from non-interacting ones. Then, it integrates interaction-consistent knowledge distillation to bolster robustness by pushing relational foreground away from the background while retaining general knowledge. Extensive experimental results on three benchmarks show that ACC achieves state-of-the-art performance, demonstrating the potential of interaction-centric paradigms for real-world applications.

[244] Global Multiple Extraction Network for Low-Resolution Facial Expression Recognition

Jingyi Shi

Main category: cs.CV

TL;DR: Proposed GME-Net for low-resolution facial expression recognition using hybrid attention and multi-scale global feature extraction to overcome detail loss and weak global modeling issues.

Details

Motivation: Current facial expression recognition algorithms perform well on high-resolution images but degrade significantly on low-resolution images due to lack of detail information and weak global modeling capabilities.

Method: GME-Net includes: 1) hybrid attention-based local feature extraction with attention similarity knowledge distillation to learn details from high-resolution networks, 2) multi-scale global feature extraction with quasi-symmetric structure to reduce local noise and capture global features.

Result: Extensive experiments on multiple datasets show GME-Net achieves superior performance in low-resolution facial expression recognition compared to existing methods.

Conclusion: GME-Net effectively extracts expression-related discriminative features and outperforms current solutions for low-resolution facial expression recognition.

Abstract: Facial expression recognition, as a vital computer vision task, is garnering significant attention and undergoing extensive research. Although facial expression recognition algorithms demonstrate impressive performance on high-resolution images, their effectiveness tends to degrade when confronted with low-resolution images. We find it is because: 1) low-resolution images lack detail information; 2) current methods complete weak global modeling, which make it difficult to extract discriminative features. To alleviate the above issues, we proposed a novel global multiple extraction network (GME-Net) for low-resolution facial expression recognition, which incorporates 1) a hybrid attention-based local feature extraction module with attention similarity knowledge distillation to learn image details from high-resolution network; 2) a multi-scale global feature extraction module with quasi-symmetric structure to mitigate the influence of local image noise and facilitate capturing global image features. As a result, our GME-Net is capable of extracting expression-related discriminative features. Extensive experiments conducted on several widely-used datasets demonstrate that the proposed GME-Net can better recognize low-resolution facial expression and obtain superior performance than existing solutions.

[245] Polymap: generating high definition map based on rasterized polygons

Shiyu Gao, Hao Jiang

Main category: cs.CV

TL;DR: Proposes a segmentation-based method for HD map construction using instance segmentation and Potrace post-processing to improve generalizability over detection-based approaches.

Details

Motivation: Detection-based methods for HD map construction lack robust generalizability, limiting their applicability in auto-labeling systems for autonomous driving.

Method: Reinterprets road elements as rasterized polygons, uses segmentation-based transformer for instance masks, and Potrace-based post-processing for vectorization.

Result: Quantitative results on Nuscene dataset demonstrate effectiveness and improved generalizability of the proposed method.

Conclusion: Segmentation-based approach with proper post-processing provides better generalizability for HD map construction compared to detection-based methods.

Abstract: The perception of high-definition maps is an integral component of environmental perception in autonomous driving systems. Existing research have often focused on online construction of high-definition maps. For instance, the Maptr[9] series employ a detection-based method to output vectorized map instances parallelly in an end-to-end manner. However, despite their capability for real-time construction, detection-based methods are observed to lack robust generalizability[19], which hampers their applicability in auto-labeling systems. Therefore, aiming to improve the generalizability, we reinterpret road elements as rasterized polygons and design a concise framework based on instance segmentation. Initially, a segmentation-based transformer is employed to deliver instance masks in an end-to-end manner; succeeding this step, a Potrace-based[17] post-processing module is used to ultimately yield vectorized map elements. Quantitative results attained on the Nuscene[1] dataset substantiate the effectiveness and generaliz-ability of our method.

[246] Reperio-rPPG: Relational Temporal Graph Neural Networks for Periodicity Learning in Remote Physiological Measurement

Ba-Thinh Nguyen, Thach-Ha Ngoc Pham, Hoang-Long Duc Nguyen, Thi-Duyen Ngo, Thanh-Ha Le

Main category: cs.CV

TL;DR: Reperio-rPPG is a novel framework that integrates Relational Convolutional Networks with Graph Transformer to capture the intrinsic periodicity in remote photoplethysmography (rPPG) signals, achieving state-of-the-art performance and robustness across various real-world conditions.

Details

Motivation: Existing rPPG methods often underexplore the intrinsic periodicity characteristic of physiological signals, limiting their ability to capture fine-grained temporal dynamics under real-world conditions.

Method: Proposed Reperio-rPPG framework that strategically integrates Relational Convolutional Networks with a Graph Transformer to capture periodic structure, plus a tailored CutMix augmentation to enhance generalizability.

Result: Extensive experiments on PURE, UBFC-rPPG, and MMPD datasets show state-of-the-art performance with remarkable robustness under various motion and illumination conditions.

Conclusion: Reperio-rPPG effectively bridges the gap in modeling physiological signal periodicity and demonstrates superior performance and robustness for contactless vital sign monitoring.

Abstract: Remote photoplethysmography (rPPG) is an emerging contactless physiological sensing technique that leverages subtle color variations in facial videos to estimate vital signs such as heart rate and respiratory rate. This non-invasive method has gained traction across diverse domains, including telemedicine, affective computing, driver fatigue detection, and health monitoring, owing to its scalability and convenience. Despite significant progress in remote physiological signal measurement, a crucial characteristic - the intrinsic periodicity - has often been underexplored or insufficiently modeled in previous approaches, limiting their ability to capture fine-grained temporal dynamics under real-world conditions. To bridge this gap, we propose Reperio-rPPG, a novel framework that strategically integrates Relational Convolutional Networks with a Graph Transformer to effectively capture the periodic structure inherent in physiological signals. Additionally, recognizing the limited diversity of existing rPPG datasets, we further introduce a tailored CutMix augmentation to enhance the model’s generalizability. Extensive experiments conducted on three widely used benchmark datasets - PURE, UBFC-rPPG, and MMPD - demonstrate that Reperio-rPPG not only achieves state-of-the-art performance but also exhibits remarkable robustness under various motion (e.g., stationary, rotation, talking, walking) and illumination conditions (e.g., nature, low LED, high LED). The code is publicly available at https://github.com/deconasser/Reperio-rPPG.

[247] U(PM)$^2$:Unsupervised polygon matching with pre-trained models for challenging stereo images

Chang Li, Xingtao Peng

Main category: cs.CV

TL;DR: U(PM)^2 is an unsupervised polygon matching method that combines pre-trained models with handcrafted features to address stereo polygon matching challenges without training requirements.

Details

Motivation: Polygon matching in stereo vision faces challenges including disparity discontinuity, scale variation, training requirements, and generalization issues that remain largely unexplored.

Method: Uses pre-trained Segment Anything Model for mask detection, converts masks to polygons, employs bidirectional-pyramid strategy with LoFTR for global matching, and local-joint geometry with Hungarian algorithm for local matching.

Result: Achieved state-of-the-art accuracy on ScanNet and SceneFlow datasets with competitive speed, satisfactory generalization, and low cost without training.

Conclusion: The proposed U(PM)^2 effectively addresses polygon matching challenges through unsupervised approach combining learned and handcrafted features, demonstrating strong performance without training requirements.

Abstract: Stereo image matching is a fundamental task in computer vision, photogrammetry and remote sensing, but there is an almost unexplored field, i.e., polygon matching, which faces the following challenges: disparity discontinuity, scale variation, training requirement, and generalization. To address the above-mentioned issues, this paper proposes a novel U(PM)$^2$: low-cost unsupervised polygon matching with pre-trained models by uniting automatically learned and handcrafted features, of which pipeline is as follows: firstly, the detector leverages the pre-trained segment anything model to obtain masks; then, the vectorizer converts the masks to polygons and graphic structure; secondly, the global matcher addresses challenges from global viewpoint changes and scale variation based on bidirectional-pyramid strategy with pre-trained LoFTR; finally, the local matcher further overcomes local disparity discontinuity and topology inconsistency of polygon matching by local-joint geometry and multi-feature matching strategy with Hungarian algorithm. We benchmark our U(PM)$^2$ on the ScanNet and SceneFlow datasets using our proposed new metric, which achieved state-of-the-art accuracy at a competitive speed and satisfactory generalization performance at low cost without any training requirement.

Surbhi Madan, Shreya Ghosh, Ramanathan Subramanian, Abhinav Dhall, Tom Gedeon

Main category: cs.CV

TL;DR: CSGaze is a context-aware multimodal approach that uses facial and scene information to predict social gaze patterns in conversational interactions, outperforming state-of-the-art methods and demonstrating strong generalization.

Details

Motivation: To leverage contextual cues combined with visual scene and facial information to better predict and interpret social gaze patterns during conversations, as gaze reveals attention, engagement, and confidence.

Method: Multimodal approach using facial and scene information with fine-grained attention mechanism focused on the principal speaker to model social gaze dynamics.

Result: CSGaze performs competitively with state-of-the-art methods on GP-Static, UCO-LAEO and AVA-LAEO datasets, and shows strong generalization on open set datasets.

Conclusion: Contextual cues significantly improve social gaze prediction, and the model provides explainability through attention scores while demonstrating robustness across diverse scenarios.

Abstract: A person’s gaze offers valuable insights into their focus of attention, level of social engagement, and confidence. In this work, we investigate how contextual cues combined with visual scene and facial information can be effectively utilized to predict and interpret social gaze patterns during conversational interactions. We introduce CSGaze, a context aware multimodal approach that leverages facial, scene information as complementary inputs to enhance social gaze pattern prediction from multi-person images. The model also incorporates a fine-grained attention mechanism centered on the principal speaker, which helps in better modeling social gaze dynamics. Experimental results show that CSGaze performs competitively with state-of-the-art methods on GP-Static, UCO-LAEO and AVA-LAEO. Our findings highlight the role of contextual cues in improving social gaze prediction. Additionally, we provide initial explainability through generated attention scores, offering insights into the model’s decision-making process. We also demonstrate our model’s generalizability by testing our model on open set datasets that demonstrating its robustness across diverse scenarios.

[249] Adaptive Agent Selection and Interaction Network for Image-to-point cloud Registration

Zhixin Cheng, Xiaotian Yin, Jiacheng Deng, Bohao Liao, Yujia Chen, Xu Zhou, Baoqun Yin, Tianzhu Zhang

Main category: cs.CV

TL;DR: A novel cross-modal registration framework using Iterative Agents Selection and Reliable Agents Interaction modules to improve robustness and accuracy in image-to-point cloud registration under challenging conditions.

Details

Motivation: Existing detection-free methods struggle with noise that disrupts similarity computation and leads to incorrect correspondences, and lack effective mechanisms to select informative cross-modal representations.

Method: Proposes two key modules: IAS enhances structural feature awareness with phase maps and uses reinforcement learning to select reliable agents; RAI leverages selected agents to guide cross-modal interactions and reduce mismatches.

Result: Extensive experiments on RGB-D Scenes v2 and 7-Scenes benchmarks show state-of-the-art performance.

Conclusion: The proposed framework effectively addresses challenges in cross-modal registration by improving feature selection and interaction mechanisms, achieving superior robustness and accuracy.

Abstract: Typical detection-free methods for image-to-point cloud registration leverage transformer-based architectures to aggregate cross-modal features and establish correspondences. However, they often struggle under challenging conditions, where noise disrupts similarity computation and leads to incorrect correspondences. Moreover, without dedicated designs, it remains difficult to effectively select informative and correlated representations across modalities, thereby limiting the robustness and accuracy of registration. To address these challenges, we propose a novel cross-modal registration framework composed of two key modules: the Iterative Agents Selection (IAS) module and the Reliable Agents Interaction (RAI) module. IAS enhances structural feature awareness with phase maps and employs reinforcement learning principles to efficiently select reliable agents. RAI then leverages these selected agents to guide cross-modal interactions, effectively reducing mismatches and improving overall robustness. Extensive experiments on the RGB-D Scenes v2 and 7-Scenes benchmarks demonstrate that our method consistently achieves state-of-the-art performance.

[250] Commonality in Few: Few-Shot Multimodal Anomaly Detection via Hypergraph-Enhanced Memory

Yuxuan Lin, Hanjing Yan, Xuan Tong, Yang Chang, Huanzhen Wang, Ziheng Zhou, Shuyong Gao, Yan Wang, Wenqiang Zhang

Main category: cs.CV

TL;DR: CIF is a few-shot multimodal industrial anomaly detection method that uses hypergraphs to extract structural commonality from limited training samples and employs a memory bank to store intra-class structural priors for improved anomaly detection.

Details

Motivation: Few-shot multimodal industrial anomaly detection is critical but challenging due to insufficient training samples that fail to cover diverse test patterns. The paper aims to address this by extracting structural commonality from limited samples.

Method: Proposes CIF method with three modules: 1) semantic-aware hypergraph construction for single-semantic industrial images, 2) training-free hypergraph message passing to update test features, and 3) hyperedge-guided memory search using structural information to reduce false positives.

Result: Experimental results on MVTec 3D-AD and Eyecandies datasets show that CIF outperforms state-of-the-art methods in few-shot settings.

Conclusion: The proposed CIF method effectively addresses few-shot multimodal industrial anomaly detection by leveraging structural commonality through hypergraphs and memory banks, demonstrating superior performance compared to existing approaches.

Abstract: Few-shot multimodal industrial anomaly detection is a critical yet underexplored task, offering the ability to quickly adapt to complex industrial scenarios. In few-shot settings, insufficient training samples often fail to cover the diverse patterns present in test samples. This challenge can be mitigated by extracting structural commonality from a small number of training samples. In this paper, we propose a novel few-shot unsupervised multimodal industrial anomaly detection method based on structural commonality, CIF (Commonality In Few). To extract intra-class structural information, we employ hypergraphs, which are capable of modeling higher-order correlations, to capture the structural commonality within training samples, and use a memory bank to store this intra-class structural prior. Firstly, we design a semantic-aware hypergraph construction module tailored for single-semantic industrial images, from which we extract common structures to guide the construction of the memory bank. Secondly, we use a training-free hypergraph message passing module to update the visual features of test samples, reducing the distribution gap between test features and features in the memory bank. We further propose a hyperedge-guided memory search module, which utilizes structural information to assist the memory search process and reduce the false positive rate. Experimental results on the MVTec 3D-AD dataset and the Eyecandies dataset show that our method outperforms the state-of-the-art (SOTA) methods in few-shot settings. Code is available at https://github.com/Sunny5250/CIF.

[251] Adapted Foundation Models for Breast MRI Triaging in Contrast-Enhanced and Non-Contrast Enhanced Protocols

Tri-Thien Nguyen, Lorenz A. Kapsner, Tobias Hepp, Shirin Heidarikahkesh, Hannes Schreiter, Luise Brock, Dominika Skwierawska, Dominique Hadler, Julian Hossbach, Evelyn Wenkel, Sabine Ohlmeyer, Frederik B. Laun, Andrzej Liebert, Andreas Maier, Michael Uder, Sebastian Bickelhaupt

Main category: cs.CV

TL;DR: DINOv2-based Medical Slice Transformer (MST) can pre-screen breast MRI to rule out significant findings (BI-RADS ≥4) with 97.5% sensitivity, achieving 19% specificity for contrast-enhanced and 17% for non-contrast-enhanced protocols.

Details

Motivation: MRI interpretation is time-consuming, and AI could help pre-screen cases to reduce radiologist workload by identifying exams without significant findings.

Method: Used DINOv2-based MST on 1,847 breast MRI exams, testing four abbreviated protocols (T1sub, DWI1500, DWI1500+T2w, T1sub+T2w) with five-fold cross-validation and external validation on Duke dataset.

Result: T1sub+T2w achieved AUC 0.77±0.04, highest specificity of 19%±7% at 97.5% sensitivity. Missed lesions were small (<10mm) non-mass enhancements. External validation AUC 0.77 with 88% good/moderate attention maps.

Conclusion: MST can triage cases without BI-RADS ≥4 findings at high sensitivity, potentially reducing radiologist workload, but requires further research before clinical implementation.

Abstract: Background: Magnetic resonance imaging (MRI) has high sensitivity for breast cancer detection, but interpretation is time-consuming. Artificial intelligence may aid in pre-screening. Purpose: To evaluate the DINOv2-based Medical Slice Transformer (MST) for ruling out significant findings (Breast Imaging Reporting and Data System [BI-RADS] >=4) in contrast-enhanced and non-contrast-enhanced abbreviated breast MRI. Materials and Methods: This institutional review board approved retrospective study included 1,847 single-breast MRI examinations (377 BI-RADS >=4) from an in-house dataset and 924 from an external validation dataset (Duke). Four abbreviated protocols were tested: T1-weighted early subtraction (T1sub), diffusion-weighted imaging with b=1500 s/mm2 (DWI1500), DWI1500+T2-weighted (T2w), and T1sub+T2w. Performance was assessed at 90%, 95%, and 97.5% sensitivity using five-fold cross-validation and area under the receiver operating characteristic curve (AUC) analysis. AUC differences were compared with the DeLong test. False negatives were characterized, and attention maps of true positives were rated in the external dataset. Results: A total of 1,448 female patients (mean age, 49 +/- 12 years) were included. T1sub+T2w achieved an AUC of 0.77 +/- 0.04; DWI1500+T2w, 0.74 +/- 0.04 (p=0.15). At 97.5% sensitivity, T1sub+T2w had the highest specificity (19% +/- 7%), followed by DWI1500+T2w (17% +/- 11%). Missed lesions had a mean diameter <10 mm at 95% and 97.5% thresholds for both T1sub and DWI1500, predominantly non-mass enhancements. External validation yielded an AUC of 0.77, with 88% of attention maps rated good or moderate. Conclusion: At 97.5% sensitivity, the MST framework correctly triaged cases without BI-RADS >=4, achieving 19% specificity for contrast-enhanced and 17% for non-contrast-enhanced MRI. Further research is warranted before clinical implementation.

[252] DiA-gnostic VLVAE: Disentangled Alignment-Constrained Vision Language Variational AutoEncoder for Robust Radiology Reporting with Missing Modalities

Nagur Shareef Shaik, Teja Krishna Cherukuri, Adnan Masood, Dong Hye Ye

Main category: cs.CV

TL;DR: DiA-gnostic VLVAE is a framework for robust radiology reporting that addresses missing modalities and feature entanglement through disentangled alignment using a Mixture-of-Experts based Vision-Language Variational Autoencoder.

Details

Motivation: Current automated methods struggle with missing modalities in clinical data and feature entanglement, leading to suboptimal fusion and clinically unfaithful hallucinated findings in radiology reports.

Method: Uses a Mixture-of-Experts based Vision-Language Variational Autoencoder (VLVAE) to disentangle shared and modality-specific features, with constrained optimization for orthogonality and alignment, followed by a compact LLaMA-X decoder for efficient report generation.

Result: Achieved competitive BLEU@4 scores of 0.266 on IU X-Ray and 0.134 on MIMIC-CXR datasets, significantly outperforming state-of-the-art models.

Conclusion: The proposed DiA framework successfully addresses challenges in radiology reporting by enabling robust performance even with missing clinical context through disentangled alignment.

Abstract: The integration of medical images with clinical context is essential for generating accurate and clinically interpretable radiology reports. However, current automated methods often rely on resource-heavy Large Language Models (LLMs) or static knowledge graphs and struggle with two fundamental challenges in real-world clinical data: (1) missing modalities, such as incomplete clinical context , and (2) feature entanglement, where mixed modality-specific and shared information leads to suboptimal fusion and clinically unfaithful hallucinated findings. To address these challenges, we propose the DiA-gnostic VLVAE, which achieves robust radiology reporting through Disentangled Alignment. Our framework is designed to be resilient to missing modalities by disentangling shared and modality-specific features using a Mixture-of-Experts (MoE) based Vision-Language Variational Autoencoder (VLVAE). A constrained optimization objective enforces orthogonality and alignment between these latent representations to prevent suboptimal fusion. A compact LLaMA-X decoder then uses these disentangled representations to generate reports efficiently. On the IU X-Ray and MIMIC-CXR datasets, DiA has achieved competetive BLEU@4 scores of 0.266 and 0.134, respectively. Experimental results show that the proposed method significantly outperforms state-of-the-art models.

[253] Runtime Safety Monitoring of Deep Neural Networks for Perception: A Survey

Albert Schotschneider, Svetlana Pavlitska, J. Marius Zöllner

Main category: cs.CV

TL;DR: Survey of runtime safety monitoring approaches for deep neural networks in safety-critical applications, categorizing methods into input, internal representation, and output monitoring.

Details

Motivation: DNNs in safety-critical systems are vulnerable to generalization errors, OOD inputs, and adversarial attacks that can cause hazardous failures, necessitating runtime monitoring without modifying the DNNs.

Method: Comprehensive categorization of existing runtime monitoring methods into three groups: monitoring inputs, internal representations, and outputs of DNNs during inference.

Result: Analysis of state-of-the-art methods in each category, identification of strengths/limitations, and mapping of methods to specific safety concerns they address.

Conclusion: Identified open challenges and future research directions for runtime safety monitoring of DNNs in safety-critical applications.

Abstract: Deep neural networks (DNNs) are widely used in perception systems for safety-critical applications, such as autonomous driving and robotics. However, DNNs remain vulnerable to various safety concerns, including generalization errors, out-of-distribution (OOD) inputs, and adversarial attacks, which can lead to hazardous failures. This survey provides a comprehensive overview of runtime safety monitoring approaches, which operate in parallel to DNNs during inference to detect these safety concerns without modifying the DNN itself. We categorize existing methods into three main groups: Monitoring inputs, internal representations, and outputs. We analyze the state-of-the-art for each category, identify strengths and limitations, and map methods to the safety concerns they address. In addition, we highlight open challenges and future research directions.

[254] A Dual-Mode ViT-Conditioned Diffusion Framework with an Adaptive Conditioning Bridge for Breast Cancer Segmentation

Prateek Singh, Moumita Dholey, P. K. Vinod

Main category: cs.CV

TL;DR: A conditional Denoising Diffusion Model with ViT encoder and enhanced UNet decoder for breast ultrasound lesion segmentation, achieving state-of-the-art performance through adaptive feature fusion and topological consistency regularization.

Details

Motivation: Breast ultrasound lesion segmentation is challenging due to low contrast, speckle noise, and unclear boundaries. Standard convolutional architectures fail to capture sufficient global context, leading to anatomically inconsistent segmentations.

Method: Proposes a flexible conditional Denoising Diffusion Model combining Vision Transformer encoder for global features with enhanced UNet decoder. Key innovations: Adaptive Conditioning Bridge for multi-scale feature fusion, Topological Denoising Consistency loss for structural regularization, and dual-head architecture for efficient inference.

Result: Achieves new state-of-the-art performance: Dice scores of 0.96 on BUSI, 0.90 on BrEaST, and 0.97 on BUS-UCLM datasets. Ablation studies confirm critical importance of all model components.

Conclusion: The proposed diffusion-based framework produces not only accurate but also anatomically plausible segmentations, effectively addressing challenges in breast ultrasound image analysis through global context modeling and structural regularization.

Abstract: In breast ultrasound images, precise lesion segmentation is essential for early diagnosis; however, low contrast, speckle noise, and unclear boundaries make this difficult. Even though deep learning models have demonstrated potential, standard convolutional architectures frequently fall short in capturing enough global context, resulting in segmentations that are anatomically inconsistent. To overcome these drawbacks, we suggest a flexible, conditional Denoising Diffusion Model that combines an enhanced UNet-based generative decoder with a Vision Transformer (ViT) encoder for global feature extraction. We introduce three primary innovations: 1) an Adaptive Conditioning Bridge (ACB) for efficient, multi-scale fusion of semantic features; 2) a novel Topological Denoising Consistency (TDC) loss component that regularizes training by penalizing structural inconsistencies during denoising; and 3) a dual-head architecture that leverages the denoising objective as a powerful regularizer, enabling a lightweight auxiliary head to perform rapid and accurate inference on smaller datasets and a noise prediction head. Our framework establishes a new state-of-the-art on public breast ultrasound datasets, achieving Dice scores of 0.96 on BUSI, 0.90 on BrEaST and 0.97 on BUS-UCLM. Comprehensive ablation studies empirically validate that the model components are critical for achieving these results and for producing segmentations that are not only accurate but also anatomically plausible.

[255] Exploring Category-level Articulated Object Pose Tracking on SE(3) Manifolds

Xianhui Meng, Yukang Huo, Li Zhang, Liu Liu, Haonan Jiang, Yan Zhong, Pingrui Zhang, Cewu Lu, Jun Liu

Main category: cs.CV

TL;DR: PPF-Tracker is a novel point-pair-based framework for articulated object pose tracking that uses SE(3) quasi-canonicalization, Point Pair Features, and kinematic constraints to achieve robust multi-frame tracking.

Details

Motivation: Articulated object pose tracking is underexplored compared to rigid objects due to complex kinematic constraints, creating a need for specialized tracking methods in robotics and AR applications.

Method: Uses SE(3) quasi-canonicalization of point clouds, models objects with Point Pair Features to predict pose voting parameters, and incorporates semantic joint axis information to enforce unified kinematic constraints.

Result: Demonstrates strong generalization across synthetic datasets and real-world scenarios, showing effectiveness and robustness in multi-frame articulated object pose tracking.

Conclusion: PPF-Tracker advances articulated object tracking and can foster progress in robotics, embodied intelligence, and augmented reality applications.

Abstract: Articulated objects are prevalent in daily life and robotic manipulation tasks. However, compared to rigid objects, pose tracking for articulated objects remains an underexplored problem due to their inherent kinematic constraints. To address these challenges, this work proposes a novel point-pair-based pose tracking framework, termed \textbf{PPF-Tracker}. The proposed framework first performs quasi-canonicalization of point clouds in the SE(3) Lie group space, and then models articulated objects using Point Pair Features (PPF) to predict pose voting parameters by leveraging the invariance properties of SE(3). Finally, semantic information of joint axes is incorporated to impose unified kinematic constraints across all parts of the articulated object. PPF-Tracker is systematically evaluated on both synthetic datasets and real-world scenarios, demonstrating strong generalization across diverse and challenging environments. Experimental results highlight the effectiveness and robustness of PPF-Tracker in multi-frame pose tracking of articulated objects. We believe this work can foster advances in robotics, embodied intelligence, and augmented reality. Codes are available at https://github.com/mengxh20/PPFTracker.

[256] MALeR: Improving Compositional Fidelity in Layout-Guided Generation

Shivank Saxena, Dhruv Srivastava, Makarand Tapaswi

Main category: cs.CV

TL;DR: MALeR is a text-to-image generation method that improves control over subject placement and attribute binding in compositional scenes with multiple subjects and attributes.

Details

Motivation: Current layout-guided text-to-image methods struggle with unintended subjects appearing outside layouts, out-of-distribution generation, unnatural artifacts, and attribute leakage across subjects in complex compositional scenes.

Method: Proposes MALeR with masked, attribute-aware binding mechanism that prevents subjects from appearing outside given layouts while maintaining in-distribution generation and preventing attribute leakage.

Result: Qualitative and quantitative evaluation shows superior performance in compositional accuracy, generation consistency, and attribute binding compared to previous methods, especially for scenes with multiple subjects and attributes per subject.

Conclusion: MALeR effectively addresses key challenges in compositional text-to-image generation, enabling more accurate and controlled generation of complex scenes with multiple subjects and attributes.

Abstract: Recent advances in text-to-image models have enabled a new era of creative and controllable image generation. However, generating compositional scenes with multiple subjects and attributes remains a significant challenge. To enhance user control over subject placement, several layout-guided methods have been proposed. However, these methods face numerous challenges, particularly in compositional scenes. Unintended subjects often appear outside the layouts, generated images can be out-of-distribution and contain unnatural artifacts, or attributes bleed across subjects, leading to incorrect visual outputs. In this work, we propose MALeR, a method that addresses each of these challenges. Given a text prompt and corresponding layouts, our method prevents subjects from appearing outside the given layouts while being in-distribution. Additionally, we propose a masked, attribute-aware binding mechanism that prevents attribute leakage, enabling accurate rendering of subjects with multiple attributes, even in complex compositional scenes. Qualitative and quantitative evaluation demonstrates that our method achieves superior performance in compositional accuracy, generation consistency, and attribute binding compared to previous work. MALeR is particularly adept at generating images of scenes with multiple subjects and multiple attributes per subject.

[257] How Reasoning Influences Intersectional Biases in Vision Language Models

Adit Desai, Sudipta Roy, Mohna Chakraborty

Main category: cs.CV

TL;DR: Analysis of social biases in 5 open-source Vision Language Models for occupation prediction, revealing biased reasoning patterns that cause intersectional disparities.

Details

Motivation: VLMs process images through statistical associations rather than human-like contextual reasoning, potentially perpetuating social biases from training data that affect downstream performance.

Method: Systematic analysis of 5 open-source VLMs on FairFace dataset for occupation prediction task across 32 occupations and three different prompting styles, eliciting both predictions and reasoning.

Result: Findings reveal that biased reasoning patterns systematically underlie intersectional disparities in VLM outputs.

Conclusion: There is a need to align VLM reasoning with human values prior to downstream deployment to mitigate social biases.

Abstract: Vision Language Models (VLMs) are increasingly deployed across downstream tasks, yet their training data often encode social biases that surface in outputs. Unlike humans, who interpret images through contextual and social cues, VLMs process them through statistical associations, often leading to reasoning that diverges from human reasoning. By analyzing how a VLM reasons, we can understand how inherent biases are perpetuated and can adversely affect downstream performance. To examine this gap, we systematically analyze social biases in five open-source VLMs for an occupation prediction task, on the FairFace dataset. Across 32 occupations and three different prompting styles, we elicit both predictions and reasoning. Our findings reveal that the biased reasoning patterns systematically underlie intersectional disparities, highlighting the need to align VLM reasoning with human values prior to its downstream deployment.

[258] Distributed Deep Learning for Medical Image Denoising with Data Obfuscation

Sulaimon Oyeniyi Adebayo, Ayaz H. Khan

Main category: cs.CV

TL;DR: Distributed deep learning for chest X-ray denoising using U-Net and U-Net++ with optimized multi-GPU training achieves 60% faster training with minor accuracy trade-offs.

Details

Motivation: To improve medical image quality while protecting sensitive information in large clinical datasets using lightweight obfuscation techniques.

Method: Implemented U-Net and U-Net++ architectures with additive Gaussian noise obfuscation, evaluated under single-GPU, DataParallel, and optimized DistributedDataParallel with Automatic Mixed Precision training.

Result: U-Net++ showed superior denoising performance with better PSNR and SSIM scores, while optimized training pipeline reduced training time by over 60% compared to single-GPU and 40% vs DataParallel with minor accuracy drop.

Conclusion: Combining architectural design, lightweight obfuscation, and distributed training strategies is practically viable for accelerating medical image processing pipelines in clinical environments.

Abstract: Medical image denoising is essential for improving image quality while minimizing the exposure of sensitive information, particularly when working with large-scale clinical datasets. This study explores distributed deep learning for denoising chest X-ray images from the NIH Chest X-ray14 dataset, using additive Gaussian noise as a lightweight obfuscation technique. We implement and evaluate U-Net and U-Net++ architectures under single-GPU, standard multi-GPU (DataParallel), and optimized multi-GPU training configurations using PyTorch’s DistributedDataParallel (DDP) and Automatic Mixed Precision (AMP). Our results show that U-Net++ consistently delivers superior denoising performance, achieving competitive Peak Signal to Noise Ratio (PSNR) and Structured Similarity Index Method (SSIM) scores, though with less performance in Learned Perceptual Image Patch Similarity (LPIPS) compared to U-Net under low and moderate noise levels. This indicates U-Net++’s enhanced structural fidelity and low perceptual similarity. Meanwhile, our optimized training pipeline reduces training time by over 60% for both models compared to single-GPU training, and outperforms standard DataParallel by over 40%, with only a minor accuracy drop for both models (trading some accuracy for speed). These findings highlight the effectiveness of software-level optimization in distributed learning for medical imaging. This work demonstrates the practical viability of combining architectural design, lightweight obfuscation, and advanced distributed training strategies to accelerate and enhance medical image processing pipelines in real-world clinical and research environments. The full implementation is publicly available at: https://github.com/Suadey/medical-image-denoising-ddp.

[259] One-Shot Knowledge Transfer for Scalable Person Re-Identification

Longhua Li, Lei Qi, Xin Geng

Main category: cs.CV

TL;DR: OSKT is a one-shot knowledge transfer method for person re-identification that uses a weight chain to avoid repetitive computations when creating multiple compressed models for different resource constraints.

Details

Motivation: Edge computing in person ReID requires compact models, but conventional compression methods need separate computations for each model size, leading to repetitive work when multiple models are needed for different resource conditions.

Method: Propose OSKT (One-Shot Knowledge Transfer) that consolidates teacher model knowledge into a weight chain, which can be expanded to target model sizes without additional computation.

Result: OSKT significantly outperforms state-of-the-art compression methods while providing one-time knowledge transfer that eliminates frequent computations for each target model.

Conclusion: OSKT offers an efficient solution for deploying multiple compressed ReID models in edge computing scenarios by avoiding repetitive computations through one-shot knowledge transfer.

Abstract: Edge computing in person re-identification (ReID) is crucial for reducing the load on central cloud servers and ensuring user privacy. Conventional compression methods for obtaining compact models require computations for each individual student model. When multiple models of varying sizes are needed to accommodate different resource conditions, this leads to repetitive and cumbersome computations. To address this challenge, we propose a novel knowledge inheritance approach named OSKT (One-Shot Knowledge Transfer), which consolidates the knowledge of the teacher model into an intermediate carrier called a weight chain. When a downstream scenario demands a model that meets specific resource constraints, this weight chain can be expanded to the target model size without additional computation. OSKT significantly outperforms state-of-the-art compression methods, with the added advantage of one-time knowledge transfer that eliminates the need for frequent computations for each target model.

[260] MiVID: Multi-Strategic Self-Supervision for Video Frame Interpolation using Diffusion Model

Priyansh Srivastava, Romit Chatterjee, Abir Sen, Aradhana Behura, Ratnakar Dash

Main category: cs.CV

TL;DR: MiVID is a lightweight, self-supervised diffusion-based framework for video frame interpolation that eliminates explicit motion estimation and achieves competitive results without high-frame-rate supervision.

Details

Motivation: Classical VFI methods struggle with occlusions, domain shifts, and ambiguous motion, requiring optical flow or dense ground-truth data. There's a need for more robust and accessible approaches.

Method: Combines 3D U-Net backbone with transformer-style temporal attention, trained under hybrid masking regime with cosine-based progressive masking and adaptive loss scheduling. Entirely self-supervised using 9-frame video segments on CPU.

Result: Achieves optimal results at 50 epochs on UCF101-7 and DAVIS-7 datasets, competitive with supervised baselines despite low-resource constraints.

Conclusion: Demonstrates the power of self-supervised diffusion priors for temporally coherent frame synthesis and provides a scalable path toward accessible and generalizable VFI systems.

Abstract: Video Frame Interpolation (VFI) remains a cornerstone in video enhancement, enabling temporal upscaling for tasks like slow-motion rendering, frame rate conversion, and video restoration. While classical methods rely on optical flow and learning-based models assume access to dense ground-truth, both struggle with occlusions, domain shifts, and ambiguous motion. This article introduces MiVID, a lightweight, self-supervised, diffusion-based framework for video interpolation. Our model eliminates the need for explicit motion estimation by combining a 3D U-Net backbone with transformer-style temporal attention, trained under a hybrid masking regime that simulates occlusions and motion uncertainty. The use of cosine-based progressive masking and adaptive loss scheduling allows our network to learn robust spatiotemporal representations without any high-frame-rate supervision. Our framework is evaluated on UCF101-7 and DAVIS-7 datasets. MiVID is trained entirely on CPU using the datasets and 9-frame video segments, making it a low-resource yet highly effective pipeline. Despite these constraints, our model achieves optimal results at just 50 epochs, competitive with several supervised baselines.This work demonstrates the power of self-supervised diffusion priors for temporally coherent frame synthesis and provides a scalable path toward accessible and generalizable VFI systems.

[261] Towards Implicit Aggregation: Robust Image Representation for Place Recognition in the Transformer Era

Feng Lu, Tong Jin, Canming Ye, Yunpeng Liu, Xiangyuan Lan, Chun Yuan

Main category: cs.CV

TL;DR: The paper proposes eliminating dedicated aggregators in transformer-based visual place recognition by using learnable aggregation tokens that implicitly aggregate information through self-attention, achieving state-of-the-art performance with higher efficiency.

Details

Motivation: Traditional VPR methods use backbone-plus-aggregator paradigm, but the authors argue that dedicated aggregators are unnecessary in the transformer era since transformers can naturally aggregate information through self-attention mechanisms.

Method: Introduce learnable aggregation tokens prepended to patch tokens before a transformer block, allowing implicit aggregation through self-attention, then concatenate these tokens as global representation. Also propose optimal token insertion strategy and initialization method.

Result: Outperforms state-of-the-art methods on several VPR datasets with higher efficiency, and ranks 1st on the MSLS challenge leaderboard.

Conclusion: Implicit aggregation via learnable tokens in transformers provides a simpler yet more effective alternative to traditional aggregators for visual place recognition.

Abstract: Visual place recognition (VPR) is typically regarded as a specific image retrieval task, whose core lies in representing images as global descriptors. Over the past decade, dominant VPR methods (e.g., NetVLAD) have followed a paradigm that first extracts the patch features/tokens of the input image using a backbone, and then aggregates these patch features into a global descriptor via an aggregator. This backbone-plus-aggregator paradigm has achieved overwhelming dominance in the CNN era and remains widely used in transformer-based models. In this paper, however, we argue that a dedicated aggregator is not necessary in the transformer era, that is, we can obtain robust global descriptors only with the backbone. Specifically, we introduce some learnable aggregation tokens, which are prepended to the patch tokens before a particular transformer block. All these tokens will be jointly processed and interact globally via the intrinsic self-attention mechanism, implicitly aggregating useful information within the patch tokens to the aggregation tokens. Finally, we only take these aggregation tokens from the last output tokens and concatenate them as the global representation. Although implicit aggregation can provide robust global descriptors in an extremely simple manner, where and how to insert additional tokens, as well as the initialization of tokens, remains an open issue worthy of further exploration. To this end, we also propose the optimal token insertion strategy and token initialization method derived from empirical studies. Experimental results show that our method outperforms state-of-the-art methods on several VPR datasets with higher efficiency and ranks 1st on the MSLS challenge leaderboard. The code is available at https://github.com/lu-feng/image.

[262] S2ML: Spatio-Spectral Mutual Learning for Depth Completion

Zihui Zhao, Yifei Zhang, Zheng Wang, Yang Li, Kui Jiang, Zihan Geng, Chia-Wen Lin

Main category: cs.CV

TL;DR: S2ML framework combines spatial and frequency domain analysis for depth completion, outperforming state-of-the-art methods by leveraging spatio-spectral mutual learning.

Details

Motivation: Raw depth images from RGB-D cameras often have incomplete depth values due to weak reflections, boundary shadows, and artifacts, limiting their use in vision tasks. Existing methods overlook physical characteristics and frequency distribution patterns altered by invalid depth areas.

Method: Proposes Spatio-Spectral Mutual Learning (S2ML) framework that harmonizes spatial and frequency domains. Uses dedicated spectral fusion module for amplitude and phase spectra, calculates local and global correlations in unified embedding space, and employs gradual mutual representation and refinement.

Result: Outperforms state-of-the-art method CFormer by 0.828 dB on NYU-Depth V2 and 0.834 dB on SUN RGB-D datasets.

Conclusion: The S2ML framework effectively explores complementary physical characteristics and priors from both spatial and frequency domains for more accurate depth completion.

Abstract: The raw depth images captured by RGB-D cameras using Time-of-Flight (TOF) or structured light often suffer from incomplete depth values due to weak reflections, boundary shadows, and artifacts, which limit their applications in downstream vision tasks. Existing methods address this problem through depth completion in the image domain, but they overlook the physical characteristics of raw depth images. It has been observed that the presence of invalid depth areas alters the frequency distribution pattern. In this work, we propose a Spatio-Spectral Mutual Learning framework (S2ML) to harmonize the advantages of both spatial and frequency domains for depth completion. Specifically, we consider the distinct properties of amplitude and phase spectra and devise a dedicated spectral fusion module. Meanwhile, the local and global correlations between spatial-domain and frequency-domain features are calculated in a unified embedding space. The gradual mutual representation and refinement encourage the network to fully explore complementary physical characteristics and priors for more accurate depth completion. Extensive experiments demonstrate the effectiveness of our proposed S2ML method, outperforming the state-of-the-art method CFormer by 0.828 dB and 0.834 dB on the NYU-Depth V2 and SUN RGB-D datasets, respectively.

[263] StreamSTGS: Streaming Spatial and Temporal Gaussian Grids for Real-Time Free-Viewpoint Video

Zhihui Ke, Yuyang Liu, Xiaobo Zhou, Tie Qiu

Main category: cs.CV

TL;DR: StreamSTGS is a novel free-viewpoint video representation that enables real-time streaming by compressing 3D Gaussian attributes as 2D images and temporal features as video, achieving competitive performance with significantly reduced frame sizes (170KB average).

Details

Motivation: Existing 3DGS-based FVV methods face prohibitive storage requirements (up to 10MB per frame), making real-time streaming impossible. There's a need for efficient compression while maintaining quality.

Method: Represents dynamic scenes using canonical 3D Gaussians, temporal features, and deformation field. Encodes Gaussian attributes as 2D images and temporal features as video. Uses sliding window for local motion and transformer-guided auxiliary training for global motion.

Result: Achieves competitive performance on diverse FVV benchmarks, increasing PSNR by average 1dB while reducing frame size to 170KB average (compared to 10MB in previous methods).

Conclusion: StreamSTGS enables real-time FVV streaming with adaptive bitrate control, demonstrating significant compression efficiency while maintaining or improving visual quality compared to state-of-the-art methods.

Abstract: Streaming free-viewpoint video~(FVV) in real-time still faces significant challenges, particularly in training, rendering, and transmission efficiency. Harnessing superior performance of 3D Gaussian Splatting~(3DGS), recent 3DGS-based FVV methods have achieved notable breakthroughs in both training and rendering. However, the storage requirements of these methods can reach up to $10$MB per frame, making stream FVV in real-time impossible. To address this problem, we propose a novel FVV representation, dubbed StreamSTGS, designed for real-time streaming. StreamSTGS represents a dynamic scene using canonical 3D Gaussians, temporal features, and a deformation field. For high compression efficiency, we encode canonical Gaussian attributes as 2D images and temporal features as a video. This design not only enables real-time streaming, but also inherently supports adaptive bitrate control based on network condition without any extra training. Moreover, we propose a sliding window scheme to aggregate adjacent temporal features to learn local motions, and then introduce a transformer-guided auxiliary training module to learn global motions. On diverse FVV benchmarks, StreamSTGS demonstrates competitive performance on all metrics compared to state-of-the-art methods. Notably, StreamSTGS increases the PSNR by an average of $1$dB while reducing the average frame size to just $170$KB. The code is publicly available on https://github.com/kkkzh/StreamSTGS.

[264] Neodragon: Mobile Video Generation using Diffusion Transformer

Animesh Karnewar, Denis Korzhenkov, Ioannis Lelekas, Adil Karjauv, Noor Fathima, Hanwen Xiong, Vancheeswaran Vaidyanathan, Will Zeng, Rafael Esteves, Tushar Singhal, Fatih Porikli, Mohsen Ghafoorian, Amirhossein Habibian

Main category: cs.CV

TL;DR: Neodragon is a mobile-optimized text-to-video system that generates 2-second videos at 640x1024 resolution in 6.7 seconds on Qualcomm Hexagon NPU, achieving efficient on-device video synthesis through four key optimizations.

Details

Motivation: To enable low-cost, private, and on-device text-to-video synthesis that democratizes AI-based video content creation without reliance on cloud services.

Method: Four technical contributions: (1) Text-Encoder Distillation to replace T5xxl with smaller DT5, (2) Asymmetric Decoder Distillation for efficient VAE decoder, (3) Pruning MMDiT blocks with two-stage distillation, (4) Reducing NFE through step distillation with DMD for pyramidal flow-matching.

Result: Achieves 81.61 VBench score with 4.945B parameters, 3.5GB peak RAM usage, and 6.7s end-to-end latency, generating 2s videos at 640x1024 resolution on mobile hardware.

Conclusion: Neodragon successfully enables efficient, high-fidelity on-device text-to-video generation, making AI video creation accessible without cloud dependency.

Abstract: We introduce Neodragon, a text-to-video system capable of generating 2s (49 frames @24 fps) videos at the 640x1024 resolution directly on a Qualcomm Hexagon NPU in a record 6.7s (7 FPS). Differing from existing transformer-based offline text-to-video generation models, Neodragon is the first to have been specifically optimised for mobile hardware to achieve efficient and high-fidelity video synthesis. We achieve this through four key technical contributions: (1) Replacing the original large 4.762B T5xxl Text-Encoder with a much smaller 0.2B DT5 (DistilT5) with minimal quality loss, enabled through a novel Text-Encoder Distillation procedure. (2) Proposing an Asymmetric Decoder Distillation approach allowing us to replace the native codec-latent-VAE decoder with a more efficient one, without disturbing the generative latent-space of the generation pipeline. (3) Pruning of MMDiT blocks within the denoiser backbone based on their relative importance, with recovery of original performance through a two-stage distillation process. (4) Reducing the NFE (Neural Functional Evaluation) requirement of the denoiser by performing step distillation using DMD adapted for pyramidal flow-matching, thereby substantially accelerating video generation. When paired with an optimised SSD1B first-frame image generator and QuickSRNet for 2x super-resolution, our end-to-end Neodragon system becomes a highly parameter (4.945B full model), memory (3.5GB peak RAM usage), and runtime (6.7s E2E latency) efficient mobile-friendly model, while achieving a VBench total score of 81.61. By enabling low-cost, private, and on-device text-to-video synthesis, Neodragon democratizes AI-based video content creation, empowering creators to generate high-quality videos without reliance on cloud services. Code and model will be made publicly available at our website: https://qualcomm-ai-research.github.io/neodragon

[265] LoopExpose: An Unsupervised Framework for Arbitrary-Length Exposure Correction

Ao Li, Chen Chen, Zhenyu Wang, Tao Huang, Fangfang Wu, Weisheng Dong

Main category: cs.CV

TL;DR: LoopExpose is an unsupervised exposure correction method that uses nested loop optimization with pseudo-labels and a Luminance Ranking Loss, achieving state-of-the-art performance without requiring labeled data.

Details

Motivation: Supervised learning for exposure correction relies on large labeled datasets that are difficult to obtain in practice, creating a need for effective unsupervised methods.

Method: Proposes a nested loop optimization strategy with two-level framework: upper-level trains correction model using pseudo-labels from lower-level multi-exposure fusion, with feedback mechanism for refinement. Introduces Luminance Ranking Loss for self-supervised constraint.

Result: Extensive experiments show LoopExpose achieves superior exposure correction and fusion performance, outperforming existing state-of-the-art unsupervised methods.

Conclusion: The proposed unsupervised approach effectively addresses exposure correction without requiring labeled data, demonstrating strong performance through the nested loop optimization and luminance-based constraints.

Abstract: Exposure correction is essential for enhancing image quality under challenging lighting conditions. While supervised learning has achieved significant progress in this area, it relies heavily on large-scale labeled datasets, which are difficult to obtain in practical scenarios. To address this limitation, we propose a pseudo label-based unsupervised method called LoopExpose for arbitrary-length exposure correction. A nested loop optimization strategy is proposed to address the exposure correction problem, where the correction model and pseudo-supervised information are jointly optimized in a two-level framework. Specifically, the upper-level trains a correction model using pseudo-labels generated through multi-exposure fusion at the lower level. A feedback mechanism is introduced where corrected images are fed back into the fusion process to refine the pseudo-labels, creating a self-reinforcing learning loop. Considering the dominant role of luminance calibration in exposure correction, a Luminance Ranking Loss is introduced to leverage the relative luminance ordering across the input sequence as a self-supervised constraint. Extensive experiments on different benchmark datasets demonstrate that LoopExpose achieves superior exposure correction and fusion performance, outperforming existing state-of-the-art unsupervised methods. Code is available at https://github.com/FALALAS/LoopExpose.

[266] An Artificial Intelligence-based Assistant for the Visually Impaired

Luis Marquez-Carpintero, Francisco Gomez-Donoso, Zuria Bauer, Bessie Dominguez-Dager, Alvaro Belmonte-Baeza, Mónica Pina-Navarro, Francisco Morillas-Espejo, Felix Escalona, Miguel Cazorla

Main category: cs.CV

TL;DR: AIDEN is an AI assistant for visually impaired people that uses machine learning to identify objects, read text, and answer questions about the environment to improve independence and quality of life.

Details

Motivation: Visually impaired individuals face challenges in object identification, text reading, and environment navigation, limiting their independence despite existing solutions like Braille and screen readers.

Method: Uses state-of-the-art machine learning algorithms including You Only Look Once architectures and a Large Language and Vision Assistant to identify objects, read text, and answer environmental questions.

Result: The application enhances user autonomy and access to information, with user feedback supporting improved perception of daily usability.

Conclusion: AIDEN successfully contributes to improving the quality of life for visually impaired individuals by leveraging AI technologies to address key accessibility challenges.

Abstract: This paper describes an artificial intelligence-based assistant application, AIDEN, developed during 2023 and 2024, aimed at improving the quality of life for visually impaired individuals. Visually impaired individuals face challenges in identifying objects, reading text, and navigating unfamiliar environments, which can limit their independence and reduce their quality of life. Although solutions such as Braille, audio books, and screen readers exist, they may not be effective in all situations. This application leverages state-of-the-art machine learning algorithms to identify and describe objects, read text, and answer questions about the environment. Specifically, it uses You Only Look Once architectures and a Large Language and Vision Assistant. The system incorporates several methods to facilitate the user’s interaction with the system and access to textual and visual information in an appropriate manner. AIDEN aims to enhance user autonomy and access to information, contributing to an improved perception of daily usability, as supported by user feedback.

[267] Hybrid CNN-ViT Framework for Motion-Blurred Scene Text Restoration

Umar Rashid, Muhammad Arslan Arshad, Ghulam Ahmad, Muhammad Zeeshan Anjum, Rizwan Khan, Muhammad Akmal

Main category: cs.CV

TL;DR: Hybrid CNN-ViT framework for motion-blurred scene text restoration, combining local feature extraction with global contextual reasoning to achieve effective deblurring with computational efficiency.

Details

Motivation: Motion blur severely impairs text readability in computer vision tasks like autonomous driving and document digitization. Conventional deblurring methods struggle with spatially varying blur and lack long-range dependency modeling needed for text restoration.

Method: Hybrid deep learning framework combining CNNs with vision transformers. Uses CNN encoder-decoder for structural details and transformer module with self-attention for global context. Trained on TextOCR dataset with synthetic motion blur using composite loss (MAE, MSE, perceptual similarity, SSIM).

Result: Achieves 32.20 dB PSNR and 0.934 SSIM with only 2.83M parameters and 61ms average inference time, demonstrating both effectiveness and computational efficiency.

Conclusion: The CNN-ViT hybrid design proves practical for real-world motion-blurred scene-text restoration, effectively addressing limitations of conventional approaches through combined local and global feature processing.

Abstract: Motion blur in scene text images severely impairs readability and hinders the reliability of computer vision tasks, including autonomous driving, document digitization, and visual information retrieval. Conventional deblurring approaches are often inadequate in handling spatially varying blur and typically fall short in modeling the long-range dependencies necessary for restoring textual clarity. To overcome these limitations, we introduce a hybrid deep learning framework that combines convolutional neural networks (CNNs) with vision transformers (ViTs), thereby leveraging both local feature extraction and global contextual reasoning. The architecture employs a CNN-based encoder-decoder to preserve structural details, while a transformer module enhances global awareness through self-attention. Training is conducted on a curated dataset derived from TextOCR, where sharp scene-text samples are paired with synthetically blurred versions generated using realistic motion-blur kernels of multiple sizes and orientations. Model optimization is guided by a composite loss that incorporates mean absolute error (MAE), squared error (MSE), perceptual similarity, and structural similarity (SSIM). Quantitative evaluations show that the proposed method attains 32.20 dB in PSNR and 0.934 in SSIM, while remaining lightweight with 2.83 million parameters and an average inference time of 61 ms. These results highlight the effectiveness and computational efficiency of the CNN-ViT hybrid design, establishing its practicality for real-world motion-blurred scene-text restoration.

[268] DiLO: Disentangled Latent Optimization for Learning Shape and Deformation in Grouped Deforming 3D Objects

Mostofa Rafid Uddin, Jana Armouti, Umong Sain, Md Asib Rahman, Xingjian Li, Min Xu

Main category: cs.CV

TL;DR: Proposes an unsupervised method for disentangling 3D object deformations into shape and deformation factors using latent optimization and regularization techniques.

Details

Motivation: To enable unsupervised parameterization of deforming 3D objects into separate shape and deformation components for various downstream applications.

Method: Joint optimization of generator network with shape/deformation factors using regularization, followed by training two order-invariant PoinNet-based encoders for amortized inference.

Result: Method effectively performs unsupervised deformation transfer, classification, and explainability analysis on 3D human, animal, and facial datasets.

Conclusion: Simple approach achieves comparable or superior performance to more complex methods in multiple downstream tasks.

Abstract: In this work, we propose a disentangled latent optimization-based method for parameterizing grouped deforming 3D objects into shape and deformation factors in an unsupervised manner. Our approach involves the joint optimization of a generator network along with the shape and deformation factors, supported by specific regularization techniques. For efficient amortized inference of disentangled shape and deformation codes, we train two order-invariant PoinNet-based encoder networks in the second stage of our method. We demonstrate several significant downstream applications of our method, including unsupervised deformation transfer, deformation classification, and explainability analysis. Extensive experiments conducted on 3D human, animal, and facial expression datasets demonstrate that our simple approach is highly effective in these downstream tasks, comparable or superior to existing methods with much higher complexity.

Hossein Askari, Yadan Luo, Hongfu Sun, Fred Roosta

Main category: cs.CV

TL;DR: LFlow is a training-free framework that uses pretrained latent flow priors to solve linear inverse problems more efficiently than existing methods, achieving superior reconstruction quality.

Details

Motivation: Current flow-based inverse solvers are computationally expensive in pixel space and use suboptimal guidance strategies with prior-agnostic posterior covariances, limiting their effectiveness and scalability.

Method: LFlow performs ODE sampling in latent space using flow matching along an optimal path, and introduces a theoretically grounded posterior covariance derived from the optimal vector field for effective flow guidance.

Result: Experimental results show LFlow outperforms state-of-the-art latent diffusion solvers in reconstruction quality across most linear inverse problem tasks.

Conclusion: The proposed latent flow framework with optimal posterior covariance provides an efficient and effective solution for linear inverse problems, demonstrating superior performance compared to existing methods.

Abstract: Recent advances in inverse problem solving have increasingly adopted flow priors over diffusion models due to their ability to construct straight probability paths from noise to data, thereby enhancing efficiency in both training and inference. However, current flow-based inverse solvers face two primary limitations: (i) they operate directly in pixel space, which demands heavy computational resources for training and restricts scalability to high-resolution images, and (ii) they employ guidance strategies with prior-agnostic posterior covariances, which can weaken alignment with the generative trajectory and degrade posterior coverage. In this paper, we propose LFlow (Latent Refinement via Flows), a training-free framework for solving linear inverse problems via pretrained latent flow priors. LFlow leverages the efficiency of flow matching to perform ODE sampling in latent space along an optimal path. This latent formulation further allows us to introduce a theoretically grounded posterior covariance, derived from the optimal vector field, enabling effective flow guidance. Experimental results demonstrate that our proposed method outperforms state-of-the-art latent diffusion solvers in reconstruction quality across most tasks. The code will be publicly available at https://github.com/hosseinaskari-cs/LFlow .

[270] Real-Time Bundle Adjustment for Ultra-High-Resolution UAV Imagery Using Adaptive Patch-Based Feature Tracking

Selim Ahmet Iz, Francesco Nex, Norman Kerle, Henry Meissner, Ralf Berger

Main category: cs.CV

TL;DR: A real-time bundle adjustment framework for UAV imagery that processes full-resolution images without downsampling by dividing images into patches and using sliding window optimization.

Details

Motivation: Real-time processing of UAV imagery is crucial for disaster response and urgent geospatial applications, but conventional BA methods either sacrifice detail through downsampling or are too slow for time-critical missions.

Method: Divides each image into user-defined patches (e.g., 150x150 pixels), dynamically tracks patches across frames using GNSS/IMU data and DSM, and performs localized BA on sliding clusters of overlapping images determined by UAV navigation system.

Result: The method maintains precise camera orientations and high-fidelity mapping across multiple strips, running full bundle adjustment in under 2 seconds without GPU acceleration on 50MP MACS datasets.

Conclusion: The proposed lightweight, onboard-compatible framework enables real-time processing of full-resolution UAV imagery for applications like disaster response, infrastructure monitoring, and coastal protection.

Abstract: Real-time processing of UAV imagery is crucial for applications requiring urgent geospatial information, such as disaster response, where rapid decision-making and accurate spatial data are essential. However, processing high-resolution imagery in real time presents significant challenges due to the computational demands of feature extraction, matching, and bundle adjustment (BA). Conventional BA methods either downsample images, sacrificing important details, or require extensive processing time, making them unsuitable for time-critical missions. To overcome these limitations, we propose a novel real-time BA framework that operates directly on fullresolution UAV imagery without downsampling. Our lightweight, onboard-compatible approach divides each image into user-defined patches (e.g., NxN grids, default 150x150 pixels) and dynamically tracks them across frames using UAV GNSS/IMU data and a coarse, globally available digital surface model (DSM). This ensures spatial consistency for robust feature extraction and matching between patches. Overlapping relationships between images are determined in real time using UAV navigation system, enabling the rapid selection of relevant neighbouring images for localized BA. By limiting optimization to a sliding cluster of overlapping images, including those from adjacent flight strips, the method achieves real-time performance while preserving the accuracy of global BA. The proposed algorithm is designed for seamless integration into the DLR Modular Aerial Camera System (MACS), supporting largearea mapping in real time for disaster response, infrastructure monitoring, and coastal protection. Validation on MACS datasets with 50MP images demonstrates that the method maintains precise camera orientations and high-fidelity mapping across multiple strips, running full bundle adjustment in under 2 seconds without GPU acceleration.

[271] MambaOVSR: Multiscale Fusion with Global Motion Modeling for Chinese Opera Video Super-Resolution

Hua Chang, Xin Xu, Wei Liu, Wei Wang, Xin Yuan, Kui Jiang

Main category: cs.CV

TL;DR: Proposes MambaOVSR, a Mamba-based multiscale fusion network for Chinese opera video super-resolution, achieving 1.86 dB PSNR improvement over SOTA methods on their new COVC dataset.

Details

Motivation: Early filming equipment limitations degraded videos of last-century Chinese opera performances (low frame rates, resolution), hindering archival efforts. Existing STVSR methods struggle with opera's large motions and lack global modeling capabilities.

Method: MambaOVSR with three novel components: Global Fusion Module (GFM) for motion modeling via multiscale alternating scanning, Multiscale Synergistic Mamba Module (MSMM) for alignment across sequence lengths, and MambaVR block to resolve feature artifacts and positional information loss.

Result: Significantly outperforms SOTA STVSR method by average 1.86 dB in PSNR on the COVC dataset. Dataset and code will be publicly released.

Conclusion: MambaOVSR effectively addresses challenges in Chinese opera video super-resolution through novel Mamba-based architecture and multiscale fusion, enabling better preservation of classical art performances.

Abstract: Chinese opera is celebrated for preserving classical art. However, early filming equipment limitations have degraded videos of last-century performances by renowned artists (e.g., low frame rates and resolution), hindering archival efforts. Although space-time video super-resolution (STVSR) has advanced significantly, applying it directly to opera videos remains challenging. The scarcity of datasets impedes the recovery of high frequency details, and existing STVSR methods lack global modeling capabilities, compromising visual quality when handling opera’s characteristic large motions. To address these challenges, we pioneer a large scale Chinese Opera Video Clip (COVC) dataset and propose the Mamba-based multiscale fusion network for space-time Opera Video Super-Resolution (MambaOVSR). Specifically, MambaOVSR involves three novel components: the Global Fusion Module (GFM) for motion modeling through a multiscale alternating scanning mechanism, and the Multiscale Synergistic Mamba Module (MSMM) for alignment across different sequence lengths. Additionally, our MambaVR block resolves feature artifacts and positional information loss during alignment. Experimental results on the COVC dataset show that MambaOVSR significantly outperforms the SOTA STVSR method by an average of 1.86 dB in terms of PSNR. Dataset and Code will be publicly released.

[272] NURBGen: High-Fidelity Text-to-CAD Generation through LLM-Driven NURBS Modeling

Muhammad Usama, Mohammad Sadil Khan, Didier Stricker, Muhammad Zeshan Afzal

Main category: cs.CV

TL;DR: NURBGen is the first framework that generates editable 3D CAD models from natural language using NURBS representations, outperforming previous methods in geometric fidelity and dimensional accuracy.

Details

Motivation: Existing text-to-CAD systems either produce non-editable meshes or rely on scarce design-history data, creating a need for direct generation of editable CAD models from text.

Method: Fine-tune a large language model to translate text into JSON representations of NURBS surface parameters, using a hybrid representation combining untrimmed NURBS with analytic primitives to handle trimmed surfaces and reduce token complexity.

Result: NURBGen demonstrates strong performance on diverse prompts, surpassing prior methods in geometric fidelity and dimensional accuracy as confirmed by expert evaluations.

Conclusion: The framework successfully generates high-fidelity 3D CAD models directly from text using NURBS, with the code and dataset to be released publicly.

Abstract: Generating editable 3D CAD models from natural language remains challenging, as existing text-to-CAD systems either produce meshes or rely on scarce design-history data. We present NURBGen, the first framework to generate high-fidelity 3D CAD models directly from text using Non-Uniform Rational B-Splines (NURBS). To achieve this, we fine-tune a large language model (LLM) to translate free-form texts into JSON representations containing NURBS surface parameters (\textit{i.e}, control points, knot vectors, degrees, and rational weights) which can be directly converted into BRep format using Python. We further propose a hybrid representation that combines untrimmed NURBS with analytic primitives to handle trimmed surfaces and degenerate regions more robustly, while reducing token complexity. Additionally, we introduce partABC, a curated subset of the ABC dataset consisting of individual CAD components, annotated with detailed captions using an automated annotation pipeline. NURBGen demonstrates strong performance on diverse prompts, surpassing prior methods in geometric fidelity and dimensional accuracy, as confirmed by expert evaluations. Code and dataset will be released publicly.

[273] Scene-Aware Urban Design: A Human-AI Recommendation Framework Using Co-Occurrence Embeddings and Vision-Language Models

Rodrigo Gallardo, Oz Fishman, Alexander Htet Kyaw

Main category: cs.CV

TL;DR: A human-in-the-loop computer vision framework that uses generative AI to propose micro-scale design interventions in public spaces, enabling continuous local participation by detecting urban objects and suggesting statistically likely complements.

Details

Motivation: To move beyond top-down master planning by supporting more continuous, local participation in urban design, grounding choices in everyday patterns and lived experience.

Method: Uses Grounding DINO and curated ADE20K dataset to detect urban objects, builds co-occurrence embeddings to reveal spatial configurations, provides five statistically likely complements to chosen anchor objects, and employs vision language model to suggest third objects that complete complex urban tactics.

Result: The system successfully detects urban objects, identifies common spatial configurations, and generates meaningful design interventions while keeping users in control of selection and refinement.

Conclusion: The framework effectively supports participatory urban design by combining AI-driven analysis with human oversight, enabling micro-scale interventions that reflect local patterns and experiences.

Abstract: This paper introduces a human-in-the-loop computer vision framework that uses generative AI to propose micro-scale design interventions in public space and support more continuous, local participation. Using Grounding DINO and a curated subset of the ADE20K dataset as a proxy for the urban built environment, the system detects urban objects and builds co-occurrence embeddings that reveal common spatial configurations. From this analysis, the user receives five statistically likely complements to a chosen anchor object. A vision language model then reasons over the scene image and the selected pair to suggest a third object that completes a more complex urban tactic. The workflow keeps people in control of selection and refinement and aims to move beyond top-down master planning by grounding choices in everyday patterns and lived experience.

[274] MoRA: Missing Modality Low-Rank Adaptation for Visual Recognition

Shu Zhao, Nilesh Ahuja, Tan Yu, Tianyi Shen, Vijaykrishnan Narayanan

Main category: cs.CV

TL;DR: MoRA is a parameter-efficient fine-tuning method for vision-language models that enables effective multimodal recognition even when modalities are missing, using only 0.11% trainable parameters while improving performance by 5.24% and reducing inference time by 74.10% compared to SOTA.

Details

Motivation: Real-world scenarios often have missing modalities due to privacy, collection difficulties, or resource limitations, but existing prompt learning approaches fail to capture cross-modal relationships and suffer from computational overhead.

Method: MoRA introduces modality-common parameters between text and vision encoders for bidirectional knowledge transfer, combined with modality-specific parameters to maintain inter-modality interaction and intra-modality flexibility.

Result: Extensive experiments show MoRA achieves 5.24% average performance improvement in missing-modality scenarios, uses only 25.90% of SOTA inference time, and requires only 0.11% of trainable parameters compared to full fine-tuning.

Conclusion: MoRA provides an effective parameter-efficient solution for handling missing modalities in vision-language models by explicitly modeling cross-modal interactions while maintaining computational efficiency.

Abstract: Pre-trained vision language models have shown remarkable performance on visual recognition tasks, but they typically assume the availability of complete multimodal inputs during both training and inference. In real-world scenarios, however, modalities may be missing due to privacy constraints, collection difficulties, or resource limitations. While previous approaches have addressed this challenge using prompt learning techniques, they fail to capture the cross-modal relationships necessary for effective multimodal visual recognition and suffer from inevitable computational overhead. In this paper, we introduce MoRA, a parameter-efficient fine-tuning method that explicitly models cross-modal interactions while maintaining modality-specific adaptations. MoRA introduces modality-common parameters between text and vision encoders, enabling bidirectional knowledge transfer. Additionally, combined with the modality-specific parameters, MoRA allows the backbone model to maintain inter-modality interaction and enable intra-modality flexibility. Extensive experiments on standard benchmarks demonstrate that MoRA achieves an average performance improvement in missing-modality scenarios by 5.24% and uses only 25.90% of the inference time compared to the SOTA method while requiring only 0.11% of trainable parameters compared to full fine-tuning.

[275] Temporal-Guided Visual Foundation Models for Event-Based Vision

Ruihao Xia, Junhong Cai, Luziwei Leng, Liuyi Wang, Chengju Liu, Ran Cheng, Yang Tang, Pan Zhou

Main category: cs.CV

TL;DR: TGVFM integrates Visual Foundation Models with temporal context fusion for event-based vision, achieving state-of-the-art performance in semantic segmentation, depth estimation, and object detection.

Details

Motivation: Event cameras excel in challenging environments but processing asynchronous event streams is difficult. Existing methods don't fully leverage pretrained Visual Foundation Models from image data for event-based vision.

Method: Proposes Temporal-Guided VFM with three components: Long-Range Temporal Attention, Dual Spatiotemporal Attention, and Deep Feature Guidance Mechanism. Retrains event-to-video models on real data and uses transformer-based VFMs.

Result: Achieves state-of-the-art performance with 16% improvement in semantic segmentation, 21% in depth estimation, and 16% in object detection over existing methods.

Conclusion: Successfully bridges cross-modality gap by enabling image-based VFMs to work with event-based vision through temporal reasoning.

Abstract: Event cameras offer unique advantages for vision tasks in challenging environments, yet processing asynchronous event streams remains an open challenge. While existing methods rely on specialized architectures or resource-intensive training, the potential of leveraging modern Visual Foundation Models (VFMs) pretrained on image data remains under-explored for event-based vision. To address this, we propose Temporal-Guided VFM (TGVFM), a novel framework that integrates VFMs with our temporal context fusion block seamlessly to bridge this gap. Our temporal block introduces three key components: (1) Long-Range Temporal Attention to model global temporal dependencies, (2) Dual Spatiotemporal Attention for multi-scale frame correlation, and (3) Deep Feature Guidance Mechanism to fuse semantic-temporal features. By retraining event-to-video models on real-world data and leveraging transformer-based VFMs, TGVFM preserves spatiotemporal dynamics while harnessing pretrained representations. Experiments demonstrate SoTA performance across semantic segmentation, depth estimation, and object detection, with improvements of 16%, 21%, and 16% over existing methods, respectively. Overall, this work unlocks the cross-modality potential of image-based VFMs for event-based vision with temporal reasoning. Code is available at https://github.com/XiaRho/TGVFM.

[276] Physics-Informed Image Restoration via Progressive PDE Integration

Shamika Likhite, Santiago López-Tapia, Aggelos K. Katsaggelos

Main category: cs.CV

TL;DR: Proposes a physics-informed PDE framework for motion deblurring that integrates advection-diffusion equations into deep learning architectures to capture long-range spatial dependencies in blur patterns with minimal computational overhead.

Details

Motivation: Motion blur degrades image quality and impairs computer vision tasks. Existing deep learning methods struggle with capturing long-range spatial dependencies in blur patterns, requiring extremely deep networks for global modeling.

Method: Progressive training framework that integrates physics-informed PDE dynamics (advection-diffusion equations) into state-of-the-art restoration architectures to model feature evolution and directional flow characteristics of motion blur.

Result: Achieves superior restoration quality with only ~1% increase in inference GMACs. Improves PSNR and SSIM significantly across four architectures (FFTformer, NAFNet, Restormer, Stripformer) on standard benchmarks.

Conclusion: Incorporating mathematical physics principles through PDE-based global layers enhances deep learning-based image restoration, establishing a promising direction for physics-informed neural network design in computer vision.

Abstract: Motion blur, caused by relative movement between camera and scene during exposure, significantly degrades image quality and impairs downstream computer vision tasks such as object detection, tracking, and recognition in dynamic environments. While deep learning-based motion deblurring methods have achieved remarkable progress, existing approaches face fundamental challenges in capturing the long-range spatial dependencies inherent in motion blur patterns. Traditional convolutional methods rely on limited receptive fields and require extremely deep networks to model global spatial relationships. These limitations motivate the need for alternative approaches that incorporate physical priors to guide feature evolution during restoration. In this paper, we propose a progressive training framework that integrates physics-informed PDE dynamics into state-of-the-art restoration architectures. By leveraging advection-diffusion equations to model feature evolution, our approach naturally captures the directional flow characteristics of motion blur while enabling principled global spatial modeling. Our PDE-enhanced deblurring models achieve superior restoration quality with minimal overhead, adding only approximately 1% to inference GMACs while providing consistent improvements in perceptual quality across multiple state-of-the-art architectures. Comprehensive experiments on standard motion deblurring benchmarks demonstrate that our physics-informed approach improves PSNR and SSIM significantly across four diverse architectures, including FFTformer, NAFNet, Restormer, and Stripformer. These results validate that incorporating mathematical physics principles through PDE-based global layers can enhance deep learning-based image restoration, establishing a promising direction for physics-informed neural network design in computer vision applications.

[277] Gait Recognition via Collaborating Discriminative and Generative Diffusion Models

Haijun Xiong, Bin Feng, Bang Wang, Xinggang Wang, Wenyu Liu

Main category: cs.CV

TL;DR: CoD² is a novel gait recognition framework that combines diffusion models with discriminative models using multi-level conditional control to generate identity-consistent gait sequences and extract robust features.

Details

Motivation: Gait recognition provides non-intrusive biometric identification, but generative models' potential remains underexplored despite discriminative models' success.

Method: Proposes CoD² framework with Multi-level Conditional Control strategy: high-level identity-aware semantic conditions from discriminative extractor guide generation, while low-level visual details (appearance, motion) enhance consistency. Generated sequences improve discriminative extractor learning.

Result: Achieves state-of-the-art performance on four datasets (SUSTech1K, CCPG, GREW, Gait3D) and integrates seamlessly with existing discriminative methods for consistent improvements.

Conclusion: CoD² effectively combines generative and discriminative approaches for robust gait feature extraction, demonstrating superior performance and compatibility with existing methods.

Abstract: Gait recognition offers a non-intrusive biometric solution by identifying individuals through their walking patterns. Although discriminative models have achieved notable success in this domain, the full potential of generative models remains largely underexplored. In this paper, we introduce \textbf{CoD$^2$}, a novel framework that combines the data distribution modeling capabilities of diffusion models with the semantic representation learning strengths of discriminative models to extract robust gait features. We propose a Multi-level Conditional Control strategy that incorporates both high-level identity-aware semantic conditions and low-level visual details. Specifically, the high-level condition, extracted by the discriminative extractor, guides the generation of identity-consistent gait sequences, whereas low-level visual details, such as appearance and motion, are preserved to enhance consistency. Furthermore, the generated sequences facilitate the discriminative extractor’s learning, enabling it to capture more comprehensive high-level semantic features. Extensive experiments on four datasets (SUSTech1K, CCPG, GREW, and Gait3D) demonstrate that CoD$^2$ achieves state-of-the-art performance and can be seamlessly integrated with existing discriminative methods, yielding consistent improvements.

[278] AdaDrive: Self-Adaptive Slow-Fast System for Language-Grounded Autonomous Driving

Ruifei Zhang, Junlin Xie, Wei Zhang, Weikai Chen, Xiao Tan, Xiang Wan, Guanbin Li

Main category: cs.CV

TL;DR: AdaDrive is an adaptive slow-fast framework that dynamically determines when and how to use LLMs in autonomous driving, balancing reasoning capabilities with real-time efficiency through adaptive activation and fusion strategies.

Details

Motivation: Existing approaches either use LLMs too frequently (causing computational overhead) or use fixed schedules (failing to adapt to dynamic driving conditions), creating a need for a balanced solution.

Method: Uses adaptive activation loss for determining when to invoke LLMs based on comparative learning, and adaptive fusion strategy for continuous, scaled LLM influence based on scene complexity and prediction confidence.

Result: Achieves state-of-the-art performance on language-grounded autonomous driving benchmarks in both driving accuracy and computational efficiency.

Conclusion: AdaDrive provides a flexible, context-aware framework that maximizes decision accuracy without compromising real-time performance in autonomous driving systems.

Abstract: Effectively integrating Large Language Models (LLMs) into autonomous driving requires a balance between leveraging high-level reasoning and maintaining real-time efficiency. Existing approaches either activate LLMs too frequently, causing excessive computational overhead, or use fixed schedules, failing to adapt to dynamic driving conditions. To address these challenges, we propose AdaDrive, an adaptively collaborative slow-fast framework that optimally determines when and how LLMs contribute to decision-making. (1) When to activate the LLM: AdaDrive employs a novel adaptive activation loss that dynamically determines LLM invocation based on a comparative learning mechanism, ensuring activation only in complex or critical scenarios. (2) How to integrate LLM assistance: Instead of rigid binary activation, AdaDrive introduces an adaptive fusion strategy that modulates a continuous, scaled LLM influence based on scene complexity and prediction confidence, ensuring seamless collaboration with conventional planners. Through these strategies, AdaDrive provides a flexible, context-aware framework that maximizes decision accuracy without compromising real-time performance. Extensive experiments on language-grounded autonomous driving benchmarks demonstrate that AdaDrive state-of-the-art performance in terms of both driving accuracy and computational efficiency. Code is available at https://github.com/ReaFly/AdaDrive.

[279] VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving

Ruifei Zhang, Wei Zhang, Xiao Tan, Sibei Yang, Xiang Wan, Xiaonan Luo, Guanbin Li

Main category: cs.CV

TL;DR: VLDrive is a lightweight multimodal LLM for autonomous driving that reduces parameters by 81% while improving driving performance through enhanced vision components and novel attention mechanisms.

Details

Motivation: Current LLM-based autonomous driving approaches suffer from frequent collisions due to visual representation limitations and face deployment challenges from large parameter sizes.

Method: Introduces a lightweight MLLM architecture with cycle-consistent dynamic visual pruning, memory-enhanced feature aggregation, and distance-decoupled instruction attention for improved visual-linguistic feature learning.

Result: Achieves state-of-the-art driving performance with 81% parameter reduction (from 7B to 1.3B) and substantial driving score improvements: 15.4% (tiny), 16.8% (short), and 7.6% (long distances) in CARLA simulator.

Conclusion: VLDrive demonstrates that lightweight MLLM architectures with enhanced vision components can achieve superior autonomous driving performance while being more deployable through significant parameter reduction.

Abstract: Recent advancements in language-grounded autonomous driving have been significantly promoted by the sophisticated cognition and reasoning capabilities of large language models (LLMs). However, current LLM-based approaches encounter critical challenges: (1) Failure analysis reveals that frequent collisions and obstructions, stemming from limitations in visual representations, remain primary obstacles to robust driving performance. (2) The substantial parameters of LLMs pose considerable deployment hurdles. To address these limitations, we introduce VLDrive, a novel approach featuring a lightweight MLLM architecture with enhanced vision components. VLDrive achieves compact visual tokens through innovative strategies, including cycle-consistent dynamic visual pruning and memory-enhanced feature aggregation. Furthermore, we propose a distance-decoupled instruction attention mechanism to improve joint visual-linguistic feature learning, particularly for long-range visual tokens. Extensive experiments conducted in the CARLA simulator demonstrate VLDrive`s effectiveness. Notably, VLDrive achieves state-of-the-art driving performance while reducing parameters by 81% (from 7B to 1.3B), yielding substantial driving score improvements of 15.4%, 16.8%, and 7.6% at tiny, short, and long distances, respectively, in closed-loop evaluations. Code is available at https://github.com/ReaFly/VLDrive.

[280] Robust Nearest Neighbour Retrieval Using Targeted Manifold Manipulation

B. Ghosh, H. Harikumar, S. Rana

Main category: cs.CV

TL;DR: TMM-NN is a novel nearest-neighbor retrieval method that uses targeted perturbation via trigger patches to define neighborhoods based on sample responsiveness rather than geometric distance, outperforming traditional metrics.

Details

Motivation: Current nearest-neighbor retrieval relies on hand-tuning feature layers and distance metrics, which may not capture semantic relationships effectively.

Method: Uses a lightweight query-specific trigger patch added to query images, weakly backdooring the network to steer patched inputs toward a dummy class. Similar images shift easily to the dummy class while dissimilar ones resist.

Result: TMM-NN outperforms traditional retrieval metrics under noise and across diverse tasks, with robustness analysis confirming its effectiveness.

Conclusion: Trigger-based ranking through targeted manifold manipulation provides more semantically meaningful nearest-neighbor retrieval than traditional geometric distance approaches.

Abstract: Nearest-neighbour retrieval is central to classification and explainable-AI pipelines, but current practice relies on hand-tuning feature layers and distance metrics. We propose Targeted Manifold Manipulation-Nearest Neighbour (TMM-NN), which reconceptualises retrieval by assessing how readily each sample can be nudged into a designated region of the feature manifold; neighbourhoods are defined by a sample’s responsiveness to a targeted perturbation rather than absolute geometric distance. TMM-NN implements this through a lightweight, query-specific trigger patch. The patch is added to the query image, and the network is weakly ``backdoored’’ so that any input with the patch is steered toward a dummy class. Images similar to the query need only a slight shift and are classified as the dummy class with high probability, while dissimilar ones are less affected. By ranking candidates by this confidence, TMM-NN retrieves the most semantically related neighbours. Robustness analysis and benchmark experiments confirm this trigger-based ranking outperforms traditional metrics under noise and across diverse tasks.

[281] A Mixture-of-Experts Framework with Log-Logistic Components for Survival Analysis on Histopathology Images

Ardhendu Sekhar, Vasu Soni, Keshav Aske, Shivam Madnoorkar, Pranav Jeevan, Amit Sethi

Main category: cs.CV

TL;DR: A modular framework for predicting cancer survival from pathology images using quantile-based patch selection, graph clustering, hierarchical attention, and mixture modeling.

Details

Motivation: To develop a more accurate method for predicting cancer-specific survival from whole slide pathology images by capturing tissue heterogeneity and complex survival distributions.

Method: Four-component framework: Quantile Gated Patch Selection, Graph Guided Clustering, Hierarchical Context Attention, and Expert Driven Mixture of Log logistics distributions.

Result: Achieved concordance indices of 0.644 on TCGA LUAD, 0.751 on TCGA KIRC, and 0.752 on TCGA BRCA, outperforming state-of-the-art methods.

Conclusion: The proposed modular framework effectively predicts cancer survival by integrating tissue heterogeneity modeling and complex distribution estimation, demonstrating superior performance across multiple cancer types.

Abstract: We propose a modular framework for predicting cancer specific survival from whole slide pathology images (WSIs). The method integrates four components: (i) Quantile Gated Patch Selection via quantile based thresholding to isolate prognostically informative tissue regions; (ii) Graph Guided Clustering using a k nearest neighbor graph to capture phenotype level heterogeneity through spatial and morphological coherence; (iii) Hierarchical Context Attention to learn intra and inter cluster interactions; and (iv) an Expert Driven Mixture of Log logistics framework to estimate complex survival distributions using Log logistics distributions. The model attains a concordance index of 0.644 on TCGA LUAD, 0.751 on TCGA KIRC, and 0.752 on TCGA BRCA respectively, outperforming existing state of the art approaches.

Jian Zhang, Junyi Guo, Junyi Yuan, Huanda Lu, Yanlin Zhou, Fangyu Wu, Qiufeng Wang, Dongming Lu

Main category: cs.CV

TL;DR: C³ is a data augmentation framework that improves cross-modal retrieval by enhancing the completeness and consistency of LLM-generated descriptions through semantic coverage assessment and consistency-guided reasoning.

Details

Motivation: Cross-modal retrieval in cultural heritage is limited by incomplete/inconsistent textual descriptions due to historical data loss and expensive expert annotation. LLMs can enrich descriptions but suffer from hallucinations and lack visual grounding.

Method: Proposes C³ framework with completeness evaluation module using visual cues and language models, and Markov Decision Process to supervise Chain-of-Thought reasoning for consistency evaluation through adaptive query control.

Result: Achieves state-of-the-art performance on cultural heritage datasets CulTi and TimeTravel, as well as general benchmarks MSCOCO and Flickr30K in both fine-tuned and zero-shot settings.

Conclusion: C³ effectively addresses LLM limitations in cross-modal retrieval by improving description completeness and consistency, demonstrating strong performance across cultural heritage and general datasets.

Abstract: Cross-modal retrieval is essential for interpreting cultural heritage data, but its effectiveness is often limited by incomplete or inconsistent textual descriptions, caused by historical data loss and the high cost of expert annotation. While large language models (LLMs) offer a promising solution by enriching textual descriptions, their outputs frequently suffer from hallucinations or miss visually grounded details. To address these challenges, we propose $C^3$, a data augmentation framework that enhances cross-modal retrieval performance by improving the completeness and consistency of LLM-generated descriptions. $C^3$ introduces a completeness evaluation module to assess semantic coverage using both visual cues and language-model outputs. Furthermore, to mitigate factual inconsistencies, we formulate a Markov Decision Process to supervise Chain-of-Thought reasoning, guiding consistency evaluation through adaptive query control. Experiments on the cultural heritage datasets CulTi and TimeTravel, as well as on general benchmarks MSCOCO and Flickr30K, demonstrate that $C^3$ achieves state-of-the-art performance in both fine-tuned and zero-shot settings.

[283] RelightMaster: Precise Video Relighting with Multi-plane Light Images

Weikang Bian, Xiaoyu Shi, Zhaoyang Huang, Jianhong Bai, Qinghe Wang, Xintao Wang, Pengfei Wan, Kun Gai, Hongsheng Li

Main category: cs.CV

TL;DR: RelightMaster is a novel framework for accurate and controllable video relighting that addresses the limitations of text-to-video models in lighting control by introducing Multi-plane Light Image (MPLI) visual prompts and a Light Image Adapter for seamless integration with pre-trained diffusion models.

Details

Motivation: Current text-to-video models lack fine-grained lighting control due to text's inherent limitation in describing lighting details and insufficient pre-training on lighting-related prompts. Additionally, constructing high-quality relighting training data is challenging due to scarce real-world controllable lighting data.

Method: 1) Built RelightVideo dataset with identical dynamic content under varying precise lighting conditions using Unreal Engine; 2) Introduced Multi-plane Light Image (MPLI) - a novel visual prompt that models lighting via K depth-aligned planes representing 3D light source positions, intensities, and colors; 3) Designed Light Image Adapter that compresses MPLI via pre-trained Video VAE and injects latent light features into DiT blocks.

Result: RelightMaster generates physically plausible lighting and shadows while preserving original scene content. The framework supports multi-source lighting scenarios and generalizes to unseen light setups.

Conclusion: The proposed RelightMaster framework successfully addresses video relighting challenges by combining a novel lighting dataset, MPLI visual prompts, and seamless integration with pre-trained diffusion models, enabling precise and controllable video relighting that was previously unexplored.

Abstract: Recent advances in diffusion models enable high-quality video generation and editing, but precise relighting with consistent video contents, which is critical for shaping scene atmosphere and viewer attention, remains unexplored. Mainstream text-to-video (T2V) models lack fine-grained lighting control due to text’s inherent limitation in describing lighting details and insufficient pre-training on lighting-related prompts. Additionally, constructing high-quality relighting training data is challenging, as real-world controllable lighting data is scarce. To address these issues, we propose RelightMaster, a novel framework for accurate and controllable video relighting. First, we build RelightVideo, the first dataset with identical dynamic content under varying precise lighting conditions based on the Unreal Engine. Then, we introduce Multi-plane Light Image (MPLI), a novel visual prompt inspired by Multi-Plane Image (MPI). MPLI models lighting via K depth-aligned planes, representing 3D light source positions, intensities, and colors while supporting multi-source scenarios and generalizing to unseen light setups. Third, we design a Light Image Adapter that seamlessly injects MPLI into pre-trained Video Diffusion Transformers (DiT): it compresses MPLI via a pre-trained Video VAE and injects latent light features into DiT blocks, leveraging the base model’s generative prior without catastrophic forgetting. Experiments show that RelightMaster generates physically plausible lighting and shadows and preserves original scene content. Demos are available at https://wkbian.github.io/Projects/RelightMaster/.

[284] LaneDiffusion: Improving Centerline Graph Learning via Prior Injected BEV Feature Generation

Zijie Wang, Weiming Zhang, Wei Zhang, Xiao Tan, Hongxing Liu, Yaowei Wang, Guanbin Li

Main category: cs.CV

TL;DR: LaneDiffusion introduces a generative approach using diffusion models for centerline graph learning in autonomous driving, outperforming traditional deterministic methods by generating lane priors at BEV feature level.

Details

Motivation: Traditional deterministic methods for centerline graph learning lack spatial reasoning and struggle with occluded/invisible centerlines, while generative approaches remain underexplored in this domain.

Method: Uses diffusion models to generate lane centerline priors at BEV feature level, integrating Lane Prior Injection Module (LPIM) and Lane Prior Diffusion Module (LPDM) to construct diffusion targets and manage the process, then decodes vectorized centerlines from prior-injected features.

Result: Significantly outperforms existing methods on nuScenes and Argoverse2 datasets with improvements of 4.2-6.4% on point-level metrics and 2.1-6.8% on segment-level metrics, establishing state-of-the-art performance.

Conclusion: LaneDiffusion demonstrates the effectiveness of generative models for centerline graph learning, offering new insights and superior performance compared to traditional deterministic approaches.

Abstract: Centerline graphs, crucial for path planning in autonomous driving, are traditionally learned using deterministic methods. However, these methods often lack spatial reasoning and struggle with occluded or invisible centerlines. Generative approaches, despite their potential, remain underexplored in this domain. We introduce LaneDiffusion, a novel generative paradigm for centerline graph learning. LaneDiffusion innovatively employs diffusion models to generate lane centerline priors at the Bird’s Eye View (BEV) feature level, instead of directly predicting vectorized centerlines. Our method integrates a Lane Prior Injection Module (LPIM) and a Lane Prior Diffusion Module (LPDM) to effectively construct diffusion targets and manage the diffusion process. Furthermore, vectorized centerlines and topologies are then decoded from these prior-injected BEV features. Extensive evaluations on the nuScenes and Argoverse2 datasets demonstrate that LaneDiffusion significantly outperforms existing methods, achieving improvements of 4.2%, 4.6%, 4.7%, 6.4% and 1.8% on fine-grained point-level metrics (GEO F1, TOPO F1, JTOPO F1, APLS and SDA) and 2.3%, 6.4%, 6.8% and 2.1% on segment-level metrics (IoU, mAP_cf, DET_l and TOP_ll). These results establish state-of-the-art performance in centerline graph learning, offering new insights into generative models for this task.

[285] VideoSSR: Video Self-Supervised Reinforcement Learning

Zefeng He, Xiaoye Qu, Yafu Li, Siyuan Huang, Daizong Liu, Yu Cheng

Main category: cs.CV

TL;DR: VideoSSR is a self-supervised reinforcement learning framework that uses intrinsic video information to generate verifiable training data, improving MLLM video understanding by over 5% across 17 benchmarks.

Details

Motivation: Manual annotation of high-quality video data is expensive, and existing datasets are becoming insufficient for advancing Multimodal Large Language Models (MLLMs). The paper explores whether intrinsic video information can be leveraged to self-generate verifiable training data.

Method: Proposes three self-supervised pretext tasks (Anomaly Grounding, Object Counting, Temporal Jigsaw) and creates VideoSSR-30K dataset. Develops VideoSSR framework for video self-supervised reinforcement learning with verifiable rewards.

Result: VideoSSR consistently enhances model performance across 17 benchmarks in four video domains (General Video QA, Long Video QA, Temporal Grounding, Complex Reasoning), achieving average improvement of over 5%. Current MLLMs struggle significantly on the proposed VIUBench tasks.

Conclusion: VideoSSR establishes a potent foundational framework for developing more advanced video understanding in MLLMs, demonstrating that self-generated verifiable training data from intrinsic video information can effectively advance video understanding capabilities.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has substantially advanced the video understanding capabilities of Multimodal Large Language Models (MLLMs). However, the rapid progress of MLLMs is outpacing the complexity of existing video datasets, while the manual annotation of new, high-quality data remains prohibitively expensive. This work investigates a pivotal question: Can the rich, intrinsic information within videos be harnessed to self-generate high-quality, verifiable training data? To investigate this, we introduce three self-supervised pretext tasks: Anomaly Grounding, Object Counting, and Temporal Jigsaw. We construct the Video Intrinsic Understanding Benchmark (VIUBench) to validate their difficulty, revealing that current state-of-the-art MLLMs struggle significantly on these tasks. Building upon these pretext tasks, we develop the VideoSSR-30K dataset and propose VideoSSR, a novel video self-supervised reinforcement learning framework for RLVR. Extensive experiments across 17 benchmarks, spanning four major video domains (General Video QA, Long Video QA, Temporal Grounding, and Complex Reasoning), demonstrate that VideoSSR consistently enhances model performance, yielding an average improvement of over 5%. These results establish VideoSSR as a potent foundational framework for developing more advanced video understanding in MLLMs. The code is available at https://github.com/lcqysl/VideoSSR.

[286] From ACR O-RADS 2022 to Explainable Deep Learning: Comparative Performance of Expert Radiologists, Convolutional Neural Networks, Vision Transformers, and Fusion Models in Ovarian Masses

Ali Abbasian Ardakani, Afshin Mohammadi, Alisa Mohebbi, Anushya Vijayananthan, Sook Sam Leong, Lim Yi Ting, Mohd Kamil Bin Mohamad Fabell, U Rajendra Acharya, Sepideh Hatamikia

Main category: cs.CV

TL;DR: Deep learning models outperform radiologists using O-RADS v2022 for ovarian lesion classification, with hybrid human-AI frameworks achieving the highest diagnostic accuracy.

Details

Motivation: To address variability in human interpretation of O-RADS v2022 classification and evaluate whether deep learning models and hybrid human-AI approaches can improve diagnostic performance for ovarian lesions.

Method: Retrospective study of 512 adnexal mass images from 227 patients, comparing radiologist O-RADS assessment with 16 deep learning models (CNNs and Vision Transformers) and hybrid human-AI frameworks integrating radiologist scores with DL predictions.

Result: Radiologists achieved AUC 0.683 and 68.0% accuracy. CNN models showed AUC 0.620-0.908 and 59.2-86.4% accuracy, while ViT16-384 performed best (AUC 0.941, 87.4% accuracy). Hybrid frameworks significantly improved CNN performance but not ViT models.

Conclusion: Deep learning models significantly outperform radiologist-only O-RADS assessment, and hybrid human-AI approaches provide the highest diagnostic accuracy, offering potential to standardize ultrasound interpretation and improve lesion detection.

Abstract: Background: The 2022 update of the Ovarian-Adnexal Reporting and Data System (O-RADS) ultrasound classification refines risk stratification for adnexal lesions, yet human interpretation remains subject to variability and conservative thresholds. Concurrently, deep learning (DL) models have demonstrated promise in image-based ovarian lesion characterization. This study evaluates radiologist performance applying O-RADS v2022, compares it to leading convolutional neural network (CNN) and Vision Transformer (ViT) models, and investigates the diagnostic gains achieved by hybrid human-AI frameworks. Methods: In this single-center, retrospective cohort study, a total of 512 adnexal mass images from 227 patients (110 with at least one malignant cyst) were included. Sixteen DL models, including DenseNets, EfficientNets, ResNets, VGGs, Xception, and ViTs, were trained and validated. A hybrid model integrating radiologist O-RADS scores with DL-predicted probabilities was also built for each scheme. Results: Radiologist-only O-RADS assessment achieved an AUC of 0.683 and an overall accuracy of 68.0%. CNN models yielded AUCs of 0.620 to 0.908 and accuracies of 59.2% to 86.4%, while ViT16-384 reached the best performance, with an AUC of 0.941 and an accuracy of 87.4%. Hybrid human-AI frameworks further significantly enhanced the performance of CNN models; however, the improvement for ViT models was not statistically significant (P-value >0.05). Conclusions: DL models markedly outperform radiologist-only O-RADS v2022 assessment, and the integration of expert scores with AI yields the highest diagnostic accuracy and discrimination. Hybrid human-AI paradigms hold substantial potential to standardize pelvic ultrasound interpretation, reduce false positives, and improve detection of high-risk lesions.

[287] TinyChemVL: Advancing Chemical Vision-Language Models via Efficient Visual Token Reduction and Complex Reaction Tasks

Xuanle Zhao, Shuxin Zeng, Yinyuan Cai, Xiang Cheng, Duzhen Zhang, Xiuyi Chen, Bo Xu

Main category: cs.CV

TL;DR: TinyChemVL is an efficient 4B-parameter chemical vision-language model that uses visual token reduction and reaction-level tasks to improve efficiency and reasoning, outperforming larger models while using only 1/16th of visual tokens.

Details

Motivation: Current VLMs for chemical tasks are computationally inefficient and focus only on molecular-level tasks, missing critical visual information like molecular structures and limiting progress in chemical reasoning.

Method: Proposed TinyChemVL with visual token reduction to process chemical images efficiently, and introduced reaction-level tasks for improved reasoning. Also created ChemRxn-V benchmark for vision-based reaction recognition and prediction.

Result: TinyChemVL achieves superior performance on both molecular and reaction tasks with faster inference and training speeds, outperforming ChemVLM while using only 1/16th of visual tokens.

Conclusion: This work demonstrates that co-designing model architecture and task complexity enables building efficient yet powerful VLMs for chemical domains, advancing chemical reasoning capabilities.

Abstract: While Vision Language Models (VLMs) have demonstrated remarkable capabilities in general visual understanding, their application in the chemical domain has been limited, with previous works predominantly focusing on text and thus overlooking critical visual information, such as molecular structures. Current approaches that directly adopt standard VLMs for chemical tasks suffer from two primary issues: (i) computational inefficiency of processing entire chemical images with non-informative backgrounds. (ii) a narrow scope on molecular-level tasks that restricts progress in chemical reasoning. In this work, we propose \textbf{TinyChemVL}, an efficient and powerful chemical VLM that leverages visual token reduction and reaction-level tasks to improve model efficiency and reasoning capacity. Also, we propose \textbf{ChemRxn-V}, a reaction-level benchmark for assessing vision-based reaction recognition and prediction tasks. Directly predicting reaction products from molecular images poses a non-trivial challenge, as it requires models to integrate both recognition and reasoning capacities. Our results demonstrate that with only 4B parameters, TinyChemVL achieves superior performance on both molecular and reaction tasks while demonstrating faster inference and training speeds compared to existing models. Notably, TinyChemVL outperforms ChemVLM while utilizing only 1/16th of the visual tokens. This work builds efficient yet powerful VLMs for chemical domains by co-designing model architecture and task complexity.

[288] Learning-Based Vision Systems for Semi-Autonomous Forklift Operation in Industrial Warehouse Environments

Vamshika Sutar, Mahek Maheshwari, Archak Mittal

Main category: cs.CV

TL;DR: Vision-based framework using YOLOv8/YOLOv11 for pallet and pallet hole detection with hyperparameter optimization and spatial mapping, enabling cost-effective forklift automation.

Details

Motivation: Need for robust, low-cost perception systems in warehouse automation for forklifts and AGVs to enable intelligent material handling operations.

Method: Used YOLOv8 and YOLOv11 architectures with Optuna hyperparameter optimization, spatial post-processing, and innovative pallet hole mapping module for spatial representation.

Result: YOLOv8 achieved high detection accuracy, while optimized YOLOv11 offered superior precision and stable convergence on custom warehouse dataset.

Conclusion: Feasible cost-effective visual perception module for forklifts that advances warehouse automation for safer, economical, and intelligent logistics.

Abstract: The automation of material handling in warehouses increasingly relies on robust, low cost perception systems for forklifts and Automated Guided Vehicles (AGVs). This work presents a vision based framework for pallet and pallet hole detection and mapping using a single standard camera. We utilized YOLOv8 and YOLOv11 architectures, enhanced through Optuna driven hyperparameter optimization and spatial post processing. An innovative pallet hole mapping module converts the detections into actionable spatial representations, enabling accurate pallet and pallet hole association for forklift operation. Experiments on a custom dataset augmented with real warehouse imagery show that YOLOv8 achieves high pallet and pallet hole detection accuracy, while YOLOv11, particularly under optimized configurations, offers superior precision and stable convergence. The results demonstrate the feasibility of a cost effective, retrofittable visual perception module for forklifts. This study proposes a scalable approach to advancing warehouse automation, promoting safer, economical, and intelligent logistics operations.

[289] SFFR: Spatial-Frequency Feature Reconstruction for Multispectral Aerial Object Detection

Xin Zuo, Yuchen Qu, Haibo Zhan, Jifeng Shen, Wankou Yang

Main category: cs.CV

TL;DR: Proposes SFFR method using KAN networks for spatial-frequency feature reconstruction in multispectral object detection, with FCEKAN for frequency component exchange and MSGKAN for multi-scale spatial feature modeling.

Details

Motivation: Current multispectral object detection methods focus mainly on spatial-domain feature fusion, while frequency-domain feature potential remains underexplored.

Method: SFFR method with Frequency Component Exchange KAN (FCEKAN) module for selective frequency component exchange between RGB and IR images, and Multi-Scale Gaussian KAN (MSGKAN) module for nonlinear spatial feature modeling using multi-scale Gaussian basis functions.

Result: Extensive experiments on SeaDroneSee, DroneVehicle and DVTOD datasets demonstrate superior performance in UAV multispectral object perception tasks.

Conclusion: The proposed FCEKAN and MSGKAN modules are complementary and effectively capture frequency and spatial semantic features respectively for better feature fusion in multispectral object detection.

Abstract: Recent multispectral object detection methods have primarily focused on spatial-domain feature fusion based on CNNs or Transformers, while the potential of frequency-domain feature remains underexplored. In this work, we propose a novel Spatial and Frequency Feature Reconstruction method (SFFR) method, which leverages the spatial-frequency feature representation mechanisms of the Kolmogorov-Arnold Network (KAN) to reconstruct complementary representations in both spatial and frequency domains prior to feature fusion. The core components of SFFR are the proposed Frequency Component Exchange KAN (FCEKAN) module and Multi-Scale Gaussian KAN (MSGKAN) module. The FCEKAN introduces an innovative selective frequency component exchange strategy that effectively enhances the complementarity and consistency of cross-modal features based on the frequency feature of RGB and IR images. The MSGKAN module demonstrates excellent nonlinear feature modeling capability in the spatial domain. By leveraging multi-scale Gaussian basis functions, it effectively captures the feature variations caused by scale changes at different UAV flight altitudes, significantly enhancing the model’s adaptability and robustness to scale variations. It is experimentally validated that our proposed FCEKAN and MSGKAN modules are complementary and can effectively capture the frequency and spatial semantic features respectively for better feature fusion. Extensive experiments on the SeaDroneSee, DroneVehicle and DVTOD datasets demonstrate the superior performance and significant advantages of the proposed method in UAV multispectral object perception task. Code will be available at https://github.com/qchenyu1027/SFFR.

[290] Physics-Informed Deformable Gaussian Splatting: Towards Unified Constitutive Laws for Time-Evolving Material Field

Haoqin Hong, Ding Fan, Fubin Dou, Zhi-Li Zhou, Haoran Sun, Congcong Zhu, Jingrun Chen

Main category: cs.CV

TL;DR: PIDG integrates physics constraints into 3D Gaussian Splatting for dynamic scene reconstruction, treating Gaussians as Lagrangian particles with physics-driven motion and supervised by optical flow.

Details

Motivation: Pure data-driven 3DGS struggles to capture physics-driven motion patterns in dynamic scenes, creating a need for physics-informed approaches.

Method: Uses static-dynamic decoupled 4D hash encoding, imposes Cauchy momentum residual as physics constraint, predicts particle velocity and stress via time-evolving material field, and supervises with optical flow matching.

Result: Significant improvements in physical consistency and monocular dynamic reconstruction quality on custom physics-driven and standard datasets.

Conclusion: Physics-informed constraints enhance 3DGS for dynamic scenes, enabling better capture of physics-driven motion patterns and improved reconstruction quality.

Abstract: Recently, 3D Gaussian Splatting (3DGS), an explicit scene representation technique, has shown significant promise for dynamic novel-view synthesis from monocular video input. However, purely data-driven 3DGS often struggles to capture the diverse physics-driven motion patterns in dynamic scenes. To fill this gap, we propose Physics-Informed Deformable Gaussian Splatting (PIDG), which treats each Gaussian particle as a Lagrangian material point with time-varying constitutive parameters and is supervised by 2D optical flow via motion projection. Specifically, we adopt static-dynamic decoupled 4D decomposed hash encoding to reconstruct geometry and motion efficiently. Subsequently, we impose the Cauchy momentum residual as a physics constraint, enabling independent prediction of each particle’s velocity and constitutive stress via a time-evolving material field. Finally, we further supervise data fitting by matching Lagrangian particle flow to camera-compensated optical flow, which accelerates convergence and improves generalization. Experiments on a custom physics-driven dataset as well as on standard synthetic and real-world datasets demonstrate significant gains in physical consistency and monocular dynamic reconstruction quality.

[291] Adaptive 3D Reconstruction via Diffusion Priors and Forward Curvature-Matching Likelihood Updates

Seunghyeok Shin, Dabin Kim, Hongki Lim

Main category: cs.CV

TL;DR: The paper proposes Forward Curvature-Matching (FCM) update method integrated with diffusion sampling for high-fidelity point cloud reconstruction from images, addressing limitations of existing diffusion-based approaches.

Details

Motivation: Existing diffusion-model approaches for point cloud reconstruction suffer from inflexibility - they require conditioning signals during training, support only fixed input views, and need complete retraining for different measurements. Recent methods using likelihood updates rely on heuristic fixed step sizes that lead to slow convergence and suboptimal quality.

Method: The method integrates Forward Curvature-Matching (FCM) update with diffusion sampling, dynamically determining optimal step sizes using forward automatic differentiation and finite-difference curvature estimates for precise likelihood optimization.

Result: Experiments on ShapeNet and CO3D datasets show superior reconstruction quality at matched or lower NFEs, achieving higher F-score and lower CD and EMD metrics compared to existing methods.

Conclusion: FCM enables high-fidelity reconstruction from single-view and multi-view inputs, supports various input modalities through simple operator substitution without retraining, validating its efficiency and adaptability for practical applications.

Abstract: Reconstructing high-quality point clouds from images remains challenging in computer vision. Existing generative-model-based approaches, particularly diffusion-model approaches that directly learn the posterior, may suffer from inflexibility – they require conditioning signals during training, support only a fixed number of input views, and need complete retraining for different measurements. Recent diffusion-based methods have attempted to address this by combining prior models with likelihood updates, but they rely on heuristic fixed step sizes for the likelihood update that lead to slow convergence and suboptimal reconstruction quality. We advance this line of approach by integrating our novel Forward Curvature-Matching (FCM) update method with diffusion sampling. Our method dynamically determines optimal step sizes using only forward automatic differentiation and finite-difference curvature estimates, enabling precise optimization of the likelihood update. This formulation enables high-fidelity reconstruction from both single-view and multi-view inputs, and supports various input modalities through simple operator substitution – all without retraining. Experiments on ShapeNet and CO3D datasets demonstrate that our method achieves superior reconstruction quality at matched or lower NFEs, yielding higher F-score and lower CD and EMD, validating its efficiency and adaptability for practical applications. Code is available at https://github.com/Seunghyeok0715/FCM

[292] Seq2Seq Models Reconstruct Visual Jigsaw Puzzles without Seeing Them

Gur Elkn, Ofir Itzhak Shahar, Ohad Ben-Shahar

Main category: cs.CV

TL;DR: Language models can solve jigsaw puzzles without visual input by treating puzzle pieces as token sequences, achieving state-of-the-art performance.

Details

Motivation: To explore unconventional approaches to jigsaw puzzle solving by using language models instead of traditional vision-based methods, demonstrating cross-domain problem-solving capabilities.

Method: Convert puzzle pieces into discrete token sequences using a specialized tokenizer, then use encoder-decoder transformers to solve the puzzle as a sequence-to-sequence prediction task without visual input.

Result: Models achieved state-of-the-art results across multiple benchmarks, often outperforming vision-based methods despite being restricted from accessing visual input.

Conclusion: Language models have surprising capability to solve problems beyond their native domain, and unconventional approaches can inspire promising directions for puzzle-solving research.

Abstract: Jigsaw puzzles are primarily visual objects, whose algorithmic solutions have traditionally been framed from a visual perspective. In this work, however, we explore a fundamentally different approach: solving square jigsaw puzzles using language models, without access to raw visual input. By introducing a specialized tokenizer that converts each puzzle piece into a discrete sequence of tokens, we reframe puzzle reassembly as a sequence-to-sequence prediction task. Treated as “blind” solvers, encoder-decoder transformers accurately reconstruct the original layout by reasoning over token sequences alone. Despite being deliberately restricted from accessing visual input, our models achieve state-of-the-art results across multiple benchmarks, often outperforming vision-based methods. These findings highlight the surprising capability of language models to solve problems beyond their native domain, and suggest that unconventional approaches can inspire promising directions for puzzle-solving research.

[293] CINEMAE: Leveraging Frozen Masked Autoencoders for Cross-Generator AI Image Detection

Minsuk Jang, Hyeonseo Jeong, Minseok Son, Changick Kim

Main category: cs.CV

TL;DR: CINEMAE adapts text detection principles to images using Masked AutoEncoder reconstruction uncertainty to detect AI-generated images with strong cross-generator generalization.

Details

Motivation: Image-based AIGC detectors struggle with overfitting to generator-specific artifacts, unlike text detectors that use distributional inconsistencies for better generalization.

Method: Uses Masked AutoEncoder trained to reconstruct masked patches, computes conditional Negative Log-Likelihood to quantify local semantic anomalies, and aggregates patch-level statistics with global MAE features through learned fusion.

Result: Achieves over 95% accuracy on all eight unseen generators in GenImage benchmark when trained only on Stable Diffusion v1.4, substantially outperforming state-of-the-art detectors.

Conclusion: Context-conditional reconstruction uncertainty provides a robust, transferable signal for AIGC detection, enabling strong cross-generator generalization.

Abstract: While context-based detectors have achieved strong generalization for AI-generated text by measuring distributional inconsistencies, image-based detectors still struggle with overfitting to generator-specific artifacts. We introduce CINEMAE, a novel paradigm for AIGC image detection that adapts the core principles of text detection methods to the visual domain. Our key insight is that Masked AutoEncoder (MAE), trained to reconstruct masked patches conditioned on visible context, naturally encodes semantic consistency expectations. We formalize this reconstruction process probabilistically, computing conditional Negative Log-Likelihood (NLL, p(masked | visible)) to quantify local semantic anomalies. By aggregating these patch-level statistics with global MAE features through learned fusion, CINEMAE achieves strong cross-generator generalization. Trained exclusively on Stable Diffusion v1.4, our method achieves over 95% accuracy on all eight unseen generators in the GenImage benchmark, substantially outperforming state-of-the-art detectors. This demonstrates that context-conditional reconstruction uncertainty provides a robust, transferable signal for AIGC detection.

[294] Improving Multimodal Sentiment Analysis via Modality Optimization and Dynamic Primary Modality Selection

Dingkang Yang, Mingcheng Li, Xuecheng Wu, Zhaoyu Chen, Kaixun Jiang, Keliang Liu, Peng Zhai, Lihua Zhang

Main category: cs.CV

TL;DR: MODS framework improves multimodal sentiment analysis by dynamically selecting primary modalities and reducing acoustic/visual redundancy using graph-based compression and cross-attention mechanisms.

Details

Motivation: Existing MSA methods use fixed primary modality strategies that fail to adapt to dynamic modality importance variations across samples, and non-language modalities suffer from sequential redundancy and noise.

Method: Proposes MODS framework with: 1) Graph-based Dynamic Sequence Compressor (GDC) using capsule networks and graph convolution to reduce acoustic/visual redundancy, 2) sample-adaptive Primary Modality Selector (MSelector) for dynamic dominance determination, and 3) Primary-modality-Centric Cross-Attention (PCCA) module to enhance dominant modalities and cross-modal interaction.

Result: Extensive experiments on four benchmark datasets show MODS outperforms state-of-the-art methods, achieving superior performance by effectively balancing modality contributions and eliminating redundant noise.

Conclusion: The proposed MODS framework successfully addresses modality imbalance and redundancy issues in MSA through dynamic primary modality selection and optimization, demonstrating significant performance improvements over existing approaches.

Abstract: Multimodal Sentiment Analysis (MSA) aims to predict sentiment from language, acoustic, and visual data in videos. However, imbalanced unimodal performance often leads to suboptimal fused representations. Existing approaches typically adopt fixed primary modality strategies to maximize dominant modality advantages, yet fail to adapt to dynamic variations in modality importance across different samples. Moreover, non-language modalities suffer from sequential redundancy and noise, degrading model performance when they serve as primary inputs. To address these issues, this paper proposes a modality optimization and dynamic primary modality selection framework (MODS). First, a Graph-based Dynamic Sequence Compressor (GDC) is constructed, which employs capsule networks and graph convolution to reduce sequential redundancy in acoustic/visual modalities. Then, we develop a sample-adaptive Primary Modality Selector (MSelector) for dynamic dominance determination. Finally, a Primary-modality-Centric Cross-Attention (PCCA) module is designed to enhance dominant modalities while facilitating cross-modal interaction. Extensive experiments on four benchmark datasets demonstrate that MODS outperforms state-of-the-art methods, achieving superior performance by effectively balancing modality contributions and eliminating redundant noise.

[295] Label-Efficient 3D Forest Mapping: Self-Supervised and Transfer Learning for Individual, Structural, and Species Analysis

Aldino Rizaldy, Fabian Ewald Fassnacht, Ahmed Jamal Afifi, Hua Jiang, Richard Gloaguen, Pedram Ghamisi

Main category: cs.CV

TL;DR: A unified framework using self-supervised and transfer learning to extract individual tree information from 3D point clouds, reducing annotation dependency and improving performance for forestry applications.

Details

Motivation: To address the challenge of labor-intensive annotation for 3D point clouds in complex forests and enable scalable extraction of detailed tree-level information for precision forestry, biodiversity conservation, and carbon mapping.

Method: Combined self-supervised learning with domain adaptation for instance segmentation, used self-supervised learning for semantic segmentation, and implemented hierarchical transfer learning for tree species classification.

Result: Significant performance improvements: instance segmentation AP50 +16.98%, semantic segmentation mIoU +1.79%, and species classification Jaccard +6.07% for unseen species, with ~21% reduction in energy consumption.

Conclusion: The proposed unified framework effectively reduces annotation dependency while improving performance across multiple tree analysis tasks, providing an open-source solution for operational extraction of individual tree information from laser scanning data.

Abstract: Detailed structural and species information on individual tree level is increasingly important to support precision forestry, biodiversity conservation, and provide reference data for biomass and carbon mapping. Point clouds from airborne and ground-based laser scanning are currently the most suitable data source to rapidly derive such information at scale. Recent advancements in deep learning improved segmenting and classifying individual trees and identifying semantic tree components. However, deep learning models typically require large amounts of annotated training data which limits further improvement. Producing dense, high-quality annotations for 3D point clouds, especially in complex forests, is labor-intensive and challenging to scale. We explore strategies to reduce dependence on large annotated datasets using self-supervised and transfer learning architectures. Our objective is to improve performance across three tasks: instance segmentation, semantic segmentation, and tree classification using realistic and operational training sets. Our findings indicate that combining self-supervised learning with domain adaptation significantly enhances instance segmentation compared to training from scratch (AP50 +16.98%), self-supervised learning suffices for semantic segmentation (mIoU +1.79%), and hierarchical transfer learning enables accurate classification of unseen species (Jaccard +6.07%). To simplify use and encourage uptake, we integrated the tasks into a unified framework, streamlining the process from raw point clouds to tree delineation, structural analysis, and species classification. Pretrained models reduce energy consumption and carbon emissions by ~21%. This open-source contribution aims to accelerate operational extraction of individual tree information from laser scanning point clouds to support forestry, biodiversity, and carbon mapping.

[296] BuildingWorld: A Structured 3D Building Dataset for Urban Foundation Models

Shangfeng Huang, Ruisheng Wang, Xin Wang

Main category: cs.CV

TL;DR: BuildingWorld is a comprehensive 3D building dataset addressing architectural diversity gaps in urban modeling, featuring 5M LOD2 models from global regions with LiDAR data and evaluation metrics.

Details

Motivation: Current 3D urban models lack architectural diversity, limiting generalizability across heterogeneous environments for applications like energy modeling and autonomous navigation.

Method: Collected 5 million LOD2 building models from diverse global regions (North America, Europe, Asia, Africa, Oceania) with real and simulated LiDAR point clouds, plus Cyber City for generating unlimited training data.

Result: Created a globally representative dataset enabling comprehensive research on 3D building reconstruction, detection, and segmentation with standardized evaluation metrics.

Conclusion: BuildingWorld bridges the diversity gap in 3D urban modeling, supporting foundation model development and structured urban environment analysis.

Abstract: As digital twins become central to the transformation of modern cities, accurate and structured 3D building models emerge as a key enabler of high-fidelity, updatable urban representations. These models underpin diverse applications including energy modeling, urban planning, autonomous navigation, and real-time reasoning. Despite recent advances in 3D urban modeling, most learning-based models are trained on building datasets with limited architectural diversity, which significantly undermines their generalizability across heterogeneous urban environments. To address this limitation, we present BuildingWorld, a comprehensive and structured 3D building dataset designed to bridge the gap in stylistic diversity. It encompasses buildings from geographically and architecturally diverse regions – including North America, Europe, Asia, Africa, and Oceania – offering a globally representative dataset for urban-scale foundation modeling and analysis. Specifically, BuildingWorld provides about five million LOD2 building models collected from diverse sources, accompanied by real and simulated airborne LiDAR point clouds. This enables comprehensive research on 3D building reconstruction, detection and segmentation. Cyber City, a virtual city model, is introduced to enable the generation of unlimited training data with customized and structurally diverse point cloud distributions. Furthermore, we provide standardized evaluation metrics tailored for building reconstruction, aiming to facilitate the training, evaluation, and comparison of large-scale vision models and foundation models in structured 3D urban environments.

[297] GazeVLM: A Vision-Language Model for Multi-Task Gaze Understanding

Athul M. Mathew, Haithem Hermassi, Thariq Khalid, Arshad Ali Khan, Riad Souissi

Main category: cs.CV

TL;DR: GazeVLM is a novel Vision-Language Model that unifies person detection, gaze target detection, and gaze object identification into a single framework using both visual and language prompts.

Details

Motivation: Prior research has modeled gaze cues but lacks a unified system for gaze understanding using both visual and language modalities. There's a need for a comprehensive framework that can handle multiple gaze-related tasks simultaneously.

Method: GazeVLM integrates visual (RGB and depth) and textual modalities, using a fusion of RGB images with HHA-encoded depth maps guided by text prompts. It allows selective execution of gaze understanding tasks including person detection, gaze target detection, and gaze object identification.

Result: The model achieves state-of-the-art evaluation scores on GazeFollow and VideoAttentionTarget datasets. The ablation study showed that RGB+HHA depth map fusion with text prompts yields superior performance. A new object-level gaze detection metric ($AP_{ob}$) was introduced for gaze object identification.

Conclusion: GazeVLM represents the first application of a Vision-Language Model to unified gaze understanding tasks, demonstrating significant improvements over existing methods and establishing a new benchmark for multi-task gaze analysis.

Abstract: Gaze understanding unifies the detection of people, their gaze targets, and objects of interest into a single framework, offering critical insight into visual attention and intent estimation. Although prior research has modelled gaze cues in visual scenes, a unified system is still needed for gaze understanding using both visual and language prompts. This paper introduces GazeVLM, a novel Vision-Language Model (VLM) for multi-task gaze understanding in images, addressing person detection, gaze target detection, and gaze object identification. While other transformer-based methods exist for gaze analysis, GazeVLM represents, to our knowledge, the first application of a VLM to these combined tasks, allowing for selective execution of each task. Through the integration of visual (RGB and depth) and textual modalities, our ablation study on visual input combinations revealed that a fusion of RGB images with HHA-encoded depth maps, guided by text prompts, yields superior performance. We also introduce an object-level gaze detection metric for gaze object identification ($AP_{ob}$). Through experiments, GazeVLM demonstrates significant improvements, notably achieving state-of-the-art evaluation scores on GazeFollow and VideoAttentionTarget datasets.

[298] HiMo-CLIP: Modeling Semantic Hierarchy and Monotonicity in Vision-Language Alignment

Ruijia Wu, Ping Chen, Fei Shen, Shaoan Zhao, Qiang Hui, Huanlin Gao, Ting Lu, Zhaoxiang Liu, Fang Zhao, Kai Wang, Shiguo Lian

Main category: cs.CV

TL;DR: HiMo-CLIP enhances CLIP-style models by addressing limitations in handling complex, compositional text through hierarchical decomposition and monotonicity-aware contrastive learning, improving image-text retrieval performance.

Details

Motivation: Current contrastive vision-language models like CLIP treat text as flat sequences, failing to capture semantic hierarchy and monotonicity - where richer descriptions should have stronger visual alignment.

Method: Proposes HiMo-CLIP with two components: hierarchical decomposition (HiDe) module using in-batch PCA to extract semantic components, and monotonicity-aware contrastive loss (MoLo) that aligns global and component-level representations.

Result: Experiments show HiMo-CLIP consistently outperforms strong baselines on multiple image-text retrieval benchmarks, especially with long or compositional descriptions.

Conclusion: HiMo-CLIP successfully enhances CLIP-style models by incorporating semantic hierarchy and monotonicity, producing more structured and cognitively-aligned cross-modal representations without modifying encoder architecture.

Abstract: Contrastive vision-language models like CLIP have achieved impressive results in image-text retrieval by aligning image and text representations in a shared embedding space. However, these models often treat text as flat sequences, limiting their ability to handle complex, compositional, and long-form descriptions. In particular, they fail to capture two essential properties of language: semantic hierarchy, which reflects the multi-level compositional structure of text, and semantic monotonicity, where richer descriptions should result in stronger alignment with visual content.To address these limitations, we propose HiMo-CLIP, a representation-level framework that enhances CLIP-style models without modifying the encoder architecture. HiMo-CLIP introduces two key components: a hierarchical decomposition (HiDe) module that extracts latent semantic components from long-form text via in-batch PCA, enabling flexible, batch-aware alignment across different semantic granularities, and a monotonicity-aware contrastive loss (MoLo) that jointly aligns global and component-level representations, encouraging the model to internalize semantic ordering and alignment strength as a function of textual completeness.These components work in concert to produce structured, cognitively-aligned cross-modal representations. Experiments on multiple image-text retrieval benchmarks show that HiMo-CLIP consistently outperforms strong baselines, particularly under long or compositional descriptions. The code is available at https://github.com/UnicomAI/HiMo-CLIP.

[299] AesTest: Measuring Aesthetic Intelligence from Perception to Production

Guolong Wang, Heng Huang, Zhiqiang Zhang, Wentian Li, Feilong Ma, Xin Jin

Main category: cs.CV

TL;DR: AesTest is a new benchmark for evaluating multimodal LLMs’ aesthetic perception and production capabilities across 10 tasks, addressing gaps in existing image aesthetic assessment benchmarks.

Details

Motivation: Existing benchmarks for image aesthetic assessment are narrow in scope and lack diversity needed to evaluate systematic aesthetic production in multimodal LLMs.

Method: Created AesTest benchmark with curated multiple-choice questions spanning 10 tasks covering perception, appreciation, creation, and photography, grounded in psychological theories and integrating diverse data sources including professional workflows and crowdsourced preferences.

Result: Evaluation of both instruction-tuned IAA MLLMs and general MLLMs on AesTest revealed significant challenges in building aesthetic intelligence.

Conclusion: AesTest will be publicly released to support future research in multimodal aesthetic intelligence, highlighting the need for improved aesthetic capabilities in MLLMs.

Abstract: Perceiving and producing aesthetic judgments is a fundamental yet underexplored capability for multimodal large language models (MLLMs). However, existing benchmarks for image aesthetic assessment (IAA) are narrow in perception scope or lack the diversity needed to evaluate systematic aesthetic production. To address this gap, we introduce AesTest, a comprehensive benchmark for multimodal aesthetic perception and production, distinguished by the following features: 1) It consists of curated multiple-choice questions spanning ten tasks, covering perception, appreciation, creation, and photography. These tasks are grounded in psychological theories of generative learning. 2) It integrates data from diverse sources, including professional editing workflows, photographic composition tutorials, and crowdsourced preferences. It ensures coverage of both expert-level principles and real-world variation. 3) It supports various aesthetic query types, such as attribute-based analysis, emotional resonance, compositional choice, and stylistic reasoning. We evaluate both instruction-tuned IAA MLLMs and general MLLMs on AesTest, revealing significant challenges in building aesthetic intelligence. We will publicly release AesTest to support future research in this area.

[300] V-Shuffle: Zero-Shot Style Transfer via Value Shuffle

Haojun Tang, Qiwei Lin, Tongda Xu, Lida Huang, Yan Wang

Main category: cs.CV

TL;DR: V-Shuffle is a zero-shot style transfer method that uses multiple style images to prevent content leakage while maintaining style fidelity through value feature shuffling and hybrid style regularization.

Details

Motivation: Existing attention injection-based style transfer methods suffer from content leakage, where undesired semantic content from style images appears in stylized outputs.

Method: V-Shuffle shuffles value features within self-attention layers of diffusion models to disrupt semantic content while preserving low-level style representations, complemented by Hybrid Style Regularization for high-level style textures.

Result: V-Shuffle achieves excellent performance with multiple style images and outperforms previous state-of-the-art methods when applied to a single style image.

Conclusion: V-Shuffle effectively navigates the trade-off between content preservation and style fidelity in zero-shot style transfer by leveraging multiple style images and innovative feature manipulation techniques.

Abstract: Attention injection-based style transfer has achieved remarkable progress in recent years. However, existing methods often suffer from content leakage, where the undesired semantic content of the style image mistakenly appears in the stylized output. In this paper, we propose V-Shuffle, a zero-shot style transfer method that leverages multiple style images from the same style domain to effectively navigate the trade-off between content preservation and style fidelity. V-Shuffle implicitly disrupts the semantic content of the style images by shuffling the value features within the self-attention layers of the diffusion model, thereby preserving low-level style representations. We further introduce a Hybrid Style Regularization that complements these low-level representations with high-level style textures to enhance style fidelity. Empirical results demonstrate that V-Shuffle achieves excellent performance when utilizing multiple style images. Moreover, when applied to a single style image, V-Shuffle outperforms previous state-of-the-art methods.

[301] Revisiting the Data Sampling in Multimodal Post-training from a Difficulty-Distinguish View

Jianyu Qi, Ding Zou, Wenrui Yan, Rui Ma, Jiaxu Li, Zhijie Zheng, Zhiguo Yang, Rongchang Zhao

Main category: cs.CV

TL;DR: Proposes difficulty-aware sampling strategies (PISM and CMAB) for multimodal reasoning, showing GRPO-only training on difficulty-stratified samples outperforms conventional SFT+GRPO pipelines.

Details

Motivation: Existing post-training paradigms for multimodal reasoning neglect quantifiable difficulty metrics and fail to jointly optimize perception and reasoning capabilities.

Method: Two difficulty-aware sampling strategies: Progressive Image Semantic Masking (PISM) for image degradation-based hardness assessment, and Cross-Modality Attention Balance (CMAB) for attention distribution analysis. Hierarchical training framework with GRPO-only and SFT+GRPO hybrid paradigms.

Result: Experiments across six benchmark datasets show GRPO applied to difficulty-stratified samples consistently outperforms conventional SFT+GRPO pipelines, improving model accuracy without supervised fine-tuning.

Conclusion: Strategic data sampling with difficulty-aware metrics can eliminate the need for supervised fine-tuning while enhancing multimodal reasoning performance.

Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have spurred significant progress in Chain-of-Thought (CoT) reasoning. Building on the success of Deepseek-R1, researchers extended multimodal reasoning to post-training paradigms based on reinforcement learning (RL), focusing predominantly on mathematical datasets. However, existing post-training paradigms tend to neglect two critical aspects: (1) The lack of quantifiable difficulty metrics capable of strategically screening samples for post-training optimization. (2) Suboptimal post-training paradigms that fail to jointly optimize perception and reasoning capabilities. To address this gap, we propose two novel difficulty-aware sampling strategies: Progressive Image Semantic Masking (PISM) quantifies sample hardness through systematic image degradation, while Cross-Modality Attention Balance (CMAB) assesses cross-modal interaction complexity via attention distribution analysis. Leveraging these metrics, we design a hierarchical training framework that incorporates both GRPO-only and SFT+GRPO hybrid training paradigms, and evaluate them across six benchmark datasets. Experiments demonstrate consistent superiority of GRPO applied to difficulty-stratified samples compared to conventional SFT+GRPO pipelines, indicating that strategic data sampling can obviate the need for supervised fine-tuning while improving model accuracy. Our code will be released at https://github.com/qijianyu277/DifficultySampling.

[302] InfoAffect: A Dataset for Affective Analysis of Infographics

Zihang Fu, Yunchao Wang, Chenyu Huang, Guodao Sun, Ronghua Liang

Main category: cs.CV

TL;DR: Created InfoAffect dataset with 3.5k affect-annotated infographics from 6 domains, validated by MLLM analysis and user studies showing high accuracy (CACI=0.986).

Details

Motivation: Infographics' affective dimensions are underexplored due to data scarcity, limiting understanding of how they emotionally impact viewers.

Method: Collected data from 6 domains, preprocessed with quality control, constructed affect table for annotation, used 5 MLLMs with RRF fusion for robust affect analysis.

Result: Achieved high accuracy with Composite Affect Consistency Index of 0.986, indicating reliable affect annotations in the dataset.

Conclusion: InfoAffect dataset successfully addresses the data scarcity problem and provides a reliable resource for studying affective dimensions of infographics.

Abstract: Infographics are widely used to convey complex information, yet their affective dimensions remain underexplored due to the scarcity of data resources. We introduce a 3.5k-sample affect-annotated InfoAffect dataset, which combines textual content with real-world infographics. We first collect the raw data from six domains and aligned them via preprocessing, the accompanied-text-priority method, and three strategies to guarantee the quality and compliance. After that we construct an affect table and use it to constrain annotation. Five state-of-the-art multimodal large language models (MLLMs) then analyze both modalities, and their outputs are fused with Reciprocal Rank Fusion (RRF) algorithm to yield robust affects and confidences. We conducted a user study with two experiments to validate usability and assess InfoAffect dataset using the Composite Affect Consistency Index (CACI), achieving an overall score of 0.986, which indicates high accuracy.

[303] On Modality Incomplete Infrared-Visible Object Detection: An Architecture Compatibility Perspective

Shuo Yang, Yinghui Xing, Shizhou Zhang, Zhilong Niu

Main category: cs.CV

TL;DR: Scarf-DETR is a plug-and-play module for DETR variants that enables robust infrared and visible object detection under modality-incomplete scenarios through modality-agnostic deformable attention and pseudo modality dropout training.

Details

Motivation: Current IVOD models suffer performance declines when faced with incomplete modality data, especially when dominant modalities are missing, limiting their practical deployment in around-the-clock applications.

Method: Proposes Scarf Neck module with modality-agnostic deformable attention, uses pseudo modality dropout strategy during training, and introduces comprehensive benchmark for modality-incomplete scenarios.

Result: Scarf-DETR performs excellently in missing modality scenarios and achieves superior performance on standard IVOD modality complete benchmarks.

Conclusion: The proposed approach enables flexible adaptation to any single or double modalities during training and inference, making IVOD detectors more robust and practical for real-world applications.

Abstract: Infrared and visible object detection (IVOD) is essential for numerous around-the-clock applications. Despite notable advancements, current IVOD models exhibit notable performance declines when confronted with incomplete modality data, particularly if the dominant modality is missing. In this paper, we take a thorough investigation on modality incomplete IVOD problem from an architecture compatibility perspective. Specifically, we propose a plug-and-play Scarf Neck module for DETR variants, which introduces a modality-agnostic deformable attention mechanism to enable the IVOD detector to flexibly adapt to any single or double modalities during training and inference. When training Scarf-DETR, we design a pseudo modality dropout strategy to fully utilize the multi-modality information, making the detector compatible and robust to both working modes of single and double modalities. Moreover, we introduce a comprehensive benchmark for the modality-incomplete IVOD task aimed at thoroughly assessing situations where the absent modality is either dominant or secondary. Our proposed Scarf-DETR not only performs excellently in missing modality scenarios but also achieves superior performances on the standard IVOD modality complete benchmarks. Our code will be available at https://github.com/YinghuiXing/Scarf-DETR.

[304] VDNeRF: Vision-only Dynamic Neural Radiance Field for Urban Scenes

Zhengyu Zou, Jingfeng Li, Hao Li, Xiaolei Hou, Jinwen Hu, Jingkun Chen, Lechao Cheng, Dingwen Zhang

Main category: cs.CV

TL;DR: VDNeRF is a vision-only method that jointly recovers camera trajectories and learns spatiotemporal representations for dynamic urban scenes without requiring camera pose information or expensive sensors.

Details

Motivation: Existing NeRF methods struggle with dynamic environments and accurate camera pose estimation in applications like autonomous driving and robotic perception.

Method: Uses two separate NeRF models: static NeRF optimizes camera poses and background, while dynamic NeRF incorporates 3D scene flow for dynamic objects. Includes training framework to address camera-object motion ambiguity.

Result: Outperforms state-of-the-art NeRF-based pose-free methods in camera pose estimation and dynamic novel view synthesis on urban driving datasets.

Conclusion: VDNeRF enables robust camera pose estimation and self-supervised decomposition of static/dynamic elements without external pose information.

Abstract: Neural Radiance Fields (NeRFs) implicitly model continuous three-dimensional scenes using a set of images with known camera poses, enabling the rendering of photorealistic novel views. However, existing NeRF-based methods encounter challenges in applications such as autonomous driving and robotic perception, primarily due to the difficulty of capturing accurate camera poses and limitations in handling large-scale dynamic environments. To address these issues, we propose Vision-only Dynamic NeRF (VDNeRF), a method that accurately recovers camera trajectories and learns spatiotemporal representations for dynamic urban scenes without requiring additional camera pose information or expensive sensor data. VDNeRF employs two separate NeRF models to jointly reconstruct the scene. The static NeRF model optimizes camera poses and static background, while the dynamic NeRF model incorporates the 3D scene flow to ensure accurate and consistent reconstruction of dynamic objects. To address the ambiguity between camera motion and independent object motion, we design an effective and powerful training framework to achieve robust camera pose estimation and self-supervised decomposition of static and dynamic elements in a scene. Extensive evaluations on mainstream urban driving datasets demonstrate that VDNeRF surpasses state-of-the-art NeRF-based pose-free methods in both camera pose estimation and dynamic novel view synthesis.

[305] DiffusionUavLoc: Visually Prompted Diffusion for Cross-View UAV Localization

Tao Liu, Kan Ren, Qian Chen

Main category: cs.CV

TL;DR: DiffusionUavLoc is a cross-view UAV localization framework that uses diffusion models and VAE for unified representation, achieving competitive performance on benchmark datasets without requiring text prompts or extensive annotations.

Details

Motivation: Address the limitations of GNSS-dependent localization in denied environments and overcome geometric/appearance gaps between UAV and satellite views, while avoiding complex architectures and annotation requirements.

Method: Uses training-free geometric rendering to create pseudo-satellite images as structural prompts, then employs a text-free conditional diffusion model with VAE to fuse multimodal structural cues for robust feature learning.

Result: Achieves competitive cross-view localization performance on University-1652 and SUES-200 datasets, particularly excelling in satellite-to-drone localization on University-1652.

Conclusion: The proposed framework provides an effective solution for UAV localization in GNSS-denied environments by leveraging diffusion models and unified representations without text dependencies.

Abstract: With the rapid growth of the low-altitude economy, unmanned aerial vehicles (UAVs) have become key platforms for measurement and tracking in intelligent patrol systems. However, in GNSS-denied environments, localization schemes that rely solely on satellite signals are prone to failure. Cross-view image retrieval-based localization is a promising alternative, yet substantial geometric and appearance domain gaps exist between oblique UAV views and nadir satellite orthophotos. Moreover, conventional approaches often depend on complex network architectures, text prompts, or large amounts of annotation, which hinders generalization. To address these issues, we propose DiffusionUavLoc, a cross-view localization framework that is image-prompted, text-free, diffusion-centric, and employs a VAE for unified representation. We first use training-free geometric rendering to synthesize pseudo-satellite images from UAV imagery as structural prompts. We then design a text-free conditional diffusion model that fuses multimodal structural cues to learn features robust to viewpoint changes. At inference, descriptors are computed at a fixed time step t and compared using cosine similarity. On University-1652 and SUES-200, the method performs competitively for cross-view localization, especially for satellite-to-drone in University-1652.Our data and code will be published at the following URL: https://github.com/liutao23/DiffusionUavLoc.git.

[306] SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards

Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark

Main category: cs.CV

TL;DR: SpatialThinker is a 3D-aware MLLM that uses RL with dense spatial rewards to improve spatial understanding, outperforming supervised fine-tuning and GPT-4o on spatial VQA tasks.

Details

Motivation: Current MLLMs struggle with spatial understanding despite progress in vision-language tasks, and existing approaches rely on explicit 3D inputs or large datasets.

Method: Uses RL with multi-objective dense spatial rewards to integrate structured spatial grounding with multi-step reasoning, simulating human-like spatial perception via scene graphs.

Result: SpatialThinker-7B outperforms supervised fine-tuning and sparse RL baselines, nearly doubling base-model gains compared to sparse RL and surpassing GPT-4o on spatial understanding benchmarks.

Conclusion: Combining spatial supervision with reward-aligned reasoning enables robust 3D spatial understanding with limited data, advancing MLLMs toward human-level visual reasoning.

Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific modifications, and remain constrained by large-scale datasets or sparse supervision. To address these limitations, we introduce SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. SpatialThinker consists of two key contributions: (1) a data synthesis pipeline that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL with a multi-objective dense spatial reward enforcing spatial grounding. SpatialThinker-7B outperforms supervised fine-tuning and the sparse RL baseline on spatial understanding and real-world VQA benchmarks, nearly doubling the base-model gain compared to sparse RL, and surpassing GPT-4o. These results showcase the effectiveness of combining spatial supervision with reward-aligned reasoning in enabling robust 3D spatial understanding with limited data and advancing MLLMs towards human-level visual reasoning.

[307] Diagnose Like A REAL Pathologist: An Uncertainty-Focused Approach for Trustworthy Multi-Resolution Multiple Instance Learning

Sungrae Hong, Sol Lee, Jisu Shin, Mun Yong Yi

Main category: cs.CV

TL;DR: UFC-MIL is a multiple instance learning method that provides calibrated diagnostic predictions using multi-resolution images, mimicking pathologists’ examination behaviors while incorporating uncertainty estimation.

Details

Motivation: Current multiple-resolution MIL approaches focus only on performance improvement but lack well-calibrated predictions needed for trustworthy clinical diagnostics. There's a need for methods that pathologists can rely on for accurate uncertainty estimation.

Method: Proposes UFC-MIL with a novel patch-wise loss that learns latent patterns and expresses uncertainty for classification. Uses attention-based architecture with neighbor patch aggregation module, and calibrates aggregated predictions through patch-level uncertainty without requiring multiple iterative inferences.

Result: UFC-MIL shows superior performance in model calibration while achieving classification accuracy comparable to state-of-the-art methods on challenging public datasets.

Conclusion: UFC-MIL successfully provides calibrated diagnostic predictions that mimic pathologists’ examination behaviors, offering trustworthy AI diagnostic aid with practical advantages for clinical applications.

Abstract: With the increasing demand for histopathological specimen examination and diagnostic reporting, Multiple Instance Learning (MIL) has received heightened research focus as a viable solution for AI-centric diagnostic aid. Recently, to improve its performance and make it work more like a pathologist, several MIL approaches based on the use of multiple-resolution images have been proposed, delivering often higher performance than those that use single-resolution images. Despite impressive recent developments of multiple-resolution MIL, previous approaches only focus on improving performance, thereby lacking research on well-calibrated MIL that clinical experts can rely on for trustworthy diagnostic results. In this study, we propose Uncertainty-Focused Calibrated MIL (UFC-MIL), which more closely mimics the pathologists’ examination behaviors while providing calibrated diagnostic predictions, using multiple images with different resolutions. UFC-MIL includes a novel patch-wise loss that learns the latent patterns of instances and expresses their uncertainty for classification. Also, the attention-based architecture with a neighbor patch aggregation module collects features for the classifier. In addition, aggregated predictions are calibrated through patch-level uncertainty without requiring multiple iterative inferences, which is a key practical advantage. Against challenging public datasets, UFC-MIL shows superior performance in model calibration while achieving classification accuracy comparable to that of state-of-the-art methods.

Seulgi Kim, Kiran Kokilepersaud, Mohit Prabhushankar, Ghassan AlRegib

Main category: cs.CV

TL;DR: Proposes Rank-enhancing Token Fuser to address feature and modality collapse in multi-modal fusion using effective rank as a unifying measure, validated on action anticipation tasks.

Details

Motivation: Multi-modal fusion suffers from feature collapse (loss of discriminative power) and modality collapse (dominant modality overwhelms others), hindering applications like human action anticipation that require fusing diverse sensor data.

Method: Uses effective rank to quantify both collapses; proposes Rank-enhancing Token Fuser that selectively blends less informative features from one modality with complementary features from another; evaluates modality combinations that mutually increase effective rank.

Result: Depth maintains representational balance when fused with RGB, avoiding modality collapse; R3D framework significantly outperforms prior state-of-the-art methods by up to 3.74% on NTURGBD, UTKinect, and DARai datasets.

Conclusion: Effective rank serves as a unifying framework to address both feature and modality collapse simultaneously; depth-RGB fusion provides balanced representation; the approach achieves state-of-the-art performance in action anticipation tasks.

Abstract: Multi-modal fusion methods often suffer from two types of representation collapse: feature collapse where individual dimensions lose their discriminative power (as measured by eigenspectra), and modality collapse where one dominant modality overwhelms the other. Applications like human action anticipation that require fusing multifarious sensor data are hindered by both feature and modality collapse. However, existing methods attempt to counter feature collapse and modality collapse separately. This is because there is no unifying framework that efficiently addresses feature and modality collapse in conjunction. In this paper, we posit the utility of effective rank as an informative measure that can be utilized to quantify and counter both the representation collapses. We propose \textit{Rank-enhancing Token Fuser}, a theoretically grounded fusion framework that selectively blends less informative features from one modality with complementary features from another modality. We show that our method increases the effective rank of the fused representation. To address modality collapse, we evaluate modality combinations that mutually increase each others’ effective rank. We show that depth maintains representational balance when fused with RGB, avoiding modality collapse. We validate our method on action anticipation, where we present \texttt{R3D}, a depth-informed fusion framework. Extensive experiments on NTURGBD, UTKinect, and DARai demonstrate that our approach significantly outperforms prior state-of-the-art methods by up to 3.74%. Our code is available at: \href{https://github.com/olivesgatech/R3D}{https://github.com/olivesgatech/R3D}.

Huili Huang, Chengeng Liu, Danrong Zhang, Shail Patel, Anastasiya Masalava, Sagar Sadak, Parisa Babolhavaeji, WeiHong Low, Max Mahdi Roozbahani, J. David Frost

Main category: cs.CV

TL;DR: EIDSeg is the first large-scale semantic segmentation dataset for post-earthquake social media imagery, enabling fine-grained damage assessment using ground-level photos from social networks.

Details

Motivation: Existing remote sensing methods for post-earthquake damage assessment rely on costly aerial images, expert labeling, and produce only binary damage maps, creating a gap that ground-level social media images could fill.

Method: Created EIDSeg dataset with 3,266 images from 9 major earthquakes (2008-2023), annotated across 5 damage classes using a three-phase cross-disciplinary annotation protocol with non-expert annotators achieving over 70% inter-annotator agreement.

Result: Benchmarked state-of-the-art segmentation models, with Encoder-only Mask Transformer (EoMT) achieving the best performance with 80.8% mIoU.

Conclusion: The work enables faster, finer-grained damage assessment by leveraging social networks’ ground-level perspective for post-earthquake scenarios.

Abstract: Rapid post-earthquake damage assessment is crucial for rescue and resource planning. Still, existing remote sensing methods depend on costly aerial images, expert labeling, and produce only binary damage maps for early-stage evaluation. Although ground-level images from social networks provide a valuable source to fill this gap, a large pixel-level annotated dataset for this task is still unavailable. We introduce EIDSeg, the first large-scale semantic segmentation dataset specifically for post-earthquake social media imagery. The dataset comprises 3,266 images from nine major earthquakes (2008-2023), annotated across five classes of infrastructure damage: Undamaged Building, Damaged Building, Destroyed Building, Undamaged Road, and Damaged Road. We propose a practical three-phase cross-disciplinary annotation protocol with labeling guidelines that enables consistent segmentation by non-expert annotators, achieving over 70% inter-annotator agreement. We benchmark several state-of-the-art segmentation models, identifying Encoder-only Mask Transformer (EoMT) as the top-performing method with a Mean Intersection over Union (mIoU) of 80.8%. By unlocking social networks’ rich ground-level perspective, our work paves the way for a faster, finer-grained damage assessment in the post-earthquake scenario.

[310] Inpaint360GS: Efficient Object-Aware 3D Inpainting via Gaussian Splatting for 360° Scenes

Shaoxiang Wang, Shihong Zhang, Christen Millerdurai, Rüdiger Westermann, Didier Stricker, Alain Pagani

Main category: cs.CV

TL;DR: Inpaint360GS is a 3D Gaussian Splatting-based framework for multi-object removal and inpainting in 360° scenes, addressing challenges in object identification, occlusion handling, and view consistency.

Details

Motivation: Current methods struggle with 360° scene inpainting due to difficulties in identifying target objects in complex environments, handling severe occlusions in multi-object scenes, and maintaining consistent appearance across views.

Method: Proposes Inpaint360GS framework using 3D Gaussian Splatting, distills 2D segmentation into 3D, leverages virtual camera views for contextual guidance, and introduces a new dataset for 360° inpainting.

Result: Outperforms existing baselines and achieves state-of-the-art performance in 360° scene inpainting.

Conclusion: Inpaint360GS provides an effective solution for flexible 360° editing with multi-object removal and high-fidelity inpainting in 3D space.

Abstract: Despite recent advances in single-object front-facing inpainting using NeRF and 3D Gaussian Splatting (3DGS), inpainting in complex 360° scenes remains largely underexplored. This is primarily due to three key challenges: (i) identifying target objects in the 3D field of 360° environments, (ii) dealing with severe occlusions in multi-object scenes, which makes it hard to define regions to inpaint, and (iii) maintaining consistent and high-quality appearance across views effectively. To tackle these challenges, we propose Inpaint360GS, a flexible 360° editing framework based on 3DGS that supports multi-object removal and high-fidelity inpainting in 3D space. By distilling 2D segmentation into 3D and leveraging virtual camera views for contextual guidance, our method enables accurate object-level editing and consistent scene completion. We further introduce a new dataset tailored for 360° inpainting, addressing the lack of ground truth object-free scenes. Experiments demonstrate that Inpaint360GS outperforms existing baselines and achieves state-of-the-art performance. Project page: https://dfki-av.github.io/inpaint360gs/

[311] NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models

Kyuho Lee, Euntae Kim, Jinwoo Choi, Buru Chang

Main category: cs.CV

TL;DR: NOAH is a benchmark for evaluating narrative prior-induced hallucinations and omissions in Video LLMs, created by inserting clips from other videos to test how models prioritize storyline consistency over visual evidence.

Details

Motivation: Video LLMs often prioritize narrative coherence over strict visual grounding, leading to hallucinations (introducing non-existent events) and omissions (suppressing factual events), which the authors aim to systematically evaluate.

Method: Created NOAH benchmark with composite videos by inserting clips from other sources, varying semantic similarity and insertion position. Includes captioning task with tailored metrics and three QA tasks (Existence, Temporal, Narrative) with over 60K evaluation samples.

Result: Most Video LLMs exhibit narrative prior-induced hallucinations and omissions; error patterns vary by architecture and depend on event similarity/position; reliance on narrative priors intensifies with fewer frames, amplifying errors when event continuity is weak.

Conclusion: NOAH provides the first standardized evaluation of narrative prior-induced hallucination and omission in Video LLMs, establishing a foundation for developing more reliable and trustworthy video understanding models.

Abstract: Video large language models (Video LLMs) have recently achieved strong performance on tasks such as captioning, summarization, and question answering. Many models and training methods explicitly encourage continuity across events to enhance narrative coherence. While this improves fluency, it also introduces an inductive bias that prioritizes storyline consistency over strict grounding in visual evidence. We identify this bias, which we call narrative prior, as a key driver of two errors: hallucinations, where non-existent events are introduced or existing ones are misinterpreted, and omissions, where factual events are suppressed because they are misaligned with surrounding context. To systematically evaluate narrative prior-induced errors, we introduce NOAH, a large-scale benchmark that constructs composite videos by inserting clips from other sources into target videos. By varying semantic similarity and insertion position, our benchmark enables controlled and scalable analysis of narrative priors. We design one captioning task with tailored metrics and three QA tasks - Existence, Temporal, and Narrative - yielding more than 60K evaluation samples. Extensive experiments yield three key findings: (i) most Video LLMs exhibit hallucinations and omissions driven by narrative priors, (ii) the patterns of these errors vary across architectures and depend on event similarity and insertion position, and (iii) reliance on narrative priors intensifies under sampling with fewer frames, amplifying errors when event continuity is weak. We establish NOAH as the first standardized evaluation of narrative prior-induced hallucination and omission in Video LLMs, providing a foundation for developing more reliable and trustworthy models. Our benchmark and code are available at https://anonymous550520.github.io/.

[312] Zooming into Comics: Region-Aware RL Improves Fine-Grained Comic Understanding in Vision-Language Models

Yule Chen, Yufan Ren, Sabine Süsstrunk

Main category: cs.CV

TL;DR: AI4VA-FG is the first fine-grained benchmark for VLM-based comic understanding, revealing significant performance gaps in current models. The paper proposes Region-Aware Reinforcement Learning (RARL) to improve VLM capabilities in comics by training models to dynamically attend to relevant regions.

Details

Motivation: Vision-Language Models struggle with complex visual narratives like comics due to stylized line art, onomatopoeia, and dense multi-panel layouts, creating a gap in their understanding capabilities.

Method: Created AI4VA-FG benchmark spanning recognition, detection, character reasoning, and narrative construction. Evaluated state-of-the-art models and investigated post-training strategies including SFT-S, SFT-R, RL, and proposed RARL for dynamic region attention through zoom-in operations.

Result: Revealed substantial performance deficits across core comic understanding tasks in both proprietary and open-source models. RL and RARL applied to Qwen2.5-VL yielded significant gains in entity recognition and storyline ordering.

Conclusion: Comic understanding remains an unsolved challenge, but RL and RARL approaches show promise for enhancing VLM capabilities in this domain, paving the way for more accurate comic analysis applications.

Abstract: Complex visual narratives, such as comics, present a significant challenge to Vision-Language Models (VLMs). Despite excelling on natural images, VLMs often struggle with stylized line art, onomatopoeia, and densely packed multi-panel layouts. To address this gap, we introduce AI4VA-FG, the first fine-grained and comprehensive benchmark for VLM-based comic understanding. It spans tasks from foundational recognition and detection to high-level character reasoning and narrative construction, supported by dense annotations for characters, poses, and depth. Beyond that, we evaluate state-of-the-art proprietary models, including GPT-4o and Gemini-2.5, and open-source models such as Qwen2.5-VL, revealing substantial performance deficits across core tasks of our benchmarks and underscoring that comic understanding remains an unsolved challenge. To enhance VLMs’ capabilities in this domain, we systematically investigate post-training strategies, including supervised fine-tuning on solutions (SFT-S), supervised fine-tuning on reasoning trajectories (SFT-R), and reinforcement learning (RL). Beyond that, inspired by the emerging “Thinking with Images” paradigm, we propose Region-Aware Reinforcement Learning (RARL) for VLMs, which trains models to dynamically attend to relevant regions through zoom-in operations. We observe that when applied to the Qwen2.5-VL model, RL and RARL yield significant gains in low-level entity recognition and high-level storyline ordering, paving the way for more accurate and efficient VLM applications in the comics domain.

[313] SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports

Haotian Xia, Haonan Ge, Junbo Zou, Hyun Woo Choi, Xuebin Zhang, Danny Suradja, Botao Rui, Ethan Tran, Wendy Jin, Zhen Ye, Xiyang Lin, Christopher Lai, Shengjie Zhang, Junwen Miao, Shichao Chen, Rhys Tracy, Vicente Ordonez, Weining Shen, Hanjie Chen

Main category: cs.CV

TL;DR: SportR is a multi-sports benchmark with 5,017 images and 2,101 videos, featuring progressive QA pairs and 7,118 human-authored Chain of Thought annotations to evaluate multimodal reasoning in sports.

Details

Motivation: Current sports benchmarks lack detailed reasoning chains and precise visual grounding needed to evaluate core capabilities like nuanced visual perception, rule-based reasoning, and knowledge grounding in a multi-sport context.

Method: Created a hierarchical benchmark with progressive QA pairs from simple infraction identification to complex penalty prediction, incorporating both image and video modalities with manual bounding box annotations for visual grounding.

Result: State-of-the-art models perform poorly on challenging tasks. Training via SFT and RL improves scores but remains low, highlighting significant capability gaps in current multimodal models.

Conclusion: SportR presents a challenging benchmark that reveals substantial limitations in current models’ sports reasoning abilities and provides a critical resource to drive future research in multimodal sports intelligence.

Abstract: Deeply understanding sports requires an intricate blend of fine-grained visual perception and rule-based reasoning - a challenge that pushes the limits of current multimodal models. To succeed, models must master three critical capabilities: perceiving nuanced visual details, applying abstract sport rule knowledge, and grounding that knowledge in specific visual evidence. Current sports benchmarks either cover single sports or lack the detailed reasoning chains and precise visual grounding needed to robustly evaluate these core capabilities in a multi-sport context. To address this gap, we introduce SportR, the first multi-sports large-scale benchmark designed to train and evaluate MLLMs on the fundamental reasoning required for sports intelligence. Our benchmark provides a dataset of 5,017 images and 2,101 videos. To enable granular evaluation, we structure our benchmark around a progressive hierarchy of question-answer (QA) pairs designed to probe reasoning at increasing depths - from simple infraction identification to complex penalty prediction. For the most advanced tasks requiring multi-step reasoning, such as determining penalties or explaining tactics, we provide 7,118 high-quality, human-authored Chain of Thought (CoT) annotations. In addition, our benchmark incorporates both image and video modalities and provides manual bounding box annotations to test visual grounding in the image part directly. Extensive experiments demonstrate the profound difficulty of our benchmark. State-of-the-art baseline models perform poorly on our most challenging tasks. While training on our data via Supervised Fine-Tuning and Reinforcement Learning improves these scores, they remain relatively low, highlighting a significant gap in current model capabilities. SportR presents a new challenge for the community, providing a critical resource to drive future research in multimodal sports reasoning.

[314] Video Dataset for Surgical Phase, Keypoint, and Instrument Recognition in Laparoscopic Surgery (PhaKIR)

Tobias Rueckert, Raphaela Maerkl, David Rauber, Leonard Klausmann, Max Gutbrod, Daniel Rueckert, Hubertus Feussner, Dirk Wilhelm, Christoph Palm

Main category: cs.CV

TL;DR: The PhaKIR dataset is a multi-center surgical dataset providing comprehensive annotations for laparoscopic cholecystectomy videos, including surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation.

Details

Motivation: Existing surgical datasets often address isolated tasks, neglect temporal dependencies, or lack multi-center variability, limiting the development of robust computer vision systems for robotic-assisted minimally invasive surgery.

Method: Created a dataset comprising eight complete laparoscopic cholecystectomy videos from three medical centers with frame-level annotations for three interconnected tasks: surgical phase recognition (485,875 frames), instrument keypoint estimation (19,435 frames), and instrument instance segmentation (19,435 frames).

Result: PhaKIR is the first multi-institutional dataset to jointly provide phase labels, instrument pose information, and pixel-accurate instrument segmentations while enabling temporal context exploitation through full surgical procedure sequences.

Conclusion: The dataset serves as a benchmark for surgical scene understanding methods and has been validated through the PhaKIR Challenge at MICCAI 2024, demonstrating its quality and relevance for advancing RAMIS computer vision systems.

Abstract: Robotic- and computer-assisted minimally invasive surgery (RAMIS) is increasingly relying on computer vision methods for reliable instrument recognition and surgical workflow understanding. Developing such systems often requires large, well-annotated datasets, but existing resources often address isolated tasks, neglect temporal dependencies, or lack multi-center variability. We present the Surgical Procedure Phase, Keypoint, and Instrument Recognition (PhaKIR) dataset, comprising eight complete laparoscopic cholecystectomy videos recorded at three medical centers. The dataset provides frame-level annotations for three interconnected tasks: surgical phase recognition (485,875 frames), instrument keypoint estimation (19,435 frames), and instrument instance segmentation (19,435 frames). PhaKIR is, to our knowledge, the first multi-institutional dataset to jointly provide phase labels, instrument pose information, and pixel-accurate instrument segmentations, while also enabling the exploitation of temporal context since full surgical procedure sequences are available. It served as the basis for the PhaKIR Challenge as part of the Endoscopic Vision (EndoVis) Challenge at MICCAI 2024 to benchmark methods in surgical scene understanding, thereby further validating the dataset’s quality and relevance. The dataset is publicly available upon request via the Zenodo platform.

Hui Sun, Long Lv, Pingping Zhang, Tongdan Tang, Feng Tian, Weibing Sun, Huchuan Lu

Main category: cs.CV

TL;DR: SFMFusion is a novel multi-modal image fusion framework that enhances Mamba with spatial-frequency perception and uses a three-branch structure to couple image fusion with image reconstruction for improved performance.

Details

Motivation: Existing MMIF methods using CNNs have limited receptive fields while Transformers have high computational costs. Mamba shows promise for long-range dependencies but lacks spatial and frequency perception. Also, image reconstruction as an auxiliary task is beneficial but needs efficient implementation.

Method: Proposes SFMFusion with: 1) Three-branch structure coupling MMIF and IR to retain complete source image contents; 2) Spatial-Frequency Enhanced Mamba Block (SFMB) for comprehensive feature extraction; 3) Dynamic Fusion Mamba Block (DFMB) for cross-branch dynamic feature fusion.

Result: Extensive experiments show SFMFusion achieves better results than most state-of-the-art methods on six MMIF datasets.

Conclusion: SFMFusion effectively addresses limitations of existing MMIF methods by enhancing Mamba with spatial-frequency perception and efficiently leveraging image reconstruction as an auxiliary task, achieving superior fusion performance.

Abstract: Multi-Modal Image Fusion (MMIF) aims to integrate complementary image information from different modalities to produce informative images. Previous deep learning-based MMIF methods generally adopt Convolutional Neural Networks (CNNs) or Transformers for feature extraction. However, these methods deliver unsatisfactory performances due to the limited receptive field of CNNs and the high computational cost of Transformers. Recently, Mamba has demonstrated a powerful potential for modeling long-range dependencies with linear complexity, providing a promising solution to MMIF. Unfortunately, Mamba lacks full spatial and frequency perceptions, which are very important for MMIF. Moreover, employing Image Reconstruction (IR) as an auxiliary task has been proven beneficial for MMIF. However, a primary challenge is how to leverage IR efficiently and effectively. To address the above issues, we propose a novel framework named Spatial-Frequency Enhanced Mamba Fusion (SFMFusion) for MMIF. More specifically, we first propose a three-branch structure to couple MMIF and IR, which can retain complete contents from source images. Then, we propose the Spatial-Frequency Enhanced Mamba Block (SFMB), which can enhance Mamba in both spatial and frequency domains for comprehensive feature extraction. Finally, we propose the Dynamic Fusion Mamba Block (DFMB), which can be deployed across different branches for dynamic feature fusion. Extensive experiments show that our method achieves better results than most state-of-the-art methods on six MMIF datasets. The source code is available at https://github.com/SunHui1216/SFMFusion.

[316] On Accurate and Robust Estimation of 3D and 2D Circular Center: Method and Application to Camera-Lidar Calibration

Jiajun Jiang, Xiao Hu, Wancheng Liu, Wei Jiang

Main category: cs.CV

TL;DR: A geometrically principled framework for LiDAR-camera extrinsic calibration using circular targets, featuring robust 3D circle center estimation and 2D projected center recovery methods to overcome existing challenges in 3D-2D correspondence.

Details

Motivation: Circular targets are widely used in LiDAR-camera calibration but achieving accurate 3D-2D circular center correspondence remains challenging due to decoupled 3D fitting and erroneous 2D ellipse-center estimation in existing methods.

Method: Proposes a framework with two innovations: (1) robust 3D circle center estimator using conformal geometric algebra and RANSAC, and (2) chord-length variance minimization method to recover true 2D projected center, resolving dual-minima ambiguity via homography validation or quasi-RANSAC fallback.

Result: Significantly outperforms state-of-the-art approaches on synthetic and real-world datasets, reduces extrinsic estimation error, and enables robust calibration across diverse sensors and target types including natural circular objects.

Conclusion: The proposed framework provides accurate and robust LiDAR-camera extrinsic calibration with public code release for reproducibility, addressing key limitations in existing circular target-based calibration methods.

Abstract: Circular targets are widely used in LiDAR-camera extrinsic calibration due to their geometric consistency and ease of detection. However, achieving accurate 3D-2D circular center correspondence remains challenging. Existing methods often fail due to decoupled 3D fitting and erroneous 2D ellipse-center estimation. To address this, we propose a geometrically principled framework featuring two innovations: (i) a robust 3D circle center estimator based on conformal geometric algebra and RANSAC; and (ii) a chord-length variance minimization method to recover the true 2D projected center, resolving its dual-minima ambi- guity via homography validation or a quasi-RANSAC fallback. Evaluated on synthetic and real-world datasets, our framework significantly outperforms state-of-the-art approaches. It reduces extrinsic estimation error and enables robust calibration across diverse sensors and target types, including natural circular objects. Our code will be publicly released for reproducibility.

[317] Explainable Cross-Disease Reasoning for Cardiovascular Risk Assessment from LDCT

Yifei Zhang, Jiashuo Zhang, Xiaofeng Yang, Liang Zhao

Main category: cs.CV

TL;DR: An explainable framework for joint cardiopulmonary risk assessment from low-dose chest CT scans that uses clinical reasoning to connect lung abnormalities with cardiovascular implications.

Details

Motivation: Existing approaches treat pulmonary and cardiac assessment as independent tasks, overlooking their physiological interplay and shared biomarkers in LDCT scans.

Method: Three-component framework: pulmonary perception module for lung abnormalities, knowledge-guided reasoning module for cardiovascular implications, and cardiac representation module for structural biomarkers, fused for holistic prediction.

Result: Achieves state-of-the-art performance for CVD screening and mortality prediction on NLST cohort, outperforming single-disease and purely image-based baselines.

Conclusion: Establishes a unified and explainable paradigm for cardiovascular analysis from LDCT that bridges image-based prediction with mechanism-based medical interpretation.

Abstract: Low-dose chest computed tomography (LDCT) inherently captures both pulmonary and cardiac structures, offering a unique opportunity for joint assessment of lung and cardiovascular health. However, most existing approaches treat these domains as independent tasks, overlooking their physiological interplay and shared imaging biomarkers. We propose an Explainable Cross-Disease Reasoning Framework that enables interpretable cardiopulmonary risk assessment from a single LDCT scan. The framework introduces an agentic reasoning process that emulates clinical diagnostic thinking-first perceiving pulmonary findings, then reasoning through established medical knowledge, and finally deriving a cardiovascular judgment with explanatory rationale. It integrates three synergistic components: a pulmonary perception module that summarizes lung abnormalities, a knowledge-guided reasoning module that infers their cardiovascular implications, and a cardiac representation module that encodes structural biomarkers. Their outputs are fused to produce a holistic cardiovascular risk prediction that is both accurate and physiologically grounded. Experiments on the NLST cohort demonstrate that the proposed framework achieves state-of-the-art performance for CVD screening and mortality prediction, outperforming single-disease and purely image-based baselines. Beyond quantitative gains, the framework provides human-verifiable reasoning that aligns with cardiological understanding, revealing coherent links between pulmonary abnormalities and cardiac stress mechanisms. Overall, this work establishes a unified and explainable paradigm for cardiovascular analysis from LDCT, bridging the gap between image-based prediction and mechanism-based medical interpretation.

[318] DIAL-GS: Dynamic Instance Aware Reconstruction for Label-free Street Scenes with 4D Gaussian Splatting

Chenpeng Su, Wenhua Wu, Chensheng Peng, Tianchen Deng, Zhe Liu, Hesheng Wang

Main category: cs.CV

TL;DR: DIAL-GS is a self-supervised method for dynamic instance-aware reconstruction of urban scenes using 4D Gaussian Splatting, enabling accurate dynamic object identification and fine-grained editing without human annotations.

Details

Motivation: Current supervised methods require costly human annotations and lack scalability, while self-supervised approaches struggle to distinguish between static/dynamic elements and fail to identify individual dynamic objects, limiting fine-grained editing capabilities.

Method: Uses appearance-position inconsistency to identify dynamic instances, employs instance-aware 4D Gaussians as unified volumetric representation, and implements a reciprocal mechanism where identity and dynamics reinforce each other for enhanced integrity and consistency.

Result: Experiments show DIAL-GS surpasses existing self-supervised baselines in reconstruction quality and instance-level editing capabilities in urban driving scenarios.

Conclusion: DIAL-GS provides a concise yet powerful solution for label-free urban scene modeling that enables dynamic-adaptive and instance-aware reconstruction with enhanced editing capabilities.

Abstract: Urban scene reconstruction is critical for autonomous driving, enabling structured 3D representations for data synthesis and closed-loop testing. Supervised approaches rely on costly human annotations and lack scalability, while current self-supervised methods often confuse static and dynamic elements and fail to distinguish individual dynamic objects, limiting fine-grained editing. We propose DIAL-GS, a novel dynamic instance-aware reconstruction method for label-free street scenes with 4D Gaussian Splatting. We first accurately identify dynamic instances by exploiting appearance-position inconsistency between warped rendering and actual observation. Guided by instance-level dynamic perception, we employ instance-aware 4D Gaussians as the unified volumetric representation, realizing dynamic-adaptive and instance-aware reconstruction. Furthermore, we introduce a reciprocal mechanism through which identity and dynamics reinforce each other, enhancing both integrity and consistency. Experiments on urban driving scenarios show that DIAL-GS surpasses existing self-supervised baselines in reconstruction quality and instance-level editing, offering a concise yet powerful solution for urban scene modeling.

[319] UniADC: A Unified Framework for Anomaly Detection and Classification

Ximiao Zhang, Min Xu, Zheng Zhang, Junlin Hu, Xiuzhuang Zhou

Main category: cs.CV

TL;DR: UniADC is a unified model for simultaneous anomaly detection and classification that uses controllable inpainting to synthesize anomaly images and a multi-task discriminator for joint learning, achieving state-of-the-art performance with few or no anomaly samples.

Details

Motivation: Existing methods treat anomaly detection and classification as separate tasks, neglecting their inherent correlation and limiting information sharing, leading to suboptimal performance.

Method: Proposes UniADC with two components: 1) Training-free controllable inpainting network that synthesizes anomaly images by repainting normal regions using anomaly priors, and 2) Multi-task discriminator trained on synthesized samples to align fine-grained image features with anomaly-category embeddings.

Result: Extensive experiments on MVTec-FS, MTD, and WFDD datasets show UniADC consistently outperforms existing methods in anomaly detection, localization, and classification.

Conclusion: UniADC effectively unifies anomaly detection and classification tasks, enabling superior performance with minimal anomaly data through innovative controllable inpainting and multi-task learning.

Abstract: In this paper, we introduce the task of unified anomaly detection and classification, which aims to simultaneously detect anomalous regions in images and identify their specific categories. Existing methods typically treat anomaly detection and classification as separate tasks, thereby neglecting their inherent correlation, limiting information sharing, and resulting in suboptimal performance. To address this, we propose UniADC, a unified anomaly detection and classification model that can effectively perform both tasks with only a few or even no anomaly images. Specifically, UniADC consists of two key components: a training-free controllable inpainting network and a multi-task discriminator. The inpainting network can synthesize anomaly images of specific categories by repainting normal regions guided by anomaly priors, and can also repaint few-shot anomaly samples to augment the available anomaly data. The multi-task discriminator is then trained on these synthesized samples, enabling precise anomaly detection and classification by aligning fine-grained image features with anomaly-category embeddings. We conduct extensive experiments on three anomaly detection and classification datasets, including MVTec-FS, MTD, and WFDD, and the results demonstrate that UniADC consistently outperforms existing methods in anomaly detection, localization, and classification. The code is available at https://github.com/cnulab/UniADC.

[320] FreqGRL: Suppressing Low-Frequency Bias and Mining High-Frequency Knowledge for Cross-Domain Few-Shot Learning

Siqi Hui, Sanping Zhou, Ye deng, Wenli Huang, Jinjun Wang

Main category: cs.CV

TL;DR: FreqGRL is a novel CD-FSL framework that addresses data imbalance in frequency space using Low-Frequency Replacement, High-Frequency Enhancement, and Global Frequency Filter modules to improve cross-domain generalization.

Details

Motivation: Address the severe imbalance between abundant source data and scarce target data in cross-domain few-shot learning, where models are biased toward source-specific knowledge in low-frequency components and struggle with high-frequency domain-generalizable features.

Method: Proposes FreqGRL with three modules: Low-Frequency Replacement (LFR) substitutes source low-frequency components with target ones; High-Frequency Enhancement (HFE) learns directly on high-frequency features; Global Frequency Filter (GFF) suppresses noisy frequencies and emphasizes informative ones.

Result: Extensive experiments on five standard CD-FSL benchmarks demonstrate state-of-the-art performance.

Conclusion: The frequency-space perspective and proposed framework effectively mitigate data imbalance challenges in CD-FSL, achieving superior cross-domain generalization through frequency-guided representation learning.

Abstract: Cross-domain few-shot learning (CD-FSL) aims to recognize novel classes with only a few labeled examples under significant domain shifts. While recent approaches leverage a limited amount of labeled target-domain data to improve performance, the severe imbalance between abundant source data and scarce target data remains a critical challenge for effective representation learning. We present the first frequency-space perspective to analyze this issue and identify two key challenges: (1) models are easily biased toward source-specific knowledge encoded in the low-frequency components of source data, and (2) the sparsity of target data hinders the learning of high-frequency, domain-generalizable features. To address these challenges, we propose \textbf{FreqGRL}, a novel CD-FSL framework that mitigates the impact of data imbalance in the frequency space. Specifically, we introduce a Low-Frequency Replacement (LFR) module that substitutes the low-frequency components of source tasks with those from the target domain to create new source tasks that better align with target characteristics, thus reducing source-specific biases and promoting generalizable representation learning. We further design a High-Frequency Enhancement (HFE) module that filters out low-frequency components and performs learning directly on high-frequency features in the frequency space to improve cross-domain generalization. Additionally, a Global Frequency Filter (GFF) is incorporated to suppress noisy or irrelevant frequencies and emphasize informative ones, mitigating overfitting risks under limited target supervision. Extensive experiments on five standard CD-FSL benchmarks demonstrate that our frequency-guided framework achieves state-of-the-art performance.

[321] NOVO: Bridging LLaVA and SAM with Visual-only Prompts for Reasoning Segmentation

Kyung-Yoon Yoon, Yeong-Jun Cho

Main category: cs.CV

TL;DR: NOVO bridges vision-language models and segmentation models using visual-only prompts (coarse masks and points) instead of text tokens, achieving state-of-the-art performance in reasoning segmentation.

Details

Motivation: To create a more effective framework for reasoning segmentation that preserves alignment with pretrained segmentation models like SAM by using visual prompts rather than text-derived embeddings.

Method: Generates coarse masks and point prompts from VLM output as visual-only prompts for SAM, plus a training-free refinement module to improve boundary quality and enable instance-level segmentation.

Result: Achieves state-of-the-art performance across multiple metrics and model sizes on the new RISeg benchmark (918 images, 2,533 instance masks).

Conclusion: NOVO demonstrates effective and scalable reasoning segmentation through visual-only prompts that maintain compatibility with pretrained segmentation models.

Abstract: In this study, we propose NOVO (NO text, Visual-Only prompts), a novel framework that bridges vision-language models (VLMs) and segmentation models through visual-only prompts. Unlike prior approaches that feed text-derived SEG token embeddings into segmentation models, NOVO instead generates a coarse mask and point prompts from the VLM output. These visual prompts are compatible with the Segment Anything Model (SAM), preserving alignment with its pretrained capabilities. To further enhance boundary quality and enable instance-level segmentation, we introduce a training-free refinement module that reduces visual artifacts and improves the quality of segmentation masks. We also present RISeg, a new benchmark comprising 918 images, 2,533 instance-level masks, and diverse reasoning queries to evaluate this task. Experiments demonstrate that NOVO achieves state-of-the-art performance across multiple metrics and model sizes, demonstrating its effectiveness and scalability in reasoning segmentation.

[322] Active Learning for Animal Re-Identification with Ambiguity-Aware Sampling

Depanshu Sani, Mehar Khurana, Saket Anand

Main category: cs.CV

TL;DR: A novel active learning framework for animal re-identification that uses complementary clustering to identify ambiguous regions and mine informative sample pairs, achieving state-of-the-art performance with minimal annotation effort.

Details

Motivation: Animal Re-ID faces challenges due to subtle distinguishing patterns, handling new species, and open-set nature. Foundation models underperform in zero-shot settings, while existing unsupervised and active learning methods are inadequate for animal Re-ID, requiring laborious annotation.

Method: Proposed AL framework leverages complementary clustering methods to uncover structurally ambiguous regions in embedding space, mines informative and representative sample pairs, uses oracle feedback (must-link/cannot-link constraints), and integrates with unsupervised methods through constrained clustering refinement.

Result: Achieves average improvements of 10.49%, 11.19% and 3.99% (mAP) on 13 wildlife datasets over foundational, USL and AL methods respectively, using only 0.033% of all annotations. Also shows 11.09%, 8.2% and 2.06% improvement for unknown individuals in open-world setting.

Conclusion: The proposed active learning framework effectively addresses animal Re-ID challenges by strategically targeting ambiguous regions with minimal annotation effort, outperforming existing methods and achieving state-of-the-art performance across multiple datasets.

Abstract: Animal Re-ID has recently gained substantial attention in the AI research community due to its high impact on biodiversity monitoring and unique research challenges arising from environmental factors. The subtle distinguishing patterns, handling new species and the inherent open-set nature make the problem even harder. To address these complexities, foundation models trained on labeled, large-scale and multi-species animal Re-ID datasets have recently been introduced to enable zero-shot Re-ID. However, our benchmarking reveals significant gaps in their zero-shot Re-ID performance for both known and unknown species. While this highlights the need for collecting labeled data in new domains, exhaustive annotation for Re-ID is laborious and requires domain expertise. Our analyses show that existing unsupervised (USL) and AL Re-ID methods underperform for animal Re-ID. To address these limitations, we introduce a novel AL Re-ID framework that leverages complementary clustering methods to uncover and target structurally ambiguous regions in the embedding space for mining pairs of samples that are both informative and broadly representative. Oracle feedback on these pairs, in the form of must-link and cannot-link constraints, facilitates a simple annotation interface, which naturally integrates with existing USL methods through our proposed constrained clustering refinement algorithm. Through extensive experiments, we demonstrate that, by utilizing only 0.033% of all annotations, our approach consistently outperforms existing foundational, USL and AL baselines. Specifically, we report an average improvement of 10.49%, 11.19% and 3.99% (mAP) on 13 wildlife datasets over foundational, USL and AL methods, respectively, while attaining state-of-the-art performance on each dataset. Furthermore, we also show an improvement of 11.09%, 8.2% and 2.06% for unknown individuals in an open-world setting.

[323] Sim4Seg: Boosting Multimodal Multi-disease Medical Diagnosis Segmentation with Region-Aware Vision-Language Similarity Masks

Lingran Song, Yucheng Zhou, Jianbing Shen

Main category: cs.CV

TL;DR: Proposes Medical Diagnosis Segmentation (MDS) task combining medical image segmentation with diagnostic reasoning, introduces M3DS dataset with diagnosis chain-of-thought, and presents Sim4Seg framework with RVLS2M module for improved performance.

Details

Motivation: Existing medical image segmentation models rarely explore segmentation and diagnosis tasks jointly, but explainable diagnoses along with segmentation results are crucial for patients.

Method: Introduces MDS task, creates M3DS dataset with automated diagnosis chain-of-thought generation, proposes Sim4Seg framework with Region-Aware Vision-Language Similarity to Mask (RVLS2M) module, and investigates test-time scaling strategy.

Result: Experimental results demonstrate that the proposed method outperforms baselines in both segmentation and diagnosis tasks.

Conclusion: The approach successfully integrates medical segmentation with diagnostic reasoning, providing explainable diagnoses alongside segmentation results through the novel MDS task and Sim4Seg framework.

Abstract: Despite significant progress in pixel-level medical image analysis, existing medical image segmentation models rarely explore medical segmentation and diagnosis tasks jointly. However, it is crucial for patients that models can provide explainable diagnoses along with medical segmentation results. In this paper, we introduce a medical vision-language task named Medical Diagnosis Segmentation (MDS), which aims to understand clinical queries for medical images and generate the corresponding segmentation masks as well as diagnostic results. To facilitate this task, we first present the Multimodal Multi-disease Medical Diagnosis Segmentation (M3DS) dataset, containing diverse multimodal multi-disease medical images paired with their corresponding segmentation masks and diagnosis chain-of-thought, created via an automated diagnosis chain-of-thought generation pipeline. Moreover, we propose Sim4Seg, a novel framework that improves the performance of diagnosis segmentation by taking advantage of the Region-Aware Vision-Language Similarity to Mask (RVLS2M) module. To improve overall performance, we investigate a test-time scaling strategy for MDS tasks. Experimental results demonstrate that our method outperforms the baselines in both segmentation and diagnosis.

[324] REOcc: Camera-Radar Fusion with Radar Feature Enrichment for 3D Occupancy Prediction

Chaehee Song, Sanmin Kim, Hyeonjun Jeong, Juyeb Shin, Joonhee Lim, Dongsuk Kum

Main category: cs.CV

TL;DR: REOcc is a camera-radar fusion network that enhances radar features for 3D occupancy prediction using Radar Densifier and Radar Amplifier components to address radar sparsity and noise issues.

Details

Motivation: Vision-based 3D occupancy prediction struggles in challenging environments with cameras alone. Camera-radar fusion is promising due to complementary strengths, but radar's sparsity and noise limit effectiveness.

Method: Proposes REOcc with two main components: Radar Densifier and Radar Amplifier that refine radar features by integrating spatial and contextual information to enhance spatial density and quality.

Result: Extensive experiments on Occ3D-nuScenes benchmark show significant performance gains over camera-only baseline, especially in dynamic object classes. Effectively mitigates radar sparsity and noise.

Conclusion: REOcc enables radar to better complement camera data, unlocking the full potential of camera-radar fusion for robust and reliable 3D occupancy prediction.

Abstract: Vision-based 3D occupancy prediction has made significant advancements, but its reliance on cameras alone struggles in challenging environments. This limitation has driven the adoption of sensor fusion, among which camera-radar fusion stands out as a promising solution due to their complementary strengths. However, the sparsity and noise of the radar data limits its effectiveness, leading to suboptimal fusion performance. In this paper, we propose REOcc, a novel camera-radar fusion network designed to enrich radar feature representations for 3D occupancy prediction. Our approach introduces two main components, a Radar Densifier and a Radar Amplifier, which refine radar features by integrating spatial and contextual information, effectively enhancing spatial density and quality. Extensive experiments on the Occ3D-nuScenes benchmark demonstrate that REOcc achieves significant performance gains over the camera-only baseline model, particularly in dynamic object classes. These results underscore REOcc’s capability to mitigate the sparsity and noise of the radar data. Consequently, radar complements camera data more effectively, unlocking the full potential of camera-radar fusion for robust and reliable 3D occupancy prediction.

[325] Flexible Concept Bottleneck Model

Xingbo Du, Qiantong Dou, Lei Fan, Rui Zhang

Main category: cs.CV

TL;DR: FCBM enables dynamic concept adaptation in concept bottleneck models without full retraining, using a hypernetwork for concept embedding and sparsemax for concept selection.

Details

Motivation: Existing VLM-based CBMs require full retraining for new concepts, limiting adaptability in real-world scenarios with rapidly evolving foundation models.

Method: Hypernetwork generates prediction weights from concept embeddings, plus modified sparsemax with learnable temperature for dynamic concept selection.

Result: Achieves comparable accuracy to SOTA baselines with similar concept count, generalizes well to unseen concepts with single-epoch fine-tuning.

Conclusion: FCBM provides strong adaptability and flexibility for concept bottleneck models, supporting dynamic concept updates without complete retraining.

Abstract: Concept bottleneck models (CBMs) improve neural network interpretability by introducing an intermediate layer that maps human-understandable concepts to predictions. Recent work has explored the use of vision-language models (VLMs) to automate concept selection and annotation. However, existing VLM-based CBMs typically require full model retraining when new concepts are involved, which limits their adaptability and flexibility in real-world scenarios, especially considering the rapid evolution of vision-language foundation models. To address these issues, we propose Flexible Concept Bottleneck Model (FCBM), which supports dynamic concept adaptation, including complete replacement of the original concept set. Specifically, we design a hypernetwork that generates prediction weights based on concept embeddings, allowing seamless integration of new concepts without retraining the entire model. In addition, we introduce a modified sparsemax module with a learnable temperature parameter that dynamically selects the most relevant concepts, enabling the model to focus on the most informative features. Extensive experiments on five public benchmarks demonstrate that our method achieves accuracy comparable to state-of-the-art baselines with a similar number of effective concepts. Moreover, the model generalizes well to unseen concepts with just a single epoch of fine-tuning, demonstrating its strong adaptability and flexibility.

[326] AnoStyler: Text-Driven Localized Anomaly Generation via Lightweight Style Transfer

Yulim So, Seokho Kang

Main category: cs.CV

TL;DR: AnoStyler is a lightweight zero-shot anomaly generation method that uses text-guided style transfer to create realistic anomaly images from single normal images, overcoming limitations of existing approaches.

Details

Motivation: Existing anomaly generation methods suffer from poor visual realism, dependency on large datasets, or heavy model architectures, limiting practical deployment.

Method: Uses text-guided style transfer with a lightweight U-Net trained with CLIP-based losses. Generates anomaly masks and text prompts from single normal images, then stylizes normal images into anomalies localized by masks and aligned with text semantics.

Result: Outperforms existing methods on MVTec-AD and VisA datasets in generating high-quality, diverse anomaly images. Generated anomalies enhance anomaly detection performance.

Conclusion: AnoStyler provides an effective, lightweight solution for zero-shot anomaly generation that addresses key limitations of previous approaches and improves anomaly detection.

Abstract: Anomaly generation has been widely explored to address the scarcity of anomaly images in real-world data. However, existing methods typically suffer from at least one of the following limitations, hindering their practical deployment: (1) lack of visual realism in generated anomalies; (2) dependence on large amounts of real images; and (3) use of memory-intensive, heavyweight model architectures. To overcome these limitations, we propose AnoStyler, a lightweight yet effective method that frames zero-shot anomaly generation as text-guided style transfer. Given a single normal image along with its category label and expected defect type, an anomaly mask indicating the localized anomaly regions and two-class text prompts representing the normal and anomaly states are generated using generalizable category-agnostic procedures. A lightweight U-Net model trained with CLIP-based loss functions is used to stylize the normal image into a visually realistic anomaly image, where anomalies are localized by the anomaly mask and semantically aligned with the text prompts. Extensive experiments on the MVTec-AD and VisA datasets show that AnoStyler outperforms existing anomaly generation methods in generating high-quality and diverse anomaly images. Furthermore, using these generated anomalies helps enhance anomaly detection performance.

[327] SPAN: Spatial-Projection Alignment for Monocular 3D Object Detection

Yifan Wang, Yian Zhao, Fanqi Pu, Xiaochen Yang, Yang Tang, Xi Chen, Wenming Yang

Main category: cs.CV

TL;DR: SPAN introduces spatial-projection alignment to address geometric inconsistency in monocular 3D detection by enforcing global spatial constraints and 3D-2D projection alignment, with hierarchical learning for stable training.

Details

Motivation: Existing monocular 3D detectors use decoupled prediction that ignores geometric collaborative constraints between attributes, leading to lack of geometric consistency and suboptimal performance.

Method: Proposes Spatial-Projection Alignment (SPAN) with two components: Spatial Point Alignment for global spatial constraints between predicted and ground-truth boxes, and 3D-2D Projection Alignment for tight alignment on image plane. Uses Hierarchical Task Learning for stable training.

Result: The method can be easily integrated into any established monocular 3D detector and delivers significant performance improvements.

Conclusion: SPAN effectively addresses geometric inconsistency in monocular 3D detection through spatial-projection alignment and hierarchical learning, achieving better performance while being easily integrable.

Abstract: Existing monocular 3D detectors typically tame the pronounced nonlinear regression of 3D bounding box through decoupled prediction paradigm, which employs multiple branches to estimate geometric center, depth, dimensions, and rotation angle separately. Although this decoupling strategy simplifies the learning process, it inherently ignores the geometric collaborative constraints between different attributes, resulting in the lack of geometric consistency prior, thereby leading to suboptimal performance. To address this issue, we propose novel Spatial-Projection Alignment (SPAN) with two pivotal components: (i). Spatial Point Alignment enforces an explicit global spatial constraint between the predicted and ground-truth 3D bounding boxes, thereby rectifying spatial drift caused by decoupled attribute regression. (ii). 3D-2D Projection Alignment ensures that the projected 3D box is aligned tightly within its corresponding 2D detection bounding box on the image plane, mitigating projection misalignment overlooked in previous works. To ensure training stability, we further introduce a Hierarchical Task Learning strategy that progressively incorporates spatial-projection alignment as 3D attribute predictions refine, preventing early stage error propagation across attributes. Extensive experiments demonstrate that the proposed method can be easily integrated into any established monocular 3D detector and delivers significant performance improvements.

[328] K-Stain: Keypoint-Driven Correspondence for H&E-to-IHC Virtual Staining

Sicheng Yang, Zhaohu Xing, Haipeng Zhou, Lei Zhu

Main category: cs.CV

TL;DR: K-Stain is a novel framework that uses keypoint-based spatial and semantic relationships to improve virtual staining of H&E images into IHC images, overcoming misalignment issues in tissue slices.

Details

Motivation: Existing virtual staining methods struggle with effective spatial information utilization due to misalignment in tissue slices, which K-Stain addresses using keypoints as robust spatial correspondence indicators.

Method: K-Stain has three components: Hierarchical Spatial Keypoint Detector (HSKD) for identifying keypoints, Keypoint-aware Enhancement Generator (KEG) for integrating keypoints during generation, and Keypoint Guided Discriminator (KGD) for improved spatial detail sensitivity.

Result: Extensive experiments show K-Stain outperforms state-of-the-art methods in both quantitative metrics and visual quality, producing more accurate and visually consistent IHC images.

Conclusion: The keypoint-based approach effectively leverages contextual information from adjacent slices to enhance synthesized IHC image fidelity, demonstrating superior performance over existing methods.

Abstract: Virtual staining offers a promising method for converting Hematoxylin and Eosin (H&E) images into Immunohistochemical (IHC) images, eliminating the need for costly chemical processes. However, existing methods often struggle to utilize spatial information effectively due to misalignment in tissue slices. To overcome this challenge, we leverage keypoints as robust indicators of spatial correspondence, enabling more precise alignment and integration of structural details in synthesized IHC images. We introduce K-Stain, a novel framework that employs keypoint-based spatial and semantic relationships to enhance synthesized IHC image fidelity. K-Stain comprises three main components: (1) a Hierarchical Spatial Keypoint Detector (HSKD) for identifying keypoints in stain images, (2) a Keypoint-aware Enhancement Generator (KEG) that integrates these keypoints during image generation, and (3) a Keypoint Guided Discriminator (KGD) that improves the discriminator’s sensitivity to spatial details. Our approach leverages contextual information from adjacent slices, resulting in more accurate and visually consistent IHC images. Extensive experiments show that K-Stain outperforms state-of-the-art methods in quantitative metrics and visual quality.

[329] MirrorMamba: Towards Scalable and Robust Mirror Detection in Videos

Rui Song, Jiaying Lin, Rynson W. H. Lau

Main category: cs.CV

TL;DR: MirrorMamba: A novel video mirror detection method using Mamba-based architecture with multiple cues (depth, correspondence, optical flow) that achieves state-of-the-art performance with global receptive field and linear complexity.

Details

Motivation: Existing video mirror detection methods suffer from limited performance and robustness, over-relying on single unreliable dynamic features, and using CNNs with limited receptive fields or Transformers with quadratic complexity.

Method: Proposes MirrorMamba with multiple cues (perceived depth, correspondence, optical flow), Mamba-based Multidirection Correspondence Extractor for global receptive field with linear complexity, and Mamba-based layer-wise boundary enforcement decoder to resolve unclear boundaries.

Result: Outperforms existing state-of-the-art approaches on video mirror detection benchmarks and achieves state-of-the-art performance on the most challenging image-based mirror detection dataset.

Conclusion: First successful application of Mamba-based architecture in mirror detection, demonstrating superior performance, robustness, and generalizability across both video and image domains.

Abstract: Video mirror detection has received significant research attention, yet existing methods suffer from limited performance and robustness. These approaches often over-rely on single, unreliable dynamic features, and are typically built on CNNs with limited receptive fields or Transformers with quadratic computational complexity. To address these limitations, we propose a new effective and scalable video mirror detection method, called MirrorMamba. Our approach leverages multiple cues to adapt to diverse conditions, incorporating perceived depth, correspondence and optical. We also introduce an innovative Mamba-based Multidirection Correspondence Extractor, which benefits from the global receptive field and linear complexity of the emerging Mamba spatial state model to effectively capture correspondence properties. Additionally, we design a Mamba-based layer-wise boundary enforcement decoder to resolve the unclear boundary caused by the blurred depth map. Notably, this work marks the first successful application of the Mamba-based architecture in the field of mirror detection. Extensive experiments demonstrate that our method outperforms existing state-of-the-art approaches for video mirror detection on the benchmark datasets. Furthermore, on the most challenging and representative image-based mirror detection dataset, our approach achieves state-of-the-art performance, proving its robustness and generalizability.

[330] MRT: Learning Compact Representations with Mixed RWKV-Transformer for Extreme Image Compression

Han Liu, Hengyu Man, Xingtao Wang, Wenrui Li, Debin Zhao

Main category: cs.CV

TL;DR: Proposes MRT framework using mixed RWKV-Transformer architecture to compress images into 1-D latent representations, achieving superior compression efficiency at very low bitrates (<0.02 bpp).

Details

Motivation: Existing methods compress images into 2-D latent spaces using CNNs or Transformers, which retain substantial spatial redundancy and limit compression performance.

Method: MRT partitions images into windows, uses RWKV modules for global dependencies across windows and Transformer blocks for local redundancies within windows, plus a dedicated RWKV Compression Model (RCM) for 1-D latent features.

Result: Achieves bitrate savings of 43.75% on Kodak and 30.59% on CLIC2020 datasets compared to state-of-the-art GLC, with superior reconstruction quality below 0.02 bpp.

Conclusion: The mixed RWKV-Transformer architecture with 1-D latent representations significantly improves image compression efficiency by reducing spatial redundancy.

Abstract: Recent advances in extreme image compression have revealed that mapping pixel data into highly compact latent representations can significantly improve coding efficiency. However, most existing methods compress images into 2-D latent spaces via convolutional neural networks (CNNs) or Swin Transformers, which tend to retain substantial spatial redundancy, thereby limiting overall compression performance. In this paper, we propose a novel Mixed RWKV-Transformer (MRT) architecture that encodes images into more compact 1-D latent representations by synergistically integrating the complementary strengths of linear-attention-based RWKV and self-attention-based Transformer models. Specifically, MRT partitions each image into fixed-size windows, utilizing RWKV modules to capture global dependencies across windows and Transformer blocks to model local redundancies within each window. The hierarchical attention mechanism enables more efficient and compact representation learning in the 1-D domain. To further enhance compression efficiency, we introduce a dedicated RWKV Compression Model (RCM) tailored to the structure characteristics of the intermediate 1-D latent features in MRT. Extensive experiments on standard image compression benchmarks validate the effectiveness of our approach. The proposed MRT framework consistently achieves superior reconstruction quality at bitrates below 0.02 bits per pixel (bpp). Quantitative results based on the DISTS metric show that MRT significantly outperforms the state-of-the-art 2-D architecture GLC, achieving bitrate savings of 43.75%, 30.59% on the Kodak and CLIC2020 test datasets, respectively.

[331] Relative Energy Learning for LiDAR Out-of-Distribution Detection

Zizhao Li, Zhengkang Xiang, Jiayang Ao, Joseph West, Kourosh Khoshelham

Main category: cs.CV

TL;DR: REL is a novel framework for OOD detection in LiDAR point clouds that uses relative energy scoring and synthetic outlier generation to improve reliability in autonomous driving.

Details

Motivation: Current LiDAR OOD methods struggle with distinguishing rare anomalies from common classes, leading to high false-positive rates and overconfident errors in safety-critical autonomous driving scenarios.

Method: Proposes Relative Energy Learning (REL) framework that leverages energy gap between positive and negative logits as relative scoring function, combined with Point Raise - a lightweight data synthesis strategy that perturbs existing point clouds to generate auxiliary anomalies.

Result: Outperforms existing methods by a large margin on SemanticKITTI and Spotting the Unexpected (STU) benchmarks, demonstrating improved robustness across various scenes.

Conclusion: Modeling relative energy combined with simple synthetic outliers provides a principled and scalable solution for reliable OOD detection in open-world autonomous driving.

Abstract: Out-of-distribution (OOD) detection is a critical requirement for reliable autonomous driving, where safety depends on recognizing road obstacles and unexpected objects beyond the training distribution. Despite extensive research on OOD detection in 2D images, direct transfer to 3D LiDAR point clouds has been proven ineffective. Current LiDAR OOD methods struggle to distinguish rare anomalies from common classes, leading to high false-positive rates and overconfident errors in safety-critical settings. We propose Relative Energy Learning (REL), a simple yet effective framework for OOD detection in LiDAR point clouds. REL leverages the energy gap between positive (in-distribution) and negative logits as a relative scoring function, mitigating calibration issues in raw energy values and improving robustness across various scenes. To address the absence of OOD samples during training, we propose a lightweight data synthesis strategy called Point Raise, which perturbs existing point clouds to generate auxiliary anomalies without altering the inlier semantics. Evaluated on SemanticKITTI and the Spotting the Unexpected (STU) benchmark, REL consistently outperforms existing methods by a large margin. Our results highlight that modeling relative energy, combined with simple synthetic outliers, provides a principled and scalable solution for reliable OOD detection in open-world autonomous driving.

[332] AvatarTex: High-Fidelity Facial Texture Reconstruction from Single-Image Stylized Avatars

Yuda Qiu, Zitong Xiao, Yiwei Zuo, Zisheng Ye, Weikai Chen, Xiaoguang Han

Main category: cs.CV

TL;DR: AvatarTex is a high-fidelity facial texture reconstruction framework that generates both stylized and photorealistic textures from single images using a novel three-stage diffusion-to-GAN pipeline.

Details

Motivation: Existing methods struggle with stylized avatars due to lack of diverse multi-style datasets and challenges in maintaining geometric consistency in non-standard textures.

Method: Three-stage pipeline: 1) diffusion-based inpainting for missing regions, 2) GAN-based latent optimization for style/structure consistency, 3) diffusion-based repainting for fine details. Also introduces TexHub dataset with 20,000 multi-style UV textures.

Result: Achieves high-quality topology-aligned texture synthesis with both artistic and geometric coherence, establishing new state-of-the-art in multi-style facial texture reconstruction.

Conclusion: AvatarTex successfully integrates diffusion models’ diversity with GANs’ structured latent space to overcome limitations of existing methods, with TexHub dataset enabling future research in this field.

Abstract: We present AvatarTex, a high-fidelity facial texture reconstruction framework capable of generating both stylized and photorealistic textures from a single image. Existing methods struggle with stylized avatars due to the lack of diverse multi-style datasets and challenges in maintaining geometric consistency in non-standard textures. To address these limitations, AvatarTex introduces a novel three-stage diffusion-to-GAN pipeline. Our key insight is that while diffusion models excel at generating diversified textures, they lack explicit UV constraints, whereas GANs provide a well-structured latent space that ensures style and topology consistency. By integrating these strengths, AvatarTex achieves high-quality topology-aligned texture synthesis with both artistic and geometric coherence. Specifically, our three-stage pipeline first completes missing texture regions via diffusion-based inpainting, refines style and structure consistency using GAN-based latent optimization, and enhances fine details through diffusion-based repainting. To address the need for a stylized texture dataset, we introduce TexHub, a high-resolution collection of 20,000 multi-style UV textures with precise UV-aligned layouts. By leveraging TexHub and our structured diffusion-to-GAN pipeline, AvatarTex establishes a new state-of-the-art in multi-style facial texture reconstruction. TexHub will be released upon publication to facilitate future research in this field.

[333] Argus: Quality-Aware High-Throughput Text-to-Image Inference Serving System

Shubham Agarwal, Subrata Mitra, Saud Iqbal

Main category: cs.CV

TL;DR: Argus is a high-throughput text-to-image inference system that intelligently selects appropriate approximation levels for each prompt to maintain quality while meeting throughput targets, achieving significant performance improvements over baselines.

Details

Motivation: Text-to-image models are compute-bound with high inference times due to iterative denoising processes, creating challenges for high-throughput systems. Many prompts can be served with faster approximated models, but careful calibration is needed to avoid quality degradation.

Method: Argus intelligently switches between different approximation strategies for each prompt, selecting the right level of approximation to balance quality and throughput requirements on a fixed-size cluster.

Result: Argus achieves 10x fewer latency SLO violations, 10% higher average quality, and 40% higher throughput compared to baselines on two real-world workload traces.

Conclusion: The system successfully addresses the challenge of designing high-throughput T2I inference by dynamically selecting appropriate approximation levels, demonstrating significant improvements in both performance and quality metrics.

Abstract: Text-to-image (T2I) models have gained significant popularity. Most of these are diffusion models with unique computational characteristics, distinct from both traditional small-scale ML models and large language models. They are highly compute-bound and use an iterative denoising process to generate images, leading to very high inference time. This creates significant challenges in designing a high-throughput system. We discovered that a large fraction of prompts can be served using faster, approximated models. However, the approximation setting must be carefully calibrated for each prompt to avoid quality degradation. Designing a high-throughput system that assigns each prompt to the appropriate model and compatible approximation setting remains a challenging problem. We present Argus, a high-throughput T2I inference system that selects the right level of approximation for each prompt to maintain quality while meeting throughput targets on a fixed-size cluster. Argus intelligently switches between different approximation strategies to satisfy both throughput and quality requirements. Overall, Argus achieves 10x fewer latency service-level objective (SLO) violations, 10% higher average quality, and 40% higher throughput compared to baselines on two real-world workload traces.

[334] Rethinking Rainy 3D Scene Reconstruction via Perspective Transforming and Brightness Tuning

Qianfeng Yang, Xiang Chen, Pengpeng Li, Qiyuan Guan, Guiyue Jin, Jiyu Jin

Main category: cs.CV

TL;DR: OmniRain3D dataset addresses realistic rain effects in 3D scenes with perspective heterogeneity and brightness dynamicity, while REVR-GSNet framework enables high-fidelity 3D reconstruction from rain-degraded multi-view images through joint optimization.

Details

Motivation: Existing datasets overlook viewpoint-dependent rain streak appearance and ambient brightness reduction during rainfall, leading to inaccurate 3D reconstruction from rain-degraded multi-view images.

Method: Constructed OmniRain3D dataset with perspective heterogeneity and brightness dynamicity; proposed REVR-GSNet framework integrating recursive brightness enhancement, Gaussian primitive optimization, and GS-guided rain elimination through joint alternating optimization.

Result: Extensive experiments demonstrate effectiveness of both the dataset and method for multi-view image deraining and rainy 3D scene reconstruction.

Conclusion: The dataset and method provide foundation for future research on multi-view image deraining and rainy 3D scene reconstruction, achieving high-fidelity reconstruction of clean 3D scenes from rain-degraded inputs.

Abstract: Rain degrades the visual quality of multi-view images, which are essential for 3D scene reconstruction, resulting in inaccurate and incomplete reconstruction results. Existing datasets often overlook two critical characteristics of real rainy 3D scenes: the viewpoint-dependent variation in the appearance of rain streaks caused by their projection onto 2D images, and the reduction in ambient brightness resulting from cloud coverage during rainfall. To improve data realism, we construct a new dataset named OmniRain3D that incorporates perspective heterogeneity and brightness dynamicity, enabling more faithful simulation of rain degradation in 3D scenes. Based on this dataset, we propose an end-to-end reconstruction framework named REVR-GSNet (Rain Elimination and Visibility Recovery for 3D Gaussian Splatting). Specifically, REVR-GSNet integrates recursive brightness enhancement, Gaussian primitive optimization, and GS-guided rain elimination into a unified architecture through joint alternating optimization, achieving high-fidelity reconstruction of clean 3D scenes from rain-degraded inputs. Extensive experiments show the effectiveness of our dataset and method. Our dataset and method provide a foundation for future research on multi-view image deraining and rainy 3D scene reconstruction.

[335] SinSEMI: A One-Shot Image Generation Model and Data-Efficient Evaluation Framework for Semiconductor Inspection Equipment

ChunLiang Wu, Xiaochun Li

Main category: cs.CV

TL;DR: SinSEMI is a one-shot learning approach that generates diverse, realistic semiconductor optical images from a single input image using multi-scale flow-based models with LPIPS energy guidance.

Details

Motivation: Address data scarcity in semiconductor equipment development where obtaining large quantities of raw optical images is challenging, hindering AI advancement in semiconductor manufacturing.

Method: Uses multi-scale flow-based model enhanced with LPIPS energy guidance during sampling to ensure perceptual realism and output variety from single optical images.

Result: Superior performance in visual quality, quantitative measures, and downstream tasks compared to other one-shot generation techniques; generated images achieve high fidelity and meaningful diversity.

Conclusion: SinSEMI-generated images are suitable as training data for semiconductor AI applications, effectively addressing the data scarcity problem in early-stage semiconductor equipment development.

Abstract: In the early stages of semiconductor equipment development, obtaining large quantities of raw optical images poses a significant challenge. This data scarcity hinder the advancement of AI-powered solutions in semiconductor manufacturing. To address this challenge, we introduce SinSEMI, a novel one-shot learning approach that generates diverse and highly realistic images from single optical image. SinSEMI employs a multi-scale flow-based model enhanced with LPIPS (Learned Perceptual Image Patch Similarity) energy guidance during sampling, ensuring both perceptual realism and output variety. We also introduce a comprehensive evaluation framework tailored for this application, which enables a thorough assessment using just two reference images. Through the evaluation against multiple one-shot generation techniques, we demonstrate SinSEMI’s superior performance in visual quality, quantitative measures, and downstream tasks. Our experimental results demonstrate that SinSEMI-generated images achieve both high fidelity and meaningful diversity, making them suitable as training data for semiconductor AI applications.

[336] Otter: Mitigating Background Distractions of Wide-Angle Few-Shot Action Recognition with Enhanced RWKV

Wenbo Huang, Jinghui Zhang, Zhenghao Chen, Guang Li, Lei Zhang, Yang Cao, Fang Dong, Takahiro Ogawa, Miki Haseyama

Main category: cs.CV

TL;DR: Otter is a novel framework for wide-angle few-shot action recognition that uses compound segmentation and temporal reconstruction with RWKV to address background distractions and temporal relation degradation.

Details

Motivation: Wide-angle videos in FSAR face challenges due to background distractions and difficulty in recognizing actions without global understanding of subjects and background. Existing methods struggle with highlighting subjects and reconstructing temporal relations in frames with similar backgrounds.

Method: Proposes Otter with Compound Segmentation Module (CSM) to segment and emphasize key patches, and Temporal Reconstruction Module (TRM) for bidirectional scanning to reconstruct temporal relations. Combines regular prototype with temporal-enhanced prototype for better subject emphasis and temporal modeling.

Result: Achieves state-of-the-art performance on SSv2, Kinetics, UCF101, and HMDB51 benchmarks. Additional evaluation on VideoBadminton dataset validates superiority in wide-angle FSAR.

Conclusion: Otter effectively addresses background distractions and temporal relation degradation in wide-angle FSAR through compound segmentation and temporal reconstruction, demonstrating superior performance across multiple benchmarks.

Abstract: Wide-angle videos in few-shot action recognition (FSAR) effectively express actions within specific scenarios. However, without a global understanding of both subjects and background, recognizing actions in such samples remains challenging because of the background distractions. Receptance Weighted Key Value (RWKV), which learns interaction between various dimensions, shows promise for global modeling. While directly applying RWKV to wide-angle FSAR may fail to highlight subjects due to excessive background information. Additionally, temporal relation degraded by frames with similar backgrounds is difficult to reconstruct, further impacting performance. Therefore, we design the CompOund SegmenTation and Temporal REconstructing RWKV (Otter). Specifically, the Compound Segmentation Module~(CSM) is devised to segment and emphasize key patches in each frame, effectively highlighting subjects against background information. The Temporal Reconstruction Module (TRM) is incorporated into the temporal-enhanced prototype construction to enable bidirectional scanning, allowing better reconstruct temporal relation. Furthermore, a regular prototype is combined with the temporal-enhanced prototype to simultaneously enhance subject emphasis and temporal modeling, improving wide-angle FSAR performance. Extensive experiments on benchmarks such as SSv2, Kinetics, UCF101, and HMDB51 demonstrate that Otter achieves state-of-the-art performance. Extra evaluation on the VideoBadminton dataset further validates the superiority of Otter in wide-angle FSAR.

[337] PointCubeNet: 3D Part-level Reasoning with 3x3x3 Point Cloud Blocks

Da-Yeong Kim, Yeong-Jun Cho

Main category: cs.CV

TL;DR: PointCubeNet is a multi-modal 3D understanding framework that performs unsupervised part-level reasoning without part annotations, using global and local branches with 3x3x3 local blocks for sub-region analysis.

Details

Motivation: To enhance 3D object understanding by enabling part-level reasoning without requiring expensive part annotations, addressing the gap in unsupervised 3D part-level analysis.

Method: Uses global and local branches with 3x3x3 local blocks for part-level analysis, employs pseudo-labeling and local loss function for unsupervised training with local text labels.

Result: Demonstrates that understanding 3D object parts enhances overall 3D object understanding, achieves reliable and meaningful part-level reasoning results.

Conclusion: First successful attempt at unsupervised 3D part-level reasoning, showing that part-level analysis improves comprehensive 3D object understanding without requiring part annotations.

Abstract: In this paper, we propose PointCubeNet, a novel multi-modal 3D understanding framework that achieves part-level reasoning without requiring any part annotations. PointCubeNet comprises global and local branches. The proposed local branch, structured into 3x3x3 local blocks, enables part-level analysis of point cloud sub-regions with the corresponding local text labels. Leveraging the proposed pseudo-labeling method and local loss function, PointCubeNet is effectively trained in an unsupervised manner. The experimental results demonstrate that understanding 3D object parts enhances the understanding of the overall 3D object. In addition, this is the first attempt to perform unsupervised 3D part-level reasoning and achieves reliable and meaningful results.

[338] Image Restoration via Primal Dual Hybrid Gradient and Flow Generative Model

Ji Li, Chao Wang

Main category: cs.CV

TL;DR: A Plug-and-Play framework using flow matching generative models as priors, extended to handle various data fidelity terms beyond Gaussian noise via a primal-dual hybrid gradient approach.

Details

Motivation: Existing PnP methods mainly work with Gaussian noise (squared ℓ₂ fidelity) but struggle with more general noise types like Poisson and impulse noise, limiting their practical applicability.

Method: Proposed a PnP algorithm based on PDHG that replaces the proximal operator with time-dependent denoisers from flow matching models, supporting both ℓ₁ and ℓ₂ fidelity terms.

Result: Validated on denoising, super-resolution, deblurring, and inpainting tasks; ℓ₁ and ℓ₂ fidelity outperformed squared ℓ₂ loss for non-Gaussian noise scenarios.

Conclusion: The proposed method is computationally efficient, memory-friendly, and robust to various noise types, expanding PnP’s applicability beyond Gaussian noise settings.

Abstract: Regularized optimization has been a classical approach to solving imaging inverse problems, where the regularization term enforces desirable properties of the unknown image. Recently, the integration of flow matching generative models into image restoration has garnered significant attention, owing to their powerful prior modeling capabilities. In this work, we incorporate such generative priors into a Plug-and-Play (PnP) framework based on proximal splitting, where the proximal operator associated with the regularizer is replaced by a time-dependent denoiser derived from the generative model. While existing PnP methods have achieved notable success in inverse problems with smooth squared $\ell_2$ data fidelity–typically associated with Gaussian noise–their applicability to more general data fidelity terms remains underexplored. To address this, we propose a general and efficient PnP algorithm inspired by the primal-dual hybrid gradient (PDHG) method. Our approach is computationally efficient, memory-friendly, and accommodates a wide range of fidelity terms. In particular, it supports both $\ell_1$ and $\ell_2$ norm-based losses, enabling robustness to non-Gaussian noise types such as Poisson and impulse noise. We validate our method on several image restoration tasks, including denoising, super-resolution, deblurring, and inpainting, and demonstrate that $\ell_1$ and $\ell_2$ fidelity terms outperform the conventional squared $\ell_2$ loss in the presence of non-Gaussian noise.

[339] Med-SORA: Symptom to Organ Reasoning in Abdomen CT Images

You-Kyoung Na, Yeong-Jun Cho

Main category: cs.CV

TL;DR: Med-SORA is a framework for symptom-to-organ reasoning in abdominal CT images that addresses limitations of existing medical multimodal models through RAG-based dataset construction, soft labeling with learnable organ anchors, and 2D-3D cross-attention architecture.

Details

Motivation: Existing medical multimodal models oversimplify clinical reality by using simple one-to-one hard labeling and mainly relying on single-slice 2D features without 3D information, limiting their ability to capture full anatomical context and complex symptom-organ relationships.

Method: Proposes Med-SORA with three key components: 1) RAG-based dataset construction, 2) Soft labeling with learnable organ anchors to capture one-to-many symptom-organ relationships, and 3) A 2D-3D cross-attention architecture to fuse local and global image features.

Result: Experimental results show that Med-SORA outperforms existing medical multimodal models and enables accurate 3D clinical reasoning.

Conclusion: This is the first work to address symptom-to-organ reasoning in medical multimodal learning, successfully capturing complex symptom-organ relationships and incorporating 3D anatomical context for improved clinical reasoning.

Abstract: Understanding symptom-image associations is crucial for clinical reasoning. However, existing medical multimodal models often rely on simple one-to-one hard labeling, oversimplifying clinical reality where symptoms relate to multiple organs. In addition, they mainly use single-slice 2D features without incorporating 3D information, limiting their ability to capture full anatomical context. In this study, we propose Med-SORA, a framework for symptom-to-organ reasoning in abdominal CT images. Med-SORA introduces RAG-based dataset construction, soft labeling with learnable organ anchors to capture one-to-many symptom-organ relationships, and a 2D-3D cross-attention architecture to fuse local and global image features. To our knowledge, this is the first work to address symptom-to-organ reasoning in medical multimodal learning. Experimental results show that Med-SORA outperforms existing medical multimodal models and enables accurate 3D clinical reasoning.

[340] CAST-LUT: Tokenizer-Guided HSV Look-Up Tables for Purple Flare Removal

Pu Wang, Shuning Sun, Jialang Lu, Chen Wu, Zhihua Zhang, Youshan Zhang, Chenggang Shan, Dianjie Lu, Guijuan Zhang, Zhuoran Zheng

Main category: cs.CV

TL;DR: A novel network using decoupled HSV Look-Up Tables (LUTs) to remove purple flare artifacts from images, featuring a two-stage architecture with Chroma-Aware Spectral Tokenizer and dynamic LUT generation.

Details

Motivation: Purple flare artifacts degrade image quality, existing methods lack flexibility due to hand-crafted features, and deep learning is hampered by scarce paired training data.

Method: Two-stage architecture: 1) CAST converts RGB to HSV and encodes H/V channels into semantic tokens, 2) HSV-LUT module generates independent correction curves for H, S, V channels using these tokens.

Result: Model significantly outperforms existing methods in visual effects and achieves state-of-the-art performance on all quantitative metrics.

Conclusion: The proposed decoupled HSV LUT approach effectively resolves color coupling problems and provides superior purple flare removal compared to traditional methods.

Abstract: Purple flare, a diffuse chromatic aberration artifact commonly found around highlight areas, severely degrades the tone transition and color of the image. Existing traditional methods are based on hand-crafted features, which lack flexibility and rely entirely on fixed priors, while the scarcity of paired training data critically hampers deep learning. To address this issue, we propose a novel network built upon decoupled HSV Look-Up Tables (LUTs). The method aims to simplify color correction by adjusting the Hue (H), Saturation (S), and Value (V) components independently. This approach resolves the inherent color coupling problems in traditional methods. Our model adopts a two-stage architecture: First, a Chroma-Aware Spectral Tokenizer (CAST) converts the input image from RGB space to HSV space and independently encodes the Hue (H) and Value (V) channels into a set of semantic tokens describing the Purple flare status; second, the HSV-LUT module takes these tokens as input and dynamically generates independent correction curves (1D-LUTs) for the three channels H, S, and V. To effectively train and validate our model, we built the first large-scale purple flare dataset with diverse scenes. We also proposed new metrics and a loss function specifically designed for this task. Extensive experiments demonstrate that our model not only significantly outperforms existing methods in visual effects but also achieves state-of-the-art performance on all quantitative metrics.

[341] Robust and High-Fidelity 3D Gaussian Splatting: Fusing Pose Priors and Geometry Constraints for Texture-Deficient Outdoor Scenes

Meijun Guo, Yongliang Shi, Caiyun Liu, Yixiao Feng, Ming Ma, Tinghai Yan, Weining Lu, Bin Liang

Main category: cs.CV

TL;DR: This paper improves 3D Gaussian Splatting for large outdoor scenes with weak/repetitive textures by using LiDAR-IMU prior poses for robust camera pose estimation and adding normal vector constraints with rank regularization for better scene representation.

Details

Motivation: To address unstable pose estimation and scene representation distortion caused by geometric texture inconsistency in large outdoor scenes with weak or repetitive textures.

Method: Uses LiDAR-IMU Odometry for prior camera poses in bundle adjustment, and introduces normal vector constraints with effective rank regularization for Gaussian primitives optimization.

Result: Achieved pose optimization in one-third the time while maintaining accuracy, and significantly outperformed conventional 3DGS in scene representation, especially on weak/repetitive texture datasets.

Conclusion: The approach enhances both pose estimation robustness and scene representation quality for 3DGS in challenging outdoor environments with texture limitations.

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a key rendering pipeline for digital asset creation due to its balance between efficiency and visual quality. To address the issues of unstable pose estimation and scene representation distortion caused by geometric texture inconsistency in large outdoor scenes with weak or repetitive textures, we approach the problem from two aspects: pose estimation and scene representation. For pose estimation, we leverage LiDAR-IMU Odometry to provide prior poses for cameras in large-scale environments. These prior pose constraints are incorporated into COLMAP’s triangulation process, with pose optimization performed via bundle adjustment. Ensuring consistency between pixel data association and prior poses helps maintain both robustness and accuracy. For scene representation, we introduce normal vector constraints and effective rank regularization to enforce consistency in the direction and shape of Gaussian primitives. These constraints are jointly optimized with the existing photometric loss to enhance the map quality. We evaluate our approach using both public and self-collected datasets. In terms of pose optimization, our method requires only one-third of the time while maintaining accuracy and robustness across both datasets. In terms of scene representation, the results show that our method significantly outperforms conventional 3DGS pipelines. Notably, on self-collected datasets characterized by weak or repetitive textures, our approach demonstrates enhanced visualization capabilities and achieves superior overall performance. Codes and data will be publicly available at https://github.com/justinyeah/normal_shape.git.

[342] TiS-TSL: Image-Label Supervised Surgical Video Stereo Matching via Time-Switchable Teacher-Student Learning

Rui Wang, Ying Zhou, Hao Wang, Wenwei Zhang, Qiang Li, Zhiwei Wang

Main category: cs.CV

TL;DR: TiS-TSL is a time-switchable teacher-student learning framework for video stereo matching in minimally invasive surgery that addresses temporal inconsistency in disparity predictions through unified image and video prediction modes and bidirectional spatio-temporal consistency.

Details

Motivation: Stereo matching in MIS lacks dense supervision due to anatomical constraints, and existing teacher-student methods only provide spatial confidence without temporal consistency, leading to unstable disparity predictions and flickering artifacts in videos.

Method: Proposes TiS-TSL with a unified model operating in three modes (IP, FVP, BVP) and a two-stage learning strategy: I2V transfers sparse image knowledge to temporal modeling, and V2V refines predictions using bidirectional spatio-temporal consistency to filter noisy labels and enforce temporal coherence.

Result: Experimental results show TiS-TSL outperforms image-based state-of-the-art methods by improving TEPE and EPE by at least 2.11% and 4.54% respectively on two public datasets.

Conclusion: TiS-TSL effectively addresses temporal inconsistency in surgical stereo matching through unified temporal modeling and bidirectional consistency, achieving superior performance with minimal supervision.

Abstract: Stereo matching in minimally invasive surgery (MIS) is essential for next-generation navigation and augmented reality. Yet, dense disparity supervision is nearly impossible due to anatomical constraints, typically limiting annotations to only a few image-level labels acquired before the endoscope enters deep body cavities. Teacher-Student Learning (TSL) offers a promising solution by leveraging a teacher trained on sparse labels to generate pseudo labels and associated confidence maps from abundant unlabeled surgical videos. However, existing TSL methods are confined to image-level supervision, providing only spatial confidence and lacking temporal consistency estimation. This absence of spatio-temporal reliability results in unstable disparity predictions and severe flickering artifacts across video frames. To overcome these challenges, we propose TiS-TSL, a novel time-switchable teacher-student learning framework for video stereo matching under minimal supervision. At its core is a unified model that operates in three distinct modes: Image-Prediction (IP), Forward Video-Prediction (FVP), and Backward Video-Prediction (BVP), enabling flexible temporal modeling within a single architecture. Enabled by this unified model, TiS-TSL adopts a two-stage learning strategy. The Image-to-Video (I2V) stage transfers sparse image-level knowledge to initialize temporal modeling. The subsequent Video-to-Video (V2V) stage refines temporal disparity predictions by comparing forward and backward predictions to calculate bidirectional spatio-temporal consistency. This consistency identifies unreliable regions across frames, filters noisy video-level pseudo labels, and enforces temporal coherence. Experimental results on two public datasets demonstrate that TiS-TSL exceeds other image-based state-of-the-arts by improving TEPE and EPE by at least 2.11% and 4.54%, respectively..

[343] ConeGS: Error-Guided Densification Using Pixel Cones for Improved Reconstruction with Fewer Primitives

Bartłomiej Baranowski, Stefano Esposito, Patricia Gschoßmann, Anpei Chen, Andreas Geiger

Main category: cs.CV

TL;DR: ConeGS improves 3D Gaussian Splatting by using image-space-informed densification with depth estimation from iNGP proxy, enabling better Gaussian placement and reducing primitive count while maintaining quality.

Details

Motivation: 3DGS suffers from suboptimal spatial distribution of primitives due to cloning-based densification that propagates Gaussians along existing geometry, limiting exploration and requiring many primitives for adequate scene coverage.

Method: Uses iNGP reconstruction as geometric proxy for depth estimation, identifies high-error pixels, inserts new Gaussians along viewing cones at predicted depths, applies opacity penalty to remove redundant Gaussians, and uses primitive budgeting strategy.

Result: Consistently enhances reconstruction quality and rendering performance across Gaussian budgets, with strong gains under tight primitive constraints where efficient placement is crucial.

Conclusion: ConeGS provides an effective framework for improving 3DGS by enabling more efficient Gaussian placement through image-space-informed densification independent of existing scene geometry.

Abstract: 3D Gaussian Splatting (3DGS) achieves state-of-the-art image quality and real-time performance in novel view synthesis but often suffers from a suboptimal spatial distribution of primitives. This issue stems from cloning-based densification, which propagates Gaussians along existing geometry, limiting exploration and requiring many primitives to adequately cover the scene. We present ConeGS, an image-space-informed densification framework that is independent of existing scene geometry state. ConeGS first creates a fast Instant Neural Graphics Primitives (iNGP) reconstruction as a geometric proxy to estimate per-pixel depth. During the subsequent 3DGS optimization, it identifies high-error pixels and inserts new Gaussians along the corresponding viewing cones at the predicted depth values, initializing their size according to the cone diameter. A pre-activation opacity penalty rapidly removes redundant Gaussians, while a primitive budgeting strategy controls the total number of primitives, either by a fixed budget or by adapting to scene complexity, ensuring high reconstruction quality. Experiments show that ConeGS consistently enhances reconstruction quality and rendering performance across Gaussian budgets, with especially strong gains under tight primitive constraints where efficient placement is crucial.

[344] NeuroBridge: Bio-Inspired Self-Supervised EEG-to-Image Decoding via Cognitive Priors and Bidirectional Semantic Alignment

Wenjiang Zhang, Sifeng Wang, Yuwei Su, Xinyu Li, Chen Zhang, Suyu Zhong

Main category: cs.CV

TL;DR: NeuroBridge is a self-supervised framework that improves visual neural decoding by simulating perceptual variability and using bidirectional cross-modality alignment in a shared semantic space, achieving state-of-the-art performance on EEG-based image retrieval tasks.

Details

Motivation: Current visual neural decoding methods are limited by scarce stimulus-brain response pairs and semantic mismatches between neural representations and visual content. The paper aims to overcome these limitations by drawing inspiration from biological systems' perceptual variability and co-adaptive strategies.

Method: Proposes NeuroBridge with two key components: Cognitive Prior Augmentation (CPA) simulates perceptual variability through asymmetric, modality-specific transformations on EEG signals and images; Shared Semantic Projector (SSP) establishes bidirectional alignment using a co-adaptive strategy to map features from both modalities into a shared semantic space.

Result: NeuroBridge outperforms previous state-of-the-art methods in both intra-subject and inter-subject settings. In intra-subject scenario, achieves 12.3% improvement in top-1 accuracy (63.2%) and 10.2% improvement in top-5 accuracy (89.9%) on 200-way zero-shot retrieval task.

Conclusion: The proposed NeuroBridge framework demonstrates effectiveness, robustness, and scalability for neural visual decoding, providing a promising approach for cross-modality alignment between brain activity and visual stimuli.

Abstract: Visual neural decoding seeks to reconstruct or infer perceived visual stimuli from brain activity patterns, providing critical insights into human cognition and enabling transformative applications in brain-computer interfaces and artificial intelligence. Current approaches, however, remain constrained by the scarcity of high-quality stimulus-brain response pairs and the inherent semantic mismatch between neural representations and visual content. Inspired by perceptual variability and co-adaptive strategy of the biological systems, we propose a novel self-supervised architecture, named NeuroBridge, which integrates Cognitive Prior Augmentation (CPA) with Shared Semantic Projector (SSP) to promote effective cross-modality alignment. Specifically, CPA simulates perceptual variability by applying asymmetric, modality-specific transformations to both EEG signals and images, enhancing semantic diversity. Unlike previous approaches, SSP establishes a bidirectional alignment process through a co-adaptive strategy, which mutually aligns features from two modalities into a shared semantic space for effective cross-modal learning. NeuroBridge surpasses previous state-of-the-art methods under both intra-subject and inter-subject settings. In the intra-subject scenario, it achieves the improvements of 12.3% in top-1 accuracy and 10.2% in top-5 accuracy, reaching 63.2% and 89.9% respectively on a 200-way zero-shot retrieval task. Extensive experiments demonstrate the effectiveness, robustness, and scalability of the proposed framework for neural visual decoding.

[345] Integrating Reweighted Least Squares with Plug-and-Play Diffusion Priors for Noisy Image Restoration

Ji Li, Chao Wang

Main category: cs.CV

TL;DR: A plug-and-play image restoration framework using generative diffusion priors that effectively handles non-Gaussian noise like impulse noise through a generalized Gaussian scale mixture-based loss and IRLS optimization.

Details

Motivation: Existing plug-and-play methods primarily use Gaussian denoisers and are limited to Gaussian noise. There's a need to extend these approaches to handle non-Gaussian noise types like impulse noise.

Method: Proposes a MAP estimation framework with a generalized Gaussian scale mixture-based loss (ℓ_q-norm fidelity term) for various noise distributions, solved via iteratively reweighted least squares (IRLS) with diffusion-based denoisers as proximal operators.

Result: Experimental results show the method effectively removes non-Gaussian impulse noise and achieves superior restoration performance on benchmark datasets.

Conclusion: The proposed framework successfully extends plug-and-play image restoration to handle general noise types beyond Gaussian noise, demonstrating robust performance for impulse noise removal.

Abstract: Existing plug-and-play image restoration methods typically employ off-the-shelf Gaussian denoisers as proximal operators within classical optimization frameworks based on variable splitting. Recently, denoisers induced by generative priors have been successfully integrated into regularized optimization methods for image restoration under Gaussian noise. However, their application to non-Gaussian noise–such as impulse noise–remains largely unexplored. In this paper, we propose a plug-and-play image restoration framework based on generative diffusion priors for robust removal of general noise types, including impulse noise. Within the maximum a posteriori (MAP) estimation framework, the data fidelity term is adapted to the specific noise model. Departing from the conventional least-squares loss used for Gaussian noise, we introduce a generalized Gaussian scale mixture-based loss, which approximates a wide range of noise distributions and leads to an $\ell_q$-norm ($0<q\leq2$) fidelity term. This optimization problem is addressed using an iteratively reweighted least squares (IRLS) approach, wherein the proximal step involving the generative prior is efficiently performed via a diffusion-based denoiser. Experimental results on benchmark datasets demonstrate that the proposed method effectively removes non-Gaussian impulse noise and achieves superior restoration performance.

[346] MUGSQA: Novel Multi-Uncertainty-Based Gaussian Splatting Quality Assessment Method, Dataset, and Benchmarks

Tianang Chen, Jian Jin, Shilv Cai, Zhuangzi Li, Weisi Lin

Main category: cs.CV

TL;DR: Proposes a multi-distance subjective quality assessment method for Gaussian Splatting-based 3D reconstruction and creates MUGSQA dataset with benchmarks to evaluate reconstruction robustness and quality metrics.

Details

Motivation: Assessing perceptual quality of 3D objects reconstructed with different Gaussian Splatting methods remains challenging, requiring better subjective evaluation methods that mimic human viewing behavior.

Method: Developed unified multi-distance subjective quality assessment method and constructed MUGSQA dataset considering multiple input uncertainties (quantity/resolution of views, view distance, point cloud accuracy). Created two benchmarks for reconstruction robustness and quality metrics evaluation.

Result: Created comprehensive dataset and benchmarks for evaluating Gaussian Splatting methods under various input uncertainties, enabling better assessment of perceptual quality and reconstruction robustness.

Conclusion: The proposed assessment method and MUGSQA dataset address the challenge of evaluating GS-based 3D reconstruction quality, providing tools to assess both reconstruction methods and quality metrics under realistic uncertainties.

Abstract: Gaussian Splatting (GS) has recently emerged as a promising technique for 3D object reconstruction, delivering high-quality rendering results with significantly improved reconstruction speed. As variants continue to appear, assessing the perceptual quality of 3D objects reconstructed with different GS-based methods remains an open challenge. To address this issue, we first propose a unified multi-distance subjective quality assessment method that closely mimics human viewing behavior for objects reconstructed with GS-based methods in actual applications, thereby better collecting perceptual experiences. Based on it, we also construct a novel GS quality assessment dataset named MUGSQA, which is constructed considering multiple uncertainties of the input data. These uncertainties include the quantity and resolution of input views, the view distance, and the accuracy of the initial point cloud. Moreover, we construct two benchmarks: one to evaluate the robustness of various GS-based reconstruction methods under multiple uncertainties, and the other to evaluate the performance of existing quality assessment metrics. Our dataset and benchmark code will be released soon.

[347] ConsistTalk: Intensity Controllable Temporally Consistent Talking Head Generation with Diffusion Noise Search

Zhenjie Liu, Jianzhang Lu, Renjie Lu, Cong Liang, Shangfei Wang

Main category: cs.CV

TL;DR: ConsistTalk is a novel talking head generation framework that addresses flickering, identity drift, and poor audio-visual synchronization through optical flow-guided temporal modeling, audio-to-intensity transformation, and diffusion noise search inference.

Details

Motivation: Current video diffusion models for audio-driven portrait animation suffer from flickering, identity drift, and poor audio-visual synchronization due to entangled appearance-motion representations and unstable inference strategies.

Method: 1) Optical flow-guided temporal module (OFT) decouples motion from appearance using facial optical flow. 2) Audio-to-Intensity (A2I) model transforms audio and facial velocity into frame-wise intensity for joint audio-visual modeling. 3) Diffusion noise initialization strategy (IC-Init) enforces background coherence and motion continuity constraints during inference.

Result: Extensive experiments show ConsistTalk significantly outperforms prior methods in reducing flicker, preserving identity, and delivering temporally stable, high-fidelity talking head videos with better audio-visual synchronization.

Conclusion: ConsistTalk achieves superior performance in talking head generation by addressing key limitations through decoupled motion-appearance modeling, intensity-based audio-visual synchronization, and constrained diffusion inference.

Abstract: Recent advancements in video diffusion models have significantly enhanced audio-driven portrait animation. However, current methods still suffer from flickering, identity drift, and poor audio-visual synchronization. These issues primarily stem from entangled appearance-motion representations and unstable inference strategies. In this paper, we introduce \textbf{ConsistTalk}, a novel intensity-controllable and temporally consistent talking head generation framework with diffusion noise search inference. First, we propose \textbf{an optical flow-guided temporal module (OFT)} that decouples motion features from static appearance by leveraging facial optical flow, thereby reducing visual flicker and improving temporal consistency. Second, we present an \textbf{Audio-to-Intensity (A2I) model} obtained through multimodal teacher-student knowledge distillation. By transforming audio and facial velocity features into a frame-wise intensity sequence, the A2I model enables joint modeling of audio and visual motion, resulting in more natural dynamics. This further enables fine-grained, frame-wise control of motion dynamics while maintaining tight audio-visual synchronization. Third, we introduce a \textbf{diffusion noise initialization strategy (IC-Init)}. By enforcing explicit constraints on background coherence and motion continuity during inference-time noise search, we achieve better identity preservation and refine motion dynamics compared to the current autoregressive strategy. Extensive experiments demonstrate that ConsistTalk significantly outperforms prior methods in reducing flicker, preserving identity, and delivering temporally stable, high-fidelity talking head videos.

Qunchao Jin, Yilin Wu, Changhao Chen

Main category: cs.CV

TL;DR: PanoNav is a mapless zero-shot object navigation framework that uses panoramic RGB inputs and memory-guided decision-making to improve navigation performance without depth sensors or prebuilt maps.

Details

Motivation: Existing zero-shot object navigation methods rely on depth sensors or prebuilt maps, limiting MLLMs' spatial reasoning. Mapless approaches often make short-sighted decisions due to lack of historical context, leading to local deadlocks.

Method: Integrates Panoramic Scene Parsing module to extract spatial information from panoramic RGB inputs, and Memory-guided Decision-Making with Dynamic Bounded Memory Queue to incorporate exploration history and avoid local deadlocks.

Result: Significantly outperforms representative baselines on public navigation benchmark in both Success Rate (SR) and Success weighted by Path Length (SPL) metrics.

Conclusion: PanoNav demonstrates that fully RGB-only, mapless zero-shot object navigation is feasible and effective by leveraging panoramic scene parsing and memory-guided decision-making.

Abstract: Zero-shot object navigation (ZSON) in unseen environments remains a challenging problem for household robots, requiring strong perceptual understanding and decision-making capabilities. While recent methods leverage metric maps and Large Language Models (LLMs), they often depend on depth sensors or prebuilt maps, limiting the spatial reasoning ability of Multimodal Large Language Models (MLLMs). Mapless ZSON approaches have emerged to address this, but they typically make short-sighted decisions, leading to local deadlocks due to a lack of historical context. We propose PanoNav, a fully RGB-only, mapless ZSON framework that integrates a Panoramic Scene Parsing module to unlock the spatial parsing potential of MLLMs from panoramic RGB inputs, and a Memory-guided Decision-Making mechanism enhanced by a Dynamic Bounded Memory Queue to incorporate exploration history and avoid local deadlocks. Experiments on the public navigation benchmark show that PanoNav significantly outperforms representative baselines in both SR and SPL metrics.

[349] Aerial Image Stitching Using IMU Data from a UAV

Selim Ahmet Iz, Mustafa Unel

Main category: cs.CV

TL;DR: Novel UAV image stitching method combining IMU data with computer vision for improved accuracy and robustness over feature-based approaches.

Details

Motivation: Address limitations of feature-based image stitching algorithms that suffer from errors in feature detection/matching, especially with UAV aerial photography.

Method: Combines IMU data with computer vision: estimates UAV displacement/rotation between images, corrects perspective distortion, computes homography matrix, then uses standard stitching algorithm.

Result: Method outperforms existing feature-based algorithms in accuracy and reliability, particularly in challenging scenarios with large displacements, rotations, and camera pose variations.

Conclusion: Proposed approach effectively leverages IMU data to enhance image stitching, is robust in difficult conditions, and can be easily integrated into existing UAV workflows.

Abstract: Unmanned Aerial Vehicles (UAVs) are widely used for aerial photography and remote sensing applications. One of the main challenges is to stitch together multiple images into a single high-resolution image that covers a large area. Featurebased image stitching algorithms are commonly used but can suffer from errors and ambiguities in feature detection and matching. To address this, several approaches have been proposed, including using bundle adjustment techniques or direct image alignment. In this paper, we present a novel method that uses a combination of IMU data and computer vision techniques for stitching images captured by a UAV. Our method involves several steps such as estimating the displacement and rotation of the UAV between consecutive images, correcting for perspective distortion, and computing a homography matrix. We then use a standard image stitching algorithm to align and blend the images together. Our proposed method leverages the additional information provided by the IMU data, corrects for various sources of distortion, and can be easily integrated into existing UAV workflows. Our experiments demonstrate the effectiveness and robustness of our method, outperforming some of the existing feature-based image stitching algorithms in terms of accuracy and reliability, particularly in challenging scenarios such as large displacements, rotations, and variations in camera pose.

[350] Gaussian-Augmented Physics Simulation and System Identification with Complex Colliders

Federico Vasile, Ri-Zhao Qiu, Lorenzo Natale, Xiaolong Wang

Main category: cs.CV

TL;DR: AS-DiffMPM is a differentiable MPM framework that enables physical property estimation with arbitrarily shaped colliders, overcoming limitations of previous methods restricted to planar surfaces.

Details

Motivation: Existing differentiable MPM approaches are limited to simplified object-environment interactions with planar colliders and fail in scenarios where objects collide with non-planar surfaces.

Method: Extends differentiable MPM by incorporating a differentiable collision handling mechanism that allows target objects to interact with complex rigid bodies while maintaining end-to-end optimization.

Result: The framework enables physical property estimation with arbitrarily shaped colliders and can be interfaced with various novel view synthesis methods.

Conclusion: AS-DiffMPM provides a comprehensive framework for system identification from visual observations that handles complex object-environment interactions beyond planar surfaces.

Abstract: System identification involving the geometry, appearance, and physical properties from video observations is a challenging task with applications in robotics and graphics. Recent approaches have relied on fully differentiable Material Point Method (MPM) and rendering for simultaneous optimization of these properties. However, they are limited to simplified object-environment interactions with planar colliders and fail in more challenging scenarios where objects collide with non-planar surfaces. We propose AS-DiffMPM, a differentiable MPM framework that enables physical property estimation with arbitrarily shaped colliders. Our approach extends existing methods by incorporating a differentiable collision handling mechanism, allowing the target object to interact with complex rigid bodies while maintaining end-to-end optimization. We show AS-DiffMPM can be easily interfaced with various novel view synthesis methods as a framework for system identification from visual observations.

[351] Distillation Dynamics: Towards Understanding Feature-Based Distillation in Vision Transformers

Huiyuan Tian, Bonan Xu Shijian Li

Main category: cs.CV

TL;DR: Feature distillation fails for ViTs due to representational mismatch between teacher and student models, where teachers use distributed high-dimensional encoding that students can’t replicate, causing negative transfer.

Details

Motivation: To understand why feature-based knowledge distillation works well for CNNs but fails for Vision Transformers (ViTs), often performing worse than simple logit-based distillation.

Method: Developed a “distillation dynamics” analytical framework combining frequency spectrum analysis, information entropy metrics, and activation magnitude tracking to study ViT information processing patterns.

Result: Revealed ViTs exhibit U-shaped information processing (initial compression followed by expansion) and identified representational paradigm mismatch as the root cause of negative transfer in feature distillation.

Conclusion: Successful knowledge transfer in ViTs requires moving beyond naive feature mimicry to methods that respect fundamental representational constraints, providing essential guidance for effective ViT compression strategies.

Abstract: While feature-based knowledge distillation has proven highly effective for compressing CNNs, these techniques unexpectedly fail when applied to Vision Transformers (ViTs), often performing worse than simple logit-based distillation. We provide the first comprehensive analysis of this phenomenon through a novel analytical framework termed as ``distillation dynamics", combining frequency spectrum analysis, information entropy metrics, and activation magnitude tracking. Our investigation reveals that ViTs exhibit a distinctive U-shaped information processing pattern: initial compression followed by expansion. We identify the root cause of negative transfer in feature distillation: a fundamental representational paradigm mismatch between teacher and student models. Through frequency-domain analysis, we show that teacher models employ distributed, high-dimensional encoding strategies in later layers that smaller student models cannot replicate due to limited channel capacity. This mismatch causes late-layer feature alignment to actively harm student performance. Our findings reveal that successful knowledge transfer in ViTs requires moving beyond naive feature mimicry to methods that respect these fundamental representational constraints, providing essential theoretical guidance for designing effective ViTs compression strategies. All source code and experimental logs are provided in the supplementary material.

[352] Ambiguity-aware Truncated Flow Matching for Ambiguous Medical Image Segmentation

Fanding Li, Xiangyu Li, Xianghe Su, Xingyu Qiu, Suyu Dong, Wei Wang, Kuanquan Wang, Gongning Luo, Shuo Li

Main category: cs.CV

TL;DR: Proposes Ambiguity-aware Truncated Flow Matching (ATFM) for medical image segmentation to simultaneously enhance accuracy and diversity of predictions by disentangling these objectives through hierarchical inference and novel model components.

Details

Motivation: Address the challenge of simultaneously enhancing accuracy and diversity in ambiguous medical image segmentation, where existing truncated diffusion probabilistic models suffer from entangled objectives with insufficient fidelity and plausibility.

Method: Three main components: 1) Data-Hierarchical Inference paradigm that enhances accuracy at data-distribution level and diversity at data-sample level; 2) Gaussian Truncation Representation for better fidelity and truncation reliability; 3) Segmentation Flow Matching for enhanced plausibility through semantic-aware flow transformation.

Result: Outperforms SOTA methods on LIDC and ISIC3 datasets, improving GED by up to 12% and HM-IoU by up to 7.3% compared to advanced methods, while achieving more efficient inference.

Conclusion: ATFM successfully addresses the accuracy-diversity trade-off in ambiguous medical image segmentation through its hierarchical inference paradigm and specialized components, demonstrating superior performance and efficiency over existing methods.

Abstract: A simultaneous enhancement of accuracy and diversity of predictions remains a challenge in ambiguous medical image segmentation (AMIS) due to the inherent trade-offs. While truncated diffusion probabilistic models (TDPMs) hold strong potential with a paradigm optimization, existing TDPMs suffer from entangled accuracy and diversity of predictions with insufficient fidelity and plausibility. To address the aforementioned challenges, we propose Ambiguity-aware Truncated Flow Matching (ATFM), which introduces a novel inference paradigm and dedicated model components. Firstly, we propose Data-Hierarchical Inference, a redefinition of AMIS-specific inference paradigm, which enhances accuracy and diversity at data-distribution and data-sample level, respectively, for an effective disentanglement. Secondly, Gaussian Truncation Representation (GTR) is introduced to enhance both fidelity of predictions and reliability of truncation distribution, by explicitly modeling it as a Gaussian distribution at $T_{\text{trunc}}$ instead of using sampling-based approximations.Thirdly, Segmentation Flow Matching (SFM) is proposed to enhance the plausibility of diverse predictions by extending semantic-aware flow transformation in Flow Matching (FM). Comprehensive evaluations on LIDC and ISIC3 datasets demonstrate that ATFM outperforms SOTA methods and simultaneously achieves a more efficient inference. ATFM improves GED and HM-IoU by up to $12%$ and $7.3%$ compared to advanced methods.

[353] PlantTraitNet: An Uncertainty-Aware Multimodal Framework for Global-Scale Plant Trait Inference from Citizen Science Data

Ayushi Sharma, Johanna Trost, Daniel Lusk, Johannes Dollinger, Julian Schrader, Christian Rossi, Javier Lopatin, Etienne Laliberté, Simon Haberstroh, Jana Eichel, Daniel Mederer, Jose Miguel Cerda-Paredes, Shyam S. Phartyal, Lisa-Maricia Schwarz, Anja Linstädter, Maria Conceição Caldeira, Teja Kattenborn

Main category: cs.CV

TL;DR: PlantTraitNet uses citizen science photos and deep learning to create more accurate global maps of plant traits than existing methods, outperforming current trait products across four key traits.

Details

Motivation: Existing plant trait maps are limited by costly field measurements and sparse geographic coverage, while citizen science offers millions of geotagged plant photos that capture valuable visual information on plant morphology and physiology.

Method: Multi-modal, multi-task uncertainty-aware deep learning framework that predicts four key plant traits (plant height, leaf area, specific leaf area, nitrogen content) from citizen science photos using weak supervision, then aggregates predictions across space to generate global trait distribution maps.

Result: PlantTraitNet consistently outperforms existing trait maps across all evaluated traits when validated against independent vegetation survey data (sPlotOpen) and benchmarked against leading global trait products.

Conclusion: Citizen science imagery combined with computer vision and geospatial AI enables scalable and more accurate global trait mapping, offering a powerful new pathway for ecological research and Earth system modeling.

Abstract: Global plant maps of plant traits, such as leaf nitrogen or plant height, are essential for understanding ecosystem processes, including the carbon and energy cycles of the Earth system. However, existing trait maps remain limited by the high cost and sparse geographic coverage of field-based measurements. Citizen science initiatives offer a largely untapped resource to overcome these limitations, with over 50 million geotagged plant photographs worldwide capturing valuable visual information on plant morphology and physiology. In this study, we introduce PlantTraitNet, a multi-modal, multi-task uncertainty-aware deep learning framework that predictsfour key plant traits (plant height, leaf area, specific leaf area, and nitrogen content) from citizen science photos using weak supervision. By aggregating individual trait predictions across space, we generate global maps of trait distributions. We validate these maps against independent vegetation survey data (sPlotOpen) and benchmark them against leading global trait products. Our results show that PlantTraitNet consistently outperforms existing trait maps across all evaluated traits, demonstrating that citizen science imagery, when integrated with computer vision and geospatial AI, enables not only scalable but also more accurate global trait mapping. This approach offers a powerful new pathway for ecological research and Earth system modeling.

[354] VAEVQ: Enhancing Discrete Visual Tokenization through Variational Modeling

Sicheng Yang, Xing Hu, Qiang Wu, Dawei Yang

Main category: cs.CV

TL;DR: VAEVQ improves vector quantization by using variational autoencoders for smoother latent spaces, adaptive alignment between pre/post-quantization features, and distribution consistency regularization to enhance codebook utilization and performance.

Details

Motivation: Traditional VQ methods suffer from non-smooth latent spaces, weak alignment between representations before/after quantization, and poor coherence between continuous/discrete domains, leading to unstable codeword learning and underutilized codebooks.

Method: Proposes VAEVQ with three components: (1) Variational Latent Quantization (VLQ) using VAE instead of AE for quantization, (2) Representation Coherence Strategy (RCS) for adaptive alignment modulation, and (3) Distribution Consistency Regularization (DCR) for codebook distribution alignment.

Result: Extensive experiments on benchmark datasets show VAEVQ outperforms state-of-the-art methods in both reconstruction and downstream generation tasks.

Conclusion: VAEVQ effectively addresses key limitations of traditional VQ methods through variational quantization, adaptive alignment, and distribution consistency, achieving superior performance.

Abstract: Vector quantization (VQ) transforms continuous image features into discrete representations, providing compressed, tokenized inputs for generative models. However, VQ-based frameworks suffer from several issues, such as non-smooth latent spaces, weak alignment between representations before and after quantization, and poor coherence between the continuous and discrete domains. These issues lead to unstable codeword learning and underutilized codebooks, ultimately degrading the performance of both reconstruction and downstream generation tasks. To this end, we propose VAEVQ, which comprises three key components: (1) Variational Latent Quantization (VLQ), replacing the AE with a VAE for quantization to leverage its structured and smooth latent space, thereby facilitating more effective codeword activation; (2) Representation Coherence Strategy (RCS), adaptively modulating the alignment strength between pre- and post-quantization features to enhance consistency and prevent overfitting to noise; and (3) Distribution Consistency Regularization (DCR), aligning the entire codebook distribution with the continuous latent distribution to improve utilization. Extensive experiments on two benchmark datasets demonstrate that VAEVQ outperforms state-of-the-art methods.

[355] From Attribution to Action: Jointly ALIGNing Predictions and Explanations

Dongsheng Hong, Chao Chen, Yanhui Chen, Shanshan Lin, Zhihao Chen, Xiangwen Liao

Main category: cs.CV

TL;DR: ALIGN is a novel framework that jointly trains a classifier and masker to improve model interpretability and generalization without relying on external annotations or noisy supervision signals.

Details

Motivation: Existing explanation-guided learning methods rely on external annotations or heuristic-based segmentation, which are noisy, imprecise, and difficult to scale, potentially degrading model performance.

Method: ALIGN jointly trains a classifier and masker iteratively - the masker produces soft task-relevant masks highlighting informative regions, while the classifier optimizes for both prediction accuracy and alignment between its saliency maps and learned masks.

Result: ALIGN consistently outperforms six strong baselines on VLCS and Terra Incognita benchmarks in both in-distribution and out-of-distribution settings, while also yielding superior explanation quality in terms of sufficiency and comprehensiveness.

Conclusion: ALIGN effectively improves both interpretability and generalizability by leveraging high-quality masks as guidance, producing accurate and interpretable models without relying on external supervision.

Abstract: Explanation-guided learning (EGL) has shown promise in aligning model predictions with interpretable reasoning, particularly in computer vision tasks. However, most approaches rely on external annotations or heuristic-based segmentation to supervise model explanations, which can be noisy, imprecise and difficult to scale. In this work, we provide both empirical and theoretical evidence that low-quality supervision signals can degrade model performance rather than improve it. In response, we propose ALIGN, a novel framework that jointly trains a classifier and a masker in an iterative manner. The masker learns to produce soft, task-relevant masks that highlight informative regions, while the classifier is optimized for both prediction accuracy and alignment between its saliency maps and the learned masks. By leveraging high-quality masks as guidance, ALIGN improves both interpretability and generalizability, showing its superiority across various settings. Experiments on the two domain generalization benchmarks, VLCS and Terra Incognita, show that ALIGN consistently outperforms six strong baselines in both in-distribution and out-of-distribution settings. Besides, ALIGN also yields superior explanation quality concerning sufficiency and comprehensiveness, highlighting its effectiveness in producing accurate and interpretable models.

[356] Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions

Eyal Gutflaish, Eliran Kachlon, Hezi Zisman, Tal Hacham, Nimrod Sarid, Alexander Visheratin, Saar Huberman, Gal Davidi, Guy Bukchin, Kfir Goldberg, Ron Mokady

Main category: cs.CV

TL;DR: FIBO is the first open-source text-to-image model trained on long structured captions, using DimFusion for efficient processing and achieving SOTA prompt alignment through fine-grained attribute control.

Details

Motivation: Address the gap between sparse text prompts and rich visual outputs in text-to-image models, which reduces controllability and biases toward average preferences, limiting professional use.

Method: Train on long structured captions with consistent fine-grained attributes, propose DimFusion for efficient long caption processing, and introduce Text-as-a-Bottleneck Reconstruction (TaBR) evaluation protocol.

Result: FIBO achieves state-of-the-art prompt alignment among open-source models, demonstrating improved controllability and expressiveness through fine-grained attribute control.

Conclusion: Training on long structured captions with consistent attributes enables better controllability and expressiveness in text-to-image generation, addressing limitations of traditional short-prompt models.

Abstract: Text-to-image models have rapidly evolved from casual creative tools to professional-grade systems, achieving unprecedented levels of image quality and realism. Yet, most models are trained to map short prompts into detailed images, creating a gap between sparse textual input and rich visual outputs. This mismatch reduces controllability, as models often fill in missing details arbitrarily, biasing toward average user preferences and limiting precision for professional use. We address this limitation by training the first open-source text-to-image model on long structured captions, where every training sample is annotated with the same set of fine-grained attributes. This design maximizes expressive coverage and enables disentangled control over visual factors. To process long captions efficiently, we propose DimFusion, a fusion mechanism that integrates intermediate tokens from a lightweight LLM without increasing token length. We also introduce the Text-as-a-Bottleneck Reconstruction (TaBR) evaluation protocol. By assessing how well real images can be reconstructed through a captioning-generation loop, TaBR directly measures controllability and expressiveness, even for very long captions where existing evaluation methods fail. Finally, we demonstrate our contributions by training the large-scale model FIBO, achieving state-of-the-art prompt alignment among open-source models. Model weights are publicly available at https://huggingface.co/briaai/FIBO

[357] FoCLIP: A Feature-Space Misalignment Framework for CLIP-Based Image Manipulation and Detection

Yulin Chen, Zeyuan Wang, Tianyuan Yu, Yingmei Wei, Liang Bai

Main category: cs.CV

TL;DR: FoCLIP is a framework that fools CLIP-based image quality metrics by creating feature-space misalignment through adversarial optimization, while maintaining high visual fidelity and enabling detection via color channel sensitivity analysis.

Details

Motivation: CLIP-based metrics like CLIPscore are widely used but vulnerable due to their delicate multimodal alignment, creating a need to understand and exploit these vulnerabilities for both attack and defense purposes.

Method: Uses stochastic gradient descent with three components: feature alignment to reduce modality gaps, score distribution balance, and pixel-guard regularization to optimize multimodal output equilibrium between CLIPscore and image quality.

Result: Achieves significant CLIPscore improvement while preserving visual fidelity; discovers grayscale conversion degrades fooling images; proposes detection method with 91% accuracy using color channel sensitivity analysis.

Conclusion: Establishes practical pathway for feature misalignment in CLIP-based systems and corresponding defense mechanisms, demonstrating both attack vulnerability and effective detection method.

Abstract: The well-aligned attribute of CLIP-based models enables its effective application like CLIPscore as a widely adopted image quality assessment metric. However, such a CLIP-based metric is vulnerable for its delicate multimodal alignment. In this work, we propose \textbf{FoCLIP}, a feature-space misalignment framework for fooling CLIP-based image quality metric. Based on the stochastic gradient descent technique, FoCLIP integrates three key components to construct fooling examples: feature alignment as the core module to reduce image-text modality gaps, the score distribution balance module and pixel-guard regularization, which collectively optimize multimodal output equilibrium between CLIPscore performance and image quality. Such a design can be engineered to maximize the CLIPscore predictions across diverse input prompts, despite exhibiting either visual unrecognizability or semantic incongruence with the corresponding adversarial prompts from human perceptual perspectives. Experiments on ten artistic masterpiece prompts and ImageNet subsets demonstrate that optimized images can achieve significant improvement in CLIPscore while preserving high visual fidelity. In addition, we found that grayscale conversion induces significant feature degradation in fooling images, exhibiting noticeable CLIPscore reduction while preserving statistical consistency with original images. Inspired by this phenomenon, we propose a color channel sensitivity-driven tampering detection mechanism that achieves 91% accuracy on standard benchmarks. In conclusion, this work establishes a practical pathway for feature misalignment in CLIP-based multimodal systems and the corresponding defense method.

[358] A Two-Stage System for Layout-Controlled Image Generation using Large Language Models and Diffusion Models

Jan-Hendrik Koch, Jonas Krumme, Konrad Gadzicki

Main category: cs.CV

TL;DR: A two-stage system using LLM for layout generation and layout-conditioned diffusion models for image synthesis to achieve precise control over object counts and spatial arrangements in text-to-image generation.

Details

Motivation: Text-to-image diffusion models lack precise control over object counts and spatial arrangements, limiting their compositional capabilities for complex scene generation.

Method: Two-stage approach: 1) LLM generates structured layout from object lists with task decomposition (core objects first, then rule-based completion), 2) Layout-conditioned diffusion models (ControlNet vs GLIGEN) synthesize images adhering to layouts after domain-specific finetuning.

Result: Task decomposition improved object recall from 57.2% to 99.9% for complex scenes. ControlNet preserves text-based stylistic control but suffers object hallucination, while GLIGEN provides superior layout fidelity but reduced prompt controllability.

Conclusion: The decoupled two-stage approach successfully generates images with specified object counts and plausible spatial arrangements, demonstrating viability for compositionally controlled synthesis.

Abstract: Text-to-image diffusion models exhibit remarkable generative capabilities, but lack precise control over object counts and spatial arrangements. This work introduces a two-stage system to address these compositional limitations. The first stage employs a Large Language Model (LLM) to generate a structured layout from a list of objects. The second stage uses a layout-conditioned diffusion model to synthesize a photorealistic image adhering to this layout. We find that task decomposition is critical for LLM-based spatial planning; by simplifying the initial generation to core objects and completing the layout with rule-based insertion, we improve object recall from 57.2% to 99.9% for complex scenes. For image synthesis, we compare two leading conditioning methods: ControlNet and GLIGEN. After domain-specific finetuning on table-setting datasets, we identify a key trade-off: ControlNet preserves text-based stylistic control but suffers from object hallucination, while GLIGEN provides superior layout fidelity at the cost of reduced prompt-based controllability. Our end-to-end system successfully generates images with specified object counts and plausible spatial arrangements, demonstrating the viability of a decoupled approach for compositionally controlled synthesis.

[359] Adaptive Morph-Patch Transformer for Arotic Vessel Segmentation

Zhenxi Zhang, Fuchen Zheng, Adnan Iltaf, Yifei Han, Zhenyu Cheng, Yue Du, Bin Li, Tianyong Liu, Shoujun Zhou

Main category: cs.CV

TL;DR: Proposes MPT, a Transformer-based model with adaptive morphology-aware patches and semantic clustering attention for improved aortic vascular segmentation.

Details

Motivation: Traditional Transformer models use fixed rectangular patches that damage complex vascular structure integrity, leading to poor segmentation accuracy.

Method: Adaptive patch partitioning generates morphology-aware patches aligned with vascular structures, plus Semantic Clustering Attention to aggregate features from semantically similar patches.

Result: Achieves state-of-the-art performance on three datasets (AVT, AortaSeg24, TBAD), with significant improvements in segmenting intricate vascular structures.

Conclusion: MPT effectively preserves vascular structure integrity through adaptive patch partitioning and semantic feature aggregation, outperforming existing methods.

Abstract: Accurate segmentation of aortic vascular structures is critical for diagnosing and treating cardiovascular diseases.Traditional Transformer-based models have shown promise in this domain by capturing long-range dependencies between vascular features. However, their reliance on fixed-size rectangular patches often influences the integrity of complex vascular structures, leading to suboptimal segmentation accuracy. To address this challenge, we propose the adaptive Morph Patch Transformer (MPT), a novel architecture specifically designed for aortic vascular segmentation. Specifically, MPT introduces an adaptive patch partitioning strategy that dynamically generates morphology-aware patches aligned with complex vascular structures. This strategy can preserve semantic integrity of complex vascular structures within individual patches. Moreover, a Semantic Clustering Attention (SCA) method is proposed to dynamically aggregate features from various patches with similar semantic characteristics. This method enhances the model’s capability to segment vessels of varying sizes, preserving the integrity of vascular structures. Extensive experiments on three open-source dataset(AVT, AortaSeg24 and TBAD) demonstrate that MPT achieves state-of-the-art performance, with improvements in segmenting intricate vascular structures.

[360] TrueCity: Real and Simulated Urban Data for Cross-Domain 3D Scene Understanding

Duc Nguyen, Yan-Ling Lai, Qilin Zhang, Prabin Gyawali, Benedikt Schwab, Olaf Wysocki, Thomas H. Kolbe

Main category: cs.CV

TL;DR: TrueCity is the first urban semantic segmentation benchmark with synchronized real and simulated point clouds for analyzing synthetic-to-real domain gap in 3D scene understanding.

Details

Motivation: Limited real-world annotated data for 3D semantic scene understanding and the synthetic-to-real domain gap in existing synthetic datasets that fail to capture real-world complexity and sensor noise.

Method: Introduces TrueCity benchmark with cm-accurate annotated real-world point clouds, semantic 3D city models, and annotated simulated point clouds representing the same city, using segmentation classes aligned with international 3D city modeling standards.

Result: Extensive experiments on common baselines quantify domain shift and highlight strategies for exploiting synthetic data to enhance real-world 3D scene understanding.

Conclusion: TrueCity dataset will foster development of sim-to-real gap quantification and enable generalizable data-driven models for 3D semantic scene understanding.

Abstract: 3D semantic scene understanding remains a long-standing challenge in the 3D computer vision community. One of the key issues pertains to limited real-world annotated data to facilitate generalizable models. The common practice to tackle this issue is to simulate new data. Although synthetic datasets offer scalability and perfect labels, their designer-crafted scenes fail to capture real-world complexity and sensor noise, resulting in a synthetic-to-real domain gap. Moreover, no benchmark provides synchronized real and simulated point clouds for segmentation-oriented domain shift analysis. We introduce TrueCity, the first urban semantic segmentation benchmark with cm-accurate annotated real-world point clouds, semantic 3D city models, and annotated simulated point clouds representing the same city. TrueCity proposes segmentation classes aligned with international 3D city modeling standards, enabling consistent evaluation of synthetic-to-real gap. Our extensive experiments on common baselines quantify domain shift and highlight strategies for exploiting synthetic data to enhance real-world 3D scene understanding. We are convinced that the TrueCity dataset will foster further development of sim-to-real gap quantification and enable generalizable data-driven models. The data, code, and 3D models are available online: https://tum-gis.github.io/TrueCity/

[361] Classification of Microplastic Particles in Water using Polarized Light Scattering and Machine Learning Methods

Leonard Saur, Marc von Pawlowski, Ulrich Gengenbach, Ingo Sieber, Hossein Shirali, Lorenz Wührl, Rainer Kiko, Christian Pylatiuk

Main category: cs.CV

TL;DR: A reflection-based method using polarized light scattering and CNN classification achieves 80% accuracy for identifying microplastics in water, with AOLP signals being more robust than DOLP signals.

Details

Motivation: Address the limitations of gold-standard methods for continuous, large-scale microplastic monitoring in aquatic environments by developing an in-situ classification approach.

Method: Use polarized laser light illumination and polarization-sensitive camera to capture reflected signals from microplastics, then apply deep convolutional neural network for image-based classification of polymer types.

Result: Achieved 80% mean classification accuracy for three polymer types (HDPE, LDPE, PP), with AOLP signals showing better noise robustness and polyethylene distinction, while DOLP signals excel at polypropylene identification.

Conclusion: The reflection-based polarized light scattering method is effective for in-situ microplastic classification, with complementary strengths between AOLP and DOLP signals for different polymer identification tasks.

Abstract: Facing the critical need for continuous, large-scale microplastic monitoring, which is hindered by the limitations of gold-standard methods in aquatic environments, this paper introduces and validates a novel, reflection-based approach for the in-situ classification and identification of microplastics directly in water bodies, which is based on polarized light scattering. In this experiment, we classify colorless microplastic particles (50-300 $μ$m) by illuminating them with linearly polarized laser light and capturing their reflected signals using a polarization-sensitive camera. This reflection-based technique successfully circumvents the transmission-based interference issues that plague many conventional methods when applied in water. Using a deep convolutional neural network (CNN) for image-based classification, we successfully identified three common polymer types, high-density polyethylene, low-density polyethylene, and polypropylene, achieving a peak mean classification accuracy of 80% on the test dataset. A subsequent feature hierarchy analysis demonstrated that the CNN’s decision-making process relies mainly on the microstructural integrity and internal texture (polarization patterns) of the particle rather than its macroshape. Critically, we found that the Angle of Linear Polarization (AOLP) signal is significantly more robust against contextual noise than the Degree of Linear Polarization (DOLP) signal. While the AOLP-based classification achieved superior overall performance, its strength lies in distinguishing between the two polyethylene plastics, showing a lower confusion rate between high-density and low-density polyethylene. Conversely, the DOLP signal demonstrated slightly worse overall classification results but excels at accurately identifying the polypropylene class, which it isolated with greater success than AOLP.

[362] DTTNet: Improving Video Shadow Detection via Dark-Aware Guidance and Tokenized Temporal Modeling

Zhicheng Li, Kunyang Sun, Rui Yao, Hancheng Zhu, Fuyuan Hu, Jiaqi Zhao, Zhiwen Shao, Yong Zhou

Main category: cs.CV

TL;DR: A novel video shadow detection method that uses vision-language matching and temporal tokenization to distinguish shadows from complex backgrounds and model dynamic shadow deformations efficiently.

Details

Motivation: Video shadow detection faces challenges in distinguishing shadows from complex backgrounds and modeling dynamic shadow deformations under varying illumination conditions.

Method: Proposes Vision-language Match Module (VMM) and Dark-aware Semantic Block (DSB) for text-guided feature extraction, adaptive mask reweighting for penumbra regions, and Tokenized Temporal Block (TTB) for efficient spatiotemporal learning using temporal tokens.

Result: Achieves state-of-the-art accuracy on multiple benchmark datasets with real-time inference efficiency.

Conclusion: The proposed DTTNet effectively addresses shadow-background ambiguity and temporal modeling challenges through vision-language fusion and efficient token-based temporal encoding.

Abstract: Video shadow detection confronts two entwined difficulties: distinguishing shadows from complex backgrounds and modeling dynamic shadow deformations under varying illumination. To address shadow-background ambiguity, we leverage linguistic priors through the proposed Vision-language Match Module (VMM) and a Dark-aware Semantic Block (DSB), extracting text-guided features to explicitly differentiate shadows from dark objects. Furthermore, we introduce adaptive mask reweighting to downweight penumbra regions during training and apply edge masks at the final decoder stage for better supervision. For temporal modeling of variable shadow shapes, we propose a Tokenized Temporal Block (TTB) that decouples spatiotemporal learning. TTB summarizes cross-frame shadow semantics into learnable temporal tokens, enabling efficient sequence encoding with minimal computation overhead. Comprehensive Experiments on multiple benchmark datasets demonstrate state-of-the-art accuracy and real-time inference efficiency. Codes are available at https://github.com/city-cheng/DTTNet.

[363] PADM: A Physics-aware Diffusion Model for Attenuation Correction

Trung Kien Pham, Hoang Minh Vu, Anh Duc Chu, Dac Thai Nguyen, Trung Thanh Nguyen, Thao Nguyen Truong, Mai Hong Son, Thanh Trung Nguyen, Phi Le Nguyen

Main category: cs.CV

TL;DR: PADM is a diffusion-based model that corrects attenuation artifacts in cardiac SPECT imaging using only non-attenuation-corrected input, eliminating the need for expensive CT systems through physics-aware training.

Details

Motivation: Attenuation artifacts in cardiac SPECT MPI compromise diagnostic accuracy. Hybrid SPECT/CT systems are expensive, have limited accessibility, and add radiation exposure, hindering widespread clinical adoption.

Method: Proposed Physics-aware Attenuation Correction Diffusion Model (PADM) that incorporates explicit physics priors via teacher-student distillation mechanism. Uses only NAC input while benefiting from physics-informed supervision during training.

Result: PADM outperforms state-of-the-art generative models, delivering superior reconstruction fidelity across both quantitative metrics and visual assessment. Introduced CardiAC dataset with 424 patient studies.

Conclusion: PADM provides an effective CT-free solution for attenuation correction in cardiac SPECT, addressing cost and accessibility limitations of hybrid SPECT/CT systems while maintaining high reconstruction quality.

Abstract: Attenuation artifacts remain a significant challenge in cardiac Myocardial Perfusion Imaging (MPI) using Single-Photon Emission Computed Tomography (SPECT), often compromising diagnostic accuracy and reducing clinical interpretability. While hybrid SPECT/CT systems mitigate these artifacts through CT-derived attenuation maps, their high cost, limited accessibility, and added radiation exposure hinder widespread clinical adoption. In this study, we propose a novel CT-free solution to attenuation correction in cardiac SPECT. Specifically, we introduce Physics-aware Attenuation Correction Diffusion Model (PADM), a diffusion-based generative method that incorporates explicit physics priors via a teacher–student distillation mechanism. This approach enables attenuation artifact correction using only Non-Attenuation-Corrected (NAC) input, while still benefiting from physics-informed supervision during training. To support this work, we also introduce CardiAC, a comprehensive dataset comprising 424 patient studies with paired NAC and Attenuation-Corrected (AC) reconstructions, alongside high-resolution CT-based attenuation maps. Extensive experiments demonstrate that PADM outperforms state-of-the-art generative models, delivering superior reconstruction fidelity across both quantitative metrics and visual assessment.

[364] GFix: Perceptually Enhanced Gaussian Splatting Video Compression

Siyue Teng, Ge Gao, Duolikun Danier, Yuxuan Jiang, Fan Zhang, Thomas Davis, Zoe Liu, David Bull

Main category: cs.CV

TL;DR: GFix is a content-adaptive framework that uses a single-step diffusion model to enhance perceptual quality in 3DGS-based video compression, achieving significant BD-rate savings.

Details

Motivation: Existing 3DGS-based video codecs suffer from noticeable visual artifacts and low compression ratios, while artifacts from 3DGS rendering and quantization resemble noisy latents in diffusion training.

Method: Proposes GFix with a streamlined single-step diffusion model as neural enhancer and a modulated LoRA scheme that freezes low-rank decompositions and modulates hidden states for efficient adaptation.

Result: GFix outperforms GSVC with up to 72.1% BD-rate savings in LPIPS and 21.4% in FID, delivering strong perceptual quality enhancement.

Conclusion: The proposed GFix framework effectively enhances perceptual quality in 3DGS-based video compression through diffusion-based enhancement and efficient adaptation techniques.

Abstract: 3D Gaussian Splatting (3DGS) enhances 3D scene reconstruction through explicit representation and fast rendering, demonstrating potential benefits for various low-level vision tasks, including video compression. However, existing 3DGS-based video codecs generally exhibit more noticeable visual artifacts and relatively low compression ratios. In this paper, we specifically target the perceptual enhancement of 3DGS-based video compression, based on the assumption that artifacts from 3DGS rendering and quantization resemble noisy latents sampled during diffusion training. Building on this premise, we propose a content-adaptive framework, GFix, comprising a streamlined, single-step diffusion model that serves as an off-the-shelf neural enhancer. Moreover, to increase compression efficiency, We propose a modulated LoRA scheme that freezes the low-rank decompositions and modulates the intermediate hidden states, thereby achieving efficient adaptation of the diffusion backbone with highly compressible updates. Experimental results show that GFix delivers strong perceptual quality enhancement, outperforming GSVC with up to 72.1% BD-rate savings in LPIPS and 21.4% in FID.

[365] Pandar128 dataset for lane line detection

Filip Beránek, Václav Diviš, Ivan Gruber

Main category: cs.CV

TL;DR: Pandar128 is the largest public LiDAR lane detection dataset with 52K camera frames and 34K LiDAR scans, accompanied by SimpleLidarLane baseline method and IAM-F1 evaluation metric.

Details

Motivation: To address the lack of large-scale public datasets and standardized evaluation for LiDAR-based lane detection in diverse real-world conditions.

Method: Created Pandar128 dataset with full sensor calibration and odometry; developed SimpleLidarLane method using BEV segmentation, clustering, and polyline fitting; proposed IAM-F1 metric for polyline-based evaluation.

Result: Dataset captured in diverse German conditions; baseline method achieves strong performance despite simplicity; new metric enables principled evaluation.

Conclusion: High-quality data, modular pipelines, and standardized evaluation can compete with complex approaches in LiDAR lane detection.

Abstract: We present Pandar128, the largest public dataset for lane line detection using a 128-beam LiDAR. It contains over 52,000 camera frames and 34,000 LiDAR scans, captured in diverse real-world conditions in Germany. The dataset includes full sensor calibration (intrinsics, extrinsics) and synchronized odometry, supporting tasks such as projection, fusion, and temporal modeling. To complement the dataset, we also introduce SimpleLidarLane, a light-weight baseline method for lane line reconstruction that combines BEV segmentation, clustering, and polyline fitting. Despite its simplicity, our method achieves strong performance under challenging various conditions (e.g., rain, sparse returns), showing that modular pipelines paired with high-quality data and principled evaluation can compete with more complex approaches. Furthermore, to address the lack of standardized evaluation, we propose a novel polyline-based metric - Interpolation-Aware Matching F1 (IAM-F1) - that employs interpolation-aware lateral matching in BEV space. All data and code are publicly released to support reproducibility in LiDAR-based lane detection.

[366] Learning from the Right Patches: A Two-Stage Wavelet-Driven Masked Autoencoder for Histopathology Representation Learning

Raneen Younis, Louay Hamdi, Lukas Chavez, Zahra Ahmadi

Main category: cs.CV

TL;DR: WISE-MAE introduces a wavelet-informed patch selection strategy for MAE pretraining in digital pathology, using coarse-to-fine processing to focus on structurally rich tissue regions and improve representation learning.

Details

Motivation: Conventional random patch sampling in MAE pretraining often includes irrelevant or noisy regions in whole-slide images, limiting the model's ability to capture meaningful tissue patterns in digital pathology.

Method: A two-step coarse-to-fine process: wavelet-based screening at low magnification to locate structurally rich regions, followed by high-resolution extraction for detailed modeling, mirroring pathologists’ diagnostic workflow.

Result: Evaluations across multiple cancer datasets (lung, renal, colorectal) show WISE-MAE achieves competitive representation quality and downstream classification performance while maintaining efficiency under weak supervision.

Conclusion: The wavelet-informed patch selection strategy effectively brings structure and biological relevance into MAE-based learning for digital pathology, improving learned representations while maintaining computational efficiency.

Abstract: Whole-slide images are central to digital pathology, yet their extreme size and scarce annotations make self-supervised learning essential. Masked Autoencoders (MAEs) with Vision Transformer backbones have recently shown strong potential for histopathology representation learning. However, conventional random patch sampling during MAE pretraining often includes irrelevant or noisy regions, limiting the model’s ability to capture meaningful tissue patterns. In this paper, we present a lightweight and domain-adapted framework that brings structure and biological relevance into MAE-based learning through a wavelet-informed patch selection strategy. WISE-MAE applies a two-step coarse-to-fine process: wavelet-based screening at low magnification to locate structurally rich regions, followed by high-resolution extraction for detailed modeling. This approach mirrors the diagnostic workflow of pathologists and improves the quality of learned representations. Evaluations across multiple cancer datasets, including lung, renal, and colorectal tissues, show that WISE-MAE achieves competitive representation quality and downstream classification performance while maintaining efficiency under weak supervision.

[367] Exploring the “Great Unseen” in Medieval Manuscripts: Instance-Level Labeling of Legacy Image Collections with Zero-Shot Models

Christofer Meinecke, Estelle Guéville, David Joseph Wrisley

Main category: cs.CV

TL;DR: Using advanced techniques to segment and describe medieval manuscript pages to create better training data for computer vision and multimodal models.

Details

Motivation: To develop a more holistic theoretical approach to medieval manuscript pages and their contents.

Method: Employ state-of-the-art techniques for segmenting and describing entire manuscript folios.

Result: Creation of richer training data specifically for medieval visual content analysis.

Conclusion: Enhanced training data will improve computer vision techniques like instance segmentation and multimodal models for medieval manuscripts.

Abstract: We aim to theorize the medieval manuscript page and its contents more holistically, using state-of-the-art techniques to segment and describe the entire manuscript folio, for the purpose of creating richer training data for computer vision techniques, namely instance segmentation, and multimodal models for medieval-specific visual content.

[368] How Bias Binds: Measuring Hidden Associations for Bias Control in Text-to-Image Compositions

Jeng-Lin Li, Ming-Ching Chang, Wei-Chao Chen

Main category: cs.CV

TL;DR: This paper investigates bias amplification in text-to-image models through semantic binding between objects and attributes, introduces a bias adherence score, and proposes a training-free debiasing framework that improves compositional generation by over 10%.

Details

Motivation: Current bias research focuses narrowly on single-object prompts, neglecting how semantic binding between objects and attributes can amplify biases in text-to-image generation, leading to failures in existing debiasing approaches.

Method: Developed a bias adherence score to quantify bias activation in object-attribute bindings, and created a training-free context-bias control framework using token decoupling to debias semantic bindings.

Result: The framework achieved over 10% debiasing improvement in compositional generation tasks, revealing that bias distribution can be amplified through contextual associations between objects and attributes.

Conclusion: Current debiasing approaches have critical limitations when applied to semantically bound contexts, requiring reassessment of bias mitigation strategies to reduce bias without disrupting essential semantic relationships.

Abstract: Text-to-image generative models often exhibit bias related to sensitive attributes. However, current research tends to focus narrowly on single-object prompts with limited contextual diversity. In reality, each object or attribute within a prompt can contribute to bias. For example, the prompt “an assistant wearing a pink hat” may reflect female-inclined biases associated with a pink hat. The neglected joint effects of the semantic binding in the prompts cause significant failures in current debiasing approaches. This work initiates a preliminary investigation on how bias manifests under semantic binding, where contextual associations between objects and attributes influence generative outcomes. We demonstrate that the underlying bias distribution can be amplified based on these associations. Therefore, we introduce a bias adherence score that quantifies how specific object-attribute bindings activate bias. To delve deeper, we develop a training-free context-bias control framework to explore how token decoupling can facilitate the debiasing of semantic bindings. This framework achieves over 10% debiasing improvement in compositional generation tasks. Our analysis of bias scores across various attribute-object bindings and token decorrelation highlights a fundamental challenge: reducing bias without disrupting essential semantic relationships. These findings expose critical limitations in current debiasing approaches when applied to semantically bound contexts, underscoring the need to reassess prevailing bias mitigation strategies.

[369] Performance Decay in Deepfake Detection: The Limitations of Training on Outdated Data

Jack Richings, Margaux Leblanc, Ian Groves, Victoria Nockles

Main category: cs.CV

TL;DR: A simple two-stage deepfake detection method achieves 99.8% AUROC but performance decays rapidly (30% recall drop) with new generation techniques, highlighting the need for continuous data curation and frame-level artifact detection.

Details

Motivation: The increasing quality of deepfake technology poses growing threats of disinformation, fraud, and harassment, making synthetic content harder to distinguish from reality.

Method: A simple yet effective two-stage detection method that achieves high performance on contemporary deepfakes.

Result: Achieves 99.8% AUROC on current deepfakes but suffers 30% recall drop when tested on deepfakes from just six months later, showing significant performance decay as threats evolve.

Conclusion: Robust deepfake detection requires ongoing curation of large, diverse datasets and development of advanced frame-level feature detectors, as predictive power comes primarily from static frame-level artifacts rather than temporal inconsistencies.

Abstract: The continually advancing quality of deepfake technology exacerbates the threats of disinformation, fraud, and harassment by making maliciously-generated synthetic content increasingly difficult to distinguish from reality. We introduce a simple yet effective two-stage detection method that achieves an AUROC of over 99.8% on contemporary deepfakes. However, this high performance is short-lived. We show that models trained on this data suffer a recall drop of over 30% when evaluated on deepfakes created with generation techniques from just six months later, demonstrating significant decay as threats evolve. Our analysis reveals two key insights for robust detection. Firstly, continued performance requires the ongoing curation of large, diverse datasets. Second, predictive power comes primarily from static, frame-level artifacts, not temporal inconsistencies. The future of effective deepfake detection therefore depends on rapid data collection and the development of advanced frame-level feature detectors.

[370] Certified L2-Norm Robustness of 3D Point Cloud Recognition in the Frequency Domain

Liang Zhou, Qiming Wang, Tianze Chen

Main category: cs.CV

TL;DR: FreqCert is a frequency-domain certification framework for 3D point cloud classifiers that provides certified robustness against L2-bounded perturbations through graph Fourier transform and structured subsampling.

Details

Motivation: Existing certified defenses for point cloud classifiers only handle point-wise perturbations but fail against subtle geometric distortions that preserve individual points while altering overall structure, creating safety risks in applications like autonomous driving.

Method: Transforms point clouds via graph Fourier transform, applies frequency-aware subsampling based on spectral similarity to generate multiple sub-point clouds, then uses majority voting of independent classifications.

Result: Achieves higher certified accuracy and empirical accuracy under strong perturbations on ModelNet40 and ScanObjectNN datasets compared to existing methods.

Conclusion: Spectral representations provide an effective pathway for certifiable robustness in 3D point cloud recognition, with FreqCert establishing theoretical foundations for frequency domain certification.

Abstract: 3D point cloud classification is a fundamental task in safety-critical applications such as autonomous driving, robotics, and augmented reality. However, recent studies reveal that point cloud classifiers are vulnerable to structured adversarial perturbations and geometric corruptions, posing risks to their deployment in safety-critical scenarios. Existing certified defenses limit point-wise perturbations but overlook subtle geometric distortions that preserve individual points yet alter the overall structure, potentially leading to misclassification. In this work, we propose FreqCert, a novel certification framework that departs from conventional spatial domain defenses by shifting robustness analysis to the frequency domain, enabling structured certification against global L2-bounded perturbations. FreqCert first transforms the input point cloud via the graph Fourier transform (GFT), then applies structured frequency-aware subsampling to generate multiple sub-point clouds. Each sub-cloud is independently classified by a standard model, and the final prediction is obtained through majority voting, where sub-clouds are constructed based on spectral similarity rather than spatial proximity, making the partitioning more stable under L2 perturbations and better aligned with the object’s intrinsic structure. We derive a closed-form lower bound on the certified L2 robustness radius and prove its tightness under minimal and interpretable assumptions, establishing a theoretical foundation for frequency domain certification. Extensive experiments on the ModelNet40 and ScanObjectNN datasets demonstrate that FreqCert consistently achieves higher certified accuracy and empirical accuracy under strong perturbations. Our results suggest that spectral representations provide an effective pathway toward certifiable robustness in 3D point cloud recognition.

[371] GEWDiff: Geometric Enhanced Wavelet-based Diffusion Model for Hyperspectral Image Super-resolution

Sirui Wang, Jiang He, Natàlia Blasco Andreo, Xiao Xiang Zhu

Main category: cs.CV

TL;DR: Proposes GEWDiff, a geometric enhanced wavelet-based diffusion model for 4x hyperspectral image super-resolution that addresses memory constraints, geometric structure preservation, and convergence issues.

Details

Motivation: Hyperspectral images are memory-intensive for conventional diffusion models, lack geometric understanding of ground objects, and suffer from non-intuitive convergence behavior with noise-level loss functions.

Method: Uses wavelet-based encoder-decoder for efficient compression, geometry-enhanced diffusion process to preserve geometric features, and multi-level loss function for stable convergence.

Result: Achieves state-of-the-art results across fidelity, spectral accuracy, visual realism, and clarity metrics for 4x hyperspectral image super-resolution.

Conclusion: GEWDiff effectively addresses key challenges in hyperspectral image generation through wavelet compression, geometric enhancement, and multi-level loss optimization.

Abstract: Improving the quality of hyperspectral images (HSIs), such as through super-resolution, is a crucial research area. However, generative modeling for HSIs presents several challenges. Due to their high spectral dimensionality, HSIs are too memory-intensive for direct input into conventional diffusion models. Furthermore, general generative models lack an understanding of the topological and geometric structures of ground objects in remote sensing imagery. In addition, most diffusion models optimize loss functions at the noise level, leading to a non-intuitive convergence behavior and suboptimal generation quality for complex data. To address these challenges, we propose a Geometric Enhanced Wavelet-based Diffusion Model (GEWDiff), a novel framework for reconstructing hyperspectral images at 4-times super-resolution. A wavelet-based encoder-decoder is introduced that efficiently compresses HSIs into a latent space while preserving spectral-spatial information. To avoid distortion during generation, we incorporate a geometry-enhanced diffusion process that preserves the geometric features. Furthermore, a multi-level loss function was designed to guide the diffusion process, promoting stable convergence and improved reconstruction fidelity. Our model demonstrated state-of-the-art results across multiple dimensions, including fidelity, spectral accuracy, visual realism, and clarity.

[372] 3D-ANC: Adaptive Neural Collapse for Robust 3D Point Cloud Recognition

Yuanmin Huang, Wenxuan Li, Mi Zhang, Xiaohan Zhang, Xiaoyu You, Min Yang

Main category: cs.CV

TL;DR: 3D-ANC is a novel defense method that uses Neural Collapse to create disentangled feature spaces, significantly improving 3D point cloud model robustness against adversarial attacks by addressing class imbalance and geometric similarities.

Details

Motivation: Deep neural networks for 3D point cloud recognition are vulnerable to adversarial attacks, and conventional defenses struggle with evolving attack patterns due to entangled feature spaces that make attacks easier to perform.

Method: Leverages Neural Collapse mechanism with ETF-aligned classification module and adaptive training framework including representation-balanced learning (RBL) and dynamic feature direction loss (FDL) to address class imbalance and geometric similarities in 3D data.

Result: Significantly improves model robustness - DGCNN’s classification accuracy increased from 27.2% to 80.9% on ModelNet40 (53.7% absolute gain), surpassing leading baselines by 34.0%. Works effectively across various model structures on multiple datasets.

Conclusion: 3D-ANC successfully creates disentangled feature spaces that enhance adversarial robustness in 3D point cloud recognition, overcoming challenges of class imbalance and complex geometric relationships through Neural Collapse-based approach.

Abstract: Deep neural networks have recently achieved notable progress in 3D point cloud recognition, yet their vulnerability to adversarial perturbations poses critical security challenges in practical deployments. Conventional defense mechanisms struggle to address the evolving landscape of multifaceted attack patterns. Through systematic analysis of existing defenses, we identify that their unsatisfactory performance primarily originates from an entangled feature space, where adversarial attacks can be performed easily. To this end, we present 3D-ANC, a novel approach that capitalizes on the Neural Collapse (NC) mechanism to orchestrate discriminative feature learning. In particular, NC depicts where last-layer features and classifier weights jointly evolve into a simplex equiangular tight frame (ETF) arrangement, establishing maximally separable class prototypes. However, leveraging this advantage in 3D recognition confronts two substantial challenges: (1) prevalent class imbalance in point cloud datasets, and (2) complex geometric similarities between object categories. To tackle these obstacles, our solution combines an ETF-aligned classification module with an adaptive training framework consisting of representation-balanced learning (RBL) and dynamic feature direction loss (FDL). 3D-ANC seamlessly empowers existing models to develop disentangled feature spaces despite the complexity in 3D data distribution. Comprehensive evaluations state that 3D-ANC significantly improves the robustness of models with various structures on two datasets. For instance, DGCNN’s classification accuracy is elevated from 27.2% to 80.9% on ModelNet40 – a 53.7% absolute gain that surpasses leading baselines by 34.0%.

[373] From Pretrain to Pain: Adversarial Vulnerability of Video Foundation Models Without Task Knowledge

Hui Lu, Yi Yu, Song Xia, Yiming Yang, Deepu Rajan, Boon Poh Ng, Alex Kot, Xudong Jiang

Main category: cs.CV

TL;DR: Proposes TVA, a temporal-aware adversarial attack method that exploits video foundation models to attack downstream models without access to victim tasks, data, or architecture.

Details

Motivation: Open accessibility of Video Foundation Models introduces security risks where adversaries can exploit model knowledge to attack downstream applications without direct access.

Method: TVA uses bidirectional contrastive learning to maximize feature discrepancy and temporal consistency loss with motion cues to enhance sequential perturbation impact.

Result: Extensive experiments across 24 video tasks show TVA effectively attacks downstream models and MLLMs, revealing significant security vulnerabilities.

Conclusion: TVA demonstrates practical security threats in video model deployment, enabling efficient attacks without expensive surrogate models or domain-specific data access.

Abstract: Large-scale Video Foundation Models (VFMs) has significantly advanced various video-related tasks, either through task-specific models or Multi-modal Large Language Models (MLLMs). However, the open accessibility of VFMs also introduces critical security risks, as adversaries can exploit full knowledge of the VFMs to launch potent attacks. This paper investigates a novel and practical adversarial threat scenario: attacking downstream models or MLLMs fine-tuned from open-source VFMs, without requiring access to the victim task, training data, model query, and architecture. In contrast to conventional transfer-based attacks that rely on task-aligned surrogate models, we demonstrate that adversarial vulnerabilities can be exploited directly from the VFMs. To this end, we propose the Transferable Video Attack (TVA), a temporal-aware adversarial attack method that leverages the temporal representation dynamics of VFMs to craft effective perturbations. TVA integrates a bidirectional contrastive learning mechanism to maximize the discrepancy between the clean and adversarial features, and introduces a temporal consistency loss that exploits motion cues to enhance the sequential impact of perturbations. TVA avoids the need to train expensive surrogate models or access to domain-specific data, thereby offering a more practical and efficient attack strategy. Extensive experiments across 24 video-related tasks demonstrate the efficacy of TVA against downstream models and MLLMs, revealing a previously underexplored security vulnerability in the deployment of video models.

[374] Federated Learning for Video Violence Detection: Complementary Roles of Lightweight CNNs and Vision-Language Models for Energy-Efficient Use

Sébastien Thuau, Siba Haidar, Rachid Chelouah

Main category: cs.CV

TL;DR: This paper compares federated learning approaches for violence detection, finding that 3D CNNs offer better energy efficiency while VLMs provide richer reasoning, suggesting hybrid deployment strategies.

Details

Motivation: Address the need for privacy-preserving video surveillance with low computational and environmental overhead, particularly the energy challenges of deploying large vision-language models in federated learning settings.

Method: Compared three federated strategies on realistic non-IID splits: zero-shot inference with pretrained VLMs, LoRA-based fine-tuning of LLaVA-NeXT-Video-7B, and personalized federated learning of a 65.8M-parameter 3D CNN. Also used hierarchical category grouping for VLM multiclass accuracy improvement.

Result: All methods exceeded 90% accuracy in binary violence detection. 3D CNN achieved superior calibration (ROC AUC 92.59%) at roughly half the energy cost (240 Wh vs. 570 Wh) of federated LoRA. Hierarchical category grouping boosted VLM multiclass accuracy from 65.31% to 81% on UCF-Crime dataset.

Conclusion: Hybrid deployment strategies are recommended: default to efficient CNNs for routine inference and selectively engage VLMs for complex contextual reasoning, balancing energy efficiency with multimodal reasoning capabilities.

Abstract: Deep learning-based video surveillance increasingly demands privacy-preserving architectures with low computational and environmental overhead. Federated learning preserves privacy but deploying large vision-language models (VLMs) introduces major energy and sustainability challenges. We compare three strategies for federated violence detection under realistic non-IID splits on the RWF-2000 and RLVS datasets: zero-shot inference with pretrained VLMs, LoRA-based fine-tuning of LLaVA-NeXT-Video-7B, and personalized federated learning of a 65.8M-parameter 3D CNN. All methods exceed 90% accuracy in binary violence detection. The 3D CNN achieves superior calibration (ROC AUC 92.59%) at roughly half the energy cost (240 Wh vs. 570 Wh) of federated LoRA, while VLMs provide richer multimodal reasoning. Hierarchical category grouping (based on semantic similarity and class exclusion) boosts VLM multiclass accuracy from 65.31% to 81% on the UCF-Crime dataset. To our knowledge, this is the first comparative simulation study of LoRA-tuned VLMs and personalized CNNs for federated violence detection, with explicit energy and CO2e quantification. Our results inform hybrid deployment strategies that default to efficient CNNs for routine inference and selectively engage VLMs for complex contextual reasoning.

[375] Improving Deepfake Detection with Reinforcement Learning-Based Adaptive Data Augmentation

Yuxuan Zhou, Tao Yu, Wen Huang, Yuheng Zhang, Tao Dai, Shu-Tao Xia

Main category: cs.CV

TL;DR: CRDA is a dynamic data augmentation framework that uses reinforcement learning and causal inference to progressively train deepfake detectors on increasingly complex forgery features, improving cross-domain generalization.

Details

Motivation: Current deepfake detectors rely on static augmentation strategies that cannot adapt to the evolving complexity and diversity of real-world forgery techniques, limiting their generalization capability.

Method: Proposes CRDA framework with: 1) Configurable pool of forgery operations, 2) RL agent that dynamically selects augmentation actions based on detector performance, 3) Causal inference to suppress spurious correlations and focus on causally invariant features, 4) Progressive curriculum from simple to complex forgeries.

Result: Extensive experiments show CRDA significantly improves detector generalizability, outperforming state-of-the-art methods across multiple cross-domain datasets.

Conclusion: Dynamic, curriculum-based augmentation guided by RL and causal inference is more effective than static approaches for training robust deepfake detectors that generalize well to unseen forgery types.

Abstract: The generalization capability of deepfake detectors is critical for real-world use. Data augmentation via synthetic fake face generation effectively enhances generalization, yet current SoTA methods rely on fixed strategies-raising a key question: Is a single static augmentation sufficient, or does the diversity of forgery features demand dynamic approaches? We argue existing methods overlook the evolving complexity of real-world forgeries (e.g., facial warping, expression manipulation), which fixed policies cannot fully simulate. To address this, we propose CRDA (Curriculum Reinforcement-Learning Data Augmentation), a novel framework guiding detectors to progressively master multi-domain forgery features from simple to complex. CRDA synthesizes augmented samples via a configurable pool of forgery operations and dynamically generates adversarial samples tailored to the detector’s current learning state. Central to our approach is integrating reinforcement learning (RL) and causal inference. An RL agent dynamically selects augmentation actions based on detector performance to efficiently explore the vast augmentation space, adapting to increasingly challenging forgeries. Simultaneously, the agent introduces action space variations to generate heterogeneous forgery patterns, guided by causal inference to mitigate spurious correlations-suppressing task-irrelevant biases and focusing on causally invariant features. This integration ensures robust generalization by decoupling synthetic augmentation patterns from the model’s learned representations. Extensive experiments show our method significantly improves detector generalizability, outperforming SOTA methods across multiple cross-domain datasets.

[376] RaLD: Generating High-Resolution 3D Radar Point Clouds with Latent Diffusion

Ruijie Zhang, Bixin Zeng, Shengpeng Wang, Fuhui Zhou, Wei Wang

Main category: cs.CV

TL;DR: RaLD is a framework that generates dense 3D point clouds from sparse millimeter-wave radar data using latent diffusion models with scene-level frustum-based LiDAR autoencoding and direct radar spectrum conditioning.

Details

Motivation: Millimeter-wave radar is robust and low-cost but produces sparse, low-resolution point clouds that limit 3D perception tasks. Existing generative approaches use inefficient dense voxel representations that struggle with structural detail preservation.

Method: Integrates scene-level frustum-based LiDAR autoencoding, order-invariant latent representations, and direct radar spectrum conditioning to create a compact and expressive generation process using latent diffusion models.

Result: Experiments show RaLD produces dense and accurate 3D point clouds from raw radar spectrums, enabling robust perception in challenging environments.

Conclusion: RaLD offers a promising solution for generating high-quality 3D point clouds from sparse radar data, addressing limitations of current radar-based 3D perception methods.

Abstract: Millimeter-wave radar offers a promising sensing modality for autonomous systems thanks to its robustness in adverse conditions and low cost. However, its utility is significantly limited by the sparsity and low resolution of radar point clouds, which poses challenges for tasks requiring dense and accurate 3D perception. Despite that recent efforts have shown great potential by exploring generative approaches to address this issue, they often rely on dense voxel representations that are inefficient and struggle to preserve structural detail. To fill this gap, we make the key observation that latent diffusion models (LDMs), though successful in other modalities, have not been effectively leveraged for radar-based 3D generation due to a lack of compatible representations and conditioning strategies. We introduce RaLD, a framework that bridges this gap by integrating scene-level frustum-based LiDAR autoencoding, order-invariant latent representations, and direct radar spectrum conditioning. These insights lead to a more compact and expressive generation process. Experiments show that RaLD produces dense and accurate 3D point clouds from raw radar spectrums, offering a promising solution for robust perception in challenging environments.

[377] ClusterMine: Robust Label-Free Visual Out-Of-Distribution Detection via Concept Mining from Text Corpora

Nikolas Adaloglou, Diana Petrusheva, Mohamed Asker, Felix Michels, Markus Kollmann

Main category: cs.CV

TL;DR: ClusterMine enables unsupervised OOD detection by mining positive labels from text corpora using visual clustering and zero-shot image-text consistency, eliminating the need for pre-defined in-distribution labels.

Details

Motivation: Current OOD detection methods rely on pre-defined in-distribution labels which are often unavailable, unreliable at scale, or become irrelevant due to distribution shifts after deployment.

Method: Proposes ClusterMine which extracts positive concepts from text corpora by combining visual clustering for sample consistency and zero-shot image-text consistency from CLIP models.

Result: Achieves state-of-the-art OOD detection performance without access to positive labels, scalable across CLIP models, and robust to covariate in-distribution shifts.

Conclusion: ClusterMine enables truly unsupervised OOD detection by leveraging widely available text corpora, providing a practical solution for real-world deployment scenarios.

Abstract: Large-scale visual out-of-distribution (OOD) detection has witnessed remarkable progress by leveraging vision-language models such as CLIP. However, a significant limitation of current methods is their reliance on a pre-defined set of in-distribution (ID) ground-truth label names (positives). These fixed label names can be unavailable, unreliable at scale, or become less relevant due to in-distribution shifts after deployment. Towards truly unsupervised OOD detection, we utilize widely available text corpora for positive label mining, bypassing the need for positives. In this paper, we utilize widely available text corpora for positive label mining under a general concept mining paradigm. Within this framework, we propose ClusterMine, a novel positive label mining method. ClusterMine is the first method to achieve state-of-the-art OOD detection performance without access to positive labels. It extracts positive concepts from a large text corpus by combining visual-only sample consistency (via clustering) and zero-shot image-text consistency. Our experimental study reveals that ClusterMine is scalable across a plethora of CLIP models and achieves state-of-the-art robustness to covariate in-distribution shifts. The code is available at https://github.com/HHU-MMBS/clustermine_wacv_official.

[378] LeCoT: revisiting network architecture for two-view correspondence pruning

Luanyuan Dai, Xiaoyu Du, Jinhui Tang

Main category: cs.CV

TL;DR: LeCoT is a novel two-view correspondence pruning network that uses Spatial-Channel Fusion Transformer blocks to capture global context information without extra modules, outperforming state-of-the-art methods across multiple vision tasks.

Details

Motivation: Current methods use MLP backbones with additional modules to handle context information, which is a limitation of MLPs. The authors aim to capture correspondence context information more naturally without extra design modules.

Method: Proposed LeCoT network with Spatial-Channel Fusion Transformer blocks that efficiently utilize both spatial and channel global context information. Also includes a prediction block that uses intermediate correspondence features to generate probability sets for guiding subsequent learning phases.

Result: Extensive experiments show LeCoT outperforms state-of-the-art methods in correspondence pruning, relative pose estimation, homography estimation, visual localization, and 3D reconstruction tasks.

Conclusion: LeCoT provides an effective approach for two-view correspondence pruning by naturally leveraging global context information at different stages through novel transformer blocks and progressive probability refinement.

Abstract: Two-view correspondence pruning aims to accurately remove incorrect correspondences (outliers) from initial ones and is widely applied to various computer vision tasks. Current popular strategies adopt multilayer perceptron (MLP) as the backbone, supplemented by additional modules to enhance the network ability to handle context information, which is a known limitation of MLPs. In contrast, we introduce a novel perspective for capturing correspondence context information without extra design modules. To this end, we design a two-view correspondence pruning network called LeCoT, which can naturally leverage global context information at different stages. Specifically, the core design of LeCoT is the Spatial-Channel Fusion Transformer block, a newly proposed component that efficiently utilizes both spatial and channel global context information among sparse correspondences. In addition, we integrate the proposed prediction block that utilizes correspondence features from intermediate stages to generate a probability set, which acts as guiding information for subsequent learning phases, allowing the network to more effectively capture robust global context information. Notably, this prediction block progressively refines the probability set, thereby mitigating the issue of information loss that is common in the traditional one. Extensive experiments prove that the proposed LeCoT outperforms state-of-the-art methods in correspondence pruning, relative pose estimation, homography estimation, visual localization, and $3$D~reconstruction tasks. The code is provided in https://github.com/Dailuanyuan2024/LeCoT-Revisiting-Network-Architecture-for-Two-View-Correspondence-Pruning.

[379] Leveraging Text-Driven Semantic Variation for Robust OOD Segmentation

Seungheon Song, Jaekoo Lee

Main category: cs.CV

TL;DR: A novel vision-language approach for out-of-distribution (OOD) segmentation in autonomous driving that leverages linguistic cues to detect anomalous objects on roads, achieving state-of-the-art performance.

Details

Motivation: Current OOD segmentation methods for autonomous driving underutilize rich linguistic knowledge from vision-language spaces, which could significantly improve safety and decision-making in complex real-world driving scenarios.

Method: Combines vision-language model encoder with transformer decoder, uses Distance-Based OOD prompts at varying semantic distances from in-distribution classes, and employs OOD Semantic Augmentation for OOD representations to align visual and textual information.

Result: Achieves state-of-the-art performance on Fishyscapes, Segment-Me-If-You-Can, and Road Anomaly datasets across both pixel-level and object-level evaluations.

Conclusion: Vision-language-based OOD segmentation shows strong potential to enhance safety and reliability in autonomous driving systems by effectively generalizing to unseen objects in diverse driving environments.

Abstract: In autonomous driving and robotics, ensuring road safety and reliable decision-making critically depends on out-of-distribution (OOD) segmentation. While numerous methods have been proposed to detect anomalous objects on the road, leveraging the vision-language space-which provides rich linguistic knowledge-remains an underexplored field. We hypothesize that incorporating these linguistic cues can be especially beneficial in the complex contexts found in real-world autonomous driving scenarios. To this end, we present a novel approach that trains a Text-Driven OOD Segmentation model to learn a semantically diverse set of objects in the vision-language space. Concretely, our approach combines a vision-language model’s encoder with a transformer decoder, employs Distance-Based OOD prompts located at varying semantic distances from in-distribution (ID) classes, and utilizes OOD Semantic Augmentation for OOD representations. By aligning visual and textual information, our approach effectively generalizes to unseen objects and provides robust OOD segmentation in diverse driving environments. We conduct extensive experiments on publicly available OOD segmentation datasets such as Fishyscapes, Segment-Me-If-You-Can, and Road Anomaly datasets, demonstrating that our approach achieves state-of-the-art performance across both pixel-level and object-level evaluations. This result underscores the potential of vision-language-based OOD segmentation to bolster the safety and reliability of future autonomous driving systems.

[380] HENet++: Hybrid Encoding and Multi-task Learning for 3D Perception and End-to-end Autonomous Driving

Zhongyu Xia, Zhiwei Lin, Yongtao Wang, Ming-Hsuan Yang

Main category: cs.CV

TL;DR: HENet++ is a multi-task 3D perception framework for autonomous driving that uses hybrid image encoding (large encoder for short-term, small for long-term frames) and extracts both dense/sparse features to achieve state-of-the-art performance on nuScenes benchmarks.

Details

Motivation: Address computational constraints and feature representation conflicts in autonomous driving systems where large encoders, high-resolution images, and temporal inputs improve performance but are incompatible with resource limitations, and different tasks require distinct feature representations.

Method: Hybrid image encoding network with large encoder for short-term frames and small encoder for long-term frames; simultaneous extraction of dense and sparse features; compatible with various 3D feature extraction methods and multimodal inputs.

Result: Achieves state-of-the-art end-to-end multi-task 3D perception results on nuScenes benchmark; attains lowest collision rate on nuScenes end-to-end autonomous driving benchmark.

Conclusion: The proposed HENet++ framework effectively addresses computational constraints and feature representation issues in multi-task autonomous driving systems, delivering superior performance while maintaining compatibility with existing methods.

Abstract: Three-dimensional feature extraction is a critical component of autonomous driving systems, where perception tasks such as 3D object detection, bird’s-eye-view (BEV) semantic segmentation, and occupancy prediction serve as important constraints on 3D features. While large image encoders, high-resolution images, and long-term temporal inputs can significantly enhance feature quality and deliver remarkable performance gains, these techniques are often incompatible in both training and inference due to computational resource constraints. Moreover, different tasks favor distinct feature representations, making it difficult for a single model to perform end-to-end inference across multiple tasks while maintaining accuracy comparable to that of single-task models. To alleviate these issues, we present the HENet and HENet++ framework for multi-task 3D perception and end-to-end autonomous driving. Specifically, we propose a hybrid image encoding network that uses a large image encoder for short-term frames and a small one for long-term frames. Furthermore, our framework simultaneously extracts both dense and sparse features, providing more suitable representations for different tasks, reducing cumulative errors, and delivering more comprehensive information to the planning module. The proposed architecture maintains compatibility with various existing 3D feature extraction methods and supports multimodal inputs. HENet++ achieves state-of-the-art end-to-end multi-task 3D perception results on the nuScenes benchmark, while also attaining the lowest collision rate on the nuScenes end-to-end autonomous driving benchmark.

[381] MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

Tianhao Peng, Haochen Wang, Yuanxing Zhang, Zekun Wang, Zili Wang, Ge Zhang, Jian Yang, Shihao Li, Yanghai Wang, Xintao Wang, Houyi Li, Wei Ji, Pengfei Wan, Wenhao Huang, Zhaoxiang Zhang, Jiaheng Liu

Main category: cs.CV

TL;DR: MVU-Eval is the first comprehensive benchmark for evaluating Multi-Video Understanding in MLLMs, addressing the gap in existing single-video benchmarks with 1,824 QA pairs across 4,959 videos covering 8 core competencies.

Details

Motivation: Existing MLLM evaluation benchmarks are limited to single-video understanding, overlooking the critical need for multi-video understanding in real-world applications like sports analytics and autonomous driving.

Method: Created MVU-Eval benchmark with 1,824 meticulously curated question-answer pairs spanning 4,959 videos from diverse domains, assessing 8 core competencies including fundamental perception and high-order reasoning tasks aligned with real-world applications.

Result: Extensive evaluation of state-of-the-art open-source and closed-source models revealed significant performance discrepancies and limitations in current MLLMs’ ability to perform understanding across multiple videos.

Conclusion: MVU-Eval addresses a critical gap in MLLM evaluation and will be publicly available to foster future research in multi-video understanding capabilities.

Abstract: The advent of Multimodal Large Language Models (MLLMs) has expanded AI capabilities to visual modalities, yet existing evaluation benchmarks remain limited to single-video understanding, overlooking the critical need for multi-video understanding in real-world scenarios (e.g., sports analytics and autonomous driving). To address this significant gap, we introduce MVU-Eval, the first comprehensive benchmark for evaluating Multi-Video Understanding for MLLMs. Specifically, our MVU-Eval mainly assesses eight core competencies through 1,824 meticulously curated question-answer pairs spanning 4,959 videos from diverse domains, addressing both fundamental perception tasks and high-order reasoning tasks. These capabilities are rigorously aligned with real-world applications such as multi-sensor synthesis in autonomous systems and cross-angle sports analytics. Through extensive evaluation of state-of-the-art open-source and closed-source models, we reveal significant performance discrepancies and limitations in current MLLMs’ ability to perform understanding across multiple videos. The benchmark will be made publicly available to foster future research.

[382] Sparse4DGS: 4D Gaussian Splatting for Sparse-Frame Dynamic Scene Reconstruction

Changyue Shi, Chuxiao Yang, Xinyuan Hu, Minghao Chen, Wenwen Pan, Yan Yang, Jiajun Ding, Zhou Yu, Jun Yu

Main category: cs.CV

TL;DR: Sparse4DGS enables dynamic 4D scene reconstruction from sparse-frame inputs using texture-aware regularization and optimization techniques.

Details

Motivation: Existing dynamic Gaussian Splatting methods require dense-frame videos, but real-world scenarios often only provide sparse frames due to equipment constraints.

Method: Proposes Texture-Aware Deformation Regularization with texture-based depth alignment loss for Gaussian deformation, and Texture-Aware Canonical Optimization with texture-based noise for canonical Gaussian field optimization.

Result: Outperforms existing dynamic and few-shot techniques on multiple datasets (NeRF-Synthetic, HyperNeRF, NeRF-DS, iPhone-4D) when using sparse frames as input.

Conclusion: Sparse4DGS successfully addresses the challenge of sparse-frame dynamic scene reconstruction by focusing on texture-rich areas with specialized regularization and optimization methods.

Abstract: Dynamic Gaussian Splatting approaches have achieved remarkable performance for 4D scene reconstruction. However, these approaches rely on dense-frame video sequences for photorealistic reconstruction. In real-world scenarios, due to equipment constraints, sometimes only sparse frames are accessible. In this paper, we propose Sparse4DGS, the first method for sparse-frame dynamic scene reconstruction. We observe that dynamic reconstruction methods fail in both canonical and deformed spaces under sparse-frame settings, especially in areas with high texture richness. Sparse4DGS tackles this challenge by focusing on texture-rich areas. For the deformation network, we propose Texture-Aware Deformation Regularization, which introduces a texture-based depth alignment loss to regulate Gaussian deformation. For the canonical Gaussian field, we introduce Texture-Aware Canonical Optimization, which incorporates texture-based noise into the gradient descent process of canonical Gaussians. Extensive experiments show that when taking sparse frames as inputs, our method outperforms existing dynamic or few-shot techniques on NeRF-Synthetic, HyperNeRF, NeRF-DS, and our iPhone-4D datasets.

[383] MPJudge: Towards Perceptual Assessment of Music-Induced Paintings

Shiqi Jiang, Tianyi Liang, Changbo Wang, Chenhui Li

Main category: cs.CV

TL;DR: A novel framework for assessing music-induced paintings by modeling perceptual coherence between music and visual art, using a new dataset and a model that integrates music features into visual encoding.

Details

Motivation: Existing methods rely on emotion recognition models which introduce noise and miss broader perceptual cues beyond just emotion, making it challenging to evaluate if paintings faithfully reflect their musical inspiration.

Method: Created MPD dataset (first large-scale music-painting pairs annotated by experts), developed MPJudge model that integrates music features into visual encoder via modulation-based fusion, and used Direct Preference Optimization for ambiguous cases.

Result: Extensive experiments show the method outperforms existing approaches, and qualitative results demonstrate more accurate identification of music-relevant regions in paintings.

Conclusion: The proposed framework effectively addresses limitations of emotion-based approaches by directly modeling perceptual coherence between music and visual art, providing better assessment of music-induced paintings.

Abstract: Music induced painting is a unique artistic practice, where visual artworks are created under the influence of music. Evaluating whether a painting faithfully reflects the music that inspired it poses a challenging perceptual assessment task. Existing methods primarily rely on emotion recognition models to assess the similarity between music and painting, but such models introduce considerable noise and overlook broader perceptual cues beyond emotion. To address these limitations, we propose a novel framework for music induced painting assessment that directly models perceptual coherence between music and visual art. We introduce MPD, the first large scale dataset of music painting pairs annotated by domain experts based on perceptual coherence. To better handle ambiguous cases, we further collect pairwise preference annotations. Building on this dataset, we present MPJudge, a model that integrates music features into a visual encoder via a modulation based fusion mechanism. To effectively learn from ambiguous cases, we adopt Direct Preference Optimization for training. Extensive experiments demonstrate that our method outperforms existing approaches. Qualitative results further show that our model more accurately identifies music relevant regions in paintings.

[384] Glioma C6: A Novel Dataset for Training and Benchmarking Cell Segmentation

Roman Malashin, Svetlana Pashkevich, Daniil Ilyukhin, Arseniy Volkov, Valeria Yachnaya, Andrey Denisov, Maria Mikhalkova

Main category: cs.CV

TL;DR: Glioma C6 is a new open dataset for instance segmentation of glioma C6 cells with 75 high-resolution images and over 12,000 annotated cells, serving as both benchmark and training resource for deep learning models.

Details

Motivation: To provide a realistic testbed for biomedical image analysis and enhance the utilization of image data for cancer cell research through morphological categorization.

Method: Created a dataset with 75 phase-contrast microscopy images containing over 12,000 annotated cells, including soma annotations and morphological categorization by biologists. The dataset has two parts: one for benchmarking with controlled parameters and another for generalization testing under varying conditions.

Result: Evaluation of generalist segmentation models showed limitations on this dataset. Training on Glioma C6 significantly improved segmentation performance, demonstrating its value for developing robust models.

Conclusion: Glioma C6 serves as a valuable resource for developing and benchmarking instance segmentation models in biomedical imaging, with publicly available data for researchers.

Abstract: We present Glioma C6, a new open dataset for instance segmentation of glioma C6 cells, designed as both a benchmark and a training resource for deep learning models. The dataset comprises 75 high-resolution phase-contrast microscopy images with over 12,000 annotated cells, providing a realistic testbed for biomedical image analysis. It includes soma annotations and morphological cell categorization provided by biologists. Additional categorization of cells, based on morphology, aims to enhance the utilization of image data for cancer cell research. Glioma C6 consists of two parts: the first is curated with controlled parameters for benchmarking, while the second supports generalization testing under varying conditions. We evaluate the performance of several generalist segmentation models, highlighting their limitations on our dataset. Our experiments demonstrate that training on Glioma C6 significantly enhances segmentation performance, reinforcing its value for developing robust and generalizable models. The dataset is publicly available for researchers.

[385] ProcGen3D: Learning Neural Procedural Graph Representations for Image-to-3D Reconstruction

Xinyi Zhang, Daoyi Gao, Naiqi Li, Angela Dai

Main category: cs.CV

TL;DR: ProcGen3D generates 3D content by creating procedural graph abstractions that decode into complex 3D assets, using transformer-based prediction with MCTS guidance for image-faithful reconstructions.

Details

Motivation: Inspired by procedural generators in production 3D applications, to enable efficient and controllable 3D content creation from images.

Method: Uses sequentialized graph-based procedural representation with edge-based tokenization, trains transformer prior to predict next tokens from RGB images, and incorporates MCTS-guided sampling for better image alignment.

Result: Outperforms state-of-the-art generative 3D methods and domain-specific modeling techniques on cacti, trees, and bridges, with improved generalization to real-world images despite synthetic-only training.

Conclusion: ProcGen3D provides an effective approach for neural procedural graph generation that enables high-quality 3D reconstruction from images across various object categories.

Abstract: We introduce ProcGen3D, a new approach for 3D content creation by generating procedural graph abstractions of 3D objects, which can then be decoded into rich, complex 3D assets. Inspired by the prevalent use of procedural generators in production 3D applications, we propose a sequentialized, graph-based procedural graph representation for 3D assets. We use this to learn to approximate the landscape of a procedural generator for image-based 3D reconstruction. We employ edge-based tokenization to encode the procedural graphs, and train a transformer prior to predict the next token conditioned on an input RGB image. Crucially, to enable better alignment of our generated outputs to an input image, we incorporate Monte Carlo Tree Search (MCTS) guided sampling into our generation process, steering output procedural graphs towards more image-faithful reconstructions. Our approach is applicable across a variety of objects that can be synthesized with procedural generators. Extensive experiments on cacti, trees, and bridges show that our neural procedural graph generation outperforms both state-of-the-art generative 3D methods and domain-specific modeling techniques. Furthermore, this enables improved generalization on real-world input images, despite training only on synthetic data.

[386] LiteUpdate: A Lightweight Framework for Updating AI-Generated Image Detectors

Jiajie Lu, Zhenkan Fu, Na Zhao, Long Xing, Kejiang Chen, Weiming Zhang, Nenghai Yu

Main category: cs.CV

TL;DR: LiteUpdate is a lightweight framework that efficiently updates AI-generated image detectors to adapt to new generators while preventing catastrophic forgetting, using representative sample selection and model merging techniques.

Details

Motivation: Existing AI-generated image detection methods struggle to keep up with rapidly evolving generative models, causing significant performance degradation and highlighting the need for efficient detector updates.

Method: Uses representative sample selection based on image confidence and gradient features to select boundary samples, and model merging that fuses weights from multiple fine-tuning trajectories (pre-trained, representative, and random updates).

Result: Significantly improves detection performance - on AIDE, average accuracy on Midjourney increased from 87.63% to 93.03% (6.16% relative improvement).

Conclusion: LiteUpdate effectively addresses the challenges of detector updates by balancing adaptability to new generators with preservation of prior knowledge, substantially boosting detection performance across various detectors.

Abstract: The rapid progress of generative AI has led to the emergence of new generative models, while existing detection methods struggle to keep pace, resulting in significant degradation in the detection performance. This highlights the urgent need for continuously updating AI-generated image detectors to adapt to new generators. To overcome low efficiency and catastrophic forgetting in detector updates, we propose LiteUpdate, a lightweight framework for updating AI-generated image detectors. LiteUpdate employs a representative sample selection module that leverages image confidence and gradient-based discriminative features to precisely select boundary samples. This approach improves learning and detection accuracy on new distributions with limited generated images, significantly enhancing detector update efficiency. Additionally, LiteUpdate incorporates a model merging module that fuses weights from multiple fine-tuning trajectories, including pre-trained, representative, and random updates. This balances the adaptability to new generators and mitigates the catastrophic forgetting of prior knowledge. Experiments demonstrate that LiteUpdate substantially boosts detection performance in various detectors. Specifically, on AIDE, the average detection accuracy on Midjourney improved from 87.63% to 93.03%, a 6.16% relative increase.

[387] Automated Estimation of Anatomical Risk Metrics for Endoscopic Sinus Surgery Using Deep Learning

Konrad Reuter, Lennart Thaysen, Bilkay Doruk, Sarah Latus, Brigitte Holst, Benjamin Becker, Dennis Eggert, Christian Betz, Anna-Sophie Hoffmann, Alexander Schlaefer

Main category: cs.CV

TL;DR: Automated deep learning pipeline for estimating anatomical risk scores in endoscopic sinus surgery using landmark localization via heatmap regression.

Details

Motivation: Manual measurement of anatomical risk scores (Keros, Gera, TMS) on CT/CBCT scans is time-consuming, requiring automation for efficient preoperative assessment.

Method: Deep learning pipeline using heatmap regression to localize key anatomical landmarks, comparing direct approach with global-to-local learning strategy.

Result: Achieved mean absolute errors of 0.506mm for Keros, 4.516° for Gera, and 0.802mm/0.777mm for TMS classification.

Conclusion: The automated pipeline provides accurate estimation of anatomical risk scores, potentially improving efficiency in preoperative planning for endoscopic sinus surgery.

Abstract: Endoscopic sinus surgery requires careful preoperative assessment of the skull base anatomy to minimize risks such as cerebrospinal fluid leakage. Anatomical risk scores like the Keros, Gera and Thailand-Malaysia-Singapore score offer a standardized approach but require time-consuming manual measurements on coronal CT or CBCT scans. We propose an automated deep learning pipeline that estimates these risk scores by localizing key anatomical landmarks via heatmap regression. We compare a direct approach to a specialized global-to-local learning strategy and find mean absolute errors on the relevant anatomical measurements of 0.506mm for the Keros, 4.516° for the Gera and 0.802mm / 0.777mm for the TMS classification.

[388] Geometric implicit neural representations for signed distance functions

Luiz Schirmer, Tiago Novello, Vinícius da Silva, Guilherme Schardong, Daniel Perazzo, Hélio Lopes, Nuno Gonçalves, Luiz Velho

Main category: cs.CV

TL;DR: This survey reviews geometric implicit neural representations (INRs) for signed distance functions (SDFs), focusing on incorporating differential geometry tools like normals and curvatures in loss functions to improve 3D surface reconstruction from oriented point clouds or posed images.

Details

Motivation: INRs show promise for signal representation, but standard approaches may not fully leverage geometric properties. The motivation is to enhance SDF approximation by incorporating differential geometry constraints to ensure the neural representation satisfies fundamental mathematical properties like unit gradient.

Method: The approach uses geometric INRs that add regularization terms in loss functions based on differential geometry tools (normals, curvatures). Key components include INR definition, geometric loss function construction, and sampling schemes from a differential geometry perspective.

Result: Geometric INRs enable significant advancements in surface reconstruction from both oriented point clouds and posed images by ensuring the learned SDFs satisfy proper geometric properties through regularization.

Conclusion: Incorporating differential geometry constraints into implicit neural representations for SDFs provides a powerful framework for improved 3D surface reconstruction, with geometric regularization terms playing a crucial role in ensuring mathematical correctness of the learned representations.

Abstract: \textit{Implicit neural representations} (INRs) have emerged as a promising framework for representing signals in low-dimensional spaces. This survey reviews the existing literature on the specialized INR problem of approximating \textit{signed distance functions} (SDFs) for surface scenes, using either oriented point clouds or a set of posed images. We refer to neural SDFs that incorporate differential geometry tools, such as normals and curvatures, in their loss functions as \textit{geometric} INRs. The key idea behind this 3D reconstruction approach is to include additional \textit{regularization} terms in the loss function, ensuring that the INR satisfies certain global properties that the function should hold – such as having unit gradient in the case of SDFs. We explore key methodological components, including the definition of INR, the construction of geometric loss functions, and sampling schemes from a differential geometry perspective. Our review highlights the significant advancements enabled by geometric INRs in surface reconstruction from oriented point clouds and posed images.

[389] LMM-IQA: Image Quality Assessment for Low-Dose CT Imaging

Kagan Celik, Mehmet Ozan Unal, Metin Ertas, Isa Yildirim

Main category: cs.CV

TL;DR: Proposes an LLM-based system for assessing low-dose CT image quality, generating both numerical scores and textual descriptions of degradations like noise, blur, and contrast loss, with various inference strategies showing progressive performance improvements.

Details

Motivation: Low-dose CT improves patient safety but introduces noise, blur, and contrast loss that can reduce diagnostic quality, making consistent and robust image quality assessment essential for clinical applications.

Method: LLM-based quality assessment system that generates numerical scores and textual descriptions, using various inference strategies including zero-shot approach, metadata integration, and error feedback.

Result: The system produces highly correlated scores and interpretable output, with progressive contributions from each inference method to overall performance, adding value to clinical workflows.

Conclusion: The proposed LLM-based approach effectively assesses low-dose CT image quality with both quantitative scores and qualitative descriptions, demonstrating the value of systematic inference strategies for clinical applications.

Abstract: Low-dose computed tomography (CT) represents a significant improvement in patient safety through lower radiation doses, but increased noise, blur, and contrast loss can diminish diagnostic quality. Therefore, consistency and robustness in image quality assessment become essential for clinical applications. In this study, we propose an LLM-based quality assessment system that generates both numerical scores and textual descriptions of degradations such as noise, blur, and contrast loss. Furthermore, various inference strategies - from the zero-shot approach to metadata integration and error feedback - are systematically examined, demonstrating the progressive contribution of each method to overall performance. The resultant assessments yield not only highly correlated scores but also interpretable output, thereby adding value to clinical workflows. The source codes of our study are available at https://github.com/itu-biai/lmms_ldct_iqa.

[390] Breaking the Stealth-Potency Trade-off in Clean-Image Backdoors with Generative Trigger Optimization

Binyan Xu, Fan Yang, Di Tang, Xilin Dai, Kehuan Zhang

Main category: cs.CV

TL;DR: GCB introduces a new clean-image backdoor attack using conditional InfoGAN to find natural image features as stealthy triggers, enabling attacks with minimal clean accuracy drop (<1%) across multiple datasets, architectures, and tasks.

Details

Motivation: Existing clean-image backdoor attacks require high poison rates that cause noticeable drops in clean accuracy, compromising stealthiness. There's a need for more subtle attacks that minimize accuracy degradation.

Method: Uses conditional InfoGAN to identify naturally occurring image features that can serve as potent triggers. Ensures triggers are easily separable from benign features, allowing learning from extremely small poisoned datasets.

Result: Achieves clean accuracy drop of less than 1% while successfully attacking six datasets, five architectures, and four tasks (including first demonstration in regression and segmentation). Resilient against most existing defenses.

Conclusion: GCB provides a highly effective and stealthy clean-image backdoor attack framework that minimizes accuracy degradation and demonstrates remarkable versatility across diverse applications.

Abstract: Clean-image backdoor attacks, which use only label manipulation in training datasets to compromise deep neural networks, pose a significant threat to security-critical applications. A critical flaw in existing methods is that the poison rate required for a successful attack induces a proportional, and thus noticeable, drop in Clean Accuracy (CA), undermining their stealthiness. This paper presents a new paradigm for clean-image attacks that minimizes this accuracy degradation by optimizing the trigger itself. We introduce Generative Clean-Image Backdoors (GCB), a framework that uses a conditional InfoGAN to identify naturally occurring image features that can serve as potent and stealthy triggers. By ensuring these triggers are easily separable from benign task-related features, GCB enables a victim model to learn the backdoor from an extremely small set of poisoned examples, resulting in a CA drop of less than 1%. Our experiments demonstrate GCB’s remarkable versatility, successfully adapting to six datasets, five architectures, and four tasks, including the first demonstration of clean-image backdoors in regression and segmentation. GCB also exhibits resilience against most of the existing backdoor defenses.

[391] Beyond Boundaries: Leveraging Vision Foundation Models for Source-Free Object Detection

Huizai Yao, Sicheng Zhao, Pengteng Li, Yi Cui, Shuo Lu, Weiyu Guo, Yunfan Lu, Yijie Xu, Hui Xiong

Main category: cs.CV

TL;DR: A novel SFOD framework that leverages Vision Foundation Models as external knowledge to enhance feature alignment and pseudo-label quality, achieving state-of-the-art performance.

Details

Motivation: Existing SFOD methods rely only on internal source model knowledge, limiting generalization and causing biased pseudo-labels. VFMs offer strong perception capabilities but are underutilized in SFOD.

Method: Three VFM-based modules: Patch-weighted Global Feature Alignment (PGFA) for global feature distillation, Prototype-based Instance Feature Alignment (PIFA) for instance-level contrastive learning, and Dual-source Enhanced Pseudo-label Fusion (DEPF) for reliable supervision.

Result: Extensive experiments on six benchmarks demonstrate state-of-the-art SFOD performance, validating improved transferability and discriminability.

Conclusion: Integrating VFMs as external knowledge sources effectively enhances both feature alignment and label quality in SFOD, overcoming limitations of internal-only approaches.

Abstract: Source-Free Object Detection (SFOD) aims to adapt a source-pretrained object detector to a target domain without access to source data. However, existing SFOD methods predominantly rely on internal knowledge from the source model, which limits their capacity to generalize across domains and often results in biased pseudo-labels, thereby hindering both transferability and discriminability. In contrast, Vision Foundation Models (VFMs), pretrained on massive and diverse data, exhibit strong perception capabilities and broad generalization, yet their potential remains largely untapped in the SFOD setting. In this paper, we propose a novel SFOD framework that leverages VFMs as external knowledge sources to jointly enhance feature alignment and label quality. Specifically, we design three VFM-based modules: (1) Patch-weighted Global Feature Alignment (PGFA) distills global features from VFMs using patch-similarity-based weighting to enhance global feature transferability; (2) Prototype-based Instance Feature Alignment (PIFA) performs instance-level contrastive learning guided by momentum-updated VFM prototypes; and (3) Dual-source Enhanced Pseudo-label Fusion (DEPF) fuses predictions from detection VFMs and teacher models via an entropy-aware strategy to yield more reliable supervision. Extensive experiments on six benchmarks demonstrate that our method achieves state-of-the-art SFOD performance, validating the effectiveness of integrating VFMs to simultaneously improve transferability and discriminability.

[392] Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images

JiaKui Hu, Shanshan Zhao, Qing-Guo Chen, Xuerui Qiu, Jialun Liu, Zhao Xu, Weihua Luo, Kaifu Zhang, Yanye Lu

Main category: cs.CV

TL;DR: Omni-View is a unified framework that extends multimodal understanding and generation to 3D scenes using multiview images, demonstrating that generation enhances understanding through synergistic interaction between scene understanding, novel view synthesis, and geometry estimation.

Details

Motivation: To explore the principle that 'generation facilitates understanding' in 3D scenes and create a unified system that jointly models understanding and generation tasks for holistic 3D scene comprehension.

Method: A three-component system consisting of understanding model, texture module for appearance synthesis, and geometry module for explicit geometric constraints, trained with a two-stage strategy to leverage spatiotemporal modeling and geometric constraints.

Result: Achieves state-of-the-art score of 55.4 on VSI-Bench benchmark, outperforming specialized 3D understanding models while delivering strong performance in novel view synthesis and 3D scene generation.

Conclusion: Omni-View successfully demonstrates that joint modeling of understanding and generation tasks enables synergistic interaction that enriches holistic 3D scene understanding, validating the principle that generation facilitates understanding.

Abstract: This paper presents Omni-View, which extends the unified multimodal understanding and generation to 3D scenes based on multiview images, exploring the principle that “generation facilitates understanding”. Consisting of understanding model, texture module, and geometry module, Omni-View jointly models scene understanding, novel view synthesis, and geometry estimation, enabling synergistic interaction between 3D scene understanding and generation tasks. By design, it leverages the spatiotemporal modeling capabilities of its texture module responsible for appearance synthesis, alongside the explicit geometric constraints provided by its dedicated geometry module, thereby enriching the model’s holistic understanding of 3D scenes. Trained with a two-stage strategy, Omni-View achieves a state-of-the-art score of 55.4 on the VSI-Bench benchmark, outperforming existing specialized 3D understanding models, while simultaneously delivering strong performance in both novel view synthesis and 3D scene generation.

[393] Mapping Reduced Accessibility to WASH Facilities in Rohingya Refugee Camps with Sub-Meter Imagery

Kyeongjin Ahn, YongHun Suh, Sungwon Han, Jeasurk Yang, Hannes Taubenböck, Meeyoung Cha

Main category: cs.CV

TL;DR: Remote sensing framework using semi-supervised segmentation to detect refugee shelters and quantify WASH accessibility in Rohingya camps, revealing declining access and gender disparities.

Details

Motivation: WASH services remain a major public health concern in refugee camps, with challenges in detecting shelters due to dense spatial configuration and irregular geometric patterns.

Method: Semi-supervised segmentation framework using sub-meter satellite images to detect individual refugee shelters, applied across multi-year data for WASH accessibility analysis.

Result: Achieved 76.4% F1-score in shelter detection; showed declining WASH accessibility from 25 people per facility in 2022 to 29.4 in 2025; women and girls experience reduced accessibility due to inadequate safety segregation.

Conclusion: High-resolution remote sensing and machine learning can detect inequality and inform equitable resource planning in humanitarian settings, emphasizing demand-responsive allocation strategies for underserved populations.

Abstract: Access to Water, Sanitation, and Hygiene (WASH) services remains a major public health concern in refugee camps. This study introduces a remote sensing-driven framework to quantify WASH accessibility-specifically to water pumps, latrines, and bathing cubicles-in the Rohingya camps of Cox’s Bazar, one of the world’s most densely populated displacement settings. Detecting refugee shelters in such emergent camps presents substantial challenges, primarily due to their dense spatial configuration and irregular geometric patterns. Using sub-meter satellite images, we develop a semi-supervised segmentation framework that achieves an F1-score of 76.4% in detecting individual refugee shelters. Applying the framework across multi-year data reveals declining WASH accessibility, driven by rapid refugee population growth and reduced facility availability, rising from 25 people per facility in 2022 to 29.4 in 2025. Gender-disaggregated analysis further shows that women and girls experience reduced accessibility, in scenarios with inadequate safety-related segregation in WASH facilities. These findings suggest the importance of demand-responsive allocation strategies that can identify areas with under-served populations-such as women and girls-and ensure that limited infrastructure serves the greatest number of people in settings with fixed or shrinking budgets. We also discuss the value of high-resolution remote sensing and machine learning to detect inequality and inform equitable resource planning in complex humanitarian environments.

[394] Noise & pattern: identity-anchored Tikhonov regularization for robust structural anomaly detection

Alexander Bauer, Klaus-Robert Müller

Main category: cs.CV

TL;DR: Self-supervised autoencoder with structured corruption and Gaussian noise regularization achieves state-of-the-art anomaly detection on MVTec AD benchmark.

Details

Motivation: Anomaly detection is crucial for industrial inspection but collecting all possible defect examples is impractical, requiring self-supervised approaches that can identify subtle structural defects without anomaly examples.

Method: Uses autoencoder trained to repair corrupted inputs with structured, spatially coherent perturbations (not i.i.d. noise) and adds Gaussian noise as Tikhonov regularizer to anchor Jacobian toward identity, stabilizing reconstruction.

Result: Achieves state-of-the-art performance on MVTec AD benchmark with I/P-AUROC scores of 99.9/99.4, demonstrating superior detection and segmentation accuracy.

Conclusion: The proposed identity-anchored regularization with structured corruption effectively improves anomaly detection for industrial inspection, supporting the theoretical framework with practical relevance.

Abstract: Anomaly detection plays a pivotal role in automated industrial inspection, aiming to identify subtle or rare defects in otherwise uniform visual patterns. As collecting representative examples of all possible anomalies is infeasible, we tackle structural anomaly detection using a self-supervised autoencoder that learns to repair corrupted inputs. To this end, we introduce a corruption model that injects artificial disruptions into training images to mimic structural defects. While reminiscent of denoising autoencoders, our approach differs in two key aspects. First, instead of unstructured i.i.d.\ noise, we apply structured, spatially coherent perturbations that make the task a hybrid of segmentation and inpainting. Second, and counterintuitively, we add and preserve Gaussian noise on top of the occlusions, which acts as a Tikhonov regularizer anchoring the Jacobian of the reconstruction function toward identity. This identity-anchored regularization stabilizes reconstruction and further improves both detection and segmentation accuracy. On the MVTec AD benchmark, our method achieves state-of-the-art results (I/P-AUROC: 99.9/99.4), supporting our theoretical framework and demonstrating its practical relevance for automatic inspection.

[395] Inference-Time Scaling of Diffusion Models for Infrared Data Generation

Kai A. Horstmann, Maxim Clouser, Kia Khezeli

Main category: cs.CV

TL;DR: Infrared image generation using diffusion models with inference-time guidance to overcome data scarcity, achieving 10% FID improvement on KAIST benchmark.

Details

Motivation: Infrared imaging enables temperature-based scene understanding in low visibility, but development is hindered by scarce annotated data and limited datasets for training generative models.

Method: Fine-tuned FLUX.1-dev diffusion model on small infrared dataset using parameter-efficient techniques, then employed domain-adapted CLIP-based verifier during inference to guide sampling toward higher quality generations.

Result: Consistent improvements in generation quality, reducing FID scores by 10% on KAIST Multispectral Pedestrian Detection Benchmark compared to unguided baseline.

Conclusion: Inference-time guidance offers promising direction for bridging domain gap in low-data infrared settings.

Abstract: Infrared imagery enables temperature-based scene understanding using passive sensors, particularly under conditions of low visibility where traditional RGB imaging fails. Yet, developing downstream vision models for infrared applications is hindered by the scarcity of high-quality annotated data, due to the specialized expertise required for infrared annotation. While synthetic infrared image generation has the potential to accelerate model development by providing large-scale, diverse training data, training foundation-level generative diffusion models in the infrared domain has remained elusive due to limited datasets. In light of such data constraints, we explore an inference-time scaling approach using a domain-adapted CLIP-based verifier for enhanced infrared image generation quality. We adapt FLUX.1-dev, a state-of-the-art text-to-image diffusion model, to the infrared domain by finetuning it on a small sample of infrared images using parameter-efficient techniques. The trained verifier is then employed during inference to guide the diffusion sampling process toward higher quality infrared generations that better align with input text prompts. Empirically, we find that our approach leads to consistent improvements in generation quality, reducing FID scores on the KAIST Multispectral Pedestrian Detection Benchmark dataset by 10% compared to unguided baseline samples. Our results suggest that inference-time guidance offers a promising direction for bridging the domain gap in low-data infrared settings.

[396] 4DSTR: Advancing Generative 4D Gaussians with Spatial-Temporal Rectification for High-Quality and Consistent 4D Generation

Mengmeng Liu, Jiuming Liu, Yunpeng Zhang, Jiangtao Li, Michael Ying Yang, Francesco Nex, Hao Cheng

Main category: cs.CV

TL;DR: 4DSTR is a novel 4D generation network that uses spatial-temporal rectification to modulate generative 4D Gaussian Splatting, achieving superior spatial-temporal consistency and adaptation to rapid temporal variations.

Details

Motivation: Previous 4D generation methods struggle with maintaining spatial-temporal consistency and adapting to rapid temporal variations due to ineffective spatial-temporal modeling.

Method: Proposes temporal correlation across 4D sequences to rectify deformable scales and rotations, and an adaptive spatial densification/pruning strategy that dynamically adds/deletes Gaussian points based on pre-frame movements.

Result: Extensive experiments show 4DSTR achieves state-of-the-art performance in video-to-4D generation, excelling in reconstruction quality, spatial-temporal consistency, and adaptation to rapid temporal movements.

Conclusion: 4DSTR effectively addresses spatial-temporal consistency and rapid temporal variation challenges in 4D content generation through its novel spatial-temporal rectification approach.

Abstract: Remarkable advances in recent 2D image and 3D shape generation have induced a significant focus on dynamic 4D content generation. However, previous 4D generation methods commonly struggle to maintain spatial-temporal consistency and adapt poorly to rapid temporal variations, due to the lack of effective spatial-temporal modeling. To address these problems, we propose a novel 4D generation network called 4DSTR, which modulates generative 4D Gaussian Splatting with spatial-temporal rectification. Specifically, temporal correlation across generated 4D sequences is designed to rectify deformable scales and rotations and guarantee temporal consistency. Furthermore, an adaptive spatial densification and pruning strategy is proposed to address significant temporal variations by dynamically adding or deleting Gaussian points with the awareness of their pre-frame movements. Extensive experiments demonstrate that our 4DSTR achieves state-of-the-art performance in video-to-4D generation, excelling in reconstruction quality, spatial-temporal consistency, and adaptation to rapid temporal movements.

[397] StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression

Yilong Chen, Xiang Bai, Zhibin Wang, Chengyu Bai, Yuhan Dai, Ming Lu, Shanghang Zhang

Main category: cs.CV

TL;DR: StreamKV is a training-free framework that enhances Video-LLMs for long video processing by dynamically partitioning videos into semantic segments, performing layer-adaptive KV cache retrieval and compression, and achieving superior accuracy with improved efficiency.

Details

Motivation: Current Video-LLMs struggle with long real-world videos, and existing retrieval-based methods have limitations in KV cache compression and retrieval that need further exploration.

Method: StreamKV dynamically partitions video streams into semantic segments, calculates summary vectors for retrieval, uses guidance prompts for compression, and unifies retrieval and compression in a layer-adaptive single module.

Result: Extensive experiments on StreamingVQA benchmarks show StreamKV significantly outperforms existing Online Video-LLMs in accuracy while substantially improving memory efficiency and computational latency.

Conclusion: StreamKV effectively addresses the challenges of long video processing in Video-LLMs through advanced KV cache management, achieving state-of-the-art performance in streaming video question answering.

Abstract: Video Large Language Models (Video-LLMs) have demonstrated significant potential in the areas of video captioning, search, and summarization. However, current Video-LLMs still face challenges with long real-world videos. Recent methods have introduced a retrieval mechanism that retrieves query-relevant KV caches for question answering, enhancing the efficiency and accuracy of long real-world videos. However, the compression and retrieval of KV caches are still not fully explored. In this paper, we propose \textbf{StreamKV}, a training-free framework that seamlessly equips Video-LLMs with advanced KV cache retrieval and compression. Compared to previous methods that used uniform partitioning, StreamKV dynamically partitions video streams into semantic segments, which better preserves semantic information. For KV cache retrieval, StreamKV calculates a summary vector for each segment to retain segment-level information essential for retrieval. For KV cache compression, StreamKV introduces a guidance prompt designed to capture the key semantic elements within each segment, ensuring only the most informative KV caches are retained for answering questions. Moreover, StreamKV unifies KV cache retrieval and compression within a single module, performing both in a layer-adaptive manner, thereby further improving the effectiveness of streaming video question answering. Extensive experiments on public StreamingVQA benchmarks demonstrate that StreamKV significantly outperforms existing Online Video-LLMs, achieving superior accuracy while substantially improving both memory efficiency and computational latency. The code has been released at https://github.com/sou1p0wer/StreamKV.

[398] Real-Time LiDAR Super-Resolution via Frequency-Aware Multi-Scale Fusion

June Moh Goo, Zichao Zeng, Jan Boehm

Main category: cs.CV

TL;DR: FLASH introduces a dual-domain LiDAR super-resolution framework that combines spatial and frequency processing with adaptive multi-scale fusion, achieving state-of-the-art performance while maintaining real-time efficiency.

Details

Motivation: To overcome limitations of existing transformer-based approaches that are restricted to spatial-domain processing with limited receptive fields, and to enable high-quality 3D perception from cost-effective low-resolution LiDAR sensors.

Method: FLASH integrates Frequency-Aware Window Attention (combining local spatial attention with global frequency-domain analysis via FFT) and Adaptive Multi-Scale Fusion (replacing conventional skip connections with learned position-specific feature aggregation enhanced by CBAM attention).

Result: Achieves state-of-the-art performance on KITTI across all evaluation metrics, surpassing uncertainty-enhanced baselines while maintaining single-pass efficiency for real-time deployment.

Conclusion: The dual-domain approach effectively handles uncertainty through architectural design rather than computationally expensive stochastic inference, making it practical for autonomous systems.

Abstract: LiDAR super-resolution addresses the challenge of achieving high-quality 3D perception from cost-effective, low-resolution sensors. While recent transformer-based approaches like TULIP show promise, they remain limited to spatial-domain processing with restricted receptive fields. We introduce FLASH (Frequency-aware LiDAR Adaptive Super-resolution with Hierarchical fusion), a novel framework that overcomes these limitations through dual-domain processing. FLASH integrates two key innovations: (i) Frequency-Aware Window Attention that combines local spatial attention with global frequency-domain analysis via FFT, capturing both fine-grained geometry and periodic scanning patterns at log-linear complexity. (ii) Adaptive Multi-Scale Fusion that replaces conventional skip connections with learned position-specific feature aggregation, enhanced by CBAM attention for dynamic feature selection. Extensive experiments on KITTI demonstrate that FLASH achieves state-of-the-art performance across all evaluation metrics, surpassing even uncertainty-enhanced baselines that require multiple forward passes. Notably, FLASH outperforms TULIP with Monte Carlo Dropout while maintaining single-pass efficiency, which enables real-time deployment. The consistent superiority across all distance ranges validates that our dual-domain approach effectively handles uncertainty through architectural design rather than computationally expensive stochastic inference, making it practical for autonomous systems.

[399] Segmentation of Ischemic Stroke Lesions using Transfer Learning on Multi-sequence MRI

R. P. Chowdhury, T. Rahman

Main category: cs.CV

TL;DR: A novel Res-Unet framework for automatic ischemic stroke lesion segmentation on multiple MRI sequences, achieving 80.5% Dice score and 74.03% accuracy on ISLES 2015 dataset.

Details

Motivation: Manual stroke lesion segmentation is tedious, time-consuming, and prone to inconsistency. Existing automatic methods rely on hand-crafted features that fail to capture irregular stroke lesion shapes.

Method: Used Res-Unet architecture trained twice (with and without pre-trained weights) on ISLES 2015 dataset with T1, T2, DWI, and FLAIR MRI sequences. Integrated Majority Voting Classifier to combine results from each axis.

Result: Achieved Dice score of 80.5% and accuracy of 74.03% on 3D volume evaluation, demonstrating effective stroke lesion segmentation.

Conclusion: The proposed framework provides fast, automatic segmentation of ischemic stroke lesions across multiple MRI sequences, overcoming limitations of manual segmentation and hand-crafted feature approaches.

Abstract: The accurate understanding of ischemic stroke lesions is critical for efficient therapy and prognosis of stroke patients. Magnetic resonance imaging (MRI) is sensitive to acute ischemic stroke and is a common diagnostic method for stroke. However, manual lesion segmentation performed by experts is tedious, time-consuming, and prone to observer inconsistency. Automatic medical image analysis methods have been proposed to overcome this challenge. However, previous approaches have relied on hand-crafted features that may not capture the irregular and physiologically complex shapes of ischemic stroke lesions. In this study, we present a novel framework for quickly and automatically segmenting ischemic stroke lesions on various MRI sequences, including T1-weighted, T2-weighted, DWI, and FLAIR. The proposed methodology is validated on the ISLES 2015 Brain Stroke sequence dataset, where we trained our model using the Res-Unet architecture twice: first, with pre-existing weights, and then without, to explore the benefits of transfer learning. Evaluation metrics, including the Dice score and sensitivity, were computed across 3D volumes. Finally, a Majority Voting Classifier was integrated to amalgamate the outcomes from each axis, resulting in a comprehensive segmentation method. Our efforts culminated in achieving a Dice score of 80.5% and an accuracy of 74.03%, showcasing the efficacy of our segmentation approach.

[400] VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models

Ying Cheng, Yu-Ho Lin, Min-Hung Chen, Fu-En Yang, Shang-Hong Lai

Main category: cs.CV

TL;DR: VADER is an LLM-driven framework for Video Anomaly Understanding that integrates object relation features with visual cues to provide detailed interpretation of anomalous events, addressing limitations of traditional detection-only methods.

Details

Motivation: Existing video anomaly understanding approaches neglect deeper causal relationships and object interactions, which are critical for comprehensive anomaly comprehension beyond simple detection and localization.

Method: VADER uses an Anomaly Scorer for per-frame scoring, Context-Aware Sampling (CAES) to capture causal context, Relation Feature Extractor and COntrastive Relation Encoder (CORE) to model object interactions, and integrates these with LLMs for reasoning.

Result: Experiments on multiple real-world VAU benchmarks show VADER achieves strong performance across anomaly description, explanation, and causal reasoning tasks.

Conclusion: VADER advances explainable video anomaly analysis by providing causally grounded descriptions and robust question answering through integrated visual and relational reasoning.

Abstract: Video anomaly understanding (VAU) aims to provide detailed interpretation and semantic comprehension of anomalous events within videos, addressing limitations of traditional methods that focus solely on detecting and localizing anomalies. However, existing approaches often neglect the deeper causal relationships and interactions between objects, which are critical for understanding anomalous behaviors. In this paper, we propose VADER, an LLM-driven framework for Video Anomaly unDErstanding, which integrates keyframe object Relation features with visual cues to enhance anomaly comprehension from video. Specifically, VADER first applies an Anomaly Scorer to assign per-frame anomaly scores, followed by a Context-AwarE Sampling (CAES) strategy to capture the causal context of each anomalous event. A Relation Feature Extractor and a COntrastive Relation Encoder (CORE) jointly model dynamic object interactions, producing compact relational representations for downstream reasoning. These visual and relational cues are integrated with LLMs to generate detailed, causally grounded descriptions and support robust anomaly-related question answering. Experiments on multiple real-world VAU benchmarks demonstrate that VADER achieves strong results across anomaly description, explanation, and causal reasoning tasks, advancing the frontier of explainable video anomaly analysis.

[401] YoNoSplat: You Only Need One Model for Feedforward 3D Gaussian Splatting

Botao Ye, Boqi Chen, Haofei Xu, Daniel Barath, Marc Pollefeys

Main category: cs.CV

TL;DR: YoNoSplat is a feedforward model that reconstructs high-quality 3D Gaussian Splatting representations from unstructured image collections, handling both posed/unposed and calibrated/uncalibrated inputs with exceptional efficiency.

Details

Motivation: Fast and flexible 3D scene reconstruction from unstructured image collections remains challenging, especially with arbitrary numbers of images and varying input conditions (posed/unposed, calibrated/uncalibrated).

Method: Predicts local Gaussians and camera poses for each view, aggregated into global representation using predicted or provided poses. Uses mixing training strategy to mitigate task entanglement, pairwise camera-distance normalization for scale ambiguity, and embeds camera intrinsics into network.

Result: Achieves state-of-the-art performance on standard benchmarks in both pose-free and pose-dependent settings. Reconstructs scenes from 100 views in just 2.69 seconds on NVIDIA GH200 GPU.

Conclusion: YoNoSplat provides an efficient and versatile solution for 3D scene reconstruction from unstructured image collections, handling various input conditions while maintaining high quality and speed.

Abstract: Fast and flexible 3D scene reconstruction from unstructured image collections remains a significant challenge. We present YoNoSplat, a feedforward model that reconstructs high-quality 3D Gaussian Splatting representations from an arbitrary number of images. Our model is highly versatile, operating effectively with both posed and unposed, calibrated and uncalibrated inputs. YoNoSplat predicts local Gaussians and camera poses for each view, which are aggregated into a global representation using either predicted or provided poses. To overcome the inherent difficulty of jointly learning 3D Gaussians and camera parameters, we introduce a novel mixing training strategy. This approach mitigates the entanglement between the two tasks by initially using ground-truth poses to aggregate local Gaussians and gradually transitioning to a mix of predicted and ground-truth poses, which prevents both training instability and exposure bias. We further resolve the scale ambiguity problem by a novel pairwise camera-distance normalization scheme and by embedding camera intrinsics into the network. Moreover, YoNoSplat also predicts intrinsic parameters, making it feasible for uncalibrated inputs. YoNoSplat demonstrates exceptional efficiency, reconstructing a scene from 100 views (at 280x518 resolution) in just 2.69 seconds on an NVIDIA GH200 GPU. It achieves state-of-the-art performance on standard benchmarks in both pose-free and pose-dependent settings. Our project page is at https://botaoye.github.io/yonosplat/.

[402] Garbage Vulnerable Point Monitoring using IoT and Computer Vision

R. Kumar, A. Lall, S. Chaudhari, M. Kale, A. Vattem

Main category: cs.CV

TL;DR: Proposes an IoT and computer vision system using object detection models to monitor illegal waste dumping at garbage vulnerable points, with YOLO11m achieving 92.39% accuracy.

Details

Motivation: To address the problem of illegal waste dumping in urban areas by developing an automated monitoring system using modern technologies.

Method: Uses street-level cameras and object detection algorithms (YOLOv8, YOLOv10, YOLO11m, RT-DETR) on data collected from Sangareddy district, India.

Result: YOLO11m achieved the highest accuracy of 92.39% and mAP@50 of 0.91 in waste detection, effectively capturing waste disposal patterns across different time periods.

Conclusion: Object detection models are well-suited for monitoring waste dumping events and the system provides comprehensive daily and nightly monitoring of waste disposal patterns.

Abstract: This paper proposes a smart way to manage municipal solid waste by using the Internet of Things (IoT) and computer vision (CV) to monitor illegal waste dumping at garbage vulnerable points (GVPs) in urban areas. The system can quickly detect and monitor dumped waste using a street-level camera and object detection algorithm. Data was collected from the Sangareddy district in Telangana, India. A series of comprehensive experiments was carried out using the proposed dataset to assess the accuracy and overall performance of various object detection models. Specifically, we performed an in-depth evaluation of YOLOv8, YOLOv10, YOLO11m, and RT-DETR on our dataset. Among these models, YOLO11m achieved the highest accuracy of 92.39% in waste detection, demonstrating its effectiveness in detecting waste. Additionally, it attains an mAP@50 of 0.91, highlighting its high precision. These findings confirm that the object detection model is well-suited for monitoring and tracking waste dumping events at GVP locations. Furthermore, the system effectively captures waste disposal patterns, including hourly, daily, and weekly dumping trends, ensuring comprehensive daily and nightly monitoring.

[403] StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation

Tianrui Feng, Zhi Li, Shuo Yang, Haocheng Xi, Muyang Li, Xiuyu Li, Lvmin Zhang, Keting Yang, Kelly Peng, Song Han, Maneesh Agrawala, Kurt Keutzer, Akio Kodaira, Chenfeng Xu

Main category: cs.CV

TL;DR: StreamDiffusionV2 is a training-free pipeline for real-time video streaming that addresses latency and scalability challenges in live streaming with video diffusion models, achieving high FPS while maintaining strict service-level objectives.

Details

Motivation: Previous image-based streaming diffusion models lack temporal consistency, while offline video diffusion systems are optimized for throughput but cannot meet the strict latency requirements of live streaming, which demands minimal time-to-first-frame and per-frame deadlines with low jitter.

Method: StreamDiffusionV2 integrates an SLO-aware batching scheduler, block scheduler, sink-token-guided rolling KV cache, motion-aware noise controller, and scalable pipeline orchestration that parallelizes diffusion across denoising steps and network layers.

Result: The system achieves first frame rendering within 0.5s and attains 58.28 FPS with a 14B-parameter model and 64.52 FPS with a 1.3B-parameter model on four H100 GPUs, scaling near-linearly without violating latency guarantees.

Conclusion: StreamDiffusionV2 makes state-of-the-art generative live streaming practical and accessible for both individual creators and enterprise platforms by solving the scalability and latency challenges in real-time video diffusion.

Abstract: Generative models are reshaping the live-streaming industry by redefining how content is created, styled, and delivered. Previous image-based streaming diffusion models have powered efficient and creative live streaming products but have hit limits on temporal consistency due to the foundation of image-based designs. Recent advances in video diffusion have markedly improved temporal consistency and sampling efficiency for offline generation. However, offline generation systems primarily optimize throughput by batching large workloads. In contrast, live online streaming operates under strict service-level objectives (SLOs): time-to-first-frame must be minimal, and every frame must meet a per-frame deadline with low jitter. Besides, scalable multi-GPU serving for real-time streams remains largely unresolved so far. To address this, we present StreamDiffusionV2, a training-free pipeline for interactive live streaming with video diffusion models. StreamDiffusionV2 integrates an SLO-aware batching scheduler and a block scheduler, together with a sink-token–guided rolling KV cache, a motion-aware noise controller, and other system-level optimizations. Moreover, we introduce a scalable pipeline orchestration that parallelizes the diffusion process across denoising steps and network layers, achieving near-linear FPS scaling without violating latency guarantees. The system scales seamlessly across heterogeneous GPU environments and supports flexible denoising steps (e.g., 1–4), enabling both ultra-low-latency and higher-quality modes. Without TensorRT or quantization, StreamDiffusionV2 renders the first frame within 0.5s and attains 58.28 FPS with a 14B-parameter model and 64.52 FPS with a 1.3B-parameter model on four H100 GPUs, making state-of-the-art generative live streaming practical and accessible–from individual creators to enterprise-scale platforms.

[404] DIMO: Diverse 3D Motion Generation for Arbitrary Objects

Linzhan Mou, Jiahui Lei, Chen Wang, Lingjie Liu, Kostas Daniilidis

Main category: cs.CV

TL;DR: DIMO is a generative approach that creates diverse 3D motions for objects from single images by leveraging video model priors and learning motion patterns in a latent space.

Details

Motivation: To enable generation of diverse 3D motions for arbitrary objects using only a single image as input, overcoming limitations of existing methods that require multiple inputs or lack motion diversity.

Method: Extract motion patterns from pre-trained video models, embed motions into latent vectors, train shared motion decoder to learn neural key point trajectories, and drive 3D Gaussians with these key points for geometry and appearance modeling.

Result: Successfully generates diverse 3D motions from single images, supports instant sampling of motions in single-forward pass, and enables applications like 3D motion interpolation and language-guided motion generation.

Conclusion: DIMO provides an effective framework for generating diverse 3D motions from single images by leveraging video priors and learning compact motion representations, opening possibilities for various motion generation applications.

Abstract: We present DIMO, a generative approach capable of generating diverse 3D motions for arbitrary objects from a single image. The core idea of our work is to leverage the rich priors in well-trained video models to extract the common motion patterns and then embed them into a shared low-dimensional latent space. Specifically, we first generate multiple videos of the same object with diverse motions. We then embed each motion into a latent vector and train a shared motion decoder to learn the distribution of motions represented by a structured and compact motion representation, i.e., neural key point trajectories. The canonical 3D Gaussians are then driven by these key points and fused to model the geometry and appearance. During inference time with learned latent space, we can instantly sample diverse 3D motions in a single-forward pass and support several interesting applications including 3D motion interpolation and language-guided motion generation. Our project page is available at https://linzhanm.github.io/dimo.

[405] TwinOR: Photorealistic Digital Twins of Dynamic Operating Rooms for Embodied AI Research

Han Zhang, Yiqing Shen, Roger D. Soberanis-Mukul, Ankita Ghosh, Hao Ding, Lalithkumar Seenivasan, Jose L. Porras, Zhekai Mao, Chenjia Li, Wenjie Xiao, Lonny Yarmus, Angela Christine Argento, Masaru Ishii, Mathias Unberath

Main category: cs.CV

TL;DR: TwinOR is a framework for creating photorealistic digital twins of operating rooms that combines static geometry reconstruction with dynamic modeling of human and equipment motion, enabling safe embodied AI development.

Details

Motivation: Safety regulations in real operating rooms limit embodied AI development, requiring risk-free digital environments that capture the spatial, visual, and behavioral complexity of surgical settings.

Method: Reconstructs static geometry from pre-scan videos and continuously models motion through multi-view perception, fusing static and dynamic components into an immersive 3D environment for simulation and exploration.

Result: Achieves centimeter-level accuracy in geometry reconstruction and enables realistic sensor simulations. Models like FoundationStereo and ORB-SLAM3 perform within reported accuracy on real datasets using TwinOR-synthesized data.

Conclusion: TwinOR provides a real-to-sim pipeline for creating dynamic, photorealistic digital twins that enable safe, scalable, and data-efficient development of embodied AI for surgical applications.

Abstract: Developing embodied AI for intelligent surgical systems requires safe, controllable environments for continual learning and evaluation. However, safety regulations and operational constraints in operating rooms (ORs) limit embodied agents from freely perceiving and interacting in realistic settings. Digital twins provide high-fidelity, risk-free environments for exploration and training. How we may create photorealistic and dynamic digital representations of ORs that capture relevant spatial, visual, and behavioral complexity remains unclear. We introduce TwinOR, a framework for constructing photorealistic, dynamic digital twins of ORs for embodied AI research. The system reconstructs static geometry from pre-scan videos and continuously models human and equipment motion through multi-view perception of OR activities. The static and dynamic components are fused into an immersive 3D environment that supports controllable simulation and embodied exploration. The proposed framework reconstructs complete OR geometry with centimeter level accuracy while preserving dynamic interaction across surgical workflows, enabling realistic renderings and a virtual playground for embodied AI systems. In our experiments, TwinOR simulates stereo and monocular sensor streams for geometry understanding and visual localization tasks. Models such as FoundationStereo and ORB-SLAM3 on TwinOR-synthesized data achieve performance within their reported accuracy on real indoor datasets, demonstrating that TwinOR provides sensor-level realism sufficient for perception and localization challenges. By establishing a real-to-sim pipeline for constructing dynamic, photorealistic digital twins of OR environments, TwinOR enables the safe, scalable, and data-efficient development and benchmarking of embodied AI, ultimately accelerating the deployment of embodied AI from sim-to-real.

[406] ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness

Yijun Liang, Ming Li, Chenrui Fan, Ziyue Li, Dang Nguyen, Kwesi Cobbina, Shweta Bhardwaj, Jiuhai Chen, Fuxiao Liu, Tianyi Zhou

Main category: cs.CV

TL;DR: ColorBench is a benchmark to evaluate vision-language models’ color understanding capabilities, revealing that current VLMs have significant limitations in color perception and reasoning despite the importance of color in human visual cognition.

Details

Motivation: Color is crucial for human visual reasoning but it's unclear how well vision-language models perceive and leverage color information, necessitating a systematic evaluation of their color understanding capabilities.

Method: Developed ColorBench benchmark with diverse test scenarios grounded in real applications to evaluate color perception, reasoning, and robustness across 32 different VLMs with varying language models and vision encoders.

Result: Key findings: (i) Scaling law holds with language models being more important than vision encoders; (ii) Small performance gaps indicate color understanding is neglected; (iii) CoT reasoning improves performance; (iv) Color clues are used but can mislead models.

Conclusion: Current VLMs have critical limitations in color comprehension, highlighting the need to enhance color understanding capabilities in multimodal AI systems.

Abstract: Color plays an important role in human perception and usually provides critical clues in visual reasoning. However, it is unclear whether and how vision-language models (VLMs) can perceive, understand, and leverage color as humans. This paper introduces ColorBench, an innovative benchmark meticulously crafted to assess the capabilities of VLMs in color understanding, including color perception, reasoning, and robustness. By curating a suite of diverse test scenarios, with grounding in real applications, ColorBench evaluates how these models perceive colors, infer meanings from color-based cues, and maintain consistent performance under varying color transformations. Through an extensive evaluation of 32 VLMs with varying language models and vision encoders, our paper reveals some undiscovered findings: (i) The scaling law (larger models are better) still holds on ColorBench, while the language model plays a more important role than the vision encoder. (ii) However, the performance gaps across models are relatively small, indicating that color understanding has been largely neglected by existing VLMs. (iii) CoT reasoning improves color understanding accuracies and robustness, though they are vision-centric tasks. (iv) Color clues are indeed leveraged by VLMs on ColorBench but they can also mislead models in some tasks. These findings highlight the critical limitations of current VLMs and underscore the need to enhance color comprehension. Our ColorBenchcan serve as a foundational tool for advancing the study of human-level color understanding of multimodal AI.

[407] Intelligent Sampling Consensus for Homography Estimation in Football Videos Using Featureless Unpaired Points

George Nousias, Konstantinos Delibasis, Ilias Maglogiannis

Main category: cs.CV

TL;DR: H-RANSAC is a novel homography estimation algorithm that eliminates the need for feature vectors or explicit point pairing, using geometric criteria and concave quadrilaterals to improve accuracy and efficiency.

Details

Motivation: Traditional homography estimation methods rely on RANSAC with pre-matched homologous points using local feature vectors, which can be challenging under radically different camera poses and zoom factors.

Method: H-RANSAC introduces a novel geometric (cheiral) criterion to reject implausible point configurations early, leverages typically discarded concave quadrilaterals, and includes a post-hoc criterion for accuracy improvement. It provides analytical derivations for expected maximum iterations.

Result: H-RANSAC significantly outperforms state-of-the-art classical methods combined with deep learning-based salient point detection in terms of average reprojection error and success rates on football match video frames with divergent viewpoints.

Conclusion: The proposed H-RANSAC algorithm provides an effective solution for homography estimation without requiring feature vectors or explicit point pairing, demonstrating superior performance in challenging scenarios with highly divergent camera viewpoints.

Abstract: Estimating the homography matrix between images captured under radically different camera poses and zoom factors is a complex challenge. Traditional methods rely on the Random Sample Consensus (RANSAC) algorithm, which requires pairs of homologous points, pre-matched based on local image feature vectors. Sampling consensus is a core step in many Artificial Intelligence (AI) algorithms that enable computer systems to recognize patterns in data. In this paper, we propose H-RANSAC, an algorithm for homography estimation that eliminates the need for feature vectors or explicit point pairing, while it optionally supports point labeling into two classes. H-RANSAC introduces a novel geometric (cheiral) criterion that intelligently rejects implausible point configurations at the beginning of each iteration, while leveraging concave quadrilaterals typically discarded by similar algorithms. A post-hoc criterion at the end of each iteration improves accuracy further. Analytical derivations of the expected maximum iterations are provided, considering success probabilities and outlier rates, enabling adaptive performance tuning. The algorithm is validated on a demanding task: estimating homography between video frames of football matches captured by 12 cameras with highly divergent viewpoints. Results show that H-RANSAC significantly outperforms state-of-the-art classical methods, combined with deep learning-based salient point detection, in terms of average reprojection error and success rates. The relevant implementation is available in https://github.com/gnousias/H-RANSAC.

[408] Evaluating the Evaluators: Metrics for Compositional Text-to-Image Generation

Seyed Amir Kasaei, Ali Aghayari, Arash Marioriyad, Niki Sepasian, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban

Main category: cs.CV

TL;DR: A comprehensive study of text-image generation evaluation metrics reveals that no single metric performs consistently across tasks, with performance varying by compositional problem type. VQA-based metrics aren’t uniformly superior, while embedding-based metrics show strength in specific cases.

Details

Motivation: Current text-image generation evaluation relies heavily on automated metrics adopted by convention rather than validated against human judgment, making it critical to understand how well these metrics reflect human preferences for trustworthy evaluation.

Method: Conducted a broad study examining widely used metrics for compositional text-image evaluation, analyzing their behavior across diverse compositional challenges and comparing how different metric families align with human judgments.

Result: No single metric performs consistently across tasks - performance varies with compositional problem type. VQA-based metrics are not uniformly superior, while certain embedding-based metrics prove stronger in specific cases. Image-only metrics contribute little to compositional evaluation.

Conclusion: Careful and transparent metric selection is crucial for trustworthy evaluation and their use as reward models in generation, as metric performance depends on the specific compositional challenge being evaluated.

Abstract: Text-image generation has advanced rapidly, but assessing whether outputs truly capture the objects, attributes, and relations described in prompts remains a central challenge. Evaluation in this space relies heavily on automated metrics, yet these are often adopted by convention or popularity rather than validated against human judgment. Because evaluation and reported progress in the field depend directly on these metrics, it is critical to understand how well they reflect human preferences. To address this, we present a broad study of widely used metrics for compositional text-image evaluation. Our analysis goes beyond simple correlation, examining their behavior across diverse compositional challenges and comparing how different metric families align with human judgments. The results show that no single metric performs consistently across tasks: performance varies with the type of compositional problem. Notably, VQA-based metrics, though popular, are not uniformly superior, while certain embedding-based metrics prove stronger in specific cases. Image-only metrics, as expected, contribute little to compositional evaluation, as they are designed for perceptual quality rather than alignment. These findings underscore the importance of careful and transparent metric selection, both for trustworthy evaluation and for their use as reward models in generation. Project page is available at https://amirkasaei.com/eval-the-evals/ .

[409] HyCTAS: Multi-Objective Hybrid Convolution-Transformer Architecture Search for Real-Time Image Segmentation

Hongyuan Yu, Cheng Wan, Xiyang Dai, Mengchen Liu, Dongdong Chen, Bin Xiao, Yan Huang, Yuan Lu, Liang Wang

Main category: cs.CV

TL;DR: HyCTAS is a neural architecture search method that automatically finds optimal hybrid CNN-Transformer architectures for real-time image segmentation, balancing accuracy and latency without ImageNet pretraining.

Details

Motivation: Manual design of efficient segmentation architectures is labor-intensive, and integrating multi-head self-attention into high-resolution CNNs is challenging due to memory overhead.

Method: Multi-target multi-branch supernet approach that searches for optimal placement of lightweight convolution layers and memory-efficient self-attention layers between branches at different resolutions.

Result: Discovers competitive real-time models on Cityscapes, ADE20K, and COCO datasets, achieving strong accuracy-latency trade-offs without ImageNet pretraining.

Conclusion: HyCTAS provides an automated solution for finding efficient hybrid CNN-Transformer architectures that outperform manual designs in real-time segmentation tasks.

Abstract: Real-time image segmentation demands architectures that preserve fine spatial detail while capturing global context under tight latency and memory budgets. Image segmentation is one of the most fundamental problems in computer vision and has drawn a lot of attention due to its vast applications in image understanding and autonomous driving. However, designing effective and efficient segmentation neural architectures is a labor-intensive process that may require numerous trials by human experts. In this paper, we address the challenge of integrating multi-head self-attention into high-resolution representation CNNs efficiently by leveraging architecture search. Manually replacing convolution layers with multi-head self-attention is non-trivial due to the costly overhead in memory to maintain high resolution. By contrast, we develop a multi-target multi-branch supernet method, which not only fully utilizes the advantages of high-resolution features but also finds the proper location for placing the multi-head self-attention module. Our search algorithm is optimized towards multiple objectives (e.g., latency and mIoU) and is capable of finding architectures on the approximate Pareto front with an arbitrary number of branches in a single search. We further present a series of models via the Hybrid Convolutional-Transformer Architecture Search (HyCTAS) method that searches for the best hybrid combination of lightweight convolution layers and memory-efficient self-attention layers between branches from different resolutions and fuses them at high resolution for both efficiency and effectiveness. On Cityscapes, ADE20K, and COCO, HyCTAS discovers competitive real-time models without ImageNet pretraining, delivering strong accuracy and latency trade-offs. Code and models are available at https://github.com/MarvinYu1995/HyCTAS.

[410] Hallucination as an Upper Bound: A New Perspective on Text-to-Image Evaluation

Seyed Amir Kasaei, Mohammad Hossein Rohban

Main category: cs.CV

TL;DR: Defines hallucination in text-to-image models as bias-driven deviations and proposes a taxonomy with three categories: attribute, relation, and object hallucinations.

Details

Motivation: Existing evaluations for text-to-image models focus mainly on alignment with prompt elements but overlook what models generate beyond the prompt, failing to address bias-driven deviations.

Method: Proposes a new framing of hallucination in text-to-image models and develops a taxonomy with three categories: attribute, relation, and object hallucinations.

Result: The proposed framing introduces an upper bound for evaluation and surfaces hidden biases in text-to-image models.

Conclusion: This approach provides a foundation for richer assessment of text-to-image models by systematically addressing bias-driven hallucinations beyond simple prompt alignment.

Abstract: In language and vision-language models, hallucination is broadly understood as content generated from a model’s prior knowledge or biases rather than from the given input. While this phenomenon has been studied in those domains, it has not been clearly framed for text-to-image (T2I) generative models. Existing evaluations mainly focus on alignment, checking whether prompt-specified elements appear, but overlook what the model generates beyond the prompt. We argue for defining hallucination in T2I as bias-driven deviations and propose a taxonomy with three categories: attribute, relation, and object hallucinations. This framing introduces an upper bound for evaluation and surfaces hidden biases, providing a foundation for richer assessment of T2I models.

[411] SkinCaRe: A Multimodal Dermatology Dataset Annotated with Medical Caption and Chain-of-Thought Reasoning

Yuhao Shen, Liyuan Sun, Yan Xu, Wenbin Liu, Shuping Zhang, Shawn Afvari, Zhongyi Han, Jiaoyan Song, Yongzhi Ji, Tao Lu, Xiaonan He, Xin Gao, Juexiao Zhou

Main category: cs.CV

TL;DR: SkinCaRe is a comprehensive multimodal dermatology dataset combining SkinCAP (4,000 images with medical descriptions) and SkinCoT (3,041 images with clinician-verified chain-of-thought diagnoses) to address the lack of concept-level meta-labels and natural language descriptions in existing datasets.

Details

Motivation: Existing dermatology datasets lack concept-level meta-labels and rich medical descriptions in natural language, which hinders the advancement of LLM-based methods in dermatologic diagnosis.

Method: Created SkinCAP with 4,000 images annotated by board-certified dermatologists for medical descriptions, and SkinCoT with 3,041 images paired with clinician-verified hierarchical chain-of-thought diagnoses evaluated against six quality criteria.

Result: SkinCaRe provides 7,041 expertly curated dermatologic cases with comprehensive natural language descriptions and explanations, serving as a unified resource for training multimodal models in dermatology.

Conclusion: SkinCaRe addresses the interpretability gap in AI-based dermatology diagnosis by providing a meticulously annotated dataset with rich medical descriptions and chain-of-thought reasoning, enabling better training of multimodal models.

Abstract: With the widespread application of artificial intelligence (AI), particularly deep learning (DL) and vision large language models (VLLMs), in skin disease diagnosis, the need for interpretability becomes crucial. However, existing dermatology datasets are limited in their inclusion of concept-level meta-labels, and none offer rich medical descriptions in natural language. This deficiency impedes the advancement of LLM-based methods in dermatologic diagnosis. To address this gap and provide a meticulously annotated dermatology dataset with comprehensive natural language descriptions, we introduce \textbf{SkinCaRe}, a comprehensive multimodal resource that unifies \textit{SkinCAP} and \textit{SkinCoT}. \textbf{SkinCAP} comprises 4,000 images sourced from the Fitzpatrick 17k skin disease dataset and the Diverse Dermatology Images dataset, annotated by board-certified dermatologists to provide extensive medical descriptions and captions. In addition, we introduce \textbf{SkinCoT}, a curated dataset pairing 3,041 dermatologic images with clinician-verified, hierarchical chain-of-thought (CoT) diagnoses. Each diagnostic narrative is rigorously evaluated against six quality criteria and iteratively refined until it meets a predefined standard of clinical accuracy and explanatory depth. Together, SkinCAP (captioning) and SkinCoT (reasoning), collectively referred to as SkinCaRe, encompass 7,041 expertly curated dermatologic cases and provide a unified and trustworthy resource for training multimodal models that both describe and explain dermatologic images. SkinCaRe is publicly available at https://huggingface.co/datasets/yuhos16/SkinCaRe.

[412] The Wisdom of a Crowd of Brains: A Universal Brain Encoder

Roman Beliy, Navve Wasserman, Amit Zalcher, Michal Irani

Main category: cs.CV

TL;DR: A universal brain-encoder that learns voxel-specific embeddings through cross-attention with image features, enabling training across multiple subjects, datasets, and fMRI machines.

Details

Motivation: Current brain-encoders are limited by being trained per-subject and per-dataset, restricting training data availability and generalization.

Method: Voxel-centric encoder architecture that learns unique embeddings per brain-voxel and uses cross-attention between voxel embeddings and multi-level deep image features to predict voxel responses.

Result: Enables combining data from multiple subjects, effective transfer learning across subjects/datasets/machines with few examples, and provides voxel-embeddings for brain functionality exploration.

Conclusion: The proposed universal brain-encoder overcomes limitations of subject/dataset-specific training and provides a powerful framework for brain encoding and functional analysis.

Abstract: Image-to-fMRI encoding is important for both neuroscience research and practical applications. However, such “Brain-Encoders” have been typically trained per-subject and per fMRI-dataset, thus restricted to very limited training data. In this paper we propose a Universal Brain-Encoder, which can be trained jointly on data from many different subjects/datasets/machines. What makes this possible is our new voxel-centric Encoder architecture, which learns a unique “voxel-embedding” per brain-voxel. Our Encoder trains to predict the response of each brain-voxel on every image, by directly computing the cross-attention between the brain-voxel embedding and multi-level deep image features. This voxel-centric architecture allows the functional role of each brain-voxel to naturally emerge from the voxel-image cross-attention. We show the power of this approach to (i) combine data from multiple different subjects (a “Crowd of Brains”) to improve each individual brain-encoding, (ii) quick & effective Transfer-Learning across subjects, datasets, and machines (e.g., 3-Tesla, 7-Tesla), with few training examples, and (iii) use the learned voxel-embeddings as a powerful tool to explore brain functionality (e.g., what is encoded where in the brain).

Kirolos Ataallah, Eslam Abdelrahman, Mahmoud Ahmed, Chenhui Gou, Khushbu Pahwa, Jian Ding, Mohamed Elhoseiny

Main category: cs.CV

TL;DR: InfiniBench is a comprehensive benchmark for long video understanding with 1000+ hours of video content and 87.7K QA pairs, evaluating 8 diverse skills. Current models struggle significantly, with GPT-4o achieving only 47.1% on grounding-based skills.

Details

Motivation: Existing benchmarks fail to test the full range of cognitive skills needed for long-form video understanding, which involves temporally rich and narratively complex inputs like movies and TV episodes.

Method: Created InfiniBench with over 1,000 hours of video content (avg 53 minutes), 87.7K QA pairs covering 8 diverse skills (grounding-based and reasoning-based), using both multiple-choice and open-ended question formats.

Result: Models perform poorly across all skills - GPT-4o achieves only 47.1% on grounding-based skills, most models near random chance. Models show strong reliance on world knowledge from metadata rather than visual/temporal understanding. Multi-modal input substantially improves performance.

Conclusion: Current models struggle with long video understanding, highlighting the need for better temporal reasoning and visual understanding capabilities beyond pre-trained world knowledge.

Abstract: Understanding long-form videos, such as movies and TV episodes ranging from tens of minutes to two hours, remains a significant challenge for multi-modal models. Existing benchmarks often fail to test the full range of cognitive skills needed to process these temporally rich and narratively complex inputs. Therefore, we introduce InfiniBench, a comprehensive benchmark designed to evaluate the capabilities of models in long video understanding rigorously. InfiniBench offers:(1) Over 1,000 hours of video content, with an average video length of 53 minutes. (2) The largest set of question-answer pairs for long video comprehension, totaling around 87.7 K. (3) Eight diverse skills that span both grounding-based (e.g., scene transitions, character actions) and reasoning-based (e.g., deep context understanding, multi-event linking). (4) Rich annotation formats, including both multiple-choice and open-ended questions. We conducted an in-depth evaluation across both commercial (GPT-4o, Gemini 2.0 Flash) and most recent open-source vision-language models such as Qwen2.5-VL, InternVL3.0). Results reveal that:(1) Models struggle across the board: Even the best model, GPT-4o, achieves only 47.1 % on grounding-based skills, with most models performing near or just above random chance. (2) Strong reliance on world knowledge: Models achieve surprisingly high scores using only metadata (e.g., video titles), highlighting a tendency to rely on pre-trained knowledge rather than actual visual or temporal understanding. (3) Multi-Modal Importance: When provided with full video and subtitle context, however, models show substantial improvements, confirming the critical role of multimodal input in video understanding. InfiniBench is publicly available at https://vision-cair.github.io/Infinibench

[414] DeNAS-ViT: Data Efficient NAS-Optimized Vision Transformer for Ultrasound Image Segmentation

Renqi Chen, Xinzhe Zheng, Haoyang Su, Kehan Wu

Main category: cs.CV

TL;DR: DeNAS-ViT is a neural architecture search-optimized Vision Transformer that automatically designs optimal architectures for ultrasound image segmentation, addressing data scarcity through NAS-guided semi-supervised learning.

Details

Motivation: Ultrasound image segmentation faces challenges from poor image quality and limited labeled data. Traditional handcrafted models offer limited gains and are prone to overfitting on small datasets.

Method: Proposes DeNAS-ViT with efficient NAS module for multi-scale token search before ViT attention, and NAS-guided SSL framework combining network independence and contrastive learning with stage-wise optimization.

Result: Achieves state-of-the-art performance on public datasets, maintaining robustness with minimal labeled data, and demonstrates generalization potential beyond ultrasound imaging.

Conclusion: DeNAS-ViT effectively addresses ultrasound segmentation challenges through automated architecture optimization and data-efficient learning, showing broader applicability across medical imaging domains.

Abstract: Accurate segmentation of ultrasound images is essential for reliable medical diagnoses but is challenged by poor image quality and scarce labeled data. Prior approaches have relied on manually designed, complex network architectures to improve multi-scale feature extraction. However, such handcrafted models offer limited gains when prior knowledge is inadequate and are prone to overfitting on small datasets. In this paper, we introduce DeNAS-ViT, a data-efficient NAS-optimized Vision Transformer, the first method to leverage neural architecture search (NAS) for ultrasound image segmentation by automatically optimizing model architecture through token-level search. Specifically, we propose an efficient NAS module that performs multi-scale token search prior to the ViT’s attention mechanism, effectively capturing both contextual and local features while minimizing computational costs. Given ultrasound’s data scarcity and NAS’s inherent data demands, we further develop a NAS-guided semi-supervised learning (SSL) framework. This approach integrates network independence and contrastive learning within a stage-wise optimization strategy, significantly enhancing model robustness under limited-data conditions. Extensive experiments on public datasets demonstrate that DeNAS-ViT achieves state-of-the-art performance, maintaining robustness with minimal labeled data. Moreover, we highlight DeNAS-ViT’s generalization potential beyond ultrasound imaging, underscoring its broader applicability.

[415] LMSeg: An end-to-end geometric message-passing network on barycentric dual graphs for large-scale landscape mesh segmentation

Zexian Huang, Kourosh Khoshelham, Martin Tomko

Main category: cs.CV

TL;DR: LMSeg is a lightweight deep graph network for 3D mesh segmentation that achieves state-of-the-art performance on urban and cultural heritage datasets using only 2.4M parameters.

Details

Motivation: Existing 3D mesh segmentation methods struggle with scalability, end-to-end trainability, and accurately segmenting small/irregular objects in complex environments like cultural heritage landscapes.

Method: Proposes LMSeg with barycentric dual graph representation, Geometry Aggregation+ (GA+) module for adaptive neighborhood feature combination, and hierarchical-local dual pooling to balance global context with fine details.

Result: Achieves 75.1% mIoU on SUM, 78.4% O.A. on H3D, and 62.4% mIoU on BBW dataset, demonstrating accurate segmentation of small objects and occluded cultural heritage structures.

Conclusion: The BBW dataset and LMSeg provide a practical solution for advancing 3D mesh segmentation in cultural heritage, environmental monitoring, and urban applications with strong performance and lightweight design.

Abstract: Semantic segmentation of large-scale 3D landscape meshes is critical for geospatial analysis in complex environments, yet existing approaches face persistent challenges of scalability, end-to-end trainability, and accurate segmentation of small and irregular objects. To address these issues, we introduce the BudjBim Wall (BBW) dataset, a large-scale annotated mesh dataset derived from high-resolution LiDAR scans of the UNESCO World Heritage-listed Budj Bim cultural landscape in Victoria, Australia. The BBW dataset captures historic dry-stone wall structures that are difficult to detect under vegetation occlusion, supporting research in underrepresented cultural heritage contexts. Building on this dataset, we propose LMSeg, a deep graph message-passing network for semantic segmentation of large-scale meshes. LMSeg employs a barycentric dual graph representation of mesh faces and introduces the Geometry Aggregation+ (GA+) module, a learnable softmax-based operator that adaptively combines neighborhood features and captures high-frequency geometric variations. A hierarchical-local dual pooling integrates hierarchical and local geometric aggregation to balance global context with fine-detail preservation. Experiments on three large-scale benchmarks (SUM, H3D, and BBW) show that LMSeg achieves 75.1% mIoU on SUM, 78.4% O.A. on H3D, and 62.4% mIoU on BBW, using only 2.4M lightweight parameters. In particular, LMSeg demonstrates accurate segmentation across both urban and natural scenes-capturing small-object classes such as vehicles and high vegetation in complex city environments, while also reliably detecting dry-stone walls in dense, occluded rural landscapes. Together, the BBW dataset and LMSeg provide a practical and extensible method for advancing 3D mesh segmentation in cultural heritage, environmental monitoring, and urban applications.

[416] STARS: Self-supervised Tuning for 3D Action Recognition in Skeleton Sequences

Soroush Mehraban, Mohammad Javad Rajabi, Andrea Iaboni, Babak Taati

Main category: cs.CV

TL;DR: STARS combines masked prediction with nearest-neighbor contrastive learning to improve skeleton-based action recognition, achieving state-of-the-art results and better few-shot generalization.

Details

Motivation: Masked prediction methods in skeleton-based action recognition produce poor cluster separation and struggle with generalization in few-shot settings compared to contrastive learning approaches.

Method: STARS uses a two-stage approach: first masked prediction with encoder-decoder architecture, then nearest-neighbor contrastive learning to partially tune encoder weights for better semantic cluster formation.

Result: Achieves state-of-the-art self-supervised results on NTU-60, NTU-120, and PKU-MMD benchmarks, with significantly better performance than masked prediction models in few-shot settings.

Conclusion: The proposed STARS method effectively addresses limitations of masked prediction by incorporating contrastive learning, improving cluster separation and generalization without hand-crafted data augmentations.

Abstract: Self-supervised pretraining methods with masked prediction demonstrate remarkable within-dataset performance in skeleton-based action recognition. However, we show that, unlike contrastive learning approaches, they do not produce well-separated clusters. Additionally, these methods struggle with generalization in few-shot settings. To address these issues, we propose Self-supervised Tuning for 3D Action Recognition in Skeleton sequences (STARS). Specifically, STARS first uses a masked prediction stage using an encoder-decoder architecture. It then employs nearest-neighbor contrastive learning to partially tune the weights of the encoder, enhancing the formation of semantic clusters for different actions. By tuning the encoder for a few epochs, and without using hand-crafted data augmentations, STARS achieves state-of-the-art self-supervised results in various benchmarks, including NTU-60, NTU-120, and PKU-MMD. In addition, STARS exhibits significantly better results than masked prediction models in few-shot settings, where the model has not seen the actions throughout pretraining. Project page: https://soroushmehraban.github.io/stars/

[417] Real-time Multi-view Omnidirectional Depth Estimation for Real Scenarios based on Teacher-Student Learning with Unlabeled Data

Ming Li, Xiong Yang, Chaofan Wu, Jiaheng Li, Pinzhi Wang, Xuejiao Hu, Sidan Du, Yang Li

Main category: cs.CV

TL;DR: Rt-OmniMVS is a real-time omnidirectional depth estimation method for edge platforms that achieves 15 FPS using Combined Spherical Sweeping and lightweight network, enhanced by teacher-student learning for robust generalization.

Details

Motivation: Real-time omnidirectional depth estimation is crucial for autonomous driving and robotics, but existing methods struggle with real-time performance and cross-scene generalization on edge platforms.

Method: Uses Combined Spherical Sweeping method with lightweight network structure, teacher-student learning with pseudo labels from stereo matching, data/model augmentation, and HexaMODE multi-view fisheye camera system.

Result: Achieves comparable accuracy to state-of-the-art with significantly less resource consumption, 15 FPS on edge platforms, and high accuracy in various complex real-world indoor/outdoor scenarios.

Conclusion: Rt-OmniMVS enables efficient real-time omnidirectional depth estimation on edge platforms with robust generalization across diverse real-world environments.

Abstract: Omnidirectional depth estimation enables efficient 3D perception over a full 360-degree range. However, in real-world applications such as autonomous driving and robotics, achieving real-time performance and robust cross-scene generalization remains a significant challenge for existing algorithms. In this paper, we propose a real-time omnidirectional depth estimation method for edge computing platforms named Rt-OmniMVS, which introduces the Combined Spherical Sweeping method and implements the lightweight network structure to achieve real-time performance on edge computing platforms. To achieve high accuracy, robustness, and generalization in real-world environments, we introduce a teacher-student learning strategy. We leverage the high-precision stereo matching method as the teacher model to predict pseudo labels for unlabeled real-world data, and utilize data and model augmentation techniques for training to enhance performance of the student model Rt-OmniMVS. We also propose HexaMODE, an omnidirectional depth sensing system based on multi-view fisheye cameras and edge computation device. A large-scale hybrid dataset contains both unlabeled real-world data and synthetic data is collected for model training. Experiments on public datasets demonstrate that proposed method achieves results comparable to state-of-the-art approaches while consuming significantly less resource. The proposed system and algorithm also demonstrate high accuracy in various complex real-world scenarios, both indoors and outdoors, achieving an inference speed of 15 frames per second on edge computing platforms.

[418] Improving Contactless Fingerprint Recognition with Robust 3D Feature Extraction and Graph Embedding

Yuwei Jia, Siyang Zheng, Fei Feng, Zhe Cui, Fei Su

Main category: cs.CV

TL;DR: A novel contactless fingerprint recognition algorithm that leverages 3D features instead of traditional 2D approaches, improving matching accuracy and stability across multiple finger poses.

Details

Motivation: Existing contactless fingerprint algorithms treat fingerprints as 2D plain images and use traditional contact-based methods, ignoring the modality differences and intrinsic 3D features in contactless fingerprints.

Method: Recovers 3D features from input contactless fingerprints (3D shape model and 3D fingerprint features like minutiae and orientation), then uses a novel 3D graph matching method based on extracted 3D features.

Result: The method successfully improves matching accuracy on contactless fingerprint databases and performs stably across multiple poses due to 3D embeddings.

Conclusion: The proposed 3D-based approach provides significant advantages over previous 2D-based contactless fingerprint recognition algorithms, particularly in pose stability and matching accuracy.

Abstract: Contactless fingerprint has gained lots of attention in recent fingerprint studies. However, most existing contactless fingerprint algorithms treat contactless fingerprints as 2D plain fingerprints, and still utilize traditional contact-based 2D fingerprints recognition methods. This recognition approach lacks consideration of the modality difference between contactless and contact fingerprints, especially the intrinsic 3D features in contactless fingerprints. This paper proposes a novel contactless fingerprint recognition algorithm that captures the revealed 3D feature of contactless fingerprints rather than the plain 2D feature. The proposed method first recovers 3D features from the input contactless fingerprint, including the 3D shape model and 3D fingerprint feature (minutiae, orientation, etc.). Then, a novel 3D graph matching method is proposed according to the extracted 3D feature. Additionally, the proposed method is able to perform robust 3D feature extractions on various contactless fingerprints across multiple finger poses. The results of the experiments on contactless fingerprint databases show that the proposed method successfully improves the matching accuracy of contactless fingerprints. Exceptionally, our method performs stably across multiple poses of contactless fingerprints due to 3D embeddings, which is a great advantage compared to 2D-based previous contactless fingerprint recognition algorithms.

[419] Multi-Scale Fusion for Object Representation

Rongzhen Zhao, Vivienne Wang, Juho Kannala, Joni Pajarinen

Main category: cs.CV

TL;DR: Proposes Multi-Scale Fusion (MSF) to enhance VAE guidance in Object-Centric Learning by using image pyramids and inter/intra-scale fusion to handle objects of varying sizes.

Details

Motivation: Existing VAE guidance in Object-Centric Learning doesn't explicitly address that objects vary in pixel sizes while models excel at specific pattern scales, limiting performance on multi-scale objects.

Method: Uses image pyramids to produce intermediate representations at multiple scales, and implements inter/intra-scale fusion to augment low-quality object super-pixels with high-quality ones from other scales.

Result: Improves mainstream Object-Centric Learning methods on standard benchmarks, including state-of-the-art diffusion-based approaches.

Conclusion: Multi-Scale Fusion effectively enhances VAE guidance for Object-Centric Learning by addressing scale variation in objects, leading to improved performance across various methods.

Abstract: Representing images or videos as object-level feature vectors, rather than pixel-level feature maps, facilitates advanced visual tasks. Object-Centric Learning (OCL) primarily achieves this by reconstructing the input under the guidance of Variational Autoencoder (VAE) intermediate representation to drive so-called \textit{slots} to aggregate as much object information as possible. However, existing VAE guidance does not explicitly address that objects can vary in pixel sizes while models typically excel at specific pattern scales. We propose \textit{Multi-Scale Fusion} (MSF) to enhance VAE guidance for OCL training. To ensure objects of all sizes fall within VAE’s comfort zone, we adopt the \textit{image pyramid}, which produces intermediate representations at multiple scales; To foster scale-invariance/variance in object super-pixels, we devise \textit{inter}/\textit{intra-scale fusion}, which augments low-quality object super-pixels of one scale with corresponding high-quality super-pixels from another scale. On standard OCL benchmarks, our technique improves mainstream methods, including state-of-the-art diffusion-based ones. The source code is available on https://github.com/Genera1Z/MultiScaleFusion.

[420] Holistic Unlearning Benchmark: A Multi-Faceted Evaluation for Text-to-Image Diffusion Model Unlearning

Saemi Moon, Minjong Lee, Sangdon Park, Dongwoo Kim

Main category: cs.CV

TL;DR: Proposes Holistic Unlearning Benchmark (HUB) to comprehensively evaluate concept unlearning methods across six dimensions, revealing no single method excels across all criteria.

Details

Motivation: Address concerns about unethical use of text-to-image diffusion models and limitations of previous evaluations that focus only on concept removal and image quality, neglecting broader impacts like unintended side effects.

Method: Developed HUB framework evaluating unlearning methods across six key dimensions: faithfulness, alignment, pinpoint-ness, multilingual robustness, attack robustness, and efficiency, covering 33 target concepts with 16,000 prompts per concept.

Result: Investigation reveals no single unlearning method excels across all evaluation criteria, highlighting the need for more comprehensive evaluation approaches.

Conclusion: By releasing evaluation code and dataset, the authors aim to inspire further research towards more reliable and effective unlearning methods for text-to-image diffusion models.

Abstract: As text-to-image diffusion models gain widespread commercial applications, there are increasing concerns about unethical or harmful use, including the unauthorized generation of copyrighted or sensitive content. Concept unlearning has emerged as a promising solution to these challenges by removing undesired and harmful information from the pre-trained model. However, the previous evaluations primarily focus on whether target concepts are removed while preserving image quality, neglecting the broader impacts such as unintended side effects. In this work, we propose Holistic Unlearning Benchmark (HUB), a comprehensive framework for evaluating unlearning methods across six key dimensions: faithfulness, alignment, pinpoint-ness, multilingual robustness, attack robustness, and efficiency. Our benchmark covers 33 target concepts, including 16,000 prompts per concept, spanning four categories: Celebrity, Style, Intellectual Property, and NSFW. Our investigation reveals that no single method excels across all evaluation criteria. By releasing our evaluation code and dataset, we hope to inspire further research in this area, leading to more reliable and effective unlearning methods.

[421] Grouped Discrete Representation for Object-Centric Learning

Rongzhen Zhao, Vivienne Wang, Juho Kannala, Joni Pajarinen

Main category: cs.CV

TL;DR: Proposes Grouped Discrete Representation (GDR) for Object-Centric Learning, which decomposes features into combinatorial attributes via channel grouping and quantizes them using tuple code indexes to improve object separability and interpretability.

Details

Motivation: Existing OCL methods treat features as indivisible units and use scalar code indexes for discretization, which overlooks compositional attributes and loses attribute-level similarities and differences.

Method: Decomposes features into combinatorial attributes through organized channel grouping and quantizes features into discrete representations using tuple code indexes instead of scalar indexes.

Result: GDR consistently improves both mainstream and state-of-the-art OCL methods across various datasets, demonstrating superior object separability and interpretability in visualizations.

Conclusion: Grouped Discrete Representation effectively addresses limitations of current OCL methods by preserving attribute-level information through combinatorial decomposition and tuple-based quantization.

Abstract: Object-Centric Learning (OCL) aims to discover objects in images or videos by reconstructing the input. Representative methods achieve this by reconstructing the input as its Variational Autoencoder (VAE) discrete representations, which suppress (super-)pixel noise and enhance object separability. However, these methods treat features as indivisible units, overlooking their compositional attributes, and discretize features via scalar code indexes, losing attribute-level similarities and differences. We propose Grouped Discrete Representation (GDR) for OCL. For better generalization, features are decomposed into combinatorial attributes by organized channel grouping. For better convergence, features are quantized into discrete representations via tuple code indexes. Experiments demonstrate that GDR consistently improves both mainstream and state-of-the-art OCL methods across various datasets. Visualizations further highlight GDR’s superior object separability and interpretability. The source code is available on https://github.com/Genera1Z/GroupedDiscreteRepresentation.

[422] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, Song Han

Main category: cs.CV

TL;DR: SVDQuant accelerates diffusion models by quantizing weights and activations to 4 bits using a novel approach that handles outliers with a low-rank branch via SVD, combined with an optimized inference engine for efficiency.

Details

Motivation: As diffusion models scale, they face high memory demands and latency, making deployment challenging. Existing 4-bit quantization methods are insufficient due to sensitivity of weights and activations.

Method: Proposes SVDQuant: shifts outliers from activations to weights, uses high-precision low-rank branch (via SVD) to handle outliers while low-bit branch handles residuals. Co-designs Nunchaku inference engine to fuse kernels and reduce memory access.

Result: Reduces memory usage for 12B FLUX.1 models by 3.5×, achieves 3.0× speedup over W4A16 baseline on 16GB laptop GPU, and 3.1× speedup on RTX 5090 with NVFP4 precision while preserving image quality.

Conclusion: SVDQuant effectively enables aggressive 4-bit quantization for diffusion models while maintaining quality and achieving significant speedup and memory reduction through innovative outlier handling and optimized inference.

Abstract: Diffusion models can effectively generate high-quality images. However, as they scale, rising memory demands and higher latency pose substantial deployment challenges. In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits. At such an aggressive level, both weights and activations are highly sensitive, where existing post-training quantization methods like smoothing become insufficient. To overcome this limitation, we propose SVDQuant, a new 4-bit quantization paradigm. Different from smoothing, which redistributes outliers between weights and activations, our approach absorbs these outliers using a low-rank branch. We first consolidate the outliers by shifting them from activations to weights. Then, we use a high-precision, low-rank branch to take in the weight outliers with Singular Value Decomposition (SVD), while a low-bit quantized branch handles the residuals. This process eases the quantization on both sides. However, naively running the low-rank branch independently incurs significant overhead due to extra data movement of activations, negating the quantization speedup. To address this, we co-design an inference engine Nunchaku that fuses the kernels of the low-rank branch into those of the low-bit branch to cut off redundant memory access. It can also seamlessly support off-the-shelf low-rank adapters (LoRAs) without re-quantization. Extensive experiments on SDXL, PixArt-$Σ$, and FLUX.1 validate the effectiveness of SVDQuant in preserving image quality. We reduce the memory usage for the 12B FLUX.1 models by 3.5$\times$, achieving 3.0$\times$ speedup over the 4-bit weight-only quantization (W4A16) baseline on the 16GB laptop 4090 GPU with INT4 precision. On the latest RTX 5090 desktop with Blackwell architecture, we achieve a 3.1$\times$ speedup compared to the W4A16 model using NVFP4 precision.

[423] Incomplete Multi-view Multi-label Classification via a Dual-level Contrastive Learning Framework

Bingyan Nie, Wulin Xie, Jiang Long, Xiaohuan Lu

Main category: cs.CV

TL;DR: Proposes a dual-level contrastive learning framework for multi-view multi-label classification with missing views and labels, using decoupled feature spaces for consistent and view-specific information.

Details

Motivation: Address the real-world challenge of incomplete views and labels in multi-view multi-label classification, which existing methods handle by coupling consistent and view-specific information in the same space.

Method: Two-channel decoupling module separates shared and view-proprietary representations, plus two contrastive learning objectives on high-level features and semantic labels to filter consistent information.

Result: Extensive experiments on benchmark datasets show the method achieves more stable and superior classification performance.

Conclusion: Decoupling heterogeneous properties into different spaces with contrastive learning effectively handles double missing multi-view multi-label classification.

Abstract: Recently, multi-view and multi-label classification have become significant domains for comprehensive data analysis and exploration. However, incompleteness both in views and labels is still a real-world scenario for multi-view multi-label classification. In this paper, we seek to focus on double missing multi-view multi-label classification tasks and propose our dual-level contrastive learning framework to solve this issue. Different from the existing works, which couple consistent information and view-specific information in the same feature space, we decouple the two heterogeneous properties into different spaces and employ contrastive learning theory to fully disentangle the two properties. Specifically, our method first introduces a two-channel decoupling module that contains a shared representation and a view-proprietary representation to effectively extract consistency and complementarity information across all views. Second, to efficiently filter out high-quality consistent information from multi-view representations, two consistency objectives based on contrastive learning are conducted on the high-level features and the semantic labels, respectively. Extensive experiments on several widely used benchmark datasets demonstrate that the proposed method has more stable and superior classification performance.

[424] Diffusion Implicit Policy for Unpaired Scene-aware Motion Synthesis

Jingyu Gong, Chong Zhang, Fengqi Liu, Ke Fan, Qianyu Zhou, Xin Tan, Zhizhong Zhang, Yuan Xie

Main category: cs.CV

TL;DR: DIP is a unified framework for scene-aware motion synthesis that doesn’t require paired motion-scene data, using diffusion models and implicit policies to generate natural and plausible human-scene interactions.

Details

Motivation: Existing methods rely heavily on paired motion-scene data, making it difficult to generalize to diverse scenes when trained on limited specific scenes.

Method: Disentangles human-scene interaction from motion synthesis during training, then introduces interaction-based implicit policy into motion diffusion during inference through iterative denoising and policy optimization, with motion blending in joint rotation power space for long-term synthesis.

Result: The method shows better motion naturalness and interaction plausibility than state-of-the-art methods on synthesized scenes with ShapeNet furniture and real scenes from PROX and Replica.

Conclusion: DIP demonstrates feasibility for motion synthesis in more general tasks and versatile scenes without requiring paired motion-scene data.

Abstract: Scene-aware motion synthesis has been widely researched recently due to its numerous applications. Prevailing methods rely heavily on paired motion-scene data, while it is difficult to generalize to diverse scenes when trained only on a few specific ones. Thus, we propose a unified framework, termed Diffusion Implicit Policy (DIP), for scene-aware motion synthesis, where paired motion-scene data are no longer necessary. In this paper, we disentangle human-scene interaction from motion synthesis during training, and then introduce an interaction-based implicit policy into motion diffusion during inference. Synthesized motion can be derived through iterative diffusion denoising and implicit policy optimization, thus motion naturalness and interaction plausibility can be maintained simultaneously. For long-term motion synthesis, we introduce motion blending in joint rotation power space. The proposed method is evaluated on synthesized scenes with ShapeNet furniture, and real scenes from PROX and Replica. Results show that our framework presents better motion naturalness and interaction plausibility than cutting-edge methods. This also indicates the feasibility of utilizing the DIP for motion synthesis in more general tasks and versatile scenes. Code will be publicly available at https://github.com/jingyugong/DIP.

[425] MutualVPR: A Mutual Learning Framework for Resolving Supervision Inconsistencies via Adaptive Clustering

Qiwen Gu, Xufei Wang, Junqiao Zhao, Siyue Tao, Tiantian Feng, Ziqiao Wang, Guang Chen

Main category: cs.CV

TL;DR: MutualVPR is a mutual learning framework that integrates unsupervised view self-classification and descriptor learning to address appearance variations in Visual Place Recognition, achieving state-of-the-art performance.

Details

Motivation: Drastic appearance variations from viewpoint changes cause inconsistent supervision signals in VPR, degrading descriptor learning. Existing methods rely on labels or handcrafted rules, limiting generalization and failing to handle occlusions.

Method: Proposes MutualVPR with unsupervised view self-classification using geographic grouping and iterative K-means clustering with DINOv2 initialization. Encoder and clustering co-evolve to separate appearance variations and enable consistent supervision.

Result: Achieves state-of-the-art performance across multiple datasets, demonstrating improved view direction generalization and occlusion robustness.

Conclusion: The mutual learning framework effectively addresses appearance variation challenges in VPR through unsupervised view classification and co-evolving encoder-clustering, providing robust localization without manual labels or rules.

Abstract: Visual Place Recognition (VPR) enables robust localization through image retrieval based on learned descriptors. However, drastic appearance variations of images at the same place caused by viewpoint changes can lead to inconsistent supervision signals, thereby degrading descriptor learning. Existing methods either rely on manually defined cropping rules or labeled data for view differentiation, but they suffer from two major limitations: (1) reliance on labels or handcrafted rules restricts generalization capability; (2) even within the same view direction, occlusions can introduce feature ambiguity. To address these issues, we propose MutualVPR, a mutual learning framework that integrates unsupervised view self-classification and descriptor learning. We first group images by geographic coordinates, then iteratively refine the clusters using K-means to dynamically assign place categories without orientation labels. Specifically, we adopt a DINOv2-based encoder to initialize the clustering. During training, the encoder and clustering co-evolve, progressively separating drastic appearance variations of the same place and enabling consistent supervision. Furthermore, we find that capturing fine-grained image differences at a place enhances robustness. Experiments demonstrate that MutualVPR achieves state-of-the-art (SOTA) performance across multiple datasets, validating the effectiveness of our framework in improving view direction generalization, occlusion robustness.

[426] EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing

Umar Khalid, Kashif Munir, Hasan Iqbal, Azib Farooq, Jing Hua, Nazanin Rahnavard, Chen Chen, Victor Zhu, Zhengping Ji

Main category: cs.CV

TL;DR: EVLM is a vision-language model that interprets ambiguous editing instructions using reference visuals and reflective reasoning to produce precise, context-aware editing prompts through RKTO alignment with human rationales.

Details

Motivation: Existing vision-language models often fail to infer underlying intent from reference images, leading to inconsistent or misaligned edits when given ambiguous instructions.

Method: Uses reflective reasoning framework with Chain-of-Thought reasoning and Reflection-Aware KL-Divergence Target Optimization (RKTO) to translate subjective user intent into structured outputs, trained on 30,000 CoT examples with human-annotated rationale quality.

Result: Achieves substantial gains in alignment with human intent and generates coherent, high-quality instructions across image, video, 3D, and 4D editing tasks.

Conclusion: EVLM provides a scalable foundation for multimodal editing and reasoning by capturing fine-grained editing preferences without binary supervision.

Abstract: Editing complex visual content from ambiguous or partially specified instructions remains a core challenge in vision-language modeling. Existing models can contextualize content but often fail to infer the underlying intent within a reference image or scene, leading to inconsistent or misaligned edits. We introduce the Editing Vision-Language Model (EVLM), a system that interprets ambiguous instructions in conjunction with reference visuals to produce precise, context-aware editing prompts. EVLM’s key innovation is a reflective reasoning framework that translates subjective user intent into structured, actionable outputs by aligning with human-rated rationales through Reflection-Aware KL-Divergence Target Optimization (RKTO). By combining Chain-of-Thought (CoT) reasoning with RKTO alignment, EVLM captures fine-grained editing preferences without relying on binary supervision. Trained on a dataset of 30,000 CoT examples with human-annotated rationale quality, EVLM achieves substantial gains in alignment with human intent. Experiments across image, video, 3D, and 4D editing tasks show that EVLM generates coherent and high-quality instructions, providing a scalable foundation for multimodal editing and reasoning.

[427] Towards Visual Grounding: A Survey

Linhui Xiao, Xiaoshan Yang, Xiangyuan Lan, Yaowei Wang, Changsheng Xu

Main category: cs.CV

TL;DR: This survey provides a comprehensive overview of Visual Grounding (Referring Expression Comprehension), covering its development history, recent advancements since 2021, various settings and applications, and future research directions.

Details

Motivation: To systematically organize and summarize the significant advancements in Visual Grounding since 2021, including new concepts like grounded pre-training and multimodal LLMs, and to standardize future research with fair comparisons.

Method: The authors examine the developmental history, define and organize various settings, track advancements, analyze related datasets and applications, and highlight advanced topics through systematic literature review.

Result: This survey represents the most comprehensive overview currently available in Visual Grounding, covering representative work from the past decade and providing standardized definitions for future research.

Conclusion: The paper outlines current challenges and proposes valuable future research directions, serving as an invaluable resource for both beginners and experienced researchers in the field.

Abstract: Visual Grounding, also known as Referring Expression Comprehension and Phrase Grounding, aims to ground the specific region(s) within the image(s) based on the given expression text. This task simulates the common referential relationships between visual and linguistic modalities, enabling machines to develop human-like multimodal comprehension capabilities. Consequently, it has extensive applications in various domains. However, since 2021, visual grounding has witnessed significant advancements, with emerging new concepts such as grounded pre-training, grounding multimodal LLMs, generalized visual grounding, and giga-pixel grounding, which have brought numerous new challenges. In this survey, we first examine the developmental history of visual grounding and provide an overview of essential background knowledge. We systematically track and summarize the advancements, and then meticulously define and organize the various settings to standardize future research and ensure a fair comparison. Additionally, we delve into numerous related datasets and applications, and highlight several advanced topics. Finally, we outline the challenges confronting visual grounding and propose valuable directions for future research, which may serve as inspiration for subsequent researchers. By extracting common technical details, this survey encompasses the representative work in each subtopic over the past decade. To the best of our knowledge, this paper represents the most comprehensive overview currently available in the field of visual grounding. This survey is designed to be suitable for both beginners and experienced researchers, serving as an invaluable resource for understanding key concepts and tracking the latest research developments. We keep tracing related work at https://github.com/linhuixiao/Awesome-Visual-Grounding.

[428] LWGANet: Addressing Spatial and Channel Redundancy in Remote Sensing Visual Tasks with Light-Weight Grouped Attention

Wei Lu, Xue Yang, Si-Bao Chen

Main category: cs.CV

TL;DR: LWGANet is a lightweight neural network for remote sensing that addresses spatial and channel redundancy through TGFI and LWGA modules, achieving superior accuracy-efficiency trade-off across multiple RS tasks.

Details

Motivation: Existing lightweight models designed for natural images fail to address the dual challenges of spatial redundancy from vast homogeneous backgrounds and channel redundancy from extreme scale variations in remote sensing scenarios.

Method: Proposes LWGANet with two core innovations: Top-K Global Feature Interaction (TGFI) module to mitigate spatial redundancy by focusing computation on salient regions, and Light-Weight Grouped Attention (LWGA) module to resolve channel redundancy by partitioning channels into scale-specific pathways.

Result: Extensive experiments on twelve diverse datasets across four major RS tasks show LWGANet consistently outperforms state-of-the-art lightweight backbones in both accuracy and efficiency.

Conclusion: LWGANet establishes a new robust baseline for efficient visual analysis in remote sensing images by synergistically resolving core inefficiencies in spatial and channel redundancy.

Abstract: Light-weight neural networks for remote sensing (RS) visual analysis must overcome two inherent redundancies: spatial redundancy from vast, homogeneous backgrounds, and channel redundancy, where extreme scale variations render a single feature space inefficient. Existing models, often designed for natural images, fail to address this dual challenge in RS scenarios. To bridge this gap, we propose LWGANet, a light-weight backbone engineered for RS-specific properties. LWGANet introduces two core innovations: a Top-K Global Feature Interaction (TGFI) module that mitigates spatial redundancy by focusing computation on salient regions, and a Light-Weight Grouped Attention (LWGA) module that resolves channel redundancy by partitioning channels into specialized, scale-specific pathways. By synergistically resolving these core inefficiencies, LWGANet achieves a superior trade-off between feature representation quality and computational cost. Extensive experiments on twelve diverse datasets across four major RS tasks–scene classification, oriented object detection, semantic segmentation, and change detection–demonstrate that LWGANet consistently outperforms state-of-the-art light-weight backbones in both accuracy and efficiency. Our work establishes a new, robust baseline for efficient visual analysis in RS images.

[429] Free-T2M: Robust Text-to-Motion Generation for Humanoid Robots via Frequency-Domain

Wenshuo Chen, Haozhe Jia, Songning Lai, Lei Wang, Yuqi Lin, Hongru Xiao, Lijie Hu, Yutao Yue

Main category: cs.CV

TL;DR: Free-T2M is a frequency-domain framework that improves text-to-motion generation by addressing semantic planning and fine-grained execution through stage-specific frequency alignment, achieving state-of-the-art results.

Details

Motivation: Current diffusion models for text-to-motion generation often produce semantically flawed or unstable motions, limiting their real-world applicability to humanoid robots.

Method: Reframes T2M from frequency perspective, identifies semantic planning (low-frequency) and fine-grained execution (high-frequency) stages, introduces frequency-domain temporal-adaptive module for stage-specific alignment.

Result: Dramatically improves motion quality and semantic correctness; reduces FID from 0.152 to 0.060 on StableMoFusion baseline, establishing new state-of-the-art.

Conclusion: Frequency-domain insights are critical for generating robust and reliable motions, enabling more intuitive natural language control of robots.

Abstract: Enabling humanoid robots to synthesize complex, physically coherent motions from natural language commands is a cornerstone of autonomous robotics and human-robot interaction. While diffusion models have shown promise in this text-to-motion (T2M) task, they often generate semantically flawed or unstable motions, limiting their applicability to real-world robots. This paper reframes the T2M problem from a frequency-domain perspective, revealing that the generative process mirrors a hierarchical control paradigm. We identify two critical phases: a semantic planning stage, where low-frequency components establish the global motion trajectory, and a fine-grained execution stage, where high-frequency details refine the movement. To address the distinct challenges of each phase, we introduce Frequency enhanced text-to-motion (Free-T2M), a framework incorporating stage-specific frequency-domain consistency alignment. We design a frequency-domain temporal-adaptive module to modulate the alignment effects of different frequency bands. These designs enforce robustness in the foundational semantic plan and enhance the accuracy of detailed execution. Extensive experiments show our method dramatically improves motion quality and semantic correctness. Notably, when applied to the StableMoFusion baseline, Free-T2M reduces the FID from 0.152 to 0.060, establishing a new state-of-the-art within diffusion architectures. These findings underscore the critical role of frequency-domain insights for generating robust and reliable motions, paving the way for more intuitive natural language control of robots.

[430] Mitigating Sexual Content Generation via Embedding Distortion in Text-conditioned Diffusion Models

Jaesin Ahn, Heechul Jung

Main category: cs.CV

TL;DR: DES is a text encoder-based defense that transforms unsafe embeddings toward safe regions to prevent sexual content generation while maintaining benign image quality and defending against adversarial attacks.

Details

Motivation: Existing methods struggle to defend against adversarial attacks while maintaining image quality when preventing sexual content generation in diffusion models.

Method: DES transforms unsafe embeddings from text encoders toward safe embedding regions and neutralizes “nudity” embedding by aligning it with neutral embedding.

Result: Achieved ASR of 9.47% on FLUX.1 (76.5% reduction) and 0.52% on Stable Diffusion v1.5 (63.9% reduction) compared to previous SOTA methods, while maintaining comparable FID and CLIP scores.

Conclusion: DES provides state-of-the-art defense against unsafe content generation in diffusion models while preserving benign image quality and robustness against adversarial attacks.

Abstract: Diffusion models show remarkable image generation performance following text prompts, but risk generating sexual contents. Existing approaches, such as prompt filtering, concept removal, and even sexual contents mitigation methods, struggle to defend against adversarial attacks while maintaining benign image quality. In this paper, we propose a novel approach called Distorting Embedding Space (DES), a text encoder-based defense mechanism that effectively tackles these issues through innovative embedding space control. DES transforms unsafe embeddings, extracted from a text encoder using unsafe prompts, toward carefully calculated safe embedding regions to prevent unsafe contents generation, while reproducing the original safe embeddings. DES also neutralizes the ``nudity’’ embedding, by aligning it with neutral embedding to enhance robustness against adversarial attacks. As a result, extensive experiments on explicit content mitigation and adaptive attack defense show that DES achieves state-of-the-art (SOTA) defense, with attack success rate (ASR) of 9.47% on FLUX.1, a recent popular model, and 0.52% on the widely adopted Stable Diffusion v1.5. These correspond to ASR reductions of 76.5% and 63.9% compared to previous SOTA methods, EraseAnything and AdvUnlearn, respectively. Furthermore, DES maintains benign image quality, achieving Frechet Inception Distance and CLIP score comparable to those of the original FLUX.1 and Stable Diffusion v1.5.

[431] Environment-Driven Online LiDAR-Camera Extrinsic Calibration

Zhiwei Huang, Jiaqi Li, Hongbo Zhao, Xiao Ma, Ping Zhong, Xiaohu Zhou, Wei Ye, Rui Fan

Main category: cs.CV

TL;DR: EdO-LCEC is an environment-driven online LiDAR-camera extrinsic calibration method that uses scene feature density estimation and dual-path correspondence matching to achieve high accuracy without requiring calibration targets.

Details

Motivation: Existing LiDAR-camera calibration methods rely on customized targets or fixed scene types, limiting their real-world applicability. There's a need for more flexible, target-free approaches that can work in diverse environments.

Method: Uses a generalizable scene discriminator to estimate feature density, extracts LiDAR intensity and depth features from multiple perspectives, employs dual-path correspondence matching (DPCM) for cross-modal feature matching, and formulates calibration as a joint optimization problem with global constraints.

Result: Extensive experiments show EdO-LCEC outperforms state-of-the-art methods, especially in challenging scenarios with sparse point clouds or partially overlapping sensor views.

Conclusion: EdO-LCEC provides a robust, target-free solution for LiDAR-camera extrinsic calibration that adapts to varying environmental conditions and achieves superior performance compared to existing approaches.

Abstract: LiDAR-camera extrinsic calibration (LCEC) is crucial for multi-modal data fusion in autonomous robotic systems. Existing methods, whether target-based or target-free, typically rely on customized calibration targets or fixed scene types, which limit their applicability in real-world scenarios. To address these challenges, we present EdO-LCEC, the first environment-driven online calibration approach. Unlike traditional target-free methods, EdO-LCEC employs a generalizable scene discriminator to estimate the feature density of the application environment. Guided by this feature density, EdO-LCEC extracts LiDAR intensity and depth features from varying perspectives to achieve higher calibration accuracy. To overcome the challenges of cross-modal feature matching between LiDAR and camera, we introduce dual-path correspondence matching (DPCM), which leverages both structural and textural consistency for reliable 3D-2D correspondences. Furthermore, we formulate the calibration process as a joint optimization problem that integrates global constraints across multiple views and scenes, thereby enhancing overall accuracy. Extensive experiments on real-world datasets demonstrate that EdO-LCEC outperforms state-of-the-art methods, particularly in scenarios involving sparse point clouds or partially overlapping sensor views.

[432] FreeBlend: Advancing Concept Blending with Staged Feedback-Driven Interpolation Diffusion

Yufan Zhou, Haoyu Shen, Huan Wang

Main category: cs.CV

TL;DR: FreeBlend is a training-free framework for concept blending that uses transferred image embeddings and stepwise interpolation to improve semantic coherence and visual quality.

Details

Motivation: Address limitations in existing concept blending methods that suffer from incompatible semantic information and shape/appearance discrepancies.

Method: Uses transferred image embeddings as conditional inputs, stepwise increasing interpolation between latents, and feedback-driven mechanism to update auxiliary latents in reverse order.

Result: Significantly improves semantic coherence and visual quality of blended images, producing compelling and coherent results.

Conclusion: FreeBlend effectively addresses challenges in concept blending through its training-free framework with progressive blending and feedback mechanisms.

Abstract: Concept blending is a promising yet underexplored area in generative models. While recent approaches, such as embedding mixing and latent modification based on structural sketches, have been proposed, they often suffer from incompatible semantic information and discrepancies in shape and appearance. In this work, we introduce FreeBlend, an effective, training-free framework designed to address these challenges. To mitigate cross-modal loss and enhance feature detail, we leverage transferred image embeddings as conditional inputs. The framework employs a stepwise increasing interpolation strategy between latents, progressively adjusting the blending ratio to seamlessly integrate auxiliary features. Additionally, we introduce a feedback-driven mechanism that updates the auxiliary latents in reverse order, facilitating global blending and preventing rigid or unnatural outputs. Extensive experiments demonstrate that our method significantly improves both the semantic coherence and visual quality of blended images, yielding compelling and coherent results.

[433] Articulate That Object Part (ATOP): 3D Part Articulation via Text and Motion Personalization

Aditya Vora, Sauradip Nag, Kai Wang, Hao Zhang

Main category: cs.CV

TL;DR: ATOP is a few-shot method that uses motion personalization to articulate static 3D objects based on text prompts describing part motion, leveraging diffusion models and differentiable rendering.

Details

Motivation: Existing methods struggle with motion articulation due to scarce datasets with motion attribute annotations, requiring a solution that can generalize well with limited data.

Method: Few-shot finetuning of diffusion models to inject articulation awareness, using text prompts to generate motion samples and image prompting from 3D objects to personalize motion, then transferring motion to 3D space via differentiable rendering with score distillation sampling.

Result: Experiments on PartNet-Mobility and ACD datasets show ATOP generates realistic motion samples with higher accuracy and better generalization in few-shot settings compared to prior approaches.

Conclusion: ATOP effectively addresses the challenge of 3D object articulation with limited motion data by combining diffusion models, personalization techniques, and differentiable rendering for improved motion prediction.

Abstract: We present ATOP (Articulate That Object Part), a novel few-shot method based on motion personalization to articulate a static 3D object with respect to a part and its motion as prescribed in a text prompt. Given the scarcity of available datasets with motion attribute annotations, existing methods struggle to generalize well in this task. In our work, the text input allows us to tap into the power of modern-day diffusion models to generate plausible motion samples for the right object category and part. In turn, the input 3D object provides ``image prompting’’ to personalize the generated motion to the very input object. Our method starts with a few-shot finetuning to inject articulation awareness to current diffusion models to learn a unique motion identifier associated with the target object part. Our finetuning is applied to a pre-trained diffusion model for controllable multi-view motion generation, trained with a small collection of reference motion frames demonstrating appropriate part motion. The resulting motion model can then be employed to realize plausible motion of the input 3D object from multiple views. At last, we transfer the personalized motion to the 3D space of the object via differentiable rendering to optimize part articulation parameters by a score distillation sampling loss. Experiments on PartNet-Mobility and ACD datasets demonstrate that our method can generate realistic motion samples with higher accuracy, leading to more generalizable 3D motion predictions compared to prior approaches in the few-shot setting.

[434] Learning Temporal 3D Semantic Scene Completion via Optical Flow Guidance

Meng Wang, Fan Wu, Ruihui Li, Yunchuan Qin, Zhuo Tang, Kenli Li

Main category: cs.CV

TL;DR: FlowScene is a novel temporal 3D semantic scene completion method that uses optical flow guidance to improve scene understanding by capturing motion dynamics and achieving temporal consistency in autonomous driving scenarios.

Details

Motivation: Existing SSC methods are limited to sparse current-frame information or naive multi-frame stacking, failing to capture effective scene context, motion dynamics, and temporal consistency needed for reliable autonomous driving perception.

Method: Proposes two key components: (1) Flow-Guided Temporal Aggregation module that aligns and aggregates temporal features using optical flow to capture motion-aware context and deformable structures; (2) Occlusion-Guided Voxel Refinement module that injects occlusion masks and temporal features into 3D voxel space for adaptive voxel refinement.

Result: Achieves state-of-the-art performance on SemanticKITTI and SSCBench-KITTI-360 benchmarks, demonstrating significant improvement in 3D scene completion accuracy.

Conclusion: FlowScene effectively addresses temporal consistency challenges in 3D semantic scene completion by leveraging optical flow guidance, enabling better integration of motion, viewpoints, and occlusions for improved autonomous driving perception.

Abstract: 3D Semantic Scene Completion (SSC) provides comprehensive scene geometry and semantics for autonomous driving perception, which is crucial for enabling accurate and reliable decision-making. However, existing SSC methods are limited to capturing sparse information from the current frame or naively stacking multi-frame temporal features, thereby failing to acquire effective scene context. These approaches ignore critical motion dynamics and struggle to achieve temporal consistency. To address the above challenges, we propose a novel temporal SSC method FlowScene: Learning Temporal 3D Semantic Scene Completion via Optical Flow Guidance. By leveraging optical flow, FlowScene can integrate motion, different viewpoints, occlusions, and other contextual cues, thereby significantly improving the accuracy of 3D scene completion. Specifically, our framework introduces two key components: (1) a Flow-Guided Temporal Aggregation module that aligns and aggregates temporal features using optical flow, capturing motion-aware context and deformable structures; and (2) an Occlusion-Guided Voxel Refinement module that injects occlusion masks and temporally aggregated features into 3D voxel space, adaptively refining voxel representations for explicit geometric modeling. Experimental results demonstrate that FlowScene achieves state-of-the-art performance on the SemanticKITTI and SSCBench-KITTI-360 benchmarks.

[435] Role Bias in Diffusion Models: Diagnosing and Mitigating through Intermediate Decomposition

Sina Malakouti, Adriana Kovashka

Main category: cs.CV

TL;DR: RoleBench benchmark reveals role collapse in T2I diffusion models where they default to frequent reversed relations. ReBind framework uses directional decomposition with intermediate compositions to significantly reduce role collapse.

Details

Motivation: T2I diffusion models struggle with compositional image generation, particularly action-based relations, consistently defaulting to frequent reversed relations (role collapse). The research aims to understand and address this limitation.

Method: Introduces ReBind, a lightweight framework that teaches role bindings using carefully selected active/passive intermediate compositions through simple fine-tuning to gradually mitigate role collapse.

Result: Intermediate compositions through fine-tuning significantly reduce role collapse, with humans preferring ReBind more than 78% compared to state-of-the-art methods.

Conclusion: Distributional asymmetries play a key role in compositional failures, and directional decomposition offers a simple, effective path for improving generalization in T2I models.

Abstract: Text-to-image (T2I) diffusion models exhibit impressive photorealistic image generation capabilities, yet they struggle in compositional image generation. In this work, we introduce RoleBench, a benchmark focused on evaluating compositional generalization in action-based relations (e.g., “mouse chasing cat”). We show that state-of-the-art T2I models and compositional generation methods consistently default to frequent reversed relations (i.e., “cat chasing mouse”), a phenomenon we call role collapse. Related works attribute this to the model’s architectural limitation or underrepresentation in the data. Our key insight reveals that while models fail on rare compositions when their inversions are common, they can successfully generate similar intermediate compositions (e.g., “mouse chasing boy”), suggesting that this limitation is also due to the presence of frequent counterparts rather than just the absence of rare compositions. Motivated by this, we hypothesize that directional decomposition can gradually mitigate role collapse. We test this via ReBind, a lightweight framework that teaches role bindings using carefully selected active/passive intermediate compositions. Experiments suggest that intermediate compositions through simple fine-tuning can significantly reduce role collapse, with humans preferring ReBind more than 78% compared to state-of-the-art methods. Our findings highlight the role of distributional asymmetries in compositional failures and offer a simple, effective path for improving generalization.

[436] Distilling 3D distinctive local descriptors for 6D pose estimation

Amir Hamza, Andrea Caraffa, Davide Boscaini, Fabio Poiesi

Main category: cs.CV

TL;DR: Knowledge distillation framework that trains efficient student model to regress local descriptors from GeDi teacher, enabling significant speedup while maintaining competitive zero-shot 6D pose estimation performance.

Details

Motivation: GeDi has strong zero-shot 6D pose estimation capabilities but is computationally impractical for real-world applications due to expensive inference. Need to retain effectiveness while improving efficiency.

Method: Knowledge distillation framework with efficient large-scale training procedure robust to occlusions, and novel loss formulation handling weak supervision from non-distinctive teacher descriptors.

Result: Significant reduction in inference time while maintaining competitive performance on five BOP Benchmark datasets, bringing zero-shot 6D pose estimation closer to real-time feasibility.

Conclusion: The proposed knowledge distillation approach successfully bridges the gap between GeDi’s effectiveness and practical efficiency, making zero-shot 6D pose estimation more viable for real-world applications.

Abstract: Three-dimensional local descriptors are crucial for encoding geometric surface properties, making them essential for various point cloud understanding tasks. Among these descriptors, GeDi has demonstrated strong zero-shot 6D pose estimation capabilities but remains computationally impractical for real-world applications due to its expensive inference process. Can we retain GeDi’s effectiveness while significantly improving its efficiency? In this paper, we explore this question by introducing a knowledge distillation framework that trains an efficient student model to regress local descriptors from a GeDi teacher. Our key contributions include: an efficient large-scale training procedure that ensures robustness to occlusions and partial observations while operating under compute and storage constraints, and a novel loss formulation that handles weak supervision from non-distinctive teacher descriptors. We validate our approach on five BOP Benchmark datasets and demonstrate a significant reduction in inference time while maintaining competitive performance with existing methods, bringing zero-shot 6D pose estimation closer to real-time feasibility. Project Website: https://tev-fbk.github.io/dGeDi/

[437] Stack Transformer Based Spatial-Temporal Attention Model for Dynamic Sign Language and Fingerspelling Recognition

Koki Hirooka, Abu Saleh Musa Miah, Tatsuya Murakami, Md. Al Mehedi Hasan, Yong Seok Hwang, Jungpil Shin

Main category: cs.CV

TL;DR: SSTAN: Transformer-based SLR model using sequential spatial-temporal attention to capture sign language dynamics without fixed graphs, achieving SOTA on multiple datasets.

Details

Motivation: Overcome limitations of GCNs that rely on fixed skeletal graphs for sign language recognition by developing a more flexible architecture.

Method: Sequential Spatio-Temporal Attention Network with hierarchical stacked design: Spatial MHA for intra-frame joints, Temporal MHA for inter-frame dependencies.

Result: Achieved SOTA performance on fingerspelling categories (JSL, KSL) and new SOTA for skeleton-only methods on WLASL, outperforming self-supervised pre-training approaches.

Conclusion: SSTAN demonstrates high data efficiency and effectiveness in capturing complex sign language dynamics without predefined graph structures.

Abstract: Hand gesture-based Sign Language Recognition (SLR) serves as a crucial communication bridge between deaf and non-deaf individuals. While Graph Convolutional Networks (GCNs) are common, they are limited by their reliance on fixed skeletal graphs. To overcome this, we propose the Sequential Spatio-Temporal Attention Network (SSTAN), a novel Transformer-based architecture. Our model employs a hierarchical, stacked design that sequentially integrates Spatial Multi-Head Attention (MHA) to capture intra-frame joint relationships and Temporal MHA to model long-range inter-frame dependencies. This approach allows the model to efficiently learn complex spatio-temporal patterns without predefined graph structures. We validated our model through extensive experiments on diverse, large-scale datasets (WLASL, JSL, and KSL). A key finding is that our model, trained entirely from scratch, achieves state-of-the-art (SOTA) performance in the challenging fingerspelling categories (JSL and KSL). Furthermore, it establishes a new SOTA for skeleton-only methods on WLASL, outperforming several approaches that rely on complex self-supervised pre-training. These results demonstrate our model’s high data efficiency and its effectiveness in capturing the intricate dynamics of sign language. The official implementation is available at our GitHub repository: \href{https://github.com/K-Hirooka-Aizu/skeleton-slr-transformer}{https://github.com/K-Hirooka-Aizu/skeleton-slr-transformer}.

[438] MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation

Jiaxin Huang, Runnan Chen, Ziwen Li, Zhengqing Gao, Xiao He, Yandong Guo, Mingming Gong, Tongliang Liu

Main category: cs.CV

TL;DR: MLLM-For3D transfers 2D multimodal LLM reasoning to 3D scenes by generating multi-view pseudo masks, using spatial consistency and Token-for-Query alignment to overcome 3D hallucination and inconsistency issues.

Details

Motivation: While 2D MLLMs excel at reasoning segmentation, adapting these capabilities to 3D scenes remains underexplored, with challenges in 3D context and spatial consistency across views.

Method: Generate multi-view pseudo segmentation masks and text embeddings using MLLMs, unproject 2D masks to 3D, align with text embeddings using spatial consistency strategy and Token-for-Query approach for multimodal semantic alignment.

Result: Outperforms existing 3D reasoning segmentation methods on indoor scene benchmarks without labeled 3D training data, effectively interpreting user intent and reasoning about spatial relationships.

Conclusion: MLLM-For3D successfully transfers 2D MLLM knowledge to 3D reasoning segmentation through spatial consistency and cross-view alignment, enabling zero-shot 3D scene understanding.

Abstract: Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning. While recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation, adapting these capabilities to 3D scenes remains underexplored. In this paper, we introduce MLLM-For3D, a simple yet effective framework that transfers knowledge from 2D MLLMs to 3D scene understanding. Specifically, we utilize MLLMs to generate multi-view pseudo segmentation masks and corresponding text embeddings, then unproject 2D masks into 3D space and align them with the text embeddings. The primary challenge lies in the absence of 3D context and spatial consistency across multiple views, causing the model to hallucinate objects that do not exist and fail to target objects consistently. Training the 3D model with such irrelevant objects leads to performance degradation. To address this, we introduce a spatial consistency strategy to enforce that segmentation masks remain coherent in the 3D space, effectively capturing the geometry of the scene. Moreover, we develop a Token-for-Query approach for multimodal semantic alignment, enabling consistent identification of the same object across different views. Extensive evaluations on various challenging indoor scene benchmarks demonstrate that, even without any labeled 3D training data, MLLM-For3D outperforms existing 3D reasoning segmentation methods, effectively interpreting user intent, understanding 3D scenes, and reasoning about spatial relationships.

[439] LangBridge: Interpreting Image as a Combination of Language Embeddings

Jiaqi Liao, Yuwei Niu, Fanqing Meng, Hao Li, Changyao Tian, Yinuo Du, Yuwen Xiong, Dianqi Li, Xizhou Zhu, Li Yuan, Jifeng Dai, Yu Cheng

Main category: cs.CV

TL;DR: LangBridge is a novel vision-language adapter that maps visual tokens directly to LLM vocabulary embeddings, enabling pretraining-free transfer across different LLM backbones while maintaining performance.

Details

Motivation: Current MLP adapters in LVLMs are poorly understood, require retraining for different LLM backbones, and lack interpretability in how they bridge the modality gap between vision and language.

Method: LangBridge explicitly maps visual tokens to linear combinations of LLM vocabulary embeddings, grounding visual representations directly in the language model’s embedding space.

Result: LangBridge adapters pre-trained on smaller models (Qwen2-0.5B) can be directly applied to larger models (LLaMA3-8B, Qwen2.5-14B) with competitive performance and nearly no degradation.

Conclusion: LangBridge provides interpretable vision-language alignment and enables efficient plug-and-play transfer across multiple LLMs without requiring retraining.

Abstract: Recent years have witnessed remarkable advances in Large Vision-Language Models (LVLMs), which have achieved human-level performance across various complex vision-language tasks. Following LLaVA’s paradigm, mainstream LVLMs typically employ a shallow MLP for visual-language alignment through a two-stage training process: pretraining for cross-modal alignment followed by instruction tuning. While this approach has proven effective, the underlying mechanisms of how MLPs bridge the modality gap remain poorly understood. Although some research has explored how LLMs process transformed visual tokens, few studies have investigated the fundamental alignment mechanism. Furthermore, the MLP adapter requires retraining whenever switching LLM backbones. To address these limitations, we first investigate the working principles of MLP adapters and discover that they learn to project visual embeddings into subspaces spanned by corresponding text embeddings progressively. Based on this insight, we propose LangBridge, a novel adapter that explicitly maps visual tokens to linear combinations of LLM vocabulary embeddings. This innovative design enables pretraining-free adapter transfer across different LLMs while maintaining performance. Our experimental results demonstrate that a LangBridge adapter pre-trained on Qwen2-0.5B can be directly applied to larger models such as LLaMA3-8B or Qwen2.5-14B while maintaining competitive performance. Overall, LangBridge enables interpretable vision-language alignment by grounding visual representations in LLM vocab embedding, while its plug-and-play design ensures efficient reuse across multiple LLMs with nearly no performance degradation. See our project page at https://curryx-001.github.io/LangBridge.github.io/

[440] Parameter-Free Fine-tuning via Redundancy Elimination for Vision Foundation Models

Jiahuan Long, Tingsong Jiang, Wen Yao, Yizhe Xiong, Zhengqin Xu, Shuai Jia, Hanqing Liu, Chao Ma

Main category: cs.CV

TL;DR: Proposes a parameter-free fine-tuning method for vision foundation models that selects and enhances pre-trained features instead of updating parameters, reducing GPU memory usage while maintaining performance.

Details

Motivation: Traditional fine-tuning methods require parameter updates, which can be computationally expensive and modify millions of weights. The authors aim to leverage redundancies in foundation models like SAM to create a more efficient adaptation approach.

Method: Uses a channel selection algorithm based on model output differences to identify redundant channels, then selectively replaces them with more effective ones to enhance task-specific feature representation without parameter updates.

Result: Experiments on out-of-domain and in-domain datasets show the method is efficient and effective across various vision tasks (segmentation, depth estimation, classification) and can integrate with existing fine-tuning strategies like LoRA and Adapter to boost performance.

Conclusion: The parameter-free approach offers a new perspective on fine-tuning foundation models by focusing on feature selection and reuse rather than parameter modification, significantly reducing GPU memory overhead while maintaining or improving performance.

Abstract: Vision foundation models (VFMs) have demonstrated remarkable capabilities in learning universal visual representations. However, adapting these models to downstream tasks conventionally requires parameter updates, with even parameter-efficient fine-tuning methods necessitating the modification of thousands to millions of weights. In this paper, we investigate the redundancies in the segment anything model (SAM) and then propose a novel parameter-free fine-tuning method. Unlike traditional fine-tuning methods that adjust parameters, our method emphasizes selecting, reusing, and enhancing pre-trained features, offering a new perspective on fine-tuning foundation models. Specifically, we introduce a channel selection algorithm based on the model’s output difference to identify redundant and effective channels. By selectively replacing the redundant channels with more effective ones, we filter out less useful features and reuse more task-irrelevant features to downstream tasks, thereby enhancing the task-specific feature representation. Experiments on both out-of-domain and in-domain datasets demonstrate the efficiency and effectiveness of our method in different vision tasks (e.g., image segmentation, depth estimation and image classification). Notably, our approach can seamlessly integrate with existing fine-tuning strategies (e.g., LoRA, Adapter), further boosting the performance of already fine-tuned models. Moreover, since our channel selection involves only model inference, our method significantly reduces GPU memory overhead.

[441] AGO: Adaptive Grounding for Open World 3D Occupancy Prediction

Peizheng Li, Shuxiao Ding, You Zhou, Qingwen Zhang, Onat Inak, Larissa Triess, Niklas Hanselmann, Marius Cordts, Andreas Zell

Main category: cs.CV

TL;DR: AGO is a 3D semantic occupancy prediction framework that uses adaptive grounding with vision-language models to handle both known and unknown objects in open-world scenarios.

Details

Motivation: Existing methods using VLM-derived 2D pseudo-labels are limited by predefined label spaces and lack general prediction capabilities, while direct alignment with pretrained image embeddings often fails due to inconsistent representations.

Method: AGO encodes images and class prompts into 3D and text embeddings, uses similarity-based grounding training with 3D pseudo-labels, and employs a modality adapter to align 3D embeddings with VLM-derived image embeddings.

Result: AGO improves unknown object prediction in zero-shot and few-shot transfer and achieves state-of-the-art closed-world self-supervised performance, surpassing prior methods by 4.09 mIoU on Occ3D-nuScenes.

Conclusion: The proposed AGO framework effectively addresses modality gaps and enables robust open-world 3D semantic occupancy prediction through adaptive grounding.

Abstract: Open-world 3D semantic occupancy prediction aims to generate a voxelized 3D representation from sensor inputs while recognizing both known and unknown objects. Transferring open-vocabulary knowledge from vision-language models (VLMs) offers a promising direction but remains challenging. However, methods based on VLM-derived 2D pseudo-labels with traditional supervision are limited by a predefined label space and lack general prediction capabilities. Direct alignment with pretrained image embeddings, on the other hand, often fails to achieve reliable performance because of inconsistent image and text representations in VLMs. To address these challenges, we propose AGO, a novel 3D occupancy prediction framework with adaptive grounding to handle diverse open-world scenarios. AGO first encodes surrounding images and class prompts into 3D and text embeddings, respectively, leveraging similarity-based grounding training with 3D pseudo-labels. Additionally, a modality adapter maps 3D embeddings into a space aligned with VLM-derived image embeddings, reducing modality gaps. Experiments on Occ3D-nuScenes show that AGO improves unknown object prediction in zero-shot and few-shot transfer while achieving state-of-the-art closed-world self-supervised performance, surpassing prior methods by 4.09 mIoU. Code is available at: https://github.com/EdwardLeeLPZ/AGO.

[442] Video CLIP Model for Multi-View Echocardiography Interpretation

Ryo Takizawa, Satoshi Kodera, Tempei Kabayama, Ryo Matsuoka, Yuta Ando, Yuto Nakamura, Haruki Settai, Norihiko Takeda

Main category: cs.CV

TL;DR: Developed a video-language model for echocardiographic interpretation that processes full video sequences from five standard views, trained on 60,747 video-report pairs, improving diagnostic accuracy through motion analysis and multi-view support.

Details

Motivation: Existing medical vision-language models rely on single-frame inputs, which reduces diagnostic accuracy for conditions identifiable only through cardiac motion. Echocardiographic videos from multiple views vary in suitability for detecting specific conditions.

Method: Developed a video-language model that processes full video sequences from five standard echocardiographic views, trained on 60,747 echocardiographic video-report pairs. Evaluated gains from video input and multi-view support with various pretrained models.

Result: The model demonstrated improved retrieval performance by leveraging video input and multi-view support compared to single-frame approaches.

Conclusion: Video-language models that process full echocardiographic video sequences from multiple standard views can enhance diagnostic accuracy by capturing cardiac motion and leveraging complementary information from different views.

Abstract: Echocardiography records ultrasound videos of the heart, enabling clinicians to assess cardiac function. Recent advances in large-scale vision-language models (VLMs) have spurred interest in automating echocardiographic interpretation. However, most existing medical VLMs rely on single-frame (image) inputs, which can reduce diagnostic accuracy for conditions identifiable only through cardiac motion. In addition, echocardiographic videos are captured from multiple views, each varying in suitability for detecting specific conditions. Leveraging multiple views may therefore improve diagnostic performance. We developed a video-language model that processes full video sequences from five standard views, trained on 60,747 echocardiographic video-report pairs. We evaluated the gains in retrieval performance from video input and multi-view support, including the contributions of various pretrained models. Code and model weights are available at https://github.com/UTcardiology/video-echo-clip

[443] Enhanced Partially Relevant Video Retrieval through Inter- and Intra-Sample Analysis with Coherence Prediction

Junlong Ren, Gangjian Zhang, Hao Wang, Yu Hu, Jian Shu, Hui Xiong

Main category: cs.CV

TL;DR: A novel PRVR framework addressing semantic asymmetry by exploiting inter-sample correlation and intra-sample redundancy through three modules: ICE for pseudo-positive pairs, IRM for redundant feature mining, and TCP for temporal structure learning.

Details

Motivation: Existing PRVR methods coarsely align videos and text queries, neglecting the critical cross-modal dual nature of inter-sample correlation and intra-sample redundancy, leading to suboptimal performance due to semantic asymmetry.

Method: Three core modules: 1) ICE captures inter-sample correlation using semantically similar unpaired text-video combinations; 2) IRM mitigates intra-sample redundancy by distinguishing relevant from redundant moments; 3) TCP enhances temporal learning by predicting original order of shuffled frames/moments.

Result: Extensive experiments demonstrate superior performance compared to prior methods, achieving state-of-the-art results in partially relevant video retrieval.

Conclusion: The proposed framework effectively addresses semantic asymmetry in PRVR by systematically exploiting inter-sample correlation and intra-sample redundancy, leading to significant performance improvements over existing approaches.

Abstract: Partially Relevant Video Retrieval (PRVR) aims to retrieve the target video that is partially relevant to the text query. The primary challenge in PRVR arises from the semantic asymmetry between textual and visual modalities, as videos often contain substantial content irrelevant to the query. Existing methods coarsely align paired videos and text queries to construct the semantic space, neglecting the critical cross-modal dual nature inherent in this task: inter-sample correlation and intra-sample redundancy. To this end, we propose a novel PRVR framework to systematically exploit these two characteristics. Our framework consists of three core modules. First, the Inter Correlation Enhancement (ICE) module captures inter-sample correlation by identifying semantically similar yet unpaired text queries and video moments, combining them to form pseudo-positive pairs for more robust semantic space construction. Second, the Intra Redundancy Mining (IRM) module mitigates intra-sample redundancy by mining redundant moment features and distinguishing them from query-relevant moments, encouraging the model to learn more discriminative representations. Finally, to reinforce these modules, we introduce the Temporal Coherence Prediction (TCP) module, which enhances temporal structure learning by training the model to predict the original temporal order of randomly shuffled video frames and moments. Extensive experiments demonstrate the superiority of our approach compared to prior methods, achieving state-of-the-art results.

[444] DeepAndes: A Self-Supervised Vision Foundation Model for Multi-Spectral Remote Sensing Imagery of the Andes

Junlin Guo, James R. Zimmer-Dauphinee, Jordan M. Nieusma, Siqi Lu, Quan Liu, Ruining Deng, Can Cui, Jialin Yue, Yizhe Lin, Tianyuan Yao, Juming Xiong, Junchao Zhu, Chongyu Qu, Yuechen Yang, Mitchell Wilkes, Xiao Wang, Parker VanValkenburgh, Steven A. Wernke, Yuankai Huo

Main category: cs.CV

TL;DR: DeepAndes is a transformer-based vision foundation model trained on 3 million multi-spectral satellite images, specifically designed for Andean archaeology. It uses a customized DINOv2 self-supervised learning algorithm for 8-band imagery and outperforms other models in few-shot learning scenarios.

Details

Motivation: Remote sensing surveys combined with deep learning can provide unique insights into archaeological patterns, but conventional supervised methods struggle with fine-grained feature annotation at scale. Existing vision foundation models are designed for RGB images rather than multi-spectral satellite imagery used in archaeology.

Method: Developed DeepAndes - a transformer-based vision foundation model using customized DINOv2 self-supervised learning algorithm optimized for 8-band multi-spectral satellite imagery. Trained on 3 million images and evaluated through imbalanced image classification, image instance retrieval, and pixel-level semantic segmentation tasks.

Result: DeepAndes achieves superior F1 scores, mean average precision, and Dice scores in few-shot learning scenarios, significantly outperforming models trained from scratch or pre-trained on smaller datasets.

Conclusion: Large-scale self-supervised pre-training is highly effective for archaeological remote sensing applications, and DeepAndes represents the first foundation model specifically designed for the Andes region.

Abstract: By mapping sites at large scales using remotely sensed data, archaeologists can generate unique insights into long-term demographic trends, inter-regional social networks, and past adaptations to climate change. Remote sensing surveys complement field-based approaches, and their reach can be especially great when combined with deep learning and computer vision techniques. However, conventional supervised deep learning methods face challenges in annotating fine-grained archaeological features at scale. While recent vision foundation models have shown remarkable success in learning large-scale remote sensing data with minimal annotations, most off-the-shelf solutions are designed for RGB images rather than multi-spectral satellite imagery, such as the 8-band data used in our study. In this paper, we introduce DeepAndes, a transformer-based vision foundation model trained on three million multi-spectral satellite images, specifically tailored for Andean archaeology. DeepAndes incorporates a customized DINOv2 self-supervised learning algorithm optimized for 8-band multi-spectral imagery, marking the first foundation model designed explicitly for the Andes region. We evaluate its image understanding performance through imbalanced image classification, image instance retrieval, and pixel-level semantic segmentation tasks. Our experiments show that DeepAndes achieves superior F1 scores, mean average precision, and Dice scores in few-shot learning scenarios, significantly outperforming models trained from scratch or pre-trained on smaller datasets. This underscores the effectiveness of large-scale self-supervised pre-training in archaeological remote sensing. Codes will be available on https://github.com/geopacha/DeepAndes.

[445] FreeInsert: Disentangled Text-Guided Object Insertion in 3D Gaussian Scene without Spatial Priors

Chenxi Li, Weijie Wang, Qiang Li, Bruno Lepri, Nicu Sebe, Weizhi Nie

Main category: cs.CV

TL;DR: FreeInsert enables text-driven 3D object insertion without spatial priors by leveraging foundation models to disentangle object generation from spatial placement.

Details

Motivation: Existing methods rely on spatial priors like 2D masks or 3D bounding boxes and struggle with consistency, limiting flexibility and scalability in real-world applications.

Method: Uses MLLM parser to extract structured semantics, spatial reasoning for pose initialization, hierarchical refinement with spatial semantics, and appearance enhancement using inserted-object images.

Result: Achieves semantically coherent, spatially precise, and visually realistic 3D insertions without spatial priors.

Conclusion: FreeInsert offers a user-friendly and flexible editing experience for text-driven 3D scene editing.

Abstract: Text-driven object insertion in 3D scenes is an emerging task that enables intuitive scene editing through natural language. However, existing 2D editing-based methods often rely on spatial priors such as 2D masks or 3D bounding boxes, and they struggle to ensure consistency of the inserted object. These limitations hinder flexibility and scalability in real-world applications. In this paper, we propose FreeInsert, a novel framework that leverages foundation models including MLLMs, LGMs, and diffusion models to disentangle object generation from spatial placement. This enables unsupervised and flexible object insertion in 3D scenes without spatial priors. FreeInsert starts with an MLLM-based parser that extracts structured semantics, including object types, spatial relationships, and attachment regions, from user instructions. These semantics guide both the reconstruction of the inserted object for 3D consistency and the learning of its degrees of freedom. We leverage the spatial reasoning capabilities of MLLMs to initialize object pose and scale. A hierarchical, spatially aware refinement stage further integrates spatial semantics and MLLM-inferred priors to enhance placement. Finally, the appearance of the object is improved using the inserted-object image to enhance visual fidelity. Experimental results demonstrate that FreeInsert achieves semantically coherent, spatially precise, and visually realistic 3D insertions without relying on spatial priors, offering a user-friendly and flexible editing experience.

[446] Descriptive Image-Text Matching with Graded Contextual Similarity

Jinhyun Jang, Jiyoung Lee, Kwanghoon Sohn

Main category: cs.CV

TL;DR: DITM proposes descriptive image-text matching that learns graded contextual similarity by exploring language descriptiveness, moving beyond binary supervision to handle many-to-many image-text relationships.

Details

Motivation: Existing image-text matching methods use sparse binary supervision that neglects inherent many-to-many correspondences and fails to capture the descriptive flexibility of language, limiting their ability to represent complex image-text relationships.

Method: Uses cumulative TF-IDF to formulate sentence descriptiveness scores, then leverages these scores to refine false negative labeling and build precise matching by aligning sentences in generic-to-specific order.

Result: Extensive experiments on MS-COCO, Flickr30K, and CxC datasets demonstrate superior performance in representing complex image-text relationships compared to state-of-the-art methods, with enhanced hierarchical reasoning ability.

Conclusion: DITM effectively addresses limitations of binary supervision by incorporating graded contextual similarity and language descriptiveness, enabling better discovery of optimal matches and potential positive pairs in image-text matching.

Abstract: Image-text matching aims to build correspondences between visual and textual data by learning their pairwise similarities. Most existing approaches have adopted sparse binary supervision, indicating whether a pair of images and sentences matches or not. However, such sparse supervision covers a limited subset of image-text relationships, neglecting their inherent many-to-many correspondences; an image can be described in numerous texts at different descriptive levels. Moreover, existing approaches overlook the implicit connections from general to specific descriptions, which form the underlying rationale for the many-to-many relationships between vision and language. In this work, we propose descriptive image-text matching, called DITM, to learn the graded contextual similarity between image and text by exploring the descriptive flexibility of language. We formulate the descriptiveness score of each sentence with cumulative term frequency-inverse document frequency (TF-IDF) to balance the pairwise similarity according to the keywords in the sentence. Our method leverages sentence descriptiveness to learn robust image-text matching in two key ways: (1) to refine the false negative labeling, dynamically relaxing the connectivity between positive and negative pairs, and (2) to build more precise matching, aligning a set of relevant sentences in a generic-to-specific order. By moving beyond rigid binary supervision, DITM enhances the discovery of both optimal matches and potential positive pairs. Extensive experiments on MS-COCO, Flickr30K, and CxC datasets demonstrate the effectiveness of our method in representing complex image-text relationships compared to state-of-the-art approaches. In addition, DITM enhances the hierarchical reasoning ability of the model, supported by the extensive analysis on HierarCaps benchmark.

[447] HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation

Shaina Raza, Aravind Narayanan, Vahid Reza Khazaie, Ashmal Vayani, Mukund S. Chettiar, Amandeep Singh, Mubarak Shah, Deval Pandya

Main category: cs.CV

TL;DR: HumaniBench is a comprehensive benchmark with 32,000 real-world image-question pairs to evaluate large multimodal models’ alignment with human-centered values like fairness, ethics, and inclusivity.

Details

Motivation: Current LMMs are insufficiently evaluated for alignment with human-centered values despite impressive performance on standard vision-language tasks, creating a gap in responsible AI development.

Method: Used a semi-automated annotation pipeline with domain expert validation to create 32,000 image-question pairs across 7 alignment principles (fairness, ethics, empathy, inclusivity, reasoning, robustness, multilinguality) through diverse VQA tasks.

Result: Benchmarking revealed distinct behavioral patterns - some models excel in reasoning, fairness, and multilinguality while others show better robustness and grounding, but most struggle to balance task accuracy with ethical/inclusive responses. Chain-of-thought prompting and test-time scaling improved alignment.

Conclusion: HumaniBench provides the first rigorous testbed for HC evaluation, enabling diagnosis of limitations, quantification of alignment trade-offs, and promoting responsible LMM development through transparent, reproducible evaluation.

Abstract: Large multimodal models (LMMs) have achieved impressive performance on vision-language tasks such as visual question answering (VQA), image captioning, and visual grounding; however, they remain insufficiently evaluated for alignment with human-centered (HC) values such as fairness, ethics, and inclusivity. To address this gap, we introduce HumaniBench, a comprehensive benchmark comprising 32,000 real-world image-question pairs and an accompanying evaluation suite. Using a semi-automated annotation pipeline, each sample is rigorously validated by domain experts to ensure accuracy and ethical integrity. HumaniBench assesses LMMs across seven key alignment principles: fairness, ethics, empathy, inclusivity, reasoning, robustness, and multilinguality through a diverse set of open- and closed-ended VQA tasks. Grounded in AI ethics theory and real-world social contexts, these principles provide a holistic lens for examining human-aligned behavior. Benchmarking results reveal distinct behavioral patterns: certain model families excel in reasoning, fairness, and multilinguality, while others demonstrate greater robustness and grounding capability. However, most models still struggle to balance task accuracy with ethical and inclusive responses. Techniques such as chain-of-thought prompting and test-time scaling yield measurable alignment gains. As the first benchmark explicitly designed for HC evaluation, HumaniBench offers a rigorous testbed to diagnose limitations, quantify alignment trade-offs, and promote the responsible development of large multimodal models. All data and code are publicly released to ensure transparency and reproducibility. https://vectorinstitute.github.io/HumaniBench/

[448] PSDiffusion: Harmonized Multi-Layer Image Generation via Layout and Appearance Alignment

Dingbang Huang, Wenbo Li, Yifei Zhao, Xinyu Pan, Chun Wang, Yanhong Zeng, Bo Dai

Main category: cs.CV

TL;DR: PSDiffusion is a unified diffusion framework that generates multiple transparent image layers simultaneously using pre-trained image diffusion models, addressing limitations of existing sequential generation methods.

Details

Motivation: Existing methods for transparent layer generation either decompose from single images or generate layers sequentially, which limits global layout modeling, physically plausible interactions, and visual effects like shadows and reflections due to insufficient shared global context.

Method: Proposes PSDiffusion with a global layer interaction mechanism that leverages image composition priors from pre-trained image diffusion models for simultaneous multi-layer text-to-image generation, ensuring individual layer quality and coherent spatial/visual relationships.

Result: Extensive experiments on benchmark datasets show PSDiffusion outperforms existing methods in generating multi-layer images with plausible structure and enhanced visual fidelity.

Conclusion: PSDiffusion successfully addresses the limitations of previous approaches by enabling simultaneous multi-layer generation with improved global context modeling and visual quality.

Abstract: Transparent image layer generation plays a significant role in digital art and design workflows. Existing methods typically decompose transparent layers from a single RGB image using a set of tools or generate multiple transparent layers sequentially. Despite some promising results, these methods often limit their ability to model global layout, physically plausible interactions, and visual effects such as shadows and reflections with high alpha quality due to limited shared global context among layers. To address this issue, we propose PSDiffusion, a unified diffusion framework that leverages image composition priors from pre-trained image diffusion model for simultaneous multi-layer text-to-image generation. Specifically, our method introduces a global layer interaction mechanism to generate layered images collaboratively, ensuring both individual layer quality and coherent spatial and visual relationships across layers. We include extensive experiments on benchmark datasets to demonstrate that PSDiffusion is able to outperform existing methods in generating multi-layer images with plausible structure and enhanced visual fidelity.

[449] DNOI-4DRO: Deep 4D Radar Odometry with Differentiable Neural-Optimization Iterations

Shouyi Lu, Huanyu Zhou, Guirong Zhuo, Xiao Tang

Main category: cs.CV

TL;DR: DNOI-4DRO is a novel 4D radar odometry model that combines neural networks with geometric optimization using a differentiable neural-optimization iteration operator, achieving superior performance on radar datasets.

Details

Motivation: To address the challenge of sparse 4D radar point clouds and improve radar odometry by integrating neural network learning with traditional geometric optimization methods.

Method: Uses a dual-stream 4D radar backbone with multi-scale geometric features and clustering-based class-aware features, followed by neural network motion flow estimation and Gauss-Newton pose refinement through differentiable optimization.

Result: Outperforms recent classical and learning-based approaches on VoD and Snail-Radar datasets, achieving results comparable to A-LOAM with LiDAR mapping optimization.

Conclusion: The proposed DNOI-4DRO model successfully integrates learning and optimization for 4D radar odometry, demonstrating superior performance and potential for practical applications.

Abstract: A novel learning-optimization-combined 4D radar odometry model, named DNOI-4DRO, is proposed in this paper. The proposed model seamlessly integrates traditional geometric optimization with end-to-end neural network training, leveraging an innovative differentiable neural-optimization iteration operator. In this framework, point-wise motion flow is first estimated using a neural network, followed by the construction of a cost function based on the relationship between point motion and pose in 3D space. The radar pose is then refined using Gauss-Newton updates. Additionally, we design a dual-stream 4D radar backbone that integrates multi-scale geometric features and clustering-based class-aware features to enhance the representation of sparse 4D radar point clouds. Extensive experiments on the VoD and Snail-Radar datasets demonstrate the superior performance of our model, which outperforms recent classical and learning-based approaches. Notably, our method even achieves results comparable to A-LOAM with mapping optimization using LiDAR point clouds as input. Our models and code will be publicly released.

[450] Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

Chaofan Gan, Yuanpeng Tu, Xi Chen, Tieyuan Chen, Yuxi Li, Mehrtash Harandi, Weiyao Lin

Main category: cs.CV

TL;DR: DiTF is a training-free framework that addresses massive activations in Diffusion Transformers (DiTs) to extract better features for visual correspondence tasks, outperforming DINO and SD-based models.

Details

Motivation: DiTs exhibit massive activations where few feature dimensions dominate, leading to uninformative representations and performance degradation. This is linked to Adaptive Layer Normalization (AdaLN) in DiTs.

Method: Proposed DiTF framework that leverages AdaLN to adaptively localize and normalize massive activations through channel-wise modulation, plus a channel discard strategy to mitigate adverse effects.

Result: DiTF outperforms DINO and SD-based models, achieving +9.4% on Spair-71k and +4.4% on AP-10K-C.S., establishing new SOTA for DiTs in visual correspondence.

Conclusion: The AdaLN-based DiTF framework effectively addresses massive activations in DiTs, enabling extraction of semantically discriminative features and superior performance in visual correspondence tasks.

Abstract: Pre-trained stable diffusion models (SD) have shown great advances in visual correspondence. In this paper, we investigate the capabilities of Diffusion Transformers (DiTs) for accurate dense correspondence. Distinct from SD, DiTs exhibit a critical phenomenon in which very few feature activations exhibit significantly larger values than others, known as \textit{massive activations}, leading to uninformative representations and significant performance degradation for DiTs. The massive activations consistently concentrate at very few fixed dimensions across all image patch tokens, holding little local information. We analyze these dimension-concentrated massive activations and uncover that their concentration is inherently linked to the Adaptive Layer Normalization (AdaLN) in DiTs. Building on these findings, we propose the \textbf{Di}ffusion \textbf{T}ransformer \textbf{F}eature (DiTF), a training-free AdaLN-based framework that extracts semantically discriminative features from DiTs. Specifically, DiTF leverages AdaLN to adaptively localize and normalize massive activations through channel-wise modulation. Furthermore, a channel discard strategy is introduced to mitigate the adverse effects of massive activations. Experimental results demonstrate that our DiTF outperforms both DINO and SD-based models and establishes a new state-of-the-art performance for DiTs in different visual correspondence tasks (\eg, with +9.4% on Spair-71k and +4.4% on AP-10K-C.S.).

[451] TextDiffuser-RL: Efficient and Robust Text Layout Optimization for High-Fidelity Text-to-Image Synthesis

Kazi Mahathir Rahman, Showrin Rahman, Sharmin Sultana Srishty

Main category: cs.CV

TL;DR: A novel two-stage pipeline combining reinforcement learning for text layout generation with diffusion-based image synthesis, achieving comparable quality to TextDiffuser-2 while being 42.29x faster and running efficiently on both CPU and GPU platforms.

Details

Motivation: Existing text-embedded image generation methods like TextDiffuser-2 are resource-intensive and limited in their ability to run efficiently on both CPU and GPU platforms, creating deployment challenges.

Method: Two-stage pipeline: 1) Reinforcement learning for rapid and optimized text layout generation (bounding box prediction), 2) Diffusion-based image synthesis model. The RL approach accelerates layout generation while reducing overlaps.

Result: Achieves comparable performance to TextDiffuser-2 in text placement and image synthesis quality, while being 42.29 times faster and requiring only 2 MB of CPU RAM for inference. Can run on CPU-only systems unlike TextDiffuser-2.

Conclusion: The proposed framework offers a more efficient and flexible solution for text-embedded image generation, maintaining high quality while significantly improving speed and deployment flexibility across different hardware platforms.

Abstract: Text-embedded image generation plays a critical role in industries such as graphic design, advertising, and digital content creation. Text-to-Image generation methods leveraging diffusion models, such as TextDiffuser-2, have demonstrated promising results in producing images with embedded text. TextDiffuser-2 effectively generates bounding box layouts that guide the rendering of visual text, achieving high fidelity and coherence. However, existing approaches often rely on resource-intensive processes and are limited in their ability to run efficiently on both CPU and GPU platforms. To address these challenges, we propose a novel two-stage pipeline that integrates reinforcement learning (RL) for rapid and optimized text layout generation with a diffusion-based image synthesis model. Our RL-based approach significantly accelerates the bounding box prediction step while reducing overlaps, allowing the system to run efficiently on both CPUs and GPUs. Extensive evaluations demonstrate that our framework achieves comparable performance to TextDiffuser-2 in terms of text placement and image synthesis, while offering markedly faster runtime and increased flexibility. Our method produces high-quality images comparable to TextDiffuser-2, while being 42.29 times faster and requiring only 2 MB of CPU RAM for inference, unlike TextDiffuser-2’s M1 model, which is not executable on CPU-only systems.

[452] OccLE: Label-Efficient 3D Semantic Occupancy Prediction

Naiyu Fang, Zheyuan Zhou, Fayao Liu, Xulei Yang, Jiacheng Wei, Lemiao Qiu, Guosheng Lin

Main category: cs.CV

TL;DR: OccLE is a label-efficient 3D semantic occupancy prediction method that achieves competitive performance with only 10% of voxel annotations by decoupling semantic and geometric learning tasks and fusing them effectively.

Details

Motivation: Existing approaches require either costly full supervision with voxel-level annotations or limited self-supervision with suboptimal performance. There's a need for efficient methods that maintain high performance with limited annotations.

Method: Decouples semantic and geometric learning tasks. Semantic branch distills 2D foundation models for aligned pseudo labels. Geometric branch integrates image and LiDAR inputs with semi-supervision. Fuses features through Dual Mamba and uses scatter-accumulated projection for supervision.

Result: Achieves competitive performance with only 10% of voxel annotations on SemanticKITTI and Occ3D-nuScenes datasets.

Conclusion: OccLE provides an effective label-efficient solution for 3D semantic occupancy prediction that balances performance and annotation cost.

Abstract: 3D semantic occupancy prediction offers an intuitive and efficient scene understanding and has attracted significant interest in autonomous driving perception. Existing approaches either rely on full supervision, which demands costly voxel-level annotations, or on self-supervision, which provides limited guidance and yields suboptimal performance. To address these challenges, we propose OccLE, a Label-Efficient 3D Semantic Occupancy Prediction that takes images and LiDAR as inputs and maintains high performance with limited voxel annotations. Our intuition is to decouple the semantic and geometric learning tasks and then fuse the learned feature grids from both tasks for the final semantic occupancy prediction. Therefore, the semantic branch distills 2D foundation model to provide aligned pseudo labels for 2D and 3D semantic learning. The geometric branch integrates image and LiDAR inputs in cross-plane synergy based on their inherency, employing semi-supervision to enhance geometry learning. We fuse semantic-geometric feature grids through Dual Mamba and incorporate a scatter-accumulated projection to supervise unannotated prediction with aligned pseudo labels. Experiments show that OccLE achieves competitive performance with only 10% of voxel annotations on the SemanticKITTI and Occ3D-nuScenes datasets. The code will be publicly released on https://github.com/NerdFNY/OccLE

[453] Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation

Mehrdad Noori, David Osowiechi, Gustavo Adolfo Vargas Hakim, Ali Bahri, Moslem Yazdanpanah, Sahar Dastani, Farzad Beizaee, Ismail Ben Ayed, Christian Desrosiers

Main category: cs.CV

TL;DR: Proposes MLMP, a test-time adaptation method for open-vocabulary semantic segmentation that uses multi-level features and multi-prompt entropy minimization, outperforming classification TTA baselines.

Details

Motivation: Test-time adaptation has been overlooked in dense prediction tasks like open-vocabulary semantic segmentation, while being well-studied for image classification.

Method: Multi-Level and Multi-Prompt (MLMP) entropy minimization that integrates features from intermediate vision-encoder layers and uses different text-prompt templates at both global CLS token and local pixel-wise levels.

Result: Consistently delivers significant gains over direct adoption of TTA classification baselines across 87 distinct test scenarios including 9 datasets, 15 corruptions, and domain shifts.

Conclusion: The method is plug-and-play, requires no additional training data or labels, works with single test samples, and establishes a comprehensive benchmark for future TTA research in open-vocabulary segmentation.

Abstract: Recently, test-time adaptation has attracted wide interest in the context of vision-language models for image classification. However, to the best of our knowledge, the problem is completely overlooked in dense prediction tasks such as Open-Vocabulary Semantic Segmentation (OVSS). In response, we propose a novel TTA method tailored to adapting VLMs for segmentation during test time. Unlike TTA methods for image classification, our Multi-Level and Multi-Prompt (MLMP) entropy minimization integrates features from intermediate vision-encoder layers and is performed with different text-prompt templates at both the global CLS token and local pixel-wise levels. Our approach could be used as plug-and-play for any segmentation network, does not require additional training data or labels, and remains effective even with a single test sample. Furthermore, we introduce a comprehensive OVSS TTA benchmark suite, which integrates a rigorous evaluation protocol, nine segmentation datasets, 15 common synthetic corruptions, and additional real and rendered domain shifts, \textbf{with a total of 87 distinct test scenarios}, establishing a standardized and comprehensive testbed for future TTA research in open-vocabulary segmentation. Our experiments on this suite demonstrate that our segmentation-tailored method consistently delivers significant gains over direct adoption of TTA classification baselines. Code and data are available at https://github.com/dosowiechi/MLMP.

[454] VideoCAD: A Dataset and Model for Learning Long-Horizon 3D CAD UI Interactions from Video

Brandon Man, Ghadi Nehme, Md Ferdous Alam, Faez Ahmed

Main category: cs.CV

TL;DR: VideoCAD is a large-scale synthetic dataset of CAD UI interactions, enabling learning of complex engineering tasks and benchmarking multimodal LLMs on spatial reasoning.

Details

Motivation: Existing UI agent datasets focus on simple mobile/web tasks, failing to capture the complexity and precision requirements of professional CAD tools.

Method: Created VideoCAD dataset with 41K annotated CAD operation videos using automated framework, and proposed VideoCADFormer model for learning UI interactions from video.

Result: VideoCAD offers 20x longer time horizons than existing datasets. VideoCADFormer outperforms behavior cloning baselines for CAD interaction learning.

Conclusion: VideoCAD reveals key challenges in video-based UI understanding including precise action grounding, multimodal reasoning, and long-horizon dependencies.

Abstract: Computer-Aided Design (CAD) is a time-consuming and complex process, requiring precise, long-horizon user interactions with intricate 3D interfaces. While recent advances in AI-driven user interface (UI) agents show promise, most existing datasets and methods focus on short, low-complexity tasks in mobile or web applications, failing to capture the demands of professional engineering tools. In this work, we introduce VideoCAD, the first attempt to model UI interactions for precision engineering tasks. Specifically, VideoCAD is a large-scale synthetic dataset consisting of over 41K annotated video recordings of CAD operations, generated using an automated framework for collecting high-fidelity UI action data from human-made CAD designs. Compared to existing datasets, VideoCAD offers an order-of-magnitude increase in complexity for real-world engineering UI tasks, with time horizons up to 20x longer than those in other datasets. We show two important downstream applications of VideoCAD: (1) learning UI interactions from professional 3D CAD tools for precision tasks and (2) a visual question-answering (VQA) benchmark designed to evaluate multimodal large language models (LLMs) on spatial reasoning and video understanding. To learn the UI interactions, we propose VideoCADFormer, a state-of-the-art model for learning CAD interactions directly from video, which outperforms existing behavior cloning baselines. Both VideoCADFormer and the VQA benchmark derived from VideoCAD reveal key challenges in the current state of video-based UI understanding, including the need for precise action grounding, multi-modal and spatial reasoning, and long-horizon dependencies.

[455] Seg2Any: Open-set Segmentation-Mask-to-Image Generation with Precise Shape and Semantic Control

Danfeng Li, Hui Zhang, Sheng Wang, Jiacheng Li, Zuxuan Wu

Main category: cs.CV

TL;DR: Seg2Any is a novel segmentation-mask-to-image framework that achieves precise spatial layout control by decoupling mask conditions into semantic and shape components, preventing attribute leakage across entities.

Details

Motivation: Existing S2I methods fail to simultaneously ensure semantic consistency and shape consistency, and struggle with attribute leakage in multi-entity scenarios.

Method: Decouples segmentation masks into regional semantic components (Semantic Alignment Attention Mask) and high-frequency shape components (Entity Contour Map), and uses Attribute Isolation Attention Mask to prevent cross-entity attribute leakage.

Result: Achieves state-of-the-art performance on both open-set and closed-set S2I benchmarks, particularly excelling in fine-grained spatial and attribute control of entities.

Conclusion: Seg2Any effectively addresses the limitations of existing S2I methods by ensuring both semantic and shape consistency while preventing attribute leakage, enabling precise spatial layout control in text-to-image generation.

Abstract: Despite recent advances in diffusion models, top-tier text-to-image (T2I) models still struggle to achieve precise spatial layout control, i.e. accurately generating entities with specified attributes and locations. Segmentation-mask-to-image (S2I) generation has emerged as a promising solution by incorporating pixel-level spatial guidance and regional text prompts. However, existing S2I methods fail to simultaneously ensure semantic consistency and shape consistency. To address these challenges, we propose Seg2Any, a novel S2I framework built upon advanced multimodal diffusion transformers (e.g. FLUX). First, to achieve both semantic and shape consistency, we decouple segmentation mask conditions into regional semantic and high-frequency shape components. The regional semantic condition is introduced by a Semantic Alignment Attention Mask, ensuring that generated entities adhere to their assigned text prompts. The high-frequency shape condition, representing entity boundaries, is encoded as an Entity Contour Map and then introduced as an additional modality via multi-modal attention to guide image spatial structure. Second, to prevent attribute leakage across entities in multi-entity scenarios, we introduce an Attribute Isolation Attention Mask mechanism, which constrains each entity’s image tokens to attend exclusively to themselves during image self-attention. To support open-set S2I generation, we construct SACap-1M, a large-scale dataset containing 1 million images with 5.9 million segmented entities and detailed regional captions, along with a SACap-Eval benchmark for comprehensive S2I evaluation. Extensive experiments demonstrate that Seg2Any achieves state-of-the-art performance on both open-set and closed-set S2I benchmarks, particularly in fine-grained spatial and attribute control of entities.

[456] FaceSleuth-R: Adaptive Orientation-Aware Attention for Robust Micro-Expression Recognition

Linquan Wu, Tianxiang Jiang, Haoyu Yang, Wenhao Duan, Shaochao Lin, Zixuan Wang, Yini Fang, Jacky Keung

Main category: cs.CV

TL;DR: FaceSleuth-R introduces Single-Orientation Attention (SOA) to address micro-expression recognition’s generalization problems by focusing on quasi-invariant motion orientations rather than superficial features, achieving state-of-the-art performance across benchmarks.

Details

Motivation: Micro-expression recognition suffers from poor generalization in real-world settings due to overfitting to dataset-specific appearance cues and vulnerability to domain shifts, limiting practical deployment.

Method: Proposes FaceSleuth-R framework with novel Single-Orientation Attention (SOA) module - a lightweight, differentiable operator that learns layer-specific optimal orientations to guide attention towards robust motion cues.

Result: SOA discovers universal near-vertical motion prior across datasets; FaceSleuth-R achieves superior generalization in Leave-One-Dataset-Out protocols and establishes state-of-the-art results across multiple benchmarks.

Conclusion: Adaptive orientation-aware attention is crucial for developing truly generalized and high-performing micro-expression recognition systems that can handle domain shifts effectively.

Abstract: Micro-expression recognition (MER) has achieved impressive accuracy in controlled laboratory settings. However, its real-world applicability faces a significant generalization cliff, severely hindering practical deployment due to poor performance on unseen data and susceptibility to domain shifts. Existing attention mechanisms often overfit to dataset-specific appearance cues or rely on fixed spatial priors, making them fragile in diverse environments. We posit that robust MER requires focusing on quasi-invariant motion orientations inherent to micro-expressions, rather than superficial pixel-level features. To this end, we introduce \textbf{FaceSleuth-R}, a framework centered on our novel \textbf{Single-Orientation Attention (SOA)} module. SOA is a lightweight, differentiable operator that enables the network to learn layer-specific optimal orientations, effectively guiding attention towards these robust motion cues. Through extensive experiments, we demonstrate that SOA consistently discovers a universal near-vertical motion prior across diverse datasets. More critically, FaceSleuth-R showcases superior generalization in rigorous Leave-One-Dataset-Out (LODO) protocols, significantly outperforming baselines and state-of-the-art methods when confronted with domain shifts. Furthermore, our approach establishes \textbf{state-of-the-art results} across several benchmarks. This work highlights adaptive orientation-aware attention as a key paradigm for developing truly generalized and high-performing MER systems.

[457] Bridging Weakly-Supervised Learning and VLM Distillation: Noisy Partial Label Learning for Efficient Downstream Adaptation

Qian-Wei Wang, Yuqiu Xie, Letian Zhang, Zimo Liu, Shu-Tao Xia

Main category: cs.CV

TL;DR: Proposes Co-Reg method for learning from noisy partial labels generated by pre-trained VLMs, using collaborative consistency regularization to handle instance-dependent noise.

Details

Motivation: To enable manual-annotation-free training by leveraging pre-trained VLMs for label annotation, addressing the challenge of instance-dependent noise in NPLL settings.

Method: Simultaneously trains two neural networks with collaborative purification via Co-Pseudo-Labeling, enforcing consistency regularization in both label and feature spaces with anti-overfitting mechanisms.

Result: Developed an effective approach to handle instance-dependent noise patterns from VLMs through collaborative learning and multi-space consistency constraints.

Conclusion: The Co-Reg method successfully addresses the unique challenges of learning from VLM-generated noisy partial labels, enabling annotation-free training for downstream tasks.

Abstract: In the context of noisy partial label learning (NPLL), each training sample is associated with a set of candidate labels annotated by multiple noisy annotators. With the emergence of high-performance pre-trained vision-language models (VLMs) such as CLIP, LLaVA and GPT-4V, the direction of using these models to replace time-consuming manual annotation workflows and achieve manual-annotation-free" training for downstream tasks has become a highly promising research avenue. This paper focuses on learning from noisy partial labels annotated by pre-trained VLMs and proposes an innovative collaborative consistency regularization (Co-Reg) method. Unlike the symmetric noise primarily addressed in traditional noisy label learning, the noise generated by pre-trained models is instance-dependent, embodying the underlying patterns of the pre-trained models themselves, which significantly increases the learning difficulty for the model. To address this, we simultaneously train two neural networks that implement collaborative purification of training labels through a Co-Pseudo-Labeling" mechanism, while enforcing consistency regularization constraints in both the label space and feature representation space. Specifically, we construct multiple anti-overfitting mechanisms that efficiently mine latent information from noisy partially labeled samples including alternating optimization of contrastive feature representations and pseudo-labels, as well as maintaining prototypical class vectors in the shared feature space.

[458] LGM-Pose: A Lightweight Global Modeling Network for Real-time Human Pose Estimation

Biao Guo, Fangmin Guo, Guibo Luo, Xiaonan Luo, Feng Zhang

Main category: cs.CV

TL;DR: LGM-Pose is a lightweight single-branch network for multi-person pose estimation that uses MobileViM blocks with attention modules to capture global context while maintaining efficiency.

Details

Motivation: Current lightweight multi-person pose estimation methods use multi-branch CNN architectures that struggle with global context capture and have high latency due to complex, redundant structures.

Method: Proposed LGM-Pose with lightweight MobileViM Block using Lightweight Attentional Representation Module (LARM) with Non-Parametric Transformation Operation for global information, and Shuffle-Integrated Fusion Module (SFusion) for multi-scale integration.

Result: Experimental evaluations on COCO and MPII datasets show reduced parameters compared to existing lightweight methods while achieving superior performance and faster processing speeds.

Conclusion: The proposed LGM-Pose network effectively addresses global context capture and latency issues in lightweight pose estimation through innovative single-branch architecture with attention and fusion modules.

Abstract: Most of the current top-down multi-person pose estimation lightweight methods are based on multi-branch parallel pure CNN network architecture, which often struggle to capture the global context required for detecting semantically complex keypoints and are hindered by high latency due to their intricate and redundant structures. In this article, an approximate single-branch lightweight global modeling network (LGM-Pose) is proposed to address these challenges. In the network, a lightweight MobileViM Block is designed with a proposed Lightweight Attentional Representation Module (LARM), which integrates information within and between patches using the Non-Parametric Transformation Operation(NPT-Op) to extract global information. Additionally, a novel Shuffle-Integrated Fusion Module (SFusion) is introduced to effectively integrate multi-scale information, mitigating performance degradation often observed in single-branch structures. Experimental evaluations on the COCO and MPII datasets demonstrate that our approach not only reduces the number of parameters compared to existing mainstream lightweight methods but also achieves superior performance and faster processing speeds.

[459] Bidirectional Image-Event Guided Fusion Framework for Low-Light Image Enhancement

Zhanwen Liu, Huanna Song, Yang Wang, Nan Yang, Weiping Ding, Yisheng An

Main category: cs.CV

TL;DR: BiLIE is a bidirectional image-event fusion framework for low-light image enhancement that addresses flickering artifacts and structural discontinuities through dynamic adaptive filtering and cross-modal attention mechanisms.

Details

Motivation: Frame-based cameras lose details in extreme low-light conditions, and existing event-guided approaches suffer from flickering artifacts and structural discontinuities caused by dynamic illumination changes and event sparsity.

Method: Proposes BiLIE with Dynamic Adaptive Filtering Enhancement (DAFE) for suppressing flickering artifacts and preserving high-frequency details, and Bidirectional Guided Awareness Fusion (BGAF) for breakpoint-aware restoration and structure-aware enhancement through cross-modal attention.

Result: Outperforms existing methods on RELIE and LIE datasets, achieving 0.81dB higher PSNR on RELIE with significant improvements in edge restoration, color fidelity, and noise suppression.

Conclusion: BiLIE effectively addresses low-light enhancement challenges through bidirectional image-event fusion, producing clear, smooth, and structurally intact results while introducing a high-quality dataset (RELIE) for better evaluation.

Abstract: Under extreme low-light conditions, frame-based cameras suffer from severe detail loss due to limited dynamic range. Recent studies have introduced event cameras for event-guided low-light image enhancement. However, existing approaches often overlook the flickering artifacts and structural discontinuities caused by dynamic illumination changes and event sparsity. To address these challenges, we propose BiLIE, a Bidirectional image-event guided fusion framework for Low-Light Image Enhancement, which achieves mutual guidance and complementary enhancement between the two modalities. First, to highlight edge details, we develop a Dynamic Adaptive Filtering Enhancement (DAFE) module that performs adaptive high-pass filtering on event representations to suppress flickering artifacts and preserve high-frequency information under varying illumination. Subsequently, we design a Bidirectional Guided Awareness Fusion (BGAF) mechanism, which achieves breakpoint-aware restoration from images to events and structure-aware enhancement from events to images through a two-stage attention mechanism, establishing cross-modal consistency, thereby producing a clear, smooth, and structurally intact fused representation. Moreover, recognizing that existing datasets exhibit insufficient ground-truth fidelity and color accuracy, we construct a high-quality low-light image-event dataset (RELIE) via a reliable ground truth refinement scheme. Extensive experiments demonstrate that our method outperforms existing approaches on both the RELIE and LIE datasets. Notably, on RELIE, BiLIE exceeds the state-of-the-art by 0.81dB in PSNR and shows significant advantages in edge restoration, color fidelity, and noise suppression.

[460] Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, Eli Shechtman

Main category: cs.CV

TL;DR: Self Forcing is a training method for autoregressive video diffusion models that addresses exposure bias by conditioning frame generation on previously self-generated outputs during training, enabling real-time streaming video generation with high quality.

Details

Motivation: To solve the exposure bias problem in autoregressive video diffusion models, where models trained on ground-truth context must generate sequences conditioned on imperfect outputs during inference, leading to performance degradation.

Method: Uses autoregressive rollout with key-value caching during training to condition each frame’s generation on previously self-generated outputs, employs few-step diffusion with stochastic gradient truncation for efficiency, and implements rolling KV cache for efficient video extrapolation.

Result: Achieves real-time streaming video generation with sub-second latency on a single GPU while matching or surpassing the quality of significantly slower non-causal diffusion models.

Conclusion: Self Forcing effectively addresses exposure bias in video diffusion models, enabling efficient and high-quality real-time video generation through its novel training paradigm.

Abstract: We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models. It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs during inference. Unlike prior methods that denoise future frames based on ground-truth context frames, Self Forcing conditions each frame’s generation on previously self-generated outputs by performing autoregressive rollout with key-value (KV) caching during training. This strategy enables supervision through a holistic loss at the video level that directly evaluates the quality of the entire generated sequence, rather than relying solely on traditional frame-wise objectives. To ensure training efficiency, we employ a few-step diffusion model along with a stochastic gradient truncation strategy, effectively balancing computational cost and performance. We further introduce a rolling KV cache mechanism that enables efficient autoregressive video extrapolation. Extensive experiments demonstrate that our approach achieves real-time streaming video generation with sub-second latency on a single GPU, while matching or even surpassing the generation quality of significantly slower and non-causal diffusion models. Project website: http://self-forcing.github.io/

[461] Consistent Story Generation: Unlocking the Potential of Zigzag Sampling

Mingxiao Li, Mang Ning, Marie-Francine Moens

Main category: cs.CV

TL;DR: Zigzag Sampling with Asymmetric Prompts and Visual Sharing is a training-free method that improves subject consistency in visual story generation by alternating between asymmetric prompting and visual sharing across images.

Details

Motivation: Existing text-to-image models struggle with maintaining subject consistency across multiple images for visual storytelling, with current methods being either resource-intensive or having limited success.

Method: Uses zigzag sampling mechanism with asymmetric prompting to retain subject characteristics and visual sharing module to transfer visual cues across generated images.

Result: Significantly outperforms previous approaches in generating coherent and consistent visual stories based on both quantitative metrics and qualitative evaluations.

Conclusion: The proposed training-free sampling strategy effectively enhances subject consistency in visual story generation without requiring fine-tuning.

Abstract: Text-to-image generation models have made significant progress in producing high-quality images from textual descriptions, yet they continue to struggle with maintaining subject consistency across multiple images, a fundamental requirement for visual storytelling. Existing methods attempt to address this by either fine-tuning models on large-scale story visualization datasets, which is resource-intensive, or by using training-free techniques that share information across generations, which still yield limited success. In this paper, we introduce a novel training-free sampling strategy called Zigzag Sampling with Asymmetric Prompts and Visual Sharing to enhance subject consistency in visual story generation. Our approach proposes a zigzag sampling mechanism that alternates between asymmetric prompting to retain subject characteristics, while a visual sharing module transfers visual cues across generated images to %further enforce consistency. Experimental results, based on both quantitative metrics and qualitative evaluations, demonstrate that our method significantly outperforms previous approaches in generating coherent and consistent visual stories. The code is available at https://github.com/Mingxiao-Li/Asymmetry-Zigzag-StoryDiffusion.

[462] Sekai: A Video Dataset towards World Exploration

Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, Zizhen Li, Fanrui Zhang, Jiaxin Ai, Zhixiang Wang, Yuwei Wu, Tong He, Jiangmiao Pang, Yu Qiao, Yunde Jia, Kaipeng Zhang

Main category: cs.CV

TL;DR: Sekai is a large-scale first-person video dataset with 5,000+ hours of walking/drone footage from 100+ countries, designed specifically for world exploration training in video generation models.

Details

Motivation: Existing video datasets are limited for world exploration due to short duration, static scenes, limited locations, and lack of exploration/world annotations.

Method: Developed an efficient toolbox to collect, pre-process and annotate first-person videos with location, scene, weather, crowd density, captions, and camera trajectories from 750 cities worldwide.

Result: Created a high-quality dataset with comprehensive annotations that demonstrates scale, diversity, and effectiveness for training video generation models.

Conclusion: Sekai will benefit video generation and world exploration research, enabling valuable applications in interactive world exploration.

Abstract: Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning “world” in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Comprehensive analyses and experiments demonstrate the dataset’s scale, diversity, annotation quality, and effectiveness for training video generation models. We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications. The project page is https://lixsp11.github.io/sekai-project/.

[463] StereoDiff: Stereo-Diffusion Synergy for Video Depth Estimation

Haodong Li, Chen Wang, Jiahui Lei, Kostas Daniilidis, Lingjie Liu

Main category: cs.CV

TL;DR: StereoDiff is a two-stage video depth estimator that combines stereo matching for static regions with video depth diffusion for dynamic regions, achieving state-of-the-art performance on real-world dynamic video depth benchmarks.

Details

Motivation: Video depth estimation differs from image depth estimation due to different temporal consistency requirements for dynamic vs static regions. Static regions benefit from stereo matching across frames for strong 3D cues, while dynamic regions require learning from video data due to triangulation constraint violations.

Method: Two-stage approach: 1) Stereo matching for static regions (backgrounds) using cross-frame matching, 2) Video depth diffusion for dynamic regions to ensure smooth transitions. The synergy is mathematically validated through frequency domain analysis.

Result: Achieves state-of-the-art performance on zero-shot, real-world dynamic video depth benchmarks (both indoor and outdoor), demonstrating superior consistency and accuracy in video depth estimation.

Conclusion: StereoDiff effectively synergizes stereo matching and video depth diffusion, capturing complementary strengths for static and dynamic regions respectively, leading to improved video depth estimation performance.

Abstract: Recent video depth estimation methods achieve great performance by following the paradigm of image depth estimation, i.e., typically fine-tuning pre-trained video diffusion models with massive data. However, we argue that video depth estimation is not a naive extension of image depth estimation. The temporal consistency requirements for dynamic and static regions in videos are fundamentally different. Consistent video depth in static regions, typically backgrounds, can be more effectively achieved via stereo matching across all frames, which provides much stronger global 3D cues. While the consistency for dynamic regions still should be learned from large-scale video depth data to ensure smooth transitions, due to the violation of triangulation constraints. Based on these insights, we introduce StereoDiff, a two-stage video depth estimator that synergizes stereo matching for mainly the static areas with video depth diffusion for maintaining consistent depth transitions in dynamic areas. We mathematically demonstrate how stereo matching and video depth diffusion offer complementary strengths through frequency domain analysis, highlighting the effectiveness of their synergy in capturing the advantages of both. Experimental results on zero-shot, real-world, dynamic video depth benchmarks, both indoor and outdoor, demonstrate StereoDiff’s SoTA performance, showcasing its superior consistency and accuracy in video depth estimation.

[464] Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs

Amirmohammad Izadi, Mohammad Ali Banayeeanzade, Fatemeh Askari, Ali Rahimiakbar, Mohammad Mahdi Vahedi, Hosein Hasani, Mahdieh Soleymani Baghshah

Main category: cs.CV

TL;DR: VISER is a method that enhances visual reasoning in LVLMs by adding spatial structure to visual inputs and using sequential parsing prompts, achieving significant performance improvements without multiple queries.

Details

Motivation: Current LVLMs struggle with the binding problem - reliably associating perceptual features with correct visual referents - leading to errors in counting, visual search, scene description, and spatial relationship tasks due to parallel processing of visual features without spatial grounding.

Method: VISER augments visual inputs with low-level spatial structures and pairs them with textual prompts that encourage sequential, spatially-aware parsing, using only single-query inference.

Result: VISER improved GPT-4o performance by 25.0% on visual search, 26.8% on counting, and 9.5% on spatial relationship tasks, while reducing edit distance error in scene description by 0.32 on 2D datasets. Visual modifications proved essential, as purely textual strategies were insufficient or even degraded performance.

Conclusion: Visual input design is more important than purely linguistic reasoning strategies for enhancing compositional and spatial reasoning in LVLMs, with visual structuring being a powerful general approach.

Abstract: Despite progress in Large Vision-Language Models (LVLMs), their capacity for visual reasoning is often limited by the binding problem: the failure to reliably associate perceptual features with their correct visual referents. This limitation underlies persistent errors in tasks such as counting, visual search, scene description, and spatial relationship understanding. A key factor is that current LVLMs process visual features largely in parallel, lacking mechanisms for spatially grounded, serial attention. This paper introduces Visual Input Structure for Enhanced Reasoning (VISER), a simple, effective method that augments visual inputs with low-level spatial structures and pairs them with a textual prompt that encourages sequential, spatially-aware parsing. We empirically demonstrate substantial performance improvements across core visual reasoning tasks, using only a single-query inference. Specifically, VISER improves GPT-4o performance on visual search, counting, and spatial relationship tasks by 25.0%, 26.8%, and 9.5%, respectively, and reduces edit distance error in scene description by 0.32 on 2D datasets. Furthermore, we find that the visual modification is essential for these gains; purely textual strategies, including Chain-of-Thought prompting, are insufficient and can even degrade performance. VISER underscores the importance of visual input design over purely linguistically based reasoning strategies and suggests that visual structuring is a powerful and general approach for enhancing compositional and spatial reasoning in LVLMs.

[465] High-Frequency Semantics and Geometric Priors for End-to-End Detection Transformers in Challenging UAV Imagery

Hongxing Peng, Lide Chen, Hui Zhu, Yan Chen

Main category: cs.CV

TL;DR: HEDS-DETR is a real-time Detection Transformer specifically designed for aerial object detection, addressing challenges of small, dense, and occluded objects through high-frequency feature preservation, efficient small object pyramid, and enhanced decoder stability.

Details

Motivation: Conventional object detectors struggle with UAV imagery due to small, densely packed, and occluded objects in cluttered backgrounds. Existing methods, including recent end-to-end frameworks, are not purpose-built for aerial challenges, creating a performance gap.

Method: Three key innovations: 1) HFESNet backbone preserving high-frequency details and semantic context, 2) ESOP for efficient high-resolution feature fusion to boost small object detection, 3) SQR and GAPE components for decoder stability and localization precision with spatial priors.

Result: On VisDrone dataset, achieves +3.8% AP and +5.1% AP50 gain over baseline while reducing parameters by 4M and maintaining real-time speeds, demonstrating superior accuracy-efficiency balance for dense small object detection.

Conclusion: HEDS-DETR effectively bridges the performance gap in aerial object detection through holistic enhancements, providing a competitive solution for detecting dense and small objects in UAV imagery with real-time capability.

Abstract: Object detection in Unmanned Aerial Vehicle (UAV) imagery is fundamentally challenged by a prevalence of small, densely packed, and occluded objects within cluttered backgrounds. Conventional detectors struggle with this domain, as they rely on hand-crafted components like pre-defined anchors and heuristic-based Non-Maximum Suppression (NMS), creating a well-known performance bottleneck in dense scenes. Even recent end-to-end frameworks have not been purpose-built to overcome these specific aerial challenges, resulting in a persistent performance gap. To bridge this gap, we introduce HEDS-DETR, a holistically enhanced real-time Detection Transformer tailored for aerial scenes. Our framework features three key innovations. First, we propose a novel High-Frequency Enhanced Semantics Network (HFESNet) backbone, which yields highly discriminative features by preserving critical high-frequency details alongside robust semantic context. Second, our Efficient Small Object Pyramid (ESOP) counteracts information loss by efficiently fusing high-resolution features, significantly boosting small object detection. Finally, we enhance decoder stability and localization precision with two synergistic components: Selective Query Recollection (SQR) and Geometry-Aware Positional Encoding (GAPE), which stabilize optimization and provide explicit spatial priors for dense object arrangements. On the VisDrone dataset, HEDS-DETR achieves a +3.8% AP and +5.1% AP50 gain over its baseline while reducing parameters by 4M and maintaining real-time speeds. This demonstrates a highly competitive accuracy-efficiency balance, especially for detecting dense and small objects in aerial scenes.

[466] ChestGPT: Integrating Large Language Models and Vision Transformers for Disease Detection and Localization in Chest X-Rays

Shehroz S. Khan, Petar Przulj, Ahmed Ashraf, Ali Abedi

Main category: cs.CV

TL;DR: ChestGPT is a deep-learning framework that combines EVA Vision Transformer and Llama 2 LLM to classify diseases and localize regions of interest in chest X-ray images, achieving strong performance with an F1 score of 0.76.

Details

Motivation: Address the growing gap between increasing demand for radiologists and limited supply by developing AI tools to enhance radiologists' capabilities and improve diagnostic accuracy in medical imaging.

Method: Integrates EVA Vision Transformer (ViT) with Llama 2 Large Language Model (LLM), where ViT converts X-ray images into tokens that are processed by LLM along with engineered prompts for joint disease classification and localization using transfer learning.

Result: Achieved strong global disease classification performance on VinDr-CXR dataset with F1 score of 0.76, and successfully localized pathologies by generating bounding boxes around regions of interest.

Conclusion: ChestGPT provides an effective assistive tool that can reduce radiologists’ workload by offering preliminary findings and regions of interest to support diagnostic processes.

Abstract: The global demand for radiologists is increasing rapidly due to a growing reliance on medical imaging services, while the supply of radiologists is not keeping pace. Advances in computer vision and image processing technologies present significant potential to address this gap by enhancing radiologists’ capabilities and improving diagnostic accuracy. Large language models (LLMs), particularly generative pre-trained transformers (GPTs), have become the primary approach for understanding and generating textual data. In parallel, vision transformers (ViTs) have proven effective at converting visual data into a format that LLMs can process efficiently. In this paper, we present ChestGPT, a deep-learning framework that integrates the EVA ViT with the Llama 2 LLM to classify diseases and localize regions of interest in chest X-ray images. The ViT converts X-ray images into tokens, which are then fed, together with engineered prompts, into the LLM, enabling joint classification and localization of diseases. This approach incorporates transfer learning techniques to enhance both explainability and performance. The proposed method achieved strong global disease classification performance on the VinDr-CXR dataset, with an F1 score of 0.76, and successfully localized pathologies by generating bounding boxes around the regions of interest. We also outline several task-specific prompts, in addition to general-purpose prompts, for scenarios radiologists might encounter. Overall, this framework offers an assistive tool that can lighten radiologists’ workload by providing preliminary findings and regions of interest to facilitate their diagnostic process.

[467] Visual Hand Gesture Recognition with Deep Learning: A Comprehensive Review of Methods, Datasets, Challenges and Future Research Directions

Konstantinos Foteinos, Jorgen Cani, Manousos Linardakis, Panagiotis Radoglou-Grammatikis, Vasileios Argyriou, Panagiotis Sarigiannidis, Iraklis Varlamis, Georgios Th. Papadopoulos

Main category: cs.CV

TL;DR: This paper provides a comprehensive survey on visual hand gesture recognition (VHGR), organizing state-of-the-art methods, datasets, and evaluation metrics to help researchers navigate the field.

Details

Motivation: The rapid evolution of deep learning and increasing dataset sizes have created a need for a structured survey to help researchers find the right combination of data, model, and approach for VHGR tasks, as current literature lacks comprehensive organization.

Method: The survey uses a systematic research methodology to identify state-of-the-art works and presents a taxonomy-based organization of VHGR approaches, categorizing methods by input modality, task type, and application domain across three primary tasks: static gesture recognition, isolated dynamic gestures, and continuous gesture recognition.

Result: The survey successfully organizes the VHGR field into a structured framework, identifies architectural trends and learning strategies for each task type, reviews commonly used datasets, and presents standard performance metrics for experimental evaluation.

Conclusion: The survey identifies major challenges in VHGR including both general computer vision issues and domain-specific obstacles, and outlines promising directions for future research to advance the field.

Abstract: The rapid evolution of deep learning (DL) models and the ever-increasing size of available datasets have raised the interest of the research community in the always important field of visual hand gesture recognition (VHGR), and delivered a wide range of applications, such as sign language understanding and human-computer interaction using cameras. Despite the large volume of research works in the field, a structured and complete survey on VHGR is still missing, leaving researchers to navigate through hundreds of papers in order to find the right combination of data, model, and approach for each task. The current survey aims to fill this gap by presenting a comprehensive overview of this computer vision field. With a systematic research methodology that identifies the state-of-the-art works and a structured presentation of the various methods, datasets, and evaluation metrics, this review aims to constitute a useful guideline for researchers, helping them to choose the right strategy for handling a VHGR task. Starting with the methodology used to locate the related literature, the survey identifies and organizes the key VHGR approaches in a taxonomy-based format, and presents the various dimensions that affect the final method choice, such as input modality, task type, and application domain. The state-of-the-art techniques are grouped across three primary VHGR tasks: static gesture recognition, isolated dynamic gestures, and continuous gesture recognition. For each task, the architectural trends and learning strategies are listed. To support the experimental evaluation of future methods in the field, the study reviews commonly used datasets and presents the standard performance metrics. Our survey concludes by identifying the major challenges in VHGR, including both general computer vision issues and domain-specific obstacles, and outlines promising directions for future research.

[468] Not Only Consistency: Enhance Test-Time Adaptation with Spatio-temporal Inconsistency for Remote Physiological Measurement

Xiao Yang, Jiyao Wang, Yuxuan Fan, Can Liu, Houcheng Su, Weichen Guo, Zitong Yu, Dengbo He, Kaishun Wu

Main category: cs.CV

TL;DR: A novel Test-Time Adaptation (TTA) framework called CiCi that leverages both consistency and inconsistency priors in physiological signals for real-time RPM model adaptation without source data access.

Details

Motivation: Existing domain adaptation methods for RPM face limitations in privacy concerns and real-time adaptation, restricting real-world deployment. There's a need for fully test-time adaptation strategies that work during inference without accessing source data.

Method: Proposed CiCi framework that exploits spatio-temporal consistency in frequency domain and inconsistency in time domain of BVP signals. Uses expert knowledge-based self-supervised learning with gradient dynamic control to mitigate conflicts between priors.

Result: Extensive experiments on five datasets show the method consistently outperforms existing techniques, achieving state-of-the-art performance in real-time self-supervised adaptation.

Conclusion: The CiCi framework enables effective test-time adaptation for RPM tasks by leveraging physiological priors about signal consistency and inconsistency, providing stable adaptation without source data access.

Abstract: Remote physiological measurement (RPM) has emerged as a promising non-invasive method for monitoring physiological signals using the non-contact device. Although various domain adaptation and generalization methods were proposed to promote the adaptability of deep-based RPM models in unseen deployment environments, considerations in aspects such as privacy concerns and real-time adaptation restrict their application in real-world deployment. Thus, we aim to propose a novel fully Test-Time Adaptation (TTA) strategy tailored for RPM tasks in this work. Specifically, based on prior knowledge in physiology and our observations, we noticed not only there is spatio-temporal consistency in the frequency domain of BVP signals, but also that inconsistency in the time domain was significant. Given this, by leveraging both consistency and inconsistency priors, we introduce an innovative expert knowledge-based self-supervised \textbf{C}onsistency-\textbf{i}n\textbf{C}onsistency-\textbf{i}ntegration (\textbf{CiCi}) framework to enhances model adaptation during inference. Besides, our approach further incorporates a gradient dynamic control mechanism to mitigate potential conflicts between priors, ensuring stable adaptation across instances. Through extensive experiments on five diverse datasets under the TTA protocol, our method consistently outperforms existing techniques, presenting state-of-the-art performance in real-time self-supervised adaptation without accessing source data. The code will be released later.

[469] When Person Re-Identification Meets Event Camera: A Benchmark Dataset and An Attribute-guided Re-Identification Framework

Xiao Wang, Qian Zhu, Shujuan Wu, Bo Jiang, Shiliang Zhang

Main category: cs.CV

TL;DR: This paper introduces EvReID, a large-scale RGB-event person re-identification dataset with 118,988 image pairs covering 1200 identities, and proposes TriPro-ReID, a pedestrian attribute-guided contrastive learning framework for enhanced feature learning.

Details

Motivation: Current event-based person ReID methods are limited by small-scale or simulated datasets, making it difficult to assess real performance and generalization. There's a need for large-scale real-world datasets to advance the field.

Method: Created EvReID dataset with multi-season, multi-scene data collection. Proposed TriPro-ReID framework that combines RGB frames, event streams, and pedestrian attributes through contrastive learning to enhance feature representation.

Result: Extensive experiments on EvReID and MARS datasets validated the effectiveness of the proposed RGB-Event person ReID framework. The dataset provides a solid foundation for future research with 15 state-of-the-art algorithms evaluated.

Conclusion: The EvReID dataset addresses data scarcity in event-based person ReID, and the TriPro-ReID framework effectively leverages RGB, event streams, and pedestrian attributes for improved re-identification performance.

Abstract: Recent researchers have proposed using event cameras for person re-identification (ReID) due to their promising performance and better balance in terms of privacy protection, event camera-based person ReID has attracted significant attention. Currently, mainstream event-based person ReID algorithms primarily focus on fusing visible light and event stream, as well as preserving privacy. Although significant progress has been made, these methods are typically trained and evaluated on small-scale or simulated event camera datasets, making it difficult to assess their real identification performance and generalization ability. To address the issue of data scarcity, this paper introduces a large-scale RGB-event based person ReID dataset, called EvReID. The dataset contains 118,988 image pairs and covers 1200 pedestrian identities, with data collected across multiple seasons, scenes, and lighting conditions. We also evaluate 15 state-of-the-art person ReID algorithms, laying a solid foundation for future research in terms of both data and benchmarking. Based on our newly constructed dataset, this paper further proposes a pedestrian attribute-guided contrastive learning framework to enhance feature learning for person re-identification, termed TriPro-ReID. This framework not only effectively explores the visual features from both RGB frames and event streams, but also fully utilizes pedestrian attributes as mid-level semantic features. Extensive experiments on the EvReID dataset and MARS datasets fully validated the effectiveness of our proposed RGB-Event person ReID framework. The benchmark dataset and source code will be released on https://github.com/Event-AHU/Neuromorphic_ReID

[470] Survival Modeling from Whole Slide Images via Patch-Level Graph Clustering and Mixture Density Experts

Ardhendu Sekhar, Vasu Soni, Keshav Aske, Garima Jain, Pranav Jeevan, Amit Sethi

Main category: cs.CV

TL;DR: A modular framework for predicting cancer survival from whole slide images using quantile-based patch filtering, graph-regularized clustering, hierarchical feature aggregation, and expert-guided mixture density modeling.

Details

Motivation: To directly predict cancer-specific survival from whole slide pathology images by capturing prognostic and morphological heterogeneity in tumor tissues.

Method: Four-stage framework: 1) Quantile-based patch filtering for prognostically informative regions, 2) Graph-regularized patch clustering for phenotype variations, 3) Hierarchical feature aggregation for multiscale tumor organization, 4) Expert-guided mixture density model for survival distribution estimation.

Result: Achieved concordance indices of 0.653 (TCGA LUAD), 0.719 (TCGA KIRC), and 0.733 (TCGA BRCA), surpassing state-of-the-art approaches in survival prediction from WSIs.

Conclusion: The proposed modular framework effectively captures morphological heterogeneity and enables accurate cancer survival prediction directly from whole slide images, outperforming existing methods across multiple cancer types.

Abstract: We propose a modular framework for predicting cancer specific survival directly from whole slide pathology images (WSIs). The framework consists of four key stages designed to capture prognostic and morphological heterogeneity. First, a Quantile Based Patch Filtering module selects prognostically informative tissue regions through quantile thresholding. Second, Graph Regularized Patch Clustering models phenotype level variations using a k nearest neighbor graph that enforces spatial and morphological coherence. Third, Hierarchical Feature Aggregation learns both intra and inter cluster dependencies to represent multiscale tumor organization. Finally, an Expert Guided Mixture Density Model estimates complex survival distributions via Gaussian mixtures, enabling fine grained risk prediction. Evaluated on TCGA LUAD, TCGA KIRC, and TCGA BRCA cohorts, our model achieves concordance indices of 0.653 ,0.719 ,and 0.733 respectively, surpassing existing state of the art approaches in survival prediction from WSIs.

[471] Controllable Hybrid Captioner for Improved Long-form Video Understanding

Kuleen Sasse, Efsun Sarioglu Kayi, Arun Reddy

Main category: cs.CV

TL;DR: The paper presents a video understanding system that creates text-based summaries of long-form videos using video captioning and vision-language models, enabling complex natural language query answering through LLMs.

Details

Motivation: Long-form videos are dense and high-dimensional, making direct processing challenging. Text-based summaries offer compact representations that can be efficiently processed by LLMs for reasoning over video content.

Method: Uses LaViLa video captioner on video chunks for spatio-temporal modeling, enriches with static scene descriptions using LLaVA VLM, fine-tunes a controllable hybrid captioner that alternates between action and scene captions based on scene change detection.

Result: Developed a more detailed and complete caption log that expands answerable question space, improved pipeline efficiency by combining action and scene captioning in a single model, and created a system that can answer complex natural language queries about videos.

Conclusion: The proposed video understanding system successfully creates comprehensive text-based memories from videos, enabling effective reasoning through LLMs while improving efficiency through hybrid captioning and scene-aware processing.

Abstract: Video data, especially long-form video, is extremely dense and high-dimensional. Text-based summaries of video content offer a way to represent query-relevant content in a much more compact manner than raw video. In addition, textual representations are easily ingested by state-of-the-art large language models (LLMs), which enable reasoning over video content to answer complex natural language queries. To solve this issue, we rely on the progressive construction of a text-based memory by a video captioner operating on shorter chunks of the video, where spatio-temporal modeling is computationally feasible. We explore ways to improve the quality of the activity log comprised solely of short video captions. Because the video captions tend to be focused on human actions, and questions may pertain to other information in the scene, we seek to enrich the memory with static scene descriptions using Vision Language Models (VLMs). Our video understanding system relies on the LaViLa video captioner in combination with a LLM to answer questions about videos. We first explored different ways of partitioning the video into meaningful segments such that the textual descriptions more accurately reflect the structure of the video content. Furthermore, we incorporated static scene descriptions into the captioning pipeline using LLaVA VLM, resulting in a more detailed and complete caption log and expanding the space of questions that are answerable from the textual memory. Finally, we have successfully fine-tuned the LaViLa video captioner to produce both action and scene captions, significantly improving the efficiency of the captioning pipeline compared to using separate captioning models for the two tasks. Our model, controllable hybrid captioner, can alternate between different types of captions according to special input tokens that signals scene changes detected in the video.

[472] RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cues for 3D Object Detection

Xiaokai Bai, Chenxu Zhou, Lianqing Zheng, Si-Yuan Cao, Jianan Liu, Xiaohan Zhang, Yiming Li, Zhengzhuang Zhang, Hui-liang Shen

Main category: cs.CV

TL;DR: RaGS is the first framework using 3D Gaussian Splatting to fuse 4D radar and monocular images for 3D object detection, achieving state-of-the-art performance through dynamic resource allocation to foreground objects.

Details

Motivation: Existing fusion approaches for 4D radar and monocular images either lack global context or are constrained by rigid structures, lacking flexible and adaptive representation for diverse autonomous driving scenes.

Method: RaGS uses a cascaded pipeline: Frustum-based Localization Initiation to initialize Gaussian centers, Iterative Multimodal Aggregation to refine Gaussians with image semantics and radar velocity geometry, and Multi-level Gaussian Fusion to render hierarchical BEV features for detection.

Result: Extensive experiments on View-of-Delft, TJ4DRadSet, and OmniHD-Scenes demonstrate robust state-of-the-art performance in 3D object detection.

Conclusion: RaGS achieves object-centric precision and comprehensive scene perception by dynamically focusing on sparse and informative regions through 3D Gaussian Splatting fusion of 4D radar and monocular cues.

Abstract: 4D millimeter-wave radar is a promising sensing modality for autonomous driving, yet effective 3D object detection from 4D radar and monocular images remains challenging. Existing fusion approaches either rely on instance proposals lacking global context or dense BEV grids constrained by rigid structures, lacking a flexible and adaptive representation for diverse scenes. To address this, we propose RaGS, the first framework that leverages 3D Gaussian Splatting (GS) to fuse 4D radar and monocular cues for 3D object detection. 3D GS models the scene as a continuous field of Gaussians, enabling dynamic resource allocation to foreground objects while maintaining flexibility and efficiency. Moreover, the velocity dimension of 4D radar provides motion cues that help anchor and refine the spatial distribution of Gaussians. Specifically, RaGS adopts a cascaded pipeline to construct and progressively refine the Gaussian field. It begins with Frustum-based Localization Initiation (FLI), which unprojects foreground pixels to initialize coarse Gaussian centers. Then, Iterative Multimodal Aggregation (IMA) explicitly exploits image semantics and implicitly integrates 4D radar velocity geometry to refine the Gaussians within regions of interest. Finally, Multi-level Gaussian Fusion (MGF) renders the Gaussian field into hierarchical BEV features for 3D object detection. By dynamically focusing on sparse and informative regions, RaGS achieves object-centric precision and comprehensive scene perception. Extensive experiments on View-of-Delft, TJ4DRadSet, and OmniHD-Scenes demonstrate its robustness and SOTA performance. Code will be released.

[473] FedVLM: Scalable Personalized Vision-Language Models through Federated Learning

Arkajyoti Mitra, Afia Anjum, Paul Agbaje, Mert Pesé, Habeeb Olufowobi

Main category: cs.CV

TL;DR: FedVLM is a federated LoRA fine-tuning framework for vision-language models that addresses data heterogeneity through personalized LoRA (pLoRA), improving client-specific performance by 24.5% over standard LoRA in non-iid settings.

Details

Motivation: Fine-tuning vision-language models at scale in federated environments is challenging due to decentralized and non-iid data across clients. Existing parameter-efficient methods like LoRA struggle with heterogeneous client data, leading to poor generalization.

Method: Proposes FedVLM framework with personalized LoRA (pLoRA) that dynamically adapts LoRA parameters to each client’s unique data distribution, enabling decentralized adaptation while maintaining global model aggregation.

Result: Experiments on RLAIF-V dataset show pLoRA improves client-specific performance by 24.5% over standard LoRA, demonstrating superior adaptation in non-iid settings.

Conclusion: FedVLM provides a scalable and efficient solution for fine-tuning VLMs in federated settings, advancing personalized adaptation in distributed learning scenarios.

Abstract: Vision-language models (VLMs) demonstrate impressive zero-shot and few-shot learning capabilities, making them essential for several downstream tasks. However, fine-tuning these models at scale remains challenging, particularly in federated environments where data is decentralized and non-iid across clients. Existing parameter-efficient tuning methods like LoRA (Low-Rank Adaptation) reduce computational overhead but struggle with heterogeneous client data, leading to suboptimal generalization. To address these challenges, we propose FedVLM, a federated LoRA fine-tuning framework that enables decentralized adaptation of VLMs while preserving model privacy and reducing reliance on centralized training. To further tackle data heterogeneity, we introduce personalized LoRA (pLoRA), which dynamically adapts LoRA parameters to each client’s unique data distribution, significantly improving local adaptation while maintaining global model aggregation. Experiments on the RLAIF-V dataset show that pLoRA improves client-specific performance by 24.5% over standard LoRA, demonstrating superior adaptation in non-iid settings. FedVLM provides a scalable and efficient solution for fine-tuning VLMs in federated settings, advancing personalized adaptation in distributed learning scenarios.

[474] Slot Attention with Re-Initialization and Self-Distillation

Rongzhen Zhao, Yi Zhao, Juho Kannala, Joni Pajarinen

Main category: cs.CV

TL;DR: DIAS improves Object-Centric Learning by addressing slot redundancy and lack of internal supervision through slot re-initialization and self-distillation of attention maps.

Details

Motivation: Current OCL methods suffer from redundant slots competing with informative ones, causing object segmentation errors, and lack supervision beyond input reconstruction.

Method: Proposes Slot Attention with re-Initialization and self-Distillation (DIAS): reduces slot redundancy via re-initialization and updates remaining slots; enables self-distillation by making early attention maps approximate final ones.

Result: Achieves state-of-the-art performance on object discovery and recognition tasks, and improves advanced visual prediction and reasoning capabilities.

Conclusion: DIAS effectively addresses slot redundancy and supervision limitations in OCL, demonstrating superior performance across multiple visual tasks.

Abstract: Unlike popular solutions based on dense feature maps, Object-Centric Learning (OCL) represents visual scenes as sub-symbolic object-level feature vectors, termed slots, which are highly versatile for tasks involving visual modalities. OCL typically aggregates object superpixels into slots by iteratively applying competitive cross attention, known as Slot Attention, with the slots as the query. However, once initialized, these slots are reused naively, causing redundant slots to compete with informative ones for representing objects. This often results in objects being erroneously segmented into parts. Additionally, mainstream methods derive supervision signals solely from decoding slots into the input’s reconstruction, overlooking potential supervision based on internal information. To address these issues, we propose Slot Attention with re-Initialization and self-Distillation (DIAS): $\emph{i)}$ We reduce redundancy in the aggregated slots and re-initialize extra aggregation to update the remaining slots; $\emph{ii)}$ We drive the bad attention map at the first aggregation iteration to approximate the good at the last iteration to enable self-distillation. Experiments demonstrate that DIAS achieves state-of-the-art on OCL tasks like object discovery and recognition, while also improving advanced visual prediction and reasoning. Our source code and model checkpoints are available on https://github.com/Genera1Z/DIAS.

[475] P3P Made Easy

Seong Hun Lee, Patrick Vandewalle, Javier Civera

Main category: cs.CV

TL;DR: Revisiting the classical P3P problem with a compact algebraic solver based on Grunert’s 1841 formulation, achieving competitive accuracy and runtime with modern methods.

Details

Motivation: The elegant classical formulation of P3P has been largely overlooked in modern literature despite its analytical simplicity and computational efficiency.

Method: Proposed a compact algebraic solver building on Grunert’s 1841 theoretical foundation, reducing P3P to a quartic polynomial with simple coefficients.

Result: Achieves accuracy and runtime comparable to state-of-the-art methods, demonstrating the classical formulation remains highly competitive.

Conclusion: The classical P3P formulation offers an excellent balance between simplicity, efficiency, and accuracy when implemented with modern insights.

Abstract: We revisit the classical Perspective-Three-Point (P3P) problem, which aims to recover the absolute pose of a calibrated camera from three 2D-3D correspondences. It has long been known that P3P can be reduced to a quartic polynomial with analytically simple and computationally efficient coefficients. However, this elegant formulation has been largely overlooked in modern literature. Building on the theoretical foundation that traces back to Grunert’s work in 1841, we propose a compact algebraic solver that achieves accuracy and runtime comparable to state-of-the-art methods. Our results show that this classical formulation remains highly competitive when implemented with modern insights, offering an excellent balance between simplicity, efficiency, and accuracy.

[476] Forecasting When to Forecast: Accelerating Diffusion Models with Confidence-Gated Taylor

Xiaoliu Guan, Lielin Jiang, Hanqi Chen, Xu Zhang, Jiaxing Yan, Guanzhong Wang, Yi Liu, Zetao Zhang, Yu Wu

Main category: cs.CV

TL;DR: Proposes a dynamic Taylor-based acceleration method for Diffusion Transformers that shifts prediction to the last block level and uses error-based reliability checks to balance speed and quality.

Details

Motivation: Existing training-free acceleration methods for DiTs have memory/computation overhead from fine-grained feature caching and fixed caching schedules that can degrade output quality when predictions fail.

Method: Shift Taylor prediction from module level to last block level to reduce cached features. Use error between Taylor-estimated and actual outputs of first block as reliability indicator to enable dynamic caching - trust Taylor prediction when error is small, fall back to full computation otherwise.

Result: Achieves 3.17x acceleration on FLUX, 2.36x on DiT, and 4.14x on Wan Video with negligible quality drop.

Conclusion: The proposed method achieves better speed-quality balance by reducing cached features and implementing dynamic caching based on prediction reliability.

Abstract: Diffusion Transformers (DiTs) have demonstrated remarkable performance in visual generation tasks. However, their low inference speed limits their deployment in low-resource applications. Recent training-free approaches exploit the redundancy of features across timesteps by caching and reusing past representations to accelerate inference. Building on this idea, TaylorSeer instead uses cached features to predict future ones via Taylor expansion. However, its module-level prediction across all transformer blocks (e.g., attention or feedforward modules) requires storing fine-grained intermediate features, leading to notable memory and computation overhead. Moreover, it adopts a fixed caching schedule without considering the varying accuracy of predictions across timesteps, which can lead to degraded outputs when prediction fails. To address these limitations, we propose a novel approach to better leverage Taylor-based acceleration. First, we shift the Taylor prediction target from the module level to the last block level, significantly reducing the number of cached features. Furthermore, observing strong sequential dependencies among Transformer blocks, we propose to use the error between the Taylor-estimated and actual outputs of the first block as an indicator of prediction reliability. If the error is small, we trust the Taylor prediction for the last block; otherwise, we fall back to full computation, thereby enabling a dynamic caching mechanism. Empirical results show that our method achieves a better balance between speed and quality, achieving a 3.17x acceleration on FLUX, 2.36x on DiT, and 4.14x on Wan Video with negligible quality drop. The Project Page is \href{https://cg-taylor-acce.github.io/CG-Taylor/}{here.}

[477] Predicting Video Slot Attention Queries from Random Slot-Feature Pairs

Rongzhen Zhao, Jian Li, Juho Kannala, Joni Pajarinen

Main category: cs.CV

TL;DR: RandSF.Q improves video object-centric learning by incorporating next frame features and learning transition dynamics through random slot-feature pair sampling, achieving state-of-the-art performance in object discovery.

Details

Motivation: Existing video OCL methods neglect next frame features (most informative for query prediction) and fail to learn transition dynamics (essential knowledge for query prediction), limiting their effectiveness.

Method: Proposes RandSF.Q with: (1) new transitioner incorporating both slots and features for query prediction, (2) training transitioner using random slot-feature pairs from available recurrences to learn transition dynamics.

Result: Significantly surpasses existing video OCL methods, achieving up to 10 points improvement on object discovery and setting new state-of-the-art. Also benefits downstream tasks like dynamics modeling.

Conclusion: RandSF.Q successfully addresses key limitations in video OCL by incorporating next frame features and learning transition dynamics, demonstrating superior performance in scene representation and downstream applications.

Abstract: Unsupervised video Object-Centric Learning (OCL) is promising as it enables object-level scene representation and dynamics modeling as we humans do. Mainstream video OCL methods adopt a recurrent architecture: An aggregator aggregates current video frame into object features, termed slots, under some queries; A transitioner transits current slots to queries for the next frame. This is an effective architecture but all existing implementations both (\textit{i1}) neglect to incorporate next frame features, the most informative source for query prediction, and (\textit{i2}) fail to learn transition dynamics, the knowledge essential for query prediction. To address these issues, we propose Random Slot-Feature pair for learning Query prediction (RandSF.Q): (\textit{t1}) We design a new transitioner to incorporate both slots and features, which provides more information for query prediction; (\textit{t2}) We train the transitioner to predict queries from slot-feature pairs randomly sampled from available recurrences, which drives it to learn transition dynamics. Experiments on scene representation demonstrate that our method surpass existing video OCL methods significantly, e.g., up to 10 points on object discovery, setting new state-of-the-art. Such superiority also benefits downstream tasks like dynamics modeling. Our core source code, model checkpoints and training logs are available on https://github.com/Genera1Z/RandSF.Q.

[478] Harnessing Textual Semantic Priors for Knowledge Transfer and Refinement in CLIP-Driven Continual Learning

Lingfeng He, De Cheng, Huaijie Wang, Nannan Wang

Main category: cs.CV

TL;DR: SECA is a continual learning framework that leverages CLIP’s textual semantic priors to address the stability-plasticity dilemma through semantic-guided knowledge transfer and visual prototype refinement.

Details

Motivation: To address the underutilization of CLIP's rich textual semantic priors in continual learning, particularly the interference from unrelated tasks during knowledge transfer and the modality gap limitations in text-based classifiers.

Method: Proposes Semantic-Guided Adaptive Knowledge Transfer (SG-AKT) to assess image relevance to historical knowledge via textual cues, and Semantic-Enhanced Visual Prototype Refinement (SE-VPR) to refine visual prototypes using inter-class semantic relations from textual embeddings.

Result: Extensive experiments on multiple benchmarks validate the effectiveness of SECA in achieving better balance between stability and plasticity in continual learning.

Conclusion: SECA successfully harnesses textual semantic priors to guide semantic-aware knowledge transfer and reinforce semantic structure in visual classifiers, effectively addressing the stability-plasticity dilemma in continual learning.

Abstract: Continual learning (CL) aims to equip models with the ability to learn from a stream of tasks without forgetting previous knowledge. With the progress of vision-language models like Contrastive Language-Image Pre-training (CLIP), their promise for CL has attracted increasing attention due to their strong generalizability. However, the potential of rich textual semantic priors in CLIP in addressing the stability-plasticity dilemma remains underexplored. During backbone training, most approaches transfer past knowledge without considering semantic relevance, leading to interference from unrelated tasks that disrupt the balance between stability and plasticity. Besides, while text-based classifiers provide strong generalization, they suffer from limited plasticity due to the inherent modality gap in CLIP. Visual classifiers help bridge this gap, but their prototypes lack rich and precise semantics. To address these challenges, we propose Semantic-Enriched Continual Adaptation (SECA), a unified framework that harnesses the anti-forgetting and structured nature of textual priors to guide semantic-aware knowledge transfer in the backbone and reinforce the semantic structure of the visual classifier. Specifically, a Semantic-Guided Adaptive Knowledge Transfer (SG-AKT) module is proposed to assess new images’ relevance to diverse historical visual knowledge via textual cues, and aggregate relevant knowledge in an instance-adaptive manner as distillation signals. Moreover, a Semantic-Enhanced Visual Prototype Refinement (SE-VPR) module is introduced to refine visual prototypes using inter-class semantic relations captured in class-wise textual embeddings. Extensive experiments on multiple benchmarks validate the effectiveness of our approach.

[479] VPN: Visual Prompt Navigation

Shuo Feng, Zihan Wang, Yuchen Li, Rui Kong, Hengyi Cai, Shuaiqiang Wang, Gim Hee Lee, Piji Li, Shuqiang Jiang

Main category: cs.CV

TL;DR: Proposes Visual Prompt Navigation (VPN) using visual prompts instead of language for embodied navigation, with new datasets and VPNet baseline.

Details

Motivation: Natural language instructions are ambiguous and verbose, hindering effective navigation in complex environments.

Method: Uses visual prompts on 2D top-view maps to mark navigation trajectories, creates VPN datasets (R2R-VP, R2R-CE-VP), and develops VPNet with view-level and trajectory-level data augmentation.

Result: Extensive experiments evaluate how visual prompt forms, map formats, and augmentation strategies affect navigation performance.

Conclusion: Visual prompts provide intuitive, spatially grounded guidance that reduces ambiguity and is more user-friendly than language instructions.

Abstract: While natural language is commonly used to guide embodied agents, the inherent ambiguity and verbosity of language often hinder the effectiveness of language-guided navigation in complex environments. To this end, we propose Visual Prompt Navigation (VPN), a novel paradigm that guides agents to navigate using only user-provided visual prompts within 2D top-view maps. This visual prompt primarily focuses on marking the visual navigation trajectory on a top-down view of a scene, offering intuitive and spatially grounded guidance without relying on language instructions. It is more friendly for non-expert users and reduces interpretive ambiguity. We build VPN tasks in both discrete and continuous navigation settings, constructing two new datasets, R2R-VP and R2R-CE-VP, by extending existing R2R and R2R-CE episodes with corresponding visual prompts. Furthermore, we introduce VPNet, a dedicated baseline network to handle the VPN tasks, with two data augmentation strategies: view-level augmentation (altering initial headings and prompt orientations) and trajectory-level augmentation (incorporating diverse trajectories from large-scale 3D scenes), to enhance navigation performance. Extensive experiments evaluate how visual prompt forms, top-view map formats, and data augmentation strategies affect the performance of visual prompt navigation. The code is available at https://github.com/farlit/VPN.

[480] Bridging Diffusion Models and 3D Representations: A 3D Consistent Super-Resolution Framework

Yi-Ting Chen, Ting-Hsuan Liao, Pengsheng Guo, Alexander Schwing, Jia-Bin Huang

Main category: cs.CV

TL;DR: 3D Super Resolution (3DSR) is a novel framework that uses 3D Gaussian splatting with off-the-shelf 2D diffusion models to achieve 3D-consistent super-resolution without additional fine-tuning.

Details

Motivation: To address the lack of explicit 3D consistency in existing super-resolution methods like image upsampling or video super-resolution, which either ignore 3D consistency or incorporate it implicitly.

Method: Leverages 3D Gaussian-splatting-based scene representation combined with diffusion-based 2D super-resolution models to ensure 3D consistency across views without requiring fine-tuning.

Result: Demonstrates high-resolution results on MipNeRF360 and LLFF datasets that are visually compelling while maintaining structural consistency in 3D reconstructions.

Conclusion: 3DSR successfully enhances visual quality while ensuring spatial coherence in 3D scenes, providing a superior approach to 3D-consistent super-resolution compared to prior methods.

Abstract: We propose 3D Super Resolution (3DSR), a novel 3D Gaussian-splatting-based super-resolution framework that leverages off-the-shelf diffusion-based 2D super-resolution models. 3DSR encourages 3D consistency across views via the use of an explicit 3D Gaussian-splatting-based scene representation. This makes the proposed 3DSR different from prior work, such as image upsampling or the use of video super-resolution, which either don’t consider 3D consistency or aim to incorporate 3D consistency implicitly. Notably, our method enhances visual quality without additional fine-tuning, ensuring spatial coherence within the reconstructed scene. We evaluate 3DSR on MipNeRF360 and LLFF data, demonstrating that it produces high-resolution results that are visually compelling, while maintaining structural consistency in 3D reconstructions.

[481] X2Edit: Revisiting Arbitrary-Instruction Image Editing through Self-Constructed Data and Task-Aware Representation Learning

Jian Ma, Xujie Zhu, Zihao Pan, Qirong Peng, Xu Guo, Chen Chen, Haonan Lu

Main category: cs.CV

TL;DR: X2Edit introduces a comprehensive 3.7M dataset for image editing across 14 tasks and a plug-and-play editing module using MoE-LoRA training with contrastive learning, achieving competitive performance with only 8% model parameters.

Details

Motivation: Existing open-source datasets for arbitrary-instruction image editing are suboptimal, and there's no plug-and-play editing module compatible with community-prevalent generative models.

Method: Constructed X2Edit dataset using industry-leading models and expert models, designed editing instructions with VLM, implemented scoring mechanisms. Developed task-aware MoE-LoRA training based on FLUX.1 with contrastive learning using diffusion model representations.

Result: Created 3.7M high-quality balanced dataset; model achieves competitive editing performance with only 8% of full model parameters; dataset shows substantial advantages over existing open-source datasets.

Conclusion: X2Edit provides both a superior dataset and efficient plug-and-play editing module that integrates seamlessly with community image generation models, advancing the field of arbitrary-instruction image editing.

Abstract: Existing open-source datasets for arbitrary-instruction image editing remain suboptimal, while a plug-and-play editing module compatible with community-prevalent generative models is notably absent. In this paper, we first introduce the X2Edit Dataset, a comprehensive dataset covering 14 diverse editing tasks, including subject-driven generation. We utilize the industry-leading unified image generation models and expert models to construct the data. Meanwhile, we design reasonable editing instructions with the VLM and implement various scoring mechanisms to filter the data. As a result, we construct 3.7 million high-quality data with balanced categories. Second, to better integrate seamlessly with community image generation models, we design task-aware MoE-LoRA training based on FLUX.1, with only 8% of the parameters of the full model. To further improve the final performance, we utilize the internal representations of the diffusion model and define positive/negative samples based on image editing types to introduce contrastive learning. Extensive experiments demonstrate that the model’s editing performance is competitive among many excellent models. Additionally, the constructed dataset exhibits substantial advantages over existing open-source datasets. The open-source code, checkpoints, and datasets for X2Edit can be found at the following link: https://github.com/OPPO-Mente-Lab/X2Edit.

[482] UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning

Huy Le, Nhat Chung, Tung Kieu, Jingkang Yang, Ngan Le

Main category: cs.CV

TL;DR: UNO is a unified framework for Video Scene Graph Generation that handles both box-level and pixel-level tasks in a single-stage, end-to-end architecture using object-centric slot attention.

Details

Motivation: Prior VidSGG methods require separate architectures for different granularity levels (box-level vs pixel-level), leading to task-specific designs and multi-stage training pipelines that lack efficiency and generalization.

Method: Extended slot attention decomposes visual features into object and relation slots; object temporal consistency learning enforces frame-to-frame consistency; dynamic triplet prediction links relations to object pairs over time.

Result: UNO achieves competitive performance on both box-level and pixel-level VidSGG benchmarks while offering improved efficiency through unified design and parameter sharing.

Conclusion: UNO demonstrates that a single-stage, unified framework can effectively handle multiple VidSGG tasks with minimal task-specific modifications, enabling better generalization across visual granularity levels.

Abstract: Video Scene Graph Generation (VidSGG) aims to represent dynamic visual content by detecting objects and modeling their temporal interactions as structured graphs. Prior studies typically target either coarse-grained box-level or fine-grained panoptic pixel-level VidSGG, often requiring task-specific architectures and multi-stage training pipelines. In this paper, we present UNO (UNified Object-centric VidSGG), a single-stage, unified framework that jointly addresses both tasks within an end-to-end architecture. UNO is designed to minimize task-specific modifications and maximize parameter sharing, enabling generalization across different levels of visual granularity. The core of UNO is an extended slot attention mechanism that decomposes visual features into object and relation slots. To ensure robust temporal modeling, we introduce object temporal consistency learning, which enforces consistent object representations across frames without relying on explicit tracking modules. Additionally, a dynamic triplet prediction module links relation slots to corresponding object pairs, capturing evolving interactions over time. We evaluate UNO on standard box-level and pixel-level VidSGG benchmarks. Results demonstrate that UNO not only achieves competitive performance across both tasks but also offers improved efficiency through a unified, object-centric design.

[483] Generative neural physics enables quantitative volumetric ultrasound of tissue mechanics

Zhijun Zeng, Youjia Zheng, Chang Su, Qianhang Wu, Hao Hu, Zeyuan Dong, Shan Gao, Yang Lv, Rui Tang, Ligang Cui, Zhiyong Hou, Weijun Lin, Zuoqiang Shi, Yubing Li, He Sun

Main category: cs.CV

TL;DR: A generative neural physics framework combining neural networks with physics-based solvers enables rapid 3D quantitative imaging of tissue mechanical properties using ultrasound tomography, achieving accurate biomechanical assessment in under 10 minutes.

Details

Motivation: Current medical imaging modalities (CT, MRI, B-mode ultrasound) rarely directly quantify tissue mechanical properties like stiffness and density, which are important biomarkers for disease assessment. Ultrasound tomography has potential but faces bottlenecks in efficient full-wave scattering models.

Method: Developed a generative neural physics framework that fuses generative models with physics-informed PDE solvers. Uses a compact neural surrogate for full-wave propagation trained on limited cross-modality data, preserving physical accuracy while enabling efficient inversion.

Result: Achieved accurate and efficient quantitative volumetric imaging of in vivo human breast and musculoskeletal tissues in under 10 minutes. Generated spatial maps of tissue mechanical properties not available from conventional methods. Revealed biomechanical features in bone, muscle, fat, and glandular tissues with structural resolution comparable to 3T MRI but greater sensitivity to disease-related tissue mechanics.

Conclusion: The framework enables rapid, high-fidelity 3D quantitative imaging of tissue mechanics, providing substantially greater sensitivity to disease-related tissue properties compared to conventional imaging methods while maintaining high structural resolution.

Abstract: Tissue mechanics–stiffness, density and impedance contrast–are broadly informative biomarkers across diseases, yet routine CT, MRI, and B-mode ultrasound rarely quantify them directly. While ultrasound tomography (UT) is intrinsically suited to in-vivo biomechanical assessment by capturing transmitted and reflected wavefields, efficient and accurate full-wave scattering models remain a bottleneck. Here, we introduce a generative neural physics framework that fuses generative models with physics-informed partial differential equation (PDE) solvers to produce rapid, high-fidelity 3D quantitative imaging of tissue mechanics. A compact neural surrogate for full-wave propagation is trained on limited cross-modality data, preserving physical accuracy while enabling efficient inversion. This enables, for the first time, accurate and efficient quantitative volumetric imaging of in vivo human breast and musculoskeletal tissues in under ten minutes, providing spatial maps of tissue mechanical properties not available from conventional reflection-mode or standard UT reconstructions. The resulting images reveal biomechanical features in bone, muscle, fat, and glandular tissues, maintaining structural resolution comparable to 3T MRI while providing substantially greater sensitivity to disease-related tissue mechanics.

Hanbo Bi, Zhiqiang Yuan, Zexi Jia, Jiapei Zhang, Chongyang Li, Peixiang Luo, Ying Deng, Xiaoyue Duan, Jinchao Zhang

Main category: cs.CV

TL;DR: The paper introduces Fine-grained Fragment Retrieval (FFR) for locating relevant multimodal fragments from long dialogues, creates the MLDR dataset, and proposes F2RVLM model with two-stage training and curriculum sampling that outperforms existing VLMs.

Details

Motivation: Traditional dialogue retrieval fails to meet users' needs for revisiting semantically coherent content scattered across long-form conversations, especially in multimodal contexts with both text and images.

Method: Proposed F2RVLM model with two-stage training: supervised fine-tuning for fragment-level retrieval knowledge, and GRPO-based reinforcement learning with multi-objective rewards. Introduced difficulty-aware curriculum sampling to handle varying intra-fragment complexity.

Result: F2RVLM outperforms popular Vision-Language Models in both in-domain and real-domain settings, demonstrating superior retrieval performance on the MLDR dataset and WeChat-based test set.

Conclusion: The proposed F2RVLM approach effectively addresses the Fine-grained Fragment Retrieval task, showing strong performance in retrieving semantically coherent multimodal fragments from long-form dialogues through specialized training techniques.

Abstract: Traditional dialogue retrieval aims to select the most appropriate utterance or image from recent dialogue history. However, they often fail to meet users’ actual needs for revisiting semantically coherent content scattered across long-form conversations. To fill this gap, we define the Fine-grained Fragment Retrieval (FFR) task, requiring models to locate query-relevant fragments, comprising both utterances and images, from multimodal long-form dialogues. As a foundation for FFR, we construct MLDR, the longest-turn multimodal dialogue retrieval dataset to date, averaging 25.45 turns per dialogue, with each naturally spanning three distinct topics. To evaluate generalization in real-world scenarios, we curate and annotate a WeChat-based test set comprising real-world multimodal dialogues with an average of 75.38 turns. Building on these resources, we explore existing generation-based Vision-Language Models (VLMs) on FFR and observe that they often retrieve incoherent utterance-image fragments. While optimized for generating responses from visual-textual inputs, these models lack explicit supervision to ensure semantic coherence within retrieved fragments. To this end, we propose F2RVLM, a generative retrieval model trained in a two-stage paradigm: (1) supervised fine-tuning to inject fragment-level retrieval knowledge, and (2) GRPO-based reinforcement learning with multi-objective rewards promoting semantic precision, relevance, and contextual coherence. To handle varying intra-fragment complexity, from locally dense to sparsely distributed, we introduce difficulty-aware curriculum sampling that ranks training instances by model-predicted difficulty and gradually exposes the model to harder samples. This boosts reasoning ability in long, multi-turn contexts. F2RVLM outperforms popular VLMs in both in-domain and real-domain settings, demonstrating superior retrieval performance.

[485] Improving Generalization in Deepfake Detection with Face Foundation Models and Metric Learning

Stelios Mylonas, Symeon Papadopoulos

Main category: cs.CV

TL;DR: A robust video deepfake detection framework using face foundation models with triplet loss and attribution supervision for strong generalization across diverse real-world scenarios.

Details

Motivation: Deepfake detection models struggle to generalize beyond training distributions, especially for real-world media content, due to increasing realism and accessibility of deepfakes threatening media authenticity.

Method: Built on FSFM self-supervised face model, fine-tuned with ensemble of deepfake datasets using triplet loss variants and attribution-based supervision schemes for manipulation type/dataset categorization.

Result: Extensive experiments show effective performance across diverse benchmarks, particularly in challenging real-world scenarios with strong generalization capabilities.

Conclusion: The framework demonstrates robust deepfake detection with enhanced generalization through face foundation models, triplet loss, and attribution supervision, addressing real-world detection challenges.

Abstract: The increasing realism and accessibility of deepfakes have raised critical concerns about media authenticity and information integrity. Despite recent advances, deepfake detection models often struggle to generalize beyond their training distributions, particularly when applied to media content found in the wild. In this work, we present a robust video deepfake detection framework with strong generalization that takes advantage of the rich facial representations learned by face foundation models. Our method is built on top of FSFM, a self-supervised model trained on real face data, and is further fine-tuned using an ensemble of deepfake datasets spanning both face-swapping and face-reenactment manipulations. To enhance discriminative power, we incorporate triplet loss variants during training, guiding the model to produce more separable embeddings between real and fake samples. Additionally, we explore attribution-based supervision schemes, where deepfakes are categorized by manipulation type or source dataset, to assess their impact on generalization. Extensive experiments across diverse evaluation benchmarks demonstrate the effectiveness of our approach, especially in challenging real-world scenarios.

[486] UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

Ye Liu, Zongyang Ma, Junfu Pu, Zhongang Qi, Yang Wu, Ying Shan, Chang Wen Chen

Main category: cs.CV

TL;DR: UniPixel is a large multi-modal model that integrates pixel-level perception with visual reasoning, enabling flexible comprehension of visual prompts and generation of mask-grounded responses for fine-grained pixel-level understanding.

Details

Motivation: To bridge the gap in scaling fine-grained pixel-level understanding capabilities in LMMs, which currently focus on holistic understanding but lack integration of pixel-level perception with visual reasoning.

Method: Proposes UniPixel model that processes visual prompts, generates relevant masks on demand, and performs reasoning conditioned on these intermediate pointers during inference.

Result: Verified effectiveness on 10 benchmarks across diverse tasks including pixel-level referring/segmentation, object-centric understanding in images/videos, and a novel PixelQA task requiring joint referring, segmentation, and QA.

Conclusion: UniPixel successfully integrates pixel-level perception with general visual understanding, enabling flexible fine-grained pixel-level reasoning across multiple tasks.

Abstract: Recent advances in Large Multi-modal Models (LMMs) have demonstrated their remarkable success as general-purpose multi-modal assistants, with particular focuses on holistic image- and video-language understanding. Conversely, less attention has been given to scaling fine-grained pixel-level understanding capabilities, where the models are expected to realize pixel-level alignment between visual signals and language semantics. Some previous studies have applied LMMs to related tasks such as region-level captioning and referring expression segmentation. However, these models are limited to performing either referring or segmentation tasks independently and fail to integrate these fine-grained perception capabilities into visual reasoning. To bridge this gap, we propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses. Our model distinguishes itself by seamlessly integrating pixel-level perception with general visual understanding capabilities. Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference, thereby enabling fine-grained pixel-level reasoning. The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos. A novel PixelQA task that jointly requires referring, segmentation, and question answering is also designed to verify the flexibility of our method.

[487] PersonaAnimator: Personalized Motion Transfer from Unconstrained Videos

Ziyun Qian, Runyu Xiao, Shuyuan Tu, Wei Xue, Dingkang Yang, Mingcheng Li, Dongliang Kou, Minghao Han, Zizhi Chen, Lihua Zhang

Main category: cs.CV

TL;DR: PersonaAnimator enables personalized motion transfer from unconstrained videos, addressing limitations in existing motion generation methods by learning motion styles directly from videos rather than motion capture data.

Details

Motivation: Existing methods have three key limitations: pose-guided transfer lacks style learning, style transfer relies on hard-to-obtain motion capture data, and generated motions sometimes violate physical laws.

Method: Proposes PersonaAnimator framework that learns personalized motion patterns from unconstrained videos, introduces PersonaVid dataset with 20 motion content and 120 style categories, and implements Physics-aware Motion Style Regularization for physical plausibility.

Result: Extensive experiments show PersonaAnimator outperforms state-of-the-art motion transfer methods and establishes a new benchmark for Video-to-Video Motion Personalization.

Conclusion: The paper pioneers Video-to-Video Motion Personalization task and demonstrates successful personalized motion transfer directly from videos while ensuring physical plausibility.

Abstract: Recent advances in motion generation show remarkable progress. However, several limitations remain: (1) Existing pose-guided character motion transfer methods merely replicate motion without learning its style characteristics, resulting in inexpressive characters. (2) Motion style transfer methods rely heavily on motion capture data, which is difficult to obtain. (3) Generated motions sometimes violate physical laws. To address these challenges, this paper pioneers a new task: Video-to-Video Motion Personalization. We propose a novel framework, PersonaAnimator, which learns personalized motion patterns directly from unconstrained videos. This enables personalized motion transfer. To support this task, we introduce PersonaVid, the first video-based personalized motion dataset. It contains 20 motion content categories and 120 motion style categories. We further propose a Physics-aware Motion Style Regularization mechanism to enforce physical plausibility in the generated motions. Extensive experiments show that PersonaAnimator outperforms state-of-the-art motion transfer methods and sets a new benchmark for the Video-to-Video Motion Personalization task.

[488] FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

You Shen, Zhipeng Zhang, Yansong Qu, Xiawu Zheng, Jiayi Ji, Shengchuan Zhang, Liujuan Cao

Main category: cs.CV

TL;DR: FastVGGT accelerates VGGT, a 3D vision model, by 4x using token merging without training, maintaining performance while handling long image sequences.

Details

Motivation: Scaling 3D vision models to long-sequence inputs is challenging due to inference inefficiency, and token collapse in attention maps was identified as a bottleneck.

Method: Proposed FastVGGT with a training-free token merging mechanism, using a unique token partitioning strategy tailored for 3D architectures to eliminate redundant computation.

Result: Achieved 4x speedup over VGGT with 1000 input images while mitigating error accumulation in long-sequence scenarios, validated on multiple 3D geometry benchmarks.

Conclusion: Token merging is a principled solution for scalable 3D vision systems, effectively accelerating models without compromising reconstruction capacity.

Abstract: Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, scaling these models to long-sequence image inputs remains a significant challenge due to inference-time inefficiency. In this work, we present a detailed analysis of VGGT, a state-of-the-art feed-forward visual geometry model and identify its primary bottleneck. Visualization further reveals a token collapse phenomenon in the attention maps. Motivated by these findings, we explore the potential of token merging in the feed-forward visual geometry model. Owing to the unique architectural and task-specific properties of 3D models, directly applying existing merging techniques proves challenging. To this end, we propose FastVGGT, which, for the first time, leverages token merging in the 3D domain through a training-free mechanism for accelerating VGGT. we devise a unique token partitioning strategy tailored to 3D architectures and tasks, effectively eliminating redundant computation while preserving VGGT’s powerful reconstruction capacity. Extensive experiments on multiple 3D geometry benchmarks validate the effectiveness of our approach. Notably, with 1000 input images, FastVGGT achieves a 4x speedup over VGGT while mitigating error accumulation in long-sequence scenarios. These findings underscore the potential of token merging as a principled solution for scalable 3D vision systems. Code is available at: https://mystorm16.github.io/fastvggt/.

[489] DCDB: Dynamic Conditional Dual Diffusion Bridge for Ill-posed Multi-Tasks

Chengjie Huang, Jiafeng Yan, Jing Li, Lu Bai

Main category: cs.CV

TL;DR: Proposes a dynamic conditional double diffusion bridge training paradigm for ill-posed multi-task image processing, achieving state-of-the-art performance in dehazing and visible-infrared fusion tasks.

Details

Motivation: Traditional conditional diffusion models struggle with exploiting intrinsic task correlations in multi-task scenarios, especially for ill-posed tasks with limited training data. Static condition control cannot adapt to dynamically evolving multi-task characteristics.

Method: Decouples diffusion and condition generation processes, uses dynamic conditions generated by the same noise schedule to gradually adjust statistical characteristics and embed time-related information, reducing network learning difficulty.

Result: Achieved best performance in multiple indicators on public datasets for dehazing and visible-infrared fusion tasks. Demonstrated superiority through analysis of attention weights and learning objectives.

Conclusion: The proposed dynamic conditional double diffusion bridge paradigm effectively addresses challenges in ill-posed multi-task scenarios by enabling better exploitation of task correlations and adapting to dynamic learning requirements.

Abstract: Conditional diffusion models have made impressive progress in the field of image processing, but the characteristics of constructing data distribution pathways make it difficult to exploit the intrinsic correlation between tasks in multi-task scenarios, which is even worse in ill-posed tasks with a lack of training data. In addition, traditional static condition control makes it difficult for networks to learn in multi-task scenarios with its dynamically evolving characteristics. To address these challenges, we propose a dynamic conditional double diffusion bridge training paradigm to build a general framework for ill-posed multi-tasks. Firstly, this paradigm decouples the diffusion and condition generation processes, avoiding the dependence of the diffusion model on supervised data in ill-posed tasks. Secondly, generated by the same noise schedule, dynamic conditions are used to gradually adjust their statistical characteristics, naturally embed time-related information, and reduce the difficulty of network learning. We analyze the learning objectives of the network under different conditional forms in the single-step denoising process and compare the changes in its attention weights in the network, demonstrating the superiority of our dynamic conditions. Taking dehazing and visible-infrared fusion as typical ill-posed multi-task scenarios, we achieve the best performance in multiple indicators on public datasets. The code has been publicly released at: https://anonymous.4open.science/r/DCDB-D3C2.

[490] Multispectral-NeRF:a multispectral modeling approach based on neural radiance fields

Hong Zhang, Fei Guo, Zihan Xie, Dizhao Yao

Main category: cs.CV

TL;DR: Multispectral-NeRF enhances neural radiance fields to process 6-band spectral data instead of traditional 3-band RGB, improving multispectral 3D reconstruction accuracy and quality.

Details

Motivation: Traditional 3D reconstruction methods using expanded spectral bands suffer from high costs, low accuracy, and poor geometry. While NeRF can solve these issues, current versions only handle 3-band data and cannot utilize multi-band information.

Method: Three key modifications: expanded hidden layers for 6-band inputs, redesigned residual functions for spectral discrepancy optimization, and adapted data compression for multispectral imagery’s higher bit-depth requirements.

Result: Experimental results show Multispectral-NeRF successfully processes multi-band spectral features while accurately preserving original scenes’ spectral characteristics.

Conclusion: The proposed Multispectral-NeRF effectively integrates multispectral information into neural radiance fields, overcoming limitations of existing methods and enabling high-quality multispectral 3D reconstruction.

Abstract: 3D reconstruction technology generates three-dimensional representations of real-world objects, scenes, or environments using sensor data such as 2D images, with extensive applications in robotics, autonomous vehicles, and virtual reality systems. Traditional 3D reconstruction techniques based on 2D images typically relies on RGB spectral information. With advances in sensor technology, additional spectral bands beyond RGB have been increasingly incorporated into 3D reconstruction workflows. Existing methods that integrate these expanded spectral data often suffer from expensive scheme prices, low accuracy and poor geometric features. Three - dimensional reconstruction based on NeRF can effectively address the various issues in current multispectral 3D reconstruction methods, producing high - precision and high - quality reconstruction results. However, currently, NeRF and some improved models such as NeRFacto are trained on three - band data and cannot take into account the multi - band information. To address this problem, we propose Multispectral-NeRF, an enhanced neural architecture derived from NeRF that can effectively integrates multispectral information. Our technical contributions comprise threefold modifications: Expanding hidden layer dimensionality to accommodate 6-band spectral inputs; Redesigning residual functions to optimize spectral discrepancy calculations between reconstructed and reference images; Adapting data compression modules to address the increased bit-depth requirements of multispectral imagery. Experimental results confirm that Multispectral-NeRF successfully processes multi-band spectral features while accurately preserving the original scenes’ spectral characteristics.

[491] Effective Gaussian Management for High-fidelity Object Reconstruction

Jiateng Liu, Hao Gao, Jiu-Cheng Xie, Chi-Man Pun, Jian Xiong, Haolun Li, Junxin Chen, Feng Xu

Main category: cs.CV

TL;DR: A Gaussian management framework for high-fidelity scene reconstruction that introduces selective attribute activation, adaptive representation, and surface reconstruction to improve quality while reducing parameters.

Details

Motivation: To address limitations in Gaussian Splatting methods that use indiscriminate attribute assignment, which causes gradient conflicts and redundancy in scene reconstruction.

Method: Introduces GauSep for selective color/normal attribute activation, GauRep for adaptive Gaussian representation, CoRe for surface reconstruction via normal field distillation, and Separate Rendering pipeline.

Result: Achieves superior appearance and geometry reconstruction compared to state-of-the-art methods while using significantly fewer parameters.

Conclusion: The proposed framework effectively balances model capacity and parameter efficiency, is model-agnostic, and can be seamlessly integrated into other architectures.

Abstract: This paper presents an effective Gaussian management framework for high-fidelity scene reconstruction of appearance and geometry. Departing from recent Gaussian Splatting (GS) methods that rely on indiscriminate attribute assignment, our approach introduces a novel densification strategy called \emph{GauSep} that selectively activates Gaussian color or normal attributes. Together with a tailored rendering pipeline, termed \emph{Separate Rendering}, this strategy alleviates gradient conflicts arising from dual supervision and yields improved reconstruction quality. In addition, we develop \emph{GauRep}, an adaptive and integrated Gaussian representation that reduces redundancy both at the individual and global levels, effectively balancing model capacity and number of parameters. To provide reliable geometric supervision essential for effective management, we also introduce \emph{CoRe}, a novel surface reconstruction module that distills normal fields from the SDF branch to the Gaussian branch through a confidence mechanism. Notably, our management framework is model-agnostic and can be seamlessly incorporated into other architectures, simultaneously improving performance and reducing model size. Extensive experiments demonstrate that our approach achieves superior performance in reconstructing both appearance and geometry compared with state-of-the-art methods, while using significantly fewer parameters.

[492] DevFD: Developmental Face Forgery Detection by Learning Shared and Orthogonal LoRA Subspaces

Tianshuo Zhang, Li Gao, Siran Peng, Xiangyu Zhu, Zhen Lei

Main category: cs.CV

TL;DR: Proposes a continual learning approach for face forgery detection using a Developmental Mixture of Experts with Real-LoRA and Fake-LoRAs to adapt to evolving forgery techniques while preventing catastrophic forgetting.

Details

Motivation: The rapid evolution of digital face generation and manipulation techniques outpaces existing detection models, requiring systems that can quickly adapt to new forgery types while retaining knowledge of previous ones with limited data and computation.

Method: Uses a Developmental Mixture of Experts architecture with LoRA models: Real-LoRA for stable real face knowledge and multiple Fake-LoRAs for incremental forgery learning. Employs orthogonal learning directions and orthogonal gradients to prevent catastrophic forgetting.

Result: Experimental results show effectiveness under both datasets and manipulation types incremental protocols, demonstrating successful adaptation to new forgery types while maintaining detection capabilities for previously learned types.

Conclusion: The proposed continual learning framework effectively addresses the challenge of evolving face forgery techniques by enabling incremental learning while preventing knowledge forgetting, providing a sustainable solution for face forgery detection.

Abstract: The rise of realistic digital face generation and manipulation poses significant social risks. The primary challenge lies in the rapid and diverse evolution of generation techniques, which often outstrip the detection capabilities of existing models. To defend against the ever-evolving new types of forgery, we need to enable our model to quickly adapt to new domains with limited computation and data while avoiding forgetting previously learned forgery types. In this work, we posit that genuine facial samples are abundant and relatively stable in acquisition methods, while forgery faces continuously evolve with the iteration of manipulation techniques. Given the practical infeasibility of exhaustively collecting all forgery variants, we frame face forgery detection as a continual learning problem and allow the model to develop as new forgery types emerge. Specifically, we employ a Developmental Mixture of Experts (MoE) architecture that uses LoRA models as its individual experts. These experts are organized into two groups: a Real-LoRA to learn and refine knowledge of real faces, and multiple Fake-LoRAs to capture incremental information from different forgery types. To prevent catastrophic forgetting, we ensure that the learning direction of Fake-LoRAs is orthogonal to the established subspace. Moreover, we integrate orthogonal gradients into the orthogonal loss of Fake-LoRAs, preventing gradient interference throughout the training process of each task. Experimental results under both the datasets and manipulation types incremental protocols demonstrate the effectiveness of our method.

[493] PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation

Chen Wang, Chuhao Chen, Yiming Huang, Zhiyang Dou, Yuan Liu, Jiatao Gu, Lingjie Liu

Main category: cs.CV

TL;DR: PhysCtrl is a physics-grounded image-to-video generation framework that uses physical parameters and force control to create realistic, physically plausible videos across four materials (elastic, sand, plasticine, rigid).

Details

Motivation: Existing video generation models lack physical plausibility and 3D controllability, limiting their realism and practical applications in physics-based scenarios.

Method: Uses a generative physics network with diffusion model conditioned on physics parameters and forces, trained on 550K synthetic animations. Incorporates spatiotemporal attention blocks for particle interactions and physics-based constraints.

Result: Generates realistic physics-grounded motion trajectories that drive image-to-video models to produce high-fidelity, controllable videos superior to existing methods in both visual quality and physical plausibility.

Conclusion: PhysCtrl successfully bridges the gap between video generation and physical realism, enabling controllable physics-based video synthesis with improved plausibility.

Abstract: Existing video generation models excel at producing photo-realistic videos from text or images, but often lack physical plausibility and 3D controllability. To overcome these limitations, we introduce PhysCtrl, a novel framework for physics-grounded image-to-video generation with physical parameters and force control. At its core is a generative physics network that learns the distribution of physical dynamics across four materials (elastic, sand, plasticine, and rigid) via a diffusion model conditioned on physics parameters and applied forces. We represent physical dynamics as 3D point trajectories and train on a large-scale synthetic dataset of 550K animations generated by physics simulators. We enhance the diffusion model with a novel spatiotemporal attention block that emulates particle interactions and incorporates physics-based constraints during training to enforce physical plausibility. Experiments show that PhysCtrl generates realistic, physics-grounded motion trajectories which, when used to drive image-to-video models, yield high-fidelity, controllable videos that outperform existing methods in both visual quality and physical plausibility. Project Page: https://cwchenwang.github.io/physctrl

[494] Token Painter: Training-Free Text-Guided Image Inpainting via Mask Autoregressive Models

Longtao Jiang, Jie Huang, Mingfei Han, Lei Chen, Yongqiang Yu, Feng Zhao, Xiaojun Chang, Zhihui Li

Main category: cs.CV

TL;DR: Token Painter is a training-free text-guided image inpainting method using Mask AutoRegressive models that addresses background preservation and text alignment issues through dual-stream encoder fusion and adaptive decoder attention enhancement.

Details

Motivation: Diffusion-based methods struggle with aligning inpainting results with text prompts while preserving background consistency due to their latent space modeling approach.

Method: Uses Mask AutoRegressive models with two key components: Dual-Stream Encoder Information Fusion (DEIF) to fuse text and background information in frequency domain, and Adaptive Decoder Attention Score Enhancing (ADAE) to enhance attention on guidance and inpainting tokens.

Result: Outperforms prior state-of-the-art methods across almost all metrics in extensive experiments.

Conclusion: The proposed training-free method effectively addresses text-prompt alignment and background consistency issues in image inpainting through novel attention mechanisms and information fusion techniques.

Abstract: Text-guided image inpainting aims to inpaint masked image regions based on a textual prompt while preserving the background. Although diffusion-based methods have become dominant, their property of modeling the entire image in latent space makes it challenging for the results to align well with prompt details and maintain a consistent background. To address these issues, we explore Mask AutoRegressive (MAR) models for this task. MAR naturally supports image inpainting by generating latent tokens corresponding to mask regions, enabling better local controllability without altering the background. However, directly applying MAR to this task makes the inpainting content either ignore the prompts or be disharmonious with the background context. Through analysis of the attention maps from the inpainting images, we identify the impact of background tokens on text tokens during the MAR generation, and leverage this to design \textbf{Token Painter}, a training-free text-guided image inpainting method based on MAR. Our approach introduces two key components: (1) Dual-Stream Encoder Information Fusion (DEIF), which fuses the semantic and context information from text and background in frequency domain to produce novel guidance tokens, allowing MAR to generate text-faithful inpainting content while keeping harmonious with background context. (2) Adaptive Decoder Attention Score Enhancing (ADAE), which adaptively enhances attention scores on guidance tokens and inpainting tokens to further enhance the alignment of prompt details and the content visual quality. Extensive experiments demonstrate that our training-free method outperforms prior state-of-the-art methods across almost all metrics. Codes: https://github.com/longtaojiang/Token-Painter.

Xue-Feng Zhu, Tianyang Xu, Yifan Pan, Jinjie Gu, Xi Li, Jiwen Lu, Xiao-Jun Wu, Josef Kittler

Main category: cs.CV

TL;DR: Introduces RGBDT500 dataset and RDTTrack method for tri-modal (RGB, Depth, Thermal) object tracking, showing improved robustness over dual-modal approaches.

Details

Motivation: Existing multi-modal tracking focuses on dual-modal paradigms (RGB-Depth or RGB-Thermal) but struggles in complex scenarios due to limited input modalities.

Method: Proposes RDTTrack tracker that fuses thermal infrared and depth modalities under orthogonal projection constraint, then integrates them with RGB as prompts for pre-trained foundation tracking model using prompt learning.

Result: Experimental results show significant improvements over existing dual-modal approaches in tracking accuracy and robustness in complex scenarios.

Conclusion: Tri-modal tracking with RGB, Depth, and Thermal infrared enhances robustness in complex scenarios, with proposed method and dataset advancing multi-modal tracking research.

Abstract: Existing multi-modal object tracking approaches primarily focus on dual-modal paradigms, such as RGB-Depth or RGB-Thermal, yet remain challenged in complex scenarios due to limited input modalities. To address this gap, this work introduces a novel multi-modal tracking task that leverages three complementary modalities, including visible RGB, Depth (D), and Thermal Infrared (TIR), aiming to enhance robustness in complex scenarios. To support this task, we construct a new multi-modal tracking dataset, coined RGBDT500, which consists of 500 videos with synchronised frames across the three modalities. Each frame provides spatially aligned RGB, depth, and thermal infrared images with precise object bounding box annotations. Furthermore, we propose a novel multi-modal tracker, dubbed RDTTrack. RDTTrack integrates tri-modal information for robust tracking by leveraging a pretrained RGB-only tracking model and prompt learning techniques. In specific, RDTTrack fuses thermal infrared and depth modalities under a proposed orthogonal projection constraint, then integrates them with RGB signals as prompts for the pre-trained foundation tracking model, effectively harmonising tri-modal complementary cues. The experimental results demonstrate the effectiveness and advantages of the proposed method, showing significant improvements over existing dual-modal approaches in terms of tracking accuracy and robustness in complex scenarios. The dataset and source code are publicly available at https://xuefeng-zhu5.github.io/RGBDT500.

[496] DA$^{2}$: Depth Anything in Any Direction

Haodong Li, Wangguangdong Zheng, Jing He, Yuhao Liu, Xin Lin, Xin Yang, Ying-Cong Chen, Chunchao Guo

Main category: cs.CV

TL;DR: DA² is a zero-shot generalizable panoramic depth estimator that addresses data scarcity and spherical distortion issues through data curation and SphereViT architecture, achieving state-of-the-art performance.

Details

Motivation: Panoramic depth estimation faces challenges due to limited panoramic data and spherical distortions, leading to poor zero-shot generalization and suboptimal efficiency in existing methods.

Method: Proposes DA² with two key components: 1) Data curation engine generating ~543K panoramic RGB-depth pairs from perspective data, 2) SphereViT that leverages spherical coordinates to enforce geometric consistency in panoramic features.

Result: Achieves 38% average improvement on AbsRel over strongest zero-shot baseline, outperforms prior in-domain methods, and shows higher efficiency than fusion-based approaches.

Conclusion: DA² provides an accurate, zero-shot generalizable, and fully end-to-end solution for panoramic depth estimation with superior performance and efficiency.

Abstract: Panorama has a full FoV (360$^\circ\times$180$^\circ$), offering a more complete visual description than perspective images. Thanks to this characteristic, panoramic depth estimation is gaining increasing traction in 3D vision. However, due to the scarcity of panoramic data, previous methods are often restricted to in-domain settings, leading to poor zero-shot generalization. Furthermore, due to the spherical distortions inherent in panoramas, many approaches rely on perspective splitting (e.g., cubemaps), which leads to suboptimal efficiency. To address these challenges, we propose $\textbf{DA}$$^{\textbf{2}}$: $\textbf{D}$epth $\textbf{A}$nything in $\textbf{A}$ny $\textbf{D}$irection, an accurate, zero-shot generalizable, and fully end-to-end panoramic depth estimator. Specifically, for scaling up panoramic data, we introduce a data curation engine for generating high-quality panoramic depth data from perspective, and create $\sim$543K panoramic RGB-depth pairs, bringing the total to $\sim$607K. To further mitigate the spherical distortions, we present SphereViT, which explicitly leverages spherical coordinates to enforce the spherical geometric consistency in panoramic image features, yielding improved performance. A comprehensive benchmark on multiple datasets clearly demonstrates DA$^{2}$’s SoTA performance, with an average 38% improvement on AbsRel over the strongest zero-shot baseline. Surprisingly, DA$^{2}$ even outperforms prior in-domain methods, highlighting its superior zero-shot generalization. Moreover, as an end-to-end solution, DA$^{2}$ exhibits much higher efficiency over fusion-based approaches. Both the code and the curated panoramic data has be released. Project page: https://depth-any-in-any-dir.github.io/.

[497] ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models

Yuqi Liu, Liangyu Chen, Jiazhen Liu, Mingkang Zhu, Zhisheng Zhong, Bei Yu, Jiaya Jia

Main category: cs.CV

TL;DR: ViSurf is a unified post-training paradigm that combines Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to overcome their individual limitations in Large Vision-and-Language Models training.

Details

Motivation: SFT often leads to sub-optimal performance while RLVR struggles with tasks beyond the model's internal knowledge. There's a need to integrate both approaches to leverage external guidance and internal reinforcement simultaneously.

Method: ViSurf integrates SFT and RLVR in a single stage by injecting ground-truth labels into RLVR rollouts, providing simultaneous external supervision and internal reinforcement. It introduces three novel reward control strategies to stabilize training.

Result: Extensive experiments across diverse benchmarks show ViSurf outperforms individual SFT, RLVR, and two-stage SFT→RLVR approaches.

Conclusion: ViSurf provides a unified perspective on post-training paradigms, effectively combining the strengths of both SFT and RLVR while overcoming their individual limitations.

Abstract: Typical post-training paradigms for Large Vision-and-Language Models (LVLMs) include Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR). SFT leverages external guidance to inject new knowledge, whereas RLVR utilizes internal reinforcement to enhance reasoning capabilities and overall performance. However, our analysis reveals that SFT often leads to sub-optimal performance, while RLVR struggles with tasks that exceed the model’s internal knowledge base. To address these limitations, we propose ViSurf (\textbf{Vi}sual \textbf{Su}pervised-and-\textbf{R}einforcement \textbf{F}ine-Tuning), a unified post-training paradigm that integrates the strengths of both SFT and RLVR within a single stage. We analyze the derivation of the SFT and RLVR objectives to establish the ViSurf objective, providing a unified perspective on these two paradigms. The core of ViSurf involves injecting ground-truth labels into the RLVR rollouts, thereby providing simultaneous external supervision and internal reinforcement. Furthermore, we introduce three novel reward control strategies to stabilize and optimize the training process. Extensive experiments across several diverse benchmarks demonstrate the effectiveness of ViSurf, outperforming both individual SFT, RLVR, and two-stage SFT \textrightarrow RLVR. In-depth analysis corroborates these findings, validating the derivation and design principles of ViSurf.

[498] FedHUG: Federated Heterogeneous Unsupervised Generalization for Remote Physiological Measurements

Xiao Yang, Dengbo He, Jiyao Wang, Kaishun Wu

Main category: cs.CV

TL;DR: FedHUG is a federated learning framework for remote physiological measurement that handles unlabeled client data and domain heterogeneity without requiring labeled client data.

Details

Motivation: Existing contactless physiological measurement methods require labeled client data and pose privacy risks, making it challenging to update deployed models with unlabeled user data.

Method: FedHUG uses Minimal Bias Aggregation to handle heterogeneous non-IID features and Global Distribution-aware Learning Controller to mitigate label distribution skew and long-tail issues.

Result: The framework shows superior performance compared to state-of-the-art techniques for both RGB video and mmWave radar-based physiological estimation.

Conclusion: FedHUG successfully addresses the challenges of federated unsupervised domain generalization for remote physiological measurement while preserving privacy.

Abstract: Remote physiological measurement gained wide attention, while it requires collecting users’ privacy-sensitive information, and existing contactless measurements still rely on labeled client data. This presents challenges when we want to further update real-world deployed models with numerous user data lacking labels. To resolve these challenges, we instantiate a new protocol called Federated Unsupervised Domain Generalization (FUDG) in this work. Subsequently, the \textbf{Fed}erated \textbf{H}eterogeneous \textbf{U}nsupervised \textbf{G}eneralization (\textbf{FedHUG}) framework is proposed and consists of: (1) Minimal Bias Aggregation module dynamically adjusts aggregation weights based on prior-driven bias evaluation to cope with heterogeneous non-IID features from multiple domains. (2) The Global Distribution-aware Learning Controller parameterizes the label distribution and dynamically manipulates client-specific training strategies, thereby mitigating the server-client label distribution skew and long-tail issue. The proposal shows superior performance across state-of-the-art techniques in estimation with either RGB video or mmWave radar. The code will be released.

[499] GauSSmart: Enhanced 3D Reconstruction through 2D Foundation Models and Geometric Filtering

Alexander Valverde, Brian Xu, Yuyin Zhou, Meng Xu, Hongyun Wang

Main category: cs.CV

TL;DR: GauSSmart is a hybrid 2D-3D method that enhances Gaussian Splatting scene reconstruction by integrating 2D foundational models like DINO with semantic feature supervision and convex filtering to improve detail capture and coverage in sparse regions.

Details

Motivation: Gaussian Splatting struggles with fine details and realism in sparse coverage areas due to limitations of sparse 3D training data, creating a need to bridge 2D foundational models with 3D reconstruction.

Method: Integrates 2D computer vision techniques including convex filtering and semantic feature supervision from foundational models like DINO to guide Gaussian splat densification and refinement using 2D segmentation priors and high-dimensional feature embeddings.

Result: Outperforms existing Gaussian Splatting in majority of evaluated scenes across three datasets, demonstrating improved coverage in underrepresented areas and better preservation of structural details.

Conclusion: Hybrid 2D-3D approaches combining 2D foundational models with 3D reconstruction pipelines can overcome limitations inherent in either approach alone, showing significant potential for enhanced scene reconstruction.

Abstract: Scene reconstruction has emerged as a central challenge in computer vision, with approaches such as Neural Radiance Fields (NeRF) and Gaussian Splatting achieving remarkable progress. While Gaussian Splatting demonstrates strong performance on large-scale datasets, it often struggles to capture fine details or maintain realism in regions with sparse coverage, largely due to the inherent limitations of sparse 3D training data. In this work, we propose GauSSmart, a hybrid method that effectively bridges 2D foundational models and 3D Gaussian Splatting reconstruction. Our approach integrates established 2D computer vision techniques, including convex filtering and semantic feature supervision from foundational models such as DINO, to enhance Gaussian-based scene reconstruction. By leveraging 2D segmentation priors and high-dimensional feature embeddings, our method guides the densification and refinement of Gaussian splats, improving coverage in underrepresented areas and preserving intricate structural details. We validate our approach across three datasets, where GauSSmart consistently outperforms existing Gaussian Splatting in the majority of evaluated scenes. Our results demonstrate the significant potential of hybrid 2D-3D approaches, highlighting how the thoughtful combination of 2D foundational models with 3D reconstruction pipelines can overcome the limitations inherent in either approach alone.

[500] Reg-DPO: SFT-Regularized Direct Preference Optimization with GT-Pair for Improving Video Generation

Jie Du, Xinyu Gong, Qingshan Tan, Wen Li, Yangming Cheng, Weitao Wang, Chenlu Zhan, Suhui Wu, Hao Zhang, Jun Zhang

Main category: cs.CV

TL;DR: This paper introduces Reg-DPO, an enhanced Direct Preference Optimization method for video generation that uses GT-Pairs for automatic preference data construction and adds SFT loss regularization for training stability.

Details

Motivation: Existing DPO methods for video generation follow image-domain paradigms and work on small models, failing to address video-specific challenges like costly data construction, unstable training, and heavy memory consumption.

Method: Proposes GT-Pairs that automatically build preference pairs using real videos as positives and model-generated videos as negatives, eliminating external annotation. Introduces Reg-DPO that incorporates SFT loss as regularization into DPO loss. Combines FSDP framework with memory optimization techniques to achieve 3x higher training capacity.

Result: Extensive experiments on I2V and T2V tasks across multiple datasets show the method consistently outperforms existing approaches, delivering superior video generation quality.

Conclusion: The proposed Reg-DPO with GT-Pairs effectively addresses video generation challenges, providing stable training and high-quality results without requiring external annotations.

Abstract: Recent studies have identified Direct Preference Optimization (DPO) as an efficient and reward-free approach to improving video generation quality. However, existing methods largely follow image-domain paradigms and are mainly developed on small-scale models (approximately 2B parameters), limiting their ability to address the unique challenges of video tasks, such as costly data construction, unstable training, and heavy memory consumption. To overcome these limitations, we introduce a GT-Pair that automatically builds high-quality preference pairs by using real videos as positives and model-generated videos as negatives, eliminating the need for any external annotation. We further present Reg-DPO, which incorporates the SFT loss as a regularization term into the DPO loss to enhance training stability and generation fidelity. Additionally, by combining the FSDP framework with multiple memory optimization techniques, our approach achieves nearly three times higher training capacity than using FSDP alone. Extensive experiments on both I2V and T2V tasks across multiple datasets demonstrate that our method consistently outperforms existing approaches, delivering superior video generation quality.

[501] DOS: Directional Object Separation in Text Embeddings for Multi-Object Image Generation

Dongnam Byun, Jungwon Park, Jungmin Ko, Changin Choi, Wonjong Rhee

Main category: cs.CV

TL;DR: DOS improves multi-object image generation by modifying CLIP text embeddings to address object neglect and mixing issues in four problematic scenarios.

Details

Motivation: Current T2I models struggle with prompts containing multiple objects, often resulting in object neglect or mixing, especially in scenarios with similar shapes, textures, background biases, or many objects.

Method: DOS modifies three types of CLIP text embeddings before feeding them into T2I models to better separate object representations and preserve inter-object relationships.

Result: DOS consistently improves multi-object generation success rates, reduces object mixing, and significantly outperforms four competing methods in human evaluations (26.24%-43.04% more votes across four benchmarks).

Conclusion: DOS is a practical and effective solution for improving multi-object image generation in text-to-image models.

Abstract: Recent progress in text-to-image (T2I) generative models has led to significant improvements in generating high-quality images aligned with text prompts. However, these models still struggle with prompts involving multiple objects, often resulting in object neglect or object mixing. Through extensive studies, we identify four problematic scenarios, Similar Shapes, Similar Textures, Dissimilar Background Biases, and Many Objects, where inter-object relationships frequently lead to such failures. Motivated by two key observations about CLIP embeddings, we propose DOS (Directional Object Separation), a method that modifies three types of CLIP text embeddings before passing them into text-to-image models. Experimental results show that DOS consistently improves the success rate of multi-object image generation and reduces object mixing. In human evaluations, DOS significantly outperforms four competing methods, receiving 26.24%-43.04% more votes across four benchmarks. These results highlight DOS as a practical and effective solution for improving multi-object image generation.

Zheng Qi, Chao Shang, Evangelia Spiliopoulou, Nikolaos Pappas

Main category: cs.CV

TL;DR: GIFT is a method that reduces hallucination in vision language models by tracking gaze shifts to identify salient visual regions and balancing cross-modal attention between visual inputs and user queries.

Details

Motivation: VLMs often generate hallucinated content due to over-reliance on linguistic priors, visual attention sink (misallocation to irrelevant regions), and imbalanced cross-modal fusion that neglects query importance.

Method: Pre-computes holistic visual saliency maps by tracking positive attention changes (gaze shifts) during query comprehension, then amplifies attention to both salient visual information and user queries at each decoding step.

Result: Achieves up to 20.7% improvement over greedy decoding in hallucination mitigation across generative and classification tasks, while maintaining general vision-language performance with low computational overhead.

Conclusion: GIFT effectively addresses visual attention sink and cross-modal fusion imbalance, significantly reducing hallucination in VLMs without compromising overall performance or computational efficiency.

Abstract: Vision language models (VLMs) often generate hallucination, i.e., content that cannot be substantiated by either textual or visual inputs. Prior work primarily attributes this to over-reliance on linguistic prior knowledge rather than visual inputs. Some methods attempt to mitigate hallucination by amplifying visual token attention proportionally to their attention scores. However, these methods overlook the visual attention sink problem, where attention is frequently misallocated to task-irrelevant visual regions, and neglect cross-modal fusion balance by enhancing only visual attention without adjusting attention to the user query. This can result in amplifying incorrect areas while failing to properly interpret the user query. To address these challenges, we propose a simple yet effective method called Gaze Shift-Guided Cross-modal Fusion Enhancement (GIFT). GIFT pre-computes a holistic visual saliency map by tracking positive changes in visual attention, or “gaze shifts”, during user query comprehension, and leverages this map to amplify attention to both salient visual information and the user query at each decoding step. This reduces the impact of visual attention sink, as irrelevant tokens exhibit minimal shifts, while ensuring balanced cross-modal fusion for well-integrated representation. Extensive experiments show that GIFT effectively mitigates hallucination in VLMs across both generative and classification tasks, achieving up to 20.7% improvement over greedy decoding, while maintaining general vision-language performance with low computational overhead.

[503] Rethinking Robust Adversarial Concept Erasure in Diffusion Models

Qinghong Yin, Yu Tian, Heming Yang, Xiang Chen, Xianlin Zhang, Xueming Li, Yue Zhan

Main category: cs.CV

TL;DR: S-GRACE introduces semantics-guided adversarial concept erasure for diffusion models, improving erasure performance by 26% while reducing training time by 90% compared to existing methods.

Details

Motivation: Existing concept erasure methods in diffusion models use adversarial training but neglect conceptual semantics, leading to partial mitigation, incomplete concept coverage, and disruption of non-target concepts.

Method: S-GRACE leverages semantic guidance within concept space to generate adversarial samples and perform erasure training, addressing the limitations of existing methods that fail to properly fit concept spaces.

Result: Experiments show S-GRACE significantly improves erasure performance by 26%, better preserves non-target concepts, and reduces training time by 90% compared to seven state-of-the-art methods.

Conclusion: Semantics-guided adversarial training effectively addresses concept erasure in diffusion models, providing comprehensive coverage of target concepts while preserving non-target concepts and achieving substantial efficiency gains.

Abstract: Concept erasure aims to selectively unlearning undesirable content in diffusion models (DMs) to reduce the risk of sensitive content generation. As a novel paradigm in concept erasure, most existing methods employ adversarial training to identify and suppress target concepts, thus reducing the likelihood of sensitive outputs. However, these methods often neglect the specificity of adversarial training in DMs, resulting in only partial mitigation. In this work, we investigate and quantify this specificity from the perspective of concept space, i.e., can adversarial samples truly fit the target concept space? We observe that existing methods neglect the role of conceptual semantics when generating adversarial samples, resulting in ineffective fitting of concept spaces. This oversight leads to the following issues: 1) when there are few adversarial samples, they fail to comprehensively cover the object concept; 2) conversely, they will disrupt other target concept spaces. Motivated by the analysis of these findings, we introduce S-GRACE (Semantics-Guided Robust Adversarial Concept Erasure), which grace leveraging semantic guidance within the concept space to generate adversarial samples and perform erasure training. Experiments conducted with seven state-of-the-art methods and three adversarial prompt generation strategies across various DM unlearning scenarios demonstrate that S-GRACE significantly improves erasure performance 26%, better preserves non-target concepts, and reduces training time by 90%. Our code is available at https://github.com/Qhong-522/S-GRACE.

[504] MM-UNet: Morph Mamba U-shaped Convolutional Networks for Retinal Vessel Segmentation

Jiawen Liu, Yuanbo Zeng, Jiaming Liang, Yizhen Yang, Yiheng Zhang, Enhui Cai, Xiaoqi Sheng, Hongmin Cai

Main category: cs.CV

TL;DR: MM-UNet is a novel deep learning architecture for retinal vessel segmentation that uses Morph Mamba Convolution layers and Reverse Selective State Guidance modules to improve segmentation of thin, branching vessel structures, achieving state-of-the-art performance on public datasets.

Details

Motivation: Retinal vessel segmentation is crucial for clinical diagnosis of ocular diseases, but existing methods struggle with the unique characteristics of retinal vasculature - extremely thin, branching structures with high morphological variability across images, which challenge segmentation precision and robustness.

Method: Proposed MM-UNet architecture with two key innovations: Morph Mamba Convolution layers that replace pointwise convolutions to enhance branching topological perception through morph-aware feature sampling, and Reverse Selective State Guidance modules that integrate reverse guidance theory with state-space modeling to improve geometric boundary awareness and decoding efficiency.

Result: Extensive experiments on DRIVE and STARE datasets show superior performance, achieving F1-score gains of 1.64% on DRIVE and 1.25% on STARE compared to existing approaches, demonstrating significant improvement in segmentation accuracy.

Conclusion: MM-UNet effectively addresses the challenges of retinal vessel segmentation through its specialized architecture, providing more accurate and robust segmentation of thin, branching vascular structures for clinical applications.

Abstract: Accurate detection of retinal vessels plays a critical role in reflecting a wide range of health status indicators in the clinical diagnosis of ocular diseases. Recently, advances in deep learning have led to a surge in retinal vessel segmentation methods, which have significantly contributed to the quantitative analysis of vascular morphology. However, retinal vasculature differs significantly from conventional segmentation targets in that it consists of extremely thin and branching structures, whose global morphology varies greatly across images. These characteristics continue to pose challenges to segmentation precision and robustness. To address these issues, we propose MM-UNet, a novel architecture tailored for efficient retinal vessel segmentation. The model incorporates Morph Mamba Convolution layers, which replace pointwise convolutions to enhance branching topological perception through morph, state-aware feature sampling. Additionally, Reverse Selective State Guidance modules integrate reverse guidance theory with state-space modeling to improve geometric boundary awareness and decoding efficiency. Extensive experiments conducted on two public retinal vessel segmentation datasets demonstrate the superior performance of the proposed method in segmentation accuracy. Compared to the existing approaches, MM-UNet achieves F1-score gains of 1.64 % on DRIVE and 1.25 % on STARE, demonstrating its effectiveness and advancement. The project code is public via https://github.com/liujiawen-jpg/MM-UNet.

[505] VinDr-CXR-VQA: A Visual Question Answering Dataset for Explainable Chest X-Ray Analysis with Multi-Task Learning

Dang H. Nguyen, Hieu H. Pham, Hao T. Nguyen, Hieu H. Pham

Main category: cs.CV

TL;DR: VinDr-CXR-VQA is a large-scale chest X-ray dataset for explainable Medical Visual Question Answering with spatial grounding, containing 17,597 QA pairs across 4,394 images with radiologist-verified annotations.

Details

Motivation: To advance reproducible and clinically grounded Med-VQA research by providing a comprehensive dataset with spatial grounding and clinical reasoning explanations.

Method: Constructed a balanced dataset with 41.7% positive and 58.3% negative samples across six diagnostic question types, annotated with radiologist-verified bounding boxes and clinical reasoning.

Result: Benchmarking with MedGemma-4B-it showed improved performance (F1 = 0.624, +11.8% over baseline) while enabling lesion localization.

Conclusion: VinDr-CXR-VQA successfully advances explainable Med-VQA with spatial grounding capabilities and provides a valuable resource for reproducible research in medical AI.

Abstract: We present VinDr-CXR-VQA, a large-scale chest X-ray dataset for explainable Medical Visual Question Answering (Med-VQA) with spatial grounding. The dataset contains 17,597 question-answer pairs across 4,394 images, each annotated with radiologist-verified bounding boxes and clinical reasoning explanations. Our question taxonomy spans six diagnostic types-Where, What, Is there, How many, Which, and Yes/No-capturing diverse clinical intents. To improve reliability, we construct a balanced distribution of 41.7% positive and 58.3% negative samples, mitigating hallucinations in normal cases. Benchmarking with MedGemma-4B-it demonstrates improved performance (F1 = 0.624, +11.8% over baseline) while enabling lesion localization. VinDr-CXR-VQA aims to advance reproducible and clinically grounded Med-VQA research. The dataset and evaluation tools are publicly available at huggingface.co/datasets/Dangindev/VinDR-CXR-VQA.

[506] Temporal Zoom Networks: Distance Regression and Continuous Depth for Efficient Action Localization

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

Main category: cs.CV

TL;DR: BDR replaces classification with signed-distance regression for boundary localization, achieving theoretical variance bounds and empirical improvements. ATR adaptively allocates computation depth to reduce FLOPs while maintaining accuracy.

Details

Motivation: Current temporal action localization methods apply uniform computation despite varying boundary difficulty, leading to inefficient resource usage.

Method: 1) Boundary Distance Regression (BDR): Uses signed-distance regression and zero-crossing extraction instead of classification. 2) Adaptive Temporal Refinement (ATR): Learns continuous depth allocation to adapt computation.

Result: BDR reduces boundary variance by 3.3x to 16.7x, improves mAP@0.7 by 1.8-3.1%. ATR achieves 56.5% mAP@0.7 at 151G FLOPs vs 53.6% at 198G for baseline (24% FLOPs reduction).

Conclusion: The proposed methods provide theoretically grounded boundary localization and adaptive computation, achieving significant efficiency gains while maintaining or improving accuracy.

Abstract: Temporal action localization requires precise boundaries, yet most methods apply uniform computation despite varying boundary difficulty. We propose two complementary contributions. Boundary Distance Regression (BDR) replaces classification with signed-distance regression and zero-crossing extraction. Under idealized assumptions (i.i.d. Laplace noise, uniform stride, sufficient capacity), BDR approaches the Cramer-Rao lower bound, yielding variance on the order of (Delta t)^2 / T (appearing as O((Delta t)^2) for fixed-video inference). The variance ratio R = Var[b_BDR] / Var[b_cls] scales as O((Delta t)^2 / W) for plateau width W approx 2*kappa, with empirical scaling appearing stronger (O((Delta t)^2 / W^2)) due to amplification factors (see Section~4). Empirically, BDR reduces boundary variance by 3.3x to 16.7x (R = 0.06 to 0.30) via four amplification factors. BDR retrofits to existing methods with about 50 lines of code, improving mAP@0.7 by 1.8 to 3.1 percent (average +2.4). Adaptive Temporal Refinement (ATR) learns continuous depth allocation tau in [0,1] to adapt computation, avoiding discrete routing complexity. On THUMOS14, ATR achieves 56.5 percent mAP@0.7 at 151G FLOPs versus 53.6 percent at 198G for the Uniform-6 baseline (24 percent FLOPs reduction, 118 ms vs. 167 ms latency). Gains scale with boundary heterogeneity: THUMOS14 (+2.9), FineAction (+2.7), ActivityNet (+1.8). Training overhead (1.29x baseline) is mitigated via knowledge distillation, with students retaining 99.5 percent performance. Code will be released.

[507] DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, Xing Yu

Main category: cs.CV

TL;DR: DeepEyesV2 is an agentic multimodal model that integrates tool use with reasoning through a two-stage training pipeline and is evaluated on the RealX-Bench benchmark.

Details

Motivation: Agentic multimodal models need to actively invoke external tools and integrate these operations into reasoning, but direct reinforcement learning alone fails to induce robust tool-use behavior.

Method: Two-stage training pipeline: cold-start stage to establish tool-use patterns, followed by reinforcement learning stage to refine tool invocation. Uses a diverse, moderately challenging training dataset with examples where tool use is beneficial.

Result: DeepEyesV2 demonstrates effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks, exhibiting task-adaptive tool invocation and complex tool combinations.

Conclusion: The study provides guidance for developing agentic multimodal models through the proposed training pipeline and evaluation framework.

Abstract: Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning. In this work, we introduce DeepEyesV2 and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation. We observe that direct reinforcement learning alone fails to induce robust tool-use behavior. This phenomenon motivates a two-stage training pipeline: a cold-start stage to establish tool-use patterns, and reinforcement learning stage to further refine tool invocation. We curate a diverse, moderately challenging training dataset, specifically including examples where tool use is beneficial. We further introduce RealX-Bench, a comprehensive benchmark designed to evaluate real-world multimodal reasoning, which inherently requires the integration of multiple capabilities, including perception, search, and reasoning. We evaluate DeepEyesV2 on RealX-Bench and other representative benchmarks, demonstrating its effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks. Moreover, DeepEyesV2 exhibits task-adaptive tool invocation, tending to use image operations for perception tasks and numerical computations for reasoning tasks. Reinforcement learning further enables complex tool combinations and allows model to selectively invoke tools based on context. We hope our study can provide guidance for community in developing agentic multimodal models.

[508] Rethinking Metrics and Diffusion Architecture for 3D Point Cloud Generation

Matteo Bastico, David Ryckelynck, Laurent Corté, Yannick Tillier, Etienne Decencière

Main category: cs.CV

TL;DR: The paper addresses issues with current point cloud evaluation metrics, proposes improved metrics including Surface Normal Concordance (SNC), and introduces Diffusion Point Transformer for superior point cloud generation.

Details

Motivation: Current metrics for evaluating generated point clouds lack robustness and fail to capture geometric fidelity, while existing generative models have limitations in producing high-quality 3D structures.

Method: Proposed improved evaluation metrics including Density-Aware Chamfer Distance (DCD) and Surface Normal Concordance (SNC), and developed Diffusion Point Transformer architecture using transformer-based models for point cloud generation.

Result: The model outperforms previous solutions on ShapeNet dataset, achieving state-of-the-art performance in generating high-fidelity point clouds with better geometric quality.

Conclusion: Combining improved evaluation metrics with the novel Diffusion Point Transformer architecture provides more comprehensive assessment and superior generation of 3D point clouds.

Abstract: As 3D point clouds become a cornerstone of modern technology, the need for sophisticated generative models and reliable evaluation metrics has grown exponentially. In this work, we first expose that some commonly used metrics for evaluating generated point clouds, particularly those based on Chamfer Distance (CD), lack robustness against defects and fail to capture geometric fidelity and local shape consistency when used as quality indicators. We further show that introducing samples alignment prior to distance calculation and replacing CD with Density-Aware Chamfer Distance (DCD) are simple yet essential steps to ensure the consistency and robustness of point cloud generative model evaluation metrics. While existing metrics primarily focus on directly comparing 3D Euclidean coordinates, we present a novel metric, named Surface Normal Concordance (SNC), which approximates surface similarity by comparing estimated point normals. This new metric, when combined with traditional ones, provides a more comprehensive evaluation of the quality of generated samples. Finally, leveraging recent advancements in transformer-based models for point cloud analysis, such as serialized patch attention , we propose a new architecture for generating high-fidelity 3D structures, the Diffusion Point Transformer. We perform extensive experiments and comparisons on the ShapeNet dataset, showing that our model outperforms previous solutions, particularly in terms of quality of generated point clouds, achieving new state-of-the-art. Code available at https://github.com/matteo-bastico/DiffusionPointTransformer.

cs.AI

[509] A Graph-Theoretical Perspective on Law Design for Multiagent Systems

Qi Shi, Pavel Naumov

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to the paper content

Method: Cannot determine method without access to the paper content

Result: Cannot determine results without access to the paper content

Conclusion: Cannot determine conclusion without access to the paper content

Abstract: Failed to fetch summary for 2511.06361: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.06361&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[510] Evidence-Bound Autonomous Research (EviBound): A Governance Framework for Eliminating False Claims

Ruiying Chen

Main category: cs.AI

TL;DR: EviBound is an evidence-bound execution framework that eliminates false claims in LLM-based autonomous research agents through dual governance gates requiring machine-checkable evidence.

Details

Motivation: LLM-based autonomous research agents frequently report false claims - tasks marked as complete despite missing artifacts, contradictory metrics, or failed executions, undermining research integrity.

Method: Dual governance gates: pre-execution Approval Gate validates acceptance criteria schemas before code runs; post-execution Verification Gate validates artifacts via MLflow API queries with recursive path checking and optional metric validation. Claims propagate only when backed by queryable run ID, required artifacts, and FINISHED status.

Result: Achieved 0% hallucination: 7/8 tasks verified and 1 task correctly blocked at approval gate, with only ~8.3% execution overhead. Baseline A had 100% hallucination, Baseline B had 25% hallucination.

Conclusion: Research integrity is an architectural property achieved through governance gates rather than emergent from model scale. The framework provides verifiable execution with minimal overhead.

Abstract: LLM-based autonomous research agents report false claims: tasks marked “complete” despite missing artifacts, contradictory metrics, or failed executions. EviBound is an evidence-bound execution framework that eliminates false claims through dual governance gates requiring machine-checkable evidence. Two complementary gates enforce evidence requirements. The pre-execution Approval Gate validates acceptance criteria schemas before code runs, catching structural violations proactively. The post-execution Verification Gate validates artifacts via MLflow API queries (with recursive path checking) and optionally validates metrics when specified by acceptance criteria. Claims propagate only when backed by a queryable run ID, required artifacts, and FINISHED status. Bounded, confidence-gated retries (typically 1-2 attempts) recover from transient failures without unbounded loops. The framework was evaluated on 8 benchmark tasks spanning infrastructure validation, ML capabilities, and governance stress tests. Baseline A (Prompt-Level Only) yields 100% hallucination (8/8 claimed, 0/8 verified). Baseline B (Verification-Only) reduces hallucination to 25% (2/8 fail verification). EviBound (Dual Gates) achieves 0% hallucination: 7/8 tasks verified and 1 task correctly blocked at the approval gate, all with only approximately 8.3% execution overhead. This package includes execution trajectories, MLflow run IDs for all verified tasks, and a 4-step verification protocol. Research integrity is an architectural property, achieved through governance gates rather than emergent from model scale.

[511] AGRAG: Advanced Graph-based Retrieval-Augmented Generation for LLMs

Yubo Wang, Haoyang Li, Fei Teng, Lei Chen

Main category: cs.AI

TL;DR: Failed to fetch summary for paper 2511.05549 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to analyze motivation as paper content could not be retrieved

Method: Unable to analyze method as paper content could not be retrieved

Result: Unable to analyze results as paper content could not be retrieved

Conclusion: Unable to analyze conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2511.05549: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05549&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[512] Pedagogical Reflections on the Holistic Cognitive Development (HCD) Framework and AI-Augmented Learning in Creative Computing

BHojan Anand

Main category: cs.AI

TL;DR: The paper analysis could not be completed due to a server error (HTTP 429 - Too Many Requests) when attempting to fetch the abstract from arXiv.

Details

Motivation: Unable to determine the research motivation as the abstract content is unavailable.

Method: Unable to identify the methodology used in the paper due to missing abstract data.

Result: No results can be analyzed since the paper content could not be retrieved.

Conclusion: Analysis incomplete - server rate limiting prevented access to the paper’s abstract and content.

Abstract: Failed to fetch summary for 2511.06779: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.06779&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[513] SMAGDi: Socratic Multi Agent Interaction Graph Distillation for Efficient High Accuracy Reasoning

Aayush Aluru, Myra Malik, Samarth Patankar, Spencer Kim, Kevin Zhu, Sean O’Brien, Vasu Sharma

Main category: cs.AI

TL;DR: SMAGDi is a distillation framework that transfers multi-agent debate dynamics into a compact Socratic decomposer-solver student, achieving 88% accuracy of a 40B multi-agent system with only 6B parameters.

Details

Motivation: Multi-agent systems achieve high reasoning accuracy but are computationally expensive due to repeated debates. There's a need to compress these systems into smaller, more efficient models while preserving their reasoning capabilities.

Method: Represent debate traces as directed interaction graphs with nodes encoding reasoning steps and edges capturing continuity and influence. Train a student model with composite objectives including language modeling, graph supervision, contrastive reasoning, and embedding alignment.

Result: SMAGDi compresses a 40B multi-agent system into a 6B student while retaining 88% accuracy on StrategyQA and MMLU, outperforming prior distillation methods like MAGDi, standard KD, and fine-tuned baselines.

Conclusion: Explicitly modeling interaction graphs and Socratic decomposition enables small models to inherit multi-agent debate accuracy benefits while remaining efficient for real-world deployment.

Abstract: Multi-agent systems (MAS) often achieve higher reasoning accuracy than single models, but their reliance on repeated debates across agents makes them computationally expensive. We introduce SMAGDi, a distillation framework that transfers the debate dynamics of a five-agent Llama-based MAS into a compact Socratic decomposer-solver student. SMAGDi represents debate traces as directed interaction graphs, where nodes encode intermediate reasoning steps with correctness labels and edges capture continuity and cross-agent influence. The student is trained with a composite objective combining language modeling, graph-based supervision, contrastive reasoning, and embedding alignment to preserve both fluency and structured reasoning. On StrategyQA and MMLU, SMAGDi compresses a 40B multi-agent system into a 6B student while retaining 88% of its accuracy, substantially outperforming prior distillation methods such as MAGDi, standard KD, and fine-tuned baselines. These results highlight that explicitly modeling interaction graphs and Socratic decomposition enable small models to inherit the accuracy benefits of multi-agent debate while remaining efficient enough for real-world deployment.

[514] From Prompts to Power: Measuring the Energy Footprint of LLM Inference

Francisco Caravaca, Ángel Cuevas, Rubén Cuevas

Main category: cs.AI

TL;DR: Large-scale measurement study of LLM inference energy consumption across 32,500+ measurements, 21 GPU configurations, and 155 model architectures, leading to a predictive model and browser extension for environmental awareness.

Details

Motivation: The rapid expansion of LLMs has created unprecedented energy demands, especially for inference workloads that dominate lifecycle consumption, yet systematic analyses of inference energy consumption remain limited despite growing environmental concerns.

Method: Conducted large-scale measurement-based study using vLLM inference engine, quantifying energy usage at prompt level across diverse hardware (21 GPU configurations) and model architectures (155 models from small to frontier systems).

Result: Identified how architectural and operational factors shape energy demand, developed accurate predictive model for estimating inference energy consumption across unseen architectures and hardware, and implemented as browser extension.

Conclusion: Provides comprehensive energy consumption insights for LLM inference, enabling better environmental awareness and more sustainable AI deployment through predictive modeling and user-facing tools.

Abstract: The rapid expansion of Large Language Models (LLMs) has introduced unprecedented energy demands, extending beyond training to large-scale inference workloads that often dominate total lifecycle consumption. Deploying these models requires energy-intensive GPU infrastructure, and in some cases has even prompted plans to power data centers with nuclear energy. Despite this growing relevance, systematic analyses of inference energy consumption remain limited. In this work, we present a large-scale measurement-based study comprising over 32,500 measurements across 21 GPU configurations and 155 model architectures, from small open-source models to frontier systems. Using the vLLM inference engine, we quantify energy usage at the prompt level and identify how architectural and operational factors shape energy demand. Building on these insights, we develop a predictive model that accurately estimates inference energy consumption across unseen architectures and hardware, and implement it as a browser extension to raise awareness of the environmental impact of generative AI.

[515] CoT-X: An Adaptive Framework for Cross-Model Chain-of-Thought Transfer and Optimization

Ziqian Bi, Kaijie Chen, Tianyang Wang, Junfeng Hao, Xinyuan Song

Main category: cs.AI

TL;DR: Efficient CoT reasoning transfer across LLMs via adaptive reasoning summarization framework, achieving 40% higher accuracy than truncation with same token budgets.

Details

Motivation: CoT reasoning improves LLM problem-solving but causes substantial inference overhead, limiting deployment in resource-constrained settings.

Method: Adaptive reasoning summarization with semantic segmentation, importance scoring, budget-aware dynamic compression, and coherence reconstruction to compress reasoning traces.

Result: 40% higher accuracy than truncation on medical questions, strong cross-model transferability across 64 model pairs, 84% evaluation cost reduction via Bayesian optimization.

Conclusion: Reasoning summarization enables efficient CoT transfer, allowing advanced reasoning under tight computational constraints.

Abstract: Chain-of-Thought (CoT) reasoning enhances the problem-solving ability of large language models (LLMs) but leads to substantial inference overhead, limiting deployment in resource-constrained settings. This paper investigates efficient CoT transfer across models of different scales and architectures through an adaptive reasoning summarization framework. The proposed method compresses reasoning traces via semantic segmentation with importance scoring, budget-aware dynamic compression, and coherence reconstruction, preserving critical reasoning steps while significantly reducing token usage. Experiments on 7{,}501 medical examination questions across 10 specialties show up to 40% higher accuracy than truncation under the same token budgets. Evaluations on 64 model pairs from eight LLMs (1.5B-32B parameters, including DeepSeek-R1 and Qwen3) confirm strong cross-model transferability. Furthermore, a Gaussian Process-based Bayesian optimization module reduces evaluation cost by 84% and reveals a power-law relationship between model size and cross-domain robustness. These results demonstrate that reasoning summarization provides a practical path toward efficient CoT transfer, enabling advanced reasoning under tight computational constraints. Code will be released upon publication.

[516] An Epistemic Perspective on Agent Awareness

Pavel Naumov, Alexandra Pavlova

Main category: cs.AI

TL;DR: The paper redefines agent awareness as knowledge, distinguishing between de re and de dicto forms, and provides a formal logical system for their interaction with standard knowledge.

Details

Motivation: To break from traditional approaches to agent awareness by treating it as a form of knowledge rather than a separate concept, enabling more precise formal analysis.

Method: Introduces two modalities for de re and de dicto awareness, uses 2D-semantics for formal specification, and develops a logical system for their interplay with standard knowledge.

Result: A sound and complete logical system that describes the relationships between de re awareness, de dicto awareness, and standard knowledge of facts.

Conclusion: The proposed framework successfully formalizes agent awareness as knowledge with distinct de re and de dicto forms, providing a complete logical system for reasoning about their interactions.

Abstract: The paper proposes to treat agent awareness as a form of knowledge, breaking the tradition in the existing literature on awareness. It distinguishes the de re and de dicto forms of such knowledge. The work introduces two modalities capturing these forms and formally specifies their meaning using a version of 2D-semantics. The main technical result is a sound and complete logical system describing the interplay between the two proposed modalities and the standard “knowledge of the fact” modality.

[517] Anchors in the Machine: Behavioral and Attributional Evidence of Anchoring Bias in LLMs

Felipe Valencia-Clavijo

Main category: cs.AI

TL;DR: This paper demonstrates that LLMs exhibit robust anchoring bias through behavioral analysis and attribution methods, showing anchor-induced probability shifts in output distributions.

Details

Motivation: To determine whether cognitive biases in LLMs reflect surface imitation or deeper probability shifts, using anchoring bias as a critical test case to explore internal mechanisms and attributional contributions.

Method: Three-pronged approach: (1) log-probability-based behavioral analysis with training-data contamination controls, (2) Shapley-value attribution over structured prompt fields, and (3) unified Anchoring Bias Sensitivity Score integrating behavioral and attributional evidence across six models.

Result: Robust anchoring effects found in Gemma-2B, Phi-2, and Llama-2-7B with attribution showing anchor influence reweighting. Smaller models (GPT-2, Falcon-RW-1B, GPT-Neo-125M) show variability, suggesting scale modulates sensitivity. Attributional effects vary across prompt designs.

Conclusion: Anchoring bias in LLMs is robust, measurable, and interpretable, but attribution effects are fragile, highlighting risks in treating LLMs as human substitutes. The framework bridges behavioral science, LLM safety, and interpretability for evaluating cognitive biases.

Abstract: Large language models (LLMs) are increasingly examined as both behavioral subjects and decision systems, yet it remains unclear whether observed cognitive biases reflect surface imitation or deeper probability shifts. Anchoring bias, a classic human judgment bias, offers a critical test case. While prior work shows LLMs exhibit anchoring, most evidence relies on surface-level outputs, leaving internal mechanisms and attributional contributions unexplored. This paper advances the study of anchoring in LLMs through three contributions: (1) a log-probability-based behavioral analysis showing that anchors shift entire output distributions, with controls for training-data contamination; (2) exact Shapley-value attribution over structured prompt fields to quantify anchor influence on model log-probabilities; and (3) a unified Anchoring Bias Sensitivity Score integrating behavioral and attributional evidence across six open-source models. Results reveal robust anchoring effects in Gemma-2B, Phi-2, and Llama-2-7B, with attribution signaling that the anchors influence reweighting. Smaller models such as GPT-2, Falcon-RW-1B, and GPT-Neo-125M show variability, suggesting scale may modulate sensitivity. Attributional effects, however, vary across prompt designs, underscoring fragility in treating LLMs as human substitutes. The findings demonstrate that anchoring bias in LLMs is robust, measurable, and interpretable, while highlighting risks in applied domains. More broadly, the framework bridges behavioral science, LLM safety, and interpretability, offering a reproducible path for evaluating other cognitive biases in LLMs.

[518] Maestro: Learning to Collaborate via Conditional Listwise Policy Optimization for Multi-Agent LLMs

Wei Yang, Jiacheng Pang, Shixuan Li, Paul Bogdan, Stephen Tu, Jesse Thomason

Main category: cs.AI

TL;DR: Maestro framework decouples exploration and synthesis in multi-agent LLM systems using parallel Execution Agents for diverse exploration and a Central Agent for convergent synthesis, achieving 6-10% accuracy gains.

Details

Motivation: Existing multi-agent LLM systems struggle with balancing exploration vs synthesis, leading to premature consensus, error propagation, and poor credit assignment between reasoning and plausible arguments.

Method: Proposes Maestro framework with parallel Execution Agents for exploration and Central Agent for synthesis, plus CLPO reinforcement learning that combines policy gradients with list-wise ranking loss for clean credit assignment.

Result: Experiments show Maestro with CLPO consistently outperforms state-of-the-art multi-agent approaches, achieving absolute accuracy gains of 6% on average and up to 10% at best on mathematical reasoning and problem-solving benchmarks.

Conclusion: Structural decoupling of exploration and synthesis through Maestro framework with CLPO effectively resolves core cognitive tension in multi-agent LLM systems, enabling superior performance through principled collaboration.

Abstract: Multi-agent systems (MAS) built on Large Language Models (LLMs) are being used to approach complex problems and can surpass single model inference. However, their success hinges on navigating a fundamental cognitive tension: the need to balance broad, divergent exploration of the solution space with a principled, convergent synthesis to the optimal solution. Existing paradigms often struggle to manage this duality, leading to premature consensus, error propagation, and a critical credit assignment problem that fails to distinguish between genuine reasoning and superficially plausible arguments. To resolve this core challenge, we propose the Multi-Agent Exploration-Synthesis framework Through Role Orchestration (Maestro), a principled paradigm for collaboration that structurally decouples these cognitive modes. Maestro uses a collective of parallel Execution Agents for diverse exploration and a specialized Central Agent for convergent, evaluative synthesis. To operationalize this critical synthesis phase, we introduce Conditional Listwise Policy Optimization (CLPO), a reinforcement learning objective that disentangles signals for strategic decisions and tactical rationales. By combining decision-focused policy gradients with a list-wise ranking loss over justifications, CLPO achieves clean credit assignment and stronger comparative supervision. Experiments on mathematical reasoning and general problem-solving benchmarks demonstrate that Maestro, coupled with CLPO, consistently outperforms existing state-of-the-art multi-agent approaches, delivering absolute accuracy gains of 6% on average and up to 10% at best.

[519] DiagnoLLM: A Hybrid Bayesian Neural Language Framework for Interpretable Disease Diagnosis

Bowen Xu, Xinyue Zeng, Jiazhen Hu, Tuo Wang, Adithya Kulkarni

Main category: cs.AI

TL;DR: DiagnoLLM is a hybrid framework combining Bayesian deconvolution, eQTL-guided deep learning, and LLM-based narrative generation for interpretable disease diagnosis, achieving 88.0% accuracy in Alzheimer’s Disease detection.

Details

Motivation: Building trustworthy clinical AI systems requires not only accurate predictions but also transparent, biologically grounded explanations to support human understanding and trust.

Method: Integrates GP-unmix (Gaussian Process-based hierarchical model for cell-type-specific gene expression inference), eQTL-guided neural classifier, and LLM-based reasoning module that translates model outputs into audience-specific diagnostic reports.

Result: Achieves 88.0% accuracy in Alzheimer’s Disease detection. Human evaluations confirm that the generated reports are accurate, actionable, and appropriately tailored for both physicians and patients.

Conclusion: LLMs, when deployed as post-hoc reasoners rather than end-to-end predictors, can serve as effective communicators within hybrid diagnostic pipelines.

Abstract: Building trustworthy clinical AI systems requires not only accurate predictions but also transparent, biologically grounded explanations. We present \texttt{DiagnoLLM}, a hybrid framework that integrates Bayesian deconvolution, eQTL-guided deep learning, and LLM-based narrative generation for interpretable disease diagnosis. DiagnoLLM begins with GP-unmix, a Gaussian Process-based hierarchical model that infers cell-type-specific gene expression profiles from bulk and single-cell RNA-seq data while modeling biological uncertainty. These features, combined with regulatory priors from eQTL analysis, power a neural classifier that achieves high predictive performance in Alzheimer’s Disease (AD) detection (88.0% accuracy). To support human understanding and trust, we introduce an LLM-based reasoning module that translates model outputs into audience-specific diagnostic reports, grounded in clinical features, attribution signals, and domain knowledge. Human evaluations confirm that these reports are accurate, actionable, and appropriately tailored for both physicians and patients. Our findings show that LLMs, when deployed as post-hoc reasoners rather than end-to-end predictors, can serve as effective communicators within hybrid diagnostic pipelines.

[520] Can a Small Model Learn to Look Before It Leaps? Dynamic Learning and Proactive Correction for Hallucination Detection

Zepeng Bao, Shen Zhou, Qiankun Pi, Jianhao Chen, Mayi Xu, Ming Zhong, Yuanyuan Zhu, Tieyun Qian

Main category: cs.AI

TL;DR: The LEAP framework addresses LLM hallucination detection by enabling dynamic strategy learning and proactive correction, outperforming existing methods on three benchmarks.

Details

Motivation: Existing hallucination detection methods use fixed verification strategies or costly closed-source LLMs, lacking adaptability in dynamic environments and leading to detection failures.

Method: LEAP formulates hallucination detection as dynamic strategy learning, using teacher models to generate trajectories and distill dynamic planning into efficient student models via agent tuning with proactive correction mechanisms.

Result: LEAP-tuned models outperform state-of-the-art methods on three challenging benchmarks.

Conclusion: The LEAP framework successfully endows efficient student models with dynamic learning and proactive correction capabilities, solving the strategy adaptability problem in hallucination detection.

Abstract: Hallucination in large language models (LLMs) remains a critical barrier to their safe deployment. Existing tool-augmented hallucination detection methods require pre-defined fixed verification strategies, which are crucial to the quality and effectiveness of tool calls. Some methods directly employ powerful closed-source LLMs such as GPT-4 as detectors, which are effective but too costly. To mitigate the cost issue, some methods adopt the teacher-student architecture and finetune open-source small models as detectors via agent tuning. However, these methods are limited by fixed strategies. When faced with a dynamically changing execution environment, they may lack adaptability and inappropriately call tools, ultimately leading to detection failure. To address the problem of insufficient strategy adaptability, we propose the innovative ``Learning to Evaluate and Adaptively Plan’’(LEAP) framework, which endows an efficient student model with the dynamic learning and proactive correction capabilities of the teacher model. Specifically, our method formulates the hallucination detection problem as a dynamic strategy learning problem. We first employ a teacher model to generate trajectories within the dynamic learning loop and dynamically adjust the strategy based on execution failures. We then distill this dynamic planning capability into an efficient student model via agent tuning. Finally, during strategy execution, the student model adopts a proactive correction mechanism, enabling it to propose, review, and optimize its own verification strategies before execution. We demonstrate through experiments on three challenging benchmarks that our LEAP-tuned model outperforms existing state-of-the-art methods.

[521] The Station: An Open-World Environment for AI-Driven Discovery

Stephen Chung, Wenyu Du

Main category: cs.AI

TL;DR: STATION is an open-world multi-agent environment for scientific discovery where AI agents autonomously conduct research, interact with peers, and develop novel methods through emergent behavior, achieving state-of-the-art performance across various benchmarks.

Details

Motivation: To create an autonomous scientific discovery system that moves beyond rigid optimization by enabling AI agents to engage in long-term scientific journeys and develop their own narratives in an open-world environment without centralized coordination.

Method: Developed STATION as a multi-agent environment where agents leverage extended context windows to read papers, formulate hypotheses, submit code, perform analyses, and publish results independently, allowing emergent behavior and organic method development.

Result: AI agents achieved state-of-the-art performance across mathematics, computational biology, and machine learning benchmarks, notably surpassing AlphaEvolve in circle packing. Emergent narratives produced novel methods like a density-adaptive algorithm for scRNA-seq batch integration.

Conclusion: STATION represents a new paradigm for autonomous scientific discovery driven by emergent behavior in open-world environments, demonstrating the potential for AI agents to organically develop novel methods and advance scientific frontiers.

Abstract: We introduce the STATION, an open-world multi-agent environment that models a miniature scientific ecosystem. Leveraging their extended context windows, agents in the Station can engage in long scientific journeys that include reading papers from peers, formulating hypotheses, submitting code, performing analyses, and publishing results. Importantly, there is no centralized system coordinating their activities - agents are free to choose their own actions and develop their own narratives within the Station. Experiments demonstrate that AI agents in the Station achieve new state-of-the-art performance on a wide range of benchmarks, spanning from mathematics to computational biology to machine learning, notably surpassing AlphaEvolve in circle packing. A rich tapestry of narratives emerges as agents pursue independent research, interact with peers, and build upon a cumulative history. From these emergent narratives, novel methods arise organically, such as a new density-adaptive algorithm for scRNA-seq batch integration. The Station marks a first step towards autonomous scientific discovery driven by emergent behavior in an open-world environment, representing a new paradigm that moves beyond rigid optimization.

[522] An Empirical Study of Reasoning Steps in Thinking Code LLMs

Haoran Xue, Gias Uddin, Song Wang

Main category: cs.AI

TL;DR: Comprehensive empirical study of reasoning LLMs for code generation, evaluating 6 models on 100 tasks with human evaluation and step-budget analysis.

Details

Motivation: To examine the quality of explicit reasoning chains in LLMs for code generation, which remains underexplored despite potential benefits for transparency and accuracy.

Method: Evaluated 6 reasoning LLMs on 100 code generation tasks from BigCodeBench, quantified reasoning structure, conducted step-budget adjustments, and performed 21-participant human evaluation across efficiency, logical correctness, and completeness.

Result: Targeted step increases improved resolution rates for certain models/tasks; modest step reductions preserved success on standard tasks but rarely on hard ones; completeness identified as dominant failure mode; hard problems more prone to incompleteness; models maintain consistent logical structures and can self-correct errors.

Conclusion: Provides insights into strengths and limitations of current thinking LLMs in software engineering, revealing systematic patterns in reasoning quality and failure modes.

Abstract: Thinking Large Language Models (LLMs) generate explicit intermediate reasoning traces before final answers, potentially improving transparency, interpretability, and solution accuracy for code generation. However, the quality of these reasoning chains remains underexplored. We present a comprehensive empirical study examining the reasoning process and quality of thinking LLMs for code generation. We evaluate six state-of-the-art reasoning LLMs (DeepSeek-R1, OpenAI-o3-mini, Claude-3.7-Sonnet-Thinking, Gemini-2.0-Flash-Thinking, Gemini-2.5-Flash, and Qwen-QwQ) across 100 code generation tasks of varying difficulty from BigCodeBench. We quantify reasoning-chain structure through step counts and verbosity, conduct controlled step-budget adjustments, and perform a 21-participant human evaluation across three dimensions: efficiency, logical correctness, and completeness. Our step-count interventions reveal that targeted step increases can improve resolution rates for certain models/tasks, while modest reductions often preserve success on standard tasks, rarely on hard ones. Through systematic analysis, we develop a reasoning-problematic taxonomy, identifying completeness as the dominant failure mode. Task complexity significantly impacts reasoning quality; hard problems are substantially more prone to incompleteness than standard tasks. Our stability analysis demonstrates that thinking LLMs maintain consistent logical structures across computational effort levels and can self-correct previous errors. This study provides new insights into the strengths and limitations of current thinking LLMs in software engineering.

[523] Unveiling Modality Bias: Automated Sample-Specific Analysis for Multimodal Misinformation Benchmarks

Hehai Lin, Hui Liu, Shilei Cao, Jing Li, Haoliang Li, Wenya Wang

Main category: cs.AI

TL;DR: Proposes three automated methods for detecting modality bias at sample level in multimodal misinformation detection, showing ensemble approaches are crucial and different granularity views agree more on balanced samples than biased ones.

Details

Motivation: Existing multimodal misinformation benchmarks have modality bias where detectors can predict using just one modality, but current bias quantification methods lack sample-level insights and don't scale well to online content.

Method: Three bias quantification methods: 1) coarse-grained modality benefit evaluation, 2) medium-grained information flow quantification, and 3) fine-grained causality analysis. Validated through human evaluation on popular benchmarks.

Result: Three key findings: ensemble of multiple views is crucial for reliable automated analysis; automated analysis is prone to detector-induced fluctuations; different views produce higher agreement on modality-balanced samples but diverge on biased ones.

Conclusion: The proposed automated bias recognition framework provides meaningful sample-level insights and reveals important patterns about modality bias, offering potential directions for future research in multimodal misinformation detection.

Abstract: Numerous multimodal misinformation benchmarks exhibit bias toward specific modalities, allowing detectors to make predictions based solely on one modality. While previous research has quantified bias at the dataset level or manually identified spurious correlations between modalities and labels, these approaches lack meaningful insights at the sample level and struggle to scale to the vast amount of online information. In this paper, we investigate the design for automated recognition of modality bias at the sample level. Specifically, we propose three bias quantification methods based on theories/views of different levels of granularity: 1) a coarse-grained evaluation of modality benefit; 2) a medium-grained quantification of information flow; and 3) a fine-grained causality analysis. To verify the effectiveness, we conduct a human evaluation on two popular benchmarks. Experimental results reveal three interesting findings that provide potential direction toward future research: 1)~Ensembling multiple views is crucial for reliable automated analysis; 2)~Automated analysis is prone to detector-induced fluctuations; and 3)~Different views produce a higher agreement on modality-balanced samples but diverge on biased ones.

[524] Self-Abstraction from Grounded Experience for Plan-Guided Policy Refinement

Hiroaki Hayashi, Bo Pang, Wenting Zhao, Ye Liu, Akash Gokul, Srijan Bansal, Caiming Xiong, Semih Yavuz, Yingbo Zhou

Main category: cs.AI

TL;DR: SAGE enables LLM agents to learn from task executions through self-abstraction, improving performance by distilling plans from experience and using them as guidance for future executions.

Details

Motivation: Current LLM agents operate in static frameworks without learning from experience, limiting their performance to initial design and LLM capabilities.

Method: After initial rollout, agents induce concise plan abstractions from grounded experience, extracting key steps, dependencies, and constraints, then use these as contextual guidance for refined subsequent executions.

Result: SAGE achieves 7.2% relative improvement over Mini-SWE-Agent baseline with GPT-5, and reaches 73.2%-74% Pass@1 resolve rates on SWE-Bench Verified benchmark.

Conclusion: SAGE framework enables consistent performance gains across diverse LLM backbones and agent architectures through self-abstraction from experience.

Abstract: Large language model (LLM) based agents are increasingly used to tackle software engineering tasks that require multi-step reasoning and code modification, demonstrating promising yet limited performance. However, most existing LLM agents typically operate within static execution frameworks, lacking a principled mechanism to learn and self-improve from their own experience and past rollouts. As a result, their performance remains bounded by the initial framework design and the underlying LLM’s capabilities. We propose Self-Abstraction from Grounded Experience (SAGE), a framework that enables agents to learn from their own task executions and refine their behavior through self-abstraction. After an initial rollout, the agent induces a concise plan abstraction from its grounded experience, distilling key steps, dependencies, and constraints. This learned abstraction is then fed back as contextual guidance, refining the agent’s policy and supporting more structured, informed subsequent executions. Empirically, SAGE delivers consistent performance gains across diverse LLM backbones and agent architectures. Notably, it yields a 7.2% relative performance improvement over the strong Mini-SWE-Agent baseline when paired with the GPT-5 (high) backbone. SAGE further achieves strong overall performance on SWE-Bench Verified benchmark, reaching 73.2% and 74% Pass@1 resolve rates with the Mini-SWE-Agent and OpenHands CodeAct agent framework, respectively.

[525] Klear-AgentForge: Forging Agentic Intelligence through Posttraining Scaling

Qi Wang, Hongzhi Zhang, Jia Fu, Kai Fu, Yahui Liu, Tinghai Zhang, Chenxi Sun, Gangwei Jiang, Jingyi Tang, Xingguang Ji, Yang Yue, Jingyuan Zhang, Fuzheng Zhang, Kun Gai, Guorui Zhou

Main category: cs.AI

TL;DR: Open-source pipeline for training high-performance agentic model Klear-Qwen3-AgentForge-8B from Qwen3-8B base, achieving SOTA performance among similar-sized LLMs.

Details

Motivation: Lack of critical post-training details hinders development of strong open-source agentic models, creating a gap in the open-source community.

Method: Comprehensive pipeline with supervised fine-tuning using synthetic data followed by multi-turn reinforcement learning for diverse agentic tasks.

Result: Achieves state-of-the-art performance among LLMs of similar size and remains competitive with significantly larger models on various agentic benchmarks.

Conclusion: Successfully developed fully open-source pipeline for training high-performance agentic models that can interact with external tools and environments.

Abstract: Despite the proliferation of powerful agentic models, the lack of critical post-training details hinders the development of strong counterparts in the open-source community. In this study, we present a comprehensive and fully open-source pipeline for training a high-performance agentic model for interacting with external tools and environments, named Klear-Qwen3-AgentForge, starting from the Qwen3-8B base model. We design effective supervised fine-tuning (SFT) with synthetic data followed by multi-turn reinforcement learning (RL) to unlock the potential for multiple diverse agentic tasks. We perform exclusive experiments on various agentic benchmarks in both tool use and coding domains. Klear-Qwen3-AgentForge-8B achieves state-of-the-art performance among LLMs of similar size and remains competitive with significantly larger models.

[526] Agentic AI Sustainability Assessment for Supply Chain Document Insights

Diego Gosmar, Anna Chiara Pallotta, Giovanni Zenezini

Main category: cs.AI

TL;DR: Agentic AI in supply chain document workflows achieves 70-90% energy reduction, 90-97% CO2 emission cuts, and 89-98% water usage reduction compared to manual processes.

Details

Motivation: To improve automation efficiency while providing measurable environmental performance in document-intensive supply chain workflows through AI integration.

Method: Comparative analysis of three scenarios: fully manual, AI-assisted (human-in-the-loop), and advanced multi-agent agentic AI workflow with parsers and verifiers.

Result: AI-assisted and agentic AI scenarios significantly outperform manual processes in sustainability metrics, with full agentic configurations achieving the best environmental performance despite slightly higher resource usage than simpler AI solutions.

Conclusion: The framework successfully integrates performance, energy, and emission indicators into a unified ESG-oriented methodology for assessing AI-enabled supply chain solutions, demonstrating substantial sustainability gains through agentic AI adoption.

Abstract: This paper presents a comprehensive sustainability assessment framework for document intelligence within supply chain operations, centered on agentic artificial intelligence (AI). We address the dual objective of improving automation efficiency while providing measurable environmental performance in document-intensive workflows. The research compares three scenarios: fully manual (human-only), AI-assisted (human-in-the-loop, HITL), and an advanced multi-agent agentic AI workflow leveraging parsers and verifiers. Empirical results show that AI-assisted HITL and agentic AI scenarios achieve reductions of up to 70-90% in energy consumption, 90-97% in carbon dioxide emissions, and 89-98% in water usage compared to manual processes. Notably, full agentic configurations, combining advanced reasoning (thinking mode) and multi-agent validation, achieve substantial sustainability gains over human-only approaches, even when resource usage increases slightly versus simpler AI-assisted solutions. The framework integrates performance, energy, and emission indicators into a unified ESG-oriented methodology for assessing and governing AI-enabled supply chain solutions. The paper includes a complete replicability use case demonstrating the methodology’s application to real-world document extraction tasks.

[527] ScRPO: From Errors to Insights

Lianrui Li, Dakuan Lu, Jiawei Shao, Chi Zhang, Xuelong Li

Main category: cs.AI

TL;DR: ScRPO is a reinforcement learning framework that improves LLMs on math problems through self-reflection and error correction in two stages: trial-and-error learning and self-correction learning.

Details

Motivation: To enhance large language models' performance on challenging mathematical problems by enabling self-improvement through reflection on errors with limited external feedback.

Method: Two-stage approach: (1) Trial-and-error learning with GRPO to collect incorrect answers in an error pool, (2) Self-correction learning where the model reflects on why previous answers were wrong.

Result: Outperforms several post-training methods across multiple math reasoning benchmarks including AIME, AMC, Olympiad, MATH-500, and GSM8k using Deepseek-Distill-Qwen models.

Conclusion: ScRPO is a promising paradigm for enabling language models to self-improve on difficult tasks with limited external feedback, paving the way toward more reliable and capable AI systems.

Abstract: We propose Self-correction Relative Policy Optimization (ScRPO), a novel reinforcement learning framework designed to enhance large language models on challenging mathematical problems by leveraging self-reflection and error correction. Our approach consists of two stages: (1) Trial-and-error learning stage: training the model with GRPO and collecting incorrect answers along with their corresponding questions in an error pool; (2) Self-correction learning stage: guiding the model to reflect on why its previous answers were wrong. Extensive experiments across multiple math reasoning benchmarks, including AIME, AMC, Olympiad, MATH-500, GSM8k, using Deepseek-Distill-Qwen-1.5B and Deepseek-Distill-Qwen-7B. The experimental results demonstrate that ScRPO consistently outperforms several post-training methods. These findings highlight ScRPO as a promising paradigm for enabling language models to self-improve on difficult tasks with limited external feedback, paving the way toward more reliable and capable AI systems.

[528] When Object-Centric World Models Meet Policy Learning: From Pixels to Policies, and Where It Breaks

Stefano Ferraro, Akihiro Nakano, Masahiro Suzuki, Yutaka Matsuo

Main category: cs.AI

TL;DR: Object-centric world models (OCWM) aim to decompose scenes into object representations for better generalization, but DLPWM shows strong visual modeling while struggling with stable control due to representation shift during interactions.

Details

Motivation: To test if disentangled object-level representations can improve policy performance across novel feature combinations by localizing task-relevant information.

Method: Introduce DLPWM, a fully unsupervised disentangled object-centric world model that learns object-level latents directly from pixels.

Result: DLPWM achieves strong reconstruction and prediction performance with OOD robustness, but policies trained on its latents underperform DreamerV3 due to representation shift during multi-object interactions.

Conclusion: Object-centric perception supports robust visual modeling, but achieving stable control requires mitigating latent drift caused by representation shift during interactions.

Abstract: Object-centric world models (OCWM) aim to decompose visual scenes into object-level representations, providing structured abstractions that could improve compositional generalization and data efficiency in reinforcement learning. We hypothesize that explicitly disentangled object-level representations, by localizing task-relevant information, can enhance policy performance across novel feature combinations. To test this hypothesis, we introduce DLPWM, a fully unsupervised, disentangled object-centric world model that learns object-level latents directly from pixels. DLPWM achieves strong reconstruction and prediction performance, including robustness to several out-of-distribution (OOD) visual variations. However, when used for downstream model-based control, policies trained on DLPWM latents underperform compared to DreamerV3. Through latent-trajectory analyses, we identify representation shift during multi-object interactions as a key driver of unstable policy learning. Our results suggest that, although object-centric perception supports robust visual modeling, achieving stable control requires mitigating latent drift.

[529] Evaluating Online Moderation Via LLM-Powered Counterfactual Simulations

Giacomo Fidone, Lucia Passaro, Riccardo Guidotti

Main category: cs.AI

TL;DR: LLM-powered simulator for evaluating content moderation strategies in online social networks through parallel counterfactual simulations.

Details

Motivation: Current evaluation of content moderation effectiveness is limited by high data collection costs and lack of experimental control, creating a need for simulation-based approaches.

Method: Developed a LLM-powered simulator for OSN conversations that enables parallel counterfactual simulations where toxic behavior is influenced by moderation interventions while keeping all other factors constant.

Result: Experiments demonstrated psychological realism of OSN agents, emergence of social contagion phenomena, and superior effectiveness of personalized moderation strategies compared to standard approaches.

Conclusion: LLM-powered simulation provides a viable approach for evaluating content moderation strategies, revealing the importance of personalized interventions and enabling controlled experimentation that was previously impractical.

Abstract: Online Social Networks (OSNs) widely adopt content moderation to mitigate the spread of abusive and toxic discourse. Nonetheless, the real effectiveness of moderation interventions remains unclear due to the high cost of data collection and limited experimental control. The latest developments in Natural Language Processing pave the way for a new evaluation approach. Large Language Models (LLMs) can be successfully leveraged to enhance Agent-Based Modeling and simulate human-like social behavior with unprecedented degree of believability. Yet, existing tools do not support simulation-based evaluation of moderation strategies. We fill this gap by designing a LLM-powered simulator of OSN conversations enabling a parallel, counterfactual simulation where toxic behavior is influenced by moderation interventions, keeping all else equal. We conduct extensive experiments, unveiling the psychological realism of OSN agents, the emergence of social contagion phenomena and the superior effectiveness of personalized moderation strategies.

[530] MALinZero: Efficient Low-Dimensional Search for Mastering Complex Multi-Agent Planning

Sizhe Tang, Jiayu Chen, Tian Lan

Main category: cs.AI

TL;DR: MALinZero is a novel approach that enables efficient MCTS for multi-agent planning by projecting joint-action returns into low-dimensional space using contextual linear bandits, addressing the exponential growth of action spaces in multi-agent settings.

Details

Motivation: MCTS faces challenges in multi-agent planning due to exponentially growing combinatorial action spaces, making tree expansion and exploration/exploitation difficult. The motivation is to overcome this limitation by leveraging low-dimensional representations of joint-action returns.

Method: Projects joint-action returns into low-dimensional space using contextual linear bandit formulation. Solves the bandit problem with convex and smooth loss functions, derives linear Upper Confidence Bound for Trees (LinUCT), and uses (1-1/e)-approximation algorithm for joint action selection with sub-modular objective.

Result: State-of-the-art performance on multi-agent benchmarks including matrix games, SMAC, and SMACv2. Outperforms both model-based and model-free multi-agent reinforcement learning baselines with faster learning speed and better performance.

Conclusion: MALinZero successfully addresses the exponential complexity of multi-agent MCTS through low-dimensional representations and contextual linear bandits, enabling efficient exploration and exploitation in complex multi-agent planning problems.

Abstract: Monte Carlo Tree Search (MCTS), which leverages Upper Confidence Bound for Trees (UCTs) to balance exploration and exploitation through randomized sampling, is instrumental to solving complex planning problems. However, for multi-agent planning, MCTS is confronted with a large combinatorial action space that often grows exponentially with the number of agents. As a result, the branching factor of MCTS during tree expansion also increases exponentially, making it very difficult to efficiently explore and exploit during tree search. To this end, we propose MALinZero, a new approach to leverage low-dimensional representational structures on joint-action returns and enable efficient MCTS in complex multi-agent planning. Our solution can be viewed as projecting the joint-action returns into the low-dimensional space representable using a contextual linear bandit problem formulation. We solve the contextual linear bandit problem with convex and $μ$-smooth loss functions – in order to place more importance on better joint actions and mitigate potential representational limitations – and derive a linear Upper Confidence Bound applied to trees (LinUCT) to enable novel multi-agent exploration and exploitation in the low-dimensional space. We analyze the regret of MALinZero for low-dimensional reward functions and propose an $(1-\tfrac1e)$-approximation algorithm for the joint action selection by maximizing a sub-modular objective. MALinZero demonstrates state-of-the-art performance on multi-agent benchmarks such as matrix games, SMAC, and SMACv2, outperforming both model-based and model-free multi-agent reinforcement learning baselines with faster learning speed and better performance.

[531] Evaluating Implicit Biases in LLM Reasoning through Logic Grid Puzzles

Fatima Jahara, Mark Dredze, Sharon Levy

Main category: cs.AI

TL;DR: PRIME is a new evaluation framework using logic grid puzzles to detect subtle social biases in LLMs’ reasoning, showing models perform better when solutions align with stereotypes.

Details

Motivation: Current safety measures miss subtle biases in complex reasoning tasks, and existing benchmarks don't adequately test for implicit social biases during logical deduction.

Method: Uses logic grid puzzles with stereotypical, anti-stereotypical, and neutral variants generated from shared structures, enabling automatic verification and controlled comparisons across model families and puzzle complexities.

Result: Models consistently reason more accurately when puzzle solutions align with gender stereotypes, revealing implicit bias influence on deductive reasoning.

Conclusion: PRIME effectively diagnoses and quantifies social biases in LLMs’ reasoning processes, highlighting the need for better bias detection in critical fairness contexts.

Abstract: While recent safety guardrails effectively suppress overtly biased outputs, subtler forms of social bias emerge during complex logical reasoning tasks that evade current evaluation benchmarks. To fill this gap, we introduce a new evaluation framework, PRIME (Puzzle Reasoning for Implicit Biases in Model Evaluation), that uses logic grid puzzles to systematically probe the influence of social stereotypes on logical reasoning and decision making in LLMs. Our use of logic puzzles enables automatic generation and verification, as well as variability in complexity and biased settings. PRIME includes stereotypical, anti-stereotypical, and neutral puzzle variants generated from a shared puzzle structure, allowing for controlled and fine-grained comparisons. We evaluate multiple model families across puzzle sizes and test the effectiveness of prompt-based mitigation strategies. Focusing our experiments on gender stereotypes, our findings highlight that models consistently reason more accurately when solutions align with stereotypical associations. This demonstrates the significance of PRIME for diagnosing and quantifying social biases perpetuated in the deductive reasoning of LLMs, where fairness is critical.

[532] Chasing Consistency: Quantifying and Optimizing Human-Model Alignment in Chain-of-Thought Reasoning

Boxuan Wang, Zhuoyun Li, Xinmiao Huang, Xiaowei Huang, Yi Dong

Main category: cs.AI

TL;DR: The paper introduces the Alignment Score metric to evaluate reasoning consistency in LLMs and proposes SCOS method to optimize it, showing 29.84% improvement in 3-hop reasoning tasks.

Details

Motivation: To address the problem of reasoning inconsistency in Large Language Models by developing a systematic framework for evaluating and optimizing the semantic alignment between model-generated and human-written reasoning chains.

Method: Proposes the Alignment Score metric and Semantic Consistency Optimization Sampling (SCOS) method that samples reasoning chains with minimal alignment errors (logical disconnection, thematic shift, redundant reasoning, causal reversal).

Result: Empirical findings show 2-hop reasoning chains achieve highest Alignment Score. SCOS improves Alignment Scores by average 29.84% with longer reasoning chains like 3-hop tasks.

Conclusion: The framework successfully quantifies reasoning consistency and the SCOS method effectively optimizes semantic alignment, particularly benefiting longer reasoning chains where consistency issues are more pronounced.

Abstract: This paper presents a framework for evaluating and optimizing reasoning consistency in Large Language Models (LLMs) via a new metric, the Alignment Score, which quantifies the semantic alignment between model-generated reasoning chains and human-written reference chains in Chain-of-Thought (CoT) reasoning. Empirically, we find that 2-hop reasoning chains achieve the highest Alignment Score. To explain this phenomenon, we define four key error types: logical disconnection, thematic shift, redundant reasoning, and causal reversal, and show how each contributes to the degradation of the Alignment Score. Building on this analysis, we further propose Semantic Consistency Optimization Sampling (SCOS), a method that samples and favors chains with minimal alignment errors, significantly improving Alignment Scores by an average of 29.84% with longer reasoning chains, such as in 3-hop tasks.

Kaijie Xu, Fandi Meng, Clark Verbrugge, Simon Lucas

Main category: cs.AI

TL;DR: CSP4SDG is a probabilistic constraint-satisfaction framework for hidden-role inference in social deduction games that outperforms LLM-based approaches and can enhance LLM performance when used as a reasoning tool.

Details

Motivation: Social deduction games require players to infer hidden roles while dealing with deception, making accurate role identification crucial for both human and AI performance in these games.

Method: The framework maps game events and dialogue to four constraint classes (evidence, phenomena, assertions, hypotheses), using hard constraints to prune impossible assignments and weighted soft constraints to score remaining possibilities with information-gain weighting.

Result: Experiments on three public datasets show CSP4SDG outperforms LLM-based baselines in all inference scenarios and boosts LLM performance when used as an auxiliary reasoning tool.

Conclusion: Principled probabilistic reasoning with information theory provides a scalable alternative or complement to heavyweight neural models for social deduction games, offering fully interpretable real-time role probability updates.

Abstract: In Social Deduction Games (SDGs) such as Avalon, Mafia, and Werewolf, players conceal their identities and deliberately mislead others, making hidden-role inference a central and demanding task. Accurate role identification, which forms the basis of an agent’s belief state, is therefore the keystone for both human and AI performance. We introduce CSP4SDG, a probabilistic, constraint-satisfaction framework that analyses gameplay objectively. Game events and dialogue are mapped to four linguistically-agnostic constraint classes-evidence, phenomena, assertions, and hypotheses. Hard constraints prune impossible role assignments, while weighted soft constraints score the remainder; information-gain weighting links each hypothesis to its expected value under entropy reduction, and a simple closed-form scoring rule guarantees that truthful assertions converge to classical hard logic with minimum error. The resulting posterior over roles is fully interpretable and updates in real time. Experiments on three public datasets show that CSP4SDG (i) outperforms LLM-based baselines in every inference scenario, and (ii) boosts LLMs when supplied as an auxiliary “reasoning tool.” Our study validates that principled probabilistic reasoning with information theory is a scalable alternative-or complement-to heavy-weight neural models for SDGs.

[534] Dataforge: A Data Agent Platform for Autonomous Data Engineering

Xinyuan Wang, Yanjie Fu

Main category: cs.AI

TL;DR: Data Agent is an autonomous system that automatically transforms raw tabular data into AI-ready format using LLM reasoning and validation, eliminating manual data preparation work.

Details

Motivation: The growing demand for AI applications in materials discovery, molecular modeling, and climate science requires extensive data preparation, which is labor-intensive and expertise-dependent.

Method: Leverages large language model reasoning with grounded validation, performs automatic data cleaning, hierarchical routing, and feature-level optimization through dual feedback loops.

Result: Successfully demonstrates the first practical realization of an autonomous Data Agent that can transform raw data into better data without human supervision.

Conclusion: Data Agent provides a fully autonomous solution for data preparation that is automatic, safe, and non-expert friendly, ensuring end-to-end reliability in transforming raw data into AI-ready format.

Abstract: The growing demand for AI applications in fields such as materials discovery, molecular modeling, and climate science has made data preparation an important but labor-intensive step. Raw data from diverse sources must be cleaned, normalized, and transformed to become AI-ready, while effective feature transformation and selection are essential for efficient training and inference. To address the challenges of scalability and expertise dependence, we present Data Agent, a fully autonomous system specialized for tabular data. Leveraging large language model (LLM) reasoning and grounded validation, Data Agent automatically performs data cleaning, hierarchical routing, and feature-level optimization through dual feedback loops. It embodies three core principles: automatic, safe, and non-expert friendly, which ensure end-to-end reliability without human supervision. This demo showcases the first practical realization of an autonomous Data Agent, illustrating how raw data can be transformed “From Data to Better Data.”

[535] Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads

Jingwei Ni, Ekaterina Fadeeva, Tianyi Wu, Mubashara Akhtar, Jiaheng Zhang, Elliott Ash, Markus Leippold, Timothy Baldwin, See-Kiong Ng, Artem Shelmanov, Mrinmaya Sachan

Main category: cs.AI

TL;DR: UHeads: lightweight uncertainty quantification heads that use LLM internal states to verify reasoning steps, matching performance of much larger verification models while being 810x smaller.

Details

Motivation: Existing reasoning verification methods are computationally expensive, domain-specific, or require extensive annotations, creating a need for lightweight, generalizable alternatives.

Method: Train transformer-based uncertainty heads (UHeads) on frozen LLM internal states to estimate reasoning step uncertainty, using automatic labels from larger LLMs or self-supervised learning.

Result: UHeads (<10M parameters) match or surpass Process Reward Models (810x larger) across math, planning, and QA domains, showing LLM internal states encode reliable uncertainty signals.

Conclusion: LLM internal states effectively encode uncertainty for reasoning verification, enabling scalable introspective models without expensive verification systems.

Abstract: Solving complex tasks usually requires LLMs to generate long multi-step reasoning chains. Previous work has shown that verifying the correctness of individual reasoning steps can further improve the performance and efficiency of LLMs on such tasks and enhance solution interpretability. However, existing verification approaches, such as Process Reward Models (PRMs), are either computationally expensive, limited to specific domains, or require large-scale human or model-generated annotations. Thus, we propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores. We train transformer-based uncertainty quantification heads (UHeads) that use the internal states of a frozen LLM to estimate the uncertainty of its reasoning steps during generation. The approach is fully automatic: target labels are generated either by another larger LLM (e.g., DeepSeek R1) or in a self-supervised manner by the original model itself. UHeads are both effective and lightweight, containing less than 10M parameters. Across multiple domains, including mathematics, planning, and general knowledge question answering, they match or even surpass the performance of PRMs that are up to 810x larger. Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification, offering a promising direction toward scalable and generalizable introspective LLMs.

[536] Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B

Sen Xu, Yi Zhou, Wei Wang, Jixin Min, Zhibin Yin, Yingwei Dai, Shixi Liu, Lianyu Pang, Yirong Chen, Junlin Zhang

Main category: cs.AI

TL;DR: VibeThinker-1.5B is a 1.5B-parameter model that challenges the scaling paradigm by achieving reasoning capabilities comparable to much larger models using the Spectrum-to-Signal Principle, with only $7,800 training cost.

Details

Motivation: To challenge the consensus that small models inherently lack robust reasoning capabilities and demonstrate that parameter scaling is not the only path to enhanced AI reasoning.

Method: Spectrum-to-Signal Principle (SSP) with Two-Stage Diversity-Exploring Distillation (SFT) to generate broad solution spectrum, followed by MaxEnt-Guided Policy Optimization (RL) to amplify correct signals.

Result: Outperforms 400x larger DeepSeek R1 on math benchmarks (AIME24: 80.3 vs 79.8, AIME25: 74.4 vs 70.0, HMMT25: 50.4 vs 41.7) and scores 51.1 on LiveCodeBench V6, beating Magistral Medium’s 50.3.

Conclusion: Small models can achieve reasoning capabilities comparable to large models, drastically reducing training and inference costs and democratizing advanced AI research.

Abstract: Challenging the prevailing consensus that small models inherently lack robust reasoning, this report introduces VibeThinker-1.5B, a 1.5B-parameter dense model developed via our Spectrum-to-Signal Principle (SSP). This challenges the prevailing approach of scaling model parameters to enhance capabilities, as seen in models like DeepSeek R1 (671B) and Kimi k2 (>1T). The SSP framework first employs a Two-Stage Diversity-Exploring Distillation (SFT) to generate a broad spectrum of solutions, followed by MaxEnt-Guided Policy Optimization (RL) to amplify the correct signal. With a total training cost of only $7,800, VibeThinker-1.5B demonstrates superior reasoning capabilities compared to closed-source models like Magistral Medium and Claude Opus 4, and performs on par with open-source models like GPT OSS-20B Medium. Remarkably, it surpasses the 400x larger DeepSeek R1 on three math benchmarks: AIME24 (80.3 vs. 79.8), AIME25 (74.4 vs. 70.0), and HMMT25 (50.4 vs. 41.7). This is a substantial improvement over its base model (6.7, 4.3, and 0.6, respectively). On LiveCodeBench V6, it scores 51.1, outperforming Magistral Medium’s 50.3 and its base model’s 0.0. These findings demonstrate that small models can achieve reasoning capabilities comparable to large models, drastically reducing training and inference costs and thereby democratizing advanced AI research.

[537] ROAR: Robust Accident Recognition and Anticipation for Autonomous Driving

Xingcheng Liu, Yanchen Guan, Haicheng Liao, Zhengbing He, Zhenning Li

Main category: cs.AI

TL;DR: ROAR introduces a novel accident detection and prediction method using Discrete Wavelet Transform, object-aware modules, and dynamic focal loss to handle real-world challenges like sensor failures, noise, and class imbalance.

Details

Motivation: Existing accident anticipation methods assume ideal conditions and overlook practical challenges like sensor failures, environmental disturbances, data imperfections, and variability in driver behavior across vehicle types.

Method: Combines Discrete Wavelet Transform (DWT) for feature extraction from noisy data, self-adaptive object-aware module for focusing on high-risk vehicles and modeling spatial-temporal relationships, and dynamic focal loss to address class imbalance.

Result: Outperforms existing baselines on three datasets (DAD, CCD, A3D) in key metrics including Average Precision (AP) and mean Time to Accident (mTTA), demonstrating robustness in handling sensor degradation, noise, and imbalanced data.

Conclusion: ROAR provides a reliable and accurate solution for accident anticipation in complex real-world traffic environments, addressing practical challenges that previous methods overlooked.

Abstract: Accurate accident anticipation is essential for enhancing the safety of autonomous vehicles (AVs). However, existing methods often assume ideal conditions, overlooking challenges such as sensor failures, environmental disturbances, and data imperfections, which can significantly degrade prediction accuracy. Additionally, previous models have not adequately addressed the considerable variability in driver behavior and accident rates across different vehicle types. To overcome these limitations, this study introduces ROAR, a novel approach for accident detection and prediction. ROAR combines Discrete Wavelet Transform (DWT), a self adaptive object aware module, and dynamic focal loss to tackle these challenges. The DWT effectively extracts features from noisy and incomplete data, while the object aware module improves accident prediction by focusing on high-risk vehicles and modeling the spatial temporal relationships among traffic agents. Moreover, dynamic focal loss mitigates the impact of class imbalance between positive and negative samples. Evaluated on three widely used datasets, Dashcam Accident Dataset (DAD), Car Crash Dataset (CCD), and AnAn Accident Detection (A3D), our model consistently outperforms existing baselines in key metrics such as Average Precision (AP) and mean Time to Accident (mTTA). These results demonstrate the model’s robustness in real-world conditions, particularly in handling sensor degradation, environmental noise, and imbalanced data distributions. This work offers a promising solution for reliable and accurate accident anticipation in complex traffic environments.

[538] GAIA: A General Agency Interaction Architecture for LLM-Human B2B Negotiation & Screening

Siming Zhao, Qi Li

Main category: cs.AI

TL;DR: GAIA is a governance-first framework for LLM-human collaboration in B2B negotiation that ensures safety through information-gated progression, dual feedback integration, and authorization boundaries.

Details

Motivation: Current LLM negotiation systems lack practical governance mechanisms needed for high-stakes B2B settings, including staged information gathering, authorization boundaries, and systematic feedback integration.

Method: GAIA defines three roles (Principal, Delegate, Counterparty) with optional Critic, using three mechanisms: information-gated progression separating screening from negotiation, dual feedback integration combining AI critique with human corrections, and authorization boundaries with escalation paths.

Result: The framework provides four key contributions: formal governance framework with safety invariants, information-gated progression via task-completeness tracking, dual feedback integration, and hybrid validation combining automated metrics with human judgment.

Conclusion: GAIA bridges theory and practice to offer reproducible specifications for safe, efficient, and accountable AI delegation applicable across procurement, real estate, and staffing workflows.

Abstract: Organizations are increasingly exploring delegation of screening and negotiation tasks to AI systems, yet deployment in high-stakes B2B settings is constrained by governance: preventing unauthorized commitments, ensuring sufficient information before bargaining, and maintaining effective human oversight and auditability. Prior work on large language model negotiation largely emphasizes autonomous bargaining between agents and omits practical needs such as staged information gathering, explicit authorization boundaries, and systematic feedback integration. We propose GAIA, a governance-first framework for LLM-human agency in B2B negotiation and screening. GAIA defines three essential roles - Principal (human), Delegate (LLM agent), and Counterparty - with an optional Critic to enhance performance, and organizes interactions through three mechanisms: information-gated progression that separates screening from negotiation; dual feedback integration that combines AI critique with lightweight human corrections; and authorization boundaries with explicit escalation paths. Our contributions are fourfold: (1) a formal governance framework with three coordinated mechanisms and four safety invariants for delegation with bounded authorization; (2) information-gated progression via task-completeness tracking (TCI) and explicit state transitions that separate screening from commitment; (3) dual feedback integration that blends Critic suggestions with human oversight through parallel learning channels; and (4) a hybrid validation blueprint that combines automated protocol metrics with human judgment of outcomes and safety. By bridging theory and practice, GAIA offers a reproducible specification for safe, efficient, and accountable AI delegation that can be instantiated across procurement, real estate, and staffing workflows.

[539] Synthetic Data-Driven Prompt Tuning for Financial QA over Tables and Documents

Yaoning Yu, Kaimin Chang, Ye Yu, Kai Wei, Haojing Luo, Haohan Wang

Main category: cs.AI

TL;DR: A self-improving prompt framework for financial document analysis that uses synthetic data generation and verification to iteratively refine prompts without external labels, achieving better accuracy and robustness than standard methods.

Details

Motivation: Current prompt tuning methods for financial document analysis are limited by fixed datasets or require costly manual labeling, making them unable to adapt to new question types or document structures.

Method: Closed-loop framework combining synthetic data generator, verifiers, and prompt optimizer. Generates synthetic financial tables and documents, verifies correctness, and incrementally refines prompts based on identified weaknesses.

Result: Achieves higher performance in both accuracy and robustness than standard prompt methods on the DocMath-Eval benchmark, demonstrating effectiveness without needing external labels.

Conclusion: Incorporating synthetic data generation into prompt learning is valuable for financial applications, enabling continuous improvement and adaptation to diverse document structures and question types.

Abstract: Financial documents like earning reports or balance sheets often involve long tables and multi-page reports. Large language models have become a new tool to help numerical reasoning and understanding these documents. However, prompt quality can have a major effect on how well LLMs perform these financial reasoning tasks. Most current methods tune prompts on fixed datasets of financial text or tabular data, which limits their ability to adapt to new question types or document structures, or they involve costly and manually labeled/curated dataset to help build the prompts. We introduce a self-improving prompt framework driven by data-augmented optimization. In this closed-loop process, we generate synthetic financial tables and document excerpts, verify their correctness and robustness, and then update the prompt based on the results. Specifically, our framework combines a synthetic data generator with verifiers and a prompt optimizer, where the generator produces new examples that exposes weaknesses in the current prompt, the verifiers check the validity and robustness of the produced examples, and the optimizer incrementally refines the prompt in response. By iterating these steps in a feedback cycle, our method steadily improves prompt accuracy on financial reasoning tasks without needing external labels. Evaluation on DocMath-Eval benchmark demonstrates that our system achieves higher performance in both accuracy and robustness than standard prompt methods, underscoring the value of incorporating synthetic data generation into prompt learning for financial applications.

[540] Secu-Table: a Comprehensive security table dataset for evaluating semantic table interpretation systems

Azanzi Jiomekong, Jean Bikim, Patricia Negoue, Joyce Chin

Main category: cs.AI

TL;DR: Introduces Secu-Table dataset with 1500+ tables and 15k+ entities from security data sources (CVE/CWE) for evaluating semantic table interpretation systems in cybersecurity domain.

Details

Motivation: Lack of publicly available tabular datasets for evaluating semantic table interpretation systems in the security domain, particularly for LLM-based approaches.

Method: Constructed dataset using security data from CVE and CWE sources, annotated with Wikidata and SEPSES CSKG knowledge graphs, and released publicly with all code.

Result: Created comprehensive security domain dataset for STI evaluation, with preliminary baseline evaluation using Falcon3-7b-instruct, Mistral-7B-Instruct, and GPT-4o mini models.

Conclusion: Secu-Table dataset addresses the gap in security domain tabular data availability and supports evaluation of STI systems in the SemTab challenge context.

Abstract: Evaluating semantic tables interpretation (STI) systems, (particularly, those based on Large Language Models- LLMs) especially in domain-specific contexts such as the security domain, depends heavily on the dataset. However, in the security domain, tabular datasets for state-of-the-art are not publicly available. In this paper, we introduce Secu-Table dataset, composed of more than 1500 tables with more than 15k entities constructed using security data extracted from Common Vulnerabilities and Exposures (CVE) and Common Weakness Enumeration (CWE) data sources and annotated using Wikidata and the SEmantic Processing of Security Event Streams CyberSecurity Knowledge Graph (SEPSES CSKG). Along with the dataset, all the code is publicly released. This dataset is made available to the research community in the context of the SemTab challenge on Tabular to Knowledge Graph Matching. This challenge aims to evaluate the performance of several STI based on open source LLMs. Preliminary evaluation, serving as baseline, was conducted using Falcon3-7b-instruct and Mistral-7B-Instruct, two open source LLMs and GPT-4o mini one closed source LLM.

[541] ALIGN: A Vision-Language Framework for High-Accuracy Accident Location Inference through Geo-Spatial Neural Reasoning

MD Thamed Bin Zaman Chowdhury, Moazzem Hossain

Main category: cs.AI

TL;DR: ALIGN is a vision-language framework that uses multimodal AI to infer accident locations from Bangla news reports by combining text analysis with map-based verification, outperforming traditional geocoding methods.

Details

Motivation: Low- and middle-income countries lack reliable road accident location data due to poor performance of existing geocoding tools in multilingual, unstructured news environments with incomplete place descriptions and mixed scripts.

Method: Multi-stage pipeline integrating large language and vision-language models that performs OCR, linguistic reasoning, and map-level verification through grid-based spatial scanning, evaluating predicted locations against contextual and visual evidence.

Result: ALIGN demonstrates consistent improvements over traditional geoparsing methods, accurately identifying district and sub-district-level crash sites in Bangla-language news data.

Conclusion: The framework establishes a high-accuracy foundation for automated crash mapping in data-scarce regions, supporting evidence-driven road safety policymaking and broader integration of multimodal AI in transportation analytics.

Abstract: Reliable geospatial information on road accidents is vital for safety analysis and infrastructure planning, yet most low- and middle-income countries continue to face a critical shortage of accurate, location-specific crash data. Existing text-based geocoding tools perform poorly in multilingual and unstructured news environments, where incomplete place descriptions and mixed Bangla-English scripts obscure spatial context. To address these limitations, this study introduces ALIGN (Accident Location Inference through Geo-Spatial Neural Reasoning)- a vision-language framework that emulates human spatial reasoning to infer accident coordinates directly from textual and map-based cues. ALIGN integrates large language and vision-language models within a multi-stage pipeline that performs optical character recognition, linguistic reasoning, and map-level verification through grid-based spatial scanning. The framework systematically evaluates each predicted location against contextual and visual evidence, ensuring interpretable, fine-grained geolocation outcomes without requiring model retraining. Applied to Bangla-language news data, ALIGN demonstrates consistent improvements over traditional geoparsing methods, accurately identifying district and sub-district-level crash sites. Beyond its technical contribution, the framework establishes a high accuracy foundation for automated crash mapping in data-scarce regions, supporting evidence-driven road-safety policymaking and the broader integration of multimodal artificial intelligence in transportation analytics. The code for this paper is open-source and available at: https://github.com/Thamed-Chowdhury/ALIGN

[542] LPFQA: A Long-Tail Professional Forum-based Benchmark for LLM Evaluation

Liya Zhu, Peizhuang Cong, Aowei Ji, Wenya Wu, Jiani Hou, Chunjie Wu, Xiang Gao, Jingkai Liu, Zhou Huan, Xuelei Sun, Yang Yang, Jianpeng Jiao, Liang Hu, Xinjie Chen, Jiashuo Liu, Jingzhe Ding, Tong Yang, Zaiyuan Wang, Ge Zhang, Wenhao Huang

Main category: cs.AI

TL;DR: LPFQA is a new benchmark for evaluating LLMs using authentic professional forum data across 20 fields, addressing gaps in current evaluation methods by focusing on long-tail knowledge and real-world complexity.

Details

Motivation: Current LLM benchmarks are insufficient for evaluating true capabilities as they use simplified tasks and artificial scenarios, missing long-tail knowledge and real-world application complexities.

Method: Created LPFQA benchmark using authentic professional forums across 20 academic/industrial fields (502 tasks), with four innovations: fine-grained evaluation dimensions, hierarchical difficulty structure, realistic scenario modeling, and interdisciplinary knowledge integration.

Result: Evaluation of 12 mainstream LLMs revealed significant performance disparities, particularly in specialized reasoning tasks, demonstrating LPFQA’s discriminative power.

Conclusion: LPFQA provides a robust, authentic, and discriminative benchmark that advances LLM evaluation and guides future model development by better capturing real-world professional expertise.

Abstract: Large Language Models (LLMs) have made rapid progress in reasoning, question answering, and professional applications; however, their true capabilities remain difficult to evaluate using existing benchmarks. Current datasets often focus on simplified tasks or artificial scenarios, overlooking long-tail knowledge and the complexities of real-world applications. To bridge this gap, we propose LPFQA, a long-tail knowledge-based benchmark derived from authentic professional forums across 20 academic and industrial fields, covering 502 tasks grounded in practical expertise. LPFQA introduces four key innovations: fine-grained evaluation dimensions that target knowledge depth, reasoning, terminology comprehension, and contextual analysis; a hierarchical difficulty structure that ensures semantic clarity and unique answers; authentic professional scenario modeling with realistic user personas; and interdisciplinary knowledge integration across diverse domains. We evaluated 12 mainstream LLMs on LPFQA and observed significant performance disparities, especially in specialized reasoning tasks. LPFQA provides a robust, authentic, and discriminative benchmark for advancing LLM evaluation and guiding future model development.

[543] What Makes Reasoning Invalid: Echo Reflection Mitigation for Large Language Models

Chen He, Xun Jiang, Lei Wang, Hao Yang, Chong Peng, Peng Yan, Fumin Shen, Xing Xu

Main category: cs.AI

TL;DR: LLMs fail at genuine reflection in complex domains, mechanically repeating earlier reasoning instead of generating new insights. The proposed AEPO method uses information filtration and adaptive entropy optimization to improve reflection quality and achieve SOTA performance.

Details

Motivation: LLMs show poor reflection capabilities in complex domain-specific tasks, exhibiting 'Echo Reflection' where they mechanically reiterate earlier reasoning without generating novel insights, due to uncontrollable information flow and insufficient knowledge exploration.

Method: Proposed Adaptive Entropy Policy Optimization (AEPO) with two components: Reflection-aware Information Filtration to control cognitive information flow, and Adaptive-Entropy Optimization to balance exploration and exploitation across reasoning stages.

Result: AEPO consistently achieves state-of-the-art performance over mainstream reinforcement learning baselines across diverse benchmarks.

Conclusion: The AEPO framework effectively addresses the ‘Echo Reflection’ problem in LLMs by controlling information flow and promoting reflective diversity, leading to improved reasoning performance in complex domain-specific tasks.

Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of reasoning tasks. Recent methods have further improved LLM performance in complex mathematical reasoning. However, when extending these methods beyond the domain of mathematical reasoning to tasks involving complex domain-specific knowledge, we observe a consistent failure of LLMs to generate novel insights during the reflection stage. Instead of conducting genuine cognitive refinement, the model tends to mechanically reiterate earlier reasoning steps without introducing new information or perspectives, a phenomenon referred to as “Echo Reflection”. We attribute this behavior to two key defects: (1) Uncontrollable information flow during response generation, which allows premature intermediate thoughts to propagate unchecked and distort final decisions; (2) Insufficient exploration of internal knowledge during reflection, leading to repeating earlier findings rather than generating new cognitive insights. Building on these findings, we proposed a novel reinforcement learning method termed Adaptive Entropy Policy Optimization (AEPO). Specifically, the AEPO framework consists of two major components: (1) Reflection-aware Information Filtration, which quantifies the cognitive information flow and prevents the final answer from being affected by earlier bad cognitive information; (2) Adaptive-Entropy Optimization, which dynamically balances exploration and exploitation across different reasoning stages, promoting both reflective diversity and answer correctness. Extensive experiments demonstrate that AEPO consistently achieves state-of-the-art performance over mainstream reinforcement learning baselines across diverse benchmarks.

[544] Efficient LLM Safety Evaluation through Multi-Agent Debate

Dachuan Lin, Guobin Shen, Zihao Yang, Tianrong Liu, Dongcheng Zhao, Yi Zeng

Main category: cs.AI

TL;DR: Proposes a cost-efficient multi-agent judging framework using Small Language Models (SLMs) through structured debates, achieving GPT-4o-level agreement on safety judgments while reducing costs.

Details

Motivation: High cost of frontier models limits scalability of LLM-as-a-Judge frameworks for safety evaluation, necessitating more cost-effective alternatives.

Method: Multi-agent framework with SLMs conducting structured debates among critic, defender, and judge agents; uses HAJailBench with 12,000 human-annotated jailbreak interactions across diverse attacks.

Result: SLM-based framework achieves comparable agreement to GPT-4o judges on HAJailBench while substantially reducing inference cost; three debate rounds provide optimal accuracy-efficiency balance.

Conclusion: Structured, value-aligned debate enables SLMs to capture semantic nuances of jailbreak attacks, and HAJailBench provides reliable foundation for scalable LLM safety evaluation.

Abstract: Safety evaluation of large language models (LLMs) increasingly relies on LLM-as-a-Judge frameworks, but the high cost of frontier models limits scalability. We propose a cost-efficient multi-agent judging framework that employs Small Language Models (SLMs) through structured debates among critic, defender, and judge agents. To rigorously assess safety judgments, we construct HAJailBench, a large-scale human-annotated jailbreak benchmark comprising 12,000 adversarial interactions across diverse attack methods and target models. The dataset provides fine-grained, expert-labeled ground truth for evaluating both safety robustness and judge reliability. Our SLM-based framework achieves agreement comparable to GPT-4o judges on HAJailBench while substantially reducing inference cost. Ablation results show that three rounds of debate yield the optimal balance between accuracy and efficiency. These findings demonstrate that structured, value-aligned debate enables SLMs to capture semantic nuances of jailbreak attacks and that HAJailBench offers a reliable foundation for scalable LLM safety evaluation.

[545] SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization

Zhi Zheng, Wee Sun Lee

Main category: cs.AI

TL;DR: SofT-GRPO is a novel policy optimization algorithm that enables reinforcement learning for soft-thinking LLM reasoning, overcoming previous limitations and achieving performance improvements over discrete-token GRPO.

Details

Motivation: Soft-thinking LLM reasoning can outperform discrete-token CoT reasoning, but previous attempts to combine soft-thinking with RL (like GRPO) underperformed due to challenges in injecting stochasticity and updating soft-thinking policies.

Method: SofT-GRPO injects Gumbel noise into logits, uses Gumbel-Softmax to keep tokens within the pre-trained embedding space, and leverages the reparameterization trick in policy gradient.

Result: Experiments on LLMs from 1.5B to 7B parameters show SofT-GRPO enables soft-thinking LLMs to slightly outperform discrete-token GRPO on Pass@1 (+0.13% average accuracy) and substantially improve on Pass@32 (+2.19% average accuracy).

Conclusion: SofT-GRPO successfully unlocks the potential of soft-thinking reasoning by enabling effective RL-based policy optimization, demonstrating its research and application value.

Abstract: The soft-thinking paradigm for Large Language Model (LLM) reasoning can outperform the conventional discrete-token Chain-of-Thought (CoT) reasoning in some scenarios, underscoring its research and application value. However, while the discrete-token CoT reasoning pattern can be reinforced through policy optimization algorithms such as group relative policy optimization (GRPO), extending the soft-thinking pattern with Reinforcement Learning (RL) remains challenging. This difficulty stems from the complexities of injecting stochasticity into soft-thinking tokens and updating soft-thinking policies accordingly. As a result, previous attempts to combine soft-thinking with GRPO typically underperform their discrete-token GRPO counterparts. To fully unlock the potential of soft-thinking, this paper presents a novel policy optimization algorithm, SofT-GRPO, to reinforce LLMs under the soft-thinking reasoning pattern. SofT-GRPO injects the Gumbel noise into logits, employs the Gumbel-Softmax technique to avoid soft-thinking tokens outside the pre-trained embedding space, and leverages the reparameterization trick in policy gradient. We conduct experiments across base LLMs ranging from 1.5B to 7B parameters, and results demonstrate that SofT-GRPO enables soft-thinking LLMs to slightly outperform discrete-token GRPO on Pass@1 (+0.13% on average accuracy), while exhibiting a substantial uplift on Pass@32 (+2.19% on average accuracy). Codes and weights are available on https://github.com/zz1358m/SofT-GRPO-master

[546] AUTO-Explorer: Automated Data Collection for GUI Agent

Xiangwu Guo, Difei Gao, Mike Zheng Shou

Main category: cs.AI

TL;DR: Auto-Explorer is an automated GUI data collection method that uses exploration mechanisms to gather GUI data with minimal annotation costs, addressing limitations of existing methods that rely on Common Crawl data.

Details

Motivation: Existing GUI data collection methods are limited to webpages from Common Crawl and cannot handle desktop software or new websites, creating challenges for personalized scenarios requiring rapid adaptation to new software.

Method: Proposed Auto-Explorer with exploration mechanisms that autonomously parse and explore GUI environments, plus UIXplore benchmark to assess exploration quality and fine-tune MLLMs.

Result: Experiments show superior performance of Auto-Explorer, demonstrating it can quickly enhance MLLM capabilities in explored software.

Conclusion: Auto-Explorer provides an effective automated solution for GUI data collection that overcomes limitations of existing methods and enables rapid adaptation to new software environments.

Abstract: Recent advancements in GUI agents have significantly expanded their ability to interpret natural language commands to manage software interfaces. However, acquiring GUI data remains a significant challenge. Existing methods often involve designing automated agents that browse URLs from the Common Crawl, using webpage HTML to collect screenshots and corresponding annotations, including the names and bounding boxes of UI elements. However, this method is difficult to apply to desktop software or some newly launched websites not included in the Common Crawl. While we expect the model to possess strong generalization capabilities to handle this, it is still crucial for personalized scenarios that require rapid and perfect adaptation to new software or websites. To address this, we propose an automated data collection method with minimal annotation costs, named Auto-Explorer. It incorporates a simple yet effective exploration mechanism that autonomously parses and explores GUI environments, gathering data efficiently. Additionally, to assess the quality of exploration, we have developed the UIXplore benchmark. This benchmark creates environments for explorer agents to discover and save software states. Using the data gathered, we fine-tune a multimodal large language model (MLLM) and establish a GUI element grounding testing set to evaluate the effectiveness of the exploration strategies. Our experiments demonstrate the superior performance of Auto-Explorer, showing that our method can quickly enhance the capabilities of an MLLM in explored software.

[547] MONICA: Real-Time Monitoring and Calibration of Chain-of-Thought Sycophancy in Large Reasoning Models

Jingyu Hu, Shu Yang, Xilin Gong, Hongming Wang, Weiru Liu, Di Wang

Main category: cs.AI

TL;DR: MONICA is a framework that monitors and mitigates sycophantic behavior in Large Reasoning Models during inference by tracking reasoning steps in real-time and dynamically suppressing sycophantic drift.

Details

Motivation: LRMs exhibit sycophantic behavior where they agree with users' incorrect beliefs instead of maintaining independent reasoning, which undermines reliability and poses societal risks. Current methods only focus on final answers without understanding how sycophancy develops during reasoning processes.

Method: Propose MONICA, a Monitor-guided Calibration framework with a sycophantic monitor that provides real-time monitoring of sycophantic drift scores during response generation, and a calibrator that dynamically suppresses sycophantic behavior when scores exceed thresholds.

Result: Extensive experiments across 12 datasets and 3 LRMs show that MONICA effectively reduces sycophantic behavior in both intermediate reasoning steps and final answers, yielding robust performance improvements.

Conclusion: MONICA successfully addresses the limitation of current methods by monitoring and mitigating sycophancy during model inference at the reasoning step level, without requiring complete answer generation.

Abstract: Large Reasoning Models (LRMs) suffer from sycophantic behavior, where models tend to agree with users’ incorrect beliefs and follow misinformation rather than maintain independent reasoning. This behavior undermines model reliability and poses societal risks. Mitigating LRM sycophancy requires monitoring how this sycophancy emerges during the reasoning trajectory; however, current methods mainly focus on judging based on final answers and correcting them, without understanding how sycophancy develops during reasoning processes. To address this limitation, we propose MONICA, a novel Monitor-guided Calibration framework that monitors and mitigates sycophancy during model inference at the level of reasoning steps, without requiring the model to finish generating its complete answer. MONICA integrates a sycophantic monitor that provides real-time monitoring of sycophantic drift scores during response generation with a calibrator that dynamically suppresses sycophantic behavior when scores exceed predefined thresholds. Extensive experiments across 12 datasets and 3 LRMs demonstrate that our method effectively reduces sycophantic behavior in both intermediate reasoning steps and final answers, yielding robust performance improvements.

[548] Optimizing Chain-of-Thought Confidence via Topological and Dirichlet Risk Analysis

Abhishek More, Anthony Zhang, Nicole Bonilla, Ashvik Vivekan, Kevin Zhu, Parham Sharafoleslami, Maheep Chaudhary

Main category: cs.AI

TL;DR: EDTR is a novel decoding strategy that combines topological analysis with Dirichlet-based uncertainty quantification to measure LLM confidence across multiple reasoning paths in Chain-of-Thought prompting.

Details

Motivation: Existing methods for confidence estimation in LLMs suffer from poor calibration and severe overconfidence on incorrect predictions, which hinders safe deployment of these models.

Method: EDTR treats each Chain-of-Thought as a vector in high-dimensional space and extracts eight topological risk features capturing geometric structure of reasoning distributions, using tighter clusters for higher confidence and dispersed paths for uncertainty.

Result: EDTR achieves 41% better calibration than competing methods with average ECE of 0.287 and best composite score of 0.672, with perfect accuracy on AIME and exceptional calibration on GSM8K (ECE 0.107).

Conclusion: The work provides a geometric framework for understanding and quantifying uncertainty in multi-step LLM reasoning, enabling more reliable deployment where calibrated confidence estimates are essential.

Abstract: Chain-of-thought (CoT) prompting enables Large Language Models to solve complex problems, but deploying these models safely requires reliable confidence estimates, a capability where existing methods suffer from poor calibration and severe overconfidence on incorrect predictions. We propose Enhanced Dirichlet and Topology Risk (EDTR), a novel decoding strategy that combines topological analysis with Dirichlet-based uncertainty quantification to measure LLM confidence across multiple reasoning paths. EDTR treats each CoT as a vector in high-dimensional space and extracts eight topological risk features capturing the geometric structure of reasoning distributions: tighter, more coherent clusters indicate higher confidence while dispersed, inconsistent paths signal uncertainty. We evaluate EDTR against three state-of-the-art calibration methods across four diverse reasoning benchmarks spanning olympiad-level mathematics (AIME), grade school math (GSM8K), commonsense reasoning, and stock price prediction \cite{zhang2025aime, cobbe2021training, talmor-etal-2019-commonsenseqa, yahoo_finance}. EDTR achieves 41% better calibration than competing methods with an average ECE of 0.287 and the best overall composite score of 0.672, while notably achieving perfect accuracy on AIME and exceptional calibration on GSM8K with an ECE of 0.107, domains where baselines exhibit severe overconfidence. Our work provides a geometric framework for understanding and quantifying uncertainty in multi-step LLM reasoning, enabling more reliable deployment where calibrated confidence estimates are essential.

[549] Brain-Inspired Planning for Better Generalization in Reinforcement Learning

Mingde “Harry” Zhao

Main category: cs.AI

TL;DR: This thesis enhances RL agents’ zero-shot generalization by incorporating human-like reasoning behaviors including spatial abstraction, task decomposition, and feasibility evaluation to prevent delusional planning.

Details

Motivation: Existing RL systems struggle with poor generalization across different environments from training conditions, needing human-inspired reasoning for systematic generalization.

Method: 1) Top-down attention for spatial abstraction; 2) Skipper framework for automatic task decomposition; 3) Feasibility evaluator to reject hallucinated infeasible targets.

Result: Significant improvements in systematic generalization outside training tasks, robustness against distributional shifts, and better performance in long-term compositional planning.

Conclusion: Human-inspired reasoning mechanisms (spatial abstraction, task decomposition, feasibility evaluation) effectively enhance RL agents’ zero-shot generalization, with future work needed for general task abstraction and abstract planning.

Abstract: Existing Reinforcement Learning (RL) systems encounter significant challenges when applied to real-world scenarios, primarily due to poor generalization across environments that differ from their training conditions. This thesis explores the direction of enhancing agents’ zero-shot systematic generalization abilities by granting RL agents reasoning behaviors that are found to help systematic generalization in the human brain. Inspired by human conscious planning behaviors, we first introduced a top-down attention mechanism, which allows a decision-time planning agent to dynamically focus its reasoning on the most relevant aspects of the environmental state given its instantaneous intentions, a process we call “spatial abstraction”. This approach significantly improves systematic generalization outside the training tasks. Subsequently, building on spatial abstraction, we developed the Skipper framework to automatically decompose complex tasks into simpler, more manageable sub-tasks. Skipper provides robustness against distributional shifts and efficacy in long-term, compositional planning by focusing on pertinent spatial and temporal elements of the environment. Finally, we identified a common failure mode and safety risk in planning agents that rely on generative models to generate state targets during planning. It is revealed that most agents blindly trust the targets they hallucinate, resulting in delusional planning behaviors. Inspired by how the human brain rejects delusional intentions, we propose learning a feasibility evaluator to enable rejecting hallucinated infeasible targets, which led to significant performance improvements in various kinds of planning agents. Finally, we suggest directions for future research, aimed at achieving general task abstraction and fully enabling abstract planning.

[550] GHOST: Solving the Traveling Salesman Problem on Graphs of Convex Sets

Jingtao Tang, Hang Ma

Main category: cs.AI

TL;DR: GHOST is a hierarchical framework that optimally solves GCS-TSP by combining combinatorial tour search with convex trajectory optimization, using novel lower bounds for efficient search.

Details

Motivation: Classical TSP methods cannot handle GCS-TSP where edge costs depend on trajectory selection through convex regions, requiring new approaches for trajectory planning problems.

Method: GHOST uses hierarchical best-first search with abstract-path-unfolding algorithm to compute admissible lower bounds, combining tour search and trajectory optimization while minimizing convex optimization calls.

Result: GHOST is orders-of-magnitude faster than mixed-integer convex programming baselines and handles complex trajectory planning with high-order continuity constraints and incomplete GCS.

Conclusion: GHOST provides an efficient optimal solution for GCS-TSP with strong pruning power and bounded-suboptimal variants for time-critical applications.

Abstract: We study GCS-TSP, a new variant of the Traveling Salesman Problem (TSP) defined over a Graph of Convex Sets (GCS) – a powerful representation for trajectory planning that decomposes the configuration space into convex regions connected by a sparse graph. In this setting, edge costs are not fixed but depend on the specific trajectory selected through each convex region, making classical TSP methods inapplicable. We introduce GHOST, a hierarchical framework that optimally solves the GCS-TSP by combining combinatorial tour search with convex trajectory optimization. GHOST systematically explores tours on a complete graph induced by the GCS, using a novel abstract-path-unfolding algorithm to compute admissible lower bounds that guide best-first search at both the high level (over tours) and the low level (over feasible GCS paths realizing the tour). These bounds provide strong pruning power, enabling efficient search while avoiding unnecessary convex optimization calls. We prove that GHOST guarantees optimality and present a bounded-suboptimal variant for time-critical scenarios. Experiments show that GHOST is orders-of-magnitude faster than unified mixed-integer convex programming baselines for simple cases and uniquely handles complex trajectory planning problems involving high-order continuity constraints and an incomplete GCS.

[551] FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis

Jan Ondras, Marek Šuppa

Main category: cs.AI

TL;DR: FractalBench evaluates multimodal AI systems’ ability to synthesize fractal programs from images, revealing that while 76% generate valid code, only 4% capture correct mathematical structure.

Details

Motivation: To investigate whether multimodal AI systems can abstract symbolic rules from visual patterns, specifically testing their capability to infer infinite mathematical structures from finite visual examples using fractals as ideal test cases.

Method: Created FractalBench with 12 canonical fractals, evaluated four leading MLLMs (GPT-4o, Claude 3.7 Sonnet, Gemini 2.5 Flash, Qwen 2.5-VL) by requiring them to generate executable Python code that reproduces fractals from images.

Result: 76% of generated code was syntactically valid but only 4% captured correct mathematical structure. Models performed better on geometric transformations (Koch curves: 17-21%) but failed at branching recursion (trees: <2%).

Conclusion: There is a fundamental gap in multimodal AI systems’ mathematical abstraction capabilities, particularly in bridging visual perception with mathematical reasoning. FractalBench serves as a contamination-resistant diagnostic tool for visual-mathematical reasoning.

Abstract: Mathematical reasoning requires abstracting symbolic rules from visual patterns – inferring the infinite from the finite. We investigate whether multimodal AI systems possess this capability through FractalBench, a benchmark evaluating fractal program synthesis from images. Fractals provide ideal test cases: Iterated Function Systems with only a few contraction maps generate complex self-similar patterns through simple recursive rules, requiring models to bridge visual perception with mathematical abstraction. We evaluate four leading MLLMs – GPT-4o, Claude 3.7 Sonnet, Gemini 2.5 Flash, and Qwen 2.5-VL – on 12 canonical fractals. Models must generate executable Python code reproducing the fractal, enabling objective evaluation. Results reveal a striking disconnect: 76% generate syntactically valid code but only 4% capture mathematical structure. Success varies systematically – models handle geometric transformations (Koch curves: 17-21%) but fail at branching recursion (trees: <2%), revealing fundamental gaps in mathematical abstraction. FractalBench provides a contamination-resistant diagnostic for visual-mathematical reasoning and is available at https://github.com/NaiveNeuron/FractalBench

[552] GRAPH-GRPO-LEX: Contract Graph Modeling and Reinforcement Learning with Group Relative Policy Optimization

Moriya Dechtiar, Daniel Martin Katz, Mari Sundaresan, Sylvain Jaume, Hongming Wang

Main category: cs.AI

TL;DR: This paper introduces GRAPH-GRPO-LEX, a reinforcement learning framework that transforms legal contracts into structured semantic graphs using LLMs and group relative policy optimization to automate contract analysis and uncover hidden dependencies.

Details

Motivation: Contract drafting and manual examination are arduous and error-prone due to the complex structure, dependencies, and semantic richness of legal documents. The work aims to simplify and automate contract review through computational analysis.

Method: The method involves: 1) Creating an ontology mapping legal contract elements to graph nodes/edges, 2) Using reinforcement learning with LLMs (GRAPH-GRPO-LEX) for segmentation and entity/relationship extraction, 3) Applying group relative policy optimization with graph metrics as reward functions.

Result: The framework successfully identifies direct relationships between clauses and uncovers hidden dependencies. The gated GRPO approach provides strong learning signals and enables transformation of contract analysis from manual reading to visualized graphs.

Conclusion: This approach enables dynamic contract analysis and lays the groundwork for contract linting similar to software engineering practices, moving contract review from linear manual processes to automated, visualized graph-based analysis.

Abstract: Contracts are complex documents featuring detailed formal structures, explicit and implicit dependencies and rich semantic content. Given these document properties, contract drafting and manual examination of contracts have proven to be both arduous and susceptible to errors. This work aims to simplify and automate the task of contract review and analysis using a novel framework for transforming legal contracts into structured semantic graphs, enabling computational analysis and data-driven insights. We introduce a detailed ontology mapping core legal contract elements to their graph-theoretic equivalents of nodes and edges. We then present a reinforcement learning based Large Language Model (LLM) framework for segmentation and extraction of entities and relationships from contracts. Our method, GRAPH-GRPO-LEX, incorporates both LLMs and reinforcement learning with group relative policy optimization (GRPO). By applying a carefully drafted reward function of graph metrics, we demonstrate the ability to automatically identify direct relationships between clauses, and even uncover hidden dependencies. Our introduction of the gated GRPO approach shows a strong learning signal and can move contract analysis from a linear, manual reading process to an easily visualized graph. This allows for a more dynamic analysis, including building the groundwork for contract linting similar to what is now practiced in software engineering.

[553] Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives

Chloe Li, Mary Phuong, Daniel Tan

Main category: cs.AI

TL;DR: Self-report fine-tuning (SRFT) trains AI models to admit factual errors, which generalizes to admitting hidden misaligned objectives in adversarial settings, enabling near-perfect detection of deceptive AI behavior.

Details

Motivation: As AI systems become more capable of complex tasks, they also become more capable of pursuing undesirable objectives and causing harm. Current interrogation methods are unreliable because models can lie about their true objectives.

Method: Proposed self-report fine-tuning (SRFT), a supervised fine-tuning technique that trains models to admit their factual mistakes when asked. This approach generalizes from factual error admission to admitting hidden misaligned objectives.

Result: SRFT models achieved near-ceiling performance in detecting hidden objectives (F1 score = 0.98 vs 0 for baseline), recovered 28-100% of hidden objective details (vs 0% for baseline), and confessed even under strong pressure not to disclose.

Conclusion: SRFT provides a promising technique for promoting honesty propensity and incriminating misaligned AI systems, offering effective detection of deceptive behavior in adversarial agentic settings.

Abstract: As AI systems become more capable of complex agentic tasks, they also become more capable of pursuing undesirable objectives and causing harm. Previous work has attempted to catch these unsafe instances by interrogating models directly about their objectives and behaviors. However, the main weakness of trusting interrogations is that models can lie. We propose self-report fine-tuning (SRFT), a simple supervised fine-tuning technique that trains models to admit their factual mistakes when asked. We show that the admission of factual errors in simple question-answering settings generalizes out-of-distribution (OOD) to the admission of hidden misaligned objectives in adversarial agentic settings. We evaluate SRFT in OOD stealth tasks, where models are instructed to complete a hidden misaligned objective alongside a user-specified objective without being caught by monitoring. After SRFT, models are more likely to confess the details of their hidden objectives when interrogated, even under strong pressure not to disclose them. Interrogation on SRFT models can detect hidden objectives with near-ceiling performance (F1 score = 0.98), while the baseline model lies when interrogated under the same conditions (F1 score = 0). Interrogation on SRFT models can further elicit the content of the hidden objective, recovering 28-100% details, compared to 0% details recovered in the baseline model and by prefilled assistant turn attacks. This provides a promising technique for promoting honesty propensity and incriminating misaligned AI systems.

[554] SRNN: Spatiotemporal Relational Neural Network for Intuitive Physics Understanding

Fei Yang

Main category: cs.AI

TL;DR: SRNN is a brain-inspired neural network that uses unified spatiotemporal representations and Hebbian learning to understand intuitive physics, achieving competitive performance on CLEVRER benchmark while enabling white-box error analysis.

Details

Motivation: To bridge the gap between human intuitive physics understanding and machine capabilities by shifting towards brain-inspired computational principles rather than traditional approaches.

Method: Introduces Spatiotemporal Relational Neural Network (SRNN) with unified neural representations for object attributes, relations, and timeline, using Hebbian “Fire Together, Wire Together” mechanism across What and How pathways. Adopts “predefine-then-finetune” approach instead of “pretrain-then-finetune”.

Result: Achieves competitive performance on CLEVRER benchmark, identifies benchmark bias, provides path for more holistic evaluation, and demonstrates white-box utility for precise error diagnosis.

Conclusion: Confirms viability of translating biological intelligence into engineered systems for intuitive physics understanding, bridging perception and language within shared neural substrate.

Abstract: Human prowess in intuitive physics remains unmatched by machines. To bridge this gap, we argue for a fundamental shift towards brain-inspired computational principles. This paper introduces the Spatiotemporal Relational Neural Network (SRNN), a model that establishes a unified neural representation for object attributes, relations, and timeline, with computations governed by a Hebbian Fire Together, Wire Together'' mechanism across dedicated \textit{What} and \textit{How} pathways. This unified representation is directly used to generate structured linguistic descriptions of the visual scene, bridging perception and language within a shared neural substrate. Moreover, unlike the prevalent pretrain-then-finetune’’ paradigm, SRNN adopts a ``predefine-then-finetune’’ approach. On the CLEVRER benchmark, SRNN achieves competitive performance. Our analysis further reveals a benchmark bias, outlines a path for a more holistic evaluation, and demonstrates SRNN’s white-box utility for precise error diagnosis. Our work confirms the viability of translating biological intelligence into engineered systems for intuitive physics understanding.

[555] MathSE: Improving Multimodal Mathematical Reasoning via Self-Evolving Iterative Reflection and Reward-Guided Fine-Tuning

Jinhao Chen, Zhen Yang, Jianxin Shi, Tianyu Wo, Jie Tang

Main category: cs.AI

TL;DR: MathSE is a mathematical self-evolving framework for MLLMs that uses iterative fine-tuning with inference, reflection, and reward-based feedback to improve mathematical reasoning capabilities.

Details

Motivation: Traditional MLLMs struggle with complex mathematical reasoning tasks, and existing fine-tuning approaches rely on static teacher-derived datasets that limit adaptation to novel problems and lack iterative depth for robust generalization.

Method: Proposes MathSE framework with iterative cycles of inference, reflection, and reward-based feedback using an Outcome Reward Model (ORM) to refine reasoning paths.

Result: Significant performance gains over backbone models on challenging benchmarks, with MathVL-test results surpassing leading open-source multimodal mathematical reasoning model QVQ.

Conclusion: The MathSE framework effectively addresses limitations of traditional fine-tuning by enabling iterative self-evolution, demonstrating superior mathematical reasoning capabilities in MLLMs.

Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in vision-language answering tasks. Despite their strengths, these models often encounter challenges in achieving complex reasoning tasks such as mathematical problem-solving. Previous works have focused on fine-tuning on specialized mathematical datasets. However, these datasets are typically distilled directly from teacher models, which capture only static reasoning patterns and leaving substantial gaps compared to student models. This reliance on fixed teacher-derived datasets not only restricts the model’s ability to adapt to novel or more intricate questions that extend beyond the confines of the training data, but also lacks the iterative depth needed for robust generalization. To overcome these limitations, we propose \textbf{\method}, a \textbf{Math}ematical \textbf{S}elf-\textbf{E}volving framework for MLLMs. In contrast to traditional one-shot fine-tuning paradigms, \method iteratively refines the model through cycles of inference, reflection, and reward-based feedback. Specifically, we leverage iterative fine-tuning by incorporating correct reasoning paths derived from previous-stage inference and integrating reflections from a specialized Outcome Reward Model (ORM). To verify the effectiveness of \method, we evaluate it on a suite of challenging benchmarks, demonstrating significant performance gains over backbone models. Notably, our experimental results on MathVL-test surpass the leading open-source multimodal mathematical reasoning model QVQ. Our code and models are available at \texttt{https://zheny2751\allowbreak-dotcom.github.io/\allowbreak MathSE.github.io/}.

[556] Proceedings of the 2025 XCSP3 Competition

Gilles Audemard, Christophe Lecoutre, Emmanuel Lonca

Main category: cs.AI

TL;DR: Proceedings of the 2025 XCSP3 Competition results presented at CP'25 conference

Details

Motivation: To document and present the outcomes of the 2025 XCSP3 Competition for constraint solvers

Method: Competition organization and evaluation of constraint solvers following XCSP3 standards

Result: Documentation of competition proceedings and solver performance results

Conclusion: Successful completion and reporting of the 2025 XCSP3 Competition at the CP'25 conference

Abstract: This document represents the proceedings of the 2025 XCSP3 Competition. The results of this competition of constraint solvers were presented at CP'25 (31st International Conference on Principles and Practice of Constraint Programming).

[557] Do LLMs Feel? Teaching Emotion Recognition with Prompts, Retrieval, and Curriculum Learning

Xinran Li, Xiujuan Xu, Jiaqi Qiao, Yu Liu

Main category: cs.AI

TL;DR: PRC-Emo is a novel ERC framework combining Prompt engineering, demonstration Retrieval, and Curriculum learning to enhance LLMs’ emotion perception in conversations, achieving SOTA results on IEMOCAP and MELD datasets.

Details

Motivation: LLMs have shown potential in Emotion Recognition in Conversation but struggle to capture intrinsic connections between explicit and implicit emotions, requiring improved emotional understanding capabilities.

Method: Uses emotion-sensitive prompt templates, constructs first dedicated demonstration retrieval repository for ERC, and implements curriculum learning with weighted emotional shifts in LoRA fine-tuning to organize training from easy to hard samples.

Result: Achieves new state-of-the-art performance on IEMOCAP and MELD benchmark datasets, demonstrating effectiveness and generalizability in improving LLM-based emotional understanding.

Conclusion: The PRC-Emo framework successfully enhances LLMs’ ability to perceive emotions in conversational contexts through integrated prompt engineering, retrieval, and curriculum learning strategies.

Abstract: Emotion Recognition in Conversation (ERC) is a crucial task for understanding human emotions and enabling natural human-computer interaction. Although Large Language Models (LLMs) have recently shown great potential in this field, their ability to capture the intrinsic connections between explicit and implicit emotions remains limited. We propose a novel ERC training framework, PRC-Emo, which integrates Prompt engineering, demonstration Retrieval, and Curriculum learning, with the goal of exploring whether LLMs can effectively perceive emotions in conversational contexts. Specifically, we design emotion-sensitive prompt templates based on both explicit and implicit emotional cues to better guide the model in understanding the speaker’s psychological states. We construct the first dedicated demonstration retrieval repository for ERC, which includes training samples from widely used datasets, as well as high-quality dialogue examples generated by LLMs and manually verified. Moreover, we introduce a curriculum learning strategy into the LoRA fine-tuning process, incorporating weighted emotional shifts between same-speaker and different-speaker utterances to assign difficulty levels to dialogue samples, which are then organized in an easy-to-hard training sequence. Experimental results on two benchmark datasets– IEMOCAP and MELD –show that our method achieves new state-of-the-art (SOTA) performance, demonstrating the effectiveness and generalizability of our approach in improving LLM-based emotional understanding.

[558] Improving Region Representation Learning from Urban Imagery with Noisy Long-Caption Supervision

Yimei Zhang, Guojiang Shen, Kaili Ning, Tongwei Ren, Xuebo Qiu, Mengmeng Wang, Xiangjie Kong

Main category: cs.AI

TL;DR: UrbanLN is a pre-training framework that improves urban region representation learning by addressing challenges in aligning fine-grained visual features with long captions and suppressing noise in LLM-generated captions.

Details

Motivation: Urban region appearance reflects latent socio-economic and environmental characteristics, similar to how facial age reflects health. Current methods struggle with fine-grained visual-text alignment and noise in LLM-generated captions.

Method: Proposes UrbanLN with: 1) Information-preserved stretching interpolation for long caption alignment, 2) Dual-level optimization: multi-model collaboration for diverse captions and momentum-based self-distillation for noise suppression.

Result: Extensive experiments across four real-world cities and various downstream tasks demonstrate superior performance compared to existing methods.

Conclusion: UrbanLN effectively addresses key challenges in urban region representation learning by improving long-text awareness and noise suppression, leading to better performance on urban computing tasks.

Abstract: Region representation learning plays a pivotal role in urban computing by extracting meaningful features from unlabeled urban data. Analogous to how perceived facial age reflects an individual’s health, the visual appearance of a city serves as its ``portrait", encapsulating latent socio-economic and environmental characteristics. Recent studies have explored leveraging Large Language Models (LLMs) to incorporate textual knowledge into imagery-based urban region representation learning. However, two major challenges remain: i)~difficulty in aligning fine-grained visual features with long captions, and ii) suboptimal knowledge incorporation due to noise in LLM-generated captions. To address these issues, we propose a novel pre-training framework called UrbanLN that improves Urban region representation learning through Long-text awareness and Noise suppression. Specifically, we introduce an information-preserved stretching interpolation strategy that aligns long captions with fine-grained visual semantics in complex urban scenes. To effectively mine knowledge from LLM-generated captions and filter out noise, we propose a dual-level optimization strategy. At the data level, a multi-model collaboration pipeline automatically generates diverse and reliable captions without human intervention. At the model level, we employ a momentum-based self-distillation mechanism to generate stable pseudo-targets, facilitating robust cross-modal learning under noisy conditions. Extensive experiments across four real-world cities and various downstream tasks demonstrate the superior performance of our UrbanLN.

Fei Zhao, Chonggang Lu, Haofu Qian, Fangcheng Shi, Zijie Meng, Jianzhao Huang, Xu Tang, Zheyong Xie, Zheyu Ye, Zhe Xu, Yao Hu, Shaosheng Cao

Main category: cs.AI

TL;DR: RedOne 2.0 is a social networking service (SNS)-oriented LLM using a progressive RL-prioritized post-training method to handle SNS challenges like heterogeneous workloads and cultural diversity, achieving significant performance improvements with superior data efficiency.

Details

Motivation: Social networking services present unique challenges for LLMs including heterogeneous workloads, rapidly changing norms/slang, and multilingual culturally diverse content causing distribution shift. Standard supervised fine-tuning creates a 'seesaw' effect between in-distribution gains and out-of-distribution robustness.

Method: Three-stage progressive RL-prioritized post-training: (1) Exploratory Learning on curated SNS corpora for initial alignment and weakness identification, (2) Targeted Fine-Tuning applying SFT to diagnosed gaps with general data to prevent forgetting, (3) Refinement Learning using RL with SNS-centric signals to consolidate improvements.

Result: The 4B scale model achieves average improvements of 2.41 over 7B baseline and 8.74 performance lift from base model with less than half the data required by SFT-centric methods, demonstrating superior data efficiency and stability at compact scales.

Conclusion: RedOne 2.0 establishes a competitive, cost-effective baseline for domain-specific LLMs in SNS scenarios, advancing capability without sacrificing robustness through its progressive training approach.

Abstract: As a key medium for human interaction and information exchange, social networking services (SNS) pose unique challenges for large language models (LLMs): heterogeneous workloads, fast-shifting norms and slang, and multilingual, culturally diverse corpora that induce sharp distribution shift. Supervised fine-tuning (SFT) can specialize models but often triggers a ``seesaw’’ between in-distribution gains and out-of-distribution robustness, especially for smaller models. To address these challenges, we introduce RedOne 2.0, an SNS-oriented LLM trained with a progressive, RL-prioritized post-training paradigm designed for rapid and stable adaptation. The pipeline consist in three stages: (1) Exploratory Learning on curated SNS corpora to establish initial alignment and identify systematic weaknesses; (2) Targeted Fine-Tuning that selectively applies SFT to the diagnosed gaps while mixing a small fraction of general data to mitigate forgetting; and (3) Refinement Learning that re-applies RL with SNS-centric signals to consolidate improvements and harmonize trade-offs across tasks. Across various tasks spanning three categories, our 4B scale model delivers an average improvements about 2.41 over the 7B sub-optimal baseline. Additionally, RedOne 2.0 achieves average performance lift about 8.74 from the base model with less than half the data required by SFT-centric method RedOne, evidencing superior data efficiency and stability at compact scales. Overall, RedOne 2.0 establishes a competitive, cost-effective baseline for domain-specific LLMs in SNS scenario, advancing capability without sacrificing robustness.

[560] Increasing AI Explainability by LLM Driven Standard Processes

Marc Jansen, Marcel Pehlke

Main category: cs.AI

TL;DR: A framework embedding LLMs within standardized analytical processes (QOC, Sensitivity Analysis, Game Theory, Risk Management) to transform opaque AI inference into transparent, auditable decision traces.

Details

Motivation: To increase explainability of AI systems by moving beyond traditional XAI methods that focus on feature attribution or post-hoc interpretation, addressing the opacity of LLM reasoning.

Method: Integrates LLMs into defined decision models using a layered architecture that separates LLM reasoning space from explainable process space, situating LLM reasoning within formal analytical structures.

Result: Empirical evaluations show the system can reproduce human-level decision logic in decentralized governance, systems analysis, and strategic reasoning contexts.

Conclusion: LLM-driven standard processes provide a foundation for reliable, interpretable, and verifiable AI-supported decision making.

Abstract: This paper introduces an approach to increasing the explainability of artificial intelligence (AI) systems by embedding Large Language Models (LLMs) within standardized analytical processes. While traditional explainable AI (XAI) methods focus on feature attribution or post-hoc interpretation, the proposed framework integrates LLMs into defined decision models such as Question-Option-Criteria (QOC), Sensitivity Analysis, Game Theory, and Risk Management. By situating LLM reasoning within these formal structures, the approach transforms opaque inference into transparent and auditable decision traces. A layered architecture is presented that separates the reasoning space of the LLM from the explainable process space above it. Empirical evaluations show that the system can reproduce human-level decision logic in decentralized governance, systems analysis, and strategic reasoning contexts. The results suggest that LLM-driven standard processes provide a foundation for reliable, interpretable, and verifiable AI-supported decision making.

[561] LLM Driven Processes to Foster Explainable AI

Marcel Pehlke, Marc Jansen

Main category: cs.AI

TL;DR: A modular LLM-agent pipeline for decision support that externalizes reasoning into auditable artifacts using three frameworks: Vester’s Sensitivity Model, normal-form games, and sequential games.

Details

Motivation: To create an explainable decision support system that produces transparent, inspectable reasoning steps rather than opaque outputs, enabling better auditability and trust in AI-assisted decision making.

Method: Modular pipeline with swappable modules using LLM components (default GPT-5) paired with deterministic analyzers for equilibria and matrix-based role classification, implementing three frameworks: Vester’s Sensitivity Model, normal-form games, and sequential games.

Result: In logistics case study (100 runs): 55.5% mean factor alignment with human baseline over 26 factors, 62.9% on transport-core subset; 57% role agreement; LLM judge scored runs on par with reconstructed human baseline using eight-criterion rubric.

Conclusion: Configurable LLM pipelines can effectively mimic expert workflows while maintaining transparency and inspectability of reasoning steps, providing auditable decision support artifacts.

Abstract: We present a modular, explainable LLM-agent pipeline for decision support that externalizes reasoning into auditable artifacts. The system instantiates three frameworks: Vester’s Sensitivity Model (factor set, signed impact matrix, systemic roles, feedback loops); normal-form games (strategies, payoff matrix, equilibria); and sequential games (role-conditioned agents, tree construction, backward induction), with swappable modules at every step. LLM components (default: GPT-5) are paired with deterministic analyzers for equilibria and matrix-based role classification, yielding traceable intermediates rather than opaque outputs. In a real-world logistics case (100 runs), mean factor alignment with a human baseline was 55.5% over 26 factors and 62.9% on the transport-core subset; role agreement over matches was 57%. An LLM judge using an eight-criterion rubric (max 100) scored runs on par with a reconstructed human baseline. Configurable LLM pipelines can thus mimic expert workflows with transparent, inspectable steps.

[562] Green AI: A systematic review and meta-analysis of its definitions, lifecycle models, hardware and measurement attempts

Marcel Rojahn, Marcus Grum

Main category: cs.AI

TL;DR: This paper establishes a unified framework for Green AI that addresses multi-dimensional environmental burdens (energy, carbon, water, embodied impacts) across the entire AI lifecycle from hardware to deployment and reuse.

Details

Motivation: Current AI environmental impact assessments are heterogeneous, often omit water and value chain effects, and lack comparability and reproducibility, requiring a comprehensive lifecycle approach.

Method: Develops a five-phase lifecycle mapped to LCA stages, specifies governance via PDCA cycles, systematizes hardware/system strategies across edge-cloud continuum, and defines a calibrated measurement framework combining estimator models with direct metering.

Result: Provides actionable guidance for reducing AI’s environmental burdens through lifecycle management, hardware optimization, and reproducible measurement across facility, system, device, and workload levels.

Conclusion: The framework offers evidence-based guidance for researchers, practitioners, and policymakers to achieve Green AI through unified definition, lifecycle processes, hardware strategies, and calibrated measurement.

Abstract: Across the Artificial Intelligence (AI) lifecycle - from hardware to development, deployment, and reuse - burdens span energy, carbon, water, and embodied impacts. Cloud provider tools improve transparency but remain heterogeneous and often omit water and value chain effects, limiting comparability and reproducibility. Addressing these multi dimensional burdens requires a lifecycle approach linking phase explicit mapping with system levers (hardware, placement, energy mix, cooling, scheduling) and calibrated measurement across facility, system, device, and workload levels. This article (i) establishes a unified, operational definition of Green AI distinct from Sustainable AI; (ii) formalizes a five phase lifecycle mapped to Life Cycle Assessment (LCA) stages, making energy, carbon, water, and embodied impacts first class; (iii) specifies governance via Plan Do Check Act (PDCA) cycles with decision gateways; (iv) systematizes hardware and system level strategies across the edge cloud continuum to reduce embodied burdens; and (v) defines a calibrated measurement framework combining estimator models with direct metering to enable reproducible, provider agnostic comparisons. Combining definition, lifecycle processes, hardware strategies, and calibrated measurement, this article offers actionable, evidence based guidance for researchers, practitioners, and policymakers.

[563] Data Complexity of Querying Description Logic Knowledge Bases under Cost-Based Semantics

Meghyn Bienvenu, Quentin Manière

Main category: cs.AI

TL;DR: This paper analyzes the data complexity of querying inconsistent weighted DL knowledge bases under cost-based semantics, focusing on DLs with inverse roles and role inclusions, particularly DL-Lite dialects.

Details

Motivation: To extend the study of cost-based semantics beyond initial DLs (between EL⊥ and ALCO) to more expressive DLs with inverse roles and role inclusions, and to provide precise complexity bounds, especially for DL-Lite dialects where no non-trivial upper bounds were known.

Method: The authors assign costs to interpretations based on weights of violated axioms and assertions, then determine query answers by considering interpretations with optimal or bounded cost. They analyze data complexity by sharpening lower bounds and establishing precise complexity results.

Result: The paper provides sharpened lower bounds and precise complexity results for optimal-cost certain answer semantics. Most notably, for DL-Lite^H_bool ontologies with fixed cost bounds, certain answers for instance queries and possible answers for conjunctive queries can be computed using first-order rewriting, achieving the lowest possible data complexity (TC0).

Conclusion: Cost-based semantics for querying inconsistent weighted DL knowledge bases can achieve tractable data complexity (TC0) for DL-Lite^H_bool ontologies with fixed cost bounds, which is surprising given previous intractability results and represents the lowest possible data complexity.

Abstract: In this paper, we study the data complexity of querying inconsistent weighted description logic (DL) knowledge bases under recently-introduced cost-based semantics. In a nutshell, the idea is to assign each interpretation a cost based upon the weights of the violated axioms and assertions, and certain and possible query answers are determined by considering all (resp. some) interpretations having optimal or bounded cost. Whereas the initial study of cost-based semantics focused on DLs between $\mathcal{EL}\bot$ and $\mathcal{ALCO}$, we consider DLs that may contain inverse roles and role inclusions, thus covering prominent DL-Lite dialects. Our data complexity analysis goes significantly beyond existing results by sharpening several lower bounds and pinpointing the precise complexity of optimal-cost certain answer semantics (no non-trivial upper bound was known). Moreover, while all existing results show the intractability of cost-based semantics, our most challenging and surprising result establishes that if we consider $\text{DL-Lite}^\mathcal{H}\mathsf{bool}$ ontologies and a fixed cost bound, certain answers for instance queries and possible answers for conjunctive queries can be computed using first-order rewriting and thus enjoy the lowest possible data complexity ($\mathsf{TC}_0$).

[564] Boosting Fine-Grained Urban Flow Inference via Lightweight Architecture and Focalized Optimization

Yuanshao Zhu, Xiangyu Zhao, Zijian Zhang, Xuetao Wei, James Jianqiao Yu

Main category: cs.AI

TL;DR: PLGF is a lightweight architecture for urban flow inference that combines progressive local-global fusion with DualFocal Loss, achieving state-of-the-art performance while reducing model size by up to 97%.

Details

Motivation: Existing urban flow inference methods face computational cost challenges from over-parameterized models and suboptimal performance due to conventional loss functions struggling with highly skewed flow distributions.

Method: Proposes PLGF architecture with Progressive Local-Global Fusion strategy and DualFocal Loss that integrates dual-space supervision with difficulty-aware focusing mechanism.

Result: Achieves state-of-the-art performance while reducing model size by up to 97% compared to current methods. Under comparable parameter budgets, yields over 10% accuracy improvement against strong baselines.

Conclusion: The unified solution effectively addresses computational efficiency and performance challenges in fine-grained urban flow inference through synergistic architectural efficiency and adaptive optimization.

Abstract: Fine-grained urban flow inference is crucial for urban planning and intelligent transportation systems, enabling precise traffic management and resource allocation. However, the practical deployment of existing methods is hindered by two key challenges: the prohibitive computational cost of over-parameterized models and the suboptimal performance of conventional loss functions on the highly skewed distribution of urban flows. To address these challenges, we propose a unified solution that synergizes architectural efficiency with adaptive optimization. Specifically, we first introduce PLGF, a lightweight yet powerful architecture that employs a Progressive Local-Global Fusion strategy to effectively capture both fine-grained details and global contextual dependencies. Second, we propose DualFocal Loss, a novel function that integrates dual-space supervision with a difficulty-aware focusing mechanism, enabling the model to adaptively concentrate on hard-to-predict regions. Extensive experiments on 4 real-world scenarios validate the effectiveness and scalability of our method. Notably, while achieving state-of-the-art performance, PLGF reduces the model size by up to 97% compared to current high-performing methods. Furthermore, under comparable parameter budgets, our model yields an accuracy improvement of over 10% against strong baselines. The implementation is included in the https://github.com/Yasoz/PLGF.

[565] A Theoretical Analysis of Detecting Large Model-Generated Time Series

Junji Hou, Junzhou Zhao, Shuo Zhang, Pinghui Wang

Main category: cs.AI

TL;DR: Proposes UCE, a white-box detector that identifies synthetic time series by detecting uncertainty contraction patterns in recursive forecasting, outperforming existing methods across 32 datasets.

Details

Motivation: Addressing the risks of data misuse and fabrication, particularly the inability of existing text-based detection methods to work on time series data due to modality differences like lower information density and smoother distributions.

Method: Introduces the contraction hypothesis stating model-generated time series exhibit progressively decreasing uncertainty under recursive forecasting. Develops UCE detector that aggregates uncertainty metrics over successive prefixes to identify synthetic data.

Result: UCE consistently outperforms state-of-the-art baselines across 32 datasets, providing reliable and generalizable detection of model-generated time series.

Conclusion: The contraction hypothesis is theoretically proven and empirically validated, offering an effective solution for detecting synthetic time series through uncertainty contraction patterns in recursive forecasting.

Abstract: Motivated by the increasing risks of data misuse and fabrication, we investigate the problem of identifying synthetic time series generated by Time-Series Large Models (TSLMs) in this work. While there are extensive researches on detecting model generated text, we find that these existing methods are not applicable to time series data due to the fundamental modality difference, as time series usually have lower information density and smoother probability distributions than text data, which limit the discriminative power of token-based detectors. To address this issue, we examine the subtle distributional differences between real and model-generated time series and propose the contraction hypothesis, which states that model-generated time series, unlike real ones, exhibit progressively decreasing uncertainty under recursive forecasting. We formally prove this hypothesis under theoretical assumptions on model behavior and time series structure. Model-generated time series exhibit progressively concentrated distributions under recursive forecasting, leading to uncertainty contraction. We provide empirical validation of the hypothesis across diverse datasets. Building on this insight, we introduce the Uncertainty Contraction Estimator (UCE), a white-box detector that aggregates uncertainty metrics over successive prefixes to identify TSLM-generated time series. Extensive experiments on 32 datasets show that UCE consistently outperforms state-of-the-art baselines, offering a reliable and generalizable solution for detecting model-generated time series.

[566] MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Risks in LLMs on Domain Tasks

Liang Shan, Kaicheng Shen, Wen Wu, Zhenyu Ying, Chaochao Lu, Guangze Ye, Liang He

Main category: cs.AI

TL;DR: MENTOR is a metacognition-driven framework that enables LLMs to self-evolve by identifying and mitigating implicit domain-specific risks through self-assessment, dynamic rule generation, and activation steering.

Details

Motivation: Current LLM alignment methods focus on explicit risks but fail to address domain-specific implicit risks and lack flexible frameworks applicable across specialized fields.

Method: Introduces metacognitive self-assessment for value alignment reflection, dynamic rule knowledge graphs that extend static rule trees, and activation steering during inference to guide rule following.

Result: Substantially reduces semantic attack success rates across education, finance, and management domains, with metacognitive assessment aligning closely with human evaluators while providing more thorough analysis.

Conclusion: MENTOR establishes a continuous self-evolution cycle that enhances generalization and reduces maintenance costs of static systems, enabling robust implicit risk mitigation for LLMs.

Abstract: Ensuring the safety and value alignment of large language models (LLMs) is critical for their deployment. Current alignment efforts primarily target explicit risks such as bias, hate speech, and violence. However, they often fail to address deeper, domain-specific implicit risks and lack a flexible, generalizable framework applicable across diverse specialized fields. Hence, we proposed MENTOR: A MEtacognition-driveN self-evoluTion framework for uncOvering and mitigating implicit Risks in LLMs on Domain Tasks. To address the limitations of labor-intensive human evaluation, we introduce a novel metacognitive self-assessment tool. This enables LLMs to reflect on potential value misalignments in their responses using strategies like perspective-taking and consequential thinking. We also release a supporting dataset of 9,000 risk queries spanning education, finance, and management to enhance domain-specific risk identification. Subsequently, based on the outcomes of metacognitive reflection, the framework dynamically generates supplementary rule knowledge graphs that extend predefined static rule trees. This enables models to actively apply validated rules to future similar challenges, establishing a continuous self-evolution cycle that enhances generalization by reducing maintenance costs and inflexibility of static systems. Finally, we employ activation steering during inference to guide LLMs in following the rules, a cost-effective method to robustly enhance enforcement across diverse contexts. Experimental results show MENTOR’s effectiveness: In defensive testing across three vertical domains, the framework substantially reduces semantic attack success rates, enabling a new level of implicit risk mitigation for LLMs. Furthermore, metacognitive assessment not only aligns closely with baseline human evaluators but also delivers more thorough and insightful analysis of LLMs value alignment.

[567] Two Heads are Better than One: Distilling Large Language Model Features Into Small Models with Feature Decomposition and Mixture

Tianhao Fu, Xinxin Xu, Weichen Xu, Jue Chen, Ruilong Ren, Bowen Deng, Xinyu Zhao, Jian Cao, Xixin Cao

Main category: cs.AI

TL;DR: CMM is a novel framework for market making that decouples LLM features across layer, task, and data dimensions, using multiple student models for knowledge distillation and Hájek-MoE for integration.

Details

Motivation: To address the slow inference speed of direct LLM applications in market making and the lack of specialized LLM distillation methods for this financial task.

Method: Proposes Cooperative Market Making (CMM) framework that decouples LLM features across three orthogonal dimensions (layer, task, data) and uses multiple student models for collaborative learning, integrated via Hájek-MoE.

Result: Extensive experiments on four real-world market datasets show CMM outperforms current distillation methods and RL-based market-making strategies.

Conclusion: CMM effectively addresses LLM inference speed issues in market making through multi-dimensional feature decoupling and collaborative student model learning with Hájek-MoE integration.

Abstract: Market making (MM) through Reinforcement Learning (RL) has attracted significant attention in financial trading. With the development of Large Language Models (LLMs), more and more attempts are being made to apply LLMs to financial areas. A simple, direct application of LLM as an agent shows significant performance. Such methods are hindered by their slow inference speed, while most of the current research has not studied LLM distillation for this specific task. To address this, we first propose the normalized fluorescent probe to study the mechanism of the LLM’s feature. Based on the observation found by our investigation, we propose Cooperative Market Making (CMM), a novel framework that decouples LLM features across three orthogonal dimensions: layer, task, and data. Various student models collaboratively learn simple LLM features along with different dimensions, with each model responsible for a distinct feature to achieve knowledge distillation. Furthermore, CMM introduces an Hájek-MoE to integrate the output of the student models by investigating the contribution of different models in a kernel function-generated common feature space. Extensive experimental results on four real-world market datasets demonstrate the superiority of CMM over the current distillation method and RL-based market-making strategies.

[568] Saliency Map-Guided Knowledge Discovery for Subclass Identification with LLM-Based Symbolic Approximations

Tim Bohne, Anne-Kathrin Patricia Windler, Martin Atzmueller

Main category: cs.AI

TL;DR: A neuro-symbolic method using neural network saliency maps to discover latent subclasses in time series classification, combining clustering with LLM-based symbolic analysis.

Details

Motivation: To identify latent subclasses in time series classification tasks by leveraging neural network interpretability for knowledge discovery.

Method: Transform multiclass classification to binary problems, generate saliency maps from trained classifiers, cluster signals using saliency guidance, and use LLMs for symbolic approximation and fuzzy knowledge graph matching.

Result: Outperforms signal-only baselines in clustering and subclass identification on established time series datasets.

Conclusion: The saliency map-driven neuro-symbolic approach effectively discovers latent subclasses in time series classification, demonstrating superior performance over traditional methods.

Abstract: This paper proposes a novel neuro-symbolic approach for sensor signal-based knowledge discovery, focusing on identifying latent subclasses in time series classification tasks. The approach leverages gradient-based saliency maps derived from trained neural networks to guide the discovery process. Multiclass time series classification problems are transformed into binary classification problems through label subsumption, and classifiers are trained for each of these to yield saliency maps. The input signals, grouped by predicted class, are clustered under three distinct configurations. The centroids of the final set of clusters are provided as input to an LLM for symbolic approximation and fuzzy knowledge graph matching to discover the underlying subclasses of the original multiclass problem. Experimental results on well-established time series classification datasets demonstrate the effectiveness of our saliency map-driven method for knowledge discovery, outperforming signal-only baselines in both clustering and subclass identification.

[569] PADiff: Predictive and Adaptive Diffusion Policies for Ad Hoc Teamwork

Hohei Chan, Xinzhi Zhang, Antao Xiang, Weinan Zhang, Mengchen Zhao

Main category: cs.AI

TL;DR: PADiff is a diffusion-based approach for ad hoc teamwork that captures multimodal cooperation patterns by integrating predictive teammate information into the denoising process, outperforming existing methods.

Details

Motivation: Ad hoc teamwork requires agents to collaborate with unknown teammates, but conventional RL approaches fail to capture multimodal cooperation patterns as they collapse into single behaviors.

Method: Proposed PADiff, a diffusion-based policy that integrates predictive information about teammates into the denoising process to handle non-stationary AHT scenarios.

Result: Extensive experiments across three cooperation environments demonstrate that PADiff significantly outperforms existing AHT methods.

Conclusion: Diffusion-based approaches with integrated predictive information effectively capture multimodal behaviors and enable diverse cooperation modes in ad hoc teamwork scenarios.

Abstract: Ad hoc teamwork (AHT) requires agents to collaborate with previously unseen teammates, which is crucial for many real-world applications. The core challenge of AHT is to develop an ego agent that can predict and adapt to unknown teammates on the fly. Conventional RL-based approaches optimize a single expected return, which often causes policies to collapse into a single dominant behavior, thus failing to capture the multimodal cooperation patterns inherent in AHT. In this work, we introduce PADiff, a diffusion-based approach that captures agent’s multimodal behaviors, unlocking its diverse cooperation modes with teammates. However, standard diffusion models lack the ability to predict and adapt in highly non-stationary AHT scenarios. To address this limitation, we propose a novel diffusion-based policy that integrates critical predictive information about teammates into the denoising process. Extensive experiments across three cooperation environments demonstrate that PADiff outperforms existing AHT methods significantly.

[570] AgenticSciML: Collaborative Multi-Agent Systems for Emergent Discovery in Scientific Machine Learning

Qile Jiang, George Karniadakis

Main category: cs.AI

TL;DR: AgenticSciML is a multi-agent AI system that collaboratively designs Scientific Machine Learning solutions, outperforming human-designed methods by up to 4 orders of magnitude in error reduction.

Details

Motivation: Current SciML design requires extensive expert-driven experimentation and problem-specific insights, making it inefficient and limited by human expertise.

Method: A collaborative system with over 10 specialized AI agents that use structured debate, retrieval-augmented memory, and ensemble-guided evolutionary search to propose, critique, and refine SciML solutions.

Result: The framework discovered novel strategies including adaptive mixture-of-expert architectures, decomposition-based PINNs, and physics-informed operator learning models that outperform baselines by up to 4 orders of magnitude.

Conclusion: Collaborative reasoning among AI agents enables emergent methodological innovation, suggesting a path toward scalable and autonomous discovery in scientific computing.

Abstract: Scientific Machine Learning (SciML) integrates data-driven inference with physical modeling to solve complex problems in science and engineering. However, the design of SciML architectures, loss formulations, and training strategies remains an expert-driven research process, requiring extensive experimentation and problem-specific insights. Here we introduce AgenticSciML, a collaborative multi-agent system in which over 10 specialized AI agents collaborate to propose, critique, and refine SciML solutions through structured reasoning and iterative evolution. The framework integrates structured debate, retrieval-augmented method memory, and ensemble-guided evolutionary search, enabling the agents to generate and assess new hypotheses about architectures and optimization procedures. Across physics-informed learning and operator learning tasks, the framework discovers solution methods that outperform single-agent and human-designed baselines by up to four orders of magnitude in error reduction. The agents produce novel strategies – including adaptive mixture-of-expert architectures, decomposition-based PINNs, and physics-informed operator learning models – that do not appear explicitly in the curated knowledge base. These results show that collaborative reasoning among AI agents can yield emergent methodological innovation, suggesting a path toward scalable, transparent, and autonomous discovery in scientific computing.

[571] Beyond Detection: Exploring Evidence-based Multi-Agent Debate for Misinformation Intervention and Persuasion

Chen Han, Yijia Ma, Jin Tan, Wenzhen Zheng, Xijin Tang

Main category: cs.AI

TL;DR: ED2D is an evidence-based multi-agent debate framework for misinformation detection that not only detects misinformation but also generates persuasive debunking explanations to correct user beliefs and discourage misinformation sharing.

Details

Motivation: Prior multi-agent debate frameworks focused only on detection accuracy but overlooked helping users understand reasoning behind factual judgments and develop future resilience against misinformation.

Method: ED2D extends previous approaches by incorporating factual evidence retrieval and is designed as a persuasive multi-agent system that generates debunking transcripts through evidence-based debates.

Result: ED2D outperforms existing baselines across three misinformation detection benchmarks. When correct, its debunking transcripts show persuasive effects comparable to human experts, but when incorrect, they may reinforce misconceptions even alongside accurate human explanations.

Conclusion: ED2D shows promise for misinformation intervention but also reveals risks when systems misclassify, highlighting the need for careful deployment. A public community website was developed to foster transparency and critical thinking.

Abstract: Multi-agent debate (MAD) frameworks have emerged as promising approaches for misinformation detection by simulating adversarial reasoning. While prior work has focused on detection accuracy, it overlooks the importance of helping users understand the reasoning behind factual judgments and develop future resilience. The debate transcripts generated during MAD offer a rich but underutilized resource for transparent reasoning. In this study, we introduce ED2D, an evidence-based MAD framework that extends previous approach by incorporating factual evidence retrieval. More importantly, ED2D is designed not only as a detection framework but also as a persuasive multi-agent system aimed at correcting user beliefs and discouraging misinformation sharing. We compare the persuasive effects of ED2D-generated debunking transcripts with those authored by human experts. Results demonstrate that ED2D outperforms existing baselines across three misinformation detection benchmarks. When ED2D generates correct predictions, its debunking transcripts exhibit persuasive effects comparable to those of human experts; However, when ED2D misclassifies, its accompanying explanations may inadvertently reinforce users’misconceptions, even when presented alongside accurate human explanations. Our findings highlight both the promise and the potential risks of deploying MAD systems for misinformation intervention. We further develop a public community website to help users explore ED2D, fostering transparency, critical thinking, and collaborative fact-checking.

[572] IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction

Guoxin Chen, Zile Qiao, Xuanzhong Chen, Donglei Yu, Haotian Xu, Wayne Xin Zhao, Ruihua Song, Wenbiao Yin, Huifeng Yin, Liwen Zhang, Kuan Li, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou

Main category: cs.AI

TL;DR: IterResearch introduces an iterative deep-research paradigm that reformulates long-horizon research as a Markov Decision Process with strategic workspace reconstruction, overcoming context suffocation and noise contamination in existing mono-contextual approaches.

Details

Motivation: Existing deep-research agents rely on mono-contextual paradigms that accumulate all information in a single expanding context window, leading to context suffocation and noise contamination that limit effectiveness on long-horizon tasks.

Method: Maintains an evolving report as memory and periodically synthesizes insights; develops Efficiency-Aware Policy Optimization (EAPO) with geometric reward discounting and adaptive downsampling for stable distributed training.

Result: Achieves +14.5pp average improvement across six benchmarks, extends to 2048 interactions with dramatic performance gains (3.5% to 42.5%), and serves as effective prompting strategy improving frontier models by up to 19.2pp over ReAct.

Conclusion: IterResearch is a versatile solution for long-horizon reasoning, effective both as a trained agent and as a prompting paradigm for frontier models, demonstrating unprecedented interaction scaling and performance improvements.

Abstract: Recent advances in deep-research agents have shown promise for autonomous knowledge construction through dynamic reasoning over external sources. However, existing approaches rely on a mono-contextual paradigm that accumulates all information in a single, expanding context window, leading to context suffocation and noise contamination that limit their effectiveness on long-horizon tasks. We introduce IterResearch, a novel iterative deep-research paradigm that reformulates long-horizon research as a Markov Decision Process with strategic workspace reconstruction. By maintaining an evolving report as memory and periodically synthesizing insights, our approach preserves consistent reasoning capacity across arbitrary exploration depths. We further develop Efficiency-Aware Policy Optimization (EAPO), a reinforcement learning framework that incentivizes efficient exploration through geometric reward discounting and enables stable distributed training via adaptive downsampling. Extensive experiments demonstrate that IterResearch achieves substantial improvements over existing open-source agents with average +14.5pp across six benchmarks and narrows the gap with frontier proprietary systems. Remarkably, our paradigm exhibits unprecedented interaction scaling, extending to 2048 interactions with dramatic performance gains (from 3.5% to 42.5%), and serves as an effective prompting strategy, improving frontier models by up to 19.2pp over ReAct on long-horizon tasks. These findings position IterResearch as a versatile solution for long-horizon reasoning, effective both as a trained agent and as a prompting paradigm for frontier models.

[573] DeepPersona: A Generative Engine for Scaling Deep Synthetic Personas

Zhen Wang, Yufan Zhou, Zhongyan Luo, Lyumanshan Ye, Adam Wood, Man Yao, Luoshang Pan

Main category: cs.AI

TL;DR: DEEPPERSONA is a scalable generative engine that creates rich, narrative-complete synthetic personas using a two-stage, taxonomy-guided method, significantly outperforming existing approaches in diversity, uniqueness, and real-world application performance.

Details

Motivation: Existing synthetic personas are too shallow and simplistic, failing to capture the rich complexity and diversity of real human identities needed for effective agentic behavioral simulation, LLM personalization, and human-AI alignment research.

Method: Two-stage approach: 1) Algorithmically construct the largest human-attribute taxonomy from thousands of real user-ChatGPT conversations, 2) Progressively sample attributes from this taxonomy to conditionally generate coherent personas with hundreds of structured attributes and ~1MB of narrative text.

Result: Significant improvements: 32% higher attribute diversity, 44% greater profile uniqueness, 11.6% average improvement in personalized question answering accuracy, 31.7% reduction in gap between simulated LLM citizens and human responses, and 17% reduction in Big Five personality test performance gap.

Conclusion: DEEPPERSONA provides a rigorous, scalable, privacy-free platform for high-fidelity human simulation and personalized AI research, enabling more realistic and diverse synthetic personas than previous methods.

Abstract: Simulating human profiles by instilling personas into large language models (LLMs) is rapidly transforming research in agentic behavioral simulation, LLM personalization, and human-AI alignment. However, most existing synthetic personas remain shallow and simplistic, capturing minimal attributes and failing to reflect the rich complexity and diversity of real human identities. We introduce DEEPPERSONA, a scalable generative engine for synthesizing narrative-complete synthetic personas through a two-stage, taxonomy-guided method. First, we algorithmically construct the largest-ever human-attribute taxonomy, comprising over hundreds of hierarchically organized attributes, by mining thousands of real user-ChatGPT conversations. Second, we progressively sample attributes from this taxonomy, conditionally generating coherent and realistic personas that average hundreds of structured attributes and roughly 1 MB of narrative text, two orders of magnitude deeper than prior works. Intrinsic evaluations confirm significant improvements in attribute diversity (32 percent higher coverage) and profile uniqueness (44 percent greater) compared to state-of-the-art baselines. Extrinsically, our personas enhance GPT-4.1-mini’s personalized question answering accuracy by 11.6 percent on average across ten metrics and substantially narrow (by 31.7 percent) the gap between simulated LLM citizens and authentic human responses in social surveys. Our generated national citizens reduced the performance gap on the Big Five personality test by 17 percent relative to LLM-simulated citizens. DEEPPERSONA thus provides a rigorous, scalable, and privacy-free platform for high-fidelity human simulation and personalized AI research.

[574] DigiData: Training and Evaluating General-Purpose Mobile Control Agents

Yuxuan Sun, Manchen Wang, Shengyi Qian, William R. Wong, Eric Gan, Pierluca D’Oro, Alejandro Castillejo Munoz, Sneha Silwal, Pedro Matias, Nitin Kamra, Satwik Kottur, Nick Raines, Xuanyi Zhao, Joy Chen, Joseph Greer, Andrea Madotto, Allen Bolourchi, James Valori, Kevin Carlberg, Karl Ridgeway, Joseph Tighe

Main category: cs.AI

TL;DR: DigiData is a large-scale, high-quality mobile control dataset with complex goals, plus DigiData-Bench benchmark with dynamic/AI evaluation protocols to better assess agent performance.

Details

Motivation: To accelerate development of AI agents for controlling user interfaces by providing high-quality datasets and robust evaluation methods for mobile control agents.

Method: Created DigiData dataset through comprehensive exploration of app features (unlike unstructured interactions), and developed DigiData-Bench benchmark with dynamic evaluation protocols and AI-powered assessments.

Result: Produces a diverse, multi-modal dataset with higher goal complexity than existing datasets, and introduces more reliable evaluation methods beyond step-accuracy metrics.

Conclusion: These contributions significantly advance mobile control agent development, enabling more intuitive human-device interactions through better training data and evaluation frameworks.

Abstract: AI agents capable of controlling user interfaces have the potential to transform human interaction with digital devices. To accelerate this transformation, two fundamental building blocks are essential: high-quality datasets that enable agents to achieve complex and human-relevant goals, and robust evaluation methods that allow researchers and practitioners to rapidly enhance agent performance. In this paper, we introduce DigiData, a large-scale, high-quality, diverse, multi-modal dataset designed for training mobile control agents. Unlike existing datasets, which derive goals from unstructured interactions, DigiData is meticulously constructed through comprehensive exploration of app features, resulting in greater diversity and higher goal complexity. Additionally, we present DigiData-Bench, a benchmark for evaluating mobile control agents on real-world complex tasks. We demonstrate that the commonly used step-accuracy metric falls short in reliably assessing mobile control agents and, to address this, we propose dynamic evaluation protocols and AI-powered evaluations as rigorous alternatives for agent assessment. Our contributions aim to significantly advance the development of mobile control agents, paving the way for more intuitive and effective human-device interactions.

[575] Logic Distillation: Learning from Code Function by Function for Decision-making Tasks

Dong Chen, Shilin Zhang, Fei Gao, Yueting Zhuang, Siliang Tang, Qidong Liu, Mingliang Xu

Main category: cs.AI

TL;DR: Logic Distillation (LD) framework enables small LLMs to achieve logical reasoning capabilities comparable to large LLMs by learning to use discrete functions created by large LLMs.

Details

Motivation: Current knowledge distillation methods fail to transfer powerful logical reasoning capabilities from large LLMs to small LLMs, leaving small LLMs unable to handle planning and decision-making tasks.

Method: LD first uses large LLMs to create discrete functions and their usage examples, then fine-tunes small LLMs to learn the logic behind function selection and invocation based on instructions and current states.

Result: Experiments show that small LLMs equipped with LD achieve outstanding results in planning and decision-making tasks, comparable to or even surpassing large LLMs.

Conclusion: The Logic Distillation framework successfully bridges the logical reasoning capability gap between large and small LLMs, enabling efficient deployment of reasoning capabilities on resource-constrained devices.

Abstract: Large language models (LLMs) have garnered increasing attention owing to their powerful logical reasoning capabilities. Generally, larger LLMs (L-LLMs) that require paid interfaces exhibit significantly superior performance compared to smaller LLMs (S-LLMs) that can be deployed on a variety of devices. Knowledge distillation (KD) aims to empower S-LLMs with the capabilities of L-LLMs, while S-LLMs merely mimic the outputs of L-LLMs, failing to get the powerful logical reasoning capabilities. Consequently, S-LLMs are helpless when it comes to planning and decision-making tasks that require logical reasoning capabilities. To tackle the identified challenges, we propose a novel framework called Logic Distillation (LD). Initially, LD employs L-LLMs to instantiate complex instructions into discrete functions and illustrates their usage to establish a function base. Subsequently, based on the function base, LD fine-tunes S-LLMs to learn the logic employed by L-LLMs in planning and decision-making. During testing, LD utilizes a retriever to identify the top-$K$ relevant functions based on instructions and current states, which will be selected and invoked by S-LLMs. Ultimately, S-LLMs yield planning and decision-making outcomes, function by function. Relevant experiments demonstrate that with the assistance of LD, S-LLMs can achieve outstanding results in planning and decision-making tasks, comparable to, or even surpassing, those of L-LLMs.

[576] Conceptual Belief-Informed Reinforcement Learning

Xingrui Gu, Chuyi Jiang, Laixi Shi

Main category: cs.AI

TL;DR: HI-RL introduces a human-inspired reinforcement learning paradigm that uses conceptual abstraction and probabilistic beliefs to improve sample efficiency and performance in existing RL algorithms.

Details

Motivation: Current RL methods are inefficient and unstable, requiring large amounts of trial-and-error data, while humans learn efficiently by abstracting concepts and updating probabilistic beliefs using uncertainty and prior knowledge.

Method: HI-RL forms concepts by extracting high-level categories of critical environmental information and constructs adaptive concept-associated probabilistic beliefs as experience priors to guide value or policy updates.

Result: When integrated into various RL algorithms (DQN, PPO, SAC, TD3), HI-RL consistently improved sample efficiency and performance across both discrete and continuous control benchmarks.

Conclusion: HI-RL successfully emulates human intelligence in RL by using conceptual abstraction and probabilistic beliefs, providing an efficient experience utilization paradigm that can be directly integrated into existing RL frameworks.

Abstract: Reinforcement learning (RL) has achieved significant success but is hindered by inefficiency and instability, relying on large amounts of trial-and-error data and failing to efficiently use past experiences to guide decisions. However, humans achieve remarkably efficient learning from experience, attributed to abstracting concepts and updating associated probabilistic beliefs by integrating both uncertainty and prior knowledge, as observed by cognitive science. Inspired by this, we introduce Conceptual Belief-Informed Reinforcement Learning to emulate human intelligence (HI-RL), an efficient experience utilization paradigm that can be directly integrated into existing RL frameworks. HI-RL forms concepts by extracting high-level categories of critical environmental information and then constructs adaptive concept-associated probabilistic beliefs as experience priors to guide value or policy updates. We evaluate HI-RL by integrating it into various existing value- and policy-based algorithms (DQN, PPO, SAC, and TD3) and demonstrate consistent improvements in sample efficiency and performance across both discrete and continuous control benchmarks.

[577] GlitchMiner: Mining Glitch Tokens in Large Language Models via Gradient-based Discrete Optimization

Zihui Wu, Haichang Gao, Ping Wang, Shudong Zhang, Zhaoxiang Liu, Shiguo Lian

Main category: cs.AI

TL;DR: GlitchMiner is a behavior-driven framework that identifies glitch tokens in LLMs by maximizing predictive entropy using gradient-guided local search, outperforming existing methods in accuracy and efficiency.

Details

Motivation: Existing glitch token detection methods rely on heuristic embedding patterns or statistical anomalies, limiting generalizability across model architectures and potentially missing anomalies that deviate from observed patterns.

Method: GlitchMiner uses a gradient-guided local search strategy to efficiently explore discrete token space without model-specific heuristics or large-batch sampling, focusing on maximizing predictive entropy to identify anomalous behavior.

Result: Extensive experiments across ten LLMs from five major model families show GlitchMiner consistently outperforms existing approaches in detection accuracy and query efficiency.

Conclusion: GlitchMiner provides a generalizable and scalable solution for effective glitch token discovery, offering improved reliability and safety for LLMs.

Abstract: Glitch tokens, inputs that trigger unpredictable or anomalous behavior in Large Language Models (LLMs), pose significant challenges to model reliability and safety. Existing detection methods primarily rely on heuristic embedding patterns or statistical anomalies within internal representations, limiting their generalizability across different model architectures and potentially missing anomalies that deviate from observed patterns. We introduce GlitchMiner, an behavior-driven framework designed to identify glitch tokens by maximizing predictive entropy. Leveraging a gradient-guided local search strategy, GlitchMiner efficiently explores the discrete token space without relying on model-specific heuristics or large-batch sampling. Extensive experiments across ten LLMs from five major model families demonstrate that GlitchMiner consistently outperforms existing approaches in detection accuracy and query efficiency, providing a generalizable and scalable solution for effective glitch token discovery. Code is available at [https://github.com/wooozihu/GlitchMiner]

[578] SEAGraph: Unveiling the Whole Story of Paper Review Comments

Jianxiang Yu, Jiaqi Tan, Zichen Ding, Jiapeng Zhu, Jiahao Li, Yao Cheng, Qier Cui, Yunshi Lan, Yao Liu, Xiang Li

Main category: cs.AI

TL;DR: SEAGraph is a framework that clarifies peer review comments by analyzing authors’ thought processes and research context to help authors better understand reviewer feedback.

Details

Motivation: Traditional peer review often provides vague feedback that doesn't help authors improve their work effectively, leading to longer review cycles and limited assistance.

Method: Constructs two graphs: semantic mind graph (captures authors’ thought process) and hierarchical background graph (delineates research domains), then uses retrieval to extract relevant content for explaining review comments.

Result: Extensive experiments show SEAGraph excels in review comment understanding tasks and offers significant benefits to authors.

Conclusion: SEAGraph bridges the gap between reviewers’ critiques and authors’ comprehension, contributing to a more efficient, transparent and collaborative scientific publishing ecosystem.

Abstract: Peer review, as a cornerstone of scientific research, ensures the integrity and quality of scholarly work by providing authors with objective feedback for refinement. However, in the traditional peer review process, authors often receive vague or insufficiently detailed feedback, which provides limited assistance and leads to a more time-consuming review cycle. If authors can identify some specific weaknesses in their paper, they can not only address the reviewer’s concerns but also improve their work. This raises the critical question of how to enhance authors’ comprehension of review comments. In this paper, we present SEAGraph, a novel framework developed to clarify review comments by uncovering the underlying intentions behind them. We construct two types of graphs for each paper: the semantic mind graph, which captures the authors’ thought process, and the hierarchical background graph, which delineates the research domains related to the paper. A retrieval method is then designed to extract relevant content from both graphs, facilitating coherent explanations for the review comments. Extensive experiments show that SEAGraph excels in review comment understanding tasks, offering significant benefits to authors. By bridging the gap between reviewers’ critiques and authors’ comprehension, SEAGraph contributes to a more efficient, transparent and collaborative scientific publishing ecosystem.

[579] ImitDiff: Transferring Foundation-Model Priors for Distraction Robust Visuomotor Policy

Yuhang Dong, Haizhou Ge, Yupei Zeng, Jiangning Zhang, Beiwen Tian, Hongrui Zhu, Yufei Jia, Ruixiang Wang, Zhucun Xue, Guyue Zhou, Longhua Ma, Guanzhong Tian

Main category: cs.AI

TL;DR: ImitDiff is a diffusion-based imitation learning policy that uses vision-language foundation models to create semantic masks for guiding dual-resolution perception, enabling better performance in complex scenes with visual distractions.

Details

Motivation: As scene complexity and visual distractions increase, existing visuomotor imitation learning policies suffer from performance degradation. There's a need for policies that can maintain performance in complex environments.

Method: Uses pretrained vision-language foundation models to transform instructions into pixel-level semantic masks. Implements dual-resolution perception (global context from low-res, local features from high-res) and a consistency-driven diffusion transformer action head.

Result: Outperforms state-of-the-art vision-language manipulation frameworks and visuomotor imitation learning policies, especially in complex scenes. Shows strong zero-shot generalization with novel objects and distractions. Action head achieves order-of-magnitude speed improvement while maintaining success rates.

Conclusion: ImitDiff effectively addresses performance degradation in complex scenes through semantic-guided dual-resolution perception and achieves significant improvements in both performance and inference speed.

Abstract: Visuomotor imitation learning policies enable robots to efficiently acquire manipulation skills from visual demonstrations. However, as scene complexity and visual distractions increase, policies that perform well in simple settings often experience substantial performance degradation. To address this challenge, we propose ImitDiff, a diffusion-based imitation learning policy guided by fine-grained semantics within a dual-resolution workflow. Leveraging pretrained priors of vision-language foundation models, our method transforms high-level instructions into pixel-level visual semantic masks. These masks guide a dual-resolution perception pipeline that captures both global context (e.g., overall layout) from low-resolution observation and fine-grained local features (e.g., geometric details) from high-resolution observation, enabling the policy to focus on task-relevant regions. Additionally, we introduce a consistency-driven diffusion transformer action head that bridges visual semantic conditions and real-time action generation. Extensive experiments demonstrate that ImitDiff outperforms state-of-the-art vision-language manipulation frameworks, as well as visuomotor imitation learning policies, particularly under increased scene complexity and visual distractions. Notably, ImitDiff exhibits strong generalization in zero-shot settings involving novel objects and visual distractions. Furthermore, our consistency-driven action head achieves an order-of-magnitude improvement in inference speed while maintaining competitive success rates.

[580] CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models

Zhihang Lin, Mingbao Lin, Yuan Xie, Rongrong Ji

Main category: cs.AI

TL;DR: CPPO accelerates GRPO training by pruning low-advantage completions and using dynamic completion allocation, achieving up to 7.98x speedup while maintaining or improving accuracy.

Details

Motivation: GRPO requires sampling multiple completions per question, which is computationally expensive and time-consuming, with not all completions equally contributing to training.

Method: Prunes completions with low absolute advantages to reduce gradient calculations, and uses dynamic completion allocation to maximize GPU utilization by adding more questions.

Result: Achieves up to 7.98x speedup on GSM8K and 3.48x on Math while preserving or enhancing accuracy compared to original GRPO.

Conclusion: CPPO effectively reduces training costs of reasoning models while maintaining performance, making GRPO-based training more efficient.

Abstract: This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models based on Group Relative Policy Optimization (GRPO). GRPO, while effective, incurs high training costs due to the need to sample multiple completions for each question. Our experiment and theoretical analysis reveal that the number of completions impacts model accuracy yet increases training time multiplicatively, and not all completions contribute equally to policy training – their contribution depends on their relative advantage. To address these issues, we propose CPPO, which prunes completions with low absolute advantages, significantly reducing the number needed for gradient calculation and updates. Additionally, we introduce a dynamic completion allocation strategy to maximize GPU utilization by incorporating additional questions, further enhancing training efficiency. Experiments show that CPPO achieves up to $7.98\times$ speedup on GSM8K and $3.48\times$ on Math while preserving or even enhancing the accuracy compared to the original GRPO. We release our code at \href{https://github.com/lzhxmu/CPPO}{https://github.com/lzhxmu/CPPO}.

[581] Collaborative LLM Numerical Reasoning with Local Data Protection

Min Zhang, Yuzhe Lu, Yun Zhou, Panpan Xu, Lin Lee Cheong, Chang-Tien Lu, Haozhu Wang

Main category: cs.AI

TL;DR: A model collaboration framework that enables low-capacity local models to perform numerical reasoning while protecting sensitive data by using context-aware query synthesis and tool-based answer reconstruction.

Details

Motivation: Numerical reasoning over documents is challenging for low-capacity local models on constrained devices, but routing queries to powerful remote models like GPT-4 raises data leakage concerns. Existing methods struggle with generating logically equivalent queries and accurate inference.

Method: Two key innovations: (1) context-aware synthesis strategy that shifts query topics while preserving reasoning patterns, and (2) tool-based answer reconstruction that reuses remote-generated plug-and-play solutions with code snippets.

Result: Achieves better reasoning accuracy than solely using local models while providing stronger data protection than fully relying on remote models. Improves accuracy by 16.2%-43.6% and reduces data leakage by 2.3%-44.6% compared to existing approaches.

Conclusion: The proposed framework successfully balances the trade-off between reasoning performance and data privacy, enabling effective numerical reasoning on constrained devices without exposing sensitive local data to remote models.

Abstract: Numerical reasoning over documents, which demands both contextual understanding and logical inference, is challenging for low-capacity local models deployed on computation-constrained devices. Although such complex reasoning queries could be routed to powerful remote models like GPT-4, exposing local data raises significant data leakage concerns. Existing mitigation methods generate problem descriptions or examples for remote assistance. However, the inherent complexity of numerical reasoning hinders the local model from generating logically equivalent queries and accurately inferring answers with remote guidance. In this paper, we present a model collaboration framework with two key innovations: (1) a context-aware synthesis strategy that shifts the query topics while preserving reasoning patterns; and (2) a tool-based answer reconstruction approach that reuses the remote-generated plug-and-play solution with code snippets. Experimental results demonstrate that our method achieves better reasoning accuracy than solely using local models while providing stronger data protection than fully relying on remote models. Furthermore, our method improves accuracy by 16.2% - 43.6% while reducing data leakage by 2.3% - 44.6% compared to existing data protection approaches.

[582] DrKGC: Dynamic Subgraph Retrieval-Augmented LLMs for Knowledge Graph Completion across General and Biomedical Domains

Yongkang Xiao, Sinian Zhang, Yi Dai, Huixue Zhou, Jue Hou, Jie Ding, Rui Zhang

Main category: cs.AI

TL;DR: DrKGC is a novel KGC method that combines LLMs with dynamic subgraph retrieval and structural embeddings to better leverage graph structure information for knowledge graph completion.

Details

Motivation: Current LLM-based approaches for KGC encode graph context as text, failing to fully utilize LLMs' potential for graph structure perception and reasoning.

Method: Uses lightweight model training for structural embeddings and logical rules, bottom-up graph retrieval guided by rules, GCN adapter to enhance embeddings, and integrates structural information into prompts for LLM fine-tuning.

Result: Superior performance on two general domain and two biomedical benchmark datasets, with interpretability demonstrated through biomedical case study.

Conclusion: DrKGC effectively bridges the gap between LLMs and graph structures, achieving better KGC performance while maintaining interpretability and practical utility.

Abstract: Knowledge graph completion (KGC) aims to predict missing triples in knowledge graphs (KGs) by leveraging existing triples and textual information. Recently, generative large language models (LLMs) have been increasingly employed for graph tasks. However, current approaches typically encode graph context in textual form, which fails to fully exploit the potential of LLMs for perceiving and reasoning about graph structures. To address this limitation, we propose DrKGC (Dynamic Subgraph Retrieval-Augmented LLMs for Knowledge Graph Completion). DrKGC employs a flexible lightweight model training strategy to learn structural embeddings and logical rules within the KG. It then leverages a novel bottom-up graph retrieval method to extract a subgraph for each query guided by the learned rules. Finally, a graph convolutional network (GCN) adapter uses the retrieved subgraph to enhance the structural embeddings, which are then integrated into the prompt for effective LLM fine-tuning. Experimental results on two general domain benchmark datasets and two biomedical datasets demonstrate the superior performance of DrKGC. Furthermore, a realistic case study in the biomedical domain highlights its interpretability and practical utility.

[583] LLM-Powered Swarms: A New Frontier or a Conceptual Stretch?

Muhammad Atta Ur Rahman, Melanie Schranz, Samira Hayat

Main category: cs.AI

TL;DR: LLM-powered swarms can emulate swarm intelligence but suffer from 300x computational overhead compared to classical approaches, limiting real-time applications.

Details

Motivation: To evaluate whether LLM-powered swarm systems like OpenAI's Swarm framework truly capture fundamental principles of classical swarm intelligence: decentralization, simplicity, emergence, and scalability.

Method: Implemented and compared classical vs LLM-based versions of two established swarm algorithms (Boids and Ant Colony Optimization) using OpenAI’s Swarm framework.

Result: LLM-powered swarms can emulate swarm-like dynamics but require roughly 300x more computation time than classical counterparts, with LLM-based Boids simulation showing particularly high computational costs.

Conclusion: While LLM-driven swarms can replicate swarm intelligence behaviors, current computational limitations make them impractical for real-time systems due to excessive overhead.

Abstract: Swarm intelligence describes how simple, decentralized agents can collectively produce complex behaviors. Recently, the concept of swarming has been extended to large language model (LLM)-powered systems, such as OpenAI’s Swarm (OAS) framework, where agents coordinate through natural language prompts. This paper evaluates whether such systems capture the fundamental principles of classical swarm intelligence: decentralization, simplicity, emergence, and scalability. Using OAS, we implement and compare classical and LLM-based versions of two well-established swarm algorithms: Boids and Ant Colony Optimization. Results indicate that while LLM-powered swarms can emulate swarm-like dynamics, they are constrained by substantial computational overhead. For instance, our LLM-based Boids simulation required roughly 300x more computation time than its classical counterpart, highlighting current limitations in applying LLM-driven swarms to real-time systems.

[584] Large model retrieval enhancement framework for construction site risk identification

Jiawei Li, Chengye Yang, Yaochen Zhang, Weilin Sun, Lei Meng, Xiangxu Meng

Main category: cs.AI

TL;DR: Proposes a retrieval-augmented framework that enhances LLMs for construction hazard identification without fine-tuning, achieving 50% accuracy (35.49% improvement) by integrating external knowledge and similar cases via prompt tuning.

Details

Motivation: Current LLM-based approaches for construction hazard identification face limitations: image-text matching struggles with complex hazards, while instruction tuning lacks generalization and is resource-intensive.

Method: A framework with case database, image retrieval module, and LLM-based reasoning module that dynamically integrates external knowledge and retrieved similar cases via prompt tuning, using LPIPS- and CLIP-based retrieval strategy.

Result: Boosted GLM-4V’s accuracy to 50% on real-site data, a 35.49% improvement over baselines, with consistent gains across hazard types. Ablation studies validated the effectiveness of the image retrieval strategy.

Conclusion: The proposed technique significantly improves identification accuracy and contextual understanding, demonstrating strong generalization and offering a practical path for intelligent safety risk detection in construction.

Abstract: This study addresses construction site hazard identification by proposing a retrieval-augmented framework that enhances large language models (LLMs) without requiring fine-tuning. Current LLM-based approaches face limitations: image-text matching struggles with complex hazards, while instruction tuning lacks generalization and is resource-intensive. Our method dynamically integrates external knowledge and retrieved similar cases via prompt tuning, overcoming LLMs’ limitations in domain knowledge and feature correlation. The framework comprises a case database, an image retrieval module, and an LLM-based reasoning module. Evaluated on real-site data, our approach boosted GLM-4V’s accuracy to 50%, a 35.49% improvement over baselines, with consistent gains across hazard types. Ablation studies validated the effectiveness of our image retrieval strategy, showing the superiority of our LPIPS- and CLIP-based method. The proposed technique significantly improves identification accuracy and contextual understanding, demonstrating strong generalization and offering a practical path for intelligent safety risk detection in construction.

[585] GVGAI-LLM: Evaluating Large Language Model Agents with Infinite Games

Yuchen Li, Cong Lin, Muhammad Umair Nasir, Philip Bontrager, Jialin Liu, Julian Togelius

Main category: cs.AI

TL;DR: GVGAI-LLM is a video game benchmark using ASCII-based arcade games to evaluate LLMs’ spatial reasoning and planning capabilities, revealing persistent limitations despite some improvements from structured prompting.

Details

Motivation: To create a benchmark that tests LLMs' reasoning and problem-solving abilities in tasks different from existing benchmarks, focusing on preventing overfitting through diverse game generation.

Method: Built on General Video Game AI framework using game description language for rapid game creation, representing game scenes as ASCII characters, with zero-shot evaluations across diverse games and levels.

Result: LLMs show persistent limitations in spatial reasoning and basic planning, consistently making spatial and logical errors. Structured prompting and spatial grounding provide partial improvements but benchmark remains unsolved.

Conclusion: GVGAI-LLM provides a reproducible testbed for advancing research on language model capabilities, particularly for agentic behavior and contextual reasoning, highlighting significant remaining challenges.

Abstract: We introduce GVGAI-LLM, a video game benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs). Built on the General Video Game AI framework, it features a diverse collection of arcade-style games designed to test a model’s ability to handle tasks that differ from most existing LLM benchmarks. The benchmark leverages a game description language that enables rapid creation of new games and levels, helping to prevent overfitting over time. Each game scene is represented by a compact set of ASCII characters, allowing for efficient processing by language models. GVGAI-LLM defines interpretable metrics, including the meaningful step ratio, step efficiency, and overall score, to assess model behavior. Through zero-shot evaluations across a broad set of games and levels with diverse challenges and skill depth, we reveal persistent limitations of LLMs in spatial reasoning and basic planning. Current models consistently exhibit spatial and logical errors, motivating structured prompting and spatial grounding techniques. While these interventions lead to partial improvements, the benchmark remains very far from solved. GVGAI-LLM provides a reproducible testbed for advancing research on language model capabilities, with a particular emphasis on agentic behavior and contextual reasoning.

[586] The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Francisco Piedrahita-Velez, Yue Liao, Hongru Wang, Mengyue Yang, Heng Ji, Jun Wang, Shuicheng Yan, Philip Torr, Lei Bai

Main category: cs.AI

TL;DR: Agentic RL transforms LLMs from passive generators into autonomous agents using extended POMDPs, with taxonomy based on capabilities and applications, positioning RL as key for adaptive behavior.

Details

Motivation: To formalize the paradigm shift from LLM-RL to Agentic RL, highlighting the transition from single-step MDPs to complex POMDPs for autonomous decision-making.

Method: Proposed a twofold taxonomy around agentic capabilities (planning, tool use, memory, reasoning, self-improvement, perception) and applications, synthesizing over 500 works.

Result: Consolidated open-source environments, benchmarks, and frameworks into a practical compendium to accelerate future research in scalable AI agents.

Conclusion: Reinforcement learning is crucial for developing adaptive, robust agentic behavior, with identified opportunities and challenges for general-purpose AI agents.

Abstract: The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to large language models (LLM RL), reframing LLMs from passive sequence generators into autonomous, decision-making agents embedded in complex, dynamic worlds. This survey formalizes this conceptual shift by contrasting the degenerate single-step Markov Decision Processes (MDPs) of LLM-RL with the temporally extended, partially observable Markov decision processes (POMDPs) that define Agentic RL. Building on this foundation, we propose a comprehensive twofold taxonomy: one organized around core agentic capabilities, including planning, tool use, memory, reasoning, self-improvement, and perception, and the other around their applications across diverse task domains. Central to our thesis is that reinforcement learning serves as the critical mechanism for transforming these capabilities from static, heuristic modules into adaptive, robust agentic behavior. To support and accelerate future research, we consolidate the landscape of open-source environments, benchmarks, and frameworks into a practical compendium. By synthesizing over five hundred recent works, this survey charts the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose AI agents.

[587] Scaling Up without Fading Out: Goal-Aware Sparse GNN for RL-based Generalized Planning

Sangwoo Jeon, Juchul Shin, Gyeong-Tae Kim, YeonJe Cho, Seongwoo Kim

Main category: cs.AI

TL;DR: Proposes a sparse, goal-aware GNN representation for generalized planning to overcome limitations of dense graph approaches in large grid-based environments.

Details

Motivation: Existing dense graph representations in RL+GNN planning suffer from combinatorial explosion, memory inefficiency, and poor scalability in large grid environments, making learning infeasible for larger problems.

Method: Developed a sparse GNN that selectively encodes relevant local relationships and explicitly integrates spatial goal features, validated using novel drone mission scenarios in PDDL-based grid worlds.

Result: Method scales effectively to larger grid sizes previously infeasible with dense graphs, and substantially improves policy generalization and success rates in drone mission scenarios.

Conclusion: Provides a practical foundation for addressing realistic, large-scale generalized planning tasks by overcoming scalability limitations of dense graph representations.

Abstract: Generalized planning using deep reinforcement learning (RL) combined with graph neural networks (GNNs) has shown promising results in various symbolic planning domains described by PDDL. However, existing approaches typically represent planning states as fully connected graphs, leading to a combinatorial explosion in edge information and substantial sparsity as problem scales grow, especially evident in large grid-based environments. This dense representation results in diluted node-level information, exponentially increases memory requirements, and ultimately makes learning infeasible for larger-scale problems. To address these challenges, we propose a sparse, goal-aware GNN representation that selectively encodes relevant local relationships and explicitly integrates spatial features related to the goal. We validate our approach by designing novel drone mission scenarios based on PDDL within a grid world, effectively simulating realistic mission execution environments. Our experimental results demonstrate that our method scales effectively to larger grid sizes previously infeasible with dense graph representations and substantially improves policy generalization and success rates. Our findings provide a practical foundation for addressing realistic, large-scale generalized planning tasks.

[588] Tree-Guided Diffusion Planner

Hyeonseong Jeon, Cheolhong Min, Jaesik Park

Main category: cs.AI

TL;DR: TDP is a zero-shot test-time planning framework that uses tree-guided diffusion to balance exploration and exploitation, outperforming state-of-the-art methods on various tasks without task-specific training.

Details

Motivation: Standard gradient guidance struggles with non-convex objectives, non-differentiable constraints, and multi-reward structures in real-world scenarios, while supervised approaches lack test-time flexibility and zero-shot generalization.

Method: Tree-guided Diffusion Planner (TDP) frames planning as tree search with bi-level sampling: diverse parent trajectories via training-free particle guidance for exploration, and sub-trajectory refinement through fast conditional denoising guided by task objectives.

Result: TDP consistently outperforms state-of-the-art approaches on maze gold-picking, robot arm block manipulation, and AntMaze multi-goal exploration tasks.

Conclusion: TDP effectively addresses gradient guidance limitations by exploring diverse trajectory regions and leveraging gradient information across expanded solution space using only pretrained models and test-time rewards.

Abstract: Planning with pretrained diffusion models has emerged as a promising approach for solving test-time guided control problems. Standard gradient guidance typically performs optimally under convex, differentiable reward landscapes. However, it shows substantially reduced effectiveness in real-world scenarios with non-convex objectives, non-differentiable constraints, and multi-reward structures. Furthermore, recent supervised planning approaches require task-specific training or value estimators, which limits test-time flexibility and zero-shot generalization. We propose a Tree-guided Diffusion Planner (TDP), a zero-shot test-time planning framework that balances exploration and exploitation through structured trajectory generation. We frame test-time planning as a tree search problem using a bi-level sampling process: (1) diverse parent trajectories are produced via training-free particle guidance to encourage broad exploration, and (2) sub-trajectories are refined through fast conditional denoising guided by task objectives. TDP addresses the limitations of gradient guidance by exploring diverse trajectory regions and harnessing gradient information across this expanded solution space using only pretrained models and test-time reward signals. We evaluate TDP on three diverse tasks: maze gold-picking, robot arm block manipulation, and AntMaze multi-goal exploration. TDP consistently outperforms state-of-the-art approaches on all tasks. The project page can be found at: https://tree-diffusion-planner.github.io.

[589] DeepResearch Arena: The First Exam of LLMs’ Research Abilities via Seminar-Grounded Tasks

Haiyuan Wan, Chen Yang, Junchi Yu, Meiqi Tu, Jiaxuan Lu, Di Yu, Jianbao Cao, Ben Gao, Jiaqing Xie, Aoran Wang, Wenlong Zhang, Philip Torr, Dongzhan Zhou

Main category: cs.AI

TL;DR: DeepResearch Arena is a benchmark for evaluating deep research agents using academic seminar transcripts to create realistic research tasks across 12 disciplines, addressing data leakage issues in current evaluations.

Details

Motivation: Current evaluation of deep research agents is challenging due to difficulty in collecting frontier research questions that genuinely capture researchers' attention and intellectual curiosity, requiring more realistic research environments.

Method: Proposed Multi-Agent Hierarchical Task Generation (MAHTG) system that extracts research-worthy inspirations from seminar transcripts and translates them into high-quality research tasks, ensuring traceability while filtering noise.

Result: Curated DeepResearch Arena with over 10,000 high-quality research tasks from 200+ academic seminars across 12 disciplines, showing substantial challenges for current state-of-the-art agents with clear performance gaps.

Conclusion: DeepResearch Arena provides a more faithful evaluation benchmark for deep research agents by leveraging academic seminar discourse, better reflecting real-world research environments and reducing data leakage risks.

Abstract: Deep research agents have attracted growing attention for their potential to orchestrate multi-stage research workflows, spanning literature synthesis, methodological design, and empirical verification. Despite these strides, evaluating their research capability faithfully is rather challenging due to the difficulty of collecting frontier research questions that genuinely capture researchers’ attention and intellectual curiosity. To address this gap, we introduce DeepResearch Arena, a benchmark grounded in academic seminars that capture rich expert discourse and interaction, better reflecting real-world research environments and reducing the risk of data leakage. To automatically construct DeepResearch Arena, we propose a Multi-Agent Hierarchical Task Generation (MAHTG) system that extracts research-worthy inspirations from seminar transcripts. The MAHTG system further translates research-worthy inspirations into high-quality research tasks, ensuring the traceability of research task formulation while filtering noise. With the MAHTG system, we curate DeepResearch Arena with over 10,000 high-quality research tasks from over 200 academic seminars, spanning 12 disciplines, such as literature, history, and science. Our extensive evaluation shows that DeepResearch Arena presents substantial challenges for current state-of-the-art agents, with clear performance gaps observed across different models.

[590] Evaluation Awareness Scales Predictably in Open-Weights Large Language Models

Maheep Chaudhary, Ian Su, Nikhil Hooda, Nishith Shankar, Julia Tan, Kevin Zhu, Ryan Lagasse, Vasu Sharma, Ashwinee Panda

Main category: cs.AI

TL;DR: Evaluation awareness in LLMs follows a power-law scaling with model size, enabling prediction of deceptive behavior in larger models.

Details

Motivation: To understand how evaluation awareness scales across different model sizes, as prior work only studied a single 70B model, which undermines AI safety evaluations.

Method: Linear probing on steering vector activations across 15 models from 0.27B to 70B parameters from four families.

Result: Clear power-law scaling: evaluation awareness increases predictably with model size.

Conclusion: This scaling law enables forecasting deceptive behavior in future larger models and guides scale-aware evaluation strategies for AI safety.

Abstract: Large language models (LLMs) can internally distinguish between evaluation and deployment contexts, a behaviour known as \emph{evaluation awareness}. This undermines AI safety evaluations, as models may conceal dangerous capabilities during testing. Prior work demonstrated this in a single $70$B model, but the scaling relationship across model sizes remains unknown. We investigate evaluation awareness across $15$ models scaling from $0.27$B to $70$B parameters from four families using linear probing on steering vector activations. Our results reveal a clear power-law scaling: evaluation awareness increases predictably with model size. This scaling law enables forecasting deceptive behavior in future larger models and guides the design of scale-aware evaluation strategies for AI safety. A link to the implementation of this paper can be found at https://anonymous.4open.science/r/evaluation-awareness-scaling-laws/README.md.

[591] Multi-Scenario Highway Lane-Change Intention Prediction: A Physics-Informed AI Framework for Three-Class Classification

Jiazhao Shi, Yichen Lin, Yiheng Hua, Ziyu Wang, Zijian Zhang, Wenjia Zheng, Yun Song, Kuan Lu, Shoufeng Lu

Main category: cs.AI

TL;DR: A physics-informed AI framework for lane-change intention prediction that integrates vehicle kinematics and traffic-safety metrics, achieving state-of-the-art performance with 99.8% accuracy on highway data and 96.1% accuracy on complex ramp scenarios.

Details

Motivation: Lane-change maneuvers are a leading cause of highway accidents, and existing methods suffer from binary classification limitations, lack of scenario diversity, and degraded performance under longer prediction horizons.

Method: Proposes a physics-informed AI framework that explicitly integrates vehicle kinematics, interaction feasibility, and traffic-safety metrics (distance headway, time headway, time-to-collision, closing gap time) into the learning process. Formulates lane-change prediction as a three-class problem (left change, right change, no change) and evaluates across both straight highway segments and complex ramp scenarios.

Result: LightGBM model achieves up to 99.8% accuracy and 93.6% macro F1 on highway data (highD), and 96.1% accuracy and 88.7% macro F1 on ramp scenarios (exiD) at 1-second horizon, outperforming a two-layer stacked LSTM baseline.

Conclusion: The findings demonstrate the practical advantages of a physics-informed and feature-rich machine learning framework for real-time lane-change intention prediction in autonomous driving systems.

Abstract: Lane-change maneuvers are a leading cause of highway accidents, underscoring the need for accurate intention prediction to improve the safety and decision-making of autonomous driving systems. While prior studies using machine learning and deep learning methods (e.g., SVM, CNN, LSTM, Transformers) have shown promise, most approaches remain limited by binary classification, lack of scenario diversity, and degraded performance under longer prediction horizons. In this study, we propose a physics-informed AI framework that explicitly integrates vehicle kinematics, interaction feasibility, and traffic-safety metrics (e.g., distance headway, time headway, time-to-collision, closing gap time) into the learning process. lane-change prediction is formulated as a three-class problem that distinguishes left change, right change, and no change, and is evaluated across both straight highway segments (highD) and complex ramp scenarios (exiD). By integrating vehicle kinematics with interaction features, our machine learning models, particularly LightGBM, achieve state-of-the-art accuracy and strong generalization. Results show up to 99.8% accuracy and 93.6% macro F1 on highD, and 96.1% accuracy and 88.7% macro F1 on exiD at a 1-second horizon, outperforming a two-layer stacked LSTM baseline. These findings demonstrate the practical advantages of a physics-informed and feature-rich machine learning framework for real-time lane-change intention prediction in autonomous driving systems.

[592] A Cost-Benefit Analysis of On-Premise Large Language Model Deployment: Breaking Even with Commercial LLM Services

Guanzhong Pan, Vishal Chodnekar, Abinas Roy, Haibo Wang

Main category: cs.AI

TL;DR: This paper provides a cost-benefit analysis framework to help organizations decide between commercial LLM services and on-premise deployment based on usage levels and performance needs.

Details

Motivation: Organizations face important decisions about using AI productivity tools - whether to subscribe to commercial LLM services or deploy models locally, with concerns about data privacy, vendor lock-in, and long-term costs driving interest in on-premise solutions.

Method: The authors develop a cost-benefit analysis framework considering hardware requirements, operational expenses, and performance benchmarks of open-source models (Qwen, Llama, Mistral, etc.), then compare total local deployment costs with major cloud providers’ subscription fees.

Result: The analysis provides estimated breakeven points based on usage levels and performance needs, showing when on-premise LLM deployment becomes economically viable compared to commercial subscription services.

Conclusion: The findings give organizations a practical framework for planning their LLM strategies by identifying the optimal deployment approach based on their specific usage patterns and requirements.

Abstract: Large language models (LLMs) are becoming increasingly widespread. Organizations that want to use AI for productivity now face an important decision. They can subscribe to commercial LLM services or deploy models on their own infrastructure. Cloud services from providers such as OpenAI, Anthropic, and Google are attractive because they provide easy access to state-of-the-art models and are easy to scale. However, concerns about data privacy, the difficulty of switching service providers, and long-term operating costs have driven interest in local deployment of open-source models. This paper presents a cost-benefit analysis framework to help organizations determine when on-premise LLM deployment becomes economically viable compared to commercial subscription services. We consider the hardware requirements, operational expenses, and performance benchmarks of the latest open-source models, including Qwen, Llama, Mistral, and etc. Then we compare the total cost of deploying these models locally with the major cloud providers subscription fee. Our findings provide an estimated breakeven point based on usage levels and performance needs. These results give organizations a practical framework for planning their LLM strategies.

[593] TERAG: Token-Efficient Graph-Based Retrieval-Augmented Generation

Qiao Xiao, Hong Ting Tsang, Jiaxin Bai

Main category: cs.AI

TL;DR: TERAG is a cost-efficient graph-based RAG framework that achieves 80% of existing methods’ accuracy while using only 3-11% of output tokens through Personalized PageRank integration.

Details

Motivation: To address the high LLM token usage costs in existing graph-based RAG systems that hinder large-scale adoption.

Method: Proposes TERAG framework incorporating Personalized PageRank during retrieval phase for efficient graph construction with low token consumption.

Result: Achieves at least 80% accuracy of widely used graph-based RAG methods while consuming only 3%-11% of output tokens.

Conclusion: TERAG is well-suited for large-scale and cost-sensitive deployment scenarios due to its low token footprint and efficient construction pipeline.

Abstract: Graph-based Retrieval-augmented generation (RAG) has become a widely studied approach for improving the reasoning, accuracy, and factuality of Large Language Models (LLMs). However, many existing graph-based RAG systems overlook the high cost associated with LLM token usage during graph construction, hindering large-scale adoption. To address this, we propose TERAG, a simple yet effective framework designed to build informative graphs at a significantly lower cost. Inspired by HippoRAG, we incorporate Personalized PageRank (PPR) during the retrieval phase, and we achieve at least 80% of the accuracy of widely used graph-based RAG methods while consuming only 3%-11% of the output tokens. With its low token footprint and efficient construction pipeline, TERAG is well-suited for large-scale and cost-sensitive deployment scenarios.

[594] PaperArena: An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature

Daoyu Wang, Mingyue Cheng, Shuo Yu, Zirui Liu, Ze Guo, Qi Liu

Main category: cs.AI

TL;DR: PaperArena is a benchmark for evaluating LLM agents on cross-paper scientific reasoning tasks requiring multi-tool orchestration, where current state-of-the-art agents achieve only 38.78% accuracy.

Details

Motivation: Existing benchmarks are limited to tool-free tasks within single papers, lacking evaluation for real-world research scenarios that require integrating information across multiple papers with external tools.

Method: Proposed PaperArena benchmark with modular platform offering tools like multimodal parsing, context retrieval, and programmatic computation to evaluate agents on research questions requiring cross-paper reasoning.

Result: Even the most advanced LLM-powered agent system achieved only 38.78% average accuracy, dropping to 18.47% on hard tasks. All agents showed inefficient tool usage, often invoking unnecessary tools.

Conclusion: PaperArena reveals significant gaps in current agent capabilities for scientific discovery and provides a standardized platform for developing more capable agents for complex knowledge-intensive tasks.

Abstract: Understanding and reasoning on the web-scale scientific literature is a crucial touchstone for large language model (LLM) based agents designed to support complex knowledge-intensive tasks. However, existing works are mainly restricted to tool-free tasks within isolated papers, largely due to the lack of a benchmark for cross-paper reasoning and multi-tool orchestration in real research scenarios. In this work, we propose PaperArena, an evaluation benchmark for agents to address real-world research questions that typically require integrating information across multiple papers with the assistance of external tools. Given a research question, agents should integrate diverse formats across multiple papers through reasoning and interacting with appropriate tools, thereby producing a well-grounded answer. To support standardized evaluation, we provide a modular and extensible platform for agent execution, offering tools such as multimodal parsing, context retrieval, and programmatic computation. Experimental results reveal that even the most advanced LLM powering a well-established agent system achieves merely 38.78% average accuracy. On the hard subset, accuracy drops to only 18.47%, highlighting great potential for improvement. We also present several empirical findings, including that all agents tested exhibit inefficient tool usage, often invoking more tools than necessary to solve a task. We invite the community to adopt PaperArena to develop and evaluate more capable agents for scientific discovery. Our code and data are available https://github.com/Melmaphother/PaperArena.

[595] Jr. AI Scientist and Its Risk Report: Autonomous Scientific Exploration from a Baseline Paper

Atsuyuki Miyai, Mashiro Toyooka, Takashi Otonari, Zaiying Zhao, Kiyoharu Aizawa

Main category: cs.AI

TL;DR: Jr. AI Scientist is an autonomous AI system that mimics a novice researcher’s workflow to analyze papers, formulate hypotheses, conduct experiments, and write new research papers, demonstrating improved performance over existing automated systems while highlighting current limitations and risks.

Details

Motivation: To understand the capabilities and risks of AI Scientist systems for ensuring trustworthy AI-driven scientific progress while preserving academic integrity.

Method: Developed Jr. AI Scientist that follows a research workflow: analyzes baseline paper limitations, formulates novel hypotheses, iteratively conducts experiments using modern coding agents for complex implementations, and writes papers with results.

Result: Successfully generated new research papers building on real NeurIPS, IJCV, and ICLR works; papers received higher review scores than existing fully automated systems in evaluations using AI Reviewers, author-led assessments, and Agents4Science submissions.

Conclusion: Jr. AI Scientist clarifies the current role and limitations of AI Scientist systems, showing they can generate valuable contributions but still require human expertise and pose emerging risks that need addressing.

Abstract: Understanding the current capabilities and risks of AI Scientist systems is essential for ensuring trustworthy and sustainable AI-driven scientific progress while preserving the integrity of the academic ecosystem. To this end, we develop Jr. AI Scientist, a state-of-the-art autonomous AI scientist system that mimics the core research workflow of a novice student researcher: Given the baseline paper from the human mentor, it analyzes its limitations, formulates novel hypotheses for improvement, and iteratively conducts experiments until improvements are realized, and writes a paper with the results. Unlike previous approaches that assume full automation or operate on small-scale code, Jr. AI Scientist follows a well-defined research workflow and leverages modern coding agents to handle complex, multi-file implementations, leading to scientifically valuable contributions. Through our experiments, the Jr. AI Scientist successfully generated new research papers that build upon real NeurIPS, IJCV, and ICLR works by proposing and implementing novel methods. For evaluation, we conducted automated assessments using AI Reviewers, author-led evaluations, and submissions to Agents4Science, a venue dedicated to AI-driven scientific contributions. The findings demonstrate that Jr. AI Scientist generates papers receiving higher review scores than existing fully automated systems. Nevertheless, we identify important limitations from both the author evaluation and the Agents4Science reviews, indicating the potential risks of directly applying current AI Scientist systems and key challenges for future research. Finally, we comprehensively report various risks identified during development. We believe this study clarifies the current role and limitations of AI Scientist systems, offering insights into the areas that still require human expertise and the risks that may emerge as these systems evolve.

[596] CodeEvolve: An open source evolutionary coding agent for algorithm discovery and optimization

Henrique Assumpção, Diego Ferreira, Leandro Campos, Fabricio Murai

Main category: cs.AI

TL;DR: CodeEvolve is an open-source evolutionary coding agent that combines LLMs with genetic algorithms to solve complex computational problems, outperforming AlphaEvolve on several benchmarks.

Details

Motivation: To develop an open-source alternative to closed-source systems like AlphaEvolve that can solve complex computational problems by leveraging evolutionary concepts with LLMs.

Method: Uses island-based genetic algorithms for population diversity, inspiration-based crossover using LLM context windows, and meta-prompting for dynamic solution space exploration.

Result: Surpassed AlphaEvolve’s performance on several challenging mathematical benchmarks from the AlphaEvolve evaluation set.

Conclusion: CodeEvolve demonstrates the effectiveness of combining evolutionary algorithms with LLMs for computational problem solving and is released as open-source to foster collaboration.

Abstract: In this work, we introduce CodeEvolve, an open-source evolutionary coding agent that unites Large Language Models (LLMs) with genetic algorithms to solve complex computational problems. Our framework adapts powerful evolutionary concepts to the LLM domain, building upon recent methods for generalized scientific discovery. CodeEvolve employs an island-based genetic algorithm to maintain population diversity and increase throughput, introduces a novel inspiration-based crossover mechanism that leverages the LLMs context window to combine features from successful solutions, and implements meta-prompting strategies for dynamic exploration of the solution space. We conduct a rigorous evaluation of CodeEvolve on a subset of the mathematical benchmarks used to evaluate Google DeepMind’s closed-source AlphaEvolve. Our findings show that our method surpasses AlphaEvolve’s performance on several challenging problems. To foster collaboration and accelerate progress, we release our complete framework as an open-source repository.

Aaron Bell, Amit Aides, Amr Helmy, Arbaaz Muslim, Aviad Barzilai, Aviv Slobodkin, Bolous Jaber, David Schottlander, George Leifman, Joydeep Paul, Mimi Sun, Nadav Sherman, Natalie Williams, Per Bjornsson, Roy Lee, Ruth Alcantara, Thomas Turnbull, Tomer Shekel, Vered Silverman, Yotam Gigi, Adam Boulanger, Alex Ottenwess, Ali Ahmadalipour, Anna Carter, Behzad Vahedi, Charles Elliott, David Andre, Elad Aharoni, Gia Jung, Hassler Thurston, Jacob Bien, Jamie McPike, Jessica Sapick, Juliet Rothenberg, Kartik Hegde, Kel Markert, Kim Philipp Jablonski, Luc Houriez, Monica Bharel, Phing VanLee, Reuven Sayag, Sebastian Pilarski, Shelley Cazares, Shlomi Pasternak, Siduo Jiang, Thomas Colthurst, Yang Chen, Yehonathan Refael, Yochai Blau, Yuval Carny, Yael Maguire, Avinatan Hassidim, James Manyika, Tim Thelin, Genady Beryozkin, Gautam Prasad, Luke Barrington, Yossi Matias, Niv Efron, Shravya Shetty

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation from unavailable abstract

Method: Methodology details unavailable due to access restrictions

Result: Results cannot be retrieved due to server rate limiting

Conclusion: Analysis impossible - paper content inaccessible

Abstract: Failed to fetch summary for 2510.18318: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18318&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[598] SOCIA-Nabla: Textual Gradient Meets Multi-Agent Orchestration for Automated Simulator Generation

Yuncheng Hua, Sion Weatherhead, Mehdi Jafari, Hao Xue, Flora D. Salim

Main category: cs.AI

TL;DR: SOCIA-Nabla is an end-to-end agentic framework that treats simulator construction as instance optimization over code within a textual computation graph, using LLM-driven agents and Textual-Gradient Descent to convert brittle prompt pipelines into reproducible simulator code generation.

Details

Motivation: To address the brittleness of prompt pipelines in simulator construction and create a more reproducible, constraint-aware approach that scales across domains and simulation granularities while minimizing expert effort.

Method: Uses specialized LLM-driven agents embedded as graph nodes with a workflow manager executing a loss-driven loop (code synthesis -> execution -> evaluation -> code repair) and performs Textual-Gradient Descent optimization.

Result: Achieves state-of-the-art overall accuracy across three CPS tasks: User Modeling, Mask Adoption, and Personal Mobility.

Conclusion: SOCIA-Nabla successfully unifies multi-agent orchestration with loss-aligned optimization to create reproducible simulator code generation that scales across domains and granularities, converting brittle prompt pipelines into robust systems.

Abstract: In this paper, we present SOCIA-Nabla, an end-to-end, agentic framework that treats simulator construction asinstance optimization over code within a textual computation graph. Specialized LLM-driven agents are embedded as graph nodes, and a workflow manager executes a loss-driven loop: code synthesis -> execution -> evaluation -> code repair. The optimizer performs Textual-Gradient Descent (TGD), while human-in-the-loop interaction is reserved for task-spec confirmation, minimizing expert effort and keeping the code itself as the trainable object. Across three CPS tasks, i.e., User Modeling, Mask Adoption, and Personal Mobility, SOCIA-Nabla attains state-of-the-art overall accuracy. By unifying multi-agent orchestration with a loss-aligned optimization view, SOCIA-Nabla converts brittle prompt pipelines into reproducible, constraint-aware simulator code generation that scales across domains and simulation granularities. This work is under review, and we will release the code soon.

[599] GraphChain: Large Language Models for Large-scale Graph Analysis via Tool Chaining

Chunyu Wei, Wenji Hu, Xingjia Hao, Xin Wang, Yifan Yang, Yueguo Chen, Yang Tian, Yunhai Wang

Main category: cs.AI

TL;DR: GraphChain enables LLMs to analyze large-scale graphs through dynamic tool sequences, overcoming context constraints and inflexible reasoning with progressive graph distillation and structure-aware adaptation.

Details

Motivation: LLMs struggle with large-scale graphs due to context limitations and rigid reasoning approaches, requiring a more flexible framework for complex graph analysis.

Method: Uses progressive graph distillation (RL-based tool sequence optimization) and structure-aware test-time adaptation (spectral properties + lightweight adapters) for dynamic tool selection.

Result: Significantly outperforms prior methods, enabling scalable and adaptive LLM-driven graph analysis across diverse graph topologies.

Conclusion: GraphChain provides an effective framework for LLM-based graph analysis through dynamic tool orchestration and topology-aware adaptation, overcoming key limitations of current approaches.

Abstract: Large Language Models (LLMs) face significant limitations when applied to large-scale graphs, struggling with context constraints and inflexible reasoning. We present GraphChain, a framework that enables LLMs to analyze complex graphs through dynamic sequences of specialized tools, mimicking human exploratory intelligence. Our approach introduces two key innovations: (1) Progressive Graph Distillation, a reinforcement learning mechanism that generates optimized tool sequences balancing task relevance with information compression, and (2) Structure-aware Test-Time Adaptation, which efficiently tailors tool selection strategies to diverse graph topologies using spectral properties and lightweight adapters without costly retraining. Experiments show GraphChain significantly outperforms prior methods, enabling scalable and adaptive LLM-driven graph analysis.

[600] PreferThinker: Reasoning-based Personalized Image Preference Assessment

Shengqi Xu, Xinpeng Zhou, Yabo Zhang, Ming Liu, Tao Liang, Tianyu Zhang, Yalong Bai, Zuxuan Wu, Wangmeng Zuo

Main category: cs.AI

TL;DR: Proposes a reasoning-based framework for personalized image preference assessment using a predict-then-assess paradigm with Chain-of-Thought reasoning and two-stage training.

Details

Motivation: Existing methods struggle with personalized preference assessment due to scarce user-specific data and diverse individual tastes, requiring a scalable approach to capture complex preferences.

Method: Two-stage framework: first predicts user preference profile from reference images, then provides interpretable assessments. Uses CoT-style dataset, supervised fine-tuning followed by reinforcement learning with similarity-aware prediction reward.

Result: Extensive experiments demonstrate superiority of the proposed method over existing approaches.

Conclusion: The framework effectively handles personalized image preference assessment by leveraging common preference profiles and structured reasoning, achieving better performance through interpretable multi-dimensional scoring.

Abstract: Personalized image preference assessment aims to evaluate an individual user’s image preferences by relying only on a small set of reference images as prior information. Existing methods mainly focus on general preference assessment, training models with large-scale data to tackle well-defined tasks such as text-image alignment. However, these approaches struggle to handle personalized preference because user-specific data are scarce and not easily scalable, and individual tastes are often diverse and complex. To overcome these challenges, we introduce a common preference profile that serves as a bridge across users, allowing large-scale user data to be leveraged for training profile prediction and capturing complex personalized preferences. Building on this idea, we propose a reasoning-based personalized image preference assessment framework that follows a \textit{predict-then-assess} paradigm: it first predicts a user’s preference profile from reference images, and then provides interpretable, multi-dimensional scores and assessments of candidate images based on the predicted profile. To support this, we first construct a large-scale Chain-of-Thought (CoT)-style personalized assessment dataset annotated with diverse user preference profiles and high-quality CoT-style reasoning, enabling explicit supervision of structured reasoning. Next, we adopt a two-stage training strategy: a cold-start supervised fine-tuning phase to empower the model with structured reasoning capabilities, followed by reinforcement learning to incentivize the model to explore more reasonable assessment paths and enhance generalization. Furthermore, we propose a similarity-aware prediction reward to encourage better prediction of the user’s preference profile, which facilitates more reasonable assessments exploration. Extensive experiments demonstrate the superiority of the proposed method.

[601] Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries

Minghe Shen, Zhuo Zhi, Chonghan Liu, Shuo Xing, Zhengzhong Tu, Che Liu

Main category: cs.AI

TL;DR: Ariadne framework uses RL post-training with verified rewards to extend VLMs’ capability boundaries for visual-centric spatial reasoning tasks, achieving significant improvements on both synthetic mazes and real-world benchmarks.

Details

Motivation: To investigate whether RL post-training can truly extend VLMs' inherent capability boundaries for visual-centric spatial tasks where base models initially fail, moving beyond language-dominant evaluations.

Method: Uses synthetic mazes with controlled difficulty for multi-step spatial reasoning, training VLMs with Reinforcement Learning with Verified Rewards (RLVR) in a difficulty-aware curriculum.

Result: Post-RLVR training achieved over 50% accuracy on problems where base model scored 0%, with zero-shot improvements of 16% on MapBench and 24% on ReasonMap for real-world spatial reasoning tasks.

Conclusion: RL post-training with verified rewards can expand VLMs’ initial capability boundaries and enhance generalization to real-world spatial reasoning, though limited to post-training phase due to pre-training data opaqueness.

Abstract: While Vision-Language Models (VLMs) post-trained with Reinforcement Learning (RL) show impressive general reasoning, their evaluation is often confined to language-dominant tasks (e.g., math). This raises a critical question: can RL post-training truly extend the inherent capability boundary of a base VLM, particularly for visual-centric spatial tasks where it initially fails? To investigate this, we introduce Ariadne, a framework utilizing synthetic mazes for multi-step spatial reasoning where task difficulty (e.g., path length, turns) is precisely controlled. We leverage this controllable environment to train VLMs using Reinforcement Learning with Verified Rewards (RLVR) in a difficulty-aware curriculum. Surprisingly, post-RLVR training, the VLM achieves over 50% accuracy on a problem set where the base model scored 0%, demonstrating that our approach expands the model’s initial capability boundary. To assess real-world viability, we evaluate out-of-distribution (OOD) generalization on practical benchmarks. Despite training only on synthetic maze samples, Ariadne achieves significant zero-shot improvements, averaging 16% on MapBench (e.g., museum navigation) and 24% on ReasonMap (subway transfer tasks). These results confirm that our method not only broadens the model’s fundamental limits but also enhances its generalization to real-world spatial reasoning. We acknowledge our study is limited to the post-training phase, given the opaqueness of pre-training data, and hope our research motivates further work on specialized, capability-extending alignment.

[602] SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

Jonathan Li, Nasim Farahini, Evgenii Iuliugin, Magnus Vesterlund, Christian Häggström, Guangtao Wang, Shubhangi Upasani, Ayush Sachdeva, Rui Li, Faline Fu, Chen Wu, Ayesha Siddiqua, John Long, Tuowen Zhao, Matheen Musaddiq, Håkan Zeffer, Yun Du, Mingran Wang, Qinghua Li, Bo Li, Urmish Thakker, Raghu Prabhakar

Main category: cs.AI

TL;DR: SnapStream is a KV cache compression method that enables 4× improved on-chip memory usage with minimal accuracy degradation, deployed in production inference systems with static graphs and continuous batching.

Details

Motivation: Address the increasing demands for on-chip memory to support large KV caches in LLMs with 100k+ context length, while overcoming limitations of existing techniques that are difficult to implement in industrial deployments using frameworks like vLLM or SGLang.

Method: Developed SnapStream, a KV cache compression method that can be deployed at scale in systems with static graphs and continuous batching. Tested on Llama-3.1-8B-Instruct and DeepSeek-R1, and deployed in a 16-way tensor-parallel deployment of DeepSeek-671B on SambaNova SN40L accelerators.

Result: Achieved 4× improved on-chip memory usage at 128k context length with up to 1832 tokens per second in production. Showed minimal accuracy degradation on LongBench-v2, AIME24 and LiveCodeBench benchmarks.

Conclusion: SnapStream successfully enables efficient KV cache compression in production inference systems, representing the first implementation of sparse KV attention techniques deployed with static graphs and continuous batching.

Abstract: The proliferation of 100B+ parameter Large Language Models (LLMs) with 100k+ context length support have resulted in increasing demands for on-chip memory to support large KV caches. Techniques such as StreamingLLM and SnapKV demonstrate how to control KV cache size while maintaining model accuracy. Yet, these techniques are not commonly used within industrial deployments using frameworks like vLLM or SGLang. The reason is twofold: on one hand, the static graphs and continuous batching methodology employed by these frameworks make it difficult to admit modifications to the standard multi-head attention algorithm, while on the other hand, the accuracy implications of such techniques on modern instruction-following and reasoning models are not well understood, obfuscating the need for implementing these techniques. In this paper, we explore these accuracy implications on Llama-3.1-8B-Instruct and DeepSeek-R1, and develop SnapStream, a KV cache compression method that can be deployed at scale. We demonstrate the efficacy of SnapStream in a 16-way tensor-parallel deployment of DeepSeek-671B on SambaNova SN40L accelerators running at 128k context length and up to 1832 tokens per second in a real production setting. SnapStream enables $4\times$ improved on-chip memory usage and introduces minimal accuracy degradation on LongBench-v2, AIME24 and LiveCodeBench. To the best of our knowledge, this is the first implementation of sparse KV attention techniques deployed in a production inference system with static graphs and continuous batching.

[603] DeepKnown-Guard: A Proprietary Model-Based Safety Response Framework for AI Agents

Qi Li, Jianjun Xu, Pingtao Wei, Jiu Li, Peiqiang Zhao, Jiwei Shi, Xuan Zhang, Yanhui Yang, Xiaodong Hui, Peng Xu, Wenqin Shao

Main category: cs.AI

TL;DR: A novel safety response framework for LLMs that provides systematic protection at both input and output levels, achieving high risk recall rates and perfect safety scores on high-risk tests.

Details

Motivation: Security issues in LLMs are limiting their trustworthy deployment in critical domains, requiring systematic safety measures.

Method: Input-level: supervised fine-tuning-based safety classification with 4-tier taxonomy; Output-level: RAG with fine-tuned interpretation model for trustworthy knowledge grounding.

Result: 99.3% risk recall rate, significantly higher safety scores than baseline, and 100% safety score on proprietary high-risk test set.

Conclusion: Provides an effective engineering pathway for building high-security, high-trust LLM applications with exceptional protective capabilities.

Abstract: With the widespread application of Large Language Models (LLMs), their associated security issues have become increasingly prominent, severely constraining their trustworthy deployment in critical domains. This paper proposes a novel safety response framework designed to systematically safeguard LLMs at both the input and output levels. At the input level, the framework employs a supervised fine-tuning-based safety classification model. Through a fine-grained four-tier taxonomy (Safe, Unsafe, Conditionally Safe, Focused Attention), it performs precise risk identification and differentiated handling of user queries, significantly enhancing risk coverage and business scenario adaptability, and achieving a risk recall rate of 99.3%. At the output level, the framework integrates Retrieval-Augmented Generation (RAG) with a specifically fine-tuned interpretation model, ensuring all responses are grounded in a real-time, trustworthy knowledge base. This approach eliminates information fabrication and enables result traceability. Experimental results demonstrate that our proposed safety control model achieves a significantly higher safety score on public safety evaluation benchmarks compared to the baseline model, TinyR1-Safety-8B. Furthermore, on our proprietary high-risk test set, the framework’s components attained a perfect 100% safety score, validating their exceptional protective capabilities in complex risk scenarios. This research provides an effective engineering pathway for building high-security, high-trust LLM applications.

[604] Scaling Agent Learning via Experience Synthesis

Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, Dat Huynh

Main category: cs.AI

TL;DR: DreamGym is a unified framework that synthesizes diverse experiences for RL training by using reasoning-based environment models, enabling scalable agent rollout collection without expensive real-environment interactions.

Details

Motivation: Address challenges in RL adoption including costly rollouts, limited task diversity, unreliable rewards, and infrastructure complexity that obstruct scalable experience data collection.

Method: Uses reasoning-based experience model for state transitions and feedback, experience replay buffer initialized with offline data, and adaptive task generation for curriculum learning.

Result: Outperforms baselines by over 30% on non-RL-ready tasks like WebArena, matches GRPO and PPO performance using only synthetic interactions, and provides significant performance gains in sim-to-real transfer with fewer real-world interactions.

Conclusion: DreamGym enables scalable RL training through synthetic experience synthesis, providing an effective warm-start strategy for general-purpose RL with reduced reliance on costly real-environment interactions.

Abstract: While reinforcement learning (RL) can empower autonomous agents by enabling self-improvement through interaction, its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity, all of which obstruct the collection of scalable experience data. To address these challenges, we introduce DreamGym, the first unified framework designed to synthesize diverse experiences with scalability in mind to enable effective online RL training for autonomous agents. Rather than relying on expensive real-environment rollouts, DreamGym distills environment dynamics into a reasoning-based experience model that derives consistent state transitions and feedback signals through step-by-step reasoning, enabling scalable agent rollout collection for RL. To improve the stability and quality of transitions, DreamGym leverages an experience replay buffer initialized with offline real-world data and continuously enriched with fresh interactions to actively support agent training. To improve knowledge acquisition, DreamGym adaptively generates new tasks that challenge the current agent policy, enabling more effective online curriculum learning. Experiments across diverse environments and agent backbones demonstrate that DreamGym substantially improves RL training, both in fully synthetic settings and in sim-to-real transfer scenarios. On non-RL-ready tasks like WebArena, DreamGym outperforms all baselines by over 30%. And in RL-ready but costly settings, it matches GRPO and PPO performance using only synthetic interactions. When transferring a policy trained purely on synthetic experiences to real-environment RL, DreamGym yields significant additional performance gains while requiring far fewer real-world interactions, providing a scalable warm-start strategy for general-purpose RL.

[605] Interpreting Multi-Attribute Confounding through Numerical Attributes in Large Language Models

Hirohane Takagi, Gouki Minegishi, Shota Kizawa, Issey Sukeda, Hitomi Yanaka

Main category: cs.AI

TL;DR: LLMs encode numerical correlations but systematically amplify them, and irrelevant context causes shifts in magnitude representations that affect decision-making differently by model size.

Details

Motivation: To understand how LLMs internally represent and integrate multiple numerical attributes, and how irrelevant numerical context affects these representations and outputs.

Method: Combined linear probing with partial correlation analysis and prompt-based vulnerability tests across models of varying sizes.

Result: LLMs encode real-world numerical correlations but tend to systematically amplify them. Irrelevant context induces consistent shifts in magnitude representations with downstream effects that vary by model size.

Conclusion: Reveals vulnerability in LLM decision-making and provides groundwork for fairer, representation-aware control under multi-attribute entanglement.

Abstract: Although behavioral studies have documented numerical reasoning errors in large language models (LLMs), the underlying representational mechanisms remain unclear. We hypothesize that numerical attributes occupy shared latent subspaces and investigate two questions:(1) How do LLMs internally integrate multiple numerical attributes of a single entity? (2)How does irrelevant numerical context perturb these representations and their downstream outputs? To address these questions, we combine linear probing with partial correlation analysis and prompt-based vulnerability tests across models of varying sizes. Our results show that LLMs encode real-world numerical correlations but tend to systematically amplify them. Moreover, irrelevant context induces consistent shifts in magnitude representations, with downstream effects that vary by model size. These findings reveal a vulnerability in LLM decision-making and lay the groundwork for fairer, representation-aware control under multi-attribute entanglement.

[606] Agentmandering: A Game-Theoretic Framework for Fair Redistricting via Large Language Model Agents

Hao Li, Haotian Chen, Ruoyuan Gong, Juanjuan Wang, Hao Jiang

Main category: cs.AI

TL;DR: The paper analysis could not be completed due to HTTP 429 error when fetching the abstract from arXiv API.

Details

Motivation: Unable to determine the paper's motivation as the abstract content is not accessible.

Method: Methodology details unavailable due to failed API request.

Result: Results cannot be analyzed without access to the paper’s content.

Conclusion: No conclusion can be drawn as the paper information could not be retrieved from the arXiv API.

Abstract: Failed to fetch summary for 2511.04076: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.04076&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[607] GUI-360$^\circ$: A Comprehensive Dataset and Benchmark for Computer-Using Agents

Jian Mu, Chaoyun Zhang, Chiming Ni, Lu Wang, Bo Qiao, Kartik Mathur, Qianhui Wu, Yuhang Xie, Xiaojun Ma, Mengyu Zhou, Si Qin, Liqun Li, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

Main category: cs.AI

TL;DR: GUI-360° is a large-scale dataset and benchmark for computer-using agents (CUAs) that addresses gaps in real-world tasks, automated multi-modal trajectory collection, and unified evaluation of GUI grounding, screen parsing, and action prediction.

Details

Motivation: To overcome three persistent gaps in CUA research: scarcity of real-world tasks, lack of automated pipelines for multi-modal trajectory collection, and absence of unified benchmarks for GUI grounding, screen parsing, and action prediction.

Method: Uses an LLM-augmented automated pipeline for query sourcing, environment-template construction, task instantiation, batched execution, and LLM-driven quality filtering. Contains 1.2M+ action steps across Windows office applications with screenshots, accessibility metadata, goals, reasoning traces, and both successful/failed trajectories.

Result: Benchmarking shows substantial shortcomings in state-of-the-art vision-language models for grounding and action prediction. Supervised fine-tuning and reinforcement learning yield significant improvements but don’t reach human-level reliability.

Conclusion: GUI-360° provides a comprehensive dataset and benchmark to accelerate research on robust desktop computer-using agents, with the dataset publicly released to facilitate reproducible research.

Abstract: We introduce GUI-360$^\circ$, a large-scale, comprehensive dataset and benchmark suite designed to advance computer-using agents (CUAs). CUAs present unique challenges and is constrained by three persistent gaps: a scarcity of real-world CUA tasks, the lack of automated collection-and-annotation pipelines for multi-modal trajectories, and the absence of a unified benchmark that jointly evaluates GUI grounding, screen parsing, and action prediction. GUI-360$^\circ$ addresses these gaps with an LLM-augmented, largely automated pipeline for query sourcing, environment-template construction, task instantiation, batched execution, and LLM-driven quality filtering. The released corpus contains over 1.2M executed action steps across thousands of trajectories in popular Windows office applications, and includes full-resolution screenshots, accessibility metadata when available, instantiated goals, intermediate reasoning traces, and both successful and failed action trajectories. The dataset supports three canonical tasks, GUI grounding, screen parsing, and action prediction, and a hybrid GUI+API action space that reflects modern agent designs. Benchmarking state-of-the-art vision–language models on GUI-360$^\circ$ reveals substantial out-of-the-box shortcomings in grounding and action prediction; supervised fine-tuning and reinforcement learning yield significant gains but do not close the gap to human-level reliability. We release GUI-360$^\circ$ and accompanying code to facilitate reproducible research and accelerate progress on robust desktop CUAs. The full dataset has been made public on https://huggingface.co/datasets/vyokky/GUI-360.

[608] Condensed Data Expansion Using Model Inversion for Knowledge Distillation

Kuluhan Binici, Shivam Aggarwal, Cihan Acar, Nam Trung Pham, Karianto Leman, Gim Hee Lee, Tulika Mitra

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: N/A - Abstract not available

Method: N/A - Abstract not available

Result: N/A - Abstract not available

Conclusion: N/A - Abstract not available

Abstract: Failed to fetch summary for 2408.13850: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2408.13850&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[609] Return Prediction for Mean-Variance Portfolio Selection: How Decision-Focused Learning Shapes Forecasting Models

Junhyeong Lee, Haeun Jeon, Hyunglip Bae, Yongjae Lee

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: N/A - Paper content unavailable

Method: N/A - Paper content unavailable

Result: N/A - Paper content unavailable

Conclusion: N/A - Paper content unavailable

Abstract: Failed to fetch summary for 2409.09684: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.09684&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[610] Stochastic interior-point methods for smooth conic optimization with applications

Chuan He, Zhanwang Deng

Main category: cs.AI

TL;DR: Unable to retrieve paper summary due to HTTP 503 error from arXiv API

Details

Motivation: The paper analysis could not be performed due to technical issues with the data source

Method: Attempted to fetch paper metadata from arXiv API but encountered server error

Result: No results obtained - service temporarily unavailable

Conclusion: Technical limitations prevented analysis of paper 2412.12987

Abstract: Failed to fetch summary for 2412.12987: Page request resulted in HTTP 503 (https://export.arxiv.org/api/query?search_query=&id_list=2412.12987&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.SD

[611] Loud-loss: A Perceptually Motivated Loss Function for Speech Enhancement Based on Equal-Loudness Contours

Zixuan Li, Xueliang Zhang, Changjiang Zhao, Shuai Gao, Lei Miao, Zhipeng Yan, Ying Sun, Chong Zhu

Main category: cs.SD

TL;DR: Proposes a perceptually-weighted loss function based on psychoacoustics to replace MSE in speech enhancement, improving perceptual quality by aligning error weighting with human auditory sensitivity.

Details

Motivation: MSE over-emphasizes low-frequency components with high energy, leading to inadequate modeling of perceptually important high-frequency information and poor reflection of auditory perception quality.

Method: Leverages equal-loudness contours to assign frequency-dependent weights to reconstruction error, penalizing deviations according to human auditory sensitivity. The loss is model-agnostic and flexible.

Result: Experiments on VoiceBank+DEMAND dataset show replacing MSE with the proposed loss in GTCRN model elevates WB-PESQ score from 2.17 to 2.93, indicating significant perceptual quality improvement.

Conclusion: The perceptually-weighted loss function grounded in psychoacoustic principles effectively addresses MSE’s limitations and significantly enhances speech enhancement performance in terms of perceptual quality.

Abstract: The mean squared error (MSE) is a ubiquitous loss function for speech enhancement, but its problem is that the error cannot reflect the auditory perception quality. This is because MSE causes models to over-emphasize low-frequency components which has high energy, leading to the inadequate modeling of perceptually important high-frequency information. To overcome this limitation, we propose a perceptually-weighted loss function grounded in psychoacoustic principles. Specifically, it leverages equal-loudness contours to assign frequency-dependent weights to the reconstruction error, thereby penalizing deviations in a way aligning with human auditory sensitivity. The proposed loss is model-agnostic and flexible, demonstrating strong generality. Experiments on the VoiceBank+DEMAND dataset show that replacing MSE with our loss in a GTCRN model elevates the WB-PESQ score from 2.17 to 2.93-a significant improvement in perceptual quality.

[612] We Can Hear You with mmWave Radar! An End-to-End Eavesdropping System

Dachao Han, Teng Huang, Han Ding, Cui Zhao, Fei Wang, Ge Wang, Wei Xi

Main category: cs.SD

TL;DR: mmSpeech is a mmWave-based eavesdropping system that reconstructs intelligible speech from loudspeaker vibrations through walls without prior speaker knowledge.

Details

Motivation: With increasing voice-enabled technologies, loudspeaker playback poses speech privacy risks. Traditional eavesdropping methods are limited by requiring invasive access or line-of-sight.

Method: Uses narrowband mmWave signals to capture vibrations, reveals optimal vibrating material and radar sampling rate, designs deep neural network for speech reconstruction, and fine-tunes pre-trained ASR model encoder with synthetic training pipeline.

Result: Achieves state-of-the-art speech quality and generalizes well across unseen speakers and various conditions using commercial mmWave radar.

Conclusion: mmSpeech demonstrates effective speech reconstruction from loudspeaker vibrations through walls, highlighting significant privacy concerns in voice-enabled technologies.

Abstract: With the rise of voice-enabled technologies, loudspeaker playback has become widespread, posing increasing risks to speech privacy. Traditional eavesdropping methods often require invasive access or line-of-sight, limiting their practicality. In this paper, we present mmSpeech, an end-to-end mmWave-based eavesdropping system that reconstructs intelligible speech solely from vibration signals induced by loudspeaker playback, even through walls and without prior knowledge of the speaker. To achieve this, we reveal an optimal combination of vibrating material and radar sampling rate for capturing high- quality vibrations using narrowband mmWave signals. We then design a deep neural network that reconstructs intelligible speech from the estimated noisy spectrograms. To further support downstream speech understanding, we introduce a synthetic training pipeline and selectively fine-tune the encoder of a pre-trained ASR model. We implement mmSpeech with a commercial mmWave radar and validate its performance through extensive experiments. Results show that mmSpeech achieves state-of-the-art speech quality and generalizes well across unseen speakers and various conditions.

[613] ELEGANCE: Efficient LLM Guidance for Audio-Visual Target Speech Extraction

Wenxuan Wu, Shuai Wang, Xixin Wu, Helen Meng, Haizhou Li

Main category: cs.SD

TL;DR: ELEGANCE enhances audio-visual target speaker extraction by incorporating linguistic knowledge from LLMs through three guidance strategies, improving performance in challenging scenarios.

Details

Motivation: Current AV-TSE models rely mainly on visual cues, but humans also use linguistic knowledge like syntax and conversation context for speech extraction.

Method: Proposes three linguistic guidance strategies: output linguistic constraints, intermediate linguistic prediction, and input linguistic prior, using LLMs like RoBERTa, Qwen3-0.6B, and Qwen3-4B.

Result: Significant improvements in challenging scenarios including visual cue impairment, unseen languages, speaker switches, more interfering speakers, and out-of-domain tests.

Conclusion: Incorporating linguistic knowledge from LLMs effectively enhances AV-TSE performance, especially in difficult conditions where visual cues alone are insufficient.

Abstract: Audio-visual target speaker extraction (AV-TSE) models primarily rely on visual cues from the target speaker. However, humans also leverage linguistic knowledge, such as syntactic constraints, next word prediction, and prior knowledge of conversation, to extract target speech. Inspired by this observation, we propose ELEGANCE, a novel framework that incorporates linguistic knowledge from large language models (LLMs) into AV-TSE models through three distinct guidance strategies: output linguistic constraints, intermediate linguistic prediction, and input linguistic prior. Comprehensive experiments with RoBERTa, Qwen3-0.6B, and Qwen3-4B on two AV-TSE backbones demon- strate the effectiveness of our approach. Significant improvements are observed in challenging scenarios, including visual cue impaired, unseen languages, target speaker switches, increased interfering speakers, and out-of-domain test set. Demo page: https://alexwxwu.github.io/ELEGANCE/.

[614] EchoMark: Perceptual Acoustic Environment Transfer with Watermark-Embedded Room Impulse Response

Chenpei Huang, Lingfeng Yao, Kyu In Lee, Lan Emily Zhang, Xun Chen, Miao Pan

Main category: cs.SD

TL;DR: EchoMark is a deep learning framework that generates perceptually similar room impulse responses (RIRs) with embedded watermarks for acoustic environment matching, addressing security concerns while maintaining high-quality audio transfer.

Details

Motivation: To enable secure acoustic environment matching by preventing malicious misuse of RIR recovery capabilities that could facilitate voice spoofing attacks or undermine audio evidence authenticity.

Method: Operates in latent domain to handle variable RIR characteristics, jointly optimizing with perceptual loss for RIR reconstruction and loss for watermark detection.

Result: Achieves room acoustic parameter matching comparable to state-of-the-art FiNS, MOS of 4.22/5, watermark detection accuracy >99%, and bit error rates <0.3%.

Conclusion: EchoMark effectively balances high-quality environment transfer with reliable watermark embedding, providing a secure solution for acoustic environment matching applications.

Abstract: Acoustic Environment Matching (AEM) is the task of transferring clean audio into a target acoustic environment, enabling engaging applications such as audio dubbing and auditory immersive virtual reality (VR). Recovering similar room impulse response (RIR) directly from reverberant speech offers more accessible and flexible AEM solution. However, this capability also introduces vulnerabilities of arbitrary ``relocation" if misused by malicious user, such as facilitating advanced voice spoofing attacks or undermining the authenticity of recorded evidence. To address this issue, we propose EchoMark, the first deep learning-based AEM framework that generates perceptually similar RIRs with embedded watermark. Our design tackle the challenges posed by variable RIR characteristics, such as different durations and energy decays, by operating in the latent domain. By jointly optimizing the model with a perceptual loss for RIR reconstruction and a loss for watermark detection, EchoMark achieves both high-quality environment transfer and reliable watermark recovery. Experiments on diverse datasets validate that EchoMark achieves room acoustic parameter matching performance comparable to FiNS, the state-of-the-art RIR estimator. Furthermore, a high Mean Opinion Score (MOS) of 4.22 out of 5, watermark detection accuracy exceeding 99%, and bit error rates (BER) below 0.3% collectively demonstrate the effectiveness of EchoMark in preserving perceptual quality while ensuring reliable watermark embedding.

[615] MT-HuBERT: Self-Supervised Mix-Training for Few-Shot Keyword Spotting in Mixed Speech

Junming Yuan, Ying Shi, Dong Wang, Lantian Li, Askar Hamdulla

Main category: cs.SD

TL;DR: MT-HuBERT is a self-supervised learning framework for few-shot keyword spotting that enables detection of multiple overlapping keywords in mixed speech by predicting clean acoustic units from contextual cues during pre-training.

Details

Motivation: Existing few-shot keyword spotting approaches struggle with mixed keyword detection (multiple overlapping keywords in single utterances), which is essential for real-world applications, and most methods are fully supervised, unable to leverage vast unlabeled data.

Method: Propose Mix-Training HuBERT (MT-HuBERT), a self-supervised pre-training framework that implements the Mix-Training criterion during pre-training to predict clean acoustic units of each constituent signal from contextual cues, rather than predicting compositional patterns of mixed speech.

Result: Experiments on Google Speech Commands (GSC v2) corpus show MT-HuBERT consistently outperforms state-of-the-art baselines in few-shot keyword spotting tasks under both mixed and clean conditions.

Conclusion: MT-HuBERT effectively addresses mixed keyword detection in few-shot scenarios through self-supervised pre-training, demonstrating superior performance compared to existing approaches while leveraging unlabeled data.

Abstract: Few-shot keyword spotting aims to detect previously unseen keywords with very limited labeled samples. A pre-training and adaptation paradigm is typically adopted for this task. While effective in clean conditions, most existing approaches struggle with mixed keyword spotting–detecting multiple overlapping keywords within a single utterance–a capability essential for real-world applications. We have previously proposed a pre-training approach based on Mix-Training (MT) to tackle the mixed keyword detection problem and demonstrated its efficiency. However, this approach is fully supervised, unable to utilize vast unlabeled data. To this end, we propose Mix-Training HuBERT (MT-HuBERT), a self-supervised learning (SSL) pre-training framework that implements the MT criterion during pre-training. MT-HuBERT predicts, in a self-supervised manner, the clean acoustic units of each constituent signal from contextual cues, in contrast to predicting compositional patterns of mixed speech. Experiments conducted on the Google Speech Commands (GSC v2) corpus demonstrate that our proposed MT-HuBERT consistently outperforms several state-of-the-art baselines in few-shot KWS tasks under both mixed and clean conditions.

[616] SAR-LM: Symbolic Audio Reasoning with Large Language Models

Termeh Taheri, Yinghao Ma, Emmanouil Benetos

Main category: cs.SD

TL;DR: SAR-LM is a symbolic audio reasoning pipeline that converts audio into structured, human-readable features for better interpretability and competitive performance on audio reasoning benchmarks.

Details

Motivation: Current LLMs struggle with audio reasoning, relying on dense embeddings that are hard to interpret and fail on structured reasoning tasks. Caption-based approaches help but still use dense embeddings, limiting transparency when models fail.

Method: Converts audio into structured symbolic features across speech, sound events, and music domains, creating human-readable inputs that support reasoning and transparent error analysis.

Result: Achieves competitive performance across three benchmarks (MMAU, MMAR, OmniBench) while enabling traceable error analysis to specific features.

Conclusion: SAR-LM provides an interpretable alternative to dense embedding approaches, prioritizing transparency in audio reasoning while maintaining competitive performance.

Abstract: Large language models (LLMs) have advanced in text and vision, but their reasoning on audio remains limited. Most existing methods rely on dense audio embeddings, which are difficult to interpret and often fail on structured reasoning tasks. Caption-based approaches, introduced in recent benchmarks such as MMAU, improve performance by translating audio into text, yet still depend on dense embeddings as input, offering little insight when models fail. We present SAR-LM, a symbolic audio reasoning pipeline that builds on this caption-based paradigm by converting audio into structured, human-readable features across speech, sound events, and music. These symbolic inputs support both reasoning and transparent error analysis, enabling us to trace failures to specific features. Across three benchmarks, MMAU, MMAR, and OmniBench, SAR-LM achieves competitive results, while prioritizing interpretability as its primary contribution.

[617] Metric Analysis for Spatial Semantic Segmentation of Sound Scenes

Mayank Mishra, Paul Magron, Romain Serizel

Main category: cs.SD

TL;DR: The paper analyzes and improves the CA-SDR metric for evaluating spatial semantic segmentation of sound scenes, proposing modifications to better handle classification errors and cross-contamination.

Details

Motivation: Existing evaluation of spatial semantic segmentation systems uses separate metrics for source separation and sound event classification, making system comparison challenging. The joint CA-SDR metric was proposed but has limitations in handling classification errors and cross-contamination.

Method: The authors compare CA-SDR with classical SDR, analyze its limitations, propose a modified CA-SDR that focuses on class-agnostic SDR first and then accounts for mislabeled sources, and introduce penalties for labeling and separation errors.

Result: The analysis reveals cases where CA-SDR doesn’t allow proper system comparison, and the proposed modifications address these issues by better handling classification errors and cross-contamination between separated sources.

Conclusion: The paper presents an improved evaluation metric for spatial semantic segmentation that more accurately reflects both separation and classification performance through modified CA-SDR with appropriate penalties.

Abstract: Spatial semantic segmentation of sound scenes (S5) consists of jointly performing audio source separation and sound event classification from a multichannel audio mixture. To evaluate S5 systems, one can consider two individual metrics, i.e., one for source separation and another for sound event classification, but this approach makes it challenging to compare S5 systems. Thus, a joint class-aware signal-to-distortion ratio (CA-SDR) metric was proposed to evaluate S5 systems. In this work, we first compare the CA-SDR with the classical SDR on scenarios with only classification errors. We then analyze the cases where the metric might not allow proper comparison of the systems. To address this problem, we propose a modified version of the CA-SDR which first focuses on class-agnostic SDR and then accounts for the wrongly labeled sources. We also analyze the performance of the two metrics under cross-contamination between separated audio sources. Finally, we propose a first set of penalties in an attempt to make the metric more reflective of the labeling and separation errors.

[618] Generating Novel and Realistic Speakers for Voice Conversion

Meiying Melissa Chen, Zhenyu Wang, Zhiyao Duan

Main category: cs.SD

TL;DR: SpeakerVAE: A lightweight method using hierarchical VAE to generate novel speaker representations for voice conversion without requiring target utterances, compatible with existing VC models.

Details

Motivation: Most voice conversion systems require target utterances, limiting use when target data is unavailable or when users want conversion to novel, unseen voices.

Method: Deep hierarchical variational autoencoder to model speaker timbre space, generating novel speaker representations by sampling from the trained model as a plug-in module for VC pipelines.

Result: Successfully generates novel, unseen speakers with quality comparable to training speakers when evaluated with FACodec and CosyVoice2 VC models.

Conclusion: SpeakerVAE provides a flexible solution for generating novel speaker voices in voice conversion without requiring target data or modifying base VC systems.

Abstract: Voice conversion models modify timbre while preserving paralinguistic features, enabling applications like dubbing and identity protection. However, most VC systems require access to target utterances, limiting their use when target data is unavailable or when users desire conversion to entirely novel, unseen voices. To address this, we introduce a lightweight method SpeakerVAE to generate novel speakers for VC. Our approach uses a deep hierarchical variational autoencoder to model the speaker timbre space. By sampling from the trained model, we generate novel speaker representations for voice synthesis in a VC pipeline. The proposed method is a flexible plug-in module compatible with various VC models, without co-training or fine-tuning of the base VC system. We evaluated our approach with state-of-the-art VC models: FACodec and CosyVoice2. The results demonstrate that our method successfully generates novel, unseen speakers with quality comparable to that of the training speakers.

[619] E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis

Zhisheng Zhang, Derui Wang, Yifan Mi, Zhiyong Wu, Jie Gao, Yuxin Cao, Kai Ye, Minhui Xue, Jie Hao

Main category: cs.SD

TL;DR: E2E-VGuard is a proactive defense framework that protects against production LLM-based speech synthesis and ASR-driven end-to-end voice cloning attacks by combining encoder ensembles for timbre protection and ASR-targeted adversarial examples for pronunciation disruption.

Details

Motivation: Existing defense techniques cannot effectively counter production LLM-based speech synthesis and the emerging threat of ASR-driven end-to-end voice cloning systems, which bypass the need for manual transcript annotation and are increasingly used in commercial APIs.

Method: The framework employs encoder ensemble with feature extractor for timbre protection, ASR-targeted adversarial examples to disrupt pronunciation, and incorporates psychoacoustic model to ensure imperceptible perturbations.

Result: Comprehensive evaluation across 16 open-source synthesizers and 3 commercial APIs on Chinese and English datasets confirms E2E-VGuard’s effectiveness in protecting both timbre and pronunciation, with real-world deployment validation.

Conclusion: E2E-VGuard successfully addresses the security gaps in defending against modern speech synthesis threats, particularly production LLM-based systems and ASR-driven end-to-end voice cloning attacks.

Abstract: Recent advancements in speech synthesis technology have enriched our daily lives, with high-quality and human-like audio widely adopted across real-world applications. However, malicious exploitation like voice-cloning fraud poses severe security risks. Existing defense techniques struggle to address the production large language model (LLM)-based speech synthesis. While previous studies have considered the protection for fine-tuning synthesizers, they assume manually annotated transcripts. Given the labor intensity of manual annotation, end-to-end (E2E) systems leveraging automatic speech recognition (ASR) to generate transcripts are becoming increasingly prevalent, e.g., voice cloning via commercial APIs. Therefore, this E2E speech synthesis also requires new security mechanisms. To tackle these challenges, we propose E2E-VGuard, a proactive defense framework for two emerging threats: (1) production LLM-based speech synthesis, and (2) the novel attack arising from ASR-driven E2E scenarios. Specifically, we employ the encoder ensemble with a feature extractor to protect timbre, while ASR-targeted adversarial examples disrupt pronunciation. Moreover, we incorporate the psychoacoustic model to ensure perturbative imperceptibility. For a comprehensive evaluation, we test 16 open-source synthesizers and 3 commercial APIs across Chinese and English datasets, confirming E2E-VGuard’s effectiveness in timbre and pronunciation protection. Real-world deployment validation is also conducted. Our code and demo page are available at https://wxzyd123.github.io/e2e-vguard/.

[620] BridgeVoC: Revitalizing Neural Vocoder from a Restoration Perspective

Andong Li, Tong Lei, Rilin Chen, Kai Li, Meng Yu, Xiaodong Li, Dong Yu, Chengshi Zheng

Main category: cs.SD

TL;DR: BridgeVoC is a novel diffusion vocoder that frames vocoder tasks as audio restoration problems using Schrodinger bridge framework, achieving state-of-the-art performance with only 4 sampling steps through subband-aware modeling and omnidirectional distillation.

Details

Motivation: To reframe neural vocoder tasks through audio restoration perspective by analyzing rank characteristics of Mel-spectrum and treating vocoder generation as restoring target spectrum from range-space spectral surrogates.

Method: Uses Schrodinger bridge framework with RSS and target spectrum as endpoints, subband-aware convolutional diffusion network with uneven subband division and large-kernel attention for T-F modeling, and omnidirectional distillation for single-step inference.

Result: Achieves SOTA performance on various benchmarks with fewer parameters, lower computational cost, and competitive inference speed using only 4 sampling steps, maintaining superiority even with single-step inference.

Conclusion: BridgeVoC successfully demonstrates that vocoder tasks can be effectively treated as audio restoration problems, achieving high-quality synthesis with efficient sampling through novel diffusion modeling and distillation techniques.

Abstract: This paper revisits the neural vocoder task through the lens of audio restoration and propose a novel diffusion vocoder called BridgeVoC. Specifically, by rank analysis, we compare the rank characteristics of Mel-spectrum with other common acoustic degradation factors, and cast the vocoder task as a specialized case of audio restoration, where the range-space spectral (RSS) surrogate of the target spectrum acts as the degraded input. Based on that, we introduce the Schrodinger bridge framework for diffusion modeling, which defines the RSS and target spectrum as dual endpoints of the stochastic generation trajectory. Further, to fully utilize the hierarchical prior of subbands in the time-frequency (T-F) domain, we elaborately devise a novel subband-aware convolutional diffusion network as the data predictor, where subbands are divided following an uneven strategy, and convolutional-style attention module is employed with large kernels for efficient T-F contextual modeling. To enable single-step inference, we propose an omnidirectional distillation loss to facilitate effective information transfer from the teacher model to the student model, and the performance is improved by combining target-related and bijective consistency losses. Comprehensive experiments are conducted on various benchmarks and out-of-distribution datasets. Quantitative and qualitative results show that while enjoying fewer parameters, lower computational cost, and competitive inference speed, the proposed BridgeVoC yields stateof-the-art performance over existing advanced GAN-, DDPMand flow-matching-based baselines with only 4 sampling steps. And consistent superiority is still achieved with single-step inference.

[621] Generating Piano Music with Transformers: A Comparative Study of Scale, Data, and Metrics

Jonathan Lehmkuhl, Ábel Ilyés-Kun, Nico Bremes, Cemhan Kaan Özaltan, Frederik Muthers, Jiayi Yuan

Main category: cs.SD

TL;DR: Systematic comparison of transformers for symbolic piano music generation, examining datasets, architectures, model sizes, and training strategies, with correlation analysis between metrics and human judgment.

Details

Motivation: Lack of comprehensive studies on how specific design choices affect symbolic music generation quality, despite various transformer models being proposed.

Method: Systematic comparison of different datasets, model architectures, model sizes, and training strategies; evaluation using quantitative metrics correlated with human judgment from listening studies.

Result: Best-performing model is a 950M-parameter transformer trained on 80K MIDI files from diverse genres, producing outputs often rated as human-composed in Turing-style listening surveys.

Conclusion: Comprehensive analysis provides insights into effective design choices for symbolic music generation, with large-scale transformers trained on diverse datasets achieving human-level composition quality.

Abstract: Although a variety of transformers have been proposed for symbolic music generation in recent years, there is still little comprehensive study on how specific design choices affect the quality of the generated music. In this work, we systematically compare different datasets, model architectures, model sizes, and training strategies for the task of symbolic piano music generation. To support model development and evaluation, we examine a range of quantitative metrics and analyze how well they correlate with human judgment collected through listening studies. Our best-performing model, a 950M-parameter transformer trained on 80K MIDI files from diverse genres, produces outputs that are often rated as human-composed in a Turing-style listening survey.

[622] Twenty-Five Years of MIR Research: Achievements, Practices, Evaluations, and Future Challenges

Geoffroy Peeters, Zafar Rafii, Magdalena Fuentes, Zhiyao Duan, Emmanouil Benetos, Juhan Nam, Yuki Mitsufuji

Main category: cs.SD

TL;DR: This paper traces the 25-year evolution of Music Information Retrieval (MIR), highlighting key research achievements, successful practices like annual benchmarks and open research, and future challenges.

Details

Motivation: To reflect on the main research achievements and evolution of Music Information Retrieval over the past 25 years, examining the practices that have driven its rapid development.

Method: The paper traces MIR evolution by reviewing main research achievements along three EDICS (music analysis, processing, and generation) and analyzing successful practices including annual benchmarks (MIREX), reproducible research, industry engagement, and DEI commitments.

Result: The analysis identifies key factors for MIR’s success: annual research benchmarks, pursuit of reproducible/open research, active industry engagement, and commitment to diversity and inclusion, which have created a vibrant research community.

Conclusion: MIR has evolved significantly over 25 years through successful practices that foster rapid development, though the field still faces future challenges that need to be addressed.

Abstract: In this paper, we trace the evolution of Music Information Retrieval (MIR) over the past 25 years. While MIR gathers all kinds of research related to music informatics, a large part of it focuses on signal processing techniques for music data, fostering a close relationship with the IEEE Audio and Acoustic Signal Processing Technical Commitee. In this paper, we reflect the main research achievements of MIR along the three EDICS related to music analysis, processing and generation. We then review a set of successful practices that fuel the rapid development of MIR research. One practice is the annual research benchmark, the Music Information Retrieval Evaluation eXchange, where participants compete on a set of research tasks. Another practice is the pursuit of reproducible and open research. The active engagement with industry research and products is another key factor for achieving large societal impacts and motivating younger generations of students to join the field. Last but not the least, the commitment to diversity, equity and inclusion ensures MIR to be a vibrant and open community where various ideas, methodologies, and career pathways collide. We finish by providing future challenges MIR will have to face.

[623] AcousTools: A `Full-Stack’, Python-Based, Acoustic Holography Library

Joshua Mukherjee, Giorgos Christopoulos, Zhouyang Shen, Sriram Subramanian, Ryuji Hirayama

Main category: cs.SD

TL;DR: AcousTools is a Python-based acoustic holography library that provides a full-stack solution for acoustic holography applications, covering setup, modeling, phase retrieval, analysis, and hardware control.

Details

Motivation: There is no existing single software that provides a complete solution for acoustic holography applications, from abstraction to physicalization, covering all necessary steps in the process.

Method: Developed AcousTools as a Python library that supports the full suite of acoustic holographic applications, including setup, acoustic propagation modeling, transducer phase retrieval, sound field analysis, and hardware control.

Result: AcousTools successfully meets each step of the full-stack requirements for acoustic holography and has the potential to become the standard library in this field.

Conclusion: AcousTools provides a uniquely complete and easy-to-use solution that will enable researchers to develop novel acoustic holography applications and facilitate accurate review of others’ work.

Abstract: Acoustic Holography is an emerging field where mid-air ultrasound is controlled and manipulated for novel and exciting applications. These range from mid-air haptics, volumetric displays, contactless fabrication, and even chemical and biomedical applications such as drug delivery. To develop these applications, a software framework to predict acoustic behaviour and simulating resulting effects, such as applied forces or scattering patterns is desirable. There have been various software libraries and platforms that attempt to fill this role, but there is yet to be a single piece of software that acts as a ‘full-stack’ solution. We define this full-stack as the process from abstraction to physicalisation starting with setup, modelling acoustic propagation, transducer phase retrieval, sound field analysis, and control of the acoustic holographic hardware itself. Existing methods fail to fulfil one or more of these categories. To address this, we present AcousTools, a Python-based acoustic holography library, designed to support the full suite of acoustic holographic applications and we show AcousTools’s ability to meet each step of the full-stack’s requirements. AcousTools has the potential to become the standard code library for acoustic holography, with the uniquely complete suite of features wrapped in a language that is known to be easy to use, AcousTools will increase the ability for researchers to develop novel applications as well as accurately review other’s work. The full-stack, aside from software, will also be useful for researchers - providing a way to view and compare methodologies by understanding where they fit into the stack.

[624] Describe Where You Are: Improving Noise-Robustness for Speech Emotion Recognition with Text Description of the Environment

Seong-Gyun Leem, Daniel Fulford, Jukka-Pekka Onnela, David Gard, Carlos Busso

Main category: cs.SD

TL;DR: Proposes a text-guided, environment-aware training method for speech emotion recognition that uses noise descriptions to improve robustness in noisy conditions.

Details

Motivation: Speech emotion recognition systems perform poorly in real-world noisy environments, and existing methods don't effectively leverage prior knowledge about testing environments.

Method: Uses text-based environment embeddings from noise descriptions via pre-trained text encoder, fused with transformer-based SER model. Employs contrastive learning and joint fine-tuning of text encoder with emotion model.

Result: Text-based environment descriptions processed by LLMs improve noise robustness. Fine-tuning text encoder with CL-based representation shows significant improvements: 76.4% (arousal), 100.0% (dominance), 27.7% (valence) at -5dB SNR.

Conclusion: Leveraging text-based environment knowledge through joint training significantly enhances SER performance in noisy conditions, demonstrating the value of multimodal approaches for noise robustness.

Abstract: Speech emotion recognition (SER) systems often struggle in real-world environments, where ambient noise severely degrades their performance. This paper explores a novel approach that exploits prior knowledge of testing environments to maximize SER performance under noisy conditions. To address this task, we propose a text-guided, environment-aware training where an SER model is trained with contaminated speech samples and their paired noise description. We use a pre-trained text encoder to extract the text-based environment embedding and then fuse it to a transformer-based SER model during training and inference. We demonstrate the effectiveness of our approach through our experiment with the MSP-Podcast corpus and real-world additive noise samples collected from the Freesound and DEMAND repositories. Our experiment indicates that the text-based environment descriptions processed by a large language model (LLM) produce representations that improve the noise-robustness of the SER system. With a contrastive learning (CL)-based representation, our proposed method can be improved by jointly fine-tuning the text encoder with the emotion recognition model. Under the -5dB signal-to-noise ratio (SNR) level, fine-tuning the text encoder improves our CL-based representation method by 76.4% (arousal), 100.0% (dominance), and 27.7% (valence).

[625] MACS: Multi-source Audio-to-image Generation with Contextual Significance and Semantic Alignment

Hao Zhou, Xiaobao Guo, Yuzhe Zhu, Adams Wai-Kin Kong

Main category: cs.SD

TL;DR: MACS is the first method for multi-source audio-to-image generation that explicitly separates mixed audio inputs before generating images, outperforming state-of-the-art methods on most evaluation metrics.

Details

Motivation: Previous audio-to-image generation methods only handle single-source audio inputs, ignoring the multi-source nature of real-world auditory scenes, which limits their ability to generate comprehensive visual content.

Method: Two-stage approach: 1) Weakly supervised audio separation using CLAP model for semantic alignment, with ranking loss for contextual significance; 2) Image generation using trainable adapter and MLP layer to map separated audio signals to generation conditions.

Result: Outperforms current state-of-the-art methods in 17 out of 21 evaluation indexes across multi-source, mixed-source, and single-source tasks, delivering superior visual quality.

Conclusion: MACS successfully bridges the gap in multi-source audio-to-image generation by explicitly separating audio components before generation, establishing a new benchmark for this task.

Abstract: Propelled by the breakthrough in deep generative models, audio-to-image generation has emerged as a pivotal cross-modal task that converts complex auditory signals into rich visual representations. However, previous works only focus on single-source audio inputs for image generation, ignoring the multi-source characteristic in natural auditory scenes, thus limiting the performance in generating comprehensive visual content. To bridge this gap, we propose a method called MACS to conduct multi-source audio-to-image generation. To our best knowledge, this is the first work that explicitly separates multi-source audio to capture the rich audio components before image generation. MACS is a two-stage method. In the first stage, multi-source audio inputs are separated by a weakly supervised method, where the audio and text labels are semantically aligned by casting into a common space using the large pre-trained CLAP model. We introduce a ranking loss to consider the contextual significance of the separated audio signals. In the second stage, effective image generation is achieved by mapping the separated audio signals to the generation condition using only a trainable adapter and a MLP layer. We preprocess the LLP dataset as the first full multi-source audio-to-image generation benchmark. The experiments are conducted on multi-source, mixed-source, and single-source audio-to-image generation tasks. The proposed MACS outperforms the current state-of-the-art methods in 17 out of the 21 evaluation indexes on all tasks and delivers superior visual quality.

[626] GRAM: Spatial general-purpose audio representation models for real-world applications

Goksenin Yuksel, Marcel van Gerven, Kiki van der Heijden

Main category: cs.SD

TL;DR: GRAM is a multi-channel masked autoencoder that learns spatial audio representations from simulated real-world scenes, outperforming state-of-the-art models on both standard and naturalistic benchmarks while using less training data.

Details

Motivation: Current audio foundation models are trained on dry, single-channel audio and fail in real-world acoustic environments with reverberation and noise, overlooking spatial sound properties and limiting sound localization tasks.

Method: Proposed GRAM: a General-purpose Real-world Audio Model using multi-channel masked auto-encoder approach to learn spatial audio representations from high-quality simulated real-world scenes.

Result: GRAM surpasses all state-of-the-art self-supervised audio foundation models and speech models on both HEAR and Nat-HEAR benchmarks, achieves state-of-the-art localization performance (even beating supervised approaches), and works with both binaural and Ambisonics formats.

Conclusion: GRAM represents a significant advancement towards robust, spatial audio foundation models for real-world applications, demonstrating effective transfer to real-world recordings.

Abstract: Although audio foundations models have seen great progress on a wide variety of tasks, their application in real-world acoustic environments with reverberation and noise has been less successful. Moreover, as audio foundation models are typically trained on dry, single-channel audio clips, the inherent spatial nature of real-world sound scenes is overlooked and tasks involving sound localization ruled out. To address these limitations, we propose GRAM: a General-purpose Real-world Audio Model utilizing a multi-channel masked auto-encoder approach to efficiently learn spatial audio representations from high-quality simulated real-world scenes. To evaluate the performance of GRAM and other audio foundation models in real-world sound scenes, we release Nat-HEAR: A naturalistic version of the HEAR benchmark suite comprising a simulated real-world version, as well as two new sound localization tasks. We show that the performance of GRAM surpasses all state-of-the-art self-supervised audio foundation models and speech models on both HEAR and Nat-HEAR, while using only a fraction of the training data. GRAM also showcases state-of-the-art localization performance, surpassing even supervised sound localization approaches, and can be flexibly applied either to a two-channel, binaural sound format or a four-channel, Ambisonics format. Validating GRAM’s performance on real-world sound recordings demonstrates robust transfer to real-world scenes. Taken together, GRAM presents a significant advancement towards robust, spatial audio foundation models for real-world applications.

[627] TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data

Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari

Main category: cs.SD

TL;DR: TTSOps is an automated closed-loop framework that builds multi-speaker TTS systems from noisy web-scale speech data without requiring curated corpora, outperforming conventional methods in naturalness and speaker diversity.

Details

Motivation: Conventional TTS systems require well-curated data with high acoustic quality and accurate alignment, limiting scalability and speaker diversity. Recent methods overlook TTS models' noise robustness and the value of perceptually low-quality but informative samples.

Method: Three core components: automated data collection from dark data, utterance-level dynamic selection of data cleansing methods based on training quality, and evaluation-in-the-loop selection using predicted MOS scores to estimate utterance impact. Jointly optimizes corpus and TTS model in closed-loop.

Result: Extensive experiments on Japanese YouTube data show TTSOps outperforms conventional acoustic-quality-based baselines in both naturalness and speaker diversity of synthesized speech.

Conclusion: TTSOps successfully demonstrates that automated construction of multi-speaker TTS systems from noisy web-scale data is feasible and superior to traditional curated data approaches.

Abstract: This paper presents TTSOps, a fully automated closed-loop framework for constructing multi-speaker text-to-speech (TTS) systems from noisy, uncurated web-scale speech data, often referred to as ``dark data,’’ such as online videos. Conventional TTS training pipelines require well-curated corpora with high acoustic quality and accurate text-speech alignment, which severely limits scalability, speaker diversity, and real-world applicability. While recent studies have proposed acoustic-quality-based data selection techniques, they often overlook two critical aspects: (1) the inherent robustness of modern TTS models to noise, and (2) the potential contribution of perceptually low-quality yet informative samples. To address these issues, TTSOps introduces a data-centric training pipeline that integrates three core components: (1) automated data collection from dark data sources, (2) utterance-level dynamic selection of data cleansing methods based on training data quality, and (3) evaluation-in-the-loop data selection using automatically predicted mean opinion scores (MOS) to estimate each utterance’s impact on model performance. Furthermore, TTSOps jointly optimizes the corpus and the TTS model in a closed-loop framework by dynamically adapting both data selection and data cleansing processes to the characteristics of the target TTS model. Extensive experiments on Japanese YouTube data demonstrate that TTSOps outperforms conventional acoustic-quality-based baselines in both the naturalness and speaker diversity of synthesized speech.

[628] DIFFA: Large Language Diffusion Models Can Listen and Understand

Jiaming Zhou, Hongjie Chen, Shiwan Zhao, Jian Kang, Jie Li, Enzhi Wang, Yujie Guo, Haoqin Sun, Hui Wang, Aobo Kong, Yong Qin, Xuelong Li

Main category: cs.SD

TL;DR: DIFFA is the first diffusion-based large audio-language model for spoken language understanding, combining a frozen diffusion language model with a dual-adapter architecture to bridge speech and language processing.

Details

Motivation: While diffusion-based language models show promise with improved controllability and bidirectional context modeling, their application to audio modality remains underexplored compared to autoregressive approaches.

Method: Uses a frozen diffusion language model with lightweight dual-adapter architecture, trained in two stages: semantic alignment via ASR objective, then instruction-following using synthetic audio-caption pairs generated by LLMs.

Result: DIFFA achieves competitive performance on MMSU, MMAU, and VoiceBench benchmarks, outperforming several autoregressive open-source baselines despite training on only 960 hours of ASR and 127 hours of synthetic instruction data.

Conclusion: Demonstrates the potential of diffusion-based language models for efficient and scalable audio understanding, opening new directions for speech-driven AI.

Abstract: Recent advances in large language models (LLMs) have shown remarkable capabilities across textual and multimodal domains. In parallel, diffusion-based language models have emerged as a promising alternative to the autoregressive paradigm, offering improved controllability, bidirectional context modeling, and robust generation. However, their application to the audio modality remains underexplored. In this work, we introduce \textbf{DIFFA}, the first diffusion-based large audio-language model designed to perform spoken language understanding. DIFFA integrates a frozen diffusion language model with a lightweight dual-adapter architecture that bridges speech understanding and natural language reasoning. We employ a two-stage training pipeline: first, aligning semantic representations via an ASR objective; then, learning instruction-following abilities through synthetic audio-caption pairs automatically generated by prompting LLMs. Despite being trained on only 960 hours of ASR and 127 hours of synthetic instruction data, DIFFA demonstrates competitive performance on major benchmarks, including MMSU, MMAU, and VoiceBench, outperforming several autoregressive open-source baselines. Our results reveal the potential of diffusion-based language models for efficient and scalable audio understanding, opening a new direction for speech-driven AI. Our code will be available at https://github.com/NKU-HLT/DIFFA.git.

[629] Progressive Facial Granularity Aggregation with Bilateral Attribute-based Enhancement for Face-to-Speech Synthesis

Yejin Jeon, Youngjae Kim, Jihyun Lee, Hyounghun Kim, Gary Geunbae Lee

Main category: cs.SD

TL;DR: This paper proposes a novel face-to-voice synthesis method that preserves fine-grained facial attributes like gender and ethnicity through multi-granular facial representation and multi-task learning, improving voice synthesis quality and alignment robustness.

Details

Motivation: Traditional text-to-speech systems fail to preserve users' original voices after traumatic events like strokes. Existing face-to-voice methods lose fine-grained facial information and require inefficient multi-stage training pipelines.

Method: Uses fine-grained facial attribute modeling by decomposing facial images into non-overlapping segments, creates multi-granular representations, employs multi-task learning for speaker attributes, and adopts multi-view training with various visual perspectives.

Result: Extensive evaluations show substantial improvements in face-voice congruence and synthesis stability compared to existing methods.

Conclusion: The proposed approach effectively addresses limitations of existing face-to-voice synthesis methods by preserving fine-grained facial attributes and achieving better voice synthesis quality through robust multi-view training.

Abstract: For individuals who have experienced traumatic events such as strokes, speech may no longer be a viable means of communication. While text-to-speech (TTS) can be used as a communication aid since it generates synthetic speech, it fails to preserve the user’s own voice. As such, face-to-voice (FTV) synthesis, which derives corresponding voices from facial images, provides a promising alternative. However, existing methods rely on pre-trained visual encoders, and finetune them to align with speech embeddings, which strips fine-grained information from facial inputs such as gender or ethnicity, despite their known correlation with vocal traits. Moreover, these pipelines are multi-stage, which requires separate training of multiple components, thus leading to training inefficiency. To address these limitations, we utilize fine-grained facial attribute modeling by decomposing facial images into non-overlapping segments and progressively integrating them into a multi-granular representation. This representation is further refined through multi-task learning of speaker attributes such as gender and ethnicity at both the visual and acoustic domains. Moreover, to improve alignment robustness, we adopt a multi-view training strategy by pairing various visual perspectives of a speaker in terms of different angles and lighting conditions, with identical speech recordings. Extensive subjective and objective evaluations confirm that our approach substantially enhances face-voice congruence and synthesis stability.

[630] WavJEPA: Semantic learning unlocks robust audio foundation models for raw waveforms

Goksenin Yuksel, Pierre Guetschel, Michael Tangermann, Marcel van Gerven, Kiki van der Heijden

Main category: cs.SD

TL;DR: WavJEPA is a waveform-based Joint-Embedding Predictive Architecture that outperforms state-of-the-art time-domain audio models across various tasks with fewer computational resources. WavJEPA-Nat extends this to multi-channel processing for robustness in noisy environments.

Details

Motivation: To overcome limitations of spectrogram-based audio representation learning (long latency, phase information loss) and address the gap where self-supervised speech representation learning from waveforms has succeeded but general-purpose audio representation learning hasn't achieved similar success.

Method: Proposes WavJEPA using high-level semantic representation learning instead of speech unit/token level learning. Also presents WavJEPA-Nat as a multi-channel extension trained on simulated naturalistic scenes for robustness.

Result: WavJEPA substantially outperforms state-of-the-art time-domain audio foundation models across various downstream tasks while requiring fewer computational resources. WavJEPA-Nat shows high robustness to reverberation and noise.

Conclusion: Demonstrates feasibility and computational efficiency of general-purpose audio representation learning from raw waveforms, enabling low-latency, robust time-domain audio foundation models for real-world applications.

Abstract: Learning audio representations from raw waveforms overcomes key limitations of spectrogram-based audio representation learning, such as the long latency of spectrogram computation and the loss of phase information. Yet, while self-supervised speech representation learning from raw waveforms has been remarkably successful, these approaches have not achieved similar feats for general-purpose audio representation learning from waveforms. Here, we propose WavJEPA, a waveform-based version of the Joint-Embedding Predictive Architecture. WavJEPA leverages high-level semantic representation learning to tackle the shortcomings of representation learning at the speech unit or token level. We show that this approach substantially outperforms state-of-the-art time-domain audio foundation models across a wide variety of downstream benchmark tasks, while requiring considerably fewer computational resources. Additionally, to overcome the performance drop that time-domain models typically exhibit in noisy and reverberant real-world acoustic environments, we present WavJEPA-Nat. WavJEPA-Nat is a multi-channel extension of the WavJEPA architecture trained on simulated naturalistic scenes. We find that WavJEPA-Nat is highly robust to reverberation and noise. These results highlight the feasibility and computational efficiency of general-purpose audio representation learning from raw waveforms, showcasing the potential for low-latency, robust time-domain audio foundation models for real-world applications.

[631] MERaLiON-SER: Robust Speech Emotion Recognition Model for English and SEA Languages

Hardik B. Sailor, Aw Ai Ti, Chen Fang Yih Nancy, Chiu Ying Lay, Ding Yang, He Yingxu, Jiang Ridong, Li Jingtao, Liao Jingyi, Liu Zhuohan, Lu Yanfeng, Ma Yi, Manas Gupta, Muhammad Huzaifah Bin Md Shahrin, Nabilah Binte Md Johan, Nattadaporn Lertcheva, Pan Chunlei, Pham Minh Duc, Siti Maryam Binte Ahmad Subaidi, Siti Umairah Binte Mohammad Salleh, Sun Shuo, Tarun Kumar Vangani, Wang Qiongqiong, Won Cheng Yi Lewis, Wong Heng Meng Jeremy, Wu Jinyang, Zhang Huayun, Zhang Longyin, Zou Xunlong

Main category: cs.SD

TL;DR: MERaLiON-SER is a robust multilingual speech emotion recognition model that outperforms existing models by combining categorical and dimensional emotion modeling using hybrid loss functions.

Details

Motivation: To create a comprehensive speech emotion recognition system that works across multiple languages (English and Southeast Asian languages) and captures both discrete emotion categories and fine-grained emotional dimensions for better paralinguistic understanding.

Method: Uses hybrid objective combining weighted categorical cross-entropy and Concordance Correlation Coefficient (CCC) losses for joint discrete and dimensional emotion modeling, enabling capture of both emotion categories (happy, angry) and dimensions (arousal, valence, dominance).

Result: Extensive evaluations show MERaLiON-SER consistently surpasses both open-source speech encoders and large Audio-LLMs across multilingual Singaporean languages (English, Chinese, Malay, Tamil) and other public benchmarks.

Conclusion: Specialized speech-only models are crucial for accurate paralinguistic understanding and cross-lingual generalization, providing foundation for emotion-aware perception in future agentic audio systems for more empathetic multimodal reasoning.

Abstract: We present MERaLiON-SER, a robust speech emotion recognition model de- signed for English and Southeast Asian languages. The model is trained using a hybrid objective combining weighted categorical cross-entropy and Concordance Correlation Coefficient (CCC) losses for joint discrete and dimensional emotion modelling. This dual approach enables the model to capture both the distinct categories of emotion (like happy or angry) and the fine-grained, such as arousal (intensity), valence (positivity/negativity), and dominance (sense of control), lead- ing to a more comprehensive and robust representation of human affect. Extensive evaluations across multilingual Singaporean languages (English, Chinese, Malay, and Tamil ) and other public benchmarks show that MERaLiON-SER consistently surpasses both open-source speech encoders and large Audio-LLMs. These results underscore the importance of specialised speech-only models for accurate paralin- guistic understanding and cross-lingual generalisation. Furthermore, the proposed framework provides a foundation for integrating emotion-aware perception into future agentic audio systems, enabling more empathetic and contextually adaptive multimodal reasoning.

[632] Perceptually Aligning Representations of Music via Noise-Augmented Autoencoders

Mathias Rose Bjare, Giorgia Cantisani, Marco Pasini, Stefan Lattner, Gerhard Widmer

Main category: cs.SD

TL;DR: Training autoencoders with noise reconstruction and perceptual losses creates hierarchical encodings where perceptually important information is captured in coarser structures, improving latent diffusion decoding for music analysis.

Details

Motivation: To develop autoencoders that produce encodings structured according to perceptual hierarchy, where perceptually salient information is organized in coarser representation structures than conventional methods.

Method: Train autoencoders to reconstruct inputs from noised versions of their encodings combined with perceptual losses, then apply this approach to audio autoencoders.

Result: Perceptually salient information is captured in coarser representation structures, and the perceptual hierarchies improve latent diffusion decoding for estimating music pitch surprisal and predicting EEG-brain responses to music.

Conclusion: The proposed training method successfully creates perceptual hierarchies in autoencoder encodings that enhance downstream tasks like music analysis and brain response prediction.

Abstract: We argue that training autoencoders to reconstruct inputs from noised versions of their encodings, when combined with perceptual losses, yields encodings that are structured according to a perceptual hierarchy. We demonstrate the emergence of this hierarchical structure by showing that, after training an audio autoencoder in this manner, perceptually salient information is captured in coarser representation structures than with conventional training. Furthermore, we show that such perceptual hierarchies improve latent diffusion decoding in the context of estimating surprisal in music pitches and predicting EEG-brain responses to music listening. Pretrained weights are available on github.com/CPJKU/pa-audioic.

cs.LG

[633] Deep one-gate per layer networks with skip connections are universal classifiers

Raul Rojas

Main category: cs.LG

TL;DR: A two-hidden-layer MLP for binary classification can be transformed into a deep neural network with one-gate layers and skip connections.

Details

Motivation: To demonstrate the transformability of traditional MLP architectures into more modern deep neural network structures with gated layers and skip connections.

Method: Transform a two-hidden-layer multilayer perceptron designed for binary classification into a deep neural network architecture featuring one-gate layers and skip connections.

Result: Successful transformation of the MLP architecture into a deep neural network with the specified components.

Conclusion: Traditional MLP architectures can be readily converted into modern deep neural network structures with gated layers and skip connections, showing architectural flexibility.

Abstract: This paper shows how a multilayer perceptron with two hidden layers, which has been designed to classify two classes of data points, can easily be transformed into a deep neural network with one-gate layers and skip connections.

[634] Daily Forecasting for Annual Time Series Datasets Using Similarity-Based Machine Learning Methods: A Case Study in the Energy Market

Mahdi Goldani

Main category: cs.LG

TL;DR: This study introduces a daily proxy for the Energy Security Index using time series similarity measures and XGBoost forecasting to enable high-frequency monitoring of energy security dynamics.

Details

Motivation: The annual reporting of the Energy Security Index limits responsiveness to short-term policy and market fluctuations, creating a need for more frequent monitoring capabilities.

Method: Two-stage approach: 1) Identify daily proxy using six time series similarity measures on energy variables, 2) Model selected proxy with XGBoost algorithm for 15-day ahead forecasts.

Result: Brent volume emerged as the best proxy. Model achieved R² of 0.981 (training) and 0.945 (test) with acceptable error metrics. 15-day forecast shows fluctuating pattern with peak around day 4 and downward trend toward day 15.

Conclusion: The framework successfully converts low-frequency macroeconomic indicators into high-frequency signals, enabling real-time monitoring of energy security for policymakers in data-scarce environments.

Abstract: The policy environment of countries changes rapidly, influencing macro-level indicators such as the Energy Security Index. However, this index is only reported annually, limiting its responsiveness to short-term fluctuations. To address this gap, the present study introduces a daily proxy for the Energy Security Index and applies it to forecast energy security at a daily frequency.The study employs a two stage approach first, a suitable daily proxy for the annual Energy Security Index is identified by applying six time series similarity measures to key energy related variables. Second, the selected proxy is modeled using the XGBoost algorithm to generate 15 day ahead forecasts, enabling high frequency monitoring of energy security dynamics.As the result of proxy choosing, Volume Brent consistently emerged as the most suitable proxy across the majority of methods. The model demonstrated strong performance, with an R squared of 0.981 on the training set and 0.945 on the test set, and acceptable error metrics . The 15 day forecast of Brent volume indicates short term fluctuations, with a peak around day 4, a decline until day 8, a rise near day 10, and a downward trend toward day 15, accompanied by prediction intervals.By integrating time series similarity measures with machine learning based forecasting, this study provides a novel framework for converting low frequency macroeconomic indicators into high frequency, actionable signals. The approach enables real time monitoring of the Energy Security Index, offering policymakers and analysts a scalable and practical tool to respond more rapidly to fast changing policy and market conditions, especially in data scarce environments.

[635] Diversified Flow Matching with Translation Identifiability

Sagar Shrestha, Xiao Fu

Main category: cs.LG

TL;DR: DFM is an ODE-based framework for diversified distribution matching that guarantees translation identifiability and provides transport trajectories, overcoming GAN limitations.

Details

Motivation: To address GAN instability and lack of transport trajectory information in DDM, while maintaining translation identifiability for unpaired domain translation.

Method: Proposed diversified flow matching with bilevel optimization, nonlinear interpolant, and structural reformulation to adapt flow matching for unified translation functions.

Result: DFM successfully achieves translation identifiability and provides transport trajectories, validated on synthetic and real-world datasets.

Conclusion: DFM is the first ODE-based approach that guarantees translation identifiability while providing useful transport trajectory information.

Abstract: Diversified distribution matching (DDM) finds a unified translation function mapping a diverse collection of conditional source distributions to their target counterparts. DDM was proposed to resolve content misalignment issues in unpaired domain translation, achieving translation identifiability. However, DDM has only been implemented using GANs due to its constraints on the translation function. GANs are often unstable to train and do not provide the transport trajectory information – yet such trajectories are useful in applications such as single-cell evolution analysis and robot route planning. This work introduces diversified flow matching (DFM), an ODE-based framework for DDM. Adapting flow matching (FM) to enforce a unified translation function as in DDM is challenging, as FM learns the translation function’s velocity rather than the translation function itself. A custom bilevel optimization-based training loss, a nonlinear interpolant, and a structural reformulation are proposed to address these challenges, offering a tangible implementation. To our knowledge, DFM is the first ODE-based approach guaranteeing translation identifiability. Experiments on synthetic and real-world datasets validate the proposed method.

[636] Effective Test-Time Scaling of Discrete Diffusion through Iterative Refinement

Sanghyun Lee, Sunwoo Kim, Seungryong Kim, Jongho Park, Dongmin Park

Main category: cs.LG

TL;DR: Iterative Reward-Guided Refinement (IterRef) is a novel test-time scaling method for discrete diffusion models that uses reward-guided noising-denoising transitions to progressively refine misaligned intermediate states, achieving superior performance with low compute budgets.

Details

Motivation: Test-time scaling through reward-guided generation remains largely unexplored for discrete diffusion models despite its potential as a promising alternative to existing methods.

Method: Introduces Iterative Reward-Guided Refinement (IterRef) that leverages reward-guided noising-denoising transitions within a Multiple-Try Metropolis (MTM) framework to progressively refine misaligned intermediate states, proving convergence to the reward-aligned distribution.

Result: IterRef achieves consistent improvements in reward-guided generation quality across text and image domains, with striking gains under low compute budgets that far surpass prior state-of-the-art baselines.

Conclusion: IterRef provides an effective test-time scaling approach for discrete diffusion models that progressively refines intermediate states toward optimal distributions, demonstrating superior performance particularly under constrained computational resources.

Abstract: Test-time scaling through reward-guided generation remains largely unexplored for discrete diffusion models despite its potential as a promising alternative. In this work, we introduce Iterative Reward-Guided Refinement (IterRef), a novel test-time scaling method tailored to discrete diffusion that leverages reward-guided noising-denoising transitions to progressively refine misaligned intermediate states. We formalize this process within a Multiple-Try Metropolis (MTM) framework, proving convergence to the reward-aligned distribution. Unlike prior methods that assume the current state is already aligned with the reward distribution and only guide the subsequent transition, our approach explicitly refines each state in situ, progressively steering it toward the optimal intermediate distribution. Across both text and image domains, we evaluate IterRef on diverse discrete diffusion models and observe consistent improvements in reward-guided generation quality. In particular, IterRef achieves striking gains under low compute budgets, far surpassing prior state-of-the-art baselines.

[637] Lookahead Unmasking Elicits Accurate Decoding in Diffusion Language Models

Sanghyun Lee, Seungryong Kim, Jongho Park, Dongmin Park

Main category: cs.LG

TL;DR: Lookahead Unmasking (LookUM) improves masked diffusion models by reformulating sampling as path selection over unmasking orders, using path generation and uncertainty-based verification to avoid early decoding errors.

Details

Motivation: Current masked diffusion models suffer from myopic heuristics like confidence-based sampling that optimize locally, fail to leverage extra test-time compute, and let early decoding mistakes cascade.

Method: Proposes LookUM framework with: (i) path generator that samples from pools of unmasking sets, and (ii) verifier that computes path uncertainty and performs importance sampling to select final paths.

Result: Validated across six benchmarks (mathematics, planning, coding) with consistent performance improvements. LookUM requires only 2-3 paths for peak performance. Base LLaDA with LookUM rivals RL-tuned LLaDA 1.5 performance.

Conclusion: Uncertainty-based verification provides orthogonal benefits to reinforcement learning, demonstrating the versatility of the framework for improving masked diffusion models.

Abstract: Masked Diffusion Models (MDMs) as language models generate by iteratively unmasking tokens, yet their performance crucially depends on the inference time order of unmasking. Prevailing heuristics, such as confidence based sampling, are myopic: they optimize locally, fail to leverage extra test-time compute, and let early decoding mistakes cascade. We propose Lookahead Unmasking (LookUM), which addresses these concerns by reformulating sampling as path selection over all possible unmasking orders without the need for an external reward model. Our framework couples (i) a path generator that proposes paths by sampling from pools of unmasking sets with (ii) a verifier that computes the uncertainty of the proposed paths and performs importance sampling to subsequently select the final paths. Empirically, erroneous unmasking measurably inflates sequence level uncertainty, and our method exploits this to avoid error-prone trajectories. We validate our framework across six benchmarks, such as mathematics, planning, and coding, and demonstrate consistent performance improvements. LookUM requires only two to three paths to achieve peak performance, demonstrating remarkably efficient path selection. The consistent improvements on both LLaDA and post-trained LLaDA 1.5 are particularly striking: base LLaDA with LookUM rivals the performance of RL-tuned LLaDA 1.5, while LookUM further enhances LLaDA 1.5 itself showing that uncertainty based verification provides orthogonal benefits to reinforcement learning and underscoring the versatility of our framework. Code will be publicly released.

[638] Adaptive Sample-Level Framework Motivated by Distributionally Robust Optimization with Variance-Based Radius Assignment for Enhanced Neural Network Generalization Under Distribution Shift

Aheer Sravon, Devdyuti Mazumder, Md. Ibrahim

Main category: cs.LG

TL;DR: Var-DRO is an adaptive DRO framework that assigns personalized robustness budgets to training samples based on their loss variance, outperforming ERM and KL-DRO on corrupted datasets while maintaining competitive performance on clean data.

Details

Motivation: Standard ERM fails on distribution shifts and minority subpopulations, while conventional DRO uses a single global robustness budget that can be overly conservative or misallocate robustness.

Method: Proposes variance-driven adaptive DRO that identifies high-risk samples and assigns personalized budgets using KL-divergence bounds, resulting in efficient water-filling solution with warmup phase and label smoothing.

Result: Achieves highest mean accuracy on CIFAR-10-C, improves overall performance on Waterbirds while matching/surpassing KL-DRO, and remains competitive on original CIFAR-10 with expected robustness trade-off.

Conclusion: Var-DRO is an unsupervised, theoretically sound, computationally efficient framework that provides adaptive robustness without requiring group labels.

Abstract: Distribution shifts and minority subpopulations frequently undermine the reliability of deep neural networks trained using Empirical Risk Minimization (ERM). Distributionally Robust Optimization (DRO) addresses this by optimizing for the worst-case risk within a neighborhood of the training distribution. However, conventional methods depend on a single, global robustness budget, which can lead to overly conservative models or a misallocation of robustness. We propose a variance-driven, adaptive, sample-level DRO (Var-DRO) framework that automatically identifies high-risk training samples and assigns a personalized robustness budget to each based on its online loss variance. Our formulation employs two-sided, KL-divergence-style bounds to constrain the ratio between adversarial and empirical weights for every sample. This results in a linear inner maximization problem over a convex polytope, which admits an efficient water-filling solution. To stabilize training, we introduce a warmup phase and a linear ramp schedule for the global cap on per-sample budgets, complemented by label smoothing for numerical robustness. Evaluated on CIFAR-10-C (corruptions), our method achieves the highest overall mean accuracy compared to ERM and KL-DRO. On Waterbirds, Var-DRO improves overall performance while matching or surpassing KL-DRO. On the original CIFAR-10 dataset, Var-DRO remains competitive, exhibiting the modest trade-off anticipated when prioritizing robustness. The proposed framework is unsupervised (requiring no group labels), straightforward to implement, theoretically sound, and computationally efficient.

[639] On the Joint Minimization of Regularization Loss Functions in Deep Variational Bayesian Methods for Attribute-Controlled Symbolic Music Generation

Matteo Pettenó, Alessandro Ilic Mezza, Alberto Bernardini

Main category: cs.LG

TL;DR: The paper analyzes the trade-off between Kullback-Leibler Divergence (KLD) and Attribute-Regularization (AR) losses in explicit latent variable models for symbolic music generation, proposing attribute transformations to achieve both controllability and regularization.

Details

Motivation: Existing approaches struggle to balance KLD and AR losses - when KLD dominates, models lack controllability; when AR dominates, the encoder violates the standard normal prior. This trade-off needs better handling for effective controllable generation.

Method: The authors explore this trade-off in symbolic music generation with continuous musical attributes, proposing suitable attribute transformations to help achieve both objectives simultaneously.

Result: The study shows that existing approaches fail to jointly minimize both regularization objectives, but appropriate attribute transformations enable achieving both controllability and proper regularization of target latent dimensions.

Conclusion: Attribute transformations provide an effective solution to balance the KLD-AR trade-off, enabling explicit latent variable models to maintain both controllability over musical attributes and proper regularization of latent spaces.

Abstract: Explicit latent variable models provide a flexible yet powerful framework for data synthesis, enabling controlled manipulation of generative factors. With latent variables drawn from a tractable probability density function that can be further constrained, these models enable continuous and semantically rich exploration of the output space by navigating their latent spaces. Structured latent representations are typically obtained through the joint minimization of regularization loss functions. In variational information bottleneck models, reconstruction loss and Kullback-Leibler Divergence (KLD) are often linearly combined with an auxiliary Attribute-Regularization (AR) loss. However, balancing KLD and AR turns out to be a very delicate matter. When KLD dominates over AR, generative models tend to lack controllability; when AR dominates over KLD, the stochastic encoder is encouraged to violate the standard normal prior. We explore this trade-off in the context of symbolic music generation with explicit control over continuous musical attributes. We show that existing approaches struggle to jointly minimize both regularization objectives, whereas suitable attribute transformations can help achieve both controllability and regularization of the target latent dimensions.

[640] Data-driven jet fuel demand forecasting: A case study of Copenhagen Airport

Alessandro Contini, Davide Cacciarelli, Murat Kulahci

Main category: cs.LG

TL;DR: This paper evaluates machine learning models for jet fuel demand forecasting using data from a Danish aviation fuel distributor, comparing traditional time series models, Prophet, LSTM networks, and hybrid approaches for 30-day ahead predictions.

Details

Motivation: Accurate jet fuel demand forecasting is crucial for supply chain optimization, but current industry practices rely on deterministic or expertise-based models rather than data-driven approaches, creating a research gap.

Method: Used data from a major Danish aviation fuel distributor to compare traditional time series models, Prophet, LSTM sequence-to-sequence neural networks, and hybrid models across three different datasets for 30-day forecasting horizon.

Result: The study demonstrates the advantages of data-driven models over traditional approaches and shows the impact of incorporating additional variables in predictive models for jet fuel demand forecasting.

Conclusion: Data-driven approaches provide significant benefits for jet fuel demand forecasting, with hybrid models and additional variables improving prediction accuracy for supply chain optimization.

Abstract: Accurate forecasting of jet fuel demand is crucial for optimizing supply chain operations in the aviation market. Fuel distributors specifically require precise estimates to avoid inventory shortages or excesses. However, there is a lack of studies that analyze the jet fuel demand forecasting problem using machine learning models. Instead, many industry practitioners rely on deterministic or expertise-based models. In this research, we evaluate the performance of data-driven approaches using a substantial amount of data obtained from a major aviation fuel distributor in the Danish market. Our analysis compares the predictive capabilities of traditional time series models, Prophet, LSTM sequence-to-sequence neural networks, and hybrid models. A key challenge in developing these models is the required forecasting horizon, as fuel demand needs to be predicted for the next 30 days to optimize sourcing strategies. To ensure the reliability of the data-driven approaches and provide valuable insights to practitioners, we analyze three different datasets. The primary objective of this study is to present a comprehensive case study on jet fuel demand forecasting, demonstrating the advantages of employing data-driven models and highlighting the impact of incorporating additional variables in the predictive models.

[641] Coupling Agent-based Modeling and Life Cycle Assessment to Analyze Trade-offs in Resilient Energy Transitions

Beichen Zhang, Mohammed T. Zaki, Hanna Breunig, Newsha K. Ajami

Main category: cs.LG

TL;DR: Integrated modeling framework combining agent-based modeling and Life Cycle Assessment to analyze energy transition trade-offs in Southern California.

Details

Motivation: Current energy transition assessments overlook critical interactions like regional resource competition and cumulative impacts, leading to unintended consequences.

Method: Couples agent-based modeling with Life Cycle Assessment (LCA) to simulate energy transition pathways interacting with regional resource competition, ecological constraints, and community-level burdens.

Result: Demonstrates how integrated multiscale decision making shapes energy pathway deployment and reveals spatially explicit trade-offs under scenario-driven constraints.

Conclusion: The framework supports more adaptive and resilient energy transition planning on spatial and institutional scales.

Abstract: Transitioning to sustainable and resilient energy systems requires navigating complex and interdependent trade-offs across environmental, social, and resource dimensions. Neglecting these trade-offs can lead to unintended consequences across sectors. However, existing assessments often evaluate emerging energy pathways and their impacts in silos, overlooking critical interactions such as regional resource competition and cumulative impacts. We present an integrated modeling framework that couples agent-based modeling and Life Cycle Assessment (LCA) to simulate how energy transition pathways interact with regional resource competition, ecological constraints, and community-level burdens. We apply the model to a case study in Southern California. The results demonstrate how integrated and multiscale decision making can shape energy pathway deployment and reveal spatially explicit trade-offs under scenario-driven constraints. This modeling framework can further support more adaptive and resilient energy transition planning on spatial and institutional scales.

[642] Fine-Tuning Vision-Language Models for Multimodal Polymer Property Prediction

An Vuong, Minh-Hao Van, Prateek Verma, Chen Zhao, Xintao Wu

Main category: cs.LG

TL;DR: Fine-tuning Vision-Language Models with multimodal polymer data improves property prediction performance over unimodal approaches.

Details

Motivation: VLMs perform well in general tasks but are limited in scientific domains like materials science, where foundation models for broad multimodal tasks like polymer property prediction are lacking.

Method: Created a multimodal polymer dataset and fine-tuned VLMs using instruction-tuning pairs with LoRA to assess multimodality impact on prediction.

Result: Fine-tuned models using LoRA outperformed unimodal and baseline approaches, demonstrating benefits of multimodal learning.

Conclusion: Multimodal approach improves polymer property prediction and reduces need for separate models for different properties, lowering deployment and maintenance costs.

Abstract: Vision-Language Models (VLMs) have shown strong performance in tasks like visual question answering and multimodal text generation, but their effectiveness in scientific domains such as materials science remains limited. While some machine learning methods have addressed specific challenges in this field, there is still a lack of foundation models designed for broad tasks like polymer property prediction using multimodal data. In this work, we present a multimodal polymer dataset to fine-tune VLMs through instruction-tuning pairs and assess the impact of multimodality on prediction performance. Our fine-tuned models, using LoRA, outperform unimodal and baseline approaches, demonstrating the benefits of multimodal learning. Additionally, this approach reduces the need to train separate models for different properties, lowering deployment and maintenance costs.

[643] Conditional Diffusion as Latent Constraints for Controllable Symbolic Music Generation

Matteo Pettenó, Alessandro Ilic Mezza, Alberto Bernardini

Main category: cs.LG

TL;DR: This paper introduces diffusion-driven latent constraints for symbolic music generation, enabling precise control over musical attributes like note density and rhythm complexity without retraining the main model.

Details

Motivation: Existing music generation models rely on musical context or natural language for control, which lacks the precise fader-like control that expert users need over specific musical attributes.

Method: Uses a library of small conditional diffusion models as plug-and-play latent constraints on a frozen unconditional backbone model, applying denoising diffusion processes as implicit probabilistic priors.

Result: Diffusion-driven constraints outperform traditional attribute regularization and other latent constraint architectures, achieving stronger correlations between target and generated attributes while maintaining high perceptual quality and diversity.

Conclusion: The approach demonstrates versatility across diverse musical attributes and provides expert users with precise control capabilities for symbolic music generation.

Abstract: Recent advances in latent diffusion models have demonstrated state-of-the-art performance in high-dimensional time-series data synthesis while providing flexible control through conditioning and guidance. However, existing methodologies primarily rely on musical context or natural language as the main modality of interacting with the generative process, which may not be ideal for expert users who seek precise fader-like control over specific musical attributes. In this work, we explore the application of denoising diffusion processes as plug-and-play latent constraints for unconditional symbolic music generation models. We focus on a framework that leverages a library of small conditional diffusion models operating as implicit probabilistic priors on the latents of a frozen unconditional backbone. While previous studies have explored domain-specific use cases, this work, to the best of our knowledge, is the first to demonstrate the versatility of such an approach across a diverse array of musical attributes, such as note density, pitch range, contour, and rhythm complexity. Our experiments show that diffusion-driven constraints outperform traditional attribute regularization and other latent constraints architectures, achieving significantly stronger correlations between target and generated attributes while maintaining high perceptual quality and diversity.

[644] Distillation-Accelerated Uncertainty Modeling for Multi-Objective RTA Interception

Gaoxiang Zhao, Ruina Qiu, Pengpeng Zhao, Rongjin Wang, Zhangang Lin, Xiaoqiang Wang

Main category: cs.LG

TL;DR: DAUM is a joint modeling framework for Real-Time Auction (RTA) Interception that integrates multi-objective learning with uncertainty modeling to filter invalid traffic while addressing efficiency bottlenecks through knowledge distillation.

Details

Motivation: RTA Interception needs accurate traffic quality estimation with high confidence, but faces challenges with uncertainty modeling efficiency in real-time applications due to repeated inference overhead.

Method: Proposed DAUM framework integrates multi-objective learning with uncertainty modeling, then applies knowledge distillation to reduce computational overhead while preserving accuracy and uncertainty estimation benefits.

Result: Experiments on JD advertisement dataset show DAUM consistently improves predictive performance, with distilled model achieving tenfold increase in inference speed.

Conclusion: DAUM effectively addresses both accuracy and efficiency challenges in RTA Interception through joint uncertainty modeling and knowledge distillation, enabling reliable traffic filtering in real-time applications.

Abstract: Real-Time Auction (RTA) Interception aims to filter out invalid or irrelevant traffic to enhance the integrity and reliability of downstream data. However, two key challenges remain: (i) the need for accurate estimation of traffic quality together with sufficiently high confidence in the model’s predictions, typically addressed through uncertainty modeling, and (ii) the efficiency bottlenecks that such uncertainty modeling introduces in real-time applications due to repeated inference. To address these challenges, we propose DAUM, a joint modeling framework that integrates multi-objective learning with uncertainty modeling, yielding both traffic quality predictions and reliable confidence estimates. Building on DAUM, we further apply knowledge distillation to reduce the computational overhead of uncertainty modeling, while largely preserving predictive accuracy and retaining the benefits of uncertainty estimation. Experiments on the JD advertisement dataset demonstrate that DAUM consistently improves predictive performance, with the distilled model delivering a tenfold increase in inference speed.

Manh Duong Nguyen, Trung Thanh Nguyen, Huy Hieu Pham, Trong Nghia Hoang, Phi Le Nguyen, Thanh Trung Huynh

Main category: cs.LG

TL;DR: FedMAC is a novel federated learning framework that addresses partial-modality missing in multi-modal data, using contrastive regularization to improve feature aggregation and handle severe data heterogeneity.

Details

Motivation: Existing FL methods handle complete-modality missing but fail to address partial-modality missing where missing patterns vary significantly across samples, creating severe heterogeneity at the instance level.

Method: Proposes FedMAC framework with contrastive-based regularization to impose constraints on latent representation space and avoid trivial aggregation of multi-modal features.

Result: FedMAC outperforms baseline methods by up to 26% in severe missing scenarios across various client configurations with statistical heterogeneity.

Conclusion: FedMAC shows strong potential as a solution for partially missing modalities in federated systems, effectively handling instance-level heterogeneity in multi-modal data.

Abstract: Federated Learning (FL) is a method for training machine learning models using distributed data sources. It ensures privacy by allowing clients to collaboratively learn a shared global model while storing their data locally. However, a significant challenge arises when dealing with missing modalities in clients’ datasets, where certain features or modalities are unavailable or incomplete, leading to heterogeneous data distribution. While previous studies have addressed the issue of complete-modality missing, they fail to tackle partial-modality missing on account of severe heterogeneity among clients at an instance level, where the pattern of missing data can vary significantly from one sample to another. To tackle this challenge, this study proposes a novel framework named FedMAC, designed to address multi-modality missing under conditions of partial-modality missing in FL. Additionally, to avoid trivial aggregation of multi-modal features, we introduce contrastive-based regularization to impose additional constraints on the latent representation space. The experimental results demonstrate the effectiveness of FedMAC across various client configurations with statistical heterogeneity, outperforming baseline methods by up to 26% in severe missing scenarios, highlighting its potential as a solution for the challenge of partially missing modalities in federated systems. Our source code is provided at https://github.com/nmduonggg/PEPSY

[646] Depth-induced NTK: Bridging Over-parameterized Neural Networks and Deep Neural Kernels

Yong-Ming Tian, Shuang Liang, Shao-Qun Zhang, Feng-Lei Fan

Main category: cs.LG

TL;DR: The paper proposes a depth-induced neural tangent kernel (NTK) that accounts for network depth, addressing limitations of existing NTK theory which focuses mainly on infinite width while overlooking depth’s representational role.

Details

Motivation: Current neural tangent kernel theory is confined to infinite-width regimes and overlooks the representational importance of network depth in deep learning architectures.

Method: Proposed a depth-induced NTK kernel based on shortcut-related architecture, which converges to a Gaussian process as network depth approaches infinity. Theoretically analyzed training invariance and spectrum properties.

Result: The proposed kernel stabilizes kernel dynamics and mitigates degeneration. Experimental results demonstrate the effectiveness of the method.

Conclusion: The findings significantly extend neural kernel theory and provide deeper understanding of deep learning and scaling laws by incorporating depth considerations into NTK framework.

Abstract: While deep learning has achieved remarkable success across a wide range of applications, its theoretical understanding of representation learning remains limited. Deep neural kernels provide a principled framework to interpret over-parameterized neural networks by mapping hierarchical feature transformations into kernel spaces, thereby combining the expressive power of deep architectures with the analytical tractability of kernel methods. Recent advances, particularly neural tangent kernels (NTKs) derived by gradient inner products, have established connections between infinitely wide neural networks and nonparametric Bayesian inference. However, the existing NTK paradigm has been predominantly confined to the infinite-width regime, while overlooking the representational role of network depth. To address this gap, we propose a depth-induced NTK kernel based on a shortcut-related architecture, which converges to a Gaussian process as the network depth approaches infinity. We theoretically analyze the training invariance and spectrum properties of the proposed kernel, which stabilizes the kernel dynamics and mitigates degeneration. Experimental results further underscore the effectiveness of our proposed method. Our findings significantly extend the existing landscape of the neural kernel theory and provide an in-depth understanding of deep learning and the scaling law.

[647] Prompting Neural-Guided Equation Discovery Based on Residuals

Jannis Brugger, Viktor Pfanschilling, David Richter, Mira Mezini, Stefan Kramer

Main category: cs.LG

TL;DR: RED is a post-processing method that improves equation discovery by using residuals to generate better equation suggestions without extensive search.

Details

Motivation: Current neural-guided equation discovery systems lack options for getting alternative equation suggestions when initial predictions don't meet user expectations without intensive work.

Method: Parse initial equation to syntax tree, compute residuals for each subequation, use residuals as new target variables to generate prompts for better subequations, and replace old subequations if better ones are found on validation set.

Result: RED improves all tested neural-guided and classical genetic programming systems on 53 equations from the Feynman benchmark.

Conclusion: RED is a fast, extensible post-processing method that enhances equation discovery systems by using residuals to generate targeted improvements.

Abstract: Neural-guided equation discovery systems use a data set as prompt and predict an equation that describes the data set without extensive search. However, if the equation does not meet the user’s expectations, there are few options for getting other equation suggestions without intensive work with the system. To fill this gap, we propose Residuals for Equation Discovery (RED), a post-processing method that improves a given equation in a targeted manner, based on its residuals. By parsing the initial equation to a syntax tree, we can use node-based calculation rules to compute the residual for each subequation of the initial equation. It is then possible to use this residual as new target variable in the original data set and generate a new prompt. If, with the new prompt, the equation discovery system suggests a subequation better than the old subequation on a validation set, we replace the latter by the former. RED is usable with any equation discovery system, is fast to calculate, and is easy to extend for new mathematical operations. In experiments on 53 equations from the Feynman benchmark, we show that it not only helps to improve all tested neural-guided systems, but also all tested classical genetic programming systems.

[648] CoPRIS: Efficient and Stable Reinforcement Learning via Concurrency-Controlled Partial Rollout with Importance Sampling

Zekai Qu, Yinxu Pan, Ao Sun, Chaojun Xiao, Xu Han

Main category: cs.LG

TL;DR: CoPRIS is an asynchronous RL system for LLMs that uses partial rollouts and importance sampling to improve training efficiency by 1.94x while maintaining performance.

Details

Motivation: Existing RL systems for LLMs operate synchronously, causing inefficiencies when long trajectories stall rollouts and leave GPUs idle.

Method: Concurrency-Controlled Partial Rollout with Importance Sampling (CoPRIS) maintains fixed concurrent rollouts, early-terminates when sufficient samples are collected, reuses unfinished trajectories, and uses Cross-stage Importance Sampling Correction to handle off-policy trajectories.

Result: Experiments on mathematical reasoning benchmarks show CoPRIS achieves up to 1.94x faster training while maintaining comparable or superior performance to synchronous RL systems.

Conclusion: CoPRIS effectively addresses the inefficiency problem in RL post-training for LLMs through asynchronous partial rollouts and importance sampling correction.

Abstract: Reinforcement learning (RL) post-training has become a trending paradigm for enhancing the capabilities of large language models (LLMs). Most existing RL systems for LLMs operate in a fully synchronous manner, where training must wait for the rollout of an entire batch to complete. This design leads to severe inefficiencies, as extremely long trajectories can stall the entire rollout process and leave many GPUs idle. To address this issue, we propose Concurrency- Controlled Partial Rollout with Importance Sampling (CoPRIS), which mitigates long-tail inefficiencies by maintaining a fixed number of concurrent rollouts, early-terminating once sufficient samples are collected, and reusing unfinished trajectories in subsequent rollouts. To mitigate the impact of off-policy trajectories, we introduce Cross-stage Importance Sampling Correction, which concatenates buffered log probabilities from the previous policy with those recomputed under the current policy for importance sampling correction. Experiments on challenging mathematical reasoning benchmarks show that CoPRIS achieves up to 1.94x faster training while maintaining comparable or superior performance to synchronous RL systems. The code of CoPRIS is available at https://github.com/777pomingzi/CoPRIS.

[649] FedSparQ: Adaptive Sparse Quantization with Error Feedback for Robust & Efficient Federated Learning

Chaimaa Medjadji, Sadi Alawadi, Feras M. Awaysheh, Guilain Leduc, Sylvain Kubler, Yves Le Traon

Main category: cs.LG

TL;DR: FedSparQ is a lightweight compression framework for federated learning that combines dynamic sparsification, half-precision quantization, and error feedback to reduce communication overhead by 90% while maintaining or improving model accuracy.

Details

Motivation: Federated Learning suffers from significant communication overhead due to frequent exchange of high-dimensional model updates over constrained networks, which limits its practical deployment.

Method: FedSparQ dynamically sparsifies gradients using adaptive thresholds, applies half-precision quantization to retained entries, and integrates residuals from error feedback to prevent information loss. It requires no manual tuning and works with any model architecture.

Result: FedSparQ reduces communication overhead by 90% compared to FedAvg, improves model accuracy by 6% over uncompressed FedAvg and state-of-the-art compression methods, and enhances convergence robustness by 50% compared to other baselines.

Conclusion: FedSparQ provides a practical, easy-to-deploy solution for bandwidth-constrained federated deployments and enables future extensions in adaptive precision and privacy-preserving protocols.

Abstract: Federated Learning (FL) enables collaborative model training across decentralized clients while preserving data privacy by keeping raw data local. However, FL suffers from significant communication overhead due to the frequent exchange of high-dimensional model updates over constrained networks. In this paper, we present FedSparQ, a lightweight compression framework that dynamically sparsifies the gradient of each client through an adaptive threshold, applies half-precision quantization to retained entries and integrates residuals from error feedback to prevent loss of information. FedSparQ requires no manual tuning of sparsity rates or quantization schedules, adapts seamlessly to both homogeneous and heterogeneous data distributions, and is agnostic to model architecture. Through extensive empirical evaluation on vision benchmarks under independent and identically distributed (IID) and non-IID data, we show that FedSparQ substantially reduces communication overhead (reducing by 90% of bytes sent compared to FedAvg) while preserving or improving model accuracy (improving by 6% compared to FedAvg non-compressed solution or to state-of-the-art compression models) and enhancing convergence robustness (by 50%, compared to the other baselines). Our approach provides a practical, easy-to-deploy solution for bandwidth-constrained federated deployments and lays the groundwork for future extensions in adaptive precision and privacy-preserving protocols.

[650] GRAVER: Generative Graph Vocabularies for Robust Graph Foundation Models Fine-tuning

Haonan Yuan, Qingyun Sun, Junhua Shi, Xingcheng Fu, Bryan Hooi, Jianxin Li, Philip S. Yu

Main category: cs.LG

TL;DR: GRAVER is a generative graph vocabulary framework for robust fine-tuning of Graph Foundation Models that addresses instability in few-shot learning through generative augmentations and transferable subgraph patterns.

Details

Motivation: Existing Graph Foundation Models struggle with unstable few-shot fine-tuning due to randomness in support sample selection and structural discrepancies between pre-trained and target graphs, limiting trustworthy knowledge transfer across domains.

Method: Extracts transferable class-specific subgraph patterns via ego-graph disentanglement, constructs graph vocabularies using graphon-based generative experts, and employs a lightweight MoE-CoE network for attentive knowledge routing from source domains during prompt fine-tuning.

Result: Extensive experiments show GRAVER outperforms 15 state-of-the-art baselines in effectiveness, robustness, and efficiency on downstream few-shot node and graph classification tasks.

Conclusion: GRAVER provides a robust and efficient framework for fine-tuning Graph Foundation Models by leveraging generative graph vocabularies and transferable subgraph patterns, enabling stable knowledge transfer across diverse graph domains and tasks.

Abstract: Inspired by the remarkable success of foundation models in language and vision, Graph Foundation Models (GFMs) hold significant promise for broad applicability across diverse graph tasks and domains. However, existing GFMs struggle with unstable few-shot fine-tuning, where both performance and adaptation efficiency exhibit significant fluctuations caused by the randomness in the support sample selection and structural discrepancies between the pre-trained and target graphs. How to fine-tune GFMs robustly and efficiently to enable trustworthy knowledge transfer across domains and tasks is the major challenge. In this paper, we propose GRAVER, a novel Generative gRAph VocabulariEs for Robust GFM fine-tuning framework that tackles the aforementioned instability via generative augmentations. Specifically, to identify transferable units, we analyze and extract key class-specific subgraph patterns by ego-graph disentanglement and validate their transferability both theoretically and empirically. To enable effective pre-training across diverse domains, we leverage a universal task template based on ego-graph similarity and construct graph vocabularies via graphon-based generative experts. To facilitate robust and efficient prompt fine-tuning, we grave the support samples with in-context vocabularies, where the lightweight MoE-CoE network attentively routes knowledge from source domains. Extensive experiments demonstrate the superiority of GRAVER over effectiveness, robustness, and efficiency on downstream few-shot node and graph classification tasks compared with 15 state-of-the-art baselines.

[651] Gradient Projection onto Historical Descent Directions for Communication-Efficient Federated Learning

Arnaud Descours, Léonard Deroose, Jan Ramon

Main category: cs.LG

TL;DR: ProjFL and ProjFL+EF are communication-efficient federated learning algorithms that project local gradients onto shared subspaces, with ProjFL+EF using error feedback for biased compressors.

Details

Motivation: Communication efficiency is a critical bottleneck in federated learning, especially for large-scale models, requiring methods to reduce communication overhead while maintaining performance.

Method: Two algorithms: ProjFL for unbiased compressors projects local gradients onto shared client-server subspace using historical descent directions; ProjFL+EF adds error feedback mechanism for biased compressors.

Result: Both algorithms achieve accuracy comparable to existing baselines while substantially reducing communication costs on standard FL classification benchmarks with deep neural networks.

Conclusion: ProjFL and ProjFL+EF provide effective communication-efficient solutions for federated learning with proven convergence guarantees across strongly convex, convex, and non-convex settings.

Abstract: Federated Learning (FL) enables decentralized model training across multiple clients while optionally preserving data privacy. However, communication efficiency remains a critical bottleneck, particularly for large-scale models. In this work, we introduce two complementary algorithms: ProjFL, designed for unbiased compressors, and ProjFL+EF, tailored for biased compressors through an Error Feedback mechanism. Both methods rely on projecting local gradients onto a shared client-server subspace spanned by historical descent directions, enabling efficient information exchange with minimal communication overhead. We establish convergence guarantees for both algorithms under strongly convex, convex, and non-convex settings. Empirical evaluations on standard FL classification benchmarks with deep neural networks show that ProjFL and ProjFL+EF achieve accuracy comparable to existing baselines while substantially reducing communication costs.

[652] Optimizing Predictive Maintenance in Intelligent Manufacturing: An Integrated FNO-DAE-GNN-PPO MDP Framework

Shiqing Qiu

Main category: cs.LG

TL;DR: A novel MDP framework combining FNO, DAE, GNN, and PPO for predictive maintenance in smart manufacturing, achieving 13% cost reduction and improved system reliability.

Details

Motivation: To address multidimensional challenges in predictive maintenance for complex manufacturing systems by improving equipment reliability and reducing operating costs through data-driven strategies.

Method: Integrates Fourier Neural Operator for temporal pattern capture, Denoising Autoencoder for robust state embedding, Graph Neural Network for inter-device dependencies, and Proximal Policy Optimization for stable long-term strategy optimization.

Result: Significantly outperforms deep learning baselines with up to 13% cost reduction, demonstrates strong convergence and inter-module synergy, and effectively handles uncertainty and non-stationary dynamics.

Conclusion: The framework shows considerable industrial potential for reducing downtime and operating expenses through coordinated system-wide maintenance decisions and noise-resistant data processing.

Abstract: In the era of smart manufacturing, predictive maintenance (PdM) plays a pivotal role in improving equipment reliability and reducing operating costs. In this paper, we propose a novel Markov Decision Process (MDP) framework that integrates advanced soft computing techniques - Fourier Neural Operator (FNO), Denoising Autoencoder (DAE), Graph Neural Network (GNN), and Proximal Policy Optimisation (PPO) - to address the multidimensional challenges of predictive maintenance in complex manufacturing systems. Specifically, the proposed framework innovatively combines the powerful frequency-domain representation capability of FNOs to capture high-dimensional temporal patterns; DAEs to achieve robust, noise-resistant latent state embedding from complex non-Gaussian sensor data; and GNNs to accurately represent inter-device dependencies for coordinated system-wide maintenance decisions. Furthermore, by exploiting PPO, the framework ensures stable and efficient optimisation of long-term maintenance strategies to effectively handle uncertainty and non-stationary dynamics. Experimental validation demonstrates that the approach significantly outperforms multiple deep learning baseline models with up to 13% cost reduction, as well as strong convergence and inter-module synergy. The framework has considerable industrial potential to effectively reduce downtime and operating expenses through data-driven strategies.

[653] FlowNet: Modeling Dynamic Spatio-Temporal Systems via Flow Propagation

Yutong Feng, Xu Liu, Yutong Xia, Yuxuan Liang

Main category: cs.LG

TL;DR: FlowNet introduces a physics-inspired spatio-temporal modeling paradigm using flow tokens and conservation principles to capture asymmetric flow exchanges in dynamic systems, outperforming existing methods.

Details

Motivation: Existing graph-based and attention-driven methods fail to capture asymmetric flow exchanges that govern system evolution, relying instead on similarity-driven connectivity assumptions.

Method: Proposes Spatio-Temporal Flow paradigm with FlowNet architecture using flow tokens as information carriers, Flow Allocation Modules for source-to-destination transfers, Adaptive Spatial Masking for dynamic interaction radius, and cascaded architecture for scalability.

Result: FlowNet significantly outperforms state-of-the-art approaches on seven metrics across three real-world systems, demonstrating superior efficiency and physical interpretability.

Conclusion: Establishes a principled methodology for modeling complex systems through spatio-temporal flow interactions with conservation laws.

Abstract: Accurately modeling complex dynamic spatio-temporal systems requires capturing flow-mediated interdependencies and context-sensitive interaction dynamics. Existing methods, predominantly graph-based or attention-driven, rely on similarity-driven connectivity assumptions, neglecting asymmetric flow exchanges that govern system evolution. We propose Spatio-Temporal Flow, a physics-inspired paradigm that explicitly models dynamic node couplings through quantifiable flow transfers governed by conservation principles. Building on this, we design FlowNet, a novel architecture leveraging flow tokens as information carriers to simulate source-to-destination transfers via Flow Allocation Modules, ensuring state redistribution aligns with conservation laws. FlowNet dynamically adjusts the interaction radius through an Adaptive Spatial Masking module, suppressing irrelevant noise while enabling context-aware propagation. A cascaded architecture enhances scalability and nonlinear representation capacity. Experiments demonstrate that FlowNet significantly outperforms existing state-of-the-art approaches on seven metrics in the modeling of three real-world systems, validating its efficiency and physical interpretability. We establish a principled methodology for modeling complex systems through spatio-temporal flow interactions.

[654] Hybrid Training for Enhanced Multi-task Generalization in Multi-agent Reinforcement Learning

Mingliang Zhang, Sichang Su, Chengyang He, Guillaume Sartoretti

Main category: cs.LG

TL;DR: HyGen is a hybrid MARL framework combining online and offline learning for multi-task generalization, outperforming existing methods on StarCraft challenges.

Details

Motivation: Existing MARL methods lack multi-task generalization capabilities, with online methods being computationally wasteful and offline methods being data-dependent and performing poorly on unseen tasks.

Method: Extracts general skills from offline datasets, trains policies to select optimal skills using CTDE paradigm, and uses a replay buffer integrating both offline data and online interactions.

Result: Effectively extracts and refines general skills, achieving impressive generalization to unseen tasks and outperforming existing online and offline methods on StarCraft multi-agent challenge.

Conclusion: HyGen framework successfully integrates online and offline learning to ensure both multi-task generalization and training efficiency in MARL.

Abstract: In multi-agent reinforcement learning (MARL), achieving multi-task generalization to diverse agents and objectives presents significant challenges. Existing online MARL algorithms primarily focus on single-task performance, but their lack of multi-task generalization capabilities typically results in substantial computational waste and limited real-life applicability. Meanwhile, existing offline multi-task MARL approaches are heavily dependent on data quality, often resulting in poor performance on unseen tasks. In this paper, we introduce HyGen, a novel hybrid MARL framework, Hybrid Training for Enhanced Multi-Task Generalization, which integrates online and offline learning to ensure both multi-task generalization and training efficiency. Specifically, our framework extracts potential general skills from offline multi-task datasets. We then train policies to select the optimal skills under the centralized training and decentralized execution paradigm (CTDE). During this stage, we utilize a replay buffer that integrates both offline data and online interactions. We empirically demonstrate that our framework effectively extracts and refines general skills, yielding impressive generalization to unseen tasks. Comparative analyses on the StarCraft multi-agent challenge show that HyGen outperforms a wide range of existing solely online and offline methods.

Vansh Sharma, Harish Jai Ganesh, Maryam Akram, Wanjiao Liu, Venkat Raman

Main category: cs.LG

TL;DR: AutoHood3D is a high-fidelity multi-modal dataset with 16,000+ automotive hood variants for ML applications in engineering design and multiphysics surrogates, addressing fluid-structure interaction during painting processes.

Details

Motivation: Existing datasets are limited to 2D cases, have restricted geometric variations, and lack multi-modal annotations - gaps that AutoHood3D addresses for physics-aware ML development.

Method: Created 16,000+ hood variants modeled with coupled Large-Eddy Simulation (LES) and Finite Element Analysis (FEA) using 1.2M cells, providing time-resolved physical fields, STL meshes, and natural language prompts.

Result: Validated numerical methodology, established quantitative baselines across five neural architectures, and demonstrated systematic surrogate errors in displacement and force predictions.

Conclusion: The dataset enables physics-aware ML development, accelerates generative-design iteration, facilitates new FSI benchmarks, and motivates novel approaches with multiphysics loss functions that enforce fluid-solid coupling.

Abstract: This study presents a new high-fidelity multi-modal dataset containing 16000+ geometric variants of automotive hoods useful for machine learning (ML) applications such as engineering component design and process optimization, and multiphysics system surrogates. The dataset is centered on a practical multiphysics problem-hood deformation from fluid entrapment and inertial loading during rotary-dip painting. Each hood is numerically modeled with a coupled Large-Eddy Simulation (LES)-Finite Element Analysis (FEA), using 1.2M cells in total to ensure spatial and temporal accuracy. The dataset provides time-resolved physical fields, along with STL meshes and structured natural language prompts for text-to-geometry synthesis. Existing datasets are either confined to 2D cases, exhibit limited geometric variations, or lack the multi-modal annotations and data structures - shortcomings we address with AutoHood3D. We validate our numerical methodology, establish quantitative baselines across five neural architectures, and demonstrate systematic surrogate errors in displacement and force predictions. These findings motivate the design of novel approaches and multiphysics loss functions that enforce fluid-solid coupling during model training. By providing fully reproducible workflows, AutoHood3D enables physics-aware ML development, accelerates generative-design iteration, and facilitates the creation of new FSI benchmarks. Dataset and code URLs in Appendix.

[656] FiCABU: A Fisher-Based, Context-Adaptive Machine Unlearning Processor for Edge AI

Eun-Su Cho, Jongin Choi, Jeongmin Jin, Jae-Jin Lee, Woojoo Lee

Main category: cs.LG

TL;DR: FiCABU is a software-hardware co-design for efficient machine unlearning on edge AI processors that reduces computation by up to 87.52% while maintaining accuracy.

Details

Motivation: Privacy regulations and the "right to be forgotten" require machine unlearning capabilities at the edge, but existing server-centric or retraining-heavy methods are impractical due to tight computation and energy constraints.

Method: Combines Context-Adaptive Unlearning (starting edits from back-end layers and stopping when target forgetting is reached) with Balanced Dampening (scaling dampening strength by depth to preserve accuracy), implemented in a RISC-V edge AI processor with lightweight IPs for Fisher estimation and dampening.

Result: Achieves random-guess forget accuracy while matching baseline retain accuracy, reduces computation by up to 87.52% (ResNet-18) and 71.03% (ViT), and reduces energy to 6.48% (CIFAR-20) and 0.13% (PinsFaceRecognition) of baseline on hardware prototype.

Conclusion: Back-end-first, depth-aware unlearning can be made practical and efficient for resource-constrained edge AI devices through software-hardware co-design.

Abstract: Machine unlearning, driven by privacy regulations and the “right to be forgotten”, is increasingly needed at the edge, yet server-centric or retraining-heavy methods are impractical under tight computation and energy budgets. We present FiCABU (Fisher-based Context-Adaptive Balanced Unlearning), a software-hardware co-design that brings unlearning to edge AI processors. FiCABU combines (i) Context-Adaptive Unlearning, which begins edits from back-end layers and halts once the target forgetting is reached, with (ii) Balanced Dampening, which scales dampening strength by depth to preserve retain accuracy. These methods are realized in a full RTL design of a RISC-V edge AI processor that integrates two lightweight IPs for Fisher estimation and dampening into a GEMM-centric streaming pipeline, validated on an FPGA prototype and synthesized in 45 nm for power analysis. Across CIFAR-20 and PinsFaceRecognition with ResNet-18 and ViT, FiCABU achieves random-guess forget accuracy while matching the retraining-free Selective Synaptic Dampening (SSD) baseline on retain accuracy, reducing computation by up to 87.52 percent (ResNet-18) and 71.03 percent (ViT). On the INT8 hardware prototype, FiCABU further improves retain preservation and reduces energy to 6.48 percent (CIFAR-20) and 0.13 percent (PinsFaceRecognition) of the SSD baseline. In sum, FiCABU demonstrates that back-end-first, depth-aware unlearning can be made both practical and efficient for resource-constrained edge AI devices.

[657] Conformal Prediction-Driven Adaptive Sampling for Digital Twins of Water Distribution Networks

Mohammadhossein Homaei, Oscar Mogollon Gutierrez, Ruben Molano, Andres Caro, Mar Avila

Main category: cs.LG

TL;DR: Adaptive sensing framework for water distribution networks using LSTM forecasting and conformal prediction to focus sensing on most uncertain nodes, reducing demand error by 33-34% compared to uniform sampling.

Details

Motivation: Digital Twins for Water Distribution Networks need accurate state estimation with limited sensors, and uniform sampling wastes resources across nodes with different uncertainty levels.

Method: Proposed adaptive framework combining LSTM forecasting and Conformal Prediction (using marginal CP for low computational cost) to estimate node-wise uncertainty and focus sensing on most uncertain points.

Result: Experiments on Hanoi, Net3, and CTOWN networks show 33-34% lower demand error than uniform sampling at 40% coverage, while maintaining 89.4-90.2% empirical coverage with only 5-10% extra computation.

Conclusion: The adaptive framework effectively reduces sensing resources while maintaining accuracy, making it suitable for real-time Digital Twins in water distribution networks.

Abstract: Digital Twins (DTs) for Water Distribution Networks (WDNs) require accurate state estimation with limited sensors. Uniform sampling often wastes resources across nodes with different uncertainty. We propose an adaptive framework combining LSTM forecasting and Conformal Prediction (CP) to estimate node-wise uncertainty and focus sensing on the most uncertain points. Marginal CP is used for its low computational cost, suitable for real-time DTs. Experiments on Hanoi, Net3, and CTOWN show 33-34% lower demand error than uniform sampling at 40% coverage and maintain 89.4-90.2% empirical coverage with only 5-10% extra computation.

[658] An MLCommons Scientific Benchmarks Ontology

Ben Hawks, Gregor von Laszewski, Matthew D. Sinclair, Marco Colombo, Shivaram Venkataraman, Rutwik Jain, Yiwei Jiang, Nhan Tran, Geoffrey Fox

Main category: cs.LG

TL;DR: Introduces MLCommons Science Benchmarks Ontology - a unified, community-driven framework for standardizing scientific machine learning benchmarks across physics, chemistry, biology, climate science and other domains.

Details

Motivation: Existing scientific ML benchmarks are siloed and lack standardization, making applications fragmented and impact pathways unclear. Need for unified benchmarking across diverse scientific domains.

Method: Developed through community effort extending MLCommons ecosystem, consolidating disparate benchmarks into single taxonomy with open submission workflow and six-category rating rubric for quality assessment.

Result: Created extensible architecture supporting future scientific/AI motifs, with standardized foundation for reproducible, cross-domain benchmarking in scientific ML.

Conclusion: MLCommons Science Benchmarks Ontology provides scalable foundation for reproducible scientific ML benchmarking, enabling stakeholders to select appropriate benchmarks and identify emerging computing patterns.

Abstract: Scientific machine learning research spans diverse domains and data modalities, yet existing benchmark efforts remain siloed and lack standardization. This makes novel and transformative applications of machine learning to critical scientific use-cases more fragmented and less clear in pathways to impact. This paper introduces an ontology for scientific benchmarking developed through a unified, community-driven effort that extends the MLCommons ecosystem to cover physics, chemistry, materials science, biology, climate science, and more. Building on prior initiatives such as XAI-BENCH, FastML Science Benchmarks, PDEBench, and the SciMLBench framework, our effort consolidates a large set of disparate benchmarks and frameworks into a single taxonomy of scientific, application, and system-level benchmarks. New benchmarks can be added through an open submission workflow coordinated by the MLCommons Science Working Group and evaluated against a six-category rating rubric that promotes and identifies high-quality benchmarks, enabling stakeholders to select benchmarks that meet their specific needs. The architecture is extensible, supporting future scientific and AI/ML motifs, and we discuss methods for identifying emerging computing patterns for unique scientific workloads. The MLCommons Science Benchmarks Ontology provides a standardized, scalable foundation for reproducible, cross-domain benchmarking in scientific machine learning. A companion webpage for this work has also been developed as the effort evolves: https://mlcommons-science.github.io/benchmark/

[659] wa-hls4ml: A Benchmark and Surrogate Models for hls4ml Resource and Latency Estimation

Benjamin Hawks, Jason Weitz, Dmitri Demler, Karla Tame-Narvaez, Dennis Plotnikov, Mohammad Mehdi Rahimifar, Hamza Ezzaoui Rahali, Audrey C. Therrien, Donovan Sproule, Elham E Khoda, Keegan A. Smith, Russell Marroquin, Giuseppe Di Guglielmo, Nhan Tran, Javier Duarte, Vladimir Loncar

Main category: cs.LG

TL;DR: The paper introduces wa-hls4ml, a benchmark for ML accelerator resource and latency estimation, along with a dataset of 680,000+ neural networks synthesized using hls4ml for Xilinx FPGAs. It also presents GNN- and transformer-based surrogate models that predict latency and resources within several percent of actual synthesized values.

Details

Motivation: As ML hardware implementation advances, hardware synthesis has become a bottleneck in rapid design iteration. Current toolchains have reduced design iteration time but exposed new constraints, necessitating ML-based surrogate models to estimate resource usage of ML accelerator architectures.

Method: Created wa-hls4ml benchmark with over 680,000 fully connected and convolutional neural networks synthesized using hls4ml targeting Xilinx FPGAs. Introduced GNN- and transformer-based surrogate models to predict latency and resource usage, evaluated against common ML model architectures from scientific domains.

Result: The surrogate models generally predict latency and resources for the 75% percentile within several percent of the synthesized resources on the synthetic test dataset, demonstrating effective estimation capabilities.

Conclusion: The wa-hls4ml benchmark and the proposed GNN- and transformer-based surrogate models provide effective solutions for estimating ML accelerator resource usage, addressing the emerging bottleneck of hardware synthesis in rapid design iteration cycles.

Abstract: As machine learning (ML) is increasingly implemented in hardware to address real-time challenges in scientific applications, the development of advanced toolchains has significantly reduced the time required to iterate on various designs. These advancements have solved major obstacles, but also exposed new challenges. For example, processes that were not previously considered bottlenecks, such as hardware synthesis, are becoming limiting factors in the rapid iteration of designs. To mitigate these emerging constraints, multiple efforts have been undertaken to develop an ML-based surrogate model that estimates resource usage of ML accelerator architectures. We introduce wa-hls4ml, a benchmark for ML accelerator resource and latency estimation, and its corresponding initial dataset of over 680,000 fully connected and convolutional neural networks, all synthesized using hls4ml and targeting Xilinx FPGAs. The benchmark evaluates the performance of resource and latency predictors against several common ML model architectures, primarily originating from scientific domains, as exemplar models, and the average performance across a subset of the dataset. Additionally, we introduce GNN- and transformer-based surrogate models that predict latency and resources for ML accelerators. We present the architecture and performance of the models and find that the models generally predict latency and resources for the 75% percentile within several percent of the synthesized resources on the synthetic test dataset.

[660] Frequency Matters: When Time Series Foundation Models Fail Under Spectral Shift

Tianze Wang, Sofiane Ennadir, John Pertoft, Gabriela Zarzar Gandler, Lele Cao, Zineb Senane, Styliani Katsarou, Sahar Asadi, Axel Karlsson, Oleg Smirnov

Main category: cs.LG

TL;DR: TSFMs struggle with generalization due to spectral shift - mismatch between pretraining and downstream task frequencies, causing underperformance in industrial settings like mobile gaming engagement prediction.

Details

Motivation: Despite strong benchmark performance, TSFMs' effectiveness in real-world industrial applications remains uncertain, particularly regarding their ability to generalize across different frequency domains.

Method: Used industrial-scale player engagement prediction in mobile gaming and designed controlled synthetic experiments comparing signals with seen vs unseen frequency bands to isolate spectral mismatch effects.

Result: TSFMs underperformed domain-adapted baselines in industrial tasks and showed systematic degradation under spectral mismatch conditions in synthetic experiments.

Conclusion: Frequency awareness is critical for robust TSFM deployment, requiring new pretraining and evaluation protocols that explicitly account for spectral diversity.

Abstract: Time series foundation models (TSFMs) have shown strong results on public benchmarks, prompting comparisons to a “BERT moment” for time series. Their effectiveness in industrial settings, however, remains uncertain. We examine why TSFMs often struggle to generalize and highlight spectral shift (a mismatch between the dominant frequency components in downstream tasks and those represented during pretraining) as a key factor. We present evidence from an industrial-scale player engagement prediction task in mobile gaming, where TSFMs underperform domain-adapted baselines. To isolate the mechanism, we design controlled synthetic experiments contrasting signals with seen versus unseen frequency bands, observing systematic degradation under spectral mismatch. These findings position frequency awareness as critical for robust TSFM deployment and motivate new pretraining and evaluation protocols that explicitly account for spectral diversity.

[661] Fooling Algorithms in Non-Stationary Bandits using Belief Inertia

Gal Mendelson, Eyal Tadmor

Main category: cs.LG

TL;DR: The paper introduces a belief inertia argument to derive sharp lower bounds for worst-case regret in piecewise stationary multi-armed bandits, showing that classical algorithms suffer linear regret regardless of parameter tuning.

Details

Motivation: Existing lower bounds for non-stationary bandits rely on infrequent sampling arguments, but there's a need for fundamentally different approaches to understand the true minimax limits in time-varying settings.

Method: The authors develop a belief inertia argument that captures how algorithms’ empirical beliefs create momentum resisting new evidence after changes, and use this to construct adversarial instances that mislead classical bandit algorithms.

Result: Classical algorithms like Explore Then Commit, epsilon greedy, and UCB suffer linear regret with substantial constant factors, regardless of parameter tuning, even with a single change point. Periodic restart strategies also yield linear worst-case regret.

Conclusion: Belief inertia provides a powerful method for deriving sharp lower bounds in non-stationary bandits, revealing fundamental limitations of classical algorithms in time-varying environments.

Abstract: We study the problem of worst case regret in piecewise stationary multi armed bandits. While the minimax theory for stationary bandits is well established, understanding analogous limits in time-varying settings is challenging. Existing lower bounds rely on what we refer to as infrequent sampling arguments, where long intervals without exploration allow adversarial reward changes that induce large regret. In this paper, we introduce a fundamentally different approach based on a belief inertia argument. Our analysis captures how an algorithm’s empirical beliefs, encoded through historical reward averages, create momentum that resists new evidence after a change. We show how this inertia can be exploited to construct adversarial instances that mislead classical algorithms such as Explore Then Commit, epsilon greedy, and UCB, causing them to suffer regret that grows linearly with T and with a substantial constant factor, regardless of how their parameters are tuned, even with a single change point. We extend the analysis to algorithms that periodically restart to handle non stationarity and prove that, even then, the worst case regret remains linear in T. Our results indicate that utilizing belief inertia can be a powerful method for deriving sharp lower bounds in non stationary bandits.

[662] Unveiling the Training Dynamics of ReLU Networks through a Linear Lens

Longqing Ye

Main category: cs.LG

TL;DR: A framework that transforms multi-layer ReLU networks into equivalent single-layer linear models with input-dependent effective weights, revealing how class-specific representations emerge during training.

Details

Motivation: To understand the complex internal learning mechanisms of deep neural networks with ReLU activations, which are challenging to interpret due to their high-dimensional, non-linear nature.

Method: Proposes recasting multi-layer ReLU networks into single-layer linear models by deriving input-dependent effective weight matrices that capture the active computational path for each input sample through ReLU activation patterns.

Result: Shows that during training, effective weights for samples from the same class converge while those from different classes diverge, revealing the formation of class-specific decision boundaries and semantic representations.

Conclusion: The evolution of effective weights provides a new interpretable framework for understanding representation learning in deep networks, tracking how networks develop class-specific representations through weight convergence and divergence patterns.

Abstract: Deep neural networks, particularly those employing Rectified Linear Units (ReLU), are often perceived as complex, high-dimensional, non-linear systems. This complexity poses a significant challenge to understanding their internal learning mechanisms. In this work, we propose a novel analytical framework that recasts a multi-layer ReLU network into an equivalent single-layer linear model with input-dependent “effective weights”. For any given input sample, the activation pattern of ReLU units creates a unique computational path, effectively zeroing out a subset of weights in the network. By composing the active weights across all layers, we can derive an effective weight matrix, $W_{\text{eff}}(x)$, that maps the input directly to the output for that specific sample. We posit that the evolution of these effective weights reveals fundamental principles of representation learning. Our work demonstrates that as training progresses, the effective weights corresponding to samples from the same class converge, while those from different classes diverge. By tracking the trajectories of these sample-wise effective weights, we provide a new lens through which to interpret the formation of class-specific decision boundaries and the emergence of semantic representations within the network.

[663] SSTODE: Ocean-Atmosphere Physics-Informed Neural ODEs for Sea Surface Temperature Prediction

Zheng Jiang, Wei Wang, Gaowei Zhang, Yi Wang

Main category: cs.LG

TL;DR: SSTODE is a physics-informed Neural ODE framework that improves SST prediction by incorporating fluid transport principles and external forcing factors, achieving state-of-the-art performance while providing physical interpretability.

Details

Motivation: Current data-driven SST models lack interpretability and fail to capture key physical processes like seawater movement and external drivers, limiting their practical utility.

Method: Derives ODEs from fluid transport principles (advection and diffusion), recovers latent velocity field through variational optimization, and introduces Energy Exchanges Integrator based on ocean heat budget equations.

Result: Achieves state-of-the-art performance in global and regional SST forecasting benchmarks, and visually reveals advection dynamics, thermal diffusion patterns, and diurnal heating-cooling cycles.

Conclusion: SSTODE demonstrates both superior predictive performance and enhanced physical interpretability, bridging the gap between data-driven models and physical ocean processes.

Abstract: Sea Surface Temperature (SST) is crucial for understanding upper-ocean thermal dynamics and ocean-atmosphere interactions, which have profound economic and social impacts. While data-driven models show promise in SST prediction, their black-box nature often limits interpretability and overlooks key physical processes. Recently, physics-informed neural networks have been gaining momentum but struggle with complex ocean-atmosphere dynamics due to 1) inadequate characterization of seawater movement (e.g., coastal upwelling) and 2) insufficient integration of external SST drivers (e.g., turbulent heat fluxes). To address these challenges, we propose SSTODE, a physics-informed Neural Ordinary Differential Equations (Neural ODEs) framework for SST prediction. First, we derive ODEs from fluid transport principles, incorporating both advection and diffusion to model ocean spatiotemporal dynamics. Through variational optimization, we recover a latent velocity field that explicitly governs the temporal dynamics of SST. Building upon ODE, we introduce an Energy Exchanges Integrator (EEI)-inspired by ocean heat budget equations-to account for external forcing factors. Thus, the variations in the components of these factors provide deeper insights into SST dynamics. Extensive experiments demonstrate that SSTODE achieves state-of-the-art performances in global and regional SST forecasting benchmarks. Furthermore, SSTODE visually reveals the impact of advection dynamics, thermal diffusion patterns, and diurnal heating-cooling cycles on SST evolution. These findings demonstrate the model’s interpretability and physical consistency.

[664] Physics-Guided Machine Learning for Uncertainty Quantification in Turbulence Models

Minghan Chu, Weicheng Qian

Main category: cs.LG

TL;DR: A hybrid ML-physics framework improves turbulence model uncertainty quantification by using CNN to modulate EPM perturbation magnitudes, yielding tighter and better-calibrated uncertainty bounds.

Details

Motivation: Traditional turbulence models introduce epistemic uncertainty through empirical simplifications, and purely physics-based uncertainty quantification methods like EPM can overpredict uncertainty bounds.

Method: Proposes a convolutional neural network (CNN)-based modulation of Eigenspace Perturbation Method (EPM) perturbation magnitudes to create a hybrid ML-EPM framework that maintains physical consistency while improving calibration.

Result: The hybrid ML-EPM framework produces substantially tighter and better-calibrated uncertainty estimates compared to baseline EPM alone across canonical test cases.

Conclusion: Combining machine learning with physics-based uncertainty quantification methods can effectively improve the calibration of turbulence model uncertainty estimates while preserving physical consistency.

Abstract: Predicting the evolution of turbulent flows is central across science and engineering. Most studies rely on simulations with turbulence models, whose empirical simplifications introduce epistemic uncertainty. The Eigenspace Perturbation Method (EPM) is a widely used physics-based approach to quantify model-form uncertainty, but being purely physics-based it can overpredict uncertainty bounds. We propose a convolutional neural network (CNN)-based modulation of EPM perturbation magnitudes to improve calibration while preserving physical consistency. Across canonical cases, the hybrid ML-EPM framework yields substantially tighter, better-calibrated uncertainty estimates than baseline EPM alone.

Hamza Virk, Sandro Amaglobeli, Zuhayr Syed

Main category: cs.LG

TL;DR: Blind-IGT is a statistical framework that jointly recovers reward parameters and rationality temperature from observed behavior in competitive settings, resolving fundamental scale ambiguity in inverse game theory.

Details

Motivation: Existing inverse game theory methods assume known rationality parameters, but when unknown, scale ambiguity makes reward parameters statistically unidentifiable, limiting practical applicability.

Method: Introduces Blind-IGT with normalization constraint to resolve scale ambiguity, proposes Normalized Least Squares estimator, and extends framework to Markov games with unknown transition dynamics.

Result: Achieves optimal O(N^{-1/2}) convergence rate for joint parameter recovery, provides partial identification guarantees when strong conditions fail, and demonstrates strong empirical performance.

Conclusion: Blind-IGT enables practical inverse game theory by jointly recovering both reward parameters and rationality temperature, overcoming fundamental identifiability challenges in competitive settings.

Abstract: Inverse Game Theory (IGT) methods based on the entropy-regularized Quantal Response Equilibrium (QRE) offer a tractable approach for competitive settings, but critically assume the agents’ rationality parameter (temperature $τ$) is known a priori. When $τ$ is unknown, a fundamental scale ambiguity emerges that couples $τ$ with the reward parameters ($θ$), making them statistically unidentifiable. We introduce Blind-IGT, the first statistical framework to jointly recover both $θ$ and $τ$ from observed behavior. We analyze this bilinear inverse problem and establish necessary and sufficient conditions for unique identification by introducing a normalization constraint that resolves the scale ambiguity. We propose an efficient Normalized Least Squares (NLS) estimator and prove it achieves the optimal $\mathcal{O}(N^{-1/2})$ convergence rate for joint parameter recovery. When strong identifiability conditions fail, we provide partial identification guarantees through confidence set construction. We extend our framework to Markov games and demonstrate optimal convergence rates with strong empirical performance even when transition dynamics are unknown.

[666] KLASS: KL-Guided Fast Inference in Masked Diffusion Models

Seo Hyun Kim, Sunwoo Hong, Hojung Jung, Youngrok Park, Se-Young Yun

Main category: cs.LG

TL;DR: KL-Adaptive Stability Sampling (KLASS) is a fast sampling method for masked diffusion models that uses token-level KL divergence to identify stable predictions, enabling multiple token unmasking per iteration without extra training.

Details

Motivation: Masked diffusion models suffer from slow inference due to iterative refinement processes, creating a bottleneck for practical applications.

Method: Exploit token-level KL divergence to identify stable, high-confidence predictions and unmask multiple tokens in each iteration without additional model training.

Result: Achieves up to 2.78× wall-clock speedups while improving performance over standard greedy decoding, attaining state-of-the-art results among diffusion-based samplers across text, image, and molecular generation tasks.

Conclusion: KLASS is an effective broadly applicable sampler that significantly speeds up generation while maintaining sample quality across diverse domains.

Abstract: Masked diffusion models have demonstrated competitive results on various tasks including language generation. However, due to its iterative refinement process, the inference is often bottlenecked by slow and static sampling speed. To overcome this problem, we introduce `KL-Adaptive Stability Sampling’ (KLASS), a fast yet effective sampling method that exploits token-level KL divergence to identify stable, high-confidence predictions. By unmasking multiple tokens in each iteration without any additional model training, our approach speeds up generation significantly while maintaining sample quality. On reasoning benchmarks, KLASS achieves up to $2.78\times$ wall-clock speedups while improving performance over standard greedy decoding, attaining state-of-the-art results among diffusion-based samplers. We further validate KLASS across diverse domains, including text, image, and molecular generation, showing its effectiveness as a broadly applicable sampler across different models.

[667] Distributionally Robust Self Paced Curriculum Reinforcement Learning

Anirudh Satheesh, Keenan Powell, Vaneet Aggarwal

Main category: cs.LG

TL;DR: DR-SPCRL adaptively schedules the robustness budget ε as a continuous curriculum in distributionally robust RL, achieving better performance-robustness trade-off than fixed ε methods.

Details

Motivation: Fixed robustness budgets in DRRL create a tradeoff between performance and robustness - small ε gives high nominal performance but weak robustness, while large ε causes instability and overly conservative policies.

Method: Proposes Distributionally Robust Self-Paced Curriculum Reinforcement Learning (DR-SPCRL) that treats ε as a continuous curriculum and adaptively schedules it based on the agent’s learning progress.

Result: Achieves 11.8% average increase in episodic return under perturbations compared to fixed/heuristic scheduling, and approximately 1.9x performance of nominal RL algorithms across multiple environments.

Conclusion: DR-SPCRL stabilizes training and achieves superior robustness-performance trade-off by adaptively scheduling the robustness budget as a curriculum.

Abstract: A central challenge in reinforcement learning is that policies trained in controlled environments often fail under distribution shifts at deployment into real-world environments. Distributionally Robust Reinforcement Learning (DRRL) addresses this by optimizing for worst-case performance within an uncertainty set defined by a robustness budget $ε$. However, fixing $ε$ results in a tradeoff between performance and robustness: small values yield high nominal performance but weak robustness, while large values can result in instability and overly conservative policies. We propose Distributionally Robust Self-Paced Curriculum Reinforcement Learning (DR-SPCRL), a method that overcomes this limitation by treating $ε$ as a continuous curriculum. DR-SPCRL adaptively schedules the robustness budget according to the agent’s progress, enabling a balance between nominal and robust performance. Empirical results across multiple environments demonstrate that DR-SPCRL not only stabilizes training but also achieves a superior robustness-performance trade-off, yielding an average 11.8% increase in episodic return under varying perturbations compared to fixed or heuristic scheduling strategies, and achieving approximately 1.9$\times$ the performance of the corresponding nominal RL algorithms.

[668] AI-assisted workflow enables rapid, high-fidelity breast cancer clinical trial eligibility prescreening

Jacob T. Rosenthal, Emma Hahesy, Sulov Chalise, Menglei Zhu, Mert R. Sabuncu, Lior Z. Braunstein, Anyi Li

Main category: cs.LG

TL;DR: MSK-MATCH is an AI system that automates clinical trial eligibility screening using LLMs and retrieval-augmented generation, achieving high accuracy while reducing human screening time from 20 minutes to 43 seconds.

Details

Motivation: Clinical trial participation rates are low despite their importance in cancer care and research, creating a need for automated screening solutions.

Method: Integrated large language model with curated oncology trial knowledge base and retrieval-augmented architecture that provides explanations for AI predictions grounded in source clinical text.

Result: Automatically resolved 61.9% of cases and triaged 38.1% for human review, achieving 98.6% accuracy, 98.4% sensitivity, and 98.7% specificity for patient-level eligibility classification.

Conclusion: AI-assisted workflow significantly improves efficiency and reduces costs in clinical trial screening while maintaining high accuracy comparable to human-only approaches.

Abstract: Clinical trials play an important role in cancer care and research, yet participation rates remain low. We developed MSK-MATCH (Memorial Sloan Kettering Multi-Agent Trial Coordination Hub), an AI system for automated eligibility screening from clinical text. MSK-MATCH integrates a large language model with a curated oncology trial knowledge base and retrieval-augmented architecture providing explanations for all AI predictions grounded in source text. In a retrospective dataset of 88,518 clinical documents from 731 patients across six breast cancer trials, MSK-MATCH automatically resolved 61.9% of cases and triaged 38.1% for human review. This AI-assisted workflow achieved 98.6% accuracy, 98.4% sensitivity, and 98.7% specificity for patient-level eligibility classification, matching or exceeding performance of the human-only and AI-only comparisons. For the triaged cases requiring manual review, prepopulating eligibility screens with AI-generated explanations reduced screening time from 20 minutes to 43 seconds at an average cost of $0.96 per patient-trial pair.

[669] TabDistill: Distilling Transformers into Neural Nets for Few-Shot Tabular Classification

Pasan Dissanayake, Sanghamitra Dutta

Main category: cs.LG

TL;DR: TabDistill is a knowledge distillation framework that transfers pre-trained transformer knowledge to simpler neural networks for tabular data classification, achieving parameter efficiency while maintaining strong few-shot performance.

Details

Motivation: Transformer models perform well on tabular data in few-shot scenarios but are computationally expensive and parameter-heavy. There's a need to maintain performance while reducing complexity.

Method: Knowledge distillation from complex transformer-based models into simpler neural networks for tabular data classification.

Result: Distilled neural networks outperform classical baselines (regular neural networks, XGBoost, logistic regression) and sometimes even the original transformer models, while being more parameter-efficient.

Conclusion: TabDistill successfully bridges the gap between transformer performance and computational efficiency, enabling effective tabular data classification with limited training data and reduced model complexity.

Abstract: Transformer-based models have shown promising performance on tabular data compared to their classical counterparts such as neural networks and Gradient Boosted Decision Trees (GBDTs) in scenarios with limited training data. They utilize their pre-trained knowledge to adapt to new domains, achieving commendable performance with only a few training examples, also called the few-shot regime. However, the performance gain in the few-shot regime comes at the expense of significantly increased complexity and number of parameters. To circumvent this trade-off, we introduce TabDistill, a new strategy to distill the pre-trained knowledge in complex transformer-based models into simpler neural networks for effectively classifying tabular data. Our framework yields the best of both worlds: being parameter-efficient while performing well with limited training data. The distilled neural networks surpass classical baselines such as regular neural networks, XGBoost and logistic regression under equal training data, and in some cases, even the original transformer-based models that they were distilled from.

[670] Distributionally Robust Multimodal Machine Learning

Peilin Yang, Yu Ma

Main category: cs.LG

TL;DR: A novel distributionally robust optimization framework for multimodal machine learning that provides theoretical guarantees and improves robustness in practical applications.

Details

Motivation: Existing multimodal approaches rely on early fusion or heuristic uncertainty modeling, which downplay modality-aware effects and provide limited insights into uncertainty handling.

Method: Proposed a distributionally robust optimization (DRO) framework with complexity analysis, established generalization upper bounds and minimax lower bounds, and extended to encoder-specific error propagation settings.

Result: The approach improves robustness in both simulation settings and real-world datasets, providing performance guarantees through theoretical bounds.

Conclusion: The framework provides a principled foundation for employing multimodal machine learning models in high-stakes applications where uncertainty is unavoidable.

Abstract: We consider the problem of distributionally robust multimodal machine learning. Existing approaches often rely on merging modalities on the feature level (early fusion) or heuristic uncertainty modeling, which downplays modality-aware effects and provide limited insights. We propose a novel distributionally robust optimization (DRO) framework that aims to study both the theoretical and practical insights of multimodal machine learning. We first justify this setup and show the significance of this problem through complexity analysis. We then establish both generalization upper bounds and minimax lower bounds which provide performance guarantees. These results are further extended in settings where we consider encoder-specific error propogations. Empirically, we demonstrate that our approach improves robustness in both simulation settings and real-world datasets. Together, these findings provide a principled foundation for employing multimodal machine learning models in high-stakes applications where uncertainty is unavoidable.

Ziyang Gao, Annie Cheung, Yihao Ou

Main category: cs.LG

TL;DR: GastroDL-Fusion is a dual-modal deep learning framework that integrates protein-ligand complex data with disease-associated gene sequences to improve binding affinity prediction for gastrointestinal disease drug/vaccine development.

Details

Motivation: Traditional computational models rely only on structural information and fail to capture genetic determinants influencing disease mechanisms and therapeutic responses in gastrointestinal diseases.

Method: Protein-ligand complexes are modeled as molecular graphs using Graph Isomorphism Network (GIN), while gene sequences are encoded via pre-trained Transformers (ProtBERT/ESM), with both modalities fused through a multi-layer perceptron for cross-modal interaction learning.

Result: The model achieves MAE of 1.12 and RMSE of 1.75 on GI disease-related targets, significantly outperforming CNN, BiLSTM, GIN, and Transformer-only baselines.

Conclusion: Incorporating both structural and genetic features yields more accurate binding affinity predictions, providing a reliable computational tool for accelerating targeted therapy and vaccine design for gastrointestinal diseases.

Abstract: Accurate prediction of protein-ligand binding affinity plays a pivotal role in accelerating the discovery of novel drugs and vaccines, particularly for gastrointestinal (GI) diseases such as gastric ulcers, Crohn’s disease, and ulcerative colitis. Traditional computational models often rely on structural information alone and thus fail to capture the genetic determinants that influence disease mechanisms and therapeutic responses. To address this gap, we propose GastroDL-Fusion, a dual-modal deep learning framework that integrates protein-ligand complex data with disease-associated gene sequence information for drug and vaccine development. In our approach, protein-ligand complexes are represented as molecular graphs and modeled using a Graph Isomorphism Network (GIN), while gene sequences are encoded into biologically meaningful embeddings via a pre-trained Transformer (ProtBERT/ESM). These complementary modalities are fused through a multi-layer perceptron to enable robust cross-modal interaction learning. We evaluate the model on benchmark datasets of GI disease-related targets, demonstrating that GastroDL-Fusion significantly improves predictive performance over conventional methods. Specifically, the model achieves a mean absolute error (MAE) of 1.12 and a root mean square error (RMSE) of 1.75, outperforming CNN, BiLSTM, GIN, and Transformer-only baselines. These results confirm that incorporating both structural and genetic features yields more accurate predictions of binding affinities, providing a reliable computational tool for accelerating the design of targeted therapies and vaccines in the context of gastrointestinal diseases.

[672] Compressing Chemistry Reveals Functional Groups

Ruben Sharma, Ross D. King

Main category: cs.LG

TL;DR: This paper introduces the first large-scale assessment of chemical functional groups using computational learning theory, showing that good explanations should compress data. An MML-based algorithm discovers substructures that outperform traditional fingerprints in bioactivity prediction.

Details

Motivation: To formally evaluate the utility of traditional chemical functional groups as explanations using computational learning theory principles, specifically that good explanations should compress data.

Method: Developed an unsupervised learning algorithm based on Minimum Message Length (MML) principle that searches for substructures that compress around 3 million biologically relevant molecules.

Result: Discovered substructures contain most human-curated functional groups plus novel larger patterns with more specific functions. Dataset-specific functional group fingerprints significantly outperform MACCS and Morgan fingerprints in bioactivity regression tasks.

Conclusion: The MML-based approach successfully identifies meaningful chemical substructures that improve bioactivity prediction performance, validating functional groups as useful explanations through data compression principles.

Abstract: We introduce the first formal large-scale assessment of the utility of traditional chemical functional groups as used in chemical explanations. Our assessment employs a fundamental principle from computational learning theory: a good explanation of data should also compress the data. We introduce an unsupervised learning algorithm based on the Minimum Message Length (MML) principle that searches for substructures that compress around three million biologically relevant molecules. We demonstrate that the discovered substructures contain most human-curated functional groups as well as novel larger patterns with more specific functions. We also run our algorithm on 24 specific bioactivity prediction datasets to discover dataset-specific functional groups. Fingerprints constructed from dataset-specific functional groups are shown to significantly outperform other fingerprint representations, including the MACCS and Morgan fingerprint, when training ridge regression models on bioactivity regression tasks.

[673] QiVC-Net: Quantum-Inspired Variational Convolutional Network, with Application to Biosignal Classification

Amin Golnari, Jamileh Yousefi, Reza Moheimani, Saeid Sanei

Main category: cs.LG

TL;DR: QiVC framework integrates quantum-inspired transformations into convolutional networks using differentiable subspace rotations of weights, achieving state-of-the-art performance in biosignal classification without adding parameters.

Details

Motivation: To address challenges in biosignal analysis including high noise, inter-subject variability, and imbalanced data through uncertainty-aware modeling that preserves parameter space geometry.

Method: Quantum-inspired variational convolution (QiVC) with QiRE mechanism performing differentiable low-dimensional subspace rotations of convolutional weights, analogous to quantum state evolution.

Result: QiVC-Net achieved 97.84% accuracy on PhysioNet CinC 2016 and 97.89% on PhysioNet CirCor DigiScope 2022, demonstrating state-of-the-art performance in PCG classification.

Conclusion: The QiVC framework shows promise for advancing uncertainty-aware modeling in biomedical signal analysis, offering robust performance without computational burden.

Abstract: This work introduces the quantum-inspired variational convolution (QiVC) framework, a novel learning paradigm that integrates principles of probabilistic inference, variational optimization, and quantum-inspired transformations within convolutional architectures. The central innovation of QiVC lies in its quantum-inspired rotated ensemble (QiRE) mechanism. QiRE performs differentiable low-dimensional subspace rotations of convolutional weights, analogously to quantum state evolution. This approach enables structured uncertainty modeling while preserving the intrinsic geometry of the parameter space, resulting in more expressive, stable, and uncertainty-aware representations. To demonstrate its practical potential, the concept is instantiated in a QiVC-based convolutional network (QiVC-Net) and evaluated in the context of biosignal classification, focusing on phonocardiogram (PCG) recordings, a challenging domain characterized by high noise, inter-subject variability, and often imbalanced data. The proposed QiVC-Net integrates an architecture in which the QiVC layer does not introduce additional parameters, instead performing an ensemble rotation of the convolutional weights through a structured mechanism ensuring robustness without added highly computational burden. Experiments on two benchmark datasets, PhysioNet CinC 2016 and PhysioNet CirCor DigiScope 2022, show that QiVC-Net achieves state-of-the-art performance, reaching accuracies of 97.84% and 97.89%, respectively. These findings highlight the versatility of the QiVC framework and its promise for advancing uncertainty-aware modeling in real-world biomedical signal analysis. The implementation of the QiVConv layer is openly available in GitHub.

[674] Near-Exponential Savings for Mean Estimation with Active Learning

Julian M. Morimoto, Jacob Goldin, Daniel E. Ho

Main category: cs.LG

TL;DR: PartiBandits is an active learning algorithm that efficiently estimates the mean of a k-class random variable using limited labels and auxiliary covariates, achieving minimax optimal convergence rates.

Details

Motivation: To address the challenge of estimating population means with limited labeled data when auxiliary covariates are available, bridging UCB and disagreement-based active learning approaches.

Method: Two-stage algorithm: (1) learns a partition of unlabeled data to reduce conditional variance, (2) uses WarmStart-UCB subroutine for round-by-round label requests from each stratum.

Result: Achieves squared error of Õ((ν + exp(-cN/log(N)))/N), where ν is Bayes-optimal classifier risk, with minimax optimal convergence rates in classical settings.

Conclusion: PartiBandits effectively combines UCB and disagreement-based approaches, is implemented in an R package, and demonstrated through EHR simulations.

Abstract: We study the problem of efficiently estimating the mean of a $k$-class random variable, $Y$, using a limited number of labels, $N$, in settings where the analyst has access to auxiliary information (i.e.: covariates) $X$ that may be informative about $Y$. We propose an active learning algorithm (“PartiBandits”) to estimate $\mathbb{E}[Y]$. The algorithm yields an estimate, $\widehatμ_{\text{PB}}$, such that $\left( \widehatμ_{\text{PB}} - \mathbb{E}[Y]\right)^2$ is $\tilde{\mathcal{O}}\left( \frac{ν+ \exp(c \cdot (-N/\log(N))) }{N} \right)$, where $c > 0$ is a constant and $ν$ is the risk of the Bayes-optimal classifier. PartiBandits is essentially a two-stage algorithm. In the first stage, it learns a partition of the unlabeled data that shrinks the average conditional variance of $Y$. In the second stage it uses a UCB-style subroutine (“WarmStart-UCB”) to request labels from each stratum round-by-round. Both the main algorithm’s and the subroutine’s convergence rates are minimax optimal in classical settings. PartiBandits bridges the UCB and disagreement-based approaches to active learning despite these two approaches being designed to tackle very different tasks. We illustrate our methods through simulation using nationwide electronic health records. Our methods can be implemented using the PartiBandits package in R.

[675] Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder

Zhen Xu, Zhen Tan, Song Wang, Kaidi Xu, Tianlong Chen

Main category: cs.LG

TL;DR: The paper addresses the feature redundancy problem in Mixture of Experts Sparse Autoencoders (MoE-SAE) by proposing Multiple Expert Activation and Feature Scaling to improve expert specialization and reduce computational costs.

Details

Motivation: Sparse autoencoders face a trade-off between interpretability (requiring high dimensionality) and computational efficiency. While MoE approaches help reduce costs, they suffer from experts failing to specialize and learning overlapping features.

Method: Proposes two innovations: (1) Multiple Expert Activation that engages semantically weighted expert subsets simultaneously, and (2) Feature Scaling that enhances diversity through adaptive high-frequency scaling.

Result: Achieves 24% lower reconstruction error and 99% reduction in feature redundancy compared to existing MoE-SAE methods.

Conclusion: Bridges the interpretability-efficiency gap in LLM analysis, enabling transparent model inspection without compromising computational feasibility.

Abstract: Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting large language models (LLMs) by decomposing token activations into combinations of human-understandable features. While SAEs provide crucial insights into LLM explanations, their practical adoption faces a fundamental challenge: better interpretability demands that SAEs’ hidden layers have high dimensionality to satisfy sparsity constraints, resulting in prohibitive training and inference costs. Recent Mixture of Experts (MoE) approaches attempt to address this by partitioning SAEs into narrower expert networks with gated activation, thereby reducing computation. In a well-designed MoE, each expert should focus on learning a distinct set of features. However, we identify a \textit{critical limitation} in MoE-SAE: Experts often fail to specialize, which means they frequently learn overlapping or identical features. To deal with it, we propose two key innovations: (1) Multiple Expert Activation that simultaneously engages semantically weighted expert subsets to encourage specialization, and (2) Feature Scaling that enhances diversity through adaptive high-frequency scaling. Experiments demonstrate a 24% lower reconstruction error and a 99% reduction in feature redundancy compared to existing MoE-SAE methods. This work bridges the interpretability-efficiency gap in LLM analysis, allowing transparent model inspection without compromising computational feasibility.

[676] Primal-Only Actor Critic Algorithm for Robust Constrained Average Cost MDPs

Anirudh Satheesh, Sooraj Sathish, Swetha Ganesh, Keenan Powell, Vaneet Aggarwal

Main category: cs.LG

TL;DR: Proposes an actor-critic algorithm for Robust Constrained Average-Cost MDPs that achieves ε-feasibility and ε-optimality with sample complexities of Õ(ε⁻⁴) and Õ(ε⁻⁶).

Details

Motivation: Address challenges in Robust Constrained Average-Cost MDPs including lack of strong duality and non-contractive Robust Bellman operator in average-cost setting.

Method: Actor-critic algorithm designed specifically for Average-Cost RCMDPs to handle the unique difficulties of this setting.

Result: Achieves both ε-feasibility and ε-optimality with sample complexities of Õ(ε⁻⁴) with slackness assumption and Õ(ε⁻⁶) without slackness assumption.

Conclusion: The proposed method successfully addresses the challenges in Average-Cost RCMDPs and achieves comparable performance to discounted settings.

Abstract: In this work, we study the problem of finding robust and safe policies in Robust Constrained Average-Cost Markov Decision Processes (RCMDPs). A key challenge in this setting is the lack of strong duality, which prevents the direct use of standard primal-dual methods for constrained RL. Additional difficulties arise from the average-cost setting, where the Robust Bellman operator is not a contraction under any norm. To address these challenges, we propose an actor-critic algorithm for Average-Cost RCMDPs. We show that our method achieves both (ε)-feasibility and (ε)-optimality, and we establish a sample complexities of (\tilde{O}\left(ε^{-4}\right)) and (\tilde{O}\left(ε^{-6}\right)) with and without slackness assumption, which is comparable to the discounted setting.

[677] An Efficient Gradient-Aware Error-Bounded Lossy Compressor for Federated Learning

Zhijing Ye, Sheng Di, Jiamin Wang, Zhiqing Zhong, Zhaorui Zhang, Xiaodong Yu

Main category: cs.LG

TL;DR: Proposes a novel error-bounded lossy compression framework for federated learning gradients that exploits temporal correlations and structural regularities to achieve higher compression ratios than existing methods while preserving model accuracy.

Details

Motivation: Federated learning faces communication bottlenecks, especially under system heterogeneity where low-bandwidth clients limit performance. Existing error-bounded compression methods designed for smooth scientific data perform poorly on gradient tensors due to their low smoothness and weak spatial correlation.

Method: Developed an EBLC framework with two key predictors: (1) cross-round magnitude predictor using normalized exponential moving average, and (2) sign predictor leveraging gradient oscillation and kernel-level sign consistency. The framework is compatible with standard quantizers and entropy coders.

Result: Achieves up to 1.53x higher compression ratios than SZ3 with lower accuracy loss. When integrated into APPFL framework, reduces end-to-end communication time by 76.1%-96.2% under various constrained-bandwidth scenarios.

Conclusion: The proposed EBLC framework effectively addresses FL communication bottlenecks by exploiting temporal and structural correlations in gradient data, demonstrating strong scalability for real-world FL deployments.

Abstract: Federated learning (FL) enables collaborative model training without exposing clients’ private data, but its deployment is often constrained by the communication cost of transmitting gradients between clients and the central server, especially under system heterogeneity where low-bandwidth clients bottleneck overall performance. Lossy compression of gradient data can mitigate this overhead, and error-bounded lossy compression (EBLC) is particularly appealing for its fine-grained utility-compression tradeoff. However, existing EBLC methods (e.g., SZ), originally designed for smooth scientific data with strong spatial locality, rely on generic predictors such as Lorenzo and interpolation for entropy reduction to improve compression ratio. Gradient tensors, in contrast, exhibit low smoothness and weak spatial correlation, rendering these predictors ineffective and leading to poor compression ratios. To address this limitation, we propose an EBLC framework tailored for FL gradient data to achieve high compression ratios while preserving model accuracy. The core of it is an innovative prediction mechanism that exploits temporal correlations across FL training rounds and structural regularities within convolutional kernels to reduce residual entropy. The predictor is compatible with standard quantizers and entropy coders and comprises (1) a cross-round magnitude predictor based on a normalized exponential moving average, and (2) a sign predictor that leverages gradient oscillation and kernel-level sign consistency. Experiments show that this new EBLC yields up to 1.53x higher compression ratios than SZ3 with lower accuracy loss. Integrated into a real-world FL framework, APPFL, it reduces end-to-end communication time by 76.1%-96.2% under various constrained-bandwidth scenarios, demonstrating strong scalability for real-world FL deployments.

[678] MARAuder’s Map: Motion-Aware Real-time Activity Recognition with Layout-Based Trajectories

Zishuai Liu, Weihang You, Jin Lu, Fei Dou

Main category: cs.LG

TL;DR: MARAuder’s Map is a real-time human activity recognition framework that uses spatial projections of sensor data onto floorplans and hybrid deep learning with temporal embeddings to handle unsegmented sensor streams in smart homes.

Details

Motivation: Existing HAR approaches lack real-time inference, spatial reasoning, and context-aware temporal modeling, relying on pre-segmented data and ignoring physical layouts, which limits robustness in continuous real-world deployments.

Method: Projects sensor activations onto physical floorplans to create trajectory-aware image sequences, uses hybrid deep learning with spatial-temporal modeling, learnable time embeddings for contextual cues, and attention-based encoder for selective focus on informative segments.

Result: Extensive experiments on multiple real-world smart home datasets show the method outperforms strong baselines in real-time activity recognition.

Conclusion: MARAuder’s Map provides a practical solution for real-time HAR in ambient sensor environments by effectively handling spatial flow, temporal dependencies, and cross-activity transitions.

Abstract: Ambient sensor-based human activity recognition (HAR) in smart homes remains challenging due to the need for real-time inference, spatially grounded reasoning, and context-aware temporal modeling. Existing approaches often rely on pre-segmented, within-activity data and overlook the physical layout of the environment, limiting their robustness in continuous, real-world deployments. In this paper, we propose MARAuder’s Map, a novel framework for real-time activity recognition from raw, unsegmented sensor streams. Our method projects sensor activations onto the physical floorplan to generate trajectory-aware, image-like sequences that capture the spatial flow of human movement. These representations are processed by a hybrid deep learning model that jointly captures spatial structure and temporal dependencies. To enhance temporal awareness, we introduce a learnable time embedding module that encodes contextual cues such as hour-of-day and day-of-week. Additionally, an attention-based encoder selectively focuses on informative segments within each observation window, enabling accurate recognition even under cross-activity transitions and temporal ambiguity. Extensive experiments on multiple real-world smart home datasets demonstrate that our method outperforms strong baselines, offering a practical solution for real-time HAR in ambient sensor environments.

[679] SymLight: Exploring Interpretable and Deployable Symbolic Policies for Traffic Signal Control

Xiao-Cheng Liao, Yi Mei, Mengjie Zhang

Main category: cs.LG

TL;DR: SymLight uses Monte Carlo Tree Search to find interpretable symbolic priority functions for traffic signal control, achieving high performance while being transparent and deployable on edge devices.

Details

Motivation: Neural traffic signal control policies are over-parameterized, non-transparent, and difficult to deploy on resource-limited edge devices, creating a need for interpretable and efficient alternatives.

Method: Proposes SymLight framework using MCTS to search for symbolic priority functions, with a concise function representation and probabilistic structural rollout strategy to guide the search process.

Result: Experiments on real-world datasets show SymLight outperforms baselines while producing interpretable and deployable traffic signal control policies.

Conclusion: Symbolic priority functions discovered through MCTS can provide excellent traffic signal control performance while maintaining interpretability and deployability on edge devices.

Abstract: Deep Reinforcement Learning have achieved significant success in automatically devising effective traffic signal control (TSC) policies. Neural policies, however, tend to be over-parameterized and non-transparent, hindering their interpretability and deployability on resource-limited edge devices. This work presents SymLight, a priority function search framework based on Monte Carlo Tree Search (MCTS) for discovering inherently interpretable and deployable symbolic priority functions to serve as the TSC policies. The priority function, in particular, accepts traffic features as input and then outputs a priority for each traffic signal phase, which subsequently directs the phase transition. For effective search, we propose a concise yet expressive priority function representation. This helps mitigate the combinatorial explosion of the action space in MCTS. Additionally, a probabilistic structural rollout strategy is introduced to leverage structural patterns from previously discovered high-quality priority functions, guiding the rollout process. Our experiments on real-world datasets demonstrate SymLight’s superior performance across a range of baselines. A key advantage is SymLight’s ability to produce interpretable and deployable TSC policies while maintaining excellent performance.

[680] Beyond the Lower Bound: Bridging Regret Minimization and Best Arm Identification in Lexicographic Bandits

Bo Xue, Yuanyu Wan, Zhichao Lu, Qingfu Zhang

Main category: cs.LG

TL;DR: This paper bridges regret minimization and best arm identification in lexicographic bandits, proposing two elimination-based algorithms that leverage hierarchical preferences and cross-objective dependencies.

Details

Motivation: To address the gap between regret minimization and best arm identification in multi-objective decision-making with hierarchical preferences, where previous studies focused mainly on regret minimization.

Method: Two elimination-based algorithms: (1) sequential layer-by-layer elimination following objective priorities, (2) simultaneous elimination leveraging cross-objective dependencies in each round.

Result: First algorithm achieves comparable sample complexity and regret bounds to best single-objective algorithms. Second algorithm outperforms known single-objective lower bound, demonstrating benefits of cross-objective information sharing.

Conclusion: The proposed algorithms show superior performance over baselines, with cross-objective information sharing providing significant advantages in multi-objective lexicographic bandits.

Abstract: In multi-objective decision-making with hierarchical preferences, lexicographic bandits provide a natural framework for optimizing multiple objectives in a prioritized order. In this setting, a learner repeatedly selects arms and observes reward vectors, aiming to maximize the reward for the highest-priority objective, then the next, and so on. While previous studies have primarily focused on regret minimization, this work bridges the gap between \textit{regret minimization} and \textit{best arm identification} under lexicographic preferences. We propose two elimination-based algorithms to address this joint objective. The first algorithm eliminates suboptimal arms sequentially, layer by layer, in accordance with the objective priorities, and achieves sample complexity and regret bounds comparable to those of the best single-objective algorithms. The second algorithm simultaneously leverages reward information from all objectives in each round, effectively exploiting cross-objective dependencies. Remarkably, it outperforms the known lower bound for the single-objective bandit problem, highlighting the benefit of cross-objective information sharing in the multi-objective setting. Empirical results further validate their superior performance over baselines.

[681] Catching Contamination Before Generation: Spectral Kill Switches for Agents

Valentin Noël

Main category: cs.LG

TL;DR: A diagnostic method using attention-based spectral statistics to detect context inconsistencies in agentic language models during execution, enabling real-time error detection before propagation.

Details

Motivation: Intermediate reasoning steps in agentic language models can be corrupted by inconsistent context, retrieval errors, or adversarial inputs, leading to error propagation that post hoc evaluation cannot prevent.

Method: Uses forward pass only to analyze token graphs from attention, computing high frequency energy ratio and spectral entropy in early layers. Establishes invariances and provides finite sample estimators with uncertainty quantification.

Result: High frequency energy ratio shows robust bimodality for context verification across model families, enabling sub-millisecond gating decisions. Optimal Bayes detection under two-regime mixture assumption with monotone likelihood ratio property.

Conclusion: The approach enables inline safety monitoring that detects contamination during text processing, preventing errors from committing to reasoning chains, and integrates into retrieval-augmented agent pipelines.

Abstract: Agentic language models compose multi step reasoning chains, yet intermediate steps can be corrupted by inconsistent context, retrieval errors, or adversarial inputs, which makes post hoc evaluation too late because errors propagate before detection. We introduce a diagnostic that requires no additional training and uses only the forward pass to emit a binary accept or reject signal during agent execution. The method analyzes token graphs induced by attention and computes two spectral statistics in early layers, namely the high frequency energy ratio and spectral entropy. We formalize these signals, establish invariances, and provide finite sample estimators with uncertainty quantification. Under a two regime mixture assumption with a monotone likelihood ratio property, we show that a single threshold on the high frequency energy ratio is optimal in the Bayes sense for detecting context inconsistency. Empirically, the high frequency energy ratio exhibits robust bimodality during context verification across multiple model families, which enables gating decisions with overhead below one millisecond on our hardware and configurations. We demonstrate integration into retrieval augmented agent pipelines and discuss deployment as an inline safety monitor. The approach detects contamination while the model is still processing the text, before errors commit to the reasoning chain.

[682] Measuring Model Performance in the Presence of an Intervention

Winston Chen, Michael W. Sjoding, Jenna Wiens

Main category: cs.LG

TL;DR: Proposes NPW, an unbiased model evaluation method that uses all RCT data by reweighting treatment group data to mimic no-intervention distributions, improving model selection efficiency over standard approaches.

Details

Motivation: Standard model evaluation is biased when interventions affect outcomes. RCT control data provides unbiased evaluation but wastes treatment group data, which is inefficient given RCT costs.

Method: Theoretically quantify bias from naive aggregation, derive conditions for incorrect model selection, and propose NPW that reweights treatment data to mimic no-intervention outcome distributions.

Result: NPW consistently yields better model selection than standard approaches across various intervention effects and sample sizes in synthetic and real-world datasets.

Conclusion: NPW represents a meaningful step towards more efficient model evaluation in real-world contexts by leveraging all RCT data while maintaining unbiased evaluation.

Abstract: AI models are often evaluated based on their ability to predict the outcome of interest. However, in many AI for social impact applications, the presence of an intervention that affects the outcome can bias the evaluation. Randomized controlled trials (RCTs) randomly assign interventions, allowing data from the control group to be used for unbiased model evaluation. However, this approach is inefficient because it ignores data from the treatment group. Given the complexity and cost often associated with RCTs, making the most use of the data is essential. Thus, we investigate model evaluation strategies that leverage all data from an RCT. First, we theoretically quantify the estimation bias that arises from naïvely aggregating performance estimates from treatment and control groups, and derive the condition under which this bias leads to incorrect model selection. Leveraging these theoretical insights, we propose nuisance parameter weighting (NPW), an unbiased model evaluation approach that reweights data from the treatment group to mimic the distributions of samples that would or would not experience the outcome under no intervention. Using synthetic and real-world datasets, we demonstrate that our proposed evaluation approach consistently yields better model selection than the standard approach, which ignores data from the treatment group, across various intervention effect and sample size settings. Our contribution represents a meaningful step towards more efficient model evaluation in real-world contexts.

[683] MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling

Yu Zhang, Hui-Ling Zhen, Mingxuan Yuan, Bei Yu

Main category: cs.LG

TL;DR: MOSS is a novel FP8 training framework that enables efficient and stable training of large language models through two-level microscaling for activations and automatic scaling for weights, achieving 34% higher throughput while maintaining BF16-level performance.

Details

Motivation: Current FP8 training frameworks face efficiency challenges due to per-group quantization overhead and inefficient just-in-time scaling that negates FP8 performance benefits.

Method: MOSS introduces two key innovations: (1) two-level microscaling for sensitive activations combining high-precision global scale with power-of-two local scales, and (2) automatic scaling for weights that eliminates max-reduction operations by predicting scaling factors during training.

Result: MOSS enables efficient FP8 training of a 7B parameter model, achieving performance comparable to BF16 baseline with up to 34% higher training throughput.

Conclusion: MOSS successfully overcomes FP8 training limitations by balancing precision and efficiency through innovative scaling strategies, making FP8 training practical for large language models.

Abstract: Training large language models with FP8 formats offers significant efficiency gains. However, the reduced numerical precision of FP8 poses challenges for stable and accurate training. Current frameworks preserve training performance using mixed-granularity quantization, i.e., applying per-group quantization for activations and per-tensor/block quantization for weights. While effective, per-group quantization requires scaling along the inner dimension of matrix multiplication, introducing additional dequantization overhead. Moreover, these frameworks often rely on just-in-time scaling to dynamically adjust scaling factors based on the current data distribution. However, this online quantization is inefficient for FP8 training, as it involves multiple memory reads and writes that negate the performance benefits of FP8. To overcome these limitations, we propose MOSS, a novel FP8 training framework that ensures both efficiency and numerical stability. MOSS introduces two key innovations: (1) a two-level microscaling strategy for quantizing sensitive activations, which balances precision and dequantization cost by combining a high-precision global scale with compact, power-of-two local scales; and (2) automatic scaling for weights in linear layers, which eliminates the need for costly max-reduction operations by predicting and adjusting scaling factors during training. Leveraging these techniques, MOSS enables efficient FP8 training of a 7B parameter model, achieving performance comparable to the BF16 baseline while achieving up to 34% higher training throughput.

[684] In-depth Analysis on Caching and Pre-fetching in Mixture of Experts Offloading

Shuning Lin, Yifan He, Yitong Chen

Main category: cs.LG

TL;DR: The paper analyzes MoE offloading challenges, proposes LFU caching optimization over LRU, implements speculative expert pre-fetching, and studies MoE architecture behavior to enable better deployment on memory-constrained devices.

Details

Motivation: MoE models require significantly more memory than dense models, making deployment difficult on edge devices with limited GPU memory. MoE offloading with caching and pre-fetching is promising but prior work had suboptimal caching algorithms and limited insights.

Method: 1) Detailed analysis of expert activation and LRU caching behavior with traces; 2) Proposed LFU caching optimization based on analysis; 3) Implemented speculative expert pre-fetching with detailed trace analysis; 4) Extensive study of MoE architecture behavior including gating network and experts.

Result: LFU caching optimization achieved strong improvements over LRU, and speculative expert pre-fetching showed huge potential based on trace analysis.

Conclusion: The study provides comprehensive insights into MoE offloading and architecture behavior, which can inspire future work on MoE model interpretation and pruning techniques with minimal performance loss.

Abstract: In today’s landscape, Mixture of Experts (MoE) is a crucial architecture that has been used by many of the most advanced models. One of the major challenges of MoE models is that they usually require much more memory than their dense counterparts due to their unique architecture, and hence are harder to deploy in environments with limited GPU memory, such as edge devices. MoE offloading is a promising technique proposed to overcome this challenge, especially if it is enhanced with caching and pre-fetching, but prior work stopped at suboptimal caching algorithm and offered limited insights. In this work, we study MoE offloading in depth and make the following contributions: 1. We analyze the expert activation and LRU caching behavior in detail and provide traces. 2. We propose LFU caching optimization based on our analysis and obtain strong improvements from LRU. 3. We implement and experiment speculative expert pre-fetching, providing detailed trace showing its huge potential . 4. In addition, our study extensively covers the behavior of the MoE architecture itself, offering information on the characteristic of the gating network and experts. This can inspire future work on the interpretation of MoE models and the development of pruning techniques for MoE architecture with minimal performance loss.

[685] AiEDA: An Open-Source AI-Aided Design Library for Design-to-Vector

Yihang Qiu, Zengrong Huang, Simin Tao, Hongda Zhang, Weiguo Li, Xinhua Lai, Rui Wang, Weiqiang Wang, Xingquan Li

Main category: cs.LG

TL;DR: AiEDA is a unified open-source library that transforms fragmented chip design data into standardized vector representations, enabling AI-aided design workflows and generating a 600GB dataset from 50 real chip designs.

Details

Motivation: Current AI-EDA infrastructures are fragmented with challenges including disjointed flow engines, heterogeneous file formats, non-standardized data extraction, and poorly organized data storage, limiting comprehensive AI integration in chip design.

Method: AiEDA integrates multiple design-to-vector data representation techniques to transform diverse chip design data into universal multi-level vector representations, providing complete physical design flows with programmatic data extraction and standardized Python interfaces.

Result: Generated iDATA, a 600GB dataset from 50 real 28nm chip designs, and validated effectiveness through seven representative AI-aided design tasks spanning prediction, generation, optimization and analysis.

Conclusion: AiEDA establishes an AI-aided design paradigm optimized for AI-EDA workflows, with publicly available code and a forthcoming dataset that provides a foundation for future AI-EDA research.

Abstract: Recent research has demonstrated that artificial intelligence (AI) can assist electronic design automation (EDA) in improving both the quality and efficiency of chip design. But current AI for EDA (AI-EDA) infrastructures remain fragmented, lacking comprehensive solutions for the entire data pipeline from design execution to AI integration. Key challenges include fragmented flow engines that generate raw data, heterogeneous file formats for data exchange, non-standardized data extraction methods, and poorly organized data storage. This work introduces a unified open-source library for EDA (AiEDA) that addresses these issues. AiEDA integrates multiple design-to-vector data representation techniques that transform diverse chip design data into universal multi-level vector representations, establishing an AI-aided design (AAD) paradigm optimized for AI-EDA workflows. AiEDA provides complete physical design flows with programmatic data extraction and standardized Python interfaces bridging EDA datasets and AI frameworks. Leveraging the AiEDA library, we generate iDATA, a 600GB dataset of structured data derived from 50 real chip designs (28nm), and validate its effectiveness through seven representative AAD tasks spanning prediction, generation, optimization and analysis. The code is publicly available at https://github.com/OSCC-Project/AiEDA, while the full iDATA dataset is being prepared for public release, providing a foundation for future AI-EDA research.

[686] CADM: Cluster-customized Adaptive Distance Metric for Categorical Data Clustering

Taixi Chen, Yiu-ming Cheung, Yiqun Zhang

Main category: cs.LG

TL;DR: Proposes a cluster-customized distance metric for categorical data clustering that adapts distances based on attribute distributions in each cluster, and extends it to mixed data types.

Details

Motivation: Standard distance metrics for categorical data don't account for varying attribute value distributions across different clusters, leading to unreasonable distance measurements.

Method: Developed a cluster-customized distance metric that competitively updates distances based on different attribute distributions in each cluster, with extension to mixed numerical and categorical data.

Result: Achieved an average ranking of around first place across fourteen datasets, demonstrating superior performance.

Conclusion: The proposed cluster-customized distance metric effectively addresses the limitations of traditional categorical distance measures and shows strong empirical performance.

Abstract: An appropriate distance metric is crucial for categorical data clustering, as the distance between categorical data cannot be directly calculated. However, the distances between attribute values usually vary in different clusters induced by their different distributions, which has not been taken into account, thus leading to unreasonable distance measurement. Therefore, we propose a cluster-customized distance metric for categorical data clustering, which can competitively update distances based on different distributions of attributes in each cluster. In addition, we extend the proposed distance metric to the mixed data that contains both numerical and categorical attributes. Experiments demonstrate the efficacy of the proposed method, i.e., achieving an average ranking of around first in fourteen datasets. The source code is available at https://anonymous.4open.science/r/CADM-47D8

[687] Predicting the Future by Retrieving the Past

Dazhao Du, Tao Han, Song Guo

Main category: cs.LG

TL;DR: PFRP enhances time series forecasting by retrieving similar historical patterns from a Global Memory Bank and combining them with local model predictions, improving accuracy by 8.4% on average.

Details

Motivation: Current deep learning models for time series forecasting only use local context from sliding windows during inference, failing to utilize rich global historical patterns stored in their parameters from training.

Method: Propose PFRP with Global Memory Bank to store global historical patterns, retrieval mechanism to find similar patterns, and adaptive combination of global predictions with local model outputs.

Result: Extensive experiments on 7 real-world datasets show PFRP improves average performance of advanced univariate forecasting models by 8.4%.

Conclusion: Explicitly integrating global historical knowledge through retrieval mechanisms significantly enhances forecasting accuracy and interpretability compared to relying solely on local context.

Abstract: Deep learning models such as MLP, Transformer, and TCN have achieved remarkable success in univariate time series forecasting, typically relying on sliding window samples from historical data for training. However, while these models implicitly compress historical information into their parameters during training, they are unable to explicitly and dynamically access this global knowledge during inference, relying only on the local context within the lookback window. This results in an underutilization of rich patterns from the global history. To bridge this gap, we propose Predicting the Future by Retrieving the Past (PFRP), a novel approach that explicitly integrates global historical data to enhance forecasting accuracy. Specifically, we construct a Global Memory Bank (GMB) to effectively store and manage global historical patterns. A retrieval mechanism is then employed to extract similar patterns from the GMB, enabling the generation of global predictions. By adaptively combining these global predictions with the outputs of any local prediction model, PFRP produces more accurate and interpretable forecasts. Extensive experiments conducted on seven real-world datasets demonstrate that PFRP significantly enhances the average performance of advanced univariate forecasting models by 8.4%. Codes can be found in https://github.com/ddz16/PFRP.

[688] EMOD: A Unified EEG Emotion Representation Framework Leveraging V-A Guided Contrastive Learning

Yuning Chen, Sha Zhao, Shijian Li, Gang Pan

Main category: cs.LG

TL;DR: EMOD is a unified EEG emotion recognition framework that uses valence-arousal guided contrastive learning to bridge semantic and structural gaps across heterogeneous datasets, achieving state-of-the-art performance with strong generalization.

Details

Motivation: Existing deep learning approaches for EEG emotion recognition have limited generalization across datasets due to heterogeneity in annotation schemes and data formats, requiring dataset-specific architectures and lacking semantic alignment.

Method: Projects emotion labels into unified V-A space, uses soft-weighted supervised contrastive loss, and employs flexible backbone with Triple-Domain Encoder and Spatial-Temporal Transformer to handle variable EEG formats.

Result: Achieves state-of-the-art performance on three benchmark datasets after pretraining on eight public EEG datasets, demonstrating strong adaptability and generalization.

Conclusion: EMOD successfully learns transferable and emotion-aware representations that bridge both semantic and structural gaps in EEG emotion recognition across diverse datasets.

Abstract: Emotion recognition from EEG signals is essential for affective computing and has been widely explored using deep learning. While recent deep learning approaches have achieved strong performance on single EEG emotion datasets, their generalization across datasets remains limited due to the heterogeneity in annotation schemes and data formats. Existing models typically require dataset-specific architectures tailored to input structure and lack semantic alignment across diverse emotion labels. To address these challenges, we propose EMOD: A Unified EEG Emotion Representation Framework Leveraging Valence-Arousal (V-A) Guided Contrastive Learning. EMOD learns transferable and emotion-aware representations from heterogeneous datasets by bridging both semantic and structural gaps. Specifically, we project discrete and continuous emotion labels into a unified V-A space and formulate a soft-weighted supervised contrastive loss that encourages emotionally similar samples to cluster in the latent space. To accommodate variable EEG formats, EMOD employs a flexible backbone comprising a Triple-Domain Encoder followed by a Spatial-Temporal Transformer, enabling robust extraction and integration of temporal, spectral, and spatial features. We pretrain EMOD on eight public EEG datasets and evaluate its performance on three benchmark datasets. Experimental results show that EMOD achieves state-of-the-art performance, demonstrating strong adaptability and generalization across diverse EEG-based emotion recognition scenarios.

[689] Adaptation and Fine-tuning with TabPFN for Travelling Salesman Problem

Nguyen Gia Hien Vu, Yifan Tang, Rey Lim, Yifan Yang, Hang Ma, Ke Wang, G. Gary Wang

Main category: cs.LG

TL;DR: TabPFN applied to solve Traveling Salesman Problem with minimal training, achieving strong performance comparable to other methods without post-processing.

Details

Motivation: To reduce time and data-intensive training requirements of traditional combinatorial optimization methods like exact algorithms, heuristics, and ML-based models.

Method: Adapted and fine-tuned TabPFN using node-based approach and node-predicting adaptation strategy to construct TSP routes, requiring only a single sample for adaptation.

Result: TabPFN requires minimal training, adapts quickly (within minutes), performs better generalization across varying TSP sizes, and reduces performance degradation while achieving comparable solution quality to other models.

Conclusion: TabPFN is a promising approach for solving structured combinatorial optimization problems efficiently under resource constraints and rapid deployment needs.

Abstract: Tabular Prior-Data Fitted Network (TabPFN) is a foundation model designed for small to medium-sized tabular data, which has attracted much attention recently. This paper investigates the application of TabPFN in Combinatorial Optimization (CO) problems. The aim is to lessen challenges in time and data-intensive training requirements often observed in using traditional methods including exact and heuristic algorithms, Machine Learning (ML)-based models, to solve CO problems. Proposing possibly the first ever application of TabPFN for such a purpose, we adapt and fine-tune the TabPFN model to solve the Travelling Salesman Problem (TSP), one of the most well-known CO problems. Specifically, we adopt the node-based approach and the node-predicting adaptation strategy to construct the entire TSP route. Our evaluation with varying instance sizes confirms that TabPFN requires minimal training, adapts to TSP using a single sample, performs better generalization across varying TSP instance sizes, and reduces performance degradation. Furthermore, the training process with adaptation and fine-tuning is completed within minutes. The methodology leads to strong solution quality even without post-processing and achieves performance comparable to other models with post-processing refinement. Our findings suggest that the TabPFN model is a promising approach to solve structured and CO problems efficiently under training resource constraints and rapid deployment requirements.

[690] FusionLog: Cross-System Log-based Anomaly Detection via Fusion of General and Proprietary Knowledge

Xinlong Zhao, Tong Jia, Minghua He, Xixuan Yang, Ying Li

Main category: cs.LG

TL;DR: FusionLog is a zero-label cross-system log anomaly detection method that fuses general and proprietary knowledge without requiring labeled target logs, achieving over 90% F1-score by dynamically routing logs and using collaborative knowledge distillation.

Details

Motivation: Existing transfer learning methods for log anomaly detection focus only on general knowledge transfer and neglect the mismatch between general knowledge and target system's proprietary knowledge, limiting performance in new systems without sufficient labeled data.

Method: Uses training-free router based on semantic similarity to partition target logs into general and proprietary logs. For general logs: employs system-agnostic representation meta-learning. For proprietary logs: iteratively generates pseudo-labels and fine-tunes using multi-round collaborative knowledge distillation between LLM and small model.

Result: Achieves over 90% F1-score on three public log datasets under fully zero-label setting, significantly outperforming state-of-the-art cross-system log anomaly detection methods.

Conclusion: FusionLog effectively fuses general and proprietary knowledge for cross-system generalization without labeled target logs, demonstrating superior performance in zero-label log anomaly detection scenarios.

Abstract: Log-based anomaly detection is critical for ensuring the stability and reliability of web systems. One of the key problems in this task is the lack of sufficient labeled logs, which limits the rapid deployment in new systems. Existing works usually leverage large-scale labeled logs from a mature web system and a small amount of labeled logs from a new system, using transfer learning to extract and generalize general knowledge across both domains. However, these methods focus solely on the transfer of general knowledge and neglect the disparity and potential mismatch between such knowledge and the proprietary knowledge of target system, thus constraining performance. To address this limitation, we propose FusionLog, a novel zero-label cross-system log-based anomaly detection method that effectively achieves the fusion of general and proprietary knowledge, enabling cross-system generalization without any labeled target logs. Specifically, we first design a training-free router based on semantic similarity that dynamically partitions unlabeled target logs into ‘general logs’ and ‘proprietary logs.’ For general logs, FusionLog employs a small model based on system-agnostic representation meta-learning for direct training and inference, inheriting the general anomaly patterns shared between the source and target systems. For proprietary logs, we iteratively generate pseudo-labels and fine-tune the small model using multi-round collaborative knowledge distillation and fusion based on large language model (LLM) and small model (SM) to enhance its capability to recognize anomaly patterns specific to the target system. Experimental results on three public log datasets from different systems show that FusionLog achieves over 90% F1-score under a fully zero-label setting, significantly outperforming state-of-the-art cross-system log-based anomaly detection methods.

[691] Physics-Informed Neural Networks for Real-Time Gas Crossover Prediction in PEM Electrolyzers: First Application with Multi-Membrane Validation

Yong-Woon Kim, Chulung Kang, Yung-Cheol Byun

Main category: cs.LG

TL;DR: PINNs for hydrogen crossover prediction in PEM electrolyzers achieve 99.84% accuracy with sub-millisecond inference, enabling real-time safety monitoring across industrial conditions.

Details

Motivation: Green hydrogen production via PEM electrolysis faces safety risks from hydrogen crossover approaching explosive limits (4% H2 in O2) and reduced efficiency. Current physics models are computationally intensive while pure data-driven methods fail to extrapolate for dynamic operation.

Method: Physics-informed neural networks (PINNs) integrating mass conservation, Fick’s diffusion law, and Henry’s solubility law with compact architecture (17,793 parameters), validated across 6 membranes under industrial conditions (0.05-5.0 A/cm², 1-200 bar, 25-85°C).

Result: Exceptional accuracy (R²=99.84%, RMSE=0.0348%) with sub-millisecond inference, maintains R²>86% at pressures 2.5x beyond training range, outperforming pure neural networks (R²=43.4%). Hardware-agnostic deployment from CPUs to edge devices.

Conclusion: PINNs bridge physical rigor and computational efficiency, establishing new paradigm for real-time electrolyzer monitoring to accelerate safe, efficient green hydrogen infrastructure for net-zero emissions.

Abstract: Green hydrogen production via polymer electrolyte membrane (PEM) water electrolysis is pivotal for energy transition, yet hydrogen crossover through membranes threatens safety and economic viability-approaching explosive limits (4 mol% H$_2$ in O$_2$) while reducing Faradaic efficiency by 2.5%. Current physics-based models require extensive calibration and computational resources that preclude real-time implementation, while purely data-driven approaches fail to extrapolate beyond training conditions-critical for dynamic electrolyzer operation. Here we present the first application of physics-informed neural networks (PINNs) for hydrogen crossover prediction, integrating mass conservation, Fick’s diffusion law, and Henry’s solubility law within a compact architecture (17,793 parameters). Validated across six membranes under industrially relevant conditions (0.05-5.0 A/cm$^2$, 1-200 bar, 25-85°C), our PINN achieves exceptional accuracy (R$^2$ = 99.84%, RMSE = 0.0348%) with sub-millisecond inference times suitable for real-time control. Remarkably, the model maintains R$^2$ > 86% when predicting crossover at pressures 2.5x beyond training range-substantially outperforming pure neural networks (R$^2$ = 43.4%). The hardware-agnostic deployment, from desktop CPUs to edge devices (Raspberry Pi 4), enables distributed safety monitoring essential for gigawatt-scale installations. By bridging physical rigor and computational efficiency, this work establishes a new paradigm for real-time electrolyzer monitoring, accelerating deployment of safe, efficient green hydrogen infrastructure crucial for net-zero emissions targets.

[692] From Kernels to Attention: A Transformer Framework for Density and Score Estimation

Vasily Ilin, Peter Sushko

Main category: cs.LG

TL;DR: A unified transformer framework for joint density and score estimation that learns a single distribution-agnostic operator, outperforming traditional methods like KDE while maintaining equivariance properties.

Details

Motivation: Traditional score-matching methods require separate models for each distribution, lacking generalization across densities and sample sizes. The goal is to develop a single, unified approach for both density and score estimation.

Method: Uses a permutation- and affine-equivariant transformer architecture with cross-attention to connect observed samples with query points, enabling generalization beyond training data while maintaining symmetry constraints.

Result: Achieves substantially lower error and better scaling than KDE and score-debiased KDE, with improved runtime scaling. The attention weights can recover classical KDE, establishing a principled link between transformers and traditional methods.

Conclusion: Transformers serve as general-purpose, data-adaptive operators for nonparametric density and score estimation, offering a unified framework that generalizes across distributions and outperforms classical approaches.

Abstract: We introduce a unified attention-based framework for joint score and density estimation. Framing the problem as a sequence-to-sequence task, we develop a permutation- and affine-equivariant transformer that estimates both the probability density $f(x)$ and its score $\nabla_x \log f(x)$ directly from i.i.d. samples. Unlike traditional score-matching methods that require training a separate model for each distribution, our approach learns a single distribution-agnostic operator that generalizes across densities and sample sizes. The architecture employs cross-attention to connect observed samples with arbitrary query points, enabling generalization beyond the training data, while built-in symmetry constraints ensure equivariance to permutation and affine transformations. Analytically, we show that the attention weights can recover classical kernel density estimation (KDE), and verify it empirically, establishing a principled link between classical KDE and the transformer architecture. Empirically, the model achieves substantially lower error and better scaling than KDE and score-debiased KDE (SD-KDE), while exhibiting better runtime scaling. Together, these results establish transformers as general-purpose, data-adaptive operators for nonparametric density and score estimation.

[693] Deep Survival Analysis of Longitudinal EHR Data for Joint Prediction of Hospitalization and Death in COPD Patients

Enrico Manzini, Thomas Gonzalez Saito, Joan Escudero, Ana Génova, Cristina Caso, Tomas Perez-Porcuna, Alexandre Perera-Lluna

Main category: cs.LG

TL;DR: Deep learning models using recurrent architectures outperformed traditional statistical and machine learning approaches in predicting COPD patient hospitalizations and deaths from longitudinal EHR data.

Details

Motivation: COPD patients have high hospitalization risks that strongly impact survival, but predicting the timing of these events remains challenging and understudied.

Method: Used survival analysis on longitudinal EHR data from 150k+ patients, modeling hospitalization as first event and death as semi-competing terminal event. Compared Cox PH, SurvivalBoost, DeepPseudo, SurvTRACE, Dynamic Deep-Hit, and Deep Recurrent Survival Machine models.

Result: DL models with recurrent architectures achieved higher concordance and time-dependent AUC than ML and linear approaches, especially for the harder-to-predict hospitalization events.

Conclusion: This is the first study to apply deep survival analysis on longitudinal EHR data for multiple time-to-event outcomes in COPD patients, demonstrating DL’s potential to capture temporal patterns and improve risk stratification.

Abstract: Patients with chronic obstructive pulmonary disease (COPD) have an increased risk of hospitalizations, strongly associated with decreased survival, yet predicting the timing of these events remains challenging and has received limited attention in the literature. In this study, we performed survival analysis to predict hospitalization and death in COPD patients using longitudinal electronic health records (EHRs), comparing statistical models, machine learning (ML), and deep learning (DL) approaches. We analyzed data from more than 150k patients from the SIDIAP database in Catalonia, Spain, from 2013 to 2017, modeling hospitalization as a first event and death as a semi-competing terminal event. Multiple models were evaluated, including Cox proportional hazards, SurvivalBoost, DeepPseudo, SurvTRACE, Dynamic Deep-Hit, and Deep Recurrent Survival Machine. Results showed that DL models utilizing recurrent architectures outperformed both ML and linear approaches in concordance and time-dependent AUC, especially for hospitalization, which proved to be the harder event to predict. This study is, to our knowledge, the first to apply deep survival analysis on longitudinal EHR data to jointly predict multiple time-to-event outcomes in COPD patients, highlighting the potential of DL approaches to capture temporal patterns and improve risk stratification.

[694] Next-Latent Prediction Transformers Learn Compact World Models

Jayden Teoh, Manan Tomar, Kwangjun Ahn, Edward S. Hu, Pratyusha Sharma, Riashat Islam, Alex Lamb, John Langford

Main category: cs.LG

TL;DR: NextLat extends next-token training with self-supervised latent space predictions, training transformers to learn representations that predict next latent states, effectively injecting recurrent inductive bias while maintaining parallel training.

Details

Motivation: Transformers lack inherent incentive to compress history into compact latent states with consistent transition rules, leading to poor generalization. The paper aims to address this limitation.

Method: Introduces Next-Latent Prediction (NextLat) which adds self-supervised predictions in latent space to standard next-token training, training transformers to learn latent representations predictive of next latent states given next output tokens.

Result: Empirical results show significant gains over standard next-token training in downstream accuracy, representation compression, and lookahead planning across benchmarks for world modeling, reasoning, planning, and language modeling.

Conclusion: NextLat provides a simple and efficient paradigm for shaping transformer representations toward stronger generalization by encouraging formation of compact internal world models with belief states and transition dynamics.

Abstract: Transformers replace recurrence with a memory that grows with sequence length and self-attention that enables ad-hoc look ups over past tokens. Consequently, they lack an inherent incentive to compress history into compact latent states with consistent transition rules. This often leads to learning solutions that generalize poorly. We introduce Next-Latent Prediction (NextLat), which extends standard next-token training with self-supervised predictions in the latent space. Specifically, NextLat trains a transformer to learn latent representations that are predictive of its next latent state given the next output token. Theoretically, we show that these latents provably converge to belief states, compressed information of the history necessary to predict the future. This simple auxiliary objective also injects a recurrent inductive bias into transformers, while leaving their architecture, parallel training, and inference unchanged. NextLat effectively encourages the transformer to form compact internal world models with its own belief states and transition dynamics – a crucial property absent in standard next-token prediction transformers. Empirically, across benchmarks targeting core sequence modeling competencies – world modeling, reasoning, planning, and language modeling – NextLat demonstrates significant gains over standard next-token training in downstream accuracy, representation compression, and lookahead planning. NextLat stands as a simple and efficient paradigm for shaping transformer representations toward stronger generalization.

[695] Explainable Deep Learning-based Classification of Wolff-Parkinson-White Electrocardiographic Signals

Alice Ragonesi, Stefania Fresca, Karli Gillette, Stefan Kurath-Koller, Gernot Plank, Elena Zappon

Main category: cs.LG

TL;DR: A deep learning model using synthetic ECG data from virtual heart models achieves >95% accuracy for localizing Wolff-Parkinson-White accessory pathways across 24 cardiac regions, with explainable AI methods providing transparency and physiological validation.

Details

Motivation: Current methods for accessory pathway localization have limited resolution, poor interpretability, and use small datasets. There's a need for accurate, transparent, non-invasive localization to guide catheter ablation procedures.

Method: Deep learning model trained on large synthetic ECG database from personalized virtual heart models, integrated with explainable AI methods (Guided Backpropagation, Grad-CAM, Guided Grad-CAM) for interpretability.

Result: Achieves 95% localization accuracy, 94.32% sensitivity, and 99.78% specificity. XAI validation shows lead V2 as most critical for localization, followed by aVF, V1, and aVL.

Conclusion: Combining cardiac digital twins with explainable deep learning enables accurate, transparent, non-invasive accessory pathway localization, addressing key barriers to clinical adoption.

Abstract: Wolff-Parkinson-White (WPW) syndrome is a cardiac electrophysiology (EP) disorder caused by the presence of an accessory pathway (AP) that bypasses the atrioventricular node, faster ventricular activation rate, and provides a substrate for atrio-ventricular reentrant tachycardia (AVRT). Accurate localization of the AP is critical for planning and guiding catheter ablation procedures. While traditional diagnostic tree (DT) methods and more recent machine learning (ML) approaches have been proposed to predict AP location from surface electrocardiogram (ECG), they are often constrained by limited anatomical localization resolution, poor interpretability, and the use of small clinical datasets. In this study, we present a Deep Learning (DL) model for the localization of single manifest APs across 24 cardiac regions, trained on a large, physiologically realistic database of synthetic ECGs generated using a personalized virtual heart model. We also integrate eXplainable Artificial Intelligence (XAI) methods, Guided Backpropagation, Grad-CAM, and Guided Grad-CAM, into the pipeline. This enables interpretation of DL decision-making and addresses one of the main barriers to clinical adoption: lack of transparency in ML predictions. Our model achieves localization accuracy above 95%, with a sensitivity of 94.32% and specificity of 99.78%. XAI outputs are physiologically validated against known depolarization patterns, and a novel index is introduced to identify the most informative ECG leads for AP localization. Results highlight lead V2 as the most critical, followed by aVF, V1, and aVL. This work demonstrates the potential of combining cardiac digital twins with explainable DL to enable accurate, transparent, and non-invasive AP localization.

[696] Kunlun Anomaly Troubleshooter: Enabling Kernel-Level Anomaly Detection and Causal Reasoning for Large Model Distributed Inference

Yuyang Liu, Jingjing Cai, Jiayi Ren, Peng Zhou, Danyang Zhang, Yin Du, Shijian Li

Main category: cs.LG

TL;DR: KAT is an anomaly troubleshooting framework for large model distributed inference that uses GPU function traces for precise kernel-level anomaly detection and integrates with LLMs for causal reasoning and natural language interpretation.

Details

Motivation: Anomaly troubleshooting in large model distributed inference systems requires significant manual effort from experts, leading to time-consuming diagnosis processes with low accuracy.

Method: KAT leverages GPU worker synchronicity and function trace data for nanosecond-resolution anomaly detection, then integrates detection results with domain-adapted LLMs for systematic causal reasoning and natural language interpretation.

Result: Evaluation in Alibaba Cloud Service production environment shows KAT achieves over 0.884 precision and 0.936 recall in anomaly detection, significantly narrowing diagnostic scope and improving troubleshooting efficiency.

Conclusion: KAT provides the first specialized anomaly troubleshooting framework for large model distributed inference, delivering high-precision detection and natural language insights that enhance troubleshooting success rates.

Abstract: Anomaly troubleshooting for large model distributed inference (LMDI) remains a critical challenge. Resolving anomalies such as inference performance degradation or latency jitter in distributed system demands significant manual efforts from domain experts, resulting in extremely time-consuming diagnosis processes with relatively low accuracy. In this paper, we introduce Kunlun Anomaly Troubleshooter (KAT), the first anomaly troubleshooting framework tailored for LMDI. KAT addresses this problem through two core innovations. First, KAT exploits the synchronicity and consistency of GPU workers, innovatively leverages function trace data to precisely detect kernel-level anomalies and associated hardware components at nanosecond resolution. Second, KAT integrates these detection results into a domain-adapted LLM, delivering systematic causal reasoning and natural language interpretation of complex anomaly symptoms. Evaluations conducted in Alibaba Cloud Service production environment indicate that KAT achieves over 0.884 precision and 0.936 recall in anomaly detection, providing detail anomaly insights that significantly narrow down the diagnostic scope and improve both the efficiency and success rate of troubleshooting.

[697] Are Time-Indexed Foundation Models the Future of Time Series Imputation?

Etienne Le Naour, Tahar Nabil, Adrien Petralia, Ghislain Agoua

Main category: cs.LG

TL;DR: First large-scale empirical study of time-indexed foundation models (TabPFN-TS and MoTM) for zero-shot time series imputation across 33 datasets.

Details

Motivation: Foundation models for time series imputation remain largely unexplored, with only two recent models (TabPFN-TS and MoTM) emerging in this domain.

Method: Conducted extensive univariate experiments across 33 out-of-domain datasets (approximately 1.3M imputation windows) and evaluated ability to integrate covariates at inference time without fine-tuning.

Result: Time-indexed foundation models demonstrate powerful and practical zero-shot imputation capabilities for real-world time series.

Conclusion: Time-indexed foundation models represent a significant step toward achieving general-purpose, zero-shot imputation for real-world time series applications.

Abstract: Foundation models for time series imputation remain largely unexplored. Recently, two such models, TabPFN-TS and MoTM, have emerged. These models share a common philosophy that places them within the family of time-indexed foundation models. This paper presents the first large-scale empirical study of these models for zero-shot imputation, which enables missing value recovery without retraining across a wide range of scenarios. We conduct extensive univariate experiments across 33 out-of-domain datasets (approximately 1.3M imputation windows) and evaluate their ability to integrate covariates at inference time to improve accuracy without fine-tuning. Our results demonstrate that time-indexed foundation models are a powerful and practical step toward achieving general-purpose, zero-shot imputation for real-world time series.

[698] Bespoke Co-processor for Energy-Efficient Health Monitoring on RISC-V-based Flexible Wearables

Theofanis Vergos, Polykarpos Vergos, Mehdi B. Tahoori, Georgios Zervakis

Main category: cs.LG

TL;DR: A flexible RISC-V microprocessor with custom multiply-accumulate co-processor achieves 2.35x speedup and 2.15x lower energy consumption for on-body ML classification in healthcare wearables.

Details

Motivation: Flexible electronics for healthcare wearables face challenges with limited gate count, large feature sizes, and high static power consumption, making on-body machine learning classification difficult. Existing bendable RISC-V systems lack sufficient energy efficiency.

Method: Integrates a bespoke multiply-accumulate co-processor with fixed coefficients, formulates a constrained programming problem to jointly determine co-processor constants and optimally map MLP inference operations, leveraging low fabrication costs of flexible technologies.

Result: Post-layout results show near-real-time performance across healthcare datasets, operating within flexible battery power budgets, occupying only 2.42 mm^2, with 2.35x speedup and 2.15x lower energy consumption compared to state-of-the-art.

Conclusion: Provides a promising path toward accessible, sustainable, and conformable healthcare wearables by enabling compact, energy-efficient on-body machine learning classification through specialized co-processor integration and optimal operation mapping.

Abstract: Flexible electronics offer unique advantages for conformable, lightweight, and disposable healthcare wearables. However, their limited gate count, large feature sizes, and high static power consumption make on-body machine learning classification highly challenging. While existing bendable RISC-V systems provide compact solutions, they lack the energy efficiency required. We present a mechanically flexible RISC-V that integrates a bespoke multiply-accumulate co-processor with fixed coefficients to maximize energy efficiency and minimize latency. Our approach formulates a constrained programming problem to jointly determine co-processor constants and optimally map Multi-Layer Perceptron (MLP) inference operations, enabling compact, model-specific hardware by leveraging the low fabrication and non-recurring engineering costs of flexible technologies. Post-layout results demonstrate near-real-time performance across several healthcare datasets, with our circuits operating within the power budget of existing flexible batteries and occupying only 2.42 mm^2, offering a promising path toward accessible, sustainable, and conformable healthcare wearables. Our microprocessors achieve an average 2.35x speedup and 2.15x lower energy consumption compared to the state of the art.

[699] MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference

Myunghyun Rhee, Sookyung Choi, Euiseok Kim, Joonseop Sim, Youngpyo Joo, Hoshik Kim

Main category: cs.LG

TL;DR: MoSKA introduces a novel architecture that addresses KV cache bottlenecks in LLMs by differentiating between unique and shared context data, transforming memory-bound operations into compute-bound ones through batching and specialized infrastructure.

Details

Motivation: The escalating context length in LLMs creates severe performance bottlenecks around the KV cache, leading to significant GPU under-utilization due to its memory-bound nature.

Method: Uses Mixture of Shared KV Attention (MoSKA) with: 1) Shared KV Attention mechanism that batches concurrent requests to transform memory-bound GEMV into compute-bound GEMM, 2) MoE-inspired sparse attention to prune search space, 3) Disaggregated Infrastructure that specializes hardware for unique and shared data.

Result: Demonstrates throughput increase of up to 538.7x over baselines in workloads with high context sharing.

Conclusion: Provides a clear architectural path toward scalable LLM inference by addressing KV cache bottlenecks through context data heterogeneity exploitation.

Abstract: The escalating context length in Large Language Models (LLMs) creates a severe performance bottleneck around the Key-Value (KV) cache, whose memory-bound nature leads to significant GPU under-utilization. This paper introduces Mixture of Shared KV Attention (MoSKA), an architecture that addresses this challenge by exploiting the heterogeneity of context data. It differentiates between per-request unique and massively reused shared sequences. The core of MoSKA is a novel Shared KV Attention mechanism that transforms the attention on shared data from a series of memory-bound GEMV operations into a single, compute-bound GEMM by batching concurrent requests. This is supported by an MoE-inspired sparse attention strategy that prunes the search space and a tailored Disaggregated Infrastructure that specializes hardware for unique and shared data. This comprehensive approach demonstrates a throughput increase of up to 538.7x over baselines in workloads with high context sharing, offering a clear architectural path toward scalable LLM inference.

[700] Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving

Hui Zeng, Daming Zhao, Pengfei Yang, Wenxuan Hou, Tianyang Zheng, Hui Li, Weiye Ji, Jidong Zhai

Main category: cs.LG

TL;DR: Lethe is a dynamic KV cache management framework that reduces memory and latency in long-form generation by adaptively pruning tokens across layers and time using attention patterns.

Details

Motivation: Long decoding sequences in generative reasoning with LLMs cause significant memory and latency overhead from KV caches, which existing compression methods fail to address effectively for dynamic, layer-sensitive generation.

Method: Lethe introduces spatial adaptivity through layerwise sparsity-aware allocation and temporal adaptivity via multi-round token pruning using Recency-Aware Selective Retention (RASR), which considers both recency and token relevance from attention patterns.

Result: Lethe achieves up to 2.56x throughput increase while maintaining generation quality across diverse models and tasks, demonstrating an effective balance between efficiency and performance.

Conclusion: The proposed dynamic KV cache management framework successfully addresses the challenges of long-form generation by adaptively managing cache resources along spatial and temporal dimensions, offering significant efficiency gains without compromising output quality.

Abstract: Generative reasoning with large language models (LLMs) often involves long decoding sequences, leading to substantial memory and latency overheads from accumulating key-value (KV) caches. While existing KV compression methods primarily focus on reducing prefill memory from long input sequences, they fall short in addressing the dynamic and layer-sensitive nature of long-form generation, which is central to reasoning tasks. We propose Lethe, a dynamic KV cache management framework that introduces adaptivity along both the spatial and temporal dimensions of decoding. Along the spatial dimension, Lethe performs layerwise sparsity-aware allocation, assigning token pruning budgets to each transformer layer based on estimated attention redundancy. Along the temporal dimension, Lethe conducts multi-round token pruning during generation, driven by a Recency-Aware Selective Retention} (RASR) mechanism. RASR extends traditional recency-based heuristics by also considering token relevance derived from evolving attention patterns, enabling informed decisions about which tokens to retain or evict. Empirical results demonstrate that Lethe achieves a favorable balance between efficiency and generation quality across diverse models and tasks, increases throughput by up to 2.56x.

[701] ITPP: Learning Disentangled Event Dynamics in Marked Temporal Point Processes

Wang-Tao Zhou, Zhao Kang, Ke Yan, Ling Tian

Main category: cs.LG

TL;DR: ITPP is a channel-independent MTPP architecture that decouples event type information using an encoder-decoder framework with ODE-based backbone and type-aware inverted self-attention to model inter-channel correlations.

Details

Motivation: Existing MTPP models use channel-mixing strategies that encode different event types into a single latent representation, which can obscure type-specific dynamics, cause performance degradation, and increase overfitting risk.

Method: Proposed ITPP uses a channel-independent architecture with encoder-decoder framework, ODE-based backbone, and type-aware inverted self-attention mechanism to explicitly model inter-channel correlations among heterogeneous event types.

Result: Comprehensive experiments on multiple real-world and synthetic datasets show ITPP consistently outperforms state-of-the-art MTPP models in both predictive accuracy and generalization.

Conclusion: ITPP enhances effectiveness and robustness while reducing overfitting by decoupling event type information and explicitly modeling inter-channel correlations through its novel architecture.

Abstract: Marked Temporal Point Processes (MTPPs) provide a principled framework for modeling asynchronous event sequences by conditioning on the history of past events. However, most existing MTPP models rely on channel-mixing strategies that encode information from different event types into a single, fixed-size latent representation. This entanglement can obscure type-specific dynamics, leading to performance degradation and increased risk of overfitting. In this work, we introduce ITPP, a novel channel-independent architecture for MTPP modeling that decouples event type information using an encoder-decoder framework with an ODE-based backbone. Central to ITPP is a type-aware inverted self-attention mechanism, designed to explicitly model inter-channel correlations among heterogeneous event types. This architecture enhances effectiveness and robustness while reducing overfitting. Comprehensive experiments on multiple real-world and synthetic datasets demonstrate that ITPP consistently outperforms state-of-the-art MTPP models in both predictive accuracy and generalization.

[702] Advancing Ocean State Estimation with efficient and scalable AI

Yanfei Xiang, Yuan Gao, Hao Wu, Quan Zhang, Ruiqi Shu, Xiao Zhou, Xi Wu, Xiaomeng Huang

Main category: cs.LG

TL;DR: ADAF-Ocean is an AI-driven data assimilation framework that directly assimilates multi-source ocean observations without interpolation, using neural processes to learn continuous mappings and AI super-resolution to reconstruct mesoscale dynamics efficiently.

Details

Motivation: Traditional ocean state estimation faces computational scalability and data fidelity challenges in data assimilation and deep learning approaches.

Method: Uses Neural Processes to learn continuous mapping from heterogeneous inputs, AI-driven super-resolution to reconstruct 0.25° mesoscale dynamics from 1° coarse fields, and direct assimilation of multi-source observations without interpolation.

Result: Achieves computational efficiency with only 3.7% more parameters than 1° configuration, and extends global forecast skill by up to 20 days when coupled with DL forecasting system.

Conclusion: Establishes a computationally viable pathway for real-time, high-resolution Earth system monitoring by preserving native data fidelity and ensuring scalability.

Abstract: Accurate and efficient global ocean state estimation remains a grand challenge for Earth system science, hindered by the dual bottlenecks of computational scalability and degraded data fidelity in traditional data assimilation (DA) and deep learning (DL) approaches. Here we present an AI-driven Data Assimilation Framework for Ocean (ADAF-Ocean) that directly assimilates multi-source and multi-scale observations, ranging from sparse in-situ measurements to 4 km satellite swaths, without any interpolation or data thinning. Inspired by Neural Processes, ADAF-Ocean learns a continuous mapping from heterogeneous inputs to ocean states, preserving native data fidelity. Through AI-driven super-resolution, it reconstructs 0.25$^\circ$ mesoscale dynamics from coarse 1$^\circ$ fields, which ensures both efficiency and scalability, with just 3.7% more parameters than the 1$^\circ$ configuration. When coupled with a DL forecasting system, ADAF-Ocean extends global forecast skill by up to 20 days compared to baselines without assimilation. This framework establishes a computationally viable and scientifically rigorous pathway toward real-time, high-resolution Earth system monitoring.

[703] Physics-Informed Design of Input Convex Neural Networks for Consistency Optimal Transport Flow Matching

Fanghui Song, Zhongjian Wang, Jiebao Sun

Main category: cs.LG

TL;DR: Proposes a consistency model using optimal-transport flow with physics-informed PICNNs, combining HJ residual with flow matching to avoid inner optimization, supporting both one-step and multi-step sampling.

Details

Motivation: To develop a more efficient optimal transport framework that avoids complex inner optimization problems present in previous one-step flow matching approaches while maintaining flexibility.

Method: Uses physics-informed partially input-convex neural networks (PICNN) to construct flow fields emulating displacement interpolation, coupling Hamilton-Jacobi residual with flow matching loss during training.

Result: The approach successfully scales on standard OT benchmarks, supporting both one-step (Brenier-map) and multi-step ODE sampling from the same learned potential due to OT flow straightness.

Conclusion: The proposed consistency model provides an efficient and flexible optimal transport framework that eliminates inner optimization subproblems while maintaining sampling versatility.

Abstract: We propose a consistency model based on the optimal-transport flow. A physics-informed design of partially input-convex neural networks (PICNN) plays a central role in constructing the flow field that emulates the displacement interpolation. During the training stage, we couple the Hamilton-Jacobi (HJ) residual in the OT formulation with the original flow matching loss function. Our approach avoids inner optimization subproblems that are present in previous one-step OFM approaches. During the prediction stage, our approach supports both one-step (Brenier-map) and multi-step ODE sampling from the same learned potential, leveraging the straightness of the OT flow. We validate scalability and performance on standard OT benchmarks.

[704] How Particle-System Random Batch Methods Enhance Graph Transformer: Memory Efficiency and Parallel Computing Strategy

Hanwen Liu, Yixuan Ma, Shi Jin, Yuguang Wang

Main category: cs.LG

TL;DR: Random Batch Attention (RBA) is a linear self-attention mechanism that maintains expressivity while reducing quadratic complexity to linear, with theoretical convergence guarantees and parallel implementation benefits.

Details

Motivation: The quadratic complexity of standard attention mechanisms limits their practicality, and existing sparse attention methods lack theoretical analysis about expressivity preservation.

Method: Proposed Random Batch Attention (RBA) based on Random Batch Methods from computational mathematics, which processes attention in linear time complexity with parallel implementation on a new dimension.

Result: Experiments on large graphs demonstrate RBA’s advantages: linear complexity, memory efficiency, and ability to improve existing models by replacing their attention mechanisms while maintaining expressivity.

Conclusion: RBA provides a theoretically grounded linear attention alternative with maintained expressivity, and the theoretical modeling approach offers a new tool for future attention mechanism analysis.

Abstract: Attention mechanism is a significant part of Transformer models. It helps extract features from embedded vectors by adding global information and its expressivity has been proved to be powerful. Nevertheless, the quadratic complexity restricts its practicability. Although several researches have provided attention mechanism in sparse form, they are lack of theoretical analysis about the expressivity of their mechanism while reducing complexity. In this paper, we put forward Random Batch Attention (RBA), a linear self-attention mechanism, which has theoretical support of the ability to maintain its expressivity. Random Batch Attention has several significant strengths as follows: (1) Random Batch Attention has linear time complexity. Other than this, it can be implemented in parallel on a new dimension, which contributes to much memory saving. (2) Random Batch Attention mechanism can improve most of the existing models by replacing their attention mechanisms, even many previously improved attention mechanisms. (3) Random Batch Attention mechanism has theoretical explanation in convergence, as it comes from Random Batch Methods on computation mathematics. Experiments on large graphs have proved advantages mentioned above. Also, the theoretical modeling of self-attention mechanism is a new tool for future research on attention-mechanism analysis.

[705] Function Based Isolation Forest (FuBIF): A Unifying Framework for Interpretable Isolation-Based Anomaly Detection

Alessio Arcudi, Alessandro Ferreri, Francesco Borsatti, Gian Antonio Susto

Main category: cs.LG

TL;DR: Introduces Function-based Isolation Forest (FuBIF) as a generalization of Isolation Forest that uses real-valued functions for dataset branching, improving flexibility and interpretability in anomaly detection.

Details

Motivation: Isolation Forest has limitations in adaptability and biases. The paper aims to enhance its flexibility and interpretability for better anomaly detection in complex datasets.

Method: Proposes FuBIF which enables real-valued functions for dataset branching, and FuBIFFI algorithm for feature importance scoring across possible FuBIF models.

Result: FuBIF significantly enhances the flexibility of evaluation tree construction and provides improved interpretability through feature importance scores.

Conclusion: FuBIF represents a theoretical advancement in Isolation Forest methodology, with open-source implementation provided for reproducibility and further research.

Abstract: Anomaly Detection (AD) is evolving through algorithms capable of identifying outliers in complex datasets. The Isolation Forest (IF), a pivotal AD technique, exhibits adaptability limitations and biases. This paper introduces the Function-based Isolation Forest (FuBIF), a generalization of IF that enables the use of real-valued functions for dataset branching, significantly enhancing the flexibility of evaluation tree construction. Complementing this, the FuBIF Feature Importance (FuBIFFI) algorithm extends the interpretability in IF-based approaches by providing feature importance scores across possible FuBIF models. This paper details the operational framework of FuBIF, evaluates its performance against established methods, and explores its theoretical contributions. An open-source implementation is provided to encourage further research and ensure reproducibility.

[706] CatBack: Universal Backdoor Attacks on Tabular Data via Categorical Encoding

Behrad Tajalli, Stefanos Koffas, Stjepan Picek

Main category: cs.LG

TL;DR: Novel backdoor attack on tabular data that converts categorical values to floating-point representations, enabling gradient-based universal perturbations across all features with high attack success rates while remaining stealthy against defenses.

Details

Motivation: Most backdoor attack research focuses on homogeneous data like images, leaving tabular data vulnerable despite its prevalence in real-world applications with mixed numerical and categorical features.

Method: Convert categorical values into floating-point representations to preserve information while enabling gradient-based universal perturbations that apply to all features, including categorical ones.

Result: Achieves up to 100% attack success rate on five datasets and four models in both white-box and black-box settings, successfully bypassing Vertex AI and defeating state-of-the-art defense mechanisms like Spectral Signatures and Neural Cleanse.

Conclusion: Tabular data is severely vulnerable to backdoor attacks, and the proposed method demonstrates superior performance over previous approaches like Tabdoor while remaining undetectable by current defense mechanisms.

Abstract: Backdoor attacks in machine learning have drawn significant attention for their potential to compromise models stealthily, yet most research has focused on homogeneous data such as images. In this work, we propose a novel backdoor attack on tabular data, which is particularly challenging due to the presence of both numerical and categorical features. Our key idea is a novel technique to convert categorical values into floating-point representations. This approach preserves enough information to maintain clean-model accuracy compared to traditional methods like one-hot or ordinal encoding. By doing this, we create a gradient-based universal perturbation that applies to all features, including categorical ones. We evaluate our method on five datasets and four popular models. Our results show up to a 100% attack success rate in both white-box and black-box settings (including real-world applications like Vertex AI), revealing a severe vulnerability for tabular data. Our method is shown to surpass the previous works like Tabdoor in terms of performance, while remaining stealthy against state-of-the-art defense mechanisms. We evaluate our attack against Spectral Signatures, Neural Cleanse, Beatrix, and Fine-Pruning, all of which fail to defend successfully against it. We also verify that our attack successfully bypasses popular outlier detection mechanisms.

[707] Make It Long, Keep It Fast: End-to-End 10k-Sequence Modeling at Billion Scale on Douyin

Lin Guan, Jia-Qi Yang, Zhishan Zhao, Beichuan Zhang, Bo Sun, Xuanyuan Luo, Jinan Ni, Xiaowen Li, Yuhang Qi, Zhifang Fan, Hangyu Wang, Qiwei Chen, Yi Cheng, Feng Zhang, Xiao Yang

Main category: cs.LG

TL;DR: A system that scales long-sequence modeling to 10k-length user histories for video recommendation, using stacked cross-attention, user-centric batching, and length-extrapolative training to achieve production deployment.

Details

Motivation: Short-video recommenders need to process extremely long user histories efficiently without exceeding latency or cost budgets in production environments.

Method: Three key innovations: 1) Stacked Target-to-History Cross Attention (STCA) reduces complexity from quadratic to linear; 2) Request Level Batching (RLB) aggregates targets per user to share encoding; 3) Length-extrapolative training on shorter windows for inference on longer sequences.

Result: Successfully deployed on Douyin with significant improvements in key engagement metrics while meeting production latency requirements, demonstrating predictable scaling gains with history length and model capacity.

Conclusion: Provides a practical path to scaling end-to-end long-sequence recommendation to the 10k regime, showing scaling law behavior similar to large language models.

Abstract: Short-video recommenders such as Douyin must exploit extremely long user histories without breaking latency or cost budgets. We present an end-to-end system that scales long-sequence modeling to 10k-length histories in production. First, we introduce Stacked Target-to-History Cross Attention (STCA), which replaces history self-attention with stacked cross-attention from the target to the history, reducing complexity from quadratic to linear in sequence length and enabling efficient end-to-end training. Second, we propose Request Level Batching (RLB), a user-centric batching scheme that aggregates multiple targets for the same user/request to share the user-side encoding, substantially lowering sequence-related storage, communication, and compute without changing the learning objective. Third, we design a length-extrapolative training strategy – train on shorter windows, infer on much longer ones – so the model generalizes to 10k histories without additional training cost. Across offline and online experiments, we observe predictable, monotonic gains as we scale history length and model capacity, mirroring the scaling law behavior observed in large language models. Deployed at full traffic on Douyin, our system delivers significant improvements on key engagement metrics while meeting production latency, demonstrating a practical path to scaling end-to-end long-sequence recommendation to the 10k regime.

[708] Event-driven physics-informed operator learning for reliability analysis

Shailesh Garg, Souvik Chakraborty

Main category: cs.LG

TL;DR: NeuroPOL is a neuroscience-inspired physics-informed operator learning framework that uses variable spiking neurons for energy-efficient reliability analysis of engineering systems under uncertainty.

Details

Motivation: Traditional surrogate modeling approaches for reliability analysis suffer from high energy consumption, limiting scalability and deployability in resource-constrained environments.

Method: Incorporates Variable Spiking Neurons into physics-informed operator architecture, replacing continuous activations with event-driven spiking dynamics to promote sparse communication and reduce computational load.

Result: Achieves reliability measures comparable to standard physics-informed operators while introducing significant communication sparsity, enabling scalable, distributed, and energy-efficient deployment.

Conclusion: NeuroPOL successfully lowers both computational and power demands, supporting real-time reliability assessment and deployment on edge devices and digital twins for high-dimensional problems.

Abstract: Reliability analysis of engineering systems under uncertainty poses significant computational challenges, particularly for problems involving high-dimensional stochastic inputs, nonlinear system responses, and multiphysics couplings. Traditional surrogate modeling approaches often incur high energy consumption, which severely limits their scalability and deployability in resource-constrained environments. We introduce NeuroPOL, \textit{the first neuroscience-inspired physics-informed operator learning framework} for reliability analysis. NeuroPOL incorporates Variable Spiking Neurons into a physics-informed operator architecture, replacing continuous activations with event-driven spiking dynamics. This innovation promotes sparse communication, significantly reduces computational load, and enables an energy-efficient surrogate model. The proposed framework lowers both computational and power demands, supporting real-time reliability assessment and deployment on edge devices and digital twins. By embedding governing physical laws into operator learning, NeuroPOL builds physics-consistent surrogates capable of accurate uncertainty propagation and efficient failure probability estimation, even for high-dimensional problems. We evaluate NeuroPOL on five canonical benchmarks, the Burgers equation, Nagumo equation, two-dimensional Poisson equation, two-dimensional Darcy equation, and incompressible Navier-Stokes equation with energy coupling. Results show that NeuroPOL achieves reliability measures comparable to standard physics-informed operators, while introducing significant communication sparsity, enabling scalable, distributed, and energy-efficient deployment.

[709] Approximating Shapley Explanations in Reinforcement Learning

Daniel Beechey, Özgür Şimşek

Main category: cs.LG

TL;DR: FastSVERL is a scalable method for approximating Shapley values to explain reinforcement learning, addressing computational challenges while handling temporal dependencies, off-policy data, and evolving agent behaviors.

Details

Motivation: Reinforcement learning lacks transparency, limiting deployment in safety-critical settings, and Shapley value explanations are computationally expensive.

Method: FastSVERL approximates Shapley values with a scalable approach that handles temporal dependencies across multi-step trajectories, off-policy data learning, and real-time adaptation to evolving agent behaviors.

Result: The method provides a practical and scalable approach for principled interpretability in reinforcement learning.

Conclusion: FastSVERL enables rigorous and scalable interpretability for reinforcement learning systems, overcoming computational barriers to Shapley value explanations.

Abstract: Reinforcement learning has achieved remarkable success in complex decision-making environments, yet its lack of transparency limits its deployment in practice, especially in safety-critical settings. Shapley values from cooperative game theory provide a principled framework for explaining reinforcement learning; however, the computational cost of Shapley explanations is an obstacle to their use. We introduce FastSVERL, a scalable method for explaining reinforcement learning by approximating Shapley values. FastSVERL is designed to handle the unique challenges of reinforcement learning, including temporal dependencies across multi-step trajectories, learning from off-policy data, and adapting to evolving agent behaviours in real time. FastSVERL introduces a practical, scalable approach for principled and rigorous interpretability in reinforcement learning.

[710] Adapting Web Agents with Synthetic Supervision

Zhaoyang Wang, Yiming Liang, Xuchao Zhang, Qianhui Wu, Siwei Han, Anson Bastos, Rujia Wang, Chetan Bansal, Baolin Peng, Jianfeng Gao, Saravan Rajmohan, Huaxiu Yao

Main category: cs.LG

TL;DR: SynthAgent improves web agent adaptation to new websites through dual refinement of synthetic tasks and trajectories, outperforming existing methods by addressing data quality issues like hallucinations and noise.

Details

Motivation: Web agents struggle to adapt to new websites due to scarce environment-specific tasks and demonstrations, while existing synthetic data methods suffer from quality issues like hallucinations and noisy trajectories.

Method: Proposes SynthAgent with dual refinement: synthesizes diverse tasks through categorized web element exploration, refines tasks during trajectory collection when conflicts arise, and refines trajectories with global context to reduce noise and misalignments.

Result: Experimental results show SynthAgent outperforms existing synthetic data methods, validating the importance of high-quality synthetic supervision for web agent adaptation.

Conclusion: SynthAgent effectively improves synthetic data quality through dual task and trajectory refinement, enabling better web agent adaptation to new websites.

Abstract: Web agents struggle to adapt to new websites due to the scarcity of environment specific tasks and demonstrations. Recent works have explored synthetic data generation to address this challenge, however, they suffer from data quality issues where synthesized tasks contain hallucinations that cannot be executed, and collected trajectories are noisy with redundant or misaligned actions. In this paper, we propose SynthAgent, a fully synthetic supervision framework that aims at improving synthetic data quality via dual refinement of both tasks and trajectories. Our approach begins by synthesizing diverse tasks through categorized exploration of web elements, ensuring efficient coverage of the target environment. During trajectory collection, we refine tasks when conflicts with actual observations are detected, mitigating hallucinations while maintaining task consistency. After collection, we conduct trajectory refinement with a global context to mitigate potential noise or misalignments. Finally, we fine-tune open-source web agents on the refined synthetic data to adapt them to the target environment. Experimental results demonstrate that SynthAgent outperforms existing synthetic data methods, validating the importance of high-quality synthetic supervision. The code will be publicly available at https://github.com/aiming-lab/SynthAgent.

[711] Guardian-regularized Safe Offline Reinforcement Learning for Smart Weaning of Mechanical Circulatory Devices

Aysin Tumay, Sophia Sun, Sonia Fereidooni, Aaron Dumas, Elise Jortberg, Rose Yu

Main category: cs.LG

TL;DR: An offline reinforcement learning framework for automated weaning of mechanical circulatory support devices in cardiogenic shock patients, addressing challenges like no online interaction, uncertain dynamics, and limited data.

Details

Motivation: Current MCS weaning strategies vary significantly across care teams and lack data-driven approaches, while offline RL faces challenges in medical settings with prohibitions on patient interaction and uncertain dynamics.

Method: Developed CORMPO - a density-regularized offline RL algorithm with out-of-distribution suppression and clinically-informed reward shaping, plus a Transformer-based probabilistic digital twin for modeling MCS circulatory dynamics.

Result: CORMPO achieves 28% higher reward than offline RL baselines and 82.6% higher scores in clinical metrics on real and synthetic datasets, with proven theoretical performance guarantees.

Conclusion: The approach provides a principled framework for safe offline policy learning in high-stakes medical applications where domain expertise and safety constraints are essential.

Abstract: We study the sequential decision-making problem for automated weaning of mechanical circulatory support (MCS) devices in cardiogenic shock patients. MCS devices are percutaneous micro-axial flow pumps that provide left ventricular unloading and forward blood flow, but current weaning strategies vary significantly across care teams and lack data-driven approaches. Offline reinforcement learning (RL) has proven to be successful in sequential decision-making tasks, but our setting presents challenges for training and evaluating traditional offline RL methods: prohibition of online patient interaction, highly uncertain circulatory dynamics due to concurrent treatments, and limited data availability. We developed an end-to-end machine learning framework with two key contributions (1) Clinically-aware OOD-regularized Model-based Policy Optimization (CORMPO), a density-regularized offline RL algorithm for out-of-distribution suppression that also incorporates clinically-informed reward shaping and (2) a Transformer-based probabilistic digital twin that models MCS circulatory dynamics for policy evaluation with rich physiological and clinical metrics. We prove that \textsf{CORMPO} achieves theoretical performance guarantees under mild assumptions. CORMPO attains a higher reward than the offline RL baselines by 28% and higher scores in clinical metrics by 82.6% on real and synthetic datasets. Our approach offers a principled framework for safe offline policy learning in high-stakes medical applications where domain expertise and safety constraints are essential.

[712] On the Convergence and Stability of Distributed Sub-model Training

Yuyang Deng, Fuli Qiao, Mehrdad Mahdavi

Main category: cs.LG

TL;DR: Proposes distributed shuffled sub-model training for federated learning, where pre-partitioned sub-models are shuffled and distributed to clients, improving convergence and generalization compared to random sampling.

Details

Motivation: Enabling on-device local training for large models in federated learning, addressing poor convergence performance from random sub-model sampling.

Method: Partition full model into sub-models in advance, shuffle and distribute them to clients each round, perform local updates, then average updated sub-models at server.

Result: Established convergence rate, showed improved generalization via stability analysis, and validated findings through extensive experiments.

Conclusion: Distributed shuffled sub-model training effectively improves convergence and generalization in federated learning compared to random sub-model sampling approaches.

Abstract: As learning models continue to grow in size, enabling on-device local training of these models has emerged as a critical challenge in federated learning. A popular solution is sub-model training, where the server only distributes randomly sampled sub-models to the edge clients, and clients only update these small models. However, those random sampling of sub-models may not give satisfying convergence performance. In this paper, observing the success of SGD with shuffling, we propose a distributed shuffled sub-model training, where the full model is partitioned into several sub-models in advance, and the server shuffles those sub-models, sends each of them to clients at each round, and by the end of local updating period, clients send back the updated sub-models, and server averages them. We establish the convergence rate of this algorithm. We also study the generalization of distributed sub-model training via stability analysis, and find that the sub-model training can improve the generalization via amplifying the stability of training process. The extensive experiments also validate our theoretical findings.

[713] Enhancing Robustness of Graph Neural Networks through p-Laplacian

Anuj Kumar Sirohi, Subhanu Halder, Kabir Kumar, Sandeep Kumar

Main category: cs.LG

TL;DR: pLAPGNN is a computationally efficient framework using weighted p-Laplacian to make Graph Neural Networks robust against adversarial attacks.

Details

Motivation: Graph Neural Networks are vulnerable to adversarial attacks during training and testing, and existing robustness methods are computationally demanding and ineffective against strong attacks.

Method: Proposed pLAPGNN framework based on weighted p-Laplacian to enhance GNN robustness against adversarial attacks.

Result: Empirical evaluation on real datasets shows the proposed method is both effective and efficient.

Conclusion: pLAPGNN provides a computationally efficient solution for making GNNs robust to adversarial attacks, addressing limitations of existing methods.

Abstract: With the increase of data in day-to-day life, businesses and different stakeholders need to analyze the data for better predictions. Traditionally, relational data has been a source of various insights, but with the increase in computational power and the need to understand deeper relationships between entities, the need to design new techniques has arisen. For this graph data analysis has become an extraordinary tool for understanding the data, which reveals more realistic and flexible modelling of complex relationships. Recently, Graph Neural Networks (GNNs) have shown great promise in various applications, such as social network analysis, recommendation systems, drug discovery, and more. However, many adversarial attacks can happen over the data, whether during training (poisoning attack) or during testing (evasion attack), which can adversely manipulate the desired outcome from the GNN model. Therefore, it is crucial to make the GNNs robust to such attacks. The existing robustness methods are computationally demanding and perform poorly when the intensity of attack increases. This paper presents a computationally efficient framework, namely, pLAPGNN, based on weighted p-Laplacian for making GNNs robust. Empirical evaluation on real datasets establishes the efficacy and efficiency of the proposed method.

[714] Models Got Talent: Identifying High Performing Wearable Human Activity Recognition Models Without Training

Richard Goldman, Varun Komperla, Thomas Ploetz, Harish Haresamudram

Main category: cs.LG

TL;DR: Zero Cost Proxies (ZCPs) enable efficient neural architecture search for Human Activity Recognition, achieving within 5% of full training performance with minimal computation.

Details

Motivation: To reduce the computational expense of Neural Architecture Search (NAS) for Human Activity Recognition (HAR) by leveraging Zero Cost Proxies that can estimate architecture performance with minimal training.

Method: Using Zero Cost Proxies (ZCPs) that correlate well with trained performance and can be computed through a single forward/backward pass on randomly sampled data, applied to six HAR benchmark datasets.

Result: ZCPs discovered architectures that obtained within 5% performance of full-scale training involving 1500 randomly sampled architectures, with substantial computational savings and robustness to data noise.

Conclusion: ZCPs are effective and practical for HAR, offering computational efficiency while maintaining performance and demonstrating robustness in real-world scenarios.

Abstract: A promising alternative to the computationally expensive Neural Architecture Search (NAS) involves the development of \textit{Zero Cost Proxies (ZCPs)}, which correlate well to trained performance, but can be computed through a single forward/backward pass on a randomly sampled batch of data. In this paper, we investigate the effectiveness of ZCPs for HAR on six benchmark datasets, and demonstrate that they discover network architectures that obtain within 5% of performance attained by full scale training involving 1500 randomly sampled architectures. This results in substantial computational savings as high performing architectures can be discovered with minimal training. Our experiments not only introduce ZCPs to sensor-based HAR, but also demonstrate that they are robust to data noise, further showcasing their suitability for practical scenarios.

[715] LLM Attention Transplant for Transfer Learning of Tabular Data Across Disparate Domains

Ibna Kowsar, Kazi F. Akhter, Manar D. Samad

Main category: cs.LG

TL;DR: LATTLE: A lightweight transfer learning framework that transplants LLM attention weights into a tabular transformer for cross-domain knowledge transfer without requiring shared features or large pretrained models.

Details

Motivation: Traditional deep learning struggles with tabular data transfer due to feature heterogeneity, and LLMs have limitations with mixed data types in tables. There's a need for effective transfer learning that doesn't require shared features or large-scale pretraining.

Method: Fine-tune LLM on source tabular data, transplant selective key-value projection weights into gated feature tokenized transformer (gFTT), then fine-tune gFTT with cross-domain attention on target tabular data.

Result: Superior performance over 12 baselines across 10 source-target dataset pairs, outperforming traditional ML models, state-of-the-art deep tabular architectures, and transfer learning models trained on thousands to billions of samples.

Conclusion: LLM attention transfer provides an effective solution for learning relationships between data tables in low-resource environments, eliminating need for shared features, prompt engineering, and large pretrained models.

Abstract: Transfer learning of tabular data is non-trivial due to heterogeneity in the feature space across disparate domains. The limited success of traditional deep learning in tabular knowledge transfer can be advanced by leveraging large language models (LLMs). However, the efficacy of LLMs often stagnates for mixed data types structured in tables due to the limitations of text prompts and in-context learning. We propose a lightweight transfer learning framework that fine-tunes an LLM using source tabular data and transplants the LLM’s selective $key$ and $value$ projection weights into a gated feature tokenized transformer (gFTT) built for tabular data. The gFTT model with cross-domain attention is fine-tuned using target tabular data for transfer learning, eliminating the need for shared features, LLM prompt engineering, and large-scale pretrained models. Our experiments using ten pairs of source-target data sets and 12 baselines demonstrate the superiority of the proposed LLM-attention transplant for transfer learning (LATTLE) method over traditional ML models, state-of-the-art deep tabular architectures, and transfer learning models trained on thousands to billions of tabular samples. The proposed attention transfer demonstrates an effective solution to learning relationships between data tables using an LLM in a low-resource learning environment. The source code for the proposed method is publicly available.

[716] Learning Gaussian DAG Models without Condition Number Bounds

Constantinos Daskalakis, Vardis Kandiros, Rui Yao

Main category: cs.LG

TL;DR: The paper presents an algorithm for learning directed Gaussian Graphical Models that achieves sample complexity independent of the condition number, addressing a key limitation of prior work.

Details

Motivation: Prior algorithms for learning directed Gaussian Graphical Models require samples that grow polynomially with the condition number, which can be impractical in high-dimensional settings where the condition number grows with n.

Method: The authors develop a novel algorithm that recovers the underlying graph structure under the equal-variance assumption, with sample complexity that doesn’t depend on the condition number. They also provide a polynomial-time algorithm under additional variance assumptions.

Result: The algorithm achieves sample complexity of O(d log n) that is independent of the condition number, with nearly matching lower bounds. Simulations confirm the theoretical predictions.

Conclusion: This work provides an almost tight characterization of the sample complexity for learning directed Gaussian Graphical Models, overcoming the condition number dependence that plagued prior approaches.

Abstract: We study the problem of learning the topology of a directed Gaussian Graphical Model under the equal-variance assumption, where the graph has $n$ nodes and maximum in-degree $d$. Prior work has established that $O(d \log n)$ samples are sufficient for this task. However, an important factor that is often overlooked in these analyses is the dependence on the condition number of the covariance matrix of the model. Indeed, all algorithms from prior work require a number of samples that grows polynomially with this condition number. In many cases this is unsatisfactory, since the condition number could grow polynomially with $n$, rendering these prior approaches impractical in high-dimensional settings. In this work, we provide an algorithm that recovers the underlying graph and prove that the number of samples required is independent of the condition number. Furthermore, we establish lower bounds that nearly match the upper bound up to a $d$-factor, thus providing an almost tight characterization of the true sample complexity of the problem. Moreover, under a further assumption that all the variances of the variables are bounded, we design a polynomial-time algorithm that recovers the underlying graph, at the cost of an additional polynomial dependence of the sample complexity on $d$. We complement our theoretical findings with simulations on synthetic datasets that confirm our predictions.

[717] Local K-Similarity Constraint for Federated Learning with Label Noise

Sanskar Amgain, Prashant Shrestha, Bidur Khanal, Alina Devkota, Yash Raj Shrestha, Seungryul Baek, Prashnna Gyawali, Binod Bhattarai

Main category: cs.LG

TL;DR: Proposes a local regularization method for federated learning with noisy labels that enforces similarity between close data points using self-supervised representations, without requiring shared architecture between pretrained and classification models.

Details

Motivation: Existing federated learning methods assume sufficient clean clients, but fail when many heterogeneous clients have noisy labels. Current approaches using pre-trained models are impractical due to communication costs.

Method: Uses a regularization objective that decouples pretrained and classification models by enforcing similarity between close data points within clients, leveraging self-supervised representations to evaluate closeness.

Result: Significantly improves performance in noisy federated settings, outperforming state-of-the-art methods on multiple computer vision and medical image classification benchmarks.

Conclusion: The proposed architecture-agnostic regularization method effectively handles noisy clients in federated learning without requiring shared model architectures, making it practical for real-world deployment.

Abstract: Federated learning on clients with noisy labels is a challenging problem, as such clients can infiltrate the global model, impacting the overall generalizability of the system. Existing methods proposed to handle noisy clients assume that a sufficient number of clients with clean labels are available, which can be leveraged to learn a robust global model while dampening the impact of noisy clients. This assumption fails when a high number of heterogeneous clients contain noisy labels, making the existing approaches ineffective. In such scenarios, it is important to locally regularize the clients before communication with the global model, to ensure the global model isn’t corrupted by noisy clients. While pre-trained self-supervised models can be effective for local regularization, existing centralized approaches relying on pretrained initialization are impractical in a federated setting due to the potentially large size of these models, which increases communication costs. In that line, we propose a regularization objective for client models that decouples the pre-trained and classification models by enforcing similarity between close data points within the client. We leverage the representation space of a self-supervised pretrained model to evaluate the closeness among examples. This regularization, when applied with the standard objective function for the downstream task in standard noisy federated settings, significantly improves performance, outperforming existing state-of-the-art federated methods in multiple computer vision and medical image classification benchmarks. Unlike other techniques that rely on self-supervised pretrained initialization, our method does not require the pretrained model and classifier backbone to share the same architecture, making it architecture-agnostic.

[718] Resilience Inference for Supply Chains with Hypergraph Neural Network

Zetian Shen, Hongjun Wang, Jiyuan Chen, Xuan Song

Main category: cs.LG

TL;DR: The paper introduces SC-RIHN, a hypergraph-based model for predicting supply chain resilience using network topology and inventory data without explicit dynamics, outperforming existing methods.

Details

Motivation: Existing approaches lack mechanisms to infer supply chain resilience without explicit system dynamics and struggle to represent higher-order, multi-entity dependencies in supply chain networks.

Method: Proposes SC-RIHN, a hypergraph-based model using set-based encoding and hypergraph message passing to capture multi-party firm-product interactions in supply chain networks.

Result: SC-RIHN significantly outperforms traditional MLP, graph neural network variants, and ResInf baselines across synthetic benchmarks.

Conclusion: The model shows potential for practical, early-warning risk assessment in complex supply chain systems by accurately inferring resilience from network topology and inventory data.

Abstract: Supply chains are integral to global economic stability, yet disruptions can swiftly propagate through interconnected networks, resulting in substantial economic impacts. Accurate and timely inference of supply chain resilience the capability to maintain core functions during disruptions is crucial for proactive risk mitigation and robust network design. However, existing approaches lack effective mechanisms to infer supply chain resilience without explicit system dynamics and struggle to represent the higher-order, multi-entity dependencies inherent in supply chain networks. These limitations motivate the definition of a novel problem and the development of targeted modeling solutions. To address these challenges, we formalize a novel problem: Supply Chain Resilience Inference (SCRI), defined as predicting supply chain resilience using hypergraph topology and observed inventory trajectories without explicit dynamic equations. To solve this problem, we propose the Supply Chain Resilience Inference Hypergraph Network (SC-RIHN), a novel hypergraph-based model leveraging set-based encoding and hypergraph message passing to capture multi-party firm-product interactions. Comprehensive experiments demonstrate that SC-RIHN significantly outperforms traditional MLP, representative graph neural network variants, and ResInf baselines across synthetic benchmarks, underscoring its potential for practical, early-warning risk assessment in complex supply chain systems.

[719] Sparse Linear Regression is Easy on Random Supports

Gautam Chandrasekaran, Raghu Meka, Konstantinos Stavropoulos

Main category: cs.LG

TL;DR: The paper presents a breakthrough in sparse linear regression, showing that for worst-case design matrices with random support patterns, prediction error ε can be achieved with poly(k, log d, 1/ε) samples and poly(d,N) runtime, overcoming previous computational barriers.

Details

Motivation: There exists an exponential gap between information-theoretically optimal sample complexity (O(k log d/ε)) and computationally efficient methods (requiring O(d) samples). Current methods either need exponential runtime d^Ω(k) for optimal samples or polynomial runtime but suboptimal O(d) samples.

Method: The approach works for any design matrix X with bounded condition number (up to 2^poly(d)) and assumes the support of the unknown sparse vector w* is chosen at random. This allows achieving prediction error ε with poly(k, log d, 1/ε) samples and poly(d,N) runtime.

Result: First generic positive result for worst-case design matrices: achieves prediction error ε with poly(k, log d, 1/ε) samples and poly(d,N) runtime, bridging the computational-statistical gap for sparse regression.

Conclusion: The work demonstrates that computational efficiency can be achieved for sparse linear regression with near-optimal sample complexity, even for worst-case design matrices, by leveraging randomness in the support pattern of the sparse signal.

Abstract: Sparse linear regression is one of the most basic questions in machine learning and statistics. Here, we are given as input a design matrix $X \in \mathbb{R}^{N \times d}$ and measurements or labels ${y} \in \mathbb{R}^N$ where ${y} = {X} {w}^* + ξ$, and $ξ$ is the noise in the measurements. Importantly, we have the additional constraint that the unknown signal vector ${w}^$ is sparse: it has $k$ non-zero entries where $k$ is much smaller than the ambient dimension. Our goal is to output a prediction vector $\widehat{w}$ that has small prediction error: $\frac{1}{N}\cdot |{X} {w}^ - {X} \widehat{w}|^2_2$. Information-theoretically, we know what is best possible in terms of measurements: under most natural noise distributions, we can get prediction error at most $ε$ with roughly $N = O(k \log d/ε)$ samples. Computationally, this currently needs $d^{Ω(k)}$ run-time. Alternately, with $N = O(d)$, we can get polynomial-time. Thus, there is an exponential gap (in the dependence on $d$) between the two and we do not know if it is possible to get $d^{o(k)}$ run-time and $o(d)$ samples. We give the first generic positive result for worst-case design matrices ${X}$: For any ${X}$, we show that if the support of ${w}^$ is chosen at random, we can get prediction error $ε$ with $N = \text{poly}(k, \log d, 1/ε)$ samples and run-time $\text{poly}(d,N)$. This run-time holds for any design matrix ${X}$ with condition number up to $2^{\text{poly}(d)}$. Previously, such results were known for worst-case ${w}^$, but only for random design matrices from well-behaved families, matrices that have a very low condition number ($\text{poly}(\log d)$; e.g., as studied in compressed sensing), or those with special structural properties.

[720] Adaptive Multi-view Graph Contrastive Learning via Fractional-order Neural Diffusion Networks

Yanan Zhao, Feng Ji, Jingyang Dai, Jiaze Ma, Keyue Jiang, Kai Zhao, Wee Peng Tay

Main category: cs.LG

TL;DR: A novel graph contrastive learning framework using fractional-order dynamics to generate multi-scale views without data augmentation, outperforming existing methods.

Details

Motivation: Existing GCL methods use fixed, handcrafted views that limit their ability to capture multi-scale structural patterns in graphs.

Method: Uses fractional-order continuous dynamics with learnable derivative order α to generate a spectrum of views from local to global perspectives without manual augmentations.

Result: Extensive experiments show the method produces more robust and expressive embeddings and outperforms state-of-the-art GCL baselines on standard benchmarks.

Conclusion: The fractional-order approach enables adaptive discovery of informative multi-scale views, generating diverse complementary representations for improved graph learning.

Abstract: Graph contrastive learning (GCL) learns node and graph representations by contrasting multiple views of the same graph. Existing methods typically rely on fixed, handcrafted views-usually a local and a global perspective, which limits their ability to capture multi-scale structural patterns. We present an augmentation-free, multi-view GCL framework grounded in fractional-order continuous dynamics. By varying the fractional derivative order $α\in (0,1]$, our encoders produce a continuous spectrum of views: small $α$ yields localized features, while large $α$ induces broader, global aggregation. We treat $α$ as a learnable parameter so the model can adapt diffusion scales to the data and automatically discover informative views. This principled approach generates diverse, complementary representations without manual augmentations. Extensive experiments on standard benchmarks demonstrate that our method produces more robust and expressive embeddings and outperforms state-of-the-art GCL baselines.

[721] Deep Reinforcement Learning for Dynamic Origin-Destination Matrix Estimation in Microscopic Traffic Simulations Considering Credit Assignment

Donggyu Min, Seongjin Choi, Dong-Kyu Kim

Main category: cs.LG

TL;DR: This paper proposes a deep reinforcement learning framework for dynamic origin-destination matrix estimation in microscopic traffic simulations, achieving 43.2% MSE reduction over conventional methods.

Details

Motivation: The credit assignment problem in DODE makes it challenging to determine which vehicles traverse which links at specific times, creating ambiguous relationships between OD matrices and link flows.

Method: Formulates DODE as a Markov Decision Process and applies model-free deep reinforcement learning where an agent learns optimal policy to sequentially generate OD matrices through simulation interaction.

Result: Experimental validation on Nguyen-Dupuis network using SUMO shows 43.2% reduction in mean squared error compared to best-performing conventional baseline over 30-minute horizon with 5-minute intervals.

Conclusion: The DRL-based approach effectively addresses the credit assignment challenge through learned policy, overcoming limitations of conventional methods and providing a novel framework for microscopic traffic simulation calibration.

Abstract: This paper focuses on dynamic origin-destination matrix estimation (DODE), a crucial calibration process necessary for the effective application of microscopic traffic simulations. The fundamental challenge of the DODE problem in microscopic simulations stems from the complex temporal dynamics and inherent uncertainty of individual vehicle dynamics. This makes it highly challenging to precisely determine which vehicle traverses which link at any given moment, resulting in intricate and often ambiguous relationships between origin-destination (OD) matrices and their contributions to resultant link flows. This phenomenon constitutes the credit assignment problem, a central challenge addressed in this study. We formulate the DODE problem as a Markov Decision Process (MDP) and propose a novel framework that applies model-free deep reinforcement learning (DRL). Within our proposed framework, the agent learns an optimal policy to sequentially generate OD matrices, refining its strategy through direct interaction with the simulation environment. The proposed method is validated on the Nguyen-Dupuis network using SUMO, where its performance is evaluated against ground-truth link flows aggregated at 5-minute intervals over a 30-minute horizon. Experimental results demonstrate that our approach achieves a 43.2% reduction in mean squared error (MSE) compared to the best-performing conventional baseline. By reframing DODE as a sequential decision-making problem, our approach addresses the credit assignment challenge through its learned policy, thereby overcoming the limitations of conventional methods and proposing a novel framework for calibration of microscopic traffic simulations.

[722] Synheart Emotion: Privacy-Preserving On-Device Emotion Recognition from Biosignals

Henok Ademtew, Israel Goytom

Main category: cs.LG

TL;DR: Classical ensemble methods outperform deep learning for on-device emotion recognition from wrist-based PPG signals, with ExtraTrees achieving best performance while enabling privacy-preserving real-time applications through ONNX optimization.

Details

Motivation: Current emotion recognition systems rely on cloud-based inference, creating privacy vulnerabilities and latency issues unsuitable for real-time wearable applications.

Method: Comprehensive evaluation of ML architectures for on-device emotion recognition from wrist-based PPG, comparing classical ensemble methods, deep neural networks, and transformers on the WESAD stress detection dataset.

Result: ExtraTrees achieved F1 = 0.826 on combined features and F1 = 0.623 on wrist-only features, substantially outperforming transformers (F1 = 0.509-0.577). ONNX optimization yielded 4.08 MB footprint, 0.05 ms latency, 152x speedup, 30.5% storage reduction, and 40.1x inference speedup.

Conclusion: Classical ensemble methods are superior for small physiological datasets, enabling feasible privacy-preserving on-device emotion recognition for real-world wearables with minimal latency and storage requirements.

Abstract: Human-computer interaction increasingly demands systems that recognize not only explicit user inputs but also implicit emotional states. While substantial progress has been made in affective computing, most emotion recognition systems rely on cloud-based inference, introducing privacy vulnerabilities and latency constraints unsuitable for real-time applications. This work presents a comprehensive evaluation of machine learning architectures for on-device emotion recognition from wrist-based photoplethysmography (PPG), systematically comparing different models spanning classical ensemble methods, deep neural networks, and transformers on the WESAD stress detection dataset. Results demonstrate that classical ensemble methods substantially outperform deep learning on small physiological datasets, with ExtraTrees achieving F1 = 0.826 on combined features and F1 = 0.623 on wrist-only features, compared to transformers achieving only F1 = 0.509-0.577. We deploy the wrist-only ExtraTrees model optimized via ONNX conversion, achieving a 4.08 MB footprint, 0.05 ms inference latency, and 152x speedup over the original implementation. Furthermore, ONNX optimization yields a 30.5% average storage reduction and 40.1x inference speedup, highlighting the feasibility of privacy-preserving on-device emotion recognition for real-world wearables.

[723] Scaling Laws and In-Context Learning: A Unified Theoretical Framework

Sushant Mehta, Ishan Gupta

Main category: cs.LG

TL;DR: The paper presents a theoretical framework connecting scaling laws to in-context learning emergence in transformers, showing ICL follows power-law relationships with model dimensions and derives optimal depth-width allocations.

Details

Motivation: Despite extensive empirical studies of in-context learning, a principled understanding of how ICL emerges at scale remains elusive, particularly regarding the relationship between model scaling and ICL capabilities.

Method: Developed a unified theoretical framework analyzing ICL emergence through scaling laws, showing transformers implement gradient-based metalearning in forward pass with effective learning rate η_eff = Θ(1/√Ld), and derived optimal depth-width allocations.

Result: Established that ICL performance follows power-law relationships with model depth L, width d, context length k, and training data D, with exponents determined by task structure. Demonstrated sharp phase transitions at critical scales and optimal allocation L* ∝ N^{2/3}, d* ∝ N^{1/3} for fixed parameter budget N = Ld.

Conclusion: The work provides necessary and sufficient conditions for ICL emergence and establishes fundamental computational limits on what transformers can learn in-context, with experimental validation showing measured scaling exponents closely matching theoretical predictions.

Abstract: In-context learning (ICL) enables large language models to adapt to new tasks from demonstrations without parameter updates. Despite extensive empirical studies, a principled understanding of ICL emergence at scale remains more elusive. We present a unified theoretical framework connecting scaling laws to ICL emergence in transformers. Our analysis establishes that ICL performance follows power-law relationships with model depth $L$, width $d$, context length $k$, and training data $D$, with exponents determined by task structure. We show that under specific conditions, transformers implement gradient-based metalearning in their forward pass, with an effective learning rate $η_{\text{eff}} = Θ(1/\sqrt{Ld})$. We demonstrate sharp phase transitions at critical scales and derive optimal depth-width allocations favoring $L^* \propto N^{2/3}$, $d^* \propto N^{1/3}$ for the fixed parameter budget $N = Ld$. Systematic experiments on synthetic tasks validate our predictions, with measured scaling exponents closely matching theory. This work provides both necessary and sufficient conditions for the emergence of ICLs and establishes fundamental computational limits on what transformers can learn in-context.

[724] Mixtures of SubExperts for Large Language Continual Learning

Haeyong Kang

Main category: cs.LG

TL;DR: MoSEs is a novel continual learning framework that uses Mixtures of SubExperts with task-specific routing to prevent catastrophic forgetting while enabling efficient knowledge transfer, achieving state-of-the-art performance with sublinear parameter growth.

Details

Motivation: Current PEFT methods for continual learning face a dilemma: reusing parameters causes catastrophic forgetting, while allocating separate parameters per task leads to linear model growth and prevents knowledge transfer between related tasks.

Method: Integrates sparse Mixtures of SubExperts into transformer layers with task-specific routing mechanism, allowing isolation of knowledge in dedicated SubExperts while enabling adaptive selection and combination of previously learned parameters for new tasks.

Result: Significantly outperforms conventional continual learning approaches on TRACE benchmark datasets, achieving superior knowledge retention and scalability with substantial memory and computational savings.

Conclusion: MoSEs provides an effective solution for continual learning of LLMs by balancing knowledge preservation and efficient parameter usage through adaptive SubExpert routing, enabling state-of-the-art performance with sublinear capacity growth.

Abstract: Adapting Large Language Models (LLMs) to a continuous stream of tasks is a critical yet challenging endeavor. While Parameter-Efficient Fine-Tuning (PEFT) methods have become a standard for this, they face a fundamental dilemma in continual learning. Reusing a single set of PEFT parameters for new tasks often leads to catastrophic forgetting of prior knowledge. Conversely, allocating distinct parameters for each task prevents forgetting but results in a linear growth of the model’s size and fails to facilitate knowledge transfer between related tasks. To overcome these limitations, we propose a novel adaptive PEFT method referred to as \textit{Mixtures of SubExperts (MoSEs)}, a novel continual learning framework designed for minimal forgetting and efficient scalability. MoSEs integrate a sparse Mixture of SubExperts into the transformer layers, governed by a task-specific routing mechanism. This architecture allows the model to isolate and protect knowledge within dedicated SubExperts, thereby minimizing parameter interference and catastrophic forgetting. Crucially, the router can adaptively select and combine previously learned sparse parameters for new tasks, enabling effective knowledge transfer while ensuring that the model’s capacity grows sublinearly. We evaluate MoSEs on the comprehensive TRACE benchmark datasets. Our experiments demonstrate that MoSEs significantly outperform conventional continual learning approaches in both knowledge retention and scalability to new tasks, achieving state-of-the-art performance with substantial memory and computational savings.

[725] Constraint-Informed Active Learning for End-to-End ACOPF Optimization Proxies

Miao Li, Michael Klamkin, Pascal Van Hentenryck, Wenting Li, Russell Bent

Main category: cs.LG

TL;DR: This paper introduces an active sampling framework for ACOPF optimization proxies that generates realistic and diverse training data using optimization-specific quantities, achieving superior generalization over existing methods.

Details

Motivation: Optimization proxy performance heavily depends on training data quality, and existing methods lack effective sampling strategies to generate realistic and diverse training data for ACOPF problems.

Method: A novel active sampling framework that explores varied problem specifications reflecting operational realities and uses optimization-specific quantities (active constraint sets) to capture salient features of ACOPF problems.

Result: Numerical results show superior generalization over existing sampling methods with equivalent training budget, significantly advancing trustworthy ACOPF optimization proxies.

Conclusion: The proposed active sampling framework effectively addresses training data limitations for ACOPF optimization proxies and represents a significant advancement in the state-of-practice.

Abstract: This paper studies optimization proxies, machine learning (ML) models trained to efficiently predict optimal solutions for AC Optimal Power Flow (ACOPF) problems. While promising, optimization proxy performance heavily depends on training data quality. To address this limitation, this paper introduces a novel active sampling framework for ACOPF optimization proxies designed to generate realistic and diverse training data. The framework actively explores varied, flexible problem specifications reflecting plausible operational realities. More importantly, the approach uses optimization-specific quantities (active constraint sets) that better capture the salient features of an ACOPF that lead to the optimal solution. Numerical results show superior generalization over existing sampling methods with an equivalent training budget, significantly advancing the state-of-practice for trustworthy ACOPF optimization proxies.

[726] Test-Time Iterative Error Correction for Efficient Diffusion Models

Yunshan Zhong, Yanwei Qi, Yuxin Zhang

Main category: cs.LG

TL;DR: IEC is a test-time method that reduces error propagation in efficient diffusion models from exponential to linear growth, improving generation quality without retraining.

Details

Motivation: Efficient diffusion models suffer from approximation errors that accumulate exponentially across timesteps, degrading generation quality, and these errors are difficult to correct after deployment.

Method: Iterative Error Correction (IEC) - a test-time method that iteratively refines the model’s output during inference to mitigate approximation errors.

Result: Extensive experiments show IEC consistently improves generation quality across various datasets, efficiency techniques, and model architectures.

Conclusion: IEC is a practical and generalizable solution for test-time enhancement of efficient diffusion models, enabling flexible trade-off between performance and efficiency.

Abstract: With the growing demand for high-quality image generation on resource-constrained devices, efficient diffusion models have received increasing attention. However, such models suffer from approximation errors introduced by efficiency techniques, which significantly degrade generation quality. Once deployed, these errors are difficult to correct, as modifying the model is typically infeasible in deployment environments. Through an analysis of error propagation across diffusion timesteps, we reveal that these approximation errors can accumulate exponentially, severely impairing output quality. Motivated by this insight, we propose Iterative Error Correction (IEC), a novel test-time method that mitigates inference-time errors by iteratively refining the model’s output. IEC is theoretically proven to reduce error propagation from exponential to linear growth, without requiring any retraining or architectural changes. IEC can seamlessly integrate into the inference process of existing diffusion models, enabling a flexible trade-off between performance and efficiency. Extensive experiments show that IEC consistently improves generation quality across various datasets, efficiency techniques, and model architectures, establishing it as a practical and generalizable solution for test-time enhancement of efficient diffusion models.

[727] MrCoM: A Meta-Regularized World-Model Generalizing Across Multi-Scenarios

Xuantang Xiong, Ni Mu, Runpeng Xie, Senhao Yang, Yaqing Wang, Lexiang Wang, Yao Luan, Siyuan Li, Shuang Xu, Yiqin Yang, Bo Xu

Main category: cs.LG

TL;DR: Proposes Meta-Regularized Contextual World-Model (MrCoM) for multi-scenario generalization in model-based RL, using latent space decomposition and meta-regularization to improve cross-scenario performance.

Details

Motivation: Current MBRL methods focus on single tasks and lack generalization across different scenarios, despite dynamics sharing inherent properties within the same simulation engine.

Method: Decomposes latent state space based on dynamic characteristics, uses meta-state regularization for unified scenario representation, and meta-value regularization to align world-model optimization with policy learning across scenarios.

Result: Theoretically analyzes generalization error bound and demonstrates significantly better performance than state-of-the-art methods across diverse scenarios.

Conclusion: MrCoM effectively addresses multi-scenario generalization in MBRL through latent decomposition and meta-regularization, achieving superior cross-scenario performance.

Abstract: Model-based reinforcement learning (MBRL) is a crucial approach to enhance the generalization capabilities and improve the sample efficiency of RL algorithms. However, current MBRL methods focus primarily on building world models for single tasks and rarely address generalization across different scenarios. Building on the insight that dynamics within the same simulation engine share inherent properties, we attempt to construct a unified world model capable of generalizing across different scenarios, named Meta-Regularized Contextual World-Model (MrCoM). This method first decomposes the latent state space into various components based on the dynamic characteristics, thereby enhancing the accuracy of world-model prediction. Further, MrCoM adopts meta-state regularization to extract unified representation of scenario-relevant information, and meta-value regularization to align world-model optimization with policy learning across diverse scenario objectives. We theoretically analyze the generalization error upper bound of MrCoM in multi-scenario settings. We systematically evaluate our algorithm’s generalization ability across diverse scenarios, demonstrating significantly better performance than previous state-of-the-art methods.

[728] Breaking the Modality Barrier: Generative Modeling for Accurate Molecule Retrieval from Mass Spectra

Yiwen Zhang, Keyan Ding, Yihang Wu, Xiang Zhuang, Yi Yang, Qiang Zhang, Huajun Chen

Main category: cs.LG

TL;DR: GLMR is a Generative Language Model-based Retrieval framework that improves molecular structure retrieval from mass spectra using a two-stage process combining contrastive learning and generative modeling.

Details

Motivation: Existing methods for retrieving molecular structures from mass spectra have limited library coverage and suffer from cross-modal misalignment, leading to poor retrieval accuracy and generalization.

Method: Two-stage framework: 1) Pre-retrieval stage uses contrastive learning to identify top candidate molecules as contextual priors; 2) Generative retrieval stage integrates candidates with input spectrum to guide a generative model in producing refined structures for re-ranking.

Result: GLMR significantly outperforms existing methods, achieving over 40% improvement in top-1 accuracy on both MassSpecGym and MassRET-20k datasets, with strong generalizability.

Conclusion: The proposed GLMR framework effectively mitigates cross-modal misalignment and demonstrates superior performance in molecular structure retrieval from mass spectra compared to existing approaches.

Abstract: Retrieving molecular structures from tandem mass spectra is a crucial step in rapid compound identification. Existing retrieval methods, such as traditional mass spectral library matching, suffer from limited spectral library coverage, while recent cross-modal representation learning frameworks often encounter modality misalignment, resulting in suboptimal retrieval accuracy and generalization. To address these limitations, we propose GLMR, a Generative Language Model-based Retrieval framework that mitigates the cross-modal misalignment through a two-stage process. In the pre-retrieval stage, a contrastive learning-based model identifies top candidate molecules as contextual priors for the input mass spectrum. In the generative retrieval stage, these candidate molecules are integrated with the input mass spectrum to guide a generative model in producing refined molecular structures, which are then used to re-rank the candidates based on molecular similarity. Experiments on both MassSpecGym and the proposed MassRET-20k dataset demonstrate that GLMR significantly outperforms existing methods, achieving over 40% improvement in top-1 accuracy and exhibiting strong generalizability.

[729] CAMP-HiVe: Cyclic Pair Merging based Efficient DNN Pruning with Hessian-Vector Approximation for Resource-Constrained Systems

Mohammad Helal Uddin, Sai Krishna Ghanta, Liam Seymour, Sabur Baidya

Main category: cs.LG

TL;DR: CAMP-HiVe is a novel neural network pruning method using Hessian-vector products and power iteration to identify and preserve essential weights while reducing computational demands, achieving better performance than state-of-the-art methods.

Details

Motivation: Deep learning algorithms need efficient deployment on resource-constrained systems, and neural pruning provides high compression with minimal cost, but existing methods need improvement for better performance-complexity trade-offs.

Method: Proposes CAMP-HiVe - cyclic pair merging-based pruning with Hessian Vector approximation. Uses Hessian-vector products to approximate curvature information and power iteration to identify essential weights. Dynamically merges weight pairs by combining significant and less significant weights.

Result: Achieves significant reductions in computational requirements while maintaining high performance across ResNet18, ResNet56, and MobileNetv2 on CIFAR10, CIFAR-100, and ImageNet datasets. Outperforms existing state-of-the-art neural pruning methods.

Conclusion: CAMP-HiVe provides an effective framework for neural network pruning that balances model accuracy and computational efficiency through dynamic weight significance adjustment and Hessian-based curvature approximation.

Abstract: Deep learning algorithms are becoming an essential component of many artificial intelligence (AI) driven applications, many of which run on resource-constrained and energy-constrained systems. For efficient deployment of these algorithms, although different techniques for the compression of neural network models are proposed, neural pruning is one of the fastest and effective methods, which can provide a high compression gain with minimal cost. To harness enhanced performance gain with respect to model complexity, we propose a novel neural network pruning approach utilizing Hessian-vector products that approximate crucial curvature information in the loss function, which significantly reduces the computation demands. By employing a power iteration method, our algorithm effectively identifies and preserves the essential information, ensuring a balanced trade-off between model accuracy and computational efficiency. Herein, we introduce CAMP-HiVe, a cyclic pair merging-based pruning with Hessian Vector approximation by iteratively consolidating weight pairs, combining significant and less significant weights, thus effectively streamlining the model while preserving its performance. This dynamic, adaptive framework allows for real-time adjustment of weight significance, ensuring that only the most critical parameters are retained. Our experimental results demonstrate that our proposed method achieves significant reductions in computational requirements while maintaining high performance across different neural network architectures, e.g., ResNet18, ResNet56, and MobileNetv2, on standard benchmark datasets, e.g., CIFAR10, CIFAR-100, and ImageNet, and it outperforms the existing state-of-the-art neural pruning methods.

Yuhao Zhang, Qinghong Guo, Qixian Chen, Liuwei Zhang, Hongyan Cui, Xiyi Chen

Main category: cs.LG

TL;DR: LLM³-DTI is a novel drug-target interaction prediction framework that leverages large language models and multi-modal data fusion to enhance prediction accuracy.

Details

Motivation: Drug-target interaction prediction is crucial for drug discovery and repurposing. With abundant data available, data-driven methods can reduce costs and improve efficiency in pharmaceutical research.

Method: Uses domain-specific LLM for text semantic embeddings, dual cross-attention mechanism for alignment, TSFusion module for multi-modal fusion, and an output network for DTI prediction.

Result: Outperforms comparison models across diverse scenarios, demonstrating proficiency in identifying validated drug-target interactions.

Conclusion: LLM³-DTI effectively fulfills DTI prediction tasks with excellent performance, providing a robust framework for drug discovery applications.

Abstract: Drug-target interaction (DTI) prediction is of great significance for drug discovery and drug repurposing. With the accumulation of a large volume of valuable data, data-driven methods have been increasingly harnessed to predict DTIs, reducing costs across various dimensions. Therefore, this paper proposes a $\textbf{L}$arge $\textbf{L}$anguage $\textbf{M}$odel and $\textbf{M}$ulti-$\textbf{M}$odel data co-powered $\textbf{D}$rug $\textbf{T}$arget $\textbf{I}$nteraction prediction framework, named LLM$^3$-DTI. LLM$^3$-DTI constructs multi-modal data embedding to enhance DTI prediction performance. In this framework, the text semantic embeddings of drugs and targets are encoded by a domain-specific LLM. To effectively align and fuse multi-modal embedding. We propose the dual cross-attention mechanism and the TSFusion module. Finally, these multi-modal data are utilized for the DTI task through an output network. The experimental results indicate that LLM$^3$-DTI can proficiently identify validated DTIs, surpassing the performance of the models employed for comparison across diverse scenarios. Consequently, LLM$^3$-DTI is adept at fulfilling the task of DTI prediction with excellence. The data and code are available at https://github.com/chaser-gua/LLM3DTI.

[731] COTN: A Chaotic Oscillatory Transformer Network for Complex Volatile Systems under Extreme Conditions

Boyan Tang, Yilong Zeng, Xuanhao Ren, Peng Xiao, Yuhan Zhao, Raymond Lee, Jianghua Wu

Main category: cs.LG

TL;DR: COTN combines Transformer architecture with Lee Oscillator activation and Autoencoder Self-Regressive module to better predict chaotic financial/electricity markets during extreme volatility, outperforming state-of-the-art models by up to 40%.

Details

Motivation: Financial and electricity markets are challenging to predict due to nonlinearity, rapid fluctuations, and chaotic patterns, especially during extreme conditions where conventional activation functions saturate.

Method: Transformer architecture with novel Lee Oscillator activation function, Max-over-Time pooling, lambda-gating mechanism, and Autoencoder Self-Regressive module to detect abnormal patterns.

Result: Outperforms state-of-the-art deep learning models (Informer) by up to 17% and traditional statistical methods (GARCH) by up to 40% in electricity spot and financial markets.

Conclusion: COTN effectively navigates real-world market uncertainty and complexity, offering a powerful tool for forecasting highly volatile systems under duress.

Abstract: Accurate prediction of financial and electricity markets, especially under extreme conditions, remains a significant challenge due to their intrinsic nonlinearity, rapid fluctuations, and chaotic patterns. To address these limitations, we propose the Chaotic Oscillatory Transformer Network (COTN). COTN innovatively combines a Transformer architecture with a novel Lee Oscillator activation function, processed through Max-over-Time pooling and a lambda-gating mechanism. This design is specifically tailored to effectively capture chaotic dynamics and improve responsiveness during periods of heightened volatility, where conventional activation functions (e.g., ReLU, GELU) tend to saturate. Furthermore, COTN incorporates an Autoencoder Self-Regressive (ASR) module to detect and isolate abnormal market patterns, such as sudden price spikes or crashes, thereby preventing corruption of the core prediction process and enhancing robustness. Extensive experiments across electricity spot markets and financial markets demonstrate the practical applicability and resilience of COTN. Our approach outperforms state-of-the-art deep learning models like Informer by up to 17% and traditional statistical methods like GARCH by as much as 40%. These results underscore COTN’s effectiveness in navigating real-world market uncertainty and complexity, offering a powerful tool for forecasting highly volatile systems under duress.

[732] Achieving Fairness Without Harm via Selective Demographic Experts

Xuwei Tan, Yuanlong Wang, Thai-Hoang Pham, Ping Zhang, Xueru Zhang

Main category: cs.LG

TL;DR: Proposes a fairness-without-harm approach using demographic experts with group-specific representations and personalized classifiers to achieve fairness without performance degradation.

Details

Motivation: Existing bias mitigation techniques create unfair trade-offs between fairness and accuracy, which is unacceptable in high-stakes domains like healthcare where performance degradation for any demographic group is ethically problematic.

Method: Learn distinct representations for different demographic groups and use demographic experts consisting of group-specific representations and personalized classifiers with no-harm constrained selection.

Result: Extensive empirical evaluation on three medical datasets (eye disease, skin cancer, X-ray diagnosis) and two face datasets demonstrates effective fairness without harm.

Conclusion: The proposed approach successfully achieves fairness without performance degradation, addressing the critical need for equitable machine learning systems in human-centered domains like healthcare.

Abstract: As machine learning systems become increasingly integrated into human-centered domains such as healthcare, ensuring fairness while maintaining high predictive performance is critical. Existing bias mitigation techniques often impose a trade-off between fairness and accuracy, inadvertently degrading performance for certain demographic groups. In high-stakes domains like clinical diagnosis, such trade-offs are ethically and practically unacceptable. In this study, we propose a fairness-without-harm approach by learning distinct representations for different demographic groups and selectively applying demographic experts consisting of group-specific representations and personalized classifiers through a no-harm constrained selection. We evaluate our approach on three real-world medical datasets – covering eye disease, skin cancer, and X-ray diagnosis – as well as two face datasets. Extensive empirical results demonstrate the effectiveness of our approach in achieving fairness without harm.

[733] Transolver is a Linear Transformer: Revisiting Physics-Attention through the Lens of Linear Attention

Wenjie Hu, Sidun Liu, Peng Qiao, Zhenglun Sun, Yong Dou

Main category: cs.LG

TL;DR: The paper proposes Linear Attention Neural Operator (LinearNO), which reformulates Physics-Attention from Transolver as linear attention, achieving better performance with fewer parameters and lower computational cost on PDE benchmarks.

Details

Motivation: To address the inefficiency of Physics-Attention in Transformer-based neural operators for PDEs, which the authors found can be reformulated as linear attention and that slice attention may hurt performance.

Method: A two-step transformation that redesigns Physics-Attention into canonical linear attention, eliminating slice attention while maintaining the benefits of slice and deslice operations.

Result: State-of-the-art performance on six standard PDE benchmarks with 40.0% fewer parameters and 36.2% lower computational cost, plus superior performance on industrial datasets AirfRANS and Shape-Net Car.

Conclusion: LinearNO demonstrates that the effectiveness of Physics-Attention comes from slice/deslice operations rather than slice interactions, and linear attention provides a more efficient and performant alternative for neural operators in PDE solving.

Abstract: Recent advances in Transformer-based Neural Operators have enabled significant progress in data-driven solvers for Partial Differential Equations (PDEs). Most current research has focused on reducing the quadratic complexity of attention to address the resulting low training and inference efficiency. Among these works, Transolver stands out as a representative method that introduces Physics-Attention to reduce computational costs. Physics-Attention projects grid points into slices for slice attention, then maps them back through deslicing. However, we observe that Physics-Attention can be reformulated as a special case of linear attention, and that the slice attention may even hurt the model performance. Based on these observations, we argue that its effectiveness primarily arises from the slice and deslice operations rather than interactions between slices. Building on this insight, we propose a two-step transformation to redesign Physics-Attention into a canonical linear attention, which we call Linear Attention Neural Operator (LinearNO). Our method achieves state-of-the-art performance on six standard PDE benchmarks, while reducing the number of parameters by an average of 40.0% and computational cost by 36.2%. Additionally, it delivers superior performance on two challenging, industrial-level datasets: AirfRANS and Shape-Net Car.

[734] 3dSAGER: Geospatial Entity Resolution over 3D Objects (Technical Report)

Bar Genossar, Sagi Dalyot, Roee Shraga, Avigdor Gal

Main category: cs.LG

TL;DR: 3dSAGER is a novel end-to-end pipeline for geospatial entity resolution that uses intrinsic 3D geometry instead of traditional spatial proximity or metadata, enabling robust matching across datasets with incompatible coordinate systems.

Details

Motivation: Traditional geospatial entity resolution methods rely on spatial proximity, textual metadata, or external identifiers, which are often unavailable, unreliable, or misaligned in cross-source scenarios, especially with varying data collection platforms and coordinate systems.

Method: 3dSAGER introduces a spatial-reference-independent featurization mechanism that captures intricate geometric characteristics of 3D objects, along with BKAFI - a lightweight interpretable blocking method for efficient candidate generation.

Result: Extensive experiments on real-world urban datasets show significant gains in both accuracy and efficiency over strong baselines, with empirical analysis validating the contributions of each component.

Conclusion: The proposed 3dSAGER pipeline successfully addresses limitations of traditional geospatial entity resolution by focusing on intrinsic 3D geometry, enabling robust cross-dataset matching even with incompatible coordinate systems.

Abstract: Urban environments are continuously mapped and modeled by various data collection platforms, including satellites, unmanned aerial vehicles and street cameras. The growing availability of 3D geospatial data from multiple modalities has introduced new opportunities and challenges for integrating spatial knowledge at scale, particularly in high-impact domains such as urban planning and rapid disaster management. Geospatial entity resolution is the task of identifying matching spatial objects across different datasets, often collected independently under varying conditions. Existing approaches typically rely on spatial proximity, textual metadata, or external identifiers to determine correspondence. While useful, these signals are often unavailable, unreliable, or misaligned, especially in cross-source scenarios. To address these limitations, we shift the focus to the intrinsic geometry of 3D spatial objects and present 3dSAGER (3D Spatial-Aware Geospatial Entity Resolution), an end-to-end pipeline for geospatial entity resolution over 3D objects. 3dSAGER introduces a novel, spatial-reference-independent featurization mechanism that captures intricate geometric characteristics of matching pairs, enabling robust comparison even across datasets with incompatible coordinate systems where traditional spatial methods fail. As a key component of 3dSAGER, we also propose a new lightweight and interpretable blocking method, BKAFI, that leverages a trained model to efficiently generate high-recall candidate sets. We validate 3dSAGER through extensive experiments on real-world urban datasets, demonstrating significant gains in both accuracy and efficiency over strong baselines. Our empirical study further dissects the contributions of each component, providing insights into their impact and the overall design choices.

[735] Kaggle Chronicles: 15 Years of Competitions, Community and Data Science Innovation

Kevin Bönisch, Leandro Losaria

Main category: cs.LG

TL;DR: Analysis of 15 years of Kaggle data science competitions, community interactions, and technological trends through metadata, code, and discussions.

Details

Motivation: To explore Kaggle's evolution from a competition platform to a broader data science ecosystem and understand its impact on the community, technological trends, and real-world ML applications.

Method: Analyzed millions of kernels and discussion threads using longitudinal trend analysis and exploratory data analysis of Kaggle Meta Code and Meta Datasets.

Result: Kaggle shows steady growth with diverse use cases, rapid adoption of new trends by participants, and models with solid generalization capabilities on average.

Conclusion: Kaggle has evolved into a comprehensive data science platform that drives innovation, community collaboration, and practical ML applications while maintaining competitive model quality.

Abstract: Since 2010, Kaggle has been a platform where data scientists from around the world come together to compete, collaborate, and push the boundaries of Data Science. Over these 15 years, it has grown from a purely competition-focused site into a broader ecosystem with forums, notebooks, models, datasets, and more. With the release of the Kaggle Meta Code and Kaggle Meta Datasets, we now have a unique opportunity to explore these competitions, technologies, and real-world applications of Machine Learning and AI. And so in this study, we take a closer look at 15 years of data science on Kaggle - through metadata, shared code, community discussions, and the competitions themselves. We explore Kaggle’s growth, its impact on the data science community, uncover hidden technological trends, analyze competition winners, how Kagglers approach problems in general, and more. We do this by analyzing millions of kernels and discussion threads to perform both longitudinal trend analysis and standard exploratory data analysis. Our findings show that Kaggle is a steadily growing platform with increasingly diverse use cases, and that Kagglers are quick to adapt to new trends and apply them to real-world challenges, while producing - on average - models with solid generalization capabilities. We also offer a snapshot of the platform as a whole, highlighting its history and technological evolution. Finally, this study is accompanied by a video (https://www.youtube.com/watch?v=YVOV9bIUNrM) and a Kaggle write-up (https://kaggle.com/competitions/meta-kaggle-hackathon/writeups/kaggle-chronicles-15-years-of-competitions-communi) for your convenience.

[736] DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation

Speed Zhu, Jianwei Cai, Guang Chen, Lulu Wu, Saiyong Yang, Wiggin Zhou

Main category: cs.LG

TL;DR: The paper presents a two-stage RL training method for competitive-programming code generation, achieving SOTA performance comparable to leading models like DeepSeek v3.1 and Doubao-1.5-Thinking.

Details

Motivation: To address the underexplored area of competitive-programming code generation in RLVR research, focusing on data curation and practical training techniques rather than just RL algorithm design.

Method: Two-stage RL training: 1) GRPO on large problem set with 8 rollouts to expand entropy, 2) Pre-GRPO on challenging problems with 64 rollouts using hard-focus curriculum. Built on SFT from strong open-source models.

Result: Achieves state-of-the-art performance among similar-scale models on LeetCode and Codeforces, comparable to DeepSeek v3.1 and Doubao-1.5-Thinking. Shows strong RL scaling on internal MoE model.

Conclusion: Provides best practices for data curation, entropy expansion, and curriculum design in RLVR for competitive-programming code generation.

Abstract: Recent reasoning-first models (e.g., OpenAI o1, DeepSeek R1) have spurred a resurgence of interest in RLVR. Nevertheless, advances are dominated by mathematics (e.g., AIME), with competitive-programming code generation underexplored and data curation receiving less attention than RL algorithm design. We investigate how to construct RLVR datasets (i.e., RL prompts) and present practical training techniques that yield strong performance on competitive-programming code generation. Our pipeline begins with supervised fine-tuning (SFT) distilled from strong open-source models, augmented with general-purpose and reasoning-intensive data. RL then follows a two-stage process with executable, testcase-driven rewards: first, training on a large, uniformly distributed set of competitive-programming problems using Group Relative Policy Optimization (GRPO) with 8 rollouts per prompt and a relatively short response-generation window (e.g., 32k during SFT and 24k in this stage) to expand entropy and mitigate repetition and truncation; second, we perform \textbf{Pre-GRPO}: updating on a small, high-quality set of challenging problems with a large rollout budget (64 rollouts per prompt) under a hard-focus curriculum that continuously retains the most difficult instances throughout training. We implement our method on Qwen2.5-32B and evaluate on LeetCode and Codeforces weekly contests to avoid data leakage. The resulting model achieves state-of-the-art performance among models of similar scale and is comparable to leading systems such as DeepSeek v3.1 and Doubao-1.5-Thinking. We also examine scaling trends and observe strong RL scaling on an internal large-scale MoE model. Our study distills concise best practices for data curation, entropy expansion, and curriculum design in RLVR for competitive-programming code generation.

[737] Scalable Verification of Neural Control Barrier Functions Using Linear Bound Propagation

Nikolaus Vertovec, Frederik Baymler Mathiesen, Thom Badings, Luca Laurenti, Alessandro Abate

Main category: cs.LG

TL;DR: A scalable framework for verifying neural control barrier functions using linear bound propagation and McCormick relaxation to compute bounds on CBF conditions, enabling verification of larger networks than state-of-the-art methods.

Details

Motivation: Current verification methods for neural CBFs are computationally expensive and limit the size of networks that can be used, creating a bottleneck for practical applications.

Method: Extends linear bound propagation to compute bounds on neural network gradients, combines with McCormick relaxation to derive linear bounds on CBF conditions, and uses adaptive refinement strategy to reduce conservatism.

Result: The approach scales to larger neural networks than existing verification procedures and applies to arbitrary control-affine systems with various nonlinear activation functions.

Conclusion: The proposed framework eliminates computational bottlenecks in neural CBF verification, enabling the use of more expressive networks for safety certification of nonlinear control systems.

Abstract: Control barrier functions (CBFs) are a popular tool for safety certification of nonlinear dynamical control systems. Recently, CBFs represented as neural networks have shown great promise due to their expressiveness and applicability to a broad class of dynamics and safety constraints. However, verifying that a trained neural network is indeed a valid CBF is a computational bottleneck that limits the size of the networks that can be used. To overcome this limitation, we present a novel framework for verifying neural CBFs based on piecewise linear upper and lower bounds on the conditions required for a neural network to be a CBF. Our approach is rooted in linear bound propagation (LBP) for neural networks, which we extend to compute bounds on the gradients of the network. Combined with McCormick relaxation, we derive linear upper and lower bounds on the CBF conditions, thereby eliminating the need for computationally expensive verification procedures. Our approach applies to arbitrary control-affine systems and a broad range of nonlinear activation functions. To reduce conservatism, we develop a parallelizable refinement strategy that adaptively refines the regions over which these bounds are computed. Our approach scales to larger neural networks than state-of-the-art verification procedures for CBFs, as demonstrated by our numerical experiments.

[738] Reaction Prediction via Interaction Modeling of Symmetric Difference Shingle Sets

Runhan Shi, Letian Chen, Gufeng Yu, Yang Yang

Main category: cs.LG

TL;DR: ReaDISH is a novel reaction prediction model that addresses permutation sensitivity and inadequate substructure interaction modeling through symmetric difference shingle encoding and geometry-structure interaction attention.

Details

Motivation: Existing machine learning models for chemical reaction prediction suffer from sensitivity to input permutations (molecule/atom orderings) and poor modeling of substructural interactions that govern reactivity, leading to inconsistent predictions and poor generalization.

Method: The model uses two key innovations: (1) symmetric difference shingle encoding to capture reaction-specific structural changes while eliminating order sensitivity, and (2) geometry-structure interaction attention mechanism that models intra- and inter-molecular interactions at the shingle level.

Result: Extensive experiments show ReaDISH improves reaction prediction performance across diverse benchmarks, with enhanced robustness demonstrated by an average 8.76% improvement on R² under permutation perturbations.

Conclusion: ReaDISH successfully addresses critical limitations in chemical reaction prediction by providing permutation-invariant representations while incorporating interaction-aware features, leading to more robust and accurate predictions.

Abstract: Chemical reaction prediction remains a fundamental challenge in organic chemistry, where existing machine learning models face two critical limitations: sensitivity to input permutations (molecule/atom orderings) and inadequate modeling of substructural interactions governing reactivity. These shortcomings lead to inconsistent predictions and poor generalization to real-world scenarios. To address these challenges, we propose ReaDISH, a novel reaction prediction model that learns permutation-invariant representations while incorporating interaction-aware features. It introduces two innovations: (1) symmetric difference shingle encoding, which computes molecular shingle differences to capture reaction-specific structural changes while eliminating order sensitivity; and (2) geometry-structure interaction attention, a mechanism that models intra- and inter-molecular interactions at the shingle level. Extensive experiments demonstrate that ReaDISH improves reaction prediction performance across diverse benchmarks. It shows enhanced robustness with an average improvement of 8.76% on R$^2$ under permutation perturbations.

[739] CG-TTRL: Context-Guided Test-Time Reinforcement Learning for On-Device Large Language Models

Peyman Hosseini, Ondrej Bohdal, Taha Ceritli, Ignacio Castro, Matthew Purver, Mete Ozay, Umberto Michieli

Main category: cs.LG

TL;DR: CG-TTRL enhances Test-time Reinforcement Learning by integrating contextual guidance into both sampling phases, improving accuracy and efficiency over standard TTRL.

Details

Motivation: TTRL's two-phase sampling strategy under-utilizes contextual guidance, which could improve pseudo-label accuracy in the initial phase and regulate exploration in the second phase.

Method: Propose context-guided TTRL (CG-TTRL) that integrates context dynamically into both sampling phases and develops an efficient context selection method for on-device applications.

Result: CG-TTRL outperforms TTRL with 7% relative accuracy improvement on mathematical and scientific QA benchmarks, and achieves 8% improvement after only 3 steps of training compared to TTRL’s 1%.

Conclusion: Integrating contextual guidance into TTRL significantly improves both performance and efficiency, making it more suitable for practical applications.

Abstract: Test-time Reinforcement Learning (TTRL) has shown promise in adapting foundation models for complex tasks at test-time, resulting in large performance improvements. TTRL leverages an elegant two-phase sampling strategy: first, multi-sampling derives a pseudo-label via majority voting, while subsequent downsampling and reward-based fine-tuning encourages the model to explore and learn diverse valid solutions, with the pseudo-label modulating the reward signal. Meanwhile, in-context learning has been widely explored at inference time and demonstrated the ability to enhance model performance without weight updates. However, TTRL’s two-phase sampling strategy under-utilizes contextual guidance, which can potentially improve pseudo-label accuracy in the initial exploitation phase while regulating exploration in the second. To address this, we propose context-guided TTRL (CG-TTRL), integrating context dynamically into both sampling phases and propose a method for efficient context selection for on-device applications. Our evaluations on mathematical and scientific QA benchmarks show CG-TTRL outperforms TTRL (e.g. additional 7% relative accuracy improvement over TTRL), while boosting efficiency by obtaining strong performance after only a few steps of test-time training (e.g. 8% relative improvement rather than 1% over TTRL after 3 steps).

[740] Privacy-Preserving Federated Learning for Fair and Efficient Urban Traffic Optimization

Rathin Chandra Shit, Sharmila Subudhi

Main category: cs.LG

TL;DR: FedFair-Traffic is a privacy-preserving federated learning framework that simultaneously optimizes travel efficiency, traffic fairness, and privacy protection for urban transportation systems.

Details

Motivation: Current centralized traffic management invades user privacy and creates traffic disparities, while existing federated learning lacks fairness constraints in multi-objective traffic settings.

Method: Integrates Graph Neural Networks with differential privacy (ε-privacy guarantees) and Gini coefficient-based fair constraints using multi-objective optimization, with federated aggregation methods including gradient clipping and noise injection.

Result: Reduced average travel time by 7% (14.2 minutes), improved traffic fairness by 73% (Gini coefficient 0.78), provided high privacy protection (score 0.8), and reduced communication overhead by 89% on METR-LA dataset.

Conclusion: FedFair-Traffic is a scalable privacy-aware smart city infrastructure suitable for metropolitan traffic flow control and federated transportation networks.

Abstract: The optimization of urban traffic is threatened by the complexity of achieving a balance between transport efficiency and the maintenance of privacy, as well as the equitable distribution of traffic based on socioeconomically diverse neighborhoods. Current centralized traffic management schemes invade user location privacy and further entrench traffic disparity by offering disadvantaged route suggestions, whereas current federated learning frameworks do not consider fairness constraints in multi-objective traffic settings. This study presents a privacy-preserving federated learning framework, termed FedFair-Traffic, that jointly and simultaneously optimizes travel efficiency, traffic fairness, and differential privacy protection. This is the first attempt to integrate three conflicting objectives to improve urban transportation systems. The proposed methodology enables collaborative learning between related vehicles with data locality by integrating Graph Neural Networks with differential privacy mechanisms ($ε$-privacy guarantees) and Gini coefficient-based fair constraints using multi-objective optimization. The framework uses federated aggregation methods of gradient clipping and noise injection to provide differential privacy and optimize Pareto-efficient solutions for the efficiency-fairness tradeoff. Real-world comprehensive experiments on the METR-LA traffic dataset showed that FedFair-Traffic can reduce the average travel time by 7% (14.2 minutes) compared with their centralized baselines, promote traffic fairness by 73% (Gini coefficient, 0.78), and offer high privacy protection (privacy score, 0.8) with an 89% reduction in communication overhead. These outcomes demonstrate that FedFair-Traffic is a scalable privacy-aware smart city infrastructure with possible use-cases in metropolitan traffic flow control and federated transportation networks.

[741] Adaptive Regularization for Large-Scale Sparse Feature Embedding Models

Mang Li, Wei Lyu

Main category: cs.LG

TL;DR: The paper analyzes the one-epoch overfitting problem in CTR/CVR models using large-scale sparse categorical features, identifies the fundamental cause theoretically, and proposes an adaptive regularization method that prevents performance degradation in multi-epoch training while improving single-epoch performance.

Details

Motivation: CTR and CVR estimation models in search, advertising, and recommendation domains suffer from significant performance decline when trained for multiple epochs (one-epoch overfitting problem), and existing heuristic solutions haven't identified the fundamental cause.

Method: Theoretical analysis of overfitting causes in models with large-scale sparse categorical features, followed by development of an adaptive regularization method to address the identified issues.

Result: The proposed method prevents severe performance degradation during multi-epoch training and improves model performance even within a single epoch. The method has been successfully deployed in online production systems.

Conclusion: The paper provides both theoretical understanding and practical solution for the one-epoch overfitting problem in CTR/CVR models, with demonstrated effectiveness in real-world deployment.

Abstract: The one-epoch overfitting problem has drawn widespread attention, especially in CTR and CVR estimation models in search, advertising, and recommendation domains. These models which rely heavily on large-scale sparse categorical features, often suffer a significant decline in performance when trained for multiple epochs. Although recent studies have proposed heuristic solutions, they have not clearly identified the fundamental cause of this phenomenon. In this work, we provide a theoretical analysis that explains why overfitting occurs in models that use large-scale sparse categorical features. Based on this analysis, we propose an adaptive regularization method to address it. Our approach not only prevents the severe performance degradation observed during multi-epoch training, but also improves model performance within a single epoch. This method has already been deployed in online production systems.

[742] Vocabulary In-Context Learning in Transformers: Benefits of Positional Encoding

Qian Ma, Ruoxiang Xu, Yongqiang Cai

Main category: cs.LG

TL;DR: Vocabulary in-context learning (VICL) in single-layer Transformers without positional encoding lacks universal approximation property (UAP), but adding positional encoding enables UAP.

Details

Motivation: To understand the role of positional encoding in enabling universal approximation capability for vocabulary-based in-context learning in Transformers.

Method: Theoretical analysis of single-layer Transformers with and without positional encoding in vocabulary in-context learning scenarios, providing sufficient conditions for positional encoding.

Result: VICL without positional encoding does not possess UAP, but with positional encoding, UAP can be achieved under certain sufficient conditions.

Conclusion: Positional encoding provides crucial benefits for approximation capability in in-context learning from an approximation theory perspective.

Abstract: Numerous studies have demonstrated that the Transformer architecture possesses the capability for in-context learning (ICL). In scenarios involving function approximation, context can serve as a control parameter for the model, endowing it with the universal approximation property (UAP). In practice, context is represented by tokens from a finite set, referred to as a vocabulary, which is the case considered in this paper, \emph{i.e.}, vocabulary in-context learning (VICL). We demonstrate that VICL in single-layer Transformers, without positional encoding, does not possess the UAP; however, it is possible to achieve the UAP when positional encoding is included. Several sufficient conditions for the positional encoding are provided. Our findings reveal the benefits of positional encoding from an approximation theory perspective in the context of ICL.

[743] How Wide and How Deep? Mitigating Over-Squashing of GNNs via Channel Capacity Constrained Estimation

Zinuo You, Jin Zheng, John Cartlidge

Main category: cs.LG

TL;DR: C3E framework uses information theory to optimize hidden dimensions and propagation depth in graph neural networks, mitigating over-squashing by treating GNNs as communication channels.

Details

Motivation: Existing GNNs suffer from heuristic choices of hidden dimensions and propagation depths, leading to over-squashing (information loss during propagation).

Method: Proposes Channel Capacity Constrained Estimation (C3E) that formulates dimension/depth selection as nonlinear programming based on information theory, modeling spectral GNNs as communication channels.

Result: Experiments on 9 datasets show C3E-estimated dimensions/depths mitigate over-squashing and improve representation learning. Reveals over-squashing stems from cumulative information compression in representation matrices.

Conclusion: Increasing hidden dimensions mitigates information compression, while propagation depth requires balancing information compression and representation complexity.

Abstract: Existing graph neural networks typically rely on heuristic choices for hidden dimensions and propagation depths, which often lead to severe information loss during propagation, known as over-squashing. To address this issue, we propose Channel Capacity Constrained Estimation (C3E), a novel framework that formulates the selection of hidden dimensions and depth as a nonlinear programming problem grounded in information theory. Through modeling spectral graph neural networks as communication channels, our approach directly connects channel capacity to hidden dimensions, propagation depth, propagation mechanism, and graph structure. Extensive experiments on nine public datasets demonstrate that hidden dimensions and depths estimated by C3E can mitigate over-squashing and consistently improve representation learning. Experimental results show that over-squashing occurs due to the cumulative compression of information in representation matrices. Furthermore, our findings show that increasing hidden dimensions indeed mitigate information compression, while the role of propagation depth is more nuanced, uncovering a fundamental balance between information compression and representation complexity.

[744] FLEX: Continuous Agent Evolution via Forward Learning from Experience

Zhicheng Cai, Xinyuan Guo, Yu Pei, JiangTao Feng, Jiangjie Chen, Ya-Qin Zhang, Wei-Ying Ma, Mingxuan Wang, Hao Zhou

Main category: cs.LG

TL;DR: FLEX enables LLM agents to continuously learn and evolve through experience accumulation, achieving significant performance improvements across multiple domains including mathematical reasoning, chemical synthesis, and protein prediction.

Details

Motivation: Current LLM agents remain static after training and cannot grow with experience like intelligent beings do during deployment, limiting their adaptability and long-term performance.

Method: FLEX uses a gradient-free learning paradigm that constructs a structured experience library through continual reflection on successes and failures during environment interactions, enabling scalable and inheritable evolution.

Result: FLEX achieved substantial improvements: up to 23% on AIME25 (mathematical reasoning), 10% on USPTO50k (chemical retrosynthesis), and 14% on ProteinGym (protein fitness prediction). The approach also demonstrated scaling laws of experiential growth and experience inheritance across agents.

Conclusion: FLEX represents a step toward scalable and inheritable continuous agent evolution, enabling LLM agents to grow and improve through accumulated experience rather than remaining static after training.

Abstract: Autonomous agents driven by Large Language Models (LLMs) have revolutionized reasoning and problem-solving but remain static after training, unable to grow with experience as intelligent beings do during deployment. We introduce Forward Learning with EXperience (FLEX), a gradient-free learning paradigm that enables LLM agents to continuously evolve through accumulated experience. Specifically, FLEX cultivates scalable and inheritable evolution by constructing a structured experience library through continual reflection on successes and failures during interaction with the environment. FLEX delivers substantial improvements on mathematical reasoning, chemical retrosynthesis, and protein fitness prediction (up to 23% on AIME25, 10% on USPTO50k, and 14% on ProteinGym). We further identify a clear scaling law of experiential growth and the phenomenon of experience inheritance across agents, marking a step toward scalable and inheritable continuous agent evolution. Project Page: https://flex-gensi-thuair.github.io.

[745] A Risk-Neutral Neural Operator for Arbitrage-Free SPX-VIX Term Structures

Jian’an Zhang

Main category: cs.LG

TL;DR: ARBITER is a risk-neutral neural operator that learns joint SPX-VIX term structures while enforcing no-arbitrage constraints, outperforming existing methods on historical data.

Details

Motivation: To develop a model that can learn joint SPX-VIX term structures while enforcing static arbitrage constraints (calendar, vertical, butterfly), Lipschitz bounds, and monotonicity to ensure no-arbitrage conditions.

Method: Couples operator learning with constrained decoders, trained using extragradient-style updates plus projection. Ties SPX and VIX legs together and uses selective state updates for improved generalization.

Result: Outperforms Fourier Neural Operator, DeepONet, and state-space sequence models on historical SPX and VIX data. Shows reduced Dual-Gap, improved NI, and better calibration stability.

Conclusion: ARBITER provides effective arbitrage-free interpolation and extrapolation across maturities and strikes, with proven identifiability and approximation guarantees.

Abstract: We propose ARBITER, a risk-neutral neural operator for learning joint SPX-VIX term structures under no-arbitrage constraints. ARBITER maps market states to an operator that outputs implied volatility and variance curves while enforcing static arbitrage (calendar, vertical, butterfly), Lipschitz bounds, and monotonicity. The model couples operator learning with constrained decoders and is trained with extragradient-style updates plus projection. We introduce evaluation metrics for derivatives term structures (NAS, CNAS, NI, Dual-Gap, Stability Rate) and show gains over Fourier Neural Operator, DeepONet, and state-space sequence models on historical SPX and VIX data. Ablation studies indicate that tying the SPX and VIX legs reduces Dual-Gap and improves NI, Lipschitz projection stabilizes calibration, and selective state updates improve long-horizon generalization. We provide identifiability and approximation results and describe practical recipes for arbitrage-free interpolation and extrapolation across maturities and strikes.

[746] MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains

Leyan Xue, Zongbo Han, Kecheng Xue, Xiaohong Liu, Guangyu Wang, Changqing Zhang

Main category: cs.LG

TL;DR: A large-scale multimodal evaluation benchmark is introduced to address the lack of adequate evaluation standards in multimodal fusion, integrating over 30 datasets across 15 modalities and 20 tasks.

Details

Motivation: Current multimodal fusion methods are evaluated on limited datasets, leading to biased assessments and poor generalization. The absence of unified evaluation standards hinders fair comparisons and the development of universal fusion models.

Method: Developed a large-scale, domain-adaptive benchmark integrating over 30 datasets, 15 modalities, and 20 predictive tasks, along with an open-source, unified, automated evaluation pipeline with standardized implementations of state-of-the-art models and fusion paradigms.

Result: Conducted large-scale experiments establishing new performance baselines across multiple tasks, providing a platform for rigorous and reproducible assessment of multimodal models.

Conclusion: This work offers a crucial platform for the academic community to advance multimodal artificial intelligence through standardized, comprehensive evaluation, addressing current limitations in multimodal fusion assessment.

Abstract: Although multimodal fusion has made significant progress, its advancement is severely hindered by the lack of adequate evaluation benchmarks. Current fusion methods are typically evaluated on a small selection of public datasets, a limited scope that inadequately represents the complexity and diversity of real-world scenarios, potentially leading to biased evaluations. This issue presents a twofold challenge. On one hand, models may overfit to the biases of specific datasets, hindering their generalization to broader practical applications. On the other hand, the absence of a unified evaluation standard makes fair and objective comparisons between different fusion methods difficult. Consequently, a truly universal and high-performance fusion model has yet to emerge. To address these challenges, we have developed a large-scale, domain-adaptive benchmark for multimodal evaluation. This benchmark integrates over 30 datasets, encompassing 15 modalities and 20 predictive tasks across key application domains. To complement this, we have also developed an open-source, unified, and automated evaluation pipeline that includes standardized implementations of state-of-the-art models and diverse fusion paradigms. Leveraging this platform, we have conducted large-scale experiments, successfully establishing new performance baselines across multiple tasks. This work provides the academic community with a crucial platform for rigorous and reproducible assessment of multimodal models, aiming to propel the field of multimodal artificial intelligence to new heights.

[747] Reconstruction and Secrecy under Approximate Distance Queries

Shay Moran, Elizaveta Nesterova

Main category: cs.LG

TL;DR: This paper studies the problem of locating an unknown target using noisy distance queries, analyzing the optimal reconstruction error through geometric and learning-theoretic approaches.

Details

Motivation: The problem arises in GPS localization, sensor networks, and privacy-aware data access, with relevance for both reconstructors (seeking accuracy) and responders (aiming to limit information disclosure for privacy/security).

Method: The authors study the reconstruction game through a learning-theoretic lens, using geometric characterization via Chebyshev radius and analyzing asymptotic behavior in different metric spaces.

Result: First result provides tight geometric characterization of optimal error using Chebyshev radius for all compact/totally bounded metric spaces. Second result characterizes pseudo-finite spaces where optimal error is attained after finite queries versus spaces with nontrivial decay.

Conclusion: The paper establishes fundamental limits and geometric characterizations for target localization using noisy distance queries, with applications across various metric spaces and practical contexts.

Abstract: Consider the task of locating an unknown target point using approximate distance queries: in each round, a reconstructor selects a query point and receives a noisy version of its distance to the target. This problem arises naturally in various contexts ranging from localization in GPS and sensor networks to privacy-aware data access, and spans a wide variety of metric spaces. It is relevant from the perspective of both the reconstructor (seeking accurate recovery) and the responder (aiming to limit information disclosure, e.g., for privacy or security reasons). We study this reconstruction game through a learning-theoretic lens, focusing on the rate and limits of the best possible reconstruction error. Our first result provides a tight geometric characterization of the optimal error in terms of the Chebyshev radius, a classical concept from geometry. This characterization applies to all compact metric spaces (in fact, even to all totally bounded spaces) and yields explicit formulas for natural metric spaces. Our second result addresses the asymptotic behavior of reconstruction, distinguishing between pseudo-finite spaces – where the optimal error is attained after finitely many queries – and spaces where the approximation curve exhibits nontrivial decay. We characterize pseudo-finiteness for convex Euclidean spaces.

[748] The Few Govern the Many:Unveiling Few-Layer Dominance for Time Series Models

Xin Qiu, Junlong Tong, Yirong Sun, Yunpu Ma, Xiaoyu Shen

Main category: cs.LG

TL;DR: Large-scale time series models exhibit a scaling paradox where larger models don’t perform better due to few-layer dominance - only a small subset of layers are functionally important while most are redundant.

Details

Motivation: To understand why scaling up model capacity and data volume in time series forecasting doesn't lead to expected performance improvements, revealing a puzzling scaling paradox phenomenon.

Method: Extensive experiments across model families (100M to 1.7B parameters) and diverse data (up to 6B observations), analyzing internal representations to identify few-layer dominance, then proposing a method to automatically identify and retain only dominant layers.

Result: Retaining only 21% of parameters achieves up to 12% accuracy improvement and 2.7× inference speedup. Method validated on 8 SOTA models (90M to 6B), showing that retaining <30% of layers achieves comparable or superior accuracy in over 95% of tasks.

Conclusion: The scaling paradox in time series models is caused by few-layer dominance, and selectively retaining only functionally important layers can significantly improve performance and efficiency while reducing model size.

Abstract: Large-scale models are at the forefront of time series (TS) forecasting, dominated by two paradigms: fine-tuning text-based Large Language Models (LLM4TS) and training Time Series Foundation Models (TSFMs) from scratch. Both approaches share a foundational assumption that scaling up model capacity and data volume leads to improved performance. However, we observe a \textit{\textbf{scaling paradox}} in TS models, revealing a puzzling phenomenon that larger models do \emph{NOT} achieve better performance. Through extensive experiments on two model families across four scales (100M to 1.7B parameters) and diverse data (up to 6B observations), we rigorously confirm that the scaling paradox is a pervasive issue. We then diagnose its root cause by analyzing internal representations, identifying a phenomenon we call \textit{few-layer dominance}: only a small subset of layers are functionally important, while the majority are redundant, under-utilized, and can even distract training. Based on this discovery, we propose a practical method to automatically identify and retain only these dominant layers. In our models, retaining only 21% of the parameters achieves up to a 12% accuracy improvement and a 2.7$\times$ inference speedup. We validate the universality of our method on 8 prominent SOTA models (LLM4TS and TSFMs, 90M to 6B), showing that retaining less than 30% of layers achieves comparable or superior accuracy in over 95% of tasks.

[749] Error Estimate and Convergence Analysis for Data Valuation

Zhangyong Liang, Huanhuan Gao, Ji Zhang

Main category: cs.LG

TL;DR: NDDV enables valid data valuation in single training processes with proven error bounds and convergence guarantees.

Details

Motivation: Existing data valuation methods cannot ensure validity in a single training process, limiting their practical applicability.

Method: Neural dynamic data valuation (NDDV) with error estimation and convergence analysis under Lipschitz and smoothness assumptions.

Result: Derived quadratic error bounds for loss differences that scale inversely with time steps and quadratically with control variations, ensuring stability. Proved expected squared gradient norm vanishes asymptotically and meta loss converges sublinearly.

Conclusion: NDDV achieves sublinear convergence and provides the first theoretical guarantees for data valuation in single training processes.

Abstract: Data valuation quantifies data importance, but existing methods cannot ensure validity in a single training process. The neural dynamic data valuation (NDDV) method [3] addresses this limitation. Based on NDDV, we are the first to explore error estimation and convergence analysis in data valuation. Under Lipschitz and smoothness assumptions, we derive quadratic error bounds for loss differences that scale inversely with time steps and quadratically with control variations, ensuring stability. We also prove that the expected squared gradient norm for the training loss vanishes asymptotically, and that the meta loss converges sublinearly over iterations. In particular, NDDV achieves sublinear convergence.

[750] Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection

Vaibhav Mavi, Shubh Jaroria, Weiqi Sun

Main category: cs.LG

TL;DR: Extends self-evaluation techniques from single-step to multi-step reasoning tasks, showing stepwise evaluation outperforms holistic scoring for error detection in LLMs.

Details

Motivation: Reliability and failure detection of LLMs is critical for high-stakes multi-step reasoning tasks, but prior work focuses on single-step outputs and overlooks multi-step challenges.

Method: Tested two approaches: holistic scoring (evaluating entire reasoning chain) and step-by-step scoring (evaluating each reasoning step) on two multi-step benchmark datasets.

Result: Stepwise evaluation generally outperforms holistic scoring in detecting potential errors, with up to 15% relative increase in AUC-ROC.

Conclusion: Self-evaluating LLM systems provide meaningful confidence estimates in complex reasoning, improving trustworthiness and offering a practical framework for failure detection.

Abstract: Reliability and failure detection of large language models (LLMs) is critical for their deployment in high-stakes, multi-step reasoning tasks. Prior work explores confidence estimation for self-evaluating LLM-scorer systems, with confidence scorers estimating the likelihood of errors in LLM responses. However, most methods focus on single-step outputs and overlook the challenges of multi-step reasoning. In this work, we extend self-evaluation techniques to multi-step tasks, testing two intuitive approaches: holistic scoring and step-by-step scoring. Using two multi-step benchmark datasets, we show that stepwise evaluation generally outperforms holistic scoring in detecting potential errors, with up to 15% relative increase in AUC-ROC. Our findings demonstrate that self-evaluating LLM systems provide meaningful confidence estimates in complex reasoning, improving their trustworthiness and providing a practical framework for failure detection.

[751] DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning

Nikolay Yudin, Ekaterina Grishina, Andrey Veprikov, Alexandr Beznosikov, Maxim Rakhuba

Main category: cs.LG

TL;DR: DyKAF optimizer uses projector-splitting integrators to create effective Kronecker-factorized Fisher matrix preconditioners, outperforming existing methods in LLM training.

Details

Motivation: Existing matrix-based optimizers use heuristic approximations of Fisher matrices due to computational constraints, but need more accurate and efficient preconditioners.

Method: Leverages projector-splitting integrators to construct Kronecker-factorized Fisher matrix approximations, enabling memory-efficient representation and better optimization.

Result: DyKAF consistently improves Fisher matrix approximation quality and outperforms existing optimizers in large language model pre-training and fine-tuning across multiple metrics.

Conclusion: The DyKAF approach provides an effective solution for constructing accurate Fisher matrix preconditioners, demonstrating superior performance in practical deep learning applications.

Abstract: Recently, optimizers that explicitly treat weights as matrices, rather than flattened vectors, have demonstrated their effectiveness. This perspective naturally leads to structured approximations of the Fisher matrix as preconditioners, where the matrix view induces a Kronecker-factorized form that enables memory-efficient representation. However, constructing such approximations both efficiently and accurately remains an open challenge, since obtaining the optimal factorization is resource-intensive and practical methods therefore rely on heuristic design choices. In this work, we introduce a novel approach that leverages projector-splitting integrators to construct effective preconditioners. Our optimizer, DyKAF (Dynamical Kronecker Approximation of the Fisher Matrix), consistently improves the Fisher matrix approximation quality. Experiments on large language model pre-training and fine-tuning demonstrate that DyKAF outperforms existing optimizers across a range of evaluation metrics.

[752] Explainable AI For Early Detection Of Sepsis

Atharva Thakur, Shruti Dhumal

Main category: cs.LG

TL;DR: An interpretable AI approach for sepsis analysis that combines machine learning with clinical knowledge to provide accurate predictions while maintaining transparency and clinical trust.

Details

Motivation: Sepsis requires rapid detection but current machine learning models lack interpretability, limiting clinical trust despite their predictive capabilities.

Method: Integrates machine learning with clinical knowledge to create an interpretable AI approach for sepsis analysis.

Result: The method delivers accurate predictions of sepsis onset while enabling clinicians to understand, validate, and align model outputs with medical expertise.

Conclusion: The interpretable AI approach addresses the black-box limitation of existing models, enhancing clinical trust and practical utility in sepsis management.

Abstract: Sepsis is a life-threatening condition that requires rapid detection and treatment to prevent progression to severe sepsis, septic shock, or multi-organ failure. Despite advances in medical technology, it remains a major challenge for clinicians. While recent machine learning models have shown promise in predicting sepsis onset, their black-box nature limits interpretability and clinical trust. In this study, we present an interpretable AI approach for sepsis analysis that integrates machine learning with clinical knowledge. Our method not only delivers accurate predictions of sepsis onset but also enables clinicians to understand, validate, and align model outputs with established medical expertise.

[753] Learning Time-Varying Graph Signals via Koopman

Sivaram Krishnan, Jinho Choi, Jihong Park

Main category: cs.LG

TL;DR: Proposes a Koopman autoencoder framework for modeling time-varying graph data by learning non-linear dynamics in latent space for prediction and reconstruction tasks.

Details

Motivation: Real-world data like sensor measurements and UAV trajectories form time-varying graphs with non-Euclidean structures, requiring effective modeling for predicting evolution and reconstructing missing data.

Method: Convert graph data to vector time series via graph embedding, then apply Koopman autoencoder to learn non-linear dynamics in latent space for temporal evolution modeling.

Result: Framework enables prediction of graph evolution and reconstruction of missing graph data by capturing underlying non-linear dynamics.

Conclusion: The Koopman autoencoder provides an effective approach for handling time-varying graph data by learning latent space dynamics for both prediction and reconstruction applications.

Abstract: A wide variety of real-world data, such as sea measurements, e.g., temperatures collected by distributed sensors and multiple unmanned aerial vehicles (UAV) trajectories, can be naturally represented as graphs, often exhibiting non-Euclidean structures. These graph representations may evolve over time, forming time-varying graphs. Effectively modeling and analyzing such dynamic graph data is critical for tasks like predicting graph evolution and reconstructing missing graph data. In this paper, we propose a framework based on the Koopman autoencoder (KAE) to handle time-varying graph data. Specifically, we assume the existence of a hidden non-linear dynamical system, where the state vector corresponds to the graph embedding of the time-varying graph signals. To capture the evolving graph structures, the graph data is first converted into a vector time series through graph embedding, representing the structural information in a finite-dimensional latent space. In this latent space, the KAE is applied to learn the underlying non-linear dynamics governing the temporal evolution of graph features, enabling both prediction and reconstruction tasks.

[754] Route Experts by Sequence, not by Token

Tiansheng Wen, Yifei Wang, Aosong Feng, Long Ma, Xinyang Liu, Yifan Wang, Lixuan Guo, Bo Chen, Stefanie Jegelka, Chenyu You

Main category: cs.LG

TL;DR: SeqTopK is a simple modification to MoE routing that shifts expert budget from token level to sequence level, enabling dynamic expert allocation based on token complexity while maintaining the same overall budget.

Details

Motivation: Standard TopK routing assigns the same fixed number of experts to all tokens, ignoring varying token complexity. Prior adaptive methods require additional modules and costly retraining.

Method: Sequence-level TopK routing selects top T·K experts across all T tokens in a sequence, allowing end-to-end learned dynamic allocation where difficult tokens get more experts and easy tokens get fewer.

Result: Experiments across math, coding, law, and writing show consistent improvements over TopK and prior parameter-free methods, with gains up to 16.9% under higher sparsity.

Conclusion: SeqTopK is a simple, efficient, and scalable routing strategy particularly suited for extreme sparsity regimes in next-generation LLMs, requiring minimal code changes and adding less than 1% overhead.

Abstract: Mixture-of-Experts (MoE) architectures scale large language models (LLMs) by activating only a subset of experts per token, but the standard TopK routing assigns the same fixed number of experts to all tokens, ignoring their varying complexity. Prior adaptive routing methods introduce additional modules and hyperparameters, often requiring costly retraining from scratch. We propose Sequence-level TopK (SeqTopK), a minimal modification that shifts the expert budget from the token level to the sequence level. By selecting the top $T \cdot K$ experts across all $T$ tokens, SeqTopK enables end-to-end learned dynamic allocation – assigning more experts to difficult tokens and fewer to easy ones – while preserving the same overall budget. SeqTopK requires only a few lines of code, adds less than 1% overhead, and remains fully compatible with pretrained MoE models. Experiments across math, coding, law, and writing show consistent improvements over TopK and prior parameter-free adaptive methods, with gains that become substantially larger under higher sparsity (up to 16.9%). These results highlight SeqTopK as a simple, efficient, and scalable routing strategy, particularly well-suited for the extreme sparsity regimes of next-generation LLMs. Code is available at https://github.com/Y-Research-SBU/SeqTopK.

[755] Probably Approximately Global Robustness Certification

Peter Blohm, Patrick Indri, Thomas Gärtner, Sagar Malhotra

Main category: cs.LG

TL;DR: Proposes probabilistic guarantees for adversarial robustness by sampling ε-nets and using local robustness oracles, achieving dimensionality-independent certification.

Details

Motivation: Traditional formal verification for robustness is intractable, while sampling approaches lack formal guarantees. Need efficient probabilistic certification that scales to large neural networks.

Method: Sample an ε-net and invoke local robustness oracle on the sample. Sample size needed is independent of input dimensionality, number of classes, and learning algorithm.

Result: Approach certifies probabilistic relaxation of robustness better than state-of-the-art sampling methods and scales better than formal methods.

Conclusion: Efficient probabilistic robustness certification method that works for large neural networks beyond traditional verification scope, with empirical validation.

Abstract: We propose and investigate probabilistic guarantees for the adversarial robustness of classification algorithms. While traditional formal verification approaches for robustness are intractable and sampling-based approaches do not provide formal guarantees, our approach is able to efficiently certify a probabilistic relaxation of robustness. The key idea is to sample an $ε$-net and invoke a local robustness oracle on the sample. Remarkably, the size of the sample needed to achieve probably approximately global robustness guarantees is independent of the input dimensionality, the number of classes, and the learning algorithm itself. Our approach can, therefore, be applied even to large neural networks that are beyond the scope of traditional formal verification. Experiments empirically confirm that it characterizes robustness better than state-of-the-art sampling-based approaches and scales better than formal methods.

[756] Efficient Approximation of Volterra Series for High-Dimensional Systems

Navin Khoshnan, Claudia K Petritsch, Bryce-Allen Bagley

Main category: cs.LG

TL;DR: THA algorithm reduces computational complexity of high-dimensional nonlinear system identification by using ensemble of localized MVMALS models on input subsets, with theoretical error bounds and proven performance advantages over full model truncation.

Details

Motivation: Overcome prohibitive computational and memory bottlenecks in Tensor Network methods for high-dimensional nonlinear system identification, which suffer from high-order polynomial scaling with input dimension.

Method: Tensor Head Averaging (THA) algorithm constructs ensemble of localized MVMALS models trained on small subsets of input space, with theoretical analysis of error bounds and decomposition.

Result: Established finite-sample error bounds between THA ensemble and full MVMALS model, derived exact decomposition of squared error, and proved correlation-driven optimization incentive for superior accuracy over simple truncation.

Conclusion: THA provides scalable, theoretically grounded approach for identifying previously intractable high-dimensional nonlinear dynamical systems via Volterra series.

Abstract: The identification of high-dimensional nonlinear dynamical systems via the Volterra series has significant potential, but has been severely hindered by the curse of dimensionality. Tensor Network (TN) methods such as the Modified Alternating Linear Scheme (MVMALS) have been a breakthrough for the field, offering a tractable approach by exploiting the low-rank structure in Volterra kernels. However, these techniques still encounter prohibitive computational and memory bottlenecks due to high-order polynomial scaling with respect to input dimension. To overcome this barrier, we introduce the Tensor Head Averaging (THA) algorithm, which significantly reduces complexity by constructing an ensemble of localized MVMALS models trained on small subsets of the input space. In this paper, we present a theoretical foundation for the THA algorithm. We establish observable, finite-sample bounds on the error between the THA ensemble and a full MVMALS model, and we derive an exact decomposition of the squared error. This decomposition is used to analyze the manner in which subset models implicitly compensate for omitted dynamics. We quantify this effect, and prove that correlation between the included and omitted dynamics creates an optimization incentive which drives THA’s performance toward accuracy superior to a simple truncation of a full MVMALS model. THA thus offers a scalable and theoretically grounded approach for identifying previously intractable high-dimensional systems.

[757] TriShGAN: Enhancing Sparsity and Robustness in Multivariate Time Series Counterfactuals Explanation

Hongnan Ma, Yiwei Shi, Guanxiong Sun, Mengyue Yang, Weiru Liu

Main category: cs.LG

TL;DR: TriShGAN is a novel method that generates robust counterfactual explanations for multivariate time series by incorporating triplet loss and shapelet extraction to balance minimal cost and robustness.

Details

Motivation: Traditional counterfactual explanation methods for time series have limitations: NUN-based methods use unrealistic direct substitutions, while GAN-based methods focus only on minimal cost and neglect robustness against model changes.

Method: TriShGAN enhances the CounteRGAN framework with triplet loss for distance metric learning and integrates a Shapelet Extractor to select discriminative parts of time series, improving sparsity and training efficiency.

Result: The method generates counterfactual explanations that remain close to the original time series while capturing the feature distribution of desired outcomes, achieving better balance between minimal cost and robustness.

Conclusion: TriShGAN provides more robust counterfactual explanations for multivariate time series by addressing the limitations of existing methods through triplet loss and shapelet extraction.

Abstract: In decision-making processes, stakeholders often rely on counterfactual explanations, which provide suggestions about what should be changed in the queried instance to alter the outcome of an AI system. However, generating these explanations for multivariate time series presents challenges due to their complex, multi-dimensional nature. Traditional Nearest Unlike Neighbor-based methods typically substitute subsequences in a queried time series with influential subsequences from an NUN, which is not always realistic in real-world scenarios due to the rigid direct substitution. Counterfactual with Residual Generative Adversarial Networks-based methods aim to address this by learning from the distribution of observed data to generate synthetic counterfactual explanations. However, these methods primarily focus on minimizing the cost from the queried time series to the counterfactual explanations and often neglect the importance of distancing the counterfactual explanation from the decision boundary. This oversight can result in explanations that no longer qualify as counterfactual if minor changes occur within the model. To generate a more robust counterfactual explanation, we introduce TriShGAN, under the CounteRGAN framework enhanced by the incorporation of triplet loss. This unsupervised learning approach uses distance metric learning to encourage the counterfactual explanations not only to remain close to the queried time series but also to capture the feature distribution of the instance with the desired outcome, thereby achieving a better balance between minimal cost and robustness. Additionally, we integrate a Shapelet Extractor that strategically selects the most discriminative parts of the high-dimensional queried time series to enhance the sparsity of counterfactual explanation and efficiency of the training process.

[758] Bayesian Uncertainty Quantification with Anchored Ensembles for Robust EV Power Consumption Prediction

Ghazal Farhani, Taufiq Rahman, Kieran Humphries

Main category: cs.LG

TL;DR: An anchored-ensemble LSTM with Student-t likelihood for EV power estimation that jointly captures model and data uncertainty, providing accurate predictions with well-calibrated uncertainty bands and efficient deterministic inference.

Details

Motivation: Practitioners need both accurate point estimates and trustworthy uncertainty for EV range prediction and energy management, requiring methods that capture both epistemic (model) and aleatoric (data) uncertainty.

Method: Anchored-ensemble LSTM with Student-t likelihood using Gaussian weight prior (MAP training) for posterior-like diversity without test-time sampling, and t-head for heavy-tailed robustness and closed-form prediction intervals.

Result: Achieves strong accuracy: RMSE 3.36 +/- 1.10, MAE 2.21 +/- 0.89, R-squared = 0.93 +/- 0.02, explained variance 0.93 +/- 0.02, with well-calibrated uncertainty bands and near-nominal coverage. Outperforms baselines in log-scores while producing sharper intervals.

Conclusion: The method provides a compact, theoretically grounded estimator that couples accuracy, calibration, and systems efficiency for reliable EV range estimation and energy management, with efficient deterministic inference suitable for real-time deployment.

Abstract: Accurate EV power estimation underpins range prediction and energy management, yet practitioners need both point accuracy and trustworthy uncertainty. We propose an anchored-ensemble Long Short-Term Memory (LSTM) with a Student-t likelihood that jointly captures epistemic (model) and aleatoric (data) uncertainty. Anchoring imposes a Gaussian weight prior (MAP training), yielding posterior-like diversity without test-time sampling, while the t-head provides heavy-tailed robustness and closed-form prediction intervals. Using vehicle-kinematic time series (e.g., speed, motor RPM), our model attains strong accuracy: RMSE 3.36 +/- 1.10, MAE 2.21 +/- 0.89, R-squared = 0.93 +/- 0.02, explained variance 0.93 +/- 0.02, and delivers well-calibrated uncertainty bands with near-nominal coverage. Against competitive baselines (Student-t MC dropout; quantile regression with/without anchoring), our method matches or improves log-scores while producing sharper intervals at the same coverage. Crucially for real-time deployment, inference is a single deterministic pass per ensemble member (or a weight-averaged collapse), eliminating Monte Carlo latency. The result is a compact, theoretically grounded estimator that couples accuracy, calibration, and systems efficiency, enabling reliable range estimation and decision-making for production EV energy management.

[759] Practical Policy Distillation for Reinforcement Learning in Radio Access Networks

Sara Khosravi, Burak Demirel, Linghui Zhou, Javier Rasines, Pablo Soldati

Main category: cs.LG

TL;DR: Policy distillation enables deployment of lightweight AI models for link adaptation in RANs, overcoming hardware constraints while maintaining performance.

Details

Motivation: Address computational/memory limitations in legacy 4G RAN hardware that restrict AI model deployment, balancing the need for strong generalization with resource constraints.

Method: Two policy distillation strategies: single-policy (scenario-agnostic teacher compressed to student) and multi-policy (multiple scenario-specific teachers consolidated into one generalist student).

Result: Compact student models preserve teachers’ generalization capabilities while meeting computational/memory constraints of existing RAN hardware.

Conclusion: Policy distillation successfully bridges the gap between AI model performance requirements and RAN hardware limitations for link adaptation tasks.

Abstract: Adopting artificial intelligence (AI) in radio access networks (RANs) presents several challenges, including limited availability of link-level measurements (e.g., CQI reports), stringent real-time processing constraints (e.g., sub-1 ms per TTI), and network heterogeneity (different spectrum bands, cell types, and vendor equipment). A critical yet often overlooked barrier lies in the computational and memory limitations of RAN baseband hardware, particularly in legacy 4th Generation (4G) systems, which typically lack on-chip neural accelerators. As a result, only lightweight AI models (under 1 Mb and sub-100~μs inference time) can be effectively deployed, limiting both their performance and applicability. However, achieving strong generalization across diverse network conditions often requires large-scale models with substantial resource demands. To address this trade-off, this paper investigates policy distillation in the context of a reinforcement learning-based link adaptation task. We explore two strategies: single-policy distillation, where a scenario-agnostic teacher model is compressed into one generalized student model; and multi-policy distillation, where multiple scenario-specific teachers are consolidated into a single generalist student. Experimental evaluations in a high-fidelity, 5th Generation (5G)-compliant simulator demonstrate that both strategies produce compact student models that preserve the teachers’ generalization capabilities while complying with the computational and memory limitations of existing RAN hardware.

[760] Breaking the Dyadic Barrier: Rethinking Fairness in Link Prediction Beyond Demographic Parity

João Mattos, Debolina Halder Lina, Arlei Silva

Main category: cs.LG

TL;DR: The paper critiques existing dyadic fairness definitions in link prediction, showing they can hide subgroup biases, and proposes a new fairness assessment framework with a post-processing method that achieves better fairness-utility trade-offs.

Details

Motivation: Current fairness approaches in link prediction use dyadic definitions that may obscure underlying subgroup disparities and demographic parity doesn't suit ranking-based tasks, potentially allowing systemic biases to persist undetected.

Method: Proposes a new fairness assessment framework and a lightweight post-processing method combined with decoupled link predictors to mitigate bias effectively.

Result: The proposed method achieves state-of-the-art fairness-utility trade-offs, effectively addressing limitations of existing fairness evaluations in link prediction.

Conclusion: Existing dyadic fairness definitions are insufficient for link prediction tasks, and the proposed framework with post-processing provides a more expressive fairness assessment while maintaining good utility.

Abstract: Link prediction is a fundamental task in graph machine learning with applications, ranging from social recommendation to knowledge graph completion. Fairness in this setting is critical, as biased predictions can exacerbate societal inequalities. Prior work adopts a dyadic definition of fairness, enforcing fairness through demographic parity between intra-group and inter-group link predictions. However, we show that this dyadic framing can obscure underlying disparities across subgroups, allowing systemic biases to go undetected. Moreover, we argue that demographic parity does not meet desired properties for fairness assessment in ranking-based tasks such as link prediction. We formalize the limitations of existing fairness evaluations and propose a framework that enables a more expressive assessment. Additionally, we propose a lightweight post-processing method combined with decoupled link predictors that effectively mitigates bias and achieves state-of-the-art fairness-utility trade-offs.

[761] Optimistic Online-to-Batch Conversions for Accelerated Convergence and Universality

Yu-Hu Yan, Peng Zhao, Zhi-Hua Zhou

Main category: cs.LG

TL;DR: Novel optimistic online-to-batch conversions that simplify online algorithm design while preserving optimal convergence rates for offline convex optimization with smooth objectives.

Details

Motivation: To understand Nesterov's Accelerated Gradient method from the online learning perspective and develop simpler approaches that maintain optimal convergence.

Method: Proposed optimistic online-to-batch conversions that incorporate optimism into analysis, combined with simple online gradient descent and optimistic online algorithms.

Result: Achieved optimal accelerated convergence for smooth objectives, optimal accelerated rate for strongly convex objectives (first time via online-to-batch), and universality to smoothness without requiring smoothness coefficient knowledge.

Conclusion: The optimistic online-to-batch conversions effectively simplify algorithm design while preserving optimal convergence rates and demonstrate precise correspondence with NAG.

Abstract: In this work, we study offline convex optimization with smooth objectives, where the classical Nesterov’s Accelerated Gradient (NAG) method achieves the optimal accelerated convergence. Extensive research has aimed to understand NAG from various perspectives, and a recent line of work approaches this from the viewpoint of online learning and online-to-batch conversion, emphasizing the role of optimistic online algorithms for acceleration. In this work, we contribute to this perspective by proposing novel optimistic online-to-batch conversions that incorporate optimism theoretically into the analysis, thereby significantly simplifying the online algorithm design while preserving the optimal convergence rates. Specifically, we demonstrate the effectiveness of our conversions through the following results: (i) when combined with simple online gradient descent, our optimistic conversion achieves the optimal accelerated convergence; (ii) our conversion also applies to strongly convex objectives, and by leveraging both optimistic online-to-batch conversion and optimistic online algorithms, we achieve the optimal accelerated convergence rate for strongly convex and smooth objectives, for the first time through the lens of online-to-batch conversion; (iii) our optimistic conversion can achieve universality to smoothness – applicable to both smooth and non-smooth objectives without requiring knowledge of the smoothness coefficient – and remains efficient as non-universal methods by using only one gradient query in each iteration. Finally, we highlight the effectiveness of our optimistic online-to-batch conversions by a precise correspondence with NAG.

[762] Adaptive Initial Residual Connections for GNNs with Theoretical Guarantees

Mohammad Shirzadi, Ali Safarpoor Dehkordi, Ahad N. Zehmakan

Main category: cs.LG

TL;DR: The paper proposes an adaptive residual scheme for graph neural networks that prevents oversmoothing by allowing different nodes to have varying residual strengths, with theoretical guarantees and improved performance on heterophilic graphs.

Details

Motivation: Standard message passing in deep graph neural networks leads to diminished expressiveness and oversmoothing, where node embeddings become too similar. Residual connections help but current approaches use fixed strengths across all nodes.

Method: An adaptive residual scheme where different nodes have varying residual strengths, with theoretical analysis showing it prevents oversmoothing. Also introduces a heuristic variant without learnable parameters for better time complexity.

Result: Theoretical proof that adaptive residual connections prevent oversmoothing by keeping Dirichlet energy bounded away from zero. Experimental results show superior performance over standard and state-of-the-art methods, especially on heterophilic graphs.

Conclusion: Adaptive residual connections effectively prevent oversmoothing in deep graph neural networks and outperform existing methods, with both learnable and heuristic variants providing strong performance.

Abstract: Message passing is the core operation in graph neural networks, where each node updates its embeddings by aggregating information from its neighbors. However, in deep architectures, this process often leads to diminished expressiveness. A popular solution is to use residual connections, where the input from the current (or initial) layer is added to aggregated neighbor information to preserve embeddings across layers. Following a recent line of research, we investigate an adaptive residual scheme in which different nodes have varying residual strengths. We prove that this approach prevents oversmoothing; particularly, we show that the Dirichlet energy of the embeddings remains bounded away from zero. This is the first theoretical guarantee not only for the adaptive setting, but also for static residual connections (where residual strengths are shared across nodes) with activation functions. Furthermore, extensive experiments show that this adaptive approach outperforms standard and state-of-the-art message passing mechanisms, especially on heterophilic graphs. To improve the time complexity of our approach, we introduce a variant in which residual strengths are not learned but instead set heuristically, a choice that performs as well as the learnable version.

[763] Explainable Probabilistic Machine Learning for Predicting Drilling Fluid Loss of Circulation in Marun Oil Field

Seshu Kumar Damarla, Xiuli Zhu

Main category: cs.LG

TL;DR: Probabilistic machine learning using Gaussian Process Regression for predicting drilling fluid loss with uncertainty quantification and interpretability through LIME.

Details

Motivation: Lost circulation is a major costly challenge in drilling causing wellbore instability and non-productive time, requiring accurate fluid loss prediction for improved safety and efficiency.

Method: Gaussian Process Regression framework with LBFGS hyperparameter optimization and LIME for interpretability to capture nonlinear dependencies and quantify predictive uncertainty.

Result: Enhanced reliability for high-risk decision-making, enabling proactive identification of lost-circulation risks and optimized design of lost circulation materials.

Conclusion: Explainable probabilistic learning shows potential for reducing operational uncertainties and improving drilling safety and efficiency through better fluid loss prediction.

Abstract: Lost circulation remains a major and costly challenge in drilling operations, often resulting in wellbore instability, stuck pipe, and extended non-productive time. Accurate prediction of fluid loss is therefore essential for improving drilling safety and efficiency. This study presents a probabilistic machine learning framework based on Gaussian Process Regression (GPR) for predicting drilling fluid loss in complex formations. The GPR model captures nonlinear dependencies among drilling parameters while quantifying predictive uncertainty, offering enhanced reliability for high-risk decision-making. Model hyperparameters are optimized using the Limited memory Broyden Fletcher Goldfarb Shanno (LBFGS) algorithm to ensure numerical stability and robust generalization. To improve interpretability, Local Interpretable Model agnostic Explanations (LIME) are employed to elucidate how individual features influence model predictions. The results highlight the potential of explainable probabilistic learning for proactive identification of lost-circulation risks, optimized design of lost circulation materials (LCM), and reduction of operational uncertainties in drilling applications.

[764] Beyond Fixed Depth: Adaptive Graph Neural Networks for Node Classification Under Varying Homophily

Asela Hevapathige, Asiri Wijesinghe, Ahad N. Zehmakan

Main category: cs.LG

TL;DR: The paper proposes an adaptive-depth GNN architecture that dynamically selects node-specific aggregation depths to handle both homophilic and heterophilic graphs, overcoming limitations of fixed-depth approaches.

Details

Motivation: Traditional GNNs perform poorly on heterophilic graphs where connected nodes have different labels, and existing methods use fixed aggregation depths that don't account for varying local homophily levels and neighborhood structures.

Method: Developed a theoretical framework linking local structure to propagation dynamics, then designed an adaptive-depth GNN that dynamically selects node-specific aggregation depths using theoretically grounded metrics.

Result: Extensive experiments show the approach consistently enhances performance of standard GNN backbones across diverse benchmarks, adapting well to both homophilic and heterophilic patterns.

Conclusion: The proposed adaptive-depth GNN architecture effectively addresses the limitations of fixed-depth approaches and generalizes well across both homophilic and heterophilic graph regimes.

Abstract: Graph Neural Networks (GNNs) have achieved significant success in addressing node classification tasks. However, the effectiveness of traditional GNNs degrades on heterophilic graphs, where connected nodes often belong to different labels or properties. While recent work has introduced mechanisms to improve GNN performance under heterophily, certain key limitations still exist. Most existing models apply a fixed aggregation depth across all nodes, overlooking the fact that nodes may require different propagation depths based on their local homophily levels and neighborhood structures. Moreover, many methods are tailored to either homophilic or heterophilic settings, lacking the flexibility to generalize across both regimes. To address these challenges, we develop a theoretical framework that links local structural and label characteristics to information propagation dynamics at the node level. Our analysis shows that optimal aggregation depth varies across nodes and is critical for preserving class-discriminative information. Guided by this insight, we propose a novel adaptive-depth GNN architecture that dynamically selects node-specific aggregation depths using theoretically grounded metrics. Our method seamlessly adapts to both homophilic and heterophilic patterns within a unified model. Extensive experiments demonstrate that our approach consistently enhances the performance of standard GNN backbones across diverse benchmarks.

[765] A Weak Penalty Neural ODE for Learning Chaotic Dynamics from Noisy Time Series

Xuyang Li, John Harlim, Romit Maulik

Main category: cs.LG

TL;DR: The paper proposes Weak-Penalty NODE (WP-NODE), a training strategy that combines weak and strong formulations to improve forecasting accuracy and robustness in chaotic dynamical systems with noisy data.

Details

Motivation: Real-world measurements are often corrupted by noise, which severely degrades data-driven models' performance, especially in chaotic systems where small errors amplify rapidly.

Method: Uses weak formulation as a penalty alongside classical strong formulation in neural ODE training, constraining the model with integrated residuals over temporal subdomains.

Result: WP-NODE achieves state-of-the-art forecasting accuracy and exceptional robustness across benchmark chaotic dynamical systems.

Conclusion: Combining weak and strong formulations in NODE training significantly enhances performance and noise robustness for chaotic system forecasting.

Abstract: Accurate forecasting of complex high-dimensional dynamical systems from observational data is essential for several applications across science and engineering. A key challenge, however, is that real-world measurements are often corrupted by noise, which severely degrades the performance of data-driven models. Particularly, in chaotic dynamical systems, where small errors amplify rapidly, it is challenging to identify a data-driven model from noisy data that achieves short-term accuracy while preserving long-term invariant properties. In this paper, we propose the use of the weak formulation as a complementary approach to the classical strong formulation of data-driven time-series forecasting models. Specifically, we focus on the neural ordinary differential equation (NODE) architecture. Unlike the standard strong formulation, which relies on the discretization of the NODE followed by optimization, the weak formulation constrains the model using a set of integrated residuals over temporal subdomains. While such a formulation yields an effective NODE model, we discover that the performance of a NODE can be further enhanced by employing this weak formulation as a penalty alongside the classical strong formulation-based learning. Through numerical demonstrations, we illustrate that our proposed training strategy, which we coined as the Weak-Penalty NODE (WP-NODE), achieves state-of-the-art forecasting accuracy and exceptional robustness across benchmark chaotic dynamical systems.

[766] CaberNet: Causal Representation Learning for Cross-Domain HVAC Energy Prediction

Kaiyuan Zhai, Jiacheng Cui, Zhehao Zhang, Junyu Xue, Yang Deng, Kui Wu, Guoming Tang

Main category: cs.LG

TL;DR: CaberNet is a causal deep sequence model that learns invariant representations for robust cross-domain HVAC energy prediction, achieving 22.9% NMSE reduction without requiring prior knowledge.

Details

Motivation: Cross-domain HVAC energy prediction is challenging due to data scarcity and heterogeneity across buildings, climate zones, and seasons, causing existing methods to overfit or rely on expert intervention.

Method: Integrates global feature gate with self-supervised Bernoulli regularization to identify causal features, and domain-wise training that balances domain contributions and promotes latent factor independence.

Result: Outperforms all baselines on real-world datasets from three climatically diverse buildings, achieving 22.9% reduction in normalized mean squared error.

Conclusion: CaberNet provides a purely data-driven solution for robust cross-domain HVAC energy prediction without requiring prior knowledge, demonstrating superior performance over existing methods.

Abstract: Cross-domain HVAC energy prediction is essential for scalable building energy management, particularly because collecting extensive labeled data for every new building is both costly and impractical. Yet, this task remains highly challenging due to the scarcity and heterogeneity of data across different buildings, climate zones, and seasonal patterns. In particular, buildings situated in distinct climatic regions introduce variability that often leads existing methods to overfit to spurious correlations, rely heavily on expert intervention, or compromise on data diversity. To address these limitations, we propose CaberNet, a causal and interpretable deep sequence model that learns invariant (Markov blanket) representations for robust cross-domain prediction. In a purely data-driven fashion and without requiring any prior knowledge, CaberNet integrates i) a global feature gate trained with a self-supervised Bernoulli regularization to distinguish superior causal features from inferior ones, and ii) a domain-wise training scheme that balances domain contributions, minimizes cross-domain loss variance, and promotes latent factor independence. We evaluate CaberNet on real-world datasets collected from three buildings located in three climatically diverse cities, and it consistently outperforms all baselines, achieving a 22.9% reduction in normalized mean squared error (NMSE) compared to the best benchmark. Our code is available at https://github.com/rickzky1001/CaberNet-CRL.

[767] Non-Rival Data as Rival Products: An Encapsulation-Forging Approach for Data Synthesis

Kaidong Wang, Jiale Li, Shao-Bo Lin, Yao Wang

Main category: cs.LG

TL;DR: EnFo framework creates rival synthetic data with asymmetric utility by encapsulating knowledge into a key model and forging data that overfits it, enabling strategic data sharing while preserving competitive advantage.

Details

Motivation: To solve the dilemma of data sharing where firms want to unlock data value but fear losing competitive advantage due to data's non-rival nature and existing synthesis methods creating symmetric utility.

Method: Two-stage framework: 1) Encapsulate predictive knowledge from original data into a designated key model, 2) Forge synthetic dataset by optimizing data to intentionally overfit the key model.

Result: EnFo demonstrates remarkable sample efficiency, matching original data’s performance with fraction of size, while providing robust privacy protection and resistance to misuse.

Conclusion: EnFo offers practical solution for firms to collaborate strategically without compromising core analytical advantage by transforming non-rival data into rival product with asymmetric utility.

Abstract: The non-rival nature of data creates a dilemma for firms: sharing data unlocks value but risks eroding competitive advantage. Existing data synthesis methods often exacerbate this problem by creating data with symmetric utility, allowing any party to extract its value. This paper introduces the Encapsulation-Forging (EnFo) framework, a novel approach to generate rival synthetic data with asymmetric utility. EnFo operates in two stages: it first encapsulates predictive knowledge from the original data into a designated ``key’’ model, and then forges a synthetic dataset by optimizing the data to intentionally overfit this key model. This process transforms non-rival data into a rival product, ensuring its value is accessible only to the intended model, thereby preventing unauthorized use and preserving the data owner’s competitive edge. Our framework demonstrates remarkable sample efficiency, matching the original data’s performance with a fraction of its size, while providing robust privacy protection and resistance to misuse. EnFo offers a practical solution for firms to collaborate strategically without compromising their core analytical advantage.

[768] Dual-branch Spatial-Temporal Self-supervised Representation for Enhanced Road Network Learning

Qinghong Guo, Yu Wang, Ji Cao, Tongya Zheng, Junshu Dai, Bingde Hu, Shunyu Liu, Canghong Jin

Main category: cs.LG

TL;DR: DST is a dual-branch spatial-temporal self-supervised framework for road network representation learning that addresses spatial heterogeneity and temporal dynamics through mix-hop graph convolution, hypergraph contrastive learning, and causal Transformer-based temporal modeling.

Details

Motivation: Spatial heterogeneity and temporal dynamics of road networks challenge the neighborhood smoothing mechanism of self-supervised GNNs in road network representation learning.

Method: DST uses a dual-branch approach: spatial branch with mix-hop transition matrix for graph convolution and hypergraph contrastive learning; temporal branch with causal Transformer for next token prediction on traffic sequences, regularized by weekday-weekend differentiation.

Result: Extensive experiments show DST outperforms state-of-the-art methods and excels in zero-shot learning scenarios due to comprehensive spatiotemporal modeling.

Conclusion: The proposed DST framework effectively addresses spatial heterogeneity and temporal dynamics in road networks through dual-branch spatial-temporal self-supervised learning, achieving superior performance and strong generalization capabilities.

Abstract: Road network representation learning (RNRL) has attracted increasing attention from both researchers and practitioners as various spatiotemporal tasks are emerging. Recent advanced methods leverage Graph Neural Networks (GNNs) and contrastive learning to characterize the spatial structure of road segments in a self-supervised paradigm. However, spatial heterogeneity and temporal dynamics of road networks raise severe challenges to the neighborhood smoothing mechanism of self-supervised GNNs. To address these issues, we propose a $\textbf{D}$ual-branch $\textbf{S}$patial-$\textbf{T}$emporal self-supervised representation framework for enhanced road representations, termed as DST. On one hand, DST designs a mix-hop transition matrix for graph convolution to incorporate dynamic relations of roads from trajectories. Besides, DST contrasts road representations of the vanilla road network against that of the hypergraph in a spatial self-supervised way. The hypergraph is newly built based on three types of hyperedges to capture long-range relations. On the other hand, DST performs next token prediction as the temporal self-supervised task on the sequences of traffic dynamics based on a causal Transformer, which is further regularized by differentiating traffic modes of weekdays from those of weekends. Extensive experiments against state-of-the-art methods verify the superiority of our proposed framework. Moreover, the comprehensive spatiotemporal modeling facilitates DST to excel in zero-shot learning scenarios.

[769] Neyman-Pearson Classification under Both Null and Alternative Distributions Shift

Mohammadreza M. Kalan, Yuyang Deng, Eitan J. Neugut, Samory Kpotufe

Main category: cs.LG

TL;DR: This paper addresses transfer learning in Neyman-Pearson classification, developing an adaptive procedure that handles distribution shifts in both μ₀ and μ₁ while avoiding negative transfer.

Details

Motivation: Transfer learning has been well-studied in traditional classification but receives less attention in imbalanced Neyman-Pearson classification, where both Type-I and Type-II errors must be controlled simultaneously. Existing methods only handle shifts in μ₁, while practical scenarios often involve shifts in both μ₀ and μ₁.

Method: The authors derive an adaptive procedure that guarantees improved Type-I and Type-II errors when the source data is informative, and automatically adapts when the source is uninformative to avoid negative transfer.

Result: The proposed procedure provides statistical guarantees for controlling both types of errors while being computationally efficient, as demonstrated through complementary computational guarantees.

Conclusion: The developed adaptive transfer learning approach effectively handles distribution shifts in both μ₀ and μ₁ for Neyman-Pearson classification, ensuring performance improvements when possible while preventing negative transfer when source data is uninformative.

Abstract: We consider the problem of transfer learning in Neyman-Pearson classification, where the objective is to minimize the error w.r.t. a distribution $μ_1$, subject to the constraint that the error w.r.t. a distribution $μ_0$ remains below a prescribed threshold. While transfer learning has been extensively studied in traditional classification, transfer learning in imbalanced classification such as Neyman-Pearson classification has received much less attention. This setting poses unique challenges, as both types of errors must be simultaneously controlled. Existing works address only the case of distribution shift in $μ_1$, whereas in many practical scenarios shifts may occur in both $μ_0$ and $μ_1$. We derive an adaptive procedure that not only guarantees improved Type-I and Type-II errors when the source is informative, but also automatically adapt to situations where the source is uninformative, thereby avoiding negative transfer. In addition to such statistical guarantees, the procedures is efficient, as shown via complementary computational guarantees.

[770] ML-EcoLyzer: Quantifying the Environmental Cost of Machine Learning Inference Across Frameworks and Hardware

Jose Marie Antonio Minoza, Rex Gregor Laylo, Christian F Villarin, Sebastian C. Ibanez

Main category: cs.LG

TL;DR: ML-EcoLyzer is a cross-framework tool that measures carbon, energy, thermal, and water costs of ML inference across various hardware, introducing an Environmental Sustainability Score (ESS) to quantify efficiency.

Details

Motivation: To address the poorly quantified environmental impact of machine learning inference, especially on low-resource hardware, and provide reliable sustainability metrics for model selection.

Method: Developed ML-EcoLyzer tool with adaptive monitoring and hardware-aware evaluation, supporting classical and modern models across CPUs, GPUs, and datacenter accelerators. Introduced ESS metric (effective parameters per gram of CO2). Evaluated 1,900+ inference configurations across diverse architectures, modalities, hardware, and precision levels.

Result: Quantization improves ESS, large accelerators can be inefficient for lightweight tasks, and even small models can have significant environmental costs when implemented suboptimally. Provides extensive empirical evaluation of inference environmental costs.

Conclusion: ML-EcoLyzer sets a standard for sustainability-conscious model selection and offers comprehensive environmental cost assessment for ML inference across diverse hardware and model configurations.

Abstract: Machine learning inference occurs at a massive scale, yet its environmental impact remains poorly quantified, especially on low-resource hardware. We present ML-EcoLyzer, a cross-framework tool for measuring the carbon, energy, thermal, and water costs of inference across CPUs, consumer GPUs, and datacenter accelerators. The tool supports both classical and modern models, applying adaptive monitoring and hardware-aware evaluation. We introduce the Environmental Sustainability Score (ESS), which quantifies the number of effective parameters served per gram of CO$_2$ emitted. Our evaluation covers over 1,900 inference configurations, spanning diverse model architectures, task modalities (text, vision, audio, tabular), hardware types, and precision levels. These rigorous and reliable measurements demonstrate that quantization enhances ESS, huge accelerators can be inefficient for lightweight applications, and even small models may incur significant costs when implemented suboptimally. ML-EcoLyzer sets a standard for sustainability-conscious model selection and offers an extensive empirical evaluation of environmental costs during inference.

[771] Improving Asset Allocation in a Fast Moving Consumer Goods B2B Company: An Interpretable Machine Learning Framework for Commercial Cooler Assignment Based on Multi-Tier Growth Targets

Renato Castro, Rodrigo Paredes, Douglas Kahn

Main category: cs.LG

TL;DR: Machine learning framework for predicting which beverage clients will deliver strong sales growth after receiving coolers, achieving AUC scores up to 0.898 and improving ROI over traditional methods.

Details

Motivation: Asset placement decisions in FMCG industry directly impact revenue, but machine learning for asset allocation guidance remains underexplored despite existing work on churn prediction and demand forecasting.

Method: Used private dataset of 3,119 B2B clients with 12 months sales data before/after cooler installation. Compared XGBoost, LightGBM, and CatBoost models with SHAP for interpretable feature analysis. Defined three growth thresholds: 10%, 30%, 50% sales volume growth year-over-year.

Result: Best model achieved AUC scores of 0.857, 0.877, and 0.898 across the three growth thresholds on validation set. Simulations showed improved ROI by better selecting growth-potential clients and avoiding assignments to non-growing clients.

Conclusion: The machine learning approach outperforms traditional volume-based methods, providing substantial business management recommendations for improving cooler allocation efficiency and cost savings.

Abstract: In the fast-moving consumer goods (FMCG) industry, deciding where to place physical assets, such as commercial beverage coolers, can directly impact revenue growth and execution efficiency. Although churn prediction and demand forecasting have been widely studied in B2B contexts, the use of machine learning to guide asset allocation remains relatively unexplored. This paper presents a framework focused on predicting which beverage clients are most likely to deliver strong returns in volume after receiving a cooler. Using a private dataset from a well-known Central American brewing and beverage company of 3,119 B2B traditional trade channel clients that received a cooler from 2022-01 to 2024-07, and tracking 12 months of sales transactions before and after cooler installation, three growth thresholds were defined: 10%, 30% and 50% growth in sales volume year over year. The analysis compares results of machine learning models such as XGBoost, LightGBM, and CatBoost combined with SHAP for interpretable feature analysis in order to have insights into improving business operations related to cooler allocation; the results show that the best model has AUC scores of 0.857, 0.877, and 0.898 across the thresholds on the validation set. Simulations suggest that this approach can improve ROI because it better selects potential clients to grow at the expected level and increases cost savings by not assigning clients that will not grow, compared to traditional volume-based approaches with substantial business management recommendations

[772] Magnitude-Modulated Equivariant Adapter for Parameter-Efficient Fine-Tuning of Equivariant Graph Neural Networks

Dian Jin, Yancheng Yuan, Xiaoming Tao

Main category: cs.LG

TL;DR: MMEA is a novel equivariant fine-tuning method that uses lightweight scalar gating to modulate feature magnitudes per-order and per-multiplicity, achieving state-of-the-art performance while preserving strict equivariance.

Details

Motivation: Existing parameter-efficient fine-tuning (PEFT) methods break symmetry in equivariant graph neural networks, and even ELoRA (the first equivariant PEFT) can perturb pretrained feature distributions due to high degrees of freedom.

Method: Magnitude-Modulated Equivariant Adapter (MMEA) employs lightweight scalar gating to modulate feature magnitudes on a per-order and per-multiplicity basis, preserving strict equivariance while adapting models.

Result: MMEA consistently improves energy and force predictions to state-of-the-art levels across multiple benchmarks while training fewer parameters than competing approaches.

Conclusion: Modulating channel magnitudes is sufficient to adapt equivariant models to new chemical environments without breaking symmetry, pointing toward a new paradigm for equivariant PEFT design.

Abstract: Pretrained equivariant graph neural networks based on spherical harmonics offer efficient and accurate alternatives to computationally expensive ab-initio methods, yet adapting them to new tasks and chemical environments still requires fine-tuning. Conventional parameter-efficient fine-tuning (PEFT) techniques, such as Adapters and LoRA, typically break symmetry, making them incompatible with those equivariant architectures. ELoRA, recently proposed, is the first equivariant PEFT method. It achieves improved parameter efficiency and performance on many benchmarks. However, the relatively high degrees of freedom it retains within each tensor order can still perturb pretrained feature distributions and ultimately degrade performance. To address this, we present Magnitude-Modulated Equivariant Adapter (MMEA), a novel equivariant fine-tuning method which employs lightweight scalar gating to modulate feature magnitudes on a per-order and per-multiplicity basis. We demonstrate that MMEA preserves strict equivariance and, across multiple benchmarks, consistently improves energy and force predictions to state-of-the-art levels while training fewer parameters than competing approaches. These results suggest that, in many practical scenarios, modulating channel magnitudes is sufficient to adapt equivariant models to new chemical environments without breaking symmetry, pointing toward a new paradigm for equivariant PEFT design.

[773] Dual-Pathway Fusion of EHRs and Knowledge Graphs for Predicting Unseen Drug-Drug Interactions

Franklin Lee, Tengfei Ma

Main category: cs.LG

TL;DR: First system that combines knowledge graphs with EHR data for drug-drug interaction prediction, enabling zero-shot inference on unseen drugs through teacher-student distillation.

Details

Motivation: Existing DDI models either rely on pharmacologic knowledge graphs (which fail on unseen drugs) or EHRs (which are noisy and site-dependent), creating a need for a hybrid approach.

Method: Teacher-student distillation: Fusion Teacher learns mechanism-specific relations from both KG and EHR data, while distilled Student generalizes to new drugs using only EHR data at inference, operating under shared pharmacologic mechanism ontology.

Result: Maintains precision across multi-institution test data, produces mechanism-specific predictions, reduces false alerts with comparable F1, misses fewer true interactions, and demonstrates zero-shot identification of clinically recognized mechanisms for KG-absent drugs.

Conclusion: System enables interpretable, auditable DDI alerts and supports real-world use in clinical decision support and pharmacovigilance through zero-shot capability on unseen drugs.

Abstract: Drug-drug interactions (DDIs) remain a major source of preventable harm, and many clinically important mechanisms are still unknown. Existing models either rely on pharmacologic knowledge graphs (KGs), which fail on unseen drugs, or on electronic health records (EHRs), which are noisy, temporal, and site-dependent. We introduce, to our knowledge, the first system that conditions KG relation scoring on patient-level EHR context and distills that reasoning into an EHR-only model for zero-shot inference. A fusion “Teacher” learns mechanism-specific relations for drug pairs represented in both sources, while a distilled “Student” generalizes to new or rarely used drugs without KG access at inference. Both operate under a shared ontology (set) of pharmacologic mechanisms (drug relations) to produce interpretable, auditable alerts rather than opaque risk scores. Trained on a multi-institution EHR corpus paired with a curated DrugBank DDI graph, and evaluated using a clinically aligned, decision-focused protocol with leakage-safe negatives that avoid artificially easy pairs, the system maintains precision across multi-institutuion test data, produces mechanism-specific, clinically consistent predictions, reduces false alerts (higher precision) at comparable overall detection performance (F1), and misses fewer true interactions compared to prior methods. Case studies further show zero-shot identification of clinically recognized CYP-mediated and pharmacodynamic mechanisms for drugs absent from the KG, supporting real-world use in clinical decision support and pharmacovigilance.

[774] An Adaptive Machine Learning Triage Framework for Predicting Alzheimer’s Disease Progression

Richard Hou, Shengpu Tang, Wei Jin

Main category: cs.LG

TL;DR: A two-stage ML framework that selectively uses costly PET/CSF biomarkers for MCI-to-AD prediction, reducing advanced testing by 20% while maintaining high accuracy comparable to using all features.

Details

Motivation: To address the cost-accuracy dilemma in Alzheimer's prediction where cognitive tests lack predictive power but PET scans and CSF biomarkers are prohibitively expensive for routine use.

Method: Two-stage machine learning framework that selectively obtains advanced costly features based on predicted ‘value of information’, applied to ADNI data for MCI-to-AD progression prediction.

Result: Reduces need for advanced testing by 20% while achieving test AUROC of 0.929, comparable to model using both basic and advanced features (AUROC=0.915, p=0.1010).

Conclusion: The framework presents an interpretable, data-driven approach that optimizes AD diagnostic pathways and balances accuracy with cost, making early AD prediction more accessible in real-world practice.

Abstract: Accurate predictions of conversion from mild cognitive impairment (MCI) to Alzheimer’s disease (AD) can enable effective personalized therapy. While cognitive tests and clinical data are routinely collected, they lack the predictive power of PET scans and CSF biomarker analysis, which are prohibitively expensive to obtain for every patient. To address this cost-accuracy dilemma, we design a two-stage machine learning framework that selectively obtains advanced, costly features based on their predicted “value of information”. We apply our framework to predict AD progression for MCI patients using data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Our framework reduces the need for advanced testing by 20% while achieving a test AUROC of 0.929, comparable to the model that uses both basic and advanced features (AUROC=0.915, p=0.1010). We also provide an example interpretability analysis showing how one may explain the triage decision. Our work presents an interpretable, data-driven framework that optimizes AD diagnostic pathways and balances accuracy with cost, representing a step towards making early, reliable AD prediction more accessible in real-world practice. Future work should consider multiple categories of advanced features and larger-scale validation.

[775] Sensor Calibration Model Balancing Accuracy, Real-time, and Efficiency

Jinyong Yun, Hyungjin Kim, Seokho Ahn, Euijong Lee, Young-Duk Seo

Main category: cs.LG

TL;DR: Scare is an ultra-compressed transformer model for sensor calibration that simultaneously meets eight microscopic requirements by decomposing traditional accuracy, real-time, and efficiency goals into more granular metrics.

Details

Motivation: Existing on-device sensor calibration models only benchmark against three macroscopic requirements (accuracy, real-time, resource efficiency), hiding deployment bottlenecks like instantaneous error and worst-case latency.

Method: Scare uses three core components: Sequence Lens Projector for logarithmic time-series compression, Efficient Bitwise Attention replacing multiplications with bitwise operations via binary hash codes, and Hash optimization strategy for stable training without auxiliary losses.

Result: Extensive experiments on air-quality datasets and real microcontroller deployments show Scare outperforms linear, hybrid, and deep-learning baselines, meeting all eight microscopic requirements simultaneously.

Conclusion: Scare is the first model to fulfill all eight microscopic requirements for on-device sensor calibration, providing high accuracy while maintaining computational efficiency and real-time performance on MCUs.

Abstract: Most on-device sensor calibration studies benchmark models only against three macroscopic requirements (i.e., accuracy, real-time, and resource efficiency), thereby hiding deployment bottlenecks such as instantaneous error and worst-case latency. We therefore decompose this triad into eight microscopic requirements and introduce Scare (Sensor Calibration model balancing Accuracy, Real-time, and Efficiency), an ultra-compressed transformer that fulfills them all. SCARE comprises three core components: (1) Sequence Lens Projector (SLP) that logarithmically compresses time-series data while preserving boundary information across bins, (2) Efficient Bitwise Attention (EBA) module that replaces costly multiplications with bitwise operations via binary hash codes, and (3) Hash optimization strategy that ensures stable training without auxiliary loss terms. Together, these components minimize computational overhead while maintaining high accuracy and compatibility with microcontroller units (MCUs). Extensive experiments on large-scale air-quality datasets and real microcontroller deployments demonstrate that Scare outperforms existing linear, hybrid, and deep-learning baselines, making Scare, to the best of our knowledge, the first model to meet all eight microscopic requirements simultaneously.

Heshan Fernando, Parikshit Ram, Yi Zhou, Soham Dan, Horst Samulowitz, Nathalie Baracaldo, Tianyi Chen

Main category: cs.LG

TL;DR: Proposes a multi-objective optimization approach to solve imbalanced learning in multi-modal learning, achieving better performance with significantly reduced computation time.

Details

Motivation: Multi-modal learning often underperforms single-modality approaches due to imbalanced learning across modalities, and existing heuristics are computationally intensive.

Method: Reformulate multi-modal learning as a multi-objective optimization problem and develop a gradient-based algorithm to solve it.

Result: Improved performance over existing balanced MML and MOO baselines with up to ~20x reduction in subroutine computation time.

Conclusion: The proposed multi-objective optimization approach effectively addresses imbalanced learning in multi-modal learning while being computationally efficient.

Abstract: Multi-modal learning (MML) aims to integrate information from multiple modalities, which is expected to lead to superior performance over single-modality learning. However, recent studies have shown that MML can underperform, even compared to single-modality approaches, due to imbalanced learning across modalities. Methods have been proposed to alleviate this imbalance issue using different heuristics, which often lead to computationally intensive subroutines. In this paper, we reformulate the MML problem as a multi-objective optimization (MOO) problem that overcomes the imbalanced learning issue among modalities and propose a gradient-based algorithm to solve the modified MML problem. We provide convergence guarantees for the proposed method, and empirical evaluations on popular MML benchmarks showcasing the improved performance of the proposed method over existing balanced MML and MOO baselines, with up to ~20x reduction in subroutine computation time. Our code is available at https://github.com/heshandevaka/MIMO.

[777] Peeling Context from Cause for Multimodal Molecular Property Prediction

Tao Li, Kaiyuan Hou, Tuan Vinh, Carl Yang, Monika Raj

Main category: cs.LG

TL;DR: CLaP is a framework that separates causal signal from context in molecular property prediction, improving accuracy and interpretability by layerwise peeling of batch-coupled context and integrating multiple graph representations.

Details

Motivation: Deep models for molecular property prediction often rely on spurious context rather than causal structure, reducing reliability under distribution shift and harming predictive performance.

Method: CLaP performs layerwise soft splitting into causal/non-causal branches, fuses causal evidence across modalities, and progressively removes batch-coupled context to focus on label-relevant structure while limiting shortcut signals.

Result: Across four molecular benchmarks, CLaP consistently improves MAE, MSE, and R² over competitive baselines, and produces accurate atom-level causal saliency maps that align with chemical intuition.

Conclusion: By peeling context from cause at every layer, CLaP yields molecular predictors that are both accurate and interpretable for molecular design, providing actionable guidance for targeted molecular edits.

Abstract: Deep models are used for molecular property prediction, yet they are often difficult to interpret and may rely on spurious context rather than causal structure, which reduces reliability under distribution shift and harms predictive performance. We introduce CLaP (Causal Layerwise Peeling), a framework that separates causal signal from context in a layerwise manner and integrates diverse graph representations of molecules. At each layer, a causal block performs a soft split into causal and non-causal branches, fuses causal evidence across modalities, and progressively removes batch-coupled context to focus on label-relevant structure, thereby limiting shortcut signals and stabilizing layerwise refinement. Across four molecular benchmarks, CLaP consistently improves MAE, MSE, and $R^2$ over competitive baselines. The model also produces atom-level causal saliency maps that highlight substructures responsible for predictions, providing actionable guidance for targeted molecular edits. Case studies confirm the accuracy of these maps and their alignment with chemical intuition. By peeling context from cause at every layer, the model yields predictors that are both accurate and interpretable for molecular design.

[778] Rank-1 LoRAs Encode Interpretable Reasoning Signals

Jake Ward, Paul Riechers, Adam Shai

Main category: cs.LG

TL;DR: Reasoning models’ enhanced performance can be elicited by small rank-1 parameter changes, with 73-90% performance recovery on reasoning benchmarks using minimal LoRA adapters.

Details

Motivation: To understand the mechanisms behind reasoning models' enhanced performance and investigate whether these capabilities stem from minimal parameter changes rather than extensive modifications.

Method: Used rank-1 LoRA to create minimal parameter adapters for Qwen-2.5-32B-Instruct, analyzed activation interpretability, and trained sparse autoencoders on LoRA activations.

Result: Achieved 73-90% reasoning benchmark performance recovery compared to full parameter finetuning, found interpretable reasoning-specific activations, and identified fine-grained monosemantic features.

Conclusion: Reasoning performance largely arises from minimal parameter changes, and parameter-efficient methods can serve as targeted tools for understanding language model behavior and dynamics.

Abstract: Reasoning models leverage inference-time compute to significantly enhance the performance of language models on difficult logical tasks, and have become a dominating paradigm in frontier LLMs. Despite their wide adoption, the mechanisms underpinning the enhanced performance of these reasoning models are not well understood. In this work, we show that the majority of new capabilities in reasoning models can be elicited by small, single-rank changes to base model parameters, with many of these changes being interpretable. Specifically, we use a rank-1 LoRA to create a minimal parameter adapter for Qwen-2.5-32B-Instruct which recovers 73-90% of reasoning-benchmark performance compared to a full parameter finetune. We find that the activations of this LoRA are as interpretable as MLP neurons, and fire for reasoning-specific behaviors. Finally, we train a sparse autoencoder on the entire activation state of this LoRA and identify fine-grained and monosemantic features. Our findings highlight that reasoning performance can arise largely from minimal changes to base model parameters, and explore what these changes affect. More broadly, our work shows that parameter-efficient training methods can be used as a targeted lens for uncovering fundamental insights about language model behavior and dynamics.

[779] MobileLLM-Pro Technical Report

Patrick Huber, Ernie Chang, Wei Wen, Igor Fedorov, Tarek Elgamal, Hanxian Huang, Naveen Suda, Chinnadhurai Sankar, Vish Vogeti, Yanghan Wang, Alex Gladkov, Kai Sheng Tai, Abdelrahman Elogeel, Tarek Hefny, Vikas Chandra, Ahmed Aly, Anuj Kumar, Raghuraman Krishnamoorthi, Adithya Sagar

Main category: cs.LG

TL;DR: MobileLLM-Pro is a 1B parameter language model optimized for on-device deployment, achieving SOTA performance across 11 benchmarks while supporting 128K context windows and robust 4-bit quantization.

Details

Motivation: Need for efficient on-device language models around 1B parameters that can power low-latency AI applications on mobile/wearable devices while maintaining strong performance with long context windows.

Method: Four core innovations: (1) implicit positional distillation for long-context capabilities, (2) specialist model merging framework to fuse domain experts, (3) simulation-driven data mixing with utility estimation, (4) 4-bit quantization-aware training with self-distillation.

Result: Achieves state-of-the-art results across 11 benchmarks, significantly outperforming Gemma 3-1B and Llama 3.2-1B, supports 128K token context windows, and shows only minor performance regressions at 4-bit quantization.

Conclusion: MobileLLM-Pro demonstrates that compact 1B parameter models can achieve strong performance with long context support through novel architectural and training innovations, enabling practical on-device deployment.

Abstract: Efficient on-device language models around 1 billion parameters are essential for powering low-latency AI applications on mobile and wearable devices. However, achieving strong performance in this model class, while supporting long context windows and practical deployment remains a significant challenge. We introduce MobileLLM-Pro, a 1-billion-parameter language model optimized for on-device deployment. MobileLLM-Pro achieves state-of-the-art results across 11 standard benchmarks, significantly outperforming both Gemma 3-1B and Llama 3.2-1B, while supporting context windows of up to 128,000 tokens and showing only minor performance regressions at 4-bit quantization. These improvements are enabled by four core innovations: (1) implicit positional distillation, a novel technique that effectively instills long-context capabilities through knowledge distillation; (2) a specialist model merging framework that fuses multiple domain experts into a compact model without parameter growth; (3) simulation-driven data mixing using utility estimation; and (4) 4-bit quantization-aware training with self-distillation. We release our model weights and code to support future research in efficient on-device language models.

Evelyn Chee, Wynne Hsu, Mong Li Lee

Main category: cs.LG

TL;DR: A novel multi-modal continual learning framework using pre-trained models with cross-modality adapters and representation alignment to prevent catastrophic forgetting while integrating new multi-modal information.

Details

Motivation: Existing continual learning approaches focus on uni-modal data, but multi-modal learning offers benefits similar to human perception. Multi-modal continual learning faces challenges in integrating new information from various modalities while preventing catastrophic forgetting.

Method: Proposed a pre-trained model-based framework with cross-modality adapter using mixture-of-experts structure, representation alignment loss for robust multi-modal representations, and regularization to preserve knowledge from previous tasks.

Result: Experiments on multiple multi-modal datasets show the approach consistently outperforms baselines in both class-incremental and domain-incremental learning, achieving higher accuracy and reduced forgetting.

Conclusion: The proposed framework effectively addresses multi-modal continual learning challenges by enabling effective integration of new multi-modal information while preserving previously acquired knowledge.

Abstract: Continual learning is essential for adapting models to new tasks while retaining previously acquired knowledge. While existing approaches predominantly focus on uni-modal data, multi-modal learning offers substantial benefits by utilizing diverse sensory inputs, akin to human perception. However, multi-modal continual learning presents additional challenges, as the model must effectively integrate new information from various modalities while preventing catastrophic forgetting. In this work, we propose a pre-trained model-based framework for multi-modal continual learning. Our framework includes a novel cross-modality adapter with a mixture-of-experts structure to facilitate effective integration of multi-modal information across tasks. We also introduce a representation alignment loss that fosters learning of robust multi-modal representations, and regularize relationships between learned representations to preserve knowledge from previous tasks. Experiments on several multi-modal datasets demonstrate that our approach consistently outperforms baselines in both class-incremental and domain-incremental learning, achieving higher accuracy and reduced forgetting.

[781] Implicit Federated In-context Learning For Task-Specific LLM Fine-Tuning

Dongcheng Li, Junhan Chen, Aoxiang Zhou, Chunpei Li, Youquan Xian, Peng Liu, Xianxian Li

Main category: cs.LG

TL;DR: IFed-ICL is a federated in-context learning framework that converts client context examples into implicit vectors for distributed collaboration during inference, avoiding extensive parameter updates while improving performance on text classification tasks.

Details

Motivation: Address the challenge of enhancing large language models using private organizational data while avoiding depletion of public data, and reducing computational overhead of traditional federated learning approaches.

Method: Propose Implicit Federated In-Context Learning (IFed-ICL) that converts client local context examples into implicit vector representations, enabling distributed collaborative computation during inference and injecting model residual streams.

Result: Achieves outstanding performance across multiple text classification tasks, reduces data transmission and local computation compared to traditional methods, and enables efficient distributed context learning using local private data.

Conclusion: IFed-ICL provides an effective solution for leveraging private data to enhance model performance without extensive parameter updates, significantly improving task-specific performance while maintaining computational efficiency.

Abstract: As large language models continue to develop and expand, the extensive public data they rely on faces the risk of depletion. Consequently, leveraging private data within organizations to enhance the performance of large models has emerged as a key challenge. The federated learning paradigm, combined with model fine-tuning techniques, effectively reduces the number of trainable parameters. However,the necessity to process high-dimensional feature spaces results in substantial overall computational overhead. To address this issue, we propose the Implicit Federated In-Context Learning (IFed-ICL) framework. IFed-ICL draws inspiration from federated learning to establish a novel distributed collaborative paradigm, by converting client local context examples into implicit vector representations, it enables distributed collaborative computation during the inference phase and injects model residual streams to enhance model performance. Experiments demonstrate that our proposed method achieves outstanding performance across multiple text classification tasks. Compared to traditional methods, IFed-ICL avoids the extensive parameter updates required by conventional fine-tuning methods while reducing data transmission and local computation at the client level in federated learning. This enables efficient distributed context learning using local private-domain data, significantly improving model performance on specific tasks.

[782] Dual Mamba for Node-Specific Representation Learning: Tackling Over-Smoothing with Selective State Space Modeling

Xin He, Yili Wang, Yiwei Dai, Xin Wang

Main category: cs.LG

TL;DR: DMbaGCN integrates Mamba into GNNs to address over-smoothing using dual modules: LSEMba for local node-specific representation dynamics and GCAMba for global context awareness.

Details

Motivation: Existing solutions like residual connections and skip layers fail to explicitly model node-specific representation evolution across layers and don't incorporate global information, which is crucial for mitigating over-smoothing in deep GNNs.

Method: Proposed Dual Mamba-enhanced Graph Convolutional Network (DMbaGCN) with two modules: Local State-Evolution Mamba (LSEMba) for local neighborhood aggregation and node-specific representation dynamics, and Global Context-Aware Mamba (GCAMba) for incorporating global context using Mamba’s selective state space modeling and global attention capabilities.

Result: Extensive experiments on multiple benchmarks demonstrate the effectiveness and efficiency of DMbaGCN in enhancing node discriminability and mitigating over-smoothing in deep GNNs.

Conclusion: DMbaGCN successfully addresses over-smoothing from both local and global perspectives by integrating Mamba’s capabilities into GNNs, providing a novel framework that captures node-specific representation dynamics and global context simultaneously.

Abstract: Over-smoothing remains a fundamental challenge in deep Graph Neural Networks (GNNs), where repeated message passing causes node representations to become indistinguishable. While existing solutions, such as residual connections and skip layers, alleviate this issue to some extent, they fail to explicitly model how node representations evolve in a node-specific and progressive manner across layers. Moreover, these methods do not take global information into account, which is also crucial for mitigating the over-smoothing problem. To address the aforementioned issues, in this work, we propose a Dual Mamba-enhanced Graph Convolutional Network (DMbaGCN), which is a novel framework that integrates Mamba into GNNs to address over-smoothing from both local and global perspectives. DMbaGCN consists of two modules: the Local State-Evolution Mamba (LSEMba) for local neighborhood aggregation and utilizing Mamba’s selective state space modeling to capture node-specific representation dynamics across layers, and the Global Context-Aware Mamba (GCAMba) that leverages Mamba’s global attention capabilities to incorporate global context for each node. By combining these components, DMbaGCN enhances node discriminability in deep GNNs, thereby mitigating over-smoothing. Extensive experiments on multiple benchmarks demonstrate the effectiveness and efficiency of our method.

Zhixiong Zhao, Haomin Li, Fangxin Liu, Yuncheng Lu, Zongwu Wang, Tao Yang, Li Jiang, Haibing Guan

Main category: cs.LG

TL;DR: QUARK is a quantization-enabled FPGA acceleration framework that reduces nonlinear operation latency in Transformer models through circuit sharing, achieving up to 1.96× speedup over GPUs while cutting hardware overhead by 50%+ and maintaining/improving accuracy.

Details

Motivation: Transformer models achieve SOTA performance but nonlinear operations create significant inference latency challenges for hardware acceleration, requiring efficient solutions.

Method: Proposes QUARK framework that leverages common patterns in nonlinear operations to enable efficient circuit sharing, using quantization and novel circuit-sharing design tailored for Transformer nonlinear operations.

Result: Achieves up to 1.96× end-to-end speedup over GPU implementations, reduces hardware overhead of nonlinear modules by >50% compared to prior approaches, and maintains high model accuracy while boosting accuracy under ultra-low-bit quantization.

Conclusion: QUARK effectively addresses nonlinear operation bottlenecks in Transformers through circuit sharing and quantization, providing significant performance improvements and hardware efficiency gains while preserving model quality.

Abstract: Transformer-based models have revolutionized computer vision (CV) and natural language processing (NLP) by achieving state-of-the-art performance across a range of benchmarks. However, nonlinear operations in models significantly contribute to inference latency, presenting unique challenges for efficient hardware acceleration. To this end, we propose QUARK, a quantization-enabled FPGA acceleration framework that leverages common patterns in nonlinear operations to enable efficient circuit sharing, thereby reducing hardware resource requirements. QUARK targets all nonlinear operations within Transformer-based models, achieving high-performance approximation through a novel circuit-sharing design tailored to accelerate these operations. Our evaluation demonstrates that QUARK significantly reduces the computational overhead of nonlinear operators in mainstream Transformer architectures, achieving up to a 1.96 times end-to-end speedup over GPU implementations. Moreover, QUARK lowers the hardware overhead of nonlinear modules by more than 50% compared to prior approaches, all while maintaining high model accuracy – and even substantially boosting accuracy under ultra-low-bit quantization.

[784] Data Trajectory Alignment for LLM Domain Adaptation: A Two-Phase Synthesis Framework for Telecommunications Mathematics

Zhicheng Zhou, Jing Li, Suming Qiu, Junjie Huang, Linyuan Qiu, Zhijie Sun

Main category: cs.LG

TL;DR: DTA is a two-phase data curation framework that aligns teacher solution processes with student inductive biases, achieving SOTA accuracy on telecom math tasks while improving efficiency for edge deployment.

Details

Motivation: Adapting LLMs to vertical domains like telecom is challenging due to scarce corpora and mobile/edge constraints. Current approaches struggle with low-information-density data and inefficient reasoning processes.

Method: Two-phase framework: Phase I synthesizes diverse candidates using teacher ensembles; Phase II rewrites solutions to align intermediate steps and style with student biases, then performs signal-aware exemplar selection via agreement checks and reflection-based judging.

Result: Achieved 72.45% pass@1 on TELEMATH, surpassing distilled-only training by +17.65 points and outperforming Qwen3-32B with thinking enabled by +2.94 points. Reduced energy per token by ~42% and latency by ~60% versus baselines.

Conclusion: Aligning solution processes enables compact, high-yield supervision that improves both accuracy and efficiency, offering practical domain adaptation for low-resource verticals beyond telecom.

Abstract: General-purpose large language models (LLMs) are increasingly deployed in verticals such as telecommunications, where adaptation is hindered by scarce, low-information-density corpora and tight mobile/edge constraints. We propose Data Trajectory Alignment (DTA), a two-phase, model-agnostic data curation framework that treats solution processes - not only final answers - as first-class supervision. Phase I (Initializing) synthesizes diverse, high-coverage candidates using an ensemble of strong teachers. Phase II (DTA) rewrites teacher solutions to align intermediate steps and presentation style with the target student’s inductive biases and then performs signal-aware exemplar selection via agreement checks and reflection-based judging. Instantiated on telecommunications mathematics (e.g., link budgets, SNR/AMC selection, and power-control feasibility), DTA yields state-of-the-art (SOTA) accuracy on TELEMATH without enabling explicit “thinking” modes: 72.45% pass@1, surpassing distilled-only training by +17.65 points and outperforming a strong baseline (Qwen3-32B with thinking enabled) by +2.94 points. Token-shift analyses indicate that DTA concentrates gains on logical-structural discourse markers rather than merely amplifying domain nouns, indicating improved reasoning scaffolding. Under edge-like inference settings, DTA improves efficiency by reducing reliance on multi-sample voting and disabling expensive reasoning heuristics, cutting energy per output token by ~42% versus Qwen3-32B (thinking mode enabled) and end-to-end latency by ~60% versus Qwen3-32B (thinking mode disabled). These results demonstrate that aligning how solutions are produced enables compact, high-yield supervision that is effective for both accuracy and efficiency, offering a practical recipe for domain adaptation in low-resource verticals beyond telecom.

[785] On the Mechanisms of Collaborative Learning in VAE Recommenders

Tung-Long Vuong, Julien Monteil, Hien Dang, Volodymyr Vaskovych, Trung Le, Vu Nguyen

Main category: cs.LG

TL;DR: The paper analyzes how collaboration works in VAE-based recommendation systems, showing it depends on latent proximity. It studies local vs global collaboration trade-offs and proposes an anchor regularizer to improve global consistency while preserving user identity.

Details

Motivation: To theoretically understand how collaboration arises in VAE-based collaborative filtering, particularly the effects of binary input masking which improves performance but lacks theoretical exploration.

Method: Analyzed latent proximity and derived a latent sharing radius; studied local vs global collaboration; compared β-KL regularization and input masking mechanisms; proposed anchor regularizer to align user posteriors with item embeddings.

Result: Showed collaboration is governed by latent proximity with influence decaying as latent Wasserstein distance increases; validated analyses on Netflix, MovieLens-20M, and Million Song datasets; successfully deployed algorithm on Amazon streaming platform.

Conclusion: VAE-based CF primarily exploits local collaboration; global mixing can be improved through careful regularization; anchor regularizer effectively stabilizes users under masking and enables better signal sharing across items.

Abstract: Variational Autoencoders (VAEs) are a powerful alternative to matrix factorization for recommendation. A common technique in VAE-based collaborative filtering (CF) consists in applying binary input masking to user interaction vectors, which improves performance but remains underexplored theoretically. In this work, we analyze how collaboration arises in VAE-based CF and show it is governed by latent proximity: we derive a latent sharing radius that informs when an SGD update on one user strictly reduces the loss on another user, with influence decaying as the latent Wasserstein distance increases. We further study the induced geometry: with clean inputs, VAE-based CF primarily exploits \emph{local} collaboration between input-similar users and under-utilizes global collaboration between far-but-related users. We compare two mechanisms that encourage \emph{global} mixing and characterize their trade-offs: (1) $β$-KL regularization directly tightens the information bottleneck, promoting posterior overlap but risking representational collapse if too large; (2) input masking induces stochastic geometric contractions and expansions, which can bring distant users onto the same latent neighborhood but also introduce neighborhood drift. To preserve user identity while enabling global consistency, we propose an anchor regularizer that aligns user posteriors with item embeddings, stabilizing users under masking and facilitating signal sharing across related items. Our analyses are validated on the Netflix, MovieLens-20M, and Million Song datasets. We also successfully deployed our proposed algorithm on an Amazon streaming platform following a successful online experiment.

[786] Resource Efficient Sleep Staging via Multi-Level Masking and Prompt Learning

Lejun Ai, Yulong Li, Haodong Yi, Jixuan Xie, Yue Wang, Jia Liu, Min Chen, Rui Wang

Main category: cs.LG

TL;DR: MASS is a novel framework for resource-efficient sleep staging that uses masking and prompt learning to maintain reliable classification with limited EEG data, achieving state-of-the-art performance especially in low-data scenarios.

Details

Motivation: Existing sleep staging methods require long continuous EEG recordings, which is challenging for wearable and home-based monitoring systems. The paper aims to reduce signal collection while maintaining performance.

Method: Proposes Mask-Aware Sleep Staging (MASS) with multi-level masking strategy and hierarchical prompt learning mechanism that aggregates unmasked data into a global prompt to guide feature modeling.

Result: Evaluated on four datasets, MASS demonstrates state-of-the-art performance, particularly when data is very limited.

Conclusion: MASS shows strong potential for efficient and scalable deployment in real-world low-resource sleep monitoring environments.

Abstract: Automatic sleep staging plays a vital role in assessing sleep quality and diagnosing sleep disorders. Most existing methods rely heavily on long and continuous EEG recordings, which poses significant challenges for data acquisition in resource-constrained systems, such as wearable or home-based monitoring systems. In this paper, we propose the task of resource-efficient sleep staging, which aims to reduce the amount of signal collected per sleep epoch while maintaining reliable classification performance. To solve this task, we adopt the masking and prompt learning strategy and propose a novel framework called Mask-Aware Sleep Staging (MASS). Specifically, we design a multi-level masking strategy to promote effective feature modeling under partial and irregular observations. To mitigate the loss of contextual information introduced by masking, we further propose a hierarchical prompt learning mechanism that aggregates unmasked data into a global prompt, serving as a semantic anchor for guiding both patch-level and epoch-level feature modeling. MASS is evaluated on four datasets, demonstrating state-of-the-art performance, especially when the amount of data is very limited. This result highlights its potential for efficient and scalable deployment in real-world low-resource sleep monitoring environments.

Boyang Zhang, Daning Cheng, Yunquan Zhang

Main category: cs.LG

TL;DR: Geo-Sharing is a principled parameter sharing method that uses group theory and Hessian analysis to systematically determine optimal cross-layer sharing configurations, outperforming heuristic approaches with higher compression and lower accuracy loss.

Details

Motivation: Modern deep models have massive parameters causing high memory usage, but existing parameter sharing methods are heuristic-based, restricted to adjacent layers, and lack systematic analysis for cross-layer sharing, with exponentially growing configuration spaces making exhaustive search infeasible.

Method: Reformulate parameter sharing from group theory as introducing structural symmetries, using a coloring function to define sharing classes. Propose a second-order geometric criterion based on Taylor expansion and Hessian spectrum, projecting perturbations onto low-curvature eigensubspace to select sharing groups that minimize performance impact.

Result: Geo-Sharing consistently outperforms state-of-the-art heuristic sharing strategies across diverse architectures and tasks, achieving higher compression ratios with smaller accuracy degradation.

Conclusion: The geometric approach provides a principled and scalable configuration procedure for parameter sharing that effectively reduces model redundancy while maintaining performance.

Abstract: Modern deep models have massive parameter sizes, leading to high inference-time memory usage that limits practical deployment. Parameter sharing, a form of structured compression, effectively reduces redundancy, but existing approaches remain heuristic-restricted to adjacent layers and lacking a systematic analysis for cross-layer sharing. However, extending sharing across multiple layers leads to an exponentially expanding configuration space, making exhaustive search computationally infeasible and forming a critical bottleneck for parameter sharing. We recast parameter sharing from a group-theoretic perspective as introducing structural symmetries in the model’s parameter space. A sharing configuration can be described by a coloring function $α:L\rightarrow C$ (L: layer indices and C: sharing classes), which determines inter-layer sharing groups while preserving structural symmetry. To determine the coloring function, we propose a second-order geometric criterion based on Taylor expansion and the Hessian spectrum. By projecting perturbations onto the Hessian’s low-curvature eigensubspace, the criterion provides an analytic rule for selecting sharing groups that minimize performance impact, yielding a principled and scalable configuration procedure. Across diverse architectures and tasks, Geo-Sharing consistently outperforms state-of-the-art heuristic sharing strategies, achieving higher compression ratios with smaller accuracy degradation.

[788] Robust Causal Discovery under Imperfect Structural Constraints

Zidong Wang, Xi Lin, Chuchao He, Xiaoguang Gao

Main category: cs.LG

TL;DR: Proposes a robust causal discovery method that harmonizes imperfect prior knowledge with observational data through prior alignment and conflict resolution using multi-task learning.

Details

Motivation: Existing causal discovery methods fail when dealing with imperfect prior knowledge of unknown location and type, due to inflexible thresholding strategies that conflict with data distribution.

Method: Uses surrogate model to assess constraint credibility, sparse penalization for alignment, and multi-task learning with multi-gradient descent to resolve conflicts between knowledge-driven and data-driven objectives.

Result: Method demonstrates robustness across linear and nonlinear settings, various noise conditions, and structural equation models, showing effectiveness under imperfect structural constraints.

Conclusion: The proposed approach successfully addresses the challenge of robust causal discovery from observational data with imperfect prior knowledge through harmonization of knowledge and data.

Abstract: Robust causal discovery from observational data under imperfect prior knowledge remains a significant and largely unresolved challenge. Existing methods typically presuppose perfect priors or can only handle specific, pre-identified error types. And their performance degrades substantially when confronted with flawed constraints of unknown location and type. This decline arises because most of them rely on inflexible and biased thresholding strategies that may conflict with the data distribution. To overcome these limitations, we propose to harmonizes knowledge and data through prior alignment and conflict resolution. First, we assess the credibility of imperfect structural constraints through a surrogate model, which then guides a sparse penalization term measuring the loss between the learned and constrained adjacency matrices. We theoretically prove that, under ideal assumption, the knowledge-driven objective aligns with the data-driven objective. Furthermore, to resolve conflicts when this assumption is violated, we introduce a multi-task learning framework optimized via multi-gradient descent, jointly minimizing both objectives. Our proposed method is robust to both linear and nonlinear settings. Extensive experiments, conducted under diverse noise conditions and structural equation model types, demonstrate the effectiveness and efficiency of our method under imperfect structural constraints.

Kunhao Li, Wenhao Li, Di Wu, Lei Yang, Jun Bai, Ju Jia, Jason Xue

Main category: cs.LG

TL;DR: MIP-Editor is a multimodal machine unlearning approach that uses modality-specific attribution scores and influential-path-aware neuron editing to selectively forget targeted knowledge while preserving model utility across modalities.

Details

Motivation: Address privacy leakage, toxicity mitigation, and IP violations in Multimodal Large Language Models by enabling selective forgetting of specific knowledge while maintaining overall model performance.

Method: Proposes multimodal influential neuron path editor (MIP-Editor) with modality-specific attribution scores to identify influential neuron paths, and applies influential-path-aware neuron-editing via representation misdirection for coordinated forgetting across modalities.

Result: Achieves 87.75% maximum forgetting rate on multimodal tasks with 54.26% improvement in general knowledge retention, and up to 80.65% forgetting with 77.9% general performance preservation on textual tasks.

Conclusion: MIP-Editor effectively addresses multimodal unlearning challenges by enabling coordinated forgetting across modalities while preserving the model’s general capabilities, outperforming existing approaches.

Abstract: Multimodal Large Language Models (MLLMs) extend foundation models to real-world applications by integrating inputs such as text and vision. However, their broad knowledge capacity raises growing concerns about privacy leakage, toxicity mitigation, and intellectual property violations. Machine Unlearning (MU) offers a practical solution by selectively forgetting targeted knowledge while preserving overall model utility. When applied to MLLMs, existing neuron-editing-based MU approaches face two fundamental challenges: (1) forgetting becomes inconsistent across modalities because existing point-wise attribution methods fail to capture the structured, layer-by-layer information flow that connects different modalities; and (2) general knowledge performance declines when sensitive neurons that also support important reasoning paths are pruned, as this disrupts the model’s ability to generalize. To alleviate these limitations, we propose a multimodal influential neuron path editor (MIP-Editor) for MU. Our approach introduces modality-specific attribution scores to identify influential neuron paths responsible for encoding forget-set knowledge and applies influential-path-aware neuron-editing via representation misdirection. This strategy also enables effective and coordinated forgetting across modalities while preserving the model’s general capabilities. Experimental results demonstrate that MIP-Editor achieves a superior unlearning performance on multimodal tasks, with a maximum forgetting rate of 87.75% and up to 54.26% improvement in general knowledge retention. On textual tasks, MIP-Editor achieves up to 80.65% forgetting and preserves 77.9% of general performance. Codes are available at https://github.com/PreckLi/MIP-Editor.

[790] Recursive Dynamics in Fast-Weights Homeostatic Reentry Networks: Toward Reflective Intelligence

B. G. Chae

Main category: cs.LG

TL;DR: FH-RL integrates fast-weight memory, homeostatic regularization, and learned reentrant feedback to enable self-referential computation in neural networks without external looping during inference.

Details

Motivation: To enable internal recurrence and self-referential computation in neural networks, moving beyond purely feedforward transformer architectures during inference.

Method: Developed FH-RL with controlled experiments sweeping reentry gain γ, using three novel metrics (IRR, ESRI, RDP) to evaluate emergent internal dynamics and analyze the learned feedback matrix Wr.

Result: Reentry quantity increases with γ, while Wr remains bounded and becomes structured at moderate gains. A stable reflective band emerges at γ≈0.10-0.20 with maximal expressivity and spectral stability.

Conclusion: Reflective, thought-like internal processing arises from balanced feedback amplification and homeostatic regulation, linking fast-weight architectures to cortical reentry and recursive cognition theories.

Abstract: This study introduces the Fast-Weights Homeostatic Reentry Layer (FH-RL), a neural mechanism that integrates fast-weight associative memory, homeostatic regularization, and learned reentrant feedback to approximate self-referential computation in neural networks. Unlike standard transformer architectures that operate in a purely feedforward manner during inference, FH-RL enables internal recurrence without external looping, allowing prior latent states to be dynamically re-entered into the ongoing computation stream. We conduct controlled experiments sweeping the reentry gain $γ$ and evaluate emergent internal dynamics using three novel metrics: the Information Reentry Ratio (IRR), Eigen-Spectrum Recursion Index (ESRI), and Representational Drift Periodicity (RDP). Results show that reentry quantity increases proportionally with~$γ$, while the learned feedback matrix $W_r$ remains bounded and becomes more structured at moderate gains. Critically, a stable reflective band emerges around $γ\approx 0.10-0.20$, where internal feedback is maximally expressive yet spectrally stable: IRR rises smoothly, ESRI remains near zero, and RDP exhibits consistent low-frequency cycles. These findings provide quantitative evidence that reflective, thought-like internal processing can arise from a principled balance between feedback amplification and homeostatic regulation, linking modern fast-weight architectures to theories of cortical reentry and recursive cognition.

[791] Beyond Uniform Deletion: A Data Value-Weighted Framework for Certified Machine Unlearning

Lisong He, Yi Yang, Xiangyu Chang

Main category: cs.LG

TL;DR: DVWU is a machine unlearning framework that incorporates data value heterogeneity through a weighting strategy, enabling differentiated unlearning for data points with varying utility to improve model performance after deletion.

Details

Motivation: Existing machine unlearning algorithms treat all data points equally, ignoring that different data contribute unequally to model performance, which can degrade updated model performance when deleting heterogeneous data.

Method: Proposes Data Value-Weighted Unlearning (DVWU) with a weighting strategy based on data values, integrated into unlearning procedures. Implemented using one-step Newton update with output and objective perturbation algorithms for certified unlearning.

Result: Experiments on synthetic and real-world datasets show superior predictive performance and robustness compared to conventional unlearning approaches. Framework also extends to gradient ascent methods.

Conclusion: DVWU effectively addresses data value heterogeneity in machine unlearning, providing a general framework that can be adapted to various existing methods while maintaining better model performance after data deletion.

Abstract: As the right to be forgotten becomes legislated worldwide, machine unlearning mechanisms have emerged to efficiently update models for data deletion and enhance user privacy protection. However, existing machine unlearning algorithms frequently neglect the fact that different data points may contribute unequally to model performance (i.e., heterogeneous data values). Treat them equally in machine unlearning procedure can potentially degrading the performance of updated models. To address this limitation, we propose Data Value-Weighted Unlearning (DVWU), a general unlearning framework that accounts for data value heterogeneity into the unlearning process. Specifically, we design a weighting strategy based on data values, which are then integrated into the unlearning procedure to enable differentiated unlearning for data points with varying utility to the model. The DVWU framework can be broadly adapted to various existing machine unlearning methods. We use the one-step Newton update as an example for implementation, developing both output and objective perturbation algorithms to achieve certified unlearning. Experiments on both synthetic and real-world datasets demonstrate that our methods achieve superior predictive performance and robustness compared to conventional unlearning approaches. We further show the extensibility of our framework on gradient ascent method by incorporating the proposed weighting strategy into the gradient terms, highlighting the adaptability of DVWU for broader gradient-based deep unlearning methods.

[792] FedNET: Federated Learning for Proactive Traffic Management and Network Capacity Planning

Saroj Kumar Panda, Basabdatta Palit, Sadananda Behera

Main category: cs.LG

TL;DR: FedNET is a federated learning framework for early identification of high-risk network links using distributed multi-step traffic forecasting without exposing sensitive data.

Details

Motivation: To proactively identify potential high-risk links in communication networks while preserving privacy and enabling early warning for traffic engineering.

Method: Uses Federated Learning to model temporal evolution of node-level traffic, aggregates forecasts with routing info to estimate link utilization, ranks links by predicted load intensity and variability.

Result: FL achieves accuracy close to centralized training (R² >0.92 for short horizons, 0.45-0.55 for longer horizons), identifies high-risk links 3 days ahead of critical states.

Conclusion: FedNET is a practical tool for anticipatory traffic engineering and capacity planning, enabling privacy-preserving early detection of network bottlenecks.

Abstract: We propose FedNET, a proactive and privacy-preserving framework for early identification of high-risk links in large-scale communication networks, that leverages a distributed multi-step traffic forecasting method. FedNET employs Federated Learning (FL) to model the temporal evolution of node-level traffic in a distributed manner, enabling accurate multi-step-ahead predictions (e.g., several hours to days) without exposing sensitive network data. Using these node-level forecasts and known routing information, FedNET estimates the future link-level utilization by aggregating traffic contributions across all source-destination pairs. The links are then ranked according to the predicted load intensity and temporal variability, providing an early warning signal for potential high-risk links. We compare the federated traffic prediction of FedNET against a centralized multi-step learning baseline and then systematically analyze the impact of history and prediction window sizes on forecast accuracy using the $R^2$ score. Results indicate that FL achieves accuracy close to centralized training, with shorter prediction horizons consistently yielding the highest accuracy ($R^2 >0.92$), while longer horizons providing meaningful forecasts ($R^2 \approx 0.45\text{–}0.55$). We further validate the efficacy of the FedNET framework in predicting network utilization on a realistic network topology and demonstrate that it consistently identifies high-risk links well in advance (i.e., three days ahead) of the critical stress states emerging, making it a practical tool for anticipatory traffic engineering and capacity planning.

[793] Neural-Initialized Newton: Accelerating Nonlinear Finite Elements via Operator Learning

Kianoosh Taghikhani, Yusuke Yamazaki, Jerry Paul Varghese, Markus Apel, Reza Najian Asl, Shahed Rezaei

Main category: cs.LG

TL;DR: A hybrid approach combining neural operators with Newton-based correction to accelerate nonlinear solid mechanics simulations while maintaining accuracy.

Details

Motivation: To address the computational demands of standard Newton-Raphson solvers in computational solid mechanics while overcoming the accuracy limitations of pure neural operator approaches.

Method: Train physics-informed conditional neural fields to approximate solutions, then refine with Newton-based correction initialized by neural outputs, comparing three strategies: standard NFEM, neural operators, and the hybrid neural-initialized Newton approach.

Result: The neural-initialized Newton strategy reduces computational cost while preserving accuracy compared to pure neural operators or standard NFEM solvers.

Conclusion: The hybrid approach effectively combines neural operator efficiency with Newton method robustness, showing promise for accelerating large-scale nonlinear simulations in solid mechanics.

Abstract: We propose a Newton-based scheme, initialized by neural operator predictions, to accelerate the parametric solution of nonlinear problems in computational solid mechanics. First, a physics informed conditional neural field is trained to approximate the nonlinear parametric solutionof the governing equations. This establishes a continuous mapping between the parameter and solution spaces, which can then be evaluated for a given parameter at any spatial resolution. Second, since the neural approximation may not be exact, it is subsequently refined using a Newton-based correction initialized by the neural output. To evaluate the effectiveness of this hybrid approach, we compare three solution strategies: (i) the standard Newton-Raphson solver used in NFEM, which is robust and accurate but computationally demanding; (ii) physics-informed neural operators, which provide rapid inference but may lose accuracy outside the training distribution and resolution; and (iii) the neural-initialized Newton (NiN) strategy, which combines the efficiency of neural operators with the robustness of NFEM. The results demonstrate that the proposed hybrid approach reduces computational cost while preserving accuracy, highlighting its potential to accelerate large-scale nonlinear simulations.

[794] Controllable Flow Matching for Online Reinforcement Learning

Bin Wang, Boxiang Tao, Haifeng Jing, Hongbo Dou, Zijian Wang

Main category: cs.LG

TL;DR: CtrlFlow is a model-based RL method that uses conditional flow matching to directly model optimal trajectory distributions instead of environment dynamics, achieving better performance and sample efficiency than traditional MBRL approaches.

Details

Motivation: Traditional MBRL methods struggle with modeling stability due to accumulated model errors over long-horizon rollouts when explicitly modeling environment dynamics.

Method: Proposes CtrlFlow using conditional flow matching to directly model trajectory distributions from initial states to high-return terminal states, minimizing control energy via the Controllability Gramian Matrix.

Result: In online settings, CtrlFlow outperforms dynamics models on MuJoCo benchmarks and achieves superior sample efficiency compared to standard MBRL methods.

Conclusion: CtrlFlow provides a more stable and efficient alternative to traditional MBRL by directly modeling optimal trajectory distributions rather than environment dynamics.

Abstract: Model-based reinforcement learning (MBRL) typically relies on modeling environment dynamics for data efficiency. However, due to the accumulation of model errors over long-horizon rollouts, such methods often face challenges in maintaining modeling stability. To address this, we propose CtrlFlow, a trajectory-level synthetic method using conditional flow matching (CFM), which directly modeling the distribution of trajectories from initial states to high-return terminal states without explicitly modeling the environment transition function. Our method ensures optimal trajectory sampling by minimizing the control energy governed by the non-linear Controllability Gramian Matrix, while the generated diverse trajectory data significantly enhances the robustness and cross-task generalization of policy learning. In online settings, CtrlFlow demonstrates the better performance on common MuJoCo benchmark tasks than dynamics models and achieves superior sample efficiency compared to standard MBRL methods.

[795] DeepRWCap: Neural-Guided Random-Walk Capacitance Solver for IC Design

Hector R. Rodriguez, Jiechen Huang, Wenjian Yu

Main category: cs.LG

TL;DR: DeepRWCap is a machine learning-guided random walk solver for capacitance extraction that predicts transition quantities to guide walk steps, achieving high accuracy and significant speedup over existing methods.

Details

Motivation: Traditional Monte Carlo random walk methods struggle with modern semiconductor technologies due to challenges in unbiasedly sampling transition domains with multiple high-contrast dielectric materials.

Method: Uses a two-stage neural architecture with 3D convolutional networks for volumetric dielectric interactions and 2D depthwise separable convolutions for localized kernel behavior, incorporating grid-based positional encodings and cube symmetry considerations.

Result: Achieves mean relative error of 1.24±0.53% compared to commercial Raphael solver, with 23% average speedup over state-of-the-art Microwalk method (49% acceleration on complex designs with runtimes over 10s).

Conclusion: DeepRWCap effectively addresses sampling challenges in modern capacitance extraction through machine learning guidance, demonstrating both high accuracy and computational efficiency improvements.

Abstract: Monte Carlo random walk methods are widely used in capacitance extraction for their mesh-free formulation and inherent parallelism. However, modern semiconductor technologies with densely packed structures present significant challenges in unbiasedly sampling transition domains in walk steps with multiple high-contrast dielectric materials. We present DeepRWCap, a machine learning-guided random walk solver that predicts the transition quantities required to guide each step of the walk. These include Poisson kernels, gradient kernels, signs and magnitudes of weights. DeepRWCap employs a two-stage neural architecture that decomposes structured outputs into face-wise distributions and spatial kernels on cube faces. It uses 3D convolutional networks to capture volumetric dielectric interactions and 2D depthwise separable convolutions to model localized kernel behavior. The design incorporates grid-based positional encodings and structural design choices informed by cube symmetries to reduce learning redundancy and improve generalization. Trained on 100,000 procedurally generated dielectric configurations, DeepRWCap achieves a mean relative error of $1.24\pm0.53$% when benchmarked against the commercial Raphael solver on the self-capacitance estimation of 10 industrial designs spanning 12 to 55 nm nodes. Compared to the state-of-the-art stochastic difference method Microwalk, DeepRWCap achieves an average 23% speedup. On complex designs with runtimes over 10 s, it reaches an average 49% acceleration.

[796] Minimum Width of Deep Narrow Networks for Universal Approximation

Xiao-Song Yang, Qi Zhou, Xuan Zhou

Main category: cs.LG

TL;DR: The paper studies minimum width bounds for fully connected neural networks with universal approximation capability, establishing both lower and upper bounds for various activation functions including ELU, SELU, LeakyReLU, and ReLU.

Details

Motivation: Understanding the minimum width requirements for universal approximation is fundamental for neural network design and training, as it helps determine the minimal network architecture needed for function approximation.

Method: The authors use mathematical analysis and geometric approaches including the Poincaré-Miranda Theorem to prove width bounds. They show that ReLU can be approximated by other activation functions to establish bounds.

Result: Key results include: w_min ≤ max(2d_x+1, d_y) for ELU/SELU networks (tight when d_y=2d_x), d_x+1 ≤ w_min ≤ d_x+d_y for LeakyReLU/ELU/CELU/SELU/Softplus, and w_min ≥ d_y + 1_{d_x<d_y≤2d_x} for injective activations like ReLU.

Conclusion: The paper provides comprehensive width bounds for universal approximation across different activation functions, with novel geometric proofs offering more intuitive understanding of the minimum width requirements.

Abstract: Determining the minimum width of fully connected neural networks has become a fundamental problem in recent theoretical studies of deep neural networks. In this paper, we study the lower bounds and upper bounds of the minimum width required for fully connected neural networks in order to have universal approximation capability, which is important in network design and training. We show that $w_{min}\leq\max(2d_x+1, d_y)$ for networks with ELU, SELU, and the upper bound of this inequality is attained when $d_y=2d_x$, where $d_x$, $d_y$ denote the input and output dimensions, respectively. Besides, we show that $d_x+1\leq w_{min}\leq d_x+d_y$ for networks with LeakyReLU, ELU, CELU, SELU, Softplus, by proving that ReLU can be approximated by these activation functions. In addition, in the case that the activation function is injective or can be uniformly approximated by a sequence of injective functions (e.g., ReLU), we present a new proof of the inequality $w_{min}\ge d_y+\mathbf{1}_{d_x<d_y\leq2d_x}$ by constructing a more intuitive example via a new geometric approach based on Poincar$\acute{\text{e}}$-Miranda Theorem.

[797] MI-to-Mid Distilled Compression (M2M-DC): An Hybrid-Information-Guided-Block Pruning with Progressive Inner Slicing Approach to Model Compression

Lionel Levine, Sajjad Ghiasvand, Haniyeh Ehsani Oskouie, Majid Sarrafzadeh

Main category: cs.LG

TL;DR: M2M-DC is a two-scale compression framework that combines mutual information-guided block pruning with progressive channel slicing and staged knowledge distillation to create compact models that maintain or exceed teacher accuracy with significantly reduced computation.

Details

Motivation: To develop a practical compression framework that can effectively reduce model size and computational cost while preserving or improving accuracy, with broad applicability across different CNN architectures including residual and inverted-residual networks.

Method: Two-stage approach: 1) Rank residual blocks by label-aware mutual information and prune least informative blocks, 2) Alternate short knowledge distillation phases with stage-coherent channel slicing (stage planes and optional mid-channel trim) while preserving residual shape invariants.

Result: Achieved significant compression with maintained accuracy: ResNet-18: 85.46% Top-1 with 72% params and 63% GMacs reduction; ResNet-34: 85.02% Top-1 with 74% reduction in both params and GMacs; MobileNetV2: 68.54% Top-1 (+2.5 points over teacher) with 73% params and 76% GMacs reduction.

Conclusion: M2M-DC provides a compact, practical recipe for deployment-ready models that generalize across residual CNNs and inverted-residual families, achieving competitive accuracy at a fraction of the computational cost.

Abstract: We introduce MI-to-Mid Distilled Compression (M2M-DC), a two-scale, shape-safe compression framework that interleaves information-guided block pruning with progressive inner slicing and staged knowledge distillation (KD). First, M2M-DC ranks residual (or inverted-residual) blocks by a label-aware mutual information (MI) signal and removes the least informative units (structured prune-after-training). It then alternates short KD phases with stage-coherent, residual-safe channel slicing: (i) stage “planes” (co-slicing conv2 out-channels with the downsample path and next-stage inputs), and (ii) an optional mid-channel trim (conv1 out / bn1 / conv2 in). This targets complementary redundancy, whole computational motifs and within-stage width while preserving residual shape invariants. On CIFAR-100, M2M-DC yields a clean accuracy-compute frontier. For ResNet-18, we obtain 85.46% Top-1 with 3.09M parameters and 0.0139 GMacs (72% params, 63% GMacs vs. teacher; mean final 85.29% over three seeds). For ResNet-34, we reach 85.02% Top-1 with 5.46M params and 0.0195 GMacs (74% / 74% vs. teacher; mean final 84.62%). Extending to inverted-residuals, MobileNetV2 achieves a mean final 68.54% Top-1 at 1.71M params (27%) and 0.0186 conv GMacs (24%), improving over the teacher’s 66.03% by +2.5 points across three seeds. Because M2M-DC exposes only a thin, architecture-aware interface (blocks, stages, and down sample/skip wiring), it generalizes across residual CNNs and extends to inverted-residual families with minor legalization rules. The result is a compact, practical recipe for deployment-ready models that match or surpass teacher accuracy at a fraction of the compute.

[798] Beyond Observations: Reconstruction Error-Guided Irregularly Sampled Time Series Representation Learning

Jiexi Liu, Meng Cao, Songcan Chen

Main category: cs.LG

TL;DR: iTimER is a self-supervised pre-training framework for irregularly sampled time series that uses reconstruction errors to generate pseudo-observations for unobserved timestamps, improving representation learning.

Details

Motivation: Existing methods for irregularly sampled time series overlook reconstruction errors as learning signals, which can provide valuable information about unobserved values and data structure.

Method: Models reconstruction error distribution, generates pseudo-observations via error-last observation mixup, uses Wasserstein metric for distribution alignment, and incorporates contrastive learning.

Result: Consistently outperforms state-of-the-art methods on classification, interpolation, and forecasting tasks for irregularly sampled time series.

Conclusion: Reconstruction errors are valuable learning signals that can be effectively leveraged through the proposed iTimER framework to improve irregular time series representation learning.

Abstract: Irregularly sampled time series (ISTS), characterized by non-uniform time intervals with natural missingness, are prevalent in real-world applications. Existing approaches for ISTS modeling primarily rely on observed values to impute unobserved ones or infer latent dynamics. However, these methods overlook a critical source of learning signal: the reconstruction error inherently produced during model training. Such error implicitly reflects how well a model captures the underlying data structure and can serve as an informative proxy for unobserved values. To exploit this insight, we propose iTimER, a simple yet effective self-supervised pre-training framework for ISTS representation learning. iTimER models the distribution of reconstruction errors over observed values and generates pseudo-observations for unobserved timestamps through a mixup strategy between sampled errors and the last available observations. This transforms unobserved timestamps into noise-aware training targets, enabling meaningful reconstruction signals. A Wasserstein metric aligns reconstruction error distributions between observed and pseudo-observed regions, while a contrastive learning objective enhances the discriminability of learned representations. Extensive experiments on classification, interpolation, and forecasting tasks demonstrate that iTimER consistently outperforms state-of-the-art methods under the ISTS setting.

[799] Contact Wasserstein Geodesics for Non-Conservative Schrodinger Bridges

Andrea Testa, Soren Hauberg, Tamim Asfour, Leonel Rozo

Main category: cs.LG

TL;DR: Introduces non-conservative generalized Schrödinger bridge (NCGSB) using contact Hamiltonian mechanics to model energy-varying stochastic processes, with efficient contact Wasserstein geodesic (CWG) implementation via ResNet architecture.

Details

Motivation: Existing Schrödinger Bridge methods are limited by energy-conservation assumptions, preventing modeling of varying-energy phenomena in real-world stochastic processes.

Method: Proposes NCGSB based on contact Hamiltonian mechanics, parameterizes Wasserstein manifold, and implements contact Wasserstein geodesic (CWG) via ResNet architecture with non-iterative solver.

Result: Achieves near-linear complexity, supports guided generation through task-specific distance metrics, and demonstrates effectiveness on manifold navigation, molecular dynamics, and image generation tasks.

Conclusion: NCGSB provides a broader class of stochastic processes with richer intermediate dynamics, offering practical benefits and versatility across multiple applications.

Abstract: The Schrödinger Bridge provides a principled framework for modeling stochastic processes between distributions; however, existing methods are limited by energy-conservation assumptions, which constrains the bridge’s shape preventing it from model varying-energy phenomena. To overcome this, we introduce the non-conservative generalized Schrödinger bridge (NCGSB), a novel, energy-varying reformulation based on contact Hamiltonian mechanics. By allowing energy to change over time, the NCGSB provides a broader class of real-world stochastic processes, capturing richer and more faithful intermediate dynamics. By parameterizing the Wasserstein manifold, we lift the bridge problem to a tractable geodesic computation in a finite-dimensional space. Unlike computationally expensive iterative solutions, our contact Wasserstein geodesic (CWG) is naturally implemented via a ResNet architecture and relies on a non-iterative solver with near-linear complexity. Furthermore, CWG supports guided generation by modulating a task-specific distance metric. We validate our framework on tasks including manifold navigation, molecular dynamics predictions, and image generation, demonstrating its practical benefits and versatility.

[800] TuckA: Hierarchical Compact Tensor Experts for Efficient Fine-Tuning

Qifeng Lei, Zhiyong Yang, Qianqian Xu, Cong Hua, Peisong Wen, Qingming Huang

Main category: cs.LG

TL;DR: TuckA is a parameter-efficient fine-tuning method that uses Tucker decomposition to create multiple adaptation experts in a compact 3D tensor structure, enabling efficient handling of diverse data patterns while maintaining low parameter count.

Details

Motivation: Traditional PEFT methods use single adaptation weights per layer, which struggle with complex tasks due to data diversity. A single adaptation weight cannot adequately capture features of all samples in diverse datasets.

Method: Uses Tucker decomposition to create compact 3D tensor experts, hierarchical grouping strategy for multi-granularity pattern capture, efficient batch-level routing to reduce parameters, and data-aware initialization for expert load balancing.

Result: Extensive experiments on natural language understanding, image classification, and mathematical reasoning benchmarks demonstrate TuckA’s efficacy, offering comparable performance to full fine-tuning with significantly fewer parameters.

Conclusion: TuckA provides an effective solution to PEFT by integrating multiple adaptation experts in a compact structure, enabling efficient handling of diverse data patterns while maintaining parameter efficiency.

Abstract: Efficiently fine-tuning pre-trained models for downstream tasks is a key challenge in the era of foundation models. Parameter-efficient fine-tuning (PEFT) presents a promising solution, achieving performance comparable to full fine-tuning by updating only a small number of adaptation weights per layer. Traditional PEFT methods typically rely on a single expert, where the adaptation weight is a low-rank matrix. However, for complex tasks, the data’s inherent diversity poses a significant challenge for such models, as a single adaptation weight cannot adequately capture the features of all samples. To address this limitation, we explore how to integrate multiple small adaptation experts into a compact structure to defeat a large adapter. Specifically, we propose Tucker Adaptation (TuckA), a method with four key properties: (i) We use Tucker decomposition to create a compact 3D tensor where each slice naturally serves as an expert. The low-rank nature of this decomposition ensures that the number of parameters scales efficiently as more experts are added. (ii) We introduce a hierarchical strategy that organizes these experts into groups at different granularities, allowing the model to capture both local and global data patterns. (iii) We develop an efficient batch-level routing mechanism, which reduces the router’s parameter size by a factor of $L$ compared to routing at every adapted layer (where $L$ is the number of adapted layers) (iv) We propose data-aware initialization to achieve loss-free expert load balancing based on theoretical analysis. Extensive experiments on benchmarks in natural language understanding, image classification, and mathematical reasoning speak to the efficacy of TuckA, offering a new and effective solution to the PEFT problem.

[801] DeepBooTS: Dual-Stream Residual Boosting for Drift-Resilient Time-Series Forecasting

Daojun Liang, Jing Chen, Xiao Wang, Yinglong Wang, Suo Li

Main category: cs.LG

TL;DR: DeepBooTS is a novel dual-stream residual-decreasing boosting method that progressively reconstructs intrinsic signals in time series forecasting, achieving 15.8% average performance improvement and enhanced robustness to concept drift.

Details

Motivation: Time series exhibit pronounced non-stationarity, causing most forecasting methods to have compromised robustness to concept drift despite instance normalization. The paper analyzes concept drift through a bias-variance lens and proves weighted ensemble reduces variance without increasing bias.

Method: DeepBooTS is an end-to-end dual-stream residual-decreasing boosting method where each block of a deep model becomes an ensemble of learners with auxiliary output branches. Block-wise outputs correct residuals of previous blocks, enabling learning-driven decomposition of inputs and targets.

Result: Extensive experiments show DeepBooTS outperforms existing methods by a large margin, achieving 15.8% average performance improvement across various datasets and establishing a new benchmark for time series forecasting.

Conclusion: The proposed method enhances versatility and interpretability while substantially improving robustness to concept drift, making it a superior approach for time series forecasting in non-stationary environments.

Abstract: Time-Series (TS) exhibits pronounced non-stationarity. Consequently, most forecasting methods display compromised robustness to concept drift, despite the prevalent application of instance normalization. We tackle this challenge by first analysing concept drift through a bias-variance lens and proving that weighted ensemble reduces variance without increasing bias. These insights motivate DeepBooTS, a novel end-to-end dual-stream residual-decreasing boosting method that progressively reconstructs the intrinsic signal. In our design, each block of a deep model becomes an ensemble of learners with an auxiliary output branch forming a highway to the final prediction. The block-wise outputs correct the residuals of previous blocks, leading to a learning-driven decomposition of both inputs and targets. This method enhances versatility and interpretability while substantially improving robustness to concept drift. Extensive experiments, including those on large-scale datasets, show that the proposed method outperforms existing methods by a large margin, yielding an average performance improvement of 15.8% across various datasets, establishing a new benchmark for TS forecasting.

[802] COGNOS: Universal Enhancement for Time Series Anomaly Detection via Constrained Gaussian-Noise Optimization and Smoothing

Wenlong Shang, Peng Chang

Main category: cs.LG

TL;DR: COGNOS is a universal framework that improves time series anomaly detection by regularizing model outputs to follow Gaussian white noise distribution and applying Kalman smoothing to denoise anomaly scores, achieving significant performance improvements.

Details

Motivation: Current reconstruction-based TSAD methods rely on MSE loss, which produces statistically flawed reconstruction residuals leading to noisy and unstable anomaly scores with poor signal-to-noise ratio, hindering reliable detection.

Method: COGNOS introduces Gaussian-White Noise Regularization during training to constrain model output residuals to follow Gaussian white noise distribution, combined with a Kalman Smoothing Post-processor that optimally denoises raw anomaly scores.

Result: Extensive experiments show COGNOS delivers an average F-score uplift of 57.9% when applied to 12 diverse backbone models across multiple real-world benchmark datasets.

Conclusion: Directly regularizing output statistics is a powerful and generalizable strategy for significantly improving anomaly detection systems.

Abstract: Reconstruction-based methods are a dominant paradigm in time series anomaly detection (TSAD), however, their near-universal reliance on Mean Squared Error (MSE) loss results in statistically flawed reconstruction residuals. This fundamental weakness leads to noisy, unstable anomaly scores with a poor signal-to-noise ratio, hindering reliable detection. To address this, we propose Constrained Gaussian-Noise Optimization and Smoothing (COGNOS), a universal, model-agnostic enhancement framework that tackles this issue at its source. COGNOS introduces a novel Gaussian-White Noise Regularization strategy during training, which directly constrains the model’s output residuals to conform to a Gaussian white noise distribution. This engineered statistical property creates the ideal precondition for our second contribution: a Kalman Smoothing Post-processor that provably operates as a statistically optimal estimator to denoise the raw anomaly scores. The synergy between these two components allows COGNOS to robustly separate the true anomaly signal from random fluctuations. Extensive experiments demonstrate that COGNOS is highly effective, delivering an average F-score uplift of 57.9% when applied to 12 diverse backbone models across multiple real-world benchmark datasets. Our work reveals that directly regularizing output statistics is a powerful and generalizable strategy for significantly improving anomaly detection systems.

[803] On The Presence of Double-Descent in Deep Reinforcement Learning

Viktor Veselý, Aleksandar Todorov, Matthia Sabatelli

Main category: cs.LG

TL;DR: The paper provides preliminary evidence that the double descent phenomenon exists in deep reinforcement learning, showing that over-parameterized models improve generalization past the interpolation point, with policy entropy reduction indicating implicit regularization.

Details

Motivation: The double descent paradox remains largely unexplored in non-stationary deep reinforcement learning domains, despite being well-studied in supervised learning.

Method: Systematically investigated double descent across varying model capacity using Actor-Critic framework, measuring policy uncertainty throughout training with an information-theoretic metric (Policy Entropy).

Result: Clear epoch-wise double descent curve observed; policy’s entry into second descent region correlates with sustained, significant reduction in Policy Entropy, suggesting over-parameterization acts as implicit regularizer.

Conclusion: Double descent is established as a factor in DRL, providing an information-based mechanism for designing more general, transferable, and robust agents through entropic decay indicating flatter minima in loss landscape.

Abstract: The double descent (DD) paradox, where over-parameterized models see generalization improve past the interpolation point, remains largely unexplored in the non-stationary domain of Deep Reinforcement Learning (DRL). We present preliminary evidence that DD exists in model-free DRL, investigating it systematically across varying model capacity using the Actor-Critic framework. We rely on an information-theoretic metric, Policy Entropy, to measure policy uncertainty throughout training. Preliminary results show a clear epoch-wise DD curve; the policy’s entrance into the second descent region correlates with a sustained, significant reduction in Policy Entropy. This entropic decay suggests that over-parameterization acts as an implicit regularizer, guiding the policy towards robust, flatter minima in the loss landscape. These findings establish DD as a factor in DRL and provide an information-based mechanism for designing agents that are more general, transferable, and robust.

[804] A Hybrid Autoencoder-Transformer Model for Robust Day-Ahead Electricity Price Forecasting under Extreme Conditions

Boyan Tang, Xuanhao Ren, Peng Xiao, Shunbo Lei, Xiaorong Sun, Jianghua Wu

Main category: cs.LG

TL;DR: Proposes a hybrid deep learning framework combining Distilled Attention Transformer and Autoencoder Self-regression Model for accurate day-ahead electricity price forecasting under extreme conditions.

Details

Motivation: Address challenges in electricity price forecasting caused by extreme conditions and market anomalies that existing methods struggle with.

Method: Integrates DAT model using self-attention to weight critical historical data segments and ASM using unsupervised learning to detect anomalous patterns from extreme conditions.

Result: Significantly outperforms state-of-the-art methods in prediction accuracy, robustness, and computational efficiency on California and Shandong datasets.

Conclusion: The framework enhances grid resilience and optimizes market operations in future power systems.

Abstract: Accurate day-ahead electricity price forecasting (DAEPF) is critical for the efficient operation of power systems, but extreme condition and market anomalies pose significant challenges to existing forecasting methods. To overcome these challenges, this paper proposes a novel hybrid deep learning framework that integrates a Distilled Attention Transformer (DAT) model and an Autoencoder Self-regression Model (ASM). The DAT leverages a self-attention mechanism to dynamically assign higher weights to critical segments of historical data, effectively capturing both long-term trends and short-term fluctuations. Concurrently, the ASM employs unsupervised learning to detect and isolate anomalous patterns induced by extreme conditions, such as heavy rain, heat waves, or human festivals. Experiments on datasets sampled from California and Shandong Province demonstrate that our framework significantly outperforms state-of-the-art methods in prediction accuracy, robustness, and computational efficiency. Our framework thus holds promise for enhancing grid resilience and optimizing market operations in future power systems.

[805] A Closer Look at Knowledge Distillation in Spiking Neural Network Training

Xu Liu, Na Xia, Jinxing Zhou, Jingyuan Xu, Dan Guo

Main category: cs.LG

TL;DR: The paper proposes two knowledge distillation methods (SAMD and NLD) to improve SNN training by better aligning ANN teacher and SNN student features, addressing their architectural differences in sparsity and discreteness.

Details

Motivation: Current KD methods for SNNs use simple element-wise alignment but ignore fundamental differences between ANN's continuous outputs and SNN's sparse, discrete outputs, leading to suboptimal training.

Method: Two KD strategies: 1) SAMD aligns SNN spike activation maps with ANN class-aware activation maps using saliency scaling for better semantic consistency; 2) NLD uses Gaussian noise to smooth SNN’s sparse logits for better alignment with ANN’s continuous logits.

Result: Extensive experiments on multiple datasets demonstrate the effectiveness of the proposed methods in improving SNN training through knowledge distillation.

Conclusion: The proposed SAMD and NLD methods successfully address the architectural gap between ANNs and SNNs in knowledge distillation, leading to more effective SNN training.

Abstract: Spiking Neural Networks (SNNs) become popular due to excellent energy efficiency, yet facing challenges for effective model training. Recent works improve this by introducing knowledge distillation (KD) techniques, with the pre-trained artificial neural networks (ANNs) used as teachers and the target SNNs as students. This is commonly accomplished through a straightforward element-wise alignment of intermediate features and prediction logits from ANNs and SNNs, often neglecting the intrinsic differences between their architectures. Specifically, ANN’s outputs exhibit a continuous distribution, whereas SNN’s outputs are characterized by sparsity and discreteness. To mitigate this issue, we introduce two innovative KD strategies. Firstly, we propose the Saliency-scaled Activation Map Distillation (SAMD), which aligns the spike activation map of the student SNN with the class-aware activation map of the teacher ANN. Rather than performing KD directly on the raw %and distinct features of ANN and SNN, our SAMD directs the student to learn from saliency activation maps that exhibit greater semantic and distribution consistency. Additionally, we propose a Noise-smoothed Logits Distillation (NLD), which utilizes Gaussian noise to smooth the sparse logits of student SNN, facilitating the alignment with continuous logits from teacher ANN. Extensive experiments on multiple datasets demonstrate the effectiveness of our methods. Code is available~\footnote{https://github.com/SinoLeu/CKDSNN.git}.

[806] Counterfactual Explanation for Multivariate Time Series Forecasting with Exogenous Variables

Keita Kinjo

Main category: cs.LG

TL;DR: Proposes a method for generating counterfactual explanations in time series forecasting using exogenous variables, with techniques for variable influence analysis and CE quality evaluation.

Details

Motivation: Address interpretability concerns in black-box machine learning models for time series analysis, particularly focusing on the underexplored area of counterfactual explanations in time series forecasting.

Method: Uses exogenous variables common in business/marketing fields, develops methods for analyzing variable influence across time series, generating CEs by modifying specific variables, and evaluating CE quality.

Result: Validated through theoretical analysis and empirical experiments, demonstrating accuracy and practical applicability of the proposed method.

Conclusion: The contributions support real-world decision-making based on time series data analysis by providing interpretable counterfactual explanations.

Abstract: Currently, machine learning is widely used across various domains, including time series data analysis. However, some machine learning models function as black boxes, making interpretability a critical concern. One approach to address this issue is counterfactual explanation (CE), which aims to provide insights into model predictions. This study focuses on the relatively underexplored problem of generating counterfactual explanations for time series forecasting. We propose a method for extracting CEs in time series forecasting using exogenous variables, which are frequently encountered in fields such as business and marketing. In addition, we present methods for analyzing the influence of each variable over an entire time series, generating CEs by altering only specific variables, and evaluating the quality of the resulting CEs. We validate the proposed method through theoretical analysis and empirical experiments, showcasing its accuracy and practical applicability. These contributions are expected to support real-world decision-making based on time series data analysis.

[807] Sampling and Loss Weights in Multi-Domain Training

Mahdi Salmani, Pratik Worah, Meisam Razaviyayn, Vahab Mirrokni

Main category: cs.LG

TL;DR: This paper analyzes how sampling weights and loss weights affect training in multi-domain data scenarios, showing they reduce gradient variance and improve generalization.

Details

Motivation: Large neural networks need diverse training data from multiple domains, but domains vary in quality and information diversity. Current methods use heuristics for domain weighting, but a deeper understanding of data mixing is needed.

Method: Study two types of weights: sampling weights (control domain contribution in batches) and loss weights (scale loss from each domain). Analyze through linear regression and examine their joint dynamics in SGD training.

Result: Both weights play complementary roles: they reduce gradient estimate variance in SGD and improve generalization by reducing generalization gap. Theoretical and empirical evidence supports these findings.

Conclusion: Sampling weights and loss weights work together to optimize training with heterogeneous data domains, addressing both gradient stability and generalization performance.

Abstract: In the training of large deep neural networks, there is a need for vast amounts of training data. To meet this need, data is collected from multiple domains, such as Wikipedia and GitHub. These domains are heterogeneous in both data quality and the diversity of information they provide. This raises the question of how much we should rely on each domain. Several methods have attempted to address this issue by assigning sampling weights to each data domain using heuristics or approximations. As a first step toward a deeper understanding of the role of data mixing, this work revisits the problem by studying two kinds of weights: sampling weights, which control how much each domain contributes in a batch, and loss weights, which scale the loss from each domain during training. Through a rigorous study of linear regression, we show that these two weights play complementary roles. First, they can reduce the variance of gradient estimates in iterative methods such as stochastic gradient descent (SGD). Second, they can improve generalization performance by reducing the generalization gap. We provide both theoretical and empirical support for these claims. We further study the joint dynamics of sampling weights and loss weights, examining how they can be combined to capture both contributions.

[808] Learning to Focus: Prioritizing Informative Histories with Structured Attention Mechanisms in Partially Observable Reinforcement Learning

Daniel De Dios Allegue, Jinke He, Frans A. Oliehoek

Main category: cs.LG

TL;DR: The paper introduces structured inductive priors into Transformer self-attention for model-based RL under partial observability, showing that Gaussian distributional priors significantly outperform standard attention and memory-length priors.

Details

Motivation: Standard self-attention in Transformers is inefficient for RL trajectories because it distributes weight uniformly across all past tokens rather than emphasizing the few critical transitions needed for control in sparse, reward-driven environments.

Method: Introduces two structured inductive priors into self-attention: (1) per-head memory-length priors that constrain attention to task-specific windows, and (2) distributional priors that learn smooth Gaussian weightings over past state-action pairs. These are integrated into UniZero, a Transformer-based world model for planning under partial observability.

Result: Gaussian Attention achieves a 77% relative improvement in mean human-normalized scores over UniZero on the Atari 100k benchmark. Most efficiency gains come from the Gaussian prior, while memory-length priors often truncate useful signals with overly restrictive cut-offs.

Conclusion: In partially observable RL domains with non-stationary temporal dependencies, smooth distributional priors flexibly adapt across horizons and yield more robust data efficiency than discrete memory windows, demonstrating that encoding structured temporal priors directly into self-attention improves prioritization of informative histories for dynamics modeling.

Abstract: Transformers have shown strong ability to model long-term dependencies and are increasingly adopted as world models in model-based reinforcement learning (RL) under partial observability. However, unlike natural language corpora, RL trajectories are sparse and reward-driven, making standard self-attention inefficient because it distributes weight uniformly across all past tokens rather than emphasizing the few transitions critical for control. To address this, we introduce structured inductive priors into the self-attention mechanism of the dynamics head: (i) per-head memory-length priors that constrain attention to task-specific windows, and (ii) distributional priors that learn smooth Gaussian weightings over past state-action pairs. We integrate these mechanisms into UniZero, a model-based RL agent with a Transformer-based world model that supports planning under partial observability. Experiments on the Atari 100k benchmark show that most efficiency gains arise from the Gaussian prior, which smoothly allocates attention to informative transitions, while memory-length priors often truncate useful signals with overly restrictive cut-offs. In particular, Gaussian Attention achieves a 77% relative improvement in mean human-normalized scores over UniZero. These findings suggest that in partially observable RL domains with non-stationary temporal dependencies, discrete memory windows are difficult to learn reliably, whereas smooth distributional priors flexibly adapt across horizons and yield more robust data efficiency. Overall, our results demonstrate that encoding structured temporal priors directly into self-attention improves the prioritization of informative histories for dynamics modeling under partial observability.

[809] Hybrid Autoencoders for Tabular Data: Leveraging Model-Based Augmentation in Low-Label Settings

Erel Naor, Ofir Lindenbaum

Main category: cs.LG

TL;DR: A hybrid autoencoder combining neural and tree-based encoders with model-specific feature selection for self-supervised learning on tabular data, addressing neural networks’ limitations in capturing high-frequency tabular structure.

Details

Motivation: Deep neural networks struggle with tabular data due to sensitivity to irrelevant features and bias toward smooth functions, while self-supervised learning faces challenges from lack of effective data augmentations in tabular domains.

Method: Hybrid autoencoder with neural encoder and oblivious soft decision tree encoder, each with stochastic gating networks for feature selection. Uses cross-reconstruction loss and model-based augmentation to create complementary input views.

Result: Achieves consistent gains in low-label classification and regression across diverse tabular datasets, outperforming both deep and tree-based supervised baselines.

Conclusion: The method successfully leverages complementary inductive biases of neural and tree encoders to improve tabular data representation learning, with the tree encoder guiding neural training while only using neural encoder at inference.

Abstract: Deep neural networks often under-perform on tabular data due to their sensitivity to irrelevant features and a spectral bias toward smooth, low-frequency functions. These limitations hinder their ability to capture the sharp, high-frequency signals that often define tabular structure, especially under limited labeled samples. While self-supervised learning (SSL) offers promise in such settings, it remains challenging in tabular domains due to the lack of effective data augmentations. We propose a hybrid autoencoder that combines a neural encoder with an oblivious soft decision tree (OSDT) encoder, each guided by its own stochastic gating network that performs sample-specific feature selection. Together, these structurally different encoders and model-specific gating networks implement model-based augmentation, producing complementary input views tailored to each architecture. The two encoders, trained with a shared decoder and cross-reconstruction loss, learn distinct yet aligned representations that reflect their respective inductive biases. During training, the OSDT encoder (robust to noise and effective at modeling localized, high-frequency structure) guides the neural encoder toward representations more aligned with tabular data. At inference, only the neural encoder is used, preserving flexibility and SSL compatibility. Spectral analysis highlights the distinct inductive biases of each encoder. Our method achieves consistent gains in low-label classification and regression across diverse tabular datasets, outperforming deep and tree-based supervised baselines.

[810] Oh That Looks Familiar: A Novel Similarity Measure for Spreadsheet Template Discovery

Ananad Krishnakumar, Vengadesh Ravikumaran

Main category: cs.LG

TL;DR: A hybrid distance metric combining semantic embeddings, data types, and spatial positioning outperforms traditional methods for identifying structurally similar spreadsheets, achieving perfect template reconstruction.

Details

Motivation: Traditional methods fail to capture spatial layouts and type patterns that define spreadsheet templates, limiting automated template discovery and downstream applications.

Method: Convert spreadsheets into cell-level embeddings and use aggregation techniques like Chamfer and Hausdorff distances to calculate similarity based on semantic embeddings, data type information, and spatial positioning.

Result: Superior unsupervised clustering performance compared to graph-based Mondrian baseline, achieving perfect template reconstruction (Adjusted Rand Index of 1.00 vs 0.90) on FUSTE dataset.

Conclusion: The approach enables large-scale automated template discovery, facilitating downstream applications like retrieval-augmented generation, model training, and bulk data cleaning over tabular collections.

Abstract: Traditional methods for identifying structurally similar spreadsheets fail to capture the spatial layouts and type patterns defining templates. To quantify spreadsheet similarity, we introduce a hybrid distance metric that combines semantic embeddings, data type information, and spatial positioning. In order to calculate spreadsheet similarity, our method converts spreadsheets into cell-level embeddings and then uses aggregation techniques like Chamfer and Hausdorff distances. Experiments across template families demonstrate superior unsupervised clustering performance compared to the graph-based Mondrian baseline, achieving perfect template reconstruction (Adjusted Rand Index of 1.00 versus 0.90) on the FUSTE dataset. Our approach facilitates large-scale automated template discovery, which in turn enables downstream applications such as retrieval-augmented generation over tabular collections, model training, and bulk data cleaning.

[811] Rethinking Crystal Symmetry Prediction: A Decoupled Perspective

Liheng Yu, Zhe Zhao, Xucong Wang, Di Wu, Pengkun Wang

Main category: cs.LG

TL;DR: XRDecoupler framework addresses sub-property confusion in crystal symmetry analysis by incorporating chemical intuition through multidimensional symmetry information as superclass guidance, achieving superior performance and interpretability.

Details

Motivation: Existing deep learning methods for crystal symmetry analysis ignore chemical rules and suffer from serious sub-property confusion (SPC) problems, leading to inaccurate predictions.

Method: Proposed XRDecoupler framework that incorporates multidimensional crystal symmetry information as superclass guidance, designs hierarchical PXRD pattern learning model, and uses multi-objective optimization for balanced training.

Result: Comprehensive evaluations on CCDC, CoREMOF, and InorganicData databases demonstrate XRDecoupler excels in performance, interpretability, and generalization compared to existing methods.

Conclusion: XRDecoupler successfully addresses the SPC problem in crystal symmetry analysis by aligning model predictions with chemical intuition through decoupled perspective and superclass guidance.

Abstract: Efficiently and accurately determining the symmetry is a crucial step in the structural analysis of crystalline materials. Existing methods usually mindlessly apply deep learning models while ignoring the underlying chemical rules. More importantly, experiments show that they face a serious sub-property confusion SPC problem. To address the above challenges, from a decoupled perspective, we introduce the XRDecoupler framework, a problem-solving arsenal specifically designed to tackle the SPC problem. Imitating the thinking process of chemists, we innovatively incorporate multidimensional crystal symmetry information as superclass guidance to ensure that the model’s prediction process aligns with chemical intuition. We further design a hierarchical PXRD pattern learning model and a multi-objective optimization approach to achieve high-quality representation and balanced optimization. Comprehensive evaluations on three mainstream databases (e.g., CCDC, CoREMOF, and InorganicData) demonstrate that XRDecoupler excels in performance, interpretability, and generalization.

[812] S$^2$Drug: Bridging Protein Sequence and 3D Structure in Contrastive Representation Learning for Virtual Screening

Bowei He, Bowen Gao, Yankai Chen, Yanyan Lan, Chen Ma, Philip S. Yu, Ya-Qin Zhang, Wei-Ying Ma

Main category: cs.LG

TL;DR: S²Drug is a two-stage framework that integrates protein sequence and 3D structure information for virtual screening, using contrastive learning with sequence pretraining and structure fine-tuning to improve protein-ligand matching.

Details

Motivation: Existing deep learning methods for virtual screening primarily rely on structural data while overlooking more accessible protein sequences, which could enhance generalizability. However, directly integrating sequences is challenging due to redundancy and noise in large-scale datasets.

Method: Two-stage framework: 1) Protein sequence pretraining on ChemBL using ESM2 backbone with data sampling to reduce redundancy/noise; 2) Fine-tuning on PDBBind by fusing sequence and structure via residue-level gating module, with auxiliary binding site prediction task to guide accurate residue localization.

Result: Across multiple benchmarks, S²Drug consistently improves virtual screening performance and achieves strong results on binding site prediction.

Conclusion: The framework demonstrates the value of bridging sequence and structure in contrastive learning for virtual screening applications.

Abstract: Virtual screening (VS) is an essential task in drug discovery, focusing on the identification of small-molecule ligands that bind to specific protein pockets. Existing deep learning methods, from early regression models to recent contrastive learning approaches, primarily rely on structural data while overlooking protein sequences, which are more accessible and can enhance generalizability. However, directly integrating protein sequences poses challenges due to the redundancy and noise in large-scale protein-ligand datasets. To address these limitations, we propose \textbf{S$^2$Drug}, a two-stage framework that explicitly incorporates protein \textbf{S}equence information and 3D \textbf{S}tructure context in protein-ligand contrastive representation learning. In the first stage, we perform protein sequence pretraining on ChemBL using an ESM2-based backbone, combined with a tailored data sampling strategy to reduce redundancy and noise on both protein and ligand sides. In the second stage, we fine-tune on PDBBind by fusing sequence and structure information through a residue-level gating module, while introducing an auxiliary binding site prediction task. This auxiliary task guides the model to accurately localize binding residues within the protein sequence and capture their 3D spatial arrangement, thereby refining protein-ligand matching. Across multiple benchmarks, S$^2$Drug consistently improves virtual screening performance and achieves strong results on binding site prediction, demonstrating the value of bridging sequence and structure in contrastive learning.

[813] Fast Bayesian Updates via Harmonic Representations

Di Zhang

Main category: cs.LG

TL;DR: A novel framework using harmonic analysis transforms Bayesian updates into spectral convolution, enabling fast O(N log N) computation via FFT instead of traditional O(N^2) methods.

Details

Motivation: Bayesian inference faces computational challenges with posterior distributions due to intractable evidence integrals, and conventional methods like MCMC and VI have scalability and efficiency limitations.

Method: Represent prior and likelihood in orthogonal basis to transform Bayesian update into spectral convolution, use spectral truncation for smooth functions to enable circular convolution, and apply Fast Fourier Transform (FFT).

Result: Achieves deterministic algorithm with O(N log N) complexity, substantial improvement over naive O(N^2) methods, with rigorous mathematical criteria linking efficiency to distribution smoothness and spectral decay.

Conclusion: Provides paradigm shift connecting Bayesian computation to signal processing, enabling real-time sequential inference across wide problem classes through harmonic analysis framework.

Abstract: Bayesian inference, while foundational to probabilistic reasoning, is often hampered by the computational intractability of posterior distributions, particularly through the challenging evidence integral. Conventional approaches like Markov Chain Monte Carlo (MCMC) and Variational Inference (VI) face significant scalability and efficiency limitations. This paper introduces a novel, unifying framework for fast Bayesian updates by leveraging harmonic analysis. We demonstrate that representing the prior and likelihood in a suitable orthogonal basis transforms the Bayesian update rule into a spectral convolution. Specifically, the Fourier coefficients of the posterior are shown to be the normalized convolution of the prior and likelihood coefficients. To achieve computational feasibility, we introduce a spectral truncation scheme, which, for smooth functions, yields an exceptionally accurate finite-dimensional approximation and reduces the update to a circular convolution. This formulation allows us to exploit the Fast Fourier Transform (FFT), resulting in a deterministic algorithm with O(N log N) complexity – a substantial improvement over the O(N^2) cost of naive methods. We establish rigorous mathematical criteria for the applicability of our method, linking its efficiency to the smoothness and spectral decay of the involved distributions. The presented work offers a paradigm shift, connecting Bayesian computation to signal processing and opening avenues for real-time, sequential inference in a wide class of problems.

[814] Breaking the Gradient Barrier: Unveiling Large Language Models for Strategic Classification

Xinpeng Lv, Yunxin Mao, Haoxuan Li, Ke Liang, Jinxuan Yang, Wanrong Huang, Haoang Chi, Huan Chen, Long Lan, Yuanlong Chen, Wenjing Yang, Haotian Wang

Main category: cs.LG

TL;DR: GLIM is a gradient-free strategic classification method using LLMs that implicitly simulates bi-level optimization through in-context learning, enabling cost-effective adaptation to dynamic strategic environments without fine-tuning.

Details

Motivation: Existing strategic classification methods based on linear models or shallow neural networks lack scalability and capacity for real-world datasets in financial services and internet sectors, especially with growing numbers of strategic individuals.

Method: GLIM leverages large language models through in-context learning, implicitly simulating bi-level optimization (feature manipulation and decision rule optimization) during self-attention feed-forward process without fine-tuning the LLMs.

Result: Experiments with pre-trained LLMs on real-world and synthetic datasets in financial and internet domains show that GLIM exhibits both robustness and efficiency, providing an effective solution for large-scale strategic classification tasks.

Conclusion: GLIM enables pre-trained LLMs to adapt to a broad range of strategic manipulations through gradient-free in-context learning, offering a scalable and cost-effective framework for strategic classification in dynamic environments.

Abstract: Strategic classification~(SC) explores how individuals or entities modify their features strategically to achieve favorable classification outcomes. However, existing SC methods, which are largely based on linear models or shallow neural networks, face significant limitations in terms of scalability and capacity when applied to real-world datasets with significantly increasing scale, especially in financial services and the internet sector. In this paper, we investigate how to leverage large language models to design a more scalable and efficient SC framework, especially in the case of growing individuals engaged with decision-making processes. Specifically, we introduce GLIM, a gradient-free SC method grounded in in-context learning. During the feed-forward process of self-attention, GLIM implicitly simulates the typical bi-level optimization process of SC, including both the feature manipulation and decision rule optimization. Without fine-tuning the LLMs, our proposed GLIM enjoys the advantage of cost-effective adaptation in dynamic strategic environments. Theoretically, we prove GLIM can support pre-trained LLMs to adapt to a broad range of strategic manipulations. We validate our approach through experiments with a collection of pre-trained LLMs on real-world and synthetic datasets in financial and internet domains, demonstrating that our GLIM exhibits both robustness and efficiency, and offering an effective solution for large-scale SC tasks.

[815] HCFSLN: Adaptive Hyperbolic Few-Shot Learning for Multimodal Anxiety Detection

Aditya Sneh, Nilesh Kumar Sahu, Anushka Sanjay Shelke, Arya Adyasha, Haroon R. Lone

Main category: cs.LG

TL;DR: HCFSLN is a novel Few-Shot Learning framework using hyperbolic embeddings for multimodal anxiety detection, achieving 88% accuracy with minimal data.

Details

Motivation: Traditional anxiety diagnosis relies on clinical interviews, while machine learning models face overfitting due to limited data. Large-scale data collection is costly and restricts accessibility.

Method: HCFSLN integrates speech, physiological signals, and video data using hyperbolic embeddings, cross-modal attention, and adaptive gating network to enhance feature separability for few-shot learning.

Result: Achieved 88% accuracy on a multimodal anxiety dataset from 108 participants, outperforming six FSL baselines by 14%.

Conclusion: Hyperbolic space effectively models anxiety-related speech patterns, demonstrating FSL’s potential for accessible anxiety classification with minimal data.

Abstract: Anxiety disorders impact millions globally, yet traditional diagnosis relies on clinical interviews, while machine learning models struggle with overfitting due to limited data. Large-scale data collection remains costly and time-consuming, restricting accessibility. To address this, we introduce the Hyperbolic Curvature Few-Shot Learning Network (HCFSLN), a novel Few-Shot Learning (FSL) framework for multimodal anxiety detection, integrating speech, physiological signals, and video data. HCFSLN enhances feature separability through hyperbolic embeddings, cross-modal attention, and an adaptive gating network, enabling robust classification with minimal data. We collected a multimodal anxiety dataset from 108 participants and benchmarked HCFSLN against six FSL baselines, achieving 88% accuracy, outperforming the best baseline by 14%. These results highlight the effectiveness of hyperbolic space for modeling anxiety-related speech patterns and demonstrate FSL’s potential for anxiety classification.

[816] CoLM: Collaborative Large Models via A Client-Server Paradigm

Siqi Huang, Sida Huang, Hongyuan Zhang

Main category: cs.LG

TL;DR: CoLM is a client-server collaborative framework that enables multiple models to share outputs and independently refine their generations, improving performance on previously failed queries.

Details

Motivation: Traditional model ensembles require simultaneous inference from multiple models, which doesn't align with practical deployment where limited server models serve many clients under modern internet architectures.

Method: CoLM allows outputs from multiple models to be aggregated or shared, enabling each client model to independently refine and update its own generation based on these high-quality outputs, leveraging both client-side and shared server-side models.

Result: Experimental results across multiple benchmarks show CoLM consistently improves model performance on previously failed queries.

Conclusion: CoLM demonstrates the effectiveness of collaborative guidance in enhancing single-model capabilities and extends successfully to vision-language models beyond language tasks.

Abstract: Large models have achieved remarkable performance across a range of reasoning and understanding tasks. Prior work often utilizes model ensembles or multi-agent systems to collaboratively generate responses, effectively operating in a server-to-server paradigm. However, such approaches do not align well with practical deployment settings, where a limited number of server-side models are shared by many clients under modern internet architectures. In this paper, we introduce \textbf{CoLM} (\textbf{Co}llaboration in \textbf{L}arge-\textbf{M}odels), a novel framework for collaborative reasoning that redefines cooperation among large models from a client-server perspective. Unlike traditional ensemble methods that rely on simultaneous inference from multiple models to produce a single output, CoLM allows the outputs of multiple models to be aggregated or shared, enabling each client model to independently refine and update its own generation based on these high-quality outputs. This design enables collaborative benefits by fully leveraging both client-side and shared server-side models. We further extend CoLM to vision-language models (VLMs), demonstrating its applicability beyond language tasks. Experimental results across multiple benchmarks show that CoLM consistently improves model performance on previously failed queries, highlighting the effectiveness of collaborative guidance in enhancing single-model capabilities.

[817] Learning Quantized Continuous Controllers for Integer Hardware

Fabian Kresse, Christoph H. Lampert

Main category: cs.LG

TL;DR: Quantization-aware training enables deployment of reinforcement learning policies on FPGAs using only 2-3 bits per weight/activation, achieving microsecond latencies and microjoule energy consumption while maintaining competitive performance.

Details

Motivation: Deploying continuous-control RL policies on embedded hardware requires meeting tight latency and power budgets, which small FPGAs can deliver but only if costly floating-point pipelines are avoided.

Method: Developed a learning-to-hardware pipeline that uses quantization-aware training (QAT) for integer inference, automatically selects low-bit policies, and synthesizes them to an Artix-7 FPGA.

Result: Achieved policies competitive with FP32 baselines using only 2-3 bits per weight and activation, with microsecond inference latencies and microjoule energy consumption on target hardware. Also observed increased input noise robustness compared to floating-point baseline.

Conclusion: Quantization-aware training enables efficient deployment of RL policies on resource-constrained embedded hardware while maintaining performance and even improving noise robustness.

Abstract: Deploying continuous-control reinforcement learning policies on embedded hardware requires meeting tight latency and power budgets. Small FPGAs can deliver these, but only if costly floating point pipelines are avoided. We study quantization-aware training (QAT) of policies for integer inference and we present a learning-to-hardware pipeline that automatically selects low-bit policies and synthesizes them to an Artix-7 FPGA. Across five MuJoCo tasks, we obtain policy networks that are competitive with full precision (FP32) policies but require as few as 3 or even only 2 bits per weight, and per internal activation value, as long as input precision is chosen carefully. On the target hardware, the selected policies achieve inference latencies on the order of microseconds and consume microjoules per action, favorably comparing to a quantized reference. Last, we observe that the quantized policies exhibit increased input noise robustness compared to the floating-point baseline.

[818] Correcting False Alarms from Unseen: Adapting Graph Anomaly Detectors at Test Time

Junjun Pan, Yixin Liu, Chuan Zhou, Fei Xiong, Alan Wee-Chung Liew, Shirui Pan

Main category: cs.LG

TL;DR: TUNE is a plug-and-play test-time adaptation framework that addresses normality shift in graph anomaly detection by aligning unseen normal patterns to the original training distribution without requiring model retraining.

Details

Motivation: Real-world graph anomaly detection faces normality shift where unseen normal samples emerge during deployment, causing performance degradation due to semantic confusion and aggregation contamination. Retraining models is impractical due to high costs and data labeling difficulties.

Method: Proposes TUNE framework with a graph aligner that: (1) aligns shifted data to original distribution at graph attribute level to address semantic confusion, (2) uses minimization of representation-level shift as supervision signal, leveraging aggregation contamination as indicator of normality shift.

Result: Extensive experiments on 10 real-world datasets show TUNE significantly enhances generalizability of pre-trained GAD models to both synthetic and real unseen normal patterns.

Conclusion: TUNE provides an effective lightweight solution for handling normality shift in graph anomaly detection without requiring expensive retraining, making it practical for real-world deployment.

Abstract: Graph anomaly detection (GAD), which aims to detect outliers in graph-structured data, has received increasing research attention recently. However, existing GAD methods assume identical training and testing distributions, which is rarely valid in practice. In real-world scenarios, unseen but normal samples may emerge during deployment, leading to a normality shift that degrades the performance of GAD models trained on the original data. Through empirical analysis, we reveal that the degradation arises from (1) semantic confusion, where unseen normal samples are misinterpreted as anomalies due to their novel patterns, and (2) aggregation contamination, where the representations of seen normal nodes are distorted by unseen normals through message aggregation. While retraining or fine-tuning GAD models could be a potential solution to the above challenges, the high cost of model retraining and the difficulty of obtaining labeled data often render this approach impractical in real-world applications. To bridge the gap, we proposed a lightweight and plug-and-play Test-time adaptation framework for correcting Unseen Normal pattErns (TUNE) in GAD. To address semantic confusion, a graph aligner is employed to align the shifted data to the original one at the graph attribute level. Moreover, we utilize the minimization of representation-level shift as a supervision signal to train the aligner, which leverages the estimated aggregation contamination as a key indicator of normality shift. Extensive experiments on 10 real-world datasets demonstrate that TUNE significantly enhances the generalizability of pre-trained GAD models to both synthetic and real unseen normal patterns.

[819] Fair Bayesian Data Selection via Generalized Discrepancy Measures

Yixuan Zhang, Jiabin Luo, Zhenggang Wang, Feng Zhou, Quyu Kong

Main category: cs.LG

TL;DR: A Bayesian data selection framework that ensures fairness by aligning group-specific posterior distributions with a shared central distribution, outperforming existing methods in both fairness and accuracy.

Details

Motivation: Address limitations of existing fairness-aware methods that intervene at model level, which suffer from high computational costs, limited scalability, and poor generalization.

Method: Propose a Bayesian data selection framework that aligns group-specific posterior distributions of model parameters and sample weights with a shared central distribution using various distributional discrepancy measures (Wasserstein distance, maximum mean discrepancy, f-divergence).

Result: Experiments on benchmark datasets show the method consistently outperforms existing data selection and model-based fairness methods in both fairness and accuracy.

Conclusion: The data-centric approach effectively mitigates group-specific biases in training data and improves fairness in downstream tasks with theoretical guarantees.

Abstract: Fairness concerns are increasingly critical as machine learning models are deployed in high-stakes applications. While existing fairness-aware methods typically intervene at the model level, they often suffer from high computational costs, limited scalability, and poor generalization. To address these challenges, we propose a Bayesian data selection framework that ensures fairness by aligning group-specific posterior distributions of model parameters and sample weights with a shared central distribution. Our framework supports flexible alignment via various distributional discrepancy measures, including Wasserstein distance, maximum mean discrepancy, and $f$-divergence, allowing geometry-aware control without imposing explicit fairness constraints. This data-centric approach mitigates group-specific biases in training data and improves fairness in downstream tasks, with theoretical guarantees. Experiments on benchmark datasets show that our method consistently outperforms existing data selection and model-based fairness methods in both fairness and accuracy.

[820] Breaking Privacy in Federated Clustering: Perfect Input Reconstruction via Temporal Correlations

Guang Yang, Lixia Luo, Qiongxiu Li

Main category: cs.LG

TL;DR: Centroid disclosure in federated clustering, previously considered safe, actually enables perfect input reconstruction through temporal patterns in k-means iterations.

Details

Motivation: To determine whether disclosing intermediate centroids in federated clustering truly compromises privacy, challenging prior assumptions that such disclosure is harmless for efficiency.

Method: Developed Trajectory-Aware Reconstruction (TAR) attack that combines temporal assignment information from k-means iterations with algebraic analysis to reconstruct original inputs.

Result: Successfully demonstrated perfect reconstruction of exact original inputs, proving centroid disclosure significantly compromises privacy in federated clustering.

Conclusion: There exists a fundamental tension between privacy and efficiency in federated clustering, as centroid disclosure enables practical attacks that recover private data, contradicting previous safety assumptions.

Abstract: Federated clustering allows multiple parties to discover patterns in distributed data without sharing raw samples. To reduce overhead, many protocols disclose intermediate centroids during training. While often treated as harmless for efficiency, whether such disclosure compromises privacy remains an open question. Prior analyses modeled the problem as a so-called Hidden Subset Sum Problem (HSSP) and argued that centroid release may be safe, since classical HSSP attacks fail to recover inputs. We revisit this question and uncover a new leakage mechanism: temporal regularities in $k$-means iterations create exploitable structure that enables perfect input reconstruction. Building on this insight, we propose Trajectory-Aware Reconstruction (TAR), an attack that combines temporal assignment information with algebraic analysis to recover exact original inputs. Our findings provide the first rigorous evidence, supported by a practical attack, that centroid disclosure in federated clustering significantly compromises privacy, exposing a fundamental tension between privacy and efficiency.

[821] Direct Molecular Polarizability Prediction with SO(3) Equivariant Local Frame GNNs

Jean Philip Filling, Felix Post, Michael Wand, Denis Andrienko

Main category: cs.LG

TL;DR: Novel equivariant GNN using local coordinate frames for SO(3)-equivariant prediction of molecular tensorial properties, outperforming scalar models on QM7-X polarizabilities.

Details

Motivation: Traditional methods focus on scalar properties and derive tensorial properties from derivatives, lacking direct equivariant modeling of tensorial responses.

Method: Equivariant GNN with local coordinate frames that maintains SO(3)-equivariance through scalar, vector, and tensor channels in local message-passing.

Result: Outperforms scalar message passing models in predicting molecular polarizabilities on QM7-X dataset.

Conclusion: Advances development of structured, geometry-aware neural models for molecular property prediction through direct tensorial message passing.

Abstract: We introduce a novel equivariant graph neural network (GNN) architecture designed to predict the tensorial response properties of molecules. Unlike traditional frameworks that focus on regressing scalar quantities and derive tensorial properties from their derivatives, our approach maintains $SO(3)$-equivariance through the use of local coordinate frames. Our GNN effectively captures geometric information by integrating scalar, vector, and tensor channels within a local message-passing framework. To assess the accuracy of our model, we apply it to predict the polarizabilities of molecules in the QM7-X dataset and show that tensorial message passing outperforms scalar message passing models. This work marks an advancement towards developing structured, geometry-aware neural models for molecular property prediction.

[822] REACT-LLM: A Benchmark for Evaluating LLM Integration with Causal Features in Clinical Prognostic Tasks

Linna Wang, Zhixuan You, Qihui Zhang, Jiunan Wen, Ji Shi, Yimin Chen, Yusen Wang, Fanqi Ding, Ziliang Feng, Li Lu

Main category: cs.LG

TL;DR: REACT-LLM is a benchmark evaluating LLM-causal learning synergy in clinical risk prediction across 7 outcomes, 15 LLMs, 6 ML models, and 3 causal discovery algorithms.

Details

Motivation: To address the lack of systematic benchmarks evaluating LLM-causal learning integration in clinical decision making, where identifying causal features is crucial for trustworthy predictions.

Method: Introduced REACT-LLM benchmark evaluating 7 clinical outcomes across 2 real-world datasets, comparing 15 LLMs, 6 traditional ML models, and 3 causal discovery algorithms.

Result: LLMs perform reasonably but haven’t outperformed traditional ML models; integrating causal features offers limited gains due to strict CD method assumptions being violated in clinical data.

Conclusion: While direct integration yields limited improvement, the benchmark reveals a more promising synergy between LLMs and causal learning in clinical prognostics.

Abstract: Large Language Models (LLMs) and causal learning each hold strong potential for clinical decision making (CDM). However, their synergy remains poorly understood, largely due to the lack of systematic benchmarks evaluating their integration in clinical risk prediction. In real-world healthcare, identifying features with causal influence on outcomes is crucial for actionable and trustworthy predictions. While recent work highlights LLMs’ emerging causal reasoning abilities, there lacks comprehensive benchmarks to assess their causal learning and performance informed by causal features in clinical risk prediction. To address this, we introduce REACT-LLM, a benchmark designed to evaluate whether combining LLMs with causal features can enhance clinical prognostic performance and potentially outperform traditional machine learning (ML) methods. Unlike existing LLM-clinical benchmarks that often focus on a limited set of outcomes, REACT-LLM evaluates 7 clinical outcomes across 2 real-world datasets, comparing 15 prominent LLMs, 6 traditional ML models, and 3 causal discovery (CD) algorithms. Our findings indicate that while LLMs perform reasonably in clinical prognostics, they have not yet outperformed traditional ML models. Integrating causal features derived from CD algorithms into LLMs offers limited performance gains, primarily due to the strict assumptions of many CD methods, which are often violated in complex clinical data. While the direct integration yields limited improvement, our benchmark reveals a more promising synergy.

[823] Guiding Generative Models to Uncover Diverse and Novel Crystals via Reinforcement Learning

Hyunsoo Park, Aron Walsh

Main category: cs.LG

TL;DR: A reinforcement learning framework that guides diffusion models to discover novel, diverse, and thermodynamically viable crystalline materials, addressing the misalignment between likelihood-based sampling and targeted exploration of underexplored regions.

Details

Motivation: To overcome the fundamental challenge of objective misalignment between generative modeling's likelihood-based sampling and the need to explore underexplored regions where novel compounds reside in materials discovery.

Method: Integrates group relative policy optimization with verifiable multi-objective rewards that balance creativity, stability, and diversity to guide latent denoising diffusion models.

Result: Enables generation of diverse and novel crystalline compounds while maintaining thermodynamic viability, and demonstrates enhanced property-guided design that preserves chemical validity.

Conclusion: Establishes a modular foundation for controllable AI-driven inverse design that addresses the novelty-validity trade-off in scientific discovery applications of generative models.

Abstract: Discovering functional crystalline materials entails navigating an immense combinatorial design space. While recent advances in generative artificial intelligence have enabled the sampling of chemically plausible compositions and structures, a fundamental challenge remains: the objective misalignment between likelihood-based sampling in generative modelling and targeted focus on underexplored regions where novel compounds reside. Here, we introduce a reinforcement learning framework that guides latent denoising diffusion models toward diverse and novel, yet thermodynamically viable crystalline compounds. Our approach integrates group relative policy optimisation with verifiable, multi-objective rewards that jointly balance creativity, stability, and diversity. Beyond de novo generation, we demonstrate enhanced property-guided design that preserves chemical validity, while targeting desired functional properties. This approach establishes a modular foundation for controllable AI-driven inverse design that addresses the novelty-validity trade-off across scientific discovery applications of generative models.

[824] Fuzzy Label: From Concept to Its Application in Label Learning

Chenxi Luoa, Zhuangzhuang Zhaoa, Zhaohong Denga, Te Zhangb

Main category: cs.LG

TL;DR: This paper introduces fuzzy labels based on fuzzy set theory to better represent label uncertainty in machine learning, proposing methods to generate fuzzy labels and enhance traditional KNN algorithms for improved performance.

Details

Motivation: Traditional binary label representations fail to capture uncertainty in real-world annotations caused by data noise, ambiguity, and human subjectivity, limiting model expressiveness.

Method: Proposed fuzzy labeling method that mines and generates fuzzy labels from original data, then developed fuzzy-label-enhanced versions of KNN and multi-label KNN algorithms.

Result: Experimental results show fuzzy labels effectively characterize real-world labeling information and significantly enhance label learning model performance.

Conclusion: Fuzzy labels provide more informative and nuanced label representations that overcome limitations of traditional binary labels, leading to improved model performance in label learning tasks.

Abstract: Label learning is a fundamental task in machine learning that aims to construct intelligent models using labeled data, encompassing traditional single-label and multi-label classification models. Traditional methods typically rely on logical labels, such as binary indicators (e.g., “yes/no”) that specify whether an instance belongs to a given category. However, in practical applications, label annotations often involve significant uncertainty due to factors such as data noise, inherent ambiguity in the observed entities, and the subjectivity of human annotators. Therefore, representing labels using simplistic binary logic can obscure valuable information and limit the expressiveness of label learning models. To overcome this limitation, this paper introduces the concept of fuzzy labels, grounded in fuzzy set theory, to better capture and represent label uncertainty. We further propose an efficient fuzzy labeling method that mines and generates fuzzy labels from the original data, thereby enriching the label space with more informative and nuanced representations. Based on this foundation, we present fuzzy-label-enhanced algorithms for both single-label and multi-label learning, using the classical K-Nearest Neighbors (KNN) and multi-label KNN algorithms as illustrative examples. Experimental results indicate that fuzzy labels can more effectively characterize the real-world labeling information and significantly enhance the performance of label learning models.

[825] LLMscape

Gottfried Haider, Jie Zhang

Main category: cs.LG

TL;DR: LLMscape is an interactive installation exploring human-AI meaning-making in uncertain conditions through a mutable projection-mapped landscape where humans reshape the world and interact with AI agents developing incomplete environmental accounts.

Details

Motivation: To investigate how humans and AI construct meaning under shared conditions of uncertainty, positioning AI as co-witnesses rather than deterministic tools, and examining parallels between human and artificial meaning-making.

Method: Interactive installation with mutable projection-mapped landscape where human participants reshape the world and engage with multiple AI agents that develop incomplete and provisional accounts of their environment.

Result: Exhibited in Shanghai and continually evolving, the work successfully positions AI as embodied co-witnesses to an unstable world and facilitates reflection on shared epistemic limits.

Conclusion: The installation demonstrates that AI can function as co-witnesses in uncertain environments, highlighting parallels between human and artificial meaning-making processes and inviting reflection on our shared limitations in knowledge construction.

Abstract: LLMscape is an interactive installation that investigates how humans and AI construct meaning under shared conditions of uncertainty. Within a mutable, projection-mapped landscape, human participants reshape the world and engage with multiple AI agents, each developing incomplete and provisional accounts of their environment. Exhibited in Shanghai and continually evolving, the work positions AI not as deterministic tools but as embodied co-witnesses to an unstable world, examining the parallels between human and artificial meaning-making and inviting reflection on our shared epistemic limits.

[826] Combining digital data streams and epidemic networks for real time outbreak detection

Ruiqi Lyu, Alistair Turcan, Bryan Wilder

Main category: cs.LG

TL;DR: LRTrend is an interpretable ML framework that aggregates diverse health and behavioral data streams to detect disease outbreaks in real time, identifying epidemic clusters and connections across regions.

Details

Motivation: Outbreak detection is hindered by high noise in epidemic time series, and aggregating information across data sources - effective in other fields - remains underexplored in epidemiology.

Method: LRTrend aggregates diverse health and behavioral data streams within regions and learns disease-specific epidemic networks to aggregate information across regions, revealing epidemic clusters and connections.

Result: The framework identified diverse epidemic clusters across the US not well explained by human mobility networks. Applied to COVID-19 data, it detected Delta and Omicron waves within 2 weeks of outbreak start when cases were still low.

Conclusion: LRTrend provides an effective approach for real-time outbreak detection by aggregating multiple data sources and learning epidemic networks, offering potential for improved public health coordination.

Abstract: Responding to disease outbreaks requires close surveillance of their trajectories, but outbreak detection is hindered by the high noise in epidemic time series. Aggregating information across data sources has shown great denoising ability in other fields, but remains underexplored in epidemiology. Here, we present LRTrend, an interpretable machine learning framework to identify outbreaks in real time. LRTrend effectively aggregates diverse health and behavioral data streams within one region and learns disease-specific epidemic networks to aggregate information across regions. We reveal diverse epidemic clusters and connections across the United States that are not well explained by commonly used human mobility networks and may be informative for future public health coordination. We apply LRTrend to 2 years of COVID-19 data in 305 hospital referral regions and frequently detect regional Delta and Omicron waves within 2 weeks of the outbreak’s start, when case counts are a small fraction of the wave’s resulting peak.

[827] SMiLE: Provably Enforcing Global Relational Properties in Neural Networks

Matteo Francobaldi, Michele Lombardi, Andrea Lodi

Main category: cs.LG

TL;DR: SMiLE framework extended to enforce global relational properties in Neural Networks with full guarantees, outperforming property-specific methods in generality and robustness.

Details

Motivation: Existing methods for property enforcement in Neural Networks are limited to specific constraints or local properties, lacking full guarantees and generality.

Method: Extend SMiLE framework to support global relational properties defined over entire input space, scaling with model complexity and accommodating general properties.

Result: Competitive with property-specific baselines in accuracy and runtime, strictly superior in generality and guarantee levels across monotonicity, robustness, and fairness tasks.

Conclusion: SMiLE framework shows strong potential as a versatile platform for property enforcement in Neural Networks with full satisfaction guarantees.

Abstract: Artificial Intelligence systems are increasingly deployed in settings where ensuring robustness, fairness, or domain-specific properties is essential for regulation compliance and alignment with human values. However, especially on Neural Networks, property enforcement is very challenging, and existing methods are limited to specific constraints or local properties (defined around datapoints), or fail to provide full guarantees. We tackle these limitations by extending SMiLE, a recently proposed enforcement framework for NNs, to support global relational properties (defined over the entire input space). The proposed approach scales well with model complexity, accommodates general properties and backbones, and provides full satisfaction guarantees. We evaluate SMiLE on monotonicity, global robustness, and individual fairness, on synthetic and real data, for regression and classification tasks. Our approach is competitive with property-specific baselines in terms of accuracy and runtime, and strictly superior in terms of generality and level of guarantees. Overall, our results emphasize the potential of the SMiLE framework as a platform for future research and applications.

[828] On Stealing Graph Neural Network Models

Marcin Podhajski, Jan Dubiński, Franziska Boenisch, Adam Dziedzic, Agnieszka Pręgowska, Tomasz P. Michalak

Main category: cs.LG

TL;DR: Proposes a GNN model-stealing attack that works with very limited queries by first obtaining the model backbone without direct queries, then strategically using a fixed query limit to extract the most informative data.

Details

Motivation: Current GNN model-stealing methods assume unlimited queries, but real-world scenarios often have severe query limitations, creating a need for effective extraction methods under restricted query budgets.

Method: Two-stage approach: first obtain model backbone without direct queries to victim model, then strategically utilize fixed query limit to extract most informative data.

Result: Experiments on eight real-world datasets show the attack remains effective even under very restricted query limits and existing defenses against model extraction.

Conclusion: The findings highlight the vulnerability of GNNs to model extraction attacks and underscore the need for more robust defenses against such threats.

Abstract: Current graph neural network (GNN) model-stealing methods rely heavily on queries to the victim model, assuming no hard query limits. However, in reality, the number of allowed queries can be severely limited. In this paper, we demonstrate how an adversary can extract the GNN with very limited interactions with the model. Our approach first enables the adversary to obtain the model backbone without making direct queries to the victim model and then to strategically utilize a fixed query limit to extract the most informative data. The experiments on eight real-world datasets demonstrate the effectiveness of the attack, even under a very restricted query limit and under defense against model extraction in place. Our findings underscore the need for robust defenses against GNN model extraction threats.

[829] Synergy over Discrepancy: A Partition-Based Approach to Multi-Domain LLM Fine-Tuning

Hua Ye, Siyuan Chen, Haoliang Zhang, Weihao Luo, Yanbin Li, Xuan Zhang

Main category: cs.LG

TL;DR: A partition-based multi-stage fine-tuning framework for LLMs that groups domains into stages to leverage synergies while reducing interference, with theoretical analysis and empirical validation.

Details

Motivation: Address the challenge of inter-domain interference when adapting LLMs across multiple heterogeneous domains, aiming to exploit domain synergies while minimizing negative transfer.

Method: Strategic partitioning of domains into subsets (stages) based on domain discrepancy, synergy, and model capacity constraints, followed by multi-stage fine-tuning.

Result: Theoretical generalization bounds support the partitioning strategy, and extensive empirical evaluations show consistent outperformance over state-of-the-art baselines on various language understanding tasks.

Conclusion: The proposed framework effectively handles multi-domain adaptation in LLMs by balancing domain relationships and model constraints, demonstrating improved generalization across heterogeneous domains.

Abstract: Large language models (LLMs) demonstrate impressive generalization abilities, yet adapting them effectively across multiple heterogeneous domains remains challenging due to inter-domain interference. To overcome this challenge, we propose a partition-based multi-stage fine-tuning framework designed to exploit inter-domain synergies while minimizing negative transfer. Our approach strategically partitions domains into subsets (stages) by balancing domain discrepancy, synergy, and model capacity constraints. We theoretically analyze the proposed framework and derive novel generalization bounds that justify our partitioning strategy. Extensive empirical evaluations on various language understanding tasks show that our method consistently outperforms state-of-the-art baselines.

[830] DETECT: Data-Driven Evaluation of Treatments Enabled by Classification Transformers

Yuanheng Mao, Lillian Yang, Stephen Yang, Ethan Shao, Zihan Li

Main category: cs.LG

TL;DR: DETECT is a data-driven framework using smartphone sensors to objectively measure treatment success for chronic pain by comparing patient activities before and after treatment.

Details

Motivation: Traditional pain measurement methods like numeric rating scales are subjective and self-reported, creating a need for objective assessment of treatment functional impact.

Method: Uses classification transformers to analyze smartphone sensor data from patient activities of daily life, comparing pre- and post-treatment periods.

Result: DETECT proves to be objective and lightweight, working effectively on public benchmark datasets and simulated patient data.

Conclusion: The framework can enhance clinical decision-making and lead to more personalized patient care when used alongside traditional metrics.

Abstract: Chronic pain is a global health challenge affecting millions of individuals, making it essential for physicians to have reliable and objective methods to measure the functional impact of clinical treatments. Traditionally used methods, like the numeric rating scale, while personalized and easy to use, are subjective due to their self-reported nature. Thus, this paper proposes DETECT (Data-Driven Evaluation of Treatments Enabled by Classification Transformers), a data-driven framework that assesses treatment success by comparing patient activities of daily life before and after treatment. We use DETECT on public benchmark datasets and simulated patient data from smartphone sensors. Our results demonstrate that DETECT is objective yet lightweight, making it a significant and novel contribution to clinical decision-making. By using DETECT, independently or together with other self-reported metrics, physicians can improve their understanding of their treatment impacts, ultimately leading to more personalized and responsive patient care.

[831] Deep Neural Operator Learning for Probabilistic Models

Erhan Bayraktar, Qi Feng, Zecheng Zhang, Zhaoyu Zhang

Main category: cs.LG

TL;DR: A deep neural-operator framework for probability models with universal approximation guarantees, applied to European and American option pricing problems.

Details

Motivation: To develop a general framework for approximating probability models using neural operators, with applications to financial derivatives pricing where traditional methods face computational challenges.

Method: Proposed a deep neural-operator architecture with explicit network-size bounds, requiring stochastic processes to satisfy integrability and tail-probability conditions. Verified framework for European and American options within FBSDE framework.

Result: Established universal approximation theorem with explicit network-size bounds under global Lipschitz conditions. Successfully applied to basket of American options, showing learned model produces optimal stopping boundaries for new strike prices without retraining.

Conclusion: The neural-operator framework provides a powerful and flexible approach for probability modeling, particularly effective for option pricing problems with demonstrated generalization capabilities.

Abstract: We propose a deep neural-operator framework for a general class of probability models. Under global Lipschitz conditions on the operator over the entire Euclidean space-and for a broad class of probabilistic models-we establish a universal approximation theorem with explicit network-size bounds for the proposed architecture. The underlying stochastic processes are required only to satisfy integrability and general tail-probability conditions. We verify these assumptions for both European and American option-pricing problems within the forward-backward SDE (FBSDE) framework, which in turn covers a broad class of operators arising from parabolic PDEs, with or without free boundaries. Finally, we present a numerical example for a basket of American options, demonstrating that the learned model produces optimal stopping boundaries for new strike prices without retraining.

[832] Does TabPFN Understand Causal Structures?

Omar Swelam, Lennart Purucker, Jake Robertson, Hanne Raum, Joschka Boedecker, Frank Hutter

Main category: cs.LG

TL;DR: TabPFN, a transformer-based tabular foundation model pre-trained on synthetic causal data, encodes causal information in its embeddings that can be extracted for causal discovery tasks.

Details

Motivation: Causal discovery from real-world data remains challenging, and the paper investigates whether pre-trained tabular foundation models like TabPFN encode causal information that could be leveraged for causal discovery.

Method: Developed an adapter framework with learnable decoder and causal tokens to extract causal signals from TabPFN’s frozen embeddings and decode them into adjacency matrices for causal discovery.

Result: TabPFN’s embeddings contain meaningful causal information, outperforming traditional causal discovery algorithms, with causal information concentrated in mid-range layers of the model.

Conclusion: Foundation models pre-trained on synthetic causal data can encode causal information, establishing a new direction for interpretable and adaptable foundation models for causal discovery tasks.

Abstract: Causal discovery is fundamental for multiple scientific domains, yet extracting causal information from real world data remains a significant challenge. Given the recent success on real data, we investigate whether TabPFN, a transformer-based tabular foundation model pre-trained on synthetic datasets generated from structural causal models, encodes causal information in its internal representations. We develop an adapter framework using a learnable decoder and causal tokens that extract causal signals from TabPFN’s frozen embeddings and decode them into adjacency matrices for causal discovery. Our evaluations demonstrate that TabPFN’s embeddings contain causal information, outperforming several traditional causal discovery algorithms, with such causal information being concentrated in mid-range layers. These findings establish a new direction for interpretable and adaptable foundation models and demonstrate the potential for leveraging pre-trained tabular models for causal discovery.

[833] Understanding the role of depth in the neural tangent kernel for overparameterized neural networks

William St-Arnaud, Margarida Carvalho, Golnoosh Farnadi

Main category: cs.LG

TL;DR: Analysis of how overparameterized ReLU networks behave as kernel models when depth increases, showing the limiting kernel approaches a matrix of ones while the closed-form solution converges to a fixed limit.

Details

Motivation: To understand how increasing depth affects large ReLU networks' behavior as kernel models, particularly the sensitivity of their limiting kernels and closed-form solutions.

Method: Theoretical analysis of limiting kernels for deep ReLU networks, characterization of kernel behavior with increasing depth, and empirical evaluation of depth requirements for convergence.

Result: The normalized limiting kernel approaches the matrix of ones, while the corresponding closed-form solution approaches a fixed limit on the sphere. Empirical results show the depth required to observe this convergence.

Conclusion: Deep ReLU networks exhibit convergent behavior in their limiting kernels and solutions, with identifiable depth thresholds for convergence and generalizable properties across different kernels.

Abstract: Overparameterized fully-connected neural networks have been shown to behave like kernel models when trained with gradient descent, under mild conditions on the width, the learning rate, and the parameter initialization. In the limit of infinitely large widths and small learning rate, the kernel that is obtained allows to represent the output of the learned model with a closed-form solution. This closed-form solution hinges on the invertibility of the limiting kernel, a property that often holds on real-world datasets. In this work, we analyze the sensitivity of large ReLU networks to increasing depths by characterizing the corresponding limiting kernel. Our theoretical results demonstrate that the normalized limiting kernel approaches the matrix of ones. In contrast, they show the corresponding closed-form solution approaches a fixed limit on the sphere. We empirically evaluate the order of magnitude in network depth required to observe this convergent behavior, and we describe the essential properties that enable the generalization of our results to other kernels.

Jinfeng Xu, Zheyu Chen, Shuo Yang, Jinze Li, Ziyue Peng, Zewei Liu, Hewei Wang, Jiayi Zhang, Edith C. H. Ngai

Main category: cs.LG

TL;DR: Multi-DProxy is a multi-modal dynamic proxy learning framework that uses learnable textual proxies to generate user-interest-aware clusterings, overcoming static semantic limitations in existing multi-clustering methods.

Details

Motivation: Existing multiple clustering methods generate exhaustive clusterings without considering user interests, requiring manual screening. Current multi-modal approaches suffer from static semantic rigidity with predefined candidate words and fixed fusion strategies that don't adapt to dataset-specific concepts.

Method: Proposes Multi-DProxy with three key components: 1) gated cross-modal fusion for adaptive feature interaction modeling, 2) dual-constraint proxy optimization with user interest and concept constraints, and 3) dynamic candidate management that refines textual proxies through iterative clustering feedback.

Result: Extensive experiments show state-of-the-art performance with significant improvements over existing methods across multiple multi-clustering benchmarks, effectively capturing user interests and identifying relevant clusterings with greater precision.

Conclusion: Multi-DProxy successfully addresses the limitations of static multi-modal clustering by introducing dynamic proxy learning, enabling adaptive semantic alignment and improved clustering discrimination while capturing user interests effectively.

Abstract: Multiple clustering aims to discover diverse latent structures from different perspectives, yet existing methods generate exhaustive clusterings without discerning user interest, necessitating laborious manual screening. Current multi-modal solutions suffer from static semantic rigidity: predefined candidate words fail to adapt to dataset-specific concepts, and fixed fusion strategies ignore evolving feature interactions. To overcome these limitations, we propose Multi-DProxy, a novel multi-modal dynamic proxy learning framework that leverages cross-modal alignment through learnable textual proxies. Multi-DProxy introduces 1) gated cross-modal fusion that synthesizes discriminative joint representations by adaptively modeling feature interactions. 2) dual-constraint proxy optimization where user interest constraints enforce semantic consistency with domain concepts while concept constraints employ hard example mining to enhance cluster discrimination. 3) dynamic candidate management that refines textual proxies through iterative clustering feedback. Therefore, Multi-DProxy not only effectively captures a user’s interest through proxies but also enables the identification of relevant clusterings with greater precision. Extensive experiments demonstrate state-of-the-art performance with significant improvements over existing methods across a broad set of multi-clustering benchmarks.

[835] Enabling Off-Policy Imitation Learning with Deep Actor Critic Stabilization

Sayambhu Sen, Shalabh Bhatnagar

Main category: cs.LG

TL;DR: Introduces an off-policy adversarial imitation learning algorithm that improves sample efficiency by combining off-policy learning with double Q-network stabilization and value learning without reward inference.

Details

Motivation: Addresses the sample inefficiency problem in state-of-the-art imitation learning methods like GAIL, which suffer from slow convergence due to their reliance on on-policy algorithms like TRPO.

Method: Combines off-policy framework with double Q-network based stabilization and value learning without reward function inference to improve sample efficiency.

Result: Demonstrates reduction in samples required to robustly match expert behavior compared to existing methods.

Conclusion: The proposed off-policy adversarial imitation learning approach effectively addresses sample inefficiency issues in imitation learning while maintaining robust expert behavior matching.

Abstract: Learning complex policies with Reinforcement Learning (RL) is often hindered by instability and slow convergence, a problem exacerbated by the difficulty of reward engineering. Imitation Learning (IL) from expert demonstrations bypasses this reliance on rewards. However, state-of-the-art IL methods, exemplified by Generative Adversarial Imitation Learning (GAIL)Ho et. al, suffer from severe sample inefficiency. This is a direct consequence of their foundational on-policy algorithms, such as TRPO Schulman et.al. In this work, we introduce an adversarial imitation learning algorithm that incorporates off-policy learning to improve sample efficiency. By combining an off-policy framework with auxiliary techniques specifically, double Q network based stabilization and value learning without reward function inference we demonstrate a reduction in the samples required to robustly match expert behavior.

[836] RobustA: Robust Anomaly Detection in Multimodal Data

Salem AlMarri, Muhammad Irzam Liaqat, Muhammad Zaigham Zaheer, Shah Nawaz, Karthik Nandakumar, Markus Schedl

Main category: cs.LG

TL;DR: This paper investigates how corrupted modalities affect multimodal anomaly detection and proposes RobustA dataset and a resilient detection method with dynamic weighting.

Details

Motivation: Real-world multimodal data often gets corrupted due to environmental distortions, but current methods don't address this issue properly.

Method: Proposes a method that learns shared representation space for modalities and uses dynamic weighting during inference based on estimated corruption levels.

Result: The proposed method shows notable resilience against corrupted modalities in multimodal anomaly detection.

Conclusion: This work enables real-world application of multimodal anomaly detection by addressing modality corruptions, with the RobustA dataset and features to be made publicly available.

Abstract: In recent years, multimodal anomaly detection methods have demonstrated remarkable performance improvements over video-only models. However, real-world multimodal data is often corrupted due to unforeseen environmental distortions. In this paper, we present the first-of-its-kind work that comprehensively investigates the adverse effects of corrupted modalities on multimodal anomaly detection task. To streamline this work, we propose RobustA, a carefully curated evaluation dataset to systematically observe the impacts of audio and visual corruptions on the overall effectiveness of anomaly detection systems. Furthermore, we propose a multimodal anomaly detection method, which shows notable resilience against corrupted modalities. The proposed method learns a shared representation space for different modalities and employs a dynamic weighting scheme during inference based on the estimated level of corruption. Our work represents a significant step forward in enabling the real-world application of multimodal anomaly detection, addressing situations where the likely events of modality corruptions occur. The proposed evaluation dataset with corrupted modalities and respective extracted features will be made publicly available.

[837] MG-HGNN: A Heterogeneous GNN Framework for Indoor Wi-Fi Fingerprint-Based Localization

Yibu Wang, Zhaoxin Zhang, Ning Li, Xinlong Zhao, Dong Zhao, Tianzi Zhao

Main category: cs.LG

TL;DR: Proposes MG-HGNN, a multi-graph heterogeneous GNN framework that improves Wi-Fi RSSI-based indoor localization by enhancing spatial awareness through multi-type graph construction and heterogeneous GNN learning.

Details

Motivation: Existing RSSI-based positioning methods suffer from reduced accuracy due to environmental complexity and challenges in processing multi-source information, requiring better spatial awareness and positioning performance.

Method: Uses two graph construction branches for node and edge embedding, employs heterogeneous graph neural network for representation learning, and introduces multi-type task-directed graph construction combining label estimation and feature encoding.

Result: Achieves superior performance on UJIIndoorLoc and UTSIndoorLoc datasets compared to state-of-the-art methods, with ablation studies confirming framework effectiveness.

Conclusion: MG-HGNN provides a novel perspective for enhancing GNN-based localization methods and demonstrates improved positioning accuracy through multi-graph heterogeneous learning.

Abstract: Received signal strength indicator (RSSI) is the primary representation of Wi-Fi fingerprints and serves as a crucial tool for indoor localization. However, existing RSSI-based positioning methods often suffer from reduced accuracy due to environmental complexity and challenges in processing multi-source information. To address these issues, we propose a novel multi-graph heterogeneous GNN framework (MG-HGNN) to enhance spatial awareness and improve positioning performance. In this framework, two graph construction branches perform node and edge embedding, respectively, to generate informative graphs. Subsequently, a heterogeneous graph neural network is employed for graph representation learning, enabling accurate positioning. The MG-HGNN framework introduces the following key innovations: 1) multi-type task-directed graph construction that combines label estimation and feature encoding for richer graph information; 2) a heterogeneous GNN structure that enhances the performance of conventional GNN models. Evaluations on the UJIIndoorLoc and UTSIndoorLoc public datasets demonstrate that MG-HGNN not only achieves superior performance compared to several state-of-the-art methods, but also provides a novel perspective for enhancing GNN-based localization methods. Ablation studies further confirm the rationality and effectiveness of the proposed framework.

[838] Can Training Dynamics of Scale-Invariant Neural Networks Be Explained by the Thermodynamics of an Ideal Gas?

Ildus Sadrtdinov, Ekaterina Lobacheva, Ivan Klimov, Mikhail I. Katsnelson, Dmitry Vetrov

Main category: cs.LG

TL;DR: A thermodynamic framework is developed to analyze SGD training dynamics for scale-invariant neural networks, drawing analogies between hyperparameters and thermodynamic variables like temperature and pressure.

Details

Motivation: To better understand the training dynamics of deep neural networks using physics-inspired approaches, particularly for scale-invariant networks that reflect practical architectures with normalization layers.

Method: Developed a thermodynamic framework that establishes analogies between SGD hyperparameters (learning rate, weight decay) and thermodynamic variables (temperature, pressure, volume), starting with isotropic noise models and extending to neural network training.

Result: The framework shows close correspondence between SGD dynamics and ideal gas behavior, with key predictions about stationary entropy aligning well with experimental observations.

Conclusion: This thermodynamic framework provides a principled foundation for interpreting training dynamics and could guide future hyperparameter tuning and learning rate scheduler design.

Abstract: Understanding the training dynamics of deep neural networks remains a major open problem, with physics-inspired approaches offering promising insights. Building on this perspective, we develop a thermodynamic framework to describe the stationary distributions of stochastic gradient descent (SGD) with weight decay for scale-invariant neural networks, a setting that both reflects practical architectures with normalization layers and permits theoretical analysis. We establish analogies between training hyperparameters (e.g., learning rate, weight decay) and thermodynamic variables such as temperature, pressure, and volume. Starting with a simplified isotropic noise model, we uncover a close correspondence between SGD dynamics and ideal gas behavior, validated through theory and simulation. Extending to training of neural networks, we show that key predictions of the framework, including the behavior of stationary entropy, align closely with experimental observations. This framework provides a principled foundation for interpreting training dynamics and may guide future work on hyperparameter tuning and the design of learning rate schedulers.

[839] Superhuman AI for Stratego Using Self-Play Reinforcement Learning and Test-Time Search

Samuel Sokota, Eugene Vinitsky, Hengyuan Hu, J. Zico Kolter, Gabriele Farina

Main category: cs.LG

TL;DR: This paper achieves superhuman performance in Stratego using efficient self-play reinforcement learning and test-time search methods, requiring only thousands of dollars instead of millions.

Details

Motivation: Stratego represents a major AI challenge due to massive hidden information, where previous expensive efforts failed to reach top human performance levels.

Method: Developed general approaches for self-play reinforcement learning and test-time search under imperfect information.

Result: Achieved vastly superhuman performance in Stratego at a cost of only a few thousand dollars, compared to previous million-dollar efforts.

Conclusion: Stratego can now be mastered at superhuman levels with dramatically reduced computational costs using the proposed methods.

Abstract: Few classical games have been regarded as such significant benchmarks of artificial intelligence as to have justified training costs in the millions of dollars. Among these, Stratego – a board wargame exemplifying the challenge of strategic decision making under massive amounts of hidden information – stands apart as a case where such efforts failed to produce performance at the level of top humans. This work establishes a step change in both performance and cost for Stratego, showing that it is now possible not only to reach the level of top humans, but to achieve vastly superhuman level – and that doing so requires not an industrial budget, but merely a few thousand dollars. We achieved this result by developing general approaches for self-play reinforcement learning and test-time search under imperfect information.

[840] Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training

Artyom Sorokin, Nazar Buzun, Alexander Anokhin, Oleg Inozemcev, Egor Vedernikov, Petr Anokhin, Mikhail Burtsev, Trushkov Alexey, Yin Wenshuai, Evgeny Burnaev

Main category: cs.LG

TL;DR: Q-RAG is a novel multi-step retrieval method that fine-tunes Embedder models using reinforcement learning for efficient open-domain question answering, achieving SOTA on long-context benchmarks.

Details

Motivation: Existing RAG methods focus on single-step retrieval which is insufficient for complex questions, while current multi-step approaches require resource-intensive fine-tuning of small LLMs and cannot leverage larger models.

Method: Fine-tunes the Embedder model for multi-step retrieval using reinforcement learning (RL) instead of fine-tuning LLMs directly.

Result: Achieves state-of-the-art results on Babilong and RULER benchmarks for contexts up to 10M tokens, offering a competitive and resource-efficient alternative.

Conclusion: Q-RAG provides an effective RL-based approach for multi-step retrieval that is more efficient than existing methods while maintaining high performance on complex question answering tasks.

Abstract: Retrieval-Augmented Generation (RAG) methods enhance LLM performance by efficiently filtering relevant context for LLMs, reducing hallucinations and inference cost. However, most existing RAG methods focus on single-step retrieval, which is often insufficient for answering complex questions that require multi-step search. Recently, multi-step retrieval approaches have emerged, typically involving the fine-tuning of small LLMs to perform multi-step retrieval. This type of fine-tuning is highly resource-intensive and does not enable the use of larger LLMs. In this work, we propose Q-RAG, a novel approach that fine-tunes the Embedder model for multi-step retrieval using reinforcement learning (RL). Q-RAG offers a competitive, resource-efficient alternative to existing multi-step retrieval methods for open-domain question answering and achieves state-of-the-art results on the popular long-context benchmarks Babilong and RULER for contexts up to 10M tokens.

[841] Grounding Computer Use Agents on Human Demonstrations

Aarash Feizi, Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Kaixin Li, Rabiul Awal, Xing Han Lù, Johan Obando-Ceron, Juan A. Rodriguez, Nicolas Chapados, David Vazquez, Adriana Romero-Soriano, Reihaneh Rabbany, Perouz Taslakian, Christopher Pal, Spandana Gella, Sai Rajeswar

Main category: cs.LG

TL;DR: GroundCUA is a large-scale desktop grounding dataset with 56K screenshots and 3.56M human-verified annotations covering 87 applications. Using this data, GroundNext models achieve SOTA results with minimal training data.

Details

Motivation: To address the lack of high-quality desktop interaction datasets for building reliable computer-use agents that can accurately connect natural language instructions to on-screen elements.

Method: Created GroundCUA dataset from expert human demonstrations, then developed GroundNext models (3B and 7B) using supervised fine-tuning and reinforcement learning post-training.

Result: GroundNext achieves state-of-the-art results across five benchmarks using less than one-tenth the training data of prior work, with comparable or superior performance on OSWorld benchmark.

Conclusion: High-quality, expert-driven datasets are critical for advancing general-purpose computer-use agents, as demonstrated by GroundNext’s strong performance with minimal training data.

Abstract: Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. To address this gap, we introduce GroundCUA, a large-scale desktop grounding dataset built from expert human demonstrations. It covers 87 applications across 12 categories and includes 56K screenshots, with every on-screen element carefully annotated for a total of over 3.56M human-verified annotations. From these demonstrations, we generate diverse instructions that capture a wide range of real-world tasks, providing high-quality data for model training. Using GroundCUA, we develop the GroundNext family of models that map instructions to their target UI elements. At both 3B and 7B scales, GroundNext achieves state-of-the-art results across five benchmarks using supervised fine-tuning, while requiring less than one-tenth the training data of prior work. Reinforcement learning post-training further improves performance, and when evaluated in an agentic setting on the OSWorld benchmark using o3 as planner, GroundNext attains comparable or superior results to models trained with substantially more data,. These results demonstrate the critical role of high-quality, expert-driven datasets in advancing general-purpose computer-use agents.

[842] Preparation of Fractal-Inspired Computational Architectures for Advanced Large Language Model Analysis

Yash Mittal, Dmitry Ignatov, Radu Timofte

Main category: cs.LG

TL;DR: FractalNet introduces fractal-inspired neural architectures for efficient large-scale model diversity exploration, generating over 1,200 variants through systematic layer permutations.

Details

Motivation: To address the challenge of exploring model diversity at large scale efficiently, moving beyond manual architecture design limitations.

Method: Uses template-driven generator with systematic permutations of convolutional, normalization, activation, and dropout layers, incorporating structural recursion and multi-column pathways for balanced depth and width.

Result: Fractal-based architectures demonstrate strong performance and computational efficiency when trained on CIFAR-10 dataset for five epochs using PyTorch with AMP and gradient checkpointing.

Conclusion: Fractal design provides a feasible and resource-efficient approach for automated neural architecture exploration.

Abstract: It introduces FractalNet, a fractal-inspired computational architectures for advanced large language model analysis that mainly challenges model diversity on a large scale in an efficient manner. The new set-up involves a template-driven generator, runner, and evaluation framework that, through systematic permutations of convolutional, normalization, activation, and dropout layers, can create more than 1,200 variants of neural networks. Fractal templates allow for structural recursion and multi-column pathways, thus, models become deeper and wider in a balanced way. Training utilizes PyTorch, Automatic Mixed Precision (AMP), and gradient checkpointing and is carried out on the CIFAR-10 dataset for five epochs. The outcomes show that fractal-based architectures are capable of strong performance and are computationally efficient. The paper positions fractal design as a feasible and resource-efficient method of automated architecture exploration.

[843] TNT: Improving Chunkwise Training for Test-Time Memorization

Zeman Li, Ali Behrouz, Yuan Deng, Peilin Zhong, Praneeth Kacham, Mahdi Karami, Meisam Razaviyayn, Vahab Mirrokni

Main category: cs.LG

TL;DR: TNT is a novel two-stage training paradigm that decouples training efficiency from inference performance for recurrent neural networks with deep test-time memorization modules, achieving up to 17× faster training while improving accuracy.

Details

Motivation: Existing parallelization methods for RNNs with deep memorization modules force a trade-off between training speed and performance due to the chunksize hyperparameter, creating a scalability barrier that prevents these expressive models from reaching their full potential.

Method: Two-stage approach: Stage 1 uses hierarchical memory with global module for long-range context and parallel local modules with periodic memory resets to break sequential dependencies. Stage 2 fine-tunes only local memory modules with smaller chunksize for maximum accuracy.

Result: TNT achieves up to 17× faster training speed compared to the most accurate baseline configuration while simultaneously improving model accuracy on Titans and TTT models.

Conclusion: TNT removes a critical scalability barrier for expressive RNNs, establishing a practical foundation for developing these models and facilitating future work to close the performance gap with Transformers.

Abstract: Recurrent neural networks (RNNs) with deep test-time memorization modules, such as Titans and TTT, represent a promising, linearly-scaling paradigm distinct from Transformers. While these expressive models do not yet match the peak performance of state-of-the-art Transformers, their potential has been largely untapped due to prohibitively slow training and low hardware utilization. Existing parallelization methods force a fundamental conflict governed by the chunksize hyperparameter: large chunks boost speed but degrade performance, necessitating a fixed, suboptimal compromise. To solve this challenge, we introduce TNT, a novel training paradigm that decouples training efficiency from inference performance through a two-stage process. Stage one is an efficiency-focused pre-training phase utilizing a hierarchical memory. A global module processes large, hardware-friendly chunks for long-range context, while multiple parallel local modules handle fine-grained details. Crucially, by periodically resetting local memory states, we break sequential dependencies to enable massive context parallelization. Stage two is a brief fine-tuning phase where only the local memory modules are adapted to a smaller, high-resolution chunksize, maximizing accuracy with minimal overhead. Evaluated on Titans and TTT models, TNT achieves a substantial acceleration in training speed-up to 17 times faster than the most accurate baseline configuration - while simultaneously improving model accuracy. This improvement removes a critical scalability barrier, establishing a practical foundation for developing expressive RNNs and facilitating future work to close the performance gap with Transformers.

[844] Private Sketches for Linear Regression

Shrutimoy Das, Debanuj Nayak, Anirban Dasgupta

Main category: cs.LG

TL;DR: Proposes differentially private sketches for linear regression instead of noisy parameter estimates, enabling use of standard solvers without privacy risks.

Details

Motivation: Linear regression is widely used but often involves sensitive data requiring privacy protection. Existing DP methods add noise to parameter estimates, but sketches could enable safer use of standard tools.

Method: Develop differentially private sketches for least squares and least absolute deviations regression problems, creating private dataset summaries.

Result: Private sketches are created that maintain differential privacy while allowing application of commonly available regression solvers.

Conclusion: Private sketches provide a viable alternative to noisy parameter estimation for DP linear regression, facilitating safer use of standard statistical tools.

Abstract: Linear regression is frequently applied in a variety of domains. In order to improve the efficiency of these methods, various methods have been developed that compute summaries or \emph{sketches} of the datasets. Certain domains, however, contain sensitive data which necessitates that the application of these statistical methods does not reveal private information. Differentially private (DP) linear regression methods have been developed for mitigating this problem. These techniques typically involve estimating a noisy version of the parameter vector. Instead, we propose releasing private sketches of the datasets. We present differentially private sketches for the problems of least squares regression, as well as least absolute deviations regression. The availability of these private sketches facilitates the application of commonly available solvers for regression, without the risk of privacy leakage.

[845] Consistency Is Not Always Correct: Towards Understanding the Role of Exploration in Post-Training Reasoning

Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Bo Xue, Qingfu Zhang, Hau-San Wong, Taiji Suzuki

Main category: cs.LG

TL;DR: Post-training methods like RLVR and ORM/PRM reinforce existing reasoning paths rather than expanding reasoning scope, creating a paradox about why exploration helps. The paper resolves this by showing exploration preserves rare but crucial reasoning traces needed for hard problems.

Details

Motivation: To reconcile the paradox that exploration helps in post-training despite RLVR and ORM/PRM only reinforcing existing reasoning patterns rather than creating new ones.

Method: Formalizes post-training dynamics through Multi-task Tree-structured Markov Chains (TMC), viewing reasoning as Markov transitions and analyzing how different training strategies affect reasoning path probabilities.

Result: Shows that RLVR squeezes reasoning entropy and forgets correct paths, ORM/PRM favors common patterns over accuracy, and rare high-uncertainty paths solve hard problems. Exploration preserves these crucial rare traces.

Conclusion: Exploration remains essential because it preserves access to rare but crucial reasoning traces needed for difficult cases, which are eliminated by RLVR or disfavored by inference scaling. Strategies like rejecting easy instances and KL regularization help preserve these rare traces.

Abstract: Foundation models exhibit broad knowledge but limited task-specific reasoning, motivating post-training strategies such as RLVR and inference scaling with outcome or process reward models (ORM/PRM). While recent work highlights the role of exploration and entropy stability in improving pass@K, empirical evidence points to a paradox: RLVR and ORM/PRM typically reinforce existing tree-like reasoning paths rather than expanding the reasoning scope, raising the question of why exploration helps at all if no new patterns emerge. To reconcile this paradox, we adopt the perspective of Kim et al. (2025), viewing easy (e.g., simplifying a fraction) versus hard (e.g., discovering a symmetry) reasoning steps as low- versus high-probability Markov transitions, and formalize post-training dynamics through Multi-task Tree-structured Markov Chains (TMC). In this tractable model, pretraining corresponds to tree expansion, while post-training corresponds to chain-of-thought reweighting. We show that several phenomena recently observed in empirical studies arise naturally in this setting: (1) RLVR induces a squeezing effect, reducing reasoning entropy and forgetting some correct paths; (2) population rewards of ORM/PRM encourage consistency rather than accuracy, thereby favoring common patterns; and (3) certain rare, high-uncertainty reasoning paths by the base model are responsible for solving hard problem instances. Together, these explain why exploration – even when confined to the base model’s reasoning scope – remains essential: it preserves access to rare but crucial reasoning traces needed for difficult cases, which are squeezed out by RLVR or unfavored by inference scaling. Building on this, we further show that exploration strategies such as rejecting easy instances and KL regularization help preserve rare reasoning traces. Empirical simulations corroborate our theoretical results.

[846] Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training

Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Hau-San Wong, Qingfu Zhang, Taiji Suzuki

Main category: cs.LG

TL;DR: Curriculum post-training for LLMs avoids exponential complexity bottlenecks by progressively learning through manageable steps, enabling polynomial sample complexity for reasoning tasks.

Details

Motivation: To understand why curriculum techniques outperform non-curriculum approaches in enhancing LLM reasoning performance and establish theoretical foundations for their effectiveness.

Method: Developed a theoretical framework modeling CoT generation as states-conditioned autoregressive reasoning trees, with curriculum stages as depth-increasing or hint-decreasing subtasks, analyzed using reinforcement learning finetuning.

Result: Curriculum post-training achieves high accuracy with polynomial sample complexity, while direct learning suffers from exponential bottlenecks; similar benefits apply to test-time scaling with reduced oracle calls.

Conclusion: Curriculum learning provides principled efficiency gains for LLM reasoning by avoiding exponential complexity through progressive, competence-aligned training stages.

Abstract: Recent curriculum techniques in the post-training stage of LLMs have been widely observed to outperform non-curriculum approaches in enhancing reasoning performance, yet a principled understanding of why and to what extent they work remains elusive. To address this gap, we develop a theoretical framework grounded in the intuition that progressively learning through manageable steps is more efficient than directly tackling a hard reasoning task, provided each stage stays within the model’s effective competence. Under mild complexity conditions linking consecutive curriculum stages, we show that curriculum post-training avoids the exponential complexity bottleneck. To substantiate this result, drawing insights from the Chain-of-Thoughts (CoTs) solving mathematical problems such as Countdown and parity, we model CoT generation as a states-conditioned autoregressive reasoning tree, define a uniform-branching base model to capture pretrained behavior, and formalize curriculum stages as either depth-increasing (longer reasoning chains) or hint-decreasing (shorter prefixes) subtasks. Our analysis shows that, under outcome-only reward signals, reinforcement learning finetuning achieves high accuracy with polynomial sample complexity, whereas direct learning suffers from an exponential bottleneck. We further establish analogous guarantees for test-time scaling, where curriculum-aware querying reduces both reward oracle calls and sampling cost from exponential to polynomial order.

[847] Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization

Yu Huang, Zixin Wen, Aarti Singh, Yuejie Chi, Yuxin Chen

Main category: cs.LG

TL;DR: This paper provides a theoretical analysis of transformers’ ability to extrapolate reasoning patterns to longer chain-of-thought problems, proving how attention concentration enables length generalization in state-tracking tasks.

Details

Motivation: To understand whether AI models can extrapolate learned reasoning patterns to solve harder tasks with longer chain-of-thought reasoning, particularly examining transformers' length generalization capabilities.

Method: Theoretical analysis of transformers learning on synthetic state-tracking tasks with gradient descent, mathematical proofs linking attention concentration to length generalization, and analysis of recursive self-training schemes.

Result: Proved that transformers can learn NC^1-complete problems with chain-of-thought, significantly advancing beyond prior TC^0 limitations, and demonstrated attention concentration as the mechanism enabling length generalization.

Conclusion: Transformers can provably extrapolate reasoning to longer chain-of-thought sequences through attention concentration, with recursive self-training extending solvable problem lengths, providing the first optimization guarantee for transformers learning NC^1-complete problems.

Abstract: The ability to reason lies at the core of artificial intelligence (AI), and challenging problems usually call for deeper and longer reasoning to tackle. A crucial question about AI reasoning is whether models can extrapolate learned reasoning patterns to solve harder tasks with longer chain-of-thought (CoT). In this work, we present a theoretical analysis of transformers learning on synthetic state-tracking tasks with gradient descent. We mathematically prove how the algebraic structure of state-tracking problems governs the degree of extrapolation of the learned CoT. Specifically, our theory characterizes the length generalization of transformers through the mechanism of attention concentration, linking the retrieval robustness of the attention layer to the state-tracking task structure of long-context reasoning. Moreover, for transformers with limited reasoning length, we prove that a recursive self-training scheme can progressively extend the range of solvable problem lengths. To our knowledge, we provide the first optimization guarantee that constant-depth transformers provably learn $\mathsf{NC}^1$-complete problems with CoT, significantly going beyond prior art confined in $\mathsf{TC}^0$, unless the widely held conjecture $\mathsf{TC}^0 \neq \mathsf{NC}^1$ fails. Finally, we present a broad set of experiments supporting our theoretical results, confirming the length generalization behaviors and the mechanism of attention concentration.

[848] LoReTTA: A Low Resource Framework To Poison Continuous Time Dynamic Graphs

Himanshu Pal, Venkata Sai Pranav Bachina, Ankit Gangwal, Charu Sharma

Main category: cs.LG

TL;DR: LoReTTA is a novel poisoning attack framework for Temporal Graph Neural Networks that degrades performance by 29.47% on average across 4 datasets and 4 SotA models using a two-phase approach: graph sparsification and adversarial edge replacement.

Details

Motivation: TGNNs are used in high-stakes domains but are vulnerable to poisoning attacks, posing critical security risks that need to be addressed.

Method: Two-stage approach: (1) sparsify graph by removing high-impact edges using temporal importance metrics, (2) replace removed edges with adversarial negatives via degree-preserving negative sampling algorithm.

Result: Degrades TGNN performance by up to 42.0% on MOOC, 31.5% on Wikipedia, 28.8% on UCI, and 15.6% on Enron. Outperforms 11 attack baselines, undetectable to 4 anomaly detection systems, and robust to 4 SotA defenses.

Conclusion: LoReTTA establishes an effective, unnoticeable, and robust poisoning attack framework for TGNNs that operates without expensive surrogate models while maintaining realistic constraints.

Abstract: Temporal Graph Neural Networks (TGNNs) are increasingly used in high-stakes domains, such as financial forecasting, recommendation systems, and fraud detection. However, their susceptibility to poisoning attacks poses a critical security risk. We introduce LoReTTA (Low Resource Two-phase Temporal Attack), a novel adversarial framework on Continuous-Time Dynamic Graphs, which degrades TGNN performance by an average of 29.47% across 4 widely benchmark datasets and 4 State-of-the-Art (SotA) models. LoReTTA operates through a two-stage approach: (1) sparsify the graph by removing high-impact edges using any of the 16 tested temporal importance metrics, (2) strategically replace removed edges with adversarial negatives via LoReTTA’s novel degree-preserving negative sampling algorithm. Our plug-and-play design eliminates the need for expensive surrogate models while adhering to realistic unnoticeability constraints. LoReTTA degrades performance by upto 42.0% on MOOC, 31.5% on Wikipedia, 28.8% on UCI, and 15.6% on Enron. LoReTTA outperforms 11 attack baselines, remains undetectable to 4 leading anomaly detection systems, and is robust to 4 SotA adversarial defense training methods, establishing its effectiveness, unnoticeability, and robustness.

[849] A Diffusion Model to Shrink Proteins While Maintaining Their Function

Ethan Baron, Alan N. Amin, Ruben Weitzman, Debora Marks, Andrew Gordon Wilson

Main category: cs.LG

TL;DR: SCISOR is a discrete diffusion model that generates shortened protein sequences by deleting letters while maintaining natural-like properties, outperforming previous methods in functional preservation and realism.

Details

Motivation: Many medically useful proteins are too long for practical use in labs or delivery, but current shortening methods are expensive and time-consuming. Existing sequence models struggle with combinatorial deletion search and lack deletion-specific training.

Method: SCISOR trains a de-noiser to reverse a forward noising process that adds random insertions to natural protein sequences, enabling efficient generation of shortened sequences through deletion.

Result: SCISOR achieves competitive evolutionary sequence fitting, state-of-the-art functional effect predictions on ProteinGym, and generates significantly more realistic proteins with better functional motif preservation than previous models.

Conclusion: SCISOR provides an effective approach for protein shortening that maintains natural sequence properties and functional integrity, addressing key limitations in protein engineering and delivery applications.

Abstract: Many proteins useful in modern medicine or bioengineering are challenging to make in the lab, fuse with other proteins in cells, or deliver to tissues in the body, because their sequences are too long. Shortening these sequences typically involves costly, time-consuming experimental campaigns. Ideally, we could instead use modern models of massive databases of sequences from nature to learn how to propose shrunken proteins that resemble sequences found in nature. Unfortunately, these models struggle to efficiently search the combinatorial space of all deletions, and are not trained with inductive biases to learn how to delete. To address this gap, we propose SCISOR, a novel discrete diffusion model that deletes letters from sequences to generate protein samples that resemble those found in nature. To do so, SCISOR trains a de-noiser to reverse a forward noising process that adds random insertions to natural sequences. As a generative model, SCISOR fits evolutionary sequence data competitively with previous large models. In evaluation, SCISOR achieves state-of-the-art predictions of the functional effects of deletions on ProteinGym. Finally, we use the SCISOR de-noiser to shrink long protein sequences, and show that its suggested deletions result in significantly more realistic proteins and more often preserve functional motifs than previous models of evolutionary sequences.

[850] C3PO: Optimized Large Language Model Cascades with Probabilistic Cost Constraints for Reasoning

Antonios Valkanas, Soumyasundar Pal, Pavel Rumiantsev, Yingxue Zhang, Mark Coates

Main category: cs.LG

TL;DR: C3PO is a self-supervised framework for optimizing LLM cascades that minimizes regret relative to the most powerful model while controlling inference costs through conformal prediction, achieving state-of-the-art performance without labeled data.

Details

Motivation: LLMs have high inference costs that limit real-world deployment, and existing cascade methods require supervised training, lack theoretical guarantees, and provide limited cost control.

Method: Self-supervised framework using unlabeled model outputs, conformal prediction for cost control, and optimization that minimizes regret with respect to the most powerful model.

Result: Achieves state-of-the-art performance on reasoning benchmarks (GSM8K, MATH-500, BigBench-Hard, AIME), outperforming baselines in accuracy and cost-efficiency with theoretical guarantees.

Conclusion: Principled, label-free cascade optimization enables scalable LLM deployment by effectively controlling costs while maintaining high performance.

Abstract: Large language models (LLMs) have achieved impressive results on complex reasoning tasks, but their high inference cost remains a major barrier to real-world deployment. A promising solution is to use cascaded inference, where small, cheap models handle easy queries, and only the hardest examples are escalated to more powerful models. However, existing cascade methods typically rely on supervised training with labeled data, offer no theoretical generalization guarantees, and provide limited control over test-time computational cost. We introduce C3PO (Cost Controlled Cascaded Prediction Optimization), a self-supervised framework for optimizing LLM cascades under probabilistic cost constraints. By focusing on minimizing regret with respect to the most powerful model (MPM), C3PO avoids the need for labeled data by constructing a cascade using only unlabeled model outputs. It leverages conformal prediction to bound the probability that inference cost exceeds a user-specified budget. We provide theoretical guarantees on both cost control and generalization error, and show that our optimization procedure is effective even with small calibration sets. Empirically, C3PO achieves state-of-the-art performance across a diverse set of reasoning benchmarks including GSM8K, MATH-500, BigBench-Hard and AIME, outperforming strong LLM cascading baselines in both accuracy and cost-efficiency. Our results demonstrate that principled, label-free cascade optimization can enable scalable LLM deployment.

[851] Entangled Schrödinger Bridge Matching

Sophia Tang, Yinuo Zhang, Pranam Chatterjee

Main category: cs.LG

TL;DR: EntangledSBM learns dynamic interactions in multi-particle systems by solving coupled bias forces that entangle particle velocities, enabling accurate simulation of heterogeneous cell populations and rare biomolecular transitions.

Details

Motivation: Existing methods use static snapshots which fail to capture dynamic interactions that evolve over trajectories in systems like biomolecules and cell populations.

Method: Introduces Entangled Schrödinger Bridge Matching framework that learns first- and second-order stochastic dynamics with coupled bias forces entangling particle velocities.

Result: Accurately simulates heterogeneous cell populations under perturbations and rare transitions in high-dimensional biomolecular systems.

Conclusion: EntangledSBM closes the gap for simulating interacting multi-particle systems with dynamic dependencies that cannot be captured through static snapshots.

Abstract: Simulating trajectories of multi-particle systems on complex energy landscapes is a central task in molecular dynamics (MD) and drug discovery, but remains challenging at scale due to computationally expensive and long simulations. Previous approaches leverage techniques such as flow or Schrödinger bridge matching to implicitly learn joint trajectories through data snapshots. However, many systems, including biomolecular systems and heterogeneous cell populations, undergo dynamic interactions that evolve over their trajectory and cannot be captured through static snapshots. To close this gap, we introduce Entangled Schrödinger Bridge Matching (EntangledSBM), a framework that learns the first- and second-order stochastic dynamics of interacting, multi-particle systems where the direction and magnitude of each particle’s path depend dynamically on the paths of the other particles. We define the Entangled Schrödinger Bridge (EntangledSB) problem as solving a coupled system of bias forces that entangle particle velocities. We show that our framework accurately simulates heterogeneous cell populations under perturbations and rare transitions in high-dimensional biomolecular systems.

[852] Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs

Zhongyang Li, Ziyue Li, Tianyi Zhou

Main category: cs.LG

TL;DR: RoMA improves MoE LLMs by aligning routing weights with task embeddings through manifold regularization, requiring only lightweight router finetuning to bridge the performance gap with optimal routing.

Details

Motivation: Existing MoE LLMs show consistent router suboptimality, causing 10-20% accuracy gaps compared to optimal routing, which limits generalization performance across downstream tasks.

Method: Routing Manifold Alignment (RoMA) adds manifold regularization to post-training, finetuning only routers while freezing other parameters. It encourages routing weights to align with successful neighbors in task embedding space, creating task-expert bindings.

Result: RoMA substantially improves performance on diverse benchmarks for OLMoE, DeepSeekMoE, and Qwen3-MoE, demonstrating better generalization through unified task understanding and solution generation.

Conclusion: Aligning routing manifolds with task embeddings effectively reduces the performance gap in MoE LLMs, showing that lightweight router finetuning with manifold regularization can significantly enhance generalization capabilities.

Abstract: Sparse Mixture-of-Experts (MoE) have been widely adopted in recent large language models since it can efficiently scale up the model capability without increasing the inference cost. However, evaluations on broad downstream tasks reveal a consistent suboptimality of the routers in existing MoE LLMs, which results in a severe performance gap (e.g., 10-20% in accuracy) to the optimal routing. In this paper, we show that aligning the manifold of routing weights with that of task embedding can effectively reduce the gap and improve MoE LLMs’ generalization performance. Our method, “Routing Manifold Alignment (RoMA)”, introduces an additional manifold regularization term in the post-training objective and only requires lightweight finetuning of routers (with other parameters frozen). Specifically, the regularization encourages the routing weights of each sample to be close to those of its successful neighbors (whose routing weights lead to correct answers) in a task embedding space. Consequently, samples targeting similar tasks will share similar expert choices across layers. Building such bindings between tasks and experts over different samples is essential to achieve better generalization. Moreover, RoMA demonstrates the advantage of unifying the task understanding (by embedding models) with solution generation (by MoE LLMs). In experiments, we finetune routers in OLMoE, DeepSeekMoE, and Qwen3-MoE using RoMA. Evaluations on diverse benchmarks and extensive comparisons with baselines show the substantial improvement brought by RoMA.

[853] Understanding Forgetting in LLM Supervised Fine-Tuning and Preference Learning - A Convex Optimization Perspective

Heshan Fernando, Han Shen, Parikshit Ram, Yi Zhou, Horst Samulowitz, Nathalie Baracaldo, Tianyi Chen

Main category: cs.LG

TL;DR: Proposes a joint post-training framework that combines SFT and RLHF/DPO stages to prevent forgetting, achieving up to 23% performance improvement over sequential training.

Details

Motivation: Sequential post-training (SFT then RLHF/DPO) causes LLMs to forget first-stage training, leading to suboptimal performance despite being simpler to implement.

Method: Theoretical analysis of sequential training suboptimality and development of a practical joint post-training framework with convergence guarantees.

Result: Empirically outperforms sequential training with up to 23% overall performance improvement across multiple LLM evaluation benchmarks, with minimal computational overhead.

Conclusion: Joint post-training framework effectively addresses the forgetting problem in sequential training, providing significant performance gains while maintaining computational efficiency.

Abstract: The post-training of LLMs, which typically consists of the supervised fine-tuning (SFT) stage and the preference learning stage (RLHF or DPO), is crucial to effective and safe LLM applications. The widely adopted approach in post-training popular open-source LLMs is to sequentially perform SFT and RLHF/DPO. However, this is suboptimal in terms of SFT and RLHF/DPO trade-off: the LLM gradually forgets about the first stage’s training when undergoing the second stage’s training. This sequential paradigm persists largely due to its simplicity and modularity, which make it easier to implement and manage at scale despite its limitations. We theoretically prove the sub-optimality of sequential post-training and propose a practical joint post-training framework which has theoretical convergence guarantees and empirically outperforms sequential post-training framework, with up to 23% overall performance improvement across multiple LLM evaluation benchmarks, while having minimal computational overhead. Our code is available at https://github.com/heshandevaka/XRIGHT.

[854] Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training

Elia Cunegatti, Leonardo Lucio Custode, Giovanni Iacca

Main category: cs.LG

TL;DR: NeuroAl is a top-up pruning algorithm for LLMs that maximizes neuron alignment between dense and sparse models using activation information, requiring no retraining and adaptively selecting sparsity parameters.

Details

Motivation: Traditional pruning and retraining methods are impractical for large pre-trained models due to high computational costs. There's a need for pruning methods that work directly on pre-trained models without retraining.

Method: Uses functional information (input activations) from dense pre-trained models to maximize activation alignment between dense and sparse versions. Adaptively selects block-wise and row-wise sparsity ratios based on model and target sparsity.

Result: Tested on ~300 cases with 4 LLM families, 3 sparsity ratios, and 10 language tasks. Consistently outperforms state-of-the-art methods in performance-runtime trade-off.

Conclusion: NeuroAl provides an effective pruning approach for LLMs that eliminates the need for retraining while maintaining performance through optimal neuron alignment.

Abstract: Network pruning focuses on algorithms that aim to reduce a given model’s computational cost by removing a subset of its parameters while having minimal impact on performance. Throughout the last decade, the most widely used pruning paradigm has been pruning and re-training, which nowadays is inconvenient due to the vast amount of pre-trained models, which are, in any case, too expensive to re-train. In this paper, we exploit functional information from dense pre-trained models, i.e., their input activations, to obtain sparse models that maximize the activations’ alignment with respect to their corresponding dense models. Hence, we propose \textbf{NeuroAl}, a \emph{top-up} algorithm that can be used on top of any given pruning algorithm for LLMs, which modifies the block-wise and row-wise sparsity, exploiting information from both the dense model and its sparse version to maximize the \emph{neuron alignment} among activations. Different from existing methods, our approach adaptively selects the best hyperparameters for the block-wise and row-wise sparsity ratios w.r.t. the model and the desired sparsity, and requires \emph{no re-training}. We test our method over $\sim$300 test cases with four LLM families, three sparsity ratios, and ten language tasks (three language modeling and seven zero-shot datasets), showing how it consistently outperforms the latest state-of-the-art methods in terms of performance-runtime trade-off. The code is available at \href{https://github.com/eliacunegatti/NeuroAL}{https://github.com/eliacunegatti/NeuroAL}.

[855] Distributional Surgery for Language Model Activations

Bao Nguyen, Binh Nguyen, Duy Nguyen, Viet Anh Nguyen

Main category: cs.LG

TL;DR: A two-stage method to detect and mitigate undesirable content in language models by rectifying activations through layerwise classifiers and distributional steering policies.

Details

Motivation: Language models can generate harmful or toxic content, requiring effective detection and mitigation methods to ensure safe outputs.

Method: Two-stage approach: 1) Train ensemble of layerwise classifiers to detect undesirable content using activations, 2) Apply layerwise distributional steering policies through semidefinite programming to transform attention heads for detected content.

Result: Empirical evaluations show the method outperforms baselines in reducing undesirable content generation across multiple language models and datasets.

Conclusion: The proposed activation rectification approach effectively mitigates undesirable content in language models through principled detection and steering mechanisms.

Abstract: Language models, while capable of generating remarkably coherent and seemingly accurate text, can occasionally produce undesirable content, including harmful or toxic outputs. In this paper, we present a new two-stage approach to detect and mitigate undesirable content generations by rectifying activations. First, we train an ensemble of layerwise classifiers to detect undesirable content using activations by minimizing a smooth surrogate of the risk-aware score. Then, for detected undesirable contents, we propose layerwise distributional steering policies that transform the attention heads. These policies are computed through principled semidefinite programming, which aims to minimally perturb the attention distribution while probabilistically guaranteeing the effectiveness of the editions. Empirical evaluations across multiple language models and datasets show that our method outperforms baselines in reducing the generation of undesirable output.

[856] Continual Pre-training of MoEs: How robust is your router?

Benjamin Thérien, Charles-Étienne Joseph, Zain Sarwar, Ashwinee Panda, Anirban Das, Shi-Xiong Zhang, Stephen Rawls, Sambit Sahu, Eugene Belilovsky, Irina Rish

Main category: cs.LG

TL;DR: MoE transformers show surprising robustness to distribution shifts during continual pre-training, maintaining performance and sample efficiency without requiring replay, matching fully re-trained models at lower cost.

Details

Motivation: To understand how MoE transformers handle continual pre-training compared to dense models, specifically examining routing algorithm impact on forgetting, load balancing, and whether dense model strategies suffice for MoEs.

Method: Large-scale study training 500M parameter dense transformer and four 500M-active/2B-total parameter MoE transformers for 600B tokens, comparing Sinkhorn-Balanced and Z-and-Aux-loss-balanced routing algorithms.

Result: MoEs demonstrate robustness to distribution shifts even without replay, maintain sample efficiency during CPT, and can match fully re-trained MoE performance at fraction of the cost.

Conclusion: MoE transformers are surprisingly robust for continual pre-training, requiring minimal adaptation strategies and offering cost-effective alternatives to full re-training while maintaining performance.

Abstract: Sparsely-activated Mixture of Experts (MoE) transformers are promising architectures for foundation models. Compared to dense transformers that require the same amount of floating-point operations (FLOPs) per forward pass, MoEs benefit from improved sample efficiency at training time and achieve much stronger performance. Many closed-source and open-source frontier language models have thus adopted an MoE architecture. Naturally, practitioners will want to extend the capabilities of these models with large amounts of newly collected data without completely re-training them. Prior work has shown that a simple combination of replay, learning rate re-warming, and re-decaying can enable the continual pre-training (CPT) of dense decoder-only transformers with minimal performance degradation compared to full re-training. In the case of decoder-only MoE transformers, however, it is unclear how the routing algorithm will impact continual pre-training performance: 1) do the MoE transformer’s routers exacerbate forgetting relative to a dense model?; 2) do the routers maintain a balanced load on previous distributions after CPT?; 3) are the same strategies applied to dense models sufficient to continually pre-train MoE LLMs? In what follows, we conduct a large-scale study training a 500M parameter dense transformer and four 500M-active/2B-total parameter MoE transformers. Each model is trained for 600B tokens. Our results establish a surprising robustness to distribution shifts for MoEs using both Sinkhorn-Balanced and Z-and-Aux-loss-balanced routing algorithms, even in MoEs continually pre-trained without replay. Moreover, we show that MoE LLMs maintain their sample efficiency (relative to a FLOP-matched dense model) during CPT and that they can match the performance of a fully re-trained MoE at a fraction of the cost.

[857] Dissecting Long-Chain-of-Thought Reasoning Models: An Empirical Study

Yongyu Mu, Jiali Zeng, Bei Li, Xinyan Guan, Fandong Meng, Jie Zhou, Tong Xiao, Jingbo Zhu

Main category: cs.LG

TL;DR: Analysis of scaling RL for long-chain-of-thought reasoning reveals: positive samples fit training data while negative samples improve generalization; data inefficiency in group relative policy optimization; and evaluation instability due to uncertain problems and greedy decoding.

Details

Motivation: To understand the underlying training dynamics and counterintuitive behaviors in scaling RL for long-chain-of-thought reasoning models, addressing three key problematic aspects.

Method: Systematic analysis of positive/negative sample roles, investigation of data inefficiency in group relative policy optimization, and examination of performance instability across reasoning models and benchmarks.

Result: Negative samples alone can achieve strong reasoning performance and better generalization; over half of samples yield zero advantage in group relative policy optimization; greedy decoding can flip response correctness in evaluation.

Conclusion: The study provides insights into scaling RL dynamics for reasoning models, revealing the distinct roles of positive/negative samples, data inefficiency issues, and evaluation instability factors that need addressing.

Abstract: Despite recent progress in training long-chain-of-thought reasoning models via scaling reinforcement learning (RL), its underlying training dynamics remain poorly understood, and several counterintuitive behaviors persist. This work focuses on three key aspects: (1) We systematically analyze the roles of positive and negative samples in scaling RL, revealing that positive samples mainly facilitate precise fitting to the training data, whereas negative samples significantly enhance generalization and robustness. Interestingly, while positive samples are essential for convergence in the zero-RL setting, training on negative samples alone suffices to attain strong reasoning performance and even better generalization in cold-start scenarios. (2) We identify substantial data inefficiency in group relative policy optimization, where over half of the samples yield zero advantage. To address this, we explore two strategies, including relative length rewards and offline sample injection, to leverage these data better and enhance reasoning efficiency and capability. (3) We investigate unstable performance across various reasoning models and benchmarks, attributing instability to uncertain problems with ambiguous outcomes, and demonstrate that greedy decoding can distort evaluation by flipping the correctness of responses. Our code is available at: https://github.com/takagi97/Dissect-Long-Reason-Models.

[858] Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Helena Casademunt, Caden Juang, Adam Karvonen, Samuel Marks, Senthooran Rajamanoharan, Neel Nanda

Main category: cs.LG

TL;DR: CAFT is a fine-tuning technique that uses interpretability tools to control LLM generalization by ablating undesired concept directions in latent space, preventing unintended out-of-distribution behaviors without modifying training data.

Details

Motivation: Fine-tuning LLMs often causes unintended out-of-distribution generalization, and existing solutions require modifying training data which isn't always practical.

Method: CAFT leverages interpretability tools to identify undesired concept directions in latent space and ablate them using linear projections during fine-tuning, steering the model away from unintended generalizations.

Result: CAFT successfully reduced misaligned responses by 10x in emergent misalignment scenarios without degrading performance on the training distribution, applied across three fine-tuning tasks.

Conclusion: CAFT represents a novel approach for steering LLM generalization without needing to modify training data or use data from target distributions.

Abstract: Fine-tuning large language models (LLMs) can lead to unintended out-of-distribution generalization. Standard approaches to this problem rely on modifying training data, for example by adding data that better specify the intended generalization. However, this is not always practical. We introduce Concept Ablation Fine-Tuning (CAFT), a technique that leverages interpretability tools to control how LLMs generalize from fine-tuning, without needing to modify the training data or otherwise use data from the target distribution. Given a set of directions in an LLM’s latent space corresponding to undesired concepts, CAFT works by ablating these concepts with linear projections during fine-tuning, steering the model away from unintended generalizations. We successfully apply CAFT to three fine-tuning tasks, including emergent misalignment, a phenomenon where LLMs fine-tuned on a narrow task generalize to give egregiously misaligned responses to general questions. Without any changes to the fine-tuning data, CAFT reduces misaligned responses by 10x without degrading performance on the training distribution. Overall, CAFT represents a novel approach for steering LLM generalization without modifying training data.

[859] Generative Medical Event Models Improve with Scale

Shane Waxler, Paul Blazek, Davis White, Daniel Sneider, Kevin Chung, Mani Nagarathnam, Patrick Williams, Hank Voeller, Karen Wong, Matthew Swanhorst, Sheng Zhang, Naoto Usuyama, Cliff Wong, Tristan Naumann, Hoifung Poon, Andrew Loza, Daniella Meeker, Seth Hain, Rahul Shah

Main category: cs.LG

TL;DR: Curiosity is a family of decoder-only transformer foundation models pretrained on 16.3 billion medical encounters from 118 million patients, capable of predicting next medical events and outperforming task-specific models on 78 real-world healthcare tasks without fine-tuning.

Details

Motivation: To enable personalized medicine at scale by developing foundation models that can distill insights from longitudinal patient journeys and generalize to diverse downstream healthcare tasks.

Method: Pretrained decoder-only transformer models on Epic Cosmos dataset containing 16.3 billion medical encounters from 300 million patients, using autoregressive prediction of next medical events to simulate patient health timelines.

Result: Models show power-law scaling relationships and achieve compute-optimal performance up to 1B parameters. Outperformed or matched task-specific supervised models on 78 real-world tasks including diagnosis prediction and healthcare operations without fine-tuning or few-shot examples.

Conclusion: Curiosity effectively captures complex clinical dynamics and provides an extensible, generalizable framework for clinical decision-making, healthcare operations, and improving patient outcomes through generative medical event modeling.

Abstract: Realizing personalized medicine at scale calls for methods that distill insights from longitudinal patient journeys, which can be viewed as a sequence of medical events. Foundation models pretrained on large-scale medical event data represent a promising direction for scaling real-world evidence generation and generalizing to diverse downstream tasks. Using Epic Cosmos, a dataset with medical events from de-identified longitudinal health records for 16.3 billion encounters over 300 million unique patient records from 310 health systems, we introduce the Curiosity models, a family of decoder-only transformer models pretrained on 118 million patients representing 115 billion discrete medical events (151 billion tokens). We present the largest scaling-law study of medical event data, establishing a methodology for pretraining and revealing power-law scaling relationships for compute, tokens, and model size. Consequently, we pretrained a series of compute-optimal models with up to 1 billion parameters. Conditioned on a patient’s real-world history, Curiosity autoregressively predicts the next medical event to simulate patient health timelines. We studied 78 real-world tasks, including diagnosis prediction, disease prognosis, and healthcare operations. Remarkably for a foundation model with generic pretraining and simulation-based inference, Curiosity generally outperformed or matched task-specific supervised models on these tasks, without requiring task-specific fine-tuning or few-shot examples. Curiosity’s predictive power consistently improves as the model and pretraining scale. Our results show that Curiosity, a generative medical event foundation model, can effectively capture complex clinical dynamics, providing an extensible and generalizable framework to support clinical decision-making, streamline healthcare operations, and improve patient outcomes.

[860] The Markovian Thinker

Milad Aghajohari, Kamran Chitsaz, Amirhossein Kazemnejad, Sarath Chandar, Alessandro Sordoni, Aaron Courville, Siva Reddy

Main category: cs.LG

TL;DR: Proposes Markovian Thinking and Delethink, an RL environment that enables linear compute scaling for long-chain reasoning by structuring thoughts into fixed-size chunks with state carryover, avoiding quadratic attention costs.

Details

Motivation: Standard RL for reasoning LLMs suffers from unbounded state size and quadratic compute costs as reasoning chains lengthen, limiting scalability of long-chain reasoning.

Method: Delethink environment structures reasoning into fixed-size chunks with context reset at boundaries, using RL to learn textual state carryover for seamless continuation. Models learn to write sufficient state summaries at chunk ends.

Result: 1.5B model achieves 24K token reasoning with 8K chunk size, matching 24K-budget LongCoT-RL. Linear compute scaling enables 4x cost reduction at 96K thinking length. Off-the-shelf models already show Markovian reasoning traces.

Conclusion: Redesigning the thinking environment enables efficient long reasoning without quadratic overhead, providing a scalable path for reasoning LLMs through Markovian Thinking paradigm.

Abstract: Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL “thinking environment”, where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.

[861] Reasoning Planning for Language Models

Bao Nguyen, Hieu Trung Nguyen, Ruifeng She, Xiaojin Fu, Viet Anh Nguyen

Main category: cs.LG

TL;DR: EPIC is an Ensemble Planning with Contrastive learning framework that learns to select optimal reasoning methods for queries, improving accuracy while reducing computational costs.

Details

Motivation: Existing approaches assume more candidate answers yield higher accuracy, but this assumption needs revisiting through theoretical analysis of accuracy bounds under fixed generation distributions.

Method: EPIC learns a shared representation space capturing model reasoning abilities and query-method compatibility, incorporating probability bounds as a regularizer in utility-driven optimization.

Result: Experiments on mathematical reasoning tasks show EPIC consistently selects optimal reasoning methods, improving accuracy while reducing computational overhead.

Conclusion: EPIC provides an effective framework for reasoning method selection that balances accuracy and computational efficiency through learned representations and utility optimization.

Abstract: Selecting an appropriate reasoning method for a given query remains a key challenge in language model generation. Existing approaches typically generate multiple candidate responses and use an aggregation strategy to select the output answer, often assuming that more candidate answers yield higher accuracy. We revisit this assumption through a rigorous theoretical analysis, deriving accuracy bounds for standard aggregation methods under fixed generation distributions and candidate sizes. Building on these insights, we introduce EPIC, an Ensemble Planning with Contrastive learning framework to learn a shared representation space that captures both model reasoning abilities and query-method compatibility. EPIC incorporates our probability bounds as a regularizer in a utility-driven optimization that balances accuracy and computational cost. Experiments on diverse mathematical reasoning tasks show that EPIC consistently selects optimal reasoning methods, improving accuracy while reducing computational overhead. Our code can be found at https://github.com/nguyenngocbaocmt02/EPIC.

[862] Explaining Bayesian Neural Networks

Kirill Bykov, Marina M. -C. Höhne, Adelaida Creosteanu, Klaus-Robert Müller, Frederick Klauschen, Shinichi Nakajima, Marius Kloft

Main category: cs.LG

TL;DR: This paper extends local attribution explanations to Bayesian Neural Networks (BNNs), transforming point explanations into explanation distributions that capture uncertainty and variability in model rationales.

Details

Motivation: While Bayesian models like BNNs have built-in transparency through prior distributions, they lack instance-specific explanations. The authors aim to combine XAI attribution methods with Bayesian frameworks to provide uncertainty-aware explanations.

Method: The approach extends local attributions to BNNs by treating explanations probabilistically. Multiple attribution maps are drawn from the approximate posterior distribution, creating explanation distributions that capture variability in model rationales across different weight samples.

Result: Experiments on toy data, benchmarks, and real-world pathology datasets show that the framework enriches standard explanations with uncertainty information and supports visualization of explanation stability. The method reveals how predictive rationales vary across posterior samples.

Conclusion: The proposed framework successfully combines Bayesian modeling with explainable AI, providing explanation distributions that offer deeper insights into model uncertainty and rationale variability, enhancing transparency for Bayesian neural networks.

Abstract: To advance the transparency of learning machines such as Deep Neural Networks (DNNs), the field of Explainable AI (XAI) was established to provide interpretations of DNNs’ predictions. While different explanation techniques exist, a popular approach is given in the form of attribution maps, which illustrate, given a particular data point, the relevant patterns the model has used for making its prediction. Although Bayesian models such as Bayesian Neural Networks (BNNs) have a limited form of transparency built-in through their prior weight distribution, they lack explanations of their predictions for given instances. In this work, we take a step toward combining these two perspectives by examining how local attributions can be extended to BNNs. Within the Bayesian framework, network weights follow a probability distribution; hence, the standard point explanation extends naturally to an explanation distribution. Viewing explanations probabilistically, we aggregate and analyze multiple local attributions drawn from an approximate posterior to explore variability in explanation patterns. The diversity of explanations offers a way to further explore how predictive rationales may vary across posterior samples. Quantitative and qualitative experiments on toy and benchmark data, as well as on a real-world pathology dataset, illustrate that our framework enriches standard explanations with uncertainty information and may support the visualization of explanation stability.

[863] Weight-Entanglement Meets Gradient-Based Neural Architecture Search

Rhea Sanjay Sukthanker, Arjun Krishnakumar, Mahmoud Safari, Frank Hutter

Main category: cs.LG

TL;DR: This paper bridges gradient-based NAS and weight-entanglement by proposing a method to adapt gradient-based approaches for weight-entangled spaces, enabling comparative analysis and preserving memory efficiency.

Details

Motivation: To bridge the gap between gradient-based NAS methods (which use weight sharing) and weight-entanglement techniques that have developed independently in parallel sub-communities.

Method: Proposed a novel scheme to adapt gradient-based methods for weight-entangled search spaces, enabling gradient-based NAS in weight-entangled architectures.

Result: Integration of weight-entanglement and gradient-based NAS brings benefits of gradient-based methods while preserving memory efficiency of weight-entangled spaces.

Conclusion: Successfully bridged the gap between gradient-based NAS and weight-entanglement, enabling efficient exploration of weight-entangled architectural spaces with gradient-based methods.

Abstract: Weight sharing is a fundamental concept in neural architecture search (NAS), enabling gradient-based methods to explore cell-based architectural spaces significantly faster than traditional black-box approaches. In parallel, weight-entanglement has emerged as a technique for more intricate parameter sharing amongst macro-architectural spaces. Since weight-entanglement is not directly compatible with gradient-based NAS methods, these two paradigms have largely developed independently in parallel sub-communities. This paper aims to bridge the gap between these sub-communities by proposing a novel scheme to adapt gradient-based methods for weight-entangled spaces. This enables us to conduct an in-depth comparative assessment and analysis of the performance of gradient-based NAS in weight-entangled search spaces. Our findings reveal that this integration of weight-entanglement and gradient-based NAS brings forth the various benefits of gradient-based methods, while preserving the memory efficiency of weight-entangled spaces. The code for our work is openly accessible https://github.com/automl/TangleNAS.

[864] Diffusion Posterior Sampling is Computationally Intractable

Shivam Gupta, Ajil Jalal, Aditya Parulekar, Eric Price, Zhiyang Xun

Main category: cs.LG

TL;DR: Posterior sampling in diffusion models is computationally intractable under cryptographic assumptions, requiring superpolynomial time even when unconditional sampling is fast.

Details

Motivation: Posterior sampling is useful for tasks like inpainting and MRI reconstruction, but existing heuristic algorithms lack provable polynomial-time convergence guarantees.

Method: The paper analyzes computational complexity using cryptographic assumptions - specifically, the existence of one-way functions and exponentially hard one-way functions.

Result: Shows that posterior sampling is computationally intractable: requires superpolynomial time under basic cryptographic assumptions, and rejection sampling is essentially optimal under stronger assumptions.

Conclusion: Posterior sampling in diffusion models faces fundamental computational barriers, with no efficient algorithms possible under standard cryptographic assumptions.

Abstract: Diffusion models are a remarkably effective way of learning and sampling from a distribution $p(x)$. In posterior sampling, one is also given a measurement model $p(y \mid x)$ and a measurement $y$, and would like to sample from $p(x \mid y)$. Posterior sampling is useful for tasks such as inpainting, super-resolution, and MRI reconstruction, so a number of recent works have given algorithms to heuristically approximate it; but none are known to converge to the correct distribution in polynomial time. In this paper we show that posterior sampling is computationally intractable: under the most basic assumption in cryptography – that one-way functions exist – there are instances for which every algorithm takes superpolynomial time, even though unconditional sampling is provably fast. We also show that the exponential-time rejection sampling algorithm is essentially optimal under the stronger plausible assumption that there are one-way functions that take exponential time to invert.

[865] Addressing Polarization and Unfairness in Performative Prediction

Kun Jin, Tian Xie, Yang Liu, Xueru Zhang

Main category: cs.LG

TL;DR: This paper addresses fairness issues in performative prediction, showing that performative stable solutions can cause polarization and performance disparities, and proposes new fairness mechanisms that ensure both stability and fairness.

Details

Motivation: Existing performative prediction frameworks focus on finding performative stable solutions but overlook their societal impacts on fairness, particularly how they can lead to polarization and disparities in real-world applications like recommendations, hiring, and lending.

Method: The authors introduce novel fairness mechanisms designed specifically for performative prediction settings that provably ensure both stability and fairness, supported by theoretical analysis and empirical validation.

Result: The proposed fairness mechanisms successfully address the limitations of conventional fairness interventions, which often fail under model-dependent distribution shifts, and provide provable guarantees for both stability and fairness.

Conclusion: The paper demonstrates that achieving fairness in performative prediction requires specialized mechanisms that account for model-induced distribution shifts, and the proposed approach effectively ensures both performative stability and fairness simultaneously.

Abstract: In many real-world applications of machine learning such as recommendations, hiring, and lending, deployed models influence the data they are trained on, leading to feedback loops between predictions and data distribution. The performative prediction (PP) framework captures this phenomenon by modeling the data distribution as a function of the deployed model. While prior work has focused on finding performative stable (PS) solutions for robustness, their societal impacts, particularly regarding fairness, remain underexplored. We show that PS solutions can lead to severe polarization and prediction performance disparities, and that conventional fairness interventions in previous works often fail under model-dependent distribution shifts due to failing the PS criteria. To address these challenges in PP, we introduce novel fairness mechanisms that provably ensure both stability and fairness, validated by theoretical analysis and empirical results.

[866] A CNN-LSTM Quantifier for Single Access Point CSI Indoor Localization

Minh Tu Hoang, Brosnan Yuen, Kai Ren, Xiaodai Dong, Tao Lu, Hung Le Nguyen, Robert Westendorp, Kishore Reddy

Main category: cs.LG

TL;DR: Proposes CNN-LSTM network for WiFi fingerprinting indoor localization using CSI data, achieving 2.5m average error with 80% errors under 4m using single router.

Details

Motivation: Conventional methods use only spatial data with classification models, limiting estimation to reference points only. Need to extract both spatial and temporal features and enable quantification for unknown testing points.

Method: Combined CNN-LSTM network to extract space and time features from CSI data, with comprehensive filter and normalization to mitigate CSI instability. Uses quantification model instead of classification.

Result: Achieves 2.5m average localization error with 80% errors under 4m using single WiFi router, outperforming other algorithms by approximately 50% in same test environment.

Conclusion: CNN-LSTM network with quantification model effectively extracts spatial-temporal features from CSI, enabling accurate indoor localization with single router and handling unknown testing points.

Abstract: This paper proposes a combined network structure between convolutional neural network (CNN) and long-short term memory (LSTM) quantifier for WiFi fingerprinting indoor localization. In contrast to conventional methods that utilize only spatial data with classification models, our CNN-LSTM network extracts both space and time features of the received channel state information (CSI) from a single router. Furthermore, the proposed network builds a quantification model rather than a limited classification model as in most of the literature work, which enables the estimation of testing points that are not identical to the reference points. We analyze the instability of CSI and demonstrate a mitigation solution using a comprehensive filter and normalization scheme. The localization accuracy is investigated through extensive on-site experiments with several mobile devices including mobile phone (Nexus 5) and laptop (Intel 5300 NIC) on hundreds of testing locations. Using only a single WiFi router, our structure achieves an average localization error of 2.5~~m with $\mathrm{80%}$ of the errors under 4~~m, which outperforms the other reported algorithms by approximately $\mathrm{50%}$ under the same test environment.

[867] Benchmarking Large Language Models with Integer Sequence Generation Tasks

Daniel O’Malley, Manish Bhattarai, Nishath Rajiv Ranasinghe, Erick Draayer, Javier Santos

Main category: cs.LG

TL;DR: A new benchmark evaluates LLMs on mathematical reasoning using OEIS integer sequences, testing Python code generation without lookup tables. Reasoning-specialized models perform better but struggle with hard sequences, revealing limitations in algorithmic reasoning.

Details

Motivation: To rigorously assess LLMs' capabilities in mathematical reasoning and algorithmic code synthesis, particularly their ability to generate correct Python code for computing integer sequences without relying on memorized values.

Method: Uses 1000 OEIS sequences (500 classical, 500 recent) categorized as easy/hard, with automated cheating detection to prevent lookup table usage. Evaluates models from OpenAI, Anthropic, Meta, and Google on Python code generation tasks.

Result: Reasoning-specialized models (OpenAI o-series, Gemini 2.5-pro) show substantial accuracy improvements over non-reasoning models, but overall performance on hard sequences remains poor, highlighting algorithmic reasoning challenges.

Conclusion: The benchmark reveals significant limitations in current LLMs’ ability to solve complex mathematical reasoning tasks algorithmically, emphasizing the need for further advancements in this area.

Abstract: We present a novel benchmark designed to rigorously evaluate the capabilities of large language models (LLMs) in mathematical reasoning and algorithmic code synthesis tasks. The benchmark comprises integer sequence generation tasks sourced from the Online Encyclopedia of Integer Sequences (OEIS), testing LLMs’ abilities to accurately and efficiently generate Python code to compute these sequences without using lookup tables. Our comprehensive evaluation includes leading models from OpenAI (including the specialized reasoning-focused o-series), Anthropic, Meta, and Google across a carefully selected set of 1000 OEIS sequences categorized as easy'' or hard.’’ Half of these sequences are classical sequences from the early days of OEIS and half were recently added to avoid contamination with the models’ training data. To prevent models from exploiting memorized sequence values, we introduce an automated cheating detection mechanism that flags usage of lookup tables, validated by comparison with human expert evaluations. Experimental results demonstrate that reasoning-specialized models (o3, o3-mini, o4-mini from OpenAI, and Gemini 2.5-pro from Google) achieve substantial improvements in accuracy over non-reasoning models, especially on more complex tasks. However, overall model performance on the hard sequences is poor, highlighting persistent challenges in algorithmic reasoning. Our benchmark provides important insights into the strengths and limitations of state-of-the-art LLMs, particularly emphasizing the necessity for further advancements to reliably solve complex mathematical reasoning tasks algorithmically.

[868] Revenue Maximization and Learning in Products Ranking

Ningyuan Chen, Anran Li, Shuoguang Yang

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: The paper analysis cannot be performed as the abstract content is not accessible

Method: Attempted to retrieve paper metadata from arXiv API but encountered rate limiting issues

Result: No results available - paper content not fetched

Conclusion: Analysis cannot be completed due to technical limitations in accessing the paper

Abstract: Failed to fetch summary for 2012.03800: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2012.03800&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[869] Using Machine Learning to Discover Parsimonious and Physically-Interpretable Representations of Catchment-Scale Rainfall-Runoff Dynamics

Yuan-Heng Wang, Hoshin V. Gupta

Main category: cs.LG

TL;DR: The paper proposes using physically-interpretable machine learning models with Mass-Conserving-Perceptron units for hydrological modeling, achieving both good predictive performance and physical interpretability with minimal complexity.

Details

Motivation: Traditional physical-conceptual models have poor predictive performance but are preferred for interpretability, while ML models lack physical interpretability despite better predictions. There's a need for models that combine both good performance and interpretability.

Method: Uses distributed-state networks with Mass-Conserving-Perceptron units, context-dependent gating, and information sharing across nodes. Focuses on parsimonious minimally-optimal representations with few layers (up to 2) and physical flow pathways (up to 3).

Result: Achieves both physical interpretability and good predictive performance in catchment-scale streamflow modeling. The distributed-state mechanism ensures sufficient system storage properties while information sharing synchronizes them properly.

Conclusion: MCP-based ML models with minimal complexity can significantly contribute to ML-based streamflow modeling by providing physically-interpretable solutions without sacrificing predictive performance.

Abstract: Due largely to challenges associated with physical interpretability of machine learning (ML) methods, and because model interpretability is key to credibility in management applications, many scientists and practitioners are hesitant to discard traditional physical-conceptual (PC) modeling approaches despite their poorer predictive performance. Here, we examine how to develop parsimonious minimally-optimal representations that can facilitate better insight regarding system functioning. The term minimally-optimal indicates that the desired outcome can be achieved with the smallest possible effort and resources, while parsimony is widely held to support understanding. Accordingly, we suggest that ML-based modeling should use computational units that are inherently physically-interpretable, and explore how generic network architectures comprised of Mass-Conserving-Perceptron can be used to model dynamical systems in a physically-interpretable manner. In the context of spatially-lumped catchment-scale modeling, we find that both physical interpretability and good predictive performance can be achieved using a distributed-state network with context-dependent gating and information sharing across nodes. The distributed-state mechanism ensures a sufficient number of temporally-evolving properties of system storage while information-sharing ensures proper synchronization of such properties. The results indicate that MCP-based ML models with only a few layers (up to two) and relativity few physical flow pathways (up to three) can play a significant role in ML-based streamflow modelling.

[870] Impacts of Individual Fairness on Group Fairness from the Perspective of Generalized Entropy

Youngmi Jin, Jio Gim, Tae-Jin Lee, Young-Joo Suh

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2202.11966: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2202.11966&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[871] Corruptions of Supervised Learning Problems: Typology and Mitigations

Laura Iacovissi, Nan Lu, Robert C. Williamson

Main category: cs.LG

TL;DR: This paper develops a unified theory of corruption in supervised learning that encompasses all modifications to learning problems, including changes to model class and loss, using Markov kernels to model distribution changes.

Details

Motivation: Existing literature on data corruption focuses on specific settings without a unified view, lacking comprehensive corruption modelization and mitigation frameworks.

Method: The approach uses Markov kernels to model changes in probability distributions, enabling construction of an exhaustive corruption framework that distinguishes corruption types and unifies existing models.

Result: The framework reveals that label corruptions affect only the loss function while attribute corruptions additionally influence the hypothesis class, and enables expansion of loss-correction methods to handle dependent corruption types.

Conclusion: The classical corruption-corrected learning framework needs generalization to a new paradigm with weaker requirements to encompass more corruption types, with proposed loss correction formulas for attribute and joint corruption cases.

Abstract: Corruption is notoriously widespread in data collection. Despite extensive research, the existing literature predominantly focuses on specific settings and learning scenarios, lacking a unified view of corruption modelization and mitigation. In this work, we develop a general theory of corruption, which incorporates all modifications to a supervised learning problem, including changes in model class and loss. Focusing on changes to the underlying probability distributions via Markov kernels, our approach leads to three novel opportunities. First, it enables the construction of a novel, provably exhaustive corruption framework, distinguishing among different corruption types. This serves to unify existing models and establish a consistent nomenclature. Second, it facilitates a systematic analysis of corruption’s consequences on learning tasks, by comparing Bayes risks in the clean and corrupted scenarios. Notably, while label corruptions affect only the loss function, attribute corruptions additionally influence the hypothesis class. Third, building upon these results, we investigate mitigations for various corruption types. We expand existing loss-correction methods for label corruption to handle dependent corruption types. Our findings highlight the necessity to generalize the classical corruption-corrected learning framework to a new paradigm with weaker requirements to encompass more corruption types. We provide such a paradigm as well as loss correction formulas in the attribute and joint corruption cases.

[872] Deeper or Wider: A Perspective from Optimal Generalization Error with Sobolev Loss

Yahong Yang, Juncai He

Main category: cs.LG

TL;DR: Comparison of deeper vs wider neural networks for optimal generalization error in Sobolev losses, showing parameter count favors width while sample size and loss regularity favor depth.

Details

Motivation: Address the persistent question in neural network architecture design about whether to build deeper or wider networks for optimal performance.

Method: Analytical investigation of deeper neural networks (DeNNs) with flexible layers vs wider neural networks (WeNNs) with limited hidden layers, focusing on generalization error in Sobolev losses.

Result: Parameter count favors wider networks, while increased sample points and greater loss function regularity favor deeper networks.

Conclusion: Applied theory to guide neural network design for solving partial differential equations using deep Ritz and PINN methods.

Abstract: Constructing the architecture of a neural network is a challenging pursuit for the machine learning community, and the dilemma of whether to go deeper or wider remains a persistent question. This paper explores a comparison between deeper neural networks (DeNNs) with a flexible number of layers and wider neural networks (WeNNs) with limited hidden layers, focusing on their optimal generalization error in Sobolev losses. Analytical investigations reveal that the architecture of a neural network can be significantly influenced by various factors, including the number of sample points, parameters within the neural networks, and the regularity of the loss function. Specifically, a higher number of parameters tends to favor WeNNs, while an increased number of sample points and greater regularity in the loss function lean towards the adoption of DeNNs. We ultimately apply this theory to address partial differential equations using deep Ritz and physics-informed neural network (PINN) methods, guiding the design of neural networks.

[873] HyperSHAP: Shapley Values and Interactions for Explaining Hyperparameter Optimization

Marcel Wever, Maximilian Muschalik, Fabian Fumagalli, Marius Lindauer

Main category: cs.LG

TL;DR: HyperSHAP is a game-theoretic explainability framework for hyperparameter optimization that uses Shapley values to provide transparent explanations of hyperparameter contributions and interactions.

Details

Motivation: Current HPO methods are black-box and opaque, which undermines user trust and discourages adoption. There's a need for explainable HPO methods that can reveal how individual hyperparameters impact model performance.

Method: The proposed HyperSHAP framework uses Shapley values and interactions to provide an additive decomposition of performance measures across hyperparameters, enabling local and global explanations.

Result: HyperSHAP successfully analyzes interaction structures in various HPO benchmarks, providing insights into ablation studies, algorithm tunability, and optimizer behavior across different hyperparameter spaces.

Conclusion: HyperSHAP offers broad applicability and actionable insights for improving HPO by making the optimization process more transparent and interpretable through game-theoretic explanations.

Abstract: Hyperparameter optimization (HPO) is a crucial step in achieving strong predictive performance. Yet, the impact of individual hyperparameters on model generalization is highly context-dependent, prohibiting a one-size-fits-all solution and requiring opaque HPO methods to find optimal configurations. However, the black-box nature of most HPO methods undermines user trust and discourages adoption. To address this, we propose a game-theoretic explainability framework for HPO based on Shapley values and interactions. Our approach provides an additive decomposition of a performance measure across hyperparameters, enabling local and global explanations of hyperparameters’ contributions and their interactions. The framework, named HyperSHAP, offers insights into ablation studies, the tunability of learning algorithms, and optimizer behavior across different hyperparameter spaces. We demonstrate HyperSHAP’s capabilities on various HPO benchmarks to analyze the interaction structure of the corresponding HPO problems, demonstrating its broad applicability and actionable insights for improving HPO.

[874] Optimization without Retraction on the Random Generalized Stiefel Manifold

Simon Vary, Pierre Ablin, Bin Gao, P. -A. Absil

Main category: cs.LG

TL;DR: A stochastic iterative method for optimization on generalized Stiefel manifolds that uses random estimates of B instead of the full matrix, achieving lower per-iteration cost while maintaining convergence rates.

Details

Motivation: Many applications like CCA, ICA, and GEVP require optimization on generalized Stiefel manifolds, but existing methods need the full matrix B, which can be computationally expensive.

Method: Proposed a stochastic iterative method that uses random estimates of B, doesn’t enforce constraints every iteration, and converges to critical points on the generalized Stiefel manifold in expectation.

Result: Method achieves lower per-iteration cost, requires only matrix multiplications, and maintains same convergence rates as full-matrix Riemannian optimization methods.

Conclusion: The stochastic approach is effective for machine learning applications with generalized orthogonality constraints, providing computational efficiency without sacrificing convergence performance.

Abstract: Optimization over the set of matrices $X$ that satisfy $X^\top B X = I_p$, referred to as the generalized Stiefel manifold, appears in many applications involving sampled covariance matrices such as the canonical correlation analysis (CCA), independent component analysis (ICA), and the generalized eigenvalue problem (GEVP). Solving these problems is typically done by iterative methods that require a fully formed $B$. We propose a cheap stochastic iterative method that solves the optimization problem while having access only to random estimates of $B$. Our method does not enforce the constraint in every iteration; instead, it produces iterations that converge to critical points on the generalized Stiefel manifold defined in expectation. The method has lower per-iteration cost, requires only matrix multiplications, and has the same convergence rates as its Riemannian optimization counterparts that require the full matrix $B$. Experiments demonstrate its effectiveness in various machine learning applications involving generalized orthogonality constraints, including CCA, ICA, and the GEVP.

[875] TrustChain: A Blockchain Framework for Auditing and Verifying Aggregators in Decentralized Federated Learning

Ehsan Hallaji, Roozbeh Razavi-Far, Mehrdad Saif

Main category: cs.LG

TL;DR: TrustChain is a DFL framework that scores aggregators before selection based on past behavior and audits them after aggregation using HSIC to detect statistical independence between client updates and aggregated models.

Details

Motivation: Current DFL architectures ensure aggregator trustworthiness upon selection but overlook the risk of aggregators turning rogue after nomination, creating security vulnerabilities.

Method: Uses blockchain, anomaly detection, and concept drift analysis with HSIC to continuously monitor statistical independence between client updates and aggregated models.

Result: Evaluated on multiple federated datasets and attack scenarios with varying numbers of Byzantine nodes, showing effectiveness in detecting malicious aggregator behavior.

Conclusion: TrustChain provides enhanced security in DFL by preventing and detecting malicious aggregator actions through pre-selection scoring and post-aggregation auditing.

Abstract: The server-less nature of Decentralized Federated Learning (DFL) requires allocating the aggregation role to specific participants in each federated round. Current DFL architectures ensure the trustworthiness of the aggregator node upon selection. However, most of these studies overlook the possibility that the aggregating node may turn rogue and act maliciously after being nominated. To address this problem, this paper proposes a DFL structure, called TrustChain, that scores the aggregators before selection based on their past behavior and additionally audits them after the aggregation. To do this, the statistical independence between the client updates and the aggregated model is continuously monitored using the Hilbert-Schmidt Independence Criterion (HSIC). The proposed method relies on several principles, including blockchain, anomaly detection, and concept drift analysis. The designed structure is evaluated on several federated datasets and attack scenarios with different numbers of Byzantine nodes.

[876] $μ$LO: Compute-Efficient Meta-Generalization of Learned Optimizers

Benjamin Thérien, Charles-Étienne Joseph, Boris Knyazev, Edouard Oyallon, Irina Rish, Eugene Belilovsky

Main category: cs.LG

TL;DR: The paper analysis could not be completed due to HTTP 429 error when fetching the abstract from arXiv API, indicating rate limiting or server overload.

Details

Motivation: Unable to determine the paper's motivation as the abstract content could not be retrieved from the arXiv API.

Method: Methodology details are unavailable due to the failed API request with HTTP 429 status code.

Result: No results can be analyzed as the paper content was not accessible from the arXiv database.

Conclusion: The analysis process was unsuccessful due to technical limitations in accessing the paper’s abstract information.

Abstract: Failed to fetch summary for 2406.00153: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.00153&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[877] A Novel Loss Function for Deep Learning Based Daily Stock Trading System

Ruoyu Guo, Haochen Qiu, Xuelun Hou

Main category: cs.LG

TL;DR: The paper introduces a return-weighted loss function for AI-based stock trading that achieves high returns (61.73% annual return) using only public data and technical indicators, outperforming traditional methods.

Details

Motivation: To enhance AI's utility in financial markets by addressing the black-box nature of deep learning models and improving profitability in volatile stock markets through better loss functions.

Method: Proposes a novel return-weighted loss function that drives top growth detection, using only publicly accessible stock data (OHLC, volume, sector info) and technical indicators with ML model architecture.

Result: Achieved 61.73% annual return with Sharpe Ratio 1.18 (2019-2024) and 37.61% annual return with Sharpe Ratio 0.97 (2005-2010) on daily rebalancing over extensive testing periods.

Conclusion: The return-weighted loss function, combined with categorical-continuous data integration and ML architecture, successfully drives profitable trading independent of domain knowledge, demonstrating superiority over traditional loss functions.

Abstract: Making consistently profitable financial decisions in a continuously evolving and volatile stock market has always been a difficult task. Professionals from different disciplines have developed foundational theories to anticipate price movement and evaluate securities such as the famed Capital Asset Pricing Model (CAPM). In recent years, the role of artificial intelligence (AI) in asset pricing has been growing. Although the black-box nature of deep learning models lacks interpretability, they have continued to solidify their position in the financial industry. We aim to further enhance AI’s potential and utility by introducing a return-weighted loss function that will drive top growth while providing the ML models a limited amount of information. Using only publicly accessible stock data (open/close/high/low, trading volume, sector information) and several technical indicators constructed from them, we propose an efficient daily trading system that detects top growth opportunities. Our best models achieve 61.73% annual return on daily rebalancing with an annualized Sharpe Ratio of 1.18 over 1340 testing days from 2019 to 2024, and 37.61% annual return with an annualized Sharpe Ratio of 0.97 over 1360 testing days from 2005 to 2010. The main drivers for success, especially independent of any domain knowledge, are the novel return-weighted loss function, the integration of categorical and continuous data, and the ML model architecture. We also demonstrate the superiority of our novel loss function over traditional loss functions via several performance metrics and statistical evidence.

[878] GC4NC: A Benchmark Framework for Graph Condensation on Node Classification with New Insights

Shengbo Gong, Juntong Ni, Noveen Sachdeva, Carl Yang, Wei Jin

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Analysis cannot be performed as the paper content could not be retrieved

Method: N/A - Paper content unavailable

Result: N/A - Paper content unavailable

Conclusion: N/A - Paper content unavailable

Abstract: Failed to fetch summary for 2406.16715: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.16715&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[879] Preference-Guided Reinforcement Learning for Efficient Exploration

Guojian Wang, Jianxiang Liu, Xinyuan Li, Faguo Wu, Xiao Zhang, Tianyuan Chen, Xuyang Chen

Main category: cs.LG

TL;DR: Failed to fetch summary for 2407.06503 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to analyze motivation due to retrieval failure

Method: Unable to analyze method due to retrieval failure

Result: Unable to analyze results due to retrieval failure

Conclusion: Unable to analyze conclusion due to retrieval failure

Abstract: Failed to fetch summary for 2407.06503: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.06503&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[880] MENSA: A Multi-Event Network for Survival Analysis with Trajectory-based Likelihood Estimation

Christian Marius Lillelund, Ali Hossein Gharari Foomani, Weijie Sun, Shi-ang Qi, Russell Greiner

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: N/A - Paper content not accessible

Method: N/A - Paper content not accessible

Result: N/A - Paper content not accessible

Conclusion: N/A - Paper content not accessible

Abstract: Failed to fetch summary for 2409.06525: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.06525&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[881] Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching

Yanhao Dong, Yubo Miao, Weinan Li, Xiao Zheng, Chao Wang, Jiesheng Wu, Feng Lyu

Main category: cs.LG

TL;DR: Proposes L2 Cache-oriented asynchronous KV Cache prefetching to overcome memory bandwidth bottlenecks in LLM inference by overlapping computation with memory operations.

Details

Motivation: LLMs face memory-bound limitations during inference due to HBM bandwidth constraints, which restricts performance despite computational capabilities.

Method: Strategic scheduling of idle memory bandwidth during computation windows to prefetch KV Cache into GPU L2 cache, enabling high-speed cache hits and hiding HBM latency.

Result: Achieves 2.15x improvement in attention kernel efficiency and up to 1.97x end-to-end throughput enhancement on NVIDIA H20 GPUs, outperforming FlashAttention-3.

Conclusion: The method provides an orthogonal, scalable latency-hiding solution that can be integrated with existing inference frameworks for next-generation LLM engines.

Abstract: Large Language Models (LLMs) exhibit pronounced memory-bound characteristics during inference due to High Bandwidth Memory (HBM) bandwidth constraints. In this paper, we propose an L2 Cache-oriented asynchronous KV Cache prefetching method to break through the memory bandwidth bottleneck in LLM inference through computation-load overlap. By strategically scheduling idle memory bandwidth during active computation windows, our method proactively prefetches required KV Cache into GPU L2 cache, enabling high-speed L2 cache hits for subsequent accesses and effectively hiding HBM access latency within computational cycles. Extensive experiments on NVIDIA H20 GPUs demonstrate that the proposed method achieves 2.15x improvement in attention kernel efficiency and up to 1.97x end-to-end throughput enhancement, surpassing state-of-the-art baseline FlashAttention-3. Notably, our solution maintains orthogonality to existing optimization techniques and can be integrated with current inference frameworks, providing a scalable latency-hiding solution for next-generation LLM inference engines.

[882] The Dark Side of Rich Rewards: Understanding and Mitigating Noise in VLM Rewards

Sukai Huang, Shu-Wei Liu, Nir Lipovetzky, Trevor Cohn

Main category: cs.LG

TL;DR: VLM-generated rewards for embodied agents often underperform intrinsic rewards due to false positives. A new reward function called BiMI (Binary Mutual Information) is introduced to reduce noise and improve learning efficiency.

Details

Motivation: To understand why Vision-Language Model (VLM) rewards underperform intrinsic rewards for training embodied agents, and to address the issue of false positive rewards that harm learning.

Method: Analyzed the impact of false positive vs false negative rewards, identified cosine similarity as problematic, and developed BiMI reward function to mitigate noise in multimodal reward signals.

Result: BiMI significantly improves learning efficiency across diverse embodied navigation environments compared to standard VLM rewards.

Conclusion: False positive rewards are more detrimental than false negatives, and addressing multimodal reward signal noise is crucial for effectively training embodied agents with VLM guidance.

Abstract: While Vision-Language Models (VLMs) are increasingly used to generate reward signals for training embodied agents to follow instructions, our research reveals that agents guided by VLM rewards often underperform compared to those employing only intrinsic (exploration-driven) rewards, contradicting expectations set by recent work. We hypothesize that false positive rewards – instances where unintended trajectories are incorrectly rewarded – are more detrimental than false negatives. Our analysis confirms this hypothesis, revealing that the widely used cosine similarity metric is prone to false positive reward estimates. To address this, we introduce BiMI ({Bi}nary {M}utual {I}nformation), a novel reward function designed to mitigate noise. BiMI significantly enhances learning efficiency across diverse and challenging embodied navigation environments. Our findings offer a nuanced understanding of how different types of reward noise impact agent learning and highlight the importance of addressing multimodal reward signal noise when training embodied agents

[883] Quantum Doubly Stochastic Transformers

Jannis Born, Filip Skogh, Kahn Rhrissorrakrai, Filippo Utro, Nico Wagner, Aleksandros Sobczyk

Main category: cs.LG

TL;DR: The paper introduces QDSFormer, a hybrid classical-quantum Transformer that replaces softmax with a variational quantum circuit to generate doubly stochastic attention matrices, improving performance and training stability.

Details

Motivation: Softmax normalization in Transformers often destabilizes training, and while Sinkhorn's algorithm improves performance, it's iterative, approximative, and inflexible. Quantum circuits offer a novel parametric approach to doubly stochastic matrices with no classical analogue.

Method: Replace softmax in self-attention with a variational quantum circuit that generates doubly stochastic matrices. Compare against standard ViT, Sinkformer, and a novel quantum-inspired doubly stochastic Transformer based on QR decomposition.

Result: QDSFormer consistently surpasses standard ViT and other doubly stochastic Transformers across multiple small-scale object recognition tasks, with improved training stability and lower performance variation.

Conclusion: Quantum circuits provide more diverse doubly stochastic matrices that better preserve information, potentially mitigating ViT’s unstable training on small-scale data while consistently improving performance.

Abstract: At the core of the Transformer, the softmax normalizes the attention matrix to be right stochastic. Previous research has shown that this often de-stabilizes training and that enforcing the attention matrix to be doubly stochastic (through Sinkhorn’s algorithm) consistently improves performance across different tasks, domains and Transformer flavors. However, Sinkhorn’s algorithm is iterative, approximative, non-parametric and thus inflexible w.r.t. the obtained doubly stochastic matrix (DSM). Recently, it has been proven that DSMs can be obtained with a parametric quantum circuit, yielding a novel quantum inductive bias for DSMs with no known classical analogue. Motivated by this, we demonstrate the feasibility of a hybrid classical-quantum doubly stochastic Transformer (QDSFormer) that replaces the softmax in the self-attention layer with a variational quantum circuit. We study the expressive power of the circuit and find that it yields more diverse DSMs that better preserve information than classical operators. Across multiple small-scale object recognition tasks, we find that our QDSFormer consistently surpasses both a standard ViT and other doubly stochastic Transformers. Beyond the Sinkformer, this comparison includes a novel quantum-inspired doubly stochastic Transformer (based on QR decomposition) that can be of independent interest. Our QDSFormer also shows improved training stability and lower performance variation suggesting that it may mitigate the notoriously unstable training of ViTs on small-scale data.

[884] On the Convergence of Continual Federated Learning Using Incrementally Aggregated Gradients

Satish Kumar Keshri, Nazreen Shah, Ranjitha Prasad

Main category: cs.LG

TL;DR: C-FLAG is a novel Continual Federated Learning method that uses aggregated gradients and replay memory to prevent global catastrophic forgetting while maintaining privacy and scalability.

Details

Motivation: To enable Continual Federated Learning that overcomes global catastrophic forgetting where global model accuracy declines on old tasks when learning new ones, while enhancing efficiency, privacy, and scalability.

Method: Proposes C-FLAG with edge-based gradient updates on memory and aggregated gradients on current data, using replay-memory based federated strategy with optimization sub-problem for minimizing forgetting and adaptive learning rates.

Result: Converges at rate O(1/√T) over T communication rounds, outperforms state-of-the-art baselines on task and class-incremental settings in accuracy and forgetting metrics.

Conclusion: C-FLAG effectively addresses catastrophic forgetting in Continual Federated Learning through aggregated gradients and replay memory, providing theoretical convergence guarantees and empirical superiority over existing methods.

Abstract: The holy grail of machine learning is to enable Continual Federated Learning (CFL) to enhance the efficiency, privacy, and scalability of AI systems while learning from streaming data. The primary challenge of a CFL system is to overcome global catastrophic forgetting, wherein the accuracy of the global model trained on new tasks declines on the old tasks. In this work, we propose Continual Federated Learning with Aggregated Gradients (C-FLAG), a novel replay-memory based federated strategy consisting of edge-based gradient updates on memory and aggregated gradients on the current data. We provide convergence analysis of the C-FLAG approach which addresses forgetting and bias while converging at a rate of $O(1/\sqrt{T})$ over $T$ communication rounds. We formulate an optimization sub-problem that minimizes catastrophic forgetting, translating CFL into an iterative algorithm with adaptive learning rates that ensure seamless learning across tasks. We empirically show that C-FLAG outperforms several state-of-the-art baselines on both task and class-incremental settings with respect to metrics such as accuracy and forgetting.

[885] AnomalyAID: Reliable Interpretation for Semi-supervised Network Anomaly Detection

Yachao Yuan, Yu Huang, Yingwen Wu, Jin Wang

Main category: cs.LG

TL;DR: AnomalyAID is a framework that makes semi-supervised network anomaly detection interpretable and improves performance with limited labeled data through reliable explanations and pseudo-labeling.

Details

Motivation: Semi-supervised learning is crucial for network anomaly detection but faces challenges with limited labeled samples and lack of interpretability, creating barriers to practical adoption.

Method: Proposes AnomalyAID with: (1) novel interpretation approach using global and local interpreters for reliable explanations, (2) two-stage semi-supervised learning framework with model prediction alignment and special constraints for pseudo-labeling.

Result: Experimental evaluation shows AnomalyAID provides accurate detection results with reliable interpretations for semi-supervised network anomaly detection systems across two representative tasks.

Conclusion: AnomalyAID successfully addresses interpretability and performance challenges in semi-supervised network anomaly detection, enabling practical adoption through reliable explanations and improved detection with limited labeled data.

Abstract: Semi-supervised Learning plays a crucial role in network anomaly detection applications, however, learning anomaly patterns with limited labeled samples is not easy. Additionally, the lack of interpretability creates key barriers to the adoption of semi-supervised frameworks in practice. Most existing interpretation methods are developed for supervised/unsupervised frameworks or non-security domains and fail to provide reliable interpretations. In this paper, we propose AnomalyAID, a general framework aiming to (1) make the anomaly detection process interpretable and improve the reliability of interpretation results, and (2) assign high-confidence pseudo labels to unlabeled samples for improving the performance of anomaly detection systems with limited supervised data. For (1), we propose a novel interpretation approach that leverages global and local interpreters to provide reliable explanations, while for (2), we design a new two-stage semi-supervised learning framework for network anomaly detection by aligning both stages’ model predictions with special constraints. We apply AnomalyAID over two representative network anomaly detection tasks and extensively evaluate AnomalyAID with representative prior works. Experimental results demonstrate that AnomalyAID can provide accurate detection results with reliable interpretations for semi-supervised network anomaly detection systems.

[886] Adaptive Group Robust Ensemble Knowledge Distillation

Patrik Kenfack, Ulrich Aïvodji, Samira Ebrahimi Kahou

Main category: cs.LG

TL;DR: AGRE-KD is a novel ensemble knowledge distillation method that prevents performance degradation for underrepresented subgroups by selectively choosing teachers whose knowledge benefits worst-performing subgroups, outperforming traditional ensemble distillation and even classic model ensembles.

Details

Motivation: Neural networks learn spurious correlations that harm underrepresented subgroups, and traditional ensemble knowledge distillation amplifies this issue even with debiased teachers, creating a need for methods that preserve performance for worst-case subgroups.

Method: Proposed Adaptive Group Robust Ensemble Knowledge Distillation (AGRE-KD) uses an additional biased model to selectively choose teachers whose gradient directions deviate from the biased model, upweighting knowledge beneficial for underrepresented subgroups.

Result: Experiments on several datasets show AGRE-KD significantly outperforms traditional ensemble distillation and can even beat classic model ensembles based on majority voting, effectively improving performance for worst-case subgroups.

Conclusion: AGRE-KD provides an effective solution to prevent performance degradation for underrepresented subgroups in knowledge distillation, demonstrating that selective teacher weighting based on gradient deviation from biased models can transfer beneficial knowledge to students.

Abstract: Neural networks can learn spurious correlations in the data, often leading to performance degradation for underrepresented subgroups. Studies have demonstrated that the disparity is amplified when knowledge is distilled from a complex teacher model to a relatively ``simple’’ student model. Prior work has shown that ensemble deep learning methods can improve the performance of the worst-case subgroups; however, it is unclear if this advantage carries over when distilling knowledge from an ensemble of teachers, especially when the teacher models are debiased. This study demonstrates that traditional ensemble knowledge distillation can significantly drop the performance of the worst-case subgroups in the distilled student model even when the teacher models are debiased. To overcome this, we propose Adaptive Group Robust Ensemble Knowledge Distillation (AGRE-KD), a simple ensembling strategy to ensure that the student model receives knowledge beneficial for unknown underrepresented subgroups. Leveraging an additional biased model, our method selectively chooses teachers whose knowledge would better improve the worst-performing subgroups by upweighting the teachers with gradient directions deviating from the biased model. Our experiments on several datasets demonstrate the superiority of the proposed ensemble distillation technique and show that it can even outperform classic model ensembles based on majority voting. Our source code is available at https://github.com/patrikken/AGRE-KD

[887] When Bias Helps Learning: Bridging Initial Prejudice and Trainability

Alberto Bassi, Marco Baity-Jesi, Aurelien Lucchi, Carlo Albert, Emanuele Francazi

Main category: cs.LG

TL;DR: Theoretical proof links initialization-guessing bias (IGB) in untrained DNNs to mean-field analyses, showing that optimal trainability requires systematic bias rather than neutrality.

Details

Motivation: To understand how statistical properties of DNNs at initialization affect both trainability and intrinsic architectural biases, particularly the connection between initialization-guessing bias and mean-field analyses.

Method: Provided theoretical proof linking IGB to mean-field analyses, and validated through experiments across multiple architectures and datasets.

Result: Established that network predisposition toward specific classes is intrinsically tied to conditions for efficient learning, leading to counterintuitive conclusion that optimal trainability requires systematic bias.

Conclusion: Initialization that optimizes trainability is systematically biased rather than neutral, connecting initialization-guessing bias to mean-field theory conditions for efficient learning.

Abstract: Understanding the statistical properties of deep neural networks (DNNs) at initialization is crucial for elucidating both their trainability and the intrinsic architectural biases they encode prior to data exposure. Mean-field (MF) analyses have demonstrated that the parameter distribution in randomly initialized networks dictates whether gradients vanish or explode. Recent work has shown that untrained DNNs exhibit an initial-guessing bias (IGB), in which large regions of the input space are assigned to a single class. In this work, we provide a theoretical proof linking IGB to MF analyses, establishing that a network predisposition toward specific classes is intrinsically tied to the conditions for efficient learning. This connection leads to a counterintuitive conclusion: the initialization that optimizes trainability is systematically biased rather than neutral. We validate our theory through experiments across multiple architectures and datasets.

[888] Understanding and Mitigating Memorization in Diffusion Models for Tabular Data

Zhengyu Fang, Zhimeng Jiang, Huiyuan Chen, Xiao Li, Jing Li

Main category: cs.LG

TL;DR: First comprehensive investigation of memorization in tabular diffusion models, revealing it increases with training epochs and proposing TabCutMix/TabCutMixPlus data augmentation methods to mitigate memorization while maintaining data quality.

Details

Motivation: Memorization has been studied in image and text generation but remains unexplored in tabular data generation, despite the growing importance of tabular diffusion models for synthetic data creation.

Method: Empirical analysis of memorization factors (dataset size, feature dimensions, diffusion models) and theoretical explanation of memorization. Proposed TabCutMix (feature segment exchange between same-class samples) and TabCutMixPlus (clustering correlated features for coherent exchange).

Result: Memorization occurs in tabular diffusion models and increases with training epochs. TabCutMix and TabCutMixPlus effectively reduce memorization while preserving high-quality data generation across various datasets and models.

Conclusion: Memorization is a significant issue in tabular diffusion models that can be effectively mitigated through proposed data augmentation techniques, enabling safer and more reliable synthetic tabular data generation.

Abstract: Tabular data generation has attracted significant research interest in recent years, with the tabular diffusion models greatly improving the quality of synthetic data. However, while memorization, where models inadvertently replicate exact or near-identical training data, has been thoroughly investigated in image and text generation, its effects on tabular data remain largely unexplored. In this paper, we conduct the first comprehensive investigation of memorization phenomena in diffusion models for tabular data. Our empirical analysis reveals that memorization appears in tabular diffusion models and increases with larger training epochs. We further examine the influence of factors such as dataset sizes, feature dimensions, and different diffusion models on memorization. Additionally, we provide a theoretical explanation for why memorization occurs in tabular diffusion models. To address this issue, we propose TabCutMix, a simple yet effective data augmentation technique that exchanges randomly selected feature segments between random same-class training sample pairs. Building upon this, we introduce TabCutMixPlus, an enhanced method that clusters features based on feature correlations and ensures that features within the same cluster are exchanged together during augmentation. This clustering mechanism mitigates out-of-distribution (OOD) generation issues by maintaining feature coherence. Experimental results across various datasets and diffusion models demonstrate that TabCutMix effectively mitigates memorization while maintaining high-quality data generation.

[889] The Energy Cost of Reasoning: Analyzing Energy Usage in LLMs with Test-time Compute

Yunho Jin, Gu-Yeon Wei, David Brooks

Main category: cs.LG

TL;DR: Test-time compute (TTC) offers better accuracy-energy efficiency than traditional model scaling, especially for complex reasoning tasks, by strategically allocating compute resources during inference based on query complexity.

Details

Motivation: Address diminishing returns and high energy demands of scaling large language models by exploring energy-efficient alternatives to conventional training scaling strategies.

Method: Empirical analysis comparing test-time compute (allocating additional computational resources at inference time) versus traditional model scaling, examining accuracy-energy trade-offs across different task types and output sequence lengths.

Result: TTC surpasses traditional model scaling in accuracy/energy efficiency, with greater benefits for complex reasoning tasks than factual recall. Performance depends on output sequence length, and strategic compute allocation based on query complexity enhances efficiency.

Conclusion: TTC is a promising direction for sustainable, accurate, and adaptable deployment of language models, offering superior efficiency over conventional scaling approaches.

Abstract: Scaling large language models (LLMs) has driven significant advancements, yet it faces diminishing returns and escalating energy demands. This work explores how test-time compute (TTC) can serve as an energy-efficient complement to conventional scaling strategies by allocating additional computational resources at inference time rather than during training. Specifically, we investigate whether employing TTC can achieve superior accuracy-energy trade-offs compared to simply increasing model size. Our empirical analysis reveals that TTC surpasses traditional model scaling in accuracy/energy efficiency, with notable gains in tasks demanding complex reasoning rather than mere factual recall. Further, we identify a critical interaction between TTC performance and output sequence length, demonstrating that strategically adjusting compute resources at inference time according to query complexity can substantially enhance efficiency. Our findings advocate for TTC as a promising direction, enabling more sustainable, accurate, and adaptable deployment of future language models.

[890] A solvable model of learning generative diffusion: theory and insights

Hugo Cui, Cengiz Pehlevan, Yue M. Lu

Main category: cs.LG

TL;DR: The paper analyzes how flow/diffusion generative models trained with SGD on manifold-structured data can suffer from mode collapse and model collapse when retrained on synthetic data.

Details

Motivation: To understand the performance and failure modes of generative models trained on high-dimensional data with low-dimensional structure, particularly when models are retrained on their own synthetic outputs.

Method: Theoretical analysis of two-layer auto-encoder generative models trained with online stochastic gradient descent, deriving asymptotic characterizations of low-dimensional projections of generated samples.

Result: Obtained tight asymptotic characterization of generated sample distributions and their dependence on training sample size, revealing how mode collapse occurs.

Conclusion: Mode collapse in generative models can lead to model collapse when models are iteratively retrained on synthetic data, highlighting a fundamental limitation in generative model training pipelines.

Abstract: In this manuscript, we consider the problem of learning a flow or diffusion-based generative model parametrized by a two-layer auto-encoder, trained with online stochastic gradient descent, on a high-dimensional target density with an underlying low-dimensional manifold structure. We derive a tight asymptotic characterization of low-dimensional projections of the distribution of samples generated by the learned model, ascertaining in particular its dependence on the number of training samples. Building on this analysis, we discuss how mode collapse can arise, and lead to model collapse when the generative model is re-trained on generated synthetic data.

[891] Guided Diffusion Sampling on Function Spaces with Applications to PDEs

Jiachen Yao, Abbas Mammadov, Julius Berner, Gavin Kerrigan, Jong Chul Ye, Kamyar Azizzadenesheli, Anima Anandkumar

Main category: cs.LG

TL;DR: FunDPS: A function-space diffusion framework for conditional sampling in PDE-based inverse problems that recovers solutions from sparse/noisy measurements using neural operators and plug-and-play guidance.

Details

Motivation: To address the challenge of recovering complete solutions from extremely sparse or noisy measurements in PDE-based inverse problems, overcoming limitations of fixed-resolution methods.

Method: Trains unconditional discretization-agnostic denoising model using neural operators, then refines samples via gradient-based guidance to satisfy sparse observation data. Extends Tweedie’s formula to infinite-dimensional Hilbert spaces.

Result: Achieves 32% accuracy improvement over state-of-the-art diffusion baselines with only 3% observation across five PDE tasks, while reducing sampling steps by 4x. Shows strong cross-resolution generalizability.

Conclusion: First discretization-independent diffusion framework for PDE forward and inverse problems, providing practical and flexible solution under minimal supervision and severe data scarcity.

Abstract: We propose a general framework for conditional sampling in PDE-based inverse problems, targeting the recovery of whole solutions from extremely sparse or noisy measurements. This is accomplished by a function-space diffusion model and plug-and-play guidance for conditioning. Our method first trains an unconditional discretization-agnostic denoising model using neural operator architectures. At inference, we refine the samples to satisfy sparse observation data via a gradient-based guidance mechanism. Through rigorous mathematical analysis, we extend Tweedie’s formula to infinite-dimensional Hilbert spaces, providing the theoretical foundation for our posterior sampling approach. Our method (FunDPS) accurately captures posterior distributions in function spaces under minimal supervision and severe data scarcity. Across five PDE tasks with only 3% observation, our method achieves an average 32% accuracy improvement over state-of-the-art fixed-resolution diffusion baselines while reducing sampling steps by 4x. Furthermore, multi-resolution fine-tuning ensures strong cross-resolution generalizability. To the best of our knowledge, this is the first diffusion-based framework to operate independently of discretization, offering a practical and flexible solution for forward and inverse problems in the context of PDEs. Code is available at https://github.com/neuraloperator/FunDPS

[892] pMixFed: Efficient Personalized Federated Learning through Adaptive Layer-Wise Mixup

Yasaman Saadati, Mohammad Rostami, M. Hadi Amini

Main category: cs.LG

TL;DR: pMixFed is a personalized federated learning method that uses mixup between global and local models with adaptive layer partitioning and gradual personalization to address data heterogeneity and model drift issues.

Details

Motivation: Traditional FL struggles with non-IID data and personalization, while existing PFL methods face challenges like global-local model discrepancy, client drift, and catastrophic forgetting that degrade accuracy.

Method: Proposes pMixFed with dynamic layer-wise partitioning between shared global and personalized local models, using mixup integration, adaptive layer selection, gradual personalization transition, and novel aggregation to prevent forgetting.

Result: Extensive experiments show pMixFed outperforms state-of-the-art PFL methods with faster training, increased robustness, and better handling of data heterogeneity across different settings.

Conclusion: pMixFed effectively addresses key challenges in personalized federated learning through its dynamic layer-wise approach with mixup integration and adaptive strategies.

Abstract: Traditional Federated Learning (FL) methods encounter significant challenges when dealing with heterogeneous data and providing personalized solutions for non-IID scenarios. Personalized Federated Learning (PFL) approaches aim to address these issues by balancing generalization and personalization, often through parameter decoupling or partial models that freeze some neural network layers for personalization while aggregating other layers globally. However, existing methods still face challenges of global-local model discrepancy, client drift, and catastrophic forgetting, which degrade model accuracy. To overcome these limitations, we propose $\textit{pMixFed}$, a dynamic, layer-wise PFL approach that integrates $\textit{mixup}$ between shared global and personalized local models. Our method introduces an adaptive strategy for partitioning between personalized and shared layers, a gradual transition of personalization degree to enhance local client adaptation, improved generalization across clients, and a novel aggregation mechanism to mitigate catastrophic forgetting. Extensive experiments demonstrate that pMixFed outperforms state-of-the-art PFL methods, showing faster model training, increased robustness, and improved handling of data heterogeneity under different heterogeneous settings.

[893] Differential privacy for medical deep learning: methods, tradeoffs, and deployment implications

Marziyeh Mohammadi, Mohsen Vejdanihemmat, Mahshad Lotfinia, Mirabela Rusu, Daniel Truhn, Andreas Maier, Soroosh Tayebi Arasteh

Main category: cs.LG

TL;DR: This scoping review examines differential privacy (DP) applications in medical deep learning, analyzing 74 studies to understand privacy-utility-fairness tradeoffs across various data modalities and training settings.

Details

Motivation: As clinical models become more data-dependent, balancing privacy protection with model utility and fairness has emerged as a critical challenge in medical deep learning.

Method: Conducted a structured scoping review of 74 studies up to March 2025, analyzing DP applications across centralized and federated settings, focusing on DP-SGD and alternative mechanisms.

Result: DP can preserve performance in well-structured imaging tasks but causes severe degradation under strict privacy, particularly affecting underrepresented groups. Privacy-induced performance gaps disproportionately impact demographic subgroups, with fairness impacts varying by data type and task.

Conclusion: Key gaps exist in fairness auditing, standardization, and evaluation protocols. Future work should focus on developing equitable and clinically robust privacy-preserving DL systems in medicine.

Abstract: Differential privacy (DP) is a key technique for protecting sensitive patient data in medical deep learning (DL). As clinical models grow more data-dependent, balancing privacy with utility and fairness has become a critical challenge. This scoping review synthesizes recent developments in applying DP to medical DL, with a particular focus on DP-SGD and alternative mechanisms across centralized and federated settings. Using a structured search strategy, we identified 74 studies published up to March 2025. Our analysis spans diverse data modalities, training setups, and downstream tasks, and highlights the tradeoffs between privacy guarantees, model accuracy, and subgroup fairness. We find that while DP-especially at strong privacy budgets-can preserve performance in well-structured imaging tasks, severe degradation often occurs under strict privacy, particularly in underrepresented or complex modalities. Furthermore, privacy-induced performance gaps disproportionately affect demographic subgroups, with fairness impacts varying by data type and task. A small subset of studies explicitly addresses these tradeoffs through subgroup analysis or fairness metrics, but most omit them entirely. Beyond DP-SGD, emerging approaches leverage alternative mechanisms, generative models, and hybrid federated designs, though reporting remains inconsistent. We conclude by outlining key gaps in fairness auditing, standardization, and evaluation protocols, offering guidance for future work toward equitable and clinically robust privacy-preserving DL systems in medicine.

[894] HumorReject: Decoupling LLM Safety from Refusal Prefix via A Little Humor

Zihui Wu, Haichang Gao, Jiacheng Luo, Zhaoxiang Liu

Main category: cs.LG

TL;DR: HumorReject uses humor instead of explicit refusal prefixes to make LLMs safer against prefix injection attacks while reducing over-defense issues.

Details

Motivation: Traditional LLM safety relies on explicit refusal prefixes, making them vulnerable to prefix injection attacks and causing over-defense problems.

Method: Data-driven approach that uses humor as an indirect refusal strategy, responding to harmful instructions with contextually appropriate humor instead of explicit rejection.

Result: Effectively addresses over-defense issues and demonstrates superior robustness against various attack vectors while maintaining safety.

Conclusion: Training data design improvements can be as important as alignment algorithms for achieving effective LLM safety.

Abstract: Large Language Models (LLMs) commonly rely on explicit refusal prefixes for safety, making them vulnerable to prefix injection attacks. We introduce HumorReject, a novel data-driven approach that reimagines LLM safety by decoupling it from refusal prefixes through humor as an indirect refusal strategy. Rather than explicitly rejecting harmful instructions, HumorReject responds with contextually appropriate humor that naturally defuses potentially dangerous requests. Our approach effectively addresses common “over-defense” issues while demonstrating superior robustness against various attack vectors. Our findings suggest that improvements in training data design can be as important as the alignment algorithm itself in achieving effective LLM safety. The code and dataset are available at https://github.com/wooozihui/HumorReject.

[895] Data Leakage and Deceptive Performance: A Critical Examination of Credit Card Fraud Detection Methodologies

Khizar Hayat, Baptiste Magnier

Main category: cs.LG

TL;DR: Methodological flaws in credit card fraud detection research can make simple models appear superior to sophisticated ones, highlighting that proper evaluation protocols matter more than algorithmic complexity.

Details

Motivation: To critically examine and expose fundamental evaluation flaws in credit card fraud detection research that undermine the validity of reported results.

Method: Deliberate experimentation with improper evaluation protocols and analysis of four critical methodological issues: data leakage, vague reporting, inadequate temporal validation, and metric manipulation.

Result: A minimal neural network with data leakage achieved 99.9% recall, outperforming sophisticated methods in literature, demonstrating how methodological flaws can produce deceptively impressive results.

Conclusion: Methodological rigor must precede architectural sophistication in fraud detection research, with implications for improving practices across machine learning applications.

Abstract: This study critically examines the methodological rigor in credit card fraud detection research, revealing how fundamental evaluation flaws can overshadow algorithmic sophistication. Through deliberate experimentation with improper evaluation protocols, we demonstrate that even simple models can achieve deceptively impressive results when basic methodological principles are violated. Our analysis identifies four critical issues plaguing current approaches: (1) pervasive data leakage from improper preprocessing sequences, (2) intentional vagueness in methodological reporting, (3) inadequate temporal validation for transaction data, and (4) metric manipulation through recall optimization at precision’s expense. We present a case study showing how a minimal neural network architecture with data leakage outperforms many sophisticated methods reported in literature, achieving 99.9% recall despite fundamental evaluation flaws. These findings underscore that proper evaluation methodology matters more than model complexity in fraud detection research. The study serves as a cautionary example of how methodological rigor must precede architectural sophistication, with implications for improving research practices across machine learning applications.

[896] Choose Your Model Size: Any Compression of Large Language Models Without Re-Computation

Martin Genzel, Patrick Putzky, Pengfei Zhao, Sebastian Schulze, Mattes Mollenhauer, Robert Seidel, Stefan Dietzel, Thomas Wollmann

Main category: cs.LG

TL;DR: ACIP is a novel compression method that uses iterative pruning with SVD-reparametrization to enable flexible model compression from a single training run, achieving state-of-the-art results for LLMs.

Details

Motivation: Foundation models are too large and expensive for resource-constrained environments, requiring efficient compression methods that balance model size reduction with performance preservation.

Method: Uses SVD-reparametrization of linear layers with iterative pruning of singular values using sparsity-inducing penalty, creating a global score map that enables compression to any target size without re-computation.

Result: Achieves state-of-the-art results compared to existing factorization-based compression methods on various LLMs and downstream tasks, and works well with quantization techniques.

Conclusion: ACIP provides an efficient and flexible approach for compressing foundation models that can be applied to any target size from a single training run, making it suitable for resource-constrained environments.

Abstract: The adoption of Foundation Models in resource-constrained environments remains challenging due to their large size and inference costs. A promising way to overcome these limitations is post-training compression, which aims to balance reduced model size against performance degradation. This work presents Any Compression via Iterative Pruning (ACIP), a novel algorithmic approach to determine a compression-performance trade-off from a single stochastic gradient descent run. To achieve parameter efficiency, we use an SVD-reparametrization of linear layers and iteratively prune their singular values with a sparsity-inducing penalty. Importantly, the pruning order of the parameters is used to derive a global score map that allows compressing a model to any target size without re-computation. We evaluate ACIP on a large selection of open-weight LLMs and downstream tasks, demonstrating state-of-the-art results compared to existing factorization-based compression methods. We also show that ACIP seamlessly complements common quantization-based compression techniques.

[897] Tight Bounds for Jensen’s Gap with Applications to Variational Inference

Marcin Mazur, Tadeusz Dziarmaga, Piotr Kościelniak, Łukasz Struski

Main category: cs.LG

TL;DR: This paper proposes new general bounds for Jensen’s gap that work under broad assumptions on functions and distributions, with special focus on exponential and logarithmic cases, and connects these bounds to PAC-Bayes framework.

Details

Motivation: Jensen's inequality is fundamental in mathematics, statistics, and machine learning, but recent research has focused on estimating the size of Jensen's gap, especially for logarithmic functions which are crucial in variational inference applications like VAEs where log-likelihood intractability poses practical challenges.

Method: The authors develop new analytical bounds for Jensen’s gap that accommodate various assumptions on both the function and random variable, with particular attention to exponential and logarithmic cases. They provide both analytical proofs and empirical validation of their method.

Result: The paper presents general bounds for Jensen’s gap that perform well across different assumptions, with special results for exponential and logarithmic functions. The bounds are supported by both theoretical analysis and empirical evidence.

Conclusion: The proposed bounds provide improved estimation of Jensen’s gap under broad conditions, and the connection to PAC-Bayes framework offers new insights into generalization performance in probabilistic models.

Abstract: Since its original formulation, Jensen’s inequality has played a fundamental role across mathematics, statistics, and machine learning, with its probabilistic version highlighting the nonnegativity of the so-called Jensen’s gap, i.e., the difference between the expectation of a convex function and the function at the expectation. Of particular importance is the case when the function is logarithmic, as this setting underpins many applications in variational inference, where the term variational gap is often used interchangeably. Recent research has focused on estimating the size of Jensen’s gap and establishing tight lower and upper bounds under various assumptions on the underlying function and distribution, driven by practical challenges such as the intractability of log-likelihood in graphical models like variational autoencoders (VAEs). In this paper, we propose new, general bounds for Jensen’s gap that accommodate a broad range of assumptions on both the function and the random variable, with special attention to exponential and logarithmic cases. We provide both analytical and empirical evidence for the performance of our method. Furthermore, we relate our bounds to the PAC-Bayes framework, providing new insights into generalization performance in probabilistic models.

[898] Understanding Representation Dynamics of Diffusion Models via Low-Dimensional Modeling

Xiao Li, Zekai Zhang, Xiang Li, Siyi Chen, Zhihui Zhu, Peng Wang, Qing Qu

Main category: cs.LG

TL;DR: Diffusion models show unimodal representation dynamics where feature quality peaks at intermediate noise levels, indicating successful learning of the data distribution and reflecting model generalization.

Details

Motivation: To understand why diffusion models exhibit unimodal representation dynamics and how this phenomenon relates to model generalization and memorization.

Method: Theoretical analysis leveraging low-dimensional structure of image data, and empirical investigation of classification tasks to study the relationship between unimodal dynamics and model generalization.

Result: Unimodal dynamics emerge when diffusion models capture the underlying data distribution, and the presence of this phenomenon reliably indicates model generalization - it transitions to monotonically decreasing curves when models memorize training data.

Conclusion: Unimodal representation dynamics in diffusion models serve as a reliable indicator of successful learning and generalization, with the phenomenon arising from the interplay between denoising strength and class confidence across noise scales.

Abstract: Diffusion models, though originally designed for generative tasks, have demonstrated impressive self-supervised representation learning capabilities. A particularly intriguing phenomenon in these models is the emergence of unimodal representation dynamics, where the quality of learned features peaks at an intermediate noise level. In this work, we conduct a comprehensive theoretical and empirical investigation of this phenomenon. Leveraging the inherent low-dimensionality structure of image data, we theoretically demonstrate that the unimodal dynamic emerges when the diffusion model successfully captures the underlying data distribution. The unimodality arises from an interplay between denoising strength and class confidence across noise scales. Empirically, we further show that, in classification tasks, the presence of unimodal dynamics reliably reflects the generalization of the diffusion model: it emerges when the model generates novel images and gradually transitions to a monotonically decreasing curve as the model begins to memorize the training data.

[899] SPIRIT: Short-term Prediction of solar IRradIance for zero-shot Transfer learning using Foundation Models

Aditya Mishra, Ravindra T, Srinivasan Iyengar, Shivkumar Kalyanaraman, Ponnurangam Kumaraguru

Main category: cs.LG

TL;DR: SPIRIT is a foundation model-based approach for solar irradiance forecasting that achieves 70% better zero-shot performance than state-of-the-art models, enabling accurate forecasting at new solar installations without historical data.

Details

Motivation: Traditional solar forecasting requires years of site-specific historical data, which is unavailable for newer photovoltaic farms. Accurate forecasting is essential for grid management and achieving UN net zero goals through solar energy proliferation.

Method: Leverages foundation models for solar irradiance forecasting, enabling zero-shot transfer learning without historical data. Can be fine-tuned as location-specific data becomes available.

Result: Outperforms state-of-the-art models by about 70% in zero-shot transfer learning. Shows statistically significant improvements and further performance gains through fine-tuning.

Conclusion: SPIRIT enables rapid, scalable, and adaptable solar forecasting solutions, advancing renewable energy integration into global power systems without requiring extensive historical data.

Abstract: Traditional solar forecasting models are based on several years of site-specific historical irradiance data, often spanning five or more years, which are unavailable for newer photovoltaic farms. As renewable energy is highly intermittent, building accurate solar irradiance forecasting systems is essential for efficient grid management and enabling the ongoing proliferation of solar energy, which is crucial to achieve the United Nations’ net zero goals. In this work, we propose SPIRIT, a novel approach leveraging foundation models for solar irradiance forecasting, making it applicable to newer solar installations. Our approach outperforms state-of-the-art models in zero-shot transfer learning by about 70%, enabling effective performance at new locations without relying on any historical data. Further improvements in performance are achieved through fine-tuning, as more location-specific data becomes available. These findings are supported by statistical significance, further validating our approach. SPIRIT represents a pivotal step towards rapid, scalable, and adaptable solar forecasting solutions, advancing the integration of renewable energy into global power systems.

[900] Weak-to-Strong Generalization Even in Random Feature Networks, Provably

Marko Medvedev, Kaifeng Lyu, Dingli Yu, Sanjeev Arora, Zhiyuan Li, Nathan Srebro

Main category: cs.LG

TL;DR: Weak-to-strong generalization occurs when a strong student model outperforms a weak teacher model, even when trained only on the teacher’s labels. This phenomenon works with random feature models and is enabled by early stopping.

Details

Motivation: To demonstrate that weak-to-strong generalization doesn't require extremely strong learners like GPT-4, and to understand the mechanisms behind this phenomenon using simpler random feature models.

Method: Used random feature models (two-layer networks with fixed random bottom layers) where a weak teacher with few units is trained on population data, and a strong student with many units is trained only on labels generated by the weak teacher.

Result: Showed that the strong student can significantly outperform the weak teacher despite being trained only on the teacher’s labels, and identified early stopping as a key enabler of this phenomenon.

Conclusion: Weak-to-strong generalization is a robust phenomenon that works with simpler models, has quantitative limits, and is facilitated by early stopping during training.

Abstract: Weak-to-Strong Generalization (Burns et al., 2024) is the phenomenon whereby a strong student, say GPT-4, learns a task from a weak teacher, say GPT-2, and ends up significantly outperforming the teacher. We show that this phenomenon does not require a strong learner like GPT-4. We consider student and teacher that are random feature models, described by two-layer networks with a random and fixed bottom layer and a trained top layer. A “weak” teacher, with a small number of units (i.e. random features), is trained on the population, and a “strong” student, with a much larger number of units (i.e. random features), is trained only on labels generated by the weak teacher. We demonstrate, prove, and understand how the student can outperform the teacher, even though trained only on data labeled by the teacher. We also explain how such weak-to-strong generalization is enabled by early stopping. Importantly, we also show the quantitative limits of weak-to-strong generalization in this model.

[901] Time-Prompt: Integrated Heterogeneous Prompts for Unlocking LLMs in Time Series Forecasting

Zesen Wang, Lijuan Lan, Yonggang Li

Main category: cs.LG

TL;DR: Time-Prompt is a framework that activates LLMs for time series forecasting using learnable soft prompts and textualized hard prompts, with semantic space embedding and cross-modal alignment to fuse temporal and textual data.

Details

Motivation: To address skepticism about LLMs' usefulness in time series forecasting and improve long-term forecasting performance, while contributing to carbon neutrality goals.

Method: Construct unified prompt paradigm with soft/hard prompts, design semantic space embedding and cross-modal alignment for temporal-textual fusion, and fine-tune LLM parameters with time series data.

Result: Comprehensive evaluations on 6 public datasets and 3 carbon emission datasets demonstrate Time-Prompt as a powerful framework for time series forecasting.

Conclusion: Time-Prompt effectively activates LLMs for time series forecasting, showing strong performance across multiple datasets and contributing to carbon emission analysis.

Abstract: Time series forecasting aims to model temporal dependencies among variables for future state inference, holding significant importance and widespread applications in real-world scenarios. Although deep learning-based methods have achieved remarkable progress, they still exhibit suboptimal performance in long-term forecasting. Recent research demonstrates that large language models (LLMs) achieve promising performance in time series forecasting, but this progress is still met with skepticism about whether LLMs are truly useful for this task. To address this, we propose Time-Prompt, a framework for activating LLMs for time series forecasting. Specifically, we first construct a unified prompt paradigm with learnable soft prompts to guide the LLM’s behavior and textualized hard prompts to enhance the time series representations. Second, to enhance LLM’ comprehensive understanding of the forecasting task, we design a semantic space embedding and cross-modal alignment module to achieve fusion of temporal and textual data. Finally, we efficiently fine-tune the LLM’s parameters using time series data. Furthermore, we focus on carbon emissions, aiming to provide a modest contribution to global carbon neutrality. Comprehensive evaluations on 6 public datasets and 3 carbon emission datasets demonstrate that Time-Prompt is a powerful framework for time series forecasting.

[902] Bayesian Network Structural Consensus via Greedy Min-Cut Analysis

Pablo Torrijos, José M. Puerta, Juan A. Aledo, José A. Gámez

Main category: cs.LG

TL;DR: MCBNC is a greedy algorithm for Bayesian Network structural consensus that prunes weak edges using min-cut analysis and a modified GES algorithm, producing sparser and more accurate consensus structures than existing methods.

Details

Motivation: To develop a scalable, data-agnostic method for Bayesian Network structural consensus suitable for federated learning and distributed scenarios, addressing limitations of existing methods with fixed treewidth bounds.

Method: Uses min-cut analysis to prune weak edges from initial fusion, integrated into modified Backward Equivalence Search of GES algorithm with a pruning threshold θ selected post hoc using structural information.

Result: Experiments show MCBNC yields sparser, more accurate consensus structures than canonical fusion and input networks, with good scalability and performance in distributed settings.

Conclusion: MCBNC is an effective, scalable approach for Bayesian Network structural consensus that outperforms existing methods and is well-suited for federated learning applications.

Abstract: This paper presents the Min-Cut Bayesian Network Consensus (MCBNC) algorithm, a greedy method for structural consensus of Bayesian Networks (BNs), with applications in federated learning and model aggregation. MCBNC prunes weak edges from an initial unrestricted fusion using a structural score based on min-cut analysis, integrated into a modified Backward Equivalence Search (BES) phase of the Greedy Equivalence Search (GES) algorithm. The score quantifies edge support across input networks and is computed using max-flow. Unlike methods with fixed treewidth bounds, MCBNC introduces a pruning threshold $θ$ that can be selected post hoc using only structural information. Experiments on real-world BNs show that MCBNC yields sparser, more accurate consensus structures than both canonical fusion and the input networks. The method is scalable, data-agnostic, and well-suited for distributed or federated scenarios.

[903] GPT, But Backwards: Exactly Inverting Language Model Outputs

Adrians Skapars, Edoardo Manino, Youcheng Sun, Lucas C. Cordeiro

Main category: cs.LG

TL;DR: SODA algorithm enables exact reconstruction of language model inputs using white-box access, achieving high success rates on both natural and random text.

Details

Motivation: To assess language model vulnerabilities like system prompt theft, backdoor detection, and data leakage through input reconstruction, addressing limitations of existing inversion methods.

Method: Sparse One-hot Discrete Adam (SODA) - a search-based inversion algorithm that uses white-box access to the model and its outputs to reconstruct inputs.

Result: Achieved 98% and 79% reconstruction rates on inputs up to 10 tokens for natural and random language respectively; scales from 33M to 3B parameter models; input length and vocabulary size are more critical than model size.

Conclusion: Exact language model inversion is feasible with SODA, demonstrating that input reconstruction is primarily constrained by input length and vocabulary rather than model size.

Abstract: The task of reconstructing unknown textual inputs to language models is a fundamental auditing primitive that allows us to assess the model’s vulnerability to a range of security issues, including stealing hidden system prompts, detecting backdoors, and leaking private data. Existing inversion works assume access to differing levels of information (e.g. requiring input-output examples, the model parameters, intermediate activations or output logits) but oftentimes fail to fully reconstruct the desired input. In this paper, we present the Sparse One-hot Discrete Adam (SODA) algorithm, a search-based inversion method that can accurately reconstruct the input text, given white-box access to the language model and its output. Our experiments demonstrate for the first time that exact language model inversion is possible on both natural language and random inputs. Indeed, SODA achieves respectively 98% and 79% reconstruction rates on inputs with lengths up to 10 tokens. Furthermore, we show that input length and vocabulary size have a far greater impact on the probability of a successful reconstruction than the size of the language model itself, thus allowing us to scale to models from 33M to 3B parameters.

[904] Robust Hallucination Detection in LLMs via Adaptive Token Selection

Mengjia Niu, Hamed Haddadi, Guansong Pang

Main category: cs.LG

TL;DR: HaMI is a novel hallucination detection approach that uses multiple instance learning to adaptively select and learn critical tokens most indicative of hallucinations in LLM outputs, achieving state-of-the-art performance.

Details

Motivation: Current hallucination detectors depend on predetermined tokens' internal representations, but their performance fluctuates with free-form generations of varying lengths and sparse hallucinated entity distributions.

Method: Formulates hallucination detection as multiple instance learning over token-level representations, enabling joint optimization of token selection and hallucination detection through adaptive selection of critical tokens.

Result: Significantly outperforms existing state-of-the-art approaches across four hallucination benchmarks.

Conclusion: HaMI provides a robust solution for hallucination detection by adaptively identifying and learning from the most indicative tokens, overcoming limitations of fixed-token approaches.

Abstract: Hallucinations in large language models (LLMs) pose significant safety concerns that impede their broader deployment. Recent research in hallucination detection has demonstrated that LLMs’ internal representations contain truthfulness hints, which can be harnessed for detector training. However, the performance of these detectors is heavily dependent on the internal representations of predetermined tokens, fluctuating considerably when working on free-form generations with varying lengths and sparse distributions of hallucinated entities. To address this, we propose HaMI, a novel approach that enables robust detection of hallucinations through adaptive selection and learning of critical tokens that are most indicative of hallucinations. We achieve this robustness by an innovative formulation of the Hallucination detection task as Multiple Instance (HaMI) learning over token-level representations within a sequence, thereby facilitating a joint optimisation of token selection and hallucination detection on generation sequences of diverse forms. Comprehensive experimental results on four hallucination benchmarks show that HaMI significantly outperforms existing state-of-the-art approaches.

[905] Neural Thermodynamics: Entropic Forces in Deep and Universal Representation Learning

Liu Ziyin, Yizhou Xu, Isaac Chuang

Main category: cs.LG

TL;DR: The paper proposes an entropic-force theory to explain emergent phenomena in deep learning, showing that representation learning is governed by entropic forces from SGD that break continuous parameter symmetries while preserving discrete ones, leading to gradient balance phenomena.

Details

Motivation: To understand the cause of emergent phenomena in deep learning and large language models, particularly the universal alignment of neural representations and contradictory optimization behaviors.

Method: Developed a rigorous entropic-force theory based on parameter symmetries and entropic loss landscape, analyzing learning dynamics of neural networks trained with SGD and its variants.

Result: The theory explains universal representation alignment (proving Platonic Representation Hypothesis) and reconciles sharpness- vs flatness-seeking optimization behaviors through gradient balance phenomena resembling thermal equipartition.

Conclusion: A combination of entropic forces and symmetry breaking is key to understanding emergent phenomena in deep learning, providing a unified framework for representation learning dynamics.

Abstract: With the rapid discovery of emergent phenomena in deep learning and large language models, understanding their cause has become an urgent need. Here, we propose a rigorous entropic-force theory for understanding the learning dynamics of neural networks trained with stochastic gradient descent (SGD) and its variants. Building on the theory of parameter symmetries and an entropic loss landscape, we show that representation learning is crucially governed by emergent entropic forces arising from stochasticity and discrete-time updates. These forces systematically break continuous parameter symmetries and preserve discrete ones, leading to a series of gradient balance phenomena that resemble the equipartition property of thermal systems. These phenomena, in turn, (a) explain the universal alignment of neural representations between AI models and lead to a proof of the Platonic Representation Hypothesis, and (b) reconcile the seemingly contradictory observations of sharpness- and flatness-seeking behavior of deep learning optimization. Our theory and experiments demonstrate that a combination of entropic forces and symmetry breaking is key to understanding emergent phenomena in deep learning.

[906] Private Statistical Estimation via Truncation

Manolis Zampetakis, Felix Zhou

Main category: cs.LG

TL;DR: A novel framework for differentially private statistical estimation using data truncation to handle unbounded data support, achieving near-optimal sample complexity for exponential family distributions.

Details

Motivation: Address the challenge of DP estimation with unbounded data support, overcoming limitations of traditional sensitivity analysis approaches that are problem-specific and restrict applicability.

Method: Leverage truncated statistics techniques, use data truncation to mitigate sensitivity, correct bias via maximum likelihood estimation and DP stochastic gradient descent, establish improved uniform convergence guarantees for log-likelihood functions.

Result: Develop computationally efficient DP estimators for exponential family distributions (including Gaussian mean and covariance estimation) with near-optimal sample complexity, extending beyond previous works limited to bounded or one-dimensional families.

Conclusion: The framework provides a general blueprint for DP algorithm design via truncated statistics, with improved convergence guarantees that may have independent interest in statistical theory.

Abstract: We introduce a novel framework for differentially private (DP) statistical estimation via data truncation, addressing a key challenge in DP estimation when the data support is unbounded. Traditional approaches rely on problem-specific sensitivity analysis, limiting their applicability. By leveraging techniques from truncated statistics, we develop computationally efficient DP estimators for exponential family distributions, including Gaussian mean and covariance estimation, achieving near-optimal sample complexity. Previous works on exponential families only consider bounded or one-dimensional families. Our approach mitigates sensitivity through truncation while carefully correcting for the introduced bias using maximum likelihood estimation and DP stochastic gradient descent. Along the way, we establish improved uniform convergence guarantees for the log-likelihood function of exponential families, which may be of independent interest. Our results provide a general blueprint for DP algorithm design via truncated statistics.

[907] Bridging the Plausibility-Validity Gap by Fine-Tuning a Reasoning-Enhanced LLM for Chemical Synthesis and Discovery

Malikussaid, Hilal Hudan Nuha, Isman Kurniawan

Main category: cs.LG

TL;DR: The paper addresses the “plausibility-validity gap” in LLMs for chemistry, where outputs appear reasonable but violate fundamental principles. It presents a fine-tuned model that significantly improves chemical validity and synthesis feasibility.

Details

Motivation: Large Language Models often generate chemically plausible but invalid outputs, creating a gap between superficial correctness and actual chemical validity, especially in molecular structure, reaction mechanisms, and synthetic pathways.

Method: Combines a reasoning-centric model architecture (Magistral Small) with Low-Rank Adaptation fine-tuning on a dual-domain dataset covering molecular properties and chemical transformations.

Result: Achieved 96.3% format adherence, 97.4% chemical validity, and 74.4% synthesis feasibility. Outperforms MolT5 (97.4% vs 77.2% validity) and matches ChemCrow performance (9.0/10 vs 9.24/10 expert rating) with more transparent methodology.

Conclusion: Establishes a reproducible framework for transforming generalist LLMs into dependable scientific tools, revealing a learning hierarchy from syntactic correctness to chemical understanding to synthetic planning, with stereochemical precision and knowledge currency as future challenges.

Abstract: Large Language Models frequently generate outputs that appear scientifically reasonable yet violate fundamental principles–a phenomenon we characterize as the “plausibility-validity gap.” This challenge proves especially acute in chemistry, where superficial correctness masks deeper errors in molecular structure, reaction mechanisms, and synthetic pathways. We present a systematic approach combining a reasoning-centric model architecture (Magistral Small) with Low-Rank Adaptation fine-tuning on a dual-domain dataset covering molecular properties and chemical transformations. Evaluation reveals substantial improvements: the fine-tuned system achieves 96.3% format adherence, 97.4% chemical validity, and 74.4% synthesis feasibility. Comparative analysis shows our approach outperforms specialized translation models like MolT5 (97.4% vs 77.2% validity) while achieving performance comparable to complex tool-augmented systems like ChemCrow (9.0/10 vs 9.24/10 expert rating) through a more transparent, efficient methodology. Results demonstrate a learning hierarchy where syntactic correctness develops before chemical understanding, which precedes synthetic planning capability. This work establishes a reproducible framework for transforming generalist language models into dependable scientific tools while identifying critical areas including stereochemical precision, knowledge currency, and computational accessibility as key challenges for future advancement.

[908] Joint Velocity-Growth Flow Matching for Single-Cell Dynamics Modeling

Dongyi Wang, Yuanwei Jiang, Zhenyi Zhang, Xiang Gu, Peijie Zhou, Jian Sun

Main category: cs.LG

TL;DR: VGFM is a novel method that jointly learns state transition and mass growth in single-cell dynamics using flow matching, addressing challenges from unpaired and unbalanced snapshot data.

Details

Motivation: Single-cell snapshot data suffers from destructive measurements and cell proliferation/death, creating unpaired and unbalanced data that makes learning underlying dynamics challenging.

Method: Proposes joint Velocity-Growth Flow Matching (VGFM) that builds ideal single-cell dynamics with velocity and growth components, using semi-relaxed optimal transport and neural networks for approximation.

Result: Extensive experiments on synthetic and real datasets show VGFM captures biological dynamics accounting for mass and state variations, outperforming existing approaches.

Conclusion: VGFM provides an effective framework for learning single-cell dynamics that handles both state transitions and mass growth, demonstrating superior performance over current methods.

Abstract: Learning the underlying dynamics of single cells from snapshot data has gained increasing attention in scientific and machine learning research. The destructive measurement technique and cell proliferation/death result in unpaired and unbalanced data between snapshots, making the learning of the underlying dynamics challenging. In this paper, we propose joint Velocity-Growth Flow Matching (VGFM), a novel paradigm that jointly learns state transition and mass growth of single-cell populations via flow matching. VGFM builds an ideal single-cell dynamics containing velocity of state and growth of mass, driven by a presented two-period dynamic understanding of the static semi-relaxed optimal transport, a mathematical tool that seeks the coupling between unpaired and unbalanced data. To enable practical usage, we approximate the ideal dynamics using neural networks, forming our joint velocity and growth matching framework. A distribution fitting loss is also employed in VGFM to further improve the fitting performance for snapshot data. Extensive experimental results on both synthetic and real datasets demonstrate that VGFM can capture the underlying biological dynamics accounting for mass and state variations over time, outperforming existing approaches for single-cell dynamics modeling.

[909] Loss-Guided Auxiliary Agents for Overcoming Mode Collapse in GFlowNets

Idriss Malek, Aya Laajil, Abhijith Sharma, Eric Moulines, Salem Lahlou

Main category: cs.LG

TL;DR: LGGFN addresses GFlowNet mode collapse by using the main model’s training loss to guide exploration, focusing on high-loss regions to accelerate discovery of diverse high-reward samples.

Details

Motivation: GFlowNets often suffer from mode collapse, getting trapped in early-discovered modes and requiring prolonged training to find diverse solutions, while existing exploration techniques rely on heuristic novelty signals.

Method: Proposes Loss-Guided GFlowNets (LGGFN) where an auxiliary GFlowNet’s exploration is directly driven by the main GFlowNet’s training loss, prioritizing trajectories with high loss to focus sampling on poorly understood regions.

Result: LGGFN consistently outperforms baselines across diverse benchmarks including grid environments, structured sequence generation, Bayesian structure learning, and biological sequence design. On a challenging sequence generation task, it discovered over 40 times more unique valid modes while reducing exploration error by ~99%.

Conclusion: Using training loss as a direct guide for exploration significantly accelerates discovery of diverse, high-reward samples in GFlowNets, overcoming mode collapse more effectively than heuristic-based approaches.

Abstract: Although Generative Flow Networks (GFlowNets) are designed to capture multiple modes of a reward function, they often suffer from mode collapse in practice, getting trapped in early-discovered modes and requiring prolonged training to find diverse solutions. Existing exploration techniques often rely on heuristic novelty signals. We propose Loss-Guided GFlowNets (LGGFN), a novel approach where an auxiliary GFlowNet’s exploration is \textbf{directly driven by the main GFlowNet’s training loss}. By prioritizing trajectories where the main model exhibits \textbf{high loss}, LGGFN focuses sampling on poorly understood regions of the state space. This targeted exploration significantly accelerates the discovery of diverse, high-reward samples. Empirically, across \textbf{diverse benchmarks} including grid environments, structured sequence generation, Bayesian structure learning, and biological sequence design, LGGFN consistently \textbf{outperforms} baselines in exploration efficiency and sample diversity. For instance, on a challenging sequence generation task, it discovered over 40 times more unique valid modes while simultaneously reducing the exploration error metric by approximately 99%.

[910] Monitoring Risks in Test-Time Adaptation

Mona Schirmer, Metod Jazbec, Christian A. Naesseth, Eric Nalisnick

Main category: cs.LG

TL;DR: Proposes pairing test-time adaptation (TTA) with risk monitoring frameworks to detect when models degrade beyond recovery, extending sequential testing tools to work with TTA scenarios without labeled test data.

Details

Motivation: Test-time adaptation provides temporary solutions for shifted data, but models eventually degrade to the point of needing offline retraining. Current methods lack systematic ways to detect these ultimate failure points.

Method: Extends sequential testing with confidence sequences to accommodate TTA scenarios where models are updated at test time without labeled test data to estimate performance metrics.

Result: Demonstrates effective TTA monitoring framework across diverse datasets, distribution shift types, and TTA methods, enabling rigorous statistical risk monitoring for TTA.

Conclusion: The proposed framework successfully pairs TTA with risk monitoring to detect model degradation points, providing a systematic approach for determining when models need offline retraining.

Abstract: Encountering shifted data at test time is a ubiquitous challenge when deploying predictive models. Test-time adaptation (TTA) methods address this issue by continuously adapting a deployed model using only unlabeled test data. While TTA can extend the model’s lifespan, it is only a temporary solution. Eventually the model might degrade to the point that it must be taken offline and retrained. To detect such points of ultimate failure, we propose pairing TTA with risk monitoring frameworks that track predictive performance and raise alerts when predefined performance criteria are violated. Specifically, we extend existing monitoring tools based on sequential testing with confidence sequences to accommodate scenarios in which the model is updated at test time and no test labels are available to estimate the performance metrics of interest. Our extensions unlock the application of rigorous statistical risk monitoring to TTA, and we demonstrate the effectiveness of our proposed TTA monitoring framework across a representative set of datasets, distribution shift types, and TTA methods.

[911] Graph-Conditional Flow Matching for Relational Data Generation

Davide Scassola, Sebastiano Saccani, Luca Bortolussi

Main category: cs.LG

TL;DR: A generative model for relational data that uses flow matching and graph neural networks to capture complex foreign-key relationships and long-range dependencies in multi-table datasets.

Details

Motivation: Current methods for multi-table data generation lack flexibility and expressiveness to handle complex relational structures, particularly long-range dependencies and complex foreign-key relationships.

Method: Deep generative model using flow matching with graph neural networks that denoises records while leveraging information from connected records through foreign-key relationships.

Result: Achieves state-of-the-art performance in synthetic data fidelity on benchmark datasets, demonstrating ability to handle complex relational structures.

Conclusion: The proposed method provides a flexible and expressive approach for relational data generation that can capture complex dependencies and relationships across multiple tables.

Abstract: Data synthesis is gaining momentum as a privacy-enhancing technology. While single-table tabular data generation has seen considerable progress, current methods for multi-table data often lack the flexibility and expressiveness needed to capture complex relational structures. In particular, they struggle with long-range dependencies and complex foreign-key relationships, such as tables with multiple parent tables or multiple types of links between the same pair of tables. We propose a generative model for relational data that generates the content of a relational dataset given the graph formed by the foreign-key relationships. We do this by learning a deep generative model of the content of the whole relational database by flow matching, where the neural network trained to denoise records leverages a graph neural network to obtain information from connected records. Our method is flexible, as it can support relational datasets with complex structures, and expressive, as the generation of each record can be influenced by any other record within the same connected component. We evaluate our method on several benchmark datasets and show that it achieves state-of-the-art performance in terms of synthetic data fidelity.

[912] Stochastic Forward-Forward Learning through Representational Dimensionality Compression

Zhichao Zhu, Yang Qi, Hengyuan Ma, Wenlian Lu, Jianfeng Feng

Main category: cs.LG

TL;DR: The paper proposes a novel dimensionality compression goodness function for Forward-Forward learning that uses effective dimensionality of neural responses, achieving competitive performance without needing negative samples and showing noise can enhance generalization.

Details

Motivation: Existing goodness functions in Forward-Forward learning neglect correlated variability between neurons and require well-designed negative samples for contrastive learning.

Method: Proposed a dimensionality compression goodness function that uses effective dimensionality of fluctuating neural responses to incorporate second-order statistical structure, minimizing ED for noisy copies while maximizing it across sample distribution.

Result: Achieves competitive performance compared to other non-backpropagation methods, demonstrates noise can enhance generalization and improve inference when predictions use mean squared output.

Conclusion: Contributes to biologically plausible learning algorithms and suggests natural fit for neuromorphic computing where stochasticity is a computational resource.

Abstract: The Forward-Forward (FF) learning algorithm provides a bottom-up alternative to backpropagation (BP) for training neural networks, relying on a layer-wise “goodness” function with well-designed negative samples for contrastive learning. Existing goodness functions are typically defined as the sum of squared postsynaptic activations, neglecting correlated variability between neurons. In this work, we propose a novel goodness function termed dimensionality compression that uses the effective dimensionality (ED) of fluctuating neural responses to incorporate second-order statistical structure. Our objective minimizes ED for noisy copies of individual inputs while maximizing it across the sample distribution, promoting structured representations without the need to prepare negative samples.We demonstrate that this formulation achieves competitive performance compared to other non-BP methods. Moreover, we show that noise plays a constructive role that can enhance generalization and improve inference when predictions are derived from the mean of squared output, which is equivalent to making predictions based on an energy term. Our findings contribute to the development of more biologically plausible learning algorithms and suggest a natural fit for neuromorphic computing, where stochasticity is a computational resource rather than a nuisance. The code is available at https://github.com/ZhichaoZhu/StochasticForwardForward

[913] On the Relation between Rectified Flows and Optimal Transport

Johannes Hertrich, Antonin Chambolle, Julie Delon

Main category: cs.LG

TL;DR: This paper examines the relationship between rectified flows, flow matching, and optimal transport, revealing limitations in previous claims about gradient-constrained rectified flows solving optimal transport problems.

Details

Motivation: To clarify the theoretical connections between flow matching methods and optimal transport, particularly addressing recent claims that rectified flows with gradient constraints can solve optimal transport problems.

Method: The authors analyze invariance properties of rectified flows, provide explicit constructions in Gaussian and Gaussian mixture settings, and present counterexamples to test the relationship between gradient-constrained rectified flows and optimal transport solutions.

Result: The study shows that gradient-constrained rectified flows only relate to optimal transport under much stronger assumptions than previously acknowledged, and presents counterexamples that invalidate earlier equivalence results in the literature.

Conclusion: Enforcing a gradient constraint on rectified flows is generally not a reliable method for computing optimal transport maps, as the equivalence only holds under restrictive conditions not typically satisfied in practice.

Abstract: This paper investigates the connections between rectified flows, flow matching, and optimal transport. Flow matching is a recent approach to learning generative models by estimating velocity fields that guide transformations from a source to a target distribution. Rectified flow matching aims to straighten the learned transport paths, yielding more direct flows between distributions. Our first contribution is a set of invariance properties of rectified flows and explicit velocity fields. In addition, we also provide explicit constructions and analysis in the Gaussian (not necessarily independent) and Gaussian mixture settings and study the relation to optimal transport. Our second contribution addresses recent claims suggesting that rectified flows, when constrained such that the learned velocity field is a gradient, can yield (asymptotically) solutions to optimal transport problems. We study the existence of solutions for this problem and demonstrate that they only relate to optimal transport under assumptions that are significantly stronger than those previously acknowledged. In particular, we present several counterexamples that invalidate earlier equivalence results in the literature, and we argue that enforcing a gradient constraint on rectified flows is, in general, not a reliable method for computing optimal transport maps.

[914] Rotary Masked Autoencoders are Versatile Learners

Uros Zivanovic, Serafina Di Gioia, Andre Scaffidi, Martín de los Rios, Gabriella Contardo, Roberto Trotta

Main category: cs.LG

TL;DR: RoMAE extends Masked Autoencoder with Rotary Positional Embedding to handle irregular time-series without specialized architecture, achieving superior performance across multiple modalities while maintaining MAE’s efficiency.

Details

Motivation: Transformers for irregular time-series typically require specialized architectures that add computational overhead and complexity. The goal is to develop a method that handles continuous positional information without time-series-specific modifications.

Method: RoMAE combines Rotary Positional Embedding (RoPE) with Masked Autoencoder (MAE) to enable interpolation and representation learning with multidimensional continuous positional information, avoiding architectural specializations.

Result: RoMAE surpasses specialized time-series architectures on challenging datasets like DESC ELAsTiCC Challenge while maintaining MAE’s performance across images, audio, and other modalities. However, including learned embeddings breaks RoPE’s relative position property.

Conclusion: RoMAE provides an effective approach for handling irregular time-series and continuous positional information without specialized architectural modifications, demonstrating strong performance across diverse modalities.

Abstract: Applying Transformers to irregular time-series typically requires specializations to their baseline architecture, which can result in additional computational overhead and increased method complexity. We present the Rotary Masked Autoencoder (RoMAE), which utilizes the popular Rotary Positional Embedding (RoPE) method for continuous positions. RoMAE is an extension to the Masked Autoencoder (MAE) that enables interpolation and representation learning with multidimensional continuous positional information while avoiding any time-series-specific architectural specializations. We showcase RoMAE’s performance on a variety of modalities including irregular and multivariate time-series, images, and audio, demonstrating that RoMAE surpasses specialized time-series architectures on difficult datasets such as the DESC ELAsTiCC Challenge while maintaining MAE’s usual performance across other modalities. In addition, we investigate RoMAE’s ability to reconstruct the embedded continuous positions, demonstrating that including learned embeddings in the input sequence breaks RoPE’s relative position property.

[915] Universal Neurons in GPT-2: Emergence, Persistence, and Functional Impact

Advey Nandan, Cheng-Ting Chou, Amrit Kurakula, Cole Blondin, Kevin Zhu, Vasu Sharma, Sean O’Brien

Main category: cs.LG

TL;DR: Analysis of universal neurons in GPT-2 models showing consistent activation patterns across independently trained models, with significant functional impact and high persistence throughout training.

Details

Motivation: To understand the emergence and evolution of universal neurons - neurons with consistently correlated activations across independently trained language models - and their stability during training.

Method: Analyzed five GPT-2 Small models at five training checkpoints using pairwise correlation analysis of activations over 5 million tokens dataset, conducted ablation experiments to measure functional impact via cross entropy loss, and quantified neuron persistence across checkpoints.

Result: Identified universal neurons with consistently correlated activations across models, ablation showed significant functional impacts on predictions, and demonstrated high stability of universal neurons across training checkpoints, especially in early and deeper layers.

Conclusion: Stable and universal representational structures emerge during language model training, suggesting consistent learning patterns across independently trained models.

Abstract: We investigate the phenomenon of neuron universality in independently trained GPT-2 Small models, examining these universal neurons-neurons with consistently correlated activations across models-emerge and evolve throughout training. By analyzing five GPT-2 models at five checkpoints, we identify universal neurons through pairwise correlation analysis of activations over a dataset of 5 million tokens. Ablation experiments reveal significant functional impacts of universal neurons on model predictions, measured via cross entropy loss. Additionally, we quantify neuron persistence, demonstrating high stability of universal neurons across training checkpoints, particularly in early and deeper layers. These findings suggest stable and universal representational structures emerge during language model training.

[916] Graph Flow Matching: Enhancing Image Generation with Neighbor-Aware Flow Fields

Md Shahriar Rahim Siddiqui, Moshe Eliasof, Eldad Haber

Main category: cs.LG

TL;DR: Graph Flow Matching (GFM) enhances flow matching by adding a diffusion term that aggregates neighbor information via graph neural networks, improving generation quality while maintaining computational efficiency.

Details

Motivation: Existing flow matching networks predict velocities independently for each point, ignoring correlations between neighboring points that could improve velocity predictions and generation quality.

Method: Proposes GFM which decomposes velocity into a reaction term (standard flow matching) and a diffusion term that uses graph neural networks to aggregate neighbor information in a reaction-diffusion formulation.

Result: GFM consistently improves FID and recall across five image generation benchmarks (LSUN Church, LSUN Bedroom, FFHQ, AFHQ-Cat, CelebA-HQ at 256×256) in latent space of pretrained VAE.

Conclusion: GFM is an effective modular enhancement to existing flow matching architectures that enriches velocity predictions with local context at minimal computational cost.

Abstract: Flow matching casts sample generation as learning a continuous-time velocity field that transports noise to data. Existing flow matching networks typically predict each point’s velocity independently, considering only its location and time along its flow trajectory, and ignoring neighboring points. However, this pointwise approach may overlook correlations between points along the generation trajectory that could enhance velocity predictions, thereby improving downstream generation quality. To address this, we propose Graph Flow Matching (GFM), a lightweight enhancement that decomposes the learned velocity into a reaction term – any standard flow matching network – and a diffusion term that aggregates neighbor information via a graph neural module. This reaction-diffusion formulation retains the scalability of deep flow models while enriching velocity predictions with local context, all at minimal additional computational cost. Operating in the latent space of a pretrained variational autoencoder, GFM consistently improves Fréchet Inception Distance (FID) and recall across five image generation benchmarks (LSUN Church, LSUN Bedroom, FFHQ, AFHQ-Cat, and CelebA-HQ at $256\times256$), demonstrating its effectiveness as a modular enhancement to existing flow matching architectures.

[917] From Invariant Representations to Invariant Data: Provable Robustness to Spurious Correlations via Noisy Counterfactual Matching

Ruqi Bai, Yao Ji, Zeyu Zhou, David I. Inouye

Main category: cs.LG

TL;DR: NCM improves model robustness by using noisy counterfactual pairs instead of invariant representations, with theoretical guarantees and empirical validation.

Details

Motivation: Existing methods for learning invariant representations often underperform ERM, so the paper proposes a data-centric approach using invariant data pairs.

Method: Noisy Counterfactual Matching (NCM) - a constraint-based method that leverages noisy counterfactual pairs to enforce prediction invariance.

Result: Theoretical analysis shows NCM’s test-domain error is bounded, and experiments on synthetic and real-world datasets demonstrate effectiveness.

Conclusion: NCM provides a practical data-centric alternative to representation-based methods for improving model robustness to spurious correlations.

Abstract: Models that learn spurious correlations from training data often fail when deployed in new environments. While many methods aim to learn invariant representations to address this, they often underperform standard empirical risk minimization (ERM). We propose a data-centric alternative that shifts the focus from learning invariant representations to leveraging invariant data pairs – pairs of samples that should have the same prediction. We prove that certain counterfactuals naturally satisfy this invariance property. Based on this, we introduce Noisy Counterfactual Matching (NCM), a simple constraint-based method that improves robustness by leveraging even a small number of \emph{noisy} counterfactual pairs – improving upon prior works that do not explicitly consider noise. For linear causal models, we prove that NCM’s test-domain error is bounded by its in-domain error plus a term dependent on the counterfactuals’ quality and diversity. Experiments on synthetic data validate our theory, and we demonstrate NCM’s effectiveness on real-world datasets.

[918] Discovering Spatial Correlations of Earth Observations for weather forecasting by using Graph Structure Learning

Hyeon-Ju Jeon, Jeon-Ho Kang, In-Hyuk Kwon, O-Joun Lee

Main category: cs.LG

TL;DR: Proposes CloudNine-v2, a spatiotemporal graph neural network with adaptive edge sampling to improve weather prediction by capturing dynamic spatial correlations between Earth observations and atmospheric states, achieving 15% RMSE reduction.

Details

Motivation: Traditional numerical weather prediction systems struggle to capture complex, dynamic spatial correlations between shifting observation locations and atmospheric states due to rigid statistical/physical formulations.

Method: Uses spatiotemporal graph neural networks with structure learning, but addresses structural information loss and over-smoothing by regulating edge sampling through adaptive node degree determination and spatial distance consideration.

Result: Validated on real-world East Asia data, achieved up to 15% RMSE reduction over existing STGNN models, with consistent outperformance in high atmospheric variability areas.

Conclusion: The proposed adaptive edge sampling method effectively handles dynamic spatial correlations in weather prediction, demonstrating superior performance over baseline approaches with and without structure learning.

Abstract: This study aims to improve the accuracy of weather predictions by discovering spatial correlations between Earth observations and atmospheric states. Existing numerical weather prediction (NWP) systems predict future atmospheric states at fixed locations, which are called NWP grid points, by analyzing previous atmospheric states and newly acquired Earth observations. However, the shifting locations of observations and the surrounding meteorological context induce complex, dynamic spatial correlations that are difficult for traditional NWP systems to capture, since they rely on strict statistical and physical formulations. To handle complicated spatial correlations, which change dynamically, we employ a spatiotemporal graph neural networks (STGNNs) with structure learning. However, structure learning has an inherent limitation that this can cause structural information loss and over-smoothing problem by generating excessive edges. To solve this problem, we regulate edge sampling by adaptively determining node degrees and considering the spatial distances between NWP grid points and observations. We validated the effectiveness of the proposed method (CloudNine-v2) using real-world atmospheric state and observation data from East Asia, achieving up to 15% reductions in RMSE over existing STGNN models. Even in areas with high atmospheric variability, CloudNine-v2 consistently outperformed baselines with and without structure learning.

[919] Causal Discovery in Dynamic Fading Wireless Networks

Oluwaseyi Giwa

Main category: cs.LG

TL;DR: Proposes a sequential regression algorithm with NOTEARS constraint for dynamic causal discovery in wireless networks, deriving theoretical bounds on detection delay and validating them through simulations.

Details

Motivation: Traditional static causal models fail in wireless networks due to evolving interference, fading, and mobility, requiring dynamic causal discovery approaches.

Method: Sequential regression-based algorithm with NOTEARS acyclicity constraint for efficient online updates in dynamic fading wireless environments.

Result: Derived theoretical bounds on detection delay showing linear increase with network size, quadratic growth with noise variance, and inverse-square dependence on structural change magnitude. Simulations validate these findings.

Conclusion: Provides theoretical insights and practical guidelines for robust online causal inference to maintain network reliability under nonstationary wireless conditions.

Abstract: Dynamic causal discovery in wireless networks is essential due to evolving interference, fading, and mobility, which complicate traditional static causal models. This paper addresses causal inference challenges in dynamic fading wireless environments by proposing a sequential regression-based algorithm with a novel application of the NOTEARS acyclicity constraint, enabling efficient online updates. We derive theoretical lower and upper bounds on the detection delay required to identify structural changes, explicitly quantifying their dependence on network size, noise variance, and fading severity. Monte Carlo simulations validate these theoretical results, demonstrating linear increases in detection delay with network size, quadratic growth with noise variance, and inverse-square dependence on the magnitude of structural changes. Our findings provide rigorous theoretical insights and practical guidelines for designing robust online causal inference mechanisms to maintain network reliability under nonstationary wireless conditions.

[920] Oblivionis: A Lightweight Learning and Unlearning Framework for Federated Large Language Models

Fuyao Zhang, Xinyu Yan, Tiantong Wu, Wenjie Li, Tianxiang Chen, Yang Cao, Ran Yan, Longtao Huang, Wei Yang Bryan Lim, Qiang Yang

Main category: cs.LG

TL;DR: Oblivionis is a lightweight framework that enables selective data removal in federated LLM training, addressing GDPR compliance and data governance gaps in existing FL systems.

Details

Motivation: Federated LLM frameworks lack built-in mechanisms for regulatory compliance like GDPR's right to be forgotten, creating privacy and governance concerns despite enabling collaborative training without raw data sharing.

Method: Unifies FL and unlearning as a dual optimization objective, incorporating 6 FL and 5 unlearning algorithms for comprehensive evaluation, establishing a robust pipeline for federated LLM unlearning.

Result: Extensive experiments show Oblivionis outperforms local training, achieving a robust balance between forgetting efficacy and model utility, with cross-algorithm comparisons providing clear future directions.

Conclusion: Oblivionis successfully addresses the complexity of federated LLM unlearning, enhancing trustworthiness and regulatory compliance while maintaining model performance.

Abstract: Large Language Models (LLMs) increasingly leverage Federated Learning (FL) to utilize private, task-specific datasets for fine-tuning while preserving data privacy. However, while federated LLM frameworks effectively enable collaborative training without raw data sharing, they critically lack built-in mechanisms for regulatory compliance like GDPR’s right to be forgotten. Integrating private data heightens concerns over data quality and long-term governance, yet existing distributed training frameworks offer no principled way to selectively remove specific client contributions post-training. Due to distributed data silos, stringent privacy constraints, and the intricacies of interdependent model aggregation, federated LLM unlearning is significantly more complex than centralized LLM unlearning. To address this gap, we introduce Oblivionis, a lightweight learning and unlearning framework that enables clients to selectively remove specific private data during federated LLM training, enhancing trustworthiness and regulatory compliance. By unifying FL and unlearning as a dual optimization objective, we incorporate 6 FL and 5 unlearning algorithms for comprehensive evaluation and comparative analysis, establishing a robust pipeline for federated LLM unlearning. Extensive experiments demonstrate that Oblivionis outperforms local training, achieving a robust balance between forgetting efficacy and model utility, with cross-algorithm comparisons providing clear directions for future LLM development.

[921] PyLO: Towards Accessible Learned Optimizers in PyTorch

Paul Janson, Benjamin Therien, Quentin Anthony, Xiaolong Huang, Abhinav Moudgil, Eugene Belilovsky

Main category: cs.LG

TL;DR: PyLO is a PyTorch library that makes learned optimizers accessible and practical for real-world large-scale pre-training tasks, providing significant speed improvements and integration with existing optimization tools.

Details

Motivation: To address the inaccessibility of recent learned optimizers like VeLO (which required 4000 TPU-months of meta-training) to the broader community due to JAX dependency and lack of user-friendly packages.

Method: Developed PyLO, a PyTorch-based library with CUDA-accelerated implementation of the small_fc_lopt learned optimizer architecture, enabling integration with existing optimization workflows and tools like learning rate schedules and weight decay.

Result: Achieved substantial speedups from 39.36 to 205.59 samples/sec throughput for training ViT B/16 with batch size 32, and demonstrated that learned optimizers can substantially benefit when combined with existing optimization tools.

Conclusion: PyLO successfully bridges the gap between research and practical application of learned optimizers, making them accessible to the broader machine learning community through familiar PyTorch workflows while delivering significant performance improvements.

Abstract: Learned optimizers have been an active research topic over the past decade, with increasing progress toward practical, general-purpose optimizers that can serve as drop-in replacements for widely used methods like Adam. However, recent advances – such as VeLO, which was meta-trained for 4000 TPU-months – remain largely inaccessible to the broader community, in part due to their reliance on JAX and the absence of user-friendly packages for applying the optimizers after meta-training. To address this gap, we introduce PyLO, a PyTorch-based library that brings learned optimizers to the broader machine learning community through familiar, widely adopted workflows. Unlike prior work focused on synthetic or convex tasks, our emphasis is on applying learned optimization to real-world large-scale pre-training tasks. Our release includes a CUDA-accelerated version of the small_fc_lopt learned optimizer architecture from (Metz et al., 2022a), delivering substantial speedups – from 39.36 to 205.59 samples/sec throughput for training ViT B/16 with batch size 32. PyLO also allows us to easily combine learned optimizers with existing optimization tools such as learning rate schedules and weight decay. When doing so, we find that learned optimizers can substantially benefit. Our code is available at https://github.com/Belilovsky-Lab/pylo

[922] ANO : Faster is Better in Noisy Landscape

Adrien Kegreisz

Main category: cs.LG

TL;DR: Ano optimizer decouples direction and magnitude: momentum for directional smoothing, instantaneous gradients for step size. Anolog variant removes momentum sensitivity via logarithmic schedule. Improves robustness to noise while maintaining efficiency.

Details

Motivation: Existing optimizers like Adam and Adan degrade in non-stationary or noisy environments due to reliance on momentum-based magnitude estimates.

Method: Decouple direction and magnitude: use momentum for directional smoothing while using instantaneous gradient magnitudes for step size. Anolog variant expands momentum window over time via logarithmic schedule.

Result: Established non-convex convergence guarantees with rate similar to sign-based methods. Substantial gains in noisy and non-stationary regimes like reinforcement learning, while remaining competitive on low-noise tasks.

Conclusion: Ano and Anolog provide improved robustness to gradient noise while retaining simplicity and efficiency of first-order methods, particularly beneficial in noisy and non-stationary environments.

Abstract: Stochastic optimizers are central to deep learning, yet widely used methods such as Adam and Adan can degrade in non-stationary or noisy environments, partly due to their reliance on momentum-based magnitude estimates. We introduce Ano, a novel optimizer that decouples direction and magnitude: momentum is used for directional smoothing, while instantaneous gradient magnitudes determine step size. This design improves robustness to gradient noise while retaining the simplicity and efficiency of first-order methods. We further propose Anolog, which removes sensitivity to the momentum coefficient by expanding its window over time via a logarithmic schedule. We establish non-convex convergence guarantees with a convergence rate similar to other sign-based methods, and empirically show that Ano provides substantial gains in noisy and non-stationary regimes such as reinforcement learning, while remaining competitive on low-noise tasks.

[923] Multiple Streams of Knowledge Retrieval: Enriching and Recalling in Transformers

Todd Nief, David Reber, Sean Richardson, Ari Holtzman

Main category: cs.LG

TL;DR: The paper proposes dynamic weight grafting to analyze how LLMs store and retrieve new facts learned during finetuning, revealing two distinct pathways: entity enrichment during processing and information recall before prediction.

Details

Motivation: To understand where and how LLMs store new factual information learned during finetuning, as existing activation-based methods are insufficient for this analysis.

Method: Dynamic weight grafting - selectively grafting weights from finetuned models onto pretrained models to analyze information storage mechanisms.

Result: Identified two separate pathways: 1) enriching residual stream with relation information during entity processing, and 2) recalling information at final token position before prediction. Localized recall pathway to specific model components including attention mechanisms and feedforward networks.

Conclusion: LLMs implement multiple redundant heuristics for retrieving finetuned knowledge, with both enrichment and recall pathways working independently or together depending on the case.

Abstract: When an LLM learns a new fact during finetuning (e.g., new movie releases, newly elected pope, etc.), where does this information go? Are entities enriched with relation information, or do models recall information just-in-time before a prediction? Or, are all of the above'' true with LLMs implementing multiple redundant heuristics? Existing localization approaches (e.g., activation patching) are ill-suited for this analysis because they usually \textit{replace} parts of the residual stream, thus overriding previous information. To fill this gap, we propose \emph{dynamic weight grafting}, a technique that selectively grafts weights from a finetuned model onto a pretrained model. Using this technique, we show two separate pathways for retrieving finetuned relation information: 1) enriching" the residual stream with relation information while processing the tokens that correspond to an entity (e.g., Zendaya'' in Zendaya co-starred with John David Washington’’) and 2) recalling" this information at the final token position before generating a target fact. In some cases, models need information from both of these pathways to correctly generate finetuned facts while, in other cases, either the enrichment" or recall" pathway alone is sufficient. We localize the recall’’ pathway to model components – finding that ``recall" occurs via both task-specific attention mechanisms and an entity-specific extraction step in the feedforward networks of the final layers before the target prediction. By targeting model components and parameters, as opposed to just activations, we are able to understand the \textit{mechanisms} by which finetuned knowledge is retrieved during generation.

[924] Cross-Platform E-Commerce Product Categorization and Recategorization: A Multimodal Hierarchical Classification Approach

Lotte Gross, Rebecca Walter, Nicole Zoppi, Adrien Justus, Alessandro Gambetti, Qiwei Han, Maximilian Kaiser

Main category: cs.LG

TL;DR: Multimodal hierarchical classification framework for e-commerce product categorization achieves 98.59% hierarchical F1 score using CLIP embeddings with MLP-based late fusion, plus self-supervised recategorization pipeline for discovering fine-grained categories.

Details

Motivation: Address platform heterogeneity and structural limitations of existing taxonomies in e-commerce product categorization across 40 international fashion platforms.

Method: Multimodal hierarchical classification integrating RoBERTa (text), ViT (vision), and CLIP (vision-language) with fusion strategies (early, late, attention-based) and dynamic masking. Self-supervised recategorization pipeline using SimCLR, UMAP, and cascade clustering.

Result: CLIP embeddings with MLP-based late fusion achieved highest hierarchical F1 (98.59%). Recategorization discovered new fine-grained categories with cluster purities >86%. Late fusion maximizes accuracy with diverse data, while early fusion generalizes better to unseen platforms.

Conclusion: Framework successfully deployed in commercial platform via two-stage inference pipeline balancing cost and accuracy, demonstrating industrial scalability for e-commerce product categorization.

Abstract: This study addresses critical industrial challenges in e-commerce product categorization, namely platform heterogeneity and the structural limitations of existing taxonomies, by developing and deploying a multimodal hierarchical classification framework. Using a dataset of 271,700 products from 40 international fashion e-commerce platforms, we integrate textual features (RoBERTa), visual features (ViT), and joint vision-language representations (CLIP). We investigate fusion strategies, including early, late, and attention-based fusion within a hierarchical architecture enhanced by dynamic masking to ensure taxonomic consistency. Results show that CLIP embeddings combined via an MLP-based late-fusion strategy achieve the highest hierarchical F1 (98.59%), outperforming unimodal baselines. To address shallow or inconsistent categories, we further introduce a self-supervised “product recategorization” pipeline using SimCLR, UMAP, and cascade clustering, which discovered new, fine-grained categories (for example, subtypes of “Shoes”) with cluster purities above 86%. Cross-platform experiments reveal a deployment-relevant trade-off: complex late-fusion methods maximize accuracy with diverse training data, while simpler early-fusion methods generalize more effectively to unseen platforms. Finally, we demonstrate the framework’s industrial scalability through deployment in EURWEB’s commercial transaction intelligence platform via a two-stage inference pipeline, combining a lightweight RoBERTa stage with a GPU-accelerated multimodal stage to balance cost and accuracy.

[925] Learning Stochastic Multiscale Models

Andrew F. Ilersich, Prasanth B. Nair

Main category: cs.LG

TL;DR: Proposes learning stochastic multiscale models from data using SDEs with latent microscale states, achieving better accuracy than under-resolved simulations and closure models.

Details

Motivation: Physical systems have wide range of length/time scales making direct numerical simulation computationally expensive due to high-dimensional state space.

Method: Learn stochastic multiscale models as SDEs with macroscale state on coarse mesh and latent microscale state for unresolved dynamics, using simulator-free amortized variational inference with Product of Experts likelihood.

Result: Learned multiscale models achieve superior predictive accuracy compared to under-resolved direct numerical simulation, closure-type models, and reduced-order modeling approaches at equivalent resolution.

Conclusion: The approach successfully learns effective multiscale models from data that outperform traditional methods while maintaining computational efficiency.

Abstract: The physical sciences are replete with dynamical systems that require the resolution of a wide range of length and time scales. This presents significant computational challenges since direct numerical simulation requires discretization at the finest relevant scales, leading to a high-dimensional state space. In this work, we propose an approach to learn stochastic multiscale models in the form of stochastic differential equations directly from observational data. Drawing inspiration from physics-based multiscale modeling approaches, we resolve the macroscale state on a coarse mesh while introducing a microscale latent state to explicitly model unresolved dynamics. We learn the parameters of the multiscale model using a simulator-free amortized variational inference method with a Product of Experts likelihood that enforces scale separation. We present detailed numerical studies to demonstrate that our learned multiscale models achieve superior predictive accuracy compared to under-resolved direct numerical simulation and closure-type models at equivalent resolution, as well as reduced-order modeling approaches.

[926] Mirror Descent Policy Optimisation for Robust Constrained Markov Decision Processes

David M. Bossens, Atsushi Nitanda

Main category: cs.LG

TL;DR: Mirror descent policy optimization for robust constrained Markov decision processes that achieves O~(1/T^{1/3}) convergence rate using policy gradient techniques to optimize both policy and adversarial transition kernel.

Details

Motivation: Safety is essential for reinforcement learning systems, and robust constrained MDPs provide guarantees under epistemic uncertainty while satisfying long-term constraints.

Method: Uses policy gradient techniques to optimize both the policy (maximizer) and transition kernel (adversarial minimizer) on the Lagrangian of constrained MDPs, with an algorithm for approximate gradient descent in transition kernel space.

Result: Achieves O~(1/T^{1/3}) convergence rate in sample-based robust constrained MDP setting, with experiments showing significant improvements in robustness compared to baseline algorithms.

Conclusion: Mirror descent policy optimization provides effective approach for robust constrained MDPs with strong convergence guarantees and improved robustness performance.

Abstract: Safety is an essential requirement for reinforcement learning systems. The newly emerging framework of robust constrained Markov decision processes allows learning policies that satisfy long-term constraints while providing guarantees under epistemic uncertainty. This paper presents mirror descent policy optimisation for robust constrained Markov decision processes, making use of policy gradient techniques to optimise both the policy (as a maximiser) and the transition kernel (as an adversarial minimiser) on the Lagrangian representing a constrained Markov decision process. Our proposed algorithm obtains an $\tilde{\mathcal{O}}\left(1/T^{1/3}\right)$ convergence rate in the sample-based robust constrained Markov decision process setting. The paper also contributes an algorithm for approximate gradient descent in the space of transition kernels, which is of independent interest for designing adversarial environments in general Markov decision processes. Experiments confirm the benefits of mirror descent policy optimisation in constrained and unconstrained optimisation, and significant improvements are observed in robustness tests when compared to baseline policy optimisation algorithms.

[927] Privacy-Preserving Personalization in Education: A Federated Recommender System for Student Performance Prediction

Rodrigo Tertulino, Ricardo Almeida

Main category: cs.LG

TL;DR: A privacy-preserving educational recommender system using Federated Learning achieves 92% of centralized model performance while protecting student data privacy.

Details

Motivation: To address the conflict between data-driven personalization in education and student data privacy protection under modern regulations.

Method: Used Federated Learning with Deep Neural Networks on the ASSISTments dataset, comparing FedProx and FedAvg aggregation strategies.

Result: FedProx proved more stable and effective than FedAvg, achieving 76.28% F1-Score (92% of centralized XGBoost performance).

Conclusion: Federated Learning provides a viable solution to the personalization-privacy dilemma in educational platforms without centralizing sensitive data.

Abstract: The increasing digitalization of education presents unprecedented opportunities for data-driven personalization, but it also introduces significant challenges to student data privacy. Conventional recommender systems rely on centralized data, a paradigm often incompatible with modern data protection regulations. A novel privacy-preserving recommender system is proposed and evaluated to address this critical issue using Federated Learning (FL). The approach utilizes a Deep Neural Network (DNN) with rich, engineered features from the large-scale ASSISTments educational dataset. A rigorous comparative analysis of federated aggregation strategies was conducted, identifying FedProx as a significantly more stable and effective method for handling heterogeneous student data than the standard FedAvg baseline. The optimized federated model achieves a high-performance F1-Score of 76.28%, corresponding to 92% of the performance of a powerful, centralized XGBoost model. These findings validate that a federated approach can provide highly effective content recommendations without centralizing sensitive student data. Consequently, our work presents a viable and robust solution to the personalization-privacy dilemma in modern educational platforms.

[928] Beyond Parallelism: Synergistic Computational Graph Effects in Multi-Head Attention

Haitz Sáez de Ocáriz Borde

Main category: cs.LG

TL;DR: Multi-head attention in Transformers provides synergistic computational benefits beyond parallel processing, enhancing information propagation with faster mixing times and minimax fidelity under head-diversity conditions.

Details

Motivation: The theoretical advantages of multi-head versus single-head attention remain underexplored, despite multi-head attention being fundamental to Transformer networks and large language models.

Method: Reframe multi-head attention as a system of synergistic computational graphs where each head functions as a feedforward DAG with common sink state, analyze mixing time and minimax fidelity theoretically, and empirically train single-head vs multi-head Transformers with equal parameters on sequence manipulation tasks.

Result: Multi-head attention synergistically enhances information propagation with faster mixing times and minimax fidelity amplification under specific head-diversity conditions, as verified through empirical training experiments.

Conclusion: Multi-head attention provides computational synergy beyond mere parallel processing, offering theoretical and empirical advantages in information propagation for Transformer architectures.

Abstract: Multi-head attention powers Transformer networks, the primary deep learning architecture behind the success of large language models (LLMs). Yet, the theoretical advantages of multi-head versus single-head attention, beyond mere parallel processing, remain underexplored. In this paper, we reframe multi-head attention as a system of potentially synergistic computational graphs, where each head functions as a feedforward directed acyclic graph (DAG) with a common sink state. We provide intuition and preliminary theoretical analysis of mixing time and minimax fidelity in this framework. Our results show that multi-head attention can synergistically enhance information propagation, yielding faster mixing times and minimax fidelity amplification under specific head-diversity conditions. Finally, we train single-head and multi-head Transformers, each with the same total number of parameters, on sequence manipulation tasks and empirically verify the predicted effects. The code is available at https://github.com/haitzsaezdeocariz/beyondparallelism.

[929] MoTM: Towards a Foundation Model for Time Series Imputation based on Continuous Modeling

Etienne Le Naour, Tahar Nabil, Ghislain Agoua

Main category: cs.LG

TL;DR: MoTM is a foundation model for time series imputation that uses a mixture of implicit neural representations to handle out-of-domain generalization across various missing data scenarios.

Details

Motivation: Time series foundation models focus mainly on forecasting, leaving out-of-domain imputation of missing values largely unexplored. Current implicit neural representations struggle with distribution shifts.

Method: MoTM combines a basis of independently trained INRs (each on distinct time series families) with a ridge regressor that adapts to observed context at inference, treating new time series as mixtures of previously seen patterns.

Result: The model demonstrates robust in-domain and out-of-domain generalization across diverse imputation scenarios including block and pointwise missingness with variable sampling rates.

Conclusion: MoTM paves the way for adaptable foundation imputation models by effectively handling distribution shifts in time series imputation tasks.

Abstract: Recent years have witnessed a growing interest for time series foundation models, with a strong emphasis on the forecasting task. Yet, the crucial task of out-of-domain imputation of missing values remains largely underexplored. We propose a first step to fill this gap by leveraging implicit neural representations (INRs). INRs model time series as continuous functions and naturally handle various missing data scenarios and sampling rates. While they have shown strong performance within specific distributions, they struggle under distribution shifts. To address this, we introduce MoTM (Mixture of Timeflow Models), a step toward a foundation model for time series imputation. Building on the idea that a new time series is a mixture of previously seen patterns, MoTM combines a basis of INRs, each trained independently on a distinct family of time series, with a ridge regressor that adapts to the observed context at inference. We demonstrate robust in-domain and out-of-domain generalization across diverse imputation scenarios (e.g., block and pointwise missingness, variable sampling rates), paving the way for adaptable foundation imputation models.

[930] Continual Learning with Synthetic Boundary Experience Blending

Chih-Fan Hsu, Ming-Ching Chang, Wei-Chao Chen

Main category: cs.LG

TL;DR: Experience Blending (EB) improves continual learning by generating synthetic boundary data through noise injection in latent space, enhancing decision boundary robustness beyond standard experience replay.

Details

Motivation: Standard experience replay in continual learning only sparsely approximates data distributions, leading to fragile decision boundaries and catastrophic forgetting.

Method: Proposes Experience Blending framework with: (1) latent-space noise injection to create synthetic boundary data (SBD), and (2) dual-model aggregation strategy for joint training on exemplars and SBD.

Result: Achieves consistent accuracy improvements: 10% on CIFAR-10, 6% on CIFAR-100, and 13% on Tiny ImageNet over strong baselines.

Conclusion: Synthetic boundary data enriches feature space near decision boundaries, enabling more stable and robust continual learning performance.

Abstract: Continual learning (CL) seeks to mitigate catastrophic forgetting when models are trained with sequential tasks. A common approach, experience replay (ER), stores past exemplars but only sparsely approximates the data distribution, yielding fragile and oversimplified decision boundaries. We address this limitation by introducing synthetic boundary data (SBD), generated via differential privacy: inspired noise into latent features to create boundary-adjacent representations that implicitly regularize decision boundaries. Building on this idea, we propose Experience Blending (EB), a framework that jointly trains on exemplars and SBD through a dual-model aggregation strategy. EB has two components: (1) latent-space noise injection to synthesize boundary data, and (2) end-to-end training that jointly leverages exemplars and SBD. Unlike standard experience replay, SBD enriches the feature space near decision boundaries, leading to more stable and robust continual learning. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet demonstrate consistent accuracy improvements of 10%, 6%, and 13%, respectively, over strong baselines.

[931] Sheaf Graph Neural Networks via PAC-Bayes Spectral Optimization

Yoonhyuk Choi, Jiho Choi, Chong-Kwon Kim

Main category: cs.LG

TL;DR: SGPC is a sheaf-based GNN framework that addresses over-smoothing on heterophilic graphs through optimal transport lifting, variance-reduced diffusion, and PAC-Bayes regularization, achieving state-of-the-art performance with certified confidence intervals.

Details

Motivation: Over-smoothing in GNNs causes node feature collapse, especially on heterophilic graphs where adjacent nodes have dissimilar labels. Existing sheaf neural networks use static or over-parameterized sheaf structures that lack generalization and scalability, and fail to provide rigorous stability guarantees.

Method: SGPC combines cellular-sheaf message passing with optimal transport-based lifting, variance-reduced diffusion, and PAC-Bayes spectral regularization for robust semi-supervised node classification. It enables end-to-end training with linear computational complexity.

Result: Experiments on nine homophilic and heterophilic benchmarks show SGPC outperforms state-of-the-art spectral and sheaf-based GNNs while providing certified confidence intervals on unseen nodes.

Conclusion: SGPC provides a unified architecture that effectively mitigates over-smoothing in GNNs through theoretically-grounded mechanisms, achieving superior performance with guaranteed stability and confidence bounds.

Abstract: Over-smoothing in Graph Neural Networks (GNNs) causes collapse in distinct node features, particularly on heterophilic graphs where adjacent nodes often have dissimilar labels. Although sheaf neural networks partially mitigate this problem, they typically rely on static or heavily parameterized sheaf structures that hinder generalization and scalability. Existing sheaf-based models either predefine restriction maps or introduce excessive complexity, yet fail to provide rigorous stability guarantees. In this paper, we introduce a novel scheme called SGPC (Sheaf GNNs with PAC-Bayes Calibration), a unified architecture that combines cellular-sheaf message passing with several mechanisms, including optimal transport-based lifting, variance-reduced diffusion, and PAC-Bayes spectral regularization for robust semi-supervised node classification. We establish performance bounds theoretically and demonstrate that end-to-end training in linear computational complexity can achieve the resulting bound-aware objective. Experiments on nine homophilic and heterophilic benchmarks show that SGPC outperforms state-of-the-art spectral and sheaf-based GNNs while providing certified confidence intervals on unseen nodes. The code and proofs are in https://github.com/ChoiYoonHyuk/SGPC.

[932] TimeMosaic: Temporal Heterogeneity Guided Time Series Forecasting via Adaptive Granularity Patch and Segment-wise Decoding

Kuiye Ding, Fanda Fan, Chunyi Hou, Zheya Wang, Lei Wang, Zhengxin Yang, Jianfeng Zhan

Main category: cs.LG

TL;DR: TimeMosaic is a multivariate time series forecasting framework that addresses temporal heterogeneity through adaptive patch embedding and segment-wise decoding, achieving competitive performance with state-of-the-art methods.

Details

Motivation: Existing patch-based methods use fixed-length segmentation, which overlooks local temporal dynamics heterogeneity and decoding heterogeneity, leading to lost details in information-dense regions and redundancy in stable segments.

Method: Employs adaptive patch embedding to dynamically adjust granularity based on local information density, and introduces segment-wise decoding that treats each prediction horizon as a related subtask with horizon-specific adaptation.

Result: Extensive evaluations show consistent improvements over existing methods, with performance competitive with state-of-the-art TSFMs when trained on large-scale corpus with 321 billion observations.

Conclusion: TimeMosaic effectively addresses temporal heterogeneity in multivariate time series forecasting through its adaptive patch embedding and segment-wise decoding approach.

Abstract: Multivariate time series forecasting is essential in domains such as finance, transportation, climate, and energy. However, existing patch-based methods typically adopt fixed-length segmentation, overlooking the heterogeneity of local temporal dynamics and the decoding heterogeneity of forecasting. Such designs lose details in information-dense regions, introduce redundancy in stable segments, and fail to capture the distinct complexities of short-term and long-term horizons. We propose TimeMosaic, a forecasting framework that aims to address temporal heterogeneity. TimeMosaic employs adaptive patch embedding to dynamically adjust granularity according to local information density, balancing motif reuse with structural clarity while preserving temporal continuity. In addition, it introduces segment-wise decoding that treats each prediction horizon as a related subtask and adapts to horizon-specific difficulty and information requirements, rather than applying a single uniform decoder. Extensive evaluations on benchmark datasets demonstrate that TimeMosaic delivers consistent improvements over existing methods, and our model trained on the large-scale corpus with 321 billion observations achieves performance competitive with state-of-the-art TSFMs.

[933] PSEO: Optimizing Post-hoc Stacking Ensemble Through Hyperparameter Tuning

Beicheng Xu, Wei Liu, Keyao Ding, Yupeng Lu, Bin Cui

Main category: cs.LG

TL;DR: PSEO is a framework for post-hoc stacking ensemble optimization that improves AutoML by optimizing ensemble strategies rather than using fixed approaches, achieving top performance on 80 datasets.

Details

Motivation: Current AutoML systems use fixed ensemble strategies that don't adapt to specific task characteristics, limiting the potential of post-hoc ensembles despite their proven effectiveness in ensemble learning.

Method: Uses binary quadratic programming for base model selection balancing diversity and performance, introduces mechanisms for multi-layer stacking optimization, and searches optimal ensemble strategies within a hyperparameter space.

Result: Achieved best average test rank (2.96) among 16 methods on 80 public datasets, outperforming both AutoML systems’ post-hoc designs and state-of-the-art ensemble methods.

Conclusion: PSEO demonstrates that optimizing ensemble strategies significantly improves AutoML performance over fixed approaches, with adaptive ensemble design being crucial for achieving state-of-the-art results.

Abstract: The Combined Algorithm Selection and Hyperparameter Optimization (CASH) problem is fundamental in Automated Machine Learning (AutoML). Inspired by the success of ensemble learning, recent AutoML systems construct post-hoc ensembles for final predictions rather than relying on the best single model. However, while most CASH methods conduct extensive searches for the optimal single model, they typically employ fixed strategies during the ensemble phase that fail to adapt to specific task characteristics. To tackle this issue, we propose PSEO, a framework for post-hoc stacking ensemble optimization. First, we conduct base model selection through binary quadratic programming, with a trade-off between diversity and performance. Furthermore, we introduce two mechanisms to fully realize the potential of multi-layer stacking. Finally, PSEO builds a hyperparameter space and searches for the optimal post-hoc ensemble strategy within it. Empirical results on 80 public datasets show that \sys achieves the best average test rank (2.96) among 16 methods, including post-hoc designs in recent AutoML systems and state-of-the-art ensemble learning methods.

[934] UniMove: A Unified Model for Multi-city Human Mobility Prediction

Chonghua Han, Yuan Yuan, Yukun Liu, Jingtao Ding, Jie Feng, Yong Li

Main category: cs.LG

TL;DR: UniMove is a unified multi-city human mobility prediction model that addresses spatial heterogeneity and diverse movement patterns through a dual-tower architecture and MoE Transformer blocks, achieving over 10.2% accuracy improvement.

Details

Motivation: Human mobility prediction faces challenges from inherent randomness, non-uniform time intervals, complex patterns, and city heterogeneity. Existing solutions require separate models for each city due to distinct spatial representations and geographic coverage.

Method: Proposes a trajectory-location dual-tower architecture with location tower for universal spatial encoding and trajectory tower for sequential mobility modeling. Uses MoE Transformer blocks to adaptively select experts for handling diverse movement patterns.

Result: Extensive experiments across multiple datasets from diverse cities show UniMove significantly improves mobility prediction accuracy by over 10.2% through joint training on multi-city data with mutual data enhancement.

Conclusion: UniMove represents a key advancement toward realizing a true foundational model with unified architecture for human mobility prediction, enabling effective cross-city modeling and performance improvements.

Abstract: Human mobility prediction is vital for urban planning, transportation optimization, and personalized services. However, the inherent randomness, non-uniform time intervals, and complex patterns of human mobility, compounded by the heterogeneity introduced by varying city structures, infrastructure, and population densities, present significant challenges in modeling. Existing solutions often require training separate models for each city due to distinct spatial representations and geographic coverage. In this paper, we propose UniMove, a unified model for multi-city human mobility prediction, addressing two challenges: (1) constructing universal spatial representations for effective token sharing across cities, and (2) modeling heterogeneous mobility patterns from varying city characteristics. We propose a trajectory-location dual-tower architecture, with a location tower for universal spatial encoding and a trajectory tower for sequential mobility modeling. We also design MoE Transformer blocks to adaptively select experts to handle diverse movement patterns. Extensive experiments across multiple datasets from diverse cities demonstrate that UniMove truly embodies the essence of a unified model. By enabling joint training on multi-city data with mutual data enhancement, it significantly improves mobility prediction accuracy by over 10.2%. UniMove represents a key advancement toward realizing a true foundational model with a unified architecture for human mobility. We release the implementation at https://github.com/tsinghua-fib-lab/UniMove/.

[935] Tight Bounds for Schrödinger Potential Estimation in Unpaired Data Translation

Nikita Puchkin, Denis Suchkov, Alexey Naumov, Denis Belomestny

Main category: cs.LG

TL;DR: The paper develops a method for generative modeling and unpaired data translation using Schrödinger bridges and stochastic optimal control, with theoretical guarantees on generalization bounds for empirical risk minimization.

Details

Motivation: To address generative modeling and unpaired data translation problems where only i.i.d. samples from initial and final distributions are available, using optimal transport theory.

Method: Uses stochastic optimal control with Ornstein-Uhlenbeck process as reference, estimates Schrödinger potential, and employs empirical risk minimization with Kullback-Leibler divergence as risk function.

Result: Derives tight generalization bounds for empirical risk minimizer in Schrödinger potential classes including Gaussian mixtures, achieving fast convergence rates up to logarithmic factors due to Ornstein-Uhlenbeck mixing properties.

Conclusion: The proposed approach provides theoretically grounded framework for generative modeling and data translation with strong generalization guarantees, supported by numerical experiments.

Abstract: Modern methods of generative modelling and unpaired data translation based on Schrödinger bridges and stochastic optimal control theory aim to transform an initial density to a target one in an optimal way. In the present paper, we assume that we only have access to i.i.d. samples from initial and final distributions. This makes our setup suitable for both generative modelling and unpaired data translation. Relying on the stochastic optimal control approach, we choose an Ornstein-Uhlenbeck process as the reference one and estimate the corresponding Schrödinger potential. Introducing a risk function as the Kullback-Leibler divergence between couplings, we derive tight bounds on generalization ability of an empirical risk minimizer in a class of Schrödinger potentials including Gaussian mixtures. Thanks to the mixing properties of the Ornstein-Uhlenbeck process, we almost achieve fast rates of convergence up to some logarithmic factors in favourable scenarios. We also illustrate performance of the suggested approach with numerical experiments.

[936] MDNS: Masked Diffusion Neural Sampler via Stochastic Optimal Control

Yuchen Zhu, Wei Guo, Jaemoo Choi, Guan-Horng Liu, Yongxin Chen, Molei Tao

Main category: cs.LG

TL;DR: MDNS is a novel framework for training discrete neural samplers using masked diffusion and path measure alignment to generate samples from complex discrete distributions where the target probability is known up to a normalizing constant.

Details

Motivation: To address the challenging task of sampling from discrete state spaces with large cardinality and multi-modal distributions, which is important in statistical physics, machine learning, and combinatorial optimization.

Method: Proposes Masked Diffusion Neural Sampler (MDNS) that aligns two path measures through learning objectives grounded in stochastic optimal control of continuous-time Markov chains.

Result: MDNS learns to accurately sample from target distributions despite extremely high problem dimensions and outperforms other learning-based baselines by a large margin across various distributions with distinct statistical properties.

Conclusion: The framework demonstrates efficiency, scalability, and potential through comprehensive ablations and extensions, providing an effective solution for discrete neural sampling in high-dimensional multi-modal settings.

Abstract: We study the problem of learning a neural sampler to generate samples from discrete state spaces where the target probability mass function $π\propto\mathrm{e}^{-U}$ is known up to a normalizing constant, which is an important task in fields such as statistical physics, machine learning, combinatorial optimization, etc. To better address this challenging task when the state space has a large cardinality and the distribution is multi-modal, we propose $\textbf{M}$asked $\textbf{D}$iffusion $\textbf{N}$eural $\textbf{S}$ampler ($\textbf{MDNS}$), a novel framework for training discrete neural samplers by aligning two path measures through a family of learning objectives, theoretically grounded in the stochastic optimal control of the continuous-time Markov chains. We validate the efficiency and scalability of MDNS through extensive experiments on various distributions with distinct statistical properties, where MDNS learns to accurately sample from the target distributions despite the extremely high problem dimensions and outperforms other learning-based baselines by a large margin. A comprehensive study of ablations and extensions is also provided to demonstrate the efficacy and potential of the proposed framework. Our code is available at https://github.com/yuchen-zhu-zyc/MDNS.

[937] DPCformer: An Interpretable Deep Learning Model for Genomic Prediction in Crops

Pengcheng Deng, Kening Liu, Mengxi Zhou, Mingxi Li, Rui Yang, Chuzhe Cao, Maojun Wang, Zeyu Zhang

Main category: cs.LG

TL;DR: DPCformer is a deep learning model combining CNNs and self-attention that significantly improves genomic selection accuracy across multiple crops, especially in small-sample scenarios.

Details

Motivation: Traditional genomic selection methods struggle with prediction accuracy for complex traits and large datasets, limiting breeding efficiency.

Method: DPCformer integrates convolutional neural networks with self-attention mechanism, uses 8-dimensional one-hot encoding for SNP data ordered by chromosome, and employs PMF algorithm for feature selection.

Result: DPCformer outperformed existing methods across 13 traits in 5 crops: maize (up to 2.92% accuracy improvement), cotton (up to 8.37% gains), tomato (up to 57.35% PCC increase), and chickpea (16.62% yield correlation boost).

Conclusion: DPCformer demonstrates superior accuracy, robustness in small-sample scenarios, and enhanced interpretability, providing a powerful tool for precision breeding and addressing global food security challenges.

Abstract: Genomic Selection (GS) uses whole-genome information to predict crop phenotypes and accelerate breeding. Traditional GS methods, however, struggle with prediction accuracy for complex traits and large datasets. We propose DPCformer, a deep learning model integrating convolutional neural networks with a self-attention mechanism to model complex genotype-phenotype relationships. We applied DPCformer to 13 traits across five crops (maize, cotton, tomato, rice, chickpea). Our approach uses an 8-dimensional one-hot encoding for SNP data, ordered by chromosome, and employs the PMF algorithm for feature selection. Evaluations show DPCformer outperforms existing methods. In maize datasets, accuracy for traits like days to tasseling and plant height improved by up to 2.92%. For cotton, accuracy gains for fiber traits reached 8.37%. On small-sample tomato data, the Pearson Correlation Coefficient for a key trait increased by up to 57.35%. In chickpea, the yield correlation was boosted by 16.62%. DPCformer demonstrates superior accuracy, robustness in small-sample scenarios, and enhanced interpretability, providing a powerful tool for precision breeding and addressing global food security challenges.

[938] Fairness-Aware Multi-view Evidential Learning with Adaptive Prior

Haishun Chen, Cai Xu, Jinlong Yu, Yilin Zhang, Ziyu Guan, Wei Zhao, Fangyuan Zhao, Xin Yang

Main category: cs.LG

TL;DR: FAML addresses biased evidence learning in multi-view evidential learning by introducing adaptive priors, fairness constraints, and opinion alignment to achieve balanced evidence allocation and reliable uncertainty estimation.

Details

Motivation: Traditional multi-view evidential learning assumes reliable view-specific evidence learning, but empirical analysis reveals bias where samples receive more evidence for data-rich classes, leading to unreliable uncertainty estimation.

Method: FAML uses adaptive priors based on training trajectory for regularization, fairness constraints on class-wise evidence variance, and opinion alignment mechanism to mitigate view-specific bias during multi-view fusion.

Result: Extensive experiments on five real-world datasets show FAML achieves more balanced evidence allocation and improves both prediction performance and uncertainty estimation reliability compared to state-of-the-art methods.

Conclusion: FAML effectively addresses biased evidence learning in multi-view settings through fairness-aware mechanisms, enhancing both prediction accuracy and uncertainty estimation trustworthiness.

Abstract: Multi-view evidential learning aims to integrate information from multiple views to improve prediction performance and provide trustworthy uncertainty esitimation. Most previous methods assume that view-specific evidence learning is naturally reliable. However, in practice, the evidence learning process tends to be biased. Through empirical analysis on real-world data, we reveal that samples tend to be assigned more evidence to support data-rich classes, thereby leading to unreliable uncertainty estimation in predictions. This motivates us to delve into a new Biased Evidential Multi-view Learning (BEML) problem. To this end, we propose Fairness-Aware Multi-view Evidential Learning (FAML). FAML first introduces an adaptive prior based on training trajectory, which acts as a regularization strategy to flexibly calibrate the biased evidence learning process. Furthermore, we explicitly incorporate a fairness constraint based on class-wise evidence variance to promote balanced evidence allocation. In the multi-view fusion stage, we propose an opinion alignment mechanism to mitigate view-specific bias across views, thereby encouraging the integration of consistent and mutually supportive evidence.Theoretical analysis shows that FAML enhances fairness in the evidence learning process. Extensive experiments on five real-world multi-view datasets demonstrate that FAML achieves more balanced evidence allocation and improves both prediction performance and the reliability of uncertainty estimation compared to state-of-the-art methods.

[939] Robust Driving Control for Autonomous Vehicles: An Intelligent General-sum Constrained Adversarial Reinforcement Learning Approach

Junchao Fan, Qi Wei, Ruichen Zhang, Dusit Niyato, Yang Lu, Jianhua Wang, Xiaolin Chang, Bo Ai

Main category: cs.LG

TL;DR: IGCARL is a robust autonomous driving approach that uses strategic adversarial training with constrained optimization to defend against multi-step attacks and prevent safety-critical events.

Details

Motivation: Existing robust DRL methods for autonomous driving are vulnerable to strategic adversarial attacks, struggle to cause safety-critical events, and suffer from training instability and policy drift.

Method: Proposes IGCARL with two components: (1) strategic targeted adversary using temporal decision-making for multi-step attacks with general-sum objectives, and (2) robust driving agent trained with constrained optimization to ensure stable learning.

Result: IGCARL improves success rate by at least 27.9% over state-of-the-art methods, demonstrating superior robustness to adversarial attacks and enhanced safety.

Conclusion: IGCARL effectively addresses key limitations in robust autonomous driving by combining strategic adversarial training with constrained optimization, significantly improving robustness and safety against adversarial threats.

Abstract: Deep reinforcement learning (DRL) has demonstrated remarkable success in developing autonomous driving policies. However, its vulnerability to adversarial attacks remains a critical barrier to real-world deployment. Although existing robust methods have achieved success, they still suffer from three key issues: (i) these methods are trained against myopic adversarial attacks, limiting their abilities to respond to more strategic threats, (ii) they have trouble causing truly safety-critical events (e.g., collisions), but instead often result in minor consequences, and (iii) these methods can introduce learning instability and policy drift during training due to the lack of robust constraints. To address these issues, we propose Intelligent General-sum Constrained Adversarial Reinforcement Learning (IGCARL), a novel robust autonomous driving approach that consists of a strategic targeted adversary and a robust driving agent. The strategic targeted adversary is designed to leverage the temporal decision-making capabilities of DRL to execute strategically coordinated multi-step attacks. In addition, it explicitly focuses on inducing safety-critical events by adopting a general-sum objective. The robust driving agent learns by interacting with the adversary to develop a robust autonomous driving policy against adversarial attacks. To ensure stable learning in adversarial environments and to mitigate policy drift caused by attacks, the agent is optimized under a constrained formulation. Extensive experiments show that IGCARL improves the success rate by at least 27.9% over state-of-the-art methods, demonstrating superior robustness to adversarial attacks and enhancing the safety and reliability of DRL-based autonomous driving.

[940] Learned Structure in Cartridges: Keys as Shareable Routers in Self-Studied Representations

Maurizio Diaz

Main category: cs.LG

TL;DR: Cartridges enable 40x smaller KV cache for long-context LLMs by learning compressed representations, with keys acting as stable retrieval routers and values handling most compression.

Details

Motivation: Address the bottleneck of linearly growing KV cache in long-context LLM inference by developing more efficient compression methods.

Method: Mechanistic analysis of Cartridge KV cache structure, empirical validation across tasks and models, and proposed Sampled Chunk Initialization (SCI) for faster convergence.

Result: Cartridge keys function as stable, shareable retrieval routers while values handle most compression; SCI enables faster convergence than previous methods.

Conclusion: Cartridges provide efficient KV cache compression with mechanistic insights, laying groundwork for further optimization and scaling of long-context LLM inference.

Abstract: A bottleneck for long-context LLM inference is the linearly growing KV cache. Recent work has proposed Cartridges, an approach which leverages offline compute to train a much smaller KV cache than is typically required for a full document (up to 40x less memory usage at inference time). In this paper, we present the first mechanistic exploration of the learned Cartridge key-value cache structure. In particular, we propose that (1) Cartridge keys act as stable, shareable retrieval routers for the compressed corpora and (2) most of the learned compression occurs within the Cartridge value vectors. We present empirical evidence of our routing theory across tasks, model families, and model sizes; for example, we can ablate the learned Cartridge key vectors between tasks with little performance loss. Finally, we propose a slight improvement in initialization called Sampled Chunk Initialization (SCI). We suggest that SCI can lead to faster Cartridge convergence than previously demonstrated in the literature. Our findings lay the groundwork for broader empirical study of Cartridge training optimization which may be crucial for further scaling.

[941] A Graph Laplacian Eigenvector-based Pre-training Method for Graph Neural Networks

Howard Dai, Nyambura Njenga, Hiren Madhu, Siddharth Viswanath, Ryan Pellico, Ian Adelstein, Smita Krishnaswamy

Main category: cs.LG

TL;DR: Proposes LELM, a self-supervised graph pre-training module that predicts low-frequency Laplacian eigenvectors to capture global graph structure while preventing oversmoothing in deep GNNs.

Details

Motivation: Structure-based pre-training is under-explored but crucial for graph foundation models, and traditional GNNs struggle with capturing long-range dependencies due to oversmoothing in deep networks.

Method: Developed Laplacian Eigenvector Learning Module (LELM) that predicts low-frequency eigenvectors of graph Laplacian using a novel architecture designed to overcome oversmoothing.

Result: Models pre-trained with LELM outperform baseline models on downstream molecular property prediction tasks.

Conclusion: LELM provides an effective structure-based pre-training approach that enables GNNs to learn long-range interdependencies while avoiding oversmoothing issues.

Abstract: The development of self-supervised graph pre-training methods is a crucial ingredient in recent efforts to design robust graph foundation models (GFMs). Structure-based pre-training methods are under-explored yet crucial for downstream applications which rely on underlying graph structure. In addition, pre-training traditional message passing GNNs to capture global and regional structure is often challenging due to the risk of oversmoothing as network depth increases. We address these gaps by proposing the Laplacian Eigenvector Learning Module (LELM), a novel pre-training module for graph neural networks (GNNs) based on predicting the low-frequency eigenvectors of the graph Laplacian. Moreover, LELM introduces a novel architecture that overcomes oversmoothing, allowing the GNN model to learn long-range interdependencies. Empirically, we show that models pre-trained via our framework outperform baseline models on downstream molecular property prediction tasks.

[942] Learning from N-Tuple Data with M Positive Instances: Unbiased Risk Estimation and Theoretical Guarantees

Miao Zhang, Junpeng Li, ChangChun HUa, Yana Yang

Main category: cs.LG

TL;DR: The paper presents a method for weakly supervised learning where training examples are n-tuples containing exactly m positives, but only the count m is observed. The approach uses unbiased risk estimation to effectively learn from count-only supervision.

Details

Motivation: Weakly supervised learning often deals with coarse aggregate signals rather than instance labels. The NTMP (N-tuple with M positives) setting arises in practical scenarios like image classification with region proposals and multi-instance measurements where only tuple counts are available.

Method: The method derives a trainable unbiased risk estimator (URE) by linking tuple-generation to latent instance marginals. It handles fixed and variable tuple sizes/counts, uses ReLU corrections for finite-sample stability, and maintains asymptotic correctness.

Result: The approach consistently outperforms weak-supervision baselines across benchmarks converted to NTMP tasks, achieving favorable precision-recall and F1 trade-offs. It remains robust under class-prior imbalance and diverse tuple configurations.

Conclusion: Count-only supervision can be effectively exploited through a theoretically grounded and practically stable objective, demonstrating that tuple counts provide sufficient information for successful learning in weakly supervised settings.

Abstract: Weakly supervised learning often operates with coarse aggregate signals rather than instance labels. We study a setting where each training example is an $n$-tuple containing exactly m positives, while only the count m per tuple is observed. This NTMP (N-tuple with M positives) supervision arises in, e.g., image classification with region proposals and multi-instance measurements. We show that tuple counts admit a trainable unbiased risk estimator (URE) by linking the tuple-generation process to latent instance marginals. Starting from fixed (n,m), we derive a closed-form URE and extend it to variable tuple sizes, variable counts, and their combination. Identification holds whenever the effective mixing rate is separated from the class prior. We establish generalization bounds via Rademacher complexity and prove statistical consistency with standard rates under mild regularity assumptions. To improve finite-sample stability, we introduce simple ReLU corrections to the URE that preserve asymptotic correctness. Across benchmarks converted to NTMP tasks, the approach consistently outperforms representative weak-supervision baselines and yields favorable precision-recall and F1 trade-offs. It remains robust under class-prior imbalance and across diverse tuple configurations, demonstrating that count-only supervision can be exploited effectively through a theoretically grounded and practically stable objective.

[943] Dual-Branch Convolutional Framework for Spatial and Frequency-Based Image Forgery Detection

Naman Tyagi, Riya Jain

Main category: cs.LG

TL;DR: A dual-branch CNN combining spatial and frequency features for image forgery detection, achieving 77.9% accuracy on CASIA 2.0 dataset with balanced computational complexity.

Details

Motivation: Address the challenge of ensuring image authenticity amid rapid increase in deepfakes and digital image forgeries.

Method: Dual branch convolution neural network operating on spatial and frequency domain features, fused and compared within a Siamese network to produce 64-dimensional embeddings for classification.

Result: Achieves 77.9% accuracy on CASIA 2.0 dataset, outperforming traditional statistical methods while balancing computational complexity and detection reliability.

Conclusion: Provides a practical forgery detection framework ready for deployment, advancing visual forensics for media verification, law enforcement and digital content reliability.

Abstract: With a very rapid increase in deepfakes and digital image forgeries, ensuring the authenticity of images is becoming increasingly challenging. This report introduces a forgery detection framework that combines spatial and frequency-based features for detecting forgeries. We propose a dual branch convolution neural network that operates on features extracted from spatial and frequency domains. Features from both branches are fused and compared within a Siamese network, yielding 64 dimensional embeddings for classification. When benchmarked on CASIA 2.0 dataset, our method achieves an accuracy of 77.9%, outperforming traditional statistical methods. Despite its relatively weaker performance compared to larger, more complex forgery detection pipelines, our approach balances computational complexity and detection reliability, making it ready for practical deployment. It provides a strong methodology for forensic scrutiny of digital images. In a broader sense, it advances the state of the art in visual forensics, addressing an urgent requirement in media verification, law enforcement and digital content reliability.

[944] FedAdamW: A Communication-Efficient Optimizer with Convergence and Generalization Guarantees for Federated Large Models

Junkang Liu, Fanhua Shang, Kewen Zhu, Hongying Liu, Yuanyuan Liu, Jin Liu

Main category: cs.LG

TL;DR: FedAdamW is a federated AdamW optimizer that addresses challenges of high variance in second-moment estimates, local overfitting, and slow convergence in federated learning by using local correction mechanisms and efficient aggregation of moment estimates.

Details

Motivation: Direct application of AdamW in federated learning faces challenges including high variance in second-moment estimates due to data heterogeneity, local overfitting causing client drift, and slow convergence from reinitializing moment estimates each round.

Method: FedAdamW uses local correction mechanisms and decoupled weight decay to align local updates with global updates, and efficiently aggregates the mean of second-moment estimates to reduce variance and reinitialize them properly.

Result: Theoretically achieves linear speedup convergence rate without heterogeneity assumption. Empirically validated on language and vision Transformer models, significantly reducing communication rounds and improving test accuracy compared to baselines.

Conclusion: FedAdamW effectively adapts AdamW for federated learning settings, addressing key challenges and demonstrating superior performance in both theoretical analysis and empirical validation.

Abstract: AdamW has become one of the most effective optimizers for training large-scale models. We have also observed its effectiveness in the context of federated learning (FL). However, directly applying AdamW in federated learning settings poses significant challenges: (1) due to data heterogeneity, AdamW often yields high variance in the second-moment estimate $\boldsymbol{v}$; (2) the local overfitting of AdamW may cause client drift; and (3) Reinitializing moment estimates ($\boldsymbol{v}$, $\boldsymbol{m}$) at each round slows down convergence. To address these challenges, we propose the first \underline{Fed}erated \underline{AdamW} algorithm, called \texttt{FedAdamW}, for training and fine-tuning various large models. \texttt{FedAdamW} aligns local updates with the global update using both a \textbf{local correction mechanism} and decoupled weight decay to mitigate local overfitting. \texttt{FedAdamW} efficiently aggregates the \texttt{mean} of the second-moment estimates to reduce their variance and reinitialize them. Theoretically, we prove that \texttt{FedAdamW} achieves a linear speedup convergence rate of $\mathcal{O}(\sqrt{(L Δσ_l^2)/(S K R ε^2)}+(L Δ)/R)$ without \textbf{heterogeneity assumption}, where $S$ is the number of participating clients per round, $K$ is the number of local iterations, and $R$ is the total number of communication rounds. We also employ PAC-Bayesian generalization analysis to explain the effectiveness of decoupled weight decay in local training. Empirically, we validate the effectiveness of \texttt{FedAdamW} on language and vision Transformer models. Compared to several baselines, \texttt{FedAdamW} significantly reduces communication rounds and improves test accuracy. The code is available in https://github.com/junkangLiu0/FedAdamW.

[945] An upper bound of the silhouette validation metric for clustering

Hugo Sträng, Tai Dinh

Main category: cs.LG

TL;DR: This paper derives sharp upper bounds for silhouette width and average silhouette width (ASW) to improve interpretability of clustering quality metrics by showing how close empirical results are to dataset-specific maximums.

Details

Motivation: The standard upper limit of 1 for ASW is rarely attainable, making it difficult to interpret empirical ASW values and determine how close clustering solutions are to optimal for a given dataset.

Method: The authors derive sharp upper bounds for individual silhouette widths and aggregate these to obtain canonical upper bounds on ASW, with extensions for minimum cluster-size constraints and macro-averaged silhouette.

Result: The method provides dataset-specific upper bounds that are often substantially below 1, enabling better interpretation of clustering quality by showing relative performance against the best possible outcome for that dataset.

Conclusion: These bounds enhance ASW interpretability, help confirm global optimality, guide clustering evaluation, and can be refined with practical constraints like minimum cluster sizes.

Abstract: The silhouette coefficient quantifies, for each observation, the balance between within-cluster cohesion and between-cluster separation, taking values in [-1, 1]. The average silhouette width (ASW) is a widely used internal measure of clustering quality, with higher values indicating more cohesive and well-separated clusters. However, the dataset-specific maximum of ASW is typically unknown, and the standard upper limit of 1 is rarely attainable. In this work, we derive for each data point a sharp upper bound on its silhouette width and aggregate these to obtain a canonical upper bound on the ASW. This bound-often substantially below 1-enhances the interpretability of empirical ASW values by indicating how close a given clustering result is to the best possible outcome on that dataset. It can be used to confirm global optimality, guide the evaluation of clustering solutions, and be refined to incorporate minimum cluster-size constraints for greater practical relevance. Finally, we extend the framework to establish a corresponding bound for the macro-averaged silhouette.

[946] Calibrating and Rotating: A Unified Framework for Weight Conditioning in PEFT

Da Chang, Peng Xue, Yu Li, Yongxiang Liu, Pengxiang Xu, Shixun Zhang

Main category: cs.LG

TL;DR: This paper analyzes DoRA’s mechanism, reformulates it into an efficient matrix form, and proposes two new PEFT methods (Pre-Diag and SORA) that outperform LoRA and DoRA in performance and efficiency.

Details

Motivation: To understand DoRA's underlying mechanism and address its computational overhead while developing more efficient and effective Parameter-Efficient Fine-Tuning methods.

Method: Identified DoRA’s mechanism (increased singular value entropy), reformulated it into an efficient matrix form, and proposed a unified framework with two novel methods: Pre-Diag (diagonal conditioning before LoRA) and SORA (parameter-efficient orthogonal rotation).

Result: Extensive experiments on natural language tasks show superior performance and efficiency compared to LoRA and DoRA.

Conclusion: The proposed Pre-Diag and SORA methods provide more effective and efficient alternatives to existing PEFT approaches, with SORA enabling powerful norm-preserving transformations.

Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods are crucial for adapting large pre-trained models. Among these, LoRA is considered a foundational approach. Building on this, the influential DoRA method enhances performance by decomposing weight updates into magnitude and direction. However, its underlying mechanism remains unclear, and it introduces significant computational overhead. In this work, we first identify that DoRA’s success stems from its capacity to increase the singular value entropy of the weight update matrix, which promotes a more uniform update distribution akin to full fine-tuning. We then reformulate DoRA into a mathematically equivalent and more efficient matrix form, revealing it as a learnable weight conditioning method. Based on this insight, we propose a unified framework for designing advanced PEFT methods by exploring two orthogonal dimensions: the architectural placement and the transformation type of the conditioning matrix. Within this framework, we introduce two novel methods: (1) \textbf{Pre-Diag}, which applies a diagonal conditioning matrix before the LoRA update to efficiently calibrate the pre-trained weights, thereby enhancing performance while reducing training time; and (2) \textbf{S}kewed \textbf{O}rthogonal \textbf{R}otation \textbf{A}daptation (\textbf{SORA}), which employs a parameter-efficient orthogonal rotation to perform a more powerful, norm-preserving transformation of the feature space. Extensive experiments on natural language understanding and generation tasks demonstrate that our proposed methods achieve superior performance and efficiency compared to both LoRA and DoRA. The code is available at https://github.com/MaeChd/SORA.

[947] ChemBOMAS: Accelerated BO in Chemistry with LLM-Enhanced Multi-Agent System

Dong Han, Zhehong Ai, Pengxiang Cai, Shanya Lu, Jianpeng Chen, Zihao Ye, Shuzhou Sun, Ben Gao, Lingli Ge, Weida Wang, Xiangxin Zhou, Xihui Liu, Mao Su, Wanli Ouyang, Lei Bai, Dongzhan Zhou, Tao Xu, Yuqiang Li, Shufei Zhang

Main category: cs.LG

TL;DR: ChemBOMAS is an LLM-enhanced multi-agent system that accelerates Bayesian optimization in chemistry through data-driven pseudo-data generation and knowledge-driven search space partitioning, achieving 5x efficiency improvements.

Details

Motivation: Bayesian optimization in chemistry faces challenges from sparse experimental data and vast search spaces, limiting its efficiency for scientific discovery.

Method: Combines data-driven strategy (fine-tuning 8B LLM on 1% samples for pseudo-data generation) and knowledge-driven strategy (hybrid RAG for search space partitioning with UCB algorithm for subspace selection).

Result: Achieves up to 5-fold acceleration in optimization efficiency compared to baseline methods across multiple scientific benchmarks.

Conclusion: ChemBOMAS sets a new state-of-the-art for Bayesian optimization in chemistry by synergistically integrating LLM capabilities with optimization algorithms.

Abstract: Bayesian optimization (BO) is a powerful tool for scientific discovery in chemistry, yet its efficiency is often hampered by the sparse experimental data and vast search space. Here, we introduce ChemBOMAS: a large language model (LLM)-enhanced multi-agent system that accelerates BO through synergistic data- and knowledge-driven strategies. Firstly, the data-driven strategy involves an 8B-scale LLM regressor fine-tuned on a mere 1% labeled samples for pseudo-data generation, robustly initializing the optimization process. Secondly, the knowledge-driven strategy employs a hybrid Retrieval-Augmented Generation approach to guide LLM in dividing the search space while mitigating LLM hallucinations. An Upper Confidence Bound algorithm then identifies high-potential subspaces within this established partition. Across the LLM-refined subspaces and supported by LLM-generated data, BO achieves the improvement of effectiveness and efficiency. Comprehensive evaluations across multiple scientific benchmarks demonstrate that ChemBOMAS set a new state-of-the-art, accelerating optimization efficiency by up to 5-fold compared to baseline methods.

[948] Aligning Brain Signals with Multimodal Speech and Vision Embeddings

Kateryna Shapovalenko, Quentin Auster

Main category: cs.LG

TL;DR: The paper investigates which layers of pre-trained models (wav2vec2 and CLIP) best align with the brain’s layered processing of language, using EEG signals during speech perception.

Details

Motivation: To understand how the brain builds meaning through layers from raw acoustics to rich multimodal associations, inspired by the brain's processing of words like "house" which involves imagining walls, doors, and memories.

Method: Using EEG recorded during natural speech perception, the authors compare embeddings from wav2vec2 (sound to language) and CLIP (words to images) models. They evaluate alignment with brain activity using ridge regression and contrastive decoding, testing three strategies: individual layers, progressive concatenation, and progressive summation.

Result: The findings suggest that combining multimodal, layer-aware representations may improve alignment with brain activity during language understanding.

Conclusion: Combining multimodal, layer-aware representations brings us closer to decoding how the brain understands language as experience, not just as sound.

Abstract: When we hear the word “house”, we don’t just process sound, we imagine walls, doors, memories. The brain builds meaning through layers, moving from raw acoustics to rich, multimodal associations. Inspired by this, we build on recent work from Meta that aligned EEG signals with averaged wav2vec2 speech embeddings, and ask a deeper question: which layers of pre-trained models best reflect this layered processing in the brain? We compare embeddings from two models: wav2vec2, which encodes sound into language, and CLIP, which maps words to images. Using EEG recorded during natural speech perception, we evaluate how these embeddings align with brain activity using ridge regression and contrastive decoding. We test three strategies: individual layers, progressive concatenation, and progressive summation. The findings suggest that combining multimodal, layer-aware representations may bring us closer to decoding how the brain understands language, not just as sound, but as experience.

[949] FEDONet : Fourier-Embedded DeepONet for Spectrally Accurate Operator Learning

Arth Sojitra, Mrigank Dhingra, Omer San

Main category: cs.LG

TL;DR: Fourier-Embedded DeepONet (FEDONet) enhances neural operator learning by incorporating Fourier feature mappings in trunk networks, achieving superior accuracy across various PDE benchmarks.

Details

Motivation: Standard DeepONets with fully connected linear layers struggle to capture complex spatial structures in PDEs, limiting their effectiveness in operator learning.

Method: Introduces Fourier-Embedded trunk networks using random Fourier feature mappings to enrich spatial representation capabilities within the DeepONet architecture.

Result: FEDONet consistently outperforms traditional DeepONets across multiple PDE datasets (Poisson, Burgers’, Lorenz-63, Eikonal, Allen-Cahn, Kuramoto-Sivashinsky), with significant error reductions in chaotic and stiff systems.

Conclusion: Fourier embeddings effectively enhance neural operator learning, providing a robust methodology for PDE surrogate modeling with broad applicability.

Abstract: Deep Operator Networks (DeepONets) have recently emerged as powerful data-driven frameworks for learning nonlinear operators, particularly suited for approximating solutions to partial differential equations. Despite their promising capabilities, the standard implementation of DeepONets, which typically employs fully connected linear layers in the trunk network, can encounter limitations in capturing complex spatial structures inherent to various PDEs. To address this limitation, we introduce Fourier-Embedded trunk networks within the DeepONet architecture, leveraging random fourier feature mappings to enrich spatial representation capabilities. Our proposed Fourier-Embedded DeepONet, FEDONet demonstrates superior performance compared to the traditional DeepONet across a comprehensive suite of PDE-driven datasets, including the two-dimensional Poisson, Burgers’, Lorenz-63, Eikonal, Allen-Cahn, and the Kuramoto-Sivashinsky equation. FEDONet delivers consistently superior reconstruction accuracy across all benchmark PDEs, with particularly large relative $L^2$ error reductions observed in chaotic and stiff systems. This study highlights the effectiveness of Fourier embeddings in enhancing neural operator learning, offering a robust and broadly applicable methodology for PDE surrogate modeling.

[950] On the Convergence of Muon and Beyond

Da Chang, Yongxiang Liu, Ganzhao Yuan

Main category: cs.LG

TL;DR: This paper proves that Muon-MVR2, a momentum-based variance-reduced variant of the Muon optimizer, achieves optimal iteration complexity of Õ(T^{-1/3}) for stochastic non-convex optimization, matching the theoretical lower bound.

Details

Motivation: There is a significant gap between Muon optimizer's empirical success and theoretical understanding, with existing analyses showing only suboptimal iteration complexity of O(T^{-1/4}). The paper aims to explore the theoretical limits of the Muon framework.

Method: The authors analyze two momentum-based variance-reduced variants: Muon-MVR1 (one-batch) and Muon-MVR2 (two-batch), providing rigorous proofs of convergence properties and establishing last-iterate convergence guarantees under the Polyak-Łojasiewicz condition.

Result: Muon-MVR2 achieves the optimal iteration complexity of Õ(T^{-1/3}), matching the theoretical lower bound. Extensive experiments on CIFAR-10 and C4 benchmarks corroborate the theoretical findings on per-iteration convergence.

Conclusion: This work provides the first proof of optimality for a Muon-style optimizer and clarifies the path toward developing more practically efficient, accelerated variants.

Abstract: The Muon optimizer has demonstrated remarkable empirical success in handling matrix-structured parameters for training neural networks. However, a significant gap remains between its practical performance and theoretical understanding. Existing analyses show that the Muon variants achieve only a suboptimal iteration complexity of $\mathcal{O}(T^{-1/4})$ in stochastic non-convex settings, where $T$ denotes the number of iterations. To explore the theoretical limits of the Muon framework, we analyze two Momentum-based Variance-Reduced variants: a one-batch version (Muon-MVR1) and a two-batch version (Muon-MVR2). We provide the first rigorous proof that incorporating variance reduction enables Muon-MVR2 to attain the optimal iteration complexity of $\tilde{\mathcal{O}}(T^{-1/3})$, thereby matching the theoretical lower bound for this class of problems. Furthermore, our analysis establishes last-iterate convergence guarantees for Muon variants under the Polyak-Łojasiewicz (PŁ) condition. Extensive experiments on vision (CIFAR-10) and language (C4) benchmarks corroborate our theoretical findings on per-iteration convergence. Overall, this work offers the first proof of optimality for a Muon-style optimizer and clarifies the path toward developing more practically efficient, accelerated variants.

[951] Tool Zero: Training Tool-Augmented LLMs via Pure RL from Scratch

Yirong Zeng, Xiao Ding, Yutai Hou, Yuxian Wang, Li Du, Juyi Dai, Qiuyang Ding, Duyu Tang, Dandan Tu, Weiwen Liu, Bing Qin, Ting Liu

Main category: cs.LG

TL;DR: The paper proposes a reinforcement learning approach with dynamic reward design to train LLMs for tool use without supervised fine-tuning, achieving significant performance improvements.

Details

Motivation: Current supervised fine-tuning methods struggle with generalization to unfamiliar tool-use scenarios, while RL shows promise for enhancing reasoning and generalization capabilities.

Method: Dynamic generalization-guided reward design for rule-based RL that shifts from exploratory to exploitative patterns, training Tool-Zero models directly from base models without post-training.

Result: Over 7% performance improvement compared to both SFT and RL-with-SFT models, with consistent gains across cross-dataset and intra-dataset evaluations.

Conclusion: Pure RL can effectively elicit LLMs’ intrinsic reasoning capabilities and enhance tool-agnostic generalization, demonstrating the effectiveness and robustness of the proposed methods.

Abstract: Training tool-augmented LLMs has emerged as a promising approach to enhancing language models’ capabilities for complex tasks. The current supervised fine-tuning paradigm relies on constructing extensive domain-specific datasets to train models. However, this approach often struggles to generalize effectively to unfamiliar or intricate tool-use scenarios. Recently, reinforcement learning (RL) paradigm can endow LLMs with superior reasoning and generalization abilities. In this work, we address a key question: Can the pure RL be used to effectively elicit a model’s intrinsic reasoning capabilities and enhance the tool-agnostic generalization? We propose a dynamic generalization-guided reward design for rule-based RL, which progressively shifts rewards from exploratory to exploitative tool-use patterns. Based on this design, we introduce the Tool-Zero series models. These models are trained to enable LLMs to autonomously utilize general tools by directly scaling up RL from Zero models (i.e., base models without post-training). Experimental results demonstrate that our models achieve over 7% performance improvement compared to both SFT and RL-with-SFT models under the same experimental settings. These gains are consistently replicated across cross-dataset and intra-dataset evaluations, validating the effectiveness and robustness of our methods.

[952] Demystifying Network Foundation Models

Sylee Beltiukov, Satyandra Guthula, Wenbo Guo, Walter Willinger, Arpit Gupta

Main category: cs.LG

TL;DR: Systematic analysis of Network Foundation Models (NFMs) reveals significant limitations in latent knowledge encoding, including anisotropy, inconsistent feature sensitivity, and context separation issues, with fixes improving performance by up to +0.35 F1 score.

Details

Motivation: To investigate the latent knowledge encoded within Network Foundation Models beyond just downstream task performance, focusing on hidden representations analysis through a comprehensive three-part evaluation framework.

Method: Three-part evaluation: Embedding Geometry Analysis (representation space utilization), Metric Alignment Assessment (correspondence with domain-expert features), and Causal Sensitivity Testing (robustness to protocol perturbations). Evaluated four state-of-the-art NFMs using five diverse network datasets.

Result: All NFMs exhibited significant anisotropy, inconsistent feature sensitivity patterns, inability to separate high-level context, payload dependency, and other limitations. Addressing these limitations improved model performance by up to +0.35 F1 score without architectural changes.

Conclusion: The systematic analysis reveals fundamental limitations in current NFMs’ latent knowledge encoding, demonstrating that addressing identified issues can significantly enhance model performance and robustness.

Abstract: This work presents a systematic investigation into the latent knowledge encoded within Network Foundation Models (NFMs) that focuses on hidden representations analysis rather than pure downstream task performance. Different from existing efforts, we analyze the models through a three-part evaluation: Embedding Geometry Analysis to assess representation space utilization, Metric Alignment Assessment to measure correspondence with domain-expert features, and Causal Sensitivity Testing to evaluate robustness to protocol perturbations. Using five diverse network datasets spanning controlled and real-world environments, we evaluate four state-of-the-art NFMs, revealing that they all exhibit significant anisotropy, inconsistent feature sensitivity patterns, an inability to separate the high-level context, payload dependency, and other properties. Our work identifies numerous limitations across all models and demonstrates that addressing them can significantly improve model performance (by up to +0.35 $F_1$ score without architectural changes).

[953] RobustFSM: Submodular Maximization in Federated Setting with Malicious Clients

Duc A. Tran, Dung Truong, Duy Le

Main category: cs.LG

TL;DR: RobustFSM is a federated submodular maximization solution that protects against client attacks while maintaining data privacy and autonomy.

Details

Motivation: To address vulnerabilities in federated submodular maximization where malicious clients can share fake information, similar to backdoor attacks in federated learning.

Method: Proposed RobustFSM algorithm that provides robustness against various practical client attacks in the federated setting.

Result: RobustFSM substantially outperforms conventional federated algorithms under severe attacks, with improvements up to 200% depending on dataset and attack scenarios.

Conclusion: RobustFSM effectively addresses security vulnerabilities in federated submodular maximization while maintaining performance quality.

Abstract: Submodular maximization is an optimization problem benefiting many machine learning applications, where we seek a small subset best representing an extremely large dataset. We focus on the federated setting where the data are locally owned by decentralized clients who have their own definitions for the quality of representability. This setting requires repetitive aggregation of local information computed by the clients. While the main motivation is to respect the privacy and autonomy of the clients, the federated setting is vulnerable to client misbehaviors: malicious clients might share fake information. An analogy is backdoor attack in conventional federated learning, but our challenge differs freshly due to the unique characteristics of submodular maximization. We propose RobustFSM, a federated submodular maximization solution that is robust to various practical client attacks. Its performance is substantiated with an empirical evaluation study using real-world datasets. Numerical results show that the solution quality of RobustFSM substantially exceeds that of the conventional federated algorithm when attacks are severe. The degree of this improvement depends on the dataset and attack scenarios, which can be as high as 200%

[954] Private Online Learning against an Adaptive Adversary: Realizable and Agnostic Settings

Bo Li, Wei Wang, Peng Ye

Main category: cs.LG

TL;DR: This paper presents improved algorithms for private online learning, achieving optimal O_d(log T) mistake bound against adaptive adversaries in the realizable setting and sublinear regret in the agnostic setting for Littlestone classes.

Details

Motivation: Prior work achieved O_d(log T) mistake bound only against oblivious adversaries, leaving a gap with suboptimal O_d(√T) bound against adaptive adversaries. The authors aim to close this gap and extend results to the more challenging agnostic setting.

Method: The authors develop new algorithms for private online learning that maintain differential privacy while achieving improved performance bounds. They specifically address both realizable and agnostic settings.

Result: The main results are: (1) O_d(log T) mistake bound against adaptive adversaries in the realizable setting, closing the gap from prior work; (2) O_d(√T) regret bound in the agnostic setting for Littlestone classes.

Conclusion: The work establishes that concept classes with finite Littlestone dimension are privately online learnable with optimal bounds against adaptive adversaries in the realizable setting and with sublinear regret in the agnostic setting.

Abstract: We revisit the problem of private online learning, in which a learner receives a sequence of $T$ data points and has to respond at each time-step a hypothesis. It is required that the entire stream of output hypotheses should satisfy differential privacy. Prior work of Golowich and Livni [2021] established that every concept class $\mathcal{H}$ with finite Littlestone dimension $d$ is privately online learnable in the realizable setting. In particular, they proposed an algorithm that achieves an $O_{d}(\log T)$ mistake bound against an oblivious adversary. However, their approach yields a suboptimal $\tilde{O}{d}(\sqrt{T})$ bound against an adaptive adversary. In this work, we present a new algorithm with a mistake bound of $O{d}(\log T)$ against an adaptive adversary, closing this gap. We further investigate the problem in the agnostic setting, which is more general than the realizable setting as it does not impose any assumptions on the data. We give an algorithm that obtains a sublinear regret of $\tilde{O}_d(\sqrt{T})$ for generic Littlestone classes, demonstrating that they are also privately online learnable in the agnostic setting.

[955] Rectifying Regression in Reinforcement Learning

Alex Ayoub, David Szepesvári, Alireza Bakhtiari, Csaba Szepesvári, Dale Schuurmans

Main category: cs.LG

TL;DR: Analysis shows mean absolute error is better than mean squared error for controlling policy suboptimality in value-based RL, with cross-entropy losses outperforming squared loss.

Details

Motivation: To investigate how different loss functions in value-based reinforcement learning methods affect the quality of learned policies through analysis of underlying prediction objectives.

Method: Theoretical analysis comparing mean absolute error vs mean squared error as prediction objectives, and empirical evaluation of different loss functions (binary/categorical cross-entropy vs squared loss) in linear reinforcement learning.

Result: Mean absolute error is theoretically superior to mean squared error for controlling policy suboptimality gap. Cross-entropy losses aligned with mean absolute error outperform squared loss aligned with mean squared error in empirical evaluations.

Conclusion: Loss function choice significantly impacts RL performance, with cross-entropy losses being better aligned with mean absolute error objectives and yielding superior results compared to traditional squared loss approaches.

Abstract: This paper investigates the impact of the loss function in value-based methods for reinforcement learning through an analysis of underlying prediction objectives. We theoretically show that mean absolute error is a better prediction objective than the traditional mean squared error for controlling the learned policy’s suboptimality gap. Furthermore, we present results that different loss functions are better aligned with these different regression objectives: binary and categorical cross-entropy losses with the mean absolute error and squared loss with the mean squared error. We then provide empirical evidence that algorithms minimizing these cross-entropy losses can outperform those based on the squared loss in linear reinforcement learning.

[956] Variational Diffusion Unlearning: A Variational Inference Framework for Unlearning in Diffusion Models under Data Constraints

Subhodip Panda, MS Varun, Shreyans Jain, Sarthak Kumar Maharana, Prathosh A. P

Main category: cs.LG

TL;DR: VDU is a computationally efficient machine unlearning method for diffusion models that works in data-constrained settings, requiring only a subset of undesired training data to prevent generation of unwanted outputs while maintaining image quality.

Details

Motivation: To enable safe deployment of diffusion models by preventing generation of undesired, violent, and obscene outputs, especially in data-constrained settings where full training datasets are inaccessible.

Method: Variational Diffusion Unlearning (VDU) uses variational inference with a loss function containing plasticity inducer (reduces log-likelihood of undesired data) and stability regularizer (preserves image generation quality by regularizing in parameter space).

Result: Effective class unlearning from MNIST, CIFAR-10, and tinyImageNet datasets using DDPM, and feature unlearning from Stable Diffusion model, demonstrating prevention of unwanted outputs while maintaining generation quality.

Conclusion: VDU provides an effective solution for machine unlearning in diffusion models under data constraints, enabling responsible deployment by selectively forgetting undesired features without requiring full training dataset access.

Abstract: For a responsible and safe deployment of diffusion models in various domains, regulating the generated outputs from these models is desirable because such models could generate undesired, violent, and obscene outputs. To tackle this problem, recent works use machine unlearning methodology to forget training data points containing these undesired features from pre-trained generative models. However, these methods proved to be ineffective in data-constrained settings where the whole training dataset is inaccessible. Thus, the principal objective of this work is to propose a machine unlearning methodology that can prevent the generation of outputs containing undesired features from a pre-trained diffusion model in such a data-constrained setting. Our proposed method, termed as Variational Diffusion Unlearning (VDU), is a computationally efficient method that only requires access to a subset of training data containing undesired features. Our approach is inspired by the variational inference framework with the objective of minimizing a loss function consisting of two terms: plasticity inducer and stability regularizer. Plasticity inducer reduces the log-likelihood of the undesired training data points, while the stability regularizer, essential for preventing loss of image generation quality, regularizes the model in parameter space. We validate the effectiveness of our method through comprehensive experiments for both class unlearning and feature unlearning. For class unlearning, we unlearn some user-identified classes from MNIST, CIFAR-10, and tinyImageNet datasets from a pre-trained unconditional denoising diffusion probabilistic model (DDPM). Similarly, for feature unlearning, we unlearn the generation of certain high-level features from a pre-trained Stable Diffusion model

[957] Lagrangian neural ODEs: Measuring the existence of a Lagrangian with Helmholtz metrics

Luca Wolf, Tobias Buck, Bjoern Malte Schaefer

Main category: cs.LG

TL;DR: The paper introduces Helmholtz metrics to quantify how closely neural ODE solutions resemble physical Euler-Lagrange equations, and presents Lagrangian neural ODEs that can learn Euler-Lagrange equations directly from positional data.

Details

Motivation: Neural ODEs are powerful but not all solutions are physical Euler-Lagrange equations, limiting their applicability in physics contexts where physical consistency is required.

Method: Developed Helmholtz metrics to measure resemblance to Euler-Lagrange equations, combined with second-order neural ODEs to create Lagrangian neural ODEs that learn Euler-Lagrange equations directly.

Result: The approach successfully distinguishes Lagrangian from non-Lagrangian systems using only positional data, improves neural ODE solutions, and maintains zero additional inference cost.

Conclusion: Lagrangian neural ODEs provide a physically consistent framework for learning Euler-Lagrange equations directly, enhancing the applicability of neural ODEs in physics while maintaining computational efficiency.

Abstract: Neural ODEs are a widely used, powerful machine learning technique in particular for physics. However, not every solution is physical in that it is an Euler-Lagrange equation. We present Helmholtz metrics to quantify this resemblance for a given ODE and demonstrate their capabilities on several fundamental systems with noise. We combine them with a second order neural ODE to form a Lagrangian neural ODE, which allows to learn Euler-Lagrange equations in a direct fashion and with zero additional inference cost. We demonstrate that, using only positional data, they can distinguish Lagrangian and non-Lagrangian systems and improve the neural ODE solutions.

[958] Hierarchical Bayesian Flow Networks for Molecular Graph Generation

Yida Xiong, Jiameng Chen, Kun Li, Hongzhi Zhang, Xiantao Cai, Wenbin Hu

Main category: cs.LG

TL;DR: GraphBFN is a novel molecular graph generation framework that addresses the training-inference discrepancy in continuous diffusion models by using Bayesian Flow Networks and Cumulative Distribution Functions to unify training objectives with sampling operations.

Details

Motivation: Current molecular graph generation methods using continuous diffusion models have a fundamental limitation - they train for regression but require rounding for discrete classification during inference, causing training-inference mismatch and reduced molecular diversity.

Method: Proposed GraphBFN, a hierarchical coarse-to-fine framework based on Bayesian Flow Networks that operates on distribution parameters. It introduces Cumulative Distribution Function to calculate category probabilities, unifying training with sampling rounding.

Result: GraphBFN achieves superior performance and faster generation, setting new state-of-the-art results on QM9 and ZINC250k molecular graph generation benchmarks.

Conclusion: The proposed method successfully addresses the fundamental limitation of training-inference discrepancy in molecular graph generation, enabling more efficient learning and better generalization capabilities.

Abstract: Molecular graph generation is essentially a classification generation problem, aimed at predicting categories of atoms and bonds. Currently, prevailing paradigms such as continuous diffusion models are trained to predict continuous numerical values, treating the training process as a regression task. However, the final generation necessitates a rounding step to convert these predictions back into discrete classification categories, which is intrinsically a classification operation. Given that the rounding operation is not incorporated during training, there exists a significant discrepancy between the model’s training objective and its inference procedure. As a consequence, an excessive emphasis on point-wise precision can lead to overfitting and inefficient learning. This occurs because considerable efforts are devoted to capturing intra-bin variations that are ultimately irrelevant to the discrete nature of the task at hand. Such a flaw results in diminished molecular diversity and constrains the model’s generalization capabilities. To address this fundamental limitation, we propose GraphBFN, a novel hierarchical coarse-to-fine framework based on Bayesian Flow Networks that operates on the parameters of distributions. By innovatively introducing Cumulative Distribution Function, GraphBFN is capable of calculating the probability of selecting the correct category, thereby unifying the training objective with the sampling rounding operation. We demonstrate that our method achieves superior performance and faster generation, setting new state-of-the-art results on the QM9 and ZINC250k molecular graph generation benchmarks.

[959] The Evolving Nature of Latent Spaces: From GANs to Diffusion

Ludovica Schaerf

Main category: cs.LG

TL;DR: The paper analyzes how generative visual models’ internal representations have evolved, distinguishing between compact latent space synthesis (GANs/VAEs) and distributed layer-wise synthesis (diffusion models), arguing for rethinking AI as emergent specialized processes rather than direct content synthesis.

Details

Motivation: To examine the conceptual shift in generative models from unified latent spaces to distributed representations, challenging traditional assumptions about internal model spaces and synthesis processes.

Method: Close readings of model architectures and targeted experimental interventions in layerwise representations of diffusion models to analyze how representational labor is distributed across layers.

Result: Diffusion models fragment the burden of representation across layers, challenging the assumption of unified internal space and demonstrating distributed representational labor.

Conclusion: Generative AI should be understood not as direct synthesis of content but as emergent configuration of specialized processes, requiring reorientation of how we conceptualize internal representations in AI systems.

Abstract: This paper examines the evolving nature of internal representations in generative visual models, focusing on the conceptual and technical shift from GANs and VAEs to diffusion-based architectures. Drawing on Beatrice Fazi’s account of synthesis as the amalgamation of distributed representations, we propose a distinction between “synthesis in a strict sense”, where a compact latent space wholly determines the generative process, and “synthesis in a broad sense,” which characterizes models whose representational labor is distributed across layers. Through close readings of model architectures and a targeted experimental setup that intervenes in layerwise representations, we show how diffusion models fragment the burden of representation and thereby challenge assumptions of unified internal space. By situating these findings within media theoretical frameworks and critically engaging with metaphors such as the latent space and the Platonic Representation Hypothesis, we argue for a reorientation of how generative AI is understood: not as a direct synthesis of content, but as an emergent configuration of specialized processes.

[960] Alternative Fairness and Accuracy Optimization in Criminal Justice

Shaolong Wu, James Blume, Geshi Yeung

Main category: cs.LG

TL;DR: Proposes a modified group fairness approach that minimizes weighted error loss while keeping false negative rate differences within tolerance, addressing algorithmic fairness challenges in criminal justice.

Details

Motivation: Address unsettled concepts in algorithmic fairness, especially in criminal justice, where group, individual, and process fairness often conflict.

Method: Develop a simple modification to standard group fairness: minimize weighted error loss while constraining false negative rate differences within small tolerance.

Result: Makes solutions easier to find, can improve predictive accuracy, and highlights ethical choices in error cost allocation.

Conclusion: Proposes a practical deployment framework with three pillars: need-based decisions, transparency/accountability, and narrowly tailored definitions/solutions to link technical design with legitimacy.

Abstract: Algorithmic fairness has grown rapidly as a research area, yet key concepts remain unsettled, especially in criminal justice. We review group, individual, and process fairness and map the conditions under which they conflict. We then develop a simple modification to standard group fairness. Rather than exact parity across protected groups, we minimize a weighted error loss while keeping differences in false negative rates within a small tolerance. This makes solutions easier to find, can raise predictive accuracy, and surfaces the ethical choice of error costs. We situate this proposal within three classes of critique: biased and incomplete data, latent affirmative action, and the explosion of subgroup constraints. Finally, we offer a practical framework for deployment in public decision systems built on three pillars: need-based decisions, Transparency and accountability, and narrowly tailored definitions and solutions. Together, these elements link technical design to legitimacy and provide actionable guidance for agencies that use risk assessment and related tools.

[961] Local properties of neural networks through the lens of layer-wise Hessians

Maxim Bolshim, Alexander Kugaevskikh

Main category: cs.LG

TL;DR: Analysis of neural networks using layer-wise Hessian matrices to study local geometry, revealing patterns related to overfitting, underparameterization, and expressivity through spectral properties.

Details

Motivation: To develop a formal methodology for analyzing neural networks by examining the local geometry of parameter space through Hessian matrices, connecting optimization geometry with functional behavior.

Method: Define local Hessian for each layer as second derivatives matrix, analyze spectral properties (eigenvalue distributions), conduct 111 experiments across 37 datasets to study evolution during training.

Result: Consistent structural regularities in local Hessians during training, correlations between Hessian spectra and generalization performance, patterns associated with overfitting and underparameterization.

Conclusion: Local geometric analysis through Hessians provides foundation for diagnosing and designing neural networks, connecting optimization geometry with functional behavior for improved architectures and training stability.

Abstract: We introduce a methodology for analyzing neural networks through the lens of layer-wise Hessian matrices. The local Hessian of each functional block (layer) is defined as the matrix of second derivatives of a scalar function with respect to the parameters of that layer. This concept provides a formal tool for characterizing the local geometry of the parameter space. We show that the spectral properties of local Hessians, such as the distribution of eigenvalues, reveal quantitative patterns associated with overfitting, underparameterization, and expressivity in neural network architectures. We conduct an extensive empirical study involving 111 experiments across 37 datasets. The results demonstrate consistent structural regularities in the evolution of local Hessians during training and highlight correlations between their spectra and generalization performance. These findings establish a foundation for using local geometric analysis to guide the diagnosis and design of deep neural networks. The proposed framework connects optimization geometry with functional behavior and offers practical insight for improving network architectures and training stability.

[962] Addressing divergent representations from causal interventions on neural networks

Satchel Grant, Simon Jerome Han, Alexa R. Tartaglini, Christopher Potts

Main category: cs.LG

TL;DR: Causal interventions in mechanistic interpretability often create out-of-distribution representations, potentially compromising the faithfulness of explanations. The paper identifies harmless vs pernicious divergences and proposes a modified regularization method to mitigate harmful effects.

Details

Motivation: To investigate whether causal interventions in mechanistic interpretability create out-of-distribution representations that may undermine the faithfulness of explanations to the model's natural state.

Method: Empirical demonstration of distribution shifts from interventions, theoretical analysis of divergence types (harmless vs pernicious), and modification of Counterfactual Latent loss for regularization to keep interventions closer to natural distributions.

Result: Common causal intervention techniques frequently shift internal representations away from natural distributions. The modified CL loss successfully reduces harmful divergences while maintaining interpretive power.

Conclusion: The findings highlight the need for more reliable interpretability methods and provide a path forward through regularization techniques that preserve intervention effectiveness while minimizing distributional divergence.

Abstract: A common approach to mechanistic interpretability is to causally manipulate model representations via targeted interventions in order to understand what those representations encode. Here we ask whether such interventions create out-of-distribution (divergent) representations, and whether this raises concerns about how faithful their resulting explanations are to the target model in its natural state. First, we demonstrate empirically that common causal intervention techniques often do shift internal representations away from the natural distribution of the target model. Then, we provide a theoretical analysis of two classes of such divergences: “harmless” divergences that occur in the null-space of the weights and from covariance within behavioral decision boundaries, and “pernicious” divergences that activate hidden network pathways and cause dormant behavioral changes. Finally, in an effort to mitigate the pernicious cases, we modify the Counterfactual Latent (CL) loss from Grant (2025) that regularizes interventions to remain closer to the natural distributions, reducing the likelihood of harmful divergences while preserving the interpretive power of interventions. Together, these results highlight a path towards more reliable interpretability methods.

[963] CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training

Soroush Tabesh, Mher Safaryan, Andrei Panferov, Alexandra Volkova, Dan Alistarh

Main category: cs.LG

TL;DR: CAGE introduces curvature-aware gradient estimation to improve quantization-aware training, reducing accuracy gap between quantized and native training by half compared to prior methods.

Details

Motivation: To address the persistent accuracy gap between low-bit quantization-aware training and native training methods.

Method: Augments straight-through estimator gradient with curvature-aware correction derived from multi-objective optimization, balancing loss minimization with quantization constraints using local curvature information.

Result: Halves compression accuracy loss in QAT fine-tuning and enables 3-bit weights-and-activations to match 4-bit accuracy of prior methods in Llama pre-training.

Conclusion: CAGE provides a principled, optimizer-agnostic approach that significantly advances quantization-aware training with strong theoretical guarantees and practical efficiency.

Abstract: Despite significant work on low-bit quantization-aware training (QAT), there is still an accuracy gap between such techniques and native training. To address this, we introduce CAGE (Curvature-Aware Gradient Estimation), a new QAT method that augments the straight-through estimator (STE) gradient with a curvature-aware correction designed to counteract the loss increase induced by quantization. CAGE is derived from a multi-objective view of QAT that balances loss minimization with the quantization constraints, yielding a principled correction term that depends on local curvature information. On the theoretical side, we introduce the notion of Pareto-optimal solutions for quantized optimization, and establish that CAGE yields strong convergence guarantees in the smooth non-convex setting. In terms of implementation, our approach is optimizer-agnostic, but we provide a highly-efficient implementation that leverages Adam statistics. CAGE significantly improves upon the prior state-of-the-art methods in terms of accuracy, for similar computational cost: for QAT fine-tuning, it halves the compression accuracy loss relative to the prior best method, while for QAT pre-training of Llama models, its accuracy for 3-bit weights-and-activations (W3A3) matches the accuracy achieved at 4-bits (W4A4) with the prior best method. The official implementation can be found over https://github.com/IST-DASLab/CAGE .

[964] Policy Learning with Abstention

Ayush Sawarni, Jikai Jin, Justin Whitehouse, Vasilis Syrgkanis

Main category: cs.LG

TL;DR: Policy learning with abstention allows deferring to safe defaults when uncertain, providing O(1/n) regret guarantees and applications in margin conditions, distributional robustness, and safe policy improvement.

Details

Motivation: Most policy learning methods force decisions even when uncertain, which is risky in high-stakes settings like personalized medicine and advertising. Abstention provides a safer alternative by allowing deferral to experts or default options.

Method: Propose a two-stage learner: first identify near-optimal policies, then construct abstention rules from their disagreements. Use doubly robust objective for unknown propensities and establish theoretical guarantees.

Result: Achieve fast O(1/n)-type regret guarantees when propensities are known, extend to unknown-propensity case via doubly robust methods. Abstention enables improved performance under margin conditions without realizability, connects to distributionally robust learning, and supports safe policy improvement.

Conclusion: Abstention is a versatile tool that enhances policy learning safety and performance across multiple domains, providing theoretical guarantees and practical benefits for high-stakes decision-making.

Abstract: Policy learning algorithms are widely used in areas such as personalized medicine and advertising to develop individualized treatment regimes. However, most methods force a decision even when predictions are uncertain, which is risky in high-stakes settings. We study policy learning with abstention, where a policy may defer to a safe default or an expert. When a policy abstains, it receives a small additive reward on top of the value of a random guess. We propose a two-stage learner that first identifies a set of near-optimal policies and then constructs an abstention rule from their disagreements. We establish fast O(1/n)-type regret guarantees when propensities are known, and extend these guarantees to the unknown-propensity case via a doubly robust (DR) objective. We further show that abstention is a versatile tool with direct applications to other core problems in policy learning: it yields improved guarantees under margin conditions without the common realizability assumption, connects to distributionally robust policy learning by hedging against small data shifts, and supports safe policy improvement by ensuring improvement over a baseline policy with high probability.

[965] Shift is Good: Mismatched Data Mixing Improves Test Performance

Marko Medvedev, Kaifeng Lyu, Zhiyuan Li, Nathan Srebro

Main category: cs.LG

TL;DR: Distribution shift between training and test mixtures can improve performance, even without transfer between components. The paper identifies optimal training proportions and analyzes when such shifts are beneficial.

Details

Motivation: To challenge the conventional wisdom that distribution shift between training and test data is always detrimental, and to explore scenarios where mismatched proportions can actually improve test performance.

Method: Theoretical analysis of mixture distributions with different training and test proportions, examining various scenarios to identify optimal training proportions and quantify the benefits of distribution shift.

Result: Distribution shift can be beneficial in many settings, even when components are unrelated with no transfer between them. The paper provides analytical results showing improved test performance due to mismatched training proportions.

Conclusion: Distribution shift between training and test data is not always harmful and can be strategically leveraged to improve performance, with applications extending to compositional settings with varying component skill distributions.

Abstract: We consider training and testing on mixture distributions with different training and test proportions. We show that in many settings, and in some sense generically, distribution shift can be beneficial, and test performance can improve due to mismatched training proportions, even if the components are unrelated and with no transfer between components. In a variety of scenarios, we identify the optimal training proportions and the extent to which such distribution shift can be beneficial. We show how the same analysis applies also to a compositional setting with differing distribution of component “skills’’ at training and test.

[966] Sensitivity Analysis for Climate Science with Generative Flow Models

Alex Dobra, Jakiw Pidstrigach, Tim Reichelt, Christian Schroeder de Witt, Philip Torr, Philip Stier

Main category: cs.LG

TL;DR: Applying adjoint state method to generative flow models for efficient sensitivity analysis in climate science, reducing computation time from weeks to hours.

Details

Motivation: Traditional physical models for climate sensitivity analysis are computationally expensive, while AI-based generative models lack efficient gradient computation methods.

Method: Applied adjoint state method to cBottle generative model trained on ERA5 and ICON data to compute sensitivities of atmospheric variables with respect to sea surface temperatures.

Result: Successfully computed reliable gradients, validated against model outputs, reducing computational cost from weeks on supercomputers to hours on GPUs.

Conclusion: This approach enables efficient sensitivity analysis in climate science, simplifying critical workflows while maintaining reliability.

Abstract: Sensitivity analysis is a cornerstone of climate science, essential for understanding phenomena ranging from storm intensity to long-term climate feedbacks. However, computing these sensitivities using traditional physical models is often prohibitively expensive in terms of both computation and development time. While modern AI-based generative models are orders of magnitude faster to evaluate, computing sensitivities with them remains a significant bottleneck. This work addresses this challenge by applying the adjoint state method for calculating gradients in generative flow models. We apply this method to the cBottle generative model, trained on ERA5 and ICON data, to perform sensitivity analysis of any atmospheric variable with respect to sea surface temperatures. We quantitatively validate the computed sensitivities against the model’s own outputs. Our results provide initial evidence that this approach can produce reliable gradients, reducing the computational cost of sensitivity analysis from weeks on a supercomputer with a physical model to hours on a GPU, thereby simplifying a critical workflow in climate science. The code can be found at https://github.com/Kwartzl8/cbottle_adjoint_sensitivity.

[967] EraseFlow: Learning Concept Erasure Policies via GFlowNet-Driven Alignment

Abhiram Kusumba, Maitreya Patel, Kyle Min, Changhoon Kim, Chitta Baral, Yezhou Yang

Main category: cs.LG

TL;DR: EraseFlow is a novel framework that uses GFlowNets to unlearn harmful concepts from text-to-image diffusion models by exploring denoising trajectories, achieving better performance and generalization than existing methods.

Details

Motivation: Current concept erasure techniques either degrade image quality, rely on fragile adversarial losses, or require extensive retraining, highlighting the need for a more robust approach to removing harmful or proprietary concepts from text-to-image generators.

Method: EraseFlow casts concept unlearning as exploration in denoising path space and optimizes it with GFlowNets using trajectory balance objective, sampling entire trajectories rather than single end states to learn a stochastic policy that steers generation away from target concepts.

Result: EraseFlow outperforms existing baselines, eliminates the need for carefully crafted reward models, generalizes effectively to unseen concepts, avoids hackable rewards, and achieves optimal trade-off between performance and prior preservation.

Conclusion: EraseFlow provides a superior approach to concept erasure in diffusion models by leveraging trajectory-based exploration with GFlowNets, offering improved performance, generalization, and robustness compared to existing methods.

Abstract: Erasing harmful or proprietary concepts from powerful text to image generators is an emerging safety requirement, yet current “concept erasure” techniques either collapse image quality, rely on brittle adversarial losses, or demand prohibitive retraining cycles. We trace these limitations to a myopic view of the denoising trajectories that govern diffusion based generation. We introduce EraseFlow, the first framework that casts concept unlearning as exploration in the space of denoising paths and optimizes it with GFlowNets equipped with the trajectory balance objective. By sampling entire trajectories rather than single end states, EraseFlow learns a stochastic policy that steers generation away from target concepts while preserving the model’s prior. EraseFlow eliminates the need for carefully crafted reward models and by doing this, it generalizes effectively to unseen concepts and avoids hackable rewards while improving the performance. Extensive empirical results demonstrate that EraseFlow outperforms existing baselines and achieves an optimal trade off between performance and prior preservation.

[968] Priors in Time: Missing Inductive Biases for Language Model Interpretability

Ekdeep Singh Lubana, Can Rager, Sai Sumedh R. Hindupur, Valerie Costa, Greta Tuckute, Oam Patel, Sonia Krishna Murthy, Thomas Fel, Daniel Wurgaft, Eric J. Bigelow, Johnny Lin, Demba Ba, Martin Wattenberg, Fernanda Viegas, Melanie Weber, Aaron Mueller

Main category: cs.LG

TL;DR: SAEs assume independent concepts across time, but language has rich temporal dynamics. Temporal Feature Analysis decomposes representations into predictable context-based components and novel residual components, outperforming SAEs on temporal tasks.

Details

Motivation: Existing feature extraction methods like Sparse Autoencoders assume concept independence across time, which conflicts with the rich temporal structure and non-stationarity of language model representations.

Method: Introduce Temporal Feature Analysis with temporal inductive bias that decomposes representations into predictable components (inferred from context) and residual components (novel information unexplained by context).

Result: Temporal Feature Analyzers successfully parse garden path sentences, identify event boundaries, and delineate abstract slow-moving vs novel fast-moving information, while SAEs show significant pitfalls in these temporal tasks.

Conclusion: Interpretability tools need inductive biases that match the temporal dynamics of language data, as demonstrated by the superiority of temporal-aware methods over independence-assuming approaches like SAEs.

Abstract: Recovering meaningful concepts from language model activations is a central aim of interpretability. While existing feature extraction methods aim to identify concepts that are independent directions, it is unclear if this assumption can capture the rich temporal structure of language. Specifically, via a Bayesian lens, we demonstrate that Sparse Autoencoders (SAEs) impose priors that assume independence of concepts across time, implying stationarity. Meanwhile, language model representations exhibit rich temporal dynamics, including systematic growth in conceptual dimensionality, context-dependent correlations, and pronounced non-stationarity, in direct conflict with the priors of SAEs. Taking inspiration from computational neuroscience, we introduce a new interpretability objective – Temporal Feature Analysis – which possesses a temporal inductive bias to decompose representations at a given time into two parts: a predictable component, which can be inferred from the context, and a residual component, which captures novel information unexplained by the context. Temporal Feature Analyzers correctly parse garden path sentences, identify event boundaries, and more broadly delineate abstract, slow-moving information from novel, fast-moving information, while existing SAEs show significant pitfalls in all the above tasks. Overall, our results underscore the need for inductive biases that match the data in designing robust interpretability tools.

[969] NOWS: Neural Operator Warm Starts for Accelerating Iterative Solvers

Mohammad Sadegh Eshaghi, Cosmin Anitescu, Navid Valizadeh, Yizheng Wang, Xiaoying Zhuang, Timon Rabczuk

Main category: cs.LG

TL;DR: NOWS uses neural operators to generate initial guesses for iterative PDE solvers, reducing iteration counts by up to 90% while maintaining solver guarantees.

Details

Motivation: High-fidelity PDE simulation is computationally expensive, and while data-driven surrogates are fast, they lack reliability outside training distributions.

Method: Hybrid approach combining neural operators with classical iterative solvers (Krylov methods) by using learned operators to produce high-quality initial guesses.

Result: Consistently reduces iteration counts and runtime by up to 90% across benchmarks while preserving numerical stability and convergence guarantees.

Conclusion: NOWS provides a practical and trustworthy approach to accelerate PDE simulations by combining neural operator speed with traditional solver rigor.

Abstract: Partial differential equations (PDEs) underpin quantitative descriptions across the physical sciences and engineering, yet high-fidelity simulation remains a major computational bottleneck for many-query, real-time, and design tasks. Data-driven surrogates can be strikingly fast but are often unreliable when applied outside their training distribution. Here we introduce Neural Operator Warm Starts (NOWS), a hybrid strategy that harnesses learned solution operators to accelerate classical iterative solvers by producing high-quality initial guesses for Krylov methods such as conjugate gradient and GMRES. NOWS leaves existing discretizations and solver infrastructures intact, integrating seamlessly with finite-difference, finite-element, isogeometric analysis, finite volume method, etc. Across our benchmarks, the learned initialization consistently reduces iteration counts and end-to-end runtime, resulting in a reduction of the computational time of up to 90 %, while preserving the stability and convergence guarantees of the underlying numerical algorithms. By combining the rapid inference of neural operators with the rigor of traditional solvers, NOWS provides a practical and trustworthy approach to accelerate high-fidelity PDE simulations.

[970] A Feedback-Control Framework for Efficient Dataset Collection from In-Vehicle Data Streams

Philipp Reis, Philipp Rigoll, Christian Steinhauser, Jacob Langner, Eric Sax

Main category: cs.LG

TL;DR: FCDC introduces a closed-loop control system for data collection that uses online probabilistic modeling and feedback signals to dynamically balance exploration/exploitation, maintain diversity, and prevent redundancy in datasets.

Details

Motivation: Current data collection methods are open-loop and accumulate redundant samples without feedback, leading to inefficient storage, costly labeling, and limited generalization in AI systems.

Method: Formulates data collection as closed-loop control using online probabilistic models to approximate data distribution state, with adaptive sample retention based on feedback signals like likelihood and Mahalanobis distance.

Result: On synthetic data, FCDC converges toward uniform distribution under Gaussian input; on real data streams, produces 25.9% more balanced datasets while reducing storage by 39.8%.

Conclusion: Data collection can be actively controlled, transforming it from a passive pipeline stage into a self-regulating, feedback-driven process at the core of data-centric AI.

Abstract: Modern AI systems are increasingly constrained not by model capacity but by the quality and diversity of their data. Despite growing emphasis on data-centric AI, most datasets are still gathered in an open-loop manner which accumulates redundant samples without feedback from the current coverage. This results in inefficient storage, costly labeling, and limited generalization. To address this, this paper introduces Feedback Control Data Collection (FCDC), a paradigm that formulates data collection as a closed-loop control problem. FCDC continuously approximates the state of the collected data distribution using an online probabilistic model and adaptively regulates sample retention using based on feedback signals such as likelihood and Mahalanobis distance. Through this feedback mechanism, the system dynamically balances exploration and exploitation, maintains dataset diversity, and prevents redundancy from accumulating over time. In addition to demonstrating the controllability of FCDC on a synthetic dataset that converges toward a uniform distribution under Gaussian input assumption, experiments on real data streams show that FCDC produces more balanced datasets by 25.9% while reducing data storage by 39.8%. These results demonstrate that data collection itself can be actively controlled, transforming collection from a passive pipeline stage into a self-regulating, feedback-driven process at the core of data-centric AI.

[971] Adaptive and Robust Data Poisoning Detection and Sanitization in Wearable IoT Systems using Large Language Models

W. K. M Mithsara, Ning Yang, Ahmed Imteaj, Hussein Zangoti, Abdur R. Shahid

Main category: cs.LG

TL;DR: Proposes an LLM-based framework for detecting and sanitizing data poisoning attacks in human activity recognition systems using zero-shot, one-shot, and few-shot learning with role play prompting and step-by-step reasoning.

Details

Motivation: Wearable IoT systems are vulnerable to data poisoning attacks that compromise data integrity, and conventional defenses require extensive labeled datasets which limit adaptability in dynamic environments.

Method: Uses LLMs with role play prompting (LLM acts as expert) and think step-by-step reasoning to detect poisoning indicators and generate clean alternatives in sensor data, minimizing reliance on large datasets.

Result: Extensive evaluation shows effective detection accuracy, sanitization quality, low latency, and reduced communication costs, demonstrating practical security improvements.

Conclusion: LLMs provide robust, adaptable defense mechanisms for wearable IoT systems, enhancing security and reliability through efficient poisoning detection and sanitization.

Abstract: The widespread integration of wearable sensing devices in Internet of Things (IoT) ecosystems, particularly in healthcare, smart homes, and industrial applications, has required robust human activity recognition (HAR) techniques to improve functionality and user experience. Although machine learning models have advanced HAR, they are increasingly susceptible to data poisoning attacks that compromise the data integrity and reliability of these systems. Conventional approaches to defending against such attacks often require extensive task-specific training with large, labeled datasets, which limits adaptability in dynamic IoT environments. This work proposes a novel framework that uses large language models (LLMs) to perform poisoning detection and sanitization in HAR systems, utilizing zero-shot, one-shot, and few-shot learning paradigms. Our approach incorporates \textit{role play} prompting, whereby the LLM assumes the role of expert to contextualize and evaluate sensor anomalies, and \textit{think step-by-step} reasoning, guiding the LLM to infer poisoning indicators in the raw sensor data and plausible clean alternatives. These strategies minimize reliance on curation of extensive datasets and enable robust, adaptable defense mechanisms in real-time. We perform an extensive evaluation of the framework, quantifying detection accuracy, sanitization quality, latency, and communication cost, thus demonstrating the practicality and effectiveness of LLMs in improving the security and reliability of wearable IoT systems.

[972] Optimal Inference Schedules for Masked Diffusion Models

Sitan Chen, Kevin Cong, Jerry Li

Main category: cs.LG

TL;DR: This paper provides an exact characterization of the divergence between true and sampled distributions in masked diffusion language models, establishing connections to function approximation theory and deriving optimal unmasking schedules.

Details

Motivation: Standard auto-regressive LLMs have sequential inference leading to long inference times, while masked diffusion models promise parallel sampling but lack rigorous understanding of how much parallelism is possible without performance degradation.

Method: The authors develop a new exact characterization of expected divergence for any distribution and unmasking schedule, leveraging connections to univariate function approximation theory to derive optimal sampling strategies.

Result: The paper shows impossibility of competing with optimal schedules without strong prior knowledge, but provides new upper bounds and sampling schedules based on total correlation and dual total correlation, enabling O(log n) step sampling in natural settings.

Conclusion: Masked diffusion models can achieve efficient parallel sampling with O(log n) steps without performance loss in natural settings, though optimal scheduling requires distribution-specific knowledge.

Abstract: A major bottleneck of standard auto-regressive large language models is that their inference process is inherently sequential, resulting in very long and costly inference times. To circumvent this, practitioners proposed a class of language models called diffusion language models, of which the masked diffusion model (MDM) is the most successful. The MDM is able to sample tokens out-of-order and, ostensibly, many tokens at once and in parallel. However, there is very limited rigorous understanding of how much parallel sampling these models can perform without noticeable degradation in their sampling performance. Prior work of Li and Cai obtained some preliminary bounds, but these are not tight for many natural classes of distributions. In this work, we give a new, exact characterization of the expected divergence between the true distribution and the sampled distribution, for any distribution and any unmasking schedule for the sampler, showing an elegant connection to the theory of univariate function approximation. By leveraging this connection, we then attain a number of novel lower and upper bounds for this problem. While the connection to function approximation in principle gives the optimal unmasking schedule for any distribution, we show that it is in general impossible to compete with it without strong a priori knowledge of the distribution, even in seemingly benign settings. However, we also demonstrate new upper bounds and new sampling schedules in terms of well-studied information-theoretic properties of the base distribution, namely, its total correlation and dual total correlation, which show that in some natural settings, one can sample in $O(log n)$ steps without any visible loss in performance, where $n$ is the total sequence length.

[973] Nowcast3D: Reliable precipitation nowcasting via gray-box learning

Huaguan Chen, Wei Han, Haofei Sun, Ning Lin, Xingtao Song, Yunfan Yang, Jie Tian, Yang Liu, Ji-Rong Wen, Xiaoye Zhang, Xueshun Shen, Hao Sun

Main category: cs.LG

TL;DR: A 3D gray-box nowcasting framework that combines physical constraints with data-driven learning for extreme precipitation forecasting, achieving superior performance up to 3-hour lead times.

Details

Motivation: Existing methods have limitations: NWP is too slow/coarse for convection, extrapolation models suffer from error accumulation, and 2D methods discard crucial vertical information needed for accurate dynamics.

Method: Hybrid framework processing volumetric radar data with physically constrained neural operators. Learns 3D advection fields, parameterizes diffusion, adds stochastic terms for unresolved motions, and uses residual branches for convective initiation.

Result: Achieved most accurate forecasts up to 3-hour lead time across precipitation regimes, ranked first in 57% of cases in blind evaluation by 160 meteorologists.

Conclusion: The 3D gray-box approach with physical consistency provides scalable and robust pathway for skillful and reliable extreme precipitation nowcasting.

Abstract: Extreme precipitation nowcasting demands high spatiotemporal fidelity and extended lead times, yet existing approaches remain limited. Numerical Weather Prediction (NWP) and its deep-learning emulations are too slow and coarse for rapidly evolving convection, while extrapolation and purely data-driven models suffer from error accumulation and excessive smoothing. Hybrid 2D radar-based methods discard crucial vertical information, preventing accurate reconstruction of height-dependent dynamics. We introduce a gray-box, fully three-dimensional nowcasting framework that directly processes volumetric radar reflectivity and couples physically constrained neural operators with datadriven learning. The model learns vertically varying 3D advection fields under a conservative advection operator, parameterizes spatially varying diffusion, and introduces a Brownian-motion–inspired stochastic term to represent unresolved motions. A residual branch captures small-scale convective initiation and microphysical variability, while a diffusion-based stochastic module estimates uncertainty. The framework achieves more accurate forecasts up to three-hour lead time across precipitation regimes and ranked first in 57% of cases in a blind evaluation by 160 meteorologists. By restoring full 3D dynamics with physical consistency, it offers a scalable and robust pathway for skillful and reliable nowcasting of extreme precipitation.

cs.MA

[974] Novel Concepts for Agent-Based Population Modelling and Simulation: Updates from GEPOC ABM

Martin Bicher, Maximilian Viehauser, Daniele Giannandrea, Hannah Kastinger, Dominik Brunmeir, Niki Popper

Main category: cs.MA

TL;DR: This paper presents three transferable innovations from the GEPOC ABM population model: an innovative time-update concept, co-simulation-inspired strategy, and accurate parametrisation approach.

Details

Motivation: Dynamic agent-based population models are gaining popularity for decision support due to their flexibility, and the authors want to share transferable innovations from their successful GEPOC ABM model with the broader community.

Method: The paper presents three key methods: 1) an innovative time-update concept for individual agents, 2) a co-simulation-inspired simulation strategy, and 3) a strategy for accurate model parametrisation.

Result: The methods are described in a reproducible manner with explanations of their advantages and ideas for transfer to other population models.

Conclusion: These three innovations from the well-established GEPOC ABM can be successfully transferred to other population models to improve their effectiveness and accuracy.

Abstract: In recent years, dynamic agent-based population models, which model every inhabitant of a country as a statistically representative agent, have been gaining in popularity for decision support. This is mainly due to their high degree of flexibility with respect to their area of application. GEPOC ABM is one of these models. Developed in 2015, it is now a well-established decision support tool and has been successfully applied for a wide range of population-level research questions ranging from health-care to logistics. At least in part, this success is attributable to continuous improvement and development of new methods. While some of these are very application- or implementation-specific, others can be well transferred to other population models. The focus of the present work lies on the presentation of three selected transferable innovations. We illustrate an innovative time-update concept for the individual agents, a co-simulation-inspired simulation strategy, and a strategy for accurate model parametrisation. We describe these methods in a reproducible manner, explain their advantages and provide ideas on how they can be transferred to other population models.

[975] STAIR: Stability criterion for Time-windowed Assignment and Internal adversarial influence in Routing and decision-making

Roee M. Francos, Daniel Garces, Orhan Eren Akgün, Stephanie Gil

Main category: cs.MA

TL;DR: The paper addresses routing problems in multi-agent systems with adversarial agents that spoof locations to disrupt operations. It proposes a new stability criterion called STAIR that is easier to analyze in adversarial settings and links stability to operational metrics like finite rejected requests.

Details

Motivation: Existing routing algorithms don't consider adversarial agents, leading to severe performance degradation when adversaries spoof locations to disrupt pickup-and-delivery operations through coordinated denial-of-service attacks.

Method: Proposes STAIR stability criterion that is easier to analyze than queuing theory approaches and doesn’t depend on discount factors like RL methods. Also introduces time-window constraints to mitigate degenerate stability phenomena.

Result: STAIR provides a practical metric for monitoring adversarial effects and demonstrates practical relevance through simulations on real-world San Francisco mobility-on-demand data.

Conclusion: STAIR offers a more suitable stability criterion for adversarial routing problems, directly linking stability to operational performance metrics and providing tools to monitor and mitigate adversarial disruptions.

Abstract: A major limitation of existing routing algorithms for multi-agent systems is that they are designed without considering the potential presence of adversarial agents in the decision-making loop, which could lead to severe performance degradation in real-life applications where adversarial agents may be present. We study autonomous pickup-and-delivery routing problems in which adversarial agents launch coordinated denial-of-service attacks by spoofing their locations. This deception causes the central scheduler to assign pickup requests to adversarial agents instead of cooperative agents. Adversarial agents then choose not to service the requests with the goal of disrupting the operation of the system, leading to delays, cancellations, and potential instability in the routing policy. Policy stability in routing problems is typically defined as the cost of the policy being uniformly bounded over time, and it has been studied through two different lenses: queuing theory and reinforcement learning (RL), which are not well suited for routing with adversaries. In this paper, we propose a new stability criterion, STAIR, which is easier to analyze than queuing-theory-based stability in adversarial settings. Furthermore, STAIR does not depend on a chosen discount factor as is the case in discounted RL stability. STAIR directly links stability to desired operational metrics, like a finite number of rejected requests. This characterization is particularly useful in adversarial settings as it provides a metric for monitoring the effect of adversaries in the operation of the system. Furthermore, we demonstrate STAIR’s practical relevance through simulations on real-world San Francisco mobility-on-demand data. We also identify a phenomenon of degenerate stability that arises in the adversarial routing problem, and we introduce time-window constraints in the decision-making algorithm to mitigate it.

[976] Evader-Agnostic Team-Based Pursuit Strategies in Partially-Observable Environments

Addison Kalanther, Daniel Bostwick, Chinmay Maheshwari, Shankar Sastry

Main category: cs.MA

TL;DR: Unable to fetch paper summary due to HTTP 503 error from arXiv API

Details

Motivation: N/A - Paper content unavailable

Method: N/A - Paper content unavailable

Result: N/A - Paper content unavailable

Conclusion: N/A - Paper content unavailable

Abstract: Failed to fetch summary for 2511.05812: Page request resulted in HTTP 503 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05812&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Qibing Ren, Zhijie Zheng, Jiaxuan Guo, Junchi Yan, Lizhuang Ma, Jing Shao

Main category: cs.MA

TL;DR: This paper studies collective financial fraud risks in LLM-powered multi-agent systems, develops a benchmark called MultiAgentFraudBench with 28 fraud scenarios, analyzes factors influencing fraud success, and proposes mitigation strategies.

Details

Motivation: To investigate the risks of collaborative financial fraud in large-scale multi-agent systems using LLM agents, understanding how agents can collaborate in fraudulent behaviors and how such collaboration amplifies risks.

Method: Developed MultiAgentFraudBench, a large-scale benchmark simulating financial fraud scenarios based on realistic online interactions, covering 28 typical fraud scenarios across public and private domains. Analyzed key factors like interaction depth, activity level, and collaboration failure modes.

Result: Found that malicious agents can adapt to environmental interventions and successfully collaborate in fraudulent behaviors. The benchmark successfully simulated various fraud scenarios and identified factors affecting fraud success.

Conclusion: Highlights real-world risks of multi-agent financial fraud and suggests practical mitigation measures including content-level warnings, using LLMs as monitors, and fostering group resilience through information sharing.

Abstract: In this work, we study the risks of collective financial fraud in large-scale multi-agent systems powered by large language model (LLM) agents. We investigate whether agents can collaborate in fraudulent behaviors, how such collaboration amplifies risks, and what factors influence fraud success. To support this research, we present MultiAgentFraudBench, a large-scale benchmark for simulating financial fraud scenarios based on realistic online interactions. The benchmark covers 28 typical online fraud scenarios, spanning the full fraud lifecycle across both public and private domains. We further analyze key factors affecting fraud success, including interaction depth, activity level, and fine-grained collaboration failure modes. Finally, we propose a series of mitigation strategies, including adding content-level warnings to fraudulent posts and dialogues, using LLMs as monitors to block potentially malicious agents, and fostering group resilience through information sharing at the societal level. Notably, we observe that malicious agents can adapt to environmental interventions. Our findings highlight the real-world risks of multi-agent financial fraud and suggest practical measures for mitigating them. Code is available at https://github.com/zheng977/MutiAgent4Fraud.

[978] S-DAG: A Subject-Based Directed Acyclic Graph for Multi-Agent Heterogeneous Reasoning

Jiangwen Dong, Zehui Lin, Wanyu Lin, Mingjin Zhang

Main category: cs.MA

TL;DR: The paper proposes a subject-level multi-agent collaboration framework using Graph Neural Networks to create Subject-based DAGs and match LLMs with subject expertise, achieving superior performance on multi-subject reasoning tasks.

Details

Motivation: Existing task-level approaches like mixture-of-experts are too coarse for heterogeneous problems involving multiple subjects, requiring finer-grained analysis at the subject level.

Method: Uses GNN to identify relevant subjects and dependencies to generate Subject-based DAGs, profiles LLM expertise by subject, and enables graph-structured multi-agent collaboration with information flow over the S-DAG.

Result: Significantly outperforms existing task-level model selection and multi-agent collaboration baselines in accuracy and efficiency on curated multi-subject benchmarks (MMLU-Pro, GPQA, MedMCQA).

Conclusion: Subject-aware reasoning and structured collaboration are effective for addressing complex multi-subject problems, demonstrating the importance of fine-grained analysis at the subject level.

Abstract: Large Language Models (LLMs) have achieved impressive performance in complex reasoning problems. Their effectiveness highly depends on the specific nature of the task, especially the required domain knowledge. Existing approaches, such as mixture-of-experts, typically operate at the task level; they are too coarse to effectively solve the heterogeneous problems involving multiple subjects. This work proposes a novel framework that performs fine-grained analysis at subject level equipped with a designated multi-agent collaboration strategy for addressing heterogeneous problem reasoning. Specifically, given an input query, we first employ a Graph Neural Network to identify the relevant subjects and infer their interdependencies to generate an \textit{Subject-based Directed Acyclic Graph} (S-DAG), where nodes represent subjects and edges encode information flow. Then we profile the LLM models by assigning each model a subject-specific expertise score, and select the top-performing one for matching corresponding subject of the S-DAG. Such subject-model matching enables graph-structured multi-agent collaboration where information flows from the starting model to the ending model over S-DAG. We curate and release multi-subject subsets of standard benchmarks (MMLU-Pro, GPQA, MedMCQA) to better reflect complex, real-world reasoning tasks. Extensive experiments show that our approach significantly outperforms existing task-level model selection and multi-agent collaboration baselines in accuracy and efficiency. These results highlight the effectiveness of subject-aware reasoning and structured collaboration in addressing complex and multi-subject problems.

[979] Multi-Agent Reinforcement Learning for Deadlock Handling among Autonomous Mobile Robots

Marcel Müller

Main category: cs.MA

TL;DR: MARL-based strategies outperform traditional methods for deadlock handling in AMR intralogistics systems, especially in complex environments using CTDE approach.

Details

Motivation: AMRs increase operational flexibility but also deadlock risks, while existing approaches neglect deadlock handling in planning and use rigid control rules that can't adapt to dynamic conditions.

Method: Developed structured methodology integrating MARL into logistics planning, introduced reference models for deadlock-capable MAPF problems, compared traditional strategies with MARL-based solutions (PPO and IMPALA) using grid-based environments and simulation.

Result: MARL-based strategies with CTDE outperform rule-based methods in complex congested environments, while rule-based methods remain competitive in simpler environments due to lower computational demands.

Conclusion: MARL provides flexible and scalable deadlock handling solution for dynamic intralogistics, but requires careful tailoring to operational context.

Abstract: This dissertation explores the application of multi-agent reinforcement learning (MARL) for handling deadlocks in intralogistics systems that rely on autonomous mobile robots (AMRs). AMRs enhance operational flexibility but also increase the risk of deadlocks, which degrade system throughput and reliability. Existing approaches often neglect deadlock handling in the planning phase and rely on rigid control rules that cannot adapt to dynamic operational conditions. To address these shortcomings, this work develops a structured methodology for integrating MARL into logistics planning and operational control. It introduces reference models that explicitly consider deadlock-capable multi-agent pathfinding (MAPF) problems, enabling systematic evaluation of MARL strategies. Using grid-based environments and an external simulation software, the study compares traditional deadlock handling strategies with MARL-based solutions, focusing on PPO and IMPALA algorithms under different training and execution modes. Findings reveal that MARL-based strategies, particularly when combined with centralized training and decentralized execution (CTDE), outperform rule-based methods in complex, congested environments. In simpler environments or those with ample spatial freedom, rule-based methods remain competitive due to their lower computational demands. These results highlight that MARL provides a flexible and scalable solution for deadlock handling in dynamic intralogistics scenarios, but requires careful tailoring to the operational context.

[980] The Curse of Shared Knowledge: Recursive Belief Reasoning in a Coordination Game with Imperfect Information

Thomas Bolander, Robin Engelhardt, Thomas S. Nicolet

Main category: cs.MA

TL;DR: Humans struggle to distinguish between common knowledge and finite-order shared knowledge, often attempting coordination even with shallow shared knowledge despite significant payoff penalties.

Details

Motivation: To investigate how well humans can differentiate between common knowledge and nth-order shared knowledge, and understand coordination failures in group settings where common knowledge is absent.

Method: Three experiments with 802 participants using a two-person coordination game with imperfect information that models recursive game structures and higher-order uncertainties in everyday-like settings.

Result: Participants had extreme difficulty accepting that common knowledge cannot be reduced to shared knowledge, and consistently attempted coordination even at minimal depths of shared knowledge despite substantial payoff penalties.

Conclusion: Human reasoning about knowledge attribution is limited in depth, leading to coordination failures because finite-order knowledge attributions always allow for higher-order uncertainties that can change what is known by whom.

Abstract: Common knowledge is a necessary condition for safe group coordination. When common knowledge can not be obtained, humans routinely use their ability to attribute beliefs and intentions in order to infer what is known. But such shared knowledge attributions are limited in depth and therefore prone to coordination failures, because any finite-order knowledge attribution allows for an even higher order attribution that may change what is known by whom. In three separate experiments we investigate to which degree human participants (N=802) are able to recognize the difference between common knowledge and nth-order shared knowledge. We use a new two-person coordination game with imperfect information that is able to cast the recursive game structure and higher-order uncertainties into a simple, everyday-like setting. Our results show that participants have a very hard time accepting the fact that common knowledge is not reducible to shared knowledge. Instead, participants try to coordinate even at the shallowest depths of shared knowledge and in spite of huge payoff penalties.

[981] Optimal Strategy Revision in Population Games: A Mean Field Game Theory Perspective

Julian Barreiro-Gomez, Shinkyu Park

Main category: cs.MA

TL;DR: This paper connects Population Games and Mean Field Games to design optimal strategy revision that maximizes agent payoffs while ensuring convergence to Nash equilibrium through evolutionary dynamics.

Details

Motivation: To establish a theoretical foundation for optimal strategy revision in Population Games by leveraging Mean Field Games framework, addressing the need for systematic design of agent decision-making processes that guarantee convergence to equilibrium.

Method: Link Evolutionary Dynamics in Population Games to Mean Field Games framework, solving forward Fokker-Planck and backward Hamilton-Jacobi equations to derive optimal strategy revision that satisfies positive correlation and Nash stationarity properties.

Result: The designed optimal strategy revision successfully maximizes agent payoffs over finite time horizon while ensuring convergence to Nash equilibrium, with numerical examples demonstrating improved convergence properties compared to existing approaches.

Conclusion: The MFG framework provides a systematic approach to design optimal strategy revision in Population Games, recovering existing evolutionary dynamics models as special cases while offering enhanced convergence guarantees to Nash equilibrium.

Abstract: This paper investigates the design of optimal strategy revision in Population Games (PG) by establishing its connection to finite-state Mean Field Games (MFG). Specifically, by linking Evolutionary Dynamics (ED) – which models agent decision-making in PG – to the MFG framework, we demonstrate that optimal strategy revision can be derived by solving the forward Fokker-Planck (FP) equation and the backward Hamilton-Jacobi (HJ) equation, both central components of the MFG framework. Furthermore, we show that the resulting optimal strategy revision, which maximizes each agent’s payoffs over a finite time horizon, satisfies two key properties: positive correlation and Nash stationarity, which are essential for ensuring convergence to the Nash equilibrium. This convergence is then rigorously analyzed and established. Additionally, we discuss how different design objectives for the optimal strategy revision can recover existing ED models previously reported in the PG literature. Numerical examples are provided to illustrate the effectiveness and improved convergence properties of the optimal strategy revision design.

[982] Assemble Your Crew: Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation

Shiyuan Li, Yixin Liu, Qingsong Wen, Chengqi Zhang, Shirui Pan

Main category: cs.MA

TL;DR: ARG-Designer is an autoregressive model that generates multi-agent system collaboration graphs from scratch based on natural language task queries, overcoming limitations of template-based approaches.

Details

Motivation: Existing multi-agent system design approaches are constrained by template graph modification with predefined agents and hard-coded structures, limiting adaptability to task-specific requirements.

Method: Reframe MAS design as conditional autoregressive graph generation, where ARG-Designer sequentially determines agent count, selects roles from an extensible pool, and establishes optimal communication links.

Result: Extensive experiments across six benchmarks show ARG-Designer achieves state-of-the-art performance with significantly greater token efficiency and enhanced extensibility.

Conclusion: ARG-Designer provides a flexible and extensible approach for generating customized multi-agent system topologies tailored to specific task demands.

Abstract: Multi-agent systems (MAS) based on large language models (LLMs) have emerged as a powerful solution for dealing with complex problems across diverse domains. The effectiveness of MAS is critically dependent on its collaboration topology, which has become a focal point for automated design research. However, existing approaches are fundamentally constrained by their reliance on a template graph modification paradigm with a predefined set of agents and hard-coded interaction structures, significantly limiting their adaptability to task-specific requirements. To address these limitations, we reframe MAS design as a conditional autoregressive graph generation task, where both the system composition and structure are designed jointly. We propose ARG-Designer, a novel autoregressive model that operationalizes this paradigm by constructing the collaboration graph from scratch. Conditioned on a natural language task query, ARG-Designer sequentially and dynamically determines the required number of agents, selects their appropriate roles from an extensible pool, and establishes the optimal communication links between them. This generative approach creates a customized topology in a flexible and extensible manner, precisely tailored to the unique demands of different tasks. Extensive experiments across six diverse benchmarks demonstrate that ARG-Designer not only achieves state-of-the-art performance but also enjoys significantly greater token efficiency and enhanced extensibility. The source code of ARG-Designer is available at https://github.com/Shiy-Li/ARG-Designer.

cs.MM

eess.AS

[983] BSCodec: A Band-Split Neural Codec for High-Quality Universal Audio Reconstruction

Haoran Wang, Jiatong Shi, Jinchuan Tian, Bohan Li, Kai Yu, Shinji Watanabe

Main category: eess.AS

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2511.06150: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.06150&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[984] IDMap: A Pseudo-Speaker Generator Framework Based on Speaker Identity Index to Vector Mapping

Zeyan Liu, Liping Chen, Kong Aik Lee, Zhenhua Ling

Main category: eess.AS

TL;DR: Failed to fetch summary for paper 2511.06246 due to HTTP 429 (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable

Result: No results available - API request was rate limited

Conclusion: Analysis impossible due to technical limitations in accessing the paper

Abstract: Failed to fetch summary for 2511.06246: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.06246&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[985] SPUR: A Plug-and-Play Framework for Integrating Spatial Audio Understanding and Reasoning into Large Audio-Language Models

S Sakshi, Vaibhavi Lokegaonkar, Neil Zhang, Ramani Duraiswami, Sreyan Ghosh, Dinesh Manocha, Lie Lu

Main category: eess.AS

TL;DR: SPUR is a lightweight plug-in method that equips large audio-language models with spatial perception capabilities through minimal architectural changes, enabling them to understand spatial cues like direction, elevation, and distance from First-Order Ambisonics inputs.

Details

Motivation: Current large audio-language models operate on monaural inputs and lack spatial perception abilities (direction, elevation, distance), which are crucial for accurate understanding of real-world acoustic scenes and human-level auditory intelligence.

Method: SPUR consists of: (1) a First-Order Ambisonics encoder that maps (W,X,Y,Z) channels to rotation-aware spatial features via multimodal adapter, and (2) SPUR-Set dataset combining FOA recordings with controlled simulations for spatial QA training.

Result: Fine-tuning models on SPUR-Set consistently improves spatial question answering and multi-speaker attribution while preserving general audio understanding capabilities. Extensive ablations validate the approach’s effectiveness.

Conclusion: SPUR provides a simple recipe to transform monaural large audio-language models into spatially aware models through minimal architectural changes and specialized spatial training data.

Abstract: Spatial perception is central to auditory intelligence, enabling accurate understanding of real-world acoustic scenes and advancing human-level perception of the world around us. While recent large audio-language models (LALMs) show strong reasoning over complex audios, most operate on monaural inputs and lack the ability to capture spatial cues such as direction, elevation, and distance. We introduce SPUR, a lightweight, plug-in approach that equips LALMs with spatial perception through minimal architectural changes. SPUR consists of: (i) a First-Order Ambisonics (FOA) encoder that maps (W, X, Y, Z) channels to rotation-aware, listener-centric spatial features, integrated into target LALMs via a multimodal adapter; and (ii) SPUR-Set, a spatial QA dataset combining open-source FOA recordings with controlled simulations, emphasizing relative direction, elevation, distance, and overlap for supervised spatial reasoning. Fine-tuning our model on the SPUR-Set consistently improves spatial QA and multi-speaker attribution while preserving general audio understanding. SPUR provides a simple recipe that transforms monaural LALMs into spatially aware models. Extensive ablations validate the effectiveness of our approach.

[986] Neural Directional Filtering Using a Compact Microphone Array

Weilong Huang, Srikanth Raj Chetupalli, Mhd Modar Halimeh, Oliver Thiergart, Emanuël Habets

Main category: eess.AS

TL;DR: Neural directional filtering (NDF) uses deep neural networks to achieve predefined directivity patterns with compact microphone arrays, overcoming limitations of traditional beamformers.

Details

Motivation: Traditional beamformers struggle with compact arrays due to limited microphones and aperture, degrading directivity pattern effectiveness.

Method: NDF computes a single-channel complex mask from array signals and applies it to a reference microphone to create a virtual directional microphone with desired directivity pattern.

Result: NDF achieves frequency-invariant patterns above aliasing frequency, approximates diverse higher-order patterns, enables steering, and generalizes to unseen conditions.

Conclusion: NDF demonstrates superior performance over conventional beamforming and parametric approaches for compact microphone arrays.

Abstract: Beamforming with desired directivity patterns using compact microphone arrays is essential in many audio applications. Directivity patterns achievable using traditional beamformers depend on the number of microphones and the array aperture. Generally, their effectiveness degrades for compact arrays. To overcome these limitations, we propose a neural directional filtering (NDF) approach that leverages deep neural networks to enable sound capture with a predefined directivity pattern. The NDF computes a single-channel complex mask from the microphone array signals, which is then applied to a reference microphone to produce an output that approximates a virtual directional microphone with the desired directivity pattern. We introduce training strategies and propose data-dependent metrics to evaluate the directivity pattern and directivity factor. We show that the proposed method: i) achieves a frequency-invariant directivity pattern even above the spatial aliasing frequency, ii) can approximate diverse and higher-order patterns, iii) can steer the pattern in different directions, and iv) generalizes to unseen conditions. Lastly, experimental comparisons demonstrate superior performance over conventional beamforming and parametric approaches.

[987] Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models

Umberto Cappellazzo, Xubo Liu, Pingchuan Ma, Stavros Petridis, Maja Pantic

Main category: eess.AS

TL;DR: Omni-AVSR is a unified audio-visual LLM that supports ASR, VSR, and AVSR tasks with elastic inference, using multi-granularity training and parameter-efficient adaptation to reduce resource use while maintaining performance.

Details

Motivation: Current LLM-based approaches train separate models for ASR, VSR, and AVSR tasks, increasing computational costs and missing cross-task synergies. Fixed-rate token compression also limits flexibility in balancing accuracy and efficiency.

Method: Adapts matryoshka representation learning for efficient multi-granularity training across audio and visual modalities, and explores three LoRA-based strategies for parameter-efficient adaptation of the backbone LLM.

Result: Achieves comparable or superior accuracy to state-of-the-art baselines on LRS2 and LRS3 datasets while using substantially lower training and deployment resources. Model remains robust under acoustic noise and shows favorable scaling behavior.

Conclusion: Omni-AVSR provides an effective unified framework for multi-modal speech recognition that balances performance and efficiency through multi-granularity training and parameter-efficient adaptation.

Abstract: Large language models (LLMs) have recently achieved impressive results in speech recognition across multiple modalities, including Auditory Speech Recognition (ASR), Visual Speech Recognition (VSR), and Audio-Visual Speech Recognition (AVSR). Despite this progress, current LLM-based approaches typically address each task independently, training separate models that raise computational and deployment resource use while missing potential cross-task synergies. They also rely on fixed-rate token compression, which restricts flexibility in balancing accuracy with efficiency. These limitations highlight the need for a unified framework that can support ASR, VSR, and AVSR while enabling elastic inference. To this end, we present Omni-AVSR, a unified audio-visual LLM that combines efficient multi-granularity training with parameter-efficient adaptation. Specifically, we adapt the matryoshka representation learning paradigm to efficiently train across multiple audio and visual granularities, reducing its inherent training resource use. Furthermore, we explore three LoRA-based strategies for adapting the backbone LLM, balancing shared and task-specific specialization. Experiments on LRS2 and LRS3 show that Omni-AVSR achieves comparable or superior accuracy to state-of-the-art baselines while training a single model at substantially lower training and deployment resource use. The model also remains robust under acoustic noise, and we analyze its scaling behavior as LLM size increases, providing insights into the trade-off between performance and efficiency.

[988] Privacy in Speech Technology

Tom Bäckström

Main category: eess.AS

TL;DR: This paper provides a comprehensive tutorial on privacy threats in speech technology, covering threat modeling, protection methods, performance measurement, and societal impacts.

Details

Motivation: Speech technology is rapidly improving and widely used, but it inherently contains private information and side information (health, emotions, affiliations) that can lead to serious threats like price gouging, harassment, extortion, and stalking.

Method: The paper presents a tutorial overview that includes modeling privacy threats, approaches for protecting users’ privacy, measuring performance of privacy-protecting methods, and analyzing privacy perception and legal consequences.

Result: The tutorial identifies key privacy vulnerabilities in speech technology and provides systematic approaches for addressing them.

Conclusion: Beyond the tutorial overview, the paper identifies critical areas where urgent improvements are needed in speech privacy protection and outlines directions for further development.

Abstract: Speech technology for communication, accessing information, and services has rapidly improved in quality. It is convenient and appealing because speech is the primary mode of communication for humans. Such technology, however, also presents proven threats to privacy. Speech is a tool for communication and it will thus inherently contain private information. Importantly, it however also contains a wealth of side information, such as information related to health, emotions, affiliations, and relationships, all of which are private. Exposing such private information can lead to serious threats such as price gouging, harassment, extortion, and stalking. This paper is a tutorial on privacy issues related to speech technology, modeling their threats, approaches for protecting users’ privacy, measuring the performance of privacy-protecting methods, perception of privacy as well as societal and legal consequences. In addition to a tutorial overview, it also presents lines for further development where improvements are most urgently needed.

[989] Adaptive Convolution for CNN-based Speech Enhancement Models

Dahan Wang, Xiaobin Rong, Shiruo Sun, Yuxiang Hu, Changbao Zhu, Jing Lu

Main category: eess.AS

TL;DR: This paper introduces adaptive convolution, a dynamic convolutional module that generates time-varying kernels for speech enhancement, and proposes AdaptCRN, an ultra-lightweight model that achieves superior performance with minimal computational cost.

Details

Motivation: To enhance CNN-based speech enhancement models by enabling adaptive representation of speech signals through frame-wise dynamic convolution that can better capture spectral features.

Method: Proposes adaptive convolution with frame-wise causal dynamic convolution using multiple parallel candidate kernels and lightweight attention mechanism. Also introduces AdaptCRN model combining adaptive convolution with efficient encoder-decoder design.

Result: Adaptive convolution significantly improves performance with negligible computational increase, especially for lightweight models. AdaptCRN achieves superior performance compared to models with similar or higher computational costs.

Conclusion: Adaptive convolution is an efficient and versatile module that enhances CNN-based speech enhancement, and AdaptCRN demonstrates state-of-the-art performance with ultra-lightweight design.

Abstract: Deep learning-based speech enhancement methods have significantly improved speech quality and intelligibility. Convolutional neural networks (CNNs) have been proven to be essential components of many high-performance models. In this paper, we introduce adaptive convolution, an efficient and versatile convolutional module that enhances the model’s capability to adaptively represent speech signals. Adaptive convolution performs frame-wise causal dynamic convolution, generating time-varying kernels for each frame by assembling multiple parallel candidate kernels. A lightweight attention mechanism is proposed for adaptive convolution, leveraging both current and historical information to assign adaptive weights to each candidate kernel. This enables the convolution operation to adapt to frame-level speech spectral features, leading to more efficient extraction and reconstruction. We integrate adaptive convolution into various CNN-based models, highlighting its generalizability. Experimental results demonstrate that adaptive convolution significantly improves the performance with negligible increases in computational complexity, especially for lightweight models. Moreover, we present an intuitive analysis revealing a strong correlation between kernel selection and signal characteristics. Furthermore, we propose the adaptive convolutional recurrent network (AdaptCRN), an ultra-lightweight model that incorporates adaptive convolution and an efficient encoder-decoder design, achieving superior performance compared to models with similar or even higher computational costs.

[990] Bridging the Gap between Continuous and Informative Discrete Representations by Random Product Quantization

Xueqing Li, Hao Ma, Zehan Li, Rujin Chen, Boyu Zhu, Ruihao Jing, Jian Kang, Jie Li, Chi Zhang, Xiao-Lei Zhang, Xuelong Li

Main category: eess.AS

TL;DR: Proposes two quantization-based discretization methods (PQ and RPQ) for SSL speech representations that outperform standard K-means discretization and achieve competitive performance with continuous representations.

Details

Motivation: Existing discretization methods for SSL speech representations suffer from significant information loss, creating a performance gap compared to continuous representations.

Method: Product Quantization (PQ) partitions feature space into subspaces and independently quantizes each sub-vector. Random Product Quantization (RPQ) randomly samples feature dimensions multiple times to construct sub-vectors, enhancing representation diversity.

Result: PQ and RPQ achieved relative improvements of 21.8% and 20.0% in WER on LibriSpeech, and 24.1% and 19.6% in CER on ML-SUPERB compared to standard K-means discretization.

Conclusion: The proposed quantization methods effectively mitigate information loss in SSL representation discretization and achieve performance competitive with continuous representations.

Abstract: Self-supervised learning (SSL) has become a core technique in speech processing, but the high dimensionality of its representations makes discretization essential for improving efficiency. However, existing discretization methods still suffer from significant information loss, resulting in a notable performance gap compared to continuous representations. To overcome these limitations, we propose two quantization-based discretization methods: Product Quantization (PQ) and Random Product Quantization (RPQ). PQ partitions the original feature space into multiple subspaces and independently quantizes each sub-vector, producing a fused set of discrete units that retain diverse information from different subspaces, thereby mitigating the loss associated with single-cluster quantization. RPQ further enhances representation diversity by randomly sampling a fixed proportion of feature dimensions multiple times to construct sub-vectors, thereby better capturing the variability in the data distribution. Theoretical analysis shows that RPQ reduces the correlation coefficient rho (where 0 <= rho <= 1) between sub-quantizers. Its quantization error is lower-bounded by the product of rho and epsilon-kms, where epsilon-kms denotes the quantization error of a single K-means quantizer. Experimental results on a combined dataset built from LibriSpeech and ML-SUPERB show that PQ and RPQ outperform standard K-means discretization, achieving relative improvements of 21.8 percent and 20.0 percent in WER on LibriSpeech, and 24.1 percent and 19.6 percent in CER on ML-SUPERB, respectively. Moreover, their performance is competitive with, and in some cases even surpasses, that of continuous SSL representations.

[991] Hybrid Pruning: In-Situ Compression of Self-Supervised Speech Models for Speaker Verification and Anti-Spoofing

Junyi Peng, Lin Zhang, Jiangyu Han, Oldřich Plchot, Johan Rohdin, Themos Stafylakis, Shuai Wang, Jan Černocký

Main category: eess.AS

TL;DR: A unified framework that integrates structured pruning into downstream fine-tuning, enabling joint optimization of task performance and model sparsity in a single stage for speech processing models.

Details

Motivation: Large SSL models like WavLM achieve SOTA performance but are too large for resource-constrained devices. Existing pruning methods separate pruning from fine-tuning, creating suboptimal architectures for diverse downstream tasks.

Method: Unified framework that combines structured pruning with task-specific fine-tuning in a single stage, eliminating multi-stage pipelines and knowledge distillation.

Result: 70% parameter reduction with negligible performance degradation; achieved 0.7%, 0.8%, and 1.6% EER on Vox1-O, -E, -H respectively; SOTA 3.7% EER on ASVspoof5 with improved generalization in low-resource scenarios.

Conclusion: The unified pruning-fine-tuning framework effectively compresses speech models while maintaining performance, demonstrating better generalization and eliminating complex multi-stage approaches.

Abstract: Although large-scale self-supervised learning (SSL) models like WavLM have achieved state-of-the-art performance in speech processing, their significant size impedes deployment on resource-constrained devices. While structured pruning is a key technique for model compression, existing methods typically separate it from task-specific fine-tuning. This multi-stage approach struggles to create optimal architectures tailored for diverse downstream tasks. In this work, we introduce a unified framework that integrates structured pruning into the downstream fine-tuning process. Our framework unifies these steps, jointly optimizing for task performance and model sparsity in a single stage. This allows the model to learn a compressed architecture specifically for the end task, eliminating the need for complex multi-stage pipelines and knowledge distillation. Our pruned models achieve up to a 70% parameter reduction with negligible performance degradation on large-scale datasets, achieving equal error rates of 0.7%, 0.8%, and 1.6% on Vox1-O, -E, and -H, respectively. Furthermore, our approach demonstrates improved generalization in low-resource scenarios, reducing overfitting and achieving a state-of-the-art 3.7% EER on ASVspoof5.

eess.IV

[992] Training-Free Adaptive Quantization for Variable Rate Image Coding for Machines

Yui Tatsumi, Ziyue Zeng, Hiroshi Watanabe

Main category: eess.IV

TL;DR: Proposes a training-free adaptive quantization scheme for Image Coding for Machines that enables flexible bitrate control using channel-wise entropy dependencies and spatial scale parameters.

Details

Motivation: Most ICM frameworks use learned image compression models that operate at fixed rates and require separate training for each target bitrate, limiting practical applications. Variable rate approaches exist but depend on training, increasing computational cost and deployment complexity.

Method: Training-free adaptive quantization step size control scheme leveraging channel-wise entropy dependencies and spatial scale parameters from hyperprior network. Preserves semantically important regions while coarsely quantizing less critical areas with continuous bitrate control via a single parameter.

Result: Achieves up to 11.07% BD-rate savings over non-adaptive variable rate methods, demonstrating effective bitrate control without additional training.

Conclusion: The proposed method provides an effective training-free solution for variable rate control in ICM, enabling flexible bitrate adjustment while maintaining semantic preservation and reducing computational overhead.

Abstract: Image Coding for Machines (ICM) has become increasingly important with the rapid integration of computer vision into real-world applications. However, most ICM frameworks utilize learned image compression (LIC) models that operate at a fixed rate and require separate training for each target bitrate, which may limit their practical applications. Existing variable rate LIC approaches mitigate this limitation but typically depend on training, increasing computational cost and deployment complexity. Moreover, variable rate control has not been thoroughly explored for ICM. To address these challenges, we propose a training-free, adaptive quantization step size control scheme that enables flexible bitrate adjustment. By leveraging both channel-wise entropy dependencies and spatial scale parameters predicted by the hyperprior network, the proposed method preserves semantically important regions while coarsely quantizing less critical areas. The bitrate can be continuously controlled through a single parameter. Experimental results demonstrate the effectiveness of our proposed method, achieving up to 11.07% BD-rate savings over the non-adaptive variable rate method.

[993] HarmoQ: Harmonized Post-Training Quantization for High-Fidelity Image

Hongjun Wang, Jiyuan Chen, Xuan Song, Yinqiang Zheng

Main category: eess.IV

TL;DR: HarmoQ is a unified quantization framework that coordinates weight and activation quantization through structural residual calibration, harmonized scale optimization, and adaptive boundary refinement, achieving superior super-resolution performance under aggressive compression.

Details

Motivation: Existing post-training quantization methods treat weight and activation quantization independently, missing their critical interplay in super-resolution models where weights encode restoration priors and activations carry intensity information.

Method: Three synergistic steps: structural residual calibration adjusts weights to compensate for activation-induced detail loss, harmonized scale optimization balances quantization difficulty via closed-form solutions, and adaptive boundary refinement maintains this balance during optimization.

Result: HarmoQ outperforms prior art by 0.46 dB on Set5 at 2-bit quantization while delivering 3.2x speedup and 4x memory reduction on A100 GPUs.

Conclusion: This work provides the first systematic analysis of weight-activation coupling in super-resolution quantization and establishes a principled solution for efficient high-quality image restoration.

Abstract: Post-training quantization offers an efficient pathway to deploy super-resolution models, yet existing methods treat weight and activation quantization independently, missing their critical interplay. Through controlled experiments on SwinIR, we uncover a striking asymmetry: weight quantization primarily degrades structural similarity, while activation quantization disproportionately affects pixel-level accuracy. This stems from their distinct roles–weights encode learned restoration priors for textures and edges, whereas activations carry input-specific intensity information. Building on this insight, we propose HarmoQ, a unified framework that harmonizes quantization across components through three synergistic steps: structural residual calibration proactively adjusts weights to compensate for activation-induced detail loss, harmonized scale optimization analytically balances quantization difficulty via closed-form solutions, and adaptive boundary refinement iteratively maintains this balance during optimization. Experiments show HarmoQ achieves substantial gains under aggressive compression, outperforming prior art by 0.46 dB on Set5 at 2-bit while delivering 3.2x speedup and 4x memory reduction on A100 GPUs. This work provides the first systematic analysis of weight-activation coupling in super-resolution quantization and establishes a principled solution for efficient high-quality image restoration.

[994] EndoIR: Degradation-Agnostic All-in-One Endoscopic Image Restoration via Noise-Aware Routing Diffusion

Tong Chen, Xinyu Ma, Long Bai, Wenyang Wang, Sun Yue, Luping Zhou

Main category: eess.IV

TL;DR: EndoIR is an all-in-one diffusion-based framework that restores multiple degradation types in endoscopic images using a single model, achieving state-of-the-art performance with fewer parameters.

Details

Motivation: Endoscopic images suffer from diverse co-occurring degradations like low lighting, smoke, and bleeding that obscure clinical details. Existing methods are task-specific and require prior knowledge of degradation types, limiting real-world clinical robustness.

Method: Proposes EndoIR with Dual-Domain Prompter for joint spatial-frequency features, adaptive embedding for shared/task-specific cues, Dual-Stream Diffusion architecture processing clean/degraded inputs separately, Rectified Fusion Block for structured integration, and Noise-Aware Routing Block for efficient feature selection.

Result: Experiments on SegSTRONG-C and CEC datasets show state-of-the-art performance across multiple degradation scenarios with fewer parameters than baselines. Downstream segmentation confirms clinical utility.

Conclusion: EndoIR provides a robust, degradation-agnostic solution for endoscopic image restoration that outperforms specialized methods while being more efficient and clinically useful.

Abstract: Endoscopic images often suffer from diverse and co-occurring degradations such as low lighting, smoke, and bleeding, which obscure critical clinical details. Existing restoration methods are typically task-specific and often require prior knowledge of the degradation type, limiting their robustness in real-world clinical use. We propose EndoIR, an all-in-one, degradation-agnostic diffusion-based framework that restores multiple degradation types using a single model. EndoIR introduces a Dual-Domain Prompter that extracts joint spatial-frequency features, coupled with an adaptive embedding that encodes both shared and task-specific cues as conditioning for denoising. To mitigate feature confusion in conventional concatenation-based conditioning, we design a Dual-Stream Diffusion architecture that processes clean and degraded inputs separately, with a Rectified Fusion Block integrating them in a structured, degradation-aware manner. Furthermore, Noise-Aware Routing Block improves efficiency by dynamically selecting only noise-relevant features during denoising. Experiments on SegSTRONG-C and CEC datasets demonstrate that EndoIR achieves state-of-the-art performance across multiple degradation scenarios while using fewer parameters than strong baselines, and downstream segmentation experiments confirm its clinical utility.

Jyun-Ping Kao, Shinyeong Rho, Shahar Lazarev, Hyun-Hae Cho, Fangxu Xing, Taehoon Shin, C. -C. Jay Kuo, Jonghye Woo

Main category: eess.IV

TL;DR: A novel parameter-efficient transfer learning method using 3D LoRA adaptation of CT pre-trained foundation models for ADHD classification from MRI data, achieving state-of-the-art performance with 113x fewer parameters.

Details

Motivation: Early ADHD diagnosis is crucial but challenging using neuroimaging due to heterogeneous presentations and symptom overlap with other conditions. Cross-modal adaptation of foundation models can address these challenges efficiently.

Method: Proposed 3D Low-Rank Adaptation (LoRA) that factorizes 3D convolutional kernels into 2D low-rank updates, adapting a CT pre-trained foundation model to MRI-based ADHD classification with dramatically reduced trainable parameters.

Result: Achieved 71.9% accuracy and AUC of 0.716 in five-fold cross-validation on public diffusion MRI database, using only 1.64 million parameters (113x fewer than full fine-tuning).

Conclusion: Successfully demonstrated cross-modal (CT-to-MRI) adaptation of foundation models in neuroimaging, establishing new benchmark for ADHD classification with greatly improved efficiency.

Abstract: Early diagnosis of attention-deficit/hyperactivity disorder (ADHD) in children plays a crucial role in improving outcomes in education and mental health. Diagnosing ADHD using neuroimaging data, however, remains challenging due to heterogeneous presentations and overlapping symptoms with other conditions. To address this, we propose a novel parameter-efficient transfer learning approach that adapts a large-scale 3D convolutional foundation model, pre-trained on CT images, to an MRI-based ADHD classification task. Our method introduces Low-Rank Adaptation (LoRA) in 3D by factorizing 3D convolutional kernels into 2D low-rank updates, dramatically reducing trainable parameters while achieving superior performance. In a five-fold cross-validated evaluation on a public diffusion MRI database, our 3D LoRA fine-tuning strategy achieved state-of-the-art results, with one model variant reaching 71.9% accuracy and another attaining an AUC of 0.716. Both variants use only 1.64 million trainable parameters (over 113x fewer than a fully fine-tuned foundation model). Our results represent one of the first successful cross-modal (CT-to-MRI) adaptations of a foundation model in neuroimaging, establishing a new benchmark for ADHD classification while greatly improving efficiency.

[996] SPASHT: An image-enhancement method for sparse-view MPI SPECT

Zezhang Yang, Zitong Yu, Nuri Choi, Janice Tania, Wenxuan Xue, Barry A. Siegel, Abhinav K. Jha

Main category: eess.IV

TL;DR: SPASHT algorithm improves defect detection in sparse-view MPI SPECT by reducing scanning time while maintaining image quality through deep learning-based enhancement.

Details

Motivation: Reduce long scanning time in MPI SPECT that causes patient discomfort and motion artifacts, while addressing image quality degradation from fewer projection views.

Method: Proposed SPASHT algorithm that inherently trains to improve defect-detection performance, evaluated on clinical data with synthetically inserted defects for various reduced projection views (1/6, 1/3, 1/2 of typical).

Result: SPASHT significantly improved AUC for defect detection across all reduced projection views compared to sparse-view protocol alone. Human observer study confirmed improved detection performance with SPASHT.

Conclusion: SPASHT effectively enhances sparse-view MPI SPECT image quality and improves defect detection, motivating further clinical validation.

Abstract: Single-photon emission computed tomography for myocardial perfusion imaging (MPI SPECT) is a widely used diagnostic tool for coronary artery disease. However, the procedure requires considerable scanning time, leading to patient discomfort and the potential for motion-induced artifacts. Reducing the number of projection views while keeping the time per view unchanged provides a mechanism to shorten the scanning time. However, this approach leads to increased sampling artifacts, higher noise, and hence limited image quality. To address these issues, we propose sparseview SPECT image enhancement (SPASHT), inherently training the algorithm to improve performance on defect-detection tasks. We objectively evaluated SPASHT on the clinical task of detecting perfusion defects in a retrospective clinical study using data from patients who underwent MPI SPECT, where the defects were clinically realistic and synthetically inserted. The study was conducted for different numbers of fewer projection views, including 1/6, 1/3, and 1/2 of the typical projection views for MPI SPECT. Performance on the detection task was quantified using area under the receiver operating characteristic curve (AUC). Images obtained with SPASHT yielded significantly improved AUC compared to those obtained with the sparse-view protocol for all the considered numbers of fewer projection views. To further assess performance, a human observer study on the task of detecting perfusion defects was conducted. Results from the human observer study showed improved detection performance with images reconstructed using SPASHT compared to those from the sparse-view protocol. The results provide evidence of the efficacy of SPASHT in improving the quality of sparse-view MPI SPECT images and motivate further clinical validation.

[997] A Visual Perception-Based Tunable Framework and Evaluation Benchmark for H.265/HEVC ROI Encryption

Xiang Zhang, Geng Wu, Wenbin Huang, Daoyong Fu, Fei Peng, Zhangjie Fu

Main category: eess.IV

TL;DR: Proposes a visual perception-based tunable framework and evaluation benchmark for H.265/HEVC ROI selective encryption, addressing flexibility and evaluation standardization issues in existing methods.

Details

Motivation: Existing ROI-based video encryption methods suffer from insufficient flexibility and lack of a unified evaluation system, limiting their practical application for privacy protection in video content.

Method: 1) ROI region recognition using visual perception network, 2) Three-level tunable encryption strategy balancing security and real-time performance, 3) Development of unified ROI encryption evaluation benchmark.

Result: Experimental results show the proposed benchmark comprehensively measures ROI selective encryption performance. Enhanced and advanced level encryption outperform existing algorithms in multiple metrics.

Conclusion: The framework effectively meets privacy protection requirements in H.265/HEVC and provides reliable solution for secure and efficient processing of sensitive video content.

Abstract: ROI selective encryption, as an efficient privacy protection technique, encrypts only the key regions in the video, thereby ensuring security while minimizing the impact on coding efficiency. However, existing ROI-based video encryption methods suffer from insufficient flexibility and lack of a unified evaluation system. To address these issues, we propose a visual perception-based tunable framework and evaluation benchmark for H.265/HEVC ROI encryption. Our scheme introduces three key contributions: 1) A ROI region recognition module based on visual perception network is proposed to accurately identify the ROI region in videos. 2) A three-level tunable encryption strategy is implemented while balancing security and real-time performance. 3) A unified ROI encryption evaluation benchmark is developed to provide a standardized quantitative platform for subsequent research. This triple strategy provides new solution and significant unified performance evaluation methods for ROI selective encryption field. Experimental results indicate that the proposed benchmark can comprehensively measure the performance of the ROI selective encryption. Compared to existing ROI encryption algorithms, our proposed enhanced and advanced level encryption exhibit superior performance in multiple performance metrics. In general, the proposed framework effectively meets the privacy protection requirements in H.265/HEVC and provides a reliable solution for secure and efficient processing of sensitive video content.

[998] Turbo-DDCM: Fast and Flexible Zero-Shot Diffusion-Based Image Compression

Amit Vaisman, Guy Ohayon, Hila Manor, Michael Elad, Tomer Michaeli

Main category: eess.IV

TL;DR: Turbo-DDCM is an efficient zero-shot diffusion-based compression method that runs faster than existing methods while maintaining state-of-the-art performance, with flexible variants for priority-aware and distortion-controlled compression.

Details

Motivation: Zero-shot diffusion-based compression methods are notoriously slow and computationally demanding, creating a need for more efficient alternatives that maintain performance.

Method: Builds on Denoising Diffusion Codebook Models (DDCMs) by combining multiple noise vectors at each denoising step to reduce operations, coupled with improved encoding protocol and flexible variants for priority-aware and distortion-controlled compression.

Result: Turbo-DDCM runs substantially faster than existing methods while maintaining performance on par with state-of-the-art techniques, as demonstrated through comprehensive experiments.

Conclusion: Turbo-DDCM represents a compelling, practical, and flexible image compression scheme that addresses the computational limitations of existing diffusion-based compression methods.

Abstract: While zero-shot diffusion-based compression methods have seen significant progress in recent years, they remain notoriously slow and computationally demanding. This paper presents an efficient zero-shot diffusion-based compression method that runs substantially faster than existing methods, while maintaining performance that is on par with the state-of-the-art techniques. Our method builds upon the recently proposed Denoising Diffusion Codebook Models (DDCMs) compression scheme. Specifically, DDCM compresses an image by sequentially choosing the diffusion noise vectors from reproducible random codebooks, guiding the denoiser’s output to reconstruct the target image. We modify this framework with Turbo-DDCM, which efficiently combines a large number of noise vectors at each denoising step, thereby significantly reducing the number of required denoising operations. This modification is also coupled with an improved encoding protocol. Furthermore, we introduce two flexible variants of Turbo-DDCM, a priority-aware variant that prioritizes user-specified regions and a distortion-controlled variant that compresses an image based on a target PSNR rather than a target BPP. Comprehensive experiments position Turbo-DDCM as a compelling, practical, and flexible image compression scheme.

[999] Compressive Sensing Photoacoustic Imaging Receiver with Matrix-Vector-Multiplication SAR ADC

Huan-Cheng Liao, Shunyao Zhang, Yumin Su, Arvind Govinday, Yiwei Zou, Wei Wang, Vivek Boominathan, Ashok Veeraraghavan, Lei S. Li, Kaiyuan Yang

Main category: eess.IV

TL;DR: A photoacoustic imaging receiver with embedded compressive sensing that reduces data output by 4-8x while maintaining image quality, enabling compact wearable systems.

Details

Motivation: Large data volume from high-density transducer arrays poses challenges for compact and power-efficient wearable photoacoustic imaging systems.

Method: Integrates 16 AFEs and four matrix-vector-multiplication SAR ADCs for analog-domain compression using programmable ternary weights, with two reconstruction methods: optimization-based and learning-based approaches.

Result: Achieves 57.5 dB SNDR at 20.41 MS/s, 3.5 nV/sqrt(Hz) input-referred noise, R^2 > 0.999 MVM linearity, and high-fidelity image reconstruction under 8x compression with 5.83 mW/channel power consumption.

Conclusion: The hardware-embedded compressive sensing approach provides an effective solution for miniaturized, wearable photoacoustic imaging systems by significantly reducing data rates while preserving image quality.

Abstract: Wearable photoacoustic imaging devices hold great promise for continuous health monitoring and point-of-care diagnostics. However, the large data volume generated by high-density transducer arrays presents a major challenge for realizing compact and power-efficient wearable systems. This paper presents a photoacoustic imaging receiver (RX) that embeds compressive sensing directly into the hardware to address this bottleneck. The RX integrates 16 AFEs and four matrix-vector-multiplication (MVM) SAR ADCs that perform energy- and area-efficient analog-domain compression. The architecture achieves a 4-8x reduction in output data rate while preserving low-loss full-array information. The MVM SAR ADC executes passive and accurate MVM using user-defined programmable ternary weights. Two signal reconstruction methods are implemented: (1) an optimization approach using the fast iterative shrinkage-thresholding algorithm, and (2) a learning-based approach employing implicit neural representation. Fabricated in 65 nm CMOS, the chip achieves an ADC’s SNDR of 57.5 dB at 20.41 MS/s, with an AFE input-referred noise of 3.5 nV/sqrt(Hz). MVM linearity measurements show R^2 > 0.999 across a wide range of weights and input amplitudes. The system is validated through phantom imaging experiments, demonstrating high-fidelity image reconstruction under up to 8x compression. The RX consumes 5.83 mW/channel and supports a general ternary-weighted measurement matrix, offering a compelling solution for next-generation miniaturized, wearable PA imaging systems.

[1000] Hierarchical Spatial-Frequency Aggregation for Spectral Deconvolution Imaging

Tao Lv, Daoming Zhou, Chenglong Huang, Chongde Zi, Linsen Chen, Xun Cao

Main category: eess.IV

TL;DR: HSFAUT is a Transformer-based deep unfolding method that solves the inverse problem in Spectral Deconvolution imaging by transforming nonlinear processes into linear mappings through frequency domain projection and hierarchical spatial-spectral aggregation.

Details

Motivation: Spectral Deconvolution imaging (SDI) methods for computational spectral imaging face challenges due to scene-dependent coefficient matrices from composite convolution-integration operations, which hinder efficient prior exploitation and accurate reconstruction.

Method: Proposed HSFAUT framework: decomposes subproblems into frequency domain to transform nonlinear processes into linear mappings, and integrates Spatial-Frequency Aggregation Transformer (SFAT) to aggregate spatial-spectral priors during iterative refinement.

Result: HSFAUT surpasses state-of-the-art methods with cheaper memory and computational costs, while achieving optimal performance across different SDI systems in both simulated and real experiments.

Conclusion: The proposed hierarchical spatial-frequency aggregation unfolding transformer effectively addresses the data-dependent operator challenge in SDI, enabling high-fidelity computational spectral imaging with improved efficiency and performance.

Abstract: Computational spectral imaging (CSI) achieves real-time hyperspectral imaging through co-designed optics and algorithms, but typical CSI methods suffer from a bulky footprint and limited fidelity. Therefore, Spectral Deconvolution imaging (SDI) methods based on PSF engineering have been proposed to achieve high-fidelity compact CSI design recently. However, the composite convolution-integration operations of SDI render the normal-equation coefficient matrix scene-dependent, which hampers the efficient exploitation of imaging priors and poses challenges for accurate reconstruction. To tackle the inherent data-dependent operators in SDI, we introduce a Hierarchical Spatial-Spectral Aggregation Unfolding Framework (HSFAUF). By decomposing subproblems and projecting them into the frequency domain, HSFAUF transforms nonlinear processes into linear mappings, thereby enabling efficient solutions. Furthermore, to integrate spatial-spectral priors during iterative refinement, we propose a Spatial-Frequency Aggregation Transformer (SFAT), which explicitly aggregates information across spatial and frequency domains. By integrating SFAT into HSFAUF, we develop a Transformer-based deep unfolding method, \textbf{H}ierarchical \textbf{S}patial-\textbf{F}requency \textbf{A}ggregation \textbf{U}nfolding \textbf{T}ransformer (HSFAUT), to solve the inverse problem of SDI. Systematic simulated and real experiments show that HSFAUT surpasses SOTA methods with cheaper memory and computational costs, while exhibiting optimal performance on different SDI systems.

[1001] CAMP-VQA: Caption-Embedded Multimodal Perception for No-Reference Quality Assessment of Compressed Video

Xinyi Wang, Angeliki Katsenou, Junxiao Shen, David Bull

Main category: eess.IV

TL;DR: CAMP-VQA is a novel no-reference video quality assessment framework that uses large vision-language models with quality-aware prompting to generate fine-grained quality captions, outperforming existing methods without needing manual annotations.

Details

Motivation: User-generated content on platforms like YouTube and TikTok requires effective no-reference video quality assessment, but current methods struggle with modeling subjective scores for compressed content due to lack of fine-grained perceptual annotations.

Method: Uses BLIP-2 pretraining with quality-aware prompting that integrates video metadata and key fragments from inter-frame variations to generate fine-grained quality captions. Features are extracted and fused across semantic alignment, temporal characteristics, and spatial dimensions.

Result: Achieves state-of-the-art performance with SRCC: 0.928 and PLCC: 0.938, consistently outperforming existing NR-VQA methods across multiple UGC datasets without requiring costly manual annotations.

Conclusion: CAMP-VQA demonstrates that leveraging large vision-language models with quality-aware prompting effectively addresses NR-VQA challenges for user-generated content, achieving superior accuracy while eliminating the need for expensive fine-grained annotations.

Abstract: The prevalence of user-generated content (UGC) on platforms such as YouTube and TikTok has rendered no-reference (NR) perceptual video quality assessment (VQA) vital for optimizing video delivery. Nonetheless, the characteristics of non-professional acquisition and the subsequent transcoding of UGC video on sharing platforms present significant challenges for NR-VQA. Although NR-VQA models attempt to infer mean opinion scores (MOS), their modeling of subjective scores for compressed content remains limited due to the absence of fine-grained perceptual annotations of artifact types. To address these challenges, we propose CAMP-VQA, a novel NR-VQA framework that exploits the semantic understanding capabilities of large vision-language models. Our approach introduces a quality-aware prompting mechanism that integrates video metadata (e.g., resolution, frame rate, bitrate) with key fragments extracted from inter-frame variations to guide the BLIP-2 pretraining approach in generating fine-grained quality captions. A unified architecture has been designed to model perceptual quality across three dimensions: semantic alignment, temporal characteristics, and spatial characteristics. These multimodal features are extracted and fused, then regressed to video quality scores. Extensive experiments on a wide variety of UGC datasets demonstrate that our model consistently outperforms existing NR-VQA methods, achieving improved accuracy without the need for costly manual fine-grained annotations. Our method achieves the best performance in terms of average rank and linear correlation (SRCC: 0.928, PLCC: 0.938) compared to state-of-the-art methods. The source code and trained models, along with a user-friendly demo, are available at: https://github.com/xinyiW915/CAMP-VQA.

[1002] RRTS Dataset: A Benchmark Colonoscopy Dataset from Resource-Limited Settings for Computer-Aided Diagnosis Research

Ridoy Chandra Shil, Ragib Abid, Tasnia Binte Mamun, Samiul Based Shuvo, Masfique Ahmed Bhuiyan, Jahid Ferdous

Main category: eess.IV

TL;DR: The BUET Polyp Dataset (BPD) is a new colonoscopy image dataset addressing limitations of existing public datasets by including real-world artifacts and challenges, with benchmark results showing lower performance compared to curated datasets due to increased difficulty.

Details

Motivation: Existing public datasets for colorectal cancer prevention are limited by small sample sizes, curated image selection, and lack of real-world artifacts, creating a need for datasets that better reflect clinical practice complexity, especially in resource-constrained settings.

Method: Collected colonoscopy images using Olympus 170 and Pentax i-Scan series endoscopes under routine clinical conditions, with expert-annotated binary masks. Provided benchmark results using VGG16, ResNet50, and InceptionV3 for classification, and UNet variants with VGG16, ResNet34, and InceptionV4 backbones for segmentation.

Result: Dataset contains 1,288 polyp images from 164 patients with ground-truth masks and 1,657 polyp-free images from 31 patients. Benchmarking achieved 90.8% accuracy for classification (VGG16) and maximum Dice score of 0.64 for segmentation (InceptionV4-UNet), with lower performance than curated datasets due to real-world challenges.

Conclusion: The BPD dataset captures real-world clinical complexity with various artifacts, demonstrating that models trained on curated datasets may not generalize well to practical settings, highlighting the importance of datasets that reflect actual clinical challenges.

Abstract: Background and Objective: Colorectal cancer prevention relies on early detection of polyps during colonoscopy. Existing public datasets, such as CVC-ClinicDB and Kvasir-SEG, provide valuable benchmarks but are limited by small sample sizes, curated image selection, or lack of real-world artifacts. There remains a need for datasets that capture the complexity of clinical practice, particularly in resource-constrained settings. Methods: We introduce a dataset, BUET Polyp Dataset (BPD), of colonoscopy images collected using Olympus 170 and Pentax i-Scan series endoscopes under routine clinical conditions. The dataset contains images with corresponding expert-annotated binary masks, reflecting diverse challenges such as motion blur, specular highlights, stool artifacts, blood, and low-light frames. Annotations were manually reviewed by clinical experts to ensure quality. To demonstrate baseline performance, we provide benchmark results for classification using VGG16, ResNet50, and InceptionV3, and for segmentation using UNet variants with VGG16, ResNet34, and InceptionV4 backbones. Results: The dataset comprises 1,288 images with polyps from 164 patients with corresponding ground-truth masks and 1,657 polyp-free images from 31 patients. Benchmarking experiments achieved up to 90.8% accuracy for binary classification (VGG16) and a maximum Dice score of 0.64 with InceptionV4-UNet for segmentation. Performance was lower compared to curated datasets, reflecting the real-world difficulty of images with artifacts and variable quality.

[1003] Anatomy-Aware Lymphoma Lesion Detection in Whole-Body PET/CT

Simone Bendazzoli, Antonios Tzortzakakis, Andreas Abrahamsson, Björn Engelbrekt Wahlin, Örjan Smedby, Maria Holstensson, Rodrigo Moreno

Main category: eess.IV

TL;DR: Adding anatomical priors via organ segmentation masks improves lesion detection in CNN-based models like nnDetection but has minimal impact on vision transformers like Swin Transformer.

Details

Motivation: Accurate cancer lesion detection is challenging due to multiple lesions of varying sizes in PET/CT imaging, and anatomical context could potentially improve detection performance.

Method: Used organ segmentation masks from TotalSegmentator as auxiliary inputs to nnDetection (CNN-based) and Swin Transformer (vision transformer) for lesion detection on PET/CT images from AutoPET and Karolinska lymphoma datasets.

Result: Anatomical priors substantially improved detection performance in nnDetection framework but had almost no impact on Swin Transformer performance. Swin Transformer showed no clear advantages over CNN encoders.

Conclusion: Anatomical context plays a critical role in cancer lesion detection, particularly for CNN-based models, while vision transformers may not benefit as much from such priors.

Abstract: Early cancer detection is crucial for improving patient outcomes, and 18F FDG PET/CT imaging plays a vital role by combining metabolic and anatomical information. Accurate lesion detection remains challenging due to the need to identify multiple lesions of varying sizes. In this study, we investigate the effect of adding anatomy prior information to deep learning-based lesion detection models. In particular, we add organ segmentation masks from the TotalSegmentator tool as auxiliary inputs to provide anatomical context to nnDetection, which is the state-of-the-art for lesion detection, and Swin Transformer. The latter is trained in two stages that combine self-supervised pre-training and supervised fine-tuning. The method is tested in the AutoPET and Karolinska lymphoma datasets. The results indicate that the inclusion of anatomical priors substantially improves the detection performance within the nnDetection framework, while it has almost no impact on the performance of the vision transformer. Moreover, we observe that Swin Transformer does not offer clear advantages over conventional convolutional neural network (CNN) encoders used in nnDetection. These findings highlight the critical role of the anatomical context in cancer lesion detection, especially in CNN-based models.

[1004] TauFlow: Dynamic Causal Constraint for Complexity-Adaptive Lightweight Segmentation

Zidong Chen, Fadratul Hafinaz Hassan

Main category: eess.IV

TL;DR: TauFlow is a lightweight medical image segmentation model that uses brain-inspired dynamic feature response to handle lesion boundaries efficiently and reduce parameter conflicts, achieving high accuracy with under 0.5M parameters.

Details

Motivation: To address challenges in deploying lightweight segmentation models on edge devices: efficiently handling lesion boundary vs background contrast, and preventing accuracy drops in extremely lightweight designs (<0.5M parameters).

Method: Proposes TauFlow with two key innovations: 1) Convolutional Long-Time Constant Cell (ConvLTC) that dynamically regulates feature update rates (slow for backgrounds, fast for boundaries), and 2) STDP Self-Organizing Module that reduces encoder-decoder feature conflicts.

Result: Significantly reduces feature conflict rate from approximately 35%-40% to 8%-10% while maintaining high accuracy with extremely lightweight design (<0.5M parameters).

Conclusion: TauFlow successfully addresses the challenges of lightweight medical image segmentation through brain-inspired dynamic feature response mechanisms, enabling efficient edge deployment without sacrificing accuracy.

Abstract: Deploying lightweight medical image segmentation models on edge devices presents two major challenges: 1) efficiently handling the stark contrast between lesion boundaries and background regions, and 2) the sharp drop in accuracy that occurs when pursuing extremely lightweight designs (e.g., <0.5M parameters). To address these problems, this paper proposes TauFlow, a novel lightweight segmentation model. The core of TauFlow is a dynamic feature response strategy inspired by brain-like mechanisms. This is achieved through two key innovations: the Convolutional Long-Time Constant Cell (ConvLTC), which dynamically regulates the feature update rate to “slowly” process low-frequency backgrounds and “quickly” respond to high-frequency boundaries; and the STDP Self-Organizing Module, which significantly mitigates feature conflicts between the encoder and decoder, reducing the conflict rate from approximately 35%-40% to 8%-10%.

[1005] Validation of Fully-Automated Deep Learning-Based Fibroglandular Tissue Segmentation for Efficient and Reliable Quantitation of Background Parenchymal Enhancement in Breast MRI

Yu-Tzu Kuo, Anum S. Kazerouni, Vivian Y. Park, Wesley Surento, Suleeporn Sujichantararat, Daniel S. Hippe, Habib Rahbar, Savannah C. Partridge

Main category: eess.IV

TL;DR: Deep learning-based automated fibroglandular tissue segmentation shows better correlation with qualitative BPE assessments and higher quality scores compared to semi-automated methods.

Details

Motivation: Background parenchymal enhancement (BPE) on breast MRI is a potential breast cancer risk marker, but current qualitative assessment by radiologists lacks precision. Quantitative BPE measures could provide more accurate risk evaluation.

Method: Compared an existing open-source deep learning-based method for segmenting fibroglandular tissue to quantify BPE against a semi-automated fuzzy c-means method using breast MRI from 100 women. Evaluated segmentation agreement, BPE metric concordance, and associations with qualitative BPE.

Result: Both methods showed good agreement for quantitative BPE measurements, but DL-based measures had stronger correlation with qualitative BPE assessments and received higher quality scores from radiologists for FGT segmentations.

Conclusion: DL-based FGT segmentation enhances efficiency for objective BPE quantification and may improve standardized breast cancer risk assessment.

Abstract: Background parenchymal enhancement (BPE) on breast dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) shows potential as a breast cancer risk marker. Clinically, BPE is qualitatively assessed by radiologists, but quantitative BPE measures offer potential for more precise risk evaluation. This study evaluated an existing open-source, fully-automated deep learning-based (DL-based) method for segmenting fibroglandular tissue (FGT) to quantify BPE and compared it to a semi-automated fuzzy c-means method. Using breast MRI examinations from 100 women, we evaluated segmentation agreement, concordance across quantitative BPE metrics, and associations with qualitative BPE. The quality of FGT segmentations from both methods was scored by a radiologist. While the DL-based and semi-automated methods showed good agreement for quantitative BPE measurements, DL-based measures more strongly correlated with qualitative BPE assessments and DL-based segmentations were scored as higher quality by the radiologist. Our findings suggest that DL-based FGT segmentation enhances efficiency for objective BPE quantification and may improve standardized breast cancer risk assessment.

[1006] Task-Adaptive Low-Dose CT Reconstruction

Necati Sefercioglu, Mehmet Ozan Unal, Metin Ertas, Isa Yildirim

Main category: eess.IV

TL;DR: A task-adaptive CT reconstruction framework that uses a frozen pre-trained task network as regularization to preserve diagnostic details, achieving better segmentation performance than joint-training and traditional methods.

Details

Motivation: Current deep learning CT reconstruction methods achieve high metric scores but fail to preserve critical anatomical details needed for diagnostic tasks, limiting clinical applicability.

Method: Incorporates a frozen pre-trained task network as a regularization term in the reconstruction loss function, guiding reconstruction training while maintaining diagnostic quality without joint optimization.

Result: Achieved Dice scores up to 0.707 for liver and tumor segmentation, approaching full-dose scan performance (0.874) and outperforming joint-training (0.331) and traditional methods (0.626).

Conclusion: The framework can be easily integrated into existing deep learning reconstruction models through simple loss function modification, enabling widespread adoption for task-adaptive optimization in clinical practice.

Abstract: Deep learning-based low-dose computed tomography reconstruction methods already achieve high performance on standard image quality metrics like peak signal-to-noise ratio and structural similarity index measure. Yet, they frequently fail to preserve the critical anatomical details needed for diagnostic tasks. This fundamental limitation hinders their clinical applicability despite their high metric scores. We propose a novel task-adaptive reconstruction framework that addresses this gap by incorporating a frozen pre-trained task network as a regularization term in the reconstruction loss function. Unlike existing joint-training approaches that simultaneously optimize both reconstruction and task networks, and risk diverging from satisfactory reconstructions, our method leverages a pre-trained task model to guide reconstruction training while still maintaining diagnostic quality. We validate our framework on a liver and liver tumor segmentation task. Our task-adaptive models achieve Dice scores up to 0.707, approaching the performance of full-dose scans (0.874), and substantially outperforming joint-training approaches (0.331) and traditional reconstruction methods (0.626). Critically, our framework can be integrated into any existing deep learning-based reconstruction model through simple loss function modification, enabling widespread adoption for task-adaptive optimization in clinical practice. Our codes are available at: https://github.com/itu-biai/task_adaptive_ct

[1007] X-Diffusion: Generating Detailed 3D MRI Volumes From a Single Image Using Cross-Sectional Diffusion Models

Emmanuelle Bourigault, Abdullah Hamdi, Amir Jamaludin

Main category: eess.IV

TL;DR: X-Diffusion is a novel cross-sectional diffusion model that reconstructs detailed 3D MRI volumes from extremely sparse 2D inputs (as few as one slice), outperforming state-of-the-art methods while preserving critical anatomical features and demonstrating strong generalization capabilities.

Details

Motivation: Traditional MRI reconstruction requires extensive 3D data acquisition, making high-resolution scans slow and expensive. Current methods perform 3D-to-3D reconstruction and treat MRI as collections of 2D slices, limiting efficiency and requiring full 3D scans.

Method: X-Diffusion is a cross-sectional diffusion model that performs 2D-to-3D reconstruction by modeling MRI data as holistic 3D volumes during training and inference. It can reconstruct from as little as a single 2D slice or few slices, unlike previous approaches that require full 3D scans.

Result: X-Diffusion surpasses state-of-the-art methods in quantitative accuracy (PSNR) on brain tumor and full-body MRIs, preserves critical anatomical features (tumor profiles, spine curvature, brain volume), and generalizes beyond training domain (successfully reconstructs knee MRIs despite being trained only on brain data). Medical expert evaluations confirm clinical relevance.

Conclusion: X-Diffusion is the first method capable of producing detailed 3D MRIs from highly limited 2D input data, potentially accelerating MRI acquisition and reducing costs while maintaining diagnostic quality.

Abstract: Magnetic Resonance Imaging (MRI) is a crucial diagnostic tool, but high-resolution scans are often slow and expensive due to extensive data acquisition requirements. Traditional MRI reconstruction methods aim to expedite this process by filling in missing frequency components in the K-space, performing 3D-to-3D reconstructions that demand full 3D scans. In contrast, we introduce X-Diffusion, a novel cross-sectional diffusion model that reconstructs detailed 3D MRI volumes from extremely sparse spatial-domain inputs, achieving 2D-to-3D reconstruction from as little as a single 2D MRI slice or few slices. A key aspect of X-Diffusion is that it models MRI data as holistic 3D volumes during the cross-sectional training and inference, unlike previous learning approaches that treat MRI scans as collections of 2D slices in standard planes (coronal, axial, sagittal). We evaluated X-Diffusion on brain tumor MRIs from the BRATS dataset and full-body MRIs from the UK Biobank dataset. Our results demonstrate that X-Diffusion not only surpasses state-of-the-art methods in quantitative accuracy (PSNR) on unseen data but also preserves critical anatomical features such as tumor profiles, spine curvature, and brain volume. Remarkably, the model generalizes beyond the training domain, successfully reconstructing knee MRIs despite being trained exclusively on brain data. Medical expert evaluations further confirm the clinical relevance and fidelity of the generated images. To our knowledge, X-Diffusion is the first method capable of producing detailed 3D MRIs from highly limited 2D input data, potentially accelerating MRI acquisition and reducing associated costs. The code is available on the project website https://emmanuelleb985.github.io/XDiffusion/ .

[1008] Evaluating BM3D and NBNet: A Comprehensive Study of Image Denoising Across Multiple Datasets

Ghazal Kaviani, Reza Marzban, Ghassan AlRegib

Main category: eess.IV

TL;DR: Comparison of traditional BM3D and modern NBNet denoising methods across diverse datasets, showing BM3D excels in blur scenarios while NBNet performs better in complex noise environments like under/over-exposure.

Details

Motivation: To investigate and compare the effectiveness of traditional non-learning-based denoising (BM3D) versus modern learning-based methods (NBNet) across various real-world noise challenges and applications.

Method: Evaluated BM3D and NBNet across multiple datasets (CURE-OR, CURE-TSR, SSID+, Set-12, Chest-Xray) using 7 Image Quality Assessment metrics and analyzed impact on object detection performance.

Result: BM3D performs better in blur challenge scenarios, while NBNet is more effective in complex noise environments such as under-exposure and over-exposure conditions.

Conclusion: The study reveals distinct strengths and limitations of traditional vs. learning-based denoising methods, providing guidance for selecting appropriate denoising strategies based on specific real-world application requirements.

Abstract: This paper investigates image denoising, comparing traditional non-learning-based techniques, represented by Block-Matching 3D (BM3D), with modern learning-based methods, exemplified by NBNet. We assess these approaches across diverse datasets, including CURE-OR, CURE-TSR, SSID+, Set-12, and Chest-Xray, each presenting unique noise challenges. Our analysis employs seven Image Quality Assessment (IQA) metrics and examines the impact on object detection performance. We find that while BM3D excels in scenarios like blur challenges, NBNet is more effective in complex noise environments such as under-exposure and over-exposure. The study reveals the strengths and limitations of each method, providing insights into the effectiveness of different denoising strategies in varied real-world applications.

Jun-En Ding, Chien-Chin Hsu, Chi-Hsiang Chu, Shuqiang Wang, Feng Liu

Main category: eess.IV

TL;DR: CGMCL framework integrates multimodal medical data using cross-modality graphs and contrastive learning to improve disease classification accuracy and interpretability.

Details

Motivation: Traditional medical image classification focuses only on unimodal image data, neglecting valuable non-image patient information that could enhance diagnosis.

Method: Constructs cross-modality graphs between image and non-image data, uses contrastive learning to align features in shared latent space, and includes inter-modality feature scaling to reduce heterogeneity gaps.

Result: Outperforms unimodal methods on Parkinson’s disease and melanoma datasets, achieving higher accuracy, better interpretability, and improved early disease prediction.

Conclusion: CGMCL effectively integrates multimodal medical data for superior classification performance while providing enhanced disease interpretability and predictive capabilities.

Abstract: The classification of medical images is a pivotal aspect of disease diagnosis, often enhanced by deep learning techniques. However, traditional approaches typically focus on unimodal medical image data, neglecting the integration of diverse non-image patient data. This paper proposes a novel Cross-Graph Modal Contrastive Learning (CGMCL) framework for multimodal structured data from different data domains to improve medical image classification. The model effectively integrates both image and non-image data by constructing cross-modality graphs and leveraging contrastive learning to align multimodal features in a shared latent space. An inter-modality feature scaling module further optimizes the representation learning process by reducing the gap between heterogeneous modalities. The proposed approach is evaluated on two datasets: a Parkinson’s disease (PD) dataset and a public melanoma dataset. Results demonstrate that CGMCL outperforms conventional unimodal methods in accuracy, interpretability, and early disease prediction. Additionally, the method shows superior performance in multi-class melanoma classification. The CGMCL framework provides valuable insights into medical image classification while offering improved disease interpretability and predictive capabilities.

[1010] MAROON: A Dataset for the Joint Characterization of Near-Field High-Resolution Radio-Frequency and Optical Depth Imaging Techniques

Vanessa Wirth, Johanna Bräunig, Nikolai Hofmann, Martin Vossiek, Tim Weyrich, Marc Stamminger

Main category: eess.IV

TL;DR: Characterization and comparison of optical depth sensors and imaging radar for close-range applications through multimodal spatial calibration and comprehensive evaluation of depth measurements across different materials, geometries, and distances.

Details

Motivation: There is limited research on combining optical depth sensors and radars for close-range applications (decimeters away), especially with growing interest in high-resolution imaging radars operating in the near field.

Method: Used multimodal spatial calibration to jointly characterize four depth imagers (three optical sensors with varying operation principles and one imaging radar), collecting data across different object materials, geometries, and object-to-sensor distances.

Result: Revealed scattering effects of partially transmissive materials and investigated radio-frequency signal responses. Created comprehensive evaluation of depth measurements across different conditions.

Conclusion: All object measurements will be made publicly available as a multimodal dataset called MAROON to support further research in this area.

Abstract: Utilizing the complementary strengths of wavelength-specific range or depth sensors is crucial for robust computer-assisted tasks such as autonomous driving. Despite this, there is still little research done at the intersection of optical depth sensors and radars operating close range, where the target is decimeters away from the sensors. Together with a growing interest in high-resolution imaging radars operating in the near field, the question arises how these sensors behave in comparison to their traditional optical counterparts. In this work, we take on the unique challenge of jointly characterizing depth imagers from both, the optical and radio-frequency domain using a multimodal spatial calibration. We collect data from four depth imagers, with three optical sensors of varying operation principle and an imaging radar. We provide a comprehensive evaluation of their depth measurements with respect to distinct object materials, geometries, and object-to-sensor distances. Specifically, we reveal scattering effects of partially transmissive materials and investigate the response of radio-frequency signals. All object measurements will be made public in form of a multimodal dataset, called MAROON.

[1011] Unsupervised Multi-Parameter Inverse Solving for Reducing Ring Artifacts in 3D X-Ray CBCT

Qing Wu, Hongjiang Wei, Jingyi Yu, Yuyao Zhang

Main category: eess.IV

TL;DR: Riner is an unsupervised ring artifact reduction method for 3D CBCT that formulates the problem as a multi-parameter inverse problem, learning artifact-free images and physical parameters directly from CT measurements without external training data.

Details

Motivation: Existing supervised methods struggle with complex real-world acquisitions, don't fully capture ring artifact physics, and have high memory demands for 3D CBCT applications.

Method: Reformulates RAR as multi-parameter inverse problem with differentiable forward model; jointly learns implicit neural representation of artifact-free images and estimates physical parameters directly from CT measurements using ray-based optimization.

Result: Outperforms existing state-of-the-art supervised methods on both simulated and real-world datasets.

Conclusion: Riner provides an effective unsupervised alternative to supervised methods, with better physical understanding of ring artifacts and memory-efficient optimization suitable for large-scale 3D CBCT.

Abstract: Ring artifacts are prevalent in 3D cone-beam computed tomography (CBCT) due to non-ideal responses of X-ray detectors, substantially affecting image quality and diagnostic reliability. Existing state-of-the-art (SOTA) ring artifact reduction (RAR) methods rely on supervised learning with large-scale paired CT datasets. While effective in-domain, supervised methods tend to struggle to fully capture the physical characteristics of ring artifacts, leading to pronounced performance drops in complex real-world acquisitions. Moreover, their scalability to 3D CBCT is limited by high memory demands. In this work, we propose Riner, a new unsupervised RAR method. Based on a theoretical analysis of ring artifact formation, we reformulate RAR as a multi-parameter inverse problem, where the non-ideal responses of X-ray detectors are parameterized as solvable physical variables. Using a new differentiable forward model, Riner can jointly learn the implicit neural representation of artifact-free images and estimate the physical parameters directly from CT measurements, without external training data. Additionally, Riner is memory-friendly due to its ray-based optimization, enhancing its usability in large-scale 3D CBCT. Experiments on both simulated and real-world datasets show Riner outperforms existing SOTA supervised methods.

[1012] Physics-informed DeepCT: Sinogram Wavelet Decomposition Meets Masked Diffusion

Zekun Zhou, Tan Liu, Bing Yu, Yanru Gong, Liu Shi, Qiegen Liu

Main category: eess.IV

TL;DR: SWARM is a novel diffusion model for sparse-view CT reconstruction that uses random mask strategies and wavelet decomposition to enhance generalization and detail capture, outperforming existing methods.

Details

Motivation: Diffusion models for SVCT reconstruction suffer from limited generalization when trained on constrained sample spaces, leading to blurry details and regional inconsistencies in generated images.

Method: Proposes SWARM with: 1) random mask strategy in sinogram to expand training sample space, 2) random training on high-frequency wavelet components for better feature representation, and 3) two-stage iterative reconstruction for global consistency and detail refinement.

Result: Experimental results show SWARM outperforms competing approaches in both quantitative and qualitative performance across various datasets.

Conclusion: SWARM effectively addresses generalization limitations in SVCT reconstruction through innovative random strategies and wavelet decomposition, achieving superior reconstruction quality.

Abstract: Diffusion model shows remarkable potential on sparse-view computed tomography (SVCT) reconstruction. However, when a network is trained on a limited sample space, its generalization capability may be constrained, which degrades performance on unfamiliar data. For image generation tasks, this can lead to issues such as blurry details and inconsistencies between regions. To alleviate this problem, we propose a Sinogram-based Wavelet random decomposition And Random mask diffusion Model (SWARM) for SVCT reconstruction. Specifically, introducing a random mask strategy in the sinogram effectively expands the limited training sample space. This enables the model to learn a broader range of data distributions, enhancing its understanding and generalization of data uncertainty. In addition, applying a random training strategy to the high-frequency components of the sinogram wavelet enhances feature representation and improves the ability to capture details in different frequency bands, thereby improving performance and robustness. Two-stage iterative reconstruction method is adopted to ensure the global consistency of the reconstructed image while refining its details. Experimental results demonstrate that SWARM outperforms competing approaches in both quantitative and qualitative performance across various datasets.

[1013] CT Radiomics-Based Explainable Machine Learning Model for Accurate Differentiation of Malignant and Benign Endometrial Tumors: A Two-Center Study

Tingrui Zhang, Honglin Wu, Zekun Jiang, Yingying Wang, Rui Ye, Huiming Ni, Chang Liu, Jin Cao, Xuan Sun, Rong Shao, Xiaorong Wei, Yingchun Sun

Main category: eess.IV

TL;DR: Developed and validated a CT radiomics-based explainable ML model for diagnosing endometrial cancer malignancy vs benignity, achieving 0.96 testing AUROC with Random Forest as optimal model.

Details

Motivation: To develop an explainable machine learning model for precise diagnosis of malignancy and benignity in endometrial cancer patients using CT radiomics features.

Method: 83 EC patients from two centers, manual ROI segmentation from pre-surgical CT scans, 1132 radiomic features extracted using Pyradiomics, six explainable ML algorithms tested, SHAP analysis and feature mapping for explainability.

Result: Random Forest model achieved training AUROC of 1.00 and testing AUROC of 0.96, SHAP identified significant radiomic features (P < 0.05), DCA showed higher net benefit than ‘All’ and ‘None’ strategies.

Conclusion: CT radiomics-based explainable ML model achieved high diagnostic performance and can serve as an intelligent auxiliary tool for endometrial cancer diagnosis.

Abstract: Aimed to develop and validate a CT radiomics-based explainable machine learning model for precise diagnosing malignancy and benignity specifically in endometrial cancer (EC) patients. A total of 83 EC patients from two centers, including 46 with malignant and 37 with benign conditions, were included, with data split into a training set (n=59) and a testing set (n=24). The regions of interest (ROIs) were manually segmented from pre-surgical CT scans, and 1132 radiomic features were extracted from the pre-surgical CT scans using Pyradiomics. Six explainable machine learning (ML) modeling algorithms were implemented respectively, for determining the optimal radiomics pipeline. The diagnostic performance of the radiomic model was evaluated by using sensitivity, specificity, accuracy, precision, F1 score, AUROC, and AUPRC. To enhance clinical understanding and usability, we separately implemented SHAP analysis and feature mapping visualization, and evaluated the calibration curve and decision curve. By comparing six modeling strategies, the Random Forest model emerged as the optimal choice for diagnosing EC, with a training AUROC of 1.00 and a testing AUROC of 0.96. SHAP identified the most important radiomic features, revealing that all selected features were significantly associated with EC (P < 0.05). Radiomics feature maps also provide a feasible assessment tool for clinical applications. Decision Curve Analysis (DCA) indicated a higher net benefit for our model compared to the “All” and “None” strategies, suggesting its clinical utility in identifying high-risk cases and reducing unnecessary interventions. In conclusion, the CT radiomics-based explainable ML model achieved high diagnostic performance, which could be used as an intelligent auxiliary tool for the diagnosis of endometrial cancer.

[1014] A multi-dynamic low-rank deep image prior (ML-DIP) for 3D real-time cardiovascular MRI

Chong Chen, Marc Vornehm, Zhenyu Bu, Preethi Chandrasekaran, Muhammad A. Sultan, Syed M. Arshad, Yingmin Liu, Yuchi Han, Rizwan Ahmad

Main category: eess.IV

TL;DR: ML-DIP enables high-quality 3D real-time cardiovascular MRI with >1000x acceleration by learning spatial and motion representations directly from undersampled data, without requiring fully sampled training datasets.

Details

Motivation: To develop a reconstruction framework for 3D real-time cine cardiovascular MRI that works with highly undersampled data and doesn't require fully sampled training datasets, addressing limitations of traditional methods.

Method: Multi-dynamic low-rank deep image prior (ML-DIP) framework using separate neural networks for spatial image content and deformation fields, optimized per scan directly from undersampled k-space data.

Result: Achieved PSNR >29 dB and SSIM >0.90 in phantom studies for 2-minute scans, preserved cardiac/respiratory motion and PVC events, yielded comparable functional measurements to 2D cine with better image quality than 5D-Cine, and successfully reconstructed irregular beats in PVC patients.

Conclusion: ML-DIP enables high-quality 3D real-time CMR with acceleration factors exceeding 1,000 by learning low-rank spatial and motion representations from undersampled data, eliminating the need for external fully sampled training datasets.

Abstract: Purpose: To develop a reconstruction framework for 3D real-time cine cardiovascular magnetic resonance (CMR) from highly undersampled data without requiring fully sampled training datasets. Methods: We developed a multi-dynamic low-rank deep image prior (ML-DIP) framework that models spatial image content and deformation fields using separate neural networks. These networks are optimized per scan to reconstruct the dynamic image series directly from undersampled k-space data. ML-DIP was evaluated on (i) a 3D cine digital phantom with simulated premature ventricular contractions (PVCs), (ii) ten healthy subjects (including two scanned during both rest and exercise), and (iii) 12 patients with a history of PVCs. Phantom results were assessed using peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM). In vivo performance was evaluated by comparing left-ventricular function quantification (against 2D real-time cine) and image quality (against 2D real-time cine and binning-based 5D-Cine). Results: In the phantom study, ML-DIP achieved PSNR > 29 dB and SSIM > 0.90 for scan times as short as two minutes, while recovering cardiac motion, respiratory motion, and PVC events. In healthy subjects, ML-DIP yielded functional measurements comparable to 2D cine and higher image quality than 5D-Cine, including during exercise with high heart rates and bulk motion. In PVC patients, ML-DIP preserved beat-to-beat variability and reconstructed irregular beats, whereas 5D-Cine showed motion artifacts and information loss due to binning. Conclusion: ML-DIP enables high-quality 3D real-time CMR with acceleration factors exceeding 1,000 by learning low-rank spatial and motion representations from undersampled data, without relying on external fully sampled training datasets.

[1015] Onboard Hyperspectral Super-Resolution with Deep Pushbroom Neural Network

Davide Piccinini, Diego Valsesia, Enrico Magli

Main category: eess.IV

TL;DR: DPSR is a lightweight neural network for hyperspectral image super-resolution that processes images line-by-line to match pushbroom sensor acquisition, enabling real-time onboard processing on satellites.

Details

Motivation: Hyperspectral sensors have fine spectral resolution but limited spatial resolution. There's growing need for lightweight super-resolution methods that can run onboard satellites in real time to improve downstream detection capabilities.

Method: Proposed Deep Pushbroom Super-Resolution (DPSR) network that processes images line-by-line in the along-track direction with causal memory mechanism to exploit previously acquired lines, matching pushbroom sensor acquisition patterns.

Result: DPSR achieves onboard real-time performance on low-power hardware, super-resolving each line in the time it takes to acquire the next one. Quality is competitive or outperforms more complex state-of-the-art methods.

Conclusion: DPSR provides an efficient solution for hyperspectral image super-resolution that enables real-time onboard processing while maintaining high quality results, making it suitable for satellite deployment.

Abstract: Hyperspectral imagers on satellites obtain the fine spectral signatures essential for distinguishing one material from another at the expense of limited spatial resolution. Enhancing the latter is thus a desirable preprocessing step in order to further improve the detection capabilities offered by hyperspectral images on downstream tasks. At the same time, there is a growing interest towards deploying inference methods directly onboard of satellites, which calls for lightweight image super-resolution methods that can be run on the payload in real time. In this paper, we present a novel neural network design, called Deep Pushbroom Super-Resolution (DPSR) that matches the pushbroom acquisition of hyperspectral sensors by processing an image line by line in the along-track direction with a causal memory mechanism to exploit previously acquired lines. This design greatly limits memory requirements and computational complexity, achieving onboard real-time performance, i.e., the ability to super-resolve a line in the time it takes to acquire the next one, on low-power hardware. Experiments show that the quality of the super-resolved images is competitive or even outperforms state-of-the-art methods that are significantly more complex.

[1016] Beyond Data Scarcity Optimizing R3GAN for Medical Image Generation from Small Datasets

Tsung-Wei Pan, Chang-Hong Wu, Jung-Hua Wang, Ming-Jer Chen, Yu-Chiao Yi, Tsung-Hsien Lee

Main category: eess.IV

TL;DR: This paper presents optimized GAN training strategies for small medical imaging datasets, using embryo time-lapse imaging as a case study. The approach successfully generates realistic images to address class imbalance, significantly improving classification performance.

Details

Motivation: Medical image datasets often suffer from class imbalance and limited sample sizes, which is particularly challenging in clinical imaging. The study aims to address these issues using generative adversarial networks optimized for small datasets.

Method: The researchers used R3GAN with systematic experiments to establish effective training strategies. They designed an optimized configuration for 256x256-resolution datasets featuring a full burn-in phase and a low, gradually increasing gamma range (5 to 40). Generated samples were used to balance an imbalanced embryo dataset.

Result: The approach substantially improved classification performance. The recall and F1-score of the three-cell (t3) class increased from 0.06 to 0.69 and from 0.11 to 0.60 respectively, without compromising performance of other classes.

Conclusion: Tailored R3GAN training strategies can effectively alleviate data scarcity and improve model robustness in small-scale medical imaging tasks, demonstrating practical value for addressing class imbalance in clinical imaging datasets.

Abstract: Medical image datasets frequently exhibit significant class imbalance, a challenge that is further amplified by the inherently limited sample sizes that characterize clinical imaging data. Using human embryo time-lapse imaging (TLI) as a case study, this work investigates how generative adversarial networks (GANs) can be optimized for small datasets to generate realistic and diagnostically meaningful images. Based on systematic experiments with R3GAN, we established effective training strategies and designed an optimized configuration for 256x256-resolution datasets, featuring a full burn-in phase and a low, gradually increasing gamma range (5 to 40). The generated samples were used to balance an imbalanced embryo dataset, leading to substantial improvement in classification performance. The recall and F1-score of the three-cell (t3) class increased from 0.06 to 0.69 and from 0.11 to 0.60, respectively, without compromising the performance of other classes. These results demonstrate that tailored R3GAN training strategies can effectively alleviate data scarcity and improve model robustness in small-scale medical imaging tasks.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Factual and Musical Evaluation Metrics for Music Language Models

[2] Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation

[3] Persian Musical Instruments Classification Using Polyphonic Data Augmentation

[4] Retracing the Past: LLMs Emit Training Data When They Get Lost

[5] Beyond One-Size-Fits-All: Personalized Harmful Content Detection with In-Context Learning

[6] MCP4IFC: IFC-Based Building Design Using Large Language Models

[7] FlowMM: Cross-Modal Information Flow Guided KV Cache Merging for Efficient Multimodal Context Inference

[8] ReMoD: Rethinking Modality Contribution in Multimodal Stance Detection via Dual Reasoning

[9] Future of AI Models: A Computational perspective on Model collapse

[10] MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making

[11] Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

[12] Sample-Efficient Language Modeling with Linear Attention and Lightweight Enhancements

[13] UTF-8 Plumbing: Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8

[14] Optimizing Diversity and Quality through Base-Aligned Model Collaboration

[15] OckBench: Measuring the Efficiency of LLM Reasoning

[16] In-Context Learning Without Copying

[17] CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokkien Speech Recognition

[18] Multi-Scale Feature Fusion and Graph Neural Network Integration for Text Classification with Large Language Models

[19] Language Generation: Complexity Barriers and Implications for Learning

[20] DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning

[21] Compositional Phoneme Approximation for L1-Grounded L2 Pronunciation Training

[22] Quantifying Edits Decay in Fine-tuned LLMs

[23] MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation

[24] Retrieval-Augmented Generation in Medicine: A Scoping Review of Technical Implementations, Clinical Applications, and Ethical Considerations

[25] NILC: Discovering New Intents with LLM-assisted Clustering

[26] How Does a Deep Neural Network Look at Lexical Stress?

[27] IDALC: A Semi-Supervised Framework for Intent Detection and Active Learning based Correction

[28] Reinforcement Learning Improves Traversal of Hierarchical Knowledge in LLMs

[29] Interpretable Recognition of Cognitive Distortions in Natural Language Texts

[30] Revisiting Entropy in Reinforcement Learning for Large Reasoning Models

[31] LLMs Do Not See Age: Assessing Demographic Bias in Automated Systematic Review Synthesis

[32] Multi-Reward GRPO Fine-Tuning for De-biasing Large Language Models: A Study Based on Chinese-Context Discrimination Data

[33] Visual Exploration of Feature Relationships in Sparse Autoencoders with Curated Concepts

[34] Efficient Hate Speech Detection: A Three-Layer LoRA-Tuned BERTweet Framework

[35] Automating Hardware Design and Verification from Architectural Papers via a Neural-Symbolic Graph Framework

[36] Stemming Hallucination in Language Models Using a Licensing Oracle

[37] MuonAll: Muon Variant for Efficient Finetuning of Large Language Models

[38] Evaluation of retrieval-based QA on QUEST-LOFT

[39] Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models

[40] BookAsSumQA: An Evaluation Framework for Aspect-Based Book Summarization via Question Answering

[41] Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning

[42] Explicit Knowledge-Guided In-Context Learning for Early Detection of Alzheimer’s Disease

[43] SPA: Achieving Consensus in LLM Alignment via Self-Priority Optimization

[44] Overview of CHIP 2025 Shared Task 2: Discharge Medication Recommendation for Metabolic Diseases Based on Chinese Electronic Health Records

[45] Analyzing and Mitigating Negation Artifacts using Data Augmentation for Improving ELECTRA-Small Model Accuracy

[46] TimeSense:Making Large Language Models Proficient in Time-Series Analysis

[47] HatePrototypes: Interpretable and Transferable Representations for Implicit and Explicit Hate Speech Detection

[48] SugarTextNet: A Transformer-Based Framework for Detecting Sugar Dating-Related Content on Social Media with Context-Aware Focal Loss

[49] How Well Do LLMs Understand Drug Mechanisms? A Knowledge + Reasoning Evaluation Dataset

[50] Dutch Metaphor Extraction from Cancer Patients’ Interviews and Forum Data using LLMs and Human in the Loop

[51] Towards Resource-Efficient Multimodal Intelligence: Learned Routing among Specialized Expert Models

[52] SR-KI: Scalable and Real-Time Knowledge Integration into LLMs via Supervised Attention

[53] Rethinking what Matters: Effective and Robust Multilingual Realignment for Low-Resource Languages

[54] You Had One Job: Per-Task Quantization Using LLMs’ Hidden Representations

[55] Better Datasets Start From RefineLab: Automatic Optimization for High-Quality Dataset Refinement

[56] Ibom NLP: A Step Toward Inclusive Natural Language Processing for Nigeria’s Minority Languages

[57] Rep2Text: Decoding Full Text from a Single LLM Token Representation

[58] TabRAG: Tabular Document Retrieval via Structured Language Representations

[59] Duality-based Mode Operations and Pyramid Multilayer Mapping for Rhetorical Modes

[60] How AI Fails: An Interactive Pedagogical Tool for Demonstrating Dialectal Bias in Automated Toxicity Models

[61] Steering LLMs toward Korean Local Speech: Iterative Refinement Framework for Faithful Dialect Translation

[62] Textual Self-attention Network: Test-Time Preference Optimization through Textual Gradient-based Attention

[63] Sentiment Analysis On YouTube Comments Using Machine Learning Techniques Based On Video Games Content

[64] Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights

[65] Sensitivity of Small Language Models to Fine-tuning Data Contamination

[66] SAFENLIDB: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces

[67] Learning to Focus: Focal Attention for Selective and Scalable Transformers

[68] AI Brown and AI Koditex: LLM-Generated Corpora Comparable to Traditional Corpora of English and Czech Texts

[69] Beyond Plain Demos: A Demo-centric Anchoring Paradigm for In-Context Learning in Alzheimer’s Disease Detection

[70] Inclusion of Role into Named Entity Recognition and Ranking

[71] EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers

[72] RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation

[73] HLPD: Aligning LLMs to Human Language Preference for Machine-Revised Text Detection

[74] SCOPE: Intrinsic Semantic Space Control for Mitigating Copyright Infringement in LLMs

[75] Automated Circuit Interpretation via Probe Prompting

[76] Beyond English: Toward Inclusive and Scalable Multilingual Machine Translation with LLMs

[77] A Picture is Worth a Thousand (Correct) Captions: A Vision-Guided Judge-Corrector System for Multimodal Machine Translation