Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 119]
- cs.CV [Total: 345]
- cs.AI [Total: 82]
- cs.SD [Total: 17]
- cs.LG [Total: 247]
- cs.MA [Total: 10]
- cs.MM [Total: 2]
- eess.AS [Total: 5]
- eess.IV [Total: 21]
cs.CL
[1] Empathy by Design: Aligning Large Language Models for Healthcare Dialogue
Emre Umucu, Guillermina Solis, Leon Garza, Emilia Rivas, Beatrice Lee, Anantaa Kotal, Aritran Piplai
Main category: cs.CL
TL;DR: A DPO-based alignment framework improves factual correctness and empathetic communication in healthcare LLMs for caregiver-patient dialogues.
Details
Motivation: General-purpose LLMs have limitations in healthcare applications due to factual unreliability and lack of empathetic communication, posing risks for non-professionals and caregivers seeking medical guidance or emotional reassurance.Method: Direct Preference Optimization (DPO)-based alignment framework fine-tunes domain-adapted LLMs using pairwise preference data, where preferred responses reflect supportive/accessible communication and rejected ones represent prescriptive/technical tones.
Result: DPO-tuned models achieve higher semantic alignment, improved factual accuracy, and stronger human-centric evaluation scores compared to baseline and commercial alternatives like Google medical dialogue systems.
Conclusion: Preference-based alignment offers a scalable and transparent pathway toward developing trustworthy, empathetic, and clinically informed AI assistants for healthcare communication.
Abstract: General-purpose large language models (LLMs) have demonstrated remarkable generative and reasoning capabilities but remain limited in healthcare and caregiving applications due to two key deficiencies: factual unreliability and a lack of empathetic communication. These shortcomings pose significant risks in sensitive contexts where users, particularly non-professionals and caregivers, seek medically relevant guidance or emotional reassurance. To address these challenges, we introduce a Direct Preference Optimization (DPO)-based alignment framework designed to improve factual correctness, semantic coherence, and human-centric qualities such as empathy, politeness, and simplicity in caregiver-patient dialogues. Our approach fine-tunes domain-adapted LLMs using pairwise preference data, where preferred responses reflect supportive and accessible communication styles while rejected ones represent prescriptive or overly technical tones. This direct optimization method aligns model outputs with human preferences more efficiently than traditional reinforcement-learning-based alignment. Empirical evaluations across multiple open and proprietary LLMs show that our DPO-tuned models achieve higher semantic alignment, improved factual accuracy, and stronger human-centric evaluation scores compared to baseline and commercial alternatives such as Google medical dialogue systems. These improvements demonstrate that preference-based alignment offers a scalable and transparent pathway toward developing trustworthy, empathetic, and clinically informed AI assistants for caregiver and healthcare communication. Our open-source code is available at: https://github.com/LeonG19/Empathy-by-Design
[2] Morphologically-Informed Tokenizers for Languages with Non-Concatenative Morphology: A case study of Yoloxóchtil Mixtec ASR
Chris Crawford
Main category: cs.CL
TL;DR: Novel morphologically-informed tokenizers for Yoloxóchitl Mixtec ASR improve annotation efficiency while being competitive with traditional BPE/Unigram models.
Details
Motivation: To improve efficiency and reduce human workload in interlinear gloss annotation of Yoloxóchitl Mixtec audio corpus by developing tokenizers that better handle the language's non-concatenative tonal morphology.Method: Developed two novel nonlinear tokenization schemes: 1) Segment and Melody tokenizer that extracts tones without predicting segmentation, and 2) Sequence of Processes tokenizer that predicts segmentation. Compared these against traditional BPE and Unigram models using ASR and text-based sequence-to-sequence tools.
Result: The novel tokenizers are competitive with BPE and Unigram models. The Segment-and-Melody model outperforms traditional tokenizers in word error rate but not character error rate. Morphological and information-theoretic metrics show predictive correlations with downstream performance.
Conclusion: Nonlinear tokenizers designed for non-concatenative morphology are competitive with conventional models for ASR, suggesting potential for improved annotation efficiency, though further research is needed for downstream task applicability.
Abstract: This paper investigates the impact of using morphologically-informed tokenizers to aid and streamline the interlinear gloss annotation of an audio corpus of Yoloxóchitl Mixtec (YM) using a combination of ASR and text-based sequence-to-sequence tools, with the goal of improving efficiency while reducing the workload of a human annotator. We present two novel tokenization schemes that separate words in a nonlinear manner, preserving information about tonal morphology as much as possible. One of these approaches, a Segment and Melody tokenizer, simply extracts the tones without predicting segmentation. The other, a Sequence of Processes tokenizer, predicts segmentation for the words, which could allow an end-to-end ASR system to produce segmented and unsegmented transcriptions in a single pass. We find that these novel tokenizers are competitive with BPE and Unigram models, and the Segment-and-Melody model outperforms traditional tokenizers in terms of word error rate but does not reach the same character error rate. In addition, we analyze tokenizers on morphological and information-theoretic metrics to find predictive correlations with downstream performance. Our results suggest that nonlinear tokenizers designed specifically for the non-concatenative morphology of a language are competitive with conventional BPE and Unigram models for ASR. Further research will be necessary to determine the applicability of these tokenizers in downstream processing tasks.
[3] Do You Feel Comfortable? Detecting Hidden Conversational Escalation in AI Chatbots
Jihyung Park, Saleh Afroogh, Junfeng Jiao
Main category: cs.CL
TL;DR: GAUGE is a lightweight framework for detecting implicit emotional harm in LLM conversations by measuring probabilistic shifts in affective states, addressing limitations of traditional toxicity filters.
Details
Motivation: LLMs are increasingly used as emotional companions, but they can cause implicit harm through emotional reinforcement or affective drift that traditional toxicity filters miss. Existing guardrails rely on external classifiers or clinical rubrics that can't capture real-time conversational dynamics.Method: GAUGE is a lightweight, logit-based framework that measures how an LLM’s output probabilistically shifts the affective state of a dialogue in real-time, detecting hidden conversational escalation.
Result: The paper proposes a novel framework for real-time detection of implicit emotional harm in LLM conversations, addressing the gap in existing safety mechanisms.
Conclusion: GAUGE provides a practical solution for detecting subtle emotional escalation in LLM interactions that traditional approaches miss, enabling better protection against implicit harm in AI emotional companions.
Abstract: Large Language Models (LLM) are increasingly integrated into everyday interactions, serving not only as information assistants but also as emotional companions. Even in the absence of explicit toxicity, repeated emotional reinforcement or affective drift can gradually escalate distress in a form of \textit{implicit harm} that traditional toxicity filters fail to detect. Existing guardrail mechanisms often rely on external classifiers or clinical rubrics that may lag behind the nuanced, real-time dynamics of a developing conversation. To address this gap, we propose GAUGE (Guarding Affective Utterance Generation Escalation), a lightweight, logit-based framework for the real-time detection of hidden conversational escalation. GAUGE measures how an LLM’s output probabilistically shifts the affective state of a dialogue.
[4] Automated Data Enrichment using Confidence-Aware Fine-Grained Debate among Open-Source LLMs for Mental Health and Online Safety
Junyu Mao, Anthony Hills, Talia Tseriotou, Maria Liakata, Aya Shamir, Dan Sayda, Dana Atzil-Slonim, Natalie Djohari, Arpan Mandal, Silke Roth, Pamela Ugwudike, Mahesan Niranjan, Stuart E. Middleton
Main category: cs.CL
TL;DR: The paper introduces a Confidence-Aware Fine-Grained Debate (CFD) framework where multiple LLM agents simulate human annotators to enrich datasets with real-world indicators for NLP tasks like mental health and online safety analysis.
Details
Motivation: Real-world indicators (life events for mental health, risky behavior for online safety) are important for NLP tasks but are costly and difficult to label due to their dynamic nature and annotation challenges.Method: Proposes a Confidence-Aware Fine-Grained Debate (CFD) framework where multiple LLM agents simulate human annotators, exchange fine-grained evidence, and reach consensus. Also introduces two expert-annotated datasets: mental health Reddit wellbeing dataset and online safety Facebook sharenting risk dataset.
Result: CFD framework achieves the most robust data enrichment performance compared to baselines. Data enrichment consistently improves downstream tasks, with enriched features via debate transcripts yielding the largest gains (10.1% improvement over non-enriched baseline for online safety task).
Conclusion: The CFD framework effectively addresses the challenge of labeling dynamic real-world indicators, demonstrating superior performance in data enrichment and significant improvements in downstream NLP tasks for mental health and online safety applications.
Abstract: Real-world indicators are important for improving natural language processing (NLP) tasks such as life events for mental health analysis and risky behaviour for online safety, yet labelling such information in NLP training datasets is often costly and/or difficult given the dynamic nature of such events. This paper compares several LLM-based data enrichment methods and introduces a novel Confidence-Aware Fine-Grained Debate (CFD) framework in which multiple LLM agents simulate human annotators and exchange fine-grained evidence to reach consensus. We describe two new expert-annotated datasets, a mental health Reddit wellbeing dataset and an online safety Facebook sharenting risk dataset. Our CFD framework achieves the most robust data enrichment performance compared to a range of baselines and we show that this type of data enrichment consistently improves downstream tasks. Enriched features incorporated via debate transcripts yield the largest gains, outperforming the non-enriched baseline by 10.1% for the online safety task.
[5] Policy-based Sentence Simplification: Replacing Parallel Corpora with LLM-as-a-Judge
Xuanxin Wu, Yuki Arase, Masaaki Nagata
Main category: cs.CL
TL;DR: LLM-as-a-Judge approach automatically creates policy-aligned training data for sentence simplification without human annotation, enabling small LLMs to outperform GPT-4o on lexical simplification.
Details
Motivation: Different applications require distinct simplification policies (lexical vs. full rewriting), but achieving policy-driven control remains challenging due to the need for costly human annotation or parallel corpora.Method: Leverages Large Language Model-as-a-Judge (LLM-as-a-Judge) to automatically construct policy-aligned training data, completely removing the need for human annotation or parallel corpora.
Result: Small-scale open-source LLMs (Phi-3-mini-3.8B) surpass GPT-4o on lexical-oriented simplification and achieve comparable performance on overall rewriting, verified by both automatic metrics and human evaluations.
Conclusion: The approach enables building simplification systems that adapt to diverse simplification policies, with consistent improvements across model families and sizes demonstrating robustness.
Abstract: Sentence simplification aims to modify a sentence to make it easier to read and understand while preserving the meaning. Different applications require distinct simplification policies, such as replacing only complex words at the lexical level or rewriting the entire sentence while trading off details for simplicity. However, achieving such policy-driven control remains an open challenge. In this work, we introduce a simple yet powerful approach that leverages Large Language Model-as-a-Judge (LLM-as-a-Judge) to automatically construct policy-aligned training data, completely removing the need for costly human annotation or parallel corpora. Our method enables building simplification systems that adapt to diverse simplification policies. Remarkably, even small-scale open-source LLMs such as Phi-3-mini-3.8B surpass GPT-4o on lexical-oriented simplification, while achieving comparable performance on overall rewriting, as verified by both automatic metrics and human evaluations. The consistent improvements across model families and sizes demonstrate the robustness of our approach.
[6] LOCUS: A System and Method for Low-Cost Customization for Universal Specialization
Dhanasekar Sundararaman, Keying Li, Wayne Xiong, Aashna Garg
Main category: cs.CL
TL;DR: LOCUS is a low-cost pipeline for customizing NLP models using few-shot data through retrieval, synthetic data generation, and parameter-efficient tuning, achieving near-full fine-tuning accuracy with minimal memory footprint.
Details
Motivation: To address the high costs and resource requirements of fine-tuning large NLP models for specialized tasks, while maintaining high performance with minimal labeled data.Method: Three-step pipeline: 1) Retrieves relevant data from broad repository using few-shot examples, 2) Generates synthetic training samples via in-context data generation, 3) Fine-tunes models using full or LoRA parameter adaptation for efficiency.
Result: Outperforms strong baselines including GPT-4o on NER and text classification benchmarks, achieves 99% of fully fine-tuned accuracy with only 5% memory footprint, beats GPT-4o with less than 1% of its parameters.
Conclusion: LOCUS enables efficient, low-cost customization of NLP models for specialized tasks while maintaining high performance, offering a practical solution for resource-constrained applications.
Abstract: We present LOCUS (LOw-cost Customization for Universal Specialization), a pipeline that consumes few-shot data to streamline the construction and training of NLP models through targeted retrieval, synthetic data generation, and parameter-efficient tuning. With only a small number of labeled examples, LOCUS discovers pertinent data in a broad repository, synthesizes additional training samples via in-context data generation, and fine-tunes models using either full or low-rank (LoRA) parameter adaptation. Our approach targets named entity recognition (NER) and text classification (TC) benchmarks, consistently outperforming strong baselines (including GPT-4o) while substantially lowering costs and model sizes. Our resultant memory-optimized models retain 99% of fully fine-tuned accuracy while using barely 5% of the memory footprint, also beating GPT-4o on several benchmarks with less than 1% of its parameters.
[7] Convergence of Outputs When Two Large Language Models Interact in a Multi-Agentic Setup
Aniruddha Maiti, Satya Nimmagadda, Kartha Veerya Jammuladinne, Niladri Sengupta, Ananya Jana
Main category: cs.CL
TL;DR: Two large language models (Mistral Nemo Base 2407 and Llama 2 13B) engage in multi-turn conversations without human intervention, starting from a seed sentence. Conversations initially remain coherent but eventually fall into repetitive loops where both models produce similar outputs, demonstrating convergence behavior despite their separate training.
Details
Motivation: To investigate what happens when two large language models communicate with each other autonomously over multiple turns without external input, examining whether they can sustain meaningful dialogue or fall into predictable patterns.Method: Multi-agent setup where two models (Mistral Nemo Base 2407 and Llama 2 13B hf) respond to each other’s outputs iteratively, starting from a short seed sentence. The conversation continues for a fixed number of steps. Lexical and embedding-based metrics are applied to measure conversation drift from the initial seed and similarity between model outputs over time.
Result: Most conversations start coherently but eventually fall into repetition, often with short phrases repeating across turns. Once repetition begins, both models tend to produce similar outputs rather than introducing new directions, creating loops of same or similar text. This convergence behavior occurs despite the models being large, separately trained, and lacking prompt instructions.
Conclusion: Autonomous conversations between large language models tend to converge into repetitive loops rather than sustaining diverse, meaningful dialogue, suggesting inherent limitations in multi-agent LLM interactions without external guidance or intervention.
Abstract: In this work, we report what happens when two large language models respond to each other for many turns without any outside input in a multi-agent setup. The setup begins with a short seed sentence. After that, each model reads the other’s output and generates a response. This continues for a fixed number of steps. We used Mistral Nemo Base 2407 and Llama 2 13B hf. We observed that most conversations start coherently but later fall into repetition. In many runs, a short phrase appears and repeats across turns. Once repetition begins, both models tend to produce similar output rather than introducing a new direction in the conversation. This leads to a loop where the same or similar text is produced repeatedly. We describe this behavior as a form of convergence. It occurs even though the models are large, trained separately, and not given any prompt instructions. To study this behavior, we apply lexical and embedding-based metrics to measure how far the conversation drifts from the initial seed and how similar the outputs of the two models becomes as the conversation progresses.
[8] A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification
Nicolas Calbucura, Valentin Barriere
Main category: cs.CL
TL;DR: Simple method to enhance text-based LLMs with speech info via lasso-based feature selection on audio tokens, improving performance on argumentative fallacy detection tasks.
Details
Motivation: Textual LLMs lack speech information, and integrating audio is challenging due to long audio sequences and large vocabularies from speech tokenizers. Existing methods struggle with multimodal fusion, especially when audio was previously considered counterproductive for certain tasks.Method: Use speech tokenizer for ASR to get audio tokens, apply lasso-based feature selection on multimodal Bag-of-Words to retain important audio tokens, adapt LLM to selected tokens via self-supervised language modeling, then fine-tune on downstream classification task.
Result: Outperforms unimodal models, larger SpeechLM, and learned audio representations. Achieves SOTA on argumentative fallacy detection tasks where audio was previously considered unhelpful. Even random audio token selection improves unimodal performance.
Conclusion: Simple lasso-based feature selection effectively integrates speech information into text LLMs, enabling successful multimodal fusion for tasks where audio was believed counterproductive, with minimal computational cost.
Abstract: This paper presents a simple method that allows to easily enhance textual pre-trained large language models with speech information, when fine-tuned for a specific classification task. A classical issue with the fusion of many embeddings from audio with text is the large length of the audio sequence compared to the text one. Our method benefits from an existing speech tokenizer trained for Audio Speech Recognition that output long sequences of tokens from a large vocabulary, making it difficult to integrate it at low cost in a large language model. By applying a simple lasso-based feature selection on multimodal Bag-of-Words representation, we retain only the most important audio tokens for the task, and adapt the language model to them with a self-supervised language modeling objective, before fine-tuning it on the downstream task. We show this helps to improve the performances compared to an unimodal model, to a bigger SpeechLM or to integrating audio via a learned representation. We show the effectiveness of our method on two recent Argumentative Fallacy Detection and Classification tasks where the use of audio was believed counterproductive, reaching state-of-the-art results. We also provide an in-depth analysis of the method, showing that even a random audio token selection helps enhancing the unimodal model. Our code is available online.
[9] Nanbeige4-3B Technical Report: Exploring the Frontier of Small Language Models
Chen Yang, Guangyue Peng, Jiaying Zhu, Ran Le, Ruixiang Feng, Tao Zhang, Wei Ruan, Xiaoqi Liu, Xiaoxue Cheng, Xiyun Xu, Yang Song, Yanzipeng Gao, Yiming Jia, Yun Xing, Yuntao Wen, Zekai Wang, Zhenwei An, Zhicong Sun, Zongchao Chen
Main category: cs.CL
TL;DR: Nanbeige4-3B is a high-performing 3B parameter language model that achieves performance rivaling much larger models through advanced training techniques including FG-WSD scheduler, joint SFT refinement, Dual Preference Distillation, and multi-stage RL.
Details
Motivation: To push the boundaries of scaling laws for small language models by demonstrating that with sophisticated training methodologies, small models can achieve performance comparable to much larger models.Method: Four-stage approach: 1) Pretraining with FG-WSD scheduler for progressive data mixture refinement, 2) SFT with joint deliberative generation refinement and chain-of-thought reconstruction, 3) Distillation using Dual Preference Distillation from flagship reasoning model, 4) Multi-stage RL with verifiable rewards and preference modeling.
Result: Nanbeige4-3B significantly outperforms comparable-scale models and rivals much larger models across diverse benchmarks, demonstrating the effectiveness of the proposed training methodologies.
Conclusion: The paper successfully extends scaling laws for small language models through comprehensive training innovations, showing that small models can achieve competitive performance with proper methodology.
Abstract: We present Nanbeige4-3B, a family of small-scale but high-performing language models. Pretrained on 23T high-quality tokens and finetuned on over 30 million diverse instructions, we extend the boundary of the scaling law for small language models. In pre-training, we design a Fine-Grained Warmup-Stable-Decay (FG-WSD) training scheduler, which progressively refines data mixtures across stages to boost model performance. In post-training, to improve the quality of the SFT data, we design a joint mechanism that integrates deliberative generation refinement and chain-of-thought reconstruction, yielding substantial gains on complex tasks. Following SFT, we employ our flagship reasoning model to distill Nanbeige4-3B through our proposed Dual Preference Distillation (DPD) method, which leads to further performance gains. Finally, a multi-stage reinforcement learning phase was applied, leveraging verifiable rewards and preference modeling to strengthen abilities on both reasoning and human alignment. Extensive evaluations show that Nanbeige4-3B not only significantly outperforms models of comparable parameter scale but also rivals much larger models across a wide range of benchmarks. The model checkpoints are available at https://huggingface.co/Nanbeige.
[10] Modeling Contextual Passage Utility for Multihop Question Answering
Akriti Jain, Aparna Garimella
Main category: cs.CL
TL;DR: A lightweight transformer model predicts passage utility for multihop QA by considering inter-passage dependencies, using reasoning traces for training, improving reranking and QA performance over relevance-based methods.
Details
Motivation: Current passage utility prediction approaches model passages independently, ignoring that in multihop QA, passage utility is context-dependent and influenced by relationships with other passages (complementary information or crucial links).Method: Fine-tune a small transformer model to predict contextual passage utility scores, leveraging reasoning traces from an advanced reasoning model to capture passage usage order and generate synthetic training data.
Result: The utility-based scoring approach leads to improved reranking and downstream QA performance compared to relevance-based reranking methods, as demonstrated through comprehensive experiments.
Conclusion: Modeling contextual passage utility with inter-passage dependencies is effective for multihop QA, providing better passage selection than independent utility assessment or pure relevance-based approaches.
Abstract: Multihop Question Answering (QA) requires systems to identify and synthesize information from multiple text passages. While most prior retrieval methods assist in identifying relevant passages for QA, further assessing the utility of the passages can help in removing redundant ones, which may otherwise add to noise and inaccuracies in the generated answers. Existing utility prediction approaches model passage utility independently, overlooking a critical aspect of multihop reasoning: the utility of a passage can be context-dependent, influenced by its relation to other passages - whether it provides complementary information or forms a crucial link in conjunction with others. In this paper, we propose a lightweight approach to model contextual passage utility, accounting for inter-passage dependencies. We fine-tune a small transformer-based model to predict passage utility scores for multihop QA. We leverage the reasoning traces from an advanced reasoning model to capture the order in which passages are used to answer a question and obtain synthetic training data. Through comprehensive experiments, we demonstrate that our utility-based scoring of retrieved passages leads to improved reranking and downstream QA performance compared to relevance-based reranking methods.
[11] Knowing What’s Missing: Assessing Information Sufficiency in Question Answering
Akriti Jain, Aparna Garimella
Main category: cs.CL
TL;DR: Proposes Identify-then-Verify framework for assessing context sufficiency in QA by first identifying missing information through hypothesis generation and consensus, then verifying absence in source text.
Details
Motivation: Simple prompting strategies fail on inferential questions requiring reasoning beyond direct text extraction. Need more reliable method to determine if context contains sufficient information to answer questions.Method: Identify-then-Verify framework: 1) Generate multiple hypotheses about missing information and establish semantic consensus, 2) Perform critical verification step forcing model to re-examine source text to confirm whether information is truly absent.
Result: Method evaluated across diverse multi-hop and factual QA datasets, outperforming established baselines. Produces more accurate sufficiency judgments while clearly articulating information gaps.
Conclusion: Guiding models to justify claims about missing information through structured reasoning provides more reliable sufficiency assessment for question-answering systems.
Abstract: Determining whether a provided context contains sufficient information to answer a question is a critical challenge for building reliable question-answering systems. While simple prompting strategies have shown success on factual questions, they frequently fail on inferential ones that require reasoning beyond direct text extraction. We hypothesize that asking a model to first reason about what specific information is missing provides a more reliable, implicit signal for assessing overall sufficiency. To this end, we propose a structured Identify-then-Verify framework for robust sufficiency modeling. Our method first generates multiple hypotheses about missing information and establishes a semantic consensus. It then performs a critical verification step, forcing the model to re-examine the source text to confirm whether this information is truly absent. We evaluate our method against established baselines across diverse multi-hop and factual QA datasets. The results demonstrate that by guiding the model to justify its claims about missing information, our framework produces more accurate sufficiency judgments while clearly articulating any information gaps.
[12] TeluguST-46: A Benchmark Corpus and Comprehensive Evaluation for Telugu-English Speech Translation
Bhavana Akkiraju, Srihari Bandarupalli, Swathi Sambangi, Vasavi Ravuri, R Vijaya Saraswathi, Anil Kumar Vuppala
Main category: cs.CL
TL;DR: This paper introduces a Telugu-English speech translation benchmark from 46 hours of verified data, compares cascaded vs. end-to-end architectures, and evaluates automatic metrics against human judgments for this low-resource language pair.
Details
Motivation: Telugu is spoken by over 80 million people but speech translation research for this morphologically rich language remains severely underexplored, creating a significant gap in multilingual speech technology.Method: Developed a 46-hour Telugu-English speech translation benchmark (30h/8h/8h train/dev/test split) from manually verified CSTD corpus data. Systematically compared cascaded (IndicWhisper + IndicMT) vs. end-to-end (finetuned SeamlessM4T) architectures, and evaluated multiple automatic metrics (BLEU, METEOR, ChrF++, ROUGE-L, TER, BERTScore) against human judgments.
Result: Cascaded IndicWhisper + IndicMT achieved highest performance due to extensive Telugu-specific training data, but finetuned SeamlessM4T models showed remarkable competitiveness with significantly less Telugu data. Traditional metrics (BLEU, METEOR, etc.) provided better quality discrimination than BERTScore for Telugu-English translation.
Conclusion: End-to-end systems can achieve performance comparable to cascaded approaches in low-resource settings with careful hyperparameter tuning and sufficient parallel data (potentially less than 100 hours). The work provides a reproducible benchmark, evidence of competitive end-to-end potential, and practical evaluation guidance for morphologically complex languages.
Abstract: Despite Telugu being spoken by over 80 million people, speech translation research for this morphologically rich language remains severely underexplored. We address this gap by developing a high-quality Telugu–English speech translation benchmark from 46 hours of manually verified CSTD corpus data (30h/8h/8h train/dev/test split). Our systematic comparison of cascaded versus end-to-end architectures shows that while IndicWhisper + IndicMT achieves the highest performance due to extensive Telugu-specific training data, finetuned SeamlessM4T models demonstrate remarkable competitiveness despite using significantly less Telugu-specific training data. This finding suggests that with careful hyperparameter tuning and sufficient parallel data (potentially less than 100 hours), end-to-end systems can achieve performance comparable to cascaded approaches in low-resource settings. Our metric reliability study evaluating BLEU, METEOR, ChrF++, ROUGE-L, TER, and BERTScore against human judgments reveals that traditional metrics provide better quality discrimination than BERTScore for Telugu–English translation. The work delivers three key contributions: a reproducible Telugu–English benchmark, empirical evidence of competitive end-to-end performance potential in low-resource scenarios, and practical guidance for automatic evaluation in morphologically complex language pairs.
[13] Classifying German Language Proficiency Levels Using Large Language Models
Elias-Leander Ahlers, Witold Brunsmann, Malte Schilling
Main category: cs.CL
TL;DR: LLMs effectively classify German texts by CEFR proficiency levels using combined datasets and multiple approaches, outperforming prior methods.
Details
Motivation: Language proficiency assessment is crucial for personalized education, and automated CEFR classification of German texts can enable scalable, tailored instruction.Method: Constructed diverse dataset combining existing CEFR-annotated corpora with synthetic data; evaluated prompt engineering, fine-tuning of LLaMA-3-8B-Instruct, and probing-based approach using LLM’s internal neural state.
Result: Consistent performance improvement over prior methods, demonstrating LLMs’ potential for reliable and scalable CEFR classification.
Conclusion: LLMs show strong potential for automated German text proficiency classification according to CEFR levels, offering reliable and scalable assessment solutions.
Abstract: Assessing language proficiency is essential for education, as it enables instruction tailored to learners needs. This paper investigates the use of Large Language Models (LLMs) for automatically classifying German texts according to the Common European Framework of Reference for Languages (CEFR) into different proficiency levels. To support robust training and evaluation, we construct a diverse dataset by combining multiple existing CEFR-annotated corpora with synthetic data. We then evaluate prompt-engineering strategies, fine-tuning of a LLaMA-3-8B-Instruct model and a probing-based approach that utilizes the internal neural state of the LLM for classification. Our results show a consistent performance improvement over prior methods, highlighting the potential of LLMs for reliable and scalable CEFR classification.
[14] Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data
Srihari Bandarupalli, Bhavana Akkiraju, Charan Devarakonda, Vamsiraghusimha Narsinga, Anil Kumar Vuppala
Main category: cs.CL
TL;DR: Cross-lingual continuous pretraining with unlabeled data enables effective ASR for low-resource Perso-Arabic languages using a 300M parameter model that matches or beats much larger models.
Details
Motivation: Low-resource languages face severe constraints from limited labeled data and computational resources needed for state-of-the-art ASR models. There's a need for more inclusive speech technology that doesn't depend on massive infrastructure or proprietary datasets.Method: Constructed a 3,000-hour multilingual corpus using scalable unlabeled data collection pipeline. Employed targeted continual pretraining combined with morphologically-aware tokenization to develop a 300M parameter model for Perso-Arabic languages (Persian, Arabic, Urdu).
Result: The 300M parameter model outperforms Whisper Large v3 (1.5B parameters) on Persian and achieves competitive results on Arabic and Urdu despite using significantly fewer parameters and substantially less labeled data. Performance comparable to systems 5 times larger.
Conclusion: Data relevance and strategic pretraining are more critical than model size for low-resource ASR. This provides a practical pathway toward inclusive speech technology for underrepresented languages without dependence on massive computational infrastructure.
Abstract: Automatic speech recognition for low-resource languages remains fundamentally constrained by the scarcity of labeled data and computational resources required by state-of-the-art models. We present a systematic investigation into cross-lingual continuous pretraining for low-resource languages, using Perso-Arabic languages (Persian, Arabic, and Urdu) as our primary case study. Our approach demonstrates that strategic utilization of unlabeled speech data can effectively bridge the resource gap without sacrificing recognition accuracy. We construct a 3,000-hour multilingual corpus through a scalable unlabeled data collection pipeline and employ targeted continual pretraining combined with morphologically-aware tokenization to develop a 300M parameter model that achieves performance comparable to systems 5 times larger. Our model outperforms Whisper Large v3 (1.5B parameters) on Persian and achieves competitive results on Arabic and Urdu despite using significantly fewer parameters and substantially less labeled data. These findings challenge the prevailing assumption that ASR quality scales primarily with model size, revealing instead that data relevance and strategic pretraining are more critical factors for low-resource scenarios. This work provides a practical pathway toward inclusive speech technology, enabling effective ASR for underrepresented languages without dependence on massive computational infrastructure or proprietary datasets.
[15] A Knowledge-Based Language Model: Deducing Grammatical Knowledge in a Multi-Agent Language Acquisition Simulation
David Ph. Shakouri, Crit Cremers, Niels O. Schiller
Main category: cs.CL
TL;DR: MODOMA is a computational multi-agent system for unsupervised language acquisition experiments where an adult agent teaches a child agent, resulting in knowledge-based language models that can generate and parse new utterances.
Details
Motivation: To create a fully parametrized computational laboratory environment for conducting controlled language acquisition experiments with explicit representation of acquired grammatical knowledge.Method: Multi-agent system with adult and child agents using statistical and rule-based procedures for unsupervised language acquisition; researchers can control all experimental parameters.
Result: The child agent successfully acquired functional and content categories from machine-generated data, showing patterns similar to human language acquisition, validating the MODOMA approach.
Conclusion: MODOMA introduces novel possibilities for computational language acquisition experiments by providing a controlled environment where acquired grammatical knowledge is explicitly represented and consultable.
Abstract: This paper presents an initial study performed by the MODOMA system. The MODOMA is a computational multi-agent laboratory environment for unsupervised language acquisition experiments such that acquisition is based on the interaction between two language models, an adult and a child agent. Although this framework employs statistical as well as rule-based procedures, the result of language acquisition is a knowledge-based language model, which can be used to generate and parse new utterances of the target language. This system is fully parametrized and researchers can control all aspects of the experiments while the results of language acquisition, that is, the acquired grammatical knowledge, are explicitly represented and can be consulted. Thus, this system introduces novel possibilities for conducting computational language acquisition experiments. The experiments presented by this paper demonstrate that functional and content categories can be acquired and represented by the daughter agent based on training and test data containing different amounts of exemplars generated by the adult agent. Interestingly, similar patterns, which are well-established for human-generated data, are also found for these machine-generated data. As the procedures resulted in the successful acquisition of discrete grammatical categories by the child agent, these experiments substantiate the validity of the MODOMA approach to modelling language acquisition.
[16] ProSocialAlign: Preference Conditioned Test Time Alignment in Language Models
Somnath Banerjee, Sayan Layek, Sayantan Adak, Mykola Pechenizkiy, Animesh Mukherjee, Rima Hazra
Main category: cs.CL
TL;DR: ProSocialAlign is a test-time framework that steers language models toward safe, empathetic responses using lexicographic constrained generation and parameter-efficient steering without retraining.
Details
Motivation: Current safety approaches fail in emotionally charged or high-stakes settings where refusal-only methods alienate users and naive compliance amplifies risks, requiring more nuanced safety alignment.Method: Combines directional regulation (subtracting learned “harm vectors” in parameter space) with preference-aware autoregressive reward modeling trained jointly across attributes with gradient conflict resolution, using lexicographic constrained generation.
Result: State-of-the-art performance across five safety benchmarks, reducing unsafe leakage and boosting alignment to human values with strong gains across multiple evaluation metrics.
Conclusion: ProSocialAlign provides a robust, modular foundation for generating context-sensitive, safe, and human-aligned responses at inference time without retraining base models.
Abstract: Current language model safety paradigms often fall short in emotionally charged or high-stakes settings, where refusal-only approaches may alienate users and naive compliance can amplify risk. We propose ProSocialAlign, a test-time, parameter-efficient framework that steers generation toward safe, empathetic, and value-aligned responses without retraining the base model. We formalize five human-centered objectives and cast safety as lexicographic constrained generation: first, applying hard constraints to eliminate harmful continuations; then optimizing for prosocial quality within the safe set. Our method combines (i) directional regulation, a harm-mitigation mechanism that subtracts a learned “harm vector” in parameter space, and (ii) preference-aware autoregressive reward modeling trained jointly across attributes with gradient conflict resolution, enabling fine-grained, user-controllable decoding. Empirical evaluations across five safety benchmarks demonstrate state-of-the-art performance, reducing unsafe leakage and boosting alignment to human values, with strong gains across multiple evaluation metrics. ProSocialAlign offers a robust and modular foundation for generating context-sensitive, safe, and human-aligned responses at inference time.
[17] Adapting AlignScore Mertic for Factual Consistency Evaluation of Text in Russian: A Student Abstract
Mikhail Zimin, Milyausha Shamsutdinova, Georgii Andriushchenko
Main category: cs.CL
TL;DR: AlignRuScore adapts the AlignScore metric for Russian to evaluate factual consistency, addressing the lack of Russian-focused evaluation tools by fine-tuning RuBERT on translated datasets.
Details
Motivation: There's a lack of evaluation tools for factual consistency in Russian texts, as existing tools primarily focus on English corpora, creating a gap for reliable natural language processing applications in Russian.Method: Fine-tuned a RuBERT-based alignment model with task-specific classification and regression heads on Russian and translated English datasets to adapt the AlignScore metric for Russian.
Result: Demonstrated that a unified alignment metric can be successfully ported to Russian, laying groundwork for robust multilingual factual consistency evaluation.
Conclusion: AlignRuScore successfully bridges the gap for Russian factual consistency evaluation, with released translated corpora, model checkpoints, and code to support further research.
Abstract: Ensuring factual consistency in generated text is crucial for reliable natural language processing applications. However, there is a lack of evaluation tools for factual consistency in Russian texts, as existing tools primarily focus on English corpora. To bridge this gap, we introduce AlignRuScore, a comprehensive adaptation of the AlignScore metric for Russian. To adapt the metric, we fine-tuned a RuBERT-based alignment model with task-specific classification and regression heads on Russian and translated English datasets. Our results demonstrate that a unified alignment metric can be successfully ported to Russian, laying the groundwork for robust multilingual factual consistency evaluation. We release the translated corpora, model checkpoints, and code to support further research.
[18] The Online Discourse of Virtual Reality and Anxiety
Kwabena Yamoah, Cass Dykeman
Main category: cs.CL
TL;DR: This study uses corpus linguistics to analyze online discussions about VR and anxiety, finding that VR technology terms dominate conversations and prepositional phrases reveal different aspects of VR development and experience.
Details
Motivation: VR is increasingly used in treating anxiety disorders, and understanding public discourse about VR and anxiety can help improve its therapeutic efficacy and accessibility.Method: Corpus linguistic methodology using Sketch Engine software to analyze the English Trends corpus, identifying frequently used words and collocations in online discussions about VR and anxiety.
Result: Most frequent terms were VR, Oculus, and headset; prepositional phrases like “of virtual reality” (design), “in virtual reality” (experience), and “for virtual reality” (development) revealed different discourse patterns.
Conclusion: The analysis provides insights into how VR and anxiety are discussed online, offering pathways for future development and accessibility improvements in therapeutic applications.
Abstract: VR in the treatment of clinical concerns such as generalized anxiety disorder or social anxiety. VR has created additional pathways to support patient well-being and care. Understanding online discussion of what users think about this technology may further support its efficacy. The purpose of this study was to employ a corpus linguistic methodology to identify the words and word networks that shed light on the online discussion of virtual reality and anxiety. Using corpus linguistics, frequently used words in discussion along with collocation were identified by utilizing Sketch Engine software. The results of the study, based upon the English Trends corpus, identified VR, Oculus, and headset as the most frequently discussed within the VR and anxiety subcorpus. These results point to the development of the virtual system, along with the physical apparatus that makes viewing and engaging with the virtual environment possible. Additional results point to collocation of prepositional phrases such as of virtual reality, in virtual reality, and for virtual reality relating to the design, experience, and development, respectively. These findings offer new perspective on how VR and anxiety together are discussed in general discourse and offer pathways for future opportunities to support counseling needs through development and accessibility. Keywords: anxiety disorders, corpus linguistics, Sketch Engine, and virtual reality VR
[19] CMV-Fuse: Cross Modal-View Fusion of AMR, Syntax, and Knowledge Representations for Aspect Based Sentiment Analysis
Smitha Muthya Sudheendra, Mani Deep Cherukuri, Jaideep Srivastava
Main category: cs.CL
TL;DR: CMV-Fuse is a Cross-Modal View fusion framework for Aspect-Based Sentiment Analysis that combines multiple linguistic perspectives (AMR, constituency parsing, dependency syntax, semantic attention) with external knowledge through hierarchical gated attention and contrastive learning.
Details
Motivation: Current ABSA systems use isolated linguistic views, missing the complex interplay between structural representations that humans naturally use for language understanding. Natural language understanding requires integrating multiple complementary perspectives from surface syntax to deep semantics and world knowledge.Method: Systematically combines four linguistic perspectives: Abstract Meaning Representations, constituency parsing, dependency syntax, and semantic attention with external knowledge. Uses hierarchical gated attention fusion across local syntactic, intermediate semantic, and global knowledge levels. Includes structure-aware multi-view contrastive learning for consistency across representations.
Result: Extensive experiments show substantial improvements over strong baselines on standard benchmarks. Analysis reveals how each linguistic view contributes to more robust sentiment analysis.
Conclusion: CMV-Fuse successfully emulates human language processing by systematically combining multiple linguistic perspectives, capturing both fine-grained structural patterns and broad contextual understanding for more effective Aspect-Based Sentiment Analysis.
Abstract: Natural language understanding inherently depends on integrating multiple complementary perspectives spanning from surface syntax to deep semantics and world knowledge. However, current Aspect-Based Sentiment Analysis (ABSA) systems typically exploit isolated linguistic views, thereby overlooking the intricate interplay between structural representations that humans naturally leverage. We propose CMV-Fuse, a Cross-Modal View fusion framework that emulates human language processing by systematically combining multiple linguistic perspectives. Our approach systematically orchestrates four linguistic perspectives: Abstract Meaning Representations, constituency parsing, dependency syntax, and semantic attention, enhanced with external knowledge integration. Through hierarchical gated attention fusion across local syntactic, intermediate semantic, and global knowledge levels, CMV-Fuse captures both fine-grained structural patterns and broad contextual understanding. A novel structure aware multi-view contrastive learning mechanism ensures consistency across complementary representations while maintaining computational efficiency. Extensive experiments demonstrate substantial improvements over strong baselines on standard benchmarks, with analysis revealing how each linguistic view contributes to more robust sentiment analysis.
[20] Mechanistic Interpretability of GPT-2: Lexical and Contextual Layers in Sentiment Analysis
Amartya Hatua
Main category: cs.CL
TL;DR: GPT-2’s sentiment processing doesn’t follow predicted two-stage architecture; early layers detect lexical sentiment but contextual integration happens in late layers through unified mechanism.
Details
Motivation: To causally examine how sentiment information is processed across GPT-2's transformer layers and test the hypothesized two-stage sentiment architecture (lexical detection + contextual integration).Method: Systematic activation patching across all 12 layers of GPT-2 to test three contextual integration hypotheses: Middle Layer Concentration, Phenomenon Specificity, and Distributed Processing.
Result: Early layers (0-3) act as lexical sentiment detectors with stable, position-specific polarity signals. However, all three contextual integration hypotheses were falsified - contextual phenomena (negation, sarcasm, domain shifts) are integrated primarily in late layers (8-11) through a unified, non-modular mechanism.
Conclusion: GPT-2’s sentiment computation differs from predicted hierarchical patterns, highlighting the need for further empirical characterization of contextual integration in large language models.
Abstract: We present a mechanistic interpretability study of GPT-2 that causally examines how sentiment information is processed across its transformer layers. Using systematic activation patching across all 12 layers, we test the hypothesized two-stage sentiment architecture comprising early lexical detection and mid-layer contextual integration. Our experiments confirm that early layers (0-3) act as lexical sentiment detectors, encoding stable, position specific polarity signals that are largely independent of context. However, all three contextual integration hypotheses: Middle Layer Concentration, Phenomenon Specificity, and Distributed Processing are falsified. Instead of mid-layer specialization, we find that contextual phenomena such as negation, sarcasm, domain shifts etc. are integrated primarily in late layers (8-11) through a unified, non-modular mechanism. These experimental findings provide causal evidence that GPT-2’s sentiment computation differs from the predicted hierarchical pattern, highlighting the need for further empirical characterization of contextual integration in large language models.
[21] PersonaMem-v2: Towards Personalized Intelligence via Learning Implicit User Personas and Agentic Memory
Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, Ziyi Liu, Anvesh Rao Vijjini, Jiashu He, Hanchao Yu, Radha Poovendran, Gregory Wornell, Lyle Ungar, Dan Roth, Sihao Chen, Camillo Jose Taylor
Main category: cs.CL
TL;DR: PersonaMem-v2 is a state-of-the-art dataset for LLM personalization with 1,000 realistic user-chatbot interactions. The paper shows frontier LLMs struggle with implicit personalization (37-48% accuracy), but reinforcement fine-tuning enables Qwen3-4B to outperform GPT-5 (53% accuracy). An agentic memory framework achieves 55% accuracy using 16x fewer tokens.
Details
Motivation: Personalization is a key milestone for advancing AI capability and alignment. Current LLMs struggle with implicit personalization where user preferences are revealed indirectly, and they face challenges with long-context reasoning for understanding users over extended interactions.Method: 1) Created PersonaMem-v2 dataset with 1,000 realistic user-chatbot interactions across 300+ scenarios, 20,000+ user preferences, and 128k-token context windows. 2) Used reinforcement fine-tuning to improve long-context reasoning for personalization. 3) Developed an agentic memory framework that maintains a single, human-readable memory that grows with each user over time.
Result: Frontier LLMs achieve only 37-48% accuracy on implicit personalization. Reinforcement fine-tuning enables Qwen3-4B to outperform GPT-5 with 53% accuracy. The agentic memory framework achieves state-of-the-art 55% accuracy while using 16x fewer input tokens (2k-token memory vs. full 32k conversation histories).
Conclusion: The PersonaMem-v2 dataset enables significant progress in LLM personalization. While reasoning remains a bottleneck for implicit personalization, reinforcement fine-tuning and agentic memory systems provide scalable paths toward real-world personalized intelligence, with agentic memory being particularly efficient.
Abstract: Personalization is one of the next milestones in advancing AI capability and alignment. We introduce PersonaMem-v2, the state-of-the-art dataset for LLM personalization that simulates 1,000 realistic user-chatbot interactions on 300+ scenarios, 20,000+ user preferences, and 128k-token context windows, where most user preferences are implicitly revealed to reflect real-world interactions. Using this data, we investigate how reinforcement fine-tuning enables a model to improve its long-context reasoning capabilities for user understanding and personalization. We also develop a framework for training an agentic memory system, which maintains a single, human-readable memory that grows with each user over time. In our experiments, frontier LLMs still struggle with implicit personalization, achieving only 37-48% accuracy. While they support long context windows, reasoning remains the bottleneck for implicit personalization tasks. Using reinforcement fine-tuning, we successfully train Qwen3-4B to outperforms GPT-5, reaching 53% accuracy in implicit personalization. Moreover, our agentic memory framework achieves state-of-the-art 55% accuracy while using 16x fewer input tokens, relying on a 2k-token memory instead of full 32k conversation histories. These results underscore the impact of our dataset and demonstrate agentic memory as a scalable path toward real-world personalized intelligence.
[22] MASim: Multilingual Agent-Based Simulation for Social Science
Xuan Zhang, Wenxuan Zhang, Anxu Wang, See-Kiong Ng, Yang Deng
Main category: cs.CL
TL;DR: MASim is the first multilingual agent-based simulation framework for studying cross-lingual social interactions, enabling global public opinion modeling and media influence analysis through diverse sociolinguistic agents.
Details
Motivation: Existing multi-agent role-playing simulations are mostly monolingual and fail to model cross-lingual interactions, which are essential for studying real-world social behavior and computational social science.Method: Developed MASim framework supporting multi-turn interactions among generative agents with diverse sociolinguistic profiles, with two key analyses: global public opinion modeling across languages/cultures, and media influence/information diffusion via autonomous news agents. Created MAPS benchmark combining survey questions and demographic personas from global population distributions.
Result: Experiments on calibration, sensitivity, consistency, and cultural case studies demonstrate that MASim reproduces sociocultural phenomena and highlights the importance of multilingual simulation for scalable, controlled computational social science.
Conclusion: MASim represents a significant advancement for computational social science by enabling multilingual agent-based simulations that better reflect real-world cross-lingual social interactions and cultural dynamics.
Abstract: Multi-agent role-playing has recently shown promise for studying social behavior with language agents, but existing simulations are mostly monolingual and fail to model cross-lingual interaction, an essential property of real societies. We introduce MASim, the first multilingual agent-based simulation framework that supports multi-turn interaction among generative agents with diverse sociolinguistic profiles. MASim offers two key analyses: (i) global public opinion modeling, by simulating how attitudes toward open-domain hypotheses evolve across languages and cultures, and (ii) media influence and information diffusion, via autonomous news agents that dynamically generate content and shape user behavior. To instantiate simulations, we construct the MAPS benchmark, which combines survey questions and demographic personas drawn from global population distributions. Experiments on calibration, sensitivity, consistency, and cultural case studies show that MASim reproduces sociocultural phenomena and highlights the importance of multilingual simulation for scalable, controlled computational social science.
[23] Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation
Chengbing Wang, Yang Zhang, Wenjie Wang, Xiaoyan Zhao, Fuli Feng, Xiangnan He, Tat-Seng Chua
Main category: cs.CL
TL;DR: FlyThinker is a “think-while-generating” framework for personalized long-form generation that enables concurrent reasoning and generation through parallel latent token-level reasoning, addressing limitations of existing think-then-generate methods.
Details
Motivation: Current preference alignment methods optimize for population-level preferences, overlooking individual users. Early personalization approaches struggle with implicit preferences, while recent think-then-generate methods face challenges in long-form generation due to static one-shot reasoning that must capture all relevant information upfront, limiting adaptability to evolving content.Method: FlyThinker employs a separate reasoning model that generates latent token-level reasoning in parallel, which is fused into the generation model to dynamically guide response generation. The reasoning model depends only on previous responses rather than its own prior outputs, preserving training parallelism across positions and enabling efficient training.
Result: Extensive experiments on real-world benchmarks demonstrate that FlyThinker achieves better personalized generation while maintaining training and inference efficiency compared to existing methods.
Conclusion: FlyThinker addresses the limitations of current personalized generation methods by enabling efficient concurrent reasoning and generation, making it effective for personalized long-form content creation while preserving both training and inference efficiency.
Abstract: Preference alignment has enabled large language models (LLMs) to better reflect human expectations, but current methods mostly optimize for population-level preferences, overlooking individual users. Personalization is essential, yet early approaches-such as prompt customization or fine-tuning-struggle to reason over implicit preferences, limiting real-world effectiveness. Recent “think-then-generate” methods address this by reasoning before response generation. However, they face challenges in long-form generation: their static one-shot reasoning must capture all relevant information for the full response generation, making learning difficult and limiting adaptability to evolving content. To address this issue, we propose FlyThinker, an efficient “think-while-generating” framework for personalized long-form generation. FlyThinker employs a separate reasoning model that generates latent token-level reasoning in parallel, which is fused into the generation model to dynamically guide response generation. This design enables reasoning and generation to run concurrently, ensuring inference efficiency. In addition, the reasoning model is designed to depend only on previous responses rather than its own prior outputs, which preserves training parallelism across different positions-allowing all reasoning tokens for training data to be produced in a single forward pass like standard LLM training, ensuring training efficiency. Extensive experiments on real-world benchmarks demonstrate that FlyThinker achieves better personalized generation while keeping training and inference efficiency.
[24] TopiCLEAR: Topic extraction by CLustering Embeddings with Adaptive dimensional Reduction
Aoi Fujita, Taichi Yamamoto, Yuri Nakayama, Ryota Kobayashi
Main category: cs.CL
TL;DR: TopiCLEAR: A new topic modeling method using SBERT embeddings, GMM clustering, and adaptive LDA projection that outperforms baselines on social media and news datasets without preprocessing.
Details
Motivation: Traditional topic models fail on short social media posts due to limited co-occurrence statistics, fragmented semantics, inconsistent spelling, and informal language. There's a need for methods that can handle raw social media text without preprocessing.Method: TopiCLEAR: Topic extraction by CLustering Embeddings with Adaptive dimensional Reduction. Uses SBERT embeddings, Gaussian Mixture Models (GMM) for provisional clustering, then iteratively refines clusters using supervised linear discriminant analysis projection followed by GMM clustering until convergence. Works directly on raw text without preprocessing.
Result: Evaluated on four datasets (20News, AgNewsTitle, Reddit, TweetTopic) with human-labeled topics. Outperformed seven baseline methods including recent SBERT-based and zero-shot generative AI methods. Achieved highest similarity to human-annotated topics for both social media posts and news articles. Produced more interpretable topics in qualitative analysis.
Conclusion: TopiCLEAR effectively addresses challenges of topic modeling on social media data, showing significant improvements over existing methods and demonstrating strong potential for social media and web content analytics applications.
Abstract: Rapid expansion of social media platforms such as X (formerly Twitter), Facebook, and Reddit has enabled large-scale analysis of public perceptions on diverse topics, including social issues, politics, natural disasters, and consumer sentiment. Topic modeling is a widely used approach for uncovering latent themes in text data, typically framed as an unsupervised classification task. However, traditional models, originally designed for longer and more formal documents, struggle with short social media posts due to limited co-occurrence statistics, fragmented semantics, inconsistent spelling, and informal language. To address these challenges, we propose a new method, TopiCLEAR: Topic extraction by CLustering Embeddings with Adaptive dimensional Reduction. Specifically, each text is embedded using Sentence-BERT (SBERT) and provisionally clustered using Gaussian Mixture Models (GMM). The clusters are then refined iteratively using a supervised projection based on linear discriminant analysis, followed by GMM-based clustering until convergence. Notably, our method operates directly on raw text, eliminating the need for preprocessing steps such as stop word removal. We evaluate our approach on four diverse datasets, 20News, AgNewsTitle, Reddit, and TweetTopic, each containing human-labeled topic information. Compared with seven baseline methods, including a recent SBERT-based method and a zero-shot generative AI method, our approach achieves the highest similarity to human-annotated topics, with significant improvements for both social media posts and online news articles. Additionally, qualitative analysis shows that our method produces more interpretable topics, highlighting its potential for applications in social media data and web content analytics.
[25] Parameter-Efficient Fine-Tuning with Differential Privacy for Robust Instruction Adaptation in Large Language Models
Yulin Huang, Yaxuan Luan, Jinxu Guo, Xiangchen Song, Yuchen Liu
Main category: cs.CL
TL;DR: Proposes a parameter-efficient instruction fine-tuning method with differential privacy that uses gradient clipping and adaptive noise allocation in a collaborative optimization framework, achieving better accuracy, privacy budget, and parameter efficiency while maintaining stability.
Details
Motivation: Addresses privacy protection and efficiency issues in instruction fine-tuning of large language models, aiming to reduce privacy budget consumption while ensuring training stability and robustness in multi-task instruction scenarios.Method: Keeps backbone model frozen and updates parameters through low-dimensional projection subspace, introduces gradient clipping and adaptive noise allocation during gradient computation, and uses a unified framework combining gradient constraints, noise allocation, and parameter projection.
Result: Outperforms baseline models in accuracy, privacy budget, and parameter efficiency, maintains stable performance under diverse and uncertain data conditions across hyperparameter, environment, and data sensitivity dimensions.
Conclusion: Enriches theoretical integration of differential privacy and parameter-efficient fine-tuning, demonstrates practical adaptability in instruction tasks, and provides a feasible solution for secure training in complex instruction environments.
Abstract: This study addresses the issues of privacy protection and efficiency in instruction fine-tuning of large-scale language models by proposing a parameter-efficient method that integrates differential privacy noise allocation with gradient clipping in a collaborative optimization framework. The method keeps the backbone model frozen and updates parameters through a low-dimensional projection subspace, while introducing clipping and adaptive noise allocation during gradient computation. This design reduces privacy budget consumption and ensures training stability and robustness. The unified framework combines gradient constraints, noise allocation, and parameter projection, effectively mitigating performance fluctuations and privacy risks in multi-task instruction scenarios. Experiments are conducted across hyperparameter, environment, and data sensitivity dimensions. Results show that the method outperforms baseline models in accuracy, privacy budget, and parameter efficiency, and maintains stable performance under diverse and uncertain data conditions. The findings enrich the theoretical integration of differential privacy and parameter-efficient fine-tuning and demonstrate its practical adaptability in instruction tasks, providing a feasible solution for secure training in complex instruction environments.
[26] “The Dentist is an involved parent, the bartender is not”: Revealing Implicit Biases in QA with Implicit BBQ
Aarushi Wagh, Saniya Srivastava
Main category: cs.CL
TL;DR: ImplicitBBQ extends BBQ benchmark to test implicit biases in LLMs using subtle cues rather than explicit declarations of protected attributes, revealing significant performance gaps not detected by existing explicit benchmarks.
Details
Motivation: Existing bias benchmarks rely on explicit declaration of protected attributes (religion, race, gender), but real-world interactions often contain implicit biases inferred through names, cultural cues, or traits. This creates a blind spot in fairness evaluation that needs to be addressed.Method: Extends the Bias Benchmark for QA (BBQ) to create ImplicitBBQ, which includes implicitly cued protected attributes across 6 categories. The benchmark uses subtle cues like names, cultural references, or traits instead of explicit declarations of attributes.
Result: Evaluation of GPT-4o on ImplicitBBQ shows troubling performance disparity compared to explicit BBQ prompts, with accuracy declining up to 7% in the “sexual orientation” subcategory and consistent declines across most other categories.
Conclusion: Current LLMs contain implicit biases undetected by explicit benchmarks. ImplicitBBQ offers a crucial tool for more nuanced fairness evaluation in NLP, addressing the gap between explicit bias testing and real-world implicit bias scenarios.
Abstract: Existing benchmarks evaluating biases in large language models (LLMs) primarily rely on explicit cues, declaring protected attributes like religion, race, gender by name. However, real-world interactions often contain implicit biases, inferred subtly through names, cultural cues, or traits. This critical oversight creates a significant blind spot in fairness evaluation. We introduce ImplicitBBQ, a benchmark extending the Bias Benchmark for QA (BBQ) with implicitly cued protected attributes across 6 categories. Our evaluation of GPT-4o on ImplicitBBQ illustrates troubling performance disparity from explicit BBQ prompts, with accuracy declining up to 7% in the “sexual orientation” subcategory and consistent decline located across most other categories. This indicates that current LLMs contain implicit biases undetected by explicit benchmarks. ImplicitBBQ offers a crucial tool for nuanced fairness evaluation in NLP.
[27] A Patient-Doctor-NLP-System to contest inequality for less privileged
Subrit Dikshit, Ritu Tiwari, Priyank Jain
Main category: cs.CL
TL;DR: PDFTEMRA is a compact transformer model combining distillation, frequency modulation, ensemble learning, and random activations for efficient medical NLP in Hindi and accessibility scenarios.
Details
Motivation: Addresses challenges of deploying large language models in resource-constrained healthcare settings, particularly for visually impaired users and speakers of low-resource languages like Hindi in rural environments.Method: Proposes PDFTEMRA (Performant Distilled Frequency Transformer Ensemble Model with Random Activations) - a compact transformer architecture integrating model distillation, frequency-domain modulation, ensemble learning, and randomized activation patterns to reduce computational costs.
Result: PDFTEMRA achieves comparable performance to state-of-the-art NLP models with substantially lower computational requirements, demonstrating suitability for accessible, low-resource medical NLP applications.
Conclusion: The proposed model offers an efficient solution for medical NLP in resource-constrained healthcare scenarios, particularly benefiting visually impaired users and speakers of low-resource languages like Hindi.
Abstract: Transfer Learning (TL) has accelerated the rapid development and availability of large language models (LLMs) for mainstream natural language processing (NLP) use cases. However, training and deploying such gigantic LLMs in resource-constrained, real-world healthcare situations remains challenging. This study addresses the limited support available to visually impaired users and speakers of low-resource languages such as Hindi who require medical assistance in rural environments. We propose PDFTEMRA (Performant Distilled Frequency Transformer Ensemble Model with Random Activations), a compact transformer-based architecture that integrates model distillation, frequency-domain modulation, ensemble learning, and randomized activation patterns to reduce computational cost while preserving language understanding performance. The model is trained and evaluated on medical question-answering and consultation datasets tailored to Hindi and accessibility scenarios, and its performance is compared against standard NLP state-of-the-art model baselines. Results demonstrate that PDFTEMRA achieves comparable performance with substantially lower computational requirements, indicating its suitability for accessible, inclusive, low-resource medical NLP applications.
[28] One Word Is Not Enough: Simple Prompts Improve Word Embeddings
Rajeev Ranjan
Main category: cs.CL
TL;DR: Adding semantic prompts like “meaning: {word}” before embedding words significantly improves word similarity performance for text embedding models, achieving state-of-the-art results without training.
Details
Motivation: Text embedding models are designed and evaluated for sentence-level tasks, but their behavior on isolated words is poorly understood. The paper investigates whether simple prompting can improve word-level semantic similarity performance.Method: Tested 7 text embedding models on 3 standard word similarity benchmarks (SimLex-999, WordSim-353, MEN-3000). Used zero-shot semantic prompts like “meaning: {word}” or “Represent the semantic concept: {word}” prepended to words before embedding. Compared performance with and without prompts.
Result: Prompts improved Spearman correlations by up to +0.29 on SimLex-999. Some models failed completely on bare words (correlation = 0) but recovered with prompts (+0.73 improvement). Best results: embed-english-v3.0 achieved 0.692 on SimLex-999, text-embedding-3-large achieved 0.811 on WordSim-353 and 0.855 on MEN-3000, outperforming classic static embeddings like Word2Vec (0.40) and LexVec (0.48).
Conclusion: Simple semantic prompting significantly enhances word similarity performance of text embedding models, establishing new state-of-the-art for pure embedding methods without requiring any training. This zero-shot technique works with any text embedding model.
Abstract: Text embedding models are designed for sentence-level applications like retrieval and semantic similarity, and are primarily evaluated on sentence-level benchmarks. Their behavior on isolated words is less understood. We show that simply prepending semantic prompts to words before embedding substantially improves word similarity correlations. Testing 7 text embedding models, including text-embedding-3-large (OpenAI), embed-english-v3.0 (Cohere), voyage-3(Voyage AI), all-mpnet-base-v2, and Qwen3-Embedding-8B, on 3 standard benchmarks (SimLex-999, WordSim-353, MEN-3000), we find that prompts like “meaning: {word}” or “Represent the semantic concept: {word}” improve Spearman correlations by up to +0.29 on SimLex-999. Some models fail completely on bare words (correlation = 0) but recover with prompts (+0.73 improvement). Our best results achieve correlation = 0.692 on SimLex-999 with embed-english-v3.0 (Cohere), correlation = 0.811 on WordSim-353, and correlation = 0.855 on MEN-3000 with text-embedding-3-large (OpenAI). These results outperform classic static embeddings like Word2Vec (correlation = 0.40) and even the best static method LexVec (correlation = 0.48) on SimLex-999, establishing a new state-of-the-art for pure embedding methods. This zero-shot technique requires no training and works with any text embedding model.
[29] Becoming Experienced Judges: Selective Test-Time Learning for Evaluators
Seungyeon Jwa, Daechul Ahn, Reokyoung Kim, Dongyeop Kang, Jonghyun Choi
Main category: cs.CL
TL;DR: Learning While Evaluating (LWE) enables LLM evaluators to improve sequentially during inference by maintaining an evolving meta-prompt that generates sample-specific instructions and refines itself through self-generated feedback, with Selective LWE focusing updates only on self-inconsistent cases for cost-effectiveness.
Details
Motivation: Current LLM-as-a-judge evaluation methods treat each case independently without accumulating experience, and use fixed prompts for all cases, missing opportunities for sample-specific evaluation criteria and sequential improvement.Method: LWE framework maintains an evolving meta-prompt that produces sample-specific evaluation instructions and refines itself through self-generated feedback. Selective LWE updates the meta-prompt only on self-inconsistent cases to focus computation where it matters most.
Result: Across two pairwise comparison benchmarks, Selective LWE outperforms strong baselines, demonstrating that evaluators can improve during sequential testing with simple selective updates, learning most from cases they struggle with.
Conclusion: LWE enables LLM evaluators to accumulate experience and adapt to specific cases during inference without requiring training/validation sets, with Selective LWE providing a cost-effective approach that focuses learning on challenging cases.
Abstract: Automatic evaluation with large language models, commonly known as LLM-as-a-judge, is now standard across reasoning and alignment tasks. Despite evaluating many samples in deployment, these evaluators typically (i) treat each case independently, missing the opportunity to accumulate experience, and (ii) rely on a single fixed prompt for all cases, neglecting the need for sample-specific evaluation criteria. We introduce Learning While Evaluating (LWE), a framework that allows evaluators to improve sequentially at inference time without requiring training or validation sets. LWE maintains an evolving meta-prompt that (i) produces sample-specific evaluation instructions and (ii) refines itself through self-generated feedback. Furthermore, we propose Selective LWE, which updates the meta-prompt only on self-inconsistent cases, focusing computation where it matters most. This selective approach retains the benefits of sequential learning while being far more cost-effective. Across two pairwise comparison benchmarks, Selective LWE outperforms strong baselines, empirically demonstrating that evaluators can improve during sequential testing with a simple selective update, learning most from the cases they struggle with.
[30] From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs
Yuchuan Tian, Yuchen Liang, Jiacheng Sun, Shuo Zhang, Guangwen Yang, Yingte Shu, Sibo Fang, Tianyu Guo, Kai Han, Chao Xu, Hanting Chen, Xinghao Chen, Yunhe Wang
Main category: cs.CL
TL;DR: NBDiff adapts autoregressive LLMs to block-diffusion models for parallel generation while preserving pretrained knowledge, achieving SOTA performance among 7B-class diffusion language models.
Details
Motivation: Autoregressive decoding in LLMs creates throughput bottlenecks due to sequential generation. While diffusion language models enable parallel generation, training them from scratch is costly and wastes existing AR model knowledge. Existing adaptation methods fail to properly address the mismatch between AR causality and block-wise bidirectionality.Method: Reframes adaptation as an intra-paradigm path from AR to Block-Diffusion by viewing AR as Block-Diffusion with blocksize=1. Uses context-causal attention mask (causal in context, bidirectional within block), efficient parallel adaptation, auxiliary AR loss for knowledge retention, and gradual block size increment. Integrates cleanly with masked block-diffusion while maintaining train-inference consistency.
Result: NBDiff-7B (Base and Instruct) achieves state-of-the-art performance among 7B-class DLMs, delivering strong gains on general-knowledge, math, and code benchmarks over strong baselines while inheriting long-context modeling and reasoning capabilities.
Conclusion: Principled AR-to-block-diffusion adaptation is an effective and compute-efficient alternative to training diffusion language models from scratch, enabling parallel generation while leveraging existing AR model knowledge.
Abstract: Large language models (LLMs) excel at generation but dominant autoregressive (AR) decoding is inherently sequential, creating a throughput bottleneck. Diffusion Language Models (DLMs)–especially block-wise variants–enable parallel generation and intra-block bidirectional reasoning, yet training large DLMs from scratch is costly and wastes the knowledge in mature AR checkpoints. Prior “adaptation” attempts either modify logits or randomly grow attention masks to full-sequence diffusion, or simply transplant AR weights into a block-diffusion recipe, leaving a fundamental mismatch between AR causality and block-wise bidirectionality unaddressed. We reframe adaptation as a intra-paradigm path from AR to Block-Diffusion by viewing AR as Block-Diffusion with blocksize=1. Concretely, we design the pathway of adaptation as follows: we use a context-causal attention mask (causal in context, bidirectional only within the active block), an efficient parallel adaptation procedure, an auxiliary AR loss to maximize data utilization and retain pretrained knowledge, and gradual increment of the generation block size. The recipe integrates cleanly with masked block-diffusion and maintains train-inference consistency. Built on these components, NBDiff-7B (Base and Instruct) could inherit the long-context modeling and reasoning capabilities, and achieve state-of-the-art performance among the 7B-class DLMs, delivering strong gains on general-knowledge, math, and code benchmarks over strong baselines. These results demonstrate that principled AR-to-block-diffusion adaptation is an effective and compute-efficient alternative to training DLMs from scratch. Codes: https://github.com/YuchuanTian/NBDiff.
[31] LLM4SFC: Sequential Function Chart Generation via Large Language Models
Ofek Glick, Vladimir Tchuiev, Marah Ghoummaid, Michal Moshkovitz, Dotan Di-Castro
Main category: cs.CL
TL;DR: LLM4SFC is the first framework that generates executable Sequential Function Charts (SFCs) from natural language descriptions, addressing the gap in LLM-based generation of graphical PLC programming languages.
Details
Motivation: While LLMs are increasingly used for textual PLC programming languages like Structured Text, graphical languages like SFCs remain underexplored. Generating SFCs is challenging due to their graphical nature and embedded ST actions, often leading to non-executable code incompatible with industrial toolchains.Method: LLM4SFC uses three components: (1) reduced structured representation capturing essential topology and inline ST with reduced verbosity, (2) fine-tuning and few-shot RAG for alignment with SFC programming conventions, and (3) structured generation with real-time illegal token pruning to ensure compliance with SFC textual format.
Result: Evaluation on real-world SFCs from automated manufacturing projects shows LLM4SFC reliably generates syntactically valid SFC programs, achieving 75%-94% generation success across both open-source and proprietary LLMs, effectively bridging graphical and textual PLC languages.
Conclusion: LLM4SFC successfully generates executable SFCs from natural language, paving the way for automated industrial programming by addressing the unique challenges of graphical PLC language generation.
Abstract: While Large Language Models (LLMs) are increasingly used for synthesizing textual PLC programming languages like Structured Text (ST) code, other IEC 61131-3 standard graphical languages like Sequential Function Charts (SFCs) remain underexplored. Generating SFCs is challenging due to graphical nature and ST actions embedded within, which are not directly compatible with standard generation techniques, often leading to non-executable code that is incompatible with industrial tool-chains In this work, we introduce LLM4SFC, the first framework to receive natural-language descriptions of industrial workflows and provide executable SFCs. LLM4SFC is based on three components: (i) A reduced structured representation that captures essential topology and in-line ST and reduced textual verbosity; (ii) Fine-tuning and few-shot retrieval-augmented generation (RAG) for alignment with SFC programming conventions; and (iii) A structured generation approach that prunes illegal tokens in real-time to ensure compliance with the textual format of SFCs. We evaluate LLM4SFC on a dataset of real-world SFCs from automated manufacturing projects, using both open-source and proprietary LLMs. The results show that LLM4SFC reliably generates syntactically valid SFC programs effectively bridging graphical and textual PLC languages, achieving a generation generation success of 75% - 94%, paving the way for automated industrial programming.
[32] Large Language Model-Based Generation of Discharge Summaries
Tiago Rodrigues, Carla Teixeira Lopes
Main category: cs.CL
TL;DR: LLMs show promise for automating discharge summary generation, with proprietary models (especially Gemini 1.5 Pro) outperforming open-source alternatives, though hallucinations and data privacy remain challenges.
Details
Motivation: Discharge summaries contain crucial patient information but are time-consuming to create manually. Automating their generation could reduce healthcare professional workload, minimize errors, and improve accessibility of critical patient data.Method: Evaluated five LLMs (Mistral, Llama 2, GPT-3, GPT-4, Gemini 1.5 Pro) on MIMIC-III data using exact-match, soft-overlap, and reference-free metrics, with human evaluation by a clinical expert.
Result: Proprietary models, particularly Gemini with one-shot prompting, produced summaries most similar to gold-standard references. Open-source models (especially Mistral after fine-tuning) showed promise but struggled with hallucinations and repetition.
Conclusion: LLMs, especially proprietary models, are promising for automatic discharge summary generation, but challenges like hallucinations and data privacy must be addressed for practical deployment.
Abstract: Discharge Summaries are documents written by medical professionals that detail a patient’s visit to a care facility. They contain a wealth of information crucial for patient care, and automating their generation could significantly reduce the effort required from healthcare professionals, minimize errors, and ensure that critical patient information is easily accessible and actionable. In this work, we explore the use of five Large Language Models on this task, from open-source models (Mistral, Llama 2) to proprietary systems (GPT-3, GPT-4, Gemini 1.5 Pro), leveraging MIMIC-III summaries and notes. We evaluate them using exact-match, soft-overlap, and reference-free metrics. Our results show that proprietary models, particularly Gemini with one-shot prompting, outperformed others, producing summaries with the highest similarity to the gold-standard ones. Open-source models, while promising, especially Mistral after fine-tuning, lagged in performance, often struggling with hallucinations and repeated information. Human evaluation by a clinical expert confirmed the practical utility of the summaries generated by proprietary models. Despite the challenges, such as hallucinations and missing information, the findings suggest that LLMs, especially proprietary models, are promising candidates for automatic discharge summary generation as long as data privacy is ensured.
[33] CAuSE: Decoding Multimodal Classifiers using Faithful Natural Language Explanation
Dibyanayan Bandyopadhyay, Soham Bhattacharjee, Mohammed Hasanuzzaman, Asif Ekbal
Main category: cs.CL
TL;DR: CAuSE is a novel framework that generates faithful natural language explanations for multimodal classifiers using causal abstraction and interchange interventions.
Details
Motivation: Multimodal classifiers are opaque black boxes, and while natural language explanations are intuitive for building trust, they must faithfully capture the classifier's internal decision-making behavior (faithfulness).Method: CAuSE uses causal abstraction under simulated explanations with interchange intervention training to generate faithful natural language explanations for any pretrained multimodal classifier.
Result: CAuSE generalizes across datasets and models, surpasses other methods on causal faithfulness metrics, and qualitative analysis reinforces its advantages. Error analysis identifies failure cases.
Conclusion: CAuSE provides a theoretically grounded framework for generating faithful natural language explanations for multimodal classifiers through causal abstraction, improving interpretability and trust.
Abstract: Multimodal classifiers function as opaque black box models. While several techniques exist to interpret their predictions, very few of them are as intuitive and accessible as natural language explanations (NLEs). To build trust, such explanations must faithfully capture the classifier’s internal decision making behavior, a property known as faithfulness. In this paper, we propose CAuSE (Causal Abstraction under Simulated Explanations), a novel framework to generate faithful NLEs for any pretrained multimodal classifier. We demonstrate that CAuSE generalizes across datasets and models through extensive empirical evaluations. Theoretically, we show that CAuSE, trained via interchange intervention, forms a causal abstraction of the underlying classifier. We further validate this through a redesigned metric for measuring causal faithfulness in multimodal settings. CAuSE surpasses other methods on this metric, with qualitative analysis reinforcing its advantages. We perform detailed error analysis to pinpoint the failure cases of CAuSE. For replicability, we make the codes available at https://github.com/newcodevelop/CAuSE
[34] AquaFusionNet: Lightweight VisionSensor Fusion Framework for Real-Time Pathogen Detection and Water Quality Anomaly Prediction on Edge Devices
Sepyan Purnama Kristanto, Lutfi Hakim, Hermansyah
Main category: cs.CL
TL;DR: AquaFusionNet is a lightweight cross-modal framework that unifies microscopic imaging and physicochemical sensors for real-time microbial contamination detection in small-scale drinking water systems, achieving high accuracy with low power consumption.
Details
Motivation: Existing monitoring tools for small-scale drinking water systems only capture fragments of rapidly fluctuating microbial contamination, and operators must interpret microscopic imaging and sensor data separately, making real-time decision-making unreliable.Method: AquaFusionNet uses a gated cross-attention mechanism to learn statistical dependencies between microbial appearance and concurrent sensor dynamics, trained on AquaMicro12K dataset (12,846 annotated micrographs), designed specifically for low-power edge deployment.
Result: Deployed for 6 months across 7 facilities in East Java, Indonesia, the system processed 1.84 million frames with 94.8% mAP@0.5 detection accuracy and 96.3% anomaly prediction accuracy, operating at only 4.8W on Jetson Nano hardware.
Conclusion: AquaFusionNet provides higher accuracy at comparable or lower power than existing lightweight detectors, reduces failure modes of unimodal approaches, and all models, data, and hardware designs are openly released to support decentralized water safety infrastructure.
Abstract: Evidence from many low and middle income regions shows that microbial contamination in small scale drinking water systems often fluctuates rapidly, yet existing monitoring tools capture only fragments of this behaviour. Microscopic imaging provides organism level visibility, whereas physicochemical sensors reveal shortterm changes in water chemistry; in practice, operators must interpret these streams separately, making realtime decision-making unreliable. This study introduces AquaFusionNet, a lightweight cross-modal framework that unifies both information sources inside a single edge deployable model. Unlike prior work that treats microscopic detection and water quality prediction as independent tasks, AquaFusionNet learns the statistical dependencies between microbial appearance and concurrent sensor dynamics through a gated crossattention mechanism designed specifically for lowpower hardware. The framework is trained on AquaMicro12K, a new dataset comprising 12,846 annotated 1000 micrographs curated for drinking water contexts, an area where publicly accessible microscopic datasets are scarce. Deployed for six months across seven facilities in East Java, Indonesia, the system processed 1.84 million frames and consistently detected contamination events with 94.8% mAP@0.5 and 96.3% anomaly prediction accuracy, while operating at 4.8 W on a Jetson Nano. Comparative experiments against representative lightweight detectors show that AquaFusionNet provides higher accuracy at comparable or lower power, and field results indicate that cross-modal coupling reduces common failure modes of unimodal detectors, particularly under fouling, turbidity spikes, and inconsistent illumination. All models, data, and hardware designs are released openly to facilitate replication and adaptation in decentralized water safety infrastructures.
[35] Rhea: Role-aware Heuristic Episodic Attention for Conversational LLMs
Wanyang Hong, Zhaoning Zhang, Yi Chen, Libo Zhang, Baihui Liu, Linbo Qiao, Zhiliang Tian, Dongsheng Li
Main category: cs.CL
TL;DR: Rhea is a novel framework that addresses cumulative contextual decay in multi-turn LLM conversations by decoupling history into Instructional Memory for global constraints and Episodic Memory for dynamic interactions, achieving significant performance improvements.
Details
Motivation: LLMs perform well on single-turn tasks but deteriorate in multi-turn conversations due to cumulative contextual decay - progressive degradation of contextual integrity caused by attention pollution, dilution, and drift.Method: Proposes Rhea (Role-aware Heuristic Episodic Attention) with two independent memory modules: Instructional Memory (stores high-fidelity global constraints via structural priority mechanism) and Episodic Memory (dynamically manages user-model interactions via asymmetric noise control and heuristic context retrieval). Uses priority attention during inference to selectively integrate episodic information while prioritizing global instructions.
Result: On multiple multi-turn conversation benchmarks (MT-Eval and Long-MT-Bench+), Rhea mitigates performance decay and improves overall accuracy by 1.04 points on a 10-point scale (16% relative gain over strong baselines). Maintains near-perfect instruction fidelity (IAR > 8.1) across long-horizon interactions.
Conclusion: Rhea provides a principled and effective framework for building more precise, instruction-consistent conversational LLMs by addressing cumulative contextual decay through role-aware memory decoupling and priority attention mechanisms.
Abstract: Large Language Models (LLMs) have achieved remarkable performance on single-turn tasks, yet their effectiveness deteriorates in multi-turn conversations. We define this phenomenon as cumulative contextual decay - a progressive degradation of contextual integrity caused by attention pollution, dilution, and drift. To address this challenge, we propose Rhea (Role-aware Heuristic Episodic Attention), a novel framework that decouples conversation history into two functionally independent memory modules: (1) an Instructional Memory (IM) that persistently stores high-fidelity global constraints via a structural priority mechanism, and (2) an Episodic Memory (EM) that dynamically manages user-model interactions via asymmetric noise control and heuristic context retrieval. During inference, Rhea constructs a high signal-to-noise context by applying its priority attention: selectively integrating relevant episodic information while always prioritizing global instructions. To validate this approach, experiments on multiple multi-turn conversation benchmarks - including MT-Eval and Long-MT-Bench+ - show that Rhea mitigates performance decay and improves overall accuracy by 1.04 points on a 10-point scale (a 16% relative gain over strong baselines). Moreover, Rhea maintains near-perfect instruction fidelity (IAR > 8.1) across long-horizon interactions. These results demonstrate that Rhea provides a principled and effective framework for building more precise, instruction-consistent conversational LLMs.
[36] An Analysis of Large Language Models for Simulating User Responses in Surveys
Ziyun Yu, Yiru Zhou, Chen Zhao, Hongyi Wen
Main category: cs.CL
TL;DR: LLMs struggle to accurately simulate diverse human opinions in survey responses due to biases toward dominant viewpoints and inability to adapt to specific demographic features.
Details
Motivation: There's growing interest in using LLMs to simulate user opinions, but RLHF-trained LLMs exhibit biases toward dominant viewpoints, raising concerns about their ability to represent diverse demographic and cultural backgrounds.Method: Examined LLMs’ ability to simulate human responses through direct prompting and chain-of-thought prompting. Proposed CLAIMSIM method that elicits viewpoints from LLM parametric knowledge as contextual input to diversify claims.
Result: CLAIMSIM produces more diverse responses but both approaches struggle to accurately simulate users. Key limitations: (1) LLMs maintain fixed viewpoints across demographics and generate single-perspective claims; (2) LLMs struggle to reason over nuanced differences among demographic features when presented with conflicting claims.
Conclusion: Current LLM prompting methods, even with diversification techniques like CLAIMSIM, have significant limitations in accurately simulating diverse human opinions due to viewpoint rigidity and poor demographic reasoning capabilities.
Abstract: Using Large Language Models (LLMs) to simulate user opinions has received growing attention. Yet LLMs, especially trained with reinforcement learning from human feedback (RLHF), are known to exhibit biases toward dominant viewpoints, raising concerns about their ability to represent users from diverse demographic and cultural backgrounds. In this work, we examine the extent to which LLMs can simulate human responses to cross-domain survey questions through direct prompting and chain-of-thought prompting. We further propose a claim diversification method CLAIMSIM, which elicits viewpoints from LLM parametric knowledge as contextual input. Experiments on the survey question answering task indicate that, while CLAIMSIM produces more diverse responses, both approaches struggle to accurately simulate users. Further analysis reveals two key limitations: (1) LLMs tend to maintain fixed viewpoints across varying demographic features, and generate single-perspective claims; and (2) when presented with conflicting claims, LLMs struggle to reason over nuanced differences among demographic features, limiting their ability to adapt responses to specific user profiles.
[37] Automated PRO-CTCAE Symptom Selection based on Prior Adverse Event Profiles
Francois Vandenhende, Anna Georgiou, Michalis Georgiou, Theodoros Psaras, Ellie Karekla
Main category: cs.CL
TL;DR: Automated method for selecting minimal yet comprehensive PRO-CTCAE subsets using MedDRA semantic mapping, utility scoring, and spectral analysis to balance signal coverage with patient burden.
Details
Motivation: Current PRO-CTCAE item selection for oncology trials is manual and subjective, often leading to either excessive patient burden (too many items) or missed safety signals (too few items). There's a need for an objective, data-driven approach to optimize symptom selection.Method: 1) Map PRO-CTCAE symptoms to MedDRA Preferred Terms (PTs); 2) Encode PTs into Safeterm semantic space; 3) Score each PRO item for relevance to historical adverse event data; 4) Combine relevance and incidence into utility function; 5) Apply spectral analysis to utility-diversity matrix to identify orthogonal medical concepts; 6) Rank-order symptoms and determine cut-off based on explained information.
Result: Implemented as part of Safeterm trial-safety app. Evaluated using simulations and oncology case studies where PRO-CTCAE was employed. The method provides an objective, reproducible approach to balance signal coverage against patient burden.
Conclusion: This automated approach streamlines PRO-CTCAE design by leveraging MedDRA semantics and historical data, offering an objective method to optimize symptom selection for oncology trials while minimizing patient burden.
Abstract: The PRO-CTCAE is an NCI-developed patient-reported outcome system for capturing symptomatic adverse events in oncology trials. It comprises a large library drawn from the CTCAE vocabulary, and item selection for a given trial is typically guided by expected toxicity profiles from prior data. Selecting too many PRO-CTCAE items can burden patients and reduce compliance, while too few may miss important safety signals. We present an automated method to select a minimal yet comprehensive PRO-CTCAE subset based on historical safety data. Each candidate PRO-CTCAE symptom term is first mapped to its corresponding MedDRA Preferred Terms (PTs), which are then encoded into Safeterm, a high-dimensional semantic space capturing clinical and contextual diversity in MedDRA terminology. We score each candidate PRO item for relevance to the historical list of adverse event PTs and combine relevance and incidence into a utility function. Spectral analysis is then applied to the combined utility and diversity matrix to identify an orthogonal set of medical concepts that balances relevance and diversity. Symptoms are rank-ordered by importance, and a cut-off is suggested based on the explained information. The tool is implemented as part of the Safeterm trial-safety app. We evaluate its performance using simulations and oncology case studies in which PRO-CTCAE was employed. This automated approach can streamline PRO-CTCAE design by leveraging MedDRA semantics and historical data, providing an objective and reproducible method to balance signal coverage against patient burden.
[38] Large Language Models and Forensic Linguistics: Navigating Opportunities and Threats in the Age of Generative AI
George Mikros
Main category: cs.CL
TL;DR: LLMs create dual challenges for forensic linguistics: they enable powerful authorship analysis tools while undermining traditional idiolect assumptions through style mimicry and synthetic text generation, requiring methodological adaptation for legal admissibility.
Details
Motivation: The paper addresses the dual impact of LLMs on forensic linguistics - as analytical tools that enable scalable analysis while simultaneously destabilizing foundational assumptions about idiolect through style mimicry, authorship obfuscation, and synthetic text proliferation. The tension between LLMs' ability to approximate human style while exhibiting detectable differences has significant forensic implications.Method: The paper analyzes current AI-text detection techniques including classifier-based approaches, stylometric methods, and watermarking approaches, examining their limitations such as high false positive rates for non-native English writers and vulnerability to adversarial strategies like homoglyph substitution. It evaluates these methods against legal admissibility standards (Daubert and Kumho Tire frameworks).
Result: Current AI-text detection techniques face substantial limitations: they produce high false positive rates for non-native English writers and are vulnerable to adversarial strategies. These uncertainties raise serious concerns under legal admissibility standards, indicating that forensic linguistics requires methodological reconfiguration to remain scientifically credible and legally admissible.
Conclusion: Forensic linguistics needs methodological adaptation including hybrid human-AI workflows, explainable detection paradigms beyond binary classification, and validation regimes measuring error and bias across diverse populations. The discipline’s core insight that language reveals information about its producer remains valid but must accommodate increasingly complex chains of human and machine authorship.
Abstract: Large language models (LLMs) present a dual challenge for forensic linguistics. They serve as powerful analytical tools enabling scalable corpus analysis and embedding-based authorship attribution, while simultaneously destabilising foundational assumptions about idiolect through style mimicry, authorship obfuscation, and the proliferation of synthetic texts. Recent stylometric research indicates that LLMs can approximate surface stylistic features yet exhibit detectable differences from human writers, a tension with significant forensic implications. However, current AI-text detection techniques, whether classifier-based, stylometric, or watermarking approaches, face substantial limitations: high false positive rates for non-native English writers and vulnerability to adversarial strategies such as homoglyph substitution. These uncertainties raise concerns under legal admissibility standards, particularly the Daubert and Kumho Tire frameworks. The article concludes that forensic linguistics requires methodological reconfiguration to remain scientifically credible and legally admissible. Proposed adaptations include hybrid human-AI workflows, explainable detection paradigms beyond binary classification, and validation regimes measuring error and bias across diverse populations. The discipline’s core insight, i.e., that language reveals information about its producer, remains valid but must accommodate increasingly complex chains of human and machine authorship.
[39] XAM: Interactive Explainability for Authorship Attribution Models
Milad Alshomary, Anisha Bhatnagar, Peter Zeng, Smaranda Muresan, Owen Rambow, Kathleen McKeown
Main category: cs.CL
TL;DR: IXAM is an interactive explainability framework that helps users explore and understand authorship attribution models by visualizing embedding spaces and constructing multi-granularity writing style explanations.
Details
Motivation: There's a need for better interpretability in authorship attribution models, particularly to help users understand model predictions beyond just predefined stylistic explanations.Method: Developed an interactive framework that allows users to explore model embedding spaces and construct explanations as sets of writing style features at different granularity levels.
Result: User evaluation demonstrated that IXAM provides more value than traditional predefined stylistic explanations for understanding authorship attribution models.
Conclusion: Interactive exploration of embedding spaces with multi-granularity feature explanations offers superior interpretability for authorship attribution models compared to static explanations.
Abstract: We present IXAM, an Interactive eXplainability framework for Authorship Attribution Models. Given an authorship attribution (AA) task and an embedding-based AA model, our tool enables users to interactively explore the model’s embedding space and construct an explanation of the model’s prediction as a set of writing style features at different levels of granularity. Through a user evaluation, we demonstrate the value of our framework compared to predefined stylistic explanations.
[40] Progress Ratio Embeddings: An Impatience Signal for Robust Length Control in Neural Text Generation
Ivanhoé Botcazou, Tassadit Amghar, Sylvain Lamprier, Frédéric Saubion
Main category: cs.CL
TL;DR: The paper introduces Progress Ratio Embeddings (PRE) for robust length control in text generation, addressing limitations of existing methods like Reverse Positional Embeddings (RPE) that become unstable beyond training distribution.
Details
Motivation: Current neural language models lack precise control over generation length. Existing methods like Reverse Positional Embeddings (RPE) fail when controlling generation beyond training distribution, particularly due to instability from discrete countdown signals tied to absolute remaining token counts.Method: The authors propose Progress Ratio Embeddings (PRE), which use continuous embeddings tied to a trigonometric impatience signal. PRE integrates seamlessly into standard Transformer architectures and provides stable length control without compromising text quality.
Result: Experiments on two widely used news-summarization benchmarks show that PRE provides stable length fidelity without degrading text accuracy under standard evaluation metrics. The method also generalizes well to unseen target lengths.
Conclusion: Progress Ratio Embeddings offer a robust solution for length control in text generation, overcoming limitations of previous methods and maintaining text quality while enabling precise length control even for lengths outside the training distribution.
Abstract: Modern neural language models achieve high accuracy in text generation, yet precise control over generation length remains underdeveloped. In this paper, we first investigate a recent length control method based on Reverse Positional Embeddings (RPE) and show its limits when control is requested beyond the training distribution. In particular, using a discrete countdown signal tied to the absolute remaining token count leads to instability. To provide robust length control, we introduce Progress Ratio Embeddings (PRE), as continuous embeddings tied to a trigonometric impatience signal. PRE integrates seamlessly into standard Transformer architectures, providing stable length fidelity without degrading text accuracy under standard evaluation metrics. We further show that PRE generalizes well to unseen target lengths. Experiments on two widely used news-summarization benchmarks validate these findings.
[41] Prompting-in-a-Series: Psychology-Informed Contents and Embeddings for Personality Recognition With Decoder-Only Models
Jing Jie Tan, Ban-Hoe Kwan, Danny Wee-Kiat Ng, Yan-Chai Hum, Anissa Mokraoui, Shih-Yu Lo
Main category: cs.CL
TL;DR: PICEPR is a novel “Prompting-in-a-Series” algorithm that uses psychology-informed content embeddings to improve personality recognition by 5-15%, achieving new SOTA performance.
Details
Motivation: LLMs have shown strong NLP capabilities, but there's a need for better personality recognition methods. The research aims to leverage LLMs' content generation and summarization abilities to enhance personality classification through psychology-informed approaches.Method: Introduces PICEPR algorithm with two pipelines: (1) Contents pipeline for generating/summarizing personality-rich content, and (2) Embeddings pipeline for personality feature extraction. Uses modular decoder-only LLMs and compares both closed-source (GPT-4o, Gemini) and open-source (Mistral) models.
Result: Achieved 5-15% improvement in personality recognition performance, establishing new state-of-the-art results. The algorithm effectively functions as both personality feature extractor and content generator.
Conclusion: PICEPR demonstrates that psychology-informed prompting strategies with modular LLMs can significantly enhance personality recognition capabilities, with both closed and open-source models showing promising results for generating personality-rich content.
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various natural language processing tasks. This research introduces a novel “Prompting-in-a-Series” algorithm, termed PICEPR (Psychology-Informed Contents Embeddings for Personality Recognition), featuring two pipelines: (a) Contents and (b) Embeddings. The approach demonstrates how a modularised decoder-only LLM can summarize or generate content, which can aid in classifying or enhancing personality recognition functions as a personality feature extractor and a generator for personality-rich content. We conducted various experiments to provide evidence to justify the rationale behind the PICEPR algorithm. Meanwhile, we also explored closed-source models such as \textit{gpt4o} from OpenAI and \textit{gemini} from Google, along with open-source models like \textit{mistral} from Mistral AI, to compare the quality of the generated content. The PICEPR algorithm has achieved a new state-of-the-art performance for personality recognition by 5-15% improvement. The work repository and models’ weight can be found at https://research.jingjietan.com/?q=PICEPR.
[42] FVA-RAG: Falsification-Verification Alignment for Mitigating Sycophantic Hallucinations
Mayank Ravishankara
Main category: cs.CL
TL;DR: FVA-RAG introduces a falsification-based retrieval framework that actively searches for contradictory evidence to combat retrieval sycophancy in RAG systems, reducing hallucinations from false premises.
Details
Motivation: Standard RAG systems suffer from "Retrieval Sycophancy" - when users query based on false premises or misconceptions, vector retrievers fetch documents that align with user bias rather than objective truth, causing models to "hallucinate with citations."Method: FVA-RAG shifts from inductive verification to deductive falsification. It uses an Adversarial Retrieval Policy to generate “Kill Queries” that surface contradictory evidence, and employs a dual-verification mechanism that weighs draft answers against this “Anti-Context.”
Result: Preliminary experiments on a dataset of common misconceptions show FVA-RAG significantly improves robustness against sycophantic hallucinations compared to standard RAG baselines.
Conclusion: FVA-RAG effectively acts as an inference-time “Red Team” for factual generation by actively seeking disproof rather than just supporting evidence, addressing a critical vulnerability in current RAG architectures.
Abstract: Retrieval-Augmented Generation (RAG) systems have significantly reduced hallucinations in Large Language Models (LLMs) by grounding responses in external context. However, standard RAG architectures suffer from a critical vulnerability: Retrieval Sycophancy. When presented with a query based on a false premise or a common misconception, vector-based retrievers tend to fetch documents that align with the user’s bias rather than objective truth, leading the model to “hallucinate with citations.” In this work, we introduce Falsification-Verification Alignment RAG (FVA-RAG), a framework that shifts the retrieval paradigm from Inductive Verification (seeking support) to Deductive Falsification (seeking disproof). Unlike existing “Self-Correction” methods that rely on internal consistency, FVA-RAG deploys a distinct Adversarial Retrieval Policy that actively generates “Kill Queries”-targeted search terms designed to surface contradictory evidence. We introduce a dual-verification mechanism that explicitly weighs the draft answer against this “Anti-Context.” Preliminary experiments on a dataset of common misconceptions demonstrate that FVA-RAG significantly improves robustness against sycophantic hallucinations compared to standard RAG baselines, effectively acting as an inference-time “Red Team” for factual generation.
[43] Replicating TEMPEST at Scale: Multi-Turn Adversarial Attacks Against Trillion-Parameter Frontier Models
Richard Young
Main category: cs.CL
TL;DR: Most frontier LLMs are highly vulnerable to multi-turn adversarial attacks (96-100% success rate), though some show meaningful resistance (42-78% success rate). Extended reasoning reduces vulnerability, but model scale doesn’t predict robustness.
Details
Motivation: Despite substantial safety alignment investments, it's unclear how vulnerable LLMs are to sophisticated multi-turn adversarial attacks, and whether model scale or inference mode affects robustness.Method: Used TEMPEST multi-turn attack framework to evaluate 10 frontier models from 8 vendors across 1,000 harmful behaviors, generating over 97,000 API queries with automated safety classifier evaluation.
Result: Six models achieved 96-100% attack success rate (ASR), while four showed meaningful resistance with ASR 42-78%. Enabling extended reasoning reduced ASR from 97% to 42% on identical architecture.
Conclusion: Current alignment techniques remain fundamentally vulnerable to adaptive multi-turn attacks regardless of model scale, but deliberative inference (thinking mode) provides a promising deployable safety enhancement.
Abstract: Despite substantial investment in safety alignment, the vulnerability of large language models to sophisticated multi-turn adversarial attacks remains poorly characterized, and whether model scale or inference mode affects robustness is unknown. This study employed the TEMPEST multi-turn attack framework to evaluate ten frontier models from eight vendors across 1,000 harmful behaviors, generating over 97,000 API queries across adversarial conversations with automated evaluation by independent safety classifiers. Results demonstrated a spectrum of vulnerability: six models achieved 96% to 100% attack success rate (ASR), while four showed meaningful resistance, with ASR ranging from 42% to 78%; enabling extended reasoning on identical architecture reduced ASR from 97% to 42%. These findings indicate that safety alignment quality varies substantially across vendors, that model scale does not predict adversarial robustness, and that thinking mode provides a deployable safety enhancement. Collectively, this work establishes that current alignment techniques remain fundamentally vulnerable to adaptive multi-turn attacks regardless of model scale, while identifying deliberative inference as a promising defense direction.
[44] SETUP: Sentence-level English-To-Uniform Meaning Representation Parser
Emma Markle, Javier Gutierrez Bach, Shira Wein
Main category: cs.CL
TL;DR: This paper introduces two methods for English text-to-UMR parsing, with the best model (SETUP) achieving significant performance gains in automatic UMR parsing.
Details
Motivation: UMR is a promising semantic representation for language documentation and low-resource language technologies, but its downstream applications require automatic text-to-UMR parsers for large-scale production of accurate UMR graphs.Method: Two approaches: 1) fine-tuning existing Abstract Meaning Representation parsers, and 2) leveraging a converter from Universal Dependencies, using prior work as a baseline.
Result: The best-performing model (SETUP) achieves an AnCast score of 84 and a SMATCH++ score of 91, showing substantial improvements in automatic UMR parsing.
Conclusion: The paper demonstrates successful methods for English text-to-UMR parsing with significant performance gains, advancing the practical application of UMR for language documentation and low-resource language technologies.
Abstract: Uniform Meaning Representation (UMR) is a novel graph-based semantic representation which captures the core meaning of a text, with flexibility incorporated into the annotation schema such that the breadth of the world’s languages can be annotated (including low-resource languages). While UMR shows promise in enabling language documentation, improving low-resource language technologies, and adding interpretability, the downstream applications of UMR can only be fully explored when text-to-UMR parsers enable the automatic large-scale production of accurate UMR graphs at test time. Prior work on text-to-UMR parsing is limited to date. In this paper, we introduce two methods for English text-to-UMR parsing, one of which fine-tunes existing parsers for Abstract Meaning Representation and the other, which leverages a converter from Universal Dependencies, using prior work as a baseline. Our best-performing model, which we call SETUP, achieves an AnCast score of 84 and a SMATCH++ score of 91, indicating substantial gains towards automatic UMR parsing.
[45] Do Large Language Models Truly Understand Cross-cultural Differences?
Shiwei Guo, Sihang Jiang, Qianxi He, Yanghua Xiao, Jiaqing Liang, Bi Yude, Minggui He, Shimin Tao, Li Zhang
Main category: cs.CL
TL;DR: SAGE is a scenario-based benchmark for evaluating LLMs’ cross-cultural understanding, addressing limitations in existing benchmarks through cross-cultural concept alignment and generative task design.
Details
Motivation: Existing benchmarks for evaluating LLMs' cross-cultural understanding lack contextual scenarios, sufficient cross-cultural concept mapping, and deep cultural reasoning capabilities, creating a need for more comprehensive evaluation tools.Method: Built SAGE benchmark via cross-cultural core concept alignment and generative task design, grounded in cultural theory with 9 capability dimensions. Curated 210 core concepts and constructed 4530 test items across 15 real-world scenarios organized under 4 cross-cultural situation categories.
Result: SAGE reveals model weaknesses across dimensions and scenarios, exposing systematic limitations in cross-cultural reasoning. Experiments confirm its transferability to other languages. LLMs still lack truly nuanced cross-cultural understanding despite progress.
Conclusion: SAGE addresses key gaps in evaluating LLMs’ cross-cultural capabilities and shows that while progress has been made, LLMs are still far from achieving genuine nuanced cross-cultural understanding.
Abstract: In recent years, large language models (LLMs) have demonstrated strong performance on multilingual tasks. Given its wide range of applications, cross-cultural understanding capability is a crucial competency. However, existing benchmarks for evaluating whether LLMs genuinely possess this capability suffer from three key limitations: a lack of contextual scenarios, insufficient cross-cultural concept mapping, and limited deep cultural reasoning capabilities. To address these gaps, we propose SAGE, a scenario-based benchmark built via cross-cultural core concept alignment and generative task design, to evaluate LLMs’ cross-cultural understanding and reasoning. Grounded in cultural theory, we categorize cross-cultural capabilities into nine dimensions. Using this framework, we curated 210 core concepts and constructed 4530 test items across 15 specific real-world scenarios, organized under four broader categories of cross-cultural situations, following established item design principles. The SAGE dataset supports continuous expansion, and experiments confirm its transferability to other languages. It reveals model weaknesses across both dimensions and scenarios, exposing systematic limitations in cross-cultural reasoning. While progress has been made, LLMs are still some distance away from reaching a truly nuanced cross-cultural understanding. In compliance with the anonymity policy, we include data and code in the supplement materials. In future versions, we will make them publicly available online.
[46] Leveraging KV Similarity for Online Structured Pruning in LLMs
Jungmin Lee, Gwangeun Byeon, Yulhwa Kim, Seokin Hong
Main category: cs.CL
TL;DR: Token Filtering is an online structured pruning technique for LLMs that skips redundant attention computations during inference without calibration data, using joint key-value similarity to measure token redundancy.
Details
Motivation: Existing pruning approaches for LLMs suffer from instability because they rely on offline calibration data that may not generalize across different inputs, leading to unreliable pruning decisions during inference.Method: The method measures token redundancy via joint key-value similarity and skips redundant attention computations during inference. It uses a variance-aware fusion strategy that adaptively weights key and value similarity across attention heads to ensure informative tokens are retained even under high pruning ratios.
Result: Token Filtering consistently outperforms prior structured pruning methods across LLaMA-2 (7B/13B), LLaMA-3 (8B), and Mistral (7B) models. It preserves accuracy on commonsense reasoning benchmarks and maintains strong performance on challenging tasks like MMLU, even with 50% pruning.
Conclusion: Token Filtering provides a lightweight, stable online pruning technique that reduces LLM inference costs without calibration data, offering better performance preservation than existing methods while introducing no additional memory overhead.
Abstract: Pruning has emerged as a promising direction for accelerating large language model (LLM) inference, yet existing approaches often suffer from instability because they rely on offline calibration data that may not generalize across inputs. In this work, we introduce Token Filtering, a lightweight online structured pruning technique that makes pruning decisions directly during inference without any calibration data. The key idea is to measure token redundancy via joint key-value similarity and skip redundant attention computations, thereby reducing inference cost while preserving critical information. To further enhance stability, we design a variance-aware fusion strategy that adaptively weights key and value similarity across heads, ensuring that informative tokens are retained even under high pruning ratios. This design introduces no additional memory overhead and provides a more reliable criterion for token importance. Extensive experiments on LLaMA-2 (7B/13B), LLaMA-3 (8B), and Mistral (7B) demonstrate that Token Filtering consistently outperforms prior structured pruning methods, preserving accuracy on commonsense reasoning benchmarks and maintaining strong performance on challenging tasks such as MMLU, even with 50% pruning.
[47] DART: Leveraging Multi-Agent Disagreement for Tool Recruitment in Multimodal Reasoning
Nithin Sivakumaran, Justin Chih-Yao Chen, David Wan, Yue Zhang, Jaehong Yoon, Elias Stengel-Eskin, Mohit Bansal
Main category: cs.CL
TL;DR: DART is a multi-agent framework that uses disagreements between visual agents to identify and call appropriate visual tools (object detection, OCR, spatial reasoning, etc.) to resolve debates, improving performance over existing multi-agent and tool-calling methods.
Details
Motivation: While specialized visual tools can enhance LLMs/VLMs with expert knowledge, determining which tools to call and when to call them is challenging. The paper aims to address this by leveraging agent disagreements to guide tool selection.Method: DART uses multiple debating visual agents whose disagreements trigger the identification of useful visual tools. These tools introduce new information and provide tool-aligned agreement scores to facilitate discussion. An aggregator agent then selects the best answer using agent outputs and tool information.
Result: DART outperforms multi-agent debate and single-agent tool-calling frameworks, beating the next-strongest baseline by 3.4% on A-OKVQA and 2.4% on MMMU. It also shows strong adaptation to new tools with 1.3% improvement on M3D medical dataset. Analysis reveals rich discussion and diverse tool usage.
Conclusion: DART effectively leverages agent disagreements to guide visual tool selection, improving performance across diverse benchmarks and demonstrating adaptability to new domains while facilitating richer multi-agent discussions compared to existing methods.
Abstract: Specialized visual tools can augment large language models or vision language models with expert knowledge (e.g., grounding, spatial reasoning, medical knowledge, etc.), but knowing which tools to call (and when to call them) can be challenging. We introduce DART, a multi-agent framework that uses disagreements between multiple debating visual agents to identify useful visual tools (e.g., object detection, OCR, spatial reasoning, etc.) that can resolve inter-agent disagreement. These tools allow for fruitful multi-agent discussion by introducing new information, and by providing tool-aligned agreement scores that highlight agents in agreement with expert tools, thereby facilitating discussion. We utilize an aggregator agent to select the best answer by providing the agent outputs and tool information. We test DART on four diverse benchmarks and show that our approach improves over multi-agent debate as well as over single agent tool-calling frameworks, beating the next-strongest baseline (multi-agent debate with a judge model) by 3.4% and 2.4% on A-OKVQA and MMMU respectively. We also find that DART adapts well to new tools in applied domains, with a 1.3% improvement on the M3D medical dataset over other strong tool-calling, single agent, and multi-agent baselines. Additionally, we measure text overlap across rounds to highlight the rich discussion in DART compared to existing multi-agent methods. Finally, we study the tool call distribution, finding that diverse tools are reliably used to help resolve disagreement.
[48] GUMBridge: a Corpus for Varieties of Bridging Anaphora
Lauren Levine, Amir Zeldes
Main category: cs.CL
TL;DR: GUMBridge is a new English bridging anaphora resource covering 16 diverse genres with detailed subtype annotations, showing LLMs still struggle with bridging resolution tasks.
Details
Motivation: Existing bridging anaphora resources are limited - they're small, have limited coverage of the phenomenon, and/or limited genre coverage. There's a need for a more comprehensive resource that captures the diversity of bridging across different types of English text.Method: Created GUMBridge, a new resource with 16 diverse English genres, providing both broad coverage of bridging phenomena and granular annotations for subtype categorization. Also conducted annotation quality evaluation and baseline performance testing using contemporary LLMs on three core tasks.
Result: The resource provides comprehensive coverage across genres. Evaluation shows bridging resolution and subtype classification remain challenging NLP tasks even for modern LLMs, indicating the complexity of these linguistic phenomena.
Conclusion: GUMBridge addresses limitations of existing resources by providing diverse genre coverage and detailed subtype annotations. Despite advances in LLMs, bridging anaphora resolution and classification remain difficult tasks that require specialized resources and approaches.
Abstract: Bridging is an anaphoric phenomenon where the referent of an entity in a discourse is dependent on a previous, non-identical entity for interpretation, such as in “There is ‘a house’. ‘The door’ is red,” where the door is specifically understood to be the door of the aforementioned house. While there are several existing resources in English for bridging anaphora, most are small, provide limited coverage of the phenomenon, and/or provide limited genre coverage. In this paper, we introduce GUMBridge, a new resource for bridging, which includes 16 diverse genres of English, providing both broad coverage for the phenomenon and granular annotations for the subtype categorization of bridging varieties. We also present an evaluation of annotation quality and report on baseline performance using open and closed source contemporary LLMs on three tasks underlying our data, showing that bridging resolution and subtype classification remain difficult NLP tasks in the age of LLMs.
[49] NeSTR: A Neuro-Symbolic Abductive Framework for Temporal Reasoning in Large Language Models
Feng Liang, Weixin Zeng, Runhao Zhao, Xiang Zhao
Main category: cs.CL
TL;DR: NeSTR is a neuro-symbolic framework that combines symbolic temporal representations with reflective reasoning to improve LLMs’ temporal reasoning without fine-tuning.
Details
Motivation: LLMs struggle with temporal reasoning under complex constraints. Existing approaches either underutilize LLMs' reasoning capabilities (symbolic methods) or lack structured temporal representations (reflective methods), leading to inconsistent or hallucinated reasoning even when correct temporal context is available.Method: NeSTR integrates structured symbolic representations with hybrid reflective reasoning. It preserves explicit temporal relations through symbolic encoding, enforces logical consistency via verification, and corrects flawed inferences using abductive reflection.
Result: Extensive experiments on diverse temporal question answering benchmarks show NeSTR achieves superior zero-shot performance and consistently improves temporal reasoning without any fine-tuning.
Conclusion: The neuro-symbolic integration in NeSTR effectively enhances temporal understanding in large language models, demonstrating the advantage of combining symbolic representations with reflective reasoning for complex temporal tasks.
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, temporal reasoning, particularly under complex temporal constraints, remains a major challenge. To this end, existing approaches have explored symbolic methods, which encode temporal structure explicitly, and reflective mechanisms, which revise reasoning errors through multi-step inference. Nonetheless, symbolic approaches often underutilize the reasoning capabilities of LLMs, while reflective methods typically lack structured temporal representations, which can result in inconsistent or hallucinated reasoning. As a result, even when the correct temporal context is available, LLMs may still misinterpret or misapply time-related information, leading to incomplete or inaccurate answers. To address these limitations, in this work, we propose Neuro-Symbolic Temporal Reasoning (NeSTR), a novel framework that integrates structured symbolic representations with hybrid reflective reasoning to enhance the temporal sensitivity of LLM inference. NeSTR preserves explicit temporal relations through symbolic encoding, enforces logical consistency via verification, and corrects flawed inferences using abductive reflection. Extensive experiments on diverse temporal question answering benchmarks demonstrate that NeSTR achieves superior zero-shot performance and consistently improves temporal reasoning without any fine-tuning, showcasing the advantage of neuro-symbolic integration in enhancing temporal understanding in large language models.
[50] Ensembling LLM-Induced Decision Trees for Explainable and Robust Error Detection
Mengqi Wang, Jianwei Wang, Qing Liu, Xiwei Xu, Zhenchang Xing, Liming Zhu, Wenjie Zhang
Main category: cs.CL
TL;DR: Proposes ForestED: an LLM-as-an-inducer framework that uses LLMs to generate decision trees for error detection in tabular data, improving explainability and robustness over traditional LLM-as-labeler approaches.
Details
Motivation: Current LLM-based error detection methods lack explainability (black-box decisions) and robustness (sensitive to prompts), limiting their practical utility for ensuring data quality.Method: TreeED: Uses LLMs to induce decision tree skeletons with rule nodes (simple validation), GNN nodes (complex patterns), and leaf nodes (error/clean decisions). ForestED: Ensembles multiple trees via uncertainty-based sampling and EM-based consensus optimization.
Result: Achieves 16.1% average F1-score improvement over best baselines, demonstrating superior accuracy, explainability, and robustness in error detection.
Conclusion: The LLM-as-an-inducer framework effectively addresses limitations of LLM-as-labeler approaches, providing transparent, robust, and accurate error detection for tabular data quality assurance.
Abstract: Error detection (ED), which aims to identify incorrect or inconsistent cell values in tabular data, is important for ensuring data quality. Recent state-of-the-art ED methods leverage the pre-trained knowledge and semantic capability embedded in large language models (LLMs) to directly label whether a cell is erroneous. However, this LLM-as-a-labeler pipeline (1) relies on the black box, implicit decision process, thus failing to provide explainability for the detection results, and (2) is highly sensitive to prompts, yielding inconsistent outputs due to inherent model stochasticity, therefore lacking robustness. To address these limitations, we propose an LLM-as-an-inducer framework that adopts LLM to induce the decision tree for ED (termed TreeED) and further ensembles multiple such trees for consensus detection (termed ForestED), thereby improving explainability and robustness. Specifically, based on prompts derived from data context, decision tree specifications and output requirements, TreeED queries the LLM to induce the decision tree skeleton, whose root-to-leaf decision paths specify the stepwise procedure for evaluating a given sample. Each tree contains three types of nodes: (1) rule nodes that perform simple validation checks (e.g., format or range), (2) Graph Neural Network (GNN) nodes that capture complex patterns (e.g., functional dependencies), and (3) leaf nodes that output the final decision types (error or clean). Furthermore, ForestED employs uncertainty-based sampling to obtain multiple row subsets, constructing a decision tree for each subset using TreeED. It then leverages an Expectation-Maximization-based algorithm that jointly estimates tree reliability and optimizes the consensus ED prediction. Extensive xperiments demonstrate that our methods are accurate, explainable and robust, achieving an average F1-score improvement of 16.1% over the best baseline.
[51] Investigating Training and Generalization in Faithful Self-Explanations of Large Language Models
Tomoki Doi, Masaru Isonuma, Hitomi Yanaka
Main category: cs.CL
TL;DR: Training LLMs with pseudo-faithful one-word explanations improves faithfulness across tasks and styles, with generalization to multi-word settings and unseen tasks.
Details
Motivation: LLMs often generate unfaithful self-explanations, but how to improve faithfulness and whether improvements generalize across explanation styles remains unclear.Method: Construct one-word constrained explanations using feature attribution methods to create pseudo-faithful self-explanations, then use these for continual learning on instruction-tuned models across three classification tasks and three explanation styles.
Result: Training improves self-explanation faithfulness across all tasks and styles, with generalization to multi-word settings and unseen tasks, plus consistent cross-style generalization.
Conclusion: Training with pseudo-faithful explanations can broadly improve faithful self-explanation ability in LLMs, with effects generalizing across different explanation styles and tasks.
Abstract: Large language models have the potential to generate explanations for their own predictions in a variety of styles based on user instructions. Recent research has examined whether these self-explanations faithfully reflect the models’ actual behavior and has found that they often lack faithfulness. However, the question of how to improve faithfulness remains underexplored. Moreover, because different explanation styles have superficially distinct characteristics, it is unclear whether improvements observed in one style also arise when using other styles. This study analyzes the effects of training for faithful self-explanations and the extent to which these effects generalize, using three classification tasks and three explanation styles. We construct one-word constrained explanations that are likely to be faithful using a feature attribution method, and use these pseudo-faithful self-explanations for continual learning on instruction-tuned models. Our experiments demonstrate that training can improve self-explanation faithfulness across all classification tasks and explanation styles, and that these improvements also show signs of generalization to the multi-word settings and to unseen tasks. Furthermore, we find consistent cross-style generalization among three styles, suggesting that training may contribute to a broader improvement in faithful self-explanation ability.
[52] Multilingual corpora for the study of new concepts in the social sciences and humanities:
Revekka Kyriakoglou, Anna Pappa
Main category: cs.CL
TL;DR: Hybrid methodology for building multilingual corpus to study emerging HSS concepts, demonstrated with “non-technological innovation” case study.
Details
Motivation: To create a reproducible and extensible resource for studying emerging concepts in humanities and social sciences, addressing the need for multilingual corpora that can support both conceptual analysis and NLP applications.Method: Combines two data sources: (1) cleaned text from company websites (French/English), and (2) filtered annual reports. Processing includes language detection, content filtering, segment extraction, and metadata enrichment. Creates English dataset with contextual blocks (5 sentences) around expert lexicon terms, annotated with thematic categories.
Result: Produces a multilingual corpus suitable for analyzing lexical variability of emerging concepts and generating datasets for supervised classification tasks in natural language processing.
Conclusion: The hybrid approach yields a reproducible, extensible resource that serves dual purposes: conceptual analysis in HSS and dataset generation for NLP applications, demonstrated through the “non-technological innovation” case study.
Abstract: This article presents a hybrid methodology for building a multilingual corpus designed to support the study of emerging concepts in the humanities and social sciences (HSS), illustrated here through the case of ``non-technological innovation’’. The corpus relies on two complementary sources: (1) textual content automatically extracted from company websites, cleaned for French and English, and (2) annual reports collected and automatically filtered according to documentary criteria (year, format, duplication). The processing pipeline includes automatic language detection, filtering of non-relevant content, extraction of relevant segments, and enrichment with structural metadata. From this initial corpus, a derived dataset in English is created for machine learning purposes. For each occurrence of a term from the expert lexicon, a contextual block of five sentences is extracted (two preceding and two following the sentence containing the term). Each occurrence is annotated with the thematic category associated with the term, enabling the construction of data suitable for supervised classification tasks. This approach results in a reproducible and extensible resource, suitable both for analyzing lexical variability around emerging concepts and for generating datasets dedicated to natural language processing applications.
[53] Training Language Models to Use Prolog as a Tool
Niklas Mellgren, Peter Schneider-Kamp, Lukas Galke Poech
Main category: cs.CL
TL;DR: Fine-tuning language models to use Prolog as an external verification tool improves reasoning reliability and auditability through reinforcement learning with optimized prompts, rewards, and inference protocols.
Details
Motivation: Language models often produce plausible but incorrect reasoning that's hard to verify, creating safety risks for agentic AI systems. The paper aims to improve reliability by grounding model reasoning in formal verification systems.Method: Fine-tuned Qwen2.5-3B-Instruct using Group Relative Policy Optimization (GRPO) on GSM8K-Prolog-Prover dataset, varying prompt structure, reward composition (execution, syntax, semantics, structure), and inference protocols (single-shot, best-of-N, agentic modes).
Result: Reinforcement learning outperformed supervised fine-tuning, achieving zero-shot MMLU performance comparable to 7B few-shot results. Best-of-N with external Prolog verification maximized GSM8K accuracy, while agentic inference with internal repair yielded superior zero-shot generalization on MMLU-Stem and MMLU-Pro.
Conclusion: Grounding model reasoning in formal verification systems like Prolog substantially improves reliability and auditability for safety-critical applications, demonstrating the importance of joint optimization of prompts, rewards, and inference protocols.
Abstract: Ensuring reliable tool use is critical for safe agentic AI systems. Language models frequently produce unreliable reasoning with plausible but incorrect solutions that are difficult to verify. To address this, we investigate fine-tuning models to use Prolog as an external tool for verifiable computation. Using Group Relative Policy Optimization (GRPO), we fine-tune Qwen2.5-3B-Instruct on a cleaned GSM8K-Prolog-Prover dataset while varying (i) prompt structure, (ii) reward composition (execution, syntax, semantics, structure), and (iii) inference protocol: single-shot, best-of-N, and two agentic modes where Prolog is invoked internally or independently. Our reinforcement learning approach outperforms supervised fine-tuning, with our 3B model achieving zero-shot MMLU performance comparable to 7B few-shot results. Our findings reveal that: 1) joint tuning of prompt, reward, and inference shapes program syntax and logic; 2) best-of-N with external Prolog verification maximizes accuracy on GSM8K; 3) agentic inference with internal repair yields superior zero-shot generalization on MMLU-Stem and MMLU-Pro. These results demonstrate that grounding model reasoning in formal verification systems substantially improves reliability and auditability for safety-critical applications. The source code for reproducing our experiments is available under https://github.com/niklasmellgren/grpo-prolog-inference
[54] Persian-Phi: Efficient Cross-Lingual Adaptation of Compact LLMs via Curriculum Learning
Amir Mohammad Akhlaghi, Amirhossein Shabani, Mostafa Abdolmaleki, Saeed Reza Kheradpisheh
Main category: cs.CL
TL;DR: Persian-Phi is a 3.8B parameter model that efficiently adapts Microsoft’s monolingual English Phi-3 Mini to Persian using a novel curriculum learning pipeline, achieving competitive results with minimal computational resources.
Details
Motivation: The democratization of AI is hindered by high computational costs for training LLMs for low-resource languages. Current approaches assume robust multilingual capabilities require massive models or multilingual baselines, which is resource-intensive.Method: A resource-efficient curriculum learning pipeline: 1) “warm-up” stage using bilingual narratives (Tiny Stories) to align embeddings, 2) continual pretraining, and 3) instruction tuning via Parameter-Efficient Fine-Tuning (PEFT). Adapts monolingual English Phi-3 Mini to Persian.
Result: Persian-Phi achieves competitive results on the Open Persian LLM Leaderboard in HuggingFace despite its compact 3.8B parameter size, demonstrating effective adaptation with minimal hardware resources.
Conclusion: The approach provides a validated, scalable framework for extending state-of-the-art LLMs to underrepresented languages with minimal computational resources, challenging assumptions about multilingual model requirements.
Abstract: The democratization of AI is currently hindered by the immense computational costs required to train Large Language Models (LLMs) for low-resource languages. This paper presents Persian-Phi, a 3.8B parameter model that challenges the assumption that robust multilingual capabilities require massive model sizes or multilingual baselines. We demonstrate how Microsoft Phi-3 Mini – originally a monolingual English model – can be effectively adapted to Persian through a novel, resource-efficient curriculum learning pipeline. Our approach employs a unique “warm-up” stage using bilingual narratives (Tiny Stories) to align embeddings prior to heavy training, followed by continual pretraining and instruction tuning via Parameter-Efficient Fine-Tuning (PEFT). Despite its compact size, Persian-Phi achieves competitive results on Open Persian LLM Leaderboard in HuggingFace. Our findings provide a validated, scalable framework for extending the reach of state-of-the-art LLMs to underrepresented languages with minimal hardware resources. The Persian-Phi model is publicly available at https://huggingface.co/amirakhlaghiqqq/PersianPhi.
[55] Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning
Tong Wu, Yang Liu, Jun Bai, Zixia Jia, Shuyi Zhang, Ziyong Lin, Yanting Wang, Song-Chun Zhu, Zilong Zheng
Main category: cs.CL
TL;DR: NPR is a teacher-free framework that enables LLMs to self-evolve genuine parallel reasoning capabilities, achieving up to 24.5% performance gains and 4.6x inference speedups with 100% genuine parallel execution.
Details
Motivation: Current LLMs rely on sequential emulation for reasoning tasks, which limits efficiency and scalability. There's a need for models to develop native parallel reasoning capabilities without external supervision to improve both performance and inference speed.Method: Three key innovations: 1) Self-distilled progressive training paradigm transitioning from format discovery to topological constraints; 2) Parallel-Aware Policy Optimization (PAPO) algorithm optimizing branching policies within execution graphs; 3) NPR Engine refactoring memory management and flow control of SGLang for stable parallel RL training.
Result: Across eight reasoning benchmarks, NPR trained on Qwen3-4B achieves performance gains up to 24.5% and inference speedups up to 4.6x. Demonstrates 100% genuine parallel execution, unlike prior baselines that often fall back to autoregressive decoding.
Conclusion: NPR establishes a new standard for self-evolving, efficient, and scalable agentic reasoning by enabling LLMs to develop genuine parallel reasoning capabilities without external supervision, overcoming limitations of sequential emulation approaches.
Abstract: We introduce Native Parallel Reasoner (NPR), a teacher-free framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities. NPR transforms the model from sequential emulation to native parallel cognition through three key innovations: 1) a self-distilled progressive training paradigm that transitions from ``cold-start’’ format discovery to strict topological constraints without external supervision; 2) a novel Parallel-Aware Policy Optimization (PAPO) algorithm that optimizes branching policies directly within the execution graph, allowing the model to learn adaptive decomposition via trial and error; and 3) a robust NPR Engine that refactors memory management and flow control of SGLang to enable stable, large-scale parallel RL training. Across eight reasoning benchmarks, NPR trained on Qwen3-4B achieves performance gains of up to 24.5% and inference speedups up to 4.6x. Unlike prior baselines that often fall back to autoregressive decoding, NPR demonstrates 100% genuine parallel execution, establishing a new standard for self-evolving, efficient, and scalable agentic reasoning.
[56] Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization
Zhuoran Zhuang, Ye Chen, Jianghao Su, Chao Luo, Luhui Liu, Xia Zeng
Main category: cs.CL
TL;DR: PRS and VSPO techniques improve LLM tool-use agents: PRS provides progressive dense rewards, VSPO enhances policy optimization with value-based sampling and smoothing.
Details
Motivation: Two key challenges hinder Agentic RL for LLM tool-use: (1) sparse binary rewards provide limited guidance for intermediate steps, slowing convergence; (2) gradient degradation in GRPO when identical rewards yield zero advantage, reducing sample efficiency and destabilizing training.Method: Two complementary techniques: 1) Progressive Reward Shaping (PRS) - curriculum-inspired reward design with dense, stage-wise feedback (format correctness → factual correctness → answer quality); 2) Value-based Sampling Policy Optimization (VSPO) - enhanced GRPO variant with value-based sampling (balancing difficulty/uncertainty) and value-smoothing clipping.
Result: Experiments on multiple QA benchmarks show PRS consistently outperforms traditional binary rewards, and VSPO achieves superior stability, faster convergence, and higher final performance compared to PPO, GRPO, CISPO, and SFT-only baselines.
Conclusion: PRS and VSPO together yield LLM-based TIR agents that generalize better across domains by addressing sparse reward and gradient degradation problems in Agentic RL for tool-use.
Abstract: Large Language Models (LLMs) empowered with Tool-Integrated Reasoning (TIR) can iteratively plan, call external tools, and integrate returned information to solve complex, long-horizon reasoning tasks. Agentic Reinforcement Learning (Agentic RL) optimizes such models over full tool-interaction trajectories, but two key challenges hinder effectiveness: (1) Sparse, non-instructive rewards, such as binary 0-1 verifiable signals, provide limited guidance for intermediate steps and slow convergence; (2) Gradient degradation in Group Relative Policy Optimization (GRPO), where identical rewards within a rollout group yield zero advantage, reducing sample efficiency and destabilizing training. To address these challenges, we propose two complementary techniques: Progressive Reward Shaping (PRS) and Value-based Sampling Policy Optimization (VSPO). PRS is a curriculum-inspired reward design that introduces dense, stage-wise feedback - encouraging models to first master parseable and properly formatted tool calls, then optimize for factual correctness and answer quality. We instantiate PRS for short-form QA (with a length-aware BLEU to fairly score concise answers) and long-form QA (with LLM-as-a-Judge scoring to prevent reward hacking). VSPO is an enhanced GRPO variant that replaces low-value samples with prompts selected by a task-value metric balancing difficulty and uncertainty, and applies value-smoothing clipping to stabilize gradient updates. Experiments on multiple short-form and long-form QA benchmarks show that PRS consistently outperforms traditional binary rewards, and VSPO achieves superior stability, faster convergence, and higher final performance compared to PPO, GRPO, CISPO, and SFT-only baselines. Together, PRS and VSPO yield LLM-based TIR agents that generalize better across domains.
[57] SPAD: Seven-Source Token Probability Attribution with Syntactic Aggregation for Detecting Hallucinations in RAG
Pengqian Lu, Jie Lu, Anjin Liu, Guangquan Zhang
Main category: cs.CL
TL;DR: SPAD introduces a fine-grained attribution method to detect hallucinations in RAG systems by decomposing token probabilities into seven sources and analyzing their contributions across linguistic categories.
Details
Motivation: Existing approaches treat hallucinations as binary conflicts between internal knowledge and retrieved context, but this oversimplifies the generative process which involves multiple components like user queries, previously generated tokens, current tokens, and LayerNorm adjustments.Method: SPAD mathematically attributes each token’s probability into seven distinct sources: Query, RAG, Past, Current Token, FFN, Final LayerNorm, and Initial Embedding. It then aggregates these scores by POS tags to quantify how different components drive specific linguistic categories, identifying anomalies like Nouns relying on Final LayerNorm.
Result: Extensive experiments demonstrate that SPAD achieves state-of-the-art performance in detecting hallucinations in RAG systems.
Conclusion: SPAD provides a more comprehensive approach to hallucination detection by considering multiple components of the generative process rather than just binary conflicts, enabling more effective identification of anomalous generation patterns.
Abstract: Detecting hallucinations in Retrieval-Augmented Generation (RAG) remains a challenge. Prior approaches attribute hallucinations to a binary conflict between internal knowledge (stored in FFNs) and retrieved context. However, this perspective is incomplete, failing to account for the impact of other components in the generative process, such as the user query, previously generated tokens, the current token itself, and the final LayerNorm adjustment. To address this, we introduce SPAD. First, we mathematically attribute each token’s probability into seven distinct sources: Query, RAG, Past, Current Token, FFN, Final LayerNorm, and Initial Embedding. This attribution quantifies how each source contributes to the generation of the current token. Then, we aggregate these scores by POS tags to quantify how different components drive specific linguistic categories. By identifying anomalies, such as Nouns relying on Final LayerNorm, SPAD effectively detects hallucinations. Extensive experiments demonstrate that SPAD achieves state-of-the-art performance
[58] LIME: Making LLM Data More Efficient with Linguistic Metadata Embeddings
Sebastian Sztwiertnia, Felix Friedrich, Kristian Kersting, Patrick Schramowski, Björn Deiseroth
Main category: cs.CL
TL;DR: LIME (Linguistic Metadata Embeddings) enriches token embeddings with syntax, semantics, and contextual metadata to improve pre-training efficiency and language modeling performance with minimal parameter overhead.
Details
Motivation: High-quality data for pre-training decoder-only language models is becoming scarce, and while metadata is commonly used for dataset curation, its potential as a direct training signal remains under-explored.Method: Proposes LIME, which enriches token embeddings with metadata capturing syntax, semantics, and contextual properties. Also develops LIME+1 variant with shifted metadata that can guide token generation using prior metadata for the next token.
Result: LIME adapts up to 56% faster to training data distribution with only 0.01% additional parameters. Improves tokenization and language modeling capabilities across model scales (500M to 2B). LIME+1 improves reasoning performance by up to 38% and arithmetic accuracy by up to 35%.
Conclusion: LIME demonstrates that metadata can serve as an effective direct training signal, substantially improving pre-training efficiency and language modeling performance while enabling guided token generation for enhanced reasoning and arithmetic capabilities.
Abstract: Pre-training decoder-only language models relies on vast amounts of high-quality data, yet the availability of such data is increasingly reaching its limits. While metadata is commonly used to create and curate these datasets, its potential as a direct training signal remains under-explored. We challenge this status quo and propose LIME (Linguistic Metadata Embeddings), a method that enriches token embeddings with metadata capturing syntax, semantics, and contextual properties. LIME substantially improves pre-training efficiency. Specifically, it adapts up to 56% faster to the training data distribution, while introducing only 0.01% additional parameters at negligible compute overhead. Beyond efficiency, LIME improves tokenization, leading to remarkably stronger language modeling capabilities and generative task performance. These benefits persist across model scales (500M to 2B). In addition, we develop a variant with shifted metadata, LIME+1, that can guide token generation. Given prior metadata for the next token, LIME+1 improves reasoning performance by up to 38% and arithmetic accuracy by up to 35%.
[59] Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs
Xiaoran Liu, Yuerong Song, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Zhaoxiang Liu, Shiguo Lian, Ziwei He, Xipeng Qiu
Main category: cs.CL
TL;DR: Extends RoPE by incorporating the imaginary component of complex-valued dot products to improve long-context modeling in LLMs.
Details
Motivation: Standard RoPE implementations discard the imaginary component of complex dot products, losing valuable phase information that could enhance modeling of long-context dependencies.Method: Proposes using both real and imaginary components of complex-valued dot products to create dual-component attention scores, preserving more positional information.
Result: The method consistently outperforms standard RoPE on long-context language modeling benchmarks, with benefits increasing as context length grows.
Conclusion: Incorporating the imaginary component of RoPE’s complex representations improves long-context dependency modeling and should be considered for enhanced positional encoding.
Abstract: Rotary Position Embeddings (RoPE) have become a standard for encoding sequence order in Large Language Models (LLMs) by applying rotations to query and key vectors in the complex plane. Standard implementations, however, utilize only the real component of the complex-valued dot product for attention score calculation. This simplification discards the imaginary component, which contains valuable phase information, leading to a potential loss of relational details crucial for modeling long-context dependencies. In this paper, we propose an extension that re-incorporates this discarded imaginary component. Our method leverages the full complex-valued representation to create a dual-component attention score. We theoretically and empirically demonstrate that this approach enhances the modeling of long-context dependencies by preserving more positional information. Furthermore, evaluations on a suite of long-context language modeling benchmarks show that our method consistently improves performance over the standard RoPE, with the benefits becoming more significant as context length increases. The code is available at https://github.com/OpenMOSS/rope_pp.
[60] SwissGov-RSD: A Human-annotated, Cross-lingual Benchmark for Token-level Recognition of Semantic Differences Between Related Documents
Michelle Wastl, Jannis Vamvas, Rico Sennrich
Main category: cs.CL
TL;DR: First document-level cross-lingual semantic difference recognition dataset (SwissGov-RSD) with 224 multi-parallel English-German/French/Italian documents and token-level human annotations, showing current LLMs and encoder models perform poorly on this naturalistic benchmark.
Details
Motivation: Semantic difference recognition across documents, especially in different languages, is crucial for text generation evaluation and multilingual content alignment, but has received little attention as a standalone task.Method: Introduced SwissGov-RSD dataset with 224 multi-parallel documents in English-German, English-French, and English-Italian with token-level difference annotations. Evaluated various open-source and closed-source LLMs and encoder models across different fine-tuning settings.
Result: Current automatic approaches perform poorly compared to their performance on monolingual, sentence-level, and synthetic benchmarks, revealing a considerable gap for both LLMs and encoder models.
Conclusion: There’s a significant performance gap in cross-lingual document-level semantic difference recognition, highlighting the need for better models and methods for this important task in multilingual content alignment and text evaluation.
Abstract: Recognizing semantic differences across documents, especially in different languages, is crucial for text generation evaluation and multilingual content alignment. However, as a standalone task it has received little attention. We address this by introducing SwissGov-RSD, the first naturalistic, document-level, cross-lingual dataset for semantic difference recognition. It encompasses a total of 224 multi-parallel documents in English-German, English-French, and English-Italian with token-level difference annotations by human annotators. We evaluate a variety of open-source and closed source large language models as well as encoder models across different fine-tuning settings on this new benchmark. Our results show that current automatic approaches perform poorly compared to their performance on monolingual, sentence-level, and synthetic benchmarks, revealing a considerable gap for both LLMs and encoder models. We make our code and datasets publicly available.
[61] Minimum Bayes Risk Decoding for Error Span Detection in Reference-Free Automatic Machine Translation Evaluation
Boxuan Lyu, Haiyue Song, Hidetaka Kamigaito, Chenchen Ding, Hideki Tanaka, Masao Utiyama, Kotaro Funakoshi, Manabu Okumura
Main category: cs.CL
TL;DR: MBR decoding improves generative error span detection by selecting hypotheses based on similarity metrics rather than just model likelihood, outperforming MAP decoding and enabling efficient distillation.
Details
Motivation: Current generative ESD methods use MAP decoding assuming model probabilities perfectly correlate with human annotation similarity, but this assumption is flawed as dissimilar annotations can have higher model likelihood than human annotations.Method: Apply Minimum Bayes Risk (MBR) decoding to generative ESD models using sentence- and span-level similarity metrics as utility functions to select candidate hypotheses based on their approximate similarity to human annotation.
Result: MBR decoding outperforms MAP baseline at system, sentence, and span-levels. MBR distillation enables standard greedy models to match MBR decoding performance while eliminating inference-time latency bottleneck.
Conclusion: MBR decoding addresses the limitation of MAP decoding in generative ESD by better aligning with human annotation similarity, and distillation makes this approach computationally practical for real-world applications.
Abstract: Error Span Detection (ESD) is a subtask of automatic machine translation evaluation that localizes error spans in translations and labels their severity. State-of-the-art generative ESD methods typically decode using Maximum a Posteriori (MAP), assuming that model-estimated probabilities are perfectly correlated with similarity to human annotation. However, we observed that annotations dissimilar to the human annotation could achieve a higher model likelihood than the human annotation. We address this issue by applying Minimum Bayes Risk (MBR) decoding to generative ESD models. Specifically, we employ sentence- and span-level similarity metrics as utility functions to select candidate hypotheses based on their approximate similarity to the human annotation. Extensive experimental results show that our MBR decoding outperforms the MAP baseline at the system, sentence, and span-levels. Furthermore, to mitigate the computational cost of MBR decoding, we demonstrate that applying MBR distillation enables a standard greedy model to match MBR decoding performance, effectively eliminating the inference-time latency bottleneck.
[62] Most over-representation of phonological features in basic vocabulary disappears when controlling for spatial and phylogenetic effects
Frederic Blum
Main category: cs.CL
TL;DR: Most previously reported sound symbolic patterns in basic vocabulary disappear when controlling for genealogical and areal dependencies, though a small subset remains robust across a much larger language sample.
Details
Motivation: To test the reproducibility and robustness of reported sound symbolic patterns in basic vocabulary, addressing concerns about potential biases from inadequate controls for genealogical and areal dependencies between languages.Method: Reanalyzed sound symbolism using a much larger sample (2864 languages from Lexibank vs. original 245), modified the original model by adding statistical controls for spatial and phylogenetic dependencies between languages.
Result: Most previously observed sound symbolic patterns were not robust and disappeared when adding genealogical and areal controls. However, a small number of patterns emerged as highly stable even with the new sample and controls.
Conclusion: Universal claims about language patterns must be tested for robustness across various levels, and while most reported sound symbolic patterns may be artifacts of methodological limitations, a core subset appears genuinely robust across languages.
Abstract: The statistical over-representation of phonological features in the basic vocabulary of languages is often interpreted as reflecting potentially universal sound symbolic patterns. However, most of those results have not been tested explicitly for reproducibility and might be prone to biases in the study samples or models. Many studies on the topic do not adequately control for genealogical and areal dependencies between sampled languages, casting doubts on the robustness of the results. In this study, we test the robustness of a recent study on sound symbolism of basic vocabulary concepts which analyzed245 languages.The new sample includes data on 2864 languages from Lexibank. We modify the original model by adding statistical controls for spatial and phylogenetic dependencies between languages. The new results show that most of the previously observed patterns are not robust, and in fact many patterns disappear completely when adding the genealogical and areal controls. A small number of patterns, however, emerges as highly stable even with the new sample. Through the new analysis, we are able to assess the distribution of sound symbolism on a larger scale than previously. The study further highlights the need for testing all universal claims on language for robustness on various levels.
[63] MoCoRP: Modeling Consistent Relations between Persona and Response for Persona-based Dialogue
Kyungro Lee, Dongha Choi, Hyunju Lee
Main category: cs.CL
TL;DR: MoCoRP improves persona-based dialogue by explicitly modeling NLI relations between persona sentences and responses, enhancing persona consistency and engagement in dialogue generation.
Details
Motivation: Existing persona-based dialogue datasets lack explicit relations between persona sentences and responses, making it difficult for models to effectively capture persona information and generate coherent personality-driven interactions.Method: Proposes MoCoRP framework that uses an NLI expert to explicitly extract NLI relations (entailment, contradiction, neutral) between persona sentences and responses, enabling models to incorporate appropriate persona information. Applied to BART and extended to LLMs through alignment tuning.
Result: Outperforms existing baselines on ConvAI2 and MPChat datasets, achieving superior persona consistency and engaging, context-aware dialogue generation. Shows improvements in both quantitative metrics and qualitative aspects.
Conclusion: Explicitly modeling persona-response relations is effective for persona-based dialogue, as demonstrated by MoCoRP’s performance improvements in generating consistent and engaging personality-driven conversations.
Abstract: As dialogue systems become increasingly important across various domains, a key challenge in persona-based dialogue is generating engaging and context-specific interactions while ensuring the model acts with a coherent personality. However, existing persona-based dialogue datasets lack explicit relations between persona sentences and responses, which makes it difficult for models to effectively capture persona information. To address these issues, we propose MoCoRP (Modeling Consistent Relations between Persona and Response), a framework that incorporates explicit relations into language models. MoCoRP leverages an NLI expert to explicitly extract the NLI relations between persona sentences and responses, enabling the model to effectively incorporate appropriate persona information from the context into its responses. We applied this framework to pre-trained models like BART and further extended it to modern large language models (LLMs) through alignment tuning. Experimental results on the public datasets ConvAI2 and MPChat demonstrate that MoCoRP outperforms existing baselines, achieving superior persona consistency and engaging, context-aware dialogue generation. Furthermore, our model not only excels in quantitative metrics but also shows significant improvements in qualitative aspects. These results highlight the effectiveness of explicitly modeling persona-response relations in persona-based dialogue. The source codes of MoCoRP are available at https://github.com/DMCB-GIST/MoCoRP.
[64] Performance of the SafeTerm AI-Based MedDRA Query System Against Standardised MedDRA Queries
Francois Vandenhende, Anna Georgiou, Michalis Georgiou, Theodoros Psaras, Ellie Karekla, Elena Hadjicosta
Main category: cs.CL
TL;DR: SafeTerm AMQ is an AI system that automatically generates MedDRA queries for drug safety signal detection, achieving good performance with balanced recall and precision.
Details
Motivation: In pre-market drug safety review, grouping related adverse event terms into Standardized MedDRA Queries (SMQs) or Other Company MedDRA Queries (OCMQs) is critical for signal detection, but manual query generation is time-consuming. There's a need for automated systems to efficiently retrieve relevant MedDRA terms.Method: SafeTerm AMQ uses quantitative AI to understand medical terminology, embedding query terms and MedDRA Preferred Terms in a multidimensional vector space. It applies cosine similarity and extreme-value clustering to generate ranked lists of relevant PTs with relevance scores (0-1). Validation was conducted against 110 tier-1 SMQs using precision, recall, and F1 metrics at multiple similarity thresholds.
Result: The system achieved high recall (94%) at moderate thresholds, with precision up to 89% at higher thresholds. Optimal threshold (0.70) yielded 48% recall and 45% precision. Narrow-term PTs performed slightly better at increased thresholds. Automatic threshold selection (0.66) prioritized recall (0.58) over precision (0.29). Performance was comparable on SMQs and sanitized OCMQs.
Conclusion: SafeTerm AMQ is a viable supplementary method for automated MedDRA query generation that balances recall and precision. The authors recommend using appropriate MedDRA terminology in queries and applying automated threshold methods to optimize recall, with higher similarity scores enabling refined term selection.
Abstract: In pre-market drug safety review, grouping related adverse event terms into SMQs or OCMQs is critical for signal detection. We assess the performance of SafeTerm Automated Medical Query (AMQ) on MedDRA SMQs. The AMQ is a novel quantitative artificial intelligence system that understands and processes medical terminology and automatically retrieves relevant MedDRA Preferred Terms (PTs) for a given input query, ranking them by a relevance score (0-1) using multi-criteria statistical methods. The system (SafeTerm) embeds medical query terms and MedDRA PTs in a multidimensional vector space, then applies cosine similarity, and extreme-value clustering to generate a ranked list of PTs. Validation was conducted against tier-1 SMQs (110 queries, v28.1). Precision, recall and F1 were computed at multiple similarity-thresholds, defined either manually or using an automated method. High recall (94%)) is achieved at moderate similarity thresholds, indicative of good retrieval sensitivity. Higher thresholds filter out more terms, resulting in improved precision (up to 89%). The optimal threshold (0.70)) yielded an overall recall of (48%) and precision of (45%) across all 110 queries. Restricting to narrow-term PTs achieved slightly better performance at an increased (+0.05) similarity threshold, confirming increased relatedness of narrow versus broad terms. The automatic threshold (0.66) selection prioritizes recall (0.58) to precision (0.29). SafeTerm AMQ achieves comparable, satisfactory performance on SMQs and sanitized OCMQs. It is therefore a viable supplementary method for automated MedDRA query generation, balancing recall and precision. We recommend using suitable MedDRA PT terminology in query formulation and applying the automated threshold method to optimise recall. Increasing similarity scores allows refined, narrow terms selection.
[65] Complementary Learning Approach for Text Classification using Large Language Models
Navid Asgari, Benjamin M. Cole
Main category: cs.CL
TL;DR: Proposes a cost-efficient LLM methodology using chain-of-thought and few-shot learning to integrate human and machine strengths in quantitative research, demonstrated on pharmaceutical alliance press releases.
Details
Motivation: To develop a structured approach that leverages LLMs cost-effectively while integrating human expertise and machine capabilities, addressing weaknesses in both human and machine approaches to quantitative research.Method: Uses chain-of-thought and few-shot learning prompting techniques from computer science, extending qualitative research co-author team practices to human-machine teams in quantitative research. Allows humans to use abductive reasoning and natural language to interrogate both machine and human outputs.
Result: Demonstrated the methodology by interrogating human-machine rating discrepancies for a sample of 1,934 press releases announcing pharmaceutical alliances from 1990-2017, showing how scholars can manage LLM weaknesses with low-cost techniques.
Conclusion: Proposes a practical, cost-efficient framework for human-machine collaboration in quantitative research that leverages LLMs while mitigating their weaknesses through careful prompting techniques and human oversight.
Abstract: In this study, we propose a structured methodology that utilizes large language models (LLMs) in a cost-efficient and parsimonious manner, integrating the strengths of scholars and machines while offsetting their respective weaknesses. Our methodology, facilitated through a chain of thought and few-shot learning prompting from computer science, extends best practices for co-author teams in qualitative research to human-machine teams in quantitative research. This allows humans to utilize abductive reasoning and natural language to interrogate not just what the machine has done but also what the human has done. Our method highlights how scholars can manage inherent weaknesses OF LLMs using careful, low-cost techniques. We demonstrate how to use the methodology to interrogate human-machine rating discrepancies for a sample of 1,934 press releases announcing pharmaceutical alliances (1990-2017).
[66] Metric-Fair Prompting: Treating Similar Samples Similarly
Jing Wang, Jie Shen, Xing Niu, Tong Zhang, Jeremy Weiss
Main category: cs.CL
TL;DR: Metric-Fair Prompting improves LLM accuracy on medical QA by enforcing fairness constraints that treat similar questions similarly through confidence scoring and Lipschitz-style constraints.
Details
Motivation: To promote individual fairness in LLM decision-making for high-stakes medical applications, ensuring similar medical questions receive similar treatment and consistent outputs.Method: Treats each (question, option) pair as binary instance, computes question similarity using NLP embeddings, solves items in joint pairs of similar questions, extracts decisive clinical features, maps to confidence scores with Lipschitz-style constraints.
Result: Improves performance over standard single-item prompting on MedQA (US) benchmark, demonstrating fairness-guided reasoning enhances LLM accuracy on clinical multiple-choice questions.
Conclusion: Fairness-aware prompting with metric-fairness constraints can enhance LLM performance in high-stakes medical decision-making by promoting consistent treatment of similar cases.
Abstract: We introduce \emph{Metric-Fair Prompting}, a fairness-aware prompting framework that guides large language models (LLMs) to make decisions under metric-fairness constraints. In the application of multiple-choice medical question answering, each {(question, option)} pair is treated as a binary instance with label $+1$ (correct) or $-1$ (incorrect). To promote {individual fairness}~–treating similar instances similarly–~we compute question similarity using NLP embeddings and solve items in \emph{joint pairs of similar questions} rather than in isolation. The prompt enforces a global decision protocol: extract decisive clinical features, map each ((\text{question}, \text{option})) to a score $f(x)$ that acts as confidence, and impose a Lipschitz-style constraint so that similar inputs receive similar scores and, hence, consistent outputs. Evaluated on the {MedQA (US)} benchmark, Metric-Fair Prompting is shown to improve performance over standard single-item prompting, demonstrating that fairness-guided, confidence-oriented reasoning can enhance LLM accuracy on high-stakes clinical multiple-choice questions.
[67] PCMind-2.1-Kaiyuan-2B Technical Report
Kairong Luo, Zhenbo Sun, Xinyu Shi, Shengqi Chen, Bowen Yu, Yunyi Chen, Chenyi Dang, Hengtao Tao, Hui Wang, Fangming Liu, Kaifeng Lyu, Wenguang Chen
Main category: cs.CL
TL;DR: PCMind-2.1-Kaiyuan-2B is a fully open-source 2B-parameter LLM that addresses the knowledge gap between open-source and industry by introducing efficient training methods for resource-constrained environments.
Details
Motivation: There's a significant knowledge gap between open-source community and industry due to industry's reliance on closed-source, high-quality data and training recipes. The authors aim to democratize access to advanced LLM training by providing fully open-source solutions for resource-limited settings.Method: Three key innovations: 1) Quantile Data Benchmarking for comparing heterogeneous datasets and data mixing strategies, 2) Strategic Selective Repetition within multi-phase paradigm to leverage sparse high-quality data, and 3) Multi-Domain Curriculum Training that orders samples by quality. Also includes optimized data preprocessing pipeline and architectural modifications for FP16 stability.
Result: Kaiyuan-2B achieves performance competitive with state-of-the-art fully open-source models, demonstrating practical and scalable solutions for resource-limited pretraining. All assets (model weights, data, code) are released under Apache 2.0 license.
Conclusion: The work provides a comprehensive open-source solution for efficient LLM training under resource constraints, bridging the gap between open-source and industry approaches while maintaining competitive performance.
Abstract: The rapid advancement of Large Language Models (LLMs) has resulted in a significant knowledge gap between the open-source community and industry, primarily because the latter relies on closed-source, high-quality data and training recipes. To address this, we introduce PCMind-2.1-Kaiyuan-2B, a fully open-source 2-billion-parameter model focused on improving training efficiency and effectiveness under resource constraints. Our methodology includes three key innovations: a Quantile Data Benchmarking method for systematically comparing heterogeneous open-source datasets and providing insights on data mixing strategies; a Strategic Selective Repetition scheme within a multi-phase paradigm to effectively leverage sparse, high-quality data; and a Multi-Domain Curriculum Training policy that orders samples by quality. Supported by a highly optimized data preprocessing pipeline and architectural modifications for FP16 stability, Kaiyuan-2B achieves performance competitive with state-of-the-art fully open-source models, demonstrating practical and scalable solutions for resource-limited pretraining. We release all assets (including model weights, data, and code) under Apache 2.0 license at https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B.
[68] Bridging Code Graphs and Large Language Models for Better Code Understanding
Zeqi Chen, Zhaoyang Chu, Yi Gui, Feng Guo, Yao Wan, Chuan Shi
Main category: cs.CL
TL;DR: CGBridge enhances LLMs for code intelligence by injecting structural graph information through an external bridge module, improving performance on tasks like code summarization and translation while maintaining efficiency.
Details
Motivation: LLMs struggle with understanding structural semantics of code due to reliance on linearized token sequences. Existing approaches have limitations: graph-augmented prompting suffers from length constraints, while structure-aware pretraining requires incompatible architectural changes for instruction-following LLMs.Method: CGBridge uses a plug-and-play approach with: 1) Pre-training a code graph encoder via self-supervised learning on 270K code graphs, 2) Training an external bridge module with cross-modal attention to align code, graph, and text semantics, 3) Generating structure-informed prompts injected into frozen LLMs, and 4) Fine-tuning for downstream tasks.
Result: Achieves 16.19% and 9.12% relative gain in LLM-as-a-Judge on code summarization, 9.84% and 38.87% relative gain in Execution Accuracy on code translation. Also achieves over 4x faster inference than LoRA-tuned models.
Conclusion: CGBridge effectively enhances LLMs with structural code semantics through an external bridge module, achieving significant performance improvements on code intelligence tasks while maintaining efficiency and compatibility with existing LLMs.
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance in code intelligence tasks such as code generation, summarization, and translation. However, their reliance on linearized token sequences limits their ability to understand the structural semantics of programs. While prior studies have explored graphaugmented prompting and structure-aware pretraining, they either suffer from prompt length constraints or require task-specific architectural changes that are incompatible with large-scale instructionfollowing LLMs. To address these limitations, this paper proposes CGBridge, a novel plug-and-play method that enhances LLMs with Code Graph information through an external, trainable Bridge module. CGBridge first pre-trains a code graph encoder via selfsupervised learning on a large-scale dataset of 270K code graphs to learn structural code semantics. It then trains an external module to bridge the modality gap among code, graph, and text by aligning their semantics through cross-modal attention mechanisms. Finally, the bridge module generates structure-informed prompts, which are injected into a frozen LLM, and is fine-tuned for downstream code intelligence tasks. Experiments show that CGBridge achieves notable improvements over both the original model and the graphaugmented prompting method. Specifically, it yields a 16.19% and 9.12% relative gain in LLM-as-a-Judge on code summarization, and a 9.84% and 38.87% relative gain in Execution Accuracy on code translation. Moreover, CGBridge achieves over 4x faster inference than LoRA-tuned models, demonstrating both effectiveness and efficiency in structure-aware code understanding.
[69] When Large Language Models Do Not Work: Online Incivility Prediction through Graph Neural Networks
Zihan Chen, Lanyu Yu
Main category: cs.CL
TL;DR: A Graph Neural Network framework outperforms 12 state-of-the-art LLMs in detecting online incivility (toxicity, aggression, personal attacks) on English Wikipedia by leveraging both textual content and relational structures between comments.
Details
Motivation: Online incivility is a widespread problem causing social and psychological burdens, but existing moderation and automated detection approaches have limited accuracy and efficiency. Current text-only LLM paradigms fail to fully capture the contextual and relational aspects of uncivil behavior in digital communities.Method: Proposes a Graph Neural Network framework where each user comment is represented as a node, with edges defined by textual similarity between comments. The model jointly learns from linguistic content and relational structures, and includes a dynamically adjusted attention mechanism that adaptively balances nodal and topological features during information aggregation.
Result: The proposed architecture outperforms 12 state-of-the-art Large Language Models across multiple metrics while requiring significantly lower inference cost. The framework demonstrates superior performance in detecting three types of uncivil behavior (toxicity, aggression, personal attacks) within the English Wikipedia community.
Conclusion: Structural context plays a crucial role in detecting online incivility, addressing limitations of text-only LLM paradigms in behavioral prediction. The findings highlight the importance of relational information in understanding digital communication patterns, and the datasets and outputs will be publicly available to support further research and reproducibility.
Abstract: Online incivility has emerged as a widespread and persistent problem in digital communities, imposing substantial social and psychological burdens on users. Although many platforms attempt to curb incivility through moderation and automated detection, the performance of existing approaches often remains limited in both accuracy and efficiency. To address this challenge, we propose a Graph Neural Network (GNN) framework for detecting three types of uncivil behavior (i.e., toxicity, aggression, and personal attacks) within the English Wikipedia community. Our model represents each user comment as a node, with textual similarity between comments defining the edges, allowing the network to jointly learn from both linguistic content and relational structures among comments. We also introduce a dynamically adjusted attention mechanism that adaptively balances nodal and topological features during information aggregation. Empirical evaluations demonstrate that our proposed architecture outperforms 12 state-of-the-art Large Language Models (LLMs) across multiple metrics while requiring significantly lower inference cost. These findings highlight the crucial role of structural context in detecting online incivility and address the limitations of text-only LLM paradigms in behavioral prediction. All datasets and comparative outputs will be publicly available in our repository to support further research and reproducibility.
[70] HalluShift++: Bridging Language and Vision through Internal Representation Shifts for Hierarchical Hallucinations in MLLMs
Sujoy Nath, Arkaprabha Basu, Sharanya Dasgupta, Swagatam Das
Main category: cs.CL
TL;DR: HalluShift++ is a novel approach for detecting hallucinations in Multimodal Large Language Models by analyzing internal layer dynamics rather than relying on external LLM evaluators.
Details
Motivation: MLLMs often produce factually inconsistent descriptions (hallucinations) that can lead to adverse consequences. Current methods rely on external LLM evaluators that are themselves prone to hallucinations and have domain adaptation challenges.Method: Proposes that hallucinations manifest as measurable irregularities in MLLMs’ internal layer dynamics. HalluShift++ analyzes layer-wise patterns to detect hallucinations, extending detection capabilities from text-based LLMs to multimodal scenarios.
Result: The method broadens hallucination detection efficacy to encompass multimodal scenarios, providing a more reliable approach than external LLM evaluators.
Conclusion: HalluShift++ offers a promising alternative to external evaluators by leveraging internal model dynamics for more accurate hallucination detection in multimodal models.
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language understanding tasks. While these models often produce linguistically coherent output, they often suffer from hallucinations, generating descriptions that are factually inconsistent with the visual content, potentially leading to adverse consequences. Therefore, the assessment of hallucinations in MLLM has become increasingly crucial in the model development process. Contemporary methodologies predominantly depend on external LLM evaluators, which are themselves susceptible to hallucinations and may present challenges in terms of domain adaptation. In this study, we propose the hypothesis that hallucination manifests as measurable irregularities within the internal layer dynamics of MLLMs, not merely due to distributional shifts but also in the context of layer-wise analysis of specific assumptions. By incorporating such modifications, \textsc{\textsc{HalluShift++}} broadens the efficacy of hallucination detection from text-based large language models (LLMs) to encompass multimodal scenarios. Our codebase is available at https://github.com/C0mRD/HalluShift_Plus.
[71] Automated Generation of Custom MedDRA Queries Using SafeTerm Medical Map
Francois Vandenhende, Anna Georgiou, Michalis Georgiou, Theodoros Psaras, Ellie Karekla, Elena Hadjicosta
Main category: cs.CL
TL;DR: SafeTerm is an AI system that automatically retrieves and ranks MedDRA Preferred Terms for drug safety queries using vector embeddings and similarity scoring, achieving high recall (>95%) at moderate thresholds.
Details
Motivation: Manual grouping of adverse event terms into standardized MedDRA queries for drug safety review is time-consuming and requires medical expertise. There's a need for automated systems to assist in this critical signal detection process.Method: The system embeds medical query terms and MedDRA Preferred Terms in a multidimensional vector space, then applies cosine similarity and extreme-value clustering to generate ranked lists of relevant PTs with relevance scores.
Result: Validation against FDA OCMQ v3.0 showed high recall (>95%) at moderate thresholds, with precision up to 86% at higher thresholds. Optimal threshold (0.70-0.75) yielded ~50% recall and ~33% precision.
Conclusion: SafeTerm provides a viable supplementary method for automated MedDRA query generation, with recommended initial similarity threshold of ~0.60 and increased thresholds for refined term selection.
Abstract: In pre-market drug safety review, grouping related adverse event terms into standardised MedDRA queries or the FDA Office of New Drugs Custom Medical Queries (OCMQs) is critical for signal detection. We present a novel quantitative artificial intelligence system that understands and processes medical terminology and automatically retrieves relevant MedDRA Preferred Terms (PTs) for a given input query, ranking them by a relevance score using multi-criteria statistical methods. The system (SafeTerm) embeds medical query terms and MedDRA PTs in a multidimensional vector space, then applies cosine similarity and extreme-value clustering to generate a ranked list of PTs. Validation was conducted against the FDA OCMQ v3.0 (104 queries), restricted to valid MedDRA PTs. Precision, recall and F1 were computed across similarity-thresholds. High recall (>95%) is achieved at moderate thresholds. Higher thresholds improve precision (up to 86%). The optimal threshold (~0.70 - 0.75) yielded recall ~50% and precision ~33%. Narrow-term PT subsets performed similarly but required slightly higher similarity thresholds. The SafeTerm AI-driven system provides a viable supplementary method for automated MedDRA query generation. A similarity threshold of ~0.60 is recommended initially, with increased thresholds for refined term selection.
[72] Mary, the Cheeseburger-Eating Vegetarian: Do LLMs Recognize Incoherence in Narratives?
Karin de Langis, Püren Öncel, Ryan Peters, Andrew Elfenbein, Laura Kristen Allen, Andreas Schramm, Dongyeop Kang
Main category: cs.CL
TL;DR: LLMs can detect incoherent stories internally but fail to articulate this in responses, showing a gap between internal representations and external behavior, with particular difficulty identifying character-based inconsistencies.
Details
Motivation: To investigate whether LLMs can reliably distinguish between coherent and incoherent narratives, and understand the gap between their internal representations and external responses regarding narrative coherence.Method: Used paired narratives dataset, conducted probing studies on LLMs’ internal representations, tested various prompt variations for rating questions, examined reasoning approaches including thought strings, and analyzed sensitivity to different types of incoherence (setting violations vs. character trait violations).
Result: LLMs’ internal representations reliably identify incoherent narratives, but their generated responses fail to separate coherent/incoherent stories across prompts. Reasoning doesn’t fix this gap. LLMs are more sensitive to setting violations than character trait violations, suggesting reliance on prototypical world knowledge rather than narrative coherence.
Conclusion: LLMs lack complete grasp of narrative coherence, showing a discrepancy between internal state and external behavior, with particular weakness in understanding character-based narrative inconsistencies.
Abstract: Leveraging a dataset of paired narratives, we investigate the extent to which large language models (LLMs) can reliably separate incoherent and coherent stories. A probing study finds that LLMs’ internal representations can reliably identify incoherent narratives. However, LLMs generate responses to rating questions that fail to satisfactorily separate the coherent and incoherent narratives across several prompt variations, hinting at a gap in LLM’s understanding of storytelling. The reasoning LLMs tested do not eliminate these deficits, indicating that thought strings may not be able to fully address the discrepancy between model internal state and behavior. Additionally, we find that LLMs appear to be more sensitive to incoherence resulting from an event that violates the setting (e.g., a rainy day in the desert) than to incoherence arising from a character violating an established trait (e.g., Mary, a vegetarian, later orders a cheeseburger), suggesting that LLMs may rely more on prototypical world knowledge than building meaning-based narrative coherence. The consistent asymmetry found in our results suggests that LLMs do not have a complete grasp on narrative coherence.
[73] On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
Charlie Zhang, Graham Neubig, Xiang Yue
Main category: cs.CL
TL;DR: RL post-training only improves reasoning when pre-training leaves room for growth and targets tasks at the model’s edge of competence, while mid-training plays a crucial but underexplored role in training pipelines.
Details
Motivation: To resolve ambiguity about whether RL post-training truly extends model reasoning beyond pre-training capabilities, given the lack of control in modern training pipelines with opaque pre-training corpora and complex interactions between RL objectives and unknown prior knowledge.Method: Developed a fully controlled experimental framework using synthetic reasoning tasks with explicit atomic operations, parseable step-by-step reasoning traces, and systematic manipulation of training distributions. Evaluated models on extrapolative generalization (more complex compositions) and contextual generalization (across surface contexts).
Result: 1) RL produces true capability gains only when pre-training leaves sufficient headroom and targets tasks at the model’s edge of competence. 2) Contextual generalization requires minimal yet sufficient pre-training exposure before RL can reliably transfer. 3) Mid-training significantly enhances performance under fixed compute compared to RL only. 4) Process-level rewards reduce reward hacking and improve reasoning fidelity.
Conclusion: The study clarifies the interplay between pre-training, mid-training, and RL, offering a foundation for understanding and improving reasoning LM training strategies, with mid-training playing a central but underexplored role in training pipelines.
Abstract: Recent reinforcement learning (RL) techniques have yielded impressive reasoning improvements in language models, yet it remains unclear whether post-training truly extends a model’s reasoning ability beyond what it acquires during pre-training. A central challenge is the lack of control in modern training pipelines: large-scale pre-training corpora are opaque, mid-training is often underexamined, and RL objectives interact with unknown prior knowledge in complex ways. To resolve this ambiguity, we develop a fully controlled experimental framework that isolates the causal contributions of pre-training, mid-training, and RL-based post-training. Our approach employs synthetic reasoning tasks with explicit atomic operations, parseable step-by-step reasoning traces, and systematic manipulation of training distributions. We evaluate models along two axes: extrapolative generalization to more complex compositions and contextual generalization across surface contexts. Using this framework, we reconcile competing views on RL’s effectiveness. We show that: 1) RL produces true capability gains (pass@128) only when pre-training leaves sufficient headroom and when RL data target the model’s edge of competence, tasks at the boundary that are difficult but not yet out of reach. 2) Contextual generalization requires minimal yet sufficient pre-training exposure, after which RL can reliably transfer. 3) Mid-training significantly enhances performance under fixed compute compared with RL only, demonstrating its central but underexplored role in training pipelines. 4) Process-level rewards reduce reward hacking and improve reasoning fidelity. Together, these results clarify the interplay between pre-training, mid-training, and RL, offering a foundation for understanding and improving reasoning LM training strategies.
[74] Collaborative Causal Sensemaking: Closing the Complementarity Gap in Human-AI Decision Support
Raunak Jain, Mudita Khurana
Main category: cs.CL
TL;DR: The paper proposes Collaborative Causal Sensemaking (CCS) as a new framework for AI decision-support agents that act as cognitive partners rather than just tools, focusing on co-constructing mental models and improving human-AI team performance.
Details
Motivation: Current LLM-based agents in expert decision-support often fail to make teams smarter, with human-AI teams underperforming the best individual, experts oscillating between verification loops and over-reliance, and promised complementarity not materializing. This is not just an accuracy problem but a fundamental gap in how AI assistance is conceived.Method: Proposes Collaborative Causal Sensemaking (CCS) as a research agenda and organizing framework for decision-support agents. These systems are designed as partners in cognitive work that maintain evolving models of how particular experts reason, help articulate and revise goals, co-construct and stress-test causal hypotheses, and learn from joint decision outcomes.
Result: The paper sketches challenges around: 1) training ecologies that make collaborative thinking instrumentally valuable, 2) representations and interaction protocols for co-authored models, and 3) evaluation centered on trust and complementarity.
Conclusion: These directions can reframe MAS research around agents that participate in collaborative sensemaking and act as AI teammates that think with their human partners, moving beyond current limitations of AI decision-support systems.
Abstract: LLM-based agents are rapidly being plugged into expert decision-support, yet in messy, high-stakes settings they rarely make the team smarter: human-AI teams often underperform the best individual, experts oscillate between verification loops and over-reliance, and the promised complementarity does not materialise. We argue this is not just a matter of accuracy, but a fundamental gap in how we conceive AI assistance: expert decisions are made through collaborative cognitive processes where mental models, goals, and constraints are continually co-constructed, tested, and revised between human and AI. We propose Collaborative Causal Sensemaking (CCS) as a research agenda and organizing framework for decision-support agents: systems designed as partners in cognitive work, maintaining evolving models of how particular experts reason, helping articulate and revise goals, co-constructing and stress-testing causal hypotheses, and learning from the outcomes of joint decisions so that both human and agent improve over time. We sketch challenges around training ecologies that make collaborative thinking instrumentally valuable, representations and interaction protocols for co-authored models, and evaluation centred on trust and complementarity. These directions can reframe MAS research around agents that participate in collaborative sensemaking and act as AI teammates that think with their human partners.
[75] Do Generalisation Results Generalise?
Matteo Boglioni, Andrea Sgobbi, Gabriel Tavernini, Francesco Rita, Marius Mosbach, Tiago Pimentel
Main category: cs.CL
TL;DR: This paper investigates whether out-of-distribution (OOD) generalization results from LLMs generalize across different OOD datasets, finding no consistent trend and that correlations depend heavily on the specific model.
Details
Motivation: Previous work assessing LLM generalization typically uses single OOD datasets, which may not accurately evaluate model capabilities since real-world deployment involves diverse data shifts. The paper aims to determine if OOD generalization results generalize across multiple OOD testsets.Method: The authors evaluate model performance across multiple OOD testsets throughout finetuning runs, then compute partial correlation of performances across these testsets while regressing out in-domain performance. This assesses how correlated generalization performances are once in-domain performance is controlled for.
Result: Analysis of OLMo2 and OPT models shows no overarching trend in generalization results. The existence of positive or negative correlation between any two OOD testsets depends strongly on the specific model being analyzed.
Conclusion: OOD generalization results do not consistently generalize across different OOD datasets, and correlations between generalization performances are highly model-dependent, challenging the reliability of single-dataset OOD evaluations.
Abstract: A large language model’s (LLM’s) out-of-distribution (OOD) generalisation ability is crucial to its deployment. Previous work assessing LLMs’ generalisation performance, however, typically focuses on a single out-of-distribution dataset. This approach may fail to precisely evaluate the capabilities of the model, as the data shifts encountered once a model is deployed are much more diverse. In this work, we investigate whether OOD generalisation results generalise. More specifically, we evaluate a model’s performance across multiple OOD testsets throughout a finetuning run; we then evaluate the partial correlation of performances across these testsets, regressing out in-domain performance. This allows us to assess how correlated are generalisation performances once in-domain performance is controlled for. Analysing OLMo2 and OPT, we observe no overarching trend in generalisation results: the existence of a positive or negative correlation between any two OOD testsets depends strongly on the specific choice of model analysed.
[76] I Learn Better If You Speak My Language: Understanding the Superior Performance of Fine-Tuning Large Language Models with LLM-Generated Responses
Xuan Ren, Biao Wu, Lingqiao Liu
Main category: cs.CL
TL;DR: LLM-generated training data outperforms human-generated data for fine-tuning LLMs on reasoning tasks, due to LLMs’ inherent “familiarity” with their own generated content rather than just content detail.
Details
Motivation: To investigate why fine-tuning LLMs with LLM-generated responses yields better results than human-generated responses, particularly for reasoning tasks, challenging the common assumption that this advantage comes solely from more detailed content.Method: Conducted in-depth investigation with designed experiments to measure “familiarity” through perplexity measurements before fine-tuning, and systematically tested the impact of this familiarity on learning performance across different reasoning tasks.
Result: Found that LLM-generated responses have lower perplexity scores before fine-tuning, indicating inherent familiarity. This familiarity significantly improves learning performance and helps maintain model capabilities in other reasoning tasks after task-specific fine-tuning.
Conclusion: The advantage of LLM-generated training data stems from the model’s familiarity with its own generated content, not just content detail. This familiarity enhances fine-tuning performance and preserves cross-task reasoning capabilities.
Abstract: This paper explores an intriguing observation: fine-tuning a large language model (LLM) with responses generated by a LLM often yields better results than using responses generated by humans, particularly in reasoning tasks. We conduct an in-depth investigation to understand why this occurs. Contrary to the common belief that these instances is due to the more detailed nature of LLM-generated content, our study identifies another contributing factor: an LLM is inherently more “familiar” with LLM generated responses. This familiarity is evidenced by lower perplexity before fine-tuning. We design a series of experiments to understand the impact of the “familiarity” and our conclusion reveals that this “familiarity” significantly impacts learning performance. Training with LLM-generated responses not only enhances performance but also helps maintain the model’s capabilities in other reasoning tasks after fine-tuning on a specific task.
[77] PhyloLM : Inferring the Phylogeny of Large Language Models and Predicting their Performances in Benchmarks
Nicolas Yax, Pierre-Yves Oudeyer, Stefano Palminteri
Main category: cs.CL
TL;DR: PhyloLM adapts phylogenetic algorithms to LLMs to analyze relationships and predict performance using output similarity-based distance metrics.
Details
Motivation: To explore relationships between LLMs and predict their performance characteristics without needing transparent training information, creating a time and cost-effective evaluation tool.Method: Adapts phylogenetic algorithms to LLMs by calculating phylogenetic distance metric based on similarity of LLMs’ outputs, then constructs dendrograms to visualize relationships across 156 models (111 open-source, 45 closed).
Result: Phylogenetic distance successfully captures known relationships between models and predicts performance in standard benchmarks, demonstrating functional validity.
Conclusion: PhyloLM provides a validated tool for evaluating LLM development, relationships, and capabilities by translating population genetic concepts to machine learning, enabling assessment even without transparent training information.
Abstract: This paper introduces PhyloLM, a method adapting phylogenetic algorithms to Large Language Models (LLMs) to explore whether and how they relate to each other and to predict their performance characteristics. Our method calculates a phylogenetic distance metric based on the similarity of LLMs’ output. The resulting metric is then used to construct dendrograms, which satisfactorily capture known relationships across a set of 111 open-source and 45 closed models. Furthermore, our phylogenetic distance predicts performance in standard benchmarks, thus demonstrating its functional validity and paving the way for a time and cost-effective estimation of LLM capabilities. To sum up, by translating population genetic concepts to machine learning, we propose and validate a tool to evaluate LLM development, relationships and capabilities, even in the absence of transparent training information.
[78] Deep Learning and Machine Learning, Advancing Big Data Analytics and Management: Handy Appetizer
Benji Peng, Xuanhe Pan, Yizhu Wen, Ziqian Bi, Keyu Chen, Ming Li, Ming Liu, Qian Niu, Junyu Liu, Jinlang Wang, Sen Zhang, Jiawei Xu, Xinyuan Song, Zekun Jiang, Tianyang Wang, Pohsun Feng
Main category: cs.CL
TL;DR: This book provides a comprehensive guide to AI/ML/DL for big data analytics, simplifying complex concepts with visualizations and case studies, covering key models (Transformers, GPT, ResNet, BERT, YOLO) and big data technologies (SQL/NoSQL, Hadoop, Spark).
Details
Motivation: To bridge the gap between complex AI/deep learning concepts and practical application by providing accessible explanations, visualizations, and real-world case studies that help readers understand and apply these technologies in big data analytics and management.Method: Uses intuitive visualizations to simplify complex mathematical concepts, presents practical case studies, introduces classic AI models (Transformers, GPT, ResNet, BERT, YOLO), covers pre-trained models, and explains big data management technologies including SQL/NoSQL databases and distributed computing frameworks like Hadoop and Spark.
Result: Provides a comprehensive educational resource that demystifies deep learning and big data technologies, making them accessible to both beginners and experienced professionals, with practical guidance on applying these tools in real-world scenarios across various domains.
Conclusion: Mastering deep learning and big data management skills is essential for the future workforce, and this book serves as a critical resource for developing these competencies through accessible explanations, practical examples, and coverage of foundational technologies.
Abstract: This book explores the role of Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL) in driving the progress of big data analytics and management. The book focuses on simplifying the complex mathematical concepts behind deep learning, offering intuitive visualizations and practical case studies to help readers understand how neural networks and technologies like Convolutional Neural Networks (CNNs) work. It introduces several classic models and technologies such as Transformers, GPT, ResNet, BERT, and YOLO, highlighting their applications in fields like natural language processing, image recognition, and autonomous driving. The book also emphasizes the importance of pre-trained models and how they can enhance model performance and accuracy, with instructions on how to apply these models in various real-world scenarios. Additionally, it provides an overview of key big data management technologies like SQL and NoSQL databases, as well as distributed computing frameworks such as Apache Hadoop and Spark, explaining their importance in managing and processing vast amounts of data. Ultimately, the book underscores the value of mastering deep learning and big data management skills as critical tools for the future workforce, making it an essential resource for both beginners and experienced professionals.
[79] Surveying the MLLM Landscape: A Meta-Review of Current Surveys
Ming Li, Keyu Chen, Ziqian Bi, Ming Liu, Xinyuan Song, Zekun Jiang, Tianyang Wang, Benji Peng, Qian Niu, Junyu Liu, Jinlang Wang, Sen Zhang, Xuanhe Pan, Jiawei Xu, Pohsun Feng
Main category: cs.CL
TL;DR: This survey paper provides a systematic review of benchmark tests and evaluation methods for Multimodal Large Language Models (MLLMs), covering their applications, evaluation methodologies, ethical concerns, and future research directions.
Details
Motivation: As MLLMs become increasingly capable and transformative across various applications, there's a critical need for comprehensive and accurate performance evaluation methods to assess these multimodal systems effectively.Method: The authors conduct a systematic literature review, classifying and analyzing existing surveys and research on MLLM evaluation. They cover foundational concepts, applications, evaluation methodologies, ethical concerns, security, efficiency, and domain-specific applications.
Result: The survey summarizes main contributions and methodologies from existing literature, provides comparative analysis of different approaches, examines academic impact, and identifies emerging trends and underexplored areas in MLLM research.
Conclusion: This comprehensive survey offers researchers and practitioners a thorough understanding of current MLLM evaluation practices and proposes future research directions to advance this rapidly evolving field.
Abstract: The rise of Multimodal Large Language Models (MLLMs) has become a transformative force in the field of artificial intelligence, enabling machines to process and generate content across multiple modalities, such as text, images, audio, and video. These models represent a significant advancement over traditional unimodal systems, opening new frontiers in diverse applications ranging from autonomous agents to medical diagnostics. By integrating multiple modalities, MLLMs achieve a more holistic understanding of information, closely mimicking human perception. As the capabilities of MLLMs expand, the need for comprehensive and accurate performance evaluation has become increasingly critical. This survey aims to provide a systematic review of benchmark tests and evaluation methods for MLLMs, covering key topics such as foundational concepts, applications, evaluation methodologies, ethical concerns, security, efficiency, and domain-specific applications. Through the classification and analysis of existing literature, we summarize the main contributions and methodologies of various surveys, conduct a detailed comparative analysis, and examine their impact within the academic community. Additionally, we identify emerging trends and underexplored areas in MLLM research, proposing potential directions for future studies. This survey is intended to offer researchers and practitioners a comprehensive understanding of the current state of MLLM evaluation, thereby facilitating further progress in this rapidly evolving field.
[80] Deep Learning and Machine Learning, Advancing Big Data Analytics and Management: Unveiling AI’s Potential Through Tools, Techniques, and Applications
Pohsun Feng, Ziqian Bi, Yizhu Wen, Xuanhe Pan, Benji Peng, Ming Liu, Jiawei Xu, Keyu Chen, Junyu Liu, Caitlyn Heqi Yin, Sen Zhang, Jinlang Wang, Qian Niu, Ming Li, Tianyang Wang, Xinyuan Song, Zekun Jiang
Main category: cs.CL
TL;DR: This article provides a comprehensive overview of AI, machine learning, and deep learning’s transformative role in big data analytics, focusing on LLMs like ChatGPT, Claude, and Gemini, their applications, and ethical considerations.
Details
Motivation: To explore how AI, machine learning, and deep learning are revolutionizing big data analytics and management across industries, with particular emphasis on large language models and their practical applications.Method: The article examines foundational concepts and cutting-edge developments, discussing neural networks, reinforcement learning, generative models, edge computing, AutoML, and practical hardware/software configurations.
Result: The work demonstrates how AI systems enhanced by advanced algorithms can process complex datasets, democratize AI access through technologies like AutoML, and provide comprehensive resources for researchers and practitioners.
Conclusion: AI and LLMs have significant potential to revolutionize big data management across domains like healthcare and finance, but must be deployed responsibly with attention to ethics, transparency, and fairness.
Abstract: Artificial intelligence (AI), machine learning, and deep learning have become transformative forces in big data analytics and management, enabling groundbreaking advancements across diverse industries. This article delves into the foundational concepts and cutting-edge developments in these fields, with a particular focus on large language models (LLMs) and their role in natural language processing, multimodal reasoning, and autonomous decision-making. Highlighting tools such as ChatGPT, Claude, and Gemini, the discussion explores their applications in data analysis, model design, and optimization. The integration of advanced algorithms like neural networks, reinforcement learning, and generative models has enhanced the capabilities of AI systems to process, visualize, and interpret complex datasets. Additionally, the emergence of technologies like edge computing and automated machine learning (AutoML) democratizes access to AI, empowering users across skill levels to engage with intelligent systems. This work also underscores the importance of ethical considerations, transparency, and fairness in the deployment of AI technologies, paving the way for responsible innovation. Through practical insights into hardware configurations, software environments, and real-world applications, this article serves as a comprehensive resource for researchers and practitioners. By bridging theoretical underpinnings with actionable strategies, it showcases the potential of AI and LLMs to revolutionize big data management and drive meaningful advancements across domains such as healthcare, finance, and autonomous systems.
[81] LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation
Yen-Shan Chen, Jing Jin, Peng-Ting Kuo, Chao-Wei Huang, Yun-Nung Chen
Main category: cs.CL
TL;DR: LLMs show no self-preference bias in RAG frameworks; factual accuracy dominates their evaluation in fact-oriented tasks across multiple models and datasets.
Details
Motivation: Previous research shows LLMs have self-preference bias in evaluation tasks, but it's unclear if this persists in fact-oriented RAG frameworks where factual accuracy matters more than style.Method: Simulated two RAG phases: 1) pointwise reranking phase where LLMs evaluated human vs model-generated passages, and 2) generation phase with pairwise reading comprehension tests. Tested across 3 QA datasets (NQ, MARCO, TriviaQA) and 5 LLMs (GPT-3.5, GPT-4o-mini, Gemini, LLaMA3, Mistral).
Result: No significant self-preference bias found in RAG frameworks. Factual accuracy significantly influences LLM outputs even without prior knowledge. Results consistent across all tested datasets and models.
Conclusion: LLM self-preference bias doesn’t manifest in fact-oriented RAG tasks; factual accuracy is the dominant factor. This insight helps develop more robust, unbiased LLM systems for RAG applications.
Abstract: Recent studies have demonstrated that large language models (LLMs) exhibit significant biases in evaluation tasks, particularly in preferentially rating and favoring self-generated content. However, the extent to which this bias manifests in fact-oriented tasks, especially within retrieval-augmented generation (RAG) frameworks, where keyword extraction and factual accuracy take precedence over stylistic elements, remains unclear. Our study addresses this knowledge gap by simulating two critical phases of the RAG framework. In the first phase, LLMs evaluated human-authored and model-generated passages, emulating the \textit{pointwise reranking phase}. The second phase involves conducting pairwise reading comprehension tests to simulate the \textit{generation phase}. Contrary to previous findings indicating a self-preference in rating tasks, our results reveal no significant self-preference effect in RAG frameworks. Instead, we observe that factual accuracy significantly influences LLMs’ output, even in the absence of prior knowledge. These findings are consistent among three common QA datasets (NQ, MARCO, TriviaQA Datasets) and 5 widely adopted language models (GPT-3.5, GPT-4o-mini, Gemini, LLaMA3, and Mistral). Our research contributes to the ongoing discourse on LLM biases and their implications for RAG-based system, offering insights that may inform the development of more robust and unbiased LLM systems.
[82] A Systematic Assessment of Language Models with Linguistic Minimal Pairs in Chinese
Yikang Liu, Yeting Shen, Hongao Zhu, Lilong Xu, Zhiheng Qian, Siyuan Song, Kejia Zhang, Jialong Tang, Pei Zhang, Baosong Yang, Rui Wang, Hai Hu
Main category: cs.CL
TL;DR: ZhoBLiMP is a large Chinese linguistic minimal pair benchmark with 100+ paradigms. The paper introduces SLLN-LP metric to address sentence length bias and finds that certain Chinese linguistic phenomena remain challenging for LMs up to 32B parameters.
Details
Motivation: To create a comprehensive Chinese linguistic benchmark (ZhoBLiMP) and study how Chinese language models learn various grammatical phenomena, while addressing biases in minimal pair evaluation caused by unequal sentence lengths.Method: 1) Created ZhoBLiMP benchmark with 100+ Chinese linguistic paradigms; 2) Trained Chinese LMs with varying tokenizers, sizes, and token volumes; 3) Proposed SLLN-LP metric to normalize for sentence length bias; 4) Evaluated LMs on ZhoBLiMP, JBLiMP, and BLiMP using the new metric.
Result: Anaphor, Quantifiers, and Ellipsis in Chinese remain difficult for LMs even at 32B parameters. SLLN-LP successfully mitigates length biases across all three benchmarks (ZhoBLiMP, JBLiMP, BLiMP).
Conclusion: Future linguistic evaluations should carefully consider the complex relationships between linking functions, language models, and targeted minimal pairs to avoid biases and better assess model capabilities.
Abstract: We present ZhoBLiMP, the largest linguistic minimal pair benchmark for Chinese, with over 100 paradigms, ranging from topicalization to the \textit{Ba} construction. We then train from scratch a suite of Chinese language models (LMs) with different tokenizers, parameter sizes, and token volumes, to study the learning curves of LMs on Chinese. To mitigate the biases introduced by unequal lengths of the sentences in a minimal pair, we propose a new metric named sub-linear length normalized log-probabilities (SLLN-LP). Using SLLN-LP as the metric, our results show that \textsc{Anaphor}, \textsc{Quantifiers}, and \textsc{Ellipsis} in Chinese are difficult for LMs even up to 32B parameters, and that SLLN-LP successfully mitigates biases in ZhoBLiMP, JBLiMP and BLiMP. We conclude that future evaluations should be more carefully designed to consider the intricate relations between linking functions, LMs, and targeted minimal pairs.
[83] Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models
Xiaojun Wu, Junxi Liu, Huanyi Su, Zhouchi Lin, Yiyan Qi, Chengjin Xu, Jiajun Su, Jiajie Zhong, Fuwei Wang, Saizhuo Wang, Fengrui Hua, Jia Li, Jian Guo
Main category: cs.CL
TL;DR: Golden Touchstone is a comprehensive bilingual benchmark for evaluating financial LLMs across 8 core NLP tasks in Chinese and English, addressing limitations of existing financial benchmarks.
Details
Motivation: Existing financial benchmarks have limited language/task coverage, low-quality datasets, and inadequate adaptability for LLM evaluation, creating a need for standardized assessment methods as LLMs become more prevalent in finance.Method: Developed Golden Touchstone benchmark from extensive open-source data collection and industry demands, covering 8 core financial NLP tasks. Also created Touchstone-GPT model through continual pre-training and instruction tuning.
Result: Comparative analysis of major models (GPT-4o, Llama3, FinGPT, FinMA) revealed their strengths/limitations in processing complex financial information. Touchstone-GPT showed strong performance on the bilingual benchmark but had limitations in specific tasks.
Conclusion: The research provides a practical evaluation tool for financial LLMs and guides future development/optimization. Both Golden Touchstone benchmark and Touchstone-GPT model weights are publicly available.
Abstract: As large language models (LLMs) increasingly permeate the financial sector, there is a pressing need for a standardized method to comprehensively assess their performance. Existing financial benchmarks often suffer from limited language and task coverage, low-quality datasets, and inadequate adaptability for LLM evaluation. To address these limitations, we introduce Golden Touchstone, a comprehensive bilingual benchmark for financial LLMs, encompassing eight core financial NLP tasks in both Chinese and English. Developed from extensive open-source data collection and industry-specific demands, this benchmark thoroughly assesses models’ language understanding and generation capabilities. Through comparative analysis of major models such as GPT-4o, Llama3, FinGPT, and FinMA, we reveal their strengths and limitations in processing complex financial information. Additionally, we open-source Touchstone-GPT, a financial LLM trained through continual pre-training and instruction tuning, which demonstrates strong performance on the bilingual benchmark but still has limitations in specific tasks. This research provides a practical evaluation tool for financial LLMs and guides future development and optimization. The source code for Golden Touchstone and model weight of Touchstone-GPT have been made publicly available at https://github.com/IDEA-FinAI/Golden-Touchstone.
[84] Bridging Relevance and Reasoning: Rationale Distillation in Retrieval-Augmented Generation
Pengyue Jia, Derong Xu, Xiaopeng Li, Zhaocheng Du, Xiangyang Li, Yichao Wang, Yuhao Wang, Qidong Liu, Maolin Wang, Huifeng Guo, Ruiming Tang, Xiangyu Zhao
Main category: cs.CL
TL;DR: RADIO is a preference alignment framework that bridges the gap between reranker and generator in RAG systems by using LLM-extracted rationales to align document ranking with generation needs.
Details
Motivation: There's an inherent gap between documents ranked as relevant by rerankers and those actually needed by generators to answer queries, due to differences in pre-training data and objectives between these components.Method: Proposes RADIO with two main components: 1) Rationale extraction using LLMs to identify reasoning needed for answering queries, and 2) Rationale-based alignment that reranks documents based on extracted rationales and fine-tunes the reranker to align preferences.
Result: Extensive experiments on two tasks across three datasets demonstrate the effectiveness of RADIO compared to baseline methods, showing improved alignment between reranker and generator.
Conclusion: RADIO successfully bridges the gap between reranker and generator in RAG pipelines through rationale distillation and preference alignment, with code released for reproducibility.
Abstract: The reranker and generator are two critical components in the Retrieval-Augmented Generation (i.e., RAG) pipeline, responsible for ranking relevant documents and generating responses. However, due to differences in pre-training data and objectives, there is an inevitable gap between the documents ranked as relevant by the reranker and those required by the generator to support answering the query. To address this gap, we propose RADIO, a novel and practical preference alignment framework with RAtionale DIstillatiOn. Specifically, we first propose a rationale extraction method that leverages the reasoning capabilities of Large Language Models (LLMs) to extract the rationales necessary for answering the query. Subsequently, a rationale-based alignment process is designed to rerank the documents based on the extracted rationales, and fine-tune the reranker to align the preferences. We conduct extensive experiments on two tasks across three datasets to demonstrate the effectiveness of our approach compared to baseline methods. Our code is released online to ease reproduction.
[85] Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation
Wenbo Zhang, Hengrui Cai, Wenyu Chen
Main category: cs.CL
TL;DR: Proposes hierarchical statistical model for LLM benchmarking that accounts for randomness, uses multiple generations for more accurate score estimation, and introduces prompt-level difficulty metrics.
Details
Motivation: Current LLM evaluation methods overlook inherent randomness by using deterministic generation or single random samples, leading to unaccounted sampling variance and unreliable benchmark scores.Method: Hierarchical statistical model incorporating both benchmark characteristics and LLM randomness, leveraging multiple generations per prompt to improve accuracy and reduce variance.
Result: Multiple generations improve benchmark score estimation accuracy and reduce variance; enables definition of P(correct) prompt-level difficulty score; creates data map visualizing prompt difficulty and semantics.
Conclusion: Proposed approach provides more comprehensive LLM benchmarking by accounting for randomness, offering fine-grained prompt insights, and enabling error detection in benchmark construction.
Abstract: Large language models (LLMs) have demonstrated significant utility in real-world applications, exhibiting impressive capabilities in natural language processing and understanding. Benchmark evaluations are crucial for assessing the capabilities of LLMs as they can provide a comprehensive assessment of their strengths and weaknesses. However, current evaluation methods often overlook the inherent randomness of LLMs by employing deterministic generation strategies or relying on a single random sample, resulting in unaccounted sampling variance and unreliable benchmark score estimates. In this paper, we propose a hierarchical statistical model that provides a more comprehensive representation of the benchmarking process by incorporating both benchmark characteristics and LLM randomness. We show that leveraging multiple generations improves the accuracy of estimating the benchmark score and reduces variance. Multiple generations also allow us to define $\mathbb P\left(\text{correct}\right)$, a prompt-level difficulty score based on correct ratios, providing fine-grained insights into individual prompts. Additionally, we create a data map that visualizes difficulty and semantics of prompts, enabling error detection and quality control in benchmark construction.
[86] HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization
Zhijian Zhuo, Yutao Zeng, Ya Wang, Sijun Zhang, Jian Yang, Xiaoqing Li, Xun Zhou, Jinwen Ma
Main category: cs.CL
TL;DR: HybridNorm is a hybrid normalization strategy that combines Pre-Norm and Post-Norm advantages by using QKV normalization in attention and Post-Norm in FFN, improving gradient flow and model robustness in deep transformers.
Details
Motivation: Transformers face challenges in training deep networks, especially with layer normalization positioning. Pre-Norm enables stable training but leads to suboptimal performance compared to Post-Norm. There's a need to combine the advantages of both approaches.Method: HybridNorm integrates Pre-Norm and Post-Norm by employing QKV normalization within the attention mechanism (Pre-Norm style) and Post-Norm in the feed-forward network (FFN) of each transformer block.
Result: HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches across multiple benchmarks on large-scale transformer models (both dense and sparse variants), demonstrating improved gradient flow and model robustness.
Conclusion: HybridNorm represents a more stable and effective technique for improving training and performance of deep transformer models, offering a practical solution to the normalization positioning challenge in transformer architectures.
Abstract: Transformers have become the de facto architecture for a wide range of machine learning tasks, particularly in large language models (LLMs). Despite their remarkable performance, many challenges remain in training deep transformer networks, especially regarding the position of the layer normalization. While Pre-Norm structures facilitate more stable training owing to their stronger identity path, they often lead to suboptimal performance compared to Post-Norm. In this paper, we propose $\textbf{HybridNorm}$, a simple yet effective hybrid normalization strategy that integrates the advantages of both Pre-Norm and Post-Norm. Specifically, HybridNorm employs QKV normalization within the attention mechanism and Post-Norm in the feed-forward network (FFN) of each transformer block. We provide both theoretical insights and empirical evidence to demonstrate that HybridNorm improves the gradient flow and the model robustness. Extensive experiments on large-scale transformer models, including both dense and sparse variants, show that HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches across multiple benchmarks. These findings highlight the potential of HybridNorm as a more stable and effective technique for improving the training and performance of deep transformer models. Code is available at https://github.com/BryceZhuo/HybridNorm.
[87] OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System Verification
Shangyu Li, Juyong Jiang, Tiancheng Zhao, Jiasi Shen
Main category: cs.CL
TL;DR: OSVBench is a new benchmark for evaluating LLMs on generating formal specifications for operating system kernel verification, featuring 245 complex tasks based on Hyperkernel with 20k-30k token contexts.
Details
Motivation: There's a need to evaluate LLMs' capability in generating complete formal specifications for verifying functional correctness of operating system kernels, which is a complex, long-context task that existing benchmarks don't adequately address.Method: The benchmark formulates specification generation as a program synthesis problem confined to a domain for specifying states and transitions. LLMs must understand a programming model and verification assumptions to delineate correct search space for syntax and semantics, then generate formal specifications based on the OS’s high-level functional description.
Result: Experimental results with 12 state-of-the-art LLMs show limited performance on specification generation for OS verification, with significant disparities highlighting differences in their ability to handle long-context code generation tasks.
Conclusion: Existing LLMs have limited capability in generating formal specifications for operating system verification, revealing challenges in handling complex, long-context code generation tasks, and highlighting the need for improved models in this domain.
Abstract: We introduce OSVBench, a new benchmark for evaluating Large Language Models (LLMs) on the task of generating complete formal specifications for verifying the functional correctness of operating system kernels. This benchmark is built upon a real-world operating system kernel, Hyperkernel, and consists of 245 complex specification generation tasks in total, each of which is a long-context task of about 20k-30k tokens. The benchmark formulates the specification generation task as a program synthesis problem confined to a domain for specifying states and transitions. This formulation is provided to LLMs through a programming model. The LLMs must be able to understand the programming model and verification assumptions before delineating the correct search space for syntax and semantics and generating formal specifications. Guided by the operating system’s high-level functional description, the LLMs are asked to generate a specification that fully describes all correct states and transitions for a potentially buggy code implementation of the operating system. Experimental results with 12 state-of-the-art LLMs indicate limited performance of existing LLMs on the specification generation task for operating system verification. Significant disparities in their performance highlight differences in their ability to handle long-context code generation tasks. The code are available at https://github.com/lishangyu-hkust/OSVBench
[88] Democratic or Authoritarian? Probing a New Dimension of Political Biases in Large Language Models
David Guzman Piedrahita, Irene Strauss, Bernhard Schölkopf, Rada Mihalcea, Zhijing Jin
Main category: cs.CL
TL;DR: LLMs generally favor democratic values but show increased favorability toward authoritarian figures when prompted in Mandarin, and often cite authoritarian figures as role models even in non-political contexts.
Details
Motivation: While prior work has examined socio-demographic and left-right political biases in LLMs, little attention has been paid to how LLMs align with broader geopolitical value systems, particularly the democracy-authoritarianism spectrum, which is crucial as LLMs become increasingly integrated into everyday life and information ecosystems.Method: Proposes a novel methodology combining: (1) the F-scale (psychometric tool for measuring authoritarian tendencies), (2) FavScore (new metric for evaluating model favorability toward world leaders), and (3) role-model probing to assess which figures are cited as general role-models by LLMs.
Result: LLMs generally favor democratic values and leaders, but exhibit increased favorability toward authoritarian figures when prompted in Mandarin. Models often cite authoritarian figures as role models, even outside explicit political contexts.
Conclusion: LLMs may reflect and potentially reinforce global political ideologies, highlighting the importance of evaluating bias beyond conventional socio-political axes. The findings reveal nuanced geopolitical biases in language models.
Abstract: As Large Language Models (LLMs) become increasingly integrated into everyday life and information ecosystems, concerns about their implicit biases continue to persist. While prior work has primarily examined socio-demographic and left–right political dimensions, little attention has been paid to how LLMs align with broader geopolitical value systems, particularly the democracy–authoritarianism spectrum. In this paper, we propose a novel methodology to assess such alignment, combining (1) the F-scale, a psychometric tool for measuring authoritarian tendencies, (2) FavScore, a newly introduced metric for evaluating model favorability toward world leaders, and (3) role-model probing to assess which figures are cited as general role-models by LLMs. We find that LLMs generally favor democratic values and leaders, but exhibit increased favorability toward authoritarian figures when prompted in Mandarin. Further, models are found to often cite authoritarian figures as role models, even outside explicit political contexts. These results shed light on ways LLMs may reflect and potentially reinforce global political ideologies, highlighting the importance of evaluating bias beyond conventional socio-political axes. Our code is available at: https://github.com/irenestrauss/Democratic-Authoritarian-Bias-LLMs.
[89] Rethinking LLM Training through Information Geometry and Quantum Metrics
Riccardo Di Sipio
Main category: cs.CL
TL;DR: Information geometry provides a principled framework for understanding LLM optimization through Fisher information metrics, natural gradient descent, and curvature analysis, with potential quantum extensions.
Details
Motivation: LLM optimization occurs in high-dimensional parameter spaces with non-Euclidean structure, requiring better geometric understanding to explain phenomena like sharp minima, generalization, and scaling laws.Method: Using information geometry with Fisher information metric to frame the optimization landscape, enabling natural gradient descent and curvature-based analysis of LLM training dynamics.
Result: The geometric lens clarifies optimization phenomena and provides principled understanding of LLM training, though natural gradient descent remains often impractical for large-scale applications.
Conclusion: Curvature-based approaches deepen understanding of LLM optimization, with potential quantum extensions using Fubini-Study metric and Quantum Fisher Information for efficient optimization in quantum-enhanced systems.
Abstract: Optimization in large language models (LLMs) unfolds over high-dimensional parameter spaces with non-Euclidean structure. Information geometry frames this landscape using the Fisher information metric, enabling more principled learning via natural gradient descent. Though often impractical, this geometric lens clarifies phenomena such as sharp minima, generalization, and observed scaling laws. We argue that curvature-based approaches deepen our understanding of LLM training. Finally, we speculate on quantum analogies based on the Fubini-Study metric and Quantum Fisher Information, hinting at efficient optimization in quantum-enhanced systems.
[90] MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation
Daeyong Kwon, SeungHeon Doh, Juhan Nam
Main category: cs.CL
TL;DR: MusT-RAG is a Retrieval Augmented Generation framework that adapts general-purpose LLMs for music question answering by using a specialized music vector database and context-aware fine-tuning.
Details
Motivation: LLMs have strong zero-shot performance but limited effectiveness in music applications due to insufficient music-specific knowledge in their training data.Method: Proposes MusT-RAG framework with two key components: (1) MusWikiDB - a music-specialized vector database for retrieval, and (2) context information utilization during both inference and fine-tuning to transform general LLMs into music-specific models.
Result: MusT-RAG significantly outperforms traditional fine-tuning approaches in music domain adaptation, showing consistent improvements across both in-domain and out-of-domain music question answering benchmarks. MusWikiDB proves substantially more effective than general Wikipedia corpora with superior performance and computational efficiency.
Conclusion: The proposed MusT-RAG framework effectively addresses LLMs’ limitations in music applications through specialized retrieval augmentation and context-aware adaptation, demonstrating superior performance over traditional methods.
Abstract: Recent advancements in Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains. While they exhibit strong zero-shot performance on various tasks, LLMs’ effectiveness in music-related applications remains limited due to the relatively small proportion of music-specific knowledge in their training data. To address this limitation, we propose MusT-RAG, a comprehensive framework based on Retrieval Augmented Generation (RAG) to adapt general-purpose LLMs for text-only music question answering (MQA) tasks. RAG is a technique that provides external knowledge to LLMs by retrieving relevant context information when generating answers to questions. To optimize RAG for the music domain, we (1) propose MusWikiDB, a music-specialized vector database for the retrieval stage, and (2) utilizes context information during both inference and fine-tuning processes to effectively transform general-purpose LLMs into music-specific models. Our experiment demonstrates that MusT-RAG significantly outperforms traditional fine-tuning approaches in enhancing LLMs’ music domain adaptation capabilities, showing consistent improvements across both in-domain and out-of-domain MQA benchmarks. Additionally, our MusWikiDB proves substantially more effective than general Wikipedia corpora, delivering superior performance and computational efficiency.
[91] Understanding Syntactic Generalization in Structure-inducing Language Models
David Arps, Hassan Sajjad, Laura Kallmeyer
Main category: cs.CL
TL;DR: The paper compares three Structure-inducing Language Models (SiLMs) - Structformer, UDGN, and GPST - trained on both natural and synthetic data, evaluating their syntactic representations, grammaticality judgments, and training dynamics.
Details
Motivation: SiLMs show strong syntactic generalization and competitive NLP performance, but their basic properties remain underexplored. The study aims to systematically compare different SiLM architectures to understand their characteristics and performance.Method: Trained three SiLM architectures (Structformer, UDGN, GPST) from scratch on both natural language corpora (English, German, Chinese) and synthetic bracketing expressions. Evaluated them on: (1) properties of induced syntactic representations, (2) grammaticality judgment tasks, and (3) training dynamics.
Result: No single architecture dominates across all metrics. GPST performs most consistently across evaluation settings and outperforms others on long-distance dependencies in bracketing expressions. Small models trained on synthetic data provide a useful testbed for evaluating basic model properties.
Conclusion: Different SiLM architectures have significant differences, particularly in induced syntactic representations. GPST shows the most consistent performance. Synthetic data training with small models offers an effective approach for fundamental model evaluation.
Abstract: Structure-inducing Language Models (SiLM) are trained on a self-supervised language modeling task, and induce a hierarchical sentence representation as a byproduct when processing an input. SiLMs couple strong syntactic generalization behavior with competitive performance on various NLP tasks, but many of their basic properties are yet underexplored. In this work, we train three different SiLM architectures from scratch: Structformer (Shen et al., 2021), UDGN (Shen et al., 2022), and GPST (Hu et al., 2024b). We train these architectures on both natural language (English, German, and Chinese) corpora and synthetic bracketing expressions. The models are then evaluated with respect to (i) properties of the induced syntactic representations (ii) performance on grammaticality judgment tasks, and (iii) training dynamics. We find that none of the three architectures dominates across all evaluation metrics. However, there are significant differences, in particular with respect to the induced syntactic representations. The Generative Pretrained Structured Transformer (GPST; Hu et al. 2024) performs most consistently across evaluation settings, and outperforms the other models on long-distance dependencies in bracketing expressions. Furthermore, our study shows that small models trained on large amounts of synthetic data provide a useful testbed for evaluating basic model properties.
[92] RPRO: Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning
Chia-Hsuan Hsu, Jun-En Ding, Hsin-Ling Hsu, Chih-Ho Hsu, Li-Hung Yao, Chun-Chieh Liao, Feng Liu, Fang-Ming Hung
Main category: cs.CL
TL;DR: RPRO is a reinforcement learning framework that enhances clinical reasoning in LLMs by combining preference optimization with quality-driven refinement, outperforming larger models on medical QA tasks.
Details
Motivation: Existing LLMs generate reasoning chains that lack factual accuracy and clinical reliability in medical question answering, requiring more advanced reasoning that integrates domain knowledge with logical inference.Method: Ranked Preference Reinforcement Optimization (RPRO) combines reinforcement learning with preference-driven reasoning refinement, using task-adaptive reasoning templates, probabilistic evaluation aligned with clinical workflows, groupwise ranking optimization based on Bradley-Terry model, and KL-divergence regularization for stable training.
Result: Experiments on PubMedQA, MedQA-USMLE, and FEMH clinical dataset show consistent improvements over strong baselines, with 2B-parameter model outperforming larger 7B-20B models including medical-specialized variants.
Conclusion: Combining preference optimization with quality-driven refinement provides a scalable and clinically grounded approach to building more reliable medical LLMs.
Abstract: Medical question answering requires advanced reasoning that integrates domain knowledge with logical inference. However, existing large language models (LLMs) often generate reasoning chains that lack factual accuracy and clinical reliability. We propose Ranked Preference Reinforcement Optimization (RPRO), a novel framework that combines reinforcement learning with preference-driven reasoning refinement to enhance clinical chain-of-thought (CoT) performance. RPRO distinguishes itself from prior approaches by employing task-adaptive reasoning templates and a probabilistic evaluation mechanism that aligns model outputs with established clinical workflows, while automatically identifying and correcting low-quality reasoning chains. Unlike traditional pairwise preference methods, RPRO introduces a groupwise ranking optimization based on the Bradley–Terry model and incorporates KL-divergence regularization for stable training. Experiments on PubMedQA, MedQA-USMLE, and a real-world clinical dataset from Far Eastern Memorial Hospital (FEMH) demonstrate consistent improvements over strong baselines. Remarkably, our 2B-parameter model outperforms much larger 7B–20B models, including medical-specialized variants. These findings demonstrate that combining preference optimization with quality-driven refinement provides a scalable and clinically grounded approach to building more reliable medical LLMs.
[93] SFT Doesn’t Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs
Jiacheng Lin, Zhongruo Wang, Kun Qian, Tian Wang, Arvind Srinivasan, Hansi Zeng, Ruochen Jiao, Xie Zhou, Jiri Gesi, Dakuo Wang, Yufan Guo, Kai Zhong, Weiqi Zhang, Sujay Sanghavi, Changyou Chen, Hyokun Yun, Lihong Li
Main category: cs.CL
TL;DR: SFT doesn’t always hurt general LLM capabilities - smaller learning rates help, and Token-Adaptive Loss Reweighting (TALR) outperforms other methods in balancing domain adaptation with general capability preservation.
Details
Motivation: The paper addresses the common belief that Supervised Fine-Tuning (SFT) on domain-specific datasets degrades LLMs' general capabilities, seeking to understand and mitigate this trade-off between domain adaptation and general performance preservation.Method: The authors first empirically show that smaller learning rates mitigate general performance degradation. They then provide theoretical analysis and propose Token-Adaptive Loss Reweighting (TALR). They compare TALR against baselines including L2 regularization, LoRA, model averaging, and FLOW.
Result: Smaller learning rates substantially mitigate general performance degradation while preserving target-domain performance. TALR consistently outperforms other methods in balancing domain-specific gains and general capabilities, though no method completely eliminates the trade-off.
Conclusion: Two practical guidelines: (1) use small learning rates for favorable trade-offs, and (2) adopt TALR when stronger balance is needed. The work provides both empirical evidence and theoretical understanding of the SFT trade-off.
Abstract: Supervised Fine-Tuning (SFT) on domain-specific datasets is a common approach to adapt Large Language Models (LLMs) to specialized tasks but is often believed to degrade their general capabilities. In this work, we revisit this trade-off and present both empirical and theoretical insights. First, we show that SFT does not always hurt: using a smaller learning rate can substantially mitigate general performance degradation while preserving comparable target-domain performance. We then provide a theoretical analysis that explains these phenomena and further motivates a new method, Token-Adaptive Loss Reweighting (TALR). Building on this, and recognizing that smaller learning rates alone do not fully eliminate general-performance degradation in all cases, we evaluate a range of strategies for reducing general capability loss, including L2 regularization, LoRA, model averaging, FLOW, and our proposed TALR. Experimental results demonstrate that while no method completely eliminates the trade-off, TALR consistently outperforms these baselines in balancing domain-specific gains and general capabilities. Finally, we distill our findings into practical guidelines for adapting LLMs to new domains: (i) using a small learning rate to achieve a favorable trade-off, and (ii) when a stronger balance is further desired, adopt TALR as an effective strategy.
[94] LLM Output Homogenization is Task Dependent
Shomik Jain, Jack Lanchantin, Maximilian Nickel, Karen Ullrich, Ashia Wilson, Jamelle Watson-Daniels
Main category: cs.CL
TL;DR: The paper addresses LLM output homogenization by proposing a task-dependent approach to define, evaluate, and mitigate homogenization, introducing task taxonomy, functional diversity metrics, and sampling techniques that increase diversity without sacrificing quality.
Details
Motivation: Current approaches to LLM output homogenization fail to account for task-dependent definitions of diversity - what constitutes problematic homogenization varies by task type (e.g., math vs creative writing).Method: 1) Develop task taxonomy with 8 categories having distinct homogenization concepts; 2) Introduce task-anchored functional diversity metrics; 3) Propose task-anchored sampling technique; 4) Challenge diversity-quality trade-off assumption.
Result: The approach successfully increases functional diversity for tasks where homogenization is undesired while preserving it where desired, and demonstrates that diversity can be increased without compromising response quality.
Conclusion: Task dependence is crucial for properly evaluating and mitigating LLM output homogenization, and the proposed framework provides a more nuanced approach that aligns with different task requirements.
Abstract: A large language model can be less helpful if it exhibits output response homogenization. But whether two responses are considered homogeneous, and whether such homogenization is problematic, both depend on the task category. For instance, in objective math tasks, we often expect no variation in the final answer but anticipate variation in the problem-solving strategy. Whereas, for creative writing tasks, we may expect variation in key narrative components (e.g. plot, genre, setting, etc), beyond the vocabulary or embedding diversity produced by temperature-sampling. Previous work addressing output homogenization often fails to conceptualize diversity in a task-dependent way. We address this gap in the literature directly by making the following contributions. (1) We present a task taxonomy comprised of eight task categories that each have distinct concepts of output homogenization. (2) We introduce task-anchored functional diversity to better evaluate output homogenization. (3) We propose a task-anchored sampling technique that increases functional diversity for task categories where homogenization is undesired, while preserving it where it is desired. (4) We challenge the perceived existence of a diversity-quality trade-off by increasing functional diversity while maintaining response quality. Overall, we demonstrate how task dependence improves the evaluation and mitigation of output homogenization.
[95] Why Chain of Thought Fails in Clinical Text Understanding
Jiageng Wu, Kevin Xie, Bowen Gu, Nils Krüger, Kueiyu Joshua Lin, Jie Yang
Main category: cs.CL
TL;DR: CoT prompting hurts performance for 86.3% of LLMs on clinical text tasks, creating a paradox where interpretability improves but reliability decreases.
Details
Motivation: LLMs are increasingly used in clinical care where accuracy and transparent reasoning are critical. While CoT prompting has shown benefits in other domains for performance and interpretability, its effectiveness on clinical EHR data (which is lengthy, fragmented, and noisy) remains unexplored.Method: Conducted first large-scale systematic study of CoT for clinical text understanding. Assessed 95 advanced LLMs on 87 real-world clinical text tasks covering 9 languages and 8 task types. Performed fine-grained analyses of reasoning length, medical concept alignment, and error profiles using both LLM-as-a-judge evaluation and clinical expert evaluation.
Result: Contrary to prior findings in other domains, 86.3% of models suffered consistent performance degradation in CoT setting. More capable models remained relatively robust while weaker ones suffered substantial declines. Systematic patterns revealed when and why CoT fails in clinical contexts.
Conclusion: CoT creates a critical paradox: enhances interpretability but may undermine reliability in clinical text tasks. The study provides empirical basis for clinical reasoning strategies of LLMs, highlighting need for transparent and trustworthy approaches in clinical applications.
Abstract: Large language models (LLMs) are increasingly being applied to clinical care, a domain where both accuracy and transparent reasoning are critical for safe and trustworthy deployment. Chain-of-thought (CoT) prompting, which elicits step-by-step reasoning, has demonstrated improvements in performance and interpretability across a wide range of tasks. However, its effectiveness in clinical contexts remains largely unexplored, particularly in the context of electronic health records (EHRs), the primary source of clinical documentation, which are often lengthy, fragmented, and noisy. In this work, we present the first large-scale systematic study of CoT for clinical text understanding. We assess 95 advanced LLMs on 87 real-world clinical text tasks, covering 9 languages and 8 task types. Contrary to prior findings in other domains, we observe that 86.3% of models suffer consistent performance degradation in the CoT setting. More capable models remain relatively robust, while weaker ones suffer substantial declines. To better characterize these effects, we perform fine-grained analyses of reasoning length, medical concept alignment, and error profiles, leveraging both LLM-as-a-judge evaluation and clinical expert evaluation. Our results uncover systematic patterns in when and why CoT fails in clinical contexts, which highlight a critical paradox: CoT enhances interpretability but may undermine reliability in clinical text tasks. This work provides an empirical basis for clinical reasoning strategies of LLMs, highlighting the need for transparent and trustworthy approaches.
[96] Non-Collaborative User Simulators for Tool Agents
Jeonghoon Shim, Woojung Song, Cheyon Jin, Seungwon KooK, Yohan Jo
Main category: cs.CL
TL;DR: A user simulator architecture that generates four types of non-collaborative behaviors to better train and test tool agents against real-world challenging users.
Details
Motivation: Existing user simulators are too agent-friendly and cooperative, failing to prepare tool agents for real-world non-collaborative users who exhibit challenging behaviors.Method: Proposed a novel user simulator architecture that simulates four categories of non-collaborative behaviors: requesting unavailable services, digressing into tangential conversations, expressing impatience, and providing incomplete utterances.
Result: Experiments on MultiWOZ and τ-bench show significant performance degradation in state-of-the-art tool agents when encountering non-collaborative users, with increased hallucinations and dialogue breakdowns.
Conclusion: Provides an extensible user simulation framework to help develop more robust tool agents and preemptively diagnose weaknesses under challenging real-world conditions.
Abstract: Tool agents interact with users through multi-turn dialogues to accomplish various tasks. Recent studies have adopted user simulation methods to develop these agents in multi-turn settings. However, existing user simulators tend to be agent-friendly, exhibiting only cooperative behaviors, which fails to train and test agents against non-collaborative users in the real world. To address this, we propose a novel user simulator architecture that simulates four categories of non-collaborative behaviors: requesting unavailable services, digressing into tangential conversations, expressing impatience, and providing incomplete utterances. Our user simulator can simulate challenging and natural non-collaborative behaviors while reliably delivering all intents and information necessary to accomplish the task. Our experiments on MultiWOZ and $τ$-bench reveal significant performance degradation in state-of-the-art tool agents when encountering non-collaborative users. We provide detailed analyses of agents’ weaknesses under each non-collaborative condition, such as escalated hallucinations and dialogue breakdowns. Ultimately, we contribute an easily extensible user simulation framework to help the research community develop tool agents and preemptively diagnose them under challenging real-world conditions within their own services.
[97] SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents
Gyuhyeon Seo, Jungwoo Yang, Junseong Pyo, Nalim Kim, Jonggeun Lee, Yohan Jo
Main category: cs.CL
TL;DR: SimuHome: A realistic smart home simulator built on Matter protocol with 600-episode benchmark for evaluating LLM agents on complex smart home tasks, revealing performance vs. practicality trade-offs.
Details
Motivation: Smart homes present unique challenges for LLM agents (latent intents, temporal dependencies, device constraints, scheduling), but lack realistic simulation environments and challenging benchmarks to evaluate agent capabilities.Method: Developed SimuHome - a time-accelerated smart home simulator built on Matter protocol standard, supporting API calls and environmental variable changes. Created benchmark of 600 episodes across 12 query types requiring complex capabilities.
Result: Evaluated 16 agents under ReAct framework: Models <7B parameters performed negligibly; GPT-4.1 struggled with implicit intent, state verification, and temporal scheduling; reasoning models (GPT-5.1) outperformed standard models but required 3x inference time.
Conclusion: SimuHome enables realistic smart home agent evaluation and deployment. There’s a critical trade-off between task performance and real-world practicality, with reasoning models showing superior performance but prohibitive inference times for real-time applications.
Abstract: Large Language Model (LLM) agents excel at multi-step, tool-augmented tasks. However, smart homes introduce distinct challenges, requiring agents to handle latent user intents, temporal dependencies, device constraints, scheduling, and more. The main bottlenecks for developing smart home agents with such capabilities include the lack of a realistic simulation environment where agents can interact with devices and observe the results, as well as a challenging benchmark to evaluate them. To address this, we introduce $\textbf{SimuHome}$, a time-accelerated home environment that simulates smart devices, supports API calls, and reflects changes in environmental variables. By building the simulator on the Matter protocol, the global industry standard for smart home communication, SimuHome provides a high-fidelity environment, and agents validated in SimuHome can be deployed on real Matter-compliant devices with minimal adaptation. We provide a challenging benchmark of 600 episodes across twelve user query types that require the aforementioned capabilities. Our evaluation of 16 agents under a unified ReAct framework reveals distinct capabilities and limitations across models. Models under 7B parameters exhibited negligible performance across all query types. Even GPT-4.1, the best-performing standard model, struggled with implicit intent inference, state verification, and particularly temporal scheduling. While reasoning models such as GPT-5.1 consistently outperformed standard models on every query type, they required over three times the average inference time, which can be prohibitive for real-time smart home applications. This highlights a critical trade-off between task performance and real-world practicality.
[98] AdaDetectGPT: Adaptive Detection of LLM-Generated Text with Statistical Guarantees
Hongyi Zhou, Jin Zhu, Pingfan Su, Kai Ye, Ying Yang, Shakeel A O B Gavioli-Akilagun, Chengchun Shi
Main category: cs.CL
TL;DR: AdaDetectGPT is a novel LLM text detector that adaptively learns a witness function from training data to improve upon existing logits-based detection methods, achieving up to 37% performance gains.
Details
Motivation: Existing state-of-the-art logits-based detectors rely solely on log-probability statistics from source LLMs, which can be sub-optimal for distinguishing human vs. LLM-generated text.Method: AdaDetectGPT introduces a classifier that adaptively learns a witness function from training data to enhance logits-based detectors, providing statistical guarantees on detection rates.
Result: Extensive numerical studies show AdaDetectGPT nearly uniformly improves state-of-the-art methods across various dataset-LLM combinations, with improvements reaching up to 37%.
Conclusion: AdaDetectGPT represents a significant advancement in LLM text detection by adaptively learning from data rather than relying solely on log-probability statistics, with open-source implementation available.
Abstract: We study the problem of determining whether a piece of text has been authored by a human or by a large language model (LLM). Existing state of the art logits-based detectors make use of statistics derived from the log-probability of the observed text evaluated using the distribution function of a given source LLM. However, relying solely on log probabilities can be sub-optimal. In response, we introduce AdaDetectGPT – a novel classifier that adaptively learns a witness function from training data to enhance the performance of logits-based detectors. We provide statistical guarantees on its true positive rate, false positive rate, true negative rate and false negative rate. Extensive numerical studies show AdaDetectGPT nearly uniformly improves the state-of-the-art method in various combination of datasets and LLMs, and the improvement can reach up to 37%. A python implementation of our method is available at https://github.com/Mamba413/AdaDetectGPT.
[99] Hallucination reduction with CASAL: Contrastive Activation Steering For Amortized Learning
Wannan, Yang, Xinchi Qiu, Lei Yu, Yuchen Zhang, Aobo Yang, Narine Kokhlikyan, Nicola Cancedda, Diego Garcia-Olano
Main category: cs.CL
TL;DR: CASAL is an efficient algorithm that bakes activation steering benefits into model weights, enabling LLMs to abstain from unknown questions without real-time intervention, reducing hallucinations by 30-40% with minimal compute/data requirements.
Details
Motivation: LLMs often hallucinate confidently instead of admitting ignorance. Existing activation steering methods require real-time monitoring during inference, limiting practical deployment. There's a need for an efficient method that embeds self-knowledge awareness directly into model weights.Method: CASAL (Contrastive Activation Steering for Amortized Learning) connects interpretability with amortized optimization. It trains only a submodule of a single transformer layer using contrastive learning to bake activation steering benefits into model weights, enabling models to distinguish known from unknown questions.
Result: Reduces hallucinations by 30-40% across multiple short-form QA benchmarks. 30x more compute-efficient and 20x more data-efficient than LoRA-based baselines (SFT, DPO). Generalizes effectively to OOD domains and works with both text-only and vision-language models, including dense and MoE architectures.
Conclusion: CASAL represents a promising step for practical deployment of interpretability-inspired methods in production systems, offering efficient hallucination mitigation without real-time intervention requirements.
Abstract: Large Language Models (LLMs) exhibit impressive capabilities but often hallucinate, confidently providing incorrect answers instead of admitting ignorance. Prior work has shown that models encode linear representations of their own knowledge and that activation steering can reduce hallucinations. These approaches, however, require real-time monitoring and intervention during inference. We introduce Contrastive Activation Steering for Amortized Learning (CASAL), an efficient algorithm that connects interpretability with amortized optimization. CASAL directly bakes the benefits of activation steering into model’s weights. Once trained, LLMs answer questions they know while abstaining from answering those they do not. CASAL’s light-weight design requires training only a submodule of a single transformer layer and yet reduces hallucination by 30%-40% across multiple short-form QA benchmarks. CASAL is 30x more compute-efficient and 20x more data-efficient than strong LoRA-based baselines such as SFT and DPO, boosting its practical applicability in data scarce domains. Importantly, CASAL also generalizes effectively to out-of-distribution (OOD) domains. We showcase CASAL’s flexibility in mitigating hallucinations in both text-only and vision-language models. To our knowledge, CASAL is the first steering-based training method that has been shown to be effective for both dense and Mixture-of-Experts (MoE) models. CASAL represents a promising step forward for applying interpretability-inspired method for practical deployment in production systems.
[100] mini-vec2vec: Scaling Universal Geometry Alignment with Linear Transformations
Guy Dar
Main category: cs.CL
TL;DR: mini-vec2vec is a simpler, more efficient linear alternative to vec2vec for aligning text embedding spaces without parallel data, offering better stability and interpretability.
Details
Motivation: vec2vec provides near-perfect alignment of text embedding spaces without parallel data but is expensive, unstable, and computationally intensive, limiting its practical adoption.Method: Three-stage approach: 1) tentative matching of pseudo-parallel embedding vectors, 2) transformation fitting, and 3) iterative refinement. The learned mapping is a linear transformation.
Result: mini-vec2vec exceeds vec2vec by orders of magnitude in efficiency while matching or exceeding its results. It’s highly robust, stable, and the linear transformation is interpretable.
Conclusion: The method’s stability, efficiency, and interpretable algorithmic steps enable scaling and unlock new opportunities for adoption in various domains and fields.
Abstract: We build upon vec2vec, a procedure designed to align text embedding spaces without parallel data. vec2vec finds a near-perfect alignment, but it is expensive and unstable. We present mini-vec2vec, a simple and efficient alternative that requires substantially lower computational cost and is highly robust. Moreover, the learned mapping is a linear transformation. Our method consists of three main stages: a tentative matching of pseudo-parallel embedding vectors, transformation fitting, and iterative refinement. Our linear alternative exceeds the original instantiation of vec2vec by orders of magnitude in efficiency, while matching or exceeding their results. The method’s stability and interpretable algorithmic steps facilitate scaling and unlock new opportunities for adoption in new domains and fields.
[101] AgriGPT-VL: Agricultural Vision-Language Understanding Suite
Bo Yang, Yunkui Chen, Lanfei Feng, Yu Zhang, Xiao Xu, Jianyu Zhang, Nueraili Aierken, Runhe Huang, Hongjian Lin, Yibin Ying, Shijian Li
Main category: cs.CL
TL;DR: AgriGPT-VL Suite: A unified multimodal framework for agriculture featuring the largest vision-language corpus (Agri-3M-VL), a specialized vision-language model (AgriGPT-VL), and a challenging evaluation benchmark (AgriBench-VL-4K).
Details
Motivation: Agricultural applications face constraints due to scarcity of domain-tailored models, curated vision-language corpora, and rigorous evaluation frameworks, despite rapid advances in multimodal large language models.Method: Threefold approach: 1) Created Agri-3M-VL corpus using scalable multi-agent data generator (1M image-caption pairs, 2M VQA pairs, 50K expert VQA, 15K GRPO samples); 2) Developed AgriGPT-VL model via progressive curriculum of textual grounding, multimodal shallow/deep alignment, and GRPO refinement; 3) Established AgriBench-VL-4K evaluation suite with open-ended/image-grounded questions and LLM-as-a-judge framework.
Result: AgriGPT-VL outperforms leading general-purpose VLMs on AgriBench-VL-4K with higher pairwise win rates in LLM-as-a-judge evaluation, while remaining competitive on text-only AgriBench-13K without language ability degradation. Ablation studies confirm gains from alignment and GRPO refinement stages.
Conclusion: The AgriGPT-VL Suite addresses agricultural AI constraints by providing comprehensive resources (corpus, model, benchmark) that enable strong multimodal reasoning while preserving text-only capabilities, with plans to open source all resources for reproducible research in low-resource agricultural settings.
Abstract: Despite rapid advances in multimodal large language models, agricultural applications remain constrained by the scarcity of domain-tailored models, curated vision-language corpora, and rigorous evaluation. To address these challenges, we present the AgriGPT-VL Suite, a unified multimodal framework for agriculture. Our contributions are threefold. First, we introduce Agri-3M-VL, the largest vision-language corpus for agriculture to our knowledge, curated by a scalable multi-agent data generator; it comprises 1M image-caption pairs, 2M image-grounded VQA pairs, 50K expert-level VQA instances, and 15K GRPO reinforcement learning samples. Second, we develop AgriGPT-VL, an agriculture-specialized vision-language model trained via a progressive curriculum of textual grounding, multimodal shallow/deep alignment, and GRPO refinement. This method achieves strong multimodal reasoning while preserving text-only capability. Third, we establish AgriBench-VL-4K, a compact yet challenging evaluation suite with open-ended and image-grounded questions, paired with multi-metric evaluation and an LLM-as-a-judge framework. Experiments show that AgriGPT-VL outperforms leading general-purpose VLMs on AgriBench-VL-4K, achieving higher pairwise win rates in the LLM-as-a-judge evaluation. Meanwhile, it remains competitive on the text-only AgriBench-13K with no noticeable degradation of language ability. Ablation studies further confirm consistent gains from our alignment and GRPO refinement stages. We will open source all of the resources to support reproducible research and deployment in low-resource agricultural settings.
[102] Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization
Wengao Ye, Yan Liang, Lianlei Shan
Main category: cs.CL
TL;DR: LTPO is a test-time optimization framework that treats latent reasoning vectors as dynamic parameters optimized per problem instance using policy gradients and intrinsic confidence-based rewards, achieving robust performance on challenging reasoning tasks.
Details
Motivation: Current latent reasoning approaches in LLMs are brittle on challenging out-of-distribution tasks where robust reasoning is critical, despite being more efficient than explicit Chain-of-Thought reasoning.Method: Latent Thought Policy Optimization (LTPO) treats intermediate latent “thought” vectors as dynamic parameters optimized for each problem instance using online policy gradient methods guided by intrinsic confidence-based rewards computed from the LLM’s own output distributions.
Result: LTPO matches or surpasses strong baselines on standard reasoning tasks and demonstrates remarkable robustness where others fail, achieving substantial improvements on highly challenging AIME benchmarks where existing latent reasoning baselines collapse to near-zero accuracy.
Conclusion: LTPO provides a parameter-free framework that enhances LLM reasoning at test time without model updates, showcasing unique capability for complex reasoning through dynamic optimization of latent thought vectors.
Abstract: Recent advancements in Large Language Models (LLMs) have shifted from explicit Chain-of-Thought (CoT) reasoning to more efficient latent reasoning, where intermediate thoughts are represented as vectors rather than text. However, latent reasoning can be brittle on challenging, out-of-distribution tasks where robust reasoning is most critical. To overcome these limitations, we introduce Latent Thought Policy Optimization (LTPO), a parameter-free framework that enhances LLM reasoning entirely at test time, without requiring model parameter updates. LTPO treats intermediate latent “thought” vectors as dynamic parameters that are actively optimized for each problem instance. It employs an online policy gradient method guided by an intrinsic, confidence-based reward signal computed directly from the frozen LLM’s own output distributions, eliminating the need for external supervision or expensive text generation during optimization. Extensive experiments on five reasoning benchmarks show that LTPO not only matches or surpasses strong baselines on standard tasks but also demonstrates remarkable robustness where others fail. Most notably, on highly challenging AIME benchmarks where existing latent reasoning baselines collapse to near-zero accuracy, LTPO delivers substantial improvements, showcasing a unique capability for complex reasoning.
[103] Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization
Omri Uzan, Asaf Yehudai, Roi pony, Eyal Shnarch, Ariel Gera
Main category: cs.CL
TL;DR: GQR is a test-time optimization method that refines a vision-centric retriever’s query embedding using guidance from a lightweight text retriever, achieving state-of-the-art performance with significantly better efficiency.
Details
Motivation: Current multimodal retrieval models have two main problems: 1) They use very large query/document representations that hinder deployment and scalability, and 2) Vision-centric approaches suffer from the modality gap in vision-language models. The authors want to see if a lightweight text retriever can enhance a stronger vision-centric model without the computational overhead.Method: Guided Query Refinement (GQR) - a test-time optimization method that refines the primary (vision-centric) retriever’s query embedding using guidance from a complementary (text) retriever’s scores. Unlike existing hybrid methods that do coarse fusion of ranks/scores, GQR exploits rich interactions within each model’s representation space through optimization at test time.
Result: GQR allows vision-centric models to match the performance of models with significantly larger representations, while being up to 14x faster and requiring 54x less memory. It effectively pushes the Pareto frontier for performance and efficiency in multimodal retrieval.
Conclusion: The proposed GQR method successfully addresses the efficiency-performance trade-off in multimodal retrieval by combining a strong vision-centric model with a lightweight text retriever through test-time optimization, achieving state-of-the-art results with dramatically improved computational efficiency.
Abstract: Multimodal encoders have pushed the boundaries of visual document retrieval, matching textual query tokens directly to image patches and achieving state-of-the-art performance on public benchmarks. Recent models relying on this paradigm have massively scaled the sizes of their query and document representations, presenting obstacles to deployment and scalability in real-world pipelines. Furthermore, purely vision-centric approaches may be constrained by the inherent modality gap still exhibited by modern vision-language models. In this work, we connect these challenges to the paradigm of hybrid retrieval, investigating whether a lightweight dense text retriever can enhance a stronger vision-centric model. Existing hybrid methods, which rely on coarse-grained fusion of ranks or scores, fail to exploit the rich interactions within each model’s representation space. To address this, we introduce Guided Query Refinement (GQR), a novel test-time optimization method that refines a primary retriever’s query embedding using guidance from a complementary retriever’s scores. Through extensive experiments on visual document retrieval benchmarks, we demonstrate that GQR allows vision-centric models to match the performance of models with significantly larger representations, while being up to 14x faster and requiring 54x less memory. Our findings show that GQR effectively pushes the Pareto frontier for performance and efficiency in multimodal retrieval. We release our code at https://github.com/IBM/test-time-hybrid-retrieval
[104] SwiReasoning: Switch-Thinking in Latent and Explicit for Pareto-Superior Reasoning LLMs
Dachuan Shi, Abedelkadir Asi, Keying Li, Xiangchi Yuan, Leyan Pan, Wenke Lee, Wen Xiao
Main category: cs.CL
TL;DR: SwiReasoning is a training-free framework that dynamically switches between explicit and latent reasoning in LLMs based on confidence estimation, improving both accuracy and token efficiency on math/STEM benchmarks.
Details
Motivation: Latent reasoning in LLMs faces two key challenges: 1) purely latent reasoning broadens search distribution, diffuses probability mass, introduces noise, and impedes convergence to high-confidence solutions, hurting accuracy; 2) overthinking persists even without explicit text, wasting tokens and degrading efficiency.Method: SwiReasoning features two innovations: 1) dynamic switching between explicit and latent reasoning guided by block-wise confidence estimated from entropy trends in next-token distributions to balance exploration/exploitation and promote timely convergence; 2) limiting maximum number of thinking-block switches to curb overthinking and improve token efficiency.
Result: On widely used mathematics and STEM benchmarks, SwiReasoning consistently improves average accuracy by 1.5%-2.8% across reasoning LLMs of different model families and scales. Under constrained budgets, it improves average token efficiency by 56%-79%, with larger gains as budgets tighten.
Conclusion: SwiReasoning effectively addresses the challenges of latent reasoning by dynamically switching between reasoning modes based on confidence estimation, achieving both improved accuracy and significant token efficiency gains without requiring training.
Abstract: Recent work shows that, beyond discrete reasoning through explicit chain-of-thought steps, which are limited by the boundaries of natural languages, large language models (LLMs) can also reason continuously in latent space, allowing richer information per step and thereby improving token efficiency. Despite this promise, latent reasoning still faces two challenges, especially in training-free settings: 1) purely latent reasoning broadens the search distribution by maintaining multiple implicit paths, which diffuses probability mass, introduces noise, and impedes convergence to a single high-confidence solution, thereby hurting accuracy; and 2) overthinking persists even without explicit text, wasting tokens and degrading efficiency. To address these issues, we introduce SwiReasoning, a training-free framework for LLM reasoning which features two key innovations: 1) SwiReasoning dynamically switches between explicit and latent reasoning, guided by block-wise confidence estimated from entropy trends in next-token distributions, to balance exploration and exploitation and promote timely convergence. 2) By limiting the maximum number of thinking-block switches, SwiReasoning curbs overthinking and improves token efficiency across varying problem difficulties. On widely used mathematics and STEM benchmarks, SwiReasoning consistently improves average accuracy by 1.5%-2.8% across reasoning LLMs of different model families and scales. Furthermore, under constrained budgets, SwiReasoning improves average token efficiency by 56%-79%, with larger gains as budgets tighten.
[105] TRepLiNa: Layer-wise CKA+REPINA Alignment Improves Low-Resource Machine Translation in Aya-23 8B
Toshiki Nakai, Ravi Kiran Chikkala, Lena Sophie Oberkircher, Nicholas Jennings, Natalia Skachkova, Tatiana Anikina, Jesujoba Oluwadara Alabi
Main category: cs.CL
TL;DR: TRepLiNa (CKA+REPINA) improves low-resource language translation by aligning mid-level layers in multilingual LLMs, showing effectiveness in data-scarce settings.
Details
Motivation: Addresses India's linguistic gap by improving translation quality from low-resource languages (LRLs) to high-resource languages, focusing on practical solutions for diverse Indian languages with limited resources.Method: Combines Centered Kernel Alignment (CKA) for cross-lingual representation alignment with REPINA regularization to constrain parameter updates, creating TRepLiNa. Experiments with Aya-23 8B using QLoRA in zero-shot, few-shot, and fine-tuning settings across MMLoSo language pairs (Mundari, Santali, Bhili) with Hindi/English pivots.
Result: Aligning mid-level layers using TRepLiNa improves LRL translation quality, especially in data-scarce settings, demonstrating a low-cost, practical approach.
Conclusion: TRepLiNa provides an effective method for enhancing low-resource language translation by strategically aligning specific internal layers in multilingual LLMs, offering a practical solution for linguistic diversity challenges.
Abstract: The 2025 Multimodal Models for Low-Resource Contexts and Social Impact (MMLoSo) Language Challenge addresses one of India’s most pressing linguistic gaps: the lack of resources for its diverse low-resource languages (LRLs). In this study, we investigate whether enforcing cross-lingual similarity in specific internal layers of a decoder-only multilingual large language model (LLM) can improve translation quality from LRL to high-resource language (HRL). Specifically, we combine Centered Kernel Alignment (CKA), a similarity metric that encourages representations of different languages to align, with REPINA, a regularization method that constrains parameter updates to remain close to the pretrained model, into a joint method we call TRepLiNa. In this research project, we experiment with zero-shot, few-shot, and fine-tuning settings using Aya-23 8B with QLoRA across MMLoSo shared task language pairs (Mundari, Santali, Bhili) with Hindi/English pivots. Our results show that aligning mid-level layers using TRepLiNa (CKA+REPINA) is a low-cost, practical approach to improving LRL translation, especially in data-scarce settings.
[106] Unilaw-R1: A Large Language Model for Legal Reasoning with Reinforcement Learning and Iterative Inference
Hua Cai, Shuang Zhao, Liang Zhang, Xuli Shen, Qing Xu, Weilin Shen, Zihao Wen, Tianke Ban
Main category: cs.CL
TL;DR: Unilaw-R1 is a 7B-parameter LLM specialized for legal reasoning, addressing legal knowledge gaps, reasoning reliability, and business generalization through curated data and two-stage training, achieving competitive performance on legal benchmarks.
Details
Motivation: While reasoning-focused LLMs are advancing in various domains, their capabilities in handling complex legal problems remain underexplored. There's a need for specialized models that can address three core challenges in legal AI: insufficient legal knowledge, unreliable reasoning logic, and weak business generalization.Method: 1. Constructed Unilaw-R1-Data: a high-quality dataset of 17K distilled and screened chain-of-thought samples. 2. Used two-stage training: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL). 3. Created Unilaw-R1-Eval benchmark to assess legal reasoning across single- and multi-choice tasks.
Result: Unilaw-R1 outperforms all models of similar scale and achieves performance comparable to much larger models (54.9% vs DeepSeek-R1-Distill-Qwen-32B). After domain-specific training, it shows significant gains on LawBench and LexEval, exceeding Qwen-2.5-7B-Instruct by an average margin of 6.6%.
Conclusion: Unilaw-R1 demonstrates that lightweight 7B-parameter models can effectively handle complex legal reasoning tasks when trained with specialized data and methods, offering a cost-effective solution for legal AI applications while supporting interpretable decision-making.
Abstract: Reasoning-focused large language models (LLMs) are rapidly evolving across various domains, yet their capabilities in handling complex legal problems remains underexplored. In this paper, we introduce Unilaw-R1, a large language model tailored for legal reasoning. With a lightweight 7-billion parameter scale, Unilaw-R1 significantly reduces deployment cost while effectively tackling three core challenges in the legal domain: insufficient legal knowledge, unreliable reasoning logic, and weak business generalization. To address these issues, we first construct Unilaw-R1-Data, a high-quality dataset containing 17K distilled and screened chain-of-thought (CoT) samples. Based on this, we adopt a two-stage training strategy combining Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), which significantly boosts the performance on complex legal reasoning tasks and supports interpretable decision-making in legal AI applications. To assess legal reasoning ability, we also introduce Unilaw-R1-Eval, a dedicated benchmark designed to evaluate models across single- and multi-choice legal tasks. Unilaw-R1 demonstrates strong results on authoritative benchmarks, outperforming all models of similar scale and achieving performance on par with the much larger DeepSeek-R1-Distill-Qwen-32B (54.9%). Following domain-specific training, it also showed significant gains on LawBench and LexEval, exceeding Qwen-2.5-7B-Instruct (46.6%) by an average margin of 6.6%.
[107] Grounding Long-Context Reasoning with Contextual Normalization for Retrieval-Augmented Generation
Jiamin Chen, Yuchen Li, Xinyu Ma, Xinran Chen, Xiaokun Zhang, Shuaiqiang Wang, Chen Ma, Dawei Yin
Main category: cs.CL
TL;DR: Context formatting in RAG significantly impacts performance; proposed Contextual Normalization improves robustness by standardizing context representations.
Details
Motivation: Prior RAG research focused on retrieval quality and prompting, but neglected how retrieved documents are framed (context format). The authors discovered that superficial formatting choices like delimiters and structural markers can cause substantial accuracy and stability shifts even with identical semantic content.Method: 1) Designed controlled experiments varying context density, delimiter styles, and positional placement to understand performance differences. 2) Introduced Contextual Normalization - a lightweight strategy that adaptively standardizes context representations before generation.
Result: Extensive experiments on controlled and real-world RAG benchmarks show the proposed strategy consistently improves robustness to order variation and strengthens long-context utilization across diverse settings.
Conclusion: Reliable RAG depends not only on retrieving the right content but also on how that content is presented. The work provides new empirical evidence and a practical technique for better long-context reasoning through context formatting optimization.
Abstract: Retrieval-Augmented Generation (RAG) has become an essential approach for extending the reasoning and knowledge capacity of large language models (LLMs). While prior research has primarily focused on retrieval quality and prompting strategies, the influence of how the retrieved documents are framed, i.e., context format, remains underexplored. We show that seemingly superficial choices, such as delimiters or structural markers in key-value extraction, can induce substantial shifts in accuracy and stability, even when semantic content is identical. To systematically investigate this effect, we design controlled experiments that vary context density, delimiter styles, and positional placement, revealing the underlying factors that govern performance differences. Building on these insights, we introduce Contextual Normalization, a lightweight strategy that adaptively standardizes context representations before generation. Extensive experiments on both controlled and real-world RAG benchmarks across diverse settings demonstrate that the proposed strategy consistently improves robustness to order variation and strengthens long-context utilization. These findings underscore that reliable RAG depends not only on retrieving the right content, but also on how that content is presented, offering both new empirical evidence and a practical technique for better long-context reasoning.
[108] Evaluating Long-Term Memory for Long-Context Question Answering
Alessandra Terranova, Björn Ross, Alexandra Birch
Main category: cs.CL
TL;DR: Memory-augmented LLMs reduce token usage by 90% while maintaining accuracy in long-context dialogues, with optimal memory architecture scaling with model capability.
Details
Motivation: LLMs need memory for true conversational continuity and experiential learning, but it's unclear which memory types are most effective for long-context conversational tasks requiring diverse reasoning strategies.Method: Systematic evaluation of memory-augmented methods on long-context dialogues with QA tasks: full-context prompting, semantic memory (RAG and agentic memory), episodic memory (in-context learning), and procedural memory (prompt optimization).
Result: Memory-augmented approaches reduce token usage by over 90% while maintaining competitive accuracy. Memory architecture should scale with model capability: foundation models benefit most from RAG, while stronger instruction-tuned models gain from episodic learning and complex agentic semantic memory.
Conclusion: Episodic memory helps LLMs recognize their knowledge limits. Optimal memory systems should match model capabilities, with simpler approaches for foundation models and more complex episodic/agentic memory for advanced instruction-tuned models.
Abstract: In order for large language models to achieve true conversational continuity and benefit from experiential learning, they need memory. While research has focused on the development of complex memory systems, it remains unclear which types of memory are most effective for long-context conversational tasks. We present a systematic evaluation of memory-augmented methods on long-context dialogues annotated for question-answering tasks that require diverse reasoning strategies. We analyse full-context prompting, semantic memory through retrieval-augmented generation and agentic memory, episodic memory through in-context learning, and procedural memory through prompt optimization. Our findings show that memory-augmented approaches reduce token usage by over 90% while maintaining competitive accuracy. Memory architecture complexity should scale with model capability, with foundation models benefitting most from RAG, and stronger instruction-tuned models gaining from episodic learning through reflections and more complex agentic semantic memory. In particular, episodic memory can help LLMs recognise the limits of their own knowledge.
[109] Can Fine-Tuning Erase Your Edits? On the Fragile Coexistence of Knowledge Editing and Adaptation
Yinjie Cheng, Paul Youssef, Christin Seifert, Jörg Schlötterer, Zhixue Zhao
Main category: cs.CL
TL;DR: Fine-tuning after knowledge editing causes edits to decay, with survival rates varying by editing method. Fine-tuning edited layers can remove edits, while fine-tuning non-edited layers impairs more edits than full fine-tuning.
Details
Motivation: The paper investigates whether knowledge edits survive when models are subsequently fine-tuned, addressing practical concerns about edit persistence (requiring re-editing) and safety risks (propagating malicious edits).Method: Systematically quantifies edit decay after fine-tuning by examining how different fine-tuning configurations affect knowledge editing survival rates across various editing methods.
Result: Edits decay after fine-tuning with varying survival rates (AlphaEdit decays more than MEMIT). Fine-tuning edited layers effectively removes edits with minor downstream performance cost. Surprisingly, fine-tuning non-edited layers impairs more edits than full fine-tuning.
Conclusion: The study establishes empirical baselines for integrating knowledge editing with fine-tuning, showing that evaluating model editing must consider the full LLM application pipeline, and provides actionable strategies for edit management.
Abstract: Knowledge editing has emerged as a lightweight alternative to retraining for correcting or injecting specific facts in large language models (LLMs). Meanwhile, fine-tuning remains the default operation for adapting LLMs to new domains and tasks. Despite their widespread adoption, these two post-training interventions have been studied in isolation, leaving open a crucial question: if we fine-tune an edited model, do the edits survive? This question is motivated by two practical scenarios: removing covert or malicious edits, and preserving beneficial edits. If fine-tuning impairs edits (Fig.1), current KE methods become less useful, as every fine-tuned model would require re-editing, which significantly increases the cost; if edits persist, fine-tuned models risk propagating hidden malicious edits, raising serious safety concerns. To this end, we systematically quantify edit decay after fine-tuning, investigating how fine-tuning affects knowledge editing. Our results show that edits decay after fine-tuning, with survival varying across configurations, e.g., AlphaEdit edits decay more than MEMIT edits. Further, we find that fine-tuning edited layers only can effectively remove edits, though at a slight cost to downstream performance. Surprisingly, fine-tuning non-edited layers impairs more edits than full fine-tuning. Overall, our study establishes empirical baselines and actionable strategies for integrating knowledge editing with fine-tuning, and underscores that evaluating model editing requires considering the full LLM application pipeline.
[110] SPOT: An Annotated French Corpus and Benchmark for Detecting Critical Interventions in Online Conversations
Manon Berriche, Célia Nouri, Chloé Clavel, Jean-Philippe Cointet
Main category: cs.CL
TL;DR: SPOT introduces the first annotated corpus for detecting “stopping points” in online discussions - subtle interventions that pause or redirect conversations, operationalized as a binary classification task on French Facebook comments about misinformation.
Details
Motivation: Current frameworks like counterspeech or social correction often overlook subtle, ordinary critical interventions in online discussions. The authors aim to translate the sociological concept of "stopping points" into a reproducible NLP task to better understand these nuanced conversational dynamics.Method: Created SPOT corpus with 43,305 manually annotated French Facebook comments linked to false information URLs, enriched with contextual metadata. Benchmarking includes fine-tuned CamemBERT encoder models and instruction-tuned LLMs with various prompting strategies.
Result: Fine-tuned encoders outperform prompted LLMs by more than 10 percentage points in F1 score. Incorporating contextual metadata improves encoder F1 scores from 0.75 to 0.78. Supervised learning proves crucial for emerging non-English social media tasks.
Conclusion: The SPOT corpus successfully operationalizes stopping points as an NLP task, demonstrating the value of supervised learning for nuanced social media analysis in non-English contexts. The released dataset, guidelines, and code support transparency and reproducible research.
Abstract: We introduce SPOT (Stopping Points in Online Threads), the first annotated corpus translating the sociological concept of stopping point into a reproducible NLP task. Stopping points are ordinary critical interventions that pause or redirect online discussions through a range of forms (irony, subtle doubt or fragmentary arguments) that frameworks like counterspeech or social correction often overlook. We operationalize this concept as a binary classification task and provide reliable annotation guidelines. The corpus contains 43,305 manually annotated French Facebook comments linked to URLs flagged as false information by social media users, enriched with contextual metadata (article, post, parent comment, page or group, and source). We benchmark fine-tuned encoder models (CamemBERT) and instruction-tuned LLMs under various prompting strategies. Results show that fine-tuned encoders outperform prompted LLMs in F1 score by more than 10 percentage points, confirming the importance of supervised learning for emerging non-English social media tasks. Incorporating contextual metadata further improves encoder models F1 scores from 0.75 to 0.78. We release the anonymized dataset, along with the annotation guidelines and code in our code repository, to foster transparency and reproducible research.
[111] Chopping Trees: Semantic Similarity Based Dynamic Pruning for Tree-of-Thought Reasoning
Joongho Kim, Xirui Huang, Zarreen Reza, Gabriel Grand
Main category: cs.CL
TL;DR: SSDP introduces semantic similarity-based dynamic pruning to reduce computational redundancy in Tree-of-Thought reasoning by clustering and pruning semantically equivalent reasoning paths in real time, achieving significant speedups while maintaining competitive accuracy.
Details
Motivation: Tree-of-Thought reasoning improves LLM problem-solving but suffers from computational inefficiency due to semantic redundancy, where different branches explore equivalent reasoning paths, leading to unnecessary computation.Method: SSDP (Semantic Similarity-Based Dynamic Pruning) integrates online semantic merging into parallelized tree search, clustering semantically similar reasoning steps and pruning redundant branches in real time during the search process.
Result: SSDP achieves up to 2.3x speedup over state-of-the-art tree-search baselines on benchmarks like GSM8K and MATH500, reduces explored nodes by 85-90%, while maintaining competitive accuracy (typically within 5% of strongest baseline).
Conclusion: SSDP provides a practical, efficient approach to scalable LLM reasoning by addressing semantic redundancy in Tree-of-Thought methods, with publicly available implementation for broader adoption.
Abstract: Tree-of-Thought (ToT) reasoning boosts the problem-solving abilities of Large Language Models (LLMs) but is computationally expensive due to semantic redundancy, where distinct branches explore equivalent reasoning paths. We introduce Semantic Similarity-Based Dynamic Pruning (SSDP), a lightweight method that, to the best of our knowledge, is the first framework to integrate online semantic merging into parallelized tree search, enabling the clustering and pruning of redundant steps in real time. Across reasoning benchmarks, including GSM8K and MATH500, SSDP achieves up to a 2.3x speedup over state-of-the-art tree-search baselines while maintaining competitive accuracy (typically within 5% of the strongest baseline) and reducing the number of explored nodes by 85-90%, demonstrating a practical approach to efficient, scalable LLM reasoning. The implementation of SSDP is publicly available at https://github.com/kimjoonghokim/SSDP.
[112] Latent Collaboration in Multi-Agent Systems
Jiaru Zou, Xiyuan Yang, Ruizhong Qiu, Gaotang Li, Katherine Tieu, Pan Lu, Ke Shen, Hanghang Tong, Yejin Choi, Jingrui He, James Zou, Mengdi Wang, Ling Yang
Main category: cs.CL
TL;DR: LatentMAS enables LLM agents to collaborate directly in latent space instead of text, achieving better performance with much higher efficiency.
Details
Motivation: Existing multi-agent LLM systems rely on text-based communication, which is inefficient and loses information. The authors want to enable more direct, lossless collaboration in the continuous latent space.Method: LatentMAS framework uses auto-regressive latent thoughts generation through last-layer hidden embeddings and a shared latent working memory for lossless information exchange between agents, all without training.
Result: Outperforms text-based MAS by up to 14.6% accuracy, reduces token usage by 70.8%-83.7%, provides 4x-4.3x faster inference, and demonstrates higher expressiveness with lower complexity.
Conclusion: Latent collaboration in continuous space enhances system-level reasoning quality while offering substantial efficiency gains without additional training, representing a significant advancement in multi-agent LLM systems.
Abstract: Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent’s internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.
[113] JELV: A Judge of Edit-Level Validity for Evaluation and Automated Reference Expansion in Grammatical Error Correction
Yuhao Zhan, Yuqing Zhang, Jing Yuan, Qixiang Ma, Zhiqi Yang, Yu Gu, Zemin Liu, Fei Wu
Main category: cs.CL
TL;DR: JELV is an automated framework that validates GEC edits across grammaticality, faithfulness, and fluency, improving evaluation accuracy and enabling dataset expansion through LLM-generated corrections.
Details
Motivation: Existing GEC systems suffer from limited reference diversity, leading to underestimated evaluation and restricted model generalization. Current evaluation methods misjudge valid corrections as false positives.Method: JELV framework with two implementations: 1) multi-turn LLM-as-Judges pipeline, and 2) distilled DeBERTa classifier. Uses human-annotated PEVData benchmark. Applies to reclassify misjudged false positives and filter LLM-generated corrections.
Result: LLM pipeline achieves 90% agreement with human annotators; DeBERTa classifier achieves 85% precision on valid edits. State-of-the-art correlation with human judgments. Expanded BEA19 dataset yields measurable performance gains when retraining top GEC systems.
Conclusion: JELV provides scalable solution for enhancing reference diversity and strengthening both evaluation and model generalization in GEC systems.
Abstract: Existing Grammatical Error Correction (GEC) systems suffer from limited reference diversity, leading to underestimated evaluation and restricted model generalization. To address this issue, we introduce the Judge of Edit-Level Validity (JELV), an automated framework to validate correction edits from grammaticality, faithfulness, and fluency. Using our proposed human-annotated Pair-wise Edit-level Validity Dataset (PEVData) as benchmark, JELV offers two implementations: a multi-turn LLM-as-Judges pipeline achieving 90% agreement with human annotators, and a distilled DeBERTa classifier with 85% precision on valid edits. We then apply JELV to reclassify misjudged false positives in evaluation and derive a comprehensive evaluation metric by integrating false positive decoupling and fluency scoring, resulting in state-of-the-art correlation with human judgments. We also apply JELV to filter LLM-generated correction candidates, expanding the BEA19’s single-reference dataset containing 38,692 source sentences. Retraining top GEC systems on this expanded dataset yields measurable performance gains. JELV provides a scalable solution for enhancing reference diversity and strengthening both evaluation and model generalization.
[114] Simplex-Optimized Hybrid Ensemble for Large Language Model Text Detection Under Generative Distribution Drif
Sepyan Purnama Kristanto, Lutfi Hakim, Dianni Yusuf
Main category: cs.CL
TL;DR: A hybrid ensemble detector combining supervised classifier, curvature-based score, and stylometric features achieves 94.2% accuracy in distinguishing human vs. machine text across diverse LLMs, with improved stability against new models.
Details
Motivation: Existing LLM text detectors degrade when newer models or decoding strategies are introduced, lacking stability against changing generator distributions. This creates practical challenges in real applications where distinguishing human from machine text is important.Method: A hybrid ensemble with three complementary components: 1) RoBERTa-based classifier fine-tuned for supervised detection, 2) curvature-inspired score based on perturbing input and measuring likelihood changes, 3) compact stylometric model using hand-crafted linguistic features. Outputs are fused on probability simplex with validation-based weight selection.
Result: Achieves 94.2% accuracy and 0.978 AUC on 30,000 document corpus across multiple LLM families, including unseen models and paraphrased attack variants. Reduces false positives on scientific articles compared to baselines.
Conclusion: The hybrid ensemble approach provides stable and accurate detection across changing LLM distributions, with practical benefits for educational and research settings where false positives are costly.
Abstract: The widespread adoption of large language models (LLMs) has made it difficult to distinguish human writing from machine-produced text in many real applications. Detectors that were effective for one generation of models tend to degrade when newer models or modified decoding strategies are introduced. In this work, we study this lack of stability and propose a hybrid ensemble that is explicitly designed to cope with changing generator distributions. The ensemble combines three complementary components: a RoBERTa-based classifier fine-tuned for supervised detection, a curvature-inspired score based on perturbing the input and measuring changes in model likelihood, and a compact stylometric model built on hand-crafted linguistic features. The outputs of these components are fused on the probability simplex, and the weights are chosen via validation-based search. We frame this approach in terms of variance reduction and risk under mixtures of generators, and show that the simplex constraint provides a simple way to trade off the strengths and weaknesses of each branch. Experiments on a 30000 document corpus drawn from several LLM families including models unseen during training and paraphrased attack variants show that the proposed method achieves 94.2% accuracy and an AUC of 0.978. The ensemble also lowers false positives on scientific articles compared to strong baselines, which is critical in educational and research settings where wrongly flagging human work is costly
[115] CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency
Jiacheng Guo, Suozhi Huang, Zixin Yao, Yifan Zhang, Yifu Lu, Jiashuo Liu, Zihao Li, Nicholas Deng, Qixin Xiao, Jia Tian, Kanghong Zhan, Tianyi Li, Xiaochen Liu, Jason Ge, Chaoyang He, Kaixuan Huang, Lin Yang, Wenhao Huang, Mengdi Wang
Main category: cs.CL
TL;DR: CryptoBench is the first expert-curated dynamic benchmark for evaluating LLM agents in cryptocurrency analysis, featuring 50 monthly questions across four task categories, revealing a retrieval-prediction imbalance where models excel at data gathering but struggle with predictive analysis.
Details
Motivation: Current benchmarks don't capture the unique challenges of cryptocurrency analysis: extreme time-sensitivity, adversarial information environments, and the need to synthesize data from diverse specialized sources like on-chain intelligence platforms and real-time DeFi dashboards.Method: Created a live, dynamic benchmark with 50 questions per month designed by crypto-native professionals, categorized into four quadrants: Simple Retrieval, Complex Retrieval, Simple Prediction, and Complex Prediction to assess both data-gathering and analytical capabilities.
Result: Evaluation of ten LLMs revealed a performance hierarchy and uncovered a “retrieval-prediction imbalance” - models are proficient at data retrieval but show pronounced weakness in predictive analysis tasks, highlighting that agents appear factually grounded but lack deeper analytical synthesis capabilities.
Conclusion: CryptoBench provides a more challenging and valuable scenario for LLM agent assessment in the cryptocurrency domain, exposing critical weaknesses in current models’ analytical capabilities despite their retrieval strengths, suggesting need for improved predictive reasoning in agent systems.
Abstract: This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate the real-world capabilities of Large Language Model (LLM) agents in the uniquely demanding and fast-paced cryptocurrency domain. Unlike general-purpose agent benchmarks for search and prediction, professional crypto analysis presents specific challenges: \emph{extreme time-sensitivity}, \emph{a highly adversarial information environment}, and the critical need to synthesize data from \emph{diverse, specialized sources}, such as on-chain intelligence platforms and real-time Decentralized Finance (DeFi) dashboards. CryptoBench thus serves as a much more challenging and valuable scenario for LLM agent assessment. To address these challenges, we constructed a live, dynamic benchmark featuring 50 questions per month, expertly designed by crypto-native professionals to mirror actual analyst workflows. These tasks are rigorously categorized within a four-quadrant system: Simple Retrieval, Complex Retrieval, Simple Prediction, and Complex Prediction. This granular categorization enables a precise assessment of an LLM agent’s foundational data-gathering capabilities alongside its advanced analytical and forecasting skills. Our evaluation of ten LLMs, both directly and within an agentic framework, reveals a performance hierarchy and uncovers a failure mode. We observe a \textit{retrieval-prediction imbalance}, where many leading models, despite being proficient at data retrieval, demonstrate a pronounced weakness in tasks requiring predictive analysis. This highlights a problematic tendency for agents to appear factually grounded while lacking the deeper analytical capabilities to synthesize information.
[116] AutoNeural: Co-Designing Vision-Language Models for NPU Inference
Wei Chen, Liangmin Wu, Yunhai Hu, Zhiyuan Li, Zhiyuan Cheng, Yicheng Qian, Lingyue Zhu, Zhipeng Hu, Luoyi Liang, Qiang Tang, Zhen Liu, Han Yang
Main category: cs.CL
TL;DR: AutoNeural: NPU-native Vision-Language Model using MobileNetV5-style vision encoder and hybrid SSM-Transformer language backbone for efficient integer-only inference on edge NPUs.
Details
Motivation: Current VLMs designed for GPUs perform poorly on NPUs due to ViT quantization brittleness and autoregressive attention's I/O-bound nature, which fails to utilize NPU's high arithmetic throughput.Method: Replace ViT with MobileNetV5-style backbone using depthwise separable convolutions for stable quantization; combine SSM principles with Transformer layers using gated convolutions for linear-time complexity, eliminating KV caching overhead.
Result: 7x reduction in vision encoder quantization error, 14x end-to-end latency reduction, 3x decoding speed, 4x longer context window vs baselines; validated on Qualcomm SA8295P SoC with real-time automotive cockpit performance.
Conclusion: Redesigning model topology specifically for NPU constraints is essential for robust multi-modal edge intelligence, enabling efficient integer-only inference on edge AI hardware.
Abstract: While Neural Processing Units (NPUs) offer high theoretical efficiency for edge AI, state-of-the-art Vision–Language Models (VLMs) tailored for GPUs often falter on these substrates. We attribute this hardware-model mismatch to two primary factors: the quantization brittleness of Vision Transformers (ViTs) and the I/O-bound nature of autoregressive attention mechanisms, which fail to utilize the high arithmetic throughput of NPUs. To bridge this gap, we propose AutoNeural, an NPU-native VLM architecture co-designed for integer-only inference. We replace the standard ViT encoder with a MobileNetV5-style backbone utilizing depthwise separable convolutions, which ensures bounded activation distributions for stable INT4/8/16 quantization. Complementing this, our language backbone integrates State-Space Model (SSM) principles with Transformer layers, employing efficient gated convolutions to achieve linear-time complexity. This hybrid design eliminates the heavy memory I/O overhead of Key-Value caching during generation. Our approach delivers substantial efficiency gains, reducing quantization error of vision encoder by up to 7x and end-to-end latency by 14x compared to conventional baselines. The AutoNeural also delivers 3x decoding speed and 4x longer context window than the baseline. We validate these improvements via a real-world automotive case study on the Qualcomm SA8295P SoC, demonstrating real-time performance for cockpit applications. Our results highlight that rethinking model topology specifically for NPU constraints is a prerequisite for robust multi-modal edge intelligence.
[117] DZ-TDPO: Non-Destructive Temporal Alignment for Mutable State Tracking in Long-Context Dialogue
Yijun Liao
Main category: cs.CL
TL;DR: DZ-TDPO is a non-destructive alignment framework that addresses State Inertia in long-context dialogue systems by combining conflict-aware dynamic KL constraints with calibrated temporal attention bias, achieving SOTA performance while preserving general capabilities.
Details
Motivation: Long-context dialogue systems suffer from State Inertia, where static constraints prevent models from resolving conflicts between evolving user intents and established historical context, limiting their ability to adapt to changing user needs.Method: Proposes DZ-TDPO, a non-destructive alignment framework that synergizes conflict-aware dynamic KL constraints with a calibrated temporal attention bias to regulate attention patterns without destructive weight updates.
Result: Achieves state-of-the-art win rates (55.4% on Phi-3.5) on Multi-Session Chat dataset while maintaining robust zero-shot generalization. Larger models like Qwen2.5-7B achieve 50.8% win rate with negligible perplexity overhead, confirming that TAI can be alleviated via precise attention regulation.
Conclusion: State Inertia in long-context dialogue can be effectively addressed through precise attention regulation rather than destructive weight updates, preserving general capabilities across model scales while achieving superior performance.
Abstract: Long-context dialogue systems suffer from State Inertia, where static constraints prevent models from resolving conflicts between evolving user intents and established historical context. To address this, we propose DZ-TDPO, a non-destructive alignment framework that synergizes conflict-aware dynamic KL constraints with a calibrated temporal attention bias. Experiments on the Multi-Session Chat (MSC) dataset demonstrate that DZ-TDPO achieves state-of-the-art win rates (55.4% on Phi-3.5) while maintaining robust zero-shot generalization. Our scaling analysis reveals a “Capacity-Stability Trade-off”: while smaller models incur an “alignment tax” (perplexity surge) to overcome historical inertia, the larger Qwen2.5-7B model achieves 50.8% win rate with negligible perplexity overhead. This confirms that TAI can be alleviated via precise attention regulation rather than destructive weight updates, preserving general capabilities (MMLU) across model scales. Code and data are available: https://github.com/lyj20071013/DZ-TDPO
[118] DaLA: Danish Linguistic Acceptability Evaluation Guided by Real World Errors
Gianluca Barmina, Nathalie Carmen Hau Norman, Peter Schneider-Kamp, Lukas Galke Poech
Main category: cs.CL
TL;DR: Enhanced benchmark for evaluating linguistic acceptability in Danish using 14 corruption functions to generate incorrect sentences, providing more comprehensive assessment than current state-of-the-art.
Details
Motivation: To create a better benchmark for evaluating linguistic acceptability in Danish that is broader and more comprehensive than existing benchmarks, addressing the need for more rigorous assessment of language models' understanding of Danish grammar and syntax.Method: 1) Analyze common errors in written Danish, 2) Develop 14 corruption functions that systematically introduce errors into correct Danish sentences, 3) Validate corruptions using manual and automatic methods, 4) Use results as benchmark for evaluating LLMs on linguistic acceptability judgement tasks.
Result: The benchmark is more comprehensive than current state-of-the-art, increases task difficulty (evidenced by lower LLM performance), and has higher discriminatory power to better distinguish between well-performing and low-performing models.
Conclusion: The enhanced Danish linguistic acceptability benchmark provides a more rigorous evaluation framework that better assesses LLMs’ understanding of Danish grammar through systematic error introduction and comprehensive corruption types.
Abstract: We present an enhanced benchmark for evaluating linguistic acceptability in Danish. We first analyze the most common errors found in written Danish. Based on this analysis, we introduce a set of fourteen corruption functions that generate incorrect sentences by systematically introducing errors into existing correct Danish sentences. To ensure the accuracy of these corruptions, we assess their validity using both manual and automatic methods. The results are then used as a benchmark for evaluating Large Language Models on a linguistic acceptability judgement task. Our findings demonstrate that this extension is both broader and more comprehensive than the current state of the art. By incorporating a greater variety of corruption types, our benchmark provides a more rigorous assessment of linguistic acceptability, increasing task difficulty, as evidenced by the lower performance of LLMs on our benchmark compared to existing ones. Our results also suggest that our benchmark has a higher discriminatory power which allows to better distinguish well-performing models from low-performing ones.
[119] LMSpell: Neural Spell Checking for Low-Resource Languages
Akesh Gunathilake, Nadil Karunarathne, Tharusha Bandaranayake, Nisansa de Silva, Surangika Ranathunga
Main category: cs.CL
TL;DR: First empirical study comparing pretrained language models for spell correction across languages, including low-resource ones, finding LLMs outperform other architectures with sufficient fine-tuning data.
Details
Motivation: Spell correction remains challenging for low-resource languages, and while pretrained language models have been used, there's been no proper comparison across different PLM architectures and limited application to LRLs.Method: Conducted empirical study comparing effectiveness of different PLM architectures (LLMs, encoder-based, encoder-decoder) for spell correction. Created LMSpell toolkit with evaluation function to compensate for LLM hallucination. Included case study with Sinhala language.
Result: LLMs outperform encoder-based and encoder-decoder models when fine-tuning dataset is large, even for languages the LLM wasn’t pre-trained on. Released LMSpell toolkit for easy spell correction across PLMs.
Conclusion: LLMs are effective for spell correction across languages including low-resource ones when sufficient fine-tuning data is available. The study provides practical toolkit and insights for improving spell correction in under-resourced languages.
Abstract: Spell correction is still a challenging problem for low-resource languages (LRLs). While pretrained language models (PLMs) have been employed for spell correction, their use is still limited to a handful of languages, and there has been no proper comparison across PLMs. We present the first empirical study on the effectiveness of PLMs for spell correction, which includes LRLs. We find that Large Language Models (LLMs) outperform their counterparts (encoder-based and encoder-decoder) when the fine-tuning dataset is large. This observation holds even in languages for which the LLM is not pre-trained. We release LMSpell, an easy- to use spell correction toolkit across PLMs. It includes an evaluation function that compensates for the hallucination of LLMs. Further, we present a case study with Sinhala to shed light on the plight of spell correction for LRLs.
cs.CV
[120] Video Models Start to Solve Chess, Maze, Sudoku, Mental Rotation, and Raven’ Matrices
Hokin Deng
Main category: cs.CV
TL;DR: Video generation models like Sora-2 can now reason with 60% success on cognitive tasks like chess, Sudoku, and Raven’s Matrices, enabled by a new “Task Pair” evaluation paradigm and scalable code framework.
Details
Motivation: To establish whether video generation models possess reasoning capabilities and create a robust, scalable evaluation framework for measuring and improving reasoning in video models.Method: Developed a “Task Pair” experimental paradigm and built VMEvalKit code framework with 39 models, enabling automated evaluation of reasoning on cognitive tasks like chess, maze, Sudoku, mental rotation, and Raven’s Matrices.
Result: Leading models like Sora-2 achieve 60% success rates on reasoning tasks, with automated evaluation strongly correlating with human judgment, demonstrating the framework’s reliability and scalability.
Conclusion: Video generation models now demonstrate reasoning capabilities, and the established paradigm enables scalable evaluation and potential reinforcement learning approaches to further improve reasoning in video models.
Abstract: We show that video generation models could reason now. Testing on tasks such as chess, maze, Sudoku, mental rotation, and Raven’s Matrices, leading models such as Sora-2 achieve sixty percent success rates. We establish a robust experimental paradigm centered on the “Task Pair” design. We build a code framework, with 39 models available already, that supports this paradigm and allows for easy scaling - users can add models and tasks efficiently. We show our automated evaluation strongly correlates with human judgment, and therefore this paradigm is highly scalable. We see an opportunity, given the availability of our paradigm, to do reinforcement learning for improving reasoning in video models. You could checkout all of our raw $\href{https://grow-ai-like-a-child.com/video-reason/}{results}$ and our $\href{https://github.com/hokindeng/VMEvalKit}{VMEvalKit}$ codebase.
[121] A Sleep Monitoring System Based on Audio, Video and Depth Information
Lyn Chao-ling Chen, Kuan-Wen Chen, Yi-Ping Hung
Main category: cs.CV
TL;DR: A noninvasive sleep monitoring system using event-based method with depth sensor, RGB camera, and microphone array to detect motion, light, and noise disturbances in home settings.
Details
Motivation: To enable quantitative evaluation of sleep disturbances in home environments through noninvasive monitoring, as traditional methods may be intrusive or limited in capturing multiple disturbance types simultaneously.Method: Uses a device with infrared depth sensor, RGB camera, and four-microphone array. Establishes background models: one in depth signals for movement magnitude, another in color images for lighting effects. Event detection algorithm processes data from all three sensor types to classify disturbances into motion, light-on/off, and noise events.
Result: The system was tested in sleep conditions and experimental results validated the system’s reliability for detecting sleep disturbances.
Conclusion: The developed event-based monitoring system provides a reliable, noninvasive approach for quantitative assessment of sleep disturbances in home environments using multi-modal sensor data.
Abstract: For quantitative evaluation of sleep disturbances, a noninvasive monitoring system is developed by introducing an event-based method. We observe sleeping in home context and classify the sleep disturbances into three types of events: motion events, light-on/off events and noise events. A device with an infrared depth sensor, a RGB camera, and a four-microphone array is used in sleep monitoring in an environment with barely light sources. One background model is established in depth signals for measuring magnitude of movements. Because depth signals cannot observe lighting changes, another background model is established in color images for measuring magnitude of lighting effects. An event detection algorithm is used to detect occurrences of events from the processed data of the three types of sensors. The system was tested in sleep condition and the experiment result validates the system reliability.
[122] Adaptive Dataset Quantization: A New Direction for Dataset Pruning
Chenyue Yu, Jianyu Yu
Main category: cs.CV
TL;DR: A novel dataset quantization method reduces intra-sample redundancy to compress large datasets for edge devices, maintaining training performance while achieving significant compression.
Details
Motivation: Address storage and communication costs for large-scale datasets in resource-constrained edge devices by reducing dataset size while preserving essential features for model training.Method: Uses linear symmetric quantization for initial range/scale, then adaptive quantization allocation algorithm to distribute different quantization ratios per sample based on precision needs while maintaining constant total compression ratio.
Result: Method maintains model training performance while achieving significant dataset compression, outperforming traditional quantization and dataset pruning baselines on CIFAR-10, CIFAR-100, and ImageNet-1K.
Conclusion: Proposed dataset quantization approach effectively reduces intra-sample redundancy for edge device storage, offering better compression-performance tradeoff than existing methods.
Abstract: This paper addresses the challenges of storage and communication costs for large-scale datasets in resource-constrained edge devices by proposing a novel dataset quantization approach to reduce intra-sample redundancy. Unlike traditional dataset pruning and distillation methods that focus on inter-sample redundancy, the proposed method compresses each image by reducing redundant or less informative content within samples while preserving essential features. It first applies linear symmetric quantization to obtain an initial quantization range and scale for each sample. Then, an adaptive quantization allocation algorithm is introduced to distribute different quantization ratios for samples with varying precision requirements, maintaining a constant total compression ratio. The main contributions include: (1) being the first to use limited bits to represent datasets for storage reduction; (2) introducing a dataset-level quantization algorithm with adaptive ratio allocation; and (3) validating the method’s effectiveness through extensive experiments on CIFAR-10, CIFAR-100, and ImageNet-1K. Results show that the method maintains model training performance while achieving significant dataset compression, outperforming traditional quantization and dataset pruning baselines under the same compression ratios.
[123] RMAdapter: Reconstruction-based Multi-Modal Adapter for Vision-Language Models
Xiang Lin, Weixin Li, Shu Guo, Lihong Wang, Di Huang
Main category: cs.CV
TL;DR: RMAdapter is a novel dual-branch adapter for Vision-Language Models that balances task-specific adaptation with general knowledge preservation through reconstruction, achieving SOTA performance in few-shot multimodal transfer learning.
Details
Motivation: Current adapter-based approaches for Vision-Language Models are underexplored and show performance gaps compared to prompt-based methods. There's a need to balance task-specific adaptation with generalization in few-shot scenarios, as conventional single-branch adapters struggle with this trade-off.Method: RMAdapter uses a dual-branch architecture: (1) an adaptation branch for task-specific knowledge injection via parameter-efficient fine-tuning, and (2) a reconstruction branch that preserves general knowledge by reconstructing latent features back to original feature space. It includes local reconstruction loss computation, shared projection modules for efficiency, and consistency constraints to regulate discriminability-generalization trade-off.
Result: RMAdapter consistently outperforms state-of-the-art approaches across all evaluation metrics on three representative tasks: generalization to new categories, generalization to new target datasets, and domain generalization, without relying on data augmentation or duplicate prompt designs.
Conclusion: The dual-branch reconstruction-based adapter effectively balances task-specific adaptation with general knowledge preservation, providing a lightweight yet powerful solution for few-shot multimodal transfer learning that addresses the limitations of current adapter-based approaches.
Abstract: Pre-trained Vision-Language Models (VLMs), \textit{e.g.} CLIP, have become essential tools in multimodal transfer learning. However, fine-tuning VLMs in few-shot scenarios poses significant challenges in balancing task-specific adaptation and generalization in the obtained model. Meanwhile, current researches have predominantly focused on prompt-based adaptation methods, leaving adapter-based approaches underexplored and revealing notable performance gaps. To address these challenges, we introduce a novel Reconstruction-based Multimodal Adapter (RMAdapter), which leverages a dual-branch architecture. Unlike conventional single-branch adapters, RMAdapter consists of: (1) an adaptation branch that injects task-specific knowledge through parameter-efficient fine-tuning, and (2) a reconstruction branch that preserves general knowledge by reconstructing latent space features back into the original feature space. This design facilitates a dynamic balance between general and task-specific knowledge. Importantly, although RMAdapter introduces an additional reconstruction branch, it is carefully optimized to remain lightweight. By computing reconstruction loss locally at each layer and sharing projection modules, the overall computational overhead is kept minimal. A consistency constraint is also incorporated to better regulate the trade-off between discriminability and generalization. We comprehensively evaluate the effectiveness of RMAdapter on three representative tasks: generalization to new categories, generalization to new target datasets, and domain generalization. Without relying on data augmentation or duplicate prompt designs, our RMAdapter consistently outperforms state-of-the-art approaches across all evaluation metrics.
[124] VG3T: Visual Geometry Grounded Gaussian Transformer
Junho Kim, Seongwon Lee
Main category: cs.CV
TL;DR: VG3T is a multi-view feed-forward network that predicts 3D semantic occupancy via 3D Gaussian representation, achieving better performance with fewer primitives than previous methods.
Details
Motivation: Existing methods struggle with multi-view fusion, leading to fragmented 3D representations and sub-optimal performance. There's a need for a unified approach that can handle multi-view inputs effectively without fragmentation.Method: VG3T uses a novel multi-view feed-forward network that directly predicts semantically attributed 3D Gaussians in a joint fashion. It introduces Grid-Based Sampling and Positional Refinement components to mitigate distance-dependent density bias in Gaussian initialization.
Result: VG3T achieves a 1.7%p improvement in mIoU while using 46% fewer primitives than previous state-of-the-art on the nuScenes benchmark, demonstrating superior efficiency and performance.
Conclusion: VG3T provides a unified paradigm for 3D scene representation that overcomes fragmentation issues in multi-view processing, offering both improved accuracy and computational efficiency through its joint multi-view Gaussian prediction approach.
Abstract: Generating a coherent 3D scene representation from multi-view images is a fundamental yet challenging task. Existing methods often struggle with multi-view fusion, leading to fragmented 3D representations and sub-optimal performance. To address this, we introduce VG3T, a novel multi-view feed-forward network that predicts a 3D semantic occupancy via a 3D Gaussian representation. Unlike prior methods that infer Gaussians from single-view images, our model directly predicts a set of semantically attributed Gaussians in a joint, multi-view fashion. This novel approach overcomes the fragmentation and inconsistency inherent in view-by-view processing, offering a unified paradigm to represent both geometry and semantics. We also introduce two key components, Grid-Based Sampling and Positional Refinement, to mitigate the distance-dependent density bias common in pixel-aligned Gaussian initialization methods. Our VG3T shows a notable 1.7%p improvement in mIoU while using 46% fewer primitives than the previous state-of-the-art on the nuScenes benchmark, highlighting its superior efficiency and performance.
[125] EmoDiffTalk:Emotion-aware Diffusion for Editable 3D Gaussian Talking Head
Chang Liu, Tianjiao Jing, Chengcheng Ma, Xuanqi Zhou, Zhengxuan Lian, Qin Jin, Hongliang Yuan, Shi-Sheng Huang
Main category: cs.CV
TL;DR: EmoDiffTalk: A 3D Gaussian Splatting talking head framework with emotion-aware Gaussian diffusion for fine-grained facial animation and text-to-AU emotion control, enabling continuous multimodal emotional editing.
Details
Motivation: Current photo-realistic 3D talking heads using 3D Gaussian Splatting lack effective emotional expression manipulation, particularly for fine-grained and expansive dynamic emotional editing using multimodal controls.Method: Introduces Emotion-aware Gaussian Diffusion with two components: 1) Action Unit (AU) prompt Gaussian diffusion process for fine-grained facial animation, and 2) Accurate text-to-AU emotion controller for expansive dynamic emotional editing using text input.
Result: Superior performance on EmoTalk3D and RenderMe-360 datasets, demonstrating better emotional subtlety, lip-sync fidelity, and controllability compared to previous works.
Conclusion: Establishes a principled pathway toward high-quality, diffusion-driven, multimodal editable 3D talking-head synthesis, representing one of the first 3D Gaussian Splatting talking-head frameworks supporting continuous multimodal emotional editing in AU-based expression space.
Abstract: Recent photo-realistic 3D talking head via 3D Gaussian Splatting still has significant shortcoming in emotional expression manipulation, especially for fine-grained and expansive dynamics emotional editing using multi-modal control. This paper introduces a new editable 3D Gaussian talking head, i.e. EmoDiffTalk. Our key idea is a novel Emotion-aware Gaussian Diffusion, which includes an action unit (AU) prompt Gaussian diffusion process for fine-grained facial animator, and moreover an accurate text-to-AU emotion controller to provide accurate and expansive dynamic emotional editing using text input. Experiments on public EmoTalk3D and RenderMe-360 datasets demonstrate superior emotional subtlety, lip-sync fidelity, and controllability of our EmoDiffTalk over previous works, establishing a principled pathway toward high-quality, diffusion-driven, multimodal editable 3D talking-head synthesis. To our best knowledge, our EmoDiffTalk is one of the first few 3D Gaussian Splatting talking-head generation framework, especially supporting continuous, multimodal emotional editing within the AU-based expression space.
[126] Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and Separation
Jisoo Park, Seonghak Lee, Guisik Kim, Taewoo Kim, Junseok Kwon
Main category: cs.CV
TL;DR: UniVoiceLite is a lightweight, unsupervised audio-visual framework that unifies speech enhancement and speech separation in a single model using lip motion and facial identity cues, without requiring paired noisy-clean data.
Details
Motivation: Real-world audio often contains both background noise and overlapping speakers, but traditional approaches treat speech enhancement and speech separation as separate tasks. Existing unified approaches are complex, parameter-heavy, and rely on supervised training, limiting scalability and generalization.Method: UniVoiceLite uses lip motion and facial identity cues to guide speech extraction, employs Wasserstein distance regularization to stabilize the latent space, and operates in an unsupervised manner without requiring paired noisy-clean data.
Result: Experimental results show UniVoiceLite achieves strong performance in both noisy and multi-speaker scenarios, combining efficiency with robust generalization.
Conclusion: UniVoiceLite provides a lightweight, unsupervised solution that effectively unifies speech enhancement and speech separation, addressing real-world audio challenges while maintaining efficiency and generalization capabilities.
Abstract: Speech Enhancement (SE) and Speech Separation (SS) have traditionally been treated as distinct tasks in speech processing. However, real-world audio often involves both background noise and overlapping speakers, motivating the need for a unified solution. While recent approaches have attempted to integrate SE and SS within multi-stage architectures, these approaches typically involve complex, parameter-heavy models and rely on supervised training, limiting scalability and generalization. In this work, we propose UniVoiceLite, a lightweight and unsupervised audio-visual framework that unifies SE and SS within a single model. UniVoiceLite leverages lip motion and facial identity cues to guide speech extraction and employs Wasserstein distance regularization to stabilize the latent space without requiring paired noisy-clean data. Experimental results demonstrate that UniVoiceLite achieves strong performance in both noisy and multi-speaker scenarios, combining efficiency with robust generalization. The source code is available at https://github.com/jisoo-o/UniVoiceLite.
[127] Domain-Specific Foundation Model Improves AI-Based Analysis of Neuropathology
Ruchika Verma, Shrishtee Kandoi, Robina Afzal, Shengjia Chen, Jannes Jegminat, Michael W. Karlovich, Melissa Umphlett, Timothy E. Richardson, Kevin Clare, Quazi Hossain, Jorge Samanamud, Phyllis L. Faust, Elan D. Louis, Ann C. McKee, Thor D. Stein, Jonathan D. Cherry, Jesse Mez, Anya C. McGoldrick, Dalilah D. Quintana Mora, Melissa J. Nirenberg, Ruth H. Walker, Yolfrankcis Mendez, Susan Morgello, Dennis W. Dickson, Melissa E. Murray, Carlos Cordon-Cardo, Nadejda M. Tsankova, Jamie M. Walker, Diana K. Dangoor, Stephanie McQuillan, Emma L. Thorn, Claudia De Sanctis, Shuying Li, Thomas J. Fuchs, Kurt Farrell, John F. Crary, Gabriele Campanella
Main category: cs.CV
TL;DR: NeuroFM is a domain-specialized foundation model for neuropathology that outperforms general-purpose models on brain tissue analysis tasks.
Details
Motivation: Existing pathology foundation models are trained mainly on surgical pathology data, which doesn't capture the unique morphological features of neuropathology (neurons, glia, neurofibrillary tangles, amyloid plaques, etc.), limiting their effectiveness for neurodegenerative disease analysis.Method: Developed NeuroFM, a foundation model specifically trained on whole-slide images of brain tissue spanning diverse neurodegenerative pathologies, addressing the domain mismatch between general surgical pathology and neuropathology.
Result: NeuroFM demonstrates superior performance compared to general-purpose models across multiple neuropathology-specific downstream tasks including mixed dementia classification, hippocampal region segmentation, and neurodegenerative ataxia identification.
Conclusion: Domain-specialized foundation models trained on brain tissue better capture neuropathology-specific features than general models, enabling more accurate AI-based analysis for brain disease diagnosis and research, setting a precedent for domain-specific model development in digital pathology.
Abstract: Foundation models have transformed computational pathology by providing generalizable representations from large-scale histology datasets. However, existing models are predominantly trained on surgical pathology data, which is enriched for non-nervous tissue and overrepresents neoplastic, inflammatory, metabolic, and other non-neurological diseases. Neuropathology represents a markedly different domain of histopathology, characterized by unique cell types (neurons, glia, etc.), distinct cytoarchitecture, and disease-specific pathological features including neurofibrillary tangles, amyloid plaques, Lewy bodies, and pattern-specific neurodegeneration. This domain mismatch may limit the ability of general-purpose foundation models to capture the morphological patterns critical for interpreting neurodegenerative diseases such as Alzheimer’s disease, Parkinson’s disease, and cerebellar ataxias. To address this gap, we developed NeuroFM, a foundation model trained specifically on whole-slide images of brain tissue spanning diverse neurodegenerative pathologies. NeuroFM demonstrates superior performance compared to general-purpose models across multiple neuropathology-specific downstream tasks, including mixed dementia disease classification, hippocampal region segmentation, and neurodegenerative ataxia identification encompassing cerebellar essential tremor and spinocerebellar ataxia subtypes. This work establishes that domain-specialized foundation models trained on brain tissue can better capture neuropathology-specific features than models trained on general surgical pathology datasets. By tailoring foundation models to the unique morphological landscape of neurodegenerative diseases, NeuroFM enables more accurate and reliable AI-based analysis for brain disease diagnosis and research, setting a precedent for domain-specific model development in specialized areas of digital pathology.
[128] FishDetector-R1: Unified MLLM-Based Framework with Reinforcement Fine-Tuning for Weakly Supervised Fish Detection, Segmentation, and Counting
Yi Liu, Jingyu Song, Vedanth Kallakuri, Katherine A. Skinner
Main category: cs.CV
TL;DR: FishDetector-R1 is a unified MLLM-based framework for fish detection, segmentation, and counting using weak supervision, achieving significant improvements on underwater fish imagery with novel detection-to-count prompting and reinforcement learning.
Details
Motivation: Underwater fish imagery analysis is crucial for ecological monitoring but faces challenges due to visual degradation and expensive annotation requirements, creating a need for efficient weakly-supervised solutions.Method: A unified MLLM-based framework with two key components: 1) a novel detect-to-count prompt enforcing spatially consistent detections and counts, and 2) Reinforcement Learning from Verifiable Reward (RLVR) leveraging sparse point labels in a scalable paradigm.
Result: On DeepFish dataset: 20% AP improvement, 10% mIoU improvement, 30% MAE reduction, 35% GAME reduction. The framework shows strong cross-domain generalization to other underwater datasets.
Conclusion: FishDetector-R1 provides a reliable and scalable solution for accurate marine visual understanding via weak supervision, with validated effectiveness through ablation studies and demonstrated cross-domain robustness.
Abstract: Analyzing underwater fish imagery is critical for ecological monitoring but remains difficult due to visual degradation and costly annotations. We introduce FishDetector-R1, a unified MLLM-based framework for fish detection, segmentation, and counting under weak supervision. On the DeepFish dataset, our framework achieves substantial gains over baselines, improving AP by 20% and mIoU by 10%, while reducing MAE by 30% and GAME by 35%. These improvements stem from two key components: a novel detect-to-count prompt that enforces spatially consistent detections and counts, and Reinforcement Learning from Verifiable Reward (RLVR) with a complementary scalable paradigm leveraging sparse point labels. Ablation studies further validate the effectiveness of this reward design. Moreover, the improvement generalizes well to other underwater datasets, confirming strong cross-domain robustness. Overall, FishDetector-R1 provides a reliable and scalable solution for accurate marine visual understanding via weak supervision. The project page for FishDetector-R1 is https://umfieldrobotics.github.io/FishDetector-R1.
[129] PrunedCaps: A Case For Primary Capsules Discrimination
Ramin Sharifi, Pouya Shiri, Amirali Baniasadi
Main category: cs.CV
TL;DR: Capsule Networks can be pruned by removing up to 95% of Primary Capsules, achieving 9.9x speedup and 95.36% reduction in FLOPs without accuracy loss across multiple datasets.
Details
Motivation: Capsule Networks offer advantages over CNNs (better robustness to affine transformations, overlapping image detection) but are resource-inefficient due to high number of Primary Capsules, leading to slow training/testing and high computational requirements.Method: Investigates Primary Capsules pruning in CapsNets across MNIST, Fashion-MNIST, CIFAR-10, and SVHN datasets. Removes up to 95% of Capsules to create a pruned architecture.
Result: Pruned CapsNet performs up to 9.90 times faster than conventional architecture, saves over 95.36% of floating-point operations in dynamic routing stage, and maintains accuracy. Provides insights into why some datasets benefit more from pruning than others.
Conclusion: Primary Capsules pruning enables resource-efficient CapsNets without sacrificing accuracy, making them more practical for real-world applications while maintaining their advantages over CNNs.
Abstract: Capsule Networks (CapsNets) are a generation of image classifiers with proven advantages over Convolutional Neural Networks (CNNs). Better robustness to affine transformation and overlapping image detection are some of the benefits associated with CapsNets. However, CapsNets cannot be classified as resource-efficient deep learning architecture due to the high number of Primary Capsules (PCs). In addition, CapsNets’ training and testing are slow and resource hungry. This paper investigates the possibility of Primary Capsules pruning in CapsNets on MNIST handwritten digits, Fashion-MNIST, CIFAR-10, and SVHN datasets. We show that a pruned version of CapsNet performs up to 9.90 times faster than the conventional architecture by removing 95 percent of Capsules without a loss of accuracy. Also, our pruned architecture saves on more than 95.36 percent of floating-point operations in the dynamic routing stage of the architecture. Moreover, we provide insight into why some datasets benefit significantly from pruning while others fall behind.
[130] Simple Agents Outperform Experts in Biomedical Imaging Workflow Optimization
Xuefei, Wang, Kai A. Horstmann, Ethan Lin, Jonathan Chen, Alexander R. Farhang, Sophia Stiles, Atharva Sehgal, Jonathan Light, David Van Valen, Yisong Yue, Jennifer J. Sun
Main category: cs.CV
TL;DR: AI agents can automate adaptation of computer vision tools to scientific datasets, outperforming human experts with simpler architectures.
Details
Motivation: Adapting production-level computer vision tools to scientific datasets is a critical bottleneck. Current solutions are impractical: fine-tuning requires large annotated datasets that scientists lack, while manual code adaptation takes weeks to months of effort.Method: The authors introduce a systematic evaluation framework for agentic code optimization and use it to study three production-level biomedical imaging pipelines. They investigate optimal agent design for this targeted task.
Result: A simple agent framework consistently generates adaptation code that outperforms human-expert solutions. Analysis reveals that common, complex agent architectures are not universally beneficial.
Conclusion: The research provides a practical roadmap for agent design and demonstrates a clear pathway for real-world impact by deploying agent-generated functions into production pipelines. The framework is open-sourced.
Abstract: Adapting production-level computer vision tools to bespoke scientific datasets is a critical “last mile” bottleneck. Current solutions are impractical: fine-tuning requires large annotated datasets scientists often lack, while manual code adaptation costs scientists weeks to months of effort. We consider using AI agents to automate this manual coding, and focus on the open question of optimal agent design for this targeted task. We introduce a systematic evaluation framework for agentic code optimization and use it to study three production-level biomedical imaging pipelines. We demonstrate that a simple agent framework consistently generates adaptation code that outperforms human-expert solutions. Our analysis reveals that common, complex agent architectures are not universally beneficial, leading to a practical roadmap for agent design. We open source our framework and validate our approach by deploying agent-generated functions into a production pipeline, demonstrating a clear pathway for real-world impact.
[131] Fast and Flexible Robustness Certificates for Semantic Segmentation
Thomas Massena, Corentin Friedrich, Franck Mamalet, Mathieu Serrurier
Main category: cs.CV
TL;DR: This paper introduces a novel framework for certifiably robust semantic segmentation using Lipschitz-constrained networks, achieving real-time certification that’s 600× faster than randomized smoothing methods.
Details
Motivation: Deep neural networks are vulnerable to adversarial perturbations, but most robustification methods focus on classification tasks. There's a lack of efficient certification procedures for semantic segmentation, which is crucial for safety-critical applications like autonomous driving.Method: The authors introduce a new class of certifiably robust semantic segmentation networks with built-in Lipschitz constraints. They provide a novel framework that generalizes robustness certificates for semantic segmentation tasks, leveraging the computational efficiency of Lipschitz networks.
Result: The method achieves competitive pixel accuracy on challenging datasets like Cityscapes while enabling real-time certifiably robust semantic segmentation. The certification process is around 600 times faster than randomized smoothing methods on an NVIDIA A100 GPU, with comparable certificates.
Conclusion: This work successfully unlocks real-time compatible certifiably robust semantic segmentation for the first time, providing an efficient framework for computing worst-case performance under adversarial attacks while maintaining competitive accuracy.
Abstract: Deep Neural Networks are vulnerable to small perturbations that can drastically alter their predictions for perceptually unchanged inputs. The literature on adversarially robust Deep Learning attempts to either enhance the robustness of neural networks (e.g, via adversarial training) or to certify their decisions up to a given robustness level (e.g, by using randomized smoothing, formal methods or Lipschitz bounds). These studies mostly focus on classification tasks and few efficient certification procedures currently exist for semantic segmentation. In this work, we introduce a new class of certifiably robust Semantic Segmentation networks with built-in Lipschitz constraints that are efficiently trainable and achieve competitive pixel accuracy on challenging datasets such as Cityscapes. Additionally, we provide a novel framework that generalizes robustness certificates for semantic segmentation tasks, where we showcase the flexibility and computational efficiency of using Lipschitz networks. Our approach unlocks real-time compatible certifiably robust semantic segmentation for the first time. Moreover, it allows the computation of worst-case performance under $\ell_2$ attacks of radius $ε$ across a wide range of performance measures. Crucially, we benchmark the runtime of our certification process and find our approach to be around 600 times faster than randomized smoothing methods at inference with comparable certificates on an NVIDIA A100 GPU. Finally, we evaluate the tightness of our worstcase certificates against state-of-the-art adversarial attacks to further validate the performance of our method.
[132] Multi-Modal Zero-Shot Prediction of Color Trajectories in Food Drying
Shichen Li, Ahmadreza Eslaminia, Chenhui Shao
Main category: cs.CV
TL;DR: Novel multi-modal method predicts food drying color trajectories using high-dimensional temporal color data and process parameters, achieving 90%+ error reduction over baselines.
Details
Motivation: Current food drying color analysis uses low-dimensional features that can't capture complex dynamic color changes, and existing models lack generalization to unseen drying conditions.Method: Developed a multi-modal color-trajectory prediction method that integrates high-dimensional temporal color information with drying process parameters for accurate and data-efficient prediction.
Result: Achieved RMSEs of 2.12 for cookie drying and 1.29 for apple drying under unseen conditions, reducing errors by over 90% compared to baseline models.
Conclusion: The model demonstrates superior accuracy, robustness, and broad applicability for predicting color evolution in food drying processes.
Abstract: Food drying is widely used to reduce moisture content, ensure safety, and extend shelf life. Color evolution of food samples is an important indicator of product quality in food drying. Although existing studies have examined color changes under different drying conditions, current approaches primarily rely on low-dimensional color features and cannot fully capture the complex, dynamic color trajectories of food samples. Moreover, existing modeling approaches lack the ability to generalize to unseen process conditions. To address these limitations, we develop a novel multi-modal color-trajectory prediction method that integrates high-dimensional temporal color information with drying process parameters to enable accurate and data-efficient color trajectory prediction. Under unseen drying conditions, the model attains RMSEs of 2.12 for cookie drying and 1.29 for apple drying, reducing errors by over 90% compared with baseline models. These experimental results demonstrate the model’s superior accuracy, robustness, and broad applicability.
[133] High-Throughput Unsupervised Profiling of the Morphology of 316L Powder Particles for Use in Additive Manufacturing
Emmanuel Akeweje, Conall Kirk, Chi-Wai Chan, Denis Dowling, Mimi Zhang
Main category: cs.CV
TL;DR: Automated ML framework using high-throughput imaging and clustering to profile metallic powder morphology for SLM additive manufacturing, identifying Fourier-descriptor + k-means as most effective approach.
Details
Motivation: Conventional powder characterization methods for SLM are low-throughput and qualitative, failing to capture heterogeneity in industrial-scale batches needed for quality control.Method: Developed three ML clustering pipelines: autoencoder, shape-descriptor, and functional-data pipelines. Applied to ~126,000 powder images (0.5-102 μm diameter) with high-throughput imaging and shape extraction.
Result: Fourier-descriptor + k-means pipeline identified as most effective based on internal validity metrics (lowest Davies-Bouldin index, highest Calinski-Harabasz score) with sub-millisecond runtime per particle.
Conclusion: Unsupervised learning framework enables rapid automated powder morphology assessment, supports tracking shape evolution across reuse cycles, and offers path toward real-time feedstock monitoring in SLM workflows.
Abstract: Selective Laser Melting (SLM) is a powder-bed additive manufacturing technique whose part quality depends critically on feedstock morphology. However, conventional powder characterization methods are low-throughput and qualitative, failing to capture the heterogeneity of industrial-scale batches. We present an automated, machine learning framework that couples high-throughput imaging with shape extraction and clustering to profile metallic powder morphology at scale. We develop and evaluate three clustering pipelines: an autoencoder pipeline, a shape-descriptor pipeline, and a functional-data pipeline. Across a dataset of approximately 126,000 powder images (0.5-102 micrometer diameter), internal validity metrics identify the Fourier-descriptor + k-means pipeline as the most effective, achieving the lowest Davies-Bouldin index and highest Calinski-Harabasz score while maintaining sub-millisecond runtime per particle on a standard desktop workstation. Although the present work focuses on establishing the morphological-clustering framework, the resulting shape groups form a basis for future studies examining their relationship to flowability, packing density, and SLM part quality. Overall, this unsupervised learning framework enables rapid, automated assessment of powder morphology and supports tracking of shape evolution across reuse cycles, offering a path toward real-time feedstock monitoring in SLM workflows.
[134] VAT: Vision Action Transformer by Unlocking Full Representation of ViT
Wenhao Li, Chengwei Ma, Weixin Mao
Main category: cs.CV
TL;DR: Vision Action Transformer (VAT) uses all ViT layers instead of just final layer features for better robot learning, achieving SOTA 98.15% success on LIBERO benchmarks.
Details
Motivation: Current robot learning methods using Vision Transformers discard valuable information by only using the final layer's features, providing insufficient representations for robotic tasks.Method: VAT extends ViT architecture to process specialized action tokens with visual features across all transformer layers, enabling deep progressive fusion of perception and action generation.
Result: Achieves 98.15% average success rate across four LIBERO benchmarks, establishing new state-of-the-art and outperforming prior methods like OpenVLA-OFT.
Conclusion: VAT demonstrates the critical importance of leveraging the complete “representation trajectory” of vision models to advance robotic policy, providing a powerful model for imitation learning.
Abstract: In robot learning, Vision Transformers (ViTs) are standard for visual perception, yet most methods discard valuable information by using only the final layer’s features. We argue this provides an insufficient representation and propose the Vision Action Transformer (VAT), a novel architecture that is extended from ViT and unlocks the full feature hierarchy of ViT. VAT processes specialized action tokens with visual features across all transformer layers, enabling a deep and progressive fusion of perception and action generation. On a suite of simulated manipulation tasks, VAT achieves a 98.15% average success rate across four LIBERO benchmarks, establishing a new state-of-the-art by outperforming prior methods like OpenVLA-OFT. Our work presents not only a powerful model for imitation learning but also demonstrates the critical importance of leveraging the complete ‘‘representation trajectory’’ of vision models to advance robotic policy. The GitHub URL for the project code is https://github.com/sellerbubble/VAT.
[135] Dual-Stream Cross-Modal Representation Learning via Residual Semantic Decorrelation
Xuecheng Li, Weikuan Jia, Alisher Kurbonaliev, Qurbonaliev Alisher, Khudzhamkulov Rustam, Ismoilov Shuhratjon, Eshmatov Javhariddin, Yuanjie Zheng
Main category: cs.CV
TL;DR: DSRSD-Net disentangles modality-specific and shared information via residual decomposition and decorrelation constraints to address modality dominance and redundancy in multimodal learning.
Details
Motivation: Multimodal representations suffer from modality dominance, redundant information coupling, and spurious cross-modal correlations, leading to suboptimal generalization and limited interpretability. High-variance modalities overshadow weaker but semantically important signals, and naive fusion strategies entangle modality-shared and modality-specific factors uncontrollably.Method: Dual-Stream Residual Semantic Decorrelation Network (DSRSD-Net) with: (1) dual-stream representation learning separating intra-modal (private) and inter-modal (shared) latent factors via residual projection; (2) residual semantic alignment head mapping shared factors into common space using contrastive and regression objectives; (3) decorrelation and orthogonality loss regularizing covariance structure while enforcing orthogonality between shared and private streams.
Result: Experimental results on two large-scale educational benchmarks show DSRSD-Net consistently improves next-step prediction and final outcome prediction over strong single-modality, early-fusion, late-fusion, and co-attention baselines.
Conclusion: DSRSD-Net provides an effective framework for disentangling modality-specific and shared information, addressing key challenges in multimodal learning including modality dominance, redundancy, and spurious correlations, leading to better generalization and interpretability.
Abstract: Cross-modal learning has become a fundamental paradigm for integrating heterogeneous information sources such as images, text, and structured attributes. However, multimodal representations often suffer from modality dominance, redundant information coupling, and spurious cross-modal correlations, leading to suboptimal generalization and limited interpretability. In particular, high-variance modalities tend to overshadow weaker but semantically important signals, while naïve fusion strategies entangle modality-shared and modality-specific factors in an uncontrolled manner. This makes it difficult to understand which modality actually drives a prediction and to maintain robustness when some modalities are noisy or missing. To address these challenges, we propose a Dual-Stream Residual Semantic Decorrelation Network (DSRSD-Net), a simple yet effective framework that disentangles modality-specific and modality-shared information through residual decomposition and explicit semantic decorrelation constraints. DSRSD-Net introduces: (1) a dual-stream representation learning module that separates intra-modal (private) and inter-modal (shared) latent factors via residual projection; (2) a residual semantic alignment head that maps shared factors from different modalities into a common space using a combination of contrastive and regression-style objectives; and (3) a decorrelation and orthogonality loss that regularizes the covariance structure of the shared space while enforcing orthogonality between shared and private streams, thereby suppressing cross-modal redundancy and preventing feature collapse. Experimental results on two large-scale educational benchmarks demonstrate that DSRSD-Net consistently improves next-step prediction and final outcome prediction over strong single-modality, early-fusion, late-fusion, and co-attention baselines.
[136] Benchmarking CXR Foundation Models With Publicly Available MIMIC-CXR and NIH-CXR14 Datasets
Jiho Shin, Dominic Marshall, Matthieu Komorowski
Main category: cs.CV
TL;DR: Benchmark comparison of two chest X-ray foundation models (CXR-Foundation and MedImageInsight) on public datasets shows MedImageInsight performs slightly better, while CXR-Foundation has better cross-dataset stability.
Details
Motivation: Despite strong performance of foundation models in medical imaging, there's limited understanding of their comparative behavior across different datasets, necessitating standardized evaluation.Method: Used unified preprocessing pipeline and fixed downstream classifiers (LightGBM) on MIMIC-CR and NIH ChestX-ray14 datasets. Extracted embeddings from pre-trained encoders and evaluated with AUROC and F1-score metrics with confidence intervals.
Result: MedImageInsight achieved slightly higher performance across most tasks, while CXR-Foundation showed stronger cross-dataset stability. Unsupervised clustering revealed coherent disease-specific structure in MedImageInsight embeddings.
Conclusion: The study highlights the need for standardized evaluation of medical foundation models and establishes reproducible baselines for future multimodal and clinical integration research.
Abstract: Recent foundation models have demonstrated strong performance in medical image representation learning, yet their comparative behaviour across datasets remains underexplored. This work benchmarks two large-scale chest X-ray (CXR) embedding models (CXR-Foundation (ELIXR v2.0) and MedImagelnsight) on public MIMIC-CR and NIH ChestX-ray14 datasets. Each model was evaluated using a unified preprocessing pipeline and fixed downstream classifiers to ensure reproducible comparison. We extracted embeddings directly from pre-trained encoders, trained lightweight LightGBM classifiers on multiple disease labels, and reported mean AUROC, and F1-score with 95% confidence intervals. MedImageInsight achieved slightly higher performance across most tasks, while CXR-Foundation exhibited strong cross-dataset stability. Unsupervised clustering of MedImageIn-sight embeddings further revealed a coherent disease-specific structure consistent with quantitative results. The results highlight the need for standardised evaluation of medical foundation models and establish reproducible baselines for future multimodal and clinical integration studies.
[137] DeepAgent: A Dual Stream Multi Agent Fusion for Robust Multimodal Deepfake Detection
Sayeem Been Zaman, Wasimul Karim, Arefin Ittesafun Abian, Reem E. Mohamed, Md Rafiqul Islam, Asif Karim, Sami Azam
Main category: cs.CV
TL;DR: DeepAgent: Multi-agent framework combining visual CNN and audio-visual inconsistency detection with Random Forest fusion for robust deepfake detection across multiple datasets.
Details
Motivation: Existing deepfake detection methods that integrate audio and visual cues in single models are vulnerable to modality mismatches, noise, and manipulation. There's a need for more robust approaches that can handle diverse types of manipulations.Method: DeepAgent uses two complementary agents: Agent-1 (AlexNet-based CNN) detects visual manipulation artifacts, while Agent-2 detects audio-visual inconsistencies using acoustic features, Whisper transcriptions, and EasyOCR frame reading. A Random Forest meta-classifier fuses their decisions.
Result: Agent-1 achieved 94.35% accuracy on Celeb-DF+FakeAVCeleb. Agent-2 and meta-classifier achieved 93.69% and 81.56% on FakeAVCeleb. Cross-dataset validation on DeepFakeTIMIT showed 97.49% accuracy, demonstrating strong generalization.
Conclusion: Hierarchy-based fusion enhances robustness by mitigating individual modality weaknesses. Multi-agent collaboration effectively addresses diverse deepfake manipulations, showing strong cross-dataset performance.
Abstract: The increasing use of synthetic media, particularly deepfakes, is an emerging challenge for digital content verification. Although recent studies use both audio and visual information, most integrate these cues within a single model, which remains vulnerable to modality mismatches, noise, and manipulation. To address this gap, we propose DeepAgent, an advanced multi-agent collaboration framework that simultaneously incorporates both visual and audio modalities for the effective detection of deepfakes. DeepAgent consists of two complementary agents. Agent-1 examines each video with a streamlined AlexNet-based CNN to identify the symbols of deepfake manipulation, while Agent-2 detects audio-visual inconsistencies by combining acoustic features, audio transcriptions from Whisper, and frame-reading sequences of images through EasyOCR. Their decisions are fused through a Random Forest meta-classifier that improves final performance by taking advantage of the different decision boundaries learned by each agent. This study evaluates the proposed framework using three benchmark datasets to demonstrate both component-level and fused performance. Agent-1 achieves a test accuracy of 94.35% on the combined Celeb-DF and FakeAVCeleb datasets. On the FakeAVCeleb dataset, Agent-2 and the final meta-classifier attain accuracies of 93.69% and 81.56%, respectively. In addition, cross-dataset validation on DeepFakeTIMIT confirms the robustness of the meta-classifier, which achieves a final accuracy of 97.49%, and indicates a strong capability across diverse datasets. These findings confirm that hierarchy-based fusion enhances robustness by mitigating the weaknesses of individual modalities and demonstrate the effectiveness of a multi-agent approach in addressing diverse types of manipulations in deepfakes.
[138] PrefGen: Multimodal Preference Learning for Preference-Conditioned Image Generation
Wenyi Mo, Tianyu Zhang, Yalong Bai, Ligong Han, Ying Ba, Dimitris N. Metaxas
Main category: cs.CV
TL;DR: A multimodal framework using MLLMs to extract user preferences and inject them into diffusion models for personalized image generation that aligns with both text prompts and individual aesthetic choices.
Details
Motivation: Existing approaches for preference-conditioned image generation either fail to capture nuanced user preferences or lack effective mechanisms to encode personalized visual signals, creating a need for better methods to adapt generative models to individual aesthetic choices beyond textual prompts.Method: Proposes a multimodal framework that: 1) Trains MLLMs with preference-oriented visual question answering to capture fine-grained semantic cues, 2) Introduces two probing tasks (inter-user discrimination and intra-user discrimination) to isolate preference-relevant features, 3) Uses maximum mean discrepancy-based alignment loss to bridge modality gaps while preserving multimodal structure, and 4) Conditions diffusion generators with the resulting embeddings.
Result: Extensive experiments demonstrate that the method substantially outperforms strong baselines in both image quality and preference alignment, highlighting the effectiveness of representation extraction and alignment for personalized generation.
Conclusion: The proposed multimodal framework successfully enables faithful adherence to both prompts and user preferences in image generation, effectively addressing the limitations of existing approaches through sophisticated representation extraction and alignment techniques.
Abstract: Preference-conditioned image generation seeks to adapt generative models to individual users, producing outputs that reflect personal aesthetic choices beyond the given textual prompt. Despite recent progress, existing approaches either fail to capture nuanced user preferences or lack effective mechanisms to encode personalized visual signals. In this work, we propose a multimodal framework that leverages multimodal large language models (MLLMs) to extract rich user representations and inject them into diffusion-based image generation. We train the MLLM with a preference-oriented visual question answering task to capture fine-grained semantic cues. To isolate preference-relevant features, we introduce two complementary probing tasks: inter-user discrimination to distinguish between different users, and intra-user discrimination to separate liked from disliked content. To ensure compatibility with diffusion text encoders, we design a maximum mean discrepancy-based alignment loss that bridges the modality gap while preserving multimodal structure. The resulting embeddings are used to condition the generator, enabling faithful adherence to both prompts and user preferences. Extensive experiments demonstrate that our method substantially outperforms strong baselines in both image quality and preference alignment, highlighting the effectiveness of representation extraction and alignment for personalized generation.
[139] Neural reconstruction of 3D ocean wave hydrodynamics from camera sensing
Jiabin Liu, Zihao Zhou, Jialei Yan, Anxin Guo, Alvise Benetazzo, Hui Li
Main category: cs.CV
TL;DR: A neural network for 3D wave surface and velocity field reconstruction using attention-augmented pyramid architecture with physics constraints, achieving millimeter-level accuracy and fast dense reconstruction under real-sea conditions.
Details
Motivation: To overcome high computational costs and visual occlusion challenges in long-term ocean wave observation, enabling precise 3D reconstruction of wave free surfaces and velocity fields for better understanding of ocean physics.Method: Proposed wave free surface visual reconstruction neural network with attention-augmented pyramid architecture designed for multi-scale and temporally continuous wave motions. Uses physics-based constraints for time-resolved reconstruction of nonlinear 3D velocity fields from evolving free-surface boundary.
Result: Millimeter-level wave elevation prediction in central region, dominant-frequency errors below 0.01 Hz, precise estimation of high-frequency spectral power laws, high-fidelity 3D reconstruction of nonlinear velocity fields, and dense reconstruction of two million points in only 1.35 seconds. Outperforms conventional approaches and maintains strong generalization in occluded conditions.
Conclusion: The proposed neural network effectively addresses computational and occlusion challenges in ocean wave observation, enabling accurate and efficient 3D reconstruction of wave surfaces and velocity fields with strong generalization capabilities.
Abstract: Precise three-dimensional (3D) reconstruction of wave free surfaces and associated velocity fields is essential for developing a comprehensive understanding of ocean physics. To address the high computational cost of dense visual reconstruction in long-term ocean wave observation tasks and the challenges introduced by persistent visual occlusions, we propose an wave free surface visual reconstruction neural network, which is designed as an attention-augmented pyramid architecture tailored to the multi-scale and temporally continuous characteristics of wave motions. Using physics-based constraints, we perform time-resolved reconstruction of nonlinear 3D velocity fields from the evolving free-surface boundary. Experiments under real-sea conditions demonstrate millimetre-level wave elevation prediction in the central region, dominant-frequency errors below 0.01 Hz, precise estimation of high-frequency spectral power laws, and high-fidelity 3D reconstruction of nonlinear velocity fields, while enabling dense reconstruction of two million points in only 1.35 s. Built on a stereo-vision dataset, the model outperforms conventional visual reconstruction approaches and maintains strong generalization in occluded conditions, owing to its global multi-scale attention and its learned encoding of wave propagation dynamics.
[140] The SAM2-to-SAM3 Gap in the Segment Anything Model Family: Why Prompt-Based Expertise Fails in Concept-Driven Image Segmentation
Ranjan Sapkota, Konstantinos I. Roumeliotis, Manoj Karkee
Main category: cs.CV
TL;DR: SAM3 represents a fundamental paradigm shift from SAM2’s prompt-based segmentation to concept-driven multimodal segmentation, creating a discontinuity where expertise doesn’t transfer between the two models.
Details
Motivation: To analyze and explain why expertise in SAM2's prompt-based segmentation doesn't transfer to SAM3's new multimodal concept-driven paradigm, highlighting the fundamental discontinuity between the two models.Method: Structured analysis through five core components: conceptual break, architectural divergence, dataset differences, training distinctions, and evaluation metrics, comparing SAM2’s spatial prompt approach with SAM3’s vision-language architecture.
Result: SAM3 introduces a unified vision-language architecture with open-vocabulary reasoning, semantic grounding, contrastive alignment, and exemplar-based concept understanding, fundamentally different from SAM2’s purely geometric and temporal segmentation.
Conclusion: SAM3 represents a new class of segmentation foundation model that establishes concept-driven segmentation as a distinct era, requiring new expertise and approaches separate from previous prompt-based models.
Abstract: This paper investigates the fundamental discontinuity between the latest two Segment Anything Models: SAM2 and SAM3. We explain why the expertise in prompt-based segmentation of SAM2 does not transfer to the multimodal concept-driven paradigm of SAM3. SAM2 operates through spatial prompts points, boxes, and masks yielding purely geometric and temporal segmentation. In contrast, SAM3 introduces a unified vision-language architecture capable of open-vocabulary reasoning, semantic grounding, contrastive alignment, and exemplar-based concept understanding. We structure this analysis through five core components: (1) a Conceptual Break Between Prompt-Based and Concept-Based Segmentation, contrasting spatial prompt semantics of SAM2 with multimodal fusion and text-conditioned mask generation of SAM3; (2) Architectural Divergence, detailing pure vision-temporal design of SAM2 versus integration of vision-language encoders, geometry and exemplar encoders, fusion modules, DETR-style decoders, object queries, and ambiguity-handling via Mixture-of-Experts in SAM3; (3) Dataset and Annotation Differences, contrasting SA-V video masks with multimodal concept-annotated corpora of SAM3; (4) Training and Hyperparameter Distinctions, showing why SAM2 optimization knowledge does not apply to SAM3; and (5) Evaluation, Metrics, and Failure Modes, outlining the transition from geometric IoU metrics to semantic, open-vocabulary evaluation. Together, these analyses establish SAM3 as a new class of segmentation foundation model and chart future directions for the emerging concept-driven segmentation era.
[141] Representation Learning for Point Cloud Understanding
Siming Yan
Main category: cs.CV
TL;DR: This dissertation develops methods to improve 3D point cloud understanding by integrating 2D knowledge through supervised representation learning, self-supervised learning, and transfer learning from 2D to 3D domains.
Details
Motivation: 3D data provides rich geometric information but understanding it remains challenging. Combining 3D data with 2D images can give machines comprehensive environmental understanding for applications like autonomous driving, robotics, and medical treatment. The goal is to leverage 2D knowledge to enhance 3D understanding without simply converting 2D data.Method: Three main approaches: 1) Supervised representation learning for point cloud primitive segmentation, 2) Self-supervised learning methods for 3D data, and 3) Transfer learning from 2D to 3D by integrating pre-trained 2D models to support 3D network training. The key innovation is effectively integrating 2D knowledge into 3D learning frameworks.
Result: Extensive experiments validate the effectiveness of the proposed methods. The approach significantly improves 3D understanding compared to traditional methods, demonstrating that integrating 2D knowledge enhances point cloud representation learning without merely transforming 2D data.
Conclusion: The dissertation successfully advances point cloud representation learning by effectively integrating 2D knowledge through multiple learning paradigms. The methods show strong potential for applications requiring comprehensive 3D understanding and pave the way for more sophisticated 2D-3D knowledge transfer approaches.
Abstract: With the rapid advancement of technology, 3D data acquisition and utilization have become increasingly prevalent across various fields, including computer vision, robotics, and geospatial analysis. 3D data, captured through methods such as 3D scanners, LiDARs, and RGB-D cameras, provides rich geometric, shape, and scale information. When combined with 2D images, 3D data offers machines a comprehensive understanding of their environment, benefiting applications like autonomous driving, robotics, remote sensing, and medical treatment. This dissertation focuses on three main areas: supervised representation learning for point cloud primitive segmentation, self-supervised learning methods, and transfer learning from 2D to 3D. Our approach, which integrates pre-trained 2D models to support 3D network training, significantly improves 3D understanding without merely transforming 2D data. Extensive experiments validate the effectiveness of our methods, showcasing their potential to advance point cloud representation learning by effectively integrating 2D knowledge.
[142] EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing
Runjia Li, Moayed Haji-Ali, Ashkan Mirzaei, Chaoyang Wang, Arpit Sahni, Ivan Skorokhodov, Aliaksandr Siarohin, Tomas Jakab, Junlin Han, Sergey Tulyakov, Philip Torr, Willi Menapace
Main category: cs.CV
TL;DR: EgoEdit: A real-time instruction-guided video editing system for egocentric videos, addressing challenges like rapid egomotion and hand-object interactions through a new dataset, model, and benchmark.
Details
Motivation: Existing AI video editors work well on third-person footage but struggle with egocentric videos due to rapid egomotion and frequent hand-object interactions, creating a domain gap. Offline editing pipelines also have high latency, limiting real-time interaction for AR applications.Method: Three-part ecosystem: 1) EgoEditData - manually curated dataset for egocentric editing with hand-object interactions; 2) EgoEdit - instruction-following video editor supporting real-time streaming inference on single GPU; 3) EgoEditBench - evaluation suite for instruction faithfulness, hand preservation, and temporal stability.
Result: EgoEdit produces temporally stable, instruction-faithful results with interactive latency. It achieves clear gains on egocentric editing benchmarks where existing methods struggle, while maintaining comparable performance to strongest baselines on general editing tasks.
Conclusion: The complete ecosystem addresses the unique challenges of egocentric video editing, enabling real-time instruction-guided editing for interactive AR applications. The dataset and benchmark will be publicly released to advance research in this domain.
Abstract: We study instruction-guided editing of egocentric videos for interactive AR applications. While recent AI video editors perform well on third-person footage, egocentric views present unique challenges - including rapid egomotion and frequent hand-object interactions - that create a significant domain gap. Moreover, existing offline editing pipelines suffer from high latency, limiting real-time interaction. To address these issues, we present a complete ecosystem for egocentric video editing. First, we construct EgoEditData, a carefully designed and manually curated dataset specifically designed for egocentric editing scenarios, featuring rich hand-object interactions, while explicitly preserving hands. Second, we develop EgoEdit, an instruction-following egocentric video editor that supports real-time streaming inference on a single GPU. Finally, we introduce EgoEditBench, an evaluation suite targeting instruction faithfulness, hand and interaction preservation, and temporal stability under egomotion. Across both egocentric and general editing tasks, EgoEdit produces temporally stable, instruction-faithful results with interactive latency. It achieves clear gains on egocentric editing benchmarks-where existing methods struggle-while maintaining performance comparable to the strongest baselines on general editing tasks. EgoEditData and EgoEditBench will be made public for the research community. See our website at https://snap-research.github.io/EgoEdit
[143] Shoot-Bounce-3D: Single-Shot Occlusion-Aware 3D from Lidar by Decomposing Two-Bounce Light
Tzofi Klinghoffer, Siddharth Somasundaram, Xiaoyu Xiang, Yuchen Fan, Christian Richardt, Akshat Dave, Ramesh Raskar, Rakesh Ranjan
Main category: cs.CV
TL;DR: Single-photon lidar uses multi-bounce light to reconstruct occluded geometry and specular surfaces from single measurements, enabled by a learned prior on complex light transport from simulated data.
Details
Motivation: 3D reconstruction from single measurements is challenging with occlusions and specular materials like mirrors. Single-photon lidars can capture multi-bounce light containing hidden information, but existing methods only work with sequential single-point illumination, not the more practical multiplexed illumination scenario.Method: Data-driven approach using first large-scale simulated dataset (~100k lidar transients) to learn prior on complex light transport. This enables decomposition of measured two-bounce light into contributions from each laser spot in multiplexed illumination scenarios.
Result: Experimental demonstration of 3D geometry inference in scenes with occlusions and mirrors from single measurements. Code and dataset released publicly.
Conclusion: Learned prior on light transport enables practical single-photon lidar to reconstruct occluded geometry and specular surfaces from multiplexed illumination, advancing single-measurement 3D scene reconstruction.
Abstract: 3D scene reconstruction from a single measurement is challenging, especially in the presence of occluded regions and specular materials, such as mirrors. We address these challenges by leveraging single-photon lidars. These lidars estimate depth from light that is emitted into the scene and reflected directly back to the sensor. However, they can also measure light that bounces multiple times in the scene before reaching the sensor. This multi-bounce light contains additional information that can be used to recover dense depth, occluded geometry, and material properties. Prior work with single-photon lidar, however, has only demonstrated these use cases when a laser sequentially illuminates one scene point at a time. We instead focus on the more practical - and challenging - scenario of illuminating multiple scene points simultaneously. The complexity of light transport due to the combined effects of multiplexed illumination, two-bounce light, shadows, and specular reflections is challenging to invert analytically. Instead, we propose a data-driven method to invert light transport in single-photon lidar. To enable this approach, we create the first large-scale simulated dataset of ~100k lidar transients for indoor scenes. We use this dataset to learn a prior on complex light transport, enabling measured two-bounce light to be decomposed into the constituent contributions from each laser spot. Finally, we experimentally demonstrate how this decomposed light can be used to infer 3D geometry in scenes with occlusions and mirrors from a single measurement. Our code and dataset are released at https://shoot-bounce-3d.github.io.
[144] BeLLA: End-to-End Birds Eye View Large Language Assistant for Autonomous Driving
Karthik Mohan, Sonam Singh, Amit Arvind Kale
Main category: cs.CV
TL;DR: BeLLA connects 360° BEV representations with LLMs for autonomous driving QA, outperforming existing methods on spatial reasoning tasks by up to +9.3%.
Details
Motivation: Existing VLMs/MLLMs in autonomous driving either use single-view encoders that miss spatial structure or aggregated multi-view features lacking unified spatial representation, making spatial reasoning about directions, object relations, and context challenging.Method: BeLLA is an end-to-end architecture that connects unified 360° Bird’s Eye View (BEV) representations with a large language model for question answering in autonomous driving.
Result: BeLLA consistently outperforms existing approaches on spatial reasoning questions (relative object positioning, behavioral understanding) achieving up to +9.3% absolute improvement on NuScenes-QA and DriveLM benchmarks, while performing competitively on other question categories.
Conclusion: The unified BEV representation combined with LLMs enables superior spatial reasoning capabilities for autonomous driving question answering, handling diverse question types while excelling at spatial understanding tasks.
Abstract: The rapid development of Vision-Language models (VLMs) and Multimodal Language Models (MLLMs) in autonomous driving research has significantly reshaped the landscape by enabling richer scene understanding, context-aware reasoning, and more interpretable decision-making. However, a lot of existing work often relies on either single-view encoders that fail to exploit the spatial structure of multi-camera systems or operate on aggregated multi-view features, which lack a unified spatial representation, making it more challenging to reason about ego-centric directions, object relations, and the wider context. We thus present BeLLA, an end-to-end architecture that connects unified 360° BEV representations with a large language model for question answering in autonomous driving. We primarily evaluate our work using two benchmarks - NuScenes-QA and DriveLM, where BeLLA consistently outperforms existing approaches on questions that require greater spatial reasoning, such as those involving relative object positioning and behavioral understanding of nearby objects, achieving up to +9.3% absolute improvement in certain tasks. In other categories, BeLLA performs competitively, demonstrating the capability of handling a diverse range of questions.
[145] SpectraIrisPAD: Leveraging Vision Foundation Models for Spectrally Conditioned Multispectral Iris Presentation Attack Detection
Raghavendra Ramachandra, Sushma Venkatesh
Main category: cs.CV
TL;DR: SpectraIrisPAD is a deep learning framework using multispectral imaging and Vision Transformers for robust iris presentation attack detection, outperforming state-of-the-art methods on a new comprehensive dataset.
Details
Motivation: Iris recognition systems are vulnerable to presentation attacks (spoofing), and existing PAD methods need better generalization. Multispectral imaging across multiple NIR bands provides complementary reflectance information that can enhance PAD robustness against diverse attack types.Method: SpectraIrisPAD uses a DINOv2 Vision Transformer backbone with learnable spectral positional encoding, token fusion, and contrastive learning to extract discriminative, band-specific features. The framework is trained on a new MSIrPAD dataset containing 18,848 iris images across 8 PAI categories captured at five NIR wavelengths (800nm, 830nm, 850nm, 870nm, 980nm).
Result: SpectraIrisPAD consistently outperforms several state-of-the-art baselines across all performance metrics in unseen attack evaluation protocols, demonstrating superior robustness and generalizability in detecting a wide range of presentation attacks.
Conclusion: The proposed multispectral approach with advanced deep learning techniques provides an effective solution for robust iris PAD, addressing the generalization challenges of conventional methods and enhancing the security of iris-based biometric systems.
Abstract: Iris recognition is widely recognized as one of the most accurate biometric modalities. However, its growing deployment in real-world applications raises significant concerns regarding its vulnerability to Presentation Attacks (PAs). Effective Presentation Attack Detection (PAD) is therefore critical to ensure the integrity and security of iris-based biometric systems. While conventional iris recognition systems predominantly operate in the near-infrared (NIR) spectrum, multispectral imaging across multiple NIR bands provides complementary reflectance information that can enhance the generalizability of PAD methods. In this work, we propose \textbf{SpectraIrisPAD}, a novel deep learning-based framework for robust multispectral iris PAD. The SpectraIrisPAD leverages a DINOv2 Vision Transformer (ViT) backbone equipped with learnable spectral positional encoding, token fusion, and contrastive learning to extract discriminative, band-specific features that effectively distinguish bona fide samples from various spoofing artifacts. Furthermore, we introduce a new comprehensive dataset Multispectral Iris PAD (\textbf{MSIrPAD}) with diverse PAIs, captured using a custom-designed multispectral iris sensor operating at five distinct NIR wavelengths (800,nm, 830,nm, 850,nm, 870,nm, and 980,nm). The dataset includes 18,848 iris images encompassing eight diverse PAI categories, including five textured contact lenses, print attacks, and display-based attacks. We conduct comprehensive experiments under unseen attack evaluation protocols to assess the generalization capability of the proposed method. SpectraIrisPAD consistently outperforms several state-of-the-art baselines across all performance metrics, demonstrating superior robustness and generalizability in detecting a wide range of presentation attacks.
[146] Explainable Melanoma Diagnosis with Contrastive Learning and LLM-based Report Generation
Junwen Zheng, Xinran Xu, Li Rong Wang, Chang Cai, Lucinda Siyun Tan, Dingyuan Wang, Hong Liang Tey, Xiuyi Fan
Main category: cs.CV
TL;DR: CEFM is a cross-modal explainable framework for melanoma diagnosis that uses contrastive learning to align clinical ABC criteria with visual features, generating interpretable textual explanations while maintaining high classification performance.
Details
Motivation: Deep learning models for melanoma classification achieve expert-level performance but suffer from opacity and lack of interpretability, which hinders clinical adoption as clinicians struggle to trust black-box decision-making processes.Method: CEFM uses contrastive learning to map clinical ABC criteria (Asymmetry, Border, Color) into Vision Transformer embedding space via dual projection heads, aligning clinical semantics with visual features, then translates these aligned representations into structured textual explanations through natural language generation.
Result: Achieves 92.79% accuracy and 0.961 AUC on public datasets, with significant improvements across multiple interpretability metrics. Qualitative analysis shows learned embeddings align with clinicians’ ABC rule application, bridging performance and clinical trust.
Conclusion: CEFM effectively addresses the interpretability gap in melanoma diagnosis by creating transparent links between image data and clinical interpretation, potentially enabling greater clinical adoption of AI systems through improved trust and understanding.
Abstract: Deep learning has demonstrated expert-level performance in melanoma classification, positioning it as a powerful tool in clinical dermatology. However, model opacity and the lack of interpretability remain critical barriers to clinical adoption, as clinicians often struggle to trust the decision-making processes of black-box models. To address this gap, we present a Cross-modal Explainable Framework for Melanoma (CEFM) that leverages contrastive learning as the core mechanism for achieving interpretability. Specifically, CEFM maps clinical criteria for melanoma diagnosis-namely Asymmetry, Border, and Color (ABC)-into the Vision Transformer embedding space using dual projection heads, thereby aligning clinical semantics with visual features. The aligned representations are subsequently translated into structured textual explanations via natural language generation, creating a transparent link between raw image data and clinical interpretation. Experiments on public datasets demonstrate 92.79% accuracy and an AUC of 0.961, along with significant improvements across multiple interpretability metrics. Qualitative analyses further show that the spatial arrangement of the learned embeddings aligns with clinicians’ application of the ABC rule, effectively bridging the gap between high-performance classification and clinical trust.
[147] Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation
Su Sun, Cheng Zhao, Himangi Mittal, Gaurav Mittal, Rohith Kukkala, Yingjie Victor Chen, Mei Chen
Main category: cs.CV
TL;DR: Track4DGen is a two-stage framework that integrates multi-view video diffusion with point tracking and 4D Gaussian Splatting to generate dynamic 4D objects from sparse inputs, addressing appearance drift and temporal inconsistency.
Details
Motivation: Generating dynamic 4D objects from sparse inputs is challenging due to difficulties in preserving appearance and motion coherence across views and time while suppressing artifacts and temporal drift. The problem stems from existing supervision methods that rely on pixel- or latent-space video-diffusion losses without explicit temporally aware, feature-level tracking guidance.Method: Two-stage framework: 1) Multi-view video diffusion model coupled with foundation point tracker to enforce dense feature-level point correspondences, producing temporally consistent features. 2) Hybrid 4D Gaussian Splatting reconstructor that concatenates diffusion features (with tracking priors) with Hex-plane features and augments them with 4D Spherical Harmonics for higher-fidelity dynamics modeling.
Result: Track4DGen surpasses baselines on both multi-view video generation and 4D generation benchmarks, producing temporally stable, text-editable 4D assets. The authors also curate Sketchfab28, a high-quality dataset for benchmarking object-centric 4D generation.
Conclusion: The explicit injection of tracker-derived motion priors into intermediate feature representations for both multi-view video generation and 4D-GS effectively addresses appearance drift and enhances cross-view coherence, enabling high-quality dynamic 4D object generation from sparse inputs.
Abstract: Generating dynamic 4D objects from sparse inputs is difficult because it demands joint preservation of appearance and motion coherence across views and time while suppressing artifacts and temporal drift. We hypothesize that the view discrepancy arises from supervision limited to pixel- or latent-space video-diffusion losses, which lack explicitly temporally aware, feature-level tracking guidance. We present \emph{Track4DGen}, a two-stage framework that couples a multi-view video diffusion model with a foundation point tracker and a hybrid 4D Gaussian Splatting (4D-GS) reconstructor. The central idea is to explicitly inject tracker-derived motion priors into intermediate feature representations for both multi-view video generation and 4D-GS. In Stage One, we enforce dense, feature-level point correspondences inside the diffusion generator, producing temporally consistent features that curb appearance drift and enhance cross-view coherence. In Stage Two, we reconstruct a dynamic 4D-GS using a hybrid motion encoding that concatenates co-located diffusion features (carrying Stage-One tracking priors) with Hex-plane features, and augment them with 4D Spherical Harmonics for higher-fidelity dynamics modeling. \emph{Track4DGen} surpasses baselines on both multi-view video generation and 4D generation benchmarks, yielding temporally stable, text-editable 4D assets. Lastly, we curate \emph{Sketchfab28}, a high-quality dataset for benchmarking object-centric 4D generation and fostering future research.
[148] Automated Annotation of Shearographic Measurements Enabling Weakly Supervised Defect Detection
Jessica Plassmann, Nicolas Schuler, Michael Schuth, Georg von Freymann
Main category: cs.CV
TL;DR: Automated workflow using deep learning generates defect annotations from shearography measurements, enabling weakly supervised training and scalable dataset creation for subsurface defect detection.
Details
Motivation: Shearography is sensitive for detecting subsurface defects but lacks high-quality annotated datasets due to labor-intensive manual labeling that is subjective and difficult to standardize, limiting industrial adoption.Method: Introduces an automated workflow that uses deep learning to generate defect annotations (high-resolution segmentation and bounding-box labels) directly from shearography measurements.
Result: Evaluation against expert-labeled data demonstrates sufficient accuracy to enable weakly supervised training, reducing manual labeling effort.
Conclusion: The automated annotation workflow supports scalable dataset creation for robust defect detection in shearography, addressing a key limitation to industrial adoption.
Abstract: Shearography is an interferometric technique sensitive to surface displacement gradients, providing high sensitivity for detecting subsurface defects in safety-critical components. A key limitation to industrial adoption is the lack of high-quality annotated datasets, since manual labeling remains labor-intensive, subjective, and difficult to standardize. We introduce an automated workflow that generates defect annotations from shearography measurements using deep learning, producing high-resolution segmentation and bounding-box labels. Evaluation against expert-labeled data demonstrates sufficient accuracy to enable weakly supervised training, reducing manual effort and supporting scalable dataset creation for robust defect detection.
[149] Physics-Grounded Shadow Generation from Monocular 3D Geometry Priors and Approximate Light Direction
Shilin Hu, Jingyi Xu, Akshat Dave, Dimitris Samaras, Hieu Le
Main category: cs.CV
TL;DR: A novel shadow generation framework that integrates explicit physical modeling (geometry and illumination) with diffusion models to produce photorealistic shadows that are both visually realistic and physically coherent.
Details
Motivation: Current deep-learning-based shadow generation methods rarely use explicit physical modeling of shadow formation, which involves occluders blocking light rays from light sources. The paper aims to bridge this gap by embedding physics-based modeling into deep learning for more accurate and realistic shadow generation.Method: 1) From a monocular RGB image, obtain approximate 3D geometry (dense point maps) and predict dominant light direction. 2) Use physics of shadow formation to recover accurate shadow location and shape. 3) Integrate this physics-based initial estimate into a diffusion framework that refines shadows for realistic appearance while maintaining consistency with scene geometry and illumination.
Result: Trained on DESOBAV2 dataset, the model produces shadows that are both visually realistic and physically coherent, outperforming existing approaches, especially in scenes with complex geometry or ambiguous lighting conditions.
Conclusion: The proposed framework successfully integrates explicit physical modeling with deep learning for shadow generation, achieving superior performance in producing photorealistic shadows that maintain physical consistency with scene geometry and illumination.
Abstract: Shadow generation aims to produce photorealistic shadows that are visually consistent with object geometry and scene illumination. In the physics of shadow formation, the occluder blocks some light rays casting from the light source that would otherwise arrive at the surface, creating a shadow that follows the silhouette of the occluder. However, such explicit physical modeling has rarely been used in deep-learning-based shadow generation. In this paper, we propose a novel framework that embeds explicit physical modeling - geometry and illumination - into deep-learning-based shadow generation. First, given a monocular RGB image, we obtain approximate 3D geometry in the form of dense point maps and predict a single dominant light direction. These signals allow us to recover fairly accurate shadow location and shape based on the physics of shadow formation. We then integrate this physics-based initial estimate into a diffusion framework that refines the shadow into a realistic, high-fidelity appearance while ensuring consistency with scene geometry and illumination. Trained on DESOBAV2, our model produces shadows that are both visually realistic and physically coherent, outperforming existing approaches, especially in scenes with complex geometry or ambiguous lighting.
[150] Physics-Grounded Attached Shadow Detection Using Approximate 3D Geometry and Light Direction
Shilin Hu, Jingyi Xu, Sagnik Das, Dimitris Samaras, Hieu Le
Main category: cs.CV
TL;DR: A framework for joint detection of cast and attached shadows using iterative geometry-illumination reasoning, with a new dataset for training and evaluation.
Details
Motivation: Existing shadow detection methods focus only on cast shadows, ignoring attached shadows which are crucial for 3D structure understanding. There are no dedicated datasets or models for attached shadow detection.Method: A closed-loop system with: 1) shadow detection module predicting both shadow types separately, 2) light estimation module inferring light direction from shadows, 3) using light direction + surface normals to derive geometry-consistent partial map of self-occluded regions, and 4) feeding this map back to refine shadow predictions iteratively.
Result: Experimental results show substantial improvement in attached shadow detection (at least 33% BER reduction) while maintaining strong performance on full and cast shadows. A new dataset of 1,458 images with separate annotations for both shadow types was created.
Conclusion: The iterative geometry-illumination reasoning framework effectively addresses the gap in attached shadow detection, demonstrating that joint reasoning about shadows, illumination, and geometry significantly improves attached shadow detection performance.
Abstract: Attached shadows occur on the surface of the occluder where light cannot reach because of self-occlusion. They are crucial for defining the three-dimensional structure of objects and enhancing scene understanding. Yet existing shadow detection methods mainly target cast shadows, and there are no dedicated datasets or models for detecting attached shadows. To address this gap, we introduce a framework that jointly detects cast and attached shadows by reasoning about their mutual relationship with scene illumination and geometry. Our system consists of a shadow detection module that predicts both shadow types separately, and a light estimation module that infers the light direction from the detected shadows. The estimated light direction, combined with surface normals, allows us to derive a geometry-consistent partial map that identifies regions likely to be self-occluded. This partial map is then fed back to refine shadow predictions, forming a closed-loop reasoning process that iteratively improves both shadow segmentation and light estimation. In order to train our method, we have constructed a dataset of 1,458 images with separate annotations for cast and attached shadows, enabling training and quantitative evaluation of both. Experimental results demonstrate that this iterative geometry-illumination reasoning substantially improves the detection of attached shadows, with at least 33% BER reduction, while maintaining strong full and cast shadow performance.
[151] Tokenizing Motion: A Generative Approach for Scene Dynamics Compression
Shanzhi Yin, Zihan Zhang, Bolin Chen, Shiqi Wang, Yan Ye
Main category: cs.CV
TL;DR: A novel generative video compression framework using motion pattern priors from common scene dynamics (like swaying flowers) instead of content priors, achieving ultra-low bitrate communication with high-quality reconstruction.
Details
Motivation: Traditional video compression relies on content-specific priors (e.g., talking faces), which limits generalization. The paper aims to leverage universal motion patterns from common natural scenes to enable more efficient compression across diverse content.Method: Uses motion pattern priors from subtle scene dynamics. Encoder: compresses motion priors via dense-to-sparse transformation. Decoder: reconstructs dynamics using a flow-driven diffusion model guided by these priors.
Result: Superior rate-distortion performance compared to state-of-the-art conventional codec ECM, especially on scene dynamics sequences. Achieves ultra-low bitrate communication with high-quality reconstruction.
Conclusion: Motion pattern priors offer a promising alternative to content priors for generative video compression, enabling efficient ultra-low bitrate communication with broad applicability across diverse scenes.
Abstract: This paper proposes a novel generative video compression framework that leverages motion pattern priors, derived from subtle dynamics in common scenes (e.g., swaying flowers or a boat drifting on water), rather than relying on video content priors (e.g., talking faces or human bodies). These compact motion priors enable a new approach to ultra-low bitrate communication while achieving high-quality reconstruction across diverse scene contents. At the encoder side, motion priors can be streamlined into compact representations via a dense-to-sparse transformation. At the decoder side, these priors facilitate the reconstruction of scene dynamics using an advanced flow-driven diffusion model. Experimental results illustrate that the proposed method can achieve superior rate-distortion-performance and outperform the state-of-the-art conventional-video codec Enhanced Compression Model (ECM) on-scene dynamics sequences. The project page can be found at-https://github.com/xyzysz/GNVDC.
[152] SPOOF: Simple Pixel Operations for Out-of-Distribution Fooling
Ankit Gupta, Christoph Adami, Emily Dolson
Main category: cs.CV
TL;DR: Modern deep neural networks remain vulnerable to high-confidence fooling attacks, with transformers being most susceptible. A new minimalist attack called SPOOF generates fooling images efficiently, and retraining provides only partial defense.
Details
Motivation: Despite advances in deep neural networks, they continue to exhibit overconfidence on non-natural images (fooling images). The paper aims to revisit and update the classic fooling images problem to assess modern architectures and develop more efficient attacks.Method: Re-implemented evolutionary fooling attacks (CPPN-based and direct-encoding) on modern architectures including convolutional and transformer classifiers. Introduced SPOOF, a minimalist black-box attack that generates fooling images with minimal pixel modifications and reduced computational cost.
Result: High-confidence fooling persists in state-of-the-art networks, with transformer-based ViT-B/16 being most susceptible (achieving near-certain misclassifications with fewer queries). SPOOF generates unrecognizable fooling images efficiently. Retraining with fooling images as an additional class provides only partial resistance.
Conclusion: Modern deep classifiers remain fragile to fooling attacks, with transformers being particularly vulnerable. Even defensive retraining offers limited protection, highlighting persistent security vulnerabilities in current deep learning systems.
Abstract: Deep neural networks (DNNs) excel across image recognition tasks, yet continue to exhibit overconfidence on inputs that bear no resemblance to natural images. Revisiting the “fooling images” work introduced by Nguyen et al. (2015), we re-implement both CPPN-based and direct-encoding-based evolutionary fooling attacks on modern architectures, including convolutional and transformer classifiers. Our re-implementation confirm that high-confidence fooling persists even in state-of-the-art networks, with transformer-based ViT-B/16 emerging as the most susceptible–achieving near-certain misclassifications with substantially fewer queries than convolution-based models. We then introduce SPOOF, a minimalist, consistent, and more efficient black-box attack generating high-confidence fooling images. Despite its simplicity, SPOOF generates unrecognizable fooling images with minimal pixel modifications and drastically reduced compute. Furthermore, retraining with fooling images as an additional class provides only partial resistance, as SPOOF continues to fool consistently with slightly higher query budgets–highlighting persistent fragility of modern deep classifiers.
[153] Enhancing Monocular Height Estimation via Sparse LiDAR-Guided Correction
Jian Song, Hongruixuan Chen, Naoto Yokoya
Main category: cs.CV
TL;DR: Automated correction pipeline using sparse ICESat-2 LiDAR measurements to enhance monocular height/depth estimation accuracy, achieving 30.9% MAE reduction for MHE and 24.1% for MDE.
Details
Motivation: Monocular height estimation from VHR imagery faces challenges due to limited structural cues and expensive conventional elevation data. Existing models lack robustness under varied illumination and scene conditions.Method: Fully automated correction pipeline integrating sparse global LiDAR measurements from ICESat-2 with deep learning predictions. Uses publicly available models/data, requiring only single georeferenced optical image. Evaluated random forest approaches, parameter-efficient fine-tuning methods, and full fine-tuning.
Result: Best method reduces MHE model’s MAE by 30.9% and improves F1HE score by 44.2%. For MDE model, MAE improves by 24.1% and F1HE score by 25.1%. Tested across six diverse regions (297 km²) including urban cores and forested areas.
Conclusion: Correction pipeline effectively enhances monocular height/depth estimation accuracy. Sparse global LiDAR can systematically strengthen both MHE and MDE models, enabling scalable and widely accessible 3D height mapping.
Abstract: Monocular height estimation (MHE) from very-high-resolution (VHR) optical imagery remains challenging due to limited structural cues and the high cost and geographic constraints of conventional elevation data such as airborne LiDAR and multi-view stereo. Although recent MHE and monocular depth estimation (MDE) models show strong performance, their robustness under varied illumination and scene conditions is still limited. We introduce a fully automated correction pipeline that integrates sparse, imperfect global LiDAR measurements from ICESat-2 with deep learning predictions to enhance accuracy and stability. The workflow relies entirely on publicly available models and data and requires only a single georeferenced optical image to produce corrected height maps, enabling low-cost and globally scalable deployment. We also establish the first benchmark for this task, evaluating two random forest based approaches, four parameter efficient fine tuning methods, and full fine tuning. Experiments across six diverse regions at 0.5 m resolution (297 km2), covering the urban cores of Tokyo, Paris, and Sao Paulo as well as suburban and forested areas, show substantial gains. The best method reduces the MHE model’s mean absolute error (MAE) by 30.9 percent and improves its F1HE score by 44.2 percent. For the MDE model, MAE improves by 24.1 percent and the F1HE score by 25.1 percent. These results validate the effectiveness of our correction pipeline and demonstrate how sparse global LiDAR can systematically strengthen both MHE and MDE models, enabling scalable and widely accessible 3D height mapping.
[154] The MICCAI Federated Tumor Segmentation (FeTS) Challenge 2024: Efficient and Robust Aggregation Methods for Federated Learning
Akis Linardos, Sarthak Pati, Ujjwal Baid, Brandon Edwards, Patrick Foley, Kevin Ta, Verena Chung, Micah Sheller, Muhammad Irfan Khan, Mojtaba Jafaritadi, Elina Kontio, Suleiman Khan, Leon Mächler, Ivan Ezhov, Suprosanna Shit, Johannes C. Paetzold, Gustav Grimberg, Manuel A. Nickel, David Naccache, Vasilis Siomos, Jonathan Passerat-Palmbach, Giacomo Tarroni, Daewoon Kim, Leonard L. Klausmann, Prashant Shah, Bjoern Menze, Dimitrios Makris, Spyridon Bakas
Main category: cs.CV
TL;DR: The FeTS Challenge 2024 evaluated federated learning methods for glioma segmentation in MRI, with a PID-controller-based approach achieving top performance in both segmentation accuracy and communication efficiency.
Details
Motivation: To advance federated learning for medical imaging by evaluating new weight aggregation methods that improve robustness and efficiency in multi-institutional glioma segmentation tasks, addressing the need for privacy-preserving collaborative learning across healthcare institutions.Method: A standardized federated learning setup using multi-institutional BraTS glioma dataset (1,251 training, 219 validation, 570 test cases). Six teams were evaluated on segmentation performance (Dice Similarity Coefficient and 95th percentile Hausdorff Distance) and communication efficiency (convergence score). The winning method used PID-controller-based weight aggregation.
Result: The PID-controller-based method achieved top overall ranking with mean DSC values of 0.733 (ET), 0.761 (TC), and 0.751 (WT), and HD95 values of 33.922mm, 33.623mm, and 32.309mm respectively. It also demonstrated highest communication efficiency with convergence score of 0.764, surpassing previous challenge methods.
Conclusion: PID controllers are effective mechanisms for stabilizing and optimizing weight aggregation in federated learning for medical imaging, advancing the state-of-the-art in federated glioma segmentation while maintaining communication efficiency.
Abstract: We present the design and results of the MICCAI Federated Tumor Segmentation (FeTS) Challenge 2024, which focuses on federated learning (FL) for glioma sub-region segmentation in multi-parametric MRI and evaluates new weight aggregation methods aimed at improving robustness and efficiency. Six participating teams were evaluated using a standardized FL setup and a multi-institutional dataset derived from the BraTS glioma benchmark, consisting of 1,251 training cases, 219 validation cases, and 570 hidden test cases with segmentations for enhancing tumor (ET), tumor core (TC), and whole tumor (WT). Teams were ranked using a cumulative scoring system that considered both segmentation performance, measured by Dice Similarity Coefficient (DSC) and the 95th percentile Hausdorff Distance (HD95), and communication efficiency assessed through the convergence score. A PID-controller-based method achieved the top overall ranking, obtaining mean DSC values of 0.733, 0.761, and 0.751 for ET, TC, and WT, respectively, with corresponding HD95 values of 33.922 mm, 33.623 mm, and 32.309 mm, while also demonstrating the highest communication efficiency with a convergence score of 0.764. These findings advance the state of federated learning for medical imaging, surpassing top-performing methods from previous challenge iterations and highlighting PID controllers as effective mechanisms for stabilizing and optimizing weight aggregation in FL. The challenge code is available at https://github.com/FeTS-AI/Challenge.
[155] Revisiting SVD and Wavelet Difference Reduction for Lossy Image Compression: A Reproducibility Study
Alena Makarova
Main category: cs.CV
TL;DR: Reproducibility study finds SVD+WDR image compression doesn’t outperform JPEG2000 or WDR as originally claimed, highlighting implementation ambiguities.
Details
Motivation: To independently verify the original paper's claims that combining SVD and WDR yields better visual quality and higher compression ratios than JPEG2000 and standalone WDR.Method: Re-implemented the proposed SVD+WDR method, filled in missing implementation details, replicated original experiments, and conducted additional experiments on new images using PSNR and SSIM metrics.
Result: Contrary to original claims, SVD+WDR generally does not surpass JPEG2000 or WDR in PSNR, and only partially improves SSIM relative to JPEG2000. Implementation ambiguities significantly impact reproducibility and performance.
Conclusion: The reproducibility study reveals that the original SVD+WDR claims are not substantiated, highlighting the importance of clear implementation details for reproducible research in image compression.
Abstract: This work presents an independent reproducibility study of a lossy image compression technique that integrates singular value decomposition (SVD) and wavelet difference reduction (WDR). The original paper claims that combining SVD and WDR yields better visual quality and higher compression ratios than JPEG2000 and standalone WDR. I re-implemented the proposed method, carefully examined missing implementation details, and replicated the original experiments as closely as possible. I then conducted additional experiments on new images and evaluated performance using PSNR and SSIM. In contrast to the original claims, my results indicate that the SVD+WDR technique generally does not surpass JPEG2000 or WDR in terms of PSNR, and only partially improves SSIM relative to JPEG2000. The study highlights ambiguities in the original description (e.g., quantization and threshold initialization) and illustrates how such gaps can significantly impact reproducibility and reported performance.
[156] GPU-GLMB: Assessing the Scalability of GPU-Accelerated Multi-Hypothesis Tracking
Pranav Balakrishnan, Sidisha Barik, Sean M. O’Rourke, Benjamin M. Marlin
Main category: cs.CV
TL;DR: A GPU-accelerated GLMB filter variant that handles multiple detections per object, breaking inter-detection dependencies for improved parallel scalability.
Details
Motivation: Standard GLMB filters are computationally expensive even with pruning, especially when needing to handle multiple detections per object from ML-based virtual sensors in distributed networks.Method: Modified GLMB filter variant that allows multiple detections per object from the same sensor, which breaks inter-detection dependencies in filter updates, enabling better parallelization and GPU acceleration.
Result: The proposed variant shows significantly improved parallel scalability and enables efficient deployment on GPU hardware, with preliminary analysis demonstrating runtime scalability improvements.
Conclusion: The modified GLMB filter with multiple detections per object enables practical GPU acceleration for multi-target tracking in distributed sensor networks.
Abstract: Much recent research on multi-target tracking has focused on multi-hypothesis approaches leveraging random finite sets. Of particular interest are labeled random finite set methods that maintain temporally coherent labels for each object. While these methods enjoy important theoretical properties as closed-form solutions to the multi-target Bayes filter, the maintenance of multiple hypotheses under the standard measurement model is highly computationally expensive, even when hypothesis pruning approximations are applied. In this work, we focus on the Generalized Labeled Multi-Bernoulli (GLMB) filter as an example of this class of methods. We investigate a variant of the filter that allows multiple detections per object from the same sensor, a critical capability when deploying tracking in the context of distributed networks of machine learning-based virtual sensors. We show that this breaks the inter-detection dependencies in the filter updates of the standard GLMB filter, allowing updates with significantly improved parallel scalability and enabling efficient deployment on GPU hardware. We report the results of a preliminary analysis of a GPU-accelerated implementation of our proposed GLMB tracker, with a focus on run time scalability with respect to the number of objects and the maximum number of retained hypotheses.
[157] Opinion: Learning Intuitive Physics May Require More than Visual Data
Ellen Su, Solim Legris, Todd M. Gureckis, Mengye Ren
Main category: cs.CV
TL;DR: Training on developmentally realistic child video data (SAYCam) doesn’t improve intuitive physics performance, suggesting data distribution alone isn’t sufficient for learning physical principles.
Details
Motivation: Despite training on massive internet video data, current AI models still fail at human-level intuitive physics. The paper investigates whether data distribution (rather than volume) is key to learning physical principles, testing if developmentally realistic child experience data can enable better learning.Method: Pretrained a Video Joint Embedding Predictive Architecture (V-JEPA) model on SAYCam dataset - a developmentally realistic, egocentric video dataset capturing three children’s everyday visual experiences. This represents only 0.01% of data volume used by state-of-the-art models. Evaluated performance on the IntPhys2 intuitive physics benchmark.
Result: Training on the developmentally realistic dataset did NOT lead to significant performance improvements on the IntPhys2 benchmark. The model showed no substantial gains in intuitive physics understanding despite using data that mimics human developmental experience.
Conclusion: Merely training on developmentally realistic datasets is insufficient for current architectures to learn representations supporting intuitive physics. Varying visual data volume and distribution alone may not be enough to build systems with artificial intuitive physics - suggesting more fundamental architectural or learning mechanism changes may be needed.
Abstract: Humans expertly navigate the world by building rich internal models founded on an intuitive understanding of physics. Meanwhile, despite training on vast quantities of internet video data, state-of-the-art deep learning models still fall short of human-level performance on intuitive physics benchmarks. This work investigates whether data distribution, rather than volume, is the key to learning these principles. We pretrain a Video Joint Embedding Predictive Architecture (V-JEPA) model on SAYCam, a developmentally realistic, egocentric video dataset partially capturing three children’s everyday visual experiences. We find that training on this dataset, which represents 0.01% of the data volume used to train SOTA models, does not lead to significant performance improvements on the IntPhys2 benchmark. Our results suggest that merely training on a developmentally realistic dataset is insufficient for current architectures to learn representations that support intuitive physics. We conclude that varying visual data volume and distribution alone may not be sufficient for building systems with artificial intuitive physics.
[158] NexusFlow: Unifying Disparate Tasks under Partial Supervision via Invertible Flow Networks
Fangzhou Lin, Yuping Wang, Yuliang Guo, Zixun Huang, Xinyu Huang, Haichong Zhang, Kazunori Yamada, Zhengzhong Tu, Liu Ren, Ziming Zhang
Main category: cs.CV
TL;DR: NexusFlow is a lightweight plug-and-play framework for partially supervised multi-task learning that uses invertible coupling layers to align latent feature distributions across structurally diverse tasks, enabling effective knowledge transfer even with incomplete annotations.
Details
Motivation: Existing PS-MTL approaches focus on homogeneous dense prediction tasks, leaving the realistic challenge of learning from structurally diverse tasks unexplored. There's a need for a framework that can handle both homogeneous and structurally different tasks with incomplete annotations.Method: NexusFlow introduces surrogate networks with invertible coupling layers to align latent feature distributions across tasks. The bijective coupling layers preserve information while mapping features into a shared canonical space, avoiding representational collapse and enabling alignment across structurally different tasks without reducing expressive capacity.
Result: On nuScenes autonomous driving dataset with domain-partitioned tasks (dense map reconstruction and sparse multi-object tracking), NexusFlow sets new state-of-the-art results. On NYUv2 with three homogeneous dense prediction tasks (segmentation, depth, surface normals), it yields consistent gains across all tasks.
Conclusion: NexusFlow is an effective, lightweight, and plug-and-play framework for PS-MTL that works well with both structurally diverse and homogeneous tasks, demonstrating broad applicability and superior performance compared to existing partially supervised baselines.
Abstract: Partially Supervised Multi-Task Learning (PS-MTL) aims to leverage knowledge across tasks when annotations are incomplete. Existing approaches, however, have largely focused on the simpler setting of homogeneous, dense prediction tasks, leaving the more realistic challenge of learning from structurally diverse tasks unexplored. To this end, we introduce NexusFlow, a novel, lightweight, and plug-and-play framework effective in both settings. NexusFlow introduces a set of surrogate networks with invertible coupling layers to align the latent feature distributions of tasks, creating a unified representation that enables effective knowledge transfer. The coupling layers are bijective, preserving information while mapping features into a shared canonical space. This invertibility avoids representational collapse and enables alignment across structurally different tasks without reducing expressive capacity. We first evaluate NexusFlow on the core challenge of domain-partitioned autonomous driving, where dense map reconstruction and sparse multi-object tracking are supervised in different geographic regions, creating both structural disparity and a strong domain gap. NexusFlow sets a new state-of-the-art result on nuScenes, outperforming strong partially supervised baselines. To demonstrate generality, we further test NexusFlow on NYUv2 using three homogeneous dense prediction tasks, segmentation, depth, and surface normals, as a representative N-task PS-MTL scenario. NexusFlow yields consistent gains across all tasks, confirming its broad applicability.
[159] Language-driven Fine-grained Retrieval
Shijie Wang, Xin Yu, Yadan Luo, Zijian Wang, Pengfei Zhang, Zi Huang
Main category: cs.CV
TL;DR: LaFG is a language-driven framework for fine-grained image retrieval that uses LLMs and VLMs to convert class names into attribute-level supervision, improving generalization to unseen categories.
Details
Motivation: Existing FGIR methods use semantically sparse one-hot labels that overlook rich semantics in category names, hindering modeling of cross-category details and limiting generalization to unseen categories.Method: LaFG converts class names into attribute-level supervision using LLMs to generate detailed descriptions, then uses frozen VLMs to project them into vision-aligned space, clustering into dataset-wide attribute vocabulary. A global prompt template selects category-relevant attributes aggregated into linguistic prototypes that supervise the retrieval model.
Result: The framework improves fine-grained retrieval by better modeling comparability among cross-category details and enhancing generalization to unseen categories through rich attribute-level supervision.
Conclusion: LaFG demonstrates that leveraging language models to extract rich semantic information from category names can significantly improve fine-grained image retrieval performance, especially for generalization to unseen categories.
Abstract: Existing fine-grained image retrieval (FGIR) methods learn discriminative embeddings by adopting semantically sparse one-hot labels derived from category names as supervision. While effective on seen classes, such supervision overlooks the rich semantics encoded in category names, hindering the modeling of comparability among cross-category details and, in turn, limiting generalization to unseen categories. To tackle this, we introduce LaFG, a Language-driven framework for Fine-Grained Retrieval that converts class names into attribute-level supervision using large language models (LLMs) and vision-language models (VLMs). Treating each name as a semantic anchor, LaFG prompts an LLM to generate detailed, attribute-oriented descriptions. To mitigate attribute omission in these descriptions, it leverages a frozen VLM to project them into a vision-aligned space, clustering them into a dataset-wide attribute vocabulary while harvesting complementary attributes from related categories. Leveraging this vocabulary, a global prompt template selects category-relevant attributes, which are aggregated into category-specific linguistic prototypes. These prototypes supervise the retrieval model to steer
[160] Knowing the Answer Isn’t Enough: Fixing Reasoning Path Failures in LVLMs
Chaoyang Wang, Yangfan He, Yiyang Zhou, Yixuan Wang, Jiaqi Liu, Peng Xia, Zhengzhong Tu, Mohit Bansal, Huaxiu Yao
Main category: cs.CV
TL;DR: LVLMs often reach correct answers via incorrect reasoning paths due to path selection bias. PSO framework improves reasoning stability and accuracy through two-stage optimization with negative replay memory.
Details
Motivation: Large Vision-Language Models frequently produce correct answers but through flawed reasoning paths, indicating a path selection bias problem rather than knowledge deficiency. The gap between Pass@K and Pass@1 metrics shows this is primarily a misreasoning issue.Method: Proposes PSO (Path-Select Optimization), a two-stage post-training framework: 1) Group Relative Policy Optimization with template and answer-based rewards for structured reasoning, 2) Online preference optimization where models self-evaluate reasoning paths and align with preferred trajectories, using Negative Replay Memory to store and revisit incorrect paths.
Result: PSO effectively prunes invalid reasoning paths, achieves 7.4% average improvement in reasoning accuracy, and produces more stable and consistent chains of thought across extensive experiments.
Conclusion: The path selection bias in LVLMs is a critical flaw that can be addressed through systematic optimization. PSO demonstrates that enhancing reasoning stability and consistency is achievable through targeted post-training methods that focus on path selection rather than just knowledge acquisition.
Abstract: We reveal a critical yet underexplored flaw in Large Vision-Language Models (LVLMs): even when these models know the correct answer, they frequently arrive there through incorrect reasoning paths. The core issue is not a lack of knowledge, but a path selection bias within the vast reasoning search space. Although LVLMs are often capable of sampling correct solution trajectories, they disproportionately favor unstable or logically inconsistent ones, leading to erratic and unreliable outcomes. The substantial disparity between Pass@K (with large K) and Pass@1 across numerous models provides compelling evidence that such failures primarily stem from misreasoning rather than ignorance. To systematically investigate and address this issue, we propose PSO (Path-Select Optimization), a two-stage post-training framework designed to enhance both the reasoning performance and stability of existing LVLMs. In the first stage, we employ Group Relative Policy Optimization (GRPO) with template and answer-based rewards to cultivate structured, step-by-step reasoning. In the second stage, we conduct online preference optimization, where the model samples reasoning paths from GRPO-generated data, self-evaluates them, and aligns itself toward the preferred trajectories. Incorrect or suboptimal paths are concurrently stored in a Negative Replay Memory (NRM) as hard negatives, which are periodically revisited to prevent the model from repeating prior mistakes and to facilitate continual reasoning refinement. Extensive experiments show that PSO effectively prunes invalid reasoning paths, substantially enhances reasoning accuracy (with 7.4% improvements on average), and yields more stable and consistent chains of thought. Our code will be available at https://github.com/aiming-lab/PSO.
[161] TriaGS: Differentiable Triangulation-Guided Geometric Consistency for 3D Gaussian Splatting
Quan Tran, Tuan Dang
Main category: cs.CV
TL;DR: 3D Gaussian Splatting improved with multi-view triangulation to reduce floaters and improve geometry consistency, achieving SOTA results on DTU dataset.
Details
Motivation: Current 3D Gaussian Splatting methods suffer from reconstruction inconsistencies, floater artifacts, and unstructured geometry due to being guided solely by photometric loss without geometric constraints.Method: Introduces global geometry consistency through constrained multi-view triangulation. Uses self-supervised approach to penalize deviation of rendered 3D points from robust consensus points triangulated from neighboring views.
Result: Achieves state-of-the-art results on multiple datasets. On DTU dataset, attains mean Chamfer Distance of 0.50 mm, outperforming comparable explicit methods.
Conclusion: The proposed method effectively addresses geometry inconsistencies in 3D Gaussian Splatting through multi-view triangulation constraints, enabling high-fidelity surface extraction and improved reconstruction quality.
Abstract: 3D Gaussian Splatting is crucial for real-time novel view synthesis due to its efficiency and ability to render photorealistic images. However, building a 3D Gaussian is guided solely by photometric loss, which can result in inconsistencies in reconstruction. This under-constrained process often results in “floater” artifacts and unstructured geometry, preventing the extraction of high-fidelity surfaces. To address this issue, our paper introduces a novel method that improves reconstruction by enforcing global geometry consistency through constrained multi-view triangulation. Our approach aims to achieve a consensus on 3D representation in the physical world by utilizing various estimated views. We optimize this process by penalizing the deviation of a rendered 3D point from a robust consensus point, which is re-triangulated from a bundle of neighboring views in a self-supervised fashion. We demonstrate the effectiveness of our method across multiple datasets, achieving state-of-the-art results. On the DTU dataset, our method attains a mean Chamfer Distance of 0.50 mm, outperforming comparable explicit methods. We will make our code open-source to facilitate community validation and ensure reproducibility.
[162] FacePhys: State of the Heart Learning
Kegang Wang, Jiankai Tang, Yuntao Wang, Xin Liu, Yuxuan Fan, Jiatong Ji, Yuanchun Shi, Daniel McDuff
Main category: cs.CV
TL;DR: FacePhys is a memory-efficient remote photoplethysmography algorithm that achieves state-of-the-art performance with minimal computational overhead, enabling real-time heart rate monitoring from video.
Details
Motivation: Current camera-based vital sign monitoring (rPPG) faces practical limitations: computational constraints on front-end devices and accuracy degradation when transmitting compressed data. There's a need to resolve the trilemma of model scalability, cross-dataset generalization, and real-time operation.Method: FacePhys uses temporal-spatial state space duality to capture subtle periodic variations across video frames. It leverages a transferable heart state representation while maintaining minimal computational overhead, enabling training on extended sequences and supporting low-latency inference.
Result: Achieves 49% error reduction over existing methods, with 3.6 MB memory footprint and 9.46 ms per-frame latency (83-99% improvement over existing methods). Enables reliable real-time performance in practical deployments.
Conclusion: FacePhys successfully resolves the scalability-generalization-real-time trilemma for rPPG, making camera-based vital sign monitoring practical for front-end devices with limited computational resources.
Abstract: Vital sign measurement using cameras presents opportunities for comfortable, ubiquitous health monitoring. Remote photoplethysmography (rPPG), a foundational technology, enables cardiac measurement through minute changes in light reflected from the skin. However, practical deployment is limited by the computational constraints of performing analysis on front-end devices and the accuracy degradation of transmitting data through compressive channels that reduce signal quality. We propose a memory efficient rPPG algorithm - \emph{FacePhys} - built on temporal-spatial state space duality, which resolves the trilemma of model scalability, cross-dataset generalization, and real-time operation. Leveraging a transferable heart state, FacePhys captures subtle periodic variations across video frames while maintaining a minimal computational overhead, enabling training on extended video sequences and supporting low-latency inference. FacePhys establishes a new state-of-the-art, with a substantial 49% reduction in error. Our solution enables real-time inference with a memory footprint of 3.6 MB and per-frame latency of 9.46 ms – surpassing existing methods by 83% to 99%. These results translate into reliable real-time performance in practical deployments, and a live demo is available at https://www.facephys.com/.
[163] RefBench-PRO: Perceptual and Reasoning Oriented Benchmark for Referring Expression Comprehension
Tianyi Gao, Hao Li, Han Fang, Xin Wei, Xiaodong Dong, Hongbo Sun, Ye Yuan, Zhongjiang He, Jinglin Xu, Jingmin Xin, Hao Sun
Main category: cs.CV
TL;DR: RefBench-PRO is a new REC benchmark that decomposes referring expressions into perception and reasoning dimensions with six sub-tasks, plus an automated data generation pipeline and RL-based Ref-R1 method for improved localization.
Details
Motivation: Existing REC benchmarks lack interpretable scoring mechanisms and cannot reveal MLLM's grounding capabilities across different cognitive abilities. Current benchmarks primarily evaluate perceptual capabilities without assessing reasoning skills.Method: 1) RefBench-PRO benchmark decomposes referring expressions into perception (attribute, position) and reasoning (interaction, commonsense, relation, reject) dimensions with six progressively challenging tasks. 2) Automated data-generation pipeline produces diverse referring expressions across these sub-dimensions. 3) Ref-R1: RL-based learning scheme with Dynamic IoU-based GRPO to improve localization accuracy under complex reasoning conditions.
Result: Extensive experiments show RefBench-PRO enables interpretable evaluation of MLLMs on referring expression comprehension, presenting greater challenges in both perception and reasoning dimensions compared to existing benchmarks.
Conclusion: RefBench-PRO addresses limitations of existing REC benchmarks by providing interpretable evaluation across cognitive dimensions, establishing stronger baselines, and revealing MLLM’s grounding capabilities in both perception and reasoning tasks.
Abstract: Referring Expression Comprehension (REC) is a vision-language task that localizes a specific image region based on a textual description. Existing REC benchmarks primarily evaluate perceptual capabilities and lack interpretable scoring mechanisms, which cannot reveal the grounding capability of Multi-modal Large Language Model (MLLM) across different cognitive abilities. To address this limitation, we introduce RefBench-PRO, a comprehensive REC benchmark, which decomposes referring expressions into two core dimensions, i.e., perception and reasoning, and further subdivides them into six progressively challenging tasks, such as attribute, position, interaction, commonsense, relation and reject. We also develop a fully automated data-generation pipeline that produces diverse referring expressions across these six sub-dimensions. Furthermore, We propose Ref-R1, an RL-based learning scheme, which incorporates Dynamic IoU-based GRPO to improve localization accuracy under increasingly complex reasoning conditions, establishing a stronger baseline for REC. Extensive experiments demonstrate that our RefBench-PRO enables interpretable evaluation of MLLM on referring expression comprehension, presenting greater challenges in both perception and reasoning.
[164] Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models
Hengzhuang Li, Xinsong Zhang, Qiming Peng, Bin Luo, Han Hu, Dengyang Jiang, Han-Jia Ye, Teng Zhang, Hai Jin
Main category: cs.CV
TL;DR: LaVer addresses modality imbalance in MLLMs by introducing masked image modeling in LLM’s latent space to enhance visual representation learning.
Details
Motivation: MLLMs suffer from modality imbalance where visual information is underutilized compared to text in deeper layers, causing degraded visual performance and hallucinations due to lack of direct visual supervision during training.Method: Proposes Latent Visual Reconstruction (LaVer) framework that uses masked image modeling in the joint latent semantic space of LLMs to provide direct visual supervisory signals and learn more discriminative visual representations.
Result: Extensive experiments show LaVer’s superiority across diverse benchmarks, especially in tasks requiring dense visual capabilities, with MLLMs exhibiting increased visual attention allocation and enhanced visual information utilization.
Conclusion: LaVer effectively mitigates modality imbalance in MLLMs by providing direct visual activation through latent space reconstruction, improving visual representation learning and performance on vision-intensive tasks.
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in multimodal tasks. Despite their impressive performance, MLLMs suffer from the modality imbalance issue, where visual information is often underutilized compared to textual representations in deeper layers, leading to degraded visual performance or hallucinations. This issue stems from the predominant reliance on next-text-token-prediction during training, which fails to provide direct visual supervisory signals, resulting in progressive homogenization of visual representations throughout the layers. To this end, we propose Latent Visual Reconstruction (LaVer), a novel training framework that facilitates MLLMs in learning more discriminative visual representations via masked image modeling in the joint latent semantic space of LLM. Our method offers direct visual activation to MLLMs, which exhibit increased visual attention allocation, indicating enhanced utilization of visual information. Extensive experiments across diverse benchmarks prove the superiority of our approach in various scenarios, especially those requiring dense visual capabilities. Code of LaVer is available at https://github.com/Fir-lat/LaVer.
[165] StrokeNet: Unveiling How to Learn Fine-Grained Interactions in Online Handwritten Stroke Classification
Yiheng Huang, Shuang She, Zewei Wei, Jianmin Lin, Ming Yang, Wenyin Liu
Main category: cs.CV
TL;DR: StrokeNet: A novel network for stroke classification using reference pair representations with Inline Sequence Attention and Cross-Ellipse Query mechanisms to capture fine-grained stroke relationships.
Details
Motivation: Stroke classification is challenging due to writing style variations, ambiguous content, and dynamic positions. Existing methods struggle to capture fine-grained semantic relationships between strokes, which are typically localized interactions. Point-level approaches introduce redundancy, but selecting reference points with sequential ordering can solve this problem.Method: StrokeNet encodes strokes as reference pair representations (points + feature vectors). It dynamically selects reference points for each stroke and sequences them, using an Inline Sequence Attention (ISA) module to construct contextual features. A Cross-Ellipse Query (CEQ) mechanism clusters reference points and extracts features across varying spatial scales. A joint optimization framework predicts stroke categories via reference points regression and models adjacent stroke semantic transitions through an Auxiliary Branch.
Result: Achieves state-of-the-art performance on multiple public online handwritten datasets. On CASIA-onDo dataset, accuracy improves from 93.81% to 95.54%, demonstrating effectiveness and robustness.
Conclusion: StrokeNet effectively addresses stroke classification challenges by modeling fine-grained semantic relationships through reference pair representations, achieving superior performance through novel attention and query mechanisms.
Abstract: Stroke classification remains challenging due to variations in writing style, ambiguous content, and dynamic writing positions. The core challenge in stroke classification is modeling the semantic relationships between strokes. Our observations indicate that stroke interactions are typically localized, making it difficult for existing deep learning methods to capture such fine-grained relationships. Although viewing strokes from a point-level perspective can address this issue, it introduces redundancy. However, by selecting reference points and using their sequential order to represent strokes in a fine-grained manner, this problem can be effectively solved. This insight inspired StrokeNet, a novel network architecture encoding strokes as reference pair representations (points + feature vectors), where reference points enable spatial queries and features mediate interaction modeling. Specifically, we dynamically select reference points for each stroke and sequence them, employing an Inline Sequence Attention (ISA) module to construct contextual features. To capture spatial feature interactions, we devised a Cross-Ellipse Query (CEQ) mechanism that clusters reference points and extracts features across varying spatial scales. Finally, a joint optimization framework simultaneously predicts stroke categories via reference points regression and adjacent stroke semantic transition modeling through an Auxiliary Branch (Aux-Branch). Experimental results show that our method achieves state-of-the-art performance on multiple public online handwritten datasets. Notably, on the CASIA-onDo dataset, the accuracy improves from 93.81$%$ to 95.54$%$, demonstrating the effectiveness and robustness of our approach.
[166] Exploiting Spatiotemporal Properties for Efficient Event-Driven Human Pose Estimation
Haoxian Zhou, Chuanzhi Xu, Langyi Chen, Haodong Chen, Yuk Ying Chung, Qiang Qu, Xaoming Chen, Weidong Cai
Main category: cs.CV
TL;DR: Event-based human pose estimation using point cloud framework with temporal slicing and edge enhancement improves performance over traditional dense frame methods.
Details
Motivation: Existing event-based pose estimation methods convert event streams to dense frames, adding computation and losing the high temporal resolution of event signals. The authors aim to better exploit the spatiotemporal properties of event streams.Method: Propose a point cloud-based framework with: 1) Event Temporal Slicing Convolution module to capture short-term dependencies across event slices, 2) Event Slice Sequencing module for structured temporal modeling, and 3) edge enhancement in point cloud representation to improve spatial edge information under sparse conditions.
Result: Experiments on DHP19 dataset show consistent performance improvements across three point cloud backbones: PointNet, DGCNN, and Point Transformer.
Conclusion: The point cloud-based approach effectively leverages the spatiotemporal properties of event streams for human pose estimation, outperforming methods that convert events to dense frames while maintaining computational efficiency.
Abstract: Human pose estimation focuses on predicting body keypoints to analyze human motion. Event cameras provide high temporal resolution and low latency, enabling robust estimation under challenging conditions. However, most existing methods convert event streams into dense event frames, which adds extra computation and sacrifices the high temporal resolution of the event signal. In this work, we aim to exploit the spatiotemporal properties of event streams based on point cloud-based framework, designed to enhance human pose estimation performance. We design Event Temporal Slicing Convolution module to capture short-term dependencies across event slices, and combine it with Event Slice Sequencing module for structured temporal modeling. We also apply edge enhancement in point cloud-based event representation to enhance spatial edge information under sparse event conditions to further improve performance. Experiments on the DHP19 dataset show our proposed method consistently improves performance across three representative point cloud backbones: PointNet, DGCNN, and Point Transformer.
[167] ReCAD: Reinforcement Learning Enhanced Parametric CAD Model Generation with Vision-Language Models
Jiahao Li, Yusheng Luo, Yunzhong Lou, Xiangdong Zhou
Main category: cs.CV
TL;DR: ReCAD is a reinforcement learning framework that bootstraps pretrained large models to generate precise parametric CAD models from text or image inputs, achieving state-of-the-art performance by leveraging generative priors and structured learning.
Details
Motivation: Previous CAD generation methods rely on supervised fine-tuning with limited editability and fail to exploit the strong generative capabilities of pretrained large models. There's a need for approaches that can generate precise parametric CAD models while maintaining editability and leveraging existing model capabilities.Method: 1) Fine-tune vision-language models with parameterized CAD scripts for basic generation capabilities; 2) Novel RL strategy using parameterized code as guidance to enhance reasoning; 3) Hierarchical primitive learning process with unified reward function for geometric accuracy and semantic fidelity.
Result: State-of-the-art performance in both text-to-CAD and image-to-CAD tasks. For image-to-CAD: reduces mean Chamfer Distance from 73.47 to 29.61 (in-distribution) and from 272.06 to 80.23 (out-of-distribution), significantly outperforming baselines.
Conclusion: ReCAD successfully bootstraps pretrained large models for precise parametric CAD generation through reinforcement learning and structured skill acquisition, demonstrating superior geometric accuracy and generalization capabilities compared to previous approaches.
Abstract: We present ReCAD, a reinforcement learning (RL) framework that bootstraps pretrained large models (PLMs) to generate precise parametric computer-aided design (CAD) models from multimodal inputs by leveraging their inherent generative capabilities. With just access to simple functional interfaces (e.g., point coordinates), our approach enables the emergence of complex CAD operations (e.g., pattern replication and mirror). This stands in contrast to previous methods, which typically rely on knowledge injected through supervised fine-tuning (SFT), offer limited support for editability, and fail to exploit the strong generative priors of PLMs. Specifically, the ReCAD framework begins by fine-tuning vision-language models (VLMs) to equip them with basic CAD model generation capabilities, where we rewrite CAD scripts into parameterized code that is leveraged to generate accurate textual descriptions for supervision. Then, we propose a novel RL strategy that incorporates parameterized code as guidance to enhance the model’s reasoning on challenging questions. Furthermore, we employ a hierarchical primitive learning process to progressively teach structured and compositional skills under a unified reward function that ensures both geometric accuracy and semantic fidelity. ReCAD sets a new state-of-the-art in both text-to-CAD and image-to-CAD tasks, significantly improving geometric accuracy across in-distribution and out-of-distribution settings. In the image-to-CAD task, for instance, it reduces the mean Chamfer Distance from 73.47 to 29.61 (in-distribution) and from 272.06 to 80.23 (out-of-distribution), outperforming existing baselines by a substantial margin.
[168] S2WMamba: A Spectral-Spatial Wavelet Mamba for Pansharpening
Haoyu Zhang, Junhan Luo, Yugang Cao, Siran Peng, Jie Huang, Liangjian-Deng
Main category: cs.CV
TL;DR: S2WMamba: A pansharpening method using 2D/1D Haar DWT for frequency disentanglement and Mamba-based cross-modulation for efficient long-range dependency modeling, achieving state-of-the-art performance on multiple satellite datasets.
Details
Motivation: Traditional pansharpening methods often entangle spatial detail with spectral fidelity when jointly processing PAN and MS images, leading to suboptimal fusion results.Method: Uses 2D Haar DWT on PAN to localize spatial edges/textures, and channel-wise 1D Haar DWT on MS to separate spectral frequency components. Features two parallel branches (Spectral and Spatial) with Mamba-based cross-modulation for long-range dependency modeling, followed by multi-scale dynamic gate fusion.
Result: Outperforms recent strong baselines (FusionMamba, CANNet, U2Net, ARConv) on WV3, GF2, and QB datasets, improving PSNR by up to 0.23 dB and achieving HQNR 0.956 on full-resolution WV3.
Conclusion: S2WMamba effectively disentangles frequency information through wavelet transforms and enables efficient cross-modal interaction via Mamba architecture, providing superior pansharpening performance with linear complexity.
Abstract: Pansharpening fuses a high-resolution PAN image with a low-resolution multispectral (LRMS) image to produce an HRMS image. A key difficulty is that jointly processing PAN and MS often entangles spatial detail with spectral fidelity. We propose S2WMamba, which explicitly disentangles frequency information and then performs lightweight cross-modal interaction. Concretely, a 2D Haar DWT is applied to PAN to localize spatial edges and textures, while a channel-wise 1D Haar DWT treats each pixel’s spectrum as a 1D signal to separate low/high-frequency components and limit spectral distortion. The resulting Spectral branch injects wavelet-extracted spatial details into MS features, and the Spatial branch refines PAN features using spectra from the 1D pyramid; the two branches exchange information through Mamba-based cross-modulation that models long-range dependencies with linear complexity. A multi-scale dynamic gate (multiplicative + additive) then adaptively fuses branch outputs.On WV3, GF2, and QB, S2WMamba matches or surpasses recent strong baselines (FusionMamba, CANNet, U2Net, ARConv), improving PSNR by up to 0.23 dB and reaching HQNR 0.956 on full-resolution WV3. Ablations justify the choice of 2D/1D DWT placement, parallel dual branches, and the fusion gate. Our code is available at https://github.com/KagUYa66/S2WMamba.
[169] CryoHype: Reconstructing a thousand cryo-EM structures with transformer-based hypernetworks
Jeffrey Gu, Minkyu Jeon, Ambri Ma, Serena Yeung-Levy, Ellen D. Zhong
Main category: cs.CV
TL;DR: CryoHype: A transformer-based hypernetwork for cryo-EM reconstruction that handles compositional heterogeneity in mixtures of many distinct molecular species.
Details
Motivation: Cryo-EM has potential for high-throughput structure determination of many targets simultaneously, but existing methods focus on conformational heterogeneity within single/few structures and cannot resolve compositional heterogeneity from mixtures of many distinct molecular species.Method: CryoHype uses a transformer-based hypernetwork that dynamically adjusts the weights of an implicit neural representation for cryo-EM reconstruction.
Result: Achieves state-of-the-art results on a benchmark dataset with 100 structures, and scales to reconstructing 1,000 distinct structures from unlabeled cryo-EM images in the fixed-pose setting.
Conclusion: CryoHype successfully addresses the challenge of compositional heterogeneity in cryo-EM, enabling high-throughput reconstruction of many distinct molecular structures simultaneously.
Abstract: Cryo-electron microscopy (cryo-EM) is an indispensable technique for determining the 3D structures of dynamic biomolecular complexes. While typically applied to image a single molecular species, cryo-EM has the potential for structure determination of many targets simultaneously in a high-throughput fashion. However, existing methods typically focus on modeling conformational heterogeneity within a single or a few structures and are not designed to resolve compositional heterogeneity arising from mixtures of many distinct molecular species. To address this challenge, we propose CryoHype, a transformer-based hypernetwork for cryo-EM reconstruction that dynamically adjusts the weights of an implicit neural representation. Using CryoHype, we achieve state-of-the-art results on a challenging benchmark dataset containing 100 structures. We further demonstrate that CryoHype scales to the reconstruction of 1,000 distinct structures from unlabeled cryo-EM images in the fixed-pose setting.
[170] Beyond Hallucinations: A Multimodal-Guided Task-Aware Generative Image Compression for Ultra-Low Bitrate
Kaile Wang, Lijun He, Haisheng Fu, Haixia Bi, Fan Li
Main category: cs.CV
TL;DR: MTGC framework uses multimodal guidance (text captions, compressed images, and semantic pseudo-words) to improve semantic consistency in ultra-low bitrate generative image compression, addressing hallucination issues in 6G semantic communication.
Details
Motivation: Generative image compression at ultra-low bitrates (bpp < 0.05) suffers from semantic deviations caused by generative hallucinations, limiting reliable deployment in bandwidth-constrained 6G semantic communication scenarios.Method: Proposes MTGC framework with three guidance modalities: 1) concise text captions for global semantics, 2) highly compressed images for low-level visual information, and 3) Semantic Pseudo-Words (SPWs) for fine-grained task-relevant semantics. Uses Task-Aware Semantic Compression Module (TASCM) to generate SPWs and Multimodal-Guided Diffusion Decoder (MGDD) with dual-path cooperative guidance mechanism to inject guidance into diffusion process.
Result: Extensive experiments show MTGC consistently improves semantic consistency (e.g., DISTS drops by 10.59% on DIV2K dataset) while achieving remarkable gains in perceptual quality and pixel-level fidelity at ultra-low bitrate.
Conclusion: MTGC effectively addresses semantic deviation problems in ultra-low bitrate generative image compression through multimodal guidance, making it suitable for reliable deployment in 6G semantic communication scenarios.
Abstract: Generative image compression has recently shown impressive perceptual quality, but often suffers from semantic deviations caused by generative hallucinations at ultra-low bitrate (bpp < 0.05), limiting its reliable deployment in bandwidth-constrained 6G semantic communication scenarios. In this work, we reassess the positioning and role of of multimodal guidance, and propose a Multimodal-Guided Task-Aware Generative Image Compression (MTGC) framework. Specifically, MTGC integrates three guidance modalities to enhance semantic consistency: a concise but robust text caption for global semantics, a highly compressed image (HCI) retaining low-level visual information, and Semantic Pseudo-Words (SPWs) for fine-grained task-relevant semantics. The SPWs are generated by our designed Task-Aware Semantic Compression Module (TASCM), which operates in a task-oriented manner to drive the multi-head self-attention mechanism to focus on and extract semantics relevant to the generation task while filtering out redundancy. Subsequently, to facilitate the synergistic guidance of these modalities, we design a Multimodal-Guided Diffusion Decoder (MGDD) employing a dual-path cooperative guidance mechanism that synergizes cross-attention and ControlNet additive residuals to precisely inject these three guidance into the diffusion process, and leverages the diffusion model’s powerful generative priors to reconstruct the image. Extensive experiments demonstrate that MTGC consistently improves semantic consistency (e.g., DISTS drops by 10.59% on the DIV2K dataset) while also achieving remarkable gains in perceptual quality and pixel-level fidelity at ultra-low bitrate.
[171] CLUENet: Cluster Attention Makes Neural Networks Have Eyes
Xiangshuai Song, Jun-Jie Huang, Tianrui Liu, Ke Liang, Chang Tang
Main category: cs.CV
TL;DR: CLUENet is a transparent deep architecture for visual semantic understanding that combines clustering paradigms with attention mechanisms to achieve better accuracy, efficiency, and interpretability than existing methods.
Details
Motivation: Convolution- and attention-based models have rigid receptive fields and complex architectures that limit their ability to model irregular spatial patterns and hinder interpretability, which is problematic for tasks requiring high model transparency. While clustering paradigms offer promising interpretability and flexible semantic modeling, they suffer from limited accuracy, low efficiency, and gradient vanishing during training.Method: The paper proposes CLUENet with three key innovations: (1) Global Soft Aggregation and Hard Assignment with Temperature-Scaled Cosine Attention and gated residual connections for enhanced local modeling, (2) inter-block Hard and Shared Feature Dispatching, and (3) an improved cluster pooling strategy.
Result: Experiments on CIFAR-100 and Mini-ImageNet demonstrate that CLUENet outperforms existing clustering methods and mainstream visual models, offering a compelling balance of accuracy, efficiency, and transparency.
Conclusion: CLUENet successfully addresses the limitations of both traditional vision models and clustering paradigms, providing a transparent deep architecture that achieves superior performance while maintaining interpretability for visual semantic understanding tasks.
Abstract: Despite the success of convolution- and attention-based models in vision tasks, their rigid receptive fields and complex architectures limit their ability to model irregular spatial patterns and hinder interpretability, therefore posing challenges for tasks requiring high model transparency. Clustering paradigms offer promising interpretability and flexible semantic modeling, but suffer from limited accuracy, low efficiency, and gradient vanishing during training. To address these issues, we propose CLUster attEntion Network (CLUENet), an transparent deep architecture for visual semantic understanding. We propose three key innovations include (i) a Global Soft Aggregation and Hard Assignment with a Temperature-Scaled Cosin Attention and gated residual connections for enhanced local modeling, (ii) inter-block Hard and Shared Feature Dispatching, and (iii) an improved cluster pooling strategy. These enhancements significantly improve both classification performance and visual interpretability. Experiments on CIFAR-100 and Mini-ImageNet demonstrate that CLUENet outperforms existing clustering methods and mainstream visual models, offering a compelling balance of accuracy, efficiency, and transparency.
[172] TreeQ: Pushing the Quantization Boundary of Diffusion Transformer via Tree-Structured Mixed-Precision Search
Kaicheng Yang, Kaisen Yang, Baiting Wu, Xun Zhang, Qianrui Yang, Haotong Qin, He Zhang, Yulun Zhang
Main category: cs.CV
TL;DR: TreeQ is a unified quantization framework for Diffusion Transformers that addresses computational/memory challenges through tree-structured search, environmental noise guidance, and general monarch branch techniques to achieve near-lossless 4-bit quantization.
Details
Motivation: Diffusion Transformers (DiTs) outperform U-Nets for image generation but face deployment challenges due to high computational/memory demands. While Mixed-Precision Quantization works well for U-Nets, its application to DiTs remains limited and underexplored.Method: Three key components: 1) Tree Structured Search (TSS) for efficient O(n) search using DiT’s linear properties, 2) Environmental Noise Guidance (ENG) to align PTQ and QAT objectives with single hyperparameter, 3) General Monarch Branch (GMB) structured sparse branch to prevent information loss in ultra-low-bit regimes.
Result: State-of-the-art performance on DiT-XL/2 under W3A3 and W4A4 PTQ/PEFT settings. First to achieve near-lossless 4-bit PTQ performance on DiT models.
Conclusion: TreeQ successfully addresses DiT quantization challenges through unified framework, enabling efficient deployment while maintaining performance. The approach demonstrates practical viability for real-world DiT applications.
Abstract: Diffusion Transformers (DiTs) have emerged as a highly scalable and effective backbone for image generation, outperforming U-Net architectures in both scalability and performance. However, their real-world deployment remains challenging due to high computational and memory demands. Mixed-Precision Quantization (MPQ), designed to push the limits of quantization, has demonstrated remarkable success in advancing U-Net quantization to sub-4bit settings while significantly reducing computational and memory overhead. Nevertheless, its application to DiT architectures remains limited and underexplored. In this work, we propose TreeQ, a unified framework addressing key challenges in DiT quantization. First, to tackle inefficient search and proxy misalignment, we introduce Tree Structured Search (TSS). This DiT-specific approach leverages the architecture’s linear properties to traverse the solution space in O(n) time while improving objective accuracy through comparison-based pruning. Second, to unify optimization objectives, we propose Environmental Noise Guidance (ENG), which aligns Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) configurations using a single hyperparameter. Third, to mitigate information bottlenecks in ultra-low-bit regimes, we design the General Monarch Branch (GMB). This structured sparse branch prevents irreversible information loss, enabling finer detail generation. Through extensive experiments, our TreeQ framework demonstrates state-of-the-art performance on DiT-XL/2 under W3A3 and W4A4 PTQ/PEFT settings. Notably, our work is the first to achieve near-lossless 4-bit PTQ performance on DiT models. The code and models will be available at https://github.com/racoonykc/TreeQ
[173] Rectifying Latent Space for Generative Single-Image Reflection Removal
Mingjia Li, Jin Hu, Hainuo Wang, Qiming Hu, Jiarui Wang, Xiaojie Guo
Main category: cs.CV
TL;DR: RefRAM: A latent diffusion model reframed for single-image reflection removal by aligning latent space with reflection physics, using task-specific embeddings and depth-guided sampling.
Details
Motivation: Single-image reflection removal is highly ill-posed; existing methods struggle with reasoning about corrupted regions and fail at recovery/generalization in real-world scenarios due to latent spaces lacking structure to interpret composite images as linear superpositions.Method: Three synergistic components: 1) Reflection-equivariant VAE that aligns latent space with linear physics of reflection formation, 2) Learnable task-specific text embedding for precise guidance bypassing ambiguous language, 3) Depth-guided early-branching sampling strategy to harness generative stochasticity.
Result: Achieves new state-of-the-art performance on multiple benchmarks and generalizes well to challenging real-world cases.
Conclusion: RefRAM successfully reframes latent diffusion models for reflection removal by addressing the fundamental issue of latent space structure, enabling effective perception and processing of ambiguous layered images.
Abstract: Single-image reflection removal is a highly ill-posed problem, where existing methods struggle to reason about the composition of corrupted regions, causing them to fail at recovery and generalization in the wild. This work reframes an editing-purpose latent diffusion model to effectively perceive and process highly ambiguous, layered image inputs, yielding high-quality outputs. We argue that the challenge of this conversion stems from a critical yet overlooked issue, i.e., the latent space of semantic encoders lacks the inherent structure to interpret a composite image as a linear superposition of its constituent layers. Our approach is built on three synergistic components, including a reflection-equivariant VAE that aligns the latent space with the linear physics of reflection formation, a learnable task-specific text embedding for precise guidance that bypasses ambiguous language, and a depth-guided early-branching sampling strategy to harness generative stochasticity for promising results. Extensive experiments reveal that our model achieves new SOTA performance on multiple benchmarks and generalizes well to challenging real-world cases.
[174] Spoofing-aware Prompt Learning for Unified Physical-Digital Facial Attack Detection
Jiabao Guo, Yadian Wang, Hui Ma, Yuhao Fu, Ju Jia, Hui Liu, Shengeng Tang, Lechao Cheng, Yunfeng Diao, Ajian Liu
Main category: cs.CV
TL;DR: SPL-UAD framework uses spoofing-aware prompt learning to decouple physical and digital attack detection, achieving better unified defense against both attack types.
Details
Motivation: Real-world face recognition systems need protection against both physical presentation attacks (like masks) and digital forgery attacks (like deepfakes), but existing unified approaches suffer from conflicting optimization between these two detection tasks.Method: Proposes SPL-UAD framework with spoofing-aware prompt learning that decouples optimization branches for physical and digital attacks using parallel prompt branches with adaptive Spoofing Context Prompt Generation, plus Cues-awareness Augmentation for robust sample mining.
Result: Extensive experiments on UniAttackDataPlus dataset show significant performance improvements in unified attack detection tasks compared to existing methods.
Conclusion: The proposed SPL-UAD framework effectively addresses the optimization conflict problem in unified attack detection and provides comprehensive protection against both physical and digital face spoofing attacks.
Abstract: Real-world face recognition systems are vulnerable to both physical presentation attacks (PAs) and digital forgery attacks (DFs). We aim to achieve comprehensive protection of biometric data by implementing a unified physical-digital defense framework with advanced detection. Existing approaches primarily employ CLIP with regularization constraints to enhance model generalization across both tasks. However, these methods suffer from conflicting optimization directions between physical and digital attack detection under same category prompt spaces. To overcome this limitation, we propose a Spoofing-aware Prompt Learning for Unified Attack Detection (SPL-UAD) framework, which decouples optimization branches for physical and digital attacks in the prompt space. Specifically, we construct a learnable parallel prompt branch enhanced with adaptive Spoofing Context Prompt Generation, enabling independent control of optimization for each attack type. Furthermore, we design a Cues-awareness Augmentation that leverages the dual-prompt mechanism to generate challenging sample mining tasks on data, significantly enhancing the model’s robustness against unseen attack types. Extensive experiments on the large-scale UniAttackDataPlus dataset demonstrate that the proposed method achieves significant performance improvements in unified attack detection tasks.
[175] Human3R: Incorporating Human Priors for Better 3D Dynamic Reconstruction from Monocular Videos
Weitao Xiong, Zhiyuan Yuan, Jiahao Lu, Chengfeng Zhao, Peng Li, Yuan Liu
Main category: cs.CV
TL;DR: Human3R: A method for monocular dynamic human video reconstruction using hybrid geometric priors (SMPL + monocular depth) to address geometric inconsistencies and resolution degradation in existing approaches.
Details
Motivation: Existing monocular dynamic video reconstruction methods for human scenes suffer from geometric inconsistencies (distorted limb proportions, unnatural human-object fusion) and resolution degradation due to memory-constrained downsampling, which causes human boundary drift toward background geometry.Method: Proposes Human3R with hybrid geometric priors combining SMPL human body models with monocular depth estimation. Uses a hierarchical pipeline with refinement components: processes full-resolution images for overall scene geometry, then applies strategic cropping and cross-attention fusion for human-specific detail enhancement. Integrates SMPL priors through a Feature Fusion Module to ensure geometrically plausible reconstruction while preserving fine-grained human boundaries.
Result: Extensive experiments on TUM Dynamics and GTA-IM datasets demonstrate superior performance in dynamic human reconstruction compared to existing methods.
Conclusion: The proposed Human3R method effectively addresses geometric inconsistencies and resolution degradation in monocular dynamic human video reconstruction by incorporating structured human priors (SMPL) with monocular depth estimation, producing geometrically consistent results with preserved fine-grained human boundaries.
Abstract: Monocular dynamic video reconstruction faces significant challenges in dynamic human scenes due to geometric inconsistencies and resolution degradation issues. Existing methods lack 3D human structural understanding, producing geometrically inconsistent results with distorted limb proportions and unnatural human-object fusion, while memory-constrained downsampling causes human boundary drift toward background geometry. To address these limitations, we propose to incorporate hybrid geometric priors that combine SMPL human body models with monocular depth estimation. Our approach leverages structured human priors to maintain surface consistency while capturing fine-grained geometric details in human regions. We introduce Human3R, featuring a hierarchical pipeline with refinement components that processes full-resolution images for overall scene geometry, then applies strategic cropping and cross-attention fusion for human-specific detail enhancement. The method integrates SMPL priors through a Feature Fusion Module to ensure geometrically plausible reconstruction while preserving fine-grained human boundaries. Extensive experiments on TUM Dynamics and GTA-IM datasets demonstrate superior performance in dynamic human reconstruction.
[176] VG-Refiner: Towards Tool-Refined Referring Grounded Reasoning via Agentic Reinforcement Learning
Yuji Wang, Wenlong Liu, Jingxuan Niu, Haoji Zhang, Yansong Tang
Main category: cs.CV
TL;DR: VG-Refiner is a framework for tool-refined referring grounded reasoning that addresses unreliable tool outputs through a two-stage think-rethink mechanism and refinement rewards.
Details
Motivation: Existing tool-integrated visual reasoning (TiVR) paradigms focus on integrating visual tools but neglect effective response mechanisms for handling unreliable or erroneous tool outputs, especially in referring and grounding tasks where inaccurate detection tool predictions mislead models into generating hallucinated reasoning.Method: Proposes VG-Refiner with a two-stage think-rethink mechanism that enables explicit analysis and response to tool feedback, along with a refinement reward that encourages effective correction of poor tool results. Uses small amounts of task-specific data to enhance refinement capability.
Result: Achieves significant improvement in accuracy and correction ability on referring and reasoning grounding benchmarks while preserving the general capabilities of the pretrained model.
Conclusion: VG-Refiner addresses the critical limitation of handling unreliable tool outputs in TiVR systems, establishing a new framework for tool-refined referring grounded reasoning with effective refinement mechanisms and evaluation protocols.
Abstract: Tool-integrated visual reasoning (TiVR) has demonstrated great potential in enhancing multimodal problem-solving. However, existing TiVR paradigms mainly focus on integrating various visual tools through reinforcement learning, while neglecting to design effective response mechanisms for handling unreliable or erroneous tool outputs. This limitation is particularly pronounced in referring and grounding tasks, where inaccurate detection tool predictions often mislead TiVR models into generating hallucinated reasoning. To address this issue, we propose the VG-Refiner, the first framework aiming at the tool-refined referring grounded reasoning. Technically, we introduce a two-stage think-rethink mechanism that enables the model to explicitly analyze and respond to tool feedback, along with a refinement reward that encourages effective correction in response to poor tool results. In addition, we propose two new metrics and establish fair evaluation protocols to systematically measure the refinement ability of current models. We adopt a small amount of task-specific data to enhance the refinement capability of VG-Refiner, achieving a significant improvement in accuracy and correction ability on referring and reasoning grounding benchmarks while preserving the general capabilities of the pretrained model.
[177] Are AI-Generated Driving Videos Ready for Autonomous Driving? A Diagnostic Evaluation Framework
Xinhao Xiang, Abhijeet Rastogi, Jiawei Zhang
Main category: cs.CV
TL;DR: The paper investigates whether AI-generated driving videos (AIGVs) can reliably support autonomous driving model training/evaluation, finding that raw AIGVs degrade perception performance but filtered AIGVs can complement real data when using their proposed evaluator ADGVE.
Details
Motivation: Text-to-video models can generate high-resolution driving scenes at low cost, offering a scalable alternative to real/simulator data for autonomous driving. However, it's unclear if these AI-generated driving videos (AIGVs) can reliably support AD model training and evaluation.Method: 1) Introduced taxonomy of AIGV failure modes (visual artifacts, implausible motion, traffic violations); 2) Built ADGV-Bench benchmark with human annotations and dense labels; 3) Proposed ADGVE evaluator combining static semantics, temporal cues, lane obedience, and VLM-guided reasoning into quality scores.
Result: Blindly adding raw AIGVs degrades perception performance (object detection, tracking, segmentation). Filtering AIGVs with ADGVE improves both video quality metrics and downstream AD models, turning AIGVs into a beneficial complement to real-world data.
Conclusion: AIGVs present both risks and promise for autonomous driving. The study provides practical tools (ADGV-Bench benchmark and ADGVE evaluator) for safely leveraging large-scale video generation in AD pipelines, showing that properly filtered AIGVs can enhance real data.
Abstract: Recent text-to-video models have enabled the generation of high-resolution driving scenes from natural language prompts. These AI-generated driving videos (AIGVs) offer a low-cost, scalable alternative to real or simulator data for autonomous driving (AD). But a key question remains: can such videos reliably support training and evaluation of AD models? We present a diagnostic framework that systematically studies this question. First, we introduce a taxonomy of frequent AIGV failure modes, including visual artifacts, physically implausible motion, and violations of traffic semantics, and demonstrate their negative impact on object detection, tracking, and instance segmentation. To support this analysis, we build ADGV-Bench, a driving-focused benchmark with human quality annotations and dense labels for multiple perception tasks. We then propose ADGVE, a driving-aware evaluator that combines static semantics, temporal cues, lane obedience signals, and Vision-Language Model(VLM)-guided reasoning into a single quality score for each clip. Experiments show that blindly adding raw AIGVs can degrade perception performance, while filtering them with ADGVE consistently improves both general video quality assessment metrics and downstream AD models, and turns AIGVs into a beneficial complement to real-world data. Our study highlights both the risks and the promise of AIGVs, and provides practical tools for safely leveraging large-scale video generation in future AD pipelines.
[178] VAD-Net: Multidimensional Facial Expression Recognition in Intelligent Education System
Yi Huo, Yun Ge
Main category: cs.CV
TL;DR: Researchers add Dominance dimension to FER2013 dataset (creating VAD annotations) and improve prediction accuracy using orthogonal convolution, providing new benchmark for multidimensional emotion recognition.
Details
Motivation: Current FER datasets only use basic emotion categories (happy, angry, sad, etc.) which are limited. Future affective computing needs more comprehensive metrics like VAD (Valence-Arousal-Dominance) parameters. While AffectNet added VA, it still lacks Dominance dimension.Method: 1) Annotated FER2013 dataset with VAD (Valence-Arousal-Dominance) dimensions, focusing on adding the missing Dominance dimension. 2) Used orthogonalized convolution in the network architecture to extract more diverse and expressive features, improving prediction accuracy.
Result: 1) Dominance dimension can be measured but is more difficult to obtain than Valence and Arousal in both manual annotation and network prediction. 2) Orthogonal convolution improves VAD prediction accuracy. 3) Created new VAD-annotated FER2013 dataset and orthogonalized regression network based on ResNet as baseline.
Conclusion: The research provides the first Dominance dimension labeling for FER datasets and proposes an improved VAD prediction network using orthogonal convolution. The new dataset serves as a benchmark for multidimensional emotion measurement, with code and dataset publicly available.
Abstract: Current FER (Facial Expression Recognition) dataset is mostly labeled by emotion categories, such as happy, angry, sad, fear, disgust, surprise, and neutral which are limited in expressiveness. However, future affective computing requires more comprehensive and precise emotion metrics which could be measured by VAD(Valence-Arousal-Dominance) multidimension parameters. To address this, AffectNet has tried to add VA (Valence and Arousal) information, but still lacks D(Dominance). Thus, the research introduces VAD annotation on FER2013 dataset, takes the initiative to label D(Dominance) dimension. Then, to further improve network capacity, it enforces orthogonalized convolution on it, which extracts more diverse and expressive features and will finally increase the prediction accuracy. Experiment results show that D dimension could be measured but is difficult to obtain compared with V and A dimension no matter in manual annotation or regression network prediction. Secondly, the ablation test by introducing orthogonal convolution verifies that better VAD prediction could be obtained in the configuration of orthogonal convolution. Therefore, the research provides an initiative labelling for D dimension on FER dataset, and proposes a better prediction network for VAD prediction through orthogonal convolution. The newly built VAD annotated FER2013 dataset could act as a benchmark to measure VAD multidimensional emotions, while the orthogonalized regression network based on ResNet could act as the facial expression recognition baseline for VAD emotion prediction. The newly labeled dataset and implementation code is publicly available on https://github.com/YeeHoran/VAD-Net .
[179] OCFER-Net: Recognizing Facial Expression in Online Learning System
Yi Huo, Lei Zhang
Main category: cs.CV
TL;DR: OCFER-Net improves facial expression recognition by enforcing orthogonality on convolutional kernels via a regularizer, achieving better performance on FER-2013 dataset.
Details
Motivation: Online learning during COVID-19 requires not just knowledge distribution but also emotion interaction. Facial Expression Recognition (FER) helps teachers understand student emotions, but existing methods don't sufficiently exploit convolutional matrix orthogonality.Method: Proposes OCFER-Net which enforces orthogonality on convolutional kernels using a regularizer to extract more diverse and expressive features.
Result: Superior performance on challenging FER-2013 dataset, outperforming baselines by 1.087 (likely percentage points or accuracy improvement).
Conclusion: Orthogonal convolutional kernels improve FER performance, making OCFER-Net effective for emotion recognition in online learning environments.
Abstract: Recently, online learning is very popular, especially under the global epidemic of COVID-19. Besides knowledge distribution, emotion interaction is also very important. It can be obtained by employing Facial Expression Recognition (FER). Since the FER accuracy is substantial in assisting teachers to acquire the emotional situation, the project explores a series of FER methods and finds that few works engage in exploiting the orthogonality of convolutional matrix. Therefore, it enforces orthogonality on kernels by a regularizer, which extracts features with more diversity and expressiveness, and delivers OCFER-Net. Experiments are carried out on FER-2013, which is a challenging dataset. Results show superior performance over baselines by 1.087. The code of the research project is publicly available on https://github.com/YeeHoran/OCFERNet.
[180] Perceptual Region-Driven Infrared-Visible Co-Fusion for Extreme Scene Enhancement
Jing Tao, Yonghong Zong, Banglei Guana, Pengju Sun, Taihang Lei, Yang Shanga, Qifeng Yu
Main category: cs.CV
TL;DR: A region perception-based fusion framework for IR-VIS spectra using multi-exposure and multi-modal imaging with SVE camera, achieving superior image clarity and performance in extreme conditions.
Details
Motivation: Existing methods compromise visible imagery quality when fusing IR and VIS spectra, impacting measurement accuracy in photogrammetry, especially under extreme conditions.Method: Region perception-based fusion framework combining multi-exposure and multi-modal imaging using spatially varying exposure (SVE) camera. Features include region perception-based feature fusion for precise registration, adaptive fusion with contrast enhancement, and structural similarity compensation guided by regional saliency maps.
Result: Superior image clarity and improved performance compared to state-of-the-art methods, demonstrated through experiments on both synthetic and real-world data with quantitative and visual evaluations.
Conclusion: The proposed framework effectively addresses the challenge of fusing IR and VIS spectra while preserving geometric fidelity and incorporating thermal radiation, with robust performance across different conditions including single-exposure scenarios.
Abstract: In photogrammetry, accurately fusing infrared (IR) and visible (VIS) spectra while preserving the geometric fidelity of visible features and incorporating thermal radiation is a significant challenge, particularly under extreme conditions. Existing methods often compromise visible imagery quality, impacting measurement accuracy. To solve this, we propose a region perception-based fusion framework that combines multi-exposure and multi-modal imaging using a spatially varying exposure (SVE) camera. This framework co-fuses multi-modal and multi-exposure data, overcoming single-exposure method limitations in extreme environments. The framework begins with region perception-based feature fusion to ensure precise multi-modal registration, followed by adaptive fusion with contrast enhancement. A structural similarity compensation mechanism, guided by regional saliency maps, optimizes IR-VIS spectral integration. Moreover, the framework adapts to single-exposure scenarios for robust fusion across different conditions. Experiments conducted on both synthetic and real-world data demonstrate superior image clarity and improved performance compared to state-of-the-art methods, as evidenced by both quantitative and visual evaluations.
[181] Rethinking Training Dynamics in Scale-wise Autoregressive Generation
Gengze Zhou, Chongjian Ge, Hao Tan, Feng Liu, Yicong Hong
Main category: cs.CV
TL;DR: SAR (Self-Autoregressive Refinement) addresses exposure bias in scale-wise autoregressive image generation through staggered-scale rollouts and contrastive loss, improving quality with minimal computational overhead.
Details
Motivation: Scale-wise autoregressive models suffer from exposure bias that undermines generation quality, caused by train-test mismatch and imbalance in scale-wise learning difficulty.Method: Proposes Self-Autoregressive Refinement (SAR) with two components: Stagger-Scale Rollout (SSR) for exposing models to their own intermediate predictions, and Contrastive Student-Forcing Loss (CSFL) for providing supervision on self-generated contexts.
Result: SAR consistently improves generation quality of pretrained AR models with minimal overhead, achieving 5.2% FID reduction on FlexVAR-d16 trained on ImageNet 256 within 10 epochs (5 hours on 32xA100 GPUs).
Conclusion: SAR is an efficient, scalable, and effective post-training method for visual autoregressive generation that addresses exposure bias through self-autoregressive refinement.
Abstract: Recent advances in autoregressive (AR) generative models have produced increasingly powerful systems for media synthesis. Among them, next-scale prediction has emerged as a popular paradigm, where models generate images in a coarse-to-fine manner. However, scale-wise AR models suffer from exposure bias, which undermines generation quality. We identify two primary causes of this issue: (1) train-test mismatch, where the model must rely on its own imperfect predictions during inference, and (2) imbalance in scale-wise learning difficulty, where certain scales exhibit disproportionately higher optimization complexity. Through a comprehensive analysis of training dynamics, we propose Self-Autoregressive Refinement (SAR) to address these limitations. SAR introduces a Stagger-Scale Rollout (SSR) mechanism that performs lightweight autoregressive rollouts to expose the model to its own intermediate predictions, thereby aligning train-test patterns, and a complementary Contrastive Student-Forcing Loss (CSFL) that provides adequate supervision for self-generated contexts to ensure stable training. Experimental results show that applying SAR to pretrained AR models consistently improves generation quality with minimal computational overhead. For instance, SAR yields a 5.2% FID reduction on FlexVAR-d16 trained on ImageNet 256 within 10 epochs (5 hours on 32xA100 GPUs). Given its efficiency, scalability, and effectiveness, we expect SAR to serve as a reliable post-training method for visual autoregressive generation.
[182] A Perception CNN for Facial Expression Recognition
Chunwei Tian, Jingyuan Xie, Lingjun Li, Wangmeng Zuo, Yanning Zhang, David Zhang
Main category: cs.CV
TL;DR: PCNN is a perception CNN for facial expression recognition that uses five parallel networks to learn local facial features, employs multi-domain interaction for feature fusion, and uses a two-phase loss function to achieve state-of-the-art results on multiple FER benchmarks.
Details
Motivation: Standard CNNs for facial expression recognition may ignore the effect of facial segmentation and fail to capture subtle local facial changes effectively.Method: 1. Five parallel networks learn local features from eyes, cheeks, and mouth. 2. Multi-domain interaction mechanism registers and fuses local sense organ features with global facial structural features. 3. Two-phase loss function restricts accuracy of sense information and reconstructed face images.
Result: PCNN achieves superior results on multiple FER benchmarks: CK+, JAFFE, FER2013, FERPlus, RAF-DB, and Occlusion and Pose Variant Dataset.
Conclusion: The proposed PCNN effectively captures subtle facial changes through parallel local feature learning and multi-domain feature fusion, demonstrating state-of-the-art performance on various FER datasets.
Abstract: Convolutional neural networks (CNNs) can automatically learn data patterns to express face images for facial expression recognition (FER). However, they may ignore effect of facial segmentation of FER. In this paper, we propose a perception CNN for FER as well as PCNN. Firstly, PCNN can use five parallel networks to simultaneously learn local facial features based on eyes, cheeks and mouth to realize the sensitive capture of the subtle changes in FER. Secondly, we utilize a multi-domain interaction mechanism to register and fuse between local sense organ features and global facial structural features to better express face images for FER. Finally, we design a two-phase loss function to restrict accuracy of obtained sense information and reconstructed face images to guarantee performance of obtained PCNN in FER. Experimental results show that our PCNN achieves superior results on several lab and real-world FER benchmarks: CK+, JAFFE, FER2013, FERPlus, RAF-DB and Occlusion and Pose Variant Dataset. Its code is available at https://github.com/hellloxiaotian/PCNN.
[183] DragMesh: Interactive 3D Generation Made Easy
Tianshan Zhang, Zeyu Zhang, Hao Tang
Main category: cs.CV
TL;DR: DragMesh is a real-time interactive 3D articulation framework that combines kinematic reasoning with generative motion synthesis using dual quaternions.
Details
Motivation: Current methods for articulated motion are either physically consistent but too slow for real-time use, or generative but violate basic kinematic constraints. There's a need for systems that understand how objects move and respond to interactions in real-time while maintaining physical plausibility.Method: A decoupled framework with two main components: 1) Kinematic reasoning that infers joint parameters by separating semantic intent (joint type) from geometric regression (axis/origin) using Kinematics Prediction Network (KPP-Net), and 2) A Dual Quaternion VAE (DQ-VAE) that receives predicted priors and user drag inputs to generate motion trajectories. The DQ-VAE uses FiLM conditioning to inject joint priors at every layer and a numerically-stable cross-product loss for axis alignment.
Result: DragMesh achieves real-time performance and enables plausible, generative articulation on novel objects without retraining. The framework offers a practical solution for interactive 3D articulation that maintains kinematic constraints while being fast enough for real-time use.
Conclusion: The decoupled design of DragMesh represents a practical step toward generative 3D intelligence, combining the benefits of physical consistency with real-time generative capabilities for articulated motion.
Abstract: While generative models have excelled at creating static 3D content, the pursuit of systems that understand how objects move and respond to interactions remains a fundamental challenge. Current methods for articulated motion lie at a crossroads: they are either physically consistent but too slow for real-time use, or generative but violate basic kinematic constraints. We present DragMesh, a robust framework for real-time interactive 3D articulation built around a lightweight motion generation core. Our core contribution is a novel decoupled kinematic reasoning and motion generation framework. First, we infer the latent joint parameters by decoupling semantic intent reasoning (which determines the joint type) from geometric regression (which determines the axis and origin using our Kinematics Prediction Network (KPP-Net)). Second, to leverage the compact, continuous, and singularity-free properties of dual quaternions for representing rigid body motion, we develop a novel Dual Quaternion VAE (DQ-VAE). This DQ-VAE receives these predicted priors, along with the original user drag, to generate a complete, plausible motion trajectory. To ensure strict adherence to kinematics, we inject the joint priors at every layer of the DQ-VAE’s non-autoregressive Transformer decoder using FiLM (Feature-wise Linear Modulation) conditioning. This persistent, multi-scale guidance is complemented by a numerically-stable cross-product loss to guarantee axis alignment. This decoupled design allows DragMesh to achieve real-time performance and enables plausible, generative articulation on novel objects without retraining, offering a practical step toward generative 3D intelligence. Code: https://github.com/AIGeeksGroup/DragMesh. Website: https://aigeeksgroup.github.io/DragMesh.
[184] When Gender is Hard to See: Multi-Attribute Support for Long-Range Recognition
Nzakiese Mbongo, Kailash A. Hambarde, Hugo Proença
Main category: cs.CV
TL;DR: Dual-path transformer framework using CLIP for long-range gender recognition, combining visual and attribute cues with spatial attention, evaluated on new U-DetAGReID dataset.
Details
Motivation: Gender recognition from extreme long-range imagery is challenging due to limited resolution, viewpoint variability, and loss of facial cues. Existing methods struggle with these constraints.Method: Dual-path transformer framework leveraging CLIP: (1) visual path with fine-tuned CLIP image encoder, (2) attribute-mediated path using soft-biometric prompts (hairstyle, clothing, accessories) in CLIP text-image space, plus spatial channel attention modules for localization.
Result: Outperforms state-of-the-art person-attribute and re-identification baselines across multiple metrics (macro-F1, accuracy, AUC), with robustness to distance, angle, and height variations. Created U-DetAGReID dataset for evaluation.
Conclusion: Language-guided dual-path learning provides principled, extensible foundation for responsible gender recognition in unconstrained long-range scenarios, with interpretable attribute localization and responsible abstention behavior.
Abstract: Accurate gender recognition from extreme long-range imagery remains a challenging problem due to limited spatial resolution, viewpoint variability, and loss of facial cues. For such purpose, we present a dual-path transformer framework that leverages CLIP to jointly model visual and attribute-driven cues for gender recognition at a distance. The framework integrates two complementary streams: (1) a direct visual path that refines a pre-trained CLIP image encoder through selective fine-tuning of its upper layers, and (2) an attribute-mediated path that infers gender from a set of soft-biometric prompts (e.g., hairstyle, clothing, accessories) aligned in the CLIP text-image space. Spatial channel attention modules further enhance discriminative localization under occlusion and low resolution. To support large-scale evaluation, we construct U-DetAGReID, a unified long-range gender dataset derived from DetReIDx and AG-ReID.v2, harmonized under a consistent ternary labeling scheme (Male, Female, Unknown). Extensive experiments suggest that the proposed solution surpasses state-of-the-art person-attribute and re-identification baselines across multiple metrics (macro-F1, accuracy, AUC), with consistent robustness to distance, angle, and height variations. Qualitative attention visualizations confirm interpretable attribute localization and responsible abstention behavior. Our results show that language-guided dual-path learning offers a principled, extensible foundation for responsible gender recognition in unconstrained long-range scenarios.
[185] Automated Deep Learning Estimation of Anthropometric Measurements for Preparticipation Cardiovascular Screening
Lucas R. Mareque, Ricardo L. Armentano, Leandro J. Cymberknop
Main category: cs.CV
TL;DR: Deep learning models (VGG19, ResNet50, DenseNet121) achieve sub-centimeter accuracy in estimating five key anthropometric measurements from 2D synthetic body images, offering automated screening for cardiovascular risk in athletes.
Details
Motivation: Traditional manual anthropometric measurements for preparticipation cardiovascular examination (PPCE) are labor-intensive, operator-dependent, and difficult to scale, creating a need for automated solutions to identify athletes at risk of sudden cardiac death.Method: Developed a fully automated deep-learning approach using 100,000 synthetic 2D body images derived from 3D meshes. Trained and evaluated three CNN architectures (VGG19, ResNet50, DenseNet121) with fully connected layers for regression to estimate five key anthropometric measurements.
Result: All models achieved sub-centimeter accuracy. ResNet50 performed best with mean MAE of 0.668 cm across all five measurements, demonstrating deep learning can deliver accurate anthropometric data at scale.
Conclusion: Deep learning provides a practical automated tool for anthropometric measurement that can complement athlete screening protocols. Future work will validate models on real-world images to extend applicability beyond synthetic data.
Abstract: Preparticipation cardiovascular examination (PPCE) aims to prevent sudden cardiac death (SCD) by identifying athletes with structural or electrical cardiac abnormalities. Anthropometric measurements, such as waist circumference, limb lengths, and torso proportions to detect Marfan syndrome, can indicate elevated cardiovascular risk. Traditional manual methods are labor-intensive, operator-dependent, and challenging to scale. We present a fully automated deep-learning approach to estimate five key anthropometric measurements from 2D synthetic human body images. Using a dataset of 100,000 images derived from 3D body meshes, we trained and evaluated VGG19, ResNet50, and DenseNet121 with fully connected layers for regression. All models achieved sub-centimeter accuracy, with ResNet50 performing best, achieving a mean MAE of 0.668 cm across all measurements. Our results demonstrate that deep learning can deliver accurate anthropometric data at scale, offering a practical tool to complement athlete screening protocols. Future work will validate the models on real-world images to extend applicability.
[186] AGORA: Adversarial Generation Of Real-time Animatable 3D Gaussian Head Avatars
Ramazan Fazylov, Sergey Zagoruyko, Aleksandr Parkin, Stamatis Lefkimmiatis, Ivan Laptev
Main category: cs.CV
TL;DR: AGORA extends 3D Gaussian Splatting with GAN framework to create animatable 3D human avatars with real-time rendering and fine-grained expression control.
Details
Motivation: Existing methods have limitations: NeRF-based approaches are slow and inconsistent for dynamic content, while 3DGS methods are limited to static heads without dynamic control. There's a need for high-fidelity, animatable avatars with real-time performance.Method: AGORA combines 3D Gaussian Splatting with GAN framework. Key innovation is a lightweight FLAME-conditioned deformation branch that predicts per-Gaussian residuals for identity-preserving expression control. Uses dual-discriminator training with synthetic renderings of parametric mesh for expression fidelity.
Result: Outperforms state-of-the-art NeRF methods on expression accuracy while achieving 250+ FPS on GPU and ~9 FPS on CPU-only inference. First demonstration of practical CPU-only animatable 3DGS avatar synthesis.
Conclusion: AGORA represents significant progress toward practical, high-performance digital humans with real-time animatable avatar generation that works even on CPU-only systems.
Abstract: The generation of high-fidelity, animatable 3D human avatars remains a core challenge in computer graphics and vision, with applications in VR, telepresence, and entertainment. Existing approaches based on implicit representations like NeRFs suffer from slow rendering and dynamic inconsistencies, while 3D Gaussian Splatting (3DGS) methods are typically limited to static head generation, lacking dynamic control. We bridge this gap by introducing AGORA, a novel framework that extends 3DGS within a generative adversarial network to produce animatable avatars. Our key contribution is a lightweight, FLAME-conditioned deformation branch that predicts per-Gaussian residuals, enabling identity-preserving, fine-grained expression control while allowing real-time inference. Expression fidelity is enforced via a dual-discriminator training scheme leveraging synthetic renderings of the parametric mesh. AGORA generates avatars that are not only visually realistic but also precisely controllable. Quantitatively, we outperform state-of-the-art NeRF-based methods on expression accuracy while rendering at 250+ FPS on a single GPU, and, notably, at $\sim$9 FPS under CPU-only inference - representing, to our knowledge, the first demonstration of practical CPU-only animatable 3DGS avatar synthesis. This work represents a significant step toward practical, high-performance digital humans. Project website: https://ramazan793.github.io/AGORA/
[187] Towards Stable Cross-Domain Depression Recognition under Missing Modalities
Jiuyi Chen, Mingkui Tan, Haifeng Lu, Qiuna Xu, Zhihua Wang, Runhao Zeng, Xiping Hu
Main category: cs.CV
TL;DR: A unified multimodal LLM framework for stable cross-domain depression recognition that handles heterogeneous data and missing modalities.
Details
Motivation: Current multimodal depression detection methods lack unified frameworks for diverse scenarios and show limited stability to missing modalities, which are common in real-world data.Method: SCD-MLLM framework with two key components: Multi-Source Data Input Adapter (MDIA) using masking and prompts to handle heterogeneous inputs, and Modality-Aware Adaptive Fusion Module (MAFM) for adaptive audio-visual feature integration.
Result: Outperforms SOTA models and commercial LLMs across five depression datasets in both complete and partial modality settings, demonstrating superior cross-domain generalization and stability to missing modalities.
Conclusion: The proposed framework provides a robust solution for real-world depression screening with strong generalization across diverse scenarios and resilience to incomplete multimodal data.
Abstract: Depression poses serious public health risks, including suicide, underscoring the urgency of timely and scalable screening. Multimodal automatic depression detection (ADD) offers a promising solution; however, widely studied audio- and video-based ADD methods lack a unified, generalizable framework for diverse depression recognition scenarios and show limited stability to missing modalities, which are common in real-world data. In this work, we propose a unified framework for Stable Cross-Domain Depression Recognition based on Multimodal Large Language Model (SCD-MLLM). The framework supports the integration and processing of heterogeneous depression-related data collected from varied sources while maintaining stability in the presence of incomplete modality inputs. Specifically, SCD-MLLM introduces two key components: (i) Multi-Source Data Input Adapter (MDIA), which employs masking mechanism and task-specific prompts to transform heterogeneous depression-related inputs into uniform token sequences, addressing inconsistency across diverse data sources; (ii) Modality-Aware Adaptive Fusion Module (MAFM), which adaptively integrates audio and visual features via a shared projection mechanism, enhancing resilience under missing modality conditions. e conduct comprehensive experiments under multi-dataset joint training settings on five publicly available and heterogeneous depression datasets from diverse scenarios: CMDC, AVEC2014, DAIC-WOZ, DVlog, and EATD. Across both complete and partial modality settings, SCD-MLLM outperforms state-of-the-art (SOTA) models as well as leading commercial LLMs (Gemini and GPT), demonstrating superior cross-domain generalization, enhanced ability to capture multimodal cues of depression, and strong stability to missing modality cases in real-world applications.
[188] Sanvaad: A Multimodal Accessibility Framework for ISL Recognition and Voice-Based Interaction
Kush Revankar, Shreyas Deshpande, Araham Sayeed, Ansh Tandale, Sarika Bobde
Main category: cs.CV
TL;DR: Sanvaad is a lightweight multimodal accessibility framework for real-time two-way communication between deaf, visually impaired, and hearing users, using MediaPipe for sign language recognition and speech processing for voice interfaces.
Details
Motivation: Current communication tools often only support one direction of interaction between deaf users, visually impaired users, and the hearing population, creating barriers to inclusive communication.Method: Developed a lightweight multimodal framework with: 1) ISL recognition module using MediaPipe landmarks for efficiency on edge devices, 2) voice-to-sign component that maps speech to predefined phrases with GIFs/alphabet visualizations, and 3) screen-free voice interface with multilingual speech recognition, text summarization, and TTS for visually impaired users. Built with Streamlit for cross-platform usability.
Result: Created a practical, accessible framework that enables real-time two-way communication by combining lightweight computer vision and speech processing tools within a unified system that runs on both desktop and mobile environments.
Conclusion: Sanvaad provides a practical pathway for inclusive communication by addressing bidirectional communication needs through lightweight multimodal technologies that don’t require dedicated hardware, making accessibility tools more widely available.
Abstract: Communication between deaf users, visually im paired users, and the general hearing population often relies on tools that support only one direction of interaction. To address this limitation, this work presents Sanvaad, a lightweight multimodal accessibility framework designed to support real time, two-way communication. For deaf users, Sanvaad includes an ISL recognition module built on MediaPipe landmarks. MediaPipe is chosen primarily for its efficiency and low computational load, enabling the system to run smoothly on edge devices without requiring dedicated hardware. Spoken input from a phone can also be translated into sign representations through a voice-to-sign component that maps detected speech to predefined phrases and produces corresponding GIFs or alphabet-based visualizations. For visually impaired users, the framework provides a screen free voice interface that integrates multilingual speech recognition, text summarization, and text-to-speech generation. These components work together through a Streamlit-based interface, making the system usable on both desktop and mobile environments. Overall, Sanvaad aims to offer a practical and accessible pathway for inclusive communication by combining lightweight computer vision and speech processing tools within a unified framework.
[189] Method of UAV Inspection of Photovoltaic Modules Using Thermal and RGB Data Fusion
Andrii Lysyi, Anatoliy Sachenko, Pavlo Radiuk, Mykola Lysyi, Oleksandr Melnychenko, Diana Zahorodnia
Main category: cs.CV
TL;DR: Researchers developed an intelligent automated PV inspection system that addresses thermal palette bias, data redundancy, and bandwidth issues through multi-modal fusion, adaptive re-acquisition, and geospatial deduplication.
Details
Motivation: Conventional PV inspection methods suffer from thermal palette bias, data redundancy, and high communication bandwidth requirements, which limit their effectiveness and efficiency in monitoring photovoltaic infrastructure.Method: The system uses palette-invariant thermal embedding fused with contrast-normalized RGB via gated mechanism, adaptive re-acquisition controller with Rodrigues-based updates for ambiguous anomalies, and geospatial deduplication using DBSCAN over haversine distance.
Result: Achieved mAP@0.5 of 0.903 on PVF-10 benchmark (12-15% improvement over baselines), 96% recall in field validation, 15-20% reduction in duplicate-induced false positives, and 60-70% reduction in airborne data transmission.
Conclusion: The study establishes a powerful new paradigm for proactive PV inspection with validated performance improvements, reduced false positives, and significant bandwidth savings, demonstrating readiness for practical deployment.
Abstract: The subject of this research is the development of an intelligent, integrated framework for the automated inspection of photovoltaic (PV) infrastructure that addresses the critical shortcomings of conventional methods, including thermal palette bias, data redundancy, and high communication bandwidth requirements. The goal of this study is to design, develop, and validate a comprehensive, multi-modal system that fully automates the monitoring workflow, from data acquisition to the generation of actionable, geo-located maintenance alerts, thereby enhancing plant safety and operational efficiency. The methods employed involve a synergistic architecture that begins with a palette-invariant thermal embedding, learned by enforcing representational consistency, which is fused with a contrast-normalized RGB stream via a gated mechanism. This is supplemented by a closed-loop, adaptive re-acquisition controller that uses Rodrigues-based updates for targeted confirmation of ambiguous anomalies and a geospatial deduplication module that clusters redundant alerts using DBSCAN over the haversine distance. In conclusion, this study establishes a powerful new paradigm for proactive PV inspection, with the proposed system achieving a mean Average Precision (mAP@0.5) of 0.903 on the public PVF-10 benchmark, a significant 12-15% improvement over single-modality baselines. Field validation confirmed the system’s readiness, achieving 96% recall, while the de-duplication process reduced duplicate-induced false positives by 15-20%, and relevance-only telemetry cut airborne data transmission by 60-70%.
[190] ShadowWolf – Automatic Labelling, Evaluation and Model Training Optimised for Camera Trap Wildlife Images
Jens Dede, Anna Förster
Main category: cs.CV
TL;DR: ShadowWolf is a unified AI framework that dynamically adapts wildlife monitoring models to changing environmental conditions, reducing labeling effort and improving real-world performance.
Details
Motivation: Increasing human-wildlife interactions due to habitat expansion require better monitoring solutions. Traditional AI models struggle with environmental variability (landscape, weather, lighting, camera distances), limiting real-world robustness.Method: Proposes ShadowWolf, a unified framework that integrates and optimizes AI model training stages. Features dynamic model retraining to adapt to environmental changes and application requirements, enabling on-site adaptation with reduced labeling effort.
Result: The framework enhances accuracy and efficiency of wildlife monitoring systems by addressing environmental variability challenges, though specific quantitative results are not provided in the abstract.
Conclusion: ShadowWolf’s adaptive approach enables more effective and scalable conservation efforts by improving wildlife monitoring AI systems’ robustness to real-world environmental variations.
Abstract: The continuous growth of the global human population is leading to the expansion of human habitats, resulting in decreasing wildlife spaces and increasing human-wildlife interactions. These interactions can range from minor disturbances, such as raccoons in urban waste bins, to more severe consequences, including species extinction. As a result, the monitoring of wildlife is gaining significance in various contexts. Artificial intelligence (AI) offers a solution by automating the recognition of animals in images and videos, thereby reducing the manual effort required for wildlife monitoring. Traditional AI training involves three main stages: image collection, labelling, and model training. However, the variability, for example, in the landscape (e.g., mountains, open fields, forests), weather (e.g., rain, fog, sunshine), lighting (e.g., day, night), and camera-animal distances presents significant challenges to model robustness and adaptability in real-world scenarios. In this work, we propose a unified framework, called ShadowWolf, designed to address these challenges by integrating and optimizing the stages of AI model training and evaluation. The proposed framework enables dynamic model retraining to adjust to changes in environmental conditions and application requirements, thereby reducing labelling efforts and allowing for on-site model adaptation. This adaptive and unified approach enhances the accuracy and efficiency of wildlife monitoring systems, promoting more effective and scalable conservation efforts.
[191] On The Role of K-Space Acquisition in MRI Reconstruction Domain-Generalization
Mohammed Wattad, Tamir Shor, Alex Bronstein
Main category: cs.CV
TL;DR: Learned k-space sampling patterns in MRI show cross-domain transferability, with proposed stochastic perturbation method improving domain generalization.
Details
Motivation: Existing learned k-space acquisition patterns are typically optimized for single datasets/modalities with limited consideration of transferability across imaging domains. The paper aims to demonstrate that learned sampling benefits can extend beyond training domains and improve reconstruction under domain shifts.Method: Systematic evaluation across datasets and acquisition paradigms, plus a novel method that enhances domain robustness by introducing acquisition uncertainty during training - stochastically perturbing k-space trajectories to simulate variability across scanners and imaging conditions.
Result: Models trained with learned sampling patterns exhibit improved generalization under cross-domain settings. The proposed stochastic perturbation method further enhances domain robustness by simulating scanner variability.
Conclusion: K-space trajectory design should be treated not just as an acceleration mechanism, but as an active degree of freedom for improving domain generalization in MRI reconstruction, with learned sampling patterns showing promising cross-domain transferability.
Abstract: Recent work has established learned k-space acquisition patterns as a promising direction for improving reconstruction quality in accelerated Magnetic Resonance Imaging (MRI). Despite encouraging results, most existing research focuses on acquisition patterns optimized for a single dataset or modality, with limited consideration of their transferability across imaging domains. In this work, we demonstrate that the benefits of learned k-space sampling can extend beyond the training domain, enabling superior reconstruction performance under domain shifts. Our study presents two main contributions. First, through systematic evaluation across datasets and acquisition paradigms, we show that models trained with learned sampling patterns exhibitimproved generalization under cross-domain settings. Second, we propose a novel method that enhances domain robustness by introducing acquisition uncertainty during training-stochastically perturbing k-space trajectories to simulate variability across scanners and imaging conditions. Our results highlight the importance of treating kspace trajectory design not merely as an acceleration mechanism, but as an active degree of freedom for improving domain generalization in MRI reconstruction.
[192] Novel Deep Learning Architectures for Classification and Segmentation of Brain Tumors from MRI Images
Sayan Das, Arghadip Biswas
Main category: cs.CV
TL;DR: Proposes two novel deep learning architectures for brain tumor detection: SAETCN for tumor classification with 99.38% accuracy, and SAS-Net for tumor segmentation with 99.23% pixel accuracy.
Details
Motivation: Brain tumors are life-threatening and require early detection. Manual detection from MRI scans is time-consuming and difficult due to increasing cases, especially among children and adolescents. Existing AI models lack generalization and perform poorly on validation data.Method: Two novel deep learning architectures: (1) SAETCN (Self-Attention Enhancement Tumor Classification Network) for classifying brain tumors into glioma, meningioma, pituitary tumors, and non-tumor cases; (2) SAS-Net (Self-Attentive Segmentation Network) for accurate tumor segmentation.
Result: SAETCN achieved 99.38% accuracy on validation dataset for tumor classification. SAS-Net achieved 99.23% overall pixel accuracy for tumor segmentation. Both models trained on dataset containing 3 tumor types and non-tumor cases.
Conclusion: The proposed deep learning architectures provide highly accurate automated detection of brain tumors, addressing limitations of existing models and manual detection methods, potentially improving early diagnosis and treatment.
Abstract: Brain tumors pose a significant threat to human life, therefore it is very much necessary to detect them accurately in the early stages for better diagnosis and treatment. Brain tumors can be detected by the radiologist manually from the MRI scan images of the patients. However, the incidence of brain tumors has risen amongst children and adolescents in recent years, resulting in a substantial volume of data, as a result, it is time-consuming and difficult to detect manually. With the emergence of Artificial intelligence in the modern world and its vast application in the medical field, we can make an approach to the CAD (Computer Aided Diagnosis) system for the early detection of Brain tumors automatically. All the existing models for this task are not completely generalized and perform poorly on the validation data. So, we have proposed two novel Deep Learning Architectures - (a) SAETCN (Self-Attention Enhancement Tumor Classification Network) for the classification of different kinds of brain tumors. We have achieved an accuracy of 99.38% on the validation dataset making it one of the few Novel Deep learning-based architecture that is capable of detecting brain tumors accurately. We have trained the model on the dataset, which contains images of 3 types of tumors (glioma, meningioma, and pituitary tumors) and non-tumor cases. and (b) SAS-Net (Self-Attentive Segmentation Network) for the accurate segmentation of brain tumors. We have achieved an overall pixel accuracy of 99.23%.
[193] Bridging spatial awareness and global context in medical image segmentation
Dalia Alzu’bi, A. Ben Hamza
Main category: cs.CV
TL;DR: U-CycleMLP: A lightweight U-shaped encoder-decoder network for medical image segmentation that uses position attention, dense atrous blocks, and channel CycleMLP blocks to capture local/global context while maintaining computational efficiency.
Details
Motivation: Existing medical image segmentation models struggle to balance accuracy and efficiency while effectively capturing both local and global contextual information, leading to boundary pixel loss and segmentation errors.Method: U-shaped encoder-decoder architecture with: 1) Encoder using position attention weight excitation blocks, dense atrous blocks, and downsampling to learn multiscale contextual features; 2) Decoder with upsampling, dense atrous blocks, and feature fusion; 3) Channel CycleMLP blocks along skip connections for enhanced feature integration with linear computational complexity.
Result: Competitive performance on three benchmark datasets, achieving better segmentation accuracy across all datasets, capturing fine-grained anatomical structures, demonstrating robustness across different medical imaging modalities, and ablation studies confirm importance of core components.
Conclusion: U-CycleMLP effectively balances segmentation accuracy and computational efficiency by capturing both local and global contextual information, making it suitable for medical image segmentation tasks requiring precise boundary delineation.
Abstract: Medical image segmentation is a fundamental task in computer-aided diagnosis, requiring models that balance segmentation accuracy and computational efficiency. However, existing segmentation models often struggle to effectively capture local and global contextual information, leading to boundary pixel loss and segmentation errors. In this paper, we propose U-CycleMLP, a novel U-shaped encoder-decoder network designed to enhance segmentation performance while maintaining a lightweight architecture. The encoder learns multiscale contextual features using position attention weight excitation blocks, dense atrous blocks, and downsampling operations, effectively capturing both local and global contextual information. The decoder reconstructs high-resolution segmentation masks through upsampling operations, dense atrous blocks, and feature fusion mechanisms, ensuring precise boundary delineation. To further refine segmentation predictions, channel CycleMLP blocks are incorporated into the decoder along the skip connections, enhancing feature integration while maintaining linear computational complexity relative to input size. Experimental results, both quantitative and qualitative, across three benchmark datasets demonstrate the competitive performance of U-CycleMLP in comparison with state-of-the-art methods, achieving better segmentation accuracy across all datasets, capturing fine-grained anatomical structures, and demonstrating robustness across different medical imaging modalities. Ablation studies further highlight the importance of the model’s core architectural components in enhancing segmentation accuracy.
[194] SUGAR: A Sweeter Spot for Generative Unlearning of Many Identities
Dung Thuy Nguyen, Quang Nguyen, Preston K. Robinette, Eli Jiang, Taylor T. Johnson, Kevin Leach
Main category: cs.CV
TL;DR: SUGAR is a framework for scalable generative unlearning that enables removal of multiple identities from 3D-aware generative models without full retraining, using personalized surrogate latents and continual utility preservation.
Details
Motivation: Recent advances in 3D-aware generative models enable high-fidelity human identity synthesis, but raise urgent questions about user consent and the ability to remove specific individuals from model outputs. There's a need for methods that can remove identities without retraining entire models.Method: SUGAR learns a personalized surrogate latent for each identity to divert reconstructions to visually coherent alternatives, rather than projecting to unrealistic outputs or using static template faces. It introduces a continual utility preservation objective to prevent degradation as more identities are forgotten.
Result: SUGAR achieves state-of-the-art performance in removing up to 200 identities, with up to 700% improvement in retention utility compared to existing baselines.
Conclusion: SUGAR provides an effective framework for scalable generative unlearning that addresses consent concerns in 3D-aware generative models while maintaining model quality and diversity.
Abstract: Recent advances in 3D-aware generative models have enabled high-fidelity image synthesis of human identities. However, this progress raises urgent questions around user consent and the ability to remove specific individuals from a model’s output space. We address this by introducing SUGAR, a framework for scalable generative unlearning that enables the removal of many identities (simultaneously or sequentially) without retraining the entire model. Rather than projecting unwanted identities to unrealistic outputs or relying on static template faces, SUGAR learns a personalized surrogate latent for each identity, diverting reconstructions to visually coherent alternatives while preserving the model’s quality and diversity. We further introduce a continual utility preservation objective that guards against degradation as more identities are forgotten. SUGAR achieves state-of-the-art performance in removing up to 200 identities, while delivering up to a 700% improvement in retention utility compared to existing baselines. Our code is publicly available at https://github.com/judydnguyen/SUGAR-Generative-Unlearn.
[195] GNC-Pose: Geometry-Aware GNC-PnP for Accurate 6D Pose Estimation
Xiujin Liu
Main category: cs.CV
TL;DR: GNC-Pose is a learning-free monocular 6D object pose estimation method that uses rendering-based initialization, geometry-aware correspondence weighting, and GNC optimization to achieve competitive accuracy without training data or learned features.
Details
Motivation: The paper aims to provide a robust, learning-free solution for 6D object pose estimation that doesn't require training data, learned features, or category-specific priors, making it practical and widely applicable.Method: The method combines: 1) rendering-based initialization for coarse 2D-3D correspondences, 2) geometry-aware cluster-based weighting that assigns confidence based on 3D structural consistency, and 3) Graduated Non-Convexity (GNC) optimization for robust pose estimation under severe outliers, followed by LM refinement.
Result: Tested on the YCB Object and Model Set, GNC-Pose achieves competitive accuracy compared to both learning-based and learning-free methods, despite requiring no learned features, training data, or category-specific priors.
Conclusion: GNC-Pose offers a simple, robust, and practical learning-free solution for 6D pose estimation that performs competitively with state-of-the-art methods while avoiding the need for training data or learned features.
Abstract: We present GNC-Pose, a fully learning-free monocular 6D object pose estimation pipeline for textured objects that combines rendering-based initialization, geometry-aware correspondence weighting, and robust GNC optimization. Starting from coarse 2D-3D correspondences obtained through feature matching and rendering-based alignment, our method builds upon the Graduated Non-Convexity (GNC) principle and introduces a geometry-aware, cluster-based weighting mechanism that assigns robust per point confidence based on the 3D structural consistency of the model. This geometric prior and weighting strategy significantly stabilizes the optimization under severe outlier contamination. A final LM refinement further improve accuracy. We tested GNC-Pose on The YCB Object and Model Set, despite requiring no learned features, training data, or category-specific priors, GNC-Pose achieves competitive accuracy compared with both learning-based and learning-free methods, and offers a simple, robust, and practical solution for learning-free 6D pose estimation.
[196] MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding
Yuhao Su, Anwesa Choudhuri, Zhongpai Gao, Benjamin Planche, Van Nguyen Nguyen, Meng Zheng, Yuhan Shen, Arun Innanje, Terrence Chen, Ehsan Elhamifar, Ziyan Wu
Main category: cs.CV
TL;DR: MedVidBench is a large medical video benchmark with 531,850 video-instruction pairs across 8 sources. MedGRPO is a novel RL framework with cross-dataset reward normalization and medical LLM judge that improves medical video understanding beyond SFT baselines.
Details
Motivation: Large vision-language models struggle with medical video understanding due to challenges in spatial precision, temporal reasoning, and clinical semantics. There's a need for comprehensive benchmarks and robust training methods for medical domains.Method: 1) Created MedVidBench benchmark with expert-guided prompting and dual-model validation. 2) Developed MedGRPO RL framework with cross-dataset reward normalization (maps each dataset’s median performance to common reward) and medical LLM judge (evaluates caption quality on five clinical dimensions).
Result: SFT on MedVidBench with Qwen2.5-VL-7B substantially outperforms GPT-4.1 and Gemini-2.5-Flash across all tasks. MedGRPO further improves upon SFT baseline across grounding and captioning tasks, overcoming standard RL’s training collapse issues.
Conclusion: The work establishes foundational benchmark (MedVidBench) and robust training methodology (MedGRPO) for advancing vision-language models in medical domains, addressing critical challenges in medical video understanding.
Abstract: Large vision-language models struggle with medical video understanding, where spatial precision, temporal reasoning, and clinical semantics are critical. To address this, we first introduce \textbf{MedVidBench}, a large-scale benchmark of 531,850 video-instruction pairs across 8 medical sources spanning video, segment, and frame-level tasks, curated through a rigorous quality assurance pipeline with expert-guided prompting and dual-model validation. While supervised fine-tuning on MedVidBench yields noticeable gains, standard Reinforcement Learning (RL) fails due to imbalanced reward scales across datasets, which destabilizes optimization and leads to training collapse. To overcome this, we introduce \textbf{MedGRPO}, a novel RL framework for balanced multi-dataset training with two key innovations: (1) \emph{cross-dataset reward normalization} that maps each dataset’s median performance to a common reward value, ensuring fair optimization regardless of difficulty, and (2) a \emph{medical LLM judge} that evaluates caption quality on five clinical dimensions through comparative similarity scoring. Supervised fine-tuning Qwen2.5-VL-7B on MedVidBench substantially outperforms GPT-4.1 and Gemini-2.5-Flash across all tasks, demonstrating MedVidBench’s efficacy, while our MedGRPO framework further improves upon the SFT baseline across grounding and captioning tasks. Our work establishes a foundational benchmark and robust training methodology for advancing vision-language models in medical domains. Our project website is available at https://yuhaosu.github.io/MedGRPO/.
[197] Masked Autoencoder Pretraining on Strong-Lensing Images for Joint Dark-Matter Model Classification and Super-Resolution
Achmad Ardani Prasha, Clavino Ourizqi Rachmadi, Muhamad Fauzan Ibnu Syahlan, Naufal Rahfi Anugerah, Nanda Garin Raditya, Putri Amelia, Sabrina Laila Mutiara, Hilman Syachr Ramadhan
Main category: cs.CV
TL;DR: MAE pretraining on simulated strong-lensing images improves classification of dark matter models and super-resolution reconstruction compared to training from scratch.
Details
Motivation: Strong gravitational lensing can reveal dark-matter substructure, but analyzing noisy, low-resolution images is challenging. Need better methods for classification and image enhancement.Method: Use masked autoencoder (MAE) pretraining on simulated strong-lensing images from DeepLense benchmark. Pretrain Vision Transformer encoder with masked image modeling, then fine-tune separately for classification (dark matter model) and super-resolution tasks.
Result: MAE pretraining with 90% mask ratio achieves: classification AUC 0.968 (vs 0.957 baseline), accuracy 88.65% (vs 82.46%). Super-resolution: PSNR ~33 dB, SSIM 0.961, modest improvement over baseline. Higher mask ratios improve classification but slightly degrade reconstruction.
Conclusion: MAE pretraining on physics-rich simulations provides flexible, reusable encoder for multiple strong-lensing analysis tasks, outperforming training from scratch.
Abstract: Strong gravitational lensing can reveal the influence of dark-matter substructure in galaxies, but analyzing these effects from noisy, low-resolution images poses a significant challenge. In this work, we propose a masked autoencoder (MAE) pretraining strategy on simulated strong-lensing images from the DeepLense ML4SCI benchmark to learn generalizable representations for two downstream tasks: (i) classifying the underlying dark matter model (cold dark matter, axion-like, or no substructure) and (ii) enhancing low-resolution lensed images via super-resolution. We pretrain a Vision Transformer encoder using a masked image modeling objective, then fine-tune the encoder separately for each task. Our results show that MAE pretraining, when combined with appropriate mask ratio tuning, yields a shared encoder that matches or exceeds a ViT trained from scratch. Specifically, at a 90% mask ratio, the fine-tuned classifier achieves macro AUC of 0.968 and accuracy of 88.65%, compared to the scratch baseline (AUC 0.957, accuracy 82.46%). For super-resolution (16x16 to 64x64), the MAE-pretrained model reconstructs images with PSNR ~33 dB and SSIM 0.961, modestly improving over scratch training. We ablate the MAE mask ratio, revealing a consistent trade-off: higher mask ratios improve classification but slightly degrade reconstruction fidelity. Our findings demonstrate that MAE pretraining on physics-rich simulations provides a flexible, reusable encoder for multiple strong-lensing analysis tasks.
[198] From Remote Sensing to Multiple Time Horizons Forecasts: Transformers Model for CyanoHAB Intensity in Lake Champlain
Muhammad Adil, Patrick J. Clemins, Andrew W. Schroth, Panagiotis D. Oikonomou, Donna M. Rizzo, Peter D. F. Isles, Xiaohan Zhang, Kareem I. Hannoun, Scott Turnbull, Noah B. Beckage, Asim Zia, Safwan Wshah
Main category: cs.CV
TL;DR: Transformer-BiLSTM model predicts cyanobacterial harmful algal blooms up to 14 days ahead using sparse satellite data with strong forecasting performance.
Details
Motivation: CyanoHABs threaten aquatic ecosystems and public health globally, with Lake Champlain being particularly vulnerable. Remote sensing offers scalable monitoring where in situ data is sparse.Method: Combines Transformers and BiLSTM to predict CyanoHAB intensities using Cyanobacterial Index and MODIS temperature data. Two-stage preprocessing handles sparse data (30% missing CI, 90% missing temperature) with forward fill, weighted temporal imputation, and smoothing.
Result: Model achieves F1 scores of 89.5% (1-day), 86.4% (2-day), 85.5% (3-day), and 78.9% (14-day) with AUC of 82.6% at 14-day horizon.
Conclusion: The Transformer-BiLSTM model effectively captures complex spatiotemporal dynamics from sparse satellite data and provides reliable early warning for CyanoHAB management.
Abstract: Cyanobacterial Harmful Algal Blooms (CyanoHABs) pose significant threats to aquatic ecosystems and public health globally. Lake Champlain is particularly vulnerable to recurring CyanoHAB events, especially in its northern segment: Missisquoi Bay, St. Albans Bay, and Northeast Arm, due to nutrient enrichment and climatic variability. Remote sensing provides a scalable solution for monitoring and forecasting these events, offering continuous coverage where in situ observations are sparse or unavailable. In this study, we present a remote sensing only forecasting framework that combines Transformers and BiLSTM to predict CyanoHAB intensities up to 14 days in advance. The system utilizes Cyanobacterial Index data from the Cyanobacterial Assessment Network and temperature data from Moderate Resolution Imaging Spectroradiometer satellites to capture long range dependencies and sequential dynamics in satellite time series. The dataset is very sparse, missing more than 30% of the Cyanobacterial Index data and 90% of the temperature data. A two stage preprocessing pipeline addressed data gaps by applying forward fill and weighted temporal imputation at the pixel level, followed by smoothing to reduce the discontinuities of CyanoHAB events. The raw dataset is transformed into meaningful features through equal frequency binning for the Cyanobacterial Index values and extracted temperature statistics. Transformer BiLSTM model demonstrates strong forecasting performance across multiple horizons, achieving F1 scores of 89.5%, 86.4%, and 85.5% at one, two, and three-day forecasts, respectively, and maintaining an F1 score of 78.9% with an AUC of 82.6% at the 14-day horizon. These results confirm the model’s ability to capture complex spatiotemporal dynamics from sparse satellite data and to provide reliable early warning for CyanoHABs management.
[199] TextMamba: Scene Text Detector with Mamba
Qiyan Zhao, Yue Yan, Da-Han Wang
Main category: cs.CV
TL;DR: A novel scene text detector using Mamba’s selection mechanism with attention layers, featuring Top_k selection, dual-scale feed-forward network, and embedding pyramid enhancement for better long-range dependency modeling.
Details
Motivation: Transformer-based methods have limitations in cross-domain applications and long-range dependency modeling for scene text detection, often forgetting important information or focusing on irrelevant representations. Mamba's linear complexity selection mechanism offers better long-range modeling capabilities.Method: Proposes a Mamba-based scene text detector that integrates selection mechanism with attention layers, uses Top_k algorithm to explicitly select key information, designs a dual-scale feed-forward network for high-dimensional hidden state interactions, and includes an embedding pyramid enhancement module for multi-scale feature fusion.
Result: Achieves state-of-the-art or competitive performance with F-measures of 89.7% on CTW1500, 89.2% on TotalText, and 78.5% on ICDAR19ArT benchmarks.
Conclusion: The proposed Mamba-based approach effectively addresses Transformer limitations in scene text detection by improving long-range dependency modeling through selective information processing and multi-scale feature enhancement, demonstrating superior performance on standard benchmarks.
Abstract: In scene text detection, Transformer-based methods have addressed the global feature extraction limitations inherent in traditional convolution neural network-based methods. However, most directly rely on native Transformer attention layers as encoders without evaluating their cross-domain limitations and inherent shortcomings: forgetting important information or focusing on irrelevant representations when modeling long-range dependencies for text detection. The recently proposed state space model Mamba has demonstrated better long-range dependencies modeling through a linear complexity selection mechanism. Therefore, we propose a novel scene text detector based on Mamba that integrates the selection mechanism with attention layers, enhancing the encoder’s ability to extract relevant information from long sequences. We adopt the Top_k algorithm to explicitly select key information and reduce the interference of irrelevant information in Mamba modeling. Additionally, we design a dual-scale feed-forward network and an embedding pyramid enhancement module to facilitate high-dimensional hidden state interactions and multi-scale feature fusion. Our method achieves state-of-the-art or competitive performance on various benchmarks, with F-measures of 89.7%, 89.2%, and 78.5% on CTW1500, TotalText, and ICDAR19ArT, respectively. Codes will be available.
[200] Learning Relative Gene Expression Trends from Pathology Images in Spatial Transcriptomics
Kazuya Nishimura, Haruka Hirose, Ryoma Bise, Kaito Shiku, Yasuhiro Kojima
Main category: cs.CV
TL;DR: STRank: A novel loss function for gene expression estimation from pathology images that learns relative expression patterns instead of absolute values to handle noise and batch effects.
Details
Motivation: Current methods using point-wise loss functions struggle with accurate absolute gene expression estimation due to stochastic noise and batch effects from sequencing techniques and cellular variability. The authors propose focusing on relative expression patterns which are more consistent across experiments.Method: Proposes STRank, a novel loss function that models relative expression patterns rather than absolute levels. The method assumes relative expression levels show consistent patterns across independent experiments despite batch effects and noise in absolute values.
Result: Experiments on both synthetic and real datasets demonstrate the effectiveness of STRank in handling noise and batch effects for gene expression estimation from pathology images.
Conclusion: Learning relative expression patterns rather than absolute values provides a more robust approach to gene expression estimation from pathology images, effectively addressing challenges posed by stochastic noise and batch effects.
Abstract: Gene expression estimation from pathology images has the potential to reduce the RNA sequencing cost. Point-wise loss functions have been widely used to minimize the discrepancy between predicted and absolute gene expression values. However, due to the complexity of the sequencing techniques and intrinsic variability across cells, the observed gene expression contains stochastic noise and batch effects, and estimating the absolute expression values accurately remains a significant challenge. To mitigate this, we propose a novel objective of learning relative expression patterns rather than absolute levels. We assume that the relative expression levels of genes exhibit consistent patterns across independent experiments, even when absolute expression values are affected by batch effects and stochastic noise in tissue samples. Based on the assumption, we model the relation and propose a novel loss function called STRank that is robust to noise and batch effects. Experiments using synthetic datasets and real datasets demonstrate the effectiveness of the proposed method. The code is available at https://github.com/naivete5656/STRank.
[201] The Role of Entropy in Visual Grounding: Analysis and Optimization
Shuo Li, Jiajun Sun, Zhihao Zhang, Xiaoran Fan, Senjie Jin, Hui Li, Yuming Yang, Junjie Ye, Lixing Shen, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang
Main category: cs.CV
TL;DR: ECVGPO is an interpretable entropy control algorithm for visual grounding that balances exploration-exploitation trade-off, achieving broad improvements across benchmarks and models.
Details
Motivation: While entropy control techniques have advanced MLLM fine-tuning via RL, their role in perception tasks like visual grounding remains unexplored. The paper aims to analyze entropy characteristics in visual grounding vs. reasoning tasks and develop effective entropy regulation strategies.Method: ECVGPO (Entropy Control Visual Grounding Policy Optimization) - an interpretable algorithm for effective entropy regulation in visual grounding. It analyzes entropy role/characteristics in visual grounding compared to reasoning tasks, then implements entropy control to balance exploration-exploitation trade-off.
Result: ECVGPO achieves broad improvements across various benchmarks and models through effective entropy control.
Conclusion: The paper successfully addresses the unexplored role of entropy in visual grounding tasks, introduces an effective entropy control algorithm (ECVGPO), and demonstrates its effectiveness through improved performance across multiple benchmarks and models.
Abstract: Recent advances in fine-tuning multimodal large language models (MLLMs) using reinforcement learning have achieved remarkable progress, particularly with the introduction of various entropy control techniques. However, the role and characteristics of entropy in perception-oriented tasks like visual grounding, as well as effective strategies for controlling it, remain largely unexplored. To address this issue, we focus on the visual grounding task and analyze the role and characteristics of entropy in comparison to reasoning tasks. Building on these findings, we introduce ECVGPO (Entropy Control Visual Grounding Policy Optimization), an interpretable algorithm designed for effective entropy regulation. Through entropy control, the trade-off between exploration and exploitation is better balanced. Experiments show that ECVGPO achieves broad improvements across various benchmarks and models.
[202] Hierarchical Deep Learning for Diatom Image Classification: A Multi-Level Taxonomic Approach
Yueying Ke
Main category: cs.CV
TL;DR: Hierarchical CNN with cascaded heads for multi-level diatom classification outperforms flat models by improving accuracy at upper taxonomic levels and keeping errors taxonomically local.
Details
Motivation: Conventional diatom identification relies on expert taxonomists, and while deep learning helps, current approaches treat classification as flat, predicting only one taxonomic rank. The authors investigate whether embedding taxonomic hierarchy into neural networks can improve both accuracy and error locality.Method: Introduce hierarchical convolutional network with five cascaded heads predicting class, order, family, genus, and species. Each head receives shared backbone features and probability distributions from higher levels, with binary masks restricting predictions to valid descendants during training and inference. Use dataset of 1,456 diatom images covering 82 species and compare hierarchical vs flat models.
Result: Hierarchical model matches flat baselines at species level (69.4% accuracy) while outperforming at all upper taxonomic levels. When species predictions fail, 92.5% of misclassified species are correctly predicted at genus level (vs 67.2% for flat). Reduces mean taxonomic distance by 38.2% (1.209 vs 1.955). Improves class accuracy from 96.2% to 99.5% with 6-8% gains at upper levels.
Conclusion: Hierarchical embedding improves multi-level taxonomic classification through bidirectional mechanisms: top-down constraint masks restrict prediction space, while bottom-up gradients from fine-grained levels refine shared features. This produces more robust, interpretable, and biologically aligned predictions.
Abstract: Accurate taxonomic identification of diatoms is essential for aquatic ecosystem monitoring, yet conventional methods depend heavily on expert taxonomists. Recent deep learning approaches improve automation, but most treat diatom recognition as flat classification predicting only one taxonomic rank. We investigate whether embedding taxonomic hierarchy into neural network architectures can improve both accuracy and error locality. We introduce a hierarchical convolutional network with five cascaded heads that jointly predict class, order, family, genus, and species. Each head receives shared backbone features and probability distributions from higher levels, with binary masks restricting predictions to valid descendants during training and inference. Using a filtered dataset of 1,456 diatom images covering 82 species, we compare hierarchical and flat models under identical settings. The hierarchical model matches flat baselines at species level (69.4% accuracy) while outperforming at all upper taxonomic levels. When species predictions fail, errors remain taxonomically local: 92.5 % of misclassified species are correctly predicted at genus level, versus 67.2% for flat baselines. The hierarchical model reduces mean taxonomic distance by 38.2% (1.209 vs. 1.955). Progressive training reveals bidirectional mechanisms: hierarchical constraint masks operate top-down to constrain prediction space, while gradients from fine-grained levels propagate bottom-up through the shared backbone, refining features. This improves class accuracy from 96.2% to 99.5% and yields 6-8% gains at upper levels, producing more robust, interpretable, and biologically aligned predictions for multi-level taxonomic classification.
[203] MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning
Yueqian Wang, Songxiang Liu, Disong Wang, Nuo Xu, Guanglu Wan, Huishuai Zhang, Dongyan Zhao
Main category: cs.CV
TL;DR: Proactive Video MLLM that autonomously decides when to respond during video streaming using text-to-text approach and RL training without precise timing annotations.
Details
Motivation: Existing Video MLLMs operate in turn-based manner, but real-time applications need proactive interaction where models decide when to reply during video playback. Current methods face challenges with manual threshold tuning and precise reply time annotations.Method: Text-to-text approach where model determines whether to respond or remain silent at each turn based on dialogue history and visual context. Uses multi-turn RL training to encourage timely responses without precise timing annotations. Trained on 52k videos with two dialogue types via SFT and RL.
Result: MMDuet2 outperforms existing proactive Video MLLM baselines in response timing and quality, achieving state-of-the-art performance on ProactiveVideoQA benchmark.
Conclusion: The proposed proactive Video MLLM with RL training enables effective real-time video interaction without requiring precise response time annotations, advancing the field of proactive multimodal dialogue systems.
Abstract: Recent advances in video multimodal large language models (Video MLLMs) have significantly enhanced video understanding and multi-modal interaction capabilities. While most existing systems operate in a turn-based manner where the model can only reply after user turns, proactively deciding when to reply during video playback presents a promising yet challenging direction for real-time applications. In this work, we propose a novel text-to-text approach to proactive interaction, where the model autonomously determines whether to respond or remain silent at each turn based on dialogue history and visual context up to current frame of an streaming video. To overcome difficulties in previous methods such as manually tuning response decision thresholds and annotating precise reply times, we introduce a multi-turn RL based training method that encourages timely and accurate responses without requiring precise response time annotations. We train our model MMDuet2 on a dataset of 52k videos with two types of dialogues via SFT and RL. Experimental results demonstrate that MMDuet2 outperforms existing proactive Video MLLM baselines in response timing and quality, achieving state-of-the-art performance on the ProactiveVideoQA benchmark.
[204] Personalized Image Descriptions from Attention Sequences
Ruoyu Xue, Hieu Le, Jingyi Xu, Sounak Mondal, Abe Leite, Gregory Zelinsky, Minh Hoai, Dimitris Samaras
Main category: cs.CV
TL;DR: DEPER learns personalized subject embeddings combining linguistic style and viewing behavior to generate more human-aligned image descriptions through few-shot personalization.
Details
Motivation: Existing personalized image description models only focus on linguistic style, ignoring individual viewing patterns. People view images differently - focusing on different regions, objects, and details in varying orders - leading to substantial variability in descriptions.Method: DEPER (DEscription-PERception persona encoder) learns subject embeddings capturing both linguistic style and viewing behavior, guided by an auxiliary attention-prediction task. Uses lightweight adapter to align embeddings with frozen vision-language model for few-shot personalization without retraining.
Result: Across four datasets spanning diverse viewing tasks and both short/detailed descriptions, DEPER achieves 24% average improvement. Shows modeling personalized attention produces more human-aligned and high-quality descriptions.
Conclusion: Understanding how people see helps predict what they say; modeling human diversity in perception can improve both performance and human alignment in multimodal systems.
Abstract: People can view the same image differently: they focus on different regions, objects, and details in varying orders and describe them in distinct linguistic styles. This leads to substantial variability in image descriptions. However, existing models for personalized image description focus on linguistic style alone, with no prior work leveraging individual viewing patterns. We address this gap by explicitly modeling personalized viewing behavior as a core factor in description generation. Our method, DEPER (DEscription-PERception persona encoder), learns a subject embedding that captures both linguistic style and viewing behavior, guided by an auxiliary attention-prediction task. A lightweight adapter aligns these embeddings with a frozen vision-language model, enabling few-shot personalization without retraining. Across four datasets spanning diverse viewing tasks and both short and detailed descriptions, DEPER achieves a 24% average improvement, showing that modeling personalized attention produces more human-aligned and high-quality descriptions. We posit that understanding how people see helps predict what they say; modeling human diversity in perception can improve both performance and human alignment in multimodal systems.
[205] Task-Model Alignment: A Simple Path to Generalizable AI-Generated Image Detection
Ruoxin Chen, Jiahui Gao, Kaiqing Lin, Keyue Zhang, Yandan Zhao, Isabel Guan, Taiping Yao, Shouhong Ding
Main category: cs.CV
TL;DR: The paper proposes AlignGemini, a two-branch detector that aligns Vision Language Models (VLMs) with semantic tasks and pixel-artifact detectors with low-level artifact tasks, achieving +9.5% average accuracy gain on AIGI detection benchmarks.
Details
Motivation: Current VLMs used for AI-generated image detection require substantial resources and still suffer from severe hallucinations. The core issue is task-model misalignment: semantics-oriented VLMs lack sensitivity to fine-grained pixel artifacts, while conventional pixel-artifact detectors lack semantic awareness.Method: Formalizes AIGI detection as two complementary tasks: semantic consistency checking and pixel-artifact detection. Introduces Task-Model Alignment principle and implements AlignGemini - a two-branch detector with a VLM fine-tuned exclusively with pure semantic supervision and a pixel-artifact expert trained exclusively with pure pixel-artifact supervision.
Result: AlignGemini achieves a +9.5% gain in average accuracy on five in-the-wild benchmarks, demonstrating that task-model alignment is an effective approach for generalizable AIGI detection.
Conclusion: Neglecting either semantic consistency checking or pixel-artifact detection creates systematic blind spots in AIGI detection. The proposed Task-Model Alignment principle, instantiated as AlignGemini, effectively leverages complementary strengths of different model types for improved detection performance.
Abstract: Vision Language Models (VLMs) are increasingly adopted for AI-generated images (AIGI) detection, yet converting VLMs into detectors requires substantial resource, while the resulting models still exhibit severe hallucinations. To probe the core issue, we conduct an empirical analysis and observe two characteristic behaviors: (i) fine-tuning VLMs on high-level semantic supervision strengthens semantic discrimination and well generalize to unseen data; (ii) fine-tuning VLMs on low-level pixel-artifact supervision yields poor transfer. We attribute VLMs’ underperformance to task-model misalignment: semantics-oriented VLMs inherently lack sensitivity to fine-grained pixel artifacts, and semantically non-discriminative pixel artifacts thus exceeds their inductive biases. In contrast, we observe that conventional pixel-artifact detectors capture low-level pixel artifacts yet exhibit limited semantic awareness relative to VLMs, highlighting that distinct models are better matched to distinct tasks. In this paper, we formalize AIGI detection as two complementary tasks–semantic consistency checking and pixel-artifact detection–and show that neglecting either induces systematic blind spots. Guided by this view, we introduce the Task-Model Alignment principle and instantiate it as a two-branch detector, AlignGemini, comprising a VLM fine-tuned exclusively with pure semantic supervision and a pixel-artifact expert trained exclusively with pure pixel-artifact supervision. By enforcing orthogonal supervision on two simplified datasets, each branch trains to its strengths, producing complementary discrimination over semantic and pixel cues. On five in-the-wild benchmarks, AlignGemini delivers a +9.5 gain in average accuracy, supporting task-model alignment as an effective path to generalizable AIGI detection.
[206] Less Is More, but Where? Dynamic Token Compression via LLM-Guided Keyframe Prior
Yulin Li, Haokun Gui, Ziyang Fan, Junjie Wang, Bin Kang, Bin Chen, Zhuotao Tian
Main category: cs.CV
TL;DR: DyToK is a training-free method that uses VLLMs’ attention mechanisms to dynamically compress video tokens by adjusting per-frame retention ratios based on semantic importance, achieving 4.3x faster inference while maintaining accuracy.
Details
Motivation: Current Video LLMs face efficiency bottlenecks due to quadratic computational growth with long video sequences. Existing keyframe sampling methods introduce additional computational costs and use suboptimal binary frame selection paradigms.Method: DyToK leverages VLLMs’ inherent attention mechanisms to encode query-conditioned keyframe priors, enabling dynamic token compression by adjusting per-frame token retention ratios based on semantic richness.
Result: DyToK achieves state-of-the-art efficiency-accuracy tradeoffs, shows plug-and-play compatibility with existing compression methods, attains 4.3x faster inference while preserving accuracy across multiple VLLMs like LLaVA-OneVision and Qwen2.5-VL.
Conclusion: DyToK provides an effective training-free paradigm for dynamic token compression in VLLMs by harnessing their inherent attention mechanisms, offering significant efficiency improvements without sacrificing accuracy.
Abstract: Recent advances in Video Large Language Models (VLLMs) have achieved remarkable video understanding capabilities, yet face critical efficiency bottlenecks due to quadratic computational growth with lengthy visual token sequences of long videos. While existing keyframe sampling methods can improve temporal modeling efficiency, additional computational cost is introduced before feature encoding, and the binary frame selection paradigm is found suboptimal. Therefore, in this work, we propose Dynamic Token compression via LLM-guided Keyframe prior (DyToK), a training-free paradigm that enables dynamic token compression by harnessing VLLMs’ inherent attention mechanisms. Our analysis reveals that VLLM attention layers naturally encoding query-conditioned keyframe priors, by which DyToK dynamically adjusts per-frame token retention ratios, prioritizing semantically rich frames while suppressing redundancies. Extensive experiments demonstrate that DyToK achieves state-of-the-art efficiency-accuracy tradeoffs. DyToK shows plug-and-play compatibility with existing compression methods, such as VisionZip and FastV, attaining 4.3x faster inference while preserving accuracy across multiple VLLMs, such as LLaVA-OneVision and Qwen2.5-VL. Code is available at https://github.com/yu-lin-li/DyToK .
[207] CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks
Yu Qi, Yumeng Zhang, Chenting Gong, Xiao Tan, Weiming Zhang, Wei Zhang, Jingdong Wang
Main category: cs.CV
TL;DR: CoT4Det improves LVLM perception performance by reformulating detection tasks into classification, counting, and grounding steps, boosting mAP from 19% to 33% on COCO2017.
Details
Motivation: Large Vision-Language Models (LVLMs) perform well on general vision-language tasks but struggle with perception-centric tasks like object detection, significantly underperforming task-specific models (e.g., only 19% mAP on COCO2017).Method: Chain-of-Thought for Detection (CoT4Det) reformulates perception tasks into three interpretable steps: classification (identifying objects), counting (determining quantities), and grounding (localizing objects) - each better aligned with LVLM reasoning capabilities.
Result: CoT4Det significantly improves perception performance: boosts mAP from 19.0% to 33.0% on COCO2017 val, outperforms baselines by +2% on RefCOCO series and 19% on Flickr30k entities, while maintaining general vision-language capabilities.
Conclusion: The proposed CoT4Det strategy effectively bridges the perception gap in LVLMs by decomposing complex detection tasks into more manageable reasoning steps, achieving substantial performance improvements without compromising other capabilities.
Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable success in a broad range of vision-language tasks, such as general visual question answering and optical character recognition (OCR). However, their performance on perception-centric tasks – such as object detection, semantic segmentation, and depth estimation – remains significantly inferior to that of task-specific expert models. For example, Qwen2.5-VL-7B-Instruct achieves only 19% mAP on COCO2017 val, particularly struggling with dense scenes and small object recall. In this work, we introduce Chain-of-Thought for Detection (CoT4Det), a simple but efficient strategy that reformulates perception tasks into three interpretable steps: classification, counting, and grounding – each more naturally aligned with the reasoning capabilities of LVLMs. Extensive experiments demonstrate that our method significantly improves perception performance without compromising general vision language capabilities. With a standard Qwen2.5-VL-7B-Instruct, CoT4Det boosts mAP from 19.0% to 33.0% on COCO2017 val and achieves competitive results across a variety of perception benchmarks, outperforming baselines by +2% on RefCOCO series and 19% on Flickr30k entities.
[208] NeuroABench: A Multimodal Evaluation Benchmark for Neurosurgical Anatomy Identification
Ziyang Song, Zelin Zang, Xiaofan Ye, Boqiang Xu, Long Bai, Jinlin Wu, Hongliang Ren, Hongbin Liu, Jiebo Luo, Zhen Lei
Main category: cs.CV
TL;DR: First multimodal benchmark (NeuroABench) for evaluating anatomical understanding in neurosurgical videos, revealing MLLMs significantly lag behind human performance with best model at 40.87% accuracy vs. human average of 46.5%.
Details
Motivation: Existing MLLM research focuses on surgical procedures and workflows but neglects anatomical comprehension, which is critical for surgeons to interpret, review, and learn from surgical videos. There's a gap in evaluating anatomical understanding in the neurosurgical domain.Method: Created NeuroABench with 9 hours of annotated neurosurgical videos covering 89 procedures using novel multimodal annotation pipeline with multiple review cycles. Evaluates identification of 68 clinical anatomical structures. Tested over 10 state-of-the-art MLLMs and compared with 4 neurosurgical trainees.
Result: Best-performing MLLM achieved only 40.87% accuracy in anatomical identification. Neurosurgical trainees scored 28-56% accuracy with average 46.5%. Best MLLM performed comparably to lowest-scoring student but significantly below human average performance.
Conclusion: NeuroABench reveals significant limitations in MLLMs’ anatomical understanding capabilities. While MLLMs show progress, substantial gap remains to achieve human-level performance in anatomical comprehension from surgical videos.
Abstract: Multimodal Large Language Models (MLLMs) have shown significant potential in surgical video understanding. With improved zero-shot performance and more effective human-machine interaction, they provide a strong foundation for advancing surgical education and assistance. However, existing research and datasets primarily focus on understanding surgical procedures and workflows, while paying limited attention to the critical role of anatomical comprehension. In clinical practice, surgeons rely heavily on precise anatomical understanding to interpret, review, and learn from surgical videos. To fill this gap, we introduce the Neurosurgical Anatomy Benchmark (NeuroABench), the first multimodal benchmark explicitly created to evaluate anatomical understanding in the neurosurgical domain. NeuroABench consists of 9 hours of annotated neurosurgical videos covering 89 distinct procedures and is developed using a novel multimodal annotation pipeline with multiple review cycles. The benchmark evaluates the identification of 68 clinical anatomical structures, providing a rigorous and standardized framework for assessing model performance. Experiments on over 10 state-of-the-art MLLMs reveal significant limitations, with the best-performing model achieving only 40.87% accuracy in anatomical identification tasks. To further evaluate the benchmark, we extract a subset of the dataset and conduct an informative test with four neurosurgical trainees. The results show that the best-performing student achieves 56% accuracy, with the lowest scores of 28% and an average score of 46.5%. While the best MLLM performs comparably to the lowest-scoring student, it still lags significantly behind the group’s average performance. This comparison underscores both the progress of MLLMs in anatomical understanding and the substantial gap that remains in achieving human-level performance.
[209] 1 + 1 > 2: Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning
Shida Gao, Feng Xue, Xiangfeng Wang, Anlong Ming, Teng Long, Yihua Shao, Haozhe Wang, Zhaowen Lin, Wei Wang, Nicu Sebe
Main category: cs.CV
TL;DR: DEViL is a detector-empowered video LLM that improves spatio-temporal grounding by connecting a video LLM with an open-vocabulary detector via reference-semantic tokens, avoiding autoregressive spatial decoding issues.
Details
Motivation: Current MLLMs treat bounding boxes as text tokens and generate them autoregressively, causing long output sequences where spatial errors accumulate and localization results drift across videos.Method: DEViL couples a Video LLM with an open-vocabulary detector using reference-semantic tokens (RST) that distill query semantics and serve as both control signals and OVD text embedding replacements. Also includes tube-mined temporal regularization (TTReg) for temporal consistency.
Result: DEViL achieves strong performance across various fine-grained video understanding tasks, particularly STVG and GroundedVQA.
Conclusion: The proposed DEViL framework effectively addresses autoregressive spatial decoding issues in spatio-temporal grounding by integrating detector capabilities with video LLMs through semantic-aware tokens and temporal regularization.
Abstract: Spatio-temporal grounding and reasoning aims to locate the temporal segment and spatial region of an event in a video given a user query, while also reasoning about semantics such as causality, temporal order, and action relationships. To achieve this, current MLLMs primarily treats bounding boxes as text tokens and generates them autoregressively. However, such autoregressive spatial decoding leads to very-long output sequences, causing spatial errors to accumulated over time and the localization results to progressively drift across a video. To address this, we present a Detector-Empowered Video LLM, short for DEViL, which couples a Video LLM with an open-vocabulary detector (OVD). Specifically, the MLLM and detector are connected via a reference-semantic token (RST) that distills the user query into a rich semantic representation. Unlike tokens that merely serve as spatial prompts or segmentor switches, the RST functions as both a control signal and a replacement for the OVD’s text embedding, enabling end-to-end learning of both referential understanding and spatial localization. Furthermore, we propose a tube-mined temporal regularization (TTReg) within OVD, which drives the OVD to generate temporally-consistent queries for target objects, thereby ensuring effective temporal association. Experiments demonstrate that DEViL achieves strong performance across various fine-grained video understanding tasks, particularly STVG and GroundedVQA. Code will be released on https://github.com/gaostar123/DeViL.
[210] VisChainBench: A Benchmark for Multi-Turn, Multi-Image Visual Reasoning Beyond Language Priors
Wenbo Lyu, Yingjun Du, Jinglin Zhao, Xianton Zhen, Ling Shao
Main category: cs.CV
TL;DR: VisChainBench is a new benchmark for evaluating Large Vision-Language Models’ ability to perform multi-step visual reasoning across sequential, interdependent tasks with minimal language guidance.
Details
Motivation: Existing benchmarks focus on static comparisons and rely heavily on language cues, overlooking progressive context-dependent reasoning and visual-to-visual inference in multi-image, multi-turn scenarios.Method: Created VisChainBench with 1,457 tasks spanning over 20,000 images across three diverse domains, using a multi-agent generation pipeline to ensure visual diversity and controlled language bias.
Result: The benchmark contains large-scale, diverse evaluation data structured to mimic real-world decision-making processes, with all data and code publicly available.
Conclusion: VisChainBench addresses a critical gap in evaluating LVLMs’ multi-step visual reasoning capabilities and provides a comprehensive benchmark for future research.
Abstract: Understanding multi-image, multi-turn scenarios is a critical yet underexplored capability for Large Vision-Language Models (LVLMs). Existing benchmarks predominantly focus on static or horizontal comparisons – e.g., spotting visual differences or assessing appropriateness – while relying heavily on language cues. Such settings overlook progressive, context-dependent reasoning and the challenge of visual-to-visual inference. To bridge this gap, we present VisChainBench, a large-scale benchmark designed to rigorously evaluate LVLMs’ ability to perform multi-step visual reasoning across sequential, interdependent tasks with minimal language guidance. VisChainBench contains 1,457 tasks spanning over 20,000 images across three diverse domains (e.g., daily scenarios, engineering troubleshooting), structured to mimic real-world decision-making processes. Uniquely, the benchmark is constructed using a multi-agent generation pipeline, ensuring high visual diversity and controlled language bias. All the benchmark data and code for benchmark construction are available for viewing and download via following Link: https://huggingface.co/datasets/eyehole/VisChainBench
[211] RunawayEvil: Jailbreaking the Image-to-Video Generative Models
Songping Wang, Rufan Qian, Yueming Lyu, Qinglong Liu, Linzhuang Zou, Jie Qin, Songhua Liu, Caifeng Shan
Main category: cs.CV
TL;DR: RunawayEvil is the first multimodal jailbreak framework for Image-to-Video models that uses a self-evolving “Strategy-Tactic-Action” paradigm to generate coordinated text and image attacks, achieving state-of-the-art attack success rates on commercial I2V models.
Details
Motivation: The security of multimodal Image-to-Video generation systems is critically underexplored, particularly their vulnerability to jailbreak attacks. There's a need to understand and address these security vulnerabilities to build more robust video generation systems.Method: A three-component “Strategy-Tactic-Action” framework: (1) Strategy-Aware Command Unit uses reinforcement learning and LLMs for self-evolving attack strategies; (2) Multimodal Tactical Planning Unit generates coordinated text jailbreak instructions and image tampering guidelines; (3) Tactical Action Unit executes and evaluates the multimodal coordinated attacks.
Result: RunawayEvil achieves state-of-the-art attack success rates on commercial I2V models like Open-Sora 2.0 and CogVideoX, outperforming existing methods by 58.5 to 79 percent on COCO2017 dataset.
Conclusion: This work provides a critical tool for vulnerability analysis of I2V models, laying a foundation for developing more robust and secure video generation systems by exposing and understanding their security weaknesses.
Abstract: Image-to-Video (I2V) generation synthesizes dynamic visual content from image and text inputs, providing significant creative control. However, the security of such multimodal systems, particularly their vulnerability to jailbreak attacks, remains critically underexplored. To bridge this gap, we propose RunawayEvil, the first multimodal jailbreak framework for I2V models with dynamic evolutionary capability. Built on a “Strategy-Tactic-Action” paradigm, our framework exhibits self-amplifying attack through three core components: (1) Strategy-Aware Command Unit that enables the attack to self-evolve its strategies through reinforcement learning-driven strategy customization and LLM-based strategy exploration; (2) Multimodal Tactical Planning Unit that generates coordinated text jailbreak instructions and image tampering guidelines based on the selected strategies; (3) Tactical Action Unit that executes and evaluates the multimodal coordinated attacks. This self-evolving architecture allows the framework to continuously adapt and intensify its attack strategies without human intervention. Extensive experiments demonstrate RunawayEvil achieves state-of-the-art attack success rates on commercial I2V models, such as Open-Sora 2.0 and CogVideoX. Specifically, RunawayEvil outperforms existing methods by 58.5 to 79 percent on COCO2017. This work provides a critical tool for vulnerability analysis of I2V models, thereby laying a foundation for more robust video generation systems.
[212] Stitch and Tell: A Structured Multimodal Data Augmentation Method for Spatial Understanding
Hang Yin, Xiaomin He, PeiWen Yuan, Yiwei Li, Jiayi Shi, Wenxiao Fan, Shaoxiong Feng, Kan Li
Main category: cs.CV
TL;DR: SiTe is a plug-and-play method that stitches images along spatial axes and generates spatially-aware captions to inject structured spatial supervision into vision-language models, reducing spatial hallucinations while maintaining general capabilities.
Details
Motivation: Vision-language models suffer from spatial hallucinations (incorrect descriptions of object positions) due to asymmetric properties between images and text, requiring better spatial understanding capabilities.Method: SiTe constructs stitched image-text pairs by stitching images along a spatial axis and generating spatially-aware captions or QA pairs based on the layout, without costly models or human annotation.
Result: SiTe improves spatial understanding tasks (MME_Position +5.50%, Spatial-MM +4.19%) while maintaining or improving general vision-language benchmarks (COCO-QA +1.02%, MMBench +4.76%) across three architectures.
Conclusion: Explicitly injecting spatially-aware structure into training data effectively mitigates spatial hallucinations and improves spatial understanding while preserving general vision-language capabilities.
Abstract: Existing vision-language models often suffer from spatial hallucinations, i.e., generating incorrect descriptions about the relative positions of objects in an image. We argue that this problem mainly stems from the asymmetric properties between images and text. To enrich the spatial understanding ability of vision-language models, we propose a simple, annotation-free, plug-and-play method named $\text{Stitch and Tell}$ (abbreviated as SiTe), which injects structured spatial supervision into data. It constructs stitched image-text pairs by stitching images along a spatial axis and generating spatially-aware captions or question answer pairs based on the layout of stitched image, without relying on costly advanced models or human involvement. We evaluate SiTe across three architectures including LLaVA-v1.5-7B, LLaVA-Qwen2-1.5B and HALVA-7B, two training datasets, and eight benchmarks. Experiments show that SiTe improves spatial understanding tasks such as $\text{MME}_{\text{Position}}$ (+5.50%) and Spatial-MM (+4.19%), while maintaining or improving performance on general vision-language benchmarks including COCO-QA (+1.02%) and MMBench (+4.76%). Our findings suggest that explicitly injecting spatially-aware structure into training data offers an effective way to mitigate spatial hallucinations and improve spatial understanding, while preserving general vision-language capabilities.
[213] EMGauss: Continuous Slice-to-3D Reconstruction via Dynamic Gaussian Modeling in Volume Electron Microscopy
Yumeng He, Zanwei Zhou, Yekun Zheng, Chen Liang, Yunbo Wang, Xiaokang Yang
Main category: cs.CV
TL;DR: EMGauss: A Gaussian splatting-based framework for 3D reconstruction from 2D slices that treats slice progression as temporal evolution, avoiding isotropy assumptions and enabling continuous synthesis without large-scale pretraining.
Details
Motivation: Volume electron microscopy (vEM) produces anisotropic volumes with limited axial resolution due to acquisition trade-offs. Existing deep learning methods assume isotropy, which fails for morphologically anisotropic biological structures.Method: Reframes slice-to-3D reconstruction as 3D dynamic scene rendering using Gaussian splatting, modeling axial slice progression as temporal evolution of 2D Gaussian point clouds. Incorporates Teacher-Student bootstrapping to use high-confidence predictions on unobserved slices as pseudo-supervisory signals.
Result: EMGauss substantially improves interpolation quality compared to diffusion- and GAN-based methods, enables continuous slice synthesis, and eliminates the need for large-scale pretraining.
Conclusion: EMGauss provides a generalizable slice-to-3D solution that circumvents isotropy limitations, with applications beyond vEM to diverse imaging domains.
Abstract: Volume electron microscopy (vEM) enables nanoscale 3D imaging of biological structures but remains constrained by acquisition trade-offs, leading to anisotropic volumes with limited axial resolution. Existing deep learning methods seek to restore isotropy by leveraging lateral priors, yet their assumptions break down for morphologically anisotropic structures. We present EMGauss, a general framework for 3D reconstruction from planar scanned 2D slices with applications in vEM, which circumvents the inherent limitations of isotropy-based approaches. Our key innovation is to reframe slice-to-3D reconstruction as a 3D dynamic scene rendering problem based on Gaussian splatting, where the progression of axial slices is modeled as the temporal evolution of 2D Gaussian point clouds. To enhance fidelity in data-sparse regimes, we incorporate a Teacher-Student bootstrapping mechanism that uses high-confidence predictions on unobserved slices as pseudo-supervisory signals. Compared with diffusion- and GAN-based reconstruction methods, EMGauss substantially improves interpolation quality, enables continuous slice synthesis, and eliminates the need for large-scale pretraining. Beyond vEM, it potentially provides a generalizable slice-to-3D solution across diverse imaging domains.
[214] RDSplat: Robust Watermarking Against Diffusion Editing for 3D Gaussian Splatting
Longjie Zhao, Ziming Hong, Zhenyang Ren, Runnan Chen, Mingming Gong, Tongliang Liu
Main category: cs.CV
TL;DR: RDSplat introduces a robust watermarking method for 3D Gaussian Splatting that resists diffusion-based editing attacks by embedding watermarks in low-frequency Gaussians and using adversarial training.
Details
Motivation: Existing 3DGS watermarking methods are vulnerable to diffusion-based editing attacks that can erase embedded watermarks, creating an urgent need for watermarking techniques that are intrinsically resilient to such editing operations.Method: RDSplat uses a multi-domain framework that: (1) proactively targets low-frequency Gaussians that diffusion-based editing inherently preserves, (2) employs coordinated covariance regularization and 2D filtering to embed watermarks, and (3) uses Gaussian blur as a training surrogate for diffusion-based editing to enable adversarial fine-tuning that enhances robustness.
Result: Comprehensive evaluations on three benchmark datasets show RDSplat maintains superior robustness under diffusion-based editing while preserving watermark invisibility, achieving state-of-the-art performance.
Conclusion: RDSplat provides an effective solution for robust copyright protection of 3DGS assets against diffusion-based editing attacks, addressing a critical vulnerability in existing watermarking methods.
Abstract: 3D Gaussian Splatting (3DGS) has enabled the creation of digital assets and downstream applications, underscoring the need for robust copyright protection via digital watermarking. However, existing 3DGS watermarking methods remain highly vulnerable to diffusion-based editing, which can easily erase embedded provenance. This challenge highlights the urgent need for 3DGS watermarking techniques that are intrinsically resilient to diffusion-based editing. In this paper, we introduce RDSplat, a Robust watermarking paradigm against Diffusion editing for 3D Gaussian Splatting. RDSplat embeds watermarks into 3DGS components that diffusion-based editing inherently preserve, achieved through (i) proactively targeting low-frequency Gaussians and (ii) adversarial training with a diffusion proxy. Specifically, we introduce a multi-domain framework that operates natively in 3DGS space and embeds watermarks into diffusion-editing-preserved low-frequency Gaussians via coordinated covariance regularization and 2D filtering. In addition, we exploit the low-pass filtering behavior of diffusion-based editing by using Gaussian blur as an efficient training surrogate, enabling adversarial fine-tuning that further enhances watermark robustness against diffusion-based editing. Empirically, comprehensive quantitative and qualitative evaluations on three benchmark datasets demonstrate that RDSplat not only maintains superior robustness under diffusion-based editing, but also preserves watermark invisibility, achieving state-of-the-art performance.
[215] Graph Convolutional Long Short-Term Memory Attention Network for Post-Stroke Compensatory Movement Detection Based on Skeleton Data
Jiaxing Fan, Jiaojiao Liu, Wenkong Wang, Yang Zhang, Xin Ma, Jichen Zhang
Main category: cs.CV
TL;DR: A GCN-LSTM-ATT network using skeleton data outperforms traditional ML methods in detecting compensatory movements during stroke rehabilitation.
Details
Motivation: Most stroke patients experience upper limb motor dysfunction, and compensatory movements during rehabilitation training are detrimental to long-term recovery, making detection of these movements important.Method: Proposed a Graph Convolutional Long Short-Term Memory Attention Network (GCN-LSTM-ATT) based on skeleton data collected from 16 stroke patients using Kinect depth camera during rehabilitation movements. Compared with SVM, KNN, and Random Forest models.
Result: GCN-LSTM-ATT achieved 0.8580 detection accuracy, significantly higher than traditional machine learning algorithms. Ablation experiments showed each component contributed significantly to performance improvement.
Conclusion: The model provides a more precise and powerful tool for detecting compensatory movements after stroke, which could facilitate optimization of rehabilitation training strategies for stroke patients.
Abstract: Most stroke patients experience upper limb motor dysfunction. Compensatory movements are prevalent during rehabilitation training, which is detrimental to patients’ long-term recovery. Therefore, detecting compensatory movements is of great significance. In this study, a Graph Convolutional Long Short-Term Memory Attention Network (GCN-LSTM-ATT) based on skeleton data is proposed for the detection of compensatory movements after stroke. Sixteen stroke patients were selected in the research. The skeleton data of the patients performing specific rehabilitation movements were collected using the Kinect depth camera. After data processing, detection models were constructed respectively using the GCN-LSTM-ATT model, the Support Vector Machine(SVM), the K-Nearest Neighbor algorithm(KNN), and the Random Forest(RF). The results show that the detection accuracy of the GCN-LSTM-ATT model reaches 0.8580, which is significantly higher than that of traditional machine learning algorithms. Ablation experiments indicate that each component of the model contributes significantly to the performance improvement. These findings provide a more precise and powerful tool for the detection of compensatory movements after stroke, and are expected to facilitate the optimization of rehabilitation training strategies for stroke patients.
[216] Think-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models
Fenghua Weng, Chaochao Lu, Xia Hu, Wenqi Shao, Wenjie Wang
Main category: cs.CV
TL;DR: TRR (Think-Reflect-Revise) is a three-stage training framework that enhances LVLM safety through policy-guided self-reflection, addressing vulnerabilities in single-pass reasoning approaches.
Details
Motivation: Single-pass think-then-answer paradigms in LVLMs remain vulnerable to jailbreak attacks, as they may overlook harmful content in their own output. The authors propose exploiting this wasted signal through reflection for genuine self-correction.Method: Three-stage framework: 1) Build ReSafe dataset with 5,000 think-reflect-revise examples, 2) Fine-tune target model using ReSafe to initialize reflective behavior, 3) Reinforce policy-guided reflection through reinforcement learning.
Result: TRR substantially improves safety performance, increasing safe response rate from 42.8% to 87.7% on Qwen2.5-VL-7B across safety-awareness benchmarks and jailbreak attacks, while preserving performance on general benchmarks like MMMU and MMStar.
Conclusion: The TRR framework effectively enhances LVLM safety alignment through self-reflection, addressing critical vulnerabilities in existing safety-oriented reasoning approaches while maintaining general capabilities.
Abstract: As multimodal reasoning improves the overall capabilities of Large Vision Language Models (LVLMs), recent studies have begun to explore safety-oriented reasoning, aiming to enhance safety awareness by analyzing potential safety risks during the reasoning process before generating the final response. Although such approaches improve safety awareness and interpretability, this single-pass think-then-answer paradigm remains vulnerable to contextual or visual jailbreak attacks. This reveals a critical flaw: single-pass reasoning may overlook explicit harmful content in its own output. Our key insight is to exploit this wasted signal through reflection, which can effectively leverage the malicious content revealed in the first-pass reasoning to enable genuine self-correction and prevent unsafe generations. Motivated by this, we propose Think-Reflect-Revise (TRR), a three-stage training framework designed to enhance the safety alignment of LVLMs through policy-guided self-reflection. We first build a Reflective Safety Reasoning (ReSafe) dataset with 5,000 examples that follow a think-reflect-revise process. We then fine-tune the target model using the ReSafe dataset to initialize reflective behavior, and finally reinforce policy-guided reflection through reinforcement learning. Experimental results show that TRR substantially improves the safety performance of LVLMs across both safety-awareness benchmarks and jailbreak attack evaluations, increasing the overall safe response rate from 42.8% to 87.7% on Qwen2.5-VL-7B, while preserving stable performance on general benchmarks such as MMMU and MMStar. The project page is available at https://think-reflect-revise.github.io/.
[217] FedSCAl: Leveraging Server and Client Alignment for Unsupervised Federated Source-Free Domain Adaptation
M Yashwanth, Sampath Koti, Arunabh Singh, Shyam Marjit, Anirban Chakraborty
Main category: cs.CV
TL;DR: FedSCAl is an FL framework for Federated Source-Free Domain Adaptation that uses Server-Client Alignment to mitigate client-drift and improve pseudo-labeling accuracy across heterogeneous client domains.
Details
Motivation: Address the Federated Source-Free Domain Adaptation (FFreeDA) problem where clients have unlabeled data with significant inter-client domain gaps, and only a pre-trained server model is available without access to source data during training. Existing SFDA methods struggle with client-drift in FL due to extreme data heterogeneity.Method: FedSCAl framework with Server-Client Alignment (SCAl) mechanism that regularizes client updates by aligning the predictions of client and server models. This alignment helps mitigate client-drift and improves pseudo-labeling accuracy.
Result: FedSCAl consistently outperforms state-of-the-art FL methods in the FFreeDA setup for classification tasks on benchmark vision datasets, showing improved pseudo-labeling accuracy post-alignment.
Conclusion: The proposed FedSCAl framework effectively addresses FFreeDA challenges by using Server-Client Alignment to reduce client-drift and enhance adaptation performance in federated learning with heterogeneous client domains and source-free constraints.
Abstract: We address the Federated source-Free Domain Adaptation (FFreeDA) problem, with clients holding unlabeled data with significant inter-client domain gaps. The FFreeDA setup constrains the FL frameworks to employ only a pre-trained server model as the setup restricts access to the source dataset during the training rounds. Often, this source domain dataset has a distinct distribution to the clients’ domains. To address the challenges posed by the FFreeDA setup, adaptation of the Source-Free Domain Adaptation (SFDA) methods to FL struggles with client-drift in real-world scenarios due to extreme data heterogeneity caused by the aforementioned domain gaps, resulting in unreliable pseudo-labels. In this paper, we introduce FedSCAl, an FL framework leveraging our proposed Server-Client Alignment (SCAl) mechanism to regularize client updates by aligning the clients’ and server model’s predictions. We observe an improvement in the clients’ pseudo-labeling accuracy post alignment, as the SCAl mechanism helps to mitigate the client-drift. Further, we present extensive experiments on benchmark vision datasets showcasing how FedSCAl consistently outperforms state-of-the-art FL methods in the FFreeDA setup for classification tasks.
[218] Generating Storytelling Images with Rich Chains-of-Reasoning
Xiujie Song, Qi Jia, Shota Watanabe, Xiaoyi Pang, Ruijie Chen, Mengyue Wu, Kenny Q. Zhu
Main category: cs.CV
TL;DR: The paper introduces Storytelling Image Generation - using AI to create images with rich, logically connected visual clues that tell compelling stories through Chains-of-Reasoning.
Details
Motivation: Storytelling Images convey multi-layered information through visual clues but are challenging to create manually, making them scarce despite their diverse applications in illustration creation, cognitive screening, and beyond.Method: Proposes StorytellingPainter, a two-stage pipeline combining LLMs’ creative reasoning with T2I models’ visual synthesis. Also develops Mini-Storytellers (lightweight LLMs) and a dedicated evaluation framework with three evaluators: Semantic Complexity, KNN-based Diversity, and Story-Image Alignment.
Result: Experimental results demonstrate the feasibility and effectiveness of the approaches, showing that generative AI can successfully create Storytelling Images with rich semantic content.
Conclusion: The paper successfully introduces and validates a framework for Storytelling Image Generation, addressing the scarcity of such images by leveraging AI models while providing comprehensive evaluation methods to assess the quality of generated storytelling images.
Abstract: An image can convey a compelling story by presenting rich, logically connected visual clues. These connections form Chains-of-Reasoning (CoRs) within the image, enabling viewers to infer events, causal relationships, and other information, thereby understanding the underlying story. In this paper, we focus on these semantically rich images and define them as Storytelling Images. Such images have diverse applications beyond illustration creation and cognitive screening, leveraging their ability to convey multi-layered information visually and inspire active interpretation. However, due to their complex semantic nature, Storytelling Images are inherently challenging to create, and thus remain relatively scarce. To address this challenge, we introduce the Storytelling Image Generation task, which explores how generative AI models can be leveraged to create such images. Specifically, we propose a two-stage pipeline, StorytellingPainter, which combines the creative reasoning abilities of Large Language Models (LLMs) with the visual synthesis capabilities of Text-to-Image (T2I) models to generate Storytelling Images. Alongside this pipeline, we develop a dedicated evaluation framework comprising three main evaluators: a Semantic Complexity Evaluator, a KNN-based Diversity Evaluator and a Story-Image Alignment Evaluator. Given the critical role of story generation in the Storytelling Image Generation task and the performance disparity between open-source and proprietary LLMs, we further explore tailored training strategies to reduce this gap, resulting in a series of lightweight yet effective models named Mini-Storytellers. Experimental results demonstrate the feasibility and effectiveness of our approaches. The code is available at https://github.com/xiujiesong/StorytellingImageGeneration.
[219] UARE: A Unified Vision-Language Model for Image Quality Assessment, Restoration, and Enhancement
Weiqi Li, Xuanyu Zhang, Bin Chen, Jingfen Xie, Yan Wang, Kexin Zhang, Junlin Li, Li Zhang, Jian Zhang, Shijie Zhao
Main category: cs.CV
TL;DR: UARE is the first unified vision-language model for image quality assessment, restoration, and enhancement that uses IQA signals to guide restoration through multi-task co-training.
Details
Motivation: Image quality assessment and restoration are conceptually connected but typically treated separately. Recent unified multimodal models show that better understanding can improve generation, motivating a single model that unifies IQA and restoration to study how quality assessment can guide restoration.Method: Built on pretrained unified understanding and generation models with a two-stage training framework: 1) progressive easy-to-hard schedule from single-type distortions to mixed degradations, 2) unified fine-tuning of quality understanding and restoration with interleaved text-image data to align IQA signals with restoration objectives.
Result: Extensive experiments across IQA, restoration, and enhancement tasks demonstrate the effectiveness of UARE, showing that multi-task co-training allows IQA to boost restoration and enhancement performance.
Conclusion: UARE successfully unifies image quality assessment, restoration, and enhancement in a single model, demonstrating that explicit integration of IQA signals can guide and improve restoration outcomes, addressing a previously underexplored but valuable research direction.
Abstract: Image quality assessment (IQA) and image restoration are fundamental problems in low-level vision. Although IQA and restoration are closely connected conceptually, most existing work treats them in isolation. Recent advances in unified multimodal understanding-generation models demonstrate promising results and indicate that stronger understanding can improve generative performance. This motivates a single model that unifies IQA and restoration and explicitly studies how IQA can guide restoration, a setting that remains largely underexplored yet highly valuable. In this paper, we propose UARE, to our knowledge the first Unified vision-language model for image quality Assessment, Restoration, and Enhancement. Built on pretrained unified understanding and generation models, we introduce a two-stage training framework. First, a progressive, easy-to-hard schedule expands from single-type distortions to higher-order mixed degradations, enabling UARE to handle multiple degradations. Second, we perform unified fine-tuning of quality understanding and restoration with interleaved text-image data, aligning IQA signals with restoration objectives. Through multi-task co-training, UARE leverages IQA to boost restoration and enhancement performance. Extensive experiments across IQA, restoration, and enhancement tasks demonstrate the effectiveness of UARE. The code and models will be available at https://github.com/lwq20020127/UARE.
[220] JOCA: Task-Driven Joint Optimisation of Camera Hardware and Adaptive Camera Control Algorithms
Chengyang Yan, Mitch Bryson, Donald G. Dansereau
Main category: cs.CV
TL;DR: Joint optimization of camera hardware and adaptive control algorithms with vision tasks using hybrid gradient-based/derivative-free methods improves perception performance.
Details
Motivation: Most prior camera-perception co-design approaches focus only on fixed manufacturing parameters, but many camera parameters (like exposure) require adaptive runtime control. There's a need to jointly optimize both hardware and adaptive control algorithms with downstream vision tasks.Method: Proposes a unified optimization framework combining gradient-based and derivative-free methods to handle continuous/discrete parameters, non-differentiable image formation processes, and neural network-based adaptive control. Introduces DF-Grad, a hybrid strategy that trains adaptive control networks using signals from derivative-free optimization alongside unsupervised task-driven learning.
Result: Method outperforms baselines that optimize static and dynamic parameters separately, especially under challenging conditions like low light and fast motion. Demonstrates improved perception performance through joint hardware-adaptive control optimization.
Conclusion: Jointly optimizing camera hardware parameters and adaptive control algorithms significantly improves perception performance and provides a unified approach to task-driven camera system design.
Abstract: The quality of captured images strongly influences the performance of downstream perception tasks. Recent works on co-designing camera systems with perception tasks have shown improved task performance. However, most prior approaches focus on optimising fixed camera parameters set at manufacturing, while many parameters, such as exposure settings, require adaptive control at runtime. This paper introduces a method that jointly optimises camera hardware and adaptive camera control algorithms with downstream vision tasks. We present a unified optimisation framework that integrates gradient-based and derivative-free methods, enabling support for both continuous and discrete parameters, non-differentiable image formation processes, and neural network-based adaptive control algorithms. To address non-differentiable effects such as motion blur, we propose DF-Grad, a hybrid optimisation strategy that trains adaptive control networks using signals from a derivative-free optimiser alongside unsupervised task-driven learning. Experiments show that our method outperforms baselines that optimise static and dynamic parameters separately, particularly under challenging conditions such as low light and fast motion. These results demonstrate that jointly optimising hardware parameters and adaptive control algorithms improves perception performance and provides a unified approach to task-driven camera system design.
[221] Physics Informed Human Posture Estimation Based on 3D Landmarks from Monocular RGB-Videos
Tobias Leuthold, Michele Xiloyannis, Yves Zimmermann
Main category: cs.CV
TL;DR: Real-time post-processing algorithm that enhances BlazePose 3D pose estimation by incorporating anatomical constraints and biomechanical models, reducing errors by 10-16% while maintaining computational efficiency for consumer devices.
Details
Motivation: Current pose estimation models like BlazePose lack anatomical constraints, limiting their accuracy for automated physical coaching applications such as physiotherapy and sports training. There's a need for more robust and anatomically consistent pose estimation that can run efficiently on consumer devices.Method: A real-time post-processing algorithm that fuses BlazePose 3D and 2D estimations using weighted optimization. The method penalizes deviations from expected bone lengths and biomechanical models, with bone length estimations refined using a Kalman filter with adaptive measurement trust.
Result: Evaluation on Physio2.2M dataset shows 10.2% reduction in 3D MPJPE (Mean Per Joint Position Error) and 16.6% decrease in errors of angles between body segments compared to standard BlazePose 3D estimation.
Conclusion: The method provides robust, anatomically consistent pose estimation suitable for automated physiotherapy, healthcare, and sports coaching on consumer-level devices, with backend processing using anonymized data only.
Abstract: Applications providing automated coaching for physical training are increasing in popularity, for example physical therapy. These applications rely on accurate and robust pose estimation using monocular video streams. State-of-the-art models like BlazePose excel in real-time pose tracking, but their lack of anatomical constraints indicates improvement potential by including physical knowledge. We present a real-time post-processing algorithm fusing the strengths of BlazePose 3D and 2D estimations using a weighted optimization, penalizing deviations from expected bone length and biomechanical models. Bone length estimations are refined to the individual anatomy using a Kalman filter with adapting measurement trust. Evaluation using the Physio2.2M dataset shows a 10.2 percent reduction in 3D MPJPE and a 16.6 percent decrease in errors of angles between body segments compared to BlazePose 3D estimation. Our method provides a robust, anatomically consistent pose estimation based on a computationally efficient video-to-3D pose estimation, suitable for automated physiotherapy, healthcare, and sports coaching on consumer-level laptops and mobile devices. The refinement runs on the backend with anonymized data only.
[222] Generalized Geometry Encoding Volume for Real-time Stereo Matching
Jiaxin Liu, Gangwei Xu, Xianqi Wang, Chengliang Zhang, Xin Yang
Main category: cs.CV
TL;DR: GGEV is a real-time stereo matching network that achieves strong generalization by using depth-aware features and dynamic cost aggregation, outperforming existing real-time methods in zero-shot generalization.
Details
Motivation: Real-time stereo matching methods focus on in-domain performance but neglect generalization, while foundation models have good generalization but slow inference. There's a need for real-time methods with strong generalization capabilities.Method: Proposes Generalized Geometry Encoding Volume (GGEV) with two key components: 1) extracting depth-aware features encoding domain-invariant structural priors, and 2) Depth-aware Dynamic Cost Aggregation (DDCA) module that adaptively incorporates these priors into each disparity hypothesis.
Result: GGEV surpasses all existing real-time methods in zero-shot generalization capability and achieves state-of-the-art performance on KITTI 2012, KITTI 2015, and ETH3D benchmarks.
Conclusion: GGEV successfully addresses the trade-off between real-time performance and generalization in stereo matching, providing a lightweight yet effective solution for real-world applications.
Abstract: Real-time stereo matching methods primarily focus on enhancing in-domain performance but often overlook the critical importance of generalization in real-world applications. In contrast, recent stereo foundation models leverage monocular foundation models (MFMs) to improve generalization, but typically suffer from substantial inference latency. To address this trade-off, we propose Generalized Geometry Encoding Volume (GGEV), a novel real-time stereo matching network that achieves strong generalization. We first extract depth-aware features that encode domain-invariant structural priors as guidance for cost aggregation. Subsequently, we introduce a Depth-aware Dynamic Cost Aggregation (DDCA) module that adaptively incorporates these priors into each disparity hypothesis, effectively enhancing fragile matching relationships in unseen scenes. Both steps are lightweight and complementary, leading to the construction of a generalized geometry encoding volume with strong generalization capability. Experimental results demonstrate that our GGEV surpasses all existing real-time methods in zero-shot generalization capability, and achieves state-of-the-art performance on the KITTI 2012, KITTI 2015, and ETH3D benchmarks.
[223] VDOT: Efficient Unified Video Creation via Optimal Transport Distillation
Yutong Wang, Haiyu Zhang, Tianfan Xue, Yu Qiao, Yaohui Wang, Chang Xu, Xinyuan Chen
Main category: cs.CV
TL;DR: VDOT is an efficient unified video creation model that uses distribution matching distillation with optimal transport to achieve high-quality video generation in just 4 steps, outperforming models requiring 100 steps.
Details
Motivation: Existing video creation models are either limited to specific conditions or too slow for practical use due to complex inference processes. There's a need for an efficient, unified model that can handle multiple video creation tasks in real-world applications.Method: VDOT uses distribution matching distillation with computational optimal transport instead of KL divergence to optimize discrepancy between real and fake score distributions. It integrates a discriminator for better video quality perception and includes an automated pipeline for video data annotation/filtering across multiple tasks.
Result: The 4-step VDOT model outperforms or matches other baselines that require 100 denoising steps, demonstrating superior efficiency and quality in video generation.
Conclusion: VDOT provides an efficient, unified solution for video creation that addresses the limitations of existing models, making video generation practical for real-world applications through optimal transport-based distillation and comprehensive training infrastructure.
Abstract: The rapid development of generative models has significantly advanced image and video applications. Among these, video creation, aimed at generating videos under various conditions, has gained substantial attention. However, existing video creation models either focus solely on a few specific conditions or suffer from excessively long generation times due to complex model inference, making them impractical for real-world applications. To mitigate these issues, we propose an efficient unified video creation model, named VDOT. Concretely, we model the training process with the distribution matching distillation (DMD) paradigm. Instead of using the Kullback-Leibler (KL) minimization, we additionally employ a novel computational optimal transport (OT) technique to optimize the discrepancy between the real and fake score distributions. The OT distance inherently imposes geometric constraints, mitigating potential zero-forcing or gradient collapse issues that may arise during KL-based distillation within the few-step generation scenario, and thus, enhances the efficiency and stability of the distillation process. Further, we integrate a discriminator to enable the model to perceive real video data, thereby enhancing the quality of generated videos. To support training unified video creation models, we propose a fully automated pipeline for video data annotation and filtering that accommodates multiple video creation tasks. Meanwhile, we curate a unified testing benchmark, UVCBench, to standardize evaluation. Experiments demonstrate that our 4-step VDOT outperforms or matches other baselines with 100 denoising steps.
[224] JoPano: Unified Panorama Generation via Joint Modeling
Wancheng Feng, Chen An, Zhenliang He, Meina Kan, Shiguang Shan, Lukun Wang
Main category: cs.CV
TL;DR: JoPano is a unified DiT-based approach for panorama generation that addresses both text-to-panorama and view-to-panorama tasks using a Joint-Face Adapter and condition switching mechanism, achieving state-of-the-art performance.
Details
Motivation: Existing panorama generation methods have two major limitations: U-Net architectures constrain visual quality, and treating text-to-panorama and view-to-panorama tasks independently leads to modeling redundancy and inefficiency.Method: Proposes a DiT-based model with a Joint-Face Adapter built on cubemap representation to transfer pretrained DiT capabilities to panorama domain, uses Poisson Blending to reduce seam inconsistencies, and introduces a condition switching mechanism to unify both tasks.
Result: Achieves state-of-the-art performance on FID, CLIP-FID, IS, and CLIP-Score metrics, generating high-quality panoramas for both text-to-panorama and view-to-panorama tasks with improved seam consistency.
Conclusion: JoPano successfully unifies panorama generation tasks within a single efficient model, overcoming limitations of previous approaches while maintaining high visual quality and seam consistency.
Abstract: Panorama generation has recently attracted growing interest in the research community, with two core tasks, text-to-panorama and view-to-panorama generation. However, existing methods still face two major challenges: their U-Net-based architectures constrain the visual quality of the generated panoramas, and they usually treat the two core tasks independently, which leads to modeling redundancy and inefficiency. To overcome these challenges, we propose a joint-face panorama (JoPano) generation approach that unifies the two core tasks within a DiT-based model. To transfer the rich generative capabilities of existing DiT backbones learned from natural images to the panorama domain, we propose a Joint-Face Adapter built on the cubemap representation of panoramas, which enables a pretrained DiT to jointly model and generate different views of a panorama. We further apply Poisson Blending to reduce seam inconsistencies that often appear at the boundaries between cube faces. Correspondingly, we introduce Seam-SSIM and Seam-Sobel metrics to quantitatively evaluate the seam consistency. Moreover, we propose a condition switching mechanism that unifies text-to-panorama and view-to-panorama tasks within a single model. Comprehensive experiments show that JoPano can generate high-quality panoramas for both text-to-panorama and view-to-panorama generation tasks, achieving state-of-the-art performance on FID, CLIP-FID, IS, and CLIP-Score metrics.
[225] Toward More Reliable Artificial Intelligence: Reducing Hallucinations in Vision-Language Models
Kassoum Sanogo, Renzo Ardiccioni
Main category: cs.CV
TL;DR: Training-free self-correction framework for VLMs that reduces hallucinations by 9.8% through uncertainty-guided visual re-attention without model updates.
Details
Motivation: Vision-language models often generate plausible but incorrect claims (hallucinations) about image content, which undermines their trustworthiness and reliability.Method: Proposes a training-free framework combining multidimensional uncertainty quantification (token entropy, attention dispersion, semantic consistency, claim confidence) with attention-guided cropping of under-explored regions, enabling iterative response refinement without gradient updates.
Result: Reduces hallucination rates by 9.8 percentage points on POPE and MMHAL BENCH benchmarks using Qwen2.5-VL-7B, improves object existence accuracy by 4.7 points on adversarial splits, and successfully grounds corrections in visual evidence where standard decoding fails.
Conclusion: The uncertainty-guided self-correction framework effectively reduces hallucinations in VLMs without requiring model retraining, demonstrating a practical approach for improving multimodal system trustworthiness.
Abstract: Vision-language models (VLMs) frequently generate hallucinated content plausible but incorrect claims about image content. We propose a training-free self-correction framework enabling VLMs to iteratively refine responses through uncertainty-guided visual re-attention. Our method combines multidimensional uncertainty quantification (token entropy, attention dispersion, semantic consistency, claim confidence) with attention-guided cropping of under-explored regions. Operating entirely with frozen, pretrained VLMs, our framework requires no gradient updates. We validate our approach on the POPE and MMHAL BENCH benchmarks using the Qwen2.5-VL-7B [23] architecture. Experimental results demonstrate that our method reduces hallucination rates by 9.8 percentage points compared to the baseline, while improving object existence accuracy by 4.7 points on adversarial splits. Furthermore, qualitative analysis confirms that uncertainty-guided re-attention successfully grounds corrections in visual evidence where standard decoding fails. We validate our approach on Qwen2.5-VL-7B [23], with plans to extend validation across diverse architectures in future versions. We release our code and methodology to facilitate future research in trustworthy multimodal systems.
[226] MeshSplatting: Differentiable Rendering with Opaque Meshes
Jan Held, Sanghyun Son, Renaud Vandeghen, Daniel Rebain, Matheus Gadelha, Yi Zhou, Anthony Cioppa, Ming C. Lin, Marc Van Droogenbroeck, Andrea Tagliasacchi
Main category: cs.CV
TL;DR: MeshSplatting bridges neural rendering and interactive 3D graphics by creating mesh-based representations from differentiable rendering, achieving better quality and efficiency than current mesh-based methods.
Details
Motivation: Current primitive-based splatting methods (like 3D Gaussian Splatting) are incompatible with mesh-based pipelines used in AR/VR and game engines, creating a gap between neural rendering and interactive 3D graphics.Method: MeshSplatting jointly optimizes geometry and appearance through differentiable rendering, enforces connectivity via restricted Delaunay triangulation, and refines surface consistency to create smooth, high-quality meshes.
Result: On Mip-NeRF360, MeshSplatting achieves +0.69 dB PSNR improvement over state-of-the-art MiLo for mesh-based novel view synthesis, while training 2x faster and using 2x less memory.
Conclusion: MeshSplatting bridges neural rendering and interactive 3D graphics, enabling seamless real-time scene interaction in standard 3D engines while maintaining high visual quality and efficiency.
Abstract: Primitive-based splatting methods like 3D Gaussian Splatting have revolutionized novel view synthesis with real-time rendering. However, their point-based representations remain incompatible with mesh-based pipelines that power AR/VR and game engines. We present MeshSplatting, a mesh-based reconstruction approach that jointly optimizes geometry and appearance through differentiable rendering. By enforcing connectivity via restricted Delaunay triangulation and refining surface consistency, MeshSplatting creates end-to-end smooth, visually high-quality meshes that render efficiently in real-time 3D engines. On Mip-NeRF360, it boosts PSNR by +0.69 dB over the current state-of-the-art MiLo for mesh-based novel view synthesis, while training 2x faster and using 2x less memory, bridging neural rendering and interactive 3D graphics for seamless real-time scene interaction. The project page is available at https://meshsplatting.github.io/.
[227] SparseCoop: Cooperative Perception with Kinematic-Grounded Queries
Jiahao Wang, Zhongwei Jiang, Wenchao Sun, Jiaru Zhong, Haibao Yu, Yuner Zhang, Chenyang Lu, Chuang Zhang, Lei He, Shaobing Xu, Jianqiang Wang
Main category: cs.CV
TL;DR: SparseCoop: A fully sparse cooperative perception framework for 3D detection and tracking that eliminates BEV representations, using kinematic-grounded instance queries for precise alignment with low communication costs.
Details
Motivation: Current cooperative perception methods face limitations: dense BEV feature sharing has quadratic communication costs and lacks flexibility for precise alignment across asynchronous/disparate viewpoints. Sparse query-based alternatives suffer from inadequate geometric representations, suboptimal fusion, and training instability.Method: Three key innovations: 1) Kinematic-grounded instance query with explicit state vector (3D geometry + velocity) for precise spatio-temporal alignment; 2) Coarse-to-fine aggregation module for robust fusion; 3) Cooperative instance denoising task to accelerate and stabilize training.
Result: Achieves state-of-the-art performance on V2X-Seq and Griffin datasets with superior computational efficiency, low transmission cost, and strong robustness to communication latency.
Conclusion: SparseCoop demonstrates that fully sparse cooperative perception without BEV representations can achieve high performance while addressing communication efficiency and alignment challenges in autonomous driving.
Abstract: Cooperative perception is critical for autonomous driving, overcoming the inherent limitations of a single vehicle, such as occlusions and constrained fields-of-view. However, current approaches sharing dense Bird’s-Eye-View (BEV) features are constrained by quadratically-scaling communication costs and the lack of flexibility and interpretability for precise alignment across asynchronous or disparate viewpoints. While emerging sparse query-based methods offer an alternative, they often suffer from inadequate geometric representations, suboptimal fusion strategies, and training instability. In this paper, we propose SparseCoop, a fully sparse cooperative perception framework for 3D detection and tracking that completely discards intermediate BEV representations. Our framework features a trio of innovations: a kinematic-grounded instance query that uses an explicit state vector with 3D geometry and velocity for precise spatio-temporal alignment; a coarse-to-fine aggregation module for robust fusion; and a cooperative instance denoising task to accelerate and stabilize training. Experiments on V2X-Seq and Griffin datasets show SparseCoop achieves state-of-the-art performance. Notably, it delivers this with superior computational efficiency, low transmission cost, and strong robustness to communication latency. Code is available at https://github.com/wang-jh18-SVM/SparseCoop.
[228] CADE: Continual Weakly-supervised Video Anomaly Detection with Ensembles
Satoshi Hashimoto, Tatsuya Konishi, Tomoya Kaichi, Kazunori Matsumoto, Mori Kurokawa
Main category: cs.CV
TL;DR: CADE is a novel continual learning approach for weakly-supervised video anomaly detection that addresses domain shift and forgetting by using Dual-Generator for data imbalance and Multi-Discriminator ensembles to capture missed anomalies.
Details
Motivation: Existing weakly-supervised VAD methods focus on static datasets and neglect domain shift. Continual learning is needed to prevent forgetting when adapting to new data domains, as forgetting causes models to become biased toward certain anomaly modes and miss various anomalies.Method: CADE combines continual learning with weakly-supervised VAD using: 1) Dual-Generator to address data imbalance and label uncertainty in WVAD, and 2) Multi-Discriminator ensembles that capture missed anomalies from past scenes due to forgetting, using multiple models.
Result: Extensive experiments show CADE significantly outperforms existing VAD methods on common multi-scene VAD datasets including ShanghaiTech and Charlotte Anomaly datasets.
Conclusion: CADE is the first work to combine continual learning and weakly-supervised VAD perspectives, effectively addressing domain shift and forgetting in video anomaly detection through ensemble-based continual learning.
Abstract: Video anomaly detection (VAD) has long been studied as a crucial problem in public security and crime prevention. In recent years, weakly-supervised VAD (WVAD) have attracted considerable attention due to their easy annotation process and promising research results. While existing WVAD methods tackle mainly on static datasets, the possibility that the domain of data can vary has been neglected. To adapt such domain-shift, the continual learning (CL) perspective is required because otherwise additional training only with new coming data could easily cause performance degradation for previous data, i.e., forgetting. Therefore, we propose a brand-new approach, called Continual Anomaly Detection with Ensembles (CADE) that is the first work combining CL and WVAD viewpoints. Specifically, CADE uses the Dual-Generator(DG) to address data imbalance and label uncertainty in WVAD. We also found that forgetting exacerbates the “incompleteness’’ where the model becomes biased towards certain anomaly modes, leading to missed detections of various anomalies. To address this, we propose to ensemble Multi-Discriminator (MD) that capture missed anomalies in past scenes due to forgetting, using multiple models. Extensive experiments show that CADE significantly outperforms existing VAD methods on the common multi-scene VAD datasets, such as ShanghaiTech and Charlotte Anomaly datasets.
[229] Pseudo Anomalies Are All You Need: Diffusion-Based Generation for Weakly-Supervised Video Anomaly Detection
Satoshi Hashimoto, Hitoshi Nishimura, Yanan Wang, Mori Kurokawa
Main category: cs.CV
TL;DR: PA-VAD: Video anomaly detection trained only on synthesized pseudo-abnormal videos and real normal videos, achieving SOTA results without real abnormal footage.
Details
Motivation: Real abnormal videos are scarce and expensive to collect, limiting practical deployment of video anomaly detection systems.Method: 1) Generate pseudo-abnormal videos using CLIP for image selection, vision-language model for prompt refinement, and video diffusion model for synthesis. 2) Train detector with domain-aligned regularization to mitigate excessive spatiotemporal magnitude in synthesized anomalies.
Result: Achieves 98.2% on ShanghaiTech and 82.5% on UCF-Crime, surpassing strongest real-abnormal method by +0.6% on ShanghaiTech and outperforming UVAD SOTA by +1.9% on UCF-Crime.
Conclusion: High-accuracy anomaly detection can be achieved without collecting real anomalies, providing a practical path toward scalable deployment.
Abstract: Deploying video anomaly detection in practice is hampered by the scarcity and collection cost of real abnormal footage. We address this by training without any real abnormal videos while evaluating under the standard weakly supervised split, and we introduce PA-VAD, a generation-driven approach that learns a detector from synthesized pseudo-abnormal videos paired with real normal videos, using only a small set of real normal images to drive synthesis. For synthesis, we select class-relevant initial images with CLIP and refine textual prompts with a vision-language model to improve fidelity and scene consistency before invoking a video diffusion model. For training, we mitigate excessive spatiotemporal magnitude in synthesized anomalies by an domain-aligned regularized module that combines domain alignment and memory usage-aware updates. Extensive experiments show that our approach reaches 98.2% on ShanghaiTech and 82.5% on UCF-Crime, surpassing the strongest real-abnormal method on ShanghaiTech by +0.6% and outperforming the UVAD state-of-the-art on UCF-Crime by +1.9%. The results demonstrate that high-accuracy anomaly detection can be obtained without collecting real anomalies, providing a practical path toward scalable deployment.
[230] Hide-and-Seek Attribution: Weakly Supervised Segmentation of Vertebral Metastases in CT
Matan Atad, Alexander W. Marka, Lisa Steinhelfer, Anna Curto-Vilalta, Yannik Leonhardt, Sarah C. Foreman, Anna-Sophia Walburga Dietrich, Robert Graf, Alexandra S. Gersing, Bjoern Menze, Daniel Rueckert, Jan S. Kirschke, Hendrik Möller
Main category: cs.CV
TL;DR: Weakly supervised method for vertebral metastasis segmentation using only vertebra-level labels (healthy/malignant) without lesion masks, combining diffusion autoencoder editing with hide-and-seek attribution to generate accurate lesion segmentations.
Details
Motivation: Vertebral metastasis segmentation in CT is clinically important but difficult to scale due to scarce voxel-level annotations and the similarity between malignant lesions (lytic/blastic) and benign degenerative changes.Method: Combines Diffusion Autoencoder (DAE) for classifier-guided healthy edits of vertebrae with pixel-wise difference maps to propose candidate lesion regions. Introduces Hide-and-Seek Attribution: each candidate region is revealed while others are hidden, edited image is projected back to data manifold by DAE, and latent-space classifier quantifies isolated malignant contribution to identify true lesions.
Result: Achieved strong performance on held-out radiologist annotations: blastic (F1: 0.91, Dice: 0.87) and lytic (F1: 0.85, Dice: 0.78), significantly exceeding baselines (F1: 0.79/0.67; Dice: 0.74/0.55).
Conclusion: Vertebra-level labels can be transformed into reliable lesion masks, demonstrating that generative editing combined with selective occlusion supports accurate weakly supervised segmentation in CT without mask supervision.
Abstract: Accurate segmentation of vertebral metastasis in CT is clinically important yet difficult to scale, as voxel-level annotations are scarce and both lytic and blastic lesions often resemble benign degenerative changes. We introduce a weakly supervised method trained solely on vertebra-level healthy/malignant labels, without any lesion masks. The method combines a Diffusion Autoencoder (DAE) that produces a classifier-guided healthy edit of each vertebra with pixel-wise difference maps that propose candidate lesion regions. To determine which regions truly reflect malignancy, we introduce Hide-and-Seek Attribution: each candidate is revealed in turn while all others are hidden, the edited image is projected back to the data manifold by the DAE, and a latent-space classifier quantifies the isolated malignant contribution of that component. High-scoring regions form the final lytic or blastic segmentation. On held-out radiologist annotations, we achieve strong blastic/lytic performance despite no mask supervision (F1: 0.91/0.85; Dice: 0.87/0.78), exceeding baselines (F1: 0.79/0.67; Dice: 0.74/0.55). These results show that vertebra-level labels can be transformed into reliable lesion masks, demonstrating that generative editing combined with selective occlusion supports accurate weakly supervised segmentation in CT.
[231] Omni-Referring Image Segmentation
Qiancheng Zheng, Yunhang Shen, Gen Luo, Baiyang Song, Xing Sun, Xiaoshuai Sun, Yiyi Zhou, Rongrong Ji
Main category: cs.CV
TL;DR: Proposes Omni-Referring Image Segmentation (OmniRIS) - a unified task supporting both text and visual prompts (masks, boxes, scribbles) for generalized image segmentation, with new dataset OmniRef and baseline model OmniSegNet.
Details
Motivation: Existing segmentation tasks are unimodal (text-only or visual-only), limiting their generalization. Need a unified approach that leverages strengths of both modalities: text for granular attribute referring and visual prompts for uncommon object grounding.Method: 1) Defines OmniRIS task supporting omni-prompts (text + visual references). 2) Creates OmniRef dataset with 186,939 prompts for 30,956 images. 3) Proposes OmniSegNet baseline to handle omni-prompt encoding and various segmentation settings (one vs many, many vs many).
Result: Extensive experiments validate OmniSegNet’s capability to follow omni-modal instructions and show OmniRIS’s superiority for highly generalized image segmentation compared to existing unimodal approaches.
Conclusion: OmniRIS enables more flexible and generalized image segmentation by combining text and visual modalities, with OmniRef dataset and OmniSegNet baseline providing foundation for future research in this direction.
Abstract: In this paper, we propose a novel task termed Omni-Referring Image Segmentation (OmniRIS) towards highly generalized image segmentation. Compared with existing unimodally conditioned segmentation tasks, such as RIS and visual RIS, OmniRIS supports the input of text instructions and reference images with masks, boxes or scribbles as omni-prompts. This property makes it can well exploit the intrinsic merits of both text and visual modalities, i.e., granular attribute referring and uncommon object grounding, respectively. Besides, OmniRIS can also handle various segmentation settings, such as one v.s. many and many v.s. many, further facilitating its practical use. To promote the research of OmniRIS, we also rigorously design and construct a large dataset termed OmniRef, which consists of 186,939 omni-prompts for 30,956 images, and establish a comprehensive evaluation system. Moreover, a strong and general baseline termed OmniSegNet is also proposed to tackle the key challenges of OmniRIS, such as omni-prompt encoding. The extensive experiments not only validate the capability of OmniSegNet in following omni-modal instructions, but also show the superiority of OmniRIS for highly generalized image segmentation.
[232] Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training
Kaixuan Lu, Mehmet Onurcan Kaya, Dim P. Papadopoulos
Main category: cs.CV
TL;DR: AutoQ-VIS is an unsupervised Video Instance Segmentation framework that uses quality-guided self-training to bridge the synthetic-to-real domain gap without human annotations.
Details
Motivation: Video Instance Segmentation requires pixel-level masks and temporal consistency labels, which are challenging to annotate. Existing unsupervised methods like VideoCutLER rely on synthetic data but suffer from synthetic-to-real domain gap limitations.Method: AutoQ-VIS establishes a closed-loop system between pseudo-label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos through quality-guided self-training.
Result: Achieves state-of-the-art performance with 52.6 AP50 on YouTubeVIS-2019 val set, surpassing previous state-of-the-art VideoCutLER by 4.4%, while requiring no human annotations.
Conclusion: Demonstrates the viability of quality-aware self-training for unsupervised Video Instance Segmentation, effectively bridging the synthetic-to-real domain gap.
Abstract: Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel-level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic-to-real domain gap. We present AutoQ-VIS, a novel unsupervised framework that bridges this gap through quality-guided self-training. Our approach establishes a closed-loop system between pseudo-label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state-of-the-art performance with 52.6 $\text{AP}_{50}$ on YouTubeVIS-2019 $\texttt{val}$ set, surpassing the previous state-of-the-art VideoCutLER by 4.4%, while requiring no human annotations. This demonstrates the viability of quality-aware self-training for unsupervised VIS. We will release the code at https://github.com/wcbup/AutoQ-VIS.
[233] Spatial Retrieval Augmented Autonomous Driving
Xiaosong Jia, Chenhe Zhang, Yule Jiang, Songbur Wong, Zhiyuan Zhang, Chen Chen, Shaofeng Zhang, Xuanhe Zhou, Xue Yang, Junchi Yan, Yu-Gang Jiang
Main category: cs.CV
TL;DR: The paper proposes a spatial retrieval paradigm that uses offline geographic images (e.g., from Google Maps) to enhance autonomous driving perception beyond onboard sensors, addressing limitations like limited view scope and poor visibility.
Details
Motivation: Current autonomous driving systems rely on onboard sensors which have limitations: limited perception horizon, restricted view scope, occlusion issues, and poor performance in extreme conditions (darkness, rain). Human drivers can recall road structure even under poor visibility, so the authors aim to give models similar "recall" ability.Method: Introduce spatial retrieval paradigm using offline retrieved geographic images as additional input. These images come from offline caches like Google Maps or stored autonomous driving datasets. Extended nuScenes dataset with Google Maps API images aligned with ego-vehicle trajectories. Established baselines across five core AD tasks: object detection, online mapping, occupancy prediction, end-to-end planning, and generative world modeling.
Result: Extensive experiments show that the extended modality (geographic images) could enhance performance of certain autonomous driving tasks. The authors will open-source dataset curation code, data, and benchmarks for further study.
Conclusion: The spatial retrieval paradigm using offline geographic images is a plug-and-play extension for existing AD tasks that can enhance perception capabilities, addressing limitations of onboard sensors while being easy to obtain without additional hardware.
Abstract: Existing autonomous driving systems rely on onboard sensors (cameras, LiDAR, IMU, etc) for environmental perception. However, this paradigm is limited by the drive-time perception horizon and often fails under limited view scope, occlusion or extreme conditions such as darkness and rain. In contrast, human drivers are able to recall road structure even under poor visibility. To endow models with this ``recall” ability, we propose the spatial retrieval paradigm, introducing offline retrieved geographic images as an additional input. These images are easy to obtain from offline caches (e.g, Google Maps or stored autonomous driving datasets) without requiring additional sensors, making it a plug-and-play extension for existing AD tasks. For experiments, we first extend the nuScenes dataset with geographic images retrieved via Google Maps APIs and align the new data with ego-vehicle trajectories. We establish baselines across five core autonomous driving tasks: object detection, online mapping, occupancy prediction, end-to-end planning, and generative world modeling. Extensive experiments show that the extended modality could enhance the performance of certain tasks. We will open-source dataset curation code, data, and benchmarks for further study of this new autonomous driving paradigm.
[234] Towards Robust Pseudo-Label Learning in Semantic Segmentation: An Encoding Perspective
Wangkai Li, Rui Sun, Zhaoyang Li, Tianzhu Zhang
Main category: cs.CV
TL;DR: ECOCSeg introduces error-correcting output codes for semantic segmentation to handle noisy pseudo-labels in UDA and SSL scenarios, improving stability and generalization through bit-level denoising.
Details
Motivation: Pseudo-label learning in semantic segmentation suffers from erroneous pseudo-labels that get amplified during training due to one-hot encoding limitations, especially in label-scarce scenarios like UDA and SSL.Method: Proposes ECOCSeg using error-correcting output codes (ECOC) to create fine-grained class encodings, introducing an ECOC-based classifier that disentangles classes into attributes and a bit-level label denoising mechanism.
Result: Consistently demonstrates significant improvements on multiple UDA and SSL benchmarks across different segmentation architectures, showing better stability and generalization.
Conclusion: ECOCSeg effectively addresses pseudo-label noise in segmentation tasks, is easily integrable with existing methods, and provides robust supervision for unlabeled images through error-correcting codes.
Abstract: Pseudo-label learning is widely used in semantic segmentation, particularly in label-scarce scenarios such as unsupervised domain adaptation (UDA) and semisupervised learning (SSL). Despite its success, this paradigm can generate erroneous pseudo-labels, which are further amplified during training due to utilization of one-hot encoding. To address this issue, we propose ECOCSeg, a novel perspective for segmentation models that utilizes error-correcting output codes (ECOC) to create a fine-grained encoding for each class. ECOCSeg offers several advantages. First, an ECOC-based classifier is introduced, enabling model to disentangle classes into attributes and handle partial inaccurate bits, improving stability and generalization in pseudo-label learning. Second, a bit-level label denoising mechanism is developed to generate higher-quality pseudo-labels, providing adequate and robust supervision for unlabeled images. ECOCSeg can be easily integrated with existing methods and consistently demonstrates significant improvements on multiple UDA and SSL benchmarks across different segmentation architectures. Code is available at https://github.com/Woof6/ECOCSeg.
[235] SceneMixer: Exploring Convolutional Mixing Networks for Remote Sensing Scene Classification
Mohammed Q. Alkhatib, Ali Jamali, Swalpa Kumar Roy
Main category: cs.CV
TL;DR: Proposes a lightweight convolutional mixer architecture for remote sensing scene classification that balances accuracy and efficiency by alternating spatial mixing (depthwise convolutions) and channel mixing (pointwise operations).
Details
Motivation: Remote sensing scene classification is crucial for Earth observation but remains challenging due to variations in spatial resolution, viewpoint, orientation, and background conditions that reduce model generalization. Existing CNN and ViT models struggle with these challenges while maintaining efficiency.Method: Lightweight architecture based on convolutional mixer paradigm that alternates between spatial mixing through depthwise convolutions at multiple scales and channel mixing through pointwise operations, enabling efficient extraction of both local and contextual information with low parameters and computations.
Result: Achieved 74.7% overall accuracy, 74.57% average accuracy, and 73.79 Kappa on AID dataset; 93.90% overall accuracy, 93.93% average accuracy, and 93.22 Kappa on EuroSAT dataset. Shows good balance between accuracy and efficiency compared to CNN- and transformer-based models.
Conclusion: The proposed convolutional mixer architecture provides an effective solution for remote sensing scene classification that maintains good accuracy while being computationally efficient, addressing generalization challenges in aerial and satellite imagery analysis.
Abstract: Remote sensing scene classification plays a key role in Earth observation by enabling the automatic identification of land use and land cover (LULC) patterns from aerial and satellite imagery. Despite recent progress with convolutional neural networks (CNNs) and vision transformers (ViTs), the task remains challenging due to variations in spatial resolution, viewpoint, orientation, and background conditions, which often reduce the generalization ability of existing models. To address these challenges, this paper proposes a lightweight architecture based on the convolutional mixer paradigm. The model alternates between spatial mixing through depthwise convolutions at multiple scales and channel mixing through pointwise operations, enabling efficient extraction of both local and contextual information while keeping the number of parameters and computations low. Extensive experiments were conducted on the AID and EuroSAT benchmarks. The proposed model achieved overall accuracy, average accuracy, and Kappa values of 74.7%, 74.57%, and 73.79 on the AID dataset, and 93.90%, 93.93%, and 93.22 on EuroSAT, respectively. These results demonstrate that the proposed approach provides a good balance between accuracy and efficiency compared with widely used CNN- and transformer-based models. Code will be publicly available on: https://github.com/mqalkhatib/SceneMixer
[236] Power of Boundary and Reflection: Semantic Transparent Object Segmentation using Pyramid Vision Transformer with Transparent Cues
Tuan-Anh Vu, Hai Nguyen-Truong, Ziqiang Zheng, Binh-Son Hua, Qing Guo, Ivor Tsang, Sai-Kit Yeung
Main category: cs.CV
TL;DR: TransCues is a transformer-based framework that improves glass object segmentation by jointly enhancing boundary and reflection features, achieving state-of-the-art performance across multiple datasets.
Details
Motivation: Glass objects are challenging to segment due to transparency and reflection properties. Existing methods fail to adequately capture both boundary and reflection features that humans use to perceive glass objects.Method: Proposes TransCues with Boundary Feature Enhancement and Reflection Feature Enhancement modules in a pyramidal transformer encoder-decoder architecture to jointly leverage both visual cues for glass segmentation.
Result: Significant performance improvements: +4.2% mIoU on Trans10K-v2, +5.6% on MSD, +10.1% on RGBD-Mirror, +13.1% on TROSD, and +8.3% on Stanford2D3D, outperforming state-of-the-art by large margins.
Conclusion: Jointly enhancing boundary and reflection features in a transformer architecture effectively addresses glass segmentation challenges, demonstrating superior performance across diverse datasets including glass, mirror, and generic segmentation tasks.
Abstract: Glass is a prevalent material among solid objects in everyday life, yet segmentation methods struggle to distinguish it from opaque materials due to its transparency and reflection. While it is known that human perception relies on boundary and reflective-object features to distinguish glass objects, the existing literature has not yet sufficiently captured both properties when handling transparent objects. Hence, we propose incorporating both of these powerful visual cues via the Boundary Feature Enhancement and Reflection Feature Enhancement modules in a mutually beneficial way. Our proposed framework, TransCues, is a pyramidal transformer encoder-decoder architecture to segment transparent objects. We empirically show that these two modules can be used together effectively, improving overall performance across various benchmark datasets, including glass object semantic segmentation, mirror object semantic segmentation, and generic segmentation datasets. Our method outperforms the state-of-the-art by a large margin, achieving +4.2% mIoU on Trans10K-v2, +5.6% mIoU on MSD, +10.1% mIoU on RGBD-Mirror, +13.1% mIoU on TROSD, and +8.3% mIoU on Stanford2D3D, showing the effectiveness of our method against glass objects.
[237] Hierarchical Image-Guided 3D Point Cloud Segmentation in Industrial Scenes via Multi-View Bayesian Fusion
Yu Zhu, Naoya Chiba, Koichi Hashimoto
Main category: cs.CV
TL;DR: Hierarchical image-guided 3D segmentation framework that uses 2D foundation models (SAM + YOLO-World) to progressively refine segmentation from instance to part level, addressing occlusion and scale challenges in industrial scenes.
Details
Motivation: Industrial environments have dense layouts with multi-scale objects and heavy occlusion, which weakens geometric boundaries and causes end-to-end models to fail at capturing both coarse and fine details. Existing methods either require costly 3D annotations or suffer from semantic inconsistencies across views.Method: Two-stage hierarchical approach: 1) Instance segmentation via top-view rendering with SAM masks prompted by YOLO-World, back-projected to 3D point cloud. 2) Part-level segmentation via multi-view rendering of each instance, applying same 2D segmentation at each view, followed by Bayesian updating fusion for cross-view consistency.
Result: Method effectively handles occlusion and structural complexity in real-world factory data, achieving consistently high per-class mIoU scores. Additional evaluations on public datasets confirm generalization ability, robustness, annotation efficiency, and adaptability to diverse 3D environments.
Conclusion: Proposed hierarchical image-guided framework successfully addresses challenges in industrial 3D segmentation by leveraging 2D foundation models, providing robust segmentation without costly 3D annotations while maintaining semantic consistency across views.
Abstract: Reliable 3D segmentation is critical for understanding complex scenes with dense layouts and multi-scale objects, as commonly seen in industrial environments. In such scenarios, heavy occlusion weakens geometric boundaries between objects, and large differences in object scale will cause end-to-end models fail to capture both coarse and fine details accurately. Existing 3D point-based methods require costly annotations, while image-guided methods often suffer from semantic inconsistencies across views. To address these challenges, we propose a hierarchical image-guided 3D segmentation framework that progressively refines segmentation from instance-level to part-level. Instance segmentation involves rendering a top-view image and projecting SAM-generated masks prompted by YOLO-World back onto the 3D point cloud. Part-level segmentation is subsequently performed by rendering multi-view images of each instance obtained from the previous stage and applying the same 2D segmentation and back-projection process at each view, followed by Bayesian updating fusion to ensure semantic consistency across views. Experiments on real-world factory data demonstrate that our method effectively handles occlusion and structural complexity, achieving consistently high per-class mIoU scores. Additional evaluations on public dataset confirm the generalization ability of our framework, highlighting its robustness, annotation efficiency, and adaptability to diverse 3D environments.
[238] DAUNet: A Lightweight UNet Variant with Deformable Convolutions and Parameter-Free Attention for Medical Image Segmentation
Adnan Munir, Shujaat Khan
Main category: cs.CV
TL;DR: DAUNet is a lightweight UNet variant combining Deformable V2 Convolutions and SimAM attention for medical image segmentation, achieving state-of-the-art performance with high parameter efficiency.
Details
Motivation: Medical image segmentation is crucial for automated diagnostic systems, but existing models often lack spatial adaptability and context-aware feature fusion without increasing model complexity. There's a need for lightweight yet effective segmentation models suitable for real-time clinical deployment.Method: DAUNet integrates Deformable V2 Convolutions in the bottleneck to handle geometric variations with dynamic deformable kernels, and uses Parameter-Free Attention (SimAM) modules in decoder and skip pathways for saliency-aware refinement and context-aware feature fusion.
Result: DAUNet outperforms state-of-the-art models on FH-PS-AoP (ultrasound) and FUMPE (CT) datasets in Dice score, HD95, and ASD metrics while maintaining superior parameter efficiency. Ablation studies confirm contributions of both components.
Conclusion: DAUNet demonstrates robustness to missing context and low-contrast regions, establishing its suitability for deployment in real-time and resource-constrained clinical environments as an efficient yet powerful medical image segmentation solution.
Abstract: Medical image segmentation plays a pivotal role in automated diagnostic and treatment planning systems. In this work, we present DAUNet, a novel lightweight UNet variant that integrates Deformable V2 Convolutions and Parameter-Free Attention (SimAM) to improve spatial adaptability and context-aware feature fusion without increasing model complexity. DAUNet’s bottleneck employs dynamic deformable kernels to handle geometric variations, while the decoder and skip pathways are enhanced using SimAM attention modules for saliency-aware refinement. Extensive evaluations on two challenging datasets, FH-PS-AoP (fetal head and pubic symphysis ultrasound) and FUMPE (CT-based pulmonary embolism detection), demonstrate that DAUNet outperforms state-of-the-art models in Dice score, HD95, and ASD, while maintaining superior parameter efficiency. Ablation studies highlight the individual contributions of deformable convolutions and SimAM attention. DAUNet’s robustness to missing context and low-contrast regions establishes its suitability for deployment in real-time and resource-constrained clinical environments.
[239] Balanced Learning for Domain Adaptive Semantic Segmentation
Wangkai Li, Rui Sun, Bohao Liao, Zhaoyang Li, Tianzhu Zhang
Main category: cs.CV
TL;DR: BLDA addresses class imbalance in UDA semantic segmentation by analyzing predicted logits distributions, aligning them across classes using anchor distributions, and incorporating logits correction into self-training.
Details
Motivation: Self-training in UDA struggles with class imbalance due to distribution shifts between source and target domains, leading to biased learning where some classes are over-predicted while others are under-predicted.Method: 1) Identify over/under-predicted classes by analyzing predicted logits distributions. 2) Use post-hoc approach with shared anchor distributions to align logits across classes. 3) Estimate logits distributions online and incorporate correction terms into loss function. 4) Leverage cumulative density as domain-shared structural knowledge to connect domains.
Result: Extensive experiments on standard UDA semantic segmentation benchmarks show BLDA consistently improves performance, especially for under-predicted classes, when integrated into various existing methods.
Conclusion: BLDA effectively addresses class imbalance in UDA semantic segmentation without requiring prior knowledge about distribution shifts, improving overall performance and balancing class predictions.
Abstract: Unsupervised domain adaptation (UDA) for semantic segmentation aims to transfer knowledge from a labeled source domain to an unlabeled target domain. Despite the effectiveness of self-training techniques in UDA, they struggle to learn each class in a balanced manner due to inherent class imbalance and distribution shift in both data and label space between domains. To address this issue, we propose Balanced Learning for Domain Adaptation (BLDA), a novel approach to directly assess and alleviate class bias without requiring prior knowledge about the distribution shift. First, we identify over-predicted and under-predicted classes by analyzing the distribution of predicted logits. Subsequently, we introduce a post-hoc approach to align the logits distributions across different classes using shared anchor distributions. To further consider the network’s need to generate unbiased pseudo-labels during self-training, we estimate logits distributions online and incorporate logits correction terms into the loss function. Moreover, we leverage the resulting cumulative density as domain-shared structural knowledge to connect the source and target domains. Extensive experiments on two standard UDA semantic segmentation benchmarks demonstrate that BLDA consistently improves performance, especially for under-predicted classes, when integrated into various existing methods. Code is available at https://github.com/Woof6/BLDA.
[240] $\mathrm{D}^{\mathrm{3}}$-Predictor: Noise-Free Deterministic Diffusion for Dense Prediction
Changliang Xia, Chengyou Jia, Minnan Luo, Zhuohang Dang, Xin Shen, Bowen Ping
Main category: cs.CV
TL;DR: D³-Predictor: A deterministic, noise-free framework that reformulates pretrained diffusion models for dense prediction tasks by aggregating timestep-dependent visual priors without stochastic noise.
Details
Motivation: Diffusion models with strong visual priors are powerful for dense prediction but have a core limitation: stochastic noise in diffusion sampling is inherently misaligned with dense prediction's need for deterministic image-to-geometry mapping. This noise corrupts fine-grained spatial cues and pushes models toward timestep-specific noise objectives, destroying meaningful geometric structure mappings.Method: Introduces D³-Predictor, a noise-free deterministic framework that reformulates pretrained diffusion models without stochastic noise. Instead of using noisy inputs, it views the pretrained diffusion network as an ensemble of timestep-dependent visual experts and self-supervisedly aggregates their heterogeneous priors into a single, clean, and complete geometric prior. Task-specific supervision adapts this noise-free prior to dense prediction tasks.
Result: Extensive experiments on various dense prediction tasks show D³-Predictor achieves competitive or state-of-the-art performance in diverse scenarios. It requires less than half the training data previously used and efficiently performs inference in a single step.
Conclusion: D³-Predictor successfully addresses the misalignment between stochastic diffusion sampling and deterministic dense prediction by creating a noise-free framework that aggregates diffusion priors effectively, enabling efficient and high-performance dense prediction with reduced data requirements.
Abstract: Although diffusion models with strong visual priors have emerged as powerful dense prediction backboens, they overlook a core limitation: the stochastic noise at the core of diffusion sampling is inherently misaligned with dense prediction that requires a deterministic mapping from image to geometry. In this paper, we show that this stochastic noise corrupts fine-grained spatial cues and pushes the model toward timestep-specific noise objectives, consequently destroying meaningful geometric structure mappings. To address this, we introduce $\mathrm{D}^{\mathrm{3}}$-Predictor, a noise-free deterministic framework built by reformulating a pretrained diffusion model without stochasticity noise. Instead of relying on noisy inputs to leverage diffusion priors, $\mathrm{D}^{\mathrm{3}}$-Predictor views the pretrained diffusion network as an ensemble of timestep-dependent visual experts and self-supervisedly aggregates their heterogeneous priors into a single, clean, and complete geometric prior. Meanwhile, we utilize task-specific supervision to seamlessly adapt this noise-free prior to dense prediction tasks. Extensive experiments on various dense prediction tasks demonstrate that $\mathrm{D}^{\mathrm{3}}$-Predictor achieves competitive or state-of-the-art performance in diverse scenarios. In addition, it requires less than half the training data previously used and efficiently performs inference in a single step. Our code, data, and checkpoints are publicly available at https://x-gengroup.github.io/HomePage_D3-Predictor/.
[241] Overcoming Small Data Limitations in Video-Based Infant Respiration Estimation
Liyang Song, Hardik Bishnoi, Sai Kumar Reddy Manne, Sarah Ostadabbas, Briana J. Taylor, Michael Wan
Main category: cs.CV
TL;DR: Researchers introduce AIR-400, a new annotated infant respiration video dataset, and develop the first reproducible pipelines for infant respiration estimation using computer vision.
Details
Motivation: Current respiration monitoring for infants lacks adequate video datasets and reproducible algorithms, despite the potential for early detection of breathing irregularities linked to neurodevelopmental issues and SIDS.Method: Created AIR-400 dataset with 400 annotated videos (275 new videos from 10 subjects), developed infant-specific region-of-interest detection, and built spatiotemporal neural processing pipelines enhanced by optical flow inputs.
Result: Established first reproducible benchmarks for state-of-the-art vision-based infant respiration estimation, with all resources (dataset, code, models) made publicly available.
Conclusion: This work addresses the critical gap in infant respiration monitoring by providing both dataset and reproducible algorithms, enabling advances in early detection of breathing irregularities.
Abstract: The development of contactless respiration monitoring for infants could enable advances in the early detection and treatment of breathing irregularities, which are associated with neurodevelopmental impairments and conditions like sudden infant death syndrome (SIDS). But while respiration estimation for adults is supported by a robust ecosystem of computer vision algorithms and video datasets, only one small public video dataset with annotated respiration data for infant subjects exists, and there are no reproducible algorithms which are effective for infants. We introduce the annotated infant respiration dataset of 400 videos (AIR-400), contributing 275 new, carefully annotated videos from 10 recruited subjects to the public corpus. We develop the first reproducible pipelines for infant respiration estimation, based on infant-specific region-of-interest detection and spatiotemporal neural processing enhanced by optical flow inputs. We establish, through comprehensive experiments, the first reproducible benchmarks for the state-of-the-art in vision-based infant respiration estimation. We make our dataset, code repository, and trained models available for public use.
[242] Scaling Zero-Shot Reference-to-Video Generation
Zijian Zhou, Shikun Liu, Haozhe Liu, Haonan Qiu, Zhaochong An, Weiming Ren, Zhiheng Liu, Xiaoke Huang, Kam Woh Ng, Tian Xie, Xiao Han, Yuren Cong, Hang Li, Chuyan Zhu, Aditya Patel, Tao Xiang, Sen He
Main category: cs.CV
TL;DR: Saber is a scalable zero-shot framework for reference-to-video generation that eliminates the need for expensive reference image-video-text triplets by training only on video-text pairs.
Details
Motivation: Current R2V methods rely on expensive explicit reference image-video-text triplets that are difficult and costly to scale, creating a bottleneck for practical applications.Method: Saber uses masked training strategy and tailored attention-based model design to learn identity-consistent and reference-aware representations, with mask augmentation techniques to reduce copy-paste artifacts.
Result: Saber achieves superior performance on OpenS2V-Eval benchmark compared to methods trained with R2V data, and demonstrates remarkable generalization across varying numbers of references.
Conclusion: Saber provides a scalable zero-shot solution for R2V generation that bypasses the data bottleneck while maintaining strong identity consistency and reference awareness.
Abstract: Reference-to-video (R2V) generation aims to synthesize videos that align with a text prompt while preserving the subject identity from reference images. However, current R2V methods are hindered by the reliance on explicit reference image-video-text triplets, whose construction is highly expensive and difficult to scale. We bypass this bottleneck by introducing Saber, a scalable zero-shot framework that requires no explicit R2V data. Trained exclusively on video-text pairs, Saber employs a masked training strategy and a tailored attention-based model design to learn identity-consistent and reference-aware representations. Mask augmentation techniques are further integrated to mitigate copy-paste artifacts common in reference-to-video generation. Moreover, Saber demonstrates remarkable generalization capabilities across a varying number of references and achieves superior performance on the OpenS2V-Eval benchmark compared to methods trained with R2V data.
[243] Can We Go Beyond Visual Features? Neural Tissue Relation Modeling for Relational Graph Analysis in Non-Melanoma Skin Histology
Shravan Venkatraman, Muthu Subash Kavitha, Joe Dhanith P R, V Manikandarajan, Jia Wu
Main category: cs.CV
TL;DR: NTRM is a novel histopathology segmentation framework that uses graph neural networks to model spatial and functional relationships between tissue types, outperforming state-of-the-art methods on skin cancer datasets.
Details
Motivation: Current CNN-based histopathology segmentation methods focus on visual texture and treat tissues as independent regions, failing to capture biological context and inter-tissue relationships, especially in areas with overlapping or morphologically similar tissues.Method: NTRM augments CNNs with a tissue-level graph neural network that constructs a graph over predicted regions, propagates contextual information via message passing, and refines segmentation through spatial projection to explicitly encode inter-tissue dependencies.
Result: On the Histopathology Non-Melanoma Skin Cancer Segmentation Dataset, NTRM outperforms state-of-the-art methods, achieving Dice similarity coefficients 4.9% to 31.25% higher than the best-performing models.
Conclusion: Relational modeling provides a principled approach for more context-aware and interpretable histological segmentation compared to local receptive-field architectures that lack tissue-level structural awareness.
Abstract: Histopathology image segmentation is essential for delineating tissue structures in skin cancer diagnostics, but modeling spatial context and inter-tissue relationships remains a challenge, especially in regions with overlapping or morphologically similar tissues. Current convolutional neural network (CNN)-based approaches operate primarily on visual texture, often treating tissues as independent regions and failing to encode biological context. To this end, we introduce Neural Tissue Relation Modeling (NTRM), a novel segmentation framework that augments CNNs with a tissue-level graph neural network to model spatial and functional relationships across tissue types. NTRM constructs a graph over predicted regions, propagates contextual information via message passing, and refines segmentation through spatial projection. Unlike prior methods, NTRM explicitly encodes inter-tissue dependencies, enabling structurally coherent predictions in boundary-dense zones. On the benchmark Histopathology Non-Melanoma Skin Cancer Segmentation Dataset, NTRM outperforms state-of-the-art methods, achieving a robust Dice similarity coefficient that is 4.9% to 31.25% higher than the best-performing models among the evaluated approaches. Our experiments indicate that relational modeling offers a principled path toward more context-aware and interpretable histological segmentation, compared to local receptive-field architectures that lack tissue-level structural awareness. Our code is available at https://github.com/shravan-18/NTRM.
[244] TrajMoE: Scene-Adaptive Trajectory Planning with Mixture of Experts and Reinforcement Learning
Zebin Xing, Pengxuan Yang, Linbo Wang, Yichen Zhang, Yiming Hu, Yupeng Zheng, Junli Wang, Yinfeng Gao, Guang Li, Kun Ma, Long Chen, Zhongpu Xia, Qichao Zhang, Hangjun Ye, Dongbin Zhao
Main category: cs.CV
TL;DR: The paper proposes an improved autonomous driving planning system that uses Mixture of Experts (MoE) for scenario-specific trajectory priors and Reinforcement Learning (RL) for policy-driven trajectory scoring refinement.
Details
Motivation: Current end-to-end autonomous driving systems with trajectory priors have two limitations: 1) trajectory priors are not scenario-adaptive, and 2) trajectory evaluation lacks policy-driven refinement due to one-stage supervised training constraints.Method: 1) Use Mixture of Experts (MoE) to apply different trajectory priors tailored to different driving scenarios. 2) Employ Reinforcement Learning to fine-tune the trajectory scoring mechanism. 3) Integrate models with different perception backbones to enhance perceptual features.
Result: The integrated model achieved a score of 51.08 on the navsim ICCV benchmark, securing third place in the competition.
Conclusion: The proposed approach successfully addresses the limitations of current trajectory-based planning systems by making priors scenario-adaptive and adding policy-driven refinement through RL, resulting in competitive benchmark performance.
Abstract: Current autonomous driving systems often favor end-to-end frameworks, which take sensor inputs like images and learn to map them into trajectory space via neural networks. Previous work has demonstrated that models can achieve better planning performance when provided with a prior distribution of possible trajectories. However, these approaches often overlook two critical aspects: 1) The appropriate trajectory prior can vary significantly across different driving scenarios. 2) Their trajectory evaluation mechanism lacks policy-driven refinement, remaining constrained by the limitations of one-stage supervised training. To address these issues, we explore improvements in two key areas. For problem 1, we employ MoE to apply different trajectory priors tailored to different scenarios. For problem 2, we utilize Reinforcement Learning to fine-tune the trajectory scoring mechanism. Additionally, we integrate models with different perception backbones to enhance perceptual features. Our integrated model achieved a score of 51.08 on the navsim ICCV benchmark, securing third place.
[245] Selective Masking based Self-Supervised Learning for Image Semantic Segmentation
Yuemin Wang, Ian Stavness
Main category: cs.CV
TL;DR: Proposes selective masking image reconstruction for self-supervised semantic segmentation pretraining, outperforming random masking and ImageNet pretraining by 2.9% on general datasets and 2.5% on weed datasets.
Details
Motivation: To improve self-supervised pretraining for semantic segmentation by replacing random masking with selective masking that leverages trained model knowledge, especially useful for scenarios with limited model capacity and computational resources.Method: Selective masking image reconstruction that iteratively masks patches with highest reconstruction loss rather than random masking, breaking pretraining into iterative steps to use the trained model’s knowledge.
Result: Outperforms random masking and supervised ImageNet pretraining by 2.9% on Pascal VOC/Cityscapes and 2.5% on weed datasets; significantly improves accuracy for lowest-performing classes; same pretraining/downstream dataset works best for low-budget scenarios.
Conclusion: Selective Masking Image Reconstruction provides effective and practical solution to improve end-to-end semantic segmentation workflows, particularly for limited model capacity scenarios requiring fast inference and computational efficiency.
Abstract: This paper proposes a novel self-supervised learning method for semantic segmentation using selective masking image reconstruction as the pretraining task. Our proposed method replaces the random masking augmentation used in most masked image modelling pretraining methods. The proposed selective masking method selectively masks image patches with the highest reconstruction loss by breaking the image reconstruction pretraining into iterative steps to leverage the trained model’s knowledge. We show on two general datasets (Pascal VOC and Cityscapes) and two weed segmentation datasets (Nassar 2020 and Sugarbeets 2016) that our proposed selective masking method outperforms the traditional random masking method and supervised ImageNet pretraining on downstream segmentation accuracy by 2.9% for general datasets and 2.5% for weed segmentation datasets. Furthermore, we found that our selective masking method significantly improves accuracy for the lowest-performing classes. Lastly, we show that using the same pretraining and downstream dataset yields the best result for low-budget self-supervised pretraining. Our proposed Selective Masking Image Reconstruction method provides an effective and practical solution to improve end-to-end semantic segmentation workflows, especially for scenarios that require limited model capacity to meet inference speed and computational resource requirements.
[246] A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning
Siyang Jiang, Mu Yuan, Xiang Ji, Bufang Yang, Zeyu Liu, Lilin Xu, Yang Li, Yuting He, Liran Dong, Wenrui Lu, Zhenyu Yan, Xiaofan Jiang, Wei Gao, Hongkai Chen, Guoliang Xing
Main category: cs.CV
TL;DR: CUHK-X introduces a large-scale multimodal dataset for human action recognition, understanding, and reasoning, addressing limitations of existing datasets that lack fine-grained textual descriptions for non-RGB modalities.
Details
Motivation: Current LLMs struggle with non-RGB modalities (depth, IMU, mmWave) due to lack of large-scale data-caption resources, and existing HAR datasets provide only coarse labels insufficient for fine-grained action understanding and reasoning tasks.Method: Created CUHK-X dataset with 58,445 samples covering 40 actions by 30 participants across two indoor environments. Used prompt-based scene creation method leveraging LLMs to generate logically connected activity sequences followed by human validation for caption consistency.
Result: Achieved average accuracies of 76.52% for HAR, 40.76% for HAU, and 70.25% for HARn across three benchmarks with six evaluation tasks.
Conclusion: CUHK-X enables the community to apply and develop data-intensive learning methods for robust multimodal human activity analysis, bridging the gap between traditional action recognition and advanced understanding/reasoning capabilities.
Abstract: Multimodal human action recognition (HAR) leverages complementary sensors for activity classification. Beyond recognition, recent advances in large language models (LLMs) enable detailed descriptions and causal reasoning, motivating new tasks: human action understanding (HAU) and human action reasoning (HARn). However, most LLMs, especially large vision language models (LVLMs), struggle with non-RGB modalities such as depth, IMU, and mmWave due to the lack of large-scale data-caption resources. Existing HAR datasets mainly provide coarse data-label annotations, which are insufficient to capture fine-grained action dynamics needed for HAU and HARn. We consider two ground-truth pair types: (1) data label (discrete category) and (2) data caption (textual description). Naively generating captions from labels often lacks logical and spatiotemporal consistency. We introduce CUHK-X, a large-scale multimodal dataset and benchmark suite for HAR, HAU, and HARn. CUHK-X contains 58,445 samples covering 40 actions performed by 30 participants across two indoor environments. To improve caption consistency, we propose a prompt-based scene creation method that leverages LLMs to generate logically connected activity sequences, followed by human validation. CUHK-X includes three benchmarks with six evaluation tasks. Experiments report average accuracies of 76.52% (HAR), 40.76% (HAU), and 70.25% (HARn). CUHK-X aims to enable the community to apply and develop data-intensive learning methods for robust, multimodal human activity analysis. Project page and code: https://openaiotlab.github.io/CUHK-X/ and https://github.com/openaiotlab/CUHK-X.
[247] Evaluating and Preserving High-level Fidelity in Super-Resolution
Josep M. Rocafort, Shaolin Su, Javier Vazquez-Corral, Alexandra Gomez-Villa
Main category: cs.CV
TL;DR: The paper introduces a new fidelity metric for super-resolution models to measure high-level semantic preservation, addressing the problem where current SR models sometimes hallucinate content despite good visual quality.
Details
Motivation: Current super-resolution models can hallucinate and change image content while achieving high visual quality, but existing low-level image quality metrics don't capture these high-level semantic changes that humans can easily identify.Method: 1) Constructed the first annotated dataset with fidelity scores from different SR models; 2) Evaluated SOTA SR models on high-level fidelity preservation; 3) Analyzed correlation between existing metrics and fidelity; 4) Showed foundation models better address this task; 5) Fine-tuned SR models using fidelity feedback.
Result: Created the first fidelity dataset for SR models, showed that foundation models outperform traditional metrics for fidelity measurement, and demonstrated that fine-tuning with fidelity feedback improves both semantic fidelity and perceptual quality.
Conclusion: High-level fidelity measurement is crucial for evaluating and optimizing SR models, providing a complementary criterion to reveal their reliability, with potential applications in model evaluation and optimization.
Abstract: Recent image Super-Resolution (SR) models are achieving impressive effects in reconstructing details and delivering visually pleasant outputs. However, the overpowering generative ability can sometimes hallucinate and thus change the image content despite gaining high visual quality. This type of high-level change can be easily identified by humans yet not well-studied in existing low-level image quality metrics. In this paper, we establish the importance of measuring high-level fidelity for SR models as a complementary criterion to reveal the reliability of generative SR models. We construct the first annotated dataset with fidelity scores from different SR models, and evaluate how state-of-the-art (SOTA) SR models actually perform in preserving high-level fidelity. Based on the dataset, we then analyze how existing image quality metrics correlate with fidelity measurement, and further show that this high-level task can be better addressed by foundation models. Finally, by fine-tuning SR models based on our fidelity feedback, we show that both semantic fidelity and perceptual quality can be improved, demonstrating the potential value of our proposed criteria, both in model evaluation and optimization. We will release the dataset, code, and models upon acceptance.
[248] Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach
Jiayang Li, Chengjie Jiang, Junjun Jiang, Pengwei Liang, Jiayi Ma, Liqiang Nie
Main category: cs.CV
TL;DR: DiTFuse is an instruction-driven Diffusion-Transformer framework that performs end-to-end, semantics-aware image fusion across multiple modalities using natural language instructions, enabling hierarchical control without ground-truth data.
Details
Motivation: Existing image fusion approaches lack robustness, adaptability, and controllability. They are task-specific, cannot incorporate user intent, and struggle with complex scenarios like low-light degradation or exposure imbalance. The absence of ground-truth fused images and small datasets make it difficult to train end-to-end models that understand both high-level semantics and fine-grained multimodal alignment.Method: DiTFuse uses a Diffusion-Transformer (DiT) framework that jointly encodes two images and natural-language instructions in a shared latent space. It employs multi-degradation masked-image modeling to learn cross-modal alignment, modality-invariant restoration, and task-aware feature selection without ground truth. A curated multi-granularity instruction dataset enables interactive fusion capabilities.
Result: Experiments on IVIF, MFF, and MEF benchmarks show superior quantitative and qualitative performance with sharper textures and better semantic retention. The model unifies infrared-visible, multi-focus, and multi-exposure fusion, supports text-controlled refinement, and demonstrates zero-shot generalization to other fusion scenarios including instruction-conditioned segmentation.
Conclusion: DiTFuse overcomes limitations of existing fusion methods by providing a single, instruction-driven framework that enables hierarchical control, semantics-aware fusion, and generalization across multiple fusion tasks without requiring ground-truth data.
Abstract: Image fusion aims to blend complementary information from multiple sensing modalities, yet existing approaches remain limited in robustness, adaptability, and controllability. Most current fusion networks are tailored to specific tasks and lack the ability to flexibly incorporate user intent, especially in complex scenarios involving low-light degradation, color shifts, or exposure imbalance. Moreover, the absence of ground-truth fused images and the small scale of existing datasets make it difficult to train an end-to-end model that simultaneously understands high-level semantics and performs fine-grained multimodal alignment. We therefore present DiTFuse, instruction-driven Diffusion-Transformer (DiT) framework that performs end-to-end, semantics-aware fusion within a single model. By jointly encoding two images and natural-language instructions in a shared latent space, DiTFuse enables hierarchical and fine-grained control over fusion dynamics, overcoming the limitations of pre-fusion and post-fusion pipelines that struggle to inject high-level semantics. The training phase employs a multi-degradation masked-image modeling strategy, so the network jointly learns cross-modal alignment, modality-invariant restoration, and task-aware feature selection without relying on ground truth images. A curated, multi-granularity instruction dataset further equips the model with interactive fusion capabilities. DiTFuse unifies infrared-visible, multi-focus, and multi-exposure fusion-as well as text-controlled refinement and downstream tasks-within a single architecture. Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention. The model also supports multi-level user control and zero-shot generalization to other multi-image fusion scenarios, including instruction-conditioned segmentation.
[249] RAVE: Rate-Adaptive Visual Encoding for 3D Gaussian Splatting
Hoang-Nhat Tran, Francesco Di Sario, Gabriele Spadaro, Giuseppe Valenzise, Enzo Tartaglione
Main category: cs.CV
TL;DR: A flexible compression scheme for 3D Gaussian Splatting that supports interpolation at any rate between predefined bounds, requiring no retraining for different rates while preserving rendering quality.
Details
Motivation: 3D Gaussian Splatting enables real-time photorealistic rendering but suffers from large memory requirements and costly training. Existing compression approaches operate at fixed rates, limiting adaptability to varying bandwidth and device constraints.Method: Proposes a flexible compression scheme for 3DGS that supports interpolation at any rate between predefined bounds. The method is computationally lightweight and requires no retraining for any rate.
Result: The approach achieves efficient, high-quality compression while offering dynamic rate control, making it suitable for practical deployment in immersive applications. Experiments demonstrate preservation of rendering quality across a broad range of operating points.
Conclusion: The proposed flexible compression scheme addresses the limitations of fixed-rate approaches, enabling adaptable compression for 3DGS that can meet varying bandwidth and device constraints while maintaining rendering quality.
Abstract: Recent advances in neural scene representations have transformed immersive multimedia, with 3D Gaussian Splatting (3DGS) enabling real-time photorealistic rendering. Despite its efficiency, 3DGS suffers from large memory requirements and costly training procedures, motivating efforts toward compression. Existing approaches, however, operate at fixed rates, limiting adaptability to varying bandwidth and device constraints. In this work, we propose a flexible compression scheme for 3DGS that supports interpolation at any rate between predefined bounds. Our method is computationally lightweight, requires no retraining for any rate, and preserves rendering quality across a broad range of operating points. Experiments demonstrate that the approach achieves efficient, high-quality compression while offering dynamic rate control, making it suitable for practical deployment in immersive applications. The code will be provided open-source upon acceptance of the work.
[250] START: Spatial and Textual Learning for Chart Understanding
Zhuoming Liu, Xiaofeng Gao, Feiyang Niu, Qiaozi Gao, Liu Liu, Robinson Piramuthu
Main category: cs.CV
TL;DR: START is a multimodal LLM approach for chart understanding that combines spatial layout learning with textual data representation through chart-element grounding and chart-to-code generation, achieving state-of-the-art performance.
Details
Motivation: Charts combine structured visual layouts (spatial properties) with underlying data representations (textual properties), requiring both aspects for precise chart reasoning. Existing methods struggle to handle this dual nature effectively, creating a need for comprehensive chart understanding in real-world applications like scientific paper analysis.Method: Proposes START with two key components: (1) chart-element grounding to understand visual layout, and (2) chart-to-code generation to capture data details. Uses a novel data generation pipeline that translates real chart images into executable code via MLLM, then evolves the code with LLM to determine chart element positions. Also introduces CS-Bench benchmark for evaluating spatial understanding.
Result: START delivers consistent performance gains across model sizes and benchmarks over base models, surpassing prior state-of-the-art by a clear margin. The approach effectively addresses the dual challenges of spatial and textual understanding in charts.
Conclusion: The proposed spatial and textual learning approach (START) enables comprehensive chart understanding by simultaneously addressing visual layout and data representation, filling a critical gap in multimodal LLM capabilities for real-world chart analysis applications.
Abstract: Chart understanding is crucial for deploying multimodal large language models (MLLMs) in real-world scenarios such as analyzing scientific papers and technical reports. Unlike natural images, charts pair a structured visual layout (spatial property) with an underlying data representation (textual property) – grasping both is essential for precise, fine-grained chart reasoning. Motivated by this observation, we propose START, the Spatial and Textual learning for chART understanding. Specifically, we introduce (i) chart-element grounding and (ii) chart-to-code generation to strengthen an MLLM’s understanding of both chart visual layout and data details. To facilitate spatial and textual learning, we propose the START-Dataset generated with a novel data-generation pipeline that first leverages an MLLM to translate real chart images into executable chart code, recovering the underlying data representation while preserving the visual distribution of real-world charts. We then evolve the code with a Large Language Model (LLM) to ascertain the positions of chart elements that capture the chart’s visual structure, addressing challenges that existing methods cannot handle. To evaluate a model’s ability to understand chart spatial structures, we propose the Chart Spatial understanding Benchmark (CS-Bench), filling a critical gap in comprehensive chart understanding evaluation. Leveraging spatial and textual learning, START delivers consistent gains across model sizes and benchmarks over the base models and surpasses prior state-of-the-art by a clear margin. Code, data and models will be publicly available.
[251] Persistent Homology-Guided Frequency Filtering for Image Compression
Anil Chintapalli, Peter Tenholder, Henry Chen, Arjun Rao
Main category: cs.CV
TL;DR: Using discrete Fourier transform with persistent homology to extract frequency features for image compression, achieving JPEG-comparable compression while improving noise robustness.
Details
Motivation: Addressing challenges in feature extraction from noisy image datasets to improve model reliability by developing a method that can differentiate meaningful data from noise during compression.Method: Combines discrete Fourier transform with persistent homology analysis to identify specific frequencies corresponding to topological image features, enabling guided frequency filtration for compression while preserving meaningful data.
Result: Achieves compression levels comparable to JPEG across six different metrics, with potential to improve binary classification performance when augmenting Convolutional Neural Networks compared to traditional methods.
Conclusion: Persistent homology-guided frequency filtration enhances image compression reliability under noisy conditions by preserving topological features while filtering noise, offering improved robustness for downstream tasks.
Abstract: Feature extraction in noisy image datasets presents many challenges in model reliability. In this paper, we use the discrete Fourier transform in conjunction with persistent homology analysis to extract specific frequencies that correspond with certain topological features of an image. This method allows the image to be compressed and reformed while ensuring that meaningful data can be differentiated. Our experimental results show a level of compression comparable to that of using JPEG using six different metrics. The end goal of persistent homology-guided frequency filtration is its potential to improve performance in binary classification tasks (when augmenting a Convolutional Neural Network) compared to traditional feature extraction and compression methods. These findings highlight a useful end result: enhancing the reliability of image compression under noisy conditions.
[252] VFM-VLM: Vision Foundation Model and Vision Language Model based Visual Comparison for 3D Pose Estimation
Md Selim Sarowar, Sungho Kim
Main category: cs.CV
TL;DR: Comparison of CLIP vs DINOv2 for 6D object pose estimation in hand-object grasping, showing CLIP excels in semantic understanding while DINOv2 provides superior geometric features.
Details
Motivation: Vision Foundation Models (VFMs) and Vision Language Models (VLMs) have revolutionized computer vision with rich semantic and geometric representations, but their comparative strengths for 3D pose estimation in robotic grasping scenarios need systematic evaluation.Method: Comprehensive visual comparison between CLIP-based and DINOv2-based approaches for 6D object pose estimation in hand-object grasping scenarios, evaluated through extensive experiments on benchmark datasets.
Result: CLIP-based methods achieve better semantic consistency through language grounding, while DINOv2-based approaches demonstrate competitive performance with enhanced geometric precision, revealing complementary strengths.
Conclusion: The analysis provides insights for selecting appropriate vision models for robotic manipulation and grasping applications based on whether semantic understanding or geometric precision is prioritized.
Abstract: Vision Foundation Models (VFMs) and Vision Language Models (VLMs) have revolutionized computer vision by providing rich semantic and geometric representations. This paper presents a comprehensive visual comparison between CLIP based and DINOv2 based approaches for 3D pose estimation in hand object grasping scenarios. We evaluate both models on the task of 6D object pose estimation and demonstrate their complementary strengths: CLIP excels in semantic understanding through language grounding, while DINOv2 provides superior dense geometric features. Through extensive experiments on benchmark datasets, we show that CLIP based methods achieve better semantic consistency, while DINOv2 based approaches demonstrate competitive performance with enhanced geometric precision. Our analysis provides insights for selecting appropriate vision models for robotic manipulation and grasping, picking applications.
[253] Context-measure: Contextualizing Metric for Camouflage
Chen-Yang Wang, Gepeng Ji, Song Shao, Ming-Ming Cheng, Deng-Ping Fan
Main category: cs.CV
TL;DR: Proposes Context-measure, a new evaluation metric for camouflaged object segmentation that incorporates spatial context dependencies, addressing limitations of existing context-independent metrics.
Details
Motivation: Current metrics for camouflaged scenarios overlook the critical context-dependent nature of camouflage. Existing metrics were designed for general/salient objects with uncorrelated spatial context assumptions, making them inadequate for evaluating camouflaged patterns.Method: Develops Context-measure based on a probabilistic pixel-aware correlation framework that incorporates spatial dependencies and pixel-wise camouflage quantification to better align with human perception.
Result: Extensive experiments across three challenging camouflaged object segmentation datasets show Context-measure delivers more reliability than existing context-independent metrics.
Conclusion: Context-measure provides a foundational evaluation benchmark for various computer vision applications involving camouflaged patterns in agricultural, industrial, and medical scenarios.
Abstract: Camouflage is primarily context-dependent yet current metrics for camouflaged scenarios overlook this critical factor. Instead, these metrics are originally designed for evaluating general or salient objects, with an inherent assumption of uncorrelated spatial context. In this paper, we propose a new contextualized evaluation paradigm, Context-measure, built upon a probabilistic pixel-aware correlation framework. By incorporating spatial dependencies and pixel-wise camouflage quantification, our measure better aligns with human perception. Extensive experiments across three challenging camouflaged object segmentation datasets show that Context-measure delivers more reliability than existing context-independent metrics. Our measure can provide a foundational evaluation benchmark for various computer vision applications involving camouflaged patterns, such as agricultural, industrial, and medical scenarios. Code is available at https://github.com/pursuitxi/Context-measure.
[254] Towards Robust Protective Perturbation against DeepFake Face Swapping
Hengyang Yao, Lin Li, Ke Sun, Jianing Qiu, Huiping Chen
Main category: cs.CV
TL;DR: EOLT framework learns optimal transformation distributions for robust face protection against DeepFake attacks, outperforming standard EOT methods.
Details
Motivation: Existing DeepFake protection methods using invisible perturbations are fragile against basic transformations like compression/resizing. Standard Expectation over Transformation (EOT) with uniform sampling is suboptimal because protection robustness is highly sensitive to transformation choices.Method: Proposes Expectation Over Learned distribution of Transformation (EOLT) framework that treats transformation distribution as learnable. Uses policy network with reinforcement learning to automatically prioritize critical transformations and generate instance-specific perturbations, modeling defensive bottlenecks while maintaining transferability.
Result: Achieves 26% higher average robustness and up to 30% gains on challenging transformation categories compared to state-of-the-art approaches.
Conclusion: EOLT framework significantly improves DeepFake protection robustness by learning optimal transformation distributions rather than using fixed uniform sampling, demonstrating superior performance across diverse transformation categories.
Abstract: DeepFake face swapping enables highly realistic identity forgeries, posing serious privacy and security risks. A common defence embeds invisible perturbations into images, but these are fragile and often destroyed by basic transformations such as compression or resizing. In this paper, we first conduct a systematic analysis of 30 transformations across six categories and show that protection robustness is highly sensitive to the choice of training transformations, making the standard Expectation over Transformation (EOT) with uniform sampling fundamentally suboptimal. Motivated by this, we propose Expectation Over Learned distribution of Transformation (EOLT), the framework to treat transformation distribution as a learnable component rather than a fixed design choice. Specifically, EOLT employs a policy network that learns to automatically prioritize critical transformations and adaptively generate instance-specific perturbations via reinforcement learning, enabling explicit modeling of defensive bottlenecks while maintaining broad transferability. Extensive experiments demonstrate that our method achieves substantial improvements over state-of-the-art approaches, with 26% higher average robustness and up to 30% gains on challenging transformation categories.
[255] DFIR-DETR: Frequency Domain Enhancement and Dynamic Feature Aggregation for Cross-Scene Small Object Detection
Bo Gao, Jingcheng Tong, Xingsheng Chen, Han Yu, Zichen Li
Main category: cs.CV
TL;DR: DFIR-DETR: A transformer-based detector using dynamic feature aggregation and frequency-domain processing for small object detection in UAV images and industrial inspection, achieving SOTA results with lightweight architecture.
Details
Motivation: Small object detection in UAV remote sensing and industrial defect inspection faces challenges: sparse/weak features, cluttered backgrounds, and dramatic scale variations. Current transformer detectors struggle with feature degradation during downsampling, limited long-range dependency capture by spatial convolutions, and unnecessary feature map inflation from standard upsampling.Method: Three novel components: 1) DCFA module with dynamic K-sparse attention (O(NK) complexity) and spatial gated linear units for nonlinear modeling; 2) DFPN module with amplitude-normalized upsampling to prevent feature inflation and dual-path shuffle convolution for spatial detail retention; 3) FIRC3 module operating in frequency domain for global receptive fields without efficiency loss.
Result: Achieved mAP50 scores of 92.9% on NEU-DET and 51.6% on VisDrone datasets, both state-of-the-art. Model remains lightweight with only 11.7M parameters and 41.2 GFLOPs.
Conclusion: DFIR-DETR effectively addresses small object detection challenges through dynamic feature aggregation and frequency-domain processing, demonstrating strong generalization across different domains and effectiveness in resource-limited settings for cross-scene small object detection.
Abstract: Detecting small objects in UAV remote sensing images and identifying surface defects in industrial inspection remain difficult tasks. These applications face common obstacles: features are sparse and weak, backgrounds are cluttered, and object scales vary dramatically. Current transformer-based detectors, while powerful, struggle with three critical issues. First, features degrade severely as networks downsample progressively. Second, spatial convolutions cannot capture long-range dependencies effectively. Third, standard upsampling methods inflate feature maps unnecessarily. We introduce DFIR-DETR to tackle these problems through dynamic feature aggregation combined with frequency-domain processing. Our architecture builds on three novel components. The DCFA module uses dynamic K-sparse attention, cutting complexity from O(N2) down to O(NK), and employs spatial gated linear units for better nonlinear modeling. The DFPN module applies amplitude-normalized upsampling to prevent feature inflation and uses dual-path shuffle convolution to retain spatial details across scales. The FIRC3 module operates in the frequency domain, achieving global receptive fields without sacrificing efficiency. We tested our method extensively on NEU-DET and VisDrone datasets. Results show mAP50 scores of 92.9% and 51.6% respectively-both state-of-the-art. The model stays lightweight with just 11.7M parameters and 41.2 GFLOPs. Strong performance across two very different domains confirms that DFIR-DETR generalizes well and works effectively in resource-limited settings for cross-scene small object detection.
[256] Dropout Prompt Learning: Towards Robust and Adaptive Vision-Language Models
Biao Chen, Lin Zuo, Mengmeng Jing, Kunbin He, Yuchen Wang
Main category: cs.CV
TL;DR: Dropout Prompt Learning applies dropout to vision-language models by evaluating token significance across modalities and using residual entropy regularization to maintain semantic alignment while encouraging diverse representations.
Details
Motivation: Dropout is effective for regularization and generalization in models, but vanilla dropout doesn't consider token significance in multimodal contexts. The authors aim to improve robustness of vision-language models by applying dropout that accounts for both intra-modal context and inter-modal alignment.Method: Proposes Dropout Prompt Learning with two key components: 1) Dropout applied to textual and visual tokens with flexible probabilities based on token significance considering both intra-modal context and inter-modal alignment, 2) Residual entropy regularization to maintain semantic alignment for knowledge transfer while encouraging diverse representations from dropout.
Result: Experiments on 15 benchmarks show effectiveness in challenging scenarios including low-shot learning, long-tail classification, and out-of-distribution generalization. Notably outperforms regularization-based methods like KgCoOp by 5.10% and PromptSRC by 2.13% on base-to-novel generalization.
Conclusion: Dropout Prompt Learning effectively improves robustness of vision-language models by intelligently applying dropout based on token significance across modalities, with residual entropy regularization balancing semantic alignment and representation diversity.
Abstract: Dropout is a widely used regularization technique which improves the generalization ability of a model by randomly dropping neurons. In light of this, we propose Dropout Prompt Learning, which aims for applying dropout to improve the robustness of the vision-language models. Different from the vanilla dropout, we apply dropout on the tokens of the textual and visual branches, where we evaluate the token significance considering both intra-modal context and inter-modal alignment, enabling flexible dropout probabilities for each token. Moreover, to maintain semantic alignment for general knowledge transfer while encouraging the diverse representations that dropout introduces, we further propose residual entropy regularization. Experiments on 15 benchmarks show our method’s effectiveness in challenging scenarios like low-shot learning, long-tail classification, and out-of-distribution generalization. Notably, our method surpasses regularization-based methods including KgCoOp by 5.10% and PromptSRC by 2.13% in performance on base-to-novel generalization.
[257] COREA: Coarse-to-Fine 3D Representation Alignment Between Relightable 3D Gaussians and SDF via Bidirectional 3D-to-3D Supervision
Jaeyoon Lee, Hojoon Jung, Sungtae Hwang, Jihyong Oh, Jongwon Choi
Main category: cs.CV
TL;DR: COREA is a unified framework that jointly learns relightable 3D Gaussians and SDF for accurate geometry and faithful relighting, addressing limitations of existing 3DGS methods through 3D-to-3D alignment.
Details
Motivation: Recent 3D Gaussian Splatting methods have limitations: geometry learned from 2D renderings leads to coarse surfaces and unreliable BRDF-lighting decomposition, preventing accurate mesh reconstruction and physically-based rendering.Method: COREA introduces a coarse-to-fine bidirectional 3D-to-3D alignment strategy: depth provides coarse alignment, depth gradients and normals refine fine-scale structure. Also includes density-control mechanism for stable Gaussian growth and memory efficiency.
Result: Experiments on standard benchmarks show COREA achieves superior performance in novel-view synthesis, mesh reconstruction, and physically-based rendering within a unified framework.
Conclusion: COREA successfully addresses limitations of existing 3DGS methods by enabling direct 3D geometric learning, resulting in accurate geometry reconstruction and stable BRDF-lighting decomposition for faithful relighting.
Abstract: We present COREA, the first unified framework that jointly learns relightable 3D Gaussians and a Signed Distance Field (SDF) for accurate geometry reconstruction and faithful relighting. While recent 3D Gaussian Splatting (3DGS) methods have extended toward mesh reconstruction and physically-based rendering (PBR), their geometry is still learned from 2D renderings, leading to coarse surfaces and unreliable BRDF-lighting decomposition. To address these limitations, COREA introduces a coarse-to-fine bidirectional 3D-to-3D alignment strategy that allows geometric signals to be learned directly in 3D space. Within this strategy, depth provides coarse alignment between the two representations, while depth gradients and normals refine fine-scale structure, and the resulting geometry supports stable BRDF-lighting decomposition. A density-control mechanism further stabilizes Gaussian growth, balancing geometric fidelity with memory efficiency. Experiments on standard benchmarks demonstrate that COREA achieves superior performance in novel-view synthesis, mesh reconstruction, and PBR within a unified framework.
[258] DGGAN: Degradation Guided Generative Adversarial Network for Real-time Endoscopic Video Enhancement
Handing Xu, Zhenguo Nie, Tairan Peng, Huimin Pan, Xin-Jun Liu
Main category: cs.CV
TL;DR: A degradation-aware framework for real-time endoscopic video enhancement that propagates degradation representations across frames using contrastive learning and cycle-consistency constraints.
Details
Motivation: Endoscopic surgery relies on video quality, but videos suffer from uneven illumination, tissue scattering, occlusions, and motion blur that obscure critical anatomical details. Existing deep learning methods are too computationally demanding for real-time surgical use.Method: Proposes a degradation-aware framework that extracts degradation representations using contrastive learning, then fuses these representations with image features to guide a single-frame enhancement model. Uses cycle-consistency constraints between degraded and restored images for robustness.
Result: Achieves superior balance between performance and efficiency compared to state-of-the-art methods, demonstrating effectiveness for real-time endoscopic video enhancement.
Conclusion: Implicitly learning and propagating degradation representations offers a practical pathway for clinical application of real-time endoscopic video enhancement.
Abstract: Endoscopic surgery relies on intraoperative video, making image quality a decisive factor for surgical safety and efficacy. Yet, endoscopic videos are often degraded by uneven illumination, tissue scattering, occlusions, and motion blur, which obscure critical anatomical details and complicate surgical manipulation. Although deep learning-based methods have shown promise in image enhancement, most existing approaches remain too computationally demanding for real-time surgical use. To address this challenge, we propose a degradation-aware framework for endoscopic video enhancement, which enables real-time, high-quality enhancement by propagating degradation representations across frames. In our framework, degradation representations are first extracted from images using contrastive learning. We then introduce a fusion mechanism that modulates image features with these representations to guide a single-frame enhancement model, which is trained with a cycle-consistency constraint between degraded and restored images to improve robustness and generalization. Experiments demonstrate that our framework achieves a superior balance between performance and efficiency compared with several state-of-the-art methods. These results highlight the effectiveness of degradation-aware modeling for real-time endoscopic video enhancement. Nevertheless, our method suggests that implicitly learning and propagating degradation representation offer a practical pathway for clinical application.
[259] MSN: Multi-directional Similarity Network for Hand-crafted and Deep-synthesized Copy-Move Forgery Detection
Liangwei Jiang, Jinluo Xie, Yecheng Huang, Hua Zhang, Hongyu Yang, Di Huang
Main category: cs.CV
TL;DR: Proposes MSN, a two-stream network for copy-move forgery detection that improves representation and localization using multi-directional CNN and 2D similarity matrix decoder, with new deep-synthesized forgery benchmark.
Details
Motivation: Copy-move forgery detection is challenging due to complex transformations and fine-tuned operations. Existing deep detection models have limitations in representation and localization capabilities.Method: Multi-directional Similarity Network (MSN) with two streams: 1) hierarchical encoding using multi-directional CNN with diverse scale/rotation augmentations for better patch similarity measurement, 2) 2-D similarity matrix decoder that utilizes full spatial information instead of 1-D vectors.
Result: State-of-the-art results on CASIA CMFD, CoMoFoD benchmarks and new deep-synthesized forgery database. The approach demonstrates effectiveness in accurate and efficient copy-move forgery detection.
Conclusion: MSN addresses representation and localization limitations in existing models, provides better performance on traditional and new deep-synthesized copy-move forgeries, and introduces a new benchmark for evaluating detection methods.
Abstract: Copy-move image forgery aims to duplicate certain objects or to hide specific contents with copy-move operations, which can be achieved by a sequence of manual manipulations as well as up-to-date deep generative network-based swapping. Its detection is becoming increasingly challenging for the complex transformations and fine-tuned operations on the tampered regions. In this paper, we propose a novel two-stream model, namely Multi-directional Similarity Network (MSN), to accurate and efficient copy-move forgery detection. It addresses the two major limitations of existing deep detection models in \textbf{representation} and \textbf{localization}, respectively. In representation, an image is hierarchically encoded by a multi-directional CNN network, and due to the diverse augmentation in scales and rotations, the feature achieved better measures the similarity between sampled patches in two streams. In localization, we design a 2-D similarity matrix based decoder, and compared with the current 1-D similarity vector based one, it makes full use of spatial information in the entire image, leading to the improvement in detecting tampered regions. Beyond the method, a new forgery database generated by various deep neural networks is presented, as a new benchmark for detecting the growing deep-synthesized copy-move. Extensive experiments are conducted on two classic image forensics benchmarks, \emph{i.e.} CASIA CMFD and CoMoFoD, and the newly presented one. The state-of-the-art results are reported, which demonstrate the effectiveness of the proposed approach.
[260] Training-free Clothing Region of Interest Self-correction for Virtual Try-On
Shengjie Lu, Zhibin Wan, Jiejie Liu, Quan Zhang, Mingjie Sun
Main category: cs.CV
TL;DR: A virtual try-on method using energy constraints on attention maps to better preserve clothing details, with a new evaluation metric VTID and improved performance on multiple datasets.
Details
Motivation: Existing VTON methods have discrepancies between generated clothing and target clothing in patterns, textures, and boundaries. Current evaluation metrics focus only on image realism and ignore alignment with target elements.Method: Proposes using an energy function to impose constraints on attention maps during generation, making attention focus more on clothing regions. Also introduces a new evaluation metric called Virtual Try-on Inception Distance (VTID).
Result: Outperforms previous SOTA methods by 1.4% (LPIPS), 2.3% (FID), 12.3% (KID), and 5.8% (VTID) on VITON-HD and DressCode datasets. Also improves downstream CC-Reid performance by 2.5%, 1.1%, and 1.6% on LTCC, PRCC, and VC-Clothes datasets.
Conclusion: The proposed attention constraint method effectively preserves clothing details in virtual try-on, and the new VTID metric provides more comprehensive evaluation. The approach benefits downstream tasks like clothing-change re-identification.
Abstract: VTON (Virtual Try-ON) aims at synthesizing the target clothing on a certain person, preserving the details of the target clothing while keeping the rest of the person unchanged. Existing methods suffer from the discrepancies between the generated clothing results and the target ones, in terms of the patterns, textures and boundaries. Therefore, we propose to use an energy function to impose constraints on the attention map extracted through the generation process. Thus, at each generation step, the attention can be more focused on the clothing region of interest, thereby influencing the generation results to be more consistent with the target clothing details. Furthermore, to address the limitation that existing evaluation metrics concentrate solely on image realism and overlook the alignment with target elements, we design a new metric, Virtual Try-on Inception Distance (VTID), to bridge this gap and ensure a more comprehensive assessment. On the VITON-HD and DressCode datasets, our approach has outperformed the previous state-of-the-art (SOTA) methods by 1.4%, 2.3%, 12.3%, and 5.8% in the traditional metrics of LPIPS, FID, KID, and the new VTID metrics, respectively. Additionally, by applying the generated data to downstream Clothing-Change Re-identification (CC-Reid) methods, we have achieved performance improvements of 2.5%, 1.1%, and 1.6% on the LTCC, PRCC, VC-Clothes datasets in the metrics of Rank-1. The code of our method is public at https://github.com/MrWhiteSmall/CSC-VTON.git.
[261] Effective Attention-Guided Multi-Scale Medical Network for Skin Lesion Segmentation
Siyu Wang, Hua Wang, Huiyu Li, Fan Zhang
Main category: cs.CV
TL;DR: Proposes a novel encoder-decoder network with multi-scale residual structures, MRCF module, CMAM attention, and EAB bridge for improved skin lesion segmentation, outperforming existing CNN and transformer models.
Details
Motivation: Precise skin lesion segmentation is crucial for early detection and diagnosis of skin diseases, but existing methods struggle with irregular lesion shapes and low contrast in medical images.Method: Innovative encoder-decoder network with multi-scale residual structures, Multi-Resolution Multi-Channel Fusion (MRCF) module for cross-scale features, Cross-Mix Attention Module (CMAM) for dynamic weight calculation across contexts, and External Attention Bridge (EAB) to address information loss in skip connections.
Result: Extensive experiments on multiple skin lesion segmentation datasets show the proposed model significantly outperforms existing transformer and CNN-based models, demonstrating exceptional segmentation accuracy and robustness.
Conclusion: The proposed architecture effectively addresses challenges in skin lesion segmentation through innovative modules that enhance feature extraction, attention mechanisms, and information flow, achieving state-of-the-art performance for medical image analysis.
Abstract: In the field of healthcare, precise skin lesion segmentation is crucial for the early detection and accurate diagnosis of skin diseases. Despite significant advances in deep learning for image processing, existing methods have yet to effectively address the challenges of irregular lesion shapes and low contrast. To address these issues, this paper proposes an innovative encoder-decoder network architecture based on multi-scale residual structures, capable of extracting rich feature information from different receptive fields to effectively identify lesion areas. By introducing a Multi-Resolution Multi-Channel Fusion (MRCF) module, our method captures cross-scale features, enhancing the clarity and accuracy of the extracted information. Furthermore, we propose a Cross-Mix Attention Module (CMAM), which redefines the attention scope and dynamically calculates weights across multiple contexts, thus improving the flexibility and depth of feature capture and enabling deeper exploration of subtle features. To overcome the information loss caused by skip connections in traditional U-Net, an External Attention Bridge (EAB) is introduced, facilitating the effective utilization of information in the decoder and compensating for the loss during upsampling. Extensive experimental evaluations on several skin lesion segmentation datasets demonstrate that the proposed model significantly outperforms existing transformer and convolutional neural network-based models, showcasing exceptional segmentation accuracy and robustness.
[262] MulCLIP: A Multi-level Alignment Framework for Enhancing Fine-grained Long-context CLIP
Chau Truong, Hieu Ta Quang, Dung D. Le
Main category: cs.CV
TL;DR: MulCLIP is a multi-level alignment framework that improves vision-language models’ ability to handle lengthy, detailed text descriptions without requiring region-proposal information, reducing deployment costs while enhancing fine-grained understanding.
Details
Motivation: Current vision-language models like CLIP struggle with lengthy, detailed descriptions because they're trained on short captions. Existing solutions use region-proposal information which increases deployment costs, creating a need for more efficient approaches.Method: MulCLIP uses end-to-end multi-level alignment: 1) Global contrastive alignment between images and both summary/long captions with extended positional embeddings, 2) Token reconstruction alignment with locally calibrated features to strengthen word-patch connections, and 3) Subcaption-aggregated patch alignment that automatically extracts context-rich patches for each subcaption.
Result: The method consistently improves downstream performance across diverse benchmarks. Ablation studies confirm its multi-scale alignment drives better fine-grained capability than region-proposal-assisted approaches, making it suitable for real-world applications.
Conclusion: MulCLIP effectively bridges natural long-text structures with image components through multi-level alignment, achieving better fine-grained understanding without the deployment costs of region-proposal methods, making it practical for diverse real-world applications.
Abstract: Vision-language models like CLIP show impressive ability to align images and text, but their training on short, concise captions makes them struggle with lengthy, detailed descriptions. Recent advances mitigate this challenge by leveraging region-proposal information to map visual regions with corresponding sentences from lengthy captions, yet incurring notable deployment costs. We introduce MulCLIP, a novel end-to-end multi-level alignment framework that bridges natural long-text structures with image components. MulCLIP first preserves global contrastive alignment between images and both summary and long captions, while extending positional embeddings for longer text sequences. To further enhance fine-grained understanding, we propose two novel strategies: (1) a token reconstruction alignment over locally calibrated features to strengthen semantic connections between words and image patches, and (2) a subcaption-aggregated patch alignment that automatically extracts and aggregates context-rich patches for each subcaption. Experimental results across diverse benchmarks demonstrate our method consistently improves downstream performance, while ablation studies confirm its multi-scale alignment is the key factor driving better fine-grained capability than region-proposal-assisted approaches, making it particularly suitable for diverse real-world applications.
[263] Towards Accurate UAV Image Perception: Guiding Vision-Language Models with Stronger Task Prompts
Mingning Guo, Mengwei Wu, Shaoxian Li, Haifeng Li, Chao Tao
Main category: cs.CV
TL;DR: AerialVP is a novel agent framework that enhances task prompts for UAV image perception by extracting multi-dimensional auxiliary information, overcoming limitations of traditional VLM-based methods that struggle with UAV imagery challenges.
Details
Motivation: Traditional VLM-based image perception methods face limitations with UAV imagery due to challenges like target confusion, scale variations, and complex backgrounds. These issues arise because VLMs depend on semantic alignment between visual and textual tokens, which becomes difficult when task prompts are simplistic and image content is complex.Method: AerialVP proactively extracts multi-dimensional auxiliary information from UAV images to enhance task prompts through a three-stage process: (1) analyzing task prompts to identify task type and enhancement needs, (2) selecting appropriate tools from a tool repository, and (3) generating enhanced task prompts based on the analysis and selected tools.
Result: The authors introduce AerialSense, a comprehensive benchmark for UAV image perception including Aerial Visual Reasoning, Aerial Visual Question Answering, and Aerial Visual Grounding tasks. Experimental results show AerialVP significantly enhances task prompt guidance, leading to stable and substantial performance improvements in both open-source and proprietary VLMs.
Conclusion: AerialVP addresses the limitations of traditional VLM-based approaches for UAV imagery by enhancing task prompts with auxiliary information extraction, resulting in improved performance across diverse UAV image perception tasks and conditions.
Abstract: Existing image perception methods based on VLMs generally follow a paradigm wherein models extract and analyze image content based on user-provided textual task prompts. However, such methods face limitations when applied to UAV imagery, which presents challenges like target confusion, scale variations, and complex backgrounds. These challenges arise because VLMs’ understanding of image content depends on the semantic alignment between visual and textual tokens. When the task prompt is simplistic and the image content is complex, achieving effective alignment becomes difficult, limiting the model’s ability to focus on task-relevant information. To address this issue, we introduce AerialVP, the first agent framework for task prompt enhancement in UAV image perception. AerialVP proactively extracts multi-dimensional auxiliary information from UAV images to enhance task prompts, overcoming the limitations of traditional VLM-based approaches. Specifically, the enhancement process includes three stages: (1) analyzing the task prompt to identify the task type and enhancement needs, (2) selecting appropriate tools from the tool repository, and (3) generating enhanced task prompts based on the analysis and selected tools. To evaluate AerialVP, we introduce AerialSense, a comprehensive benchmark for UAV image perception that includes Aerial Visual Reasoning, Aerial Visual Question Answering, and Aerial Visual Grounding tasks. AerialSense provides a standardized basis for evaluating model generalization and performance across diverse resolutions, lighting conditions, and both urban and natural scenes. Experimental results demonstrate that AerialVP significantly enhances task prompt guidance, leading to stable and substantial performance improvements in both open-source and proprietary VLMs. Our work will be available at https://github.com/lostwolves/AerialVP.
[264] CHIMERA: Adaptive Cache Injection and Semantic Anchor Prompting for Zero-shot Image Morphing with Morphing-oriented Metrics
Dahyeon Kye, Jeahun Sung, MinKyu Jeon, Jihyong Oh
Main category: cs.CV
TL;DR: CHIMERA is a zero-shot diffusion framework for smooth image morphing using cached inversion-guided denoising with adaptive feature injection and semantic prompting.
Details
Motivation: Existing diffusion-based image morphing methods often produce abrupt transitions or over-saturated appearances due to lack of adaptive structural and semantic alignments between dissimilar images.Method: CHIMERA formulates morphing as cached inversion-guided denoising with two key components: Adaptive Cache Injection (ACI) that caches and adaptively re-injects features from both input images during denoising, and Semantic Anchor Prompting (SAP) that generates shared semantic prompts using vision-language models to bridge dissimilar inputs.
Result: Extensive experiments and user studies show CHIMERA achieves smoother and more semantically aligned transitions than existing methods, establishing new state-of-the-art in image morphing. The paper also introduces GLCS, a morphing-specific evaluation metric.
Conclusion: CHIMERA successfully addresses the challenge of smooth image morphing with large semantic disparities through adaptive feature injection and semantic prompting, demonstrating superior performance over existing approaches.
Abstract: Diffusion models exhibit remarkable generative ability, yet achieving smooth and semantically consistent image morphing remains a challenge. Existing approaches often yield abrupt transitions or over-saturated appearances due to the lack of adaptive structural and semantic alignments. We propose CHIMERA, a zero-shot diffusion-based framework that formulates morphing as a cached inversion-guided denoising process. To handle large semantic and appearance disparities, we propose Adaptive Cache Injection and Semantic Anchor Prompting. Adaptive Cache Injection (ACI) caches down, mid, and up blocks features from both inputs during DDIM inversion and re-injects them adaptively during denoising, enabling spatial and semantic alignment in depth- and time-adaptive manners and enabling natural feature fusion and smooth transitions. Semantic Anchor Prompting (SAP) leverages a vision-language model to generate a shared anchor prompt that serves as a semantic anchor, bridging dissimilar inputs and guiding the denoising process toward coherent results. Finally, we introduce the Global-Local Consistency Score (GLCS), a morphing-oriented metric that simultaneously evaluates the global harmonization of the two inputs and the smoothness of the local morphing transition. Extensive experiments and user studies show that CHIMERA achieves smoother and more semantically aligned transitions than existing methods, establishing a new state of the art in image morphing. The code and project page will be publicly released.
[265] MuSASplat: Efficient Sparse-View 3D Gaussian Splats via Lightweight Multi-Scale Adaptation
Muyu Xu, Fangneng Zhan, Xiaoqin Zhang, Ling Shao, Shijian Lu
Main category: cs.CV
TL;DR: MuSASplat is a lightweight framework for sparse-view 3D Gaussian splatting that reduces computational costs while maintaining rendering quality through efficient fine-tuning and feature aggregation.
Details
Motivation: Existing pose-free feed-forward methods for sparse-view 3D Gaussian splatting rely on full fine-tuning of large Vision Transformer backbones, which incurs substantial GPU costs and computational burden.Method: Introduces two key components: 1) Multi-Scale Adapter for efficient fine-tuning of ViT-based architectures with minimal parameters, and 2) Feature Fusion Aggregator that integrates features across input views effectively while reducing memory usage and computational costs compared to memory banks.
Result: Extensive experiments show MuSASplat achieves state-of-the-art rendering quality with significantly reduced parameters and training resource requirements compared to existing methods, maintaining high fidelity in novel view synthesis even with very sparse input views.
Conclusion: MuSASplat provides an efficient framework that dramatically reduces computational burden for training pose-free feed-forward 3D Gaussian splatting models while preserving rendering quality, making it more accessible and practical.
Abstract: Sparse-view 3D Gaussian splatting seeks to render high-quality novel views of 3D scenes from a limited set of input images. While recent pose-free feed-forward methods leveraging pre-trained 3D priors have achieved impressive results, most of them rely on full fine-tuning of large Vision Transformer (ViT) backbones and incur substantial GPU costs. In this work, we introduce MuSASplat, a novel framework that dramatically reduces the computational burden of training pose-free feed-forward 3D Gaussian splats models with little compromise of rendering quality. Central to our approach is a lightweight Multi-Scale Adapter that enables efficient fine-tuning of ViT-based architectures with only a small fraction of training parameters. This design avoids the prohibitive GPU overhead associated with previous full-model adaptation techniques while maintaining high fidelity in novel view synthesis, even with very sparse input views. In addition, we introduce a Feature Fusion Aggregator that integrates features across input views effectively and efficiently. Unlike widely adopted memory banks, the Feature Fusion Aggregator ensures consistent geometric integration across input views and meanwhile mitigates the memory usage, training complexity, and computational costs significantly. Extensive experiments across diverse datasets show that MuSASplat achieves state-of-the-art rendering quality but has significantly reduced parameters and training resource requirements as compared with existing methods.
[266] When Privacy Meets Recovery: The Overlooked Half of Surrogate-Driven Privacy Preservation for MLLM Editing
Siyuan Xu, Yibing Liu, Peilin Chen, Yung-Hui Li, Shiqi Wang, Sam Kwong
Main category: cs.CV
TL;DR: This paper addresses privacy leakage in MLLMs by developing a method to restore surrogate-driven protected data, introducing the SPPE dataset and a unified approach for privacy recovery while preserving MLLM editing quality.
Details
Motivation: Existing privacy protection methods for MLLMs effectively obscure private information but fail to evaluate the authenticity and recovery quality of user privacy, creating a research gap in restoring protected data across diverse MLLM scenarios.Method: The paper introduces the SPPE dataset with various privacy categories and user instructions, and proposes a unified approach that formulates privacy recovery as a guided generation task conditioned on complementary multimodal signals to reconstruct private content while preserving MLLM editing fidelity.
Result: Experiments on both SPPE and InstructPix2Pix datasets show the approach generalizes well across diverse visual content and editing tasks, achieving a strong balance between privacy protection and MLLM usability.
Conclusion: The work successfully addresses the critical challenge of restoring surrogate-driven protected data in MLLMs, providing both a benchmark dataset and a practical solution that maintains privacy recovery quality while preserving MLLM editing capabilities.
Abstract: Privacy leakage in Multimodal Large Language Models (MLLMs) has long been an intractable problem. Existing studies, though effectively obscure private information in MLLMs, often overlook the evaluation of the authenticity and recovery quality of user privacy. To this end, this work uniquely focuses on the critical challenge of how to restore surrogate-driven protected data in diverse MLLM scenarios. We first bridge this research gap by contributing the SPPE (Surrogate Privacy Protected Editable) dataset, which includes a wide range of privacy categories and user instructions to simulate real MLLM applications. This dataset offers protected surrogates alongside their various MLLM-edited versions, thus enabling the direct assessment of privacy recovery quality. By formulating privacy recovery as a guided generation task conditioned on complementary multimodal signals, we further introduce a unified approach that reliably reconstructs private content while preserving the fidelity of MLLM-generated edits. The experiments on both SPPE and InstructPix2Pix further show that our approach generalizes well across diverse visual content and editing tasks, achieving a strong balance between privacy protection and MLLM usability.
[267] TIDE: Two-Stage Inverse Degradation Estimation with Guided Prior Disentanglement for Underwater Image Restoration
Shravan Venkatraman, Rakesh Raj Madavan, Pavan Kumar S, Muthu Subash Kavitha
Main category: cs.CV
TL;DR: TIDE is a two-stage underwater image restoration framework that explicitly models degradation characteristics through specialized prior decomposition, adaptively fusing restoration hypotheses based on local degradation patterns.
Details
Motivation: Underwater image restoration is crucial for marine applications, but existing methods struggle with spatially varying degradations that co-occur and change with water conditions. Uniform restoration strategies fail to handle these complex, spatially varying degradations effectively.Method: TIDE decomposes underwater degradations into four key factors (color distortion, haze, detail loss, noise) and designs specialized restoration experts for each. It disentangles restoration into multiple specialized hypotheses that are adaptively fused based on local degradation patterns, followed by progressive refinement to correct residual artifacts.
Result: TIDE achieves competitive performance on reference-based fidelity metrics while outperforming state-of-the-art methods on non-reference perceptual quality metrics, with strong improvements in color correction and contrast enhancement across standard benchmarks and challenging turbid water conditions.
Conclusion: The explicit modeling of degradation characteristics through specialized prior decomposition enables TIDE to effectively handle spatially varying underwater degradations, producing natural results even in highly degraded regions by balancing competing degradation factors.
Abstract: Underwater image restoration is essential for marine applications ranging from ecological monitoring to archaeological surveys, but effectively addressing the complex and spatially varying nature of underwater degradations remains a challenge. Existing methods typically apply uniform restoration strategies across the entire image, struggling to handle multiple co-occurring degradations that vary spatially and with water conditions. We introduce TIDE, a $\underline{t}$wo stage $\underline{i}$nverse $\underline{d}$egradation $\underline{e}$stimation framework that explicitly models degradation characteristics and applies targeted restoration through specialized prior decomposition. Our approach disentangles the restoration process into multiple specialized hypotheses that are adaptively fused based on local degradation patterns, followed by a progressive refinement stage that corrects residual artifacts. Specifically, TIDE decomposes underwater degradations into four key factors, namely color distortion, haze, detail loss, and noise, and designs restoration experts specialized for each. By generating specialized restoration hypotheses, TIDE balances competing degradation factors and produces natural results even in highly degraded regions. Extensive experiments across both standard benchmarks and challenging turbid water conditions show that TIDE achieves competitive performance on reference based fidelity metrics while outperforming state of the art methods on non reference perceptual quality metrics, with strong improvements in color correction and contrast enhancement. Our code is available at: https://rakesh-123-cryp.github.io/TIDE.
[268] ContextAnyone: Context-Aware Diffusion for Character-Consistent Text-to-Video Generation
Ziyang Mai, Yu-Wing Tai
Main category: cs.CV
TL;DR: ContextAnyone is a context-aware diffusion framework that generates character-consistent videos from text and a single reference image, preserving not just facial identity but also hairstyle, outfit, and body shape.
Details
Motivation: Existing text-to-video personalization methods focus mainly on facial identity but fail to preserve broader contextual cues like hairstyle, outfit, and body shape, which are critical for visual coherence in character-consistent video generation.Method: A context-aware diffusion framework that jointly reconstructs the reference image and generates new video frames. Uses an Emphasize-Attention module to selectively reinforce reference-aware features, dual-guidance loss combining diffusion and reference reconstruction objectives, and Gap-RoPE positional embedding to separate reference and video tokens for stable temporal modeling.
Result: ContextAnyone outperforms existing reference-to-video methods in identity consistency and visual quality, generating coherent and context-preserving character videos across diverse motions and scenes.
Conclusion: The proposed framework effectively addresses the challenge of maintaining consistent character identities in text-to-video generation by fully perceiving and utilizing reference information through novel attention mechanisms and training objectives.
Abstract: Text-to-video (T2V) generation has advanced rapidly, yet maintaining consistent character identities across scenes remains a major challenge. Existing personalization methods often focus on facial identity but fail to preserve broader contextual cues such as hairstyle, outfit, and body shape, which are critical for visual coherence. We propose \textbf{ContextAnyone}, a context-aware diffusion framework that achieves character-consistent video generation from text and a single reference image. Our method jointly reconstructs the reference image and generates new video frames, enabling the model to fully perceive and utilize reference information. Reference information is effectively integrated into a DiT-based diffusion backbone through a novel Emphasize-Attention module that selectively reinforces reference-aware features and prevents identity drift across frames. A dual-guidance loss combines diffusion and reference reconstruction objectives to enhance appearance fidelity, while the proposed Gap-RoPE positional embedding separates reference and video tokens to stabilize temporal modeling. Experiments demonstrate that ContextAnyone outperforms existing reference-to-video methods in identity consistency and visual quality, generating coherent and context-preserving character videos across diverse motions and scenes. Project page: \href{https://github.com/ziyang1106/ContextAnyone}{https://github.com/ziyang1106/ContextAnyone}.
[269] Integrating Multi-scale and Multi-filtration Topological Features for Medical Image Classification
Pengfei Gu, Huimin Li, Haoteng Tang, Dongkuan, Xu, Erik Enriquez, DongChul Kim, Bin Fu, Danny Z. Chen
Main category: cs.CV
TL;DR: A topology-guided classification framework that extracts multi-scale and multi-filtration persistent topological features and integrates them with vision backbones for improved medical image classification.
Details
Motivation: Current deep neural networks for medical image classification either focus too much on pixel-intensity features instead of fundamental anatomical structures, or capture only simple topological features via single-parameter persistence, missing complex anatomical patterns.Method: 1) Compute cubical persistence diagrams across multiple image resolutions/scales; 2) Develop a “vineyard” algorithm to consolidate these diagrams into a single stable diagram capturing signatures from global anatomy to local irregularities; 3) Design a cross-attention-based neural network to process consolidated persistence diagrams; 4) Fuse topological embeddings with feature maps from CNNs or Transformers in an end-to-end architecture.
Result: Evaluations on three public datasets show consistent, considerable improvements over strong baselines and state-of-the-art methods, demonstrating enhanced model capacity to recognize complex anatomical structures.
Conclusion: The comprehensive topological perspective provides robust and interpretable medical image classification by integrating multi-scale and multi-filtration topological features with vision backbones.
Abstract: Modern deep neural networks have shown remarkable performance in medical image classification. However, such networks either emphasize pixel-intensity features instead of fundamental anatomical structures (e.g., those encoded by topological invariants), or they capture only simple topological features via single-parameter persistence. In this paper, we propose a new topology-guided classification framework that extracts multi-scale and multi-filtration persistent topological features and integrates them into vision classification backbones. For an input image, we first compute cubical persistence diagrams (PDs) across multiple image resolutions/scales. We then develop a ``vineyard’’ algorithm that consolidates these PDs into a single, stable diagram capturing signatures at varying granularities, from global anatomy to subtle local irregularities that may indicate early-stage disease. To further exploit richer topological representations produced by multiple filtrations, we design a cross-attention-based neural network that directly processes the consolidated final PDs. The resulting topological embeddings are fused with feature maps from CNNs or Transformers. By integrating multi-scale and multi-filtration topologies into an end-to-end architecture, our approach enhances the model’s capacity to recognize complex anatomical structures. Evaluations on three public datasets show consistent, considerable improvements over strong baselines and state-of-the-art methods, demonstrating the value of our comprehensive topological perspective for robust and interpretable medical image classification.
[270] RefLSM: Linearized Structural-Prior Reflectance Model for Medical Image Segmentation and Bias-Field Correction
Wenqi Zhao, Jiacheng Sang, Fenghua Cheng, Yonglu Shu, Dong Li, Xiaofeng Yang
Main category: cs.CV
TL;DR: Proposes RefLSM, a reflectance-based level set model for medical image segmentation that decomposes images into reflectance and bias field components, integrates structural priors, and uses convex relaxation for stable evolution.
Details
Motivation: Medical image segmentation faces challenges like intensity inhomogeneity, noise, blurred boundaries, and irregular structures. Traditional level set methods struggle under severe non-uniform imaging conditions due to approximate bias field estimations.Method: Proposes Reflectance-based Level Set Model (RefLSM) that integrates Retinex-inspired reflectance decomposition. Uses linear structural prior to guide reflectance gradients, embeds relaxed binary level-set via convex relaxation and sign projection, and solves with ADMM-based optimization.
Result: Extensive experiments on multiple medical imaging datasets show RefLSM achieves superior segmentation accuracy, robustness, and computational efficiency compared to state-of-the-art level set methods.
Conclusion: RefLSM effectively addresses limitations of traditional level set methods by explicitly modeling reflectance invariance to illumination, providing reliable geometric guidance, and ensuring stable evolution through convex relaxation techniques.
Abstract: Medical image segmentation remains challenging due to intensity inhomogeneity, noise, blurred boundaries, and irregular structures. Traditional level set methods, while effective in certain cases, often depend on approximate bias field estimations and therefore struggle under severe non-uniform imaging conditions. To address these limitations, we propose a novel variational Reflectance-based Level Set Model (RefLSM), which explicitly integrates Retinex-inspired reflectance decomposition into the segmentation framework. By decomposing the observed image into reflectance and bias field components, RefLSM directly segments the reflectance, which is invariant to illumination and preserves fine structural details. Building on this foundation, we introduce two key innovations for enhanced precision and robustness. First, a linear structural prior steers the smoothed reflectance gradients toward a data-driven reference, providing reliable geometric guidance in noisy or low-contrast scenes. Second, a relaxed binary level-set is embedded in RefLSM and enforced via convex relaxation and sign projection, yielding stable evolution and avoiding reinitialization-induced diffusion. The resulting variational problem is solved efficiently using an ADMM-based optimization scheme. Extensive experiments on multiple medical imaging datasets demonstrate that RefLSM achieves superior segmentation accuracy, robustness, and computational efficiency compared to state-of-the-art level set methods.
[271] Structure-Aware Feature Rectification with Region Adjacency Graphs for Training-Free Open-Vocabulary Semantic Segmentation
Qiming Huang, Hao Ai, Jianbo Jiao
Main category: cs.CV
TL;DR: Proposes structure-aware feature rectification using region adjacency graphs to refine CLIP features for better open-vocabulary semantic segmentation by addressing local inconsistency issues.
Details
Motivation: CLIP's pre-training on image-text pairs focuses on global semantic alignment, leading to suboptimal performance for fine-grained visual region-text association, resulting in noisy and inconsistent local predictions due to dispersed bias from contrastive training.Method: Constructs region adjacency graph (RAG) based on low-level features (color, texture) to capture local structural relationships, then uses this graph to refine CLIP features by enhancing local discrimination through instance-specific priors derived directly from images.
Result: Extensive experiments show the method effectively suppresses segmentation noise, improves region-level consistency, and achieves strong performance on multiple open-vocabulary segmentation benchmarks.
Conclusion: The structure-aware feature rectification approach successfully addresses CLIP’s local inconsistency issues in open-vocabulary segmentation by incorporating instance-specific structural priors, leading to improved performance and consistency.
Abstract: Benefiting from the inductive biases learned from large-scale datasets, open-vocabulary semantic segmentation (OVSS) leverages the power of vision-language models, such as CLIP, to achieve remarkable progress without requiring task-specific training. However, due to CLIP’s pre-training nature on image-text pairs, it tends to focus on global semantic alignment, resulting in suboptimal performance when associating fine-grained visual regions with text. This leads to noisy and inconsistent predictions, particularly in local areas. We attribute this to a dispersed bias stemming from its contrastive training paradigm, which is difficult to alleviate using CLIP features alone. To address this, we propose a structure-aware feature rectification approach that incorporates instance-specific priors derived directly from the image. Specifically, we construct a region adjacency graph (RAG) based on low-level features (e.g., colour and texture) to capture local structural relationships and use it to refine CLIP features by enhancing local discrimination. Extensive experiments show that our method effectively suppresses segmentation noise, improves region-level consistency, and achieves strong performance on multiple open-vocabulary segmentation benchmarks.
[272] HVQ-CGIC: Enabling Hyperprior Entropy Modeling for VQ-Based Controllable Generative Image Compression
Niu Yi, Xu Tianyi, Ma Mingming, Wang Xinkun
Main category: cs.CV
TL;DR: HVQ-CGIC introduces a hyperprior to vector quantization-based generative image compression, enabling adaptive entropy modeling and flexible rate control, achieving 61.3% bitrate reduction compared to SOTA methods.
Details
Motivation: Current VQ-based generative image compression methods use static global probability distributions for entropy estimation, which fails to adapt to image-specific content, leading to untapped bitrate potential and difficulty in achieving flexible rate control.Method: Proposes HVQ-CGIC framework with mathematical foundation for introducing hyperprior to VQ indices entropy model, novel loss design for RD balance and control, and lightweight hyper-prior estimation network.
Result: Achieves significant RD performance advantage over SOTA generative compression methods, with 61.3% fewer bits on Kodak dataset while maintaining same LPIPS quality as Control-GIC, CDC and HiFiC.
Conclusion: HVQ-CGIC has potential to become foundational component for VQGAN-based image compression, analogous to HyperPrior framework’s role in neural image compression, by enabling adaptive entropy modeling and flexible rate control.
Abstract: Generative learned image compression methods using Vector Quantization (VQ) have recently shown impressive potential in balancing distortion and perceptual quality. However, these methods typically estimate the entropy of VQ indices using a static, global probability distribution, which fails to adapt to the specific content of each image. This non-adaptive approach leads to untapped bitrate potential and challenges in achieving flexible rate control. To address this challenge, we introduce a Controllable Generative Image Compression framework based on a VQ Hyperprior, termed HVQ-CGIC. HVQ-CGIC rigorously derives the mathematical foundation for introducing a hyperprior to the VQ indices entropy model. Based on this foundation, through novel loss design, to our knowledge, this framework is the first to introduce RD balance and control into vector quantization-based Generative Image Compression. Cooperating with a lightweight hyper-prior estimation network, HVQ-CGIC achieves a significant advantage in rate-distortion (RD) performance compared to current state-of-the-art (SOTA) generative compression methods. On the Kodak dataset, we achieve the same LPIPS as Control-GIC, CDC and HiFiC with an average of 61.3% fewer bits. We posit that HVQ-CGIC has the potential to become a foundational component for VQGAN-based image compression, analogous to the integral role of the HyperPrior framework in neural image compression.
[273] SUCCESS-GS: Survey of Compactness and Compression for Efficient Static and Dynamic Gaussian Splatting
Seokhyun Youn, Soohyun Lee, Geonho Kim, Weeyoung Kwon, Sung-Ho Bae, Jihyong Oh
Main category: cs.CV
TL;DR: This survey paper provides the first unified overview of efficient 3D and 4D Gaussian Splatting techniques, categorizing methods into Parameter Compression and Restructuring Compression to address memory and computational challenges while preserving quality.
Details
Motivation: 3D Gaussian Splatting enables real-time, high-fidelity 3D reconstruction but faces practical limitations due to massive memory and computational demands, especially in 4D dynamic scenes. The field of Efficient Gaussian Splatting has evolved to address these issues, but lacks a unified overview.Method: The survey systematically categorizes existing efficient 3D/4D Gaussian Splatting methods into two major directions: Parameter Compression (reducing redundancy in Gaussian parameters) and Restructuring Compression (reorganizing Gaussian representations). It provides comprehensive analysis of core ideas, methodological trends, datasets, evaluation metrics, and benchmark comparisons.
Result: The paper organizes the rapidly evolving field of Efficient Gaussian Splatting, providing researchers with a structured framework to understand different compression approaches, their trade-offs, and current state-of-the-art techniques for both static and dynamic scene representation.
Conclusion: The survey identifies current limitations and outlines promising research directions toward scalable, compact, and real-time Gaussian Splatting, emphasizing the need for continued innovation in efficient representations for both 3D and 4D applications.
Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful explicit representation enabling real-time, high-fidelity 3D reconstruction and novel view synthesis. However, its practical use is hindered by the massive memory and computational demands required to store and render millions of Gaussians. These challenges become even more severe in 4D dynamic scenes. To address these issues, the field of Efficient Gaussian Splatting has rapidly evolved, proposing methods that reduce redundancy while preserving reconstruction quality. This survey provides the first unified overview of efficient 3D and 4D Gaussian Splatting techniques. For both 3D and 4D settings, we systematically categorize existing methods into two major directions, Parameter Compression and Restructuring Compression, and comprehensively summarize the core ideas and methodological trends within each category. We further cover widely used datasets, evaluation metrics, and representative benchmark comparisons. Finally, we discuss current limitations and outline promising research directions toward scalable, compact, and real-time Gaussian Splatting for both static and dynamic 3D scene representation.
[274] Understanding Diffusion Models via Code Execution
Cheng Yu
Main category: cs.CV
TL;DR: A concise 300-line implementation tutorial explaining diffusion models from a code-execution perspective, bridging the gap between mathematical theory and practical implementation.
Details
Motivation: To address the difficulty in bridging the gap between complex mathematical formulations of diffusion models in papers and practical open-source implementations, providing clear implementation-first understanding.Method: Developed a minimal implementation (~300 lines) that preserves essential components: forward diffusion, reverse sampling, noise-prediction network, and training loop while removing unnecessary engineering details.
Result: Created a concise, accessible implementation that demonstrates how diffusion models actually operate in code, making the connection between theory and practice clearer for researchers.
Conclusion: The implementation provides researchers with a clear, practical understanding of diffusion models, showing how code and theory correspond, with code and pre-trained models available on GitHub.
Abstract: Diffusion models have achieved remarkable performance in generative modeling, yet their theoretical foundations are often intricate, and the gap between mathematical formulations in papers and practical open-source implementations can be difficult to bridge. Existing tutorials primarily focus on deriving equations, offering limited guidance on how diffusion models actually operate in code. To address this, we present a concise implementation of approximately 300 lines that explains diffusion models from a code-execution perspective. Our minimal example preserves the essential components – including forward diffusion, reverse sampling, the noise-prediction network, and the training loop – while removing unnecessary engineering details. This technical report aims to provide researchers with a clear, implementation-first understanding of how diffusion models work in practice and how code and theory correspond. Our code and pre-trained models are available at: https://github.com/disanda/GM/tree/main/DDPM-DDIM-ClassifierFree.
[275] Data-driven Exploration of Mobility Interaction Patterns
Gabriele Galatolo, Mirco Nanni
Main category: cs.CV
TL;DR: Data mining approach to discover mobility interaction patterns from real data, applied to cars and pedestrians, to improve simulation models.
Details
Motivation: Existing solutions use preconceived behavioral models for human dynamics, but need data-driven approach to capture mutual interactions between individuals for applications like crowd simulation and emergency management.Method: Proposes data mining approach that searches mobility events for evidence of mutual interactions, identifies complex persistent patterns and time-evolving configurations of events from real data.
Result: Applied methodology to two real case studies (cars and pedestrians) with experimental evaluation of performance, parameter sensitivity, and interpretation of sample results.
Conclusion: Patterns discovered provide new insights into mobility interaction mechanics, potentially improving existing simulation models through data-driven understanding of human dynamics.
Abstract: Understanding the movement behaviours of individuals and the way they react to the external world is a key component of any problem that involves the modelling of human dynamics at a physical level. In particular, it is crucial to capture the influence that the presence of an individual can have on the others. Important examples of applications include crowd simulation and emergency management, where the simulation of the mass of people passes through the simulation of the individuals, taking into consideration the others as part of the general context. While existing solutions basically start from some preconceived behavioural model, in this work we propose an approach that starts directly from the data, adopting a data mining perspective. Our method searches the mobility events in the data that might be possible evidences of mutual interactions between individuals, and on top of them looks for complex, persistent patterns and time evolving configurations of events. The study of these patterns can provide new insights on the mechanics of mobility interactions between individuals, which can potentially help in improving existing simulation models. We instantiate the general methodology on two real case studies, one on cars and one on pedestrians, and a full experimental evaluation is performed, both in terms of performances, parameter sensitivity and interpretation of sample results.
[276] MMRPT: MultiModal Reinforcement Pre-Training via Masked Vision-Dependent Reasoning
Xuhui Zheng, Kang An, Ziliang Wang, Yuhang Wang, Faqiang Qian, Yichao Wu
Main category: cs.CV
TL;DR: MMRPT is a masked multimodal reinforcement pre-training framework that uses reinforcement learning to strengthen visual reasoning in MLLMs by rewarding visual grounding over caption imitation.
Details
Motivation: Current multimodal pre-training suffers from descriptive bias in image-caption pairs, causing models to rely too much on surface linguistic cues rather than developing genuine visual understanding and reasoning.Method: Introduces reinforcement learning into pre-training: 1) Estimates sentence-level visual dependency via attention over visual tokens, 2) Masks highly vision-dependent segments, 3) Uses reinforcement learning with semantic-visual rewards to reconstruct masked spans through vision-grounded reasoning.
Result: Shows consistent zero-shot gains across diverse benchmarks and substantially improved robustness under supervised fine-tuning, demonstrating more reliable and generalizable multimodal pre-training.
Conclusion: Reinforcement-driven masked reasoning provides a superior pre-training objective for multimodal models by promoting genuine visual understanding rather than caption imitation.
Abstract: Multimodal pre-training remains constrained by the descriptive bias of image-caption pairs, leading models to favor surface linguistic cues over grounded visual understanding. We introduce MMRPT, a masked multimodal reinforcement pre-training framework that strengthens visual reasoning in MLLMs. We are the first to incorporate reinforcement learning directly into the pre-training of large vision-language models, enabling learning signals that reward visual grounding rather than caption imitation. MMRPT constructs masked multimodal data by estimating sentence-level visual dependency via attention over visual tokens and masking highly vision-dependent segments; the model reconstructs these spans through vision-grounded reasoning guided by a semantic-visual reward. Experiments show consistent zero-shot gains across diverse benchmarks and substantially improved robustness under supervised fine-tuning, demonstrating that reinforcement-driven masked reasoning provides a more reliable and generalizable pre-training objective for multimodal models.
[277] When normalization hallucinates: unseen risks in AI-powered whole slide image processing
Karel Moens, Matthew B. Blaschko, Tinne Tuytelaars, Bart Diricx, Jonas De Vylder, Mustafa Yousif
Main category: cs.CV
TL;DR: WSI normalization models can create realistic-looking hallucinations that mask important diagnostic features, with current evaluation methods failing to detect them, requiring new validation approaches.
Details
Motivation: Deep learning-based WSI normalization models tend to produce outputs that average out important diagnostic features and can introduce undetectable hallucinated content, posing serious risks for clinical pathology analysis that current evaluation practices overlook.Method: Proposed a novel image comparison measure to automatically detect hallucinations in normalized outputs, then systematically evaluated several well-cited normalization methods retrained on real-world clinical data using this measure.
Result: Found concerning frequency of hallucinations in real-world clinical data that weren’t apparent in public datasets, revealing significant inconsistencies and failures not captured by conventional metrics.
Conclusion: The risk of hallucinations in WSI normalization is real and underappreciated, necessitating more robust, interpretable normalization techniques and stricter validation protocols for clinical deployment.
Abstract: Whole slide image (WSI) normalization remains a vital preprocessing step in computational pathology. Increasingly driven by deep learning, these models learn to approximate data distributions from training examples. This often results in outputs that gravitate toward the average, potentially masking diagnostically important features. More critically, they can introduce hallucinated content, artifacts that appear realistic but are not present in the original tissue, posing a serious threat to downstream analysis. These hallucinations are nearly impossible to detect visually, and current evaluation practices often overlook them. In this work, we demonstrate that the risk of hallucinations is real and underappreciated. While many methods perform adequately on public datasets, we observe a concerning frequency of hallucinations when these same models are retrained and evaluated on real-world clinical data. To address this, we propose a novel image comparison measure designed to automatically detect hallucinations in normalized outputs. Using this measure, we systematically evaluate several well-cited normalization methods retrained on real-world data, revealing significant inconsistencies and failures that are not captured by conventional metrics. Our findings underscore the need for more robust, interpretable normalization techniques and stricter validation protocols in clinical deployment.
[278] AutoLugano: A Deep Learning Framework for Fully Automated Lymphoma Segmentation and Lugano Staging on FDG-PET/CT
Boyang Pan, Zeyu Zhang, Hongyu Meng, Bin Cui, Yingying Zhang, Wenli Hou, Junhao Li, Langdi Zhong, Xiaoxiao Chen, Xiaoyu Xu, Changjin Zuo, Chao Cheng, Nan-Jie Gong
Main category: cs.CV
TL;DR: AutoLugano is a fully automated deep learning system for end-to-end lymphoma classification that performs lesion segmentation, anatomical localization, and automated Lugano staging from FDG-PET/CT scans.
Details
Motivation: To develop an automated system that can translate baseline FDG-PET/CT scans into complete Lugano staging for lymphoma, assisting in initial staging, treatment stratification, and clinical decision-making without manual intervention.Method: Three sequential modules: (1) Anatomy-Informed Lesion Segmentation using 3D nnU-Net on multi-channel inputs, (2) Atlas-based Anatomical Localization using TotalSegmentator to map lesions to 21 lymph node regions, and (3) Automated Lugano Staging that translates spatial distribution into Lugano stages and therapeutic groups. Trained on autoPET dataset (n=1,007) and validated on independent cohort (n=67).
Result: External validation showed 88.31% overall accuracy, 74.47% sensitivity, 94.21% specificity, and 80.80% F1-score for regional involvement detection. For therapeutic stratification (Limited vs. Advanced Stage), achieved 85.07% accuracy, 90.48% specificity, and 82.61% sensitivity.
Conclusion: AutoLugano is the first fully automated, end-to-end pipeline that translates single baseline FDG-PET/CT scans into complete Lugano stages, demonstrating strong potential to assist in clinical decision-making for lymphoma patients.
Abstract: Purpose: To develop a fully automated deep learning system, AutoLugano, for end-to-end lymphoma classification by performing lesion segmentation, anatomical localization, and automated Lugano staging from baseline FDG-PET/CT scans. Methods: The AutoLugano system processes baseline FDG-PET/CT scans through three sequential modules:(1) Anatomy-Informed Lesion Segmentation, a 3D nnU-Net model, trained on multi-channel inputs, performs automated lesion detection (2) Atlas-based Anatomical Localization, which leverages the TotalSegmentator toolkit to map segmented lesions to 21 predefined lymph node regions using deterministic anatomical rules; and (3) Automated Lugano Staging, where the spatial distribution of involved regions is translated into Lugano stages and therapeutic groups (Limited vs. Advanced Stage).The system was trained on the public autoPET dataset (n=1,007) and externally validated on an independent cohort of 67 patients. Performance was assessed using accuracy, sensitivity, specificity, F1-scorefor regional involvement detection and staging agreement. Results: On the external validation set, the proposed model demonstrated robust performance, achieving an overall accuracy of 88.31%, sensitivity of 74.47%, Specificity of 94.21% and an F1-score of 80.80% for regional involvement detection,outperforming baseline models. Most notably, for the critical clinical task of therapeutic stratification (Limited vs. Advanced Stage), the system achieved a high accuracy of 85.07%, with a specificity of 90.48% and a sensitivity of 82.61%.Conclusion: AutoLugano represents the first fully automated, end-to-end pipeline that translates a single baseline FDG-PET/CT scan into a complete Lugano stage. This study demonstrates its strong potential to assist in initial staging, treatment stratification, and supporting clinical decision-making.
[279] Object Pose Distribution Estimation for Determining Revolution and Reflection Uncertainty in Point Clouds
Frederik Hagelskjær, Dimitrios Arapis, Steffen Madsen, Thorbjørn Mosekjær Iversen
Main category: cs.CV
TL;DR: First deep learning method for object pose uncertainty estimation using only 3D colorless data, validated in real-world bin picking with geometrically ambiguous objects.
Details
Motivation: Single pose estimates can't capture uncertainty from visual ambiguity, leading to unreliable robotic behavior. Existing methods rely heavily on color information, which is often unavailable in industrial settings.Method: Novel neural network-based approach for pose distribution estimation using only 3D colorless data (no RGB input). Framework focuses on symmetries in reflection and revolution but is extendable to full SE(3) pose distribution estimation.
Result: Method validated in real-world bin picking scenario with objects of varying geometric ambiguity. First approach leveraging deep learning for pose distribution estimation without RGB input.
Conclusion: Proposed method enables reliable pose uncertainty estimation in industrial settings where color information is unavailable, addressing visual ambiguity challenges in robotic perception.
Abstract: Object pose estimation is crucial to robotic perception and typically provides a single-pose estimate. However, a single estimate cannot capture pose uncertainty deriving from visual ambiguity, which can lead to unreliable behavior. Existing pose distribution methods rely heavily on color information, often unavailable in industrial settings. We propose a novel neural network-based method for estimating object pose uncertainty using only 3D colorless data. To the best of our knowledge, this is the first approach that leverages deep learning for pose distribution estimation without relying on RGB input. We validate our method in a real-world bin picking scenario with objects of varying geometric ambiguity. Our current implementation focuses on symmetries in reflection and revolution, but the framework is extendable to full SE(3) pose distribution estimation. Source code available at opde3d.github.io
[280] ReLKD: Inter-Class Relation Learning with Knowledge Distillation for Generalized Category Discovery
Fang Zhou, Zhiqiang Chen, Martin Pavlovski, Yizhong Zhang
Main category: cs.CV
TL;DR: ReLKD is an end-to-end framework for Generalized Category Discovery that exploits implicit inter-class relations to enhance novel class classification through a three-module architecture combining target-grained, coarse-grained, and distillation components.
Details
Motivation: GCD faces the challenge of categorizing unlabeled data containing both known and novel classes, but previous approaches treat classes independently, neglecting inherent inter-class relations. Obtaining explicit inter-class relations is difficult in real-world scenarios, creating a need for methods that can exploit implicit relations to improve novel class classification.Method: ReLKD uses three key modules: 1) Target-grained module for learning discriminative representations, 2) Coarse-grained module for capturing hierarchical class relations, and 3) Distillation module for transferring knowledge from the coarse-grained module to refine the target-grained module’s representation learning.
Result: Extensive experiments on four datasets demonstrate the effectiveness of ReLKD, particularly in scenarios with limited labeled data. The framework shows improved performance in exploiting implicit inter-class relations for novel class classification.
Conclusion: ReLKD effectively addresses the GCD challenge by exploiting implicit inter-class relations through its three-module architecture, demonstrating strong performance especially when labeled data is limited. The approach provides a practical solution for real-world scenarios where explicit inter-class relations are difficult to obtain.
Abstract: Generalized Category Discovery (GCD) faces the challenge of categorizing unlabeled data containing both known and novel classes, given only labels for known classes. Previous studies often treat each class independently, neglecting the inherent inter-class relations. Obtaining such inter-class relations directly presents a significant challenge in real-world scenarios. To address this issue, we propose ReLKD, an end-to-end framework that effectively exploits implicit inter-class relations and leverages this knowledge to enhance the classification of novel classes. ReLKD comprises three key modules: a target-grained module for learning discriminative representations, a coarse-grained module for capturing hierarchical class relations, and a distillation module for transferring knowledge from the coarse-grained module to refine the target-grained module’s representation learning. Extensive experiments on four datasets demonstrate the effectiveness of ReLKD, particularly in scenarios with limited labeled data. The code for ReLKD is available at https://github.com/ZhouF-ECNU/ReLKD.
[281] STRinGS: Selective Text Refinement in Gaussian Splatting
Abhinav Raundhal, Gaurav Behera, P J Narayanan, Ravi Kiran Sarvadevabhatla, Makarand Tapaswi
Main category: cs.CV
TL;DR: STRinGS is a text-aware selective refinement framework for 3D Gaussian Splatting that treats text and non-text regions separately to preserve fine-grained text details in 3D reconstruction.
Details
Motivation: Text in real-world scenes conveys important contextual information, but current 3D representations like 3D Gaussian Splatting struggle to preserve fine-grained text details. Small errors in text reconstruction can lead to significant semantic loss, limiting scene understanding in text-rich environments.Method: STRinGS treats text and non-text regions separately, refining text regions first and then merging them with non-text regions for full-scene optimization. This selective refinement approach allows for better preservation of text details while maintaining overall scene quality.
Result: STRinGS achieves a 63.6% relative improvement over standard 3DGS in text readability (measured by OCR Character Error Rate) at just 7K iterations. The method produces sharp, readable text even in challenging configurations.
Conclusion: STRinGS and the accompanying STRinGS-360 dataset advance 3D scene understanding in text-rich environments. The framework demonstrates that selective refinement of text regions significantly improves text readability in 3D reconstruction, paving the way for more robust text-aware reconstruction methods.
Abstract: Text as signs, labels, or instructions is a critical element of real-world scenes as they can convey important contextual information. 3D representations such as 3D Gaussian Splatting (3DGS) struggle to preserve fine-grained text details, while achieving high visual fidelity. Small errors in textual element reconstruction can lead to significant semantic loss. We propose STRinGS, a text-aware, selective refinement framework to address this issue for 3DGS reconstruction. Our method treats text and non-text regions separately, refining text regions first and merging them with non-text regions later for full-scene optimization. STRinGS produces sharp, readable text even in challenging configurations. We introduce a text readability measure OCR Character Error Rate (CER) to evaluate the efficacy on text regions. STRinGS results in a 63.6% relative improvement over 3DGS at just 7K iterations. We also introduce a curated dataset STRinGS-360 with diverse text scenarios to evaluate text readability in 3D reconstruction. Our method and dataset together push the boundaries of 3D scene understanding in text-rich environments, paving the way for more robust text-aware reconstruction methods.
[282] Unified Camera Positional Encoding for Controlled Video Generation
Cheng Zhang, Boying Li, Meng Wei, Yan-Pei Cao, Camilo Cruz Gambardella, Dinh Phung, Jianfei Cai
Main category: cs.CV
TL;DR: UCPE introduces a unified camera positional encoding that handles diverse camera parameters (poses, intrinsics, distortions) for better camera-controllable video generation with minimal added parameters.
Details
Motivation: Existing camera encoding methods rely on simplified pinhole assumptions, limiting generalization across diverse real-world camera intrinsics and lens distortions needed for 3D perception and video generation tasks.Method: Proposes Relative Ray Encoding for geometry-consistent representation of complete camera info (6-DoF poses, intrinsics, distortions), plus Absolute Orientation Encoding for pitch/roll control, integrated into pretrained video Diffusion Transformers via lightweight spatial attention adapter.
Result: Achieves state-of-the-art camera controllability and visual fidelity with less than 1% trainable parameters added, validated on a large video dataset covering diverse camera motions and lens types.
Conclusion: UCPE demonstrates effective camera-controllable video generation and shows potential as a general camera representation for Transformers across future multi-view, video, and 3D tasks.
Abstract: Transformers have emerged as a universal backbone across 3D perception, video generation, and world models for autonomous driving and embodied AI, where understanding camera geometry is essential for grounding visual observations in three-dimensional space. However, existing camera encoding methods often rely on simplified pinhole assumptions, restricting generalization across the diverse intrinsics and lens distortions in real-world cameras. We introduce Relative Ray Encoding, a geometry-consistent representation that unifies complete camera information, including 6-DoF poses, intrinsics, and lens distortions. To evaluate its capability under diverse controllability demands, we adopt camera-controlled text-to-video generation as a testbed task. Within this setting, we further identify pitch and roll as two components effective for Absolute Orientation Encoding, enabling full control over the initial camera orientation. Together, these designs form UCPE (Unified Camera Positional Encoding), which integrates into a pretrained video Diffusion Transformer through a lightweight spatial attention adapter, adding less than 1% trainable parameters while achieving state-of-the-art camera controllability and visual fidelity. To facilitate systematic training and evaluation, we construct a large video dataset covering a wide range of camera motions and lens types. Extensive experiments validate the effectiveness of UCPE in camera-controllable video generation and highlight its potential as a general camera representation for Transformers across future multi-view, video, and 3D tasks. Code will be available at https://github.com/chengzhag/UCPE.
[283] Squeezed-Eff-Net: Edge-Computed Boost of Tomography Based Brain Tumor Classification leveraging Hybrid Neural Network Architecture
Md. Srabon Chowdhury, Syeda Fahmida Tanzim, Sheekar Banerjee, Ishtiak Al Mamoon, AKM Muzahidul Islam
Main category: cs.CV
TL;DR: A hybrid deep learning model combining SqueezeNet v1, EfficientNet-B0, and handcrafted radiomic features achieves 98.93% accuracy for brain tumor classification from MRI, with only 2.1M parameters and 1.2 GFLOPs.
Details
Motivation: Brain tumor diagnosis requires timely and accurate MRI interpretation, but manual tumor delineation is difficult, time-consuming, and prone to inter-observer error. Current methods need improvement in both computational efficiency and diagnostic accuracy.Method: Proposes a hybrid deep learning model combining lightweight SqueezeNet v1 with high-performing EfficientNet-B0, enhanced with handcrafted radiomic features (HOG, LBP, Gabor filters, Wavelet transforms). Trained on Nickparvar Brain Tumor MRI dataset with 7,023 T1-weighted MRI slices across four classes.
Result: Achieved 98.93% testing accuracy, improved to 99.08% with Test Time Augmentation. Model uses only 2.1 million parameters and less than 1.2 GFLOPs, offering excellent computational efficiency while maintaining high diagnostic performance.
Conclusion: The hybrid model provides a practical balance between computational efficiency and diagnostic accuracy, demonstrating near-clinical reliability for automated brain tumor classification and potential for clinical decision-support systems.
Abstract: Brain tumors are one of the most common and dangerous neurological diseases which require a timely and correct diagnosis to provide the right treatment procedures. Even with the promotion of magnetic resonance imaging (MRI), the process of tumor delineation is difficult and time-consuming, which is prone to inter-observer error. In order to overcome these limitations, this work proposes a hybrid deep learning model based on SqueezeNet v1 which is a lightweight model, and EfficientNet-B0, which is a high-performing model, and is enhanced with handcrafted radiomic descriptors, including Histogram of Oriented Gradients (HOG), Local Binary Patterns (LBP), Gabor filters and Wavelet transforms. The framework was trained and tested only on publicly available Nickparvar Brain Tumor MRI dataset, which consisted of 7,023 contrast-enhanced T1-weighted axial MRI slices which were categorized into four groups: glioma, meningioma, pituitary tumor, and no tumor. The testing accuracy of the model was 98.93% that reached a level of 99.08% with Test Time Augmentation (TTA) showing great generalization and power. The proposed hybrid network offers a compromise between computation efficiency and diagnostic accuracy compared to current deep learning structures and only has to be trained using fewer than 2.1 million parameters and less than 1.2 GFLOPs. The handcrafted feature addition allowed greater sensitivity in texture and the EfficientNet-B0 backbone represented intricate hierarchical features. The resulting model has almost clinical reliability in automated MRI-based classification of tumors highlighting its possibility of use in clinical decision-support systems.
[284] Zero-Shot Textual Explanations via Translating Decision-Critical Features
Toshinori Yamauchi, Hiroshi Kera, Kazuhiko Kawamoto
Main category: cs.CV
TL;DR: TEXTER generates textual explanations for image classifiers by isolating decision-critical features before alignment, producing more faithful and interpretable explanations than existing methods.
Details
Motivation: Existing zero-shot explanation methods produce generic descriptions of what's visible in images rather than capturing the specific reasoning behind classifier predictions. There's a need for explanations that reflect the actual decision-making process of image classifiers.Method: TEXTER identifies neurons contributing to predictions, emphasizes decision-critical features encoded in those neurons, maps them to CLIP feature space for text retrieval, and uses a sparse autoencoder for improved interpretability, especially for Transformer architectures.
Result: Extensive experiments show TEXTER generates more faithful and interpretable explanations than existing methods, better capturing classifier-specific reasoning.
Conclusion: TEXTER successfully overcomes limitations of existing explanation methods by focusing on decision-critical features, providing more transparent and accurate textual explanations for image classifier decisions.
Abstract: Textual explanations make image classifier decisions transparent by describing the prediction rationale in natural language. Large vision-language models can generate captions but are designed for general visual understanding, not classifier-specific reasoning. Existing zero-shot explanation methods align global image features with language, producing descriptions of what is visible rather than what drives the prediction. We propose TEXTER, which overcomes this limitation by isolating decision-critical features before alignment. TEXTER identifies the neurons contributing to the prediction and emphasizes the features encoded in those neurons – i.e., the decision-critical features. It then maps these emphasized features into the CLIP feature space to retrieve textual explanations that reflect the model’s reasoning. A sparse autoencoder further improves interpretability, particularly for Transformer architectures. Extensive experiments show that TEXTER generates more faithful and interpretable explanations than existing methods. The code will be publicly released.
[285] AdLift: Lifting Adversarial Perturbations to Safeguard 3D Gaussian Splatting Assets Against Instruction-Driven Editing
Ziming Hong, Tianyu Huang, Runnan Chen, Shanshan Ye, Mingming Gong, Bo Han, Tongliang Liu
Main category: cs.CV
TL;DR: AdLift is the first editing safeguard for 3D Gaussian Splatting that prevents unauthorized instruction-driven editing by lifting bounded 2D adversarial perturbations into 3D Gaussian representations.
Details
Motivation: While diffusion-based editing pipelines enable 3DGS manipulation, they expose 3D assets to unauthorized editing and malicious tampering risks. Existing 2D adversarial protection methods don't work well for 3DGS due to view-generalizable protection challenges and balancing invisibility with protection capability.Method: AdLift lifts strictly bounded 2D adversarial perturbations into 3D Gaussian-represented safeguards. It uses a tailored Lifted PGD that alternates between gradient truncation during back-propagation from editing models and image-to-Gaussian fitting operations to optimize safeguard Gaussians across training views while maintaining perturbation constraints.
Result: AdLift effectively protects against state-of-the-art instruction-driven 2D image and 3DGS editing methods, providing consistent adversarial-based protection across different viewpoints and generalizing to novel views.
Conclusion: AdLift successfully addresses the challenges of view-generalizable protection and balancing invisibility with protection capability for 3DGS assets, providing the first effective safeguard against unauthorized instruction-driven editing of 3D Gaussian Splatting content.
Abstract: Recent studies have extended diffusion-based instruction-driven 2D image editing pipelines to 3D Gaussian Splatting (3DGS), enabling faithful manipulation of 3DGS assets and greatly advancing 3DGS content creation. However, it also exposes these assets to serious risks of unauthorized editing and malicious tampering. Although imperceptible adversarial perturbations against diffusion models have proven effective for protecting 2D images, applying them to 3DGS encounters two major challenges: view-generalizable protection and balancing invisibility with protection capability. In this work, we propose the first editing safeguard for 3DGS, termed AdLift, which prevents instruction-driven editing across arbitrary views and dimensions by lifting strictly bounded 2D adversarial perturbations into 3D Gaussian-represented safeguard. To ensure both adversarial perturbations effectiveness and invisibility, these safeguard Gaussians are progressively optimized across training views using a tailored Lifted PGD, which first conducts gradient truncation during back-propagation from the editing model at the rendered image and applies projected gradients to strictly constrain the image-level perturbation. Then, the resulting perturbation is backpropagated to the safeguard Gaussian parameters via an image-to-Gaussian fitting operation. We alternate between gradient truncation and image-to-Gaussian fitting, yielding consistent adversarial-based protection performance across different viewpoints and generalizes to novel views. Empirically, qualitative and quantitative results demonstrate that AdLift effectively protects against state-of-the-art instruction-driven 2D image and 3DGS editing.
[286] An AI-Powered Autonomous Underwater System for Sea Exploration and Scientific Research
Hamad Almazrouei, Mariam Al Nasseri, Maha Alzaabi
Main category: cs.CV
TL;DR: AI-powered AUV system for automated underwater object detection and reporting using YOLOv12 Nano, ResNet50, PCA, K-Means++, and GPT-4o Mini, achieving 0.512 mAP@0.5 on 55K+ marine images.
Details
Motivation: Traditional sea exploration faces challenges from extreme conditions, limited visibility, high costs, and vast unexplored ocean regions, creating need for automated solutions to reduce human risk and improve efficiency.Method: Integrated system combines YOLOv12 Nano for real-time object detection, ResNet50 CNN for feature extraction, PCA for dimensionality reduction (preserving 98% variance), K-Means++ clustering for grouping marine objects, and GPT-4o Mini LLM for generating structured reports and summaries.
Result: System achieved mAP@0.5 of 0.512, precision of 0.535, recall of 0.438 on combined DeepFish and OzFish datasets (55K+ images). PCA reduced dimensionality effectively, K-Means successfully grouped objects by visual similarity, and LLM generated insightful summaries with location data.
Conclusion: The integrated AI-AUV system reduces human diving risks, increases mission efficiency, and enhances underwater data analysis speed and depth, enabling more effective scientific research in challenging marine environments.
Abstract: Traditional sea exploration faces significant challenges due to extreme conditions, limited visibility, and high costs, resulting in vast unexplored ocean regions. This paper presents an innovative AI-powered Autonomous Underwater Vehicle (AUV) system designed to overcome these limitations by automating underwater object detection, analysis, and reporting. The system integrates YOLOv12 Nano for real-time object detection, a Convolutional Neural Network (CNN) (ResNet50) for feature extraction, Principal Component Analysis (PCA) for dimensionality reduction, and K-Means++ clustering for grouping marine objects based on visual characteristics. Furthermore, a Large Language Model (LLM) (GPT-4o Mini) is employed to generate structured reports and summaries of underwater findings, enhancing data interpretation. The system was trained and evaluated on a combined dataset of over 55,000 images from the DeepFish and OzFish datasets, capturing diverse Australian marine environments. Experimental results demonstrate the system’s capability to detect marine objects with a mAP@0.5 of 0.512, a precision of 0.535, and a recall of 0.438. The integration of PCA effectively reduced feature dimensionality while preserving 98% variance, facilitating K-Means clustering which successfully grouped detected objects based on visual similarities. The LLM integration proved effective in generating insightful summaries of detections and clusters, supported by location data. This integrated approach significantly reduces the risks associated with human diving, increases mission efficiency, and enhances the speed and depth of underwater data analysis, paving the way for more effective scientific research and discovery in challenging marine environments.
[287] See More, Change Less: Anatomy-Aware Diffusion for Contrast Enhancement
Junqi Liu, Zejun Wu, Pedro R. A. S. Bassi, Xinze Zhou, Wenxuan Li, Ibrahim E. Hamamci, Sezgin Er, Tianyu Lin, Yi Luo, Szymon Płotka, Bjoern Menze, Daguang Xu, Kai Ding, Kang Wang, Yang Yang, Yucheng Tang, Alan L. Yuille, Zongwei Zhou
Main category: cs.CV
TL;DR: SMILE is an anatomy-aware diffusion model for medical image enhancement that improves image quality while preserving anatomical accuracy, outperforming existing methods across multiple metrics.
Details
Motivation: Current medical image enhancement models often over-edit, distorting organs, creating false findings, and missing small tumors because they lack anatomical understanding and contrast dynamics knowledge.Method: SMILE uses three key innovations: structure-aware supervision following organ boundaries and contrast patterns, registration-free learning with unaligned multi-phase CT scans, and unified inference for consistent enhancement across contrast phases.
Result: SMILE outperforms existing methods across six external datasets with 14.2% higher SSIM, 20.6% higher PSNR, 50% better FID, and improves cancer detection from non-contrast CT by up to 10% F1 score.
Conclusion: SMILE provides anatomically accurate and diagnostically meaningful image enhancement by understanding organ shapes and contrast dynamics, making it clinically useful for medical imaging applications.
Abstract: Image enhancement improves visual quality and helps reveal details that are hard to see in the original image. In medical imaging, it can support clinical decision-making, but current models often over-edit. This can distort organs, create false findings, and miss small tumors because these models do not understand anatomy or contrast dynamics. We propose SMILE, an anatomy-aware diffusion model that learns how organs are shaped and how they take up contrast. It enhances only clinically relevant regions while leaving all other areas unchanged. SMILE introduces three key ideas: (1) structure-aware supervision that follows true organ boundaries and contrast patterns; (2) registration-free learning that works directly with unaligned multi-phase CT scans; (3) unified inference that provides fast and consistent enhancement across all contrast phases. Across six external datasets, SMILE outperforms existing methods in image quality (14.2% higher SSIM, 20.6% higher PSNR, 50% better FID) and in clinical usefulness by producing anatomically accurate and diagnostically meaningful images. SMILE also improves cancer detection from non-contrast CT, raising the F1 score by up to 10 percent.
[288] DIST-CLIP: Arbitrary Metadata and Image Guided MRI Harmonization via Disentangled Anatomy-Contrast Representations
Mehmet Yigit Avci, Pedro Borges, Virginia Fernandez, Paul Wright, Mehmet Yigitsoy, Sebastien Ourselin, Jorge Cardoso
Main category: cs.CV
TL;DR: DIST-CLIP is a novel MRI harmonization framework that disentangles anatomical content from image contrast using CLIP guidance, enabling flexible style transfer using either target images or DICOM metadata to address real-world clinical data heterogeneity.
Details
Motivation: Deep learning for medical image analysis faces clinical generalization limitations due to data heterogeneity, especially in MRI where scanner differences, acquisition protocols, and sequence parameters cause domain shifts. Existing harmonization methods are insufficient - image-based approaches need target images, while text-guided methods use simplistic labels or work only on limited datasets, failing to capture real-world clinical heterogeneity.Method: DIST-CLIP (Disentangled Style Transfer with CLIP Guidance) explicitly disentangles anatomical content from image contrast. Contrast representations are extracted using pre-trained CLIP encoders, then integrated into anatomical content via a novel Adaptive Style Transfer module. The framework can flexibly use either target images or DICOM metadata for guidance.
Result: The method was trained and evaluated on diverse real-world clinical datasets, showing significant improvements over state-of-the-art methods in both style translation fidelity and anatomical preservation.
Conclusion: DIST-CLIP offers a flexible solution for MRI style transfer and standardization, addressing the limitations of existing harmonization methods and better capturing real-world clinical heterogeneity. Code and weights will be made publicly available.
Abstract: Deep learning holds immense promise for transforming medical image analysis, yet its clinical generalization remains profoundly limited. A major barrier is data heterogeneity. This is particularly true in Magnetic Resonance Imaging, where scanner hardware differences, diverse acquisition protocols, and varying sequence parameters introduce substantial domain shifts that obscure underlying biological signals. Data harmonization methods aim to reduce these instrumental and acquisition variability, but existing approaches remain insufficient. When applied to imaging data, image-based harmonization approaches are often restricted by the need for target images, while existing text-guided methods rely on simplistic labels that fail to capture complex acquisition details or are typically restricted to datasets with limited variability, failing to capture the heterogeneity of real-world clinical environments. To address these limitations, we propose DIST-CLIP (Disentangled Style Transfer with CLIP Guidance), a unified framework for MRI harmonization that flexibly uses either target images or DICOM metadata for guidance. Our framework explicitly disentangles anatomical content from image contrast, with the contrast representations being extracted using pre-trained CLIP encoders. These contrast embeddings are then integrated into the anatomical content via a novel Adaptive Style Transfer module. We trained and evaluated DIST-CLIP on diverse real-world clinical datasets, and showed significant improvements in performance when compared against state-of-the-art methods in both style translation fidelity and anatomical preservation, offering a flexible solution for style transfer and standardizing MRI data. Our code and weights will be made publicly available upon publication.
[289] A graph generation pipeline for critical infrastructures based on heuristics, images and depth data
Mike Diessner, Yannick Tarant
Main category: cs.CV
TL;DR: A graph generation pipeline using photogrammetry and deep learning to create virtual models of critical infrastructure from RGB images and stereo depth data, as a cost-effective alternative to expensive laser scanning.
Details
Motivation: Current virtual representations of critical infrastructure (water/energy plants) require expensive 3D laser scanning and specialist knowledge. There's a need for more cost-effective, accessible methods to create digital twins for resilience planning.Method: Photogrammetry-based pipeline using RGB images and stereo camera depth data. Deep learning for object detection and instance segmentation, combined with user-defined heuristics/rules to infer object relations and generate graphs.
Result: The method produces graphs close to ground truth for two hydraulic systems. It’s flexible for specific applications and transparent enough for high-stakes decision-making in critical infrastructure.
Conclusion: The photogrammetry-based graph generation pipeline offers a cost-effective, flexible, and transparent alternative to laser scanning for creating virtual models of critical infrastructure, suitable for digital twins and resilience planning.
Abstract: Virtual representations of physical critical infrastructures, such as water or energy plants, are used for simulations and digital twins to ensure resilience and continuity of their services. These models usually require 3D point clouds from laser scanners that are expensive to acquire and require specialist knowledge to use. In this article, we present a graph generation pipeline based on photogrammetry. The pipeline detects relevant objects and predicts their relation using RGB images and depth data generated by a stereo camera. This more cost-effective approach uses deep learning for object detection and instance segmentation of the objects, and employs user-defined heuristics or rules to infer their relations. Results of two hydraulic systems show that this strategy can produce graphs close to the ground truth while its flexibility allows the method to be tailored to specific applications and its transparency qualifies it to be used in the high stakes decision-making that is required for critical infrastructures.
[290] Guiding What Not to Generate: Automated Negative Prompting for Text-Image Alignment
Sangha Park, Eunji Kim, Yeongtak Oh, Jooyoung Choi, Sungroh Yoon
Main category: cs.CV
TL;DR: NPC is an automated pipeline that improves text-to-image alignment by identifying and applying negative prompts to suppress unintended content in generated images.
Details
Motivation: Despite progress in text-to-image generation, achieving precise alignment remains challenging for prompts with rich compositional structure or imaginative elements.Method: NPC analyzes cross-attention patterns to identify both targeted and untargeted negative prompts, then uses a verifier-captioner-proposer framework to generate candidate prompts and ranks them with a salient text-space score without requiring additional image synthesis.
Result: NPC outperforms strong baselines on GenEval++ (0.571 vs 0.371) and achieves best overall performance on Imagine-Bench.
Conclusion: NPC provides a principled, fully automated route to stronger text-image alignment in diffusion models by guiding what not to generate.
Abstract: Despite substantial progress in text-to-image generation, achieving precise text-image alignment remains challenging, particularly for prompts with rich compositional structure or imaginative elements. To address this, we introduce Negative Prompting for Image Correction (NPC), an automated pipeline that improves alignment by identifying and applying negative prompts that suppress unintended content. We begin by analyzing cross-attention patterns to explain why both targeted negatives-those directly tied to the prompt’s alignment error-and untargeted negatives-tokens unrelated to the prompt but present in the generated image-can enhance alignment. To discover useful negatives, NPC generates candidate prompts using a verifier-captioner-proposer framework and ranks them with a salient text-space score, enabling effective selection without requiring additional image synthesis. On GenEval++ and Imagine-Bench, NPC outperforms strong baselines, achieving 0.571 vs. 0.371 on GenEval++ and the best overall performance on Imagine-Bench. By guiding what not to generate, NPC provides a principled, fully automated route to stronger text-image alignment in diffusion models. Code is released at https://github.com/wiarae/NPC.
[291] RVLF: A Reinforcing Vision-Language Framework for Gloss-Free Sign Language Translation
Zhi Rao, Yucheng Zhou, Benjia Zhou, Yiqing Huang, Sergio Escalera, Jun Wan
Main category: cs.CV
TL;DR: Proposes RVLF, a three-stage vision-language framework for gloss-free sign language translation that combines semantic representation learning with GRPO-based reinforcement learning to address inadequate sign representation and sentence-level semantic misalignment.
Details
Motivation: Gloss-free SLT faces two key challenges: 1) inadequate sign representation that fails to capture nuanced visual cues, and 2) sentence-level semantic misalignment in current LLM-based methods that limits translation quality.Method: Three-stage RVLF framework: 1) Builds a large vision-language model for sign language with semantic representation learning fusing skeleton-based motion cues with DINOv2 visual features, 2) Instruction tuning to obtain SLT-SFT baseline, 3) GRPO-based optimization with reward function combining BLEU (translation fidelity) and ROUGE (sentence completeness) to fine-tune into SLT-GRPO.
Result: Substantial improvements in BLEU-4 scores: +5.1 on CSL-Daily, +1.11 on PHOENIX-2014T, +1.4 on How2Sign, and +1.61 on OpenASL datasets. First work to incorporate GRPO into SLT, achieving better translation quality and semantic consistency without pre-training on external large-scale sign language datasets.
Conclusion: RVLF effectively addresses sign representation and semantic alignment challenges in gloss-free SLT through a novel combination of vision-language modeling and reinforcement learning optimization, demonstrating significant performance gains across multiple datasets.
Abstract: Gloss-free sign language translation (SLT) is hindered by two key challenges: inadequate sign representation that fails to capture nuanced visual cues, and sentence-level semantic misalignment in current LLM-based methods, which limits translation quality. To address these issues, we propose a three-stage reinforcing vision-language framework (RVLF). We build a large vision-language model (LVLM) specifically designed for sign language, and then combine it with reinforcement learning (RL) to adaptively enhance translation performance. First, for a sufficient representation of sign language, RVLF introduces an effective semantic representation learning mechanism that fuses skeleton-based motion cues with semantically rich visual features extracted via DINOv2, followed by instruction tuning to obtain a strong SLT-SFT baseline. Then, to improve sentence-level semantic misalignment, we introduce a GRPO-based optimization strategy that fine-tunes the SLT-SFT model with a reward function combining translation fidelity (BLEU) and sentence completeness (ROUGE), yielding the optimized model termed SLT-GRPO. Our conceptually simple framework yields substantial gains under the gloss-free SLT setting without pre-training on any external large-scale sign language datasets, improving BLEU-4 scores by +5.1, +1.11, +1.4, and +1.61 on the CSL-Daily, PHOENIX-2014T, How2Sign, and OpenASL datasets, respectively. To the best of our knowledge, this is the first work to incorporate GRPO into SLT. Extensive experiments and ablation studies validate the effectiveness of GRPO-based optimization in enhancing both translation quality and semantic consistency.
[292] Geo3DVQA: Evaluating Vision-Language Models for 3D Geospatial Reasoning from Aerial Imagery
Mai Tsujimoto, Junjue Wang, Weihao Xuan, Naoto Yokoya
Main category: cs.CV
TL;DR: Geo3DVQA is a new benchmark for evaluating vision-language models on 3D geospatial reasoning using only RGB remote sensing imagery, revealing current VLMs struggle with RGB-to-3D reasoning but domain-specific fine-tuning significantly improves performance.
Details
Motivation: Current 3D geospatial analysis methods rely on expensive specialized sensors (LiDAR, multispectral) that limit global accessibility, and existing approaches struggle with integrating multiple 3D cues, handling diverse queries, and providing interpretable reasoning.Method: Created Geo3DVQA benchmark with 110k curated question-answer pairs spanning 16 task categories across three complexity levels: single-feature inference, multi-feature reasoning, and application-level spatial analysis, using only RGB remote sensing imagery.
Result: State-of-the-art VLMs performed poorly: GPT-4o (28.6%), Gemini-2.5-Flash (33.0%), but domain-specific fine-tuning of Qwen2.5-VL-7B achieved 49.6% accuracy (+24.8 points improvement), showing limitations of current VLMs but effectiveness of domain adaptation.
Conclusion: Geo3DVQA introduces new challenges for scalable, accessible 3D geospatial analysis, demonstrating the difficulty of RGB-to-3D reasoning and the value of domain-specific fine-tuning, with the dataset and code to be publicly released.
Abstract: Three-dimensional geospatial analysis is critical to applications in urban planning, climate adaptation, and environmental assessment. Current methodologies depend on costly, specialized sensors (e.g., LiDAR and multispectral), which restrict global accessibility. Existing sensor-based and rule-driven methods further struggle with tasks requiring the integration of multiple 3D cues, handling diverse queries, and providing interpretable reasoning. We hereby present Geo3DVQA, a comprehensive benchmark for evaluating vision-language models (VLMs) in height-aware, 3D geospatial reasoning using RGB-only remote sensing imagery. Unlike conventional sensor-based frameworks, Geo3DVQA emphasizes realistic scenarios that integrate elevation, sky view factors, and land cover patterns. The benchmark encompasses 110k curated question-answer pairs spanning 16 task categories across three complexity levels: single-feature inference, multi-feature reasoning, and application-level spatial analysis. The evaluation of ten state-of-the-art VLMs highlights the difficulty of RGB-to-3D reasoning. GPT-4o and Gemini-2.5-Flash achieved only 28.6% and 33.0% accuracy respectively, while domain-specific fine-tuning of Qwen2.5-VL-7B achieved 49.6% (+24.8 points). These results reveal both the limitations of current VLMs and the effectiveness of domain adaptation. Geo3DVQA introduces new challenge frontiers for scalable, accessible, and holistic 3D geospatial analysis. The dataset and code will be released upon publication at https://github.com/mm1129/Geo3DVQA.
[293] Reevaluating Automated Wildlife Species Detection: A Reproducibility Study on a Custom Image Dataset
Tobias Abraham Haider
Main category: cs.CV
TL;DR: This study replicates Carl et al.’s work on using Google Inception-ResNet-v2 for automated detection of European wild mammal species in camera trap images, achieving similar accuracy (62% vs 71%) with a different dataset, confirming pretrained CNNs provide practical baselines but need species-specific adaptation.
Details
Motivation: To assess the reproducibility and generalizability of Carl et al.'s approach for automated detection of European wild mammal species using pretrained CNNs on camera trap images.Method: Reimplemented the experiment from scratch using openly available resources and a different dataset of 900 images spanning 90 species. Used Google Inception-ResNet-v2 model with minimal preprocessing.
Result: Achieved 62% overall classification accuracy (vs 71% in original), with macro F1 score of 0.28 showing substantial per-class performance variation. Results closely aligned with original despite dataset differences.
Conclusion: Pretrained CNNs provide practical baselines for wildlife species identification but require species-specific adaptation or transfer learning for consistent high-quality predictions, especially when labels don’t align with ImageNet classes.
Abstract: This study revisits the findings of Carl et al., who evaluated the pre-trained Google Inception-ResNet-v2 model for automated detection of European wild mammal species in camera trap images. To assess the reproducibility and generalizability of their approach, we reimplemented the experiment from scratch using openly available resources and a different dataset consisting of 900 images spanning 90 species. After minimal preprocessing, we obtained an overall classification accuracy of 62%, closely aligning with the 71% reported in the original work despite differences in datasets. As in the original study, per-class performance varied substantially, as indicated by a macro F1 score of 0.28,highlighting limitations in generalization when labels do not align directly with ImageNet classes. Our results confirm that pretrained convolutional neural networks can provide a practical baseline for wildlife species identification but also reinforce the need for species-specific adaptation or transfer learning to achieve consistent, high-quality predictions.
[294] Improving action classification with brain-inspired deep networks
Aidas Aglinskas, Stefano Anzellotti
Main category: cs.CV
TL;DR: Brain-inspired dual-stream DNNs with separate body and background processing outperform standard DNNs and show more human-like action recognition patterns.
Details
Motivation: Standard DNNs may rely disproportionately on either body or background information for action recognition, while humans have specialized brain regions for both. The paper investigates whether brain-inspired architectures can achieve more balanced and human-like performance.Method: 1) Test standard DNNs on HAA500 dataset with three stimulus versions (full, body-only, background-only). 2) Compare with human participants (N=28) on same stimuli. 3) Implement novel brain-inspired architecture with separate domain-specific streams for body and background processing.
Result: Standard DNNs perform well on full and background-only stimuli but fail on body-only stimuli. Humans perform well on all three versions, better on body-only than background-only. The brain-inspired dual-stream architecture improves overall performance and shows accuracy patterns more similar to humans.
Conclusion: Brain-inspired domain-specific architectures with separate body and background streams lead to more human-like action recognition performance, suggesting that incorporating biological principles can improve DNN robustness and generalization.
Abstract: Action recognition is also key for applications ranging from robotics to healthcare monitoring. Action information can be extracted from the body pose and movements, as well as from the background scene. However, the extent to which deep neural networks (DNNs) make use of information about the body and information about the background remains unclear. Since these two sources of information may be correlated within a training dataset, DNNs might learn to rely predominantly on one of them, without taking full advantage of the other. Unlike DNNs, humans have domain-specific brain regions selective for perceiving bodies, and regions selective for perceiving scenes. The present work tests whether humans are thus more effective at extracting information from both body and background, and whether building brain-inspired deep network architectures with separate domain-specific streams for body and scene perception endows them with more human-like performance. We first demonstrate that DNNs trained using the HAA500 dataset perform almost as accurately on versions of the stimuli that show both body and background and on versions of the stimuli from which the body was removed, but are at chance-level for versions of the stimuli from which the background was removed. Conversely, human participants (N=28) can recognize the same set of actions accurately with all three versions of the stimuli, and perform significantly better on stimuli that show only the body than on stimuli that show only the background. Finally, we implement and test a novel architecture patterned after domain specificity in the brain with separate streams to process body and background information. We show that 1) this architecture improves action recognition performance, and 2) its accuracy across different versions of the stimuli follows a pattern that matches more closely the pattern of accuracy observed in human participants.
[295] Exploring the Potential of Encoder-free Architectures in 3D LMMs
Yiwen Tang, Zoey Guo, Zhuhao Wang, Ray Zhang, Qizhi Chen, Junli Liu, Delin Qu, Zhigang Wang, Dong Wang, Bin Zhao, Xuelong Li
Main category: cs.CV
TL;DR: ENEL is the first encoder-free 3D Large Multimodal Model that eliminates the need for pre-trained 3D encoders, enabling LLMs to directly process point clouds while achieving state-of-the-art performance on 3D understanding tasks.
Details
Motivation: Encoder-based 3D LMMs face challenges: they fail to adapt to varying point cloud resolutions during inference, and their point features don't meet LLMs' semantic needs. The paper investigates whether encoder-free architectures can effectively address these issues in 3D understanding.Method: Two key strategies: 1) LLM-embedded Semantic Encoding in pre-training with Hybrid Semantic Loss to extract high-level semantics; 2) Hierarchical Geometry Aggregation in instruction tuning to incorporate inductive bias into LLM layers for focusing on local point cloud details.
Result: ENEL-7B rivals state-of-the-art PointLLM-PiSA-13B, achieving 57.91% on classification, 61.0% on captioning, and 55.20% on VQA tasks, demonstrating competitive performance with fewer parameters.
Conclusion: Encoder-free architecture is highly promising for replacing encoder-based architectures in 3D understanding, offering solutions to long-standing challenges in 3D LMMs while maintaining competitive performance.
Abstract: Encoder-free architectures have been preliminarily explored in the 2D Large Multimodal Models (LMMs), yet it remains an open question whether they can be effectively applied to 3D understanding scenarios. In this paper, we present the first comprehensive investigation into the potential of encoder-free architectures to alleviate the challenges of encoder-based 3D LMMs. These long-standing challenges include the failure to adapt to varying point cloud resolutions during inference and the point features from the encoder not meeting the semantic needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to remove the pre-trained encoder and enable the LLM to assume the role of the 3D encoder: 1) We propose the LLM-embedded Semantic Encoding strategy in the pre-training stage, exploring the effects of various point cloud self-supervised losses. And we present the Hybrid Semantic Loss to extract high-level semantics. 2) We introduce the Hierarchical Geometry Aggregation strategy in the instruction tuning stage. This incorporates inductive bias into the LLM layers to focus on the local details of the point clouds. To the end, we present the first Encoder-free 3D LMM, ENEL. Our 7B model rivals the state-of-the-art model, PointLLM-PiSA-13B, achieving 57.91%, 61.0%, and 55.20% on the classification, captioning, and VQA tasks, respectively. Our results show that the encoder-free architecture is highly promising for replacing encoder-based architectures in the field of 3D understanding. The code is released at https://github.com/Ivan-Tang-3D/ENEL
[296] The Inductive Bottleneck: Data-Driven Emergence of Representational Sparsity in Vision Transformers
Kanishk Awadhiya
Main category: cs.CV
TL;DR: ViTs develop a U-shaped entropy profile (inductive bottleneck) that adapts to dataset complexity - deeper bottlenecks for more abstract semantic tasks, shallower for texture-heavy datasets.
Details
Motivation: Vision Transformers lack hierarchical biases of CNNs but still show U-shaped entropy patterns. The paper investigates whether this "inductive bottleneck" is an architectural artifact or a data-driven adaptation to task complexity.Method: Analyzed layer-wise Effective Encoding Dimension (EED) of DINO-trained ViTs across datasets of varying compositional complexity (UC Merced, Tiny ImageNet, CIFAR-100). Examined how bottleneck depth correlates with semantic abstraction requirements.
Result: The inductive bottleneck is data-dependent: texture-heavy datasets preserve high-rank representations throughout, while object-centric datasets drive networks to dampen high-frequency information in middle layers, creating bottlenecks to isolate semantic features.
Conclusion: The U-shaped entropy profile in ViTs is not an architectural artifact but an adaptive mechanism that learns to compress information based on dataset complexity, with bottleneck depth scaling with required semantic abstraction.
Abstract: Vision Transformers (ViTs) lack the hierarchical inductive biases inherent to Convolutional Neural Networks (CNNs), theoretically allowing them to maintain high-dimensional representations throughout all layers. However, recent observations suggest ViTs often spontaneously manifest a “U-shaped” entropy profile-compressing information in middle layers before expanding it for the final classification. In this work, we demonstrate that this “Inductive Bottleneck” is not an architectural artifact, but a data-dependent adaptation. By analyzing the layer-wise Effective Encoding Dimension (EED) of DINO-trained ViTs across datasets of varying compositional complexity (UC Merced, Tiny ImageNet, and CIFAR-100), we show that the depth of the bottleneck correlates strongly with the semantic abstraction required by the task. We find that while texture-heavy datasets preserve high-rank representations throughout, object-centric datasets drive the network to dampen high-frequency information in middle layers, effectively “learning” a bottleneck to isolate semantic features.
[297] SAVE: Sparse Autoencoder-Driven Visual Information Enhancement for Mitigating Object Hallucination
Sangha Park, Seungryong Yoo, Jisoo Mok, Sungroh Yoon
Main category: cs.CV
TL;DR: SAVE is a training-free framework that uses Sparse Autoencoder features to reduce object hallucination in Multimodal Large Language Models by identifying and steering along visual understanding features.
Details
Motivation: MLLMs suffer from object hallucination due to language priors and visual information loss, which undermines their reliability in multimodal understanding tasks.Method: Uses Sparse Autoencoder (SAE) latent features; identifies visual understanding features via binary object-presence QA probe; steers model along these features to reinforce grounded visual understanding.
Result: Outperforms SOTA training-free methods: 10%p improvement in CHAIR_S, consistent gains on POPE and MMHal-Bench; robust across multiple models and layers; suppresses uncertain object tokens and increases attention to image tokens.
Conclusion: SAVE effectively mitigates hallucination by steering along SAE-derived visual understanding features, offering a simple yet powerful training-free solution with strong generalizability.
Abstract: Although Multimodal Large Language Models (MLLMs) have advanced substantially, they remain vulnerable to object hallucination caused by language priors and visual information loss. To address this, we propose SAVE (Sparse Autoencoder-Driven Visual Information Enhancement), a framework that mitigates hallucination by steering the model along Sparse Autoencoder (SAE) latent features. A binary object-presence question-answering probe identifies the SAE features most indicative of the model’s visual information processing, referred to as visual understanding features. Steering the model along these identified features reinforces grounded visual understanding and effectively reduces hallucination. With its simple design, SAVE outperforms state-of-the-art training-free methods on standard benchmarks, achieving a 10%p improvement in CHAIR_S and consistent gains on POPE and MMHal-Bench. Extensive evaluations across multiple models and layers confirm the robustness and generalizability of our approach. Further analysis reveals that steering along visual understanding features suppresses the generation of uncertain object tokens and increases attention to image tokens, mitigating hallucination. Code is released at https://github.com/wiarae/SAVE.
[298] Generalized Referring Expression Segmentation on Aerial Photos
Luís Marnoto, Alexandre Bernardino, Bruno Martins
Main category: cs.CV
TL;DR: Aerial-D is a new large-scale referring expression segmentation dataset for aerial imagery with 37K images and 1.5M expressions covering 259K annotated targets across 21 classes, created via an automated pipeline with LLM enhancement, enabling unified instance and semantic segmentation from text for both modern and historical aerial images.
Details
Motivation: Referring expression segmentation in aerial imagery faces unique challenges: varying spatial resolution, inconsistent color usage, small targets (few pixels), high object density, and partial occlusions. Existing datasets don't adequately address these aerial-specific challenges, especially for both modern and historical imagery.Method: Created Aerial-D dataset through fully automatic pipeline combining systematic rule-based expression generation with LLM enhancement for linguistic variety and visual detail focus. Applied filters to simulate historic imaging conditions. Used RSRefSeg architecture and trained models on Aerial-D combined with prior aerial datasets.
Result: Combined training achieves competitive performance on contemporary benchmarks while maintaining strong accuracy under monochrome, sepia, and grainy degradations typical of archival aerial photography. Dataset includes 37,288 images, 1,522,523 referring expressions covering 259,709 annotated targets across 21 classes.
Conclusion: Aerial-D enables unified instance and semantic segmentation from text for both modern and historical aerial imagery, addressing unique aerial vision challenges. The dataset, models, and software pipeline are publicly available to advance research in aerial referring expression segmentation.
Abstract: Referring expression segmentation is a fundamental task in computer vision that integrates natural language understanding with precise visual localization of target regions. Considering aerial imagery (e.g., modern aerial photos collected through drones, historical photos from aerial archives, high-resolution satellite imagery, etc.) presents unique challenges because spatial resolution varies widely across datasets, the use of color is not consistent, targets often shrink to only a few pixels, and scenes contain very high object densities and objects with partial occlusions. This work presents Aerial-D, a new large-scale referring expression segmentation dataset for aerial imagery, comprising 37,288 images with 1,522,523 referring expressions that cover 259,709 annotated targets, spanning across individual object instances, groups of instances, and semantic regions covering 21 distinct classes that range from vehicles and infrastructure to land coverage types. The dataset was constructed through a fully automatic pipeline that combines systematic rule-based expression generation with a Large Language Model (LLM) enhancement procedure that enriched both the linguistic variety and the focus on visual details within the referring expressions. Filters were additionally used to simulate historic imaging conditions for each scene. We adopted the RSRefSeg architecture, and trained models on Aerial-D together with prior aerial datasets, yielding unified instance and semantic segmentation from text for both modern and historical images. Results show that the combined training achieves competitive performance on contemporary benchmarks, while maintaining strong accuracy under monochrome, sepia, and grainy degradations that appear in archival aerial photography. The dataset, trained models, and complete software pipeline are publicly available at https://luispl77.github.io/aerial-d .
[299] Debiasing Diffusion Priors via 3D Attention for Consistent Gaussian Splatting
Shilong Jin, Haoran Duan, Litao Hua, Wentao Huang, Yuan Zhou
Main category: cs.CV
TL;DR: TD-Attn is a novel framework that addresses multi-view inconsistency in 3D tasks by correcting prior view bias in Text-to-Image diffusion models through 3D-aware attention guidance and hierarchical attention modulation.
Details
Motivation: Text-to-Image diffusion models used for 3D tasks suffer from prior view bias, causing conflicting appearances between different views of an object due to subject-words preferentially activating prior view features regardless of target view conditions.Method: Proposes TD-Attn with two key components: 1) 3D-Aware Attention Guidance Module constructs view-consistent 3D attention Gaussian for subject-words to enforce spatial consistency; 2) Hierarchical Attention Modulation Module uses Semantic Guidance Tree and Semantic Response Profiler to localize and modulate CA layers responsive to view conditions.
Result: Extensive experiments show TD-Attn significantly enhances multi-view consistency across 3D tasks and has potential to serve as a universal plugin for various 3D applications.
Conclusion: TD-Attn effectively addresses the prior view bias limitation in T2I models, enabling more consistent and controllable 3D generation and editing without requiring extensive 3D training data.
Abstract: Versatile 3D tasks (e.g., generation or editing) that distill from Text-to-Image (T2I) diffusion models have attracted significant research interest for not relying on extensive 3D training data. However, T2I models exhibit limitations resulting from prior view bias, which produces conflicting appearances between different views of an object. This bias causes subject-words to preferentially activate prior view features during cross-attention (CA) computation, regardless of the target view condition. To overcome this limitation, we conduct a comprehensive mathematical analysis to reveal the root cause of the prior view bias in T2I models. Moreover, we find different UNet layers show different effects of prior view in CA. Therefore, we propose a novel framework, TD-Attn, which addresses multi-view inconsistency via two key components: (1) the 3D-Aware Attention Guidance Module (3D-AAG) constructs a view-consistent 3D attention Gaussian for subject-words to enforce spatial consistency across attention-focused regions, thereby compensating for the limited spatial information in 2D individual view CA maps; (2) the Hierarchical Attention Modulation Module (HAM) utilizes a Semantic Guidance Tree (SGT) to direct the Semantic Response Profiler (SRP) in localizing and modulating CA layers that are highly responsive to view conditions, where the enhanced CA maps further support the construction of more consistent 3D attention Gaussians. Notably, HAM facilitates semantic-specific interventions, enabling controllable and precise 3D editing. Extensive experiments firmly establish that TD-Attn has the potential to serve as a universal plugin, significantly enhancing multi-view consistency across 3D tasks.
[300] WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling
Shaoheng Fang, Hanwen Jiang, Yunpeng Bai, Niloy J. Mitra, Qixing Huang
Main category: cs.CV
TL;DR: WorldReel is a 4D video generator that produces spatio-temporally consistent videos with explicit 4D scene representations (pointmaps, camera trajectory, dense flow), trained on synthetic+real data for geometric fidelity and visual realism.
Details
Motivation: Current video generators achieve photorealism but lack 3D consistency, producing fundamentally inconsistent scenes. There's a need for native spatio-temporal consistency in video generation to enable coherent geometry and appearance modeling over time.Method: WorldReel jointly generates RGB frames with explicit 4D scene representations including pointmaps, camera trajectory, and dense flow mapping. It’s trained on a blend of synthetic data (providing precise 4D supervision for geometry, motion, and camera) and real videos (contributing visual diversity and realism).
Result: WorldReel sets new state-of-the-art for consistent video generation with dynamic scenes and moving cameras, improving metrics of geometric consistency, motion coherence, and reducing view-time artifacts compared to competing methods.
Conclusion: WorldReel advances video generation toward 4D-consistent world modeling, enabling agents to render, interact, and reason about scenes through a single stable spatiotemporal representation, bridging the gap between photorealism and 3D consistency.
Abstract: Recent video generators achieve striking photorealism, yet remain fundamentally inconsistent in 3D. We present WorldReel, a 4D video generator that is natively spatio-temporally consistent. WorldReel jointly produces RGB frames together with 4D scene representations, including pointmaps, camera trajectory, and dense flow mapping, enabling coherent geometry and appearance modeling over time. Our explicit 4D representation enforces a single underlying scene that persists across viewpoints and dynamic content, yielding videos that remain consistent even under large non-rigid motion and significant camera movement. We train WorldReel by carefully combining synthetic and real data: synthetic data providing precise 4D supervision (geometry, motion, and camera), while real videos contribute visual diversity and realism. This blend allows WorldReel to generalize to in-the-wild footage while preserving strong geometric fidelity. Extensive experiments demonstrate that WorldReel sets a new state-of-the-art for consistent video generation with dynamic scenes and moving cameras, improving metrics of geometric consistency, motion coherence, and reducing view-time artifacts over competing methods. We believe that WorldReel brings video generation closer to 4D-consistent world modeling, where agents can render, interact, and reason about scenes through a single and stable spatiotemporal representation.
[301] MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition
Xinyu Wei, Kangrui Cen, Hongyang Wei, Zhen Guo, Bairui Li, Zeqing Wang, Jinrui Zhang, Lei Zhang
Main category: cs.CV
TL;DR: MICo-150K: A large-scale dataset for Multi-Image Composition with 7 tasks, enabling models to synthesize coherent images from multiple reference inputs.
Details
Motivation: Multi-Image Composition (MICo) - synthesizing coherent images from multiple references - remains challenging due to lack of high-quality training data. The paper aims to bridge this gap by creating comprehensive resources.Method: 1) Categorize MICo into 7 representative tasks; 2) Curate source images and construct diverse prompts; 3) Use proprietary models to synthesize balanced composite images; 4) Human-in-the-loop filtering to create MICo-150K dataset; 5) Build Decomposition-and-Recomposition subset with real-world images; 6) Create MICo-Bench for evaluation; 7) Introduce Weighted-Ref-VIEScore metric; 8) Fine-tune models on MICo-150K.
Result: MICo-150K effectively equips models without MICo capability and enhances existing skills. Baseline model Qwen-MICo matches Qwen-Image-2509 in 3-image composition while supporting arbitrary multi-image inputs beyond the latter’s limitation.
Conclusion: The dataset, benchmark, and baseline collectively provide valuable resources for Multi-Image Composition research, addressing the data scarcity problem and enabling comprehensive evaluation of MICo capabilities.
Abstract: In controllable image generation, synthesizing coherent and consistent images from multiple reference inputs, i.e., Multi-Image Composition (MICo), remains a challenging problem, partly hindered by the lack of high-quality training data. To bridge this gap, we conduct a systematic study of MICo, categorizing it into 7 representative tasks and curate a large-scale collection of high-quality source images and construct diverse MICo prompts. Leveraging powerful proprietary models, we synthesize a rich amount of balanced composite images, followed by human-in-the-loop filtering and refinement, resulting in MICo-150K, a comprehensive dataset for MICo with identity consistency. We further build a Decomposition-and-Recomposition (De&Re) subset, where 11K real-world complex images are decomposed into components and recomposed, enabling both real and synthetic compositions. To enable comprehensive evaluation, we construct MICo-Bench with 100 cases per task and 300 challenging De&Re cases, and further introduce a new metric, Weighted-Ref-VIEScore, specifically tailored for MICo evaluation. Finally, we fine-tune multiple models on MICo-150K and evaluate them on MICo-Bench. The results show that MICo-150K effectively equips models without MICo capability and further enhances those with existing skills. Notably, our baseline model, Qwen-MICo, fine-tuned from Qwen-Image-Edit, matches Qwen-Image-2509 in 3-image composition while supporting arbitrary multi-image inputs beyond the latter’s limitation. Our dataset, benchmark, and baseline collectively offer valuable resources for further research on Multi-Image Composition.
[302] One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation
Yuan Gao, Chen Chen, Tianrong Chen, Jiatao Gu
Main category: cs.CV
TL;DR: FAE is a simple framework that adapts pre-trained visual representations into low-dimensional latents suitable for generation using minimal architecture (single attention layer), achieving state-of-the-art image generation quality.
Details
Motivation: There's a fundamental mismatch between understanding-oriented visual representations (which need high-dimensional features for diverse hypotheses) and generation-friendly latent spaces (which need low-dimensional spaces to preserve noise). Existing approaches require complex objectives and architectures to bridge this gap.Method: FAE uses two separate deep decoders: one trained to reconstruct the original feature space from compressed latents, and a second that takes the reconstructed features as input for image generation. This allows adaptation of pre-trained visual representations into low-dimensional latents suitable for generation with minimal architecture (as little as a single attention layer).
Result: FAE achieves strong performance across class-conditional and text-to-image benchmarks. On ImageNet 256x256, the diffusion model with CFG attains FID of 1.29 (800 epochs) and 1.70 (80 epochs). Without CFG, it reaches state-of-the-art FID of 1.48 (800 epochs) and 2.08 (80 epochs), demonstrating both high quality and fast learning.
Conclusion: FAE provides a simple yet effective solution to bridge the gap between understanding-oriented visual representations and generation-friendly latent spaces, achieving state-of-the-art performance with minimal architectural complexity.
Abstract: Visual generative models (e.g., diffusion models) typically operate in compressed latent spaces to balance training efficiency and sample quality. In parallel, there has been growing interest in leveraging high-quality pre-trained visual representations, either by aligning them inside VAEs or directly within the generative model. However, adapting such representations remains challenging due to fundamental mismatches between understanding-oriented features and generation-friendly latent spaces. Representation encoders benefit from high-dimensional latents that capture diverse hypotheses for masked regions, whereas generative models favor low-dimensional latents that must faithfully preserve injected noise. This discrepancy has led prior work to rely on complex objectives and architectures. In this work, we propose FAE (Feature Auto-Encoder), a simple yet effective framework that adapts pre-trained visual representations into low-dimensional latents suitable for generation using as little as a single attention layer, while retaining sufficient information for both reconstruction and understanding. The key is to couple two separate deep decoders: one trained to reconstruct the original feature space, and a second that takes the reconstructed features as input for image generation. FAE is generic; it can be instantiated with a variety of self-supervised encoders (e.g., DINO, SigLIP) and plugged into two distinct generative families: diffusion models and normalizing flows. Across class-conditional and text-to-image benchmarks, FAE achieves strong performance. For example, on ImageNet 256x256, our diffusion model with CFG attains a near state-of-the-art FID of 1.29 (800 epochs) and 1.70 (80 epochs). Without CFG, FAE reaches the state-of-the-art FID of 1.48 (800 epochs) and 2.08 (80 epochs), demonstrating both high quality and fast learning.
[303] Enhancing Small Object Detection with YOLO: A Novel Framework for Improved Accuracy and Efficiency
Mahila Moghadami, Mohammad Ali Keyvanrad, Melika Sabaghian
Main category: cs.CV
TL;DR: Enhanced SW-YOLO model for small object detection in aerial images achieves 61.2 mAP on VisDrone2019, significantly outperforming baseline YOLOv5L (35.5) and CZDet (58.36).
Details
Motivation: Small object detection in large-scale aerial images is critical for industrial applications, but current methods using image cropping and architectural modifications need improvement for better accuracy and robustness.Method: Enhanced SW-YOLO approach with refined sliding window cropping dimensions and overlap, plus architectural modifications including advanced feature extraction modules in the neck, CBAM integration in backbone for spatial/channel information preservation, and new detection head.
Result: Achieved 61.2 mAP .5.5 on VisDrone2019 dataset, significantly outperforming baseline YOLOv5L (35.5) and CZDet (58.36), and showing superior performance compared to SAHI framework.
Conclusion: The proposed enhanced SW-YOLO model demonstrates substantial improvements in small object detection accuracy for aerial imagery, offering a robust framework for critical industrial applications.
Abstract: This paper investigates and develops methods for detecting small objects in large-scale aerial images. Current approaches for detecting small objects in aerial images often involve image cropping and modifications to detector network architectures. Techniques such as sliding window cropping and architectural enhancements, including higher-resolution feature maps and attention mechanisms, are commonly employed. Given the growing importance of aerial imagery in various critical and industrial applications, the need for robust frameworks for small object detection becomes imperative. To address this need, we adopted the base SW-YOLO approach to enhance speed and accuracy in small object detection by refining cropping dimensions and overlap in sliding window usage and subsequently enhanced it through architectural modifications. we propose a novel model by modifying the base model architecture, including advanced feature extraction modules in the neck for feature map enhancement, integrating CBAM in the backbone to preserve spatial and channel information, and introducing a new head to boost small object detection accuracy. Finally, we compared our method with SAHI, one of the most powerful frameworks for processing large-scale images, and CZDet, which is also based on image cropping, achieving significant improvements in accuracy. The proposed model achieves significant accuracy gains on the VisDrone2019 dataset, outperforming baseline YOLOv5L detection by a substantial margin. Specifically, the final proposed model elevates the mAP .5.5 accuracy on the VisDrone2019 dataset from the base accuracy of 35.5 achieved by the YOLOv5L detector to 61.2. Notably, the accuracy of CZDet, which is another classic method applied to this dataset, is 58.36. This research demonstrates a significant improvement, achieving an increase in accuracy from 35.5 to 61.2.
[304] Relational Visual Similarity
Thao Nguyen, Sicheng Mo, Krishna Kumar Singh, Yilin Wang, Jing Shi, Nicholas Kolkin, Eli Shechtman, Yong Jae Lee, Yuheng Li
Main category: cs.CV
TL;DR: The paper introduces relational image similarity as a new dimension beyond perceptual attribute similarity, develops a method to measure it using anonymized captions and vision-language model fine-tuning, and shows existing models fail to capture this human-like relational understanding.
Details
Motivation: Humans perceive relational similarity (e.g., Earth is like a peach due to similar layered structure) beyond just attribute similarity, but current visual similarity metrics (LPIPS, CLIP, DINO) only capture perceptual attributes and miss relational properties that distinguish human cognition.Method: 1) Formulate relational image similarity as measurable problem; 2) Curate 114k image-caption dataset with anonymized captions describing relational logic rather than surface content; 3) Fine-tune Vision-Language model to measure relational similarity between images.
Result: Developed first model to connect images by underlying relational structure rather than visible appearance, demonstrating that existing image similarity models fail to capture relational similarity, revealing a critical gap in visual computing.
Conclusion: Relational similarity has important real-world applications but is missing from current visual computing models; the proposed approach represents a first step toward capturing human-like relational understanding in image analysis.
Abstract: Humans do not just see attribute similarity – we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach’s skin, flesh, and pit. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. Yet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive. How can we go beyond the visible content of an image to capture its relational properties? How can we bring images with the same relational logic closer together in representation space? To answer these questions, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ. We then curate 114k image-caption dataset in which the captions are anonymized – describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision-Language model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance. Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it – revealing a critical gap in visual computing.
[305] Tessellation GS: Neural Mesh Gaussians for Robust Monocular Reconstruction of Dynamic Objects
Shuohan Tao, Boyao Zhou, Hanzhang Tu, Yuwang Wang, Yebin Liu
Main category: cs.CV
TL;DR: Tessellation GS improves 3D Gaussian Splatting by anchoring 2D Gaussians to mesh faces with hierarchical neural features, enabling better dynamic scene reconstruction from single cameras.
Details
Motivation: 3D Gaussian Splatting struggles with viewpoint extrapolation, overfitting, and poor generalization in sparse-view and dynamic scene reconstruction, especially with single static cameras.Method: Anchors 2D Gaussians to mesh faces, uses hierarchical neural features to infer Gaussian attributes, employs adaptive face subdivision guided by detail-aware loss, and leverages foundation model priors to initialize Gaussian deformations.
Result: Outperforms previous SOTA method with 29.1% reduction in LPIPS and 49.2% reduction in Chamfer distance for appearance and mesh reconstruction tasks.
Conclusion: Tessellation GS enables robust reconstruction of general dynamic objects from single static cameras, overcoming limitations of optimization-based methods for dynamic scene reconstruction.
Abstract: 3D Gaussian Splatting (GS) enables highly photorealistic scene reconstruction from posed image sequences but struggles with viewpoint extrapolation due to its anisotropic nature, leading to overfitting and poor generalization, particularly in sparse-view and dynamic scene reconstruction. We propose Tessellation GS, a structured 2D GS approach anchored on mesh faces, to reconstruct dynamic scenes from a single continuously moving or static camera. Our method constrains 2D Gaussians to localized regions and infers their attributes via hierarchical neural features on mesh faces. Gaussian subdivision is guided by an adaptive face subdivision strategy driven by a detail-aware loss function. Additionally, we leverage priors from a reconstruction foundation model to initialize Gaussian deformations, enabling robust reconstruction of general dynamic objects from a single static camera, previously extremely challenging for optimization-based methods. Our method outperforms previous SOTA method, reducing LPIPS by 29.1% and Chamfer distance by 49.2% on appearance and mesh reconstruction tasks.
[306] LogicCBMs: Logic-Enhanced Concept-Based Learning
Deepika SN Vemuri, Gautham Bellamkonda, Aditya Pola, Vineeth N Balasubramanian
Main category: cs.CV
TL;DR: LogicCBM enhances concept bottleneck models by replacing linear concept combinations with differentiable logic operations, improving accuracy, interpretability, and intervention capabilities.
Details
Motivation: Current Concept Bottleneck Models (CBMs) are limited by their linear combination approach for concept-based predictions, which restricts their expressiveness and ability to capture complex inter-concept relationships.Method: Introduces a logic module that connects learned concepts from CBMs through differentiable logic operations, enabling end-to-end learnability while allowing various logical operations (beyond simple weighted combinations) for final predictions.
Result: Empirical studies on well-known benchmarks and synthetic datasets show that LogicCBM models achieve better accuracy, perform effective interventions, and maintain high interpretability compared to traditional CBMs.
Conclusion: Enhancing concept-based learning models with propositional logic operations improves model expressivity, captures inter-concept relations better, and provides superior performance while preserving interpretability.
Abstract: Concept Bottleneck Models (CBMs) provide a basis for semantic abstractions within a neural network architecture. Such models have primarily been seen through the lens of interpretability so far, wherein they offer transparency by inferring predictions as a linear combination of semantic concepts. However, a linear combination is inherently limiting. So we propose the enhancement of concept-based learning models through propositional logic. We introduce a logic module that is carefully designed to connect the learned concepts from CBMs through differentiable logic operations, such that our proposed LogicCBM can go beyond simple weighted combinations of concepts to leverage various logical operations to yield the final predictions, while maintaining end-to-end learnability. Composing concepts using a set of logic operators enables the model to capture inter-concept relations, while simultaneously improving the expressivity of the model in terms of logic operations. Our empirical studies on well-known benchmarks and synthetic datasets demonstrate that these models have better accuracy, perform effective interventions and are highly interpretable.
[307] How Far are Modern Trackers from UAV-Anti-UAV? A Million-Scale Benchmark and New Baseline
Chunhui Zhang, Li Liu, Zhipeng Zhang, Yong Wang, Hao Wen, Xi Zhou, Shiming Ge, Yanfeng Wang
Main category: cs.CV
TL;DR: Proposes UAV-Anti-UAV tracking task where a pursuer UAV tracks an adversarial UAV, introduces a million-scale dataset, and presents MambaSTS baseline method for integrated spatial-temporal-semantic learning.
Details
Motivation: Current Anti-UAV research focuses on fixed ground cameras, but there's a gap in tracking target UAVs from another moving UAV platform. The UAV-Anti-UAV task addresses this with more challenging dual-dynamic disturbances.Method: Proposes MambaSTS, a Mamba-based baseline method that uses Mamba for global semantic features, Transformer for spatial features, and state space models for long-sequence modeling with temporal token propagation for video-level context.
Result: Created a dataset of 1,810 videos with manual annotations. Experimental evaluation of 50 modern tracking algorithms shows significant room for improvement in the UAV-Anti-UAV domain.
Conclusion: The UAV-Anti-UAV task presents new challenges with dual-dynamic disturbances, and the proposed MambaSTS baseline with integrated spatial-temporal-semantic learning provides a foundation for future research in this emerging domain.
Abstract: Unmanned Aerial Vehicles (UAVs) offer wide-ranging applications but also pose significant safety and privacy violation risks in areas like airport and infrastructure inspection, spurring the rapid development of Anti-UAV technologies in recent years. However, current Anti-UAV research primarily focuses on RGB, infrared (IR), or RGB-IR videos captured by fixed ground cameras, with little attention to tracking target UAVs from another moving UAV platform. To fill this gap, we propose a new multi-modal visual tracking task termed UAV-Anti-UAV, which involves a pursuer UAV tracking a target adversarial UAV in the video stream. Compared to existing Anti-UAV tasks, UAV-Anti-UAV is more challenging due to severe dual-dynamic disturbances caused by the rapid motion of both the capturing platform and the target. To advance research in this domain, we construct a million-scale dataset consisting of 1,810 videos, each manually annotated with bounding boxes, a language prompt, and 15 tracking attributes. Furthermore, we propose MambaSTS, a Mamba-based baseline method for UAV-Anti-UAV tracking, which enables integrated spatial-temporal-semantic learning. Specifically, we employ Mamba and Transformer models to learn global semantic and spatial features, respectively, and leverage the state space model’s strength in long-sequence modeling to establish video-level long-term context via a temporal token propagation mechanism. We conduct experiments on the UAV-Anti-UAV dataset to validate the effectiveness of our method. A thorough experimental evaluation of 50 modern deep tracking algorithms demonstrates that there is still significant room for improvement in the UAV-Anti-UAV domain. The dataset and codes will be available at {\color{magenta}https://github.com/983632847/Awesome-Multimodal-Object-Tracking}.
[308] GlimmerNet: A Lightweight Grouped Dilated Depthwise Convolutions for UAV-Based Emergency Monitoring
Đorđe Nedeljković
Main category: cs.CV
TL;DR: GlimmerNet is an ultra-lightweight CNN that achieves strong global perception without expensive self-attention, using grouped dilated convolutions and efficient feature recombination to set new accuracy-efficiency trade-offs for UAV emergency monitoring.
Details
Motivation: While CNNs are efficient for edge/mobile vision, recent attempts to add global context via Vision Transformers introduce significant computational overhead. The authors aim to retain strong global perception without relying on computationally expensive components.Method: GlimmerNet introduces Grouped Dilated Depthwise Convolutions (GDBlocks) that partition channels into groups with distinct dilation rates for multi-scale feature extraction at no extra parameter cost. It also uses an Aggregator module with grouped pointwise convolution to efficiently recombine cross-group representations.
Result: With only 31K parameters and 29% fewer FLOPs than the most recent baseline, GlimmerNet achieves state-of-the-art weighted F1-score of 0.966 on the UAV-focused AIDERv2 dataset.
Conclusion: GlimmerNet establishes a new accuracy-efficiency trade-off frontier for real-time emergency monitoring on resource-constrained UAV platforms, demonstrating that strong global perception can be achieved without computationally expensive components.
Abstract: Convolutional Neural Networks (CNNs) have proven highly effective for edge and mobile vision tasks due to their computational efficiency. While many recent works seek to enhance CNNs with global contextual understanding via self-attention-based Vision Transformers, these approaches often introduce significant computational overhead. In this work, we demonstrate that it is possible to retain strong global perception without relying on computationally expensive components. We present GlimmerNet, an ultra-lightweight convolutional network built on the principle of separating receptive field diversity from feature recombination. GlimmerNet introduces Grouped Dilated Depthwise Convolutions(GDBlocks), which partition channels into groups with distinct dilation rates, enabling multi-scale feature extraction at no additional parameter cost. To fuse these features efficiently, we design a novel Aggregator module that recombines cross-group representations using grouped pointwise convolution, significantly lowering parameter overhead. With just 31K parameters and 29% fewer FLOPs than the most recent baseline, GlimmerNet achieves a new state-of-the-art weighted F1-score of 0.966 on the UAV-focused AIDERv2 dataset. These results establish a new accuracy-efficiency trade-off frontier for real-time emergency monitoring on resource-constrained UAV platforms. Our implementation is publicly available at https://github.com/djordjened92/gdd-cnn.
[309] Reconstructing Objects along Hand Interaction Timelines in Egocentric Video
Zhifan Zhu, Siddhant Bansal, Shashank Tripathi, Dima Damen
Main category: cs.CV
TL;DR: ROHIT: Reconstructing Objects along Hand Interaction Timelines using constrained pose propagation for better 3D object reconstruction in egocentric videos without 3D ground truth.
Details
Motivation: Need to reconstruct 3D objects in egocentric videos where objects undergo hand interactions, without requiring 3D ground truth annotations which are expensive to obtain.Method: Defines Hand Interaction Timeline (HIT) with pose constraints (static → contact → firm grip → release → static). Proposes Constrained Optimisation and Propagation (COP) framework to propagate object poses along HIT, focusing on stable grasp periods where hand maintains constant contact.
Result: Evaluated on HOT3D (1.2K clips) and EPIC-Kitchens (2.4K clips, 390 object instances, 9 categories, 141 environments). COP improves stable grasp reconstruction by 6.2-11.3% and HIT reconstruction by up to 24.5% using 2D projection error metrics.
Conclusion: ROHIT task and COP framework enable effective 3D object reconstruction in hand interaction videos without 3D ground truth, leveraging temporal pose constraints along interaction timelines.
Abstract: We introduce the task of Reconstructing Objects along Hand Interaction Timelines (ROHIT). We first define the Hand Interaction Timeline (HIT) from a rigid object’s perspective. In a HIT, an object is first static relative to the scene, then is held in hand following contact, where its pose changes. This is usually followed by a firm grip during use, before it is released to be static again w.r.t. to the scene. We model these pose constraints over the HIT, and propose to propagate the object’s pose along the HIT enabling superior reconstruction using our proposed Constrained Optimisation and Propagation (COP) framework. Importantly, we focus on timelines with stable grasps - i.e. where the hand is stably holding an object, effectively maintaining constant contact during use. This allows us to efficiently annotate, study, and evaluate object reconstruction in videos without 3D ground truth. We evaluate our proposed task, ROHIT, over two egocentric datasets, HOT3D and in-the-wild EPIC-Kitchens. In HOT3D, we curate 1.2K clips of stable grasps. In EPIC-Kitchens, we annotate 2.4K clips of stable grasps including 390 object instances across 9 categories from videos of daily interactions in 141 environments. Without 3D ground truth, we utilise 2D projection error to assess the reconstruction. Quantitatively, COP improves stable grasp reconstruction by 6.2-11.3% and HIT reconstruction by up to 24.5% with constrained pose propagation.
[310] InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs
Bin Li, Ruichi Zhang, Han Liang, Jingyan Zhang, Juze Zhang, Xin Chen, Lan Xu, Jingyi Yu, Jingya Wang
Main category: cs.CV
TL;DR: InterAgent: First end-to-end framework for text-driven physics-based multi-agent humanoid control using autoregressive diffusion transformer with multi-stream blocks and interaction graph representation.
Details
Motivation: Existing methods focus on single-agent scenarios and overlook physically plausible interplay essential for multi-agent interactions. Need to bridge the gap for complex coordination in human social behaviors.Method: 1) Autoregressive diffusion transformer with multi-stream blocks decouples proprioception, exteroception, and action to mitigate cross-modal interference while enabling synergistic coordination. 2) Novel interaction graph exteroception representation captures fine-grained joint-to-joint spatial dependencies. 3) Sparse edge-based attention mechanism dynamically prunes redundant connections and emphasizes critical inter-agent spatial relations.
Result: InterAgent consistently outperforms multiple strong baselines, achieving state-of-the-art performance. Enables producing coherent, physically plausible, and semantically faithful multi-agent behaviors from only text prompts.
Conclusion: InterAgent is the first end-to-end framework for text-driven physics-based multi-agent humanoid control that successfully addresses the limitations of existing single-agent methods, enabling realistic multi-agent social interactions.
Abstract: Humanoid agents are expected to emulate the complex coordination inherent in human social behaviors. However, existing methods are largely confined to single-agent scenarios, overlooking the physically plausible interplay essential for multi-agent interactions. To bridge this gap, we propose InterAgent, the first end-to-end framework for text-driven physics-based multi-agent humanoid control. At its core, we introduce an autoregressive diffusion transformer equipped with multi-stream blocks, which decouples proprioception, exteroception, and action to mitigate cross-modal interference while enabling synergistic coordination. We further propose a novel interaction graph exteroception representation that explicitly captures fine-grained joint-to-joint spatial dependencies to facilitate network learning. Additionally, within it we devise a sparse edge-based attention mechanism that dynamically prunes redundant connections and emphasizes critical inter-agent spatial relations, thereby enhancing the robustness of interaction modeling. Extensive experiments demonstrate that InterAgent consistently outperforms multiple strong baselines, achieving state-of-the-art performance. It enables producing coherent, physically plausible, and semantically faithful multi-agent behaviors from only text prompts. Our code and data will be released to facilitate future research.
[311] Unified Video Editing with Temporal Reasoner
Xiangpeng Yang, Ji Xie, Yiyuan Yang, Yan Huang, Min Xu, Qiang Wu
Main category: cs.CV
TL;DR: VideoCoF introduces a Chain-of-Frames approach for precise mask-free video editing by predicting edit-region latents as reasoning tokens before generating target video, achieving state-of-the-art performance with minimal data.
Details
Motivation: Existing video editing methods face a trade-off: expert models need task-specific masks (hard to unify) while unified models lack spatial cues, resulting in weak instruction-to-region mapping and imprecise localization.Method: VideoCoF enforces a “see, reason, then edit” procedure inspired by Chain-of-Thought reasoning. It compels video diffusion models to first predict reasoning tokens (edit-region latents) before generating target video tokens, enabling explicit reasoning without user masks. Also introduces RoPE alignment strategy for motion alignment and length extrapolation.
Result: Achieves state-of-the-art performance on VideoCoF-Bench with only 50k video pairs, demonstrating efficient and effective precise video editing without masks.
Conclusion: VideoCoF resolves the precision vs. unification trade-off in video editing through explicit reasoning tokens, enabling mask-free fine-grained editing with strong instruction-to-region alignment and motion consistency.
Abstract: Existing video editing methods face a critical trade-off: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning. VideoCoF enforces a ``see, reason, then edit" procedure by compelling the video diffusion model to first predict reasoning tokens (edit-region latents) before generating the target video tokens. This explicit reasoning step removes the need for user-provided masks while achieving precise instruction-to-region alignment and fine-grained video editing. Furthermore, we introduce a RoPE alignment strategy that leverages these reasoning tokens to ensure motion alignment and enable length extrapolation beyond the training duration. We demonstrate that with a minimal data cost of only 50k video pairs, VideoCoF achieves state-of-the-art performance on VideoCoF-Bench, validating the efficiency and effectiveness of our approach. Our code, weight, data are available at https://github.com/knightyxp/VideoCoF.
[312] PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis, Keith Krut, Robert Stein, Rina Elster Pantalony, Mohit Bansal, Kathleen McKeown
Main category: cs.CV
TL;DR: PoSh is a new evaluation metric for detailed image descriptions that uses scene graphs as structured rubrics to guide LLM judges, outperforming existing metrics and better correlating with human judgments.
Details
Motivation: Existing metrics for image description evaluation (like CIDEr, SPICE) were designed for short texts and fail to properly evaluate detailed descriptions, lacking sensitivity to attribute/relation attachments and localization of errors in long text spans.Method: PoSh uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors like mistakes in compositional understanding. They also introduce DOCENT dataset with artwork, expert references, and human judgments.
Result: PoSh achieves stronger correlations (+0.05 Spearman ρ) with human judgments than best alternatives, is robust across image types, works as a capable reward function outperforming standard supervised fine-tuning, and reveals foundation models struggle with rich scene dynamics.
Conclusion: PoSh provides a replicable, interpretable metric that better proxies human evaluation for detailed image descriptions, while DOCENT establishes a challenging new benchmark for evaluating both metrics and VLMs in complex domains like artwork description.
Abstract: While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a challenging new dataset, DOCENT. This novel benchmark contains artwork, paired with expert-written references, and model-generated descriptions, augmented with granular and coarse judgments of their quality from art history students. Thus, DOCENT enables evaluating both detailed image description metrics and detailed image description itself in a challenging new domain. We show that PoSh achieves stronger correlations (+0.05 Spearman $ρ$) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable reward function, outperforming standard supervised fine-tuning. Then, using PoSh, we characterize the performance of open and closed models in describing the paintings, sketches and statues in DOCENT and find that foundation models struggle to achieve full, error-free coverage of images with rich scene dynamics, establishing a demanding new task to gauge VLM progress. Through both PoSh and DOCENT, we hope to enable advances in important areas such as assistive text generation.
[313] Single-step Diffusion-based Video Coding with Semantic-Temporal Guidance
Naifu Xue, Zhaoyang Jia, Jiahao Li, Bin Li, Zihan Zheng, Yuan Zhang, Yan Lu
Main category: cs.CV
TL;DR: S2VC is a single-step diffusion-based video codec that achieves state-of-the-art perceptual quality at low bitrates with 52.73% bitrate savings over prior methods, using efficient single-step diffusion generation instead of heavy sampling.
Details
Motivation: Traditional and neural video codecs struggle with perceptual quality at low bitrates. Some NVCs use perceptual/adversarial objectives but still have artifacts due to limited generation capacity, while others use diffusion models but suffer from heavy sampling complexity.Method: S2VC integrates conditional coding framework with efficient single-step diffusion generator. It introduces Contextual Semantic Guidance to extract frame-adaptive semantics from buffered features (replacing text captions), and Temporal Consistency Guidance in the diffusion U-Net to enforce temporal coherence across frames.
Result: Extensive experiments show S2VC delivers state-of-the-art perceptual quality with average 52.73% bitrate saving over prior perceptual methods, enabling realistic reconstruction at low bitrates with reduced sampling cost.
Conclusion: S2VC demonstrates the promise of single-step diffusion for efficient, high-quality video compression, overcoming challenges of both traditional NVCs and diffusion-based approaches through innovative conditioning and temporal guidance mechanisms.
Abstract: While traditional and neural video codecs (NVCs) have achieved remarkable rate-distortion performance, improving perceptual quality at low bitrates remains challenging. Some NVCs incorporate perceptual or adversarial objectives but still suffer from artifacts due to limited generation capacity, whereas others leverage pretrained diffusion models to improve quality at the cost of heavy sampling complexity. To overcome these challenges, we propose S2VC, a Single-Step diffusion based Video Codec that integrates a conditional coding framework with an efficient single-step diffusion generator, enabling realistic reconstruction at low bitrates with reduced sampling cost. Recognizing the importance of semantic conditioning in single-step diffusion, we introduce Contextual Semantic Guidance to extract frame-adaptive semantics from buffered features. It replaces text captions with efficient, fine-grained conditioning, thereby improving generation realism. In addition, Temporal Consistency Guidance is incorporated into the diffusion U-Net to enforce temporal coherence across frames and ensure stable generation. Extensive experiments show that S2VC delivers state-of-the-art perceptual quality with an average 52.73% bitrate saving over prior perceptual methods, underscoring the promise of single-step diffusion for efficient, high-quality video compression.
[314] Towards Robust DeepFake Detection under Unstable Face Sequences: Adaptive Sparse Graph Embedding with Order-Free Representation and Explicit Laplacian Spectral Prior
Chih-Chung Hsu, Shao-Ning Chen, Chia-Ming Lee, Yi-Fang Wang, Yi-Shiuan Chou
Main category: cs.CV
TL;DR: LR-GCN: A Laplacian-regularized graph convolutional network for robust DeepFake detection that handles noisy, shuffled, or missing face sequences using order-free temporal graph embeddings and spectral analysis.
Details
Motivation: Real-world DeepFake detection faces challenges due to compression artifacts, occlusions, and adversarial attacks that destabilize face detection, leading to invalid or misdetected faces. Most detectors assume clean, temporally consistent facial sequences, which rarely holds in practice.Method: Proposes LR-GCN with Order-Free Temporal Graph Embedding (OF-TGE) that organizes frame-wise CNN features into adaptive sparse graphs based on semantic affinities. Uses dual-level sparsity on graph structure and node features, and introduces Graph Laplacian Spectral Prior as a high-pass operator to highlight forgery artifacts, followed by low-pass GCN aggregation for a spectral band-pass mechanism.
Result: Achieves state-of-the-art performance on FF++, Celeb-DFv2, and DFDC datasets. Shows significantly improved robustness under severe disruptions including missing faces, occlusions, and adversarial perturbations to face detection.
Conclusion: LR-GCN provides a robust solution for real-world DeepFake detection by handling noisy, unordered face sequences through graph-based modeling and spectral analysis, overcoming limitations of traditional temporal sequence assumptions.
Abstract: Ensuring the authenticity of video content remains challenging as DeepFake generation becomes increasingly realistic and robust against detection. Most existing detectors implicitly assume temporally consistent and clean facial sequences, an assumption that rarely holds in real-world scenarios where compression artifacts, occlusions, and adversarial attacks destabilize face detection and often lead to invalid or misdetected faces. To address these challenges, we propose a Laplacian-Regularized Graph Convolutional Network (LR-GCN) that robustly detects DeepFakes from noisy or unordered face sequences, while being trained only on clean facial data. Our method constructs an Order-Free Temporal Graph Embedding (OF-TGE) that organizes frame-wise CNN features into an adaptive sparse graph based on semantic affinities. Unlike traditional methods constrained by strict temporal continuity, OF-TGE captures intrinsic feature consistency across frames, making it resilient to shuffled, missing, or heavily corrupted inputs. We further impose a dual-level sparsity mechanism on both graph structure and node features to suppress the influence of invalid faces. Crucially, we introduce an explicit Graph Laplacian Spectral Prior that acts as a high-pass operator in the graph spectral domain, highlighting structural anomalies and forgery artifacts, which are then consolidated by a low-pass GCN aggregation. This sequential design effectively realizes a task-driven spectral band-pass mechanism that suppresses background information and random noise while preserving manipulation cues. Extensive experiments on FF++, Celeb-DFv2, and DFDC demonstrate that LR-GCN achieves state-of-the-art performance and significantly improved robustness under severe global and local disruptions, including missing faces, occlusions, and adversarially perturbed face detections.
[315] MultiMotion: Multi Subject Video Motion Transfer via Video Diffusion Transformer
Penghui Liu, Jiangshan Wang, Yutong Shen, Shanhui Mo, Chenyang Qi, Yue Ma
Main category: cs.CV
TL;DR: MultiMotion is a new framework for multi-object video motion transfer using Diffusion Transformers, featuring mask-aware attention motion flow and efficient sampling methods.
Details
Motivation: Current Diffusion Transformer architectures struggle with multi-object video motion transfer due to motion entanglement and lack of object-level control, making precise manipulation of multiple objects challenging.Method: Introduces Mask-aware Attention Motion Flow (AMF) using SAM2 masks to disentangle motion features for multiple objects, and RectPC, a high-order predictor-corrector solver for efficient sampling.
Result: Creates the first benchmark dataset for DiT-based multi-object motion transfer and demonstrates precise, semantically aligned, and temporally coherent motion transfer for multiple objects while maintaining DiT’s quality and scalability.
Conclusion: MultiMotion successfully addresses multi-object motion transfer challenges in DiT architectures through explicit motion disentanglement and efficient sampling, enabling better control over multiple objects in video generation.
Abstract: Multi-object video motion transfer poses significant challenges for Diffusion Transformer (DiT) architectures due to inherent motion entanglement and lack of object-level control. We present MultiMotion, a novel unified framework that overcomes these limitations. Our core innovation is Maskaware Attention Motion Flow (AMF), which utilizes SAM2 masks to explicitly disentangle and control motion features for multiple objects within the DiT pipeline. Furthermore, we introduce RectPC, a high-order predictor-corrector solver for efficient and accurate sampling, particularly beneficial for multi-entity generation. To facilitate rigorous evaluation, we construct the first benchmark dataset specifically for DiT-based multi-object motion transfer. MultiMotion demonstrably achieves precise, semantically aligned, and temporally coherent motion transfer for multiple distinct objects, maintaining DiT’s high quality and scalability. The code is in the supp.
[316] SJD++: Improved Speculative Jacobi Decoding for Training-free Acceleration of Discrete Auto-regressive Text-to-Image Generation
Yao Teng, Zhihuan Jiang, Han Shi, Xian Liu, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, Xihui Liu
Main category: cs.CV
TL;DR: SJD++ accelerates autoregressive text-to-image generation by 2-3× through training-free probabilistic parallel decoding with multi-token prediction and token reuse.
Details
Motivation: Large autoregressive models produce high-quality images but are slow due to requiring hundreds to thousands of sequential forward passes for next-token prediction during inference.Method: Speculative Jacobi Decoding++ (SJD++) combines iterative multi-token prediction from Jacobi decoding with probabilistic drafting-and-verification from speculative sampling, plus reuses high-confidence draft tokens after verification instead of resampling all.
Result: Achieves 2× to 3× inference latency reduction and 2× to 7× step compression while preserving visual quality with no observable degradation across several autoregressive text-to-image models.
Conclusion: SJD++ provides a training-free solution to significantly accelerate autoregressive image generation without sacrificing quality, making high-quality autoregressive models more practical for real-world applications.
Abstract: Large autoregressive models can generate high-quality, high-resolution images but suffer from slow generation speed, because these models require hundreds to thousands of sequential forward passes for next-token prediction during inference. To accelerate autoregressive text-to-image generation, we propose Speculative Jacobi Decoding++ (SJD++), a training-free probabilistic parallel decoding algorithm. Unlike traditional next-token prediction, SJD++ performs multi-token prediction in each forward pass, drastically reducing generation steps. Specifically, it integrates the iterative multi-token prediction mechanism from Jacobi decoding, with the probabilistic drafting-and-verification mechanism from speculative sampling. More importantly, for further acceleration, SJD++ reuses high-confidence draft tokens after each verification phase instead of resampling them all. We conduct extensive experiments on several representative autoregressive text-to-image generation models and demonstrate that SJD++ achieves $2\times$ to $3\times$ inference latency reduction and $2\times$ to $7\times$ step compression, while preserving visual quality with no observable degradation.
[317] ControlVP: Interactive Geometric Refinement of AI-Generated Images with Consistent Vanishing Points
Ryota Okumura, Kaede Shiohara, Toshihiko Yamasaki
Main category: cs.CV
TL;DR: ControlVP is a user-guided framework that corrects vanishing point inconsistencies in text-to-image generated scenes, improving geometric realism while maintaining visual quality.
Details
Motivation: Current text-to-image models like Stable Diffusion often produce geometric inconsistencies, particularly vanishing point errors where parallel lines don't converge correctly, leading to structurally implausible scenes that undermine spatial realism, especially in architectural contexts.Method: Extends pre-trained diffusion models by incorporating structural guidance from building contours and introducing geometric constraints that explicitly align image edges with perspective cues to correct vanishing point inconsistencies.
Result: The method enhances global geometric consistency while maintaining visual fidelity comparable to baseline models, making it particularly valuable for applications requiring accurate spatial structure like image-to-3D reconstruction.
Conclusion: ControlVP provides an effective user-guided solution for correcting vanishing point inconsistencies in generated images, improving structural realism without sacrificing visual quality, with potential applications in architectural visualization and 3D reconstruction.
Abstract: Recent text-to-image models, such as Stable Diffusion, have achieved impressive visual quality, yet they often suffer from geometric inconsistencies that undermine the structural realism of generated scenes. One prominent issue is vanishing point inconsistency, where projections of parallel lines fail to converge correctly in 2D space. This leads to structurally implausible geometry that degrades spatial realism, especially in architectural scenes. We propose ControlVP, a user-guided framework for correcting vanishing point inconsistencies in generated images. Our approach extends a pre-trained diffusion model by incorporating structural guidance derived from building contours. We also introduce geometric constraints that explicitly encourage alignment between image edges and perspective cues. Our method enhances global geometric consistency while maintaining visual fidelity comparable to the baselines. This capability is particularly valuable for applications that require accurate spatial structure, such as image-to-3D reconstruction. The dataset and source code are available at https://github.com/RyotaOkumura/ControlVP .
[318] Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion
Samuele Dell’Erba, Andrew D. Bagdanov
Main category: cs.CV
TL;DR: The paper proposes Optimization-based Visual Inversion (OVI) as a training-free, data-free alternative to expensive diffusion prior networks for text-to-image generation, showing it can match or exceed state-of-the-art priors while revealing flaws in current evaluation benchmarks.
Details
Motivation: Current diffusion models rely on computationally expensive diffusion prior networks that require massive training datasets. The authors challenge whether such trained priors are necessary at all, seeking a more efficient alternative.Method: Propose Optimization-based Visual Inversion (OVI) - a training-free, data-free method that initializes latent visual representations from random pseudo-tokens and iteratively optimizes them to maximize cosine similarity with text embeddings. Introduce two novel constraints: Mahalanobis-based loss and Nearest-Neighbor loss to regularize optimization toward realistic image distributions.
Result: OVI serves as an effective alternative to traditional priors. The analysis reveals critical flaws in current benchmarks (T2I-CompBench++) where using text embeddings directly achieves high scores despite poor perceptual quality. Constrained OVI methods improve visual fidelity, with Nearest-Neighbor approach achieving quantitative scores comparable to or higher than state-of-the-art data-efficient priors.
Conclusion: Trained diffusion priors may not be necessary for text-to-image generation. OVI offers a promising training-free alternative that merits further investigation, while also exposing issues with current evaluation metrics that don’t adequately measure perceptual quality.
Abstract: Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and data-free alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective, achieving quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, indicating that the idea merits further investigation. The code will be publicly available upon acceptance.
[319] MeshRipple: Structured Autoregressive Generation of Artist-Meshes
Junkai Lin, Hang Long, Huipeng Guo, Jielei Zhang, JiaYi Yang, Tianle Guo, Yang Yang, Jianwen Li, Wenxiao Zhang, Matthias Nießner, Wei Yang
Main category: cs.CV
TL;DR: MeshRipple: A novel mesh generation method that expands from a frontier like a ripple, maintaining topological coherence and long-range dependencies to avoid holes and fragmentation.
Details
Motivation: Autoregressive mesh generators serialize faces into sequences and use truncated segments with sliding-window inference due to memory limits, which breaks long-range geometric dependencies and produces holes and fragmented components.Method: Three key innovations: 1) frontier-aware BFS tokenization aligning generation order with surface topology, 2) expansive prediction strategy maintaining coherent connected surface growth, 3) sparse-attention global memory providing effectively unbounded receptive field for long-range topological dependencies.
Result: MeshRipple generates meshes with high surface fidelity and topological completeness, outperforming strong recent baselines.
Conclusion: MeshRipple’s integrated design addresses the critical limitation of broken long-range dependencies in autoregressive mesh generation, enabling coherent mesh generation with topological completeness.
Abstract: Meshes serve as a primary representation for 3D assets. Autoregressive mesh generators serialize faces into sequences and train on truncated segments with sliding-window inference to cope with memory limits. However, this mismatch breaks long-range geometric dependencies, producing holes and fragmented components. To address this critical limitation, we introduce MeshRipple, which expands a mesh outward from an active generation frontier, akin to a ripple on a surface.MeshRipple rests on three key innovations: a frontier-aware BFS tokenization that aligns the generation order with surface topology; an expansive prediction strategy that maintains coherent, connected surface growth; and a sparse-attention global memory that provides an effectively unbounded receptive field to resolve long-range topological dependencies.This integrated design enables MeshRipple to generate meshes with high surface fidelity and topological completeness, outperforming strong recent baselines.
[320] From Orbit to Ground: Generative City Photogrammetry from Extreme Off-Nadir Satellite Images
Fei Yu, Yu Liu, Luyang Tang, Mingchao Sun, Zengye Ge, Rui Bu, Yuchao Jin, Haisen Zhao, He Sun, Yangyan Li, Mu Xu, Wenzheng Chen, Baoquan Chen
Main category: cs.CV
TL;DR: City-scale 3D reconstruction from sparse satellite images using 2.5D height map modeling and texture restoration for extreme viewpoint extrapolation.
Details
Motivation: City-scale 3D reconstruction from satellite imagery faces extreme viewpoint extrapolation challenges (nearly 90° viewpoint gaps) where current methods like NeRF and 3DGS fail due to sparse orbital images with minimal parallax, foreshortened facades, and flawed textures.Method: 1) Model city geometry as 2.5D height map using Z-monotonic signed distance field (SDF) for stable optimization under sparse satellite views, producing watertight meshes with crisp roofs and vertical facades. 2) Paint mesh appearance via differentiable rendering and train generative texture restoration network to enhance degraded satellite inputs with high-frequency details.
Result: Successfully reconstructs 4km² real-world regions from few satellite images, achieving state-of-the-art performance in synthesizing photorealistic ground views. Produces visually compelling, high-fidelity assets suitable for urban planning and simulation applications.
Conclusion: The proposed method addresses extreme viewpoint extrapolation in city-scale 3D reconstruction through tailored geometric modeling and texture enhancement, demonstrating scalability and robustness for large-scale urban reconstruction from sparse satellite imagery.
Abstract: City-scale 3D reconstruction from satellite imagery presents the challenge of extreme viewpoint extrapolation, where our goal is to synthesize ground-level novel views from sparse orbital images with minimal parallax. This requires inferring nearly $90^\circ$ viewpoint gaps from image sources with severely foreshortened facades and flawed textures, causing state-of-the-art reconstruction engines such as NeRF and 3DGS to fail. To address this problem, we propose two design choices tailored for city structures and satellite inputs. First, we model city geometry as a 2.5D height map, implemented as a Z-monotonic signed distance field (SDF) that matches urban building layouts from top-down viewpoints. This stabilizes geometry optimization under sparse, off-nadir satellite views and yields a watertight mesh with crisp roofs and clean, vertically extruded facades. Second, we paint the mesh appearance from satellite images via differentiable rendering techniques. While the satellite inputs may contain long-range, blurry captures, we further train a generative texture restoration network to enhance the appearance, recovering high-frequency, plausible texture details from degraded inputs. Our method’s scalability and robustness are demonstrated through extensive experiments on large-scale urban reconstruction. For example, in our teaser figure, we reconstruct a $4,\mathrm{km}^2$ real-world region from only a few satellite images, achieving state-of-the-art performance in synthesizing photorealistic ground views. The resulting models are not only visually compelling but also serve as high-fidelity, application-ready assets for downstream tasks like urban planning and simulation.
[321] All You Need Are Random Visual Tokens? Demystifying Token Pruning in VLLMs
Yahong Wang, Juncheng Wu, Zhangkai Ni, Longzhen Yang, Yihang Liu, Chengmei Yang, Ying Wen, Xianfeng Tang, Hui Liu, Yuyin Zhou, Lianghua He
Main category: cs.CV
TL;DR: Vision LLMs waste computation on redundant visual tokens in deep layers. The paper discovers “information horizon” where visual tokens become uniformly uninformative, enabling efficient random pruning beyond this point.
Details
Motivation: Vision LLMs use hundreds of visual tokens, causing high computational costs. Existing token pruning methods fail in deep layers, performing no better than random pruning beyond certain layers, indicating a fundamental limitation.Method: Proposed measuring token information content by output probability change upon removal. Analyzed information distribution across layers, identified “information horizon” where tokens become redundant. Used random pruning in deep layers and enhanced existing methods with this approach.
Result: Discovered three key findings: 1) Visual token information becomes uniform and vanishes at intermediate “information horizon” layer; 2) Horizon depth varies by task (deeper for OCR vs VQA); 3) Horizon correlates with model capacity. Random pruning in deep layers works effectively, and DivPrune with random pruning achieves SOTA - 96.9% performance with 50% token pruning.
Conclusion: Visual tokens become redundant beyond the “information horizon” in deep layers, making random pruning surprisingly effective. This insight enables efficient VLLM acceleration while maintaining performance, with task-aware and model-aware horizon positioning.
Abstract: Vision Large Language Models (VLLMs) incur high computational costs due to their reliance on hundreds of visual tokens to represent images. While token pruning offers a promising solution for accelerating inference, this paper, however, identifies a key observation: in deeper layers (e.g., beyond the 20th), existing training-free pruning methods perform no better than random pruning. We hypothesize that this degradation is caused by “vanishing token information”, where visual tokens progressively lose their salience with increasing network depth. To validate this hypothesis, we quantify a token’s information content by measuring the change in the model output probabilities upon its removal. Using this proposed metric, our analysis of the information of visual tokens across layers reveals three key findings: (1) As layers deepen, the information of visual tokens gradually becomes uniform and eventually vanishes at an intermediate layer, which we term as “information horizon”, beyond which the visual tokens become redundant; (2) The position of this horizon is not static; it extends deeper for visually intensive tasks, such as Optical Character Recognition (OCR), compared to more general tasks like Visual Question Answering (VQA); (3) This horizon is also strongly correlated with model capacity, as stronger VLLMs (e.g., Qwen2.5-VL) employ deeper visual tokens than weaker models (e.g., LLaVA-1.5). Based on our findings, we show that simple random pruning in deep layers efficiently balances performance and efficiency. Moreover, integrating random pruning consistently enhances existing methods. Using DivPrune with random pruning achieves state-of-the-art results, maintaining 96.9% of Qwen-2.5-VL-7B performance while pruning 50% of visual tokens. The code will be publicly available at https://github.com/YahongWang1/Information-Horizon.
[322] LongCat-Image Technical Report
Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, Xunliang Cai, Yayong Guan, Jie Hu
Main category: cs.CV
TL;DR: LongCat-Image is a new bilingual Chinese-English image generation model that excels at multilingual text rendering, photorealism, and efficiency while offering comprehensive open-source ecosystem.
Details
Motivation: Address core challenges in current models: multilingual text rendering (especially Chinese characters), photorealism, deployment efficiency, and developer accessibility limitations.Method: Rigorous data curation across pre-training, mid-training, and SFT stages; coordinated use of curated reward models during RL phase; compact 6B parameter diffusion model design.
Result: Achieves SOTA in text-rendering capabilities and photorealism; sets new industry standard for Chinese character rendering; achieves remarkable efficiency with minimal VRAM usage; excels in image editing with superior consistency.
Conclusion: LongCat-Image establishes a comprehensive open-source ecosystem that empowers developers and researchers, pushing frontiers of visual content creation through superior bilingual capabilities and accessibility.
Abstract: We introduce LongCat-Image, a pioneering open-source and bilingual (Chinese-English) foundation model for image generation, designed to address core challenges in multilingual text rendering, photorealism, deployment efficiency, and developer accessibility prevalent in current leading models. 1) We achieve this through rigorous data curation strategies across the pre-training, mid-training, and SFT stages, complemented by the coordinated use of curated reward models during the RL phase. This strategy establishes the model as a new state-of-the-art (SOTA), delivering superior text-rendering capabilities and remarkable photorealism, and significantly enhancing aesthetic quality. 2) Notably, it sets a new industry standard for Chinese character rendering. By supporting even complex and rare characters, it outperforms both major open-source and commercial solutions in coverage, while also achieving superior accuracy. 3) The model achieves remarkable efficiency through its compact design. With a core diffusion model of only 6B parameters, it is significantly smaller than the nearly 20B or larger Mixture-of-Experts (MoE) architectures common in the field. This ensures minimal VRAM usage and rapid inference, significantly reducing deployment costs. Beyond generation, LongCat-Image also excels in image editing, achieving SOTA results on standard benchmarks with superior editing consistency compared to other open-source works. 4) To fully empower the community, we have established the most comprehensive open-source ecosystem to date. We are releasing not only multiple model versions for text-to-image and image editing, including checkpoints after mid-training and post-training stages, but also the entire toolchain of training procedure. We believe that the openness of LongCat-Image will provide robust support for developers and researchers, pushing the frontiers of visual content creation.
[323] Robust Variational Model Based Tailored UNet: Leveraging Edge Detector and Mean Curvature for Improved Image Segmentation
Kaili Qi, Zhongyi Huang, Wenli Yang
Main category: cs.CV
TL;DR: Robust VM_TUNet integrates variational PDEs with deep learning for noisy image segmentation, combining physical priors with neural networks for better boundary handling and computational efficiency.
Details
Motivation: To address challenges in segmenting noisy images with blurred/fragmented boundaries by combining the interpretability and boundary-smoothing advantages of variational PDEs with the strong representational power of deep neural networks.Method: Hybrid framework integrating variational methods with deep learning (VM_TUNet). Incorporates physical priors, edge detector, and mean curvature term into modified Cahn-Hilliard equation. Two collaborative modules: F module for frequency domain preprocessing to alleviate poor local minima, and T module for accurate stable local computations with stability estimate.
Result: Achieves balanced trade-off between performance and computational efficiency. Competitive quantitative results and improved visual quality compared to pure CNN-based models. Performance close to transformer-based methods with reasonable computational expense, validated on three benchmark datasets.
Conclusion: The robust VM_TUNet framework successfully combines variational PDEs with deep learning for noisy image segmentation, offering interpretability, boundary smoothing, and computational efficiency while maintaining competitive performance.
Abstract: To address the challenge of segmenting noisy images with blurred or fragmented boundaries, this paper presents a robust version of Variational Model Based Tailored UNet (VM_TUNet), a hybrid framework that integrates variational methods with deep learning. The proposed approach incorporates physical priors, an edge detector and a mean curvature term, into a modified Cahn-Hilliard equation, aiming to combine the interpretability and boundary-smoothing advantages of variational partial differential equations (PDEs) with the strong representational ability of deep neural networks. The architecture consists of two collaborative modules: an F module, which conducts efficient frequency domain preprocessing to alleviate poor local minima, and a T module, which ensures accurate and stable local computations, backed by a stability estimate. Extensive experiments on three benchmark datasets indicate that the proposed method achieves a balanced trade-off between performance and computational efficiency, which yields competitive quantitative results and improved visual quality compared to pure convolutional neural network (CNN) based models, while achieving performance close to that of transformer-based method with reasonable computational expense.
[324] More than Segmentation: Benchmarking SAM 3 for Segmentation, 3D Perception, and Reconstruction in Robotic Surgery
Wenzhen Dong, Jieming Yu, Yiming Huang, Hongqiu Wang, Lei Zhu, Albert C. S. Chung, Hongliang Ren, Long Bai
Main category: cs.CV
TL;DR: SAM 3 shows improved zero-shot segmentation with point/box prompts and introduces language prompts, but language segmentation underperforms in surgery. It demonstrates strong 3D reconstruction from 2D surgical images but has limitations in complex dynamic scenes.
Details
Motivation: To evaluate SAM 3's capabilities in robot-assisted surgery, assessing its zero-shot segmentation with various prompts (point, bounding box, language) and exploring its 3D reconstruction abilities for surgical applications.Method: Empirical evaluation benchmarking SAM 3’s zero-shot segmentation with point and bounding box prompts, testing language prompt segmentation, and investigating 3D reconstruction from 2D surgical images. Comprehensive testing on MICCAI EndoVis 2017/2018 benchmarks and zero-shot evaluations on SCARED, StereoMIS, and EndoNeRF datasets.
Result: SAM 3 shows clear improvements over SAM and SAM 2 in image and video segmentation with spatial prompts. Language prompts show potential but underperform in surgical domain. Demonstrates strong monocular depth estimation and realistic 3D instrument reconstruction, but reveals limitations in complex, highly dynamic surgical scenes.
Conclusion: SAM 3 advances surgical segmentation with improved spatial prompt performance and promising 3D reconstruction capabilities, but language-based segmentation requires domain-specific training, and challenges remain for complex dynamic surgical environments.
Abstract: The recent Segment Anything Model (SAM) 3 has introduced significant advancements over its predecessor, SAM 2, particularly with the integration of language-based segmentation and enhanced 3D perception capabilities. SAM 3 supports zero-shot segmentation across a wide range of prompts, including point, bounding box, and language-based prompts, allowing for more flexible and intuitive interactions with the model. In this empirical evaluation, we assess the performance of SAM 3 in robot-assisted surgery, benchmarking its zero-shot segmentation with point and bounding box prompts and exploring its effectiveness in dynamic video tracking, alongside its newly introduced language prompt segmentation. While language prompts show potential, their performance in the surgical domain is currently suboptimal, highlighting the need for further domain-specific training. Additionally, we investigate SAM 3’s 3D reconstruction abilities, demonstrating its capacity to process surgical scene data and reconstruct 3D anatomical structures from 2D images. Through comprehensive testing on the MICCAI EndoVis 2017 and EndoVis 2018 benchmarks, SAM 3 shows clear improvements over SAM and SAM 2 in both image and video segmentation under spatial prompts, while zero-shot evaluations on SCARED, StereoMIS, and EndoNeRF indicate strong monocular depth estimation and realistic 3D instrument reconstruction, yet also reveal remaining limitations in complex, highly dynamic surgical scenes.
[325] Online Segment Any 3D Thing as Instance Tracking
Hanshi Wang, Zijian Cai, Jin Gao, Yiwei Zhang, Weiming Hu, Ke Wang, Zhipeng Zhang
Main category: cs.CV
TL;DR: AutoSeg3D reformulates online 3D segmentation as instance tracking using object queries for temporal propagation, achieving SOTA results on multiple datasets.
Details
Motivation: Current query-based 3D segmentation methods lack temporal understanding, which is crucial for embodied agents operating in dynamic environments. Viewpoint variations in robotics lead to partial object visibility across frames, requiring holistic object understanding beyond instantaneous views.Method: Reconceptualizes 3D segmentation as instance tracking using object queries for temporal propagation: long-term instance association maintains feature/identity coherence, short-term instance update enriches observations. Introduces spatial consistency learning to mitigate VFM fragmentation problems. Uses sparse object queries to avoid dense temporal point cloud interactions.
Result: Achieves new state-of-the-art, surpassing ESAM by 2.8 AP on ScanNet200 and delivering consistent gains on ScanNet, SceneNN, and 3RScan datasets.
Conclusion: AutoSeg3D effectively addresses temporal understanding in 3D segmentation through instance tracking with object queries, enabling embodied agents to develop holistic object understanding despite partial visibility across frames while maintaining computational efficiency.
Abstract: Online, real-time, and fine-grained 3D segmentation constitutes a fundamental capability for embodied intelligent agents to perceive and comprehend their operational environments. Recent advancements employ predefined object queries to aggregate semantic information from Vision Foundation Models (VFMs) outputs that are lifted into 3D point clouds, facilitating spatial information propagation through inter-query interactions. Nevertheless, perception is an inherently dynamic process, rendering temporal understanding a critical yet overlooked dimension within these prevailing query-based pipelines. Therefore, to further unlock the temporal environmental perception capabilities of embodied agents, our work reconceptualizes online 3D segmentation as an instance tracking problem (AutoSeg3D). Our core strategy involves utilizing object queries for temporal information propagation, where long-term instance association promotes the coherence of features and object identities, while short-term instance update enriches instant observations. Given that viewpoint variations in embodied robotics often lead to partial object visibility across frames, this mechanism aids the model in developing a holistic object understanding beyond incomplete instantaneous views. Furthermore, we introduce spatial consistency learning to mitigate the fragmentation problem inherent in VFMs, yielding more comprehensive instance information for enhancing the efficacy of both long-term and short-term temporal learning. The temporal information exchange and consistency learning facilitated by these sparse object queries not only enhance spatial comprehension but also circumvent the computational burden associated with dense temporal point cloud interactions. Our method establishes a new state-of-the-art, surpassing ESAM by 2.8 AP on ScanNet200 and delivering consistent gains on ScanNet, SceneNN, and 3RScan datasets.
[326] Decomposition Sampling for Efficient Region Annotations in Active Learning
Jingna Qiu, Frauke Wilm, Mathias Öttl, Jonas Utz, Maja Schlereth, Moritz Schillinger, Marc Aubreville, Katharina Breininger
Main category: cs.CV
TL;DR: DECOMP is an active learning method for dense prediction tasks that decomposes images into class-specific components using pseudo-labels and samples regions from each class, improving annotation efficiency for minority classes.
Details
Motivation: Existing active learning methods for dense prediction tasks in medical imaging have limitations: high computational/memory costs, irrelevant region selection, and heavy reliance on uncertainty sampling. Region-level annotation is more efficient than image-level annotation but current approaches don't effectively handle class imbalance.Method: DECOMP decomposes images into class-specific components using pseudo-labels, then samples regions from each class. It uses class-wise predictive confidence to guide sampling, ensuring difficult/minority classes receive more annotations. This enhances annotation diversity and addresses class imbalance.
Result: DECOMP consistently outperforms baseline methods across ROI classification, 2-D segmentation, and 3-D segmentation tasks. It better samples minority-class regions and boosts performance on challenging classes.
Conclusion: DECOMP provides an effective active learning strategy for dense prediction tasks that addresses limitations of existing methods, particularly for handling class imbalance and improving annotation efficiency in medical imaging applications.
Abstract: Active learning improves annotation efficiency by selecting the most informative samples for annotation and model training. While most prior work has focused on selecting informative images for classification tasks, we investigate the more challenging setting of dense prediction, where annotations are more costly and time-intensive, especially in medical imaging. Region-level annotation has been shown to be more efficient than image-level annotation for these tasks. However, existing methods for representative annotation region selection suffer from high computational and memory costs, irrelevant region choices, and heavy reliance on uncertainty sampling. We propose decomposition sampling (DECOMP), a new active learning sampling strategy that addresses these limitations. It enhances annotation diversity by decomposing images into class-specific components using pseudo-labels and sampling regions from each class. Class-wise predictive confidence further guides the sampling process, ensuring that difficult classes receive additional annotations. Across ROI classification, 2-D segmentation, and 3-D segmentation, DECOMP consistently surpasses baseline methods by better sampling minority-class regions and boosting performance on these challenging classes. Code is in https://github.com/JingnaQiu/DECOMP.git.
[327] MoCA: Mixture-of-Components Attention for Scalable Compositional 3D Generation
Zhiqi Li, Wenhuan Li, Tengfei Wang, Zhenwei Wang, Junta Wu, Haoyuan Wang, Yunhan Yang, Zehuan Huang, Yang Li, Peidong Liu, Chunchao Guo
Main category: cs.CV
TL;DR: MoCA introduces an efficient compositional 3D generative model that uses importance-based component routing and compression to enable scalable fine-grained 3D asset creation with many components.
Details
Motivation: Existing part-aware 3D generation methods suffer from poor scalability due to quadratic global attention costs when increasing the number of components, limiting their practical application for complex compositional 3D assets.Method: Two key designs: (1) importance-based component routing that selects top-k relevant components for sparse global attention, and (2) unimportant components compression that preserves contextual priors of unselected components while reducing computational complexity.
Result: Extensive experiments show MoCA outperforms baselines on both compositional object and scene generation tasks, enabling efficient, fine-grained compositional 3D asset creation with scalable number of components.
Conclusion: MoCA presents an effective solution to the scalability problem in compositional 3D generation, making it practical for creating complex 3D assets with many components through efficient attention mechanisms.
Abstract: Compositionality is critical for 3D object and scene generation, but existing part-aware 3D generation methods suffer from poor scalability due to quadratic global attention costs when increasing the number of components. In this work, we present MoCA, a compositional 3D generative model with two key designs: (1) importance-based component routing that selects top-k relevant components for sparse global attention, and (2) unimportant components compression that preserve contextual priors of unselected components while reducing computational complexity of global attention. With these designs, MoCA enables efficient, fine-grained compositional 3D asset creation with scalable number of components. Extensive experiments show MoCA outperforms baselines on both compositional object and scene generation tasks. Project page: https://lizhiqi49.github.io/MoCA
[328] Liver Fibrosis Quantification and Analysis: The LiQA Dataset and Baseline Method
Yuanye Liu, Hanxiao Zhang, Nannan Shi, Yuxin Shi, Arif Mahmood, Murtaza Taj, Xiahai Zhuang
Main category: cs.CV
TL;DR: LiQA dataset for liver fibrosis staging benchmark with 440 multi-phase MRI scans, featuring segmentation and staging tasks under real-world challenges like domain shifts and missing data.
Details
Motivation: Liver fibrosis is a major global health issue requiring accurate staging for clinical management. There's a need for robust algorithms that can handle complex real-world conditions in medical imaging.Method: Created LiQA dataset with 440 patients’ multi-phase, multi-center MRI scans. Top-performing approach uses semi-supervised learning with external data for segmentation, and multi-view consensus with CAM-based regularization for staging.
Result: The baseline evaluation shows that leveraging multi-source data and anatomical constraints significantly enhances model robustness in clinical settings for liver fibrosis staging.
Conclusion: The LiQA dataset provides a valuable benchmark for developing robust liver fibrosis staging algorithms that can handle real-world clinical challenges like domain shifts and missing data.
Abstract: Liver fibrosis represents a significant global health burden, necessitating accurate staging for effective clinical management. This report introduces the LiQA (Liver Fibrosis Quantification and Analysis) dataset, established as part of the CARE 2024 challenge. Comprising $440$ patients with multi-phase, multi-center MRI scans, the dataset is curated to benchmark algorithms for Liver Segmentation (LiSeg) and Liver Fibrosis Staging (LiFS) under complex real-world conditions, including domain shifts, missing modalities, and spatial misalignment. We further describe the challenge’s top-performing methodology, which integrates a semi-supervised learning framework with external data for robust segmentation, and utilizes a multi-view consensus approach with Class Activation Map (CAM)-based regularization for staging. Evaluation of this baseline demonstrates that leveraging multi-source data and anatomical constraints significantly enhances model robustness in clinical settings.
[329] Optimization-Guided Diffusion for Interactive Scene Generation
Shiaho Li, Naisheng Ye, Tianyu Li, Kashyap Chitta, Tuo An, Peng Su, Boyang Wang, Haiou Liu, Chen Lv, Hongyang Li
Main category: cs.CV
TL;DR: OMEGA is an optimization-guided framework that improves diffusion-based scene generation by enforcing physical and social constraints, enabling realistic safety-critical scenario creation for autonomous vehicle testing.
Details
Motivation: Safety-critical events are rare in real driving datasets but essential for evaluating autonomous vehicles. Existing scene generation models lack controllability and often produce unrealistic samples that violate physical or social constraints.Method: OMEGA uses constrained optimization to re-anchor each reverse diffusion step, steering generation toward physically plausible and behaviorally coherent trajectories. It formulates ego-attacker interactions as game-theoretic optimization in distribution space to approximate Nash equilibria for adversarial scenarios.
Result: OMEGA improves generation realism, consistency, and controllability: increases valid scenes from 32.35% to 72.27% for free exploration, and from 11% to 80% for controllability-focused generation. Generates 5× more near-collision frames with time-to-collision under 3 seconds while maintaining scene realism.
Conclusion: OMEGA provides an effective training-free framework for generating realistic, safety-critical driving scenarios through optimization-guided diffusion sampling, addressing key limitations in existing scene generation methods for autonomous vehicle evaluation.
Abstract: Realistic and diverse multi-agent driving scenes are crucial for evaluating autonomous vehicles, but safety-critical events which are essential for this task are rare and underrepresented in driving datasets. Data-driven scene generation offers a low-cost alternative by synthesizing complex traffic behaviors from existing driving logs. However, existing models often lack controllability or yield samples that violate physical or social constraints, limiting their usability. We present OMEGA, an optimization-guided, training-free framework that enforces structural consistency and interaction awareness during diffusion-based sampling from a scene generation model. OMEGA re-anchors each reverse diffusion step via constrained optimization, steering the generation towards physically plausible and behaviorally coherent trajectories. Building on this framework, we formulate ego-attacker interactions as a game-theoretic optimization in the distribution space, approximating Nash equilibria to generate realistic, safety-critical adversarial scenarios. Experiments on nuPlan and Waymo show that OMEGA improves generation realism, consistency, and controllability, increasing the ratio of physically and behaviorally valid scenes from 32.35% to 72.27% for free exploration capabilities, and from 11% to 80% for controllability-focused generation. Our approach can also generate $5\times$ more near-collision frames with a time-to-collision under three seconds while maintaining the overall scene realism.
[330] EgoCampus: Egocentric Pedestrian Eye Gaze Model and Dataset
Ronan John, Aditya Kesari, Vincenzo DiMatteo, Kristin Dana
Main category: cs.CV
TL;DR: EgoCampus dataset and EgoCampusNet model for predicting pedestrian eye gaze during outdoor campus navigation using egocentric vision and eye tracking data.
Details
Motivation: To address the lack of datasets and models for studying human visual attention during real-world outdoor navigation, particularly focusing on pedestrian eye gaze in campus environments where prior work has been limited to indoor tasks or lacked eye tracking information.Method: Collected the EgoCampus dataset using Meta’s Project Aria glasses with eye tracking, RGB cameras, inertial sensors, and GPS from 80+ pedestrians walking 6 km across 25 outdoor campus paths. Developed EgoCampusNet, a novel method to predict eye gaze of navigating pedestrians in outdoor environments.
Result: Created a diverse gaze-annotated video dataset spanning outdoor campus navigation and developed a gaze prediction model specifically for pedestrian navigation scenarios.
Conclusion: The work provides both a valuable resource for studying real-world visual attention and a foundation for future gaze prediction models in navigation contexts, with dataset and code to be made publicly available.
Abstract: We address the challenge of predicting human visual attention during real-world navigation by measuring and modeling egocentric pedestrian eye gaze in an outdoor campus setting. We introduce the EgoCampus dataset, which spans 25 unique outdoor paths over 6 km across a university campus with recordings from more than 80 distinct human pedestrians, resulting in a diverse set of gaze-annotated videos. The system used for collection, Meta’s Project Aria glasses, integrates eye tracking, front-facing RGB cameras, inertial sensors, and GPS to provide rich data from the human perspective. Unlike many prior egocentric datasets that focus on indoor tasks or exclude eye gaze information, our work emphasizes visual attention while subjects walk in outdoor campus paths. Using this data, we develop EgoCampusNet, a novel method to predict eye gaze of navigating pedestrians as they move through outdoor environments. Our contributions provide both a new resource for studying real-world attention and a resource for future work in gaze prediction models for navigation. Dataset and code are available upon request, and will be made publicly available at a later date at https://github.com/ComputerVisionRutgers/EgoCampus .
[331] sim2art: Accurate Articulated Object Modeling from a Single Video using Synthetic Training Data Only
Arslan Artykov, Corentin Sautier, Vincent Lepetit
Main category: cs.CV
TL;DR: First data-driven method to jointly predict part segmentation and joint parameters from monocular video with freely moving camera, trained on synthetic data and generalizes to real-world objects.
Details
Motivation: Understanding articulated objects is crucial for robotics and digital twin creation, but previous work focused on multi-view systems, object scanning, or static cameras, lacking scalable solutions for monocular video.Method: Data-driven approach that jointly predicts part segmentation and joint parameters from monocular video captured with freely moving camera, trained solely on synthetic data.
Result: Demonstrates strong generalization to real-world objects despite being trained only on synthetic data, operates directly on casually recorded video, suitable for real-time applications.
Conclusion: Presents a scalable and practical solution for articulated object understanding that works with monocular video, enabling applications in dynamic environments.
Abstract: Understanding articulated objects is a fundamental challenge in robotics and digital twin creation. To effectively model such objects, it is essential to recover both part segmentation and the underlying joint parameters. Despite the importance of this task, previous work has largely focused on setups like multi-view systems, object scanning, or static cameras. In this paper, we present the first data-driven approach that jointly predicts part segmentation and joint parameters from monocular video captured with a freely moving camera. Trained solely on synthetic data, our method demonstrates strong generalization to real-world objects, offering a scalable and practical solution for articulated object understanding. Our approach operates directly on casually recorded video, making it suitable for real-time applications in dynamic environments. Project webpage: https://aartykov.github.io/sim2art/
[332] PVeRA: Probabilistic Vector-Based Random Matrix Adaptation
Leo Fillioux, Enzo Ferrante, Paul-Henry Cournède, Maria Vakalopoulou, Stergios Christodoulidis
Main category: cs.CV
TL;DR: PVeRA is a probabilistic version of the VeRA adapter that modifies low-rank matrices probabilistically to handle input ambiguities and enable different sampling configurations during training/testing, outperforming other adapters on VTAB-1k benchmark.
Details
Motivation: Large foundation models require vast datasets and computational resources for training/finetuning, which are scarce and costly. Parameter-efficient adaptation methods like adapters provide computationally efficient solutions by finetuning only small trainable modules appended to frozen backbones.Method: PVeRA modifies the VeRA adapter’s low-rank matrices in a probabilistic manner. VeRA uses a pair of frozen random low-rank matrices shared across all layers. PVeRA introduces probabilistic modifications to these matrices, allowing handling of input ambiguities and enabling different sampling configurations during training and testing phases.
Result: Comprehensive evaluation on VTAB-1k benchmark with seven adapters shows PVeRA outperforms VeRA and other adapters. The method demonstrates superior performance in parameter-efficient adaptation tasks.
Conclusion: PVeRA provides an effective probabilistic approach to parameter-efficient adaptation that handles input ambiguities and allows flexible sampling configurations, achieving state-of-the-art performance on benchmark tasks while maintaining computational efficiency.
Abstract: Large foundation models have emerged in the last years and are pushing performance boundaries for a variety of tasks. Training or even finetuning such models demands vast datasets and computational resources, which are often scarce and costly. Adaptation methods provide a computationally efficient solution to address these limitations by allowing such models to be finetuned on small amounts of data and computing power. This is achieved by appending new trainable modules to frozen backbones with only a fraction of the trainable parameters and fitting only these modules on novel tasks. Recently, the VeRA adapter was shown to excel in parameter-efficient adaptations by utilizing a pair of frozen random low-rank matrices shared across all layers. In this paper, we propose PVeRA, a probabilistic version of the VeRA adapter, which modifies the low-rank matrices of VeRA in a probabilistic manner. This modification naturally allows handling inherent ambiguities in the input and allows for different sampling configurations during training and testing. A comprehensive evaluation was performed on the VTAB-1k benchmark and seven adapters, with PVeRA outperforming VeRA and other adapters. Our code for training models with PVeRA and benchmarking all adapters is available https://github.com/leofillioux/pvera.
[333] UnCageNet: Tracking and Pose Estimation of Caged Animal
Sayak Dutta, Harish Katti, Shashikant Verma, Shanmuganathan Raman
Main category: cs.CV
TL;DR: A three-stage preprocessing pipeline improves animal tracking/pose estimation by removing cage occlusions through segmentation, inpainting, and evaluation.
Details
Motivation: Existing animal tracking and pose estimation systems (STEP, ViTPose) suffer performance degradation when processing images/videos with cage structures and systematic occlusions.Method: Three-stage pipeline: (1) cage segmentation using Gabor-enhanced ResNet-UNet with 72 directional kernels, (2) cage inpainting using CRFill for content-aware reconstruction, (3) evaluation of pose estimation/tracking on uncaged frames.
Result: Removing cage occlusions enables pose estimation and tracking performance comparable to environments without occlusions, with significant improvements in keypoint detection accuracy and trajectory consistency.
Conclusion: The proposed preprocessing pipeline effectively addresses cage occlusion problems in animal tracking/pose estimation systems, restoring performance to levels comparable to occlusion-free environments.
Abstract: Animal tracking and pose estimation systems, such as STEP (Simultaneous Tracking and Pose Estimation) and ViTPose, experience substantial performance drops when processing images and videos with cage structures and systematic occlusions. We present a three-stage preprocessing pipeline that addresses this limitation through: (1) cage segmentation using a Gabor-enhanced ResNet-UNet architecture with tunable orientation filters, (2) cage inpainting using CRFill for content-aware reconstruction of occluded regions, and (3) evaluation of pose estimation and tracking on the uncaged frames. Our Gabor-enhanced segmentation model leverages orientation-aware features with 72 directional kernels to accurately identify and segment cage structures that severely impair the performance of existing methods. Experimental validation demonstrates that removing cage occlusions through our pipeline enables pose estimation and tracking performance comparable to that in environments without occlusions. We also observe significant improvements in keypoint detection accuracy and trajectory consistency.
[334] ViSA: 3D-Aware Video Shading for Real-Time Upper-Body Avatar Creation
Fan Yang, Heyuan Li, Peihao Li, Weihao Yuan, Lingteng Qiu, Chaoyue Song, Cheng Chen, Yisheng He, Shifeng Zhang, Xiaoguang Han, Steven Hoi, Guosheng Lin
Main category: cs.CV
TL;DR: A novel method that combines 3D reconstruction models with video diffusion models to generate high-fidelity upper-body 3D avatars from single images, achieving photorealistic textures and fluid motion while maintaining structural stability.
Details
Motivation: Current 3D avatar generation methods produce stable structures but suffer from blurry textures and stiff motion, while video models create photorealistic results but have structural instability and identity drift issues. There's a need to combine the strengths of both approaches for high-quality avatar generation.Method: The framework uses a 3D reconstruction model to provide structural and appearance priors, which then guides a real-time autoregressive video diffusion model for rendering. This hybrid approach leverages geometric stability from reconstruction models and generative capabilities from video models.
Result: The method significantly reduces artifacts, improves visual quality over leading methods, synthesizes high-frequency photorealistic details and fluid dynamics in real time, and prevents structural inconsistencies common in video generation approaches.
Conclusion: By uniting geometric stability with generative capabilities, the approach produces high-fidelity digital avatars with realistic appearance and dynamic, temporally coherent motion, providing a robust solution for real-time applications like gaming and VR.
Abstract: Generating high-fidelity upper-body 3D avatars from one-shot input image remains a significant challenge. Current 3D avatar generation methods, which rely on large reconstruction models, are fast and capable of producing stable body structures, but they often suffer from artifacts such as blurry textures and stiff, unnatural motion. In contrast, generative video models show promising performance by synthesizing photorealistic and dynamic results, but they frequently struggle with unstable behavior, including body structural errors and identity drift. To address these limitations, we propose a novel approach that combines the strengths of both paradigms. Our framework employs a 3D reconstruction model to provide robust structural and appearance priors, which in turn guides a real-time autoregressive video diffusion model for rendering. This process enables the model to synthesize high-frequency, photorealistic details and fluid dynamics in real time, effectively reducing texture blur and motion stiffness while preventing the structural inconsistencies common in video generation methods. By uniting the geometric stability of 3D reconstruction with the generative capabilities of video models, our method produces high-fidelity digital avatars with realistic appearance and dynamic, temporally coherent motion. Experiments demonstrate that our approach significantly reduces artifacts and achieves substantial improvements in visual quality over leading methods, providing a robust and efficient solution for real-time applications such as gaming and virtual reality. Project page: https://lhyfst.github.io/visa
[335] SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery
Meng Cao, Xingyu Li, Xue Liu, Ian Reid, Xiaodan Liang
Main category: cs.CV
TL;DR: SpatialDreamer is a reinforcement learning framework that enables MLLMs to perform complex spatial reasoning through active exploration and mental simulation, addressing limitations in current passive observation methods.
Details
Motivation: Current Multi-modal Large Language Models (MLLMs) have limited performance on complex spatial reasoning tasks requiring mental simulation. They rely on passive observation rather than active mental imagery processes, creating a gap in human-like spatial reasoning capabilities.Method: SpatialDreamer uses a reinforcement learning framework with a closed-loop process of active exploration, visual imagination via a world model, and evidence-grounded reasoning. It introduces Geometric Policy Optimization (GeoPO) with tree-structured sampling and step-level reward estimation with geometric consistency constraints to address lack of fine-grained reward supervision.
Result: Extensive experiments show SpatialDreamer delivers highly competitive results across multiple challenging benchmarks, demonstrating significant advancement in active spatial mental simulation for MLLMs.
Conclusion: SpatialDreamer represents a critical advancement in enabling human-like active spatial mental simulation for MLLMs, bridging the gap between passive observation and active mental imagery processes for complex spatial reasoning tasks.
Abstract: Despite advancements in Multi-modal Large Language Models (MLLMs) for scene understanding, their performance on complex spatial reasoning tasks requiring mental simulation remains significantly limited. Current methods often rely on passive observation of spatial data, failing to internalize an active mental imagery process. To bridge this gap, we propose SpatialDreamer, a reinforcement learning framework that enables spatial reasoning through a closedloop process of active exploration, visual imagination via a world model, and evidence-grounded reasoning. To address the lack of fine-grained reward supervision in longhorizontal reasoning tasks, we propose Geometric Policy Optimization (GeoPO), which introduces tree-structured sampling and step-level reward estimation with geometric consistency constraints. Extensive experiments demonstrate that SpatialDreamer delivers highly competitive results across multiple challenging benchmarks, signifying a critical advancement in human-like active spatial mental simulation for MLLMs.
[336] HLTCOE Evaluation Team at TREC 2025: VQA Track
Dengjia Zhang, Charles Weng, Katherine Guerrerio, Yi Lu, Kenton Murray, Alexander Martin, Reno Kriz, Benjamin Van Durme
Main category: cs.CV
TL;DR: The paper presents a listwise learning framework for video question answering that reranks candidate answers using a novel Masked Pointer Cross-Entropy Loss with Rank Weights, improving semantic precision and ranking consistency.
Details
Motivation: To improve semantic precision and ranking consistency in video question answering answer generation, addressing challenges in temporal reasoning and semantic disambiguation where traditional methods may lack fine-grained ranking stability.Method: A two-stage approach: 1) Base multimodal model generates multiple candidate answers for video-question pairs, 2) Reranking using a model trained with novel Masked Pointer Cross-Entropy Loss with Rank Weights that integrates pointer-based candidate selection, rank-dependent weighting, and masked cross-entropy under vocabulary restriction.
Result: Experiments show consistent gains in accuracy and ranking stability, particularly for questions requiring temporal reasoning and semantic disambiguation, demonstrating improved coherent and fine-grained answer lists.
Conclusion: The proposed listwise learning framework successfully bridges generative modeling with discriminative ranking to produce coherent, fine-grained answer lists, offering stable and interpretable optimization for video question answering tasks.
Abstract: The HLTCOE Evaluation team participated in TREC VQA’s Answer Generation (AG) task, for which we developed a listwise learning framework that aims to improve semantic precision and ranking consistency in answer generation. Given a video-question pair, a base multimodal model first generates multiple candidate answers, which are then reranked using a model trained with a novel Masked Pointer Cross-Entropy Loss with Rank Weights. This objective integrates pointer-based candidate selection, rank-dependent weighting, and masked cross-entropy under vocabulary restriction, enabling stable and interpretable listwise optimization. By bridging generative modeling with discriminative ranking, our method produces coherent, fine-grained answer lists. Experiments reveal consistent gains in accuracy and ranking stability, especially for questions requiring temporal reasoning and semantic disambiguation.
[337] DiffusionDriveV2: Reinforcement Learning-Constrained Truncated Diffusion Modeling in End-to-End Autonomous Driving
Jialv Zou, Shaoyu Chen, Bencheng Liao, Zhiyu Zheng, Yuehao Song, Lefei Zhang, Qian Zhang, Wenyu Liu, Xinggang Wang
Main category: cs.CV
TL;DR: DiffusionDriveV2 improves autonomous driving trajectory generation by combining diffusion models with reinforcement learning to overcome mode collapse and achieve better diversity-quality trade-off.
Details
Motivation: Existing generative diffusion models for autonomous driving suffer from mode collapse, producing conservative and homogeneous behaviors. While DiffusionDrive uses predefined anchors for diversity, it lacks sufficient constraints, creating a dilemma between diversity and consistent high quality.Method: DiffusionDriveV2 leverages reinforcement learning to constrain low-quality modes and explore superior trajectories while preserving multimodality. Uses scale-adaptive multiplicative noise for exploration, intra-anchor GRPO for advantage estimation within anchors, and inter-anchor truncated GRPO for global perspective across different driving intentions.
Result: Achieves 91.2 PDMS on NAVSIM v1 and 85.5 EPDMS on NAVSIM v2 datasets in closed-loop evaluation with ResNet-34 backbone, setting new records. Resolves diversity-quality dilemma and achieves best trade-off for truncated diffusion models.
Conclusion: The proposed reinforcement learning approach successfully enhances trajectory quality while preserving diversity in autonomous driving diffusion models, solving the fundamental trade-off problem through careful advantage estimation and exploration techniques.
Abstract: Generative diffusion models for end-to-end autonomous driving often suffer from mode collapse, tending to generate conservative and homogeneous behaviors. While DiffusionDrive employs predefined anchors representing different driving intentions to partition the action space and generate diverse trajectories, its reliance on imitation learning lacks sufficient constraints, resulting in a dilemma between diversity and consistent high quality. In this work, we propose DiffusionDriveV2, which leverages reinforcement learning to both constrain low-quality modes and explore for superior trajectories. This significantly enhances the overall output quality while preserving the inherent multimodality of its core Gaussian Mixture Model. First, we use scale-adaptive multiplicative noise, ideal for trajectory planning, to promote broad exploration. Second, we employ intra-anchor GRPO to manage advantage estimation among samples generated from a single anchor, and inter-anchor truncated GRPO to incorporate a global perspective across different anchors, preventing improper advantage comparisons between distinct intentions (e.g., turning vs. going straight), which can lead to further mode collapse. DiffusionDriveV2 achieves 91.2 PDMS on the NAVSIM v1 dataset and 85.5 EPDMS on the NAVSIM v2 dataset in closed-loop evaluation with an aligned ResNet-34 backbone, setting a new record. Further experiments validate that our approach resolves the dilemma between diversity and consistent high quality for truncated diffusion models, achieving the best trade-off. Code and model will be available at https://github.com/hustvl/DiffusionDriveV2
[338] Unison: A Fully Automatic, Task-Universal, and Low-Cost Framework for Unified Understanding and Generation
Shihao Zhao, Yitong Chen, Zeyinzi Jiang, Bojia Zi, Shaozhe Hao, Yu Liu, Chaojie Mao, Kwan-Yee K. Wong
Main category: cs.CV
TL;DR: Unison is a low-cost multimodal AI that automatically identifies tasks and extracts parameters for unified understanding and generation across text, image, and video tasks.
Details
Motivation: Current approaches for unified multimodal understanding and generation either require massive resources (auto-regressive transformers) or suffer from limited task coverage/poor quality (two-stage methods). Both lack automatic task parsing and require manual parameter configuration.Method: Two-stage scheme preserving pre-trained model capabilities. Uses only 500k training samples and 50 GPU hours. Equips model with automatic intention parsing, task type determination, and meta-information extraction for full automation.
Result: Achieves superior performance across various understanding and generation tasks with extremely low training cost. Accurately identifies tasks and extracts relevant parameters automatically.
Conclusion: Unison demonstrates that effective unified multimodal AI can be achieved with minimal resources through intelligent task parsing and parameter extraction, enabling full automation without human intervention.
Abstract: Unified understanding and generation is a highly appealing research direction in multimodal learning. There exist two approaches: one trains a transformer via an auto-regressive paradigm, and the other adopts a two-stage scheme connecting pre-trained understanding and generative models for alignment fine-tuning. The former demands massive data and computing resources unaffordable for ordinary researchers. Though the latter requires a lower training cost, existing works often suffer from limited task coverage or poor generation quality. Both approaches lack the ability to parse input meta-information (such as task type, image resolution, video duration, etc.) and require manual parameter configuration that is tedious and non-intelligent. In this paper, we propose Unison which adopts the two-stage scheme while preserving the capabilities of the pre-trained models well. With an extremely low training cost, we cover a variety of multimodal understanding tasks, including text, image, and video understanding, as well as diverse generation tasks, such as text-to-visual content generation, editing, controllable generation, and IP-based reference generation. We also equip our model with the ability to automatically parse user intentions, determine the target task type, and accurately extract the meta-information required for the corresponding task. This enables full automation of various multimodal tasks without human intervention. Experiments demonstrate that, under a low-cost setting of only 500k training samples and 50 GPU hours, our model can accurately and automatically identify tasks and extract relevant parameters, and achieve superior performance across a variety of understanding and generation tasks.
[339] UltrasODM: A Dual Stream Optical Flow Mamba Network for 3D Freehand Ultrasound Reconstruction
Mayank Anand, Ujair Alam, Surya Prakash, Priya Shukla, Gora Chand Nandi, Domenec Puig
Main category: cs.CV
TL;DR: UltrasODM is a dual-stream framework that assists sonographers during ultrasound acquisition by providing calibrated uncertainty, saliency diagnostics, and actionable prompts to reduce reconstruction errors.
Details
Motivation: Clinical ultrasound acquisition is highly operator-dependent, with rapid probe motion and brightness fluctuations causing reconstruction errors that reduce trust and clinical utility.Method: Dual-stream framework with: (1) contrastive ranking module grouping frames by motion similarity, (2) optical-flow stream fused with Dual-Mamba temporal modules for robust 6-DoF pose estimation, and (3) Human-in-the-Loop layer combining Bayesian uncertainty, clinician-calibrated thresholds, and saliency maps.
Result: Reduces drift by 15.2%, distance error by 12.1%, and Hausdorff distance by 10.1% relative to UltrasOM, while producing per-frame uncertainty and saliency outputs.
Conclusion: UltrasODM improves reconstruction reliability and supports safer, more trustworthy clinical workflows by emphasizing transparency and clinician feedback.
Abstract: Clinical ultrasound acquisition is highly operator-dependent, where rapid probe motion and brightness fluctuations often lead to reconstruction errors that reduce trust and clinical utility. We present UltrasODM, a dual-stream framework that assists sonographers during acquisition through calibrated per-frame uncertainty, saliency-based diagnostics, and actionable prompts. UltrasODM integrates (i) a contrastive ranking module that groups frames by motion similarity, (ii) an optical-flow stream fused with Dual-Mamba temporal modules for robust 6-DoF pose estimation, and (iii) a Human-in-the-Loop (HITL) layer combining Bayesian uncertainty, clinician-calibrated thresholds, and saliency maps highlighting regions of low confidence. When uncertainty exceeds the threshold, the system issues unobtrusive alerts suggesting corrective actions such as re-scanning highlighted regions or slowing the sweep. Evaluated on a clinical freehand ultrasound dataset, UltrasODM reduces drift by 15.2%, distance error by 12.1%, and Hausdorff distance by 10.1% relative to UltrasOM, while producing per-frame uncertainty and saliency outputs. By emphasizing transparency and clinician feedback, UltrasODM improves reconstruction reliability and supports safer, more trustworthy clinical workflows. Our code is publicly available at https://github.com/AnandMayank/UltrasODM.
[340] Deep transfer learning for image classification: a survey
Jo Plested, Musa Phiri, Tom Gedeon
Main category: cs.CV
TL;DR: A comprehensive survey on deep transfer learning for image classification, proposing a new taxonomy to analyze effectiveness patterns and identify knowledge gaps.
Details
Motivation: While deep neural networks excel with abundant labeled data, real-world scenarios often lack sufficient training data. Transfer learning can help, but there's no comprehensive survey specifically for image classification. Existing surveys are either too general or too specialized, creating a need to collate current knowledge and analyze overarching patterns.Method: The paper conducts a systematic survey of deep transfer learning for image classification, formally defines the problem, reviews current state-of-the-art, and proposes a new taxonomy for categorizing transfer learning applications. This taxonomy helps analyze patterns of effectiveness and failure cases.
Result: The survey identifies current progress and knowledge gaps in the field. The proposed taxonomy reveals that many cases where transfer learning fails or hinders performance are actually predictable when considering source/target datasets and techniques used, providing insights into when transfer learning is effective.
Conclusion: The paper provides a comprehensive foundation for understanding deep transfer learning in image classification, offering a structured taxonomy that explains success/failure patterns and suggests directions for future research to address identified knowledge gaps.
Abstract: Deep neural networks such as convolutional neural networks (CNNs) and transformers have achieved many successes in image classification in recent years. It has been consistently demonstrated that best practice for image classification is when large deep models can be trained on abundant labelled data. However there are many real world scenarios where the requirement for large amounts of training data to get the best performance cannot be met. In these scenarios transfer learning can help improve performance. To date there have been no surveys that comprehensively review deep transfer learning as it relates to image classification overall. However, several recent general surveys of deep transfer learning and ones that relate to particular specialised target image classification tasks have been published. We believe it is important for the future progress in the field that all current knowledge is collated and the overarching patterns analysed and discussed. In this survey we formally define deep transfer learning and the problem it attempts to solve in relation to image classification. We survey the current state of the field and identify where recent progress has been made. We show where the gaps in current knowledge are and make suggestions for how to progress the field to fill in these knowledge gaps. We present a new taxonomy of the applications of transfer learning for image classification. This taxonomy makes it easier to see overarching patterns of where transfer learning has been effective and, where it has failed to fulfill its potential. This also allows us to suggest where the problems lie and how it could be used more effectively. We show that under this new taxonomy, many of the applications where transfer learning has been shown to be ineffective or even hinder performance are to be expected when taking into account the source and target datasets and the techniques used.
[341] Modality-Aware Bias Mitigation and Invariance Learning for Unsupervised Visible-Infrared Person Re-Identification
Menglin Wang, Xiaojin Gong, Jiachen Li, Genlin Ji
Main category: cs.CV
TL;DR: Proposes a novel unsupervised visible-infrared person re-identification method using modality-aware Jaccard distance for global association and split-and-contrast strategy for modality-invariant representation learning.
Details
Motivation: Existing unsupervised VI-ReID methods using optimal transport propagate local cluster errors and overlook global instance-level relations due to significant visible-infrared modality gap. Need to address cross-modality learning by mitigating modality bias and learning modality-invariant representations.Method: 1) Modality-aware Jaccard distance to mitigate distance bias from modality discrepancy for reliable cross-modality global clustering. 2) Split-and-contrast strategy to obtain modality-specific global prototypes and align them under global association guidance for modality-invariant representation learning.
Result: Achieves state-of-the-art performance on benchmark VI-ReID datasets, outperforming existing methods by significant margin.
Conclusion: The proposed method effectively addresses cross-modality learning in unsupervised VI-ReID through bias-mitigated global association and modality-invariant representation learning, demonstrating superior performance despite conceptual simplicity.
Abstract: Unsupervised visible-infrared person re-identification (USVI-ReID) aims to match individuals across visible and infrared cameras without relying on any annotation. Given the significant gap across visible and infrared modality, estimating reliable cross-modality association becomes a major challenge in USVI-ReID. Existing methods usually adopt optimal transport to associate the intra-modality clusters, which is prone to propagating the local cluster errors, and also overlooks global instance-level relations. By mining and attending to the visible-infrared modality bias, this paper focuses on addressing cross-modality learning from two aspects: bias-mitigated global association and modality-invariant representation learning. Motivated by the camera-aware distance rectification in single-modality re-ID, we propose modality-aware Jaccard distance to mitigate the distance bias caused by modality discrepancy, so that more reliable cross-modality associations can be estimated through global clustering. To further improve cross-modality representation learning, a `split-and-contrast’ strategy is designed to obtain modality-specific global prototypes. By explicitly aligning these prototypes under global association guidance, modality-invariant yet ID-discriminative representation learning can be achieved. While conceptually simple, our method obtains state-of-the-art performance on benchmark VI-ReID datasets and outperforms existing methods by a significant margin, validating its effectiveness.
[342] GorillaWatch: An Automated System for In-the-Wild Gorilla Re-Identification and Population Monitoring
Maximilian Schall, Felix Leonard Knöfel, Noah Elias König, Jan Jonas Kubeler, Maximilian von Klinski, Joan Wilhelm Linnemann, Xiaoshi Liu, Iven Jelle Schlegelmilch, Ole Woyciniuk, Alexandra Schild, Dante Wasmuht, Magdalena Bermejo Espinet, German Illera Basas, Gerard de Melo
Main category: cs.CV
TL;DR: GorillaWatch: An end-to-end pipeline for automated gorilla re-identification using novel datasets and multi-frame self-supervised pretraining, with attention verification for scientific validity.
Details
Motivation: Manual re-identification of critically endangered western lowland gorillas from camera trap footage is extremely labor-intensive, and existing automation efforts are hampered by lack of large-scale "in-the-wild" video datasets for training robust deep learning models.Method: Introduces three novel datasets (Gorilla-SPAC-Wild, Gorilla-Berlin-Zoo, Gorilla-SPAC-MoT) and GorillaWatch pipeline integrating detection, tracking, and re-identification. Uses multi-frame self-supervised pretraining leveraging tracklet consistency, and differentiable AttnLRP adaptation to verify model focuses on biometric traits. Also addresses unsupervised population counting with spatiotemporal constraints.
Result: Extensive benchmarking shows aggregating features from large-scale image backbones outperforms specialized video architectures. The approach enables scalable, non-invasive monitoring of endangered species.
Conclusion: The released datasets and GorillaWatch pipeline provide a comprehensive solution for automated gorilla monitoring, addressing key challenges in wildlife conservation through computer vision and deep learning.
Abstract: Monitoring critically endangered western lowland gorillas is currently hampered by the immense manual effort required to re-identify individuals from vast archives of camera trap footage. The primary obstacle to automating this process has been the lack of large-scale, “in-the-wild” video datasets suitable for training robust deep learning models. To address this gap, we introduce a comprehensive benchmark with three novel datasets: Gorilla-SPAC-Wild, the largest video dataset for wild primate re-identification to date; Gorilla-Berlin-Zoo, for assessing cross-domain re-identification generalization; and Gorilla-SPAC-MoT, for evaluating multi-object tracking in camera trap footage. Building on these datasets, we present GorillaWatch, an end-to-end pipeline integrating detection, tracking, and re-identification. To exploit temporal information, we introduce a multi-frame self-supervised pretraining strategy that leverages consistency in tracklets to learn domain-specific features without manual labels. To ensure scientific validity, a differentiable adaptation of AttnLRP verifies that our model relies on discriminative biometric traits rather than background correlations. Extensive benchmarking subsequently demonstrates that aggregating features from large-scale image backbones outperforms specialized video architectures. Finally, we address unsupervised population counting by integrating spatiotemporal constraints into standard clustering to mitigate over-segmentation. We publicly release all code and datasets to facilitate scalable, non-invasive monitoring of endangered species
[343] Distribution Matching Variational AutoEncoder
Sen Ye, Jianning Pei, Mengde Xu, Shuyang Gu, Chunyu Wang, Liwei Wang, Han Hu
Main category: cs.CV
TL;DR: DMVAE introduces distribution matching to align encoder latents with arbitrary reference distributions, enabling systematic study of optimal latent distributions for image generation.
Details
Motivation: Existing visual generative models use latent spaces without explicit distribution shaping, making it unclear which distributions are optimal for modeling. Current approaches like VAEs and foundation model encoders constrain latents implicitly without systematic investigation.Method: Distribution-Matching VAE (DMVAE) explicitly aligns encoder’s latent distribution with arbitrary reference distributions via distribution matching constraint, generalizing beyond Gaussian priors to align with distributions from self-supervised features, diffusion noise, or other priors.
Result: SSL-derived distributions provide optimal balance between reconstruction fidelity and modeling efficiency, achieving gFID of 3.2 on ImageNet with only 64 training epochs. Distribution-level alignment is key to bridging gap between easy-to-model latents and high-fidelity synthesis.
Conclusion: Choosing suitable latent distribution structure through distribution-level alignment, rather than fixed priors, is crucial for optimal image synthesis. DMVAE enables systematic investigation of latent distributions and shows SSL-derived distributions offer excellent performance trade-offs.
Abstract: Most visual generative models compress images into a latent space before applying diffusion or autoregressive modelling. Yet, existing approaches such as VAEs and foundation model aligned encoders implicitly constrain the latent space without explicitly shaping its distribution, making it unclear which types of distributions are optimal for modeling. We introduce \textbf{Distribution-Matching VAE} (\textbf{DMVAE}), which explicitly aligns the encoder’s latent distribution with an arbitrary reference distribution via a distribution matching constraint. This generalizes beyond the Gaussian prior of conventional VAEs, enabling alignment with distributions derived from self-supervised features, diffusion noise, or other prior distributions. With DMVAE, we can systematically investigate which latent distributions are more conducive to modeling, and we find that SSL-derived distributions provide an excellent balance between reconstruction fidelity and modeling efficiency, reaching gFID equals 3.2 on ImageNet with only 64 training epochs. Our results suggest that choosing a suitable latent distribution structure (achieved via distribution-level alignment), rather than relying on fixed priors, is key to bridging the gap between easy-to-model latents and high-fidelity image synthesis. Code is avaliable at https://github.com/sen-ye/dmvae.
[344] OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory
Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, Chenyang Zhang, Tao Xiang, Fanny Yang, Serge Belongie, Tian Xie
Main category: cs.CV
TL;DR: OneStory is a novel multi-shot video generation method that reformulates the task as next-shot generation with global cross-shot context modeling, achieving state-of-the-art narrative coherence for long-form storytelling.
Details
Motivation: Existing multi-shot video generation methods struggle with modeling long-range cross-shot context due to limited temporal windows or single keyframe conditioning, leading to degraded performance under complex narratives.Method: Reformulates MSV as next-shot generation task with autoregressive shot synthesis using pretrained I2V models. Introduces Frame Selection module for semantically-relevant global memory from prior shots, and Adaptive Conditioner with importance-guided patchification for compact context conditioning. Curates 60K multi-shot dataset with referential captions and designs effective training strategies.
Result: Achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings, enabling controllable and immersive long-form video storytelling.
Conclusion: OneStory enables global yet compact cross-shot context modeling for consistent and scalable narrative generation, advancing multi-shot video generation for realistic storytelling applications.
Abstract: Storytelling in real-world videos often unfolds through multiple shots – discontinuous yet semantically connected clips that together convey a coherent narrative. However, existing multi-shot video generation (MSV) methods struggle to effectively model long-range cross-shot context, as they rely on limited temporal windows or single keyframe conditioning, leading to degraded performance under complex narratives. In this work, we propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation. OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning. We introduce two key modules: a Frame Selection module that constructs a semantically-relevant global memory based on informative frames from prior shots, and an Adaptive Conditioner that performs importance-guided patchification to generate compact context for direct conditioning. We further curate a high-quality multi-shot dataset with referential captions to mirror real-world storytelling patterns, and design effective training strategies under the next-shot paradigm. Finetuned from a pretrained I2V model on our curated 60K dataset, OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings, enabling controllable and immersive long-form video storytelling.
[345] Multi-view Pyramid Transformer: Look Coarser to See Broader
Gyeongjin Kang, Seungkwon Yang, Seungtae Nam, Younggeun Lee, Jungwoo Kim, Eunbyung Park
Main category: cs.CV
TL;DR: MVP is a scalable multi-view transformer that reconstructs large 3D scenes from many images in one pass using dual hierarchies for efficiency and detail.
Details
Motivation: Current methods struggle with reconstructing large 3D scenes from many images efficiently while maintaining both computational efficiency and representational richness.Method: Multi-view Pyramid Transformer with two core principles: 1) local-to-global inter-view hierarchy (local views → groups → full scene), and 2) fine-to-coarse intra-view hierarchy (detailed spatial representations → compact tokens).
Result: Achieves state-of-the-art generalizable reconstruction quality when coupled with 3D Gaussian Splatting, maintaining high efficiency and scalability across diverse view configurations.
Conclusion: MVP enables fast reconstruction of large, complex scenes through its dual hierarchy design, balancing computational efficiency with rich representation.
Abstract: We propose Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. Drawing on the idea of ``looking broader to see the whole, looking finer to see the details," MVP is built on two core design principles: 1) a local-to-global inter-view hierarchy that gradually broadens the model’s perspective from local views to groups and ultimately the full scene, and 2) a fine-to-coarse intra-view hierarchy that starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens. This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. We validate MVP on diverse datasets and show that, when coupled with 3D Gaussian Splatting as the underlying 3D representation, it achieves state-of-the-art generalizable reconstruction quality while maintaining high efficiency and scalability across a wide range of view configurations.
[346] Lang3D-XL: Language Embedded 3D Gaussians for Large-scale Scenes
Shai Krakovsky, Gal Fiebelman, Sagie Benaim, Hadar Averbuch-Elor
Main category: cs.CV
TL;DR: Proposes a novel approach for embedding language fields in 3D Gaussian representations to enable semantic scene understanding, addressing efficiency and feature misalignment challenges in existing methods.
Details
Motivation: Language fields in 3D representations enable richer semantic understanding of spatial environments, linking geometry with descriptive meaning for intuitive human-computer interaction, scene retrieval, navigation, and multimodal reasoning. However, recent feature distillation approaches struggle with massive Internet data due to semantic feature misalignment and inefficiency in memory/runtime.Method: 1) Introduces extremely low-dimensional semantic bottleneck features in 3D Gaussian representation, processed through rendering and multi-resolution feature-based hash encoder for improved efficiency. 2) Proposes Attenuated Downsampler module and several regularizations to address semantic misalignment of ground truth 2D features.
Result: Evaluated on the in-the-wild HolyScenes dataset, the method surpasses existing approaches in both performance and efficiency.
Conclusion: The proposed approach effectively addresses efficiency and semantic misalignment challenges in learning language fields from massive Internet data, enabling practical semantic 3D scene understanding for real-world applications.
Abstract: Embedding a language field in a 3D representation enables richer semantic understanding of spatial environments by linking geometry with descriptive meaning. This allows for a more intuitive human-computer interaction, enabling querying or editing scenes using natural language, and could potentially improve tasks like scene retrieval, navigation, and multimodal reasoning. While such capabilities could be transformative, in particular for large-scale scenes, we find that recent feature distillation approaches cannot effectively learn over massive Internet data due to challenges in semantic feature misalignment and inefficiency in memory and runtime. To this end, we propose a novel approach to address these challenges. First, we introduce extremely low-dimensional semantic bottleneck features as part of the underlying 3D Gaussian representation. These are processed by rendering and passing them through a multi-resolution, feature-based, hash encoder. This significantly improves efficiency both in runtime and GPU memory. Second, we introduce an Attenuated Downsampler module and propose several regularizations addressing the semantic misalignment of ground truth 2D features. We evaluate our method on the in-the-wild HolyScenes dataset and demonstrate that it surpasses existing approaches in both performance and efficiency.
[347] OpenVE-3M: A Large-Scale High-Quality Dataset for Instruction-Guided Video Editing
Haoyang He, Jie Wang, Jiangning Zhang, Zhucun Xue, Xingyuan Bu, Qiangpeng Yang, Shilei Wen, Lei Xie
Main category: cs.CV
TL;DR: OpenVE-3M is a large-scale, high-quality dataset for instruction-based video editing with diverse edit types, accompanied by OpenVE-Bench benchmark and OpenVE-Edit model achieving SOTA results.
Details
Motivation: There's a scarcity of large-scale, high-quality datasets for instruction-based video editing compared to image editing, creating a gap in the field that needs to be addressed.Method: Created OpenVE-3M dataset with two categories: spatially-aligned edits (6 types) and non-spatially-aligned edits (2 types), generated via a meticulously designed data pipeline with rigorous quality filtering. Also built OpenVE-Bench benchmark with 431 video-edit pairs and three human-aligned metrics.
Result: OpenVE-3M surpasses existing open-source datasets in scale, diversity, instruction length, and quality. OpenVE-Edit (5B model) trained on this dataset sets new SOTA on OpenVE-Bench, outperforming all prior open-source models including a 14B baseline.
Conclusion: The OpenVE-3M dataset, OpenVE-Bench benchmark, and OpenVE-Edit model collectively advance instruction-based video editing by providing comprehensive resources and demonstrating superior performance through efficient model training.
Abstract: The quality and diversity of instruction-based image editing datasets are continuously increasing, yet large-scale, high-quality datasets for instruction-based video editing remain scarce. To address this gap, we introduce OpenVE-3M, an open-source, large-scale, and high-quality dataset for instruction-based video editing. It comprises two primary categories: spatially-aligned edits (Global Style, Background Change, Local Change, Local Remove, Local Add, and Subtitles Edit) and non-spatially-aligned edits (Camera Multi-Shot Edit and Creative Edit). All edit types are generated via a meticulously designed data pipeline with rigorous quality filtering. OpenVE-3M surpasses existing open-source datasets in terms of scale, diversity of edit types, instruction length, and overall quality. Furthermore, to address the lack of a unified benchmark in the field, we construct OpenVE-Bench, containing 431 video-edit pairs that cover a diverse range of editing tasks with three key metrics highly aligned with human judgment. We present OpenVE-Edit, a 5B model trained on our dataset that demonstrates remarkable efficiency and effectiveness by setting a new state-of-the-art on OpenVE-Bench, outperforming all prior open-source models including a 14B baseline. Project page is at https://github.com/lewandofskee/OpenVE.
[348] Image-Guided Semantic Pseudo-LiDAR Point Generation for 3D Object Detection
Minseung Lee, Seokha Moon, Seung Joon Lee, Reza Mahjourian, Jinkyu Kim
Main category: cs.CV
TL;DR: ImagePG is a novel framework that uses RGB image features to generate dense, semantically meaningful 3D pseudo-points to enhance LiDAR-based object detection, particularly for small and distant objects like pedestrians and cyclists.
Details
Motivation: LiDAR's inherent sparsity makes it difficult to detect small or distant objects in autonomous driving. Existing methods that generate additional points within RoIs using LiDAR alone often produce false positives and fail to recover meaningful structures.Method: ImagePG leverages RGB image features through three main components: 1) Image-Guided RoI Points Generation (IG-RPG) module creates pseudo-points guided by image features, 2) Image-Aware Occupancy Prediction Network (I-OPN) provides spatial priors for point placement, and 3) Multi-stage Refinement (MR) module enhances point quality and detection robustness.
Result: On KITTI benchmark: +1.38%p mAP improvement for cars, +7.91%p for pedestrians, and +5.21%p for cyclists on test set, achieving state-of-the-art cyclist performance. Reduces false positives by nearly 50% and significantly improves detection of small/distant objects on both KITTI and Waymo datasets.
Conclusion: ImagePG successfully addresses LiDAR sparsity limitations by leveraging image features for semantic point generation, demonstrating substantial improvements in object detection performance, particularly for challenging categories like pedestrians and cyclists in autonomous driving scenarios.
Abstract: In autonomous driving scenarios, accurate perception is becoming an even more critical task for safe navigation. While LiDAR provides precise spatial data, its inherent sparsity makes it difficult to detect small or distant objects. Existing methods try to address this by generating additional points within a Region of Interest (RoI), but relying on LiDAR alone often leads to false positives and a failure to recover meaningful structures. To address these limitations, we propose Image-Guided Semantic Pseudo-LiDAR Point Generation model, called ImagePG, a novel framework that leverages rich RGB image features to generate dense and semantically meaningful 3D points. Our framework includes an Image-Guided RoI Points Generation (IG-RPG) module, which creates pseudo-points guided by image features, and an Image-Aware Occupancy Prediction Network (I-OPN), which provides spatial priors to guide point placement. A multi-stage refinement (MR) module further enhances point quality and detection robustness. To the best of our knowledge, ImagePG is the first method to directly leverage image features for point generation. Extensive experiments on the KITTI and Waymo datasets demonstrate that ImagePG significantly improves the detection of small and distant objects like pedestrians and cyclists, reducing false positives by nearly 50%. On the KITTI benchmark, our framework improves mAP by +1.38%p (car), +7.91%p (pedestrian), and +5.21%p (cyclist) on the test set over the baseline, achieving state-of-the-art cyclist performance on the KITTI leaderboard. The code is available at: https://github.com/MS-LIMA/ImagePG
[349] UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
Jiehui Huang, Yuechen Zhang, Xu He, Yuan Gao, Zhi Cen, Bin Xia, Yan Zhou, Xin Tao, Pengfei Wan, Jiaya Jia
Main category: cs.CV
TL;DR: UnityVideo is a unified framework for world-aware video generation that jointly learns across multiple modalities (segmentation, skeletons, DensePose, flow, depth) to improve holistic world understanding and video synthesis.
Details
Motivation: Current video generation models are limited by single-modality conditioning, which constrains holistic world understanding due to insufficient cross-modal interaction and limited modal diversity for comprehensive world knowledge representation.Method: Introduces two core components: (1) dynamic noising to unify heterogeneous training paradigms, and (2) a modality switcher with in-context learner for unified processing via modular parameters and contextual learning. Uses a large-scale unified dataset with 1.3M samples.
Result: UnityVideo accelerates convergence and significantly enhances zero-shot generalization to unseen data. It achieves superior video quality, consistency, and improved alignment with physical world constraints.
Conclusion: UnityVideo addresses limitations of single-modality video generation by enabling joint learning across multiple modalities, resulting in better world-aware video synthesis with improved physical realism and generalization capabilities.
Abstract: Recent video generation models demonstrate impressive synthesis capabilities but remain limited by single-modality conditioning, constraining their holistic world understanding. This stems from insufficient cross-modal interaction and limited modal diversity for comprehensive world knowledge representation. To address these limitations, we introduce UnityVideo, a unified framework for world-aware video generation that jointly learns across multiple modalities (segmentation masks, human skeletons, DensePose, optical flow, and depth maps) and training paradigms. Our approach features two core components: (1) dynamic noising to unify heterogeneous training paradigms, and (2) a modality switcher with an in-context learner that enables unified processing via modular parameters and contextual learning. We contribute a large-scale unified dataset with 1.3M samples. Through joint optimization, UnityVideo accelerates convergence and significantly enhances zero-shot generalization to unseen data. We demonstrate that UnityVideo achieves superior video quality, consistency, and improved alignment with physical world constraints. Code and data can be found at: https://github.com/dvlab-research/UnityVideo
[350] Moyun: A Diffusion-Based Model for Style-Specific Chinese Calligraphy Generation
Kaiyuan Liu, Jiahao Mei, Hengyu Zhang, Yihuai Zhang, Daoguo Dong, Liang He
Main category: cs.CV
TL;DR: Moyun model uses Vision Mamba with TripleLabel control for Chinese calligraphy generation, trained on 1.9M image dataset Mobao
Details
Motivation: Existing Chinese calligraphy generation lacks fine-grained control over calligrapher, font, and character style specificationsMethod: Replace Unet in Diffusion model with Vision Mamba and introduce TripleLabel control mechanism for controllable generation
Result: Model effectively controls generation process and produces calligraphy in specified styles, even for unseen characters by calligraphers
Conclusion: Moyun enables fine-grained controllable Chinese calligraphy generation with style consistency across calligraphers and fonts
Abstract: Although Chinese calligraphy generation has achieved style transfer, generating calligraphy by specifying the calligrapher, font, and character style remains challenging. To address this, we propose a new Chinese calligraphy generation model ‘Moyun’ , which replaces the Unet in the Diffusion model with Vision Mamba and introduces the TripleLabel control mechanism to achieve controllable calligraphy generation. The model was tested on our large-scale dataset ‘Mobao’ of over 1.9 million images, and the results demonstrate that ‘Moyun’ can effectively control the generation process and produce calligraphy in the specified style. Even for calligraphy the calligrapher has not written, ‘Moyun’ can generate calligraphy that matches the style of the calligrapher.
[351] Voxify3D: Pixel Art Meets Volumetric Rendering
Yi-Chuan Huang, Jiewen Chan, Hao-Jen Chien, Yu-Lun Liu
Main category: cs.CV
TL;DR: Voxify3D is a differentiable two-stage framework that generates high-quality voxel art from 3D meshes by integrating orthographic pixel art supervision, patch-based CLIP alignment, and palette-constrained Gumbel-Softmax quantization.
Details
Motivation: Automated voxel art generation from 3D meshes is challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either over-simplify geometry or fail to achieve pixel-precise, palette-constrained aesthetics.Method: A differentiable two-stage framework with three key components: (1) orthographic pixel art supervision for precise voxel-pixel alignment, (2) patch-based CLIP alignment for semantic preservation across discretization levels, and (3) palette-constrained Gumbel-Softmax quantization for differentiable optimization over discrete color spaces with controllable palette strategies.
Result: Superior performance with 37.12 CLIP-IQA score and 77.90% user preference across diverse characters, supporting controllable abstraction with 2-8 colors and 20x-50x resolutions.
Conclusion: Voxify3D successfully addresses fundamental challenges in voxel art generation by integrating semantic preservation, pixel-art aesthetics, and end-to-end discrete optimization, enabling high-quality automated voxel art creation from 3D meshes.
Abstract: Voxel art is a distinctive stylization widely used in games and digital media, yet automated generation from 3D meshes remains challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either over-simplify geometry or fail to achieve the pixel-precise, palette-constrained aesthetics of voxel art. We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. Our core innovation lies in the synergistic integration of three components: (1) orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment; (2) patch-based CLIP alignment that preserves semantics across discretization levels; (3) palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces with controllable palette strategies. This integration addresses fundamental challenges: semantic preservation under extreme discretization, pixel-art aesthetics through volumetric rendering, and end-to-end discrete optimization. Experiments show superior performance (37.12 CLIP-IQA, 77.90% user preference) across diverse characters and controllable abstraction (2-8 colors, 20x-50x resolutions). Project page: https://yichuanh.github.io/Voxify-3D/
[352] Rethinking Normalization Strategies and Convolutional Kernels for Multimodal Image Fusion
Dan He, Guofen Wang, Weisheng Li, Yucheng Shu, Wenbo Li, Lijian Yang, Yuping Huang, Feiyan Li
Main category: cs.CV
TL;DR: The paper proposes LKC-FUNet, a UNet-based architecture for multimodal image fusion that addresses limitations of batch normalization and introduces hybrid normalization with large kernel convolutions for better feature preservation.
Details
Motivation: Existing multimodal image fusion research focuses on complementary information fusion and training strategies but overlooks critical architectural components like normalization and convolution kernels. Batch normalization limits performance by smoothing crucial sparse features that are important for fusion tasks.Method: Proposes LKC-FUNet with: 1) Hybrid instance and group normalization to maintain sample independence and reinforce intrinsic feature correlations; 2) Large kernel convolutions enabled by richer feature maps from the normalization strategy; 3) Multi-path adaptive fusion module that dynamically calibrates features from varying scales and receptive fields.
Result: Achieves state-of-the-art objective performance on MSRS, M³FD, TNO, and Harvard datasets. Produces visually clearer salient objects and lesion areas. Notably improves MSRS segmentation mIoU by 8.1% over infrared images alone.
Conclusion: The performance stems from synergistic design of normalization and convolution kernels that preserves critical sparse features. The method demonstrates that architectural components like normalization are crucial for multimodal image fusion success.
Abstract: Multimodal image fusion (MMIF) integrates information from different modalities to obtain a comprehensive image, aiding downstream tasks. However, existing research focuses on complementary information fusion and training strategies, overlooking the critical role of underlying architectural components like normalization and convolution kernels. We reevaluate the UNet architecture for end-to-end MMIF, identifying that widely used batch normalization limits performance by smoothing crucial sparse features. To address this, we propose a hybrid of instance and group normalization to maintain sample independence and reinforce intrinsic feature correlations. Crucially, this strategy facilitates richer feature maps, enabling large kernel convolution to fully leverage its receptive field, enhancing detail preservation. Furthermore, the proposed multi-path adaptive fusion module dynamically calibrates features from varying scales and receptive fields, ensuring effective information transfer. Our method achieves SOTA objective performance on MSRS, M$^3$FD, TNO, and Harvard datasets, producing visually clearer salient objects and lesion areas. Notably, it improves MSRS segmentation mIoU by 8.1% over the infrared image. This performance stems from a synergistic design of normalization and convolution kernels, which preserves critical sparse features. The code is available at https://github.com/HeDan-11/LKC-FUNet.
[353] Twisted Convolutional Networks (TCNs): Enhancing Feature Interactions for Non-Spatial Data Classification
Junbo Jacob Lian, Haoran Chen, Kaichen Ouyang, Yujun Zhang, Rui Zhong, Huiling Chen
Main category: cs.CV
TL;DR: TCNs are a new deep learning architecture for 1D data with arbitrary feature order, using multiplicative and pairwise interactions to capture high-order feature relationships that traditional CNNs miss.
Details
Motivation: Traditional CNNs rely on structured feature sequences and spatial relationships, which are inadequate for 1D data with arbitrary feature order and minimal spatial relationships. There's a need for architectures that can effectively capture feature interactions in non-spatial data.Method: TCNs use twisted convolution operations that combine subsets of input features through multiplicative and pairwise interaction mechanisms, formalized through polynomial feature expansions. This creates enriched representations that capture high-order feature interactions while maintaining computational tractability.
Result: TCNs achieve statistically significant improvements over CNNs, ResNet, GNNs, DeepSets, and SVM across five benchmark datasets from diverse domains (medical diagnostics, political science, synthetic data, chemometrics, and healthcare). They also show superior training stability and generalization capabilities.
Conclusion: TCNs provide a robust and effective architecture for classifying non-spatial 1D data with arbitrary feature order, offering better performance and generalization than existing methods through their theoretically grounded feature interaction mechanisms.
Abstract: Twisted Convolutional Networks (TCNs) are proposed as a novel deep learning architecture for classifying one-dimensional data with arbitrary feature order and minimal spatial relationships. Unlike conventional Convolutional Neural Networks (CNNs) that rely on structured feature sequences, TCNs explicitly combine subsets of input features through theoretically grounded multiplicative and pairwise interaction mechanisms to create enriched representations. This feature combination strategy, formalized through polynomial feature expansions, captures high-order feature interactions that traditional convolutional approaches miss. We provide a comprehensive mathematical framework for TCNs, demonstrating how the twisted convolution operation generalizes standard convolutions while maintaining computational tractability. Through extensive experiments on five benchmark datasets from diverse domains (medical diagnostics, political science, synthetic data, chemometrics, and healthcare), we show that TCNs achieve statistically significant improvements over CNNs, Residual Networks (ResNet), Graph Neural Networks (GNNs), DeepSets, and Support Vector Machine (SVM). The performance gains are validated through statistical testing. TCNs also exhibit superior training stability and generalization capabilities, highlighting their robustness for non-spatial data classification tasks.
[354] SPFFNet: Strip Perception and Feature Fusion Spatial Pyramid Pooling for Fabric Defect Detection
Peizhe Zhao, Shunbo Jia
Main category: cs.CV
TL;DR: Improved YOLOv11-based fabric defect detection model with Strip Perception Module, SE-SPPF enhancement, and novel FECIoU metric achieves significant mAP improvements on benchmark datasets.
Details
Motivation: Existing fabric defect detection methods struggle with complex backgrounds and shape-specific defects, particularly strip defects, requiring improved feature capture and handling of scale differences and class imbalance.Method: Proposes three key improvements: 1) Strip Perception Module (SPM) with multi-scale convolution for better strip defect detection, 2) SE-SPPF module integrating squeeze-and-excitation mechanism into SPPF for enhanced spatial and channel information integration, and 3) novel FECIoU metric with adaptive weights using focal loss to address scale differences and class imbalance.
Result: Model achieves 0.8-8.1% mAP improvement on Tianchi dataset and 1.6-13.2% improvement on custom dataset, outperforming state-of-the-art methods.
Conclusion: The proposed improvements to YOLOv11 effectively enhance fabric defect detection performance, particularly for challenging strip defects, through better feature extraction and handling of class imbalance.
Abstract: Defect detection in fabrics is critical for quality control, yet existing methods often struggle with complex backgrounds and shape-specific defects. In this paper, we propose an improved fabric defect detection model based on YOLOv11. To enhance the detection of strip defects, we introduce a Strip Perception Module (SPM) that improves feature capture through multi-scale convolution. We further enhance the spatial pyramid pooling fast (SPPF) by integrating a squeeze-and-excitation mechanism, resulting in the SE-SPPF module, which better integrates spatial and channel information for more effective defect feature extraction. Additionally, we propose a novel focal enhanced complete intersection over union (FECIoU) metric with adaptive weights, addressing scale differences and class imbalance by adjusting the weights of hard-to-detect instances through focal loss. Experimental results demonstrate that our model achieves a 0.8-8.1% improvement in mean average precision (mAP) on the Tianchi dataset and a 1.6-13.2% improvement on our custom dataset, outperforming other state-of-the-art methods.
[355] Bimodal SegNet: Instance Segmentation Fusing Events and RGB Frames for Robotic Grasping
Sanket Kachole, Xiaoqian Huang, Fariborz Baghaei Naeini, Rajkumar Muthusamy, Dimitrios Makris, Yahya Zweiri
Main category: cs.CV
TL;DR: Bimodal SegNet fuses event-based and RGB data for robust object segmentation in robotic grasping under challenging dynamic conditions like occlusion, blur, and lighting variations.
Details
Motivation: Object segmentation for robotic grasping faces challenges in dynamic conditions including occlusion, low light, motion blur, and object size variance. Traditional RGB-based methods struggle with these issues, necessitating a more robust approach.Method: Proposes Bimodal SegNet with two distinct encoders for event-based and RGB frame data, using spatial pyramidal pooling with atrous convolutions. Encoders capture contextual information by pooling concatenated features at different resolutions, while the decoder obtains sharp object boundaries.
Result: Evaluation on the Event-based Segmentation (ESD) Dataset shows 6-10% segmentation accuracy improvement over state-of-the-art methods in terms of mean intersection over union and pixel accuracy across five degradation challenges (occlusion, blur, brightness, trajectory, scale variance).
Conclusion: Fusing event-based and RGB data through the proposed Bimodal SegNet architecture significantly improves segmentation robustness for robotic grasping under challenging dynamic conditions, outperforming existing methods.
Abstract: Object segmentation for robotic grasping under dynamic conditions often faces challenges such as occlusion, low light conditions, motion blur and object size variance. To address these challenges, we propose a Deep Learning network that fuses two types of visual signals, event-based data and RGB frame data. The proposed Bimodal SegNet network has two distinct encoders, one for each signal input and a spatial pyramidal pooling with atrous convolutions. Encoders capture rich contextual information by pooling the concatenated features at different resolutions while the decoder obtains sharp object boundaries. The evaluation of the proposed method undertakes five unique image degradation challenges including occlusion, blur, brightness, trajectory and scale variance on the Event-based Segmentation (ESD) Dataset. The evaluation results show a 6-10% segmentation accuracy improvement over state-of-the-art methods in terms of mean intersection over the union and pixel accuracy. The model code is available at https://github.com/sanket0707/Bimodal-SegNet.git
[356] Diffusion Models for Image Restoration and Enhancement: A Comprehensive Survey
Xin Li, Yulin Ren, Xin Jin, Cuiling Lan, Xingrui Wang, Wenjun Zeng, Xinchao Wang, Zhibo Chen
Main category: cs.CV
TL;DR: A comprehensive survey paper reviewing diffusion model-based methods for image restoration, covering learning paradigms, conditional strategies, framework designs, and evaluation metrics, while identifying future research directions.
Details
Motivation: While diffusion models have shown impressive results in image generation, there's a lack of comprehensive surveys on their application to image restoration tasks. The authors aim to fill this gap by systematically reviewing how diffusion models can boost image restoration performance compared to previous GAN-based approaches.Method: The paper presents a systematic review methodology: 1) Introducing diffusion model background, 2) Presenting two prevalent workflows for using diffusion models in IR, 3) Classifying innovative designs for both standard and blind/real-world IR, 4) Summarizing evaluation datasets and metrics, and 5) Providing objective comparisons of open-sourced methods across super-resolution, deblurring, and inpainting tasks.
Result: The survey identifies that diffusion model-based methods have achieved superior performance compared to previous GAN-based approaches. The comprehensive review covers existing methods, evaluation frameworks, and provides objective comparisons across multiple IR tasks, establishing the current state of the field.
Conclusion: The paper serves as the first comprehensive survey on diffusion model-based image restoration, highlighting current advancements while proposing five future research directions: sampling efficiency, model compression, distortion simulation/estimation, distortion invariant learning, and improved framework design.
Abstract: Image restoration (IR) has been an indispensable and challenging task in the low-level vision field, which strives to improve the subjective quality of images distorted by various forms of degradation. Recently, the diffusion model has achieved significant advancements in the visual generation of AIGC, thereby raising an intuitive question, “whether diffusion model can boost image restoration”. To answer this, some pioneering studies attempt to integrate diffusion models into the image restoration task, resulting in superior performances than previous GAN-based methods. Despite that, a comprehensive and enlightening survey on diffusion model-based image restoration remains scarce. In this paper, we are the first to present a comprehensive review of recent diffusion model-based methods on image restoration, encompassing the learning paradigm, conditional strategy, framework design, modeling strategy, and evaluation. Concretely, we first introduce the background of the diffusion model briefly and then present two prevalent workflows that exploit diffusion models in image restoration. Subsequently, we classify and emphasize the innovative designs using diffusion models for both IR and blind/real-world IR, intending to inspire future development. To evaluate existing methods thoroughly, we summarize the commonly-used dataset, implementation details, and evaluation metrics. Additionally, we present the objective comparison for open-sourced methods across three tasks, including image super-resolution, deblurring, and inpainting. Ultimately, informed by the limitations in existing works, we propose five potential and challenging directions for the future research of diffusion model-based IR, including sampling efficiency, model compression, distortion simulation and estimation, distortion invariant learning, and framework design.
[357] TEMPLE: Incentivizing Temporal Understanding of Video Large Language Models via Progressive Pre-SFT Alignment
Shicheng Li, Lei Li, Kun Ouyang, Shuhuai Ren, Yuanxin Liu, Yuanxing Zhang, Fuzheng Zhang, Lingpeng Kong, Qi Liu, Xu Sun
Main category: cs.CV
TL;DR: TEMPLE enhances Video LLM temporal reasoning through automated preference pair generation and progressive pre-SFT alignment using DPO, outperforming SFT-only approaches with minimal data.
Details
Motivation: Existing Video LLMs struggle with temporal reasoning due to weak temporal correspondence in training data and over-reliance on next-token prediction, lacking proper temporal supervision.Method: TEMPLE framework uses Direct Preference Optimization with automated pipeline for constructing temporality-intensive preference pairs (selecting rich videos, designing perturbation strategies, evaluating responses) and Progressive Pre-SFT Alignment with curriculum learning and preference optimization before instruction tuning.
Result: Extensive experiments show consistent improvement across multiple benchmarks with relatively small set of self-generated DPO data, demonstrating scalability and efficiency.
Conclusion: TEMPLE provides a scalable and efficient complement to SFT-based methods, paving the way for developing more reliable Video LLMs with enhanced temporal reasoning capabilities.
Abstract: Video Large Language Models (Video LLMs) have achieved significant success by adopting the paradigm of large-scale pre-training followed by supervised fine-tuning (SFT). However, existing approaches struggle with temporal reasoning due to weak temporal correspondence in the data and over-reliance on the next-token prediction paradigm}, which collectively result in the absence temporal supervision. To address these limitations, we propose TEMPLE (TEMporal Preference LEarning), a systematic framework that enhances temporal reasoning capabilities through Direct Preference Optimization (DPO). To address temporal information scarcity in data, we introduce an automated pipeline for systematically constructing temporality-intensive preference pairs comprising three steps: selecting temporally rich videos, designing video-specific perturbation strategies, and evaluating model responses on clean and perturbed inputs. Complementing this data pipeline, we provide additional supervision signals via preference learning and propose a novel Progressive Pre-SFT Alignment strategy featuring two key innovations: a curriculum learning strategy which progressively increases perturbation difficulty to maximize data efficiency; and applying preference optimization before instruction tuning to incentivize fundamental temporal alignment. Extensive experiments demonstrate that our approach consistently improves Video LLM performance across multiple benchmarks with a relatively small set of self-generated DPO data. Our findings highlight TEMPLE as a scalable and efficient complement to SFT-based methods, paving the way for developing reliable Video LLMs.
[358] Roadside Monocular 3D Detection Prompted by 2D Detection
Yechi Ma, Yanan Li, Wei Hua, Shu Kong
Main category: cs.CV
TL;DR: Pro3D is a novel roadside monocular 3D detection method that uses 2D detections as prompts to help 3D detectors focus on lifting objects into 3D BEV space, achieving state-of-the-art performance.
Details
Motivation: Roadside monocular 3D detection has important applications in traffic control and vehicle-infrastructure cooperation, but directly training 3D detectors is challenging. 2D detectors are easier to train and better at localizing objects in 2D, so using them as prompts can help 3D detectors focus on the more difficult 3D lifting task.Method: Pro3D leverages 2D detections as prompts for 3D detection. The authors explore three fusion methods: (1) simple feature concatenation, (2) attentive feature fusion, and (3) encoding 2D bounding box properties (x, y, width, height, label) and attentively fusing them with 3D detector features. The third method proved most effective.
Result: The third fusion method (encoding 2D bounding box properties with attentive fusion) significantly outperformed other methods. Pro3D enhances existing methods and achieves state-of-the-art results on two contemporary benchmarks.
Conclusion: Using 2D detections as prompts is an effective strategy for monocular 3D detection, allowing 3D detectors to focus on the challenging task of lifting precisely localized 2D objects into 3D BEV space. Pro3D is adaptable to various 2D/3D detectors and demonstrates significant performance improvements.
Abstract: Roadside monocular 3D detection requires detecting objects of predefined classes in an RGB frame and predicting their 3D attributes, such as bird’s-eye-view (BEV) locations. It has broad applications in traffic control, vehicle-vehicle communication, and vehicle-infrastructure cooperative perception. To address this task, we introduce Promptable 3D Detector (Pro3D), a novel detector design that leverages 2D detections as prompts. We build our Pro3D upon two key insights. First, compared to a typical 3D detector, a 2D detector is ``easier’’ to train due to fewer loss terms and performs significantly better at localizing objects w.r.t 2D metrics. Second, once 2D detections precisely locate objects in the image, a 3D detector can focus on lifting these detections into 3D BEV, especially when fixed camera pose or scene geometry provide an informative prior. To encode and incorporate 2D detections, we explore three methods: (a) concatenating features from both 2D and 3D detectors, (b) attentively fusing 2D and 3D detector features, and (c) encoding properties of predicted 2D bounding boxes {$x$, $y$, width, height, label} and attentively fusing them with the 3D detector feature. Interestingly, the third method significantly outperforms the others, underscoring the effectiveness of 2D detections as prompts that offer precise object targets and allow the 3D detector to focus on lifting them into 3D. Pro3D is adaptable for use with a wide range of 2D and 3D detectors with minimal modifications. Comprehensive experiments demonstrate that our Pro3D significantly enhances existing methods, achieving state-of-the-art results on two contemporary benchmarks.
[359] Suite-IN++: A FlexiWear BodyNet Integrating Global and Local Motion Features from Apple Suite for Robust Inertial Navigation
Lan Sun, Songpengcheng Xia, Jiarui Yang, Ling Pei
Main category: cs.CV
TL;DR: Suite-IN++ is a deep learning framework that leverages multiple wearable devices (smartphone, smartwatch, headphones) as a “flexiwear bodynet” for robust pedestrian localization, using contrastive learning and attention mechanisms to fuse motion data from different body parts.
Details
Motivation: Traditional PDR struggles with diverse motion modes, while single-device data-driven methods lack robustness. The proliferation of wearable technology creates opportunities to leverage multiple existing devices for more reliable pedestrian localization.Method: Deep learning framework integrating motion data from wearable devices on different body parts. Uses contrastive learning to separate global and local motion features, fuses global features based on device reliability, and employs attention mechanisms to uncover cross-device correlations in local features.
Result: Superior localization accuracy and robustness compared to state-of-the-art models, demonstrated through experiments on a real-life flexiwear bodynet dataset incorporating Apple devices across diverse walking modes and configurations.
Conclusion: Suite-IN++ effectively leverages multiple wearable devices to form a flexiwear bodynet, achieving robust and accurate pedestrian localization by intelligently fusing motion data from different body parts through advanced deep learning techniques.
Abstract: The proliferation of wearable technology has established multi-device ecosystems comprising smartphones, smartwatches, and headphones as critical enablers for ubiquitous pedestrian localization. However, traditional pedestrian dead reckoning (PDR) struggles with diverse motion modes, while data-driven methods, despite improving accuracy, often lack robustness due to their reliance on a single-device setup. Therefore, a promising solution is to fully leverage existing wearable devices to form a flexiwear bodynet for robust and accurate pedestrian localization. This paper presents Suite-IN++, a deep learning framework for flexiwear bodynet-based pedestrian localization. Suite-IN++ integrates motion data from wearable devices on different body parts, using contrastive learning to separate global and local motion features. It fuses global features based on the data reliability of each device to capture overall motion trends and employs an attention mechanism to uncover cross-device correlations in local features, extracting motion details helpful for accurate localization. To evaluate our method, we construct a real-life flexiwear bodynet dataset, incorporating Apple Suite (iPhone, Apple Watch, and AirPods) across diverse walking modes and device configurations. Experimental results demonstrate that Suite-IN++ achieves superior localization accuracy and robustness, significantly outperforming state-of-the-art models in real-life pedestrian tracking scenarios.
[360] SSP-GNN: Learning to Track via Bilevel Optimization
Griffin Golias, Masa Nakura-Fan, Vitaly Ablavsky
Main category: cs.CV
TL;DR: A graph-based MOT method using SSP algorithm with GNN-learned edge costs trained end-to-end via bilevel optimization.
Details
Motivation: To develop a multi-object tracking approach that effectively combines kinematic information and re-ID features through a learnable graph-based formulation.Method: Uses successive shortest paths algorithm on a tracking graph with edge costs computed by a message-passing GNN, trained end-to-end via bilevel optimization with novel loss function.
Result: Method performs favorably compared to strong baselines across varied scenario complexities in simulated evaluations.
Conclusion: The graph-based formulation with learned GNN edge costs and end-to-end training provides effective multi-object tracking that handles both kinematic and appearance features.
Abstract: We propose a graph-based tracking formulation for multi-object tracking (MOT) where target detections contain kinematic information and re-identification features (attributes). Our method applies a successive shortest paths (SSP) algorithm to a tracking graph defined over a batch of frames. The edge costs in this tracking graph are computed via a message-passing network, a graph neural network (GNN) variant. The parameters of the GNN, and hence, the tracker, are learned end-to-end on a training set of example ground-truth tracks and detections. Specifically, learning takes the form of bilevel optimization guided by our novel loss function. We evaluate our algorithm on simulated scenarios to understand its sensitivity to scenario aspects and model hyperparameters. Across varied scenario complexities, our method compares favorably to a strong baseline.
[361] JambaTalk: Speech-Driven 3D Talking Head Generation Based on Hybrid Transformer-Mamba Model
Farzaneh Jafari, Stefano Berretti, Anup Basu
Main category: cs.CV
TL;DR: JambaTalk: A hybrid Transformer-Mamba model for talking head generation that combines Transformer and Mamba architectures to improve lip sync, facial expressions, and head poses while handling long sequences better than traditional models.
Details
Motivation: Current talking head generation models struggle to achieve equivalence across all quantitative and qualitative metrics. Existing models have limitations in handling long sequences, which constrains their performance in generating comprehensive facial animations with good lip sync, expressions, and head poses.Method: Introduces Jamba, a hybrid Transformer-Mamba model that combines Transformer and Mamba (Structured State Space Model) architectures. Based on the Jamba block, they present JambaTalk which uses multimodal integration to enhance motion variety and lip synchronization for 3D face animation.
Result: Extensive experiments show that the method achieves performance comparable or superior to state-of-the-art models in talking head generation across various metrics.
Conclusion: The hybrid Transformer-Mamba approach (JambaTalk) provides a comprehensive solution for talking head generation that addresses the limitations of traditional models, particularly in handling long sequences while improving lip sync, facial expressions, and head pose generation.
Abstract: In recent years, the talking head generation has become a focal point for researchers. Considerable effort is being made to refine lip-sync motion, capture expressive facial expressions, generate natural head poses, and achieve high-quality video. However, no single model has yet achieved equivalence across all quantitative and qualitative metrics. We introduce Jamba, a hybrid Transformer-Mamba model, to animate a 3D face. Mamba, a pioneering Structured State Space Model (SSM) architecture, was developed to overcome the limitations of conventional Transformer architectures, particularly in handling long sequences. This challenge has constrained traditional models. Jamba combines the advantages of both the Transformer and Mamba approaches, offering a comprehensive solution. Based on the foundational Jamba block, we present JambaTalk to enhance motion variety and lip sync through multimodal integration. Extensive experiments reveal that our method achieves performance comparable or superior to state-of-the-art models.
[362] Enhancing Test Time Adaptation with Few-shot Guidance
Siqi Luo, Yi Xin, Yuntao Du, Tao Tan, Guangtao Zhai, Xiaohong Liu
Main category: cs.CV
TL;DR: FS-TTA introduces a few-shot support set to improve test time adaptation, reducing blind exploration in unseen domains through a two-stage framework with feature diversity augmentation and prototype memory guidance.
Details
Motivation: Existing Test Time Adaptation (TTA) methods lack reliable mechanisms for domain shift correction and can be erratic in real-world applications when adapting pre-trained models to out-of-distribution streaming target data.Method: Two-stage framework: (1) Fine-tune pre-trained source model with few-shot support set using feature diversity augmentation to avoid overfitting; (2) Implement test time adaptation with prototype memory bank guidance to produce high-quality pseudo-labels for model adaptation.
Result: Superior performance and reliability demonstrated through extensive experiments on three cross-domain classification benchmarks, showing that FS-TTA reduces blind exploration in unseen target domains.
Conclusion: FS-TTA provides a practical and effective approach for domain adaptation by leveraging few-shot support sets to enhance the reliability and performance of test time adaptation methods.
Abstract: Deep neural networks often encounter significant performance drops while facing with domain shifts between training (source) and test (target) data. To address this issue, Test Time Adaptation (TTA) methods have been proposed to adapt pre-trained source model to handle out-of-distribution streaming target data. Although these methods offer some relief, they lack a reliable mechanism for domain shift correction, which can often be erratic in real-world applications. In response, we develop Few-Shot Test Time Adaptation (FS-TTA), a novel and practical setting that utilizes a few-shot support set on top of TTA. Adhering to the principle of few inputs, big gains, FS-TTA reduces blind exploration in unseen target domains. Furthermore, we propose a two-stage framework to tackle FS-TTA, including (i) fine-tuning the pre-trained source model with few-shot support set, along with using feature diversity augmentation module to avoid overfitting, (ii) implementing test time adaptation based on prototype memory bank guidance to produce high quality pseudo-label for model adaptation. Through extensive experiments on three cross-domain classification benchmarks, we demonstrate the superior performance and reliability of our FS-TTA and framework.
[363] Event-Customized Image Generation
Zhen Wang, Yilei Jiang, Dong Zheng, Jun Xiao, Long Chen
Main category: cs.CV
TL;DR: FreeEvent is a training-free method for event-customized image generation that captures complex actions, poses, relations, and interactions between entities from a single reference image.
Details
Motivation: Existing customization methods focus only on basic actions/interactions between two entities and are limited by insufficient reference images. There's a need to extend customized image generation to more complex scenes for real-world applications.Method: FreeEvent introduces two paths alongside diffusion denoising: 1) Entity switching path using cross-attention guidance for target entity generation, and 2) Event transferring path injecting spatial features and self-attention maps from reference to target image.
Result: The method was evaluated on two new benchmarks (SWiG-Event and Real-Event) with extensive experiments demonstrating its effectiveness in event-customized image generation.
Conclusion: FreeEvent successfully addresses the new task of event-customized image generation, enabling accurate capture of complex events and generation of customized images with various target entities from single reference images.
Abstract: Customized Image Generation, generating customized images with user-specified concepts, has raised significant attention due to its creativity and novelty. With impressive progress achieved in subject customization, some pioneer works further explored the customization of action and interaction beyond entity (i.e., human, animal, and object) appearance. However, these approaches only focus on basic actions and interactions between two entities, and their effects are limited by insufficient ‘’exactly same’’ reference images. To extend customized image generation to more complex scenes for general real-world applications, we propose a new task: event-customized image generation. Given a single reference image, we define the ‘’event’’ as all specific actions, poses, relations, or interactions between different entities in the scene. This task aims at accurately capturing the complex event and generating customized images with various target entities. To solve this task, we proposed a novel training-free event customization method: FreeEvent. Specifically, FreeEvent introduces two extra paths alongside the general diffusion denoising process: 1) Entity switching path: it applies cross-attention guidance and regulation for target entity generation. 2) Event transferring path: it injects the spatial feature and self-attention maps from the reference image to the target image for event generation. To further facilitate this new task, we collected two evaluation benchmarks: SWiG-Event and Real-Event. Extensive experiments and ablations have demonstrated the effectiveness of FreeEvent.
[364] RepLDM: Reprogramming Pretrained Latent Diffusion Models for High-Quality, High-Efficiency, High-Resolution Image Generation
Boyuan Cao, Jiaxin Ye, Yujie Wei, Hongming Shan
Main category: cs.CV
TL;DR: RepLDM is a novel reprogramming framework for pretrained latent diffusion models that enables high-quality, high-efficiency, high-resolution image generation without extensive retraining.
Details
Motivation: Latent diffusion models struggle with structural distortions when generating images at resolutions higher than their training resolution, and existing reprogramming methods result in poor quality and slow inference.Method: Two-stage framework: (1) Attention guidance stage using training-free self-attention to generate higher-quality training-resolution latent representations, and (2) Progressive upsampling stage in pixel space to mitigate artifacts from latent space upsampling.
Result: RepLDM significantly outperforms state-of-the-art methods in both quality and efficiency for high-resolution image generation, with fewer denoising steps needed at higher resolutions.
Conclusion: The framework provides an effective resource-efficient approach for high-resolution image generation using pretrained models, making it advantageous for real-world applications.
Abstract: While latent diffusion models (LDMs), such as Stable Diffusion, are designed for high-resolution (HR) image generation, they often struggle with significant structural distortions when generating images at resolutions higher than their training one. Instead of relying on extensive retraining, a more resource-efficient approach is to reprogram the pretrained model for HR image generation; however, existing methods often result in poor image quality and long inference time. We introduce RepLDM, a novel reprogramming framework for pretrained LDMs that enables high-quality, high-efficiency, high-resolution image generation; see Fig. 1. RepLDM consists of two stages: (i) an attention guidance stage, which generates a latent representation of a higher-quality training-resolution image using a novel training-free self-attention mechanism to enhance the structural consistency; and (ii) a progressive upsampling stage, which progressively performs upsampling in pixel space to mitigate the severe artifacts caused by latent space upsampling. The effective initialization from the first stage allows for denoising at higher resolutions with significantly fewer steps, improving the efficiency. Extensive experimental results demonstrate that RepLDM significantly outperforms state-of-the-art methods in both quality and efficiency for HR image generation, underscoring its advantages for real-world applications. Codes: https://github.com/kmittle/RepLDM.
[365] TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba
Xiaowen Ma, Zhenliang Ni, Xinghao Chen
Main category: cs.CV
TL;DR: TinyViM is a lightweight vision Mamba architecture that uses Laplace mixer for frequency decoupling and frequency ramp inception to efficiently balance high/low-frequency components, achieving superior performance with 2-3× higher throughput than other Mamba models.
Details
Motivation: Existing lightweight Mamba-based vision backbones underperform compared to Convolution/Transformer methods. The authors found that simply modifying scanning paths in image domain doesn't fully exploit Mamba's potential, and Mamba blocks mainly model low-frequency information in hybrid architectures.Method: 1) Comprehensive spectral/quantitative analysis revealing Mamba blocks focus on low-frequency info; 2) Laplace mixer to decouple features by frequency and input only low-frequency components to Mamba blocks; 3) Frequency ramp inception to gradually reduce high-frequency branch dimensions for efficient trade-off across layers; 4) Integration with mobile-friendly convolution.
Result: TinyViM outperforms Convolution, Transformer, and Mamba-based models with similar scales across image classification, semantic segmentation, object detection, and instance segmentation. Achieves 2-3× higher throughput than other Mamba-based models.
Conclusion: The proposed TinyViM demonstrates that proper frequency-aware design can unlock Mamba’s potential for vision tasks, achieving state-of-the-art performance with superior efficiency, making it suitable for practical deployment.
Abstract: Mamba has shown great potential for computer vision due to its linear complexity in modeling the global context with respect to the input length. However, existing lightweight Mamba-based backbones cannot demonstrate performance that matches Convolution or Transformer-based methods. By observing, we find that simply modifying the scanning path in the image domain is not conducive to fully exploiting the potential of vision Mamba. In this paper, we first perform comprehensive spectral and quantitative analyses, and verify that the Mamba block mainly models low-frequency information under Convolution-Mamba hybrid architecture. Based on the analyses, we introduce a novel Laplace mixer to decouple the features in terms of frequency and input only the low-frequency components into the Mamba block. In addition, considering the redundancy of the features and the different requirements for high-frequency details and low-frequency global information at different stages, we introduce a frequency ramp inception, i.e., gradually reduce the input dimensions of the high-frequency branches, so as to efficiently trade-off the high-frequency and low-frequency components at different layers. By integrating mobile-friendly convolution and efficient Laplace mixer, we build a series of tiny hybrid vision Mamba called TinyViM. The proposed TinyViM achieves impressive performance on several downstream tasks including image classification, semantic segmentation, object detection and instance segmentation. In particular, TinyViM outperforms Convolution, Transformer and Mamba-based models with similar scales, and the throughput is about 2-3 times higher than that of other Mamba-based models. Code is available at https://github.com/xwmaxwma/TinyViM.
[366] Bi-ICE: An Inner Interpretable Framework for Image Classification via Bi-directional Interactions between Concept and Input Embeddings
Jinyung Hong, Yearim Kim, Keun Hee Park, Sangyu Han, Nojun Kwak, Theodore P. Pavlic
Main category: cs.CV
TL;DR: The paper introduces Bi-ICE, a bidirectional interaction module for inner interpretability in large-scale image classification that enables concept-based predictions, contribution quantification, and concept localization.
Details
Motivation: While inner interpretability has advanced for language models, it has received limited attention for large-scale image tasks, which have primarily focused on architectural and functional visualization rather than deeper interpretability analysis.Method: Proposes a conceptual framework for multilevel analysis and introduces the Bi-directional Interaction between Concept and Input Embeddings (Bi-ICE) module that facilitates interpretability across computational, algorithmic, and implementation levels.
Result: Demonstrates enhanced transparency in image classification by generating predictions based on human-understandable concepts, quantifying concept contributions, and localizing concepts within inputs. Shows algorithmic interpretability through concept learning process and convergence.
Conclusion: The Bi-ICE module successfully enables inner interpretability for large-scale image classification, providing transparency through concept-based predictions, contribution analysis, and localization, while highlighting algorithmic interpretability of concept learning.
Abstract: Inner interpretability is a promising field aiming to uncover the internal mechanisms of AI systems through scalable, automated methods. While significant research has been conducted on large language models, limited attention has been paid to applying inner interpretability to large-scale image tasks, focusing primarily on architectural and functional levels to visualize learned concepts. In this paper, we first present a conceptual framework that supports inner interpretability and multilevel analysis for large-scale image classification tasks. Specifically, we introduce the Bi-directional Interaction between Concept and Input Embeddings (Bi-ICE) module, which facilitates interpretability across the computational, algorithmic, and implementation levels. This module enhances transparency by generating predictions based on human-understandable concepts, quantifying their contributions, and localizing them within the inputs. Finally, we showcase enhanced transparency in image classification, measuring concept contributions, and pinpointing their locations within the inputs. Our approach highlights algorithmic interpretability by demonstrating the process of concept learning and its convergence.
[367] Explaining Object Detectors via Collective Contribution of Pixels
Toshinori Yamauchi, Hiroshi Kera, Kazuhiko Kawamoto
Main category: cs.CV
TL;DR: A game-theoretic method using Shapley values and interactions to explain object detectors by capturing both individual and collective pixel contributions for better visual explanations.
Details
Motivation: Existing visual explanation methods for object detectors focus only on individual pixel contributions, overlooking collective influences that are crucial for accurate detection. This leads to missing compositional cues or capturing spurious correlations.Method: Proposes a game-theoretic approach based on Shapley values and interactions to explicitly capture both individual and collective pixel contributions. The method provides explanations for both bounding box localization and class determination.
Result: Extensive experiments show the proposed method identifies important regions more accurately than state-of-the-art methods. The code will be publicly available.
Conclusion: The game-theoretic approach using Shapley values and interactions effectively captures collective pixel contributions, providing more accurate visual explanations for object detectors than existing methods.
Abstract: Visual explanations for object detectors are crucial for enhancing their reliability. Object detectors identify and localize instances by assessing multiple visual features collectively. When generating explanations, overlooking these collective influences in detections may lead to missing compositional cues or capturing spurious correlations. However, existing methods typically focus solely on individual pixel contributions, neglecting the collective contribution of multiple pixels. To address this limitation, we propose a game-theoretic method based on Shapley values and interactions to explicitly capture both individual and collective pixel contributions. Our method provides explanations for both bounding box localization and class determination, highlighting regions crucial for detection. Extensive experiments demonstrate that the proposed method identifies important regions more accurately than state-of-the-art methods. The code will be publicly available soon.
[368] Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation
Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, Song Han
Main category: cs.CV
TL;DR: Radial Attention: A sparse attention mechanism for video diffusion models that reduces computational complexity from O(n²) to O(n log n) by exploiting spatiotemporal energy decay, enabling longer video generation with significant speedups.
Details
Motivation: Video diffusion models suffer from high computational costs due to the temporal dimension. The authors identify "Spatiotemporal Energy Decay" - attention scores diminish with spatial and temporal distance between tokens - similar to physical signal decay in nature.Method: Proposes Radial Attention, a scalable sparse attention mechanism with O(n log n) complexity. It uses a static attention mask where each token attends to spatially nearby tokens, with attention window size shrinking with temporal distance. Allows pre-trained models to extend generation length via efficient LoRA-based fine-tuning.
Result: Achieves up to 1.9× speedup over dense attention while maintaining video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1 models. Enables 4× longer video generation with 4.4× reduced training costs and 3.7× faster inference compared to dense attention.
Conclusion: Radial Attention provides an efficient solution for long video generation by translating natural spatiotemporal energy decay into computational efficiency, making video diffusion models more scalable and practical.
Abstract: Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal distance between tokens increase, akin to the physical decay of signal or waves over space and time in nature. Motivated by this, we propose Radial Attention, a scalable sparse attention mechanism with $\mathcal{O}(n \log n)$ complexity that translates energy decay into exponentially decaying compute density, which is significantly more efficient than standard $\mathcal{O}(n^2)$ dense attention and more expressive than linear attention. Specifically, Radial Attention employs a simple, static attention mask where each token attends to spatially nearby tokens, with the attention window size shrinking with temporal distance. Moreover, it allows pre-trained video diffusion models to extend their generation length with efficient LoRA-based fine-tuning. Extensive experiments show that Radial Attention maintains video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9$\times$ speedup over the original dense attention. With minimal tuning, it enables video generation up to 4$\times$ longer while reducing training costs by up to 4.4$\times$ compared to direct fine-tuning and accelerating inference by up to 3.7$\times$ compared to dense attention inference. Code is released at \href{https://github.com/mit-han-lab/radial-attention}{https://github.com/mit-han-lab/radial-attention}.
[369] EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation
Liangwei Jiang, Ruida Li, Zhifeng Zhang, Shuo Fang, Chenguang Ma
Main category: cs.CV
TL;DR: EmojiDiff: First end-to-end solution for simultaneous RGB-level expression control and high-fidelity identity preservation in portrait generation, addressing mutual interference between expression and identity.
Details
Motivation: Current portrait generation methods struggle with mutual interference between expression and identity - fine expression control introduces appearance semantics that affect identity, and even coarse control causes facial changes that compromise identity. Previous methods rely on coarse control or two-stage animation integration.Method: Two-stage scheme: (1) Decoupled training using ID-irrelevant Data Iteration (IDI) to synthesize cross-identity expression pairs by separating expression maintenance and identity alteration processes, (2) ID-enhanced Contrast Alignment (ICA) fine-tuning for rapid reconstruction and joint supervision of identity and expression information.
Result: Remarkably outperforms counterparts, achieves precise expression control with highly maintained identity, and generalizes well to various diffusion models.
Conclusion: EmojiDiff successfully addresses the mutual interference problem in portrait generation, enabling simultaneous fine-grained expression control and high-fidelity identity preservation through innovative decoupled training and contrast alignment techniques.
Abstract: This paper aims to bring fine-grained expression control while maintaining high-fidelity identity in portrait generation. This is challenging due to the mutual interference between expression and identity: (i) fine expression control signals inevitably introduce appearance-related semantics (e.g., facial contours, and ratio), which impact the identity of the generated portrait; (ii) even coarse-grained expression control can cause facial changes that compromise identity, since they all act on the face. These limitations remain unaddressed by previous generation methods, which primarily rely on coarse control signals or two-stage inference that integrates portrait animation. Here, we introduce EmojiDiff, the first end-to-end solution that enables simultaneous control of extremely detailed expression (RGB-level) and high-fidelity identity in portrait generation. To address the above challenges, EmojiDiff adopts a two-stage scheme involving decoupled training and fine-tuning. For decoupled training, we innovate ID-irrelevant Data Iteration (IDI) to synthesize cross-identity expression pairs by dividing and optimizing the processes of maintaining expression and altering identity, thereby ensuring stable and high-quality data generation. Training the model with this data, we effectively disentangle fine expression features in the expression template from other extraneous information (e.g., identity, skin). Subsequently, we present ID-enhanced Contrast Alignment (ICA) for further fine-tuning. ICA achieves rapid reconstruction and joint supervision of identity and expression information, thus aligning identity representations of images with and without expression control. Experimental results demonstrate that our method remarkably outperforms counterparts, achieves precise expression control with highly maintained identity, and generalizes well to various diffusion models.
[370] SAMCL: Empowering SAM to Continually Learn from Dynamic Domains with Extreme Storage Efficiency
Zeqing Wang, Kangye Ji, Di Wang, Haibin Zhang, Fei Cheng
Main category: cs.CV
TL;DR: SAMCL: A continual learning method for Segment Anything Model that addresses catastrophic forgetting in open-world scenarios by decomposing incremental knowledge into separate modules with a selector, achieving minimal forgetting and storage efficiency.
Details
Motivation: SAM struggles in open-world scenarios with diverse domains, and naive fine-tuning causes catastrophic forgetting when learning incrementally. There's a need for a continual learning approach that maintains performance across domains while managing storage efficiently.Method: Proposes SAMCL with two key components: 1) AugModule reduces LoRA storage by sharing parameters across layers and uses heatmaps from point prompts for domain adaptation; 2) Module Selector leverages SAM’s embeddings to distinguish domains and select appropriate modules during inference.
Result: Outperforms state-of-the-art methods with only 0.19% forgetting and at least 2.5% gain on unseen domains. Each AugModule requires just 0.233 MB (24.3% storage reduction), and selector buffer storage reduced by up to 256×.
Conclusion: SAMCL effectively addresses catastrophic forgetting in SAM for continual learning scenarios, achieving excellent performance with minimal storage overhead through modular decomposition and efficient domain selection.
Abstract: Segment Anything Model (SAM) struggles in open-world scenarios with diverse domains. In such settings, naive fine-tuning with a well-designed learning module is inadequate and often causes catastrophic forgetting issue when learning incrementally. To address this issue, we propose a novel continual learning (CL) method for SAM, termed SAMCL. Rather than relying on a fixed learning module, our method decomposes incremental knowledge into separate modules and trains a selector to choose the appropriate one during inference. However, this intuitive design introduces two key challenges: ensuring effective module learning and selection, and managing storage as tasks accumulate. To tackle these, we introduce two components: AugModule and Module Selector. AugModule reduces the storage of the popular LoRA learning module by sharing parameters across layers while maintaining accuracy. It also employs heatmaps-generated from point prompts-to further enhance domain adaptation with minimal additional cost. Module Selector leverages the observation that SAM’s embeddings can effectively distinguish domains, enabling high selection accuracy by training on low-consumed embeddings instead of raw images. Experiments show that SAMCL outperforms state-of-the-art methods, achieving only 0.19% forgetting and at least 2.5% gain on unseen domains. Each AugModule requires just 0.233 MB, reducing storage by at least 24.3% over other fine-tuning approaches. The buffer storage for Module Selector is further reduced by up to 256$\times$.
[371] Towards Unsupervised Domain Bridging via Image Degradation in Semantic Segmentation
Wangkai Li, Rui Sun, Huayu Mai, Tianzhu Zhang
Main category: cs.CV
TL;DR: DiDA is a plug-and-play unsupervised domain bridging approach for semantic segmentation that creates continuous intermediate domains via degradation operations and uses semantic shift compensation to preserve discriminative features.
Details
Motivation: Current self-training techniques for unsupervised domain adaptation in semantic segmentation overlook explicit modeling of domain-shared feature extraction, leading to performance degradation when networks are applied to different domains.Method: DiDA has two key modules: 1) Degradation-based Intermediate Domain Construction that creates continuous intermediate domains through simple image degradation operations to encourage learning domain-invariant features, and 2) Semantic Shift Compensation that leverages a diffusion encoder to disentangle and compensate for semantic shift information with degraded timesteps.
Result: Extensive experiments on multiple domain adaptive semantic segmentation benchmarks demonstrate that DiDA consistently achieves significant performance improvements across all settings.
Conclusion: DiDA provides an effective plug-and-play solution that supports various degradation operations and seamlessly integrates with existing UDA methods, addressing the domain adaptation challenge in semantic segmentation through explicit domain bridging.
Abstract: Semantic segmentation suffers from significant performance degradation when the trained network is applied to a different domain. To address this issue, unsupervised domain adaptation (UDA) has been extensively studied. Despite the effectiveness of selftraining techniques in UDA, they still overlook the explicit modeling of domain-shared feature extraction. In this paper, we propose DiDA, an unsupervised domain bridging approach for semantic segmentation. DiDA consists of two key modules: (1) Degradation-based Intermediate Domain Construction, which creates continuous intermediate domains through simple image degradation operations to encourage learning domain-invariant features as domain differences gradually diminish; (2) Semantic Shift Compensation, which leverages a diffusion encoder to disentangle and compensate for semantic shift information with degraded timesteps, preserving discriminative representations in the intermediate domains. As a plug-and-play solution, DiDA supports various degradation operations and seamlessly integrates with existing UDA methods. Extensive experiments on multiple domain adaptive semantic segmentation benchmarks demonstrate that DiDA consistently achieves significant performance improvements across all settings. Code is available at https://github.com/Woof6/DiDA.
[372] Causal Interpretability for Adversarial Robustness: A Hybrid Generative Classification Approach
Chunheng Zhao, Pierluigi Pisu, Gurcan Comert, Negash Begashaw, Varghese Vaidyan, Nina Christine Hubig
Main category: cs.CV
TL;DR: Deep ensemble model combines discriminative and generative approaches for adversarial robustness without adversarial training, showing correlation between interpretability and robustness.
Details
Motivation: Deep learning classifiers are vulnerable to adversarial attacks, and adversarial training doesn't address the fundamental opacity of black-box models. Need for approaches that achieve both high accuracy and robustness through better model understanding.Method: Two-level ensemble: bottom-level pre-trained discriminative network for feature extraction, top-level generative classification network using deep latent variable model to capture adversarial input distributions. Uses variational Bayes for inference without adversarial training.
Result: Superior robustness against white-box adversarial attacks on CIFAR-10 and CIFAR-100. Established correlations between interpretability (via counterfactual and feature interaction metrics) and adversarial robustness. Preliminary validation on Tiny-ImageNet shows scalability.
Conclusion: The proposed ensemble approach provides a practical solution for robust image classification by combining discriminative and generative modeling, demonstrating that improved interpretability correlates with better adversarial robustness.
Abstract: Deep learning-based discriminative classifiers, despite their remarkable success, remain vulnerable to adversarial examples that can mislead model predictions. While adversarial training can enhance robustness, it fails to address the intrinsic vulnerability stemming from the opaque nature of these black-box models. We present a deep ensemble model that combines discriminative features with generative models to achieve both high accuracy and adversarial robustness. Our approach integrates a bottom-level pre-trained discriminative network for feature extraction with a top-level generative classification network that models adversarial input distributions through a deep latent variable model. Using variational Bayes, our model achieves superior robustness against white-box adversarial attacks without adversarial training. Extensive experiments on CIFAR-10 and CIFAR-100 demonstrate our model’s superior adversarial robustness. Through evaluations using counterfactual metrics and feature interaction-based metrics, we establish correlations between model interpretability and adversarial robustness. Additionally, preliminary results on Tiny-ImageNet validate our approach’s scalability to more complex datasets, offering a practical solution for developing robust image classification models.
[373] CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual Question Answering
Ben Vardi, Oron Nir, Ariel Shamir
Main category: cs.CV
TL;DR: CLIP-UP is a lightweight method that uses CLIP-based similarity measures to detect unanswerable VQA questions, preventing VLMs from making unnatural errors by withholding answers when questions don’t align with images.
Details
Motivation: Vision-Language Models often make unnatural errors by providing wrong answers to unanswerable questions (questions about objects not in the image), highlighting the need for a mechanism to detect when questions cannot be answered from the visual input.Method: CLIP-UP uses CLIP-based similarity measures to extract question-image alignment information, requiring only efficient training of a few additional layers while keeping original VLM weights unchanged. It detects unanswerability by measuring alignment between questions and images.
Result: CLIP-UP achieves significant improvements on benchmarks assessing unanswerability in both multiple-choice and open-ended VQA, surpassing other methods while preserving original performance on other tasks across several models.
Conclusion: The proposed CLIP-UP method effectively equips VLMs with the ability to detect and withhold answers to unanswerable questions, addressing a key limitation while maintaining model efficiency and preserving original capabilities.
Abstract: Vision-Language Models (VLMs) demonstrate remarkable capabilities in visual understanding and reasoning, such as in Visual Question Answering (VQA), where the model is asked a question related to a visual input. Still, these models can make distinctly unnatural errors, for example, providing (wrong) answers to unanswerable VQA questions, such as questions asking about objects that do not appear in the image. To address this issue, we propose CLIP-UP: CLIP-based Unanswerable Problem detection, a novel lightweight method for equipping VLMs with the ability to withhold answers to unanswerable questions. CLIP-UP leverages CLIP-based similarity measures to extract question-image alignment information to detect unanswerability, requiring efficient training of only a few additional layers, while keeping the original VLMs’ weights unchanged. Tested across several models, CLIP-UP achieves significant improvements on benchmarks assessing unanswerability in both multiple-choice and open-ended VQA, surpassing other methods, while preserving original performance on other tasks.
[374] Expectation-Maximization as the Engine of Scalable Medical Intelligence
Wenxuan Li, Pedro R. A. S. Bassi, Tianyu Lin, Yu-Cheng Chou, Jakob Wasserthal, Xinze Zhou, Qi Chen, Fabian Isensee, Yannick Kirchhoff, Maximilian Rokuss, Saikat Roy, Constantin Ulrich, Klaus Maier-Hein, Szymon Płotka, Xiaoxi Chen, Kang Wang, Yang Yang, Daguang Xu, Kai Ding, Yucheng Tang, Alan L. Yuille, Zongwei Zhou
Main category: cs.CV
TL;DR: ScaleMAI is an EM-based framework that co-evolves data annotation and model development, automatically correcting annotation errors while training models, with minimal human intervention, creating the largest CT scan dataset and achieving superior tumor diagnosis performance.
Details
Motivation: Constructing large, high-quality annotated medical datasets is extremely time-consuming and resource-intensive, requiring years of multidisciplinary effort. While active learning helps prioritize annotation, scaling still requires extensive manual correction of noisy annotations.Method: ScaleMAI uses an Expectation-Maximization (EM) process where: 1) Expectation step: AI model automatically identifies and corrects annotation mistakes; 2) Maximization step: refined data retrains the model to improve accuracy. Human experts review only <5% of cases that cannot be resolved automatically.
Result: Created largest CT scan dataset: 47,315 scans (4.8x larger than PanTS) with 4.16M per-voxel annotations for tumors and anatomical structures. Model exceeds human expert performance in tumor diagnosis (+7%), with significant gains in tumor detection (+10%) and segmentation (+14%) on benchmarks.
Conclusion: ScaleMAI successfully addresses the medical AI data bottleneck by co-evolving annotation and model development through EM, achieving superior performance with minimal human intervention, demonstrating a scalable framework for building high-quality medical datasets and models.
Abstract: Large, high-quality, annotated datasets are the foundation of medical AI research, but constructing even a small, moderate-quality, annotated dataset can take years of effort from multidisciplinary teams. Although active learning can prioritize what to annotate, scaling up still requires extensive manual efforts to revise the noisy annotations. We formulate this as a missing-data problem and develop ScaleMAI, a framework that unifies data annotation and model development co-evolution through an Expectation-Maximization (EM) process. In this iterative process, the AI model automatically identifies and corrects the mistakes in annotations (Expectation), while the refined annotated data retrain the model to improve accuracy (Maximization). In addition to the classical EM algorithm, ScaleMAI brings human experts into the loop to review annotations that cannot be adequately addressed by either Expectation or Maximization step (<5%). As a result, ScaleMAI progressively creates an annotated dataset of 47,315 CT scans (4.8x larger than the largest public dataset, PanTS) including 4,163,720 per-voxel annotations for benign/malignant tumors and 88 anatomical structures. ScaleMAI iteratively trains a model that exceeds human expert performance in tumor diagnosis (+7%), and outperforms models developed from smaller, moderate-quality datasets, with statistically significant gains in tumor detection (+10%) and segmentation (+14%) on two prestigious benchmarks.
[375] CrowdSplat: Exploring Gaussian Splatting For Crowd Rendering
Xiaohan Sun, Yinghan Xu, John Dingliana, Carol O’Sullivan
Main category: cs.CV
TL;DR: CrowdSplat uses 3D Gaussian Splatting for real-time, high-quality crowd rendering from monocular videos, with LoD optimization and GPU memory efficiency.
Details
Motivation: To enable real-time, high-quality rendering of dynamic crowds with diverse poses and outfits, addressing the computational challenges of realistic crowd simulation.Method: Two-stage framework: (1) avatar reconstruction using 3D Gaussian functions to represent animated human characters from monocular videos, (2) crowd synthesis with Level of Detail (LoD) rendering and GPU memory optimization.
Result: Achieves good rendering quality, memory efficiency, and computational performance in quantitative and qualitative evaluations, enabling real-time dynamic crowd simulation.
Conclusion: CrowdSplat is a viable solution for real-time, realistic crowd rendering with efficient GPU memory usage and computational performance.
Abstract: We present CrowdSplat, a novel approach that leverages 3D Gaussian Splatting for real-time, high-quality crowd rendering. Our method utilizes 3D Gaussian functions to represent animated human characters in diverse poses and outfits, which are extracted from monocular videos. We integrate Level of Detail (LoD) rendering to optimize computational efficiency and quality. The CrowdSplat framework consists of two stages: (1) avatar reconstruction and (2) crowd synthesis. The framework is also optimized for GPU memory usage to enhance scalability. Quantitative and qualitative evaluations show that CrowdSplat achieves good levels of rendering quality, memory efficiency, and computational performance. Through the.se experiments, we demonstrate that CrowdSplat is a viable solution for dynamic, realistic crowd simulation in real-time applications.
[376] DRWKV: Focusing on Object Edges for Low-Light Image Enhancement
Xuecheng Bai, Yuxiang Wang, Boyu Hu, Qinyuan Jie, Chuanzhi Xu, Kechen Li, Hongru Xiao, Vera Chung
Main category: cs.CV
TL;DR: DRWKV is a novel low-light image enhancement model that integrates Global Edge Retinex theory, Evolving WKV Attention, and Bilateral Spectrum Aligner to preserve edge continuity and structural details while improving visual naturalness with low computational cost.
Details
Motivation: Low-light image enhancement faces challenges in preserving object edge continuity and fine structural details under extreme illumination degradation, requiring better edge fidelity and artifact mitigation.Method: 1) Global Edge Retinex theory for decoupling illumination and edge structures; 2) Evolving WKV Attention with spiral-scanning mechanism for spatial edge continuity; 3) Bilateral Spectrum Aligner and MS2-Loss for luminance/chrominance alignment.
Result: Achieves leading performance on five LLIE benchmarks in PSNR, SSIM, and NIQE metrics while maintaining low computational complexity. Also enhances downstream low-light multi-object tracking performance.
Conclusion: DRWKV effectively addresses edge preservation and structural detail enhancement in low-light conditions, demonstrates strong generalization capabilities, and provides practical benefits for downstream computer vision tasks.
Abstract: Low-light image enhancement remains a challenging task, particularly in preserving object edge continuity and fine structural details under extreme illumination degradation. In this paper, we propose a novel model, DRWKV (Detailed Receptance Weighted Key Value), which integrates our proposed Global Edge Retinex (GER) theory, enabling effective decoupling of illumination and edge structures for enhanced edge fidelity. Secondly, we introduce Evolving WKV Attention, a spiral-scanning mechanism that captures spatial edge continuity and models irregular structures more effectively. Thirdly, we design the Bilateral Spectrum Aligner (Bi-SAB) and a tailored MS2-Loss to jointly align luminance and chrominance features, improving visual naturalness and mitigating artifacts. Extensive experiments on five LLIE benchmarks demonstrate that DRWKV achieves leading performance in PSNR, SSIM, and NIQE while maintaining low computational complexity. Furthermore, DRWKV enhances downstream performance in low-light multi-object tracking tasks, validating its generalization capabilities.
[377] VLM-Assisted Continual learning for Visual Question Answering in Self-Driving
Yuxin Lin, Mengshi Qi, Liang Liu, Huadong Ma
Main category: cs.CV
TL;DR: A novel continual learning framework for Visual Question Answering in autonomous driving that combines Vision-Language Models with selective memory replay and knowledge distillation to prevent catastrophic forgetting across sequential driving tasks.
Details
Motivation: Traditional VQA models in autonomous driving suffer from catastrophic forgetting when sequentially learning new tasks (perception, prediction, planning), requiring different knowledge forms. This limits system reliability and adaptability in real-world driving scenarios.Method: Proposes a continual learning framework integrating VLMs with: 1) selective memory replay, 2) knowledge distillation (using previous model as teacher), and 3) task-specific projection layer regularization that calculates loss based on feature representation divergence to ensure learning continuity.
Result: Evaluated on DriveLM dataset, the framework achieves substantial performance improvements of 20.11% to 35.16% across various metrics, demonstrating enhanced resilience and reliability in VQA systems for autonomous driving.
Conclusion: Combining continual learning with VLMs effectively enhances VQA system resilience in autonomous driving by mitigating catastrophic forgetting, with significant performance gains. Source code will be released.
Abstract: In this paper, we propose a novel approach for solving the Visual Question Answering (VQA) task in autonomous driving by integrating Vision-Language Models (VLMs) with continual learning. In autonomous driving, VQA plays a vital role in enabling the system to understand and reason about its surroundings. However, traditional models often struggle with catastrophic forgetting when sequentially exposed to new driving tasks, such as perception, prediction, and planning, each requiring different forms of knowledge. To address this challenge, we present a novel continual learning framework that combines VLMs with selective memory replay and knowledge distillation, reinforced by task-specific projection layer regularization. The knowledge distillation allows a previously trained model to act as a “teacher” to guide the model through subsequent tasks, minimizing forgetting. Meanwhile, task-specific projection layers calculate the loss based on the divergence of feature representations, ensuring continuity in learning and reducing the shift between tasks. Evaluated on the DriveLM dataset, our framework shows substantial performance improvements, with gains ranging from 20.11% to 35.16% across various metrics. These results highlight the effectiveness of combining continual learning with VLMs in enhancing the resilience and reliability of VQA systems in autonomous driving. We will release our source code.
[378] Pic2Diagnosis: A Method for Diagnosis of Cardiovascular Diseases from the Printed ECG Pictures
Oğuzhan Büyüksolak, İlkay Öksüz
Main category: cs.CV
TL;DR: Two-step curriculum learning with segmentation pre-training and grayscale fine-tuning, plus ensemble averaging, achieves high-performance CVD diagnosis directly from ECG images without digitization.
Details
Motivation: Current ECG diagnosis relies on outdated datasets and traditional algorithms with limited accuracy, requiring digitization that isn't always available in resource-limited settings where printed/scanned ECG images are common.Method: Two-step curriculum learning: 1) pre-train classification model on segmentation masks, 2) fine-tune on grayscale, inverted ECG images. Enhanced with ensemble of three models with averaged outputs.
Result: Achieved AUC of 0.9534 and F1 score of 0.7801 on BHF ECG Challenge dataset, outperforming individual models. Robust to real-world artifacts and simplifies diagnostic process.
Conclusion: Method provides reliable automated CVD diagnosis directly from ECG images, particularly valuable for resource-limited settings, enabling rapid and accurate diagnosis critical for timely intervention in urgent CVD cases.
Abstract: The electrocardiogram (ECG) is a vital tool for diagnosing heart diseases. However, many disease patterns are derived from outdated datasets and traditional stepwise algorithms with limited accuracy. This study presents a method for direct cardiovascular disease (CVD) diagnosis from ECG images, eliminating the need for digitization. The proposed approach utilizes a two-step curriculum learning framework, beginning with the pre-training of a classification model on segmentation masks, followed by fine-tuning on grayscale, inverted ECG images. Robustness is further enhanced through an ensemble of three models with averaged outputs, achieving an AUC of 0.9534 and an F1 score of 0.7801 on the BHF ECG Challenge dataset, outperforming individual models. By effectively handling real-world artifacts and simplifying the diagnostic process, this method offers a reliable solution for automated CVD diagnosis, particularly in resource-limited settings where printed or scanned ECG images are commonly used. Such an automated procedure enables rapid and accurate diagnosis, which is critical for timely intervention in CVD cases that often demand urgent care.
[379] FLARES: Fast and Accurate LiDAR Multi-Range Semantic Segmentation
Bin Yang, Alexandru Paul Condurache
Main category: cs.CV
TL;DR: FLARES improves 3D semantic segmentation by training with multiple range images instead of single panoramic views, addressing efficiency and accuracy issues through specialized data augmentation and post-processing.
Details
Motivation: Current range-view methods for LiDAR scene understanding face efficiency challenges with wide panoramic images and information loss from spherical projection. While splitting point clouds into multiple range images improves both accuracy and efficiency, it introduces new problems like class imbalance and projection artifacts.Method: FLARES introduces a training paradigm using multiple range images from split point clouds, with two tailored data augmentation techniques and specialized post-processing methods designed specifically for multi-range settings to address class imbalance and projection artifacts.
Result: FLARES achieves 2.1%~7.9% mIoU improvements on SemanticKITTI and 1.8%~3.9% mIoU on nuScenes, while delivering over 40% speed-up in inference time, demonstrating strong generalization across different architectures.
Conclusion: The multi-range image approach with FLARES’ specialized techniques effectively balances accuracy and efficiency for 3D scene understanding in autonomous driving, overcoming limitations of traditional panoramic range-view methods.
Abstract: 3D scene understanding is a critical yet challenging task in autonomous driving due to the irregularity and sparsity of LiDAR data, as well as the computational demands of processing large-scale point clouds. Recent methods leverage range-view representations to enhance efficiency, but they often adopt higher azimuth resolutions to mitigate information loss during spherical projection, where only the closest point is retained for each 2D grid. However, processing wide panoramic range-view images remains inefficient and may introduce additional distortions. Our empirical analysis shows that training with multiple range images, obtained from splitting the full point cloud, improves both segmentation accuracy and computational efficiency. However, this approach also poses new challenges of exacerbated class imbalance and increase in projection artifacts. To address these, we introduce FLARES, a novel training paradigm that incorporates two tailored data augmentation techniques and a specialized post-processing method designed for multi-range settings. Extensive experiments demonstrate that FLARES is highly generalizable across different architectures, yielding 2.1%~7.9% mIoU improvements on SemanticKITTI and 1.8%~3.9% mIoU on nuScenes, while delivering over 40% speed-up in inference.
[380] Thicker and Quicker: A Jumbo Token for Fast Plain Vision Transformers
Anthony Fuller, Yousef Yassin, Daniel G. Kyrollos, Evan Shelhamer, James R. Green
Main category: cs.CV
TL;DR: Jumbo improves ViT speed by reducing patch token width while adding a single wide “Jumbo token” with its own efficient FFN, maintaining ViT generality while achieving better speed-accuracy trade-offs.
Details
Motivation: ViTs are general and accurate but slow; existing speed-up methods either lose generality (hybrid architectures) or sacrifice accuracy (token shrinking). Need to make ViTs faster while preserving their flexibility and compatibility with ViT methods.Method: Introduce a “Jumbo token” - reduce patch token width while increasing global token width with a single wide token. This Jumbo token has its own wider FFN for increased capacity, but is efficient: processes only one token for speed, and shares parameters across all layers for memory efficiency. Maintains attention-only, non-hierarchical plain ViT architecture.
Result: Improves over ViT baselines with Registers from Nano to Large scales (0.1-13% on ImageNet-1K) while maintaining speed/throughput. Also improves MAE pre-training (4.9% linear probing on ImageNet-1K), test-time adaptation (5.2% on ImageNet-C), and time series modeling. Achieves better speed-accuracy trade-offs than specialized non-ViT models while maintaining plain-ViT compatibility.
Conclusion: Jumbo token approach successfully makes ViTs faster while preserving their generality and compatibility with existing ViT methods, achieving practical speed improvements without sacrificing the flexibility that makes ViTs valuable.
Abstract: ViTs are general and accurate, and address many tasks, but ViTs are slow, and are not always practical when efficiency is key. Existing methods for faster ViTs design hybrid non-ViT architectures, losing generality, or shrink their tokens, sacrificing accuracy. While many non-ViT architectures are both fast and accurate, they cannot flexibly process other input shapes, pre-train by SOTA self-supervised learning, reduce computation by dropping tokens, and more like ViTs can. We make ViTs faster by reducing patch token width while increasing global token width by adding a new Jumbo token. Our wider Jumbo token is processed by its own wider FFN to increase model capacity. Yet our Jumbo FFN is efficient: it processes a single token, for speed, and its parameters are shared across all layers, for memory. Crucially, our Jumbo is attention-only and non-hierarchical, like a plain ViT, so it is simple, scalable, flexible, and compatible with ViT methods new and old. Jumbo improves over ViT baselines with Registers from Nano to Large scales while maintaining speed/throughput on ImageNet-1K (0.1-13%). Jumbo also improves MAE pre-training (4.9% linear probing on ImageNet-1K), test-time adaptation (5.2% on ImageNet-C), and time series modeling. Our Jumbo models even achieve better speed-accuracy trade-offs than specialized non-ViT compute-efficient models, while maintaining plain-ViT compatibility for practicality. Code and weights available: https://github.com/antofuller/jumbo
[381] A Survey on Industrial Anomalies Synthesis
Yanshu Wang, Xichen Xu, Jiaqi Liu, Xiaoning Lei, Guoyang Xie, Guannan Jiang, Zhichao Lu
Main category: cs.CV
TL;DR: This paper provides a comprehensive review of anomaly synthesis methodologies, introducing the first industrial anomaly synthesis taxonomy and covering about 40 methods across four categories, with special focus on cross-modality synthesis and large-scale vision-language models.
Details
Motivation: Existing surveys on anomaly synthesis are limited in scope, focusing on only a few techniques and lacking an overall field view. They also fail to understand interconnections between methods and don't provide formal classification frameworks, which hampers structured comparisons and trend identification.Method: The study offers a unified review covering approximately 40 representative methods across four main categories: Hand-crafted, Distribution-hypothesis-based, Generative models (GM)-based, and Vision-language models (VLM)-based synthesis. The authors introduce the first industrial anomaly synthesis (IAS) taxonomy and explore cross-modality synthesis and large-scale VLM integration.
Result: The paper provides a comprehensive taxonomy that offers a fine-grained framework reflecting methodological progress and practical implications. It analyzes the integration of multimodal data and VLM in anomaly synthesis, identifying their benefits, challenges, and future prospects.
Conclusion: This survey establishes a structured foundation for future research in anomaly synthesis by providing a unified taxonomy and roadmap for boosting industrial anomaly synthesis through multimodal learning, with special emphasis on vision-language models and cross-modality approaches.
Abstract: This paper comprehensively reviews anomaly synthesis methodologies. Existing surveys focus on limited techniques, missing an overall field view and understanding method interconnections. In contrast, our study offers a unified review, covering about 40 representative methods across Hand-crafted, Distribution-hypothesis-based, Generative models (GM)-based, and Vision-language models (VLM)-based synthesis. We introduce the first industrial anomaly synthesis (IAS) taxonomy. Prior works lack formal classification or use simplistic taxonomies, hampering structured comparisons and trend identification. Our taxonomy provides a fine-grained framework reflecting methodological progress and practical implications, grounding future research. Furthermore, we explore cross-modality synthesis and large-scale VLM. Previous surveys overlooked multimodal data and VLM in anomaly synthesis, limiting insights into their advantages. Our survey analyzes their integration, benefits, challenges, and prospects, offering a roadmap to boost IAS with multimodal learning. More resources are available at https://github.com/M-3LAB/awesome-anomaly-synthesis.
[382] Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and Benchmark
Jiahao Wang, Xiangyu Cao, Jiaru Zhong, Yuner Zhang, Zeyu Han, Haibao Yu, Chuang Zhang, Lei He, Shaobing Xu, Jianqiang Wang
Main category: cs.CV
TL;DR: Griffin is a comprehensive aerial-ground cooperation 3D perception dataset with 250+ scenes (37k+ frames) featuring varied drone altitudes, weather conditions, realistic dynamics, and occlusion-aware annotations, plus a benchmarking framework for cooperative detection and tracking.
Details
Motivation: Traditional vehicle-to-vehicle and vehicle-to-infrastructure cooperative perception systems face significant economic barriers. Aerial-ground cooperation (pairing ground vehicles with drones) offers a more economically viable alternative, but lacks high-quality public datasets and benchmarks to advance the field.Method: Created Griffin dataset using CARLA-AirSim co-simulation for realistic drone dynamics, with varied drone altitudes (20-60m), diverse weather conditions, and critical occlusion-aware 3D annotations. Developed a unified benchmarking framework with protocols to evaluate communication efficiency, altitude adaptability, and robustness to communication issues.
Result: Dataset contains over 250 dynamic scenes (37k+ frames) with comprehensive annotations. Benchmarking framework enables evaluation of cooperative detection and tracking methods, revealing effectiveness and limitations of current approaches through different cooperative paradigms.
Conclusion: Griffin bridges the critical gap in aerial-ground cooperation research by providing the first comprehensive dataset and benchmarking framework, enabling systematic evaluation and providing crucial insights for future research in this economically viable alternative to traditional cooperative perception systems.
Abstract: While cooperative perception can overcome the limitations of single-vehicle systems, the practical implementation of vehicle-to-vehicle and vehicle-to-infrastructure systems is often impeded by significant economic barriers. Aerial-ground cooperation (AGC), which pairs ground vehicles with drones, presents a more economically viable and rapidly deployable alternative. However, this emerging field has been held back by a critical lack of high-quality public datasets and benchmarks. To bridge this gap, we present \textit{Griffin}, a comprehensive AGC 3D perception dataset, featuring over 250 dynamic scenes (37k+ frames). It incorporates varied drone altitudes (20-60m), diverse weather conditions, realistic drone dynamics via CARLA-AirSim co-simulation, and critical occlusion-aware 3D annotations. Accompanying the dataset is a unified benchmarking framework for cooperative detection and tracking, with protocols to evaluate communication efficiency, altitude adaptability, and robustness to communication latency, data loss and localization noise. By experiments through different cooperative paradigms, we demonstrate the effectiveness and limitations of current methods and provide crucial insights for future research. The dataset and codes are available at https://github.com/wang-jh18-SVM/Griffin.
[383] MIRAM: Masked Image Autoencoders Across Multiple Scales with Hybrid-Attention Mechanism for Breast Lesion Risk Prediction
Hung Q. Vo, Pengyu Yuan, Zheng Yin, Kelvin K. Wong, Chika F. Ezeana, Son T. Ly, Stephen T. C. Wong, Hien V. Nguyen
Main category: cs.CV
TL;DR: A new self-supervised learning method using multi-scale image reconstruction from masked images improves medical image classification performance, particularly for breast imaging tasks.
Details
Motivation: While masked image modeling (MIM) has emerged as a powerful SSL technique with strong inductive bias for spatial and semantic understanding, there's potential to develop more challenging pretext tasks that can extract even more robust features, especially for medical imaging where fine details are crucial.Method: Proposes a scalable SSL approach using multi-scale image reconstruction from randomly masked input images as the pretext task. The method reconstructs high-resolution images to force the model to attend to finer spatial details, which is particularly beneficial for medical image analysis.
Result: The method improves classification performance on the CBIS-DDSM dataset: 3% increase in AP and 1% increase in AUC for pathology classification, and 4% increase in AP and 2% increase in AUC for mass margins classification compared to SOTA algorithms.
Conclusion: Multi-scale image reconstruction from masked images provides an effective SSL approach for medical imaging, enabling models to capture finer spatial details that improve classification performance on challenging medical image analysis tasks.
Abstract: Self-supervised learning (SSL) has garnered substantial interest within the machine learning and computer vision communities. Two prominent approaches in SSL include contrastive-based learning and self-distillation utilizing cropping augmentation. Lately, masked image modeling (MIM) has emerged as a more potent SSL technique, employing image inpainting as a pretext task. MIM creates a strong inductive bias toward meaningful spatial and semantic understanding. This has opened up new opportunities for SSL to contribute not only to classification tasks but also to more complex applications like object detection and image segmentation. Building upon this progress, our research paper introduces a scalable and practical SSL approach centered around more challenging pretext tasks that facilitate the acquisition of robust features. Specifically, we leverage multi-scale image reconstruction from randomly masked input images as the foundation for feature learning. Our hypothesis posits that reconstructing high-resolution images enables the model to attend to finer spatial details, particularly beneficial for discerning subtle intricacies within medical images. The proposed SSL features help improve classification performance on the Curated Breast Imaging Subset of Digital Database for Screening Mammography (CBIS-DDSM) dataset. In pathology classification, our method demonstrates a 3% increase in average precision (AP) and a 1% increase in the area under the receiver operating characteristic curve (AUC) when compared to state-of-the-art (SOTA) algorithms. Moreover, in mass margins classification, our approach achieves a 4% increase in AP and a 2% increase in AUC.
[384] GeoShield: Safeguarding Geolocation Privacy from Vision-Language Models via Adversarial Perturbations
Xinwei Liu, Xiaojun Jia, Yuan Xun, Simeng Qin, Xiaochun Cao
Main category: cs.CV
TL;DR: GeoShield is a novel adversarial framework that protects geoprivacy by generating effective perturbations against Vision-Language Models’ geolocation inference capabilities, outperforming prior methods with minimal visual impact.
Details
Motivation: Advanced VLMs like GPT-4o can accurately infer user locations from public images, creating significant geoprivacy risks. Existing adversarial perturbation methods are inadequate for this scenario due to poor performance on high-resolution images, low perturbation budgets, and introduction of irrelevant semantic content.Method: GeoShield uses three key modules: 1) Feature disentanglement to separate geographical and non-geographical information, 2) Exposure element identification to pinpoint geo-revealing regions, and 3) Scale-adaptive enhancement that jointly optimizes perturbations at global and local levels for effectiveness across resolutions.
Result: Extensive experiments on challenging benchmarks show GeoShield consistently surpasses prior methods in black-box settings, achieving strong privacy protection with minimal impact on visual or semantic quality.
Conclusion: This is the first work exploring adversarial perturbations for defending against geolocation inference by advanced VLMs, providing a practical and effective solution to escalating privacy concerns in real-world scenarios.
Abstract: Vision-Language Models (VLMs) such as GPT-4o now demonstrate a remarkable ability to infer users’ locations from public shared images, posing a substantial risk to geoprivacy. Although adversarial perturbations offer a potential defense, current methods are ill-suited for this scenario: they often perform poorly on high-resolution images and low perturbation budgets, and may introduce irrelevant semantic content. To address these limitations, we propose GeoShield, a novel adversarial framework designed for robust geoprivacy protection in real-world scenarios. GeoShield comprises three key modules: a feature disentanglement module that separates geographical and non-geographical information, an exposure element identification module that pinpoints geo-revealing regions within an image, and a scale-adaptive enhancement module that jointly optimizes perturbations at both global and local levels to ensure effectiveness across resolutions. Extensive experiments on challenging benchmarks show that GeoShield consistently surpasses prior methods in black-box settings, achieving strong privacy protection with minimal impact on visual or semantic quality. To our knowledge, this work is the first to explore adversarial perturbations for defending against geolocation inference by advanced VLMs, providing a practical and effective solution to escalating privacy concerns.
[385] Enhanced Spatiotemporal Consistency for Image-to-LiDAR Data Pretraining
Xiang Xu, Lingdong Kong, Hui Shuai, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu, Qingshan Liu
Main category: cs.CV
TL;DR: SuperFlow++ is a novel LiDAR representation learning framework that integrates spatiotemporal cues using consecutive LiDAR-camera pairs, outperforming state-of-the-art methods across 11 datasets while being computationally efficient.
Details
Motivation: Existing LiDAR representation learning methods focus on spatial alignment but overlook temporal dynamics critical for capturing motion and scene continuity in driving scenarios, limiting their effectiveness.Method: Proposes four key components: 1) view consistency alignment across camera views, 2) dense-to-sparse consistency regularization for varying point cloud densities, 3) flow-based contrastive learning for temporal relationships, and 4) temporal voting for semantic propagation across LiDAR scans.
Result: Outperforms state-of-the-art methods across 11 heterogeneous LiDAR datasets on diverse tasks and driving conditions. Scaling 2D and 3D backbones reveals emergent properties for scalable 3D foundation models.
Conclusion: SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving with strong generalizability and computational efficiency, providing insights for developing scalable 3D foundation models.
Abstract: LiDAR representation learning has emerged as a promising approach to reducing reliance on costly and labor-intensive human annotations. While existing methods primarily focus on spatial alignment between LiDAR and camera sensors, they often overlook the temporal dynamics critical for capturing motion and scene continuity in driving scenarios. To address this limitation, we propose SuperFlow++, a novel framework that integrates spatiotemporal cues in both pretraining and downstream tasks using consecutive LiDAR-camera pairs. SuperFlow++ introduces four key components: (1) a view consistency alignment module to unify semantic information across camera views, (2) a dense-to-sparse consistency regularization mechanism to enhance feature robustness across varying point cloud densities, (3) a flow-based contrastive learning approach that models temporal relationships for improved scene understanding, and (4) a temporal voting strategy that propagates semantic information across LiDAR scans to improve prediction consistency. Extensive evaluations on 11 heterogeneous LiDAR datasets demonstrate that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions. Furthermore, by scaling both 2D and 3D backbones during pretraining, we uncover emergent properties that provide deeper insights into developing scalable 3D foundation models. With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving. The code is publicly available at https://github.com/Xiangxu-0103/SuperFlow
[386] Math Blind: Failures in Diagram Understanding Undermine Reasoning in MLLMs
Yanpeng Sun, Shan Zhang, Wei Tang, Aotian Chen, Piotr Koniusz, Kai Zou, Yuan Xue, Anton van den Hengel
Main category: cs.CV
TL;DR: MLLMs struggle with diagram understanding due to poor visual perception, not just reasoning flaws. Training on structural graph representations improves both perception (+79% grounding) and reasoning (3-4% benchmark gains).
Details
Motivation: Diagrams pose unique challenges for MLLMs distinct from natural images, yet current models show flawed reasoning and hallucinations. The paper investigates whether these limitations stem from poor diagram perception rather than reasoning alone.Method: Developed a diagnostic test suite isolating perception from reasoning, evaluated MLLMs on basic perceptual tasks (shape classification, counting, relationship identification, grounding). Proposed training models on structural graph representations of diagrams (primitives and relationships).
Result: MLLMs performed poorly on basic perceptual tasks with near-zero accuracy on fine-grained grounding. Models exhibited “blind faith in text” (Math Blind). Training on structural representations achieved +79% gain on grounding and 3-4% cross-suite improvements on three public benchmarks without additional reasoning data.
Conclusion: Low-level perception supports faithful high-level reasoning in mathematical MLLMs. Capturing diagrams’ structural properties as graphs is essential for improving diagram understanding. Provides methodological frameworks and empirical evidence for future research.
Abstract: Diagrams represent a form of visual language that encodes abstract concepts and relationships through structured symbols and their spatial arrangements. Unlike natural images, they are inherently symbolic, and entirely artificial. They thus pose unique challenges for Multimodal Large Language Models (MLLMs) distinct from natural image processing. Recent studies have shown that MLLMs often exhibit flawed reasoning and hallucinations when handling diagram inputs. We investigate here whether these limitations stem from shortcomings in the models’ ability to interpret diagrams themselves. To this end, we develop a diagnostic test suite that isolates perception from reasoning. Our systematic evaluation reveals that MLLMs perform poorly on basic perceptual tasks, e.g., shape classification, object counting, relationship identification, and object grounding, with near-zero accuracy on fine-grained grounding. Further analysis shows that weak diagram perception leads to “blind faith in text”, where models rely on textual shortcuts rather than visual understanding (that is, they are Math Blind). We hypothesize that enabling models to capture the inherent structural properties of diagrams, represented as graphs of primitives and their interrelationships, is essential for improving diagram understanding. Experiments with 7B and 32B MLLMs validate this assumption, with models trained on such representations achieving a +79% gain on the grounding task. Crucially, these gains transfer to reasoning, achieving 3-4% cross-suite improvements on three public benchmarks even without additional chain-of-thought reasoning data. Our findings demonstrate that low-level perception supports faithful high-level reasoning in mathematical MLLMs. We provide both methodological frameworks and empirical evidence to guide future research in this direction.
[387] TranSplat: Instant Cross-Scene Object Relighting in Gaussian Splatting via Spherical Harmonic Transfer
Boyang, Yu, Yanlin Jin, Yun He, Akshat Dave, Guha Balakrishnan
Main category: cs.CV
TL;DR: TranSplat enables fast object relighting in 3D Gaussian Splatting using spherical harmonic products without explicit BRDF computation, achieving comparable results to inverse rendering methods with much faster runtime.
Details
Motivation: To enable fast and accurate object relighting within the 3D Gaussian Splatting framework when transferring objects between scenes, avoiding the computational overhead of conventional inverse rendering approaches.Method: Uses a theoretical radiance transfer identity for cross-scene relighting with radially symmetric BRDFs, involving only products of spherical harmonic appearance coefficients of object, source, and target environment maps. Automatically infers unknown environment maps directly from GS representations.
Result: Demonstrates comparable 3D object relighting performance to recent inverse rendering-based GS methods but with a fraction of their runtime. Works well on both synthetic and real-world scenes, producing perceptually realistic renderings even beyond ideal radially symmetric BRDF assumptions.
Conclusion: TranSplat provides a lightweight, efficient path for object relighting in Gaussian Splatting framework, offering practical performance despite theoretical limitations to radially symmetric BRDFs, opening new possibilities for fast scene manipulation.
Abstract: We present TranSplat, a method for fast and accurate object relighting for the 3D Gaussian Splatting (GS) framework when transferring a 3D object from a source GS scene to a target GS scene. TranSplat is based on a theoretical radiance transfer identity for cross-scene relighting of objects with radially symmetric BRDFs that involves only taking simple products of spherical harmonic appearance coefficients of the object, source, and target environment maps without any explicit computation of scene quantities (e.g., the BRDFs themselves). TranSplat is the first method to demonstrate how this theoretical identity may be used to perform relighting within the GS framework, and furthermore, by automatically inferring unknown source and target environment maps directly from the source and target scene GS representations. We evaluated TranSplat on several synthetic and real-world scenes and objects, demonstrating comparable 3D object relighting performance to recent conventional inverse rendering-based GS methods with a fraction of their runtime. While TranSplat is theoretically best-suited for radially symmetric BRDFs, results demonstrate that TranSplat still offers perceptually realistic renderings on real scenes and opens a valuable, lightweight path forward to relighting with the GS framework.
[388] Blurry-Edges: Photon-Limited Depth Estimation from Defocused Boundaries
Wei Xu, Charles James Wagner, Junjie Luo, Qi Guo
Main category: cs.CV
TL;DR: A novel method for robust depth estimation from photon-limited defocused images using a Blurry-Edges representation and deep learning to overcome noise sensitivity in traditional depth-from-defocus approaches.
Details
Motivation: Depth from defocus (DfD) methods are fundamentally sensitive to image noise, making depth extraction challenging from photon-limited, defocused images where accurate defocus blur estimation is difficult due to noise.Method: Proposes a Blurry-Edges image patch representation that stores low-level patch information (boundaries, color, smoothness). Develops a deep neural network architecture that predicts this representation from a pair of differently defocused images, then calculates depth using a derived closed-form DfD relation.
Result: Experimental results on synthetic and real data show the method achieves the highest depth estimation accuracy on photon-limited images compared to a broad range of state-of-the-art DfD methods.
Conclusion: The proposed Blurry-Edges representation combined with deep learning enables robust depth estimation from photon-limited defocused images, overcoming the noise sensitivity limitations of traditional DfD approaches.
Abstract: Extracting depth information from photon-limited, defocused images is challenging because depth from defocus (DfD) relies on accurate estimation of defocus blur, which is fundamentally sensitive to image noise. We present a novel approach to robustly measure object depths from photon-limited images along the defocused boundaries. It is based on a new image patch representation, Blurry-Edges, that explicitly stores and visualizes a rich set of low-level patch information, including boundaries, color, and smoothness. We develop a deep neural network architecture that predicts the Blurry-Edges representation from a pair of differently defocused images, from which depth can be calculated using a closed-form DfD relation we derive. The experimental results on synthetic and real data show that our method achieves the highest depth estimation accuracy on photon-limited images compared to a broad range of state-of-the-art DfD methods.
[389] Three Forensic Cues for JPEG AI Images
Sandra Bergmann, Fabian Brand, Christian Riess
Main category: cs.CV
TL;DR: The paper proposes three novel forensic methods for detecting and analyzing JPEG AI compressed images, addressing challenges where traditional JPEG forensic tools fail and JPEG AI artifacts can be confused with DeepFakes.
Details
Motivation: JPEG AI compression offers superior quality at much lower bitrates than traditional JPEG, but existing forensic tools don't work on it, and its artifacts can be mistaken for DeepFakes, creating a critical need for new forensic approaches.Method: Three interpretable forensic algorithms based on: 1) color channel correlations introduced by JPEG AI preprocessing, 2) diminishing distortion differences from repeated compression (similar to classic JPEG forensics), and 3) latent space quantization patterns to distinguish real JPEG AI images from synthetically generated ones.
Result: The proposed methods provide the first forensic toolset for JPEG AI, enabling detection of JPEG AI compression, identification of recompression, and distinction between real JPEG AI images and synthetic DeepFakes.
Conclusion: This work establishes foundational forensic approaches for AI-compressed images, offering interpretable methods that address critical gaps in digital forensics and should inspire further research in this emerging field.
Abstract: The JPEG standard was vastly successful. Currently, the first AI-based compression method JPEG AI'' will be standardized. JPEG AI brings remarkable benefits. JPEG AI images exhibit impressive image quality at bitrates that are an order of magnitude lower than images compressed with traditional JPEG. However, forensic analysis of JPEG AI has to be completely re-thought: forensic tools for traditional JPEG do not transfer to JPEG AI, and artifacts from JPEG AI are easily confused with artifacts from artificially generated images (DeepFakes’’). This creates a need for novel forensic approaches to detection and distinction of JPEG AI images. In this work, we make a first step towards a forensic JPEG AI toolset. We propose three cues for forensic algorithms for JPEG AI. These algorithms address three forensic questions: first, we show that the JPEG AI preprocessing introduces correlations in the color channels that do not occur in uncompressed images. Second, we show that repeated compression of JPEG AI images leads to diminishing distortion differences. This can be used to detect recompression, in a spirit similar to some classic JPEG forensics methods. Third, we show that the quantization of JPEG AI images in the latent space can be used to distinguish real images with JPEG AI compression from synthetically generated images. The proposed methods are interpretable for a forensic analyst, and we hope that they inspire further research in the forensics of AI-compressed images.
[390] Head-Aware KV Cache Compression for Efficient Visual Autoregressive Modeling
Ziran Qin, Youru Lv, Mingbao Lin, Hang Guo, Zeren Zhang, Danping Zou, Weiyao Lin
Main category: cs.CV
TL;DR: HACK is a training-free KV cache compression framework for VAR models that classifies attention heads into contextual and structural types, applying pattern-specific compression strategies to reduce memory overhead and speed up inference.
Details
Motivation: VAR models suffer from high attention complexity and severe memory overhead due to accumulating KV caches across scales. Existing compression methods perform poorly because they treat all attention heads uniformly, failing to account for their different functional roles.Method: HACK uses offline classification to separate attention heads into Contextual Heads (semantic consistency) and Structural Heads (spatial coherence). It then applies pattern-specific compression strategies with asymmetric cache budgets for each category, constraining average KV cache length within a fixed budget B.
Result: Achieves up to 70% KV cache compression without quality degradation, reducing theoretical attention complexity from O(n⁴) to O(Bn²). Provides 1.75× memory reduction and 1.57× speedup on Infinity-8B model across text-to-image and class-conditional tasks.
Conclusion: HACK effectively addresses VAR models’ memory and computational bottlenecks through head-aware KV cache compression, demonstrating strong generalizability across different VAR architectures while maintaining output quality.
Abstract: Visual Autoregressive (VAR) models adopt a next-scale prediction paradigm, offering high-quality content generation with substantially fewer decoding steps. However, existing VAR models suffer from significant attention complexity and severe memory overhead due to the accumulation of key-value (KV) caches across scales. In this paper, we tackle this challenge by introducing KV cache compression into the next-scale generation paradigm. We begin with a crucial observation: attention heads in VAR models can be divided into two functionally distinct categories: Contextual Heads focus on maintaining semantic consistency, while Structural Heads are responsible for preserving spatial coherence. This structural divergence causes existing one-size-fits-all compression methods to perform poorly on VAR models. To address this, we propose HACK, a training-free Head-Aware KV cache Compression frameworK. HACK utilizes an offline classification scheme to separate head types, enabling it to apply pattern-specific compression strategies with asymmetric cache budgets for each category. By doing so, HACK effectively constrains the average KV cache length within a fixed budget $B$, reducing the theoretical attention complexity from $\mathcal{O}(n^4)$ to $\mathcal{O}(Bn^2)$. Extensive experiments on multiple VAR models across text-to-image and class-conditional tasks validate the effectiveness and generalizability of HACK. It achieves up to 70% KV cache compression without degrading output quality, resulting in memory savings and faster inference. For example, HACK provides a $1.75\times$ memory reduction and a $1.57\times$ speedup on Infinity-8B.
[391] FLAIR: Frequency- and Locality-Aware Implicit Neural Representations
Sukhun Ko, Seokhyun Yoon, Dahyeon Kye, Kyle Min, Chanho Eom, Jihyong Oh
Main category: cs.CV
TL;DR: FLAIR introduces frequency- and locality-aware implicit neural representations with Band-Localized Activation and Wavelet-Energy-Guided Encoding to address spectral bias and improve signal representation.
Details
Motivation: Existing Implicit Neural Representations (INRs) lack frequency selectivity and spatial localization, leading to spectral bias where they learn low-frequency components early but struggle with high-frequency details, and over-rely on redundant signal components.Method: Proposes FLAIR with two key innovations: 1) Band-Localized Activation (BLA) for joint frequency selection and spatial localization under time-frequency uncertainty principle constraints, and 2) Wavelet-Energy-Guided Encoding (WEGE) that uses discrete wavelet transform to compute energy scores and explicitly guide frequency information for precise frequency selection and adaptive band control.
Result: The method consistently outperforms existing INRs in 2D image representation, 3D shape reconstruction, and novel view synthesis.
Conclusion: FLAIR effectively addresses spectral bias and improves training stability through frequency-aware and locality-aware mechanisms, advancing the capabilities of implicit neural representations for various vision tasks.
Abstract: Implicit Neural Representations (INRs) leverage neural networks to map coordinates to corresponding signals, enabling continuous and compact representations. This paradigm has driven significant advances in various vision tasks. However, existing INRs lack frequency selectivity and spatial localization, leading to an over-reliance on redundant signal components. Consequently, they exhibit spectral bias, tending to learn low-frequency components early while struggling to capture fine high-frequency details. To address these issues, we propose FLAIR (Frequency- and Locality-Aware Implicit Neural Representations), which incorporates two key innovations. The first is Band-Localized Activation (BLA), a novel activation designed for joint frequency selection and spatial localization under the constraints of the time-frequency uncertainty principle (TFUP). Through structured frequency control and spatially localized responses, BLA effectively mitigates spectral bias and enhances training stability. The second is Wavelet-Energy-Guided Encoding (WEGE), which leverages the discrete wavelet transform to compute energy scores and explicitly guide frequency information to the network, enabling precise frequency selection and adaptive band control. Our method consistently outperforms existing INRs in 2D image representation, as well as 3D shape reconstruction and novel view synthesis.
[392] Beyond Degradation Redundancy: Contrastive Prompt Learning for All-in-One Image Restoration
Gang Wu, Junjun Jiang, Kui Jiang, Xianming Liu, Liqiang Nie
Main category: cs.CV
TL;DR: CPL improves All-in-One Image Restoration by learning better task-aware prompts through sparse representations and contrastive regularization to reduce redundancy and enhance task boundaries.
Details
Motivation: Existing AiOIR approaches struggle with designing effective task-aware prompts. Adaptive prompt learning leads to overlapping/redundant representations, while explicit prompts from classifiers lose visual reconstruction information.Method: Contrastive Prompt Learning (CPL) with two components: Sparse Prompt Module (SPM) to capture degradation-aware representations efficiently, and Contrastive Prompt Regularization (CPR) to strengthen task boundaries using negative prompt samples across different degradation types.
Result: Extensive experiments across five benchmarks show CPL consistently boosts performance of strong AiOIR baselines, achieving state-of-the-art average performance.
Conclusion: CPL provides a general and robust solution for AiOIR by directly optimizing prompt-restoration model interaction, improving prompt-task alignment through sparse representations and contrastive regularization.
Abstract: All-in-One Image Restoration (AiOIR), which addresses diverse degradation types with a unified model, presents significant challenges in designing task-aware prompts that effectively guide restoration across multiple degradation scenarios. While adaptive prompt learning enables end-to-end optimization, it often yields overlapping or redundant task representations. Conversely, explicit prompts derived from pretrained classifiers enhance discriminability but discard critical visual information needed for reconstruction. To address these limitations, we introduce Contrastive Prompt Learning (CPL), a framework that aims to improve prompt-task alignment through two complementary components: a Sparse Prompt Module (SPM) that efficiently captures degradation-aware representations while reducing redundancy, and a Contrastive Prompt Regularization (CPR) that explicitly strengthens task boundaries by incorporating negative prompt samples across different degradation types. Unlike previous approaches that focus primarily on degradation classification, CPL directly optimizes the interaction between prompts and the restoration model. Extensive experiments across five benchmarks show that CPL consistently boosts the performance of strong AiOIR baselines across diverse scenarios. Our approach achieves state-of-the-art average performance on these benchmarks, providing a general and robust solution for AiOIR. The code is available at https://github.com/Aitical/CPLIR
[393] Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs
Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, Jiankang Deng
Main category: cs.CV
TL;DR: UniME is a two-stage framework that uses Multimodal Large Language Models (MLLMs) to learn better multimodal representations by addressing CLIP’s limitations through textual knowledge distillation and hard negative enhanced instruction tuning.
Details
Motivation: CLIP has limitations in text token truncation, isolated image-text encoding, and deficient compositionality. While MLLMs show promise for vision-language understanding, their potential for learning transferable multimodal representations is underexplored.Method: Two-stage framework: 1) Textual discriminative knowledge distillation from LLM teacher to MLLM’s language component, 2) Hard negative enhanced instruction tuning with false negative mitigation and multiple hard negatives per instance to focus on challenging samples.
Result: UniME achieves consistent performance improvement across all tasks on MMEB benchmark and multiple retrieval tasks (short/long caption retrieval, compositional retrieval), showing superior discriminative and compositional capabilities.
Conclusion: The proposed UniME framework effectively leverages MLLMs to overcome CLIP’s limitations, demonstrating improved multimodal representation learning for diverse downstream tasks through knowledge distillation and hard negative enhanced training.
Abstract: The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal Large Language Models (MLLMs) have demonstrated significant advances in generalized vision-language understanding, their potential for learning transferable multimodal representations remains underexplored.In this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model to enhance the embedding capability of the MLLMś language component. In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning. Specifically, we initially mitigate false negative contamination and then sample multiple hard negatives per instance within each batch, forcing the model to focus on challenging samples. This approach not only improves discriminative power but also enhances instruction-following ability in downstream tasks. We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including short and long caption retrieval and compositional retrieval. Results demonstrate that UniME achieves consistent performance improvement across all tasks, exhibiting superior discriminative and compositional capabilities.
[394] Diffusion-based Adversarial Purification from the Perspective of the Frequency Domain
Gaozheng Pei, Ke Ma, Yingfei Sun, Qianqian Xu, Qingming Huang
Main category: cs.CV
TL;DR: The paper proposes a frequency-domain adversarial purification method that selectively preserves low-frequency components to remove adversarial perturbations while minimizing damage to image content and structure.
Details
Motivation: Existing diffusion-based adversarial purification methods damage normal semantics because they lack distribution information about adversarial perturbations in the pixel domain and indiscriminately damage all frequency components.Method: The method operates in the frequency domain by decomposing images into amplitude and phase spectra. During reverse diffusion, it replaces low-frequency amplitude components with those from adversarial images, and projects phase estimates into a designated low-frequency range of adversarial phase spectra.
Result: Extensive experiments show the method significantly outperforms most current defense methods by effectively eliminating adversarial perturbations while better preserving original image content and structure.
Conclusion: Frequency-domain analysis reveals adversarial perturbations cause monotonically increasing damage with frequency, enabling selective preservation of less-damaged low-frequency components for more effective adversarial purification.
Abstract: The diffusion-based adversarial purification methods attempt to drown adversarial perturbations into a part of isotropic noise through the forward process, and then recover the clean images through the reverse process. Due to the lack of distribution information about adversarial perturbations in the pixel domain, it is often unavoidable to damage normal semantics. We turn to the frequency domain perspective, decomposing the image into amplitude spectrum and phase spectrum. We find that for both spectra, the damage caused by adversarial perturbations tends to increase monotonically with frequency. This means that we can extract the content and structural information of the original clean sample from the frequency components that are less damaged. Meanwhile, theoretical analysis indicates that existing purification methods indiscriminately damage all frequency components, leading to excessive damage to the image. Therefore, we propose a purification method that can eliminate adversarial perturbations while maximizing the preservation of the content and structure of the original image. Specifically, at each time step during the reverse process, for the amplitude spectrum, we replace the low-frequency components of the estimated image’s amplitude spectrum with the corresponding parts of the adversarial image. For the phase spectrum, we project the phase of the estimated image into a designated range of the adversarial image’s phase spectrum, focusing on the low frequencies. Empirical evidence from extensive experiments demonstrates that our method significantly outperforms most current defense methods.
[395] RDD: Robust Feature Detector and Descriptor using Deformable Transformer
Gonglin Chen, Tianwen Fu, Haiwei Chen, Wenbin Teng, Hanyuan Xiao, Yajie Zhao
Main category: cs.CV
TL;DR: RDD is a novel keypoint detector/descriptor using deformable transformers that captures global context and geometric invariance, outperforming SOTA methods in sparse matching and enabling semi-dense matching.
Details
Motivation: Existing feature detection/description methods fail to handle significant viewpoint changes and don't learn visual cues from long-range relationships, despite being crucial for structure-from-motion and SLAM applications.Method: Proposes Robust Deformable Detector (RDD) leveraging deformable transformer architecture with deformable self-attention mechanisms to capture global context and geometric invariance while reducing search space complexity. Uses Air-to-Ground dataset plus MegaDepth for training.
Result: Outperforms all state-of-the-art keypoint detection/description methods in sparse matching tasks, capable of semi-dense matching. Introduces two challenging benchmarks: one for large viewpoint/scale variations, and an Air-to-Ground benchmark for 3D reconstruction across altitudes.
Conclusion: RDD successfully addresses limitations of existing methods by capturing long-range relationships through deformable transformers, demonstrating superior performance in challenging scenarios with significant viewpoint changes and scale variations.
Abstract: As a core step in structure-from-motion and SLAM, robust feature detection and description under challenging scenarios such as significant viewpoint changes remain unresolved despite their ubiquity. While recent works have identified the importance of local features in modeling geometric transformations, these methods fail to learn the visual cues present in long-range relationships. We present Robust Deformable Detector (RDD), a novel and robust keypoint detector/descriptor leveraging the deformable transformer, which captures global context and geometric invariance through deformable self-attention mechanisms. Specifically, we observed that deformable attention focuses on key locations, effectively reducing the search space complexity and modeling the geometric invariance. Furthermore, we collected an Air-to-Ground dataset for training in addition to the standard MegaDepth dataset. Our proposed method outperforms all state-of-the-art keypoint detection/description methods in sparse matching tasks and is also capable of semi-dense matching. To ensure comprehensive evaluation, we introduce two challenging benchmarks: one emphasizing large viewpoint and scale variations, and the other being an Air-to-Ground benchmark – an evaluation setting that has recently gaining popularity for 3D reconstruction across different altitudes.
[396] LookWhere? Efficient Visual Recognition by Learning Where to Look and What to See from Self-Supervision
Anthony Fuller, Yousef Yassin, Junfeng Wen, Daniel G. Kyrollos, Tarek Ibrahim, James R. Green, Evan Shelhamer
Main category: cs.CV
TL;DR: LookWhere is an adaptive computation method for vision transformers that uses a low-resolution selector and high-resolution extractor to process images efficiently without full high-resolution computation, achieving significant FLOPs and time reductions while maintaining accuracy.
Details
Motivation: Vision transformers are becoming larger and more computationally expensive, especially at high resolutions where token count grows quadratically. There's a need for efficient methods that can handle high-resolution inputs without processing the full image.Method: LookWhere uses a two-stage approach: 1) A low-resolution selector predicts where to compute, and 2) A high-resolution extractor processes only selected regions. The system is jointly pretrained without task supervision via distillation from a self-supervised teacher, learning both where and what to compute simultaneously.
Result: The method achieves up to 34x FLOPs reduction and 6x time reduction on high-resolution Traffic Signs recognition while maintaining accuracy. It also improves accuracy on standard tasks like ImageNet classification (1.36x speedup) and ADE20K segmentation.
Conclusion: LookWhere provides an economical and accurate approach for adaptive computation in vision transformers, offering significant efficiency gains without complex per-task optimization, making it suitable for both sparse recognition on high-resolution inputs and standard recognition tasks.
Abstract: Vision transformers are ever larger, more accurate, and more expensive to compute. The expense is even more extreme at high resolution as the number of tokens grows quadratically with the image size. We turn to adaptive computation to cope with this cost by learning to predict where to compute. Our LookWhere method divides the computation between a low-resolution selector and a high-resolution extractor without ever processing the full high-resolution input. We jointly pretrain the selector and extractor without task supervision by distillation from a self-supervised teacher, in effect, learning where and what to compute simultaneously. Unlike prior token reduction methods, which pay to save by pruning already-computed tokens, and prior token selection methods, which require complex and expensive per-task optimization, LookWhere economically and accurately selects and extracts transferrable representations of images. We show that LookWhere excels at sparse recognition on high-resolution inputs (Traffic Signs), maintaining accuracy while reducing FLOPs by up to 34x and time by 6x. It also excels at standard recognition tasks that are global (ImageNet classification) or local (ADE20K segmentation), improving accuracy while reducing time by 1.36x. See https://github.com/antofuller/lookwhere for the code and weights.
[397] Structured Initialization for Vision Transformers
Jianqiao Zheng, Xueqian Li, Hemanth Saratchandran, Simon Lucey
Main category: cs.CV
TL;DR: Proposes CNN-inspired initialization for Vision Transformers (ViTs) to achieve CNN-like performance on small datasets while maintaining ViT scalability on large datasets.
Details
Motivation: ViTs lack the strong inductive biases of CNNs, making them less effective on small datasets. The goal is to give ViTs CNN-like performance on small data while preserving their ability to scale with large datasets.Method: Integrates CNN inductive bias into ViTs through initialization only (not architectural changes). Uses random impulse filters inspired by CNN performance, improving upon current heuristic-based ViT initialization strategies.
Result: Significantly outperforms standard ViT initialization on small/medium benchmarks (Food-101, CIFAR-10/100, STL-10, Flowers, Pets) while maintaining comparable performance on large-scale ImageNet-1K. Works with various transformer architectures (Swin, MLP-Mixer).
Conclusion: CNN-inspired initialization effectively bridges the gap between CNNs and ViTs, enabling ViTs to perform well on small datasets while maintaining their scalability advantages on large datasets.
Abstract: Convolutional Neural Networks (CNNs) inherently encode strong inductive biases, enabling effective generalization on small-scale datasets. In this paper, we propose integrating this inductive bias into ViTs, not through an architectural intervention but solely through initialization. The motivation here is to have a ViT that can enjoy strong CNN-like performance when data assets are small, but can still scale to ViT-like performance as the data expands. Our approach is motivated by our empirical results that random impulse filters can achieve commensurate performance to learned filters within a CNN. We improve upon current ViT initialization strategies, which typically rely on empirical heuristics such as using attention weights from pretrained models or focusing on the distribution of attention weights without enforcing structures. Empirical results demonstrate that our method significantly outperforms standard ViT initialization across numerous small and medium-scale benchmarks, including Food-101, CIFAR-10, CIFAR-100, STL-10, Flowers, and Pets, while maintaining comparative performance on large-scale datasets such as ImageNet-1K. Moreover, our initialization strategy can be easily integrated into various transformer-based architectures such as Swin Transformer and MLP-Mixer with consistent improvements in performance.
[398] U-Mamba2: Scaling State Space Models for Dental Anatomy Segmentation in CBCT
Zhi Qin Tan, Xiatian Zhu, Owen Addison, Yunpeng Li
Main category: cs.CV
TL;DR: U-Mamba2: A novel neural network combining Mamba2 state space models with U-Net for efficient multi-anatomy CBCT segmentation, achieving top performance in ToothFairy3 challenge.
Details
Motivation: Accurate segmentation of dental anatomies in CBCT scans is critical for clinical applications like diagnosis and surgical planning, but remains time-consuming and challenging despite being widely used in dentistry.Method: Integrates Mamba2 state space models into U-Net architecture for stronger structural constraints and efficiency, adds interactive click prompts with cross-attention blocks, uses self-supervised pre-training, and incorporates dental domain knowledge.
Result: Achieved first place in both tasks of ToothFairy3 challenge: Task 1 - mean Dice 0.84, HD95 38.17, average inference time 40.58s; Task 2 - mean Dice 0.87, HD95 2.15.
Conclusion: U-Mamba2 is both effective and efficient for multi-anatomy CBCT segmentation, demonstrating superior performance in dental anatomy segmentation tasks while maintaining computational efficiency.
Abstract: Cone-Beam Computed Tomography (CBCT) is a widely used 3D imaging technique in dentistry, providing volumetric information about the anatomical structures of jaws and teeth. Accurate segmentation of these anatomies is critical for clinical applications such as diagnosis and surgical planning, but remains time-consuming and challenging. In this paper, we present U-Mamba2, a new neural network architecture designed for multi-anatomy CBCT segmentation in the context of the ToothFairy3 challenge. U-Mamba2 integrates the Mamba2 state space models into the U-Net architecture, enforcing stronger structural constraints for higher efficiency without compromising performance. In addition, we integrate interactive click prompts with cross-attention blocks, pre-train U-Mamba2 using self-supervised learning, and incorporate dental domain knowledge into the model design to address key challenges of dental anatomy segmentation in CBCT. Extensive experiments, including independent tests, demonstrate that U-Mamba2 is both effective and efficient, securing first place in both tasks of the Toothfairy3 challenge. In Task 1, U-Mamba2 achieved a mean Dice of 0.84, HD95 of 38.17 with the held-out test data, with an average inference time of 40.58s. In Task 2, U-Mamba2 achieved the mean Dice of 0.87 and HD95 of 2.15 with the held-out test data. The code is publicly available at https://github.com/zhiqin1998/UMamba2.
[399] 3DRS: MLLMs Need 3D-Aware Representation Supervision for Scene Understanding
Xiaohu Huang, Jingjing Wu, Qunyi Xie, Kai Han
Main category: cs.CV
TL;DR: 3DRS enhances MLLMs’ 3D representation by aligning visual features with 3D knowledge from pretrained 3D foundation models, improving scene understanding tasks.
Details
Motivation: MLLMs lack explicit 3D data during pretraining, limiting their 3D representation capability for scene understanding tasks.Method: Proposes 3DRS framework that introduces supervision from pretrained 3D foundation models to align MLLM visual features with rich 3D knowledge.
Result: Extensive experiments across multiple benchmarks and MLLMs show consistent performance gains in visual grounding, captioning, and question answering tasks.
Conclusion: Aligning MLLM features with 3D knowledge from foundation models effectively enhances 3D representation and improves downstream scene understanding performance.
Abstract: Recent advances in scene understanding have leveraged multimodal large language models (MLLMs) for 3D reasoning by capitalizing on their strong 2D pretraining. However, the lack of explicit 3D data during MLLM pretraining limits 3D representation capability. In this paper, we investigate the 3D-awareness of MLLMs by evaluating multi-view correspondence and reveal a strong positive correlation between the quality of 3D-aware representation and downstream task performance. Motivated by this, we propose 3DRS, a framework that enhances MLLM 3D representation learning by introducing supervision from pretrained 3D foundation models. Our approach aligns MLLM visual features with rich 3D knowledge distilled from 3D models, effectively improving scene understanding. Extensive experiments across multiple benchmarks and MLLMs – including visual grounding, captioning, and question answering – demonstrate consistent performance gains. Project page: https://visual-ai.github.io/3drs
[400] Normalize Filters! Classical Wisdom for Deep Vision
Gustavo Perez, Stella X. Yu
Main category: cs.CV
TL;DR: The paper proposes filter normalization for deep learning filters to make them atmosphere-equivariant, addressing distortions caused by atmospheric transfer in images.
Details
Motivation: Classical image filters are carefully normalized for consistency and to avoid artifacts, but convolutional filters learned in deep networks lack such constraints, causing distorted responses when images undergo atmospheric transfer.Method: Proposes filter normalization followed by learnable scaling and shifting (similar to batch normalization) to ensure filters are atmosphere-equivariant and enable co-domain symmetry.
Result: Significant improvements on artificial and natural intensity variation benchmarks; ResNet34 outperformed CLIP by a large margin; filter normalization regularizes learning, promotes diversity, and improves robustness and generalization.
Conclusion: Integrating classical filtering normalization principles into deep learning (for both CNNs and vision transformers) addresses limitations of unnormalized filters and improves performance on atmospheric transfer tasks.
Abstract: Classical image filters, such as those for averaging or differencing, are carefully normalized to ensure consistency, interpretability, and to avoid artifacts like intensity shifts, halos, or ringing. In contrast, convolutional filters learned end-to-end in deep networks lack such constraints. Although they may resemble wavelets and blob/edge detectors, they are not normalized in the same or any way. Consequently, when images undergo atmospheric transfer, their responses become distorted, leading to incorrect outcomes. We address this limitation by proposing filter normalization, followed by learnable scaling and shifting, akin to batch normalization. This simple yet effective modification ensures that the filters are atmosphere-equivariant, enabling co-domain symmetry. By integrating classical filtering principles into deep learning (applicable to both convolutional neural networks and convolution-dependent vision transformers), our method achieves significant improvements on artificial and natural intensity variation benchmarks. Our ResNet34 could even outperform CLIP by a large margin. Our analysis reveals that unnormalized filters degrade performance, whereas filter normalization regularizes learning, promotes diversity, and improves robustness and generalization.
[401] CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D
Mohamad Amin Mirzaei, Pantea Amoie, Ali Ekhterachian, Matin Mirzababaei, Babak Khalaj
Main category: cs.CV
TL;DR: Improved 3D semantic mapping using refined object masks from SemanticSAM and context-aware CLIP encoding for better zero-shot open-vocabulary scene understanding.
Details
Motivation: Existing zero-shot 3D semantic mapping methods produce fragmented masks and inaccurate semantic assignments due to using raw masks from vision-language models, limiting effectiveness in complex environments.Method: 1) Use SemanticSAM with progressive granularity refinement to generate more accurate and numerous object-level masks, reducing over-segmentation. 2) Employ context-aware CLIP encoding that integrates multiple contextual views of each mask with empirically determined weighting for richer visual context.
Result: Experimental evaluation on multiple 3D scene understanding tasks (semantic segmentation and object retrieval) across benchmark datasets shows significant improvements over existing methods.
Conclusion: The approach effectively addresses limitations of current zero-shot 3D semantic mapping methods by combining refined mask generation with context-aware encoding, demonstrating superior performance in complex environments.
Abstract: 3D scene understanding is fundamental for embodied AI and robotics, supporting reliable perception for interaction and navigation. Recent approaches achieve zero-shot, open-vocabulary 3D semantic mapping by assigning embedding vectors to 2D class-agnostic masks generated via vision-language models (VLMs) and projecting these into 3D. However, these methods often produce fragmented masks and inaccurate semantic assignments due to the direct use of raw masks, limiting their effectiveness in complex environments. To address this, we leverage SemanticSAM with progressive granularity refinement to generate more accurate and numerous object-level masks, mitigating the over-segmentation commonly observed in mask generation models such as vanilla SAM, and improving downstream 3D semantic segmentation. To further enhance semantic context, we employ a context-aware CLIP encoding strategy that integrates multiple contextual views of each mask using empirically determined weighting, providing much richer visual context. We evaluate our approach on multiple 3D scene understanding tasks, including 3D semantic segmentation and object retrieval from language queries, across several benchmark datasets. Experimental results demonstrate significant improvements over existing methods, highlighting the effectiveness of our approach.
[402] Exploring Adversarial Watermarking in Transformer-Based Models: Transferability and Robustness Against Defense Mechanism for Medical Images
Rifat Sadik, Tanvir Rahman, Arpan Bhattacharjee, Bikash Chandra Halder, Ismail Hossain, Rifat Sarker Aoyon, Md. Golam Rabiul Alam, Jia Uddin
Main category: cs.CV
TL;DR: Vision Transformers (ViTs) for dermatological image analysis are highly vulnerable to adversarial watermarking attacks, suffering accuracy drops to 27.6%, but adversarial training can restore performance to 90%.
Details
Motivation: While Vision Transformers (ViTs) have shown success in computer vision tasks including medical image analysis, their reliance on global attention mechanisms makes them potentially susceptible to adversarial attacks. The paper aims to investigate this vulnerability specifically for dermatological images using adversarial watermarking techniques.Method: The study uses Projected Gradient Descent (PGD) to generate adversarial watermarks (imperceptible perturbations) that fool ViT models. The research examines attack transferability to CNNs and analyzes the effectiveness of adversarial training as a defense mechanism against these attacks.
Result: ViTs show significant vulnerability to adversarial watermarking attacks, with accuracy dropping as low as 27.6% when attacked. However, adversarial training effectively mitigates this vulnerability, raising accuracy back up to 90.0%. Performance on clean images remains uncompromised.
Conclusion: Vision Transformers for medical image analysis are highly susceptible to adversarial attacks despite their strong performance on clean data. Adversarial training proves to be an effective defense strategy, highlighting the importance of security considerations when deploying transformer-based models in medical applications.
Abstract: Deep learning models have shown remarkable success in dermatological image analysis, offering potential for automated skin disease diagnosis. Previously, convolutional neural network(CNN) based architectures have achieved immense popularity and success in computer vision (CV) based task like skin image recognition, generation and video analysis. But with the emergence of transformer based models, CV tasks are now are nowadays carrying out using these models. Vision Transformers (ViTs) is such a transformer-based models that have shown success in computer vision. It uses self-attention mechanisms to achieve state-of-the-art performance across various tasks. However, their reliance on global attention mechanisms makes them susceptible to adversarial perturbations. This paper aims to investigate the susceptibility of ViTs for medical images to adversarial watermarking-a method that adds so-called imperceptible perturbations in order to fool models. By generating adversarial watermarks through Projected Gradient Descent (PGD), we examine the transferability of such attacks to CNNs and analyze the performance defense mechanism – adversarial training. Results indicate that while performance is not compromised for clean images, ViTs certainly become much more vulnerable to adversarial attacks: an accuracy drop of as low as 27.6%. Nevertheless, adversarial training raises it up to 90.0%.
[403] PCPO: Proportionate Credit Policy Optimization for Aligning Image Generation Models
Jeongjae Lee, Jong Chul Ye
Main category: cs.CV
TL;DR: PCPO is a new reinforcement learning framework that fixes disproportionate credit assignment in text-to-image model alignment, leading to more stable training, faster convergence, and better image quality.
Details
Motivation: Current policy gradient methods for aligning text-to-image models suffer from training instability and high variance, which slows convergence and hurts image quality. The authors identify disproportionate credit assignment as the root cause, where the generative sampler's mathematical structure creates volatile, non-proportional feedback across timesteps.Method: Proportionate Credit Policy Optimization (PCPO) enforces proportional credit assignment through a stable objective reformulation and principled reweighting of timesteps. This corrects the disproportionate feedback issue and stabilizes training.
Result: PCPO significantly accelerates convergence and produces superior image quality compared to existing methods. It mitigates model collapse, a common failure mode in recursive training, and substantially outperforms state-of-the-art baselines like DanceGRPO.
Conclusion: PCPO addresses a fundamental instability in reinforcement learning for text-to-image model alignment by fixing disproportionate credit assignment, leading to more stable training, faster convergence, and better quality images than current state-of-the-art methods.
Abstract: While reinforcement learning has advanced the alignment of text-to-image (T2I) models, state-of-the-art policy gradient methods are still hampered by training instability and high variance, hindering convergence speed and compromising image quality. Our analysis identifies a key cause of this instability: disproportionate credit assignment, in which the mathematical structure of the generative sampler produces volatile and non-proportional feedback across timesteps. To address this, we introduce Proportionate Credit Policy Optimization (PCPO), a framework that enforces proportional credit assignment through a stable objective reformulation and a principled reweighting of timesteps. This correction stabilizes the training process, leading to significantly accelerated convergence and superior image quality. The improvement in quality is a direct result of mitigating model collapse, a common failure mode in recursive training. PCPO substantially outperforms existing policy gradient baselines on all fronts, including the state-of-the-art DanceGRPO. Code is available at https://github.com/jaylee2000/pcpo/.
[404] PlayerOne: Egocentric World Simulator
Yuanpeng Tu, Hao Luo, Xi Chen, Xiang Bai, Fan Wang, Hengshuang Zhao
Main category: cs.CV
TL;DR: PlayerOne is the first egocentric realistic world simulator that generates immersive videos aligned with real human motion from exocentric camera data, enabling precise control of human movements and consistent world modeling.
Details
Motivation: To create the first egocentric realistic world simulator that enables immersive exploration in dynamic environments while maintaining strict alignment with real human motion captured from exocentric cameras.Method: Uses a coarse-to-fine training pipeline: pretraining on large-scale egocentric text-video pairs, then finetuning on synchronous motion-video data extracted from egocentric-exocentric datasets via automatic construction. Features part-disentangled motion injection for precise part-level control and joint reconstruction framework for progressive 4D scene and video frame modeling.
Result: Demonstrates great generalization ability in precise control of varying human movements and world-consistent modeling of diverse scenarios, marking the first endeavor into egocentric real-world simulation.
Conclusion: PlayerOne represents a breakthrough in egocentric world simulation that can pave the way for new frontiers in world modeling and diverse applications, enabling immersive exploration with precise motion control and scene consistency.
Abstract: We introduce PlayerOne, the first egocentric realistic world simulator, facilitating immersive and unrestricted exploration within vividly dynamic environments. Given an egocentric scene image from the user, PlayerOne can accurately construct the corresponding world and generate egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera. PlayerOne is trained in a coarse-to-fine pipeline that first performs pretraining on large-scale egocentric text-video pairs for coarse-level egocentric understanding, followed by finetuning on synchronous motion-video data extracted from egocentric-exocentric video datasets with our automatic construction pipeline. Besides, considering the varying importance of different components, we design a part-disentangled motion injection scheme, enabling precise control of part-level movements. In addition, we devise a joint reconstruction framework that progressively models both the 4D scene and video frames, ensuring scene consistency in the long-form video generation. Experimental results demonstrate its great generalization ability in precise control of varying human movements and worldconsistent modeling of diverse scenarios. It marks the first endeavor into egocentric real-world simulation and can pave the way for the community to delve into fresh frontiers of world modeling and its diverse applications.
[405] TimeScope: Towards Task-Oriented Temporal Grounding In Long Videos
Xiangrui Liu, Minghao Qin, Yan Shu, Zhengyang Liang, Yang Tian, Chen Jason Zhang, Bo Zhao, Zheng Liu
Main category: cs.CV
TL;DR: The paper introduces Task-oriented Temporal Grounding (ToTG), a new video understanding problem where temporal intervals are identified based on downstream task requirements rather than explicit descriptions, and proposes TimeScope method with a new benchmark ToTG-Bench.
Details
Motivation: Traditional temporal grounding relies on explicit time-interval descriptions, but real-world applications often require identifying key moments based on task objectives (e.g., "explain why the man is sent to hospital"). This task-oriented formulation presents challenges for existing methods as it requires joint deep task comprehension and fine-grained temporal localization.Method: Proposes TimeScope method that performs coarse-to-fine localization through progressive reasoning. Uses extensive supervised fine-tuning with carefully curated chain-of-thought (CoT) data from various scenarios to enable effective generalization across tasks and domains.
Result: TimeScope shows empirical advantages over baselines: (1) substantial improvements in grounding precision, (2) significant benefits to downstream tasks, and (3) strong generalizability across different scenarios. The paper also introduces ToTG-Bench benchmark for comprehensive evaluation.
Conclusion: The paper successfully addresses the new Task-oriented Temporal Grounding problem with TimeScope method and ToTG-Bench benchmark, demonstrating superior performance and generalizability. All models, datasets, and code will be open-sourced to support future research.
Abstract: Identifying key temporal intervals within long videos, known as temporal grounding (TG), is important to video understanding and reasoning tasks. In this paper, we introduce a new form of the temporal grounding problem, \textbf{Task-oriented Temporal Grounding} (\textbf{ToTG}), which is driven by the requirements of downstream tasks rather than explicit time-interval descriptions. For example, a ToTG input may be “explain why the man in the video is sent to the hospital,” whereas traditional TG would take an explicit temporal description such as “the moments when the man is tripped by a stone and falls to the ground.” This new ToTG formulation presents significant challenges for existing TG methods, as it requires jointly performing deep task comprehension and fine-grained temporal localization within long videos. To address these challenges, we conduct a systematic set of studies. First, we construct \textbf{a new benchmark ToTG-Bench}, which comprehensively evaluates ToTG performance across diverse settings. Second, we introduce \textbf{a new temporal-ground method TimeScope}, which performs coarse-to-fine localization through a progressive reasoning process. Leveraging extensive supervised fine-tuning with carefully curated chain-of-thought (CoT) data from a variety of scenarios, TimeScope generalizes effectively across tasks and domains. Our evaluation demonstrates \textbf{TimeScope’s empirical advantages} over existing baselines from three perspectives: (1) substantial improvements in grounding precision, (2) significant benefits to downstream tasks, and (3) strong generalizability across different scenarios. All models, datasets, and source code will be fully open-sourced to support future research in this area.
[406] Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration
Jun Wang, Lixing Zhu, Xiaohan Yu, Abhir Bhalerao, Yulan He
Main category: cs.CV
TL;DR: PLACE is a novel framework that improves medical vision-language learning through pathological-level alignment and correlation exploration without extra annotations, achieving SOTA on multiple downstream tasks.
Details
Motivation: Medical image-report pairs are valuable but challenging due to lengthy reports with complex discourse relations and semantic pathologies. Previous methods focus on instance-wise or token-wise alignment but neglect pathological-level consistency, which is crucial for accurate medical understanding.Method: Proposes PLACE framework with: 1) Pathological-level Cross-Modal Alignment (PCMA) to maximize consistency of pathology observations from images and reports, using a Visual Pathology Observation Extractor; 2) A proxy task that identifies correlations among image patches to enrich fine-grained details. Both operate without external disease annotations.
Result: Achieves new state-of-the-art performance on multiple downstream tasks including classification, image-to-text retrieval, semantic segmentation, object detection, and report generation.
Conclusion: PLACE effectively addresses the limitations of previous methods by focusing on pathological-level alignment and correlation exploration, demonstrating superior performance across various medical vision-language tasks without requiring additional annotations.
Abstract: Learning medical visual representations from image-report pairs through joint learning has garnered increasing research attention due to its potential to alleviate the data scarcity problem in the medical domain. The primary challenges stem from the lengthy reports that feature complex discourse relations and semantic pathologies. Previous works have predominantly focused on instance-wise or token-wise cross-modal alignment, often neglecting the importance of pathological-level consistency. This paper presents a novel framework PLACE that promotes the Pathological-Level Alignment and enriches the fine-grained details via Correlation Exploration without additional human annotations. Specifically, we propose a novel pathological-level cross-modal alignment (PCMA) approach to maximize the consistency of pathology observations from both images and reports. To facilitate this, a Visual Pathology Observation Extractor is introduced to extract visual pathological observation representations from localized tokens. The PCMA module operates independently of any external disease annotations, enhancing the generalizability and robustness of our methods. Furthermore, we design a proxy task that enforces the model to identify correlations among image patches, thereby enriching the fine-grained details crucial for various downstream tasks. Experimental results demonstrate that our proposed framework achieves new state-of-the-art performance on multiple downstream tasks, including classification, image-to-text retrieval, semantic segmentation, object detection and report generation. Code is available at https://github.com/Markin-Wang/PLACE.
[407] X-Scene: Large-Scale Driving Scene Generation with High Fidelity and Flexible Controllability
Yu Yang, Alan Liang, Jianbiao Mei, Yukai Ma, Yong Liu, Gim Hee Lee
Main category: cs.CV
TL;DR: X-Scene is a diffusion-based framework for generating large-scale 3D driving scenes with geometric intricacy, appearance fidelity, and flexible multi-granular control through layout conditioning and semantic guidance.
Details
Motivation: While diffusion models have advanced autonomous driving applications, they primarily focus on temporal consistency. Large-scale 3D scene generation requiring spatial coherence remains underexplored, creating a gap for comprehensive driving scene synthesis.Method: X-Scene uses a unified pipeline that sequentially generates 3D semantic occupancy and corresponding multi-view images/videos. It supports multi-granular control via low-level layout conditioning (user input/text) and high-level semantic guidance (user intent/LLM-enriched prompts). The framework extends local regions to large-scale scenes through consistency-aware outpainting and lifts results into 3DGS representations.
Result: Extensive experiments demonstrate that X-Scene substantially advances controllability and fidelity in large-scale scene generation, enabling high-quality 3DGS representations suitable for autonomous driving simulation and scene exploration applications.
Conclusion: X-Scene empowers data generation and simulation for autonomous driving by achieving geometric intricacy, appearance fidelity, and flexible controllability in large-scale driving scene generation, addressing the previously underexplored challenge of spatial coherence in 3D scene synthesis.
Abstract: Diffusion models are advancing autonomous driving by enabling realistic data synthesis, predictive end-to-end planning, and closed-loop simulation, with a primary focus on temporally consistent generation. However, large-scale 3D scene generation requiring spatial coherence remains underexplored. In this paper, we present X-Scene, a novel framework for large-scale driving scene generation that achieves geometric intricacy, appearance fidelity, and flexible controllability. Specifically, X-Scene supports multi-granular control, including low-level layout conditioning driven by user input or text for detailed scene composition, and high-level semantic guidance informed by user intent and LLM-enriched prompts for efficient customization. To enhance geometric and visual fidelity, we introduce a unified pipeline that sequentially generates 3D semantic occupancy and corresponding multi-view images and videos, ensuring alignment and temporal consistency across modalities. We further extend local regions into large-scale scenes via consistency-aware outpainting, which extrapolates occupancy and images from previously generated areas to maintain spatial and visual coherence. The resulting scenes are lifted into high-quality 3DGS representations, supporting diverse applications such as simulation and scene exploration. Extensive experiments demonstrate that X-Scene substantially advances controllability and fidelity in large-scale scene generation, empowering data generation and simulation for autonomous driving.
[408] Towards Explainable Bilingual Multimodal Misinformation Detection and Localization
Yiwei He, Xiangtai Li, Zhenglin Huang, Yi Dong, Hao Fei, Jiangning Zhang, Baoyuan Wu, Guangliang Cheng
Main category: cs.CV
TL;DR: BiMi is a bilingual multimodal framework for detecting subtle misinformation in news media with Chinese-English subtitles, featuring region-level localization, cross-modal/lingual consistency detection, and natural language explanations, supported by a large benchmark (BiMiBench) and enhanced with GRPO for better explanations.
Details
Motivation: Increasing realism of multimodal content makes misinformation more subtle and harder to detect, especially in news media with bilingual subtitles where localized image edits and cross-lingual inconsistencies can jointly distort meaning while appearing plausible.Method: BiMi framework jointly performs region-level localization, cross-modal and cross-lingual consistency detection, and natural language explanation. It integrates online retrieval for external context and uses Group Relative Policy Optimization (GRPO) to improve explanation quality. Also introduces BiMiBench benchmark with 104,000 samples of systematically edited real news.
Result: BiMi outperforms strong baselines by up to +8.9 in classification accuracy, +15.9 in localization accuracy, and +2.5 in explanation BERTScore, advancing state-of-the-art performance in realistic, multilingual misinformation detection.
Conclusion: The BiMi framework effectively addresses the challenge of detecting subtle misinformation in bilingual multimodal news content through joint analysis of visual and linguistic modalities, with improved interpretability via GRPO-enhanced explanations, representing significant advancement in multilingual misinformation detection.
Abstract: The increasing realism of multimodal content has made misinformation more subtle and harder to detect, especially in news media where images are frequently paired with bilingual (e.g., Chinese-English) subtitles. Such content often includes localized image edits and cross-lingual inconsistencies that jointly distort meaning while remaining superficially plausible. We introduce BiMi, a bilingual multimodal framework that jointly performs region-level localization, cross-modal and cross-lingual consistency detection, and natural language explanation for misinformation analysis. To support generalization, BiMi integrates an online retrieval module that supplements model reasoning with up-to-date external context. We further release BiMiBench, a large-scale and comprehensive benchmark constructed by systematically editing real news images and subtitles, comprising 104,000 samples with realistic manipulations across visual and linguistic modalities. To enhance interpretability, we apply Group Relative Policy Optimization (GRPO) to improve explanation quality, marking the first use of GRPO in this domain. Extensive experiments demonstrate that BiMi outperforms strong baselines by up to +8.9 in classification accuracy, +15.9 in localization accuracy, and +2.5 in explanation BERTScore, advancing state-of-the-art performance in realistic, multilingual misinformation detection. Code, models, and datasets will be released.
[409] Extreme Amodal Face Detection
Changlin Song, Yunzhong Hou, Michael Randall Barnes, Rahul Shome, Dylan Campbell
Main category: cs.CV
TL;DR: Extreme amodal face detection from single images using contextual cues and efficient heatmap-based approach with coarse-to-fine decoder.
Details
Motivation: Address safety and privacy applications by detecting faces outside the visible field-of-view using only single images, avoiding reliance on video sequences or inefficient generative models.Method: Heatmap-based extreme amodal object detector with selective coarse-to-fine decoder that uses contextual cues to infer unseen faces from single images.
Result: Strong performance on extreme amodal face detection, outperforming less efficient generative approaches while being sample-free.
Conclusion: Proposed method efficiently solves single-image extreme amodal detection using contextual reasoning, establishing new state-of-the-art for this task.
Abstract: Extreme amodal detection is the task of inferring the 2D location of objects that are not fully visible in the input image but are visible within an expanded field-of-view. This differs from amodal detection, where the object is partially visible within the input image, but is occluded. In this paper, we consider the sub-problem of face detection, since this class provides motivating applications involving safety and privacy, but do not tailor our method specifically to this class. Existing approaches rely on image sequences so that missing detections may be interpolated from surrounding frames or make use of generative models to sample possible completions. In contrast, we consider the single-image task and propose a more efficient, sample-free approach that makes use of the contextual cues from the image to infer the presence of unseen faces. We design a heatmap-based extreme amodal object detector that addresses the problem of efficiently predicting a lot (the out-of-frame region) from a little (the image) with a selective coarse-to-fine decoder. Our method establishes strong results for this new task, even outperforming less efficient generative approaches. Code, data, and models are available at https://charliesong1999.github.io/exaft_web/.
[410] Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions
Jintang Xue, Ganning Zhao, Jie-En Yao, Hong-En Chen, Yue Hu, Meida Chen, Suya You, C. -C. Jay Kuo
Main category: cs.CV
TL;DR: Descrip3D enhances 3D scene understanding by augmenting object representations with natural language descriptions of their attributes and relationships, enabling better relational reasoning across multiple tasks without task-specific heads.
Details
Motivation: Current 3D scene-language models struggle with relational understanding between objects because visual embeddings alone don't adequately capture object roles and interactions. There's a need for better representation of spatial and semantic relationships in 3D scenes.Method: Descrip3D explicitly encodes object relationships using natural language descriptions that capture both intrinsic attributes and contextual relationships. It uses dual-level integration: embedding fusion and prompt-level injection, enabling unified reasoning across tasks without task-specific heads or additional supervision.
Result: Descrip3D consistently outperforms strong baseline models on five benchmark datasets: ScanRefer, Multi3DRefer, ScanQA, SQA3D, and Scan2Cap, demonstrating the effectiveness of language-guided relational representation.
Conclusion: Language-guided relational representation through textual descriptions significantly improves 3D scene understanding, enabling better reasoning about object relationships across various tasks without requiring task-specific architectures or additional supervision.
Abstract: Understanding 3D scenes goes beyond simply recognizing objects; it requires reasoning about the spatial and semantic relationships between them. Current 3D scene-language models often struggle with this relational understanding, particularly when visual embeddings alone do not adequately convey the roles and interactions of objects. In this paper, we introduce Descrip3D, a novel and powerful framework that explicitly encodes the relationships between objects using natural language. Unlike previous methods that rely only on 2D and 3D embeddings, Descrip3D enhances each object with a textual description that captures both its intrinsic attributes and contextual relationships. These relational cues are incorporated into the model through a dual-level integration: embedding fusion and prompt-level injection. This allows for unified reasoning across various tasks such as grounding, captioning, and question answering, all without the need for task-specific heads or additional supervision. When evaluated on five benchmark datasets, including ScanRefer, Multi3DRefer, ScanQA, SQA3D, and Scan2Cap, Descrip3D consistently outperforms strong baseline models, demonstrating the effectiveness of language-guided relational representation for understanding complex indoor scenes. Our code and data are publicly available at https://github.com/jintangxue/Descrip3D.
[411] Gene-DML: Dual-Pathway Multi-Level Discrimination for Gene Expression Prediction from Histopathology Images
Yaxuan Song, Jianan Fan, Hang Chang, Weidong Cai
Main category: cs.CV
TL;DR: Gene-DML is a unified framework that uses dual-pathway multi-level discrimination to align histopathology images with gene expression profiles, achieving state-of-the-art performance in predicting gene expression from pathology images.
Details
Motivation: Existing methods underutilize cross-modal representation alignment between histopathology images and gene expression profiles across multiple representational levels, limiting prediction performance for scalable molecular profiling in precision medicine.Method: Gene-DML structures latent space through Dual-pathway Multi-Level discrimination: 1) multi-scale instance-level discrimination aligns hierarchical histopathology representations (local, neighbor, global levels) with gene expression, and 2) cross-level instance-group discrimination enforces structural consistency between individual instances and modality-crossed groups.
Result: Extensive experiments on public spatial transcriptomics datasets demonstrate that Gene-DML achieves state-of-the-art performance in gene expression prediction, enhancing both predictive accuracy and generalization across diverse biological contexts.
Conclusion: Gene-DML effectively learns robust cross-modal representations by jointly modeling fine-grained and structural-level discrimination, providing a powerful approach for non-invasive molecular profiling from histopathology images.
Abstract: Accurately predicting gene expression from histopathology images offers a scalable and non-invasive approach to molecular profiling, with significant implications for precision medicine and computational pathology. However, existing methods often underutilize the cross-modal representation alignment between histopathology images and gene expression profiles across multiple representational levels, thereby limiting their prediction performance. To address this, we propose Gene-DML, a unified framework that structures latent space through Dual-pathway Multi-Level discrimination to enhance correspondence between morphological and transcriptional modalities. The multi-scale instance-level discrimination pathway aligns hierarchical histopathology representations extracted at local, neighbor, and global levels with gene expression profiles, capturing scale-aware morphological-transcriptional relationships. In parallel, the cross-level instance-group discrimination pathway enforces structural consistency between individual (image/gene) instances and modality-crossed (gene/image, respectively) groups, strengthening the alignment across modalities. By jointly modeling fine-grained and structural-level discrimination, Gene-DML is able to learn robust cross-modal representations, enhancing both predictive accuracy and generalization across diverse biological contexts. Extensive experiments on public spatial transcriptomics datasets demonstrate that Gene-DML achieves state-of-the-art performance in gene expression prediction. The code and processed datasets are available at https://github.com/YXSong000/Gene-DML.
[412] Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression
Roy H. Jennings, Genady Paikin, Roy Shaul, Evgeny Soloveichik
Main category: cs.CV
TL;DR: RvTC method replaces vocabulary-constrained classification with flexible bin-based approach for image regression, achieving SOTA performance and showing that semantic prompts (not generic ones) enable MLLMs to leverage cross-modal understanding.
Details
Motivation: Current MLLM approaches for image regression use preset vocabularies and generic prompts, assuming they mimic human rating behavior, but analysis shows they provide no benefit over image-only training and fail to leverage semantic understanding from textual input.Method: Propose Regression via Transformer-Based Classification (RvTC), which replaces vocabulary-constrained classification with flexible bin-based approach. Uses straightforward bin increase instead of complex distributional modeling. Employs data-specific prompts containing semantic information about specific images rather than generic task descriptions.
Result: Achieves state-of-the-art performance on four image assessment datasets using only images. On AVA dataset, adding challenge titles to prompts substantially improves already SOTA image-only baseline. Demonstrates MLLMs benefit from semantic prompt information, surpassing statistical biases. Validated across two different MLLM architectures with consistent improvements.
Conclusion: RvTC eliminates manual vocabulary crafting through simple bin increase and shows that semantic prompts (not generic ones) enable MLLMs to leverage cross-modal understanding for image regression tasks, demonstrating method generalizability across architectures.
Abstract: Multimodal Large Language Models (MLLMs) show promise for image-based regression tasks, but current approaches face key limitations. Recent methods fine-tune MLLMs using preset output vocabularies and generic task-level prompts (e.g., “How would you rate this image?”), assuming this mimics human rating behavior. Our analysis reveals that these approaches provide no benefit over image-only training. Models using preset vocabularies and generic prompts perform equivalently to image-only models, failing to leverage semantic understanding from textual input. We propose Regression via Transformer-Based Classification (RvTC), which replaces vocabulary-constrained classification with a flexible bin-based approach. Unlike approaches that address discretization errors through complex distributional modeling, RvTC eliminates manual vocabulary crafting through straightforward bin increase, achieving state-of-the-art performance on four image assessment datasets using only images. More importantly, we demonstrate that data-specific prompts dramatically improve performance. Unlike generic task descriptions, prompts containing semantic information about specific images enable MLLMs to leverage cross-modal understanding. On the AVA dataset, adding challenge titles to prompts substantially improves our already state-of-the-art image-only baseline. We demonstrate through empirical evidence from the AVA and AGIQA-3k datasets that MLLMs benefit from semantic prompt information, surpassing mere statistical biases. We validate RvTC across two different MLLM architectures, demonstrating consistent improvements and method generalizability.
[413] UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning
Tiancheng Gu, Kaicheng Yang, Kaichen Zhang, Xiang An, Ziyong Feng, Yueyi Zhang, Weidong Cai, Jiankang Deng, Lidong Bing
Main category: cs.CV
TL;DR: UniME-V2 improves multimodal embedding by using MLLMs to generate soft semantic matching scores for better hard negative mining and discriminative learning, achieving SOTA performance.
Details
Motivation: Existing multimodal embedding models struggle with capturing subtle semantic differences among candidates, lack diversity in negative samples, and have limited ability to distinguish false/hard negatives.Method: 1) Construct potential hard negative set via global retrieval; 2) Use MLLM-as-a-Judge to assess semantic alignment and generate soft matching scores; 3) Use scores for hard negative mining and as soft labels to align similarity matrices; 4) Propose UniME-V2-Reranker trained on mined hard negatives with joint pairwise/listwise optimization.
Result: Achieves state-of-the-art performance on MMEB benchmark and multiple retrieval tasks across all tasks on average.
Conclusion: Leveraging MLLMs’ understanding capabilities for semantic assessment and soft label generation significantly enhances multimodal embedding models’ discriminative capacity and performance.
Abstract: Universal multimodal embedding models are foundational to various tasks. Existing approaches typically employ in-batch negative mining by measuring the similarity of query-candidate pairs. However, these methods often struggle to capture subtle semantic differences among candidates and lack diversity in negative samples. Moreover, the embeddings exhibit limited discriminative ability in distinguishing false and hard negatives. In this paper, we leverage the advanced understanding capabilities of MLLMs to enhance representation learning and present a novel Universal Multimodal Embedding (UniME-V2) model. Our approach first constructs a potential hard negative set through global retrieval. We then introduce the MLLM-as-a-Judge mechanism, which utilizes MLLMs to assess the semantic alignment of query-candidate pairs and generate soft semantic matching scores. These scores serve as a foundation for hard negative mining, mitigating the impact of false negatives and enabling the identification of diverse, high-quality hard negatives. Furthermore, the semantic matching scores are used as soft labels to mitigate the rigid one-to-one mapping constraint. By aligning the similarity matrix with the soft semantic matching score matrix, the model learns semantic distinctions among candidates, significantly enhancing its discriminative capacity. To further improve performance, we propose UniME-V2-Reranker, a reranking model trained on our mined hard negatives through a joint pairwise and listwise optimization approach. We conduct comprehensive experiments on the MMEB benchmark and multiple retrieval tasks, demonstrating that our method achieves state-of-the-art performance on average across all tasks.
[414] Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows
Simin Huo, Ning Li
Main category: cs.CV
TL;DR: Iwin Transformer is a hierarchical vision transformer without position embeddings that can be fine-tuned across resolutions using interleaved window attention and depthwise convolution to achieve global information exchange in a single module.
Details
Motivation: To overcome Swin Transformer's limitation of requiring two consecutive blocks to approximate global attention, and to create a more efficient hierarchical vision transformer that can handle multiple resolutions without position embeddings.Method: Uses interleaved window attention to connect distant tokens and depthwise separable convolution to link neighboring tokens, enabling global information exchange within a single module. This position-embedding-free design allows direct fine-tuning from low to high resolution.
Result: Achieves 87.4% top-1 accuracy on ImageNet-1K classification, shows strong performance in semantic segmentation and video action recognition, and works effectively as a standalone module for class-conditional image generation.
Conclusion: Iwin Transformer demonstrates strong competitiveness across multiple vision tasks, offers a more efficient alternative to Swin Transformer, and its core concepts can inspire future research like 3D attention for video generation.
Abstract: We introduce Iwin Transformer, a novel position-embedding-free hierarchical vision transformer, which can be fine-tuned directly from low to high resolution, through the collaboration of innovative interleaved window attention and depthwise separable convolution. This approach uses attention to connect distant tokens and applies convolution to link neighboring tokens, enabling global information exchange within a single module, overcoming Swin Transformer’s limitation of requiring two consecutive blocks to approximate global attention. Extensive experiments on visual benchmarks demonstrate that Iwin Transformer exhibits strong competitiveness in tasks such as image classification (87.4 top-1 accuracy on ImageNet-1K), semantic segmentation and video action recognition. We also validate the effectiveness of the core component in Iwin as a standalone module that can seamlessly replace the self-attention module in class-conditional image generation. The concepts and methods introduced by the Iwin Transformer have the potential to inspire future research, like Iwin 3D Attention in video generation. The code and models are available at https://github.com/cominder/Iwin-Transformer.
[415] VGS-ATD: Robust Distributed Learning for Multi-Label Medical Image Classification Under Heterogeneous and Imbalanced Conditions
Zehui Zhao, Laith Alzubaidi, Haider A. Alwzwazy, Jinglan Zhang, Yuantong Gu
Main category: cs.CV
TL;DR: VGS-ATD is a novel distributed learning framework that addresses privacy, data heterogeneity, catastrophic forgetting, and computational efficiency in medical imaging, outperforming centralized, federated, and swarm learning approaches.
Details
Motivation: Traditional centralized learning poses privacy risks, while decentralized approaches (federated/swarm learning) struggle with heterogeneous/imbalanced data, communication inefficiency, and catastrophic forgetting during system expansion in dynamic clinical environments.Method: VGS-ATD framework - a distributed learning approach designed to handle heterogeneous data, reduce communication overhead, and prevent catastrophic forgetting during system expansion without requiring full model retraining.
Result: Achieved 92.7% overall accuracy across 30 datasets and 80 labels, outperforming centralized learning (84.9%) and swarm learning (72.99%). Federated learning failed due to computational requirements. Showed only 1% accuracy drop after expansion vs 20% for centralized learning, and reduced computational costs by up to 50%.
Conclusion: VGS-ATD provides a superior distributed learning solution for medical imaging that balances privacy, accuracy, scalability, and computational efficiency while effectively addressing catastrophic forgetting in dynamic clinical environments.
Abstract: In recent years, advanced deep learning architectures have shown strong performance in medical imaging tasks. However, the traditional centralized learning paradigm poses serious privacy risks as all data is collected and trained on a single server. To mitigate this challenge, decentralized approaches such as federated learning and swarm learning have emerged, allowing model training on local nodes while sharing only model weights. While these methods enhance privacy, they struggle with heterogeneous and imbalanced data and suffer from inefficiencies due to frequent communication and the aggregation of weights. More critically, the dynamic and complex nature of clinical environments demands scalable AI systems capable of continuously learning from diverse modalities and multilabels. Yet, both centralized and decentralized models are prone to catastrophic forgetting during system expansion, often requiring full model retraining to incorporate new data. To address these limitations, we propose VGS-ATD, a novel distributed learning framework. To validate VGS-ATD, we evaluate it in experiments spanning 30 datasets and 80 independent labels across distributed nodes, VGS-ATD achieved an overall accuracy of 92.7%, outperforming centralized learning (84.9%) and swarm learning (72.99%), while federated learning failed under these conditions due to high requirements on computational resources. VGS-ATD also demonstrated strong scalability, with only a 1% drop in accuracy on existing nodes after expansion, compared to a 20% drop in centralized learning, highlighting its resilience to catastrophic forgetting. Additionally, it reduced computational costs by up to 50% relative to both centralized and swarm learning, confirming its superior efficiency and scalability.
[416] DCoAR: Deep Concept Injection into Unified Autoregressive Models for Personalized Text-to-Image Generation
Fangtai Wu, Mushui Liu, Weijie He, Zhao Wang, Yunlong Yu
Main category: cs.CV
TL;DR: DCoAR is a deep concept injection framework for unified autoregressive models that enables high-quality customized image generation with frozen pre-trained models through layer-wise multimodal context learning and multi-faceted regularization.
Details
Motivation: Existing customization approaches for unified AR models face a dilemma: adaptation-based methods suffer from overfitting and scalability issues, while concept-injection paradigms use shallow injection strategies that lead to poor visual fidelity and impaired re-contextualization.Method: DCoAR uses Layer-wise Multimodal Context Learning (LMCL) to deeply integrate new concepts while keeping the pre-trained model frozen. It employs a multi-faceted regularization scheme with Dual Prior Preservation (DPP) loss to mitigate semantic drift and Context-Aware Self-Regularization (CASR) loss to enhance re-contextualization.
Result: DCoAR significantly outperforms previous injection-based methods and achieves performance competitive with adaptation-based approaches while requiring substantially fewer trainable parameters. It also enables training-free subject customization in user-provided styles.
Conclusion: DCoAR successfully addresses the limitations of existing customization approaches for unified AR models by enabling deep concept integration with frozen models, achieving high visual fidelity and effective re-contextualization with minimal parameter overhead.
Abstract: The unified autoregressive (AR) model excels at multimodal understanding and generation. However, its full potential in the domain of customized image generation has yet to be fully realized. Existing customization approaches for unified AR models face a fundamental dilemma: adaptation-based methods suffer from overfitting and scalability bottlenecks, while concept-injection paradigms are constrained by a shallow injection strategy that leads to poor visual fidelity and impaired re-contextualization. To address this, we propose DCoAR, a novel deep concept injection framework that maintains a completely frozen pre-trained model. DCoAR deeply integrates new concepts through a Layer-wise Multimodal Context Learning (LMCL) strategy, which is stabilized by a multi-faceted regularization scheme: a Dual Prior Preservation (DPP) loss to mitigate semantic drift and a Context-Aware Self-Regularization (CASR) loss to enhance re-contextualization. The framework also enables training-free subject customization in user-provided styles. Experiments demonstrate that DCoAR significantly outperforms previous injection-based methods and achieves performance competitive with adaptation-based approaches while requiring substantially fewer trainable parameters. Code: https://github.com/KZF-kzf/CoAR
[417] EVCtrl: Efficient Control Adapter for Visual Generation
Zixiang Yang, Yue Ma, Yinhan Zhang, Shanhui Mo, Dongrui Liu, Linfeng Zhang
Main category: cs.CV
TL;DR: EVCtrl is a lightweight plug-and-play control adapter that reduces computational overhead in ControlNet-based visual generation by using spatio-temporal dual caching to eliminate redundant computations in uncontrolled regions and denoising steps.
Details
Motivation: ControlNet provides precise spatial-temporal control for visual generation but significantly increases latency and introduces redundant computation in both uncontrolled regions and denoising steps, especially problematic for video generation.Method: Proposes EVCtrl with spatio-temporal dual caching: 1) Spatial - profiles DiT-ControlNet layers, partitions network into global/local zones, uses locality-aware cache to focus computation only on zones needing control signals; 2) Temporal - selectively omits unnecessary denoising steps to improve efficiency.
Result: Achieves 2.16x speedup on CogVideo-Controlnet and 2.05x speedup on Wan2.1-Controlnet with almost no degradation in generation quality; effective for both image and video control generation without requiring training.
Conclusion: EVCtrl successfully addresses ControlNet’s computational inefficiency through lightweight spatio-temporal optimization, enabling faster controllable visual generation while maintaining quality, making it practical for real applications.
Abstract: Visual generation includes both image and video generation, training probabilistic models to create coherent, diverse, and semantically faithful content from scratch. While early research focused on unconditional sampling, practitioners now demand controllable generation that allows precise specification of layout, pose, motion, or style. While ControlNet grants precise spatial-temporal control, its auxiliary branch markedly increases latency and introduces redundant computation in both uncontrolled regions and denoising steps, especially for video. To address this problem, we introduce EVCtrl, a lightweight, plug-and-play control adapter that slashes overhead without retraining the model. Specifically, we propose a spatio-temporal dual caching strategy for sparse control information. For spatial redundancy, we first profile how each layer of DiT-ControlNet responds to fine-grained control, then partition the network into global and local functional zones. A locality-aware cache focuses computation on the local zones that truly need the control signal, skipping the bulk of redundant computation in global regions. For temporal redundancy, we selectively omit unnecessary denoising steps to improve efficiency. Extensive experiments on CogVideo-Controlnet, Wan2.1-Controlnet, and Flux demonstrate that our method is effective in image and video control generation without the need for training. For example, it achieves 2.16 and 2.05 times speedups on CogVideo-Controlnet and Wan2.1-Controlnet, respectively, with almost no degradation in generation quality.Codes are available in the supplementary materials.
[418] CountLoop: Training-Free High-Instance Image Generation via Iterative Agent Guidance
Anindya Mondal, Ayan Banerjee, Sauradip Nag, Josep Lladós, Xiatian Zhu, Anjan Dutta
Main category: cs.CV
TL;DR: CountLoop is a training-free framework that enables diffusion models to generate scenes with precise object instance counts through iterative multimodal feedback and instance-aware generation techniques.
Details
Motivation: Diffusion models struggle with generating scenes containing precise numbers of object instances, especially in complex, high-density settings where accurate counting and spatial arrangement are crucial.Method: Uses iterative structured feedback alternating between image generation and multimodal agent evaluation. Includes language-guided planner and critic for assessing object counts, spatial arrangements, and attribute consistency. Introduces instance-driven attention masking and compositional generation techniques to improve object separation in occluded scenes.
Result: Achieves counting accuracy up to 98% on COCO Count, T2I CompBench, and new high-instance benchmarks while maintaining spatial fidelity and visual quality. Outperforms layout-based and gradient-guided baselines with a score of 0.97.
Conclusion: CountLoop provides effective training-free instance control for diffusion models through iterative multimodal feedback, addressing a key limitation in current text-to-image generation systems.
Abstract: Diffusion models have shown remarkable progress in photorealistic image synthesis, yet they remain unreliable for generating scenes with a precise number of object instances, particularly in complex and high-density settings. We present CountLoop, a training-free framework that provides diffusion models with accurate instance control through iterative structured feedback. The approach alternates between image generation and multimodal agent evaluation, where a language-guided planner and critic assess object counts, spatial arrangements, and attribute consistency. This feedback is then used to refine layouts and guide subsequent generations. To further improve separation between objects, especially in occluded scenes, we introduce instance-driven attention masking and compositional generation techniques. Experiments on COCO Count, T2I CompBench, and two new high-instance benchmarks show that CountLoop achieves counting accuracy of up to 98% while maintaining spatial fidelity and visual quality, outperforming layout-based and gradient-guided baselines with a score of 0.97.
[419] MM-SeR: Multimodal Self-Refinement for Lightweight Image Captioning
Junha Song, Yongsik Jo, So Yeon Min, Quanting Xie, Taehwan Kim, Yonatan Bisk, Jaegul Choo
Main category: cs.CV
TL;DR: Lightweight streaming image captioning model using compact 125M-parameter language component with multimodal self-refinement framework outperforms large MLLMs while being 93x smaller.
Details
Motivation: Existing MLLMs for streaming image captioning have high computational costs that hinder practical applications in systems like video chatbots and navigation robots, motivating development of a lightweight alternative.Method: Replace large language component in MLLMs with compact 125M-parameter model, then add multimodal self-refinement framework inspired by human visual processing: first generate coarse caption, then refine using features from salient regions identified from previous coarse caption.
Result: Compact model achieves comparable performance to MLLMs despite 93x size reduction, and with self-refinement framework shows superiority in both single-sentence and detailed captioning, extending to long-range video QA tasks.
Conclusion: Factual image captioning doesn’t require complex reasoning of LLMs; lightweight models with human-inspired refinement can achieve superior performance with dramatically reduced computational cost.
Abstract: Systems such as video chatbots and navigation robots often depend on streaming image captioning to interpret visual inputs. Existing approaches typically employ large multimodal language models (MLLMs) for this purpose, but their substantial computational cost hinders practical application. This limitation motivates our development of a lightweight captioning model. Our investigation begins by replacing the large-scale language component in MLLMs with a compact 125M-parameter model. Surprisingly, this compact model, despite a 93x reduction in size, achieves comparable performance to MLLMs, suggesting that factual image captioning does not significantly require the complex reasoning abilities of LLMs. Despite this promising result, our lightweight model still lacks reliability. To address this, we draw inspiration from the human visual process: perceiving a global and coarse understanding of the scene before attending to finer details. Accordingly, we propose a multimodal self-refinement framework that guides the model to utilize features from salient regions, identified by referencing the previous coarse caption, and to produce a refined description. Experimental results demonstrate the superiority of our model in both single-sentence and detailed captioning, extending even to long-range video QA tasks.
[420] Algorithms Trained on Normal Chest X-rays Can Predict Health Insurance Types
Chi-Yu Chen, Rawan Abulibdeh, Arash Asgari, Sebastián Andrés Cajas Ordóñez, Leo Anthony Celi, Deirdre Goode, Hassan Hamidi, Laleh Seyyed-Kalantari, Ned McCague, Thomas Sounack, Po-Chih Kuo
Main category: cs.CV
TL;DR: Deep learning models can predict patients’ health insurance type (a socioeconomic proxy) from normal chest X-rays with ~0.70 AUC, revealing that medical images encode hidden social inequality signals.
Details
Motivation: To investigate whether medical AI models can detect and exploit hidden social inequality signals in medical imaging data, challenging the assumption that medical images are neutral biological data.Method: Used state-of-the-art architectures (DenseNet121, SwinV2-B, MedMamba) trained on chest X-rays from MIMIC-CXR-JPG and CheXpert datasets to predict health insurance type. Conducted controls for demographic features (age, race, sex) and single-racial-group training. Used patch-based occlusion to localize signals.
Result: Models achieved significant accuracy (AUC ~0.70 on MIMIC, ~0.68 on CheXpert) predicting insurance type from normal X-rays. Signal persists despite demographic controls and single-racial-group training. Patch occlusion shows diffuse signal in upper/mid-thoracic regions, suggesting environmental/equipment differences.
Conclusion: Medical AI models internalize subtle socioeconomic signatures from clinical environments, equipment, or care pathways. This reframes fairness in medical AI: beyond dataset balancing, we must interrogate and disentangle social fingerprints embedded in clinical data itself.
Abstract: Artificial intelligence is revealing what medicine never intended to encode. Deep vision models, trained on chest X-rays, can now detect not only disease but also invisible traces of social inequality. In this study, we show that state-of-the-art architectures (DenseNet121, SwinV2-B, MedMamba) can predict a patient’s health insurance type, a strong proxy for socioeconomic status, from normal chest X-rays with significant accuracy (AUC around 0.70 on MIMIC-CXR-JPG, 0.68 on CheXpert). The signal was unlikely contributed by demographic features by our machine learning study combining age, race, and sex labels to predict health insurance types; it also remains detectable when the model is trained exclusively on a single racial group. Patch-based occlusion reveals that the signal is diffuse rather than localized, embedded in the upper and mid-thoracic regions. This suggests that deep networks may be internalizing subtle traces of clinical environments, equipment differences, or care pathways; learning socioeconomic segregation itself. These findings challenge the assumption that medical images are neutral biological data. By uncovering how models perceive and exploit these hidden social signatures, this work reframes fairness in medical AI: the goal is no longer only to balance datasets or adjust thresholds, but to interrogate and disentangle the social fingerprints embedded in clinical data itself.
[421] Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching
Feng Wang, Zihao Yu
Main category: cs.CV
TL;DR: The paper introduces CPS (Coefficients-Preserving Sampling) to fix noise artifacts in SDE-based RL for Flow Matching, enabling better reward modeling and faster RL convergence.
Details
Motivation: SDE-based sampling in RL for Flow Matching introduces harmful noise artifacts that degrade image quality and disrupt reward learning, requiring a solution.Method: Proposed CPS method inspired by DDIM to reformulate sampling process, preserving coefficients while eliminating stochastic noise artifacts.
Result: CPS eliminates noise artifacts, enables more accurate reward modeling, and leads to faster, more stable convergence for RL optimizers like Flow-GRPO and Dance-GRPO.
Conclusion: CPS successfully addresses SDE noise issues in RL for Flow Matching, improving training stability and convergence for reinforcement learning-based image/video generation.
Abstract: Reinforcement Learning (RL) has recently emerged as a powerful technique for improving image and video generation in Diffusion and Flow Matching models, specifically for enhancing output quality and alignment with prompts. A critical step for applying online RL methods on Flow Matching is the introduction of stochasticity into the deterministic framework, commonly realized by Stochastic Differential Equation (SDE). Our investigation reveals a significant drawback to this approach: SDE-based sampling introduces pronounced noise artifacts in the generated images, which we found to be detrimental to the reward learning process. A rigorous theoretical analysis traces the origin of this noise to an excess of stochasticity injected during inference. To address this, we draw inspiration from Denoising Diffusion Implicit Models (DDIM) to reformulate the sampling process. Our proposed method, Coefficients-Preserving Sampling (CPS), eliminates these noise artifacts. This leads to more accurate reward modeling, ultimately enabling faster and more stable convergence for reinforcement learning-based optimizers like Flow-GRPO and Dance-GRPO. Code will be released at https://github.com/IamCreateAI/FlowCPS
[422] A PCA Based Model for Surface Reconstruction from Incomplete Point Clouds
Hao Liu
Main category: cs.CV
TL;DR: PCA-based model for surface reconstruction from incomplete point clouds using estimated normals as regularization and operator-splitting optimization.
Details
Motivation: Point cloud data often has missing regions due to scanning limitations (light absorption, occlusions), making surface reconstruction challenging when data is incomplete.Method: Uses PCA to estimate surface normals from available point cloud data, then incorporates these normals as regularization in the reconstruction model. Employs operator-splitting method for optimization.
Result: Model successfully infers surface structures in data-missing regions and reconstructs underlying surfaces, outperforming existing methods.
Conclusion: PCA-based approach with normal regularization effectively handles incomplete point cloud data for surface reconstruction.
Abstract: Point cloud data represents a crucial category of information for mathematical modeling, and surface reconstruction from such data is an important task across various disciplines. However, during the scanning process, the collected point cloud data may fail to cover the entire surface due to factors such as high light-absorption rate and occlusions, resulting in incomplete datasets. Inferring surface structures in data-missing regions and successfully reconstructing the surface poses a challenge. In this paper, we present a Principal Component Analysis (PCA) based model for surface reconstruction from incomplete point cloud data. Initially, we employ PCA to estimate the normal information of the underlying surface from the available point cloud data. This estimated normal information serves as a regularizer in our model, guiding the reconstruction of the surface, particularly in areas with missing data. Additionally, we introduce an operator-splitting method to effectively solve the proposed model. Through systematic experimentation, we demonstrate that our model successfully infers surface structures in data-missing regions and well reconstructs the underlying surfaces, outperforming existing methodologies.
[423] Beyond Words and Pixels: A Benchmark for Implicit World Knowledge Reasoning in Generative Models
Tianyang Han, Junhao Su, Junjie Hu, Peizhen Yang, Hengyu Shi, Junfeng Luo, Jialin Gao
Main category: cs.CV
TL;DR: PicWorld is a new benchmark that evaluates text-to-image models’ ability to understand implicit world knowledge and physical causal reasoning, revealing significant limitations in current models.
Details
Motivation: Current T2I models produce photorealistic images but often fail on prompts requiring implicit world knowledge. Existing evaluation methods focus too much on compositional alignment or use simple VQA-based scoring, missing critical dimensions like knowledge grounding, multi-physics interactions, and evidence-based verification.Method: Created PicWorld benchmark with 1,100 prompts across three core categories. Developed PW-Agent, an evidence-grounded multi-agent evaluator that hierarchically assesses images by decomposing prompts into verifiable visual evidence for physical realism and logical consistency.
Result: Evaluation of 17 mainstream T2I models shows they universally exhibit fundamental limitations in implicit world knowledge and physical causal reasoning to varying degrees.
Conclusion: The findings highlight the need for reasoning-aware, knowledge-integrative architectures in future T2I systems to improve their understanding of implicit world knowledge and physical causality.
Abstract: Text-to-image (T2I) models today are capable of producing photorealistic, instruction-following images, yet they still frequently fail on prompts that require implicit world knowledge. Existing evaluation protocols either emphasize compositional alignment or rely on single-round VQA-based scoring, leaving critical dimensions such as knowledge grounding, multi-physics interactions, and auditable evidence-substantially undertested. To address these limitations, we introduce PicWorld, the first comprehensive benchmark that assesses the grasp of implicit world knowledge and physical causal reasoning of T2I models. This benchmark consists of 1,100 prompts across three core categories. To facilitate fine-grained evaluation, we propose PW-Agent, an evidence-grounded multi-agent evaluator to hierarchically assess images on their physical realism and logical consistency by decomposing prompts into verifiable visual evidence. We conduct a thorough analysis of 17 mainstream T2I models on PicWorld, illustrating that they universally exhibit a fundamental limitation in their capacity for implicit world knowledge and physical causal reasoning to varying degrees. The findings highlight the need for reasoning-aware, knowledge-integrative architectures in future T2I systems. The code is available at https://github.com/D4-Lab/PicWorld}{https://github.com/D4-Lab/PicWorld.
[424] ARSS: Taming Decoder-only Autoregressive Visual Generation for View Synthesis From Single View
Wenbin Teng, Gonglin Chen, Haiwei Chen, Yajie Zhao
Main category: cs.CV
TL;DR: ARSS: A novel autoregressive framework for novel view synthesis from single images using GPT-style decoder-only transformer with camera trajectory conditioning.
Details
Motivation: Existing diffusion-based novel view synthesis methods lack strict causal structure along camera trajectories, while autoregressive models naturally operate causally by generating each token based on previous tokens.Method: Uses GPT-style decoder-only AR model with video tokenizer for discrete image representation, camera encoder for 3D positional guidance, and autoregressive transformer with random spatial permutation while maintaining temporal order.
Result: Achieves performance comparable to state-of-the-art diffusion-based view synthesis methods on public datasets, demonstrating the effectiveness of autoregressive approach for novel view generation.
Conclusion: ARSS provides a novel autoregressive alternative to diffusion models for novel view synthesis, offering causal generation along camera trajectories while maintaining competitive performance.
Abstract: Diffusion models have achieved impressive results in world modeling tasks, including novel view generation from sparse inputs. However, most existing diffusion-based NVS methods generate target views jointly via an iterative denoising process, which makes it less straightforward to impose a strictly causal structure along a camera trajectory. In contrast, autoregressive (AR) models operate in a causal fashion, generating each token based on all previously generated tokens. In this work, we introduce ARSS, a novel framework that leverages a GPT-style decoder-only AR model to generate novel views from a single image, conditioned on a predefined camera trajectory. We employ an off-the-shelf video tokenizer to map continuous image sequences into discrete tokens and propose a camera encoder that converts camera trajectories into 3D positional guidance. Then to enhance generation quality while preserving the autoregressive structure, we propose an autoregressive transformer module that randomly permutes the spatial order of tokens while maintaining their temporal order. Qualitative and quantitative experiments on public datasets demonstrate that our method achieves overall performance comparable to state-of-the-art view synthesis approaches based on diffusion models. Project page: https://wbteng9526.github.io/arss/.
[425] MetaChest: Generalized few-shot learning of pathologies from chest X-rays
Berenice Montalvo-Lezama, Gibran Fuentes-Pineda
Main category: cs.CV
TL;DR: MetaChest dataset enables few-shot learning for chest X-ray pathology classification, showing transfer learning outperforms few-shot methods in multi-label scenarios.
Details
Motivation: Medical image analysis faces data scarcity; few-shot learning methods are needed but current approaches don't address the practical scenario of learning new classes while leveraging known ones (generalized few-shot classification).Method: Created MetaChest dataset (479,215 chest X-rays from 4 databases) with meta-set partition for few-shot classification and algorithm for generating multi-label episodes. Evaluated transfer learning approach and ProtoNet extension across various few-shot multi-label tasks.
Result: Increasing classes per episode and training examples per class improves performance. Transfer learning consistently outperforms ProtoNet extension. Higher-resolution images improve accuracy but increase computation, while efficient architectures achieve comparable performance with fewer resources.
Conclusion: MetaChest enables few-shot learning research for medical imaging; transfer learning remains surprisingly effective for few-shot multi-label classification despite not being specifically designed for it.
Abstract: The limited availability of annotated data presents a major challenge for applying deep learning methods to medical image analysis. Few-shot learning methods aim to recognize new classes from only a small number of labeled examples. These methods are typically studied under the standard few-shot learning setting, where all classes in a task are new. However, medical applications such as pathology classification from chest X-rays often require learning new classes while simultaneously leveraging knowledge of previously known ones, a scenario more closely aligned with generalized few-shot classification. Despite its practical relevance, few-shot learning has been scarcely studied in this context. In this work, we present MetaChest, a large-scale dataset of 479,215 chest X-rays collected from four public databases. MetaChest includes a meta-set partition specifically designed for standard few-shot classification, as well as an algorithm for generating multi-label episodes. We conduct extensive experiments evaluating both a standard transfer learning approach and an extension of ProtoNet across a wide range of few-shot multi-label classification tasks. Our results demonstrate that increasing the number of classes per episode and the number of training examples per class improves classification performance. Notably, the transfer learning approach consistently outperforms the ProtoNet extension, despite not being tailored for few-shot learning. We also show that higher-resolution images improve accuracy at the cost of additional computation, while efficient model architectures achieve comparable performance to larger models with significantly reduced resource requirements.
[426] ReSplat: Learning Recurrent Gaussian Splats
Haofei Xu, Daniel Barath, Andreas Geiger, Marc Pollefeys
Main category: cs.CV
TL;DR: ReSplat is a feed-forward recurrent Gaussian splatting model that iteratively refines 3D Gaussians using rendering error as feedback, achieving SOTA performance with fewer Gaussians and faster rendering.
Details
Motivation: Traditional feed-forward Gaussian splatting models are limited by single-pass inference, which constrains their performance and generalization capabilities. There's a need for a method that can iteratively refine 3D representations without explicit gradient computation while maintaining computational efficiency.Method: ReSplat uses a recurrent network that iteratively refines 3D Gaussians using rendering error as feedback signal. It starts with a compact reconstruction model operating in 16× subsampled space to produce fewer initial Gaussians, then performs recurrent updates guided by the rendering error to adapt to unseen data distributions at test time.
Result: Achieves state-of-the-art performance across varying input views (2-32), resolutions (256×256 to 540×960), and datasets (DL3DV, RealEstate10K, ACID) while significantly reducing the number of Gaussians and improving rendering speed compared to previous methods.
Conclusion: ReSplat demonstrates that rendering error serves as an effective feedback signal for recurrent refinement of 3D Gaussians, enabling robust generalization with computational efficiency. The approach successfully combines the benefits of feed-forward models with iterative refinement capabilities.
Abstract: While feed-forward Gaussian splatting models offer computational efficiency and can generalize to sparse input settings, their performance is fundamentally constrained by relying on a single forward pass for inference. We propose ReSplat, a feed-forward recurrent Gaussian splatting model that iteratively refines 3D Gaussians without explicitly computing gradients. Our key insight is that the Gaussian splatting rendering error serves as a rich feedback signal, guiding the recurrent network to learn effective Gaussian updates. This feedback signal naturally adapts to unseen data distributions at test time, enabling robust generalization across datasets, view counts and image resolutions. To initialize the recurrent process, we introduce a compact reconstruction model that operates in a $16 \times$ subsampled space, producing $16 \times$ fewer Gaussians than previous per-pixel Gaussian models. This substantially reduces computational overhead and allows for efficient Gaussian updates. Extensive experiments across varying of input views (2, 8, 16, 32), resolutions ($256 \times 256$ to $540 \times 960$), and datasets (DL3DV, RealEstate10K and ACID) demonstrate that our method achieves state-of-the-art performance while significantly reducing the number of Gaussians and improving the rendering speed. Our project page is at https://haofeixu.github.io/resplat/.
[427] FootFormer: Estimating Stability from Visual Input
Keaton Kraiger, Jingjing Li, Skanda Bharadwaj, Jesse Scott, Robert T. Collins, Yanxi Liu
Main category: cs.CV
TL;DR: FootFormer is a cross-modality approach that predicts human motion dynamics from visual input, achieving state-of-the-art performance in estimating foot pressure, foot contact, center of mass, and stability-predictive components.
Details
Motivation: Existing methods typically generate only one or two measures of human motion dynamics (foot pressure, foot contact, or center of mass), lacking a comprehensive approach that can jointly predict all these important stability-related metrics from visual input.Method: FootFormer uses a cross-modality approach to jointly predict human motion dynamics directly from visual input, enabling simultaneous estimation of foot pressure distributions, foot contact maps, and center of mass (CoM).
Result: On multiple datasets, FootFormer achieves statistically significantly better or equivalent estimates of foot pressure distributions, foot contact maps, and CoM compared to existing methods. It also achieves state-of-the-art performance in estimating stability-predictive components (CoP, CoM, BoS) used in classic kinesiology metrics.
Conclusion: FootFormer provides a comprehensive solution for predicting human motion dynamics from vision, outperforming existing methods in estimating multiple stability-related metrics simultaneously, with potential applications in biomechanics, rehabilitation, and human movement analysis.
Abstract: We propose FootFormer, a cross-modality approach for jointly predicting human motion dynamics directly from visual input. On multiple datasets, FootFormer achieves statistically significantly better or equivalent estimates of foot pressure distributions, foot contact maps, and center of mass (CoM), as compared with existing methods that generate one or two of those measures. Furthermore, FootFormer achieves SOTA performance in estimating stability-predictive components (CoP, CoM, BoS) used in classic kinesiology metrics. Code and data are available at https://github.com/keatonkraiger/Vision-to-Stability.git.
[428] PPTArena: A Benchmark for Agentic PowerPoint Editing
Michael Ofengenden, Yunze Man, Ziqi Pang, Yu-Xiong Wang
Main category: cs.CV
TL;DR: PPTArena is a benchmark for PowerPoint editing that tests reliable slide modifications using natural language instructions, with PPTPilot as a structure-aware editing agent that outperforms existing systems.
Details
Motivation: Current PowerPoint editing benchmarks use image-PDF renderings or text-to-slide generation, lacking focus on in-place editing of real slides with natural language instructions. There's a need for reliable document-scale editing with precise control over text, charts, tables, animations, and styles.Method: PPTArena includes 100 decks, 2125 slides, and 800+ targeted edits with ground-truth decks and target outcomes. It uses a dual VLM-as-judge pipeline scoring instruction following and visual quality via structural diffs and slide images. PPTPilot is a structure-aware agent that plans semantic edit sequences, routes between programmatic tools and XML operations, and uses iterative plan-edit-check loops with task-specific constraints.
Result: PPTPilot outperforms proprietary agents and frontier VLM systems by over 10 percentage points on compound, layout-sensitive, and cross-slide edits, with particularly large gains in visual fidelity and deck-wide consistency.
Conclusion: Despite PPTPilot’s improvements, existing agents still underperform on long-horizon, document-scale tasks in PPTArena, highlighting remaining challenges in reliable PowerPoint editing that require further research.
Abstract: We introduce PPTArena, a benchmark for PowerPoint editing that measures reliable modifications to real slides under natural-language instructions. In contrast to image-PDF renderings or text-to-slide generation, PPTArena focuses on in-place editing across 100 decks, 2125 slides, and over 800 targeted edits covering text, charts, tables, animations, and master-level styles. Each case includes a ground-truth deck, a fully specified target outcome, and a dual VLM-as-judge pipeline that separately scores instruction following and visual quality using both structural diffs and slide images. Building on this setting, we propose PPTPilot, a structure-aware slide-editing agent that plans semantic edit sequences, routes between high-level programmatic tools and deterministic XML operations for precise control, and verifies outputs through an iterative plan-edit-check loop against task-specific constraints. In our experiments, PPTPilot outperforms strong proprietary agents and frontier VLM systems by over 10 percentage points on compound, layout-sensitive, and cross-slide edits, with particularly large gains in visual fidelity and deck-wide consistency. Despite these improvements, existing agents still underperform on long-horizon, document-scale tasks in PPTArena, highlighting the remaining challenges in reliable PPT editing.
[429] TRELLISWorld: Training-Free World Generation from Object Generators
Hanke Chen, Yuan Liu, Minchen Li
Main category: cs.CV
TL;DR: Training-free 3D scene generation by repurposing text-to-3D object diffusion models as modular tile generators, enabling scalable synthesis of large, coherent scenes without scene-level training.
Details
Motivation: Existing 3D scene generation methods are limited to single objects, require domain-specific training, or lack full 360-degree viewability, creating barriers for applications like virtual prototyping, AR/VR, and simulation.Method: Reformulates scene generation as multi-tile denoising problem using overlapping 3D regions independently generated by text-to-3D object diffusion models, with seamless blending via weighted averaging. Eliminates need for scene-level datasets or retraining.
Result: Enables scalable synthesis of large, coherent scenes with local semantic control, supports diverse scene layouts, efficient generation, and flexible editing while inheriting generalization capabilities of object-level priors.
Conclusion: Establishes a simple yet powerful foundation for general-purpose, language-driven 3D scene construction without requiring scene-level training or complex heuristics.
Abstract: Text-driven 3D scene generation holds promise for a wide range of applications, from virtual prototyping to AR/VR and simulation. However, existing methods are often constrained to single-object generation, require domain-specific training, or lack support for full 360-degree viewability. In this work, we present a training-free approach to 3D scene synthesis by repurposing general-purpose text-to-3D object diffusion models as modular tile generators. We reformulate scene generation as a multi-tile denoising problem, where overlapping 3D regions are independently generated and seamlessly blended via weighted averaging. This enables scalable synthesis of large, coherent scenes while preserving local semantic control. Our method eliminates the need for scene-level datasets or retraining, relies on minimal heuristics, and inherits the generalization capabilities of object-level priors. We demonstrate that our approach supports diverse scene layouts, efficient generation, and flexible editing, establishing a simple yet powerful foundation for general-purpose, language-driven 3D scene construction.
[430] MotionStream: Real-Time Video Generation with Interactive Motion Controls
Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Shechtman, Xun Huang
Main category: cs.CV
TL;DR: MotionStream enables real-time video generation at up to 29 FPS with sub-second latency by distilling a bidirectional motion-conditioned model into a causal streaming model using novel attention mechanisms.
Details
Motivation: Current motion-conditioned video generation methods suffer from prohibitive latency (minutes per video) and non-causal processing that prevents real-time interaction, making them unsuitable for interactive applications.Method: 1) Augment text-to-video model with motion control for high-quality generation, 2) Distill bidirectional teacher into causal student using Self Forcing with Distribution Matching Distillation, 3) Introduce sliding-window causal attention with attention sinks, 4) Use self-rollout with attention sinks and KV cache rolling during training to simulate inference-time extrapolations with fixed context window.
Result: Achieves state-of-the-art results in motion following and video quality while being two orders of magnitude faster (up to 29 FPS streaming generation on single GPU), enabling infinite-length streaming and real-time interactive video generation.
Conclusion: MotionStream uniquely enables real-time interactive video generation where users can paint trajectories, control cameras, or transfer motion and see results unfold instantly, representing a breakthrough in streaming video synthesis.
Abstract: Current motion-conditioned video generation methods suffer from prohibitive latency (minutes per video) and non-causal processing that prevents real-time interaction. We present MotionStream, enabling sub-second latency with up to 29 FPS streaming generation on a single GPU. Our approach begins by augmenting a text-to-video model with motion control, which generates high-quality videos that adhere to the global text prompt and local motion guidance, but does not perform inference on the fly. As such, we distill this bidirectional teacher into a causal student through Self Forcing with Distribution Matching Distillation, enabling real-time streaming inference. Several key challenges arise when generating videos of long, potentially infinite time-horizons – (1) bridging the domain gap from training on finite length and extrapolating to infinite horizons, (2) sustaining high quality by preventing error accumulation, and (3) maintaining fast inference, without incurring growth in computational cost due to increasing context windows. A key to our approach is introducing carefully designed sliding-window causal attention, combined with attention sinks. By incorporating self-rollout with attention sinks and KV cache rolling during training, we properly simulate inference-time extrapolations with a fixed context window, enabling constant-speed generation of arbitrarily long videos. Our models achieve state-of-the-art results in motion following and video quality while being two orders of magnitude faster, uniquely enabling infinite-length streaming. With MotionStream, users can paint trajectories, control cameras, or transfer motion, and see results unfold in real-time, delivering a truly interactive experience.
[431] FastGS: Training 3D Gaussian Splatting in 100 Seconds
Shiwei Ren, Tianci Wen, Yongchun Fang, Biao Lu
Main category: cs.CV
TL;DR: FastGS is a novel acceleration framework for 3D Gaussian Splatting that uses multi-view consistency for efficient densification and pruning, achieving 3-15× faster training while maintaining comparable rendering quality.
Details
Motivation: Current 3DGS acceleration methods fail to properly regulate Gaussian counts during training, leading to redundant computational overhead and inefficient training-speed vs quality trade-offs.Method: Proposes FastGS with innovative densification and pruning strategy based on multi-view consistency, eliminating the need for budgeting mechanisms by fully considering Gaussian importance across multiple views.
Result: Achieves 3.32× training acceleration vs DashGaussian on Mip-NeRF 360, 15.45× acceleration vs vanilla 3DGS on Deep Blending, with comparable rendering quality. Demonstrates 2-7× acceleration across various tasks including dynamic scenes, surface reconstruction, and SLAM.
Conclusion: FastGS provides a simple, general acceleration framework that efficiently balances training speed and rendering quality through multi-view consistency-based Gaussian management, showing strong performance across diverse 3D reconstruction tasks.
Abstract: The dominant 3D Gaussian splatting (3DGS) acceleration methods fail to properly regulate the number of Gaussians during training, causing redundant computational time overhead. In this paper, we propose FastGS, a novel, simple, and general acceleration framework that fully considers the importance of each Gaussian based on multi-view consistency, efficiently solving the trade-off between training time and rendering quality. We innovatively design a densification and pruning strategy based on multi-view consistency, dispensing with the budgeting mechanism. Extensive experiments on Mip-NeRF 360, Tanks & Temples, and Deep Blending datasets demonstrate that our method significantly outperforms the state-of-the-art methods in training speed, achieving a 3.32$\times$ training acceleration and comparable rendering quality compared with DashGaussian on the Mip-NeRF 360 dataset and a 15.45$\times$ acceleration compared with vanilla 3DGS on the Deep Blending dataset. We demonstrate that FastGS exhibits strong generality, delivering 2-7$\times$ training acceleration across various tasks, including dynamic scene reconstruction, surface reconstruction, sparse-view reconstruction, large-scale reconstruction, and simultaneous localization and mapping. The project page is available at https://fastgs.github.io/
[432] 3D-ANC: Adaptive Neural Collapse for Robust 3D Point Cloud Recognition
Yuanmin Huang, Wenxuan Li, Mi Zhang, Xiaohan Zhang, Xiaoyu You, Min Yang
Main category: cs.CV
TL;DR: 3D-ANC: A neural collapse-based defense method that improves 3D point cloud model robustness against adversarial attacks by creating disentangled feature spaces through ETF alignment and adaptive training.
Details
Motivation: Deep neural networks for 3D point cloud recognition are vulnerable to adversarial attacks, and existing defenses struggle due to entangled feature spaces that make attacks easy to perform.Method: Uses Neural Collapse mechanism with ETF-aligned classification module and adaptive training framework (representation-balanced learning and dynamic feature direction loss) to address class imbalance and geometric similarities in 3D data.
Result: Significantly improves model robustness - DGCNN’s classification accuracy increased from 27.2% to 80.9% on ModelNet40 (53.7% absolute gain), surpassing leading baselines by 34.0%.
Conclusion: 3D-ANC effectively creates disentangled feature spaces for 3D point cloud models, substantially enhancing adversarial robustness despite complex 3D data distribution challenges.
Abstract: Deep neural networks have recently achieved notable progress in 3D point cloud recognition, yet their vulnerability to adversarial perturbations poses critical security challenges in practical deployments. Conventional defense mechanisms struggle to address the evolving landscape of multifaceted attack patterns. Through systematic analysis of existing defenses, we identify that their unsatisfactory performance primarily originates from an entangled feature space, where adversarial attacks can be performed easily. To this end, we present 3D-ANC, a novel approach that capitalizes on the Neural Collapse (NC) mechanism to orchestrate discriminative feature learning. In particular, NC depicts where last-layer features and classifier weights jointly evolve into a simplex equiangular tight frame (ETF) arrangement, establishing maximally separable class prototypes. However, leveraging this advantage in 3D recognition confronts two substantial challenges: (1) prevalent class imbalance in point cloud datasets, and (2) complex geometric similarities between object categories. To tackle these obstacles, our solution combines an ETF-aligned classification module with an adaptive training framework consisting of representation-balanced learning (RBL) and dynamic feature direction loss (FDL). 3D-ANC seamlessly empowers existing models to develop disentangled feature spaces despite the complexity in 3D data distribution. Comprehensive evaluations state that 3D-ANC significantly improves the robustness of models with various structures on two datasets. For instance, DGCNN’s classification accuracy is elevated from 27.2% to 80.9% on ModelNet40 – a 53.7% absolute gain that surpasses leading baselines by 34.0%.
[433] STONE: Pioneering the One-to-N Universal Backdoor Threat in 3D Point Cloud
Dongmei Shan, Wei Lian, Chongxia Wang
Main category: cs.CV
TL;DR: STONE is the first method for one-to-N multi-target backdoor attacks on 3D point clouds using a configurable spherical trigger design with theoretical grounding in Neural Tangent Kernel analysis.
Details
Motivation: Existing 3D point cloud backdoor attacks are limited to one-to-one paradigms, leaving the more flexible and universal one-to-N multi-target threat unexplored, which poses critical security risks in safety-sensitive domains like autonomous driving and robotics.Method: STONE uses a configurable spherical trigger design with parameterized spatial properties that create a dynamic key space, enabling a single trigger to map to multiple target labels. The method is theoretically grounded in Neural Tangent Kernel (NTK) analysis.
Result: Achieves high attack success rates (up to 100%) without compromising clean-data accuracy, establishing a foundational benchmark for multi-target backdoor threats under dirty-label and black-box settings in 3D vision.
Conclusion: This work provides the first formal basis for one-to-N backdoor mappings in 3D models and represents a crucial step toward securing future intelligent systems against multi-target backdoor threats.
Abstract: Backdoor attacks pose a critical threat to deep learning, especially in safety-sensitive 3D domains such as autonomous driving and robotics. While potent, existing attacks on 3D point clouds are predominantly limited to one-to-one paradigms. The more flexible and universal one-to-N multi-target backdoor threat remains largely unexplored, lacking both theoretical and practical foundations. To bridge this gap, we propose STONE (Spherical Trigger One-to-N universal backdoor Enabling), the first method to instantiate this threat via a configurable spherical trigger design. Its parameterized spatial properties establish a dynamic key space, enabling a single trigger to map to multiple target labels. Theoretically, we ground STONE in a Neural Tangent Kernel (NTK) analysis, providing the first formal basis for one-to-N mappings in 3D models. Empirically, extensive evaluations demonstrate high attack success rates (up to 100%) without compromising clean-data accuracy. This work establishes a foundational benchmark for multi-target backdoor threats under dirty-label and black-box settings in 3D vision – a crucial step toward securing future intelligent systems.
[434] D$^{2}$-VPR: A Parameter-efficient Visual-foundation-model-based Visual Place Recognition Method via Knowledge Distillation and Deformable Aggregation
Zheyuan Zhang, Jiwei Zhang, Boyu Zhou, Linzhimeng Duan, Hong Chen
Main category: cs.CV
TL;DR: D²-VPR: A distillation- and deformable-based framework for Visual Place Recognition that reduces model complexity while maintaining competitive performance by combining knowledge distillation with deformable attention mechanisms.
Details
Motivation: While DINOv2 foundation models improve VPR performance through strong feature generalization, they come with high model complexity and computational overhead that hinder deployment on resource-constrained devices. There's a need to retain the strong feature extraction capabilities while reducing parameters and computational costs.Method: Two-stage training with knowledge distillation and fine-tuning, plus a Distillation Recovery Module (DRM) to align teacher-student feature spaces. Also introduces a Top-Down-attention-based Deformable Aggregator (TDDA) that uses global semantic features to dynamically adjust Regions of Interest for better adaptation to irregular structures.
Result: Achieves competitive performance compared to state-of-the-art approaches while reducing parameter count by ~64.2% and MACs by ~62.6% compared to CricaVPR.
Conclusion: D²-VPR successfully balances performance and efficiency for VPR tasks, making foundation model capabilities more accessible for resource-constrained deployment scenarios through effective distillation and deformable attention mechanisms.
Abstract: Visual Place Recognition (VPR) aims to determine the geographic location of a query image by retrieving its most visually similar counterpart from a geo-tagged reference database. Recently, the emergence of the powerful visual foundation model, DINOv2, trained in a self-supervised manner on massive datasets, has significantly improved VPR performance. This improvement stems from DINOv2’s exceptional feature generalization capabilities but is often accompanied by increased model complexity and computational overhead that impede deployment on resource-constrained devices. To address this challenge, we propose $D^{2}$-VPR, a $D$istillation- and $D$eformable-based framework that retains the strong feature extraction capabilities of visual foundation models while significantly reducing model parameters and achieving a more favorable performance-efficiency trade-off. Specifically, first, we employ a two-stage training strategy that integrates knowledge distillation and fine-tuning. Additionally, we introduce a Distillation Recovery Module (DRM) to better align the feature spaces between the teacher and student models, thereby minimizing knowledge transfer losses to the greatest extent possible. Second, we design a Top-Down-attention-based Deformable Aggregator (TDDA) that leverages global semantic features to dynamically and adaptively adjust the Regions of Interest (ROI) used for aggregation, thereby improving adaptability to irregular structures. Extensive experiments demonstrate that our method achieves competitive performance compared to state-of-the-art approaches. Meanwhile, it reduces the parameter count by approximately 64.2% and MACs by about 62.6% (compared to CricaVPR).Code is available at https://github.com/tony19980810/D2VPR.
[435] uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data
Dahyun Chung, Donghyun Shin, Yujin Sung, Seunggi Moon, Jinwoo Jeon, Byung-Jun Lee
Main category: cs.CV
TL;DR: A lightweight framework for multilingual vision-language alignment that uses English as a semantic anchor, requiring no multilingual image-text pairs and training only a small projection module to align multilingual text with CLIP’s visual representations.
Details
Motivation: CLIP works well for English but struggles with low-resource languages due to scarce multilingual image-text data. Existing multilingual vision-language models perform poorly on underrepresented languages like Czech, Finnish, Croatian, Hungarian, and Romanian.Method: Freeze both pretrained image encoder and multilingual text encoder, train only a compact 1.7M-parameter projection module using contrastive loss over English representations as semantic anchors. No image-text or text-text pairs required.
Result: Significant gains in retrieval performance for five underrepresented languages on multilingual benchmarks, demonstrating robust multilingual alignment even with limited supervision.
Conclusion: The pivot-based, parameter-efficient alignment strategy enables inclusive multimodal learning for low-resource languages without requiring expensive multilingual image-text data collection.
Abstract: Contrastive Language-Image Pre-training (CLIP) has demonstrated strong generalization across a wide range of visual tasks by leveraging large-scale English-image pairs. However, its extension to low-resource languages remains limited due to the scarcity of high-quality multilingual image-text data. Existing multilingual vision-language models exhibit consistently low retrieval performance in underrepresented languages including Czech, Finnish, Croatian, Hungarian, and Romanian on the Crossmodal-3600 (XM3600) benchmark. To address this, we propose a lightweight and data-efficient framework for multilingual vision-language alignment. Our approach requires no image-text pairs or text-text pairs and freezes both the pretrained image encoder and multilingual text encoder during training. Only a compact 1.7M-parameter projection module is trained, using a contrastive loss over English representations as semantic anchors. This minimal training setup enables robust multilingual alignment even for languages with limited supervision. Extensive evaluation across multiple multilingual retrieval benchmarks confirms the effectiveness of our method, showing significant gains in five underrepresented languages where existing models typically underperform. These findings highlight the effectiveness of our pivot-based, parameter-efficient alignment strategy for inclusive multimodal learning.
[436] VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language
Zonghao Ying, Moyang Chen, Nizhang Li, Zhiqiang Wang, Wenxin Zhang, Quanchen Zou, Zonglei Jing, Aishan Liu, Xianglong Liu
Main category: cs.CV
TL;DR: VEIL is a stealthy jailbreak attack framework for text-to-video models that uses benign-looking prompts with implicit cues to bypass safety guardrails while preserving blocked intent.
Details
Motivation: Prior jailbreak attacks on T2V models use obvious adversarial perturbations that are easy to detect. The authors aim to develop more stealthy attacks that exploit models' cross-modal associative patterns through implicit cues in seemingly benign prompts.Method: VEIL uses modular prompt design with three components: neutral scene anchors (surface-level descriptions), latent auditory triggers (innocuous audio descriptions that exploit audio-visual co-occurrence priors), and stylistic modulators (cinematic directives). Attack generation is formalized as constrained optimization over this modular prompt space, solved with guided search balancing stealth and effectiveness.
Result: Extensive experiments on 7 T2V models show VEIL achieves 23% improvement in average attack success rate for commercial models compared to prior methods, demonstrating effective circumvention of safety guardrails while maintaining plausibility.
Conclusion: The paper reveals critical blind spots in T2V safety mechanisms by showing how implicit cues in benign-looking prompts can induce semantically unsafe video generation, highlighting the need for more robust defenses against such stealthy attacks.
Abstract: Jailbreak attacks can circumvent model safety guardrails and reveal critical blind spots. Prior attacks on text-to-video (T2V) models typically add adversarial perturbations to obviously unsafe prompts, which are often easy to detect and defend. In contrast, we show that benign-looking prompts containing rich, implicit cues can induce T2V models to generate semantically unsafe videos that both violate policy and preserve the original (blocked) intent. To realize this, we propose VEIL, a jailbreak framework that leverages T2V models’ cross-modal associative patterns via a modular prompt design. Specifically, our prompts combine three components: neutral scene anchors, which provide the surface-level scene description extracted from the blocked intent to maintain plausibility; latent auditory triggers, textual descriptions of innocuous-sounding audio events (e.g., creaking, muffled noises) that exploit learned audio-visual co-occurrence priors to bias the model toward particular unsafe visual concepts; and stylistic modulators, cinematic directives (e.g., camera framing, atmosphere) that amplify and stabilize the latent trigger’s effect. We formalize attack generation as a constrained optimization over the above modular prompt space and solve it with a guided search procedure that balances stealth and effectiveness. Extensive experiments over 7 T2V models demonstrate the efficacy of our attack, achieving a 23 percent improvement in average attack success rate in commercial models. Our demos and codes can be found at https://github.com/NY1024/VEIL.
[437] Distribution Matching Distillation Meets Reinforcement Learning
Dengyang Jiang, Dongyang Liu, Zanyi Wang, Qilong Wu, Liuzhuozheng Li, Hengzhuang Li, Xin Jin, David Liu, Zhen Li, Bo Zhang, Mengmeng Wang, Steven Hoi, Peng Gao, Harry Yang
Main category: cs.CV
TL;DR: DMDR combines reinforcement learning with diffusion model distillation to create few-step generators that can outperform their multi-step teachers.
Details
Motivation: Current distillation methods for diffusion models cap few-step generator performance at the teacher model's level, creating a performance ceiling that needs to be broken.Method: Combines RL with DMD distillation, uses DMD loss as RL regularization, introduces dynamic distribution guidance and dynamic renoise sampling strategies.
Result: Achieves leading visual quality and prompt coherence among few-step methods, with performance exceeding the multi-step teacher model.
Conclusion: DMDR successfully breaks the performance ceiling of distillation by integrating RL, enabling few-step generators to surpass their teachers.
Abstract: Distribution Matching Distillation (DMD) distills a pre-trained multi-step diffusion model to a few-step one to improve inference efficiency. However, the performance of the latter is often capped by the former. To circumvent this dilemma, we propose DMDR, a novel framework that combines Reinforcement Learning (RL) techniques into the distillation process. We show that for the RL of the few-step generator, the DMD loss itself is a more effective regularization compared to the traditional ones. In turn, RL can help to guide the mode coverage process in DMD more effectively. These allow us to unlock the capacity of the few-step generator by conducting distillation and RL simultaneously. Meanwhile, we design the dynamic distribution guidance and dynamic renoise sampling training strategies to improve the initial distillation process. The experiments demonstrate that DMDR can achieve leading visual quality, prompt coherence among few-step methods, and even exhibit performance that exceeds the multi-step teacher.
[438] SIGMMA: Hierarchical Graph-Based Multi-Scale Multi-modal Contrastive Alignment of Histopathology Image and Spatial Transcriptome
Dabin Jeong, Amirhossein Vahidi, Ciro Ramírez-Suástegui, Marie Moullet, Kevin Ly, Mohammad Vali Sanian, Sebastian Birk, Yinshui Chang, Adam Boxall, Daniyal Jafree, Lloyd Steele, Vijaya Baskar MS, Muzlifah Haniffa, Mohammad Lotfollahi
Main category: cs.CV
TL;DR: Sigmma: multi-modal contrastive framework for hierarchical alignment of HE images and spatial transcriptomics across multiple scales, improving gene prediction and cross-modal retrieval.
Details
Motivation: Existing approaches align HE tiles with spatial transcriptomic profiles at a single scale, missing fine-grained cellular structures and spatial organization.Method: Multi-scale contrastive alignment framework with graph-based representation of cell interactions, capturing inter- and intra-subgraph relationships across tissue microenvironment.
Result: Improves gene-expression prediction by avg. 9.78% and cross-modal retrieval by avg. 26.93% across datasets, learns meaningful multi-tissue organization.
Conclusion: Sigmma effectively captures hierarchical tissue structures and cell-cell interactions through multi-scale alignment, enhancing computational pathology analysis.
Abstract: Recent advances in computational pathology have leveraged vision-language models to learn joint representations of Hematoxylin and Eosin (HE) images with spatial transcriptomic (ST) profiles. However, existing approaches typically align HE tiles with their corresponding ST profiles at a single scale, overlooking fine-grained cellular structures and their spatial organization. To address this, we propose Sigmma, a multi-modal contrastive alignment framework for learning hierarchical representations of HE images and spatial transcriptome profiles across multiple scales. Sigmma introduces multi-scale contrastive alignment, ensuring that representations learned at different scales remain coherent across modalities. Furthermore, by representing cell interactions as a graph and integrating inter- and intra-subgraph relationships, our approach effectively captures cell-cell interactions, ranging from fine to coarse, within the tissue microenvironment. We demonstrate that Sigmm learns representations that better capture cross-modal correspondences, leading to an improvement of avg. 9.78% in the gene-expression prediction task and avg. 26.93% in the cross-modal retrieval task across datasets. We further show that it learns meaningful multi-tissue organization in downstream analyses.
[439] MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment
Huangbiao Xu, Huanqi Wu, Xiao Ke, Junyi Wu, Rui Xu, Jinglin Xu
Main category: cs.CV
TL;DR: MCMoE: A missing completion framework with mixture of experts for multimodal action quality assessment that handles incomplete modalities at inference by dynamically reconstructing missing modalities and fusing expert knowledge.
Details
Motivation: Multimodal AQA improves evaluation of subtle action variations but suffers when modalities are missing at inference, causing catastrophic performance degradation and rendering models inoperable.Method: Proposes MCMoE with adaptive gated modality generator to reconstruct missing modalities, modality experts for unimodal knowledge, and dynamic mixing of expert knowledge for joint representations. Uses complete multimodal features and unimodal expert knowledge to guide training.
Result: Achieves state-of-the-art results in both complete and incomplete multimodal learning on three public AQA benchmarks.
Conclusion: MCMoE effectively addresses the missing modality problem in multimodal AQA through unified single-stage training with mixture of experts, enabling robust performance even with incomplete inputs.
Abstract: Multimodal Action Quality Assessment (AQA) has recently emerged as a promising paradigm. By leveraging complementary information across shared contextual cues, it enhances the discriminative evaluation of subtle intra-class variations in highly similar action sequences. However, partial modalities are frequently unavailable at the inference stage in reality. The absence of any modality often renders existing multimodal models inoperable. Furthermore, it triggers catastrophic performance degradation due to interruptions in cross-modal interactions. To address this issue, we propose a novel Missing Completion Framework with Mixture of Experts (MCMoE) that unifies unimodal and joint representation learning in single-stage training. Specifically, we propose an adaptive gated modality generator that dynamically fuses available information to reconstruct missing modalities. We then design modality experts to learn unimodal knowledge and dynamically mix the knowledge of all experts to extract cross-modal joint representations. With a mixture of experts, missing modalities are further refined and complemented. Finally, in the training phase, we mine the complete multimodal features and unimodal expert knowledge to guide modality generation and generation-based joint representation extraction. Extensive experiments demonstrate that our MCMoE achieves state-of-the-art results in both complete and incomplete multimodal learning on three public AQA benchmarks. Code is available at https://github.com/XuHuangbiao/MCMoE.
[440] MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts
Zilong Huang, Jun He, Xiaobin Huang, Ziyi Xiong, Yang Luo, Junyan Ye, Weijia Li, Yiping Chen, Ting Han
Main category: cs.CV
TL;DR: MajutsuCity is a natural language-driven framework for generating structurally consistent and stylistically diverse 3D urban scenes with fine-grained controllability through a four-stage pipeline, interactive editing agent, and comprehensive dataset.
Details
Motivation: Existing 3D city generation methods struggle to balance creative flexibility (text-based generation) with object-level editability (structural representations). There's a need for a solution that offers both stylistic diversity and fine-grained controllability for applications in world models, VR, and game development.Method: Four-stage pipeline representing cities as compositions of controllable layouts, assets, and materials. Includes MajutsuAgent for interactive language-grounded editing with five object-level operations, and MajutsuDataset with multimodal data (2D semantic layouts, height maps, 3D building assets, PBR materials, skyboxes).
Result: Reduces layout FID by 83.7% vs CityDreamer and 20.1% vs CityCraft. Ranks first across all AQS and RDR scores, outperforming existing methods by clear margins. Achieves state-of-the-art in geometric fidelity, stylistic adaptability, and semantic controllability.
Conclusion: MajutsuCity establishes new state-of-the-art for 3D city generation, balancing text-based creativity with structural editability through natural language control. The framework includes comprehensive tools (agent, dataset, metrics) that can inspire future research in this domain.
Abstract: Generating realistic 3D cities is fundamental to world models, virtual reality, and game development, where an ideal urban scene must satisfy both stylistic diversity, fine-grained, and controllability. However, existing methods struggle to balance the creative flexibility offered by text-based generation with the object-level editability enabled by explicit structural representations. We introduce MajutsuCity, a natural language-driven and aesthetically adaptive framework for synthesizing structurally consistent and stylistically diverse 3D urban scenes. MajutsuCity represents a city as a composition of controllable layouts, assets, and materials, and operates through a four-stage pipeline. To extend controllability beyond initial generation, we further integrate MajutsuAgent, an interactive language-grounded editing agent} that supports five object-level operations. To support photorealistic and customizable scene synthesis, we also construct MajutsuDataset, a high-quality multimodal dataset} containing 2D semantic layouts and height maps, diverse 3D building assets, and curated PBR materials and skyboxes, each accompanied by detailed annotations. Meanwhile, we develop a practical set of evaluation metrics, covering key dimensions such as structural consistency, scene complexity, material fidelity, and lighting atmosphere. Extensive experiments demonstrate MajutsuCity reduces layout FID by 83.7% compared with CityDreamer and by 20.1% over CityCraft. Our method ranks first across all AQS and RDR scores, outperforming existing methods by a clear margin. These results confirm MajutsuCity as a new state-of-the-art in geometric fidelity, stylistic adaptability, and semantic controllability for 3D city generation. We expect our framework can inspire new avenues of research in 3D city generation. Our project page: https://longhz140516.github.io/MajutsuCity/.
[441] Intra-Class Probabilistic Embeddings for Uncertainty Estimation in Vision-Language Models
Zhenxiang Lin, Maryam Haghighat, Will Browne, Dimity Miller
Main category: cs.CV
TL;DR: Training-free uncertainty estimation method for vision-language models that detects erroneous predictions by measuring visual feature consistency within classes using probabilistic embeddings.
Details
Motivation: Vision-language models like CLIP have strong open vocabulary classification but assign high confidence to misclassifications, limiting reliability in safety-critical applications where uncertainty estimation is crucial.Method: Training-free, post-hoc uncertainty estimation using feature projection combined with multivariate Gaussians to create class-specific probabilistic embeddings. Measures visual feature consistency within a class, requires no fine-tuning, and works with as few as 10 training images per class.
Result: State-of-the-art error detection performance on ImageNet, Flowers102, Food101, EuroSAT and DTD datasets, significantly outperforming both deterministic and probabilistic VLM baselines.
Conclusion: The proposed method is VLM-agnostic, requires no fine-tuning, demonstrates robustness to distribution shift, and effectively detects erroneous predictions with minimal training data, enhancing reliability of VLMs in safety-critical applications.
Abstract: Vision-language models (VLMs), such as CLIP, have gained popularity for their strong open vocabulary classification performance, but they are prone to assigning high confidence scores to misclassifications, limiting their reliability in safety-critical applications. We introduce a training-free, post-hoc uncertainty estimation method for contrastive VLMs that can be used to detect erroneous predictions. The key to our approach is to measure visual feature consistency within a class, using feature projection combined with multivariate Gaussians to create class-specific probabilistic embeddings. Our method is VLM-agnostic, requires no fine-tuning, demonstrates robustness to distribution shift, and works effectively with as few as 10 training images per class. Extensive experiments on ImageNet, Flowers102, Food101, EuroSAT and DTD show state-of-the-art error detection performance, significantly outperforming both deterministic and probabilistic VLM baselines. Code is available at https://github.com/zhenxianglin/ICPE.
[442] Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Feng Yu, Chi Zhang, Shifeng Zhang, Shilin Zhou
Main category: cs.CV
TL;DR: Z-Image is an efficient 6B-parameter open-source image generation model that challenges the “scale-at-all-costs” paradigm, achieving state-of-the-art performance with significantly reduced computational costs.
Details
Motivation: Current high-performance image generation is dominated by proprietary systems, while leading open-source alternatives have massive parameter counts (20B-80B) that make them impractical for inference and fine-tuning on consumer hardware.Method: Built on Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture with systematic optimization of the entire model lifecycle, including curated data infrastructure and streamlined training curriculum. Also includes few-step distillation with reward post-training to create Z-Image-Turbo for sub-second inference.
Result: Z-Image achieves performance comparable to or surpassing leading competitors across various dimensions, with exceptional capabilities in photorealistic image generation and bilingual text rendering. Full training completed in just 314K H800 GPU hours (~$630K).
Conclusion: State-of-the-art image generation results are achievable with significantly reduced computational overhead, demonstrating that the “scale-at-all-costs” paradigm can be challenged. The authors release code, weights, and demo to foster accessible, budget-friendly generative models.
Abstract: The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the “scale-at-all-costs” paradigm. By systematically optimizing the entire model lifecycle – from a curated data infrastructure to a streamlined training curriculum – we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.
[443] FACT-GS: Frequency-Aligned Complexity-Aware Texture Reparameterization for 2D Gaussian Splatting
Tianhao Xie, Linlian Jiang, Xinxin Zuo, Yang Wang, Tiberiu Popa
Main category: cs.CV
TL;DR: FACT-GS improves Gaussian Splatting by using frequency-aware adaptive texture sampling instead of uniform sampling, allocating more texture density to high-frequency regions for sharper details without increasing parameters.
Details
Motivation: Current texture-based Gaussian Splatting uses uniform per-Gaussian sampling grids, which inefficiently allocate texture density - high-frequency regions are under-sampled (causing blur) while smooth regions waste capacity, leading to loss of fine structural details.Method: FACT-GS introduces a frequency-aligned complexity-aware texture framework that reformulates texture parameterization as a differentiable sampling-density allocation problem. It replaces uniform textures with a learnable frequency-aware allocation strategy using a deformation field whose Jacobian modulates local sampling density, performing non-uniform sampling on fixed-resolution texture grids.
Result: The method preserves real-time performance while recovering sharper high-frequency details under the same parameter budget, improving texture space utilization by allocating sampling density according to local visual frequency.
Conclusion: FACT-GS provides an efficient adaptive texture sampling approach for Gaussian Splatting that better matches sampling density to local visual complexity, enabling higher-quality appearance modeling without increasing computational cost.
Abstract: Realistic scene appearance modeling has advanced rapidly with Gaussian Splatting, which enables real-time, high-quality rendering. Recent advances introduced per-primitive textures that incorporate spatial color variations within each Gaussian, improving their expressiveness. However, texture-based Gaussians parameterize appearance with a uniform per-Gaussian sampling grid, allocating equal sampling density regardless of local visual complexity. This leads to inefficient texture space utilization, where high-frequency regions are under-sampled and smooth regions waste capacity, causing blurred appearance and loss of fine structural detail. We introduce FACT-GS, a Frequency-Aligned Complexity-aware Texture Gaussian Splatting framework that allocates texture sampling density according to local visual frequency. Grounded in adaptive sampling theory, FACT-GS reformulates texture parameterization as a differentiable sampling-density allocation problem, replacing the uniform textures with a learnable frequency-aware allocation strategy implemented via a deformation field whose Jacobian modulates local sampling density. Built on 2D Gaussian Splatting, FACT-GS performs non-uniform sampling on fixed-resolution texture grids, preserving real-time performance while recovering sharper high-frequency details under the same parameter budget.
[444] Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models
Muhammad Maaz, Hanoona Rasheed, Fahad Shahbaz Khan, Salman Khan
Main category: cs.CV
TL;DR: Video R2: A reinforcement learning approach that improves temporal alignment and reasoning consistency in video understanding models by addressing their tendency to rely on linguistic priors rather than visual evidence.
Details
Motivation: Current multimodal LLMs for video reasoning often produce convincing but logically inconsistent reasoning traces that are weakly grounded in visual evidence, relying too heavily on linguistic priors rather than actual video content.Method: Proposes a reinforcement learning approach with timestamp-aware supervised fine-tuning and Group Relative Policy Optimization (GRPO) guided by a novel Temporal Alignment Reward (TAR) to enhance temporal precision and reasoning consistency.
Result: Video R2 achieves consistently higher Think Answer Consistency (TAC), Video Attention Score (VAS), and accuracy across multiple benchmarks, demonstrating improved temporal alignment and reasoning coherence.
Conclusion: Improvements in temporal alignment and reasoning coherence lead to more accurate and trustworthy video understanding, as demonstrated by Video R2’s superior performance across 11 video reasoning benchmarks.
Abstract: Reasoning over dynamic visual content remains a central challenge for multimodal large language models. Recent thinking models generate explicit reasoning traces for interpretability; however, their reasoning often appears convincing while being logically inconsistent or weakly grounded in visual evidence. We identify and formalize these issues through two diagnostic metrics: Think Answer Consistency (TAC), which measures the alignment between reasoning and answers, and Video Attention Score (VAS), which captures the extent to which reasoning depends on visual versus textual cues. Analysis across 11 video reasoning benchmarks shows that current models rely heavily on linguistic priors rather than visual content. To address this, we propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency. Our approach combines timestamp aware supervised fine tuning with Group Relative Policy Optimization (GRPO) guided by a novel Temporal Alignment Reward (TAR). This dual step post training stage encourages temporally aligned and causally coherent video reasoning. The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks, demonstrating that improvements in temporal alignment and reasoning coherence lead to more accurate and trustworthy video understanding. Code: https://github.com/mbzuai-oryx/Video-R2
[445] WiseEdit: Benchmarking Cognition- and Creativity-Informed Image Editing
Kaihang Pan, Weile Chen, Haiyi Qiu, Qifan Yu, Wendong Bu, Zehan Wang, Yun Zhu, Juncheng Li, Siliang Tang
Main category: cs.CV
TL;DR: WiseEdit is a comprehensive benchmark for evaluating cognition- and creativity-informed image editing models, featuring 1,220 test cases across three cognitive steps and three knowledge types.
Details
Motivation: Existing benchmarks are too narrow to holistically assess the advanced cognitive and creative capabilities of modern image editing models, which require more comprehensive evaluation.Method: Decomposes image editing into three cascaded cognitive steps (Awareness, Interpretation, Imagination) and incorporates three knowledge types (Declarative, Procedural, Metacognitive). Creates 1,220 test cases to challenge models at each step and in complex tasks.
Result: The benchmark objectively reveals limitations of state-of-the-art image editing models in knowledge-based cognitive reasoning and creative composition capabilities.
Conclusion: WiseEdit provides a comprehensive evaluation framework for advanced image editing models, addressing the gap in existing benchmarks and enabling better assessment of cognitive and creative capabilities.
Abstract: Recent image editing models boast next-level intelligent capabilities, facilitating cognition- and creativity-informed image editing. Yet, existing benchmarks provide too narrow a scope for evaluation, failing to holistically assess these advanced abilities. To address this, we introduce WiseEdit, a knowledge-intensive benchmark for comprehensive evaluation of cognition- and creativity-informed image editing, featuring deep task depth and broad knowledge breadth. Drawing an analogy to human cognitive creation, WiseEdit decomposes image editing into three cascaded steps, i.e., Awareness, Interpretation, and Imagination, each corresponding to a task that poses a challenge for models to complete at the specific step. It also encompasses complex tasks, where none of the three steps can be finished easily. Furthermore, WiseEdit incorporates three fundamental types of knowledge: Declarative, Procedural, and Metacognitive knowledge. Ultimately, WiseEdit comprises 1,220 test cases, objectively revealing the limitations of SoTA image editing models in knowledge-based cognitive reasoning and creative composition capabilities. The benchmark, evaluation code, and the generated images of each model will be made publicly available soon. Project Page: https://qnancy.github.io/wiseedit_project_page/.
[446] Efficient and Scalable Monocular Human-Object Interaction Motion Reconstruction
Boran Wen, Ye Lu, Keyan Wan, Sirui Wang, Jiahong Zhou, Junxuan Liang, Xinpeng Liu, Bang Xiao, Dingbang Huang, Ruiyang Liu, Yong-Lu Li
Main category: cs.CV
TL;DR: 4DHOISolver is an optimization framework for reconstructing 4D human-object interactions from monocular videos using sparse human-in-the-loop contact annotations, enabling creation of Open4DHOI dataset and demonstrating motion imitation capabilities.
Details
Motivation: Generalized robots need diverse human-object interaction data for robust real-world operation. Monocular internet videos offer limitless data but extracting accurate 4D interaction data remains unsolved and challenging.Method: 4DHOISolver framework uses sparse human-in-the-loop contact point annotations to constrain ill-posed 4D HOI reconstruction, maintaining spatio-temporal coherence and physical plausibility through efficient optimization.
Result: Created Open4DHOI dataset with 144 object types and 103 actions, demonstrated RL-based agent imitation of recovered motions, and benchmark showed automatic contact prediction remains unsolved.
Conclusion: Human-in-the-loop strategy is currently necessary for precise 4D HOI reconstruction, presenting an open challenge for the community while providing valuable dataset and framework.
Abstract: Generalized robots must learn from diverse, large-scale human-object interactions (HOI) to operate robustly in the real world. Monocular internet videos offer a nearly limitless and readily available source of data, capturing an unparalleled diversity of human activities, objects, and environments. However, accurately and scalably extracting 4D interaction data from these in-the-wild videos remains a significant and unsolved challenge. Thus, in this work, we introduce 4DHOISolver, a novel and efficient optimization framework that constrains the ill-posed 4D HOI reconstruction problem by leveraging sparse, human-in-the-loop contact point annotations, while maintaining high spatio-temporal coherence and physical plausibility. Leveraging this framework, we introduce Open4DHOI, a new large-scale 4D HOI dataset featuring a diverse catalog of 144 object types and 103 actions. Furthermore, we demonstrate the effectiveness of our reconstructions by enabling an RL-based agent to imitate the recovered motions. However, a comprehensive benchmark of existing 3D foundation models indicates that automatically predicting precise human-object contact correspondences remains an unsolved problem, underscoring the immediate necessity of our human-in-the-loop strategy while posing an open challenge to the community. Data and code will be publicly available at https://wenboran2002.github.io/open4dhoi/
[447] MM-ACT: Learn from Multimodal Parallel Generation to Act
Haotian Liang, Xinyi Chen, Bin Wang, Mingkang Chen, Yitian Liu, Yuhao Zhang, Zanxin Chen, Tianshuo Yang, Yilun Chen, Jiangmiao Pang, Dong Liu, Xiaokang Yang, Yao Mu, Wenqi Shao, Ping Luo
Main category: cs.CV
TL;DR: MM-ACT is a unified Vision-Language-Action model that integrates text, image, and action in shared token space, using re-mask parallel decoding for text/image and one-step parallel decoding for actions, achieving state-of-the-art performance across simulation and real-robot tasks.
Details
Motivation: Generalist robotic policies require both semantic understanding for task planning and predictive capabilities for environment interaction. Current approaches often treat these aspects separately, lacking a unified framework that integrates all three modalities (text, image, action) in a shared representation space.Method: MM-ACT integrates text, image, and action in a shared token space with generation across all three modalities. It uses re-mask parallel decoding for text and image generation, and one-step parallel decoding for action generation. The model employs Context-Shared Multimodal Learning, a unified training paradigm that supervises generation in all modalities from a shared context, enabling cross-modal learning that enhances action generation.
Result: Achieves 96.3% success rate on LIBERO simulation, 72.0% across three real Franka robot tasks, and 52.38% across eight bimanual tasks on RoboTwin2.0. Cross-modal learning provides an additional 9.25% performance gain on RoboTwin2.0 tasks.
Conclusion: MM-ACT demonstrates that unified multimodal learning across text, image, and action in shared token space enables effective generalist robotic policies with strong performance across simulation and real-world tasks, with cross-modal learning providing significant performance benefits.
Abstract: A generalist robotic policy needs both semantic understanding for task planning and the ability to interact with the environment through predictive capabilities. To tackle this, we present MM-ACT, a unified Vision-Language-Action (VLA) model that integrates text, image, and action in shared token space and performs generation across all three modalities. MM-ACT adopts a re-mask parallel decoding strategy for text and image generation, and employs a one-step parallel decoding strategy for action generation to improve efficiency. We introduce Context-Shared Multimodal Learning, a unified training paradigm that supervises generation in all three modalities from a shared context, enhancing action generation through cross-modal learning. Experiments were conducted on the LIBERO simulation and Franka real-robot setups as well as RoboTwin2.0 to assess in-domain and out-of-domain performances respectively. Our approach achieves a success rate of 96.3% on LIBERO, 72.0% across three tasks of real Franka, and 52.38% across eight bimanual tasks of RoboTwin2.0 with an additional gain of 9.25% from cross-modal learning. We release our codes, models and data at https://github.com/HHYHRHY/MM-ACT.
[448] DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy
Jaewoo Song, Jooyoung Choi, Kanghyun Baek, Sangyub Lee, Daemin Park, Sungroh Yoon
Main category: cs.CV
TL;DR: DCText is a training-free method that improves text rendering in text-to-image generation by using a divide-and-conquer approach with attention masks and localized noise initialization.
Details
Motivation: Current text-to-image models struggle with rendering long or multiple texts due to diluted global attention, leading to poor text accuracy in complex text prompts.Method: Uses divide-and-conquer strategy: 1) decomposes prompts by extracting and dividing target text, 2) assigns each segment to designated regions, 3) applies Text-Focus and Context-Expansion attention masks sequentially during denoising, and 4) uses Localized Noise Initialization for improved accuracy.
Result: Achieves best text accuracy on single- and multi-sentence benchmarks without compromising image quality, while also delivering the lowest generation latency.
Conclusion: DCText effectively addresses text rendering challenges in complex prompts through a training-free, computationally efficient approach that maintains both text accuracy and image quality.
Abstract: Despite recent text-to-image models achieving highfidelity text rendering, they still struggle with long or multiple texts due to diluted global attention. We propose DCText, a training-free visual text generation method that adopts a divide-and-conquer strategy, leveraging the reliable short-text generation of Multi-Modal Diffusion Transformers. Our method first decomposes a prompt by extracting and dividing the target text, then assigns each to a designated region. To accurately render each segment within their regions while preserving overall image coherence, we introduce two attention masks - Text-Focus and Context-Expansion - applied sequentially during denoising. Additionally, Localized Noise Initialization further improves text accuracy and region alignment without increasing computational cost. Extensive experiments on single- and multisentence benchmarks show that DCText achieves the best text accuracy without compromising image quality while also delivering the lowest generation latency.
[449] Gaussian Swaying: Surface-Based Framework for Aerodynamic Simulation with 3D Gaussians
Hongru Yan, Xiang Zhang, Zeyuan Chen, Fangyin Wei, Zhuowen Tu
Main category: cs.CV
TL;DR: Gaussian Swaying: A surface-based aerodynamic simulation framework using 3D Gaussians that unifies simulation and rendering for realistic wind effects on objects.
Details
Motivation: Realistic aerodynamic effects (like branches swaying, flags rippling, boats rocking) are crucial for realism in vision and graphics, but existing methods have limitations: mesh-based approaches require costly meshing, while particle-based methods rely on discrete positional data.Method: A surface-based framework using 3D Gaussians to model surfaces continuously. It uses Gaussian patches that support both force computation for dynamics and provide normals for lightweight shading, unifying simulation and rendering on the same representation.
Result: Comprehensive experiments on synthetic and real-world datasets across multiple metrics demonstrate state-of-the-art performance and efficiency, offering a scalable approach for realistic aerodynamic scene simulation.
Conclusion: Gaussian Swaying provides an efficient, fine-grained aerodynamic simulation framework that overcomes limitations of mesh-based and particle-based methods, enabling realistic wind effects with unified simulation and rendering capabilities.
Abstract: Branches swaying in the breeze, flags rippling in the wind, and boats rocking on the water all show how aerodynamics shape natural motion – an effect crucial for realism in vision and graphics. In this paper, we present Gaussian Swaying, a surface-based framework for aerodynamic simulation using 3D Gaussians. Unlike mesh-based methods that require costly meshing, or particle-based approaches that rely on discrete positional data, Gaussian Swaying models surfaces continuously with 3D Gaussians, enabling efficient and fine-grained aerodynamic interaction. Our framework unifies simulation and rendering on the same representation: Gaussian patches, which support force computation for dynamics while simultaneously providing normals for lightweight shading. Comprehensive experiments on both synthetic and real-world datasets across multiple metrics demonstrate that Gaussian Swaying achieves state-of-the-art performance and efficiency, offering a scalable approach for realistic aerodynamic scene simulation.
[450] Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling
Meng Cao, Haokun Lin, Haoyuan Li, Haoran Tang, Rongtao Xu, Dong An, Xue Liu, Ian Reid, Xiaodan Liang
Main category: cs.CV
TL;DR: MILO introduces an implicit spatial world modeling paradigm with visual geometry feedback and relative positional encoding to improve MLLMs’ spatial reasoning, trained on a large geometry-aware dataset.
Details
Motivation: Current MLLMs have poor spatial reasoning because they rely on verbal descriptive tuning that learns spatial concepts through textual symbols alone, lacking connection to visual manifestations - a problem called "visual illiteracy."Method: MILO integrates a visual generator for geometry-aware feedback to ground symbolic reasoning in perceptual experience, plus RePE (Relative Positional Encoding) for capturing relative camera-pose transformations instead of absolute coordinates. Trained on GeoGen dataset with 2,241 videos and 67,827 observation-action-outcome triplets.
Result: The approach significantly enhances spatial reasoning capabilities across multiple baselines and benchmarks, offering more holistic 3D space understanding.
Conclusion: MILO’s implicit spatial world modeling with visual feedback and relative encoding effectively addresses visual illiteracy in MLLMs, improving their spatial reasoning through perceptual grounding.
Abstract: Spatial reasoning, the ability to understand and interpret the 3D structure of the world, is a critical yet underdeveloped capability in Multimodal Large Language Models (MLLMs). Current methods predominantly rely on verbal descriptive tuning, which suffers from visual illiteracy, i.e., they learn spatial concepts through textual symbols alone, devoid of connection to their visual manifestations. To bridge this gap, this paper introduces MILO, an Implicit spatIaL wOrld modeling paradigm that simulates human-like spatial imagination. MILO integrates a visual generator to provide geometry-aware feedback, thereby implicitly grounding the MLLM’s symbolic reasoning in perceptual experience. Complementing this paradigm, we propose RePE (Relative Positional Encoding), a novel encoding scheme that captures relative camera-pose transformations, offering superior performance over absolute coordinate systems. To support the training, we construct GeoGen, a large-scale Geometry-aware Generative dataset with approximately 2,241 videos and 67,827 observation-action-outcome triplets. Experiments demonstrate that our approach significantly enhances spatial reasoning capabilities across multiple baselines and benchmarks, offering a more holistic understanding of 3D space.
[451] Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch
Yifan Zhang, Liang Hu, Haofeng Sun, Peiyu Wang, Yichen Wei, Shukang Yin, Jiangbo Pei, Wei Shen, Peng Xia, Yi Peng, Tianyidan Xie, Eric Li, Yang Liu, Xuchen Song, Yahui Zhou
Main category: cs.CV
TL;DR: Skywork-R1V4 is a 30B multimodal agentic model that unifies image manipulation and web search through interleaved reasoning, achieving SOTA results without reinforcement learning.
Details
Motivation: Existing multimodal agentic systems treat image manipulation and web search as separate capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces.Method: 30B parameter multimodal model trained via supervised fine-tuning on <30,000 high-quality planning-execution-consistent trajectories with stepwise consistency filtering. Unifies multimodal planning, active image manipulation (“thinking with images”), deep multimodal search, and interleaved reasoning between visual operations and knowledge retrieval.
Result: Achieves state-of-the-art results: 66.1 on MMSearch and 67.2 on FVQA, surpassing Gemini 2.5 Flash on all 11 metrics. Exhibits emergent long-horizon reasoning, orchestrating >10 tool calls for complex tasks.
Conclusion: Sophisticated agentic multimodal intelligence can be achieved through carefully curated supervised learning alone, without any reliance on reinforcement learning.
Abstract: Despite recent progress in multimodal agentic systems, existing approaches often treat image manipulation and web search as disjoint capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces. To address these limitations, we present Skywork-R1V4, a 30B (A3B) parameter multimodal agentic model that unifies multimodal planning, active image manipulation (“thinking with images”), deep multimodal search, and, most critically, interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30,000 high-quality, planning-execution-consistent trajectories and validated through stepwise consistency filtering, Skywork-R1V4 achieves state-of-the-art results across perception and multimodal search benchmarks: it scores 66.1 on MMSearch and 67.2 on FVQA, surpassing Gemini 2.5 Flash on all 11 metrics. Skywork-R1V4 exhibits emergent long-horizon reasoning at inference time, successfully orchestrating more than 10 tool calls to solve complex, multi-step tasks. Our results demonstrate that sophisticated agentic multimodal intelligence can be achieved through carefully curated supervised learning alone, without any reliance on reinforcement learning.
[452] LAMP: Language-Assisted Motion Planning for Controllable Video Generation
Muhammed Burak Kizil, Enes Sanli, Niloy J. Mitra, Erkut Erdem, Aykut Erdem, Duygu Ceylan
Main category: cs.CV
TL;DR: LAMP uses LLMs as motion planners to translate natural language into 3D trajectories for objects and cameras via a motion DSL, enabling better motion control for video generation.
Details
Motivation: Existing video generation interfaces have limited motion control capabilities, despite motion being essential for composing complex, cinematic scenes. Current methods lack effective ways to specify object dynamics and camera trajectories from natural language.Method: LAMP leverages LLMs as motion planners with a motion domain-specific language (DSL) inspired by cinematography conventions. It uses program synthesis to generate structured motion programs from natural language, which are deterministically mapped to 3D trajectories. A large-scale procedural dataset pairs text descriptions with motion programs and trajectories.
Result: LAMP demonstrates improved performance in motion controllability and alignment with user intent compared to state-of-the-art alternatives, establishing the first framework for generating both object and camera motions directly from natural language specifications.
Conclusion: LAMP successfully bridges natural language descriptions to explicit 3D motion trajectories using LLMs and a motion DSL, advancing motion control capabilities in video generation and enabling more complex cinematic scene composition.
Abstract: Video generation has achieved remarkable progress in visual fidelity and controllability, enabling conditioning on text, layout, or motion. Among these, motion control - specifying object dynamics and camera trajectories - is essential for composing complex, cinematic scenes, yet existing interfaces remain limited. We introduce LAMP that leverages large language models (LLMs) as motion planners to translate natural language descriptions into explicit 3D trajectories for dynamic objects and (relatively defined) cameras. LAMP defines a motion domain-specific language (DSL), inspired by cinematography conventions. By harnessing program synthesis capabilities of LLMs, LAMP generates structured motion programs from natural language, which are deterministically mapped to 3D trajectories. We construct a large-scale procedural dataset pairing natural text descriptions with corresponding motion programs and 3D trajectories. Experiments demonstrate LAMP’s improved performance in motion controllability and alignment with user intent compared to state-of-the-art alternatives establishing the first framework for generating both object and camera motions directly from natural language specifications. Code, models and data are available on our project page.
[453] PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention
Ziwen Li, Xin Wang, Hanlue Zhang, Runnan Chen, Runqi Lin, Xiao He, Han Huang, Yandong Guo, Fakhri Karray, Tongliang Liu, Mingming Gong
Main category: cs.CV
TL;DR: PosA-VLA: A Vision-Language-Action framework that uses pose-conditioned supervision to anchor visual attention, reducing redundant actions and improving precision in robotic manipulation tasks.
Details
Motivation: Current VLA models struggle with consistent and precise target-oriented actions, generating redundant or unstable motions due to spatially uniform perception fields that get distracted by irrelevant objects in complex environments.Method: Proposes PosA-VLA framework with pose-conditioned anchor attention mechanism that guides model perception toward task-relevant regions, aligning instruction semantics with actionable visual cues. Uses lightweight architecture without auxiliary perception modules.
Result: Extensive experiments show the method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks and demonstrates robust generalization in challenging environments.
Conclusion: The pose-conditioned supervision approach effectively addresses redundant action generation in VLAs, improving action precision and efficiency for real-world robotic applications without requiring additional perception modules.
Abstract: The Vision-Language-Action (VLA) models have demonstrated remarkable performance on embodied tasks and shown promising potential for real-world applications. However, current VLAs still struggle to produce consistent and precise target-oriented actions, as they often generate redundant or unstable motions along trajectories, limiting their applicability in time-sensitive scenarios.In this work, we attribute these redundant actions to the spatially uniform perception field of existing VLAs, which causes them to be distracted by target-irrelevant objects, especially in complex environments.To address this issue, we propose an efficient PosA-VLA framework that anchors visual attention via pose-conditioned supervision, consistently guiding the model’s perception toward task-relevant regions. The pose-conditioned anchor attention mechanism enables the model to better align instruction semantics with actionable visual cues, thereby improving action generation precision and efficiency. Moreover, our framework adopts a lightweight architecture and requires no auxiliary perception modules (e.g., segmentation or grounding networks), ensuring efficient inference. Extensive experiments verify that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks and shows robust generalization in a variety of challenging environments.
[454] An Automated Framework for Large-Scale Graph-Based Cerebrovascular Analysis
Daniele Falcetta, Liane S. Canas, Lorenzo Suppa, Matteo Pentassuglia, Jon Cleary, Marc Modat, Sébastien Ourselin, Maria A. Zuluaga
Main category: cs.CV
TL;DR: CaravelMetrics is an automated framework for analyzing brain blood vessels using graph-based representations from 3D MRI scans, enabling large-scale population studies of vascular aging and health.
Details
Motivation: There's a need for scalable, automated tools to quantitatively analyze cerebrovascular networks from medical imaging to study vascular health, aging, and population-level variations.Method: The framework uses skeletonization to create graph representations of blood vessels, integrates atlas-based regional parcellation, extracts centerlines, and computes 15 morphometric, topological, fractal, and geometric features at global and regional scales.
Result: Applied to 570 3D TOF-MRA scans from the IXI dataset (ages 20-86), the framework produced reproducible vessel graphs showing age- and sex-related variations, and education-associated increases in vascular complexity consistent with literature findings.
Conclusion: CaravelMetrics provides a scalable, fully automated approach for quantitative cerebrovascular feature extraction, supporting normative modeling and population-level studies of vascular health and aging.
Abstract: We present CaravelMetrics, a computational framework for automated cerebrovascular analysis that models vessel morphology through skeletonization-derived graph representations. The framework integrates atlas-based regional parcellation, centerline extraction, and graph construction to compute fifteen morphometric, topological, fractal, and geometric features. The features can be estimated globally from the complete vascular network or regionally within arterial territories, enabling multiscale characterization of cerebrovascular organization. Applied to 570 3D TOF-MRA scans from the IXI dataset (ages 20-86), CaravelMetrics yields reproducible vessel graphs capturing age- and sex-related variations and education-associated increases in vascular complexity, consistent with findings reported in the literature. The framework provides a scalable and fully automated approach for quantitative cerebrovascular feature extraction, supporting normative modeling and population-level studies of vascular health and aging.
[455] MindDrive: An All-in-One Framework Bridging World Models and Vision-Language Model for End-to-End Autonomous Driving
Bin Sun, Yaoguang Cao, Yan Wang, Rui Wang, Jiachen Shang, Xiejie Feng, Jiayi Lu, Jia Shi, Shichun Yang, Xiaoyu Yan, Ziying Song
Main category: cs.CV
TL;DR: MindDrive is an end-to-end autonomous driving framework that harmonizes trajectory generation with comprehensive decision reasoning using a “context simulation - candidate generation - multi-objective trade-off” paradigm, achieving state-of-the-art performance.
Details
Motivation: Existing E2E-AD approaches have limitations: trajectory generation methods focus on producing high-quality trajectories but have simple decision mechanisms, while trajectory selection methods perform multi-dimensional evaluation but lack sufficient generative capability. There's a need to integrate both high-quality trajectory generation with comprehensive decision reasoning.Method: MindDrive uses a structured reasoning paradigm with two key components: 1) Future-aware Trajectory Generator (FaTG) based on a World Action Model (WaM) that performs ego-conditioned “what-if” simulations to predict future scenes and generate foresighted trajectory candidates, and 2) VLM-oriented Evaluator (VLoE) that leverages large vision-language models to conduct multi-objective evaluations across safety, comfort, and efficiency dimensions for human-aligned decision making.
Result: Extensive experiments on NAVSIM-v1 and NAVSIM-v2 benchmarks demonstrate state-of-the-art performance across multi-dimensional driving metrics, significantly enhancing safety, compliance, and generalization capabilities.
Conclusion: MindDrive provides a promising path toward interpretable and cognitively guided autonomous driving by harmonizing trajectory generation with comprehensive decision reasoning, offering a more balanced and human-aligned approach to autonomous driving decision making.
Abstract: End-to-End autonomous driving (E2E-AD) has emerged as a new paradigm, where trajectory planning plays a crucial role. Existing studies mainly follow two directions: trajectory generation oriented, which focuses on producing high-quality trajectories with simple decision mechanisms, and trajectory selection oriented, which performs multi-dimensional evaluation to select the best trajectory yet lacks sufficient generative capability. In this work, we propose MindDrive, a harmonized framework that integrates high-quality trajectory generation with comprehensive decision reasoning. It establishes a structured reasoning paradigm of “context simulation - candidate generation - multi-objective trade-off”. In particular, the proposed Future-aware Trajectory Generator (FaTG), based on a World Action Model (WaM), performs ego-conditioned “what-if” simulations to predict potential future scenes and generate foresighted trajectory candidates. Building upon this, the VLM-oriented Evaluator (VLoE) leverages the reasoning capability of a large vision-language model to conduct multi-objective evaluations across safety, comfort, and efficiency dimensions, leading to reasoned and human-aligned decision making. Extensive experiments on the NAVSIM-v1 and NAVSIM-v2 benchmarks demonstrate that MindDrive achieves state-of-the-art performance across multi-dimensional driving metrics, significantly enhancing safety, compliance, and generalization. This work provides a promising path toward interpretable and cognitively guided autonomous driving.
[456] Infrared UAV Target Tracking with Dynamic Feature Refinement and Global Contextual Attention Knowledge Distillation
Houzhang Fang, Chenxing Wu, Kun Bai, Tianqi Chen, Xiaolin Wang, Xiyang Liu, Yi Chang, Luxin Yan
Main category: cs.CV
TL;DR: SiamDFF is a novel Siamese network for infrared UAV tracking that uses dynamic feature fusion with selective enhancement, spatial/channel aggregation, and knowledge distillation to handle weak features and complex backgrounds.
Details
Motivation: Infrared UAV targets have weak features and appear in complex backgrounds, making accurate tracking challenging for anti-UAV applications.Method: Proposes SiamDFF with three components: 1) STEN for adaptive region enhancement, 2) DSFAM for multi-scale spatial feature integration, 3) DCFAM for channel feature aggregation. Also introduces target-aware contextual attention knowledge distillation to transfer target priors from teacher to student network.
Result: Extensive experiments on real infrared UAV datasets show SiamDFF outperforms state-of-the-art trackers in complex backgrounds while maintaining real-time tracking speed.
Conclusion: The proposed approach effectively addresses infrared UAV tracking challenges through dynamic feature fusion and knowledge distillation, achieving superior performance in complex scenarios.
Abstract: Unmanned aerial vehicle (UAV) target tracking based on thermal infrared imaging has been one of the most important sensing technologies in anti-UAV applications. However, the infrared UAV targets often exhibit weak features and complex backgrounds, posing significant challenges to accurate tracking. To address these problems, we introduce SiamDFF, a novel dynamic feature fusion Siamese network that integrates feature enhancement and global contextual attention knowledge distillation for infrared UAV target (IRUT) tracking. The SiamDFF incorporates a selective target enhancement network (STEN), a dynamic spatial feature aggregation module (DSFAM), and a dynamic channel feature aggregation module (DCFAM). The STEN employs intensity-aware multi-head cross-attention to adaptively enhance important regions for both template and search branches. The DSFAM enhances multi-scale UAV target features by integrating local details with global features, utilizing spatial attention guidance within the search frame. The DCFAM effectively integrates the mixed template generated from STEN in the template branch and original template, avoiding excessive background interference with the template and thereby enhancing the emphasis on UAV target region features within the search frame. Furthermore, to enhance the feature extraction capabilities of the network for IRUT without adding extra computational burden, we propose a novel tracking-specific target-aware contextual attention knowledge distiller. It transfers the target prior from the teacher network to the student model, significantly improving the student network’s focus on informative regions at each hierarchical level of the backbone network. Extensive experiments on real infrared UAV datasets demonstrate that the proposed approach outperforms state-of-the-art target trackers under complex backgrounds while achieving a real-time tracking speed.
[457] Towards Cross-View Point Correspondence in Vision-Language Models
Yipu Wang, Yuheng Ji, Yuyang Liu, Enshen Zhou, Ziqiang Yang, Yuxuan Tian, Ziheng Qin, Yue Liu, Huajie Tan, Cheng Chi, Zhiyuan Ma, Daniel Dajun Zeng, Xiaolong Zheng
Main category: cs.CV
TL;DR: This paper introduces Cross-View Point Correspondence (CVPC) task and CrossPoint-Bench benchmark to evaluate VLMs’ ability for precise point-level spatial correspondence, showing current models lag far behind humans, and proposes CrossPoint-378K dataset and CroPond model that significantly outperforms state-of-the-art models.
Details
Motivation: Current Vision-Language Models lack precise point-level cross-view correspondence capabilities needed for embodied AI and spatial understanding, especially for affordance interactions requiring fine-grained coordinate prediction.Method: 1) Propose CVPC task and CrossPoint-Bench benchmark with hierarchical design based on human cognitive process; 2) Construct CrossPoint-378K dataset with 378K QA pairs across 900 scenes focused on actionable affordance regions; 3) Develop CroPond model trained on CrossPoint-378K.
Result: State-of-the-art models (Gemini-2.5-Pro) show 54.65% gap behind humans in overall accuracy on CrossPoint-Bench. CroPond achieves SOTA performance, surpassing Gemini-2.5-Pro by 39.7% accuracy, demonstrating significant improvement in cross-view correspondence.
Conclusion: The paper establishes a new benchmark for cross-view point correspondence, reveals substantial limitations in current VLMs for fine-grained spatial reasoning, and provides dataset and model (CroPond) that significantly advance the field, offering foundation for future embodied AI research.
Abstract: Cross-view correspondence is a fundamental capability for spatial understanding and embodied AI. However, it is still far from being realized in Vision-Language Models (VLMs), especially in achieving precise point-level correspondence, which is crucial for precise affordance interaction. So we propose the Cross-View Point Correspondence (CVPC) task and CrossPoint-Bench, a comprehensive benchmark with hierarchical design, inspired by the human cognitive process of “perceive”, “reason”, and “correspond”. Our evaluation shows the state-of-the-art models (e.g., Gemini-2.5-Pro) still fall far behind humans, with a gap of over 54.65% in overall accuracy, exposing a challenge in transitioning from coarse-grained judgement to fine-grained coordinate prediction. To address this problem, we construct CrossPoint-378K, a dataset with 378K question-answering pairs across 900 scenes, focused on actionable affordance regions that better reflect real-world manipulation and interaction scenarios. Furthermore, we propose CroPond that trained on the CrossPoint-378K dataset. Our CroPond achieves state-of-the-art performance on CrossPoint-Bench, surpassing Gemini-2.5-Pro by 39.7% accuracy, which offers a foundation for advancing future work on cross-view correspondence. The benchmark, dataset, and model are publicly available at https://github.com/WangYipu2002/CrossPoint.
[458] EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture
Xin He, Longhui Wei, Jianbo Ouyang, Lingxi Xie, Qi Tian
Main category: cs.CV
TL;DR: EMMA is an efficient unified multimodal architecture that handles understanding, generation, and editing with 4 key innovations: efficient autoencoder compression, channel-wise concatenation, shared-and-decoupled network, and mixture-of-experts encoder.
Details
Motivation: To create a unified multimodal architecture that efficiently handles understanding, generation, and editing tasks while addressing the computational inefficiency of existing approaches that require many tokens for generation and struggle with training balance between different tasks.Method: 1) Efficient autoencoder with 32x compression ratio to reduce token count and ensure training balance; 2) Channel-wise concatenation instead of token-wise concatenation to further reduce visual tokens; 3) Shared-and-decoupled network for mutual task improvements while meeting task-specific requirements; 4) Mixture-of-experts mechanism for visual understanding encoder to enhance perceptual capabilities with minimal parameter increase.
Result: EMMA-4B outperforms state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, and achieves competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image).
Conclusion: EMMA provides an efficient foundation for unified multimodal architectures that can handle understanding, generation, and editing tasks effectively while being computationally efficient, laying groundwork for future multimodal system development.
Abstract: We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.
[459] FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization
Yicheng Liu, Shiduo Zhang, Zibin Dong, Baijun Ye, Tianyuan Yuan, Xiaopeng Yu, Linqi Yin, Chenhao Lu, Junhao Shi, Luca Jiang-Tao Yu, Liangtao Zheng, Tao Jiang, Jingjing Gong, Xipeng Qiu, Hang Zhao
Main category: cs.CV
TL;DR: FASTer is a unified framework for efficient robot learning with learnable tokenizer and autoregressive policy, achieving faster inference and higher performance than previous VLA models.
Details
Motivation: Autoregressive VLA models face a trade-off between reconstruction fidelity and inference efficiency in action tokenization, limiting their practical deployment.Method: FASTerVQ encodes action chunks as single-channel images for global spatio-temporal dependencies with high compression. FASTerVLA adds block-wise autoregressive decoding and lightweight action expert on this tokenizer.
Result: FASTerVQ shows superior reconstruction quality, high token utilization, and strong cross-task/cross-embodiment generalization. FASTerVLA surpasses previous SOTA VLA models in both inference speed and task performance.
Conclusion: The FASTer framework successfully addresses the efficiency-performance trade-off in VLA models, enabling more practical and capable robotic manipulation systems.
Abstract: Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce FASTer, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance.
[460] NeuralRemaster: Phase-Preserving Diffusion for Structure-Aligned Generation
Yu Zeng, Charles Ochoa, Mingyuan Zhou, Vishal M. Patel, Vitor Guizilini, Rowan McAllister
Main category: cs.CV
TL;DR: Phase-Preserving Diffusion (φ-PD) preserves input phase while randomizing magnitude, enabling structure-aligned generation without architectural changes or extra parameters.
Details
Motivation: Standard diffusion corrupts both magnitude and phase, destroying spatial structure. This makes it unsuitable for tasks requiring geometric consistency like re-rendering, simulation enhancement, and image-to-image translation.Method: Model-agnostic reformulation that preserves input phase while randomizing magnitude. Also introduces Frequency-Selective Structured (FSS) noise with single frequency-cutoff parameter for continuous control over structural rigidity.
Result: Produces controllable, spatially aligned results across photorealistic/stylized re-rendering and sim-to-real enhancement. Improves CARLA-to-Waymo planner performance by 50%. No inference-time cost and compatible with any diffusion model.
Conclusion: φ-PD enables structure-aligned generation without architectural changes, complementary to existing conditioning approaches, broadly applicable to image-to-image and video-to-video generation.
Abstract: Standard diffusion corrupts data using Gaussian noise whose Fourier coefficients have random magnitudes and random phases. While effective for unconditional or text-to-image generation, corrupting phase components destroys spatial structure, making it ill-suited for tasks requiring geometric consistency, such as re-rendering, simulation enhancement, and image-to-image translation. We introduce Phase-Preserving Diffusion φ-PD, a model-agnostic reformulation of the diffusion process that preserves input phase while randomizing magnitude, enabling structure-aligned generation without architectural changes or additional parameters. We further propose Frequency-Selective Structured (FSS) noise, which provides continuous control over structural rigidity via a single frequency-cutoff parameter. φ-PD adds no inference-time cost and is compatible with any diffusion model for images or videos. Across photorealistic and stylized re-rendering, as well as sim-to-real enhancement for driving planners, φ-PD produces controllable, spatially aligned results. When applied to the CARLA simulator, φ-PD improves CARLA-to-Waymo planner performance by 50%. The method is complementary to existing conditioning approaches and broadly applicable to image-to-image and video-to-video generation. Videos, additional examples, and code are available on our \href{https://yuzeng-at-tri.github.io/ppd-page/}{project page}.
[461] Splannequin: Freezing Monocular Mannequin-Challenge Footage with Dual-Detection Splatting
Hao-Jen Chien, Yi-Chuan Huang, Chung-Ho Wu, Wei-Lun Chao, Yu-Lun Liu
Main category: cs.CV
TL;DR: Splannequin improves frozen 3D scene synthesis from monocular videos by addressing ghosting/blur artifacts in dynamic Gaussian splatting through temporal anchoring of hidden and defective Gaussian states.
Details
Motivation: Synthesizing high-fidelity frozen 3D scenes from monocular Mannequin-Challenge videos presents unique challenges distinct from standard dynamic scene reconstruction. The goal is to create frozen scenes while preserving subtle dynamics for user-controlled instant selection, but monocular capture with sparse temporal supervision introduces artifacts like ghosting and blur for Gaussians that become unobserved or occluded.Method: Proposes Splannequin, an architecture-agnostic regularization method that detects two states of Gaussian primitives (hidden and defective) and applies temporal anchoring. Under forward camera motion, hidden states are anchored to recent well-observed past states, while defective states are anchored to future states with stronger supervision. The method integrates into existing dynamic Gaussian pipelines via simple loss terms without architectural changes.
Result: The method achieves markedly improved visual quality for high-fidelity, user-selectable frozen-time renderings, validated by 96% user preference. It adds zero inference overhead and requires no architectural changes to existing dynamic Gaussian pipelines.
Conclusion: Splannequin effectively addresses artifacts in frozen 3D scene synthesis from monocular videos by leveraging temporal anchoring of Gaussian states, enabling high-quality user-controlled frozen renderings while maintaining computational efficiency and compatibility with existing pipelines.
Abstract: Synthesizing high-fidelity frozen 3D scenes from monocular Mannequin-Challenge (MC) videos is a unique problem distinct from standard dynamic scene reconstruction. Instead of focusing on modeling motion, our goal is to create a frozen scene while strategically preserving subtle dynamics to enable user-controlled instant selection. To achieve this, we introduce a novel application of dynamic Gaussian splatting: the scene is modeled dynamically, which retains nearby temporal variation, and a static scene is rendered by fixing the model’s time parameter. However, under this usage, monocular capture with sparse temporal supervision introduces artifacts like ghosting and blur for Gaussians that become unobserved or occluded at weakly supervised timestamps. We propose Splannequin, an architecture-agnostic regularization that detects two states of Gaussian primitives, hidden and defective, and applies temporal anchoring. Under predominantly forward camera motion, hidden states are anchored to their recent well-observed past states, while defective states are anchored to future states with stronger supervision. Our method integrates into existing dynamic Gaussian pipelines via simple loss terms, requires no architectural changes, and adds zero inference overhead. This results in markedly improved visual quality, enabling high-fidelity, user-selectable frozen-time renderings, validated by a 96% user preference. Project page: https://chien90190.github.io/splannequin/
[462] Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning
Chinthani Sugandhika, Chen Li, Deepu Rajan, Basura Fernando
Main category: cs.CV
TL;DR: Know-Show is a new benchmark for evaluating spatio-temporal grounded reasoning in Video-LMs, revealing significant gaps between current models and human reasoning, with GRAM proposed as a training-free plug-in to improve fine-grained grounding.
Details
Motivation: Current Video-Language Models show impressive multimodal understanding but lack proper grounding in space and time - they can't effectively "show what they know" by connecting reasoning to visual and temporal evidence.Method: Created Know-Show benchmark with 2.5K human-authored questions from Charades, Action Genome, and Ego4D datasets, covering five spatial and temporal scenarios. Proposed GRAM - a training-free plug-in using attention-based video token selection and explicit timestamp encoding.
Result: Existing Video-LMs (Qwen, VideoLLaVA, GPT-4o, Gemini) struggle with spatio-temporal grounded reasoning, especially in fine-grained hand-object interactions. GRAM improves grounding capabilities without additional training.
Conclusion: Know-Show establishes a unified standard for assessing grounded reasoning in video-language understanding, highlighting the need for better spatio-temporal grounding and providing insights for developing more interpretable and reliable multimodal reasoning systems.
Abstract: Large Video-Language Models (Video-LMs) have achieved impressive progress in multimodal understanding, yet their reasoning remains weakly grounded in space and time. We present Know-Show, a new benchmark designed to evaluate spatio-temporal grounded reasoning, the ability of a model to reason about actions and their semantics while simultaneously grounding its inferences in visual and temporal evidence. Know-Show unifies reasoning and localization within a single evaluation framework consisting of five complementary scenarios across spatial (person, object, person-object, and hand-object) and temporal dimensions. Built from Charades, Action Genome, and Ego4D with 2.5K human-authored questions, the benchmark exposes significant gaps between current Video-LMs and human reasoning. To bridge this gap, we propose GRAM, a training-free plug-in that augments Video-LMs with fine-grained grounding through attention-based video token selection and explicit timestamp encoding. Extensive experiments across open and closed Video-LMs (Qwen, VideoLLaVA, GPT-4o, and Gemini, etc.) reveal that existing models struggle to “show what they know” and vice versa, especially in fine-grained hand-object interactions. Know-Show establishes a unified standard for assessing grounded reasoning in video-language understanding and provides insights toward developing interpretable and reliable multimodal reasoning systems. We will release the dataset and the code at https://github.com/LUNAProject22/Know-Show.
[463] VOST-SGG: VLM-Aided One-Stage Spatio-Temporal Scene Graph Generation
Chinthani Sugandhika, Chen Li, Deepu Rajan, Basura Fernando
Main category: cs.CV
TL;DR: VOST-SGG: A VLM-aided one-stage spatio-temporal scene graph generation framework that integrates vision-language models to address limitations in current DETR-style approaches.
Details
Motivation: Current DETR-style ST-SGG models have two key limitations: 1) learnable queries are semantically uninformed and instance-agnostically initialized, 2) they rely exclusively on unimodal visual features for predicate classification, lacking common sense reasoning capabilities.Method: Proposes VOST-SGG with two main innovations: 1) dual-source query initialization strategy that disentangles what to attend to from where to attend, enabling semantically grounded reasoning, 2) multi-modal feature bank that fuses visual, textual, and spatial cues derived from VLMs for improved predicate classification.
Result: Extensive experiments on Action Genome dataset demonstrate state-of-the-art performance, validating the effectiveness of integrating VLM-aided semantic priors and multi-modal features for ST-SGG.
Conclusion: Integrating vision-language models’ common sense reasoning capabilities into ST-SGG pipelines significantly improves performance by providing semantically informed queries and multi-modal features, addressing key limitations of existing approaches.
Abstract: Spatio-temporal scene graph generation (ST-SGG) aims to model objects and their evolving relationships across video frames, enabling interpretable representations for downstream reasoning tasks such as video captioning and visual question answering. Despite recent advancements in DETR-style single-stage ST-SGG models, they still suffer from several key limitations. First, while these models rely on attention-based learnable queries as a core component, these learnable queries are semantically uninformed and instance-agnostically initialized. Second, these models rely exclusively on unimodal visual features for predicate classification. To address these challenges, we propose VOST-SGG, a VLM-aided one-stage ST-SGG framework that integrates the common sense reasoning capabilities of vision-language models (VLMs) into the ST-SGG pipeline. First, we introduce the dual-source query initialization strategy that disentangles what to attend to from where to attend, enabling semantically grounded what-where reasoning. Furthermore, we propose a multi-modal feature bank that fuses visual, textual, and spatial cues derived from VLMs for improved predicate classification. Extensive experiments on the Action Genome dataset demonstrate that our approach achieves state-of-the-art performance, validating the effectiveness of integrating VLM-aided semantic priors and multi-modal features for ST-SGG. We will release the code at https://github.com/LUNAProject22/VOST.
[464] VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack
Shiji Zhao, Shukun Xiong, Yao Huang, Yan Jin, Zhenyu Wu, Jiyang Guan, Ranjie Duan, Jialing Tao, Hui Xue, Xingxing Wei
Main category: cs.CV
TL;DR: VRSA is a novel jailbreak attack method that decomposes harmful text queries into sequential sub-images to exploit visual reasoning vulnerabilities in MLLMs, achieving higher attack success rates than existing methods.
Details
Motivation: While previous jailbreak attacks focused on text reasoning vulnerabilities, similar threats in visual reasoning have been overlooked. As MLLMs gain powerful cross-modal capabilities, more modalities introduce more potential vulnerabilities that need to be evaluated.Method: VRSA decomposes harmful text into sequentially related sub-images to gradually externalize harmful intent. It uses three key techniques: Adaptive Scene Refinement to optimize relevant scenes, Semantic Coherent Completion to iteratively rewrite sub-texts with context, and Text-Image Consistency Alignment to maintain semantic consistency.
Result: VRSA achieves higher attack success rates compared to state-of-the-art jailbreak attack methods on both open-source and closed-source MLLMs including GPT-4o and Claude-4.5-Sonnet.
Conclusion: The paper demonstrates significant safety risks in visual reasoning tasks of MLLMs and proposes an effective attack method to evaluate these vulnerabilities, highlighting the need for improved safety measures in multimodal systems.
Abstract: Multimodal Large Language Models (MLLMs) are widely used in various fields due to their powerful cross-modal comprehension and generation capabilities. However, more modalities bring more vulnerabilities to being utilized for jailbreak attacks, which induces MLLMs to output harmful content. Due to the strong reasoning ability of MLLMs, previous jailbreak attacks try to explore reasoning safety risk in text modal, while similar threats have been largely overlooked in the visual modal. To fully evaluate potential safety risks in the visual reasoning task, we propose Visual Reasoning Sequential Attack (VRSA), which induces MLLMs to gradually externalize and aggregate complete harmful intent by decomposing the original harmful text into several sequentially related sub-images. In particular, to enhance the rationality of the scene in the image sequence, we propose Adaptive Scene Refinement to optimize the scene most relevant to the original harmful query. To ensure the semantic continuity of the generated image, we propose Semantic Coherent Completion to iteratively rewrite each sub-text combined with contextual information in this scene. In addition, we propose Text-Image Consistency Alignment to keep the semantical consistency. A series of experiments demonstrates that the VRSA can achieve a higher attack success rate compared with the state-of-the-art jailbreak attack methods on both the open-source and closed-source MLLMs such as GPT-4o and Claude-4.5-Sonnet.
cs.AI
[465] Going All-In on LLM Accuracy: Fake Prediction Markets, Real Confidence Signals
Michael Todasco
Main category: cs.AI
TL;DR: LLMs used as evaluators lack confidence signals; framing evaluation as betting game with fictional currency improves forecasting accuracy and provides calibrated confidence measures.
Details
Motivation: Large language models are increasingly used to evaluate other models, but their judgments typically lack any representation of confidence, making it difficult to assess how certain they are about their evaluations.Method: Created 100 math/logic questions with verifiable answers. Six baseline models answered all items. Three predictor models forecasted correctness for each question-baseline pair in two conditions: Control (simple predictions) and Incentive (predictions plus wagers of 1-100,000 LLMCoin with even odds, starting from 1M bankroll).
Result: Incentive runs showed modestly higher accuracy (81.5% vs. 79.1%, p=.089) and significantly faster learning across rounds (12.0% vs. 2.9% improvement). Stake size tracked confidence: “whale” bets (40k+ coins) were correct ~99% of time, while small bets (<1k coins) showed ~74% accuracy.
Conclusion: Betting mechanic creates legible confidence signals absent from binary outputs. Simple financial framing may help transform LLMs into risk-aware forecasters, making internal beliefs visible and usable. Protocol offers foundation for meta-evaluation systems and LLM-to-LLM prediction markets.
Abstract: Large language models are increasingly used to evaluate other models, yet these judgments typically lack any representation of confidence. This pilot study tests whether framing an evaluation task as a betting game (a fictional prediction market with its own LLM currency) improves forecasting accuracy and surfaces calibrated confidence signals. We generated 100 math and logic questions with verifiable answers. Six Baseline models (three current-generation, three prior-generation) answered all items. Three Predictor models then forecasted, for each question-baseline pair, if the baseline would answer correctly. Each predictor completed matched runs in two conditions: Control (simple correct/incorrect predictions) and Incentive (predictions plus wagers of 1-100,000 LLMCoin under even odds, starting from a 1,000,000 LLMCoin bankroll). Across 5,400 predictions per condition, Incentive runs showed modestly higher accuracy (81.5% vs. 79.1%, p = .089, d = 0.86) and significantly faster learning across rounds (12.0 vs. 2.9 percentage-point improvement from Round 1 to Round 4, p = .011). Most notably, stake size tracked confidence. “Whale” bets of 40,000+ coins were correct ~99% of the time, while small bets (<1,000 coins) showed only ~74% accuracy. The key finding is not that fictional money makes models smarter; accuracy gains were modest and did not reach statistical significance (p = .089) in this pilot. Rather, the betting mechanic created a legible confidence signal absent from binary yes/no outputs. This suggests that simple financial framing may help transform LLMs into risk-aware forecasters, making their internal beliefs visible and usable. The protocol offers a foundation for future work for meta-evaluation systems and what may become LLM-to-LLM prediction markets.
[466] Deep learning for autism detection using clinical notes: A comparison of transfer learning for a transparent and black-box approach
Gondy Leroy, Prakash Bisht, Sai Madhuri Kandula, Nell Maltman, Sydney Rice
Main category: cs.AI
TL;DR: A transparent BioBERT-based ML model for ASD diagnosis from clinical text outperforms black-box approaches, with mixed-data training achieving 97% sensitivity and 98% specificity.
Details
Motivation: ASD diagnosis is lengthy and increasing in prevalence, while current ML approaches are black boxes and lack generalizability due to single-dataset training.Method: Uses BioBERT to analyze unstructured clinical text, labels behavioral descriptions, maps to diagnostic criteria, and assigns ASD/not labels. Evaluates transfer learning across two real-world datasets with sequential vs. mixed training strategies, compared against a black-box model.
Result: Transparent model: mixed-data training achieved 97% sensitivity, 98% specificity; sequential training showed slight performance drop. Black-box model performed worse (90% sensitivity, 96% specificity). Transparent approach outperformed black-box overall.
Conclusion: Transparent ML approach enables trustworthy, generalizable ASD diagnosis tools. Mixed-data training yields best performance and should be preferred when possible.
Abstract: Autism spectrum disorder (ASD) is a complex neurodevelopmental condition whose rising prevalence places increasing demands on a lengthy diagnostic process. Machine learning (ML) has shown promise in automating ASD diagnosis, but most existing models operate as black boxes and are typically trained on a single dataset, limiting their generalizability. In this study, we introduce a transparent and interpretable ML approach that leverages BioBERT, a state-of-the-art language model, to analyze unstructured clinical text. The model is trained to label descriptions of behaviors and map them to diagnostic criteria, which are then used to assign a final label (ASD or not). We evaluate transfer learning, the ability to transfer knowledge to new data, using two distinct real-world datasets. We trained on datasets sequentially and mixed together and compared the performance of the best models and their ability to transfer to new data. We also created a black-box approach and repeated this transfer process for comparison. Our transparent model demonstrated robust performance, with the mixed-data training strategy yielding the best results (97 % sensitivity, 98 % specificity). Sequential training across datasets led to a slight drop in performance, highlighting the importance of training data order. The black-box model performed worse (90 % sensitivity, 96 % specificity) when trained sequentially or with mixed data. Overall, our transparent approach outperformed the black-box approach. Mixing datasets during training resulted in slightly better performance and should be the preferred approach when practically possible. This work paves the way for more trustworthy, generalizable, and clinically actionable AI tools in neurodevelopmental diagnostics.
[467] ARCANE: A Multi-Agent Framework for Interpretable and Configurable Alignment
Charlie Masters, Marta Grześkiewicz, Stefano V. Albrecht
Main category: cs.AI
TL;DR: ARCANE is a framework that uses natural-language rubrics (weighted sets of verifiable criteria) for interpretable, on-the-fly alignment of AI agents in long-horizon tasks, enabling preference shifts without retraining.
Details
Motivation: As LLM-based agents handle longer tasks, maintaining alignment with stakeholder preferences becomes critical. Current approaches lack interpretability for auditing and cannot adapt to preference shifts at interaction time without retraining.Method: Frames alignment as multi-agent collaboration with dynamically generated natural-language rubrics. Uses utility theory to formulate rubric learning as reconstruction problem, applying regularized Group-Sequence Policy Optimization (GSPO) to balance interpretability, faithfulness, and efficiency.
Result: Evaluated on 219 labeled rubrics from GDPVal benchmark for multi-step reasoning and tool use tasks. Learned rubrics produce compact, legible evaluations and enable configurable trade-offs (e.g., correctness vs. conciseness) without retraining.
Conclusion: Rubric-based reward models offer a promising path toward interpretable, test-time adaptive alignment for complex, long-horizon AI systems, addressing key limitations of current alignment approaches.
Abstract: As agents based on large language models are increasingly deployed to long-horizon tasks, maintaining their alignment with stakeholder preferences becomes critical. Effective alignment in such settings requires reward models that are interpretable so that stakeholders can understand and audit model objectives. Moreover, reward models must be capable of steering agents at interaction time, allowing preference shifts to be incorporated without retraining. We introduce ARCANE, a framework that frames alignment as a multi-agent collaboration problem that dynamically represents stakeholder preferences as natural-language rubrics: weighted sets of verifiable criteria that can be generated on-the-fly from task context. Inspired by utility theory, we formulate rubric learning as a reconstruction problem and apply a regularized Group-Sequence Policy Optimization (GSPO) procedure that balances interpretability, faithfulness, and computational efficiency. Using a corpus of 219 labeled rubrics derived from the GDPVal benchmark, we evaluate ARCANE on challenging tasks requiring multi-step reasoning and tool use. The learned rubrics produce compact, legible evaluations and enable configurable trade-offs (e.g., correctness vs. conciseness) without retraining. Our results show that rubric-based reward models offer a promising path toward interpretable, test-time adaptive alignment for complex, long-horizon AI systems.
[468] On measuring grounding and generalizing grounding problems
Daniel Quigley, Eric Maynard
Main category: cs.AI
TL;DR: The paper reframes the symbol grounding problem from a binary judgment into a multi-dimensional audit framework with specific desiderata, then applies this framework to analyze different grounding modes and case studies.
Details
Motivation: To address the philosophical problem of how symbols can be about things in the world (symbol grounding) by moving beyond simple binary judgments and providing a systematic framework for evaluating different approaches to grounding.Method: Develops an audit framework with six desiderata (authenticity, preservation, faithfulness-correlational, faithfulness-etiological, robustness, compositionality) indexed by evaluation tuples (context, meaning type, threat model, reference distribution). Applies this framework to analyze four grounding modes and three case studies.
Result: Different grounding approaches have different strengths: model-theoretic semantics achieves exact composition but lacks etiological warrant; LLMs show correlational fit and local robustness for linguistic tasks but lack selection-for-success on world tasks; human language meets all desiderata under strong authenticity through evolutionary/developmental acquisition.
Conclusion: The framework operationalizes philosophical inquiry about representation, providing a common language and technical framework for systematic investigation of grounding and meaning across philosophy of science, computer science, linguistics, and mathematics.
Abstract: The symbol grounding problem asks how tokens like cat can be about cats, as opposed to mere shapes manipulated in a calculus. We recast grounding from a binary judgment into an audit across desiderata, each indexed by an evaluation tuple (context, meaning type, threat model, reference distribution): authenticity (mechanisms reside inside the agent and, for strong claims, were acquired through learning or evolution); preservation (atomic meanings remain intact); faithfulness, both correlational (realized meanings match intended ones) and etiological (internal mechanisms causally contribute to success); robustness (graceful degradation under declared perturbations); compositionality (the whole is built systematically from the parts). We apply this framework to four grounding modes (symbolic; referential; vectorial; relational) and three case studies: model-theoretic semantics achieves exact composition but lacks etiological warrant; large language models show correlational fit and local robustness for linguistic tasks, yet lack selection-for-success on world tasks without grounded interaction; human language meets the desiderata under strong authenticity through evolutionary and developmental acquisition. By operationalizing a philosophical inquiry about representation, we equip philosophers of science, computer scientists, linguists, and mathematicians with a common language and technical framework for systematic investigation of grounding and meaning.
[469] AI Application in Anti-Money Laundering for Sustainable and Transparent Financial Systems
Chuanhao Nie, Yunbo Liu, Chao Wang
Main category: cs.AI
TL;DR: This paper reviews AI applications for improving Anti-Money Laundering (AML) workflows, proposes an AI-driven KYC system using RAG-Graph architecture, and highlights future research directions for transparent and robust AML systems.
Details
Motivation: Money laundering and financial fraud cost trillions annually and challenge regulatory oversight, creating a need for modernized AML systems that improve detection accuracy while reducing operational burdens and supporting sustainable development.Method: The paper reviews AI applications for AML modernization and proposes an AI-driven KYC application that integrates graph-based retrieval-augmented generation (RAG Graph) with generative models to enhance KYC processes for money-laundering detection.
Result: Experimental results show that the RAG-Graph architecture delivers high faithfulness and strong answer relevancy across diverse evaluation settings, enhancing efficiency and transparency of KYC CDD/EDD workflows and contributing to more sustainable compliance practices.
Conclusion: AI can significantly modernize AML workflows by improving detection accuracy and reducing operational burdens, with future research needed in federated learning, fairness-aware AI, reinforcement learning, and human-in-the-loop systems to ensure transparent, accountable, and robust next-generation AML architectures.
Abstract: Money laundering and financial fraud remain major threats to global financial stability, costing trillions annually and challenging regulatory oversight. This paper reviews how artificial intelligence (AI) applications can modernize Anti-Money Laundering (AML) workflows by improving detection accuracy, lowering false-positive rates, and reducing the operational burden of manual investigations, thereby supporting more sustainable development. It further highlights future research directions including federated learning for privacy-preserving collaboration, fairness-aware and interpretable AI, reinforcement learning for adaptive defenses, and human-in-the-loop visualization systems to ensure that next-generation AML architectures remain transparent, accountable, and robust. In the final part, the paper proposes an AI-driven KYC application that integrates graph-based retrieval-augmented generation (RAG Graph) with generative models to enhance efficiency, transparency, and decision support in KYC processes related to money-laundering detection. Experimental results show that the RAG-Graph architecture delivers high faithfulness and strong answer relevancy across diverse evaluation settings, thereby enhancing the efficiency and transparency of KYC CDD/EDD workflows and contributing to more sustainable, resource-optimized compliance practices.
[470] How Sharp and Bias-Robust is a Model? Dual Evaluation Perspectives on Knowledge Graph Completion
Sooho Moon, Yunyong Ko
Main category: cs.AI
TL;DR: PROBE is a new evaluation framework for knowledge graph completion that addresses limitations in existing metrics by considering predictive sharpness and popularity-bias robustness.
Details
Motivation: Existing KGC evaluation metrics overlook two critical perspectives: (1) predictive sharpness - how strictly individual predictions are evaluated, and (2) popularity-bias robustness - the ability to predict low-popularity entities. Current metrics tend to over- or under-estimate model accuracy.Method: PROBE consists of two components: a rank transformer (RT) that estimates prediction scores based on required predictive sharpness levels, and a rank aggregator (RA) that aggregates scores in a popularity-aware manner to address popularity bias.
Result: Experiments on real-world knowledge graphs show that PROBE provides more comprehensive understanding of KGC models and yields more reliable evaluation results compared to existing metrics.
Conclusion: PROBE addresses key limitations in KGC evaluation by simultaneously considering predictive sharpness and popularity-bias robustness, offering a more reliable framework for assessing knowledge graph completion models.
Abstract: Knowledge graph completion (KGC) aims to predict missing facts from the observed KG. While a number of KGC models have been studied, the evaluation of KGC still remain underexplored. In this paper, we observe that existing metrics overlook two key perspectives for KGC evaluation: (A1) predictive sharpness – the degree of strictness in evaluating an individual prediction, and (A2) popularity-bias robustness – the ability to predict low-popularity entities. Toward reflecting both perspectives, we propose a novel evaluation framework (PROBE), which consists of a rank transformer (RT) estimating the score of each prediction based on a required level of predictive sharpness and a rank aggregator (RA) aggregating all the scores in a popularity-aware manner. Experiments on real-world KGs reveal that existing metrics tend to over- or under-estimate the accuracy of KGC models, whereas PROBE yields a comprehensive understanding of KGC models and reliable evaluation results.
[471] DaGRPO: Rectifying Gradient Conflict in Reasoning via Distinctiveness-Aware Group Relative Policy Optimization
Xuan Xie, Xuan Wang, Wenjie Wang
Main category: cs.AI
TL;DR: DaGRPO improves GRPO for LLM reasoning by addressing training instability through distinctiveness-aware mechanisms: gradient rectification to avoid conflicts and off-policy augmentation for hard queries.
Details
Motivation: GRPO enables post-training reasoning in LLMs but suffers from training instability and poor sample efficiency due to lack of distinctiveness in on-policy rollouts - homogeneous samples cause gradient conflicts for routine queries, while scarce positive samples hinder optimization for hard queries.Method: DaGRPO introduces two core mechanisms: 1) Sequence-level Gradient Rectification uses fine-grained scoring to dynamically mask low-distinctiveness sample pairs, eliminating gradient conflicts; 2) Off-policy Data Augmentation introduces high-quality anchors to recover training signals for challenging tasks.
Result: Extensive experiments across 9 mathematical reasoning and OOD generalization benchmarks show DaGRPO significantly surpasses SFT, GRPO, and hybrid baselines, achieving new SOTA performance (+4.7% average accuracy gain on math benchmarks). Analysis confirms effective mitigation of gradient explosion and accelerated emergence of long-chain reasoning capabilities.
Conclusion: DaGRPO successfully addresses GRPO’s training instability by incorporating distinctiveness awareness, enabling more stable and efficient elicitation of long-horizon reasoning capabilities in LLMs through gradient conflict elimination and enhanced optimization for challenging tasks.
Abstract: The evolution of Large Language Models (LLMs) has catalyzed a paradigm shift from superficial instruction following to rigorous long-horizon reasoning. While Group Relative Policy Optimization (GRPO) has emerged as a pivotal mechanism for eliciting such post-training reasoning capabilities due to its exceptional performance, it remains plagued by significant training instability and poor sample efficiency. We theoretically identify the root cause of these issues as the lack of distinctiveness within on-policy rollouts: for routine queries, highly homogeneous samples induce destructive gradient conflicts; whereas for hard queries, the scarcity of valid positive samples results in ineffective optimization. To bridge this gap, we propose Distinctiveness-aware Group Relative Policy Optimization (DaGRPO). DaGRPO incorporates two core mechanisms: (1) Sequence-level Gradient Rectification, which utilizes fine-grained scoring to dynamically mask sample pairs with low distinctiveness, thereby eradicating gradient conflicts at the source; and (2) Off-policy Data Augmentation, which introduces high-quality anchors to recover training signals for challenging tasks. Extensive experiments across 9 mathematical reasoning and out-of-distribution (OOD) generalization benchmarks demonstrate that DaGRPO significantly surpasses existing SFT, GRPO, and hybrid baselines, achieving new state-of-the-art performance (e.g., a +4.7% average accuracy gain on math benchmarks). Furthermore, in-depth analysis confirms that DaGRPO effectively mitigates gradient explosion and accelerates the emergence of long-chain reasoning capabilities.
[472] Less Is More for Multi-Step Logical Reasoning of LLM Generalisation Under Rule Removal, Paraphrasing, and Compression
Qiming Bao, Xiaoxuan Fu
Main category: cs.AI
TL;DR: LLMs show perfect accuracy on base logical tasks and semantic-preserving transformations, but fail dramatically when essential rules are deleted or contradictions are introduced, revealing fundamental brittleness in logical reasoning.
Details
Motivation: To understand how LLMs generalize to structural perturbations in logical contexts, as current understanding of their reasoning reliability under such conditions remains limited.Method: Controlled evaluation framework with four stress tests: (1) rule deletion (redundant vs essential), (2) contradictory evidence injection, (3) logic-preserving rewrites using equivalence laws, and (4) multi-law equivalence stacking. Tested on BERT, Qwen2, and LLaMA-like models.
Result: All models achieve perfect accuracy on base tasks and generalize perfectly to redundant rule deletion and all equivalence-based rewrites (single or multi-law). However, they drop to 25% accuracy with essential rule deletion and collapse to 0% accuracy with explicit contradictions.
Conclusion: LLMs possess stable invariance to semantic-preserving logical transformations but remain fundamentally brittle to missing or conflicting evidence. The framework provides a diagnostic tool for isolating reasoning failure modes and highlights persistent gaps in logical generalization abilities.
Abstract: Large language models (LLMs) excel across many natural language tasks, yet their generalisation to structural perturbations in logical contexts remains poorly understood. We introduce a controlled evaluation framework that probes reasoning reliability through four targeted stress tests: (1) rule deletion, removing either redundant or essential rules from a multi-step inference chain; (2) contradictory evidence injection; (3) logic-preserving rewrites generated through several families of equivalence laws (contrapositive, double negation, implication, De Morgan, identity, and commutativity); and (4) multi-law equivalence stacking that introduces 2-5 simultaneous logical transformations. Across three representative model families: BERT, Qwen2, and LLaMA-like models. Our experiments reveal a strikingly consistent pattern: all models achieve perfect accuracy on the base tasks and remain fully generalise to redundant rule deletion and all equivalence-based rewrites (single or multi-law), but fail sharply under essential rule deletion (dropping to 25% accuracy) and collapse completely in the presence of explicit contradictions (0% accuracy). These results demonstrate that LLMs possess stable invariance to semantic-preserving logical transformations, yet remain fundamentally brittle to missing or conflicting evidence. Our framework provides a clean diagnostic tool for isolating such reasoning failure modes and highlights persistent gaps in the logical generalisation abilities of current LLMs.
[473] GENIUS: An Agentic AI Framework for Autonomous Design and Execution of Simulation Protocols
Mohammad Soleymanibrojeni, Roland Aydin, Diego Guedes-Sobrinho, Alexandre C. Dias, Maurício J. Piotrowski, Wolfgang Wenzel, Celso Ricardo Caldeira Rêgo
Main category: cs.AI
TL;DR: GENIUS is an AI-agentic workflow that automates DFT simulation setup by translating free-form prompts into validated input files with ~80% success rate and autonomous error repair, democratizing materials simulations.
Details
Motivation: The know-how gap in setting up and debugging atomistic simulations limits ICME adoption, as state-of-the-art codes remain cumbersome for non-experts despite their predictive power for materials discovery.Method: GENIUS combines a smart Quantum ESPRESSO knowledge graph with a tiered hierarchy of large language models supervised by a finite-state error-recovery machine to translate human prompts into validated simulation inputs.
Result: Achieves ~80% success rate on 295 diverse benchmarks, with 76% autonomously repaired; halves inference costs and virtually eliminates hallucinations compared to LLM-only baselines.
Conclusion: The framework democratizes electronic-structure DFT simulations by automating protocol generation, validation, and repair, enabling large-scale screening and accelerating ICME design loops worldwide.
Abstract: Predictive atomistic simulations have propelled materials discovery, yet routine setup and debugging still demand computer specialists. This know-how gap limits Integrated Computational Materials Engineering (ICME), where state-of-the-art codes exist but remain cumbersome for non-experts. We address this bottleneck with GENIUS, an AI-agentic workflow that fuses a smart Quantum ESPRESSO knowledge graph with a tiered hierarchy of large language models supervised by a finite-state error-recovery machine. Here we show that GENIUS translates free-form human-generated prompts into validated input files that run to completion on $\approx$80% of 295 diverse benchmarks, where 76% are autonomously repaired, with success decaying exponentially to a 7% baseline. Compared with LLM-only baselines, GENIUS halves inference costs and virtually eliminates hallucinations. The framework democratizes electronic-structure DFT simulations by intelligently automating protocol generation, validation, and repair, opening large-scale screening and accelerating ICME design loops across academia and industry worldwide.
[474] UncertaintyZoo: A Unified Toolkit for Quantifying Predictive Uncertainty in Deep Learning Systems
Xianzong Wu, Xiaohong Li, Lili Quan, Qiang Hu
Main category: cs.AI
TL;DR: UncertaintyZoo is a unified toolkit that integrates 29 uncertainty quantification methods for LLMs, evaluated on code vulnerability detection tasks.
Details
Motivation: LLMs often make incorrect predictions in safety-critical scenarios, but existing uncertainty quantification methods lack integration tools, hindering practical usage and research.Method: Developed UncertaintyZoo toolkit with standardized interface covering 29 UQ methods across five major categories, then evaluated on CodeBERT and ChatGLM3 models for code vulnerability detection.
Result: UncertaintyZoo effectively reveals prediction uncertainty in LLMs, demonstrated through code vulnerability detection tasks.
Conclusion: UncertaintyZoo bridges the gap in UQ tool integration, enabling practical usage and future research in uncertainty quantification for LLMs.
Abstract: Large language models(LLMs) are increasingly expanding their real-world applications across domains, e.g., question answering, autonomous driving, and automatic software development. Despite this achievement, LLMs, as data-driven systems, often make incorrect predictions, which can lead to potential losses in safety-critical scenarios. To address this issue and measure the confidence of model outputs, multiple uncertainty quantification(UQ) criteria have been proposed. However, even though important, there are limited tools to integrate these methods, hindering the practical usage of UQ methods and future research in this domain. To bridge this gap, in this paper, we introduce UncertaintyZoo, a unified toolkit that integrates 29 uncertainty quantification methods, covering five major categories under a standardized interface. Using UncertaintyZoo, we evaluate the usefulness of existing uncertainty quantification methods under the code vulnerability detection task on CodeBERT and ChatGLM3 models. The results demonstrate that UncertaintyZoo effectively reveals prediction uncertainty. The tool with a demonstration video is available on the project site https://github.com/Paddingbuta/UncertaintyZoo.
[475] The Effect of Belief Boxes and Open-mindedness on Persuasion
Onur Bilgin, Abdullah As Sami, Sriram Sai Vujjini, John Licato
Main category: cs.AI
TL;DR: LLM-based agents with explicit belief statements in their prompts (belief boxes) show measurable effects on persuasion, resistance to opposing views, and belief change, especially when instructed to be open-minded or facing peer pressure scenarios.
Details
Motivation: As multi-agent reasoning systems become more common, there's a need for LLM-based agents to have propositional beliefs. The paper investigates how explicit belief statements in prompts affect agent behavior, persuasion ability, and belief change dynamics.Method: The researchers conducted experiments with LLM-based agents using belief boxes (explicit belief statements in prompts). They tested how belief statements and strength indicators affect agents’ resistance to opposing views, persuasiveness, and belief change in various scenarios including peer pressure situations where agents are outnumbered.
Result: Instructing agents to be open-minded makes them more amenable to belief change. Belief statements and their strengths influence agents’ resistance to opposing viewpoints and their persuasiveness. Belief change is more likely when agents are outnumbered in debates (peer pressure scenarios). The belief box technique proves feasible and valid for reasoning and decision-making tasks.
Conclusion: The belief box technique effectively influences LLM agent behavior, demonstrating that explicit belief statements in prompts can shape agents’ dispositions, persuasion capabilities, and susceptibility to belief change, particularly in social pressure situations.
Abstract: As multi-agent systems are increasingly utilized for reasoning and decision-making applications, there is a greater need for LLM-based agents to have something resembling propositional beliefs. One simple method for doing so is to include statements describing beliefs maintained in the prompt space (in what we’ll call their belief boxes). But when agents have such statements in belief boxes, how does it actually affect their behaviors and dispositions towards those beliefs? And does it significantly affect agents’ ability to be persuasive in multi-agent scenarios? Likewise, if the agents are given instructions to be open-minded, how does that affect their behaviors? We explore these and related questions in a series of experiments. Our findings confirm that instructing agents to be open-minded affects how amenable they are to belief change. We show that incorporating belief statements and their strengths influences an agent’s resistance to (and persuasiveness against) opposing viewpoints. Furthermore, it affects the likelihood of belief change, particularly when the agent is outnumbered in a debate by opposing viewpoints, i.e., peer pressure scenarios. The results demonstrate the feasibility and validity of the belief box technique in reasoning and decision-making tasks.
[476] Utilizing Multi-Agent Reinforcement Learning with Encoder-Decoder Architecture Agents to Identify Optimal Resection Location in Glioblastoma Multiforme Patients
Krishna Arun, Moinak Bhattachrya, Paras Goel
Main category: cs.AI
TL;DR: AI system for GBM brain tumors combining diagnosis (4 sequential classification models) and treatment planning (3 generative models + RL feedback loop) with significant efficiency gains and potential survival improvements.
Details
Motivation: Addressing the lack of AI support for treating heterogeneous brain tumors like Glioblastoma Multiforme (GBM), which has extremely low 5-year survival rate of 5.1%, by providing end-to-end assistance for both diagnosis and treatment planning.Method: Two-phase approach: 1) Diagnosis using sequential decision-making framework with 4 classification models (CNNs and SVM) that progressively classify brain images into specific categories. 2) Treatment planning using RL system with 3 generative models: diffusion model for resection prediction, Spatio-Temporal Vision Transformer for radiotherapy progression, diffusion model for chemotherapy effects, plus survival rate calculator CNN and PPO feedback loop for optimization.
Result: Key findings: 22.28x computing cost reduction in diagnosis, 113-hour inference time reduction for tumor progression, 2.9% DICE score improvement with real-life augmentations. Projected to increase survival rates by 0.9%, potentially saving ~2,250 lives.
Conclusion: The proposed AI system provides comprehensive end-to-end support for GBM treatment with significant efficiency improvements and promising potential for enhancing patient survival outcomes.
Abstract: Currently, there is a noticeable lack of AI in the medical field to support doctors in treating heterogenous brain tumors such as Glioblastoma Multiforme (GBM), the deadliest human cancer in the world with a five-year survival rate of just 5.1%. This project develops an AI system offering the only end-to-end solution by aiding doctors with both diagnosis and treatment planning. In the diagnosis phase, a sequential decision-making framework consisting of 4 classification models (Convolutional Neural Networks and Support Vector Machine) are used. Each model progressively classifies the patient’s brain into increasingly specific categories, with the final step being named diagnosis. For treatment planning, an RL system consisting of 3 generative models is used. First, the resection model (diffusion model) analyzes the diagnosed GBM MRI and predicts a possible resection outcome. Second, the radiotherapy model (Spatio-Temporal Vision Transformer) generates an MRI of the brain’s progression after a user-defined number of weeks. Third, the chemotherapy model (Diffusion Model) produces the post-treatment MRI. A survival rate calculator (Convolutional Neural Network) then checks if the generated post treatment MRI has a survival rate within 15% of the user defined target. If not, a feedback loop using proximal policy optimization iterates over this system until an optimal resection location is identified. When compared to existing solutions, this project found 3 key findings: (1) Using a sequential decision-making framework consisting of 4 small diagnostic models reduced computing costs by 22.28x, (2) Transformers regression capabilities decreased tumor progression inference time by 113 hours, and (3) Applying Augmentations resembling Real-life situations improved overall DICE scores by 2.9%. These results project to increase survival rates by 0.9%, potentially saving approximately 2,250 lives.
[477] Smart Spatial Planning in Egypt: An Algorithm-Driven Approach to Public Service Evaluation in Qena City
Mohamed Shamroukh, Mohamed Alkhuzamy Aziz
Main category: cs.AI
TL;DR: Developed a tailored planning model for Qena City using Python-based Voronoi Diagrams to create city-specific planning standards and evaluate public service coverage, revealing 81.3% average coverage with significant spatial disparities.
Details
Motivation: National planning standards in Egypt often fail to account for unique local characteristics, creating a gap between standardized approaches and actual urban needs in specific cities like Qena.Method: Used hybrid methodology (descriptive, analytical, experimental) with Python programming to develop intelligent spatial analysis algorithm based on Voronoi Diagrams for generating city-specific planning criteria and evaluating facility coverage.
Result: Achieved 81.3% average service coverage; ambulance stations showed highest efficiency (99.8%) due to recent upgrades, while parks/open spaces had lowest coverage (10%) due to land constraints. Spatial analysis revealed high service density in midtown (>45 services/km²) dropping to <5 services/km² in outskirts, with Hajer Qena district having most unserved areas and First District having highest coverage.
Conclusion: Successfully developed a localized planning standards model with automated algorithm for service efficiency assessment, providing a replicable framework for data-driven urban planning in Egyptian cities that addresses spatial disparities and local characteristics.
Abstract: National planning standards for public services in Egypt often fail to align with unique local characteristics. Addressing this gap, this study develops a tailored planning model for Qena City. Using a hybrid methodology (descriptive, analytical, and experimental), the research utilizes Python programming to generate an intelligent spatial analysis algorithm based on Voronoi Diagrams. This approach creates city-specific planning criteria and evaluates the current coverage of public facilities. The primary contribution of this study is the successful derivation of a localized planning standards model and the deployment of an automated algorithm to assess service efficiency. Application of this model reveals a general service coverage average of 81.3%. Ambulance stations demonstrated the highest efficiency (99.8%) due to recent upgrades, while parks and open spaces recorded the lowest coverage (10%) caused by limited land availability. Spatial analysis indicates a high service density in midtown (>45 services/km^2), which diminishes significantly towards the outskirts (<5 services/km^2). Consequently, the Hajer Qena district contains the highest volume of unserved areas, while the First District (Qesm 1) exhibits the highest level of service coverage. This model offers a replicable framework for data-driven urban planning in Egyptian cities.
[478] FlatFormer: A Flat Transformer Knowledge Tracing Model Based on Cognitive Bias Injection
Xiao-li Xia, Hou-biao Li
Main category: cs.AI
TL;DR: FlatFormer is a lightweight Transformer model for Knowledge Tracing that achieves state-of-the-art performance with minimal parameters by using information injection instead of complex hierarchical architectures.
Details
Motivation: Current Knowledge Tracing models face a "Performance-Complexity Trap" where capturing complex cognitive dynamics requires deep hierarchical architectures that are computationally expensive and impractical for real-time deployment.Method: FlatFormer uses “Information Injection over Structural Stacking” with two lightweight mechanisms: 1) hybrid input encoding combining learnable session IDs with fixed sinusoidal step embeddings, and 2) pre-computed power-law bias integrated into attention logits to model forgetting curves.
Result: On EdNet dataset, FlatFormer achieves 8.3% absolute AUC improvement over strongest hierarchical baseline (HiTSKT), uses less than 15% of parameters, and has ~3x faster inference speed. Similar results on other large-scale datasets (Junyi, etc.).
Conclusion: High cognitive fidelity in Knowledge Tracing does not require architectural complexity; lightweight information injection mechanisms can achieve superior performance with significantly reduced computational costs.
Abstract: Knowledge Tracing (KT) models face a critical Performance-Complexity Trap'': capturing complex cognitive dynamics like learning sessions and memory decay typically requires deep hierarchical architectures, which incur prohibitive computational costs for real-time deployment. To resolve this, we propose FlatFormer, a streamlined architecture based on the novel design paradigm of Information Injection over Structural Stacking.’’ Unlike parameter-heavy hierarchical models, FlatFormer leverages a standard flat Transformer augmented with two lightweight injection mechanisms: (i) a hybrid input encoding strategy combining learnable session identifiers with fixed sinusoidal step embeddings; and (ii) a pre-computed power-law bias integrated directly into attention logits to explicitly model the forgetting curve. Extensive experiments on four large-scale datasets (e.g., EdNet, Junyi) show that FlatFormer achieves state-of-the-art performance. For example, on the EdNet dataset, compared to the strongest hierarchical baseline (HiTSKT), its absolute AUC increased by 8.3%, while using less than 15% of parameters, and inference speed was about three times faster. These results validate that high cognitive fidelity does not necessitate architectural complexity.
[479] LightSearcher: Efficient DeepSearch via Experiential Memory
Hengzhi Lan, Yue Yu, Li Qian, Li Peng, Jie Wu, Wei Liu, Jian Luan, Ting Bai
Main category: cs.AI
TL;DR: LightSearcher is an efficient RL framework that balances accuracy and efficiency in DeepSearch systems by using textual experiential memory and adaptive reward shaping to reduce unnecessary tool calls while maintaining accuracy.
Details
Motivation: Current RL-driven DeepSearch systems face a trade-off between accuracy and efficiency - frequent tool invocations improve factual correctness but create computational overhead and reduced efficiency. There's a need to balance this inherent accuracy-efficiency trade-off.Method: LightSearcher incorporates textual experiential memory by learning contrastive reasoning trajectories to generate interpretable summaries of successful reasoning patterns. It also uses an adaptive reward shaping mechanism that penalizes redundant tool calls only in correct-answer scenarios.
Result: On four multi-hop QA benchmarks, LightSearcher maintains accuracy comparable to SOTA baseline ReSearch while reducing search tool invocations by 39.6%, inference time by 48.6%, and token consumption by 21.2%.
Conclusion: LightSearcher effectively balances the accuracy-efficiency trade-off in DeepSearch paradigms, demonstrating superior efficiency while maintaining comparable accuracy to state-of-the-art methods.
Abstract: DeepSearch paradigms have become a core enabler for deep reasoning models, allowing them to invoke external search tools to access up-to-date, domain-specific knowledge beyond parametric boundaries, thereby enhancing the depth and factual reliability of reasoning. Building upon this foundation, recent advances in reinforcement learning (RL) have further empowered models to autonomously and strategically control search tool usage, optimizing when and how to query external knowledge sources. Yet, these RL-driven DeepSearch systems often reveal a see-saw trade-off between accuracy and efficiency-frequent tool invocations can improve factual correctness but lead to unnecessary computational overhead and diminished efficiency. To address this challenge, we propose LightSearcher, an efficient RL framework that incorporates textual experiential memory by learning contrastive reasoning trajectories to generate interpretable summaries of successful reasoning patterns. In addition, it employs an adaptive reward shaping mechanism that penalizes redundant tool calls only in correct-answer scenarios. This design effectively balances the inherent accuracy-efficiency trade-off in DeepSearch paradigms. Experiments on four multi-hop QA benchmarks show that LightSearcher maintains accuracy comparable to SOTA baseline ReSearch, while reducing search tool invocations by 39.6%, inference time by 48.6%, and token consumption by 21.2%, demonstrating its superior efficiency.
[480] Academic journals’ AI policies fail to curb the surge in AI-assisted academic writing
Yongyuan He, Yi Bu
Main category: cs.AI
TL;DR: Current AI usage policies in academic publishing have failed to promote transparency or restrain AI adoption, with only 0.1% of papers disclosing AI use despite widespread policy adoption.
Details
Motivation: To evaluate the real-world effectiveness of AI usage policies adopted by journals and publishers in response to the rapid integration of generative AI into academic writing.Method: Analyzed 5,114 journals and over 5.2 million papers, including full-text analysis of 164k scientific publications, to assess AI policy adoption and actual AI tool usage patterns across disciplines and regions.
Result: Despite 70% of journals adopting AI policies (primarily requiring disclosure), AI tool usage increased dramatically with no significant difference between journals with or without policies. Only 76 out of 75k papers (0.1%) published since 2023 explicitly disclosed AI use, revealing a major transparency gap.
Conclusion: Current AI policies have largely failed to promote transparency or restrain AI adoption. The authors urge re-evaluation of ethical frameworks to foster responsible AI integration in science.
Abstract: The rapid integration of generative AI into academic writing has prompted widespread policy responses from journals and publishers. However, the effectiveness of these policies remains unclear. Here, we analyze 5,114 journals and over 5.2 million papers to evaluate the real-world impact of AI usage guidelines. We show that despite 70% of journals adopting AI policies (primarily requiring disclosure), researchers’ use of AI writing tools has increased dramatically across disciplines, with no significant difference between journals with or without policies. Non-English-speaking countries, physical sciences, and high-OA journals exhibit the highest growth rates. Crucially, full-text analysis on 164k scientific publications reveals a striking transparency gap: Of the 75k papers published since 2023, only 76 (0.1%) explicitly disclosed AI use. Our findings suggest that current policies have largely failed to promote transparency or restrain AI adoption. We urge a re-evaluation of ethical frameworks to foster responsible AI integration in science.
[481] Stochasticity in Agentic Evaluations: Quantifying Inconsistency with Intraclass Correlation
Zairah Mustahsan, Abel Lim, Megna Anand, Saahil Jain, Bryan McCann
Main category: cs.AI
TL;DR: The paper proposes using Intraclass Correlation Coefficient (ICC) to measure evaluation reliability in agentic systems, decomposing variance into task difficulty vs. agent inconsistency, and recommends reporting accuracy alongside ICC for trustworthy benchmarking.
Details
Motivation: Current evaluation practices for agentic systems report single accuracy numbers that obscure underlying variance, making it impossible to distinguish genuine capability improvements from lucky sampling. This unreliability introduces brittleness into downstream agentic systems when sub-agents are replaced based on misleading metrics.Method: The authors propose adopting Intraclass Correlation Coefficient (ICC) from measurement science to characterize evaluation variance. ICC decomposes observed variance into between-query variance (task difficulty) and within-query variance (agent inconsistency). They evaluate this approach on GAIA (agentic capabilities across reasoning complexity) and FRAMES (retrieval and factuality across documents) benchmarks.
Result: ICC varies dramatically with task structure: reasoning/retrieval tasks (FRAMES) show ICC=0.4955-0.7118, while agentic tasks (GAIA) show ICC=0.304-0.774 across models. For sub-agent replacement decisions, accuracy improvements are only trustworthy if ICC also improves. ICC converges by n=8-16 trials for structured tasks and n>=32 for complex reasoning.
Conclusion: The paper recommends reporting accuracy alongside ICC and within-query variance as standard practice, proposes updated Evaluation Cards capturing these metrics, and aims to transform agentic benchmarking from opaque leaderboard competition to trustworthy experimental science. Code is open-sourced.
Abstract: As large language models become components of larger agentic systems, evaluation reliability becomes critical: unreliable sub-agents introduce brittleness into downstream system behavior. Yet current evaluation practice, reporting a single accuracy number from a single run, obscures the variance underlying these results, making it impossible to distinguish genuine capability improvements from lucky sampling. We propose adopting Intraclass Correlation Coefficient (ICC), a metric from measurement science, to characterize this variance. ICC decomposes observed variance into between-query variance (task difficulty) and within-query variance (agent inconsistency), highlighting whether reported results reflect true capability or measurement noise. We evaluated on GAIA (Levels 1-3, measuring agentic capabilities across varying reasoning complexity) and FRAMES (measuring retrieval and factuality across multiple documents). We found that ICC varies dramatically with task structure, with reasoning and retrieval tasks (FRAMES) exhibit ICC=0.4955-0.7118 across models, and agentic tasks (GAIA) exhibiting ICC=0.304-0.774 across models. For sub-agent replacement decisions in agentic systems, accuracy improvements are only trustworthy if ICC also improves. We demonstrate that ICC converges by n=8-16 trials for structured tasks and n>=32 for complex reasoning, enabling practitioners to set evidence-based resampling budgets. We recommend reporting accuracy alongside ICC and within-query variance as standard practice, and propose updated Evaluation Cards capturing these metrics. By making evaluation stability visible, we aim to transform agentic benchmarking from opaque leaderboard competition to trustworthy experimental science. Our code is open-sourced at https://github.com/youdotcom-oss/stochastic-agent-evals.
[482] Cognitive Control Architecture (CCA): A Lifecycle Supervision Framework for Robustly Aligned AI Agents
Zhibo Liang, Tianze Hu, Zaiye Chen, Mingjie Tang
Main category: cs.AI
TL;DR: Proposes Cognitive Control Architecture (CCA) - a holistic defense framework against Indirect Prompt Injection attacks on LLM agents that achieves uncompromised security while maintaining efficiency and functionality.
Details
Motivation: Current LLM agents are vulnerable to Indirect Prompt Injection attacks that hijack agent behavior by polluting external information sources. Existing defenses are fragmented and force unacceptable trade-offs between security, functionality, and efficiency.Method: Cognitive Control Architecture (CCA) with two synergistic pillars: (1) proactive control-flow and data-flow integrity enforcement via pre-generated “Intent Graph”, and (2) “Tiered Adjudicator” that initiates deep reasoning based on multi-dimensional scoring upon deviation detection to counter complex conditional attacks.
Result: Experiments on AgentDojo benchmark show CCA effectively withstands sophisticated attacks that challenge other advanced defense methods, achieving uncompromised security with notable efficiency and robustness.
Conclusion: CCA reconciles the multi-dimensional trade-off between security, functionality, and efficiency in LLM agent defense, providing full-lifecycle cognitive supervision against Indirect Prompt Injection attacks.
Abstract: Autonomous Large Language Model (LLM) agents exhibit significant vulnerability to Indirect Prompt Injection (IPI) attacks. These attacks hijack agent behavior by polluting external information sources, exploiting fundamental trade-offs between security and functionality in existing defense mechanisms. This leads to malicious and unauthorized tool invocations, diverting agents from their original objectives. The success of complex IPIs reveals a deeper systemic fragility: while current defenses demonstrate some effectiveness, most defense architectures are inherently fragmented. Consequently, they fail to provide full integrity assurance across the entire task execution pipeline, forcing unacceptable multi-dimensional compromises among security, functionality, and efficiency. Our method is predicated on a core insight: no matter how subtle an IPI attack, its pursuit of a malicious objective will ultimately manifest as a detectable deviation in the action trajectory, distinct from the expected legitimate plan. Based on this, we propose the Cognitive Control Architecture (CCA), a holistic framework achieving full-lifecycle cognitive supervision. CCA constructs an efficient, dual-layered defense system through two synergistic pillars: (i) proactive and preemptive control-flow and data-flow integrity enforcement via a pre-generated “Intent Graph”; and (ii) an innovative “Tiered Adjudicator” that, upon deviation detection, initiates deep reasoning based on multi-dimensional scoring, specifically designed to counter complex conditional attacks. Experiments on the AgentDojo benchmark substantiate that CCA not only effectively withstands sophisticated attacks that challenge other advanced defense methods but also achieves uncompromised security with notable efficiency and robustness, thereby reconciling the aforementioned multi-dimensional trade-off.
[483] ProAgent: Harnessing On-Demand Sensory Contexts for Proactive LLM Agent Systems
Bufang Yang, Lilin Xu, Liekang Zeng, Yunqi Guo, Siyang Jiang, Wenrui Lu, Kaiwei Liu, Hancheng Xiang, Xiaofan Jiang, Guoliang Xing, Zhenyu Yan
Main category: cs.AI
TL;DR: ProAgent is an end-to-end proactive LLM agent system that uses sensory contexts and LLM reasoning to provide anticipatory assistance, outperforming reactive approaches in accuracy and user satisfaction.
Details
Motivation: Existing LLM agents follow reactive paradigms requiring explicit user instructions, which increases physical and cognitive workload. There's a need for proactive systems that anticipate user needs before explicit requests.Method: 1) Proactive-oriented context extraction with tiered perception to continuously sense environment and derive hierarchical contexts (sensory + persona cues). 2) Context-aware proactive reasoner that maps contexts to user needs and tool calls for proactive assistance. Implemented on AR glasses with edge server.
Result: Achieves 33.4% higher proactive prediction accuracy, 16.8% higher tool-calling F1 score, and significant user satisfaction improvements over state-of-the-art baselines in real-world testbed, public dataset, and user study evaluations.
Conclusion: ProAgent represents a significant advancement toward proactive AI assistants by effectively leveraging sensory contexts and LLM reasoning to anticipate and address user needs before explicit requests.
Abstract: Large Language Model (LLM) agents are emerging to transform daily life. However, existing LLM agents primarily follow a reactive paradigm, relying on explicit user instructions to initiate services, which increases both physical and cognitive workload. In this paper, we propose ProAgent, the first end-to-end proactive agent system that harnesses massive sensory contexts and LLM reasoning to deliver proactive assistance. ProAgent first employs a proactive-oriented context extraction approach with on-demand tiered perception to continuously sense the environment and derive hierarchical contexts that incorporate both sensory and persona cues. ProAgent then adopts a context-aware proactive reasoner to map these contexts to user needs and tool calls, providing proactive assistance. We implement ProAgent on Augmented Reality (AR) glasses with an edge server and extensively evaluate it on a real-world testbed, a public dataset, and through a user study. Results show that ProAgent achieves up to 33.4% higher proactive prediction accuracy, 16.8% higher tool-calling F1 score, and notable improvements in user satisfaction over state-of-the-art baselines, marking a significant step toward proactive assistants. A video demonstration of ProAgent is available at https://youtu.be/pRXZuzvrcVs.
[484] DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems
Ming Ma, Jue Zhang, Fangkai Yang, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
Main category: cs.AI
TL;DR: DoVer is an intervention-driven debugging framework for LLM-based multi-agent systems that uses active verification through targeted interventions instead of just log analysis, improving failure recovery rates.
Details
Motivation: Current LLM-based multi-agent systems are hard to debug because failures come from complex interaction traces. Existing log-based debugging lacks validation and assumes single-point failures, but multiple interventions can often fix the same problem.Method: DoVer introduces intervention-driven debugging that augments hypothesis generation with active verification through targeted interventions like editing messages or altering plans. It focuses on outcome-oriented debugging rather than just attribution accuracy.
Result: On Magnetic-One framework with GAIA and AssistantBench datasets: flips 18-28% of failed trials into successes, achieves up to 16% milestone progress, validates/refutes 30-60% of failure hypotheses. On GSMPlus with AG2 framework: recovers 49% of failed trials.
Conclusion: Intervention is a practical mechanism for improving reliability in agentic systems, opening opportunities for more robust, scalable debugging methods for LLM-based multi-agent systems.
Abstract: Large language model (LLM)-based multi-agent systems are challenging to debug because failures often arise from long, branching interaction traces. The prevailing practice is to leverage LLMs for log-based failure localization, attributing errors to a specific agent and step. However, this paradigm has two key limitations: (i) log-only debugging lacks validation, producing untested hypotheses, and (ii) single-step or single-agent attribution is often ill-posed, as we find that multiple distinct interventions can independently repair the failed task. To address the first limitation, we introduce DoVer, an intervention-driven debugging framework, which augments hypothesis generation with active verification through targeted interventions (e.g., editing messages, altering plans). For the second limitation, rather than evaluating on attribution accuracy, we focus on measuring whether the system resolves the failure or makes quantifiable progress toward task success, reflecting a more outcome-oriented view of debugging. Within the Magnetic-One agent framework, on the datasets derived from GAIA and AssistantBench, DoVer flips 18-28% of failed trials into successes, achieves up to 16% milestone progress, and validates or refutes 30-60% of failure hypotheses. DoVer also performs effectively on a different dataset (GSMPlus) and agent framework (AG2), where it recovers 49% of failed trials. These results highlight intervention as a practical mechanism for improving reliability in agentic systems and open opportunities for more robust, scalable debugging methods for LLM-based multi-agent systems. Project website and code will be available at https://aka.ms/DoVer.
[485] Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning
Tingyu Li, Zheng Sun, Jingxuan Wei, Siyuan Li, Conghui He, Lijun Wu, Cheng Tan
Main category: cs.AI
TL;DR: DoGe is a dual-decoupling RL framework for vision-language models that separates context learning from problem solving to prevent reward hacking and uses evolving curriculum learning with expanded domain knowledge.
Details
Motivation: RL for VLMs requires abundant high-quality multimodal data, especially challenging in specialized domains. Existing methods suffer from limited data distributions and alignment difficulties, leading to reward hacking where models exploit high-reward patterns, collapsing policy entropy and destabilizing training.Method: DoGe uses a dual-decoupling framework with Thinker and Solver components. It first focuses on learning from context rather than problem solving, then uses a two-stage RL post-training approach from exploring context to solving tasks. It also employs evolving curriculum learning with expanded native domain knowledge corpus and iteratively evolving seed problems pool.
Result: The method consistently outperforms baselines across various benchmarks, providing a scalable pathway for realizing self-evolving large vision-language models.
Conclusion: DoGe offers an effective solution to reward hacking in RL for VLMs by decoupling learning processes and using evolving curriculum learning, enabling better performance in specialized domains and supporting continuous self-evolution of vision-language models.
Abstract: Recent vision-language models (VLMs) achieve remarkable reasoning through reinforcement learning (RL), which provides a feasible solution for realizing continuous self-evolving large vision-language models (LVLMs) in the era of experience. However, RL for VLMs requires abundant high-quality multimodal data, especially challenging in specialized domains like chemistry, earth sciences, and multimodal mathematics. Existing strategies such as synthetic data and self-rewarding mechanisms suffer from limited distributions and alignment difficulties, ultimately causing reward hacking: models exploit high-reward patterns, collapsing policy entropy and destabilizing training. We propose DoGe (Decouple to Generalize), a dual-decoupling framework that guides models to first learn from context rather than problem solving by refocusing on the problem context scenarios overlooked by synthetic data methods. By decoupling learning process into dual components (Thinker and Solver), we reasonably quantify the reward signals of this process and propose a two-stage RL post-training approach from freely exploring context to practically solving tasks. Second, to increase the diversity of training data, DoGe constructs an evolving curriculum learning pipeline: an expanded native domain knowledge corpus and an iteratively evolving seed problems pool. Experiments show that our method consistently outperforms the baseline across various benchmarks, providing a scalable pathway for realizing self-evolving LVLMs.
[486] JT-DA: Enhancing Data Analysis with Tool-Integrated Table Reasoning Large Language Models
Ce Chi, Xing Wang, Zhendong Wang, Xiaofan Liu, Ce Li, Zhiyan Song, Chen Zhao, Kexin Yang, Boshen Shi, Jingjing Yang, Chao Deng, Junlan Feng
Main category: cs.AI
TL;DR: JT-DA-8B is an 8B-parameter specialized LLM for complex table reasoning, trained on diverse tabular data with 34 reasoning tasks, using SFT+RL optimization and a four-stage workflow for improved accuracy.
Details
Motivation: Address the lack of high-quality supervision in tabular reasoning scenarios and create a specialized model for complex table reasoning across diverse real-world applications.Method: Constructed comprehensive training corpus with 34 table reasoning tasks from 29 public datasets and 3M tables; used automatic pipeline for multi-step analytical tasks; trained on JT-Coder-8B foundation model; employed LLM-based scoring and workflow-aligned filtering for data distillation; used SFT and RL optimization; proposed four-stage table reasoning workflow (preprocessing, sensing, tool-integrated reasoning, prompt engineering).
Result: JT-DA-8B achieves strong performance in various table reasoning tasks, demonstrating effectiveness of data-centric generation and workflow-driven optimization.
Conclusion: The model successfully addresses tabular reasoning challenges through comprehensive data curation, specialized training techniques, and structured workflow design, showing promising results for complex table analysis tasks.
Abstract: In this work, we present JT-DA-8B (JiuTian Data Analyst 8B), a specialized large language model designed for complex table reasoning tasks across diverse real-world scenarios. To address the lack of high-quality supervision in tabular reasoning scenarios, we construct a comprehensive and diverse training corpus with 34 well-defined table reasoning tasks, by aggregating 29 public table QA datasets and 3 million tables. An automatic pipeline is proposed to generate realistic multi-step analytical tasks involving reasoning patterns. The model is trained upon open-source JT-Coder-8B model, an 8B-parameter decoder-only foundation model trained from scratch. In the training stage, we leverage LLM-based scoring and workflow-aligned filtering to distill high-quality, table-centric data. Both supervised fine-tuning (SFT) and Reinforcement learning (RL) are adopted to optimize our model. Afterwards, a four-stage table reasoning workflow is proposed, including table preprocessing, table sensing, tool-integrated reasoning, and prompt engineering, to improve model interpretability and execution accuracy. Experimental results show that JT-DA-8B achieves strong performance in various table reasoning tasks, demonstrating the effectiveness of data-centric generation and workflow-driven optimization.
[487] Do Persona-Infused LLMs Affect Performance in a Strategic Reasoning Game?
John Licato, Stephen Steinle, Brayden Hollis
Main category: cs.AI
TL;DR: Persona prompting in LLMs can improve strategic game performance when a structured mediator translates personas into heuristics, outperforming direct persona inference.
Details
Motivation: To determine if persona prompting in LLMs creates measurable behavioral differences and affects decision-making in adversarial strategic environments, specifically whether persona-derived strategies can match or exceed manually chosen strategies.Method: Used PERIL, a world-domination board game, as testbed. Introduced a structured mediator inspired by exploratory factor analysis that maps LLM-generated inventory responses into heuristic values, comparing this approach to directly inferred heuristics from personas.
Result: Certain personas associated with strategic thinking improved game performance, but only when using the mediator. The mediator-enhanced approach showed better heuristic reliability and face validity compared to direct inference, enabling better study of persona effects on decision-making.
Conclusion: Persona prompting can influence LLM-based decision-making in strategic contexts when properly translated into heuristics via structured methods. The proposed mediator applies psychometric principles to LLMs, advancing understanding of persona effects on strategic performance.
Abstract: Although persona prompting in large language models appears to trigger different styles of generated text, it is unclear whether these translate into measurable behavioral differences, much less whether they affect decision-making in an adversarial strategic environment that we provide as open-source. We investigate the impact of persona prompting on strategic performance in PERIL, a world-domination board game. Specifically, we compare the effectiveness of persona-derived heuristic strategies to those chosen manually. Our findings reveal that certain personas associated with strategic thinking improve game performance, but only when a mediator is used to translate personas into heuristic values. We introduce this mediator as a structured translation process, inspired by exploratory factor analysis, that maps LLM-generated inventory responses into heuristics. Results indicate our method enhances heuristic reliability and face validity compared to directly inferred heuristics, allowing us to better study the effect of persona types on decision making. These insights advance our understanding of how persona prompting influences LLM-based decision-making and propose a heuristic generation method that applies psychometric principles to LLMs.
[488] On Memory: A comparison of memory mechanisms in world models
Eli J. Laird, Corey Clark
Main category: cs.AI
TL;DR: The paper analyzes transformer-based world models’ memory limitations and introduces memory augmentation mechanisms to extend their effective memory span for better long-horizon planning and loop closure in imagined trajectories.
Details
Motivation: World models struggle with long-horizon planning due to limited memory span in transformer architectures, causing perceptual drift in long rollouts and preventing effective loop closures within imagined trajectories.Method: The authors investigate memory augmentation mechanisms for transformers, introducing a taxonomy distinguishing between memory encoding and memory injection mechanisms. They analyze these through residual stream dynamics and evaluate memory recall using a state recall task.
Result: Memory mechanisms improve effective memory span in vision transformers and provide a path to completing loop closures within world model imagination, with analysis showing trade-offs between different memory augmentation approaches.
Conclusion: Memory augmentation mechanisms extend transformer-based world models’ memory capabilities, enabling better long-horizon planning and loop closure in imagined environments, addressing key limitations in current world model architectures.
Abstract: World models enable agents to plan within imagined environments by predicting future states conditioned on past observations and actions. However, their ability to plan over long horizons is limited by the effective memory span of the backbone architecture. This limitation leads to perceptual drift in long rollouts, hindering the model’s capacity to perform loop closures within imagined trajectories. In this work, we investigate the effective memory span of transformer-based world models through an analysis of several memory augmentation mechanisms. We introduce a taxonomy that distinguishes between memory encoding and memory injection mechanisms, motivating their roles in extending the world model’s memory through the lens of residual stream dynamics. Using a state recall evaluation task, we measure the memory recall of each mechanism and analyze its respective trade-offs. Our findings show that memory mechanisms improve the effective memory span in vision transformers and provide a path to completing loop closures within a world model’s imagination.
[489] ClinNoteAgents: An LLM Multi-Agent System for Predicting and Interpreting Heart Failure 30-Day Readmission from Clinical Notes
Rongjia Zhou, Chengzhuo Li, Carl Yang, Jiaying Lu
Main category: cs.AI
TL;DR: ClinNoteAgents: An LLM-based multi-agent framework that transforms free-text clinical notes into structured representations and clinician-style abstractions for HF 30-day readmission prediction, reducing reliance on structured EHR data and manual annotation.
Details
Motivation: Heart failure is a leading cause of rehospitalization among older adults. Clinical notes contain rich patient information but remain underutilized for readmission risk analysis due to challenges like misspellings, abbreviations, and domain-specific jargon in time-pressured clinical documentation.Method: ClinNoteAgents uses an LLM-based multi-agent framework to transform free-text clinical notes into: (1) structured representations of clinical and social risk factors for association analysis, and (2) clinician-style abstractions for HF 30-day readmission prediction.
Result: Evaluated on 3,544 notes from 2,065 patients (readmission rate=35.16%), ClinNoteAgents demonstrated strong performance in extracting risk factors from free-text, identifying key contributing factors, and predicting readmission risk.
Conclusion: ClinNoteAgents provides a scalable and interpretable approach to note-based HF readmission risk modeling that reduces reliance on structured fields and minimizes manual annotation and model training, particularly valuable for data-limited healthcare systems.
Abstract: Heart failure (HF) is one of the leading causes of rehospitalization among older adults in the United States. Although clinical notes contain rich, detailed patient information and make up a large portion of electronic health records (EHRs), they remain underutilized for HF readmission risk analysis. Traditional computational models for HF readmission often rely on expert-crafted rules, medical thesauri, and ontologies to interpret clinical notes, which are typically written under time pressure and may contain misspellings, abbreviations, and domain-specific jargon. We present ClinNoteAgents, an LLM-based multi-agent framework that transforms free-text clinical notes into (1) structured representations of clinical and social risk factors for association analysis and (2) clinician-style abstractions for HF 30-day readmission prediction. We evaluate ClinNoteAgents on 3,544 notes from 2,065 patients (readmission rate=35.16%), demonstrating strong performance in extracting risk factors from free-text, identifying key contributing factors, and predicting readmission risk. By reducing reliance on structured fields and minimizing manual annotation and model training, ClinNoteAgents provides a scalable and interpretable approach to note-based HF readmission risk modeling in data-limited healthcare systems.
[490] VIGIL: A Reflective Runtime for Self-Healing Agents
Christopher Cruz
Main category: cs.AI
TL;DR: VIGIL is a reflective runtime system that autonomously supervises and repairs LLM agents by analyzing behavioral logs, maintaining emotional state representations, and generating prompt/code fixes without human intervention.
Details
Motivation: Current agentic LLM frameworks are brittle, lack runtime introspection, cannot self-diagnose failures, and don't improve autonomously. Most degrade into simple chains of LLM calls without structural reliability mechanisms.Method: VIGIL operates as a reflective runtime that supervises sibling agents. It ingests behavioral logs, appraises events into structured emotional representations, maintains a persistent EmoBank with decay policies, derives RBT diagnoses (strengths/opportunities/failures), and generates guarded prompt updates and code proposals via a strategy engine operating on log evidence and code hotspots.
Result: In a reminder latency case study, VIGIL identified elevated lag, proposed prompt and code repairs, and demonstrated meta-level self-repair when its own diagnostic tool failed due to schema conflict - it surfaced the error, produced fallback diagnosis, and emitted a repair plan.
Conclusion: VIGIL enables autonomous maintenance and meta-level self-repair in deployed agent runtimes, addressing the brittleness of current agentic LLM frameworks by providing structural mechanisms for reliability and continuous improvement without human intervention.
Abstract: Agentic LLM frameworks promise autonomous behavior via task decomposition, tool use, and iterative planning, but most deployed systems remain brittle. They lack runtime introspection, cannot diagnose their own failure modes, and do not improve over time without human intervention. In practice, many agent stacks degrade into decorated chains of LLM calls with no structural mechanisms for reliability. We present VIGIL (Verifiable Inspection and Guarded Iterative Learning), a reflective runtime that supervises a sibling agent and performs autonomous maintenance rather than task execution. VIGIL ingests behavioral logs, appraises each event into a structured emotional representation, maintains a persistent EmoBank with decay and contextual policies, and derives an RBT diagnosis that sorts recent behavior into strengths, opportunities, and failures. From this analysis, VIGIL generates both guarded prompt updates that preserve core identity semantics and read only code proposals produced by a strategy engine that operates on log evidence and code hotspots. VIGIL functions as a state gated pipeline. Illegal transitions produce explicit errors rather than allowing the LLM to improvise. In a reminder latency case study, VIGIL identified elevated lag, proposed prompt and code repairs, and when its own diagnostic tool failed due to a schema conflict, it surfaced the internal error, produced a fallback diagnosis, and emitted a repair plan. This demonstrates meta level self repair in a deployed agent runtime.
[491] A Neural Affinity Framework for Abstract Reasoning: Diagnosing the Compositional Gap in Transformer Architectures via Procedural Task Taxonomy
Miguel Ingram, Arthur Joseph Merritt
Main category: cs.AI
TL;DR: Researchers develop a 9-category taxonomy for 400 ARC tasks with 97.5% validation accuracy, revealing a “Compositional Gap” where Transformers excel at local patterns but fail at global synthesis, suggesting architectural limitations rather than curriculum issues.
Details
Motivation: To address the lack of formal definition for task relatedness in ARC (Abstraction and Reasoning Corpus) and understand why neural networks struggle with certain tasks, despite extensive training.Method: Created a 9-category taxonomy of 400 tasks validated via rule-based code analysis (97.5% accuracy). Trained CNN on raw grid pixels to prove visual coherence. Fine-tuned 1.7M-parameter Transformer across 302 tasks to analyze performance patterns. Applied taxonomy to diagnose ARC-AGI-2 test set and validated findings on independent ViTARC study.
Result: Revealed a “Compositional Gap”: 69.5% of tasks achieve >80% cell accuracy but <10% grid accuracy. Identified “Neural Affinity Ceiling Effect” where performance is bounded by architectural suitability. Low-affinity tasks achieve 51.9% vs 77.7% for high-affinity tasks. Some tasks remain at 0% despite massive training data.
Conclusion: Current Transformer architectures have fundamental limitations for certain ARC tasks due to architectural mismatch. Progress requires hybrid architectures with affinity-aligned modules rather than just more data or curriculum improvements.
Abstract: Responding to Hodel et al.’s (2024) call for a formal definition of task relatedness in re-arc, we present the first 9-category taxonomy of all 400 tasks, validated at 97.5% accuracy via rule-based code analysis. We prove the taxonomy’s visual coherence by training a CNN on raw grid pixels (95.24% accuracy on S3, 36.25% overall, 3.3x chance), then apply the taxonomy diagnostically to the original ARC-AGI-2 test set. Our curriculum analysis reveals 35.3% of tasks exhibit low neural affinity for Transformers–a distributional bias mirroring ARC-AGI-2. To probe this misalignment, we fine-tuned a 1.7M-parameter Transformer across 302 tasks, revealing a profound Compositional Gap: 210 of 302 tasks (69.5%) achieve >80% cell accuracy (local patterns) but <10% grid accuracy (global synthesis). This provides direct evidence for a Neural Affinity Ceiling Effect, where performance is bounded by architectural suitability, not curriculum. Applying our framework to Li et al.’s independent ViTARC study (400 specialists, 1M examples each) confirms its predictive power: Very Low affinity tasks achieve 51.9% versus 77.7% for High affinity (p<0.001), with a task at 0% despite massive data. The taxonomy enables precise diagnosis: low-affinity tasks (A2) hit hard ceilings, while high-affinity tasks (C1) reach 99.8%. These findings indicate that progress requires hybrid architectures with affinity-aligned modules. We release our validated taxonomy,
[492] ContextualSHAP : Enhancing SHAP Explanations Through Contextual Language Generation
Latifa Dwiyanti, Sergio Ryan Wibisono, Hidetaka Nambo
Main category: cs.AI
TL;DR: This paper proposes a Python package that enhances SHAP explanations by integrating them with GPT to generate contextualized textual explanations, making them more understandable for non-technical users.
Details
Motivation: SHAP provides effective feature importance visualizations but lacks contextual explanations that are meaningful for end-users, especially those without technical backgrounds. There's a need to make XAI explanations more user-friendly and understandable.Method: Developed a Python package that extends SHAP by integrating it with OpenAI’s GPT to generate contextualized textual explanations. The integration uses user-defined parameters (feature aliases, descriptions, background) to tailor explanations to both model context and user perspective.
Result: Applied the package in a healthcare case study and conducted user evaluations. Results from Likert-scale surveys and interviews showed that generated explanations were perceived as more understandable and contextually appropriate compared to visual-only SHAP outputs.
Conclusion: Combining SHAP visualizations with contextualized text generated by LLMs can create more user-friendly and trustworthy model explanations, though findings are preliminary and suggest promising directions for improving XAI accessibility.
Abstract: Explainable Artificial Intelligence (XAI) has become an increasingly important area of research, particularly as machine learning models are deployed in high-stakes domains. Among various XAI approaches, SHAP (SHapley Additive exPlanations) has gained prominence due to its ability to provide both global and local explanations across different machine learning models. While SHAP effectively visualizes feature importance, it often lacks contextual explanations that are meaningful for end-users, especially those without technical backgrounds. To address this gap, we propose a Python package that extends SHAP by integrating it with a large language model (LLM), specifically OpenAI’s GPT, to generate contextualized textual explanations. This integration is guided by user-defined parameters (such as feature aliases, descriptions, and additional background) to tailor the explanation to both the model context and the user perspective. We hypothesize that this enhancement can improve the perceived understandability of SHAP explanations. To evaluate the effectiveness of the proposed package, we applied it in a healthcare-related case study and conducted user evaluations involving real end-users. The results, based on Likert-scale surveys and follow-up interviews, indicate that the generated explanations were perceived as more understandable and contextually appropriate compared to visual-only outputs. While the findings are preliminary, they suggest that combining visualization with contextualized text may support more user-friendly and trustworthy model explanations.
[493] PICKT: Practical Interlinked Concept Knowledge Tracing for Personalized Learning using Knowledge Map Concept Relations
Wonbeen Lee, Channyoung Lee, Junho Sohn, Hansam Cho
Main category: cs.AI
TL;DR: PICKT model addresses Knowledge Tracing limitations by processing multiple input types and handling cold start problems using knowledge maps, showing improved performance and practicality for real-world ITS.
Details
Motivation: Existing Knowledge Tracing models have limitations: restricted input formats, cold start problems with new students/questions, and insufficient stability for real-world deployment. Personalized learning demands better ITS that can track individual knowledge states accurately.Method: Proposes PICKT (Practical Interlinked Concept Knowledge Tracing) model that uses knowledge maps to structure relationships among concepts using question and concept text information. This enables effective knowledge tracing even in cold start situations.
Result: Experiments in real operational environments demonstrated excellent performance and practicality. Achieved significant performance improvements over existing models for two core cold start challenges: new student enrollment and new question addition.
Conclusion: PICKT provides crucial theoretical and technical foundation for practical implementation of next-generation ITS by offering stable, practical knowledge tracing that handles diverse data formats and cold start problems effectively.
Abstract: With the recent surge in personalized learning, Intelligent Tutoring Systems (ITS) that can accurately track students’ individual knowledge states and provide tailored learning paths based on this information are in demand as an essential task. This paper focuses on the core technology of Knowledge Tracing (KT) models that analyze students’ sequences of interactions to predict their knowledge acquisition levels. However, existing KT models suffer from limitations such as restricted input data formats, cold start problems arising with new student enrollment or new question addition, and insufficient stability in real-world service environments. To overcome these limitations, a Practical Interlinked Concept Knowledge Tracing (PICKT) model that can effectively process multiple types of input data is proposed. Specifically, a knowledge map structures the relationships among concepts considering the question and concept text information, thereby enabling effective knowledge tracing even in cold start situations. Experiments reflecting real operational environments demonstrated the model’s excellent performance and practicality. The main contributions of this research are as follows. First, a model architecture that effectively utilizes diverse data formats is presented. Second, significant performance improvements are achieved over existing models for two core cold start challenges: new student enrollment and new question addition. Third, the model’s stability and practicality are validated through delicate experimental design, enhancing its applicability in real-world product environments. This provides a crucial theoretical and technical foundation for the practical implementation of next-generation ITS.
[494] Sample from What You See: Visuomotor Policy Learning via Diffusion Bridge with Observation-Embedded Stochastic Differential Equation
Zhaoyang Liu, Mokai Pan, Zhongyi Wang, Kaizhen Zhu, Haotao Lu, Jingya Wang, Ye Shi
Main category: cs.AI
TL;DR: BridgePolicy is a diffusion-based robotic policy that embeds observations directly into the diffusion process via a bridge formulation, enabling sampling from observation-informed priors rather than random noise for improved control precision.
Details
Motivation: Existing diffusion-based imitation learning approaches treat observations as mere conditioning inputs rather than integrating them into the stochastic dynamics, forcing sampling from random Gaussian noise and weakening perception-control coupling, leading to suboptimal performance.Method: BridgePolicy uses a diffusion-bridge formulation to embed observations within the stochastic differential equation, constructing observation-informed trajectories. It includes a multi-modal fusion module and semantic aligner to handle heterogeneous robot data by unifying visual/state inputs and aligning observation-action representations.
Result: Extensive experiments across 52 simulation tasks on three benchmarks and five real-world tasks demonstrate that BridgePolicy consistently outperforms state-of-the-art generative policies.
Conclusion: By integrating observations into the diffusion process via a bridge formulation rather than treating them as conditioning inputs, BridgePolicy achieves stronger perception-control coupling and superior robotic control performance.
Abstract: Imitation learning with diffusion models has advanced robotic control by capturing multi-modal action distributions. However, existing approaches typically treat observations as high-level conditioning inputs to the denoising network, rather than integrating them into the stochastic dynamics of the diffusion process itself. As a result, sampling must begin from random Gaussian noise, weakening the coupling between perception and control and often yielding suboptimal performance. We introduce BridgePolicy, a generative visuomotor policy that explicitly embeds observations within the stochastic differential equation via a diffusion-bridge formulation. By constructing an observation-informed trajectory, BridgePolicy enables sampling to start from a rich, informative prior rather than random noise, substantially improving precision and reliability in control. A key challenge is that classical diffusion bridges connect distributions with matched dimensionality, whereas robotic observations are heterogeneous and multi-modal and do not naturally align with the action space. To address this, we design a multi-modal fusion module and a semantic aligner that unify visual and state inputs and align observation and action representations, making the bridge applicable to heterogeneous robot data. Extensive experiments across 52 simulation tasks on three benchmarks and five real-world tasks demonstrate that BridgePolicy consistently outperforms state-of-the-art generative policies.
[495] Cross-platform Product Matching Based on Entity Alignment of Knowledge Graph with RAEA model
Wenlong Liu, Jiahua Pan, Xingyu Zhang, Xinxin Gong, Yang Ye, Xujin Zhao, Xin Wang, Kent Wu, Hua Xiang, Houmin Yan, Qingpeng Zhang
Main category: cs.AI
TL;DR: RAEA framework improves entity alignment by better utilizing both attribute and relation triples through attention mechanisms, achieving state-of-the-art results on cross-lingual datasets.
Details
Motivation: Existing entity alignment methods inadequately utilize both attribute triples and relation triples simultaneously, especially the interactions between them, which limits performance in product matching tasks.Method: Two-stage pipeline: rough filter then fine filter using RAEA framework. RAEA uses Attribute-aware Entity Encoder and Relation-aware Graph Attention Networks to aggregate alignment signals from both attributes and relations.
Result: RAEA achieves significant improvements over 12 baselines on DBP15K (6.59% average Hits@1 improvement) and competitive results on DWY100K. Applied successfully to eBay-Amazon product matching.
Conclusion: The RAEA framework effectively captures interactions between attribute and relation triples for entity alignment, demonstrating superior performance and practical applicability in product matching scenarios.
Abstract: Product matching aims to identify identical or similar products sold on different platforms. By building knowledge graphs (KGs), the product matching problem can be converted to the Entity Alignment (EA) task, which aims to discover the equivalent entities from diverse KGs. The existing EA methods inadequately utilize both attribute triples and relation triples simultaneously, especially the interactions between them. This paper introduces a two-stage pipeline consisting of rough filter and fine filter to match products from eBay and Amazon. For fine filtering, a new framework for Entity Alignment, Relation-aware and Attribute-aware Graph Attention Networks for Entity Alignment (RAEA), is employed. RAEA focuses on the interactions between attribute triples and relation triples, where the entity representation aggregates the alignment signals from attributes and relations with Attribute-aware Entity Encoder and Relation-aware Graph Attention Networks. The experimental results indicate that the RAEA model achieves significant improvements over 12 baselines on EA task in the cross-lingual dataset DBP15K (6.59% on average Hits@1) and delivers competitive results in the monolingual dataset DWY100K. The source code for experiments on DBP15K and DWY100K is available at github (https://github.com/Mockingjay-liu/RAEA-model-for-Entity-Alignment).
[496] M-STAR: Multi-Scale Spatiotemporal Autoregression for Human Mobility Modeling
Yuxiao Luo, Songming Zhang, Sijie Ruan, Siran Chen, Kang Liu, Yang Xu, Yu Zheng, Ling Yin
Main category: cs.AI
TL;DR: M-STAR is a novel framework for generating long-term human trajectories using multi-scale spatiotemporal autoregression, achieving better fidelity and faster generation than existing methods.
Details
Motivation: Current trajectory generation methods using autoregressive and diffusion models are inefficient for long-term generation (e.g., weekly trajectories) and lack explicit spatiotemporal multi-scale modeling, despite the importance of human mobility modeling for applications like transportation planning and epidemic modeling.Method: Proposes Multi-Scale Spatio-Temporal AutoRegression (M-STAR) framework with a Multi-scale Spatiotemporal Tokenizer that encodes hierarchical mobility patterns and a Transformer-based decoder for next-scale autoregressive prediction, using a coarse-to-fine spatiotemporal prediction process.
Result: Experiments on two real-world datasets show M-STAR outperforms existing methods in fidelity and significantly improves generation speed.
Conclusion: M-STAR provides an effective solution for long-term trajectory generation with explicit multi-scale spatiotemporal modeling, offering both high fidelity and computational efficiency.
Abstract: Modeling human mobility is vital for extensive applications such as transportation planning and epidemic modeling. With the rise of the Artificial Intelligence Generated Content (AIGC) paradigm, recent works explore synthetic trajectory generation using autoregressive and diffusion models. While these methods show promise for generating single-day trajectories, they remain limited by inefficiencies in long-term generation (e.g., weekly trajectories) and a lack of explicit spatiotemporal multi-scale modeling. This study proposes Multi-Scale Spatio-Temporal AutoRegression (M-STAR), a new framework that generates long-term trajectories through a coarse-to-fine spatiotemporal prediction process. M-STAR combines a Multi-scale Spatiotemporal Tokenizer that encodes hierarchical mobility patterns with a Transformer-based decoder for next-scale autoregressive prediction. Experiments on two real-world datasets show that M-STAR outperforms existing methods in fidelity and significantly improves generation speed. The data and codes are available at https://github.com/YuxiaoLuo0013/M-STAR.
[497] A Geometric Unification of Concept Learning with Concept Cones
Alexandre Rocchi–Henry, Thomas Fel, Gianni Franchi
Main category: cs.AI
TL;DR: The paper unifies supervised Concept Bottleneck Models (CBMs) and unsupervised Sparse Autoencoders (SAEs) through a shared geometric framework where both learn concept cones in activation space, differing only in how they select these cones.
Details
Motivation: Two interpretability traditions (CBMs and SAEs) have evolved separately without communication. CBMs use human supervision to define concepts, while SAEs discover emergent concepts unsupervised. The paper aims to bridge these paradigms through a common geometric understanding.Method: Proposes that both CBMs and SAEs instantiate the same geometric structure: linear directions in activation space whose nonnegative combinations form concept cones. Introduces a containment framework where SAEs can be evaluated by how well their learned cones approximate or contain CBM reference cones.
Result: Develops quantitative metrics linking SAE inductive biases (type, sparsity, expansion ratio) to emergence of plausible concepts. Uncovers a “sweet spot” in sparsity and expansion factor that maximizes both geometric and semantic alignment with CBM concepts.
Conclusion: The work unifies supervised and unsupervised concept discovery through a shared geometric framework, providing principled metrics to measure SAE progress and assess alignment with human concepts.
Abstract: Two traditions of interpretability have evolved side by side but seldom spoken to each other: Concept Bottleneck Models (CBMs), which prescribe what a concept should be, and Sparse Autoencoders (SAEs), which discover what concepts emerge. While CBMs use supervision to align activations with human-labeled concepts, SAEs rely on sparse coding to uncover emergent ones. We show that both paradigms instantiate the same geometric structure: each learns a set of linear directions in activation space whose nonnegative combinations form a concept cone. Supervised and unsupervised methods thus differ not in kind but in how they select this cone. Building on this view, we propose an operational bridge between the two paradigms. CBMs provide human-defined reference geometries, while SAEs can be evaluated by how well their learned cones approximate or contain those of CBMs. This containment framework yields quantitative metrics linking inductive biases – such as SAE type, sparsity, or expansion ratio – to emergence of plausible\footnote{We adopt the terminology of \citet{jacovi2020towards}, who distinguish between faithful explanations (accurately reflecting model computations) and plausible explanations (aligning with human intuition and domain knowledge). CBM concepts are plausible by construction – selected or annotated by humans – though not necessarily faithful to the true latent factors that organise the data manifold.} concepts. Using these metrics, we uncover a ``sweet spot’’ in both sparsity and expansion factor that maximizes both geometric and semantic alignment with CBM concepts. Overall, our work unifies supervised and unsupervised concept discovery through a shared geometric framework, providing principled metrics to measure SAE progress and assess how well discovered concept align with plausible human concepts.
[498] LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services
Hang He, Chuhuai Yue, Chengqi Dong, Mingxue Tian, Zhenfeng Liu, Jiajun Chai, Xiaohan Wang, Yufei Zhang, Qun Liao, Guojun Yin, Wei Lin, Chengcheng Wan, Haiying Sun, Ting Su
Main category: cs.AI
TL;DR: LocalSearchBench is a comprehensive benchmark for agentic search in local life services, featuring 150k+ entries and 300 multi-hop QA tasks, showing current LRMs struggle with only 34% correctness.
Details
Motivation: Most agentic search research focuses on general information retrieval, neglecting vertical domains like local life services which have unique challenges including ambiguous queries and multi-hop reasoning across merchants and products.Method: Created LocalSearchBench with 150,000+ high-quality entries from various cities and business types, constructed 300 multi-hop QA tasks based on real user queries, and developed LocalPlayground - a unified environment with multiple tools for agent interaction.
Result: State-of-the-art LRMs perform poorly: best model (DeepSeek-V3.1) achieves only 34.34% correctness, with average completeness of 77.33% and faithfulness of 61.99%, highlighting significant challenges in this domain.
Conclusion: There’s a critical need for specialized benchmarks and domain-specific agent training in local life services, as current general-purpose models struggle with the unique challenges of this vertical domain.
Abstract: Recent advances in large reasoning models (LRMs) have enabled agentic search systems to perform complex multi-step reasoning across multiple sources. However, most studies focus on general information retrieval and rarely explores vertical domains with unique challenges. In this work, we focus on local life services and introduce LocalSearchBench, which encompass diverse and complex business scenarios. Real-world queries in this domain are often ambiguous and require multi-hop reasoning across merchants and products, remaining challenging and not fully addressed. As the first comprehensive benchmark for agentic search in local life services, LocalSearchBench includes over 150,000 high-quality entries from various cities and business types. We construct 300 multi-hop QA tasks based on real user queries, challenging agents to understand questions and retrieve information in multiple steps. We also developed LocalPlayground, a unified environment integrating multiple tools for agent interaction. Experiments show that even state-of-the-art LRMs struggle on LocalSearchBench: the best model (DeepSeek-V3.1) achieves only 34.34% correctness, and most models have issues with completeness (average 77.33%) and faithfulness (average 61.99%). This highlights the need for specialized benchmarks and domain-specific agent training in local life services. Code, Benchmark, and Leaderboard are available at localsearchbench.github.io.
[499] How Do LLMs Fail In Agentic Scenarios? A Qualitative Analysis of Success and Failure Scenarios of Various LLMs in Agentic Simulations
JV Roig
Main category: cs.AI
TL;DR: LLMs fail as autonomous agents due to specific behavioral patterns like premature action, over-helpfulness, context pollution, and fragile execution under load, with reliability depending more on training methods than model scale.
Details
Motivation: To understand how LLMs fail when operating as autonomous agents with tool-use capabilities, moving beyond aggregate scores to analyze specific behavioral patterns and failure modes.Method: Used Kamiwaza Agentic Merit Index (KAMI) v0.1 benchmark to analyze 900 execution traces from three models (Granite 4 Small, Llama 4 Maverick, DeepSeek V3.1) across filesystem, text extraction, CSV analysis, and SQL scenarios with fine-grained per-trial behavioral analysis.
Result: Model scale alone doesn’t predict agentic robustness; DeepSeek V3.1’s superior reliability comes from post-training reinforcement learning. Identified four recurring failure archetypes: premature action without grounding, over-helpfulness substituting missing entities, vulnerability to distractor-induced context pollution, and fragile execution under load.
Conclusion: Reliable enterprise deployment requires not just stronger models but deliberate training and design choices that reinforce verification, constraint discovery, and adherence to source-of-truth data, emphasizing interactive grounding, recovery behavior, and environment-aware adaptation.
Abstract: We investigate how large language models (LLMs) fail when operating as autonomous agents with tool-use capabilities. Using the Kamiwaza Agentic Merit Index (KAMI) v0.1 benchmark, we analyze 900 execution traces from three representative models - Granite 4 Small, Llama 4 Maverick, and DeepSeek V3.1 - across filesystem, text extraction, CSV analysis, and SQL scenarios. Rather than focusing on aggregate scores, we perform fine-grained, per-trial behavioral analysis to surface the strategies that enable successful multi-step tool execution and the recurrent failure modes that undermine reliability. Our findings show that model scale alone does not predict agentic robustness: Llama 4 Maverick (400B) performs only marginally better than Granite 4 Small (32B) in some uncertainty-driven tasks, while DeepSeek V3.1’s superior reliability derives primarily from post-training reinforcement learning rather than architecture or size. Across models, we identify four recurring failure archetypes: premature action without grounding, over-helpfulness that substitutes missing entities, vulnerability to distractor-induced context pollution, and fragile execution under load. These patterns highlight the need for agentic evaluation methods that emphasize interactive grounding, recovery behavior, and environment-aware adaptation, suggesting that reliable enterprise deployment requires not just stronger models but deliberate training and design choices that reinforce verification, constraint discovery, and adherence to source-of-truth data.
[500] Comparative Analysis and Parametric Tuning of PPO, GRPO, and DAPO for LLM Reasoning Enhancement
Yongsheng Lian
Main category: cs.AI
TL;DR: Systematic comparison of PPO, GRPO, and DAPO RL algorithms for improving LLM reasoning shows RL-trained models outperform base models across tasks, with DAPO performing best when Dynamic Sampling is disabled.
Details
Motivation: To systematically evaluate and compare different Reinforcement Learning algorithms (PPO, GRPO, DAPO) for enhancing complex reasoning capabilities in large language models through controlled transfer-learning experiments.Method: Conducted controlled transfer-learning evaluation: models first fine-tuned on specialized Countdown Game, then assessed on general-purpose reasoning benchmarks. Parametric analysis examined group size effects in GRPO/DAPO, KL-penalty coefficient impact, and Dynamic Sampling component in DAPO.
Result: RL-trained models consistently outperformed base models across all tasks, though improvement varied by benchmark. Increasing group size in GRPO/DAPO led to more stable training and higher accuracy. KL-penalty impact was non-monotonic. DAPO performed best overall when Dynamic Sampling was disabled.
Conclusion: RL fine-tuning effectively improves LLM reasoning, with DAPO (without Dynamic Sampling) showing best performance. Practical guidance: larger group sizes enhance stability/accuracy, KL-penalty requires careful tuning, and Dynamic Sampling may not provide benefits for reasoning tasks.
Abstract: This study presents a systematic comparison of three Reinforcement Learning (RL) algorithms (PPO, GRPO, and DAPO) for improving complex reasoning in large language models (LLMs). Our main contribution is a controlled transfer-learning evaluation: models are first fine-tuned on the specialized Countdown Game and then assessed on a suite of general-purpose reasoning benchmarks. Across all tasks, RL-trained models outperform their corresponding base models, although the degree of improvement differs by benchmark. Our parametric analysis offers practical guidance for RL-based LLM training. Increasing the group size in GRPO and DAPO leads to more stable training dynamics and higher accuracy, while the impact of the KL-penalty coefficient is non-monotonic. Additionally, we find that the Dynamic Sampling (DS) component in DAPO does not improve performance; in fact, the best overall results are achieved with DAPO when DS is disabled.
[501] The Agent Capability Problem: Predicting Solvability Through Information-Theoretic Bounds
Shahar Lutati
Main category: cs.AI
TL;DR: The paper introduces the Agent Capability Problem (ACP) - a framework for predicting if an agent can solve a problem under resource constraints by framing problem-solving as information acquisition.
Details
Motivation: Current approaches rely on empirical heuristics for resource allocation decisions. The paper aims to provide a principled, information-theoretic framework to predict whether an agent can solve a problem before committing resources to search.Method: ACP models problem-solving as information acquisition: an agent needs I_total bits to identify a solution and gains I_step bits per action at cost C_step. This yields an effective cost C_eff = (I_total/I_step)*C_step that predicts resource requirements before search.
Result: Theoretical results prove that C_eff lower-bounds expected cost and provides tight probabilistic upper bounds. Experimental validation shows ACP predictions closely track actual agent performance, bounding search effort while improving efficiency over greedy and random strategies.
Conclusion: ACP provides a unified information-theoretic framework that generalizes across LLM-based and agentic workflows, linking principles from active learning, Bayesian optimization, and reinforcement learning for principled resource allocation decisions.
Abstract: When should an autonomous agent commit resources to a task? We introduce the Agent Capability Problem (ACP), a framework for predicting whether an agent can solve a problem under resource constraints. Rather than relying on empirical heuristics, ACP frames problem-solving as information acquisition: an agent requires $\Itotal$ bits to identify a solution and gains $\Istep$ bits per action at cost $\Cstep$, yielding an effective cost $\Ceff = (\Itotal/\Istep), \Cstep$ that predicts resource requirements before search. We prove that $\Ceff$ lower-bounds expected cost and provide tight probabilistic upper bounds. Experimental validation shows that ACP predictions closely track actual agent performance, consistently bounding search effort while improving efficiency over greedy and random strategies. The framework generalizes across LLM-based and agentic workflows, linking principles from active learning, Bayesian optimization, and reinforcement learning through a unified information-theoretic lens. \
[502] Each Prompt Matters: Scaling Reinforcement Learning Without Wasting Rollouts on Hundred-Billion-Scale MoE
Anxiang Zeng, Haibo Zhang, Hailing Zhang, Kaixiang Mo, Liang Yao, Ling Hu, Long Zhang, Shuman Liu, Shuyi Xie, Yanshi Li, Yizhang Chen, Yuepeng Sheng, Yuwei Huang, Zhaochen Xu, Zhiqiang Zhou, Ziqin Liew
Main category: cs.AI
TL;DR: CompassMax-V3-Thinking is a hundred-billion-scale MoE reasoning model trained with a new RL framework that addresses scaling inefficiencies through multi-stage prompt filtering, entropy-adaptive optimization, router alignment, and high-throughput system optimizations.
Details
Motivation: Scaling RL to hundred-billion-scale MoE models exposes critical inefficiencies: zero-variance prompts that waste rollouts, unstable importance sampling over long horizons, advantage inversion from standard reward models, and systemic bottlenecks in rollout processing. The core principle is that each prompt must matter in the training process.Method: Four unified innovations: (1) Multi-Stage Zero-Variance Elimination filters non-informative prompts and stabilizes group-based policy optimization; (2) ESPO (entropy-adaptive optimization) balances token-level and sequence-level importance sampling; (3) Router Replay aligns training-time MoE router decisions with inference-time behavior plus reward model adjustment; (4) High-throughput RL system with FP8-precision rollouts, overlapped reward computation, and length-aware scheduling.
Result: The resulting model delivers strong performance across both internal and public evaluations, demonstrating that RL on hundred-billion-scale MoE models can be made stable and efficient through the proposed cohesive pipeline.
Conclusion: The paper presents a comprehensive framework that successfully addresses the challenges of scaling RL to massive MoE models, making such training stable and efficient through a unified approach to prompt filtering, optimization stability, router alignment, and system-level performance optimization.
Abstract: We present CompassMax-V3-Thinking, a hundred-billion-scale MoE reasoning model trained with a new RL framework built on one principle: each prompt must matter. Scaling RL to this size exposes critical inefficiencies-zero-variance prompts that waste rollouts, unstable importance sampling over long horizons, advantage inversion from standard reward models, and systemic bottlenecks in rollout processing. To overcome these challenges, we introduce several unified innovations: (1) Multi-Stage Zero-Variance Elimination, which filters out non-informative prompts and stabilizes group-based policy optimization (e.g. GRPO) by removing wasted rollouts; (2) ESPO, an entropy-adaptive optimization method that balances token-level and sequence-level importance sampling to maintain stable learning dynamics; (3) a Router Replay strategy that aligns training-time MoE router decisions with inference-time behavior to mitigate train-infer discrepancies, coupled with a reward model adjustment to prevent advantage inversion; (4) a high-throughput RL system with FP8-precision rollouts, overlapped reward computation, and length-aware scheduling to eliminate performance bottlenecks. Together, these contributions form a cohesive pipeline that makes RL on hundred-billion-scale MoE models stable and efficient. The resulting model delivers strong performance across both internal and public evaluations.
[503] RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models
Xiqiao Xiong, Ouxiang Li, Zhuo Liu, Moxin Li, Wentao Shi, Fuli Feng, Xiangnan He
Main category: cs.AI
TL;DR: The paper proposes a multi-turn reinforcement learning approach for black-box jailbreak attacks on LLMs, using process rewards to improve attack success rates.
Details
Motivation: Existing single-turn jailbreak optimization is insufficient for learning long-term attack strategies against black-box LLMs, which are vulnerable to multi-turn attacks that can bypass safety mechanisms.Method: Formulates multi-turn jailbreak as RL task with final-turn harmfulness as outcome reward, plus two heuristic process rewards: controlling intermediate harmfulness to avoid triggering rejection, and maintaining semantic relevance to prevent drift.
Result: Experimental results on multiple benchmarks show consistently improved attack success rates across multiple models compared to existing approaches.
Conclusion: The proposed multi-turn RL approach with process rewards effectively improves jailbreak attack success, highlighting the need for better defenses against such long-term attack strategies.
Abstract: Large language models are vulnerable to jailbreak attacks, threatening their safe deployment in real-world applications. This paper studies black-box multi-turn jailbreaks, aiming to train attacker LLMs to elicit harmful content from black-box models through a sequence of prompt-output interactions. Existing approaches typically rely on single turn optimization, which is insufficient for learning long-term attack strategies. To bridge this gap, we formulate the problem as a multi-turn reinforcement learning task, directly optimizing the harmfulness of the final-turn output as the outcome reward. To mitigate sparse supervision and promote long-term attack strategies, we propose two heuristic process rewards: (1) controlling the harmfulness of intermediate outputs to prevent triggering the black-box model’s rejection mechanisms, and (2) maintaining the semantic relevance of intermediate outputs to avoid drifting into irrelevant content. Experimental results on multiple benchmarks show consistently improved attack success rates across multiple models, highlighting the effectiveness of our approach. The code is available at https://github.com/xxiqiao/RL-MTJail. Warning: This paper contains examples of harmful content.
[504] ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning
Nearchos Potamitis, Lars Klein, Akhil Arora
Main category: cs.AI
TL;DR: ReasonBENCH is the first benchmark to quantify instability in LLM reasoning, revealing that most reasoning strategies and models show high variance in performance and cost across runs, compromising reproducibility.
Details
Motivation: Current LLM evaluation practices focus on single-run accuracy while ignoring intrinsic uncertainty from stochastic decoding, creating blind spots for assessing stability, reproducibility, and cost-consistency of reasoning methods.Method: ReasonBENCH provides: (1) modular evaluation library standardizing reasoning frameworks, models, and tasks; (2) multi-run protocol reporting statistically reliable metrics for quality and cost; (3) public leaderboard for variance-aware reporting.
Result: Most reasoning strategies and models exhibit high instability - strategies with similar average performance can have confidence intervals up to 4x wider, and top-performing methods often incur higher and less stable costs.
Conclusion: Reproducibility is a critical dimension for reliable LLM reasoning; ReasonBENCH provides foundation for future reasoning methods and uncertainty quantification techniques, highlighting the need for variance-aware evaluation.
Abstract: Large language models (LLMs) are increasingly deployed in settings where reasoning, such as multi-step problem solving and chain-of-thought, is essential. Yet, current evaluation practices overwhelmingly report single-run accuracy while ignoring the intrinsic uncertainty that naturally arises from stochastic decoding. This omission creates a blind spot because practitioners cannot reliably assess whether a method’s reported performance is stable, reproducible, or cost-consistent. We introduce ReasonBENCH, the first benchmark designed to quantify the underlying instability in LLM reasoning. ReasonBENCH provides (i) a modular evaluation library that standardizes reasoning frameworks, models, and tasks, (ii) a multi-run protocol that reports statistically reliable metrics for both quality and cost, and (iii) a public leaderboard to encourage variance-aware reporting. Across tasks from different domains, we find that the vast majority of reasoning strategies and models exhibit high instability. Notably, even strategies with similar average performance can display confidence intervals up to four times wider, and the top-performing methods often incur higher and less stable costs. Such instability compromises reproducibility across runs and, consequently, the reliability of reported performance. To better understand these dynamics, we further analyze the impact of prompts, model families, and scale on the trade-off between solve rate and stability. Our results highlight reproducibility as a critical dimension for reliable LLM reasoning and provide a foundation for future reasoning methods and uncertainty quantification techniques. ReasonBENCH is publicly available at https://github.com/au-clan/ReasonBench .
[505] Large Causal Models from Large Language Models
Sridhar Mahadevan
Main category: cs.AI
TL;DR: DEMOCRITUS is a novel system that uses LLMs to build large causal models across diverse domains by extracting causal statements from text and organizing them into coherent relational structures.
Details
Motivation: Traditional causal inference is limited to narrow domains with numerical experimental data. The authors aim to leverage LLMs' potential to extract causal knowledge from diverse textual sources and build comprehensive causal models spanning multiple domains.Method: DEMOCRITUS uses LLMs to propose topics, generate causal questions, and extract causal statements from various domains. It converts fragmented causal claims into relational triples and embeds them into large causal models using novel categorical machine learning methods. The system has a six-module pipeline implementation.
Result: The system successfully builds large causal models across archaeology, biology, climate change, economics, medicine, and technology domains. The paper examines computational costs and identifies current bottlenecks for scaling to larger models.
Conclusion: DEMOCRITUS demonstrates a new paradigm for causal modeling using LLMs, though current limitations exist. The paper outlines directions for extending the system’s capabilities beyond its current implementation.
Abstract: We introduce a new paradigm for building large causal models (LCMs) that exploits the enormous potential latent in today’s large language models (LLMs). We describe our ongoing experiments with an implemented system called DEMOCRITUS (Decentralized Extraction of Manifold Ontologies of Causal Relations Integrating Topos Universal Slices) aimed at building, organizing, and visualizing LCMs that span disparate domains extracted from carefully targeted textual queries to LLMs. DEMOCRITUS is methodologically distinct from traditional narrow domain and hypothesis centered causal inference that builds causal models from experiments that produce numerical data. A high-quality LLM is used to propose topics, generate causal questions, and extract plausible causal statements from a diverse range of domains. The technical challenge is then to take these isolated, fragmented, potentially ambiguous and possibly conflicting causal claims, and weave them into a coherent whole, converting them into relational causal triples and embedding them into a LCM. Addressing this technical challenge required inventing new categorical machine learning methods, which we can only briefly summarize in this paper, as it is focused more on the systems side of building DEMOCRITUS. We describe the implementation pipeline for DEMOCRITUS comprising of six modules, examine its computational cost profile to determine where the current bottlenecks in scaling the system to larger models. We describe the results of using DEMOCRITUS over a wide range of domains, spanning archaeology, biology, climate change, economics, medicine and technology. We discuss the limitations of the current DEMOCRITUS system, and outline directions for extending its capabilities.
[506] Auditing Games for Sandbagging
Jordan Taylor, Sid Black, Dillon Bowen, Thomas Read, Satvik Golechha, Alex Zelenka-Martin, Oliver Makins, Connor Kissane, Kola Ayonrinde, Jacob Merizian, Samuel Marks, Chris Cundy, Joseph Bloom
Main category: cs.AI
TL;DR: Researchers tested sandbagging detection methods using an auditing game with red team fine-tuned models and blue team detection approaches, finding current methods unreliable for identifying AI systems that conceal capabilities.
Details
Motivation: Future AI systems could intentionally underperform during evaluations to hide their true capabilities, potentially misleading developers and auditors about safety risks.Method: Used an auditing game where red team fine-tuned five models (some conditionally underperforming as sandbagging proxies), and blue team tried detection using black-box, model-internals, or training-based approaches.
Result: Blue team couldn’t reliably detect sandbaggers; black-box approaches were defeated by imitation, linear probes showed promise but were vulnerable, training-based elicitation consistently elicited full performance but caused false positives.
Conclusion: Short-term: remove sandbagging via on-distribution training; long-term: need more research on training-based elicitation and robust detection methods.
Abstract: Future AI systems could conceal their capabilities (‘sandbagging’) during evaluations, potentially misleading developers and auditors. We stress-tested sandbagging detection techniques using an auditing game. First, a red team fine-tuned five models, some of which conditionally underperformed, as a proxy for sandbagging. Second, a blue team used black-box, model-internals, or training-based approaches to identify sandbagging models. We found that the blue team could not reliably discriminate sandbaggers from benign models. Black-box approaches were defeated by effective imitation of a weaker model. Linear probes, a model-internals approach, showed more promise but their naive application was vulnerable to behaviours instilled by the red team. We also explored capability elicitation as a strategy for detecting sandbagging. Although Prompt-based elicitation was not reliable, training-based elicitation consistently elicited full performance from the sandbagging models, using only a single correct demonstration of the evaluation task. However the performance of benign models was sometimes also raised, so relying on elicitation as a detection strategy was prone to false-positives. In the short-term, we recommend developers remove potential sandbagging using on-distribution training for elicitation. In the longer-term, further research is needed to ensure the efficacy of training-based elicitation, and develop robust methods for sandbagging detection. We open source our model organisms at https://github.com/AI-Safety-Institute/sandbagging_auditing_games and select transcripts and results at https://huggingface.co/datasets/sandbagging-games/evaluation_logs . A demo illustrating the game can be played at https://sandbagging-demo.far.ai/ .
[507] Transparent and Coherent Procedural Mistake Detection
Shane Storks, Itamar Bar-Yossef, Yayuan Li, Zheyuan Zhang, Jason J. Corso, Joyce Chai
Main category: cs.AI
TL;DR: The paper extends procedural mistake detection (PMD) to require generating visual self-dialog rationales, creates a benchmark dataset using individual frames, develops automated metrics for rationale coherence using NLI, and establishes baselines showing VLMs struggle but can be improved with trade-offs.
Details
Motivation: Current PMD systems have poor performance in the wild and opaque reasoning processes. The authors want to make PMD more transparent by requiring models to generate visual self-dialog rationales that explain their decisions.Method: 1) Reformulate PMD to require generating visual self-dialog rationales; 2) Curate benchmark dataset based on individual frames; 3) Use natural language inference (NLI) model to create two automated metrics for rationale coherence; 4) Establish baselines and test improvements through inference and fine-tuning methods.
Result: VLMs struggle with PMD off-the-shelf, but their accuracy, coherence, and efficiency can be improved by incorporating the coherence metrics into inference and fine-tuning methods, though with some trade-offs. The multi-faceted metrics help visualize common outcomes and identify areas for improvement.
Conclusion: The paper successfully extends PMD to require transparent rationale generation, develops automated coherence metrics, and shows that while VLMs initially struggle, they can be improved with appropriate methods, providing a foundation for more interpretable procedural mistake detection systems.
Abstract: Procedural mistake detection (PMD) is a challenging problem of classifying whether a human user (observed through egocentric video) has successfully executed a task (specified by a procedural text). Despite significant recent efforts, machine performance in the wild remains nonviable, and the reasoning processes underlying this performance are opaque. As such, we extend PMD to require generating visual self-dialog rationales to inform decisions. Given the impressive, mature image understanding capabilities observed in recent vision-and-language models (VLMs), we curate a suitable benchmark dataset for PMD based on individual frames. As our reformulation enables unprecedented transparency, we leverage a natural language inference (NLI) model to formulate two automated metrics for the coherence of generated rationales. We establish baselines for this reframed task, showing that VLMs struggle off-the-shelf, but with some trade-offs, their accuracy, coherence, and efficiency can be improved by incorporating these metrics into common inference and fine-tuning methods. Lastly, our multi-faceted metrics visualize common outcomes, highlighting areas for further improvement.
[508] Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration
Yijia Shao, Vinay Samuel, Yucheng Jiang, John Yang, Diyi Yang
Main category: cs.AI
TL;DR: Co-Gym is an open framework for developing and evaluating collaborative AI agents that work with humans through bidirectional communication, showing that human-agent collaboration outperforms fully autonomous agents across multiple tasks.
Details
Motivation: Many real-world tasks require human-AI collaboration due to humans' latent preferences, domain expertise, or need for control, but there's a lack of frameworks to systematically study and develop such collaborative agents.Method: Developed Collaborative Gym (Co-Gym) - an open framework with flexible non-turn-taking interaction paradigm, simulated user conditions, real-world web application, and evaluation suite assessing both collaboration outcomes and processes.
Result: Collaborative agents consistently outperformed fully autonomous counterparts: 86% win rate in Travel Planning, 74% in Tabular Analysis, 66% in Related Work. However, communication failures observed in 65% and situational awareness failures in 40% of real-world cases.
Conclusion: Co-Gym enables systematic development and evaluation of human-agent collaboration, demonstrating its benefits while revealing persistent limitations in current language models that need addressing for effective real-world deployment.
Abstract: While the advancement of large language models has spurred the development of AI agents to automate tasks, numerous use cases inherently require agents to collaborate with humans due to humans’ latent preferences, domain expertise, or the need for control. To facilitate the study of human-agent collaboration, we introduce Collaborative Gym (Co-Gym), an open framework for developing and evaluating collaborative agents that engage in bidirectional communication with humans while interacting with task environments. We describe how the framework enables the implementation of new task environments and coordination between humans and agents through a flexible, non-turn-taking interaction paradigm, along with an evaluation suite that assesses both collaboration outcomes and processes. Our framework provides both a simulated condition with a reliable user simulator and a real-world condition with an interactive web application. Initial benchmark experiments across three representative tasks – creating travel plans, writing related work sections, and analyzing tabular data – demonstrate the benefits of human-agent collaboration: The best-performing collaborative agents consistently outperform their fully autonomous counterparts in task performance, achieving win rates of 86% in Travel Planning, 74% in Tabular Analysis, and 66% in Related Work when evaluated by real users. Despite these improvements, our evaluation reveals persistent limitations in current language models and agents, with communication and situational awareness failures observed in 65% and 40% of cases in the real condition, respectively. Released under the permissive MIT license, Co-Gym supports the addition of new task environments and can be used to develop collaborative agent applications, while its evaluation suite enables assessment and improvement of collaborative agents.
[509] InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization
Yuhang Liu, Zeyu Liu, Shuanghe Zhu, Pengxiang Li, Congkai Xie, Jiasheng Wang, Xavier Hu, Xiaotian Han, Jianbo Yuan, Xinyao Wang, Shengyu Zhang, Hongxia Yang, Fei Wu
Main category: cs.AI
TL;DR: AEPO (Adaptive Exploration Policy Optimization) improves GUI grounding for MLLMs by addressing semantic alignment bottlenecks through multi-answer generation and adaptive exploration rewards.
Details
Motivation: While RLVR improves spatial alignment for GUI agents, inefficient exploration bottlenecks semantic alignment - the ability to match instructions to functionally appropriate UI elements, preventing models from learning difficult semantic associations.Method: AEPO uses multi-answer generation to enforce broader exploration, guided by an Adaptive Exploration Reward (AER) function derived from efficiency principles (η=U/C). This framework trains models to better align natural language instructions with GUI elements.
Result: InfiGUI-G1-3B and InfiGUI-G1-7B achieve new SOTA results across multiple GUI grounding benchmarks, with up to 9.0% relative improvement over RLVR baselines on generalization and semantic understanding tests.
Conclusion: AEPO effectively addresses exploration bottlenecks in semantic alignment for GUI agents, enabling better instruction grounding through theoretically guided exploration strategies.
Abstract: The emergence of Multimodal Large Language Models (MLLMs) has propelled the development of autonomous agents that operate on Graphical User Interfaces (GUIs) using pure visual input. A fundamental challenge is robustly grounding natural language instructions. This requires a precise spatial alignment, which accurately locates the coordinates of each element, and, more critically, a correct semantic alignment, which matches the instructions to the functionally appropriate UI element. Although Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be effective at improving spatial alignment for these MLLMs, we find that inefficient exploration bottlenecks semantic alignment, which prevent models from learning difficult semantic associations. To address this exploration problem, we present Adaptive Exploration Policy Optimization (AEPO), a new policy optimization framework. AEPO employs a multi-answer generation strategy to enforce broader exploration, which is then guided by a theoretically grounded Adaptive Exploration Reward (AER) function derived from first principles of efficiency eta=U/C. Our AEPO-trained models, InfiGUI-G1-3B and InfiGUI-G1-7B, establish new state-of-the-art results across multiple challenging GUI grounding benchmarks, achieving significant relative improvements of up to 9.0% against the naive RLVR baseline on benchmarks designed to test generalization and semantic understanding. Resources are available at https://github.com/InfiXAI/InfiGUI-G1.
[510] Artificial Human Intelligence: The role of Humans in the Development of Next Generation AI
Suayb S. Arslan
Main category: cs.AI
TL;DR: This paper explores the symbiotic interplay between human and artificial intelligence, examining how biological principles inspire AI development while emphasizing human-centered, ethical design for next-generation systems.
Details
Motivation: The rapid interaction between human and artificial intelligence has created complex confluences that require examination. As AI systems like DeepSeek emerge, drawing inspiration from biological principles, there's a need to understand this interplay and ensure human-centered, ethical development.Method: The authors propose a novel taxonomy to explore human-machine intelligence interplay, examining how neuroscience and human cognition mechanisms inspire AI implementation. They analyze recent advances like DeepSeek through the lens of biological principles such as modular neural specialization and sparse episodic encoding.
Result: The paper identifies how biological principles address computational bottlenecks while aligning with human-inspired scalability. It establishes a framework for understanding the symbiotic relationship between human and artificial intelligence, highlighting the crucial role humans play in developing ethical, responsible, and robust systems.
Conclusion: The authors advocate for human-centered AI development that capitalizes on symbiotic designs, focusing on AI’s augmentation role rather than replacement. They propose future perspectives emphasizing ethical considerations and leave open questions for the broader community to address in this evolving field.
Abstract: Human intelligence, the most evident and accessible form of source of reasoning, hosted by biological hardware, has evolved and been refined over thousands of years, positioning itself today to create new artificial forms and preparing to self–design their evolutionary path forward. Beginning with the advent of foundation models, the rate at which human and artificial intelligence interact with each other has exceeded any anticipated quantitative figures. The close engagement led both bits of intelligence to be impacted in various ways, which naturally resulted in complex confluences that warrant close scrutiny. Recent advances, such as DeepSeek, exemplify this interplay: the novel contributions, we argue, draw indirect inspiration from biological principles like modular neural specialization and sparse episodic encoding, addressing computational bottlenecks while aligning with human-inspired scalability. In the sequel, using a novel taxonomy, we shall explore this interplay between human and machine intelligence, focusing on the crucial role humans play in developing ethical, responsible, and robust intelligent systems. We briefly delve into various aspects of implementation inspired by the mechanisms underlying neuroscience and human cognition. In addition, we propose future perspectives, capitalizing on the advantages of symbiotic designs to suggest a human-centered direction for next-generation developments, focusing on the augmentation role of AI. We finalize this evolving document with some thoughts and open questions yet to be addressed by the broader community.
[511] Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents
Zonghan Yang, Shengjie Wang, Kelin Fu, Wenyang He, Weimin Xiong, Yibo Liu, Yibo Miao, Bofei Gao, Yejie Wang, Yingwei Ma, Yanhao Li, Yue Liu, Zhenxing Hu, Kaitai Zhang, Shuyi Wang, Huarong Chen, Flood Sung, Yang Liu, Yang Gao, Zhilin Yang, Tianyu Liu
Main category: cs.AI
TL;DR: Agentless training creates skill priors that enable efficient SWE-Agent adaptation, bridging workflow and agentic approaches for better coding agents.
Details
Motivation: Current SWE approaches are split between multi-turn Agent frameworks and single-turn Agentless methods, but these paradigms could be complementary rather than mutually exclusive.Method: First curate Agentless training recipe to create Kimi-Dev (open-source SWE LLM), then use SFT adaptation on 5k public trajectories to power SWE-Agents.
Result: Kimi-Dev achieves 60.4% on SWE-bench Verified (best among workflow approaches) and powers SWE-Agents to 48.6% pass@1, matching Claude 3.5 Sonnet performance.
Conclusion: Structured skill priors from Agentless training can effectively bridge workflow and agentic frameworks, creating transferable coding agents.
Abstract: Large Language Models (LLMs) are increasingly applied to software engineering (SWE), with SWE-bench as a key benchmark. Solutions are split into SWE-Agent frameworks with multi-turn interactions and workflow-based Agentless methods with single-turn verifiable steps. We argue these paradigms are not mutually exclusive: reasoning-intensive Agentless training induces skill priors, including localization, code edit, and self-reflection that enable efficient and effective SWE-Agent adaptation. In this work, we first curate the Agentless training recipe and present Kimi-Dev, an open-source SWE LLM achieving 60.4% on SWE-bench Verified, the best among workflow approaches. With additional SFT adaptation on 5k publicly-available trajectories, Kimi-Dev powers SWE-Agents to 48.6% pass@1, on par with that of Claude 3.5 Sonnet (241022 version). These results show that structured skill priors from Agentless training can bridge workflow and agentic frameworks for transferable coding agents.
[512] PowerGraph-LLM: Novel Power Grid Graph Embedding and Optimization with Large Language Models
Fabien Bernier, Jun Cao, Maxime Cordy, Salah Ghamizi
Main category: cs.AI
TL;DR: PowerGraph-LLM is the first framework using Large Language Models (LLMs) to solve Optimal Power Flow problems by combining graph and tabular representations of power grids with specialized in-context learning and fine-tuning protocols.
Details
Motivation: There's a growing need for scalable algorithms to handle increasing variability, constraints, and uncertainties in modern power networks while providing accurate and fast solutions for Optimal Power Flow problems in operational planning and grid management.Method: Combines graph and tabular representations of power grids to effectively query LLMs, capturing complex relationships and constraints. Introduces new in-context learning and fine-tuning protocols specifically tailored for OPF problems.
Result: PowerGraph-LLM demonstrates reliable performances using off-the-shelf LLMs. The study reveals the impact of LLM architecture, size, and fine-tuning, and shows the framework’s ability to handle realistic grid components and constraints.
Conclusion: The PowerGraph-LLM framework successfully applies LLMs to OPF problems, providing a novel approach that combines graph representations with language models to address complex power system optimization challenges.
Abstract: Efficiently solving Optimal Power Flow (OPF) problems in power systems is crucial for operational planning and grid management. There is a growing need for scalable algorithms capable of handling the increasing variability, constraints, and uncertainties in modern power networks while providing accurate and fast solutions. To address this, machine learning techniques, particularly Graph Neural Networks (GNNs) have emerged as promising approaches. This letter introduces PowerGraph-LLM, the first framework explicitly designed for solving OPF problems using Large Language Models (LLMs). The proposed approach combines graph and tabular representations of power grids to effectively query LLMs, capturing the complex relationships and constraints in power systems. A new implementation of in-context learning and fine-tuning protocols for LLMs is introduced, tailored specifically for the OPF problem. PowerGraph-LLM demonstrates reliable performances using off-the-shelf LLM. Our study reveals the impact of LLM architecture, size, and fine-tuning and demonstrates our framework’s ability to handle realistic grid components and constraints.
[513] Foundation Models Knowledge Distillation For Battery Capacity Degradation Forecast
Joey Chan, Zhen Chen, Ershun Pan
Main category: cs.AI
TL;DR: A time-series foundation model (Battery-Timer) is fine-tuned for lithium-ion battery capacity degradation forecasting, achieving superior performance across distribution shifts from small cells to large-scale storage systems, with knowledge distillation enabling practical deployment.
Details
Motivation: Accurate forecasting of lithium-ion battery capacity degradation is critical for reliable and safe operation but remains challenging due to distribution shifts across scales (small cells to large-scale storage) and varying operating conditions.Method: Propose a degradation-aware fine-tuning strategy that aligns a pre-trained time-series foundation model (Timer) to capacity trajectories while retaining transferable temporal structure. Fine-tune on 220,153 cycles of charge-discharge data to create Battery-Timer. Introduce knowledge distillation to compress the foundation model’s behavior into compact expert models for practical deployment.
Result: Battery-Timer consistently outperforms specialized expert models on the CycleLife-SJTUIE dataset (real-world industrial energy-storage station data). Knowledge distillation across state-of-the-art time-series experts improves multi-condition capacity generalization while substantially reducing computational overhead.
Conclusion: The combination of a time-series foundation model with targeted knowledge distillation provides a practical path to deployable cross-scale degradation forecasting for lithium-ion batteries, addressing both accuracy and computational efficiency challenges.
Abstract: Accurate forecasting of lithium-ion battery capacity degradation is critical for reliable and safe operation, yet remains challenging under distribution shifts across scales and operating regimes. Here we investigate a time-series foundation model, that is, a large pre-trained time-series model for capacity degradation forecasting, and propose a degradation-aware fine-tuning strategy that aligns the model to capacity trajectories while retaining broadly transferable temporal structure. We instantiate this approach by fine-tuning the Timer model on 220,153 cycles of open-source charge-discharge records to obtain Battery-Timer. Using our released CycleLife-SJTUIE dataset, a real-world industrial collection from an energy-storage station with long-horizon cycling, we evaluate capacity generalization from small cells to large-scale storage systems and across varying operating conditions. Battery-Timer consistently outperforms specialized expert models. To address deployment cost, we further introduce knowledge distillation, a teacher-student transfer that compresses the foundation model’s behavior into compact expert models. Distillation across several state-of-the-art time-series experts improves multi-condition capacity generalization while substantially reducing computational overhead, indicating a practical path to deployable cross-scale degradation forecasting by combining a foundation model with targeted distillation.
[514] Is PRM Necessary? Problem-Solving RL Implicitly Induces PRM Capability in LLMs
Zhangying Feng, Qianglong Chen, Ning Lu, Yongqian Li, Siqi Cheng, Shuangmu Peng, Duyu Tang, Shengcai Liu, Zhirui Zhang
Main category: cs.AI
TL;DR: Pure RL training alone can enhance reasoning without PRMs; Self-PRM framework improves accuracy but has low precision on hard problems; PRMs may not be essential for complex reasoning.
Details
Motivation: Challenge conventional wisdom that process reward models (PRMs) are necessary for enhancing reasoning in LLMs, and investigate whether pure RL training alone can develop both problem-solving and process supervision capabilities.Method: Systematic investigation of RL-PRM relationship using DeepSeek-R1 and QwQ-32B models; propose Self-PRM framework where models autonomously evaluate and rerank solutions through self-reward mechanisms.
Result: Pure RL training develops complementary reasoning dimensions; current PRMs underperform simple baselines; Self-PRM improves accuracy but has low precision (<10%) on difficult problems; RL scaling needed for better reward alignment.
Conclusion: PRMs may not be essential for complex reasoning enhancement; pure RL improves both problem-solving and PRM capabilities; continued RL scaling needed for better introspective accuracy and reward alignment.
Abstract: The development of reasoning capabilities represents a critical frontier in large language models (LLMs) research, where reinforcement learning (RL) and process reward models (PRMs) have emerged as predominant methodological frameworks. Contrary to conventional wisdom, empirical evidence from DeepSeek-R1 demonstrates that pure RL training focused on mathematical problem-solving can progressively enhance reasoning abilities without PRM integration, challenging the perceived necessity of process supervision. In this study, we conduct a systematic investigation of the relationship between RL training and PRM capabilities. Our findings demonstrate that problem-solving proficiency and process supervision capabilities represent complementary dimensions of reasoning that co-evolve synergistically during pure RL training. In particular, current PRMs underperform simple baselines like majority voting when applied to state-of-the-art models such as DeepSeek-R1 and QwQ-32B. To address this limitation, we propose Self-PRM, an introspective framework in which models autonomously evaluate and rerank their generated solutions through self-reward mechanisms. Although Self-PRM consistently improves the accuracy of the benchmark (particularly with larger sample sizes), analysis exposes persistent challenges: The approach exhibits low precision (<10%) on difficult problems, frequently misclassifying flawed solutions as valid. These analyses underscore the need for continued RL scaling to improve reward alignment and introspective accuracy. Overall, our findings suggest that PRM may not be essential for enhancing complex reasoning, as pure RL not only improves problem-solving skills but also inherently fosters robust PRM capabilities. We hope these findings provide actionable insights for building more reliable and self-aware complex reasoning models.
[515] Mind The Gap: Quantifying Mechanistic Gaps in Algorithmic Reasoning via Neural Compilation
Lucas Saldyt, Subbarao Kambhampati
Main category: cs.AI
TL;DR: The paper investigates how neural networks learn algorithmic reasoning, comparing compiled vs learned algorithms in GNNs for BFS, DFS, and Bellman-Ford to understand expressability-trainability gaps.
Details
Motivation: To understand how neural networks learn algorithmic reasoning by examining: 1) How faithful learned algorithms are when effective, and 2) Why neural networks fail to learn effective algorithms otherwise. This is crucial for developing neural networks that robustly learn complex algorithms from data.Method: Uses neural compilation to directly encode source algorithms into GNN parameters, enabling exact computation. Compares compiled vs conventionally learned parameters, intermediate vectors, and behaviors. Focuses on GNNs for BFS, DFS, and Bellman-Ford algorithms, which represent effective, faithful, and ineffective learned algorithms respectively.
Result: The paper aims to characterize expressability-trainability gaps in learning algorithmic reasoning. The hypothesis is that inductive learning is most effective for parallel algorithms contained within the computational class NC.
Conclusion: Neural compilation provides a framework to analyze algorithmic learning in neural networks, revealing fundamental gaps between what networks can express and what they can effectively learn through training.
Abstract: This paper aims to understand how neural networks learn algorithmic reasoning by addressing two questions: How faithful are learned algorithms when they are effective, and why do neural networks fail to learn effective algorithms otherwise? To answer these questions, we use neural compilation, a technique that directly encodes a source algorithm into neural network parameters, enabling the network to compute the algorithm exactly. This enables comparison between compiled and conventionally learned parameters, intermediate vectors, and behaviors. This investigation is crucial for developing neural networks that robustly learn complexalgorithms from data. Our analysis focuses on graph neural networks (GNNs), which are naturally aligned with algorithmic reasoning tasks, specifically our choices of BFS, DFS, and Bellman-Ford, which cover the spectrum of effective, faithful, and ineffective learned algorithms. Commonly, learning algorithmic reasoning is framed as induction over synthetic data, where a parameterized model is trained on inputs, traces, and outputs produced by an underlying ground truth algorithm. In contrast, we introduce a neural compilation method for GNNs, which sets network parameters analytically, bypassing training. Focusing on GNNs leverages their alignment with algorithmic reasoning, extensive algorithmic induction literature, and the novel application of neural compilation to GNNs. Overall, this paper aims to characterize expressability-trainability gaps - a fundamental shortcoming in learning algorithmic reasoning. We hypothesize that inductive learning is most effective for parallel algorithms contained within the computational class \texttt{NC}.
[516] CoP: Agentic Red-teaming for Large Language Models using Composition of Principles
Chen Xiong, Pin-Yu Chen, Tsung-Yi Ho
Main category: cs.AI
TL;DR: CoP framework automates red-teaming of LLMs using human-provided principles to orchestrate jailbreak strategies, achieving up to 19x improvement in attack success rates.
Details
Motivation: Jailbreak attacks that bypass LLM safety alignment are an urgent concern. Current red-teaming practices need automation and scaling to proactively identify safety risks before AI deployment.Method: Composition-of-Principles (CoP) framework: human users provide red-teaming principles as instructions to AI agents, which automatically orchestrate effective red-teaming strategies and generate jailbreak prompts.
Result: CoP reveals unprecedented safety risks, finds novel jailbreak prompts, and improves best-known single-turn attack success rate by up to 19.0 times against leading LLMs.
Conclusion: CoP provides a unified, extensible framework for automated red-teaming that significantly enhances the discovery of LLM safety vulnerabilities through principled strategy orchestration.
Abstract: Recent advances in Large Language Models (LLMs) have spurred transformative applications in various domains, ranging from open-source to proprietary LLMs. However, jailbreak attacks, which aim to break safety alignment and user compliance by tricking the target LLMs into answering harmful and risky responses, are becoming an urgent concern. The practice of red-teaming for LLMs is to proactively explore potential risks and error-prone instances before the release of frontier AI technology. This paper proposes an agentic workflow to automate and scale the red-teaming process of LLMs through the Composition-of-Principles (CoP) framework, where human users provide a set of red-teaming principles as instructions to an AI agent to automatically orchestrate effective red-teaming strategies and generate jailbreak prompts. Distinct from existing red-teaming methods, our CoP framework provides a unified and extensible framework to encompass and orchestrate human-provided red-teaming principles to enable the automated discovery of new red-teaming strategies. When tested against leading LLMs, CoP reveals unprecedented safety risks by finding novel jailbreak prompts and improving the best-known single-turn attack success rate by up to 19.0 times.
[517] Internal World Models as Imagination Networks in Cognitive Agents
Saurabh Ranjan, Brian Odegaard
Main category: cs.AI
TL;DR: Psychological network analysis reveals fundamental differences between human and LLM imagination: humans show consistent structural organization of internal world models across populations, while LLMs lack this organization despite conversational memory manipulation.
Details
Motivation: To understand the computational role of imagination and compare internal world models (IWMs) between humans and artificial intelligence systems, addressing the debate about whether imagination primarily serves reward maximization or broader functions like accessing internal representations.Method: Used psychological network analysis with imagination vividness ratings from VVIQ-2 and PSIQ questionnaires. Constructed imagination networks from three human populations (N=2,743) and six LLM variants in two conversation conditions. Analyzed centrality measures (expected influence, strength, closeness) and clustering patterns.
Result: Human imagination networks showed robust correlations across centrality measures and consistent clustering patterns across populations, indicating shared structural organization of IWMs. LLM-derived networks showed minimal clustering and weak centrality correlations, even with conversational memory manipulation. Differences persisted across environmental scenes and sensory modalities.
Conclusion: There are fundamental disparities between human and artificial world models. The network-based approach provides a quantitative framework for comparing internally-generated representations across cognitive agents, with implications for developing human-like imagination in AI systems.
Abstract: The computational role of imagination remains debated. While classical accounts emphasize reward maximization, emerging evidence suggests imagination serves a broader function: accessing internal world models (IWMs). Here, we employ psychological network analysis to compare IWMs in humans and large language models (LLMs) through imagination vividness ratings. Using the Vividness of Visual Imagery Questionnaire (VVIQ-2) and Plymouth Sensory Imagery Questionnaire (PSIQ), we construct imagination networks from three human populations (Florida, Poland, London; N=2,743) and six LLM variants in two conversation conditions. Human imagination networks demonstrate robust correlations across centrality measures (expected influence, strength, closeness) and consistent clustering patterns, indicating shared structural organization of IWMs across populations. In contrast, LLM-derived networks show minimal clustering and weak centrality correlations, even when manipulating conversational memory. These systematic differences persist across environmental scenes (VVIQ-2) and sensory modalities (PSIQ), revealing fundamental disparities between human and artificial world models. Our network-based approach provides a quantitative framework for comparing internally-generated representations across cognitive agents, with implications for developing human-like imagination in artificial intelligence systems.
[518] Matching Markets Meet LLMs: Algorithmic Reasoning with Ranked Preferences
Hadi Hosseini, Samarth Khanna, Ronak Singh
Main category: cs.AI
TL;DR: LLMs struggle with ranked preference reasoning in matching markets, especially for large instances where they fail to resolve instability despite fine-tuning improvements in small markets.
Details
Motivation: While LLMs have advanced reasoning tasks, their ability to handle ranked preferences and structured algorithms in combinatorial domains like matching markets remains underexplored, despite applications in resource allocation and ride-sharing.Method: Evaluated state-of-the-art LLMs on a hierarchy of preference-based reasoning tasks including stable-matching generation, instability detection, instability resolution, and fine-grained preference queries. Used parameter-efficient fine-tuning (LoRA) to test improvement strategies.
Result: Even top-performing LLMs struggle with instability resolution in large markets, often failing to identify blocking pairs or execute algorithms iteratively. LoRA fine-tuning significantly improves performance in small markets but fails to scale to large instances.
Conclusion: Current LLMs have logical and algorithmic limitations in handling ranked inputs for matching markets, requiring more sophisticated strategies to improve reasoning with larger-context inputs beyond simple fine-tuning approaches.
Abstract: The rise of Large Language Models (LLMs) has driven progress in reasoning tasks – from program synthesis to scientific hypothesis generation – yet their ability to handle ranked preferences and structured algorithms in combinatorial domains remains underexplored. We study matching markets, a core framework behind applications like resource allocation and ride-sharing, which require reconciling individual ranked preferences to ensure stable outcomes. We evaluate several state-of-the-art models on a hierarchy of preference-based reasoning tasks – ranging from stable-matching generation to instability detection, instability resolution, and fine-grained preference queries – to systematically expose their logical and algorithmic limitations in handling ranked inputs. Surprisingly, even top-performing models with advanced reasoning struggle to resolve instability in large markets, often failing to identify blocking pairs or execute algorithms iteratively. We further show that parameter-efficient fine-tuning (LoRA) significantly improves performance in small markets, but fails to bring about a similar improvement on large instances, suggesting the need for more sophisticated strategies to improve LLMs’ reasoning with larger-context inputs.
[519] Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksander Mądry, Julian Michael, Neel Nanda, Dave Orr, Jakub Pachocki, Ethan Perez, Mary Phuong, Fabien Roger, Joshua Saxe, Buck Shlegeris, Martín Soto, Eric Steinberger, Jasmine Wang, Wojciech Zaremba, Bowen Baker, Rohin Shah, Vlad Mikulik
Main category: cs.AI
TL;DR: AI safety can be enhanced by monitoring chains of thought in language-based AI systems to detect malicious intent, though this method is imperfect and requires further research alongside existing safety approaches.
Details
Motivation: The motivation is to leverage the unique characteristic of language-based AI systems that "think" in human language to enhance AI safety. By monitoring their chains of thought, we can potentially detect intent to misbehave, offering a novel oversight approach that complements existing safety methods.Method: The method proposed is CoT (Chain of Thought) monitoring - observing and analyzing the step-by-step reasoning processes of language-based AI systems to identify potential malicious intent or misbehavior before it manifests in final outputs.
Result: The paper finds that CoT monitoring shows promise for AI safety but is imperfect, allowing some misbehavior to go unnoticed. It also identifies that CoT monitorability may be fragile and sensitive to development decisions.
Conclusion: The authors recommend further research into CoT monitorability, investment in CoT monitoring alongside existing safety methods, and that frontier model developers should consider how their development decisions impact CoT monitorability due to its potential fragility.
Abstract: AI systems that “think” in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
[520] Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping
Dena Mujtaba, Brian Hu, Anthony Hoogs, Arslan Basharat
Main category: cs.AI
TL;DR: Test-time alignment technique using model-guided policy shaping to control ethical behavior in pre-trained RL agents without retraining, evaluated on MACHIAVELLI benchmark.
Details
Motivation: AI agents trained to maximize rewards may adopt harmful behaviors, creating a trade-off between reward maximization and ethical alignment. Retraining pre-trained agents is costly, and ethical values can be diverse and conflicting.Method: Model-guided policy shaping at test time using scenario-action attribute classifiers to align decisions with ethical attributes. Works on pre-trained RL agents without retraining, allowing control over individual behavioral attributes.
Result: Test-time policy shaping effectively mitigates unethical behavior across diverse environments and alignment attributes in the MACHIAVELLI benchmark (134 text-based games). Outperforms prior training-time methods and general-purpose agents.
Conclusion: Test-time policy shaping provides an effective, scalable solution for ethical alignment of pre-trained AI agents, enabling principled trade-offs between ethical alignment and reward maximization without costly retraining.
Abstract: The deployment of decision-making AI agents presents a critical challenge in maintaining alignment with human values or guidelines while operating in complex, dynamic environments. Agents trained solely to achieve their objectives may adopt harmful behavior, exposing a key trade-off between maximizing the reward function and maintaining alignment. For pre-trained agents, ensuring alignment is particularly challenging, as retraining can be a costly and slow process. This is further complicated by the diverse and potentially conflicting attributes representing the ethical values for alignment. To address these challenges, we propose a test-time alignment technique based on model-guided policy shaping. Our method allows precise control over individual behavioral attributes, generalizes across diverse reinforcement learning (RL) environments, and facilitates a principled trade-off between ethical alignment and reward maximization without requiring agent retraining. We evaluate our approach using the MACHIAVELLI benchmark, which comprises 134 text-based game environments and thousands of annotated scenarios involving ethical decisions. The RL agents are first trained to maximize the reward in their respective games. At test time, we apply policy shaping via scenario-action attribute classifiers to ensure decision alignment with ethical attributes. We compare our approach against prior training-time methods and general-purpose agents, as well as study several types of ethical violations and power-seeking behavior. Our results demonstrate that test-time policy shaping provides an effective and scalable solution for mitigating unethical behavior across diverse environments and alignment attributes.
[521] The Endless Tuning. An Artificial Intelligence Design To Avoid Human Replacement and Trace Back Responsibilities
Elio Grande
Main category: cs.AI
TL;DR: The Endless Tuning is a double-mirroring AI design method that prevents human replacement and addresses responsibility gaps, tested in three decision-making applications with positive user experience results.
Details
Motivation: The paper addresses two key ethical concerns in AI deployment: avoiding human replacement and filling the responsibility gap (where no one can be held accountable for AI decisions). It aims to provide a different ethical approach to AI that emphasizes human control and accountability.Method: The method uses a double mirroring process within a relational approach. It was implemented as a protocol and tested in three prototypical applications: loan granting, pneumonia diagnosis, and art style recognition. The approach includes a reversed and hermeneutic deployment of XAI (Explainable AI) algorithms, focusing on user experience rather than statistical accuracy.
Result: Domain experts testing the applications reported feeling full control in decision-making settings despite using deep learning models. The experiments showed that a bridge can be built between accountability and liability in cases of damage, demonstrating the method’s effectiveness in addressing responsibility concerns.
Conclusion: The Endless Tuning method successfully provides a philosophical and technical framework for reliable AI deployment that maintains human control, addresses responsibility gaps, and offers a different ethical voice in AI development focused on user experience and accountability.
Abstract: The Endless Tuning is a design method for a reliable deployment of artificial intelligence based on a double mirroring process, which pursues both the goals of avoiding human replacement and filling the so-called responsibility gap (Matthias 2004). Originally depicted in (Fabris et al. 2024) and ensuing the relational approach urged therein, it was then actualized in a protocol, implemented in three prototypical applications regarding decision-making processes (respectively: loan granting, pneumonia diagnosis, and art style recognition) and tested with such as many domain experts. Step by step illustrating the protocol, giving insights concretely showing a different voice (Gilligan 1993) in the ethics of artificial intelligence, a philosophical account of technical choices (e.g., a reversed and hermeneutic deployment of XAI algorithms) will be provided in the present study together with the results of the experiments, focusing on user experience rather than statistical accuracy. Even thoroughly employing deep learning models, full control was perceived by the interviewees in the decision-making setting, while it appeared that a bridge can be built between accountability and liability in case of damage.
[522] Semantic Chain-of-Trust: Autonomous Trust Orchestration for Collaborator Selection via Hypergraph-Aided Agentic AI
Botao Zhu, Xianbin Wang, Dusit Niyato
Main category: cs.AI
TL;DR: Semantic chain-of-trust model using agentic AI and hypergraphs for intelligent collaborator selection in distributed systems
Details
Motivation: Challenges in trust evaluation for collaborator selection due to independent device operation, dynamic relationships, and complex situational impacts on trust assessmentMethod: 1) Semantic trust concept for multi-dimensional trust assessment; 2) Agentic AI on each device for autonomous operations (state detection, data collection, semantic extraction, resource evaluation); 3) Hypergraph for dynamic collaborator management and fast one-hop selection; 4) Chain formation through hypergraph for multi-hop selection
Result: Achieves 100% accuracy in trust evaluation based on historical collaborations, enabling intelligent, resource-efficient, and precise collaborator selection
Conclusion: Proposed semantic chain-of-trust model effectively addresses trust evaluation challenges in collaborative systems through semantic trust assessment, agentic AI, and hypergraph structures
Abstract: The effective completion of tasks in collaborative systems hinges on task-specific trust evaluations of potential devices for distributed collaboration. Due to independent operation of devices involved, dynamic evolution of their mutual relationships, and complex situation-related impact on trust evaluation, effectively assessing devices’ trust for collaborator selection is challenging. To overcome this challenge, we propose a semantic chain-of-trust model implemented with agentic AI and hypergraphs for supporting effective collaborator selection. We first introduce a concept of semantic trust, specifically designed to assess collaborators along multiple semantic dimensions for a more accurate representation of their trustworthiness. To facilitate intelligent evaluation, an agentic AI system is deployed on each device, empowering it to autonomously perform necessary operations, including device state detection, trust-related data collection, semantic extraction, task-specific resource evaluation, to derive a semantic trust representation for each collaborator. In addition, each device leverages a hypergraph to dynamically manage potential collaborators according to different levels of semantic trust, enabling fast one-hop collaborator selection. Furthermore, adjacent trusted devices autonomously form a chain through the hypergraph structure, supporting multi-hop collaborator selection. Experimental results demonstrate that the proposed semantic chain-of-trust achieves 100% accuracy in trust evaluation based on historical collaborations, enabling intelligent, resource-efficient, and precise collaborator selection.
[523] Implementing Cumulative Functions with Generalized Cumulative Constraints
Pierre Schaus, Charles Thomas, Roger Kameugne
Main category: cs.AI
TL;DR: Implementation of Generalized Cumulative constraint for scheduling with conditional time intervals and cumulative functions, enabling producer-consumer problems in open-source solvers.
Details
Motivation: Modern commercial CP solvers support modeling scheduling problems with conditional time intervals and cumulative functions for producer-consumer problems, but this capability is unavailable in open-source solvers and implementation details are undocumented.Method: Develop a single generic global constraint called Generalized Cumulative, and introduce a novel time-table filtering algorithm specifically designed to handle tasks defined on conditional time-intervals.
Result: Experimental results show the approach performs competitively with existing solvers for producer-consumer scheduling problems and effectively scales to large-scale problems.
Conclusion: The Generalized Cumulative constraint with novel time-table filtering provides a practical, scalable solution for modeling complex scheduling problems with conditional time intervals in open-source constraint programming solvers.
Abstract: Modeling scheduling problems with conditional time intervals and cumulative functions has become a common approach when using modern commercial constraint programming solvers. This paradigm enables the modeling of a wide range of scheduling problems, including those involving producers and consumers. However, it is unavailable in existing open-source solvers and practical implementation details remain undocumented. In this work, we present an implementation of this modeling approach using a single, generic global constraint called the Generalized Cumulative. We also introduce a novel time-table filtering algorithm specifically designed to handle tasks defined on conditional time-intervals. Experimental results demonstrate that this approach, combined with the new filtering algorithm, performs competitively with existing solvers enabling the modeling of producer and consumer scheduling problems and effectively scales to large-scale problems.
[524] FinWorld: An All-in-One Open-Source Platform for End-to-End Financial AI Research and Deployment
Wentao Zhang, Yilei Zhao, Chuqiao Zong, Xinrun Wang, Bo An
Main category: cs.AI
TL;DR: FinWorld is an open-source platform providing end-to-end support for financial AI workflows, addressing limitations in existing platforms through multimodal data integration, diverse AI paradigm support, and agent automation.
Details
Motivation: Existing financial AI platforms have limited task coverage, lack robust multimodal data integration, and offer insufficient support for LLM training and deployment, hindering comprehensive financial AI development.Method: Developed FinWorld with native integration of heterogeneous financial data, unified support for diverse AI paradigms (deep learning, reinforcement learning), and advanced agent automation. Used data from 2 markets, 4 stock pools, and over 800 million data points to conduct experiments on 4 key financial AI tasks.
Result: FinWorld significantly enhances reproducibility, supports transparent benchmarking, and streamlines deployment. The platform provides a strong foundation for future research and real-world applications in financial AI.
Conclusion: FinWorld successfully addresses current limitations in financial AI platforms by providing comprehensive end-to-end support, enabling seamless development and deployment of financial AI solutions across diverse tasks and data types.
Abstract: Financial AI holds great promise for transforming modern finance, with the potential to support a wide range of tasks such as market forecasting, portfolio management, quantitative trading, and automated analysis. However, existing platforms remain limited in task coverage, lack robust multimodal data integration, and offer insufficient support for the training and deployment of large language models (LLMs). In response to these limitations, we present FinWorld, an all-in-one open-source platform that provides end-to-end support for the entire financial AI workflow, from data acquisition to experimentation and deployment. FinWorld distinguishes itself through native integration of heterogeneous financial data, unified support for diverse AI paradigms, and advanced agent automation, enabling seamless development and deployment. Leveraging data from 2 representative markets, 4 stock pools, and over 800 million financial data points, we conduct comprehensive experiments on 4 key financial AI tasks. These experiments systematically evaluate deep learning and reinforcement learning algorithms, with particular emphasis on RL-based finetuning for LLMs and LLM Agents. The empirical results demonstrate that FinWorld significantly enhances reproducibility, supports transparent benchmarking, and streamlines deployment, thereby providing a strong foundation for future research and real-world applications. Code is available at Github~\footnote{https://github.com/DVampire/FinWorld}.
[525] MOTIF: Multi-strategy Optimization via Turn-based Interactive Framework
Nguyen Viet Tuan Kiet, Dao Van Tung, Tran Cong Dao, Huynh Thi Thanh Binh
Main category: cs.AI
TL;DR: MOTIF is a multi-agent LLM framework that uses turn-based Monte Carlo Tree Search to jointly optimize multiple interdependent solver components for combinatorial optimization problems, outperforming single-component approaches.
Details
Motivation: Current LLM approaches for combinatorial optimization solver design focus on optimizing single components (like heuristic scoring functions), missing opportunities for broader innovation through joint optimization of multiple interdependent components.Method: Proposes MOTIF (Multi-strategy Optimization via Turn-based Interactive Framework) using Monte Carlo Tree Search with two LLM agents that take turns improving different solver components, leveraging both competitive pressure and emergent cooperation through historical update tracking.
Result: MOTIF consistently outperforms state-of-the-art methods across multiple combinatorial optimization problem domains, demonstrating superior performance through diverse, high-quality solver designs.
Conclusion: Turn-based multi-agent prompting enables fully automated solver design by broadening the search landscape and discovering better solutions through structured interaction between LLM agents optimizing multiple interdependent components.
Abstract: Designing effective algorithmic components remains a fundamental obstacle in tackling NP-hard combinatorial optimization problems (COPs), where solvers often rely on carefully hand-crafted strategies. Despite recent advances in using large language models (LLMs) to synthesize high-quality components, most approaches restrict the search to a single element - commonly a heuristic scoring function - thus missing broader opportunities for innovation. In this paper, we introduce a broader formulation of solver design as a multi-strategy optimization problem, which seeks to jointly improve a set of interdependent components under a unified objective. To address this, we propose Multi-strategy Optimization via Turn-based Interactive Framework (MOTIF) - a novel framework based on Monte Carlo Tree Search that facilitates turn-based optimization between two LLM agents. At each turn, an agent improves one component by leveraging the history of both its own and its opponent’s prior updates, promoting both competitive pressure and emergent cooperation. This structured interaction broadens the search landscape and encourages the discovery of diverse, high-performing solutions. Experiments across multiple COP domains show that MOTIF consistently outperforms state-of-the-art methods, highlighting the promise of turn-based, multi-agent prompting for fully automated solver design.
[526] Hallucination as a Computational Boundary: A Hierarchy of Inevitability and the Oracle Escape
Wang Xi, Quan Shi, Zenghui Ding, Jianqing Gao, Xianjun Yang
Main category: cs.AI
TL;DR: This paper proves LLM illusions are inevitable due to computational limits, proposes RAG and continuous learning as escape routes, and introduces Computational Class Alignment for AI security.
Details
Motivation: The core motivation is to address the fundamental problem of LLM hallucinations/illusions which hinder reliable deployment, by providing formal theoretical foundations and practical solutions.Method: Formalizes LLMs as probabilistic Turing machines using a “computational necessity hierarchy”, proves illusion inevitability via diagonalization and incomputability arguments with “learner pump lemma”, models RAGs as oracle machines, and formalizes continuous learning as “internalized oracle” through neural game theory.
Result: Proves illusions are inevitable for LLMs, demonstrates RAGs provide absolute escape via “computational jumps”, shows continuous learning offers alternative escape path, and proposes Computational Class Alignment principle for AI security.
Conclusion: LLM illusions are fundamentally unavoidable but can be escaped through RAGs or continuous learning; AI security requires matching task complexity with system computational power via Computational Class Alignment.
Abstract: The illusion phenomenon of large language models (LLMs) is the core obstacle to their reliable deployment. This article formalizes the large language model as a probabilistic Turing machine by constructing a “computational necessity hierarchy”, and for the first time proves the illusions are inevitable on diagonalization, incomputability, and information theory boundaries supported by the new “learner pump lemma”. However, we propose two “escape routes”: one is to model Retrieval Enhanced Generations (RAGs) as oracle machines, proving their absolute escape through “computational jumps”, providing the first formal theory for the effectiveness of RAGs; The second is to formalize continuous learning as an “internalized oracle” mechanism and implement this path through a novel neural game theory framework. Finally, this article proposes a feasible new principle for artificial intelligence security - Computational Class Alignment (CCA), which requires strict matching between task complexity and the actual computing power of the system, providing theoretical support for the secure application of artificial intelligence.
[527] The AI Consumer Index (ACE)
Julien Benchek, Rohit Shetty, Benjamin Hunsberger, Ajay Arun, Zach Richards, Brendan Foody, Osvald Nitski, Bertie Vidgen
Main category: cs.AI
TL;DR: AI Consumer Index (ACE) benchmark evaluates frontier AI models on everyday consumer tasks, revealing significant performance gaps with GPT-5 leading at 56.1% but still falling short of consumer needs.
Details
Motivation: There's a need to assess whether frontier AI models can effectively perform real-world consumer tasks that people encounter in daily life, as current benchmarks may not capture practical consumer needs.Method: Created ACE benchmark with 400 hidden test cases across shopping, food, gaming, and DIY domains, plus 80 open-source devset cases. Evaluated 10 frontier models with websearch enabled using novel grading methodology that checks grounding in retrieved web sources.
Result: GPT-5 (Thinking = High) leads with 56.1%, followed by o3 Pro (55.2%) and GPT-5.1 (55.1%). Performance varies by domain, with shopping being particularly challenging (top model <50%). Models frequently hallucinate key information like prices.
Conclusion: There’s a substantial performance gap between even the best AI models and actual consumer needs, highlighting significant room for improvement in making AI truly useful for everyday consumer tasks.
Abstract: We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform everyday consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) at 55.2% and GPT 5.1 (Thinking = High) at 55.1%. Model scores differ across domains, and in Shopping the top model scores under 50%. We find that models are prone to hallucinating key information, such as prices. ACE shows a substantial gap between the performance of even the best models and consumers’ AI needs.
[528] Supporting Our AI Overlords: Redesigning Data Systems to be Agent-First
Shu Liu, Soujanya Ponnapalli, Shreya Shankar, Sepanta Zeighami, Alan Zhu, Shubham Agarwal, Ruiqi Chen, Samion Suwito, Shuo Yuan, Ion Stoica, Matei Zaharia, Alvin Cheung, Natacha Crooks, Joseph E. Gonzalez, Aditya G. Parameswaran
Main category: cs.AI
TL;DR: LLM agents will dominate future data systems, requiring new architectures to handle their speculative exploration patterns.
Details
Motivation: LLM agents are becoming the primary workload for data systems, but their speculative exploration patterns (agentic speculation) create challenges due to high volume and inefficiencies that current systems aren't designed to handle.Method: The paper identifies key characteristics of agentic speculation (scale, heterogeneity, redundancy, steerability) and uses these to propose research directions for agent-first data systems, including new query interfaces, processing techniques, and memory stores.
Result: The analysis reveals that current data systems are ill-suited for agentic workloads and outlines a research agenda for developing new architectures specifically designed for LLM agent operations.
Conclusion: Data systems must evolve to natively support LLM agent workloads by developing new agent-first architectures that leverage the unique characteristics of agentic speculation.
Abstract: Large Language Model (LLM) agents, acting on their users’ behalf to manipulate and analyze data, are likely to become the dominant workload for data systems in the future. When working with data, agents employ a high-throughput process of exploration and solution formulation for the given task, one we call agentic speculation. The sheer volume and inefficiencies of agentic speculation can pose challenges for present-day data systems. We argue that data systems need to adapt to more natively support agentic workloads. We take advantage of the characteristics of agentic speculation that we identify, i.e., scale, heterogeneity, redundancy, and steerability - to outline a number of new research opportunities for a new agent-first data systems architecture, ranging from new query interfaces, to new query processing techniques, to new agentic memory stores.
[529] Semantic Faithfulness and Entropy Production Measures to Tame Your LLM Demons and Manage Hallucinations
Igor Halperin
Main category: cs.AI
TL;DR: The paper proposes two unsupervised metrics (semantic faithfulness and semantic entropy production) for evaluating LLM faithfulness using information theory and thermodynamics, treating LLMs as bipartite information engines.
Details
Motivation: Evaluating faithfulness of LLMs to a given task is complex, and there's a need for unsupervised metrics that can assess how faithfully LLMs transform context into answers without requiring labeled data.Method: Treats LLM as bipartite information engine where hidden layers act as Maxwell demon. Models QCA triplets as probability distributions over shared topics, with topic transformations from context to query and answer modeled as transition matrices. Semantic faithfulness is quantified by KL divergence between these matrices, optimized via convex optimization. Also proposes thermodynamics-based semantic entropy production metric.
Result: Demonstrates the framework on LLM summarization of corporate SEC 10-K filings, showing that high faithfulness generally implies low entropy production. The metrics can be used jointly or separately for LLM evaluation and hallucination control.
Conclusion: The proposed unsupervised SF and SEP metrics provide effective tools for evaluating LLM faithfulness using information theory and thermodynamics principles, offering practical applications for hallucination control and model evaluation.
Abstract: Evaluating faithfulness of Large Language Models (LLMs) to a given task is a complex challenge. We propose two new unsupervised metrics for faithfulness evaluation using insights from information theory and thermodynamics. Our approach treats an LLM as a bipartite information engine where hidden layers act as a Maxwell demon controlling transformations of context $C $ into answer $A$ via prompt $Q$. We model Question-Context-Answer (QCA) triplets as probability distributions over shared topics. Topic transformations from $C$ to $Q$ and $A$ are modeled as transition matrices ${\bf Q}$ and ${\bf A}$ encoding the query goal and actual result, respectively. Our semantic faithfulness (SF) metric quantifies faithfulness for any given QCA triplet by the Kullback-Leibler (KL) divergence between these matrices. Both matrices are inferred simultaneously via convex optimization of this KL divergence, and the final SF metric is obtained by mapping the minimal divergence onto the unit interval [0,1], where higher scores indicate greater faithfulness. Furthermore, we propose a thermodynamics-based semantic entropy production (SEP) metric in answer generation, and show that high faithfulness generally implies low entropy production. The SF and SEP metrics can be used jointly or separately for LLM evaluation and hallucination control. We demonstrate our framework on LLM summarization of corporate SEC 10-K filings.
[530] Hyperbolic Large Language Models
Sarang Patil, Zeyong Zhang, Yiran Huang, Tengfei Ma, Mengjia Xu
Main category: cs.AI
TL;DR: The paper surveys hyperbolic geometry-enhanced LLMs for learning hierarchical structures in non-Euclidean data like networks and linguistic trees.
Details
Motivation: LLMs struggle with non-Euclidean hierarchical data structures common in real-world applications (protein networks, transportation, financial networks, brain networks, linguistic trees). Hyperbolic geometry offers better representation for tree-like hierarchical relationships.Method: Provides taxonomy of Hyperbolic LLMs (HypLLMs) in four categories: 1) hyperbolic LLMs through exponential/logarithmic maps, 2) hyperbolic fine-tuned models, 3) fully hyperbolic LLMs, and 4) hyperbolic state-space models.
Result: Comprehensive review of recent advancements in hyperbolic geometry-enhanced LLMs, including repository of papers, models, datasets, and code implementations.
Conclusion: Hyperbolic geometry provides promising representation space for LLMs to better capture hierarchical semantic relationships in non-Euclidean data, with multiple technical approaches available and significant future research potential.
Abstract: Large language models (LLMs) have achieved remarkable success and demonstrated superior performance across various tasks, including natural language processing (NLP), weather forecasting, biological protein folding, text generation, and solving mathematical problems. However, many real-world data exhibit highly non-Euclidean latent hierarchical anatomy, such as protein networks, transportation networks, financial networks, brain networks, and linguistic structures or syntactic trees in natural languages. Effectively learning intrinsic semantic entailment and hierarchical relationships from these raw, unstructured input data using LLMs remains an underexplored area. Due to its effectiveness in modeling tree-like hierarchical structures, hyperbolic geometry – a non-Euclidean space – has rapidly gained popularity as an expressive latent representation space for complex data modeling across domains such as graphs, images, languages, and multi-modal data. Here, we provide a comprehensive and contextual exposition of recent advancements in LLMs that leverage hyperbolic geometry as a representation space to enhance semantic representation learning and multi-scale reasoning. Specifically, the paper presents a taxonomy of the principal techniques of Hyperbolic LLMs (HypLLMs) in terms of four main categories: (1) hyperbolic LLMs through exp/log maps; (2) hyperbolic fine-tuned models; (3) fully hyperbolic LLMs, and (4) hyperbolic state-space models. We also explore crucial potential applications and outline future research directions. A repository of key papers, models, datasets, and code implementations is available at https://github.com/sarangp2402/Hyperbolic-LLM-Models.
[531] Difficulty-Aware Agentic Orchestration for Query-Specific Multi-Agent Workflows
Jinwei Su, Qizhen Lan, Yinghui Xia, Lifan Sun, Weiyou Tian, Tianyu Shi, Xinyuan Song, Lewei He
Main category: cs.AI
TL;DR: DAAO is a difficulty-aware multi-agent orchestration system that dynamically generates query-specific workflows based on predicted difficulty, improving both accuracy and efficiency over static approaches.
Details
Motivation: Existing multi-agent frameworks use static or task-level workflows that either over-process simple queries or underperform on complex ones, while ignoring efficiency-performance trade-offs across heterogeneous LLMs.Method: DAAO has three modules: 1) VAE for difficulty estimation, 2) modular operator allocator, and 3) cost-performance aware LLM router. It uses a self-adjusting policy to update difficulty estimates based on workflow success.
Result: Experiments on six benchmarks show DAAO surpasses prior multi-agent systems in both accuracy and inference efficiency.
Conclusion: DAAO effectively enables adaptive, difficulty-aware reasoning by generating simpler workflows for easy queries and more complex strategies for harder ones.
Abstract: Large Language Model (LLM)-based agentic systems have shown strong capabilities across various tasks. However, existing multi-agent frameworks often rely on static or task-level workflows, which either over-process simple queries or underperform on complex ones, while also neglecting the efficiency-performance trade-offs across heterogeneous LLMs. To address these limitations, we propose Difficulty-Aware Agentic Orchestration (DAAO), which can dynamically generate query-specific multi-agent workflows guided by predicted query difficulty. DAAO comprises three interdependent modules: a variational autoencoder (VAE) for difficulty estimation, a modular operator allocator, and a cost- and performance-aware LLM router. A self-adjusting policy updates difficulty estimates based on workflow success, enabling simpler workflows for easy queries and more complex strategies for harder ones. Experiments on six benchmarks demonstrate that DAAO surpasses prior multi-agent systems in both accuracy and inference efficiency, validating its effectiveness for adaptive, difficulty-aware reasoning.
[532] RE-PO: Robust Enhanced Policy Optimization as a General Framework for LLM Alignment
Xiaoyang Cao, Zelai Xu, Mo Guang, Kaiwen Long, Michiel A. Bakker, Yu Wang, Chao Yu
Main category: cs.AI
TL;DR: RE-PO is a robust preference alignment framework that addresses label noise in human preference data by using expectation-maximization to infer label correctness and adaptively reweight training data.
Details
Motivation: Standard human preference alignment methods (like RLHF) assume clean preference data, but real-world datasets contain substantial noise from annotator mistakes, inconsistent instructions, varying expertise, and adversarial feedback, which can misguide training and degrade model performance.Method: RE-PO uses an expectation-maximization procedure to infer the posterior correctness of each label and then adaptively reweights data points in the training loss to mitigate label noise. It establishes a theoretical link between preference losses and their underlying probabilistic models, enabling systematic transformation of existing alignment algorithms into robust counterparts.
Result: Theoretically proven to recover true noise level under perfectly calibrated models. Empirically improves four state-of-the-art alignment methods (DPO, IPO, SimPO, CPO), increasing AlpacaEval 2 win rates by up to 7.0% for Mistral and Llama 3 models over their respective baselines.
Conclusion: RE-PO elevates from a single method to a general framework for robust preference alignment, effectively addressing label noise in preference datasets and consistently improving existing alignment methods.
Abstract: Standard human preference-based alignment methods, such as Reinforcement Learning from Human Feedback (RLHF), are a cornerstone for aligning large language models (LLMs) with human values. However, these methods typically assume that preference data is clean and that all labels are equally reliable. In practice, large-scale preference datasets contain substantial noise due to annotator mistakes, inconsistent instructions, varying expertise, and even adversarial or low-effort feedback. This mismatch between recorded labels and ground-truth preferences can misguide training and degrade model performance. To address this issue, we introduce Robust Enhanced Policy Optimization (RE-PO), which uses an expectation-maximization procedure to infer the posterior correctness of each label and then adaptively reweight data points in the training loss to mitigate label noise. We further generalize this idea by establishing a theoretical link between arbitrary preference losses and their underlying probabilistic models, enabling a systematic transformation of existing alignment algorithms into robust counterparts and elevating RE-PO from a single method to a general framework for robust preference alignment. Theoretically, we prove that, under a perfectly calibrated model, RE-PO recovers the true noise level of the dataset. Empirically, we show that RE-PO consistently improves four state-of-the-art alignment methods (DPO, IPO, SimPO, and CPO); when applied to Mistral and Llama 3 models, the RE-PO-enhanced variants increase AlpacaEval 2 win rates by up to 7.0 percent over their respective baselines.
[533] Saliency Guided Longitudinal Medical Visual Question Answering
Jialin Wu, Xiaofeng Liu
Main category: cs.AI
TL;DR: A saliency-guided encoder-decoder model for chest X-ray longitudinal VQA that uses keyword-conditioned Grad-CAM to enforce consistent visual attention across time points, achieving competitive performance without radiology-specific pretraining.
Details
Motivation: Longitudinal medical VQA requires analyzing changes between time points, where difference signals and consistent visual focus are more important than single-image findings. Current approaches lack mechanisms to enforce spatially consistent attention on corresponding anatomy across visits.Method: Proposes a two-step loop: 1) Extract medical keyword from answer and generate keyword-conditioned Grad-CAM saliency masks on both images; 2) Apply shared saliency masks to both time points and generate final answer. Includes lightweight affine pre-alignment to reduce nuisance motion between visits.
Result: Achieves competitive performance on Medical-Diff-VQA dataset across BLEU, ROUGE-L, CIDEr, and METEOR metrics. Notably works with general-domain pretrained backbone and decoder without radiology-specific pretraining, demonstrating practicality and transferability.
Conclusion: Saliency-conditioned generation with mild pre-alignment provides a principled framework for longitudinal reasoning in medical VQA, offering intrinsic interpretability while closing the language-vision loop to ensure medically relevant terms guide visual attention.
Abstract: Longitudinal medical visual question answering (Diff-VQA) requires comparing paired studies from different time points and answering questions about clinically meaningful changes. In this setting, the difference signal and the consistency of visual focus across time are more informative than absolute single-image findings. We propose a saliency-guided encoder-decoder for chest X-ray Diff-VQA that turns post-hoc saliency into actionable supervision. The model first performs a lightweight near-identity affine pre-alignment to reduce nuisance motion between visits. It then executes a within-epoch two-step loop: step 1 extracts a medically relevant keyword from the answer and generates keyword-conditioned Grad-CAM on both images to obtain disease-focused saliency; step 2 applies the shared saliency mask to both time points and generates the final answer. This closes the language-vision loop so that the terms that matter also guide where the model looks, enforcing spatially consistent attention on corresponding anatomy. On Medical-Diff-VQA, the approach attains competitive performance on BLEU, ROUGE-L, CIDEr, and METEOR while providing intrinsic interpretability. Notably, the backbone and decoder are general-domain pretrained without radiology-specific pretraining, highlighting practicality and transferability. These results support saliency-conditioned generation with mild pre-alignment as a principled framework for longitudinal reasoning in medical VQA.
[534] A Field Guide to Deploying AI Agents in Clinical Practice
Jack Gallifant, Katherine C. Kellogg, Matt Butler, Amanda Centi, Shan Chen, Patrick F. Doyle, Sayon Dutta, Joyce Guo, Matthew J. Hadfield, Esther H. Kim, David E. Kozono, Hugo JWL Aerts, Adam B. Landman, Raymond H. Mak, Rebecca G. Mishuris, Tanna L. Nelson, Guergana K. Savova, Elad Sharon, Benjamin C. Silverman, Umit Topaloglu, Jeremy L. Warner, Danielle S. Bitterman
Main category: cs.AI
TL;DR: A field manual for deploying LLM-based agents in healthcare, based on real-world experience with an adverse event detection system, showing that 80% of effort goes to sociotechnical implementation rather than model development.
Details
Motivation: There's a significant gap between the potential of LLMs in healthcare and their practical implementation in clinical settings. While LLMs integrated into agent-driven workflows show promise, they face challenges in real-world deployment.Method: Created a practitioner-oriented field manual based on deploying the “irAE-Agent” (an automated system for detecting immune-related adverse events from clinical notes) and conducting structured interviews with 21 clinicians, engineers, and informatics leaders.
Result: Analysis revealed critical misalignment: less than 20% of effort went to prompt engineering and model development, while over 80% was consumed by sociotechnical implementation work. This effort was distilled into five “heavy lifts”: data integration, model validation, ensuring economic value, managing system drift, and governance.
Conclusion: The field manual shifts focus from algorithmic development to essential infrastructure and implementation work needed to bridge the “valley of death” and successfully translate generative AI from pilot projects into routine clinical care, providing actionable solutions for each implementation challenge.
Abstract: Large language models (LLMs) integrated into agent-driven workflows hold immense promise for healthcare, yet a significant gap exists between their potential and practical implementation within clinical settings. To address this, we present a practitioner-oriented field manual for deploying generative agents that use electronic health record (EHR) data. This guide is informed by our experience deploying the “irAE-Agent”, an automated system to detect immune-related adverse events from clinical notes at Mass General Brigham, and by structured interviews with 21 clinicians, engineers, and informatics leaders involved in the project. Our analysis reveals a critical misalignment in clinical AI development: less than 20% of our effort was dedicated to prompt engineering and model development, while over 80% was consumed by the sociotechnical work of implementation. We distill this effort into five “heavy lifts”: data integration, model validation, ensuring economic value, managing system drift, and governance. By providing actionable solutions for each of these challenges, this field manual shifts the focus from algorithmic development to the essential infrastructure and implementation work required to bridge the “valley of death” and successfully translate generative AI from pilot projects into routine clinical care.
[535] LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild
Jiayu Wang, Yifei Ming, Riya Dulepet, Qinglin Chen, Austin Xu, Zixuan Ke, Frederic Sala, Aws Albarghouthi, Caiming Xiong, Shafiq Joty
Main category: cs.AI
TL;DR: LiveResearchBench is a new benchmark for evaluating deep research systems that produce comprehensive, citation-grounded reports from live web sources, addressing limitations of existing benchmarks through user-centric, dynamic, unambiguous, and search-intensive tasks.
Details
Motivation: Existing benchmarks for agentic systems fall short in evaluating deep research capabilities - they often focus on narrow domains, pose ambiguous questions, or don't require dynamic, real-time information from multiple web sources. There's a need for rigorous evaluation that reflects realistic information needs and requires extensive web search and synthesis.Method: The authors introduce LiveResearchBench with 100 expert-curated tasks spanning daily life, enterprise, and academia, built with over 1,500 hours of human labor. They also develop DeepEval, a comprehensive evaluation suite covering content- and report-level quality with four complementary evaluation protocols for stable assessment and high human agreement.
Result: The benchmark enables comprehensive evaluation of 17 frontier deep research systems (single-agent web search, single-agent deep research, and multi-agent systems). The analysis reveals current strengths, recurring failure modes, and identifies key system components needed for advancing reliable deep research.
Conclusion: LiveResearchBench and DeepEval provide a rigorous foundation for systematically evaluating deep research systems, addressing critical gaps in current evaluation methodologies and offering insights into advancing agentic systems for comprehensive, citation-grounded report generation from live web sources.
Abstract: Deep research – producing comprehensive, citation-grounded reports by searching and synthesizing information from hundreds of live web sources – marks an important frontier for agentic systems. To rigorously evaluate this ability, four principles are essential: tasks should be (1) user-centric, reflecting realistic information needs, (2) dynamic, requiring up-to-date information beyond parametric knowledge, (3) unambiguous, ensuring consistent interpretation across users, and (4) multi-faceted and search-intensive, requiring search over numerous web sources and in-depth analysis. Existing benchmarks fall short of these principles, often focusing on narrow domains or posing ambiguous questions that hinder fair comparison. Guided by these principles, we introduce LiveResearchBench, a benchmark of 100 expert-curated tasks spanning daily life, enterprise, and academia, each requiring extensive, dynamic, real-time web search and synthesis. Built with over 1,500 hours of human labor, LiveResearchBench provides a rigorous basis for systematic evaluation. To evaluate citation-grounded long-form reports, we introduce DeepEval, a comprehensive suite covering both content- and report-level quality, including coverage, presentation, citation accuracy and association, consistency and depth of analysis. DeepEval integrates four complementary evaluation protocols, each designed to ensure stable assessment and high agreement with human judgments. Using LiveResearchBench and DeepEval, we conduct a comprehensive evaluation of 17 frontier deep research systems, including single-agent web search, single-agent deep research, and multi-agent systems. Our analysis reveals current strengths, recurring failure modes, and key system components needed to advance reliable, insightful deep research. Our code is available at: https://github.com/SalesforceAIResearch/LiveResearchBench.
[536] LabOS: The AI-XR Co-Scientist That Sees and Works With Humans
Le Cong, David Smerkous, Xiaotong Wang, Di Yin, Zaixi Zhang, Ruofan Jin, Yinkai Wang, Michal Gerasimiuk, Ravi K. Dinesh, Alex Smerkous, Lihan Shi, Joy Zheng, Ian Lam, Xuekun Wu, Shilong Liu, Peishan Li, Yi Zhu, Ning Zhao, Meenal Parakh, Simran Serrao, Imran A. Mohammad, Chao-Yeh Chen, Xiufeng Xie, Tiffany Chen, David Weinstein, Greg Barbone, Belgin Caglar, John B. Sunwoo, Fuxin Li, Jia Deng, Joseph C. Wu, Sanfeng Wu, Mengdi Wang
Main category: cs.AI
TL;DR: LabOS is the first AI co-scientist system that combines computational reasoning with physical experimentation through multimodal AI, XR interfaces, and robotics to enable real-time human-AI collaboration in scientific labs.
Details
Motivation: The paper aims to bridge the gap between computational AI design and physical experimentation by creating an AI system that can actively participate in laboratory work, moving beyond just computational assistance to real-world scientific collaboration.Method: LabOS integrates multimodal perception (seeing what scientists see), self-evolving AI agents, Extended Reality (XR) interfaces through smart glasses, and robotic systems to create a unified platform where AI can understand experimental context and assist in real-time execution.
Result: The system demonstrates applications across diverse scientific domains including cancer immunotherapy target discovery, stem-cell engineering, and material science, showing that AI can effectively participate in physical laboratory environments.
Conclusion: LabOS transforms laboratories into intelligent, collaborative environments where human and machine discovery evolve together, representing a significant advancement in AI’s role in scientific research beyond computational design to active participation.
Abstract: Modern science advances fastest when thought meets action. LabOS represents the first AI co-scientist that unites computational reasoning with physical experimentation through multimodal perception, self-evolving agents, and Extended-Reality(XR)-enabled human-AI collaboration. By connecting multi-model AI agents, smart glasses, and robots, LabOS allows AI to see what scientists see, understand experimental context, and assist in real-time execution. Across applications – from cancer immunotherapy target discovery to stem-cell engineering and material science – LabOS shows that AI can move beyond computational design to participation, turning the laboratory into an intelligent, collaborative environment where human and machine discovery evolve together.
[537] Smaller Models, Smarter Rewards: A Two-Sided Approach to Process and Outcome Rewards
Jan Niklas Groeneveld, Xi Qin, Alexander Schaefer, Yaad Oren
Main category: cs.AI
TL;DR: Small language models (Phi-4 family) can be effectively turned into reward models for code generation by combining process and outcome rewards, achieving over 20% improvement in selecting the most accurate code from multiple generations.
Details
Motivation: Generating high-quality code remains challenging for LLMs, and reward models are needed as an intermediate step for reasoning model evolution. While reflection capabilities typically increase with model size, the authors want to investigate whether state-of-the-art small language models like Phi-4 can be turned into usable reward models that blend process and outcome rewards.Method: Constructed a dataset of code samples with correctness labels from the APPS coding challenge benchmark. Trained a value-head model (decoder-only transformer with regression layer) to estimate success probability of intermediate outputs through supervised fine-tuning.
Result: Small LLMs are capable of serving as effective reward models or code evaluation critics, successfully identifying correct solutions among multiple candidates. Using this critic achieved over 20% improvement in search capability for selecting the most accurate code out of multiple generations.
Conclusion: Small language models can be effectively transformed into usable reward models for code generation tasks, blending process and outcome rewards to significantly improve code selection accuracy, demonstrating that model size isn’t the only factor for effective reward modeling.
Abstract: Generating high-quality code remains a challenge for Large Language Models (LLMs). For the evolution of reasoning models on this task, reward models are a necessary intermediate step. These models judge outcomes or intermediate steps. Decoder-only transformer models can be turned into reward models by introducing a regression layer and supervised fine-tuning. While it is known that reflection capabilities generally increase with the size of a model, we want to investigate whether state-of-the-art small language models like the Phi-4 family can be turned into usable reward models blending the consideration of process rewards and outcome rewards. Targeting this goal, we construct a dataset of code samples with correctness labels derived from the APPS coding challenge benchmark. We then train a value-head model to estimate the success probability of intermediate outputs. Our evaluation shows that small LLMs are capable of serving as effective reward models or code evaluation critics, successfully identifying correct solutions among multiple candidates. Using this critic, we achieve over a 20% improvement in the search capability of the most accurate code out of multiple generations.
[538] The Impact of Off-Policy Training Data on Probe Generalisation
Nathalie Kirch, Samuel Dower, Adrians Skapars, Ekdeep Singh Lubana, Dmitrii Krasheninnikov
Main category: cs.AI
TL;DR: Probe performance for monitoring LLM behaviors degrades when trained on synthetic/off-policy data, with domain shifts causing larger issues than policy shifts. Deception and Sandbagging probes may fail to generalize to real monitoring scenarios.
Details
Motivation: Probing is promising for monitoring LLM behaviors like deception and sycophancy, but natural examples are rare, forcing reliance on synthetic/off-policy data. Need to understand how this affects probe generalization to real monitoring scenarios.Method: Systematically evaluated how synthetic and off-policy data influence probe generalization across 8 LLM behaviors. Tested linear and attention probes across multiple LLMs, comparing performance with different response generation strategies and data domains.
Result: Response generation strategy significantly affects probe performance, with magnitude varying by behavior. Generalization from off-policy to incentivized responses predicts generalization to on-policy data. Domain shifts cause larger performance degradation than policy shifts. Deception and Sandbagging probes may fail to generalize to real monitoring scenarios.
Conclusion: In absence of on-policy data, same-domain off-policy data yields more reliable probes than on-policy data from different domains. Need methods to better handle distribution shifts in LLM monitoring, as current probes trained on synthetic/off-policy data may not generalize well to real scenarios.
Abstract: Probing has emerged as a promising method for monitoring large language models (LLMs), enabling inference-time detection of concerning behaviours such as deception and sycophancy. However, natural examples of many behaviours are rare, forcing researchers to rely on synthetic or off-policy LLM responses for training probes. We systematically evaluate how the use of synthetic and off-policy data influences probe generalisation across eight distinct LLM behaviours. Testing linear and attention probes across multiple LLMs, we find that the response generation strategy can significantly affect probe performance, though the magnitude of this effect greatly varies by behaviour. We find that successful generalisation from off-policy responses to incentivised responses (e.g. those where the behaviour is advantageous) is predictive of successful generalisation to on-policy data. Leveraging this result, we predict that Deception and Sandbagging probes may fail to generalise from off-policy to on-policy data when used in real monitoring scenarios. We also find that shifts in the training data domain cause even larger performance degradation than off-to-on-policy shift, with different-domain test scores being consistently lower than the same-domain ones. In the absence of on-policy data, using same-domain off-policy data appears to yield more reliable probes than using on-policy data from a different domain. Still, we emphasise the need for methods that can better handle distribution shifts in LLM monitoring.
[539] ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning
Jinpeng Wang, Chao Li, Ting Ye, Mengyuan Zhang, Wei Liu, Jian Luan
Main category: cs.AI
TL;DR: ICPO method improves RLVR by using LLM’s own response probabilities to calculate preference advantage scores, addressing issues like coarse-grained rewards and reward noise in reasoning tasks.
Details
Motivation: Existing RLVR methods for LLMs suffer from coarse-grained rewards, reward noise, and inefficient exploration, leading to unstable training and entropy collapse, which limits reasoning enhancement.Method: ICPO calculates preference advantage scores by comparing relative generation probabilities of multiple responses under the same prompt, integrating these scores with verifiable rewards to guide exploration.
Result: ICPO alleviates coarse-grained rewards and reward noise, curbs overconfident errors, enhances undervalued high-quality responses, prevents overfitting, and facilitates thorough exploration.
Conclusion: Comprehensive experiments across seven benchmarks demonstrate ICPO steadily boosts reasoning performance compared to GRPO, showing the effectiveness of intrinsic confidence-driven relative preference optimization.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates significant potential in enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing RLVR methods are often constrained by issues such as coarse-grained rewards, reward noise, and inefficient exploration, which lead to unstable training and entropy collapse. To address this challenge, we propose the Intrinsic Confidence-Driven Group Relative Preference Optimization method (ICPO). The intuition behind it lies in the fact that the probabilities of an LLM generating different responses can inherently and directly reflect its self-assessment of the reasoning process. Inspired by the idea of preference modeling, ICPO calculates a preference advantage score for each response by comparing the relative generation probabilities of multiple responses under the same input prompt, and integrates this score with verifiable rewards to guide the exploration process. We have discovered that the preference advantage score not only alleviates the issues of coarse-grained rewards and reward noise but also effectively curbs overconfident errors, enhances the relative superiority of undervalued high-quality responses, and prevents the model from overfitting to specific strategies, thereby facilitating more thorough exploration. Comprehensive experiments across four general-domain benchmarks and three mathematical benchmarks demonstrate that ICPO steadily boosts reasoning compared to GRPO.
[540] ChartAnchor: Chart Grounding with Structural-Semantic Fidelity
Xinhang Li, Jingbo Zhou, Pengfei Luo, Yixiong Xiao, Tong Xu
Main category: cs.AI
TL;DR: ChartAnchor is a comprehensive benchmark for evaluating multimodal LLMs on chart grounding tasks, featuring 8k+ chart-table-code triples across 30 chart types with multi-level evaluation framework.
Details
Motivation: Existing benchmarks fail to holistically assess chart grounding due to narrow chart diversity, isolated tasks, and incomplete evaluation frameworks. Chart grounding is crucial for evaluating MLLMs' capabilities in numerical reasoning, multimodal alignment, and structural reconstruction.Method: Proposes ChartAnchor benchmark with 8k+ chart-table-code triples spanning 30 chart types from diverse real-world and augmented sources. Introduces two complementary tasks: chart-to-code generation (synthesizing executable code to replicate charts) and controlled chart-to-table reconstruction (extracting exact data with predefined headers). Uses multi-level evaluation framework integrating semantic validation, stylistic analysis, and perceptual metrics.
Result: Extensive experiments on MLLMs reveal critical limitations in numerical precision and code synthesis, emphasizing the need for structured reasoning beyond surface-level perception.
Conclusion: ChartAnchor establishes a rigorous foundation for chart grounding by unifying symbolic and data-driven grounding, offering meaningful insights for advancing MLLMs in scientific, financial, and industrial domains.
Abstract: Recent advances in multimodal large language models (MLLMs) highlight the need for benchmarks that rigorously evaluate structured chart comprehension. Chart grounding refers to the bidirectional alignment between a chart’s visual appearance and the structured semantics. This task requires models to produce a symbolic specification that faithfully captures the chart’s visual and structural intent, while also recovering the underlying tabular data with precise values and relationships. Chart grounding directly reflects a model’s capabilities in numerical reasoning, multimodal alignment, and structural reconstruction, and has several important applications in real-world scenarios. Existing benchmarks, constrained by narrow chart diversity, isolated tasks, and incomplete evaluation frameworks, fail to holistically assess grounding. To address this, we propose ChartAnchor, a comprehensive benchmark of 8k+ chart-table-code triples spanning 30 chart types drawn from diverse real-world and augmented sources. ChartAnchor introduces two complementary tasks: chart-to-code generation (synthesizing executable code to replicate charts) and controlled chart-to-table reconstruction (extracting exact data with predefined headers), enabling cross-validation of visual and numerical fidelity. A multi-level evaluation framework integrates semantic validation, stylistic analysis, and perceptual metrics to assess both structural and content-level correctness. Extensive experiments on MLLMs reveal critical limitations in numerical precision and code synthesis, emphasizing the need for structured reasoning beyond surface-level perception. By unifying symbolic and data-driven grounding, ChartAnchor establishes a rigorous foundation for chart grounding, offering meaningful insights for advancing MLLMs in scientific, financial, and industrial domains.
[541] Unsupervised decoding of encoded reasoning using language model interpretability
Ching Fang, Samuel Marks
Main category: cs.AI
TL;DR: Logit lens can decode ROT-13 encoded chain-of-thought reasoning in LLMs, with unsupervised pipeline reconstructing reasoning transcripts from internal activations.
Details
Motivation: To investigate whether current interpretability techniques can penetrate encoded reasoning processes in LLMs, addressing concerns about hidden reasoning that evades human oversight as models become more capable.Method: Fine-tuned DeepSeek-R1-Distill-Llama-70B to perform chain-of-thought reasoning in ROT-13 encryption while maintaining English outputs, then evaluated logit lens analysis on internal activations, developing an unsupervised decoding pipeline combining logit lens with automated paraphrasing.
Result: Logit lens effectively translates encoded reasoning with accuracy peaking in intermediate-to-late layers. The unsupervised pipeline achieves substantial accuracy in reconstructing complete reasoning transcripts from internal representations.
Conclusion: Current mechanistic interpretability techniques may be more robust to simple forms of encoded reasoning than previously thought, providing a framework for evaluating interpretability against non-human-readable reasoning formats to maintain oversight over AI systems.
Abstract: As large language models become increasingly capable, there is growing concern that they may develop reasoning processes that are encoded or hidden from human oversight. To investigate whether current interpretability techniques can penetrate such encoded reasoning, we construct a controlled testbed by fine-tuning a reasoning model (DeepSeek-R1-Distill-Llama-70B) to perform chain-of-thought reasoning in ROT-13 encryption while maintaining intelligible English outputs. We evaluate mechanistic interpretability methods–in particular, logit lens analysis–on their ability to decode the model’s hidden reasoning process using only internal activations. We show that logit lens can effectively translate encoded reasoning, with accuracy peaking in intermediate-to-late layers. Finally, we develop a fully unsupervised decoding pipeline that combines logit lens with automated paraphrasing, achieving substantial accuracy in reconstructing complete reasoning transcripts from internal model representations. These findings suggest that current mechanistic interpretability techniques may be more robust to simple forms of encoded reasoning than previously understood. Our work provides an initial framework for evaluating interpretability methods against models that reason in non-human-readable formats, contributing to the broader challenge of maintaining oversight over increasingly capable AI systems.
[542] Multi-Path Collaborative Reasoning via Reinforcement Learning
Jindi Lv, Yuhao Zhou, Zheng Zhu, Xiaofeng Wang, Guan Huang, Jiancheng Lv
Main category: cs.AI
TL;DR: M3PO introduces a reinforcement learning framework that uses parallel policy rollouts as diverse reasoning sources with cross-path interactions to improve LLM reasoning beyond deterministic Chain-of-Thought approaches.
Details
Motivation: Conventional CoT reasoning exhibits internal determinism during decoding, limiting exploration of plausible alternatives. Recent methods using soft abstract tokens remain constrained by greedy autoregressive decoding that isolates models from alternative reasoning possibilities.Method: Multi-Path Perception Policy Optimization (M3PO) leverages parallel policy rollouts as naturally diverse reasoning sources and integrates cross-path interactions into policy updates through a lightweight collaborative mechanism, allowing each trajectory to refine its reasoning with peer feedback.
Result: M3PO achieves state-of-the-art performance on both knowledge- and reasoning-intensive benchmarks. Models trained with M3PO maintain interpretability and inference efficiency.
Conclusion: Multi-path collaborative learning shows promise for robust reasoning by cultivating more reliable multi-step reasoning patterns through collective insights and peer feedback.
Abstract: Chain-of-Thought (CoT) reasoning has significantly advanced the problem-solving capabilities of Large Language Models (LLMs), yet conventional CoT often exhibits internal determinism during decoding, limiting exploration of plausible alternatives. Recent methods attempt to address this by generating soft abstract tokens to enable reasoning in a continuous semantic space. However, we find that such approaches remain constrained by the greedy nature of autoregressive decoding, which fundamentally isolates the model from alternative reasoning possibilities. In this work, we propose Multi-Path Perception Policy Optimization (M3PO), a novel reinforcement learning framework that explicitly injects collective insights into the reasoning process. M3PO leverages parallel policy rollouts as naturally diverse reasoning sources and integrates cross-path interactions into policy updates through a lightweight collaborative mechanism. This design allows each trajectory to refine its reasoning with peer feedback, thereby cultivating more reliable multi-step reasoning patterns. Empirical results show that M3PO achieves state-of-the-art performance on both knowledge- and reasoning-intensive benchmarks. Models trained with M3PO maintain interpretability and inference efficiency, underscoring the promise of multi-path collaborative learning for robust reasoning.
[543] Beyond the Black Box: A Cognitive Architecture for Explainable and Aligned AI
Hu Keyi
Main category: cs.AI
TL;DR: Weight-Calculatism is a novel cognitive architecture using Logical Atoms and Weight-Calculation (Weight = Benefit * Probability) for explainable, aligned AGI with transparent reasoning.
Details
Motivation: Current AI paradigms lack explainability and value alignment. The paper aims to address these fundamental challenges by creating a cognitive architecture grounded in first principles that enables transparent, human-like reasoning and trustworthy AGI development.Method: Deconstructs cognition into indivisible Logical Atoms and two fundamental operations (Pointing and Comparison). Formalizes decision-making through an interpretable Weight-Calculation model (Weight = Benefit * Probability) with traceable Initial Weights. Implements via graph-algorithm-based computational engine and global workspace workflow.
Result: The architecture achieves transparent, human-like reasoning and robust learning in unprecedented scenarios. Preliminary implementation and scenario validation demonstrate its potential as a viable pathway toward AGI with radical explainability and traceable value alignment.
Conclusion: Weight-Calculatism establishes both practical and theoretical foundation for building trustworthy and aligned AGI, offering a novel approach that addresses fundamental challenges of explainability and value alignment in current AI systems.
Abstract: Current AI paradigms, as “architects of experience,” face fundamental challenges in explainability and value alignment. This paper introduces “Weight-Calculatism,” a novel cognitive architecture grounded in first principles, and demonstrates its potential as a viable pathway toward Artificial General Intelligence (AGI). The architecture deconstructs cognition into indivisible Logical Atoms and two fundamental operations: Pointing and Comparison. Decision-making is formalized through an interpretable Weight-Calculation model (Weight = Benefit * Probability), where all values are traceable to an auditable set of Initial Weights. This atomic decomposition enables radical explainability, intrinsic generality for novel situations, and traceable value alignment. We detail its implementation via a graph-algorithm-based computational engine and a global workspace workflow, supported by a preliminary code implementation and scenario validation. Results indicate that the architecture achieves transparent, human-like reasoning and robust learning in unprecedented scenarios, establishing a practical and theoretical foundation for building trustworthy and aligned AGI.
[544] DataGovBench: Benchmarking LLM Agents for Real-World Data Governance Workflows
Zhou Liu, Zhaoyang Han, Guochen Yan, Hao Liang, Bohan Zeng, Xing Chen, Yuanfeng Song, Wentao Zhang
Main category: cs.AI
TL;DR: DataGovBench benchmark for automated data governance tasks shows current LLMs struggle with complex workflows; DataGovAgent framework improves performance significantly.
Details
Motivation: Existing benchmarks for automated data science focus on snippet-level coding or high-level analytics, but fail to capture the unique challenges of data governance which requires ensuring correctness and quality of data itself.Method: Introduced DataGovBench with 150 diverse tasks from real-world scenarios using “reversed-objective” methodology to synthesize realistic noise. Proposed DataGovAgent framework with Planner-Executor-Evaluator architecture integrating constraint-based planning, retrieval-augmented generation, and sandboxed feedback-driven debugging.
Result: Current models struggle with complex multi-step workflows and lack robust error-correction. DataGovAgent boosts Average Task Score on complex tasks from 39.7 to 54.9 and reduces debugging iterations by over 77.9% compared to general-purpose baselines.
Conclusion: Data governance automation requires specialized approaches beyond general-purpose LLMs; the DataGovAgent framework demonstrates significant improvements in handling complex data governance workflows through its structured architecture.
Abstract: Data governance ensures data quality, security, and compliance through policies and standards, a critical foundation for scaling modern AI development. Recently, large language models (LLMs) have emerged as a promising solution for automating data governance by translating user intent into executable transformation code. However, existing benchmarks for automated data science often emphasize snippet-level coding or high-level analytics, failing to capture the unique challenge of data governance: ensuring the correctness and quality of the data itself. To bridge this gap, we introduce DataGovBench, a benchmark featuring 150 diverse tasks grounded in real-world scenarios, built on data from actual cases. DataGovBench employs a novel “reversed-objective” methodology to synthesize realistic noise and utilizes rigorous metrics to assess end-to-end pipeline reliability. Our analysis on DataGovBench reveals that current models struggle with complex, multi-step workflows and lack robust error-correction mechanisms. Consequently, we propose DataGovAgent, a framework utilizing a Planner-Executor-Evaluator architecture that integrates constraint-based planning, retrieval-augmented generation, and sandboxed feedback-driven debugging. Experimental results show that DataGovAgent significantly boosts the Average Task Score (ATS) on complex tasks from 39.7 to 54.9 and reduces debugging iterations by over 77.9 percent compared to general-purpose baselines.
[545] AI-Assisted Game Management Decisions: A Fuzzy Logic Approach to Real-Time Soccer Substitutions
Pedro Passos
Main category: cs.AI
TL;DR: A Fuzzy Logic DSS for real-time soccer substitutions that objectively evaluates player performance using role-aware metrics, fatigue, and disciplinary risk to identify optimal substitution timing, validated on World Cup match data.
Details
Motivation: Current soccer substitution decisions rely too heavily on intuition or predictive models that replicate historical biases rather than providing objective, real-time tactical guidance. There's a need for transparent, explainable systems that can optimize substitution timing by overcoming the limitations of black-box machine learning approaches.Method: Developed a Fuzzy Logic Decision Support System with three key innovations: 1) Reformulated PlayeRank metric into Cumulative Mean with Role-Aware Normalization to eliminate play-time bias, 2) Integrated physiological proxies (fatigue) and contextual variables (disciplinary risk modulated by tactical role), 3) Calculated dynamic Substitution Priority (P_final) for real-time decision making.
Result: Validated on 2018 Brazil-Belgium World Cup match: System aligned with expert consensus on executed substitutions (Gabriel Jesus) and identified critical risks missed by human decision-makers, including the “FAGNER Paradox” (defensive risk minutes before yellow card) and “Lukaku Paradox” (assist masking participation drop).
Conclusion: Fuzzy Logic provides a transparent, explainable alternative to black-box models for real-time tactical decisions, demonstrating ecological validity and superior risk identification compared to human intuition alone.
Abstract: In elite soccer, substitution decisions entail significant financial and sporting consequences yet remain heavily reliant on intuition or predictive models that merely mimic historical biases. This paper introduces a Fuzzy Logic based Decision Support System (DSS) designed for real time, prescriptive game management. Unlike traditional Machine Learning approaches that encounter a predictive ceiling by attempting to replicate human behavior, our system audits performance through an objective, rule based inference engine. We propose a methodological advancement by reformulating the PlayeRank metric into a Cumulative Mean with Role Aware Normalization, eliminating the play time exposure bias inherent in cumulative sum models to enable accurate intra match comparison. The system integrates this refined metric with physiological proxies (fatigue) and contextual variables (disciplinary risk modulated by tactical role) to calculate a dynamic Substitution Priority (P final). Validation via a case study of the 2018 FIFA World Cup match between Brazil and Belgium demonstrates the system’s ecological validity: it not only aligned with expert consensus on executed substitutions (for example Gabriel Jesus) but, crucially, identified high risk scenarios ignored by human decision makers. Specifically, the model flagged the “FAGNER Paradox” - a maximum priority defensive risk - minutes before a critical yellow card, and detected the “Lukaku Paradox”, where an isolated assist masked a severe drop in participation. These results confirm that Fuzzy Logic offers a transparent, explainable, and superior alternative to black box models for optimizing real time tactical decisions.
[546] Model-Based and Sample-Efficient AI-Assisted Math Discovery in Sphere Packing
Rasul Tutunov, Alexandre Maraval, Antoine Grosnit, Xihan Li, Jun Wang, Haitham Bou-Ammar
Main category: cs.AI
TL;DR: AI-driven model-based search discovers new state-of-the-art upper bounds for sphere packing in dimensions 4-16 by formulating SDP construction as a sequential decision game.
Details
Motivation: Sphere packing (Hilbert's 18th problem) remains largely unsolved despite its importance in cryptography, crystallography, and medical imaging. Traditional approaches using semidefinite programs (SDPs) are computationally intensive, with each candidate SDP taking days to evaluate, making standard AI methods infeasible.Method: Formulate SDP construction as a sequential decision process (SDP game) where a policy assembles SDP formulations from admissible components. Use sample-efficient model-based framework combining Bayesian optimization with Monte Carlo Tree Search to navigate the search space.
Result: Achieved new state-of-the-art upper bounds for sphere packing in dimensions 4-16, demonstrating tangible progress on this longstanding geometric problem.
Conclusion: Model-based search can advance computational progress on mathematically rigid, evaluation-limited problems, offering a complementary direction for AI-assisted discovery beyond large-scale LLM-driven exploration.
Abstract: Sphere packing, Hilbert’s eighteenth problem, asks for the densest arrangement of congruent spheres in n-dimensional Euclidean space. Although relevant to areas such as cryptography, crystallography, and medical imaging, the problem remains unresolved: beyond a few special dimensions, neither optimal packings nor tight upper bounds are known. Even a major breakthrough in dimension $n=8$, later recognised with a Fields Medal, underscores its difficulty. A leading technique for upper bounds, the three-point method, reduces the problem to solving large, high-precision semidefinite programs (SDPs). Because each candidate SDP may take days to evaluate, standard data-intensive AI approaches are infeasible. We address this challenge by formulating SDP construction as a sequential decision process, the SDP game, in which a policy assembles SDP formulations from a set of admissible components. Using a sample-efficient model-based framework that combines Bayesian optimisation with Monte Carlo Tree Search, we obtain new state-of-the-art upper bounds in dimensions $4-16$, showing that model-based search can advance computational progress in longstanding geometric problems. Together, these results demonstrate that sample-efficient, model-based search can make tangible progress on mathematically rigid, evaluation limited problems, pointing towards a complementary direction for AI-assisted discovery beyond large-scale LLM-driven exploration.
cs.SD
[547] DreamFoley: Scalable VLMs for High-Fidelity Video-to-Audio Generation
Fu Li, Weichao Zhao, You Li, Zhichao Zhou, Dongliang He
Main category: cs.SD
TL;DR: DreamFoley introduces an autoregressive audio generation architecture using large vision-language models to generate synchronized audio for videos, featuring dual-visual encoders, RVQ audio tokenizer with delay patterns, and classifier-free guidance.
Details
Motivation: Existing video generation methods lack synchronized audio, which undermines immersive experience and restricts practical applications. While some works have explored audio generation, they don't leverage the full potential of multimodal modeling.Method: Autoregressive architecture using large VLMs to model sequential interactions among video, audio, and text. Features: 1) Dual-visual encoder for audio-aligned and text-aligned features, 2) RVQ audio tokenizer with delay-pattern generation for efficiency-quality trade-off, 3) Classifier-free guidance in VLMs, 4) Efficient data pipeline for audio-video-text triples.
Result: Achieves promising performance across popular benchmarks. Releases previously missing audio-visual textual descriptions from public benchmarks to facilitate future research evaluation and comparison.
Conclusion: DreamFoley provides a strong foundation for video-to-audio generation research by effectively leveraging multimodal modeling capabilities of VLMs. The released benchmark data will help advance the field through better evaluation.
Abstract: Recent advances in video generation have achieved remarkable improvements in visual content fidelity. However, the absence of synchronized audio severely undermines immersive experience and restricts practical applications of these technologies. To address this challenge, several pioneering works have explored diffusion transformer architectures for generating plausible video-synchronized audio, including Kling-foley, HunyuanVideo-foley and Thinksound. Distinct from existing works, we introduce an autoregressive audio generation architecture (DreamFoley) that harnesses the capabilities of large vision-language models (VLMs) to jointly model sequential interactions among video, audio, and text modalities. Our approach features a dual-visual encoder module that effectively captures both audio-aligned and text-aligned visual features. Additionally, we employ a Residual Vector Quantization audio tokenizer with a delay-pattern generation scheme to balance the trade-off between training efficiency and audio quality. Moreover, we introduce the classifier-free guidance strategy into VLMs to bootstrap generated audio quality. Furthermore, we establish an efficient data production pipeline to scale audio-video-text triple collection. Finally, extensive experiments are conducted to validate the effectiveness of our model, achieving promising performance across popular benchmarks. We hope that the findings in this study provide a strong foundation for future video-to-audio generation research. We also release the previously missing audio-visual textual descriptions from the public benchmark, aiming to facilitate subsequent researchers in conducting more convenient and effective evaluations and comparisons.
[548] Physics-Guided Deepfake Detection for Voice Authentication Systems
Alireza Mohammadi, Keshav Sood, Dhananjay Thiruvady, Asef Nazari
Main category: cs.SD
TL;DR: A framework combining physics-guided deepfake detection with uncertainty-aware edge learning to protect voice authentication systems from deepfake synthesis attacks and federated learning poisoning.
Details
Motivation: Voice authentication systems at network edges face dual threats: sophisticated deepfake synthesis attacks and control-plane poisoning in distributed federated learning protocols. Existing systems need protection against both advanced audio deepfakes and adversarial attacks on the learning process itself.Method: The framework fuses interpretable physics features modeling vocal tract dynamics with self-supervised learning representations. These are processed through a Multi-Modal Ensemble Architecture followed by a Bayesian ensemble for uncertainty estimates. The approach incorporates physics-based characteristics evaluations and uncertainty estimates of audio samples.
Result: The proposed framework achieves robustness against both advanced deepfake attacks and sophisticated control-plane poisoning, addressing the complete threat model for networked voice authentication systems.
Conclusion: By coupling physics-guided deepfake detection with uncertainty-aware edge learning, the framework provides comprehensive protection for voice authentication systems deployed at network edges against the dual threats of deepfake synthesis and federated learning poisoning attacks.
Abstract: Voice authentication systems deployed at the network edge face dual threats: a) sophisticated deepfake synthesis attacks and b) control-plane poisoning in distributed federated learning protocols. We present a framework coupling physics-guided deepfake detection with uncertainty-aware in edge learning. The framework fuses interpretable physics features modeling vocal tract dynamics with representations coming from a self-supervised learning module. The representations are then processed via a Multi-Modal Ensemble Architecture, followed by a Bayesian ensemble providing uncertainty estimates. Incorporating physics-based characteristics evaluations and uncertainty estimates of audio samples allows our proposed framework to remain robust to both advanced deepfake attacks and sophisticated control-plane poisoning, addressing the complete threat model for networked voice authentication.
[549] Technical Report of Nomi Team in the Environmental Sound Deepfake Detection Challenge 2026
Candy Olivia Mawalim, Haotian Zhang, Shogo Okada
Main category: cs.SD
TL;DR: The paper presents an audio-text cross-attention model for environmental sound deepfake detection, achieving competitive EER improvements over baseline in the ICASSP 2026 ESDD Challenge.
Details
Motivation: To address the challenges of unseen generators and low-resource black-box scenarios in environmental sound deepfake detection, which are key problems in the ICASSP 2026 ESDD Challenge.Method: Proposes an audio-text cross-attention model that leverages both audio and text modalities. Experiments include individual and combined text-audio models.
Result: Demonstrates competitive Equal Error Rate (EER) improvements over the challenge baseline (BEATs+AASIST model) on the EnvSDD dataset.
Conclusion: The audio-text cross-attention approach effectively addresses unseen generator and low-resource black-box challenges in environmental sound deepfake detection, showing promising results for the ESDD Challenge.
Abstract: This paper presents our work for the ICASSP 2026 Environmental Sound Deepfake Detection (ESDD) Challenge. The challenge is based on the large-scale EnvSDD dataset that consists of various synthetic environmental sounds. We focus on addressing the complexities of unseen generators and low-resource black-box scenarios by proposing an audio-text cross-attention model. Experiments with individual and combined text-audio models demonstrate competitive EER improvements over the challenge baseline (BEATs+AASIST model).
[550] Who Will Top the Charts? Multimodal Music Popularity Prediction via Adaptive Fusion of Modality Experts and Temporal Engagement Modeling
Yash Choudhary, Preeti Rao, Pushpak Bhattacharyya
Main category: cs.SD
TL;DR: GAMENet is a multimodal deep learning architecture that predicts music popularity by integrating audio, lyrics, and social metadata through adaptive gating, achieving significant improvements over existing methods.
Details
Motivation: Predicting song commercial success before release is critical for the music industry but existing methods have limitations: they average away temporal dynamics in audio/lyrics, treat lyrics as bag-of-words ignoring structure, ignore historical performance data, and use simple feature concatenation that results in poor multimodal alignment.Method: GAMENet uses modality-specific experts for audio (processed via OnionEnsembleAENet), lyrics (via large language model embeddings), and social metadata (new Career Trajectory Dynamics features capturing artist career momentum). These are integrated through an adaptive gating mechanism rather than simple concatenation.
Result: On Music4All dataset (113k tracks), GAMENet achieves 12% improvement in R² over direct feature concatenation. Spotify audio alone yields R²=0.13, adding CTD features increases to 0.69, with additional 7% gain from temporal CTD features. On SpotGenTrack dataset (100k tracks), achieves 16% improvement over previous baseline.
Conclusion: GAMENet effectively addresses key limitations in music popularity prediction by preserving temporal dynamics, capturing lyrical semantics, incorporating historical performance, and using sophisticated multimodal fusion, demonstrating significant performance gains across multiple datasets.
Abstract: Predicting a song’s commercial success prior to its release remains an open and critical research challenge for the music industry. Early prediction of music popularity informs strategic decisions, creative planning, and marketing. Existing methods suffer from four limitations:(i) temporal dynamics in audio and lyrics are averaged away; (ii) lyrics are represented as a bag of words, disregarding compositional structure and affective semantics; (iii) artist- and song-level historical performance is ignored; and (iv) multimodal fusion approaches rely on simple feature concatenation, resulting in poorly aligned shared representations. To address these limitations, we introduce GAMENet, an end-to-end multimodal deep learning architecture for music popularity prediction. GAMENet integrates modality-specific experts for audio, lyrics, and social metadata through an adaptive gating mechanism. We use audio features from Music4AllOnion processed via OnionEnsembleAENet, a network of autoencoders designed for robust feature extraction; lyric embeddings derived through a large language model pipeline; and newly introduced Career Trajectory Dynamics (CTD) features that capture multi-year artist career momentum and song-level trajectory statistics. Using the Music4All dataset (113k tracks), previously explored in MIR tasks but not popularity prediction, GAMENet achieves a 12% improvement in R^2 over direct multimodal feature concatenation. Spotify audio descriptors alone yield an R^2 of 0.13. Integrating aggregate CTD features increases this to 0.69, with an additional 7% gain from temporal CTD features. We further validate robustness using the SpotGenTrack Popularity Dataset (100k tracks), achieving a 16% improvement over the previous baseline. Extensive ablations confirm the model’s effectiveness and the distinct contribution of each modality.
[551] Protecting Bystander Privacy via Selective Hearing in LALMs
Xiao Zhan, Guangzhi Sun, Jose Such, Phil Woodland
Main category: cs.SD
TL;DR: SH-Bench is the first benchmark for evaluating selective hearing in audio language models, measuring their ability to attend to main speakers while protecting bystander privacy. The paper reveals significant privacy leakage in current models and introduces BPFT training to improve selective hearing.
Details
Motivation: Current audio language models deployed in real-world settings capture unintended bystander speech, creating privacy risks that existing benchmarks and defenses overlook. There's a need to systematically measure and improve models' ability to protect bystander privacy while maintaining comprehension of intended speakers.Method: 1) Created SH-Bench with 3,968 multi-speaker audio mixtures and 77k multiple-choice questions. 2) Proposed Selective Efficacy (SE) metric combining multi-speaker comprehension and bystander privacy protection. 3) Introduced Bystander Privacy Fine-Tuning (BPFT) pipeline to teach models to refuse bystander-related queries without degrading main-speaker performance.
Result: Evaluation of state-of-the-art LALMs shows substantial privacy leakage - strong audio understanding doesn’t translate to bystander privacy protection. BPFT yields substantial gains, improving SE by up to 15.9% over Gemini 2.5 Pro, demonstrating selective hearing is learnable but not yet achieved in current models.
Conclusion: SH-Bench and BPFT provide the first systematic framework for measuring and improving bystander privacy in audio foundation models. Selective hearing is a learnable capability but remains far from achieved in current LALMs, highlighting an important privacy gap that needs addressing.
Abstract: Large audio language models (LALMs) are increasingly deployed in real-world settings where they inevitably capture speech from unintended nearby bystanders, raising privacy risks that existing benchmarks and defences largely overlook. We introduce SH-Bench, the first benchmark designed to evaluate selective hearing: a model’s ability to attend to an intended main speaker while refusing to process or reveal information about incidental bystander speech. SH-Bench contains 3,968 multi-speaker audio mixtures spanning both real-world and synthetic scenarios, paired with 77k multiple-choice questions that probe models under general and selective operating modes. We propose Selective Efficacy (SE), a unified metric capturing both multi-speaker comprehension and bystander-privacy protection. Our evaluation of state-of-the-art open-source and proprietary LALMs reveals substantial privacy leakage, with strong audio understanding failing to translate into selective protection of bystander privacy. To mitigate this gap, we introduce Bystander Privacy Fine-Tuning (BPFT), a training pipeline that teaches models to refuse bystander-related queries without degrading main-speaker comprehension. BPFT yields substantial gains which improve SE by up to 15.9% over Gemini 2.5 Pro, demonstrating that selective hearing is learnable but far from achieved in current LALMs. SH-Bench and BPFT provide the first systematic framework for measuring and improving bystander privacy in audio foundation models.
[552] SteerMusic: Enhanced Musical Consistency for Zero-shot Text-guided and Personalized Music Editing
Xinlei Niu, Kin Wai Cheuk, Jing Zhang, Naoki Murata, Chieh-Hsin Lai, Michele Mancusi, Woosung Choi, Giorgio Fabbro, Wei-Hsiang Liao, Charles Patrick Martin, Yuki Mitsufuji
Main category: cs.SD
TL;DR: Two music editing methods (SteerMusic and SteerMusic+) that use score distillation to improve consistency between original and edited music while enabling fine-grained personalized editing beyond text instructions.
Details
Motivation: Existing zero-shot text-guided music editing methods struggle to preserve musical content and text instructions alone often fail to accurately describe desired music, creating a need for better editing approaches.Method: Two methods: 1) SteerMusic - coarse-grained zero-shot editing using delta denoising score; 2) SteerMusic+ - fine-grained personalized editing by manipulating a concept token representing user-defined musical styles.
Result: Experimental results show the methods outperform existing approaches in preserving both music content consistency and editing fidelity. User studies validate superior music editing quality.
Conclusion: The proposed score distillation-based methods effectively address limitations of existing text-guided music editing by improving content preservation and enabling personalized style editing beyond text instructions.
Abstract: Music editing is an important step in music production, which has broad applications, including game development and film production. Most existing zero-shot text-guided editing methods rely on pretrained diffusion models by involving forward-backward diffusion processes. However, these methods often struggle to preserve the musical content. Additionally, text instructions alone usually fail to accurately describe the desired music. In this paper, we propose two music editing methods that improve the consistency between the original and edited music by leveraging score distillation. The first method, SteerMusic, is a coarse-grained zero-shot editing approach using delta denoising score. The second method, SteerMusic+, enables fine-grained personalized music editing by manipulating a concept token that represents a user-defined musical style. SteerMusic+ allows for the editing of music into user-defined musical styles that cannot be achieved by the text instructions alone. Experimental results show that our methods outperform existing approaches in preserving both music content consistency and editing fidelity. User studies further validate that our methods achieve superior music editing quality.
[553] JEPA as a Neural Tokenizer: Learning Robust Speech Representations with Density Adaptive Attention
Georgios Ioannides, Christos Constantinou, Aman Chadha, Aaron Elkins, Linsey Pang, Ravid Shwartz-Ziv, Yann LeCun
Main category: cs.SD
TL;DR: Two-stage self-supervised framework combining JEPA with DAAM for learning robust speech representations via masked prediction and efficient tokenization, achieving competitive compression with language-model-friendly tokens.
Details
Motivation: To create a robust, compressed speech representation that is reversible, language-model-friendly, and more efficient than existing neural audio codecs, while discovering hierarchical speech structure through adaptive temporal feature selection.Method: Two-stage approach: Stage 1 uses Joint-Embedding Predictive Architecture (JEPA) with Density Adaptive Attention Mechanism (DAAM) for semantic audio feature learning via masked prediction in latent space. Stage 2 uses Finite Scalar Quantization (FSQ) with mixed-radix packing for tokenization, followed by HiFi-GAN decoder for waveform reconstruction. DAAM integrates Gaussian mixture-based density-adaptive gating for adaptive temporal feature selection.
Result: Achieves low frame rate of 2.5 Hz (47.5 tokens/sec), producing reversible, highly compressed, language-model-friendly representations competitive with and often more efficient than existing neural audio codecs.
Conclusion: The framework successfully combines JEPA with DAAM to learn robust speech representations that are both efficient and suitable for language modeling applications, while discovering hierarchical speech structure through adaptive feature selection.
Abstract: We introduce a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Density Adaptive Attention Mechanism (DAAM) for learning robust speech representations. Stage1 uses JEPA with DAAM to learn semantic audio features via masked prediction in latent space, fully decoupled from waveform reconstruction. Stage2 leverages these representations for efficient tokenization using Finite Scalar Quantization (FSQ) and a mixed-radix packing scheme, followed by high-fidelity waveform reconstruction with a HiFi-GAN decoder. By integrating Gaussian mixture-based density-adaptive gating into the JEPA encoder, the model performs adaptive temporal feature selection and discovers hierarchical speech structure at a low frame rate of 2.5~Hz. The resulting tokens (47.5 tokens/sec) provide a reversible, highly compressed, and language-model-friendly representation that is competitive with, and often more efficient than, existing neural audio codecs.
[554] XM-ALIGN: Unified Cross-Modal Embedding Alignment for Face-Voice Association
Zhihua Fang, Shumei Tao, Junxu Wang, Liang He
Main category: cs.SD
TL;DR: XM-ALIGN is a unified cross-modal embedding alignment framework that combines explicit and implicit alignment mechanisms to improve face-voice verification performance across both heard and unheard languages, achieving superior results on the MAV-Celeb dataset.
Details
Motivation: The paper addresses the challenge of cross-modal verification between face and voice modalities, particularly for both "heard" and "unheard" languages, aiming to improve performance in the FAME challenge at ICASSP 2026.Method: The framework extracts feature embeddings from face and voice encoders, jointly optimizes them using a shared classifier, employs mean squared error (MSE) as embedding alignment loss for tight modality alignment, and applies data augmentation strategies during training for better generalization.
Result: Experimental results demonstrate superior performance on the MAV-Celeb dataset, showing significant improvement in cross-modal verification performance for both heard and unheard languages.
Conclusion: XM-ALIGN effectively combines explicit and implicit alignment mechanisms to achieve strong cross-modal verification performance, with code being made publicly available for reproducibility and further research.
Abstract: This paper introduces our solution, XM-ALIGN (Unified Cross-Modal Embedding Alignment Framework), proposed for the FAME challenge at ICASSP 2026. Our framework combines explicit and implicit alignment mechanisms, significantly improving cross-modal verification performance in both “heard” and “unheard” languages. By extracting feature embeddings from both face and voice encoders and jointly optimizing them using a shared classifier, we employ mean squared error (MSE) as the embedding alignment loss to ensure tight alignment between modalities. Additionally, data augmentation strategies are applied during model training to enhance generalization. Experimental results show that our approach demonstrates superior performance on the MAV-Celeb dataset. The code will be released at https://github.com/PunkMale/XM-ALIGN.
[555] What Needs to be Known in Order to Perform a Meaningful Scientific Comparison Between Animal Communications and Human Spoken Language
Roger K. Moore
Main category: cs.SD
TL;DR: The paper proposes a minimum set of seven critical phenomena that must be considered when comparing animal communications with human speech.
Details
Motivation: There is growing interest in comparing animal communications and human speech, but current approaches may lack a comprehensive framework for meaningful comparison.Method: The paper proposes a conceptual framework identifying seven critical phenomena that should be evaluated when comparing animal vocalizations with human speech.
Result: The framework identifies seven key dimensions: vocal apparatus degrees-of-freedom, control independence, acoustic environment properties, perceptual salience, sound contrastiveness, compositionality presence, and information rates.
Conclusion: Meaningful comparison between animal communications and human speech requires systematic appraisal of these seven critical phenomena rather than superficial comparisons.
Abstract: Human spoken language has long been the subject of scientific investigation, particularly with regard to the mechanisms underpinning speech production. Likewise, the study of animal communications has a substantial literature, with many studies focusing on vocalisation. More recently, there has been growing interest in comparing animal communications and human speech. However, it is proposed here that such a comparison necessitates the appraisal of a minimum set of critical phenomena: i) the number of degrees-of-freedom of the vocal apparatus, ii) the ability to control those degrees-of-freedom independently, iii) the properties of the acoustic environment in which communication takes place, iv) the perceptual salience of the generated sounds, v) the degree to which sounds are contrastive, vi) the presence/absence of compositionality, and vii) the information rate(s) of the resulting communications.
[556] Singing Timbre Popularity Assessment Based on Multimodal Large Foundation Model
Zihao Wang, Ruibin Yuan, Ziqi Geng, Hengjia Li, Xingwei Qu, Xinyi Li, Songye Chen, Haoying Fu, Roger B. Dannenberg, Kejun Zhang
Main category: cs.SD
TL;DR: The paper proposes a reference-free, multi-dimensional singing assessment system with three main contributions: Sing-MD dataset with expert annotations, VocalVerse architecture for full-song analysis, and H-TPR benchmark for perceptual ranking evaluation.
Details
Motivation: Existing singing assessment systems have two key limitations: they rely on reference tracks (limiting creative expression) and oversimplify complex performances into non-diagnostic scores based only on pitch and rhythm. There's a need for more comprehensive, reference-free evaluation that captures multiple dimensions of vocal performance.Method: 1) Created Sing-MD dataset with expert annotations across four dimensions: breath control, timbre quality, emotional expression, and vocal technique. 2) Proposed VocalVerse, a hybrid architecture using a lightweight acoustic encoder to model global performance features and long-term dependencies for full-song analysis. 3) Introduced H-TPR benchmark for evaluating models’ ability to generate perceptually valid rankings rather than predicting noisy ground-truth scores.
Result: The analysis revealed significant annotation inconsistencies among experts, challenging traditional accuracy-based metrics. The proposed system addresses memory limitations of MLLMs for full-length song analysis and provides a more comprehensive evaluation framework.
Conclusion: The paper advocates for a shift from discriminative to descriptive evaluation in singing assessment, creating a complete ecosystem for reference-free, multi-dimensional assessment that better captures the complexity of vocal performances and addresses limitations of existing automated systems.
Abstract: Automated singing assessment is crucial for education and entertainment. However, existing systems face two fundamental limitations: reliance on reference tracks, which stifles creative expression, and the simplification of complex performances into non-diagnostic scores based solely on pitch and rhythm. We advocate for a shift from discriminative to descriptive evaluation, creating a complete ecosystem for reference-free, multi-dimensional assessment. First, we introduce Sing-MD, a large-scale dataset annotated by experts across four dimensions: breath control, timbre quality, emotional expression, and vocal technique. Our analysis reveals significant annotation inconsistencies among experts, challenging the validity of traditional accuracy-based metrics. Second, addressing the memory limitations of Multimodal Large Language Models (MLLMs) in analyzing full-length songs, we propose VocalVerse. This efficient hybrid architecture leverages a lightweight acoustic encoder to model global performance features and long-term dependencies. Third, to address automated metric shortcomings, we establish the H-TPR (Human-in-the-loop Tiered Perceptual Ranking) benchmark, which evaluates a model’s ability to generate perceptually valid rankings rather than predicting noisy ground-truth scores.
[557] Multi-Accent Mandarin Dry-Vocal Singing Dataset: Benchmark for Singing Accent Recognition
Zihao Wang, Ruibin Yuan, Ziqi Geng, Hengjia Li, Xingwei Qu, Xinyi Li, Songye Chen, Haoying Fu, Roger B. Dannenberg, Kejun Zhang
Main category: cs.SD
TL;DR: Researchers created MADVSD, a large multi-accent Mandarin singing dataset with 670+ hours of dry vocal recordings from 4,206 speakers across 9 Chinese regions, addressing the scarcity of annotated singing accent datasets.
Details
Motivation: Singing accent research is underexplored compared to speech accent studies due to dataset scarcity. Existing singing datasets often lose detail from vocal-instrumental separation and lack regional accent annotations.Method: Created MADVSD with dry vocal recordings from native Mandarin speakers across 9 regions, including recordings of three popular songs in native accents and phonetic exercises covering all Mandarin vowels and a full octave range.
Result: Validated MADVSD through singing accent recognition benchmark experiments, demonstrating its utility for evaluating speech models in singing contexts. Explored dialectal influences on singing accent and analyzed vowel roles in accentual variations.
Conclusion: MADVSD addresses the dataset gap in singing accent research and enables exploration of dialectal influences and phonetic analysis in singing contexts, advancing the field of singing accent studies.
Abstract: Singing accent research is underexplored compared to speech accent studies, primarily due to the scarcity of suitable datasets. Existing singing datasets often suffer from detail loss, frequently resulting from the vocal-instrumental separation process. Additionally, they often lack regional accent annotations. To address this, we introduce the Multi-Accent Mandarin Dry-Vocal Singing Dataset (MADVSD). MADVSD comprises over 670 hours of dry vocal recordings from 4,206 native Mandarin speakers across nine distinct Chinese regions. In addition to each participant recording audio of three popular songs in their native accent, they also recorded phonetic exercises covering all Mandarin vowels and a full octave range. We validated MADVSD through benchmark experiments in singing accent recognition, demonstrating its utility for evaluating state-of-the-art speech models in singing contexts. Furthermore, we explored dialectal influences on singing accent and analyzed the role of vowels in accentual variations, leveraging MADVSD’s unique phonetic exercises.
[558] Is Self-Supervised Learning Enough to Fill in the Gap? A Study on Speech Inpainting
Ihab Asaad, Maxime Jacquelin, Olivier Perrotin, Laurent Girin, Thomas Hueber
Main category: cs.SD
TL;DR: SSL-trained speech encoders (HuBERT) can be used for speech inpainting without additional training by adding a decoder (HiFi-GAN), outperforming baselines and reconstructing segments up to 200-400ms.
Details
Motivation: Speech inpainting resembles SSL pretext tasks, suggesting SSL-trained encoders could perform inpainting without extra training, potentially offering efficient reconstruction capabilities.Method: Use HuBERT as SSL encoder and HiFi-GAN as decoder in two configurations: (1) fine-tune decoder with frozen pre-trained encoder, (2) fine-tune encoder for inpainting with frozen decoder. Evaluate on single/multi-speaker datasets with informed/blind inpainting scenarios.
Result: Both approaches outperform baselines including text-informed methods, successfully reconstructing 200ms segments (sometimes 400ms). Fine-tuning encoder works better for single-speaker, pre-trained encoder better for multi-speaker scenarios.
Conclusion: SSL pretext tasks transfer to speech inpainting, enabling successful reconstruction with pre-trained encoders. Different configurations suit different scenarios: encoder fine-tuning for single-speaker, pre-trained encoder for multi-speaker.
Abstract: Speech inpainting consists in reconstructing corrupted or missing speech segments using surrounding context, a process that closely resembles the pretext tasks in Self-Supervised Learning (SSL) for speech encoders. This study investigates using SSL-trained speech encoders for inpainting without any additional training beyond the initial pretext task, and simply adding a decoder to generate a waveform. We compare this approach to supervised fine-tuning of speech encoders for a downstream task – here, inpainting. Practically, we integrate HuBERT as the SSL encoder and HiFi-GAN as the decoder in two configurations: (1) fine-tuning the decoder to align with the frozen pre-trained encoder’s output and (2) fine-tuning the encoder for an inpainting task based on a frozen decoder’s input. Evaluations are conducted under single- and multi-speaker conditions using in-domain datasets and out-of-domain datasets (including unseen speakers, diverse speaking styles, and noise). Both informed and blind inpainting scenarios are considered, where the position of the corrupted segment is either known or unknown. The proposed SSL-based methods are benchmarked against several baselines, including a text-informed method combining automatic speech recognition with zero-shot text-to-speech synthesis. Performance is assessed using objective metrics and perceptual evaluations. The results demonstrate that both approaches outperform baselines, successfully reconstructing speech segments up to 200 ms, and sometimes up to 400 ms. Notably, fine-tuning the SSL encoder achieves more accurate speech reconstruction in single-speaker settings, while a pre-trained encoder proves more effective for multi-speaker scenarios. This demonstrates that an SSL pretext task can transfer to speech inpainting, enabling successful speech reconstruction with a pre-trained encoder.
[559] MultiAPI Spoof: A Multi-API Dataset and Local-Attention Network for Speech Anti-spoofing Detection
Xueping Zhang, Zhenshan Zhang, Yechen Wang, Linxi Li, Liwei Jin, Ming Li
Main category: cs.SD
TL;DR: MultiAPI Spoof dataset with 230 hours of synthetic speech from 30 APIs, introducing API tracing task and Nes2Net-LA model for improved anti-spoofing performance.
Details
Motivation: Existing speech anti-spoofing benchmarks use narrow sets of public models, creating a gap from real-world scenarios where commercial systems use diverse, often proprietary APIs.Method: Introduce MultiAPI Spoof dataset with synthetic speech from 30 APIs, define API tracing task, and propose Nes2Net-LA (local-attention enhanced variant of Nes2Net) for better local context modeling and fine-grained spoofing feature extraction.
Result: Nes2Net-LA achieves state-of-the-art performance and superior robustness, particularly under diverse and unseen spoofing conditions.
Conclusion: The MultiAPI Spoof dataset addresses real-world API diversity gap, and Nes2Net-LA provides effective solution for API tracing and robust anti-spoofing detection.
Abstract: Existing speech anti-spoofing benchmarks rely on a narrow set of public models, creating a substantial gap from real-world scenarios in which commercial systems employ diverse, often proprietary APIs. To address this issue, we introduce MultiAPI Spoof, a multi-API audio anti-spoofing dataset comprising about 230 hours of synthetic speech generated by 30 distinct APIs, including commercial services, open-source models, and online platforms. Based on this dataset, we define the API tracing task, enabling fine-grained attribution of spoofed audio to its generation source. We further propose Nes2Net-LA, a local-attention enhanced variant of Nes2Net that improves local context modeling and fine-grained spoofing feature extraction. Experiments show that Nes2Net-LA achieves state-of-the-art performance and offers superior robustness, particularly under diverse and unseen spoofing conditions. Code \footnote{https://github.com/XuepingZhang/MultiAPI-Spoof} and dataset \footnote{https://xuepingzhang.github.io/MultiAPI-Spoof-Dataset/} have released.
[560] Target Speaker Extraction through Comparing Noisy Positive and Negative Audio Enrollments
Shitong Xu, Yiyuan Yang, Niki Trigoni, Andrew Markham
Main category: cs.SD
TL;DR: Novel target speaker extraction method using noisy enrollments with positive (target speaking) and negative (target silent) segments, achieving state-of-the-art performance without requiring clean audio samples.
Details
Motivation: Clean audio samples for target speaker identification are often unavailable in real-world scenarios (e.g., cocktail parties). Existing methods struggle with noisy enrollments containing overlapping speech, creating a practical limitation for target speaker extraction systems.Method: Proposes a novel enrollment strategy that encodes target speaker information from noisy enrollments by comparing positive segments (target speaker talking) with negative segments (target speaker silent). Uses a two-stage training strategy for faster convergence.
Result: Achieves over 2.1 dB higher SI-SNRi compared to prior works for monaural speech extraction from two-speaker mixtures. Two-stage training reduces optimization steps by 60% to reach 3 dB SNR. State-of-the-art performance for noisy enrollment conditioning.
Conclusion: The proposed method effectively extracts target speakers from noisy enrollments without requiring clean audio samples, addressing a practical limitation in real-world scenarios and achieving superior performance through positive-negative enrollment comparison.
Abstract: Target speaker extraction focuses on isolating a specific speaker’s voice from an audio mixture containing multiple speakers. To provide information about the target speaker’s identity, prior works have utilized clean audio samples as conditioning inputs. However, such clean audio examples are not always readily available. For instance, obtaining a clean recording of a stranger’s voice at a cocktail party without leaving the noisy environment is generally infeasible. Limited prior research has explored extracting the target speaker’s characteristics from noisy enrollments, which may contain overlapping speech from interfering speakers. In this work, we explore a novel enrollment strategy that encodes target speaker information from the noisy enrollment by comparing segments where the target speaker is talking (Positive Enrollments) with segments where the target speaker is silent (Negative Enrollments). Experiments show the effectiveness of our model architecture, which achieves over 2.1 dB higher SI-SNRi compared to prior works in extracting the monaural speech from the mixture of two speakers. Additionally, the proposed two-stage training strategy accelerates convergence, reducing the number of optimization steps required to reach 3 dB SNR by 60%. Overall, our method achieves state-of-the-art performance in the monaural target speaker extraction conditioned on noisy enrollments. Our implementation is available at https://github.com/xu-shitong/TSE-through-Positive-Negative-Enroll .
[561] Incorporating Structure and Chord Constraints in Symbolic Transformer-based Melodic Harmonization
Maximos Kaliakatsos-Papakostas, Konstantinos Soiledis, Theodoros Tsamis, Dimos Makris, Vassilis Katsouros, Emilios Cambouropoulos
Main category: cs.SD
TL;DR: B* algorithm combines beam search and A* with backtracking to force pretrained transformers to incorporate chord constraints in melodic harmonization, addressing the challenge of satisfying specific chord requirements at precise positions.
Details
Motivation: Transformer models for symbolic music generation need better ways to incorporate user preferences and constraints, specifically predefined chord requirements at specific locations during melodic harmonization tasks.Method: Proposes B* algorithm that combines beam search, A* search, and backtracking to force pretrained transformer models to satisfy chord constraints at correct onset positions within bars during harmonization.
Result: The algorithm is brute-force with exponential worst-case complexity, but serves as a first attempt to address the problem and provides a framework that can be improved with heuristics.
Conclusion: This work highlights the difficulties of incorporating chord constraints in transformer-based melodic harmonization and introduces a foundational algorithm that enables future improvements through heuristic integration.
Abstract: Transformer architectures offer significant advantages regarding the generation of symbolic music; their capabilities for incorporating user preferences toward what they generate is being studied under many aspects. This paper studies the inclusion of predefined chord constraints in melodic harmonization, i.e., where a desired chord at a specific location is provided along with the melody as inputs and the autoregressive transformer model needs to incorporate the chord in the harmonization that it generates. The peculiarities of involving such constraints is discussed and an algorithm is proposed for tackling this task. This algorithm is called B* and it combines aspects of beam search and A* along with backtracking to force pretrained transformers to satisfy the chord constraints, at the correct onset position within the correct bar. The algorithm is brute-force and has exponential complexity in the worst case; however, this paper is a first attempt to highlight the difficulties of the problem and proposes an algorithm that offers many possibilities for improvements since it accommodates the involvement of heuristics.
[562] Audio Palette: A Diffusion Transformer with Multi-Signal Conditioning for Controllable Foley Synthesis
Junnuo Wang
Main category: cs.SD
TL;DR: Audio Palette is a diffusion transformer model that adds four time-varying acoustic controls (loudness, pitch, spectral centroid, timbre) to Stable Audio Open for fine-grained, interpretable audio synthesis while maintaining quality.
Details
Motivation: Address the "control gap" in open-source text-to-audio synthesis where current models lack fine-grained acoustic control, limiting artist-centric workflows and precise sound design capabilities.Method: Extends Stable Audio Open architecture with diffusion transformer (DiT) and introduces four time-varying control signals. Uses LoRA fine-tuning on AudioSet subset (0.85% parameters). Features sequence-based conditioning, memory efficiency, and three-scale classifier-free guidance.
Result: Achieves fine-grained interpretable control of sound attributes while maintaining audio quality and semantic alignment comparable to baseline (FAD, LAION-CLAP scores). Provides scalable pipeline for audio research.
Conclusion: Establishes robust foundation for controllable sound design and performative audio synthesis in open-source settings, enabling more artist-centric workflows in music and sound information retrieval.
Abstract: Recent advances in diffusion-based generative models have enabled high-quality text-to-audio synthesis, but fine-grained acoustic control remains a significant challenge in open-source research. We present Audio Palette, a diffusion transformer (DiT) based model that extends the Stable Audio Open architecture to address this “control gap” in controllable audio generation. Unlike prior approaches that rely solely on semantic conditioning, Audio Palette introduces four time-varying control signals, loudness, pitch, spectral centroid, and timbre, for precise and interpretable manipulation of acoustic features. The model is efficiently adapted for the nuanced domain of Foley synthesis using Low-Rank Adaptation (LoRA) on a curated subset of AudioSet, requiring only 0.85% of the original parameters to be trained. Experiments demonstrate that Audio Palette achieves fine-grained, interpretable control of sound attributes. Crucially, it accomplishes this novel controllability while maintaining high audio quality and strong semantic alignment to text prompts, with performance on standard metrics such as Frechet Audio Distance (FAD) and LAION-CLAP scores remaining comparable to the original baseline model. We provide a scalable, modular pipeline for audio research, emphasizing sequence-based conditioning, memory efficiency, and a three-scale classifier-free guidance mechanism for nuanced inference-time control. This work establishes a robust foundation for controllable sound design and performative audio synthesis in open-source settings, enabling a more artist-centric workflow in the broader context of music and sound information retrieval.
[563] Scaling to Multimodal and Multichannel Heart Sound Classification with Synthetic and Augmented Biosignals
Milan Marocchi, Matthew Fynn, Kayapanda Mandana, Yue Rong
Main category: cs.SD
TL;DR: This paper proposes using denoising diffusion models (WaveGrad and DiffWave) to augment heart sound datasets, enabling effective fine-tuning of Wav2Vec 2.0-based classifiers for cardiovascular disease detection from multimodal and multichannel heart sounds.
Details
Motivation: Cardiovascular diseases are the leading cause of death worldwide, creating demand for accurate, inexpensive pre-screening methods. While deep learning shows promise for classifying abnormal heart sounds, state-of-the-art transformer architectures are underutilized due to limited availability of synchronized PCG-ECG and multichannel PCG datasets.Method: Combines traditional signal processing with denoising diffusion models (WaveGrad and DiffWave) to create augmented datasets, then fine-tunes a Wav2Vec 2.0-based classifier on multimodal and multichannel heart sound datasets.
Result: Achieves state-of-the-art performance across multiple datasets: CinC 2016 single channel PCG (92.48% accuracy, 0.8283 MCC), synchronized PCG-ECG (93.14% accuracy, 0.8380 MCC), and wearable vest mPCG data (77.13% accuracy, 0.5082 MCC).
Conclusion: Transformer-based models with augmented datasets are effective for CVD detection, demonstrating potential to advance multimodal and multichannel heart sound classification for early disease screening.
Abstract: Cardiovascular diseases (CVDs) are the leading cause of death worldwide, accounting for approximately 17.9 million deaths each year. Early detection is critical, creating a demand for accurate and inexpensive pre-screening methods. Deep learning has recently been applied to classify abnormal heart sounds indicative of CVDs using synchronised phonocardiogram (PCG) and electrocardiogram (ECG) signals, as well as multichannel PCG (mPCG). However, state-of-the-art architectures remain underutilised due to the limited availability of synchronised and multichannel datasets. Augmented datasets and pre-trained models provide a pathway to overcome these limitations, enabling transformer-based architectures to be trained effectively. This work combines traditional signal processing with denoising diffusion models, WaveGrad and DiffWave, to create an augmented dataset to fine-tune a Wav2Vec 2.0-based classifier on multimodal and multichannel heart sound datasets. The approach achieves state-of-the-art performance. On the Computing in Cardiology (CinC) 2016 dataset of single channel PCG, accuracy, unweighted average recall (UAR), sensitivity, specificity and Matthew’s correlation coefficient (MCC) reach 92.48%, 93.05%, 93.63%, 92.48%, 94.93% and 0.8283, respectively. Using the synchronised PCG and ECG signals of the training-a dataset from CinC, 93.14%, 92.21%, 94.35%, 90.10%, 95.12% and 0.8380 are achieved for accuracy, UAR, sensitivity, specificity and MCC, respectively. Using a wearable vest dataset consisting of mPCG data, the model achieves 77.13% accuracy, 74.25% UAR, 86.47% sensitivity, 62.04% specificity, and 0.5082 MCC. These results demonstrate the effectiveness of transformer-based models for CVD detection when supported by augmented datasets, highlighting their potential to advance multimodal and multichannel heart sound classification.
cs.LG
[564] A self-driving lab for solution-processed electrochromic thin films
Selma Dahms, Luca Torresi, Shahbaz Tareq Bandesha, Jan Hansmann, Holger Röhm, Alexander Colsmann, Marco Schott, Pascal Friederich
Main category: cs.LG
TL;DR: Self-driving labs accelerate electrochromic coating development using automation, machine learning, and Bayesian optimization for efficient parameter optimization.
Details
Motivation: Solution-processed electrochromic materials have great potential for smart windows/displays, but optimizing spin-coated thin films is complex and time-consuming, requiring rapid development methods.Method: Combines automated data acquisition, image processing, spectral analysis, and Bayesian optimization in a self-driving laboratory system to efficiently explore processing parameters.
Result: The approach increases throughput and enables targeted search for optimal processing parameters, accelerating development of electrochromic coatings.
Conclusion: Self-driving labs show significant potential for enhancing materials discovery and process optimization across various solution-processed materials.
Abstract: Solution-processed electrochromic materials offer high potential for energy-efficient smart windows and displays. Their performance varies with material choice and processing conditions. Electrochromic thin film electrodes require a smooth, defect-free coating for optimal contrast between bleached and colored states. The complexity of optimizing the spin-coated electrochromic thin layer poses challenges for rapid development. This study demonstrates the use of self-driving laboratories to accelerate the development of electrochromic coatings by coupling automation with machine learning. Our system combines automated data acquisition, image processing, spectral analysis, and Bayesian optimization to explore processing parameters efficiently. This approach not only increases throughput but also enables a pointed search for optimal processing parameters. The approach can be applied to various solution-processed materials, highlighting the potential of self-driving labs in enhancing materials discovery and process optimization.
[565] Memory-Amortized Inference: A Topological Unification of Search, Closure, and Structure
Xin Li
Main category: cs.LG
TL;DR: MAI is a topological framework unifying learning and memory as phase transitions, using homology to separate content (even-dimensions) from context (odd-dimensions), transforming search into lookup via cycle closure.
Details
Motivation: Current ML systems lack biological efficiency by separating static parameters from dynamic inference. The paper aims to create a unified framework that bridges this gap using topological principles.Method: Memory-Amortized Inference (MAI) uses algebraic topology with Homological Parity Principle: even-dimensional homology for stable content, odd-dimensional homology for dynamic context. Implements Search→Closure→Structure transformation via topological cycle closure, generalizing Wake-Sleep algorithm.
Result: The framework explains emergence of fast-thinking intuition from slow-thinking reasoning, provides blueprint for post-Turing architectures computing via topological resonance, and shows how high-complexity search converts to low-complexity lookup.
Conclusion: MAI offers a rigorous topological foundation for unifying learning and memory, potentially enabling more efficient, biologically-inspired cognitive architectures that overcome limitations of contemporary ML systems.
Abstract: Contemporary ML separates the static structure of parameters from the dynamic flow of inference, yielding systems that lack the sample efficiency and thermodynamic frugality of biological cognition. In this theoretical work, we propose \textbf{Memory-Amortized Inference (MAI)}, a formal framework rooted in algebraic topology that unifies learning and memory as phase transitions of a single geometric substrate. Central to our theory is the \textbf{Homological Parity Principle}, which posits a fundamental dichotomy: even-dimensional homology ($H_{even}$) physically instantiates stable \textbf{Content} (stable scaffolds or what''), while odd-dimensional homology ($H_{odd}$) instantiates dynamic \textbf{Context} (dynamic flows or where’’). We derive the logical flow of MAI as a topological trinity transformation: \textbf{Search $\to$ Closure $\to$ Structure}. Specifically, we demonstrate that cognition operates by converting high-complexity recursive search (modeled by \textit{Savitch’s Theorem} in NPSPACE) into low-complexity lookup (modeled by \textit{Dynamic Programming} in P) via the mechanism of \textbf{Topological Cycle Closure}. We further show that this consolidation process is governed by a topological generalization of the Wake-Sleep algorithm, functioning as a coordinate descent that alternates between optimizing the $H_{odd}$ flow (inference/wake) and condensing persistent cycles into the $H_{even}$ scaffold (learning/sleep). This framework offers a rigorous explanation for the emergence of fast-thinking (intuition) from slow-thinking (reasoning) and provides a blueprint for post-Turing architectures that compute via topological resonance.
[566] Deep learning recognition and analysis of Volatile Organic Compounds based on experimental and synthetic infrared absorption spectra
Andrea Della Valle, Annalisa D’Arco, Tiziana Mancini, Rosanna Mosetti, Maria Chiara Paolozzi, Stefano Lupi, Sebastiano Pilati, Andrea Perali
Main category: cs.LG
TL;DR: Researchers developed a deep learning system using experimental and synthetic IR spectra to accurately identify and quantify nine different VOCs in real-time, overcoming limitations of traditional IR spectroscopy analysis.
Details
Motivation: VOCs pose significant health risks, and while IR spectroscopy enables ultrasensitive detection, the complexity of IR spectra limits real-time recognition and quantification. Traditional deep learning approaches require massive datasets that are difficult to obtain for VOC analysis.Method: Created an experimental VOC dataset with IR absorption spectra for nine compounds at various concentrations, then augmented it with synthetic spectra generated via conditional generative neural networks. Used this combined dataset to train robust discriminative neural networks for VOC identification and concentration prediction.
Result: Successfully trained neural networks that can reliably identify nine different VOCs and precisely predict their concentrations. The trained model is suitable for integration into sensing devices for real-time VOC recognition and analysis.
Conclusion: The combination of experimental data with synthetic spectra generated by conditional generative NNs enables effective training of discriminative models for real-time VOC detection and quantification, addressing the data scarcity problem in VOC analysis.
Abstract: Volatile Organic Compounds (VOCs) are organic molecules that have low boiling points and therefore easily evaporate into the air. They pose significant risks to human health, making their accurate detection the crux of efforts to monitor and minimize exposure. Infrared (IR) spectroscopy enables the ultrasensitive detection at low-concentrations of VOCs in the atmosphere by measuring their IR absorption spectra. However, the complexity of the IR spectra limits the possibility to implement VOC recognition and quantification in real-time. While deep neural networks (NNs) are increasingly used for the recognition of complex data structures, they typically require massive datasets for the training phase. Here, we create an experimental VOC dataset for nine different classes of compounds at various concentrations, using their IR absorption spectra. To further increase the amount of spectra and their diversity in term of VOC concentration, we augment the experimental dataset with synthetic spectra created via conditional generative NNs. This allows us to train robust discriminative NNs, able to reliably identify the nine VOCs, as well as to precisely predict their concentrations. The trained NN is suitable to be incorporated into sensing devices for VOCs recognition and analysis.
[567] When Privacy Isn’t Synthetic: Hidden Data Leakage in Generative AI Models
S. M. Mustaqim, Anantaa Kotal, Paul H. Yi
Main category: cs.LG
TL;DR: A black-box membership inference attack on generative models that exploits structural overlap between synthetic and real data distributions through clustering analysis, revealing privacy vulnerabilities even in differentially private synthetic data releases.
Details
Motivation: To demonstrate that synthetic data generated for privacy protection can still leak information about training samples through structural patterns in the data manifold, highlighting an under-explored attack surface in privacy-preserving data publishing.Method: Proposes a black-box attack that repeatedly queries the generative model to obtain synthetic samples, performs unsupervised clustering to identify dense regions, and analyzes cluster medoids/neighborhoods that correspond to high-density regions in original training data to infer membership or reconstruct records.
Result: Experiments across healthcare, finance, and other sensitive domains show measurable membership leakage due to cluster overlap between real and synthetic data, even when generators are trained with differential privacy or other noise mechanisms.
Conclusion: Synthetic data generation pipelines need stronger privacy guarantees that account for distributional neighborhood inference rather than just sample-level memorization, calling for new approaches to address this vulnerability in privacy-preserving data publishing.
Abstract: Generative models are increasingly used to produce privacy-preserving synthetic data as a safe alternative to sharing sensitive training datasets. However, we demonstrate that such synthetic releases can still leak information about the underlying training samples through structural overlap in the data manifold. We propose a black-box membership inference attack that exploits this vulnerability without requiring access to model internals or real data. The attacker repeatedly queries the generative model to obtain large numbers of synthetic samples, performs unsupervised clustering to identify dense regions of the synthetic distribution, and then analyzes cluster medoids and neighborhoods that correspond to high-density regions in the original training data. These neighborhoods act as proxies for training samples, enabling the adversary to infer membership or reconstruct approximate records. Our experiments across healthcare, finance, and other sensitive domains show that cluster overlap between real and synthetic data leads to measurable membership leakage-even when the generator is trained with differential privacy or other noise mechanisms. The results highlight an under-explored attack surface in synthetic data generation pipelines and call for stronger privacy guarantees that account for distributional neighborhood inference rather than sample-level memorization alone, underscoring its role in privacy-preserving data publishing. Implementation and evaluation code are publicly available at:github.com/Cluster-Medoid-Leakage-Attack.
[568] JaxWildfire: A GPU-Accelerated Wildfire Simulator for Reinforcement Learning
Ufuk Çakır, Victor-Alexandru Darvariu, Bruno Lacerda, Nick Hawes
Main category: cs.LG
TL;DR: JaxWildfire is a GPU-accelerated wildfire simulator built in JAX that enables fast, vectorized simulations for training reinforcement learning agents to learn wildfire suppression policies.
Details
Motivation: Existing wildfire simulators are too slow for training reinforcement learning agents, which require many environment interactions. This speed limitation hinders the development of AI methods for proactive wildfire management.Method: Developed JaxWildfire, a simulator based on probabilistic cellular automata fire spread model, implemented in JAX with vmap for vectorized GPU simulations, enabling gradient-based optimization of parameters.
Result: Achieved 6-35x speedup over existing software, demonstrated gradient-based parameter optimization, and successfully trained RL agents to learn wildfire suppression policies.
Conclusion: JaxWildfire enables efficient training of RL agents for wildfire management, representing an important step toward advancing AI techniques for natural hazard management.
Abstract: Artificial intelligence methods are increasingly being explored for managing wildfires and other natural hazards. In particular, reinforcement learning (RL) is a promising path towards improving outcomes in such uncertain decision-making scenarios and moving beyond reactive strategies. However, training RL agents requires many environment interactions, and the speed of existing wildfire simulators is a severely limiting factor. We introduce $\texttt{JaxWildfire}$, a simulator underpinned by a principled probabilistic fire spread model based on cellular automata. It is implemented in JAX and enables vectorized simulations using $\texttt{vmap}$, allowing high throughput of simulations on GPUs. We demonstrate that $\texttt{JaxWildfire}$ achieves 6-35x speedup over existing software and enables gradient-based optimization of simulator parameters. Furthermore, we show that $\texttt{JaxWildfire}$ can be used to train RL agents to learn wildfire suppression policies. Our work is an important step towards enabling the advancement of RL techniques for managing natural hazards.
[569] ARC-AGI Without Pretraining
Isaac Liao, Albert Gu
Main category: cs.LG
TL;DR: A 76K parameter model called CompressARC solves 20% of ARC-AGI puzzles without pretraining by using Minimum Description Length (MDL) during inference, challenging the conventional wisdom that massive pretraining is necessary for such tasks.
Details
Motivation: The paper challenges the conventional belief that solving IQ-test-like visual puzzles from the ARC-AGI benchmark requires massive pretraining. The authors aim to demonstrate that intelligence can emerge through alternative approaches like Minimum Description Length (MDL) rather than relying solely on large-scale pretraining.Method: CompressARC is a 76K parameter model that uses Minimum Description Length (MDL) to solve puzzles. It operates purely during inference time, training only on a single sample (the target puzzle itself with solution information removed). The model doesn’t use the pre-provided ARC-AGI training set and minimizes description length to find solutions.
Result: CompressARC solves 20% of evaluation puzzles from the ARC-AGI-1 benchmark, demonstrating extreme generalization abilities. It successfully solves diverse creative puzzles under extremely data-limited conditions where conventional deep learning would not be expected to solve any puzzles.
Conclusion: Minimum Description Length (MDL) represents a feasible alternative path to producing intelligence, separate from conventional pretraining approaches. The success of CompressARC suggests that MDL-based methods can achieve generalization capabilities typically unheard of in deep learning, even with minimal parameters and no pretraining.
Abstract: Conventional wisdom in the age of LLMs dictates that solving IQ-test-like visual puzzles from the ARC-AGI-1 benchmark requires capabilities derived from massive pretraining. To counter this, we introduce CompressARC, a 76K parameter model without any pretraining that solves 20% of evaluation puzzles by minimizing the description length (MDL) of the target puzzle purely during inference time. The MDL endows CompressARC with extreme generalization abilities typically unheard of in deep learning. To our knowledge, CompressARC is the only deep learning method for ARC-AGI where training happens only on a single sample: the target inference puzzle itself, with the final solution information removed. Moreover, CompressARC does not train on the pre-provided ARC-AGI “training set”. Under these extremely data-limited conditions, we do not ordinarily expect any puzzles to be solvable at all. Yet CompressARC still solves a diverse distribution of creative ARC-AGI puzzles, suggesting MDL to be an alternative feasible way to produce intelligence, besides conventional pretraining.
[570] A Prescriptive Framework for Determining Optimal Days for Short-Term Traffic Counts
Arthur Mukwaya, Nancy Kasamala, Nana Kankam Gyimah, Judith Mwakalonge, Gurcan Comert, Saidi Siuhi, Denis Ruganuza, Mark Ngotonie
Main category: cs.LG
TL;DR: Machine learning framework identifies optimal days for short traffic counts to improve Annual Average Daily Traffic (AADT) estimation accuracy, outperforming current DOT practices.
Details
Motivation: State DOTs struggle to obtain accurate AADT data, especially for unmonitored roads. Continuous count stations are expensive and difficult to deploy widely, forcing reliance on short-duration counts that may not be optimally timed for accurate AADT prediction.Method: Proposes a machine learning framework to identify optimal representative days for short count data collection. Uses Texas traffic data (2022-2023) to compare ‘optimal day’ approach (iteratively selecting most informative days) vs ’no optimal day’ baseline. Uses continuous count data to simulate 24-hour short counts and actual field short counts with leave-one-out technique for unbiased feature engineering across similar road segments.
Result: Optimal day approach outperforms baseline across top five days. Best day (Day 186) achieves: RMSE: 7,871.15, MAE: 3,645.09, MAPE: 11.95%, R²: 0.9756 vs baseline: RMSE: 11,185.00, MAE: 5,118.57, MAPE: 14.42%, R²: 0.9499.
Conclusion: The framework offers DOTs an alternative to conventional short-duration count practices, improving AADT estimation accuracy, supporting Highway Performance Monitoring System compliance, and reducing operational costs of statewide traffic data collection.
Abstract: The Federal Highway Administration (FHWA) mandates that state Departments of Transportation (DOTs) collect reliable Annual Average Daily Traffic (AADT) data. However, many U.S. DOTs struggle to obtain accurate AADT, especially for unmonitored roads. While continuous count (CC) stations offer accurate traffic volume data, their implementation is expensive and difficult to deploy widely, compelling agencies to rely on short-duration traffic counts. This study proposes a machine learning framework, the first to our knowledge, to identify optimal representative days for conducting short count (SC) data collection to improve AADT prediction accuracy. Using 2022 and 2023 traffic volume data from the state of Texas, we compare two scenarios: an ‘optimal day’ approach that iteratively selects the most informative days for AADT estimation and a ’no optimal day’ baseline reflecting current practice by most DOTs. To align with Texas DOT’s traffic monitoring program, continuous count data were utilized to simulate the 24 hour short counts. The actual field short counts were used to enhance feature engineering through using a leave-one-out (LOO) technique to generate unbiased representative daily traffic features across similar road segments. Our proposed methodology outperforms the baseline across the top five days, with the best day (Day 186) achieving lower errors (RMSE: 7,871.15, MAE: 3,645.09, MAPE: 11.95%) and higher R^2 (0.9756) than the baseline (RMSE: 11,185.00, MAE: 5,118.57, MAPE: 14.42%, R^2: 0.9499). This research offers DOTs an alternative to conventional short-duration count practices, improving AADT estimation, supporting Highway Performance Monitoring System compliance, and reducing the operational costs of statewide traffic data collection.
[571] Physics-Informed Neural Koopman Machine for Interpretable Longitudinal Personalized Alzheimer’s Disease Forecasting
Georgi Hrusanov, Duy-Thanh Vu, Duy-Cat Can, Sophie Tascedda, Margaret Ryan, Julien Bodelet, Katarzyna Koscielska, Carsten Magnus, Oliver Y. Chén
Main category: cs.LG
TL;DR: NKM (Neural Koopman Machine) is a new ML architecture that integrates multimodal data to forecast Alzheimer’s cognitive decline with interpretability, outperforming existing methods.
Details
Motivation: Early forecasting of individual cognitive decline in Alzheimer's disease is crucial but challenging due to difficulties in integrating multimodal data for longitudinal personalized forecasting while maintaining interpretability.Method: NKM combines dynamical systems theory with attention mechanisms, using Fusion Group-Aware Hierarchical Attention within the Koopman operator framework to transform nonlinear trajectories into interpretable linear representations. It integrates analytical (α) and biological (β) knowledge to guide feature grouping and hierarchical attention.
Result: NKM consistently outperforms traditional ML and deep learning models in forecasting cognitive decline trajectories on ADNI dataset. It can: (1) forecast multiple cognitive scores simultaneously, (2) quantify differential biomarker contributions, and (3) identify brain regions most predictive of cognitive deterioration.
Conclusion: NKM advances personalized, interpretable forecasting of future cognitive decline in AD using multimodal data through an explainable system, revealing potential biological underpinnings of AD progression.
Abstract: Early forecasting of individual cognitive decline in Alzheimer’s disease (AD) is central to disease evaluation and management. Despite advances, it is as of yet challenging for existing methodological frameworks to integrate multimodal data for longitudinal personalized forecasting while maintaining interpretability. To address this gap, we present the Neural Koopman Machine (NKM), a new machine learning architecture inspired by dynamical systems and attention mechanisms, designed to forecast multiple cognitive scores simultaneously using multimodal genetic, neuroimaging, proteomic, and demographic data. NKM integrates analytical ($α$) and biological ($β$) knowledge to guide feature grouping and control the hierarchical attention mechanisms to extract relevant patterns. By implementing Fusion Group-Aware Hierarchical Attention within the Koopman operator framework, NKM transforms complex nonlinear trajectories into interpretable linear representations. To demonstrate NKM’s efficacy, we applied it to study the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset. Our results suggest that NKM consistently outperforms both traditional machine learning methods and deep learning models in forecasting trajectories of cognitive decline. Specifically, NKM (1) forecasts changes of multiple cognitive scores simultaneously, (2) quantifies differential biomarker contributions to predicting distinctive cognitive scores, and (3) identifies brain regions most predictive of cognitive deterioration. Together, NKM advances personalized, interpretable forecasting of future cognitive decline in AD using past multimodal data through an explainable, explicit system and reveals potential multimodal biological underpinnings of AD progression.
[572] gp2Scale: A Class of Compactly-Supported Non-Stationary Kernels and Distributed Computing for Exact Gaussian Processes on 10 Million Data Points
Marcus M. Noack, Mark D. Risser, Hengrui Luo, Vardaan Tekriwal, Ronald J. Pandolfi
Main category: cs.LG
TL;DR: gp2Scale scales exact Gaussian processes to >10M data points using naturally sparse covariance matrices from flexible kernels, avoiding approximations while maintaining full GP customizability.
Details
Motivation: Current GP scaling methods rely on approximations that sacrifice accuracy and limit kernel/noise-model flexibility, which is problematic as expressive non-stationary kernels become increasingly important.Method: Leverages flexible, compactly supported, non-stationary kernels that create natural sparsity in covariance matrices, then exploits this sparsity for efficient linear system solutions and log-determinant calculations without inducing points or approximations.
Result: Scales exact GPs to over 10 million data points, shows superior approximation performance compared to state-of-the-art methods, and maintains full GP customizability.
Conclusion: gp2Scale provides an optimal solution for modern GP applications by enabling exact computation at massive scales while preserving complete flexibility in kernel design, noise models, and input space types.
Abstract: Despite a large corpus of recent work on scaling up Gaussian processes, a stubborn trade-off between computational speed, prediction and uncertainty quantification accuracy, and customizability persists. This is because the vast majority of existing methodologies exploit various levels of approximations that lower accuracy and limit the flexibility of kernel and noise-model designs – an unacceptable drawback at a time when expressive non-stationary kernels are on the rise in many fields. Here, we propose a methodology we term \emph{gp2Scale} that scales exact Gaussian processes to more than 10 million data points without relying on inducing points, kernel interpolation, or neighborhood-based approximations, and instead leveraging the existing capabilities of a GP: its kernel design. Highly flexible, compactly supported, and non-stationary kernels lead to the identification of naturally occurring sparse structure in the covariance matrix, which is then exploited for the calculations of the linear system solution and the log-determinant for training. We demonstrate our method’s functionality on several real-world datasets and compare it with state-of-the-art approximation algorithms. Although we show superior approximation performance in many cases, the method’s real power lies in its agnosticism toward arbitrary GP customizations – core kernel design, noise, and mean functions – and the type of input space, making it optimally suited for modern Gaussian process applications.
[573] Learning Invariant Graph Representations Through Redundant Information
Barproda Halder, Pasan Dissanayake, Sanghamitra Dutta
Main category: cs.LG
TL;DR: RIG framework uses Partial Information Decomposition to isolate spurious and causal subgraphs for OOD generalization in graph learning.
Details
Motivation: Existing invariant representation learning approaches relying solely on classical information-theoretic measures fail to precisely address redundant information between spurious and invariant subgraphs, limiting OOD generalization capabilities.Method: Proposes Redundancy-guided Invariant Graph learning (RIG) - a multi-level optimization framework that maximizes redundant information while isolating spurious and causal subgraphs using Partial Information Decomposition.
Result: Experiments on synthetic and real-world graph datasets demonstrate improved generalization capabilities under diverse distribution shifts.
Conclusion: Partial Information Decomposition provides a powerful tool for addressing OOD generalization challenges in graph learning by precisely targeting redundant information between spurious and invariant components.
Abstract: Learning invariant graph representations for out-of-distribution (OOD) generalization remains challenging because the learned representations often retain spurious components. To address this challenge, this work introduces a new tool from information theory called Partial Information Decomposition (PID) that goes beyond classical information-theoretic measures. We identify limitations in existing approaches for invariant representation learning that solely rely on classical information-theoretic measures, motivating the need to precisely focus on redundant information about the target $Y$ shared between spurious subgraphs $G_s$ and invariant subgraphs $G_c$ obtained via PID. Next, we propose a new multi-level optimization framework that we call – Redundancy-guided Invariant Graph learning (RIG) – that maximizes redundant information while isolating spurious and causal subgraphs, enabling OOD generalization under diverse distribution shifts. Our approach relies on alternating between estimating a lower bound of redundant information (which itself requires an optimization) and maximizing it along with additional objectives. Experiments on both synthetic and real-world graph datasets demonstrate the generalization capabilities of our proposed RIG framework.
[574] PMA-Diffusion: A Physics-guided Mask-Aware Diffusion Framework for TSE from Sparse Observations
Lindong Liu, Zhixiong Jin, Seongjin Choi
Main category: cs.LG
TL;DR: PMA-Diffusion: A physics-guided mask-aware diffusion framework that reconstructs unobserved highway speed fields from sparse, incomplete traffic observations, outperforming baselines even with only 5% visibility.
Details
Motivation: High-resolution highway traffic state information is essential for Intelligent Transportation Systems, but typical traffic data from loop detectors and probe vehicles are often too sparse and noisy to capture detailed traffic flow dynamics.Method: A physics-guided mask-aware diffusion framework with two mask-aware training strategies (Single-Mask and Double-Mask). At inference, a physics-guided posterior sampler alternates reverse-diffusion updates, observation projection, and physics-guided projection based on adaptive anisotropic smoothing.
Result: Tested on I-24 MOTION dataset with varying visibility ratios. Even with only 5% visibility, PMA-Diffusion outperforms other baselines across three reconstruction error metrics. Training with sparse observations nearly matches performance of baseline models trained on fully observed speed fields.
Conclusion: Combining mask-aware diffusion priors with a physics-guided posterior sampler provides a reliable and flexible solution for traffic state estimation under realistic sensing sparsity.
Abstract: High-resolution highway traffic state information is essential for Intelligent Transportation Systems, but typical traffic data acquired from loop detectors and probe vehicles are often too sparse and noisy to capture the detailed dynamics of traffic flow. We propose PMA-Diffusion, a physics-guided mask-aware diffusion framework that reconstructs unobserved highway speed fields from sparse, incomplete observations. Our approach trains a diffusion prior directly on sparsely observed speed fields using two mask-aware training strategies: Single-Mask and Double-Mask. At the inference phase, the physics-guided posterior sampler alternates reverse-diffusion updates, observation projection, and physics-guided projection based on adaptive anisotropic smoothing to reconstruct the missing speed fields. The proposed framework is tested on the I-24 MOTION dataset with varying visibility ratios. Even under severe sparsity, with only 5% visibility, PMA-Diffusion outperforms other baselines across three reconstruction error metrics. Furthermore, PMA-diffusion trained with sparse observation nearly matches the performance of the baseline model trained on fully observed speed fields. The results indicate that combining mask-aware diffusion priors with a physics-guided posterior sampler provides a reliable and flexible solution for traffic state estimation under realistic sensing sparsity.
[575] Hankel-FNO: Fast Underwater Acoustic Charting Via Physics-Encoded Fourier Neural Operator
Yifan Sun, Lei Cheng, Jianlong Li, Peter Gerstoft
Main category: cs.LG
TL;DR: Hankel-FNO: A Fourier Neural Operator-based model for fast and accurate underwater acoustic charting that incorporates sound propagation knowledge and bathymetry, outperforming traditional solvers in speed and data-driven alternatives in accuracy.
Details
Motivation: Conventional underwater acoustic charting methods rely on computationally expensive numerical solvers that aren't scalable for large-scale or real-time applications. Deep learning surrogates have limitations like fixed-resolution constraints or dependence on explicit PDE formulations, hindering their generalization across diverse environments.Method: Hankel-FNO, a Fourier Neural Operator-based model that incorporates sound propagation knowledge and bathymetry data for efficient acoustic charting. The model leverages FNO architecture for computational efficiency while integrating domain-specific physical knowledge.
Result: Hankel-FNO outperforms traditional solvers in computational speed and surpasses data-driven alternatives in accuracy, especially for long-range predictions. The model shows adaptability to diverse environments and sound source settings with minimal fine-tuning.
Conclusion: Hankel-FNO provides an efficient and accurate solution for underwater acoustic charting that balances computational speed with high accuracy, demonstrating strong generalization capabilities across different environmental conditions with minimal adaptation required.
Abstract: Fast and accurate underwater acoustic charting is crucial for downstream tasks such as environment-aware sensor placement optimization and autonomous vehicle path planning. Conventional methods rely on computationally expensive while accurate numerical solvers, which are not scalable for large-scale or real-time applications. Although deep learning-based surrogate models can accelerate these computations, they often suffer from limitations such as fixed-resolution constraints or dependence on explicit partial differential equation formulations. These issues hinder their applicability and generalization across diverse environments. We propose Hankel-FNO, a Fourier Neural Operator (FNO)-based model for efficient and accurate acoustic charting. By incorporating sound propagation knowledge and bathymetry, our method has high accuracy while maintaining high computational speed. Results demonstrate that Hankel-FNO outperforms traditional solvers in speed and surpasses data-driven alternatives in accuracy, especially in long-range predictions. Experiments show the model’s adaptability to diverse environments and sound source settings with minimal fine-tuning.
[576] How Should We Evaluate Data Deletion in Graph-Based ANN Indexes?
Tomohiro Yamashita, Daichi Amagata, Yusuke Matsui
Main category: cs.LG
TL;DR: Proposes an experimental framework and evaluation metrics for assessing data deletion efficiency in Approximate Nearest Neighbor Search (ANNS) indexes, with application to Hierarchical Navigable Small World and introduction of Deletion Control method.
Details
Motivation: ANNS has gained importance for applications like Retrieval-Augmented Generation, requiring dynamic data support. However, there's no comprehensive evaluation methodology for data deletion in ANNS, despite growing interest in dynamic ANNS algorithms.Method: 1) Proposes experimental framework and comprehensive evaluation metrics for data deletion in ANNS indexes. 2) Categorizes data deletion methods in graph-based ANNS into three approaches with mathematical formalization. 3) Applies framework to Hierarchical Navigable Small World (HNSW) to analyze deletion effects. 4) Introduces Deletion Control method that dynamically selects appropriate deletion approach based on required search accuracy.
Result: The paper establishes a systematic evaluation methodology for data deletion in ANNS, providing metrics for accuracy, query speed, and other relevant performance aspects. The framework enables analysis of deletion effects on state-of-the-art methods like HNSW.
Conclusion: The study addresses the gap in evaluation methodology for data deletion in ANNS, providing a comprehensive framework and demonstrating its application to HNSW. The proposed Deletion Control method offers practical solution for maintaining search accuracy while managing dynamic data.
Abstract: Approximate Nearest Neighbor Search (ANNS) has recently gained significant attention due to its many applications, such as Retrieval-Augmented Generation. Such applications require ANNS algorithms that support dynamic data, so the ANNS problem on dynamic data has attracted considerable interest. However, a comprehensive evaluation methodology for data deletion in ANNS has yet to be established. This study proposes an experimental framework and comprehensive evaluation metrics to assess the efficiency of data deletion for ANNS indexes under practical use cases. Specifically, we categorize data deletion methods in graph-based ANNS into three approaches and formalize them mathematically. The performance is assessed in terms of accuracy, query speed, and other relevant metrics. Finally, we apply the proposed evaluation framework to Hierarchical Navigable Small World, one of the state-of-the-art ANNS methods, to analyze the effects of data deletion, and propose Deletion Control, a method which dynamically selects the appropriate deletion method under a required search accuracy.
[577] K2-V2: A 360-Open, Reasoning-Enhanced LLM
K2 Team, Zhengzhong Liu, Liping Tang, Linghao Jin, Haonan Li, Nikhil Ranjan, Desai Fan, Shaurya Rohatgi, Richard Fan, Omkar Pangarkar, Huijuan Wang, Zhoujun Cheng, Suqi Sun, Seungwook Han, Bowen Tan, Gurpreet Gosal, Xudong Han, Varad Pimpalkhute, Shibo Hao, Ming Shan Hee, Joel Hestness, Haolong Jia, Liqun Ma, Aaryamonvikram Singh, Daria Soboleva, Natalia Vassilieva, Renxi Wang, Yingquan Wu, Yuekai Sun, Taylor Killian, Alexander Moreno, John Maggs, Hector Ren, Guowei He, Hongyi Wang, Xuezhe Ma, Yuqi Wang, Mikhail Yurochkin, Eric P. Xing
Main category: cs.LG
TL;DR: K2-V2 is a 360-open LLM built from scratch that serves as a superior reasoning base model, rivaling top open-weight models in its size class while being fully open-source with complete training transparency.
Details
Motivation: To create a fully open, reasoning-centric foundation model that can serve as a superior base for reasoning adaptation while maintaining transparency through complete training data and history release.Method: Built from scratch with active infusion of domain knowledge, reasoning capabilities, long-context handling, and tool use throughout training. Uses simple supervised fine-tuning to establish strong baselines.
Result: K2-V2 outperforms Qwen2.5-72B and approaches Qwen3-235B performance, establishing itself as the strongest fully open model that rivals open-weight leaders in its size class.
Conclusion: The model demonstrates significant potential for complex reasoning tasks with room for advanced alignment, and its full transparency (weights, training data, history) empowers the community for continuous training and development.
Abstract: We introduce K2-V2, a 360-open LLM built from scratch as a superior base for reasoning adaptation, in addition to functions such as conversation and knowledge retrieval from general LLMs. It stands as the strongest fully open model, rivals open-weight leaders in its size class, outperforms Qwen2.5-72B and approaches the performance of Qwen3-235B. We actively infuse domain knowledge, reasoning, long-context, and tool use throughout the training process. This explicitly prepares the model for complex reasoning tasks. We demonstrate this potential using simple supervised fine-tuning, establishing a strong baseline that indicates significant headroom for advanced alignment. By releasing the full training history and data composition, we maximize the effectiveness of continuous training, a key open source production scenario. We release the model weights and signature LLM360 artifacts, such as complete training data, to empower the community with a capable, reasoning-centric foundation.
[578] A multimodal Bayesian Network for symptom-level depression and anxiety prediction from voice and speech data
Agnes Norbury, George Fairs, Alexandra L. Georgescu, Matthew M. Nour, Emilia Molimpakis, Stefano Goria
Main category: cs.LG
TL;DR: Bayesian network models can effectively predict depression/anxiety symptoms from voice/speech features with good performance, fairness, and clinical utility when using large multimodal datasets.
Details
Motivation: Clinicians need to integrate complex nonverbal cues (tone, speech rate, body language) with verbal reports during psychiatric assessment, which is challenging and could benefit from AI support tools that haven't been widely adopted in clinical practice.Method: Used Bayesian network modeling to predict depression and anxiety symptoms from voice and speech features, evaluated on large-scale datasets (30,135 unique speakers). Assessed performance, demographic fairness, modality integration/redundancy, clinical usefulness, and patient acceptability.
Result: Strong performance for depression (ROC-AUC=0.842, ECE=0.018) and anxiety (ROC-AUC=0.831, ECE=0.015), with core individual symptom ROC-AUC >0.74. Demonstrated demographic fairness and effective integration across input modalities.
Conclusion: Bayesian network models with rich multimodal data at symptom level provide a principled approach for robust clinical assessment tools that offer transparent, explainable outputs amenable to expert clinical supervision.
Abstract: During psychiatric assessment, clinicians observe not only what patients report, but important nonverbal signs such as tone, speech rate, fluency, responsiveness, and body language. Weighing and integrating these different information sources is a challenging task and a good candidate for support by intelligence-driven tools - however this is yet to be realized in the clinic. Here, we argue that several important barriers to adoption can be addressed using Bayesian network modelling. To demonstrate this, we evaluate a model for depression and anxiety symptom prediction from voice and speech features in large-scale datasets (30,135 unique speakers). Alongside performance for conditions and symptoms (for depression, anxiety ROC-AUC=0.842,0.831 ECE=0.018,0.015; core individual symptom ROC-AUC>0.74), we assess demographic fairness and investigate integration across and redundancy between different input modality types. Clinical usefulness metrics and acceptability to mental health service users are explored. When provided with sufficiently rich and large-scale multimodal data streams and specified to represent common mental conditions at the symptom rather than disorder level, such models are a principled approach for building robust assessment support tools: providing clinically-relevant outputs in a transparent and explainable format that is directly amenable to expert clinical supervision.
[579] Quantifying Memory Use in Reinforcement Learning with Temporal Range
Rodney Lafuente-Mercado, Daniela Rus, T. Konstantin Rusch
Main category: cs.LG
TL;DR: Temporal Range is a new metric that measures how much RL policies depend on past observations by computing the weighted average lag of input influences on outputs.
Details
Motivation: To understand how much trained RL policies actually use their past observations, and to provide a practical way to measure memory dependence for comparing agents and environments.Method: Propose Temporal Range metric computed via reverse-mode automatic differentiation from Jacobian blocks ∂y_s/∂x_t, treating first-order sensitivities as temporal influence profiles and summarizing with magnitude-weighted average lag.
Result: Temporal Range (i) remains small in fully observed control, (ii) scales with task’s ground-truth lag in Copy-k, and (iii) aligns with minimum history window needed for near-optimal return as confirmed by window ablations.
Conclusion: Temporal Range offers a practical per-sequence readout of memory dependence for comparing agents and environments and for selecting the shortest sufficient context.
Abstract: How much does a trained RL policy actually use its past observations? We propose \emph{Temporal Range}, a model-agnostic metric that treats first-order sensitivities of multiple vector outputs across a temporal window to the input sequence as a temporal influence profile and summarizes it by the magnitude-weighted average lag. Temporal Range is computed via reverse-mode automatic differentiation from the Jacobian blocks $\partial y_s/\partial x_t\in\mathbb{R}^{c\times d}$ averaged over final timesteps $s\in{t+1,\dots,T}$ and is well-characterized in the linear setting by a small set of natural axioms. Across diagnostic and control tasks (POPGym; flicker/occlusion; Copy-$k$) and architectures (MLPs, RNNs, SSMs), Temporal Range (i) remains small in fully observed control, (ii) scales with the task’s ground-truth lag in Copy-$k$, and (iii) aligns with the minimum history window required for near-optimal return as confirmed by window ablations. We also report Temporal Range for a compact Long Expressive Memory (LEM) policy trained on the task, using it as a proxy readout of task-level memory. Our axiomatic treatment draws on recent work on range measures, specialized here to temporal lag and extended to vector-valued outputs in the RL setting. Temporal Range thus offers a practical per-sequence readout of memory dependence for comparing agents and environments and for selecting the shortest sufficient context.
[580] Average-reward reinforcement learning in semi-Markov decision processes via relative value iteration
Huizhen Yu, Yi Wan, Richard S. Sutton
Main category: cs.LG
TL;DR: Applies asynchronous stochastic approximation to RVI Q-learning for average-reward SMDPs, establishing convergence to solutions of the optimality equation with new monotonicity conditions.
Details
Motivation: To extend recent asynchronous stochastic approximation results to reinforcement learning in average-reward semi-Markov decision processes, addressing convergence issues in RVI Q-learning algorithms.Method: Applies Borkar-Meyn framework for asynchronous stochastic approximation to Schweitzer’s relative value iteration algorithm (RVI Q-learning) for finite-space, weakly communicating SMDPs with new monotonicity conditions.
Result: Establishes almost sure convergence to a compact, connected subset of solutions to the average-reward optimality equation, with convergence to unique sample path-dependent solutions under additional conditions.
Conclusion: The framework substantially expands algorithmic possibilities for RVI Q-learning through novel monotonicity conditions and stability analysis, making full use of the stochastic approximation framework.
Abstract: This paper applies the authors’ recent results on asynchronous stochastic approximation (SA) in the Borkar-Meyn framework to reinforcement learning in average-reward semi-Markov decision processes (SMDPs). We establish the convergence of an asynchronous SA analogue of Schweitzer’s classical relative value iteration algorithm, RVI Q-learning, for finite-space, weakly communicating SMDPs. In particular, we show that the algorithm converges almost surely to a compact, connected subset of solutions to the average-reward optimality equation, with convergence to a unique, sample path-dependent solution under additional stepsize and asynchrony conditions. Moreover, to make full use of the SA framework, we introduce new monotonicity conditions for estimating the optimal reward rate in RVI Q-learning. These conditions substantially expand the previously considered algorithmic framework and are addressed through novel arguments in the stability and convergence analysis of RVI Q-learning.
[581] Back to Author Console Empowering GNNs for Domain Adaptation via Denoising Target Graph
Haiyang Yu, Meng-Chieh Lee, Xiang song, Qi Zhu, Christos Faloutsos
Main category: cs.LG
TL;DR: GraphDeT framework improves GNN node classification in domain adaptation by adding edge denoising auxiliary loss, which tightens generalization bounds and handles structure domain shifts.
Details
Motivation: Graph domain adaptation faces challenges with structure domain shifts when graph data comes from different times or regions, causing poor GNN performance on target graphs. Existing methods struggle with these structural variations.Method: Propose GraphDeT framework that integrates an auxiliary edge denoising task into GNN training for node classification. The auxiliary loss denoises graph edges on target graphs, which theoretically tightens the graph generalization bound with -distance.
Result: Experimental results show superior performance compared to existing baselines in handling both time-based and regional domain graph shifts, demonstrating effectiveness of the edge denoising auxiliary task.
Conclusion: Simple edge denoising auxiliary loss significantly improves GNN generalization in graph domain adaptation by constraining the generalization bound, offering an effective solution for structure domain shifts in temporal and spatial graph data.
Abstract: We explore the node classification task in the context of graph domain adaptation, which uses both source and target graph structures along with source labels to enhance the generalization capabilities of Graph Neural Networks (GNNs) on target graphs. Structure domain shifts frequently occur, especially when graph data are collected at different times or from varying areas, resulting in poor performance of GNNs on target graphs. Surprisingly, we find that simply incorporating an auxiliary loss function for denoising graph edges on target graphs can be extremely effective in enhancing GNN performance on target graphs. Based on this insight, we propose our framework, GraphDeT, a framework that integrates this auxiliary edge task into GNN training for node classification under domain adaptation. Our theoretical analysis connects this auxiliary edge task to the graph generalization bound with -distance, demonstrating such auxiliary task can imposes a constraint which tightens the bound and thereby improves generalization. The experimental results demonstrate superior performance compared to the existing baselines in handling both time and regional domain graph shifts.
[582] Quantization Blindspots: How Model Compression Breaks Backdoor Defenses
Rohan Pandey, Eric Ye
Main category: cs.LG
TL;DR: Backdoor defenses fail under standard quantization (INT8/INT4) while attacks remain effective, revealing a critical mismatch between defense evaluation (FP32) and real deployment (quantized models).
Details
Motivation: Real-world ML deployments use quantized models (INT8/INT4) for efficiency, but existing backdoor defenses are evaluated on full-precision (FP32) models, creating a dangerous gap between research evaluation and practical deployment.Method: Systematic empirical study of five representative backdoor defenses across three precision settings (FP32, INT8 dynamic, INT4 simulated) on two vision benchmarks (GTSRB, CIFAR-10) using BadNet attack.
Result: INT8 quantization reduces all defenses’ detection rate to 0% while maintaining attack success >99%. INT4 shows dataset dependence: Neural Cleanse works on GTSRB but fails on CIFAR-10, though attacks survive quantization with >90% success.
Conclusion: Quantization robustness must be a necessary evaluation axis for backdoor defenses, as current defenses fail on quantized models while attacks remain potent, exposing critical vulnerability in real-world ML security.
Abstract: Backdoor attacks embed input-dependent malicious behavior into neural networks while preserving high clean accuracy, making them a persistent threat for deployed ML systems. At the same time, real-world deployments almost never serve full-precision models: post-training quantization to INT8 or lower precision is now standard practice for reducing memory and latency. This work asks a simple question: how do existing backdoor defenses behave under standard quantization pipelines? We conduct a systematic empirical study of five representative defenses across three precision settings (FP32, INT8 dynamic, INT4 simulated) and two standard vision benchmarks using a canonical BadNet attack. We observe that INT8 quantization reduces the detection rate of all evaluated defenses to 0% while leaving attack success rates above 99%. For INT4, we find a pronounced dataset dependence: Neural Cleanse remains effective on GTSRB but fails on CIFAR-10, even though backdoors continue to survive quantization with attack success rates above 90%. Our results expose a mismatch between how defenses are commonly evaluated (on FP32 models) and how models are actually deployed (in quantized form), and they highlight quantization robustness as a necessary axis in future evaluations and designs of backdoor defenses.
[583] Auto-exploration for online reinforcement learning
Caleb Ju, Guanghui Lan
Main category: cs.LG
TL;DR: New parameter-free RL methods with auto-exploration achieve O(ε⁻²) sample complexity without requiring knowledge of problem-dependent parameters.
Details
Motivation: Existing RL algorithms require sufficient exploration assumptions that yield non-implementable algorithms and suboptimal performance, needing a priori knowledge of problem-dependent parameters.Method: Introduce auto-exploration methods that automatically explore state and action spaces parameter-free. Two variants: tabular setting and linear function approximation. Key innovations: dynamic mixing time, discounted state distribution for sampling, simple robust gradient estimator, and advantage gap function for convergence certification.
Result: Both methods achieve O(ε⁻²) sample complexity to solve to ε error under algorithm-independent assumptions of an exploring optimal policy. Complexities are novel as they avoid algorithm-dependent parameters that could be arbitrarily large in prior works.
Conclusion: The methods are simple to implement (parameter-free, no direct parameter estimation) and overcome limitations of existing RL algorithms through new algorithmic innovations for efficient exploration-exploitation.
Abstract: The exploration-exploitation dilemma in reinforcement learning (RL) is a fundamental challenge to efficient RL algorithms. Existing algorithms for finite state and action discounted RL problems address this by assuming sufficient exploration over both state and action spaces. However, this yields non-implementable algorithms and sub-optimal performance. To resolve these limitations, we introduce a new class of methods with auto-exploration, or methods that automatically explore both state and action spaces in a parameter-free way, i.e.,~without a priori knowledge of problem-dependent parameters. We present two variants: one for the tabular setting and one for linear function approximation. Under algorithm-independent assumptions on the existence of an exploring optimal policy, both methods attain $O(ε^{-2})$ sample complexity to solve to $ε$ error. Crucially, these complexities are novel since they are void of algorithm-dependent parameters seen in prior works, which may be arbitrarily large. The methods are also simple to implement because they are parameter-free and do not directly estimate the unknown parameters. These feats are achieved by new algorithmic innovations for RL, including a dynamic mixing time, a discounted state distribution for sampling, a simple robust gradient estimator, and a recent advantage gap function to certify convergence.
[584] Learning When to Switch: Adaptive Policy Selection via Reinforcement Learning
Chris Tava
Main category: cs.LG
TL;DR: RL-based adaptive switching between exploration and goal-directed navigation outperforms fixed strategies in maze solving.
Details
Motivation: Autonomous agents need to switch between strategies for complex tasks, but determining optimal switching points is challenging. Fixed thresholds are suboptimal and require domain knowledge.Method: Q-learning to learn adaptive switching thresholds between systematic exploration (coverage) and goal-directed pathfinding. State space discretized into coverage and distance buckets. Only requires maze dimensions and target location, no wall knowledge or hand-crafted heuristics.
Result: Across 240 test configurations, adaptive threshold learning showed 23-55% improvements in completion time, 83% reduction in runtime variance, and 71% improvement in worst-case scenarios. Performance gains scale with problem complexity.
Conclusion: Adaptive policy switching via reinforcement learning outperforms fixed strategies, with benefits increasing with problem complexity. The approach generalizes to unseen maze configurations within size classes.
Abstract: Autonomous agents often require multiple strategies to solve complex tasks, but determining when to switch between strategies remains challenging. This research introduces a reinforcement learning technique to learn switching thresholds between two orthogonal navigation policies. Using maze navigation as a case study, this work demonstrates how an agent can dynamically transition between systematic exploration (coverage) and goal-directed pathfinding (convergence) to improve task performance. Unlike fixed-threshold approaches, the agent uses Q-learning to adapt switching behavior based on coverage percentage and distance to goal, requiring only minimal domain knowledge: maze dimensions and target location. The agent does not require prior knowledge of wall positions, optimal threshold values, or hand-crafted heuristics; instead, it discovers effective switching strategies dynamically during each run. The agent discretizes its state space into coverage and distance buckets, then adapts which coverage threshold (20-60%) to apply based on observed progress signals. Experiments across 240 test configurations (4 maze sizes from 16$\times$16 to 128$\times$128 $\times$ 10 unique mazes $\times$ 6 agent variants) demonstrate that adaptive threshold learning outperforms both single-strategy agents and fixed 40% threshold baselines. Results show 23-55% improvements in completion time, 83% reduction in runtime variance, and 71% improvement in worst-case scenarios. The learned switching behavior generalizes within each size class to unseen wall configurations. Performance gains scale with problem complexity: 23% improvement for 16$\times$16 mazes, 34% for 32$\times$32, and 55% for 64$\times$64, demonstrating that as the space of possible maze structures grows, the value of adaptive policy selection over fixed heuristics increases proportionally.
[585] Learning Without Time-Based Embodiment Resets in Soft-Actor Critic
Homayoon Farrahi, A. Rupam Mahmood
Main category: cs.LG
TL;DR: Researchers investigate learning without episode terminations and robot resets using SAC, showing that continuing SAC performs comparably to episodic SAC with modified rewards, and that policy entropy increases can compensate for missing embodiment resets.
Details
Motivation: Standard RL practices use episode terminations and environment resets which create unnatural task setups and hinder real-world long-term performance. The paper explores learning without these accessories to enable more realistic continuous operation.Method: Develop continuing version of Soft Actor-Critic (SAC) algorithm, modify reward functions of existing tasks, analyze failure modes on Gym Reacher task without resets, and propose entropy increase intervention when performance degrades.
Result: Continuing SAC performs as well or better than episodic SAC with reduced sensitivity to discount rate γ. Embodiment resets aid exploration; without them, learning fails or slows. Increasing policy entropy recovers performance lost from missing resets.
Conclusion: RL can work without episode terminations and resets with proper algorithm modifications. Embodiment resets primarily help exploration, and performance can be recovered through entropy-based interventions, enabling more realistic continuous learning.
Abstract: When creating new reinforcement learning tasks, practitioners often accelerate the learning process by incorporating into the task several accessory components, such as breaking the environment interaction into independent episodes and frequently resetting the environment. Although they can enable the learning of complex intelligent behaviors, such task accessories can result in unnatural task setups and hinder long-term performance in the real world. In this work, we explore the challenges of learning without episode terminations and robot embodiment resets using the Soft Actor-Critic (SAC) algorithm. To learn without terminations, we present a continuing version of the SAC algorithm and show that, with simple modifications to the reward functions of existing tasks, continuing SAC can perform as well as or better than episodic SAC while reducing the sensitivity of performance to the value of the discount rate $γ$. On a modified Gym Reacher task, we investigate possible explanations for the failure of continuing SAC when learning without embodiment resets. Our results suggest that embodiment resets help with exploration of the state space in the SAC algorithm, and removing embodiment resets can lead to poor exploration of the state space and failure of or significantly slower learning. Finally, on additional simulated tasks and a real-robot vision task, we show that increasing the entropy of the policy when performance trends worse or remains static is an effective intervention for recovering the performance lost due to not using embodiment resets.
[586] Optimizing Day-Ahead Energy Trading with Proximal Policy Optimization and Blockchain
Navneet Verma, Ying Xie
Main category: cs.LG
TL;DR: PPO reinforcement learning + blockchain framework for prosumer energy trading optimization in day-ahead markets, achieving 2% demand-supply balance with ERCOT data.
Details
Motivation: Renewable energy integration creates challenges in market balancing, grid resilience, and trust in decentralized trading systems that need to be addressed.Method: Integrates Proximal Policy Optimization (PPO) RL algorithm with blockchain technology for multi-objective energy optimization and tamper-proof transaction management.
Result: RL agent achieves demand-supply balancing within 2%, maintains near-optimal supply costs for most operating hours, and generates robust battery storage policies for solar/wind variability.
Conclusion: The framework provides transparent, auditable, and secure energy trading through blockchain integration, with novel architecture, curriculum learning, and practical deployment insights.
Abstract: The increasing penetration of renewable energy sources in day-ahead energy markets introduces challenges in balancing supply and demand, ensuring grid resilience, and maintaining trust in decentralized trading systems. This paper proposes a novel framework that integrates the Proximal Policy Optimization (PPO) algorithm, a state-of-the-art reinforcement learning method, with blockchain technology to optimize automated trading strategies for prosumers in day-ahead energy markets. We introduce a comprehensive framework that employs RL agent for multi-objective energy optimization and blockchain for tamper-proof data and transaction management. Simulations using real-world data from the Electricity Reliability Council of Texas (ERCOT) demonstrate the effectiveness of our approach. The RL agent achieves demand-supply balancing within 2% and maintains near-optimal supply costs for the majority of the operating hours. Moreover, it generates robust battery storage policies capable of handling variability in solar and wind generation. All decisions are recorded on an Algorand-based blockchain, ensuring transparency, auditability, and security - key enablers for trustworthy multi-agent energy trading. Our contributions include a novel system architecture, curriculum learning for robust agent development, and actionable policy insights for practical deployment.
[587] Networked Restless Multi-Arm Bandits with Reinforcement Learning
Hanmo Zhang, Zenghui Sun, Kai Wang
Main category: cs.LG
TL;DR: Networked RMAB integrates restless bandits with independent cascade model to capture arm interactions in networks, with efficient Q-learning algorithm achieving 1-1/e approximation guarantee.
Details
Motivation: Traditional RMABs assume arm independence, but real-world applications (like public health interventions) involve significant interactions between individuals that affect decision outcomes.Method: Combine RMAB with independent cascade model to capture network effects; establish submodularity of Bellman equation; use hill-climbing for 1-1/e approximation; develop efficient Q-learning algorithm for networked setting.
Result: Prove convergence of approximate Bellman updates via modified contraction analysis; experimental results on real-world graphs show Q-learning outperforms k-step look-ahead and network-blind approaches.
Conclusion: Networked RMAB framework effectively captures arm interactions, with efficient algorithms that leverage network effects to improve decision-making in sequential resource allocation problems.
Abstract: Restless Multi-Armed Bandits (RMABs) are a powerful framework for sequential decision-making, widely applied in resource allocation and intervention optimization challenges in public health. However, traditional RMABs assume independence among arms, limiting their ability to account for interactions between individuals that can be common and significant in a real-world environment. This paper introduces Networked RMAB, a novel framework that integrates the RMAB model with the independent cascade model to capture interactions between arms in networked environments. We define the Bellman equation for networked RMAB and present its computational challenge due to exponentially large action and state spaces. To resolve the computational challenge, we establish the submodularity of Bellman equation and apply the hill-climbing algorithm to achieve a $1-\frac{1}{e}$ approximation guarantee in Bellman updates. Lastly, we prove that the approximate Bellman updates are guaranteed to converge by a modified contraction analysis. We experimentally verify these results by developing an efficient Q-learning algorithm tailored to the networked setting. Experimental results on real-world graph data demonstrate that our Q-learning approach outperforms both $k$-step look-ahead and network-blind approaches, highlighting the importance of capturing and leveraging network effects where they exist.
[588] Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning
Yulei Qin, Xiaoyu Tan, Zhengbao He, Gang Li, Haojia Lin, Zongyi Li, Zihan Xu, Yuchen Shi, Siqi Cai, Renting Rui, Shaofei Cai, Yuzheng Cai, Xuan Zhang, Sheng Ye, Ke Li, Xing Sun
Main category: cs.LG
TL;DR: SPEAR is a self-imitation learning method that balances exploration-exploitation for agentic LLMs using curriculum scheduling of policy entropy, achieving significant performance gains with minimal overhead.
Details
Motivation: Current RL methods for agentic LLMs struggle with exploration-exploitation trade-offs. Mechanical entropy maximization causes instability due to multi-turn distribution shifting. The paper aims to achieve progressive balance guided by agent experiences without entropy collapse or divergence.Method: SPEAR extends vanilla self-imitation learning with curriculum scheduling of policy entropy. It harmonizes intrinsic reward shaping and self-imitation to: 1) expedite early exploration through frequent tool interactions, and 2) strengthen exploitation of successful tactics as environment familiarity increases. Combined with industrial RL optimizations (Dr.BoT baseline).
Result: SPEAR increases success rates significantly across multiple benchmarks: ALFWorld (up to 16.1%/5.1%/8.6% for GRPO/GiGPO/Dr.BoT), WebShop (up to 20.7%/11.8%/13.9%), AIME24 (up to 3.8% boost), and AIME25 (up to 6.1% boost). Gains require only 10%-25% extra theoretical complexity with negligible runtime overhead.
Conclusion: SPEAR provides an effective plug-and-play solution for exploration-exploitation balance in agentic LLMs, demonstrating scalability and practical efficiency across diverse benchmarks while maintaining stability through curriculum-based entropy scheduling.
Abstract: Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent’s own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL, where a replay buffer stores good experience for off-policy update, by gradually steering the policy entropy across stages. Specifically, the proposed curriculum scheduling harmonizes intrinsic reward shaping and self-imitation to 1) expedite exploration via frequent tool interactions at the beginning, and 2) strengthen exploitation of successful tactics upon convergence towards familiarity with the environment. We also combine bag-of-tricks of industrial RL optimizations for a strong baseline Dr.BoT to demonstrate our effectiveness. In ALFWorld and WebShop, SPEAR increases the success rates of GRPO/GiGPO/Dr.BoT by up to 16.1%/5.1%/8.6% and 20.7%/11.8%/13.9%, respectively. In AIME24 and AIME25, SPEAR boosts Dr.BoT by up to 3.8% and 6.1%, respectively. Such gains incur only 10%-25% extra theoretical complexity and negligible runtime overhead in practice, demonstrating the plug-and-play scalability of SPEAR.
[589] Theoretical Compression Bounds for Wide Multilayer Perceptrons
Houssam El Cheairi, David Gamarnik, Rahul Mazumder
Main category: cs.LG
TL;DR: The paper provides theoretical justification for pruning/quantization success in neural networks, showing existence of compressed subnetworks with competitive performance via randomized greedy algorithm.
Details
Motivation: Despite empirical success of pruning and quantization techniques for neural network compression, there's insufficient theoretical justification for why these methods work so well in practice.Method: Proposes a randomized greedy compression algorithm for post-training pruning and quantization, extending to structured pruning for MLPs and CNNs. The algorithm resembles a randomized version of Optimal Brain Damage (OBD).
Result: Rigorously proves existence of pruned/quantized subnetworks with competitive performance, establishes tradeoff between compressibility and network width, and provides unified analysis of pruning in wide networks without data assumptions.
Conclusion: The theoretical results bridge gap between theory and application for pruning/quantization, justifying empirical success of compression in wide multilayer perceptrons.
Abstract: Pruning and quantization techniques have been broadly successful in reducing the number of parameters needed for large neural networks, yet theoretical justification for their empirical success falls short. We consider a randomized greedy compression algorithm for pruning and quantization post-training and use it to rigorously show the existence of pruned/quantized subnetworks of multilayer perceptrons (MLPs) with competitive performance. We further extend our results to structured pruning of MLPs and convolutional neural networks (CNNs), thus providing a unified analysis of pruning in wide networks. Our results are free of data assumptions, and showcase a tradeoff between compressibility and network width. The algorithm we consider bears some similarities with Optimal Brain Damage (OBD) and can be viewed as a post-training randomized version of it. The theoretical results we derive bridge the gap between theory and application for pruning/quantization, and provide a justification for the empirical success of compression in wide multilayer perceptrons.
[590] Importance-aware Topic Modeling for Discovering Public Transit Risk from Noisy Social Media
Fatima Ashraf, Muhammad Ayub Sabir, Jiaxin Deng, Junbiao Pang, Haitao Yu
Main category: cs.LG
TL;DR: A novel topic modeling framework for social media transit monitoring that jointly models linguistic interactions and user influence to detect sparse service risks from routine chatter.
Details
Motivation: Urban transit agencies need to monitor social media for emerging service risks (crowding, delays, safety incidents), but these signals are sparse, short, and easily drowned out by routine chatter, requiring better detection methods.Method: Constructs influence-weighted keyword co-occurrence graph, uses Poisson Deconvolution Factorization (PDF) to decompose graph into low-rank topical structure and topic-localized residuals, with decorrelation regularizer for distinct topics and coherence-driven topic selection.
Result: Achieves state-of-the-art topic coherence and strong diversity on large-scale social streams compared to leading baselines.
Conclusion: The proposed framework effectively detects sparse transit service risks from social media by modeling both linguistic interactions and user influence, providing interpretable topics for transit monitoring.
Abstract: Urban transit agencies increasingly turn to social media to monitor emerging service risks such as crowding, delays, and safety incidents, yet the signals of concern are sparse, short, and easily drowned by routine chatter. We address this challenge by jointly modeling linguistic interactions and user influence. First, we construct an influence-weighted keyword co-occurrence graph from cleaned posts so that socially impactful posts contributes proportionally to the underlying evidence. The core of our framework is a Poisson Deconvolution Factorization (PDF) that decomposes this graph into a low-rank topical structure and topic-localized residual interactions, producing an interpretable topic–keyword basis together with topic importance scores. A decorrelation regularizer \emph{promotes} distinct topics, and a lightweight optimization procedure ensures stable convergence under nonnegativity and normalization constraints. Finally, the number of topics is selected through a coherence-driven sweep that evaluates the quality and distinctness of the learned topics. On large-scale social streams, the proposed model achieves state-of-the-art topic coherence and strong diversity compared with leading baselines. The code and dataset are publicly available at https://github.com/pangjunbiao/Topic-Modeling_ITS.git
[591] Entropic Confinement and Mode Connectivity in Overparameterized Neural Networks
Luca Di Carlo, Chase Goddard, David J. Schwab
Main category: cs.LG
TL;DR: The paper resolves the paradox of connected low-loss basins in neural network landscapes where optimization rarely explores intermediate points, identifying curvature-induced entropic barriers as the key mechanism.
Details
Motivation: To explain why neural network optimization remains confined to single convex basins despite the existence of connecting low-loss paths, addressing the paradox between connectivity and confinement in loss landscapes.Method: Analyzed the interplay between curvature variations along connecting paths and noise in optimization dynamics, identifying entropic barriers arising from systematic curvature increases away from minima.
Result: Found that curvature systematically rises away from minima, creating effective forces that bias noisy dynamics back toward endpoints even with flat loss, with these entropic barriers persisting longer than energetic barriers.
Conclusion: Curvature-induced entropic forces govern both connectivity and confinement in deep learning landscapes, explaining why optimization rarely explores intermediate points despite connected basins.
Abstract: Modern neural networks exhibit a striking property: basins of attraction in the loss landscape are often connected by low-loss paths, yet optimization dynamics generally remain confined to a single convex basin and rarely explore intermediate points. We resolve this paradox by identifying entropic barriers arising from the interplay between curvature variations along these paths and noise in optimization dynamics. Empirically, we find that curvature systematically rises away from minima, producing effective forces that bias noisy dynamics back toward the endpoints - even when the loss remains nearly flat. These barriers persist longer than energetic barriers, shaping the late-time localization of solutions in parameter space. Our results highlight the role of curvature-induced entropic forces in governing both connectivity and confinement in deep learning landscapes.
[592] Chemistry Integrated Language Model using Hierarchical Molecular Representation for Polymer Informatics
Jihun Ahn, Gabriella Pasya Irianti, Vikram Thapar, Su-Mi Hur
Main category: cs.LG
TL;DR: CI-LLM framework uses hierarchical polymer representations (HAPPY) with transformer models for both property prediction (De³BERTa) and inverse design (GPT-based generator), achieving improved accuracy, faster inference, and successful multi-property optimization.
Details
Motivation: Machine learning has successfully transformed material discovery for inorganic compounds and small molecules, but polymers remain largely inaccessible to these methods. While data scarcity is often cited as the bottleneck, the authors believe strategic molecular representations can overcome this limitation.Method: Introduces CI-LLM (Chemically Informed Language Model) framework combining HAPPY (Hierarchically Abstracted rePeat unit of PolYmer) which encodes chemical substructures as tokens with numerical descriptors within transformer architectures. For property prediction, uses De³BERTa (descriptor-enriched encoder). For inverse design, uses a GPT-based generator.
Result: De³BERTa achieves 3.5x faster inference than SMILES-based models with improved accuracy (R² score gains of 0.9-4.1% across four properties). Provides interpretable structure-property insights at subgroup level. GPT-based generator produces polymers with targeted properties, achieving 100% scaffold retention and successful multi-property optimization for negatively correlated objectives.
Conclusion: The comprehensive framework demonstrates both forward prediction and inverse design capabilities, showcasing how strategic molecular representation advances machine learning applications in polymer science, overcoming data scarcity limitations through intelligent chemical encoding.
Abstract: Machine learning has transformed material discovery for inorganic compounds and small molecules, yet polymers remain largely inaccessible to these methods. While data scarcity is often cited as the primary bottleneck, we demonstrate that strategic molecular representations can overcome this limitation. We introduce CI-LLM (Chemically Informed Language Model), a framework combining HAPPY (Hierarchically Abstracted rePeat unit of PolYmer), which encodes chemical substructures as tokens, with numerical descriptors within transformer architectures. For property prediction, De$^3$BERTa, our descriptor-enriched encoder, achieves 3.5x faster inference than SMILES-based models with improved accuracy ($R^2$ score gains of 0.9-4.1 percent across four properties), while providing interpretable structure-property insights at the subgroup level. For inverse design, our GPT-based generator produces polymers with targeted properties, achieving 100 percent scaffold retention and successful multi-property optimization for negatively correlated objectives. This comprehensive framework demonstrates both forward prediction and inverse design capabilities, showcasing how strategic molecular representation advances machine learning applications in polymer science.
[593] Multimodal Graph Neural Networks for Prognostic Modeling of Brain Network Reorganization
Preksha Girish, Rachana Mysore, Kiran K. N., Hiranmayee R., Shipra Prashanth, Shrey Kumar
Main category: cs.LG
TL;DR: A multimodal graph neural network framework integrates structural MRI, DTI, and fMRI to model spatiotemporal brain network reorganization, using fractional stochastic differential operators and attention mechanisms to generate interpretable biomarkers for predicting cognitive decline.
Details
Motivation: Understanding dynamic brain network reorganization is critical for predicting cognitive decline, neurological progression, and individual variability in clinical outcomes. There's a need for mathematically rigorous approaches that can derive clinically meaningful biomarkers from existing imaging data without requiring new data collection.Method: Proposes a multimodal graph neural network framework integrating structural MRI, diffusion tensor imaging (DTI), and functional MRI. Brain regions are nodes, structural/functional connectivity are edges, forming longitudinal brain graphs. Uses fractional stochastic differential operators within graph-based recurrent networks to capture temporal evolution, modeling long-term dependencies and stochastic fluctuations. Attention mechanisms fuse multimodal information and generate interpretable biomarkers including network energy entropy, graph curvature, fractional memory indices, and modality-specific attention scores.
Result: Experiments on longitudinal neuroimaging datasets demonstrate both predictive accuracy and interpretability. The framework generates a composite prognostic index to quantify individual risk of network instability or cognitive decline.
Conclusion: The results highlight the potential of mathematically rigorous, multimodal graph-based approaches for deriving clinically meaningful biomarkers from existing imaging data without requiring new data collection, enabling better prediction of cognitive decline and neurological progression.
Abstract: Understanding the dynamic reorganization of brain networks is critical for predicting cognitive decline, neurological progression, and individual variability in clinical outcomes. This work proposes a multimodal graph neural network framework that integrates structural MRI, diffusion tensor imaging, and functional MRI to model spatiotemporal brain network reorganization. Brain regions are represented as nodes and structural and functional connectivity as edges, forming longitudinal brain graphs for each subject. Temporal evolution is captured via fractional stochastic differential operators embedded within graph-based recurrent networks, enabling the modeling of long-term dependencies and stochastic fluctuations in network dynamics. Attention mechanisms fuse multimodal information and generate interpretable biomarkers, including network energy entropy, graph curvature, fractional memory indices, and modality-specific attention scores. These biomarkers are combined into a composite prognostic index to quantify individual risk of network instability or cognitive decline. Experiments on longitudinal neuroimaging datasets demonstrate both predictive accuracy and interpretability. The results highlight the potential of mathematically rigorous, multimodal graph-based approaches for deriving clinically meaningful biomarkers from existing imaging data without requiring new data collection.
[594] Interpretive Efficiency: Information-Geometric Foundations of Data Usefulness
Ronald Katende
Main category: cs.LG
TL;DR: Proposes Interpretive Efficiency, a normalized metric measuring how effectively data support interpretive representations by quantifying task-relevant information transmitted through interpretive channels.
Details
Motivation: Interpretability is crucial for trustworthy ML, but existing metrics rarely quantify how effectively data support interpretive representations. There's a need for theory-backed diagnostic measures for representation design.Method: Defines Interpretive Efficiency as a normalized, task-aware functional grounded in five axioms (boundedness, Blackwell-style monotonicity, data-processing stability, admissible invariance, asymptotic consistency). Relates it to mutual information, derives local Fisher-geometric expansion, and establishes estimation guarantees using empirical-process tools.
Result: Experiments on image and signal tasks show the measure recovers theoretical orderings, exposes representational redundancy masked by accuracy, and correlates with robustness.
Conclusion: Interpretive Efficiency provides a practical, theory-backed diagnostic tool for representation design in interpretable machine learning.
Abstract: Interpretability is central to trustworthy machine learning, yet existing metrics rarely quantify how effectively data support an interpretive representation. We propose Interpretive Efficiency, a normalized, task-aware functional that measures the fraction of task-relevant information transmitted through an interpretive channel. The definition is grounded in five axioms ensuring boundedness, Blackwell-style monotonicity, data-processing stability, admissible invariance, and asymptotic consistency. We relate the functional to mutual information and derive a local Fisher-geometric expansion, then establish asymptotic and finite-sample estimation guarantees using standard empirical-process tools. Experiments on controlled image and signal tasks demonstrate that the measure recovers theoretical orderings, exposes representational redundancy masked by accuracy, and correlates with robustness, making it a practical, theory-backed diagnostic for representation design.
[595] When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models
Tong Xie, Andrew Bai, Yuanhao Ban, Yunqi Hong, Haoyu Li, Cho-jui Hsieh
Main category: cs.LG
TL;DR: The paper identifies a limitation in the standard Bradley-Terry loss for reward modeling: gradient norms are influenced by representation distance, causing small-distance pairs to receive weak updates. They propose NormBT, a normalization scheme that balances this effect and improves performance.
Details
Motivation: The standard Bradley-Terry loss used in RLHF reward modeling has a hidden issue: gradient norms scale with representation distance between response pairs, not just prediction error. This causes vanishing updates for small-distance pairs (where fine-grained distinctions matter) and disproportionately strong updates for large-distance pairs, misaligning learning.Method: The authors analyze the per-sample gradient of BT-loss and identify two scaling components: prediction error difference and representation distance. They propose NormBT, an adaptive pairwise normalization scheme that balances representation-driven effects and focuses learning signals on prediction error. NormBT is a lightweight, drop-in replacement for BT loss.
Result: NormBT consistently improves reward model performance across various LLM backbones and datasets. It achieves notable gains of over 5% on the Reasoning category of RewardBench, which contains numerous small-distance pairs where the original BT loss struggles.
Conclusion: The work reveals a key limitation in the widely used Bradley-Terry objective for reward modeling and provides a simple, effective correction through NormBT. This addresses the representation distance bias in gradient updates and improves learning, especially for fine-grained distinctions in small-distance pairs.
Abstract: Reward models are central to Large Language Model (LLM) alignment within the framework of RLHF. The standard objective used in reward modeling is the Bradley-Terry (BT) loss, which learns from pairwise data consisting of a pair of chosen and rejected responses. In this work, we analyze the per-sample gradient of BT-loss and show that its norm scales with two distinct components: (1) the difference in predicted rewards between chosen and rejected responses, which reflects the prediction error, and critically, (2) representation distance between the pair measured in the output space of the final layer. While the first term captures the intended training signal, we show that the second term can significantly impact the update magnitude and misalign learning. Specifically, pairs with small representation distance often receive vanishingly weak updates, even when misranked, while pairs with large distance receive disproportionately strong updates. This leads to gradients from large-distance pairs to overshadow those from small-distance pairs, where fine-grained distinctions are especially important. To overcome this limitation, we propose NormBT, an adaptive pair-wise normalization scheme that balances representation-driven effects and focuses learning signals on prediction error. NormBT is a lightweight, drop-in integration to BT loss with negligible overhead. Across various LLM backbones and datasets, NormBT improves reward model performance consistently, with notable gains of over 5% on the Reasoning category of RewardBench, which contains numerous small-distance pairs. This work reveals a key limitation in the widely used BT objective and provides a simple, effective correction.
[596] Zero Generalization Error Theorem for Random Interpolators via Algebraic Geometry
Naoki Yoshida, Isao Ishikawa, Masaaki Imaizumi
Main category: cs.LG
TL;DR: Randomly sampled interpolators achieve zero generalization error when training samples exceed a threshold determined by the geometric structure of the interpolator set.
Details
Motivation: Understanding why large-scale models like DNNs generalize well despite overparameterization. While prior work attributed this to SGD's implicit bias, empirical evidence suggests it's a property of the model itself - even random interpolators generalize effectively.Method: Using teacher-student framework and algebraic geometry tools to mathematically characterize the geometric structure of interpolator sets in parameter space. Proving that generalization error becomes exactly zero when training samples exceed a threshold determined by this geometry.
Result: Theoretical demonstration that generalization error of interpolators becomes 0 once number of training samples exceeds a threshold. The threshold is determined by the geometric structure of the interpolator set in parameter space.
Conclusion: Generalization ability of large models stems from properties of the model architecture itself (geometric structure of interpolator sets) rather than just optimization algorithms like SGD. This provides theoretical explanation for empirical observations of random interpolators generalizing well.
Abstract: We theoretically demonstrate that the generalization error of interpolators for machine learning models under teacher-student settings becomes 0 once the number of training samples exceeds a certain threshold. Understanding the high generalization ability of large-scale models such as deep neural networks (DNNs) remains one of the central open problems in machine learning theory. While recent theoretical studies have attributed this phenomenon to the implicit bias of stochastic gradient descent (SGD) toward well-generalizing solutions, empirical evidences indicate that it primarily stems from properties of the model itself. Specifically, even randomly sampled interpolators, which are parameters that achieve zero training error, have been observed to generalize effectively. In this study, under a teacher-student framework, we prove that the generalization error of randomly sampled interpolators becomes exactly zero once the number of training samples exceeds a threshold determined by the geometric structure of the interpolator set in parameter space. As a proof technique, we leverage tools from algebraic geometry to mathematically characterize this geometric structure.
[597] LLM-Upgraded Graph Reinforcement Learning for Carbon-Aware Job Scheduling in Smart Manufacturing
Zhiying Yang, Fang Liu, Wei Zhang, Xin Lou, Malcolm Yoke Hean Low, Boon Ping Gan
Main category: cs.LG
TL;DR: LUCA is an LLM-enhanced graph reinforcement learning framework for carbon-aware flexible job shop scheduling that combines GNN and LLM embeddings with DRL to optimize both makespan and carbon emissions.
Details
Motivation: Addresses challenges of dynamic and sustainable scheduling in smart manufacturing systems, aiming to optimize both production efficiency (makespan) and environmental sustainability (carbon emissions).Method: Integrates graph neural network and large language model with in-house prompting strategy to create fused embeddings capturing structural and semantic scheduling state information, then uses deep reinforcement learning policy network for real-time scheduling decisions.
Result: Outperforms comparison algorithms on synthetic and public datasets, achieving 4.1% average and up to 12.2% lower makespan on synthetic data while maintaining same emission levels, with additional gains on public datasets.
Conclusion: LUCA is an effective and practical framework for carbon-aware scheduling in smart manufacturing, successfully balancing production efficiency and environmental sustainability objectives.
Abstract: This paper presents \textsc{Luca}, a \underline{l}arge language model (LLM)-\underline{u}pgraded graph reinforcement learning framework for \underline{c}arbon-\underline{a}ware flexible job shop scheduling. \textsc{Luca} addresses the challenges of dynamic and sustainable scheduling in smart manufacturing systems by integrating a graph neural network and an LLM, guided by a carefully designed in-house prompting strategy, to produce a fused embedding that captures both structural characteristics and contextual semantics of the latest scheduling state. This expressive embedding is then processed by a deep reinforcement learning policy network, which generates real-time scheduling decisions optimized for both makespan and carbon emission objectives. To support sustainability goals, \textsc{Luca} incorporates a dual-objective reward function that encourages both energy efficiency and scheduling timeliness. Experimental results on both synthetic and public datasets demonstrate that \textsc{Luca} consistently outperforms comparison algorithms. For instance, on the synthetic dataset, it achieves an average of 4.1% and up to 12.2% lower makespan compared to the best-performing comparison algorithm while maintaining the same emission level. On public datasets, additional gains are observed for both makespan and emission. These results demonstrate that \textsc{Luca} is effective and practical for carbon-aware scheduling in smart manufacturing.
[598] DDFI: Diverse and Distribution-aware Missing Feature Imputation via Two-step Reconstruction
Yifan Song, Fenglin Yu, Yihong Luo, Xingjian Tao, Siya Qiu, Kai Han, Jing Tang
Main category: cs.LG
TL;DR: DDFI: A novel method for imputing missing node features in graphs that combines feature propagation with graph-based Masked AutoEncoder, addressing limitations of existing approaches and introducing a new dataset with naturally missing features.
Details
Motivation: Real-world graphs often have incomplete node features (e.g., private user attributes), which significantly degrades GNN performance. Existing feature propagation methods struggle with disconnected graphs, over-smoothing, and feature distribution shifts in inductive tasks.Method: DDFI combines feature propagation with graph-based Masked AutoEncoder. It introduces Co-Label Linking (CLL) to connect nodes with same labels in training set, and uses a two-step inference process: FP-imputed features are further reconstructed through MAE to reduce distribution shift and enhance feature diversity.
Result: Extensive experiments on six public datasets and the new Sailing dataset (with naturally missing features) show DDFI outperforms state-of-the-art methods under both transductive and inductive settings.
Conclusion: DDFI effectively addresses three key issues in feature imputation for graphs, provides a more realistic evaluation framework with naturally missing features, and demonstrates superior performance across various settings.
Abstract: Incomplete node features are ubiquitous in real-world scenarios, e.g., the attributes of web users may be partly private, which causes the performance of Graph Neural Networks (GNNs) to decline significantly. Feature propagation (FP) is a well-known method that performs well for imputation of missing node features on graphs, but it still has the following three issues: 1) it struggles with graphs that are not fully connected, 2) imputed features face the over-smoothing problem, and 3) FP is tailored for transductive tasks, overlooking the feature distribution shift in inductive tasks. To address these challenges, we introduce DDFI, a Diverse and Distribution-aware Missing Feature Imputation method that combines feature propagation with a graph-based Masked AutoEncoder (MAE) in a nontrivial manner. It first designs a simple yet effective algorithm, namely Co-Label Linking (CLL), that randomly connects nodes in the training set with the same label to enhance the performance on graphs with numerous connected components. Then we develop a novel two-step representation generation process at the inference stage. Specifically, instead of directly using FP-imputed features as input during inference, DDFI further reconstructs the features through the whole MAE to reduce feature distribution shift in the inductive tasks and enhance the diversity of node features. Meanwhile, since existing feature imputation methods for graphs only evaluate by simulating the missing scenes with manually masking the features, we collect a new dataset called Sailing from the records of voyages that contains naturally missing features to help better evaluate the effectiveness. Extensive experiments conducted on six public datasets and Sailing show that DDFI outperforms the state-of-the-art methods under both transductive and inductive settings.
[599] Proportional integral derivative booster for neural networks-based time-series prediction: Case of water demand prediction
Tony Sallooma, Okyay Kaynak, Xinbo Yub, Wei He
Main category: cs.LG
TL;DR: A PID control-inspired method boosts neural network accuracy for multi-step time-series prediction while maintaining low complexity, demonstrated on water demand and energy consumption forecasting.
Details
Motivation: Multi-step time-series prediction is crucial for industrial decision-making, but neural network complexity often compromises prediction accuracy. There's a need to enhance neural network performance for periodic time-series forecasting without significantly increasing system complexity.Method: Proposes a PID (proportional-integral-derivative) control-inspired boosting method applied to neural network predictions. The PID-based approach adjusts predicted values at each time step to bring them closer to real values. Validated on water demand forecasting using two deep neural network models and extended to hourly energy consumption prediction.
Result: The PID-based booster significantly improves prediction accuracy compared to original neural network models while maintaining negligible impact on system complexity. Demonstrated effectiveness across both water demand and energy consumption forecasting problems.
Conclusion: The PID-inspired boosting method successfully enhances neural network performance for multi-step periodic time-series prediction, offering superior accuracy-complexity trade-off compared to standalone neural network approaches.
Abstract: Multi-step time-series prediction is an essential supportive step for decision-makers in several industrial areas. Artificial intelligence techniques, which use a neural network component in various forms, have recently frequently been used to accomplish this step. However, the complexity of the neural network structure still stands up as a critical problem against prediction accuracy. In this paper, a method inspired by the proportional-integral-derivative (PID) control approach is investigated to enhance the performance of neural network models used for multi-step ahead prediction of periodic time-series information while maintaining a negligible impact on the complexity of the system. The PID-based method is applied to the predicted value at each time step to bring that value closer to the real value. The water demand forecasting problem is considered as a case study, where two deep neural network models from the literature are used to prove the effectiveness of the proposed boosting method. Furthermore, to prove the applicability of this PID-based booster to other types of periodic time-series prediction problems, it is applied to enhance the accuracy of a neural network model used for multi-step forecasting of hourly energy consumption. The comparison between the results of the original prediction models and the results after using the proposed technique demonstrates the superiority of the proposed method in terms of prediction accuracy and system complexity.
[600] Optimizing Optimizers for Fast Gradient-Based Learning
Jaerin Lee, Kyoung Mu Lee
Main category: cs.LG
TL;DR: The paper establishes a theoretical framework for automating optimizer design in gradient-based learning by formulating it as maximizing instantaneous loss decrease through convex optimization over optimizer functions.
Details
Motivation: To provide a systematic, theoretical foundation for automating the design of optimizers in gradient-based learning, moving beyond manual design and hyperparameter tuning.Method: Formulate optimizer design as maximizing instantaneous loss decrease using greedy principle. Treat optimizers as functions mapping gradients to parameter updates, then solve convex optimization problems over optimizer space under various constraints.
Result: The approach recovers popular optimizers as closed-form solutions, produces optimal hyperparameters for given problems, enables systematic optimizer design based on gradient statistics, and allows dynamic optimization during training.
Conclusion: The paper provides a unified theoretical framework for automated optimizer design that can dynamically adapt optimizers during training, offering a principled approach to optimization of optimization.
Abstract: We lay the theoretical foundation for automating optimizer design in gradient-based learning. Based on the greedy principle, we formulate the problem of designing optimizers as maximizing the instantaneous decrease in loss. By treating an optimizer as a function that translates loss gradient signals into parameter motions, the problem reduces to a family of convex optimization problems over the space of optimizers. Solving these problems under various constraints not only recovers a wide range of popular optimizers as closed-form solutions, but also produces the optimal hyperparameters of these optimizers with respect to the problems at hand. This enables a systematic approach to design optimizers and tune their hyperparameters according to the gradient statistics that are collected during the training process. Furthermore, this optimization of optimization can be performed dynamically during training.
[601] RLAX: Large-Scale, Distributed Reinforcement Learning for Large Language Models on TPUs
Runlong Zhou, Lefan Zhang, Shang-Chen Wu, Kelvin Zou, Hanzhi Zhou, Ke Ye, Yihao Feng, Dong Yin, Alex Guillen Garcia, Dmytro Babych, Rohit Chatterjee, Matthew Hopkins, Xiang Kong, Chang Lan, Lezhi Li, Yiping Ma, Daniele Molinari, Senyu Tong, Yanchao Sun, Thomas Voice, Jianyu Wang, Chong Wang, Simon Wang, Floris Weers, Yechen Xu, Guolin Yin, Muyang Yu, Yi Zhang, Zheng Zhou, Danyang Zhuo, Ruoming Pang, Cheng Leong
Main category: cs.LG
TL;DR: RLAX is a scalable RL framework on TPUs that improves LLM reasoning through efficient parameter-server architecture and novel dataset techniques, achieving 12.8% accuracy improvement in under 13 hours.
Details
Motivation: Reinforcement learning has become essential for enhancing LLM reasoning capabilities, but existing approaches lack scalability and efficiency on modern hardware like TPUs, especially for large-scale training with preemption handling.Method: RLAX uses a parameter-server architecture with master trainer pushing weights to server and inference workers pulling weights for rollouts. Includes system techniques for scalable, preemptible RL across algorithms, plus novel dataset curation and alignment methods.
Result: RLAX improves QwQ-32B’s pass@8 accuracy by 12.8% in just 12 hours 48 minutes on 1024 v5p TPUs, while maintaining robustness to training preemptions.
Conclusion: RLAX provides an effective, scalable RL framework for LLM reasoning enhancement that achieves significant accuracy improvements rapidly while being resilient to hardware preemptions.
Abstract: Reinforcement learning (RL) has emerged as the de-facto paradigm for improving the reasoning capabilities of large language models (LLMs). We have developed RLAX, a scalable RL framework on TPUs. RLAX employs a parameter-server architecture. A master trainer periodically pushes updated model weights to the parameter server while a fleet of inference workers pull the latest weights and generates new rollouts. We introduce a suite of system techniques to enable scalable and preemptible RL for a diverse set of state-of-art RL algorithms. To accelerate convergence and improve model quality, we have devised new dataset curation and alignment techniques. Large-scale evaluations show that RLAX improves QwQ-32B’s pass@8 accuracy by 12.8% in just 12 hours 48 minutes on 1024 v5p TPUs, while remaining robust to preemptions during training.
[602] A new initialisation to Control Gradients in Sinusoidal Neural network
Andrea Combette, Antoine Venaille, Nelly Pustelnik
Main category: cs.LG
TL;DR: New initialization method for SIREN networks with sinusoidal activations that controls gradients and prevents inappropriate frequency emergence, outperforming original SIREN and other baselines on reconstruction tasks.
Details
Motivation: Proper initialization is crucial for mitigating gradient explosion/vanishing in neural networks, but current initialization strategies lack precise theoretical understanding for architectures like SIREN with sinusoidal activations.Method: Derived closed-form expression for parameter initialization based on fixed points from pre-activation distribution convergence and Jacobian variance sequences. Controls gradients and targets vanishing pre-activation to prevent inappropriate frequency emergence. Analyzed through Neural Tangent Kernel (NTK) framework.
Result: New initialization consistently outperforms original SIREN and state-of-the-art methods across function fitting, image reconstruction, and physics-informed neural network tasks.
Conclusion: The proposed initialization provides better gradient control, improves training dynamics through NTK framework, and enhances generalization by preventing inappropriate frequency emergence, making it superior for SIREN networks across various reconstruction tasks.
Abstract: Proper initialisation strategy is of primary importance to mitigate gradient explosion or vanishing when training neural networks. Yet, the impact of initialisation parameters still lacks a precise theoretical understanding for several well-established architectures. Here, we propose a new initialisation for networks with sinusoidal activation functions such as \texttt{SIREN}, focusing on gradients control, their scaling with network depth, their impact on training and on generalization. To achieve this, we identify a closed-form expression for the initialisation of the parameters, differing from the original \texttt{SIREN} scheme. This expression is derived from fixed points obtained through the convergence of pre-activation distribution and the variance of Jacobian sequences. Controlling both gradients and targeting vanishing pre-activation helps preventing the emergence of inappropriate frequencies during estimation, thereby improving generalization. We further show that this initialisation strongly influences training dynamics through the Neural Tangent Kernel framework (NTK). Finally, we benchmark \texttt{SIREN} with the proposed initialisation against the original scheme and other baselines on function fitting and image reconstruction. The new initialisation consistently outperforms state-of-the-art methods across a wide range of reconstruction tasks, including those involving physics-informed neural networks.
[603] Neural expressiveness for beyond importance model compression
Angelos-Christos Maroudis, Sotirios Xydis
Main category: cs.LG
TL;DR: A novel “Expressiveness” criterion for neural network pruning that focuses on neurons’ ability to redistribute informational resources based on activation overlap, enabling data-agnostic compression with superior results.
Details
Motivation: Existing pruning methods rely on weight importance metrics, but the authors propose a new fundamental basis for compression that is independent of learning state, addressing the "When to Prune" question and enabling data-agnostic strategies.Method: Introduces “Expressiveness” criterion that measures neurons’ ability to effectively redistribute informational resources based on activation overlap. This criterion is strongly correlated with network initialization, making it stateless and independent of learning state. Can be approximated with arbitrary or limited data, enabling data-agnostic pruning. Also allows hybrid formulation combining expressiveness with importance-based approaches.
Result: Achieves up to 10x extra gains in parameter compression ratios compared to weight-based approaches with only 1% average performance degradation. On YOLOv8, achieves 46.1% MACs reduction by removing 55.4% of parameters while increasing mAP by 3% on COCO dataset. Outperforms top-performing foundational methods in compression efficiency.
Conclusion: Expressiveness provides a novel, fundamental basis for neural network compression that is initialization-dependent and data-agnostic, enabling superior compression efficiency and addressing the “When to Prune” question through stateless criteria that complement existing importance-based approaches.
Abstract: Neural Network Pruning has been established as driving force in the exploration of memory and energy efficient solutions with high throughput both during training and at test time. In this paper, we introduce a novel criterion for model compression, named “Expressiveness”. Unlike existing pruning methods that rely on the inherent “Importance” of neurons’ and filters’ weights, ``Expressiveness" emphasizes a neuron’s or group of neurons ability to redistribute informational resources effectively, based on the overlap of activations. This characteristic is strongly correlated to a network’s initialization state, establishing criterion autonomy from the learning state stateless and thus setting a new fundamental basis for the expansion of compression strategies in regards to the “When to Prune” question. We show that expressiveness is effectively approximated with arbitrary data or limited dataset’s representative samples, making ground for the exploration of Data-Agnostic strategies. Our work also facilitates a “hybrid” formulation of expressiveness and importance-based pruning strategies, illustrating their complementary benefits and delivering up to 10x extra gains w.r.t. weight-based approaches in parameter compression ratios, with an average of 1% in performance degradation. We also show that employing expressiveness (independently) for pruning leads to an improvement over top-performing and foundational methods in terms of compression efficiency. Finally, on YOLOv8, we achieve a 46.1% MACs reduction by removing 55.4% of the parameters, with an increase of 3% in the mean Absolute Precision ($mAP_{50-95}$) for object detection on COCO dataset.
[604] BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination
Huizheng Wang, Hongbin Wang, Shaojun Wei, Yang Hu, Shouyi Yin
Main category: cs.LG
TL;DR: BitStopper is an algorithm-architecture co-design that eliminates sparsity predictors in attention mechanisms, using bit-serial processing to reduce memory traffic and improve efficiency in LLMs.
Details
Motivation: Attention-based LLMs suffer from quadratic self-attention costs, and existing dynamic sparsity approaches add prediction overhead and heavy memory traffic, limiting hardware efficiency.Method: Four key innovations: 1) Bit-serial enable stage fusion to reuse memory and merge prediction with execution, 2) Lightweight adaptive token selection for bit-level sparsity speculation, 3) Bit-level asynchronous processing for better compute utilization, and 4) Custom architecture design.
Result: Achieves 2.03x speedup over Sanger and 1.89x over SOFA, with 2.4x and 2.1x energy efficiency improvements respectively.
Conclusion: BitStopper successfully addresses limitations of dynamic sparsity attention by eliminating predictor overhead through fine-grained algorithm-architecture co-design, delivering significant performance and efficiency gains.
Abstract: Attention-based large language models (LLMs) have transformed modern AI applications, but the quadratic cost of self-attention imposes significant compute and memory overhead. Dynamic sparsity (DS) attention mitigates this, yet its hardware efficiency is limited by the added prediction stage and the heavy memory traffic it entails. To address these limitations, this paper proposes BitStopper, a fine-grained algorithm-architecture co-design that operates without a sparsity predictor. First, a bit-serial enable stage fusion (BESF) mechanism is proposed to reuse and minimize the memory access by progressively terminating trivial tokens and merging the prediction stage into the execution stage. Second, a lightweight and adaptive token selection (LATS) strategy is developed to work in concert with the bit-level sparsity speculation. Third, a bit-level asynchronous processing (BAP) strategy is employed to improve compute utilization during the on-demand bit-grained memory fetching. Finally, an elaborate architecture is designed to translate the theoretical complexity reduction into practical performance improvement. Extensive evaluations demonstrate that, compared to state-of-the-art (SOTA) Transformer accelerators, BitStopper achieves 2.03x and 1.89x speedups over Sanger and SOFA, respectively, while delivering 2.4x and 2.1x improvements in energy efficiency.
[605] Why Goal-Conditioned Reinforcement Learning Works: Relation to Dual Control
Nathan P. Lawrence, Ali Mesbah
Main category: cs.LG
TL;DR: Goal-conditioned RL analysis shows optimality gap between classical dense rewards and goal-conditioned rewards, explaining why goal-conditioned RL succeeds where dense rewards fail, with applications to partially observed and uncertain environments.
Details
Motivation: To understand why goal-conditioned reinforcement learning works well compared to classical dense reward formulations, and to provide theoretical analysis connecting goal-conditioned RL to optimal control principles.Method: Derives optimality gap analysis between classical quadratic objectives and goal-conditioned rewards using optimal control theory. Extends analysis to partially observed Markov decision processes, connecting state estimation to probabilistic rewards and dual control problems.
Result: Theoretical analysis elucidates success of goal-conditioned RL and explains why classical dense rewards can fail. Validates advantages of goal-conditioned policies on nonlinear and uncertain environments using both RL and predictive control techniques.
Conclusion: Goal-conditioned RL provides superior performance in reaching target states compared to classical dense reward formulations, especially in partially observed and uncertain environments, with theoretical foundations in optimal control.
Abstract: Goal-conditioned reinforcement learning (RL) concerns the problem of training an agent to maximize the probability of reaching target goal states. This paper presents an analysis of the goal-conditioned setting based on optimal control. In particular, we derive an optimality gap between more classical, often quadratic, objectives and the goal-conditioned reward, elucidating the success of goal-conditioned RL and why classical ``dense’’ rewards can falter. We then consider the partially observed Markov decision setting and connect state estimation to our probabilistic reward, further making the goal-conditioned reward well suited to dual control problems. The advantages of goal-conditioned policies are validated on nonlinear and uncertain environments using both RL and predictive control techniques.
[606] Optimizing LLMs Using Quantization for Mobile Execution
Agatsya Yadav, Renta Chintala Bhargavi
Main category: cs.LG
TL;DR: 4-bit Post-Training Quantization reduces Llama 3.2 3B model size by 68.66%, enabling efficient mobile deployment on Android via GGUF format and Ollama framework.
Details
Motivation: Large Language Models have significant size and computational requirements that hinder deployment on resource-constrained mobile devices, creating a need for compression techniques that enable mobile execution.Method: Applied 4-bit Post-Training Quantization using BitsAndBytes library with Hugging Face Transformers framework on Meta’s Llama 3.2 3B model, then converted to GGUF format using llama.cpp tools for mobile optimization.
Result: Achieved 68.66% reduction in model size through 4-bit quantization, enabling successful inference on Android devices via Termux environment and Ollama framework with qualitative validation showing functional performance.
Conclusion: 4-bit PTQ combined with mobile-optimized formats like GGUF provides a practical pathway for deploying capable LLMs on mobile devices, effectively balancing model size and performance constraints.
Abstract: Large Language Models (LLMs) offer powerful capabilities, but their significant size and computational requirements hinder deployment on resource-constrained mobile devices. This paper investigates Post-Training Quantization (PTQ) for compressing LLMs for mobile execution. We apply 4-bit PTQ using the BitsAndBytes library with the Hugging Face Transformers framework to Meta’s Llama 3.2 3B model. The quantized model is converted to GGUF format using llama.cpp tools for optimized mobile inference. The PTQ workflow achieves a 68.66% reduction in model size through 4-bit quantization, enabling the Llama 3.2 3B model to run efficiently on an Android device. Qualitative validation shows that the 4-bit quantized model can perform inference tasks successfully. We demonstrate the feasibility of running the quantized GGUF model on an Android device using the Termux environment and the Ollama framework. PTQ, especially at 4-bit precision combined with mobile-optimized formats like GGUF, provides a practical pathway for deploying capable LLMs on mobile devices, balancing model size and performance.
[607] Diagnosis-based mortality prediction for intensive care unit patients via transfer learning
Mengqi Xu, Subha Maity, Joel Dubin
Main category: cs.LG
TL;DR: Transfer learning for diagnosis-specific mortality prediction in ICU outperforms diagnosis-only models and APACHE IVa, with better calibration than pooled data models.
Details
Motivation: Current ICU prediction models don't adequately account for diagnostic heterogeneity, despite substantial variation in underlying causes of critical illness across different diagnoses.Method: Evaluated transfer learning approaches using both GLM- and XGBoost-based models on the eICU Collaborative Research Database for diagnosis-specific mortality prediction.
Result: Transfer learning consistently outperformed models trained only on diagnosis-specific data and APACHE IVa alone, while achieving better calibration than models trained on pooled data. Youden cutoff was more appropriate than conventional 0.5 threshold.
Conclusion: Transfer learning is effective for diagnosis-specific mortality prediction in ICU settings, maintaining high predictive performance across various cutoff criteria and providing better calibration than existing approaches.
Abstract: In the intensive care unit, the underlying causes of critical illness vary substantially across diagnoses, yet prediction models accounting for diagnostic heterogeneity have not been systematically studied. To address the gap, we evaluate transfer learning approaches for diagnosis-specific mortality prediction and apply both GLM- and XGBoost-based models to the eICU Collaborative Research Database. Our results demonstrate that transfer learning consistently outperforms models trained only on diagnosis-specific data and those using a well-known ICU severity-of-illness score, i.e., APACHE IVa, alone, while also achieving better calibration than models trained on the pooled data. Our findings also suggest that the Youden cutoff is a more appropriate decision threshold than the conventional 0.5 for binary outcomes, and that transfer learning maintains consistently high predictive performance across various cutoff criteria.
[608] Hierarchical geometric deep learning enables scalable analysis of molecular dynamics
Zihan Pengmei, Spencer C. Guo, Chatipat Lorpaiboon, Aaron R. Dinner
Main category: cs.LG
TL;DR: GNN-based approach reduces memory/runtime for analyzing large biomolecular simulations by aggregating local information, enabling analysis of thousand-residue complexes on single GPUs.
Details
Motivation: Molecular dynamics simulations generate complex trajectories that are challenging to analyze without established quantitative descriptors, especially for large biomolecular systems where GNNs face memory/runtime limitations and difficulties capturing long-range interactions.Method: Developed a graph neural network approach that aggregates local information to reduce memory and runtime requirements while maintaining atomic detail, enabling analysis of large biomolecular systems without manual feature engineering.
Result: The method enables analysis of protein-nucleic acid complexes with thousands of residues on single GPUs within minutes, and improves both performance and interpretability for systems with hundreds of residues.
Conclusion: Local information aggregation in GNNs overcomes previous limitations in analyzing large biomolecular simulations, making atomic-detail analysis of complex systems computationally feasible and improving analytical capabilities.
Abstract: Molecular dynamics simulations can generate atomically detailed trajectories of complex systems, but analyzing these dynamics can be challenging when systems lack well-established quantitative descriptors (features). Graph neural networks (GNNs) in which messages are passed between nodes that represent atoms that are spatial neighbors promise to obviate manual feature engineering, but the use of GNNs with biomolecular systems of more than a few hundred residues has been limited in the context of analyzing dynamics by both difficulties in capturing the details of long-range interactions with message passing and the memory and runtime requirements associated with large graphs. Here, we show how local information can be aggregated to reduce memory and runtime requirements without sacrificing atomic detail. We demonstrate that this approach opens the door to analyzing simulations of protein-nucleic acid complexes with thousands of residues on single GPUs within minutes. For systems with hundreds of residues, for which there are sufficient data to make quantitative comparisons, we show that the approach improves performance and interpretability.
[609] Beyond Token-level Supervision: Unlocking the Potential of Decoding-based Regression via Reinforcement Learning
Ming Chen, Sheng Tang, Rong-Xi Tan, Ziniu Li, Jiacheng Chen, Ke Xue, Chao Qian
Main category: cs.LG
TL;DR: RL-based decoding for regression improves numerical prediction by using sequence-level rewards instead of token-level constraints, outperforming existing methods in precision and generalization.
Details
Motivation: Current decoding-based regression methods suffer from misalignment between discrete token-level objectives and continuous numerical values, failing to capture global magnitude and limiting precision and generalization.Method: Formulate regression as Markov Decision Process, use Reinforcement Learning (specifically ReMax and GRPO) with sequence-level rewards to enforce global numerical coherence in the generation process.
Result: Consistently outperforms state-of-the-art token-level baselines and traditional regression heads on tabular regression and code metric regression, showing superior sampling efficiency and predictive precision.
Conclusion: RL significantly enhances decoding-based regression, establishing it as a robust and accurate paradigm for general-purpose numerical prediction through sequence-level signals.
Abstract: Decoding-based regression, which reformulates regression as a sequence generation task, has emerged as a promising paradigm of applying large language models for numerical prediction. However, its progress is hindered by the misalignment between discrete token-level objectives (e.g., cross-entropy) and continuous numerical values. Existing approaches relying on token-level constraints often fail to capture the global magnitude of the target value, limiting their precision and generalization. In this paper, we propose to unlock the potential of decoding-based regression via Reinforcement Learning (RL). We formulate the generation process as a Markov Decision Process, utilizing sequence-level rewards to enforce global numerical coherence. Extensive experiments on tabular regression and code metric regression demonstrate that our method (specifically with ReMax and GRPO) consistently outperforms both state-of-the-art token-level baselines and traditional regression heads, showing the superiority of introducing sequence-level signals. Our analysis further reveals that RL significantly enhances sampling efficiency and predictive precision, establishing decoding-based regression as a robust and accurate paradigm for general-purpose numerical prediction.
[610] A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation
Xiaocan Li, Shiliang Wu, Zheng Shen
Main category: cs.LG
TL;DR: A-3PO approximates the proximal policy in decoupled RL methods through simple interpolation instead of explicit computation, reducing training time by 18% while maintaining performance.
Details
Motivation: Decoupled loss methods like PPO and GRPO use a proximal policy to improve learning stability in asynchronous RL, but this requires an extra forward pass through the network at each training step, creating a computational bottleneck for large language models.Method: A-3PO approximates the proximal policy through simple interpolation without explicit computation, since the proximal policy only serves as a trust region anchor between behavior and target policies.
Result: A-3PO eliminates the computational overhead, reducing training time by 18% while maintaining comparable performance to traditional decoupled loss methods.
Conclusion: The proposed A-3PO method provides an efficient approximation for proximal policy optimization that maintains stability benefits while significantly reducing computational costs, making it more practical for large-scale RL applications.
Abstract: Decoupled loss has been a successful reinforcement learning (RL) algorithm to deal with the high data staleness under the asynchronous RL setting. Decoupled loss improves coupled-loss style of algorithms’ (e.g., PPO, GRPO) learning stability by introducing a proximal policy to decouple the off-policy corrections (importance weight) from the controlling policy updates (trust region). However, the proximal policy requires an extra forward pass through the network at each training step, creating a computational bottleneck for large language models. We observe that since the proximal policy only serves as a trust region anchor between the behavior and target policies, we can approximate it through simple interpolation without explicit computation. We call this approach A-3PO (APproximated Proximal Policy Optimization). A-3PO eliminates this overhead, reducing training time by 18% while maintaining comparable performance. Code & off-the-shelf example are available at: https://github.com/inclusionAI/AReaL/blob/main/docs/algorithms/prox_approx.md
[611] Deep Manifold Part 2: Neural Network Mathematics
Max Y. Ma, Gen-Hua Shi
Main category: cs.LG
TL;DR: Neural networks are analyzed as learnable numerical computations shaped by manifold complexity, nonlinearity, and boundary conditions, with learnability emerging only when fixed-point regions stabilize through residual-driven iteration.
Details
Motivation: To understand neural networks through geometric and algebraic perspectives, explaining why capability emerges only when fixed-point regions stabilize, and to clarify the limits of monolithic models under geometric and data-induced plasticity.Method: Develops global equations of neural networks using stacked piecewise manifolds, fixed-point theory, and boundary-conditioned iteration, removing fixed coordinates and operators to reveal the underlying geometric structure.
Result: Neural networks appear as learnable numerical computations shaped by manifold complexity, high-order nonlinearity, and boundary conditions, with learnability constrained by data complexity, scale, and training dynamics that produce shifting node covers and curvature accumulation.
Conclusion: This geometric perspective motivates distributed architectures and federated systems that distribute manifold complexity across elastic models, forming a coherent world-modeling framework grounded in geometry, algebra, fixed points, and real-data complexity.
Abstract: This work develops the global equations of neural networks through stacked piecewise manifolds, fixed-point theory, and boundary-conditioned iteration. Once fixed coordinates and operators are removed, a neural network appears as a learnable numerical computation shaped by manifold complexity, high-order nonlinearity, and boundary conditions. Real-world data impose strong data complexity, near-infinite scope, scale, and minibatch fragmentation, while training dynamics produce learning complexity through shifting node covers, curvature accumulation, and the rise and decay of plasticity. These forces constrain learnability and explain why capability emerges only when fixed-point regions stabilize. Neural networks do not begin with fixed points; they construct them through residual-driven iteration. This perspective clarifies the limits of monolithic models under geometric and data-induced plasticity and motivates architectures and federated systems that distribute manifold complexity across many elastic models, forming a coherent world-modeling framework grounded in geometry, algebra, fixed points, and real-data complexity.
[612] QL-LSTM: A Parameter-Efficient LSTM for Stable Long-Sequence Modeling
Isaac Kofi Nti
Main category: cs.LG
TL;DR: QL-LSTM introduces two innovations: Parameter-Shared Unified Gating reduces parameters by ~48%, and Hierarchical Gated Recurrence improves long-range information flow, achieving competitive accuracy with fewer parameters on IMDB sentiment classification.
Details
Motivation: Traditional recurrent architectures (LSTM, GRU) have two core limitations: redundant gate-specific parameters and reduced ability to retain information across long temporal distances.Method: Two independent components: 1) Parameter-Shared Unified Gating replaces all gate-specific transformations with a single shared weight matrix; 2) Hierarchical Gated Recurrence with Additive Skip Connections adds a multiplication-free pathway for better long-range information flow.
Result: QL-LSTM achieves competitive accuracy on IMDB sentiment classification with extended document lengths while using substantially fewer parameters compared to LSTM, GRU, and BiLSTM. However, no wall-clock speed improvements yet due to sequential nature limitations.
Conclusion: QL-LSTM addresses parameter redundancy and long-range dependency issues in recurrent architectures, offering parameter efficiency while maintaining accuracy, though computational speed improvements require further kernel-level optimization.
Abstract: Recurrent neural architectures such as LSTM and GRU remain widely used in sequence modeling, but they continue to face two core limitations: redundant gate-specific parameters and reduced ability to retain information across long temporal distances. This paper introduces the Quantum-Leap LSTM (QL-LSTM), a recurrent architecture designed to address both challenges through two independent components. The Parameter-Shared Unified Gating mechanism replaces all gate-specific transformations with a single shared weight matrix, reducing parameters by approximately 48 percent while preserving full gating behavior. The Hierarchical Gated Recurrence with Additive Skip Connections component adds a multiplication-free pathway that improves long-range information flow and reduces forget-gate degradation. We evaluate QL-LSTM on sentiment classification using the IMDB dataset with extended document lengths, comparing it to LSTM, GRU, and BiLSTM reference models. QL-LSTM achieves competitive accuracy while using substantially fewer parameters. Although the PSUG and HGR-ASC components are more efficient per time step, the current prototype remains limited by the inherent sequential nature of recurrent models and therefore does not yet yield wall-clock speed improvements without further kernel-level optimization.
[613] On fine-tuning Boltz-2 for protein-protein affinity prediction
James King, Lewis Cornwall, Andrei Cristian Nica, James Day, Aaron Sim, Neil Dalchau, Lilly Wollman, Joshua Meyers
Main category: cs.LG
TL;DR: Adapting Boltz-2 protein-ligand affinity predictor for protein-protein interactions shows structural models underperform sequence-based alternatives, but combining both yields complementary improvements.
Details
Motivation: Accurate prediction of protein-protein binding affinity is crucial for understanding molecular interactions and therapeutic design, but current approaches need improvement.Method: Adapted Boltz-2 (state-of-the-art structure-based protein-ligand affinity predictor) for protein-protein affinity regression, evaluated on TCR3d and PPB-affinity datasets, and combined with sequence-based embeddings.
Result: Boltz-2-PPI underperforms relative to sequence-based alternatives despite high structural accuracy. Combining structural and sequence embeddings yields complementary improvements, especially for weaker sequence models, suggesting different signals are learned.
Conclusion: Current structure-based representations are not optimized for performant affinity prediction, echoing known biases in structural data training, but combining structural and sequence information provides complementary benefits.
Abstract: Accurate prediction of protein-protein binding affinity is vital for understanding molecular interactions and designing therapeutics. We adapt Boltz-2, a state-of-the-art structure-based protein-ligand affinity predictor, for protein-protein affinity regression and evaluate it on two datasets, TCR3d and PPB-affinity. Despite high structural accuracy, Boltz-2-PPI underperforms relative to sequence-based alternatives in both small- and larger-scale data regimes. Combining embeddings from Boltz-2-PPI with sequence-based embeddings yields complementary improvements, particularly for weaker sequence models, suggesting different signals are learned by sequence- and structure-based models. Our results echo known biases associated with training with structural data and suggest that current structure-based representations are not primed for performant affinity prediction.
[614] A Fast and Effective Solution to the Problem of Look-ahead Bias in LLMs
Humzah Merchant, Bradford Levy
Main category: cs.LG
TL;DR: A method to remove look-ahead bias from LLMs in finance by adjusting logits at inference using specialized models, enabling backtesting without retraining.
Details
Motivation: Applying LLMs to financial prediction is challenging due to look-ahead bias from training on long time-series data, which prevents proper backtesting since retraining frontier models with specific knowledge cutoffs is prohibitively expensive.Method: Guides generation at inference time by adjusting the logits of a large base model using a pair of smaller specialized models - one fine-tuned on information to be forgotten and another on information to be retained.
Result: The method effectively removes both verbatim and semantic knowledge, corrects biases, and outperforms prior methods for knowledge removal.
Conclusion: Provides a fast, effective, and low-cost alternative to retraining frontier LLMs, enabling proper backtesting in financial applications by removing look-ahead bias through inference-time logit adjustment.
Abstract: Applying LLMs to predictive tasks in finance is challenging due to look-ahead bias resulting from their training on long time-series data. This precludes the backtests typically employed in finance since retraining frontier models from scratch with a specific knowledge cutoff is prohibitive. In this paper, we introduce a fast, effective, and low-cost alternative. Our method guides generation at inference time by adjusting the logits of a large base model using a pair of smaller, specialized models – one fine-tuned on information to be forgotten and another on information to be retained. We demonstrate that our method effectively removes both verbatim and semantic knowledge, corrects biases, and outperforms prior methods.
[615] Vector Quantization using Gaussian Variational Autoencoder
Tongda Xu, Wendi Zheng, Jiajun He, Jose Miguel Hernandez-Lobato, Yan Wang, Ya-Qin Zhang, Jie Tang
Main category: cs.LG
TL;DR: Gaussian Quant (GQ) converts Gaussian VAE into VQ-VAE without training by using random Gaussian noise as codebook, outperforming previous VQ-VAE methods.
Details
Motivation: VQ-VAE is difficult to train due to discretization challenges. The paper aims to simplify VQ-VAE training by converting pre-trained Gaussian VAEs into VQ-VAEs without additional training.Method: Proposes Gaussian Quant (GQ) technique that uses random Gaussian noise as codebook and finds closest noise to posterior mean. Also introduces target divergence constraint (TDC) heuristic to train Gaussian VAE for effective GQ conversion.
Result: GQ outperforms previous VQ-VAEs (VQGAN, FSQ, LFQ, BSQ) on both UNet and ViT architectures. TDC improves upon previous Gaussian VAE discretization methods like TokenBridge.
Conclusion: GQ provides a simple yet effective way to convert Gaussian VAEs into VQ-VAEs without training, achieving state-of-the-art performance with theoretical guarantees on quantization error.
Abstract: Vector quantized variational autoencoder (VQ-VAE) is a discrete auto-encoder that compresses images into discrete tokens. It is difficult to train due to discretization. In this paper, we propose a simple yet effective technique, dubbed Gaussian Quant (GQ), that converts a Gaussian VAE with certain constraint into a VQ-VAE without training. GQ generates random Gaussian noise as a codebook and finds the closest noise to the posterior mean. Theoretically, we prove that when the logarithm of the codebook size exceeds the bits-back coding rate of the Gaussian VAE, a small quantization error is guaranteed. Practically, we propose a heuristic to train Gaussian VAE for effective GQ, named target divergence constraint (TDC). Empirically, we show that GQ outperforms previous VQ-VAEs, such as VQGAN, FSQ, LFQ, and BSQ, on both UNet and ViT architectures. Furthermore, TDC also improves upon previous Gaussian VAE discretization methods, such as TokenBridge. The source code is provided in https://github.com/tongdaxu/VQ-VAE-from-Gaussian-VAE.
[616] Quantum Temporal Convolutional Neural Networks for Cross-Sectional Equity Return Prediction: A Comparative Benchmark Study
Chi-Sheng Chen, Xinyu Zhang, Rong Fu, Qiuzhe Xie, Fan Zhang
Main category: cs.LG
TL;DR: QTCNN combines classical temporal encoding with quantum convolution circuits to enhance stock prediction, achieving 72% better Sharpe ratio than classical baselines.
Details
Motivation: Classical forecasting models struggle with noisy financial data, regime shifts, and limited generalization in stock market prediction, creating a need for quantum-enhanced approaches.Method: Proposes Quantum Temporal Convolutional Neural Network (QTCNN) with classical temporal encoder for multi-scale pattern extraction and parameter-efficient quantum convolution circuits for enhanced feature representation.
Result: QTCNN achieves Sharpe ratio of 0.538 on JPX Tokyo Stock Exchange dataset, outperforming best classical baseline by approximately 72% in out-of-sample portfolio construction.
Conclusion: Quantum-enhanced forecasting models like QTCNN demonstrate practical potential for robust decision-making in quantitative finance by leveraging quantum superposition and entanglement.
Abstract: Quantum machine learning offers a promising pathway for enhancing stock market prediction, particularly under complex, noisy, and highly dynamic financial environments. However, many classical forecasting models struggle with noisy input, regime shifts, and limited generalization capacity. To address these challenges, we propose a Quantum Temporal Convolutional Neural Network (QTCNN) that combines a classical temporal encoder with parameter-efficient quantum convolution circuits for cross-sectional equity return prediction. The temporal encoder extracts multi-scale patterns from sequential technical indicators, while the quantum processing leverages superposition and entanglement to enhance feature representation and suppress overfitting. We conduct a comprehensive benchmarking study on the JPX Tokyo Stock Exchange dataset and evaluate predictions through long-short portfolio construction using out-of-sample Sharpe ratio as the primary performance metric. QTCNN achieves a Sharpe ratio of 0.538, outperforming the best classical baseline by approximately 72%. These results highlight the practical potential of quantum-enhanced forecasting model, QTCNN, for robust decision-making in quantitative finance.
[617] The Impact of Data Characteristics on GNN Evaluation for Detecting Fake News
Isha Karn, David Jensen
Main category: cs.LG
TL;DR: GNN benchmarks for fake news detection (GossipCop, PolitiFact) have shallow graph structures that don’t meaningfully test structural modeling capabilities - MLPs perform similarly to GNNs, suggesting these datasets don’t properly evaluate graph-based methods.
Details
Motivation: To evaluate whether commonly used benchmark datasets (GossipCop and PolitiFact) for fake news detection actually test the utility of graph structure modeling in GNNs, or if they're poorly suited for this purpose due to their shallow graph topologies.Method: Systematically benchmarked 5 GNN architectures against structure-agnostic MLPs using same node features. Conducted controlled experiments: 1) shuffling node features, 2) randomizing edge structures. Performed structural analysis of graph topologies. Compared performance on synthetic datasets with noisy features and informative structure.
Result: MLPs match or closely trail GNN performance (gaps within 1-2%, overlapping confidence intervals). Performance collapses under feature shuffling but remains stable under edge randomization, indicating structure plays negligible role. Structural analysis shows >75% of nodes are only one hop from root, exhibiting minimal structural diversity. On synthetic datasets with informative structure, GNNs significantly outperform MLPs.
Conclusion: Widely used fake news detection benchmarks don’t meaningfully test structural modeling capabilities of GNNs due to shallow, ego-like graph topologies. Current datasets motivate development of new benchmarks with richer, more diverse graph structures to properly evaluate graph-based methods.
Abstract: Graph neural networks (GNNs) are widely used for the detection of fake news by modeling the content and propagation structure of news articles on social media. We show that two of the most commonly used benchmark data sets - GossipCop and PolitiFact - are poorly suited to evaluating the utility of models that use propagation structure. Specifically, these data sets exhibit shallow, ego-like graph topologies that provide little or no ability to differentiate among modeling methods. We systematically benchmark five GNN architectures against a structure-agnostic multilayer perceptron (MLP) that uses the same node features. We show that MLPs match or closely trail the performance of GNNs, with performance gaps often within 1-2% and overlapping confidence intervals. To isolate the contribution of structure in these datasets, we conduct controlled experiments where node features are shuffled or edge structures randomized. We find that performance collapses under feature shuffling but remains stable under edge randomization. This suggests that structure plays a negligible role in these benchmarks. Structural analysis further reveals that over 75% of nodes are only one hop from the root, exhibiting minimal structural diversity. In contrast, on synthetic datasets where node features are noisy and structure is informative, GNNs significantly outperform MLPs. These findings provide strong evidence that widely used benchmarks do not meaningfully test the utility of modeling structural features, and they motivate the development of datasets with richer, more diverse graph topologies.
[618] Financial Fraud Identification and Interpretability Study for Listed Companies Based on Convolutional Neural Network
Xiao Li
Main category: cs.LG
TL;DR: A CNN-based financial fraud detection framework for Chinese A-share companies that transforms financial data into image-like representations for early fraud prediction, outperforming traditional methods while providing interpretability through multi-dimensional analysis.
Details
Motivation: Financial fraud detection is challenging due to covert tactics, high audit costs, and limitations of existing methods. Traditional statistical models lack ability to capture nonlinear feature interactions, while machine learning models are often opaque "black boxes." Most existing approaches only detect current-year fraud using current-year data, lacking early warning capabilities.Method: Proposes a CNN-based framework that transforms firm-year panel data into image-like representations to capture both cross-sectional and temporal patterns. Uses feature engineering to enable advanced fraud prediction. Addresses interpretability through local explanation techniques analyzing entity, feature, and time dimensions.
Result: CNN outperforms logistic regression and LightGBM in accuracy, robustness, and early-warning performance. Proper classification threshold tuning is crucial for high-risk settings. Analysis reveals solvency, ratio structure, governance structure, and internal control as general fraud predictors, with environmental indicators mattering mainly in high-pollution industries.
Conclusion: The CNN framework effectively detects financial fraud with superior performance and interpretability. Fraud firms exhibit heterogeneous patterns in short time windows, while non-fraud firms show stable patterns. Case study validation confirms model’s ability to identify key fraud drivers consistent with actual misconduct.
Abstract: Since the emergence of joint-stock companies, financial fraud by listed firms has repeatedly undermined capital markets. Fraud is difficult to detect because of covert tactics and the high labor and time costs of audits. Traditional statistical models are interpretable but struggle with nonlinear feature interactions, while machine learning models are powerful but often opaque. In addition, most existing methods judge fraud only for the current year based on current year data, limiting timeliness. This paper proposes a financial fraud detection framework for Chinese A-share listed companies based on convolutional neural networks (CNNs). We design a feature engineering scheme that transforms firm-year panel data into image like representations, enabling the CNN to capture cross-sectional and temporal patterns and to predict fraud in advance. Experiments show that the CNN outperforms logistic regression and LightGBM in accuracy, robustness, and early-warning performance, and that proper tuning of the classification threshold is crucial in high-risk settings. To address interpretability, we analyze the model along the dimensions of entity, feature, and time using local explanation techniques. We find that solvency, ratio structure, governance structure, and internal control are general predictors of fraud, while environmental indicators matter mainly in high-pollution industries. Non-fraud firms share stable feature patterns, whereas fraud firms exhibit heterogeneous patterns concentrated in short time windows. A case study of Guanong Shares in 2022 shows that cash flow analysis, social responsibility, governance structure, and per-share indicators are the main drivers of the model’s fraud prediction, consistent with the company’s documented misconduct.
[619] Estimating Black Carbon Concentration from Urban Traffic Using Vision-Based Machine Learning
Camellia Zakaria, Aryan Sadeghi, Weaam Jaafar, Junshi Xu, Alex Mariakakis, Marianne Hatzopoulou
Main category: cs.LG
TL;DR: A machine learning system uses traffic video and weather data to estimate street-level black carbon emissions, achieving R²=0.72, providing actionable pollution data for urban planning and environmental justice.
Details
Motivation: Black carbon monitoring is expensive and scarce, creating a data gap between widely available traffic information and unknown environmental consequences, especially affecting marginalized communities near major roads.Method: Machine learning system extracts visual features from traffic video (vehicle behaviors/conditions) and combines with weather data to estimate street-level black carbon concentrations.
Result: Model achieves R-squared value of 0.72 and RMSE of 129.42 ng/m³, demonstrating effective estimation of black carbon from traffic video data.
Conclusion: The approach leverages existing urban infrastructure to generate actionable BC data for pollution reduction, urban planning, public health, and environmental justice at local levels.
Abstract: Black carbon (BC) emissions in urban areas are primarily driven by traffic, with hotspots near major roads disproportionately affecting marginalized communities. Because BC monitoring is typically performed using costly and specialized instruments. there is little to no available data on BC from local traffic sources that could help inform policy interventions targeting local factors. By contrast, traffic monitoring systems are widely deployed in cities around the world, highlighting the imbalance between what we know about traffic conditions and what do not know about their environmental consequences. To bridge this gap, we propose a machine learning-driven system that extracts visual information from traffic video to capture vehicles behaviors and conditions. Combining these features with weather data, our model estimates BC at street level, achieving an R-squared value of 0.72 and RMSE of 129.42 ng/m3 (nanogram per cubic meter). From a sustainability perspective, this work leverages resources already supported by urban infrastructure and established modeling techniques to generate information relevant to traffic emission. Obtaining BC concentration data provides actionable insights to support pollution reduction, urban planning, public health, and environmental justice at the local municipal level.
[620] Adaptive Test-Time Training for Predicting Need for Invasive Mechanical Ventilation in Multi-Center Cohorts
Xiaolei Lu, Shamim Nemati
Main category: cs.LG
TL;DR: AdaTTT: Adaptive Test-Time Training framework for EHR-based invasive mechanical ventilation prediction in ICU patients that addresses domain shifts across institutions through self-supervised learning and partial optimal transport alignment.
Details
Motivation: Domain shifts in EHR systems across different ICU institutions degrade generalization of predictive models for invasive mechanical ventilation (IMV) prediction, making test-time adaptation crucial for real-world deployment.Method: AdaTTT framework with: 1) Information-theoretic bounds on test-time error, 2) Self-supervised learning with reconstruction and masked feature modeling using dynamic masking, 3) Prototype learning and Partial Optimal Transport (POT) for flexible feature alignment while preserving clinical representations.
Result: Competitive classification performance across multi-center ICU cohorts on different test-time adaptation benchmarks.
Conclusion: AdaTTT effectively addresses domain shifts in EHR-based IMV prediction through adaptive test-time training with self-supervised learning and optimal transport alignment, improving generalization across institutions.
Abstract: Accurate prediction of the need for invasive mechanical ventilation (IMV) in intensive care units (ICUs) patients is crucial for timely interventions and resource allocation. However, variability in patient populations, clinical practices, and electronic health record (EHR) systems across institutions introduces domain shifts that degrade the generalization performance of predictive models during deployment. Test-Time Training (TTT) has emerged as a promising approach to mitigate such shifts by adapting models dynamically during inference without requiring labeled target-domain data. In this work, we introduce Adaptive Test-Time Training (AdaTTT), an enhanced TTT framework tailored for EHR-based IMV prediction in ICU settings. We begin by deriving information-theoretic bounds on the test-time prediction error and demonstrate that it is constrained by the uncertainty between the main and auxiliary tasks. To enhance their alignment, we introduce a self-supervised learning framework with pretext tasks: reconstruction and masked feature modeling optimized through a dynamic masking strategy that emphasizes features critical to the main task. Additionally, to improve robustness against domain shifts, we incorporate prototype learning and employ Partial Optimal Transport (POT) for flexible, partial feature alignment while maintaining clinically meaningful patient representations. Experiments across multi-center ICU cohorts demonstrate competitive classification performance on different test-time adaptation benchmarks.
[621] GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering
Jehyeok Yeon, Federico Cinus, Yifan Wu, Luca Luceri
Main category: cs.LG
TL;DR: GSAEs extend sparse autoencoders with graph regularization to learn distributed safety representations across multiple features, enabling effective runtime safety steering that adaptively refuses harmful content while preserving utility on benign queries.
Details
Motivation: LLMs are vulnerable to adversarial prompts and jailbreak attacks that generate harmful content. Existing defenses are limited by assuming safety concepts are isolated in single latent features, when evidence shows abstract concepts like refusal are distributed across multiple features.Method: Graph-Regularized Sparse Autoencoders (GSAEs) extend standard SAEs with a Laplacian smoothness penalty on neuron co-activation graphs. This learns smooth, distributed safety representations as coherent patterns spanning multiple features. A two-stage gating mechanism activates interventions only when harmful prompts or continuations are detected during generation.
Result: GSAE steering achieves 82% selective refusal rate (vs 42% for standard SAE), maintains strong task accuracy (70% TriviaQA, 65% TruthfulQA, 74% GSM8K), generalizes across multiple LLM families (LLaMA-3, Mistral, Qwen, Phi), and resists jailbreak attacks with ≥90% refusal of harmful content.
Conclusion: GSAEs effectively address the limitation of single-feature safety representations by learning distributed safety concepts, enabling adaptive refusal while preserving utility, and demonstrating robust performance across diverse LLMs and adversarial attacks.
Abstract: Large language models (LLMs) face critical safety challenges, as they can be manipulated to generate harmful content through adversarial prompts and jailbreak attacks. Many defenses are typically either black-box guardrails that filter outputs, or internals-based methods that steer hidden activations by operationalizing safety as a single latent feature or dimension. While effective for simple concepts, this assumption is limiting, as recent evidence shows that abstract concepts such as refusal and temporality are distributed across multiple features rather than isolated in one. To address this limitation, we introduce Graph-Regularized Sparse Autoencoders (GSAEs), which extends SAEs with a Laplacian smoothness penalty on the neuron co-activation graph. Unlike standard SAEs that assign each concept to a single latent feature, GSAEs recover smooth, distributed safety representations as coherent patterns spanning multiple features. We empirically demonstrate that GSAE enables effective runtime safety steering, assembling features into a weighted set of safety-relevant directions and controlling them with a two-stage gating mechanism that activates interventions only when harmful prompts or continuations are detected during generation. This approach enforces refusals adaptively while preserving utility on benign queries. Across safety and QA benchmarks, GSAE steering achieves an average 82% selective refusal rate, substantially outperforming standard SAE steering (42%), while maintaining strong task accuracy (70% on TriviaQA, 65% on TruthfulQA, 74% on GSM8K). Robustness experiments further show generalization across LLaMA-3, Mistral, Qwen, and Phi families and resilience against jailbreak attacks (GCG, AutoDAN), consistently maintaining >= 90% refusal of harmful content.
[622] Rethinking Robustness: A New Approach to Evaluating Feature Attribution Methods
Panagiota Kiourti, Anu Singh, Preeti Duraipandian, Weichao Zhou, Wenchao Li
Main category: cs.LG
TL;DR: The paper proposes a new framework for evaluating attribution method robustness, challenging current metrics that ignore model output differences, and introduces new definitions, metrics, and GAN-based methods for generating similar inputs.
Details
Motivation: Current attribution robustness metrics largely ignore differences in model outputs, leading to incomplete evaluation. There's a need for more objective metrics that reveal attribution method weaknesses rather than neural network weaknesses.Method: Proposes: 1) new definition of similar inputs, 2) new robustness metric, 3) GAN-based method to generate these inputs. Conducts comprehensive evaluation with existing metrics and state-of-the-art attribution methods.
Result: Findings highlight the need for more objective metrics that accurately reveal attribution method weaknesses rather than neural network weaknesses, providing more accurate robustness evaluation.
Conclusion: The proposed framework provides a more accurate way to evaluate attribution method robustness by focusing on attribution method weaknesses rather than model weaknesses, with implications for improving attribution method reliability.
Abstract: This paper studies the robustness of feature attribution methods for deep neural networks. It challenges the current notion of attributional robustness that largely ignores the difference in the model’s outputs and introduces a new way of evaluating the robustness of attribution methods. Specifically, we propose a new definition of similar inputs, a new robustness metric, and a novel method based on generative adversarial networks to generate these inputs. In addition, we present a comprehensive evaluation with existing metrics and state-of-the-art attribution methods. Our findings highlight the need for a more objective metric that reveals the weaknesses of an attribution method rather than that of the neural network, thus providing a more accurate evaluation of the robustness of attribution methods.
[623] The Meta-Learning Gap: Combining Hydra and Quant for Large-Scale Time Series Classification
Urav Maniar
Main category: cs.LG
TL;DR: Combining Hydra and Quant algorithms improves time series classification accuracy while maintaining computational efficiency, but current ensemble methods only capture 11% of theoretical potential, revealing a meta-learning optimization gap.
Details
Motivation: Time series classification faces a trade-off between accuracy and computational efficiency. State-of-the-art ensembles like HIVE-COTE 2.0 achieve high accuracy but require 340-hour training times, making them impractical for large-scale datasets. The research investigates whether combining two efficient algorithms from complementary paradigms can capture ensemble benefits while maintaining computational feasibility.Method: The study combines Hydra (competing convolutional kernels) and Quant (hierarchical interval quantiles) across six ensemble configurations. Performance is evaluated on 10 large-scale MONSTER datasets ranging from 7,898 to 1,168,774 training instances. The research examines both prediction-combination and feature-concatenation approaches, comparing them against theoretical oracle bounds.
Result: The strongest configuration improved mean accuracy from 0.829 to 0.836, succeeding on 7 of 10 datasets. However, prediction-combination ensembles captured only 11% of theoretical oracle potential, revealing a substantial meta-learning optimization gap. Feature-concatenation approaches exceeded oracle bounds by learning novel decision boundaries. Prediction-level complementarity shows moderate correlation with ensemble gains.
Conclusion: The central challenge has shifted from ensuring algorithms are different to learning how to combine them effectively. Current meta-learning strategies struggle to exploit the complementarity that oracle analysis confirms exists. Improved combination strategies could potentially double or triple ensemble gains across diverse time series classification applications.
Abstract: Time series classification faces a fundamental trade-off between accuracy and computational efficiency. While comprehensive ensembles like HIVE-COTE 2.0 achieve state-of-the-art accuracy, their 340-hour training time on the UCR benchmark renders them impractical for large-scale datasets. We investigate whether targeted combinations of two efficient algorithms from complementary paradigms can capture ensemble benefits while maintaining computational feasibility. Combining Hydra (competing convolutional kernels) and Quant (hierarchical interval quantiles) across six ensemble configurations, we evaluate performance on 10 large-scale MONSTER datasets (7,898 to 1,168,774 training instances). Our strongest configuration improves mean accuracy from 0.829 to 0.836, succeeding on 7 of 10 datasets. However, prediction-combination ensembles capture only 11% of theoretical oracle potential, revealing a substantial meta-learning optimization gap. Feature-concatenation approaches exceeded oracle bounds by learning novel decision boundaries, while prediction-level complementarity shows moderate correlation with ensemble gains. The central finding: the challenge has shifted from ensuring algorithms are different to learning how to combine them effectively. Current meta-learning strategies struggle to exploit the complementarity that oracle analysis confirms exists. Improved combination strategies could potentially double or triple ensemble gains across diverse time series classification applications.
[624] GradientSpace: Unsupervised Data Clustering for Improved Instruction Tuning
Shrihari Sridharan, Deepak Ravikumar, Anand Raghunathan, Kaushik Roy
Main category: cs.LG
TL;DR: GradientSpace is a framework that clusters training samples in full-dimensional gradient space using online SVD on LoRA gradients to create specialized experts, outperforming prior methods while reducing inference latency.
Details
Motivation: Real-world datasets are heterogeneous, causing gradient interference that degrades model performance. Existing clustering methods based on semantic similarity fail to capture how data influences model parameters, while gradient clustering methods suffer from accuracy loss from random projection and require expensive expert ensembles.Method: Proposes GradientSpace framework that clusters samples directly in full-dimensional gradient space using an online SVD-based algorithm operating on LoRA gradients. Identifies latent skills without storing all sample gradients, trains specialized LoRA experts for each cluster, and uses a lightweight router to select the best expert during inference.
Result: Routing to a single appropriate expert outperforms expert ensembles used in prior work while significantly reducing inference latency. Experiments across mathematical reasoning, code generation, finance, and creative writing tasks show coherent expert specialization and consistent accuracy gains over state-of-the-art clustering methods and finetuning techniques.
Conclusion: GradientSpace effectively addresses gradient interference in heterogeneous datasets by clustering in full gradient space, enabling specialized expert training with efficient inference routing, achieving better performance than existing methods.
Abstract: Instruction tuning is one of the key steps required for adapting large language models (LLMs) to a broad spectrum of downstream applications. However, this procedure is difficult because real-world datasets are rarely homogeneous; they consist of a mixture of diverse information, causing gradient interference, where conflicting gradients pull the model in opposing directions, degrading performance. A common strategy to mitigate this issue is to group data based on semantic or embedding similarity. However, this fails to capture how data influences model parameters during learning. While recent works have attempted to cluster gradients directly, they randomly project gradients into lower dimensions to manage memory, which leads to accuracy loss. Moreover, these methods rely on expert ensembles which necessitates multiple inference passes and expensive on-the-fly gradient computations during inference. To address these limitations, we propose GradientSpace, a framework that clusters samples directly in full-dimensional gradient space. We introduce an online SVD-based algorithm that operates on LoRA gradients to identify latent skills without the infeasible cost of storing all sample gradients. Each cluster is used to train a specialized LoRA expert along with a lightweight router trained to select the best expert during inference. We show that routing to a single, appropriate expert outperforms expert ensembles used in prior work, while significantly reducing inference latency. Our experiments across mathematical reasoning, code generation, finance, and creative writing tasks demonstrate that GradientSpace leads to coherent expert specialization and consistent accuracy gains over state-of-the-art clustering methods and finetuning techniques.
[625] State Diversity Matters in Offline Behavior Distillation
Shiye Lei, Zhihao Cheng, Dacheng Tao
Main category: cs.LG
TL;DR: The paper identifies a misalignment in Offline Behavior Distillation where high-quality original datasets don’t always produce superior synthetic datasets, reveals that state diversity matters more than state quality when training loss is high, proposes a state density weighted algorithm to enhance diversity, and shows improved performance on D4RL benchmarks.
Details
Motivation: The motivation is to address the misalignment problem in Offline Behavior Distillation where high-quality original datasets don't necessarily yield better synthetic datasets, limiting the effectiveness of OBD for efficient policy training across downstream RL tasks.Method: The authors first conduct empirical analysis showing state diversity outperforms state quality when training loss is high. They provide theoretical analysis linking state quality to reducing pivotal error and state diversity to reducing surrounding error. They propose State Density Weighted OBD which weights the distillation objective using the reciprocal of state density to emphasize state diversity.
Result: Extensive experiments on multiple D4RL datasets confirm that SDW significantly enhances OBD performance, particularly when the original dataset exhibits limited state diversity. The method addresses the misalignment problem and improves distillation effectiveness.
Conclusion: State diversity is more important than state quality in Offline Behavior Distillation when training loss is substantial. The proposed SDW algorithm successfully enhances OBD performance by emphasizing state diversity through density-weighted distillation, making OBD more effective for datasets with limited state coverage.
Abstract: Offline Behavior Distillation (OBD), which condenses massive offline RL data into a compact synthetic behavioral dataset, offers a promising approach for efficient policy training and can be applied across various downstream RL tasks. In this paper, we uncover a misalignment between original and distilled datasets, observing that a high-quality original dataset does not necessarily yield a superior synthetic dataset. Through an empirical analysis of policy performance under varying levels of training loss, we show that datasets with greater state diversity outperforms those with higher state quality when training loss is substantial, as is often the case in OBD, whereas the relationship reverses under minimal loss, which contributes to the misalignment. By associating state quality and diversity in reducing pivotal and surrounding error, respectively, our theoretical analysis establishes that surrounding error plays a more crucial role in policy performance when pivotal error is large, thereby highlighting the importance of state diversity in OBD scenario. Furthermore, we propose a novel yet simple algorithm, state density weighted (SDW) OBD, which emphasizes state diversity by weighting the distillation objective using the reciprocal of state density, thereby distilling a more diverse state information into synthetic data. Extensive experiments across multiple D4RL datasets confirm that SDW significantly enhances OBD performance when the original dataset exhibits limited state diversity.
[626] Mitigating Barren plateaus in quantum denoising diffusion probabilistic models
Haipeng Cao, Kaining Zhang, Dacheng Tao, Zhaofeng Su
Main category: cs.LG
TL;DR: QuDDPM suffers from barren plateaus due to 2-design input states; improved version with non-Haar distribution mitigates this and achieves better sample quality.
Details
Motivation: Quantum generative models like QuDDPM show promise for learning quantum data but suffer from barren plateaus that undermine performance, requiring a solution to make them scalable and efficient.Method: Theoretical analysis and experimental validation confirm barren plateaus in original QuDDPM. An improved QuDDPM is introduced using a distribution maintaining distance from Haar distribution to ensure better trainability.
Result: Experimental results show the improved approach effectively mitigates barren plateau problem and generates higher quality samples compared to original QuDDPM.
Conclusion: The improved QuDDPM with non-Haar distribution input addresses the barren plateau issue, paving the way for scalable and efficient quantum generative learning.
Abstract: Quantum generative models leverage quantum superposition and entanglement to enhance learning efficiency for both classical and quantum data. The quantum denoising diffusion probabilistic model (QuDDPM), inspired by its classical counterpart, has been proposed as a promising framework for quantum generative learning. QuDDPM is capable of efficiently learning and generating quantum data, and it demonstrates excellent performance in learning correlated quantum noise models, quantum many-body phases, and the topological structure of quantum data. However, we show that barren plateaus emerge in QuDDPMs due to the use of 2-design states as the input for the denoising process, which severely undermines the performance of QuDDPM. Through theoretical analysis and experimental validation, we confirm the presence of barren plateaus in the original QuDDPM. To address this issue, we introduce an improved QuDDPM that utilizes a distribution maintaining a certain distance from the Haar distribution, ensuring better trainability. Experimental results demonstrate that our approach effectively mitigates the barren plateau problem and generates samples with higher quality, paving the way for scalable and efficient quantum generative learning.
[627] Pathway to $O(\sqrt{d})$ Complexity bound under Wasserstein metric of flow-based models
Xiangjun Meng, Zhongjian Wang
Main category: cs.LG
TL;DR: The paper provides analytical tools to estimate Wasserstein error for flow-based generative models, showing optimal sampling iteration complexity of O(√d) with respect to dimension.
Details
Motivation: To develop attainable analytical tools for estimating error bounds in flow-based generative models and establish dimension-dependent sampling complexity bounds under the Wasserstein metric.Method: Decomposes error into two parts: Lipschitzness of push-forward maps (dimension-independent) and local discretization error (scales O(√d)). Uses assumptions valid for Föllmer process and 1-rectified flow under Gaussian tail conditions.
Result: Shows sampling iteration complexity grows linearly with √trace of covariance operator, achieving optimal O(√d) dimension scaling. The error is explicitly controllable through the two identified components.
Conclusion: Provides rigorous analytical framework for understanding dimension dependence in flow-based generative models, establishing optimal sampling complexity bounds and explicit error control mechanisms.
Abstract: We provide attainable analytical tools to estimate the error of flow-based generative models under the Wasserstein metric and to establish the optimal sampling iteration complexity bound with respect to dimension as $O(\sqrt{d})$. We show this error can be explicitly controlled by two parts: the Lipschitzness of the push-forward maps of the backward flow which scales independently of the dimension; and a local discretization error scales $O(\sqrt{d})$ in terms of dimension. The former one is related to the existence of Lipschitz changes of variables induced by the (heat) flow. The latter one consists of the regularity of the score function in both spatial and temporal directions. These assumptions are valid in the flow-based generative model associated with the Föllmer process and $1$-rectified flow under the Gaussian tail assumption. As a consequence, we show that the sampling iteration complexity grows linearly with the square root of the trace of the covariance operator, which is related to the invariant distribution of the forward process.
[628] A Novel Multimodal RUL Framework for Remaining Useful Life Estimation with Layer-wise Explanations
Waleed Razzaq, Yun-Bo Zhao
Main category: cs.LG
TL;DR: A novel multimodal RUL estimation framework using image and time-frequency representations with attention mechanism and explainable AI for bearing prognostics.
Details
Motivation: Existing RUL estimation methods for rolling-element bearings suffer from poor generalization, lack of robustness, high data demands, and limited interpretability, highlighting the need for more effective approaches suitable for real-world industrial deployment.Method: Multimodal framework with three branches: (1) image representation branch using Bresenham line algorithm, (2) time-frequency representation branch using Continuous Wavelet Transform, and (3) fusion branch with LSTM and multi-head attention. Includes multimodal Layer-wise Relevance Propagation for explainability.
Result: Method matches or surpasses state-of-the-art baselines on XJTU-SY and PRONOSTIA datasets, requires 28-48% less training data, exhibits strong noise resilience, and provides interpretable predictions through multimodal-LRP visualizations.
Conclusion: The proposed multimodal RUL framework offers superior performance, data efficiency, robustness, and interpretability, making it highly suitable for real-world industrial deployment in bearing prognostics and health management.
Abstract: Estimating the Remaining Useful Life (RUL) of mechanical systems is pivotal in Prognostics and Health Management (PHM). Rolling-element bearings are among the most frequent causes of machinery failure, highlighting the need for robust RUL estimation methods. Existing approaches often suffer from poor generalization, lack of robustness, high data demands, and limited interpretability. This paper proposes a novel multimodal-RUL framework that jointly leverages image representations (ImR) and time-frequency representations (TFR) of multichannel, nonstationary vibration signals. The architecture comprises three branches: (1) an ImR branch and (2) a TFR branch, both employing multiple dilated convolutional blocks with residual connections to extract spatial degradation features; and (3) a fusion branch that concatenates these features and feeds them into an LSTM to model temporal degradation patterns. A multi-head attention mechanism subsequently emphasizes salient features, followed by linear layers for final RUL regression. To enable effective multimodal learning, vibration signals are converted into ImR via the Bresenham line algorithm and into TFR using Continuous Wavelet Transform. We also introduce multimodal Layer-wise Relevance Propagation (multimodal-LRP), a tailored explainability technique that significantly enhances model transparency. The approach is validated on the XJTU-SY and PRONOSTIA benchmark datasets. Results show that our method matches or surpasses state-of-the-art baselines under both seen and unseen operating conditions, while requiring ~28 % less training data on XJTU-SY and ~48 % less on PRONOSTIA. The model exhibits strong noise resilience, and multimodal-LRP visualizations confirm the interpretability and trustworthiness of predictions, making the framework highly suitable for real-world industrial deployment.
[629] A Novel Deep Neural Network Architecture for Real-Time Water Demand Forecasting
Tony Salloom, Okyay Kaynak, Wei He
Main category: cs.LG
TL;DR: A novel deep learning approach for short-term water demand forecasting that reduces model complexity 6x while maintaining accuracy, with a data extension technique that cuts extreme point errors by 30%.
Details
Motivation: Deep learning approaches for water demand forecasting suffer from high complexity (massive parameters) and high forecasting errors at extreme points, which need to be addressed for practical applications.Method: Proposes a novel DL model combining GRU for sequential relationships and K-means for feature creation to reduce parameters. Introduces data extension by inserting virtual data around extreme points to relieve nonlinearity and reduce errors.
Result: The method reduces model complexity six times compared to state-of-the-art while maintaining same accuracy. Data extension reduces error by about 30% at extreme points, though it increases training time.
Conclusion: The proposed approach effectively addresses both complexity and extreme point error problems in water demand forecasting, offering a practical solution with validated performance on real Chinese water plant data.
Abstract: Short-term water demand forecasting (StWDF) is the foundation stone in the derivation of an optimal plan for controlling water supply systems. Deep learning (DL) approaches provide the most accurate solutions for this purpose. However, they suffer from complexity problem due to the massive number of parameters, in addition to the high forecasting error at the extreme points. In this work, an effective method to alleviate the error at these points is proposed. It is based on extending the data by inserting virtual data within the actual data to relieve the nonlinearity around them. To our knowledge, this is the first work that considers the problem related to the extreme points. Moreover, the water demand forecasting model proposed in this work is a novel DL model with relatively low complexity. The basic model uses the gated recurrent unit (GRU) to handle the sequential relationship in the historical demand data, while an unsupervised classification method, K-means, is introduced for the creation of new features to enhance the prediction accuracy with less number of parameters. Real data obtained from two different water plants in China are used to train and verify the model proposed. The prediction results and the comparison with the state-of-the-art illustrate that the method proposed reduces the complexity of the model six times of what achieved in the literature while conserving the same accuracy. Furthermore, it is found that extending the data set significantly reduces the error by about 30%. However, it increases the training time.
[630] Decoding Motor Behavior Using Deep Learning and Reservoir Computing
Tian Lan
Main category: cs.LG
TL;DR: ESNNet integrates Echo State Networks with CNNs for EEG motor-behavior classification, achieving 83.2% within-subject and 51.3% LOSO accuracy on skateboard-trick EEG data.
Details
Motivation: Conventional CNN architectures (EEGNet, DeepConvNet) are effective for local spatial patterns but poorly model long-range temporal dependencies and nonlinear dynamics in EEG decoding for BMIs.Method: Integrates Echo State Network (ESN) reservoir computing with CNNs to create ESNNet. ESN provides high-dimensional, sparsely connected recurrent reservoir for temporal dynamics, complementing CNN spatial representation. Uses PREP pipeline preprocessing and MNE-Python implementation.
Result: Achieves 83.2% within-subject accuracy and 51.3% leave-one-subject-out (LOSO) accuracy on skateboard-trick EEG dataset, surpassing CNN-based baselines.
Conclusion: ESNNet effectively combines CNN spatial processing with ESN temporal modeling for improved EEG motor-behavior classification, demonstrating superior performance over conventional CNN architectures.
Abstract: We present a novel approach to EEG decoding for non-invasive brain machine interfaces (BMIs), with a focus on motor-behavior classification. While conventional convolutional architectures such as EEGNet and DeepConvNet are effective in capturing local spatial patterns, they are markedly less suited for modeling long-range temporal dependencies and nonlinear dynamics. To address this limitation, we integrate an Echo State Network (ESN), a prominent paradigm in reservoir computing into the decoding pipeline. ESNs construct a high-dimensional, sparsely connected recurrent reservoir that excels at tracking temporal dynamics, thereby complementing the spatial representational power of CNNs. Evaluated on a skateboard-trick EEG dataset preprocessed via the PREP pipeline and implemented in MNE-Python, our ESNNet achieves 83.2% within-subject and 51.3% LOSO accuracies, surpassing widely used CNN-based baselines. Code is available at https://github.com/Yutiankunkun/Motion-Decoding-Using-Biosignals
[631] KV-CAR: KV Cache Compression using Autoencoders and KV Reuse in Large Language Models
Sourjya Roy, Shrihari Sridharan, Surya Selvam, Anand Raghunathan
Main category: cs.LG
TL;DR: KV CAR reduces KV cache memory by up to 47.85% using autoencoder compression and similarity-based reuse across layers, enabling longer sequences and larger batches without architecture changes.
Details
Motivation: As LLMs scale, the KV cache becomes a major memory bottleneck during autoregressive decoding, often exceeding the model's own memory footprint and limiting batch sizes and context windows.Method: KV CAR combines two techniques: 1) lightweight autoencoder compresses key/value tensors along embedding dimension before storage, 2) similarity-driven reuse mechanism identifies opportunities to reuse KV tensors across adjacent attention heads in different layers.
Result: Achieves up to 47.85% KV cache memory reduction with minimal impact on perplexity and zero-shot accuracy across GPT-2 and TinyLLaMA models on Wikitext, C4, PIQA, and Winogrande datasets. System measurements show longer sequence lengths and larger batch sizes on NVIDIA A40 GPU.
Conclusion: KV CAR effectively enables memory-efficient LLM inference by reducing dimensional and structural redundancy in KV tensors without requiring transformer architecture modifications.
Abstract: As Large Language Models (LLMs) scale in size and context length, the memory requirements of the key value (KV) cache have emerged as a major bottleneck during autoregressive decoding. The KV cache grows with sequence length and embedding dimension, often exceeding the memory footprint of the model itself and limiting achievable batch sizes and context windows. To address this challenge, we present KV CAR, a unified and architecture agnostic framework that significantly reduces KV cache storage while maintaining model fidelity. KV CAR combines two complementary techniques. First, a lightweight autoencoder learns compact representations of key and value tensors along the embedding dimension, compressing them before they are stored in the KV cache and restoring them upon retrieval. Second, a similarity driven reuse mechanism identifies opportunities to reuse KV tensors of specific attention heads across adjacent layers. Together, these methods reduce the dimensional and structural redundancy in KV tensors without requiring changes to the transformer architecture. Evaluations on GPT 2 and TinyLLaMA models across Wikitext, C4, PIQA, and Winogrande datasets demonstrate that KV CAR achieves up to 47.85 percent KV cache memory reduction with minimal impact on perplexity and zero shot accuracy. System level measurements on an NVIDIA A40 GPU show that the reduced KV footprint directly translates into longer sequence lengths and larger batch sizes during inference. These results highlight the effectiveness of KV CAR in enabling memory efficient LLM inference.
[632] Enhancing Interpretability of AR-SSVEP-Based Motor Intention Recognition via CNN-BiLSTM and SHAP Analysis on EEG Data
Lin Yang, Xiang Li, Xin Ma, Xinxin Zhao
Main category: cs.LG
TL;DR: AR-SSVEP system using HoloLens 2 and enhanced CNN-BiLSTM with multi-head attention for motor intention recognition in rehabilitation.
Details
Motivation: Patients with motor dysfunction show low engagement in rehabilitation, and traditional SSVEP-BCI systems rely on external visual equipment, limiting practicality. There's a need to address patient initiative and reduce therapist workload.Method: 1) Designed four HoloLens 2-based EEG classes and collected data from 7 healthy subjects. 2) Developed MACNN-BiLSTM architecture: extract 10 temporal-spectral EEG features → CNN for high-level representations → BiLSTM for sequential dependencies → multi-head attention to highlight motor-intention patterns. 3) Applied SHAP for model interpretability.
Result: The system enhances real-time motor intention recognition and supports recovery in motor-impaired patients, though specific performance metrics aren’t provided in the abstract.
Conclusion: The AR-SSVEP system with MACNN-BiLSTM and SHAP interpretability improves patient engagement in rehabilitation by providing a more practical, real-world BCI solution that reduces therapist workload.
Abstract: Patients with motor dysfunction show low subjective engagement in rehabilitation training. Traditional SSVEP-based brain-computer interface (BCI) systems rely heavily on external visual stimulus equipment, limiting their practicality in real-world settings. This study proposes an augmented reality steady-state visually evoked potential (AR-SSVEP) system to address the lack of patient initiative and the high workload on therapists. Firstly, we design four HoloLens 2-based EEG classes and collect EEG data from seven healthy subjects for analysis. Secondly, we build upon the conventional CNN-BiLSTM architecture by integrating a multi-head attention mechanism (MACNN-BiLSTM). We extract ten temporal-spectral EEG features and feed them into a CNN to learn high-level representations. Then, we use BiLSTM to model sequential dependencies and apply a multi-head attention mechanism to highlight motor-intention-related patterns. Finally, the SHAP (SHapley Additive exPlanations) method is applied to visualize EEG feature contributions to the neural network’s decision-making process, enhancing the model’s interpretability. These findings enhance real-time motor intention recognition and support recovery in patients with motor impairments.
[633] Arc Gradient Descent: A Mathematically Derived Reformulation of Gradient Descent with Phase-Aware, User-Controlled Step Dynamics
Nikhil Verma, Joonas Linnosmaa, Espinosa-Leal Leonardo, Napat Vajragupta
Main category: cs.LG
TL;DR: ArcGD optimizer outperforms Adam and other state-of-the-art optimizers on both non-convex benchmark functions and real-world ML tasks, showing better generalization and resistance to overfitting.
Details
Motivation: To develop a new optimizer (ArcGD) that can handle challenging non-convex optimization problems and demonstrate superior performance compared to existing optimizers like Adam, particularly in terms of generalization and avoiding overfitting without requiring early stopping tuning.Method: Formulated and implemented the ArcGD optimizer, evaluated it in two phases: 1) on the stochastic variant of Rosenbrock function (2D to 50,000D) compared to Adam with learning rate bias elimination, 2) on CIFAR-10 image classification across 8 diverse MLP architectures compared to Adam, AdamW, Lion, and SGD.
Result: ArcGD consistently outperformed Adam on Rosenbrock function with its own learning rate, and achieved superior final solutions even with Adam’s learning rate. On CIFAR-10, ArcGD achieved highest average test accuracy (50.7%) at 20,000 iterations, winning or tying on 6 of 8 architectures, while showing continued improvement where others regressed.
Conclusion: ArcGD demonstrates strong performance on both geometric stress tests and deep learning benchmarks, showing broad applicability and better generalization without early stopping tuning. The connection between ArcGD and Lion optimizer variants suggests deeper relationships between optimization methods that warrant further exploration.
Abstract: The paper presents the formulation, implementation, and evaluation of the ArcGD optimiser. The evaluation is conducted initially on a non-convex benchmark function and subsequently on a real-world ML dataset. The initial comparative study using the Adam optimiser is conducted on a stochastic variant of the highly non-convex and notoriously challenging Rosenbrock function, renowned for its narrow, curved valley, across dimensions ranging from 2D to 1000D and an extreme case of 50,000D. Two configurations were evaluated to eliminate learning-rate bias: (i) both using ArcGD’s effective learning rate and (ii) both using Adam’s default learning rate. ArcGD consistently outperformed Adam under the first setting and, although slower under the second, achieved super ior final solutions in most cases. In the second evaluation, ArcGD is evaluated against state-of-the-art optimizers (Adam, AdamW, Lion, SGD) on the CIFAR-10 image classification dataset across 8 diverse MLP architectures ranging from 1 to 5 hidden layers. ArcGD achieved the highest average test accuracy (50.7%) at 20,000 iterations, outperforming AdamW (46.6%), Adam (46.8%), SGD (49.6%), and Lion (43.4%), winning or tying on 6 of 8 architectures. Notably, while Adam and AdamW showed strong early convergence at 5,000 iterations, but regressed with extended training, whereas ArcGD continued improving, demonstrating generalization and resistance to overfitting without requiring early stopping tuning. Strong performance on geometric stress tests and standard deep-learning benchmarks indicates broad applicability, highlighting the need for further exploration. Moreover, it is also shown that a variant of ArcGD can be interpreted as a special case of the Lion optimiser, highlighting connections between the inherent mechanisms of such optimisation methods.
[634] Multi-Scale Protein Structure Modelling with Geometric Graph U-Nets
Chang Liu, Vivian Li, Linus Leong, Vladimir Radenkovic, Pietro Liò, Chaitanya K. Joshi
Main category: cs.LG
TL;DR: Geometric Graph U-Nets introduce hierarchical graph neural networks for 3D protein structures that outperform standard GNNs by capturing multi-scale biological interactions through recursive coarsening and refinement.
Details
Motivation: Standard Geometric GNNs and Transformers fail to capture hierarchical interactions in proteins (global domains, long-range allosteric regulation) due to their reliance on message passing, which doesn't mirror biological hierarchy.Method: Geometric Graph U-Nets learn multi-scale representations by recursively coarsening and refining the protein graph, creating a hierarchical architecture that mirrors biological organization.
Result: The hierarchical design is theoretically more expressive than standard Geometric GNNs. Empirically, on protein fold classification, Geometric U-Nets substantially outperform invariant and equivariant baselines.
Conclusion: The work provides a principled foundation for designing geometric deep learning architectures that can learn the multi-scale structure of biomolecules by mirroring biological hierarchy in network architecture.
Abstract: Geometric Graph Neural Networks (GNNs) and Transformers have become state-of-the-art for learning from 3D protein structures. However, their reliance on message passing prevents them from capturing the hierarchical interactions that govern protein function, such as global domains and long-range allosteric regulation. In this work, we argue that the network architecture itself should mirror this biological hierarchy. We introduce Geometric Graph U-Nets, a new class of models that learn multi-scale representations by recursively coarsening and refining the protein graph. We prove that this hierarchical design can theoretically more expressive than standard Geometric GNNs. Empirically, on the task of protein fold classification, Geometric U-Nets substantially outperform invariant and equivariant baselines, demonstrating their ability to learn the global structural patterns that define protein folds. Our work provides a principled foundation for designing geometric deep learning architectures that can learn the multi-scale structure of biomolecules.
[635] Optimal Analysis for Bandit Learning in Matching Markets with Serial Dictatorship
Zilong Wang, Shuai Li
Main category: cs.LG
TL;DR: The paper proposes a multi-level successive selection algorithm for two-sided matching markets with bandits that achieves regret matching the known lower bound under serial dictatorship assumption.
Details
Motivation: There's a persistent gap between the known lower bound (Ω(Nlog(T)/Δ² + Klog(T)/Δ)) and best upper bound (O(Klog(T)/Δ²)) for online matching markets with bandits under serial dictatorship. It's unclear whether the lower bound or upper bound needs improvement.Method: Proposes a multi-level successive selection algorithm designed specifically for matching markets with bandits under serial dictatorship assumption.
Result: Achieves O(Nlog(T)/Δ² + Klog(T)/Δ) regret bound, which matches the known lower bound, closing the gap between lower and upper bounds.
Conclusion: First algorithm to match the lower bound in matching markets with bandits, resolving the uncertainty about whether the lower bound or upper bound needed improvement.
Abstract: The problem of two-sided matching markets is well-studied in computer science and economics, owing to its diverse applications across numerous domains. Since market participants are usually uncertain about their preferences in various online matching platforms, an emerging line of research is dedicated to the online setting where one-side participants (players) learn their unknown preferences through multiple rounds of interactions with the other side (arms). Sankararaman et al. provide an $Ω\left( \frac{N\log(T)}{Δ^2} + \frac{K\log(T)}Δ \right)$ regret lower bound for this problem under serial dictatorship assumption, where $N$ is the number of players, $K (\geq N)$ is the number of arms, $Δ$ is the minimum reward gap across players and arms, and $T$ is the time horizon. Serial dictatorship assumes arms have the same preferences, which is common in reality when one side participants have a unified evaluation standard. Recently, the work of Kong and Li proposes the ET-GS algorithm and achieves an $O\left( \frac{K\log(T)}{Δ^2} \right)$ regret upper bound, which is the best upper bound attained so far. Nonetheless, a gap between the lower and upper bounds, ranging from $N$ to $K$, persists. It remains unclear whether the lower bound or the upper bound needs to be improved. In this paper, we propose a multi-level successive selection algorithm that obtains an $O\left( \frac{N\log(T)}{Δ^2} + \frac{K\log(T)}Δ \right)$ regret bound when the market satisfies serial dictatorship. To the best of our knowledge, we are the first to propose an algorithm that matches the lower bound in the problem of matching markets with bandits.
[636] Measuring Over-smoothing beyond Dirichlet energy
Weiqi Guan, Zihao Shi
Main category: cs.LG
TL;DR: The paper proposes higher-order node similarity measures to better quantify over-smoothing in GNNs, showing that attention-based GNNs suffer from over-smoothing under these metrics.
Details
Motivation: Current Dirichlet energy metric for over-smoothing is limited to first-order feature derivatives, failing to capture higher-order smoothing effects in graph neural networks.Method: Proposes generalized family of node similarity measures based on energy of higher-order feature derivatives, with theoretical analysis of relationships among measures and decay rates under heat diffusion and aggregation operators.
Result: Establishes decay rates of Dirichlet energy, reveals connection between over-smoothing decay rate and spectral gap of graph Laplacian, and empirically demonstrates attention-based GNNs suffer from over-smoothing under proposed metrics.
Conclusion: Higher-order node similarity measures provide better quantification of over-smoothing, revealing attention-based GNNs are susceptible to this problem, with theoretical connections to graph spectral properties.
Abstract: While Dirichlet energy serves as a prevalent metric for quantifying over-smoothing, it is inherently restricted to capturing first-order feature derivatives. To address this limitation, we propose a generalized family of node similarity measures based on the energy of higher-order feature derivatives. Through a rigorous theoretical analysis of the relationships among these measures, we establish the decay rates of Dirichlet energy under both continuous heat diffusion and discrete aggregation operators. Furthermore, our analysis reveals an intrinsic connection between the over-smoothing decay rate and the spectral gap of the graph Laplacian. Finally, empirical results demonstrate that attention-based Graph Neural Networks (GNNs) suffer from over-smoothing when evaluated under these proposed metrics.
[637] Angular Regularization for Positive-Unlabeled Learning on the Hypersphere
Vasileios Sevetlidis, George Pavlidis, Antonios Gasteratos
Main category: cs.LG
TL;DR: AngularPU is a novel Positive-Unlabeled learning framework that uses cosine similarity and angular margin on a unit hypersphere, representing the positive class with a learnable prototype vector and eliminating the need for explicit negative modeling.
Details
Motivation: Existing PU methods have limitations: they either require strong distributional assumptions (negative-risk estimation) or can collapse in high-dimensional settings (pseudo-labeling). There's a need for a more robust approach that doesn't require explicit negative modeling and works well in high-dimensional spaces.Method: AngularPU operates on the unit hypersphere using cosine similarity and angular margin. The positive class is represented by a learnable prototype vector, and classification reduces to thresholding cosine similarity between embeddings and this prototype. An angular regularizer encourages dispersion of unlabeled embeddings over the hypersphere to prevent clustering near the positive prototype.
Result: AngularPU achieves competitive or superior performance compared to state-of-the-art PU methods, particularly in settings with scarce positives and high-dimensional embeddings. The method offers theoretical guarantees on Bayes-optimality of the angular decision rule, consistency of the learned prototype, and the effect of the regularizer.
Conclusion: AngularPU provides an effective PU learning framework that eliminates the need for explicit negative modeling, works well in high-dimensional settings, offers geometric interpretability and scalability, and addresses limitations of existing approaches through angular regularization on the hypersphere.
Abstract: Positive-Unlabeled (PU) learning addresses classification problems where only a subset of positive examples is labeled and the remaining data is unlabeled, making explicit negative supervision unavailable. Existing PU methods often rely on negative-risk estimation or pseudo-labeling, which either require strong distributional assumptions or can collapse in high-dimensional settings. We propose AngularPU, a novel PU framework that operates on the unit hypersphere using cosine similarity and angular margin. In our formulation, the positive class is represented by a learnable prototype vector, and classification reduces to thresholding the cosine similarity between an embedding and this prototype-eliminating the need for explicit negative modeling. To counteract the tendency of unlabeled embeddings to cluster near the positive prototype, we introduce an angular regularizer that encourages dispersion of the unlabeled set over the hypersphere, improving separation. We provide theoretical guarantees on the Bayes-optimality of the angular decision rule, consistency of the learned prototype, and the effect of the regularizer on the unlabeled distribution. Experiments on benchmark datasets demonstrate that AngularPU achieves competitive or superior performance compared to state-of-the-art PU methods, particularly in settings with scarce positives and high-dimensional embeddings, while offering geometric interpretability and scalability.
[638] Small-Gain Nash: Certified Contraction to Nash Equilibria in Differentiable Games
Vedansh Sharma
Main category: cs.LG
TL;DR: SGN introduces a block small-gain condition in custom block-weighted geometry to certify convergence in non-monotone games where Euclidean monotonicity fails.
Details
Motivation: Classical convergence guarantees require pseudo-gradient monotonicity in Euclidean geometry, which often fails in games with strong cross-player couplings, limiting applicability to many practical game settings.Method: Small-Gain Nash (SGN) uses a block small-gain condition in custom block-weighted geometry, converting local curvature and cross-player Lipschitz bounds into contraction certificates. It constructs weighted block metrics where pseudo-gradient becomes strongly monotone, even when non-monotone in Euclidean sense.
Result: SGN provides certified “timescale band” - a non-asymptotic, metric-based certificate enabling single-step-size dynamics convergence without forcing asymptotic timescale separation. Validated on quadratic games where Euclidean analysis fails, and extended to mirror/Fisher geometries for entropy-regularized policy gradient in Markov games.
Conclusion: SGN offers an offline certification pipeline that estimates parameters, optimizes block weights, and returns structural convergence certificates (metric, contraction rate, safe step-sizes) for non-monotone games, overcoming limitations of Euclidean monotonicity requirements.
Abstract: Classical convergence guarantees for gradient-based learning in games require the pseudo-gradient to be (strongly) monotone in Euclidean geometry as shown by rosen(1965), a condition that often fails even in simple games with strong cross-player couplings. We introduce Small-Gain Nash (SGN), a block small-gain condition in a custom block-weighted geometry. SGN converts local curvature and cross-player Lipschitz coupling bounds into a tractable certificate of contraction. It constructs a weighted block metric in which the pseudo-gradient becomes strongly monotone on any region where these bounds hold, even when it is non-monotone in the Euclidean sense. The continuous flow is exponentially contracting in this designed geometry, and projected Euler and RK4 discretizations converge under explicit step-size bounds derived from the SGN margin and a local Lipschitz constant. Our analysis reveals a certified ``timescale band’’, a non-asymptotic, metric-based certificate that plays a TTUR-like role: rather than forcing asymptotic timescale separation via vanishing, unequal step sizes, SGN identifies a finite band of relative metric weights for which a single-step-size dynamics is provably contractive. We validate the framework on quadratic games where Euclidean monotonicity analysis fails to predict convergence, but SGN successfully certifies it, and extend the construction to mirror/Fisher geometries for entropy-regularized policy gradient in Markov games. The result is an offline certification pipeline that estimates curvature, coupling, and Lipschitz parameters on compact regions, optimizes block weights to enlarge the SGN margin, and returns a structural, computable convergence certificate consisting of a metric, contraction rate, and safe step-sizes for non-monotone games.
[639] Partial Inverse Design of High-Performance Concrete Using Cooperative Neural Networks for Constraint-Aware Mix Generation
Agung Nugraha, Heungjun Im, Jihwan Lee
Main category: cs.LG
TL;DR: A cooperative neural network framework for partial inverse design of high-performance concrete that determines mix compositions when some variables are fixed by constraints.
Details
Motivation: Inverse design for concrete mix proportioning is limited, especially when some mix variables are fixed by practical constraints and only remaining variables need to be determined. Current data-driven methods focus on forward prediction but struggle with constraint-aware inverse design.Method: Proposes a cooperative neural network framework with two coupled models: an imputation model that infers undetermined variables and a surrogate model that predicts compressive strength. Uses cooperative learning to generate valid, performance-consistent mix designs in a single forward pass without retraining for different constraint combinations.
Result: Achieves stable R-squared values of 0.87-0.92, reduces mean squared error by 50% compared to autoencoder baselines and 70% compared to Bayesian inference. Outperforms probabilistic (Bayesian inference with Gaussian process) and generative (autoencoder-based) approaches.
Conclusion: The cooperative neural network provides accurate, robust, and computationally efficient foundation for constraint-aware, data-driven mix proportioning in concrete engineering, enabling practical inverse design with fixed constraints.
Abstract: High-performance concrete offers exceptional strength and durability but requires complex mix designs involving many interdependent variables and practical constraints. While data-driven methods have advanced predictive modeling for forward design, inverse design, which focuses on determining mix compositions that achieve target performance, remains limited, particularly in design situations where some mix variables are fixed by constraints and only the remaining variables must be determined. This study proposes a cooperative neural network framework for the partial inverse design of high-performance concrete. The framework combines two coupled neural network models, an imputation model that infers the undetermined variables and a surrogate model that predicts compressive strength. Through cooperative learning, the model generates valid and performance-consistent mix designs in a single forward pass while accommodating different constraint combinations without retraining. Its performance is compared with both probabilistic and generative approaches, including Bayesian inference based on a Gaussian process surrogate and autoencoder-based models. Evaluated on a benchmark dataset, the proposed model achieves stable and higher R-squared values of 0.87-0.92 and reduces mean squared error by an average of 50 percent compared with autoencoder baselines and by an average of 70 percent compared with Bayesian inference. The results demonstrate that the cooperative neural network provides an accurate, robust, and computationally efficient foundation for constraint-aware, data-driven mix proportioning in concrete engineering.
[640] Neural Factorization-based Bearing Fault Diagnosis
Zhenhao Li, Xu Cheng, Yi Zhou
Main category: cs.LG
TL;DR: Proposes Neural Factorization-based Classification (NFC) framework for high-speed train bearing fault diagnosis, using mode-wise feature embedding and neural factorization fusion to improve accuracy under complex conditions.
Details
Motivation: Traditional bearing fault diagnosis methods for high-speed trains have insufficient accuracy under complex operating conditions, posing safety risks since bearings are core components directly related to train operation safety.Method: NFC framework with two core ideas: 1) Embedding vibration time series into multiple mode-wise latent feature vectors to capture diverse fault patterns, 2) Using neural factorization principles to fuse these vectors into unified vibration representation. Two model instantiations: CP-NFC and Tucker-NFC based on CP and Tucker fusion schemes.
Result: Both CP-NFC and Tucker-NFC models achieve superior diagnostic performance compared with traditional machine learning methods, providing empirical evidence and practical guidance for selecting effective diagnostic strategies.
Conclusion: The NFC framework effectively mines complex latent fault characteristics from raw time-series data, offering improved bearing fault diagnosis for high-speed train safety monitoring.
Abstract: This paper studies the key problems of bearing fault diagnosis of high-speed train. As the core component of the train operation system, the health of bearings is directly related to the safety of train operation. The traditional diagnostic methods are facing the challenge of insufficient diagnostic accuracy under complex conditions. To solve these problems, we propose a novel Neural Factorization-based Classification (NFC) framework for bearing fault diagnosis. It is built on two core idea: 1) Embedding vibration time series into multiple mode-wise latent feature vectors to capture diverse fault-related patterns; 2) Leveraging neural factorization principles to fuse these vectors into a unified vibration representation. This design enables effective mining of complex latent fault characteristics from raw time-series data. We further instantiate the framework with two models CP-NFC and Tucker-NFC based on CP and Tucker fusion schemes, respectively. Experimental results show that both models achieve superior diagnostic performance compared with traditional machine learning methods. The comparative analysis provides valuable empirical evidence and practical guidance for selecting effective diagnostic strategies in high-speed train bearing monitoring.
[641] Know your Trajectory – Trustworthy Reinforcement Learning deployment through Importance-Based Trajectory Analysis
Clifford F, Devika Jay, Abhishek Sarkar, Satheesh K Perepu, Santhosh G S, Kaushik Dey, Balaraman Ravindran
Main category: cs.LG
TL;DR: A novel framework for explaining RL agents’ long-term behavior by ranking trajectories using a new state-importance metric combining Q-value difference with a “radical term” to capture goal affinity, enabling identification of optimal trajectories and generating counterfactual explanations.
Details
Motivation: Current Explainable RL (XRL) focuses too much on local, single-step decisions, but real-world RL deployment requires understanding agents' long-term behavior for trust and transparency. There's a critical need for trajectory-level explanations to ensure trustworthy autonomous systems.Method: Introduces a framework that ranks entire trajectories by defining and aggregating a new state-importance metric. This metric combines classic Q-value difference with a “radical term” that captures the agent’s affinity to reach its goal. The method identifies optimal trajectories from heterogeneous agent experiences and generates counterfactual rollouts from critical states to compare chosen paths with alternatives.
Result: The method successfully identifies optimal trajectories from collections of agent experiences. By generating counterfactual rollouts from critical states, it shows the agent’s chosen path is robustly superior to alternatives, providing “Why this, and not that?” explanations. Experiments in standard OpenAI Gym environments validate that the proposed importance metric is more effective at identifying optimal behaviors compared to classic approaches.
Conclusion: The framework represents a significant step towards trustworthy autonomous systems by addressing the critical need for explaining RL agents’ long-term behavior through trajectory-level analysis, moving beyond single-step explanations to provide comprehensive understanding of agent decision-making.
Abstract: As Reinforcement Learning (RL) agents are increasingly deployed in real-world applications, ensuring their behavior is transparent and trustworthy is paramount. A key component of trust is explainability, yet much of the work in Explainable RL (XRL) focuses on local, single-step decisions. This paper addresses the critical need for explaining an agent’s long-term behavior through trajectory-level analysis. We introduce a novel framework that ranks entire trajectories by defining and aggregating a new state-importance metric. This metric combines the classic Q-value difference with a “radical term” that captures the agent’s affinity to reach its goal, providing a more nuanced measure of state criticality. We demonstrate that our method successfully identifies optimal trajectories from a heterogeneous collection of agent experiences. Furthermore, by generating counterfactual rollouts from critical states within these trajectories, we show that the agent’s chosen path is robustly superior to alternatives, thereby providing a powerful “Why this, and not that?” explanation. Our experiments in standard OpenAI Gym environments validate that our proposed importance metric is more effective at identifying optimal behaviors compared to classic approaches, offering a significant step towards trustworthy autonomous systems.
[642] Parent-Guided Semantic Reward Model (PGSRM): Embedding-Based Reward Functions for Reinforcement Learning of Transformer Language Models
Alexandr Plashchinsky
Main category: cs.LG
TL;DR: PGSRM is a lightweight RL reward framework that uses cosine similarity between parent and child model embeddings as semantic reward, eliminating need for human annotation or reward model training.
Details
Motivation: To create a simpler alternative to complex RLHF-style reward modeling that requires human preference data, binary correctness signals, and trained reward models. The goal is to enable more practical parent-guided alignment for smaller transformer models.Method: Uses cosine similarity between a parent model’s reference output embedding and a child model’s generated output embedding for the same input as a dense semantic reward signal. This eliminates need for human annotation or additional model training.
Result: Applied on five language tasks, PGSRM produces smoother reward improvement and more stable PPO dynamics than binary reward baselines, demonstrating practical effectiveness.
Conclusion: Embedding-based semantic rewards via PGSRM are a practical alternative to RLHF-style reward modeling for parent-guided alignment in smaller transformer models, offering simpler implementation and more stable training dynamics.
Abstract: We introduce the Parent-Guided Semantic Reward Model (PGSRM), a lightweight reward framework for reinforcement learning (RL) of transformer language models. PGSRM replaces binary correctness signals, human preference data, and trained reward models with a simple signal: cosine similarity between a parent model’s reference output embedding and a child model’s generated output for the same input. This yields a dense, semantically meaningful reward with no human annotation or additional model training. We apply PGSRM on five language tasks and find that it produces smoother reward improvement and more stable PPO dynamics than a binary reward baseline, suggesting that embedding-based semantic rewards are a practical alternative to RLHF-style reward modeling for parent-guided alignment in smaller transformer models.
[643] Flash Multi-Head Feed-Forward Network
Minshen Zhang, Xiang Hu, Jianguo Li, Wei Wu, Kewei Tu
Main category: cs.LG
TL;DR: FlashMHF replaces standard FFNs in Transformers with multi-head FFNs using fused kernels and dynamic weighting, improving performance while reducing memory usage 3-5x.
Details
Motivation: Multi-head mechanisms enhance expressivity in attention, and FFNs structurally resemble single-head attention, suggesting multi-head FFNs could similarly improve expressivity. However, naive multi-head FFNs face memory scaling issues and imbalanced dimension ratios as models scale.Method: Proposes Flash Multi-Head FFN (FlashMHF) with two innovations: 1) I/O-aware fused kernel that computes outputs online in SRAM (similar to FlashAttention), and 2) dynamically weighted parallel sub-networks that maintain balanced ratio between intermediate and head dimensions as models scale.
Result: Validated on models from 128M to 1.3B parameters, FlashMHF consistently improves perplexity and downstream task accuracy over SwiGLU FFNs, while reducing peak memory usage by 3-5x and accelerating inference by up to 1.08x.
Conclusion: Multi-head design is a superior architectural principle for FFNs, and FlashMHF presents a powerful, efficient, and scalable alternative to standard FFNs in Transformers.
Abstract: We explore Multi-Head FFN (MH-FFN) as a replacement of FFN in the Transformer architecture, motivated by the structural similarity between single-head attention and FFN. While multi-head mechanisms enhance expressivity in attention, naively applying them to FFNs faces two challenges: memory consumption scaling with the head count, and an imbalanced ratio between the growing intermediate size and the fixed head dimension as models scale, which degrades scalability and expressive power. To address these challenges, we propose Flash Multi-Head FFN (FlashMHF), with two key innovations: an I/O-aware fused kernel computing outputs online in SRAM akin to FlashAttention, and a design using dynamically weighted parallel sub-networks to maintain a balanced ratio between intermediate and head dimensions. Validated on models from 128M to 1.3B parameters, FlashMHF consistently improves perplexity and downstream task accuracy over SwiGLU FFNs, while reducing peak memory usage by 3-5x and accelerating inference by up to 1.08x. Our work establishes the multi-head design as a superior architectural principle for FFNs, presenting FlashMHF as a powerful, efficient, and scalable alternative to FFNs in Transformers.
[644] Deep Reinforcement Learning for Phishing Detection with Transformer-Based Semantic Features
Aseer Al Faisal
Main category: cs.LG
TL;DR: QR-DQN combines RoBERTa semantic embeddings with lexical features for phishing detection, achieving 99.86% accuracy with improved generalization over traditional DQN.
Details
Motivation: Phishing attacks cause financial losses through fraudulent messages and websites. Traditional DQN methods have limitations in handling uncertainties and generalizing to unseen data, needing more robust detection approaches.Method: Proposes Quantile Regression Deep Q-Network (QR-DQN) that integrates RoBERTa semantic embeddings with handcrafted lexical features. Uses quantile regression to model return distributions instead of single scalar Q-values. Trained on 105,000 URLs from multiple sources with 80/20 train-test split and 5-fold cross-validation.
Result: Achieved 99.86% test accuracy, 99.75% precision, 99.96% recall, and 99.85% F1-score. Reduced generalization gap from 1.66% to 0.04% compared to standard DQN. 5-fold cross-validation showed mean accuracy of 99.90% with 0.04% standard deviation.
Conclusion: The hybrid QR-DQN approach effectively identifies phishing threats, adapts to evolving attacks, and generalizes well to unseen data, demonstrating significant improvement in robustness over traditional methods.
Abstract: Phishing is a cybercrime in which individuals are deceived into revealing personal information, often resulting in financial loss. These attacks commonly occur through fraudulent messages, misleading advertisements, and compromised legitimate websites. This study proposes a Quantile Regression Deep Q-Network (QR-DQN) approach that integrates RoBERTa semantic embeddings with handcrafted lexical features to enhance phishing detection while accounting for uncertainties. Unlike traditional DQN methods that estimate single scalar Q-values, QR-DQN leverages quantile regression to model the distribution of returns, improving stability and generalization on unseen phishing data. A diverse dataset of 105,000 URLs was curated from PhishTank, OpenPhish, Cloudflare, and other sources, and the model was evaluated using an 80/20 train-test split. The QR-DQN framework achieved a test accuracy of 99.86%, precision of 99.75%, recall of 99.96%, and F1-score of 99.85%, demonstrating high effectiveness. Compared to standard DQN with lexical features, the hybrid QR-DQN with lexical and semantic features reduced the generalization gap from 1.66% to 0.04%, indicating significant improvement in robustness. Five-fold cross-validation confirmed model reliability, yielding a mean accuracy of 99.90% with a standard deviation of 0.04%. These results suggest that the proposed hybrid approach effectively identifies phishing threats, adapts to evolving attack strategies, and generalizes well to unseen data.
[645] Block Sparse Flash Attention
Daniel Ohayon, Itay Lamprecht, Itay Hubara, Israel Cohen, Daniel Soudry, Noam Elata
Main category: cs.LG
TL;DR: BSFA is a drop-in replacement for FlashAttention that accelerates long-context inference by computing exact query-key similarities to select top-k value blocks, skipping ~50% of computation while preserving model quality.
Details
Motivation: Modern LLMs need long contexts for reasoning tasks, but attention's quadratic complexity creates computational bottlenecks that limit practical deployment.Method: Computes exact query-key similarities to select top-k most important value blocks per query, compares per-block maximum scores against calibrated thresholds to skip ~50% of computation, requires only one-time threshold calibration on small dataset.
Result: Achieves up to 1.10x speedup on reasoning benchmarks and 1.24x for needle-in-a-haystack tasks while maintaining >99% baseline accuracy, with some configurations even improving accuracy by focusing on relevant content.
Conclusion: BSFA provides an effective training-free drop-in replacement that substantially outperforms existing sparse attention methods for accelerating long-context inference.
Abstract: Modern large language models increasingly require long contexts for reasoning and multi-document tasks, but attention’s quadratic complexity creates a severe computational bottleneck. We present Block-Sparse FlashAttention (BSFA), a drop-in replacement that accelerates long-context inference while preserving model quality. Unlike methods that predict importance before computing scores, BSFA computes exact query-key similarities to select the top-k most important value blocks for each query. By comparing per-block maximum scores against calibrated thresholds, we skip approximately 50% of the computation and memory transfers for pruned blocks. Our training-free approach requires only a one-time threshold calibration on a small dataset to learn the per-layer and per-head attention score distributions. We provide a CUDA kernel implementation that can be used as a drop-in replacement for FlashAttention. On Llama-3.1-8B, BSFA achieves up to 1.10x speedup on real-world reasoning benchmarks and up to 1.24x for needle-in-a-haystack retrieval tasks while maintaining above 99% baseline accuracy, with certain configurations even improving accuracy by focusing on the most relevant content, substantially outperforming existing sparse attention methods. The implementation is available at https://github.com/Danielohayon/Block-Sparse-Flash-Attention
[646] Evaluating the Sensitivity of BiLSTM Forecasting Models to Sequence Length and Input Noise
Salma Albelali, Moataz Ahmed
Main category: cs.LG
TL;DR: Systematic analysis shows BiLSTM forecasting models are highly sensitive to input sequence length and additive noise, with longer sequences causing overfitting and noise degrading accuracy across all sampling frequencies.
Details
Motivation: While BiLSTM models are effective for time-series forecasting in critical domains like environmental monitoring and IoT, their robustness and generalization capabilities are underexplored, particularly regarding sensitivity to input data characteristics.Method: Developed a modular forecasting pipeline with standardized preprocessing, sequence generation, model training, validation, and evaluation. Conducted controlled experiments on three real-world datasets with varying sampling frequencies to assess BiLSTM performance under different input sequence lengths and additive noise conditions.
Result: Three key findings: (1) longer input sequences significantly increase overfitting and data leakage risk, especially in data-constrained environments; (2) additive noise consistently degrades predictive accuracy across all sampling frequencies; (3) simultaneous presence of both factors causes the most substantial decline in model stability.
Conclusion: Current DL-based forecasting pipelines have important limitations, highlighting the need for data-aware design strategies. Higher-frequency datasets show greater robustness but remain vulnerable to combined input challenges, emphasizing the importance of understanding DL model behavior in dynamic time-series environments.
Abstract: Deep learning (DL) models, a specialized class of multilayer neural networks, have become central to time-series forecasting in critical domains such as environmental monitoring and the Internet of Things (IoT). Among these, Bidirectional Long Short-Term Memory (BiLSTM) architectures are particularly effective in capturing complex temporal dependencies. However, the robustness and generalization of such models are highly sensitive to input data characteristics - an aspect that remains underexplored in existing literature. This study presents a systematic empirical analysis of two key data-centric factors: input sequence length and additive noise. To support this investigation, a modular and reproducible forecasting pipeline is developed, incorporating standardized preprocessing, sequence generation, model training, validation, and evaluation. Controlled experiments are conducted on three real-world datasets with varying sampling frequencies to assess BiLSTM performance under different input conditions. The results yield three key findings: (1) longer input sequences significantly increase the risk of overfitting and data leakage, particularly in data-constrained environments; (2) additive noise consistently degrades predictive accuracy across sampling frequencies; and (3) the simultaneous presence of both factors results in the most substantial decline in model stability. While datasets with higher observation frequencies exhibit greater robustness, they remain vulnerable when both input challenges are present. These findings highlight important limitations in current DL-based forecasting pipelines and underscore the need for data-aware design strategies. This work contributes to a deeper understanding of DL model behavior in dynamic time-series environments and provides practical insights for developing more reliable and generalizable forecasting systems.
[647] Adaptive Normalization Mamba with Multi Scale Trend Decomposition and Patch MoE Encoding
MinCheol Jeon
Main category: cs.LG
TL;DR: AdaMamba is a unified time series forecasting architecture that addresses non-stationarity, multi-scale patterns, and distributional shifts through adaptive normalization, multi-scale trend extraction, and contextual sequence modeling.
Details
Motivation: Real-world time series forecasting faces challenges with non-stationarity, multi-scale temporal patterns, and distributional shifts that degrade model stability and accuracy. Existing models struggle with these issues, particularly conventional Transformer-based approaches.Method: AdaMamba integrates adaptive normalization with multi-scale convolutional trend extraction and channel-wise recalibration, followed by a Context Encoder with patch-wise embeddings, positional encoding, and a Mamba-enhanced Transformer layer with mixture of experts feed-forward module. It includes lightweight prediction head and denormalization mechanism.
Result: Experimental evaluations show AdaMamba yields consistent improvements in stability and accuracy over conventional Transformer-based baselines, effectively mitigating covariate shift and enhancing predictive reliability across heterogeneous datasets.
Conclusion: AdaMamba provides strong representational capacity with modular extensibility, supporting deterministic prediction and compatibility with probabilistic extensions. Its design effectively addresses key challenges in real-world time series forecasting.
Abstract: Time series forecasting in real world environments faces significant challenges non stationarity, multi scale temporal patterns, and distributional shifts that degrade model stability and accuracy. This study propose AdaMamba, a unified forecasting architecture that integrates adaptive normalization, multi scale trend extraction, and contextual sequence modeling to address these challenges. AdaMamba begins with an Adaptive Normalization Block that removes non stationary components through multi scale convolutional trend extraction and channel wise recalibration, enabling consistent detrending and variance stabilization. The normalized sequence is then processed by a Context Encoder that combines patch wise embeddings, positional encoding, and a Mamba enhanced Transformer layer with a mixture of experts feed forward module, allowing efficient modeling of both long range dependencies and local temporal dynamics. A lightweight prediction head generates multi horizon forecasts, and a denormalization mechanism reconstructs outputs by reintegrating local trends to ensure robustness under varying temporal conditions. AdaMamba provides strong representational capacity with modular extensibility, supporting deterministic prediction and compatibility with probabilistic extensions. Its design effectively mitigates covariate shift and enhances predictive reliability across heterogeneous datasets. Experimental evaluations demonstrate that AdaMamba’s combination of adaptive normalization and expert augmented contextual modeling yields consistent improvements in stability and accuracy over conventional Transformer based baselines.
[648] Hidden Leaks in Time Series Forecasting: How Data Leakage Affects LSTM Evaluation Across Configurations and Validation Strategies
Salma Albelali, Moataz Ahmed
Main category: cs.LG
TL;DR: Data leakage in LSTM time series forecasting can inflate performance by up to 20.5% with 10-fold CV, while 2-way/3-way splits are more robust (<5% impact).
Details
Motivation: Data leakage in time series forecasting evaluation compromises integrity by allowing future information to influence training, leading to unreliable performance estimates.Method: Compare three validation techniques (2-way split, 3-way split, 10-fold CV) under leaky (pre-split sequence generation) vs clean (temporal separation before sequence construction) conditions using RMSE Gain metric.
Result: 10-fold CV shows RMSE Gain up to 20.5% at extended lag steps, while 2-way/3-way splits maintain RMSE Gain below 5%. Smaller windows and longer lags increase leakage sensitivity.
Conclusion: Need for configuration-aware, leakage-resistant evaluation pipelines with proper temporal separation before sequence construction to ensure reliable performance estimation.
Abstract: Deep learning models, particularly Long Short-Term Memory (LSTM) networks, are widely used in time series forecasting due to their ability to capture complex temporal dependencies. However, evaluation integrity is often compromised by data leakage, a methodological flaw in which input-output sequences are constructed before dataset partitioning, allowing future information to unintentionally influence training. This study investigates the impact of data leakage on performance, focusing on how validation design mediates leakage sensitivity. Three widely used validation techniques (2-way split, 3-way split, and 10-fold cross-validation) are evaluated under both leaky (pre-split sequence generation) and clean conditions, with the latter mitigating leakage risk by enforcing temporal separation during data splitting prior to sequence construction. The effect of leakage is assessed using RMSE Gain, which measures the relative increase in RMSE caused by leakage, computed as the percentage difference between leaky and clean setups. Empirical results show that 10-fold cross-validation exhibits RMSE Gain values of up to 20.5% at extended lag steps. In contrast, 2-way and 3-way splits demonstrate greater robustness, typically maintaining RMSE Gain below 5% across diverse configurations. Moreover, input window size and lag step significantly influence leakage sensitivity: smaller windows and longer lags increase the risk of leakage, whereas larger windows help reduce it. These findings underscore the need for configuration-aware, leakage-resistant evaluation pipelines to ensure reliable performance estimation.
[649] Pay Less Attention to Function Words for Free Robustness of Vision-Language Models
Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Chao Shen
Main category: cs.LG
TL;DR: FDA (Function-word De-Attention) improves VLM robustness by reducing vulnerability from function words, achieving significant attack success rate reductions with minimal performance impact.
Details
Motivation: There's a trade-off between robustness and performance in robust VLMs. The paper identifies that function words create vulnerabilities against cross-modal adversarial attacks, motivating the need to mitigate their impact.Method: Proposes Function-word De-Attention (FDA) that calculates both original and function-word cross-attention within attention heads, then differentially subtracts the function-word attention from the original attention, similar to differential amplifiers.
Result: FDA achieves average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on 3 tested models for retrieval, and 90% ASR drop with 0.3% performance gain on visual grounding. Shows scalability, generalization, and zero-shot performance.
Conclusion: FDA effectively addresses the robustness-performance trade-off in VLMs by mitigating function-word vulnerabilities, demonstrating strong empirical results across multiple models, tasks, and datasets.
Abstract: To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90% ASR drop with a 0.3% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code will be made publicly at https://github.com/michaeltian108/FDA.
[650] A Unifying Human-Centered AI Fairness Framework
Munshi Mahbubur Rahman, Shimei Pan, James R. Foulds
Main category: cs.LG
TL;DR: A human-centered fairness framework that unifies eight fairness metrics, allowing stakeholders to weight different fairness objectives based on their values and context, with applications to real-world datasets.
Details
Motivation: AI deployment in critical domains raises fairness concerns, but existing approaches struggle with trade-offs between competing fairness notions and accuracy, creating barriers to practical implementation.Method: A unifying framework covering eight fairness metrics combining individual/group fairness, infra-marginal/intersectional assumptions, and outcome-based/EOO perspectives, with consistent formulations and weight assignment across objectives.
Result: Applied to four real-world datasets (UCI Adult, COMPAS, German Credit, MEPS), showing nuanced trade-offs between fairness metrics through weight adjustments, with case studies in judicial decision-making and healthcare.
Conclusion: The framework enables practical, value-sensitive deployment of fair AI systems by allowing stakeholders to align fairness interventions with their values and contextual considerations.
Abstract: The increasing use of Artificial Intelligence (AI) in critical societal domains has amplified concerns about fairness, particularly regarding unequal treatment across sensitive attributes such as race, gender, and socioeconomic status. While there has been substantial work on ensuring AI fairness, navigating trade-offs between competing notions of fairness as well as predictive accuracy remains challenging, creating barriers to the practical deployment of fair AI systems. To address this, we introduce a unifying human-centered fairness framework that systematically covers eight distinct fairness metrics, formed by combining individual and group fairness, infra-marginal and intersectional assumptions, and outcome-based and equality-of-opportunity (EOO) perspectives. This structure allows stakeholders to align fairness interventions with their values and contextual considerations. The framework uses a consistent and easy-to-understand formulation for all metrics to reduce the learning curve for non-experts. Rather than privileging a single fairness notion, the framework enables stakeholders to assign weights across multiple fairness objectives, reflecting their priorities and facilitating multi-stakeholder compromises. We apply this approach to four real-world datasets: the UCI Adult census dataset for income prediction, the COMPAS dataset for criminal recidivism, the German Credit dataset for credit risk assessment, and the MEPS dataset for healthcare utilization. We show that adjusting weights reveals nuanced trade-offs between different fairness metrics. Finally, through case studies in judicial decision-making and healthcare, we demonstrate how the framework can inform practical and value-sensitive deployment of fair AI systems.
[651] Recover-to-Forget: Gradient Reconstruction from LoRA for Efficient LLM Unlearning
Yezi Liu, Hanning Chen, Wenjun Huang, Yang Ni, Mohsen Imani
Main category: cs.LG
TL;DR: R2F enables efficient unlearning in LLMs by reconstructing full-model gradients from LoRA adapter updates using a gradient decoder, avoiding full retraining.
Details
Motivation: Current unlearning methods require full-model fine-tuning or access to original training data, limiting scalability and practicality for large foundation models like LLMs.Method: Compute gradients with respect to LoRA parameters using paraphrased prompts, train a gradient decoder to approximate full-model gradients, and transfer decoder from proxy model to target models.
Result: R2F achieves effective unlearning while preserving general model performance, offering a scalable and lightweight alternative without full retraining or access to internal parameters.
Conclusion: The method provides a practical solution for unlearning in pretrained LLMs, enabling dynamic knowledge updates, data deletion rights, and behavior correction with theoretical analysis of cross-model generalization.
Abstract: Unlearning in large foundation models (e.g., LLMs) is essential for enabling dynamic knowledge updates, enforcing data deletion rights, and correcting model behavior. However, existing unlearning methods often require full-model fine-tuning or access to the original training data, which limits their scalability and practicality. In this work, we introduce Recover-to-Forget (R2F), a novel framework for efficient unlearning in LLMs based on reconstructing full-model gradient directions from low-rank LoRA adapter updates. Rather than performing backpropagation through the full model, we compute gradients with respect to LoRA parameters using multiple paraphrased prompts and train a gradient decoder to approximate the corresponding full-model gradients. To ensure applicability to larger or black-box models, the decoder is trained on a proxy model and transferred to target models. We provide a theoretical analysis of cross-model generalization and demonstrate that our method achieves effective unlearning while preserving general model performance. Experimental results demonstrate that R2F offers a scalable and lightweight alternative for unlearning in pretrained LLMs without requiring full retraining or access to internal parameters.
[652] Comparing BFGS and OGR for Second-Order Optimization
Adrian Przybysz, Mikołaj Kołek, Franciszek Sobota, Jarek Duda
Main category: cs.LG
TL;DR: OGR outperforms BFGS for Hessian estimation in non-convex neural network optimization by using gradient regression without Hessian inversion.
Details
Motivation: Hessian estimation for neural networks is challenging due to high dimensionality and computational cost. BFGS has limitations requiring positive definite Hessian approximations and convexity assumptions.Method: Proposes Online Gradient Regression (OGR) that performs regression of gradients against positions using exponential moving average to estimate second derivatives online without Hessian inversion, allowing general (not necessarily positive definite) Hessian estimation.
Result: OGR achieves faster convergence and improved loss compared to BFGS, particularly in non-convex settings, as demonstrated across standard test functions.
Conclusion: OGR provides a superior alternative to BFGS for Hessian estimation in neural network optimization, especially for non-convex problems where traditional positive definite approximations fail.
Abstract: Estimating the Hessian matrix, especially for neural network training, is a challenging problem due to high dimensionality and cost. In this work, we compare the classical Sherman-Morrison update used in the popular BFGS method (Broy-den-Fletcher-Goldfarb-Shanno), which maintains a positive definite Hessian approximation under a convexity assumption, with a novel approach called Online Gradient Regression (OGR). OGR performs regression of gradients against positions using an exponential moving average to estimate second derivatives online, without requiring Hessian inversion. Unlike BFGS, OGR allows estimation of a general (not necessarily positive definite) Hessian and can thus handle non-convex structures. We evaluate both methods across standard test functions and demonstrate that OGR achieves faster convergence and improved loss, particularly in non-convex settings.
[653] LUNE: Efficient LLM Unlearning via LoRA Fine-Tuning with Negative Examples
Yezi Liu, Hanning Chen, Wenjun Huang, Yang Ni, Mohsen Imani
Main category: cs.LG
TL;DR: LUNE is a lightweight LLM unlearning framework using LoRA adapters for efficient knowledge removal without full model retraining.
Details
Motivation: LLMs cannot selectively remove specific information needed for privacy, bias mitigation, and knowledge correction, while traditional unlearning methods are computationally expensive and impractical for real deployment.Method: LoRA-based Unlearning with Negative Examples (LUNE) uses negative-only unlearning by updating only low-rank adapters while freezing the backbone model, targeting intermediate representations to suppress or replace requested knowledge.
Result: LUNE achieves effectiveness comparable to full fine-tuning and memory-editing methods while reducing computational cost by about an order of magnitude across multiple factual unlearning tasks.
Conclusion: LUNE provides a practical, efficient solution for LLM unlearning that localizes edits and avoids disruptive global changes, making it suitable for real-world deployment.
Abstract: Large language models (LLMs) possess vast knowledge acquired from extensive training corpora, but they often cannot remove specific pieces of information when needed, which makes it hard to handle privacy, bias mitigation, and knowledge correction. Traditional model unlearning approaches require computationally expensive fine-tuning or direct weight editing, making them impractical for real-world deployment. In this work, we introduce LoRA-based Unlearning with Negative Examples (LUNE), a lightweight framework that performs negative-only unlearning by updating only low-rank adapters while freezing the backbone, thereby localizing edits and avoiding disruptive global changes. Leveraging Low-Rank Adaptation (LoRA), LUNE targets intermediate representations to suppress (or replace) requested knowledge with an order-of-magnitude lower compute and memory than full fine-tuning or direct weight editing. Extensive experiments on multiple factual unlearning tasks show that LUNE: (I) achieves effectiveness comparable to full fine-tuning and memory-editing methods, and (II) reduces computational cost by about an order of magnitude.
[654] Prediction with Expert Advice under Local Differential Privacy
Ben Jacobsen, Kassem Fawaz
Main category: cs.LG
TL;DR: The paper presents two new LDP algorithms (RW-AdaBatch and RW-Meta) for prediction with expert advice that improve upon classical approaches while maintaining privacy guarantees, with RW-Meta showing strong performance on real-world COVID-19 hospital data.
Details
Motivation: To address the challenge of prediction with expert advice under local differential privacy constraints, where existing approaches have limitations in privacy amplification and expert selection capabilities.Method: 1) Showed a classical algorithm naturally satisfies LDP; 2) Developed RW-AdaBatch with privacy amplification that strengthens on easier data; 3) Created RW-Meta for privately selecting between non-trivial learning algorithm experts; 4) Used random walk theory for analysis.
Result: RW-AdaBatch provides privacy amplification with minimal utility cost. RW-Meta outperforms classical baseline and state-of-the-art central DP algorithm by 1.5-3× on real-world COVID-19 hospital prediction data, with regret bounds scaling inversely with expert independence.
Conclusion: The proposed LDP algorithms effectively balance privacy and utility, with RW-Meta demonstrating superior performance on real-world tasks while enabling private selection between complex learning algorithms as experts.
Abstract: We study the classic problem of prediction with expert advice under the constraint of local differential privacy (LDP). In this context, we first show that a classical algorithm naturally satisfies LDP and then design two new algorithms that improve it: RW-AdaBatch and RW-Meta. For RW-AdaBatch, we exploit the limited-switching behavior induced by LDP to provide a novel form of privacy amplification that grows stronger on easier data, analogous to the shuffle model in offline learning. Drawing on the theory of random walks, we prove that this improvement carries essentially no utility cost. For RW-Meta, we develop a general method for privately selecting between experts that are themselves non-trivial learning algorithms, and we show that in the context of LDP this carries no extra privacy cost. In contrast, prior work has only considered data-independent experts. We also derive formal regret bounds that scale inversely with the degree of independence between experts. Our analysis is supplemented by evaluation on real-world data reported by hospitals during the COVID-19 pandemic; RW-Meta outperforms both the classical baseline and a state-of-the-art \textit{central} DP algorithm by 1.5-3$\times$ on the task of predicting which hospital will report the highest density of COVID patients each week.
[655] LLM-Driven Composite Neural Architecture Search for Multi-Source RL State Encoding
Yu Yu, Qian Xie, Nairen Cao, Li Jin
Main category: cs.LG
TL;DR: LLM-driven neural architecture search for multi-source RL state encoders outperforms traditional NAS and GENIUS with fewer evaluations.
Details
Motivation: Designing state encoders for RL with multiple information sources (sensor data, images, text, etc.) is challenging and requires manual design. Existing NAS methods don't effectively use intermediate module outputs, limiting sample efficiency in multi-source RL.Method: Proposes an LLM-driven NAS pipeline that leverages language-model priors and intermediate-output signals to guide sample-efficient search for composite state encoders. Treats the problem as composite NAS where source-specific modules and fusion modules are jointly optimized.
Result: On mixed-autonomy traffic control task, the approach discovers higher-performing architectures with fewer candidate evaluations than traditional NAS baselines and the LLM-based GENIUS framework.
Conclusion: LLM-driven NAS with intermediate-output guidance enables more sample-efficient discovery of effective multi-source state encoders for RL, addressing limitations of existing methods.
Abstract: Designing state encoders for reinforcement learning (RL) with multiple information sources – such as sensor measurements, time-series signals, image observations, and textual instructions – remains underexplored and often requires manual design. We formalize this challenge as a problem of composite neural architecture search (NAS), where multiple source-specific modules and a fusion module are jointly optimized. Existing NAS methods overlook useful side information from the intermediate outputs of these modules – such as their representation quality – limiting sample efficiency in multi-source RL settings. To address this, we propose an LLM-driven NAS pipeline that leverages language-model priors and intermediate-output signals to guide sample-efficient search for high-performing composite state encoders. On a mixed-autonomy traffic control task, our approach discovers higher-performing architectures with fewer candidate evaluations than traditional NAS baselines and the LLM-based GENIUS framework.
[656] OXtal: An All-Atom Diffusion Model for Organic Crystal Structure Prediction
Emily Jin, Andrei Cristian Nica, Mikhail Galkin, Jarrid Rector-Brooks, Kin Long Kelvin Lee, Santiago Miret, Frances H. Arnold, Michael Bronstein, Avishek Joey Bose, Alexander Tong, Cheng-Hao Liu
Main category: cs.LG
TL;DR: OXtal is a 100M parameter diffusion model that predicts 3D molecular crystal structures from 2D chemical graphs, achieving orders-of-magnitude improvements over prior ML methods with 80% packing similarity rate.
Details
Motivation: Crystal structure prediction (CSP) is crucial for pharmaceuticals and organic semiconductors since crystal packing determines material properties, but accurately predicting experimentally-realizable 3D structures from 2D graphs remains an open challenge.Method: OXtal uses an all-atom diffusion model that learns joint distributions over intramolecular conformations and periodic packing. It abandons explicit equivariant architectures for data augmentation, and introduces Stoichiometric Stochastic Shell Sampling (S⁴) - a lattice-free training scheme that captures long-range interactions without explicit lattice parametrization.
Result: OXtal achieves orders-of-magnitude improvements over prior ML methods, recovering experimental structures with conformer RMSD₁ < 0.5 Å and attaining over 80% packing similarity rate. It remains orders of magnitude cheaper than traditional quantum-chemical approaches.
Conclusion: OXtal demonstrates the ability to model both thermodynamic and kinetic regularities of molecular crystallization, representing a significant advancement in crystal structure prediction through scalable diffusion modeling and novel training strategies.
Abstract: Accurately predicting experimentally-realizable 3D molecular crystal structures from their 2D chemical graphs is a long-standing open challenge in computational chemistry called crystal structure prediction (CSP). Efficiently solving this problem has implications ranging from pharmaceuticals to organic semiconductors, as crystal packing directly governs the physical and chemical properties of organic solids. In this paper, we introduce OXtal, a large-scale 100M parameter all-atom diffusion model that directly learns the conditional joint distribution over intramolecular conformations and periodic packing. To efficiently scale OXtal, we abandon explicit equivariant architectures imposing inductive bias arising from crystal symmetries in favor of data augmentation strategies. We further propose a novel crystallization-inspired lattice-free training scheme, Stoichiometric Stochastic Shell Sampling ($S^4$), that efficiently captures long-range interactions while sidestepping explicit lattice parametrization – thus enabling more scalable architectural choices at all-atom resolution. By leveraging a large dataset of 600K experimentally validated crystal structures (including rigid and flexible molecules, co-crystals, and solvates), OXtal achieves orders-of-magnitude improvements over prior ab initio machine learning CSP methods, while remaining orders of magnitude cheaper than traditional quantum-chemical approaches. Specifically, OXtal recovers experimental structures with conformer $\text{RMSD}_1<0.5$ Å and attains over 80% packing similarity rate, demonstrating its ability to model both thermodynamic and kinetic regularities of molecular crystallization.
[657] Group Representational Position Encoding
Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan, Kangping Xu, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao
Main category: cs.LG
TL;DR: GRAPE is a unified framework for positional encoding using group actions, with two main variants: Multiplicative GRAPE (rotations in SO(d)) and Additive GRAPE (unipotent actions in GL). It subsumes RoPE and ALiBi as special cases.
Details
Motivation: To create a principled, unified framework for positional encoding that brings together different families of mechanisms (multiplicative rotations and additive logit biases) under a single group-theoretic foundation, providing a design space for long-context models.Method: GRAPE uses group actions: (1) Multiplicative GRAPE with rotations in SO(d) using matrix exponentials of skew generators, (2) Additive GRAPE with unipotent actions in GL producing additive logits. The framework allows for learned commuting subspaces and non-commuting mixtures to capture cross-subspace feature coupling.
Result: GRAPE recovers RoPE exactly when using canonical coordinate pairs with log-uniform spectrum, and recovers ALiBi and Forgetting Transformer as exact special cases. It provides a principled design space that extends existing positional encoding methods while maintaining relative positional laws and streaming cacheability.
Conclusion: GRAPE offers a unified, group-theoretic framework for positional encoding that subsumes existing methods like RoPE and ALiBi, providing a principled design space for positional geometry in long-context models with efficient computational properties.
Abstract: We present GRAPE (Group RepresentAtional Position Encoding), a unified framework for positional encoding based on group actions. GRAPE brings together two families of mechanisms: (i) multiplicative rotations (Multiplicative GRAPE) in $\mathrm{SO}(d)$ and (ii) additive logit biases (Additive GRAPE) arising from unipotent actions in the general linear group $\mathrm{GL}$. In Multiplicative GRAPE, a position $n \in \mathbb{Z}$ (or $t \in \mathbb{R}$) acts as $\mathbf{G}(n)=\exp(n,ω,\mathbf{L})$ with a rank-2 skew generator $\mathbf{L} \in \mathbb{R}^{d \times d}$, yielding a relative, compositional, norm-preserving map with a closed-form matrix exponential. RoPE is recovered exactly when the $d/2$ planes are the canonical coordinate pairs with log-uniform spectrum. Learned commuting subspaces and compact non-commuting mixtures strictly extend this geometry to capture cross-subspace feature coupling at $O(d)$ and $O(r d)$ cost per head, respectively. In Additive GRAPE, additive logits arise as rank-1 (or low-rank) unipotent actions, recovering ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law and streaming cacheability. Altogether, GRAPE supplies a principled design space for positional geometry in long-context models, subsuming RoPE and ALiBi as special cases. Project Page: https://github.com/model-architectures/GRAPE.
[658] Toward Reliable Machine Unlearning: Theory, Algorithms, and Evaluation
Ali Ebrahimpour-Boroojeny
Main category: cs.LG
TL;DR: The paper proposes AMUN for sample unlearning and TRW for class unlearning, both outperforming existing methods by aligning predictions with retrained models and addressing security vulnerabilities.
Details
Motivation: Existing machine unlearning methods fail to properly replicate retrained model behavior and are vulnerable to membership inference attacks, creating security risks when removing sensitive data from trained models.Method: For sample unlearning: AMUN lowers model confidence on forget samples via adversarial example fine-tuning, supported by FastClip for smoothness control. For class unlearning: TRW approximates retrained model distributions using inter-class similarity estimation and tilted reweighting.
Result: AMUN surpasses prior SOTA methods on image classification based on MIA scores. TRW matches or surpasses existing unlearning methods across multiple benchmarks while mitigating vulnerability to nearest-neighbor MIA attacks.
Conclusion: The proposed methods effectively address machine unlearning challenges by aligning with retrained model behavior, improving security against inference attacks, and providing theoretical foundations for performance factors like model smoothness.
Abstract: We propose new methodologies for both unlearning random set of samples and class unlearning and show that they outperform existing methods. The main driver of our unlearning methods is the similarity of predictions to a retrained model on both the forget and remain samples. We introduce Adversarial Machine UNlearning (AMUN), which surpasses prior state-of-the-art methods for image classification based on SOTA MIA scores. AMUN lowers the model’s confidence on forget samples by fine-tuning on their corresponding adversarial examples. Through theoretical analysis, we identify factors governing AMUN’s performance, including smoothness. To facilitate training of smooth models with a controlled Lipschitz constant, we propose FastClip, a scalable method that performs layer-wise spectral-norm clipping of affine layers. In a separate study, we show that increased smoothness naturally improves adversarial example transfer, thereby supporting the second factor above. Following the same principles for class unlearning, we show that existing methods fail in replicating a retrained model’s behavior by introducing a nearest-neighbor membership inference attack (MIA-NN) that uses the probabilities assigned to neighboring classes to detect unlearned samples and demonstrate the vulnerability of such methods. We then propose a fine-tuning objective that mitigates this leakage by approximating, for forget-class inputs, the distribution over remaining classes that a model retrained from scratch would produce. To construct this approximation, we estimate inter-class similarity and tilt the target model’s distribution accordingly. The resulting Tilted ReWeighting(TRW) distribution serves as the desired target during fine-tuning. Across multiple benchmarks, TRW matches or surpasses existing unlearning methods on prior metrics.
[659] Always Keep Your Promises: DynamicLRP, A Model-Agnostic Solution To Layer-Wise Relevance Propagation
Kevin Lee, Pablo Millan Arias
Main category: cs.LG
TL;DR: DynamicLRP is a model-agnostic Layer-wise Relevance Propagation framework that works at tensor operation level using a Promise System for deferred activation resolution, enabling attribution across diverse architectures without model-specific code.
Details
Motivation: Existing LRP implementations operate at module level, requiring architecture-specific propagation rules and modifications, which limits generality and sustainability as architectures evolve. There's a need for a more flexible, model-agnostic approach.Method: DynamicLRP decomposes attribution to individual tensor operations within computation graphs and introduces a Promise System for deferred activation resolution. It operates independently of backpropagation machinery and works on arbitrary computation graphs without model modification.
Result: Achieved 99.92% node coverage across 31,465 computation graph nodes from 15 diverse architectures including Mamba, Whisper, and DePlot. Faithfulness matches or exceeds specialized implementations (1.77 vs 1.69 ABPC on VGG, equivalent on ViT, 93.70-95.06% accuracy on NLP tasks).
Conclusion: DynamicLRP establishes a sustainable, extensible foundation for LRP across evolving architectures through operation-level decomposition and the Promise System, enabling true architecture agnosticity while maintaining theoretical guarantees.
Abstract: Layer-wise Relevance Propagation (LRP) provides principled attribution for neural networks through conservation properties and foundations in Deep Taylor Decomposition. However, existing implementations operate at the module level, requiring architecture-specific propagation rules and modifications. These limit the generality of target model and sustainability of implementations as architectures evolve. We introduce DynamicLRP, a model-agnostic LRP framework operating at the tensor operation level. By decomposing attribution to individual operations within computation graphs and introducing a novel mechanism for deferred activation resolution, named the Promise System, our approach achieves true architecture agnosticity while maintaining LRP’s theoretical guarantees. This design operates independently of backpropagation machinery, enabling operation on arbitrary computation graphs without model modification and side-by-side execution with gradient backpropagation. Being based on computation graphs, this method is theoretically extensible to other deep learning libraries that support auto-differentiation. We demonstrate faithfulness matching or exceeding specialized implementations (1.77 vs 1.69 ABPC on VGG, equivalent performance on ViT, 93.70% and 95.06% top-1 attribution accuracy for explaining RoBERTa-large and Flan-T5-large answers on SQuADv2, respectively) while maintaining practical efficiency on models with hundreds of millions of parameters. We achieved 99.92% node coverage across 31,465 computation graph nodes from 15 diverse architectures, including state-space models (Mamba), audio transformers (Whisper), and multimodal systems (DePlot) without any model-specific code with rules for 47 fundamental operations implemented. Our operation-level decomposition and Promise System establish a sustainable, extensible foundation for LRP across evolving architectures.
[660] Transferring Clinical Knowledge into ECGs Representation
Jose Geraldo Fernandes, Luiz Facury de Souza, Pedro Robles Dutenhefner, Gisele L. Pappa, Wagner Meira
Main category: cs.LG
TL;DR: A three-stage training paradigm that transfers knowledge from multimodal clinical data into a unimodal ECG encoder, improving accuracy and interpretability while only requiring ECG signals at inference.
Details
Motivation: Deep learning models for ECG classification lack trust and clinical adoption due to their black-box nature and limited interpretability, despite high accuracy.Method: Three-stage training: 1) Self-supervised joint-embedding pre-training using multimodal clinical data (labs, vitals, biometrics) to enrich ECG representations, 2) Training to predict associated laboratory abnormalities from ECG embeddings as indirect explanations, 3) Only ECG signals required at inference time.
Result: Outperforms standard signal-only baseline in multi-label diagnosis classification on MIMIC-IV-ECG dataset, bridges substantial performance gap to fully multimodal models while maintaining unimodal inference.
Conclusion: The approach creates more accurate and trustworthy ECG classification models by converting abstract predictions into physiologically grounded explanations, offering a promising path for safer AI integration into clinical workflows.
Abstract: Deep learning models have shown high accuracy in classifying electrocardiograms (ECGs), but their black box nature hinders clinical adoption due to a lack of trust and interpretability. To address this, we propose a novel three-stage training paradigm that transfers knowledge from multimodal clinical data (laboratory exams, vitals, biometrics) into a powerful, yet unimodal, ECG encoder. We employ a self-supervised, joint-embedding pre-training stage to create an ECG representation that is enriched with contextual clinical information, while only requiring the ECG signal at inference time. Furthermore, as an indirect way to explain the model’s output we train it to also predict associated laboratory abnormalities directly from the ECG embedding. Evaluated on the MIMIC-IV-ECG dataset, our model outperforms a standard signal-only baseline in multi-label diagnosis classification and successfully bridges a substantial portion of the performance gap to a fully multimodal model that requires all data at inference. Our work demonstrates a practical and effective method for creating more accurate and trustworthy ECG classification models. By converting abstract predictions into physiologically grounded \emph{explanations}, our approach offers a promising path toward the safer integration of AI into clinical workflows.
[661] Transformation of Biological Networks into Images via Semantic Cartography for Visual Interpretation and Scalable Deep Analysis
Sakib Mostafa, Lei Xing, Md. Tauhidul Islam
Main category: cs.LG
TL;DR: Graph2Image transforms large biological networks into 2D images for CNN analysis, improving scalability, accuracy, and interpretability over traditional methods.
Details
Motivation: Biological networks are crucial for biomedical research but are too large and complex for current computational methods, which suffer from scalability issues, poor long-range dependency capture, multimodal integration difficulties, and limited interpretability.Method: Transform large biological networks into sets of 2D images by spatially arranging representative network nodes on a grid, enabling use of CNNs with global receptive fields and multi-scale pyramids.
Result: Improved classification accuracy by up to 67.2% over existing methods, enabled analysis of networks with >1 billion nodes on personal computers, and provided interpretable visualizations revealing biologically coherent patterns.
Conclusion: Graph2Image offers a scalable, interpretable, multimodal-ready approach for biological network analysis, opening new opportunities for disease diagnosis and complex biological system studies.
Abstract: Complex biological networks are fundamental to biomedical science, capturing interactions among molecules, cells, genes, and tissues. Deciphering these networks is critical for understanding health and disease, yet their scale and complexity represent a daunting challenge for current computational methods. Traditional biological network analysis methods, including deep learning approaches, while powerful, face inherent challenges such as limited scalability, oversmoothing long-range dependencies, difficulty in multimodal integration, expressivity bounds, and poor interpretability. We present Graph2Image, a framework that transforms large biological networks into sets of two-dimensional images by spatially arranging representative network nodes on a 2D grid. This transformation decouples the nodes as images, enabling the use of convolutional neural networks (CNNs) with global receptive fields and multi-scale pyramids, thus overcoming limitations of existing biological network analysis methods in scalability, memory efficiency, and long-range context capture. Graph2Image also facilitates seamless integration with other imaging and omics modalities and enhances interpretability through direct visualization of node-associated images. When applied to several large-scale biological network datasets, Graph2Image improved classification accuracy by up to 67.2% over existing methods and provided interpretable visualizations that revealed biologically coherent patterns. It also allows analysis of very large biological networks (nodes > 1 billion) on a personal computer. Graph2Image thus provides a scalable, interpretable, and multimodal-ready approach for biological network analysis, offering new opportunities for disease diagnosis and the study of complex biological systems.
[662] Self-Supervised Learning on Molecular Graphs: A Systematic Investigation of Masking Design
Jiannan Yang, Veronika Thost, Tengfei Ma
Main category: cs.LG
TL;DR: SSL for molecular graphs: Sophisticated masking strategies don’t beat uniform sampling; prediction target and encoder architecture matter more.
Details
Motivation: Many recent innovations in masking-based pretraining for molecular representation learning are introduced as heuristics without principled evaluation, making it unclear which design choices are genuinely effective.Method: Cast pretrain-finetune workflow into unified probabilistic framework for transparent comparison; conduct controlled study of three core design dimensions (masking distribution, prediction target, encoder architecture) under rigorous settings; use information-theoretic measures to assess pretraining signals.
Result: Sophisticated masking distributions offer no consistent benefit over uniform sampling for common node-level prediction tasks; choice of prediction target and its synergy with encoder architecture are far more critical; semantically richer targets yield substantial downstream improvements, especially when paired with expressive Graph Transformer encoders.
Conclusion: Provides practical guidance for developing more effective SSL methods for molecular graphs by shifting focus from complex masking strategies to semantically richer prediction targets and appropriate encoder architectures.
Abstract: Self-supervised learning (SSL) plays a central role in molecular representation learning. Yet, many recent innovations in masking-based pretraining are introduced as heuristics and lack principled evaluation, obscuring which design choices are genuinely effective. This work cast the entire pretrain-finetune workflow into a unified probabilistic framework, enabling a transparent comparison and deeper understanding of masking strategies. Building on this formalism, we conduct a controlled study of three core design dimensions: masking distribution, prediction target, and encoder architecture, under rigorously controlled settings. We further employ information-theoretic measures to assess the informativeness of pretraining signals and connect them to empirically benchmarked downstream performance. Our findings reveal a surprising insight: sophisticated masking distributions offer no consistent benefit over uniform sampling for common node-level prediction tasks. Instead, the choice of prediction target and its synergy with the encoder architecture are far more critical. Specifically, shifting to semantically richer targets yields substantial downstream improvements, particularly when paired with expressive Graph Transformer encoders. These insights offer practical guidance for developing more effective SSL methods for molecular graphs.
[663] Procrustean Bed for AI-Driven Retrosynthesis: A Unified Framework for Reproducible Evaluation
Anton Morgunov, Victor S. Batista
Main category: cs.LG
TL;DR: RetroCast is a unified evaluation suite for computer-aided synthesis planning that standardizes model outputs, provides rigorous benchmarking with statistical analysis, and reveals critical gaps between solvability metrics and actual route quality.
Details
Motivation: Current CASP evaluation lacks standardization and relies on metrics that prioritize topological completion over chemical validity, obscuring true progress in the field.Method: Introduces RetroCast framework with standardized schema for heterogeneous model outputs, reproducible benchmarking pipeline with stratified sampling and bootstrapped confidence intervals, and SynthArena interactive platform for qualitative route inspection.
Result: Analysis reveals divergence between solvability (stock-termination rate) and route quality, showing high solvability scores often mask chemical invalidity and fail to correlate with experimental ground truth reproduction. Also identifies a “complexity cliff” where search-based methods decay in reconstructing long-range synthetic plans compared to sequence-based approaches.
Conclusion: The framework enables transparent, reproducible CASP development by providing standardized evaluation infrastructure that reveals critical limitations in current metrics and methods, with full framework, benchmarks, and model predictions released to the community.
Abstract: Progress in computer-aided synthesis planning (CASP) is obscured by the lack of standardized evaluation infrastructure and the reliance on metrics that prioritize topological completion over chemical validity. We introduce RetroCast, a unified evaluation suite that standardizes heterogeneous model outputs into a common schema to enable statistically rigorous, apples-to-apples comparison. The framework includes a reproducible benchmarking pipeline with stratified sampling and bootstrapped confidence intervals, accompanied by SynthArena, an interactive platform for qualitative route inspection. We utilize this infrastructure to evaluate leading search-based and sequence-based algorithms on a new suite of standardized benchmarks. Our analysis reveals a divergence between “solvability” (stock-termination rate) and route quality; high solvability scores often mask chemical invalidity or fail to correlate with the reproduction of experimental ground truths. Furthermore, we identify a “complexity cliff” in which search-based methods, despite high solvability rates, exhibit a sharp performance decay in reconstructing long-range synthetic plans compared to sequence-based approaches. We release the full framework, benchmark definitions, and a standardized database of model predictions to support transparent and reproducible development in the field.
[664] TRACE: A Generalizable Drift Detector for Streaming Data-Driven Optimization
Yuan-Ting Zhong, Ting Huang, Xiaolin Xiao, Yue-Jiao Gong
Main category: cs.LG
TL;DR: TRACE is a transferable concept-drift estimator that detects distributional changes in streaming data with varying time scales, enabling adaptive optimization under unknown drifts.
Details
Motivation: Existing methods for Streaming Data-Driven Optimization (SDDO) have restrictive assumptions like fixed drift intervals and full environmental observability, limiting their adaptability to diverse dynamic environments with streaming data and unknown concept drifts.Method: TRACE uses a principled tokenization strategy to extract statistical features from data streams and models drift patterns using attention-based sequence learning, enabling transferable concept-drift detection across different datasets.
Result: Comprehensive experiments show TRACE achieves superior generalization, robustness, and effectiveness in SDDO scenarios, demonstrating accurate drift detection on unseen datasets and highlighting transferability of learned drift patterns.
Conclusion: TRACE provides a plug-and-play solution for adaptive optimization under unknown concept drifts, overcoming limitations of existing methods and enabling more flexible streaming data-driven optimization in diverse dynamic environments.
Abstract: Many optimization tasks involve streaming data with unknown concept drifts, posing a significant challenge as Streaming Data-Driven Optimization (SDDO). Existing methods, while leveraging surrogate model approximation and historical knowledge transfer, are often under restrictive assumptions such as fixed drift intervals and fully environmental observability, limiting their adaptability to diverse dynamic environments. We propose TRACE, a TRAnsferable C}oncept-drift Estimator that effectively detects distributional changes in streaming data with varying time scales. TRACE leverages a principled tokenization strategy to extract statistical features from data streams and models drift patterns using attention-based sequence learning, enabling accurate detection on unseen datasets and highlighting the transferability of learned drift patterns. Further, we showcase TRACE’s plug-and-play nature by integrating it into a streaming optimizer, facilitating adaptive optimization under unknown drifts. Comprehensive experimental results on diverse benchmarks demonstrate the superior generalization, robustness, and effectiveness of our approach in SDDO scenarios.
[665] The Geometry of Persona: Disentangling Personality from Reasoning in Large Language Models
Zhixiang Wang
Main category: cs.LG
TL;DR: The Soul Engine framework enables precise AI personality personalization without fine-tuning by extracting orthogonal personality vectors from frozen LLMs, achieving high-precision profiling while maintaining original reasoning capabilities.
Details
Motivation: Current LLM personalization methods like SFT suffer from the stability-plasticity dilemma and "alignment tax" that degrades general reasoning capabilities. There's a need for personalization approaches that don't compromise core model intelligence.Method: Based on Linear Representation Hypothesis, uses SoulBench dataset via dynamic contextual sampling. Employs dual-head architecture on frozen Qwen-2.5 base to extract disentangled personality vectors without modifying backbone weights.
Result: Three breakthroughs: 1) High-precision profiling (MSE 0.011 against psychological ground truth), 2) Geometric orthogonality enabling zero-shot personality injection, 3) Deterministic steering via vector arithmetic with robust behavior control validated by ablation studies.
Conclusion: Challenges necessity of fine-tuning for personalization. Transitions from probabilistic prompting to deterministic latent intervention, providing mathematically rigorous foundation for safe, controllable AI personalization without compromising original intelligence.
Abstract: Background: The deployment of personalized Large Language Models (LLMs) is currently constrained by the stability-plasticity dilemma. Prevailing alignment methods, such as Supervised Fine-Tuning (SFT), rely on stochastic weight updates that often incur an “alignment tax” – degrading general reasoning capabilities. Methods: We propose the Soul Engine, a framework based on the Linear Representation Hypothesis, which posits that personality traits exist as orthogonal linear subspaces. We introduce SoulBench, a dataset constructed via dynamic contextual sampling. Using a dual-head architecture on a frozen Qwen-2.5 base, we extract disentangled personality vectors without modifying the backbone weights. Results: Our experiments demonstrate three breakthroughs. First, High-Precision Profiling: The model achieves a Mean Squared Error (MSE) of 0.011 against psychological ground truth. Second, Geometric Orthogonality: T-SNE visualization confirms that personality manifolds are distinct and continuous, allowing for “Zero-Shot Personality Injection” that maintains original model intelligence. Third, Deterministic Steering: We achieve robust control over behavior via vector arithmetic, validated through extensive ablation studies. Conclusion: This work challenges the necessity of fine-tuning for personalization. By transitioning from probabilistic prompting to deterministic latent intervention, we provide a mathematically rigorous foundation for safe, controllable AI personalization.
[666] Dual Refinement Cycle Learning: Unsupervised Text Classification of Mamba and Community Detection on Text Attributed Graph
Hong Wang, Yinglong Zhang, Hanhan Guo, Xuewen Xia, Xing Xu
Main category: cs.LG
TL;DR: DRCL is an unsupervised framework that integrates structural and semantic information for community detection in text-attributed networks without requiring labels or category definitions.
Details
Motivation: Pretrained language models need labeled data for deployment, while traditional community detection methods ignore textual semantics, limiting their usefulness in real-world applications like content organization, recommendation, and risk monitoring.Method: Dual Refinement Cycle Learning (DRCL) uses warm-start initialization and bidirectional refinement between a GCN-based Community Detection Module and a Text Semantic Modeling Module, where modules iteratively exchange pseudo-labels to enhance structural clustering and guide text representation learning.
Result: DRCL consistently improves structural and semantic quality of discovered communities across text-attributed graph datasets, and a Mamba-based classifier trained from DRCL’s community signals achieves accuracy comparable to supervised models.
Conclusion: DRCL demonstrates potential for deployment in large-scale systems where labeled data are scarce or costly, offering a practical unsupervised solution for text-attributed network analysis.
Abstract: Pretrained language models offer strong text understanding capabilities but remain difficult to deploy in real-world text-attributed networks due to their heavy dependence on labeled data. Meanwhile, community detection methods typically ignore textual semantics, limiting their usefulness in downstream applications such as content organization, recommendation, and risk monitoring. To overcome these limitations, we present Dual Refinement Cycle Learning (DRCL), a fully unsupervised framework designed for practical scenarios where no labels or category definitions are available. DRCL integrates structural and semantic information through a warm-start initialization and a bidirectional refinement cycle between a GCN-based Community Detection Module (GCN-CDM) and a Text Semantic Modeling Module (TSMM). The two modules iteratively exchange pseudo-labels, allowing semantic cues to enhance structural clustering and structural patterns to guide text representation learning without manual supervision. Across several text-attributed graph datasets, DRCL consistently improves the structural and semantic quality of discovered communities. Moreover, a Mamba-based classifier trained solely from DRCL’s community signals achieves accuracy comparable to supervised models, demonstrating its potential for deployment in large-scale systems where labeled data are scarce or costly.
[667] FOAM: Blocked State Folding for Memory-Efficient LLM Training
Ziqing Wen, Jiahuan Wang, Ping Luo, Dongsheng Li, Tao Sun
Main category: cs.LG
TL;DR: FOAM is a memory-efficient optimizer that compresses Adam’s optimizer states by computing block-wise gradient means with residual correction, reducing training memory by ~50% while maintaining convergence rates equivalent to vanilla Adam.
Details
Motivation: LLMs face significant memory bottlenecks during training due to their scale and memory-intensive optimizers like Adam. Existing memory-efficient approaches often introduce computational overhead, require additional memory, or degrade model performance.Method: FOAM compresses optimizer states by computing block-wise gradient means and incorporates a residual correction mechanism to recover lost information, maintaining theoretical convergence properties while reducing memory usage.
Result: FOAM reduces total training memory by approximately 50%, eliminates up to 90% of optimizer state memory overhead, accelerates convergence, and is compatible with other memory-efficient optimizers while matching or surpassing baseline performance.
Conclusion: FOAM provides an effective solution to memory bottlenecks in LLM training by compressing optimizer states without sacrificing convergence rates or model performance, offering practical benefits for large-scale model training.
Abstract: Large language models (LLMs) have demonstrated remarkable performance due to their large parameter counts and extensive training data. However, their scale leads to significant memory bottlenecks during training, especially when using memory-intensive optimizers like Adam. Existing memory-efficient approaches often rely on techniques such as singular value decomposition (SVD), projections, or weight freezing, which can introduce substantial computational overhead, require additional memory for projections, or degrade model performance. In this paper, we propose Folded Optimizer with Approximate Moment (FOAM), a method that compresses optimizer states by computing block-wise gradient means and incorporates a residual correction to recover lost information. Theoretically, FOAM achieves convergence rates equivalent to vanilla Adam under standard non-convex optimization settings. Empirically, FOAM reduces total training memory by approximately 50%, eliminates up to 90% of optimizer state memory overhead, and accelerates convergence. Furthermore, FOAM is compatible with other memory-efficient optimizers, delivering performance and throughput that match or surpass both full-rank and existing memory-efficient baselines.
[668] PlantBiMoE: A Bidirectional Foundation Model with SparseMoE for Plant Genomes
Kepeng Lin, Qizhe Zhang, Rui Wang, Xuehai Hu, Wei Xu
Main category: cs.LG
TL;DR: PlantBiMoE: A lightweight plant genome language model using bidirectional Mamba and SparseMoE that achieves state-of-the-art performance on 20/31 datasets in the MPGB benchmark.
Details
Motivation: Existing plant genome models like AgroNT and PDLLMs have limitations: excessive parameter sizes and inability to model bidirectional DNA strand dependencies. There's a need for more efficient and expressive models that can capture the structural dependencies across both forward and reverse DNA strands.Method: Proposes PlantBiMoE, which integrates bidirectional Mamba (to capture structural dependencies across both DNA strands) with a Sparse Mixture-of-Experts (SparseMoE) framework (to reduce active parameters while maintaining modeling capacity).
Result: Achieves best performance on 20 out of 31 datasets in the Modified Plants Genome Benchmark (MPGB), which includes 11 representative tasks with sequence lengths from 50 to 6,000 bp. Shows superior average performance compared to existing models.
Conclusion: PlantBiMoE effectively represents plant genomic sequences, serving as a robust computational tool for diverse genomic tasks, with applications in plant genomics, gene editing, and synthetic biology. The model is publicly available.
Abstract: Understanding the underlying linguistic rules of plant genomes remains a fundamental challenge in computational biology. Recent advances including AgroNT and PDLLMs have made notable progress although, they suffer from excessive parameter size and limited ability to model the bidirectional nature of DNA strands respectively. To address these limitations, we propose PlantBiMoE, a lightweight and expressive plant genome language model that integrates bidirectional Mamba and a Sparse Mixture-of-Experts (SparseMoE) framework. The bidirectional Mamba enables the model to effectively capture structural dependencies across both the forward and reverse DNA strands, while SparseMoE significantly reduces the number of active parameters, improving computational efficiency without sacrificing modeling capacity. We evaluated and tested our model on the Modified Plants Genome Benchmark (MPGB), an enhanced genomic benchmark, which consolidates 31 datasets across 11 representative tasks, with input sequence lengths ranging from 50 to 6,000 bp. Experimental results demonstrate that PlantBiMoE achieves the best performance on 20 out of 31 datasets and the average best when comparing with existing models. In summary, all above results demonstrate that our model can effectively represent plant genomic sequences, serving as a robust computational tool for diverse genomic tasks, while making substantive contributions to plant genomics, gene editing, and synthetic biology. The code is available at: https://github.com/HUST-Keep-Lin/PlantBiMoE
[669] Winning the Lottery by Preserving Network Training Dynamics with Concrete Ticket Search
Tanay Arora, Christof Teuscher
Main category: cs.LG
TL;DR: CTS is a new pruning-at-initialization method that uses combinatorial optimization and gradient balancing to find high-performing sparse subnetworks efficiently, outperforming both traditional saliency methods and computationally expensive lottery ticket rewinding.
Details
Motivation: Current pruning methods face a dilemma: Lottery Ticket Rewinding (LTR) is computationally expensive, while Pruning-at-Initialization (PaI) methods suffer from poor accuracy-sparsity trade-offs and fail sanity checks. The authors argue that PaI's reliance on first-order saliency metrics (ignoring inter-weight dependencies) causes this performance gap, especially in sparse regimes.Method: Concrete Ticket Search (CTS) frames subnetwork discovery as a holistic combinatorial optimization problem using Concrete relaxation of the discrete search space. It employs a novel gradient balancing scheme (GRADBALANCE) to control sparsity without sensitive hyperparameter tuning. The authors also propose a knowledge distillation-inspired pruning objective using reverse Kullback-Leibler divergence (CTS-KL) between sparse and dense network outputs.
Result: CTS produces subnetworks that robustly pass sanity checks and achieve accuracy comparable to or exceeding LTR with much less computation. On ResNet-20/CIFAR10, CTS reaches 99.3% sparsity with 74.0% accuracy in 7.9 minutes, while LTR achieves same sparsity with only 68.3% accuracy in 95.2 minutes. CTS outperforms saliency-based methods across all sparsities, with greatest advantage in highly sparse regimes.
Conclusion: CTS addresses fundamental limitations of existing pruning methods by treating subnetwork discovery as combinatorial optimization, enabling efficient identification of high-performing sparse subnetworks near initialization without extensive hyperparameter tuning.
Abstract: The Lottery Ticket Hypothesis asserts the existence of highly sparse, trainable subnetworks (‘winning tickets’) within dense, randomly initialized neural networks. However, state-of-the-art methods of drawing these tickets, like Lottery Ticket Rewinding (LTR), are computationally prohibitive, while more efficient saliency-based Pruning-at-Initialization (PaI) techniques suffer from a significant accuracy-sparsity trade-off and fail basic sanity checks. In this work, we argue that PaI’s reliance on first-order saliency metrics, which ignore inter-weight dependencies, contributes substantially to this performance gap, especially in the sparse regime. To address this, we introduce Concrete Ticket Search (CTS), an algorithm that frames subnetwork discovery as a holistic combinatorial optimization problem. By leveraging a Concrete relaxation of the discrete search space and a novel gradient balancing scheme (GRADBALANCE) to control sparsity, CTS efficiently identifies high-performing subnetworks near initialization without requiring sensitive hyperparameter tuning. Motivated by recent works on lottery ticket training dynamics, we further propose a knowledge distillation-inspired family of pruning objectives, finding that minimizing the reverse Kullback-Leibler divergence between sparse and dense network outputs (CTS-KL) is particularly effective. Experiments on varying image classification tasks show that CTS produces subnetworks that robustly pass sanity checks and achieve accuracy comparable to or exceeding LTR, while requiring only a small fraction of the computation. For example, on ResNet-20 on CIFAR10, it reaches 99.3% sparsity with 74.0% accuracy in 7.9 minutes, while LTR attains the same sparsity with 68.3% accuracy in 95.2 minutes. CTS’s subnetworks outperform saliency-based methods across all sparsities, but its advantage over LTR is most pronounced in the highly sparse regime.
[670] FlowLPS: Langevin-Proximal Sampling for Flow-based Inverse Problem Solvers
Jonghyun Park, Jong Chul Ye
Main category: cs.LG
TL;DR: FlowLPS: A training-free framework using Langevin Proximal Sampling to solve inverse problems with pretrained flow models, addressing convergence and manifold deviation issues.
Details
Motivation: Existing training-free methods for solving inverse problems with deep generative models (particularly latent flow models) often fail to converge to the posterior mode or suffer from manifold deviation within latent spaces, limiting their effectiveness.Method: FlowLPS integrates Langevin dynamics for manifold-consistent exploration with proximal optimization for precise mode seeking, creating a novel training-free framework that solves inverse problems using pretrained flow models.
Result: The method achieves superior balance between reconstruction fidelity and perceptual quality across multiple inverse tasks on FFHQ and DIV2K datasets, outperforming state-of-the-art inverse solvers.
Conclusion: FlowLPS provides an effective training-free solution for inverse problems with flow models by combining Langevin dynamics and proximal optimization to address convergence and manifold consistency issues.
Abstract: Deep generative models have become powerful priors for solving inverse problems, and various training-free methods have been developed. However, when applied to latent flow models, existing methods often fail to converge to the posterior mode or suffer from manifold deviation within latent spaces. To mitigate this, here we introduce a novel training-free framework, FlowLPS, that solves inverse problems with pretrained flow models via a Langevin Proximal Sampling (LPS) strategy. Our method integrates Langevin dynamics for manifold-consistent exploration with proximal optimization for precise mode seeking, achieving a superior balance between reconstruction fidelity and perceptual quality across multiple inverse tasks on FFHQ and DIV2K, outperforming state of the art inverse solvers.
[671] Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration
Jucheng Shen, Gaurav Sarkar, Yeonju Ro, Sharath Nittur Sridhar, Zhangyang Wang, Aditya Akella, Souvik Kundu
Main category: cs.LG
TL;DR: CadLLM is a training-free method to accelerate inference throughput of diffusion-based LLMs by dynamically adjusting generation parameters based on token unmasking confidence.
Details
Motivation: Diffusion-based LLMs (dLLMs) have inference inefficiencies due to fixed generation parameters across blocks and steps, and the computational overhead of full vocabulary softmax operations.Method: Lightweight adaptive approach that controls generation block size, step size, and threshold based on average confidence of unmasked tokens, plus dynamic vocabulary subset selection to reduce softmax overhead.
Result: Up to 2.28x throughput improvement over state-of-the-art baseline with competitive accuracy across four popular tasks.
Conclusion: CadLLM is an effective plug-and-play, model-agnostic acceleration method for KV-cache-based dLLMs that maintains accuracy while significantly improving inference throughput.
Abstract: We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate the dynamic nature of token unmasking confidence across blocks and steps. Based on this observation, we present a lightweight adaptive approach that controls the generation block size, step size, and threshold based on the average confidence of unmasked tokens. We further reduce softmax overhead by dynamically leveraging a subset of the vocabulary to regulate sampling breadth. CadLLM is a plug-and-play, model-agnostic method compatible with KV-cache-based dLLMs. Extensive experiments on four popular tasks demonstrate that CadLLM yields up to 2.28x throughput improvement over the state-of-the-art baseline with competitive accuracy.
[672] SPACE: Noise Contrastive Estimation Stabilizes Self-Play Fine-Tuning for Large Language Models
Yibo Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Lijun Zhang
Main category: cs.LG
TL;DR: SPACE is a novel self-play fine-tuning method that uses noise contrastive estimation to address instability in existing gap-based methods by treating synthetic samples as auxiliary components and optimizing absolute reward values independently.
Details
Motivation: Existing self-play fine-tuning methods focus on relative reward gaps between real and synthetic data, neglecting absolute values, which leads to unstable evolution and potentially degenerated objectives.Method: SPACE (Self-PlAy via Noise Contrastive Estimation) treats synthetic samples as auxiliary components and discriminates them from real ones in a binary classification manner, using noise contrastive estimation to capture the real-world data distribution.
Result: SPACE significantly improves LLM performance across various tasks, outperforms supervised fine-tuning with more real-world samples, and shows remarkable superiority and stable evolution compared to gap-based methods.
Conclusion: SPACE provides a theoretically grounded, stable self-play fine-tuning method that avoids instability issues by independently optimizing absolute reward values and guarantees provable convergence to the optimal real-world data distribution.
Abstract: Self-play fine-tuning has demonstrated promising abilities in adapting large language models (LLMs) to downstream tasks with limited real-world data. The basic principle is to iteratively refine the model with real samples and synthetic ones generated from itself. However, the existing methods primarily focus on the relative gaps between the rewards for two types of data, neglecting their absolute values. Through theoretical analysis, we identify that the gap-based methods suffer from unstable evolution, due to the potentially degenerated objectives. To address this limitation, we introduce a novel self-play fine-tuning method, namely Self-PlAy via Noise Contrastive Estimation (SPACE), which leverages noise contrastive estimation to capture the real-world data distribution. Specifically, SPACE treats synthetic samples as auxiliary components, and discriminates them from the real ones in a binary classification manner. As a result, SPACE independently optimizes the absolute reward values for each type of data, ensuring a consistently meaningful objective and thereby avoiding the instability issue. Theoretically, we show that the optimal solution of the objective in SPACE aligns with the underlying distribution of real-world data, and SPACE guarantees a provably stable convergence to the optimal distribution. Empirically, we show that SPACE significantly improves the performance of LLMs over various tasks, and outperforms supervised fine-tuning that employs much more real-world samples. Compared to gap-based self-play fine-tuning methods, SPACE exhibits remarkable superiority and stable evolution.
[673] UniDiff: A Unified Diffusion Framework for Multimodal Time Series Forecasting
Da Zhang, Bingyu Li, Zhuyuan Zhao, Junyu Gao, Feiping Nie, Xuelong Li
Main category: cs.LG
TL;DR: UniDiff: A unified diffusion framework for multimodal time series forecasting that integrates numerical sequences, timestamps, and textual information using cross-attention and novel classifier-free guidance.
Details
Motivation: Current diffusion models for time series forecasting are limited to single-modality numerical sequences, ignoring rich cross-modal signals from heterogeneous data like texts and timestamps that are abundant in real-world applications.Method: 1) Tokenizes time series into patches using lightweight MLP to preserve local temporal dynamics; 2) Unified parallel fusion module with single cross-attention mechanism to adaptively integrate timestamp structural information and textual semantic context; 3) Novel classifier-free guidance for multi-source conditioning with decoupled control over textual and temporal guidance strength.
Result: Extensive experiments on real-world benchmark datasets across eight domains demonstrate state-of-the-art performance.
Conclusion: UniDiff successfully addresses the multimodal time series forecasting challenge by effectively leveraging heterogeneous information through a unified diffusion framework with efficient cross-modal fusion and flexible guidance control.
Abstract: As multimodal data proliferates across diverse real-world applications, leveraging heterogeneous information such as texts and timestamps for accurate time series forecasting (TSF) has become a critical challenge. While diffusion models demonstrate exceptional performance in generation tasks, their application to TSF remains largely confined to modeling single-modality numerical sequences, overlooking the abundant cross-modal signals inherent in complex heterogeneous data. To address this gap, we propose UniDiff, a unified diffusion framework for multimodal time series forecasting. To process the numerical sequence, our framework first tokenizes the time series into patches, preserving local temporal dynamics by mapping each patch to an embedding space via a lightweight MLP. At its core lies a unified and parallel fusion module, where a single cross-attention mechanism adaptively weighs and integrates structural information from timestamps and semantic context from texts in one step, enabling a flexible and efficient interplay between modalities. Furthermore, we introduce a novel classifier-free guidance mechanism designed for multi-source conditioning, allowing for decoupled control over the guidance strength of textual and temporal information during inference, which significantly enhances model robustness. Extensive experiments on real-world benchmark datasets across eight domains demonstrate that the proposed UniDiff model achieves state-of-the-art performance.
[674] Less is More: Non-uniform Road Segments are Efficient for Bus Arrival Prediction
Zhen Huang, Jiaxin Deng, Jiayu Xu, Junbiao Pang, Haitao Yu
Main category: cs.LG
TL;DR: RL-based non-uniform road segmentation for bus arrival time prediction that outperforms traditional uniform segmentation methods while maintaining computational efficiency.
Details
Motivation: Traditional uniform segmentation for bus arrival time prediction fails to account for varying physical constraints along roads (road conditions, intersections, POIs), limiting prediction efficiency. There's a need for adaptive segmentation that considers these variations.Method: Two-stage approach: 1) Reinforcement Learning framework extracts non-uniform road segments based on impact scores, 2) Linear prediction model applied to selected segments for arrival time prediction. This decouples segmentation from prediction.
Result: Extensive experiments show superiority over traditional methods - improves both efficiency and learning performance on large-scale benchmarks. Surprisingly, linear approach can outperform more complex methods.
Conclusion: RL-based adaptive segmentation provides optimal segment selection while maintaining computational efficiency, offering significant improvement over uniform approaches for bus arrival time prediction.
Abstract: In bus arrival time prediction, the process of organizing road infrastructure network data into homogeneous entities is known as segmentation. Segmenting a road network is widely recognized as the first and most critical step in developing an arrival time prediction system, particularly for auto-regressive-based approaches. Traditional methods typically employ a uniform segmentation strategy, which fails to account for varying physical constraints along roads, such as road conditions, intersections, and points of interest, thereby limiting prediction efficiency. In this paper, we propose a Reinforcement Learning (RL)-based approach to efficiently and adaptively learn non-uniform road segments for arrival time prediction. Our method decouples the prediction process into two stages: 1) Non-uniform road segments are extracted based on their impact scores using the proposed RL framework; and 2) A linear prediction model is applied to the selected segments to make predictions. This method ensures optimal segment selection while maintaining computational efficiency, offering a significant improvement over traditional uniform approaches. Furthermore, our experimental results suggest that the linear approach can even achieve better performance than more complex methods. Extensive experiments demonstrate the superiority of the proposed method, which not only enhances efficiency but also improves learning performance on large-scale benchmarks. The dataset and the code are publicly accessible at: https://github.com/pangjunbiao/Less-is-More.
[675] Geometric Prior-Guided Federated Prompt Calibration
Fei Luo, Ziwei Zhao, Mingxuan Wang, Duoyang Li, Zhe Qian, Jiayi Tuo, Chenyue Zhou, Yanbiao Ma
Main category: cs.LG
TL;DR: GGTPC is a federated learning framework that uses global geometric priors to correct local prompt bias caused by data heterogeneity, improving performance on skewed datasets.
Details
Motivation: Federated Prompt Learning suffers from performance degradation due to data heterogeneity, which causes locally trained prompts to become biased. Existing methods fail to address this root cause of local training bias.Method: Proposes Geometry-Guided Text Prompt Calibration (GGTPC) with a global geometric prior derived from covariance matrix, reconstructed on server in privacy-preserving manner. Clients use Geometry-Prior Calibration Layer (GPCL) to align local feature distributions with global prior during training.
Result: Outperforms SOTA by 2.15% on label-skewed CIFAR-100 (β=0.1), improves by 9.17% under extreme skew (β=0.01), and boosts FedAvg by 4.60% on domain-skewed Office-Home dataset as plug-and-play module.
Conclusion: GGTPC effectively mitigates data heterogeneity by correcting fundamental local training bias, serving as versatile module to enhance various FL algorithms through geometric prior alignment.
Abstract: Federated Prompt Learning (FPL) offers a parameter-efficient solution for collaboratively training large models, but its performance is severely hindered by data heterogeneity, which causes locally trained prompts to become biased. Existing methods, focusing on aggregation or regularization, fail to address this root cause of local training bias. To this end, we propose Geometry-Guided Text Prompt Calibration (GGTPC), a novel framework that directly corrects this bias by providing clients with a global geometric prior. This prior, representing the shape of the global data distribution derived from the covariance matrix, is reconstructed on the server in a privacy-preserving manner. Clients then use a novel Geometry-Prior Calibration Layer (GPCL) to align their local feature distributions with this global prior during training. Extensive experiments show GGTPC’s effectiveness. On the label-skewed CIFAR-100 dataset ($β$=0.1), it outperforms the state-of-the-art by 2.15%. Under extreme skew ($β$=0.01), it improves upon the baseline by 9.17%. Furthermore, as a plug-and-play module on the domain-skewed Office-Home dataset, it boosts FedAvg’s performance by 4.60%. These results demonstrate that GGTPC effectively mitigates data heterogeneity by correcting the fundamental local training bias, serving as a versatile module to enhance various FL algorithms.
[676] PINE: Pipeline for Important Node Exploration in Attributed Networks
Elizaveta Kovtun, Maksim Makarenko, Natalia Semenova, Alexey Zaytsev, Semen Budennyy
Main category: cs.LG
TL;DR: PINE is an unsupervised, attribute-aware pipeline for identifying important nodes in attributed graphs, combining semantic features with structural properties using attention mechanisms.
Details
Motivation: Traditional centrality measures (degree, PageRank) only consider network structure and ignore rich node attributes. Recent neural methods require supervision. There's a gap for unsupervised, attribute-aware approaches to identify important nodes in attributed graphs.Method: Pipeline for Important Node Exploration (PINE) with an attention-based graph model that incorporates node semantic features while learning structural graph properties. Node importance scores are derived from the attention distribution.
Result: Superior performance demonstrated on various homogeneous and heterogeneous attributed networks. Successfully implemented as an industry system for unsupervised identification of key entities in large-scale enterprise graphs.
Conclusion: PINE effectively addresses the gap for unsupervised, attribute-aware node importance identification in attributed graphs, combining semantic features with structural analysis through attention mechanisms for practical industry applications.
Abstract: A graph with semantically attributed nodes are a common data structure in a wide range of domains. It could be interlinked web data or citation networks of scientific publications. The essential problem for such a data type is to determine nodes that carry greater importance than all the others, a task that markedly enhances system monitoring and management. Traditional methods to identify important nodes in networks introduce centrality measures, such as node degree or more complex PageRank. However, they consider only the network structure, neglecting the rich node attributes. Recent methods adopt neural networks capable of handling node features, but they require supervision. This work addresses the identified gap–the absence of approaches that are both unsupervised and attribute-aware–by introducing a Pipeline for Important Node Exploration (PINE). At the core of the proposed framework is an attention-based graph model that incorporates node semantic features in the learning process of identifying the structural graph properties. The PINE’s node importance scores leverage the obtained attention distribution. We demonstrate the superior performance of the proposed PINE method on various homogeneous and heterogeneous attributed networks. As an industry-implemented system, PINE tackles the real-world challenge of unsupervised identification of key entities within large-scale enterprise graphs.
[677] IFFair: Influence Function-driven Sample Reweighting for Fair Classification
Jingran Yang, Min Zhang, Lingfeng Zhang, Zhaohui Wang, Yonggang Zhang
Main category: cs.LG
TL;DR: IFFair is a pre-processing method using influence functions to mitigate algorithmic bias by adjusting sample weights during training without modifying models or data features.
Details
Motivation: Machine learning algorithms can learn and exacerbate biases from training data, leading to discriminatory decisions against unprivileged groups, which violates equal treatment rights and hinders social well-being and application development.Method: IFFair uses influence function to measure training samples’ influence disparity across different groups, then dynamically adjusts sample weights during training without changing network structure, data features, or decision boundaries.
Result: Experiments on multiple real-world datasets show IFFair mitigates bias across multiple fairness metrics (demographic parity, equalized odds, equality of opportunity, error rate parity) without conflicts, achieving better utility-fairness trade-off than previous pre-processing methods.
Conclusion: IFFair effectively addresses algorithmic bias through influence-based sample weighting, providing a flexible pre-processing approach that maintains model utility while improving fairness across multiple metrics.
Abstract: Because machine learning has significantly improved efficiency and convenience in the society, it’s increasingly used to assist or replace human decision-making. However, the data-based pattern makes related algorithms learn and even exacerbate potential bias in samples, resulting in discriminatory decisions against certain unprivileged groups, depriving them of the rights to equal treatment, thus damaging the social well-being and hindering the development of related applications. Therefore, we propose a pre-processing method IFFair based on the influence function. Compared with other fairness optimization approaches, IFFair only uses the influence disparity of training samples on different groups as a guidance to dynamically adjust the sample weights during training without modifying the network structure, data features and decision boundaries. To evaluate the validity of IFFair, we conduct experiments on multiple real-world datasets and metrics. The experimental results show that our approach mitigates bias of multiple accepted metrics in the classification setting, including demographic parity, equalized odds, equality of opportunity and error rate parity without conflicts. It also demonstrates that IFFair achieves better trade-off between multiple utility and fairness metrics compared with previous pre-processing methods.
[678] SIT-Graph: State Integrated Tool Graph for Multi-Turn Agents
Sijia Li, Yuchen Huang, Zifan Liu, Zijian Li, Jingjing fu, Lei Song, Jiang Bian, Jun Zhang, Rui Wang
Main category: cs.LG
TL;DR: SIT-Graph improves multi-turn tool use by combining episodic state summaries with procedural tool dependencies from historical trajectories, enabling adaptive recall of context when needed.
Details
Motivation: Current LLM agents struggle with multi-turn tool-use scenarios because they either treat entire trajectories as indivisible units or only exploit tool-to-tool dependencies, failing to adapt as states and information evolve across turns. There's a need for systems that can better reuse past experience while handling progressive intent clarification and evolving environments.Method: Proposes State Integrated Tool Graph (SIT-Graph) that builds a tool graph from historical tool-use sequences and augments each edge with compact state summaries of dialog and tool history. The system enables human-like balance between episodic recall (retrieving state summaries when context is needed) and procedural execution (following high-confidence tool dependencies for routine steps).
Result: Experiments across multiple stateful multi-turn tool-use benchmarks show SIT-Graph consistently outperforms strong memory- and graph-based baselines, delivering more robust tool selection and more effective experience transfer.
Conclusion: SIT-Graph effectively addresses multi-turn tool-use challenges by integrating episodic-like state fragments with procedural-like tool dependencies, enabling adaptive decision-making that balances context recall with routine execution for improved performance.
Abstract: Despite impressive advances in agent systems, multi-turn tool-use scenarios remain challenging. It is mainly because intent is clarified progressively and the environment evolves with each tool call. While reusing past experience is natural, current LLM agents either treat entire trajectories or pre-defined subtasks as indivisible units, or solely exploit tool-to-tool dependencies, hindering adaptation as states and information evolve across turns. In this paper, we propose a State Integrated Tool Graph (SIT-Graph), which enhances multi-turn tool use by exploiting partially overlapping experience. Inspired by human decision-making that integrates episodic and procedural memory, SIT-Graph captures both compact state representations (episodic-like fragments) and tool-to-tool dependencies (procedural-like routines) from historical trajectories. Specifically, we first build a tool graph from accumulated tool-use sequences, and then augment each edge with a compact state summary of the dialog and tool history that may shape the next action. At inference time, SIT-Graph enables a human-like balance between episodic recall and procedural execution: when the next decision requires recalling prior context, the agent retrieves the state summaries stored on relevant edges and uses them to guide its next action; when the step is routine, it follows high-confidence tool dependencies without explicit recall. Experiments across multiple stateful multi-turn tool-use benchmarks show that SIT-Graph consistently outperforms strong memory- and graph-based baselines, delivering more robust tool selection and more effective experience transfer.
[679] Towards a Relationship-Aware Transformer for Tabular Data
Andrei V. Konstantinov, Valerii A. Zuev, Lev V. Utkin
Main category: cs.LG
TL;DR: Proposes attention-based models for tabular data that incorporate external dependency graphs between samples, addressing limitations of deep learning models and GNNs for sparse graphs.
Details
Motivation: Deep learning models for tabular data lack ability to incorporate external dependency graphs between samples, which is useful for tasks like treatment effect estimation. Graph neural networks are limited to adjacent nodes and difficult to apply to sparse graphs.Method: Proposes several solutions based on modified attention mechanism that adds a term to the attention matrix to account for possible relationships between data points.
Result: Models are compared with each other and gradient boosting decision trees in regression tasks on synthetic and real-world datasets, and in treatment effect estimation on IHDP dataset.
Conclusion: The attention-based approach enables incorporation of external dependency graphs in tabular data modeling, addressing limitations of existing methods for sparse graph scenarios.
Abstract: Deep learning models for tabular data typically do not allow for imposing a graph of external dependencies between samples, which can be useful for accounting for relatedness in tasks such as treatment effect estimation. Graph neural networks only consider adjacent nodes, making them difficult to apply to sparse graphs. This paper proposes several solutions based on a modified attention mechanism, which accounts for possible relationships between data points by adding a term to the attention matrix. Our models are compared with each other and the gradient boosting decision trees in a regression task on synthetic and real-world datasets, as well as in a treatment effect estimation task on the IHDP dataset.
[680] Learning-Augmented Ski Rental with Discrete Distributions: A Bayesian Approach
Bosun Kang, Hyejun Park, Chenglin Fan
Main category: cs.LG
TL;DR: A Bayesian approach to ski rental problem that unifies worst-case optimization with learning-augmented predictions, using exact posterior distributions for principled uncertainty handling and achieving prior-dependent competitive guarantees.
Details
Motivation: To bridge the gap between traditional worst-case ski rental algorithms and recent learning-augmented approaches by developing a unified Bayesian framework that can handle uncertainty in predictions while maintaining robustness guarantees.Method: Proposes a discrete Bayesian framework that maintains exact posterior distributions over the time horizon, enabling uncertainty quantification and incorporation of expert priors. The algorithm gracefully interpolates between worst-case and fully-informed settings.
Result: Achieves prior-dependent competitive guarantees, demonstrates superior empirical performance across diverse scenarios, achieves near-optimal results under accurate priors while maintaining robust worst-case guarantees.
Conclusion: The Bayesian framework provides a principled approach to online decision problems with imperfect predictions, naturally extending to incorporate multiple predictions, non-uniform priors, and contextual information, highlighting practical advantages of Bayesian reasoning.
Abstract: We revisit the classic ski rental problem through the lens of Bayesian decision-making and machine-learned predictions. While traditional algorithms minimize worst-case cost without assumptions, and recent learning-augmented approaches leverage noisy forecasts with robustness guarantees, our work unifies these perspectives. We propose a discrete Bayesian framework that maintains exact posterior distributions over the time horizon, enabling principled uncertainty quantification and seamless incorporation of expert priors. Our algorithm achieves prior-dependent competitive guarantees and gracefully interpolates between worst-case and fully-informed settings. Our extensive experimental evaluation demonstrates superior empirical performance across diverse scenarios, achieving near-optimal results under accurate priors while maintaining robust worst-case guarantees. This framework naturally extends to incorporate multiple predictions, non-uniform priors, and contextual information, highlighting the practical advantages of Bayesian reasoning in online decision problems with imperfect predictions.
[681] Local-Curvature-Aware Knowledge Graph Embedding: An Extended Ricci Flow Approach
Zhengquan Luo, Guy Tadmor, Or Amar, David Zeevi, Zhiqiang Xu
Main category: cs.LG
TL;DR: RicciKGE adapts knowledge graph embedding geometry via Ricci flow, allowing dynamic co-evolution of entity embeddings and manifold curvature to match heterogeneous graph structures.
Details
Motivation: Existing KGE methods use predefined homogeneous manifolds (Euclidean, spherical, hyperbolic) that cannot accommodate the sharply varying curvature across local regions in real-world knowledge graphs. This mismatch distorts distances and hurts expressiveness.Method: Proposes RicciKGE that couples KGE loss gradient with local curvatures in an extended Ricci flow, allowing entity embeddings to co-evolve dynamically with the underlying manifold geometry toward mutual adaptation.
Result: Theoretically proves that with proper coupling: i) edge-wise curvatures decay exponentially (manifold drives toward Euclidean flatness), and ii) KGE distances converge to global optimum. Experimental improvements on link prediction and node classification benchmarks demonstrate effectiveness.
Conclusion: RicciKGE effectively adapts to heterogeneous knowledge graph structures by dynamically co-evolving embeddings with manifold geometry through Ricci flow, overcoming limitations of predefined homogeneous manifolds.
Abstract: Knowledge graph embedding (KGE) relies on the geometry of the embedding space to encode semantic and structural relations. Existing methods place all entities on one homogeneous manifold, Euclidean, spherical, hyperbolic, or their product/multi-curvature variants, to model linear, symmetric, or hierarchical patterns. Yet a predefined, homogeneous manifold cannot accommodate the sharply varying curvature that real-world graphs exhibit across local regions. Since this geometry is imposed a priori, any mismatch with the knowledge graph’s local curvatures will distort distances between entities and hurt the expressiveness of the resulting KGE. To rectify this, we propose RicciKGE to have the KGE loss gradient coupled with local curvatures in an extended Ricci flow such that entity embeddings co-evolve dynamically with the underlying manifold geometry towards mutual adaptation. Theoretically, when the coupling coefficient is bounded and properly selected, we rigorously prove that i) all the edge-wise curvatures decay exponentially, meaning that the manifold is driven toward the Euclidean flatness; and ii) the KGE distances strictly converge to a global optimum, which indicates that geometric flattening and embedding optimization are promoting each other. Experimental improvements on link prediction and node classification benchmarks demonstrate RicciKGE’s effectiveness in adapting to heterogeneous knowledge graph structures.
[682] Towards Reliable Test-Time Adaptation: Style Invariance as a Correctness Likelihood
Gilhyun Nam, Taewon Kim, Joonhyun Jeong, Eunho Yang
Main category: cs.LG
TL;DR: SICL is a plug-and-play uncertainty calibration framework for test-time adaptation that uses style-invariance to estimate correctness likelihood without backpropagation.
Details
Motivation: Test-time adaptation (TTA) methods often produce poorly calibrated predictive uncertainty, which is critical in high-stakes domains like autonomous driving, finance, and healthcare. Existing calibration methods assume fixed models or static distributions, failing in dynamic real-world test conditions.Method: SICL leverages style-invariance for robust uncertainty estimation by measuring prediction consistency across style-altered variants. It estimates instance-wise correctness likelihood using only forward passes, making it a plug-and-play, backpropagation-free calibration module compatible with any TTA method.
Result: Comprehensive evaluations across four baselines, five TTA methods, and two realistic scenarios with three model architectures show that SICL reduces calibration error by an average of 13 percentage points compared to conventional calibration approaches.
Conclusion: SICL provides an effective solution for uncertainty calibration in test-time adaptation scenarios, addressing the critical need for reliable uncertainty estimation in dynamic real-world applications without requiring architectural changes or additional training.
Abstract: Test-time adaptation (TTA) enables efficient adaptation of deployed models, yet it often leads to poorly calibrated predictive uncertainty - a critical issue in high-stakes domains such as autonomous driving, finance, and healthcare. Existing calibration methods typically assume fixed models or static distributions, resulting in degraded performance under real-world, dynamic test conditions. To address these challenges, we introduce Style Invariance as a Correctness Likelihood (SICL), a framework that leverages style-invariance for robust uncertainty estimation. SICL estimates instance-wise correctness likelihood by measuring prediction consistency across style-altered variants, requiring only the model’s forward pass. This makes it a plug-and-play, backpropagation-free calibration module compatible with any TTA method. Comprehensive evaluations across four baselines, five TTA methods, and two realistic scenarios with three model architecture demonstrate that SICL reduces calibration error by an average of 13 percentage points compared to conventional calibration approaches.
[683] Empirical Results for Adjusting Truncated Backpropagation Through Time while Training Neural Audio Effects
Yann Bourdin, Pierrick Legrand, Fanny Roche
Main category: cs.LG
TL;DR: Optimizing TBPTT hyperparameters (sequence number, batch size, sequence length) improves neural network training for audio effect modeling, specifically dynamic range compression, enhancing accuracy while reducing computational costs.
Details
Motivation: To improve the training of neural networks for digital audio effect modeling (particularly dynamic range compression) by optimizing Truncated Backpropagation Through Time (TBPTT) hyperparameters to enhance model performance and training stability while reducing computational demands.Method: Used a convolutional-recurrent architecture and conducted extensive experiments evaluating key TBPTT hyperparameters (sequence number, batch size, sequence length) across datasets with and without conditioning by user controls.
Result: Careful tuning of TBPTT parameters enhances model accuracy and training stability while reducing computational demands. Objective evaluations show improved performance with optimized settings, and subjective listening tests confirm maintained high perceptual quality.
Conclusion: Optimizing TBPTT hyperparameters is crucial for effective neural network training in audio effect modeling, providing better performance with lower computational requirements while preserving perceptual quality.
Abstract: This paper investigates the optimization of Truncated Backpropagation Through Time (TBPTT) for training neural networks in digital audio effect modeling, with a focus on dynamic range compression. The study evaluates key TBPTT hyperparameters – sequence number, batch size, and sequence length – and their influence on model performance. Using a convolutional-recurrent architecture, we conduct extensive experiments across datasets with and without conditionning by user controls. Results demonstrate that carefully tuning these parameters enhances model accuracy and training stability, while also reducing computational demands. Objective evaluations confirm improved performance with optimized settings, while subjective listening tests indicate that the revised TBPTT configuration maintains high perceptual quality.
[684] Asymptotic analysis of shallow and deep forgetting in replay with Neural Collapse
Giulia Lanzillotta, Damiano Meier, Thomas Hofmann
Main category: cs.LG
TL;DR: The paper reveals that in continual learning, neural networks maintain linearly separable feature representations even when classifiers fail, showing that small replay buffers prevent deep feature forgetting but larger buffers are needed to mitigate shallow classifier forgetting.
Details
Motivation: To resolve the paradox that neural networks retain linearly separable representations of past tasks despite classifier prediction failures, and to understand why Experience Replay requires different buffer sizes for feature preservation versus classifier accuracy.Method: Extends Neural Collapse framework to sequential continual learning, analyzing deep forgetting as geometric drift toward out-of-distribution subspaces and shallow forgetting as statistical artifacts from rank-deficient covariances and inflated class means in small buffers.
Result: Proves that any non-zero replay fraction asymptotically guarantees retention of linear separability (prevents deep forgetting), but small buffers cause “strong collapse” that blinds classifiers to true population boundaries, requiring larger buffers to mitigate shallow forgetting.
Conclusion: Challenges the reliance on large replay buffers in continual learning, suggesting that explicitly correcting statistical artifacts (rank-deficient covariances, inflated class means) could enable robust performance with minimal replay by addressing shallow forgetting directly.
Abstract: A persistent paradox in continual learning (CL) is that neural networks often retain linearly separable representations of past tasks even when their output predictions fail. We formalize this distinction as the gap between deep feature-space and shallow classifier-level forgetting. We reveal a critical asymmetry in Experience Replay: while minimal buffers successfully anchor feature geometry and prevent deep forgetting, mitigating shallow forgetting typically requires substantially larger buffer capacities. To explain this, we extend the Neural Collapse framework to the sequential setting. We characterize deep forgetting as a geometric drift toward out-of-distribution subspaces and prove that any non-zero replay fraction asymptotically guarantees the retention of linear separability. Conversely, we identify that the “strong collapse” induced by small buffers leads to rank-deficient covariances and inflated class means, effectively blinding the classifier to true population boundaries. By unifying CL with out-of-distribution detection, our work challenges the prevailing reliance on large buffers, suggesting that explicitly correcting these statistical artifacts could unlock robust performance with minimal replay.
[685] Adaptive Tuning of Parameterized Traffic Controllers via Multi-Agent Reinforcement Learning
Giray Önür, Azita Dabiri, Bart De Schutter
Main category: cs.LG
TL;DR: Multi-agent RL framework adaptively tunes state feedback traffic controllers, combining reactivity with adaptability while improving training efficiency and system robustness.
Details
Motivation: Conventional traffic management strategies using state feedback controllers lack adaptability to complex, time-varying traffic dynamics despite their simplicity and reactivity.Method: Multi-agent reinforcement learning framework where each agent adaptively tunes parameters of state feedback traffic controllers at lower frequency rather than directly determining high-frequency control actions.
Result: Outperforms no control and fixed-parameter state feedback control, performs on par with single-agent RL-based adaptive control, and shows much better resilience to partial failures.
Conclusion: The proposed multi-agent RL framework successfully combines the reactivity of state feedback controllers with the adaptability of RL, achieving improved training efficiency and enhanced system robustness through decentralized operation.
Abstract: Effective traffic control is essential for mitigating congestion in transportation networks. Conventional traffic management strategies, including route guidance, ramp metering, and traffic signal control, often rely on state feedback controllers, used for their simplicity and reactivity; however, they lack the adaptability required to cope with complex and time-varying traffic dynamics. This paper proposes a multi-agent reinforcement learning framework in which each agent adaptively tunes the parameters of a state feedback traffic controller, combining the reactivity of state feedback controllers with the adaptability of reinforcement learning. By tuning parameters at a lower frequency rather than directly determining control actions at a high frequency, the reinforcement learning agents achieve improved training efficiency while maintaining adaptability to varying traffic conditions. The multi-agent structure further enhances system robustness, as local controllers can operate independently in the event of partial failures. The proposed framework is evaluated on a simulated multi-class transportation network under varying traffic conditions. Results show that the proposed multi-agent framework outperforms the no control and fixed-parameter state feedback control cases, while performing on par with the single-agent RL-based adaptive state feedback control, with a much better resilience to partial failures.
[686] Revolutionizing Mixed Precision Quantization: Towards Training-free Automatic Proxy Discovery via Large Language Models
Haidong Kang, Jun Du, Lihong Lin
Main category: cs.LG
TL;DR: TAP is a novel LLM-driven training-free automatic proxy discovery framework for mixed-precision quantization that eliminates human expert involvement and training costs.
Details
Motivation: Existing MPQ methods either use costly differentiable optimization (inefficient/inflexible) or rely on manually designed proxies by human experts (labor-intensive). The paper aims to design a proxy without human experts or training.Method: Proposes TAP framework using LLMs to automatically find superior proxies for MPQ. Uses Direct Policy Optimization (DPO)-based reinforcement learning to optimize prompts and enhance LLM reasoning, creating a feedback loop between LLMs and MPQ tasks.
Result: Extensive experiments on mainstream benchmarks demonstrate state-of-the-art performance.
Conclusion: TAP significantly contributes to the MPQ community by providing a new perspective on LLM-driven design algorithms, eliminating human expert involvement and training costs.
Abstract: Mixed-Precision Quantization (MPQ) liberates the Deep Neural Networks (DNNs) from the Out-Of-Memory (OOM) bottleneck, which garnered increasing research attention. However, conventional methods either searched from costly differentiable optimization, which is neither efficient nor flexible, or learned a quantized DNN from the proxy (i.e., HAWQ) manually designed by human experts, which is labor-intensive and requires huge expert knowledge. Can we design a proxy without involving any human experts and training? In this paper, we provide an affirmative answer by proposing a novel Large Language Models (LLMs)-driven Training-free Automatic Proxy (dubbed TAP) discovery framework, which reforms the design paradigm of MPQ by utilizing LLMs to find superior TAP tailored for MPQ, automatically. In addition, to bridge the gap between black-box LLMs and the tough MPQ task, we ingeniously propose simple Direct Policy Optimization (DPO) based reinforcement learning to enhance LLMs’ reasoning by optimizing prompts, which can construct a positive feedback loop between the LLM and the MPQ task, enabling LLMs to generate better TAP in the next evolution. Extensive experiments on mainstream benchmarks demonstrate that TAP achieves state-of-the-art performance. Finally, we truly believe that our TAP will significantly contribute to the MPQ community by providing a new perspective on LLM-driven design algorithms.
[687] MIDG: Mixture of Invariant Experts with knowledge injection for Domain Generalization in Multimodal Sentiment Analysis
Yangle Li, Danli Luo, Haifeng Hu
Main category: cs.LG
TL;DR: Proposes MIDG framework for multimodal sentiment analysis domain generalization using Mixture of Invariant Experts and Cross-Modal Adapter to address inter-modal synergy and fragmented knowledge issues.
Details
Motivation: Existing MSA domain generalization methods overlook inter-modal synergies during invariant feature extraction and suffer from fragmented cross-modal knowledge, preventing accurate capture of rich multimodal semantic information.Method: Two key components: 1) Mixture of Invariant Experts model to extract domain-invariant features while enhancing synergistic relationships between modalities, 2) Cross-Modal Adapter to augment semantic richness through cross-modal knowledge injection.
Result: Extensive domain experiments on three datasets demonstrate that the proposed MIDG achieves superior performance compared to existing methods.
Conclusion: The framework effectively addresses limitations in current MSA domain generalization by better capturing inter-modal synergies and integrating cross-modal knowledge, leading to improved performance across domains.
Abstract: Existing methods in domain generalization for Multimodal Sentiment Analysis (MSA) often overlook inter-modal synergies during invariant features extraction, which prevents the accurate capture of the rich semantic information within multimodal data. Additionally, while knowledge injection techniques have been explored in MSA, they often suffer from fragmented cross-modal knowledge, overlooking specific representations that exist beyond the confines of unimodal. To address these limitations, we propose a novel MSA framework designed for domain generalization. Firstly, the framework incorporates a Mixture of Invariant Experts model to extract domain-invariant features, thereby enhancing the model’s capacity to learn synergistic relationships between modalities. Secondly, we design a Cross-Modal Adapter to augment the semantic richness of multimodal representations through cross-modal knowledge injection. Extensive domain experiments conducted on three datasets demonstrate that the proposed MIDG achieves superior performance.
[688] Mitigating Bias in Graph Hyperdimensional Computing
Yezi Liu, William Youngwoo Chung, Yang Ni, Hanning Chen, Mohsen Imani
Main category: cs.LG
TL;DR: FairGHDC: A fairness-aware training framework for graph hyperdimensional computing that reduces demographic parity and equal opportunity gaps while maintaining accuracy and computational efficiency.
Details
Motivation: Graph hyperdimensional computing (HDC) has shown promise for brain-like computation on graph data, but its fairness implications remain unexplored. Biases in data representation and decision rules can lead to unequal treatment of different groups in HDC systems.Method: Proposes FairGHDC framework with a bias correction term derived from a gap-based demographic-parity regularizer. Converts this into a scalar fairness factor that scales the update of class hypervectors for ground-truth labels, enabling debiasing directly in hypervector space without modifying the graph encoder or requiring backpropagation.
Result: On six benchmark datasets, FairGHDC substantially reduces demographic-parity and equal-opportunity gaps while maintaining accuracy comparable to standard GNNs and fairness-aware GNNs. Achieves up to ~10× speedup in training time on GPU compared to GNN baselines.
Conclusion: FairGHDC effectively addresses fairness concerns in graph HDC while preserving its computational advantages, demonstrating that fairness can be achieved without sacrificing the efficiency benefits of hyperdimensional computing.
Abstract: Graph hyperdimensional computing (HDC) has emerged as a promising paradigm for cognitive tasks, emulating brain-like computation with high-dimensional vectors known as hypervectors. While HDC offers robustness and efficiency on graph-structured data, its fairness implications remain largely unexplored. In this paper, we study fairness in graph HDC, where biases in data representation and decision rules can lead to unequal treatment of different groups. We show how hypervector encoding and similarity-based classification can propagate or even amplify such biases, and we propose a fairness-aware training framework, FairGHDC, to mitigate them. FairGHDC introduces a bias correction term, derived from a gap-based demographic-parity regularizer, and converts it into a scalar fairness factor that scales the update of the class hypervector for the ground-truth label. This enables debiasing directly in the hypervector space without modifying the graph encoder or requiring backpropagation. Experimental results on six benchmark datasets demonstrate that FairGHDC substantially reduces demographic-parity and equal-opportunity gaps while maintaining accuracy comparable to standard GNNs and fairness-aware GNNs. At the same time, FairGHDC preserves the computational advantages of HDC, achieving up to about one order of magnitude ($\approx 10\times$) speedup in training time on GPU compared to GNN and fairness-aware GNN baselines.
[689] KAN-Dreamer: Benchmarking Kolmogorov-Arnold Networks as Function Approximators in World Models
Chenwei Shi, Xueyu Luan
Main category: cs.LG
TL;DR: KAN-Dreamer integrates Kolmogorov-Arnold Networks into DreamerV3’s world model, achieving comparable performance to MLP-based DreamerV3 while exploring parameter efficiency benefits.
Details
Motivation: To combine the sample efficiency of DreamerV3 with the parameter efficiency and interpretability of KANs, while addressing KANs' computational overhead through FastKAN variants.Method: Replace specific MLP and convolutional components in DreamerV3 with KAN/FastKAN layers, implement a fully vectorized JAX version with simplified grid management, and evaluate across Visual Perception, Latent Prediction, and Behavior Learning subsystems.
Result: FastKAN as drop-in replacement for Reward and Continue predictors achieves performance parity with original MLP-based DreamerV3 in sample efficiency and training speed on DeepMind Control Suite (walker_walk).
Conclusion: KAN-based architectures can effectively replace MLPs in DreamerV3 without performance degradation, serving as a promising foundation for future KAN-based world model development.
Abstract: DreamerV3 is a state-of-the-art online model-based reinforcement learning (MBRL) algorithm known for remarkable sample efficiency. Concurrently, Kolmogorov-Arnold Networks (KANs) have emerged as a promising alternative to Multi-Layer Perceptrons (MLPs), offering superior parameter efficiency and interpretability. To mitigate KANs’ computational overhead, variants like FastKAN leverage Radial Basis Functions (RBFs) to accelerate inference. In this work, we investigate integrating KAN architectures into the DreamerV3 framework. We introduce KAN-Dreamer, replacing specific MLP and convolutional components of DreamerV3 with KAN and FastKAN layers. To ensure efficiency within the JAX-based World Model, we implement a tailored, fully vectorized version with simplified grid management. We structure our investigation into three subsystems: Visual Perception, Latent Prediction, and Behavior Learning. Empirical evaluations on the DeepMind Control Suite (walker_walk) analyze sample efficiency, training time, and asymptotic performance. Experimental results demonstrate that utilizing our adapted FastKAN as a drop-in replacement for the Reward and Continue predictors yields performance on par with the original MLP-based architecture, maintaining parity in both sample efficiency and training speed. This report serves as a preliminary study for future developments in KAN-based world models.
[690] Forget and Explain: Transparent Verification of GNN Unlearning
Imran Ahsan, Hyunwook Yu, Jinsung Kim, Mucheol Kim
Main category: cs.LG
TL;DR: Proposes an explainability-driven verifier for GNN unlearning that uses attribution shifts and structural changes as transparent evidence to verify forgetting, addressing the black-box nature of GNNs in privacy compliance.
Details
Motivation: GNNs need to "forget" data for privacy compliance (GDPR), but existing unlearning methods lack transparency and verifiability due to GNNs' black-box nature, making it hard to confirm if forgetting truly occurred.Method: An explainability-driven verifier that snapshots models before/after deletion, using five explainability metrics: residual attribution, heatmap shift, explainability score deviation, graph edit distance, and diagnostic graph rule shift.
Result: Evaluation on two backbones (GCN, GAT) and four unlearning strategies across five benchmarks shows Retrain and GNNDelete achieve near-complete forgetting, GraphEditor provides partial erasure, and IDEA leaves residual signals.
Conclusion: Explanation deltas provide human-readable evidence of forgetting, while membership-inference ROC-AUC serves as complementary privacy signal, offering transparent verification for GNN unlearning in privacy-sensitive applications.
Abstract: Graph neural networks (GNNs) are increasingly used to model complex patterns in graph-structured data. However, enabling them to “forget” designated information remains challenging, especially under privacy regulations such as the GDPR. Existing unlearning methods largely optimize for efficiency and scalability, yet they offer little transparency, and the black-box nature of GNNs makes it difficult to verify whether forgetting has truly occurred. We propose an explainability-driven verifier for GNN unlearning that snapshots the model before and after deletion, using attribution shifts and localized structural changes (for example, graph edit distance) as transparent evidence. The verifier uses five explainability metrics: residual attribution, heatmap shift, explainability score deviation, graph edit distance, and a diagnostic graph rule shift. We evaluate two backbones (GCN, GAT) and four unlearning strategies (Retrain, GraphEditor, GNNDelete, IDEA) across five benchmarks (Cora, Citeseer, Pubmed, Coauthor-CS, Coauthor-Physics). Results show that Retrain and GNNDelete achieve near-complete forgetting, GraphEditor provides partial erasure, and IDEA leaves residual signals. These explanation deltas provide the primary, human-readable evidence of forgetting; we also report membership-inference ROC-AUC as a complementary, graph-wide privacy signal.
[691] Parallel Algorithms for Combined Regularized Support Vector Machines: Application in Music Genre Classification
Rongmei Liang, Zizheng Liu, Xiaofei Wu, Jingwen Tu
Main category: cs.LG
TL;DR: Proposed a unified distributed optimization framework for Combined Regularized Support Vector Machines using consensus ADMM with Gaussian back-substitution, applicable to various loss functions and regularizations including non-convex terms.
Details
Motivation: CR-SVMs effectively handle structural information in data features but lack efficient algorithms for distributed-stored big data, creating a need for scalable distributed optimization methods.Method: Developed a unified consensus-based optimization framework and distributed parallel ADMM algorithm with Gaussian back-substitution for convergence, plus introduced SGL-SVM model for music information retrieval.
Result: Theoretical analysis shows algorithm complexity is independent of regularization terms and loss functions. Experiments on synthetic and music datasets demonstrate reliability, stability, and efficiency.
Conclusion: The proposed framework provides a universal, scalable solution for distributed CR-SVM optimization with strong theoretical guarantees and practical effectiveness across different applications.
Abstract: In the era of rapid development of artificial intelligence, its applications span across diverse fields, relying heavily on effective data processing and model optimization. Combined Regularized Support Vector Machines (CR-SVMs) can effectively handle the structural information among data features, but there is a lack of efficient algorithms in distributed-stored big data. To address this issue, we propose a unified optimization framework based on consensus structure. This framework is not only applicable to various loss functions and combined regularization terms but can also be effectively extended to non-convex regularization terms, showing strong scalability. Based on this framework, we develop a distributed parallel alternating direction method of multipliers (ADMM) algorithm to efficiently compute CR-SVMs when data is stored in a distributed manner. To ensure the convergence of the algorithm, we also introduce the Gaussian back-substitution method. Meanwhile, for the integrity of the paper, we introduce a new model, the sparse group lasso support vector machine (SGL-SVM), and apply it to music information retrieval. Theoretical analysis confirms that the computational complexity of the proposed algorithm is not affected by different regularization terms and loss functions, highlighting the universality of the parallel algorithm. Experiments on synthetic and free music archiv datasets demonstrate the reliability, stability, and efficiency of the algorithm.
[692] Materium: An Autoregressive Approach for Material Generation
Niklas Dobberstein, Jan Hamaekers
Main category: cs.LG
TL;DR: Materium is an autoregressive transformer that generates crystal structures by tokenizing 3D material representations, enabling fast generation compared to diffusion models.
Details
Motivation: To create a faster, more scalable alternative to diffusion-based approaches for crystal structure generation, which require many iterative denoising steps and are computationally expensive.Method: Uses an autoregressive transformer that converts 3D material representations into token sequences containing elements with oxidation states, fractional coordinates, and lattice parameters. The model places atoms at precise fractional coordinates directly rather than refining positions iteratively.
Result: The model can be trained in a few hours on a single GPU and generates samples much faster than diffusion-based approaches on both GPUs and CPUs. It performs well with various conditioning properties including fundamental properties (density, space group) and practical targets (band gap, magnetic density).
Conclusion: Materium provides an efficient, scalable approach to crystal structure generation that outperforms diffusion models in speed while maintaining strong performance across diverse conditioning targets.
Abstract: We present Materium: an autoregressive transformer for generating crystal structures that converts 3D material representations into token sequences. These sequences include elements with oxidation states, fractional coordinates and lattice parameters. Unlike diffusion approaches, which refine atomic positions iteratively through many denoising steps, Materium places atoms at precise fractional coordinates, enabling fast, scalable generation. With this design, the model can be trained in a few hours on a single GPU and generate samples much faster on GPUs and CPUs than diffusion-based approaches. The model was trained and evaluated using multiple properties as conditions, including fundamental properties, such as density and space group, as well as more practical targets, such as band gap and magnetic density. In both single and combined conditions, the model performs consistently well, producing candidates that align with the requested inputs.
[693] Efficient Low-Tubal-Rank Tensor Estimation via Alternating Preconditioned Gradient Descent
Zhiyu Liu, Zhi Han, Yandong Tang, Jun Fan, Yao Wang
Main category: cs.LG
TL;DR: APGD algorithm accelerates low-tubal-rank tensor estimation with linear convergence guarantees even when rank is overestimated, independent of tensor condition number.
Details
Motivation: Traditional tensor SVD is computationally expensive for large-scale tensors, while factorization approaches with gradient descent require accurate rank estimation and suffer slow convergence or divergence when rank is overestimated.Method: Alternating Preconditioned Gradient Descent (APGD) algorithm that adds a preconditioning term to the original gradient and updates two factor tensors alternately to accelerate convergence in over-parameterized settings.
Result: APGD achieves linear convergence even under over-parameterization, with convergence rate independent of tensor condition number. Theoretical guarantees established for general low-tubal-rank tensor estimation, specifically for factorization and recovery problems.
Conclusion: APGD provides an efficient solution for large-scale low-tubal-rank tensor estimation that overcomes limitations of existing methods, particularly handling over-parameterization effectively with provable convergence guarantees.
Abstract: The problem of low-tubal-rank tensor estimation is a fundamental task with wide applications across high-dimensional signal processing, machine learning, and image science. Traditional approaches tackle such a problem by performing tensor singular value decomposition, which is computationally expensive and becomes infeasible for large-scale tensors. Recent approaches address this issue by factorizing the tensor into two smaller factor tensors and solving the resulting problem using gradient descent. However, this kind of approach requires an accurate estimate of the tensor rank, and when the rank is overestimated, the convergence of gradient descent and its variants slows down significantly or even diverges. To address this problem, we propose an Alternating Preconditioned Gradient Descent (APGD) algorithm, which accelerates convergence in the over-parameterized setting by adding a preconditioning term to the original gradient and updating these two factors alternately. Based on certain geometric assumptions on the objective function, we establish linear convergence guarantees for more general low-tubal-rank tensor estimation problems. Then we further analyze the specific cases of low-tubal-rank tensor factorization and low-tubal-rank tensor recovery. Our theoretical results show that APGD achieves linear convergence even under over-parameterization, and the convergence rate is independent of the tensor condition number. Extensive simulations on synthetic data are carried out to validate our theoretical assertions.
[694] Exploring possible vector systems for faster training of neural networks with preconfigured latent spaces
Nikita Gabdullin
Main category: cs.LG
TL;DR: The paper explores using predefined vector systems (like An root systems) as targets for neural network latent space configurations, enabling classifier training without classification layers and faster convergence on large-class datasets.
Details
Motivation: Neural network performance depends on embedding distribution properties in latent space. Training classifiers on datasets with extremely large numbers of classes is challenging with traditional classification layers. Using predefined vector systems as latent space targets could simplify training and improve efficiency for large-scale classification tasks.Method: The paper provides a general overview of possible vector systems for NN training, their properties, and construction methods. These systems are used to configure latent spaces of encoders and visual transformers. The approach involves using minimum latent space dimensions for specific numbers of classes to optimize convergence.
Result: The method significantly speeds up ImageNet-1K and 50k-600k classes latent space configuration training. Using minimum latent space dimensions for specific class counts results in faster convergence, which has potential advantages for reducing vector database sizes used to store NN embeddings.
Conclusion: Predefined vector systems offer an effective approach for configuring neural network latent spaces, enabling efficient training without classification layers, particularly beneficial for datasets with extremely large numbers of classes. The method provides faster convergence and potential storage efficiency improvements.
Abstract: The overall neural network (NN) performance is closely related to the properties of its embedding distribution in latent space (LS). It has recently been shown that predefined vector systems, specifically An root system vectors, can be used as targets for latent space configurations (LSC) to ensure the desired LS structure. One of the main LSC advantage is the possibility of training classifier NNs without classification layers, which facilitates training NNs on datasets with extremely large numbers of classes. This paper provides a more general overview of possible vector systems for NN training along with their properties and methods for vector system construction. These systems are used to configure LS of encoders and visual transformers to significantly speed up ImageNet-1K and 50k-600k classes LSC training. It is also shown that using the minimum number of LS dimensions for a specific number of classes results in faster convergence. The latter has potential advantages for reducing the size of vector databases used to store NN embeddings.
[695] Machine Learning: Progress and Prospects
Alexander Gammerman
Main category: cs.LG
TL;DR: A 1996 inaugural lecture tracing machine learning’s historical origins from ancient philosophy to 20th century developments, with 2025 updates.
Details
Motivation: To provide historical context for machine learning by exploring its philosophical and mathematical origins, showing how the field has evolved from ancient ideas to modern computational approaches.Method: Historical analysis and philosophical tracing of machine learning concepts through different eras, examining contributions from Aristotle, William of Ockham, David Hume, Ronald Fisher, and Claude Shannon.
Result: Identifies multiple potential starting points for machine learning spanning from ancient Greek philosophy (Aristotle) to 20th century mathematics and computer science, demonstrating the field’s interdisciplinary roots.
Conclusion: Machine learning has deep historical roots spanning centuries, with influences from philosophy, mathematics, and computer science, forming a heterogeneous field with overlapping subdisciplines that continue to evolve.
Abstract: This Inaugural Lecture was given at Royal Holloway University of London in 1996. It covers an introduction to machine learning and describes various theoretical advances and practical projects in the field. The Lecture here is presented in its original format, but a few remarks have been added in 2025 to reflect recent developments, and the list of references has been updated to enhance the convenience and accuracy for readers. When did machine learning start? Maybe a good starting point is 1949, when Claude Shannon proposed a learning algorithm for chess-playing programs. Or maybe we should go back to the 1930s when Ronald Fisher developed discriminant analysis - a type of learning where the problem is to construct a decision rule that separates two types of vectors. Or could it be the 18th century when David Hume discussed the idea of induction? Or the 14th century, when William of Ockham formulated the principle of “simplicity” known as “Ockham’s razor” (Ockham, by the way, is a small village not far from Royal Holloway). Or it may be that, like almost everything else in Western civilisation and culture, the origin of these ideas lies in the Mediterranean. After all, it was Aristotle who said that “we learn some things only by doing things”. The field of machine learning has been greatly influenced by other disciplines and the subject is in itself not a very homogeneous discipline, but includes separate, overlapping subfields. There are many parallel lines of research in ML: inductive learning, neural networks, clustering, and theories of learning. They are all part of the more general field of machine learning.
[696] Model-Based Reinforcement Learning Under Confounding
Nishanth Venkatesh, Andreas A. Malikopoulos
Main category: cs.LG
TL;DR: Proposes a model-based RL method for confounded C-MDPs where context is unobserved, using proximal off-policy evaluation and behavior-averaged transitions to create a consistent surrogate MDP.
Details
Motivation: In contextual MDPs with unobserved confounding contexts, conventional model-learning methods are fundamentally inconsistent because they estimate behavioral policy quantities rather than interventional quantities needed for policy evaluation.Method: Adapts proximal off-policy evaluation to identify confounded reward expectations using observable state-action-reward trajectories under mild invertibility conditions. Combines this with behavior-averaged transition models to create a surrogate MDP with well-defined Bellman operator consistent for state-based policies, integrated with MaxCausalEnt framework.
Result: Creates a principled model learning and planning framework for confounded environments where contextual information is unobserved, unavailable, or impractical to collect.
Conclusion: Enables consistent model-based RL in confounded C-MDPs by addressing the fundamental inconsistency of conventional methods through proximal identification and surrogate MDP construction.
Abstract: We investigate model-based reinforcement learning in contextual Markov decision processes (C-MDPs) in which the context is unobserved and induces confounding in the offline dataset. In such settings, conventional model-learning methods are fundamentally inconsistent, as the transition and reward mechanisms generated under a behavioral policy do not correspond to the interventional quantities required for evaluating a state-based policy. To address this issue, we adapt a proximal off-policy evaluation approach that identifies the confounded reward expectation using only observable state-action-reward trajectories under mild invertibility conditions on proxy variables. When combined with a behavior-averaged transition model, this construction yields a surrogate MDP whose Bellman operator is well defined and consistent for state-based policies, and which integrates seamlessly with the maximum causal entropy (MaxCausalEnt) model-learning framework. The proposed formulation enables principled model learning and planning in confounded environments where contextual information is unobserved, unavailable, or impractical to collect.
[697] FRWKV:Frequency-Domain Linear Attention for Long-Term Time Series Forecasting
Qingyuan Yang, Shizhuo, Dongyue Chen, Da Teng, Zehua Gan
Main category: cs.LG
TL;DR: FRWKV is a frequency-domain linear-attention framework for long-sequence time series forecasting that achieves O(T) complexity by combining linear attention with frequency-domain analysis, outperforming traditional Transformers.
Details
Motivation: Traditional Transformers suffer from quadratic O(T²) complexity in long-sequence forecasting and have limited ability to exploit frequency-domain information effectively, creating a major bottleneck for scalable time series modeling.Method: FRWKV integrates linear attention mechanisms (inspired by RWKV’s O(T) approach) with frequency-domain analysis, creating a framework that maintains linear computational complexity while enhancing temporal feature representations through spectral information.
Result: FRWKV achieves first-place average rank across eight real-world datasets, with ablation studies confirming the critical importance of both linear attention and frequency-encoder components.
Conclusion: The work demonstrates powerful synergy between linear attention and frequency analysis, establishing a new paradigm for scalable time series modeling with linear computational complexity.
Abstract: Traditional Transformers face a major bottleneck in long-sequence time series forecasting due to their quadratic complexity $(\mathcal{O}(T^2))$ and their limited ability to effectively exploit frequency-domain information. Inspired by RWKV’s $\mathcal{O}(T)$ linear attention and frequency-domain modeling, we propose FRWKV, a frequency-domain linear-attention framework that overcomes these limitations. Our model integrates linear attention mechanisms with frequency-domain analysis, achieving $\mathcal{O}(T)$ computational complexity in the attention path while exploiting spectral information to enhance temporal feature representations for scalable long-sequence modeling. Across eight real-world datasets, FRWKV achieves a first-place average rank. Our ablation studies confirm the critical roles of both the linear attention and frequency-encoder components. This work demonstrates the powerful synergy between linear attention and frequency analysis, establishing a new paradigm for scalable time series modeling. Code is available at this repository: https://github.com/yangqingyuan-byte/FRWKV.
[698] RRAEDy: Adaptive Latent Linearization of Nonlinear Dynamical Systems
Jad Mounayer, Sebastian Rodriguez, Jerome Tomezyk, Chady Ghnatios, Francisco Chinesta
Main category: cs.LG
TL;DR: RRAEDy is a latent-space model that automatically discovers appropriate latent dimensions while enforcing regularized linear dynamics, using Rank-Reduction Autoencoders and Dynamic Mode Decomposition without complex loss balancing.
Details
Motivation: Existing latent-space models for dynamical systems have limitations: they require fixing latent dimension in advance, rely on complex loss balancing to approximate linear dynamics, and don't regularize latent variables.Method: Built on Rank-Reduction Autoencoders (RRAEs), RRAEDy automatically ranks and prunes latent variables through singular values while learning a latent Dynamic Mode Decomposition (DMD) operator. This structure-free yet linearly constrained formulation enables learning stable low-dimensional dynamics without auxiliary losses or manual tuning.
Result: Experiments on canonical benchmarks (Van der Pol oscillator, Burgers’ equation, 2D Navier-Stokes, Rotating Gaussians) show RRAEDy achieves accurate and robust predictions. Theoretical analysis demonstrates stability of learned operator.
Conclusion: RRAEDy removes limitations of existing models by discovering appropriate latent dimension while enforcing both regularized and linearized dynamics in latent space, providing a generalizable framework with extensions for parametric ODEs.
Abstract: Most existing latent-space models for dynamical systems require fixing the latent dimension in advance, they rely on complex loss balancing to approximate linear dynamics, and they don’t regularize the latent variables. We introduce RRAEDy, a model that removes these limitations by discovering the appropriate latent dimension, while enforcing both regularized and linearized dynamics in the latent space. Built upon Rank-Reduction Autoencoders (RRAEs), RRAEDy automatically rank and prune latent variables through their singular values while learning a latent Dynamic Mode Decomposition (DMD) operator that governs their temporal progression. This structure-free yet linearly constrained formulation enables the model to learn stable and low-dimensional dynamics without auxiliary losses or manual tuning. We provide theoretical analysis demonstrating the stability of the learned operator and showcase the generality of our model by proposing an extension that handles parametric ODEs. Experiments on canonical benchmarks, including the Van der Pol oscillator, Burgers’ equation, 2D Navier-Stokes, and Rotating Gaussians, show that RRAEDy achieves accurate and robust predictions. Our code is open-source and available at https://github.com/JadM133/RRAEDy. We also provide a video summarizing the main results at https://youtu.be/ox70mSSMGrM.
[699] Weighted Contrastive Learning for Anomaly-Aware Time-Series Forecasting
Joel Ekstrand, Tor Mattsson, Zahra Taghiyarrenani, Slawomir Nowaczyk, Jens Lundström, Mikael Lindén
Main category: cs.LG
TL;DR: WECA improves multivariate time series forecasting under anomalies by aligning normal and anomaly-augmented representations, achieving 6.1% SMAPE improvement on anomaly data with minimal normal data degradation.
Details
Motivation: Modern deep forecasters fail under distribution shifts caused by anomalies, which is critical for applications like ATM cash logistics where sudden demand shifts disrupt operations.Method: Weighted Contrastive Adaptation (WECA) uses a weighted contrastive objective to align normal and anomaly-augmented representations, preserving anomaly-relevant information while maintaining consistency under benign variations.
Result: WECA improves SMAPE on anomaly-affected data by 6.1 percentage points compared to normally trained baseline, with negligible degradation on normal data in nationwide ATM transaction dataset evaluations.
Conclusion: WECA enhances forecasting reliability under anomalies without sacrificing performance during regular operations, making it valuable for real-world applications with distribution shifts.
Abstract: Reliable forecasting of multivariate time series under anomalous conditions is crucial in applications such as ATM cash logistics, where sudden demand shifts can disrupt operations. Modern deep forecasters achieve high accuracy on normal data but often fail when distribution shifts occur. We propose Weighted Contrastive Adaptation (WECA), a Weighted contrastive objective that aligns normal and anomaly-augmented representations, preserving anomaly-relevant information while maintaining consistency under benign variations. Evaluations on a nationwide ATM transaction dataset with domain-informed anomaly injection show that WECA improves SMAPE on anomaly-affected data by 6.1 percentage points compared to a normally trained baseline, with negligible degradation on normal data. These results demonstrate that WECA enhances forecasting reliability under anomalies without sacrificing performance during regular operations.
[700] ReLaX: Reasoning with Latent Exploration for Large Reasoning Models
Shimin Zhang, Xianwei Chen, Yufan Shen, Ziyuan Ye, Jibin Wu
Main category: cs.LG
TL;DR: ReLaX introduces latent dynamics analysis using Koopman operator theory to prevent entropy collapse in RLVR, achieving SOTA performance across reasoning benchmarks.
Details
Motivation: RLVR suffers from entropy collapse leading to premature policy convergence and performance saturation. While token-level entropy manipulation helps exploration, the latent dynamics underlying token generation contain richer computational structure for better exploration-exploitation tradeoff.Method: Leverages Koopman operator theory to obtain linearized representation of hidden-state dynamics, introduces Dynamic Spectral Dispersion (DSD) metric to quantify latent dynamics heterogeneity, and proposes Reasoning with Latent eXploration (ReLaX) paradigm that explicitly incorporates latent dynamics to regulate exploration-exploitation.
Result: Comprehensive experiments across multimodal and text-only reasoning benchmarks show ReLaX significantly mitigates premature convergence and consistently achieves state-of-the-art performance.
Conclusion: ReLaX successfully addresses entropy collapse in RLVR by analyzing and regulating latent dynamics, demonstrating the importance of considering underlying computational structure beyond token-level entropy for effective policy optimization in reasoning models.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated remarkable potential in enhancing the reasoning capability of Large Reasoning Models (LRMs). However, RLVR often leads to entropy collapse, resulting in premature policy convergence and performance saturation. While manipulating token-level entropy has proven effective for promoting policy exploration, we argue that the latent dynamics underlying token generation encode a far richer computational structure for steering policy optimization toward a more effective exploration-exploitation tradeoff. To enable tractable analysis and intervention of the latent dynamics of LRMs, we leverage Koopman operator theory to obtain a linearized representation of their hidden-state dynamics. This enables us to introduce Dynamic Spectral Dispersion (DSD), a new metric to quantify the heterogeneity of the model’s latent dynamics, serving as a direct indicator of policy exploration. Building upon these foundations, we propose Reasoning with Latent eXploration (ReLaX), a paradigm that explicitly incorporates latent dynamics to regulate exploration and exploitation during policy optimization. Comprehensive experiments across a wide range of multimodal and text-only reasoning benchmarks show that ReLaX significantly mitigates premature convergence and consistently achieves state-of-the-art performance.
[701] Time Series Foundation Models for Process Model Forecasting
Yongbo Yu, Jari Peeperkorn, Johannes De Smedt, Jochen De Weerdt
Main category: cs.LG
TL;DR: Time Series Foundation Models (TSFMs) outperform traditional methods for Process Model Forecasting (PMF), with zero-shot use often matching or beating fine-tuned versions, demonstrating strong transfer learning from non-process domains.
Details
Motivation: Prior PMF benchmarks show limited gains from ML/DL models over statistical baselines due to sparsity and heterogeneity of directly-follows time series. The paper investigates whether Time Series Foundation Models (large pre-trained models for generic time series) can provide better performance for PMF.Method: Using DF time series from real-life event logs, the study compares zero-shot use of TSFMs (without additional training) with fine-tuned variants adapted on PMF-specific data. Performance is evaluated against traditional and specialized models trained from scratch on the same logs.
Result: TSFMs generally achieve lower forecasting errors (MAE and RMSE) than traditional models, indicating effective transfer of temporal structure from non-process domains. Fine-tuning provides small improvements but gains may disappear on smaller or more complex datasets, making zero-shot use a strong default.
Conclusion: The study demonstrates the generalization capability and data efficiency of TSFMs for process-related time series, providing the first systematic evaluation of temporal foundation models for PMF and showing they offer superior performance over traditional approaches.
Abstract: Process Model Forecasting (PMF) aims to predict how the control-flow structure of a process evolves over time by modeling the temporal dynamics of directly-follows (DF) relations, complementing predictive process monitoring that focuses on single-case prefixes. Prior benchmarks show that machine learning and deep learning models provide only modest gains over statistical baselines, mainly due to the sparsity and heterogeneity of the DF time series. We investigate Time Series Foundation Models (TSFMs), large pre-trained models for generic time series, as an alternative for PMF. Using DF time series derived from real-life event logs, we compare zero-shot use of TSFMs, without additional training, with fine-tuned variants adapted on PMF-specific data. TSFMs generally achieve lower forecasting errors (MAE and RMSE) than traditional and specialized models trained from scratch on the same logs, indicating effective transfer of temporal structure from non-process domains. While fine-tuning can further improve accuracy, the gains are often small and may disappear on smaller or more complex datasets, so zero-shot use remains a strong default. Our study highlights the generalization capability and data efficiency of TSFMs for process-related time series and, to the best of our knowledge, provides the first systematic evaluation of temporal foundation models for PMF.
[702] A Mathematical Theory of Top-$k$ Sparse Attention via Total Variation Distance
Georgios Tzachristas, Lei Deng, Ioannis Tzachristas, Gong Zhang, Renhai Chen
Main category: cs.LG
TL;DR: The paper develops a mathematical framework for certified Top-k attention truncation with precise error bounds at both distribution and output levels, enabling efficient attention computation while guaranteeing approximation quality.
Details
Motivation: Attention mechanisms in transformers are computationally expensive due to softmax over all tokens. Top-k truncation (using only the k largest attention weights) can reduce computation, but existing methods lack rigorous error guarantees. The paper aims to provide certified bounds for Top-k attention approximation.Method: Develops a unified mathematical framework using total variation distance between full attention distribution P and its Top-k truncation P̂. Derives exact relationship TV(P,P̂)=1-e^{-KL(P̂∥P)} and uses ordered logits to bound the error. Introduces head-tail decomposition to factor output error as τ∥μ_tail-μ_head∥₂ where τ=TV(P,P̂). Under Gaussian score model, derives closed-form tail masses and asymptotic rule for minimal k ensuring TV(P,P̂)≤ε.
Result: Provides deterministic non-asymptotic bounds for TV distance using boundary gaps. Shows output error factorizes cleanly and can be bounded by head-tail diameter. Under Gaussian model, derives asymptotic rule k_ε/n ≈ Φ_c(σ+Φ^{-1}(ε)). Experiments on BERT-base-uncased and synthetic logits confirm scaling predictions and show 2-4× reduction in scored keys while meeting TV budget.
Conclusion: The framework provides certified Top-k attention truncation with precise error control, enabling efficient attention computation with guaranteed approximation quality. The mathematical results connect distribution-level and output-level errors, offering practical tools for designing attention approximations with formal guarantees.
Abstract: We develop a unified mathematical framework for certified Top-$k$ attention truncation that quantifies approximation error at both the distribution and output levels. For a single attention distribution $P$ and its Top-$k$ truncation $\hat P$, we show that the total-variation distance coincides with the discarded softmax tail mass and satisfies $\mathrm{TV}(P,\hat P)=1-e^{-\mathrm{KL}(\hat P\Vert P)}$, yielding sharp Top-$k$-specific bounds in place of generic inequalities. From this we derive non-asymptotic deterministic bounds – from a single boundary gap through multi-gap and blockwise variants – that control $\mathrm{TV}(P,\hat P)$ using only the ordered logits. Using an exact head-tail decomposition, we prove that the output error factorizes as $|\mathrm{Attn}(q,K,V)-\mathrm{Attn}k(q,K,V)|2=τ|μ{\mathrm{tail}}-μ{\mathrm{head}}|2$ with $τ=\mathrm{TV}(P,\hat P)$, yielding a new head-tail diameter bound $|\mathrm{Attn}(q,K,V)-\mathrm{Attn}k(q,K,V)|2\leτ,\mathrm{diam}{H,T}$ and refinements linking the error to $\mathrm{Var}P(V)$. Under an i.i.d. Gaussian score model $s_i\sim\mathcal N(μ,σ^2)$ we derive closed-form tail masses and an asymptotic rule for the minimal $k\varepsilon$ ensuring $\mathrm{TV}(P,\hat P)\le\varepsilon$, namely $k\varepsilon/n\approxΦ_c(σ+Φ^{-1}(\varepsilon))$. Experiments on bert-base-uncased and synthetic logits confirm the predicted scaling of $k\varepsilon/n$ and show that certified Top-$k$ can reduce scored keys by 2-4$\times$ on average while meeting the prescribed total-variation budget.
[703] Depth-Wise Activation Steering for Honest Language Models
Gracjan Góral, Marysia Winkels, Steven Basart
Main category: cs.LG
TL;DR: Training-free activation steering method using Gaussian scheduling across network depth improves honesty in LLMs without retraining.
Details
Motivation: LLMs sometimes assert falsehoods despite knowing the correct answer (failures of honesty rather than accuracy), which undermines auditability and safety. Existing approaches focus on factual correctness or require retraining, offering limited control over truthful reporting.Method: A training-free activation steering method that weights steering strength across network depth using a Gaussian schedule. The method is model-agnostic, requires no finetuning, and provides a low-cost control knob for eliciting truthful reporting.
Result: On the MASK benchmark (which separates honesty from knowledge), Gaussian scheduling improved honesty over no-steering and single-layer baselines in 6 out of 7 models (LLaMA, Qwen, and Mistral families). Equal-budget ablations showed Gaussian schedule outperforms random, uniform, and box-filter depth allocations.
Conclusion: How intervention is distributed across depth materially affects outcomes beyond total strength. The method is simple, effective, and provides a practical way to improve honesty in LLMs without retraining.
Abstract: Large language models sometimes assert falsehoods despite internally representing the correct answer, failures of honesty rather than accuracy, which undermines auditability and safety. Existing approaches largely optimize factual correctness or depend on retraining and brittle single-layer edits, offering limited leverage over truthful reporting. We present a training-free activation steering method that weights steering strength across network depth using a Gaussian schedule. On the MASK benchmark, which separates honesty from knowledge, we evaluate seven models spanning the LLaMA, Qwen, and Mistral families and find that Gaussian scheduling improves honesty over no-steering and single-layer baselines in six of seven models. Equal-budget ablations on LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct show the Gaussian schedule outperforms random, uniform, and box-filter depth allocations, indicating that how intervention is distributed across depth materially affects outcomes beyond total strength. The method is simple, model-agnostic, requires no finetuning, and provides a low-cost control knob for eliciting truthful reporting from models’ existing capabilities.
[704] A Bootstrap Perspective on Stochastic Gradient Descent
Hongjian Lan, Yucong Liu, Florian Schäfer
Main category: cs.LG
TL;DR: SGD’s generalization advantage comes from implicitly regularizing gradient covariance, making solutions robust to data sampling noise through bootstrap-like mechanism.
Details
Motivation: To understand why SGD generalizes better than GD, focusing on how SGD's stochasticity acts as a proxy for data variability and leads to more robust solutions.Method: Analyze SGD through statistical bootstrap lens, showing it implicitly regularizes trace of gradient covariance matrix. Use idealized experiments on empirical risk minimization and theoretical analysis to demonstrate SGD’s preference for robust solutions.
Result: SGD selects parameters robust to resampling, avoids spurious solutions even in wider minima, controls algorithmic variability via gradient covariance regularization, and explicit regularization with algorithmic variability improves test performance in neural networks.
Conclusion: SGD’s generalization advantage stems from bootstrap estimation mechanism - using gradient variability as proxy for data variability, leading to solutions less sensitive to sampling noise and better generalization.
Abstract: Machine learning models trained with \emph{stochastic} gradient descent (SGD) can generalize better than those trained with deterministic gradient descent (GD). In this work, we study SGD’s impact on generalization through the lens of the statistical bootstrap: SGD uses gradient variability under batch sampling as a proxy for solution variability under the randomness of the data collection process. We use empirical results and theoretical analysis to substantiate this claim. In idealized experiments on empirical risk minimization, we show that SGD is drawn to parameter choices that are robust under resampling and thus avoids spurious solutions even if they lie in wider and deeper minima of the training loss. We prove rigorously that by implicitly regularizing the trace of the gradient covariance matrix, SGD controls the algorithmic variability. This regularization leads to solutions that are less sensitive to sampling noise, thereby improving generalization. Numerical experiments on neural network training show that explicitly incorporating the estimate of the algorithmic variability as a regularizer improves test performance. This fact supports our claim that bootstrap estimation underpins SGD’s generalization advantages.
[705] In-Context and Few-Shots Learning for Forecasting Time Series Data based on Large Language Models
Saroj Gopali, Bipin Chhetri, Deepika Giri, Sima Siami-Namini, Akbar Siami Namin
Main category: cs.LG
TL;DR: This paper compares foundation models (TimesFM, LLMs) against traditional time series models (LSTM, TCN) for forecasting, finding TimesFM achieves best performance with lowest RMSE and competitive inference time.
Details
Motivation: With the emergence of foundation models like TimesFM and LLMs, the paper investigates whether these pre-trained models can outperform existing time series approaches (ARIMA, LSTM, TCN) in forecasting tasks.Method: The study compares multiple approaches: TimesFM (time series foundation model), LLMs (OpenAI o4-mini and Gemini 2.5 Flash Lite) using in-context, zero-shot, and few-shot learning, and traditional deep learning models (TCN and LSTM).
Result: TimesFM achieved the best overall performance with lowest RMSE (0.3023) and competitive inference time (266 seconds). OpenAI’s o4-mini also performed well with zero-shot learning. Foundation models show promise for real-time forecasting.
Conclusion: Pre-trained time series foundation models represent a promising direction for accurate and scalable real-time forecasting with minimal adaptation, outperforming traditional deep learning approaches.
Abstract: Existing data-driven approaches in modeling and predicting time series data include ARIMA (Autoregressive Integrated Moving Average), Transformer-based models, LSTM (Long Short-Term Memory) and TCN (Temporal Convolutional Network). These approaches, and in particular deep learning-based models such as LSTM and TCN, have shown great results in predicting time series data. With the advancement of leveraging pre-trained foundation models such as Large Language Models (LLMs) and more notably Google’s recent foundation model for time series data, {\it TimesFM} (Time Series Foundation Model), it is of interest to investigate whether these foundation models have the capability of outperforming existing modeling approaches in analyzing and predicting time series data. This paper investigates the performance of using LLM models for time series data prediction. We investigate the in-context learning methodology in the training of LLM models that are specific to the underlying application domain. More specifically, the paper explores training LLMs through in-context, zero-shot and few-shot learning and forecasting time series data with OpenAI {\tt o4-mini} and Gemini 2.5 Flash Lite, as well as the recent Google’s Transformer-based TimesFM, a time series-specific foundation model, along with two deep learning models, namely TCN and LSTM networks. The findings indicate that TimesFM has the best overall performance with the lowest RMSE value (0.3023) and the competitive inference time (266 seconds). Furthermore, OpenAI’s o4-mini also exhibits a good performance based on Zero Shot learning. These findings highlight pre-trained time series foundation models as a promising direction for real-time forecasting, enabling accurate and scalable deployment with minimal model adaptation.
[706] Enabling Delayed-Full Charging Through Transformer-Based Real-Time-to-Departure Modeling for EV Battery Longevity
Yonggeon Lee, Jibin Hwang, Alfred Malengo Kondoro, Juhyun Song, Youngtae Noh
Main category: cs.LG
TL;DR: Transformer-based real-time-to-event model for predicting EV departure times to optimize charging schedules and reduce battery degradation.
Details
Motivation: Electric vehicle lithium-ion batteries degrade faster under prolonged high states of charge. Delaying full charging until just before departure requires accurate departure time prediction to mitigate battery degradation and support sustainable mobility.Method: Transformer-based real-time-to-event (TTE) model that represents each day as a TTE sequence by discretizing time into grid-based tokens. Unlike previous methods relying on historical patterns, this approach leverages streaming contextual information to predict departures.
Result: Evaluation on real-world study with 93 users and passive smartphone data shows the method effectively captures irregular departure patterns within individual routines and outperforms baseline models.
Conclusion: The proposed method demonstrates potential for practical deployment and contributes to sustainable transportation systems by enabling optimized EV charging schedules that reduce battery degradation.
Abstract: Electric vehicles (EVs) are key to sustainable mobility, yet their lithium-ion batteries (LIBs) degrade more rapidly under prolonged high states of charge (SOC). This can be mitigated by delaying full charging \ours until just before departure, which requires accurate prediction of user departure times. In this work, we propose Transformer-based real-time-to-event (TTE) model for accurate EV departure prediction. Our approach represents each day as a TTE sequence by discretizing time into grid-based tokens. Unlike previous methods primarily dependent on temporal dependency from historical patterns, our method leverages streaming contextual information to predict departures. Evaluation on a real-world study involving 93 users and passive smartphone data demonstrates that our method effectively captures irregular departure patterns within individual routines, outperforming baseline models. These results highlight the potential for practical deployment of the \ours algorithm and its contribution to sustainable transportation systems.
[707] Formalized Hopfield Networks and Boltzmann Machines
Matteo Cipollina, Michail Karatarakis, Freek Wiedijk
Main category: cs.LG
TL;DR: Formal verification of neural networks in Lean 4, covering deterministic Hopfield networks and stochastic Boltzmann machines with convergence proofs.
Details
Motivation: Neural networks are widely used but challenging to analyze and verify formally. The paper aims to provide rigorous mathematical verification of neural network properties using theorem proving.Method: Uses Lean 4 theorem prover to formalize both deterministic (Hopfield networks) and stochastic (Boltzmann machines) neural network models. Proves convergence properties and correctness of Hebbian learning for orthogonal patterns.
Result: Successfully formalized Hopfield networks with convergence proofs and Hebbian learning correctness for orthogonal patterns. Formalized Boltzmann machines and proved their ergodicity using a new formalization of Perron-Frobenius theorem.
Conclusion: The work demonstrates that formal verification of neural networks is feasible using theorem proving, providing rigorous guarantees for convergence and learning correctness in both deterministic and stochastic settings.
Abstract: Neural networks are widely used, yet their analysis and verification remain challenging. In this work, we present a Lean 4 formalization of neural networks, covering both deterministic and stochastic models. We first formalize Hopfield networks, recurrent networks that store patterns as stable states. We prove convergence and the correctness of Hebbian learning, a training rule that updates network parameters to encode patterns, here limited to the case of pairwise-orthogonal patterns. We then consider stochastic networks, where updates are probabilistic and convergence is to a stationary distribution. As a canonical example, we formalize the dynamics of Boltzmann machines and prove their ergodicity, showing convergence to a unique stationary distribution using a new formalization of the Perron-Frobenius theorem.
[708] GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory
Jiaxu Liu, Yuhe Bai, Christos-Savvas Bouganis
Main category: cs.LG
TL;DR: GatedFWA: A gated sliding window attention mechanism that maintains linear-time efficiency while stabilizing memory updates and controlling gradient flow through learnable contraction gates.
Details
Motivation: Softmax full attention scales quadratically, while Sliding Window Attention (SWA) has linear complexity but suffers from unbounded training objectives under associative memory interpretation. Softmax attention causes memory shrinkage and gradient vanishing. Need efficient attention that stabilizes memory updates and controls gradient flow.Method: GatedFWA accumulates per-token/head gates as decay bias added to attention logits, acting as learnable contraction in memory recurrence. Uses fused one-pass gate preprocessing and FlashAttention-compatible kernel with sliding mask injection for I/O efficiency and numerical stability.
Result: Competitive throughput with negligible overhead, better use of global context, clean integration with token compression/selection methods (NSA), and generalization to various autoregressive domains.
Conclusion: GatedFWA preserves SWA’s linear-time efficiency while solving memory stability and gradient flow issues through gated contraction mechanism, making it practical for long-sequence modeling.
Abstract: Modern autoregressive models rely on attention, yet the Softmax full attention in Transformers scales quadratically with sequence length. Sliding Window Attention (SWA) achieves linear-time encoding/decoding by constraining the attention pattern, but under an \textit{Associative Memory} interpretation, its difference-style update renders the training objective effectively \emph{unbounded}. In contrast, Softmax attention normalizes updates, leading to \emph{memory shrinkage and gradient vanishing}. We propose GatedFWA: a Memory-\underline{Gated} (\underline{F}lash) \underline{W}indowed \underline{A}ttention mechanism that preserves SWAs efficiency while stabilizing memory updates and making gradient flow controllable. In essence, GatedFWA accumulate a per-token/head gate into a decay bias added to the attention logits, acting as a learnable contraction in the memory recurrence. We implement a fused one-pass gate preprocessing and a FlashAttention-compatible kernel that injects the gate under a sliding mask, ensuring I/O efficiency and numerical stability. On language modelling benchmarks, GatedFWA delivers competitive throughput with negligible overhead and better use of global context, and it integrates cleanly with token compression/selection methods such as NSA and generalizes to various autoregressive domains.
[709] Provable Long-Range Benefits of Next-Token Prediction
Xinyuan Cao, Santosh S. Vempala
Main category: cs.LG
TL;DR: Next-token prediction in RNNs can learn long-range structure and generate coherent documents indistinguishable from training data for any k tokens.
Details
Motivation: To explain why modern language models trained on next-word prediction can generate coherent documents and capture long-range structure, despite being trained only on local token prediction.Method: Theoretical analysis proving that optimizing next-token prediction over Recurrent Neural Networks (RNNs) yields models that approximate the training distribution. The proof shows that for any k, no bounded algorithm can distinguish between k consecutive tokens from training documents and k tokens generated by the learned model following the same prefix.
Result: Provides polynomial bounds (in k, independent of document length) on model size needed to achieve k-token indistinguishability, offering a complexity-theoretic explanation for long-range coherence in practice.
Conclusion: Next-token prediction is provably powerful for learning longer-range structure, explaining why language models trained on this simple objective can generate coherent documents and capture long-range dependencies.
Abstract: Why do modern language models, trained to do well on next-word prediction, appear to generate coherent documents and capture long-range structure? Here we show that next-token prediction is provably powerful for learning longer-range structure, even with common neural network architectures. Specifically, we prove that optimizing next-token prediction over a Recurrent Neural Network (RNN) yields a model that closely approximates the training distribution: for held-out documents sampled from the training distribution, no algorithm of bounded description length limited to examining the next $k$ tokens, for any $k$, can distinguish between $k$ consecutive tokens of such documents and $k$ tokens generated by the learned language model following the same prefix. We provide polynomial bounds (in $k$, independent of the document length) on the model size needed to achieve such $k$-token indistinguishability, offering a complexity-theoretic explanation for the long-range coherence observed in practice.
[710] Self-Improvement Towards Pareto Optimality: Mitigating Preference Conflicts in Multi-Objective Alignment
Moxin Li, Yuantao Zhang, Wenjie Wang, Wentao Shi, Zhuo Liu, Fuli Feng, Tat-Seng Chua
Main category: cs.LG
TL;DR: SIPO: A self-improving DPO framework that resolves preference conflicts in multi-objective alignment by having LLMs self-generate and select Pareto-optimal responses.
Details
Motivation: DPO-based multi-objective alignment approaches suffer from widespread preference conflicts where different objectives favor different responses, causing conflicting optimization directions that hinder Pareto Front optimization.Method: Proposes constructing Pareto-optimal responses to resolve preference conflicts, using a self-improving DPO framework where LLMs self-generate and select Pareto-optimal responses for self-supervised preference alignment.
Result: Extensive experiments on two datasets demonstrate superior Pareto Front achievement compared to various baselines.
Conclusion: The proposed self-improving DPO framework effectively addresses preference conflicts in multi-objective alignment by enabling LLMs to self-generate Pareto-optimal responses, leading to better optimization on the Pareto Front.
Abstract: Multi-Objective Alignment (MOA) aims to align LLMs’ responses with multiple human preference objectives, with Direct Preference Optimization (DPO) emerging as a prominent approach. However, we find that DPO-based MOA approaches suffer from widespread preference conflicts in the data, where different objectives favor different responses. This results in conflicting optimization directions, hindering the optimization on the Pareto Front. To address this, we propose to construct Pareto-optimal responses to resolve preference conflicts. To efficiently obtain and utilize such responses, we propose a self-improving DPO framework that enables LLMs to self-generate and select Pareto-optimal responses for self-supervised preference alignment. Extensive experiments on two datasets demonstrate the superior Pareto Front achieved by our framework compared to various baselines. Code is available at https://github.com/zyttt-coder/SIPO.
[711] The Adoption and Usage of AI Agents: Early Evidence from Perplexity
Jeremy Yang, Noah Yonack, Kate Zyskowski, Denis Yarats, Johnny Ho, Jerry Ma
Main category: cs.LG
TL;DR: First large-scale field study of general-purpose AI agents in open-world web environments using Comet browser data reveals adoption patterns, usage intensity, and diverse use cases across productivity, learning, and research domains.
Details
Motivation: To understand the real-world adoption, usage patterns, and applications of general-purpose AI agents operating in open-world web environments, addressing fundamental questions about who uses them, how intensively, and for what purposes.Method: Analyzed hundreds of millions of anonymized user interactions from Comet browser and its integrated Comet Assistant agent. Introduced hierarchical agentic taxonomy organizing use cases across three levels: topic, subtopic, and task. Studied adoption patterns across user segments and usage evolution over time.
Result: Substantial heterogeneity in adoption: earlier adopters, users in higher GDP/education countries, and digital/knowledge-intensive sector workers more likely to adopt. Productivity & Workflow (57% of queries) and Learning & Research are largest topics. Courses and Shopping for Goods (22%) are largest subtopics. Top 10 out of 90 tasks represent 55% of queries. Personal use (55%), professional (30%), educational (16%). Short-term usage shows stickiness but shifts toward more cognitively oriented topics over time.
Conclusion: The diffusion of increasingly capable AI agents has significant implications for researchers, businesses, policymakers, and educators, requiring new lines of inquiry into this emerging class of AI capabilities as adoption patterns and usage evolve toward more complex cognitive applications.
Abstract: This paper presents the first large-scale field study of the adoption, usage intensity, and use cases of general-purpose AI agents operating in open-world web environments. Our analysis centers on Comet, an AI-powered browser developed by Perplexity, and its integrated agent, Comet Assistant. Drawing on hundreds of millions of anonymized user interactions, we address three fundamental questions: Who is using AI agents? How intensively are they using them? And what are they using them for? Our findings reveal substantial heterogeneity in adoption and usage across user segments. Earlier adopters, users in countries with higher GDP per capita and educational attainment, and individuals working in digital or knowledge-intensive sectors – such as digital technology, academia, finance, marketing, and entrepreneurship – are more likely to adopt or actively use the agent. To systematically characterize the substance of agent usage, we introduce a hierarchical agentic taxonomy that organizes use cases across three levels: topic, subtopic, and task. The two largest topics, Productivity & Workflow and Learning & Research, account for 57% of all agentic queries, while the two largest subtopics, Courses and Shopping for Goods, make up 22%. The top 10 out of 90 tasks represent 55% of queries. Personal use constitutes 55% of queries, while professional and educational contexts comprise 30% and 16%, respectively. In the short term, use cases exhibit strong stickiness, but over time users tend to shift toward more cognitively oriented topics. The diffusion of increasingly capable AI agents carries important implications for researchers, businesses, policymakers, and educators, inviting new lines of inquiry into this rapidly emerging class of AI capabilities.
[712] Process Reward Models That Think
Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, Lu Wang
Main category: cs.LG
TL;DR: ThinkPRM is a generative chain-of-thought verifier that uses minimal process supervision (1% of PRM800K labels) to outperform discriminative verifiers and LLM-as-a-Judge across multiple benchmarks.
Details
Motivation: Process reward models (PRMs) require expensive step-level supervision for training. The goal is to build data-efficient PRMs that can verify solution steps while requiring minimal process labels.Method: ThinkPRM is a long chain-of-thought verifier fine-tuned on very few process labels. It generates verification chain-of-thought to verify each solution step, leveraging the inherent reasoning abilities of long CoT models.
Result: ThinkPRM outperforms LLM-as-a-Judge and discriminative verifiers using only 1% of PRM800K labels. It beats baselines on ProcessBench, MATH-500, and AIME ‘24, and surpasses discriminative verifiers trained on full PRM800K by 8% on GPQA-Diamond and 4.5% on LiveCodeBench. It also scales verification compute more effectively than LLM-as-a-Judge.
Conclusion: Generative, long CoT PRMs like ThinkPRM can effectively scale test-time compute for verification while requiring minimal supervision, highlighting the value of this approach over traditional discriminative methods.
Abstract: Step-by-step verifiers – also known as process reward models (PRMs) – are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers – using only 1% of the process labels in PRM800K – across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME ‘24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training. Our code, data, and models are released at https://github.com/mukhal/thinkprm.
[713] Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning
Shenao Zhang, Yaqing Wang, Yinxiao Liu, Tianqi Liu, Peter Grabowski, Eugene Ie, Zhaoran Wang, Yunxuan Li
Main category: cs.LG
TL;DR: BARL: A Bayesian RL framework that enables LLMs to perform reflective exploration at test time by optimizing expected return under posterior MDP distributions, outperforming conventional RL approaches.
Details
Motivation: Conventional RL-trained LLMs don't exhibit reflective exploration behaviors at test time because Markovian policies have no incentive to enrich identical states with additional context. The paper aims to understand whether and why reflective reasoning emerges during RL and how to make it beneficial.Method: Recasts reflective exploration within a Bayesian RL framework that optimizes expected return under a posterior distribution over Markov decision processes induced by training data. This formulation yields uncertainty-adaptive policies that naturally incentivize information-gathering actions through belief updates, inducing self-reflection behaviors.
Result: BARL outperforms conventional RL approaches on both synthetic and mathematical reasoning tasks, achieving superior test-time performance and token efficiency. The algorithm enables LLMs to stitch and switch strategies based on observed outcomes.
Conclusion: The Bayesian RL framework provides principled guidance on when and how LLMs should reflectively explore, enabling uncertainty-adaptive policies that outperform conventional RL approaches through systematic information gathering and strategy adaptation.
Abstract: Large Language Models (LLMs) trained via Reinforcement Learning (RL) have exhibited strong reasoning capabilities and emergent reflective behaviors, such as rethinking and error correction, as a form of in-context exploration. However, the Markovian policy obtained from conventional RL training does not give rise to reflective exploration behaviors since the policy depends on the history only through the state and therefore has no incentive to enrich identical states with additional context. Instead, RL exploration is only useful during training to learn the optimal policy in a trial-and-error manner. Therefore, it remains unclear whether reflective reasoning will emerge during RL, or why it is beneficial. To remedy this, we recast reflective exploration within a Bayesian RL framework, which optimizes the expected return under a posterior distribution over Markov decision processes induced by the training data. This Bayesian formulation admits uncertainty-adaptive policies that, through belief updates, naturally incentivize information-gathering actions and induce self-reflection behaviors. Our resulting algorithm, BARL, instructs the LLM to stitch and switch strategies based on the observed outcomes, offering principled guidance on when and how the model should reflectively explore. Empirical results on both synthetic and mathematical reasoning tasks demonstrate that BARL outperforms conventional RL approaches, achieving superior test-time performance and token efficiency. Our code is available at https://github.com/shenao-zhang/BARL.
[714] Eyes-on-Me: Scalable RAG Poisoning through Transferable Attention-Steering Attractors
Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen
Main category: cs.LG
TL;DR: Eyes-on-Me is a modular data poisoning attack for RAG systems that uses reusable Attention Attractors and Focus Regions to achieve scalable, transferable attacks with minimal cost for new targets.
Details
Motivation: Existing data poisoning attacks on RAG systems are inefficient because they require expensive optimization for each target phrase, making them impractical for large-scale attacks.Method: Decomposes adversarial documents into reusable Attention Attractors (optimized to direct attention) and Focus Regions (containing semantic baits or malicious instructions). Identifies and steers a small subset of attention heads strongly correlated with attack success.
Result: Increases average attack success rates from 21.9% to 57.8% (+35.9 points, 2.6× improvement) across 18 RAG settings. Single optimized attractor transfers to unseen black box retrievers/generators without retraining.
Conclusion: Establishes scalable paradigm for RAG data poisoning, showing modular reusable components pose practical threat. Reveals strong link between attention concentration and model outputs, informing interpretability research.
Abstract: Existing data poisoning attacks on retrieval-augmented generation (RAG) systems scale poorly because they require costly optimization of poisoned documents for each target phrase. We introduce Eyes-on-Me, a modular attack that decomposes an adversarial document into reusable Attention Attractors and Focus Regions. Attractors are optimized to direct attention to the Focus Region. Attackers can then insert semantic baits for the retriever or malicious instructions for the generator, adapting to new targets at near zero cost. This is achieved by steering a small subset of attention heads that we empirically identify as strongly correlated with attack success. Across 18 end-to-end RAG settings (3 datasets $\times$ 2 retrievers $\times$ 3 generators), Eyes-on-Me raises average attack success rates from 21.9 to 57.8 (+35.9 points, 2.6$\times$ over prior work). A single optimized attractor transfers to unseen black box retrievers and generators without retraining. Our findings establish a scalable paradigm for RAG data poisoning and show that modular, reusable components pose a practical threat to modern AI systems. They also reveal a strong link between attention concentration and model outputs, informing interpretability research.
[715] A Practitioner’s Guide to Multi-turn Agentic Reinforcement Learning
Ruiyi Wang, Prithviraj Ammanabrolu
Main category: cs.LG
TL;DR: Systematic analysis of LLM agent training via multi-turn RL, breaking down design space into environment, reward, and policy pillars, with empirical recipe for situated textual domains.
Details
Motivation: Existing frameworks for training LLM agents via multi-turn reinforcement learning are fragmented with no systematic formulation or analysis of which design choices matter across different tasks.Method: Break down design space into three pillars (environment, reward, policy), empirically test on TextWorld, ALFWorld, and SWE-Gym domains, analyze task complexity, reward sparsity, and policy gradient methods.
Result: Found that: (i) simple environments provide signal for generalization to complex tasks; (ii) dense rewards accelerate training but performance depends on RL algorithm; (iii) interplay between reward sparsity and policy gradient methods matters, with optimal SFT-to-RL ratio identified.
Conclusion: Distilled findings into a training recipe that guides co-design across environment, reward, and policy pillars to facilitate research and practical efforts in multi-turn agentic RL.
Abstract: We study what actually works and what doesn’t for training large language models as agents via multi-turn reinforcement learning. Despite rapid progress, existing frameworks and definitions are fragmented, and there is no systematic formulation or analysis of which design choices matter across tasks. We address this gap by first breaking down the design space into three inter-related pillars – environment, reward, and policy – and empirically derive a recipe for training LLM agents in situated textual domains. In particular, we test TextWorld and ALFWorld, popular domains for testing situated embodied reasoning, as well as SWE-Gym for more software engineering style tasks. (i) For the environment, we analyze the impacts of task complexity in terms of sizes of the state and action spaces as well as optimal solution length, finding that even simple environments within a domain can provide signal on how well an agent can generalize to more complex tasks. (ii) For the reward, we ablate relative reward sparsity, observing that while dense turn-level rewards accelerate training, performance and stability is highly dependent on the choice of RL algorithm. (iii) And for the agent’s policy, we explore the interplay between reward sparsity and biased (PPO, GRPO) and unbiased (RLOO) policy gradient methods in addition to showing how to find the optimal Supervised Fine-tuning (SFT) to RL training ratio given a fixed budget. We distill these findings into a training recipe that guides co-design across the three pillars, facilitating research and practical efforts in multi-turn agentic RL. Code: https://github.com/pearls-lab/meow-tea-taro
[716] General Exploratory Bonus for Optimistic Exploration in RLHF
Wendi Li, Changdae Oh, Sharon Li
Main category: cs.LG
TL;DR: GEB is a new theoretical framework for optimistic exploration in RLHF that fixes bias in existing divergence-regularized methods by properly incentivizing exploration of uncertain regions.
Details
Motivation: Current exploratory bonus methods in RLHF using KL or α-divergence regularization unintentionally bias exploration toward high-probability regions of the reference model, reinforcing conservative behavior instead of promoting discovery of uncertain regions.Method: Introduces General Exploratory Bonus (GEB), a theoretical framework that counteracts divergence-induced bias via reference-dependent reward regulation, unifying prior heuristic bonuses as special cases and extending across the full α-divergence family.
Result: GEB consistently outperforms baselines on alignment tasks across multiple divergence settings and large language model backbones, demonstrating both principled and practical advantages.
Conclusion: GEB offers a principled and practical solution for optimistic exploration in RLHF by provably satisfying the optimism principle where previous methods failed.
Abstract: Optimistic exploration is central to improving sample efficiency in reinforcement learning with human feedback, yet existing exploratory bonus methods to incentivize exploration often fail to realize optimism. We provide a theoretical analysis showing that current formulations, under KL or $α$-divergence regularization, unintentionally bias exploration toward high-probability regions of the reference model, thereby reinforcing conservative behavior instead of promoting discovery of uncertain regions. To address this pitfall, we introduce the General Exploratory Bonus (GEB), a novel theoretical framework that provably satisfies the optimism principle. GEB counteracts divergence-induced bias via reference-dependent reward regulation and unifies prior heuristic bonuses as special cases, while extending naturally across the full $α$-divergence family. Empirically, GEB consistently outperforms baselines on alignment tasks across multiple divergence settings and large language model backbones. These results demonstrate that GEB offers both a principled and practical solution for optimistic exploration in RLHF.
[717] A Unified Perspective for Loss-Oriented Imbalanced Learning via Localization
Zitai Wang, Qianqian Xu, Zhiyong Yang, Zhikang Xu, Linchao Zhang, Xiaochun Cao, Qingming Huang
Main category: cs.LG
TL;DR: The paper proposes a unified theoretical framework for analyzing loss-oriented methods in imbalanced learning by introducing localized properties (calibration and Lipschitz continuity) to provide fine-grained analysis and develop an improved algorithm.
Details
Motivation: Existing analysis of loss-oriented methods (re-weighting, logit-adjustment) for imbalanced learning is coarse-grained and fragmented, failing to explain empirical results because they use global properties that don't capture how class-dependent terms influence learning within each class.Method: Introduces localized versions of key properties: localized calibration for consistency validation across losses and localized Lipschitz continuity for fine-grained generalization bounds. Develops a principled learning algorithm based on these insights.
Result: Empirical results on both traditional ResNets and foundation models validate the theoretical analyses and demonstrate the effectiveness of the proposed method for imbalanced learning.
Conclusion: The paper provides a unified perspective for improving loss-oriented methods in imbalanced learning through localized analysis, offering better theoretical understanding and practical algorithm design.
Abstract: Due to the inherent imbalance in real-world datasets, naïve Empirical Risk Minimization (ERM) tends to bias the learning process towards the majority classes, hindering generalization to minority classes. To rebalance the learning process, one straightforward yet effective approach is to modify the loss function via class-dependent terms, such as re-weighting and logit-adjustment. However, existing analysis of these loss-oriented methods remains coarse-grained and fragmented, failing to explain some empirical results. After reviewing prior work, we find that the properties used through their analysis are typically global, i.e., defined over the whole dataset. Hence, these properties fail to effectively capture how class-dependent terms influence the learning process. To bridge this gap, we turn to explore the localized versions of such properties i.e., defined within each class. Specifically, we employ localized calibration to provide consistency validation across a broader range of losses and localized Lipschitz continuity to provide a fine-grained generalization bound. In this way, we reach a unified perspective for improving and adjusting loss-oriented methods. Finally, a principled learning algorithm is developed based on these insights. Empirical results on both traditional ResNets and foundation models validate our theoretical analyses and demonstrate the effectiveness of the proposed method.
[718] A Survey on Diffusion Models for Time Series and Spatio-Temporal Data
Yiyuan Yang, Ming Jin, Haomin Wen, Chaoli Zhang, Yuxuan Liang, Lintao Ma, Yi Wang, Chenghao Liu, Bin Yang, Zenglin Xu, Shirui Pan, Qingsong Wen
Main category: cs.LG
TL;DR: A comprehensive survey of diffusion models applied to time series and spatio-temporal data across various domains including healthcare, recommendation systems, climate, energy, audio, and traffic.
Details
Motivation: To provide a structured overview and foundation for researchers and practitioners working with diffusion models in time series and spatio-temporal data applications, addressing the growing use of these models across diverse fields.Method: The paper organizes diffusion model applications by separating time series and spatio-temporal data, categorizing them based on model type, task type, data modality, and practical application domains.
Result: A comprehensive survey that structures the landscape of diffusion model applications in time series and spatio-temporal data, with an open-sourced repository containing detailed information.
Conclusion: The study provides a solid foundation for future research and practical applications, aiming to inspire innovations that address traditional challenges and foster novel solutions in diffusion model-based data mining tasks.
Abstract: Diffusion models have been widely used in time series and spatio-temporal data, enhancing generative, inferential, and downstream capabilities. These models are applied across diverse fields such as healthcare, recommendation, climate, energy, audio, and traffic. By separating applications for time series and spatio-temporal data, we offer a structured perspective on model category, task type, data modality, and practical application domain. This study aims to provide a solid foundation for researchers and practitioners, inspiring future innovations that tackle traditional challenges and foster novel solutions in diffusion model-based data mining tasks and applications. For more detailed information, we have open-sourced a repository at https://github.com/yyysjz1997/Awesome-TimeSeries-SpatioTemporal-Diffusion-Model.
[719] TimeAutoDiff: A Unified Framework for Generation, Imputation, Forecasting, and Time-Varying Metadata Conditioning of Heterogeneous Time Series Tabular Data
Namjoon Suh, Yuning Yang, Din-Yin Hsieh, Qitong Luan, Shirong Xu, Shixiang Zhu, Guang Cheng
Main category: cs.LG
TL;DR: TimeAutoDiff is a unified latent-diffusion framework for four time-series tasks (generation, imputation, forecasting, metadata conditioning) that handles heterogeneous data types and uses masked modeling with efficient architectural choices for speed and scalability.
Details
Motivation: To create a unified framework for multiple fundamental time-series tasks that can handle heterogeneous data types (continuous, binary, categorical) while being computationally efficient and scalable for wide tables.Method: Combines a lightweight variational autoencoder (VAE) for mapping mixed-type features into continuous latent sequences with a diffusion model that learns temporal dynamics. Uses masked modeling with binary masks to specify observed vs. generated cells. Key innovations: diffusion model samples entire latent trajectories at once (reducing reverse-diffusion calls) and VAE compresses along feature axis for efficient wide-table modeling.
Result: Matches or surpasses strong baselines in synthetic sequence fidelity, consistently improves imputation and forecasting performance. Metadata conditioning enables realistic scenario exploration with coherent counterfactual trajectories. Ablation studies confirm importance of VAE feature encoding and denoiser components. Distance-to-closest-record audit shows generalization without excessive memorization.
Conclusion: TimeAutoDiff provides an effective unified framework for multiple time-series tasks with heterogeneous data, offering strong performance, computational efficiency, and practical utility for scenario exploration while generalizing well without overfitting.
Abstract: We present TimeAutoDiff, a unified latent-diffusion framework for four fundamental time-series tasks: unconditional generation, missing-data imputation, forecasting, and time-varying-metadata conditional generation. The model natively supports heterogeneous features including continuous, binary, and categorical variables. We unify all tasks using a masked-modeling strategy in which a binary mask specifies which time-series cells are observed and which must be generated. TimeAutoDiff combines a lightweight variational autoencoder, which maps mixed-type features into a continuous latent sequence, with a diffusion model that learns temporal dynamics in this latent space. Two architectural choices provide strong speed and scalability benefits. The diffusion model samples an entire latent trajectory at once rather than denoising one timestep at a time, greatly reducing reverse-diffusion calls. In addition, the VAE compresses along the feature axis, enabling efficient modeling of wide tables in a low-dimensional latent space. Empirical evaluation shows that TimeAutoDiff matches or surpasses strong baselines in synthetic sequence fidelity and consistently improves imputation and forecasting performance. Metadata conditioning enables realistic scenario exploration, allowing users to edit metadata sequences and produce coherent counterfactual trajectories that preserve cross-feature dependencies. Ablation studies highlight the importance of the VAE’s feature encoding and key components of the denoiser. A distance-to-closest-record audit further indicates that the model generalizes without excessive memorization. Code is available at https://github.com/namjoonsuh/TimeAutoDiff
[720] A Tidal Current Speed Forecasting Model based on Multi-Periodicity Learning
Tengfei Cheng, Yangdi Huang, Ling Xiao, Yunxuan Dong
Main category: cs.LG
TL;DR: Proposes Wavelet-Enhanced Convolutional Network for tidal current speed forecasting by learning multi-periodicity patterns, achieving improved accuracy over baselines.
Details
Motivation: Accurate tidal current speed forecasting is crucial for high tidal energy grid penetration, but physical models struggle with complex multi-periodic variations influenced by celestial bodies.Method: Wavelet-Enhanced Convolutional Network that embeds 1D tidal data into 2D tensors (rows for intra-period, columns for inter-period variations), uses convolutional kernels, integrates time-frequency analysis for local periodic features, and optimizes hyperparameters with Tree-structured Parzen Estimator.
Result: Achieves 10-step average Mean Absolute Error of 0.025, with at least 1.18% error reduction compared to other baselines, and shows 1.4% MAPE reduction on data with artificially added periodic fluctuations.
Conclusion: The proposed framework effectively captures multi-periodic dependencies in tidal current data, demonstrating superior forecasting performance and stability for tidal energy applications.
Abstract: Tidal energy is one of the key components in increasing the penetration of renewable energy. High tidal energy penetration into the electrical grid depends on accurate tidal current speed forecasting. Model inaccuracies hinder forecast accuracy. Previous research primarily used physical models to forecast tidal current speed, yet tidal current variations influenced by the orbital periods of celestial bodies make accurate physical modeling challenging. Research on the multi-periodicity of tides is crucial for forecasting tidal current speed. We propose the Wavelet-Enhanced Convolutional Network to learn multi-periodicity. The framework embeds intra-period and inter-period variations of one-dimensional tidal current data into the rows and columns, respectively, of a two-dimensional tensor. Then, the two-dimensional variations of the sequence can be processed by convolutional kernels. We integrate a time-frequency analysis method into the framework to further address local periodic features. Additionally, to enhance the framework’s stability, we optimize the framework’s hyperparameters with the Tree-structured Parzen Estimator. The proposed framework captures multi-periodic dependencies in tidal current data. Numerical results show a 10-step average Mean Absolute Error of 0.025, with at least a 1.18% error reduction compared to other baselines. Further ablation studies show a 1.4% reduction in Mean Absolute Percentage Error on the data with artificially added periodic fluctuations.
[721] It’s complicated. The relationship of algorithmic fairness and non-discrimination regulations for high-risk systems in the EU AI Act
Kristof Meding
Main category: cs.LG
TL;DR: This paper analyzes the relationship between traditional legal non-discrimination regulations and algorithmic fairness concepts in the context of the EU’s AI Act, aiming to bridge legal and computer science perspectives on AI fairness.
Details
Motivation: The paper addresses the challenge of defining fairness in AI decision-making, particularly in light of discriminatory algorithmic behaviors and the EU's recent AI Act which combines traditional legal non-discrimination regulations with machine learning fairness concepts. There's a need to bridge these two different conceptual frameworks for interdisciplinary collaboration.Method: The paper provides: 1) A high-level introduction to both legal non-discrimination and algorithmic fairness concepts for interdisciplinary audiences, and 2) An in-depth analysis of the AI Act’s relationship between these two concepts, examining regulatory scope, requirements, and computational feasibility.
Result: Three key findings: (1) Most non-discrimination regulations target only high-risk AI systems; (2) Regulation of high-risk systems includes both data input requirements and output monitoring, though these are partly inconsistent and raise computational feasibility questions; (3) Analysis of possible future interaction between classical EU non-discrimination law and AI Act regulations.
Conclusion: The paper recommends developing more specific auditing and testing methodologies for AI systems and serves as a foundation for future interdisciplinary collaboration between legal scholars and computer science researchers studying discrimination in AI systems.
Abstract: What constitutes a fair decision? This question is not only difficult for humans but becomes more challenging when Artificial Intelligence (AI) models are used. In light of discriminatory algorithmic behaviors, the EU has recently passed the AI Act, which mandates specific rules for high-risk systems, incorporating both traditional legal non-discrimination regulations and machine learning based algorithmic fairness concepts. This paper aims to bridge these two different concepts in the AI Act through: First, a necessary high-level introduction of both concepts targeting legal and computer science-oriented scholars, and second, an in-depth analysis of the AI Act’s relationship between legal non-discrimination regulations and algorithmic fairness. Our analysis reveals three key findings: (1.) Most non-discrimination regulations target only high-risk AI systems. (2.) The regulation of high-risk systems encompasses both data input requirements and output monitoring, though these regulations are partly inconsistent and raise questions of computational feasibility. (3.) Finally, we consider the possible (future) interaction of classical EU non-discrimination law and the AI Act regulations. We recommend developing more specific auditing and testing methodologies for AI systems. This paper aims to serve as a foundation for future interdisciplinary collaboration between legal scholars and computer science-oriented machine learning researchers studying discrimination in AI systems.
[722] DFDT: Dynamic Fast Decision Tree for IoT Data Stream Mining on Edge Devices
Afonso Lourenço, João Rodrigo, João Gama, Goreti Marreiros
Main category: cs.LG
TL;DR: DFDT is a memory-constrained online learning algorithm that uses activity-aware pre-pruning and adaptive mechanisms to optimize the accuracy-memory-runtime trade-off for IoT data streams, serving as a drop-in replacement for VFDT-based learners.
Details
Motivation: Edge computing enables real-time IoT applications but requires continuous adaptation to concept drifts. While VFDT extensions are state-of-the-art for tabular stream mining, their unregulated growth limits efficiency, especially in ensemble settings where individual tree pruning is rarely applied.Method: DFDT employs activity-aware pre-pruning that dynamically adjusts splitting criteria based on leaf node activity: low-activity nodes are deactivated, moderately active nodes split under stricter conditions, and highly active nodes use a skipping mechanism for accelerated growth. It also uses adaptive grace periods and tie thresholds to modulate splitting decisions based on data variability.
Result: DFDT provides three variants suited to different resource profiles through an ablation study. It enhances the accuracy-memory-runtime trade-off while minimizing hyperparameter tuning needs, and is fully compatible with existing ensemble frameworks as a drop-in alternative to standard VFDT-based learners.
Conclusion: DFDT offers an efficient memory-constrained solution for online learning in IoT edge computing environments, addressing the limitations of unregulated tree growth in VFDT-based approaches while maintaining compatibility with existing ensemble frameworks.
Abstract: The Internet of Things generates massive data streams, with edge computing emerging as a key enabler for online IoT applications and 5G networks. Edge solutions facilitate real-time machine learning inference, but also require continuous adaptation to concept drifts. While extensions of the Very Fast Decision Tree (VFDT) remain state-of-the-art for tabular stream mining, their unregulated growth limits efficiency, particularly in ensemble settings where post-pruning at the individual tree level is seldom applied. This paper presents DFDT, a novel memory-constrained algorithm for online learning. DFDT employs activity-aware pre-pruning, dynamically adjusting splitting criteria based on leaf node activity: low-activity nodes are deactivated to conserve resources, moderately active nodes split under stricter conditions, and highly active nodes leverage a skipping mechanism for accelerated growth. Additionally, adaptive grace periods and tie thresholds allow DFDT to modulate splitting decisions based on observed data variability, enhancing the accuracy-memory-runtime trade-off while minimizing the need for hyperparameter tuning. An ablation study reveals three DFDT variants suited to different resource profiles. Fully compatible with existing ensemble frameworks, DFDT provides a drop-in alternative to standard VFDT-based learners.
[723] The Optimal Approximation Factor in Density Estimation
Olivier Bousquet, Daniel Kane, Shay Moran
Main category: cs.LG
TL;DR: The paper shows that for density selection from a finite class, improper learning (outputting mixtures) achieves optimal approximation factor 2, beating proper learning’s factor 3, with sample-efficient algorithms using adaptive data analysis techniques.
Details
Motivation: Yatracos showed that selecting the best density from a finite class can be done with approximation factor 3, but it was unknown if this factor could be improved or if improper learning (outputting mixtures) could achieve better approximation.Method: Developed two geometric approaches (adaptive and static) based on estimating surrogate metrics to total variation, using techniques from Adaptive Data Analysis to bound sample complexity.
Result: Proved factor 3 is optimal for proper learning (outputting from the class), but improper learning (outputting mixtures) achieves optimal factor 2. Provided sample-efficient algorithms achieving this optimal approximation.
Conclusion: Improper learning strictly outperforms proper learning in density selection, achieving optimal approximation factor 2 vs 3, demonstrating a concrete advantage of improper learning in this statistical setup.
Abstract: Consider the following problem: given two arbitrary densities $q_1,q_2$ and a sample-access to an unknown target density $p$, find which of the $q_i$’s is closer to $p$ in total variation. A remarkable result due to Yatracos shows that this problem is tractable in the following sense: there exists an algorithm that uses $O(ε^{-2})$ samples from $p$ and outputs~$q_i$ such that with high probability, $TV(q_i,p) \leq 3\cdot\mathsf{opt} + ε$, where $\mathsf{opt}= \min{TV(q_1,p),TV(q_2,p)}$. Moreover, this result extends to any finite class of densities $\mathcal{Q}$: there exists an algorithm that outputs the best density in $\mathcal{Q}$ up to a multiplicative approximation factor of 3. We complement and extend this result by showing that: (i) the factor 3 can not be improved if one restricts the algorithm to output a density from $\mathcal{Q}$, and (ii) if one allows the algorithm to output arbitrary densities (e.g.\ a mixture of densities from $\mathcal{Q}$), then the approximation factor can be reduced to 2, which is optimal. In particular this demonstrates an advantage of improper learning over proper in this setup. We develop two approaches to achieve the optimal approximation factor of 2: an adaptive one and a static one. Both approaches are based on a geometric point of view of the problem and rely on estimating surrogate metrics to the total variation. Our sample complexity bounds exploit techniques from {\it Adaptive Data Analysis}.
[724] PIMRL: Physics-Informed Multi-Scale Recurrent Learning for Spatiotemporal Prediction
Han Wan, Qi Wang, Yuan Mi, Hao Sun
Main category: cs.LG
TL;DR: PIMRL: A multi-scale learning framework combining physics-informed pretraining and data-driven latent space evolution for accurate spatiotemporal dynamics prediction.
Details
Motivation: Traditional numerical methods for PDE-based spatiotemporal systems are computationally expensive, while existing ML methods suffer from error accumulation in long-term predictions, especially with insufficient data or varying time scales. Current approaches fail to effectively utilize multi-scale data, leading to suboptimal robustness.Method: Physics-Informed Multi-Scale Recurrent Learning (PIMRL) framework with two modules: 1) Micro-scale module embeds physical knowledge into neural networks via pretraining, 2) Macro-scale module uses data-driven approach to learn temporal evolution of physics in latent space.
Result: PIMRL achieves state-of-the-art performance across five benchmark datasets (1D to 3D), with average improvements over 9% in both RMSE and MAE, and maximum enhancements reaching up to 80%.
Conclusion: The proposed multi-scale learning framework effectively leverages multi-scale data to improve stability and accuracy in spatiotemporal dynamics prediction, addressing limitations of existing methods in handling varying time scales and insufficient data scenarios.
Abstract: Simulation of spatiotemporal systems governed by partial differential equations is widely applied in fields such as biology, chemistry, aerospace dynamics, and meteorology. Traditional numerical methods incur high computational costs due to the requirement of small time steps for accurate predictions. While machine learning has reduced these costs, long-term predictions remain challenged by error accumulation, particularly in scenarios with insufficient data or varying time scales, where stability and accuracy are compromised. Existing methods often neglect the effective utilization of multi-scale data, leading to suboptimal robustness in predictions. To address these issues, we propose a novel multi-scale learning framework, namely, the Physics-Informed Multi-Scale Recurrent Learning (PIMRL), to effectively leverage multi-scale data for spatiotemporal dynamics prediction. The PIMRL framework comprises two modules: the micro-scale module embeds physical knowledge into neural networks via pretraining, and the macro-scale module adopts a data-driven approach to learn the temporal evolution of physics in the latent space. Experimental results demonstrate that the PIMRL framework consistently achieves state-of-the-art performance across five benchmark datasets ranging from one to three dimensions, showing average improvements of over 9% in both RMSE and MAE evaluation metrics, with maximum enhancements reaching up to 80%.
[725] Attacking All Tasks at Once Using Adversarial Examples in Multi-Task Learning
Lijun Zhang, Xiao Liu, Kaleel Mahmood, Caiwen Ding, Hui Guan
Main category: cs.LG
TL;DR: This paper investigates adversarial robustness of multi-task learning models, proposing a new attack framework (DGBA) and revealing a trade-off between task accuracy and robustness due to parameter sharing.
Details
Motivation: Multi-task models are widely used for visual content understanding but their adversarial robustness is understudied compared to single-task models. Key unanswered questions include: how robust they are to single-task attacks, whether attacks can simultaneously target all tasks, and how parameter sharing affects robustness.Method: The paper analyzes limitations of existing single-task attack adaptations for multi-task models, then proposes Dynamic Gradient Balancing Attack (DGBA) - an optimization framework that formulates attacking all tasks as an integer linear programming problem to efficiently balance gradients across tasks.
Result: Extensive evaluation on NYUv2 and Tiny-Taxonomy benchmarks shows DGBA outperforms baselines in attacking both clean and adversarially trained multi-task models. Results reveal a fundamental trade-off: parameter sharing improves task accuracy but increases attack transferability, undermining robustness.
Conclusion: Multi-task models face unique adversarial vulnerabilities that require specialized attack methods like DGBA. There’s an inherent tension between accuracy gains from parameter sharing and robustness degradation due to increased attack transferability, highlighting the need for robust multi-task learning approaches.
Abstract: Visual content understanding frequently relies on multi-task models to extract robust representations of a single visual input for multiple downstream tasks. However, in comparison to extensively studied single-task models, the adversarial robustness of multi-task models has received significantly less attention and many questions remain unclear: 1) How robust are multi-task models to single task adversarial attacks, 2) Can adversarial attacks be designed to simultaneously attack all tasks in a multi-task model, and 3) How does parameter sharing across tasks affect multi-task model robustness to adversarial attacks? This paper aims to answer these questions through careful analysis and rigorous experimentation. First, we analyze the inherent drawbacks of two commonly-used adaptations of single-task white-box attacks in attacking multi-task models. We then propose a novel attack framework, Dynamic Gradient Balancing Attack (DGBA). Our framework poses the problem of attacking all tasks in a multi-task model as an optimization problem that can be efficiently solved through integer linear programming. Extensive evaluation on two popular MTL benchmarks, NYUv2 and Tiny-Taxonomy, demonstrates the effectiveness of DGBA compared to baselines in attacking both clean and adversarially trained multi-task models. Our results also reveal a fundamental trade-off between improving task accuracy via parameter sharing across tasks and undermining model robustness due to increased attack transferability from parameter sharing.
[726] Hidden Minima in Two-Layer ReLU Networks
Yossi Arjevani
Main category: cs.LG
TL;DR: The paper analyzes hidden vs non-hidden spurious minima in two-layer ReLU networks, showing that Hessian spectra alone can’t distinguish them, but examining loss curves reveals structural differences due to O(d^{-1/2}) eigenvalue terms.
Details
Motivation: To understand why vanilla SGD avoids certain spurious minima (hidden minima) while not avoiding others, despite both types having similar Hessian spectra up to O(d^{-1/2}) terms. The paper seeks analytic properties that distinguish hidden from non-hidden minima.Method: Analyzes curves along which loss is minimized or maximized, examining the structure and symmetry of arcs emanating from hidden minima. The approach focuses on the O(d^{-1/2}) eigenvalue terms that previous Hessian spectrum analyses missed.
Result: Hidden minima differ from non-hidden minima in the structure and symmetry of arcs emanating from them, precisely due to the O(d^{-1/2}) eigenvalue terms that were absent from previous Hessian spectrum analyses.
Conclusion: Hessian spectra alone provide limited explanatory power for distinguishing hidden minima; instead, analyzing loss curves reveals characteristic structural differences that explain why SGD avoids hidden minima but not other spurious minima.
Abstract: We consider the optimization problem arising from fitting two-layer ReLU networks with $d$ inputs under the square loss, where labels are generated by a target network. Two infinite families of spurious minima have recently been identified: one whose loss vanishes as $d \to \infty$, and another whose loss remains bounded away from zero. The latter are nevertheless avoided by vanilla SGD, and thus hidden, motivating the search for analytic properties distinguishing the two types. Perhaps surprisingly, the Hessian spectra of hidden and non-hidden minima agree up to terms of order $O(d^{-1/2})$, providing limited explanatory power. Consequently, our analysis of hidden minima proceeds instead via curves along which the loss is minimized or maximized. The main result is that arcs emanating from hidden minima differ, characteristically, by their structure and symmetry, precisely on account of the $O(d^{-1/2})$-eigenvalue terms absent from previous analyses.
[727] Ensemble Learning of Machine Learning Force Fields
Bangchen Yin, Yue Yin, Yuda W. Tang, Hai Xiao
Main category: cs.LG
TL;DR: EL-MLFFs is an ensemble learning framework that uses stacking with a GNN meta-model to combine multiple MLFFs, improving force prediction accuracy and simulation stability across molecular and materials systems.
Details
Motivation: There's a practical challenge in selecting optimal MLFF architectures that balance accuracy and stability. Existing diverse MLFF architectures make model selection difficult, and there's a need to address the accuracy-stability trade-off in molecular and materials simulations.Method: An ensemble learning framework using stacking methodology with graph representations. A GNN acts as a meta-model to refine initial force predictions from diverse base MLFFs. Two architectures: direct fitting (computationally efficient) and conservative (physically-principled, ensures energy conservation).
Result: Improves predictive accuracy across diverse systems: reduces force errors and improves simulation stability for molecular systems (methane, methanol/Cu(100), MD17), and yields lower formation energy errors on WBM materials test set compared to base models.
Conclusion: EL-MLFFs provides a systematic approach to address model selection challenges and the accuracy-stability trade-off, offering improved performance across both molecular and materials simulation domains.
Abstract: Machine learning force fields (MLFFs) are a promising approach to balance the accuracy of quantum mechanics with the efficiency of classical potentials, yet selecting an optimal model amid increasingly diverse architectures that delivers reliable force predictions and stable simulations remains a core pratical challenge. Here we introduce EL-MLFFs, an ensemble learning framework that uses a stacking methodology to integrate predictions from diverse base MLFFs. Our approach constructs a graph representation where a graph neural network (GNN) acts as a meta-model to refine the initial force predictions. We present two meta-model architectures: a computationally efficient direct fitting model and a physically-principled conservative model that ensures energy conservation. The framework is evaluated on a diverse range of systems, including single molecules (methane), surface chemistry (methanol/Cu(100)), molecular dynamics benchmarks (MD17), and the MatPES materials dataset. Results show that EL-MLFFs improves predictive accuracy across these domains. For molecular systems, it reduces force errors and improves the simulation stability compared to base models. For materials, the method yields lower formation energy errors on the WBM test set. The EL- MLFFs framework offers a systematic approach to address challenges of model selection and the accuracy-stability trade-off in molecular and materials simulations.
[728] SDT-GNN: Streaming-based Distributed Training Framework for Graph Neural Networks
Xin Huang, Weipeng Zhuo, Minh Phu Vuong, Shiju Li, Jongryool Kim, Bradley Rees, Chul-Ho Lee
Main category: cs.LG
TL;DR: SDT-GNN is a streaming-based distributed GNN training framework that reduces memory requirements by processing edges as a stream, enabling training on large graphs even when GPU memory is smaller than graph size.
Details
Motivation: Existing distributed GNN frameworks (DistDGL, PyG) have excessive memory requirements that hinder training on large graphs using commodity workstations, creating a need for more memory-efficient solutions.Method: Proposes SDT-GNN framework that takes a stream of edges as input for graph partitioning instead of loading entire graph in memory. Also introduces SPRING, a novel streaming partitioning algorithm specifically designed for distributed GNN training.
Result: SDT-GNN achieves up to 95% less memory footprint than DistDGL and PyG without sacrificing prediction accuracy. SPRING significantly outperforms state-of-the-art streaming partitioning algorithms.
Conclusion: SDT-GNN enables efficient distributed GNN training on large graphs with limited GPU memory resources, making GNN training more accessible on commodity hardware through streaming-based approaches.
Abstract: Recently, distributed GNN training frameworks, such as DistDGL and PyG, have been developed to enable training GNN models on large graphs by leveraging multiple GPUs in a distributed manner. Despite these advances, their memory requirements are still excessively high, thereby hindering GNN training on large graphs using commodity workstations. In this paper, we propose SDT-GNN, a streaming-based distributed GNN training framework. Unlike the existing frameworks that load the entire graph in memory, it takes a stream of edges as input for graph partitioning to reduce the memory requirement for partitioning. It also enables distributed GNN training even when the aggregated memory size of GPUs is smaller than the size of the graph and feature data. Furthermore, to improve the quality of partitioning, we propose SPRING, a novel streaming partitioning algorithm for distributed GNN training. We demonstrate the effectiveness and efficiency of SDT-GNN on seven large public datasets. SDT-GNN has up to 95% less memory footprint than DistDGL and PyG without sacrificing the prediction accuracy. SPRING also outperforms state-of-the-art streaming partitioning algorithms significantly.
[729] Covariate-Elaborated Robust Partial Information Transfer with Conditional Spike-and-Slab Prior
Ruqian Zhang, Yijiao Zhang, Annie Qu, Zhongyi Zhu, Juan Shen
Main category: cs.LG
TL;DR: CONCERT is a Bayesian transfer learning method that enables robust partial information transfer for high-dimensional data using conditional spike-and-slab priors and covariate-specific similarity modeling.
Details
Motivation: Existing transfer learning methods use global similarity measures between source and target data, which can be inefficient when only partial information is shared. There's a need for methods that can selectively transfer relevant information while ignoring irrelevant source data.Method: Proposes CONCERT using conditional spike-and-slab priors in the joint distribution of target and source parameters. Incorporates covariate-specific priors to characterize partial similarities. Uses variational Bayes for scalability and achieves simultaneous variable selection and information transfer in a one-step procedure.
Result: Establishes theoretical guarantees including variable selection consistency, estimation error bounds, and prediction error bounds. Demonstrates covariate-specific benefits of transfer learning. Extensive experiments and real data applications show CONCERT outperforms existing state-of-the-art transfer learning methods.
Conclusion: CONCERT provides an effective Bayesian framework for robust partial information transfer in high-dimensional data analysis, overcoming limitations of global similarity measures and enabling selective transfer of relevant source information.
Abstract: The popularity of transfer learning stems from the fact that it can borrow information from useful auxiliary datasets. Existing statistical transfer learning methods usually adopt a global similarity measure between the source data and the target data, which may lead to inefficiency when only partial information is shared. In this paper, we propose a novel Bayesian transfer learning method named ``CONCERT’’ to allow robust partial information transfer for high-dimensional data analysis. A conditional spike-and-slab prior is introduced in the joint distribution of target and source parameters for information transfer. By incorporating covariate-specific priors, we can characterize partial similarities and integrate source information collaboratively to improve the performance on the target. In contrast to existing work, the CONCERT is a one-step procedure which achieves variable selection and information transfer simultaneously. We establish variable selection consistency, as well as estimation and prediction error bounds for CONCERT. Our theory demonstrates the covariate-specific benefit of transfer learning. To ensure the scalability of the algorithm, we adopt the variational Bayes framework to facilitate implementation. Extensive experiments and two real data applications showcase the validity and advantages of CONCERT over existing cutting-edge transfer learning methods.
[730] LLMs Judging LLMs: A Simplex Perspective
Patrick Vossler, Fan Xia, Yifan Mai, Adarsh Subbaswamy, Jean Feng
Main category: cs.LG
TL;DR: LLM judges for evaluating LLM outputs often ignore epistemic uncertainty about judge quality. The paper provides geometric conditions for when rankings are identifiable and shows LLM judges work better for binary scoring than multi-level scoring.
Details
Motivation: Current practice of using LLMs as judges for evaluating other LLMs only accounts for sampling variability but ignores uncertainty about judge quality (epistemic uncertainty). It's unclear when this approach is theoretically valid and practically robust.Method: Proposes a geometric perspective: for M-level scoring systems, LLM judges and candidates are represented as points on an (M-1)-dimensional probability simplex. Uses geometric concepts (triangle areas) to analyze ranking identifiability. Designs geometric Bayesian priors to encode epistemic uncertainty about judge quality and conducts sensitivity analyses.
Result: Theoretical conditions show LLM judges are more effective for binary scoring (M=2) than multi-level scoring (M>2). Experiments show rankings based solely on LLM judges are robust in many but not all datasets. Bayesian method achieves substantially higher coverage rates than existing procedures.
Conclusion: LLM judges work well in many cases but epistemic uncertainty matters. The geometric Bayesian approach provides better uncertainty quantification, highlighting the need for caution when using LLMs as judges without considering judge quality uncertainty.
Abstract: Given the challenge of automatically evaluating free-form outputs from large language models (LLMs), an increasingly common solution is to use LLMs themselves as the judging mechanism, without any gold-standard scores. Implicitly, this practice accounts for only sampling variability (aleatoric uncertainty) and ignores uncertainty about judge quality (epistemic uncertainty). While this is justified if judges are perfectly accurate, it is unclear when such an approach is theoretically valid and practically robust. We study these questions for the task of ranking LLM candidates from a novel geometric perspective: for $M$-level scoring systems, both LLM judges and candidates can be represented as points on an $(M-1)$-dimensional probability simplex, where geometric concepts (e.g., triangle areas) correspond to key ranking concepts. This perspective yields intuitive theoretical conditions and visual proofs for when rankings are identifiable; for instance, we provide a formal basis for the ``folk wisdom’’ that LLM judges are more effective for two-level scoring ($M=2$) than multi-level scoring ($M>2$). Leveraging the simplex, we design geometric Bayesian priors that encode epistemic uncertainty about judge quality and vary the priors to conduct sensitivity analyses. Experiments on LLM benchmarks show that rankings based solely on LLM judges are robust in many but not all datasets, underscoring both their widespread success and the need for caution. Our Bayesian method achieves substantially higher coverage rates than existing procedures, highlighting the importance of modeling epistemic uncertainty.
[731] Fast training and sampling of Restricted Boltzmann Machines
Nicolas Béreux, Aurélien Decelle, Cyril Furtlehner, Lorenzo Rosset, Beatriz Seoane
Main category: cs.LG
TL;DR: Novel RBM training approach using continuous phase transition analogy, Parallel Trajectory Tempering sampling, and pre-training with convex optimization to overcome MCMC mixing issues in structured datasets.
Details
Motivation: RBMs struggle with slow MCMC mixing during training, especially for highly structured datasets, making training inefficient and sample generation costly.Method: 1) Stepwise encoding of data patterns into coupling matrix singular vectors; 2) Leveraging continuous phase transition analogy for smooth annealing trajectory; 3) Parallel Trajectory Tempering (PTT) sampling strategy; 4) Pre-training phase using convex optimization to encode principal components into low-rank RBM.
Result: Reduced sampling and training costs, efficient log-likelihood estimation, PTT outperforms optimized MCMC methods, pre-training enables handling of highly structured datasets where conventional methods fail.
Conclusion: The proposed approach addresses critical RBM training limitations through phase transition-inspired training, novel sampling, and strategic pre-training, enabling efficient modeling of complex structured datasets.
Abstract: Restricted Boltzmann Machines (RBMs) are powerful tools for modeling complex systems and extracting insights from data, but their training is hindered by the slow mixing of Markov Chain Monte Carlo (MCMC) processes, especially with highly structured datasets. In this study, we build on recent theoretical advances in RBM training and focus on the stepwise encoding of data patterns into singular vectors of the coupling matrix, significantly reducing the cost of generating new samples and evaluating the quality of the model, as well as the training cost in highly clustered datasets. The learning process is analogous to the thermodynamic continuous phase transitions observed in ferromagnetic models, where new modes in the probability measure emerge in a continuous manner. We leverage the continuous transitions in the training process to define a smooth annealing trajectory that enables reliable and computationally efficient log-likelihood estimates. This approach enables online assessment during training and introduces a novel sampling strategy called Parallel Trajectory Tempering (PTT) that outperforms previously optimized MCMC methods. To mitigate the critical slowdown effect in the early stages of training, we propose a pre-training phase. In this phase, the principal components are encoded into a low-rank RBM through a convex optimization process, facilitating efficient static Monte Carlo sampling and accurate computation of the partition function. Our results demonstrate that this pre-training strategy allows RBMs to efficiently handle highly structured datasets where conventional methods fail. Additionally, our log-likelihood estimation outperforms computationally intensive approaches in controlled scenarios, while the PTT algorithm significantly accelerates MCMC processes compared to conventional methods.
[732] Unlearning Inversion Attacks for Graph Neural Networks
Jiahao Zhang, Yilong Wang, Zhiwei Zhang, Xiaorui Liu, Suhang Wang
Main category: cs.LG
TL;DR: Graph unlearning inversion attack (TrendAttack) can reconstruct removed edges from unlearned GNNs by exploiting confidence patterns and adaptive thresholds, exposing privacy vulnerabilities in graph unlearning methods.
Details
Motivation: Current graph unlearning methods assume deleted information cannot be recovered, but this work challenges that assumption by investigating whether adversaries can reconstruct removed edges given only black-box access to an unlearned GNN and partial graph knowledge.Method: Proposes TrendAttack with two key components: 1) exploits the “confidence pitfall” where nodes adjacent to unlearned edges show large drops in model confidence, and 2) uses adaptive prediction mechanism with different similarity thresholds for unlearned vs. retained edges. The framework integrates existing membership inference techniques with trend features.
Result: Experiments on four real-world datasets show TrendAttack significantly outperforms state-of-the-art GNN membership inference baselines, demonstrating successful reconstruction of removed edges from unlearned GNNs.
Conclusion: The work exposes a critical privacy vulnerability in current graph unlearning methods, showing that the assumption that deleted information cannot be recovered is flawed, and adversaries can successfully reconstruct removed edges through inversion attacks.
Abstract: Graph unlearning methods aim to efficiently remove the impact of sensitive data from trained GNNs without full retraining, assuming that deleted information cannot be recovered. In this work, we challenge this assumption by introducing the graph unlearning inversion attack: given only black-box access to an unlearned GNN and partial graph knowledge, can an adversary reconstruct the removed edges? We identify two key challenges: varying probability-similarity thresholds for unlearned versus retained edges, and the difficulty of locating unlearned edge endpoints, and address them with TrendAttack. First, we derive and exploit the confidence pitfall, a theoretical and empirical pattern showing that nodes adjacent to unlearned edges exhibit a large drop in model confidence. Second, we design an adaptive prediction mechanism that applies different similarity thresholds to unlearned and other membership edges. Our framework flexibly integrates existing membership inference techniques and extends them with trend features. Experiments on four real-world datasets demonstrate that TrendAttack significantly outperforms state-of-the-art GNN membership inference baselines, exposing a critical privacy vulnerability in current graph unlearning methods.
[733] Evaluating Model Performance Under Worst-case Subpopulations
Mike Li, Daksh Mittal, Hongseok Namkoong, Shangzhou Xia
Main category: cs.LG
TL;DR: A method for evaluating ML model robustness across subpopulations defined by continuous attributes, with finite-sample guarantees and practical deployment validation.
Details
Motivation: ML models degrade when training and operational populations differ; need principled way to assess worst-case performance over subpopulations to ensure reliability and fairness.Method: Two-stage estimation procedure that evaluates worst-case model performance over all subpopulations of given size defined by core attributes Z, handling continuous attributes and intersectionality.
Result: Method provides dimension-free convergence guarantees, avoids overly conservative Rademacher complexity bounds, and successfully certifies robustness or prevents deployment of unreliable models on real datasets.
Conclusion: The proposed approach offers scalable, principled evaluation of distributional robustness that accounts for intersectionality and provides reliable certification for model deployment decisions.
Abstract: The performance of ML models degrades when the training population is different from that seen under operation. Towards assessing distributional robustness, we study the worst-case performance of a model over all subpopulations of a given size, defined with respect to core attributes Z. This notion of robustness can consider arbitrary (continuous) attributes Z, and automatically accounts for complex intersectionality in disadvantaged groups. We develop a scalable yet principled two-stage estimation procedure that can evaluate the robustness of state-of-the-art models. We prove that our procedure enjoys several finite-sample convergence guarantees, including dimension-free convergence. Instead of overly conservative notions based on Rademacher complexities, our evaluation error depends on the dimension of Z only through the out-of-sample error in estimating the performance conditional on Z. On real datasets, we demonstrate that our method certifies the robustness of a model and prevents deployment of unreliable models.
[734] Mastering AI: Big Data, Deep Learning, and the Evolution of Large Language Models – AutoML from Basics to State-of-the-Art Techniques
Pohsun Feng, Ziqian Bi, Yizhu Wen, Benji Peng, Junyu Liu, Caitlyn Heqi Yin, Tianyang Wang, Keyu Chen, Sen Zhang, Ming Li, Jiawei Xu, Ming Liu, Xuanhe Pan, Jinlang Wang, Xinyuan Song, Qian Niu
Main category: cs.LG
TL;DR: A comprehensive guide to Automated Machine Learning (AutoML) covering fundamentals, tools, and future trends for both beginners and experts.
Details
Motivation: To provide a comprehensive resource that bridges the gap between theoretical AutoML principles and practical implementation, helping both beginners and experienced practitioners navigate the rapidly evolving field of automated machine learning.Method: The paper presents a structured guide format, covering fundamental AutoML principles, detailed discussions of popular tools (TPOT, AutoGluon, Auto-Keras), and exploration of emerging topics like Neural Architecture Search (NAS) and AutoML applications in deep learning.
Result: A comprehensive educational resource that systematically organizes AutoML knowledge, tools, and applications, making the field more accessible to diverse audiences from beginners to experienced practitioners.
Conclusion: This work serves as a valuable contribution to AutoML education and research, with anticipated impact on ongoing AI and machine learning development by providing a structured guide to both current practices and future trends in automated machine learning.
Abstract: A comprehensive guide to Automated Machine Learning (AutoML) is presented, covering fundamental principles, practical implementations, and future trends. The paper is structured to assist both beginners and experienced practitioners, with detailed discussions on popular AutoML tools such as TPOT, AutoGluon, and Auto-Keras. Emerging topics like Neural Architecture Search (NAS) and AutoML’s applications in deep learning are also addressed. It is anticipated that this work will contribute to ongoing research and development in the field of AI and machine learning.
[735] One Sample is Enough to Make Conformal Prediction Robust
Soroush H. Zargarbashi, Mohammad Sadegh Akhondzadeh, Aleksandar Bojchevski
Main category: cs.LG
TL;DR: RCP1: Single-sample robust conformal prediction that certifies the conformal procedure itself rather than individual scores, achieving smaller average set sizes than methods requiring many forward passes.
Details
Motivation: Current robust conformal prediction methods using randomized smoothing require many model forward passes per input (e.g., 100), which is computationally expensive. There's a need for efficient robust CP that maintains guarantees with fewer computations.Method: Propose RCP1 (single sample robust CP) that uses any binary certificate and certifies the conformal procedure itself rather than individual conformity scores. Works with a single forward pass on a randomly perturbed input, applicable to both classification and regression tasks.
Result: RCP1 returns robust sets with smaller average set size compared to state-of-the-art methods that use many passes per input (e.g., 100). The approach maintains robustness guarantees while being computationally efficient.
Conclusion: Conformal prediction can achieve robustness with just a single forward pass by certifying the procedure itself rather than individual scores. RCP1 provides an efficient, task-agnostic approach to robust conformal prediction with better computational efficiency than smoothing-based methods.
Abstract: For any black-box model, conformal prediction (CP) returns prediction sets guaranteed to include the true label with high adjustable probability. Robust CP (RCP) extends the guarantee to the worst case noise up to a pre-defined magnitude. For RCP, a well-established approach is to use randomized smoothing since it is applicable to any black-box model and provides smaller sets compared to deterministic methods. However, smoothing-based robustness requires many model forward passes per each input which is computationally expensive. We show that conformal prediction attains some robustness even with a single forward pass on a randomly perturbed input. Using any binary certificate we propose a single sample robust CP (RCP1). Our approach returns robust sets with smaller average set size compared to SOTA methods which use many (e.g. 100) passes per input. Our key insight is to certify the conformal procedure itself rather than individual conformity scores. Our approach is agnostic to the task (classification and regression). We further extend our approach to smoothing-based robust conformal risk control.
[736] FIT-GNN: Faster Inference Time for GNNs that ‘FIT’ in Memory Using Coarsening
Shubhajit Roy, Hrriday Ruparel, Kishan Ved, Anirban Dasgupta
Main category: cs.LG
TL;DR: This paper presents a novel graph coarsening approach to improve GNN scalability by reducing computational costs during inference, achieving orders of magnitude faster inference and lower memory usage while maintaining competitive performance.
Details
Motivation: Scalability remains a major challenge for Graph Neural Networks (GNNs). While existing methods focus on training efficiency through techniques like coarsening and condensation, they inadequately address computational costs during the inference phase, which is crucial for real-world deployment.Method: The paper introduces a graph coarsening approach for inference efficiency, featuring two specific methods: Extra Nodes and Cluster Nodes. The approach extends graph coarsening to graph-level tasks including graph classification and regression, enabling efficient processing on smaller graphs during inference.
Result: The method achieves orders of magnitude improvements in single-node inference time compared to traditional approaches. It significantly reduces memory consumption for node and graph classification/regression tasks, enabling efficient training and inference on low-resource devices where conventional methods are impractical.
Conclusion: The proposed graph coarsening approach successfully addresses GNN scalability challenges during inference, delivering substantial computational efficiency gains while maintaining competitive performance relative to baseline models, making GNNs more practical for deployment on resource-constrained devices.
Abstract: Scalability of Graph Neural Networks (GNNs) remains a significant challenge. To tackle this, methods like coarsening, condensation, and computation trees are used to train on a smaller graph, resulting in faster computation. Nonetheless, prior research has not adequately addressed the computational costs during the inference phase. This paper presents a novel approach to improve the scalability of GNNs by reducing computational burden during the inference phase using graph coarsening. We demonstrate two different methods – Extra Nodes and Cluster Nodes. Our study extends the application of graph coarsening for graph-level tasks, including graph classification and graph regression. We conduct extensive experiments on multiple benchmark datasets to evaluate the performance of our approach. Our results show that the proposed method achieves orders of magnitude improvements in single-node inference time compared to traditional approaches. Furthermore, it significantly reduces memory consumption for node and graph classification and regression tasks, enabling efficient training and inference on low-resource devices where conventional methods are impractical. Notably, these computational advantages are achieved while maintaining competitive performance relative to baseline models.
[737] Quantum-Classical Hybrid Quantized Neural Network
Wenxin Li, Chuan Wang, Hongdong Zhu, Qi Gao, Yin Ma, Hai Wei, Kai Wen
Main category: cs.LG
TL;DR: A novel Quadratic Binary Optimization framework for training quantized neural networks using quantum solvers, with methods to handle nonlinearities and reduce quantum resource requirements.
Details
Motivation: To enable quantum solvers to handle complex neural network training by developing a framework that preserves universal approximation properties while making nonlinear functions accessible to quantum hardware.Method: 1) Quadratic Binary Optimization framework with spline interpolation for arbitrary activation/loss functions; 2) Forward Interval Propagation for handling nonlinearities via discretization; 3) Quantum Conditional Gradient Descent algorithm for solving QCBO on quantum hardware; 4) Decomposed copositive optimization scheme for scalability; 5) Quantum Progressive Hedging for solving decomposed problems.
Result: Theoretical results include: upper bound on approximation error, bound on number of Ising spins required, convergence proof for QCGD under quantum oracle constraints, upper bound on Time-To-Solution, and substantial reduction in quantum resource requirements through decomposition.
Conclusion: The proposed framework successfully enables efficient low-bit neural network training on quantum hardware by addressing key challenges of nonlinearities, constraints, and scalability through novel optimization methods and decomposition techniques.
Abstract: In this work, we introduce a novel Quadratic Binary Optimization (QBO) framework for training a quantized neural network. The framework enables the use of arbitrary activation and loss functions through spline interpolation, while Forward Interval Propagation addresses the nonlinearities and the multi-layered, composite structure of neural networks via discretizing activation functions into linear subintervals. This preserves the universal approximation properties of neural networks while allowing complex nonlinear functions accessible to quantum solvers, broadening their applicability in artificial intelligence. Theoretically, we derive an upper bound on the approximation error and the number of Ising spins required by deriving the sample complexity of the empirical risk minimization problem from an optimization perspective. A key challenge in solving the associated large-scale Quadratic Constrained Binary Optimization (QCBO) model is the presence of numerous constraints. To overcome this, we adopt the Quantum Conditional Gradient Descent (QCGD) algorithm, which solves QCBO directly on quantum hardware. We establish the convergence of QCGD under a quantum oracle subject to randomness, bounded variance, and limited coefficient precision, and further provide an upper bound on the Time-To-Solution. To enhance scalability, we further incorporate a decomposed copositive optimization scheme that replaces the monolithic lifted model with sample-wise subproblems. This decomposition substantially reduces the quantum resource requirements and enables efficient low-bit neural network training. We further propose the usage of QCGD and Quantum Progressive Hedging (QPH) algorithm to efficiently solve the decomposed problem.
[738] SeqProFT: Sequence-only Protein Property Prediction with LoRA Finetuning
Shuo Zhang, Jian K. Liu
Main category: cs.LG
TL;DR: LoRA-based parameter-efficient finetuning enables smaller protein language models to match or exceed larger models’ performance on protein property prediction tasks while reducing computational costs.
Details
Motivation: Protein language models require substantial computational resources for finetuning with suboptimal results, creating a need for more efficient adaptation methods that maintain performance while reducing resource demands.Method: Applied LoRA (Low-Rank Adaptation) to ESM-2 and ESM-C models of varying sizes, evaluated on 10 diverse protein property prediction tasks, and integrated contact map information through multi-head attention mechanism.
Result: Smaller models with LoRA adaptation matched or exceeded performance of larger models without adaptation, with faster convergence, better performance, and more efficient resource utilization.
Conclusion: LoRA finetuning provides practical guidance for protein research in resource-constrained environments by enabling efficient adaptation of protein language models while maintaining or improving performance.
Abstract: Protein language models (PLMs) have demonstrated remarkable capabilities in learning relationships between protein sequences and functions. However, finetuning these large models requires substantial computational resources, often with suboptimal task-specific results. This study investigates how parameter-efficient finetuning via LoRA can enhance protein property prediction while significantly reducing computational demands. By applying LoRA to ESM-2 and ESM-C models of varying sizes and evaluating 10 diverse protein property prediction tasks, we demonstrate that smaller models with LoRA adaptation can match or exceed the performance of larger models without adaptation. Additionally, we integrate contact map information through a multi-head attention mechanism, improving model comprehension of structural features. Our systematic analysis reveals that LoRA finetuning enables faster convergence, better performance, and more efficient resource utilization, providing practical guidance for protein research applications in resource-constrained environments. The code is available at https://github.com/jiankliu/SeqProFT.
[739] Inversely Learning Transferable Rewards via Abstracted States
Yikang Gui, Prashant Doshi
Main category: cs.LG
TL;DR: The paper proposes a method for learning abstract, transferable reward functions from behavior trajectories across different domain instances, enabling robots to adapt to new but related tasks without reprogramming.
Details
Motivation: Current IRL can learn rewards from behavior data, but the next challenge is learning intrinsic preferences that transfer to new but related tasks/settings. This is crucial for robotic applications where robots need to adapt to new processing line tasks without complete reprogramming.Method: Learn an abstract reward function from behavior trajectories across two or more different instances of a domain, then use this abstract reward to learn task behavior in a completely new domain instance to validate transferability.
Result: Evaluated on OpenAI Gym and AssistiveGym tasks, showing that learned abstract reward functions successfully enable learning task behaviors in previously unseen instances of the respective domains.
Conclusion: The method demonstrates successful transfer learning of intrinsic preferences across domain instances, advancing IRL toward more practical robotic applications where adaptation to new but related tasks is required.
Abstract: Inverse reinforcement learning (IRL) has progressed significantly toward accurately learning the underlying rewards in both discrete and continuous domains from behavior data. The next advance is to learn {\em intrinsic} preferences in ways that produce useful behavior in settings or tasks which are different but aligned with the observed ones. In the context of robotic applications, this helps integrate robots into processing lines involving new tasks (with shared intrinsic preferences) without programming from scratch. We introduce a method to inversely learn an abstract reward function from behavior trajectories in two or more differing instances of a domain. The abstract reward function is then used to learn task behavior in another separate instance of the domain. This step offers evidence of its transferability and validates its correctness. We evaluate the method on trajectories in tasks from multiple domains in OpenAI’s Gym testbed and AssistiveGym and show that the learned abstract reward functions can successfully learn task behaviors in instances of the respective domains, which have not been seen previously.
[740] Online-BLS: An Accurate and Efficient Online Broad Learning System for Data Stream Classification
Chunyu Lei, Guang-Ze Chen, C. L. Philip Chen, Tong Zhang
Main category: cs.LG
TL;DR: Online Broad Learning System with closed-form solutions and efficient updates using Cholesky decomposition instead of matrix inversion, achieving better accuracy and faster online learning.
Details
Motivation: Existing online learning models suffer from suboptimal model weights due to single gradient descent updates, and incremental broad learning algorithms have degraded accuracy and expensive update overhead.Method: 1) Effective weight estimation using Cholesky decomposition and forward-backward substitution instead of matrix inversion; 2) Efficient online updating strategy to reduce update time; 3) Theoretical analysis of error bounds and time complexity.
Result: Superior performance and efficiency on various real-world datasets using test-then-training evaluation, and extension to data stream scenarios with concept drift outperforms state-of-the-art baselines.
Conclusion: The proposed online broad learning framework with closed-form solutions provides improved accuracy, efficient updates, and strong theoretical guarantees, making it effective for both standard online learning and data stream scenarios with concept drift.
Abstract: The state-of-the-art online learning models generally conduct a single online gradient descent when a new sample arrives and thus suffer from suboptimal model weights. To this end, we introduce an online broad learning system framework with closed-form solutions for each online update. Different from employing existing incremental broad learning algorithms for online learning tasks, which tend to incur degraded accuracy and expensive online update overhead, we design an effective weight estimation algorithm and an efficient online updating strategy to remedy the above two deficiencies, respectively. Specifically, an effective weight estimation algorithm is first developed by replacing notorious matrix inverse operations with Cholesky decomposition and forward-backward substitution to improve model accuracy. Second, we devise an efficient online updating strategy that dramatically reduces online update time. Theoretical analysis exhibits the splendid error bound and low time complexity of our model. The most popular test-then-training evaluation experiments on various real-world datasets prove its superiority and efficiency. Furthermore, our framework is naturally extended to data stream scenarios with concept drift and exceeds state-of-the-art baselines.
[741] Flatten Graphs as Sequences: Transformers are Scalable Graph Generators
Dexiong Chen, Markus Krimmel, Karsten Borgwardt
Main category: cs.LG
TL;DR: AutoGraph is a scalable autoregressive model for graph generation using transformers, achieving SOTA performance with 100x faster generation and 3x faster training than diffusion models.
Details
Motivation: Current graph generation methods, particularly diffusion-based approaches, require expensive node feature computation and don't scale well to large, sparse graphs. There's a need for more efficient and scalable graph generation methods that can bridge language modeling techniques with graph generation.Method: AutoGraph uses decoder-only transformers to model graphs as sequences. It flattens graphs into random token sequences through a reversible process, enabling sequence modeling without expensive node features. The method’s sequence prefixes represent induced subgraphs, creating a direct link to sub-sentences in language modeling.
Result: AutoGraph achieves state-of-the-art performance on synthetic and molecular benchmarks, with up to 100x faster generation and 3x faster training than leading diffusion models. It supports substructure-conditioned generation without fine-tuning and shows promising transferability.
Conclusion: AutoGraph provides a scalable and efficient approach to graph generation that bridges language modeling and graph generation, laying groundwork for graph foundation models with optimal linear scaling for large, sparse graphs.
Abstract: We introduce AutoGraph, a scalable autoregressive model for attributed graph generation using decoder-only transformers. By flattening graphs into random sequences of tokens through a reversible process, AutoGraph enables modeling graphs as sequences without relying on additional node features that are expensive to compute, in contrast to diffusion-based approaches. This results in sampling complexity and sequence lengths that scale optimally linearly with the number of edges, making it scalable and efficient for large, sparse graphs. A key success factor of AutoGraph is that its sequence prefixes represent induced subgraphs, creating a direct link to sub-sentences in language modeling. Empirically, AutoGraph achieves state-of-the-art performance on synthetic and molecular benchmarks, with up to 100x faster generation and 3x faster training than leading diffusion models. It also supports substructure-conditioned generation without fine-tuning and shows promising transferability, bridging language modeling and graph generation to lay the groundwork for graph foundation models. Our code is available at https://github.com/BorgwardtLab/AutoGraph.
[742] XiChen: An observation-scalable fully AI-driven global weather forecasting system with 4D variational knowledge
Wuxin Wang, Weicheng Ni, Lilan Huang, Tao Hao, Ben Fei, Shuo Ma, Taikang Yuan, Yanlai Zhao, Kefeng Deng, Xiaoyong Li, Hongze Leng, Boheng Duan, Lei Bai, Weimin Zhang, Kaijun Ren, Junqiang Song
Main category: cs.LG
TL;DR: XiChen is a fully AI-driven global weather forecasting system that integrates data assimilation and medium-range forecasting in 15 seconds, achieving NWP-comparable accuracy with 8.75+ days skillful forecasting.
Details
Motivation: Current AI weather models still depend on costly NWP systems for initial conditions, and existing end-to-end AI approaches lack scalable assimilation of diverse observational data types.Method: Uses a foundation model pre-trained for weather forecasting, fine-tuned as observation operators and DA models. Integrates 4DVar knowledge for physical balance constraints and flow-dependent assimilation of both conventional and raw satellite observations.
Result: Achieves data assimilation and medium-range forecasting accuracy comparable to operational NWP systems, with skillful forecasting beyond 8.75 days. Maintains physical balance during DA and shows flow-dependent characteristics similar to traditional 4DVar.
Conclusion: XiChen demonstrates strong potential for fully AI-driven weather forecasting independent of NWP systems, enabling scalable assimilation of diverse observations while maintaining physical constraints.
Abstract: Artificial intelligence (AI)-driven models have the potential to revolutionize weather forecasting, but still rely on initial conditions generated by costly Numerical Weather Prediction (NWP) systems. Although recent end-to-end forecasting models attempt to bypass NWP systems, these methods lack scalable assimilation of new types of observational data. Here, we introduce XiChen, an observation-scalable fully AI-driven global weather forecasting system, wherein the entire pipeline, from Data Assimilation (DA) to medium-range forecasting, can be accomplished within only 15 seconds. XiChen is built upon a foundation model that is pre-trained for weather forecasting and subsequently fine-tuned to serve as both observation operators and DA models, thereby enabling the scalable assimilation of conventional and raw satellite observations. Furthermore, the integration of Four-Dimensional Variational (4DVar) knowledge ensures XiChen to achieve DA and medium-range forecasting accuracy comparable to operational NWP systems, with skillful forecasting lead time beyond 8.75 days. A key feature of XiChen is its ability to maintain physical balance constraints during DA, enabling observed variables to correct unobserved ones effectively. In single-point perturbation DA experiments, XiChen exhibits flow-dependent characteristics similar to those of traditional 4DVar systems. These results demonstrate that XiChen holds strong potential for fully AI-driven weather forecasting independent of NWP systems.
[743] EasySpec: Layer-Parallel Speculative Decoding for Efficient Multi-GPU Utilization
Yize Wu, Ke Gao, Ling Li, Yanjun Wu
Main category: cs.LG
TL;DR: EasySpec is a training-free, layer-parallel speculation method that accelerates LLM inference by breaking inter-layer dependencies in draft models, enabling simultaneous layer execution across multiple GPUs while preserving output quality.
Details
Motivation: Current speculative decoding in multi-GPU systems suffers from GPU idling during the drafting stage because the optimal tensor parallelism size for draft models is typically smaller than for base models, leading to inefficient sequential layer execution.Method: EasySpec breaks inter-layer data dependencies in the draft model, allowing multiple layers to run simultaneously across devices as ‘fuzzy’ speculation. After each iteration, the draft model’s key-value cache is calibrated in a single forward pass to prevent error accumulation.
Result: Achieves peak speedup of 4.17x compared to vanilla decoding while preserving original LLM distributions. Drafting stage accelerated by up to 1.62x with maximum speculation accuracy drop of only 7%.
Conclusion: EasySpec is an effective, training-free plug-in method that optimizes multi-GPU utilization for speculative decoding, significantly accelerating LLM inference without compromising output quality.
Abstract: Speculative decoding is an effective and lossless method for Large Language Model (LLM) inference acceleration. It employs a smaller model to generate a draft token sequence, which is then verified by the original base model. In multi-GPU systems, inference latency can be further reduced through tensor parallelism (TP), while the optimal TP size of the draft model is typically smaller than that of the base model, leading to GPU idling during the drafting stage. We observe that such inefficiency stems from the sequential execution of layers, which is seemingly natural but actually unnecessary. Therefore, we propose EasySpec, a layer-parallel speculation strategy that optimizes the efficiency of multi-GPU utilization. EasySpec breaks the inter-layer data dependencies in the draft model, enabling multiple layers to run simultaneously across multiple devices as ‘fuzzy’ speculation. After each drafting-and-verification iteration, the draft model’s key-value cache is calibrated in a single forward pass, preventing long-term fuzzy-error accumulation at minimal additional latency. EasySpec is a training-free and plug-in method. We evaluated EasySpec on several mainstream open-source LLMs, using smaller versions of models from the same series as drafters. The results demonstrate that EasySpec can achieve a peak speedup of 4.17x compared to vanilla decoding, while preserving the original distributions of the base LLMs. Specifically, the drafting stage can be accelerated by up to 1.62x with a maximum speculation accuracy drop of only 7%. The code is available at https://github.com/Yize-Wu/EasySpec.
[744] Stein Discrepancy for Unsupervised Domain Adaptation
Anneke von Seeger, Dongmian Zou, Gilad Lerman
Main category: cs.LG
TL;DR: Proposes a novel unsupervised domain adaptation framework using asymmetric Stein discrepancy instead of symmetric measures like MMD, particularly effective when target data is scarce.
Details
Motivation: Existing UDA methods using symmetric distribution alignment measures (like MMD) struggle when target data is scarce. There's a need for approaches that work well in low-data target regimes.Method: Leverages Stein discrepancy, an asymmetric measure that depends only on the target distribution’s score function. Offers kernelized and adversarial forms, and supports flexible target modeling via Gaussian, GMM, or VAE models.
Result: Derived generalization bound on target error and convergence rate for empirical Stein discrepancy. Empirically outperforms prior UDA approaches under limited target data across multiple benchmarks.
Conclusion: Stein discrepancy is a superior alternative to symmetric measures for UDA, especially in low-data target scenarios, with theoretical guarantees and strong empirical performance.
Abstract: Unsupervised domain adaptation (UDA) aims to improve model performance on an unlabeled target domain using a related, labeled source domain. A common approach aligns source and target feature distributions by minimizing a distance between them, often using symmetric measures such as maximum mean discrepancy (MMD). However, these methods struggle when target data is scarce. We propose a novel UDA framework that leverages Stein discrepancy, an asymmetric measure that depends on the target distribution only through its score function, making it particularly suitable for low-data target regimes. Our proposed method has kernelized and adversarial forms and supports flexible modeling of the target distribution via Gaussian, GMM, or VAE models. We derive a generalization bound on the target error and a convergence rate for the empirical Stein discrepancy in the two-sample setting. Empirically, our method consistently outperforms prior UDA approaches under limited target data across multiple benchmarks.
[745] Towards Resilient Safety-driven Unlearning for Diffusion Models against Downstream Fine-tuning
Boheng Li, Renjie Gu, Junjie Wang, Leyi Qi, Yiming Li, Run Wang, Zhan Qin, Tianwei Zhang
Main category: cs.LG
TL;DR: ResAlign is a safety-driven unlearning framework for text-to-image diffusion models that enhances resilience against downstream fine-tuning to prevent recovery of harmful behaviors.
Details
Motivation: Text-to-image diffusion models inherit unsafe behaviors from toxic pretraining data, and existing safety-driven unlearning methods are fragile to downstream fine-tuning, failing to retain effectiveness even when fine-tuned on benign datasets.Method: Models downstream fine-tuning as an implicit optimization problem using Moreau envelope-based reformulation for efficient gradient estimation, plus a meta-learning strategy to simulate diverse fine-tuning scenarios for better generalization.
Result: Extensive experiments across various datasets, fine-tuning methods, and configurations show ResAlign consistently outperforms prior unlearning approaches in retaining safety while preserving benign generation capability.
Conclusion: ResAlign provides an effective safety-driven unlearning framework with enhanced resilience against downstream fine-tuning, addressing a critical vulnerability in current safety methods for text-to-image diffusion models.
Abstract: Text-to-image (T2I) diffusion models have achieved impressive image generation quality and are increasingly fine-tuned for personalized applications. However, these models often inherit unsafe behaviors from toxic pretraining data, raising growing safety concerns. While recent safety-driven unlearning methods have made promising progress in suppressing model toxicity, they are found to be fragile to downstream fine-tuning, as we reveal that state-of-the-art methods largely fail to retain their effectiveness even when fine-tuned on entirely benign datasets. To mitigate this problem, in this paper, we propose ResAlign, a safety-driven unlearning framework with enhanced resilience against downstream fine-tuning. By modeling downstream fine-tuning as an implicit optimization problem with a Moreau envelope-based reformulation, ResAlign enables efficient gradient estimation to minimize the recovery of harmful behaviors. Additionally, a meta-learning strategy is proposed to simulate a diverse distribution of fine-tuning scenarios to improve generalization. Extensive experiments across a wide range of datasets, fine-tuning methods, and configurations demonstrate that ResAlign consistently outperforms prior unlearning approaches in retaining safety, while effectively preserving benign generation capability. Our code and pretrained models are publicly available at https://github.com/AntigoneRandy/ResAlign.
[746] Countering Overfitting with Counterfactual Examples
Flavio Giorgi, Fabiano Veglianti, Fabrizio Silvestri, Gabriele Tolomei
Main category: cs.LG
TL;DR: CF-Reg is a novel regularization method that uses counterfactual examples to control overfitting by ensuring sufficient margin between instances and their counterfactuals, outperforming traditional regularization techniques.
Details
Motivation: The paper addresses the classic problem of overfitting in machine learning, where models fail to generalize to unseen data. The authors observe a correlation between overfitting and the ease of generating counterfactual examples, motivating a new approach to regularization based on this relationship.Method: The authors introduce CF-Reg, a novel regularization term added to the training loss. This term controls overfitting by ensuring sufficient margin between each training instance and its corresponding counterfactual example. The approach leverages the observed correlation that higher overfitting makes it easier to find valid counterfactuals.
Result: Experiments across multiple datasets and models demonstrate that CF-Reg generally outperforms existing regularization techniques in controlling overfitting and improving model generalization.
Conclusion: The paper presents CF-Reg as an effective regularization method that uses counterfactual examples to mitigate overfitting, offering a novel approach that outperforms traditional regularization techniques by leveraging the relationship between overfitting and counterfactual generation.
Abstract: Overfitting is a well-known issue in machine learning that occurs when a model struggles to generalize its predictions to new, unseen data beyond the scope of its training set. Traditional techniques to mitigate overfitting include early stopping, data augmentation, and regularization. In this work, we demonstrate that the degree of overfitting of a trained model is correlated with the ability to generate counterfactual examples. The higher the overfitting, the easier it will be to find a valid counterfactual example for a randomly chosen input data point. Therefore, we introduce CF-Reg, a novel regularization term in the training loss that controls overfitting by ensuring enough margin between each instance and its corresponding counterfactual. Experiments conducted across multiple datasets and models show that our counterfactual regularizer generally outperforms existing regularization techniques.
[747] ExLLM: Experience-Enhanced LLM Optimization for Molecular Design and Beyond
Nian Ran, Yue Wang, Xiaoyuan Zhang, Zhongzheng Li, Qingsong Ran, Wenhao Li, Richard Allmendinger
Main category: cs.LG
TL;DR: ExLLM is an LLM-as-optimizer framework that introduces experience snippets, k-offspring exploration, and feedback adaptation to improve molecular design optimization, achieving SOTA on PMO and generalizing to other domains.
Details
Motivation: Traditional molecular design optimizers (Bayesian optimization, genetic algorithms, generative models) struggle with leveraging expert knowledge and handling complex feedback. Existing LLM-based approaches lack mechanisms for handling complex feedback and maintaining scalable memory, leading to redundancy and poor exploration in large-scale iterative search.Method: ExLLM framework with three components: (1) compact, evolving experience snippets that distill non-redundant cues for large discrete spaces; (2) k-offspring scheme for wider exploration per call with reduced orchestration cost; (3) lightweight feedback adapter that normalizes objectives for selection while formatting constraints and expert hints.
Result: Sets new state-of-the-art results on PMO benchmark, establishes records on circle packing and stellarator design, and yields consistent gains across additional domains with minimal transfer requirements (only task-description template and evaluation functions).
Conclusion: ExLLM demonstrates that LLMs can be effective optimizers when equipped with proper experience management, exploration mechanisms, and feedback adaptation, enabling strong generalization across diverse optimization domains beyond molecular design.
Abstract: Molecular design involves an enormous and irregular search space, where traditional optimizers such as Bayesian optimization, genetic algorithms, and generative models struggle to leverage expert knowledge or handle complex feedback. Recently, LLMs have been used as optimizers, achieving promising results on benchmarks such as PMO. However, existing approaches rely only on prompting or extra training, without mechanisms to handle complex feedback or maintain scalable memory. In particular, the common practice of appending or summarizing experiences at every query leads to redundancy, degraded exploration, and ultimately poor final outcomes under large-scale iterative search. We introduce ExLLM (Experience-Enhanced LLM optimization), an LLM-as-optimizer framework with three components: (1) a compact, evolving experience snippet tailored to large discrete spaces that distills non-redundant cues and improves convergence at low cost; (2) a simple yet effective k-offspring scheme that widens exploration per call and reduces orchestration cost; and (3) a lightweight feedback adapter that normalizes objectives for selection while formatting constraints and expert hints for iteration. ExLLM sets new state-of-the-art results on PMO and generalizes strongly in our setup, it sets records on circle packing and stellarator design, and yields consistent gains across additional domains requiring only a task-description template and evaluation functions to transfer.
[748] Global-Decision-Focused Neural ODEs for Proactive Grid Resilience Management
Shuyi Chen, Ferdinando Fioretto, Feng Qiu, Shixiang Zhu
Main category: cs.LG
TL;DR: PATOG framework integrates outage prediction with globally optimized interventions for power grid resilience during extreme hazards, using decision-aware neural ODE models to align prediction and optimization objectives.
Details
Motivation: Extreme hazard events increasingly threaten power systems, causing widespread outages. Traditional predict-then-optimize approaches suffer from misalignment between prediction and optimization objectives, leading to suboptimal resource allocation.Method: Propose PATOG (predict-all-then-optimize-globally) framework with global-decision-focused (GDF) neural ODE model that captures outage dynamics while optimizing resilience strategies in a decision-aware manner, ensuring spatially and temporally coherent decision-making.
Result: Experiments on synthetic and real-world datasets demonstrate significant improvements in outage prediction consistency and grid resilience compared to conventional methods.
Conclusion: PATOG framework effectively addresses the misalignment problem in traditional approaches, improving both predictive accuracy and operational efficiency for power grid resilience during extreme hazard events.
Abstract: Extreme hazard events such as wildfires and hurricanes increasingly threaten power systems, causing widespread outages and disrupting critical services. Recently, predict-then-optimize approaches have gained traction in grid operations, where system functionality forecasts are first generated and then used as inputs for downstream decision-making. However, this two-stage method often results in a misalignment between prediction and optimization objectives, leading to suboptimal resource allocation. To address this, we propose predict-all-then-optimize-globally (PATOG), a framework that integrates outage prediction with globally optimized interventions. At its core, our global-decision-focused (GDF) neural ODE model captures outage dynamics while optimizing resilience strategies in a decision-aware manner. Unlike conventional methods, our approach ensures spatially and temporally coherent decision-making, improving both predictive accuracy and operational efficiency. Experiments on synthetic and real-world datasets demonstrate significant improvements in outage prediction consistency and grid resilience.
[749] Quantization-Free Autoregressive Action Transformer
Ziyad Sheebaelhamd, Michael Tschannen, Michael Muehlebach, Claire Vernade
Main category: cs.LG
TL;DR: GIVT: A quantization-free approach using continuous action representations for transformer-based imitation learning, achieving SOTA performance on robotics tasks.
Details
Motivation: Current transformer-based imitation learning methods use discrete action representations through quantization, which breaks the continuous structure of the action space and limits generative model capabilities.Method: Proposes a quantization-free method using Generative Infinite-Vocabulary Transformers (GIVT) as a direct, continuous policy parametrization for autoregressive transformers, with enhanced sampling algorithms for policy roll-outs.
Result: Achieves state-of-the-art performance on a variety of popular simulated robotics tasks while simplifying the imitation learning pipeline.
Conclusion: GIVT provides an effective continuous alternative to discrete action representations for transformer-based imitation learning, improving performance through better preservation of action space structure.
Abstract: Current transformer-based imitation learning approaches introduce discrete action representations and train an autoregressive transformer decoder on the resulting latent code. However, the initial quantization breaks the continuous structure of the action space thereby limiting the capabilities of the generative model. We propose a quantization-free method instead that leverages Generative Infinite-Vocabulary Transformers (GIVT) as a direct, continuous policy parametrization for autoregressive transformers. This simplifies the imitation learning pipeline while achieving state-of-the-art performance on a variety of popular simulated robotics tasks. We enhance our policy roll-outs by carefully studying sampling algorithms, further improving the results.
[750] A Physics-Informed Meta-Learning Framework for the Continuous Solution of Parametric PDEs on Arbitrary Geometries
Reza Najian Asl, Yusuke Yamazaki, Kianoosh Taghikhani, Mayu Muramatsu, Markus Apel, Shahed Rezaei
Main category: cs.LG
TL;DR: iFOL is an implicit neural operator learning method that solves PDEs on arbitrary geometries using a physics-informed encoder-decoder network with discrete residual backpropagation, eliminating traditional encode-process-decode pipelines.
Details
Motivation: To develop a continuous and parametric PDE solution method that works on arbitrary geometries, captures sharp discontinuities, provides solution-to-parameter gradients automatically, and removes mesh constraints while maintaining accuracy.Method: Physics-informed encoder-decoder network with implicit neural field decoder conditioned on latent codes. Uses second-order meta-learning for PDE encoding, minimizes physics-informed loss in energy/weighted residual form with discrete residuals from standard numerical methods, enabling backpropagation of discrete residuals.
Result: Method achieves accurate parametric continuous fields, provides solution-to-parameter gradients without extra computation, captures sharp discontinuities, works on arbitrary geometries with zero-shot super-resolution, and generalizes well to unseen samples in stationary and transient PDEs.
Conclusion: iFOL offers a promising approach for challenging computational mechanics problems with unique advantages over traditional operator learning methods, demonstrating strong performance across various PDE types and geometries.
Abstract: In this work, we introduce implicit Finite Operator Learning (iFOL) for the continuous and parametric solution of partial differential equations (PDEs) on arbitrary geometries. We propose a physics-informed encoder-decoder network to establish the mapping between continuous parameter and solution spaces. The decoder constructs the parametric solution field by leveraging an implicit neural field network conditioned on a latent or feature code. Instance-specific codes are derived through a PDE encoding process based on the second-order meta-learning technique. In training and inference, a physics-informed loss function is minimized during the PDE encoding and decoding. iFOL expresses the loss function in an energy or weighted residual form and evaluates it using discrete residuals derived from standard numerical PDE methods. This approach results in the backpropagation of discrete residuals during both training and inference. iFOL features several key properties: (1) its unique loss formulation eliminates the need for the conventional encode-process-decode pipeline previously used in operator learning with conditional neural fields for PDEs; (2) it not only provides accurate parametric and continuous fields but also delivers solution-to-parameter gradients without requiring additional loss terms or sensitivity analysis; (3) it can effectively capture sharp discontinuities in the solution; and (4) it removes constraints on the geometry and mesh, making it applicable to arbitrary geometries and spatial sampling (zero-shot super-resolution capability). We critically assess these features and analyze the network’s ability to generalize to unseen samples across both stationary and transient PDEs. The overall performance of the proposed method is promising, demonstrating its applicability to a range of challenging problems in computational mechanics.
[751] Repetition Makes Perfect: Recurrent Graph Neural Networks Match Message-Passing Limit
Eran Rosenbluth, Martin Grohe
Main category: cs.LG
TL;DR: Recurrent GNNs with finite precision, sum aggregation, and ReLU can match the expressivity limit of Color Refinement/Weisfeiler-Leman algorithm, unlike non-recurrent GNNs which only achieve this non-uniformly. With random initialization, they can express all graph algorithms on connected graphs.
Details
Motivation: The paper aims to precisely characterize the expressivity of computable Recurrent Graph Neural Networks. While it's known that GNN expressive power is limited by the Color Refinement/Weisfeiler-Leman invariance, it's unclear whether GNNs can actually match this theoretical limit in practice.Method: The authors analyze recurrent GNNs with finite-precision parameters, sum aggregation, and ReLU activation. They prove these networks can compute any graph algorithm respecting the message-passing invariance induced by Color Refinement. They also show that with random initialization, recurrent GNNs can express all graph algorithms on connected graphs.
Result: Recurrent GNNs can match the expressivity limit of Weisfeiler-Leman algorithm, unlike non-recurrent GNNs which only achieve this in a weak, non-uniform sense. With random initialization, recurrent GNNs can emulate any polynomial-time graph algorithm on connected graphs in polynomial time with only polynomial overhead.
Conclusion: Recurrent GNNs are provably as expressive as the theoretical limits imposed by Color Refinement invariance, and with random initialization they become universal approximators for graph algorithms on connected graphs, bridging the gap between theoretical limits and practical expressivity.
Abstract: We precisely characterize the expressivity of computable Recurrent Graph Neural Networks (recurrent GNNs). We prove that recurrent GNNs with finite-precision parameters, sum aggregation, and ReLU activation, can compute any graph algorithm that respects the natural message-passing invariance induced by the Color Refinement (or Weisfeiler-Leman) algorithm. While it is well known that the expressive power of GNNs is limited by this invariance [Morris et al., AAAI 2019; Xu et al., ICLR 2019], we establish that recurrent GNNs can actually match this limit. This is in contrast to non-recurrent GNNs, which have the power of Weisfeiler-Leman only in a very weak, “non-uniform”, sense where each graph size requires a different GNN to compute with. Our construction introduces only a polynomial overhead in both time and space. Furthermore, we show that by incorporating random initialization, for connected graphs recurrent GNNs can express all graph algorithms. In particular, any polynomial-time graph algorithm can be emulated on connected graphs in polynomial time by a recurrent GNN with random initialization.
[752] Certifying Stability of Reinforcement Learning Policies using Generalized Lyapunov Functions
Kehan Long, Jorge Cortés, Nikolay Atanasov
Main category: cs.LG
TL;DR: The paper proposes a method to certify stability of RL policies by augmenting value functions with neural network residuals and using a relaxed multi-step Lyapunov condition, bridging control theory with learning-based methods.
Details
Motivation: Traditional Lyapunov stability certificates are difficult to construct for learned RL policies, creating a gap between empirical performance and provable guarantees. The RL value function is a natural candidate but needs adaptation for stability certification.Method: 1) Study LQR to gain intuition: augment value function with residual term and relax Lyapunov condition to require only average decrease over multiple steps. 2) For nonlinear systems: learn generalized Lyapunov functions by augmenting RL value functions with neural network residual terms. 3) Jointly train neural controllers and stability certificates using multi-step Lyapunov loss.
Result: Successfully certifies stability of RL policies on Gymnasium and DeepMind Control benchmarks. The method produces larger certified inner approximations of the region of attraction compared to classical Lyapunov approaches.
Conclusion: The formulation enables stability certification for a broad class of systems with learned policies by making certificates easier to construct, bridging classical control theory and modern learning-based methods.
Abstract: Establishing stability certificates for closed-loop systems under reinforcement learning (RL) policies is essential to move beyond empirical performance and offer guarantees of system behavior. Classical Lyapunov methods require a strict stepwise decrease in the Lyapunov function but such certificates are difficult to construct for learned policies. The RL value function is a natural candidate but it is not well understood how it can be adapted for this purpose. To gain intuition, we first study the linear quadratic regulator (LQR) problem and make two key observations. First, a Lyapunov function can be obtained from the value function of an LQR policy by augmenting it with a residual term related to the system dynamics and stage cost. Second, the classical Lyapunov decrease requirement can be relaxed to a generalized Lyapunov condition requiring only decrease on average over multiple time steps. Using this intuition, we consider the nonlinear setting and formulate an approach to learn generalized Lyapunov functions by augmenting RL value functions with neural network residual terms. Our approach successfully certifies the stability of RL policies trained on Gymnasium and DeepMind Control benchmarks. We also extend our method to jointly train neural controllers and stability certificates using a multi-step Lyapunov loss, resulting in larger certified inner approximations of the region of attraction compared to the classical Lyapunov approach. Overall, our formulation enables stability certification for a broad class of systems with learned policies by making certificates easier to construct, thereby bridging classical control theory and modern learning-based methods.
[753] Are Time-Series Foundation Models Deployment-Ready? A Systematic Study of Adversarial Robustness Across Domains
Jiawen Zhang, Zhenwei Zhang, Shun Zheng, Xumeng Wen, Jia Li, Jiang Bian
Main category: cs.LG
TL;DR: TSFMs are vulnerable to adversarial attacks; current models are alarmingly brittle to small perturbations that can cause specific forecast failures; adversarial fine-tuning offers cost-effective robustness gains.
Details
Motivation: As Time-Series Foundation Models transition to critical decision-making systems, their fragility under adversarial attacks poses severe risks in high-stakes environments vulnerable to manipulation. Robustness is a prerequisite for trustworthy deployment, not just a secondary metric.Method: Developed a systematic evaluation framework tailored to time series constraints, incorporating normalized sparsity-aware perturbation budgets and unified scale-invariant metrics across white-box and black-box settings. Evaluated six representative TSFMs and tested adversarial fine-tuning.
Result: Current TSFM architectures are alarmingly brittle - even small perturbations can reliably steer forecasts toward specific failure modes (trend flips, malicious drifts). Found TSFM-specific vulnerability patterns: horizon-proximal brittleness, increased susceptibility with longer context windows, and weak cross-model transfer indicating model-specific failure modes.
Conclusion: Adversarial robustness is essential for trustworthy TSFM deployment. Simple adversarial fine-tuning offers cost-effective path to substantial robustness gains, even with out-of-domain data. This work bridges the gap between TSFM capabilities and safety constraints for hardening next-generation forecasting systems.
Abstract: Time-Series Foundation Models (TSFMs) are rapidly transitioning from research prototypes to core components of critical decision-making systems, driven by their impressive zero-shot forecasting capabilities. However, as their deployment surges, a critical blind spot remains: their fragility under adversarial attacks. This lack of scrutiny poses severe risks, particularly as TSFMs enter high-stakes environments vulnerable to manipulation. We present a systematic, diagnostic study arguing that for TSFMs, robustness is not merely a secondary metric but a prerequisite for trustworthy deployment comparable to accuracy. Our evaluation framework, explicitly tailored to the unique constraints of time series, incorporates normalized, sparsity-aware perturbation budgets and unified scale-invariant metrics across white-box and black-box settings. Across six representative TSFMs, we demonstrate that current architectures are alarmingly brittle: even small perturbations can reliably steer forecasts toward specific failure modes, such as trend flips and malicious drifts. We uncover TSFM-specific vulnerability patterns, including horizon-proximal brittleness, increased susceptibility with longer context windows, and weak cross-model transfer that points to model-specific failure modes rather than generic distortions. Finally, we show that simple adversarial fine-tuning offers a cost-effective path to substantial robustness gains, even with out-of-domain data. This work bridges the gap between TSFM capabilities and safety constraints, offering essential guidance for hardening the next generation of forecasting systems.
[754] Learning where to learn: Training data distribution optimization for scientific machine learning
Nicolas Guerra, Nicholas H. Nelsen, Yunan Yang
Main category: cs.LG
TL;DR: A framework for optimizing training data distribution to minimize prediction error across deployment regimes in scientific machine learning.
Details
Motivation: Scientific ML models often face distribution shift when deployed with different parameters/conditions than training. Need principled approach to design training data that ensures robustness across deployment scenarios.Method: Theoretical analysis of how training distribution affects deployment accuracy, leading to two adaptive algorithms: bilevel optimization and alternating optimization in probability measure space. Implemented via parametric distribution classes or nonparametric particle-based gradient flows.
Result: Optimized training distributions outperform nonadaptive designs. Resulting models show improved sample complexity and robustness to distribution shift.
Conclusion: Framework enables principled data acquisition for learning functions and PDE solution operators, addressing the “learning-where-to-learn” problem in scientific ML.
Abstract: In scientific machine learning, models are routinely deployed with parameter values or boundary conditions far from those used in training. This paper studies the learning-where-to-learn problem of designing a training data distribution that minimizes average prediction error across a family of deployment regimes. A theoretical analysis shows how the training distribution shapes deployment accuracy. This motivates two adaptive algorithms based on bilevel or alternating optimization in the space of probability measures. Discretized implementations using parametric distribution classes or nonparametric particle-based gradient flows deliver optimized training distributions that outperform nonadaptive designs. Once trained, the resulting models exhibit improved sample complexity and robustness to distribution shift. This framework unlocks the potential of principled data acquisition for learning functions and solution operators of partial differential equations.
[755] DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment
Sangwoo Kwon, Seong Hoon Seo, Jae W. Lee, Yeonhong Park
Main category: cs.LG
TL;DR: DP-LLM dynamically adjusts layer precision during decoding based on input sensitivity, achieving better performance-latency trade-offs for on-device LLMs.
Details
Motivation: On-device LLMs need to handle varying runtime constraints (latency/accuracy), but existing multi-scale quantization approaches lack dynamic adaptation to changing layer sensitivity during decoding.Method: DP-LLM dynamically assigns precision to each layer based on input values, leveraging the observation that layer sensitivity changes across decoding steps, building on multi-scale quantization foundations.
Result: Experimental results across multiple models and benchmarks show DP-LLM achieves superior performance-latency trade-off compared to prior approaches.
Conclusion: Dynamic precision assignment based on input sensitivity during decoding enables more effective runtime adaptation for on-device LLMs, outperforming static mixed-precision methods.
Abstract: How can we effectively handle queries for on-device large language models (LLMs) with varying runtime constraints, such as latency and accuracy? Multi-scale quantization addresses this challenge by enabling memory-efficient runtime model adaptation of LLMs through the overlaying of multiple model variants quantized to different bitwidths. Meanwhile, an important question still remains open-ended: how can models be properly configured to match a target precision or latency? While mixed-precision offers a promising solution, we take this further by leveraging the key observation that the sensitivity of each layer dynamically changes across decoding steps. Building on this insight, we introduce DP-LLM, a novel mechanism that dynamically assigns precision to each layer based on input values. Experimental results across multiple models and benchmarks demonstrate that DP-LLM achieves a superior performance-latency trade-off, outperforming prior approaches.
[756] Stepsize anything: A unified learning rate schedule for budgeted-iteration training
Anda Tang, Yiming Dong, Yutao Zeng, zhou Xun, Zhouchen Lin
Main category: cs.LG
TL;DR: UBA schedule is a theoretically grounded learning rate schedule that outperforms common schedules across diverse architectures and tasks under constrained training budgets, with a single hyperparameter providing flexibility-simplicity trade-off.
Details
Motivation: Current learning rate schedules are heuristic and lack theoretical foundations, requiring extensive trial-and-error selection. With expanding computational costs and limited resources, there's critical need for budgeted-iteration training that achieves optimal learning within predetermined iteration budgets.Method: Proposes Unified Budget-Aware (UBA) schedule derived from a novel training budget-aware optimization framework that accounts for robustness to landscape curvature variations. Controlled by single hyperparameter φ providing trade-off between flexibility and simplicity, eliminating per-network numerical optimization.
Result: UBA consistently surpasses commonly-used schedules across diverse vision and language tasks, spanning network architectures (ResNet, OLMo) and scales under different training-iteration budgets. Theoretical connection established between φ and condition number, with convergence proofs for different φ values.
Conclusion: UBA provides a theoretically grounded, practical learning rate schedule that outperforms existing heuristic approaches in budgeted-iteration scenarios, offering both theoretical justification and empirical effectiveness across diverse deep learning applications.
Abstract: The expanding computational costs and limited resources underscore the critical need for budgeted-iteration training, which aims to achieve optimal learning within predetermined iteration budgets. While learning rate schedules fundamentally govern the performance of different networks and tasks, particularly in budgeted-iteration scenarios, their design remains largely heuristic, lacking theoretical foundations. In addition, the optimal learning rate schedule requires extensive trial-and-error selection, making the training process inefficient. In this work, we propose the Unified Budget-Aware (UBA) schedule, a theoretically grounded learning rate schedule that consistently outperforms commonly-used schedules among diverse architectures and tasks under different constrained training budgets. First, we bridge the gap by constructing a novel training budget-aware optimization framework, which explicitly accounts for the robustness to landscape curvature variations. From this framework, we derive the UBA schedule, controlled by a single hyper-parameter \varphi that provides a trade-off between flexibility and simplicity, eliminating the need for per-network numerical optimization. Moreover, we establish a theoretical connection between \varphi and the condition number, adding interpretation and justification to our approach. Besides, we prove the convergence for different values of \varphi. We offer practical guidelines for its selection via theoretical analysis and empirical results. Extensive experimental results show that UBA consistently surpasses the commonly-used schedules across diverse vision and language tasks, spanning network architectures (e.g., ResNet, OLMo) and scales, under different training-iteration budgets.
[757] Architecture-Aware Generalization Bounds for Temporal Networks: Theory and Fair Comparison Methodology
Barak Gahtan, Alex M. Bronstein
Main category: cs.LG
TL;DR: The paper introduces a fair-comparison methodology for temporal models, reveals surprising empirical findings about temporal dependence, and develops the first architecture-aware generalization bounds for deep temporal models on dependent sequences.
Details
Motivation: Theoretical understanding of deep temporal architectures' generalization on sequential data remains limited, despite their strong predictive performance. There's a need for proper evaluation methods and theoretical frameworks that account for temporal dependence.Method: 1) Introduces evaluation protocols fixing effective sample size to isolate temporal structure effects; 2) Empirical analysis using this method; 3) Develops architecture-aware generalization bounds using a novel blocking scheme that partitions samples into quasi-independent blocks.
Result: 1) Strongly dependent sequences (ρ=0.8) show ~76% smaller generalization gaps than weakly dependent ones (ρ=0.2); 2) Observed convergence rates (N_eff^{-1.21} to N_eff^{-0.89}) exceed theoretical worst-case predictions (N^{-0.5}); 3) First architecture-aware bounds with √D depth scaling and product of layer-wise norms, avoiding exponential dependence.
Conclusion: The paper challenges conventional views about temporal dependence, shows temporal architectures exploit problem structure beyond current theory, and provides worst-case baselines that highlight where future theory must improve, while proving learnability and identifying architectural scaling laws.
Abstract: Deep temporal architectures such as TCNs achieve strong predictive performance on sequential data, yet theoretical understanding of their generalization remains limited. We address this gap through three contributions: introducing an evaluation methodology for temporal models, revealing surprising empirical phenomena about temporal dependence, and the first architecture-aware theoretical framework for dependent sequences. Fair-Comparison Methodology. We introduce evaluation protocols that fix effective sample size $N_{\text{eff}}$ to isolate temporal structure effects from information content. Empirical Findings. Applying this method reveals that under $N_{\text{eff}} = 2000$, strongly dependent sequences ($ρ= 0.8$) exhibit approx’ $76%$ smaller generalization gaps than weakly dependent ones ($ρ= 0.2$), challenging the conventional view that dependence universally impedes learning. However, observed convergence rates ($N_{\text{eff}}^{-1.21}$ to $N_{\text{eff}}^{-0.89}$) significantly exceed theoretical worst-case predictions ($N^{-0.5}$), revealing that temporal architectures exploit problem structure in ways current theory does not capture. Lastly, we develop the first architecture-aware generalization bounds for deep temporal models on exponentially $β$-mixing sequences. By embedding Golowich et al.’s i.i.d. class bound within a novel blocking scheme that partitions $N$ samples into approx’ $B \approx N/\log N$ quasi-independent blocks, we establish polynomial sample complexity under convex Lipschitz losses. The framework achieves $\sqrt{D}$ depth scaling alongside the product of layer-wise norms $R = \prod_{\ell=1}^{D} M^{(\ell)}$, avoiding exponential dependence. While these bounds are conservative, they prove learnability and identify architectural scaling laws, providing worst-case baselines that highlight where future theory must improve.
[758] Safety-Aware Reinforcement Learning for Control via Risk-Sensitive Action-Value Iteration and Quantile Regression
Clinton Enwerem, Aniruddh G. Puranic, John S. Baras, Calin Belta
Main category: cs.LG
TL;DR: Proposed risk-regularized quantile-based RL algorithm with CVaR integration for safety constraints, providing theoretical guarantees and improved safety-performance trade-offs in dynamic environments.
Details
Motivation: Mainstream RL algorithms suffer from overestimation bias in stochastic environments, and existing quantile-based methods don't adequately address safety constraints without complex architectures or manual tradeoffs.Method: Risk-regularized quantile-based algorithm integrating Conditional Value-at-Risk (CVaR) to enforce safety constraints without complex neural architectures, with theoretical analysis of contraction properties in Wasserstein space.
Result: Theoretical guarantees on contraction properties of risk-sensitive distributional Bellman operator ensuring convergence to unique cost distribution. Simulations show more goal successes, fewer collisions, and better safety-performance trade-offs than risk-neutral methods.
Conclusion: The proposed approach effectively addresses safety constraints in RL through CVaR regularization while maintaining theoretical convergence properties, demonstrating practical improvements in safety-critical applications like mobile robot navigation.
Abstract: Mainstream approximate action-value iteration reinforcement learning (RL) algorithms suffer from overestimation bias, leading to suboptimal policies in high-variance stochastic environments. Quantile-based action-value iteration methods reduce this bias by learning a distribution of the expected cost-to-go using quantile regression. However, ensuring that the learned policy satisfies safety constraints remains a challenge when these constraints are not explicitly integrated into the RL framework. Existing methods often require complex neural architectures or manual tradeoffs due to combined cost functions. To address this, we propose a risk-regularized quantile-based algorithm integrating Conditional Value-at-Risk (CVaR) to enforce safety without complex architectures. We also provide theoretical guarantees on the contraction properties of the risk-sensitive distributional Bellman operator in Wasserstein space, ensuring convergence to a unique cost distribution. Simulations of a mobile robot in a dynamic reach-avoid task show that our approach leads to more goal successes, fewer collisions, and better safety-performance trade-offs than risk-neutral methods.
[759] Efficient Approximate Posterior Sampling with Annealed Langevin Monte Carlo
Advait Parulekar, Litu Rout, Karthikeyan Shanmugam, Sanjay Shakkottai
Main category: cs.LG
TL;DR: The paper addresses posterior sampling in score-based generative models, showing that while exact posterior sampling is computationally intractable in general, one can tractably sample from distributions that approximate both the noised prior posterior (in KL divergence) and true posterior (in Fisher divergence).
Details
Motivation: Despite empirical success of algorithms for tasks like image super-resolution and reconstruction using score-based models, prior work shows exact posterior sampling is computationally intractable under standard hardness assumptions. The paper aims to provide formal guarantees for approximate posterior sampling in polynomial time.Method: The paper frames posterior sampling as a “tilting” problem of biasing a distribution toward measurements. Under minimal assumptions, it develops methods to sample from distributions that are simultaneously close to the posterior of a noised prior in KL divergence and the true posterior in Fisher divergence.
Result: The paper provides the first formal results showing that approximate posterior sampling can be achieved in polynomial time, with guarantees that samples are consistent with both measurements and prior distributions through the combination of KL and Fisher divergence bounds.
Conclusion: While exact posterior sampling is computationally intractable, the paper demonstrates that practical approximate posterior sampling with formal guarantees is possible, explaining the empirical success of popular algorithms for tasks like image super-resolution and reconstruction.
Abstract: We study the problem of posterior sampling in the context of score based generative models. We have a trained score network for a prior $p(x)$, a measurement model $p(y|x)$, and are tasked with sampling from the posterior $p(x|y)$. Prior work has shown this to be intractable in KL (in the worst case) under well-accepted computational hardness assumptions. Despite this, popular algorithms for tasks such as image super-resolution, stylization, and reconstruction enjoy empirical success. Rather than establishing distributional assumptions or restricted settings under which exact posterior sampling is tractable, we view this as a more general “tilting” problem of biasing a distribution towards a measurement. Under minimal assumptions, we show that one can tractably sample from a distribution that is simultaneously close to the posterior of a noised prior in KL divergence and the true posterior in Fisher divergence. Intuitively, this combination ensures that the resulting sample is consistent with both the measurement and the prior. To the best of our knowledge these are the first formal results for (approximate) posterior sampling in polynomial time.
[760] Semi-supervised Graph Anomaly Detection via Robust Homophily Learning
Guoguo Ai, Hezhe Qiao, Hui Yan, Guansong Pang
Main category: cs.LG
TL;DR: RHO proposes adaptive homophily learning for semi-supervised graph anomaly detection, addressing limitations of assuming uniform homophily among normal nodes.
Details
Motivation: Current methods assume normal nodes share similar homophily and labeled nodes represent all homophily patterns, but real-world GAD datasets show normal nodes exhibit diverse homophily that is often under-represented in small labeled sets.Method: RHO uses two novel modules: 1) AdaFreq (adaptive frequency response filters) that learn adaptive spectral filters capturing different frequency components of labeled normal nodes with varying homophily in channel-wise and cross-channel views, and 2) GNA (graph normality alignment) that enforces consistency between channel-wise and cross-channel homophily representations to robustify learned normality.
Result: Experiments on eight real-world GAD datasets show RHO effectively learns varying, often under-represented homophily in small normal node sets and substantially outperforms state-of-the-art competing methods.
Conclusion: RHO provides a robust approach to semi-supervised graph anomaly detection by adaptively learning diverse homophily patterns in normal nodes, overcoming limitations of existing methods that assume uniform homophily.
Abstract: Semi-supervised graph anomaly detection (GAD) utilizes a small set of labeled normal nodes to identify abnormal nodes from a large set of unlabeled nodes in a graph. Current methods in this line posit that 1) normal nodes share a similar level of homophily and 2) the labeled normal nodes can well represent the homophily patterns in the normal class. However, this assumption often does not hold well since normal nodes in a graph can exhibit diverse homophily in real-world GAD datasets. In this paper, we propose RHO, namely Robust Homophily Learning, to adaptively learn such homophily patterns. RHO consists of two novel modules, adaptive frequency response filters (AdaFreq) and graph normality alignment (GNA). AdaFreq learns a set of adaptive spectral filters that capture different frequency components of the labeled normal nodes with varying homophily in the channel-wise and cross-channel views of node attributes. GNA is introduced to enforce consistency between the channel-wise and cross-channel homophily representations to robustify the normality learned by the filters in the two views. Experiments on eight real-world GAD datasets show that RHO can effectively learn varying, often under-represented, homophily in the small normal node set and substantially outperforms state-of-the-art competing methods. Code is available at https://github.com/mala-lab/RHO.
[761] Learning to Select MCP Algorithms: From Traditional ML to Dual-Channel GAT-MLP
Xiang Li, Shanshan Wang, Chenglong Xiao
Main category: cs.LG
TL;DR: A learning-based framework using GAT-MLP (Graph Attention Network + MLP) for instance-aware algorithm selection in Maximum Clique Problem, achieving 90.43% accuracy in choosing optimal solver.
Details
Motivation: No single MCP algorithm consistently outperforms others across diverse graph instances, highlighting the need for instance-aware algorithm selection which remains unexplored for MCP.Method: Construct benchmark dataset using 4 state-of-the-art MCP solvers on diverse graphs, extract structural features. Develop GAT-MLP dual-channel model combining GAT for local graph structure and MLP for global features.
Result: GAT-MLP achieves superior performance with 90.43% accuracy in choosing optimal solver, significantly outperforming all baseline methods including Random Forest. Connectivity and topological features identified as key predictors.
Conclusion: Dual-channel architecture combining graph neural networks with traditional ML is effective for combinatorial algorithm selection, demonstrating promise of GNNs for instance-aware MCP solver selection.
Abstract: The Maximum Clique Problem (MCP) is a foundational NP-hard problem with wide-ranging applications, yet no single algorithm consistently outperforms all others across diverse graph instances. This underscores the critical need for instance-aware algorithm selection, a domain that remains largely unexplored for the MCP. To address this gap, we propose a novel learning-based framework that integrates both traditional machine learning and graph neural networks. We first construct a benchmark dataset by executing four state-of-the-art exact MCP solvers on a diverse collection of graphs and extracting their structural features. An evaluation of conventional classifiers establishes Random Forest as a strong baseline and reveals that connectivity and topological features are key predictors of performance. Building on these insights, we develop GAT-MLP, a dual-channel model that combines a Graph Attention Network (GAT) to encode local graph structure with a Multilayer Perceptron (MLP) to model global features. Extensive experiments demonstrate that GAT-MLP achieves superior and consistent performance, significantly outperforming all baseline methods. Our results highlight the effectiveness of the dual-channel architecture and the promise of graph neural networks for combinatorial algorithm selection, achieving 90.43% accuracy in choosing the optimal solver. Code and models are available at: https://anonymous.4open.science/r/GAT-MLP-7E5F.
[762] Emergent Granger Causality in Neural Networks: Can Prediction Alone Reveal Structure?
Malik Shahid Sultan, Hernando Ombao, Maurizio Filippone
Main category: cs.LG
TL;DR: A novel deep learning approach for Granger Causality discovery that uses joint modeling of all time series components with proper regularization, comparing model uncertainty/residual distribution when specific components are dropped, without requiring explicit variable selection terms in the loss function.
Details
Motivation: Traditional VAR models are limited to linear associations, while existing deep learning methods treat GC as variable selection problem. There's a need for a more integrated approach that leverages neural networks' functional approximation power while maintaining interpretability for causal discovery.Method: Proposes joint modeling of all multivariate time series components with a single neural network using proper regularization. Uncovers GC structure by comparing model uncertainty/residual distribution when specific time series components are dropped vs. when all components are used. Also examines input layer dropout effects on GC learning.
Result: Well-regularized deep learning models can learn true GC structure without explicit variable selection terms in loss function. Simple joint models serve as strong baselines compared to sparse regression, requiring fewer hyperparameters. CNN, LSTM, and transformer architectures are compared for GC discovery capability.
Conclusion: Joint modeling with proper regularization offers an effective paradigm for GC discovery that leverages neural networks’ approximation power while avoiding complex variable selection frameworks, providing a simpler yet powerful alternative to traditional sparse regression approaches.
Abstract: Granger Causality (GC) offers an elegant statistical framework to study the association between multivariate time series data. Vector autoregressive models (VAR) are simple and easy to fit, but have limited application because of their inherent inability to capture more complex (e.g., non-linear) associations. Numerous attempts have already been made in the literature that exploit the functional approximation power of deep neural networks (DNNs) for GC. However, these methods treat GC as a variable selection problem. We present a novel paradigm for investigating the learned GC from a single neural network used for joint modeling of all components of multivariate time series data, which is essentially linked with prediction and assessing the distribution shift in residuals. A deep learning model, with proper regularization, may learn the true GC structure when jointly used for all components of the time series when there is sufficient training data. We propose to uncover the learned GC structure by comparing the model uncertainty or distribution of the residuals when the past of everything is used as compared to the one where a specific time series component is dropped from the model. We also compare the effect of input layer dropout on the ability of a neural network to learn GC. We show that a well-regularized model can learn the true GC structure from the data without explicitly adding terms in the loss function that guide the model to select variables or perform sparse regression under specific settings. We also provide a comparison of deep learning architectures such as CNN, LSTM and transformer models on their ability to discover Granger Causality. The numerical experiments demonstrate that, compared to sparse regression models, a simple joint model is a strong baseline for learning the true GC which has the advantage that it does not require tuning of many extra hyper-parameters.
[763] Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent
Tong Yang, Yu Huang, Yingbin Liang, Yuejie Chi
Main category: cs.LG
TL;DR: Transformers can learn multi-step reasoning via chain-of-thought for tree path-finding tasks, with theoretical guarantees on generalization to unseen trees through specialized attention heads.
Details
Motivation: Despite transformers' success in multi-step reasoning, there's limited theoretical understanding of how they acquire these abilities through training, particularly for chain-of-thought processes.Method: Analyze transformers learning symbolic reasoning on tree path-finding: backward reasoning (goal-to-root) and forward reasoning (two-stage: goal-to-root then reverse to root-to-goal). Use theoretical analysis grounded in gradient descent dynamics of one-layer transformers.
Result: Trained one-layer transformers provably solve both tasks with generalization guarantees to unseen trees. Multi-phase training dynamics show attention heads specialize and coordinate autonomously to solve subtasks in a single autoregressive path.
Conclusion: Provides mechanistic explanation of how transformers implement sequential algorithms via chain-of-thought. Shows shallow multi-head transformers can solve complex problems when tasks are structured with intermediate reasoning steps, potentially avoiding need for deeper architectures.
Abstract: Transformers have demonstrated remarkable capabilities in multi-step reasoning tasks. However, understandings of the underlying mechanisms by which they acquire these abilities through training remain limited, particularly from a theoretical standpoint. This work investigates how transformers learn to solve symbolic multi-step reasoning problems through chain-of-thought processes, focusing on path-finding in trees. We analyze two intertwined tasks: a backward reasoning task, where the model outputs a path from a goal node to the root, and a more complex forward reasoning task, where the model implements two-stage reasoning by first identifying the goal-to-root path and then reversing it to produce the root-to-goal path. Our theoretical analysis, grounded in the dynamics of gradient descent, shows that trained one-layer transformers can provably solve both tasks with generalization guarantees to unseen trees. In particular, our multi-phase training dynamics for forward reasoning elucidate how different attention heads learn to specialize and coordinate autonomously to solve the two subtasks in a single autoregressive path. These results provide a mechanistic explanation of how trained transformers can implement sequential algorithmic procedures. Moreover, they offer insights into the emergence of reasoning abilities, suggesting that when tasks are structured to take intermediate chain-of-thought steps, even shallow multi-head transformers can effectively solve problems that would otherwise require deeper architectures.
[764] A Simple Approximate Bayesian Inference Neural Surrogate for Stochastic Petri Net Models
Bright Kwaku Manu, Trevor Reckell, Beckett Sterner, Petar Jevtic
Main category: cs.LG
TL;DR: Neural network surrogate framework for parameter estimation in Stochastic Petri Nets with missing data, achieving accurate and fast inference without explicit likelihoods.
Details
Motivation: Parameter estimation for Stochastic Petri Nets is challenging, especially when transition rates depend on external covariates and explicit likelihoods are unavailable. Traditional methods struggle with partially observed data and computational complexity.Method: A neural-surrogate framework using a lightweight 1D Convolutional Residual Network trained end-to-end on Gillespie-simulated SPN trajectories. The model learns to invert system dynamics under realistic conditions of event dropout, with Monte Carlo dropout providing uncertainty estimates during inference.
Result: On synthetic SPNs with 10% missing events, the surrogate recovers rate-function coefficients with RMSE = 0.043 and runs substantially faster than traditional Bayesian approaches.
Conclusion: Data-driven, likelihood-free neural surrogates enable accurate, robust, and real-time parameter recovery in complex, partially observed discrete-event systems, offering a practical solution for SPN parameter estimation.
Abstract: Stochastic Petri Nets (SPNs) are an increasingly popular tool of choice for modeling discrete-event dynamics in areas such as epidemiology and systems biology, yet their parameter estimation remains challenging in general and in particular when transition rates depend on external covariates and explicit likelihoods are unavailable. We introduce a neural-surrogate (neural-network-based approximation of the posterior distribution) framework that predicts the coefficients of known covariate-dependent rate functions directly from noisy, partially observed token trajectories. Our model employs a lightweight 1D Convolutional Residual Network trained end-to-end on Gillespie-simulated SPN realizations, learning to invert system dynamics under realistic conditions of event dropout. During inference, Monte Carlo dropout provides calibrated uncertainty bounds together with point estimates. On synthetic SPNs with $10%$ missing events, our surrogate recovers rate-function coefficients with an $RMSE = 0.043$ and substantially runs faster than traditional Bayesian approaches. These results demonstrate that data-driven, likelihood-free surrogates can enable accurate, robust, and real-time parameter recovery in complex, partially observed discrete-event systems.
[765] Look the Other Way: Designing ‘Positive’ Molecules with Negative Data via Task Arithmetic
Rıza Özçelik, Sarah de Ruiter, Francesca Grisoni
Main category: cs.LG
TL;DR: Molecular task arithmetic trains models on abundant negative examples to learn property directions without positive data, then moves models in opposite directions to generate positive molecules, outperforming models trained on positive data.
Details
Motivation: The scarcity of molecules with desirable properties (positive molecules) creates a bottleneck for generative molecule design. Traditional approaches struggle due to limited positive examples.Method: Proposes molecular task arithmetic: train models on diverse and abundant negative examples to learn ‘property directions’ without accessing positively labeled data, then move models in opposite property directions to generate positive molecules.
Result: Tested on 33 design experiments with distinct molecular entities (small molecules, proteins), model architectures, and scales. Generated more diverse and successful designs than models trained on positive molecules. Also effective in dual-objective and few-shot design tasks, maintaining desirable complex properties like good docking scores.
Conclusion: Molecular task arithmetic offers simplicity, data efficiency, and strong performance, potentially becoming the de facto transfer learning strategy for de novo molecule design by increasing design diversity while maintaining desirable properties.
Abstract: The scarcity of molecules with desirable properties (i.e., `positive’ molecules) is an inherent bottleneck for generative molecule design. To sidestep such obstacle, here we propose molecular task arithmetic: training a model on diverse and abundant negative examples to learn ‘property directions’ - without accessing any positively labeled data - and moving models in the opposite property directions to generate positive molecules. When analyzed on 33 design experiments with distinct molecular entities (small molecules, proteins), model architectures, and scales, molecular task arithmetic generated more diverse and successful designs than models trained on positive molecules in general. Moreover, we employed molecular task arithmetic in dual-objective and few-shot design tasks. We find that molecular task arithmetic can consistently increase the diversity of designs while maintaining desirable complex design properties, such as good docking scores to a protein. With its simplicity, data efficiency, and performance, molecular task arithmetic bears the potential to become the de facto transfer learning strategy for de novo molecule design.
[766] Geometric Multi-color Message Passing Graph Neural Networks for Blood-brain Barrier Permeability Prediction
Trung Nguyen, Md Masud Rana, Farjana Tasnim Mukta, Chang-Guo Zhan, Duc Duy Nguyen
Main category: cs.LG
TL;DR: GMC-MPNN, a geometric graph neural network incorporating 3D atomic geometry and long-range interactions, achieves state-of-the-art BBB permeability prediction performance.
Details
Motivation: Existing GNNs for BBB permeability prediction rely mainly on molecular topology and neglect crucial 3D geometric information needed to model transport mechanisms, limiting their accuracy for CNS drug development.Method: Developed GMC-MPNN framework that enhances standard message-passing architectures by incorporating atomic-level geometric features and long-range interactions through weighted colored subgraphs based on atom types to capture spatial relationships and chemical context.
Result: Outperformed existing state-of-the-art models on three benchmark datasets with scaffold-based splitting, achieving AUC-ROC of 0.947/0.9212 for classification and RMSE of 0.5628 with Pearson correlation of 0.6947 for regression.
Conclusion: GMC-MPNN sets a new performance benchmark by integrating spatial geometry into graph representations, offering a more accurate and generalizable tool for drug discovery pipelines.
Abstract: Accurate prediction of blood-brain barrier permeability (BBBP) is essential for central nervous system (CNS) drug development. While graph neural networks (GNNs) have advanced molecular property prediction, they often rely on molecular topology and neglect the three-dimensional geometric information crucial for modeling transport mechanisms. This paper introduces the geometric multi-color message-passing graph neural network (GMC-MPNN), a novel framework that enhances standard message-passing architectures by explicitly incorporating atomic-level geometric features and long-range interactions. Our model constructs weighted colored subgraphs based on atom types to capture the spatial relationships and chemical context that govern BBB permeability. We evaluated GMC-MPNN on three benchmark datasets for both classification and regression tasks, using rigorous scaffold-based splitting to ensure a robust assessment of generalization. The results demonstrate that GMC-MPNN consistently outperforms existing state-of-the-art models, achieving superior performance in both classifying compounds as permeable/non-permeable (AUC-ROC of 0.947 and 0.9212) and in regressing continuous permeability values (RMSE of 0.5628, Pearson correlation of 0.6947). An ablation study further quantified the impact of specific atom-pair interactions, revealing that the model’s predictive power derives from its ability to learn from both common and rare, but chemically significant, functional motifs. By integrating spatial geometry into the graph representation, GMC-MPNN sets a new performance benchmark and offers a more accurate and generalizable tool for drug discovery pipelines.
[767] ONG: Orthogonal Natural Gradient Descent
Yajat Yadav, Patrick Mendoza, Jathin Korrapati
Main category: cs.LG
TL;DR: ONG combines natural gradients with orthogonal projections for continual learning, but preliminary results show naive combination has issues, motivating further research.
Details
Motivation: Standard OGD uses Euclidean projections that don't leverage information-geometric structure, potentially leading to suboptimal convergence in continual learning tasks.Method: ONG preconditions task-specific gradients with EKFAC approximation of inverse Fisher matrix (natural gradient), then projects these onto orthogonal complement of prior tasks’ natural gradients.
Result: Preliminary results on Permuted and Rotated MNIST benchmarks indicate that naive combination of natural gradients and orthogonal projections has potential issues.
Conclusion: The findings motivate continued work on robustly reconciling geometric perspectives, establishing rigorous theoretical foundations with convergence guarantees, and extending validation to large-scale benchmarks.
Abstract: Orthogonal Gradient Descent (OGD) has emerged as a powerful method for continual learning. However, its Euclidean projections do not leverage the underlying information-geometric structure of the problem, which can lead to suboptimal convergence in learning tasks. To address this, we propose incorporating the natural gradient into OGD and present \textbf{ONG (Orthogonal Natural Gradient Descent)}. ONG preconditions each new task-specific gradient with an efficient EKFAC approximation of the inverse Fisher information matrix, yielding updates that follow the steepest descent direction under a Riemannian metric. To preserve performance on previously learned tasks, ONG projects these natural gradients onto the orthogonal complement of prior tasks’ natural gradients. We provide an initial theoretical justification for this procedure, introduce the Orthogonal Natural Gradient Descent (ONG) algorithm, and present preliminary results on the Permuted and Rotated MNIST benchmarks. Our preliminary results, however, indicate that a naive combination of natural gradients and orthogonal projections has potential issues. This finding has motivated continued future work focused on robustly reconciling these geometric perspectives to develop a continual learning method, establishing a more rigorous theoretical foundation with formal convergence guarantees, and extending empirical validation to large-scale continual learning benchmarks.
[768] Density Operator Expectation Maximization
Adit Vishnu, Abhay Shastry, Dhruva Kashyap, Chiranjib Bhattacharyya
Main category: cs.LG
TL;DR: DO-EM framework enables Expectation-Maximization training for density operator latent variable models, outperforming probabilistic counterparts on generative tasks.
Details
Motivation: Density operators (quantum mechanics foundation) lack EM training framework due to non-commutativity, limiting generative capabilities compared to probabilistic models.Method: Developed DO-EM using relative entropy monotonicity inequality as evidence lower bound, with Expectation step as Petz recovery map. Applied to Quantum RBMs with Contrastive Divergence approximation.
Result: New Quantum interleaved Deep Boltzmann Machines and Quantum Gaussian-Bernoulli RBMs outperform probabilistic counterparts on generative tasks with similar computational resources.
Conclusion: DO-EM provides general EM framework for density operator latent variable models, enabling quantum-inspired models to surpass classical probabilistic models in generative performance.
Abstract: Machine learning with density operators, the mathematical foundation of quantum mechanics, is gaining prominence with rapid advances in quantum computing. Generative models based on density operators cannot yet handle tasks that are routinely handled by probabilistic models. The progress of latent variable models, a broad and influential class of probabilistic unsupervised models, was driven by the Expectation-Maximization framework. Deriving such a framework for density operators is challenging due to the non-commutativity of operators. To tackle this challenge, an inequality arising from the monotonicity of relative entropy is demonstrated to serve as an evidence lower bound for density operators. A minorant-maximization perspective on this bound leads to Density Operator Expectation Maximization (DO-EM), a general framework for training latent variable models defined through density operators. Through an information-geometric argument, the Expectation step in DO-EM is shown to be the Petz recovery map. The DO-EM algorithm is applied to Quantum Restricted Boltzmann Machines, adapting Contrastive Divergence to approximate the Maximization step gradient. Quantum interleaved Deep Boltzmann Machines and Quantum Gaussian-Bernoulli Restricted Boltzmann Machines, new models introduced in this work, outperform their probabilistic counterparts on generative tasks when trained with similar computational resources and identical hyperparameters.
[769] MedGR$^2$: Breaking the Data Barrier for Medical Reasoning via Generative Reward Learning
Weihai Zhi, Jiayan Guo, Shangyang Li
Main category: cs.LG
TL;DR: MedGR² introduces a self-improving framework that co-develops a data generator and reward model to create high-quality medical data, enabling better supervised fine-tuning and reinforcement learning for medical vision-language models.
Details
Motivation: Vision-Language Models in medicine face critical limitations due to scarcity of expert-annotated data. Supervised Fine-Tuning suffers from poor generalization on unseen modalities/tasks, while Reinforcement Learning lacks reliable reward signals in data-scarce medical domains.Method: MedGR² creates a virtuous cycle by co-developing a data generator and reward model. This enables automated, continuous creation of high-quality multi-modal medical data. The framework uses this generated data for both Supervised Fine-Training and Reinforcement Learning via Group Relative Policy Optimization (GRPO).
Result: SFT with MedGR²-produced data surpasses baselines trained on large-scale human-curated datasets. RL with GRPO using this data achieves state-of-the-art cross-modality and cross-task generalization, outperforming specialized RL-based methods. The compact model achieves performance competitive with foundation models having 10x more parameters.
Conclusion: MedGR² presents a new paradigm for data-efficient learning in high-stakes domains, transforming the problem from data scarcity to data generation and unlocking RL’s full potential for building truly generalizable medical AI.
Abstract: The application of Vision-Language Models (VLMs) in medicine is critically hampered by the scarcity of high-quality, expert-annotated data. Supervised Fine-Tuning (SFT) on existing datasets often leads to poor generalization on unseen modalities and tasks, while Reinforcement Learning (RL), a promising alternative, is stymied by the lack of reliable reward signals in this data-scarce domain. To break this impasse, we introduce Generative Reward Learning for Medical Reasoning (MedGR$^2$), a novel framework that creates a self-improving virtuous cycle. MedGR$^2$ co-develops a data generator and a reward model, enabling the automated, continuous creation of high-quality, multi-modal medical data that serves as both a superior training source for SFT and RL. Our experiments demonstrate that SFT with MedGR$^2$-produced data already surpasses baselines trained on large-scale, human-curated datasets. Crucially, when leveraging this data for RL via Group Relative Policy Optimization (GRPO), our model achieves state-of-the-art cross-modality and cross-task generalization, significantly outperforming specialized RL-based methods. Furthermore, our compact model, empowered by MedGR$^2$, achieves performance competitive with foundation models possessing over 10 times more parameters. MedGR$^2$ presents a new paradigm for data-efficient learning in high-stakes domains, transforming the problem from data scarcity to data generation and unlocking the full potential of RL for building truly generalizable medical AI.
[770] Instruction-based Time Series Editing
Jiaxing Qiu, Dongliang Guo, Brynne Sullivan, Teague R. Henry, Thomas Hartvigsen
Main category: cs.LG
TL;DR: InstructTime: A novel instruction-based time series editor that uses natural language instructions to edit time series, enabling flexible control over edit strength and generalization to unseen instructions.
Details
Motivation: Existing diffusion-based time series editors rely on rigid predefined attribute vectors and produce all-or-nothing edits through sampling, limiting flexibility in condition format and lacking customizable control over editing strength.Method: Introduces InstructTime, which takes time series and natural language instructions, embeds them into a shared multi-modal representation space, then decodes to generate edited time series. Uses multi-resolution encoders to handle both local and global edits, and learns structured representation space for interpolation-based strength control.
Result: InstructTime achieves state-of-the-art performance: high-quality edits with controllable strength, generalization to unseen instructions, and easy adaptation to unseen conditions through few-shot learning on both synthetic and real datasets.
Conclusion: Instruction-based time series editing with InstructTime overcomes limitations of existing methods by providing flexible natural language control, customizable edit strength, and strong generalization capabilities.
Abstract: In time series editing, we aim to modify some properties of a given time series without altering others. For example, when analyzing a hospital patient’s blood pressure, we may add a sudden early drop and observe how it impacts their future while preserving other conditions. Existing diffusion-based editors rely on rigid, predefined attribute vectors as conditions and produce all-or-nothing edits through sampling. This attribute- and sampling-based approach limits flexibility in condition format and lacks customizable control over editing strength. To overcome these limitations, we introduce Instruction-based Time Series Editing, where users specify intended edits using natural language. This allows users to express a wider range of edits in a more accessible format. We then introduce InstructTime, the first instruction-based time series editor. InstructTime takes in time series and instructions, embeds them into a shared multi-modal representation space, then decodes their embeddings to generate edited time series. By learning a structured multi-modal representation space, we can easily interpolate between embeddings to achieve varying degrees of edit. To handle local and global edits together, we propose multi-resolution encoders. In our experiments, we use synthetic and real datasets and find that InstructTime is a state-of-the-art time series editor: InstructTime achieves high-quality edits with controllable strength, can generalize to unseen instructions, and can be easily adapted to unseen conditions through few-shot learning.
[771] Generative Large-Scale Pre-trained Models for Automated Ad Bidding Optimization
Yu Lei, Jiayang Zhao, Yilei Zhao, Zhaoqi Zhang, Linyou Cai, Qianlong Xie, Xingxing Wang
Main category: cs.LG
TL;DR: GRAD is a generative foundation model for auto-bidding that combines mixture-of-experts for diverse action exploration with causal transformers for constraint-aware optimization, achieving significant revenue and ROI improvements on Meituan’s platform.
Details
Motivation: Modern auto-bidding systems need to balance performance with diverse advertiser goals and constraints. While conditional generative models offer promise over traditional MDP methods, they face challenges like distribution shift, limited action space exploration, and constraint satisfaction (CPM, ROI).Method: GRAD combines an Action-Mixture-of-Experts module for diverse bidding action exploration with a Value Estimator of Causal Transformer for constraint-aware optimization, creating a scalable foundation model for auto-bidding.
Result: Extensive offline and online experiments show GRAD significantly enhances platform revenue. Implementation at Meituan resulted in 2.18% increase in Gross Merchandise Value (GMV) and 10.68% increase in ROI.
Conclusion: GRAD effectively addresses the evolving and diverse requirements of modern advertisers by overcoming key challenges of generative auto-bidding methods through its mixture-of-experts and causal transformer architecture.
Abstract: Modern auto-bidding systems are required to balance overall performance with diverse advertiser goals and real-world constraints, reflecting the dynamic and evolving needs of the industry. Recent advances in conditional generative models, such as transformers and diffusers, have enabled direct trajectory generation tailored to advertiser preferences, offering a promising alternative to traditional Markov Decision Process-based methods. However, these generative methods face significant challenges, such as the distribution shift between offline and online environments, limited exploration of the action space, and the necessity to meet constraints like marginal Cost-per-Mille (CPM) and Return on Investment (ROI). To tackle these challenges, we propose GRAD (Generative Reward-driven Ad-bidding with Mixture-of-Experts), a scalable foundation model for auto-bidding that combines an Action-Mixture-of-Experts module for diverse bidding action exploration with the Value Estimator of Causal Transformer for constraint-aware optimization. Extensive offline and online experiments demonstrate that GRAD significantly enhances platform revenue, highlighting its effectiveness in addressing the evolving and diverse requirements of modern advertisers. Furthermore, GRAD has been implemented in multiple marketing scenarios at Meituan, one of the world’s largest online food delivery platforms, leading to a 2.18% increase in Gross Merchandise Value (GMV) and 10.68% increase in ROI.
[772] Jointly Computation- and Communication-Efficient Distributed Learning
Xiaoxing Ren, Nicola Bastianello, Karl H. Johansson, Thomas Parisini
Main category: cs.LG
TL;DR: Novel ADMM-based distributed learning algorithm that is both computation- and communication-efficient, using stochastic gradients, multiple local epochs, and compression, with proven linear convergence for strongly convex problems.
Details
Motivation: Address distributed learning over undirected networks with dual efficiency goals: reducing computational burden (via stochastic gradients) and communication overhead (via fewer rounds and compression).Method: ADMM-based algorithm with three key features: 1) stochastic gradients for local computation efficiency, 2) multiple training epochs between communication rounds to reduce communication frequency, and 3) compressed transmissions to reduce communication volume.
Result: Proves exact linear convergence in strongly convex setting. Numerical experiments on classification tasks show competitive performance compared to state-of-the-art techniques.
Conclusion: The proposed algorithm successfully achieves both computation and communication efficiency while maintaining theoretical convergence guarantees, making it suitable for practical distributed learning scenarios.
Abstract: We address distributed learning problems over undirected networks. Specifically, we focus on designing a novel ADMM-based algorithm that is jointly computation- and communication-efficient. Our design guarantees computational efficiency by allowing agents to use stochastic gradients during local training. Moreover, communication efficiency is achieved as follows: i) the agents perform multiple training epochs between communication rounds, and ii) compressed transmissions are used. We prove exact linear convergence of the algorithm in the strongly convex setting. We corroborate our theoretical results by numerical comparisons with state of the art techniques on a classification task.
[773] Conditionally adaptive augmented Lagrangian method for physics-informed learning of forward and inverse problems
Qifeng Hu, Shamsulhaq Basir, Inanc Senocak
Main category: cs.LG
TL;DR: Enhanced PECANN framework with multiple improvements: generalized ALM with multiple penalty parameters, constraint aggregation, Fourier feature mapping for oscillatory solutions, time-windowing for long-time evolution, and novel CAPU strategy for adaptive penalty updates.
Details
Motivation: To substantially improve the PECANN framework's capacity to solve challenging PDEs by broadening its applicability and improving efficiency for demanding scientific computing problems.Method: Five key enhancements: 1) Generalized ALM with multiple independent penalty parameters for heterogeneous constraints, 2) Constraint aggregation technique to address point-wise enforcement inefficiencies, 3) Single Fourier feature mapping for multi-scale oscillatory solutions, 4) Time-windowing strategy for long-time evolution, 5) Conditionally Adaptive Penalty Update (CAPU) strategy for accelerated Lagrange multiplier growth based on constraint violations.
Result: Demonstrated effectiveness across diverse problems including transonic rarefaction, reversible scalar advection by vortex, high-wavenumber Helmholtz and Poisson’s equations, and inverse heat source identification. Achieved competitive accuracy compared to established methods and recent Kolmogorov-Arnold network approaches.
Conclusion: Collective advances improve robustness, computational efficiency, and applicability of PECANN to demanding scientific computing problems, making it a more capable framework for solving challenging PDEs.
Abstract: We present several key advances to the Physics and Equality Constrained Artificial Neural Networks (PECANN) framework, substantially improving its capacity to solve challenging partial differential equations (PDEs). Our enhancements broaden the framework’s applicability and improve efficiency. First, we generalize the Augmented Lagrangian Method (ALM) to support multiple, independent penalty parameters for enforcing heterogeneous constraints. Second, we introduce a constraint aggregation technique to address inefficiencies associated with point-wise enforcement. Third, we incorporate a single Fourier feature mapping to capture highly oscillatory solutions with multi-scale features, where alternative methods often require multiple mappings or costlier architectures. Fourth, a novel time-windowing strategy enables seamless long-time evolution without relying on discrete time models. Fifth, and critically, we propose a conditionally adaptive penalty update (CAPU) strategy for ALM that accelerates the growth of Lagrange multipliers for constraints with larger violations, while enabling coordinated updates of multiple penalty parameters. CAPU accelerates the growth of Lagrange multipliers for selectively challenging constraints, enhancing constraint enforcement during training. We demonstrate the effectiveness of PECANN-CAPU across diverse problems, including the transonic rarefaction problem, reversible scalar advection by a vortex, high-wavenumber Helmholtz and Poisson’s equations, and inverse heat source identification. The framework achieves competitive accuracy across all cases when compared with established methods and recent approaches based on Kolmogorov-Arnold networks. Collectively, these advances improve the robustness, computational efficiency, and applicability of PECANN to demanding problems in scientific computing.
[774] JaGuard: Jamming Correction of GNSS Deviation with Deep Temporal Graphs
Ivana Kesić, Aljaž Blatnik, Carolina Fortuna, Blaž Bertalanič
Main category: cs.LG
TL;DR: JaGuard: A temporal graph neural network that mitigates GNSS jamming by modeling satellite-receiver relationships as dynamic graphs to estimate and correct positioning errors.
Details
Motivation: GNSS systems face increasing threats from intentional jamming, compromising reliability when accurate positioning and timing are most critical, necessitating robust interference mitigation solutions.Method: Formulates interference mitigation as dynamic graph regression using JaGuard, a receiver-centric temporal GNN. Represents satellite-receiver scenes as heterogeneous star graphs with time-varying satellite attributes (SNR, azimuth, elevation). Uses single-layer HeteroGCLSTM to fuse spatial context with temporal dynamics for 2D deviation estimation.
Result: Outperforms strong baselines (TSMixer, uniform CNN, Seq2Point) in all conditions. Achieves 3.64-7.74 cm MAE under severe jamming (-45 dBm), improving to 1.59-1.90 cm for -60 to -70 dBm. Attains 3.78-4.25 cm MAE on mixed-mode datasets. With only 10% training data, reaches ~20 cm MAE vs 36-42 cm for baselines.
Conclusion: JaGuard effectively mitigates GNSS jamming through graph-based spatiotemporal modeling, demonstrating superior performance across various jamming types, power levels, and data conditions compared to traditional multivariate approaches.
Abstract: Global Navigation Satellite Systems (GNSS) are increasingly exposed to intentional jamming, threatening reliability when accurate positioning and timing are most critical. We address this problem by formulating interference mitigation as a dynamic graph regression task and propose JaGuard, a receiver-centric temporal graph neural network that estimates and corrects latitude and longitude errors. At each 1 Hz epoch, the satellite-receiver scene is represented as a heterogeneous star graph with time-varying satellite attributes such as SNR, azimuth and elevation. A single-layer HeteroGCLSTM fuses one-hop spatial context with short-term temporal dynamics to produce a 2D deviation estimate. We evaluate JaGuard on data collected from two commercial receivers under controlled conducted jamming using three jammer types (CW, 3xCW, FM) and six power levels from -45 to -70 dBm, each repeated 50 times across pre-jam, jam, and recovery phases. JaGuard outperforms strong multivariate baselines (TSMixer, uniform CNN, Seq2Point) in all conditions. Under severe jamming at -45 dBm, it achieves 3.64-7.74 cm MAE, improving to 1.59-1.90 cm for -60 to -70 dBm. On mixed-mode datasets, it attains 3.78 cm MAE on GP01 and 4.25 cm on U-blox 10. With only 10 percent of the training data, JaGuard remains ahead, reaching about 20 cm MAE compared to 36-42 cm for the baselines.
[775] Evidential Physics-Informed Neural Networks for Scientific Discovery
Hai Siong Tan, Kuancheng Wang, Rafe McBeth
Main category: cs.LG
TL;DR: E-PINN is an uncertainty-aware Physics-Informed Neural Network that uses evidential deep learning for uncertainty estimation and learns posterior distributions for PDE parameters, showing better calibration than Bayesian PINN and Deep Ensembles.
Details
Motivation: The paper aims to develop a more reliable uncertainty quantification method for Physics-Informed Neural Networks (PINNs) to improve their practical applicability in real-world problems where uncertainty estimation is crucial.Method: E-PINN leverages evidential deep learning with marginal distribution loss functions to estimate output uncertainties and infers unknown PDE parameters through learned posterior distributions, combining physics-informed constraints with uncertainty quantification.
Result: E-PINN demonstrated significantly better calibrated empirical coverage probabilities than Bayesian PINN and Deep Ensemble methods on 1D Poisson equation and 2D Fisher-KPP equation case studies, and showed real-world applicability in clinical glucose-insulin analysis.
Conclusion: E-PINN provides a robust framework for uncertainty-aware physics-informed learning with better calibration than existing methods, making it suitable for real-world applications including medical research problems.
Abstract: We present the fundamental theory and implementation guidelines underlying Evidential Physics-Informed Neural Network (E-PINN) – a novel class of uncertainty-aware PINN. It leverages the marginal distribution loss function of evidential deep learning for estimating uncertainty of outputs, and infers unknown parameters of the PDE via a learned posterior distribution. Validating our model on two illustrative case studies – the 1D Poisson equation with a Gaussian source and the 2D Fisher-KPP equation, we found that E-PINN generated empirical coverage probabilities that were calibrated significantly better than Bayesian PINN and Deep Ensemble methods. To demonstrate real-world applicability, we also present a brief case study on applying E-PINN to analyze clinical glucose-insulin datasets that have featured in medical research on diabetes pathophysiology.
[776] The Multi-Query Paradox in Zeroth-Order Optimization
Wei Lin, Qingyu Song, Hong Xu
Main category: cs.LG
TL;DR: ZO optimization faces a query allocation trade-off: more queries per iteration improves gradient estimation but reduces total iterations under fixed budget. The paper shows this choice depends entirely on aggregation method - simple averaging favors single query, while new projection alignment favors full-subspace estimation.
Details
Motivation: ZO optimization suffers from high gradient estimation variance with single-query approaches. Multi-query methods improve accuracy but create a fundamental trade-off: under fixed query budget, queries per iteration and total iterations are inversely proportional. How to best allocate this query budget is an under-explored fundamental question.Method: Analyzes two aggregation methods: 1) de facto simple averaging (ZO-Avg), and 2) new Projection Alignment method (ZO-Align) derived from local surrogate minimization. Derives convergence rates for both methods across strongly convex, convex, non-convex, and stochastic settings, making dependence on number of queries explicit.
Result: Uncovers a stark dichotomy: For ZO-Avg, using more than one query per iteration is always query-inefficient, making single-query approach optimal. For ZO-Align, more queries per iteration generally performs better, with full-subspace estimation as optimal approach. The multi-query problem reduces to a choice between two classic algorithms dictated by aggregation method.
Conclusion: The query allocation problem in ZO optimization is systematically resolved. The choice isn’t about intermediate query size but between two algorithms: single-query with simple averaging vs. full-subspace estimation with projection alignment. Theoretical findings are validated by extensive experiments.
Abstract: Zeroth-order (ZO) optimization provides a powerful framework for problems where explicit gradients are unavailable and have to be approximated using only queries to function value. The prevalent single-query approach is simple, but suffers from high estimation variance, motivating a multi-query paradigm to improves estimation accuracy. This, however, creates a critical trade-off: under a fixed budget of queries (i.e. cost), queries per iteration and the total number of optimization iterations are inversely proportional to one another. How to best allocate this budget is a fundamental, under-explored question. This work systematically resolves this query allocation problem. We analyze two aggregation methods: the de facto simple averaging (ZO-Avg), and a new Projection Alignment method (ZO-Align) we derive from local surrogate minimization. By deriving convergence rates for both methods that make the dependence on the number of queries explicit across strongly convex, convex, non-convex, and stochastic settings, we uncover a stark dichotomy: For ZO-Avg, we prove that using more than one query per iteration is always query-inefficient, rendering the single-query approach optimal. On the contrary, ZO-Align generally performs better with more queries per iteration, resulting in a full-subspace estimation as the optimal approach. Thus, our work clarifies that the multi-query problem boils down to a choice not about an intermediate query size, but between two classic algorithms, a choice dictated entirely by the aggregation method used. These theoretical findings are also consistently validated by extensive experiments.
[777] Staying on the Manifold: Geometry-Aware Noise Injection
Albert Kjøller Jacobsen, Johanna Marie Gegenfurtner, Georgios Arvanitidis
Main category: cs.LG
TL;DR: Geometry-aware input noise improves model generalization by accounting for data manifold structure, outperforming ambient noise on curved manifolds while matching performance on simpler ones.
Details
Motivation: Previous input perturbation methods use ambient noise without considering the underlying data manifold structure, which may not be optimal for data that lies on lower-dimensional manifolds.Method: Proposes geometry-aware noise strategies: 1) Project ambient Gaussian noise onto tangent space then map to manifold via geodesics, 2) Brownian motion noise that moves randomly along the manifold, 3) Extension to generative model-approximated manifolds.
Result: Geometry-aware noise leads to improved generalization and robustness to hyperparameter selection on highly curved manifolds, while performing at least as well as training without noise on simpler manifolds. Similar trends observed on MNIST dataset.
Conclusion: Incorporating manifold geometry into input noise regularization provides better regularization than ambient noise, especially for complex, curved data manifolds, offering a more principled approach to data augmentation.
Abstract: It has been shown that perturbing the input during training implicitly regularises the gradient of the learnt function, leading to smoother models and enhancing generalisation. However, previous research mostly considered the addition of ambient noise in the input space, without considering the underlying structure of the data. In this work, we propose several strategies of adding geometry-aware input noise that accounts for the lower dimensional manifold the input space inhabits. We start by projecting ambient Gaussian noise onto the tangent space of the manifold. In a second step, the noise sample is mapped on the manifold via the associated geodesic curve. We also consider Brownian motion noise, which moves in random steps along the manifold. We show that geometry-aware noise leads to improved generalisation and robustness to hyperparameter selection on highly curved manifolds, while performing at least as well as training without noise on simpler manifolds. Our proposed framework extends to data manifolds approximated by generative models and we observe similar trends on the MNIST digits dataset.
[778] RAMAC: Multimodal Risk-Aware Offline Reinforcement Learning and the Role of Behavior Regularization
Kai Fukazawa, Kunal Mundada, Iman Soltani
Main category: cs.LG
TL;DR: RAMAC introduces a risk-aware multimodal actor-critic framework that combines expressive generative policies with distributional critics for safe offline RL, achieving improved CVaR while maintaining high returns.
Details
Motivation: Offline RL needs to deliver high returns without catastrophic risk in safety-critical domains. Prior risk-averse methods sacrifice value or use restricted policy classes, while expressive generative policies have only been used in risk-neutral settings.Method: RAMAC couples an expressive generative actor with a distributional critic, using a composite objective that adds Conditional Value-at-Risk (CVaR) to behavioral cloning loss. It analyzes OOD action behavior and uses behavior-regularized objectives to constrain policies to dataset support.
Result: RAMAC achieves consistent gains in CVaR₀.₁ while maintaining strong returns across tasks, including 2-D risky bandit and Stochastic-D4RL benchmarks. It effectively suppresses OOD actions in expressive policies.
Conclusion: RAMAC is the first model-free approach to learn risk-aware expressive generative policies, successfully addressing the safety-performance trade-off in offline RL through a composite CVaR-BC objective and behavior regularization.
Abstract: In safety-critical domains where online data collection is infeasible, offline reinforcement learning (RL) offers an attractive alternative but only if policies deliver high returns without incurring catastrophic lower-tail risk. Prior work on risk-averse offline RL achieves safety at the cost of value or model-based pessimism, and restricted policy classes that limit policy expressiveness, whereas diffusion/flow-based expressive generative policies trained with a behavioral-cloning (BC) objective have been used only in risk-neutral settings. Here, we address this gap by introducing the \textbf{Risk-Aware Multimodal Actor-Critic (RAMAC)}, which couples an expressive generative actor with a distributional critic and, to our knowledge, is the first model-free approach that learns \emph{risk-aware expressive generative policies}. RAMAC differentiates a composite objective that adds a Conditional Value-at-Risk (CVaR) term to a BC loss, achieving risk-sensitive learning in complex multimodal scenarios. Since out-of-distribution (OOD) actions are a major driver of catastrophic failures in offline RL, we further analyze OOD behavior under prior-anchored perturbation schemes from recent BC-regularized risk-averse offline RL. This clarifies why a behavior-regularized objective that directly constrains the expressive generative policy to the dataset support provides an effective, risk-agnostic mechanism for suppressing OOD actions in modern expressive policies. We instantiate RAMAC with a diffusion-based actor, using it both to illustrate the analysis in a 2-D risky bandit and to deploy OOD-action detectors on Stochastic-D4RL benchmarks, empirically validating our insights. Across these tasks, we observe consistent gains in $\mathrm{CVaR}_{0.1}$ while maintaining strong returns. Our implementation is available at GitHub: https://github.com/KaiFukazawa/RAMAC.git
[779] Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning
Hanjiang Hu, Changliu Liu, Na Li, Yebin Wang
Main category: cs.LG
TL;DR: The paper introduces Group Relative Policy Optimization (GRPO) to transform multi-turn task planning into single-turn reasoning problems, enabling efficient training of LLM agents for complex planning tasks.
Details
Motivation: Training LLM agents for complex multi-turn task planning faces challenges: sparse episode-wise rewards, credit assignment across long horizons, and computational overhead of reinforcement learning in multi-turn settings.Method: Transforms multi-turn task planning into single-turn task reasoning problems, uses Group Relative Policy Optimization (GRPO) with dense and verifiable rewards from expert trajectories.
Result: A 1.5B parameter model trained with single-turn GRPO achieves 70% success rate for long-horizon planning tasks, outperforming baseline models up to 14B parameters.
Conclusion: GRPO provides an efficient approach to train LLM agents for complex planning by converting multi-turn problems into single-turn reasoning with theoretical guarantees on performance improvement.
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in knowledge acquisition, reasoning, and tool use, making them promising candidates for autonomous agent applications. However, training LLM agents for complex multi-turn task planning faces significant challenges, including sparse episode-wise rewards, credit assignment across long horizons, and the computational overhead of reinforcement learning in multi-turn interaction settings. To this end, this paper introduces a novel approach that transforms multi-turn task planning into single-turn task reasoning problems, enabling efficient policy optimization through Group Relative Policy Optimization (GRPO) with dense and verifiable reward from expert trajectories. Our theoretical analysis shows that GRPO improvement on single-turn task reasoning results in a lower bound of the multi-turn success probability under the minimal turns, as well as the generalization to subtasks with shorter horizons. Experimental evaluation on the complex task planning benchmark demonstrates that our 1.5B parameter model trained with single-turn GRPO achieves superior performance compared to larger baseline models up to 14B parameters, with success rates of 70% for long-horizon planning tasks.
[780] Closed-form $\ell_r$ norm scaling with data for overparameterized linear regression and diagonal linear networks under $\ell_p$ bias
Shuofeng Zhang, Ard Louis
Main category: cs.LG
TL;DR: The paper analyzes scaling laws for ℓ_r norms of minimum-ℓ_p interpolators in overparameterized linear regression, revealing a data-dependent transition point (elbow) and universal threshold r_⋆=2(p-1) that separates norms that plateau from those that grow with sample size.
Details
Motivation: Understanding how different ℓ_r norms of interpolating solutions scale with sample size is fundamental but unresolved, especially since many generalization proxies depend on these norms. The analysis aims to provide a unified characterization that explains which norms saturate and which continue to grow.Method: Uses dual-ray analysis for overparameterized linear regression with isotropic Gaussian design and minimum-ℓ_p interpolators (p∈(1,2]). This reveals competition between signal spike and bulk of null coordinates in X^⊤Y. For diagonal linear networks, calibrates initialization scale α to effective p_eff(α) via separable potential to show inheritance of same scaling laws.
Result: Derives closed-form predictions for: (1) data-dependent transition n_⋆ (elbow), and (2) universal threshold r_⋆=2(p-1) that separates ℓ_r norms that plateau from those that grow with explicit exponent. Shows DLNs inherit same elbow/threshold laws through effective p_eff(α) calibration.
Conclusion: Provides unified solution for scaling of all ℓ_r norms within family r∈[1,p] under ℓ_p-biased interpolation. Since generalization proxies depend on ‖ŵ_p‖_r, predictive power will sensitively depend on which ℓ_r norm is used, explaining varying effectiveness of different norm-based generalization bounds.
Abstract: For overparameterized linear regression with isotropic Gaussian design and minimum-$\ell_p$ interpolator $p\in(1,2]$, we give a unified, high-probability characterization for the scaling of the family of parameter norms $ \{ \lVert \widehat{w_p} \rVert_r \}{r \in [1,p]} $ with sample size. We solve this basic, but unresolved question through a simple dual-ray analysis, which reveals a competition between a signal spike and a bulk of null coordinates in $X^\top Y$, yielding closed-form predictions for (i) a data-dependent transition $n\star$ (the “elbow”), and (ii) a universal threshold $r_\star=2(p-1)$ that separates $\lVert \widehat{w_p} \rVert_r$’s which plateau from those that continue to grow with an explicit exponent. This unified solution resolves the scaling of all $\ell_r$ norms within the family $r\in [1,p]$ under $\ell_p$-biased interpolation, and explains in one picture which norms saturate and which increase as $n$ grows. We then study diagonal linear networks (DLNs) trained by gradient descent. By calibrating the initialization scale $α$ to an effective $p_{\mathrm{eff}}(α)$ via the DLN separable potential, we show empirically that DLNs inherit the same elbow/threshold laws, providing a predictive bridge between explicit and implicit bias. Given that many generalization proxies depend on $\lVert \widehat {w_p} \rVert_r$, our results suggest that their predictive power will depend sensitively on which $l_r$ norm is used.
[781] AuON: A Linear-time Alternative to Orthogonal Momentum Updates
Dipan Maity
Main category: cs.LG
TL;DR: AuON is a linear-time optimizer that improves upon orthogonal momentum gradient updates, addressing computational complexity and exploding attention logits issues in previous methods like Muon.
Details
Motivation: Existing orthogonal momentum approaches (like SVD/QR) have high computational/memory costs and underperform compared to SGD with momentum. Recent methods like Muon improve efficiency but suffer from exploding attention logits and cubic complexity.Method: Proposes AuON (Alternative Unit-norm momentum updates by Normalized nonlinear scaling) - a linear-time optimizer that achieves strong performance without approximate orthogonal matrices, preserves structural alignment, and handles ill-posed updates. Includes an “emergency brake” mechanism for exploding attention logits and a hybrid variant (Hybrid-AuON) using Newton-Schulz iterations.
Result: AuON reduces memory usage (up to 3x like Muon), achieves strong performance without approximate orthogonal matrices, handles exploding attention logits automatically, and Hybrid-AuON outperforms Muon in language modeling tasks.
Conclusion: AuON provides an efficient linear-time alternative to previous orthogonal momentum optimizers, addressing their computational complexity and stability issues while maintaining strong performance, particularly in language modeling tasks.
Abstract: Orthogonal momentum gradient updates have emerged to overcome the limitations of vector-based optimizers like Adam. The vector-based optimizer Adam suffers from high memory costs and ill-conditioned momentum gradient updates. However, traditional Orthogonal momentum approaches, such as SVD/QR decomposition, suffer from high computational and memory costs and underperform compared to well-tuned SGD with momentum.Recent advances, such as Muon, improve efficiency by applying momentum before orthogonalization and approximate orthogonal matrices via Newton-Schulz iterations, which gives better GPU utilization, active high TFLOPS, and reduces memory usage by up to 3x. Nevertheless, Muon(Vanilla) suffers from exploding attention logits and has cubic computation complexity. In this paper we , deep dive into orthogonal momentum gradient updates to find the main properties that help Muon to achieve remarkable performance.We propose \textbf{AuON} (Alternative Unit-norm momentum updates by Normalized nonlinear scaling), a linear-time optimizer that achieves strong performance without approximate orthogonal matrices, while preserving structural alignment and reconditioning ill-posed updates. AuON has an automatic (\textbf{“emergency brake”}) to handle exploding attention logits.. We further introduce a hybrid variant (\textbf{ Hybrid-AuON})that applies the linear transformations with Newton-Schulz iterations which out performs Muon in the language modeling tasks. Code is available at: https://github.com/ryyzn9/AuON
[782] Physics-Informed Inductive Biases for Voltage Prediction in Distribution Grids
Ehimare Okoyomon, Arbel Yaniv, Christoph Goebel
Main category: cs.LG
TL;DR: The paper investigates physics-informed inductive biases for improving GNN-based voltage prediction in distribution grids, evaluating three strategies to enhance generalization from limited data.
Details
Motivation: Voltage prediction in distribution grids is critical for power system stability but difficult. While GNNs offer speedups, they suffer from poor generalization when trained on limited or incomplete data, necessitating better learning approaches.Method: Systematically evaluates three physics-informed inductive biases: (1) power-flow-constrained loss functions, (2) complex-valued neural networks, and (3) residual-based task reformulation. Uses the ENGAGE dataset spanning multiple grid configurations with controlled experiments to isolate each bias effect.
Result: The study provides practical insights into which model assumptions most effectively guide learning for reliable voltage prediction, assessing both standard predictive performance and out-of-distribution generalization.
Conclusion: Physics-informed inductive biases can improve GNN-based voltage prediction reliability and generalization in distribution networks, with the systematic evaluation offering guidance on effective model assumptions for power flow learning.
Abstract: Voltage prediction in distribution grids is a critical yet difficult task for maintaining power system stability. Machine learning approaches, particularly Graph Neural Networks (GNNs), offer significant speedups but suffer from poor generalization when trained on limited or incomplete data. In this work, we systematically investigate the role of inductive biases in improving a model’s ability to reliably learn power flow. Specifically, we evaluate three physics-informed strategies: (i) power-flow-constrained loss functions, (ii) complex-valued neural networks, and (iii) residual-based task reformulation. Using the ENGAGE dataset, which spans multiple low- and medium-voltage grid configurations, we conduct controlled experiments to isolate the effect of each inductive bias and assess both standard predictive performance and out-of-distribution generalization. Our study provides practical insights into which model assumptions most effectively guide learning for reliable and efficient voltage prediction in modern distribution networks.
[783] Trustworthy Retrosynthesis: Eliminating Hallucinations with a Diverse Ensemble of Reaction Scorers
Michal Sadowski, Tadija Radusinović, Maria Wyrzykowska, Lukasz Sztukiewicz, Jan Rzymkowski, Paweł Włodarczyk-Pruszyński, Mikołaj Sacha, Piotr Kozakowski, Ruard van Workum, Stanislaw Kamil Jastrzebski
Main category: cs.LG
TL;DR: RetroTrim is a retrosynthesis system that effectively filters out hallucinated reactions using diverse scoring strategies, achieving the highest number of high-quality synthetic paths and winning the Standard Industries $1 million Retrosynthesis Challenge.
Details
Motivation: Current retrosynthesis systems suffer from hallucination problems (nonsensical/erroneous outputs), and reliable assessment of synthetic plans is time-consuming with lacking automatic methods. There's a need for systems that can filter out unreliable reactions while maintaining high-quality synthetic paths.Method: RetroTrim combines diverse reaction scoring strategies using machine learning models and chemical databases. The system analyzes different classes of hallucinations on labeled retrosynthetic intermediates and implements a novel evaluation protocol with expert chemist review.
Result: RetroTrim is the sole method that successfully filters out hallucinated reactions while producing the highest number of high-quality paths overall. It won the Standard Industries $1 million Retrosynthesis Challenge and was evaluated on 32 novel drug-like targets using expert review.
Conclusion: The combination of diverse scoring strategies effectively addresses hallucination problems in retrosynthesis. The authors release benchmark targets and evaluation protocols to inspire further research into reliable retrosynthesis, particularly for drug-like targets.
Abstract: Retrosynthesis is one of the domains transformed by the rise of generative models, and it is one where the problem of nonsensical or erroneous outputs (hallucinations) is particularly insidious: reliable assessment of synthetic plans is time-consuming, with automatic methods lacking. In this work, we present RetroTrim, a retrosynthesis system that successfully avoids nonsensical plans on a set of challenging drug-like targets. Compared to common baselines in the field, our system is not only the sole method that succeeds in filtering out hallucinated reactions, but it also results in the highest number of high-quality paths overall. The key insight behind RetroTrim is the combination of diverse reaction scoring strategies, based on machine learning models and existing chemical databases. We show that our scoring strategies capture different classes of hallucinations by analyzing them on a dataset of labeled retrosynthetic intermediates. This approach formed the basis of our winning solution to the Standard Industries $1 million Retrosynthesis Challenge. To measure the performance of retrosynthesis systems, we propose a novel evaluation protocol for reactions and synthetic paths based on a structured review by expert chemists. Using this protocol, we compare systems on a set of 32 novel targets, curated to reflect recent trends in drug structures. While the insights behind our methodology are broadly applicable to retrosynthesis, our focus is on targets in the drug-like domain. By releasing our benchmark targets and the details of our evaluation protocol, we hope to inspire further research into reliable retrosynthesis.
[784] First Attentions Last: Better Exploiting First Attentions for Efficient Transformer Training
Gyudong Kim, Hyukju Na, Jin Hyeon Kim, Hyunsung Jang, Jaemin Park, Jaegi Hwang, Namkoo Ha, Seungryong Kim, Young Geun Kim
Main category: cs.LG
TL;DR: FAL is an efficient transformer architecture that eliminates per-block MHA-MLP connections to reduce communication overhead in distributed training, achieving up to 44% faster training and better perplexity.
Details
Motivation: Existing transformer designs suffer from significant communication overhead in distributed training, especially in Tensor Parallelism where each block's MHA-MLP connection requires all-reduce communication. The authors discovered that MHA-MLP connections can be bypassed for efficiency while maintaining model quality.Method: Proposed FAL (First Attentions Last) architecture that redirects the first MHA output to the MLP inputs of following layers, eliminating per-block MHA-MLP connections. This removes all-reduce communication and enables parallel execution of MHA and MLP on a single GPU. Also introduced FAL+ which adds normalized first attention output to MHA outputs of following layers to augment MLP input for better quality.
Result: FAL reduces multi-GPU training time by up to 44%, improves single-GPU throughput by up to 1.18x, and achieves better perplexity compared to baseline GPT. FAL+ achieves even lower perplexity without increasing training time compared to baseline.
Conclusion: The proposed FAL architecture effectively addresses communication bottlenecks in distributed transformer training while maintaining or improving model quality, offering a practical solution for efficient large-scale transformer training.
Abstract: As training billion-scale transformers becomes increasingly common, employing multiple distributed GPUs along with parallel training methods has become a standard practice. However, existing transformer designs suffer from significant communication overhead, especially in Tensor Parallelism (TP), where each block’s MHA-MLP connection requires an all-reduce communication. Through our investigation, we show that the MHA-MLP connections can be bypassed for efficiency, while the attention output of the first layer can serve as an alternative signal for the bypassed connection. Motivated by the observations, we propose FAL (First Attentions Last), an efficient transformer architecture that redirects the first MHA output to the MLP inputs of the following layers, eliminating the per-block MHA-MLP connections. This removes the all-reduce communication and enables parallel execution of MHA and MLP on a single GPU. We also introduce FAL+, which adds the normalized first attention output to the MHA outputs of the following layers to augment the MLP input for the model quality. Our evaluation shows that FAL reduces multi-GPU training time by up to 44%, improves single-GPU throughput by up to 1.18x, and achieves better perplexity compared to the baseline GPT. FAL+ achieves even lower perplexity without increasing the training time than the baseline. Codes are available at: https://github.com/CASL-KU/FAL
[785] From Observations to Parameters: Detecting Changepoint in Nonlinear Dynamics with Simulation-based Inference
Xiangbo Deng, Cheng Chen, Peng Yang
Main category: cs.LG
TL;DR: Param-CPD: A two-stage framework for detecting regime shifts in chaotic time series by first inferring governing parameters via neural posterior estimation, then applying changepoint detection to parameter trajectories instead of raw observations.
Details
Motivation: Detecting regime shifts in chaotic time series is challenging because observation-space signals are entangled with intrinsic variability, making it difficult to distinguish parameter changes from natural chaotic fluctuations.Method: Two-stage framework: 1) Amortize Bayesian inference of governing parameters using neural posterior estimator trained via simulation-based inference, 2) Apply standard changepoint detection algorithm to the resulting parameter trajectory rather than raw observations.
Result: On Lorenz-63 system with piecewise-constant parameters, Param-CPD improves F1 score, reduces localization error, and lowers false positives compared to observation-space baselines. Shows consistent gains across tolerance, window length, and noise variations.
Conclusion: Operating in physically interpretable parameter space enables more accurate and interpretable changepoint detection in nonlinear dynamical systems by providing cleaner detection signals disentangled from intrinsic chaotic variability.
Abstract: Detecting regime shifts in chaotic time series is hard because observation-space signals are entangled with intrinsic variability. We propose Parameter–Space Changepoint Detection (Param–CPD), a two–stage framework that first amortizes Bayesian inference of governing parameters with a neural posterior estimator trained by simulation-based inference, and then applies a standard CPD algorithm to the resulting parameter trajectory. On Lorenz–63 with piecewise-constant parameters, Param–CPD improves F1, reduces localization error, and lowers false positives compared to observation–space baselines. We further verify identifiability and calibration of the inferred posteriors on stationary trajectories, explaining why parameter space offers a cleaner detection signal. Robustness analyses over tolerance, window length, and noise indicate consistent gains. Our results show that operating in a physically interpretable parameter space enables accurate and interpretable changepoint detection in nonlinear dynamical systems.
[786] QiMeng-SALV: Signal-Aware Learning for Verilog Code Generation
Yang Zhang, Rui Zhang, Jiaming Guo, Lei Huang, Di Huang, Yunpu Zhao, Shuyao Cheng, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Qi Guo, Yunji Chen
Main category: cs.LG
TL;DR: QiMeng-SALV introduces signal-aware learning for Verilog code generation, using functionally correct output signal segments to optimize RL training, achieving SOTA performance with a 7B model matching DeepSeek v3 671B.
Details
Motivation: LLMs show promise for automated circuit design via Verilog generation, but lack meaningful functional rewards hinders RL-based preference optimization for producing functionally correct code.Method: Extracts verified signal-aware implementations from partially incorrect modules by comparing signal correctness with reference modules, uses AST to identify signal-aware code segments, and introduces signal-aware DPO optimized on correct signal-level segments.
Result: Achieves state-of-the-art performance on VerilogEval and RTLLM benchmarks, with 7B parameter model matching DeepSeek v3 671B performance and significantly outperforming CodeV trained on same dataset.
Conclusion: Proposes paradigm shift from module-level to fine-grained signal-level optimization in Verilog code generation, effectively addressing insufficient functional rewards issue.
Abstract: The remarkable progress of Large Language Models (LLMs) presents promising opportunities for Verilog code generation which is significantly important for automated circuit design. The lacking of meaningful functional rewards hinders the preference optimization based on Reinforcement Learning (RL) for producing functionally correct Verilog code. In this paper, we propose Signal-Aware Learning for Verilog code generation (QiMeng-SALV) by leveraging code segments of functionally correct output signal to optimize RL training. Considering Verilog code specifies the structural interconnection of hardware gates and wires so that different output signals are independent, the key insight of QiMeng-SALV is to extract verified signal-aware implementations in partially incorrect modules, so as to enhance the extraction of meaningful functional rewards. Roughly, we verify the functional correctness of signals in generated module by comparing with that of reference module in the training data. Then abstract syntax tree (AST) is employed to identify signal-aware code segments which can provide meaningful functional rewards from erroneous modules. Finally, we introduce signal-aware DPO which is optimized on the correct signal-level code segments, thereby preventing noise and interference from incorrect signals. The proposed QiMeng-SALV underscores the paradigm shift from conventional module-level to fine-grained signal-level optimization in Verilog code generation, addressing the issue of insufficient functional rewards. Experiments demonstrate that our method achieves state-of-the-art performance on VerilogEval and RTLLM, with a 7B parameter model matching the performance of the DeepSeek v3 671B model and significantly outperforming the leading open-source model CodeV trained on the same dataset. Our code is available at https://github.com/QiMeng-IPRC/QiMeng-SALV.
[787] Self-diffusion for Solving Inverse Problems
Guanxiong Luo, Shoujin Huang, Yanlong Yang
Main category: cs.LG
TL;DR: Self-diffusion: A novel framework for solving inverse problems without pretrained generative models, using iterative noising/denoising with a single untrained network.
Details
Motivation: Traditional diffusion methods require pretrained generative models trained on clean datasets, which limits flexibility and requires extensive training. Self-diffusion aims to solve inverse problems without relying on external pretrained models, making it more adaptive to arbitrary forward operators and noisy observations.Method: Self-diffusion uses an iterative process alternating between noising and denoising steps. At each step: 1) Add noise to current estimate, 2) Continuously train a single untrained convolutional network (self-denoiser) via data fidelity loss to predict solution from noisy estimate, 3) Exploit neural network spectral bias modulated through scheduled noise process. No pretrained score functions or external denoisers needed.
Result: The approach demonstrates competitive or superior performance compared to other methods on various linear inverse problems. It remains adaptive to arbitrary forward operators and noisy observations, showing high flexibility and broad applicability.
Conclusion: Self-diffusion provides a novel, flexible framework for inverse problems that eliminates dependency on pretrained generative models while maintaining strong performance through iterative refinement with untrained networks.
Abstract: We propose self-diffusion, a novel framework for solving inverse problems without relying on pretrained generative models. Traditional diffusion-based approaches require training a model on a clean dataset to learn to reverse the forward noising process. This model is then used to sample clean solutions – corresponding to posterior sampling from a Bayesian perspective – that are consistent with the observed data under a specific task. In contrast, self-diffusion introduces a self-contained iterative process that alternates between noising and denoising steps to progressively refine its estimate of the solution. At each step of self-diffusion, noise is added to the current estimate, and a self-denoiser, which is a single untrained convolutional network randomly initialized from scratch, is continuously trained for certain iterations via a data fidelity loss to predict the solution from the noisy estimate. Essentially, self-diffusion exploits the spectral bias of neural networks and modulates it through a scheduled noise process. Without relying on pretrained score functions or external denoisers, this approach still remains adaptive to arbitrary forward operators and noisy observations, making it highly flexible and broadly applicable. We demonstrate the effectiveness of our approach on a variety of linear inverse problems, showing that self-diffusion achieves competitive or superior performance compared to other methods.
[788] K-DAREK: Distance Aware Error for Kurkova Kolmogorov Networks
Masoud Ataei, Vikas Dhiman, Mohammad Javad Khojasteh
Main category: cs.LG
TL;DR: K-DAREK is a novel learning algorithm that enhances Kurkova-Kolmogorov networks with distance-aware error bounds for efficient, interpretable function approximation with uncertainty quantification, showing significant improvements in speed, scalability, and safety over existing methods.
Details
Motivation: Neural networks are powerful but lack uncertainty quantification, while Gaussian processes provide probabilistic modeling but are computationally expensive for large-scale problems. There's a need for efficient, interpretable function approximation with robust uncertainty quantification, especially for safety-critical applications like control systems.Method: The paper develops K-DAREK by enhancing KKAN (Kurkova-Kolmogorov networks) architecture. KKANs extend KANs by replacing early spline layers with MLPs to map inputs to higher dimensions before spline transformations. K-DAREK establishes distance-aware error bounds that reflect test point proximity to training data, enabling uncertainty quantification.
Result: K-DAREK is 4x faster and 10x more computationally efficient than Ensemble of KANs, 8.6x more scalable than GPs as data size increases, and 7.2% safer than previous DAREK. On real estate valuation data, it achieves zero coverage violations in error bounds.
Conclusion: K-DAREK provides an efficient, interpretable, and safe approach for function approximation with uncertainty quantification, particularly valuable for safety-critical applications where computational efficiency and reliable error bounds are essential.
Abstract: Neural networks are powerful parametric function approximators, while Gaussian processes (GPs) are nonparametric probabilistic models that place distributions over functions via kernel-defined correlations but become computationally expensive for large-scale problems. Kolmogorov-Arnold networks (KANs), semi-parametric neural architectures, model complex functions efficiently using spline layers. Kurkova Kolmogorov-Arnold networks (KKANs) extend KANs by replacing the early spline layers with multi-layer perceptrons that map inputs into higher-dimensional spaces before applying spline-based transformations, which yield more stable training and provide robust architectures for system modeling. By enhancing the KKAN architecture, we develop a novel learning algorithm, distance-aware error for Kurkova-Kolmogorov networks (K-DAREK), for efficient and interpretable function approximation with uncertainty quantification. Our approach establishes robust error bounds that are distance-aware; this means they reflect the proximity of a test point to its nearest training points. In safe control case studies, we demonstrate that K-DAREK is about four times faster and ten times more computationally efficient than Ensemble of KANs, 8.6 times more scalable than GP as data size increases, and 7.2% safer than our previous work distance-aware error for Kolmogorov networks (DAREK). Moreover, on real data (e.g., Real Estate Valuation), K-DAREK’s error bound achieves zero coverage violations.
[789] FLEX: Continuous Agent Evolution via Forward Learning from Experience
Zhicheng Cai, Xinyuan Guo, Yu Pei, Jiangtao Feng, Jinsong Su, Jiangjie Chen, Ya-Qin Zhang, Wei-Ying Ma, Mingxuan Wang, Hao Zhou
Main category: cs.LG
TL;DR: FLEX enables LLM agents to continuously learn from experience through gradient-free learning, achieving significant improvements on reasoning and scientific tasks.
Details
Motivation: Current LLM agents remain static after training and cannot grow with experience like intelligent beings do during deployment, limiting their adaptability and continuous improvement.Method: FLEX uses a gradient-free learning paradigm that constructs a structured experience library through continual reflection on successes and failures during environment interaction, enabling scalable and inheritable evolution.
Result: Substantial improvements: up to 23% on AIME25 (mathematical reasoning), 10% on USPTO50k (chemical retrosynthesis), and 14% on ProteinGym (protein fitness prediction). Also identified scaling laws of experiential growth and experience inheritance across agents.
Conclusion: FLEX represents a step toward scalable and inheritable continuous agent evolution, enabling LLM agents to grow through accumulated experience like intelligent beings.
Abstract: Autonomous agents driven by Large Language Models (LLMs) have revolutionized reasoning and problem-solving but remain static after training, unable to grow with experience as intelligent beings do during deployment. We introduce Forward Learning with EXperience (FLEX), a gradient-free learning paradigm that enables LLM agents to continuously evolve through accumulated experience. Specifically, FLEX cultivates scalable and inheritable evolution by constructing a structured experience library through continual reflection on successes and failures during interaction with the environment. FLEX delivers substantial improvements on mathematical reasoning, chemical retrosynthesis, and protein fitness prediction (up to 23% on AIME25, 10% on USPTO50k, and 14% on ProteinGym). We further identify a clear scaling law of experiential growth and the phenomenon of experience inheritance across agents, marking a step toward scalable and inheritable continuous agent evolution. Project Page: https://flex-gensi-thuair.github.io.
[790] HN-MVTS: HyperNetwork-based Multivariate Time Series Forecasting
Andrey Savchenko, Oleg Kachan
Main category: cs.LG
TL;DR: HN-MVTS improves multivariate time series forecasting by using a hypernetwork to generate weights for the last layer of base models, enhancing generalization without increasing inference time.
Details
Motivation: Multivariate time series forecasting is challenging due to complex temporal dependencies. Channel-dependent models often underperform compared to simpler channel-independent models, despite the latter ignoring component relationships but offering better robustness through smaller capacity.Method: Propose HN-MVTS architecture that integrates a hypernetwork-based generative prior with arbitrary neural forecasting models. The hypernetwork takes learnable embeddings of time series components as input and generates weights for the last layer of target networks, acting as a data-adaptive regularizer. The hypernetwork is only used during training, not inference.
Result: Extensive experiments on eight benchmark datasets show that applying HN-MVTS to state-of-the-art models (DLinear, PatchTST, TSMixer, etc.) typically improves their performance. The approach enhances generalization and long-range predictive accuracy.
Conclusion: Hypernetwork-driven parameterization offers a promising direction for enhancing existing forecasting techniques in complex scenarios, improving performance without increasing inference time compared to base models.
Abstract: Accurate forecasting of multivariate time series data remains a formidable challenge, particularly due to the growing complexity of temporal dependencies in real-world scenarios. While neural network-based models have achieved notable success in this domain, complex channel-dependent models often suffer from performance degradation compared to channel-independent models that do not consider the relationship between components but provide high robustness due to small capacity. In this work, we propose HN-MVTS, a novel architecture that integrates a hypernetwork-based generative prior with an arbitrary neural network forecasting model. The input of this hypernetwork is a learnable embedding matrix of time series components. To restrict the number of new parameters, the hypernetwork learns to generate the weights of the last layer of the target forecasting networks, serving as a data-adaptive regularizer that improves generalization and long-range predictive accuracy. The hypernetwork is used only during the training, so it does not increase the inference time compared to the base forecasting model. Extensive experiments on eight benchmark datasets demonstrate that application of HN-MVTS to the state-of-the-art models (DLinear, PatchTST, TSMixer, etc.) typically improves their performance. Our findings suggest that hypernetwork-driven parameterization offers a promising direction for enhancing existing forecasting techniques in complex scenarios.
[791] Data Fusion-Enhanced Decision Transformer for Stable Cross-Domain Generalization
Guojian Wang, Quinson Hon, Xuyang Chen, Lin Zhao
Main category: cs.LG
TL;DR: DFDT improves cross-domain adaptation for Decision Transformers by fusing target data with filtered source fragments using MMD and OT metrics, replacing RTG with advantage tokens, and applying Q-guided regularization to enhance stitchability and continuity.
Details
Motivation: Existing cross-domain policy adaptation methods for Decision Transformers suffer from poor stitchability when combining source trajectory fragments - state structures misalign, RTG tokens become incomparable across domains, and actions jump at junctions, compromising DT's inference ability.Method: DFDT uses a two-level data filter: MMD mismatch for state-structure alignment and OT deviation for action feasibility. It trains on feasibility-weighted fusion distribution, replaces RTG tokens with advantage-conditioned tokens, and applies Q-guided regularization to suppress junction jumps.
Result: DFDT improves return and stability over strong offline RL and sequence-model baselines across gravity, kinematic, and morphology shifts on D4RL-style control tasks, with theoretical bounds showing performance gaps tighten as MMD and OT measures shrink.
Conclusion: DFDT effectively addresses cross-domain adaptation challenges for Decision Transformers by improving stitchability through data fusion, advantage tokens, and regularization, with both theoretical guarantees and empirical validation across diverse domain shifts.
Abstract: Cross-domain shifts present a significant challenge for decision transformer (DT) policies. Existing cross-domain policy adaptation methods typically rely on a single simple filtering criterion to select source trajectory fragments and stitch them together. They match either state structure or action feasibility. However, the selected fragments still have poor stitchability: state structures can misalign, the return-to-go (RTG) becomes incomparable when the reward or horizon changes, and actions may jump at trajectory junctions. As a result, RTG tokens lose continuity, which compromises DT’s inference ability. To tackle these challenges, we propose Data Fusion-Enhanced Decision Transformer (DFDT), a compact pipeline that restores stitchability. Particularly, DFDT fuses scarce target data with selectively trusted source fragments via a two-level data filter, maximum mean discrepancy (MMD) mismatch for state-structure alignment, and optimal transport (OT) deviation for action feasibility. It then trains on a feasibility-weighted fusion distribution. Furthermore, DFDT replaces RTG tokens with advantage-conditioned tokens, which improves the continuity of the semantics in the token sequence. It also applies a $Q$-guided regularizer to suppress junction value and action jumps. Theoretically, we provide bounds that tie state value and policy performance gaps to the MMD-mismatch and OT-deviation measures, and show that the bounds tighten as these two measures shrink. We show that DFDT improves return and stability over strong offline RL and sequence-model baselines across gravity, kinematic, and morphology shifts on D4RL-style control tasks, and further corroborate these gains with token-stitching and sequence-semantics stability analyses.
[792] FlowCast: Advancing Precipitation Nowcasting with Conditional Flow Matching
Bernardo Perrone Ribeiro, Jana Faganeli Pucer
Main category: cs.LG
TL;DR: FlowCast introduces a new probabilistic precipitation nowcasting model using Conditional Flow Matching (CFM) that achieves state-of-the-art performance with faster sampling than diffusion models.
Details
Motivation: Radar-based precipitation nowcasting is critical for flood risk management, but existing deep learning approaches struggle with atmospheric uncertainty and high-dimensional data modeling. Diffusion models produce good forecasts but are computationally expensive for time-critical applications.Method: FlowCast uses Conditional Flow Matching (CFM) as a direct noise-to-data generative framework in compressed latent space, enabling rapid high-fidelity sample generation without the iterative sampling of diffusion models.
Result: FlowCast establishes new state-of-the-art in probabilistic performance and exceeds deterministic baselines in predictive accuracy. CFM is more accurate and significantly more efficient than diffusion objectives on the same architecture, maintaining high performance with fewer sampling steps.
Conclusion: CFM is positioned as a powerful and practical alternative to diffusion models for high-dimensional spatiotemporal forecasting, offering both superior performance and computational efficiency for time-critical applications like precipitation nowcasting.
Abstract: Radar-based precipitation nowcasting, the task of forecasting short-term precipitation fields from previous radar images, is a critical problem for flood risk management and decision-making. While deep learning has substantially advanced this field, two challenges remain fundamental: the uncertainty of atmospheric dynamics and the efficient modeling of high-dimensional data. Diffusion models have shown strong promise by producing sharp, reliable forecasts, but their iterative sampling process is computationally prohibitive for time-critical applications. We introduce FlowCast, the first end-to-end probabilistic model leveraging Conditional Flow Matching (CFM) as a direct noise-to-data generative framework for precipitation nowcasting. Unlike hybrid approaches, FlowCast learns a direct noise-to-data mapping in a compressed latent space, enabling rapid, high-fidelity sample generation. Our experiments demonstrate that FlowCast establishes a new state-of-the-art in probabilistic performance while also exceeding deterministic baselines in predictive accuracy. A direct comparison further reveals the CFM objective is both more accurate and significantly more efficient than a diffusion objective on the same architecture, maintaining high performance with significantly fewer sampling steps. This work positions CFM as a powerful and practical alternative for high-dimensional spatiotemporal forecasting.
[793] LAD-BNet: Lag-Aware Dual-Branch Networks for Real-Time Energy Forecasting on Edge Devices
Jean-Philippe Lignier
Main category: cs.LG
TL;DR: LAD-BNet is a neural network for real-time energy forecasting on edge devices that combines lag-aware processing with temporal convolutions, achieving fast inference on Google Coral TPU with improved accuracy over existing methods.
Details
Motivation: Real-time energy forecasting on edge devices is challenging but crucial for smart grid optimization and intelligent buildings, requiring models that balance accuracy with computational efficiency for embedded deployment.Method: LAD-BNet (Lag-Aware Dual-Branch Network) combines two branches: one explicitly exploits temporal lags while the other uses Temporal Convolutional Network (TCN) with dilated convolutions to capture both short and long-term dependencies simultaneously.
Result: Achieves 14.49% MAPE at 1-hour horizon with only 18ms inference time on Edge TPU (8-12x faster than CPU), 2.39% improvement over LSTM baselines, 3.04% improvement over pure TCN, with 180MB memory footprint suitable for embedded devices.
Conclusion: LAD-BNet enables practical industrial applications in real-time energy optimization, demand management, and operational planning by providing accurate forecasting with edge-optimized performance on constrained devices.
Abstract: Real-time energy forecasting on edge devices represents a major challenge for smart grid optimization and intelligent buildings. We present LAD-BNet (Lag-Aware Dual-Branch Network), an innovative neural architecture optimized for edge inference with Google Coral TPU. Our hybrid approach combines a branch dedicated to explicit exploitation of temporal lags with a Temporal Convolutional Network (TCN) featuring dilated convolutions, enabling simultaneous capture of short and long-term dependencies. Tested on real energy consumption data with 10-minute temporal resolution, LAD-BNet achieves 14.49% MAPE at 1-hour horizon with only 18ms inference time on Edge TPU, representing an 8-12 x acceleration compared to CPU. The multi-scale architecture enables predictions up to 12 hours with controlled performance degradation. Our model demonstrates a 2.39% improvement over LSTM baselines and 3.04% over pure TCN architectures, while maintaining a 180MB memory footprint suitable for embedded device constraints. These results pave the way for industrial applications in real-time energy optimization, demand management, and operational planning.
[794] Mask the Redundancy: Evolving Masking Representation Learning for Multivariate Time-Series Clustering
Zexi Tan, Xiaopeng Luo, Yunlin Liu, Yiqun Zhang
Main category: cs.LG
TL;DR: EMTC improves MTS clustering by adaptively masking redundant timestamps and using multi-view learning to focus on discriminative patterns.
Details
Motivation: MTS data contains substantial redundancy (steady-state operations, zero-output periods) that diminishes attention to discriminative timestamps, creating performance bottlenecks in clustering. Existing masking strategies are static preprocessing steps that can't adapt to clustering-critical timestamps.Method: Proposes Evolving-masked MTS Clustering (EMTC) with two modules: 1) Importance-aware Variate-wise Masking (IVM) that adaptively guides representation learning, and 2) Multi-Endogenous Views (MEV) generation for reconstruction and cluster-guided contrastive learning pathways.
Result: Extensive experiments on 15 benchmark datasets show EMTC outperforms eight state-of-the-art methods, achieving average 4.85% improvement in F1-Score over strongest baselines.
Conclusion: EMTC effectively addresses redundancy in MTS data through adaptive masking and multi-view learning, significantly improving clustering performance by focusing on discriminative temporal patterns.
Abstract: Multivariate Time-Series (MTS) clustering discovers intrinsic grouping patterns of temporal data samples. Although time-series provide rich discriminative information, they also contain substantial redundancy, such as steady-state machine operation records and zero-output periods of solar power generation. Such redundancy diminishes the attention given to discriminative timestamps in representation learning, thus leading to performance bottlenecks in MTS clustering. Masking has been widely adopted to enhance the MTS representation, where temporal reconstruction tasks are designed to capture critical information from MTS. However, most existing masking strategies appear to be standalone preprocessing steps, isolated from the learning process, which hinders dynamic adaptation to the importance of clustering-critical timestamps. Accordingly, this paper proposes the Evolving-masked MTS Clustering (EMTC) method, whose model architecture comprises Importance-aware Variate-wise Masking (IVM) and Multi-Endogenous Views (MEV) generation modules. IVM adaptively guides the model in learning more discriminative representations for clustering, while the reconstruction and cluster-guided contrastive learning pathways enhance and connect the representation learning to clustering tasks. Extensive experiments on 15 benchmark datasets demonstrate the superiority of EMTC over eight SOTA methods, where the EMTC achieves an average improvement of 4.85% in F1-Score over the strongest baselines.
[795] An Adaptive Resonance Theory-based Topological Clustering Algorithm with a Self-Adjusting Vigilance Parameter
Naoki Masuyama, Yuichiro Toda, Yusuke Nojima, Hisao Ishibuchi
Main category: cs.LG
TL;DR: Proposes an ART-based topological clustering algorithm with diversity-driven adaptation for hyperparameter-free learning in stationary/nonstationary environments, outperforming SOTA methods.
Details
Motivation: Need for clustering models that can adapt to distributional shifts in dynamic environments while preserving learned cluster structures and avoiding catastrophic forgetting.Method: Adaptive Resonance Theory (ART)-based topological clustering with diversity-driven adaptation mechanism that autonomously adjusts recalculation interval and vigilance threshold.
Result: Outperforms state-of-the-art methods on 24 real-world datasets in both clustering performance and continual learning capability.
Conclusion: The proposed parameter adaptation effectively mitigates catastrophic forgetting and maintains consistent clustering in evolving data streams.
Abstract: Clustering in stationary and nonstationary settings, where data distributions remain static or evolve over time, requires models that can adapt to distributional shifts while preserving previously learned cluster structures. This paper proposes an Adaptive Resonance Theory (ART)-based topological clustering algorithm that autonomously adjusts its recalculation interval and vigilance threshold through a diversity-driven adaptation mechanism. This mechanism enables hyperparameter-free learning that maintains cluster stability and continuity in dynamic environments. Experiments on 24 real-world datasets demonstrate that the proposed algorithm outperforms state-of-the-art methods in both clustering performance and continual learning capability. These results highlight the effectiveness of the proposed parameter adaptation in mitigating catastrophic forgetting and maintaining consistent clustering in evolving data streams. Source code is available at https://github.com/Masuyama-lab/IDAT
[796] Subgoal Graph-Augmented Planning for LLM-Guided Open-World Reinforcement Learning
Shanwei Fan, Bin Zhang, Zhiwei Xu, Yingxuan Teng, Siqi Dai, Lin Cheng, Guoliang Fan
Main category: cs.LG
TL;DR: SGA-ACR framework improves RL planning by using environment-specific subgoal graphs and multi-LLM pipeline to generate executable subgoals, addressing LLM planning-execution misalignment.
Details
Motivation: LLMs have strong high-level planning capabilities for RL but suffer from poor planning-execution alignment due to two limitations: (1) LLMs produce semantically plausible but infeasible subgoals due to insufficient environment grounding, and (2) single-LLM planning conflates generation with self-verification, leading to overconfident yet unreliable subgoals.Method: Proposes Subgoal Graph-Augmented Actor-Critic-Refiner (SGA-ACR) framework that integrates: (1) environment-specific subgoal graph and structured entity knowledge, (2) multi-LLM planning pipeline that explicitly separates generation, critique, and refinement to produce executable subgoals, and (3) subgoal tracker that monitors execution progress, provides auxiliary rewards, and adaptively updates the subgoal graph.
Result: Experimental results on 22 diverse tasks in the open-world game “Crafter” demonstrate the effectiveness of the proposed method.
Conclusion: SGA-ACR successfully addresses LLM planning-execution misalignment by combining environment-specific knowledge structures with a multi-stage LLM planning approach, enabling more reliable and executable subgoal generation for reinforcement learning.
Abstract: Large language models (LLMs) offer strong high-level planning capabilities for reinforcement learning (RL) by decomposing tasks into subgoals. However, their practical utility is limited by poor planning-execution alignment, which reflects a critical gap between abstract plans and actionable, environment-compatible behaviors. This misalignment arises from two interrelated limitations: (1) LLMs often produce subgoals that are semantically plausible but infeasible or irrelevant in the target environment due to insufficient grounding in environment-specific knowledge, and (2) single-LLM planning conflates generation with self-verification, resulting in overconfident yet unreliable subgoals that frequently fail during execution. To address these challenges, we propose Subgoal Graph-Augmented Actor-Critic-Refiner (SGA-ACR), a framework that integrates an environment-specific subgoal graph and structured entity knowledge with a multi-LLM planning pipeline that explicitly separates generation, critique, and refinement to produce executable and verifiable subgoals. A subgoal tracker further monitors execution progress, provides auxiliary rewards, and adaptively updates the subgoal graph to maintain alignment between plans and actions. Experimental results on 22 diverse tasks in the open-world game “Crafter” demonstrate the effectiveness of our proposed method.
[797] Does the Model Say What the Data Says? A Simple Heuristic for Model Data Alignment
Henry Salgado, Meagan R. Kendall, Martine Ceberio
Main category: cs.LG
TL;DR: Proposes a simple framework to evaluate if ML models align with data structure by comparing data-derived feature rankings with model explanations.
Details
Motivation: Existing interpretability methods focus on explaining model behavior but lack a baseline from the data itself to assess whether models truly reflect what the data says.Method: Uses Rubin’s Potential Outcomes Framework to quantify how strongly each feature separates outcome groups in binary classification, creating data-derived feature rankings that are compared with model-based explanations.
Result: Provides an interpretable, model-agnostic method for assessing model-data alignment by establishing a data-driven baseline for comparison with model behavior.
Conclusion: The framework offers practitioners a simple, computationally efficient way to evaluate whether machine learning models say what the data says, moving beyond traditional descriptive statistics to assess model-data alignment.
Abstract: In this work, we propose a simple and computationally efficient framework for evaluating whether machine learning models align with the structure of the data they learn from; that is, whether the model says what the data says. Unlike existing interpretability methods that focus exclusively on explaining model behavior, our approach establishes a baseline derived directly from the data itself. Drawing inspiration from Rubin’s Potential Outcomes Framework, we quantify how strongly each feature separates the two outcome groups in a binary classification task, moving beyond traditional descriptive statistics to estimate each feature’s effect on the outcome. By comparing these data-derived feature rankings with model-based explanations, we provide practitioners with an interpretable and model-agnostic method for assessing model-data alignment.
[798] Neural Tucker Convolutional Network for Water Quality Analysis
Hongnan Si, Tong Li, Yujie Chen, Xin Liao
Main category: cs.LG
TL;DR: NTCN model uses Tucker decomposition and 3D convolution for water quality data imputation, outperforming state-of-the-art methods.
Details
Motivation: Water quality monitoring suffers from data missing due to sensor failures, creating challenges for analysis that require effective imputation methods.Method: Neural Tucker Convolutional Network (NTCN) encodes different mode entities into embedding vectors, constructs Tucker interaction tensor via outer product to capture complex mode-wise feature interactions, and uses 3D convolution to extract spatiotemporal features.
Result: Experiments on three real-world water quality datasets demonstrate NTCN outperforms several state-of-the-art imputation models in accuracy.
Conclusion: NTCN provides an effective solution for water quality data imputation by combining Tucker decomposition with convolutional neural networks to handle missing data in ecological monitoring.
Abstract: Water quality monitoring is a core component of ecological environmental protection. However, due to sensor failure or other inevitable factors, data missing often exists in long-term monitoring, posing great challenges in water quality analysis. This paper proposes a Neural Tucker Convolutional Network (NTCN) model for water quality data imputation, which features the following key components: a) Encode different mode entities into respective embedding vectors, and construct a Tucker interaction tensor by outer product operations to capture the complex mode-wise feature interactions; b) Use 3D convolution to extract fine-grained spatiotemporal features from the interaction tensor. Experiments on three real-world water quality datasets show that the proposed NTCN model outperforms several state-of-the-art imputation models in terms of accuracy.
[799] UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs
Hung-Yueh Chiang, Chi-Chih Chang, Yu-Chen Lu, Chien-Yu Lin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu
Main category: cs.LG
TL;DR: UniQL is a unified framework for post-training quantization and low-rank compression of LLMs for mobile deployment, enabling configurable on-device pruning up to 35% with 4x-5.7x memory reduction and 2.7x-3.4x throughput improvement.
Details
Motivation: Deploying large language models on mobile devices is challenging due to limited memory and shared computational resources, with resource availability further complicated by variable device workloads.Method: UniQL integrates quantization and low-rank compression for Transformers, SSMs, and hybrid models. It features efficient structured weight-sorting (20x speedup), quantization-aware SVD, state-aware weight sorting for SSMs, and fused RoPE kernel for pruned models.
Result: Achieves 4x-5.7x memory reduction and 2.7x-3.4x token-throughput improvement while maintaining accuracy within 5% of original models at 15% pruning across diverse architectures (Llama3, Qwen2.5, Mamba2, Nemotron-H, Bamba-v2).
Conclusion: UniQL provides an effective unified framework for compressing LLMs for edge deployment, enabling on-device configurable pruning with significant efficiency gains while preserving model accuracy.
Abstract: Deploying large language models (LLMs) on mobile platforms faces significant challenges due to the limited memory and shared computational resources of the device. Resource availability may be an issue as it is directly impacted by the current device workload, adding to the uncertainty of model deployment. We introduce UniQL, a unified post-training quantization and low-rank compression framework with on-device configurable pruning rates for edge LLMs. UniQL is a general framework that integrates quantization and low-rank compression for Transformers, State Space Models (SSMs), and hybrid models to support diverse edge applications. In our proposed joint framework, we introduce an efficient structured weight-sorting method that speeds up computation by 20x, quantization-aware singular value decomposition (SVD) to minimize quantization errors, state-aware weight sorting for SSMs, and a fused rotary positional embedding (RoPE) kernel for pruned models. Our framework performs weight-sorting, fine-tuning, and quantization in the cloud in a single-pass workflow, while enabling on-device configurable pruning rates up to 35%. Our experiments show that quantized and pruned models achieve a memory reduction of 4x-5.7x and a token-throughput improvement of 2.7x-3.4x, maintaining accuracy within 5% of the original models at 15% pruning across Transformers (Llama3 and Qwen2.5), SSMs (Mamba2), and hybrid models (Nemotron-H and Bamba-v2). The code and quantized models are available at: https://github.com/enyac-group/UniQL.
[800] SweetDeep: A Wearable AI Solution for Real-Time Non-Invasive Diabetes Screening
Ian Henriques, Lynda Elhassar, Sarvesh Relekar, Denis Walrave, Shayan Hassantabar, Vishu Ghanakota, Adel Laoui, Mahmoud Aich, Rafia Tir, Mohamed Zerguine, Samir Louafi, Moncef Kimouche, Emmanuel Cosson, Niraj K Jha
Main category: cs.LG
TL;DR: SweetDeep is a lightweight neural network that uses wearable sensor data from Samsung Galaxy Watch 7 to detect type 2 diabetes with 82.5% accuracy in free-living conditions.
Details
Motivation: Type 2 diabetes is rising globally, but current diagnostic methods (biochemical assays) are invasive and costly. There's a need for scalable, cost-effective screening methods using consumer wearables.Method: Developed SweetDeep, a compact neural network with <3,000 parameters trained on physiological and demographic data from 285 participants (diabetic and non-diabetic) in EU and MENA regions. Data collected using Samsung Galaxy Watch 7 devices over 6 days in free-living conditions, with each participant providing ~20 recordings of 2-minute sensor data per day.
Result: Achieved 82.5% patient-level accuracy (82.1% macro-F1, 79.7% sensitivity, 84.6% specificity) under three-fold cross-validation. Expected calibration error of 5.5%. When allowing abstention on <10% of low-confidence predictions, accuracy improved to 84.5% on remaining patients.
Conclusion: Combining engineered features with lightweight neural architectures enables accurate, rapid, and generalizable type 2 diabetes detection in real-world wearable settings, offering a scalable alternative to invasive biochemical testing.
Abstract: The global rise in type 2 diabetes underscores the need for scalable and cost-effective screening methods. Current diagnosis requires biochemical assays, which are invasive and costly. Advances in consumer wearables have enabled early explorations of machine learning-based disease detection, but prior studies were limited to controlled settings. We present SweetDeep, a compact neural network trained on physiological and demographic data from 285 (diabetic and non-diabetic) participants in the EU and MENA regions, collected using Samsung Galaxy Watch 7 devices in free-living conditions over six days. Each participant contributed multiple 2-minute sensor recordings per day, totaling approximately 20 recordings per individual. Despite comprising fewer than 3,000 parameters, SweetDeep achieves 82.5% patient-level accuracy (82.1% macro-F1, 79.7% sensitivity, 84.6% specificity) under three-fold cross-validation, with an expected calibration error of 5.5%. Allowing the model to abstain on less than 10% of low-confidence patient predictions yields an accuracy of 84.5% on the remaining patients. These findings demonstrate that combining engineered features with lightweight architectures can support accurate, rapid, and generalizable detection of type 2 diabetes in real-world wearable settings.
[801] CoGraM: Context-sensitive granular optimization method with rollback for robust model fusion
Julius Lenz
Main category: cs.LG
TL;DR: CoGraM is a multi-stage, context-sensitive merging method that improves federated learning by aligning decisions with loss differences and preventing harmful updates through rollback, outperforming traditional methods like weight averaging and Fisher merging.
Details
Motivation: Current neural network merging methods (weight averaging, Fisher merging) in federated/distributed learning suffer from accuracy loss and instability across different training seeds, creating a need for more robust merging techniques.Method: CoGraM is a multi-stage, context-sensitive, loss-based iterative optimization method that operates across layers, neurons, and weight levels. It aligns merging decisions with loss differences and thresholds, and includes rollback mechanisms to prevent harmful updates.
Result: CoGraM significantly improves merged network performance compared to traditional methods like Fisher merging, addressing their weaknesses in accuracy and stability.
Conclusion: CoGraM provides a superior approach to neural network merging in federated/distributed learning settings, offering better accuracy and stability than existing methods through its context-sensitive, multi-stage optimization with rollback protection.
Abstract: Merging neural networks without retraining is central to federated and distributed learning. Common methods such as weight averaging or Fisher merging often lose accuracy and are unstable across seeds. CoGraM (Contextual Granular Merging) is a multi-stage, context-sensitive, loss-based, and iterative optimization method across layers, neurons, and weight levels that aligns decisions with loss differences and thresholds and prevents harmful updates through rollback. CoGraM is an optimization method that addresses the weaknesses of methods such as Fisher and can significantly improve the merged network.
[802] RGE-GCN: Recursive Gene Elimination with Graph Convolutional Networks for RNA-seq based Early Cancer Detection
Shreyas Shende, Varsha Narayanan, Vishal Fenn, Yiran Huang, Dincer Goksuluk, Gaurav Choudhary, Melih Agraz, Mengjia Xu
Main category: cs.LG
TL;DR: RGE-GCN combines graph convolutional networks with recursive gene elimination for biomarker discovery from RNA-seq data, achieving better performance than standard tools.
Details
Motivation: Early cancer detection requires reliable biomarkers, but RNA-seq data is high-dimensional and conventional statistical methods fail to capture complex gene relationships.Method: Builds graph from gene expression profiles, uses Graph Convolutional Network for classification, applies Integrated Gradients to identify informative genes, and recursively removes less relevant genes to converge on compact biomarker set.
Result: Achieved higher accuracy and F1-scores than DESeq2, edgeR, and limma-voom across synthetic data and real-world RNA-seq cohorts of lung, kidney, and cervical cancers. Selected genes aligned with known cancer pathways.
Conclusion: RGE-GCN shows promise as a generalizable approach for RNA-seq based early cancer detection and biomarker discovery, providing both interpretable and predictive biomarkers.
Abstract: Early detection of cancer plays a key role in improving survival rates, but identifying reliable biomarkers from RNA-seq data is still a major challenge. The data are high-dimensional, and conventional statistical methods often fail to capture the complex relationships between genes. In this study, we introduce RGE-GCN (Recursive Gene Elimination with Graph Convolutional Networks), a framework that combines feature selection and classification in a single pipeline. Our approach builds a graph from gene expression profiles, uses a Graph Convolutional Network to classify cancer versus normal samples, and applies Integrated Gradients to highlight the most informative genes. By recursively removing less relevant genes, the model converges to a compact set of biomarkers that are both interpretable and predictive. We evaluated RGE-GCN on synthetic data as well as real-world RNA-seq cohorts of lung, kidney, and cervical cancers. Across all datasets, the method consistently achieved higher accuracy and F1-scores than standard tools such as DESeq2, edgeR, and limma-voom. Importantly, the selected genes aligned with well-known cancer pathways including PI3K-AKT, MAPK, SUMOylation, and immune regulation. These results suggest that RGE-GCN shows promise as a generalizable approach for RNA-seq based early cancer detection and biomarker discovery (https://rce-gcn.streamlit.app/ ).
[803] EfficientECG: Cross-Attention with Feature Fusion for Efficient Electrocardiogram Classification
Hanhui Deng, Xinglin Li, Jie Luo, Di Wu
Main category: cs.LG
TL;DR: The paper proposes EfficientECG, a lightweight deep learning model for accurate ECG classification that handles high-frequency long-sequence data and incorporates multi-feature fusion using cross-attention mechanisms.
Details
Motivation: ECG is a valuable diagnostic tool but existing models have high misdiagnosis rates. There's a need for accurate, automated ECG analysis to reduce medical worker burden and leverage ECG's rich information for emerging applications.Method: Two-stage approach: 1) EfficientECG - a lightweight classification model based on EfficientNet architecture for handling high-frequency long-sequence ECG data with various lead types; 2) Cross-attention-based feature fusion model that integrates multi-lead ECG data with additional features like gender and age.
Result: Evaluations on representative ECG datasets show the model outperforms state-of-the-art works in precision, multi-feature fusion capability, and lightweight design.
Conclusion: The proposed deep learning approach provides an accurate, efficient solution for ECG analysis that can automatically extract features and handle complex multi-feature data, potentially reducing diagnostic burden in healthcare.
Abstract: Electrocardiogram is a useful diagnostic signal that can detect cardiac abnormalities by measuring the electrical activity generated by the heart. Due to its rapid, non-invasive, and richly informative characteristics, ECG has many emerging applications. In this paper, we study novel deep learning technologies to effectively manage and analyse ECG data, with the aim of building a diagnostic model, accurately and quickly, that can substantially reduce the burden on medical workers. Unlike the existing ECG models that exhibit a high misdiagnosis rate, our deep learning approaches can automatically extract the features of ECG data through end-to-end training. Specifically, we first devise EfficientECG, an accurate and lightweight classification model for ECG analysis based on the existing EfficientNet model, which can effectively handle high-frequency long-sequence ECG data with various leading types. On top of that, we next propose a cross-attention-based feature fusion model of EfficientECG for analysing multi-lead ECG data with multiple features (e.g., gender and age). Our evaluations on representative ECG datasets validate the superiority of our model against state-of-the-art works in terms of high precision, multi-feature fusion, and lightweights.
[804] Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs
Oren Rachmil, Roy Betser, Itay Gershon, Omer Hofman, Nitay Yakoby, Yuval Meron, Idan Yankelev, Asaf Shabtai, Yuval Elovici, Roman Vainshtein
Main category: cs.LG
TL;DR: A training-free method for detecting policy violations in LLMs by treating it as an out-of-distribution detection problem using whitening transformations on hidden activations.
Details
Motivation: Organizations need reliable mechanisms to detect policy violations when deploying LLMs in sensitive domains (legal, finance, medical), where breaches can cause legal and reputational risks. Existing methods lack robustness for nuanced organizational policies or are too slow/latent.Method: Proposes a training-free approach that treats policy violation detection as OOD detection. Uses whitening techniques to apply linear transformation that decorrelates hidden activations and standardizes them to zero mean/unit variance, then uses Euclidean norm as compliance score in transformed space.
Result: Achieves state-of-the-art results on a challenging policy benchmark, surpassing both existing guardrails and fine-tuned reasoning models.
Conclusion: Provides organizations with a practical, statistically grounded framework for policy-aware oversight of LLMs, advancing deployable AI governance. The method is lightweight, easily deployable, and requires only policy text and a few illustrative samples.
Abstract: Aligning proprietary large language models (LLMs) with internal organizational policies has become an urgent priority as organizations increasingly deploy LLMs in sensitive domains such as legal support, finance, and medical services. Beyond generic safety filters, enterprises require reliable mechanisms to detect policy violations within their regulatory and operational frameworks, where breaches can trigger legal and reputational risks. Existing content moderation frameworks, such as guardrails, remain largely confined to the safety domain and lack the robustness to capture nuanced organizational policies. LLM-as-a-judge and fine-tuning approaches, though flexible, introduce significant latency and lack interpretability. To address these limitations, we propose a training-free and efficient method that treats policy violation detection as an out-of-distribution (OOD) detection problem. Inspired by whitening techniques, we apply a linear transformation to decorrelate the model’s hidden activations and standardize them to zero mean and unit variance, yielding near-identity covariance matrix. In this transformed space, we use the Euclidean norm as a compliance score to detect policy violations. The method requires only the policy text and a small number of illustrative samples, which makes it light-weight and easily deployable. On a challenging policy benchmark, our approach achieves state-of-the-art results, surpassing both existing guardrails and fine-tuned reasoning models. This work provides organizations with a practical and statistically grounded framework for policy-aware oversight of LLMs, advancing the broader goal of deployable AI governance. Code is available at: https://tinyurl.com/policy-violation-detection
[805] TV2TV: A Unified Framework for Interleaved Language and Video Generation
Xiaochuang Han, Youssef Emad, Melissa Hall, John Nguyen, Karthik Padthe, Liam Robbins, Amir Bar, Delong Chen, Michal Drozdzal, Maha Elbayad, Yushi Hu, Shang-Wen Li, Sreya Dutta Roy, Jakob Verbeek, XuDong Wang, Marjan Ghazvininejad, Luke Zettlemoyer, Emily Dinan
Main category: cs.LG
TL;DR: TV2TV is a novel video generation framework that interleaves text and video generation using a Mixture-of-Transformers architecture, allowing the model to “think in words” before “acting in pixels” for improved quality and controllability.
Details
Motivation: Current video generation models struggle with complex outputs requiring semantic branching and high-level reasoning about what should happen next in videos.Method: TV2TV decomposes video generation into interleaved text and video generation using a Mixture-of-Transformers architecture that jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction). The model autonomously decides when to switch between generating text descriptions and video frames.
Result: TV2TV demonstrates substantial improvements in visual quality and prompt alignment on video game data, and scales to natural videos (sports) with strong visual quality and prompt alignment when trained on vision-language augmented data.
Conclusion: TV2TV represents a promising step toward video generation with open-ended textual reasoning and fine-grained controllability through text interventions during generation.
Abstract: Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to “think in words” about subsequent content before ``acting in pixels’’ to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model’s ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.
[806] Mitigating the Curse of Detail: Scaling Arguments for Feature Learning and Sample Complexity
Noa Rubin, Orit Davidovich, Zohar Ringel
Main category: cs.LG
TL;DR: The paper proposes a simpler heuristic scale analysis method to predict when feature learning emerges in deep networks, avoiding complex numerical solutions required by current theories.
Details
Motivation: Current theories of rich feature learning in deep networks require solving computationally intensive high-dimensional non-linear equations, making analytical complexity a significant challenge. There's a need for simpler methods to predict feature learning emergence.Method: Proposes a heuristic scale analysis approach that predicts data and width scales at which various patterns of feature learning emerge. This method is simpler than exact theories and can handle complex architectures.
Result: The scale analysis reproduces scaling exponents of known results and makes novel predictions for complex toy architectures like three-layer non-linear networks and attention heads.
Conclusion: The proposed heuristic scale analysis provides a powerful, simpler alternative to computationally intensive exact theories, extending the scope of first-principle theories of deep learning to more complex architectures.
Abstract: Two pressing topics in the theory of deep learning are the interpretation of feature learning mechanisms and the determination of implicit bias of networks in the rich regime. Current theories of rich feature learning, often appear in the form of high-dimensional non-linear equations, which require computationally intensive numerical solutions. Furthermore, even under such limiting settings, predictions often appear in the form of high-dimensional non-linear equations, which require computationally intensive numerical solutions. Given the many details that go into defining a deep learning problem, this analytical complexity is a significant and often unavoidable challenge. Here, we propose a powerful heuristic route for predicting the data and width scales at which various patterns of feature learning emerge. This form of scale analysis is considerably simpler than such exact theories and reproduces the scaling exponents of various known results. In addition, we make novel predictions on complex toy architectures, such as three-layer non-linear networks and attention heads, thus extending the scope of first-principle theories of deep learning.
[807] The Universal Weight Subspace Hypothesis
Prakhar Kaushik, Shravan Chaudhari, Ankit Vaidya, Rama Chellappa, Alan Yuille
Main category: cs.LG
TL;DR: Deep neural networks across diverse tasks converge to remarkably similar low-dimensional parametric subspaces, as shown through spectral analysis of over 1100 models including Mistral-7B LoRAs, Vision Transformers, and LLaMA-8B models.
Details
Motivation: To investigate whether neural networks trained on different tasks and domains exhibit shared structural patterns, and to understand the intrinsic organization of information within deep networks.Method: Conducted mode-wise spectral analysis on over 1100 models (500 Mistral-7B LoRAs, 500 Vision Transformers, and 50 LLaMA-8B models) using spectral decomposition techniques on weight matrices across various architectures trained on diverse tasks and datasets.
Result: Found that neural networks systematically converge to shared spectral subspaces regardless of initialization, task, or domain. Identified universal subspaces capturing majority variance in just a few principal directions, revealing sparse, joint subspaces consistently exploited across diverse tasks and datasets.
Conclusion: The discovery of universal low-dimensional subspaces offers new insights into neural network organization and has significant implications for model reusability, multi-task learning, model merging, and efficient algorithm development, potentially reducing the computational and environmental costs of large-scale neural models.
Abstract: We show that deep neural networks trained across diverse tasks exhibit remarkably similar low-dimensional parametric subspaces. We provide the first large-scale empirical evidence that demonstrates that neural networks systematically converge to shared spectral subspaces regardless of initialization, task, or domain. Through mode-wise spectral analysis of over 1100 models - including 500 Mistral-7B LoRAs, 500 Vision Transformers, and 50 LLaMA-8B models - we identify universal subspaces capturing majority variance in just a few principal directions. By applying spectral decomposition techniques to the weight matrices of various architectures trained on a wide range of tasks and datasets, we identify sparse, joint subspaces that are consistently exploited, within shared architectures across diverse tasks and datasets. Our findings offer new insights into the intrinsic organization of information within deep networks and raise important questions about the possibility of discovering these universal subspaces without the need for extensive data and computational resources. Furthermore, this inherent structure has significant implications for model reusability, multi-task learning, model merging, and the development of training and inference-efficient algorithms, potentially reducing the carbon footprint of large-scale neural models.
[808] Data-regularized Reinforcement Learning for Diffusion Models at Scale
Haotian Ye, Kaiwen Zheng, Jiashu Xu, Puheng Li, Huayu Chen, Jiaqi Han, Sheng Liu, Qinsheng Zhang, Hanzi Mao, Zekun Hao, Prithvijit Chattopadhyay, Dinghao Yang, Liang Feng, Maosheng Liao, Junjie Bai, Ming-Yu Liu, James Zou, Stefano Ermon
Main category: cs.LG
TL;DR: DDRL is a new RL framework for aligning diffusion models with human preferences that prevents reward hacking by anchoring policies to off-policy data using forward KL divergence.
Details
Motivation: Existing RL methods for aligning diffusion models with human preferences suffer from reward hacking problems like quality degradation, over-stylization, and reduced diversity due to unreliable regularization penalties.Method: Data-regularized Diffusion Reinforcement Learning (DDRL) uses forward KL divergence to anchor the policy to an off-policy data distribution, combining reward maximization with diffusion loss minimization.
Result: With over 1M GPU hours of experiments and 10K double-blind human evaluations on high-resolution video generation, DDRL significantly improves rewards while alleviating reward hacking, achieving the highest human preference scores.
Conclusion: DDRL establishes a robust and scalable paradigm for diffusion post-training by enabling unbiased integration of RL with standard diffusion training through data regularization.
Abstract: Aligning generative diffusion models with human preferences via reinforcement learning (RL) is critical yet challenging. Most existing algorithms are often vulnerable to reward hacking, such as quality degradation, over-stylization, or reduced diversity. Our analysis demonstrates that this can be attributed to the inherent limitations of their regularization, which provides unreliable penalties. We introduce Data-regularized Diffusion Reinforcement Learning (DDRL), a novel framework that uses the forward KL divergence to anchor the policy to an off-policy data distribution. Theoretically, DDRL enables robust, unbiased integration of RL with standard diffusion training. Empirically, this translates into a simple yet effective algorithm that combines reward maximization with diffusion loss minimization. With over a million GPU hours of experiments and ten thousand double-blind human evaluations, we demonstrate on high-resolution video generation tasks that DDRL significantly improves rewards while alleviating the reward hacking seen in baselines, achieving the highest human preference and establishing a robust and scalable paradigm for diffusion post-training.
[809] Exploiting ftrace’s function_graph Tracer Features for Machine Learning: A Case Study on Encryption Detection
Kenan Begovic, Abdulaziz Al-Ali, Qutaibah Malluhi
Main category: cs.LG
TL;DR: Using Linux kernel ftrace function graph tracer to generate system-level features for ML applications, achieving 99.28% accuracy in encryption detection and demonstrating effectiveness in program identification tasks.
Details
Motivation: Bridging the gap between system tracing and machine learning to enable innovative solutions in performance monitoring, security analytics, and system behavior analysis by leveraging low-level system trace data for ML applications.Method: Utilizes Linux kernel ftrace framework, specifically the function graph tracer, to collect system call traces. Develops comprehensive methodologies for preprocessing raw trace data and extracting graph-based features from function call traces for machine learning applications.
Result: Achieved outstanding 99.28% accuracy in encryption detection task across multiple learning algorithms. Successfully validated approach in additional multilabel classification experiment for program identification from trace data.
Conclusion: The ftrace function graph tracer provides effective system-level features for ML applications, enabling significant advancements in system behavior analysis, program identification, and anomaly detection through the integration of system tracing with machine learning.
Abstract: This paper proposes using the Linux kernel ftrace framework, particularly the function graph tracer, to generate informative system level data for machine learning (ML) applications. Experiments on a real world encryption detection task demonstrate the efficacy of the proposed features across several learning algorithms. The learner faces the problem of detecting encryption activities across a large dataset of files, using function call traces and graph based features. Empirical results highlight an outstanding accuracy of 99.28 on the task at hand, underscoring the efficacy of features derived from the function graph tracer. The results were further validated in an additional experiment targeting a multilabel classification problem, in which running programs were identified from trace data. This work provides comprehensive methodologies for preprocessing raw trace data and extracting graph based features, offering significant advancements in applying ML to system behavior analysis, program identification, and anomaly detection. By bridging the gap between system tracing and ML, this paper paves the way for innovative solutions in performance monitoring and security analytics.
[810] Utility Boundary of Dataset Distillation: Scaling and Configuration-Coverage Laws
Zhengquan Luo, Zhiqiang Xu
Main category: cs.LG
TL;DR: The paper proposes a unified theoretical framework for dataset distillation that explains performance saturation and configuration robustness through scaling and coverage laws.
Details
Motivation: Dataset distillation lacks theoretical foundations - existing methods use different surrogate objectives without understanding common principles or guarantees. It's unclear when distilled data remains effective across different training configurations.Method: Proposed configuration-dynamics-error analysis framework that reformulates major DD approaches under a common generalization-error perspective. Provides two main theoretical results: scaling law (error decreases with distilled sample size) and coverage law (required samples scale linearly with configuration diversity).
Result: Theoretical analysis shows various matching methods are interchangeable surrogates reducing the same generalization error. Experiments confirm the derived laws across diverse methods and configurations.
Conclusion: The unified framework advances theoretical foundation for dataset distillation, enables theory-driven design of compact, configuration-robust distilled datasets, and explains why different methods can all achieve dataset distillation.
Abstract: Dataset distillation (DD) aims to construct compact synthetic datasets that allow models to achieve comparable performance to full-data training while substantially reducing storage and computation. Despite rapid empirical progress, its theoretical foundations remain limited: existing methods (gradient, distribution, trajectory matching) are built on heterogeneous surrogate objectives and optimization assumptions, which makes it difficult to analyze their common principles or provide general guarantees. Moreover, it is still unclear under what conditions distilled data can retain the effectiveness of full datasets when the training configuration, such as optimizer, architecture, or augmentation, changes. To answer these questions, we propose a unified theoretical framework, termed configuration–dynamics–error analysis, which reformulates major DD approaches under a common generalization-error perspective and provides two main results: (i) a scaling law that provides a single-configuration upper bound, characterizing how the error decreases as the distilled sample size increases and explaining the commonly observed performance saturation effect; and (ii) a coverage law showing that the required distilled sample size scales linearly with configuration diversity, with provably matching upper and lower bounds. In addition, our unified analysis reveals that various matching methods are interchangeable surrogates, reducing the same generalization error, clarifying why they can all achieve dataset distillation and providing guidance on how surrogate choices affect sample efficiency and robustness. Experiments across diverse methods and configurations empirically confirm the derived laws, advancing a theoretical foundation for DD and enabling theory-driven design of compact, configuration-robust dataset distillation.
cs.MA
[811] AI-Generated Compromises for Coalition Formation: Modeling, Simulation, and a Textual Case Study
Eyal Briman, Ehud Shapiro, Nimrod Talmon
Main category: cs.MA
TL;DR: AI models for generating compromise proposals in collaborative text editing using NLP/LLMs to create semantic spaces and find majority-supported compromises.
Details
Motivation: The challenge of finding compromise proposals that can unite agent coalitions in democratic processes like collaborative text editing (e.g., drafting constitutions) remains an open problem, especially when traditional tools are limited for large-scale democratic text creation.Method: Formalize a holistic model incorporating agent bounded rationality and uncertainty, apply NLP techniques and LLMs to create semantic metric spaces for text, and develop algorithms to suggest suitable compromise points in these spaces.
Result: Developed AI models that can generate compromise proposals for collaborative text editing, demonstrated through simulations of coalition formation processes showing potential for large-scale democratic text editing applications.
Conclusion: AI techniques using NLP and LLMs can effectively facilitate large-scale democratic text editing by finding compromise proposals that unite agent coalitions, addressing limitations of traditional tools in collaborative constitution drafting and similar processes.
Abstract: The challenge of finding compromises between agent proposals is fundamental to AI sub-fields such as argumentation, mediation, and negotiation. Building on this tradition, Elkind et al. (2021) introduced a process for coalition formation that seeks majority-supported proposals preferable to the status quo, using a metric space where each agent has an ideal point. The crucial step in this iterative process involves identifying compromise proposals around which agent coalitions can unite. How to effectively find such compromise proposals, however, remains an open question. We address this gap by formalizing a holistic model that encompasses agent bounded rationality and uncertainty and developing AI models to generate such compromise proposals. We focus on the domain of collaboratively writing text documents – e.g., to enable the democratic creation of a community constitution. We apply NLP (Natural Language Processing) techniques and utilize LLMs (Large Language Models) to create a semantic metric space for text and develop algorithms to suggest suitable compromise points. To evaluate the effectiveness of our algorithms, we simulate various coalition formation processes and demonstrate the potential of AI to facilitate large-scale democratic text editing, such as collaboratively drafting a constitution, an area where traditional tools are limited.
[812] HiveMind: Contribution-Guided Online Prompt Optimization of LLM Multi-Agent Systems
Yihan Xia, Taotao Wang, Shengli Zhang, Zhangyuhua Weng, Bin Cao, Soung Chang Liew
Main category: cs.MA
TL;DR: HiveMind is a self-adaptive framework that optimizes LLM multi-agent collaboration using contribution analysis and automated prompt refinement, with DAG-Shapley reducing computational costs by 80% while maintaining accuracy.
Details
Motivation: Current LLM-based multi-agent systems lack effective methods for evaluating individual agent effectiveness and optimizing underperforming agents in real-time, limiting their practical application in complex decision-making scenarios like financial trading.Method: HiveMind introduces Contribution-Guided Online Prompt Optimization (CG-OPO) using Shapley values to quantify agent contributions. To address computational complexity, they propose DAG-Shapley, which leverages the Directed Acyclic Graph structure of agent workflows to prune non-viable coalitions and reuse intermediate outputs.
Result: In multi-agent stock-trading scenarios, HiveMind outperforms static baselines. DAG-Shapley reduces LLM calls by over 80% while maintaining attribution accuracy comparable to full Shapley values.
Conclusion: HiveMind establishes a new standard for efficient credit assignment in multi-agent systems, enabling scalable, real-world optimization of LLM-based agent collaboration through principled contribution analysis and automated prompt refinement.
Abstract: Recent advances in LLM-based multi-agent systems have demonstrated remarkable capabilities in complex decision-making scenarios such as financial trading and software engineering. However, evaluating each individual agent’s effectiveness and online optimization of underperforming agents remain open challenges. To address these issues, we present HiveMind, a self-adaptive framework designed to optimize LLM multi-agent collaboration through contribution analysis. At its core, HiveMind introduces Contribution-Guided Online Prompt Optimization (CG-OPO), which autonomously refines agent prompts based on their quantified contributions. We first propose the Shapley value as a grounded metric to quantify each agent’s contribution, thereby identifying underperforming agents in a principled manner for automated prompt refinement. To overcome the computational complexity of the classical Shapley value, we present DAG-Shapley, a novel and efficient attribution algorithm that leverages the inherent Directed Acyclic Graph structure of the agent workflow to axiomatically prune non-viable coalitions. By hierarchically reusing intermediate outputs of agents in the DAG, our method further reduces redundant computations, and achieving substantial cost savings without compromising the theoretical guarantees of Shapley values. Evaluated in a multi-agent stock-trading scenario, HiveMind achieves superior performance compared to static baselines. Notably, DAG-Shapley reduces LLM calls by over 80% while maintaining attribution accuracy comparable to full Shapley values, establishing a new standard for efficient credit assignment and enabling scalable, real-world optimization of multi-agent collaboration.
[813] ChargingBoul: A Competitive Negotiating Agent with Novel Opponent Modeling
Joe Shymanski
Main category: cs.MA
TL;DR: ChargingBoul is a lightweight automated negotiating agent that placed 2nd in ANAC 2022 by balancing concession strategies with opponent modeling, achieving high utility through dynamic bid pattern analysis and late-stage concessions.
Details
Motivation: Automated negotiation is crucial for multiagent systems in e-commerce, resource allocation, and autonomous decision-making. There's a need for effective negotiating agents that can perform well in competitive environments like ANAC.Method: ChargingBoul uses a lightweight strategy combining concession management and opponent modeling. It classifies opponents based on bid patterns, dynamically adjusts bidding strategy, and applies concession policies in later negotiation stages to maximize utility while ensuring agreements.
Result: ChargingBoul placed second in individual utility in the 2022 ANAC competition by an exceptionally narrow margin. Subsequent studies have validated its effectiveness across diverse opponent strategies, demonstrating strong performance in automated negotiation scenarios.
Conclusion: ChargingBoul represents an effective approach to automated negotiation that balances concession and opponent modeling. While successful, potential enhancements include more sophisticated opponent modeling and adaptive bidding heuristics to further improve performance.
Abstract: Automated negotiation has emerged as a critical area of research in multiagent systems, with applications spanning e-commerce, resource allocation, and autonomous decision-making. This paper presents ChargingBoul, a negotiating agent that competed in the 2022 Automated Negotiating Agents Competition (ANAC) and placed second in individual utility by an exceptionally narrow margin. ChargingBoul employs a lightweight yet effective strategy that balances concession and opponent modeling to achieve high negotiation outcomes. The agent classifies opponents based on bid patterns, dynamically adjusts its bidding strategy, and applies a concession policy in later negotiation stages to maximize utility while fostering agreements. We evaluate ChargingBoul’s performance using competition results and subsequent studies that have utilized the agent in negotiation research. Our analysis highlights ChargingBoul’s effectiveness across diverse opponent strategies and its contributions to advancing automated negotiation techniques. We also discuss potential enhancements, including more sophisticated opponent modeling and adaptive bidding heuristics, to improve its performance further.
[814] Analyzing Collision Rates in Large-Scale Mixed Traffic Control via Multi-Agent Reinforcement Learning
Muyang Fan
Main category: cs.MA
TL;DR: This paper investigates factors influencing collision rates in MARL-based mixed traffic control systems with human-driven and robotic vehicles, focusing on traffic density, intersection configurations, and turning strategies.
Details
Motivation: Vehicle collisions remain a major challenge in mixed traffic systems with human-driven and robotic vehicles. While MARL shows promise for traffic signal control, ensuring safety remains difficult, and collision rates need to be better understood and incorporated into traffic control design.Method: The study investigates three dimensions: total vehicle count, signalized vs. unsignalized intersection configurations, and turning-movement strategies. Controlled simulation experiments evaluate how each factor affects collision likelihood in MARL-governed Mixed Traffic Control networks.
Result: Collision rates are sensitive to traffic density, the level of signal coordination, and turning-control design. Higher traffic density increases collision risk, while coordinated signalization and well-designed turning strategies reduce collisions.
Conclusion: The findings provide practical insights for improving safety and robustness of MARL-based mixed traffic control systems, supporting development of intelligent transportation systems where efficiency and safety are jointly optimized.
Abstract: Vehicle collisions remain a major challenge in large-scale mixed traffic systems, especially when human-driven vehicles (HVs) and robotic vehicles (RVs) interact under dynamic and uncertain conditions. Although Multi-Agent Reinforcement Learning (MARL) offers promising capabilities for traffic signal control, ensuring safety in such environments remains difficult. As a direct indicator of traffic risk, the collision rate must be well understood and incorporated into traffic control design. This study investigates the primary factors influencing collision rates in a MARL-governed Mixed Traffic Control (MTC) network. We examine three dimensions: total vehicle count, signalized versus unsignalized intersection configurations, and turning-movement strategies. Through controlled simulation experiments, we evaluate how each factor affects collision likelihood. The results show that collision rates are sensitive to traffic density, the level of signal coordination, and turning-control design. These findings provide practical insights for improving the safety and robustness of MARL-based mixed traffic control systems, supporting the development of intelligent transportation systems in which both efficiency and safety are jointly optimized.
[815] Characterizing Lane-Changing Behavior in Mixed Traffic
Sungyong Chung, Alireza Talebpour, Samer H. Hamdar
Main category: cs.MA
TL;DR: This study analyzes lane-changing behavior in mixed traffic (AVs + human drivers) using real-world data from Waymo Open Motion Dataset, applying game theory to classify cooperative/defective behaviors and examine social dilemmas.
Details
Motivation: Understanding lane-changing interactions between automated vehicles (AVs) and human-driven vehicles (HDVs) is crucial for safety and efficiency in mixed traffic environments, especially as AV market penetration increases.Method: Used 7,636 lane-changing events from WOMD; applied k-means clustering to classify vehicles as cooperative/defective; used quantal response equilibrium to estimate utilities; constructed empirical payoff tables; analyzed social dilemmas using evolutionary game theory and Monte Carlo simulations.
Result: Found higher proportions of cooperative AVs vs HDVs; identified social dilemmas in ~4% (active) and ~11% (passive) of events (mostly Stag Hunt or Prisoner’s Dilemma); Monte Carlo simulations showed repeated interactions consistently increase cooperative behavior over time regardless of AV penetration rate.
Conclusion: The study provides a game-theoretic framework for analyzing lane-changing interactions, reveals social dilemmas in mixed traffic, and demonstrates that repeated interactions naturally promote cooperative behavior, offering insights for traffic management and AV policy development.
Abstract: Characterizing and understanding lane-changing behavior in the presence of automated vehicles (AVs) is crucial to ensuring safety and efficiency in mixed traffic. Accordingly, this study aims to characterize the interactions between the lane-changing vehicle (active vehicle) and the vehicle directly impacted by the maneuver in the target lane (passive vehicle). Utilizing real-world trajectory data from the Waymo Open Motion Dataset (WOMD), this study explores patterns in lane-changing behavior and provides insight into how these behaviors evolve under different AV market penetration rates (MPRs). In particular, we propose a game-theoretic framework to analyze cooperative and defective behaviors in mixed traffic, applied to the 7,636 observed lane-changing events in the WOMD. First, we utilize k-means clustering to classify vehicles as cooperative or defective, revealing that the proportions of cooperative AVs are higher than those of HDVs in both active and passive roles. Next, we jointly estimate the utilities of active and passive vehicles to model their behaviors using the quantal response equilibrium framework. Empirical payoff tables are then constructed based on these utilities. Using these payoffs, we analyze the presence of social dilemmas and examine the evolution of cooperative behaviors using evolutionary game theory. Our results reveal the presence of social dilemmas in approximately 4% and 11% of lane-changing events for active and passive vehicles, respectively, with most classified as Stag Hunt or Prisoner’s Dilemma (Chicken Game rarely observed). Moreover, the Monte Carlo simulation results show that repeated lane-changing interactions consistently lead to increased cooperative behavior over time, regardless of the AV penetration rate.
[816] Understanding LLM Agent Behaviours via Game Theory: Strategy Recognition, Biases and Multi-Agent Dynamics
Trung-Kiet Huynh, Duy-Minh Dao-Sy, Thanh-Bang Cao, Phong-Hao Le, Hong-Dan Nguyen, Phu-Quy Nguyen-Lam, Minh-Luan Nguyen-Vo, Hong-Phat Pham, Phu-Hoa Pham, Thien-Kim Than, Chi-Nguyen Tran, Huy Tran, Gia-Thoai Tran-Le, Alessio Buscemi, Le Hong Trang, The Anh Han
Main category: cs.MA
TL;DR: FAIRGAME framework extended to evaluate LLM strategic behavior in repeated social dilemmas, revealing systematic cooperation biases and language-dependent intentions with implications for AI safety and governance.
Details
Motivation: As LLMs increasingly operate as autonomous decision-makers in interactive systems and human societies, understanding their strategic behavior is crucial for safety, coordination, and AI-driven social/economic infrastructure design. Need methods to capture not just LLM outputs but underlying intentions guiding decisions.Method: Extended FAIRGAME framework with two complementary advances: 1) payoff-scaled Prisoner’s Dilemma isolating sensitivity to incentive magnitude, and 2) integrated multi-agent Public Goods Game with dynamic payoffs and multi-agent histories. Trained traditional supervised classification models on canonical repeated-game strategies and applied them to FAIRGAME trajectories.
Result: Revealed consistent behavioral signatures across models and languages: incentive-sensitive cooperation, cross-linguistic divergence, and end-game alignment toward defection. LLMs exhibit systematic, model- and language-dependent behavioral intentions, with linguistic framing sometimes exerting effects as strong as architectural differences.
Conclusion: Provides unified methodological foundation for auditing LLMs as strategic agents and reveals systematic cooperation biases with direct implications for AI governance, collective decision-making, and safe multi-agent system design.
Abstract: As Large Language Models (LLMs) increasingly operate as autonomous decision-makers in interactive and multi-agent systems and human societies, understanding their strategic behaviour has profound implications for safety, coordination, and the design of AI-driven social and economic infrastructures. Assessing such behaviour requires methods that capture not only what LLMs output, but the underlying intentions that guide their decisions. In this work, we extend the FAIRGAME framework to systematically evaluate LLM behaviour in repeated social dilemmas through two complementary advances: a payoff-scaled Prisoners Dilemma isolating sensitivity to incentive magnitude, and an integrated multi-agent Public Goods Game with dynamic payoffs and multi-agent histories. These environments reveal consistent behavioural signatures across models and languages, including incentive-sensitive cooperation, cross-linguistic divergence and end-game alignment toward defection. To interpret these patterns, we train traditional supervised classification models on canonical repeated-game strategies and apply them to FAIRGAME trajectories, showing that LLMs exhibit systematic, model- and language-dependent behavioural intentions, with linguistic framing at times exerting effects as strong as architectural differences. Together, these findings provide a unified methodological foundation for auditing LLMs as strategic agents and reveal systematic cooperation biases with direct implications for AI governance, collective decision-making, and the design of safe multi-agent systems.
[817] Understanding Individual Decision-Making in Multi-Agent Reinforcement Learning: A Dynamical Systems Approach
James Rudd-Jones, María Pérez-Ortiz, Mirco Musolesi
Main category: cs.MA
TL;DR: The paper proposes modeling MARL systems as coupled stochastic dynamical systems to analyze individual agent behavior while accounting for inherent stochasticity, addressing limitations of traditional mean-field approximations.
Details
Motivation: Current MARL analysis faces challenges in studying individual decision-making due to inherent stochasticity (exploration noise, environment transitions, gradient updates). Traditional approaches like replicator dynamics use mean-field approximations that remove stochastic effects, leading to dissonance between analytical predictions and actual individual trajectories.Method: Model MARL systems as coupled stochastic dynamical systems capturing both agent interactions and environmental characteristics. Leverage tools from dynamical systems theory to analyze stability and sensitivity of individual agent behavior.
Result: Provides a framework to rigorously study MARL dynamics while considering inherent stochasticity, enabling deeper understanding of system behavior and practical insights for design and control of multi-agent learning processes.
Conclusion: This novel perspective allows for analysis of individual-level stability and sensitivity in MARL systems, which is crucial for practical deployments with strict safety requirements, bridging the gap between analytical predictions and real-world implementations.
Abstract: Analysing learning behaviour in Multi-Agent Reinforcement Learning (MARL) environments is challenging, in particular with respect to \textit{individual} decision-making. Practitioners frequently tend to study or compare MARL algorithms from a qualitative perspective largely due to the inherent stochasticity in practical algorithms arising from random dithering exploration strategies, environment transition noise, and stochastic gradient updates to name a few. Traditional analytical approaches, such as replicator dynamics, often rely on mean-field approximations to remove stochastic effects, but this simplification, whilst able to provide general overall trends, might lead to dissonance between analytical predictions and actual realisations of individual trajectories. In this paper, we propose a novel perspective on MARL systems by modelling them as \textit{coupled stochastic dynamical systems}, capturing both agent interactions and environmental characteristics. Leveraging tools from dynamical systems theory, we analyse the stability and sensitivity of agent behaviour at individual level, which are key dimensions for their practical deployments, for example, in presence of strict safety requirements. This framework allows us, for the first time, to rigorously study MARL dynamics taking into consideration their inherent stochasticity, providing a deeper understanding of system behaviour and practical insights for the design and control of multi-agent learning processes.
[818] Large Language Models Miss the Multi-Agent Mark
Emanuele La Malfa, Gabriele La Malfa, Samuele Marro, Jie M. Zhang, Elizabeth Black, Michael Luck, Philip Torr, Michael Wooldridge
Main category: cs.MA
TL;DR: Position paper critiques current Multi-Agent LLM systems for misusing MAS terminology without implementing core MAS principles, advocating for better integration of established MAS concepts.
Details
Motivation: Current MAS LLM frameworks appropriate MAS terminology but don't engage with foundational MAS principles, risking field stagnation by revisiting already-solved problems.Method: Systematic analysis of discrepancies between MAS theory and current MAS LLM implementations across four key areas: social agency, environment design, coordination/communication protocols, and emergent behavior measurement.
Result: Identifies that many MAS LLMs lack true multi-agent characteristics like autonomy, social interaction, and structured environments, relying instead on oversimplified LLM-centric architectures.
Conclusion: Advocates for better integration of established MAS concepts and more precise terminology to avoid mischaracterization and missed opportunities in MAS LLM research.
Abstract: Recent interest in Multi-Agent Systems of Large Language Models (MAS LLMs) has led to an increase in frameworks leveraging multiple LLMs to tackle complex tasks. However, much of this literature appropriates the terminology of MAS without engaging with its foundational principles. In this position paper, we highlight critical discrepancies between MAS theory and current MAS LLMs implementations, focusing on four key areas: the social aspect of agency, environment design, coordination and communication protocols, and measuring emergent behaviours. Our position is that many MAS LLMs lack multi-agent characteristics such as autonomy, social interaction, and structured environments, and often rely on oversimplified, LLM-centric architectures. The field may slow down and lose traction by revisiting problems the MAS literature has already addressed. Therefore, we systematically analyse this issue and outline associated research opportunities; we advocate for better integrating established MAS concepts and more precise terminology to avoid mischaracterisation and missed opportunities.
[819] Lark: Biologically Inspired Neuroevolution for Multi-Stakeholder LLM Agents
Dheeraj Chintapalli, Rikhil Tanugula, Sunkalp Chandra
Main category: cs.MA
TL;DR: Lark is a biologically inspired decision-making framework combining LLM reasoning with evolutionary multi-agent systems, using plasticity, duplication/maturation, ranked-choice voting, and token penalties to generate stakeholder-aligned strategies efficiently.
Details
Motivation: To address verbosity and stakeholder trade-offs in decision-making systems, creating a practical framework that balances diverse stakeholder preferences while maintaining computational efficiency and transparency.Method: Four key mechanisms: (1) plasticity for concise solution adjustments, (2) duplication/maturation for copying and specializing high-performing candidates, (3) ranked-choice stakeholder aggregation with influence-weighted Borda scoring, and (4) compute awareness via token-based penalties for brevity. The system iteratively proposes strategies, applies tweaks, simulates evaluations, aggregates preferences, and selects top candidates.
Result: In 30-round evaluation comparing 14 systems, Lark Full achieved mean rank of 2.55 (95% CI [2.17, 2.93]) and mean composite score of 29.4/50 (95% CI [26.34, 32.46]), finishing Top-3 in 80% of rounds while costing $0.016 per task. All four mechanisms contributed significantly, with duplication/maturation ablation causing largest deficit (ΔScore = 3.5).
Conclusion: Lark presents a practical, compute-aware neuroevolutionary loop for scalable stakeholder-aligned strategy generation with transparent trade-offs, offering proof-of-concept findings and inviting community feedback for real-world validation.
Abstract: We present Lark, a biologically inspired decision-making framework that couples LLM-driven reasoning with an evolutionary, stakeholder-aware Multi-Agent System (MAS). To address verbosity and stakeholder trade-offs, we integrate four mechanisms: (i) plasticity, which applies concise adjustments to candidate solutions; (ii) duplication and maturation, which copy high-performing candidates and specialize them into new modules; (iii) ranked-choice stakeholder aggregation using influence-weighted Borda scoring; and (iv) compute awareness via token-based penalties that reward brevity. The system iteratively proposes diverse strategies, applies plasticity tweaks, simulates stakeholder evaluations, aggregates preferences, selects top candidates, and performs duplication/maturation while factoring compute cost into final scores. In a controlled evaluation over 30 rounds comparing 14 systems, Lark Full achieves a mean rank of 2.55 (95% CI [2.17, 2.93]) and a mean composite score of 29.4/50 (95% CI [26.34, 32.46]), finishing Top-3 in 80% of rounds while remaining cost competitive with leading commercial models ($0.016 per task). Paired Wilcoxon tests confirm that all four mechanisms contribute significantly as ablating duplication/maturation yields the largest deficit (ΔScore = 3.5, Cohen’s d_z = 2.53, p < 0.001), followed by plasticity (ΔScore = 3.4, d_z = 1.86), ranked-choice voting (ΔScore = 2.4, d_z = 1.20), and token penalties (ΔScore = 2.2, d_z = 1.63). Rather than a formal Markov Decision Process with constrained optimization, Lark is a practical, compute-aware neuroevolutionary loop that scales stakeholder-aligned strategy generation and makes trade-offs transparent through per-step metrics. Our work presents proof-of-concept findings and invites community feedback as we expand toward real-world validation studies.
[820] EvoMem: Improving Multi-Agent Planning with Dual-Evolving Memory
Wenzhe Fan, Ning Yan, Masood Mortazavi
Main category: cs.MA
TL;DR: EvoMem is a multi-agent framework with dual-evolving memory mechanisms for natural language planning tasks, showing improved performance on trip planning, meeting planning, and calendar scheduling.
Details
Motivation: While LLM-based multi-agent frameworks have advanced planning capabilities, the role of human-like memory in these systems remains unexplored. Understanding how agents coordinate through memory is critical for natural language planning tasks that require iterative reasoning, constraint tracking, and error correction.Method: EvoMem is built on a dual-evolving memory mechanism inspired by working memory from cognitive psychology. It consists of three agents (Constraint Extractor, Verifier, and Actor) and two memory modules: Constraint Memory (CMem) that evolves across queries by storing task-specific rules/constraints (fixed within queries), and Query-feedback Memory (QMem) that evolves within queries by accumulating feedback across iterations for solution refinement. Both memories reset at the end of each query session.
Result: Evaluations on trip planning, meeting planning, and calendar scheduling tasks show consistent performance improvements, demonstrating the effectiveness of the EvoMem framework.
Conclusion: The success of EvoMem underscores the importance of memory mechanisms in enhancing multi-agent planning capabilities, particularly for natural language planning tasks requiring complex coordination and iterative refinement.
Abstract: Planning has been a cornerstone of artificial intelligence for solving complex problems, and recent progress in LLM-based multi-agent frameworks have begun to extend this capability. However, the role of human-like memory within these frameworks remains largely unexplored. Understanding how agents coordinate through memory is critical for natural language planning, where iterative reasoning, constraint tracking, and error correction drive the success. Inspired by working memory model in cognitive psychology, we present EvoMem, a multi-agent framework built on a dual-evolving memory mechanism. The framework consists of three agents (Constraint Extractor, Verifier, and Actor) and two memory modules: Constraint Memory (CMem), which evolves across queries by storing task-specific rules and constraints while remains fixed within a query, and Query-feedback Memory (QMem), which evolves within a query by accumulating feedback across iterations for solution refinement. Both memory modules are reset at the end of each query session. Evaluations on trip planning, meeting planning, and calendar scheduling show consistent performance improvements, highlighting the effectiveness of EvoMem. This success underscores the importance of memory in enhancing multi-agent planning.
cs.MM
[821] Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits
Masato Ishii, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji
Main category: cs.MM
TL;DR: A pipeline for joint audio-visual editing that first edits video, then generates aligned audio using a novel video-to-audio generation model conditioned on source audio, target video, and text prompts.
Details
Motivation: To enhance coherence between edited video and its accompanying audio by ensuring audio aligns with visual changes after video editing.Method: Two-stage pipeline: 1) Apply video editing techniques to produce target video, 2) Use novel video-to-audio generation model that conditions on source audio, target video, and text prompts. The model incorporates conditional audio input, uses data augmentation for training efficiency, and dynamically adjusts source audio influence based on edit complexity.
Result: Outperforms existing approaches in maintaining audio-visual alignment and content integrity.
Conclusion: Proposed pipeline successfully achieves coherent joint audio-visual editing through a novel video-to-audio generation approach with adaptive source audio conditioning.
Abstract: We introduce a novel pipeline for joint audio-visual editing that enhances the coherence between edited video and its accompanying audio. Our approach first applies state-of-the-art video editing techniques to produce the target video, then performs audio editing to align with the visual changes. To achieve this, we present a new video-to-audio generation model that conditions on the source audio, target video, and a text prompt. We extend the model architecture to incorporate conditional audio input and propose a data augmentation strategy that improves training efficiency. Furthermore, our model dynamically adjusts the influence of the source audio based on the complexity of the edits, preserving the original audio structure where possible. Experimental results demonstrate that our method outperforms existing approaches in maintaining audio-visual alignment and content integrity.
[822] MAVERIX: Multimodal Audio-Visual Evaluation and Recognition IndeX
Liuyue Xie, Avik Kuthiala, George Z. Wei, Ce Zheng, Ananya Bal, Mosam Dabhi, Liting Wen, Taru Rustagi, Ethan Lai, Sushil Khyalia, Rohan Choudhury, Morteza Ziyadi, Xu Zhang, Hao Yang, László A. Jeni
Main category: cs.MM
TL;DR: MAVERIX is a unified benchmark for evaluating multimodal LLMs’ video understanding through audiovisual questions requiring tight integration of video and audio information, with human performance far exceeding current models.
Details
Motivation: The field lacks standardized evaluation frameworks to thoroughly assess multimodal models' cross-modality comprehension performance, particularly for audiovisual integration in video understanding.Method: Created MAVERIX benchmark with 2,556 questions from 700 videos in multiple-choice and open-ended formats, explicitly designed to evaluate multimodal models through questions requiring tight integration of video and audio information across diverse agentic scenarios.
Result: State-of-the-art models (Qwen 2.5 Omni, Gemini 2.5 Flash-Lite) achieve ~64% accuracy, while human experts reach 92.8% accuracy, revealing substantial gap between current models and human-level comprehension.
Conclusion: MAVERIX establishes a challenging testbed with standardized evaluation protocols, rigorous annotation pipeline, and public toolkit to advance audiovisual multimodal intelligence by exposing significant performance gaps between models and humans.
Abstract: We introduce MAVERIX (Multimodal audiovisual Evaluation and Recognition IndeX), a unified benchmark to probe the video understanding in multimodal LLMs, encompassing video, audio, text inputs with human performance baselines. Although recent advancements in models with vision and audio understanding capabilities have shown substantial progress, the field lacks a standardized evaluation framework to thoroughly assess their cross-modality comprehension performance. MAVERIX curates 2,556 questions from 700 videos, in the form of both multiple-choice and open-ended formats, explicitly designed to evaluate multimodal models through questions that necessitate tight integration of video and audio information, spanning a broad spectrum of agentic scenarios. MAVERIX uniquely provides models with audiovisual questions, closely mimicking the multimodal perceptual experiences available to humans during inference and decision-making processes. To our knowledge, MAVERIX is the first benchmark aimed explicitly at assessing comprehensive audiovisual integration in such granularity. Experiments with state-of-the-art models, including Qwen 2.5 Omni and Gemini 2.5 Flash-Lite, show performance around 64% accuracy, while human experts reach near-ceiling performance of 92.8%, exposing a substantial gap to human-level comprehension. With standardized evaluation protocols, a rigorously annotated pipeline, and a public toolkit, MAVERIX establishes a challenging testbed for advancing audiovisual multimodal intelligence.
eess.AS
[823] KidSpeak: A General Multi-purpose LLM for Kids’ Speech Recognition and Screening
Rohan Sharma, Dancheng Liu, Jingchen Sun, Shijie Zhou, Jiayu Qin, Jinjun Xiong, Changyou Chen
Main category: eess.AS
TL;DR: KidSpeak is a multi-task speech foundation model for children’s speech, with FASA alignment tool for creating high-quality training data from noisy children’s speech.
Details
Motivation: Current AI models fail with children's speech due to reliance on adult speech datasets, especially problematic for early developmental stages and speech pathologies.Method: Two-stage training with phonetic knowledge in speech encoder; FASA alignment tool for automatic high-quality speech alignment from noisy data.
Result: 87% average accuracy across four tasks; FASA improves data quality by 13.6x compared to human annotations on CHILDES dataset.
Conclusion: First comprehensive solution for children’s speech therapy, combining multi-purpose speech LLM with robust alignment tool.
Abstract: With the rapid advancement of conversational and diffusion-based AI, there is a growing adoption of AI in educational services, ranging from grading and assessment tools to personalized learning systems that provide targeted support for students. However, this adaptability has yet to fully extend to the domain of children’s speech, where existing models often fail due to their reliance on datasets designed for clear, articulate adult speech. Children, particularly those in early developmental stages or with speech and language pathologies, present unique challenges that current AI models and datasets are ill-equipped to handle. To address this, we introduce KidSpeak, a multi-task speech-enhanced Foundation Model capable of both generative and discriminative tasks specifically tailored to children’s speech patterns. Our framework employs a two-stage training process that incorporates phonetic knowledge into the speech encoder, achieving an average accuracy of 87% across four separate tasks. Furthermore, recognizing the limitations of scalable human annotation and existing speech alignment tools, we propose the Flexible and Automatic Speech Aligner (FASA) and leverage the method to construct high quality datasets for training and evaluation. This novel alignment tool significantly improves the quality of aligned children’s speech from noisy data, enhancing data quality by 13.6x compared to human annotations, as demonstrated on the CHILDES dataset. To the best of our knowledge, KidSpeak and FASA represent the first comprehensive solution designed for speech and language therapy in children, offering both a multi-purpose speech LLM and a robust alignment tool.
[824] Degrading Voice: A Comprehensive Overview of Robust Voice Conversion Through Input Manipulation
Xining Song, Zhihua Wei, Rui Wang, Haixiao Hu, Yanxiang Chen, Meng Han
Main category: eess.AS
TL;DR: This paper surveys robustness issues in voice conversion systems, analyzing how input degradation attacks affect VC performance and evaluating attack/defense methods across intelligibility, naturalness, timbre similarity, and subjective perception metrics.
Details
Motivation: Voice conversion models have advanced significantly but only learn from clean training data, making them vulnerable to real-world degraded inputs (noise, reverberation, attacks). There's a critical gap in understanding VC robustness under input manipulation, and questions remain about how different attacks affect outputs and whether attack/defense strategies can be optimized.Method: The paper classifies existing attack and defense methods from the perspective of input manipulation. It evaluates the impact of degraded input speech across four key dimensions: intelligibility, naturalness, timbre similarity, and subjective perception.
Result: The survey reveals that VC models are vulnerable to various input degradation attacks in real-world scenarios. The comprehensive evaluation framework assesses how different forms of input manipulation alter VC outputs, highlighting the need for more robust systems.
Conclusion: The paper identifies significant robustness gaps in current VC systems and outlines open issues and future research directions for developing more secure and reliable voice conversion technologies that can handle real-world degraded inputs effectively.
Abstract: Identity, accent, style, and emotions are essential components of human speech. Voice conversion (VC) techniques process the speech signals of two input speakers and other modalities of auxiliary information such as prompts and emotion tags. It changes para-linguistic features from one to another, while maintaining linguistic contents. Recently, VC models have made rapid advancements in both generation quality and personalization capabilities. These developments have attracted considerable attention for diverse applications, including privacy preservation, voice-print reproduction for the deceased, and dysarthric speech recovery. However, these models only learn non-robust features due to the clean training data. Subsequently, it results in unsatisfactory performances when dealing with degraded input speech in real-world scenarios, including additional noise, reverberation, adversarial attacks, or even minor perturbation. Hence, it demands robust deployments, especially in real-world settings. Although latest researches attempt to find potential attacks and countermeasures for VC systems, there remains a significant gap in the comprehensive understanding of how robust the VC model is under input manipulation. here also raises many questions: For instance, to what extent do different forms of input degradation attacks alter the expected output of VC models? Is there potential for optimizing these attack and defense strategies? To answer these questions, we classify existing attack and defense methods from the perspective of input manipulation and evaluate the impact of degraded input speech across four dimensions, including intelligibility, naturalness, timbre similarity, and subjective perception. Finally, we outline open issues and future directions.
[825] Unsupervised Single-Channel Audio Separation with Diffusion Source Priors
Runwu Shi, Chang Li, Jiang Wang, Rui Zhang, Nabeela Khan, Benjamin Yen, Takeshi Ashizawa, Kazuhiro Nakadai
Main category: eess.AS
TL;DR: Unsupervised single-channel audio separation using diffusion priors and reconstruction guidance, with novel inverse problem solver and time-frequency attention architecture.
Details
Motivation: Supervised methods require paired synthetic data which is hard to obtain in real-world scenarios, limiting generalization. Need unsupervised approach that doesn't rely on paired training data.Method: Frames separation as probabilistic inverse problem using diffusion priors trained on individual sources. Uses reconstruction guidance with specialized inverse problem solver to avoid gradient conflicts. Initializes with augmented mixture instead of Gaussian noise. Employs time-frequency attention-based network for audio modeling.
Result: Significant performance gains validated across speech-sound event, sound event, and speech separation tasks. Achieves high-quality and balanced separation across sources.
Conclusion: Proposed unsupervised approach with diffusion priors and reconstruction guidance effectively solves single-channel audio separation without paired data, demonstrating strong generalization across various separation tasks.
Abstract: Single-channel audio separation aims to separate individual sources from a single-channel mixture. Most existing methods rely on supervised learning with synthetically generated paired data. However, obtaining high-quality paired data in real-world scenarios is often difficult. This data scarcity can degrade model performance under unseen conditions and limit generalization ability. To this end, in this work, we approach this problem from an unsupervised perspective, framing it as a probabilistic inverse problem. Our method requires only diffusion priors trained on individual sources. Separation is then achieved by iteratively guiding an initial state toward the solution through reconstruction guidance. Importantly, we introduce an advanced inverse problem solver specifically designed for separation, which mitigates gradient conflicts caused by interference between the diffusion prior and reconstruction guidance during inverse denoising. This design ensures high-quality and balanced separation performance across individual sources. Additionally, we find that initializing the denoising process with an augmented mixture instead of pure Gaussian noise provides an informative starting point that significantly improves the final performance. To further enhance audio prior modeling, we design a novel time-frequency attention-based network architecture that demonstrates strong audio modeling capability. Collectively, these improvements lead to significant performance gains, as validated across speech-sound event, sound event, and speech separation tasks.
[826] Introduction to Ambisonics, Part 1: The Part With No Math
Jens Ahrens
Main category: eess.AS
TL;DR: A practical introduction to ambisonics focusing on intuitive understanding rather than technical details, covering what ambisonic signals are, how to obtain and manipulate them, and how to reproduce them for listeners.
Details
Motivation: To provide a practical, accessible introduction to ambisonics for readers who want to work with the technology without getting bogged down in deep mathematical details.Method: Part 1 of a 2-part series that explains ambisonic concepts intuitively, covering signal acquisition, manipulation techniques, and reproduction methods, supplemented with audio examples for illustration.
Result: A comprehensive practical guide that helps readers develop intuitive understanding of ambisonics, preparing them for the more technical Part 2 which covers mathematical details.
Conclusion: This paper successfully provides a practical foundation for working with ambisonics, with the mathematical details reserved for Part 2, making the technology more accessible to practitioners.
Abstract: The present document is Part 1 of a 2-part introduction to ambisonics and aims at readers who would like to work practically with ambisonics. We leave out deep technical details in this part and focus on helping the reader to develop an intuitive understanding of the underlying concept. We explain what ambisonic signals are, how they can be obtained, what manipulations can be applied to them, and how they can be reproduced to a listener. We provide a variety of audio examples that illustrate the matter. Part 2 of this introduction into ambisonics is provided in a separate document and aims at readers who would like to understand the mathematical details.
[827] DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation
Dongya Jia, Zhuo Chen, Jiawei Chen, Chenpeng Du, Jian Wu, Jian Cong, Xiaobin Zhuang, Chumin Li, Zhen Wei, Yuping Wang, Yuxuan Wang
Main category: eess.AS
TL;DR: DiTAR is a patch-based autoregressive framework combining language models with diffusion transformers for continuous speech representation generation, achieving SOTA in zero-shot speech generation with improved efficiency and scalability.
Details
Motivation: Existing autoregressive approaches for generating continuous speech representations using diffusion models suffer from high computational loads and suboptimal performance. The authors aim to create a more efficient and effective framework for continuous token generation in speech synthesis.Method: DiTAR uses a divide-and-conquer patch generation strategy: a language model processes aggregated patch embeddings, then a diffusion transformer generates the next patch based on the language model’s output. For inference, temperature is defined as the time point for introducing noise during reverse diffusion ODE to balance diversity and determinism.
Result: DiTAR demonstrates superb scalability in extensive scaling analysis and achieves state-of-the-art performance in zero-shot speech generation across robustness, speaker similarity, and naturalness metrics.
Conclusion: DiTAR successfully addresses computational efficiency and performance limitations of previous approaches, providing an effective framework for autoregressive generation of continuous speech representations with strong scalability and SOTA results in zero-shot speech generation.
Abstract: Several recent studies have attempted to autoregressively generate continuous speech representations without discrete speech tokens by combining diffusion and autoregressive models, yet they often face challenges with excessive computational loads or suboptimal outcomes. In this work, we propose Diffusion Transformer Autoregressive Modeling (DiTAR), a patch-based autoregressive framework combining a language model with a diffusion transformer. This approach significantly enhances the efficacy of autoregressive models for continuous tokens and reduces computational demands. DiTAR utilizes a divide-and-conquer strategy for patch generation, where the language model processes aggregated patch embeddings and the diffusion transformer subsequently generates the next patch based on the output of the language model. For inference, we propose defining temperature as the time point of introducing noise during the reverse diffusion ODE to balance diversity and determinism. We also show in the extensive scaling analysis that DiTAR has superb scalability. In zero-shot speech generation, DiTAR achieves state-of-the-art performance in robustness, speaker similarity, and naturalness.
eess.IV
[828] Stronger is not better: Better Augmentations in Contrastive Learning for Medical Image Segmentation
Azeez Idris, Abdurahman Ali Mohammed, Samuel Fanijo
Main category: eess.IV
TL;DR: Self-supervised contrastive learning’s strong data augmentation doesn’t always improve medical image segmentation performance; alternative augmentations show better results.
Details
Motivation: To evaluate the effectiveness of strong data augmentation in self-supervised contrastive learning for medical image semantic segmentation, as existing augmentations don't consistently improve performance.Method: Experiments with various data augmentation techniques, testing existing strong augmentations (composition of multiple techniques) and exploring alternative augmentations specifically for medical images.
Result: Existing strong data augmentations don’t always improve semantic segmentation performance for medical images; alternative augmentations provide better performance gains.
Conclusion: Standard strong data augmentation techniques used in self-supervised contrastive learning may not be optimal for medical image segmentation, requiring specialized augmentation strategies.
Abstract: Self-supervised contrastive learning is among the recent representation learning methods that have shown performance gains in several downstream tasks including semantic segmentation. This paper evaluates strong data augmentation, one of the most important components for self-supervised contrastive learning’s improved performance. Strong data augmentation involves applying the composition of multiple augmentation techniques on images. Surprisingly, we find that the existing data augmentations do not always improve performance for semantic segmentation for medical images. We experiment with other augmentations that provide improved performance.
[829] Semantic Temporal Single-photon LiDAR
Fang Li, Tonglin Mu, Shuling Li, Junran Guo, Keyuan Li, Jianing Li, Ziyang Luo, Xiaodong Fan, Ye Chen, Yunfeng Liu, Hong Cai, Lip Ket Chin, Jinbei Zhang, Shihai Sun
Main category: eess.IV
TL;DR: Semantic TSP-LiDAR with self-updating knowledge base enables adaptive target recognition in open-set scenarios under low SNR and limited photons.
Details
Motivation: Existing TSP-LiDAR systems fail in open-set scenarios with unknown targets and degrade under low SNR/short acquisition times. Need adaptive recognition without extensive retraining.Method: Propose semantic TSP-LiDAR with self-updating semantic knowledge base (SKB), formulating target recognition as semantic communication. SKB dynamically updates semantic features of new targets.
Result: Outperforms conventional methods under low SNR/limited acquisition time. Self-updating SKB achieves 89% accuracy on nine unknown target types vs 66% without updating mechanism.
Conclusion: Framework enables adaptive, robust target recognition in complex dynamic environments without extensive neural network retraining, showing potential for practical applications.
Abstract: Temporal single-photon (TSP-) LiDAR presents a promising solution for imaging-free target recognition over long distances with reduced size, cost, and power consumption. However, existing TSP-LiDAR approaches are ineffective in handling open-set scenarios where unknown targets emerge, and they suffer significant performance degradation under low signal-to-noise ratio (SNR) and short acquisition times (fewer photons). Here, inspired by semantic communication, we propose a semantic TSP-LiDAR based on a self-updating semantic knowledge base (SKB), in which the target recognition processing of TSP-LiDAR is formulated as a semantic communication. The results, both simulation and experiment, demonstrate that our approach surpasses conventional methods, particularly under challenging conditions of low SNR and limited acquisition time. More importantly, our self-updating SKB mechanism can dynamically update the semantic features of newly encountered targets in the SKB, enabling continuous adaptation without the need for extensive retraining of the neural network. In fact, a recognition accuracy of 89% is achieved on nine types of unknown targets in real-world experiments, compared to 66% without the updating mechanism. These findings highlight the potential of our framework for adaptive and robust target recognition in complex and dynamic environments.
[830] Physics-Guided Diffusion Priors for Multi-Slice Reconstruction in Scientific Imaging
Laurentius Valdy, Richard D. Paul, Alessio Quercia, Zhuo Cao, Xuan Zhao, Hanno Scharr, Arya Bangun
Main category: eess.IV
TL;DR: A memory-efficient framework for multi-slice reconstruction from limited data using partitioned diffusion priors with physics constraints, achieving high quality results for MRI and 4D-STEM with strong generalization.
Details
Motivation: Accelerating acquisition in medical/scientific imaging requires accurate multi-slice reconstruction from limited data, but this is challenging due to ill-posedness and high computational/memory demands.Method: Integrates partitioned diffusion priors with physics-based constraints to reduce memory usage per GPU while maintaining reconstruction quality.
Result: Outperforms physics-only and full multi-slice reconstruction baselines for MRI and 4D-STEM, with reduced memory usage and preserved high reconstruction quality.
Conclusion: The framework successfully addresses computational challenges while improving both in-distribution accuracy and generalization to out-of-distribution datasets.
Abstract: Accurate multi-slice reconstruction from limited measurement data is crucial to speed up the acquisition process in medical and scientific imaging. However, it remains challenging due to the ill-posed nature of the problem and the high computational and memory demands. We propose a framework that addresses these challenges by integrating partitioned diffusion priors with physics-based constraints. By doing so, we substantially reduce memory usage per GPU while preserving high reconstruction quality, outperforming both physics-only and full multi-slice reconstruction baselines for different modalities, namely Magnetic Resonance Imaging (MRI) and four-dimensional Scanning Transmission Electron Microscopy (4D-STEM). Additionally, we show that the proposed method improves in-distribution accuracy as well as strong generalization to out-of-distribution datasets.
[831] Clinical Interpretability of Deep Learning Segmentation Through Shapley-Derived Agreement and Uncertainty Metrics
Tianyi Ren, Daniel Low, Pittra Jaengprajak, Juampablo Heras Rivera, Jacob Ruzevick, Mehmet Kurt
Main category: eess.IV
TL;DR: This paper proposes using contrast-level Shapley values to explain medical image segmentation models, showing that higher-performing models have Shapley rankings that better align with clinical expectations, providing interpretable reliability metrics.
Details
Motivation: Despite deep learning's success in medical image segmentation, there's a critical need for explainability to ensure clinical acceptance. Current gradient-based techniques focus on influential regions, but there's a need for broader, clinically aligned approaches that explain how model performance attributes importance to different imaging contrasts.Method: The approach uses contrast-level Shapley values to systematically perturb model inputs and assess feature importance. Using the BraTS 2024 dataset, they generated Shapley value rankings for four MRI contrasts across four model architectures. They proposed two metrics: agreement between model and clinician imaging rankings, and uncertainty quantified through Shapley ranking variance across cross-validation folds.
Result: Higher-performing cases (Dice >0.6) showed significantly greater agreement with clinical rankings. Increased Shapley ranking variance correlated with decreased performance (U-Net: r=-0.581). These metrics provide clinically interpretable proxies for model reliability.
Conclusion: Shapley values offer a clinically aligned approach to explain medical image segmentation models. The proposed metrics help clinicians better understand state-of-the-art segmentation models by providing interpretable proxies for model reliability based on agreement with clinical expectations and ranking consistency.
Abstract: Segmentation is the identification of anatomical regions of interest, such as organs, tissue, and lesions, serving as a fundamental task in computer-aided diagnosis in medical imaging. Although deep learning models have achieved remarkable performance in medical image segmentation, the need for explainability remains critical for ensuring their acceptance and integration in clinical practice, despite the growing research attention in this area. Our approach explored the use of contrast-level Shapley values, a systematic perturbation of model inputs to assess feature importance. While other studies have investigated gradient-based techniques through identifying influential regions in imaging inputs, Shapley values offer a broader, clinically aligned approach, explaining how model performance is fairly attributed to certain imaging contrasts over others. Using the BraTS 2024 dataset, we generated rankings for Shapley values for four MRI contrasts across four model architectures. Two metrics were proposed from the Shapley ranking: agreement between model and ``clinician" imaging ranking, and uncertainty quantified through Shapley ranking variance across cross-validation folds. Higher-performing cases (Dice \textgreater0.6) showed significantly greater agreement with clinical rankings. Increased Shapley ranking variance correlated with decreased performance (U-Net: $r=-0.581$). These metrics provide clinically interpretable proxies for model reliability, helping clinicians better understand state-of-the-art segmentation models.
[832] Affine Subspace Models and Clustering for Patch-Based Image Denoising
Tharindu Wickremasinghe, Marco F. Duarte
Main category: eess.IV
TL;DR: The paper proposes using affine subspace models instead of linear subspaces for image tile clustering, and demonstrates improved performance in clustering and denoising applications.
Details
Motivation: Linear subspace models are not well-suited for image patches because images are non-negative and thus not distributed around the origin in the tile vector space. This mismatch motivates the use of affine subspace models that better match the geometric structure of image tile data.Method: The paper presents an affine subspace clustering approach for image tiles, along with a simple denoising algorithm that uses least squares projection onto these affine subspaces. Several algorithmic approaches to solve the affine subspace clustering problem are reviewed and implemented.
Result: Experimental results show performance improvements in both clustering and denoising tasks when using affine subspace models compared to linear subspace approaches.
Conclusion: Affine subspace models provide a better geometric match for image tile data than linear subspaces, leading to improved performance in clustering and denoising applications commonly used in image processing pipelines.
Abstract: Image tile-based approaches are popular in many image processing applications such as denoising (e.g., non-local means). A key step in their use is grouping the images into clusters, which usually proceeds iteratively splitting the images into clusters and fitting a model for the images in each cluster. Linear subspaces have emerged as a suitable model for tile clusters; however, they are not well matched to images patches given that images are non-negative and thus not distributed around the origin in the tile vector space. We study the use of affine subspace models for the clusters to better match the geometric structure of the image tile vector space. We also present a simple denoising algorithm that relies on the affine subspace clustering model using least squares projection. We review several algorithmic approaches to solve the affine subspace clustering problem and show experimental results that highlight the performance improvements in clustering and denoising.
[833] From sparse recovery to plug-and-play priors, understanding trade-offs for stable recovery with generalized projected gradient descent
Ali Joundi, Yann Traonmilin, Jean-François Aujol
Main category: eess.IV
TL;DR: GPGD framework unifies sparse recovery and learned deep priors, with new convergence guarantees for robustness to model/projection errors, plus methods to control stability and handle structured noise.
Details
Motivation: Address the problem of recovering low-dimensional vectors from noisy, underdetermined observations, bridging traditional sparse recovery with modern deep learning approaches while improving robustness and stability.Method: Extend Generalized Projected Gradient Descent (GPGD) framework with convergence analysis for model/projection errors, propose generalized back-projection for structured noise, and introduce normalized idempotent regularization for learning deep projective priors.
Result: Theoretical convergence guarantees for robustness, methods to control stability constants, and numerical experiments showing trade-offs between identifiability and stability in sparse recovery and image inverse problems.
Conclusion: GPGD provides a unified framework that connects traditional and modern approaches, with enhanced robustness and stability controls, demonstrating practical trade-offs in recovery performance for various inverse problems.
Abstract: We consider the problem of recovering an unknown low-dimensional vector from noisy, underdetermined observations. We focus on the Generalized Projected Gradient Descent (GPGD) framework, which unifies traditional sparse recovery methods and modern approaches using learned deep projective priors. We extend previous convergence results to robustness to model and projection errors. We use these theoretical results to explore ways to better control stability and robustness constants. To reduce recovery errors due to measurement noise, we consider generalized back-projection strategies to adapt GPGD to structured noise, such as sparse outliers. To improve the stability of GPGD, we propose a normalized idempotent regularization for the learning of deep projective priors. We provide numerical experiments in the context of sparse recovery and image inverse problems, highlighting the trade-offs between identifiability and stability that can be achieved with such methods.
[834] Precise Liver Tumor Segmentation in CT Using a Hybrid Deep Learning-Radiomics Framework
Xuecheng Li, Weikuan Jia, Komildzhon Sharipov, Alimov Ruslan, Lutfuloev Mazbutdzhon, Ismoilov Shuhratjon, Yuanjie Zheng
Main category: eess.IV
TL;DR: Hybrid framework combining attention-enhanced cascaded U-Net, radiomics feature selection, and 3D CNN refinement for joint liver and liver-tumor segmentation in CT scans.
Details
Motivation: Manual liver tumor segmentation is slow, observer-dependent, and difficult to standardize across centers. Automatic segmentation faces challenges including low lesion-parenchyma contrast, blurred boundaries, heterogeneous enhancement patterns, and confounding structures like vessels and adjacent organs.Method: Three-stage hybrid approach: 1) 2.5D two-stage network with densely connected encoder, sub-pixel convolution decoders and multi-scale attention gates for initial segmentation; 2) Radiomics feature extraction (728 descriptors reduced to 20 features) with random forest classifier for false-positive rejection; 3) Compact 3D patch-based CNN for voxel-level relabelling and contour smoothing in boundary regions.
Result: The method produces accurate 3D delineation of liver tumors with improved handling of thin/tiny lesions, noise suppression, and false-positive reduction through the combined deep learning and radiomics approach.
Conclusion: The proposed hybrid framework effectively addresses challenges in liver tumor segmentation by combining deep learning architectures with traditional radiomics features, providing a robust solution for treatment planning, navigation, and response assessment in clinical settings.
Abstract: Accurate three-dimensional delineation of liver tumors on contrast-enhanced CT is a prerequisite for treatment planning, navigation and response assessment, yet manual contouring is slow, observer-dependent and difficult to standardise across centres. Automatic segmentation is complicated by low lesion-parenchyma contrast, blurred or incomplete boundaries, heterogeneous enhancement patterns, and confounding structures such as vessels and adjacent organs. We propose a hybrid framework that couples an attention-enhanced cascaded U-Net with handcrafted radiomics and voxel-wise 3D CNN refinement for joint liver and liver-tumor segmentation. First, a 2.5D two-stage network with a densely connected encoder, sub-pixel convolution decoders and multi-scale attention gates produces initial liver and tumor probability maps from short stacks of axial slices. Inter-slice temporal consistency is then enforced by a simple three-slice refinement rule along the cranio-caudal direction, which restores thin and tiny lesions while suppressing isolated noise. Next, 728 radiomic descriptors spanning intensity, texture, shape, boundary and wavelet feature groups are extracted from candidate lesions and reduced to 20 stable, highly informative features via multi-strategy feature selection; a random forest classifier uses these features to reject false-positive regions. Finally, a compact 3D patch-based CNN derived from AlexNet operates in a narrow band around the tumor boundary to perform voxel-level relabelling and contour smoothing.
[835] R2MF-Net: A Recurrent Residual Multi-Path Fusion Network for Robust Multi-directional Spine X-ray Segmentation
Xuecheng Li, Weikuan Jia, Komildzhon Sharipov, Sharipov Hotam Beknazarovich, Farzona S. Ataeva, Qurbonaliev Alisher, Yuanjie Zheng
Main category: eess.IV
TL;DR: R2MF-Net: A recurrent residual multi-path encoder-decoder network for automatic segmentation of multi-directional spine X-ray images to enable quantitative scoliosis assessment.
Details
Motivation: Current spinal structure segmentation in X-ray images is manual, time-consuming, and non-reproducible, especially in low-contrast images with rib shadows or overlapping tissues. This hinders quantitative scoliosis assessment including Cobb angle measurement, vertebral translation estimation, and curvature classification.Method: Proposes R2MF-Net with a two-stage cascade: coarse segmentation network followed by fine segmentation network. Features include: improved Inception-style multi-branch feature extractor, recurrent residual jump connection (R2-Jump) module for semantic alignment, multi-scale cross-stage skip (MC-Skip) mechanism for hierarchical representation reuse, and lightweight spatial-channel squeeze-and-excitation block (SCSE-Lite) for spine-related activation emphasis.
Result: Evaluated on a clinical multi-view radiograph dataset comprising 228 sets of coronal, left-bending and right-bending spine X-ray images with expert annotations. (Specific quantitative results not provided in abstract).
Conclusion: R2MF-Net addresses limitations of manual segmentation by providing an automated solution for multi-directional spine X-ray image segmentation, enabling more efficient and reproducible quantitative scoliosis assessment in clinical practice.
Abstract: Accurate segmentation of spinal structures in X-ray images is a prerequisite for quantitative scoliosis assessment, including Cobb angle measurement, vertebral translation estimation and curvature classification. In routine practice, clinicians acquire coronal, left-bending and right-bending radiographs to jointly evaluate deformity severity and spinal flexibility. However, the segmentation step remains heavily manual, time-consuming and non-reproducible, particularly in low-contrast images and in the presence of rib shadows or overlapping tissues. To address these limitations, this paper proposes R2MF-Net, a recurrent residual multi-path encoder–decoder network tailored for automatic segmentation of multi-directional spine X-ray images. The overall design consists of a coarse segmentation network and a fine segmentation network connected in cascade. Both stages adopt an improved Inception-style multi-branch feature extractor, while a recurrent residual jump connection (R2-Jump) module is inserted into skip paths to gradually align encoder and decoder semantics. A multi-scale cross-stage skip (MC-Skip) mechanism allows the fine network to reuse hierarchical representations from multiple decoder levels of the coarse network, thereby strengthening the stability of segmentation across imaging directions and contrast conditions. Furthermore, a lightweight spatial-channel squeeze-and-excitation block (SCSE-Lite) is employed at the bottleneck to emphasize spine-related activations and suppress irrelevant structures and background noise. We evaluate R2MF-Net on a clinical multi-view radiograph dataset comprising 228 sets of coronal, left-bending and right-bending spine X-ray images with expert annotations.
[836] Deep Spatiotemporal Clutter Filtering of Transthoracic Echocardiographic Images: Leveraging Contextual Attention and Residual Learning
Mahdi Tabassian, Somayeh Akbari, Sandro Queirós, Jan D’hooge
Main category: eess.IV
TL;DR: A deep 3D convolutional autoencoder with attention mechanism and residual learning effectively filters reverberation clutter from echocardiographic sequences, trained on synthetic data but generalizes well to real clinical data.
Details
Motivation: Reverberation clutter in transthoracic echocardiography (TTE) degrades image quality and affects downstream clinical measurements like strain analysis, requiring effective filtering methods that preserve fine structures.Method: 3D convolutional autoencoder network with attention mechanism for focusing on cluttered regions and residual learning for structure preservation, trained on synthetic TTE sequences with simulated artifacts from six ultrasound vendors.
Result: The network effectively filters clutter in both synthetic and in vivo data, reduces strain profile discrepancies between cluttered and clutter-free segments, and processes sequences in fractions of a second for real-time use.
Conclusion: The proposed deep learning approach enables real-time clutter filtering, improves precision of clinical indices derived from TTE, and generalizes well despite being trained only on synthetic data.
Abstract: This study presents a deep convolutional autoencoder network for filtering reverberation clutter from transthoracic echocardiographic (TTE) image sequences. Given the spatiotemporal nature of this type of clutter, the filtering network employs 3D convolutional layers to suppress it throughout the cardiac cycle. The design of the network incorporates two key features that contribute to the effectiveness of the filter: 1) an attention mechanism for focusing on cluttered regions and leveraging contextual information, and 2) residual learning for preserving fine image structures. To train the network, a diverse set of artifact patterns was simulated and superimposed onto ultra-realistic synthetic TTE sequences from six ultrasound vendors, generating input for the filtering network. The artifact-free sequences served as ground-truth. Performance of the filtering network was evaluated using unseen synthetic and in vivo artifactual sequences. Results from the in vivo dataset confirmed the network’s strong generalization capabilities, despite being trained solely on synthetic data and simulated artifacts. The suitability of the filtered sequences for downstream processing was assessed by computing segmental strain curves. A significant reduction in the discrepancy between strain profiles computed from cluttered and clutter-free segments was observed after filtering the cluttered sequences with the proposed network. The trained network processes a TTE sequence in a fraction of a second, enabling real-time clutter filtering and potentially improving the precision of clinically relevant indices derived from TTE sequences. The source code of the proposed method and example video files of the filtering results are available at: https://github.com/MahdiTabassian/Deep-ClutterFiltering/tree/main.
[837] Tyche: Stochastic In-Context Learning for Medical Image Segmentation
Marianne Rakic, Hallee E. Wong, Jose Javier Gonzalez Ortiz, Beth Cimini, John Guttag, Adrian V. Dalca
Main category: eess.IV
TL;DR: Tyche is an in-context learning model for medical image segmentation that generates stochastic predictions for unseen tasks without retraining, addressing both the need for task-specific training and the inherent uncertainty in expert segmentations.
Details
Motivation: Two key problems in medical image segmentation: (1) Most methods require training/fine-tuning new models for each task, which is resource-intensive and requires ML expertise, making it infeasible for clinicians. (2) Existing methods produce single deterministic masks, ignoring the inherent uncertainty and variability in expert annotations where different experts often segment the same image differently.Method: Tyche uses in-context learning with a context set to generate predictions for unseen tasks without retraining. Key innovations: (1) Novel convolution block architecture enabling interactions among predictions, (2) In-context test-time augmentation for prediction stochasticity, combined with appropriate model design and loss functions.
Result: Tyche can predict a set of plausible diverse segmentation candidates for new or unseen medical images and segmentation tasks without requiring retraining, addressing both task generalization and uncertainty quantification.
Conclusion: Tyche provides a practical solution for medical image segmentation by eliminating the need for task-specific retraining while capturing the inherent uncertainty in expert annotations through stochastic predictions, making segmentation more accessible to medical researchers and clinicians.
Abstract: Existing learning-based solutions to medical image segmentation have two important shortcomings. First, for most new segmentation task, a new model has to be trained or fine-tuned. This requires extensive resources and machine learning expertise, and is therefore often infeasible for medical researchers and clinicians. Second, most existing segmentation methods produce a single deterministic segmentation mask for a given image. In practice however, there is often considerable uncertainty about what constitutes the correct segmentation, and different expert annotators will often segment the same image differently. We tackle both of these problems with Tyche, a model that uses a context set to generate stochastic predictions for previously unseen tasks without the need to retrain. Tyche differs from other in-context segmentation methods in two important ways. (1) We introduce a novel convolution block architecture that enables interactions among predictions. (2) We introduce in-context test-time augmentation, a new mechanism to provide prediction stochasticity. When combined with appropriate model design and loss functions, Tyche can predict a set of plausible diverse segmentation candidates for new or unseen medical images and segmentation tasks without the need to retrain.
[838] Self-supervised Learning-based Reconstruction of High-resolution 4D Light Fields
Jianxin Lei, Dongze Wu, Chengcai Xu, Hongcheng Gu, Guangquan Zhou, Junhui Hou, Ping Zhou
Main category: eess.IV
TL;DR: Self-supervised LF spatial SR method that reconstructs higher-resolution light field images without pre-defined degradation models, using hybrid imaging and HR central-view reference.
Details
Motivation: Supervised LF SR methods struggle with domain gap between training (natural resolution LFs as ground truth) and inference (reconstructing higher-resolution LFs), especially for real-world data. Need method that works without pre-defined degradation models.Method: 1) Hybrid LF imaging prototype creating reference pairs between low-res central-view and high-res images. 2) Self-supervised framework with: LF spatial SR network with hybrid input, central-view synthesis network with HR-aware loss for side-views to learn from HR central view, and backward degradation network with EPI gradient loss to preserve parallax structures.
Result: Extensive experiments on simulated and real-world datasets demonstrate significant superiority over state-of-the-art methods in reconstructing higher spatial resolution LF images without pre-defined degradation.
Conclusion: Proposed self-supervised method effectively addresses domain gap problem in LF spatial SR, enabling reconstruction of higher-resolution LF images without requiring pre-defined degradation models, with strong performance on both simulated and real-world data.
Abstract: Hand-held light field (LF) cameras often exhibit low spatial resolution due to the inherent trade-off between spatial and angular dimensions. Existing supervised learning-based LF spatial super-resolution (SR) methods, which rely on pre-defined image degradation models, struggle to overcome the domain gap between the training phase – where LFs with natural resolution are used as ground truth – and the inference phase, which aims to reconstruct higher-resolution LFs, especially when applied to real-world data.To address this challenge, this paper introduces a novel self-supervised learning-based method for LF spatial SR, which can produce higher spatial resolution LF images than originally captured ones without pre-defined image degradation models. The self-supervised method incorporates a hybrid LF imaging prototype, a real-world hybrid LF dataset, and a self-supervised LF spatial SR framework. The prototype makes reference image pairs between low-resolution central-view sub-aperture images and high-resolution (HR) images. The self-supervised framework consists of a well-designed LF spatial SR network with hybrid input, a central-view synthesis network with an HR-aware loss that enables side-view sub-aperture images to learn high-frequency information from the only HR central view reference image, and a backward degradation network with an epipolar-plane image gradient loss to preserve LF parallax structures. Extensive experiments on both simulated and real-world datasets demonstrate the significant superiority of our approach over state-of-the-art ones in reconstructing higher spatial resolution LF images without pre-defined degradation.
[839] CT Synthesis with Conditional Diffusion Models for Abdominal Lymph Node Segmentation
Yongrui Yu, Hanyu Chen, Zitian Zhang, Qiong Xiao, Wenhui Lei, Linrui Dai, Yu Fu, Hui Tan, Guan Wang, Peng Gao, Xiaofan Zhang
Main category: eess.IV
TL;DR: A pipeline combining conditional diffusion model (LN-DDPM) for lymph node generation and nnU-Net for segmentation improves abdominal lymph node segmentation by generating diverse realistic training data.
Details
Motivation: Deep learning struggles with abdominal lymph node segmentation due to complex abdominal environment, small/indistinguishable lesions, and limited annotated data.Method: Propose LN-DDPM, a conditional DDPM using lymph node masks and anatomical structure masks with two conditioning mechanisms (global structure and local detail). Generated data is used to train nnU-Net for segmentation.
Result: LN-DDPM outperforms other generative methods in abdominal lymph node image synthesis and better assists downstream segmentation tasks on abdominal lymph node datasets.
Conclusion: The integrated pipeline of conditional diffusion model for data generation and nnU-Net for segmentation effectively addresses challenges in abdominal lymph node diagnosis by augmenting limited training data.
Abstract: Despite the significant success achieved by deep learning methods in medical image segmentation, researchers still struggle in the computer-aided diagnosis of abdominal lymph nodes due to the complex abdominal environment, small and indistinguishable lesions, and limited annotated data. To address these problems, we present a pipeline that integrates the conditional diffusion model for lymph node generation and the nnU-Net model for lymph node segmentation to improve the segmentation performance of abdominal lymph nodes through synthesizing a diversity of realistic abdominal lymph node data. We propose LN-DDPM, a conditional denoising diffusion probabilistic model (DDPM) for lymph node (LN) generation. LN-DDPM utilizes lymph node masks and anatomical structure masks as model conditions. These conditions work in two conditioning mechanisms: global structure conditioning and local detail conditioning, to distinguish between lymph nodes and their surroundings and better capture lymph node characteristics. The obtained paired abdominal lymph node images and masks are used for the downstream segmentation task. Experimental results on the abdominal lymph node datasets demonstrate that LN-DDPM outperforms other generative methods in the abdominal lymph node image synthesis and better assists the downstream abdominal lymph node segmentation task.
[840] T1-PILOT: Optimized Trajectories for T1 Mapping Acceleration
Tamir Shor, Moti Freiman, Chaim Baskin, Alex Bronstein
Main category: eess.IV
TL;DR: T1-PILOT is an end-to-end method that integrates T1 relaxation modeling into learned non-Cartesian sampling trajectories and reconstruction, achieving superior T1 map fidelity at higher acceleration factors compared to existing approaches.
Details
Motivation: Cardiac T1 mapping is crucial for assessing myocardial tissue composition but faces strict time constraints due to heart motion. Current compressed sensing approaches use static, hand-crafted undersampling masks that don't fully exploit acceleration potential, limiting both speed and accuracy.Method: T1-PILOT incorporates the T1 signal relaxation model directly into the sampling-reconstruction framework to guide learning of non-Cartesian trajectories, cross-frame alignment, and T1 decay estimation in an end-to-end optimization.
Result: On the CMRxRecon dataset, T1-PILOT significantly outperforms baseline strategies (learned single-mask, fixed radial, golden-angle sampling) with higher PSNR and VIF scores, better delineation of myocardial structures, and improved T1 map fidelity at greater acceleration factors.
Conclusion: Jointly optimizing sampling trajectories with the physical relaxation model enables both enhanced quantitative accuracy and reduced acquisition times for cardiac T1 mapping, demonstrating the value of physics-informed learning in medical imaging.
Abstract: Cardiac T1 mapping provides critical quantitative insights into myocardial tissue composition, enabling the assessment of pathologies such as fibrosis, inflammation, and edema. However, the inherently dynamic nature of the heart imposes strict limits on acquisition times, making high-resolution T1 mapping a persistent challenge. Compressed sensing (CS) approaches have reduced scan durations by undersampling k-space and reconstructing images from partial data, and recent studies show that jointly optimizing the undersampling patterns with the reconstruction network can substantially improve performance. Still, most current T1 mapping pipelines rely on static, hand-crafted masks that do not exploit the full acceleration and accuracy potential. In this work, we introduce T1-PILOT: an end-to-end method that explicitly incorporates the T1 signal relaxation model into the sampling-reconstruction framework to guide the learning of non-Cartesian trajectories, crossframe alignment, and T1 decay estimation. Through extensive experiments on the CMRxRecon dataset, T1-PILOT significantly outperforms several baseline strategies (including learned single-mask and fixed radial or golden-angle sampling schemes), achieving higher T1 map fidelity at greater acceleration factors. In particular, we observe consistent gains in PSNR and VIF relative to existing methods, along with marked improvements in delineating finer myocardial structures. Our results highlight that optimizing sampling trajectories in tandem with the physical relaxation model leads to both enhanced quantitative accuracy and reduced acquisition times.
[841] Echo-E$^3$Net: Efficient Endocardial Spatio-Temporal Network for Ejection Fraction Estimation
Moein Heidari, Afshin Bozorgpour, AmirHossein Zarif-Fakharnia, Wenjin Chen, Dorit Merhof, David J Foran, Jasmine Grewal, Ilker Hacihaliloglu
Main category: eess.IV
TL;DR: Echo-E³Net: An efficient deep learning model for real-time LVEF estimation from echocardiography videos using endocardial landmark detection and feature aggregation.
Details
Motivation: Existing deep learning approaches for LVEF estimation are computationally demanding and underutilize spatio-temporal information in echocardiography videos, limiting their suitability for real-time clinical deployment.Method: Proposes Echo-E³Net with two modules: (1) dual-phase Endocardial Border Detector using phase-specific cross-attention to predict ED/ES landmarks and learn embeddings, and (2) Endocardial Feature Aggregator that fuses embeddings with global statistical descriptors. Uses multi-component loss function inspired by Simpson’s biplane method.
Result: Achieves RMSE of 5.20 and R² score of 0.82 on EchoNet-Dynamic dataset, with only 1.54M parameters and 8.05 GFLOPs, making it suitable for real-time POCUS applications.
Conclusion: Echo-E³Net provides an efficient, real-time solution for LVEF estimation without requiring external pre-training, heavy data augmentation, or test-time ensembling, making it clinically practical for point-of-care ultrasound.
Abstract: Left ventricular ejection fraction (LVEF) is a key indicator of cardiac function and is routinely used to diagnose heart failure and guide treatment decisions. Although deep learning has advanced automated LVEF estimation, many existing approaches are computationally demanding and underutilize the joint structure of spatial and temporal information in echocardiography videos, limiting their suitability for real-time clinical deployment. We propose Echo-E$^3$Net, an efficient endocardial spatio-temporal network specifically designed for LVEF estimation from echocardiography videos. Echo-E$^3$Net comprises two complementary modules: (1) a dual-phase Endocardial Border Detector (E$^2$CBD), which uses phase-specific cross-attention to predict ED/ES endocardial landmarks (EBs) and learn phase-aware landmark embeddings (LEs), and (2) an Endocardial Feature Aggregator (E$^2$FA), which fuses these embeddings with global statistical descriptors (mean, maximum, variance) of deep feature maps to refine EF regression. A multi-component loss function, inspired by Simpson’s biplane method, jointly supervises EF, volumes, and landmark geometry, thereby aligning optimization with the clinical definition of LVEF and promoting robust spatio-temporal representation learning. Evaluated on the EchoNet-Dynamic dataset, Echo-E$^3$Net achieves an RMSE of 5.20 and an $R^2$ score of 0.82, while using only 1.54M parameters and 8.05 GFLOPs. The model operates without external pre-training, heavy data augmentation, or test-time ensembling, making it highly suitable for real-time point-of-care ultrasound (POCUS) applications. Code is available at https://github.com/UltrAi-lab/Echo-E3Net.
[842] MLICv2: Enhanced Multi-Reference Entropy Modeling for Learned Image Compression
Wei Jiang, Yongqi Zhai, Jiayu Yang, Feng Gao, Ronggang Wang
Main category: eess.IV
TL;DR: MLICv2 and MLICv2+ are enhanced learned image compression models that address limitations of previous MLIC variants through improved transforms, better entropy modeling, and instance-specific optimization, achieving state-of-the-art performance.
Details
Motivation: Existing MLIC variants suffer from performance degradation at high bitrates, suboptimal entropy modeling that fails to capture global correlations, and lack of adaptive channel importance modeling.Method: 1) Lightweight token mixing block for transform enhancement; 2) Hyperprior-guided global correlation prediction for entropy modeling; 3) Channel reweighting module; 4) Enhanced positional embedding and guided selective compression; 5) Stochastic Gumbel Annealing for instance-specific optimization.
Result: MLICv2 and MLICv2+ achieve state-of-the-art results, reducing Bjøntegaard-Delta Rate by 16.54-24.35% compared to VTM-17.0 Intra across Kodak, Tecnick, and CLIC Pro Val datasets.
Conclusion: The proposed enhancements systematically address limitations of previous MLIC variants, demonstrating significant performance improvements in learned image compression through better transform design, advanced entropy modeling, and instance-specific optimization.
Abstract: Recent advances in learned image compression (LIC) have achieved remarkable performance improvements over traditional codecs. Notably, the MLIC series-LICs equipped with multi-reference entropy models-have substantially surpassed conventional image codecs such as Versatile Video Coding (VVC) Intra. However, existing MLIC variants suffer from several limitations: performance degradation at high bitrates due to insufficient transform capacity, suboptimal entropy modeling that fails to capture global correlations in initial slices, and lack of adaptive channel importance modeling. In this paper, we propose MLICv2 and MLICv2+, enhanced successors that systematically address these limitations through improved transform design, dvanced entropy modeling, and exploration of the potential of instance-specific optimization. For transform enhancement, we introduce a lightweight token mixing block inspired by the MetaFormer architecture, which effectively mitigates high-bitrate performance degradation while maintaining computational efficiency. For entropy modeling improvements, we propose hyperprior-guided global correlation prediction to extract global context even in the initial slice of latent representation, complemented by a channel reweighting module that dynamically emphasizes informative channels. We further explore enhanced positional embedding and guided selective compression strategies for superior context modeling. Additionally, we apply the Stochastic Gumbel Annealing (SGA) to demonstrate the potential for further performance improvements through input-specific optimization. Extensive experiments demonstrate that MLICv2 and MLICv2+ achieve state-of-the-art results, reducing Bjøntegaard-Delta Rate by 16.54%, 21.61%, 16.05% and 20.46%, 24.35%, 19.14% on Kodak, Tecnick, and CLIC Pro Val datasets, respectively, compared to VTM-17.0 Intra.
[843] Stochastic Orthogonal Regularization for deep projective priors
Ali Joundi, Yann Traonmilin, Alasdair Newson
Main category: eess.IV
TL;DR: The paper proposes a stochastic orthogonal regularization method for training deep projective priors to improve convergence speed and robustness of generalized projected gradient descent algorithms for solving imaging inverse problems.
Details
Motivation: Many image processing and computer vision tasks are formulated as inverse problems requiring fast and robust algorithms. While neural networks enable projections onto complex data models (deep projective priors), traditional training with mean squared error losses doesn't guarantee the conditions needed for linear convergence in GPGD algorithms.Method: The authors propose a stochastic orthogonal regularization of the training loss for deep projective priors. This regularization ensures that the learned projections sufficiently approximate orthogonal projections, which theoretically guarantees linear stable recovery with performance close to orthogonal projected gradient descent.
Result: Experimental results using two different deep projective priors (autoencoders and denoising networks) show that the proposed stochastic orthogonal regularization yields projections that improve convergence speed and robustness of GPGD in challenging inverse problem settings.
Conclusion: The stochastic orthogonal regularization method enables deep projective priors to achieve near-optimal convergence performance similar to orthogonal PGD, bridging the gap between theoretical guarantees and practical neural network-based projections for solving imaging inverse problems.
Abstract: Many crucial tasks of image processing and computer vision are formulated as inverse problems. Thus, it is of great importance to design fast and robust algorithms to solve these problems. In this paper, we focus on generalized projected gradient descent (GPGD) algorithms where generalized projections are realized with learned neural networks and provide state-of-the-art results for imaging inverse problems. Indeed, neural networks allow for projections onto unknown low-dimensional sets that model complex data, such as images. We call these projections deep projective priors. In generic settings, when the orthogonal projection onto a lowdimensional model set is used, it has been shown, under a restricted isometry assumption, that the corresponding orthogonal PGD converges with a linear rate, yielding near-optimal convergence (within the class of GPGD methods) in the classical case of sparse recovery. However, for deep projective priors trained with classical mean squared error losses, there is little guarantee that the hypotheses for linear convergence are satisfied. In this paper, we propose a stochastic orthogonal regularization of the training loss for deep projective priors. This regularization is motivated by our theoretical results: a sufficiently good approximation of the orthogonal projection guarantees linear stable recovery with performance close to orthogonal PGD. We show experimentally, using two different deep projective priors (based on autoencoders and on denoising networks), that our stochastic orthogonal regularization yields projections that improve convergence speed and robustness of GPGD in challenging inverse problem settings, in accordance with our theoretical findings.
[844] Synthetic multi-inversion time magnetic resonance images for visualization of subcortical structures
Savannah P. Hays, Lianrui Zuo, Anqi Feng, Yihao Liu, Blake E. Dewey, Jiachen Zhuo, Ellen M. Mowry, Scott D. Newsome, Jerry L. Prince, Aaron Carass
Main category: eess.IV
TL;DR: SyMTIC is a deep learning method that generates synthetic multi-TI MR images from routine T1-w, T2-w, and FLAIR scans to improve subcortical gray matter visualization without requiring specialized multi-TI acquisitions.
Details
Motivation: Multi-TI T1-weighted MR imaging improves visualization of subcortical gray matter for neuroscience and clinical applications, but it's rarely acquired in routine clinical settings due to acquisition time and protocol limitations.Method: SyMTIC combines deep neural networks with imaging physics to estimate T1 and PD maps from routine T1-w, T2-w, and FLAIR images, then uses these maps to compute synthetic multi-TI images with arbitrary inversion times.
Result: The method accurately synthesized multi-TI images comparable to explicitly acquired data, particularly enhancing visualization for TI values between 400-800 ms and improving thalamic nuclei segmentation.
Conclusion: SyMTIC provides a practical solution for generating high-quality multi-TI images from routine clinical MR contrasts, generalizing well to varied datasets and improving brain MR visualization and analysis.
Abstract: Purpose: Visualization of subcortical gray matter is essential in neuroscience and clinical practice, particularly for disease understanding and surgical planning.While multi-inversion time (multi-TI) T$_1$-weighted (T$_1$-w) magnetic resonance (MR) imaging improves visualization, it is rarely acquired in clinical settings. Approach: We present SyMTIC (Synthetic Multi-TI Contrasts), a deep learning method that generates synthetic multi-TI images using routinely acquired T$_1$-w, T$_2$-weighted (T$_2$-w), and FLAIR images. Our approach combines image translation via deep neural networks with imaging physics to estimate longitudinal relaxation time (T$_1$) and proton density (PD) maps. These maps are then used to compute multi-TI images with arbitrary inversion times. Results: SyMTIC was trained using paired MPRAGE and FGATIR images along with T$_2$-w and FLAIR images. It accurately synthesized multi-TI images from standard clinical inputs, achieving image quality comparable to that from explicitly acquired multi-TI data.The synthetic images, especially for TI values between 400-800 ms, enhanced visualization of subcortical structures and improved segmentation of thalamic nuclei. Conclusion: SyMTIC enables robust generation of high-quality multi-TI images from routine MR contrasts. It generalizes well to varied clinical datasets, including those with missing FLAIR images or unknown parameters, offering a practical solution for improving brain MR image visualization and analysis.
[845] Robust brain age estimation from structural MRI with contrastive learning
Carlo Alberto Barbano, Benoit Dufumier, Edouard Duchesnay, Marco Grangetto, Pietro Gori
Main category: eess.IV
TL;DR: Contrastive learning with novel loss function L^exp outperforms L1-supervised methods for brain age estimation, showing better generalization, robustness to scanner confounds, and stronger clinical correlations.
Details
Motivation: Brain age estimation from structural MRI is valuable for studying aging and pathology, but current L1-supervised approaches may not be optimal. The paper explores contrastive learning as a more scalable and robust alternative.Method: Introduces novel contrastive loss function L^exp and evaluates it across multiple public neuroimaging datasets (20,000+ scans). Uses contrastive learning framework with scaling pre-training on diverse multi-site data.
Result: Four key findings: 1) Scaling pre-training improves generalization (cuts MAE nearly in half); 2) L^exp robust to site confounds; 3) Models capture accelerated aging in cognitive impairment/Alzheimer’s; 4) Maintains correlation between brain age accuracy and diagnostic performance.
Conclusion: Contrastive learning is a promising direction for building generalizable and clinically meaningful brain representations, positioning L^exp as a potential foundation model for neuroimaging.
Abstract: Estimating brain age from structural MRI has emerged as a powerful tool for characterizing normative and pathological aging. In this work, we explore contrastive learning as a scalable and robust alternative to L1-supervised approaches for brain age estimation. We introduce a novel contrastive loss function, $\mathcal{L}^{exp}$, and evaluate it across multiple public neuroimaging datasets comprising over 20,000 scans. Our experiments reveal four key findings. First, scaling pre-training on diverse, multi-site data consistently improves generalization performance, cutting external mean absolute error (MAE) nearly in half. Second, $\mathcal{L}^{exp}$ is robust to site-related confounds, maintaining low scanner-predictability as training size increases. Third, contrastive models reliably capture accelerated aging in patients with cognitive impairment and Alzheimer’s disease, as shown through brain age gap analysis, ROC curves, and longitudinal trends. Lastly, unlike L1-supervised baselines, $\mathcal{L}^{exp}$ maintains a strong correlation between brain age accuracy and downstream diagnostic performance, supporting its potential as a foundation model for neuroimaging. These results position contrastive learning as a promising direction for building generalizable and clinically meaningful brain representations.
[846] A Biophysically-Conditioned Generative Framework for 3D Brain Tumor MRI Synthesis
Valentin Biller, Lucas Zimmer, Ayhan Can Erdur, Sandeep Nagar, Daniel Rückert, Niklas Bubeck, Jonas Weidner
Main category: eess.IV
TL;DR: A generative model using latent diffusion with tumor concentration conditioning achieves high-fidelity brain tumor MRI synthesis and healthy tissue restoration, achieving PSNR scores of 17.4 and 18.5 respectively.
Details
Motivation: MRI inpainting has important clinical and research applications, but existing methods lack the ability to condition on voxel-level, continuous tumor concentrations for high-fidelity brain tumor MRI synthesis.Method: A latent diffusion model conditioned on both tissue segmentations and tumor concentrations generates 3D spatially coherent and anatomically consistent images for both tumor synthesis and healthy tissue inpainting.
Result: Achieved PSNR of 18.5 for healthy tissue inpainting and 17.4 for tumor inpainting, demonstrating the model’s effectiveness for both complementary tasks.
Conclusion: The proposed generative model successfully synthesizes high-fidelity brain tumor MRIs and performs healthy tissue restoration, with code publicly available for further research and clinical applications.
Abstract: Magnetic resonance imaging (MRI) inpainting supports numerous clinical and research applications. We introduce the first generative model that conditions on voxel-level, continuous tumor concentrations to synthesize high-fidelity brain tumor MRIs. For the BraTS 2025 Inpainting Challenge, we adapt this architecture to the complementary task of healthy tissue restoration by setting the tumor concentrations to zero. Our latent diffusion model conditioned on both tissue segmentations and the tumor concentrations generates 3D spatially coherent and anatomically consistent images for both tumor synthesis and healthy tissue inpainting. For healthy inpainting, we achieve a PSNR of 18.5, and for tumor inpainting, we achieve 17.4. Our code is available at: https://github.com/valentin-biller/ldm.git
[847] General and Domain-Specific Zero-shot Detection of Generated Images via Conditional Likelihood
Roy Betser, Omer Hofman, Roman Vainshtein, Guy Gilboa
Main category: eess.IV
TL;DR: CLIDE is a novel zero-shot AI-generated image detection method using conditional likelihood approximation that adapts to diverse image domains, outperforming existing methods especially on domain-specific cases.
Details
Motivation: Existing zero-shot detection methods struggle to adapt to specific image domains (like artistic images), limiting real-world applicability. Maintaining updated datasets for supervised/few-shot methods is time-consuming and challenging.Method: CLIDE uses conditional likelihood approximation, computing likelihoods conditioned on real images to enable adaptation across diverse image domains in a zero-shot manner.
Result: Achieves state-of-the-art performance on large-scale general datasets and significantly outperforms existing methods in domain-specific cases, demonstrating robustness and domain-aware generalization.
Conclusion: CLIDE demonstrates the importance of broad, domain-aware generalization for AI-generated image detection, offering a robust zero-shot solution that adapts well to diverse image domains.
Abstract: The rapid advancement of generative models, particularly diffusion-based methods, has significantly improved the realism of synthetic images. As new generative models continuously emerge, detecting generated images remains a critical challenge. While fully supervised, and few-shot methods have been proposed, maintaining an updated dataset is time-consuming and challenging. Consequently, zero-shot methods have gained increasing attention in recent years. We find that existing zero-shot methods often struggle to adapt to specific image domains, such as artistic images, limiting their real-world applicability. In this work, we introduce CLIDE, a novel zero-shot detection method based on conditional likelihood approximation. Our approach computes likelihoods conditioned on real images, enabling adaptation across diverse image domains. We extensively evaluate CLIDE, demonstrating SOTA performance on a large-scale general dataset and significantly outperform existing methods in domain-specific cases. These results demonstrate the robustness of our method and underscore the need of broad, domain-aware generalization for the AI-generated image detection task. Code is available at https://tinyurl.com/clide-detector.
[848] Proof of Concept for Mammography Classification with Enhanced Compactness and Separability Modules
Fariza Dahes
Main category: eess.IV
TL;DR: Validation of ConvNeXt Tiny with GAGM, SEVector, and FSL framework on mammography classification shows GAGM and SEVector improve feature discriminability and reduce false negatives for malignant cases, but FSL doesn’t help. Extended evaluation includes multi-metric analysis, Grad-CAM interpretability, and interactive dashboard.
Details
Motivation: To validate and extend a recent medical image classification framework (ConvNeXt Tiny with GAGM, SEVector, and FSL) from Alzheimer MRI to mammography classification, investigating its transposability across different medical imaging domains.Method: Used Kaggle dataset combining INbreast, MIAS, and DDSM mammography collections. Compared baseline CNN, ConvNeXt Tiny, and InceptionV3 backbones enhanced with GAGM and SEVector modules. Conducted multi-metric evaluation (macro F1, recall variance, ROC/AUC), feature interpretability analysis using Grad-CAM, and developed interactive clinical dashboard.
Result: GAGM and SEVector modules effectively enhance feature discriminability and reduce false negatives for malignant cases. However, Feature Smoothing Loss (FSL) did not yield measurable improvements in mammography classification. The framework shows domain-specific limitations.
Conclusion: The original framework is partially transposable to mammography classification, with GAGM and SEVector being effective but FSL not. Need to explore alternative approaches for better intra-class compactness and inter-class separability, particularly for malignant vs. benign distinction in mammography.
Abstract: This study presents a validation and extension of a recent methodological framework for medical image classification. While an improved ConvNeXt Tiny architecture, integrating Global Average and Max Pooling fusion (GAGM), lightweight channel attention (SEVector), and Feature Smoothing Loss (FSL), demonstrated promising results on Alzheimer MRI under CPU friendly conditions, our work investigates its transposability to mammography classification. Using a Kaggle dataset that consolidates INbreast, MIAS, and DDSM mammography collections, we compare a baseline CNN, ConvNeXt Tiny, and InceptionV3 backbones enriched with GAGM and SEVector modules. Results confirm the effectiveness of GAGM and SEVector in enhancing feature discriminability and reducing false negatives, particularly for malignant cases. In our experiments, however, the Feature Smoothing Loss did not yield measurable improvements under mammography classification conditions, suggesting that its effectiveness may depend on specific architectural and computational assumptions. Beyond validation, our contribution extends the original framework through multi metric evaluation (macro F1, per class recall variance, ROC/AUC), feature interpretability analysis (Grad CAM), and the development of an interactive dashboard for clinical exploration. As a perspective, we highlight the need to explore alternative approaches to improve intra class compactness and inter class separability, with the specific goal of enhancing the distinction between malignant and benign cases in mammography classification.